Complete Guide to Exploratory Data Analysis with Python Plotly

Anar Abiyev
Analytics Vidhya
Published in
4 min readMar 13, 2022

Introduction

Exploratory data analysis is the key method to extract useful insights from data which play important role in the selection of model, data preprocessing steps, interpretation of results and etc.

In this blog, I will dive into details of EDA with the example of car prediction dataset.

You can also have look to the project example of the same dataset.

I will be using tools of pandas and plotly libraries. At each step I will point out what question I am analyzing and which information I extracted with EDA tools.

Hope you will enjoy it.

  1. Data Loading and Cleaning

As this blog is focused on EDA, I will not explain data cleaning part in detail. You can find it in this blog.

2. Overview of columns

Pandas library provide “info” function which shows column names, data types and count of nun-null values:

3. Missing values

Firstly, our dataset doesn’t contain any missing value.

If you come across null values in your dataset, missingno library is ideal for this task.

4. Descriptive Summary of numerical columns

Scaling is one of the main parts of data preprocessing. To determine weather the data needs scaling or not, you have to look at the range of values in numerical columns. With the help of “describe” function, not only this information, but also all the main statistical values can be derived:

From now, the main plots in EDA will be illustrated with plotly library. Plotly library provides interactive plots and this feature makes it favorable over other libraries such as seaborn, matplotlib.

!pip install plotly
  1. Distribution of target variable

In the dataset, the target variable is “price”. The distribution of target variable of preferred to be normal. When it is otherwise, the performance of the model weakens. In our dataset, the “price” column is not normally distributed, it means that it should be normalized before model building.

2. Count plot of categorical columns

For categorical columns, it is important to have a look to the count of values in each class. Because there are two cases that dramatically affect model results:

  1. Curse of dimensionality — when the number of classes is too much for number of samples (dummy variables increase the number of features).
  2. Imbalance — when there is significant difference between count values of classes.

For example, high dimension:

Imbalance:

3. Distribution of numerical columns

With distribution of numerical columns, you can check the normality, skewness, scale of features.

Example:

4. Boxplot for outlier check

If you build mathematical model, then you should definitely check if the numerical columns contain outliers. Boxplots are very suitable for this task. The points higher than upper whisker and lower than lower whisker are outliers.

Example:

5. Variance of target with categorical columns

In order to check how the target column differs for different classes of categorical columns, I will use boxplots with color equal to the categorical columns.

Example:

6. Pairplot

Pairplot visualizes given data to find the relationship between them where the variables can be continuous or categorical. In this case, I will use only numerical columns. As there are lots of numerical columns, the pairplot of whole dataset will not be understandable. That is why, I group the columns and plotted three pairplots.

Example:

7. Correlation heatmap

Correlation heatmap is a heatmap that shows a 2D correlation matrix of numerical columns between two discrete dimensions, using colored cells to represent data.

Thank you.

Reference to the whole code:

https://github.com/anarabiyev/ExploratoryDataAnalysis/blob/master/carpriceprediction_eda_plotly.ipynb

--

--

Anar Abiyev
Analytics Vidhya

Writing about Data Science / Deep Learning and Self Improvement