Complete Guide to Exploratory Data Analysis with Python Plotly
Introduction
Exploratory data analysis is the key method to extract useful insights from data which play important role in the selection of model, data preprocessing steps, interpretation of results and etc.
In this blog, I will dive into details of EDA with the example of car prediction dataset.
You can also have look to the project example of the same dataset.
I will be using tools of pandas and plotly libraries. At each step I will point out what question I am analyzing and which information I extracted with EDA tools.
Hope you will enjoy it.
- Data Loading and Cleaning
As this blog is focused on EDA, I will not explain data cleaning part in detail. You can find it in this blog.
2. Overview of columns
Pandas library provide “info” function which shows column names, data types and count of nun-null values:
3. Missing values
Firstly, our dataset doesn’t contain any missing value.
If you come across null values in your dataset, missingno library is ideal for this task.
4. Descriptive Summary of numerical columns
Scaling is one of the main parts of data preprocessing. To determine weather the data needs scaling or not, you have to look at the range of values in numerical columns. With the help of “describe” function, not only this information, but also all the main statistical values can be derived:
From now, the main plots in EDA will be illustrated with plotly library. Plotly library provides interactive plots and this feature makes it favorable over other libraries such as seaborn, matplotlib.
!pip install plotly
- Distribution of target variable
In the dataset, the target variable is “price”. The distribution of target variable of preferred to be normal. When it is otherwise, the performance of the model weakens. In our dataset, the “price” column is not normally distributed, it means that it should be normalized before model building.
2. Count plot of categorical columns
For categorical columns, it is important to have a look to the count of values in each class. Because there are two cases that dramatically affect model results:
- Curse of dimensionality — when the number of classes is too much for number of samples (dummy variables increase the number of features).
- Imbalance — when there is significant difference between count values of classes.
For example, high dimension:
Imbalance:
3. Distribution of numerical columns
With distribution of numerical columns, you can check the normality, skewness, scale of features.
Example:
4. Boxplot for outlier check
If you build mathematical model, then you should definitely check if the numerical columns contain outliers. Boxplots are very suitable for this task. The points higher than upper whisker and lower than lower whisker are outliers.
Example:
5. Variance of target with categorical columns
In order to check how the target column differs for different classes of categorical columns, I will use boxplots with color equal to the categorical columns.
Example:
6. Pairplot
Pairplot visualizes given data to find the relationship between them where the variables can be continuous or categorical. In this case, I will use only numerical columns. As there are lots of numerical columns, the pairplot of whole dataset will not be understandable. That is why, I group the columns and plotted three pairplots.
Example:
7. Correlation heatmap
Correlation heatmap is a heatmap that shows a 2D correlation matrix of numerical columns between two discrete dimensions, using colored cells to represent data.
Thank you.
Reference to the whole code: