Data Science Regression Project Example

Anar Abiyev
9 min readMar 3, 2022

--

This is Data Science Regression case study and aims to show how to apply data science tools on real project in Python. For better understanding and clarification the blog has been organized as separate parts (like data understanding, EDA, modelling, etc.). Both code and explanation have been provided.

Hope this blog will contribute your knowledge at your Data Science journey.

Business Understanding

You are a data scientist of an automobile consulting company and you have been given data about cars of different manufactures. Your task is to analyze the factors affecting the price and give the meaningful insights about the case.

Data Understanding

Let’s start by importing dataset to the notebook.

I have assigned the imported csv file to “data” variable and then created its copy. This is to keep the original dataset pure. You can think that the excel file is always there and safe. But in the real works, you can work with the source of data and you may do permanent changes and spoil the data. As data collection is very costly and time consuming task, it will harm your company or your project. That is why, always try to create the habit of saving original data during working.

import pandas as pddata = pd.read_csv('CarPrice_Assignment.csv')
df = data.copy()

Pandas library provides wide range of tools to help you to get familiar with data. For initial steps I use head and info.

  1. head function shows first five (you can give any number, five is default)rows of the dataframe.
pd.options.display.max_columns = None         # to print all columns
df.head()

2. info function returns number of non — null values and data types of each column. It is the simplest method to see if the data contains null values.

df.info()

The steps of data science projects haven’t been decided by an authority or a community. It is the natural flow of projects. Which means, the previous step should be foundation for the next one. For example, if you don’t know what is car, you cannot interpret what does door number or fuel type mean. So, before passing to the next step, we have to conclude the results of current step.

Results:

  1. There is no null value, data imputation will not be needed;
  2. car id column can be dropped, because it is unique for every car and doesn’t mean anything for its price;
  3. There are object columns, label encoding will be needed;
  4. The scale of numerical columns are different, scaling will be needed;

Data Cleaning

Before starting any further analysis, let’s drop car id column, because it is unique for every car and doesn’t mean anything for its price.

df.drop('car_ID', axis = 1, inplace = True)

The column CarName has 147 unique values. When you compare to the number of rows which is 205, there is two many unique classes. This case is called high cardinality of categorical values. There are various ways to solve this problem, for this instance, I will try to group the cars which are from the same manufacturer. For example, the different models of Audi will be in the same class. Let’s look at values in this column to find a way to extract company names:

df.CarName.unique()

In all values, the name of company is separated by a white space and is the first word.

df['CarName'] = df['CarName'].str.split().str[0]
print(df['CarName'].unique())

Now, we have another problem — inconsistent data. For example, there are both “Nissan” and “nissan” or “maxda” and “mazda”.

df.loc[df['CarName'] == 'maxda', 'CarName'] = 'mazda'
df.loc[df['CarName'] == 'Nissan', 'CarName'] = 'nissan'
df.loc[df['CarName'] == 'porcshce', 'CarName'] = 'porsche'
df.loc[df['CarName'] == 'toyouta', 'CarName'] = 'toyota'
df.loc[df['CarName'] == 'vokswagen', 'CarName'] = 'volkswagen'
df.loc[df['CarName'] == 'vw', 'CarName'] = 'volkswagen'
print(df['CarName'].unique())

CarName columns is clean.

Let’s check if the data contains any dublicated rows.

df.loc[df.duplicated()]

In EDA, I will analyze the concept of imbalance of categorical columns (for label encoding).

Categorical data has two types — nominal and ordinal (put link) and the encoding method should be selected according to the type of data. Thus in order to choose the method for encoding, let’s analyze the values and value counts in the object columns to see how we can approach.

obj_cols = df.select_dtypes(include = 'object').columns
for col in obj_cols:
print(col, ' : ', df[col].unique(), end = '\n\n')

We can say that, only cylindernumber is ordinal, while other columns are nominal. So I will replace values in cylindernumber manually and apply dummy variables method to the other columns.

For the nominal columns, I will inspect their value counts using bar plots.

import matplotlib.pyplot as pltcols = df.select_dtypes(include = 'object').columns.drop('cylindernumber')plt.figure(figsize=(16, 16))
n = len(cols)
for i in range(1, n+1):
plt.subplot(3, 3, i)
plt1 = df[cols[i-1]].value_counts()
plt1.plot(kind = 'barh')
plt.title(cols[i-1])

As seen from the graphs, there are lots of unique classes in categorical columns. If we convert them using dummy variables the number of columns will increase and as we have small number of samples (rows) this will cause curse of dimensionality and will decrease model performance. Hence let’s analyze if we can reduce some of classes or columns.

In most of the categorical columns there is imbalance problem — the number of elements in each class is not close to one another. Firstly, in enginelocation column, almost all elements are front, so this column can be dropped.

df.drop('enginelocation', axis = 1, inplace = True)

For enginetype and fuelsystem:

print(df['fuelsystem'].value_counts())
print(df['enginetype'].value_counts())

The classes with only one sample will be merged to the classes with closest price. I will utilize boxplots to see which classes’ price is close to each other:

import seaborn as snssns.boxplot(data = df, y = 'price', x = 'fuelsystem')

Interpretation: mfi, spfi, 4bbl -> idi.

sns.boxplot(data = df, y = 'price', x = 'enginetype')

Interpretation: dohcv -> ohcv, rotor ->l.

df.loc[df['fuelsystem'] == 'mfi', 'fuelsystem'] = 'idi'
df.loc[df['fuelsystem'] == 'spfi', 'fuelsystem'] = 'idi'
df.loc[df['fuelsystem'] == '4bbl', 'fuelsystem'] = 'idi'
df.loc[df['enginetype'] == 'dohcv', 'enginetype'] = 'ohcv'
df.loc[df['enginetype'] == 'rotor', 'enginetype'] = 'l'

Now, let’s do the same steps for CarName:

print(df['CarName'].value_counts())plt.figure(figsize=(16, 10))
sns.boxplot(data = df, y = 'price', x = 'CarName')

Interpretation: mercury -> peugeot, renault -> volkswagen, alfa-romero -> saab, jaguar -> buick, chevrolet -> plymouth

df.loc[df['CarName'] == 'mercury', 'CarName'] = 'peugeot'
df.loc[df['CarName'] == 'renault', 'CarName'] = 'volkswagen'
df.loc[df['CarName'] == 'alfa-romero', 'CarName'] = 'saab'
df.loc[df['CarName'] == 'jaguar', 'CarName'] = 'buick'
df.loc[df['CarName'] == 'chevrolet', 'CarName'] = 'plymouth'

Data Preprocessing

After cleaning the data, the coming step is data preprocessing. At this stage, the data is prepared for the modelling. In our case, outlier cleaning, label encoding and scaling will be applied.

Data Leakage

Data Leakage is very important point in data preparation. Simply, it happens when you do preprocessing before train test split. When doing the operations like labeling, scaling to the whole dataset and then split the data, train and test sets share the same patterns of preprocessing. Thus the test set will not be new to the model, as the model has already seen the same patterns in the train set. In order to avoid data leakage, I split data into train and test sets and do preprocessing independently with sklearn pipelines.

from sklearn.model_selection import train_test_split df_train, df_test = train_test_split(df, test_size=0.2, random_state=0)

Sklearn pipelines

As the preprocessing is done separately for train and test sets, the steps should be automated to avoid code repetitions. For this manner, sklearn library provides pipelines.

All three steps of preprocessing will be coded in classes to organize for pipelines.

from sklearn.base import BaseEstimator,TransformerMixin
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

Label Encoding

As cylindernumber column is ordinal, it will be encoded manually:

class CylinderNumberEncoder(BaseEstimator,TransformerMixin):

def fit(self,X,Y=None):
return self

def transform(self,X):

X.loc[df['cylindernumber'] == 'two', 'cylindernumber'] = 2
X.loc[df['cylindernumber'] == 'three', 'cylindernumber'] = 3
X.loc[df['cylindernumber'] == 'four', 'cylindernumber'] = 4
X.loc[df['cylindernumber'] == 'five', 'cylindernumber'] = 5
X.loc[df['cylindernumber'] == 'six', 'cylindernumber'] = 6
X.loc[df['cylindernumber'] == 'eight', 'cylindernumber'] = 8
X.loc[df['cylindernumber'] == 'twelve', 'cylindernumber']=12

X['cylindernumber'] = X['cylindernumber'].astype(str).astype(int)

return X

The other columns will be encoding using dummy variables.

class DummyVariables(BaseEstimator,TransformerMixin):

def fit(self,X,Y=None):
return self

def transform(self,X):

dummy_cols = X.select_dtypes(include = 'object').columns
X = pd.get_dummies(X, columns = dummy_cols, drop_first=True)

return X

Outliers

When working with numerical data and build mathematical model, you have to pay special attention to the outlier problem. One of the ways to identify the presence of outliers is interquartile range method.

def number_of_outliers(df):

df = df.select_dtypes(exclude = 'object')

Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1

return ((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).sum()
print(number_of_outliers(df_train))
print(number_of_outliers(df_test))

I will use interquartile method to handle outliers.

import numpy as npdef outlier_treatment(datacolumn):

sorted(datacolumn)
Q1,Q3 = np.percentile(datacolumn , [25,75])
IQR = Q3 - Q1
lower_range = Q1 - (1.5 * IQR)
upper_range = Q3 + (1.5 * IQR)

return lower_range,upper_range
class OutlierTreatment(BaseEstimator,TransformerMixin):

def fit(self,X,Y=None):
return self

def transform(self,X):

for col in X.columns:
lowerbound,upperbound = outlier_treatment(X[col])
X[col]=np.clip(X[col],a_min=lowerbound,a_max=upperbound)
return X

Scaling

from sklearn.preprocessing import MinMaxScalerclass CustomizedScaler(BaseEstimator,TransformerMixin):

def fit(self,X,Y=None):
return self

def transform(self,X):

names = X.columns
scaler = MinMaxScaler()
scaler.fit(X)
X =scaler.transform(X)
X = pd.DataFrame(X, columns = names)

return X

Pipeline

As the classes of all steps are created, we can now build the pipeline and transform the train and test sets.

pipeline = Pipeline( steps = [

('cylinder_number_encoder', CylinderNumberEncoder()),
('encoder', DummyVariables()),
('outlier', OutlierTreatment()),
('scaler', CustomizedScaler())
])
df_train = pipeline.fit_transform(df_train)
df_test = pipeline.fit_transform(df_test)

Modelling and Evaluation

Firstly, train and test sets should be separated into X and y values. Because of classes with small sample size, test set had one less feature, thus I removed that column from train set.

X_train = df_train.drop('price', axis = 1)
y_train = df_train.price
X_test = df_test.drop('price', axis = 1)
y_test = df_test.price
X_train = X_train[X_test.columns]

The data contains too many features, that is why I will apply feature selection. For detailed explanation and as the reference to the code, you can refer the link.

import statsmodels.api as smcols = list(X_train.columns)
pmax = 1
while (len(cols)>0):
p = []
X_1 = X_train[cols]
X_1 = sm.add_constant(X_1)
model = sm.OLS(y_train,X_1).fit()
p = pd.Series(model.pvalues.values[1:],index = cols)
pmax = max(p)
feature_with_p_max = p.idxmax()
if(pmax>0.05):
cols.remove(feature_with_p_max)
else:
break
selected_features_BE = cols
print(selected_features_BE)

Updating train set:

X_train = X_train[selected_features_BE]

Building Model:

X_train = sm.add_constant(X_train) #Adding the constant
lm = sm.OLS(y_train, X_train).fit() # fitting the model
print(lm.summary())

Test set score:

X_test = X_test[selected_features_BE]X_test = sm.add_constant(X_test)
y_pred = lm.predict(X_test)
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)
0.7828025643949679

Thank you for reading.

The dataset.

The code.

For Communication

https://www.linkedin.com/in/anar-abiyev-224a45196/

--

--

Anar Abiyev

Writing about Data Science / Deep Learning and Self Improvement