Missing Value Imputation in Python
This blog will teach you how to deal with missing values in Python
My previous blog was about theoretical information on the topic:
In this one, you will learn the Python implementation of those tips.
Without further ado, let’s get started!
Imports
First, we will import the necessary libraries and load the dataset into the pandas data frame.
import pandas as pd
import numpy as np
import missingno as msno
df = pd.read_csv('sample_dataset.csv')
df.head()
Overview of Missing Values
After that, we can use missingno matrix function to take a look at the distribution of missing values.
msno.matrix(df)
To check the number of missing values for each and every column, we can use .isnull().sum() functions:
df.isnull().sum()
For better insights, I have written this function to get the percentage values of missing values for every column.
def column_missing_value_percentiles(df):
values = df.isnull().sum().values/df.shape[1]
columns = df.columns
for idx in range(len(columns)):
print(f"{columns[idx]}: {values[idx].round()}%")
column_missing_value_percentiles(df)
Solutions
Now you will learn which solution method is suitable for the missing values.
Dropping Rows
If you check out percentiles again, you can see that some columns have a quite small amount of missing values — less than 5%.
For such columns, we can drop the rows which contain missing values from these columns.
def drop_rows(df, columns):
df.dropna(subset=columns, inplace=True)
drop_rows(df, ['enrolled_university', 'education_level', 'last_new_job', 'experience'])
Dropping Columns
For the column called “company_type”, the percentile is 67%. In this case, we should drop the column, because the filling will create bias and not be helpful for analysis.
def drop_columns(df, columns):
df.drop(columns, axis = 1, inplace = True)
drop_columns(df, ['company_type'])
Mean, Median, Mode methods
These methods are similar to each other. I have shown functions for each one and used one of them as an example.
def fill_mode(df, column):
mode = df[column].mode()[0]
df[column] = df[column].fillna(mode)
def fill_mean(df, column):
mean = df[column].mean()
df[column] = df[column].fillna(mean)
def fill_median(df, column):
median = df[column].median()
df[column] = df[column].fillna(median)
fill_mode(df, 'gender')
Divide and Conquer
In this method, I am using another column to get better insights for the target column (which is going to be filled).
I have called two columns here:
- column to conquer — the column that is going to be filled.
- column to divide — the column that is used.
If we apply the previous mode method, then the mode of the whole column will be used to fill all the NAs.
To apply a more advanced method, the column is divided into different groups and individual mode values will be found.
In this example, I assume that the person’s experience might be related to the company size; if you have more experience, you are likely to work in a bigger company.
In the first loop, individual mode values is found and stored in the list.
In the second loop, if the value of the divider matches, then the corresponding mode values are used to fill NAs.
df['company_size'].unique()
df['experience'].unique()
def divide_and_conquer(df, column_to_conquer, column_to_divide):
modes = []
for i in df[column_to_divide].unique():
mode = df[df[column_to_divide] == i][column_to_conquer].mode()[0]
modes.append(mode)
for i in range(df[column_to_divide].nunique()):
mask = df[column_to_divide] == df[column_to_divide].unique()[i]
mode_value = modes[i]
df.loc[mask, column_to_conquer] = df.loc[mask, column_to_conquer].fillna(mode_value)
column_to_conquer = 'company_size'
column_to_divide = 'experience'
divide_and_conquer(df, column_to_conquer, column_to_divide)
Random Imputation
Here, we use random values of the column in order to fill the missing values.
def random_imputation(df, column):
options = df[column].dropna().unique()
df[column] = df[column].apply(lambda x: np.random.choice(options) if pd.isna(x) else x)
random_imputation(df, 'major_discipline')
Model-based Methods
In this method, a model is trained to fill in missing values.
The target column is the one to be filled.
The train set is the rows without missing values.
The test set is the rows with missing values.
column_to_fill = 'gender'
df_train = df.dropna()
df_test = df[df[column_to_fill].isna()]
X = df.drop(column_to_fill, axis = 1)
y = df[column_to_fill]
X_train = df_train.drop(column_to_fill, axis = 1)
y_train = df_train[column_to_fill]
X_test = df_test.drop(column_to_fill, axis = 1)
X_train = pd.get_dummies(X_train, columns=X_train.select_dtypes(include = 'object').columns, drop_first=True)
X_test = pd.get_dummies(X_test, columns=X_test.select_dtypes(include = 'object').columns, drop_first=True
After we get the sets, the model can be defined and trained.
The predictions are the values that are used to fill the NAs.
from sklearn.neighbors import KNeighborsClassifier
knn_classifier = KNeighborsClassifier(n_neighbors=3)
knn_classifier.fit(X_train, y_train)
predictions = knn_classifier.predict(X_test)
Check the theoretical explanation of each solution shown here: