Missing Value Imputation in Python

This blog will teach you how to deal with missing values in Python

Anar Abiyev
4 min readFeb 4, 2024

My previous blog was about theoretical information on the topic:

In this one, you will learn the Python implementation of those tips.

Without further ado, let’s get started!

Imports

First, we will import the necessary libraries and load the dataset into the pandas data frame.

import pandas as pd
import numpy as np
import missingno as msno

df = pd.read_csv('sample_dataset.csv')
df.head()

Overview of Missing Values

After that, we can use missingno matrix function to take a look at the distribution of missing values.

msno.matrix(df)
Missingno Matrix.
Fig 1. Missingno Matrix.

To check the number of missing values for each and every column, we can use .isnull().sum() functions:

df.isnull().sum()
Pandas .isnull().sum() Output.
Fig 2. Pandas .isnull().sum() Output.

For better insights, I have written this function to get the percentage values of missing values for every column.

def column_missing_value_percentiles(df):
values = df.isnull().sum().values/df.shape[1]
columns = df.columns
for idx in range(len(columns)):
print(f"{columns[idx]}: {values[idx].round()}%")

column_missing_value_percentiles(df)
Fig 3. Output for Percentiles.

Solutions

Now you will learn which solution method is suitable for the missing values.

Dropping Rows

If you check out percentiles again, you can see that some columns have a quite small amount of missing values — less than 5%.

For such columns, we can drop the rows which contain missing values from these columns.

def drop_rows(df, columns):
df.dropna(subset=columns, inplace=True)

drop_rows(df, ['enrolled_university', 'education_level', 'last_new_job', 'experience'])

Dropping Columns

For the column called “company_type”, the percentile is 67%. In this case, we should drop the column, because the filling will create bias and not be helpful for analysis.

def drop_columns(df, columns):
df.drop(columns, axis = 1, inplace = True)

drop_columns(df, ['company_type'])

Mean, Median, Mode methods

These methods are similar to each other. I have shown functions for each one and used one of them as an example.

def fill_mode(df, column):
mode = df[column].mode()[0]
df[column] = df[column].fillna(mode)

def fill_mean(df, column):
mean = df[column].mean()
df[column] = df[column].fillna(mean)

def fill_median(df, column):
median = df[column].median()
df[column] = df[column].fillna(median)

fill_mode(df, 'gender')

Divide and Conquer

In this method, I am using another column to get better insights for the target column (which is going to be filled).

I have called two columns here:

  • column to conquer — the column that is going to be filled.
  • column to divide — the column that is used.

If we apply the previous mode method, then the mode of the whole column will be used to fill all the NAs.

To apply a more advanced method, the column is divided into different groups and individual mode values will be found.

In this example, I assume that the person’s experience might be related to the company size; if you have more experience, you are likely to work in a bigger company.

In the first loop, individual mode values is found and stored in the list.

In the second loop, if the value of the divider matches, then the corresponding mode values are used to fill NAs.

df['company_size'].unique()
df['experience'].unique()

def divide_and_conquer(df, column_to_conquer, column_to_divide):
modes = []
for i in df[column_to_divide].unique():
mode = df[df[column_to_divide] == i][column_to_conquer].mode()[0]
modes.append(mode)

for i in range(df[column_to_divide].nunique()):
mask = df[column_to_divide] == df[column_to_divide].unique()[i]
mode_value = modes[i]
df.loc[mask, column_to_conquer] = df.loc[mask, column_to_conquer].fillna(mode_value)

column_to_conquer = 'company_size'
column_to_divide = 'experience'

divide_and_conquer(df, column_to_conquer, column_to_divide)

Random Imputation

Here, we use random values of the column in order to fill the missing values.

def random_imputation(df, column):
options = df[column].dropna().unique()
df[column] = df[column].apply(lambda x: np.random.choice(options) if pd.isna(x) else x)

random_imputation(df, 'major_discipline')

Model-based Methods

In this method, a model is trained to fill in missing values.

The target column is the one to be filled.

The train set is the rows without missing values.

The test set is the rows with missing values.

column_to_fill = 'gender'

df_train = df.dropna()
df_test = df[df[column_to_fill].isna()]

X = df.drop(column_to_fill, axis = 1)
y = df[column_to_fill]

X_train = df_train.drop(column_to_fill, axis = 1)
y_train = df_train[column_to_fill]

X_test = df_test.drop(column_to_fill, axis = 1)

X_train = pd.get_dummies(X_train, columns=X_train.select_dtypes(include = 'object').columns, drop_first=True)
X_test = pd.get_dummies(X_test, columns=X_test.select_dtypes(include = 'object').columns, drop_first=True

After we get the sets, the model can be defined and trained.

The predictions are the values that are used to fill the NAs.

from sklearn.neighbors import KNeighborsClassifier

knn_classifier = KNeighborsClassifier(n_neighbors=3)

knn_classifier.fit(X_train, y_train)

predictions = knn_classifier.predict(X_test)

Check the theoretical explanation of each solution shown here:

Clap and Follow for support!

Thank you for reading!

--

--

Anar Abiyev

Writing about Data Science / Deep Learning and Self Improvement