Complete Guide to Missing Value Imputation

You will learn all the essential knowledge to deal with missing values in the dataset!

Anar Abiyev
7 min readFeb 4, 2024

In this blog, I will go through different scenarios of missing data problems and their solutions.

You will know how to approach each case as a data scientist.

After you read this guide, you can also check my blog about Python implementation of methods explained in this blog!

Outline

  • What is missing data?
  • What are the reasons for missing data?
  • Solutions.

What is missing data?

In data science, missing data refers to the absence of values or information in a dataset. Dealing with missing data is a crucial aspect of the data cleaning and preprocessing stage, as it can impact the quality and accuracy of analyses and machine learning models. It reduces the effective sample size, potentially reducing the power of statistical tests and the generalizability of models.

The are two general ways to deal with missing data:

  • Deletion. Removing rows or columns with missing values. This can lead to the loss of valuable information and may introduce bias.
  • Imputation. Filling missing values with estimated or predicted values.

In the next paragraphs, I will be explaining what are the best methods to use for missing data imputation according to the different situations.

What are the reasons for missing data?

Prior to starting missing data imputation, it is a good practice to analyze the reasons behind the missing data problem.

The way I prefer to do it is to use missingno library in Python.

import pandas as pd
import missingno as msno

df = pd.read_csv('sample_dataset.csv')
msno.matrix(df)

The code produces a matrix like below. Here, the white lines are missing values. With such a tool, you can easily get a view of your dataset in terms of missing values.

Missing Value Matrix with missingno Library.
Fig 1. Missing Value Matrix with missingno Library.

Before proceeding to the code section, let’s go through the reasons that might cause missing data.

There can be various reasons for missing data in a dataset. Understanding these reasons is crucial for handling missing data appropriately and making informed decisions in data analysis or modeling.

Here are some common reasons for missing data:

  • Non-response. Individuals or entities may choose not to respond to certain survey questions or provide specific information, leading to missing values.
  • Instrumentation Issues. Problems with measurement instruments or data collection tools can lead to missing values.
  • Technological Limitations. Technical constraints or limitations in data capture methods can result in missing data.
  • Unavailability of Historical Data. In longitudinal studies or time-series data, historical records may be missing due to various reasons such as system upgrades, changes in data collection methods, or data storage issues.

Solutions

In this section, I will go through various imputation strategies and explain which one you have to use for certain scenarios.

Dropping rows

This method involves removing entire rows from the dataset that contain missing values. It is simple and easy to implement, but it can only be helpful when the number of missing values is a minority (up to 10%). Otherwise, this can lead to data loss.

For example, you have 10,000 rows of dataset and 50 rows have missing values. You can drop those rows and continue with the remaining dataset as it is a very small proportion.

Dropping columns

In the previous method, I talked about rows. However, if the missing values are related to the same column, then you can drop that column.

If more than 60–70% of a column is missing, then you can make a case for dropping the entire column. Otherwise, if you try to fill the missing rows, most of the values in the column will be synthetic data and this can create bias.

Mean, Median, and Mode methods.

These are mostly used and straightforward methods in missing data imputation.

  • Mean is the average of a numerical column. If there are some missing values in a numerical column, then you can use the mean of the column to fill.
  • Median is the middle value in a numerical column.

Bonus Tip: If the data has many outliers, use Median imputation, otherwise use mean imputation.

  • Mode is the most frequent value of a categorical column. This method can be used to fill categorical columns. Here, you have to pay attention to the balance between different classes.

Divide and Conquer

The dataset is divided into subsets based on observed variables, and imputation is performed separately on each subset. This method addresses missing data based on related subsets, potentially capturing more nuanced patterns, but it requires careful consideration of how to divide the data. Complexity increases with multiple variables.

Let’s see the example below.

You have “age” and “marriage status” columns and the latter has some missing values. Instead of filling all the missing values with the mode of the column, you can divide rows based on age, because we can assume that more people get married when they get older.

So, you divide data into classes according to the age column: young, mature, old. After that, you fill in the missing values with the mode of each group separately.

Random imputation / hot deck.

Random imputation, also known as hot deck imputation, is a method for handling missing data by replacing missing values with randomly selected observed values from the same variable.

The term “hot deck” refers to a metaphorical deck of cards, where each card (or observation) is available to be selected to fill in the missing value.

  • Identify the variables with missing values in the dataset.
  • Create a pool or deck of observed values from the variable containing missing values.
  • Randomly select values from the pool and use them to replace the missing values.

Random imputation helps preserve the variability in the dataset by introducing randomness into the imputed values. It is a relatively simple method to implement, requiring minimal computational resources.

But,

Since values are selected randomly, there’s a possibility of imputing values that do not accurately represent the overall distribution or patterns in the data.

So,

Careful consideration should be given to how the imputation pool is created to ensure that it is representative of the variable’s distribution.

Random imputation is often more suitable for continuous variables rather than categorical ones.

To account for uncertainty introduced by randomness, multiple imputations can be performed, creating several datasets with different imputed values for each missing entry.

Model-based methods.

Model-based imputation is an advanced technique for handling missing data by using predictive models to estimate and impute missing values. Instead of relying solely on summary statistics like mean or median, this method leverages relationships within the dataset to make informed predictions. The choice of the model depends on the characteristics of the data and the relationships between variables.

Let’s see the example below.

There is a dataset in which one column has some missing and present rows.

Dataset with missing and full values.
Fig 2. Dataset with missing and full values.

In order to use a model-based approach to impute missing values, the strategy in the following image will be applied. The present or full rows will be used as a train set while missing rows will be the test set. The results of the model will be used to fill in the missing values.

After the imputation process is done, the dataset will be divided into dependent and independent columns according to the task. But for the imputation itself, the dependent column has to be the one that has missing values.

Train and Test set for Model Imputation.
Fig 3. Train and Test set for Model Imputation.

Model-based imputation takes into account relationships between variables, allowing for more accurate imputations compared to simple statistical measures. This method can capture non-linear relationships, making it suitable for datasets with complex patterns. Model-based imputation can be computationally intensive, especially when using complex models or dealing with large datasets.

Converting NA into a feature

This is a method for user form data. When you have a question that can be answered or left blank by the user, then you will have missing data for blanks.

This column can be converted to a binary column with values of true and false; true when the user answers, false when the user does not answer the question.

Here, the assumption we make is that the user didn’t answer the question because of a reason. Thus this is a feature itself.

Check the Python implementation of each solution explained here:

Clap and Follow for support!

Thank you for reading!

--

--

Anar Abiyev

Writing about Data Science / Deep Learning and Self Improvement