Standardization and Normalization, Feature Scaling — Clearly Explained!

This story will clear all your questions about standardization vs / and normalization and you will never search this topic again!

6 min readDec 21, 2023

You probably have many questions about standardization and normalization, you have searched many articles and watched some videos on YouTube.

After reading this blog until the end, I assure you that you will never search for standardization or normalization again!

In this blog, you will learn:

· What is feature scaling and why do we need it?

· Which models need scaling and which ones don’t?

· What is normalization?

· What is standardization?

· When to use normalization or standardization?

Firstly, let’s state that both normalization and standardization are types of feature scaling. They have different formulas and use cases, but both are used to change the scale of data.

What is even feature scaling?

Let’s say you have a column in your dataset that looks like the histogram on the left. The range of the column is between around 20 and 65. If we want to change the scale of the column, all we must do is divide the values by some constant, for example, 2. The histogram on the right shows the distribution after this scaling.

Fig 1. Feature Scaling by dividing with a constant.

Another method of scaling the data is by subtracting a constant. For instance, if you want the data to start from zero, you can subtract 20 from the column values.

Fig 2. **Feature Scaling by subtracting a constant.**

Please note that both multiplication and addition can be used as well, but usually, it is tried to make values around and close to zero, thus, subtraction and division are applied.

Why apply Feature Scaling?

Check out the dataset below, all three columns have different scales. If you feed this dataset as it is, the model will give more importance to the column with the higher values, the “income” column in this example. However, we want the model to approach each column as equals and calculate corresponding weights according to the optimization, not because of scales.

Fig 3. **Example dataset with different scales.**

The models I will mention below are the ones that benefit from feature scaling the most:

Gradient—based optimization algorithms. Models that use gradient descent for optimization, such as linear regression, logistic regression, and neural networks. Scaling will help to converge faster.
Distance—based algorithms. Models that use distances between data points, such as k-Nearest Neighbors (KNN) and Support Vector Machines (SVM), can benefit from feature scaling because it ensures that all features contribute equally to the distance computation.
PCA (Principal Component Analysis). PCA is a dimensionality reduction technique that involves finding the principal components of the data. Feature scaling is important for PCA because it ensures that all features have equal importance in determining principal components.

The models that do not benefit from feature scaling are the ones that are not built upon numerical values themselves, rather than comparing these values:

Tree-based models. Decision trees, Random Forests, and Gradient Boosted Trees. These models make decisions based on feature thresholds and are invariant to monotonic transformations of the features.
Naive Bayes. Naive Bayes classifiers are probabilistic models that assume independence between features given the class. They are generally not sensitive to the scale of individual features.

What is Normalization?

Normalization is moving the scale of data into the range between 0 and 1. It is done with the following formula:

Formula 1. **Equation of Normalization.**

Let’s apply normalization to the sample data we plotted above:

As you can see, the shape of the histogram remains the same, but the range has been changed to 0–1.

This is all the theoretical background needed for normalization, changing the scale of the dataset to the range between zero and one. Let’s move to standardization.

What is Standardization?

The purpose of standardization is the same as normalization — changing the scale of the data. However, it achieves this by a different method. Instead of altering the range into a fixed range, the mean and variance of the data are changed.

It may seem complicated, but I will explain all the terms one by one.

Let’s continue with the formula to have a clear view of what it means “to apply standardization”:

Formula 2. **Equation of Standardization.**

Here,

- µ is the mean, which is the average of data.

- σ is the standard deviation.

Simply put, the mean of data is subtracted, and the result is divided by the standard deviation. After this operation, the mean of the resultant data will be equal to zero and the variance to one.

The best way to observe this is with two-dimensional data. See how the values and separation of points have changed when standardization is applied. The values are around zero, so the mean is zero and the distances between values have been decreased, so the variance is one.

Fig 5. **Data points before standardization.**

Fig 6. **Data points after standardization.**

If you have the intuition behind what changing mean and variance look like, let’s see our example data after standardization:

An important point to underline here, there is a misconception that after standardization the distribution of data changes to normal distribution. However, this is a wrong conclusion about standardization. Yes, the mean and the variance are equal to 0 and 1 respectively in both normal distribution and the result of standardization, but it does not mean that the distribution of data becomes normal distribution. You can observe this in our example as well.

When to use normalization or standardization?

In general, the best approach is to try both methods and see which result is better.

If we dive more into the use cases:

- Normalization is preferred for neural networks, especially when working with images, the pixel values are scaled from 0–255 to 0–1 range.

- Standardization is preferred when there are outliers in the data because outliers can negatively affect normalization by shrinking other values.

You can check out the source code from the link below.

Medium-Youtube/Standardization_Normalization.ipynb at master · anarabiyev/Medium-Youtube

This repo contains the notebooks which have been used in my Medium blogs and YouTube video. …

github.com

Thank you for reading, hope I added value to your journey in mastering data science / AI. If so, do not forget to clap and follow!

Check out the latest story about Activation Functions as well:

Complete Guide to Activation Functions in Deep Learning

This paper will answer all of your questions about activation functions from why we need them, what are they, and which…

ai.plainenglish.io