Python treatment for outliers in data science

August 12, 2022

Tags

Finding and dealing with outliers is one of the most crucial processes in data preparation since they can have a detrimental impact on statistical analysis and the training of a machine learning or AI algorithm, leading to reduced accuracy.

‍

In this blog, we will brief you on what an outlier is, how it is created, detecting techniques for an outlier, and what are the treatment techniques that you can use to handle an outlier.`

‍

What is an Outlier?

An observation that differs significantly from the other observations in a dataset is considered an outlier. An outlier is therefore considerably larger or smaller than the other values in the collection.

‍

An outlier may appear as a result of experimental error, human mistake, data variability, or all three. A dataset of class 7 students, for instance, shows that each student is between the ages of 13 and 15, but accidentally includes information about a student from the 12th grade who is actually 19 years old. The student who is 19 years old is, therefore, an outlier for this batch of data.

‍

How an Outlier is created in the Data?

Incorrect data input or a processing mistake value gaps in a dataset. The targeted sample did not yield the expected data. During experiments, mistakes happen. It wouldn't be a mistake, but it wouldn't be like the original. an abnormally extreme distribution.

‍

Detecting Outliers using Python

If our dataset is tiny, we can find the outlier by simply scanning it. But what if our dataset is very large? How would we recognize the outliers in that case? We must employ quantitative and visual methods for detecting outliers.

‍

Visualization

The most common and easy visualization technique to detect outliers in any data set is using the Box plots and Scatter plots. With just a box and a few whiskers, Box plots effectively and efficiently capture the data summary. Boxplot uses 25th, 50th, and 75th percentiles to summarise sample data. Simply by glancing at the dataset's boxplot, one may gain insights (quartiles, the median, and outliers) about the data. Whereas Scatter plots are used when you have paired numerical data when your dependent variable contains numerous values for each reading independent variable when you're attempting to establish a link between the two variables, or in any other of these situations.

Code for Box Plot

import seaborn as sns

sns.boxplot(dadata_frame['col_name'])

‍

Code for Scatter Plot

fig, plot = plt.subplots(figsize = (18,10))

plot.scatter(data_frame['X_axis'], data_frame['Y_axis'])

‍

Z Score

This number or score aids in determining how far the data point deviates from the mean. After establishing a threshold value, one may use the z scores of the data points to identify outliers.

‍

Zscore = (data_point - mean) / std. Deviation

‍

from scipy import stats

import numpy as np

outlier = np.abs(stats.zscore(data_frame['col_name']))

Inter Quartile Range (IQR)

By splitting a data set into quartiles, the IQR is used to quantify variability. The information is divided into 4 equal portions and arranged in ascending order. The values that divide the four equal halves are known as the first, second, and third quartiles, or Q1, Q2, and Q3, respectively. The interquartile range, or IQR, is the space between the first and third quartiles, or Q1 and Q3 i.e. IQR = Q3 - Q1.

Q1 = np.percentile(data_frame['col_name'], 25,

interpolation = 'midpoint')

Q3 = np.percentile(data_frame['col_name'], 75,

interpolation = 'midpoint')

IQR = Q3 - Q1

Defining the upper and lower boundaries (1.5*IQR value is taken into consideration) can help you determine the outlier's base value, which is specified above and below the dataset's typical range:

Upper = 1.5*IQR + Q3

Lower = Q1 - 1.5*IQR

upper = data_frame['col_name'] >= (Q3+1.5*IQR)

lower = data_frame['col_name'] <= (Q1-1.5*IQR)

Outliers are data points that are either higher than the upper or lower than the lower limits.

print(np.where(upper))

print(np.where(lower))

‍

Treatment for outliers

Once we have figured out what are the outliers in our data set, the next question is what to do with them.

‍

Here are a few approaches to handling outliers.

Deleting or Trimming the outlier

We eliminate the outliers from the dataset using this method. First, from the visualization, we can have an estimation of the range where data outliers might lie and on the basis of that, we can drop all the outliers from the data set.

A good example of this is the ‘Age’ variable, which ranged from 0 to 100. An index is created for all the data points when the age takes these two values in the first line of code below. And after that, the second line of the code below drops all the outliers from our dataset.

index=data_frame[(data_frame['Age']>=100)|(data_frame['Age']<=18)].index

df.drop(index, inplace=True)

‍

Flooring and capping based on quantiles

With this method, we will floor the lower values at, say, the 10th percentile, and cap the higher values at, say, the 90th percentile. The lines of code following output the variable "Income10th "'s and 90th percentiles, respectively. The quantile-based flooring and capping will be done using these values.

‍

# Computing 10th, and 90th percentiles and replacing the outliers

import numpy as np

tenth_percentile = np.percentile(data, 10)

ninetieth_percentile = np.percentile(data, 90)

treated_data=np.where(data<tenth_percentile,tenth_percentile, treated_data)

treated_data =np.where(data>ninetieth_percentile, ninetieth_percentile, treated_data)

‍

Now the dataset treated_data contains the data without the outliers, using the flooring and capping method outliers are treated here.

‍

Median treatment

With this method, the extreme numbers are swapped out for the mean or median values. It is cautioned against using mean values because the mean value is highly influenced by the outliers. So it is better to rely on media value for treating the outliers. The median value is used to replace any values in the ‘data_frame’ variable that are higher than the 95th percentile.

‍

median = np.median(data)

for i in sample_outliers:

c = np.where(data==i, 14, data)

‍

Conclusion

In this blog, we learned about handling outliers, a crucial step in data preparation. We currently have a variety of techniques for identifying and managing outliers. But since there is no mathematically correct or incorrect answer, handling outliers is a highly subjective endeavor. Treatment choices may be made more easily with the use of qualitative knowledge, such as understanding the origin or impact of an anomaly.

‍

Sign up for Free Trial

Latest Blogs

A vector illustration of a tech city using latest cloud technologies & infrastructure

Python treatment for outliers in data science

Example H2

‍

In this blog, we will brief you on what an outlier is, how it is created, detecting techniques for an outlier, and what are the treatment techniques that you can use to handle an outlier.`

‍

What is an Outlier?

‍

How an Outlier is created in the Data?

‍

Detecting Outliers using Python

‍

Visualization

Code for Box Plot

import seaborn as sns

sns.boxplot(dadata_frame['col_name'])

‍

Code for Scatter Plot

fig, plot = plt.subplots(figsize = (18,10))

plot.scatter(data_frame['X_axis'], data_frame['Y_axis'])

‍

Z Score

This number or score aids in determining how far the data point deviates from the mean. After establishing a threshold value, one may use the z scores of the data points to identify outliers.

‍

Zscore = (data_point - mean) / std. Deviation

‍

from scipy import stats

import numpy as np

outlier = np.abs(stats.zscore(data_frame['col_name']))

Inter Quartile Range (IQR)

Q1 = np.percentile(data_frame['col_name'], 25,

interpolation = 'midpoint')

Q3 = np.percentile(data_frame['col_name'], 75,

interpolation = 'midpoint')

IQR = Q3 - Q1

Defining the upper and lower boundaries (1.5*IQR value is taken into consideration) can help you determine the outlier's base value, which is specified above and below the dataset's typical range:

Upper = 1.5*IQR + Q3

Lower = Q1 - 1.5*IQR

upper = data_frame['col_name'] >= (Q3+1.5*IQR)

lower = data_frame['col_name'] <= (Q1-1.5*IQR)

Outliers are data points that are either higher than the upper or lower than the lower limits.

print(np.where(upper))

print(np.where(lower))

‍

Treatment for outliers

Once we have figured out what are the outliers in our data set, the next question is what to do with them.

‍

Here are a few approaches to handling outliers.

Deleting or Trimming the outlier

index=data_frame[(data_frame['Age']>=100)|(data_frame['Age']<=18)].index

df.drop(index, inplace=True)

‍

Flooring and capping based on quantiles

‍

# Computing 10th, and 90th percentiles and replacing the outliers

import numpy as np

tenth_percentile = np.percentile(data, 10)

ninetieth_percentile = np.percentile(data, 90)

treated_data=np.where(data<tenth_percentile,tenth_percentile, treated_data)

treated_data =np.where(data>ninetieth_percentile, ninetieth_percentile, treated_data)

‍

Now the dataset treated_data contains the data without the outliers, using the flooring and capping method outliers are treated here.

‍

Median treatment

‍

median = np.median(data)

for i in sample_outliers:

c = np.where(data==i, 14, data)

‍

Conclusion

‍

Latest Blogs

Python treatment for outliers in data science

Table of Contents

Python treatment for outliers in data science

Table of Contents

9 Cloud Computing Trends Shaping India’s Digital Future in 2025

LoRA fine-tune Gemma 7B Using TIR with 10 Easy Steps

How Does RAG Improve the Accuracy of LLM Responses?

Top 10 Cloud GPU Providers in 2025

What is Retrieval-Augmented Generation (RAG)?

AI Inference vs Training: Understanding Key Differences

Sovereign Cloud: India's Key to Digital Independence in the AI Age

E2E Sovereign Cloud Platform: Revolutionizing Cloud Sovereignty

Top 8 Generative AI Applications in 2025

A Comparison between TIR Containerized VMs vs Traditional VMs