Finding and dealing with outliers is one of the most crucial processes in data preparation since they can have a detrimental impact on statistical analysis and the training of a machine learning or AI algorithm, leading to reduced accuracy.
In this blog, we will brief you on what an outlier is, how it is created, detecting techniques for an outlier, and what are the treatment techniques that you can use to handle an outlier.`
What is an Outlier?
An observation that differs significantly from the other observations in a dataset is considered an outlier. An outlier is therefore considerably larger or smaller than the other values in the collection.
An outlier may appear as a result of experimental error, human mistake, data variability, or all three. A dataset of class 7 students, for instance, shows that each student is between the ages of 13 and 15, but accidentally includes information about a student from the 12th grade who is actually 19 years old. The student who is 19 years old is, therefore, an outlier for this batch of data.
How an Outlier is created in the Data?
Incorrect data input or a processing mistake value gaps in a dataset. The targeted sample did not yield the expected data. During experiments, mistakes happen. It wouldn't be a mistake, but it wouldn't be like the original. an abnormally extreme distribution.
Detecting Outliers using Python
If our dataset is tiny, we can find the outlier by simply scanning it. But what if our dataset is very large? How would we recognize the outliers in that case? We must employ quantitative and visual methods for detecting outliers.
- Visualization
The most common and easy visualization technique to detect outliers in any data set is using the Box plots and Scatter plots. With just a box and a few whiskers, Box plots effectively and efficiently capture the data summary. Boxplot uses 25th, 50th, and 75th percentiles to summarise sample data. Simply by glancing at the dataset's boxplot, one may gain insights (quartiles, the median, and outliers) about the data. Whereas Scatter plots are used when you have paired numerical data when your dependent variable contains numerous values for each reading independent variable when you're attempting to establish a link between the two variables, or in any other of these situations.
Code for Box Plot
import seaborn as sns
sns.boxplot(dadata_frame['col_name'])
Code for Scatter Plot
fig, plot = plt.subplots(figsize = (18,10))
plot.scatter(data_frame['X_axis'], data_frame['Y_axis'])
- Z Score
This number or score aids in determining how far the data point deviates from the mean. After establishing a threshold value, one may use the z scores of the data points to identify outliers.
Zscore = (data_point - mean) / std. Deviation
from scipy import stats
import numpy as np
outlier = np.abs(stats.zscore(data_frame['col_name']))
- Inter Quartile Range (IQR)
By splitting a data set into quartiles, the IQR is used to quantify variability. The information is divided into 4 equal portions and arranged in ascending order. The values that divide the four equal halves are known as the first, second, and third quartiles, or Q1, Q2, and Q3, respectively. The interquartile range, or IQR, is the space between the first and third quartiles, or Q1 and Q3 i.e. IQR = Q3 - Q1.
Q1 = np.percentile(data_frame['col_name'], 25,
interpolation = 'midpoint')
Q3 = np.percentile(data_frame['col_name'], 75,
interpolation = 'midpoint')
IQR = Q3 - Q1
Defining the upper and lower boundaries (1.5*IQR value is taken into consideration) can help you determine the outlier's base value, which is specified above and below the dataset's typical range:
Upper = 1.5*IQR + Q3
Lower = Q1 - 1.5*IQR
upper = data_frame['col_name'] >= (Q3+1.5*IQR)
lower = data_frame['col_name'] <= (Q1-1.5*IQR)
Outliers are data points that are either higher than the upper or lower than the lower limits.
print(np.where(upper))
print(np.where(lower))
Treatment for outliers
Once we have figured out what are the outliers in our data set, the next question is what to do with them.
Here are a few approaches to handling outliers.
- Deleting or Trimming the outlier
We eliminate the outliers from the dataset using this method. First, from the visualization, we can have an estimation of the range where data outliers might lie and on the basis of that, we can drop all the outliers from the data set.
A good example of this is the ‘Age’ variable, which ranged from 0 to 100. An index is created for all the data points when the age takes these two values in the first line of code below. And after that, the second line of the code below drops all the outliers from our dataset.
index=data_frame[(data_frame['Age']>=100)|(data_frame['Age']<=18)].index
df.drop(index, inplace=True)
- Flooring and capping based on quantiles
With this method, we will floor the lower values at, say, the 10th percentile, and cap the higher values at, say, the 90th percentile. The lines of code following output the variable "Income10th "'s and 90th percentiles, respectively. The quantile-based flooring and capping will be done using these values.
# Computing 10th, and 90th percentiles and replacing the outliers
import numpy as np
tenth_percentile = np.percentile(data, 10)
ninetieth_percentile = np.percentile(data, 90)
treated_data=np.where(data<tenth_percentile,tenth_percentile, treated_data)
treated_data =np.where(data>ninetieth_percentile, ninetieth_percentile, treated_data)
Now the dataset treated_data contains the data without the outliers, using the flooring and capping method outliers are treated here.
- Median treatment
With this method, the extreme numbers are swapped out for the mean or median values. It is cautioned against using mean values because the mean value is highly influenced by the outliers. So it is better to rely on media value for treating the outliers. The median value is used to replace any values in the ‘data_frame’ variable that are higher than the 95th percentile.
median = np.median(data)
for i in sample_outliers:
c = np.where(data==i, 14, data)
Conclusion
In this blog, we learned about handling outliers, a crucial step in data preparation. We currently have a variety of techniques for identifying and managing outliers. But since there is no mathematically correct or incorrect answer, handling outliers is a highly subjective endeavor. Treatment choices may be made more easily with the use of qualitative knowledge, such as understanding the origin or impact of an anomaly.