Outliers: Data Science Mein Ghair Mutawaqa Mehman

October 17, 2023 Dr. Aammar Tufail

Data ki duniya mein, outliers woh ghair mutawaqa mehman hain jo aap ki dinner party mein baghair dawat ke aa jate hain. Jab aap samajhte hain ke sab kuch sahi tareeqe se set hai, toh woh aakar harmony ko disturb kar dete hain. Lekin yeh outliers akhir hain kya? Inki ahmiyat kya hai? Aur sab se zaroori, hum inhe kaise handle karte hain? Chaliye, shuru karte hain.

Outliers Kya Hain? 🔍

Outliers are data points that deviate significantly from the rest of the observations in a dataset. Imagine plotting the ages of a group of high school students, and among those teenagers, you find an age of 85. That 85 is an outlier—it doesn’t fit the general trend or expectation.

Outliers Kya Hain? Ek Insightful Kahani 😯

Lahore ki ek masroof galli mein, Ahmed apne doston ke saath chai pee raha tha. ☕

Ali: “Ahmed bhai, kal maine data analyze karte hue ek ajeeb si cheez dekhi. Kuch values bohat zyada alag thi. Kya aapko pata hai woh kya thi?”

Ahmed: “Ali bhai, aap baat kar rahe hain outliers ki. Outliers woh values hain jo baaki data se zyada alag hoti hain. Matlab woh aam data points se hat kar hoti hain.” 😌

Ali: “Outliers? Matlab woh values jo normally expected range se bahar hain?”

Ahmed: “Bilkul! Kabhi kabhi yeh outliers naturally aa jati hain, aur kabhi kuch ghalat input ya measurement ki wajah se. Inko identify karna aur deal karna bohat zaroori hai, kyun ke yeh hamare analysis aur models ko affect kar sakte hain.”

Usman (jo chai ki dukaan ka malik tha): “Toh agar outliers hote hain toh humein unko kya karna chahiye?” 🤔

Ahmed: “Usman bhai, yeh depend karta hai situation par. Kabhi kabhi outliers ko remove karna behtar hota hai, aur kabhi hum unko replace karte hain median ya mean se.”

Ali: “Lekin Ahmed bhai, kaise pata chalega ke koi value outlier hai ya nahi?”

Ahmed: “Ali, bohat se tareeque hain. Jaise ke visual techniques mein scatter plots, box plots waghera. Statistical methods mein z-score ya IQR (Interquartile Range) ki madad se bhi hum outliers identify kar sakte hain.”

Usman: “Acha, toh agar main apni chai ki bikri ki record dekhon aur kuch dinon mein bikri bohat zyada ya bohat kam ho, toh woh bhi outliers ho sakte hain?”

Ahmed: “Bilkul, Usman bhai. Lekin har alag value ko outlier nahi keh sakte. Analyze karna padta hai.”

Gupshup karte karte, Ahmed ne outliers ke baare mein apne doston ko achi tarah se samjhaya. Aur chai ki mazedaar mehfil mein ek aur informative discussion khatam hui. 😄📊

Waise toh outliers humare data mein choti si values hoti hain, lekin inka impact bohat bara hota hai. Isliye, agar aap bhi data science mein naye hain, toh outliers ko samajhna aur unka sahi tareeque se deal karna seekhein. Kyunki data mein choti si baat bhi bari ahemiyat rakhti hai! 🌟📈

Outliers Ko Kaise Pehchaanen?

Visual Tareeqa: Tools jaise scatter plots, box plots, aur histograms behtareen visual madadgar hote hain. Maslan, box plot mein, jo data points whiskers ke bahar hote hain, woh outliers maane ja sakte hain, but we need to know about IQR.
Statistical Tareeqe: Z-score aur IQR (Interquartile Range) method do maqbool statistical tareeqe hain. Z-score batata hai ke ek data point mean se kitne standard deviations door hai. Aam taur par, agar Z-score > 3 ya < -3 ho toh woh outlier maana jata hai.

Inhe Kaise Hatayen? 🛠

Truncation ya Capping: High outliers ke liye, kisi threshold se upar ki value ko maximum cap par set kiya ja sakta hai. Isi tarah, low outliers ke liye, kisi threshold se neeche ki values ko minimum cap par set kiya ja sakta hai.
Transformation: Kabhi-kabhi, logarithms jaise mathematical transformations se outliers ko control kiya ja sakta hai.
Imputation: Outlier ko mean, median, ya mode jaise central tendency measures se replace karein.
Deletion: Agar outlier data entry errors ki wajah se hai ya clear hai ke woh value add nahi karega, toh behtar hai ke aap usey remove kar dein.

Python code to remove outliers

import seaborn as sns
import pandas as pd

# Load the Titanic dataset
titanic = sns.load_dataset('titanic')

# Display the first few rows of the dataset
print(titanic.head())

# Calculate the IQR for the 'age' column
Q1 = titanic['age'].quantile(0.25)
Q3 = titanic['age'].quantile(0.75)
IQR = Q3 - Q1

# Define bounds for the outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Remove outliers
titanic_no_outliers = titanic[(titanic['age'] >= lower_bound) & (titanic['age'] <= upper_bound)]

# Display the first few rows of the dataset without outliers
print(titanic_no_outliers.head())

⚠️ Outliers Ko Ignore Karne Ka Asar

Tircha Analysis: Outliers descriptive aur inferential statistics dono ko skew kar sakte hain, jisse data ka distorted view milta hai.
Machine Learning Models Par Asar: Algorithms, khaas kar linear models, outliers se sensitive hote hain. Woh coefficient estimates aur predictions par drastic asar daal sakte hain.
Gumrahi Paida Karne Wale Assumptions: Data assumptions, jaise normality, outliers ki presence ki wajah se violated ho sakte hain, jisse galat nataij milte hain.

Akhri Baat:

Outliers, halanki woh pareshan kun hote hain, lekin data analysis ka ek ahem hissa hain. Woh anomalies, unique events, ya data collection mein errors ke baare mein zaroori maalumat laate hain. Outliers ko sahi tareeqe se handle karne se mazboot aur reliable analysis ki guarantee hoti hai. Toh, agle dafa jab aap apne dataset mein in uninvited guests ko dekhein, toh aapko bilkul pata hoga ke kya karna hai! 🌟

Read another blog for Data Wrangling Tips and also on Missing Values k Rolay.

48 Comments.

sikandar ul hassan says:
April 16, 2024 at 6:57 am
Great
Reply
Danish Azeem says:
March 28, 2024 at 2:44 pm
Great blog
Reply
Muhammad Akhtar says:
March 19, 2024 at 11:17 pm
Chaiwala bhi outlier hi tha
Reply
Ihtasham Aslam says:
January 30, 2024 at 5:42 pm
First, identify that dataset has outliers or not then Identify the method to fix it depending on the data or requirements.
Reply
Aly Omer says:
January 22, 2024 at 2:14 pm
boht kamal ka explain kya sir aap ny
outliers aik wo bnda hota h jo boht hi doooor ka rshtydar hota hai or hum sy kuch door betha hota hai
iski kuch types ye hain
contextual, collective, global,
visualization method sy isko kaan sy pkrain or agr to kaam ka bnda h to mean median sy replace krain ya transform krain
wrna utha k bahir ohenk dain
Reply
1. Wajahat Ali says:
  October 8, 2024 at 10:26 am
  😂😂wa wa
  Reply
Anam Jafar says:
January 5, 2024 at 4:36 pm
very well explained.
Reply
Fawad Ali says:
January 3, 2024 at 2:19 am
Sir Great blog
Reply
Muhammad Haroon says:
December 23, 2023 at 2:00 pm
A good and simple way to teach about outliers
Reply
Pingback: Handling outliers in Data Science and Machine Learning - Codanics
Ahad Ali says:
November 11, 2023 at 10:43 pm
# what is outliers?
– the data which is different from whole community dataset
– Outliers in Pandas refer to data points that stand out from the majority of the data in a DataFrame. They are exceptionally high or low values in a column. Detecting outliers is important in data analysis. You can find outliers using methods like box plots, Z-scores, or the Interquartile Range (IQR) method. Outliers can impact statistical analyses and machine learning models, so it’s essential to handle them appropriately, such as by removing, transforming, or addressing them depending on your data analysis goals.
# how many types of outlier?
– there are two major outliers
– Univariate outliers
– Multivariate outliers
—
### Univariate outliers
– uni mean [1]
– variate mean [variable]\
**it means `one veriable`**\
***for example***\
`let suppose : titanic[‘age]` in this example we check the titanic data when outliers of one columns is known as univaiates
—
### Multivariate outliers
– multi mean `multiple`
– variate mean `variable`\
**it means `Multiple veriables`**\
`let suppose : titanic[[‘age’, “fare”, “MEAL”]]` in this example we check the titanic data when outliers of mutltiple columns is known as Multivaiates
# other types
1. Global outliers –> `point abnormal anomaly`
2. Contextual outliers –> `A contextual outlier is a data point that is unusual within a specific context or scenario, even if it’s not statistically extreme in the overall dataset.`
3. Collective outliers –> `Collective outliers, also known as group outliers, refer to a set of data points that deviate significantly from the norm when considered together as a group, even though individual data points may not be extreme outliers. They are identified by analyzing the collective behavior of data subsets rather than individual values.`
—
# how we identify outliers? IQR method | handling outlier in DS
### How to indentify outliers?
– Visually ——> by generating plot
– `boxplot`
– `histogram`
**Note** : `qourtile 1 = 25 of 100`, `qourtile 2 = next 25 of 100 means = 50%`, `qourtile 3 = next 25 of 100 means = 75`, `qourtile 4 = next 25 of 100 means = 100`
qourtile is also called `Q1,q2, q3, q4`
***important*** : if We said Q1 to Q3 it is called IQR. [`Inter Quartile Range`]
***Formula of Quartile***: IQR = Q3-Q1
—
# Z-Score method
Note : `Z-Score is the simplest way of finding Outliers`\
**what is Z-score**
– In Python using Pandas, a Z-score is a way to standardize or normalize data to see how far each data point is from the mean (average) in terms of standard deviations. Here’s how you can calculate Z-scores for a column in a Pandas DataFrame in a simple way
1. First, import Pandas:\
`import pandas as pd`
2. Calculate the Z-scores for the ‘data’ column and store them in a new column ‘Z-score’:\
`df[‘Z-score’] = (df[‘data’] – df[‘data’].mean()) / df[‘data’].std()`
***In this code:***\
– df[‘data’].mean() calculates the mean (average) of the ‘data’ column.
– df[‘data’].std() calculates the standard deviation of the ‘data’ column.
– Subtracting the mean from each value and dividing by the standard deviation gives you the Z-score for each data point.
Now, the ‘Z-score’ column in your DataFrame ‘df’ will contain the Z-scores for the ‘data’ column, which helps you understand how each data point relates to the mean in terms of standard deviations.
—
# Handling Outliers in dataset
**How to deal with outliers?**\
– Remove all outliers
– Trabsform them
– impute outliers = `means when you deal with the missing value in pandas dataset`
– ML Model – Robut
Reply
Aftab Ahmad says:
November 10, 2023 at 11:52 pm
Bahtareen sir g
Reply
saima Shahzadi says:
November 7, 2023 at 12:09 am
types of outliers:
1-collective
2-global
3-contextual
detect by:
1-visual
2-ploting
removing outlie:
1-deleting
2-transformation
3-imputin
4-rebusting
Reply
Muhammad Daniyal says:
November 5, 2023 at 5:05 pm
Explained in very easy way.
Reply
SALMAN TASADDUQ says:
November 4, 2023 at 10:33 am
valuable learning about outliers
Reply
tayyab Ali says:
October 27, 2023 at 8:35 pm
Sir, the outlier blog is very nice.
Reply
Sibtain Ali says:
October 27, 2023 at 8:33 pm
The Outlier blog is Good Baba..G
Reply
Quratul Ain says:
October 23, 2023 at 10:47 am
OUTLIERS are extreme values within the data.
TYPES OF OUTLIERS:
1- Global Outlier: An extreme value according to the whole data set.
2- Contextual Outlier: Any data point which considered as an extreme value according to the context of that data set.
3- Collective outliers: Data points that make a group with its neighbor point and that neighbor point is the outlier. Hence the whole group becomes the outlier.
Detecting Outliers: There are three methods to detect Outliers.
1- By Visualizing (Box plot or Bar chart)
2- By IQR method
3- By z-score
Handling Outliers:
1- Removing or deleting the row
2- Impute using Mean, Median, and Mode.
3-Transform the data like taking log10 of the column.
4- Use ML, robust model.
Reply
Anwar Mehmood Sohail says:
October 22, 2023 at 1:12 pm
Outliers are those data points which significantly deviates from rest of the data.
How to detect: Visually, Z-score, IQR method,
How to handle: depends on scenario, 1, delete them, impute them, take logarithm, use robust ML algorithm
Reply
1. Moavia Hassan says:
  October 23, 2023 at 12:07 am
  Outliers in data are data points that significantly differ from the majority of the data. They can be understood as extreme values that lie far from the central distribution of the dataset. One common method to identify outliers is through visualization using techniques like box plots or scatter plots. Outliers may also be detected mathematically by calculating the interquartile range (IQR) and considering data points that fall below Q1 – 1.5 * IQR or above Q3 + 1.5 * IQR as potential outliers. These extreme values can have a notable impact on data analysis, so it’s important to investigate and handle them appropriately, as they may signify errors or anomalies in the data, or, in some cases, provide valuable insights into unique or unexpected patterns.
  Reply
Shahid Umar says:
October 22, 2023 at 12:04 pm
Outliers are those values that create confusion in our data which leads the data in the wrong direction, it may spoil all the predictions in future analysis. So, bacho outlier see…..
Reply
Neelam Nasir says:
October 22, 2023 at 3:25 am
An outlier is an observation or data point that significantly differs from the rest of the data in a dataset. Outliers can be either exceptionally high or exceptionally low values in a dataset and can distort statistical analyses and data visualization. Identifying and handling outliers is important in data analysis and statistics.
There are several methods for identifying outliers, including:
Z-Score or Standard Score Method: This method involves calculating the z-score for each data point. Z-scores measure how many standard deviations an individual data point is from the mean. Data points with high absolute z-scores (typically greater than 2 or 3) are considered outliers.
IQR (Interquartile Range) Method: This method involves calculating the IQR, which is the range between the first quartile (25th percentile) and the third quartile (75th percentile) of the data. Data points that fall below the first quartile minus 1.5 times the IQR or above the third quartile plus 1.5 times the IQR are considered outliers.
Visual Inspection: Sometimes, you can identify outliers by simply visualizing your data with plots such as box plots, scatter plots, or histograms. Outliers are often points that are far away from the bulk of the data.
Domain Knowledge: In some cases, outliers may be valid data points, so it’s essential to consider domain knowledge when determining whether an observation is genuinely an outlier.
Once you’ve identified outliers, you have several options for handling them:
Remove the Outliers: You can choose to remove the outliers from your dataset. This can be a good option if you are confident that the outliers are due to errors or unusual circumstances. However, removing outliers may also result in a loss of valuable information.
Transform the Data: You can apply data transformations to mitigate the impact of outliers. Common transformations include logarithmic transformations, square root transformations, or using robust statistical methods.
Impute or Replace Outliers: Instead of removing outliers, you can impute or replace them with more typical values. You can use methods like median imputation, mean imputation, or replacing them with a predefined threshold value.
Segment Data: Another approach is to segment your data into subsets, separating the outliers from the main dataset. You can then analyze the two subsets separately.
Use Robust Statistics: Robust statistical methods are designed to be less influenced by outliers. For example, instead of using the mean, you can use the median or other robust estimators.
The choice of how to handle outliers depends on the nature of the data and the goals of your analysis. It’s important to carefully consider the implications of your approach and be transparent about the methods used to handle outliers in your analysis. Additionally, domain knowledge and the context of the data should always be taken into account when deciding how to handle outliers.
Reply
Islam Ud din says:
October 21, 2023 at 9:06 pm
explain outlier in the precise way, Baba g tosay great ho
Reply
Muhammed Anas says:
October 21, 2023 at 8:41 pm
outstanding !!!!!
Reply
Hashir Bhatti says:
October 20, 2023 at 9:53 pm
## Summary
——————–
Outliers: Outliers are data points in a dataset that deviate significantly from the rest of the observations.
——————–
Type of Outliers:
1. Global Outlier
2. Contextual Outlier
3. Collective Outlier
——————–
How to detect them?
1. Visually (Boxplot, Scatterplot, Histogram)
2. Interquartile range (IQR) Method
3. Z-Score method
——————–
How to deal with them?
1. Delete them
2. Transform them (log transformation, etc.)
3. Impute them (mean, median, mode, etc.)
4. Use Truncation or Capping
5. Use robust models (But it will limit the models we can use)
Reply
Saad khuda bux says:
October 20, 2023 at 9:20 pm
*Outliers: Observations in a regular data set which deviates significantly from the others.
– Be carefull ! before removing outliers you have to analyze the data before removing outliers because every deviated value cannot be an outlier.
-Its very important to identify and remove the outliers because it will ultimately have a great impact on our machine learning models.
-IQR> inter quartile range, the distance between Quartile1___to______Quartile3.
– We can see the outliers by drawing plots like boxplot, scatterplot and histogram which can help us to identify the outliers easily.
– We can handle the outliers by imputing>mean,median,and mode, through Transformation like np.log10().
– We can delete the outliers by using pandas.
Reply
Sana Shah says:
October 20, 2023 at 6:47 pm
Full understandable and now I have come to know what’s actually outliers are and how to detect and then handle them. JAZAKALLAH
Reply
Saadat Khalid says:
October 20, 2023 at 11:32 am
Ma Sha Allah, you explained the outliers in a simple and easy way.
Reply
Muhammad Raffay liaqat says:
October 19, 2023 at 10:47 pm
That was helpful
Reply
Maria Nadeem says:
October 19, 2023 at 9:08 pm
is blog mn outliers kai barai mn bataya gya hai kai outliers are those data points which are different from the data. these are being reccognized by the visual method and the statistical method the visual method includes the boxplot and scatterplot while the statistical method include the zscore and iqr these outliers can disrupt the whole data
Reply
sadia ali says:
October 19, 2023 at 8:30 pm
Excellent sir! keep it up.
Data points which are out of rang or unexpected values,called outliers.
it is must to remove them otherwise there will be a lot of deviations in our insights.
Reply
Nimra Ishaq says:
October 19, 2023 at 6:17 pm
1. Outliers is a data point that is significantly different from the other data points in a dataset.
2. we can identify outliers with plotting boxplot and histogram plot .In these plots the ponits of data are
present of a distance of other data we identify these are outliers.
3. We can identify outliers with the help IQR method and Z-score mehod.
4.We are dealing outliers to depend on the type of outliers firstly we identify the outliers then we deal outliers
(1)to remove them, (2)transform them, (3)impute them with mean, median or mode and(4)Use ML Model_Robust.
Reply
Muhammad Haseeb says:
October 19, 2023 at 2:30 pm
That was helpful, keep writing ☺️
Reply
Mehak Iftikhar says:
October 19, 2023 at 11:46 am
Very good piece of writing, easy to understand and very well explained about Outliers.
Reply
DANISH AMMAR says:
October 19, 2023 at 3:03 am
## outliers
types of outliers
1. univariant
2. multivariant
kindas of outliers
1. golbal
2. contextual
3. collective
how to identify outliers
1. visually/plotting
2. IQR method
3. z-score
how to remove outliers
1. Truncation or Capping
2. Transformation
3. Imputation
4. Deletion
5. use robust ml models
Reply
Shafat Hussain Khan says:
October 18, 2023 at 11:11 pm
ماشااللہ بہت زبردست سمجھایا ہے اور 12 پوانٹ بھی اسی سے مل گے
Reply
Azhar Muheem says:
October 18, 2023 at 9:55 pm
none can teach AI this way. hatts-off ……
Reply
Afzaal Ahmad says:
October 18, 2023 at 2:42 pm
Clearly thought out and imaginative article.
Reply
Muhammad Mubeen says:
October 17, 2023 at 9:52 pm
Bahot he famous Teacher AI k un say isitarah k blogs k baray mai kahan , k sir hammay istarah asani hoge ,lkn unho nay saf mana kar diya k bhae ap b kuch karain Chatgpt karain etc(aur wo teacher b bahot he mukhlis hain), lkn in blogs ki qadar , believe me half mehnat hamaray hissay ki sir khud kar rahain hain ,, ye bat mai apnay previous experience ki base pr kar rahan hn,
Reply
Muhammad Mubeen says:
October 17, 2023 at 9:46 pm
Sir Inshallah ap ki mehnat ka impact zaror paray ga , hamray tareeqa hai taleem pr ,
Reply
TAHA EHSAN ULLAH says:
October 17, 2023 at 8:51 pm
Insightful Blog Sir g aesy hi real life example dy k smjhaya kry aessy yd achy sy rehta hy.
Reply
Asad Ullah says:
October 17, 2023 at 8:42 pm
You are trying so hard for us to understand . Thank you for everything Dr. Ammaar Bhai.
Reply
Shahid Umar says:
October 17, 2023 at 8:36 pm
is blog me, outlier k bare me jo baat mujy pata chali hai wo ye hai k wo value jo apky data me sirf jhot hi paida kare tu outlier kehlaaty hai. Aik kahawat mashhoor hai ‘jab tak sach ka pata chalega tu jhot ne gaaon k gaaon masmaar kar diye hongy’. Tu outliers hammary analysis ka beera ghark karne wale hai.
Reply
Muhammad Ali Talha says:
October 17, 2023 at 8:29 pm
It explains really well about the the outlier, its identification method and how to remove it. Moreover the impact of not removing the outlier is explained briefly but comprehensive. For my memory to check the outliers visually i will remember the term “BHS” means Boxplot, Histogram and Scatter Plot. And for statistical analysis “IQRAZ” means “Interquartile Range AND Z-Score”. Thanks for such a nice blog.
Reply
Farman Ali says:
October 17, 2023 at 8:28 pm
Understand the concept of outliers. anomalies that can be handled as per situation and data types
Reply
Muhammad Abdullah says:
October 17, 2023 at 8:26 pm
Bht achi Explanation thiii
Reply
Rimshah Sabir says:
October 17, 2023 at 8:08 pm
WOW! very nicely and easily explained…👌
Reply
Mehreen Bibi says:
October 17, 2023 at 7:24 pm
Explain very nice.
Reply