Outliers: Data Science Mein Ghair Mutawaqa Mehman

Outlier Detection and Deadling with them

Data ki duniya mein, outliers woh ghair mutawaqa mehman hain jo aap ki dinner party mein baghair dawat ke aa jate hain. Jab aap samajhte hain ke sab kuch sahi tareeqe se set hai, toh woh aakar harmony ko disturb kar dete hain. Lekin yeh outliers akhir hain kya? Inki ahmiyat kya hai? Aur sab se zaroori, hum inhe kaise handle karte hain? Chaliye, shuru karte hain.

Outliers Kya Hain? 🔍

Outliers are data points that deviate significantly from the rest of the observations in a dataset. Imagine plotting the ages of a group of high school students, and among those teenagers, you find an age of 85. That 85 is an outlier—it doesn’t fit the general trend or expectation.

Outliers Kya Hain? Ek Insightful Kahani 😯

Lahore ki ek masroof galli mein, Ahmed apne doston ke saath chai pee raha tha. ☕

Ali: “Ahmed bhai, kal maine data analyze karte hue ek ajeeb si cheez dekhi. Kuch values bohat zyada alag thi. Kya aapko pata hai woh kya thi?”

Ahmed: “Ali bhai, aap baat kar rahe hain outliers ki. Outliers woh values hain jo baaki data se zyada alag hoti hain. Matlab woh aam data points se hat kar hoti hain.” 😌

Ali: “Outliers? Matlab woh values jo normally expected range se bahar hain?”

Ahmed: “Bilkul! Kabhi kabhi yeh outliers naturally aa jati hain, aur kabhi kuch ghalat input ya measurement ki wajah se. Inko identify karna aur deal karna bohat zaroori hai, kyun ke yeh hamare analysis aur models ko affect kar sakte hain.”

Usman (jo chai ki dukaan ka malik tha): “Toh agar outliers hote hain toh humein unko kya karna chahiye?” 🤔

Ahmed: “Usman bhai, yeh depend karta hai situation par. Kabhi kabhi outliers ko remove karna behtar hota hai, aur kabhi hum unko replace karte hain median ya mean se.”

Ali: “Lekin Ahmed bhai, kaise pata chalega ke koi value outlier hai ya nahi?”

Ahmed: “Ali, bohat se tareeque hain. Jaise ke visual techniques mein scatter plots, box plots waghera. Statistical methods mein z-score ya IQR (Interquartile Range) ki madad se bhi hum outliers identify kar sakte hain.”

Usman: “Acha, toh agar main apni chai ki bikri ki record dekhon aur kuch dinon mein bikri bohat zyada ya bohat kam ho, toh woh bhi outliers ho sakte hain?”

Ahmed: “Bilkul, Usman bhai. Lekin har alag value ko outlier nahi keh sakte. Analyze karna padta hai.”

Gupshup karte karte, Ahmed ne outliers ke baare mein apne doston ko achi tarah se samjhaya. Aur chai ki mazedaar mehfil mein ek aur informative discussion khatam hui. 😄📊

Waise toh outliers humare data mein choti si values hoti hain, lekin inka impact bohat bara hota hai. Isliye, agar aap bhi data science mein naye hain, toh outliers ko samajhna aur unka sahi tareeque se deal karna seekhein. Kyunki data mein choti si baat bhi bari ahemiyat rakhti hai! 🌟📈

Outliers Ko Kaise Pehchaanen?

  1. Visual Tareeqa: Tools jaise scatter plots, box plots, aur histograms behtareen visual madadgar hote hain. Maslan, box plot mein, jo data points whiskers ke bahar hote hain, woh outliers maane ja sakte hain, but we need to know about IQR.
  2. Statistical Tareeqe: Z-score aur IQR (Interquartile Range) method do maqbool statistical tareeqe hain. Z-score batata hai ke ek data point mean se kitne standard deviations door hai. Aam taur par, agar Z-score > 3 ya < -3 ho toh woh outlier maana jata hai.

Inhe Kaise Hatayen? 🛠

  1. Truncation ya Capping: High outliers ke liye, kisi threshold se upar ki value ko maximum cap par set kiya ja sakta hai. Isi tarah, low outliers ke liye, kisi threshold se neeche ki values ko minimum cap par set kiya ja sakta hai.
  2. Transformation: Kabhi-kabhi, logarithms jaise mathematical transformations se outliers ko control kiya ja sakta hai.
  3. Imputation: Outlier ko mean, median, ya mode jaise central tendency measures se replace karein.
  4. Deletion: Agar outlier data entry errors ki wajah se hai ya clear hai ke woh value add nahi karega, toh behtar hai ke aap usey remove kar dein.

Python code to remove outliers

import seaborn as sns
import pandas as pd

# Load the Titanic dataset
titanic = sns.load_dataset('titanic')

# Display the first few rows of the dataset
print(titanic.head())

# Calculate the IQR for the 'age' column
Q1 = titanic['age'].quantile(0.25)
Q3 = titanic['age'].quantile(0.75)
IQR = Q3 - Q1

# Define bounds for the outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Remove outliers
titanic_no_outliers = titanic[(titanic['age'] >= lower_bound) & (titanic['age'] <= upper_bound)]

# Display the first few rows of the dataset without outliers
print(titanic_no_outliers.head())

⚠️ Outliers Ko Ignore Karne Ka Asar

  1. Tircha Analysis: Outliers descriptive aur inferential statistics dono ko skew kar sakte hain, jisse data ka distorted view milta hai.
  2. Machine Learning Models Par Asar: Algorithms, khaas kar linear models, outliers se sensitive hote hain. Woh coefficient estimates aur predictions par drastic asar daal sakte hain.
  3. Gumrahi Paida Karne Wale Assumptions: Data assumptions, jaise normality, outliers ki presence ki wajah se violated ho sakte hain, jisse galat nataij milte hain.

Akhri Baat:

Outliers, halanki woh pareshan kun hote hain, lekin data analysis ka ek ahem hissa hain. Woh anomalies, unique events, ya data collection mein errors ke baare mein zaroori maalumat laate hain. Outliers ko sahi tareeqe se handle karne se mazboot aur reliable analysis ki guarantee hoti hai. Toh, agle dafa jab aap apne dataset mein in uninvited guests ko dekhein, toh aapko bilkul pata hoga ke kya karna hai! 🌟

Read another blog for Data Wrangling Tips and also on Missing Values k Rolay.

47 Comments.

  1. First, identify that dataset has outliers or not then Identify the method to fix it depending on the data or requirements.

  2. boht kamal ka explain kya sir aap ny

    outliers aik wo bnda hota h jo boht hi doooor ka rshtydar hota hai or hum sy kuch door betha hota hai
    iski kuch types ye hain
    contextual, collective, global,

    visualization method sy isko kaan sy pkrain or agr to kaam ka bnda h to mean median sy replace krain ya transform krain
    wrna utha k bahir ohenk dain

  3. # what is outliers?
    – the data which is different from whole community dataset

    – Outliers in Pandas refer to data points that stand out from the majority of the data in a DataFrame. They are exceptionally high or low values in a column. Detecting outliers is important in data analysis. You can find outliers using methods like box plots, Z-scores, or the Interquartile Range (IQR) method. Outliers can impact statistical analyses and machine learning models, so it’s essential to handle them appropriately, such as by removing, transforming, or addressing them depending on your data analysis goals.

    # how many types of outlier?
    – there are two major outliers
    – Univariate outliers
    – Multivariate outliers

    ### Univariate outliers
    – uni mean [1]
    – variate mean [variable]\
    **it means `one veriable`**\
    ***for example***\
    `let suppose : titanic[‘age]` in this example we check the titanic data when outliers of one columns is known as univaiates

    ### Multivariate outliers
    – multi mean `multiple`
    – variate mean `variable`\
    **it means `Multiple veriables`**\
    `let suppose : titanic[[‘age’, “fare”, “MEAL”]]` in this example we check the titanic data when outliers of mutltiple columns is known as Multivaiates

    # other types
    1. Global outliers –> `point abnormal anomaly`

    2. Contextual outliers –> `A contextual outlier is a data point that is unusual within a specific context or scenario, even if it’s not statistically extreme in the overall dataset.`

    3. Collective outliers –> `Collective outliers, also known as group outliers, refer to a set of data points that deviate significantly from the norm when considered together as a group, even though individual data points may not be extreme outliers. They are identified by analyzing the collective behavior of data subsets rather than individual values.`

    # how we identify outliers? IQR method | handling outlier in DS

    ### How to indentify outliers?
    – Visually ——> by generating plot
    – `boxplot`
    – `histogram`

    **Note** : `qourtile 1 = 25 of 100`, `qourtile 2 = next 25 of 100 means = 50%`, `qourtile 3 = next 25 of 100 means = 75`, `qourtile 4 = next 25 of 100 means = 100`
    qourtile is also called `Q1,q2, q3, q4`

    ***important*** : if We said Q1 to Q3 it is called IQR. [`Inter Quartile Range`]
    ***Formula of Quartile***: IQR = Q3-Q1

    # Z-Score method
    Note : `Z-Score is the simplest way of finding Outliers`\
    **what is Z-score**
    – In Python using Pandas, a Z-score is a way to standardize or normalize data to see how far each data point is from the mean (average) in terms of standard deviations. Here’s how you can calculate Z-scores for a column in a Pandas DataFrame in a simple way

    1. First, import Pandas:\
    `import pandas as pd`

    2. Calculate the Z-scores for the ‘data’ column and store them in a new column ‘Z-score’:\
    `df[‘Z-score’] = (df[‘data’] – df[‘data’].mean()) / df[‘data’].std()`

    ***In this code:***\
    – df[‘data’].mean() calculates the mean (average) of the ‘data’ column.
    – df[‘data’].std() calculates the standard deviation of the ‘data’ column.
    – Subtracting the mean from each value and dividing by the standard deviation gives you the Z-score for each data point.

    Now, the ‘Z-score’ column in your DataFrame ‘df’ will contain the Z-scores for the ‘data’ column, which helps you understand how each data point relates to the mean in terms of standard deviations.

    # Handling Outliers in dataset

    **How to deal with outliers?**\

    – Remove all outliers
    – Trabsform them
    – impute outliers = `means when you deal with the missing value in pandas dataset`
    – ML Model – Robut

  4. types of outliers:
    1-collective
    2-global
    3-contextual
    detect by:
    1-visual
    2-ploting
    removing outlie:
    1-deleting
    2-transformation
    3-imputin
    4-rebusting

  5. OUTLIERS are extreme values within the data.
    TYPES OF OUTLIERS:
    1- Global Outlier: An extreme value according to the whole data set.
    2- Contextual Outlier: Any data point which considered as an extreme value according to the context of that data set.
    3- Collective outliers: Data points that make a group with its neighbor point and that neighbor point is the outlier. Hence the whole group becomes the outlier.
    Detecting Outliers: There are three methods to detect Outliers.
    1- By Visualizing (Box plot or Bar chart)
    2- By IQR method
    3- By z-score
    Handling Outliers:
    1- Removing or deleting the row
    2- Impute using Mean, Median, and Mode.
    3-Transform the data like taking log10 of the column.
    4- Use ML, robust model.

  6. Outliers are those data points which significantly deviates from rest of the data.
    How to detect: Visually, Z-score, IQR method,
    How to handle: depends on scenario, 1, delete them, impute them, take logarithm, use robust ML algorithm

    1. Outliers in data are data points that significantly differ from the majority of the data. They can be understood as extreme values that lie far from the central distribution of the dataset. One common method to identify outliers is through visualization using techniques like box plots or scatter plots. Outliers may also be detected mathematically by calculating the interquartile range (IQR) and considering data points that fall below Q1 – 1.5 * IQR or above Q3 + 1.5 * IQR as potential outliers. These extreme values can have a notable impact on data analysis, so it’s important to investigate and handle them appropriately, as they may signify errors or anomalies in the data, or, in some cases, provide valuable insights into unique or unexpected patterns.

  7. Outliers are those values that create confusion in our data which leads the data in the wrong direction, it may spoil all the predictions in future analysis. So, bacho outlier see…..

  8. An outlier is an observation or data point that significantly differs from the rest of the data in a dataset. Outliers can be either exceptionally high or exceptionally low values in a dataset and can distort statistical analyses and data visualization. Identifying and handling outliers is important in data analysis and statistics.

    There are several methods for identifying outliers, including:

    Z-Score or Standard Score Method: This method involves calculating the z-score for each data point. Z-scores measure how many standard deviations an individual data point is from the mean. Data points with high absolute z-scores (typically greater than 2 or 3) are considered outliers.

    IQR (Interquartile Range) Method: This method involves calculating the IQR, which is the range between the first quartile (25th percentile) and the third quartile (75th percentile) of the data. Data points that fall below the first quartile minus 1.5 times the IQR or above the third quartile plus 1.5 times the IQR are considered outliers.

    Visual Inspection: Sometimes, you can identify outliers by simply visualizing your data with plots such as box plots, scatter plots, or histograms. Outliers are often points that are far away from the bulk of the data.

    Domain Knowledge: In some cases, outliers may be valid data points, so it’s essential to consider domain knowledge when determining whether an observation is genuinely an outlier.

    Once you’ve identified outliers, you have several options for handling them:

    Remove the Outliers: You can choose to remove the outliers from your dataset. This can be a good option if you are confident that the outliers are due to errors or unusual circumstances. However, removing outliers may also result in a loss of valuable information.

    Transform the Data: You can apply data transformations to mitigate the impact of outliers. Common transformations include logarithmic transformations, square root transformations, or using robust statistical methods.

    Impute or Replace Outliers: Instead of removing outliers, you can impute or replace them with more typical values. You can use methods like median imputation, mean imputation, or replacing them with a predefined threshold value.

    Segment Data: Another approach is to segment your data into subsets, separating the outliers from the main dataset. You can then analyze the two subsets separately.

    Use Robust Statistics: Robust statistical methods are designed to be less influenced by outliers. For example, instead of using the mean, you can use the median or other robust estimators.

    The choice of how to handle outliers depends on the nature of the data and the goals of your analysis. It’s important to carefully consider the implications of your approach and be transparent about the methods used to handle outliers in your analysis. Additionally, domain knowledge and the context of the data should always be taken into account when deciding how to handle outliers.

  9. ## Summary
    ——————–
    Outliers: Outliers are data points in a dataset that deviate significantly from the rest of the observations.
    ——————–
    Type of Outliers:
    1. Global Outlier
    2. Contextual Outlier
    3. Collective Outlier
    ——————–
    How to detect them?
    1. Visually (Boxplot, Scatterplot, Histogram)
    2. Interquartile range (IQR) Method
    3. Z-Score method
    ——————–
    How to deal with them?
    1. Delete them
    2. Transform them (log transformation, etc.)
    3. Impute them (mean, median, mode, etc.)
    4. Use Truncation or Capping
    5. Use robust models (But it will limit the models we can use)

  10. *Outliers: Observations in a regular data set which deviates significantly from the others.
    – Be carefull ! before removing outliers you have to analyze the data before removing outliers because every deviated value cannot be an outlier.
    -Its very important to identify and remove the outliers because it will ultimately have a great impact on our machine learning models.
    -IQR> inter quartile range, the distance between Quartile1___to______Quartile3.
    – We can see the outliers by drawing plots like boxplot, scatterplot and histogram which can help us to identify the outliers easily.
    – We can handle the outliers by imputing>mean,median,and mode, through Transformation like np.log10().
    – We can delete the outliers by using pandas.

  11. Full understandable and now I have come to know what’s actually outliers are and how to detect and then handle them. JAZAKALLAH

  12. is blog mn outliers kai barai mn bataya gya hai kai outliers are those data points which are different from the data. these are being reccognized by the visual method and the statistical method the visual method includes the boxplot and scatterplot while the statistical method include the zscore and iqr these outliers can disrupt the whole data

  13. Excellent sir! keep it up.
    Data points which are out of rang or unexpected values,called outliers.
    it is must to remove them otherwise there will be a lot of deviations in our insights.

  14. 1. Outliers is a data point that is significantly different from the other data points in a dataset.
    2. we can identify outliers with plotting boxplot and histogram plot .In these plots the ponits of data are
    present of a distance of other data we identify these are outliers.
    3. We can identify outliers with the help IQR method and Z-score mehod.
    4.We are dealing outliers to depend on the type of outliers firstly we identify the outliers then we deal outliers
    (1)to remove them, (2)transform them, (3)impute them with mean, median or mode and(4)Use ML Model_Robust.

  15. ## outliers
    types of outliers
    1. univariant
    2. multivariant

    kindas of outliers
    1. golbal
    2. contextual
    3. collective

    how to identify outliers
    1. visually/plotting
    2. IQR method
    3. z-score

    how to remove outliers
    1. Truncation or Capping
    2. Transformation
    3. Imputation
    4. Deletion
    5. use robust ml models

  16. ماشااللہ بہت زبردست سمجھایا ہے اور 12 پوانٹ بھی اسی سے مل گے

  17. Bahot he famous Teacher AI k un say isitarah k blogs k baray mai kahan , k sir hammay istarah asani hoge ,lkn unho nay saf mana kar diya k bhae ap b kuch karain Chatgpt karain etc(aur wo teacher b bahot he mukhlis hain), lkn in blogs ki qadar , believe me half mehnat hamaray hissay ki sir khud kar rahain hain ,, ye bat mai apnay previous experience ki base pr kar rahan hn,

  18. is blog me, outlier k bare me jo baat mujy pata chali hai wo ye hai k wo value jo apky data me sirf jhot hi paida kare tu outlier kehlaaty hai. Aik kahawat mashhoor hai ‘jab tak sach ka pata chalega tu jhot ne gaaon k gaaon masmaar kar diye hongy’. Tu outliers hammary analysis ka beera ghark karne wale hai.

  19. It explains really well about the the outlier, its identification method and how to remove it. Moreover the impact of not removing the outlier is explained briefly but comprehensive. For my memory to check the outliers visually i will remember the term “BHS” means Boxplot, Histogram and Scatter Plot. And for statistical analysis “IQRAZ” means “Interquartile Range AND Z-Score”. Thanks for such a nice blog.

Leave a Reply

Your email address will not be published. Required fields are marked *