Jab hum data ko samajhne aur us se insights nikalne ki baat karte hain, to kuch aise elements hote hain jo baqi data se hat ke hote hain. Inhe hum ‘anomalies’ ya ‘outliers’ kehte hain. Is chapter mein, hum inhi anomalies ko kaise pehchanein, unka kya asar hota hai, aur unhen kaise handle karein, is par baat karenge.
Outliers woh data points hote hain jo baqi data set se kafi alag hote hain.
Misal: Aapke shehar mein, agar aksar temperature 20°C se 35°C ke darmiyan hota hai, to ek din ka temperature 50°C hona ek outlier hoga.
Ahmiyat: Outliers ko identify karna zaroori hai kyun ke ye kabhi-kabhi data collection mein error ya kisi unusual event ki nishani ho sakte hain.
Outliers are also known as: 1. Abberrant observations 2. Deviants 3. Outlying cases 4. Anomalous points 5. Abnormalities
6.1.1 Types of Outliers
Outliers nine types mein classify kiya ja sakta hai:
Univariate: Ye woh outliers hote hain jo sirf ek variable mein hote hain. For example, agar aapke data mein sirf age variable hai, to age ke outliers univariate outliers honge.
Multivariate: Ye woh outliers hote hain jo ek se zyada variables mein hote hain. For example, agar aapke data mein age aur income dono variables hain, to age aur income ke outliers multivariate outliers honge.
Global: Ye woh outliers hote hain jo poore data set mein hote hain. For example, agar aapke data mein age aur income dono variables hain, to age aur income ke outliers multivariate outliers honge.
Local: Ye woh outliers hote hain jo sirf ek cluster mein hote hain. For example, agar aapke data mein age aur income dono variables hain, to age aur income ke outliers multivariate outliers honge.
Point: Ye woh outliers hote hain jo sirf ek point mein hote hain. For example, agar aapke data mein age aur income dono variables hain, to age aur income ke outliers multivariate outliers honge.
Contextual: Ye woh outliers hote hain jo sirf ek cluster mein hote hain. For example, agar aapke data mein age aur income dono variables hain, to age aur income ke outliers multivariate outliers honge.
Collective: Ye woh outliers hote hain jo sirf ek cluster mein hote hain. For example, agar aapke data mein age aur income dono variables hain, to age aur income ke outliers multivariate outliers honge.
Recurrent: Ye woh outliers hote hain jo sirf ek cluster mein hote hain. For example, agar aapke data mein age aur income dono variables hain, to age aur income ke outliers multivariate outliers honge.
Periodic: Ye woh outliers hote hain jo sirf ek cluster mein hote hain. For example, agar aapke data mein age aur income dono variables hain, to age aur income ke outliers multivariate outliers honge.
6.1.2 Causes of Outliers
Outliers ki wajah kuch bhi ho sakti hai. Kuch common causes neeche diye gaye hain:
Data Entry Errors: Data ko enter karte waqt, kisi human error ki wajah se outliers ho sakte hain.
Misal ke taur par, agar aapke data mein age variable hai, aur kisi ne age ko 100 saal ki jagah 1000 saal enter kar diya, to ye ek outlier hoga.
Measurement Errors: Data ko measure karte waqt, kisi human error ki wajah se outliers ho sakte hain.
For example, agar aapke data mein height variable hai, aur kisi ne height ko 5 feet ki jagah 50 feet measure kar diya, to ye ek outlier hoga.
Experimental Errors: Data ko experiment karte waqt, kisi human error ki wajah se outliers ho sakte hain.
For example, agar aapke data mein weight variable hai, aur kisi ne weight ko 50 kg ki jagah 500 kg measure kar diya, to ye ek outlier hoga.
Intentional Outliers: Kisi ne intentionally data mein outliers add kiye hon.
For example, agar aapke data mein age variable hai, aur kisi ne age ko 100 saal ki jagah 1000 saal enter kar diya, to ye ek outlier hoga.
Data Processing Errors: Data ko process karte waqt, kisi human error ki wajah se outliers ho sakte hain.
For example, agar aapke data mein age variable hai, aur kisi ne age ko 100 saal ki jagah 1000 saal enter kar diya, to ye ek outlier hoga.
Sampling Errors: Data ko sample karte waqt, kisi human error ki wajah se outliers ho sakte hain.
For example, agar aapke data mein age variable hai, aur kisi ne age ko 100 saal ki jagah 1000 saal enter kar diya, to ye ek outlier hoga.
Natural Outliers: Data mein outliers ki wajah natural events ho sakte hain.
For example, agar aapke data mein age variable hai, aur kisi ne age ko 100 saal ki jagah 1000 saal enter kar diya, to ye ek outlier hoga.
6.1.3 Why should we care about Outliers?
Hidden Clues: Outliers humein hidden clues dete hain. Inhe identify kar ke hum kisi hidden pattern ko discover kar sakte hain.
Data Quality: Outliers ki wajah se data quality kam ho jati hai. Inhe identify kar ke hum data quality ko improve kar sakte hain.
Impact Analysis: Outliers ki wajah se humari analysis mein error aa jata hai. Inhe identify kar ke hum analysis ko improve kar sakte hain.
Better Decisions: Outliers ki wajah se humari decisions par bhi asar padta hai. Inhe identify kar ke hum better decisions le sakte hain.
Better Models: Outliers ki wajah se humari models ki accuracy kam ho jati hai. Inhe identify kar ke hum better models bana sakte hain.
Better Insights: Outliers ki wajah se humari insights par bhi asar padta hai. Inhe identify kar ke hum better insights nikal sakte hain.
Better Visualization: Outliers ki wajah se humari visualizations ki quality kam ho jati hai. Inhe identify kar ke hum better visualizations bana sakte hain.
Better Storytelling: Outliers ki wajah se humari storytelling par bhi asar padta hai. Inhe identify kar ke hum better stories bana sakte hain.
Better Data Products: Outliers ki wajah se humari data products ki quality kam ho jati hai. Inhe identify kar ke hum better data products bana sakte hain.
Better Data Science: Outliers ki wajah se humari data science ki quality kam ho jati hai. Inhe identify kar ke hum better data science kar sakte hain.
6.1.4 Detect and remove Outliers
Outliers ko identify karne ke liye, hum kuch techniques use karte hain. In techniques ko hum ‘Outlier Detection Techniques’ kehte hain. In techniques mein se kuch neeche diye gaye hain:
Z-Score
IQR
DBSCAN
Isolation Forest
Local Outlier Factor
Elliptic Envelope
One-Class SVM
Mahalanobis Distance
Robust Random Cut Forest
Histogram-based Outlier Score
K-Nearest Neighbors
K-Means Clustering
Local Correlation Integral
and many more…
Ham sirf Z-Score, IQR or k-means clustering ko dekhenge.
6.1.5 Z-Score Method
Z-Score method mein, hum ye dekhte hain ke koi data point kitne standard deviations (SD) dur hai mean se.
Z-Score ki formula ye hai:\[Z = \frac{x - \mu}{\sigma}\]
Where: \(Z\): is the Z-Score \(x\): is the data point \(\mu\): is the mean of the data \(\sigma\): is the standard deviation of the data \(x - \mu\): is the difference between the data point and the mean \(\frac{x - \mu}{\sigma}\): is the difference between the data point and the mean in terms of standard deviations
Z-Score ki properties ye hain: 1. Z-Score ka mean 0 aur standard deviation 1 hota hai. 2. Z-Score ki value jitni zyada hogi, utna data point mean se zyada dur hoga. 3. Z-Score ki value jitni kam hogi, utna data point mean ke qareeb hoga. 4. Z-Score ki value 3 se zyada ya -3 se kam hogi, to data point outlier hoga.
Z-Score ki values ko interpret karne ke liye, neeche diye gaye table ko dekhein:
Z-Score
Data Point
Interpretation
-3
3 SDs below the mean
Outlier
-2
2 SDs below the mean
Outlier
-1
1 SD below the mean
Outlier
0
Mean
Not an outlier
1
1 SD above the mean
Not an outlier
2
2 SDs above the mean
Not an outlier
3
3 SDs above the mean
Not an outlier
6.1.5.1 Z-Score Method Example in Python
Z-Score method ko Python mein implement karne ke liye, neeche diye gaye steps follow karein:
6.1.5.1.1 Using numpy
Run the code below to see the steps.
# Step 1: Import the required librariesimport pandas as pdimport numpy as np# Step 2: Create the datadata = pd.DataFrame({'Age': [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 50]})# Step 3: Calculate the mean and standard deviationmean = np.mean(data['Age'])std = np.std(data['Age'])# Step 4: Calculate the Z-Scoredata['Z-Score'] = (data['Age'] - mean) / std# Step 5: Print the dataprint("----------------------------------------")print(f"Here is the data with outliers:\n{data}")print("----------------------------------------")# Step 6: Print the outliersprint(f"Here are the outliers based on the z-score threshold, 3:\n{data[data['Z-Score'] >3]}")print("----------------------------------------")# Step 7: Remove the outliersdata = data[data['Z-Score'] <=3]# Step 8: Print the data without outliersprint(f"Here is the data without outliers:\n{data}")
----------------------------------------
Here is the data with outliers:
Age Z-Score
0 20 -0.938954
1 21 -0.806396
2 22 -0.673838
3 23 -0.541280
4 24 -0.408721
5 25 -0.276163
6 26 -0.143605
7 27 -0.011047
8 28 0.121512
9 29 0.254070
10 30 0.386628
11 50 3.037793
----------------------------------------
Here are the outliers based on the z-score threshold, 3:
Age Z-Score
11 50 3.037793
----------------------------------------
Here is the data without outliers:
Age Z-Score
0 20 -0.938954
1 21 -0.806396
2 22 -0.673838
3 23 -0.541280
4 24 -0.408721
5 25 -0.276163
6 26 -0.143605
7 27 -0.011047
8 28 0.121512
9 29 0.254070
10 30 0.386628
6.1.5.1.2 Using scipy library
You can also follow the steps below to implement the Z-Score method in Python, using scipy library:
Run the code below to see the steps.
# Import librariesimport numpy as npfrom scipy import stats# Sample datadata = [2.5, 2.7, 2.8, 3.0, 3.2, 3.4, 3.6, 3.8, 4.0, 110.0]# Calculate the Z-score for each data pointz_scores = np.abs(stats.zscore(data))# Set a threshold for identifying outliersthreshold =2.5outliers = np.where(z_scores > threshold)[0]# print the dataprint("----------------------------------------")print("Data:", data)print("----------------------------------------")print("Indices of Outliers:", outliers)print("Outliers:", [data[i] for i in outliers])# Remove outliersdata = [data[i] for i inrange(len(data)) if i notin outliers]print("----------------------------------------")print("Data without outliers:", data)
----------------------------------------
Data: [2.5, 2.7, 2.8, 3.0, 3.2, 3.4, 3.6, 3.8, 4.0, 110.0]
----------------------------------------
Indices of Outliers: [9]
Outliers: [110.0]
----------------------------------------
Data without outliers: [2.5, 2.7, 2.8, 3.0, 3.2, 3.4, 3.6, 3.8, 4.0]
6.1.6 IQR Method
IQR method mein, hum ye dekhte hain ke koi data point kitne IQRs dur hai median se.
IQR ki formula ye hai:
\[IQR = Q_3 - Q_1\]
Where: \(IQR\): is the Interquartile Range \(Q_3\): is the third quartile \(Q_1\): is the first quartile \(Q_3 - Q_1\): is the difference between the third quartile and the first quartile
IQR ki properties ye hain:
IQR ka median 0 aur standard deviation 1 hota hai.
IQR ki value jitni zyada hogi, utna data point median se zyada dur hoga.
IQR ki value jitni kam hogi, utna data point median ke qareeb hoga.
6.1.6.1 IQR Method Example in Python
IQR method ko Python mein implement karne ke liye, neeche diye gaye steps follow karein:
6.1.6.1.1 Using numpy
Run the code below to see the steps.
# Step 1: Import the required librariesimport pandas as pdimport numpy as np# Step 2: Create the datadata = pd.DataFrame({'Age': [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 50]})# Step 3: Calculate the first and third quartileQ1 = np.percentile(data['Age'], 25, interpolation ='midpoint')Q3 = np.percentile(data['Age'], 75, interpolation ='midpoint')# Step 4: Calculate the IQRIQR = Q3 - Q1# Step 5: Calculate the lower and upper boundlower_bound = Q1 - (1.5* IQR)upper_bound = Q3 + (1.5* IQR)# Step 6: Print the dataprint("----------------------------------------")print(f"Here is the data with outliers:\n{data}")print("----------------------------------------")# Step 7: Print the outliersprint(f"Here are the outliers based on the IQR threshold:\n{data[(data['Age'] < lower_bound) | (data['Age'] > upper_bound)]}")print("----------------------------------------")# Step 8: Remove the outliersdata = data[(data['Age'] >= lower_bound) & (data['Age'] <= upper_bound)]# Step 9: Print the data without outliersprint(f"Here is the data without outliers:\n{data}")
----------------------------------------
Here is the data with outliers:
Age
0 20
1 21
2 22
3 23
4 24
5 25
6 26
7 27
8 28
9 29
10 30
11 50
----------------------------------------
Here are the outliers based on the IQR threshold:
Age
11 50
----------------------------------------
Here is the data without outliers:
Age
0 20
1 21
2 22
3 23
4 24
5 25
6 26
7 27
8 28
9 29
10 30
6.1.7 Clustering Method (K-Means)
Clustering method mein, hum data points ko clusters mein divide karte hain. This can be done using the K-Means clustering algorithm. Where we specify the number of clusters we want to divide the data into. Then we assign each data point to a cluster. Then we calculate the distance of each data point from the centroid of the cluster it belongs to. Then we remove the data points that are farthest from the centroid of the cluster they belong to.
Use the code below to see the steps.
# Import libraryfrom sklearn.cluster import KMeans# Sample datadata = [[2, 2], [3, 3], [3, 4], [30, 30], [31, 31], [32, 32]]# Create a K-means model with two clusters (normal and outlier)kmeans = KMeans(n_clusters=2, n_init=10)kmeans.fit(data)# Predict cluster labelslabels = kmeans.predict(data)# Identify outliers based on cluster labelsoutliers = [data[i] for i, label inenumerate(labels) if label ==1]# print dataprint("Data:", data)print("Outliers:", outliers)# Remove outliersdata = [data[i] for i, label inenumerate(labels) if label ==0]print("Data without outliers:", data)
Outliers ko handle karne ke liye, hum kuch techniques use karte hain. In techniques ko hum ‘Outlier Handling Techniques’ kehte hain. In techniques mein se kuch neeche diye gaye hain:
Removing the outlier: This is the most common method where all detected outliers are removed from the dataset.
Transforming and binning values: Outliers can be transformed to bring them within a range. Techniques like log transformation or square root transformation can be used.
Imputation: Outliers can also be replaced with mean, median, or mode values.
Separate treatment: In some use-cases, it’s beneficial to treat outliers separately rather than removing or imputing them.
Robus Statistical Methods: Some of the statistical methods to analyze and model the data are less sensitive to outliers and provide more accurate results in the data.
I have explained some of these techniques in the section above.. Where we remove the outliers using the Z-Score, IQR and K-Means clustering methods. You can also use the other techniques by yourself and practice them.
6.1.9 Conclusion
Outliers in a dataset are observations that deviate dramatically from the rest of the data points. They might arise as a result of data gathering mistakes or abnormalities, or they can be real findings that are just infrequent or extraordinary.
If outliers are not appropriately accounted for, they might produce misleading, inconsistent, and erroneous findings. As a result, identifying and dealing with outliers is critical in order to produce accurate and useful data analysis findings.
Outliers may be detected using a variety of methods, including the percentile approach, IQR method, and z-score method. Outliers can be dealt with in a variety of methods, including removal, transformation, imputation, and so on.
6.2 Missing Values
Missing Values Ko Kaise Handle Kiya Jaye? Aur Inhe Handle Karna Kyun Zaroori Hai?” - Data Science Ki Dunia Mein Iska Role 🤔🛠️
Missing values yaani ghaib data se guzarne wala har data scientist ya researcher ko iski ahmiyat aur isse judi mushkilaat ka andaza ho sakta hai. Data Science ki duniya mein, yeh missing values se guzarne ka tajurba aksar humein milta hai. Agar aap mein se kuch khush naseeb hain jo is masle se guzre nahi, toh woh waqai kismat wale hain! 😄 Lekin un logon ke liye jo is masle ka samna karte hain, unko yeh samajhne mein mushkil nahi hoti ke missing values kitne masail paida kar sakti hain.
6.2.1 Naukri, Missing Values aur Aik Bari Ghalti 😢
Lahore ki ek mashhoor company Codanics Solutions mein Ahmed ek talented data scientist tha. Woh apne projects ko hamesha top priority deta tha aur is wajah se us ki company mein bhi bohat izzat thi. 🌟
Ek roz, Ahmed apne doston ke sath lunch kar raha tha. 🍛
Ali (ek aur data scientist): “Ahmed bhai! Suna hai aap ko naya project mila hai?”
Ahmed: “Ji haan, Ali. Mujhe customers ki buying habits analyze karni hai. Lekin data mein kuch missing values hain, mujhe lagta hai koi masla nahi hoga agar main unhein ignore kar doon.” 😕
Ali: “Bhai, kabhi bhi missing values ko ignore mat karo. Yeh choti si baat model ki performance ko kharab kar sakti hai.”
Lekin Ahmed ne Ali ki baat ko nazar andaaz kiya aur apne tareeque se kaam karna shuru kar diya.
Jab model tayyar hua aur us ko real-world data par test kiya gaya, to us ki predictions bilkul bhi sahi nahi thi. 😲 Company ko is wajah se bohat bada nuqsan hua.
CEO, Mr. Usman, ne Ahmed ko apne office mein bulaya. 🏢
Mr. Usman: “Ahmed, humein bohat zyada nuqsan hua hai is project se. Kya masla hai?”
Ahmed: “Sir, maine socha tha ke kuch missing values se koi masla nahi hoga. Lekin mujhe ab samajh aaya hai ke maine ghalat socha.” 😔
Mr. Usman: “Ahmed, aap jante hain data science mein kitni bhi choti ghalti badi problem create kar sakti hai. Mujhe afsos hai, lekin humein aap ko company se nikalna parega.”
Ahmed ko bohat afsos hua. Us ne realize kiya ke kabhi bhi data ko lightly nahi lena chahiye. Woh ghar wapas laut kar Ali ko call kiya. 📞
Ahmed: “Ali, tum sahi keh rahe the. Mujhe company se nikal diya gaya hai.”
Ali: “Afsos hai sun kar. Lekin Ahmed, har galti se humein kuch na kuch seekhne ko milta hai. Aap ab better tareeque se kaam karenge.”
Ahmed ne apni galti se seekha aur woh ab missing values aur data preprocessing par khaas tawajjo dene laga. Chand mahine baad, Ahmed ne ek aur company mein job shuru ki, aur wahan us ne prove kiya ke woh ek maahir data scientist hai. Lekin, us ek ghalti ka sabak us ne hamesha yaad rakha.
Ab agar ap b ahmad ki trah risk lena chahtay hyn tu missing values ko seekhnay se pehlay ap is blog ko ignore kar den, warna agar ap interested hyn tu yaqeen manen ye blog ap ki Data Science or AI ki journey ko bht kamal karne wala hy, I know ap soch rahay hun gay k aisa kia hy is main, Q fir Pola Payen kareay Start? Han Bholay phir tayyar ho?
I know ye nick names hyn magar isi trah or b bht se nick names hyn missing values k, By the way ap apna nick name likhen gay comments main?
6.2.2 Missing Values k ultay naam
Agar ap b aik desi culture ki paidawar hyn tu ap k bhi bht saray ultay naam gay. hai na? like Achoo, Billa, Bhola, Pola, Saji, kala, chitta, mota, chota, kaddu etc., ye main nahi keh raha ap kahin b nazar dorayen tu aisay naaam htay hyn, or kuch tu bht hi adab se pukaray jatay hyn, jaisa k, Paye Kalay. Ab isi trah missing values k bhi naam hyn kaafi jo agar ap ko na pata hun tu ap preshan hun gay. Chalein phir dekhtay hyn!
Missing values ko mukhtalif namon se pukara jata hai, depend karta hai ke context kya hai aur kis domain ya field mein baat ho rahi hai. Lekin, Data Science aur statistics mein commonly istemal hone wale names hain:
NA (Not Available)
NaN (Not a Number): Khaas taur par programming languages jaise ke Python mein pandas library mein istemal hota hai.
Null: Database management systems jaise SQL mein istemal hone wala term hai.
Undefined
Blank ya Empty
Placeholder Values: Kabhi-kabhi kuch default values set ki jati hain jinhein hum recognize kar sakte hain ke yeh actual data nahi hai. Masalan, kisi age field mein -1 ya 999 set karna.
Sentinel Values: Yeh bhi ek tarah ke placeholder values hoti hain jo specific conditions ko represent karte hain.
Dummy Data: Placeholder ya test purpose ke liye istemal hoti hai.
Missing Data: Aam taur se research papers mein istemal hone wala term.
In tamaam terms mein se kuch specific situations ya tools ke liye hote hain, jabke baaz aam istemal ke liye hote hain. Hamesha zaroori hai ke jab aap data ko analyze ya preprocess kar rahe hoon, toh aap in different types ke missing values ko pehchanein aur unhein sahi tareeqay se handle karein.
6.2.3 How to Identify Missing Values?
Missing values ko identify karne ke liye, hum kuch techniques use karte hain. In techniques ko hum ‘Missing Value Detection Techniques’ kehte hain. In techniques mein se kuch neeche diye gaye hain:
Visual Inspection: Data ko visualize kar ke missing values ko identify kiya jata hai.
Descriptive Statistics: Data ki descriptive statistics ko calculate kar ke missing values ko identify kiya jata hai.
Missingno Library: Missingno library ko use kar ke missing values ko identify kiya jata hai.
6.2.3.1 Visual Inspection
Visual Inspection mein, hum data ko visualize kar ke missing values ko identify karte hain.
Use the code below to see the steps.
# Import librariesimport pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as sns# Load titanic datasetdata = sns.load_dataset('titanic')# Visualize the dataplt.figure(figsize=(8, 5))sns.heatmap(data.isnull(), cbar=False)plt.show()
6.2.3.2 Descriptive Statistics
Descriptive Statistics mein, hum data ki descriptive statistics ko calculate kar ke missing values ko identify karte hain.
Use the code below to see the steps.
# Import librariesimport pandas as pdimport numpy as npimport seaborn as sns# load titanic datasetdata = sns.load_dataset('titanic')# calculate missing valuesprint("----------------------------------------")print(f"Missing values in each column:\n{data.isnull().sum().sort_values(ascending=False)}")print("----------------------------------------")print(f"Percentage of missing values in each column:\n{round(data.isnull().sum() /len(data) *100, 2).sort_values(ascending=False)}")
----------------------------------------
Missing values in each column:
deck 688
age 177
embarked 2
embark_town 2
survived 0
pclass 0
sex 0
sibsp 0
parch 0
fare 0
class 0
who 0
adult_male 0
alive 0
alone 0
dtype: int64
----------------------------------------
Percentage of missing values in each column:
deck 77.22
age 19.87
embarked 0.22
embark_town 0.22
survived 0.00
pclass 0.00
sex 0.00
sibsp 0.00
parch 0.00
fare 0.00
class 0.00
who 0.00
adult_male 0.00
alive 0.00
alone 0.00
dtype: float64
6.2.3.3 Missingno Library
Missingno library ko use kar ke bhi hum missing values ko identify kar sakte hain.
Use the code below to see the steps.
Code
# Import librariesimport pandas as pdimport numpy as npimport seaborn as snsdf = sns.load_dataset('titanic')df
survived
pclass
sex
age
sibsp
parch
fare
embarked
class
who
adult_male
deck
embark_town
alive
alone
0
0
3
male
22.0
1
0
7.2500
S
Third
man
True
NaN
Southampton
no
False
1
1
1
female
38.0
1
0
71.2833
C
First
woman
False
C
Cherbourg
yes
False
2
1
3
female
26.0
0
0
7.9250
S
Third
woman
False
NaN
Southampton
yes
True
3
1
1
female
35.0
1
0
53.1000
S
First
woman
False
C
Southampton
yes
False
4
0
3
male
35.0
0
0
8.0500
S
Third
man
True
NaN
Southampton
no
True
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
886
0
2
male
27.0
0
0
13.0000
S
Second
man
True
NaN
Southampton
no
True
887
1
1
female
19.0
0
0
30.0000
S
First
woman
False
B
Southampton
yes
True
888
0
3
female
NaN
1
2
23.4500
S
Third
woman
False
NaN
Southampton
no
False
889
1
1
male
26.0
0
0
30.0000
C
First
man
True
C
Cherbourg
yes
True
890
0
3
male
32.0
0
0
7.7500
Q
Third
man
True
NaN
Queenstown
no
True
891 rows × 15 columns
Titanic Dataset
# Import librariesimport pandas as pdimport numpy as npimport missingno as msnoimport matplotlib.pyplot as pltimport seaborn as snss# load titanic datasetdata = sns.load_dataset('titanic')# Visualize the datamsno.matrix(data,labels=True, fontsize=12, width_ratios=(2, 4), color=(0.2, 0.4, 0.6))plt.show()
Model Ki Accuracy Par Gehra Asar: 💔Missing values ke honay se machine learning models ki accuracy mein kami aati hai, aur iski performance par bhi bura asar hota hai.
Data Ki Mayari Par Sawal: 📉Missing values data ki mayari ko kamzor banate hain, jisse hamare analysis aur faislay mein bhi ghalat fehmiyan paida ho sakti hain.
Model Training Ka Waqt Barh Jata Hai: ⏱️Kabhi-kabhi, missing values ki wajah se model training ka waqt barh jata hai, jo ke resources aur waqt dono ka zaya hai.
6.2.5Ruku Zara Sabr Karo
Missing values ka hona kisi bhi dataset mein aam baat hai, lekin jab hum decide karte hain ke kisi column ko remove karna chahiye ya nahi, to iska faisla humein kuch factors par depend karta hai:
Data Ki Quantity: Agar aapke paas bohat zyada data hai aur aik specific column mein missing values ki tadad bohat zyada hai (masalan, 70% ya 80%), toh us column ko remove kar dena behtar ho sakta hai, kyun ke us column se faida uthana mushkil ho sakta hai.
Column Ki Importance: Agar missing values wala column aapke analysis ya model ke liye bohat ahem hai, toh us column ko remove karna acha nahi hoga. Aise mein aap missing values ko impute karne ke tareeqe istemal kar sakte hain.
Nature of Data: Kabhi-kabhi, missing values ka hona bhi kuch indicate karta hai. Masalan, kisi survey mein, agar kisi sawal ka jawab nahi diya gaya, toh yeh indicate kar sakta hai ke participant us sawal se comfortable nahi tha. Aise mein, missing value ko hata dena ya replace karna sahi nahi hoga.
Model Ki Sensitivity: Kuch machine learning models missing values ko handle kar sakte hain, jabke kuch models sensitive hoti hain. Aise mein, agar model missing values ke sensitive hai, toh aapko missing values ko handle karna parega.
Type of Data: Numeric data mein missing values ko mean, median ya mode se replace kiya ja sakta hai. Categorical data mein, missing values ko mode ya kisi specific category se replace kiya ja sakta hai.
Aam taur par, agar aapke column mein 50% se zyada data missing hai, toh us column ko consider karna chahiye ke kya usse remove karna behtar rahega ya nahi. Lekin, yeh hard and fast rule nahi hai. Har dataset unique hota hai aur uski requirements bhi alag hoti hain. Is liye, aapko har dataset ke context mein decide karna hoga ke missing values ko kaise handle kiya jaye.
6.2.6 Missing Values Ko Handle Karne Ke Mufassal Tariqay 🧐
6.2.6.1Maujooda Data Source Se Phir Se Data Hasil Karna: 🔄Agar aap ke paas woh resource maujood hai jahan se aapne data liya tha, toh aap missing values ko wahan se dobara hasil kar sakte hain.
6.2.6.2Mean, Median, Ya Mode Se Data Ko Impute Karna: 📊Agar aapke paas numerical data hai, toh usmein missing values ko mean ya median se replace kiya jata hai. Wahi, categorical data ke liye mode ka istemal hota hai.
Use following code to see the steps to fill missing values with mean, median or mode in Python:
1. Mean
# Import librariesimport pandas as pdimport numpy as np# Create the datadata = pd.DataFrame({'Age': [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, np.nan, 50]})# Print the data with missing valueprint("----------------------------------------")print(f"Here is the data with missing value:\n{data}")# Calculate the meanmean = data['Age'].mean()# Replace the missing values with meandata['Age'] = data['Age'].fillna(mean)print("----------------------------------------")# Print the data without missing valueprint(f"Here is the data without missing value:\n{data}")
----------------------------------------
Here is the data with missing value:
Age
0 20.0
1 21.0
2 22.0
3 23.0
4 24.0
5 25.0
6 26.0
7 27.0
8 28.0
9 29.0
10 NaN
11 50.0
----------------------------------------
Here is the data without missing value:
Age
0 20.000000
1 21.000000
2 22.000000
3 23.000000
4 24.000000
5 25.000000
6 26.000000
7 27.000000
8 28.000000
9 29.000000
10 26.818182
11 50.000000
2. Median
# Import librariesimport pandas as pdimport numpy as np# Create the datadata = pd.DataFrame({'Age': [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, np.nan, 50]})# Print the data with missing valueprint("----------------------------------------")print(f"Here is the data with missing value:\n{data}")# Calculate the medianmedian = data['Age'].median()# Replace the missing values with mediandata['Age'] = data['Age'].fillna(median)print("----------------------------------------")# Print the data without missing valueprint(f"Here is the data without missing value:\n{data}")
----------------------------------------
Here is the data with missing value:
Age
0 20.0
1 21.0
2 22.0
3 23.0
4 24.0
5 25.0
6 26.0
7 27.0
8 28.0
9 29.0
10 NaN
11 50.0
----------------------------------------
Here is the data without missing value:
Age
0 20.0
1 21.0
2 22.0
3 23.0
4 24.0
5 25.0
6 26.0
7 27.0
8 28.0
9 29.0
10 25.0
11 50.0
Mode
# Import librariesimport pandas as pdimport numpy as np# Create the data categorical data with mode and missing valuedata = pd.DataFrame({'Fruit': ['Apple', 'Banana', 'Apple', 'Banana', 'Apple', 'Banana', 'Apple', 'Banana', 'Apple', 'Banana', np.nan, 'Banana']})# Print the data with missing valueprint("----------------------------------------")print(f"Here is the data with missing value:\n{data}")# Find the modemode = data['Fruit'].mode()[0]# Replace the missing values with modedata['Fruit'] = data['Fruit'].fillna(mode)print("----------------------------------------")# Print the data without missing valueprint(f"Here is the data without missing value:\n{data}")
----------------------------------------
Here is the data with missing value:
Fruit
0 Apple
1 Banana
2 Apple
3 Banana
4 Apple
5 Banana
6 Apple
7 Banana
8 Apple
9 Banana
10 NaN
11 Banana
----------------------------------------
Here is the data without missing value:
Fruit
0 Apple
1 Banana
2 Apple
3 Banana
4 Apple
5 Banana
6 Apple
7 Banana
8 Apple
9 Banana
10 Banana
11 Banana
6.2.6.3Forward Ya Backward Fill Ka Istemal: 🚶♂️🏃♂️Kuch data sets mein waqt ya tarikh ka silsila hota hai. Aise data sets mein, aik row ke missing value ko pichli ya agli row ki value se replace kiya jata hai.
Use following code to see the steps to fill missing values with forward or backward fill in Python:
1. Forward Fill
# Import librariesimport pandas as pdimport numpy as np# Create the datadata = pd.DataFrame({'Age': [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, np.nan, 50]})# Print the data with missing valueprint("----------------------------------------")print(f"Here is the data with missing value:\n{data}")# Replace the missing values with forward filldata['Age'] = data['Age'].ffill()print("----------------------------------------")# Print the data without missing valueprint(f"Here is the data without missing value:\n{data}")
----------------------------------------
Here is the data with missing value:
Age
0 20.0
1 21.0
2 22.0
3 23.0
4 24.0
5 25.0
6 26.0
7 27.0
8 28.0
9 29.0
10 NaN
11 50.0
----------------------------------------
Here is the data without missing value:
Age
0 20.0
1 21.0
2 22.0
3 23.0
4 24.0
5 25.0
6 26.0
7 27.0
8 28.0
9 29.0
10 29.0
11 50.0
2. Backward Fill
# Import librariesimport pandas as pdimport numpy as np# Create the datadata = pd.DataFrame({'Age': [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, np.nan, 50]})# Print the data with missing valueprint("----------------------------------------")print(f"Here is the data with missing value:\n{data}")# Replace the missing values with backward filldata['Age'] = data['Age'].bfill()print("----------------------------------------")# Print the data without missing valueprint(f"Here is the data without missing value:\n{data}")
----------------------------------------
Here is the data with missing value:
Age
0 20.0
1 21.0
2 22.0
3 23.0
4 24.0
5 25.0
6 26.0
7 27.0
8 28.0
9 29.0
10 NaN
11 50.0
----------------------------------------
Here is the data without missing value:
Age
0 20.0
1 21.0
2 22.0
3 23.0
4 24.0
5 25.0
6 26.0
7 27.0
8 28.0
9 29.0
10 50.0
11 50.0
6.2.6.4KNN Imputation Ka Istemal: 🧑🤝🧑Yeh ek advanced technique hai jahan missing value ko uske aas-paas ke data points ke average value se replace kiya jata hai. Aise libraries jaise scikit-learn mein yeh method maujood hai.
Use following code to see the steps to fill missing values with KNN imputation in Python:
# Import librariesimport pandas as pdimport numpy as npfrom sklearn.impute import KNNImputer# Create the datadata = pd.DataFrame({'Age': [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, np.nan, 50]})# Print the data with missing valueprint("----------------------------------------")print(f"Here is the data with missing value:\n{data}")# Initialize the KNNImputerimputer = KNNImputer(n_neighbors=2)# Replace the missing values with KNN imputationdata['Age'] = imputer.fit_transform(data[['Age']])print("----------------------------------------")# Print the data without missing valueprint(f"Here is the data without missing value:\n{data}")
----------------------------------------
Here is the data with missing value:
Age
0 20.0
1 21.0
2 22.0
3 23.0
4 24.0
5 25.0
6 26.0
7 27.0
8 28.0
9 29.0
10 NaN
11 50.0
----------------------------------------
Here is the data without missing value:
Age
0 20.000000
1 21.000000
2 22.000000
3 23.000000
4 24.000000
5 25.000000
6 26.000000
7 27.000000
8 28.000000
9 29.000000
10 26.818182
11 50.000000
6.2.6.5Deep Learning Techniques Ka Istemal: 🧠Deep learning techniques jaise autoencoders bhi missing values ko handle karne mein madadgar sabit ho sakte hain.
Use following code to see the steps to fill missing values with deep learning techniques in Python:
# Import librariesimport pandas as pdimport numpy as npfrom sklearn.experimental import enable_iterative_imputerfrom sklearn.impute import IterativeImputer# Create the datadata = pd.DataFrame({'Age': [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, np.nan, 50]})# Print the data with missing valueprint("----------------------------------------")print(f"Here is the data with missing value:\n{data}")# Initialize the IterativeImputerimputer = IterativeImputer()# Replace the missing values with deep learning techniquesdata['Age'] = imputer.fit_transform(data[['Age']])print("----------------------------------------")# Print the data without missing valueprint(f"Here is the data without missing value:\n{data}")
----------------------------------------
Here is the data with missing value:
Age
0 20.0
1 21.0
2 22.0
3 23.0
4 24.0
5 25.0
6 26.0
7 27.0
8 28.0
9 29.0
10 NaN
11 50.0
----------------------------------------
Here is the data without missing value:
Age
0 20.000000
1 21.000000
2 22.000000
3 23.000000
4 24.000000
5 25.000000
6 26.000000
7 27.000000
8 28.000000
9 29.000000
10 26.818182
11 50.000000
6.2.6.6Simply Delete Kar Dena: ❌Agar aapke data set mein missing values ki tadad bahut kam hai, toh aap us specific row ya column ko bhi delete kar sakte hain.
Use following code to see the steps to delete missing values in Python:
# Import librariesimport pandas as pdimport numpy as np# Create the datadata = pd.DataFrame({'Age': [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, np.nan, 50]})# Print the data with missing valueprint("----------------------------------------")print(f"Here is the data with missing value:\n{data}")# Delete the rows with missing valuesdata = data.dropna()print("----------------------------------------")# Print the data without missing valueprint(f"Here is the data without missing value:\n{data}")
----------------------------------------
Here is the data with missing value:
Age
0 20.0
1 21.0
2 22.0
3 23.0
4 24.0
5 25.0
6 26.0
7 27.0
8 28.0
9 29.0
10 NaN
11 50.0
----------------------------------------
Here is the data without missing value:
Age
0 20.0
1 21.0
2 22.0
3 23.0
4 24.0
5 25.0
6 26.0
7 27.0
8 28.0
9 29.0
11 50.0
6.2.6.7 Agar main na handle karun tu?
Bachoo Jee! phir tu hargiz model acha kaam nahi kare ga, yehi nahi abhi or suneay!
Agar hum missing values ko nazar andaaz kar dein toh humein kai masail ka samna karna par sakta hai. Yahan kuch masail hain jo arise ho sakti hain:
📉 Model Accuracy Mein Kami: Machine learning models ki accuracy kam ho sakti hai, kyun ke model ko complete information nahi milti.
📊 Ghalat Analysis: Data analysis mein ghalat nataij nikal sakte hain, jo ke decisions par negative asar dal sakta hai.
😕 Model Confusion: Kuch models missing values handle nahi kar pate, jis se model train nahi ho pata ya phir ghalat predictions karta hai.
🤖 Bias in Model: Missing values ki wajah se model mein bias aane ka khatra barh jata hai.
📚 Data ka Ghalat Interpretation: Missing values ki wajah se humare paas adhoori ya ghalat malumat ho sakti hai, jis ki wajah se hum data ko ghalat tareeqe se interpret kar sakte hain.
💾 Storage Issues: Agar missing values ko replace nahi kiya jaye toh storage mein bhi masail ho sakti hain, kyun ke kuch systems missing values ko store nahi kar pate.
🔀 Data Integration Masail: Different sources se aane wale data mein agar missing values hain toh integration mein masail ho sakti hain.
🚫 Features ka Ghalat Selection: Missing values ki presence mein, kuch aham features ko ignore kiya ja sakta hai jin ka model par asar hona chahiye.
🧪 Ghalat Experimental Results: Science ya research projects mein, missing values ki wajah se ghalat experimental nataij aa sakte hain.
😰 Stress aur Extra Kaam: Data scientists ko extra kaam karna par sakta hai tajziyat mein, kyun ke missing values ko identify aur handle karna parta hai.
Is liye, missing values ko handle karna bohat zaroori hota hai ta ke hum upar diye gaye masail se bach saken. 🛠️🔧🔍
6.2.7 Conclusion
Missing Values - Ek Badi Challenge Lekin Ek Behtareen Mauqa Bhi 🌟Missing values se guzarne ka tajurba har data scientist ke liye ek challenge toh hai hi, lekin isse humein yeh bhi seekhne ko milta hai ke hum kaise data ki mayari ko behtar bana sakte hain. Aakhir mein, behtar quality wale data se hi behtar aur zaheen insights aur models tayyar hoti hain.
6.3 Follow us
Follow us
Main umeed karta hun k ap ko ye chapter ne bht kuch seekhaya ho ga, or agar sach main seekhaya hy then please do support us by sharing this book with your friends and colleagues. Also, do share your feedback with us, so that we can improve our work in future.