9  Exploratory Data Analysis (EDA) πŸ“ŠπŸ”

9.1 EDA Kya Hai?

Exploratory Data Analysis (EDA) ek systematic approach hai jis mein hum apne data ko deeply explore karte hain taake uski characteristics, patterns, anomalies aur hidden insights ko samajh sakein.

Jaise ek detective crime scene ko carefully examine karta hai, waise hi ek data scientist EDA ke zariye data ko investigate karta hai. EDA ke through hum:

  1. πŸ“‹ Data ki structure aur composition samajhte hain
  2. πŸ“ˆ Variables ki distribution dekhte hain
  3. πŸ”„ Different groups ko compare karte hain
  4. πŸ”— Variables ke darmiyan relationships explore karte hain
TipJohn Tukey - Father of EDA

John Tukey ne 1977 mein EDA ka concept introduce kiya tha. Unhone kaha: β€œExploratory data analysis is detective workβ€”numerical detective workβ€”or counting detective workβ€”or graphical detective work.”

9.2 EDA Ki Chaar Dimensions 🎯

EDA ko hum chaar dimensions mein divide kar sakte hain. Ye chaar dimensions data ko completely samajhne ke liye essential hain:

graph LR
    A[EDA] --> B[πŸ“‹ Data Composition]
    A --> C[πŸ“Š Distribution]
    A --> D[πŸ”„ Comparison]
    A --> E[πŸ”— Relationship]
    
    B --> B1[Shape & Size]
    B --> B2[Data Types]
    B --> B3[Missing Values]
    B --> B4[Duplicates]
    
    C --> C1[Central Tendency]
    C --> C2[Spread/Variability]
    C --> C3[Skewness]
    C --> C4[Outliers]
    
    D --> D1[Group Comparisons]
    D --> D2[Category Analysis]
    D --> D3[Trends Over Time]
    
    E --> E1[Correlations]
    E --> E2[Dependencies]
    E --> E3[Interactions]

Dimension Description Key Questions Analysis Type
Composition Data ki structure aur makeup Data mein kitni rows/columns hain? Data types kya hain? Univariate
Distribution Individual variables ka behavior Values kaise spread hain? Outliers hain? Univariate
Comparison Groups ke beech differences Kya groups different hain? Bivariate
Relationship Variables ke connections Kya variables related hain? Bi/Multivariate

9.3 Univariate, Bivariate, aur Multivariate Analysis πŸ“ˆ

ImportantKab Kaunsa Analysis Karein?
  • Univariate: Jab aap ek variable ko samajhna chahte hain
  • Bivariate: Jab aap do variables ke beech connection dekhna chahte hain
  • Multivariate: Jab aap multiple variables ke combined effect ko samajhna chahte hain
Analysis Type Variables Purpose Examples
Univariate 1 Single variable ki characteristics Histogram, Box plot
Bivariate 2 Do variables ka relationship Scatter plot, Correlation
Multivariate 3+ Multiple variables ka interaction Pair plot, Heatmap

9.4 πŸ“‹ Dimension 1: Data Composition

Data Composition dimension mein hum dekhte hain ke humara data kaise bana hai - uski shape, size, types, aur quality.

9.4.1 Dataset Load Karein - Tips Dataset

Hum seaborn library ka famous tips dataset use karenge. Ye dataset restaurant tips ke baare mein hai.

# Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Load tips dataset
tips = sns.load_dataset('tips')

# Display first 5 rows
tips.head()
Table 9.1: Tips Dataset - First 5 Rows
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

9.4.2 1.1 Data Shape aur Size

# Data dimensions
print("=" * 50)
print("πŸ“Š DATA SHAPE AUR SIZE")
print("=" * 50)
print(f"Total Rows: {tips.shape[0]}")
print(f"Total Columns: {tips.shape[1]}")
print(f"Total Elements: {tips.size}")
print("=" * 50)
==================================================
πŸ“Š DATA SHAPE AUR SIZE
==================================================
Total Rows: 244
Total Columns: 7
Total Elements: 1708
==================================================

9.4.3 1.2 Data Types

# Data types
print("=" * 50)
print("πŸ“‹ DATA TYPES")
print("=" * 50)
print(tips.dtypes)
print("=" * 50)
print(f"\nπŸ“Š Summary:")
print(f"Numerical Columns: {tips.select_dtypes(include=['number']).columns.tolist()}")
print(f"Categorical Columns: {tips.select_dtypes(include=['category', 'object']).columns.tolist()}")
==================================================
πŸ“‹ DATA TYPES
==================================================
total_bill     float64
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
dtype: object
==================================================

πŸ“Š Summary:
Numerical Columns: ['total_bill', 'tip', 'size']
Categorical Columns: ['sex', 'smoker', 'day', 'time']

9.4.4 1.3 Column Information

# Detailed info
print("=" * 50)
print("πŸ“‹ DETAILED COLUMN INFO")
print("=" * 50)
tips.info()
==================================================
πŸ“‹ DETAILED COLUMN INFO
==================================================
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB

9.4.5 1.4 Missing Values Check

Jaise hum ne Chapter 6 mein discuss kiya tha, missing values ko identify karna bohat zaroori hai.

# Missing values check
print("=" * 50)
print("❓ MISSING VALUES CHECK")
print("=" * 50)
missing = tips.isnull().sum()
missing_pct = (tips.isnull().sum() / len(tips) * 100).round(2)

missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Missing %': missing_pct
})
print(missing_df)
print("=" * 50)
print(f"βœ… Great! Is dataset mein koi missing values nahi hain!")
==================================================
❓ MISSING VALUES CHECK
==================================================
            Missing Count  Missing %
total_bill              0        0.0
tip                     0        0.0
sex                     0        0.0
smoker                  0        0.0
day                     0        0.0
time                    0        0.0
size                    0        0.0
==================================================
βœ… Great! Is dataset mein koi missing values nahi hain!

9.4.6 1.5 Duplicates Check

# Duplicates check
print("=" * 50)
print("πŸ”„ DUPLICATES CHECK")
print("=" * 50)
duplicates = tips.duplicated().sum()
print(f"Total Duplicate Rows: {duplicates}")
if duplicates > 0:
    print("⚠️ Duplicate rows found!")
else:
    print("βœ… No duplicate rows found!")
print("=" * 50)
==================================================
πŸ”„ DUPLICATES CHECK
==================================================
Total Duplicate Rows: 1
⚠️ Duplicate rows found!
==================================================

9.4.7 1.6 Memory Usage

# Memory usage
print("=" * 50)
print("πŸ’Ύ MEMORY USAGE")
print("=" * 50)
print(tips.memory_usage(deep=True))
print("=" * 50)
print(f"Total Memory: {tips.memory_usage(deep=True).sum() / 1024:.2f} KB")
==================================================
πŸ’Ύ MEMORY USAGE
==================================================
Index          132
total_bill    1952
tip           1952
sex            476
smoker         471
day            657
time           477
size          1952
dtype: int64
==================================================
Total Memory: 7.88 KB

9.5 πŸ“Š Dimension 2: Distribution

Distribution dimension mein hum dekhte hain ke individual variables ki values kaise spread hain.

9.5.1 2.1 Descriptive Statistics

# Descriptive statistics for numerical columns
tips.describe()
Table 9.2: Descriptive Statistics of Tips Dataset
total_bill tip size
count 244.000000 244.000000 244.000000
mean 19.785943 2.998279 2.569672
std 8.902412 1.383638 0.951100
min 3.070000 1.000000 1.000000
25% 13.347500 2.000000 2.000000
50% 17.795000 2.900000 2.000000
75% 24.127500 3.562500 3.000000
max 50.810000 10.000000 6.000000
# Categorical columns statistics
print("=" * 50)
print("πŸ“Š CATEGORICAL COLUMNS STATISTICS")
print("=" * 50)
print(tips.describe(include=['category', 'object']))
==================================================
πŸ“Š CATEGORICAL COLUMNS STATISTICS
==================================================
         sex smoker  day    time
count    244    244  244     244
unique     2      2    4       2
top     Male     No  Sat  Dinner
freq     157    151   87     176

9.5.2 2.2 Central Tendency - Mean, Median, Mode

Jaise hum ne Chapter 4 mein discuss kiya tha, central tendency measures data ke center ko represent karte hain.

# Central tendency for 'total_bill'
print("=" * 50)
print("πŸ“Š CENTRAL TENDENCY - total_bill")
print("=" * 50)
print(f"Mean (Average): ${tips['total_bill'].mean():.2f}")
print(f"Median (Middle): ${tips['total_bill'].median():.2f}")
print(f"Mode (Most Frequent): ${tips['total_bill'].mode()[0]:.2f}")
print("=" * 50)

# Central tendency for 'tip'
print("πŸ“Š CENTRAL TENDENCY - tip")
print("=" * 50)
print(f"Mean (Average): ${tips['tip'].mean():.2f}")
print(f"Median (Middle): ${tips['tip'].median():.2f}")
print(f"Mode (Most Frequent): ${tips['tip'].mode()[0]:.2f}")
print("=" * 50)
==================================================
πŸ“Š CENTRAL TENDENCY - total_bill
==================================================
Mean (Average): $19.79
Median (Middle): $17.80
Mode (Most Frequent): $13.42
==================================================
πŸ“Š CENTRAL TENDENCY - tip
==================================================
Mean (Average): $3.00
Median (Middle): $2.90
Mode (Most Frequent): $2.00
==================================================

9.5.3 2.3 Spread/Variability

# Spread measures
print("=" * 50)
print("πŸ“ SPREAD MEASURES - total_bill")
print("=" * 50)
print(f"Range: ${tips['total_bill'].max() - tips['total_bill'].min():.2f}")
print(f"Variance: {tips['total_bill'].var():.2f}")
print(f"Standard Deviation: {tips['total_bill'].std():.2f}")
print(f"IQR: {tips['total_bill'].quantile(0.75) - tips['total_bill'].quantile(0.25):.2f}")
print("=" * 50)
==================================================
πŸ“ SPREAD MEASURES - total_bill
==================================================
Range: $47.74
Variance: 79.25
Standard Deviation: 8.90
IQR: 10.78
==================================================

9.5.4 2.4 Distribution Visualization - Histogram

import plotly.express as px

# Histogram with Plotly
fig = px.histogram(
    tips, 
    x='total_bill', 
    nbins=30,
    title='πŸ“Š Distribution of Total Bill',
    labels={'total_bill': 'Total Bill ($)'},
    color_discrete_sequence=['#2E86AB']
)
fig.update_layout(
    xaxis_title="Total Bill ($)",
    yaxis_title="Frequency",
    showlegend=False
)
fig.show()
Figure 9.1: Distribution of Total Bill Amount
# Seaborn histogram with KDE
plt.figure(figsize=(10, 5))
sns.histplot(
    data=tips, 
    x='tip', 
    kde=True, 
    color='#E94560',
    bins=20
)
plt.title('πŸ“Š Distribution of Tip Amount', fontsize=14)
plt.xlabel('Tip ($)')
plt.ylabel('Frequency')
plt.show()
Figure 9.2: Distribution of Tip Amount with KDE

9.5.5 2.5 Box Plot - Outliers Detection

Jaise hum ne Chapter 6 mein discuss kiya tha, box plots outliers detect karne mein helpful hain.

# Box plots with Plotly
fig = px.box(
    tips, 
    y=['total_bill', 'tip'],
    title='πŸ“¦ Box Plots - Outlier Detection'
)
fig.show()
Figure 9.3: Box Plots for Numerical Variables in Tips Dataset
# Seaborn box plot
plt.figure(figsize=(10, 5))
sns.boxplot(data=tips, y='total_bill', color='#00ADB5')
plt.title('πŸ“¦ Box Plot - Total Bill', fontsize=14)
plt.ylabel('Total Bill ($)')
plt.show()
Figure 9.4: Box Plot of Total Bill

9.5.6 2.6 Skewness aur Kurtosis

# Skewness and Kurtosis
print("=" * 50)
print("πŸ“ SKEWNESS AUR KURTOSIS")
print("=" * 50)
for col in ['total_bill', 'tip', 'size']:
    skew = tips[col].skew()
    kurt = tips[col].kurtosis()
    print(f"\n{col}:")
    print(f"  Skewness: {skew:.3f}", end=" ")
    if skew > 0:
        print("(Right skewed / Positively skewed)")
    elif skew < 0:
        print("(Left skewed / Negatively skewed)")
    else:
        print("(Symmetric)")
    print(f"  Kurtosis: {kurt:.3f}", end=" ")
    if kurt > 0:
        print("(Heavy tailed / Leptokurtic)")
    else:
        print("(Light tailed / Platykurtic)")
print("=" * 50)
==================================================
πŸ“ SKEWNESS AUR KURTOSIS
==================================================

total_bill:
  Skewness: 1.133 (Right skewed / Positively skewed)
  Kurtosis: 1.218 (Heavy tailed / Leptokurtic)

tip:
  Skewness: 1.465 (Right skewed / Positively skewed)
  Kurtosis: 3.648 (Heavy tailed / Leptokurtic)

size:
  Skewness: 1.448 (Right skewed / Positively skewed)
  Kurtosis: 1.732 (Heavy tailed / Leptokurtic)
==================================================

9.5.7 2.7 Value Counts - Categorical Variables

# Categorical distributions
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Day distribution
sns.countplot(
    data=tips, 
    x='day', 
    ax=axes[0, 0], 
    palette='viridis',
    order=['Thur', 'Fri', 'Sat', 'Sun']
)
axes[0, 0].set_title('πŸ“… Distribution by Day')

# Time distribution
sns.countplot(
    data=tips, 
    x='time', 
    ax=axes[0, 1], 
    palette='coolwarm'
)
axes[0, 1].set_title('⏰ Distribution by Time')

# Gender distribution
sns.countplot(
    data=tips, 
    x='sex', 
    ax=axes[1, 0], 
    palette='Set2'
)
axes[1, 0].set_title('πŸ‘₯ Distribution by Gender')

# Smoker distribution
sns.countplot(
    data=tips, 
    x='smoker', 
    ax=axes[1, 1], 
    palette='Pastel1'
)
axes[1, 1].set_title('🚬 Distribution by Smoker Status')

plt.tight_layout()
plt.show()
Figure 9.5: Distribution of Categorical Variables

9.6 πŸ”„ Dimension 3: Comparison

Comparison dimension mein hum different groups ko aapas mein compare karte hain.

9.6.1 3.1 Group-wise Statistics

# Group by day
day_stats = tips.groupby('day').agg({
    'total_bill': ['mean', 'median', 'std', 'count'],
    'tip': ['mean', 'median', 'std']
}).round(2)
day_stats
Table 9.3: Statistics by Day
total_bill tip
mean median std count mean median std
day
Thur 17.68 16.20 7.89 62 2.77 2.30 1.24
Fri 17.15 15.38 8.30 19 2.73 3.00 1.02
Sat 20.44 18.24 9.48 87 2.99 2.75 1.63
Sun 21.41 19.63 8.83 76 3.26 3.15 1.23
# Group by time
time_stats = tips.groupby('time').agg({
    'total_bill': ['mean', 'median', 'std', 'count'],
    'tip': ['mean', 'median', 'std']
}).round(2)
time_stats
Table 9.4: Statistics by Time (Lunch vs Dinner)
total_bill tip
mean median std count mean median std
time
Lunch 17.17 15.96 7.71 68 2.73 2.25 1.21
Dinner 20.80 18.39 9.14 176 3.10 3.00 1.44

9.6.2 3.2 Box Plot Comparison

# Box plot comparison by day
fig = px.box(
    tips, 
    x='day', 
    y='total_bill',
    color='day',
    title='πŸ“¦ Total Bill Comparison by Day',
    category_orders={'day': ['Thur', 'Fri', 'Sat', 'Sun']}
)
fig.show()
Figure 9.6: Total Bill Comparison by Day
# Seaborn comparison
plt.figure(figsize=(10, 5))
sns.boxplot(
    data=tips, 
    x='time', 
    y='tip',
    hue='sex',
    palette='Set2'
)
plt.title('πŸ’΅ Tip Comparison: Time & Gender', fontsize=14)
plt.xlabel('Time')
plt.ylabel('Tip ($)')
plt.legend(title='Gender')
plt.show()
Figure 9.7: Tip Amount: Lunch vs Dinner Comparison

9.6.3 3.3 Bar Chart Comparison

# Grouped bar chart
avg_tip = tips.groupby(['day', 'time'])['tip'].mean().reset_index()
fig = px.bar(
    avg_tip,
    x='day',
    y='tip',
    color='time',
    barmode='group',
    title='πŸ“Š Average Tip by Day and Time',
    labels={'tip': 'Average Tip ($)'},
    category_orders={'day': ['Thur', 'Fri', 'Sat', 'Sun']}
)
fig.show()
Figure 9.8: Average Tip by Day and Time

9.6.4 3.4 Violin Plot - Distribution Comparison

# Violin plot
plt.figure(figsize=(12, 6))
sns.violinplot(
    data=tips, 
    x='day', 
    y='total_bill',
    hue='sex',
    split=True,
    palette='husl',
    order=['Thur', 'Fri', 'Sat', 'Sun']
)
plt.title('🎻 Total Bill Distribution by Day & Gender', fontsize=14)
plt.xlabel('Day')
plt.ylabel('Total Bill ($)')
plt.show()
Figure 9.9: Total Bill Distribution by Day (Violin Plot)

9.6.5 3.5 Pivot Table Analysis

# Pivot table
pivot_table = tips.pivot_table(
    values='tip',
    index='day',
    columns='time',
    aggfunc='mean'
).round(2)
pivot_table
Table 9.5: Pivot Table: Average Tip by Day and Time
time Lunch Dinner
day
Thur 2.77 3.00
Fri 2.38 2.94
Sat NaN 2.99
Sun NaN 3.26

9.7 πŸ”— Dimension 4: Relationship

Relationship dimension mein hum variables ke darmiyan connections aur correlations explore karte hain.

9.7.1 4.1 Correlation Matrix

Jaise hum ne Chapter 7 mein discuss kiya tha, correlation do variables ke beech linear relationship ko measure karti hai.

# Correlation matrix
corr_matrix = tips[['total_bill', 'tip', 'size']].corr().round(3)
corr_matrix
Table 9.6: Correlation Matrix
total_bill tip size
total_bill 1.000 0.676 0.598
tip 0.676 1.000 0.489
size 0.598 0.489 1.000

9.7.2 4.2 Correlation Heatmap

# Correlation heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(
    corr_matrix,
    annot=True,
    cmap='RdYlBu_r',
    center=0,
    square=True,
    linewidths=0.5,
    fmt='.3f'
)
plt.title('πŸ”₯ Correlation Heatmap', fontsize=14)
plt.show()
Figure 9.10: Correlation Heatmap
# Plotly heatmap
import plotly.figure_factory as ff

fig = ff.create_annotated_heatmap(
    z=corr_matrix.values,
    x=corr_matrix.columns.tolist(),
    y=corr_matrix.index.tolist(),
    colorscale='RdBu_r',
    showscale=True
)
fig.update_layout(title='πŸ”₯ Interactive Correlation Heatmap')
fig.show()
Figure 9.11: Interactive Correlation Heatmap

9.7.3 4.3 Scatter Plot - Bivariate Relationship

# Scatter plot with Plotly
fig = px.scatter(
    tips,
    x='total_bill',
    y='tip',
    color='day',
    size='size',
    title='πŸ“ˆ Total Bill vs Tip',
    labels={
        'total_bill': 'Total Bill ($)',
        'tip': 'Tip ($)'
    },
    hover_data=['sex', 'time']
)
fig.show()
Figure 9.12: Relationship between Total Bill and Tip
# Seaborn regplot
plt.figure(figsize=(10, 6))
sns.regplot(
    data=tips,
    x='total_bill',
    y='tip',
    scatter_kws={'alpha': 0.5},
    line_kws={'color': 'red'}
)
plt.title('πŸ“ˆ Total Bill vs Tip (with Trend Line)', fontsize=14)
plt.xlabel('Total Bill ($)')
plt.ylabel('Tip ($)')
plt.show()
Figure 9.13: Scatter Plot with Regression Line

9.7.4 4.4 Pair Plot - Multivariate Relationships

# Pair plot
g = sns.pairplot(
    tips,
    vars=['total_bill', 'tip', 'size'],
    hue='time',
    palette='husl',
    diag_kind='kde',
    height=2.5
)
g.fig.suptitle('πŸ”— Pair Plot - Multivariate Relationships', y=1.02)
plt.show()
Figure 9.14: Pair Plot - All Numerical Variables

9.7.5 4.5 Joint Plot

# Joint plot
g = sns.jointplot(
    data=tips,
    x='total_bill',
    y='tip',
    kind='reg',
    height=8,
    color='#2E86AB'
)
g.fig.suptitle('πŸ“Š Joint Plot: Total Bill vs Tip', y=1.02)
plt.show()
Figure 9.15: Joint Plot: Total Bill vs Tip

9.7.6 4.6 Categorical vs Numerical Relationship

# Catplot
g = sns.catplot(
    data=tips,
    x='day',
    y='tip',
    hue='smoker',
    kind='box',
    height=5,
    aspect=1.5,
    palette='Set2',
    order=['Thur', 'Fri', 'Sat', 'Sun']
)
g.fig.suptitle('πŸ’΅ Tip by Day & Smoker Status', y=1.02)
plt.show()
Figure 9.16: Tip Amount by Smoker Status and Day

9.7.7 4.7 FacetGrid - Multiple Relationships

# FacetGrid
g = sns.FacetGrid(
    tips, 
    col='time', 
    row='smoker',
    height=4,
    aspect=1.2
)
g.map_dataframe(
    sns.scatterplot, 
    x='total_bill', 
    y='tip',
    alpha=0.7
)
g.fig.suptitle('πŸ“Š Total Bill vs Tip by Time & Smoker', y=1.02)
plt.show()
Figure 9.17: Scatter Plots by Time and Smoker Status

9.8 🎯 Complete EDA Summary

# Complete EDA Summary
print("=" * 60)
print("πŸ“Š COMPLETE EDA SUMMARY - TIPS DATASET")
print("=" * 60)

print("\nπŸ“‹ COMPOSITION:")
print(f"   β€’ Rows: {tips.shape[0]} | Columns: {tips.shape[1]}")
print(f"   β€’ Missing Values: {tips.isnull().sum().sum()}")
print(f"   β€’ Duplicates: {tips.duplicated().sum()}")

print("\nπŸ“Š DISTRIBUTION:")
print(f"   β€’ Average Total Bill: ${tips['total_bill'].mean():.2f}")
print(f"   β€’ Average Tip: ${tips['tip'].mean():.2f}")
print(f"   β€’ Tip Range: ${tips['tip'].min():.2f} - ${tips['tip'].max():.2f}")

print("\nπŸ”„ COMPARISON:")
print(f"   β€’ Most Busy Day: {tips['day'].value_counts().idxmax()}")
print(f"   β€’ Highest Avg Tip Day: {tips.groupby('day')['tip'].mean().idxmax()}")

print("\nπŸ”— RELATIONSHIP:")
corr = tips['total_bill'].corr(tips['tip'])
print(f"   β€’ Correlation (Bill vs Tip): {corr:.3f}")
print(f"   β€’ Interpretation: Strong Positive Relationship")

print("\n" + "=" * 60)
============================================================
πŸ“Š COMPLETE EDA SUMMARY - TIPS DATASET
============================================================

πŸ“‹ COMPOSITION:
   β€’ Rows: 244 | Columns: 7
   β€’ Missing Values: 0
   β€’ Duplicates: 1

πŸ“Š DISTRIBUTION:
   β€’ Average Total Bill: $19.79
   β€’ Average Tip: $3.00
   β€’ Tip Range: $1.00 - $10.00

πŸ”„ COMPARISON:
   β€’ Most Busy Day: Sat
   β€’ Highest Avg Tip Day: Sun

πŸ”— RELATIONSHIP:
   β€’ Correlation (Bill vs Tip): 0.676
   β€’ Interpretation: Strong Positive Relationship

============================================================

9.9 πŸ€– Automatic EDA Tools

Data scientists ke kaam ko asaan banane ke liye, Python mein kaafi libraries hain jo automatically EDA perform kar sakti hain. Ye tools time aur effort bachate hain aur comprehensive reports generate karte hain.

NoteInstallation Note

Ye libraries install karne ke liye, aapko pehle apna conda environment create aur activate karna hoga or koshish karen har tool ka separate conda environment create karna hoga.

9.9.1 Available Auto-EDA Libraries

Library Best For Installation URL
YData Profiling Complete EDA + Time Series pip install ydata-profiling Docs
D-Tale Interactive Web UI pip install dtale PyPI
PyGWalker Tableau-like Interface pip install pygwalker PyPI
Sweetviz Beautiful Reports pip install sweetviz PyPI
Skimpy Quick Summary pip install skimpy PyPI
DataPrep Fast EDA pip install dataprep PyPI
LIDA AI-Powered Viz pip install lida PyPI
PandasAI Natural Language Queries pip install pandasai GitHub

9.9.2 Installation Commands

# Install all auto-EDA libraries using conda/pip
# YData Profiling (formerly pandas-profiling)
conda create -n ydata-profiling python=3.12 -y
conda activate ydata-profiling
pip install ydata-profiling

# D-Tale - Interactive EDA
conda create -n dtale python=3.12 -y
conda activate dtale
pip install dtale

# PyGWalker - Tableau-like interface
conda create -n pygwalker python=3.12 -y
conda activate pygwalker
pip install pygwalker

# Sweetviz - Beautiful reports
conda create -n sweetviz python=3.12 -y
conda activate sweetviz
pip install sweetviz

# Skimpy - Quick summaries
conda create -n skimpy python=3.12 -y
conda activate skimpy
pip install skimpy

# DataPrep - Fast EDA
conda create -n dataprep python=3.12 -y
conda activate dataprep
pip install dataprep

# LIDA - AI-powered visualization
    conda create -n lida python=3.12 -y
conda activate lida
pip install lida

# PandasAI - Natural language queries
conda create -n pandasai python=3.12 -y
conda activate pandasai
pip install pandasai

9.9.3 1. YData Profiling (Pandas Profiling)

TipBest For

Complete automated EDA reports with minimal code. Supports both tabular and time series data.

# YData Profiling - Basic Usage
from ydata_profiling import ProfileReport

# Generate report
profile = ProfileReport(
    tips, 
    title="Tips Dataset EDA Report",
    explorative=True
)

# Save as HTML
profile.to_file("tips_eda_report.html")

# For Time Series Data
profile_ts = ProfileReport(
    df, 
    tsmode=True,
    sortby="date_column"
)

9.9.4 2. D-Tale

TipBest For

Interactive web-based data exploration with point-and-click interface.

# D-Tale - Interactive EDA
import dtale

# Launch interactive session
d = dtale.show(tips)

# Opens in browser automatically
# Access at: http://localhost:40000

9.9.5 3. PyGWalker

TipBest For

Tableau-like drag-and-drop visualization interface in Jupyter notebooks.

# PyGWalker - Tableau-like Interface
import pygwalker as pyg

# Create interactive visualization
walker = pyg.walk(tips)

9.9.6 4. Sweetviz

TipBest For

Beautiful HTML reports with comparison capabilities between datasets.

# Sweetviz - Beautiful Reports
import sweetviz as sv

# Create EDA report
report = sv.analyze(tips)

# Show in notebook or save
report.show_html("sweetviz_report.html")

# Compare two datasets
# report = sv.compare([train_df, "Train"], [test_df, "Test"])

9.9.7 5. Skimpy

TipBest For

Quick, console-friendly data summaries similar to R’s skimr package.

# Skimpy - Quick Summaries
from skimpy import skim

# Quick summary
skim(tips)

9.9.8 6. DataPrep

TipBest For

Fast exploratory data analysis with blazing fast performance on large datasets.

# DataPrep - Fast EDA
from dataprep.eda import create_report, plot

# Create complete report
report = create_report(tips)
report.show_browser()

# Individual plots
plot(tips)
plot(tips, "total_bill")
plot(tips, "total_bill", "tip")

9.9.9 7. LIDA

TipBest For

AI-powered automatic visualization generation using natural language.

# LIDA - AI-Powered Visualization
from lida import Manager, TextGenerationConfig

# Initialize LIDA
lida = Manager()

# Generate visualizations automatically
summary = lida.summarize(tips)
goals = lida.goals(summary, n=5)
charts = lida.visualize(
    summary=summary, 
    goal=goals[0]
)

9.9.10 8. PandasAI

TipBest For

Ask questions about your data in natural language (requires API key).

# PandasAI - Natural Language Queries
from pandasai import SmartDataframe

# Initialize with OpenAI API key
df = SmartDataframe(
    tips, 
    config={"llm": your_llm}
)

# Ask questions in natural language
response = df.chat("What is the average tip by day?")
print(response)

response = df.chat("Show me a bar chart of tips by gender")

9.10 Data Cleaning vs EDA

Data Cleaning aur EDA dono data science ke essential steps hain, lekin inka purpose aur focus alag hota hai. | Aspect | Data Cleaning | EDA | |——–|β€”β€”β€”β€”β€”|—–| | Purpose | Data ko analysis ke liye ready karna | Data ko samaj hna aur insights nikalna | | Focus | Missing values, duplicates, inconsistencies | Data distribution, relationships, patterns | Techniques | Imputation, normalization, encoding | Visualization, statistical analysis | | Outcome | Cleaned dataset | Understanding of data characteristics |

Tip

Here you can see a comple understanding about outliers in this figure.

Code
from IPython.display import Image
Image("data_cleaning.png")
Figure 9.18: A comprehensive understanding of data cleaning and its impact on data analysis

9.11 EDA Best Practices 🎯

ImportantKey Takeaways
  1. Always start with Data Composition - Know your data structure first
  2. Check for data quality issues - Missing values, duplicates, outliers
  3. Use both statistics AND visualizations - Numbers aur graphs dono dekhein
  4. Document your findings - Notes likhte rahein
  5. Iterate - EDA ek continuous process hai

9.11.1 EDA Checklist βœ…

Step Task Done?
1 Data shape aur size check ☐
2 Data types verify ☐
3 Missing values identify ☐
4 Duplicates check ☐
5 Descriptive statistics calculate ☐
6 Distribution visualize ☐
7 Outliers detect ☐
8 Group comparisons ☐
9 Correlations analyze ☐
10 Relationships visualize ☐

9.12 🎬 Video Tutorial

TipData Visualization Masterclass

Data visualization ko master karne ke liye ye complete course dekhein:

Data Visualization Masterclass in Python | Matplotlib, Seaborn & Plotly for Beginners to Advanced

Is course mein aap seekhenge:

  • Matplotlib ki basics se advanced plotting
  • Seaborn ke through statistical graphics
  • Plotly ke sath interactive visualizations
  • Professional dashboards creation

9.13 Automatic EDA in Python - Crash Course

Tip

Agar aap in automatic EDA tools ko action mein dekhna chahte hain, to ye video zaroor dekhein:

Automatic Exploratory Data Analysis in Python | Crash Course for Data Analysts & Scientists

Is video mein aap seekhenge:

  • EDA hta kia hy?
  • EDA ke through hum apne data ko deeply samajh sakte hain aur better decisions le sakte hain.
  • Buhat kuch automatic EDA tools ko action mein dekhna zaroor hai.

9.14 Conclusion

Is chapter mein hum ne seekha ke EDA data science ki ek essential skill hai. Hum ne dekha ke:

  1. Data Composition - Data ki structure samajhna
  2. Distribution - Values ka spread analyze karna
  3. Comparison - Groups ko compare karna
  4. Relationship - Variables ke connections explore karna

EDA ke through hum apne data ko deeply samajh sakte hain aur better decisions le sakte hain. Automatic EDA tools jaise YData Profiling, D-Tale, aur others hamari is journey ko aur bhi asaan bana dete hain.

9.15 Follow us

TipFollow us

Main umeed karta hun k ap ko ye chapter ne bht kuch seekhaya ho ga, or agar sach main seekhaya hy then please do support us by sharing this book with your friends and colleagues. Also, do share your feedback with us, so that we can improve our work in future.