Pandas Mastery Guide for Data Scientists | Codanics





Pandas Mastery Guide: From Beginner to Pro | Codanics




0%



Pandas Mastery Guide: From Beginner to Pro 🐼

Pandas is the backbone of data analysis in Python, providing powerful, flexible, and efficient tools for real-world data manipulation and analysis. Whether you’re just starting your data science journey or looking to level up your skills, this comprehensive guide will take you from basic operations to advanced techniques.

In this Codanics masterclass, we’ll work with two popular datasets – the Titanic passenger data and the Diamonds dataset – to showcase practical applications at every skill level.

Why Pandas Is Essential in 2024

  • Universal Data Tool: Used by data scientists
    A professional who uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from data.
    , analysts, engineers, and ML experts worldwide
  • Versatile Data Handling: Effortlessly manages structured data of any size
  • Industry Standard: Required skill for virtually all data-related positions
  • Ecosystem Integration: Seamlessly connects with NumPy
    A fundamental package for scientific computing with Python, providing support for arrays and matrices.
    , Matplotlib
    A Python library for creating static, animated, and interactive visualizations.
    , Scikit-learn
    A Python module integrating a wide range of state-of-the-art machine learning algorithms.
    , and other tools
  • Time-saving: Automates repetitive data tasks that would take hours manually

Setting Up Your Environment

Let’s start by installing pandas and other necessary libraries:

pip install pandas numpy matplotlib seaborn scikit-learn

Now, let’s import the libraries we’ll use throughout this guide:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set styling for visualizations
sns.set(style=”whitegrid”)
plt.rcParams[‘figure.figsize’] = (12, 6)

# Display settings for better output readability
pd.set_option(‘display.max_columns’, None)
pd.set_option(‘display.width’, 1000)

# For reproducibility
np.random.seed(42)

Understanding Our Datasets

We’ll be working with two well-known datasets throughout this guide:

1. Titanic Dataset

Contains data about Titanic passengers including demographics, cabin information, and survival status.

# Load the Titanic dataset
titanic = sns.load_dataset(‘titanic’)

# Preview the data
print(“Titanic Dataset Preview:”)
print(titanic.head())

2. Diamonds Dataset

Contains attributes of almost 54,000 diamonds including price, carat, cut quality, color, and clarity.

# Load the Diamonds dataset
diamonds = sns.load_dataset(‘diamonds’)

# Preview the data
print(“\nDiamonds Dataset Preview:”)
print(diamonds.head())

Basic Operations with Pandas

Exploring Dataset Structure

# Basic information about the dataset
print(titanic.info())

# Statistical summary
print(titanic.describe())

# Check for missing values
print(“\nMissing values in Titanic dataset:”)
print(titanic.isnull().sum())

# Dataset dimensions
print(f”\nTitanic dataset shape: {titanic.shape}”)

# Column names
print(f”\nColumns: {titanic.columns.tolist()}”)

DataFrame
A two-dimensional, size-mutable, and heterogeneous tabular data structure with labeled axes (rows and columns) in pandas.
is the core data structure in pandas.

Intermediate Pandas Techniques

Working with Missing Data

Handling missing values
Entries in your dataset that are empty or not available (NaN).
is crucial for accurate analysis.

# Create more sophisticated missing value imputation
titanic_adv = titanic.copy()

# Impute age based on class and sex – more accurate than simple median
age_medians = titanic.groupby([‘sex’, ‘class’])[‘age’].median()

# Define function to assign median age based on passenger attributes
def fill_age(row):
if pd.isnull(row[‘age’]):
return age_medians[row[‘sex’], row[‘class’]]
return row[‘age’]

# Apply the function to each row
titanic_adv[‘age’] = titanic_adv.apply(fill_age, axis=1)

# Create age categories
titanic_adv[‘age_group’] = pd.cut(
titanic_adv[‘age’],
bins=[0, 12, 18, 35, 60, 100],
labels=[‘Child’, ‘Teen’, ‘Young Adult’, ‘Adult’, ‘Senior’]
)

# Check the results
print(“Age distribution by group after imputation:”)
print(titanic_adv[‘age_group’].value_counts().sort_index())

Creating New Features

# Feature engineering with the diamonds dataset
diamonds_fe = diamonds.copy()

# Create price per carat feature
diamonds_fe[‘price_per_carat’] = diamonds_fe[‘price’] / diamonds_fe[‘carat’]

# Create simplified color categories
diamonds_fe[‘color_quality’] = diamonds_fe[‘color’].apply(
lambda x: ‘Premium’ if x in [‘D’, ‘E’, ‘F’] else ‘Good’ if x in [‘G’, ‘H’] else ‘Fair’
)

# Create volume estimate
diamonds_fe[‘volume’] = diamonds_fe[‘x’] * diamonds_fe[‘y’] * diamonds_fe[‘z’]

# Create a categorical feature for price ranges
diamonds_fe[‘price_category’] = pd.qcut(
diamonds_fe[‘price’],
q=4,
labels=[‘Budget’, ‘Medium’, ‘High’, ‘Luxury’]
)

# Check new features
print(diamonds_fe[[‘carat’, ‘price’, ‘price_per_carat’, ‘color’, ‘color_quality’,
‘volume’, ‘price_category’]].head(10))

Advanced Filtering and Selection

# Advanced selection techniques with Diamonds dataset
# Get top 5 most expensive diamonds
top_diamonds = diamonds.nlargest(5, ‘price’)
print(“Top 5 most expensive diamonds:”)
print(top_diamonds[[‘carat’, ‘cut’, ‘color’, ‘clarity’, ‘price’]])

# Get 5 diamonds closest to the median price
median_price = diamonds[‘price’].median()
closest_to_median = diamonds.iloc[(diamonds[‘price’] – median_price).abs().argsort()[:5]]
print(“\nDiamonds closest to median price:”)
print(closest_to_median[[‘carat’, ‘cut’, ‘color’, ‘clarity’, ‘price’]])

# Boolean masking with AND, OR, NOT conditions
ideal_bargains = diamonds[
(diamonds[‘cut’] == ‘Ideal’) &
(diamonds[‘carat’] > 1.5) &
(diamonds[‘price’] < diamonds['price'].quantile(0.25) * 3) ] print(f"\nNumber of 'ideal' cut bargain diamonds: {len(ideal_bargains)}")# Find outliers using IQR method Q1 = diamonds['price'].quantile(0.25) Q3 = diamonds['price'].quantile(0.75) IQR = Q3 - Q1 price_outliers = diamonds[(diamonds['price'] < (Q1 - 1.5 * IQR)) | (diamonds['price'] > (Q3 + 1.5 * IQR))]
print(f”\nNumber of price outliers: {len(price_outliers)}”)

Pivot Tables and Reshaping Data

# Create pivot table – survival rates by sex and class
survival_pivot = titanic.pivot_table(
values=’survived’,
index=’sex’,
columns=’class’,
aggfunc=’mean’
).round(2)

print(“Survival rate by sex and class:”)
print(survival_pivot)

# Create pivot table with multiple values and aggregations
complex_pivot = titanic.pivot_table(
values=[‘age’, ‘fare’],
index=[‘sex’, ‘survived’],
columns=’class’,
aggfunc={
‘age’: [‘mean’, ‘count’],
‘fare’: [‘mean’, ‘median’]
}
).round(1)

print(“\nComplex pivot table:”)
print(complex_pivot)

# Reshape data with melt (wide to long format)
# First create a simplified dataframe to demonstrate
pivot_example = pd.DataFrame({
‘Name’: [‘Alice’, ‘Bob’, ‘Charlie’],
‘Math’: [90, 80, 70],
‘Science’: [95, 85, 75],
‘History’: [85, 75, 95]
})

# Melt to long format
melted = pd.melt(
pivot_example,
id_vars=[‘Name’],
value_vars=[‘Math’, ‘Science’, ‘History’],
var_name=’Subject’,
value_name=’Score’
)

print(“\nOriginal ‘wide’ data:”)
print(pivot_example)
print(“\nMelted ‘long’ data:”)
print(melted)

Advanced Pandas Mastery

Advanced Grouping Operations

GroupBy
A pandas method that allows you to split data into groups based on some criteria, apply a function, and combine the results.
is essential for advanced aggregations.

# Complex groupby operations
# Find survival rate by age group and class
titanic_adv[‘survived’] = titanic[‘survived’] # Ensure we have the survived column
survival_by_age_class = titanic_adv.groupby([‘age_group’, ‘class’])[‘survived’].agg(
[‘mean’, ‘count’, ‘sum’]
).sort_values(by=[‘age_group’, ‘mean’], ascending=[True, False])

print(“Survival rate by age group and class:”)
print(survival_by_age_class)

# Custom aggregation function
def survival_stats(x):
return pd.Series({
‘survival_rate’: x.mean(),
‘survived_count’: x.sum(),
‘total_count’: len(x),
‘survival_stderr’: x.std() / np.sqrt(len(x)) if len(x) > 0 else 0
})

# Apply custom aggregation
detailed_survival = titanic.groupby([‘sex’, ‘class’])[‘survived’].apply(survival_stats)
print(“\nDetailed survival statistics:”)
print(detailed_survival)

# Using transform for grouped operations while maintaining the original dataframe size
# Add column showing mean fare for each passenger’s class
titanic_adv[‘class_mean_fare’] = titanic.groupby(‘class’)[‘fare’].transform(‘mean’)
# Add survival rate by class
titanic_adv[‘class_survival_rate’] = titanic.groupby(‘class’)[‘survived’].transform(‘mean’)

print(“\nTransformed data (showing mean fare and survival rate by class):”)
print(titanic_adv[[‘class’, ‘fare’, ‘class_mean_fare’, ‘survived’, ‘class_survival_rate’]].head())

Advanced Merging and Joining

# Create sample dataframes for demonstration
passengers = titanic[[‘name’, ‘sex’, ‘age’, ‘survived’]].head(5)
fares = titanic[[‘name’, ‘fare’, ’embarked’]].head(7) # Note: 2 more rows than passengers
cabins = pd.DataFrame({
‘name’: titanic[‘name’].iloc[[0, 1, 3, 5, 6]], # Mixed overlap with passengers
‘cabin’: [‘C123’, ‘E46’, ‘A10’, ‘D33’, ‘B22′]
})

print(“Passengers:”)
print(passengers)
print(“\nFares:”)
print(fares)
print(“\nCabins:”)
print(cabins)

# Inner join – only matching rows
inner_join = passengers.merge(cabins, on=’name’, how=’inner’)
print(“\nInner join result:”)
print(inner_join)

# Left join – all rows from left dataframe
left_join = passengers.merge(cabins, on=’name’, how=’left’)
print(“\nLeft join result:”)
print(left_join)

# Outer join – all rows from both dataframes
outer_join = passengers.merge(cabins, on=’name’, how=’outer’)
print(“\nOuter join result:”)
print(outer_join)

# Multiple joins
complete_info = passengers.merge(
fares, on=’name’, how=’left’
).merge(
cabins, on=’name’, how=’left’
)
print(“\nMultiple joins result:”)
print(complete_info)

Window Functions and Rolling Calculations

# Sort diamonds by price
sorted_diamonds = diamonds.sort_values(‘price’)

# Calculate rolling statistics
sorted_diamonds[‘rolling_avg_10’] = sorted_diamonds[‘price’].rolling(window=10).mean()
sorted_diamonds[‘rolling_std_10’] = sorted_diamonds[‘price’].rolling(window=10).std()

# Calculate cumulative statistics
sorted_diamonds[‘cum_max’] = sorted_diamonds[‘price’].cummax()
sorted_diamonds[‘cum_min’] = sorted_diamonds[‘price’].cummin()

print(“Rolling and cumulative calculations:”)
print(sorted_diamonds[[‘carat’, ‘price’, ‘rolling_avg_10’, ‘rolling_std_10’,
‘cum_max’, ‘cum_min’]].head(15))

# Expanding window calculations
sorted_diamonds[‘exp_mean’] = sorted_diamonds[‘price’].expanding().mean()

# Adding ranks
sorted_diamonds[‘price_rank’] = sorted_diamonds[‘price’].rank(method=’min’)
sorted_diamonds[‘price_dense_rank’] = sorted_diamonds[‘price’].rank(method=’dense’)
sorted_diamonds[‘price_percent_rank’] = sorted_diamonds[‘price’].rank(pct=True)

print(“\nRanking examples:”)
print(sorted_diamonds[[‘price’, ‘price_rank’, ‘price_dense_rank’,
‘price_percent_rank’]].head(10))

Working with Time Series Data

# Create a date range
date_range = pd.date_range(start=’2023-01-01′, end=’2023-01-10′, freq=’D’)

# Create time series data
time_series = pd.DataFrame({
‘date’: date_range,
‘value’: np.random.normal(100, 10, size=len(date_range))
})

# Set date as index
time_series.set_index(‘date’, inplace=True)

print(“Time series data:”)
print(time_series.head())

# Resampling – aggregate to weekly frequency
weekly = time_series.resample(‘W’).mean()
print(“\nWeekly average:”)
print(weekly)

# Shifting data (lag and lead)
time_series[‘previous_day’] = time_series[‘value’].shift(1)
time_series[‘next_day’] = time_series[‘value’].shift(-1)
time_series[‘day_over_day_change’] = time_series[‘value’] – time_series[‘previous_day’]
time_series[‘day_over_day_pct_change’] = time_series[‘value’].pct_change() * 100

print(“\nTime series with shifts and changes:”)
print(time_series)

Advanced Function Application

# Using apply with a custom function
def price_analysis(diamond):
“””Analyze if a diamond is good value based on multiple criteria”””
price_per_carat = diamond[‘price’] / diamond[‘carat’]
avg_price_per_carat = diamonds.loc[diamonds[‘cut’] == diamond[‘cut’], ‘price’].sum() / \
diamonds.loc[diamonds[‘cut’] == diamond[‘cut’], ‘carat’].sum()

return pd.Series({
‘price_per_carat’: price_per_carat,
‘avg_for_cut’: avg_price_per_carat,
‘value_ratio’: price_per_carat / avg_price_per_carat,
‘good_value’: price_per_carat < avg_price_per_carat * 0.9 })# Apply the function to a sample of diamonds diamonds_sample = diamonds.sample(5) value_analysis = diamonds_sample.apply(price_analysis, axis=1)# Combine with original data diamonds_with_analysis = pd.concat([diamonds_sample[['carat', 'cut', 'price']], value_analysis], axis=1) print("Diamond value analysis:") print(diamonds_with_analysis)# Using vectorized operations (much faster than apply) diamonds_vector = diamonds.sample(1000).copy() diamonds_vector['price_per_carat'] = diamonds_vector['price'] / diamonds_vector['carat'] cut_avg_prices = diamonds_vector.groupby('cut')['price'].sum() / diamonds_vector.groupby('cut')['carat'].sum()# Map the averages to each diamond based on its cut diamonds_vector['avg_for_cut'] = diamonds_vector['cut'].map(cut_avg_prices) diamonds_vector['value_ratio'] = diamonds_vector['price_per_carat'] / diamonds_vector['avg_for_cut'] diamonds_vector['good_value'] = diamonds_vector['price_per_carat'] < diamonds_vector['avg_for_cut'] * 0.9print("\nVectorized operations result (faster than apply):") print(diamonds_vector[['carat', 'cut', 'price', 'price_per_carat', 'avg_for_cut', 'value_ratio', 'good_value']].head())

Data Visualization with Pandas

Pandas provides easy-to-use plotting capabilities built on Matplotlib. Let’s explore how to visualize our datasets:

Basic Visualizations

# Basic histogram
titanic[‘age’].plot(kind=’hist’, bins=20, title=’Age Distribution on Titanic’)
plt.xlabel(‘Age’)
plt.ylabel(‘Count’)
plt.savefig(‘titanic_age_histogram.png’)
plt.close()

# Basic bar chart
titanic[‘class’].value_counts().plot(kind=’bar’, title=’Passenger Count by Class’)
plt.xlabel(‘Class’)
plt.ylabel(‘Count’)
plt.savefig(‘titanic_class_bar.png’)
plt.close()

# Using seaborn with pandas dataframes
plt.figure(figsize=(10, 6))
sns.countplot(data=titanic, x=’class’, hue=’survived’)
plt.title(‘Survival by Passenger Class’)
plt.savefig(‘titanic_survival_by_class.png’)
plt.close()

# Scatter plot for diamonds
plt.figure(figsize=(10, 6))
diamonds.sample(1000).plot.scatter(x=’carat’, y=’price’, alpha=0.5, figsize=(10, 6))
plt.title(‘Diamond Price vs. Carat’)
plt.savefig(‘diamond_price_carat.png’)
plt.close()

Advanced Visualizations

# Create a pivot table for visualization
survival_pct = pd.crosstab(
index=titanic[‘class’],
columns=titanic[‘sex’],
values=titanic[‘survived’],
aggfunc=’mean’
).round(2) * 100

# Create heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(survival_pct, annot=True, cmap=’YlGnBu’, fmt=’g’)
plt.title(‘Survival Rate (%) by Class and Sex’)
plt.savefig(‘titanic_survival_heatmap.png’)
plt.close()

# Boxplot of diamond prices by cut
plt.figure(figsize=(12, 6))
sns.boxplot(x=’cut’, y=’price’, data=diamonds)
plt.title(‘Diamond Price Distribution by Cut’)
plt.yscale(‘log’) # Log scale for better visualization
plt.savefig(‘diamond_price_by_cut.png’)
plt.close()

# Create a pair plot using seaborn
sns.pairplot(diamonds.sample(1000)[[‘carat’, ‘depth’, ‘table’, ‘price’]])
plt.savefig(‘diamond_pairplot.png’)
plt.close()

# Create a custom grouped bar chart
survival_by_class_sex = titanic.pivot_table(
index=’class’,
columns=’sex’,
values=’survived’,
aggfunc=’mean’
)

survival_by_class_sex.plot(kind=’bar’, figsize=(10, 6))
plt.title(‘Survival Rate by Class and Sex’)
plt.ylabel(‘Survival Rate’)
plt.xticks(rotation=0)
plt.savefig(‘titanic_survival_by_class_sex.png’)
plt.close()

Performance Optimization

When working with larger datasets, optimizing your code becomes crucial. Here are some techniques to make your pandas operations faster:

# Speed comparison example
import time

# Create a larger dataframe for demonstration
large_df = pd.DataFrame(np.random.rand(100000, 4), columns=list(‘ABCD’))

# 1. Using .apply() (slower)
start_time = time.time()
result1 = large_df.apply(lambda x: x[‘A’] * x[‘B’] + x[‘C’] – x[‘D’], axis=1)
apply_time = time.time() – start_time
print(f”Time using .apply(): {apply_time:.4f} seconds”)

# 2. Using vectorized operations (faster)
start_time = time.time()
result2 = large_df[‘A’] * large_df[‘B’] + large_df[‘C’] – large_df[‘D’]
vector_time = time.time() – start_time
print(f”Time using vectorized operations: {vector_time:.4f} seconds”)
print(f”Speedup factor: {apply_time / vector_time:.1f}x”)

# 3. Using .eval() for complex expressions (can be faster for large datasets)
start_time = time.time()
result3 = large_df.eval(‘A * B + C – D’)
eval_time = time.time() – start_time
print(f”Time using .eval(): {eval_time:.4f} seconds”)

# Memory optimization with categoricals
# For string columns with repetitive values
titanic_optimized = titanic.copy()
titanic_optimized[‘sex’] = titanic_optimized[‘sex’].astype(‘category’)
titanic_optimized[‘class’] = titanic_optimized[‘class’].astype(‘category’)
titanic_optimized[’embark_town’] = titanic_optimized[’embark_town’].astype(‘category’)

# Compare memory usage
print(f”\nOriginal memory usage: {titanic.memory_usage(deep=True).sum() / 1024:.2f} KB”)
print(f”Optimized memory usage: {titanic_optimized.memory_usage(deep=True).sum() / 1024:.2f} KB”)

# Chunking for processing large files
def process_in_chunks(filename, chunk_size=10000):
chunk_list = []
# This is just a demonstration – replace with your actual CSV file
for chunk in pd.read_csv(filename, chunksize=chunk_size):
# Process each chunk (example: filter rows with age > 30)
processed = chunk[chunk[‘age’] > 30]
chunk_list.append(processed)

# Combine all processed chunks
return pd.concat(chunk_list)

# Example usage (commented out as we don’t have the file)
# result = process_in_chunks(‘large_dataset.csv’, chunk_size=50000)

Best Practices & Common Pitfalls

Best Practices

  • Use Vectorized Operations: Always prefer vectorized operations over loops or apply when possible
  • Chain Methods: Take advantage of method chaining for cleaner code
  • Set a Copy Warning: Use pd.set_option('mode.chained_assignment', 'warn') to catch potential issues
  • Use .loc and .iloc: Prefer these for data selection to avoid SettingWithCopyWarning
  • Make Explicit Copies: Use .copy() when creating a new dataframe from a slice
  • Use Appropriate Data Types: Convert to categorical types for string columns with few unique values

Common Pitfalls

# Problem 1: SettingWithCopyWarning
# Create a subset
bad_subset = titanic[titanic[‘age’] > 30]
# This might create a warning (if it’s a view and not a copy)
bad_subset[‘fare’] = bad_subset[‘fare’] * 1.1

# Correct approach:
good_subset = titanic[titanic[‘age’] > 30].copy()
good_subset[‘fare’] = good_subset[‘fare’] * 1.1

# Problem 2: Using == with NaN values
# This won’t work as expected
missing_ages = titanic[titanic[‘age’] == np.nan] # Will return empty dataframe
print(f”Wrong way to find NaN: {len(missing_ages)} rows”)

# Correct approach:
missing_ages = titanic[titanic[‘age’].isna()]
print(f”Correct way to find NaN: {len(missing_ages)} rows”)

# Problem 3: Forgetting that loc is inclusive on both ends
# This includes row indices 0, 1, 2, 3, 4, 5
first_six = titanic.loc[0:5]
print(f”titanic.loc[0:5] gives {len(first_six)} rows”)

# iloc is exclusive on the upper bound
# This includes row indices 0, 1, 2, 3, 4
first_five = titanic.iloc[0:5]
print(f”titanic.iloc[0:5] gives {len(first_five)} rows”)

# Problem 4: Ignoring data types
# This won’t work as expected if ‘age’ has NaN values:
adults = titanic[titanic[‘age’] >= 18] # Works despite NaN values
children = titanic[titanic[‘age’] < 18] # Works despite NaN values print(f"Adults + Children = {len(adults) + len(children)}, Total = {len(titanic)}") print("Missing values are excluded from both filters!")

Real-World Use Cases

Titanic Survival Prediction

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Prepare the data
titanic_model = titanic.copy()

# Handle missing values
titanic_model[‘age’] = titanic_model[‘age’].fillna(titanic_model[‘age’].median())
titanic_model[’embarked’] = titanic_model[’embarked’].fillna(titanic_model[’embarked’].mode()[0])

# Create dummies for categorical variables
titanic_dummies = pd.get_dummies(
titanic_model[[‘pclass’, ‘sex’, ‘age’, ‘sibsp’, ‘parch’, ‘fare’, ’embarked’]],
drop_first=True # Drop one category to avoid multicollinearity
)

# Split features and target
X = titanic_dummies
y = titanic_model[‘survived’]

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f”Model accuracy: {accuracy:.2f}”)
print(“\nClassification report:”)
print(classification_report(y_test, y_pred))

# Feature importance
features = pd.DataFrame({
‘feature’: X.columns,
‘importance’: model.feature_importances_
}).sort_values(‘importance’, ascending=False)

print(“\nFeature importance:”)
print(features.head(10))

Diamond Price Analysis

from sklearn.linear_model import LinearRegression

# Prepare the diamonds data
diamonds_model = diamonds.copy()

# Create dummies for categorical variables
diamonds_dummies = pd.get_dummies(
diamonds_model[[‘carat’, ‘depth’, ‘table’, ‘x’, ‘y’, ‘z’, ‘cut’, ‘color’, ‘clarity’]],
columns=[‘cut’, ‘color’, ‘clarity’],
drop_first=True
)

# Split features and target
X = diamonds_dummies
y = diamonds_model[‘price’]

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f”Mean Absolute Error: ${mae:.2f}”)
print(f”Root Mean Squared Error: ${rmse:.2f}”)
print(f”R² Score: {r2:.4f}”)

Conclusion

Pandas is an incredibly powerful library that forms the backbone of data analysis in Python. By mastering its capabilities, from basic operations to advanced techniques, you can efficiently handle any data processing task that comes your way.

In this guide, we’ve covered:

  • Basic operations for exploration and manipulation
  • Intermediate techniques for data transformation
  • Advanced functions for complex analysis
  • Visualization capabilities for communicating insights
  • Performance optimization for handling large datasets
  • Real-world applications showcasing pandas in action

Ready to take your data science skills to the next level?
Free Course to Master Data Science in 40 Days: Free Course to Master Data Science in 40 Days by Codanics

👨‍💻Author: Dr. Muhammad Aammar Tufail

Essential Pandas Terms & Concepts

TermDescriptionExample
DataFrame2D labeled data structure with columns of potentially different typespd.DataFrame(data)
Series1D labeled array capable of holding any data typepd.Series([1,2,3])
IndexLabels for rows or columns in DataFrame/Seriesdf.index
GroupBySplitting data into groups for aggregationdf.groupby('col').mean()
Pivot TableSummarize data with multi-dimensional groupingdf.pivot_table(values='A', index='B', columns='C')
Merge/JoinCombine DataFrames using keys/columnspd.merge(df1, df2, on='key')
loc / ilocLabel-based / integer-based selectiondf.loc[0:5], df.iloc[0:5]
NaNMissing value marker in pandasnp.nan, df.isna()
apply()Apply a function along an axis of the DataFramedf.apply(np.sqrt)
astype()Cast a pandas object to a specified dtypedf['col'].astype('float')
value_counts()Count unique values in a Seriesdf['col'].value_counts()
dropna()Remove missing valuesdf.dropna()
fillna()Fill missing valuesdf.fillna(0)
sort_values()Sort by the values along either axisdf.sort_values('col')
query()Query the columns of a DataFrame with a boolean expressiondf.query('col > 5')
crosstab()Compute a simple cross-tabulation of two (or more) factorspd.crosstab(df['A'], df['B'])
read_csv()Read a comma-separated values (csv) file into DataFramepd.read_csv('file.csv')
to_csv()Write DataFrame to a csv filedf.to_csv('file.csv')