Pandas Mastery Guide: From Beginner to Pro 🐼
Pandas is the backbone of data analysis in Python, providing powerful, flexible, and efficient tools for real-world data manipulation and analysis. Whether you’re just starting your data science journey or looking to level up your skills, this comprehensive guide will take you from basic operations to advanced techniques.
In this Codanics masterclass, we’ll work with two popular datasets – the Titanic passenger data and the Diamonds dataset – to showcase practical applications at every skill level.

Watch our video: Pandas Python Library: The Ultimate Beginner’s Guide
Table of Contents
Why Pandas Is Essential in 2024
- Universal Data Tool: Used by data scientists
A professional who uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from data.
, analysts, engineers, and ML experts worldwide - Versatile Data Handling: Effortlessly manages structured data of any size
- Industry Standard: Required skill for virtually all data-related positions
- Ecosystem Integration: Seamlessly connects with NumPy
A fundamental package for scientific computing with Python, providing support for arrays and matrices.
, Matplotlib
A Python library for creating static, animated, and interactive visualizations.
, Scikit-learn
A Python module integrating a wide range of state-of-the-art machine learning algorithms.
, and other tools - Time-saving: Automates repetitive data tasks that would take hours manually
Setting Up Your Environment
Let’s start by installing pandas and other necessary libraries:
Now, let’s import the libraries we’ll use throughout this guide:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Set styling for visualizations
sns.set(style=”whitegrid”)
plt.rcParams[‘figure.figsize’] = (12, 6)
# Display settings for better output readability
pd.set_option(‘display.max_columns’, None)
pd.set_option(‘display.width’, 1000)
# For reproducibility
np.random.seed(42)
Understanding Our Datasets
We’ll be working with two well-known datasets throughout this guide:
1. Titanic Dataset
Contains data about Titanic passengers including demographics, cabin information, and survival status.
titanic = sns.load_dataset(‘titanic’)
# Preview the data
print(“Titanic Dataset Preview:”)
print(titanic.head())
2. Diamonds Dataset
Contains attributes of almost 54,000 diamonds including price, carat, cut quality, color, and clarity.
diamonds = sns.load_dataset(‘diamonds’)
# Preview the data
print(“\nDiamonds Dataset Preview:”)
print(diamonds.head())
Basic Operations with Pandas
Exploring Dataset Structure
print(titanic.info())
# Statistical summary
print(titanic.describe())
# Check for missing values
print(“\nMissing values in Titanic dataset:”)
print(titanic.isnull().sum())
# Dataset dimensions
print(f”\nTitanic dataset shape: {titanic.shape}”)
# Column names
print(f”\nColumns: {titanic.columns.tolist()}”)
DataFrame
A two-dimensional, size-mutable, and heterogeneous tabular data structure with labeled axes (rows and columns) in pandas.
is the core data structure in pandas.
Intermediate Pandas Techniques
Working with Missing Data
Handling missing values
Entries in your dataset that are empty or not available (NaN).
is crucial for accurate analysis.
titanic_adv = titanic.copy()
# Impute age based on class and sex – more accurate than simple median
age_medians = titanic.groupby([‘sex’, ‘class’])[‘age’].median()
# Define function to assign median age based on passenger attributes
def fill_age(row):
if pd.isnull(row[‘age’]):
return age_medians[row[‘sex’], row[‘class’]]
return row[‘age’]
# Apply the function to each row
titanic_adv[‘age’] = titanic_adv.apply(fill_age, axis=1)
# Create age categories
titanic_adv[‘age_group’] = pd.cut(
titanic_adv[‘age’],
bins=[0, 12, 18, 35, 60, 100],
labels=[‘Child’, ‘Teen’, ‘Young Adult’, ‘Adult’, ‘Senior’]
)
# Check the results
print(“Age distribution by group after imputation:”)
print(titanic_adv[‘age_group’].value_counts().sort_index())
Creating New Features
diamonds_fe = diamonds.copy()
# Create price per carat feature
diamonds_fe[‘price_per_carat’] = diamonds_fe[‘price’] / diamonds_fe[‘carat’]
# Create simplified color categories
diamonds_fe[‘color_quality’] = diamonds_fe[‘color’].apply(
lambda x: ‘Premium’ if x in [‘D’, ‘E’, ‘F’] else ‘Good’ if x in [‘G’, ‘H’] else ‘Fair’
)
# Create volume estimate
diamonds_fe[‘volume’] = diamonds_fe[‘x’] * diamonds_fe[‘y’] * diamonds_fe[‘z’]
# Create a categorical feature for price ranges
diamonds_fe[‘price_category’] = pd.qcut(
diamonds_fe[‘price’],
q=4,
labels=[‘Budget’, ‘Medium’, ‘High’, ‘Luxury’]
)
# Check new features
print(diamonds_fe[[‘carat’, ‘price’, ‘price_per_carat’, ‘color’, ‘color_quality’,
‘volume’, ‘price_category’]].head(10))
Advanced Filtering and Selection
# Get top 5 most expensive diamonds
top_diamonds = diamonds.nlargest(5, ‘price’)
print(“Top 5 most expensive diamonds:”)
print(top_diamonds[[‘carat’, ‘cut’, ‘color’, ‘clarity’, ‘price’]])
# Get 5 diamonds closest to the median price
median_price = diamonds[‘price’].median()
closest_to_median = diamonds.iloc[(diamonds[‘price’] – median_price).abs().argsort()[:5]]
print(“\nDiamonds closest to median price:”)
print(closest_to_median[[‘carat’, ‘cut’, ‘color’, ‘clarity’, ‘price’]])
# Boolean masking with AND, OR, NOT conditions
ideal_bargains = diamonds[
(diamonds[‘cut’] == ‘Ideal’) &
(diamonds[‘carat’] > 1.5) &
(diamonds[‘price’] < diamonds['price'].quantile(0.25) * 3)
]
print(f"\nNumber of 'ideal' cut bargain diamonds: {len(ideal_bargains)}")# Find outliers using IQR method
Q1 = diamonds['price'].quantile(0.25)
Q3 = diamonds['price'].quantile(0.75)
IQR = Q3 - Q1
price_outliers = diamonds[(diamonds['price'] < (Q1 - 1.5 * IQR)) |
(diamonds['price'] > (Q3 + 1.5 * IQR))]
print(f”\nNumber of price outliers: {len(price_outliers)}”)
Pivot Tables and Reshaping Data
survival_pivot = titanic.pivot_table(
values=’survived’,
index=’sex’,
columns=’class’,
aggfunc=’mean’
).round(2)
print(“Survival rate by sex and class:”)
print(survival_pivot)
# Create pivot table with multiple values and aggregations
complex_pivot = titanic.pivot_table(
values=[‘age’, ‘fare’],
index=[‘sex’, ‘survived’],
columns=’class’,
aggfunc={
‘age’: [‘mean’, ‘count’],
‘fare’: [‘mean’, ‘median’]
}
).round(1)
print(“\nComplex pivot table:”)
print(complex_pivot)
# Reshape data with melt (wide to long format)
# First create a simplified dataframe to demonstrate
pivot_example = pd.DataFrame({
‘Name’: [‘Alice’, ‘Bob’, ‘Charlie’],
‘Math’: [90, 80, 70],
‘Science’: [95, 85, 75],
‘History’: [85, 75, 95]
})
# Melt to long format
melted = pd.melt(
pivot_example,
id_vars=[‘Name’],
value_vars=[‘Math’, ‘Science’, ‘History’],
var_name=’Subject’,
value_name=’Score’
)
print(“\nOriginal ‘wide’ data:”)
print(pivot_example)
print(“\nMelted ‘long’ data:”)
print(melted)
Advanced Pandas Mastery
Advanced Grouping Operations
GroupBy
A pandas method that allows you to split data into groups based on some criteria, apply a function, and combine the results.
is essential for advanced aggregations.
# Find survival rate by age group and class
titanic_adv[‘survived’] = titanic[‘survived’] # Ensure we have the survived column
survival_by_age_class = titanic_adv.groupby([‘age_group’, ‘class’])[‘survived’].agg(
[‘mean’, ‘count’, ‘sum’]
).sort_values(by=[‘age_group’, ‘mean’], ascending=[True, False])
print(“Survival rate by age group and class:”)
print(survival_by_age_class)
# Custom aggregation function
def survival_stats(x):
return pd.Series({
‘survival_rate’: x.mean(),
‘survived_count’: x.sum(),
‘total_count’: len(x),
‘survival_stderr’: x.std() / np.sqrt(len(x)) if len(x) > 0 else 0
})
# Apply custom aggregation
detailed_survival = titanic.groupby([‘sex’, ‘class’])[‘survived’].apply(survival_stats)
print(“\nDetailed survival statistics:”)
print(detailed_survival)
# Using transform for grouped operations while maintaining the original dataframe size
# Add column showing mean fare for each passenger’s class
titanic_adv[‘class_mean_fare’] = titanic.groupby(‘class’)[‘fare’].transform(‘mean’)
# Add survival rate by class
titanic_adv[‘class_survival_rate’] = titanic.groupby(‘class’)[‘survived’].transform(‘mean’)
print(“\nTransformed data (showing mean fare and survival rate by class):”)
print(titanic_adv[[‘class’, ‘fare’, ‘class_mean_fare’, ‘survived’, ‘class_survival_rate’]].head())
Advanced Merging and Joining
passengers = titanic[[‘name’, ‘sex’, ‘age’, ‘survived’]].head(5)
fares = titanic[[‘name’, ‘fare’, ’embarked’]].head(7) # Note: 2 more rows than passengers
cabins = pd.DataFrame({
‘name’: titanic[‘name’].iloc[[0, 1, 3, 5, 6]], # Mixed overlap with passengers
‘cabin’: [‘C123’, ‘E46’, ‘A10’, ‘D33’, ‘B22′]
})
print(“Passengers:”)
print(passengers)
print(“\nFares:”)
print(fares)
print(“\nCabins:”)
print(cabins)
# Inner join – only matching rows
inner_join = passengers.merge(cabins, on=’name’, how=’inner’)
print(“\nInner join result:”)
print(inner_join)
# Left join – all rows from left dataframe
left_join = passengers.merge(cabins, on=’name’, how=’left’)
print(“\nLeft join result:”)
print(left_join)
# Outer join – all rows from both dataframes
outer_join = passengers.merge(cabins, on=’name’, how=’outer’)
print(“\nOuter join result:”)
print(outer_join)
# Multiple joins
complete_info = passengers.merge(
fares, on=’name’, how=’left’
).merge(
cabins, on=’name’, how=’left’
)
print(“\nMultiple joins result:”)
print(complete_info)
Window Functions and Rolling Calculations
sorted_diamonds = diamonds.sort_values(‘price’)
# Calculate rolling statistics
sorted_diamonds[‘rolling_avg_10’] = sorted_diamonds[‘price’].rolling(window=10).mean()
sorted_diamonds[‘rolling_std_10’] = sorted_diamonds[‘price’].rolling(window=10).std()
# Calculate cumulative statistics
sorted_diamonds[‘cum_max’] = sorted_diamonds[‘price’].cummax()
sorted_diamonds[‘cum_min’] = sorted_diamonds[‘price’].cummin()
print(“Rolling and cumulative calculations:”)
print(sorted_diamonds[[‘carat’, ‘price’, ‘rolling_avg_10’, ‘rolling_std_10’,
‘cum_max’, ‘cum_min’]].head(15))
# Expanding window calculations
sorted_diamonds[‘exp_mean’] = sorted_diamonds[‘price’].expanding().mean()
# Adding ranks
sorted_diamonds[‘price_rank’] = sorted_diamonds[‘price’].rank(method=’min’)
sorted_diamonds[‘price_dense_rank’] = sorted_diamonds[‘price’].rank(method=’dense’)
sorted_diamonds[‘price_percent_rank’] = sorted_diamonds[‘price’].rank(pct=True)
print(“\nRanking examples:”)
print(sorted_diamonds[[‘price’, ‘price_rank’, ‘price_dense_rank’,
‘price_percent_rank’]].head(10))
Working with Time Series Data
date_range = pd.date_range(start=’2023-01-01′, end=’2023-01-10′, freq=’D’)
# Create time series data
time_series = pd.DataFrame({
‘date’: date_range,
‘value’: np.random.normal(100, 10, size=len(date_range))
})
# Set date as index
time_series.set_index(‘date’, inplace=True)
print(“Time series data:”)
print(time_series.head())
# Resampling – aggregate to weekly frequency
weekly = time_series.resample(‘W’).mean()
print(“\nWeekly average:”)
print(weekly)
# Shifting data (lag and lead)
time_series[‘previous_day’] = time_series[‘value’].shift(1)
time_series[‘next_day’] = time_series[‘value’].shift(-1)
time_series[‘day_over_day_change’] = time_series[‘value’] – time_series[‘previous_day’]
time_series[‘day_over_day_pct_change’] = time_series[‘value’].pct_change() * 100
print(“\nTime series with shifts and changes:”)
print(time_series)
Advanced Function Application
def price_analysis(diamond):
“””Analyze if a diamond is good value based on multiple criteria”””
price_per_carat = diamond[‘price’] / diamond[‘carat’]
avg_price_per_carat = diamonds.loc[diamonds[‘cut’] == diamond[‘cut’], ‘price’].sum() / \
diamonds.loc[diamonds[‘cut’] == diamond[‘cut’], ‘carat’].sum()
return pd.Series({
‘price_per_carat’: price_per_carat,
‘avg_for_cut’: avg_price_per_carat,
‘value_ratio’: price_per_carat / avg_price_per_carat,
‘good_value’: price_per_carat < avg_price_per_carat * 0.9
})# Apply the function to a sample of diamonds
diamonds_sample = diamonds.sample(5)
value_analysis = diamonds_sample.apply(price_analysis, axis=1)# Combine with original data
diamonds_with_analysis = pd.concat([diamonds_sample[['carat', 'cut', 'price']], value_analysis], axis=1)
print("Diamond value analysis:")
print(diamonds_with_analysis)# Using vectorized operations (much faster than apply)
diamonds_vector = diamonds.sample(1000).copy()
diamonds_vector['price_per_carat'] = diamonds_vector['price'] / diamonds_vector['carat']
cut_avg_prices = diamonds_vector.groupby('cut')['price'].sum() / diamonds_vector.groupby('cut')['carat'].sum()# Map the averages to each diamond based on its cut
diamonds_vector['avg_for_cut'] = diamonds_vector['cut'].map(cut_avg_prices)
diamonds_vector['value_ratio'] = diamonds_vector['price_per_carat'] / diamonds_vector['avg_for_cut']
diamonds_vector['good_value'] = diamonds_vector['price_per_carat'] < diamonds_vector['avg_for_cut'] * 0.9print("\nVectorized operations result (faster than apply):")
print(diamonds_vector[['carat', 'cut', 'price', 'price_per_carat',
'avg_for_cut', 'value_ratio', 'good_value']].head())
Data Visualization with Pandas
Pandas provides easy-to-use plotting capabilities built on Matplotlib. Let’s explore how to visualize our datasets:
Basic Visualizations
titanic[‘age’].plot(kind=’hist’, bins=20, title=’Age Distribution on Titanic’)
plt.xlabel(‘Age’)
plt.ylabel(‘Count’)
plt.savefig(‘titanic_age_histogram.png’)
plt.close()
# Basic bar chart
titanic[‘class’].value_counts().plot(kind=’bar’, title=’Passenger Count by Class’)
plt.xlabel(‘Class’)
plt.ylabel(‘Count’)
plt.savefig(‘titanic_class_bar.png’)
plt.close()
# Using seaborn with pandas dataframes
plt.figure(figsize=(10, 6))
sns.countplot(data=titanic, x=’class’, hue=’survived’)
plt.title(‘Survival by Passenger Class’)
plt.savefig(‘titanic_survival_by_class.png’)
plt.close()
# Scatter plot for diamonds
plt.figure(figsize=(10, 6))
diamonds.sample(1000).plot.scatter(x=’carat’, y=’price’, alpha=0.5, figsize=(10, 6))
plt.title(‘Diamond Price vs. Carat’)
plt.savefig(‘diamond_price_carat.png’)
plt.close()
Advanced Visualizations
survival_pct = pd.crosstab(
index=titanic[‘class’],
columns=titanic[‘sex’],
values=titanic[‘survived’],
aggfunc=’mean’
).round(2) * 100
# Create heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(survival_pct, annot=True, cmap=’YlGnBu’, fmt=’g’)
plt.title(‘Survival Rate (%) by Class and Sex’)
plt.savefig(‘titanic_survival_heatmap.png’)
plt.close()
# Boxplot of diamond prices by cut
plt.figure(figsize=(12, 6))
sns.boxplot(x=’cut’, y=’price’, data=diamonds)
plt.title(‘Diamond Price Distribution by Cut’)
plt.yscale(‘log’) # Log scale for better visualization
plt.savefig(‘diamond_price_by_cut.png’)
plt.close()
# Create a pair plot using seaborn
sns.pairplot(diamonds.sample(1000)[[‘carat’, ‘depth’, ‘table’, ‘price’]])
plt.savefig(‘diamond_pairplot.png’)
plt.close()
# Create a custom grouped bar chart
survival_by_class_sex = titanic.pivot_table(
index=’class’,
columns=’sex’,
values=’survived’,
aggfunc=’mean’
)
survival_by_class_sex.plot(kind=’bar’, figsize=(10, 6))
plt.title(‘Survival Rate by Class and Sex’)
plt.ylabel(‘Survival Rate’)
plt.xticks(rotation=0)
plt.savefig(‘titanic_survival_by_class_sex.png’)
plt.close()
Performance Optimization
When working with larger datasets, optimizing your code becomes crucial. Here are some techniques to make your pandas operations faster:
import time
# Create a larger dataframe for demonstration
large_df = pd.DataFrame(np.random.rand(100000, 4), columns=list(‘ABCD’))
# 1. Using .apply() (slower)
start_time = time.time()
result1 = large_df.apply(lambda x: x[‘A’] * x[‘B’] + x[‘C’] – x[‘D’], axis=1)
apply_time = time.time() – start_time
print(f”Time using .apply(): {apply_time:.4f} seconds”)
# 2. Using vectorized operations (faster)
start_time = time.time()
result2 = large_df[‘A’] * large_df[‘B’] + large_df[‘C’] – large_df[‘D’]
vector_time = time.time() – start_time
print(f”Time using vectorized operations: {vector_time:.4f} seconds”)
print(f”Speedup factor: {apply_time / vector_time:.1f}x”)
# 3. Using .eval() for complex expressions (can be faster for large datasets)
start_time = time.time()
result3 = large_df.eval(‘A * B + C – D’)
eval_time = time.time() – start_time
print(f”Time using .eval(): {eval_time:.4f} seconds”)
# Memory optimization with categoricals
# For string columns with repetitive values
titanic_optimized = titanic.copy()
titanic_optimized[‘sex’] = titanic_optimized[‘sex’].astype(‘category’)
titanic_optimized[‘class’] = titanic_optimized[‘class’].astype(‘category’)
titanic_optimized[’embark_town’] = titanic_optimized[’embark_town’].astype(‘category’)
# Compare memory usage
print(f”\nOriginal memory usage: {titanic.memory_usage(deep=True).sum() / 1024:.2f} KB”)
print(f”Optimized memory usage: {titanic_optimized.memory_usage(deep=True).sum() / 1024:.2f} KB”)
# Chunking for processing large files
def process_in_chunks(filename, chunk_size=10000):
chunk_list = []
# This is just a demonstration – replace with your actual CSV file
for chunk in pd.read_csv(filename, chunksize=chunk_size):
# Process each chunk (example: filter rows with age > 30)
processed = chunk[chunk[‘age’] > 30]
chunk_list.append(processed)
# Combine all processed chunks
return pd.concat(chunk_list)
# Example usage (commented out as we don’t have the file)
# result = process_in_chunks(‘large_dataset.csv’, chunk_size=50000)
Best Practices & Common Pitfalls
Best Practices
- Use Vectorized Operations: Always prefer vectorized operations over loops or apply when possible
- Chain Methods: Take advantage of method chaining for cleaner code
- Set a Copy Warning: Use
pd.set_option('mode.chained_assignment', 'warn')
to catch potential issues - Use .loc and .iloc: Prefer these for data selection to avoid SettingWithCopyWarning
- Make Explicit Copies: Use
.copy()
when creating a new dataframe from a slice - Use Appropriate Data Types: Convert to categorical types for string columns with few unique values
Common Pitfalls
# Create a subset
bad_subset = titanic[titanic[‘age’] > 30]
# This might create a warning (if it’s a view and not a copy)
bad_subset[‘fare’] = bad_subset[‘fare’] * 1.1
# Correct approach:
good_subset = titanic[titanic[‘age’] > 30].copy()
good_subset[‘fare’] = good_subset[‘fare’] * 1.1
# Problem 2: Using == with NaN values
# This won’t work as expected
missing_ages = titanic[titanic[‘age’] == np.nan] # Will return empty dataframe
print(f”Wrong way to find NaN: {len(missing_ages)} rows”)
# Correct approach:
missing_ages = titanic[titanic[‘age’].isna()]
print(f”Correct way to find NaN: {len(missing_ages)} rows”)
# Problem 3: Forgetting that loc is inclusive on both ends
# This includes row indices 0, 1, 2, 3, 4, 5
first_six = titanic.loc[0:5]
print(f”titanic.loc[0:5] gives {len(first_six)} rows”)
# iloc is exclusive on the upper bound
# This includes row indices 0, 1, 2, 3, 4
first_five = titanic.iloc[0:5]
print(f”titanic.iloc[0:5] gives {len(first_five)} rows”)
# Problem 4: Ignoring data types
# This won’t work as expected if ‘age’ has NaN values:
adults = titanic[titanic[‘age’] >= 18] # Works despite NaN values
children = titanic[titanic[‘age’] < 18] # Works despite NaN values
print(f"Adults + Children = {len(adults) + len(children)}, Total = {len(titanic)}")
print("Missing values are excluded from both filters!")
Real-World Use Cases
Titanic Survival Prediction
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
# Prepare the data
titanic_model = titanic.copy()
# Handle missing values
titanic_model[‘age’] = titanic_model[‘age’].fillna(titanic_model[‘age’].median())
titanic_model[’embarked’] = titanic_model[’embarked’].fillna(titanic_model[’embarked’].mode()[0])
# Create dummies for categorical variables
titanic_dummies = pd.get_dummies(
titanic_model[[‘pclass’, ‘sex’, ‘age’, ‘sibsp’, ‘parch’, ‘fare’, ’embarked’]],
drop_first=True # Drop one category to avoid multicollinearity
)
# Split features and target
X = titanic_dummies
y = titanic_model[‘survived’]
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train a model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f”Model accuracy: {accuracy:.2f}”)
print(“\nClassification report:”)
print(classification_report(y_test, y_pred))
# Feature importance
features = pd.DataFrame({
‘feature’: X.columns,
‘importance’: model.feature_importances_
}).sort_values(‘importance’, ascending=False)
print(“\nFeature importance:”)
print(features.head(10))
Diamond Price Analysis
# Prepare the diamonds data
diamonds_model = diamonds.copy()
# Create dummies for categorical variables
diamonds_dummies = pd.get_dummies(
diamonds_model[[‘carat’, ‘depth’, ‘table’, ‘x’, ‘y’, ‘z’, ‘cut’, ‘color’, ‘clarity’]],
columns=[‘cut’, ‘color’, ‘clarity’],
drop_first=True
)
# Split features and target
X = diamonds_dummies
y = diamonds_model[‘price’]
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train a model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f”Mean Absolute Error: ${mae:.2f}”)
print(f”Root Mean Squared Error: ${rmse:.2f}”)
print(f”R² Score: {r2:.4f}”)
Conclusion
Pandas is an incredibly powerful library that forms the backbone of data analysis in Python. By mastering its capabilities, from basic operations to advanced techniques, you can efficiently handle any data processing task that comes your way.
In this guide, we’ve covered:
- Basic operations for exploration and manipulation
- Intermediate techniques for data transformation
- Advanced functions for complex analysis
- Visualization capabilities for communicating insights
- Performance optimization for handling large datasets
- Real-world applications showcasing pandas in action
Ready to take your data science skills to the next level?
Free Course to Master Data Science in 40 Days: Free Course to Master Data Science in 40 Days by Codanics
Term | Description | Example |
---|---|---|
DataFrame | 2D labeled data structure with columns of potentially different types | pd.DataFrame(data) |
Series | 1D labeled array capable of holding any data type | pd.Series([1,2,3]) |
Index | Labels for rows or columns in DataFrame/Series | df.index |
GroupBy | Splitting data into groups for aggregation | df.groupby('col').mean() |
Pivot Table | Summarize data with multi-dimensional grouping | df.pivot_table(values='A', index='B', columns='C') |
Merge/Join | Combine DataFrames using keys/columns | pd.merge(df1, df2, on='key') |
loc / iloc | Label-based / integer-based selection | df.loc[0:5], df.iloc[0:5] |
NaN | Missing value marker in pandas | np.nan , df.isna() |
apply() | Apply a function along an axis of the DataFrame | df.apply(np.sqrt) |
astype() | Cast a pandas object to a specified dtype | df['col'].astype('float') |
value_counts() | Count unique values in a Series | df['col'].value_counts() |
dropna() | Remove missing values | df.dropna() |
fillna() | Fill missing values | df.fillna(0) |
sort_values() | Sort by the values along either axis | df.sort_values('col') |
query() | Query the columns of a DataFrame with a boolean expression | df.query('col > 5') |
crosstab() | Compute a simple cross-tabulation of two (or more) factors | pd.crosstab(df['A'], df['B']) |
read_csv() | Read a comma-separated values (csv) file into DataFrame | pd.read_csv('file.csv') |
to_csv() | Write DataFrame to a csv file | df.to_csv('file.csv') |

+92 300 0000000
Ghulam Muhammadabad, Faisalabad, 38000, Pakistan.
info@codanics.com