Pandas Mastery Guide: From Beginner to Pro | Codanics 0% 🌙 ↑ Pandas Mastery Guide: From Beginner to Pro 🐼 Pandas is the backbone of data analysis in Python, providing powerful, flexible, and efficient tools for real-world data manipulation and analysis. Whether you're just starting your data science journey or looking to level up your skills, this comprehensive guide will take you from basic operations to advanced techniques. In this Codanics masterclass, we'll work with two popular datasets - the Titanic passenger data and the Diamonds dataset - to showcase practical applications at every skill level. Watch our video: Pandas Python Library: The Ultimate Beginner's Guide Table of Contents Why Pandas Is Essential in 2024 Setting Up Your Environment Understanding Our Datasets Basic Operations with Pandas Intermediate Pandas Techniques Advanced Pandas Mastery Data Visualization with Pandas Performance Optimization Best Practices & Common Pitfalls Real-World Use Cases Conclusion Why Pandas Is Essential in 2024 Universal Data Tool: Used by data scientists A professional who uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from data. , analysts, engineers, and ML experts worldwide Versatile Data Handling: Effortlessly manages structured data of any size Industry Standard: Required skill for virtually all data-related positions Ecosystem Integration: Seamlessly connects with NumPy A fundamental package for scientific computing with Python, providing support for arrays and matrices. , Matplotlib A Python library for creating static, animated, and interactive visualizations. , Scikit-learn A Python module integrating a wide range of state-of-the-art machine learning algorithms. , and other tools Time-saving: Automates repetitive data tasks that would take hours manually Setting Up Your Environment Let's start by installing pandas and other necessary libraries: pip install pandas numpy matplotlib seaborn scikit-learn Now, let's import the libraries we'll use throughout this guide: import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # Set styling for visualizations sns.set(style="whitegrid") plt.rcParams['figure.figsize'] = (12, 6) # Display settings for better output readability pd.set_option('display.max_columns', None) pd.set_option('display.width', 1000) # For reproducibility np.random.seed(42) Understanding Our Datasets We'll be working with two well-known datasets throughout this guide: 1. Titanic Dataset Contains data about Titanic passengers including demographics, cabin information, and survival status. # Load the Titanic dataset titanic = sns.load_dataset('titanic') # Preview the data print("Titanic Dataset Preview:") print(titanic.head()) 2. Diamonds Dataset Contains attributes of almost 54,000 diamonds including price, carat, cut quality, color, and clarity. # Load the Diamonds dataset diamonds = sns.load_dataset('diamonds') # Preview the data print("\nDiamonds Dataset Preview:") print(diamonds.head()) Basic Operations with Pandas Exploring Dataset Structure # Basic information about the dataset print(titanic.info()) # Statistical summary print(titanic.describe()) # Check for missing values print("\nMissing values in Titanic dataset:") print(titanic.isnull().sum()) # Dataset dimensions print(f"\nTitanic dataset shape: {titanic.shape}") # Column names print(f"\nColumns: {titanic.columns.tolist()}") DataFrame A two-dimensional, size-mutable, and heterogeneous tabular data structure with labeled axes (rows and columns) in pandas. is the core data structure in pandas. Intermediate Pandas Techniques Working with Missing Data Handling missing values Entries in your dataset that are empty or not available (NaN). is crucial for accurate analysis. # Create more sophisticated missing value imputation titanic_adv = titanic.copy() # Impute age based on class and sex - more accurate than simple median age_medians = titanic.groupby(['sex', 'class'])['age'].median() # Define function to assign median age based on passenger attributes def fill_age(row): if pd.isnull(row['age']): return age_medians[row['sex'], row['class']] return row['age'] # Apply the function to each row titanic_adv['age'] = titanic_adv.apply(fill_age, axis=1) # Create age categories titanic_adv['age_group'] = pd.cut( titanic_adv['age'], bins=[0, 12, 18, 35, 60, 100], labels=['Child', 'Teen', 'Young Adult', 'Adult', 'Senior'] ) # Check the results print("Age distribution by group after imputation:") print(titanic_adv['age_group'].value_counts().sort_index()) Creating New Features # Feature engineering with the diamonds dataset diamonds_fe = diamonds.copy() # Create price per carat feature diamonds_fe['price_per_carat'] = diamonds_fe['price'] / diamonds_fe['carat'] # Create simplified color categories diamonds_fe['color_quality'] = diamonds_fe['color'].apply( lambda x: 'Premium' if x in ['D', 'E', 'F'] else 'Good' if x in ['G', 'H'] else 'Fair' ) # Create volume estimate diamonds_fe['volume'] = diamonds_fe['x'] * diamonds_fe['y'] * diamonds_fe['z'] # Create a categorical feature for price ranges diamonds_fe['price_category'] = pd.qcut( diamonds_fe['price'], q=4, labels=['Budget', 'Medium', 'High', 'Luxury'] ) # Check new features print(diamonds_fe[['carat', 'price', 'price_per_carat', 'color', 'color_quality', 'volume', 'price_category']].head(10)) Advanced Filtering and Selection # Advanced selection techniques with Diamonds dataset # Get top 5 most expensive diamonds top_diamonds = diamonds.nlargest(5, 'price') print("Top 5 most expensive diamonds:") print(top_diamonds[['carat', 'cut', 'color', 'clarity', 'price']]) # Get 5 diamonds closest to the median price median_price = diamonds['price'].median() closest_to_median = diamonds.iloc[(diamonds['price'] - median_price).abs().argsort()[:5]] print("\nDiamonds closest to median price:") print(closest_to_median[['carat', 'cut', 'color', 'clarity', 'price']]) # Boolean masking with AND, OR, NOT conditions ideal_bargains = diamonds[ (diamonds['cut'] == 'Ideal') & (diamonds['carat'] > 1.5) & (diamonds['price'] < diamonds['price'].quantile(0.25) * 3) ] print(f"\nNumber of 'ideal' cut bargain diamonds: {len(ideal_bargains)}") # Find outliers using IQR method Q1 = diamonds['price'].quantile(0.25) Q3 = diamonds['price'].quantile(0.75) IQR = Q3 - Q1 price_outliers = diamonds[(diamonds['price'] < (Q1 - 1.5 * IQR)) | (diamonds['price'] > (Q3 + 1.5 * IQR))] print(f"\nNumber of price outliers: {len(price_outliers)}") Pivot Tables and Reshaping Data # Create pivot table - survival rates by sex and class survival_pivot = titanic.pivot_table( values='survived', index='sex', columns='class', aggfunc='mean' ).round(2) print("Survival rate by sex and class:") print(survival_pivot) # Create pivot table with multiple values and aggregations complex_pivot = titanic.pivot_table( values=['age', 'fare'], index=['sex', 'survived'], columns='class', aggfunc={ 'age': ['mean', 'count'], 'fare': ['mean', 'median'] } ).round(1) print("\nComplex pivot table:") print(complex_pivot) # Reshape data with melt (wide to long format) # First create a simplified dataframe to demonstrate pivot_example = pd.DataFrame({ 'Name': ['Alice', 'Bob', 'Charlie'], 'Math': [90, 80, 70], 'Science': [95, 85, 75], 'History': [85, 75, 95] }) # Melt to long format melted = pd.melt( pivot_example, id_vars=['Name'], value_vars=['Math', 'Science', 'History'], var_name='Subject', value_name='Score' ) print("\nOriginal 'wide' data:") print(pivot_example) print("\nMelted 'long' data:") print(melted) Advanced Pandas Mastery Advanced Grouping Operations GroupBy A pandas method that allows you to split data into groups based on some criteria, apply a function, and combine the results. is essential for advanced aggregations. # Complex groupby operations # Find survival rate by age group and class titanic_adv['survived'] = titanic['survived'] # Ensure we have the survived column survival_by_age_class = titanic_adv.groupby(['age_group', 'class'])['survived'].agg( ['mean', 'count', 'sum'] ).sort_values(by=['age_group', 'mean'], ascending=[True, False]) print("Survival rate by age group and class:") print(survival_by_age_class) # Custom aggregation function def survival_stats(x): return pd.Series({ 'survival_rate': x.mean(), 'survived_count': x.sum(), 'total_count': len(x), 'survival_stderr': x.std() / np.sqrt(len(x)) if len(x) > 0 else 0 }) # Apply custom aggregation detailed_survival = titanic.groupby(['sex', 'class'])['survived'].apply(survival_stats) print("\nDetailed survival statistics:") print(detailed_survival) # Using transform for grouped operations while maintaining the original dataframe size # Add column showing mean fare for each passenger's class titanic_adv['class_mean_fare'] = titanic.groupby('class')['fare'].transform('mean') # Add survival rate by class titanic_adv['class_survival_rate'] = titanic.groupby('class')['survived'].transform('mean') print("\nTransformed data (showing mean fare and survival rate by class):") print(titanic_adv[['class', 'fare', 'class_mean_fare', 'survived', 'class_survival_rate']].head()) Advanced Merging and Joining # Create sample dataframes for demonstration passengers = titanic[['name', 'sex', 'age', 'survived']].head(5) fares = titanic[['name', 'fare', 'embarked']].head(7) # Note: 2 more rows than passengers cabins = pd.DataFrame({ 'name': titanic['name'].iloc[[0, 1, 3, 5, 6]], # Mixed overlap with passengers 'cabin': ['C123', 'E46', 'A10', 'D33', 'B22'] }) print("Passengers:") print(passengers) print("\nFares:") print(fares) print("\nCabins:") print(cabins) # Inner join - only matching rows inner_join = passengers.merge(cabins, on='name', how='inner') print("\nInner join result:") print(inner_join) # Left join - all rows from left dataframe left_join = passengers.merge(cabins, on='name', how='left') print("\nLeft join result:") print(left_join) # Outer join - all rows from both dataframes outer_join = passengers.merge(cabins, on='name', how='outer') print("\nOuter join result:") print(outer_join) # Multiple joins complete_info = passengers.merge( fares, on='name', how='left' ).merge( cabins, on='name', how='left' ) print("\nMultiple joins result:") print(complete_info) Window Functions and Rolling Calculations # Sort diamonds by price sorted_diamonds = diamonds.sort_values('price') # Calculate rolling statistics sorted_diamonds['rolling_avg_10'] = sorted_diamonds['price'].rolling(window=10).mean() sorted_diamonds['rolling_std_10'] = sorted_diamonds['price'].rolling(window=10).std() # Calculate cumulative statistics sorted_diamonds['cum_max'] = sorted_diamonds['price'].cummax() sorted_diamonds['cum_min'] = sorted_diamonds['price'].cummin() print("Rolling and cumulative calculations:") print(sorted_diamonds[['carat', 'price', 'rolling_avg_10', 'rolling_std_10', 'cum_max', 'cum_min']].head(15)) # Expanding window calculations sorted_diamonds['exp_mean'] = sorted_diamonds['price'].expanding().mean() # Adding ranks sorted_diamonds['price_rank'] = sorted_diamonds['price'].rank(method='min') sorted_diamonds['price_dense_rank'] = sorted_diamonds['price'].rank(method='dense') sorted_diamonds['price_percent_rank'] = sorted_diamonds['price'].rank(pct=True) print("\nRanking examples:") print(sorted_diamonds[['price', 'price_rank', 'price_dense_rank', 'price_percent_rank']].head(10)) Working with Time Series Data # Create a date range date_range = pd.date_range(start='2023-01-01', end='2023-01-10', freq='D') # Create time series data time_series = pd.DataFrame({ 'date': date_range, 'value': np.random.normal(100, 10, size=len(date_range)) }) # Set date as index time_series.set_index('date', inplace=True) print("Time series data:") print(time_series.head()) # Resampling - aggregate to weekly frequency weekly = time_series.resample('W').mean() print("\nWeekly average:") print(weekly) # Shifting data (lag and lead) time_series['previous_day'] = time_series['value'].shift(1) time_series['next_day'] = time_series['value'].shift(-1) time_series['day_over_day_change'] = time_series['value'] - time_series['previous_day'] time_series['day_over_day_pct_change'] = time_series['value'].pct_change() * 100 print("\nTime series with shifts and changes:") print(time_series) Advanced Function Application # Using apply with a custom function def price_analysis(diamond): """Analyze if a diamond is good value based on multiple criteria""" price_per_carat = diamond['price'] / diamond['carat'] avg_price_per_carat = diamonds.loc[diamonds['cut'] == diamond['cut'], 'price'].sum() / \ diamonds.loc[diamonds['cut'] == diamond['cut'], 'carat'].sum() return pd.Series({ 'price_per_carat': price_per_carat, 'avg_for_cut': avg_price_per_carat, 'value_ratio': price_per_carat / avg_price_per_carat, 'good_value': price_per_carat < avg_price_per_carat * 0.9 }) # Apply the function to a sample of diamonds diamonds_sample = diamonds.sample(5) value_analysis = diamonds_sample.apply(price_analysis, axis=1) # Combine with original data diamonds_with_analysis = pd.concat([diamonds_sample[['carat', 'cut', 'price']], value_analysis], axis=1) print("Diamond value analysis:") print(diamonds_with_analysis) # Using vectorized operations (much faster than apply) diamonds_vector = diamonds.sample(1000).copy() diamonds_vector['price_per_carat'] = diamonds_vector['price'] / diamonds_vector['carat'] cut_avg_prices = diamonds_vector.groupby('cut')['price'].sum() / diamonds_vector.groupby('cut')['carat'].sum() # Map the averages to each diamond based on its cut diamonds_vector['avg_for_cut'] = diamonds_vector['cut'].map(cut_avg_prices) diamonds_vector['value_ratio'] = diamonds_vector['price_per_carat'] / diamonds_vector['avg_for_cut'] diamonds_vector['good_value'] = diamonds_vector['price_per_carat'] < diamonds_vector['avg_for_cut'] * 0.9 print("\nVectorized operations result (faster than apply):") print(diamonds_vector[['carat', 'cut', 'price', 'price_per_carat', 'avg_for_cut', 'value_ratio', 'good_value']].head()) Data Visualization with Pandas Pandas provides easy-to-use plotting capabilities built on Matplotlib. Let's explore how to visualize our datasets: Basic Visualizations # Basic histogram titanic['age'].plot(kind='hist', bins=20, title='Age Distribution on Titanic') plt.xlabel('Age') plt.ylabel('Count') plt.savefig('titanic_age_histogram.png') plt.close() # Basic bar chart titanic['class'].value_counts().plot(kind='bar', title='Passenger Count by Class') plt.xlabel('Class') plt.ylabel('Count') plt.savefig('titanic_class_bar.png') plt.close() # Using seaborn with pandas dataframes plt.figure(figsize=(10, 6)) sns.countplot(data=titanic, x='class', hue='survived') plt.title('Survival by Passenger Class') plt.savefig('titanic_survival_by_class.png') plt.close() # Scatter plot for diamonds plt.figure(figsize=(10, 6)) diamonds.sample(1000).plot.scatter(x='carat', y='price', alpha=0.5, figsize=(10, 6)) plt.title('Diamond Price vs. Carat') plt.savefig('diamond_price_carat.png') plt.close() Advanced Visualizations # Create a pivot table for visualization survival_pct = pd.crosstab( index=titanic['class'], columns=titanic['sex'], values=titanic['survived'], aggfunc='mean' ).round(2) * 100 # Create heatmap plt.figure(figsize=(10, 8)) sns.heatmap(survival_pct, annot=True, cmap='YlGnBu', fmt='g') plt.title('Survival Rate (%) by Class and Sex') plt.savefig('titanic_survival_heatmap.png') plt.close() # Boxplot of diamond prices by cut plt.figure(figsize=(12, 6)) sns.boxplot(x='cut', y='price', data=diamonds) plt.title('Diamond Price Distribution by Cut') plt.yscale('log') # Log scale for better visualization plt.savefig('diamond_price_by_cut.png') plt.close() # Create a pair plot using seaborn sns.pairplot(diamonds.sample(1000)[['carat', 'depth', 'table', 'price']]) plt.savefig('diamond_pairplot.png') plt.close() # Create a custom grouped bar chart survival_by_class_sex = titanic.pivot_table( index='class', columns='sex', values='survived', aggfunc='mean' ) survival_by_class_sex.plot(kind='bar', figsize=(10, 6)) plt.title('Survival Rate by Class and Sex') plt.ylabel('Survival Rate') plt.xticks(rotation=0) plt.savefig('titanic_survival_by_class_sex.png') plt.close() Performance Optimization When working with larger datasets, optimizing your code becomes crucial. Here are some techniques to make your pandas operations faster: # Speed comparison example import time # Create a larger dataframe for demonstration large_df = pd.DataFrame(np.random.rand(100000, 4), columns=list('ABCD')) # 1. Using .apply() (slower) start_time = time.time() result1 = large_df.apply(lambda x: x['A'] * x['B'] + x['C'] - x['D'], axis=1) apply_time = time.time() - start_time print(f"Time using .apply(): {apply_time:.4f} seconds") # 2. Using vectorized operations (faster) start_time = time.time() result2 = large_df['A'] * large_df['B'] + large_df['C'] - large_df['D'] vector_time = time.time() - start_time print(f"Time using vectorized operations: {vector_time:.4f} seconds") print(f"Speedup factor: {apply_time / vector_time:.1f}x") # 3. Using .eval() for complex expressions (can be faster for large datasets) start_time = time.time() result3 = large_df.eval('A * B + C - D') eval_time = time.time() - start_time print(f"Time using .eval(): {eval_time:.4f} seconds") # Memory optimization with categoricals # For string columns with repetitive values titanic_optimized = titanic.copy() titanic_optimized['sex'] = titanic_optimized['sex'].astype('category') titanic_optimized['class'] = titanic_optimized['class'].astype('category') titanic_optimized['embark_town'] = titanic_optimized['embark_town'].astype('category') # Compare memory usage print(f"\nOriginal memory usage: {titanic.memory_usage(deep=True).sum() / 1024:.2f} KB") print(f"Optimized memory usage: {titanic_optimized.memory_usage(deep=True).sum() / 1024:.2f} KB") # Chunking for processing large files def process_in_chunks(filename, chunk_size=10000): chunk_list = [] # This is just a demonstration - replace with your actual CSV file for chunk in pd.read_csv(filename, chunksize=chunk_size): # Process each chunk (example: filter rows with age > 30) processed = chunk[chunk['age'] > 30] chunk_list.append(processed) # Combine all processed chunks return pd.concat(chunk_list) # Example usage (commented out as we don't have the file) # result = process_in_chunks('large_dataset.csv', chunk_size=50000) Best Practices & Common Pitfalls Best Practices Use Vectorized Operations: Always prefer vectorized operations over loops or apply when possible Chain Methods: Take advantage of method chaining for cleaner code Set a Copy Warning: Use pd.set_option('mode.chained_assignment', 'warn') to catch potential issues Use .loc and .iloc: Prefer these for data selection to avoid SettingWithCopyWarning Make Explicit Copies: Use .copy() when creating a new dataframe from a slice Use Appropriate Data Types: Convert to categorical types for string columns with few unique values Common Pitfalls # Problem 1: SettingWithCopyWarning # Create a subset bad_subset = titanic[titanic['age'] > 30] # This might create a warning (if it's a view and not a copy) bad_subset['fare'] = bad_subset['fare'] * 1.1 # Correct approach: good_subset = titanic[titanic['age'] > 30].copy() good_subset['fare'] = good_subset['fare'] * 1.1 # Problem 2: Using == with NaN values # This won't work as expected missing_ages = titanic[titanic['age'] == np.nan] # Will return empty dataframe print(f"Wrong way to find NaN: {len(missing_ages)} rows") # Correct approach: missing_ages = titanic[titanic['age'].isna()] print(f"Correct way to find NaN: {len(missing_ages)} rows") # Problem 3: Forgetting that loc is inclusive on both ends # This includes row indices 0, 1, 2, 3, 4, 5 first_six = titanic.loc[0:5] print(f"titanic.loc[0:5] gives {len(first_six)} rows") # iloc is exclusive on the upper bound # This includes row indices 0, 1, 2, 3, 4 first_five = titanic.iloc[0:5] print(f"titanic.iloc[0:5] gives {len(first_five)} rows") # Problem 4: Ignoring data types # This won't work as expected if 'age' has NaN values: adults = titanic[titanic['age'] >= 18] # Works despite NaN values children = titanic[titanic['age'] < 18] # Works despite NaN values print(f"Adults + Children = {len(adults) + len(children)}, Total = {len(titanic)}") print("Missing values are excluded from both filters!") Real-World Use Cases Titanic Survival Prediction from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, classification_report # Prepare the data titanic_model = titanic.copy() # Handle missing values titanic_model['age'] = titanic_model['age'].fillna(titanic_model['age'].median()) titanic_model['embarked'] = titanic_model['embarked'].fillna(titanic_model['embarked'].mode()[0]) # Create dummies for categorical variables titanic_dummies = pd.get_dummies( titanic_model[['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']], drop_first=True # Drop one category to avoid multicollinearity ) # Split features and target X = titanic_dummies y = titanic_model['survived'] # Split into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Train a model model = RandomForestClassifier(n_estimators=100, random_state=42) model.fit(X_train, y_train) # Make predictions y_pred = model.predict(X_test) # Evaluate the model accuracy = accuracy_score(y_test, y_pred) print(f"Model accuracy: {accuracy:.2f}") print("\nClassification report:") print(classification_report(y_test, y_pred)) # Feature importance features = pd.DataFrame({ 'feature': X.columns, 'importance': model.feature_importances_ }).sort_values('importance', ascending=False) print("\nFeature importance:") print(features.head(10)) Diamond Price Analysis from sklearn.linear_model import LinearRegression # Prepare the diamonds data diamonds_model = diamonds.copy() # Create dummies for categorical variables diamonds_dummies = pd.get_dummies( diamonds_model[['carat', 'depth', 'table', 'x', 'y', 'z', 'cut', 'color', 'clarity']], columns=['cut', 'color', 'clarity'], drop_first=True ) # Split features and target X = diamonds_dummies y = diamonds_model['price'] # Split into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Train a model model = LinearRegression() model.fit(X_train, y_train) # Make predictions y_pred = model.predict(X_test) # Evaluate the model from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score import numpy as np mae = mean_absolute_error(y_test, y_pred) rmse = np.sqrt(mean_squared_error(y_test, y_pred)) r2 = r2_score(y_test, y_pred) print(f"Mean Absolute Error: ${mae:.2f}") print(f"Root Mean Squared Error: ${rmse:.2f}") print(f"R² Score: {r2:.4f}") Conclusion Pandas is an incredibly powerful library that forms the backbone of data analysis in Python. By mastering its capabilities, from basic operations to advanced techniques, you can efficiently handle any data processing task that comes your way. In this guide, we've covered: Basic operations for exploration and manipulation Intermediate techniques for data transformation Advanced functions for complex analysis Visualization capabilities for communicating insights Performance optimization for handling large datasets Real-world applications showcasing pandas in action Ready to take your data science skills to the next level? Free Course to Master Data Science in 40 Days: Free Course to Master Data Science in 40 Days by Codanics 👨💻Author: Dr. Muhammad Aammar Tufail Essential Pandas Terms & Concepts Term Description Example DataFrame 2D labeled data structure with columns of potentially different types pd.DataFrame(data) Series 1D labeled array capable of holding any data type pd.Series([1,2,3]) Index Labels for rows or columns in DataFrame/Series df.index GroupBy Splitting data into groups for aggregation df.groupby('col').mean() Pivot Table Summarize data with multi-dimensional grouping df.pivot_table(values='A', index='B', columns='C') Merge/Join Combine DataFrames using keys/columns pd.merge(df1, df2, on='key') loc / iloc Label-based / integer-based selection df.loc[0:5], df.iloc[0:5] NaN Missing value marker in pandas np.nan, df.isna() apply() Apply a function along an axis of the DataFrame df.apply(np.sqrt) astype() Cast a pandas object to a specified dtype df['col'].astype('float') value_counts() Count unique values in a Series df['col'].value_counts() dropna() Remove missing values df.dropna() fillna() Fill missing values df.fillna(0) sort_values() Sort by the values along either axis df.sort_values('col') query() Query the columns of a DataFrame with a boolean expression df.query('col > 5') crosstab() Compute a simple cross-tabulation of two (or more) factors pd.crosstab(df['A'], df['B']) read_csv() Read a comma-separated values (csv) file into DataFrame pd.read_csv('file.csv') to_csv() Write DataFrame to a csv file df.to_csv('file.csv')