Pandas Essentials Cheat Sheet – IT Exams Training

Python Pandas is a simple, expressive, and one of the most important libraries in Python for data analysis and manipulation. It significantly simplifies working with real-world data, making data analysis faster and easier. For beginners, the variety of functions and operations can be overwhelming, so having a structured guide to help understand and apply Pandas is essential.

This guide introduces the basics of Pandas, including data structures, importing and exporting data, key functions, operations, and basic plotting techniques. It aims to provide a solid foundation for anyone starting their data science journey with Python.

Importing the Pandas Library

Before using Pandas, you need to import the library into your Python environment. The conventional way is:

python

CopyEdit

import pandas as pd

This imports the Pandas library and gives it the alias pd for easier usage throughout your code.

Pandas Data Structures

Pandas primarily offers two main data structures to work with data: Series and DataFrame. Understanding these is crucial to harness the power of Pandas.

Series

A Series is a one-dimensional labeled array capable of holding any data type, such as integers, strings, or floats. It is similar to a column in a spreadsheet or database table.

Example of creating a Series:

python

CopyEdit

s = pd.Series([1, 2, 3, 4], index=[‘a’, ‘b’, ‘c’, ‘d’])

This creates a Series with values 1, 2, 3, 4 labeled by the index a, b, c, and d.

DataFrame

A DataFrame is a two-dimensional, potentially heterogeneous tabular data structure with labeled axes (rows and columns). It is the most commonly used Pandas object and can be thought of as a spreadsheet or SQL table.

Example of creating a DataFrame:

python

CopyEdit

data_mobile = {

‘Mobile’: [‘iPhone’, ‘Samsung’, ‘Redmi’],

‘Color’: [‘Red’, ‘White’, ‘Black’],

‘Price’: [‘High’, ‘Medium’, ‘Low’]

}

Df = pd.DataFrame(data_mobile, columns=[‘Mobile’, ‘Color’, ‘Price’])

This creates a DataFrame with three columns: Mobile, Color, and Price, and three rows of data.

Importing Data

Pandas offers a variety of functions to import data from different file formats. These functions read data and return Pandas objects like DataFrames or Series.

Common reader functions include:

python

CopyEdit

pd.read_csv(“filename.csv”)

pd.read_table(“filename.txt”)

pd.read_excel(“filename.xlsx”)

pd.read_sql(query, connection_object)

pd.read_json(“filename.json”)

These functions are useful for loading data from CSV files, text files, Excel spreadsheets, SQL databases, and JSON files.

Exporting Data

Similarly, Pandas provides methods to export DataFrame contents back to various formats.

Common writer functions include:

python

CopyEdit

df.to_csv(“filename.csv”)

df.to_excel(“filename.xlsx”)

df.to_sql(table_name, connection_object)

df.to_json(“filename.json”)

These allow you to save your data after manipulation for sharing or further use.

Creating Test and Fake Data

For testing and development purposes, it is often necessary to generate sample or fake data.

You can create an empty DataFrame:

python

CopyEdit

df = pd.DataFrame()

Generate random data:

python

CopyEdit

import numpy as np

pd.DataFrame(np.random.rand(4,3))

This generates a DataFrame with 4 rows and 3 columns filled with random floating-point numbers.

Creating a Series from an iterable is also straightforward:

python

CopyEdit

pd.Series(new_series)

Where new_series is any iterable, like a list.

Retrieving Data from DataFrames

Retrieving Column Data

To access columns from a DataFrame, you can use the following methods:

Access a single column by its name:

python

CopyEdit

df[‘Pet’]

This returns the column named ‘Pet’.

Access multiple columns:

python

CopyEdit

df[[‘Pet’, ‘Vehicle’]]

This returns a DataFrame with the columns ‘Pet’ and ‘Vehicle’.

You can also filter columns by patterns using regular expressions:

python

CopyEdit

df.filter(regex=’TIM’)

This returns columns whose names match the pattern ‘TIM’.

Viewing DataFrame Contents

Pandas offers several convenient methods to view data samples and summaries:

View the first few rows:

python

CopyEdit

df.head(n)

View the last few rows:

python

CopyEdit

df.tail(n)

Sample random rows:

python

CopyEdit

df.sample(n)

Find the largest values in a column:

python

CopyEdit

df.nlargest(n, ‘value’)

Find the smallest values in a column:

python

CopyEdit

df.nsmallest(n, ‘value’)

Filter rows based on conditions:

python

CopyEdit

df[df.HEIGHT > 100]

Remove duplicate rows:

python

CopyEdit

df.drop_duplicates()

Check the shape (rows, columns):

python

CopyEdit

df.shape

Get general info about data types and memory usage:

python

CopyEdit

df.info()

Get summary statistics of numerical columns:

python

CopyEdit

df.describe()

Selecting Data from DataFrames

There are two primary ways to select data in a DataFrame: by position and by label.

Selecting by Position using

iloc selects data based on the integer position of rows and columns.

Select the first row:

python

CopyEdit

df.iloc[0]

Select the second row:

python

CopyEdit

df.iloc[1]

Select the last row:

python

CopyEdit

df.iloc[-1]

Select the first column for all rows:

python

CopyEdit

df.iloc[:, 0]

Select the second column for all rows:

python

CopyEdit

df.iloc[:, 1]

Selecting by Label using loc

loc selects data based on the labels of rows and columns.

Select a single value by row and column labels:

python

CopyEdit

df.loc[0, ‘column_label’]

Select a slice of rows and columns by labels:

python

CopyEdit

df.loc[‘row1′:’row3’, ‘column1′:’column3’]

Sorting DataFrames

Sorting is an essential operation when organizing data.

Sort by index labels:

python

CopyEdit

df.sort_index()

Sort by values in a column ascending:

python

CopyEdit

df.sort_values(‘column1’)

Sort by values in a column descending:

python

CopyEdit

df.sort_values(‘column2’, ascending=False)

Reset the index to the default integer index:

python

CopyEdit

df.reset_index()

Grouping Data

Grouping data allows applying aggregate functions on subsets of data grouped by one or more columns.

Create a groupby object by one column:

python

CopyEdit

df.groupby(‘column’)

Group by multiple columns:

python

CopyEdit

df.groupby([‘column1’, ‘column2’])

Calculate the mean of a column grouped by another:

python

CopyEdit

df.groupby(‘column1’)[‘column2’].mean()

Calculate the median similarly:

python

CopyEdit

df.groupby(‘column1’)[‘column2’].median()

Grouping is a powerful technique for summarizing and analyzing data by categories.

Merging Data Sets

In real-world data science projects, data often comes from multiple sources. Combining these datasets is a common task, and Pandas provides flexible methods to merge DataFrames.

Types of Joins

Pandas supports several types of joins similar to SQL joins:

Inner Join: Returns rows where there is a match in both DataFrames.
Outer Join: Returns all rows from both DataFrames, filling in NaNs for missing matches.
Left Join: Returns all rows from the left DataFrame and matched rows from the right DataFrame.
Right Join: Returns all rows from the right DataFrame and matched rows from the left DataFrame.
Cross Join: Returns the Cartesian product of both DataFrames.

Examples of Merging

Inner join example:

python

CopyEdit

pd.merge(df1, df2, how=’inner’, on=’Apple’)

Outer join example:

python

CopyEdit

pd.merge(df1, df2, how=’outer’, on=’Orange’)

Left join example:

python

CopyEdit

pd.merge(df1, df2, how=’left’, on=’Animals’)

Right join example:

python

CopyEdit

pd.merge(df1, df2, how=’right’, on=’Vehicles’)

Cross join example:

python

CopyEdit

df1.merge(df2, how=’cross’)

Parameters for Merging

How: Type of join operation (‘inner’, ‘outer’, ‘left’, ‘right’, ‘cross’).
On: Column or index level names to join on.
left_on, right_on: Columns or index levels from left and right DataFrames to join on, respectively.
left_index, right_index: Use indexes from left or right DataFrames as join keys.

Merging datasets effectively enables combining data spread across multiple tables or sources into one unified DataFrame for analysis.

Renaming Data

Renaming columns and indexes can improve readability and clarity when working with DataFrames.

Renaming Columns and Indexes

The rename() method allows mapping old labels to new ones.

Example to rename columns:

python

CopyEdit

df.rename(columns={‘ferrari’: ‘FERRARI’, ‘mercedes’: ‘MERCEDES’, ‘bently’: ‘BENTLEY’}, inplace=True)

Example to rename multiple columns:

python

CopyEdit

df.rename(columns={“”: “a”, “B”: “c”})

Renaming indexes by mapping:

python

CopyEdit

df.rename(index={0: “london”, 1: “newyork”, 2: “berlin”})

In-placee Parameter

Setting inplace=True modifies the DataFrame in place without returning a new object. Otherwise, rename() returns a new DataFrame.

Renaming improves code readability and is useful when preparing data for presentation or export.

Handling Duplicate Data

Duplicate data can skew analysis results. Pandas provides tools to detect and remove duplicates efficiently.

Identifying Duplicate Rows

Use the duplicated() method to find duplicate rows:

python

CopyEdit

df.duplicated()

This returns a Boolean Series indicating which rows are duplicates.

Removing Duplicates

Remove duplicate rows with:

python

CopyEdit

df.drop_duplicates()

This returns a DataFrame with duplicate rows removed.

Removing Duplicate Index Values

Duplicate indexes can cause issues in data retrieval:

python

Concatenation appends or combines Pandas objects along a particular axis.

Concatenating DataFrames and Series

Concatenate DataFrames vertically (default axis=0):

python

CopyEdit

df = pd.concat([df3, df1])

Concatenate Series vertically:

python

CopyEdit

df = pd.concat([S3, S1])

Concatenate along columns (axis=1):

python

CopyEdit

df = pd.concat([df3, S1], axis=1)

Copy Parameter

By default, concat() copies data. Use copy=False to avoid copying, but this can have side effects if the original data changes.

Concatenation is useful for combining datasets with similar columns or adding new columns from different sources.

Filtering Data

Filtering allows selecting rows or columns based on specific conditions or patterns.

Using filter()

Filter columns by explicit list:

python

CopyEdit

df = df.filter(items=[‘City’, ‘Country’])

Filter columns by substring match:

python

CopyEdit

df = df.filter(like=’tion’, axis=1)

Filter columns by regex pattern:

python

CopyEdit

df.fillna(method=’ffill’, inplace=True)

Replacing Values

Replace specific values in the DataFrame:

python

CopyEdit

df.replace([2, 30], [1, 10])

Interpolating Missing Data

Interpolate missing numerical values:

python

CopyEdit

df.interpolate(method=’linear’, limit_direction=’backward’, axis=0)

Handling missing data properly is critical to maintain data integrity and avoid analysis bias.

Pandas Statistical Functions

Pandas simplifies many statistical operations commonly used in data analysis.

Basic Statistics

Calculate the mean for each column:

python

CopyEdit

df.mean()

Calculate median:

python

CopyEdit

df.median()

Standard deviation:

python

CopyEdit

df.std()

Maximum and minimum values:

python

CopyEdit

df.max()

df.min()

Count of non-null values:

python

CopyEdit

df.count()

Summary Statistics

Generate descriptive statistics:

python

CopyEdit

df.describe()

This provides count, mean, standard deviation, min, max, and quartiles.

These statistical functions allow quick insights into data distribution and variability.

Dropping Data

Sometimes, it is necessary to remove specific rows or columns from a DataFrame.

Dropping Columns

Drop one or more columns by name:

python

CopyEdit

df.drop([‘Nike’], axis=1)

Dropping Rows by Index

Drop rows using index labels:

python

CopyEdit

df.drop([‘Size’], axis=0)

Dropping Multiple Labels

Drop multiple labels in rows and columns simultaneously:

python

Data visualization is key to understanding data. Pandas integrates well with Matplotlib to provide quick plots.

Histogram

Create histograms to show data distribution:

python

CopyEdit

df.plot.hist()

Scatter Plot

Create scatter plots to visualize relationships:

python

CopyEdit

df.plot.scatter(x=’column1′, y=’column2′)

Inline Plotting in Jupyter

Use this magic command to enable inline plotting in notebooks:

python

CopyEdit

%matplotlib inline

Plotting helps reveal trends, outliers, and patterns in data.

Advanced Data Manipulation Techniques in Pandas

As you deepen your understanding of Pandas, you will encounter more advanced techniques that help efficiently transform, analyze, and extract insights from data. These techniques leverage Pandas’ powerful functionality and allow you to write cleaner, faster, and more expressive code.

Working with Time Series Data

Time series data is ubiquitous in finance, IoT, sales tracking, and many other fields. Pandas provides extensive support for working with dates, times, and time-indexed data.

DateTime Objects

Pandas builds on NumPy’s datetime64 and Python’s datetime modules to handle date and time data.

Applying Functions Efficiently

Applying custom or built-in functions to your data can transform or summarize it in powerful ways.

The apply() Method

The apply() method lets you apply a function across DataFrame rows or columns.

Example applying a function to each row:

python

CopyEdit

df[‘new_col’] = df.apply(lambda row: row[‘A’] + row[‘B’], axis=1)

Vectorized Operations

Pandas and NumPy support vectorized operations, which are faster than row-wise operations:

python

CopyEdit

df[‘C’] = df[‘A’] + df[‘B’]

Prefer vectorized operations over apply() when possible for better performance.

Using map() and applymap()

map() is used with Series to map values using a dictionary or function:

python

CopyEdit

df[‘column’].map({‘a’: 1, ‘b’: 2})

applymap() applies a function element-wise across the entire DataFrame:

python

CopyEdit

df[‘category_col’] = pd.Categorical(df[‘category_col’], categories=[‘low’, ‘medium’, ‘high’], ordered=True)

This enables meaningful comparisons and sorting.

Categorical types optimize performance for datasets with repeated labels and are essential for certain analyses.

Working with Text Data

Text data is often messy and requires cleaning and transformation before analysis.

String Methods in Pandas

Pandas provides vectorized string functions accessible via. .str.

Examples:

python

CopyEdit

df[‘column’].str.lower() # Convert to lowercase

df[‘column’].str.upper() # Convert to uppercase

df[‘column’].str.strip() # Remove whitespace

df[‘column’].str.contains(‘pattern’) # Filter rows containing pattern

df[‘column’].str.replace(‘old’, ‘new’) # Replace substring

Extracting Patterns

Extract parts of strings with regular expressions:

python

CopyEdit

df[‘extracted’] = df[‘column’].str.extract(r'(\d+)’)

This extracts digits from text.

Splitting and Joining

Split strings into lists:

python

CopyEdit

df[‘split_col’] = df[‘column’].str.split(‘,’)

Join lists into strings:

CopyEdit

df.groupby(‘Category’).filter(lambda x: x[‘Sales’].sum() > 1000)

This keeps groups with total sales greater than 1000.

Transforming Grouped Data

transform() returns an object indexed like the original, but transformed:

python

CopyEdit

df[‘Sales_zscore’] = df.groupby(‘Category’)[‘Sales’].transform(lambda x: (x – x.mean()) / x.std())

Useful for normalization within groups.

Mastering groupby operations enables insightful data summaries and feature engineering.

Performance Tips

Handling large datasets requires performance considerations.

Avoid Loops

Avoid explicit Python loops over DataFrames. Use vectorized Pandas or NumPy operations instead.

Use Categoricals

Convert repeated strings to categorical dtype to reduce memory usage and speed up operations.

Use Efficient Joins

When merging large datasets, ensure the keys are indexed to speed up joins.

Use Chunking for Large Files

For huge files, read in chunks with pd.read_csv() using the chunksize parameter to avoid memory overload.

Use .eval() and .query() for Speed

eval() lets you perform operations efficiently using pandas expressions:

python

CopyEdit

df.eval(‘new_col = A + B’, inplace=True)

Query () allows fast filtering using expressions:

python

CopyEdit

df.query(‘A > 5 & B < 10’)

These can be faster than normal Pandas syntax for large DataFrames.

Exporting Data

After processing, saving the results is crucial.

Export to CSV

Python

CopyEdit

df.to_csv(‘filename.csv’, index=False)

Set index=False to exclude row labels.

Export to Excel

python

CopyEdit

df.to_excel(‘filename.xlsx’, index=False)

Export to JSON

python

CopyEdit

df.to_json(‘filename.json’)

Export to SQL Databases

Save DataFrame to SQL:

python

CopyEdit

df.to_sql(‘table_name’, connection_object)

Choosing the right format depends on your data sharing and analysis needs.

Advanced Data Analysis and Visualization with Pandas

df.fillna(0) # Replace NaN with 0

df.fillna(method=’ffill’) # Forward fill to propagate last valid observation

df.fillna(method=’bfill’) # Backward fill to propagate next valid observation

You can fill in the mean or median of the column:

python

CopyEdit

df[‘column’].fillna(df[‘column’].mean(), inplace=True)

Interpolation

Interpolation fills missing values by estimating them based on surrounding data:

python

CopyEdit

df.interpolate(method=’linear’, inplace=True)

This is useful for time series or numerical data with continuous values.

Working with MultiIndex DataFrames

MultiIndex (hierarchical indexing) allows more complex data organization, supporting multiple index levels on rows and columns.

Creating MultiIndex

Create a MultiIndex from arrays or tuples:

python

CopyEdit

arrays = [[‘A’, ‘A’, ‘B’, ‘B’], [1, 2, 1, 2]]

index = pd.MultiIndex.from_arrays(arrays, names=(‘Letter’, ‘Number’))

df = pd.DataFrame({‘Value’: [10, 20, 30, 40]}, index=index)

Accessing Data in MultiIndex

Use loc with tuples to access data:

python

CopyEdit

df.loc[(‘A’, 1)]

Slice data by levels:

python

CopyEdit

df.loc[‘A’]

df.loc[pd.IndexSlice[:, 2], :]

Resetting and Setting MultiIndex

Convert MultiIndex to columns:

python

CopyEdit

df.reset_index(inplace=True)

Set multiple columns as an index:

python

CopyEdit

df.set_index([‘col1’, ‘col2’], inplace=True)

MultiIndex is powerful for representing grouped data with multiple categorical levels.

Combining DataFrames: Concatenation, Merge, and Join

Concatenation

Concatenate along rows or columns:

python

CopyEdit

pd.concat([df1, df2], axis=0) # Row-wise concatenation

pd.concat([df1, df2], axis=1) # Column-wise concatenation

Merge

Merge DataFrames based on keys, similar to SQL joins:

python

CopyEdit

pd.merge(df1, df2, on=’key’, how=’inner’)

Types of joins:

Inner: intersection of keys
left: all keys from the left DataFrame
right: all keys from the right DataFrame
outer: union of keys

Join

Join is a convenient method for joining on an index:

python

CopyEdit

df1.join(df2, how=’left’)

Combining datasets effectively enables richer analyses.

Working with Window Functions

Window functions allow calculations across a sliding window of data.

Rolling Windows

Calculate moving averages or sums:

python

CopyEdit

df[‘rolling_mean’] = df[‘Value’].rolling(window=3).mean()

df[‘rolling_sum’] = df[‘Value’].rolling(window=3).sum()

Expanding Windows

Compute cumulative statistics:

python

CopyEdit

df[‘expanding_sum’] = df[‘Value’].expanding().sum()

Exponentially Weighted Windows

Give more weight to recent observations:

python

CopyEdit

df[‘ewm_mean’] = df[‘Value’].ewm(span=3).mean()

Window functions are vital for smoothing time series and detecting trends.

Data Visualization with Pandas

Visualizing data helps to uncover patterns and communicate findings.

Plotting Basics

Pandas integrates with Matplotlib to plot directly from DataFrames.

Enable inline plotting in notebooks:

python

CopyEdit

%matplotlib inline

Common Plot Types

Line Plot (default):

python

CopyEdit

df.plot()

Histogram:

python

CopyEdit

df[‘column’].plot.hist()

Bar Plot:

python

CopyEdit

df.plot.bar(x=’Category’, y=’Value’)

Scatter Plot:

python

CopyEdit

df.plot.scatter(x=’col1′, y=’col2′)

Box Plot:

python

CopyEdit

df.boxplot(column=’col1′, by=’group_col’)

Customizing Plots

Adjust titles, labels, and colors:

python

CopyEdit

ax = df.plot(x=’col1′, y=’col2′, kind=’scatter’, color=’red’)

ax.set_title(‘Scatter Plot’)

ax.set_xlabel(‘X axis’)

ax.set_ylabel(‘Y axis’)

Visualization is critical for data exploration and presentation.

Exporting and Saving Your Work

Export your cleaned and processed data for sharing or further use.

CSV and Excel

Save to CSV or Excel files:

python

CopyEdit

df.to_csv(‘output.csv’, index=False)

df.to_excel(‘output.xlsx’, index=False)

JSON and SQL

Export as JSON or to SQL databases:

python

CopyEdit

df.to_json(‘output.json’)

df.to_sql(‘table_name’, connection)

Choose the format based on your needs and downstream processes.

Best Practices and Tips

Always inspect your data after each major step (head(), info(), describe()).
Use meaningful variable names for readability.
Handle missing data thoughtfully to avoid skewed results.
Utilize vectorized operations for efficiency.
Use grouping and aggregation to summarize data.
Leverage plotting to understand and communicate insights.
Document your code for reproducibility.

Conclusion

This guide has covered a comprehensive range of topics, from basics to advanced data manipulation, time series handling, performance tips, and visualization in Pandas. Mastery of these tools empowers you to confidently analyze and derive insights from complex datasets. With continuous practice and exploration, Pandas becomes an invaluable asset in your data science toolkit.