{"id":893,"date":"2025-07-04T05:52:02","date_gmt":"2025-07-04T05:52:02","guid":{"rendered":"https:\/\/www.actualtests.com\/blog\/?p=893"},"modified":"2025-07-04T05:52:14","modified_gmt":"2025-07-04T05:52:14","slug":"pandas-essentials-cheat-sheet","status":"publish","type":"post","link":"https:\/\/www.actualtests.com\/blog\/pandas-essentials-cheat-sheet\/","title":{"rendered":"Pandas Essentials Cheat Sheet"},"content":{"rendered":"\n<p>Python Pandas is a simple, expressive, and one of the most important libraries in Python for data analysis and manipulation. It significantly simplifies working with real-world data, making data analysis faster and easier. For beginners, the variety of functions and operations can be overwhelming, so having a structured guide to help understand and apply Pandas is essential.<\/p>\n\n\n\n<p>This guide introduces the basics of Pandas, including data structures, importing and exporting data, key functions, operations, and basic plotting techniques. It aims to provide a solid foundation for anyone starting their data science journey with Python.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Importing the Pandas Library<\/strong><\/h2>\n\n\n\n<p>Before using Pandas, you need to import the library into your Python environment. The conventional way is:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>import pandas as pd<\/p>\n\n\n\n<p>This imports the Pandas library and gives it the alias pd for easier usage throughout your code.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Pandas Data Structures<\/strong><\/h2>\n\n\n\n<p>Pandas primarily offers two main data structures to work with data: Series and DataFrame. Understanding these is crucial to harness the power of Pandas.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Series<\/strong><\/h3>\n\n\n\n<p>A Series is a one-dimensional labeled array capable of holding any data type, such as integers, strings, or floats. It is similar to a column in a spreadsheet or database table.<\/p>\n\n\n\n<p>Example of creating a Series:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>s = pd.Series([1, 2, 3, 4], index=[&#8216;a&#8217;, &#8216;b&#8217;, &#8216;c&#8217;, &#8216;d&#8217;])<\/p>\n\n\n\n<p>This creates a Series with values 1, 2, 3, 4 labeled by the index a, b, c, and d.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>DataFrame<\/strong><\/h3>\n\n\n\n<p>A DataFrame is a two-dimensional, potentially heterogeneous tabular data structure with labeled axes (rows and columns). It is the most commonly used Pandas object and can be thought of as a spreadsheet or SQL table.<\/p>\n\n\n\n<p>Example of creating a DataFrame:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>data_mobile = {<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8216;Mobile&#8217;: [&#8216;iPhone&#8217;, &#8216;Samsung&#8217;, &#8216;Redmi&#8217;],<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8216;Color&#8217;: [&#8216;Red&#8217;, &#8216;White&#8217;, &#8216;Black&#8217;],<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&#8216;Price&#8217;: [&#8216;High&#8217;, &#8216;Medium&#8217;, &#8216;Low&#8217;]<\/p>\n\n\n\n<p>}<\/p>\n\n\n\n<p>Df = pd.DataFrame(data_mobile, columns=[&#8216;Mobile&#8217;, &#8216;Color&#8217;, &#8216;Price&#8217;])<\/p>\n\n\n\n<p>This creates a DataFrame with three columns: Mobile, Color, and Price, and three rows of data.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Importing Data<\/strong><\/h2>\n\n\n\n<p>Pandas offers a variety of functions to import data from different file formats. These functions read data and return Pandas objects like DataFrames or Series.<\/p>\n\n\n\n<p>Common reader functions include:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>pd.read_csv(&#8220;filename.csv&#8221;)<\/p>\n\n\n\n<p>pd.read_table(&#8220;filename.txt&#8221;)<\/p>\n\n\n\n<p>pd.read_excel(&#8220;filename.xlsx&#8221;)<\/p>\n\n\n\n<p>pd.read_sql(query, connection_object)<\/p>\n\n\n\n<p>pd.read_json(&#8220;filename.json&#8221;)<\/p>\n\n\n\n<p>These functions are useful for loading data from CSV files, text files, Excel spreadsheets, SQL databases, and JSON files.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Exporting Data<\/strong><\/h2>\n\n\n\n<p>Similarly, Pandas provides methods to export DataFrame contents back to various formats.<\/p>\n\n\n\n<p>Common writer functions include:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.to_csv(&#8220;filename.csv&#8221;)<\/p>\n\n\n\n<p>df.to_excel(&#8220;filename.xlsx&#8221;)<\/p>\n\n\n\n<p>df.to_sql(table_name, connection_object)<\/p>\n\n\n\n<p>df.to_json(&#8220;filename.json&#8221;)<\/p>\n\n\n\n<p>These allow you to save your data after manipulation for sharing or further use.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Creating Test and Fake Data<\/strong><\/h2>\n\n\n\n<p>For testing and development purposes, it is often necessary to generate sample or fake data.<\/p>\n\n\n\n<p>You can create an empty DataFrame:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df = pd.DataFrame()<\/p>\n\n\n\n<p>Generate random data:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>import numpy as np<\/p>\n\n\n\n<p>pd.DataFrame(np.random.rand(4,3))<\/p>\n\n\n\n<p>This generates a DataFrame with 4 rows and 3 columns filled with random floating-point numbers.<\/p>\n\n\n\n<p>Creating a Series from an iterable is also straightforward:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>pd.Series(new_series)<\/p>\n\n\n\n<p>Where new_series is any iterable, like a list.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Retrieving Data from DataFrames<\/strong><\/h2>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Retrieving Column Data<\/strong><\/h3>\n\n\n\n<p>To access columns from a DataFrame, you can use the following methods:<\/p>\n\n\n\n<p>Access a single column by its name:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df[&#8216;Pet&#8217;]<\/p>\n\n\n\n<p>This returns the column named \u2018Pet\u2019.<\/p>\n\n\n\n<p>Access multiple columns:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df[[&#8216;Pet&#8217;, &#8216;Vehicle&#8217;]]<\/p>\n\n\n\n<p>This returns a DataFrame with the columns \u2018Pet\u2019 and \u2018Vehicle\u2019.<\/p>\n\n\n\n<p>You can also filter columns by patterns using regular expressions:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.filter(regex=&#8217;TIM&#8217;)<\/p>\n\n\n\n<p>This returns columns whose names match the pattern \u2018TIM\u2019.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Viewing DataFrame Contents<\/strong><\/h2>\n\n\n\n<p>Pandas offers several convenient methods to view data samples and summaries:<\/p>\n\n\n\n<p>View the first few rows:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.head(n)<\/p>\n\n\n\n<p>View the last few rows:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.tail(n)<\/p>\n\n\n\n<p>Sample random rows:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.sample(n)<\/p>\n\n\n\n<p>Find the largest values in a column:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.nlargest(n, &#8216;value&#8217;)<\/p>\n\n\n\n<p>Find the smallest values in a column:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.nsmallest(n, &#8216;value&#8217;)<\/p>\n\n\n\n<p>Filter rows based on conditions:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df[df.HEIGHT &gt; 100]<\/p>\n\n\n\n<p>Remove duplicate rows:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.drop_duplicates()<\/p>\n\n\n\n<p>Check the shape (rows, columns):<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.shape<\/p>\n\n\n\n<p>Get general info about data types and memory usage:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.info()<\/p>\n\n\n\n<p>Get summary statistics of numerical columns:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.describe()<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Selecting Data from DataFrames<\/strong><\/h2>\n\n\n\n<p>There are two primary ways to select data in a DataFrame: by position and by label.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Selecting by Position using&nbsp;<\/strong><\/h3>\n\n\n\n<p>iloc selects data based on the integer position of rows and columns.<\/p>\n\n\n\n<p>Select the first row:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.iloc[0]<\/p>\n\n\n\n<p>Select the second row:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.iloc[1]<\/p>\n\n\n\n<p>Select the last row:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.iloc[-1]<\/p>\n\n\n\n<p>Select the first column for all rows:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.iloc[:, 0]<\/p>\n\n\n\n<p>Select the second column for all rows:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.iloc[:, 1]<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Selecting by Label using loc<\/strong><\/h3>\n\n\n\n<p>loc selects data based on the labels of rows and columns.<\/p>\n\n\n\n<p>Select a single value by row and column labels:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.loc[0, &#8216;column_label&#8217;]<\/p>\n\n\n\n<p>Select a slice of rows and columns by labels:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.loc[&#8216;row1&#8242;:&#8217;row3&#8217;, &#8216;column1&#8242;:&#8217;column3&#8217;]<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Sorting DataFrames<\/strong><\/h2>\n\n\n\n<p>Sorting is an essential operation when organizing data.<\/p>\n\n\n\n<p>Sort by index labels:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.sort_index()<\/p>\n\n\n\n<p>Sort by values in a column ascending:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.sort_values(&#8216;column1&#8217;)<\/p>\n\n\n\n<p>Sort by values in a column descending:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.sort_values(&#8216;column2&#8217;, ascending=False)<\/p>\n\n\n\n<p>Reset the index to the default integer index:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.reset_index()<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Grouping Data<\/strong><\/h2>\n\n\n\n<p>Grouping data allows applying aggregate functions on subsets of data grouped by one or more columns.<\/p>\n\n\n\n<p>Create a groupby object by one column:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.groupby(&#8216;column&#8217;)<\/p>\n\n\n\n<p>Group by multiple columns:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.groupby([&#8216;column1&#8217;, &#8216;column2&#8217;])<\/p>\n\n\n\n<p>Calculate the mean of a column grouped by another:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.groupby(&#8216;column1&#8217;)[&#8216;column2&#8217;].mean()<\/p>\n\n\n\n<p>Calculate the median similarly:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.groupby(&#8216;column1&#8217;)[&#8216;column2&#8217;].median()<\/p>\n\n\n\n<p>Grouping is a powerful technique for summarizing and analyzing data by categories.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Merging Data Sets<\/strong><\/h2>\n\n\n\n<p>In real-world data science projects, data often comes from multiple sources. Combining these datasets is a common task, and Pandas provides flexible methods to merge DataFrames.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Types of Joins<\/strong><\/h3>\n\n\n\n<p>Pandas supports several types of joins similar to SQL joins:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Inner Join<\/strong>: Returns rows where there is a match in both DataFrames.<br><\/li>\n\n\n\n<li><strong>Outer Join<\/strong>: Returns all rows from both DataFrames, filling in NaNs for missing matches.<br><\/li>\n\n\n\n<li><strong>Left Join<\/strong>: Returns all rows from the left DataFrame and matched rows from the right DataFrame.<br><\/li>\n\n\n\n<li><strong>Right Join<\/strong>: Returns all rows from the right DataFrame and matched rows from the left DataFrame.<br><\/li>\n\n\n\n<li><strong>Cross Join<\/strong>: Returns the Cartesian product of both DataFrames.<br><\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Examples of Merging<\/strong><\/h3>\n\n\n\n<p>Inner join example:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>pd.merge(df1, df2, how=&#8217;inner&#8217;, on=&#8217;Apple&#8217;)<\/p>\n\n\n\n<p>Outer join example:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>pd.merge(df1, df2, how=&#8217;outer&#8217;, on=&#8217;Orange&#8217;)<\/p>\n\n\n\n<p>Left join example:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>pd.merge(df1, df2, how=&#8217;left&#8217;, on=&#8217;Animals&#8217;)<\/p>\n\n\n\n<p>Right join example:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>pd.merge(df1, df2, how=&#8217;right&#8217;, on=&#8217;Vehicles&#8217;)<\/p>\n\n\n\n<p>Cross join example:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df1.merge(df2, how=&#8217;cross&#8217;)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Parameters for Merging<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>How<\/strong>: Type of join operation (&#8216;inner&#8217;, &#8216;outer&#8217;, &#8216;left&#8217;, &#8216;right&#8217;, &#8216;cross&#8217;).<br><\/li>\n\n\n\n<li><strong>On<\/strong>: Column or index level names to join on.<br><\/li>\n\n\n\n<li><strong>left_on<\/strong>, <strong>right_on<\/strong>: Columns or index levels from left and right DataFrames to join on, respectively.<br><\/li>\n\n\n\n<li><strong>left_index<\/strong>, <strong>right_index<\/strong>: Use indexes from left or right DataFrames as join keys.<br><\/li>\n<\/ul>\n\n\n\n<p>Merging datasets effectively enables combining data spread across multiple tables or sources into one unified DataFrame for analysis.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Renaming Data<\/strong><\/h2>\n\n\n\n<p>Renaming columns and indexes can improve readability and clarity when working with DataFrames.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Renaming Columns and Indexes<\/strong><\/h3>\n\n\n\n<p>The rename() method allows mapping old labels to new ones.<\/p>\n\n\n\n<p>Example to rename columns:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.rename(columns={&#8216;ferrari&#8217;: &#8216;FERRARI&#8217;, &#8216;mercedes&#8217;: &#8216;MERCEDES&#8217;, &#8216;bently&#8217;: &#8216;BENTLEY&#8217;}, inplace=True)<\/p>\n\n\n\n<p>Example to rename multiple columns:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.rename(columns={&#8220;&#8221;: &#8220;a&#8221;, &#8220;B&#8221;: &#8220;c&#8221;})<\/p>\n\n\n\n<p>Renaming indexes by mapping:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.rename(index={0: &#8220;london&#8221;, 1: &#8220;newyork&#8221;, 2: &#8220;berlin&#8221;})<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>In-placee Parameter<\/strong><\/h3>\n\n\n\n<p>Setting inplace=True modifies the DataFrame in place without returning a new object. Otherwise, rename() returns a new DataFrame.<\/p>\n\n\n\n<p>Renaming improves code readability and is useful when preparing data for presentation or export.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Handling Duplicate Data<\/strong><\/h2>\n\n\n\n<p>Duplicate data can skew analysis results. Pandas provides tools to detect and remove duplicates efficiently.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Identifying Duplicate Rows<\/strong><\/h3>\n\n\n\n<p>Use the duplicated() method to find duplicate rows:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.duplicated()<\/p>\n\n\n\n<p>This returns a Boolean Series indicating which rows are duplicates.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Removing Duplicates<\/strong><\/h3>\n\n\n\n<p>Remove duplicate rows with:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.drop_duplicates()<\/p>\n\n\n\n<p>This returns a DataFrame with duplicate rows removed.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Removing Duplicate Index Values<\/strong><\/h3>\n\n\n\n<p>Duplicate indexes can cause issues in data retrieval:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.index.duplicated()<\/p>\n\n\n\n<p>To remove duplicate indexes, reset or modify them appropriately.<\/p>\n\n\n\n<p>Handling duplicates ensures data quality and accurate analysis results.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Reshaping Data<\/strong><\/h2>\n\n\n\n<p>Data reshaping changes the layout of data to make it more suitable for analysis or visualization.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Pivoting<\/strong><\/h3>\n\n\n\n<p>Pivot tables allow summarizing data by transforming rows into columns.<\/p>\n\n\n\n<p>Example of pivoting:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>pivot = df.pivot(columns=&#8217;Vehicles&#8217;, values=[&#8216;BRAND&#8217;, &#8216;YEAR&#8217;])<\/p>\n\n\n\n<p>This creates a table with &#8216;Vehicles&#8217; as columns and shows &#8216;BRAND&#8217; and &#8216;YEAR&#8217; values.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Melting<\/strong><\/h3>\n\n\n\n<p>Melting converts wide-form data to long-form, combining multiple columns into key-value pairs.<\/p>\n\n\n\n<p>Example of melting:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>pd.melt(df)<\/p>\n\n\n\n<p>This stacks columns into rows, which is useful for certain types of analyses.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Pivot Table with Aggregation<\/strong><\/h3>\n\n\n\n<p>Pivot tables can include aggregation of numeric data:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>pd.pivot_table(df, values=&#8221;10&#8243;, index=[&#8220;1&#8221;, &#8220;3&#8221;], columns=[&#8220;1&#8221;])<\/p>\n\n\n\n<p>This summarizes data by specified indices and columns using aggregation functions.<\/p>\n\n\n\n<p>Reshaping data is essential for cleaning, exploring, and preparing datasets for modeling.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Concatenating Data<\/strong><\/h2>\n\n\n\n<p>Concatenation appends or combines Pandas objects along a particular axis.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Concatenating DataFrames and Series<\/strong><\/h3>\n\n\n\n<p>Concatenate DataFrames vertically (default axis=0):<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df = pd.concat([df3, df1])<\/p>\n\n\n\n<p>Concatenate Series vertically:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df = pd.concat([S3, S1])<\/p>\n\n\n\n<p>Concatenate along columns (axis=1):<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df = pd.concat([df3, S1], axis=1)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Copy Parameter<\/strong><\/h3>\n\n\n\n<p>By default, concat() copies data. Use copy=False to avoid copying, but this can have side effects if the original data changes.<\/p>\n\n\n\n<p>Concatenation is useful for combining datasets with similar columns or adding new columns from different sources.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Filtering Data<\/strong><\/h2>\n\n\n\n<p>Filtering allows selecting rows or columns based on specific conditions or patterns.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Using filter()<\/strong><\/h3>\n\n\n\n<p>Filter columns by explicit list:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df = df.filter(items=[&#8216;City&#8217;, &#8216;Country&#8217;])<\/p>\n\n\n\n<p>Filter columns by substring match:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df = df.filter(like=&#8217;tion&#8217;, axis=1)<\/p>\n\n\n\n<p>Filter columns by regex pattern:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df = df.filter(regex=&#8217;Quest&#8217;)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Querying DataFrame<\/strong><\/h3>\n\n\n\n<p>Use query() to filter rows based on expressions:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df = df.query(&#8216;Speed &gt; 70&#8217;)<\/p>\n\n\n\n<p>This returns rows where the value in the &#8216;Speed&#8217; column exceeds 70.<\/p>\n\n\n\n<p>Filtering helps focus on relevant subsets of data during analysis.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Working with Missing Data<\/strong><\/h2>\n\n\n\n<p>Real-world data often contains missing or null values. Pandas provides multiple methods to handle them.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Dropping Missing Data<\/strong><\/h3>\n\n\n\n<p>Drop columns containing null values:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.drop(columns=[&#8216;column_name&#8217;], inplace=True)<\/p>\n\n\n\n<p>Drop rows with any null values:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.dropna(inplace=True)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Filling Missing Data<\/strong><\/h3>\n\n\n\n<p>Fill missing values with a constant:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df[&#8216;London&#8217;].fillna(&#8216;Newyork&#8217;, inplace=True)<\/p>\n\n\n\n<p>Fill with a method such as forward fill or backward fill:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.fillna(method=&#8217;ffill&#8217;, inplace=True)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Replacing Values<\/strong><\/h3>\n\n\n\n<p>Replace specific values in the DataFrame:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.replace([2, 30], [1, 10])<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Interpolating Missing Data<\/strong><\/h3>\n\n\n\n<p>Interpolate missing numerical values:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.interpolate(method=&#8217;linear&#8217;, limit_direction=&#8217;backward&#8217;, axis=0)<\/p>\n\n\n\n<p>Handling missing data properly is critical to maintain data integrity and avoid analysis bias.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Pandas Statistical Functions<\/strong><\/h2>\n\n\n\n<p>Pandas simplifies many statistical operations commonly used in data analysis.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Basic Statistics<\/strong><\/h3>\n\n\n\n<p>Calculate the mean for each column:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.mean()<\/p>\n\n\n\n<p>Calculate median:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.median()<\/p>\n\n\n\n<p>Standard deviation:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.std()<\/p>\n\n\n\n<p>Maximum and minimum values:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.max()<\/p>\n\n\n\n<p>df.min()<\/p>\n\n\n\n<p>Count of non-null values:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.count()<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Summary Statistics<\/strong><\/h3>\n\n\n\n<p>Generate descriptive statistics:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.describe()<\/p>\n\n\n\n<p>This provides count, mean, standard deviation, min, max, and quartiles.<\/p>\n\n\n\n<p>These statistical functions allow quick insights into data distribution and variability.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Dropping Data<\/strong><\/h2>\n\n\n\n<p>Sometimes, it is necessary to remove specific rows or columns from a DataFrame.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Dropping Columns<\/strong><\/h3>\n\n\n\n<p>Drop one or more columns by name:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.drop([&#8216;Nike&#8217;], axis=1)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Dropping Rows by Index<\/strong><\/h3>\n\n\n\n<p>Drop rows using index labels:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.drop([&#8216;Size&#8217;], axis=0)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Dropping Multiple Labels<\/strong><\/h3>\n\n\n\n<p>Drop multiple labels in rows and columns simultaneously:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.drop(index=&#8217;offers&#8217;, columns=&#8217;location&#8217;)<\/p>\n\n\n\n<p>Dropping unwanted data helps to clean datasets and focus analysis on relevant information.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Pandas Indexing<\/strong><\/h2>\n\n\n\n<p>Indexing controls how data is accessed and manipulated in Pandas.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Reading CSV with Index<\/strong><\/h3>\n\n\n\n<p>You can specify an index column when reading a CSV file:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>detail = pd.read_csv(&#8220;employee_db.csv&#8221;, index_col=&#8221;Contact&#8221;)<\/p>\n\n\n\n<p>This sets the &#8216;Contact&#8217; column as the DataFrame index.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Setting Index<\/strong><\/h3>\n\n\n\n<p>Change or set a new index:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>detail.set_index(&#8216;Name&#8217;, inplace=True)<\/p>\n\n\n\n<p>This sets the &#8216;Name&#8217; column as the index.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>MultiIndexing<\/strong><\/h3>\n\n\n\n<p>Pandas supports hierarchical indexing:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>multi_index = pd.MultiIndex(levels=[[&#8216;2025-01-01&#8217;, &#8216;2025-01-11&#8217;, &#8216;2025-02-14&#8217;], [&#8216;mathew&#8217;, &#8216;linda&#8217;]])<\/p>\n\n\n\n<p>MultiIndex allows more complex data structures and grouping.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Resetting Index<\/strong><\/h3>\n\n\n\n<p>Reset index to default integers:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.reset_index(level=3, inplace=True, col_level=2)<\/p>\n\n\n\n<p>Indexing is powerful for organizing and accessing complex datasets efficiently.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Plotting DataFrames<\/strong><\/h2>\n\n\n\n<p>Data visualization is key to understanding data. Pandas integrates well with Matplotlib to provide quick plots.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Histogram<\/strong><\/h3>\n\n\n\n<p>Create histograms to show data distribution:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.plot.hist()<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Scatter Plot<\/strong><\/h3>\n\n\n\n<p>Create scatter plots to visualize relationships:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.plot.scatter(x=&#8217;column1&#8242;, y=&#8217;column2&#8242;)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Inline Plotting in Jupyter<\/strong><\/h3>\n\n\n\n<p>Use this magic command to enable inline plotting in notebooks:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>%matplotlib inline<\/p>\n\n\n\n<p>Plotting helps reveal trends, outliers, and patterns in data.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Advanced Data Manipulation Techniques in Pandas<\/strong><\/h2>\n\n\n\n<p>As you deepen your understanding of Pandas, you will encounter more advanced techniques that help efficiently transform, analyze, and extract insights from data. These techniques leverage Pandas\u2019 powerful functionality and allow you to write cleaner, faster, and more expressive code.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Working with Time Series Data<\/strong><\/h2>\n\n\n\n<p>Time series data is ubiquitous in finance, IoT, sales tracking, and many other fields. Pandas provides extensive support for working with dates, times, and time-indexed data.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>DateTime Objects<\/strong><\/h3>\n\n\n\n<p>Pandas builds on NumPy\u2019s datetime64 and Python\u2019s datetime modules to handle date and time data.<\/p>\n\n\n\n<p>Convert a string column to a datetime:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df[&#8216;date_column&#8217;] = pd.to_datetime(df[&#8216;date_column&#8217;])<\/p>\n\n\n\n<p>This ensures that the column is of datetime type and enables date\/time-specific operations.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Setting DateTime Index<\/strong><\/h3>\n\n\n\n<p>Often, time series data benefits from having a datetime index:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.set_index(&#8216;date_column&#8217;, inplace=True)<\/p>\n\n\n\n<p>This makes time-based slicing and resampling easier.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Time-Based Indexing and Slicing<\/strong><\/h3>\n\n\n\n<p>With a datetime index, you can select data by date or date ranges:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.loc[&#8216;2025-01-01&#8217;]<\/p>\n\n\n\n<p>df.loc[&#8216;2025-01-01&#8242;:&#8217;2025-01-15&#8217;]<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Resampling Time Series Data<\/strong><\/h3>\n\n\n\n<p>Resampling aggregates data over time intervals (e.g., daily to monthly):<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.resample(M&#8217;).mean()<\/p>\n\n\n\n<p>Here, &#8216;M&#8217; stands for month-end frequency. Other options include &#8216;D&#8217; for day, &#8216;H&#8217; for hour, and more.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Rolling Window Calculations<\/strong><\/h3>\n\n\n\n<p>Rolling functions compute statistics over a sliding window:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df[&#8216;rolling_mean&#8217;] = df[&#8216;value&#8217;].rolling(window=7).mean()<\/p>\n\n\n\n<p>This calculates a 7-period moving average, smoothing short-term fluctuations.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Time Shifts<\/strong><\/h3>\n\n\n\n<p>Shift time series data forward or backward:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df[&#8216;shifted&#8217;] = df[&#8216;value&#8217;].shift(1)<\/p>\n\n\n\n<p>This is useful for creating lag features in time series modeling.<\/p>\n\n\n\n<p>Working with time series requires understanding datetime formats, indexing, and aggregation, all well supported in Pandas.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Applying Functions Efficiently<\/strong><\/h2>\n\n\n\n<p>Applying custom or built-in functions to your data can transform or summarize it in powerful ways.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>The apply() Method<\/strong><\/h3>\n\n\n\n<p>The apply() method lets you apply a function across DataFrame rows or columns.<\/p>\n\n\n\n<p>Example applying a function to each row:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df[&#8216;new_col&#8217;] = df.apply(lambda row: row[&#8216;A&#8217;] + row[&#8216;B&#8217;], axis=1)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Vectorized Operations<\/strong><\/h3>\n\n\n\n<p>Pandas and NumPy support vectorized operations, which are faster than row-wise operations:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df[&#8216;C&#8217;] = df[&#8216;A&#8217;] + df[&#8216;B&#8217;]<\/p>\n\n\n\n<p>Prefer vectorized operations over apply() when possible for better performance.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Using map() and applymap()<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>map()<\/strong> is used with Series to map values using a dictionary or function:<br><\/li>\n<\/ul>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df[&#8216;column&#8217;].map({&#8216;a&#8217;: 1, &#8216;b&#8217;: 2})<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>applymap()<\/strong> applies a function element-wise across the entire DataFrame:<br><\/li>\n<\/ul>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.applymap(lambda x: x*2)<\/p>\n\n\n\n<p>These methods provide flexibility for customized data transformations.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Handling Categorical Data<\/strong><\/h2>\n\n\n\n<p>Categorical data is common in datasets (e.g., gender, country, product category). Pandas offers specific support to optimize memory and performance.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Converting to Category Type<\/strong><\/h3>\n\n\n\n<p>Convert string\/object columns to the category type:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df[&#8216;category_col&#8217;] = df[&#8216;category_col&#8217;].astype(&#8216;category&#8217;)<\/p>\n\n\n\n<p>This saves memory and enables category-specific methods.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Categories and Codes<\/strong><\/h3>\n\n\n\n<p>Categories have underlying integer codes:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df[&#8216;category_col&#8217;].cat.codes<\/p>\n\n\n\n<p>You can see or manipulate categories explicitly:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df[&#8216;category_col&#8217;].cat.categories<\/p>\n\n\n\n<p>df[&#8216;category_col&#8217;].cat.rename_categories([&#8216;A&#8217;, &#8216;B&#8217;, &#8216;C&#8217;], inplace=True)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Ordered Categories<\/strong><\/h3>\n\n\n\n<p>If categories have an order (e.g., low &lt; medium &lt; high), set ordered=True:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df[&#8216;category_col&#8217;] = pd.Categorical(df[&#8216;category_col&#8217;], categories=[&#8216;low&#8217;, &#8216;medium&#8217;, &#8216;high&#8217;], ordered=True)<\/p>\n\n\n\n<p>This enables meaningful comparisons and sorting.<\/p>\n\n\n\n<p>Categorical types optimize performance for datasets with repeated labels and are essential for certain analyses.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Working with Text Data<\/strong><\/h2>\n\n\n\n<p>Text data is often messy and requires cleaning and transformation before analysis.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>String Methods in Pandas<\/strong><\/h3>\n\n\n\n<p>Pandas provides vectorized string functions accessible via. .str.<\/p>\n\n\n\n<p>Examples:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df[&#8216;column&#8217;].str.lower() &nbsp; &nbsp; &nbsp; # Convert to lowercase<\/p>\n\n\n\n<p>df[&#8216;column&#8217;].str.upper() &nbsp; &nbsp; &nbsp; # Convert to uppercase<\/p>\n\n\n\n<p>df[&#8216;column&#8217;].str.strip() &nbsp; &nbsp; &nbsp; # Remove whitespace<\/p>\n\n\n\n<p>df[&#8216;column&#8217;].str.contains(&#8216;pattern&#8217;)&nbsp; # Filter rows containing pattern<\/p>\n\n\n\n<p>df[&#8216;column&#8217;].str.replace(&#8216;old&#8217;, &#8216;new&#8217;)&nbsp; # Replace substring<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Extracting Patterns<\/strong><\/h3>\n\n\n\n<p>Extract parts of strings with regular expressions:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df[&#8216;extracted&#8217;] = df[&#8216;column&#8217;].str.extract(r'(\\d+)&#8217;)<\/p>\n\n\n\n<p>This extracts digits from text.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Splitting and Joining<\/strong><\/h3>\n\n\n\n<p>Split strings into lists:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df[&#8216;split_col&#8217;] = df[&#8216;column&#8217;].str.split(&#8216;,&#8217;)<\/p>\n\n\n\n<p>Join lists into strings:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df[&#8216;joined&#8217;] = df[&#8216;split_col&#8217;].str.join(&#8216;;&#8217;)<\/p>\n\n\n\n<p>Working with text data is critical in natural language processing, feature extraction, and data cleaning.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Efficient Data Aggregation with groupby<\/strong><\/h2>\n\n\n\n<p>While basic grouping was covered earlier, more complex aggregations unlock powerful summarization capabilities.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Multiple Aggregations<\/strong><\/h3>\n\n\n\n<p>You can apply different aggregation functions to different columns:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.groupby(&#8216;Category&#8217;. ) agg ({&#8216;Sales&#8217;: &#8216;sum&#8217;, &#8216;Profit&#8217;: &#8216;mean&#8217;})<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Using Custom Aggregation Functions<\/strong><\/h3>\n\n\n\n<p>Define your aggregation functions:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>def range_func(x):<\/p>\n\n\n\n<p>&nbsp;&nbsp;&nbsp;&nbsp;return x.max() &#8211; x.min()<\/p>\n\n\n\n<p>df.groupby(&#8216;Category&#8217;).agg({&#8216;Sales&#8217;: range_func})<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Filtering Groups<\/strong><\/h3>\n\n\n\n<p>Filter groups based on aggregate values:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.groupby(&#8216;Category&#8217;).filter(lambda x: x[&#8216;Sales&#8217;].sum() &gt; 1000)<\/p>\n\n\n\n<p>This keeps groups with total sales greater than 1000.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Transforming Grouped Data<\/strong><\/h3>\n\n\n\n<p>transform() returns an object indexed like the original, but transformed:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df[&#8216;Sales_zscore&#8217;] = df.groupby(&#8216;Category&#8217;)[&#8216;Sales&#8217;].transform(lambda x: (x &#8211; x.mean()) \/ x.std())<\/p>\n\n\n\n<p>Useful for normalization within groups.<\/p>\n\n\n\n<p>Mastering groupby operations enables insightful data summaries and feature engineering.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Performance Tips<\/strong><\/h2>\n\n\n\n<p>Handling large datasets requires performance considerations.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Avoid Loops<\/strong><\/h3>\n\n\n\n<p>Avoid explicit Python loops over DataFrames. Use vectorized Pandas or NumPy operations instead.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Use Categoricals<\/strong><\/h3>\n\n\n\n<p>Convert repeated strings to categorical dtype to reduce memory usage and speed up operations.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Use Efficient Joins<\/strong><\/h3>\n\n\n\n<p>When merging large datasets, ensure the keys are indexed to speed up joins.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Use Chunking for Large Files<\/strong><\/h3>\n\n\n\n<p>For huge files, read in chunks with pd.read_csv() using the chunksize parameter to avoid memory overload.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Use <\/strong><strong>.eval()<\/strong><strong> and <\/strong><strong>.query()<\/strong><strong> for Speed<\/strong><\/h3>\n\n\n\n<p>eval() lets you perform operations efficiently using pandas expressions:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.eval(&#8216;new_col = A + B&#8217;, inplace=True)<\/p>\n\n\n\n<p>Query () allows fast filtering using expressions:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.query(&#8216;A &gt; 5 &amp; B &lt; 10&#8217;)<\/p>\n\n\n\n<p>These can be faster than normal Pandas syntax for large DataFrames.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Exporting Data<\/strong><\/h2>\n\n\n\n<p>After processing, saving the results is crucial.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Export to CSV<\/strong><\/h3>\n\n\n\n<p>Python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.to_csv(&#8216;filename.csv&#8217;, index=False)<\/p>\n\n\n\n<p>Set index=False to exclude row labels.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Export to Excel<\/strong><\/h3>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.to_excel(&#8216;filename.xlsx&#8217;, index=False)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Export to JSON<\/strong><\/h3>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.to_json(&#8216;filename.json&#8217;)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Export to SQL Databases<\/strong><\/h3>\n\n\n\n<p>Save DataFrame to SQL:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.to_sql(&#8216;table_name&#8217;, connection_object)<\/p>\n\n\n\n<p>Choosing the right format depends on your data sharing and analysis needs.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Advanced Data Analysis and Visualization with Pandas<\/strong><\/h2>\n\n\n\n<p>Building on the fundamentals and intermediate techniques, this part covers advanced data analysis strategies and visualization options in Pandas. These topics help you to not only manipulate data but also to understand and present it effectively.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Handling Missing Data in Depth<\/strong><\/h2>\n\n\n\n<p>Missing data is common in real-world datasets, and handling it correctly is vital for accurate analysis.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Detecting Missing Data<\/strong><\/h3>\n\n\n\n<p>Pandas identifies missing values as NaN (Not a Number) or None. Detect missing values using:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.isnull()<\/p>\n\n\n\n<p>df.isnull().sum()<\/p>\n\n\n\n<p>This returns a Boolean DataFrame or the count of missing values per column.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Dropping Missing Data<\/strong><\/h3>\n\n\n\n<p>Remove rows or columns with missing values:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.dropna()&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; # Drop rows with any missing value<\/p>\n\n\n\n<p>df.dropna(axis=1)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; # Drop columns with any missing value<\/p>\n\n\n\n<p>df.dropna(thresh=2)&nbsp; &nbsp; &nbsp; &nbsp; # Keep rows with at least 2 non-null values<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Filling Missing Data<\/strong><\/h3>\n\n\n\n<p>Fill missing values with a specific value or strategy:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.fillna(0) &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; # Replace NaN with 0<\/p>\n\n\n\n<p>df.fillna(method=&#8217;ffill&#8217;)&nbsp; # Forward fill to propagate last valid observation<\/p>\n\n\n\n<p>df.fillna(method=&#8217;bfill&#8217;)&nbsp; # Backward fill to propagate next valid observation<\/p>\n\n\n\n<p>You can fill in the mean or median of the column:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df[&#8216;column&#8217;].fillna(df[&#8216;column&#8217;].mean(), inplace=True)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Interpolation<\/strong><\/h3>\n\n\n\n<p>Interpolation fills missing values by estimating them based on surrounding data:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.interpolate(method=&#8217;linear&#8217;, inplace=True)<\/p>\n\n\n\n<p>This is useful for time series or numerical data with continuous values.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Working with MultiIndex DataFrames<\/strong><\/h2>\n\n\n\n<p>MultiIndex (hierarchical indexing) allows more complex data organization, supporting multiple index levels on rows and columns.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Creating MultiIndex<\/strong><\/h3>\n\n\n\n<p>Create a MultiIndex from arrays or tuples:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>arrays = [[&#8216;A&#8217;, &#8216;A&#8217;, &#8216;B&#8217;, &#8216;B&#8217;], [1, 2, 1, 2]]<\/p>\n\n\n\n<p>index = pd.MultiIndex.from_arrays(arrays, names=(&#8216;Letter&#8217;, &#8216;Number&#8217;))<\/p>\n\n\n\n<p>df = pd.DataFrame({&#8216;Value&#8217;: [10, 20, 30, 40]}, index=index)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Accessing Data in MultiIndex<\/strong><\/h3>\n\n\n\n<p>Use loc with tuples to access data:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.loc[(&#8216;A&#8217;, 1)]<\/p>\n\n\n\n<p>Slice data by levels:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.loc[&#8216;A&#8217;]<\/p>\n\n\n\n<p>df.loc[pd.IndexSlice[:, 2], :]<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Resetting and Setting MultiIndex<\/strong><\/h3>\n\n\n\n<p>Convert MultiIndex to columns:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.reset_index(inplace=True)<\/p>\n\n\n\n<p>Set multiple columns as an index:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.set_index([&#8216;col1&#8217;, &#8216;col2&#8217;], inplace=True)<\/p>\n\n\n\n<p>MultiIndex is powerful for representing grouped data with multiple categorical levels.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Combining DataFrames: Concatenation, Merge, and Join<\/strong><\/h2>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Concatenation<\/strong><\/h3>\n\n\n\n<p>Concatenate along rows or columns:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>pd.concat([df1, df2], axis=0)&nbsp; # Row-wise concatenation<\/p>\n\n\n\n<p>pd.concat([df1, df2], axis=1)&nbsp; # Column-wise concatenation<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Merge<\/strong><\/h3>\n\n\n\n<p>Merge DataFrames based on keys, similar to SQL joins:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>pd.merge(df1, df2, on=&#8217;key&#8217;, how=&#8217;inner&#8217;)<\/p>\n\n\n\n<p>Types of joins:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Inner: intersection of keys<br><\/li>\n\n\n\n<li>left: all keys from the left DataFrame<br><\/li>\n\n\n\n<li>right: all keys from the right DataFrame<br><\/li>\n\n\n\n<li>outer: union of keys<br><\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Join<\/strong><\/h3>\n\n\n\n<p>Join is a convenient method for joining on an index:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df1.join(df2, how=&#8217;left&#8217;)<\/p>\n\n\n\n<p>Combining datasets effectively enables richer analyses.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Working with Window Functions<\/strong><\/h2>\n\n\n\n<p>Window functions allow calculations across a sliding window of data.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Rolling Windows<\/strong><\/h3>\n\n\n\n<p>Calculate moving averages or sums:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df[&#8216;rolling_mean&#8217;] = df[&#8216;Value&#8217;].rolling(window=3).mean()<\/p>\n\n\n\n<p>df[&#8216;rolling_sum&#8217;] = df[&#8216;Value&#8217;].rolling(window=3).sum()<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Expanding Windows<\/strong><\/h3>\n\n\n\n<p>Compute cumulative statistics:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df[&#8216;expanding_sum&#8217;] = df[&#8216;Value&#8217;].expanding().sum()<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Exponentially Weighted Windows<\/strong><\/h3>\n\n\n\n<p>Give more weight to recent observations:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df[&#8216;ewm_mean&#8217;] = df[&#8216;Value&#8217;].ewm(span=3).mean()<\/p>\n\n\n\n<p>Window functions are vital for smoothing time series and detecting trends.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Data Visualization with Pandas<\/strong><\/h2>\n\n\n\n<p>Visualizing data helps to uncover patterns and communicate findings.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Plotting Basics<\/strong><\/h3>\n\n\n\n<p>Pandas integrates with Matplotlib to plot directly from DataFrames.<\/p>\n\n\n\n<p>Enable inline plotting in notebooks:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>%matplotlib inline<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Common Plot Types<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Line Plot<\/strong> (default):<br><\/li>\n<\/ul>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.plot()<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Histogram<\/strong>:<br><\/li>\n<\/ul>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df[&#8216;column&#8217;].plot.hist()<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Bar Plot<\/strong>:<br><\/li>\n<\/ul>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.plot.bar(x=&#8217;Category&#8217;, y=&#8217;Value&#8217;)<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Scatter Plot<\/strong>:<br><\/li>\n<\/ul>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.plot.scatter(x=&#8217;col1&#8242;, y=&#8217;col2&#8242;)<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Box Plot<\/strong>:<br><\/li>\n<\/ul>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.boxplot(column=&#8217;col1&#8242;, by=&#8217;group_col&#8217;)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Customizing Plots<\/strong><\/h3>\n\n\n\n<p>Adjust titles, labels, and colors:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>ax = df.plot(x=&#8217;col1&#8242;, y=&#8217;col2&#8242;, kind=&#8217;scatter&#8217;, color=&#8217;red&#8217;)<\/p>\n\n\n\n<p>ax.set_title(&#8216;Scatter Plot&#8217;)<\/p>\n\n\n\n<p>ax.set_xlabel(&#8216;X axis&#8217;)<\/p>\n\n\n\n<p>ax.set_ylabel(&#8216;Y axis&#8217;)<\/p>\n\n\n\n<p>Visualization is critical for data exploration and presentation.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Exporting and Saving Your Work<\/strong><\/h2>\n\n\n\n<p>Export your cleaned and processed data for sharing or further use.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>CSV and Excel<\/strong><\/h3>\n\n\n\n<p>Save to CSV or Excel files:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.to_csv(&#8216;output.csv&#8217;, index=False)<\/p>\n\n\n\n<p>df.to_excel(&#8216;output.xlsx&#8217;, index=False)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>JSON and SQL<\/strong><\/h3>\n\n\n\n<p>Export as JSON or to SQL databases:<\/p>\n\n\n\n<p>python<\/p>\n\n\n\n<p>CopyEdit<\/p>\n\n\n\n<p>df.to_json(&#8216;output.json&#8217;)<\/p>\n\n\n\n<p>df.to_sql(&#8216;table_name&#8217;, connection)<\/p>\n\n\n\n<p>Choose the format based on your needs and downstream processes.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Best Practices and Tips<\/strong><\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Always inspect your data after each major step (head(), info(), describe()).<br><\/li>\n\n\n\n<li>Use meaningful variable names for readability.<br><\/li>\n\n\n\n<li>Handle missing data thoughtfully to avoid skewed results.<br><\/li>\n\n\n\n<li>Utilize vectorized operations for efficiency.<br><\/li>\n\n\n\n<li>Use grouping and aggregation to summarize data.<br><\/li>\n\n\n\n<li>Leverage plotting to understand and communicate insights.<br><\/li>\n\n\n\n<li>Document your code for reproducibility.<br><\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Conclusion<\/strong><\/h2>\n\n\n\n<p>This guide has covered a comprehensive range of topics, from basics to advanced data manipulation, time series handling, performance tips, and visualization in Pandas. Mastery of these tools empowers you to confidently analyze and derive insights from complex datasets. With continuous practice and exploration, Pandas becomes an invaluable asset in your data science toolkit.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Python Pandas is a simple, expressive, and one of the most important libraries in Python for data analysis and manipulation. It significantly simplifies working with real-world data, making data analysis faster and easier. For beginners, the variety of functions and operations can be overwhelming, so having a structured guide to help understand and apply Pandas [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5],"tags":[],"class_list":["post-893","post","type-post","status-publish","format-standard","hentry","category-posts"],"_links":{"self":[{"href":"https:\/\/www.actualtests.com\/blog\/wp-json\/wp\/v2\/posts\/893"}],"collection":[{"href":"https:\/\/www.actualtests.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.actualtests.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.actualtests.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.actualtests.com\/blog\/wp-json\/wp\/v2\/comments?post=893"}],"version-history":[{"count":1,"href":"https:\/\/www.actualtests.com\/blog\/wp-json\/wp\/v2\/posts\/893\/revisions"}],"predecessor-version":[{"id":912,"href":"https:\/\/www.actualtests.com\/blog\/wp-json\/wp\/v2\/posts\/893\/revisions\/912"}],"wp:attachment":[{"href":"https:\/\/www.actualtests.com\/blog\/wp-json\/wp\/v2\/media?parent=893"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.actualtests.com\/blog\/wp-json\/wp\/v2\/categories?post=893"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.actualtests.com\/blog\/wp-json\/wp\/v2\/tags?post=893"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}