Telegram Group Join Now

WhatsApp Group Join Now

These are not the functions in every tutorial. They are the ones experienced analysts reach for constantly and beginners spend hours trying to solve without them.

Most pandas tutorials teach you read_csv, groupby, merge, and fillna. Those are important. They are also the ones everyone learns in their first month.

The functions in this list are different. These are the ones that separate analysts who struggle with pandas from analysts who are fast with it. Not because they are complex most of them are simple once you know they exist. But because nobody teaches them in beginner courses and most analysts discover them by accident, if at all.

Each one comes with the real-world scenario that makes it necessary, a working code example, and what it replaces. The time it saves is not theoretical. It is the time you currently spend doing the same thing the long way.

df.query() — Filter rows in plain English instead of chained Boolean conditions

You have a dataframe and you need to filter on three conditions simultaneously. The standard approach is df[(df[‘status’] == ‘active’) & (df[‘revenue’] > 10000) & (df[‘region’] == ‘South’)]. It works. It is also hard to read, easy to get the parentheses wrong, and becomes a mess with five conditions. query() fixes all of this.

# Standard approach — hard to read

result = df[(df['status'] == 'active') & (df['revenue'] > 10000) & (df['region'] == 'South')]

# query() approach — reads like English

result = df.query("status == 'active' and revenue > 10000 and region == 'South'")

# Reference a Python variable inside the query string with @

threshold = 10000

result = df.query('revenue > @threshold')

Key uses:

Multi-condition filtering that needs to be readable and maintainable
Dynamic filtering where the threshold or value comes from a variable
Any filter with three or more conditions — the readability improvement is immediate

df.assign() — Create multiple new columns in one clean statement without intermediate assignments

The standard way to create a new column is df[‘new_col’] = df[‘col_a’] + df[‘col_b’]. That is fine for one column. When you are creating four derived columns in sequence, you get four separate lines that each reference df, any of which can silently overwrite the wrong thing. assign() creates new columns in a single expression and chains naturally.

# Standard approach — four separate lines, harder to audit

df['revenue_per_unit'] = df['revenue'] / df['units']

df['margin_pct'] = df['profit'] / df['revenue'] * 100

df['is_high_value'] = df['revenue'] > 50000

# assign() approach — one expression, chainable

df = df.assign(

revenue_per_unit = lambda x: x['revenue'] / x['units'],

margin_pct = lambda x: x['profit'] / x['revenue'] * 100,

is_high_value = lambda x: x['revenue'] > 50000

)

Key uses:

Creating multiple derived columns from existing columns in one step
Chaining transformations — assign() returns a new dataframe, so .assign().query() chains cleanly
Keeping transformation logic readable when creating several columns at once

df.pipe() — Chain your entire cleaning pipeline into a readable sequence of steps

Real data cleaning involves many steps applied in sequence — strip whitespace, rename columns, cast types, drop duplicates, filter out bad rows. Most analysts write these as separate lines modifying df in place. pipe() lets you wrap each step as a function and chain them in order, producing a pipeline that reads as documentation of what the cleaning actually does.

def strip_whitespace(df):

df.columns = df.columns.str.strip()

return df

def standardise_categories(df):

df['category'] = df['category'].str.lower().str.strip()

return df

def remove_test_accounts(df):

return df[~df['email'].str.contains('test|example', na=False)]

# Chain all steps with pipe() — reads as a documented pipeline

cleaned_df = (

raw_df

.pipe(strip_whitespace)

.pipe(standardise_categories)

.pipe(remove_test_accounts)

)

Key uses:

Multi-step data cleaning where each step should be named, testable, and reorderable
Any analysis that needs to be audited or handed to a colleague — the steps are self-documenting
Production pipelines where individual cleaning functions need to be reused across datasets

df.melt() — Reshape wide data into long format — the format most analysis actually needs

You receive a spreadsheet with months as column headers: Jan, Feb, Mar, Apr across the top. Every BI tool, every groupby operation, and every time-series analysis you want to do needs the data in long format — one row per month per entity. melt() does this transformation in one line.

# Wide format — months as columns

# customer | Jan | Feb | Mar

# A | 1200 | 1350 | 980

# Transform to long format with melt()

df_long = df.melt(

id_vars=['customer'],

value_vars=['Jan', 'Feb', 'Mar'],

var_name='month',

value_name='revenue'

)

# Result:

# customer | month | revenue

# A | Jan | 1200

# A | Feb | 1350

Key uses:

Transforming pivot-style data exports into analysis-ready long format
Preparing data for seaborn or Tableau which require long format
Any time you receive a report with time periods or categories as column headers

df.explode() — Unnest list-valued columns in one line — no loops required

You import a JSON export or a database table where one column contains a list of values per row — product tags, category labels, multi-select responses. To analyse any of those values you need one row per value. The loop-based approach is slow on large datasets and verbose. explode() handles it instantly.

import pandas as pd

df = pd.DataFrame({

'order_id': [1, 2, 3],

'tags': [['urgent', 'large'], ['standard'], ['urgent', 'fragile', 'international']]

})

# Expand list values into separate rows

df_exploded = df.explode('tags').reset_index(drop=True)

# Result: 6 rows — one per tag per order

# Now count frequency of each tag

tag_counts = df_exploded['tags'].value_counts()

Key uses:

Unnesting JSON array fields from API or database exports
Analysing multi-select survey responses where each cell contains a list
Any column that contains Python lists or comma-separated values stored as strings (after splitting)

df.groupby().transform() — Add group-level aggregates back to the original dataframe — without losing rows

You need each row to show both its own revenue and the total revenue of its group. groupby().agg() collapses to one row per group. To get group aggregates alongside individual rows, you need transform(). It calculates the aggregate per group and broadcasts it back to every row in that group, preserving the original dataframe shape.

# Add group totals and group rank to original dataframe

df['group_total_revenue'] = df.groupby('region')['revenue'].transform('sum')

df['pct_of_group'] = df['revenue'] / df['group_total_revenue'] * 100

# Add group-level mean for normalisation

df['revenue_vs_group_avg'] = df['revenue'] - df.groupby('region')['revenue'].transform('mean')

# Rank within group — combines transform with rank

df['rank_in_region'] = df.groupby('region')['revenue'].rank(ascending=False)

Key uses:

Calculating what percentage each row contributes to its group total
Normalising values relative to their group mean or median
Adding group-level statistics to individual rows for combined analysis

pd.cut() and pd.qcut() — Bin continuous values into categories in one line

You have an age column, a revenue column, or a score column and you need to create age groups, revenue tiers, or quartiles. The manual approach is a series of np.where() or df.apply() conditions. pd.cut() bins by defined boundaries. pd.qcut() bins by quantile — equal frequency per bin rather than equal width.

# pd.cut() — define your own bin boundaries

df['age_group'] = pd.cut(

df['age'],

bins=[0, 18, 35, 55, 100],

labels=['Under 18', '18-35', '36-55', '55+']

)

# pd.qcut() — equal frequency bins (quartiles)

df['revenue_quartile'] = pd.qcut(

df['revenue'],

q=4,

labels=['Q1 (Bottom)', 'Q2', 'Q3', 'Q4 (Top)']

)

Key uses:

Creating age bands, revenue tiers, or score categories from continuous data
Segmenting customers into equal-sized quartile or decile groups with qcut()
Any analysis that requires categorical grouping of a numerical variable

df.str accessor methods — Clean and transform text columns without loops or apply() calls

Text columns in real data are consistently messy. Trailing spaces. Mixed case. Embedded whitespace. Inconsistent formatting. The standard approach is apply() with a lambda or a custom function. The str accessor runs the same operations vectorised — faster and more readable.

# Clean a messy category column in one chain

df['category'] = (

df['category']

.str.strip() # Remove leading/trailing whitespace

.str.lower() # Standardise case

.str.replace(r'\s+', ' ', regex=True) # Collapse internal spaces

)

# Extract from structured strings

df['domain'] = df['email'].str.split('@').str[1]

df['first_name'] = df['full_name'].str.split(' ').str[0]

# Check patterns

df['is_gmail'] = df['email'].str.endswith('@gmail.com')

Key uses:

Standardising category labels that vary due to capitalisation or whitespace
Extracting substrings — domains from emails, codes from product SKUs, names from full strings
Detecting patterns in text columns without writing custom apply() functions

df.pivot_table() — Build Excel-style pivot tables in code — and then use them in the next step of your pipeline

pivot_table() creates grouped summaries with hierarchical indexing that would take multiple groupby and unstack operations to replicate manually. It supports multiple aggregation functions, margins (totals), and fill values for missing combinations — in a single call. The result can be styled, exported to Excel, or used as input to the next analysis step.

# Multi-dimension summary with totals

summary = df.pivot_table(

values='revenue',

index='region',

columns='product_category',

aggfunc='sum',

margins=True, # Add row and column totals

fill_value=0 # Replace NaN with 0 for missing combos

)

# Multiple aggregations in one pivot

detailed = df.pivot_table(

values='revenue',

index='region',

columns='quarter',

aggfunc=['sum', 'mean', 'count']

)

Key uses:

Creating structured summaries for reporting equivalent to Excel pivot tables but code-driven
Producing cross-tabulation of two categorical dimensions with a numerical metric
Generating summaries with row and column totals in a single operation

df.memory_usage() and df.astype() — Reduce dataframe memory by 50 to 90 percent before your kernel crashes

Large dataframes slow down, consume RAM, and eventually cause out-of-memory errors. The default dtypes pandas assigns are often wasteful — int64 when int8 is sufficient, object when category is more efficient. Checking memory usage and downcasting types before analysis prevents the crash that interrupts a two-hour processing job at the worst possible moment.

# Check current memory usage

print(df.memory_usage(deep=True).sum() / 1024**2, 'MB')

# Downcast integer columns

int_cols = df.select_dtypes('int64').columns

df[int_cols] = df[int_cols].apply(pd.to_numeric, downcast='integer')

# Convert low-cardinality string columns to category

for col in df.select_dtypes('object').columns:

if df[col].nunique() / len(df) < 0.05: # Less than 5% unique values

df[col] = df[col].astype('category')

# Check memory after optimisation

print(df.memory_usage(deep=True).sum() / 1024**2, 'MB')

# Typical result: 40-80% reduction

Key uses:

Before loading large datasets into analysis prevent memory issues before they interrupt work
When joining or merging large tables reduced memory makes operations significantly faster
Any dataset over 500MB where performance is noticeably slow

Wrapping Up

Ten functions. Most of them are not in beginner pandas tutorials. All of them are used by experienced analysts every week.

query(): Multi-condition filtering that is readable and maintainable
assign(): Multiple derived columns in one chainable expression
pipe(): Named, reorderable cleaning steps that document themselves
melt(): Wide-to-long transformation for any pivot-style export
explode(): Unnest list-valued columns without loops
transform(): Group aggregates broadcast back to individual rows
cut() and qcut(): Continuous values to categorical bins in one line
str accessor: Text cleaning vectorised faster than apply()
pivot_table(): Code-driven pivot tables with totals and multiple aggregations
memory_usage() + astype(): Memory optimisation before large dataset processing

Pick two of these that are new to you. Use them in your next analysis. By the end of the week, the time they save will make the case for the rest of the list.

Best 10 Job Portals for Freshers in 2026

10 Excel Tricks Every Data Analyst Wishes They Knew Sooner

Job Notification Join us on Telegram: Click here

Job Notification Join us on WhatsApp: Click here

10 Pandas Functions That Will Save Data Analysts Hours Every Week

Wrapping Up

Leave a comment Cancel reply