placement brief / Interview Questions / interview questions / 08 Jun 2026

Pandas Interview Questions 2026: 28 Answers with Code

28 pandas interview questions with full code answers covering DataFrames, groupby, merge/join, reshape, time series, performance optimization, and real-world data manipulation for 2026.

By Aditya SharmaPublished 8 Jun 20262 sources listedSpot an error? Corrections open

4 min read last revised 8 Jun 2026

on this page§ 05

Pandas is the lingua franca of data science in Python, and pandas proficiency is tested in almost every data science and data analyst interview. Beyond basics, interviewers look for knowledge of GroupBy internals, efficient merging, time series resampling, memory optimization, and vectorized operations. This guide covers 28 pandas interview questions with full answers and code examples.

PapersAdda's take: Candidates report that GroupBy + merge questions with realistic business scenarios (e.g., "compute rolling 7-day revenue per user cohort") are the most common live coding pandas challenges. Vectorization vs apply performance questions appear in nearly every senior DS interview. Confirm the specific Python/data stack expected on the official company careers portal before you prepare.

Related articles: Data Science Interview Questions 2026 | NumPy Interview Questions 2026 | SQL for Data Analysts 2026 | Scikit-Learn Interview Questions 2026 | Statistics for Data Science 2026

Which Roles Test Pandas Deeply?

Role	Pandas Depth
Data Analyst	Core manipulation, pivot, merge, visualization
Data Scientist	Feature engineering, time series, performance
ML Engineer	Preprocessing pipelines, transform, memory optimization
Data Engineer (Python)	Large file handling, chunking, dtype optimization
Quant/FinTech	Time series, rolling windows, portfolio analytics

EASY: Core DataFrame Operations (Questions 1-8)

Q1. What is a pandas DataFrame? How does it differ from a Series?

A Series is a one-dimensional labeled array (single column with index). A DataFrame is a two-dimensional labeled data structure (collection of Series sharing an index).

import pandas as pd
import numpy as np

# Series
s = pd.Series([10, 20, 30], index=["a", "b", "c"], name="score")
print(s.dtype, s.shape)  # int64 (3,)

# DataFrame
df = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "score": [85, 92, 78],
    "grade": ["A", "A", "B"],
})
print(df.dtypes)
print(df.shape)  # (3, 3)

# Access a column as Series
print(type(df["score"]))  # pandas.core.series.Series

# Access multiple columns as DataFrame
print(type(df[["score", "grade"]]))  # pandas.core.frame.DataFrame

Key difference: Series has one axis (index), DataFrame has two axes (index + columns). DataFrame columns are Series sharing the same index.

Q2. What is the difference between loc, iloc, and at/iat?

df = pd.DataFrame(
    {"score": [85, 92, 78], "grade": ["A", "A", "B"]},
    index=["alice", "bob", "charlie"]
)

# .loc: label-based (index labels + column names)
print(df.loc["alice"])               # Row by label
print(df.loc["alice", "score"])      # Single cell by label
print(df.loc["alice":"bob", :])      # Slice by label (inclusive)
print(df.loc[df["score"] > 80])      # Boolean mask

# .iloc: integer position-based (0-indexed)
print(df.iloc[0])                    # First row
print(df.iloc[0, 0])                 # First row, first column
print(df.iloc[0:2, :])               # Rows 0 and 1 (exclusive upper)

# .at/.iat: single cell access (faster than loc/iloc for scalars)
print(df.at["alice", "score"])       # Label-based single cell
print(df.iat[0, 0])                  # Position-based single cell

# Common gotcha: loc slice is INCLUSIVE on both ends, iloc is exclusive on upper end

Q3. How do you handle missing values in pandas?

df = pd.DataFrame({
    "age": [25, None, 32, np.nan, 28],
    "city": ["Delhi", None, "Mumbai", "Bangalore", None],
    "salary": [50000, 60000, None, 70000, 55000],
})

# Detection
print(df.isnull().sum())
print(df.notnull().sum())
print(f"Missing %: {df.isnull().mean() * 100}")

# Drop
df_no_null = df.dropna()                   # Drop rows with ANY null
df_no_null_cols = df.dropna(axis=1)        # Drop columns with ANY null
df_thresh = df.dropna(thresh=2)            # Keep rows with >= 2 non-null values

# Fill
df["age"] = df["age"].fillna(df["age"].median())
df["city"] = df["city"].fillna(df["city"].mode()[0])
df["salary"] = df["salary"].fillna(method="ffill")  # Forward fill

# Fill with group-specific median
df["age"] = df.groupby("city")["age"].transform(
    lambda x: x.fillna(x.median())
)

# Check for None vs NaN (both are null in pandas but different in Python)
print(pd.isna(None))   # True
print(pd.isna(np.nan)) # True

Q4. Explain the difference between merge, join, and concat.

import pandas as pd

df1 = pd.DataFrame({"id": [1, 2, 3], "name": ["A", "B", "C"]})
df2 = pd.DataFrame({"id": [2, 3, 4], "salary": [50, 60, 70]})
df3 = pd.DataFrame({"id": [5, 6], "name": ["D", "E"]})

# merge: SQL-style join on key column(s)
inner = pd.merge(df1, df2, on="id", how="inner")   # rows present in both
left  = pd.merge(df1, df2, on="id", how="left")    # all rows from df1
right = pd.merge(df1, df2, on="id", how="right")   # all rows from df2
outer = pd.merge(df1, df2, on="id", how="outer")   # union

# join: merge on index (syntactic shortcut)
df1_idx = df1.set_index("id")
df2_idx = df2.set_index("id")
joined = df1_idx.join(df2_idx, how="left")  # uses index

# concat: stack DataFrames vertically or horizontally
vertical = pd.concat([df1, df3], axis=0, ignore_index=True)   # row-wise
horizontal = pd.concat([df1, df2.drop("id", axis=1)], axis=1) # col-wise

# Key difference:
# merge -- flexible SQL join, any columns
# join -- merge on index
# concat -- stack without matching keys

Q5. What is GroupBy? How does split-apply-combine work?

import pandas as pd

df = pd.DataFrame({
    "region": ["North", "South", "North", "South", "North"],
    "product": ["A", "A", "B", "B", "A"],
    "sales": [100, 200, 150, 300, 120],
    "returns": [5, 10, 8, 15, 6],
})

# Basic aggregation
print(df.groupby("region")["sales"].sum())

# Multiple aggregation
print(df.groupby("region").agg(
    total_sales=("sales", "sum"),
    avg_sales=("sales", "mean"),
    total_returns=("returns", "sum"),
    sales_std=("sales", "std"),
))

# Multi-level groupby
print(df.groupby(["region", "product"])["sales"].sum().unstack())

# Transform: group-level operation that returns same-shape output
df["sales_pct_of_region"] = df.groupby("region")["sales"].transform(
    lambda x: x / x.sum() * 100
)

# Filter: keep groups that meet a condition
high_sales_regions = df.groupby("region").filter(
    lambda x: x["sales"].sum() > 300
)

# Apply: apply arbitrary function
def region_stats(group):
    return pd.Series({
        "sales_range": group["sales"].max() - group["sales"].min(),
        "return_rate": group["returns"].sum() / group["sales"].sum(),
    })

print(df.groupby("region").apply(region_stats))

Q6. How do you reshape a DataFrame? Explain pivot_table, melt, stack, and unstack.

import pandas as pd

df = pd.DataFrame({
    "date": ["2026-01", "2026-01", "2026-02", "2026-02"],
    "region": ["North", "South", "North", "South"],
    "sales": [100, 200, 150, 250],
    "returns": [5, 10, 8, 12],
})

# pivot_table: wide format (like SQL pivot)
pivot = df.pivot_table(
    values="sales", index="date", columns="region",
    aggfunc="sum", fill_value=0
)
print(pivot)
# region  North  South
# date
# 2026-01   100    200
# 2026-02   150    250

# melt: wide to long format (inverse of pivot)
wide_df = pd.DataFrame({"id": [1, 2], "Jan": [100, 200], "Feb": [150, 250]})
long_df = wide_df.melt(id_vars="id", var_name="month", value_name="sales")
print(long_df)

# stack: column labels become innermost row index
stacked = pivot.stack()  # Series with MultiIndex (date, region)

# unstack: innermost row index becomes columns
unstacked = stacked.unstack(level="region")  # back to original pivot

Q7. What are vectorized string operations in pandas?

import pandas as pd

df = pd.DataFrame({
    "name": ["Alice Smith", "bob jones", "CHARLIE BROWN"],
    "email": ["[email protected]", "[email protected]", "[email protected]"],
    "phone": ["+91-9876543210", "9123456789", "  +91 8765432109  "],
})

# str accessor for vectorized string ops
df["name_clean"] = df["name"].str.strip().str.title()
df["email_domain"] = df["email"].str.split("@").str[1]
df["has_gmail"] = df["email"].str.endswith("gmail.com")
df["first_name"] = df["name"].str.split().str[0]

# Regex operations
df["phone_clean"] = df["phone"].str.strip().str.replace(r"[^\d]", "", regex=True)
df["is_valid_phone"] = df["phone_clean"].str.match(r"^[6-9]\d{9}$")

# Contains / startswith / endswith
gmail_users = df[df["email"].str.contains("gmail", na=False)]

# Extract with named groups
df["country_code"] = df["phone"].str.extract(r"\+(\d{2})")

print(df[["name_clean", "email_domain", "phone_clean", "is_valid_phone"]])

Performance note: str accessor operations avoid Python-level loops by calling optimized C code. Still avoid on very large datasets -- use polars or numba for >10M rows.

Q8. How do you sort a DataFrame? What is the difference between sort_values and sort_index?

import pandas as pd

df = pd.DataFrame({
    "name": ["Charlie", "Alice", "Bob"],
    "score": [78, 92, 85],
    "age": [25, 22, 28],
}, index=[3, 1, 2])

# sort_values: sort by column content
df_sorted = df.sort_values("score", ascending=False)
df_multi = df.sort_values(["score", "age"], ascending=[False, True])

# sort_index: sort by row index
df_idx_sorted = df.sort_index()  # index 1, 2, 3

# rank: assign rank per column
df["score_rank"] = df["score"].rank(ascending=False, method="min")
df["score_pct_rank"] = df["score"].rank(pct=True)

# nlargest / nsmallest: efficient top-K
top_3 = df.nlargest(3, "score")       # faster than sort + head
bottom_2 = df.nsmallest(2, "score")

# argsort-equivalent (get integer positions)
sorted_positions = df["score"].argsort()[::-1]  # descending positions

MEDIUM: Advanced Operations and Performance (Questions 9-22)

Q9. What is the difference between apply, map, and applymap (now map for DataFrames)?

import pandas as pd
import numpy as np

df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
s = pd.Series([1, 2, 3])

# Series.map: element-wise on a Series (can use dict or function)
s_mapped = s.map({1: "one", 2: "two", 3: "three"})
s_func = s.map(lambda x: x ** 2)

# DataFrame.apply: apply function along axis (row or column)
col_means = df.apply(np.mean, axis=0)           # apply over rows -> per-column result
row_sums = df.apply(lambda row: row.sum(), axis=1)  # apply over columns -> per-row result

# DataFrame.map (pandas 2.1+ -- was applymap before): element-wise on DataFrame
df_formatted = df.map(lambda x: f"${x:.2f}")

# PERFORMANCE WARNING: apply and map call Python for each element
# Always prefer vectorized alternatives:

# Slow: apply for column operation
# df["c"] = df.apply(lambda row: row["a"] + row["b"], axis=1)

# Fast: vectorized (direct arithmetic)
df["c"] = df["a"] + df["b"]

# Slow: apply with numpy function
# df["log_a"] = df["a"].apply(np.log)

# Fast: direct numpy on Series
df["log_a"] = np.log(df["a"])

Q10. How do you optimize pandas memory usage?

import pandas as pd
import numpy as np

def optimize_dtypes(df: pd.DataFrame) -> pd.DataFrame:
    """Reduce memory usage by downcasting numeric types and using categoricals."""
    original_memory = df.memory_usage(deep=True).sum() / 1024**2
    
    for col in df.select_dtypes(include=['int64']).columns:
        col_min, col_max = df[col].min(), df[col].max()
        if col_min >= 0:
            if col_max < 255:
                df[col] = df[col].astype(np.uint8)
            elif col_max < 65535:
                df[col] = df[col].astype(np.uint16)
            elif col_max < 4294967295:
                df[col] = df[col].astype(np.uint32)
        else:
            if col_min >= -128 and col_max < 127:
                df[col] = df[col].astype(np.int8)
            elif col_min >= -32768 and col_max < 32767:
                df[col] = df[col].astype(np.int16)
            elif col_min >= -2147483648 and col_max < 2147483647:
                df[col] = df[col].astype(np.int32)

    for col in df.select_dtypes(include=['float64']).columns:
        df[col] = df[col].astype(np.float32)

    for col in df.select_dtypes(include=['object']).columns:
        if df[col].nunique() / len(df) < 0.5:  # less than 50% unique -> categorical
            df[col] = df[col].astype('category')

    new_memory = df.memory_usage(deep=True).sum() / 1024**2
    print(f"Memory: {original_memory:.1f}MB -> {new_memory:.1f}MB ({(1 - new_memory/original_memory)*100:.0f}% savings)")
    return df

# Read large CSV in chunks to avoid memory overflow
chunks = []
for chunk in pd.read_csv("large.csv", chunksize=100_000, dtype={"id": np.int32}):
    chunk = optimize_dtypes(chunk)
    chunks.append(chunk)
df = pd.concat(chunks, ignore_index=True)

Q11. How do you work with time series data in pandas?

import pandas as pd
import numpy as np

# Create time series
dates = pd.date_range("2026-01-01", periods=365, freq="D")
df = pd.DataFrame({
    "date": dates,
    "revenue": np.random.randn(365).cumsum() + 100,
    "orders": np.random.randint(50, 200, 365),
})
df = df.set_index("date")

# Resample: aggregate to different frequency
weekly = df.resample("W").agg({"revenue": "sum", "orders": "sum"})
monthly = df.resample("ME").agg({"revenue": "mean", "orders": "sum"})

# Rolling window
df["revenue_7d_ma"] = df["revenue"].rolling(window=7, min_periods=1).mean()
df["revenue_30d_std"] = df["revenue"].rolling(window=30).std()

# Expanding window (cumulative)
df["revenue_cummax"] = df["revenue"].expanding().max()

# Shift for lag features (CRITICAL: avoids leakage in forecasting)
df["revenue_lag1"] = df["revenue"].shift(1)   # yesterday
df["revenue_lag7"] = df["revenue"].shift(7)   # last week

# Pct change
df["revenue_pct_change"] = df["revenue"].pct_change()
df["revenue_wow"] = df["revenue"].pct_change(periods=7)  # week-over-week

# Time-based indexing
jan_data = df["2026-01"]             # entire month
q1_data = df["2026-01":"2026-03"]    # quarter

# Business day resampling
business_weekly = df.resample("W-FRI").sum()  # Friday week-end

# Time zone handling
df.index = df.index.tz_localize("UTC").tz_convert("Asia/Kolkata")

Q12. What is the performance difference between iterrows, itertuples, and vectorized operations?

import pandas as pd
import numpy as np
import timeit

df = pd.DataFrame({
    "a": np.random.randn(100_000),
    "b": np.random.randn(100_000),
})

# DON'T: iterrows (very slow -- creates Series for each row)
# ~30 seconds for 100K rows
# for idx, row in df.iterrows():
#     df.loc[idx, "c"] = row["a"] + row["b"]

# BETTER: itertuples (3-10x faster than iterrows)
# ~3 seconds for 100K rows
# results = [row.a + row.b for row in df.itertuples()]

# BEST: vectorized operations
# ~0.001 seconds for 100K rows
df["c"] = df["a"] + df["b"]

# Vectorized conditional (replaces apply with if/else)
# Slow apply approach:
# df["category"] = df["a"].apply(lambda x: "high" if x > 0 else "low")

# Fast np.where:
df["category"] = np.where(df["a"] > 0, "high", "low")

# Multi-condition: np.select
conditions = [df["a"] > 1, df["a"].between(-1, 1), df["a"] < -1]
choices = ["high", "medium", "low"]
df["tier"] = np.select(conditions, choices, default="unknown")

# Speed comparison (approximate, 100K rows):
# iterrows: ~30s | itertuples: ~3s | vectorized: ~1ms

Q13. How do you compute window functions similar to SQL in pandas?

import pandas as pd

df = pd.DataFrame({
    "user_id": [1, 1, 1, 2, 2, 3],
    "date": pd.to_datetime(["2026-01-01", "2026-01-03", "2026-01-05",
                             "2026-01-02", "2026-01-04", "2026-01-01"]),
    "amount": [100, 200, 150, 300, 100, 500],
})

# ROW_NUMBER() equivalent
df["row_num"] = df.sort_values("date").groupby("user_id").cumcount() + 1

# RANK() equivalent
df["rank"] = df.groupby("user_id")["amount"].rank(method="min", ascending=False)

# DENSE_RANK() equivalent
df["dense_rank"] = df.groupby("user_id")["amount"].rank(method="dense", ascending=False)

# Running total (SUM() OVER (PARTITION BY user_id ORDER BY date))
df = df.sort_values(["user_id", "date"])
df["running_total"] = df.groupby("user_id")["amount"].cumsum()

# LAG / LEAD equivalent
df["prev_amount"] = df.groupby("user_id")["amount"].shift(1)
df["next_amount"] = df.groupby("user_id")["amount"].shift(-1)

# Partition-level aggregation (AVG() OVER PARTITION BY user_id)
df["user_avg"] = df.groupby("user_id")["amount"].transform("mean")
df["pct_of_user_total"] = df["amount"] / df.groupby("user_id")["amount"].transform("sum")

print(df[["user_id", "date", "amount", "row_num", "running_total", "prev_amount"]].to_string())

Q14. How do you handle duplicate data in pandas?

import pandas as pd

df = pd.DataFrame({
    "id": [1, 2, 2, 3, 3, 3],
    "name": ["Alice", "Bob", "Bob", "Charlie", "Charlie", "Charlie"],
    "score": [85, 92, 92, 78, 80, 78],
})

# Detect duplicates
print(df.duplicated().sum())              # Total duplicate rows
print(df.duplicated(subset=["id"]).sum()) # Duplicate IDs
print(df[df.duplicated(keep=False)])      # Show ALL duplicates including first

# Remove duplicates
df_clean = df.drop_duplicates()                          # keep first occurrence
df_clean_last = df.drop_duplicates(keep="last")          # keep last
df_clean_id = df.drop_duplicates(subset=["id"])          # dedup on id only
df_no_dupes = df.drop_duplicates(subset=["id"], keep=False)  # remove all dupes

# Keep row with highest score per id (common interview scenario)
df_best = df.sort_values("score", ascending=False).drop_duplicates(subset=["id"])

# Count duplicates per key
dup_counts = df.groupby(["id", "name"]).size().reset_index(name="count")
print(dup_counts[dup_counts["count"] > 1])

Q15. Explain multi-index (hierarchical index) in pandas.

import pandas as pd
import numpy as np

# Create MultiIndex DataFrame
arrays = [
    ["2026-Q1", "2026-Q1", "2026-Q2", "2026-Q2"],
    ["North", "South", "North", "South"],
]
index = pd.MultiIndex.from_arrays(arrays, names=["quarter", "region"])
df = pd.DataFrame({"sales": [100, 200, 150, 250], "returns": [5, 10, 8, 12]}, index=index)

# Access
print(df.loc["2026-Q1"])                    # All regions in Q1
print(df.loc[("2026-Q1", "North")])         # Specific level
print(df.loc["2026-Q1", "sales"])           # Q1, sales column

# xs (cross-section): more readable for level selection
print(df.xs("North", level="region"))       # All quarters, North only

# Reset index: MultiIndex -> columns
df_flat = df.reset_index()

# Set MultiIndex from columns
df_flat = df_flat.set_index(["quarter", "region"])

# Aggregate on one level
print(df.groupby(level="quarter").sum())
print(df.groupby(level="region").mean())

# Sort MultiIndex
df_sorted = df.sort_index(level=["region", "quarter"])

# stack/unstack between MultiIndex and columns
print(df["sales"].unstack(level="region"))  # quarter as index, regions as columns

Q16. How do you use pandas query() and eval() for performance?

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "age": np.random.randint(18, 65, 100_000),
    "salary": np.random.randint(30_000, 200_000, 100_000),
    "department": np.random.choice(["Engineering", "Sales", "Marketing"], 100_000),
    "score": np.random.random(100_000),
})

# Standard boolean indexing (readable)
result = df[(df["age"] > 30) & (df["salary"] > 80_000) & (df["department"] == "Engineering")]

# query(): string-based, more readable, slightly faster for large DataFrames
result_q = df.query("age > 30 and salary > 80_000 and department == 'Engineering'")

# Use local variables in query with @
min_age = 30
min_salary = 80_000
result_var = df.query("age > @min_age and salary > @min_salary")

# eval(): compute new columns without Python overhead
# Each operation creates a temporary array -- eval avoids this
df.eval("bonus = salary * 0.15", inplace=True)
df.eval("total_comp = salary + bonus", inplace=True)
df.eval("high_earner = salary > 100_000", inplace=True)

# Multiple eval operations (one pass)
df = df.eval("""
    tax = salary * 0.30
    net = salary - tax
    score_pct = score * 100
""")

Q17. What is the difference between wide and long data formats? When do you convert?

import pandas as pd

# Wide format: one row per entity, each variable as a column
wide = pd.DataFrame({
    "student_id": [1, 2, 3],
    "math_score": [85, 92, 78],
    "english_score": [80, 88, 75],
    "science_score": [90, 85, 82],
})

# Long format: one row per (entity, variable) pair
long = wide.melt(
    id_vars="student_id",
    value_vars=["math_score", "english_score", "science_score"],
    var_name="subject",
    value_name="score",
)
print(long)
# student_id    subject  score
#          1 math_score     85
#          1 english_score  80
#          ...

# Convert back: long to wide
wide_again = long.pivot(index="student_id", columns="subject", values="score")
wide_again.columns.name = None  # remove "subject" column index name

# When to use each:
# Wide: ML modeling (each row = one sample, columns = features)
# Long: plotting with seaborn/matplotlib (faceting requires long format)
# Long: GroupBy analytics in SQL-like workflows
# Long: time series analysis

# Plot example requiring long format
import seaborn as sns
# sns.boxplot(data=long, x="subject", y="score")

Q18. How do you merge DataFrames on multiple keys? Handle mismatched column names?

import pandas as pd

orders = pd.DataFrame({
    "order_id": [1, 2, 3],
    "customer_id": [101, 102, 101],
    "order_date": ["2026-01-01", "2026-01-02", "2026-01-03"],
    "product_id": [501, 502, 503],
})

customers = pd.DataFrame({
    "cust_id": [101, 102, 103],       # note: different column name
    "name": ["Alice", "Bob", "Charlie"],
    "city": ["Delhi", "Mumbai", "Bangalore"],
})

products = pd.DataFrame({
    "product_id": [501, 502, 503],
    "product_name": ["Laptop", "Phone", "Tablet"],
    "price": [50000, 20000, 30000],
})

# Merge with mismatched column names
result = pd.merge(
    orders, customers,
    left_on="customer_id", right_on="cust_id",  # handle different names
    how="left",
)

# Chain multiple merges
result = (
    orders
    .merge(customers, left_on="customer_id", right_on="cust_id", how="left")
    .merge(products, on="product_id", how="left")
    .drop(columns=["cust_id"])  # remove redundant key column
)

# Merge on multiple keys
events = pd.DataFrame({"user_id": [1, 1, 2], "date": ["2026-01-01", "2026-01-02", "2026-01-01"], "events": [5, 3, 7]})
budget = pd.DataFrame({"user_id": [1, 2], "date": ["2026-01-01", "2026-01-01"], "budget": [1000, 2000]})
merged = pd.merge(events, budget, on=["user_id", "date"], how="left")

Q19. How do you validate data quality in a pandas pipeline?

import pandas as pd
import numpy as np

def validate_dataframe(df: pd.DataFrame, schema: dict) -> list:
    """Run data quality checks and return list of violations."""
    violations = []

    for col, rules in schema.items():
        if col not in df.columns:
            violations.append(f"MISSING COLUMN: {col}")
            continue

        # Null check
        if rules.get("not_null") and df[col].isnull().any():
            null_pct = df[col].isnull().mean() * 100
            violations.append(f"{col}: {null_pct:.1f}% null values (expected 0)")

        # Range check
        if "min" in rules and (df[col].min() < rules["min"]).any() if hasattr(df[col].min(), '__iter__') else df[col].min() < rules["min"]:
            violations.append(f"{col}: min {df[col].min()} < expected min {rules['min']}")

        if "max" in rules and df[col].max() > rules["max"]:
            violations.append(f"{col}: max {df[col].max()} > expected max {rules['max']}")

        # Uniqueness
        if rules.get("unique") and df[col].duplicated().any():
            violations.append(f"{col}: {df[col].duplicated().sum()} duplicate values (expected unique)")

        # Categorical values
        if "allowed" in rules:
            bad = df[~df[col].isin(rules["allowed"])][col].unique()
            if len(bad) > 0:
                violations.append(f"{col}: unexpected values {bad}")

    return violations

schema = {
    "user_id": {"not_null": True, "unique": True},
    "age": {"not_null": True, "min": 0, "max": 120},
    "status": {"allowed": ["active", "inactive", "pending"]},
    "salary": {"min": 0},
}

violations = validate_dataframe(df, schema)
for v in violations:
    print(f"VIOLATION: {v}")

Q20. How do you use pandas with SQLAlchemy for database operations?

import pandas as pd
from sqlalchemy import create_engine, text

# Connect to PostgreSQL
engine = create_engine("postgresql://user:password@localhost:5432/mydb")

# Read from database
df = pd.read_sql("SELECT * FROM transactions WHERE date >= '2026-01-01'", engine)
df = pd.read_sql_table("users", engine, columns=["id", "name", "email"])

# Read with query object (parameterized)
query = text("SELECT * FROM orders WHERE user_id = :uid")
df = pd.read_sql(query, engine, params={"uid": 12345})

# Write to database
df.to_sql("ml_features", engine, if_exists="replace", index=False, 
          dtype={"user_id": "INTEGER", "score": "FLOAT"})

# Append to existing table
new_batch.to_sql("ml_features", engine, if_exists="append", index=False,
                 method="multi",  # batch inserts, much faster
                 chunksize=10_000)

# Use chunked reading for large tables
chunks = []
for chunk in pd.read_sql("SELECT * FROM large_table", engine, chunksize=50_000):
    processed = chunk.pipe(compute_features)
    chunks.append(processed)
df = pd.concat(chunks, ignore_index=True)

Q21. Explain the pandas pipe method. How does it improve code readability?

import pandas as pd
import numpy as np

# Without pipe: nested function calls (hard to read)
# result = add_features(clean_data(filter_data(load_data())))

# With pipe: left-to-right method chaining
def load_data(filepath):
    return pd.read_csv(filepath)

def clean_data(df):
    df = df.dropna(subset=["user_id", "amount"])
    df["amount"] = df["amount"].clip(lower=0)
    return df

def filter_data(df, min_amount=10):
    return df[df["amount"] >= min_amount]

def add_features(df):
    df["log_amount"] = np.log1p(df["amount"])
    df["is_large"] = df["amount"] > df["amount"].quantile(0.9)
    return df

# Pipe chains functions left to right
result = (
    pd.read_csv("data.csv")
    .pipe(clean_data)
    .pipe(filter_data, min_amount=10)
    .pipe(add_features)
    .query("is_large == True")
    .sort_values("amount", ascending=False)
)

Q22. What is the difference between copy() and a view in pandas? Explain SettingWithCopyWarning.

import pandas as pd

df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})

# Slice creates a VIEW (not always a copy) -- modifying it may or may not modify df
subset = df[df["a"] > 1]      # may be a view
subset["b"] = 999              # SettingWithCopyWarning! Uncertain behavior

# ALWAYS use .copy() when you intend to modify a slice
subset = df[df["a"] > 1].copy()
subset["b"] = 999              # Safe -- modifies subset, not df
print(df)  # df unchanged

# Avoid chained assignment (guaranteed SettingWithCopyWarning)
# WRONG:
# df[df["a"] > 1]["b"] = 999

# RIGHT: use loc for in-place modification of original
df.loc[df["a"] > 1, "b"] = 999

# pandas 3.0+ CoW (Copy-on-Write): all indexing operations return copies
# SettingWithCopyWarning becomes obsolete but adapting early is good practice
pd.options.mode.copy_on_write = True  # opt-in to CoW behavior in pandas 2.x

HARD: Real-World Patterns (Questions 23-28)

Q23. How do you build a user cohort analysis pipeline?

import pandas as pd

# User cohort: month of first purchase
orders = pd.DataFrame({
    "user_id": [1, 1, 2, 2, 3, 1, 2],
    "order_date": pd.to_datetime([
        "2026-01-05", "2026-02-10", "2026-01-15", "2026-03-20",
        "2026-02-01", "2026-03-15", "2026-04-10"
    ]),
    "revenue": [100, 150, 200, 250, 300, 120, 180],
})

# Assign cohort = month of first purchase per user
orders["cohort"] = orders.groupby("user_id")["order_date"].transform("min").dt.to_period("M")
orders["order_month"] = orders["order_date"].dt.to_period("M")

# Calculate periods since cohort
orders["cohort_period"] = (
    (orders["order_month"] - orders["cohort"]).apply(lambda x: x.n)
)

# Build retention matrix
cohort_table = orders.groupby(["cohort", "cohort_period"])["user_id"].nunique().unstack()
retention = cohort_table.divide(cohort_table.iloc[:, 0], axis=0)

print("\nRetention Matrix:")
print(retention.round(2))

# Revenue per cohort x period
revenue_table = orders.groupby(["cohort", "cohort_period"])["revenue"].sum().unstack()
print("\nRevenue by Cohort x Period:")
print(revenue_table)

Q24. How do you efficiently process a 10GB CSV file with pandas?

import pandas as pd
import numpy as np

# Strategy 1: Read only needed columns
df = pd.read_csv("large.csv", usecols=["user_id", "amount", "date"], 
                 dtype={"user_id": np.int32, "amount": np.float32})

# Strategy 2: Chunked processing (streaming)
def process_chunk(chunk):
    chunk = chunk.dropna(subset=["user_id"])
    chunk["date"] = pd.to_datetime(chunk["date"])
    chunk["log_amount"] = np.log1p(chunk["amount"])
    return chunk[chunk["amount"] > 0]

results = []
for chunk in pd.read_csv("large.csv", chunksize=500_000, 
                          dtype={"user_id": np.int32}):
    results.append(process_chunk(chunk))
df = pd.concat(results, ignore_index=True)

# Strategy 3: Filter at read time (no chunking needed for highly selective queries)
df_filtered = pd.read_csv(
    "large.csv",
    usecols=["user_id", "amount", "date"],
    skiprows=lambda i: i > 0 and np.random.random() > 0.1,  # 10% sample
)

# Strategy 4: Use Parquet for repeated access (10x faster than CSV)
df.to_parquet("data.parquet", compression="snappy", index=False)
df_fast = pd.read_parquet("data.parquet", columns=["user_id", "amount"])

# Strategy 5: Dask for truly large files (parallelized pandas API)
import dask.dataframe as dd
ddf = dd.read_csv("large.csv", dtype={"user_id": np.int32})
result = ddf.groupby("user_id")["amount"].sum().compute()

Q25. Implement a custom GroupBy aggregation for session analysis.

import pandas as pd
import numpy as np

# User clickstream events
events = pd.DataFrame({
    "user_id": [1, 1, 1, 2, 2, 3],
    "timestamp": pd.to_datetime([
        "2026-01-01 10:00", "2026-01-01 10:05", "2026-01-01 10:30",
        "2026-01-01 11:00", "2026-01-01 11:10", "2026-01-01 12:00"
    ]),
    "page": ["home", "product", "checkout", "home", "search", "home"],
    "event_type": ["view", "view", "purchase", "view", "view", "view"],
})

# Session definition: gap > 20 minutes = new session
events = events.sort_values(["user_id", "timestamp"])

def assign_sessions(group, gap_minutes=20):
    """Assign session IDs based on inactivity gap."""
    time_diff = group["timestamp"].diff()
    new_session = (time_diff > pd.Timedelta(minutes=gap_minutes)) | time_diff.isna()
    group["session_id"] = new_session.cumsum()
    return group

events = events.groupby("user_id", group_keys=False).apply(assign_sessions)

# Session-level aggregation
session_stats = events.groupby(["user_id", "session_id"]).agg(
    session_start=("timestamp", "min"),
    session_end=("timestamp", "max"),
    page_views=("event_type", "count"),
    unique_pages=("page", "nunique"),
    has_purchase=("event_type", lambda x: (x == "purchase").any()),
).reset_index()

session_stats["duration_minutes"] = (
    session_stats["session_end"] - session_stats["session_start"]
).dt.total_seconds() / 60

print(session_stats)

Q26. How do you implement a sliding window join in pandas?

import pandas as pd
import numpy as np

# Match user events with the most recent campaign impression within 7 days
events = pd.DataFrame({
    "user_id": [1, 1, 2, 2],
    "event_time": pd.to_datetime(["2026-01-05", "2026-01-10", "2026-01-08", "2026-01-15"]),
    "event_value": [100, 150, 200, 250],
})

impressions = pd.DataFrame({
    "user_id": [1, 1, 2],
    "impression_time": pd.to_datetime(["2026-01-01", "2026-01-07", "2026-01-06"]),
    "campaign_id": ["C1", "C2", "C1"],
})

# Merge all combinations (cross join within user)
merged = pd.merge(events, impressions, on="user_id", how="left")

# Filter: impression must be before event and within 7-day attribution window
merged["days_diff"] = (merged["event_time"] - merged["impression_time"]).dt.days
valid = merged[(merged["days_diff"] >= 0) & (merged["days_diff"] <= 7)]

# Keep most recent impression per event (last touch attribution)
attributed = (
    valid.sort_values("impression_time", ascending=False)
    .drop_duplicates(subset=["user_id", "event_time"])
)

print(attributed[["user_id", "event_time", "event_value", "campaign_id", "days_diff"]])

Q27. How do you handle timezone-aware datetime operations?

import pandas as pd
import pytz

df = pd.DataFrame({
    "user_id": [1, 2, 3],
    "timestamp_utc": pd.to_datetime([
        "2026-01-01 05:30:00",   # UTC
        "2026-01-01 12:00:00",
        "2026-01-01 18:45:00",
    ], utc=True),
})

# Convert to IST (UTC+5:30)
df["timestamp_ist"] = df["timestamp_utc"].dt.tz_convert("Asia/Kolkata")
df["hour_ist"] = df["timestamp_ist"].dt.hour
df["is_peak_hours"] = df["hour_ist"].between(9, 22)

# Handle mixed timezone inputs
raw_timestamps = ["2026-01-01 10:00:00+05:30", "2026-01-01 06:00:00+01:00"]
df_tz = pd.DataFrame({"ts": pd.to_datetime(raw_timestamps, utc=True)})  # normalize to UTC first

# Localize naive datetime to specific timezone
naive_timestamps = pd.date_range("2026-01-01", periods=5, freq="h")
ist_timestamps = naive_timestamps.tz_localize("Asia/Kolkata")
utc_timestamps = ist_timestamps.tz_convert("UTC")

# Common error: comparing tz-aware and tz-naive
# Fixed by normalizing all to UTC before comparison
t1 = pd.Timestamp("2026-01-01 10:00:00", tz="UTC")
t2 = pd.Timestamp("2026-01-01 15:30:00", tz="Asia/Kolkata")
same_moment = t1 == t2.tz_convert("UTC")  # True -- same instant
print(f"Same moment? {same_moment}")

Q28. How do you profile and optimize a slow pandas pipeline?

import pandas as pd
import numpy as np
import cProfile
import line_profiler

# Step 1: Identify the bottleneck
def slow_pipeline(df):
    # Bad: iterrows + apply
    results = []
    for _, row in df.iterrows():
        results.append(row["a"] ** 2 + np.log(row["b"] + 1))
    df["result"] = results
    df["category"] = df["a"].apply(lambda x: "high" if x > 0 else "low")
    return df

def fast_pipeline(df):
    # Good: vectorized operations
    df = df.copy()
    df["result"] = df["a"] ** 2 + np.log1p(df["b"])
    df["category"] = np.where(df["a"] > 0, "high", "low")
    return df

# Step 2: Profile with %timeit (in Jupyter) or timeit
import timeit
df = pd.DataFrame({"a": np.random.randn(100_000), "b": np.random.rand(100_000)})

slow_time = timeit.timeit(lambda: slow_pipeline(df.copy()), number=3) / 3
fast_time = timeit.timeit(lambda: fast_pipeline(df.copy()), number=3) / 3
print(f"Slow: {slow_time:.2f}s | Fast: {fast_time:.4f}s | Speedup: {slow_time/fast_time:.0f}x")

# Step 3: Memory profiler
# pip install memory-profiler
# @profile decorator on function, run with python -m memory_profiler

# Step 4: Profiling top-level with cProfile
# python -m cProfile -s cumulative your_script.py | head -30

# Optimization checklist:
# 1. Replace iterrows/apply with vectorized ops
# 2. Use categorical dtype for low-cardinality columns
# 3. Downcast int64/float64 to int32/float32
# 4. Use query() for large boolean filters
# 5. Pre-sort DataFrames before merge
# 6. Use to_parquet instead of CSV for repeated reads
# 7. Profile before optimizing -- don't guess the bottleneck

FAQ

Q: Should I learn polars alongside pandas for 2026 interviews?

A: Pandas is still the primary expectation. Polars knowledge is a differentiator for senior roles at data-intensive companies. If asked about pandas performance, mentioning polars as an alternative for large datasets shows current awareness. Candidates from public preparation resources confirm that pandas remains the dominant interview subject.

Q: What is the most common pandas mistake in interviews?

A: Using iterrows() for row-level operations when vectorized equivalents exist. Interviewers at product companies specifically probe this to assess Python/pandas maturity.

Q: How do I get faster at pandas for live coding interviews?

A: Practice on real datasets (Kaggle, government open data). Focus on GroupBy + merge + time series patterns. Use .pipe() chains for readable code. Confirm exact Python version and available libraries on the official company interview guide or careers portal before your round.

Sources and review notesreviewed 8 Jun 2026

Article-specific sources

Verification window

Page last edited 8 Jun 2026 by Aditya Sharma. A review date records an editorial edit, not a guarantee that every external fact is still current.

Evidence labels

Official notices, candidate reports, offer documents, and editorial practice questions carry different confidence levels. The visible source list lets you inspect the evidence instead of relying on a blanket verification badge.

Verification policy: /editorial-standards/. Found something incorrect? Submit a correction - we respond within 48 hours.

topic cluster

Sat this this year? Share your story, earn ₹500.

First-person experience reports help future candidates prep smarter. We pay verified contributors ₹500 via UPI per accepted story with byline.

Submit your story →

ready to practice?

Take a free timed mock test

Put what you learned into practice. Our mock tests match the 2026 pattern with timer, navigator, reveal, and score breakdown. No signup.

Start free mock test →

related guides

Interview Questions

Share this guide

Twitter LinkedIn W WhatsApp