SQL for Data Analysts 2026: 30 Interview Questions with Answers

What changed in 2026 drives
Mass-recruiter offer letters are flatter for 2026 batch - the 4-5 LPA ASE band has barely budged in three years while inflation eats real wages. Premium tracks (Digital, Pro, Elite, Specialist) are still where the differential lives, and they are entirely test-driven. If you are aiming higher than the default offer, the coding round is not optional pageantry - it is the entire interview.
What I'd actually study for this
- 01Two solid coding-round answers (1 medium-hard DSA each, with edge-case discussion) > five half-baked ones
- 02One real project you can defend end-to-end - file paths, design decisions, and what you would change
- 03One DBMS schema you actually built (not a textbook ER diagram), with at least 3 join-heavy queries written from memory
- 04Three behavioural STAR stories: failure recovered, conflict handled, ownership taken
Where most candidates trip up
The single biggest mistake is treating company-specific guides as primary prep and DSA as secondary. It is the opposite. Mass recruiters use the test as a filter, but premium tracks at every IT services company use coding to allocate offer band. Spend 70% of prep time on DSA + system fundamentals, 20% on company-specific patterns, 10% on HR rehearsal. Reverse that ratio and you collect the default offer.
Editorial commentary by Aditya Sharma · written for PapersAdda · not generated, not aggregated.
SQL proficiency is the most-tested hard skill in data analyst and data scientist interviews. Product companies from Flipkart and Swiggy to Google India and Amazon include SQL rounds as the primary technical filter. This guide covers 30 SQL interview questions with full answers, from foundation to advanced window functions and business analytics scenarios.
PapersAdda's take: Candidates report that window functions (ROW_NUMBER, LAG, SUM OVER PARTITION BY) appear in over 70% of senior data analyst SQL rounds at product companies. Solving "top-N per group" and "retention cohort" queries in under 20 minutes is the expected bar. Confirm the specific SQL dialect and complexity expected on the official company careers portal before you prepare.
Related articles: Data Science Interview Questions 2026 | Statistics for Data Science 2026 | Pandas Interview Questions 2026 | Apache Spark Interview Questions 2026 | Tableau Interview Questions 2026
SQL Topics by Company and Level
| Level | Topics | Companies |
|---|---|---|
| Junior DA | SELECT, WHERE, GROUP BY, HAVING, basic JOINs | Startups, consulting |
| Mid DA | Subqueries, CTEs, CASE WHEN, multiple JOINs | Product companies |
| Senior DA | Window functions, complex aggregations, optimization | FAANG, unicorns |
| Data Scientist | All of above + query optimization + analytical queries | All tech |
EASY: Foundations (Questions 1-8)
Q1. What is the difference between WHERE and HAVING?
-- WHERE: filters ROWS before grouping (before GROUP BY)
-- HAVING: filters GROUPS after aggregation (after GROUP BY)
-- Find departments with more than 5 employees earning > 50000
SELECT department, COUNT(*) as emp_count, AVG(salary) as avg_salary
FROM employees
WHERE salary > 50000 -- filter individual rows first
GROUP BY department
HAVING COUNT(*) > 5; -- filter groups after aggregation
-- Common mistake: using WHERE with aggregate functions
-- WRONG:
-- SELECT department, AVG(salary) FROM employees
-- WHERE AVG(salary) > 60000 GROUP BY department; -- ERROR
-- CORRECT:
SELECT department, AVG(salary) as avg_salary
FROM employees
GROUP BY department
HAVING AVG(salary) > 60000;
Q2. What are the different types of JOINs?
-- Setup
-- employees: (emp_id, name, dept_id, salary)
-- departments: (dept_id, dept_name, manager_id)
-- INNER JOIN: only matching rows from both tables
SELECT e.name, d.dept_name
FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id;
-- LEFT JOIN: all rows from left, matching from right (NULL if no match)
SELECT e.name, d.dept_name
FROM employees e
LEFT JOIN departments d ON e.dept_id = d.dept_id;
-- Employees without a department will show NULL dept_name
-- RIGHT JOIN: all rows from right, matching from left
SELECT e.name, d.dept_name
FROM employees e
RIGHT JOIN departments d ON e.dept_id = d.dept_id;
-- Departments with no employees will show NULL name
-- FULL OUTER JOIN: all rows from both tables
SELECT e.name, d.dept_name
FROM employees e
FULL OUTER JOIN departments d ON e.dept_id = d.dept_id;
-- CROSS JOIN: cartesian product (all combinations)
SELECT e.name, d.dept_name
FROM employees e
CROSS JOIN departments d; -- rows_e * rows_d total rows
-- SELF JOIN: join table with itself
SELECT e1.name as employee, e2.name as manager
FROM employees e1
LEFT JOIN employees e2 ON e1.manager_id = e2.emp_id;
Q3. What is a CTE? How does it differ from a subquery?
-- CTE (Common Table Expression): named temporary result set
-- Defined with WITH clause before the main query
-- Example: find employees earning above their department average
WITH dept_avg AS (
SELECT dept_id, AVG(salary) as avg_salary
FROM employees
GROUP BY dept_id
),
above_avg AS (
SELECT e.emp_id, e.name, e.salary, da.avg_salary,
e.salary - da.avg_salary as above_by
FROM employees e
JOIN dept_avg da ON e.dept_id = da.dept_id
WHERE e.salary > da.avg_salary
)
SELECT * FROM above_avg ORDER BY above_by DESC;
-- Equivalent subquery (harder to read for complex cases)
SELECT e.emp_id, e.name, e.salary,
da.avg_salary,
e.salary - da.avg_salary as above_by
FROM employees e
JOIN (
SELECT dept_id, AVG(salary) as avg_salary
FROM employees
GROUP BY dept_id
) da ON e.dept_id = da.dept_id
WHERE e.salary > da.avg_salary
ORDER BY above_by DESC;
-- CTE advantages:
-- 1. Reusability: reference same CTE multiple times in query
-- 2. Readability: step-by-step logic like a pipeline
-- 3. Recursive queries: trees, hierarchies (not possible with subqueries)
-- Recursive CTE: employee hierarchy
WITH RECURSIVE org_chart AS (
-- Base case: top-level manager (no manager_id)
SELECT emp_id, name, manager_id, 0 as level
FROM employees
WHERE manager_id IS NULL
UNION ALL
-- Recursive case: employees reporting to current level
SELECT e.emp_id, e.name, e.manager_id, oc.level + 1
FROM employees e
JOIN org_chart oc ON e.manager_id = oc.emp_id
)
SELECT * FROM org_chart ORDER BY level, name;
Q4. What is a window function? Explain PARTITION BY and ORDER BY in windows.
-- Window functions: compute over a "window" (partition) of rows
-- Unlike GROUP BY: original rows are NOT collapsed
-- Syntax: function_name(...) OVER (
-- [PARTITION BY col1, col2] -- divide into groups
-- [ORDER BY col3 DESC] -- ordering within group
-- [frame_clause] -- row/range specification
-- )
-- Basic window functions
SELECT
emp_id,
name,
dept_id,
salary,
-- Aggregate window functions: computed per partition
AVG(salary) OVER (PARTITION BY dept_id) as dept_avg,
MAX(salary) OVER (PARTITION BY dept_id) as dept_max,
SUM(salary) OVER (PARTITION BY dept_id) as dept_total,
COUNT(*) OVER (PARTITION BY dept_id) as dept_size,
-- Ranking functions (require ORDER BY)
ROW_NUMBER() OVER (PARTITION BY dept_id ORDER BY salary DESC) as salary_rank,
RANK() OVER (PARTITION BY dept_id ORDER BY salary DESC) as rank_with_ties,
DENSE_RANK() OVER (PARTITION BY dept_id ORDER BY salary DESC) as dense_rank,
-- Offset functions
LAG(salary, 1) OVER (PARTITION BY dept_id ORDER BY hire_date) as prev_salary,
LEAD(salary, 1) OVER (PARTITION BY dept_id ORDER BY hire_date) as next_salary,
-- Distribution functions
PERCENT_RANK() OVER (ORDER BY salary) as pct_rank,
NTILE(4) OVER (ORDER BY salary) as salary_quartile
FROM employees;
Q5. What is RANK vs DENSE_RANK vs ROW_NUMBER?
-- Setup: scores with ties
-- Scores: 100, 100, 90, 85, 85, 80
WITH scores AS (
SELECT 'Alice' as name, 100 as score UNION ALL
SELECT 'Bob', 100 UNION ALL
SELECT 'Charlie', 90 UNION ALL
SELECT 'Diana', 85 UNION ALL
SELECT 'Eve', 85 UNION ALL
SELECT 'Frank', 80
)
SELECT
name,
score,
ROW_NUMBER() OVER (ORDER BY score DESC) as row_num,
-- Unique sequential number; tie-breaking is arbitrary
-- 1, 2, 3, 4, 5, 6
RANK() OVER (ORDER BY score DESC) as rank_val,
-- Ties get same rank; next rank skips numbers
-- 1, 1, 3, 4, 4, 6 (no 2, no 5)
DENSE_RANK() OVER (ORDER BY score DESC) as dense_rank_val
-- Ties get same rank; next rank is consecutive (no gaps)
-- 1, 1, 2, 3, 3, 4
FROM scores;
-- Use ROW_NUMBER: when you need exactly one row per group (deduplication)
-- Use RANK: leaderboards where ties share position and next rank skips
-- Use DENSE_RANK: when you want no gaps in rankings (compact ranks)
Q6. How do you get the top N rows per group?
-- Classic interview question: "Get top 3 highest-paid employees per department"
-- Method 1: ROW_NUMBER (best for exact top-N without ties)
WITH ranked AS (
SELECT
emp_id, name, dept_id, salary,
ROW_NUMBER() OVER (PARTITION BY dept_id ORDER BY salary DESC) as rn
FROM employees
)
SELECT emp_id, name, dept_id, salary
FROM ranked
WHERE rn <= 3;
-- Method 2: RANK (includes ties -- may return more than N)
WITH ranked AS (
SELECT
emp_id, name, dept_id, salary,
RANK() OVER (PARTITION BY dept_id ORDER BY salary DESC) as rk
FROM employees
)
SELECT emp_id, name, dept_id, salary
FROM ranked
WHERE rk <= 3;
-- Method 3: DENSE_RANK (compact ranks with ties)
WITH ranked AS (
SELECT
emp_id, name, dept_id, salary,
DENSE_RANK() OVER (PARTITION BY dept_id ORDER BY salary DESC) as dr
FROM employees
)
SELECT emp_id, name, dept_id, salary
FROM ranked
WHERE dr <= 3;
-- The interviewer expects: understand the difference between methods
-- and choose based on how ties should be handled.
Q7. What is the difference between UNION and UNION ALL?
-- UNION: combines result sets and removes duplicates
-- UNION ALL: combines result sets, keeps ALL rows (including duplicates)
-- Example: combine sales from two regions
SELECT product_id, sales, 'North' as region FROM sales_north
UNION
SELECT product_id, sales, 'South' as region FROM sales_south;
-- Removes any rows that are identical across both tables
SELECT product_id, sales, 'North' as region FROM sales_north
UNION ALL
SELECT product_id, sales, 'South' as region FROM sales_south;
-- Keeps all rows -- faster (no dedup step)
-- Performance: UNION ALL is always faster (no sorting/dedup)
-- Use UNION when: you genuinely need to remove cross-table duplicates
-- Use UNION ALL when: you know tables don't overlap, or duplicates are expected
-- Requirements for UNION:
-- Same number of columns in both SELECT statements
-- Compatible data types in corresponding column positions
-- Column names taken from first SELECT
Q8. What are aggregate functions? Explain COUNT with and without DISTINCT.
-- Standard aggregate functions
SELECT
COUNT(*) as total_rows, -- count all rows including NULLs
COUNT(1) as also_total_rows, -- same as COUNT(*)
COUNT(email) as non_null_emails, -- count non-NULL values only
COUNT(DISTINCT user_id) as unique_users, -- count distinct non-NULL values
SUM(revenue) as total_revenue,
AVG(revenue) as avg_revenue,
MIN(order_date) as first_order,
MAX(order_date) as last_order,
SUM(revenue) / NULLIF(COUNT(*), 0) as manual_avg -- safe division
FROM orders;
-- Important: COUNT(*) vs COUNT(column)
-- COUNT(*): counts every row including NULLs
-- COUNT(column): counts only NON-NULL values in that column
CREATE TABLE test_nulls AS
SELECT 1 as id, NULL as value UNION ALL
SELECT 2, 100 UNION ALL
SELECT 3, 200;
SELECT
COUNT(*) as total, -- 3
COUNT(value) as non_null, -- 2
COUNT(DISTINCT value) as distinct_non_null -- 2
FROM test_nulls;
-- GROUP BY with ROLLUP (multi-level subtotals)
SELECT
department,
job_title,
SUM(salary) as total_salary,
COUNT(*) as headcount
FROM employees
GROUP BY ROLLUP(department, job_title);
-- Returns: dept+title rows + dept subtotals + grand total
MEDIUM: Window Functions and Business Queries (Questions 9-20)
Q9. How do you compute running totals and moving averages?
-- Running total (cumulative sum)
SELECT
order_date,
daily_revenue,
SUM(daily_revenue) OVER (ORDER BY order_date) as running_total,
-- Same with explicit frame
SUM(daily_revenue) OVER (
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) as running_total_explicit
FROM daily_sales
ORDER BY order_date;
-- 7-day moving average
SELECT
order_date,
daily_revenue,
AVG(daily_revenue) OVER (
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) as moving_avg_7d,
-- Rolling sum
SUM(daily_revenue) OVER (
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) as rolling_7d_sum
FROM daily_sales
ORDER BY order_date;
-- Running total WITHIN each year (partition by year)
SELECT
order_date,
EXTRACT(YEAR FROM order_date) as year,
daily_revenue,
SUM(daily_revenue) OVER (
PARTITION BY EXTRACT(YEAR FROM order_date)
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) as ytd_revenue
FROM daily_sales
ORDER BY order_date;
Q10. How do you compute month-over-month and year-over-year growth?
WITH monthly_revenue AS (
SELECT
DATE_TRUNC('month', order_date) as month,
SUM(revenue) as total_revenue
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
),
growth AS (
SELECT
month,
total_revenue,
LAG(total_revenue, 1) OVER (ORDER BY month) as prev_month,
LAG(total_revenue, 12) OVER (ORDER BY month) as same_month_last_year
FROM monthly_revenue
)
SELECT
month,
total_revenue,
prev_month,
-- MoM growth
ROUND(
(total_revenue - prev_month) / NULLIF(prev_month, 0) * 100, 2
) as mom_pct_growth,
-- YoY growth
ROUND(
(total_revenue - same_month_last_year) / NULLIF(same_month_last_year, 0) * 100, 2
) as yoy_pct_growth
FROM growth
ORDER BY month;
Q11. How do you find duplicate rows and deduplicate?
-- Find duplicates
SELECT email, COUNT(*) as cnt
FROM users
GROUP BY email
HAVING COUNT(*) > 1;
-- Show ALL duplicate rows with details
WITH dupes AS (
SELECT email, COUNT(*) OVER (PARTITION BY email) as cnt
FROM users
)
SELECT u.*
FROM users u
JOIN dupes d ON u.email = d.email
WHERE d.cnt > 1;
-- Deduplicate: keep the LATEST row per email
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY email
ORDER BY created_at DESC -- keep most recent
) as rn
FROM users
)
SELECT * FROM ranked WHERE rn = 1;
-- Delete duplicates in-place (PostgreSQL / BigQuery)
DELETE FROM users
WHERE user_id NOT IN (
SELECT MIN(user_id) -- keep lowest ID per email
FROM users
GROUP BY email
);
-- For large tables: create new table and swap
CREATE TABLE users_deduped AS
SELECT DISTINCT ON (email) *
FROM users
ORDER BY email, created_at DESC; -- PostgreSQL DISTINCT ON syntax
Q12. Write a cohort retention query.
-- Cohort: users grouped by their first order month
-- Retention: what % of each cohort is still active in month N?
WITH first_orders AS (
SELECT
user_id,
DATE_TRUNC('month', MIN(order_date)) as cohort_month
FROM orders
GROUP BY user_id
),
user_activity AS (
SELECT DISTINCT
o.user_id,
fo.cohort_month,
DATE_TRUNC('month', o.order_date) as activity_month,
DATEDIFF('month', fo.cohort_month, DATE_TRUNC('month', o.order_date)) as period_number
-- Use DATE_PART in PostgreSQL:
-- EXTRACT(EPOCH FROM (DATE_TRUNC('month', o.order_date) - fo.cohort_month)) / (30.44 * 86400)
FROM orders o
JOIN first_orders fo ON o.user_id = fo.user_id
),
cohort_sizes AS (
SELECT cohort_month, COUNT(DISTINCT user_id) as cohort_size
FROM first_orders
GROUP BY cohort_month
),
retention_matrix AS (
SELECT
ua.cohort_month,
ua.period_number,
COUNT(DISTINCT ua.user_id) as retained_users
FROM user_activity ua
GROUP BY ua.cohort_month, ua.period_number
)
SELECT
rm.cohort_month,
rm.period_number,
rm.retained_users,
cs.cohort_size,
ROUND(100.0 * rm.retained_users / cs.cohort_size, 2) as retention_pct
FROM retention_matrix rm
JOIN cohort_sizes cs ON rm.cohort_month = cs.cohort_month
ORDER BY rm.cohort_month, rm.period_number;
Q13. Write a funnel analysis query.
-- Conversion funnel: home -> product -> cart -> checkout -> purchase
-- Find the count and drop-off at each stage
WITH funnel_events AS (
SELECT
user_id,
MAX(CASE WHEN event_type = 'home_view' THEN 1 ELSE 0 END) as visited_home,
MAX(CASE WHEN event_type = 'product_view' THEN 1 ELSE 0 END) as viewed_product,
MAX(CASE WHEN event_type = 'add_to_cart' THEN 1 ELSE 0 END) as added_to_cart,
MAX(CASE WHEN event_type = 'checkout_start' THEN 1 ELSE 0 END) as started_checkout,
MAX(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) as purchased
FROM events
WHERE event_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY user_id
),
funnel_counts AS (
SELECT
SUM(visited_home) as step1_home,
SUM(viewed_product) as step2_product,
SUM(added_to_cart) as step3_cart,
SUM(started_checkout) as step4_checkout,
SUM(purchased) as step5_purchase
FROM funnel_events
)
SELECT
step1_home,
step2_product,
step3_cart,
step4_checkout,
step5_purchase,
-- Conversion from previous step
ROUND(100.0 * step2_product / NULLIF(step1_home, 0), 2) as home_to_product,
ROUND(100.0 * step3_cart / NULLIF(step2_product, 0), 2) as product_to_cart,
ROUND(100.0 * step4_checkout / NULLIF(step3_cart, 0), 2) as cart_to_checkout,
ROUND(100.0 * step5_purchase / NULLIF(step4_checkout, 0), 2) as checkout_to_purchase,
-- Overall funnel conversion
ROUND(100.0 * step5_purchase / NULLIF(step1_home, 0), 2) as overall_cvr
FROM funnel_counts;
Q14. How do you pivot and unpivot data in SQL?
-- PIVOT: rows to columns
-- Input: (user_id, month, revenue)
-- Output: (user_id, Jan, Feb, Mar, Apr)
-- PostgreSQL / BigQuery: use CASE WHEN
SELECT
user_id,
SUM(CASE WHEN EXTRACT(MONTH FROM order_date) = 1 THEN revenue ELSE 0 END) as jan,
SUM(CASE WHEN EXTRACT(MONTH FROM order_date) = 2 THEN revenue ELSE 0 END) as feb,
SUM(CASE WHEN EXTRACT(MONTH FROM order_date) = 3 THEN revenue ELSE 0 END) as mar,
SUM(CASE WHEN EXTRACT(MONTH FROM order_date) = 4 THEN revenue ELSE 0 END) as apr
FROM orders
WHERE EXTRACT(YEAR FROM order_date) = 2026
GROUP BY user_id;
-- UNPIVOT: columns to rows
-- From: (user_id, jan_revenue, feb_revenue, mar_revenue)
-- To: (user_id, month, revenue)
-- Using UNION ALL (universal SQL)
SELECT user_id, 'Jan' as month, jan_revenue as revenue FROM quarterly_data
UNION ALL
SELECT user_id, 'Feb', feb_revenue FROM quarterly_data
UNION ALL
SELECT user_id, 'Mar', mar_revenue FROM quarterly_data
ORDER BY user_id, month;
-- BigQuery: UNPIVOT operator (more readable)
SELECT user_id, month, revenue
FROM quarterly_data
UNPIVOT (revenue FOR month IN (jan_revenue, feb_revenue, mar_revenue));
Q15. How do you find gaps in sequential data?
-- Find missing dates in a user activity sequence
WITH date_series AS (
-- Generate all dates in the range
SELECT generate_series(
'2026-01-01'::date,
'2026-01-31'::date,
'1 day'::interval
)::date as expected_date
),
user_activity AS (
SELECT DISTINCT activity_date FROM user_logins
WHERE user_id = 12345 AND activity_date BETWEEN '2026-01-01' AND '2026-01-31'
)
SELECT ds.expected_date as missing_date
FROM date_series ds
LEFT JOIN user_activity ua ON ds.expected_date = ua.activity_date
WHERE ua.activity_date IS NULL;
-- Find gaps in order ID sequence
SELECT
o1.order_id + 1 as gap_start,
o2.order_id - 1 as gap_end,
o2.order_id - o1.order_id - 1 as gap_size
FROM orders o1
JOIN orders o2 ON o2.order_id = (
SELECT MIN(order_id) FROM orders WHERE order_id > o1.order_id
)
WHERE o2.order_id > o1.order_id + 1
ORDER BY gap_start;
-- Find periods of inactivity > 30 days
SELECT
user_id,
activity_date as last_active,
LEAD(activity_date) OVER (PARTITION BY user_id ORDER BY activity_date) as next_active,
LEAD(activity_date) OVER (PARTITION BY user_id ORDER BY activity_date) - activity_date as gap_days
FROM user_logins
WHERE LEAD(activity_date) OVER (PARTITION BY user_id ORDER BY activity_date) - activity_date > 30;
Q16. Write a query for user session analysis.
-- Define sessions: events within 30 minutes of each other belong to same session
WITH session_gaps AS (
SELECT
user_id,
event_time,
event_type,
LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) as prev_event_time,
CASE
WHEN LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) IS NULL THEN 1
WHEN event_time - LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time)
> INTERVAL '30 minutes' THEN 1
ELSE 0
END as is_session_start
FROM events
),
session_ids AS (
SELECT
user_id,
event_time,
event_type,
SUM(is_session_start) OVER (PARTITION BY user_id ORDER BY event_time) as session_id
FROM session_gaps
),
session_summary AS (
SELECT
user_id,
session_id,
MIN(event_time) as session_start,
MAX(event_time) as session_end,
COUNT(*) as event_count,
EXTRACT(EPOCH FROM MAX(event_time) - MIN(event_time)) / 60 as duration_minutes,
MAX(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) as has_purchase
FROM session_ids
GROUP BY user_id, session_id
)
SELECT
user_id,
COUNT(DISTINCT session_id) as total_sessions,
ROUND(AVG(duration_minutes), 2) as avg_session_min,
SUM(has_purchase) as purchase_sessions,
ROUND(100.0 * SUM(has_purchase) / COUNT(*), 2) as purchase_rate_pct
FROM session_summary
GROUP BY user_id
ORDER BY total_sessions DESC;
Q17. How do you write a query to find the second highest salary?
-- Method 1: Subquery with NOT IN
SELECT MAX(salary)
FROM employees
WHERE salary NOT IN (SELECT MAX(salary) FROM employees);
-- Method 2: OFFSET (skip top result)
SELECT DISTINCT salary
FROM employees
ORDER BY salary DESC
LIMIT 1 OFFSET 1;
-- Method 3: DENSE_RANK (handles ties correctly)
SELECT salary
FROM (
SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) as dr
FROM employees
) ranked
WHERE dr = 2
LIMIT 1;
-- Generic: Nth highest salary
-- Replace 2 with N
WITH ranked AS (
SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) as dr
FROM employees
)
SELECT MAX(salary) -- max handles if multiple employees have same Nth salary
FROM ranked
WHERE dr = 2;
-- Edge case: what if there's only 1 unique salary?
-- DENSE_RANK approach returns NULL gracefully
-- NOT IN approach returns NULL (correct)
Q18. Write a query to compute the 7-day rolling active users (DAU/WAU).
-- DAU (Daily Active Users)
WITH daily_active AS (
SELECT
DATE(event_time) as event_date,
COUNT(DISTINCT user_id) as dau
FROM user_events
GROUP BY DATE(event_time)
),
-- WAU using 7-day rolling window
wau AS (
SELECT
event_date,
dau,
COUNT(DISTINCT u.user_id) as wau_7d
FROM daily_active da
JOIN user_events u ON DATE(u.event_time) BETWEEN da.event_date - INTERVAL '6 days'
AND da.event_date
GROUP BY da.event_date, da.dau
),
-- Stickiness: DAU/MAU ratio
mau AS (
SELECT
DATE(event_time) as event_date,
COUNT(DISTINCT user_id) as mau_28d
FROM user_events
WHERE event_time >= CURRENT_DATE - INTERVAL '28 days'
GROUP BY DATE(event_time)
)
SELECT
da.event_date,
da.dau,
w.wau_7d,
m.mau_28d,
ROUND(100.0 * da.dau / NULLIF(w.wau_7d, 0), 2) as dau_wau_stickiness,
ROUND(100.0 * da.dau / NULLIF(m.mau_28d, 0), 2) as dau_mau_stickiness
FROM daily_active da
LEFT JOIN wau w ON da.event_date = w.event_date
LEFT JOIN mau m ON da.event_date = m.event_date
ORDER BY da.event_date;
Q19. How do you use CASE WHEN for conditional aggregation?
-- Pivot-like aggregation by category
SELECT
DATE_TRUNC('month', order_date) as month,
COUNT(*) as total_orders,
COUNT(CASE WHEN status = 'completed' THEN 1 END) as completed,
COUNT(CASE WHEN status = 'cancelled' THEN 1 END) as cancelled,
COUNT(CASE WHEN status = 'returned' THEN 1 END) as returned,
ROUND(
100.0 * COUNT(CASE WHEN status = 'cancelled' THEN 1 END) / COUNT(*), 2
) as cancellation_rate,
SUM(CASE WHEN category = 'Electronics' THEN revenue ELSE 0 END) as electronics_rev,
SUM(CASE WHEN category = 'Clothing' THEN revenue ELSE 0 END) as clothing_rev,
SUM(CASE WHEN revenue > 5000 THEN revenue ELSE 0 END) as high_value_rev
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY month;
-- Segmented averages
SELECT
AVG(CASE WHEN age < 25 THEN revenue END) as avg_revenue_young,
AVG(CASE WHEN age BETWEEN 25 AND 40 THEN revenue END) as avg_revenue_mid,
AVG(CASE WHEN age > 40 THEN revenue END) as avg_revenue_older
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id;
-- Rate calculation with CASE
SELECT
user_segment,
SUM(CASE WHEN converted = 1 THEN 1.0 ELSE 0 END) / COUNT(*) as cvr
FROM experiment_data
GROUP BY user_segment;
Q20. How do you handle NULL values in SQL queries?
-- NULL is NOT equal to anything, including NULL
-- NULL = NULL returns NULL (not TRUE)
-- Use IS NULL, IS NOT NULL, COALESCE, NULLIF
-- Detect NULLs
SELECT COUNT(*) as total, COUNT(email) as non_null_emails,
COUNT(*) - COUNT(email) as null_emails
FROM users;
-- COALESCE: return first non-NULL value
SELECT
user_id,
COALESCE(display_name, username, email, 'Anonymous') as name_to_show,
COALESCE(revenue, 0) as safe_revenue -- replace NULL with 0
FROM users;
-- NULLIF: return NULL if expression equals a value (prevents division by zero)
SELECT
user_id,
total_revenue / NULLIF(order_count, 0) as avg_order_value -- safe division
FROM user_stats;
-- NULL in aggregations
-- SUM(col) ignores NULLs; COUNT(*) counts NULLs; COUNT(col) ignores NULLs
SELECT
SUM(revenue) as sum_ignores_null, -- treats NULL as 0 contribution
COUNT(*) as count_all, -- includes NULLs
COUNT(revenue) as count_non_null, -- excludes NULLs
AVG(revenue) as avg_excludes_null -- denominator = non-NULL count
FROM orders;
-- NULL in JOINs: NULL != NULL means NULL keys never match
-- LEFT JOIN preserves unmatched rows (NULLs in right table columns)
-- NULL in WHERE: UNKNOWN, not TRUE
SELECT * FROM orders WHERE revenue > 100; -- rows where revenue IS NULL are excluded
SELECT * FROM orders WHERE revenue > 100 OR revenue IS NULL; -- include NULLs explicitly
HARD: Performance and Advanced Analytics (Questions 21-30)
Q21. How do you optimize a slow SQL query?
-- Step 1: Check execution plan
EXPLAIN ANALYZE
SELECT u.user_id, u.name, COUNT(o.order_id) as order_count
FROM users u
JOIN orders o ON u.user_id = o.user_id
WHERE o.created_at >= '2026-01-01'
GROUP BY u.user_id, u.name;
-- Look for: Seq Scan (table scan) vs Index Scan, high row estimates vs actual rows
-- Optimization techniques:
-- 1. Add indexes on JOIN keys and WHERE columns
CREATE INDEX idx_orders_user_id ON orders(user_id);
CREATE INDEX idx_orders_created_at ON orders(created_at);
CREATE INDEX idx_orders_user_date ON orders(user_id, created_at); -- composite
-- 2. Filter early with CTEs / subqueries to reduce join size
WITH recent_orders AS (
SELECT user_id, order_id -- only select needed columns
FROM orders
WHERE created_at >= '2026-01-01' -- filter BEFORE join
)
SELECT u.user_id, u.name, COUNT(ro.order_id) as order_count
FROM users u
JOIN recent_orders ro ON u.user_id = ro.user_id
GROUP BY u.user_id, u.name;
-- 3. Avoid SELECT * (fetches unnecessary columns, kills columnar storage performance)
-- 4. Use partition pruning (BigQuery/Redshift/Snowflake)
-- 5. Use approximate functions for large datasets
SELECT APPROX_COUNT_DISTINCT(user_id) FROM events; -- BigQuery
SELECT COUNT(DISTINCT user_id) FROM events; -- exact but slower
-- 6. Push predicates inside window function QUALIFY (BigQuery)
-- Instead of: SELECT * FROM (SELECT ..., ROW_NUMBER() OVER (...) as rn ...) WHERE rn = 1
-- Use QUALIFY: SELECT ..., ROW_NUMBER() OVER (...) as rn ... QUALIFY rn = 1
Q22. Write a query to detect transaction fraud patterns.
-- Find suspicious patterns: 3+ transactions within 1 hour from same user
WITH transaction_velocity AS (
SELECT
user_id,
transaction_id,
amount,
created_at,
COUNT(*) OVER (
PARTITION BY user_id
ORDER BY created_at
RANGE BETWEEN INTERVAL '1 hour' PRECEDING AND CURRENT ROW
) as txn_count_1h,
SUM(amount) OVER (
PARTITION BY user_id
ORDER BY created_at
RANGE BETWEEN INTERVAL '1 hour' PRECEDING AND CURRENT ROW
) as amount_sum_1h,
AVG(amount) OVER (PARTITION BY user_id) as user_avg_amount
FROM transactions
),
fraud_signals AS (
SELECT
*,
CASE
WHEN txn_count_1h >= 5 THEN 'HIGH_VELOCITY'
WHEN amount > user_avg_amount * 5 THEN 'HIGH_AMOUNT'
WHEN txn_count_1h >= 3 AND amount > user_avg_amount * 3 THEN 'COMBINED_RISK'
ELSE 'NORMAL'
END as fraud_signal
FROM transaction_velocity
)
SELECT
user_id,
transaction_id,
amount,
created_at,
txn_count_1h,
amount_sum_1h,
fraud_signal
FROM fraud_signals
WHERE fraud_signal != 'NORMAL'
ORDER BY created_at DESC;
Q23. How do you write a median calculation in SQL?
-- SQL has no built-in MEDIAN function in standard SQL
-- Method 1: PERCENTILE_CONT (PostgreSQL, BigQuery, Snowflake)
SELECT
dept_id,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) as median_salary,
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY salary) as p25_salary,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY salary) as p75_salary
FROM employees
GROUP BY dept_id;
-- Method 2: Manual calculation (works in MySQL and older DBs)
WITH ordered AS (
SELECT
dept_id,
salary,
ROW_NUMBER() OVER (PARTITION BY dept_id ORDER BY salary) as rn,
COUNT(*) OVER (PARTITION BY dept_id) as total
FROM employees
),
median_rows AS (
SELECT dept_id, salary
FROM ordered
WHERE rn IN (FLOOR((total + 1) / 2.0), CEIL((total + 1) / 2.0))
)
SELECT dept_id, AVG(salary) as median_salary
FROM median_rows
GROUP BY dept_id;
-- APPROX_QUANTILES (BigQuery -- fast approximation for large tables)
SELECT APPROX_QUANTILES(salary, 100)[OFFSET(50)] as approx_median
FROM employees;
Q24. Write a query to compute customer lifetime value (LTV).
WITH customer_metrics AS (
SELECT
c.customer_id,
c.signup_date,
COUNT(DISTINCT o.order_id) as total_orders,
SUM(o.revenue) as total_revenue,
MIN(o.order_date) as first_order_date,
MAX(o.order_date) as last_order_date,
DATEDIFF('day', MIN(o.order_date), MAX(o.order_date)) + 1 as active_days,
DATEDIFF('day', c.signup_date, CURRENT_DATE) as tenure_days
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.signup_date
),
ltv_calculation AS (
SELECT
customer_id,
total_orders,
total_revenue,
tenure_days,
active_days,
-- Average order value
ROUND(total_revenue / NULLIF(total_orders, 0), 2) as aov,
-- Purchase frequency (orders per 30 days active)
ROUND(total_orders * 30.0 / NULLIF(active_days, 0), 3) as purchase_freq,
-- Historical LTV
total_revenue as historical_ltv,
-- Projected annual LTV (if active)
ROUND(
(total_revenue / NULLIF(GREATEST(tenure_days, 1), 0)) * 365, 2
) as projected_annual_ltv,
-- LTV segment
CASE
WHEN total_revenue > 50000 THEN 'Platinum'
WHEN total_revenue > 10000 THEN 'Gold'
WHEN total_revenue > 2000 THEN 'Silver'
ELSE 'Bronze'
END as ltv_segment
FROM customer_metrics
)
SELECT * FROM ltv_calculation
ORDER BY historical_ltv DESC;
Q25. How do you calculate a percentile distribution of a metric?
-- Distribution of order values into buckets
WITH order_stats AS (
SELECT
order_id,
revenue,
NTILE(10) OVER (ORDER BY revenue) as decile,
NTILE(4) OVER (ORDER BY revenue) as quartile,
PERCENT_RANK() OVER (ORDER BY revenue) as pct_rank,
CUME_DIST() OVER (ORDER BY revenue) as cume_dist
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'
)
SELECT
decile,
COUNT(*) as order_count,
MIN(revenue) as min_revenue,
MAX(revenue) as max_revenue,
ROUND(AVG(revenue), 2) as avg_revenue,
ROUND(SUM(revenue), 2) as total_revenue,
ROUND(100.0 * SUM(revenue) / SUM(SUM(revenue)) OVER (), 2) as revenue_pct
FROM order_stats
GROUP BY decile
ORDER BY decile;
-- Find exact percentile values
SELECT
PERCENTILE_CONT(0.10) WITHIN GROUP (ORDER BY revenue) as p10,
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY revenue) as p25,
PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY revenue) as p50,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY revenue) as p75,
PERCENTILE_CONT(0.90) WITHIN GROUP (ORDER BY revenue) as p90,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY revenue) as p95,
PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY revenue) as p99
FROM orders;
Q26. Write a query to find the first and last event per user.
-- Method 1: FIRST_VALUE / LAST_VALUE window functions
SELECT DISTINCT
user_id,
FIRST_VALUE(event_type) OVER (PARTITION BY user_id ORDER BY event_time) as first_event,
FIRST_VALUE(event_time) OVER (PARTITION BY user_id ORDER BY event_time) as first_event_time,
LAST_VALUE(event_type) OVER (
PARTITION BY user_id
ORDER BY event_time
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) as last_event,
LAST_VALUE(event_time) OVER (
PARTITION BY user_id
ORDER BY event_time
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) as last_event_time
FROM user_events;
-- Method 2: GROUP BY with MIN/MAX (simpler, more readable)
SELECT
user_id,
MIN(event_time) as first_event_time,
MAX(event_time) as last_event_time,
DATEDIFF('day', MIN(event_time), MAX(event_time)) as engagement_span_days
FROM user_events
GROUP BY user_id;
-- Method 3: ROW_NUMBER for complete row access (not just timestamp)
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time ASC) as rn_first,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time DESC) as rn_last
FROM user_events
)
SELECT
f.user_id,
f.event_type as first_event_type, f.event_time as first_event_time,
l.event_type as last_event_type, l.event_time as last_event_time
FROM ranked f
JOIN ranked l ON f.user_id = l.user_id AND l.rn_last = 1
WHERE f.rn_first = 1;
Q27. Write a query to analyze A/B test results.
-- A/B test result analysis query
WITH experiment_data AS (
SELECT
user_id,
variant, -- 'control' or 'treatment'
MAX(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) as converted,
SUM(CASE WHEN event_type = 'purchase' THEN revenue ELSE 0 END) as revenue
FROM ab_test_events
WHERE experiment_id = 'checkout_redesign_v2'
AND event_date BETWEEN '2026-05-01' AND '2026-05-28'
GROUP BY user_id, variant
),
variant_stats AS (
SELECT
variant,
COUNT(*) as n_users,
SUM(converted) as n_converted,
ROUND(100.0 * SUM(converted) / COUNT(*), 4) as cvr_pct,
ROUND(AVG(revenue), 2) as avg_revenue,
ROUND(SUM(revenue), 2) as total_revenue
FROM experiment_data
GROUP BY variant
),
-- Compute relative lift (pivot control and treatment)
control_stats AS (SELECT * FROM variant_stats WHERE variant = 'control'),
treatment_stats AS (SELECT * FROM variant_stats WHERE variant = 'treatment')
SELECT
t.variant,
t.n_users,
t.n_converted,
t.cvr_pct,
c.cvr_pct as control_cvr_pct,
ROUND(t.cvr_pct - c.cvr_pct, 4) as absolute_lift_pct,
ROUND(100.0 * (t.cvr_pct - c.cvr_pct) / NULLIF(c.cvr_pct, 0), 2) as relative_lift_pct,
t.avg_revenue,
ROUND(100.0 * (t.avg_revenue - c.avg_revenue) / NULLIF(c.avg_revenue, 0), 2) as revenue_lift_pct,
-- SRM check: expected ~50/50 split
ROUND(100.0 * t.n_users / (t.n_users + c.n_users), 2) as treatment_share_pct
FROM treatment_stats t
CROSS JOIN control_stats c;
Q28. How do you use SQL for geo-spatial analysis?
-- Distance between user location and nearest store (PostgreSQL with PostGIS)
-- Or BigQuery with ST_ functions
-- BigQuery geo functions
SELECT
u.user_id,
s.store_id,
s.store_name,
ST_DISTANCE(
ST_GEOGPOINT(u.longitude, u.latitude),
ST_GEOGPOINT(s.longitude, s.latitude)
) / 1000 as distance_km
FROM users u
CROSS JOIN stores s
WHERE ST_DWITHIN(
ST_GEOGPOINT(u.longitude, u.latitude),
ST_GEOGPOINT(s.longitude, s.latitude),
10000 -- 10km radius in meters
)
ORDER BY u.user_id, distance_km;
-- Find users within delivery radius
SELECT DISTINCT u.user_id
FROM users u
JOIN restaurants r ON ST_DWITHIN(
ST_GEOGPOINT(u.longitude, u.latitude),
ST_GEOGPOINT(r.longitude, r.latitude),
5000 -- 5km
)
WHERE r.is_open = TRUE;
-- Haversine formula (standard SQL, no spatial extension needed)
SELECT
u.user_id,
s.store_id,
(6371 * ACOS(
COS(RADIANS(u.lat)) * COS(RADIANS(s.lat)) *
COS(RADIANS(s.lon) - RADIANS(u.lon)) +
SIN(RADIANS(u.lat)) * SIN(RADIANS(s.lat))
)) as distance_km
FROM users u
CROSS JOIN stores s
WHERE u.user_id = 12345
ORDER BY distance_km
LIMIT 5; -- nearest 5 stores
Q29. Write a query to compute product recommendation candidates (collaborative filtering).
-- Find products frequently bought together (association rules)
WITH user_products AS (
SELECT DISTINCT user_id, product_id
FROM order_items oi
JOIN orders o ON oi.order_id = o.order_id
WHERE o.created_at >= CURRENT_DATE - INTERVAL '90 days'
),
product_pairs AS (
SELECT
up1.product_id as product_a,
up2.product_id as product_b,
COUNT(*) as co_purchase_count
FROM user_products up1
JOIN user_products up2
ON up1.user_id = up2.user_id
AND up1.product_id < up2.product_id -- avoid duplicates
GROUP BY up1.product_id, up2.product_id
HAVING COUNT(*) >= 10 -- minimum support threshold
),
product_counts AS (
SELECT product_id, COUNT(DISTINCT user_id) as user_count
FROM user_products
GROUP BY product_id
)
SELECT
pp.product_a,
pp.product_b,
pp.co_purchase_count,
pc_a.user_count as product_a_buyers,
pc_b.user_count as product_b_buyers,
-- Lift = P(A and B) / (P(A) * P(B))
-- High lift = non-random co-occurrence
ROUND(
pp.co_purchase_count * 1.0 /
(pc_a.user_count * pc_b.user_count / (SELECT COUNT(DISTINCT user_id) FROM user_products)::float),
3
) as lift
FROM product_pairs pp
JOIN product_counts pc_a ON pp.product_a = pc_a.product_id
JOIN product_counts pc_b ON pp.product_b = pc_b.product_id
ORDER BY lift DESC
LIMIT 100;
Q30. How do you detect account takeover with SQL?
-- Suspicious login patterns: unusual location + high-value action within 1 hour
WITH recent_logins AS (
SELECT
user_id,
login_time,
ip_country,
device_fingerprint,
LAG(ip_country) OVER (PARTITION BY user_id ORDER BY login_time) as prev_country,
LAG(login_time) OVER (PARTITION BY user_id ORDER BY login_time) as prev_login_time,
LAG(device_fingerprint) OVER (PARTITION BY user_id ORDER BY login_time) as prev_device
FROM login_events
WHERE login_time >= CURRENT_TIMESTAMP - INTERVAL '7 days'
),
suspicious_logins AS (
SELECT
rl.user_id,
rl.login_time,
rl.ip_country,
rl.prev_country,
CASE
WHEN rl.ip_country != rl.prev_country AND
rl.login_time - rl.prev_login_time < INTERVAL '2 hours'
THEN 'IMPOSSIBLE_TRAVEL'
WHEN rl.device_fingerprint != rl.prev_device AND
rl.ip_country != rl.prev_country
THEN 'NEW_DEVICE_NEW_COUNTRY'
ELSE NULL
END as alert_type
FROM recent_logins rl
WHERE rl.prev_country IS NOT NULL
),
high_value_actions AS (
SELECT user_id, action_time, action_type, amount
FROM user_actions
WHERE action_type IN ('withdrawal', 'password_change', 'email_change')
AND action_time >= CURRENT_TIMESTAMP - INTERVAL '7 days'
)
SELECT
sl.user_id,
sl.login_time,
sl.alert_type,
sl.prev_country, sl.ip_country,
hva.action_type,
hva.amount,
hva.action_time
FROM suspicious_logins sl
JOIN high_value_actions hva
ON sl.user_id = hva.user_id
AND hva.action_time BETWEEN sl.login_time AND sl.login_time + INTERVAL '1 hour'
WHERE sl.alert_type IS NOT NULL
ORDER BY sl.login_time DESC;
FAQ
Q: What is the most important SQL skill to master for DA interviews? A: Window functions -- specifically ROW_NUMBER for deduplication, RANK/DENSE_RANK for rankings, LAG/LEAD for period-over-period calculations, and SUM OVER PARTITION for running totals. Candidates from public preparation resources consistently report window functions as the primary differentiator between entry-level and senior DA candidates.
Q: How should I practice SQL for interviews? A: Use LeetCode (Database section), Mode Analytics practice problems, and Stratascratch. Practice on realistic datasets: e-commerce orders, ride-hailing trips, user event logs. Focus on the 5 most common patterns: top-N per group, retention cohort, funnel analysis, period-over-period growth, and deduplication. Confirm the specific SQL platform used in the interview on the official company careers portal or interview prep guide.
Q: Does BigQuery SQL differ significantly from PostgreSQL for interviews? A: Minor syntax differences. BigQuery uses DATE_TRUNC, TIMESTAMP_DIFF, QUALIFY, and APPROX_QUANTILES. PostgreSQL uses ::cast syntax, generate_series, and LATERAL joins. The window function and CTE syntax is nearly identical. Most interview questions use portable standard SQL.
Methodology applied to this articlelast verified 8 Jun 2026
- No fabricated salary numbers or success rates. If we quote a range, it's sourced.
- No noun-substituted templates. This article was not generated by swapping company names in a stock prompt.
- No paid placements, sponsored coaching links, or affiliate-shilled course pushes.
Explore this topic cluster
More resources in Interview Questions
Use the category hub to browse similar questions, exam patterns, salary guides, and preparation resources related to this topic.
Paid contributor programme
Sat this this year? Share your story, earn ₹500.
First-person experience reports help future candidates prep smarter. We pay verified contributors ₹500 via UPI per accepted story - with byline.
Submit your story →Ready to practice?
Take a free timed mock test
Put what you learned into practice. Our mock tests match the 2026 pattern with timer, navigator, reveal, and score breakdown. No signup.
Start Free Mock Test →More from PapersAdda
Accenture Interview Questions 2026 (with Answers for Freshers)
Capgemini Interview Questions 2026 (with Answers for Freshers)
HCLTech Interview Questions 2026 (TechBee + TGT, with Answers)
IBM Interview Questions 2026 (with Answers for Freshers)