placement brief / Interview Questions / brief / 08 Jun 2026

SQL for Data Analysts 2026: 30 Interview Questions with Answers

30 SQL interview questions for data analysts with full answers covering window functions, CTEs, joins, subqueries, performance optimization, and business analytics scenarios for 2026.

By Aditya SharmaPublished 8 Jun 20262 sources listedSpot an error? Corrections open

4 min read last revised 8 Jun 2026

on this page§ 05

SQL proficiency is the most-tested hard skill in data analyst and data scientist interviews. Product companies from Flipkart and Swiggy to Google India and Amazon include SQL rounds as the primary technical filter. This guide covers 30 SQL interview questions with full answers, from foundation to advanced window functions and business analytics scenarios.

PapersAdda's take: Candidates report that window functions (ROW_NUMBER, LAG, SUM OVER PARTITION BY) appear in over 70% of senior data analyst SQL rounds at product companies. Solving "top-N per group" and "retention cohort" queries in under 20 minutes is the expected bar. Confirm the specific SQL dialect and complexity expected on the official company careers portal before you prepare.

Related articles: Data Science Interview Questions 2026 | Statistics for Data Science 2026 | Pandas Interview Questions 2026 | Apache Spark Interview Questions 2026 | Tableau Interview Questions 2026

SQL Topics by Company and Level

Level	Topics	Companies
Junior DA	SELECT, WHERE, GROUP BY, HAVING, basic JOINs	Startups, consulting
Mid DA	Subqueries, CTEs, CASE WHEN, multiple JOINs	Product companies
Senior DA	Window functions, complex aggregations, optimization	FAANG, unicorns
Data Scientist	All of above + query optimization + analytical queries	All tech

EASY: Foundations (Questions 1-8)

Q1. What is the difference between WHERE and HAVING?

-- WHERE: filters ROWS before grouping (before GROUP BY)
-- HAVING: filters GROUPS after aggregation (after GROUP BY)

-- Find departments with more than 5 employees earning > 50000
SELECT department, COUNT(*) as emp_count, AVG(salary) as avg_salary
FROM employees
WHERE salary > 50000          -- filter individual rows first
GROUP BY department
HAVING COUNT(*) > 5;          -- filter groups after aggregation

-- Common mistake: using WHERE with aggregate functions
-- WRONG:
-- SELECT department, AVG(salary) FROM employees
-- WHERE AVG(salary) > 60000 GROUP BY department;  -- ERROR

-- CORRECT:
SELECT department, AVG(salary) as avg_salary
FROM employees
GROUP BY department
HAVING AVG(salary) > 60000;

Q2. What are the different types of JOINs?

-- Setup
-- employees: (emp_id, name, dept_id, salary)
-- departments: (dept_id, dept_name, manager_id)

-- INNER JOIN: only matching rows from both tables
SELECT e.name, d.dept_name
FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id;

-- LEFT JOIN: all rows from left, matching from right (NULL if no match)
SELECT e.name, d.dept_name
FROM employees e
LEFT JOIN departments d ON e.dept_id = d.dept_id;
-- Employees without a department will show NULL dept_name

-- RIGHT JOIN: all rows from right, matching from left
SELECT e.name, d.dept_name
FROM employees e
RIGHT JOIN departments d ON e.dept_id = d.dept_id;
-- Departments with no employees will show NULL name

-- FULL OUTER JOIN: all rows from both tables
SELECT e.name, d.dept_name
FROM employees e
FULL OUTER JOIN departments d ON e.dept_id = d.dept_id;

-- CROSS JOIN: cartesian product (all combinations)
SELECT e.name, d.dept_name
FROM employees e
CROSS JOIN departments d;  -- rows_e * rows_d total rows

-- SELF JOIN: join table with itself
SELECT e1.name as employee, e2.name as manager
FROM employees e1
LEFT JOIN employees e2 ON e1.manager_id = e2.emp_id;

Q3. What is a CTE? How does it differ from a subquery?

-- CTE (Common Table Expression): named temporary result set
-- Defined with WITH clause before the main query

-- Example: find employees earning above their department average
WITH dept_avg AS (
    SELECT dept_id, AVG(salary) as avg_salary
    FROM employees
    GROUP BY dept_id
),
above_avg AS (
    SELECT e.emp_id, e.name, e.salary, da.avg_salary,
           e.salary - da.avg_salary as above_by
    FROM employees e
    JOIN dept_avg da ON e.dept_id = da.dept_id
    WHERE e.salary > da.avg_salary
)
SELECT * FROM above_avg ORDER BY above_by DESC;

-- Equivalent subquery (harder to read for complex cases)
SELECT e.emp_id, e.name, e.salary,
       da.avg_salary,
       e.salary - da.avg_salary as above_by
FROM employees e
JOIN (
    SELECT dept_id, AVG(salary) as avg_salary
    FROM employees
    GROUP BY dept_id
) da ON e.dept_id = da.dept_id
WHERE e.salary > da.avg_salary
ORDER BY above_by DESC;

-- CTE advantages:
-- 1. Reusability: reference same CTE multiple times in query
-- 2. Readability: step-by-step logic like a pipeline
-- 3. Recursive queries: trees, hierarchies (not possible with subqueries)

-- Recursive CTE: employee hierarchy
WITH RECURSIVE org_chart AS (
    -- Base case: top-level manager (no manager_id)
    SELECT emp_id, name, manager_id, 0 as level
    FROM employees
    WHERE manager_id IS NULL

    UNION ALL

    -- Recursive case: employees reporting to current level
    SELECT e.emp_id, e.name, e.manager_id, oc.level + 1
    FROM employees e
    JOIN org_chart oc ON e.manager_id = oc.emp_id
)
SELECT * FROM org_chart ORDER BY level, name;

Q4. What is a window function? Explain PARTITION BY and ORDER BY in windows.

-- Window functions: compute over a "window" (partition) of rows
-- Unlike GROUP BY: original rows are NOT collapsed

-- Syntax: function_name(...) OVER (
--     [PARTITION BY col1, col2]  -- divide into groups
--     [ORDER BY col3 DESC]       -- ordering within group
--     [frame_clause]             -- row/range specification
-- )

-- Basic window functions
SELECT
    emp_id,
    name,
    dept_id,
    salary,
    -- Aggregate window functions: computed per partition
    AVG(salary) OVER (PARTITION BY dept_id) as dept_avg,
    MAX(salary) OVER (PARTITION BY dept_id) as dept_max,
    SUM(salary) OVER (PARTITION BY dept_id) as dept_total,
    COUNT(*) OVER (PARTITION BY dept_id) as dept_size,

    -- Ranking functions (require ORDER BY)
    ROW_NUMBER() OVER (PARTITION BY dept_id ORDER BY salary DESC) as salary_rank,
    RANK() OVER (PARTITION BY dept_id ORDER BY salary DESC) as rank_with_ties,
    DENSE_RANK() OVER (PARTITION BY dept_id ORDER BY salary DESC) as dense_rank,

    -- Offset functions
    LAG(salary, 1) OVER (PARTITION BY dept_id ORDER BY hire_date) as prev_salary,
    LEAD(salary, 1) OVER (PARTITION BY dept_id ORDER BY hire_date) as next_salary,

    -- Distribution functions
    PERCENT_RANK() OVER (ORDER BY salary) as pct_rank,
    NTILE(4) OVER (ORDER BY salary) as salary_quartile

FROM employees;

Q5. What is RANK vs DENSE_RANK vs ROW_NUMBER?

-- Setup: scores with ties
-- Scores: 100, 100, 90, 85, 85, 80

WITH scores AS (
    SELECT 'Alice' as name, 100 as score UNION ALL
    SELECT 'Bob', 100 UNION ALL
    SELECT 'Charlie', 90 UNION ALL
    SELECT 'Diana', 85 UNION ALL
    SELECT 'Eve', 85 UNION ALL
    SELECT 'Frank', 80
)
SELECT
    name,
    score,
    ROW_NUMBER() OVER (ORDER BY score DESC) as row_num,
    -- Unique sequential number; tie-breaking is arbitrary
    -- 1, 2, 3, 4, 5, 6

    RANK() OVER (ORDER BY score DESC) as rank_val,
    -- Ties get same rank; next rank skips numbers
    -- 1, 1, 3, 4, 4, 6  (no 2, no 5)

    DENSE_RANK() OVER (ORDER BY score DESC) as dense_rank_val
    -- Ties get same rank; next rank is consecutive (no gaps)
    -- 1, 1, 2, 3, 3, 4

FROM scores;

-- Use ROW_NUMBER: when you need exactly one row per group (deduplication)
-- Use RANK: leaderboards where ties share position and next rank skips
-- Use DENSE_RANK: when you want no gaps in rankings (compact ranks)

Q6. How do you get the top N rows per group?

-- Classic interview question: "Get top 3 highest-paid employees per department"

-- Method 1: ROW_NUMBER (best for exact top-N without ties)
WITH ranked AS (
    SELECT
        emp_id, name, dept_id, salary,
        ROW_NUMBER() OVER (PARTITION BY dept_id ORDER BY salary DESC) as rn
    FROM employees
)
SELECT emp_id, name, dept_id, salary
FROM ranked
WHERE rn <= 3;

-- Method 2: RANK (includes ties -- may return more than N)
WITH ranked AS (
    SELECT
        emp_id, name, dept_id, salary,
        RANK() OVER (PARTITION BY dept_id ORDER BY salary DESC) as rk
    FROM employees
)
SELECT emp_id, name, dept_id, salary
FROM ranked
WHERE rk <= 3;

-- Method 3: DENSE_RANK (compact ranks with ties)
WITH ranked AS (
    SELECT
        emp_id, name, dept_id, salary,
        DENSE_RANK() OVER (PARTITION BY dept_id ORDER BY salary DESC) as dr
    FROM employees
)
SELECT emp_id, name, dept_id, salary
FROM ranked
WHERE dr <= 3;

-- The interviewer expects: understand the difference between methods
-- and choose based on how ties should be handled.

Q7. What is the difference between UNION and UNION ALL?

-- UNION: combines result sets and removes duplicates
-- UNION ALL: combines result sets, keeps ALL rows (including duplicates)

-- Example: combine sales from two regions
SELECT product_id, sales, 'North' as region FROM sales_north
UNION
SELECT product_id, sales, 'South' as region FROM sales_south;
-- Removes any rows that are identical across both tables

SELECT product_id, sales, 'North' as region FROM sales_north
UNION ALL
SELECT product_id, sales, 'South' as region FROM sales_south;
-- Keeps all rows -- faster (no dedup step)

-- Performance: UNION ALL is always faster (no sorting/dedup)
-- Use UNION when: you genuinely need to remove cross-table duplicates
-- Use UNION ALL when: you know tables don't overlap, or duplicates are expected

-- Requirements for UNION:
-- Same number of columns in both SELECT statements
-- Compatible data types in corresponding column positions
-- Column names taken from first SELECT

Q8. What are aggregate functions? Explain COUNT with and without DISTINCT.

-- Standard aggregate functions
SELECT
    COUNT(*) as total_rows,               -- count all rows including NULLs
    COUNT(1) as also_total_rows,          -- same as COUNT(*)
    COUNT(email) as non_null_emails,      -- count non-NULL values only
    COUNT(DISTINCT user_id) as unique_users,  -- count distinct non-NULL values
    SUM(revenue) as total_revenue,
    AVG(revenue) as avg_revenue,
    MIN(order_date) as first_order,
    MAX(order_date) as last_order,
    SUM(revenue) / NULLIF(COUNT(*), 0) as manual_avg  -- safe division
FROM orders;

-- Important: COUNT(*) vs COUNT(column)
-- COUNT(*): counts every row including NULLs
-- COUNT(column): counts only NON-NULL values in that column

CREATE TABLE test_nulls AS
SELECT 1 as id, NULL as value UNION ALL
SELECT 2, 100 UNION ALL
SELECT 3, 200;

SELECT
    COUNT(*) as total,          -- 3
    COUNT(value) as non_null,   -- 2
    COUNT(DISTINCT value) as distinct_non_null  -- 2
FROM test_nulls;

-- GROUP BY with ROLLUP (multi-level subtotals)
SELECT
    department,
    job_title,
    SUM(salary) as total_salary,
    COUNT(*) as headcount
FROM employees
GROUP BY ROLLUP(department, job_title);
-- Returns: dept+title rows + dept subtotals + grand total

MEDIUM: Window Functions and Business Queries (Questions 9-20)

Q9. How do you compute running totals and moving averages?

-- Running total (cumulative sum)
SELECT
    order_date,
    daily_revenue,
    SUM(daily_revenue) OVER (ORDER BY order_date) as running_total,
    -- Same with explicit frame
    SUM(daily_revenue) OVER (
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) as running_total_explicit
FROM daily_sales
ORDER BY order_date;

-- 7-day moving average
SELECT
    order_date,
    daily_revenue,
    AVG(daily_revenue) OVER (
        ORDER BY order_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) as moving_avg_7d,
    -- Rolling sum
    SUM(daily_revenue) OVER (
        ORDER BY order_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) as rolling_7d_sum
FROM daily_sales
ORDER BY order_date;

-- Running total WITHIN each year (partition by year)
SELECT
    order_date,
    EXTRACT(YEAR FROM order_date) as year,
    daily_revenue,
    SUM(daily_revenue) OVER (
        PARTITION BY EXTRACT(YEAR FROM order_date)
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) as ytd_revenue
FROM daily_sales
ORDER BY order_date;

Q10. How do you compute month-over-month and year-over-year growth?

WITH monthly_revenue AS (
    SELECT
        DATE_TRUNC('month', order_date) as month,
        SUM(revenue) as total_revenue
    FROM orders
    GROUP BY DATE_TRUNC('month', order_date)
),
growth AS (
    SELECT
        month,
        total_revenue,
        LAG(total_revenue, 1) OVER (ORDER BY month) as prev_month,
        LAG(total_revenue, 12) OVER (ORDER BY month) as same_month_last_year
    FROM monthly_revenue
)
SELECT
    month,
    total_revenue,
    prev_month,
    -- MoM growth
    ROUND(
        (total_revenue - prev_month) / NULLIF(prev_month, 0) * 100, 2
    ) as mom_pct_growth,
    -- YoY growth
    ROUND(
        (total_revenue - same_month_last_year) / NULLIF(same_month_last_year, 0) * 100, 2
    ) as yoy_pct_growth
FROM growth
ORDER BY month;

Q11. How do you find duplicate rows and deduplicate?

-- Find duplicates
SELECT email, COUNT(*) as cnt
FROM users
GROUP BY email
HAVING COUNT(*) > 1;

-- Show ALL duplicate rows with details
WITH dupes AS (
    SELECT email, COUNT(*) OVER (PARTITION BY email) as cnt
    FROM users
)
SELECT u.*
FROM users u
JOIN dupes d ON u.email = d.email
WHERE d.cnt > 1;

-- Deduplicate: keep the LATEST row per email
WITH ranked AS (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY email
            ORDER BY created_at DESC  -- keep most recent
        ) as rn
    FROM users
)
SELECT * FROM ranked WHERE rn = 1;

-- Delete duplicates in-place (PostgreSQL / BigQuery)
DELETE FROM users
WHERE user_id NOT IN (
    SELECT MIN(user_id)  -- keep lowest ID per email
    FROM users
    GROUP BY email
);

-- For large tables: create new table and swap
CREATE TABLE users_deduped AS
SELECT DISTINCT ON (email) *
FROM users
ORDER BY email, created_at DESC;  -- PostgreSQL DISTINCT ON syntax

Q12. Write a cohort retention query.

-- Cohort: users grouped by their first order month
-- Retention: what % of each cohort is still active in month N?

WITH first_orders AS (
    SELECT
        user_id,
        DATE_TRUNC('month', MIN(order_date)) as cohort_month
    FROM orders
    GROUP BY user_id
),
user_activity AS (
    SELECT DISTINCT
        o.user_id,
        fo.cohort_month,
        DATE_TRUNC('month', o.order_date) as activity_month,
        DATEDIFF('month', fo.cohort_month, DATE_TRUNC('month', o.order_date)) as period_number
        -- Use DATE_PART in PostgreSQL:
        -- EXTRACT(EPOCH FROM (DATE_TRUNC('month', o.order_date) - fo.cohort_month)) / (30.44 * 86400)
    FROM orders o
    JOIN first_orders fo ON o.user_id = fo.user_id
),
cohort_sizes AS (
    SELECT cohort_month, COUNT(DISTINCT user_id) as cohort_size
    FROM first_orders
    GROUP BY cohort_month
),
retention_matrix AS (
    SELECT
        ua.cohort_month,
        ua.period_number,
        COUNT(DISTINCT ua.user_id) as retained_users
    FROM user_activity ua
    GROUP BY ua.cohort_month, ua.period_number
)
SELECT
    rm.cohort_month,
    rm.period_number,
    rm.retained_users,
    cs.cohort_size,
    ROUND(100.0 * rm.retained_users / cs.cohort_size, 2) as retention_pct
FROM retention_matrix rm
JOIN cohort_sizes cs ON rm.cohort_month = cs.cohort_month
ORDER BY rm.cohort_month, rm.period_number;

Q13. Write a funnel analysis query.

-- Conversion funnel: home -> product -> cart -> checkout -> purchase
-- Find the count and drop-off at each stage

WITH funnel_events AS (
    SELECT
        user_id,
        MAX(CASE WHEN event_type = 'home_view' THEN 1 ELSE 0 END) as visited_home,
        MAX(CASE WHEN event_type = 'product_view' THEN 1 ELSE 0 END) as viewed_product,
        MAX(CASE WHEN event_type = 'add_to_cart' THEN 1 ELSE 0 END) as added_to_cart,
        MAX(CASE WHEN event_type = 'checkout_start' THEN 1 ELSE 0 END) as started_checkout,
        MAX(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) as purchased
    FROM events
    WHERE event_date >= CURRENT_DATE - INTERVAL '30 days'
    GROUP BY user_id
),
funnel_counts AS (
    SELECT
        SUM(visited_home) as step1_home,
        SUM(viewed_product) as step2_product,
        SUM(added_to_cart) as step3_cart,
        SUM(started_checkout) as step4_checkout,
        SUM(purchased) as step5_purchase
    FROM funnel_events
)
SELECT
    step1_home,
    step2_product,
    step3_cart,
    step4_checkout,
    step5_purchase,
    -- Conversion from previous step
    ROUND(100.0 * step2_product / NULLIF(step1_home, 0), 2) as home_to_product,
    ROUND(100.0 * step3_cart / NULLIF(step2_product, 0), 2) as product_to_cart,
    ROUND(100.0 * step4_checkout / NULLIF(step3_cart, 0), 2) as cart_to_checkout,
    ROUND(100.0 * step5_purchase / NULLIF(step4_checkout, 0), 2) as checkout_to_purchase,
    -- Overall funnel conversion
    ROUND(100.0 * step5_purchase / NULLIF(step1_home, 0), 2) as overall_cvr
FROM funnel_counts;

Q14. How do you pivot and unpivot data in SQL?

-- PIVOT: rows to columns
-- Input: (user_id, month, revenue)
-- Output: (user_id, Jan, Feb, Mar, Apr)

-- PostgreSQL / BigQuery: use CASE WHEN
SELECT
    user_id,
    SUM(CASE WHEN EXTRACT(MONTH FROM order_date) = 1 THEN revenue ELSE 0 END) as jan,
    SUM(CASE WHEN EXTRACT(MONTH FROM order_date) = 2 THEN revenue ELSE 0 END) as feb,
    SUM(CASE WHEN EXTRACT(MONTH FROM order_date) = 3 THEN revenue ELSE 0 END) as mar,
    SUM(CASE WHEN EXTRACT(MONTH FROM order_date) = 4 THEN revenue ELSE 0 END) as apr
FROM orders
WHERE EXTRACT(YEAR FROM order_date) = 2026
GROUP BY user_id;

-- UNPIVOT: columns to rows
-- From: (user_id, jan_revenue, feb_revenue, mar_revenue)
-- To: (user_id, month, revenue)

-- Using UNION ALL (universal SQL)
SELECT user_id, 'Jan' as month, jan_revenue as revenue FROM quarterly_data
UNION ALL
SELECT user_id, 'Feb', feb_revenue FROM quarterly_data
UNION ALL
SELECT user_id, 'Mar', mar_revenue FROM quarterly_data
ORDER BY user_id, month;

-- BigQuery: UNPIVOT operator (more readable)
SELECT user_id, month, revenue
FROM quarterly_data
UNPIVOT (revenue FOR month IN (jan_revenue, feb_revenue, mar_revenue));

Q15. How do you find gaps in sequential data?

-- Find missing dates in a user activity sequence
WITH date_series AS (
    -- Generate all dates in the range
    SELECT generate_series(
        '2026-01-01'::date,
        '2026-01-31'::date,
        '1 day'::interval
    )::date as expected_date
),
user_activity AS (
    SELECT DISTINCT activity_date FROM user_logins
    WHERE user_id = 12345 AND activity_date BETWEEN '2026-01-01' AND '2026-01-31'
)
SELECT ds.expected_date as missing_date
FROM date_series ds
LEFT JOIN user_activity ua ON ds.expected_date = ua.activity_date
WHERE ua.activity_date IS NULL;

-- Find gaps in order ID sequence
SELECT
    o1.order_id + 1 as gap_start,
    o2.order_id - 1 as gap_end,
    o2.order_id - o1.order_id - 1 as gap_size
FROM orders o1
JOIN orders o2 ON o2.order_id = (
    SELECT MIN(order_id) FROM orders WHERE order_id > o1.order_id
)
WHERE o2.order_id > o1.order_id + 1
ORDER BY gap_start;

-- Find periods of inactivity > 30 days
SELECT
    user_id,
    activity_date as last_active,
    LEAD(activity_date) OVER (PARTITION BY user_id ORDER BY activity_date) as next_active,
    LEAD(activity_date) OVER (PARTITION BY user_id ORDER BY activity_date) - activity_date as gap_days
FROM user_logins
WHERE LEAD(activity_date) OVER (PARTITION BY user_id ORDER BY activity_date) - activity_date > 30;

Q16. Write a query for user session analysis.

-- Define sessions: events within 30 minutes of each other belong to same session
WITH session_gaps AS (
    SELECT
        user_id,
        event_time,
        event_type,
        LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) as prev_event_time,
        CASE
            WHEN LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) IS NULL THEN 1
            WHEN event_time - LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time)
                 > INTERVAL '30 minutes' THEN 1
            ELSE 0
        END as is_session_start
    FROM events
),
session_ids AS (
    SELECT
        user_id,
        event_time,
        event_type,
        SUM(is_session_start) OVER (PARTITION BY user_id ORDER BY event_time) as session_id
    FROM session_gaps
),
session_summary AS (
    SELECT
        user_id,
        session_id,
        MIN(event_time) as session_start,
        MAX(event_time) as session_end,
        COUNT(*) as event_count,
        EXTRACT(EPOCH FROM MAX(event_time) - MIN(event_time)) / 60 as duration_minutes,
        MAX(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) as has_purchase
    FROM session_ids
    GROUP BY user_id, session_id
)
SELECT
    user_id,
    COUNT(DISTINCT session_id) as total_sessions,
    ROUND(AVG(duration_minutes), 2) as avg_session_min,
    SUM(has_purchase) as purchase_sessions,
    ROUND(100.0 * SUM(has_purchase) / COUNT(*), 2) as purchase_rate_pct
FROM session_summary
GROUP BY user_id
ORDER BY total_sessions DESC;

Q17. How do you write a query to find the second highest salary?

-- Method 1: Subquery with NOT IN
SELECT MAX(salary)
FROM employees
WHERE salary NOT IN (SELECT MAX(salary) FROM employees);

-- Method 2: OFFSET (skip top result)
SELECT DISTINCT salary
FROM employees
ORDER BY salary DESC
LIMIT 1 OFFSET 1;

-- Method 3: DENSE_RANK (handles ties correctly)
SELECT salary
FROM (
    SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) as dr
    FROM employees
) ranked
WHERE dr = 2
LIMIT 1;

-- Generic: Nth highest salary
-- Replace 2 with N
WITH ranked AS (
    SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) as dr
    FROM employees
)
SELECT MAX(salary)  -- max handles if multiple employees have same Nth salary
FROM ranked
WHERE dr = 2;

-- Edge case: what if there's only 1 unique salary?
-- DENSE_RANK approach returns NULL gracefully
-- NOT IN approach returns NULL (correct)

Q18. Write a query to compute the 7-day rolling active users (DAU/WAU).

-- DAU (Daily Active Users)
WITH daily_active AS (
    SELECT
        DATE(event_time) as event_date,
        COUNT(DISTINCT user_id) as dau
    FROM user_events
    GROUP BY DATE(event_time)
),
-- WAU using 7-day rolling window
wau AS (
    SELECT
        event_date,
        dau,
        COUNT(DISTINCT u.user_id) as wau_7d
    FROM daily_active da
    JOIN user_events u ON DATE(u.event_time) BETWEEN da.event_date - INTERVAL '6 days'
                                                    AND da.event_date
    GROUP BY da.event_date, da.dau
),
-- Stickiness: DAU/MAU ratio
mau AS (
    SELECT
        DATE(event_time) as event_date,
        COUNT(DISTINCT user_id) as mau_28d
    FROM user_events
    WHERE event_time >= CURRENT_DATE - INTERVAL '28 days'
    GROUP BY DATE(event_time)
)
SELECT
    da.event_date,
    da.dau,
    w.wau_7d,
    m.mau_28d,
    ROUND(100.0 * da.dau / NULLIF(w.wau_7d, 0), 2) as dau_wau_stickiness,
    ROUND(100.0 * da.dau / NULLIF(m.mau_28d, 0), 2) as dau_mau_stickiness
FROM daily_active da
LEFT JOIN wau w ON da.event_date = w.event_date
LEFT JOIN mau m ON da.event_date = m.event_date
ORDER BY da.event_date;

Q19. How do you use CASE WHEN for conditional aggregation?

-- Pivot-like aggregation by category
SELECT
    DATE_TRUNC('month', order_date) as month,
    COUNT(*) as total_orders,
    COUNT(CASE WHEN status = 'completed' THEN 1 END) as completed,
    COUNT(CASE WHEN status = 'cancelled' THEN 1 END) as cancelled,
    COUNT(CASE WHEN status = 'returned' THEN 1 END) as returned,
    ROUND(
        100.0 * COUNT(CASE WHEN status = 'cancelled' THEN 1 END) / COUNT(*), 2
    ) as cancellation_rate,
    SUM(CASE WHEN category = 'Electronics' THEN revenue ELSE 0 END) as electronics_rev,
    SUM(CASE WHEN category = 'Clothing' THEN revenue ELSE 0 END) as clothing_rev,
    SUM(CASE WHEN revenue > 5000 THEN revenue ELSE 0 END) as high_value_rev
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY month;

-- Segmented averages
SELECT
    AVG(CASE WHEN age < 25 THEN revenue END) as avg_revenue_young,
    AVG(CASE WHEN age BETWEEN 25 AND 40 THEN revenue END) as avg_revenue_mid,
    AVG(CASE WHEN age > 40 THEN revenue END) as avg_revenue_older
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id;

-- Rate calculation with CASE
SELECT
    user_segment,
    SUM(CASE WHEN converted = 1 THEN 1.0 ELSE 0 END) / COUNT(*) as cvr
FROM experiment_data
GROUP BY user_segment;

Q20. How do you handle NULL values in SQL queries?

-- NULL is NOT equal to anything, including NULL
-- NULL = NULL returns NULL (not TRUE)
-- Use IS NULL, IS NOT NULL, COALESCE, NULLIF

-- Detect NULLs
SELECT COUNT(*) as total, COUNT(email) as non_null_emails,
       COUNT(*) - COUNT(email) as null_emails
FROM users;

-- COALESCE: return first non-NULL value
SELECT
    user_id,
    COALESCE(display_name, username, email, 'Anonymous') as name_to_show,
    COALESCE(revenue, 0) as safe_revenue  -- replace NULL with 0
FROM users;

-- NULLIF: return NULL if expression equals a value (prevents division by zero)
SELECT
    user_id,
    total_revenue / NULLIF(order_count, 0) as avg_order_value  -- safe division
FROM user_stats;

-- NULL in aggregations
-- SUM(col) ignores NULLs; COUNT(*) counts NULLs; COUNT(col) ignores NULLs
SELECT
    SUM(revenue) as sum_ignores_null,     -- treats NULL as 0 contribution
    COUNT(*) as count_all,                -- includes NULLs
    COUNT(revenue) as count_non_null,     -- excludes NULLs
    AVG(revenue) as avg_excludes_null     -- denominator = non-NULL count
FROM orders;

-- NULL in JOINs: NULL != NULL means NULL keys never match
-- LEFT JOIN preserves unmatched rows (NULLs in right table columns)

-- NULL in WHERE: UNKNOWN, not TRUE
SELECT * FROM orders WHERE revenue > 100;  -- rows where revenue IS NULL are excluded
SELECT * FROM orders WHERE revenue > 100 OR revenue IS NULL;  -- include NULLs explicitly

HARD: Performance and Advanced Analytics (Questions 21-30)

Q21. How do you optimize a slow SQL query?

-- Step 1: Check execution plan
EXPLAIN ANALYZE
SELECT u.user_id, u.name, COUNT(o.order_id) as order_count
FROM users u
JOIN orders o ON u.user_id = o.user_id
WHERE o.created_at >= '2026-01-01'
GROUP BY u.user_id, u.name;

-- Look for: Seq Scan (table scan) vs Index Scan, high row estimates vs actual rows

-- Optimization techniques:

-- 1. Add indexes on JOIN keys and WHERE columns
CREATE INDEX idx_orders_user_id ON orders(user_id);
CREATE INDEX idx_orders_created_at ON orders(created_at);
CREATE INDEX idx_orders_user_date ON orders(user_id, created_at);  -- composite

-- 2. Filter early with CTEs / subqueries to reduce join size
WITH recent_orders AS (
    SELECT user_id, order_id  -- only select needed columns
    FROM orders
    WHERE created_at >= '2026-01-01'  -- filter BEFORE join
)
SELECT u.user_id, u.name, COUNT(ro.order_id) as order_count
FROM users u
JOIN recent_orders ro ON u.user_id = ro.user_id
GROUP BY u.user_id, u.name;

-- 3. Avoid SELECT * (fetches unnecessary columns, kills columnar storage performance)
-- 4. Use partition pruning (BigQuery/Redshift/Snowflake)
-- 5. Use approximate functions for large datasets
SELECT APPROX_COUNT_DISTINCT(user_id) FROM events;  -- BigQuery
SELECT COUNT(DISTINCT user_id) FROM events;           -- exact but slower

-- 6. Push predicates inside window function QUALIFY (BigQuery)
-- Instead of: SELECT * FROM (SELECT ..., ROW_NUMBER() OVER (...) as rn ...) WHERE rn = 1
-- Use QUALIFY: SELECT ..., ROW_NUMBER() OVER (...) as rn ... QUALIFY rn = 1

Q22. Write a query to detect transaction fraud patterns.

-- Find suspicious patterns: 3+ transactions within 1 hour from same user
WITH transaction_velocity AS (
    SELECT
        user_id,
        transaction_id,
        amount,
        created_at,
        COUNT(*) OVER (
            PARTITION BY user_id
            ORDER BY created_at
            RANGE BETWEEN INTERVAL '1 hour' PRECEDING AND CURRENT ROW
        ) as txn_count_1h,
        SUM(amount) OVER (
            PARTITION BY user_id
            ORDER BY created_at
            RANGE BETWEEN INTERVAL '1 hour' PRECEDING AND CURRENT ROW
        ) as amount_sum_1h,
        AVG(amount) OVER (PARTITION BY user_id) as user_avg_amount
    FROM transactions
),
fraud_signals AS (
    SELECT
        *,
        CASE
            WHEN txn_count_1h >= 5 THEN 'HIGH_VELOCITY'
            WHEN amount > user_avg_amount * 5 THEN 'HIGH_AMOUNT'
            WHEN txn_count_1h >= 3 AND amount > user_avg_amount * 3 THEN 'COMBINED_RISK'
            ELSE 'NORMAL'
        END as fraud_signal
    FROM transaction_velocity
)
SELECT
    user_id,
    transaction_id,
    amount,
    created_at,
    txn_count_1h,
    amount_sum_1h,
    fraud_signal
FROM fraud_signals
WHERE fraud_signal != 'NORMAL'
ORDER BY created_at DESC;

Q23. How do you write a median calculation in SQL?

-- SQL has no built-in MEDIAN function in standard SQL

-- Method 1: PERCENTILE_CONT (PostgreSQL, BigQuery, Snowflake)
SELECT
    dept_id,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) as median_salary,
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY salary) as p25_salary,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY salary) as p75_salary
FROM employees
GROUP BY dept_id;

-- Method 2: Manual calculation (works in MySQL and older DBs)
WITH ordered AS (
    SELECT
        dept_id,
        salary,
        ROW_NUMBER() OVER (PARTITION BY dept_id ORDER BY salary) as rn,
        COUNT(*) OVER (PARTITION BY dept_id) as total
    FROM employees
),
median_rows AS (
    SELECT dept_id, salary
    FROM ordered
    WHERE rn IN (FLOOR((total + 1) / 2.0), CEIL((total + 1) / 2.0))
)
SELECT dept_id, AVG(salary) as median_salary
FROM median_rows
GROUP BY dept_id;

-- APPROX_QUANTILES (BigQuery -- fast approximation for large tables)
SELECT APPROX_QUANTILES(salary, 100)[OFFSET(50)] as approx_median
FROM employees;

Q24. Write a query to compute customer lifetime value (LTV).

WITH customer_metrics AS (
    SELECT
        c.customer_id,
        c.signup_date,
        COUNT(DISTINCT o.order_id) as total_orders,
        SUM(o.revenue) as total_revenue,
        MIN(o.order_date) as first_order_date,
        MAX(o.order_date) as last_order_date,
        DATEDIFF('day', MIN(o.order_date), MAX(o.order_date)) + 1 as active_days,
        DATEDIFF('day', c.signup_date, CURRENT_DATE) as tenure_days
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    GROUP BY c.customer_id, c.signup_date
),
ltv_calculation AS (
    SELECT
        customer_id,
        total_orders,
        total_revenue,
        tenure_days,
        active_days,
        -- Average order value
        ROUND(total_revenue / NULLIF(total_orders, 0), 2) as aov,
        -- Purchase frequency (orders per 30 days active)
        ROUND(total_orders * 30.0 / NULLIF(active_days, 0), 3) as purchase_freq,
        -- Historical LTV
        total_revenue as historical_ltv,
        -- Projected annual LTV (if active)
        ROUND(
            (total_revenue / NULLIF(GREATEST(tenure_days, 1), 0)) * 365, 2
        ) as projected_annual_ltv,
        -- LTV segment
        CASE
            WHEN total_revenue > 50000 THEN 'Platinum'
            WHEN total_revenue > 10000 THEN 'Gold'
            WHEN total_revenue > 2000 THEN 'Silver'
            ELSE 'Bronze'
        END as ltv_segment
    FROM customer_metrics
)
SELECT * FROM ltv_calculation
ORDER BY historical_ltv DESC;

Q25. How do you calculate a percentile distribution of a metric?

-- Distribution of order values into buckets
WITH order_stats AS (
    SELECT
        order_id,
        revenue,
        NTILE(10) OVER (ORDER BY revenue) as decile,
        NTILE(4) OVER (ORDER BY revenue) as quartile,
        PERCENT_RANK() OVER (ORDER BY revenue) as pct_rank,
        CUME_DIST() OVER (ORDER BY revenue) as cume_dist
    FROM orders
    WHERE order_date >= CURRENT_DATE - INTERVAL '30 days'
)
SELECT
    decile,
    COUNT(*) as order_count,
    MIN(revenue) as min_revenue,
    MAX(revenue) as max_revenue,
    ROUND(AVG(revenue), 2) as avg_revenue,
    ROUND(SUM(revenue), 2) as total_revenue,
    ROUND(100.0 * SUM(revenue) / SUM(SUM(revenue)) OVER (), 2) as revenue_pct
FROM order_stats
GROUP BY decile
ORDER BY decile;

-- Find exact percentile values
SELECT
    PERCENTILE_CONT(0.10) WITHIN GROUP (ORDER BY revenue) as p10,
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY revenue) as p25,
    PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY revenue) as p50,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY revenue) as p75,
    PERCENTILE_CONT(0.90) WITHIN GROUP (ORDER BY revenue) as p90,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY revenue) as p95,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY revenue) as p99
FROM orders;

Q26. Write a query to find the first and last event per user.

-- Method 1: FIRST_VALUE / LAST_VALUE window functions
SELECT DISTINCT
    user_id,
    FIRST_VALUE(event_type) OVER (PARTITION BY user_id ORDER BY event_time) as first_event,
    FIRST_VALUE(event_time) OVER (PARTITION BY user_id ORDER BY event_time) as first_event_time,
    LAST_VALUE(event_type) OVER (
        PARTITION BY user_id
        ORDER BY event_time
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) as last_event,
    LAST_VALUE(event_time) OVER (
        PARTITION BY user_id
        ORDER BY event_time
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) as last_event_time
FROM user_events;

-- Method 2: GROUP BY with MIN/MAX (simpler, more readable)
SELECT
    user_id,
    MIN(event_time) as first_event_time,
    MAX(event_time) as last_event_time,
    DATEDIFF('day', MIN(event_time), MAX(event_time)) as engagement_span_days
FROM user_events
GROUP BY user_id;

-- Method 3: ROW_NUMBER for complete row access (not just timestamp)
WITH ranked AS (
    SELECT *,
        ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time ASC) as rn_first,
        ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time DESC) as rn_last
    FROM user_events
)
SELECT
    f.user_id,
    f.event_type as first_event_type, f.event_time as first_event_time,
    l.event_type as last_event_type, l.event_time as last_event_time
FROM ranked f
JOIN ranked l ON f.user_id = l.user_id AND l.rn_last = 1
WHERE f.rn_first = 1;

Q27. Write a query to analyze A/B test results.

-- A/B test result analysis query
WITH experiment_data AS (
    SELECT
        user_id,
        variant,  -- 'control' or 'treatment'
        MAX(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) as converted,
        SUM(CASE WHEN event_type = 'purchase' THEN revenue ELSE 0 END) as revenue
    FROM ab_test_events
    WHERE experiment_id = 'checkout_redesign_v2'
      AND event_date BETWEEN '2026-05-01' AND '2026-05-28'
    GROUP BY user_id, variant
),
variant_stats AS (
    SELECT
        variant,
        COUNT(*) as n_users,
        SUM(converted) as n_converted,
        ROUND(100.0 * SUM(converted) / COUNT(*), 4) as cvr_pct,
        ROUND(AVG(revenue), 2) as avg_revenue,
        ROUND(SUM(revenue), 2) as total_revenue
    FROM experiment_data
    GROUP BY variant
),
-- Compute relative lift (pivot control and treatment)
control_stats AS (SELECT * FROM variant_stats WHERE variant = 'control'),
treatment_stats AS (SELECT * FROM variant_stats WHERE variant = 'treatment')
SELECT
    t.variant,
    t.n_users,
    t.n_converted,
    t.cvr_pct,
    c.cvr_pct as control_cvr_pct,
    ROUND(t.cvr_pct - c.cvr_pct, 4) as absolute_lift_pct,
    ROUND(100.0 * (t.cvr_pct - c.cvr_pct) / NULLIF(c.cvr_pct, 0), 2) as relative_lift_pct,
    t.avg_revenue,
    ROUND(100.0 * (t.avg_revenue - c.avg_revenue) / NULLIF(c.avg_revenue, 0), 2) as revenue_lift_pct,
    -- SRM check: expected ~50/50 split
    ROUND(100.0 * t.n_users / (t.n_users + c.n_users), 2) as treatment_share_pct
FROM treatment_stats t
CROSS JOIN control_stats c;

Q28. How do you use SQL for geo-spatial analysis?

-- Distance between user location and nearest store (PostgreSQL with PostGIS)
-- Or BigQuery with ST_ functions

-- BigQuery geo functions
SELECT
    u.user_id,
    s.store_id,
    s.store_name,
    ST_DISTANCE(
        ST_GEOGPOINT(u.longitude, u.latitude),
        ST_GEOGPOINT(s.longitude, s.latitude)
    ) / 1000 as distance_km
FROM users u
CROSS JOIN stores s
WHERE ST_DWITHIN(
    ST_GEOGPOINT(u.longitude, u.latitude),
    ST_GEOGPOINT(s.longitude, s.latitude),
    10000  -- 10km radius in meters
)
ORDER BY u.user_id, distance_km;

-- Find users within delivery radius
SELECT DISTINCT u.user_id
FROM users u
JOIN restaurants r ON ST_DWITHIN(
    ST_GEOGPOINT(u.longitude, u.latitude),
    ST_GEOGPOINT(r.longitude, r.latitude),
    5000  -- 5km
)
WHERE r.is_open = TRUE;

-- Haversine formula (standard SQL, no spatial extension needed)
SELECT
    u.user_id,
    s.store_id,
    (6371 * ACOS(
        COS(RADIANS(u.lat)) * COS(RADIANS(s.lat)) *
        COS(RADIANS(s.lon) - RADIANS(u.lon)) +
        SIN(RADIANS(u.lat)) * SIN(RADIANS(s.lat))
    )) as distance_km
FROM users u
CROSS JOIN stores s
WHERE u.user_id = 12345
ORDER BY distance_km
LIMIT 5;  -- nearest 5 stores

Q29. Write a query to compute product recommendation candidates (collaborative filtering).

-- Find products frequently bought together (association rules)
WITH user_products AS (
    SELECT DISTINCT user_id, product_id
    FROM order_items oi
    JOIN orders o ON oi.order_id = o.order_id
    WHERE o.created_at >= CURRENT_DATE - INTERVAL '90 days'
),
product_pairs AS (
    SELECT
        up1.product_id as product_a,
        up2.product_id as product_b,
        COUNT(*) as co_purchase_count
    FROM user_products up1
    JOIN user_products up2
        ON up1.user_id = up2.user_id
        AND up1.product_id < up2.product_id  -- avoid duplicates
    GROUP BY up1.product_id, up2.product_id
    HAVING COUNT(*) >= 10  -- minimum support threshold
),
product_counts AS (
    SELECT product_id, COUNT(DISTINCT user_id) as user_count
    FROM user_products
    GROUP BY product_id
)
SELECT
    pp.product_a,
    pp.product_b,
    pp.co_purchase_count,
    pc_a.user_count as product_a_buyers,
    pc_b.user_count as product_b_buyers,
    -- Lift = P(A and B) / (P(A) * P(B))
    -- High lift = non-random co-occurrence
    ROUND(
        pp.co_purchase_count * 1.0 /
        (pc_a.user_count * pc_b.user_count / (SELECT COUNT(DISTINCT user_id) FROM user_products)::float),
        3
    ) as lift
FROM product_pairs pp
JOIN product_counts pc_a ON pp.product_a = pc_a.product_id
JOIN product_counts pc_b ON pp.product_b = pc_b.product_id
ORDER BY lift DESC
LIMIT 100;

Q30. How do you detect account takeover with SQL?

-- Suspicious login patterns: unusual location + high-value action within 1 hour
WITH recent_logins AS (
    SELECT
        user_id,
        login_time,
        ip_country,
        device_fingerprint,
        LAG(ip_country) OVER (PARTITION BY user_id ORDER BY login_time) as prev_country,
        LAG(login_time) OVER (PARTITION BY user_id ORDER BY login_time) as prev_login_time,
        LAG(device_fingerprint) OVER (PARTITION BY user_id ORDER BY login_time) as prev_device
    FROM login_events
    WHERE login_time >= CURRENT_TIMESTAMP - INTERVAL '7 days'
),
suspicious_logins AS (
    SELECT
        rl.user_id,
        rl.login_time,
        rl.ip_country,
        rl.prev_country,
        CASE
            WHEN rl.ip_country != rl.prev_country AND
                 rl.login_time - rl.prev_login_time < INTERVAL '2 hours'
            THEN 'IMPOSSIBLE_TRAVEL'
            WHEN rl.device_fingerprint != rl.prev_device AND
                 rl.ip_country != rl.prev_country
            THEN 'NEW_DEVICE_NEW_COUNTRY'
            ELSE NULL
        END as alert_type
    FROM recent_logins rl
    WHERE rl.prev_country IS NOT NULL
),
high_value_actions AS (
    SELECT user_id, action_time, action_type, amount
    FROM user_actions
    WHERE action_type IN ('withdrawal', 'password_change', 'email_change')
      AND action_time >= CURRENT_TIMESTAMP - INTERVAL '7 days'
)
SELECT
    sl.user_id,
    sl.login_time,
    sl.alert_type,
    sl.prev_country, sl.ip_country,
    hva.action_type,
    hva.amount,
    hva.action_time
FROM suspicious_logins sl
JOIN high_value_actions hva
    ON sl.user_id = hva.user_id
    AND hva.action_time BETWEEN sl.login_time AND sl.login_time + INTERVAL '1 hour'
WHERE sl.alert_type IS NOT NULL
ORDER BY sl.login_time DESC;

FAQ

Q: What is the most important SQL skill to master for DA interviews?

A: Window functions -- specifically ROW_NUMBER for deduplication, RANK/DENSE_RANK for rankings, LAG/LEAD for period-over-period calculations, and SUM OVER PARTITION for running totals. Candidates from public preparation resources consistently report window functions as the primary differentiator between entry-level and senior DA candidates.

Q: How should I practice SQL for interviews?

A: Use LeetCode (Database section), Mode Analytics practice problems, and Stratascratch. Practice on realistic datasets: e-commerce orders, ride-hailing trips, user event logs. Focus on the 5 most common patterns: top-N per group, retention cohort, funnel analysis, period-over-period growth, and deduplication. Confirm the specific SQL platform used in the interview on the official company careers portal or interview prep guide.

Q: Does BigQuery SQL differ significantly from PostgreSQL for interviews?

A: Minor syntax differences. BigQuery uses DATE_TRUNC, TIMESTAMP_DIFF, QUALIFY, and APPROX_QUANTILES. PostgreSQL uses ::cast syntax, generate_series, and LATERAL joins. The window function and CTE syntax is nearly identical. Most interview questions use portable standard SQL.

Sources and review notesreviewed 8 Jun 2026

Article-specific sources

Verification window

Page last edited 8 Jun 2026 by Aditya Sharma. A review date records an editorial edit, not a guarantee that every external fact is still current.

Evidence labels

Official notices, candidate reports, offer documents, and editorial practice questions carry different confidence levels. The visible source list lets you inspect the evidence instead of relying on a blanket verification badge.

Verification policy: /editorial-standards/. Found something incorrect? Submit a correction - we respond within 48 hours.

topic cluster

Sat this this year? Share your story, earn ₹500.

First-person experience reports help future candidates prep smarter. We pay verified contributors ₹500 via UPI per accepted story with byline.

Submit your story →

ready to practice?

Take a free timed mock test

Put what you learned into practice. Our mock tests match the 2026 pattern with timer, navigator, reveal, and score breakdown. No signup.

Start free mock test →

related guides

Interview Questions

Share this guide

Twitter LinkedIn W WhatsApp