Data Engineering Interview Questions 2026 — Top 50 Questions with Answers

Data engineering roles saw a 47% increase in job postings in 2025, and the trend is accelerating. Companies like Databricks, Snowflake, Meta, Amazon, and every data-driven business are hiring aggressively — because every AI model is only as good as its data pipeline. The modern data engineer must master streaming pipelines, lakehouse architectures, data quality frameworks, and increasingly, ML feature engineering. This guide compiles 50 real questions from interviews at Databricks, Snowflake, Meta, Amazon, Airbnb, and Uber — with battle-tested answers that get offers.

The data engineering interview isn't about memorizing Spark APIs. It's about proving you can build pipelines that don't break at 3 AM. This guide teaches you both.

Related articles: AI/ML Interview Questions 2026 | System Design Interview Questions 2026 | Generative AI Interview Questions 2026 | AWS Interview Questions 2026

Which Companies Ask These Questions?

Topic Cluster	Companies
ETL Pipelines & Orchestration	Airbnb, Meta, Amazon, Stripe, all data-heavy companies
Apache Spark & PySpark	Databricks, LinkedIn, Apple, all Hadoop/Spark shops
Apache Kafka & Streaming	LinkedIn, Confluent, Uber, DoorDash
Data Modeling & Warehousing	Snowflake, Google BigQuery teams, Redshift shops
Data Quality & Testing	Lyft, Airbnb, data platform teams
Lakehouse Architecture	Databricks, Dremio, AWS, Azure
Real-time vs Batch	All data teams with streaming requirements

EASY — Core Fundamentals (Questions 1-15)

Even senior data engineers get tripped up on fundamentals. Interviewers use these to gauge how deeply you understand the "why" behind your daily tools.

Q1. What is ETL and ELT? When do you use each?

Aspect	ETL (Extract-Transform-Load)	ELT (Extract-Load-Transform)
Transform location	Before loading (external compute)	Inside the target warehouse
Era	Traditional data warehouses (Teradata)	Cloud DWH (Snowflake, BigQuery, Redshift)
Flexibility	Less — structure defined upfront	More — raw data stored, transform later
Latency	Higher (transform can be slow)	Lower for loading; slower for transforms
Raw data preserved?	Sometimes not	Always (raw layer)

When ETL: Legacy systems, strict data governance, PII masking must happen before loading, target system has limited compute.

When ELT: Cloud data warehouse with elastic compute (Snowflake, BigQuery), data lake architectures, explorative analytics where transformation requirements are evolving.

Modern reality (2026): Most teams use ELT with a tool like dbt (data build tool) for SQL-based transformations in the warehouse, plus Python (PySpark/pandas) for complex non-SQL transformations.

Q2. What is Apache Spark and why is it faster than MapReduce?

Aspect	MapReduce (Hadoop)	Apache Spark
Data persistence	Write to HDFS after each step	In-memory RDDs (lazy evaluation)
Speed	10-100x slower	In-memory: 100x faster; disk: 10x faster
API	Low-level (map, reduce)	High-level (DataFrame, SQL, MLlib, Streaming)
Iterative algorithms	Very slow (ML training)	Efficient (cache intermediate data)
Language support	Java only	Scala, Python (PySpark), R, SQL

Spark's execution model:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, avg, window

spark = SparkSession.builder.appName("SalesAnalysis").getOrCreate()

# Transformations are LAZY — no computation until action is called
df = (spark.read.parquet("s3://data-lake/sales/")
          .filter(col("year") == 2026)
          .groupBy("product_category")
          .agg(sum("revenue").alias("total_revenue"),
               avg("order_value").alias("avg_order_value"))
          .orderBy("total_revenue", ascending=False))

# Action triggers execution
df.show(20)     # triggers DAG execution
df.write.parquet("s3://data-warehouse/sales_summary/2026/")

DAG execution: Spark builds a Directed Acyclic Graph of transformations, then optimizes and executes it. Catalyst optimizer rewrites the query plan for efficiency. Tungsten execution engine uses off-heap memory and SIMD vectorization.

Q3. What is the difference between narrow and wide transformations in Spark?

Type	Definition	Examples	Shuffle?
Narrow	Each partition → at most one output partition	map, filter, union	No
Wide	One partition → multiple output partitions	groupBy, join, distinct, repartition	Yes

Why shuffles are expensive: Wide transformations require shuffling data across the network — all rows with the same key must go to the same partition. Shuffle involves disk I/O, network I/O, and serialization.

# Wide transformation — causes shuffle
df.groupBy("customer_id").count()
# Spark: sort/hash all rows by customer_id → network transfer → reduce

# Narrow transformation — no shuffle
df.filter(col("amount") > 100).select("customer_id", "amount")
# Each partition processed independently

# Optimization: use filter BEFORE join to reduce shuffle data
# Bad:
big_df.join(other_df, "user_id").filter(col("country") == "IN")
# Good:
big_df.filter(col("country") == "IN").join(other_df, "user_id")

Q4. What is a Spark partition? How do you optimize partitioning?

Rules of thumb:

Target: 2-3 partitions per CPU core available
Ideal partition size: 100MB–300MB
Avoid too many small partitions (task overhead) and too few large partitions (OOM, skew)

# Check partition count and sizes
df = spark.read.parquet("s3://data/")
print(f"Partitions: {df.rdd.getNumPartitions()}")

# Repartition by column (even distribution via hash)
# Use when: preparing for shuffle-heavy operation, joining
df_repartitioned = df.repartition(200, "user_id")

# Coalesce (reduce partitions — no full shuffle)
# Use when: reducing partitions before write
df_coalesced = df.coalesce(50)

# Repartition by range (sorted partitions — good for range queries)
df_range = df.repartitionByRange(200, "date")

# Handle data skew: salting technique
from pyspark.sql.functions import concat, lit, floor, rand
# Add random salt to key to distribute skewed keys
df_salted = df.withColumn("salted_key",
    concat(col("user_id"), lit("_"), (floor(rand() * 10)).cast("string")))

Q5. What is Apache Kafka? Explain its core concepts.

Kafka Core Concepts:
├── Topic: Named stream of messages (e.g., "user-clicks", "order-events")
├── Partition: Ordered, immutable sequence of messages within a topic
│   └── Offset: Sequential ID of each message in a partition
├── Producer: Writes messages to topics
├── Consumer: Reads messages from topics
├── Consumer Group: Group of consumers sharing partition assignment
│   └── Each partition consumed by exactly one consumer in a group
├── Broker: Kafka server storing partition data
├── Cluster: Multiple brokers
├── ZooKeeper/KRaft: Cluster metadata management
└── Replication: Each partition replicated across N brokers (leader + followers)

from confluent_kafka import Producer, Consumer

# Producer
producer = Producer({'bootstrap.servers': 'kafka:9092'})
producer.produce('user-clicks',
                  key=str(user_id).encode(),
                  value=json.dumps(click_event).encode(),
                  callback=delivery_callback)
producer.flush()

# Consumer
consumer = Consumer({
    'bootstrap.servers': 'kafka:9092',
    'group.id': 'analytics-consumers',
    'auto.offset.reset': 'earliest',    # or 'latest'
    'enable.auto.commit': False         # manual commit for exactly-once
})
consumer.subscribe(['user-clicks'])

while True:
    msg = consumer.poll(timeout=1.0)
    if msg is None: continue
    if msg.error(): handle_error(msg.error())
    process_event(json.loads(msg.value()))
    consumer.commit()  # manual commit after processing

Delivery semantics:

Semantic	How	Risk
At most once	Commit before processing	Message loss on crash
At least once	Commit after processing	Duplicate processing
Exactly once	Kafka transactions + idempotent producers	Complex, slight overhead

Q6. What is Apache Airflow? How does it orchestrate pipelines?

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.operators.bash import BashOperator
from airflow.providers.amazon.aws.operators.glue import GlueJobOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'data-engineering',
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
    'email_on_failure': True,
    'email': ['[email protected]']
}

with DAG(
    dag_id='daily_sales_pipeline',
    default_args=default_args,
    schedule_interval='0 6 * * *',   # 6 AM daily (cron)
    start_date=datetime(2026, 1, 1),
    catchup=False,                    # don't run past dates on deploy
    tags=['sales', 'production']
) as dag:

    extract_task = PythonOperator(
        task_id='extract_from_postgres',
        python_callable=extract_sales_data,
        op_kwargs={'date': '{{ ds }}'}  # Airflow templating: execution date
    )

    transform_task = GlueJobOperator(
        task_id='spark_transform',
        job_name='sales-transform-job',
        script_args={'--date': '{{ ds }}', '--env': 'production'}
    )

    load_task = PythonOperator(
        task_id='load_to_snowflake',
        python_callable=load_to_warehouse
    )

    validate_task = PythonOperator(
        task_id='run_data_quality_checks',
        python_callable=run_great_expectations
    )

    notify_task = BashOperator(
        task_id='notify_stakeholders',
        bash_command='python notify.py --date {{ ds }} --status success'
    )

    # Define dependencies
    extract_task >> transform_task >> load_task >> validate_task >> notify_task

Airflow key concepts:

DAG: Python class defining workflow structure
Operator: Unit of work (PythonOperator, BashOperator, SparkSubmitOperator, etc.)
Task: Instance of an operator
XCom: Cross-communication between tasks (pass small data)
Scheduler: Scans DAGs, triggers task instances based on schedule
Executor: LocalExecutor, CeleryExecutor (distributed), KubernetesExecutor (cloud-native)

Q7. What is a data warehouse? How is it different from a data lake?

Aspect	Data Warehouse	Data Lake
Data type	Structured only	Structured + semi-structured + unstructured
Schema	Schema-on-write (defined before loading)	Schema-on-read (defined at query time)
Purpose	BI, reporting, dashboards	ML, data science, exploratory analytics
Examples	Snowflake, Redshift, BigQuery	S3, ADLS, HDFS
Query performance	Fast (pre-defined schema, optimized storage)	Slower (must parse raw files)
Data quality	High (enforced schema)	Varies (raw data may be messy)

Data Lakehouse (2026 dominant pattern): Combines lake (cheap storage, raw data) with warehouse (ACID transactions, query performance, schema enforcement) using open table formats:

Format	Creator	Key Feature
Delta Lake	Databricks	ACID transactions, time travel
Apache Iceberg	Netflix	Partition evolution, hidden partitioning
Apache Hudi	Uber	Upserts (record-level updates)

Q8. What is dbt (data build tool) and why is it standard in 2026?

-- models/mart/fct_sales.sql
{{
  config(
    materialized='incremental',
    unique_key='order_id',
    incremental_strategy='merge',
    cluster_by=['order_date', 'customer_id']
  )
}}

WITH source_orders AS (
    SELECT * FROM {{ ref('stg_orders') }}       -- reference to staging model
    {% if is_incremental() %}
    WHERE created_at > (SELECT MAX(created_at) FROM {{ this }})
    {% endif %}
),
customer_segments AS (
    SELECT * FROM {{ ref('dim_customers') }}    -- reference to dimension
),
final AS (
    SELECT
        o.order_id,
        o.customer_id,
        o.created_at::DATE AS order_date,
        o.gross_amount,
        o.discount_amount,
        o.gross_amount - o.discount_amount AS net_amount,
        c.segment,
        c.country
    FROM source_orders o
    LEFT JOIN customer_segments c USING (customer_id)
)
SELECT * FROM final

# models/mart/fct_sales.yml — documentation and tests
version: 2
models:
  - name: fct_sales
    description: "Fact table for all completed orders"
    columns:
      - name: order_id
        tests: [unique, not_null]
      - name: net_amount
        tests: [not_null, {dbt_expectations.expect_column_values_to_be_between: {min_value: 0}}]
      - name: order_date
        tests: [not_null]

dbt advantages: Version control for data transformations. Test every model. Automatic documentation. Lineage graphs. CI/CD integration. The standard for analytics engineering in 2026.

Q9. What is the star schema? How does it differ from snowflake schema?

-- Fact table: records events/transactions (large, narrow columns)
CREATE TABLE fact_sales (
    sale_id        BIGINT PRIMARY KEY,
    date_key       INT REFERENCES dim_date(date_key),
    product_key    INT REFERENCES dim_product(product_key),
    customer_key   INT REFERENCES dim_customer(customer_key),
    store_key      INT REFERENCES dim_store(store_key),
    quantity       INT,
    unit_price     DECIMAL(10,2),
    discount       DECIMAL(5,2),
    net_revenue    DECIMAL(12,2)
);

-- Dimension table: descriptive attributes (small, wide columns)
CREATE TABLE dim_product (
    product_key     INT PRIMARY KEY,
    product_id      VARCHAR(20),
    product_name    VARCHAR(200),
    category        VARCHAR(100),
    subcategory     VARCHAR(100),
    brand           VARCHAR(100),
    unit_cost       DECIMAL(10,2)
);

-- Date dimension: pre-computed calendar attributes
CREATE TABLE dim_date (
    date_key        INT PRIMARY KEY,   -- YYYYMMDD
    date            DATE,
    year            INT,
    quarter         INT,
    month           INT,
    month_name      VARCHAR(10),
    week_of_year    INT,
    day_of_week     INT,
    is_weekend      BOOLEAN,
    is_holiday      BOOLEAN
);

Star vs Snowflake:

Aspect	Star Schema	Snowflake Schema
Dimension normalization	Denormalized (flat)	Normalized (hierarchies split into tables)
Query performance	Faster (fewer joins)	Slower (more joins)
Storage	More (redundant data)	Less
Maintenance	Simpler	More complex
Use case	OLAP/BI workloads	When dimension tables are very large

Q10. What are Slowly Changing Dimensions (SCDs)? Explain Types 1, 2, and 3.

SCD Type 1 — Overwrite:

-- Simply overwrite old value — no history kept
UPDATE dim_customer SET city = 'Bangalore', state = 'Karnataka'
WHERE customer_id = 12345;
-- Use when: corrections to data errors, history not needed

SCD Type 2 — Add New Row (most common):

CREATE TABLE dim_customer (
    customer_key    SERIAL PRIMARY KEY,    -- surrogate key
    customer_id     INT,                   -- natural/business key
    customer_name   VARCHAR(200),
    city            VARCHAR(100),
    state           VARCHAR(50),
    segment         VARCHAR(50),
    start_date      DATE NOT NULL,
    end_date        DATE,                  -- NULL = current record
    is_current      BOOLEAN DEFAULT TRUE
);

-- When customer moves city:
-- 1. Expire old record
UPDATE dim_customer SET end_date = CURRENT_DATE - 1, is_current = FALSE
WHERE customer_id = 12345 AND is_current = TRUE;
-- 2. Insert new record
INSERT INTO dim_customer (customer_id, customer_name, city, state, start_date, is_current)
VALUES (12345, 'Rahul Sharma', 'Bangalore', 'Karnataka', CURRENT_DATE, TRUE);

SCD Type 3 — Add Column:

-- Store both current and previous value
ALTER TABLE dim_customer
ADD COLUMN previous_city VARCHAR(100),
ADD COLUMN city_change_date DATE;

UPDATE dim_customer
SET previous_city = city, city = 'Bangalore', city_change_date = CURRENT_DATE
WHERE customer_id = 12345;
-- Use when: only last change matters, and need to compare current vs previous

SCD Type 4 — History table (separate current and history tables). SCD Type 6 — Hybrid of 1+2+3. In 2026, Type 2 is most common with Delta Lake's time travel often used as an alternative for simpler history tracking.

Q11. What is data partitioning in data warehouses and lakes?

# Partitioning in PySpark (for data lake)
df.write \
  .partitionBy("year", "month", "day") \
  .mode("append") \
  .parquet("s3://data-lake/events/")

# Resulting directory structure:
# s3://data-lake/events/year=2026/month=03/day=30/part-0001.parquet

# Query with partition pruning (reads only needed partitions)
df = spark.read.parquet("s3://data-lake/events/") \
    .filter(col("year") == 2026).filter(col("month") == 3)
# Spark reads ONLY s3://data-lake/events/year=2026/month=03/*

# Partitioning in Snowflake (cluster keys)
CREATE TABLE events CLUSTER BY (TO_DATE(event_timestamp), event_type);
-- Snowflake automatically sorts data within micro-partitions by cluster key
-- Enables micro-partition pruning for range scans

Partition design principles:

Partition by columns commonly used in WHERE clauses
Avoid high-cardinality columns as partition keys (too many tiny partitions)
Date partitioning is almost always useful
Never partition by columns with < 10 distinct values (not worth the overhead)

Q12. What is the difference between OLTP and OLAP systems?

Aspect	OLTP	OLAP
Purpose	Transaction processing	Analytics and reporting
Query type	Short reads/writes on few rows	Long reads across many rows
Optimization	Row-based storage, indexes	Columnar storage, partitioning
Normalization	3NF (normalized)	Denormalized (star/snowflake)
Concurrency	Thousands of concurrent users	Fewer, complex queries
Examples	PostgreSQL, MySQL	Snowflake, BigQuery, Redshift
Data currency	Real-time	Batch-updated
Transaction support	ACID	Limited (ACID in modern lakehouses)

# OLAP query pattern: aggregate across millions of rows
# Good for columnar storage — only reads relevant columns
SELECT
    product_category,
    DATE_TRUNC('month', order_date) AS month,
    SUM(revenue) AS total_revenue,
    COUNT(DISTINCT customer_id) AS unique_customers
FROM fact_orders
WHERE order_date BETWEEN '2025-01-01' AND '2026-12-31'
GROUP BY 1, 2
ORDER BY 3 DESC;

Q13. What are the key Spark performance tuning techniques?

# 1. Broadcast joins (for small tables < ~10MB)
from pyspark.sql.functions import broadcast
result = large_df.join(broadcast(small_df), "key")
# Avoids shuffle — broadcasts small_df to all executors

# 2. Persist/cache intermediate results
df_filtered = df.filter(col("status") == "active").cache()
df_filtered.count()       # Action to trigger caching
# ... multiple uses of df_filtered ...
df_filtered.unpersist()   # Free memory when done

# 3. Avoid UDFs when possible (bypass JVM-native execution)
# Bad: Python UDF (slow — serializes row by row to Python)
from pyspark.sql.functions import udf
@udf("double")
def calculate_tax(amount): return amount * 0.18  # Slow!

# Good: use built-in Spark functions (JVM-native, vectorized)
from pyspark.sql.functions import col
df = df.withColumn("tax", col("amount") * 0.18)  # Fast!

# 4. Predicate pushdown (push filters into file reading)
spark.conf.set("spark.sql.parquet.filterPushdown", "true")

# 5. Column pruning (select only needed columns early)
df = df.select("user_id", "event_type", "timestamp")  # Before joins

# 6. Adaptive Query Execution (AQE) — enabled by default in Spark 3.x
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
# AQE dynamically adjusts partition count after shuffle based on actual data size

# 7. Salting for skewed joins
df_salted = skewed_df.withColumn("salt", (rand() * 10).cast("int")) \
                      .withColumn("skewed_key_salted",
                                  concat(col("skewed_key"), lit("_"), col("salt")))
other_salted = other_df.crossJoin(spark.range(10).toDF("salt")) \
                        .withColumn("skewed_key_salted",
                                    concat(col("skewed_key"), lit("_"), col("salt")))
result = df_salted.join(other_salted, "skewed_key_salted")

Q14. What is the Lambda architecture and how does Kappa architecture improve it?

Lambda Architecture:

Data Source → [Batch Layer] → Batch views
           ↘ [Speed Layer] → Real-time views
                                    ↓
              [Serving Layer] merges both → Query results

Problem: Maintaining two separate codebases (batch + streaming) for the same logic. Code divergence, debugging complexity.

Kappa Architecture:

Data Source → [Streaming Layer only (Kafka + Flink/Spark Streaming)]
                              ↓
                    [Serving Layer] → All queries served from streaming output

Replay for "batch" correction: Replay Kafka topic from offset 0 through a faster streaming job.

In 2026 — predominant patterns:

Lakehouse + streaming (Databricks/Delta Lake): Structured Streaming writes to Delta tables. Batch jobs read from same tables. No architecture split.
Kappa with Flink: Flink handles both streaming (milliseconds) and batch (bounded streams) uniformly.

Q15. What is a medallion architecture?

Bronze (Raw) → Silver (Cleaned) → Gold (Business-ready)

Bronze:
  - Raw data as-is from source systems
  - Immutable, append-only
  - JSON/CSV/Parquet from Kafka, databases, APIs
  - Retention: indefinite
  - s3://data-lake/bronze/orders/year=2026/

Silver:
  - Deduplicated, cleaned, validated
  - Standardized types and formats
  - Parsed nested JSON
  - Some joins (order + order_items unified)
  - Retention: 3-5 years

Gold:
  - Business-level aggregates
  - Optimized for specific use cases (BI, ML features, operational)
  - Star schema or wide tables
  - Frequently refreshed
  - s3://data-lake/gold/daily_sales_summary/

# Bronze to Silver example with Delta Lake
from delta.tables import DeltaTable
from pyspark.sql.functions import col, to_timestamp, current_timestamp

# Read from Bronze
bronze_df = spark.readStream \
    .format("delta") \
    .load("s3://data-lake/bronze/orders/")

# Clean and standardize
silver_df = bronze_df \
    .dropDuplicates(["order_id"]) \
    .filter(col("amount").isNotNull() & (col("amount") > 0)) \
    .withColumn("created_at", to_timestamp(col("created_at_str"), "yyyy-MM-dd HH:mm:ss")) \
    .withColumn("_ingested_at", current_timestamp()) \
    .drop("created_at_str", "_raw_json")

# Write to Silver (streaming append)
silver_df.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "s3://checkpoints/bronze-to-silver/orders/") \
    .start("s3://data-lake/silver/orders/")

MEDIUM — Advanced Topics (Questions 16-35)

Databricks, Snowflake, and Airbnb data engineering interviews go deep on these topics. This is where you prove you've built real pipelines, not just followed tutorials.

Q16. How does Spark Structured Streaming work? What are output modes?

from pyspark.sql.functions import window, col, sum

# Read from Kafka as a stream
events = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "kafka:9092") \
    .option("subscribe", "user-clicks") \
    .option("startingOffsets", "latest") \
    .load() \
    .selectExpr("CAST(value AS STRING) as json_value",
                "timestamp as event_time") \
    .select(from_json(col("json_value"), schema).alias("data"), "event_time") \
    .select("data.*", "event_time")

# Windowed aggregation (tumbling window, 5-minute intervals)
hourly_counts = events \
    .withWatermark("event_time", "10 minutes") \  # allow up to 10min late data
    .groupBy(
        window(col("event_time"), "5 minutes"),    # tumbling window
        col("product_category")
    ) \
    .agg(sum("click_count").alias("total_clicks"))

# Output modes:
# append: only new rows (not suitable for aggregations without watermark)
# update: only changed rows since last trigger (most efficient for aggregations)
# complete: entire result table (only for small result sets)
hourly_counts.writeStream \
    .format("delta") \
    .outputMode("update") \
    .option("checkpointLocation", "s3://checkpoints/hourly-clicks/") \
    .trigger(processingTime="1 minute") \
    .start("s3://gold/hourly-clicks/")

Watermarks: Tell Spark how late data can be and still be counted. Without watermarks, Spark must keep all state forever. With watermark of 10 min, Spark can clean up state for windows older than max_event_time - 10 min.

Q17. What is a streaming join? What challenges arise?

# Stream-stream join (both sides are unbounded)
orders_stream = spark.readStream.format("kafka").option("subscribe", "orders").load()
payments_stream = spark.readStream.format("kafka").option("subscribe", "payments").load()

# Join with watermarks — required for stream-stream joins
orders_w = orders_stream.withWatermark("order_time", "1 hour")
payments_w = payments_stream.withWatermark("payment_time", "1 hour")

result = orders_w.join(
    payments_w,
    expr("""order_id = payment_order_id AND
            payment_time BETWEEN order_time AND order_time + INTERVAL 1 HOUR""")
)
# Without watermarks: state grows unboundedly (OOM)
# With watermarks: Spark discards old unmatched events after watermark passes

# Stream-static join (no state issues — static side fully loaded)
products = spark.read.delta("s3://gold/dim_products/")  # static
enriched = orders_stream.join(products, "product_id")   # no state issues

Challenges:

State management: Unmatched events must be buffered until they're matched or watermark expires
Out-of-order events: Watermarks handle this with a delay threshold
Outer joins with streams: Only supported with watermarks in Spark
Memory: State store (RocksDB in Spark 3.x) can grow large for slow streams

Q18. How do you implement data quality checks in a pipeline?

# Great Expectations — industry standard for data quality (2026)
import great_expectations as gx
from great_expectations.checkpoint import SimpleCheckpoint

context = gx.get_context()

# Define expectations (business rules as code)
suite = context.create_expectation_suite("orders_suite", overwrite_existing=True)
validator = context.get_validator(batch_request=batch_request, expectation_suite=suite)

# Column-level expectations
validator.expect_column_to_exist("order_id")
validator.expect_column_values_to_not_be_null("order_id")
validator.expect_column_values_to_be_unique("order_id")
validator.expect_column_values_to_not_be_null("customer_id")
validator.expect_column_values_to_be_between("amount", min_value=0, max_value=1000000)
validator.expect_column_values_to_be_in_set("status",
    {"pending", "confirmed", "shipped", "delivered", "cancelled"})

# Table-level expectations
validator.expect_table_row_count_to_be_between(min_value=100, max_value=1000000)
validator.expect_column_pair_values_a_to_be_greater_than_b("delivery_date", "order_date")

# Custom SQL-based expectation
validator.expect_column_values_to_match_regex("phone", r"^\+\d{10,15}$")

# Statistical expectations (detect data drift)
validator.expect_column_mean_to_be_between("amount", min_value=500, max_value=5000)
validator.expect_column_stdev_to_be_between("amount", min_value=100, max_value=10000)

validator.save_expectation_suite()

# Run in pipeline
checkpoint = SimpleCheckpoint(name="orders_checkpoint", data_context=context)
result = checkpoint.run()
if not result.success:
    alert_on_call("Data quality check failed!")
    raise Exception("Halting pipeline: data quality failure")

Beyond GE — dbt tests (SQL-native):

# dbt_project/models/staging/stg_orders.yml
models:
  - name: stg_orders
    columns:
      - name: order_id
        tests: [unique, not_null]
      - name: customer_id
        tests: [not_null, relationships: {to: ref('dim_customers'), field: customer_id}]
      - name: amount
        tests: [{dbt_utils.not_constant}, {dbt_utils.at_least_one}]

Q19. What is Delta Lake and what problems does it solve?

from delta import DeltaTable
from pyspark.sql.functions import col

# Problem 1: ACID transactions on data lakes
# Without Delta: multiple writers corrupt files; reads during write return partial data
# With Delta: transaction log (JSON/_delta_log/) ensures ACID

# Problem 2: Schema enforcement & evolution
df.write.format("delta") \
    .option("mergeSchema", "true") \  # allow schema evolution
    .mode("append") \
    .save("s3://data-lake/orders/")

# Problem 3: Upserts (MERGE / CDC)
delta_table = DeltaTable.forPath(spark, "s3://data-lake/orders/")
delta_table.alias("target").merge(
    updates_df.alias("source"),
    "target.order_id = source.order_id"
).whenMatchedUpdate(set={
    "status": "source.status",
    "updated_at": "source.updated_at"
}).whenNotMatchedInsertAll().execute()

# Problem 4: Time travel (query historical versions)
df_yesterday = spark.read.format("delta") \
    .option("timestampAsOf", "2026-03-29") \
    .load("s3://data-lake/orders/")

df_version_5 = spark.read.format("delta") \
    .option("versionAsOf", 5) \
    .load("s3://data-lake/orders/")

# Problem 5: Audit trail
delta_table.history().show()  # see all versions and operations

# Problem 6: Data compaction (small files problem)
delta_table.optimize().executeCompaction()  # merge small files
delta_table.vacuum(retentionHours=168)      # delete old versions (7 days)

Q20. What is Change Data Capture (CDC)? How do you implement it?

Approaches:

Approach	Method	Latency	Impact on Source
Log-based CDC	Read DB transaction logs (WAL)	Near real-time	Minimal
Trigger-based	DB triggers write to audit table	Low	High (per-row overhead)
Timestamp-based	Poll for records where updated_at > last_poll	Minutes	Moderate
Snapshot-based	Full table scan, compare to previous	Hours	High

Log-based CDC with Debezium (industry standard):

# docker-compose.yml
services:
  debezium:
    image: debezium/connect:latest
    config:
      connector.class: "io.debezium.connector.postgresql.PostgresConnector"
      database.hostname: "postgres"
      database.port: "5432"
      database.user: "replicator"
      database.password: "${DB_PASSWORD}"
      database.dbname: "production"
      database.server.name: "prod-postgres"
      table.include.list: "public.orders,public.customers"
      plugin.name: "pgoutput"
      publication.autocreate.mode: "filtered"

# Consuming CDC events from Kafka
from pyspark.sql.functions import col, from_json, when

cdc_schema = StructType([
    StructField("op", StringType()),       # c=create, u=update, d=delete, r=read
    StructField("before", schema),          # row state before change
    StructField("after", schema),           # row state after change
    StructField("ts_ms", LongType())
])

cdc_df = spark.readStream.format("kafka") \
    .option("subscribe", "prod-postgres.public.orders") \
    .load() \
    .selectExpr("CAST(value AS STRING)") \
    .select(from_json(col("value"), cdc_schema).alias("cdc")) \
    .select("cdc.*")

# Apply changes to Delta Lake
def process_cdc_batch(batch_df, batch_id):
    delta_table = DeltaTable.forPath(spark, "s3://silver/orders/")
    inserts = batch_df.filter(col("op").isin(["c", "r"])).select("after.*")
    updates = batch_df.filter(col("op") == "u").select("after.*")
    deletes = batch_df.filter(col("op") == "d").select("before.order_id")
    # Merge inserts/updates; soft-delete or hard-delete
    delta_table.alias("t").merge(
        inserts.union(updates).alias("s"), "t.order_id = s.order_id"
    ).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

cdc_df.writeStream.foreachBatch(process_cdc_batch).start()

Q21. What is the difference between Flink and Spark Streaming?

Aspect	Apache Flink	Spark Structured Streaming
Processing model	True stream (event-by-event)	Micro-batch (small batches, default 0ms)
Latency	Sub-millisecond (event time)	100ms-1s typical
State management	Native RocksDB state backend	Managed state store
Event time	First-class citizen	Well-supported (watermarks)
Batch + streaming	Unified API (DataStream API for both)	Two separate APIs
Window types	Tumbling, sliding, session, global	Tumbling, sliding, watermark-based
Maturity	Very mature, production-proven	Mature, widely adopted
Ecosystem	Better for complex streaming pipelines	Better for unified batch + streaming on Databricks

2026 landscape: Flink dominates pure streaming use cases (fraud detection, real-time recommendations, trading). Spark Structured Streaming dominates in Databricks-heavy shops due to Delta Lake integration.

# Flink Python API (PyFlink)
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import StreamTableEnvironment

env = StreamExecutionEnvironment.get_execution_environment()
t_env = StreamTableEnvironment.create(env)

t_env.execute_sql("""
    CREATE TABLE kafka_orders (
        order_id STRING,
        amount DOUBLE,
        event_time TIMESTAMP(3),
        WATERMARK FOR event_time AS event_time - INTERVAL '10' SECOND
    ) WITH (
        'connector' = 'kafka',
        'topic' = 'orders',
        'properties.bootstrap.servers' = 'kafka:9092',
        'format' = 'json'
    )
""")

result = t_env.sql_query("""
    SELECT
        TUMBLE_START(event_time, INTERVAL '1' MINUTE) AS window_start,
        COUNT(*) AS order_count,
        SUM(amount) AS total_amount
    FROM kafka_orders
    GROUP BY TUMBLE(event_time, INTERVAL '1' MINUTE)
""")

Q22. How do you handle late-arriving data in streaming pipelines?

# Strategies for late data:

# 1. Watermarks (Spark/Flink) — define maximum lateness
events.withWatermark("event_time", "2 hours")  # accept data up to 2h late
# Data arriving > 2h late is dropped/ignored

# 2. Reprocessing with Lambda architecture:
#    Streaming handles recent data; daily batch job reprocesses last 3 days
#    Delta Lake time travel enables targeted reprocessing

# 3. Idempotent MERGE for late data (Delta Lake)
def handle_late_data(late_df):
    DeltaTable.forPath(spark, "s3://gold/daily_metrics/") \
        .alias("t").merge(
            late_df.alias("s"),
            "t.date = s.date AND t.user_id = s.user_id"
        ).whenMatchedUpdate(set={
            "event_count": "t.event_count + s.event_count",
            "updated_at": "current_timestamp()"
        }).whenNotMatchedInsertAll().execute()

# 4. Grace period windows (Flink):
#    Flink keeps window state for extra time after watermark passes
#    to handle late records gracefully
table.window(
    Tumble.over(lit(5).minutes)
          .on(col("event_time"))
          .alias("w")
          .allow_late(lit(10).minutes)  # 10min grace period
)

Q23. What is data lineage and why does it matter?

# Apache Atlas — open-source data lineage
# OpenLineage — open standard for lineage (used by Airflow, Spark, dbt)

# OpenLineage with Airflow
from openlineage.airflow import OpenLineageListener
# Automatically captures: DAG name, task inputs/outputs, run metadata

# dbt lineage (automatic)
dbt docs generate  # generates lineage graph
dbt docs serve     # visual DAG of all models and their dependencies

# Custom lineage in PySpark
from openlineage.client import OpenLineageClient
from openlineage.client.run import RunEvent, RunState, Run, Job, Dataset

client = OpenLineageClient.from_environment()
run = Run(runId=str(uuid4()))
job = Job(namespace="my-spark-jobs", name="daily-sales-transform")

# Emit lineage events
client.emit(RunEvent(
    eventType=RunState.START,
    run=run, job=job,
    inputs=[Dataset(namespace="s3://bronze", name="orders/")],
    outputs=[Dataset(namespace="s3://silver", name="orders/")]
))

Why lineage matters in 2026:

Impact analysis: "If I change this table, what downstream dashboards break?"
GDPR/data deletion: "Where does user_id 12345's data flow? Delete it everywhere."
Debugging: "Why is this dashboard showing wrong numbers?" → trace to source
Compliance: Auditors require proof of data governance

Q24. What are the common causes of Spark OOM errors and how do you fix them?

Cause	Symptom	Fix
Skewed data	One task takes 100x longer	Salting, skew join hint, AQE
Cartesian product	Memory explodes	Review join keys, avoid crossJoin
Collecting large data	`df.toPandas()` on large DF	Use Spark aggregations, sample first
Too much caching	Memory pressure	Unpersist when done, use MEMORY_AND_DISK
Driver memory	Driver OOM on collect()	Increase driver memory, avoid collect() on large data
Too many small files	Metadata overhead	Coalesce before write, Delta OPTIMIZE

# Fix 1: Increase executor memory
spark = SparkSession.builder \
    .config("spark.executor.memory", "8g") \
    .config("spark.executor.memoryOverhead", "2g") \
    .config("spark.driver.memory", "4g") \
    .getOrCreate()

# Fix 2: Spill to disk for large joins
spark.conf.set("spark.sql.shuffle.partitions", "400")  # more partitions = smaller each
spark.conf.set("spark.shuffle.spill", "true")

# Fix 3: AQE handles skew automatically (Spark 3.x)
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "256m")

# Fix 4: Avoid collect() on large DataFrames
# Bad:
rows = df.collect()  # brings everything to driver — OOM risk
# Good:
df.write.parquet("s3://output/")  # distributed write
# Or sample:
sample = df.sample(0.01).collect()

Q25. What is data mesh and how does it differ from a traditional data platform?

Aspect	Traditional Data Platform	Data Mesh
Ownership	Central data team owns all pipelines	Domain teams own their data products
Architecture	Monolithic data warehouse/lake	Distributed, domain-oriented
Data discovery	Central catalog	Federated catalog
Governance	Central policies	Federated governance with global standards
Scalability	Bottlenecked by central team	Scales with number of domains
Consistency	High (centralized)	Requires standards for interoperability

Four principles:

Domain ownership: Orders domain team owns orders data product
Data as a product: Data products have SLAs, documentation, versioning
Self-serve platform: Domain teams can publish/consume without central team
Federated governance: Global standards (schemas, quality, privacy) enforced locally

2026 tools: Datahub, OpenMetadata, Amundsen (data catalogs); dbt (transformations per domain); Snowflake/Databricks with workspace separation; OpenLineage (lineage federation).

HARD — Expert-Level Topics (Questions 26-50)

These questions are asked for senior data engineer and data platform architect roles (Rs 30-60+ LPA). If you can discuss data skew solutions, streaming exactly-once semantics, and lakehouse optimization at this level, you're in elite territory.

Q26. Design a real-time ML feature store.

                         [Batch Feature Pipeline (daily)]
                              ↓ (PySpark jobs)
Feature Sources:    [Offline Store (Delta Lake / Redshift)]
  - Clickstream                      ↑ training data
  - Transactions          [Feature Registry (metadata)]
  - User profiles                    ↓ sync recent features
  - Product catalog    [Online Store (Redis/Cassandra)]
                              ↑ inference (< 5ms)
                         [Real-time Pipeline (Flink/Kafka)]
                              ↑ (streaming features: last 1h activity)

from feast import FeatureStore, Entity, FeatureService
from feast.feature_view import FeatureView
from feast.types import Float32, Int64, String

# Define feature view
user_features = FeatureView(
    name="user_features",
    entities=[Entity(name="user_id", join_keys=["user_id"])],
    schema=[
        Field(name="total_purchases_30d", dtype=Float32),
        Field(name="avg_order_value_30d", dtype=Float32),
        Field(name="days_since_last_purchase", dtype=Int64),
        Field(name="preferred_category", dtype=String),
    ],
    online=True,   # enable online store
    offline_source=DeltaSource(path="s3://gold/user_features/"),
    ttl=timedelta(days=7)
)

store = FeatureStore(repo_path="feature_repo/")

# Training: point-in-time correct feature retrieval
training_df = store.get_historical_features(
    entity_df=labeled_events,  # (user_id, event_timestamp, label)
    features=["user_features:total_purchases_30d",
              "user_features:avg_order_value_30d"]
).to_df()

# Inference: online serving
features = store.get_online_features(
    features=["user_features:total_purchases_30d"],
    entity_rows=[{"user_id": 12345}]
).to_dict()

Critical design decision: Point-in-time correctness for training — the feature value at training time must reflect what was available when the label was created, not today's value. This prevents data leakage.

Q27. What is the Iceberg table format? How does it handle schema evolution and partition evolution?

# Apache Iceberg — Netflix's table format, now a top-level Apache project

from pyiceberg.catalog import load_catalog
catalog = load_catalog("default", **{"uri": "http://rest-catalog:8181"})

# Schema evolution (add columns without rewriting data)
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, StringType, IntegerType

table = catalog.load_table("db.orders")
with table.update_schema() as update:
    update.add_column("shipping_method", StringType())  # add new column
    update.rename_column("amount", "gross_amount")      # rename column
    # Old Parquet files unchanged; schema mapping handles it

# Partition evolution (change partitioning strategy without rewriting)
with table.update_spec() as update:
    update.add_field("day", DayTransform(), "order_date")  # switch from month to day
    # Old partitions (monthly) still readable; new writes use daily partitions

# Hidden partitioning (users don't think about partitions)
from pyiceberg.partitioning import PartitionSpec, PartitionField
from pyiceberg.transforms import MonthTransform
# Users query: WHERE order_date BETWEEN '2026-01-01' AND '2026-03-31'
# Iceberg automatically uses partition pruning — no need to know about partitioning

# Time travel
import pyarrow as pa
table.scan(snapshot_id=3823904823).to_arrow()  # specific snapshot
table.scan(as_of_timestamp=1711152000000).to_arrow()  # as of a timestamp

Iceberg vs Delta Lake vs Hudi:

Feature	Iceberg	Delta Lake	Hudi
Creator	Netflix	Databricks	Uber
Partition evolution	Yes	No	Limited
Schema evolution	Best	Good	Good
ACID transactions	Yes	Yes	Yes
Multi-engine support	Best (Spark, Flink, Trino, DuckDB)	Good	Good
Hidden partitioning	Yes	No	No

Q28. How do you build a streaming data quality monitoring system?

from pyspark.sql.functions import col, count, isnan, isnull, stddev, mean, percentile_approx

def compute_quality_metrics(df, table_name, partition_date):
    """Compute comprehensive quality metrics for a batch"""
    metrics = {}
    total_rows = df.count()
    metrics["row_count"] = total_rows

    for col_name in df.columns:
        col_metrics = {}
        dtype = dict(df.dtypes)[col_name]

        # Null rate
        null_count = df.filter(isnull(col(col_name)) | isnan(col(col_name))).count()
        col_metrics["null_rate"] = null_count / total_rows

        # Cardinality
        col_metrics["cardinality"] = df.select(col_name).distinct().count()

        if dtype in ("double", "float", "int", "bigint"):
            stats = df.select(
                mean(col_name).alias("mean"),
                stddev(col_name).alias("stddev"),
                percentile_approx(col_name, 0.5).alias("p50"),
                percentile_approx(col_name, 0.99).alias("p99")
            ).collect()[0]
            col_metrics.update(stats.asDict())

        metrics[col_name] = col_metrics

    # Write metrics to monitoring table
    metrics_df = spark.createDataFrame([
        {"table": table_name, "date": partition_date,
         "column": col, "metric": k, "value": float(v)}
        for col, col_metrics in metrics.items() if isinstance(col_metrics, dict)
        for k, v in col_metrics.items() if isinstance(v, (int, float))
    ])
    metrics_df.write.format("delta").mode("append").save("s3://monitoring/dq-metrics/")

# Anomaly detection on metrics
def detect_anomalies(metric_name, table_name, window_days=30):
    historical = spark.read.delta("s3://monitoring/dq-metrics/") \
        .filter((col("table") == table_name) & (col("metric") == metric_name)) \
        .orderBy("date", ascending=False).limit(window_days)
    stats = historical.select(mean("value").alias("hist_mean"),
                               stddev("value").alias("hist_std")).collect()[0]
    latest = historical.orderBy("date", ascending=False).first()["value"]
    z_score = abs(latest - stats["hist_mean"]) / (stats["hist_std"] + 1e-10)
    if z_score > 3:
        alert(f"Anomaly detected: {table_name}.{metric_name} = {latest} "
              f"(z-score={z_score:.2f})")

Q29. What is exactly-once processing in Kafka? How do you implement it?

from confluent_kafka import Producer, Consumer

# Producer-side: Idempotent producer (enable.idempotence=true)
# + Transactional producer for multi-partition atomic writes
producer = Producer({
    'bootstrap.servers': 'kafka:9092',
    'enable.idempotence': True,          # prevents duplicate messages on retry
    'transactional.id': 'order-processor-1',  # unique per producer instance
    'acks': 'all',
    'retries': 2147483647
})

producer.init_transactions()
try:
    producer.begin_transaction()
    # Produce to multiple topics atomically
    producer.produce('orders-processed', key='order_123', value=processed_order)
    producer.produce('inventory-updates', key='product_456', value=inventory_delta)
    # Commit offsets and messages atomically
    producer.send_offsets_to_transaction(offsets, consumer_group_metadata)
    producer.commit_transaction()
except Exception:
    producer.abort_transaction()

# Consumer-side: read_committed isolation
consumer = Consumer({
    'bootstrap.servers': 'kafka:9092',
    'group.id': 'my-consumer-group',
    'isolation.level': 'read_committed',  # only see committed transactions
    'enable.auto.commit': False
})

Idempotent sink with database:

# Use INSERT ... ON CONFLICT DO NOTHING (upsert)
# Natural key or message_id as deduplication key
cursor.execute("""
    INSERT INTO processed_events (event_id, user_id, action, processed_at)
    VALUES (%s, %s, %s, NOW())
    ON CONFLICT (event_id) DO NOTHING
""", (event_id, user_id, action))
conn.commit()
consumer.commit()  # commit offset only after DB commit

Q30. What is data skew in distributed processing and all strategies to address it?

# Diagnose skew: check partition sizes
df.groupBy(spark_partition_id().alias("partition")).count().orderBy("count", ascending=False).show()
# If one partition has 10M rows and others have 100K, you have skew

# Strategy 1: Broadcast join (if skewed side is small dimension table)
from pyspark.sql.functions import broadcast
result = large_df.join(broadcast(small_lookup), "key")

# Strategy 2: Salting (distribute skewed key)
from pyspark.sql.functions import concat, lit, floor, rand, col, explode, array

NUM_SALTS = 20

# On the large (skewed) side: randomly append salt
large_salted = large_df.withColumn("salt", (rand() * NUM_SALTS).cast("int")) \
                        .withColumn("salted_key", concat(col("skewed_key"),
                                                          lit("_"), col("salt")))

# On the small side: expand with all possible salts
other_expanded = other_df.withColumn("salt", explode(array([lit(i) for i in range(NUM_SALTS)]))) \
                          .withColumn("salted_key", concat(col("skewed_key"),
                                                            lit("_"), col("salt")))

result = large_salted.join(other_expanded, "salted_key").drop("salt", "salted_key")

# Strategy 3: Adaptive Query Execution (Spark 3.x) — automatic
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

# Strategy 4: Pre-aggregate (reduce volume before shuffle)
# Bad: 1B rows → group → 10M groups
# Good: pre-aggregate per partition first → 10M partial aggregates → group → 10M groups
df.groupBy("user_id").agg(sum("amount"))  # Spark 3.x does this automatically (partial agg)

# Strategy 5: Custom partitioner (if you know the distribution)
# Consistent hash partitioning with more buckets for hot keys

Data Engineering FAQs — No-BS Answers

Q: What is the modern data stack in 2026? A: Here's the exact stack most top-tier data teams run: Fivetran/Airbyte (ingestion) → Snowflake/Databricks (storage + processing) → dbt (transformations) → Airflow/Dagster/Prefect (orchestration) → Monte Carlo/Great Expectations (data quality) → Datahub/Atlan (catalog) → Looker/Metabase (BI). Add Apache Kafka for streaming. Know this stack cold — interviewers expect you to have opinions about each layer.

Q: PySpark vs pandas in 2026? A: Pandas for < 1GB data (data science notebooks, feature engineering on small samples). PySpark for anything larger. Polars (Rust-based, 10-100x faster than pandas for single-node) is increasingly popular for medium-scale (1GB-100GB) data. pandas 2.0 with PyArrow backend improved performance but Polars is still faster.

Q: What is the difference between Airflow, Dagster, and Prefect? A: All are orchestration tools. Airflow is most mature, widely used, but complex UI and testing is hard. Dagster prioritizes software-engineering best practices (type checking, testing, assets-first). Prefect has the best developer experience and cloud offering. In 2026, Dagster is gaining adoption in modern data teams; Airflow still dominates legacy shops.

Q: What is a data product? A: A dataset or service that is intentionally designed, maintained, and delivered with defined SLAs, documentation, quality guarantees, and a clear owner. A view in a dashboard is not a data product; a versioned, documented, quality-checked dataset with a contract is.

Q: What is Z-ordering in Delta Lake and when should you use it? A: Z-ordering (also called Z-order clustering) reorganizes data within Delta files so that related data is co-located. It creates a multi-dimensional locality — sorting on two columns simultaneously (unlike ORDER BY which sorts one dimension). Use Z-ordering when queries filter on multiple columns. delta_table.optimize().executeZOrderBy("user_id", "event_date").

Q: What is the small files problem and how do you solve it? A: Streaming or frequent small batch writes create millions of tiny Parquet/ORC files. Each file requires metadata I/O → query planning is slow. Solutions: Delta Lake OPTIMIZE (auto-compaction), coalesce() before write, adjust spark.sql.shuffle.partitions, enable auto-optimization in Databricks.

Q: How do you handle GDPR right-to-erasure in a data lake? A: Pure lakes with immutable Parquet files make deletion hard. Solutions: (1) Delta Lake delete + vacuum (removes file versions after retention), (2) Iceberg row-level deletes, (3) Encryption + key deletion (delete the key → data is unreadable), (4) Pseudonymization (replace PII with reversible token; delete mapping table for erasure).

Q: What are the most important skills for a data engineer to learn in 2026? A: In priority order: SQL (still the single most important skill — don't skip window functions and CTEs), Python, PySpark for large-scale processing, dbt for transformations, Kafka for streaming, Delta Lake/Iceberg for lakehouse, Airflow/Dagster for orchestration, and Great Expectations for data quality. The 2026 bonus skills that separate good from great: vector databases for AI pipelines, feature stores for MLOps, and streaming ML with Flink.

Keep leveling up your interview prep:

AI/ML Interview Questions 2026 — ML fundamentals every data engineer should know
System Design Interview Questions 2026 — Design the systems your pipelines feed
Generative AI Interview Questions 2026 — The GenAI pipeline knowledge gap
AWS Interview Questions 2026 — Cloud infrastructure for your data stack
DevOps Interview Questions 2026 — CI/CD for data pipelines