PapersAdda
2026 Placement Season is LIVE12,000+ students preparing now

Data Engineering Interview Questions 2026 — Top 50 Questions with Answers

33 min read
Interview Questions
Last Updated: 30 Mar 2026
Verified by Industry Experts
4,447 students found this helpful
Advertisement Placement

Data engineering roles saw a 47% increase in job postings in 2025, and the trend is accelerating. Companies like Databricks, Snowflake, Meta, Amazon, and every data-driven business are hiring aggressively — because every AI model is only as good as its data pipeline. The modern data engineer must master streaming pipelines, lakehouse architectures, data quality frameworks, and increasingly, ML feature engineering. This guide compiles 50 real questions from interviews at Databricks, Snowflake, Meta, Amazon, Airbnb, and Uber — with battle-tested answers that get offers.

The data engineering interview isn't about memorizing Spark APIs. It's about proving you can build pipelines that don't break at 3 AM. This guide teaches you both.

Related articles: AI/ML Interview Questions 2026 | System Design Interview Questions 2026 | Generative AI Interview Questions 2026 | AWS Interview Questions 2026


Which Companies Ask These Questions?

Topic ClusterCompanies
ETL Pipelines & OrchestrationAirbnb, Meta, Amazon, Stripe, all data-heavy companies
Apache Spark & PySparkDatabricks, LinkedIn, Apple, all Hadoop/Spark shops
Apache Kafka & StreamingLinkedIn, Confluent, Uber, DoorDash
Data Modeling & WarehousingSnowflake, Google BigQuery teams, Redshift shops
Data Quality & TestingLyft, Airbnb, data platform teams
Lakehouse ArchitectureDatabricks, Dremio, AWS, Azure
Real-time vs BatchAll data teams with streaming requirements

EASY — Core Fundamentals (Questions 1-15)

Even senior data engineers get tripped up on fundamentals. Interviewers use these to gauge how deeply you understand the "why" behind your daily tools.

Q1. What is ETL and ELT? When do you use each?

AspectETL (Extract-Transform-Load)ELT (Extract-Load-Transform)
Transform locationBefore loading (external compute)Inside the target warehouse
EraTraditional data warehouses (Teradata)Cloud DWH (Snowflake, BigQuery, Redshift)
FlexibilityLess — structure defined upfrontMore — raw data stored, transform later
LatencyHigher (transform can be slow)Lower for loading; slower for transforms
Raw data preserved?Sometimes notAlways (raw layer)

When ETL: Legacy systems, strict data governance, PII masking must happen before loading, target system has limited compute.

When ELT: Cloud data warehouse with elastic compute (Snowflake, BigQuery), data lake architectures, explorative analytics where transformation requirements are evolving.

Modern reality (2026): Most teams use ELT with a tool like dbt (data build tool) for SQL-based transformations in the warehouse, plus Python (PySpark/pandas) for complex non-SQL transformations.


Q2. What is Apache Spark and why is it faster than MapReduce?

AspectMapReduce (Hadoop)Apache Spark
Data persistenceWrite to HDFS after each stepIn-memory RDDs (lazy evaluation)
Speed10-100x slowerIn-memory: 100x faster; disk: 10x faster
APILow-level (map, reduce)High-level (DataFrame, SQL, MLlib, Streaming)
Iterative algorithmsVery slow (ML training)Efficient (cache intermediate data)
Language supportJava onlyScala, Python (PySpark), R, SQL

Spark's execution model:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, avg, window

spark = SparkSession.builder.appName("SalesAnalysis").getOrCreate()

# Transformations are LAZY — no computation until action is called
df = (spark.read.parquet("s3://data-lake/sales/")
          .filter(col("year") == 2026)
          .groupBy("product_category")
          .agg(sum("revenue").alias("total_revenue"),
               avg("order_value").alias("avg_order_value"))
          .orderBy("total_revenue", ascending=False))

# Action triggers execution
df.show(20)     # triggers DAG execution
df.write.parquet("s3://data-warehouse/sales_summary/2026/")

DAG execution: Spark builds a Directed Acyclic Graph of transformations, then optimizes and executes it. Catalyst optimizer rewrites the query plan for efficiency. Tungsten execution engine uses off-heap memory and SIMD vectorization.


Q3. What is the difference between narrow and wide transformations in Spark?

TypeDefinitionExamplesShuffle?
NarrowEach partition → at most one output partitionmap, filter, unionNo
WideOne partition → multiple output partitionsgroupBy, join, distinct, repartitionYes

Why shuffles are expensive: Wide transformations require shuffling data across the network — all rows with the same key must go to the same partition. Shuffle involves disk I/O, network I/O, and serialization.

# Wide transformation — causes shuffle
df.groupBy("customer_id").count()
# Spark: sort/hash all rows by customer_id → network transfer → reduce

# Narrow transformation — no shuffle
df.filter(col("amount") > 100).select("customer_id", "amount")
# Each partition processed independently

# Optimization: use filter BEFORE join to reduce shuffle data
# Bad:
big_df.join(other_df, "user_id").filter(col("country") == "IN")
# Good:
big_df.filter(col("country") == "IN").join(other_df, "user_id")

Q4. What is a Spark partition? How do you optimize partitioning?

Rules of thumb:

  • Target: 2-3 partitions per CPU core available
  • Ideal partition size: 100MB–300MB
  • Avoid too many small partitions (task overhead) and too few large partitions (OOM, skew)
# Check partition count and sizes
df = spark.read.parquet("s3://data/")
print(f"Partitions: {df.rdd.getNumPartitions()}")

# Repartition by column (even distribution via hash)
# Use when: preparing for shuffle-heavy operation, joining
df_repartitioned = df.repartition(200, "user_id")

# Coalesce (reduce partitions — no full shuffle)
# Use when: reducing partitions before write
df_coalesced = df.coalesce(50)

# Repartition by range (sorted partitions — good for range queries)
df_range = df.repartitionByRange(200, "date")

# Handle data skew: salting technique
from pyspark.sql.functions import concat, lit, floor, rand
# Add random salt to key to distribute skewed keys
df_salted = df.withColumn("salted_key",
    concat(col("user_id"), lit("_"), (floor(rand() * 10)).cast("string")))

Q5. What is Apache Kafka? Explain its core concepts.

Kafka Core Concepts:
├── Topic: Named stream of messages (e.g., "user-clicks", "order-events")
├── Partition: Ordered, immutable sequence of messages within a topic
│   └── Offset: Sequential ID of each message in a partition
├── Producer: Writes messages to topics
├── Consumer: Reads messages from topics
├── Consumer Group: Group of consumers sharing partition assignment
│   └── Each partition consumed by exactly one consumer in a group
├── Broker: Kafka server storing partition data
├── Cluster: Multiple brokers
├── ZooKeeper/KRaft: Cluster metadata management
└── Replication: Each partition replicated across N brokers (leader + followers)
from confluent_kafka import Producer, Consumer

# Producer
producer = Producer({'bootstrap.servers': 'kafka:9092'})
producer.produce('user-clicks',
                  key=str(user_id).encode(),
                  value=json.dumps(click_event).encode(),
                  callback=delivery_callback)
producer.flush()

# Consumer
consumer = Consumer({
    'bootstrap.servers': 'kafka:9092',
    'group.id': 'analytics-consumers',
    'auto.offset.reset': 'earliest',    # or 'latest'
    'enable.auto.commit': False         # manual commit for exactly-once
})
consumer.subscribe(['user-clicks'])

while True:
    msg = consumer.poll(timeout=1.0)
    if msg is None: continue
    if msg.error(): handle_error(msg.error())
    process_event(json.loads(msg.value()))
    consumer.commit()  # manual commit after processing

Delivery semantics:

SemanticHowRisk
At most onceCommit before processingMessage loss on crash
At least onceCommit after processingDuplicate processing
Exactly onceKafka transactions + idempotent producersComplex, slight overhead

Q6. What is Apache Airflow? How does it orchestrate pipelines?

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.operators.bash import BashOperator
from airflow.providers.amazon.aws.operators.glue import GlueJobOperator
from datetime import datetime, timedelta

default_args = {
    'owner': 'data-engineering',
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
    'email_on_failure': True,
    'email': ['[email protected]']
}

with DAG(
    dag_id='daily_sales_pipeline',
    default_args=default_args,
    schedule_interval='0 6 * * *',   # 6 AM daily (cron)
    start_date=datetime(2026, 1, 1),
    catchup=False,                    # don't run past dates on deploy
    tags=['sales', 'production']
) as dag:

    extract_task = PythonOperator(
        task_id='extract_from_postgres',
        python_callable=extract_sales_data,
        op_kwargs={'date': '{{ ds }}'}  # Airflow templating: execution date
    )

    transform_task = GlueJobOperator(
        task_id='spark_transform',
        job_name='sales-transform-job',
        script_args={'--date': '{{ ds }}', '--env': 'production'}
    )

    load_task = PythonOperator(
        task_id='load_to_snowflake',
        python_callable=load_to_warehouse
    )

    validate_task = PythonOperator(
        task_id='run_data_quality_checks',
        python_callable=run_great_expectations
    )

    notify_task = BashOperator(
        task_id='notify_stakeholders',
        bash_command='python notify.py --date {{ ds }} --status success'
    )

    # Define dependencies
    extract_task >> transform_task >> load_task >> validate_task >> notify_task

Airflow key concepts:

  • DAG: Python class defining workflow structure
  • Operator: Unit of work (PythonOperator, BashOperator, SparkSubmitOperator, etc.)
  • Task: Instance of an operator
  • XCom: Cross-communication between tasks (pass small data)
  • Scheduler: Scans DAGs, triggers task instances based on schedule
  • Executor: LocalExecutor, CeleryExecutor (distributed), KubernetesExecutor (cloud-native)

Q7. What is a data warehouse? How is it different from a data lake?

AspectData WarehouseData Lake
Data typeStructured onlyStructured + semi-structured + unstructured
SchemaSchema-on-write (defined before loading)Schema-on-read (defined at query time)
PurposeBI, reporting, dashboardsML, data science, exploratory analytics
ExamplesSnowflake, Redshift, BigQueryS3, ADLS, HDFS
Query performanceFast (pre-defined schema, optimized storage)Slower (must parse raw files)
Data qualityHigh (enforced schema)Varies (raw data may be messy)

Data Lakehouse (2026 dominant pattern): Combines lake (cheap storage, raw data) with warehouse (ACID transactions, query performance, schema enforcement) using open table formats:

FormatCreatorKey Feature
Delta LakeDatabricksACID transactions, time travel
Apache IcebergNetflixPartition evolution, hidden partitioning
Apache HudiUberUpserts (record-level updates)

Q8. What is dbt (data build tool) and why is it standard in 2026?

-- models/mart/fct_sales.sql
{{
  config(
    materialized='incremental',
    unique_key='order_id',
    incremental_strategy='merge',
    cluster_by=['order_date', 'customer_id']
  )
}}

WITH source_orders AS (
    SELECT * FROM {{ ref('stg_orders') }}       -- reference to staging model
    {% if is_incremental() %}
    WHERE created_at > (SELECT MAX(created_at) FROM {{ this }})
    {% endif %}
),
customer_segments AS (
    SELECT * FROM {{ ref('dim_customers') }}    -- reference to dimension
),
final AS (
    SELECT
        o.order_id,
        o.customer_id,
        o.created_at::DATE AS order_date,
        o.gross_amount,
        o.discount_amount,
        o.gross_amount - o.discount_amount AS net_amount,
        c.segment,
        c.country
    FROM source_orders o
    LEFT JOIN customer_segments c USING (customer_id)
)
SELECT * FROM final
# models/mart/fct_sales.yml — documentation and tests
version: 2
models:
  - name: fct_sales
    description: "Fact table for all completed orders"
    columns:
      - name: order_id
        tests: [unique, not_null]
      - name: net_amount
        tests: [not_null, {dbt_expectations.expect_column_values_to_be_between: {min_value: 0}}]
      - name: order_date
        tests: [not_null]

dbt advantages: Version control for data transformations. Test every model. Automatic documentation. Lineage graphs. CI/CD integration. The standard for analytics engineering in 2026.


Q9. What is the star schema? How does it differ from snowflake schema?

-- Fact table: records events/transactions (large, narrow columns)
CREATE TABLE fact_sales (
    sale_id        BIGINT PRIMARY KEY,
    date_key       INT REFERENCES dim_date(date_key),
    product_key    INT REFERENCES dim_product(product_key),
    customer_key   INT REFERENCES dim_customer(customer_key),
    store_key      INT REFERENCES dim_store(store_key),
    quantity       INT,
    unit_price     DECIMAL(10,2),
    discount       DECIMAL(5,2),
    net_revenue    DECIMAL(12,2)
);

-- Dimension table: descriptive attributes (small, wide columns)
CREATE TABLE dim_product (
    product_key     INT PRIMARY KEY,
    product_id      VARCHAR(20),
    product_name    VARCHAR(200),
    category        VARCHAR(100),
    subcategory     VARCHAR(100),
    brand           VARCHAR(100),
    unit_cost       DECIMAL(10,2)
);

-- Date dimension: pre-computed calendar attributes
CREATE TABLE dim_date (
    date_key        INT PRIMARY KEY,   -- YYYYMMDD
    date            DATE,
    year            INT,
    quarter         INT,
    month           INT,
    month_name      VARCHAR(10),
    week_of_year    INT,
    day_of_week     INT,
    is_weekend      BOOLEAN,
    is_holiday      BOOLEAN
);

Star vs Snowflake:

AspectStar SchemaSnowflake Schema
Dimension normalizationDenormalized (flat)Normalized (hierarchies split into tables)
Query performanceFaster (fewer joins)Slower (more joins)
StorageMore (redundant data)Less
MaintenanceSimplerMore complex
Use caseOLAP/BI workloadsWhen dimension tables are very large

Q10. What are Slowly Changing Dimensions (SCDs)? Explain Types 1, 2, and 3.

SCD Type 1 — Overwrite:

-- Simply overwrite old value — no history kept
UPDATE dim_customer SET city = 'Bangalore', state = 'Karnataka'
WHERE customer_id = 12345;
-- Use when: corrections to data errors, history not needed

SCD Type 2 — Add New Row (most common):

CREATE TABLE dim_customer (
    customer_key    SERIAL PRIMARY KEY,    -- surrogate key
    customer_id     INT,                   -- natural/business key
    customer_name   VARCHAR(200),
    city            VARCHAR(100),
    state           VARCHAR(50),
    segment         VARCHAR(50),
    start_date      DATE NOT NULL,
    end_date        DATE,                  -- NULL = current record
    is_current      BOOLEAN DEFAULT TRUE
);

-- When customer moves city:
-- 1. Expire old record
UPDATE dim_customer SET end_date = CURRENT_DATE - 1, is_current = FALSE
WHERE customer_id = 12345 AND is_current = TRUE;
-- 2. Insert new record
INSERT INTO dim_customer (customer_id, customer_name, city, state, start_date, is_current)
VALUES (12345, 'Rahul Sharma', 'Bangalore', 'Karnataka', CURRENT_DATE, TRUE);

SCD Type 3 — Add Column:

-- Store both current and previous value
ALTER TABLE dim_customer
ADD COLUMN previous_city VARCHAR(100),
ADD COLUMN city_change_date DATE;

UPDATE dim_customer
SET previous_city = city, city = 'Bangalore', city_change_date = CURRENT_DATE
WHERE customer_id = 12345;
-- Use when: only last change matters, and need to compare current vs previous

SCD Type 4 — History table (separate current and history tables). SCD Type 6 — Hybrid of 1+2+3. In 2026, Type 2 is most common with Delta Lake's time travel often used as an alternative for simpler history tracking.


Q11. What is data partitioning in data warehouses and lakes?

# Partitioning in PySpark (for data lake)
df.write \
  .partitionBy("year", "month", "day") \
  .mode("append") \
  .parquet("s3://data-lake/events/")

# Resulting directory structure:
# s3://data-lake/events/year=2026/month=03/day=30/part-0001.parquet

# Query with partition pruning (reads only needed partitions)
df = spark.read.parquet("s3://data-lake/events/") \
    .filter(col("year") == 2026).filter(col("month") == 3)
# Spark reads ONLY s3://data-lake/events/year=2026/month=03/*

# Partitioning in Snowflake (cluster keys)
CREATE TABLE events CLUSTER BY (TO_DATE(event_timestamp), event_type);
-- Snowflake automatically sorts data within micro-partitions by cluster key
-- Enables micro-partition pruning for range scans

Partition design principles:

  • Partition by columns commonly used in WHERE clauses
  • Avoid high-cardinality columns as partition keys (too many tiny partitions)
  • Date partitioning is almost always useful
  • Never partition by columns with < 10 distinct values (not worth the overhead)

Q12. What is the difference between OLTP and OLAP systems?

AspectOLTPOLAP
PurposeTransaction processingAnalytics and reporting
Query typeShort reads/writes on few rowsLong reads across many rows
OptimizationRow-based storage, indexesColumnar storage, partitioning
Normalization3NF (normalized)Denormalized (star/snowflake)
ConcurrencyThousands of concurrent usersFewer, complex queries
ExamplesPostgreSQL, MySQLSnowflake, BigQuery, Redshift
Data currencyReal-timeBatch-updated
Transaction supportACIDLimited (ACID in modern lakehouses)
# OLAP query pattern: aggregate across millions of rows
# Good for columnar storage — only reads relevant columns
SELECT
    product_category,
    DATE_TRUNC('month', order_date) AS month,
    SUM(revenue) AS total_revenue,
    COUNT(DISTINCT customer_id) AS unique_customers
FROM fact_orders
WHERE order_date BETWEEN '2025-01-01' AND '2026-12-31'
GROUP BY 1, 2
ORDER BY 3 DESC;

Q13. What are the key Spark performance tuning techniques?

# 1. Broadcast joins (for small tables < ~10MB)
from pyspark.sql.functions import broadcast
result = large_df.join(broadcast(small_df), "key")
# Avoids shuffle — broadcasts small_df to all executors

# 2. Persist/cache intermediate results
df_filtered = df.filter(col("status") == "active").cache()
df_filtered.count()       # Action to trigger caching
# ... multiple uses of df_filtered ...
df_filtered.unpersist()   # Free memory when done

# 3. Avoid UDFs when possible (bypass JVM-native execution)
# Bad: Python UDF (slow — serializes row by row to Python)
from pyspark.sql.functions import udf
@udf("double")
def calculate_tax(amount): return amount * 0.18  # Slow!

# Good: use built-in Spark functions (JVM-native, vectorized)
from pyspark.sql.functions import col
df = df.withColumn("tax", col("amount") * 0.18)  # Fast!

# 4. Predicate pushdown (push filters into file reading)
spark.conf.set("spark.sql.parquet.filterPushdown", "true")

# 5. Column pruning (select only needed columns early)
df = df.select("user_id", "event_type", "timestamp")  # Before joins

# 6. Adaptive Query Execution (AQE) — enabled by default in Spark 3.x
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
# AQE dynamically adjusts partition count after shuffle based on actual data size

# 7. Salting for skewed joins
df_salted = skewed_df.withColumn("salt", (rand() * 10).cast("int")) \
                      .withColumn("skewed_key_salted",
                                  concat(col("skewed_key"), lit("_"), col("salt")))
other_salted = other_df.crossJoin(spark.range(10).toDF("salt")) \
                        .withColumn("skewed_key_salted",
                                    concat(col("skewed_key"), lit("_"), col("salt")))
result = df_salted.join(other_salted, "skewed_key_salted")

Q14. What is the Lambda architecture and how does Kappa architecture improve it?

Lambda Architecture:

Data Source → [Batch Layer] → Batch views
           ↘ [Speed Layer] → Real-time views
                                    ↓
              [Serving Layer] merges both → Query results

Problem: Maintaining two separate codebases (batch + streaming) for the same logic. Code divergence, debugging complexity.

Kappa Architecture:

Data Source → [Streaming Layer only (Kafka + Flink/Spark Streaming)]
                              ↓
                    [Serving Layer] → All queries served from streaming output

Replay for "batch" correction: Replay Kafka topic from offset 0 through a faster streaming job.

In 2026 — predominant patterns:

  1. Lakehouse + streaming (Databricks/Delta Lake): Structured Streaming writes to Delta tables. Batch jobs read from same tables. No architecture split.
  2. Kappa with Flink: Flink handles both streaming (milliseconds) and batch (bounded streams) uniformly.

Q15. What is a medallion architecture?

Bronze (Raw) → Silver (Cleaned) → Gold (Business-ready)

Bronze:
  - Raw data as-is from source systems
  - Immutable, append-only
  - JSON/CSV/Parquet from Kafka, databases, APIs
  - Retention: indefinite
  - s3://data-lake/bronze/orders/year=2026/

Silver:
  - Deduplicated, cleaned, validated
  - Standardized types and formats
  - Parsed nested JSON
  - Some joins (order + order_items unified)
  - Retention: 3-5 years

Gold:
  - Business-level aggregates
  - Optimized for specific use cases (BI, ML features, operational)
  - Star schema or wide tables
  - Frequently refreshed
  - s3://data-lake/gold/daily_sales_summary/
# Bronze to Silver example with Delta Lake
from delta.tables import DeltaTable
from pyspark.sql.functions import col, to_timestamp, current_timestamp

# Read from Bronze
bronze_df = spark.readStream \
    .format("delta") \
    .load("s3://data-lake/bronze/orders/")

# Clean and standardize
silver_df = bronze_df \
    .dropDuplicates(["order_id"]) \
    .filter(col("amount").isNotNull() & (col("amount") > 0)) \
    .withColumn("created_at", to_timestamp(col("created_at_str"), "yyyy-MM-dd HH:mm:ss")) \
    .withColumn("_ingested_at", current_timestamp()) \
    .drop("created_at_str", "_raw_json")

# Write to Silver (streaming append)
silver_df.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "s3://checkpoints/bronze-to-silver/orders/") \
    .start("s3://data-lake/silver/orders/")

MEDIUM — Advanced Topics (Questions 16-35)

Databricks, Snowflake, and Airbnb data engineering interviews go deep on these topics. This is where you prove you've built real pipelines, not just followed tutorials.

Q16. How does Spark Structured Streaming work? What are output modes?

from pyspark.sql.functions import window, col, sum

# Read from Kafka as a stream
events = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "kafka:9092") \
    .option("subscribe", "user-clicks") \
    .option("startingOffsets", "latest") \
    .load() \
    .selectExpr("CAST(value AS STRING) as json_value",
                "timestamp as event_time") \
    .select(from_json(col("json_value"), schema).alias("data"), "event_time") \
    .select("data.*", "event_time")

# Windowed aggregation (tumbling window, 5-minute intervals)
hourly_counts = events \
    .withWatermark("event_time", "10 minutes") \  # allow up to 10min late data
    .groupBy(
        window(col("event_time"), "5 minutes"),    # tumbling window
        col("product_category")
    ) \
    .agg(sum("click_count").alias("total_clicks"))

# Output modes:
# append: only new rows (not suitable for aggregations without watermark)
# update: only changed rows since last trigger (most efficient for aggregations)
# complete: entire result table (only for small result sets)
hourly_counts.writeStream \
    .format("delta") \
    .outputMode("update") \
    .option("checkpointLocation", "s3://checkpoints/hourly-clicks/") \
    .trigger(processingTime="1 minute") \
    .start("s3://gold/hourly-clicks/")

Watermarks: Tell Spark how late data can be and still be counted. Without watermarks, Spark must keep all state forever. With watermark of 10 min, Spark can clean up state for windows older than max_event_time - 10 min.


Q17. What is a streaming join? What challenges arise?

# Stream-stream join (both sides are unbounded)
orders_stream = spark.readStream.format("kafka").option("subscribe", "orders").load()
payments_stream = spark.readStream.format("kafka").option("subscribe", "payments").load()

# Join with watermarks — required for stream-stream joins
orders_w = orders_stream.withWatermark("order_time", "1 hour")
payments_w = payments_stream.withWatermark("payment_time", "1 hour")

result = orders_w.join(
    payments_w,
    expr("""order_id = payment_order_id AND
            payment_time BETWEEN order_time AND order_time + INTERVAL 1 HOUR""")
)
# Without watermarks: state grows unboundedly (OOM)
# With watermarks: Spark discards old unmatched events after watermark passes

# Stream-static join (no state issues — static side fully loaded)
products = spark.read.delta("s3://gold/dim_products/")  # static
enriched = orders_stream.join(products, "product_id")   # no state issues

Challenges:

  1. State management: Unmatched events must be buffered until they're matched or watermark expires
  2. Out-of-order events: Watermarks handle this with a delay threshold
  3. Outer joins with streams: Only supported with watermarks in Spark
  4. Memory: State store (RocksDB in Spark 3.x) can grow large for slow streams

Q18. How do you implement data quality checks in a pipeline?

# Great Expectations — industry standard for data quality (2026)
import great_expectations as gx
from great_expectations.checkpoint import SimpleCheckpoint

context = gx.get_context()

# Define expectations (business rules as code)
suite = context.create_expectation_suite("orders_suite", overwrite_existing=True)
validator = context.get_validator(batch_request=batch_request, expectation_suite=suite)

# Column-level expectations
validator.expect_column_to_exist("order_id")
validator.expect_column_values_to_not_be_null("order_id")
validator.expect_column_values_to_be_unique("order_id")
validator.expect_column_values_to_not_be_null("customer_id")
validator.expect_column_values_to_be_between("amount", min_value=0, max_value=1000000)
validator.expect_column_values_to_be_in_set("status",
    {"pending", "confirmed", "shipped", "delivered", "cancelled"})

# Table-level expectations
validator.expect_table_row_count_to_be_between(min_value=100, max_value=1000000)
validator.expect_column_pair_values_a_to_be_greater_than_b("delivery_date", "order_date")

# Custom SQL-based expectation
validator.expect_column_values_to_match_regex("phone", r"^\+\d{10,15}$")

# Statistical expectations (detect data drift)
validator.expect_column_mean_to_be_between("amount", min_value=500, max_value=5000)
validator.expect_column_stdev_to_be_between("amount", min_value=100, max_value=10000)

validator.save_expectation_suite()

# Run in pipeline
checkpoint = SimpleCheckpoint(name="orders_checkpoint", data_context=context)
result = checkpoint.run()
if not result.success:
    alert_on_call("Data quality check failed!")
    raise Exception("Halting pipeline: data quality failure")

Beyond GE — dbt tests (SQL-native):

# dbt_project/models/staging/stg_orders.yml
models:
  - name: stg_orders
    columns:
      - name: order_id
        tests: [unique, not_null]
      - name: customer_id
        tests: [not_null, relationships: {to: ref('dim_customers'), field: customer_id}]
      - name: amount
        tests: [{dbt_utils.not_constant}, {dbt_utils.at_least_one}]

Q19. What is Delta Lake and what problems does it solve?

from delta import DeltaTable
from pyspark.sql.functions import col

# Problem 1: ACID transactions on data lakes
# Without Delta: multiple writers corrupt files; reads during write return partial data
# With Delta: transaction log (JSON/_delta_log/) ensures ACID

# Problem 2: Schema enforcement & evolution
df.write.format("delta") \
    .option("mergeSchema", "true") \  # allow schema evolution
    .mode("append") \
    .save("s3://data-lake/orders/")

# Problem 3: Upserts (MERGE / CDC)
delta_table = DeltaTable.forPath(spark, "s3://data-lake/orders/")
delta_table.alias("target").merge(
    updates_df.alias("source"),
    "target.order_id = source.order_id"
).whenMatchedUpdate(set={
    "status": "source.status",
    "updated_at": "source.updated_at"
}).whenNotMatchedInsertAll().execute()

# Problem 4: Time travel (query historical versions)
df_yesterday = spark.read.format("delta") \
    .option("timestampAsOf", "2026-03-29") \
    .load("s3://data-lake/orders/")

df_version_5 = spark.read.format("delta") \
    .option("versionAsOf", 5) \
    .load("s3://data-lake/orders/")

# Problem 5: Audit trail
delta_table.history().show()  # see all versions and operations

# Problem 6: Data compaction (small files problem)
delta_table.optimize().executeCompaction()  # merge small files
delta_table.vacuum(retentionHours=168)      # delete old versions (7 days)

Q20. What is Change Data Capture (CDC)? How do you implement it?

Approaches:

ApproachMethodLatencyImpact on Source
Log-based CDCRead DB transaction logs (WAL)Near real-timeMinimal
Trigger-basedDB triggers write to audit tableLowHigh (per-row overhead)
Timestamp-basedPoll for records where updated_at > last_pollMinutesModerate
Snapshot-basedFull table scan, compare to previousHoursHigh

Log-based CDC with Debezium (industry standard):

# docker-compose.yml
services:
  debezium:
    image: debezium/connect:latest
    config:
      connector.class: "io.debezium.connector.postgresql.PostgresConnector"
      database.hostname: "postgres"
      database.port: "5432"
      database.user: "replicator"
      database.password: "${DB_PASSWORD}"
      database.dbname: "production"
      database.server.name: "prod-postgres"
      table.include.list: "public.orders,public.customers"
      plugin.name: "pgoutput"
      publication.autocreate.mode: "filtered"
# Consuming CDC events from Kafka
from pyspark.sql.functions import col, from_json, when

cdc_schema = StructType([
    StructField("op", StringType()),       # c=create, u=update, d=delete, r=read
    StructField("before", schema),          # row state before change
    StructField("after", schema),           # row state after change
    StructField("ts_ms", LongType())
])

cdc_df = spark.readStream.format("kafka") \
    .option("subscribe", "prod-postgres.public.orders") \
    .load() \
    .selectExpr("CAST(value AS STRING)") \
    .select(from_json(col("value"), cdc_schema).alias("cdc")) \
    .select("cdc.*")

# Apply changes to Delta Lake
def process_cdc_batch(batch_df, batch_id):
    delta_table = DeltaTable.forPath(spark, "s3://silver/orders/")
    inserts = batch_df.filter(col("op").isin(["c", "r"])).select("after.*")
    updates = batch_df.filter(col("op") == "u").select("after.*")
    deletes = batch_df.filter(col("op") == "d").select("before.order_id")
    # Merge inserts/updates; soft-delete or hard-delete
    delta_table.alias("t").merge(
        inserts.union(updates).alias("s"), "t.order_id = s.order_id"
    ).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

cdc_df.writeStream.foreachBatch(process_cdc_batch).start()

AspectApache FlinkSpark Structured Streaming
Processing modelTrue stream (event-by-event)Micro-batch (small batches, default 0ms)
LatencySub-millisecond (event time)100ms-1s typical
State managementNative RocksDB state backendManaged state store
Event timeFirst-class citizenWell-supported (watermarks)
Batch + streamingUnified API (DataStream API for both)Two separate APIs
Window typesTumbling, sliding, session, globalTumbling, sliding, watermark-based
MaturityVery mature, production-provenMature, widely adopted
EcosystemBetter for complex streaming pipelinesBetter for unified batch + streaming on Databricks

2026 landscape: Flink dominates pure streaming use cases (fraud detection, real-time recommendations, trading). Spark Structured Streaming dominates in Databricks-heavy shops due to Delta Lake integration.

# Flink Python API (PyFlink)
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import StreamTableEnvironment

env = StreamExecutionEnvironment.get_execution_environment()
t_env = StreamTableEnvironment.create(env)

t_env.execute_sql("""
    CREATE TABLE kafka_orders (
        order_id STRING,
        amount DOUBLE,
        event_time TIMESTAMP(3),
        WATERMARK FOR event_time AS event_time - INTERVAL '10' SECOND
    ) WITH (
        'connector' = 'kafka',
        'topic' = 'orders',
        'properties.bootstrap.servers' = 'kafka:9092',
        'format' = 'json'
    )
""")

result = t_env.sql_query("""
    SELECT
        TUMBLE_START(event_time, INTERVAL '1' MINUTE) AS window_start,
        COUNT(*) AS order_count,
        SUM(amount) AS total_amount
    FROM kafka_orders
    GROUP BY TUMBLE(event_time, INTERVAL '1' MINUTE)
""")

Q22. How do you handle late-arriving data in streaming pipelines?

# Strategies for late data:

# 1. Watermarks (Spark/Flink) — define maximum lateness
events.withWatermark("event_time", "2 hours")  # accept data up to 2h late
# Data arriving > 2h late is dropped/ignored

# 2. Reprocessing with Lambda architecture:
#    Streaming handles recent data; daily batch job reprocesses last 3 days
#    Delta Lake time travel enables targeted reprocessing

# 3. Idempotent MERGE for late data (Delta Lake)
def handle_late_data(late_df):
    DeltaTable.forPath(spark, "s3://gold/daily_metrics/") \
        .alias("t").merge(
            late_df.alias("s"),
            "t.date = s.date AND t.user_id = s.user_id"
        ).whenMatchedUpdate(set={
            "event_count": "t.event_count + s.event_count",
            "updated_at": "current_timestamp()"
        }).whenNotMatchedInsertAll().execute()

# 4. Grace period windows (Flink):
#    Flink keeps window state for extra time after watermark passes
#    to handle late records gracefully
table.window(
    Tumble.over(lit(5).minutes)
          .on(col("event_time"))
          .alias("w")
          .allow_late(lit(10).minutes)  # 10min grace period
)

Q23. What is data lineage and why does it matter?

# Apache Atlas — open-source data lineage
# OpenLineage — open standard for lineage (used by Airflow, Spark, dbt)

# OpenLineage with Airflow
from openlineage.airflow import OpenLineageListener
# Automatically captures: DAG name, task inputs/outputs, run metadata

# dbt lineage (automatic)
dbt docs generate  # generates lineage graph
dbt docs serve     # visual DAG of all models and their dependencies

# Custom lineage in PySpark
from openlineage.client import OpenLineageClient
from openlineage.client.run import RunEvent, RunState, Run, Job, Dataset

client = OpenLineageClient.from_environment()
run = Run(runId=str(uuid4()))
job = Job(namespace="my-spark-jobs", name="daily-sales-transform")

# Emit lineage events
client.emit(RunEvent(
    eventType=RunState.START,
    run=run, job=job,
    inputs=[Dataset(namespace="s3://bronze", name="orders/")],
    outputs=[Dataset(namespace="s3://silver", name="orders/")]
))

Why lineage matters in 2026:

  • Impact analysis: "If I change this table, what downstream dashboards break?"
  • GDPR/data deletion: "Where does user_id 12345's data flow? Delete it everywhere."
  • Debugging: "Why is this dashboard showing wrong numbers?" → trace to source
  • Compliance: Auditors require proof of data governance

Q24. What are the common causes of Spark OOM errors and how do you fix them?

CauseSymptomFix
Skewed dataOne task takes 100x longerSalting, skew join hint, AQE
Cartesian productMemory explodesReview join keys, avoid crossJoin
Collecting large datadf.toPandas() on large DFUse Spark aggregations, sample first
Too much cachingMemory pressureUnpersist when done, use MEMORY_AND_DISK
Driver memoryDriver OOM on collect()Increase driver memory, avoid collect() on large data
Too many small filesMetadata overheadCoalesce before write, Delta OPTIMIZE
# Fix 1: Increase executor memory
spark = SparkSession.builder \
    .config("spark.executor.memory", "8g") \
    .config("spark.executor.memoryOverhead", "2g") \
    .config("spark.driver.memory", "4g") \
    .getOrCreate()

# Fix 2: Spill to disk for large joins
spark.conf.set("spark.sql.shuffle.partitions", "400")  # more partitions = smaller each
spark.conf.set("spark.shuffle.spill", "true")

# Fix 3: AQE handles skew automatically (Spark 3.x)
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "256m")

# Fix 4: Avoid collect() on large DataFrames
# Bad:
rows = df.collect()  # brings everything to driver — OOM risk
# Good:
df.write.parquet("s3://output/")  # distributed write
# Or sample:
sample = df.sample(0.01).collect()

Q25. What is data mesh and how does it differ from a traditional data platform?

AspectTraditional Data PlatformData Mesh
OwnershipCentral data team owns all pipelinesDomain teams own their data products
ArchitectureMonolithic data warehouse/lakeDistributed, domain-oriented
Data discoveryCentral catalogFederated catalog
GovernanceCentral policiesFederated governance with global standards
ScalabilityBottlenecked by central teamScales with number of domains
ConsistencyHigh (centralized)Requires standards for interoperability

Four principles:

  1. Domain ownership: Orders domain team owns orders data product
  2. Data as a product: Data products have SLAs, documentation, versioning
  3. Self-serve platform: Domain teams can publish/consume without central team
  4. Federated governance: Global standards (schemas, quality, privacy) enforced locally

2026 tools: Datahub, OpenMetadata, Amundsen (data catalogs); dbt (transformations per domain); Snowflake/Databricks with workspace separation; OpenLineage (lineage federation).


HARD — Expert-Level Topics (Questions 26-50)

These questions are asked for senior data engineer and data platform architect roles (Rs 30-60+ LPA). If you can discuss data skew solutions, streaming exactly-once semantics, and lakehouse optimization at this level, you're in elite territory.

Q26. Design a real-time ML feature store.

                         [Batch Feature Pipeline (daily)]
                              ↓ (PySpark jobs)
Feature Sources:    [Offline Store (Delta Lake / Redshift)]
  - Clickstream                      ↑ training data
  - Transactions          [Feature Registry (metadata)]
  - User profiles                    ↓ sync recent features
  - Product catalog    [Online Store (Redis/Cassandra)]
                              ↑ inference (< 5ms)
                         [Real-time Pipeline (Flink/Kafka)]
                              ↑ (streaming features: last 1h activity)
from feast import FeatureStore, Entity, FeatureService
from feast.feature_view import FeatureView
from feast.types import Float32, Int64, String

# Define feature view
user_features = FeatureView(
    name="user_features",
    entities=[Entity(name="user_id", join_keys=["user_id"])],
    schema=[
        Field(name="total_purchases_30d", dtype=Float32),
        Field(name="avg_order_value_30d", dtype=Float32),
        Field(name="days_since_last_purchase", dtype=Int64),
        Field(name="preferred_category", dtype=String),
    ],
    online=True,   # enable online store
    offline_source=DeltaSource(path="s3://gold/user_features/"),
    ttl=timedelta(days=7)
)

store = FeatureStore(repo_path="feature_repo/")

# Training: point-in-time correct feature retrieval
training_df = store.get_historical_features(
    entity_df=labeled_events,  # (user_id, event_timestamp, label)
    features=["user_features:total_purchases_30d",
              "user_features:avg_order_value_30d"]
).to_df()

# Inference: online serving
features = store.get_online_features(
    features=["user_features:total_purchases_30d"],
    entity_rows=[{"user_id": 12345}]
).to_dict()

Critical design decision: Point-in-time correctness for training — the feature value at training time must reflect what was available when the label was created, not today's value. This prevents data leakage.


Q27. What is the Iceberg table format? How does it handle schema evolution and partition evolution?

# Apache Iceberg — Netflix's table format, now a top-level Apache project

from pyiceberg.catalog import load_catalog
catalog = load_catalog("default", **{"uri": "http://rest-catalog:8181"})

# Schema evolution (add columns without rewriting data)
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, StringType, IntegerType

table = catalog.load_table("db.orders")
with table.update_schema() as update:
    update.add_column("shipping_method", StringType())  # add new column
    update.rename_column("amount", "gross_amount")      # rename column
    # Old Parquet files unchanged; schema mapping handles it

# Partition evolution (change partitioning strategy without rewriting)
with table.update_spec() as update:
    update.add_field("day", DayTransform(), "order_date")  # switch from month to day
    # Old partitions (monthly) still readable; new writes use daily partitions

# Hidden partitioning (users don't think about partitions)
from pyiceberg.partitioning import PartitionSpec, PartitionField
from pyiceberg.transforms import MonthTransform
# Users query: WHERE order_date BETWEEN '2026-01-01' AND '2026-03-31'
# Iceberg automatically uses partition pruning — no need to know about partitioning

# Time travel
import pyarrow as pa
table.scan(snapshot_id=3823904823).to_arrow()  # specific snapshot
table.scan(as_of_timestamp=1711152000000).to_arrow()  # as of a timestamp

Iceberg vs Delta Lake vs Hudi:

FeatureIcebergDelta LakeHudi
CreatorNetflixDatabricksUber
Partition evolutionYesNoLimited
Schema evolutionBestGoodGood
ACID transactionsYesYesYes
Multi-engine supportBest (Spark, Flink, Trino, DuckDB)GoodGood
Hidden partitioningYesNoNo

Q28. How do you build a streaming data quality monitoring system?

from pyspark.sql.functions import col, count, isnan, isnull, stddev, mean, percentile_approx

def compute_quality_metrics(df, table_name, partition_date):
    """Compute comprehensive quality metrics for a batch"""
    metrics = {}
    total_rows = df.count()
    metrics["row_count"] = total_rows

    for col_name in df.columns:
        col_metrics = {}
        dtype = dict(df.dtypes)[col_name]

        # Null rate
        null_count = df.filter(isnull(col(col_name)) | isnan(col(col_name))).count()
        col_metrics["null_rate"] = null_count / total_rows

        # Cardinality
        col_metrics["cardinality"] = df.select(col_name).distinct().count()

        if dtype in ("double", "float", "int", "bigint"):
            stats = df.select(
                mean(col_name).alias("mean"),
                stddev(col_name).alias("stddev"),
                percentile_approx(col_name, 0.5).alias("p50"),
                percentile_approx(col_name, 0.99).alias("p99")
            ).collect()[0]
            col_metrics.update(stats.asDict())

        metrics[col_name] = col_metrics

    # Write metrics to monitoring table
    metrics_df = spark.createDataFrame([
        {"table": table_name, "date": partition_date,
         "column": col, "metric": k, "value": float(v)}
        for col, col_metrics in metrics.items() if isinstance(col_metrics, dict)
        for k, v in col_metrics.items() if isinstance(v, (int, float))
    ])
    metrics_df.write.format("delta").mode("append").save("s3://monitoring/dq-metrics/")

# Anomaly detection on metrics
def detect_anomalies(metric_name, table_name, window_days=30):
    historical = spark.read.delta("s3://monitoring/dq-metrics/") \
        .filter((col("table") == table_name) & (col("metric") == metric_name)) \
        .orderBy("date", ascending=False).limit(window_days)
    stats = historical.select(mean("value").alias("hist_mean"),
                               stddev("value").alias("hist_std")).collect()[0]
    latest = historical.orderBy("date", ascending=False).first()["value"]
    z_score = abs(latest - stats["hist_mean"]) / (stats["hist_std"] + 1e-10)
    if z_score > 3:
        alert(f"Anomaly detected: {table_name}.{metric_name} = {latest} "
              f"(z-score={z_score:.2f})")

Q29. What is exactly-once processing in Kafka? How do you implement it?

from confluent_kafka import Producer, Consumer

# Producer-side: Idempotent producer (enable.idempotence=true)
# + Transactional producer for multi-partition atomic writes
producer = Producer({
    'bootstrap.servers': 'kafka:9092',
    'enable.idempotence': True,          # prevents duplicate messages on retry
    'transactional.id': 'order-processor-1',  # unique per producer instance
    'acks': 'all',
    'retries': 2147483647
})

producer.init_transactions()
try:
    producer.begin_transaction()
    # Produce to multiple topics atomically
    producer.produce('orders-processed', key='order_123', value=processed_order)
    producer.produce('inventory-updates', key='product_456', value=inventory_delta)
    # Commit offsets and messages atomically
    producer.send_offsets_to_transaction(offsets, consumer_group_metadata)
    producer.commit_transaction()
except Exception:
    producer.abort_transaction()

# Consumer-side: read_committed isolation
consumer = Consumer({
    'bootstrap.servers': 'kafka:9092',
    'group.id': 'my-consumer-group',
    'isolation.level': 'read_committed',  # only see committed transactions
    'enable.auto.commit': False
})

Idempotent sink with database:

# Use INSERT ... ON CONFLICT DO NOTHING (upsert)
# Natural key or message_id as deduplication key
cursor.execute("""
    INSERT INTO processed_events (event_id, user_id, action, processed_at)
    VALUES (%s, %s, %s, NOW())
    ON CONFLICT (event_id) DO NOTHING
""", (event_id, user_id, action))
conn.commit()
consumer.commit()  # commit offset only after DB commit

Q30. What is data skew in distributed processing and all strategies to address it?

# Diagnose skew: check partition sizes
df.groupBy(spark_partition_id().alias("partition")).count().orderBy("count", ascending=False).show()
# If one partition has 10M rows and others have 100K, you have skew

# Strategy 1: Broadcast join (if skewed side is small dimension table)
from pyspark.sql.functions import broadcast
result = large_df.join(broadcast(small_lookup), "key")

# Strategy 2: Salting (distribute skewed key)
from pyspark.sql.functions import concat, lit, floor, rand, col, explode, array

NUM_SALTS = 20

# On the large (skewed) side: randomly append salt
large_salted = large_df.withColumn("salt", (rand() * NUM_SALTS).cast("int")) \
                        .withColumn("salted_key", concat(col("skewed_key"),
                                                          lit("_"), col("salt")))

# On the small side: expand with all possible salts
other_expanded = other_df.withColumn("salt", explode(array([lit(i) for i in range(NUM_SALTS)]))) \
                          .withColumn("salted_key", concat(col("skewed_key"),
                                                            lit("_"), col("salt")))

result = large_salted.join(other_expanded, "salted_key").drop("salt", "salted_key")

# Strategy 3: Adaptive Query Execution (Spark 3.x) — automatic
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

# Strategy 4: Pre-aggregate (reduce volume before shuffle)
# Bad: 1B rows → group → 10M groups
# Good: pre-aggregate per partition first → 10M partial aggregates → group → 10M groups
df.groupBy("user_id").agg(sum("amount"))  # Spark 3.x does this automatically (partial agg)

# Strategy 5: Custom partitioner (if you know the distribution)
# Consistent hash partitioning with more buckets for hot keys

Data Engineering FAQs — No-BS Answers

Q: What is the modern data stack in 2026? A: Here's the exact stack most top-tier data teams run: Fivetran/Airbyte (ingestion) → Snowflake/Databricks (storage + processing) → dbt (transformations) → Airflow/Dagster/Prefect (orchestration) → Monte Carlo/Great Expectations (data quality) → Datahub/Atlan (catalog) → Looker/Metabase (BI). Add Apache Kafka for streaming. Know this stack cold — interviewers expect you to have opinions about each layer.

Q: PySpark vs pandas in 2026? A: Pandas for < 1GB data (data science notebooks, feature engineering on small samples). PySpark for anything larger. Polars (Rust-based, 10-100x faster than pandas for single-node) is increasingly popular for medium-scale (1GB-100GB) data. pandas 2.0 with PyArrow backend improved performance but Polars is still faster.

Q: What is the difference between Airflow, Dagster, and Prefect? A: All are orchestration tools. Airflow is most mature, widely used, but complex UI and testing is hard. Dagster prioritizes software-engineering best practices (type checking, testing, assets-first). Prefect has the best developer experience and cloud offering. In 2026, Dagster is gaining adoption in modern data teams; Airflow still dominates legacy shops.

Q: What is a data product? A: A dataset or service that is intentionally designed, maintained, and delivered with defined SLAs, documentation, quality guarantees, and a clear owner. A view in a dashboard is not a data product; a versioned, documented, quality-checked dataset with a contract is.

Q: What is Z-ordering in Delta Lake and when should you use it? A: Z-ordering (also called Z-order clustering) reorganizes data within Delta files so that related data is co-located. It creates a multi-dimensional locality — sorting on two columns simultaneously (unlike ORDER BY which sorts one dimension). Use Z-ordering when queries filter on multiple columns. delta_table.optimize().executeZOrderBy("user_id", "event_date").

Q: What is the small files problem and how do you solve it? A: Streaming or frequent small batch writes create millions of tiny Parquet/ORC files. Each file requires metadata I/O → query planning is slow. Solutions: Delta Lake OPTIMIZE (auto-compaction), coalesce() before write, adjust spark.sql.shuffle.partitions, enable auto-optimization in Databricks.

Q: How do you handle GDPR right-to-erasure in a data lake? A: Pure lakes with immutable Parquet files make deletion hard. Solutions: (1) Delta Lake delete + vacuum (removes file versions after retention), (2) Iceberg row-level deletes, (3) Encryption + key deletion (delete the key → data is unreadable), (4) Pseudonymization (replace PII with reversible token; delete mapping table for erasure).

Q: What are the most important skills for a data engineer to learn in 2026? A: In priority order: SQL (still the single most important skill — don't skip window functions and CTEs), Python, PySpark for large-scale processing, dbt for transformations, Kafka for streaming, Delta Lake/Iceberg for lakehouse, Airflow/Dagster for orchestration, and Great Expectations for data quality. The 2026 bonus skills that separate good from great: vector databases for AI pipelines, feature stores for MLOps, and streaming ML with Flink.


Keep leveling up your interview prep:

Advertisement Placement

Explore this topic cluster

More resources in Interview Questions

Use the category hub to browse similar questions, exam patterns, salary guides, and preparation resources related to this topic.

Related Articles

More from PapersAdda

Share this guide: