Data Engineering Interview Questions 2026 — Top 50 Questions with Answers
Data engineering roles saw a 47% increase in job postings in 2025, and the trend is accelerating. Companies like Databricks, Snowflake, Meta, Amazon, and every data-driven business are hiring aggressively — because every AI model is only as good as its data pipeline. The modern data engineer must master streaming pipelines, lakehouse architectures, data quality frameworks, and increasingly, ML feature engineering. This guide compiles 50 real questions from interviews at Databricks, Snowflake, Meta, Amazon, Airbnb, and Uber — with battle-tested answers that get offers.
The data engineering interview isn't about memorizing Spark APIs. It's about proving you can build pipelines that don't break at 3 AM. This guide teaches you both.
Related articles: AI/ML Interview Questions 2026 | System Design Interview Questions 2026 | Generative AI Interview Questions 2026 | AWS Interview Questions 2026
Which Companies Ask These Questions?
| Topic Cluster | Companies |
|---|---|
| ETL Pipelines & Orchestration | Airbnb, Meta, Amazon, Stripe, all data-heavy companies |
| Apache Spark & PySpark | Databricks, LinkedIn, Apple, all Hadoop/Spark shops |
| Apache Kafka & Streaming | LinkedIn, Confluent, Uber, DoorDash |
| Data Modeling & Warehousing | Snowflake, Google BigQuery teams, Redshift shops |
| Data Quality & Testing | Lyft, Airbnb, data platform teams |
| Lakehouse Architecture | Databricks, Dremio, AWS, Azure |
| Real-time vs Batch | All data teams with streaming requirements |
EASY — Core Fundamentals (Questions 1-15)
Even senior data engineers get tripped up on fundamentals. Interviewers use these to gauge how deeply you understand the "why" behind your daily tools.
Q1. What is ETL and ELT? When do you use each?
| Aspect | ETL (Extract-Transform-Load) | ELT (Extract-Load-Transform) |
|---|---|---|
| Transform location | Before loading (external compute) | Inside the target warehouse |
| Era | Traditional data warehouses (Teradata) | Cloud DWH (Snowflake, BigQuery, Redshift) |
| Flexibility | Less — structure defined upfront | More — raw data stored, transform later |
| Latency | Higher (transform can be slow) | Lower for loading; slower for transforms |
| Raw data preserved? | Sometimes not | Always (raw layer) |
When ETL: Legacy systems, strict data governance, PII masking must happen before loading, target system has limited compute.
When ELT: Cloud data warehouse with elastic compute (Snowflake, BigQuery), data lake architectures, explorative analytics where transformation requirements are evolving.
Modern reality (2026): Most teams use ELT with a tool like dbt (data build tool) for SQL-based transformations in the warehouse, plus Python (PySpark/pandas) for complex non-SQL transformations.
Q2. What is Apache Spark and why is it faster than MapReduce?
| Aspect | MapReduce (Hadoop) | Apache Spark |
|---|---|---|
| Data persistence | Write to HDFS after each step | In-memory RDDs (lazy evaluation) |
| Speed | 10-100x slower | In-memory: 100x faster; disk: 10x faster |
| API | Low-level (map, reduce) | High-level (DataFrame, SQL, MLlib, Streaming) |
| Iterative algorithms | Very slow (ML training) | Efficient (cache intermediate data) |
| Language support | Java only | Scala, Python (PySpark), R, SQL |
Spark's execution model:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, avg, window
spark = SparkSession.builder.appName("SalesAnalysis").getOrCreate()
# Transformations are LAZY — no computation until action is called
df = (spark.read.parquet("s3://data-lake/sales/")
.filter(col("year") == 2026)
.groupBy("product_category")
.agg(sum("revenue").alias("total_revenue"),
avg("order_value").alias("avg_order_value"))
.orderBy("total_revenue", ascending=False))
# Action triggers execution
df.show(20) # triggers DAG execution
df.write.parquet("s3://data-warehouse/sales_summary/2026/")
DAG execution: Spark builds a Directed Acyclic Graph of transformations, then optimizes and executes it. Catalyst optimizer rewrites the query plan for efficiency. Tungsten execution engine uses off-heap memory and SIMD vectorization.
Q3. What is the difference between narrow and wide transformations in Spark?
| Type | Definition | Examples | Shuffle? |
|---|---|---|---|
| Narrow | Each partition → at most one output partition | map, filter, union | No |
| Wide | One partition → multiple output partitions | groupBy, join, distinct, repartition | Yes |
Why shuffles are expensive: Wide transformations require shuffling data across the network — all rows with the same key must go to the same partition. Shuffle involves disk I/O, network I/O, and serialization.
# Wide transformation — causes shuffle
df.groupBy("customer_id").count()
# Spark: sort/hash all rows by customer_id → network transfer → reduce
# Narrow transformation — no shuffle
df.filter(col("amount") > 100).select("customer_id", "amount")
# Each partition processed independently
# Optimization: use filter BEFORE join to reduce shuffle data
# Bad:
big_df.join(other_df, "user_id").filter(col("country") == "IN")
# Good:
big_df.filter(col("country") == "IN").join(other_df, "user_id")
Q4. What is a Spark partition? How do you optimize partitioning?
Rules of thumb:
- Target: 2-3 partitions per CPU core available
- Ideal partition size: 100MB–300MB
- Avoid too many small partitions (task overhead) and too few large partitions (OOM, skew)
# Check partition count and sizes
df = spark.read.parquet("s3://data/")
print(f"Partitions: {df.rdd.getNumPartitions()}")
# Repartition by column (even distribution via hash)
# Use when: preparing for shuffle-heavy operation, joining
df_repartitioned = df.repartition(200, "user_id")
# Coalesce (reduce partitions — no full shuffle)
# Use when: reducing partitions before write
df_coalesced = df.coalesce(50)
# Repartition by range (sorted partitions — good for range queries)
df_range = df.repartitionByRange(200, "date")
# Handle data skew: salting technique
from pyspark.sql.functions import concat, lit, floor, rand
# Add random salt to key to distribute skewed keys
df_salted = df.withColumn("salted_key",
concat(col("user_id"), lit("_"), (floor(rand() * 10)).cast("string")))
Q5. What is Apache Kafka? Explain its core concepts.
Kafka Core Concepts:
├── Topic: Named stream of messages (e.g., "user-clicks", "order-events")
├── Partition: Ordered, immutable sequence of messages within a topic
│ └── Offset: Sequential ID of each message in a partition
├── Producer: Writes messages to topics
├── Consumer: Reads messages from topics
├── Consumer Group: Group of consumers sharing partition assignment
│ └── Each partition consumed by exactly one consumer in a group
├── Broker: Kafka server storing partition data
├── Cluster: Multiple brokers
├── ZooKeeper/KRaft: Cluster metadata management
└── Replication: Each partition replicated across N brokers (leader + followers)
from confluent_kafka import Producer, Consumer
# Producer
producer = Producer({'bootstrap.servers': 'kafka:9092'})
producer.produce('user-clicks',
key=str(user_id).encode(),
value=json.dumps(click_event).encode(),
callback=delivery_callback)
producer.flush()
# Consumer
consumer = Consumer({
'bootstrap.servers': 'kafka:9092',
'group.id': 'analytics-consumers',
'auto.offset.reset': 'earliest', # or 'latest'
'enable.auto.commit': False # manual commit for exactly-once
})
consumer.subscribe(['user-clicks'])
while True:
msg = consumer.poll(timeout=1.0)
if msg is None: continue
if msg.error(): handle_error(msg.error())
process_event(json.loads(msg.value()))
consumer.commit() # manual commit after processing
Delivery semantics:
| Semantic | How | Risk |
|---|---|---|
| At most once | Commit before processing | Message loss on crash |
| At least once | Commit after processing | Duplicate processing |
| Exactly once | Kafka transactions + idempotent producers | Complex, slight overhead |
Q6. What is Apache Airflow? How does it orchestrate pipelines?
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.operators.bash import BashOperator
from airflow.providers.amazon.aws.operators.glue import GlueJobOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'data-engineering',
'retries': 3,
'retry_delay': timedelta(minutes=5),
'email_on_failure': True,
'email': ['[email protected]']
}
with DAG(
dag_id='daily_sales_pipeline',
default_args=default_args,
schedule_interval='0 6 * * *', # 6 AM daily (cron)
start_date=datetime(2026, 1, 1),
catchup=False, # don't run past dates on deploy
tags=['sales', 'production']
) as dag:
extract_task = PythonOperator(
task_id='extract_from_postgres',
python_callable=extract_sales_data,
op_kwargs={'date': '{{ ds }}'} # Airflow templating: execution date
)
transform_task = GlueJobOperator(
task_id='spark_transform',
job_name='sales-transform-job',
script_args={'--date': '{{ ds }}', '--env': 'production'}
)
load_task = PythonOperator(
task_id='load_to_snowflake',
python_callable=load_to_warehouse
)
validate_task = PythonOperator(
task_id='run_data_quality_checks',
python_callable=run_great_expectations
)
notify_task = BashOperator(
task_id='notify_stakeholders',
bash_command='python notify.py --date {{ ds }} --status success'
)
# Define dependencies
extract_task >> transform_task >> load_task >> validate_task >> notify_task
Airflow key concepts:
- DAG: Python class defining workflow structure
- Operator: Unit of work (PythonOperator, BashOperator, SparkSubmitOperator, etc.)
- Task: Instance of an operator
- XCom: Cross-communication between tasks (pass small data)
- Scheduler: Scans DAGs, triggers task instances based on schedule
- Executor: LocalExecutor, CeleryExecutor (distributed), KubernetesExecutor (cloud-native)
Q7. What is a data warehouse? How is it different from a data lake?
| Aspect | Data Warehouse | Data Lake |
|---|---|---|
| Data type | Structured only | Structured + semi-structured + unstructured |
| Schema | Schema-on-write (defined before loading) | Schema-on-read (defined at query time) |
| Purpose | BI, reporting, dashboards | ML, data science, exploratory analytics |
| Examples | Snowflake, Redshift, BigQuery | S3, ADLS, HDFS |
| Query performance | Fast (pre-defined schema, optimized storage) | Slower (must parse raw files) |
| Data quality | High (enforced schema) | Varies (raw data may be messy) |
Data Lakehouse (2026 dominant pattern): Combines lake (cheap storage, raw data) with warehouse (ACID transactions, query performance, schema enforcement) using open table formats:
| Format | Creator | Key Feature |
|---|---|---|
| Delta Lake | Databricks | ACID transactions, time travel |
| Apache Iceberg | Netflix | Partition evolution, hidden partitioning |
| Apache Hudi | Uber | Upserts (record-level updates) |
Q8. What is dbt (data build tool) and why is it standard in 2026?
-- models/mart/fct_sales.sql
{{
config(
materialized='incremental',
unique_key='order_id',
incremental_strategy='merge',
cluster_by=['order_date', 'customer_id']
)
}}
WITH source_orders AS (
SELECT * FROM {{ ref('stg_orders') }} -- reference to staging model
{% if is_incremental() %}
WHERE created_at > (SELECT MAX(created_at) FROM {{ this }})
{% endif %}
),
customer_segments AS (
SELECT * FROM {{ ref('dim_customers') }} -- reference to dimension
),
final AS (
SELECT
o.order_id,
o.customer_id,
o.created_at::DATE AS order_date,
o.gross_amount,
o.discount_amount,
o.gross_amount - o.discount_amount AS net_amount,
c.segment,
c.country
FROM source_orders o
LEFT JOIN customer_segments c USING (customer_id)
)
SELECT * FROM final
# models/mart/fct_sales.yml — documentation and tests
version: 2
models:
- name: fct_sales
description: "Fact table for all completed orders"
columns:
- name: order_id
tests: [unique, not_null]
- name: net_amount
tests: [not_null, {dbt_expectations.expect_column_values_to_be_between: {min_value: 0}}]
- name: order_date
tests: [not_null]
dbt advantages: Version control for data transformations. Test every model. Automatic documentation. Lineage graphs. CI/CD integration. The standard for analytics engineering in 2026.
Q9. What is the star schema? How does it differ from snowflake schema?
-- Fact table: records events/transactions (large, narrow columns)
CREATE TABLE fact_sales (
sale_id BIGINT PRIMARY KEY,
date_key INT REFERENCES dim_date(date_key),
product_key INT REFERENCES dim_product(product_key),
customer_key INT REFERENCES dim_customer(customer_key),
store_key INT REFERENCES dim_store(store_key),
quantity INT,
unit_price DECIMAL(10,2),
discount DECIMAL(5,2),
net_revenue DECIMAL(12,2)
);
-- Dimension table: descriptive attributes (small, wide columns)
CREATE TABLE dim_product (
product_key INT PRIMARY KEY,
product_id VARCHAR(20),
product_name VARCHAR(200),
category VARCHAR(100),
subcategory VARCHAR(100),
brand VARCHAR(100),
unit_cost DECIMAL(10,2)
);
-- Date dimension: pre-computed calendar attributes
CREATE TABLE dim_date (
date_key INT PRIMARY KEY, -- YYYYMMDD
date DATE,
year INT,
quarter INT,
month INT,
month_name VARCHAR(10),
week_of_year INT,
day_of_week INT,
is_weekend BOOLEAN,
is_holiday BOOLEAN
);
Star vs Snowflake:
| Aspect | Star Schema | Snowflake Schema |
|---|---|---|
| Dimension normalization | Denormalized (flat) | Normalized (hierarchies split into tables) |
| Query performance | Faster (fewer joins) | Slower (more joins) |
| Storage | More (redundant data) | Less |
| Maintenance | Simpler | More complex |
| Use case | OLAP/BI workloads | When dimension tables are very large |
Q10. What are Slowly Changing Dimensions (SCDs)? Explain Types 1, 2, and 3.
SCD Type 1 — Overwrite:
-- Simply overwrite old value — no history kept
UPDATE dim_customer SET city = 'Bangalore', state = 'Karnataka'
WHERE customer_id = 12345;
-- Use when: corrections to data errors, history not needed
SCD Type 2 — Add New Row (most common):
CREATE TABLE dim_customer (
customer_key SERIAL PRIMARY KEY, -- surrogate key
customer_id INT, -- natural/business key
customer_name VARCHAR(200),
city VARCHAR(100),
state VARCHAR(50),
segment VARCHAR(50),
start_date DATE NOT NULL,
end_date DATE, -- NULL = current record
is_current BOOLEAN DEFAULT TRUE
);
-- When customer moves city:
-- 1. Expire old record
UPDATE dim_customer SET end_date = CURRENT_DATE - 1, is_current = FALSE
WHERE customer_id = 12345 AND is_current = TRUE;
-- 2. Insert new record
INSERT INTO dim_customer (customer_id, customer_name, city, state, start_date, is_current)
VALUES (12345, 'Rahul Sharma', 'Bangalore', 'Karnataka', CURRENT_DATE, TRUE);
SCD Type 3 — Add Column:
-- Store both current and previous value
ALTER TABLE dim_customer
ADD COLUMN previous_city VARCHAR(100),
ADD COLUMN city_change_date DATE;
UPDATE dim_customer
SET previous_city = city, city = 'Bangalore', city_change_date = CURRENT_DATE
WHERE customer_id = 12345;
-- Use when: only last change matters, and need to compare current vs previous
SCD Type 4 — History table (separate current and history tables). SCD Type 6 — Hybrid of 1+2+3. In 2026, Type 2 is most common with Delta Lake's time travel often used as an alternative for simpler history tracking.
Q11. What is data partitioning in data warehouses and lakes?
# Partitioning in PySpark (for data lake)
df.write \
.partitionBy("year", "month", "day") \
.mode("append") \
.parquet("s3://data-lake/events/")
# Resulting directory structure:
# s3://data-lake/events/year=2026/month=03/day=30/part-0001.parquet
# Query with partition pruning (reads only needed partitions)
df = spark.read.parquet("s3://data-lake/events/") \
.filter(col("year") == 2026).filter(col("month") == 3)
# Spark reads ONLY s3://data-lake/events/year=2026/month=03/*
# Partitioning in Snowflake (cluster keys)
CREATE TABLE events CLUSTER BY (TO_DATE(event_timestamp), event_type);
-- Snowflake automatically sorts data within micro-partitions by cluster key
-- Enables micro-partition pruning for range scans
Partition design principles:
- Partition by columns commonly used in WHERE clauses
- Avoid high-cardinality columns as partition keys (too many tiny partitions)
- Date partitioning is almost always useful
- Never partition by columns with < 10 distinct values (not worth the overhead)
Q12. What is the difference between OLTP and OLAP systems?
| Aspect | OLTP | OLAP |
|---|---|---|
| Purpose | Transaction processing | Analytics and reporting |
| Query type | Short reads/writes on few rows | Long reads across many rows |
| Optimization | Row-based storage, indexes | Columnar storage, partitioning |
| Normalization | 3NF (normalized) | Denormalized (star/snowflake) |
| Concurrency | Thousands of concurrent users | Fewer, complex queries |
| Examples | PostgreSQL, MySQL | Snowflake, BigQuery, Redshift |
| Data currency | Real-time | Batch-updated |
| Transaction support | ACID | Limited (ACID in modern lakehouses) |
# OLAP query pattern: aggregate across millions of rows
# Good for columnar storage — only reads relevant columns
SELECT
product_category,
DATE_TRUNC('month', order_date) AS month,
SUM(revenue) AS total_revenue,
COUNT(DISTINCT customer_id) AS unique_customers
FROM fact_orders
WHERE order_date BETWEEN '2025-01-01' AND '2026-12-31'
GROUP BY 1, 2
ORDER BY 3 DESC;
Q13. What are the key Spark performance tuning techniques?
# 1. Broadcast joins (for small tables < ~10MB)
from pyspark.sql.functions import broadcast
result = large_df.join(broadcast(small_df), "key")
# Avoids shuffle — broadcasts small_df to all executors
# 2. Persist/cache intermediate results
df_filtered = df.filter(col("status") == "active").cache()
df_filtered.count() # Action to trigger caching
# ... multiple uses of df_filtered ...
df_filtered.unpersist() # Free memory when done
# 3. Avoid UDFs when possible (bypass JVM-native execution)
# Bad: Python UDF (slow — serializes row by row to Python)
from pyspark.sql.functions import udf
@udf("double")
def calculate_tax(amount): return amount * 0.18 # Slow!
# Good: use built-in Spark functions (JVM-native, vectorized)
from pyspark.sql.functions import col
df = df.withColumn("tax", col("amount") * 0.18) # Fast!
# 4. Predicate pushdown (push filters into file reading)
spark.conf.set("spark.sql.parquet.filterPushdown", "true")
# 5. Column pruning (select only needed columns early)
df = df.select("user_id", "event_type", "timestamp") # Before joins
# 6. Adaptive Query Execution (AQE) — enabled by default in Spark 3.x
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
# AQE dynamically adjusts partition count after shuffle based on actual data size
# 7. Salting for skewed joins
df_salted = skewed_df.withColumn("salt", (rand() * 10).cast("int")) \
.withColumn("skewed_key_salted",
concat(col("skewed_key"), lit("_"), col("salt")))
other_salted = other_df.crossJoin(spark.range(10).toDF("salt")) \
.withColumn("skewed_key_salted",
concat(col("skewed_key"), lit("_"), col("salt")))
result = df_salted.join(other_salted, "skewed_key_salted")
Q14. What is the Lambda architecture and how does Kappa architecture improve it?
Lambda Architecture:
Data Source → [Batch Layer] → Batch views
↘ [Speed Layer] → Real-time views
↓
[Serving Layer] merges both → Query results
Problem: Maintaining two separate codebases (batch + streaming) for the same logic. Code divergence, debugging complexity.
Kappa Architecture:
Data Source → [Streaming Layer only (Kafka + Flink/Spark Streaming)]
↓
[Serving Layer] → All queries served from streaming output
Replay for "batch" correction: Replay Kafka topic from offset 0 through a faster streaming job.
In 2026 — predominant patterns:
- Lakehouse + streaming (Databricks/Delta Lake): Structured Streaming writes to Delta tables. Batch jobs read from same tables. No architecture split.
- Kappa with Flink: Flink handles both streaming (milliseconds) and batch (bounded streams) uniformly.
Q15. What is a medallion architecture?
Bronze (Raw) → Silver (Cleaned) → Gold (Business-ready)
Bronze:
- Raw data as-is from source systems
- Immutable, append-only
- JSON/CSV/Parquet from Kafka, databases, APIs
- Retention: indefinite
- s3://data-lake/bronze/orders/year=2026/
Silver:
- Deduplicated, cleaned, validated
- Standardized types and formats
- Parsed nested JSON
- Some joins (order + order_items unified)
- Retention: 3-5 years
Gold:
- Business-level aggregates
- Optimized for specific use cases (BI, ML features, operational)
- Star schema or wide tables
- Frequently refreshed
- s3://data-lake/gold/daily_sales_summary/
# Bronze to Silver example with Delta Lake
from delta.tables import DeltaTable
from pyspark.sql.functions import col, to_timestamp, current_timestamp
# Read from Bronze
bronze_df = spark.readStream \
.format("delta") \
.load("s3://data-lake/bronze/orders/")
# Clean and standardize
silver_df = bronze_df \
.dropDuplicates(["order_id"]) \
.filter(col("amount").isNotNull() & (col("amount") > 0)) \
.withColumn("created_at", to_timestamp(col("created_at_str"), "yyyy-MM-dd HH:mm:ss")) \
.withColumn("_ingested_at", current_timestamp()) \
.drop("created_at_str", "_raw_json")
# Write to Silver (streaming append)
silver_df.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "s3://checkpoints/bronze-to-silver/orders/") \
.start("s3://data-lake/silver/orders/")
MEDIUM — Advanced Topics (Questions 16-35)
Databricks, Snowflake, and Airbnb data engineering interviews go deep on these topics. This is where you prove you've built real pipelines, not just followed tutorials.
Q16. How does Spark Structured Streaming work? What are output modes?
from pyspark.sql.functions import window, col, sum
# Read from Kafka as a stream
events = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "kafka:9092") \
.option("subscribe", "user-clicks") \
.option("startingOffsets", "latest") \
.load() \
.selectExpr("CAST(value AS STRING) as json_value",
"timestamp as event_time") \
.select(from_json(col("json_value"), schema).alias("data"), "event_time") \
.select("data.*", "event_time")
# Windowed aggregation (tumbling window, 5-minute intervals)
hourly_counts = events \
.withWatermark("event_time", "10 minutes") \ # allow up to 10min late data
.groupBy(
window(col("event_time"), "5 minutes"), # tumbling window
col("product_category")
) \
.agg(sum("click_count").alias("total_clicks"))
# Output modes:
# append: only new rows (not suitable for aggregations without watermark)
# update: only changed rows since last trigger (most efficient for aggregations)
# complete: entire result table (only for small result sets)
hourly_counts.writeStream \
.format("delta") \
.outputMode("update") \
.option("checkpointLocation", "s3://checkpoints/hourly-clicks/") \
.trigger(processingTime="1 minute") \
.start("s3://gold/hourly-clicks/")
Watermarks: Tell Spark how late data can be and still be counted. Without watermarks, Spark must keep all state forever. With watermark of 10 min, Spark can clean up state for windows older than max_event_time - 10 min.
Q17. What is a streaming join? What challenges arise?
# Stream-stream join (both sides are unbounded)
orders_stream = spark.readStream.format("kafka").option("subscribe", "orders").load()
payments_stream = spark.readStream.format("kafka").option("subscribe", "payments").load()
# Join with watermarks — required for stream-stream joins
orders_w = orders_stream.withWatermark("order_time", "1 hour")
payments_w = payments_stream.withWatermark("payment_time", "1 hour")
result = orders_w.join(
payments_w,
expr("""order_id = payment_order_id AND
payment_time BETWEEN order_time AND order_time + INTERVAL 1 HOUR""")
)
# Without watermarks: state grows unboundedly (OOM)
# With watermarks: Spark discards old unmatched events after watermark passes
# Stream-static join (no state issues — static side fully loaded)
products = spark.read.delta("s3://gold/dim_products/") # static
enriched = orders_stream.join(products, "product_id") # no state issues
Challenges:
- State management: Unmatched events must be buffered until they're matched or watermark expires
- Out-of-order events: Watermarks handle this with a delay threshold
- Outer joins with streams: Only supported with watermarks in Spark
- Memory: State store (RocksDB in Spark 3.x) can grow large for slow streams
Q18. How do you implement data quality checks in a pipeline?
# Great Expectations — industry standard for data quality (2026)
import great_expectations as gx
from great_expectations.checkpoint import SimpleCheckpoint
context = gx.get_context()
# Define expectations (business rules as code)
suite = context.create_expectation_suite("orders_suite", overwrite_existing=True)
validator = context.get_validator(batch_request=batch_request, expectation_suite=suite)
# Column-level expectations
validator.expect_column_to_exist("order_id")
validator.expect_column_values_to_not_be_null("order_id")
validator.expect_column_values_to_be_unique("order_id")
validator.expect_column_values_to_not_be_null("customer_id")
validator.expect_column_values_to_be_between("amount", min_value=0, max_value=1000000)
validator.expect_column_values_to_be_in_set("status",
{"pending", "confirmed", "shipped", "delivered", "cancelled"})
# Table-level expectations
validator.expect_table_row_count_to_be_between(min_value=100, max_value=1000000)
validator.expect_column_pair_values_a_to_be_greater_than_b("delivery_date", "order_date")
# Custom SQL-based expectation
validator.expect_column_values_to_match_regex("phone", r"^\+\d{10,15}$")
# Statistical expectations (detect data drift)
validator.expect_column_mean_to_be_between("amount", min_value=500, max_value=5000)
validator.expect_column_stdev_to_be_between("amount", min_value=100, max_value=10000)
validator.save_expectation_suite()
# Run in pipeline
checkpoint = SimpleCheckpoint(name="orders_checkpoint", data_context=context)
result = checkpoint.run()
if not result.success:
alert_on_call("Data quality check failed!")
raise Exception("Halting pipeline: data quality failure")
Beyond GE — dbt tests (SQL-native):
# dbt_project/models/staging/stg_orders.yml
models:
- name: stg_orders
columns:
- name: order_id
tests: [unique, not_null]
- name: customer_id
tests: [not_null, relationships: {to: ref('dim_customers'), field: customer_id}]
- name: amount
tests: [{dbt_utils.not_constant}, {dbt_utils.at_least_one}]
Q19. What is Delta Lake and what problems does it solve?
from delta import DeltaTable
from pyspark.sql.functions import col
# Problem 1: ACID transactions on data lakes
# Without Delta: multiple writers corrupt files; reads during write return partial data
# With Delta: transaction log (JSON/_delta_log/) ensures ACID
# Problem 2: Schema enforcement & evolution
df.write.format("delta") \
.option("mergeSchema", "true") \ # allow schema evolution
.mode("append") \
.save("s3://data-lake/orders/")
# Problem 3: Upserts (MERGE / CDC)
delta_table = DeltaTable.forPath(spark, "s3://data-lake/orders/")
delta_table.alias("target").merge(
updates_df.alias("source"),
"target.order_id = source.order_id"
).whenMatchedUpdate(set={
"status": "source.status",
"updated_at": "source.updated_at"
}).whenNotMatchedInsertAll().execute()
# Problem 4: Time travel (query historical versions)
df_yesterday = spark.read.format("delta") \
.option("timestampAsOf", "2026-03-29") \
.load("s3://data-lake/orders/")
df_version_5 = spark.read.format("delta") \
.option("versionAsOf", 5) \
.load("s3://data-lake/orders/")
# Problem 5: Audit trail
delta_table.history().show() # see all versions and operations
# Problem 6: Data compaction (small files problem)
delta_table.optimize().executeCompaction() # merge small files
delta_table.vacuum(retentionHours=168) # delete old versions (7 days)
Q20. What is Change Data Capture (CDC)? How do you implement it?
Approaches:
| Approach | Method | Latency | Impact on Source |
|---|---|---|---|
| Log-based CDC | Read DB transaction logs (WAL) | Near real-time | Minimal |
| Trigger-based | DB triggers write to audit table | Low | High (per-row overhead) |
| Timestamp-based | Poll for records where updated_at > last_poll | Minutes | Moderate |
| Snapshot-based | Full table scan, compare to previous | Hours | High |
Log-based CDC with Debezium (industry standard):
# docker-compose.yml
services:
debezium:
image: debezium/connect:latest
config:
connector.class: "io.debezium.connector.postgresql.PostgresConnector"
database.hostname: "postgres"
database.port: "5432"
database.user: "replicator"
database.password: "${DB_PASSWORD}"
database.dbname: "production"
database.server.name: "prod-postgres"
table.include.list: "public.orders,public.customers"
plugin.name: "pgoutput"
publication.autocreate.mode: "filtered"
# Consuming CDC events from Kafka
from pyspark.sql.functions import col, from_json, when
cdc_schema = StructType([
StructField("op", StringType()), # c=create, u=update, d=delete, r=read
StructField("before", schema), # row state before change
StructField("after", schema), # row state after change
StructField("ts_ms", LongType())
])
cdc_df = spark.readStream.format("kafka") \
.option("subscribe", "prod-postgres.public.orders") \
.load() \
.selectExpr("CAST(value AS STRING)") \
.select(from_json(col("value"), cdc_schema).alias("cdc")) \
.select("cdc.*")
# Apply changes to Delta Lake
def process_cdc_batch(batch_df, batch_id):
delta_table = DeltaTable.forPath(spark, "s3://silver/orders/")
inserts = batch_df.filter(col("op").isin(["c", "r"])).select("after.*")
updates = batch_df.filter(col("op") == "u").select("after.*")
deletes = batch_df.filter(col("op") == "d").select("before.order_id")
# Merge inserts/updates; soft-delete or hard-delete
delta_table.alias("t").merge(
inserts.union(updates).alias("s"), "t.order_id = s.order_id"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()
cdc_df.writeStream.foreachBatch(process_cdc_batch).start()
Q21. What is the difference between Flink and Spark Streaming?
| Aspect | Apache Flink | Spark Structured Streaming |
|---|---|---|
| Processing model | True stream (event-by-event) | Micro-batch (small batches, default 0ms) |
| Latency | Sub-millisecond (event time) | 100ms-1s typical |
| State management | Native RocksDB state backend | Managed state store |
| Event time | First-class citizen | Well-supported (watermarks) |
| Batch + streaming | Unified API (DataStream API for both) | Two separate APIs |
| Window types | Tumbling, sliding, session, global | Tumbling, sliding, watermark-based |
| Maturity | Very mature, production-proven | Mature, widely adopted |
| Ecosystem | Better for complex streaming pipelines | Better for unified batch + streaming on Databricks |
2026 landscape: Flink dominates pure streaming use cases (fraud detection, real-time recommendations, trading). Spark Structured Streaming dominates in Databricks-heavy shops due to Delta Lake integration.
# Flink Python API (PyFlink)
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import StreamTableEnvironment
env = StreamExecutionEnvironment.get_execution_environment()
t_env = StreamTableEnvironment.create(env)
t_env.execute_sql("""
CREATE TABLE kafka_orders (
order_id STRING,
amount DOUBLE,
event_time TIMESTAMP(3),
WATERMARK FOR event_time AS event_time - INTERVAL '10' SECOND
) WITH (
'connector' = 'kafka',
'topic' = 'orders',
'properties.bootstrap.servers' = 'kafka:9092',
'format' = 'json'
)
""")
result = t_env.sql_query("""
SELECT
TUMBLE_START(event_time, INTERVAL '1' MINUTE) AS window_start,
COUNT(*) AS order_count,
SUM(amount) AS total_amount
FROM kafka_orders
GROUP BY TUMBLE(event_time, INTERVAL '1' MINUTE)
""")
Q22. How do you handle late-arriving data in streaming pipelines?
# Strategies for late data:
# 1. Watermarks (Spark/Flink) — define maximum lateness
events.withWatermark("event_time", "2 hours") # accept data up to 2h late
# Data arriving > 2h late is dropped/ignored
# 2. Reprocessing with Lambda architecture:
# Streaming handles recent data; daily batch job reprocesses last 3 days
# Delta Lake time travel enables targeted reprocessing
# 3. Idempotent MERGE for late data (Delta Lake)
def handle_late_data(late_df):
DeltaTable.forPath(spark, "s3://gold/daily_metrics/") \
.alias("t").merge(
late_df.alias("s"),
"t.date = s.date AND t.user_id = s.user_id"
).whenMatchedUpdate(set={
"event_count": "t.event_count + s.event_count",
"updated_at": "current_timestamp()"
}).whenNotMatchedInsertAll().execute()
# 4. Grace period windows (Flink):
# Flink keeps window state for extra time after watermark passes
# to handle late records gracefully
table.window(
Tumble.over(lit(5).minutes)
.on(col("event_time"))
.alias("w")
.allow_late(lit(10).minutes) # 10min grace period
)
Q23. What is data lineage and why does it matter?
# Apache Atlas — open-source data lineage
# OpenLineage — open standard for lineage (used by Airflow, Spark, dbt)
# OpenLineage with Airflow
from openlineage.airflow import OpenLineageListener
# Automatically captures: DAG name, task inputs/outputs, run metadata
# dbt lineage (automatic)
dbt docs generate # generates lineage graph
dbt docs serve # visual DAG of all models and their dependencies
# Custom lineage in PySpark
from openlineage.client import OpenLineageClient
from openlineage.client.run import RunEvent, RunState, Run, Job, Dataset
client = OpenLineageClient.from_environment()
run = Run(runId=str(uuid4()))
job = Job(namespace="my-spark-jobs", name="daily-sales-transform")
# Emit lineage events
client.emit(RunEvent(
eventType=RunState.START,
run=run, job=job,
inputs=[Dataset(namespace="s3://bronze", name="orders/")],
outputs=[Dataset(namespace="s3://silver", name="orders/")]
))
Why lineage matters in 2026:
- Impact analysis: "If I change this table, what downstream dashboards break?"
- GDPR/data deletion: "Where does user_id 12345's data flow? Delete it everywhere."
- Debugging: "Why is this dashboard showing wrong numbers?" → trace to source
- Compliance: Auditors require proof of data governance
Q24. What are the common causes of Spark OOM errors and how do you fix them?
| Cause | Symptom | Fix |
|---|---|---|
| Skewed data | One task takes 100x longer | Salting, skew join hint, AQE |
| Cartesian product | Memory explodes | Review join keys, avoid crossJoin |
| Collecting large data | df.toPandas() on large DF | Use Spark aggregations, sample first |
| Too much caching | Memory pressure | Unpersist when done, use MEMORY_AND_DISK |
| Driver memory | Driver OOM on collect() | Increase driver memory, avoid collect() on large data |
| Too many small files | Metadata overhead | Coalesce before write, Delta OPTIMIZE |
# Fix 1: Increase executor memory
spark = SparkSession.builder \
.config("spark.executor.memory", "8g") \
.config("spark.executor.memoryOverhead", "2g") \
.config("spark.driver.memory", "4g") \
.getOrCreate()
# Fix 2: Spill to disk for large joins
spark.conf.set("spark.sql.shuffle.partitions", "400") # more partitions = smaller each
spark.conf.set("spark.shuffle.spill", "true")
# Fix 3: AQE handles skew automatically (Spark 3.x)
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "256m")
# Fix 4: Avoid collect() on large DataFrames
# Bad:
rows = df.collect() # brings everything to driver — OOM risk
# Good:
df.write.parquet("s3://output/") # distributed write
# Or sample:
sample = df.sample(0.01).collect()
Q25. What is data mesh and how does it differ from a traditional data platform?
| Aspect | Traditional Data Platform | Data Mesh |
|---|---|---|
| Ownership | Central data team owns all pipelines | Domain teams own their data products |
| Architecture | Monolithic data warehouse/lake | Distributed, domain-oriented |
| Data discovery | Central catalog | Federated catalog |
| Governance | Central policies | Federated governance with global standards |
| Scalability | Bottlenecked by central team | Scales with number of domains |
| Consistency | High (centralized) | Requires standards for interoperability |
Four principles:
- Domain ownership: Orders domain team owns orders data product
- Data as a product: Data products have SLAs, documentation, versioning
- Self-serve platform: Domain teams can publish/consume without central team
- Federated governance: Global standards (schemas, quality, privacy) enforced locally
2026 tools: Datahub, OpenMetadata, Amundsen (data catalogs); dbt (transformations per domain); Snowflake/Databricks with workspace separation; OpenLineage (lineage federation).
HARD — Expert-Level Topics (Questions 26-50)
These questions are asked for senior data engineer and data platform architect roles (Rs 30-60+ LPA). If you can discuss data skew solutions, streaming exactly-once semantics, and lakehouse optimization at this level, you're in elite territory.
Q26. Design a real-time ML feature store.
[Batch Feature Pipeline (daily)]
↓ (PySpark jobs)
Feature Sources: [Offline Store (Delta Lake / Redshift)]
- Clickstream ↑ training data
- Transactions [Feature Registry (metadata)]
- User profiles ↓ sync recent features
- Product catalog [Online Store (Redis/Cassandra)]
↑ inference (< 5ms)
[Real-time Pipeline (Flink/Kafka)]
↑ (streaming features: last 1h activity)
from feast import FeatureStore, Entity, FeatureService
from feast.feature_view import FeatureView
from feast.types import Float32, Int64, String
# Define feature view
user_features = FeatureView(
name="user_features",
entities=[Entity(name="user_id", join_keys=["user_id"])],
schema=[
Field(name="total_purchases_30d", dtype=Float32),
Field(name="avg_order_value_30d", dtype=Float32),
Field(name="days_since_last_purchase", dtype=Int64),
Field(name="preferred_category", dtype=String),
],
online=True, # enable online store
offline_source=DeltaSource(path="s3://gold/user_features/"),
ttl=timedelta(days=7)
)
store = FeatureStore(repo_path="feature_repo/")
# Training: point-in-time correct feature retrieval
training_df = store.get_historical_features(
entity_df=labeled_events, # (user_id, event_timestamp, label)
features=["user_features:total_purchases_30d",
"user_features:avg_order_value_30d"]
).to_df()
# Inference: online serving
features = store.get_online_features(
features=["user_features:total_purchases_30d"],
entity_rows=[{"user_id": 12345}]
).to_dict()
Critical design decision: Point-in-time correctness for training — the feature value at training time must reflect what was available when the label was created, not today's value. This prevents data leakage.
Q27. What is the Iceberg table format? How does it handle schema evolution and partition evolution?
# Apache Iceberg — Netflix's table format, now a top-level Apache project
from pyiceberg.catalog import load_catalog
catalog = load_catalog("default", **{"uri": "http://rest-catalog:8181"})
# Schema evolution (add columns without rewriting data)
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, StringType, IntegerType
table = catalog.load_table("db.orders")
with table.update_schema() as update:
update.add_column("shipping_method", StringType()) # add new column
update.rename_column("amount", "gross_amount") # rename column
# Old Parquet files unchanged; schema mapping handles it
# Partition evolution (change partitioning strategy without rewriting)
with table.update_spec() as update:
update.add_field("day", DayTransform(), "order_date") # switch from month to day
# Old partitions (monthly) still readable; new writes use daily partitions
# Hidden partitioning (users don't think about partitions)
from pyiceberg.partitioning import PartitionSpec, PartitionField
from pyiceberg.transforms import MonthTransform
# Users query: WHERE order_date BETWEEN '2026-01-01' AND '2026-03-31'
# Iceberg automatically uses partition pruning — no need to know about partitioning
# Time travel
import pyarrow as pa
table.scan(snapshot_id=3823904823).to_arrow() # specific snapshot
table.scan(as_of_timestamp=1711152000000).to_arrow() # as of a timestamp
Iceberg vs Delta Lake vs Hudi:
| Feature | Iceberg | Delta Lake | Hudi |
|---|---|---|---|
| Creator | Netflix | Databricks | Uber |
| Partition evolution | Yes | No | Limited |
| Schema evolution | Best | Good | Good |
| ACID transactions | Yes | Yes | Yes |
| Multi-engine support | Best (Spark, Flink, Trino, DuckDB) | Good | Good |
| Hidden partitioning | Yes | No | No |
Q28. How do you build a streaming data quality monitoring system?
from pyspark.sql.functions import col, count, isnan, isnull, stddev, mean, percentile_approx
def compute_quality_metrics(df, table_name, partition_date):
"""Compute comprehensive quality metrics for a batch"""
metrics = {}
total_rows = df.count()
metrics["row_count"] = total_rows
for col_name in df.columns:
col_metrics = {}
dtype = dict(df.dtypes)[col_name]
# Null rate
null_count = df.filter(isnull(col(col_name)) | isnan(col(col_name))).count()
col_metrics["null_rate"] = null_count / total_rows
# Cardinality
col_metrics["cardinality"] = df.select(col_name).distinct().count()
if dtype in ("double", "float", "int", "bigint"):
stats = df.select(
mean(col_name).alias("mean"),
stddev(col_name).alias("stddev"),
percentile_approx(col_name, 0.5).alias("p50"),
percentile_approx(col_name, 0.99).alias("p99")
).collect()[0]
col_metrics.update(stats.asDict())
metrics[col_name] = col_metrics
# Write metrics to monitoring table
metrics_df = spark.createDataFrame([
{"table": table_name, "date": partition_date,
"column": col, "metric": k, "value": float(v)}
for col, col_metrics in metrics.items() if isinstance(col_metrics, dict)
for k, v in col_metrics.items() if isinstance(v, (int, float))
])
metrics_df.write.format("delta").mode("append").save("s3://monitoring/dq-metrics/")
# Anomaly detection on metrics
def detect_anomalies(metric_name, table_name, window_days=30):
historical = spark.read.delta("s3://monitoring/dq-metrics/") \
.filter((col("table") == table_name) & (col("metric") == metric_name)) \
.orderBy("date", ascending=False).limit(window_days)
stats = historical.select(mean("value").alias("hist_mean"),
stddev("value").alias("hist_std")).collect()[0]
latest = historical.orderBy("date", ascending=False).first()["value"]
z_score = abs(latest - stats["hist_mean"]) / (stats["hist_std"] + 1e-10)
if z_score > 3:
alert(f"Anomaly detected: {table_name}.{metric_name} = {latest} "
f"(z-score={z_score:.2f})")
Q29. What is exactly-once processing in Kafka? How do you implement it?
from confluent_kafka import Producer, Consumer
# Producer-side: Idempotent producer (enable.idempotence=true)
# + Transactional producer for multi-partition atomic writes
producer = Producer({
'bootstrap.servers': 'kafka:9092',
'enable.idempotence': True, # prevents duplicate messages on retry
'transactional.id': 'order-processor-1', # unique per producer instance
'acks': 'all',
'retries': 2147483647
})
producer.init_transactions()
try:
producer.begin_transaction()
# Produce to multiple topics atomically
producer.produce('orders-processed', key='order_123', value=processed_order)
producer.produce('inventory-updates', key='product_456', value=inventory_delta)
# Commit offsets and messages atomically
producer.send_offsets_to_transaction(offsets, consumer_group_metadata)
producer.commit_transaction()
except Exception:
producer.abort_transaction()
# Consumer-side: read_committed isolation
consumer = Consumer({
'bootstrap.servers': 'kafka:9092',
'group.id': 'my-consumer-group',
'isolation.level': 'read_committed', # only see committed transactions
'enable.auto.commit': False
})
Idempotent sink with database:
# Use INSERT ... ON CONFLICT DO NOTHING (upsert)
# Natural key or message_id as deduplication key
cursor.execute("""
INSERT INTO processed_events (event_id, user_id, action, processed_at)
VALUES (%s, %s, %s, NOW())
ON CONFLICT (event_id) DO NOTHING
""", (event_id, user_id, action))
conn.commit()
consumer.commit() # commit offset only after DB commit
Q30. What is data skew in distributed processing and all strategies to address it?
# Diagnose skew: check partition sizes
df.groupBy(spark_partition_id().alias("partition")).count().orderBy("count", ascending=False).show()
# If one partition has 10M rows and others have 100K, you have skew
# Strategy 1: Broadcast join (if skewed side is small dimension table)
from pyspark.sql.functions import broadcast
result = large_df.join(broadcast(small_lookup), "key")
# Strategy 2: Salting (distribute skewed key)
from pyspark.sql.functions import concat, lit, floor, rand, col, explode, array
NUM_SALTS = 20
# On the large (skewed) side: randomly append salt
large_salted = large_df.withColumn("salt", (rand() * NUM_SALTS).cast("int")) \
.withColumn("salted_key", concat(col("skewed_key"),
lit("_"), col("salt")))
# On the small side: expand with all possible salts
other_expanded = other_df.withColumn("salt", explode(array([lit(i) for i in range(NUM_SALTS)]))) \
.withColumn("salted_key", concat(col("skewed_key"),
lit("_"), col("salt")))
result = large_salted.join(other_expanded, "salted_key").drop("salt", "salted_key")
# Strategy 3: Adaptive Query Execution (Spark 3.x) — automatic
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
# Strategy 4: Pre-aggregate (reduce volume before shuffle)
# Bad: 1B rows → group → 10M groups
# Good: pre-aggregate per partition first → 10M partial aggregates → group → 10M groups
df.groupBy("user_id").agg(sum("amount")) # Spark 3.x does this automatically (partial agg)
# Strategy 5: Custom partitioner (if you know the distribution)
# Consistent hash partitioning with more buckets for hot keys
Data Engineering FAQs — No-BS Answers
Q: What is the modern data stack in 2026? A: Here's the exact stack most top-tier data teams run: Fivetran/Airbyte (ingestion) → Snowflake/Databricks (storage + processing) → dbt (transformations) → Airflow/Dagster/Prefect (orchestration) → Monte Carlo/Great Expectations (data quality) → Datahub/Atlan (catalog) → Looker/Metabase (BI). Add Apache Kafka for streaming. Know this stack cold — interviewers expect you to have opinions about each layer.
Q: PySpark vs pandas in 2026? A: Pandas for < 1GB data (data science notebooks, feature engineering on small samples). PySpark for anything larger. Polars (Rust-based, 10-100x faster than pandas for single-node) is increasingly popular for medium-scale (1GB-100GB) data. pandas 2.0 with PyArrow backend improved performance but Polars is still faster.
Q: What is the difference between Airflow, Dagster, and Prefect? A: All are orchestration tools. Airflow is most mature, widely used, but complex UI and testing is hard. Dagster prioritizes software-engineering best practices (type checking, testing, assets-first). Prefect has the best developer experience and cloud offering. In 2026, Dagster is gaining adoption in modern data teams; Airflow still dominates legacy shops.
Q: What is a data product? A: A dataset or service that is intentionally designed, maintained, and delivered with defined SLAs, documentation, quality guarantees, and a clear owner. A view in a dashboard is not a data product; a versioned, documented, quality-checked dataset with a contract is.
Q: What is Z-ordering in Delta Lake and when should you use it?
A: Z-ordering (also called Z-order clustering) reorganizes data within Delta files so that related data is co-located. It creates a multi-dimensional locality — sorting on two columns simultaneously (unlike ORDER BY which sorts one dimension). Use Z-ordering when queries filter on multiple columns. delta_table.optimize().executeZOrderBy("user_id", "event_date").
Q: What is the small files problem and how do you solve it?
A: Streaming or frequent small batch writes create millions of tiny Parquet/ORC files. Each file requires metadata I/O → query planning is slow. Solutions: Delta Lake OPTIMIZE (auto-compaction), coalesce() before write, adjust spark.sql.shuffle.partitions, enable auto-optimization in Databricks.
Q: How do you handle GDPR right-to-erasure in a data lake? A: Pure lakes with immutable Parquet files make deletion hard. Solutions: (1) Delta Lake delete + vacuum (removes file versions after retention), (2) Iceberg row-level deletes, (3) Encryption + key deletion (delete the key → data is unreadable), (4) Pseudonymization (replace PII with reversible token; delete mapping table for erasure).
Q: What are the most important skills for a data engineer to learn in 2026? A: In priority order: SQL (still the single most important skill — don't skip window functions and CTEs), Python, PySpark for large-scale processing, dbt for transformations, Kafka for streaming, Delta Lake/Iceberg for lakehouse, Airflow/Dagster for orchestration, and Great Expectations for data quality. The 2026 bonus skills that separate good from great: vector databases for AI pipelines, feature stores for MLOps, and streaming ML with Flink.
Keep leveling up your interview prep:
- AI/ML Interview Questions 2026 — ML fundamentals every data engineer should know
- System Design Interview Questions 2026 — Design the systems your pipelines feed
- Generative AI Interview Questions 2026 — The GenAI pipeline knowledge gap
- AWS Interview Questions 2026 — Cloud infrastructure for your data stack
- DevOps Interview Questions 2026 — CI/CD for data pipelines
Explore this topic cluster
More resources in Interview Questions
Use the category hub to browse similar questions, exam patterns, salary guides, and preparation resources related to this topic.
Related Articles
Prompt Engineering Interview Questions 2026 — Top 50 Questions with Answers
Prompt engineering went from a "nice-to-have" to a $150K-$300K career skill in under 2 years. It's no longer a trick — it's...
Top 50 Data Structures Interview Questions 2026
Data Structures and Algorithms (DSA) form the foundation of computer science and are crucial for technical interviews at top...
AI/ML Interview Questions 2026 — Top 50 Questions with Answers
AI/ML engineer is the highest-paid engineering role in 2026, with median compensation exceeding $200K at top companies. But...
AWS Interview Questions 2026 — Top 50 with Expert Answers
AWS certifications command a 25-30% salary premium in India, and AWS skills appear in 74% of all cloud job postings. AWS...
DevOps Interview Questions 2026 — Top 50 with Expert Answers
Elite DevOps teams deploy to production multiple times per day with a change failure rate under 5%. That's the bar companies...