Apache Spark Interview Questions 2026: 28 Answers with Code

What changed in 2026 drives
Mass-recruiter offer letters are flatter for 2026 batch - the 4-5 LPA ASE band has barely budged in three years while inflation eats real wages. Premium tracks (Digital, Pro, Elite, Specialist) are still where the differential lives, and they are entirely test-driven. If you are aiming higher than the default offer, the coding round is not optional pageantry - it is the entire interview.
What I'd actually study for this
- 01Two solid coding-round answers (1 medium-hard DSA each, with edge-case discussion) > five half-baked ones
- 02One real project you can defend end-to-end - file paths, design decisions, and what you would change
- 03One DBMS schema you actually built (not a textbook ER diagram), with at least 3 join-heavy queries written from memory
- 04Three behavioural STAR stories: failure recovered, conflict handled, ownership taken
Where most candidates trip up
The single biggest mistake is treating company-specific guides as primary prep and DSA as secondary. It is the opposite. Mass recruiters use the test as a filter, but premium tracks at every IT services company use coding to allocate offer band. Spend 70% of prep time on DSA + system fundamentals, 20% on company-specific patterns, 10% on HR rehearsal. Reverse that ratio and you collect the default offer.
Editorial commentary by Aditya Sharma · written for PapersAdda · not generated, not aggregated.
Apache Spark is the de-facto standard for large-scale data processing, and Spark proficiency is a core requirement for data engineering, ML engineering, and senior data science roles. This guide covers 28 Apache Spark interview questions with full PySpark answers and code examples, from fundamentals to production performance tuning.
PapersAdda's take: Candidates report that shuffle optimization and partitioning questions eliminate more candidates than any other Spark topic. The most common live coding scenario: "rewrite this groupByKey operation to avoid a full shuffle". Understanding when Spark needs to move data across the network is the single most important concept for Spark interviews. Confirm the specific Spark version and platform (Databricks, EMR, Dataproc) expected on the official company careers portal before you prepare.
Related articles: Data Engineering Interview Questions 2026 | Kafka Interview Questions 2026 | Airflow Interview Questions 2026 | SQL for Data Analysts 2026 | MLOps Interview Questions 2026
EASY: Core Spark Concepts (Questions 1-8)
Q1. What is Apache Spark? How does it differ from MapReduce?
| Property | Hadoop MapReduce | Apache Spark |
|---|---|---|
| Processing | Disk-based (read/write between stages) | In-memory (cache intermediate results) |
| Speed | Slow (I/O bound) | Up to 100x faster for iterative workloads |
| Fault tolerance | Task re-execution, data replication | RDD lineage: recompute lost partitions |
| APIs | Low-level Java | High-level: DataFrame, SQL, MLlib, Streaming |
| Ease of use | Complex, verbose | Concise Python/Scala/SQL API |
| Streaming | Batch only | Micro-batch + true streaming (Structured Streaming) |
Spark's key advantages:
- In-memory processing: DataFrames and RDDs can be cached in memory for repeated use.
- DAG execution: builds an optimal Directed Acyclic Graph of transformations before executing.
- Unified platform: batch, streaming, ML, graph processing in one framework.
Q2. What is the difference between RDD, DataFrame, and Dataset?
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.appName("demo").getOrCreate()
# RDD: Resilient Distributed Dataset (low-level API)
# Untyped, no schema, no query optimization
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])
rdd_result = rdd.map(lambda x: x * 2).filter(lambda x: x > 4).collect()
# DataFrame: distributed table with named columns and schema
# SQL optimizer (Catalyst), in-memory columnar format
df = spark.createDataFrame([(1, "Alice", 50000), (2, "Bob", 60000)],
["id", "name", "salary"])
df.filter(F.col("salary") > 55000).show()
# Dataset (Scala/Java only): typed DataFrame
# Not directly available in PySpark (DataFrame IS a Dataset[Row] under the hood)
print(f"RDD result: {rdd_result}")
df.printSchema()
| API | Language | Schema | Optimization | Use case |
|---|---|---|---|---|
| RDD | Any | None | None (manual) | Custom operations, unstructured data |
| DataFrame | Python/Scala/Java/R | Yes | Catalyst optimizer | Structured data, SQL-like operations |
| Dataset | Scala/Java | Yes (typed) | Catalyst + compile-time type safety | Typed transformations in Scala/Java |
Q3. What is lazy evaluation in Spark? What is an action vs a transformation?
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.appName("lazy_eval").getOrCreate()
# Lazy evaluation: transformations build a DAG but do NOT execute
# Execution happens only when an ACTION is called
df = spark.range(1_000_000)
# TRANSFORMATIONS (lazy -- builds DAG, no execution)
df_filtered = df.filter(F.col("id") % 2 == 0) # narrow transformation
df_mapped = df_filtered.withColumn("value", F.col("id") * 2) # narrow
# Nothing has been computed yet!
# ACTIONS (trigger execution)
count = df_mapped.count() # triggers full DAG execution
first_row = df_mapped.first() # triggers partial execution
rows = df_mapped.take(10) # triggers partial execution
df_mapped.write.parquet("/tmp/output") # triggers full execution
# Common transformations (lazy):
# .filter(), .select(), .withColumn(), .groupBy(), .join(),
# .orderBy(), .distinct(), .union(), .limit(), .map() (RDD)
# Common actions (eager):
# .count(), .collect(), .take(n), .first(), .show(), .write, .save()
# .reduce() (RDD), .foreach() (RDD)
# Benefits of lazy evaluation:
# 1. Catalyst can optimize the full plan (predicate pushdown, column pruning)
# 2. Avoid unnecessary computation when results not needed
# 3. Pipeline transformations without intermediate materializations
Q4. What is a DAG in Spark? What is lineage?
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.appName("dag_demo").getOrCreate()
df = spark.read.parquet("/data/orders")
result = (df
.filter(F.col("status") == "completed") # Stage 1 begin
.groupBy("user_id") # Wide transformation (shuffle)
.agg(F.sum("amount").alias("total")) # Stage 2 begin
.filter(F.col("total") > 1000) # Stage 2 continue
.orderBy(F.col("total").desc()) # Wide transformation (sort shuffle)
)
# View the DAG (execution plan)
result.explain(mode="extended")
# Shows: Parsed / Analyzed / Optimized / Physical plan
# Lineage: chain of transformations from source to result
# Spark tracks lineage to recompute lost partitions (fault tolerance)
# No replication needed like HDFS -- just re-execute the subset of the DAG
# Stages: separated by wide transformations (shuffle boundaries)
# Stage 1: filter -> stop (before groupBy shuffle)
# Stage 2: groupBy aggregate -> filter -> stop (before orderBy shuffle)
# Stage 3: orderBy
Q5. What is the difference between narrow and wide transformations?
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.appName("narrow_wide").getOrCreate()
df = spark.range(1_000_000).withColumn("value", F.rand())
# NARROW transformations: each partition produces exactly one output partition
# No data movement across the network
narrow_ops = [
df.filter(F.col("value") > 0.5), # map/filter within partition
df.withColumn("doubled", F.col("id") * 2), # column operations
df.select("id", "value"), # projection
df.union(df), # if same partitioning
]
# WIDE transformations: partition i may need data from ANY partition j
# Causes a SHUFFLE (expensive: serialize, write to disk, transfer over network, deserialize)
wide_ops_examples = [
"df.groupBy('key').agg(...)", # shuffle to group same keys
"df.orderBy('column')", # global sort: shuffle to range partitioner
"df.join(df2, 'key')", # shuffle-hash join or sort-merge join
"df.distinct()", # shuffle to deduplicate
"df.repartition(200)", # explicit repartition
]
# Why shuffles are expensive:
# - Data serialized + written to disk on each executor
# - Network transfer between executors/nodes
# - Data deserialized + read from disk on destination executor
# - Can be the single biggest bottleneck in a Spark job
Q6. What is the difference between repartition and coalesce?
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("partition_demo").getOrCreate()
df = spark.range(1_000_000)
print(f"Initial partitions: {df.rdd.getNumPartitions()}") # typically ~200
# repartition(N): full shuffle -- can increase OR decrease partitions
# Use when: increasing partitions, or need even data distribution
df_repartitioned = df.repartition(400)
print(f"After repartition(400): {df_repartitioned.rdd.getNumPartitions()}") # 400
# repartition on a column: co-locate same key values in same partition
df_by_key = df.repartition(50, "id") # hash partition on id into 50 partitions
# Benefit: subsequent joins/groupBys on id column avoid shuffle
# coalesce(N): narrow transformation -- only DECREASES partitions
# Avoids full shuffle by merging adjacent partitions
# Use when: reducing output file count, post-filter cleanup
df_small = df.filter(df.id < 1000) # now mostly empty partitions
df_coalesced = df_small.coalesce(10) # merge to 10 without full shuffle
print(f"After coalesce(10): {df_coalesced.rdd.getNumPartitions()}") # 10
# Rule of thumb:
# Reducing partitions? Use coalesce (faster, no shuffle)
# Increasing partitions or need balanced distribution? Use repartition
# Target: ~128MB per partition (configurable via spark.sql.files.maxPartitionBytes)
Q7. What is caching in Spark? When should you cache?
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark import StorageLevel
spark = SparkSession.builder.appName("cache_demo").getOrCreate()
df = spark.read.parquet("/data/large_dataset")
# WITHOUT caching: each action re-reads and re-computes from scratch
# If df is used 3 times, parquet is read 3 times
# WITH caching: data is stored in memory after first materialization
df_filtered = df.filter(F.col("status") == "active")
# Cache in memory (default)
df_filtered.cache() # or df_filtered.persist()
# After the first action, data is in memory
count = df_filtered.count() # triggers read + cache
result1 = df_filtered.groupBy("dept").count() # reads from memory (fast)
result2 = df_filtered.agg(F.sum("salary")) # reads from memory (fast)
# Different storage levels
df_filtered.persist(StorageLevel.MEMORY_ONLY) # default: fail if OOM
df_filtered.persist(StorageLevel.MEMORY_AND_DISK) # spill to disk if OOM
df_filtered.persist(StorageLevel.MEMORY_ONLY_SER) # serialized (less memory, slower)
df_filtered.persist(StorageLevel.DISK_ONLY) # disk only (slow)
df_filtered.persist(StorageLevel.MEMORY_AND_DISK_2) # replicated (fault tolerant)
# Remove from cache
df_filtered.unpersist()
# WHEN to cache:
# 1. DataFrame used in 2+ subsequent actions
# 2. Expensive computation (complex joins, aggregations) used repeatedly
# 3. Iterative ML algorithms (same data used each epoch)
# WHEN NOT to cache:
# 1. Used only once
# 2. Data too large for available memory (causes spilling, may hurt performance)
# 3. Simple sequential pipeline
Q8. What is a Spark shuffle? What are the main performance problems it causes?
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.appName("shuffle").getOrCreate()
# Shuffle anatomy:
# 1. Map side: partition data by output partition key (hash)
# 2. Write shuffle files to local disk (one file per output partition per mapper)
# 3. Network transfer: reducers fetch their partition from all mappers
# 4. Reduce side: sort + merge data within partition
# Shuffle bottlenecks:
# - Disk I/O for writing/reading shuffle files
# - Network bandwidth for data transfer
# - Memory pressure: spilling when shuffle buffer full
# PROBLEM: Data skew (most data maps to one key)
# Example: 90% of events are user_id=1
df = spark.range(1_000_000).withColumn(
"user_id",
F.when(F.col("id") < 900_000, 1).otherwise(F.col("id"))
)
# groupBy("user_id") sends 900K rows to one reducer -- that task takes 10x longer
# Solutions to data skew:
# 1. Salting: add random suffix to hot keys
df_salted = df.withColumn(
"user_id_salt",
F.concat(F.col("user_id"), F.lit("_"), (F.rand() * 10).cast("int"))
)
# Aggregate with salt, then aggregate again without salt
# 2. Broadcast join for small tables (avoids shuffle entirely)
small_df = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["id", "name"])
large_df = spark.range(10_000_000).withColumn("user_id", F.col("id") % 1000)
result = large_df.join(F.broadcast(small_df), large_df.user_id == small_df.id)
# 3. Increase partitions
spark.conf.set("spark.sql.shuffle.partitions", "400") # default is 200
MEDIUM: Performance Tuning and Operations (Questions 9-20)
Q9. What is the difference between reduceByKey and groupByKey?
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("reduce_vs_group").getOrCreate()
sc = spark.sparkContext
data = [("a", 1), ("b", 2), ("a", 3), ("b", 4), ("c", 5), ("a", 6)]
rdd = sc.parallelize(data, numSlices=4)
# groupByKey: sends ALL values for each key across the network
# Then reduces on reducer side
# Memory-intensive: all values for one key must fit in one executor's memory
result_group = rdd.groupByKey().mapValues(sum).collect()
# Network transfer: ALL (a, 1), (a, 3), (a, 6) are sent to one node
# reduceByKey: partially reduces values ON EACH PARTITION first
# Only partial results are shuffled -- much less data transferred
result_reduce = rdd.reduceByKey(lambda a, b: a + b).collect()
# Partition 1: (a,1) + (a,3) -> (a,4) -- sent over network
# Partition 2: (a,6) -- sent over network
# Reducer combines (a,4) + (a,6) = (a,10)
# For the same wordcount task, reduceByKey transfers ~1/3 the data
# DataFrame equivalent (always preferred over RDD for SQL-like operations)
from pyspark.sql import functions as F
df = spark.createDataFrame(data, ["key", "value"])
df.groupBy("key").agg(F.sum("value")).show()
# Catalyst optimizer automatically applies partial aggregation (like reduceByKey)
# Rule: always prefer reduceByKey over groupByKey when the operation is associative
Q10. How do you implement and use broadcast joins?
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.appName("broadcast").getOrCreate()
# Large fact table: 500M orders
orders = spark.read.parquet("/data/orders") # 500M rows
# Small dimension table: 200 products
products = spark.read.parquet("/data/products") # 200 rows
# Without broadcast: sort-merge join (shuffle both tables -- expensive)
orders.join(products, "product_id").explain()
# Output: SortMergeJoin
# With broadcast: small table fits in each executor's memory
result = orders.join(F.broadcast(products), "product_id")
result.explain()
# Output: BroadcastHashJoin (no shuffle of orders table!)
# Automatic broadcast: configure threshold
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 50 * 1024 * 1024) # 50MB
# Spark automatically broadcasts tables smaller than this threshold
# Broadcast a collected DataFrame back to all executors
from pyspark.sql.functions import broadcast
# Manually collect small table for custom logic
product_dict = {row["product_id"]: row["category"]
for row in products.collect()}
bc_products = spark.sparkContext.broadcast(product_dict)
# Use broadcast variable in UDF
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
@udf(returnType=StringType())
def get_category(product_id):
return bc_products.value.get(product_id, "unknown")
orders_with_cat = orders.withColumn("category", get_category(F.col("product_id")))
Q11. How do you tune Spark memory? Explain executor memory layout.
# Spark executor memory regions:
# Total executor memory = executor.memory + executor.memoryOverhead
# executor.memory is divided into:
# 1. Reserved memory (300MB hardcoded)
# 2. Usable memory = executor.memory - 300MB
# Usable = unified memory (fraction: spark.memory.fraction, default 0.6)
# + user memory (1 - fraction)
# Unified memory (60% of usable) further splits:
# - Execution memory: shuffle buffers, sort buffers, joins
# - Storage memory: caching DataFrames
# (These can borrow from each other -- unified memory model)
# User memory (40% of usable): UDFs, custom data structures
# Configuration
spark = SparkSession.builder \
.config("spark.executor.memory", "8g") \
.config("spark.executor.memoryOverhead", "2g") \
.config("spark.memory.fraction", "0.6") \
.config("spark.memory.storageFraction", "0.5") \
.getOrCreate()
# Common OOM causes and fixes:
# 1. Large shuffle: increase spark.executor.memory or reduce data skew
# 2. Too many cached DataFrames: unpersist() unused caches
# 3. Large broadcast: reduce autoBroadcastJoinThreshold or disable broadcast
# 4. Skewed UDFs: increase memoryOverhead
# Executor sizing heuristics:
# Small executors (2 cores, 8GB): HDFS friendlier, less GC pressure
# Large executors (16 cores, 64GB): fewer JVMs, more parallelism per executor
# Sweet spot: 4-5 cores per executor, executor memory = ~5 * partitionSize
# Monitor with Spark UI:
# - Storage tab: cache hit rate
# - Stages tab: task duration, shuffle read/write bytes
# - SQL tab: physical plan, task metrics
Q12. How do you handle data skew in Spark?
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.appName("skew").getOrCreate()
# Detect skew
df = spark.read.parquet("/data/transactions")
key_dist = df.groupBy("user_id").count().orderBy(F.col("count").desc())
key_dist.show(10) # top 10 keys -- if top key >> median, there's skew
# Solution 1: Salting (for aggregations and joins)
SALT_FACTOR = 10
# Salt the skewed table
df_salted = df.withColumn(
"salted_user_id",
F.concat(F.col("user_id").cast("string"), F.lit("_"),
(F.rand() * SALT_FACTOR).cast("int").cast("string"))
)
# Salt the small lookup table (explode to match all salt values)
lookup = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["user_id", "name"])
lookup_salted = lookup.crossJoin(
spark.range(SALT_FACTOR).withColumnRenamed("id", "salt")
).withColumn(
"salted_user_id",
F.concat(F.col("user_id").cast("string"), F.lit("_"), F.col("salt").cast("string"))
)
# Join on salted key (distributed evenly now)
result = df_salted.join(lookup_salted, "salted_user_id")
# Solution 2: Adaptive Query Execution (AQE) -- Spark 3.0+
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionFactor", "5")
spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "256MB")
# AQE automatically splits skewed partitions at runtime
# Solution 3: Isolate hot keys
hot_keys = {1, 2, 3} # pre-identified high-traffic users
df_hot = df.filter(F.col("user_id").isin(*hot_keys))
df_cold = df.filter(~F.col("user_id").isin(*hot_keys))
# Process hot keys with more partitions
result_hot = df_hot.repartition(200, "user_id") # distribute hot key rows
result_cold = df_cold.repartition(50, "user_id")
final = result_hot.union(result_cold)
Q13. What is Adaptive Query Execution (AQE)? What problems does it solve?
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config("spark.sql.adaptive.enabled", "true") \
.config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
.config("spark.sql.adaptive.coalescePartitions.minPartitionNum", "1") \
.config("spark.sql.adaptive.skewJoin.enabled", "true") \
.appName("aqe_demo").getOrCreate()
# AQE: dynamically adjusts execution plan based on runtime statistics
# Available since Spark 3.0, significantly improved in 3.2+
# Problem 1: Too many small shuffle partitions
# Default spark.sql.shuffle.partitions=200 -- many may be tiny after heavy filter
# AQE solution: dynamically coalesce small partitions
spark.conf.set("spark.sql.adaptive.coalescePartitions.initialPartitionNum", "400")
spark.conf.set("spark.sql.adaptive.advisoryPartitionSizeInBytes", "128MB")
# AQE combines small partitions to reach ~128MB target
# Problem 2: Join strategy mismatch
# Sort-merge join was planned, but after filter the small table fits in memory
# AQE solution: dynamically switch to broadcast hash join at runtime
spark.conf.set("spark.sql.adaptive.localShuffleReader.enabled", "true")
# Problem 3: Data skew (see Q12)
# AQE splits skewed partitions and replicates small-side data to match
# Verify AQE is working
df = spark.read.parquet("/data/orders")
result = df.groupBy("category").count()
result.explain("formatted")
# Look for "AdaptiveSparkPlan" in the plan output
# AQE limitations:
# - Cannot change the physical join type for streaming queries
# - Subqueries are not AQE-optimized
# - Non-deterministic data source reads may cause issues
Q14. How do you write and optimize a Spark join?
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.appName("join_opt").getOrCreate()
orders = spark.read.parquet("/data/orders") # large: 500M rows
users = spark.read.parquet("/data/users") # medium: 10M rows
products = spark.read.parquet("/data/products") # small: 10K rows
# Join strategy selection:
# 1. Broadcast Hash Join: small table fits in memory (<= autoBroadcastJoinThreshold)
result = orders.join(F.broadcast(products), "product_id", "left")
# Zero shuffle of orders table -- fastest
# 2. Sort-Merge Join: both tables large, pre-sort on join key
# Enable bucketing to avoid sort phase
orders.write.bucketBy(200, "user_id").sortBy("user_id").saveAsTable("orders_bucketed")
users.write.bucketBy(200, "user_id").sortBy("user_id").saveAsTable("users_bucketed")
orders_b = spark.table("orders_bucketed")
users_b = spark.table("users_bucketed")
# Join on user_id: no shuffle (already co-partitioned)
result = orders_b.join(users_b, "user_id")
# 3. Hash Join: moderate size, no need to sort
spark.conf.set("spark.sql.join.preferSortMergeJoin", "false")
# Join optimization checklist:
# a) Filter before join (reduce rows before shuffle)
result = (
orders.filter(F.col("status") == "completed") # filter first
.join(users.filter(F.col("is_active") == True), "user_id")
)
# b) Select only needed columns before join (reduces shuffle size)
orders_slim = orders.select("order_id", "user_id", "amount", "status")
users_slim = users.select("user_id", "name", "city")
result = orders_slim.join(users_slim, "user_id")
# c) Avoid cross joins (cartesian product)
# d) Use skew handling for data-skewed join keys (see Q12)
Q15. What is Delta Lake? How does it improve on Parquet?
from delta import configure_spark_with_delta_pip
from pyspark.sql import SparkSession
spark = configure_spark_with_delta_pip(
SparkSession.builder
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.appName("delta_demo")
).getOrCreate()
# Delta Lake: ACID-compliant storage layer on top of Parquet
# Write Delta table
data = [(1, "Alice", 50000), (2, "Bob", 60000)]
df = spark.createDataFrame(data, ["id", "name", "salary"])
df.write.format("delta").save("/tmp/delta_table")
# ACID transactions: concurrent reads/writes are safe
df_new = spark.createDataFrame([(3, "Charlie", 70000)], ["id", "name", "salary"])
df_new.write.format("delta").mode("append").save("/tmp/delta_table")
# Upsert (MERGE): update existing rows, insert new ones
from delta.tables import DeltaTable
delta_table = DeltaTable.forPath(spark, "/tmp/delta_table")
updates = spark.createDataFrame([(1, "Alice", 55000), (4, "Diana", 80000)], ["id", "name", "salary"])
delta_table.alias("target").merge(
updates.alias("source"),
"target.id = source.id"
).whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
# Time travel: query historical versions
spark.read.format("delta").option("versionAsOf", 0).load("/tmp/delta_table").show()
spark.read.format("delta").option("timestampAsOf", "2026-06-01").load("/tmp/delta_table").show()
# Delta vs plain Parquet:
# Parquet: immutable files, no ACID, overwrites risky, schema inconsistency possible
# Delta: transaction log, ACID, schema enforcement, time travel, MERGE/UPDATE/DELETE
Q16. What is Spark Structured Streaming?
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StringType, IntegerType, TimestampType
spark = SparkSession.builder.appName("streaming").getOrCreate()
# Schema for incoming events
schema = StructType() \
.add("user_id", IntegerType()) \
.add("event_type", StringType()) \
.add("amount", IntegerType()) \
.add("event_time", TimestampType())
# Read from Kafka as streaming source
raw_stream = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "transactions") \
.load()
# Parse JSON payload
events = raw_stream.select(
F.from_json(F.col("value").cast("string"), schema).alias("data")
).select("data.*")
# Windowed aggregation with watermark (handles late data)
result = events \
.withWatermark("event_time", "10 minutes") \
.groupBy(
F.window("event_time", "5 minutes"), # 5-minute tumbling windows
"user_id"
) \
.agg(F.sum("amount").alias("total_amount"))
# Write to console (development) or sink (production)
query = result.writeStream \
.outputMode("update") \
.format("console") \
.trigger(processingTime="1 minute") \
.start()
# Production sinks: Kafka, Delta Lake, Parquet, JDBC
result.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "/tmp/checkpoint") \
.start("/tmp/streaming_output")
query.awaitTermination()
Q17. What are the output modes in Structured Streaming?
# Three output modes:
# 1. APPEND: only new rows added since last trigger are written
# Use for: immutable results (no aggregation, or append-only windowed aggregation with watermark)
query_append = df_stream.writeStream.outputMode("append").format("parquet").start()
# 2. COMPLETE: entire result table is re-written every trigger
# Use for: aggregations where you want the full updated table each time
# Warning: for unbounded aggregations (no watermark), this grows forever
query_complete = df_stream \
.groupBy("user_id").agg(F.sum("amount")) \
.writeStream.outputMode("complete").format("console").start()
# 3. UPDATE: only rows that changed since last trigger are written
# Use for: stateful aggregations where downstream can handle updates
# Most sinks don't support update mode natively (Delta Lake does)
query_update = df_stream \
.withWatermark("event_time", "10 minutes") \
.groupBy(F.window("event_time", "5 minutes"), "user_id") \
.agg(F.sum("amount")) \
.writeStream.outputMode("update").format("console").start()
# Summary:
# | Operation | Append | Complete | Update |
# |-----------|--------|----------|--------|
# | No aggregation | Supported | Not supported | Same as append |
# | With watermark | Supported | Supported | Supported |
# | Without watermark | Not supported | Supported | Supported |
Q18. How do you implement exactly-once semantics in Spark Streaming?
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.appName("exactly_once").getOrCreate()
# Exactly-once = no data loss + no duplicate processing
# Component 1: Checkpointing (saves streaming state to durable storage)
query = (
events.writeStream
.option("checkpointLocation", "s3://my-bucket/checkpoints/stream_v1/")
# checkpoint stores: current offsets, aggregation state, pending commits
.format("delta")
.outputMode("append")
.start("/delta/output")
)
# Component 2: Idempotent sinks
# Delta Lake: ACID transactions ensure each micro-batch committed exactly once
# Kafka: use transactional producer (Spark 3.0+)
events.writeStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", "output_topic") \
.option("kafka.transactional.id", "my-stream-transaction-1") \
.option("checkpointLocation", "/tmp/kafka_checkpoint") \
.start()
# Component 3: Source offset tracking
# Kafka source: Spark tracks offset per partition in checkpoint
# At restart: resumes from last committed offset (no reprocessing, no loss)
# Failure recovery:
# Driver failure: restart job -- it reads checkpoint and resumes from last good state
# Executor failure: recompute from RDD lineage (at-least-once within micro-batch)
# + idempotent sink = exactly-once end-to-end
Q19. How do you read and write Parquet, CSV, and JSON efficiently?
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, IntegerType, StringType, DoubleType
spark = SparkSession.builder.appName("io_demo").getOrCreate()
# PARQUET (preferred for large data)
# Columnar, compressed, schema-preserving
df = spark.read.parquet("/data/orders") # single file or directory
df_partitioned = spark.read.parquet("/data/orders/year=2026/month=*") # partition pruning
df.write \
.mode("overwrite") \
.partitionBy("year", "month") \ # partition by date for efficient queries
.parquet("/data/orders_partitioned")
# Predicate pushdown: Spark reads ONLY needed columns and row groups
df_fast = spark.read.parquet("/data/orders") \
.filter("status = 'completed' AND year = 2026") \
.select("order_id", "amount") # column pruning
df_fast.explain() # should show: PushedFilters
# CSV (with explicit schema for performance + safety)
schema = StructType() \
.add("order_id", IntegerType()) \
.add("user_id", IntegerType()) \
.add("amount", DoubleType()) \
.add("status", StringType())
df_csv = spark.read \
.schema(schema) \
.option("header", "true") \
.option("nullValue", "NULL") \
.option("timestampFormat", "yyyy-MM-dd HH:mm:ss") \
.csv("/data/orders.csv")
df.write.mode("overwrite").option("header", "true").csv("/data/output.csv")
# JSON (schema inference is slow for large files)
df_json = spark.read.schema(schema).json("/data/events/*.json")
# Performance tips:
# 1. Always specify schema (avoid spark.read.parquet().schema inference for CSV/JSON)
# 2. Partition Parquet by high-cardinality time columns
# 3. Use columnar formats (Parquet/ORC) instead of CSV for large datasets
# 4. Compress with snappy (fast) or zstd (better compression)
Q20. What is SparkSQL? How do you run SQL queries on DataFrames?
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.appName("sparksql").getOrCreate()
# Create DataFrames
orders = spark.read.parquet("/data/orders")
users = spark.read.parquet("/data/users")
# Register as temporary views (session-scoped)
orders.createOrReplaceTempView("orders")
users.createOrReplaceTempView("users")
# Run SQL
result_sql = spark.sql("""
SELECT
u.name,
COUNT(o.order_id) as order_count,
ROUND(SUM(o.amount), 2) as total_revenue,
ROUND(AVG(o.amount), 2) as avg_order_value
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id
WHERE o.status = 'completed'
AND o.created_at >= '2026-01-01'
GROUP BY u.name
HAVING COUNT(o.order_id) >= 5
ORDER BY total_revenue DESC
LIMIT 100
""")
# Equivalent DataFrame API
result_df = (
users.alias("u")
.join(orders.filter(
(F.col("status") == "completed") &
(F.col("created_at") >= "2026-01-01")
).alias("o"), "user_id", "left")
.groupBy("u.name")
.agg(
F.count("o.order_id").alias("order_count"),
F.round(F.sum("o.amount"), 2).alias("total_revenue"),
F.round(F.avg("o.amount"), 2).alias("avg_order_value"),
)
.filter(F.col("order_count") >= 5)
.orderBy(F.col("total_revenue").desc())
.limit(100)
)
# Both produce identical execution plans
result_sql.explain()
result_df.explain()
# Catalyst optimizer produces the same physical plan either way
HARD: Production and System Design (Questions 21-28)
Q21. How do you implement a window function in PySpark?
from pyspark.sql import SparkSession, functions as F
from pyspark.sql import Window
spark = SparkSession.builder.appName("window_demo").getOrCreate()
df = spark.createDataFrame([
(1, "Engineering", 90000), (2, "Engineering", 80000),
(3, "Sales", 60000), (4, "Sales", 70000), (5, "Engineering", 95000),
(6, "HR", 55000), (7, "HR", 65000),
], ["emp_id", "dept", "salary"])
# Define window specifications
dept_window = Window.partitionBy("dept")
dept_salary_window = Window.partitionBy("dept").orderBy(F.col("salary").desc())
ordered_window = Window.orderBy(F.col("salary").desc())
# Apply window functions
result = df.select(
"emp_id", "dept", "salary",
F.avg("salary").over(dept_window).alias("dept_avg"),
F.rank().over(dept_salary_window).alias("dept_rank"),
F.dense_rank().over(dept_salary_window).alias("dept_dense_rank"),
F.row_number().over(dept_salary_window).alias("dept_row_num"),
F.lag("salary", 1).over(dept_salary_window).alias("prev_salary"),
F.lead("salary", 1).over(dept_salary_window).alias("next_salary"),
F.ntile(3).over(ordered_window).alias("tertile"),
F.percent_rank().over(ordered_window).alias("pct_rank"),
)
result.orderBy("dept", "dept_rank").show()
# Rolling window (time-based)
time_window = Window \
.partitionBy("user_id") \
.orderBy(F.col("event_time").cast("long")) \
.rangeBetween(-7 * 24 * 3600, 0) # last 7 days in seconds
events = spark.read.parquet("/data/events")
events_with_rolling = events.withColumn(
"rolling_7d_events",
F.count("event_id").over(time_window)
)
Q22. Write a PySpark job to compute user session features.
from pyspark.sql import SparkSession, functions as F, Window
spark = SparkSession.builder.appName("session_features").getOrCreate()
events = spark.read.parquet("/data/user_events")
# Session definition: 30-minute inactivity gap
SESSION_GAP = 30 * 60 # seconds
# Step 1: Compute time since previous event per user
user_time_window = Window.partitionBy("user_id").orderBy("event_time")
events_with_prev = events.withColumn(
"prev_time",
F.lag(F.col("event_time").cast("long")).over(user_time_window)
)
# Step 2: Mark session starts
events_with_sessions = events_with_prev.withColumn(
"is_new_session",
F.when(
(F.col("prev_time").isNull()) |
(F.col("event_time").cast("long") - F.col("prev_time") > SESSION_GAP),
1
).otherwise(0)
)
# Step 3: Assign session IDs (cumulative sum of session starts)
events_with_session_id = events_with_sessions.withColumn(
"session_id",
F.sum("is_new_session").over(user_time_window)
)
# Step 4: Compute session-level features
session_features = events_with_session_id.groupBy("user_id", "session_id").agg(
F.min("event_time").alias("session_start"),
F.max("event_time").alias("session_end"),
F.count("event_id").alias("event_count"),
F.countDistinct("page_id").alias("unique_pages"),
F.max(F.when(F.col("event_type") == "purchase", 1).otherwise(0)).alias("has_purchase"),
((F.max(F.col("event_time").cast("long")) - F.min(F.col("event_time").cast("long"))) / 60).alias("duration_min"),
)
# Step 5: User-level features from sessions
user_features = session_features.groupBy("user_id").agg(
F.count("session_id").alias("total_sessions"),
F.avg("duration_min").alias("avg_session_duration"),
F.sum("has_purchase").alias("purchase_sessions"),
(F.sum("has_purchase") / F.count("session_id")).alias("purchase_rate"),
F.avg("event_count").alias("avg_events_per_session"),
)
user_features.write.mode("overwrite").parquet("/output/user_session_features")
Q23. How do you unit test PySpark code?
import pytest
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
@pytest.fixture(scope="session")
def spark():
"""Create a local SparkSession for testing."""
return (
SparkSession.builder
.master("local[2]") # 2 threads
.appName("pytest_spark")
.config("spark.sql.shuffle.partitions", "2") # small for tests
.config("spark.ui.enabled", "false") # disable Spark UI
.getOrCreate()
)
def compute_revenue_per_user(df):
"""Function under test."""
return (
df.filter(F.col("status") == "completed")
.groupBy("user_id")
.agg(F.sum("amount").alias("total_revenue"))
)
class TestRevenueComputation:
def test_completed_orders_only(self, spark):
data = [
(1, 100.0, "completed"),
(1, 200.0, "cancelled"),
(2, 150.0, "completed"),
]
df = spark.createDataFrame(data, ["user_id", "amount", "status"])
result = compute_revenue_per_user(df)
rows = {row.user_id: row.total_revenue for row in result.collect()}
assert rows[1] == 100.0 # cancelled order excluded
assert rows[2] == 150.0
def test_empty_dataframe(self, spark):
schema = "user_id int, amount double, status string"
df = spark.createDataFrame([], schema)
result = compute_revenue_per_user(df)
assert result.count() == 0
def test_schema_output(self, spark):
data = [(1, 100.0, "completed")]
df = spark.createDataFrame(data, ["user_id", "amount", "status"])
result = compute_revenue_per_user(df)
assert "user_id" in result.columns
assert "total_revenue" in result.columns
assert result.schema["total_revenue"].dataType.typeName() == "double"
def test_null_amounts(self, spark):
data = [(1, None, "completed"), (1, 100.0, "completed")]
df = spark.createDataFrame(data, ["user_id", "amount", "status"])
result = compute_revenue_per_user(df)
row = result.collect()[0]
assert row.total_revenue == 100.0 # SUM ignores NULLs
Q24. How do you implement slowly changing dimensions (SCD) with Spark?
from pyspark.sql import SparkSession, functions as F
from delta.tables import DeltaTable
spark = SparkSession.builder.appName("scd").getOrCreate()
# SCD Type 2: maintain full history of dimension changes
# Keep current record + all historical records with effective dates
# Current dimension table (Delta)
dim_path = "/delta/dim_customers"
def upsert_scd_type2(current_delta_path, updates_df):
"""Implement SCD Type 2 with Delta MERGE."""
dim = DeltaTable.forPath(spark, current_delta_path)
# Step 1: Expire records that have changed
# Update old record: set end_date = today, is_current = False
dim.alias("target").merge(
updates_df.alias("source"),
"target.customer_id = source.customer_id AND target.is_current = true AND "
"(target.email != source.email OR target.city != source.city)"
).whenMatchedUpdate(set={
"is_current": F.lit(False),
"end_date": F.current_date(),
}).execute()
# Step 2: Insert new versions (all updates)
updates_with_metadata = updates_df.withColumn("start_date", F.current_date()) \
.withColumn("end_date", F.lit(None).cast("date")) \
.withColumn("is_current", F.lit(True)) \
.withColumn("surrogate_key", F.monotonically_increasing_id())
updates_with_metadata.write.format("delta").mode("append").save(current_delta_path)
# SCD Type 1: just overwrite (no history)
def upsert_scd_type1(dim_path, updates_df):
dim = DeltaTable.forPath(spark, dim_path)
dim.alias("target").merge(
updates_df.alias("source"),
"target.customer_id = source.customer_id"
).whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
Q25. Design a Spark pipeline for a daily feature engineering batch job.
from pyspark.sql import SparkSession, functions as F
from pyspark.sql import Window
from datetime import datetime, timedelta
def run_feature_job(run_date: str):
spark = SparkSession.builder \
.appName(f"feature_job_{run_date}") \
.config("spark.sql.adaptive.enabled", "true") \
.config("spark.sql.shuffle.partitions", "400") \
.getOrCreate()
run_dt = datetime.strptime(run_date, "%Y-%m-%d")
lookback_90d = (run_dt - timedelta(days=90)).strftime("%Y-%m-%d")
# 1. Read raw data (partition pruning via filter pushdown)
orders = spark.read.parquet("/data/orders") \
.filter(F.col("order_date").between(lookback_90d, run_date))
events = spark.read.parquet("/data/events") \
.filter(F.col("event_date").between(lookback_90d, run_date))
# 2. Cache shared DataFrames (used in multiple aggregations)
users_active = spark.read.parquet("/data/users") \
.filter(F.col("is_active") == True) \
.select("user_id", "signup_date", "plan_type") \
.cache()
# 3. Order features
order_features = orders.groupBy("user_id").agg(
F.count("order_id").alias("order_count_90d"),
F.sum("amount").alias("gmv_90d"),
F.avg("amount").alias("avg_order_value_90d"),
F.max("order_date").alias("last_order_date"),
F.countDistinct("product_id").alias("unique_products_90d"),
).withColumn("days_since_last_order",
F.datediff(F.lit(run_date), F.col("last_order_date")))
# 4. Event features
event_features = events.groupBy("user_id").agg(
F.count("event_id").alias("event_count_90d"),
F.countDistinct("session_id").alias("sessions_90d"),
F.max("event_date").alias("last_active_date"),
)
# 5. Join all features
features = users_active \
.join(order_features, "user_id", "left") \
.join(event_features, "user_id", "left") \
.fillna(0, subset=["order_count_90d", "gmv_90d", "event_count_90d"])
# 6. Write with date partition
features.withColumn("feature_date", F.lit(run_date)) \
.write \
.partitionBy("feature_date") \
.mode("overwrite") \
.parquet("/data/user_features")
users_active.unpersist()
spark.stop()
if __name__ == "__main__":
run_feature_job("2026-06-08")
Q26. How do you monitor and debug a Spark job in production?
# Spark UI: http://driver-host:4040 (while job runs)
# History Server: http://history-server:18080 (after job completes)
# Key metrics to monitor:
# 1. Stage DAG: identify shuffles (wide transformations)
# Look for: stages with large shuffle read/write bytes
# 2. Task metrics: find skewed tasks
# Good: all tasks finish within 2x of median duration
# Bad: one task takes 10x longer (data skew)
# 3. Executor metrics: GC time, memory pressure
# GC > 5% of task time: memory tuning needed
# 4. SQL tab: query plan with actual row counts
spark.conf.set("spark.sql.statistics.size.autoUpdate.enabled", "true")
# Programmatic monitoring
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("monitored_job").getOrCreate()
sc = spark.sparkContext
# Add job group for tracking
sc.setJobGroup("feature_engineering", "Daily user features batch")
# Log partition metrics
df = spark.read.parquet("/data/orders")
print(f"Input partitions: {df.rdd.getNumPartitions()}")
df_result = df.groupBy("user_id").count()
print(f"Output partitions: {df_result.rdd.getNumPartitions()}")
# Check for empty partitions (common after heavy filter)
partition_sizes = df_result.rdd.mapPartitions(lambda it: [sum(1 for _ in it)]).collect()
empty_partitions = sum(1 for s in partition_sizes if s == 0)
print(f"Empty partitions: {empty_partitions} / {len(partition_sizes)}")
# Common debugging patterns:
# 1. .explain("formatted") -- verbose query plan
# 2. .rdd.getNumPartitions() -- check partition count
# 3. .groupBy(F.spark_partition_id()).count() -- check partition size distribution
# 4. df.select(F.spark_partition_id(), F.count()).groupBy("spark_partition_id()").count() -- skew check
Q27. What are user-defined functions (UDFs) in PySpark? When to avoid them?
from pyspark.sql import SparkSession, functions as F
from pyspark.sql.types import StringType, ArrayType, DoubleType
from pyspark.sql.functions import udf, pandas_udf
import pandas as pd
spark = SparkSession.builder.appName("udf_demo").getOrCreate()
# Regular Python UDF: row-by-row execution, no optimization
@udf(returnType=StringType())
def categorize_salary(salary):
if salary is None:
return "unknown"
elif salary < 30000:
return "low"
elif salary < 70000:
return "medium"
else:
return "high"
df = spark.createDataFrame([(1, 50000), (2, 85000), (3, 25000)], ["id", "salary"])
df.withColumn("category", categorize_salary(F.col("salary"))).show()
# Regular UDFs are slow: each row deserialized, Python function called, result serialized
# Avoid for operations expressible as Spark native functions!
# Better: use native functions
df.withColumn("category",
F.when(F.col("salary") < 30000, "low")
.when(F.col("salary") < 70000, "medium")
.otherwise("high")
).show() # 10-100x faster
# Pandas UDF (Vectorized UDF): much faster for custom logic
@pandas_udf(returnType=DoubleType())
def custom_log_transform(salary: pd.Series) -> pd.Series:
"""Custom transform not available as Spark native function."""
import numpy as np
return np.log1p(salary.fillna(0))
df.withColumn("log_salary", custom_log_transform(F.col("salary"))).show()
# Pandas UDF uses Arrow for zero-copy data transfer (much faster than Python UDF)
# Use UDFs for: complex custom logic not expressible in Spark SQL functions
# Avoid UDFs for: arithmetic, string ops, date ops, conditions (all have native equivalents)
Q28. Design a real-time anomaly detection pipeline using Spark Structured Streaming.
from pyspark.sql import SparkSession, functions as F
from pyspark.sql import Window
from pyspark.sql.types import StructType, IntegerType, DoubleType, TimestampType
spark = SparkSession.builder \
.appName("anomaly_detection") \
.config("spark.sql.adaptive.enabled", "true") \
.getOrCreate()
schema = StructType() \
.add("transaction_id", IntegerType()) \
.add("user_id", IntegerType()) \
.add("amount", DoubleType()) \
.add("event_time", TimestampType())
# Stream from Kafka
raw_stream = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "kafka:9092") \
.option("subscribe", "transactions") \
.load() \
.select(F.from_json(F.col("value").cast("string"), schema).alias("d")).select("d.*")
# Load historical user statistics (batch lookup)
user_stats = spark.read.parquet("/data/user_stats_baseline") \
.select("user_id", "avg_amount", "std_amount")
# Streaming join: enrich transactions with user baseline stats
enriched = raw_stream.join(F.broadcast(user_stats), "user_id", "left")
# Compute z-score anomaly signal
anomaly_signals = enriched.withColumn(
"z_score",
(F.col("amount") - F.col("avg_amount")) / (F.col("std_amount") + 1.0)
).withColumn(
"is_anomaly",
F.when(F.abs(F.col("z_score")) > 3.0, True).otherwise(False)
)
# Window-based velocity check: too many transactions in 5 minutes
with_watermark = anomaly_signals.withWatermark("event_time", "2 minutes")
velocity_check = with_watermark \
.groupBy(F.window("event_time", "5 minutes"), "user_id") \
.agg(
F.count("transaction_id").alias("txn_count"),
F.sum("amount").alias("total_amount"),
F.max("is_anomaly").alias("any_anomaly"),
) \
.filter((F.col("txn_count") > 10) | F.col("any_anomaly"))
# Write alerts to Kafka output topic
velocity_check.selectExpr(
"CAST(user_id AS STRING) as key",
"to_json(struct(*)) as value"
).writeStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "kafka:9092") \
.option("topic", "anomaly_alerts") \
.option("checkpointLocation", "/checkpoints/anomaly_v1") \
.outputMode("update") \
.trigger(processingTime="30 seconds") \
.start() \
.awaitTermination()
FAQ
Q: What version of Spark should I know for 2026 interviews? A: Spark 3.4+ is the current production standard. Key features introduced since Spark 3.0 that are heavily tested: Adaptive Query Execution (AQE), ANSI SQL compliance, improved Python UDF performance, Spark Connect, and better Delta Lake integration. Confirm the specific version expected on the official company careers portal.
Q: Is PySpark the right choice over Scala for interviews? A: PySpark is accepted at most companies for data engineering and data science roles. Scala is preferred at companies with existing Scala/JVM codebases (some FAANG teams, financial firms). Python/PySpark covers 85% of Spark interview scenarios. Confirm the language preference on the official company job description.
Q: Do I need to know Databricks specifically? A: Databricks-specific features (Delta Lake, Photon engine, Unity Catalog, Databricks Workflows, MLflow integration) are tested at companies using Databricks. Core Spark knowledge transfers directly. Delta Lake is increasingly important to know regardless of whether the company uses Databricks. Candidates from public preparation resources confirm that Delta MERGE, time travel, and ACID properties are tested at companies using lakehouse architectures.
Methodology applied to this articlelast verified 8 Jun 2026
- No fabricated salary numbers or success rates. If we quote a range, it's sourced.
- No noun-substituted templates. This article was not generated by swapping company names in a stock prompt.
- No paid placements, sponsored coaching links, or affiliate-shilled course pushes.
Explore this topic cluster
More resources in Interview Questions
Use the category hub to browse similar questions, exam patterns, salary guides, and preparation resources related to this topic.
Paid contributor programme
Sat this this year? Share your story, earn ₹500.
First-person experience reports help future candidates prep smarter. We pay verified contributors ₹500 via UPI per accepted story - with byline.
Submit your story →Ready to practice?
Take a free timed mock test
Put what you learned into practice. Our mock tests match the 2026 pattern with timer, navigator, reveal, and score breakdown. No signup.
Start Free Mock Test →Related Articles
Airbnb Interview Questions 2026: Top Tech, HR & Behavioural Q&As for Freshers
Clearing Airbnb's fresher loop in 2026 comes down to preparing for the exact mix of questions across technical, behavioural,...
Airtel Interview Questions 2026: Top Tech, HR & Behavioural Q&As for Freshers
Clearing Airtel's fresher loop in 2026 comes down to preparing for the exact mix of questions across technical, behavioural,...
AMD Interview Questions 2026: Top Tech, HR & Behavioural Q&As for Freshers
Clearing AMD's fresher loop in 2026 comes down to preparing for the exact mix of questions across technical, behavioural,...
Atlassian Interview Questions 2026: Top Tech, HR & Behavioural Q&As for Freshers
Clearing Atlassian's fresher loop in 2026 comes down to preparing for the exact mix of questions across technical,...
Barclays Interview Questions 2026
_Last verified by [Aditya Sharma](/author/aditya-sharma/) · cross-checked against PapersAdda Hiring Pulse and...
More from PapersAdda
Accenture Interview Questions 2026 (with Answers for Freshers)
Capgemini Interview Questions 2026 (with Answers for Freshers)
HCLTech Interview Questions 2026 (TechBee + TGT, with Answers)
IBM Interview Questions 2026 (with Answers for Freshers)