placement brief / Interview Questions / interview questions / 08 Jun 2026

Hadoop Interview Questions 2026: HDFS, MapReduce, YARN & Ecosystem

> Candidates report that Hadoop ecosystem questions in 2026 interviews focus heavily on HDFS internals, MapReduce optimization, and integration with modern...

By Aditya SharmaPublished 8 Jun 20263 sources listedSpot an error? Corrections open

13 min read last revised 8 Jun 2026

on this page§ 12

Candidates report that Hadoop ecosystem questions in 2026 interviews focus heavily on HDFS internals, MapReduce optimization, and integration with modern tools like Spark and cloud storage. Confirm exact topics and framework versions on the official company careers portal before your interview.

Hadoop remains a foundational big data technology. While cloud-native tools have taken center stage, understanding Hadoop's architecture -- HDFS, MapReduce, YARN, and the broader ecosystem (Hive, HBase, Pig, Oozie) -- is essential for data engineering roles at enterprises with large on-premise or hybrid deployments.

Core Architecture

Q1. Explain Hadoop's core architecture and the role of each component.

Hadoop has three core components:

Component	Role
HDFS	Distributed filesystem -- stores data across commodity hardware
MapReduce	Parallel processing framework -- splits jobs into map + reduce tasks
YARN	Resource management -- allocates CPU/memory to applications

HDFS architecture:

NameNode: Master -- stores filesystem metadata (file-to-block mappings, block locations). Single point of metadata, but HA mode with standby NameNode.
DataNodes: Workers -- store actual data blocks (default 128 MB each). Send heartbeats every 3 seconds to NameNode.
Secondary NameNode: Checkpoints NameNode's edit logs to prevent log bloat. NOT a hot standby.

YARN architecture:

ResourceManager: Cluster-wide resource allocator -- two daemons: Scheduler (allocates resources) + ApplicationsManager (manages submitted jobs).
NodeManager: Per-node -- launches and monitors containers.
ApplicationMaster: Per-job -- negotiates resources with RM, coordinates tasks.

Data flow for a MapReduce job:

Client submits job to ResourceManager.
RM launches ApplicationMaster in a container.
AM requests containers from RM; NMs launch mapper/reducer tasks.
Output written to HDFS.

Q2. How does HDFS achieve fault tolerance?

Three mechanisms:

1. Replication (default factor = 3) Each block is replicated to 3 DataNodes. Placement policy: first replica on local rack, second on different rack, third on same rack as second. This balances fault tolerance with write bandwidth.

2. Heartbeat + block reports

DataNodes send heartbeats to NameNode every 3 seconds.
Block reports every 6 hours (default) -- NameNode validates replication factor.
If NameNode misses 10 consecutive heartbeats (30 seconds), DataNode is declared dead and its blocks are re-replicated.

3. Checksums Each block has a CRC-32C checksum stored alongside data. Reads verify checksums; silent corruption triggers re-read from replica.

# Check HDFS file health
hdfs fsck /user/data/my_file.csv -files -blocks -locations

# Check replication factor
hdfs dfs -stat "%r" /user/data/my_file.csv

NameNode HA: Active + Standby NameNode share state via JournalNodes (quorum-based log). ZooKeeper handles automatic failover.

Q3. What is data locality in Hadoop, and why does it matter?

Data locality = running computation where the data already resides, avoiding network I/O.

Three levels:

Node-local: Task runs on the DataNode that holds the block. Best -- zero network cost.
Rack-local: Task runs on a DataNode in the same rack. One network hop within the rack switch.
Off-rack: Task fetches data across racks. Most expensive -- cross-rack bandwidth is the bottleneck.

MapReduce JobTracker (in YARN: ApplicationMaster) attempts node-local placement first. If the local node's slots are all occupied, it falls back to rack-local, then off-rack.

Why it matters: A typical MapReduce job reads all input data. At 100 TB input, even 1 GB/s cross-rack bandwidth means hours of network transfer. Node-local reads saturate local disk instead (often 200-500 MB/s per disk).

Cloud implication: S3-based data lakes decouple storage from compute, so data locality is impossible. This is why Spark on EMR uses speculative execution and large instance types with high network bandwidth to compensate.

Q4. Explain the MapReduce programming model with a word count example.

MapReduce has four phases: Input split, Map, Shuffle & Sort, Reduce.

# Conceptual Python pseudocode for word count

# MAP phase: runs on each input split (portion of a text file)
def map(key, value):
    # key = line offset, value = line text
    for word in value.split():
        emit(word.lower(), 1)

# After map: framework groups all values by key (shuffle + sort)
# ("hadoop", [1, 1, 1]), ("is", [1, 1]), ...

# REDUCE phase: runs once per unique key
def reduce(key, values):
    # key = word, values = list of 1s
    emit(key, sum(values))

Java implementation structure:

public class WordCount {
    // Mapper class
    public static class TokenizerMapper
        extends Mapper<Object, Text, Text, IntWritable> {
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(Object key, Text value, Context context)
            throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
            }
        }
    }

    // Reducer class
    public static class IntSumReducer
        extends Reducer<Text, IntWritable, Text, IntWritable> {
        public void reduce(Text key, Iterable<IntWritable> values,
                           Context context)
            throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            context.write(key, new IntWritable(sum));
        }
    }
}

Phases in detail:

Input split: HDFS file divided into splits (usually one block = one split). Each split goes to one Mapper.
Map: Emits (key, value) pairs.
Combiner (optional): Local mini-reducer on mapper output -- reduces shuffle data volume.
Shuffle & Sort: Framework partitions, transfers, and sorts mapper output by key before reduce.
Reduce: Aggregates all values for each key into final output.

Q5. What is a Combiner in MapReduce, and when should you use it?

A Combiner is a local reducer that runs on each Mapper's output before the shuffle phase. It reduces the volume of data transferred across the network.

When to use:

The reduce operation is commutative and associative (e.g., sum, max, min, count).
NOT valid for operations like average (partial sums divided by different counts would corrupt results).

// Set combiner to same class as reducer (valid for word count)
job.setCombinerClass(IntSumReducer.class);

Impact:

Without combiner: Each mapper emits (word, 1) for every occurrence. Network transfers millions of pairs.
With combiner: Local aggregation per word per mapper. Network transfers one (word, localCount) per unique word per mapper.

For average -- correct pattern:

# Map emits: (word, (count, sum))
# Combine: (word, (totalCount, totalSum)) -- valid, just add counts + sums
# Reduce: (word, totalSum / totalCount)

Q6. Explain MapReduce speculative execution.

Problem: Stragglers -- slow tasks that lag behind and delay entire job completion. Can be caused by hardware degradation, data skew, or resource contention.

Solution: Speculative execution -- launch duplicate copies of slow tasks on other nodes; use whichever finishes first, kill the other.

Trigger conditions (configurable):

Task has been running longer than average for its stage.
Progress rate is significantly below average.
Cluster has spare capacity.


<property>
  <name>mapreduce.map.speculative</name>
  <value>true</value>
</property>
<property>
  <name>mapreduce.reduce.speculative</name>
  <value>true</value>
</property>

When to disable:

Non-idempotent operations (e.g., writing to external databases without deduplication).
Tasks with external side effects.
When cluster is already fully loaded (speculative tasks just compete for same resources).

HDFS Deep Dive

Q7. What is the difference between NameNode and Secondary NameNode?

Aspect	NameNode	Secondary NameNode
Role	Active master -- serves all metadata	Checkpoint helper
Stores	FsImage + EditLog (in memory + disk)	Merged FsImage checkpoints
Failure impact	Cluster unusable	No immediate impact
HA role	Active or Standby (HA mode)	Not part of HA setup

EditLog problem: NameNode logs every metadata operation to an EditLog. Over time, EditLog grows huge. On restart, replaying a large EditLog is slow.

Secondary NameNode solution:

Periodically downloads FsImage + EditLog from NameNode.
Merges them into a new FsImage (checkpointing).
Uploads merged FsImage back to NameNode.
NameNode can start fresh from new FsImage + smaller EditLog.

Secondary NameNode is NOT a hot standby. For actual HA, use HDFS HA with Active/Standby NameNodes and JournalNodes.

# Force checkpoint on Secondary NameNode
hdfs secondarynamenode -checkpoint force

Q8. What is HDFS small files problem, and how do you solve it?

Problem: Each file in HDFS requires a metadata entry in the NameNode's memory (roughly 150 bytes per file/block). Millions of small files:

Exhaust NameNode heap memory.
Create excessive MapReduce tasks (one mapper per file/split).
Slow metadata operations.

Solutions:

1. HAR (Hadoop Archive):

# Package many small files into a single HAR archive
hadoop archive -archiveName myarchive.har -p /user/small_files /user/archives/

# Access individual files through HAR
hdfs dfs -ls har:///user/archives/myarchive.har/

Downside: HAR files are read-only; not suitable for streaming updates.

2. SequenceFile: Pack small files as key-value pairs in a SequenceFile (filename as key, content as value). Compressible, splittable, MapReduce-friendly.

3. CombineFileInputFormat:

// Combine multiple small files into one mapper's input split
job.setInputFormatClass(CombineTextInputFormat.class);
CombineTextInputFormat.setMaxInputSplitSize(job, 128 * 1024 * 1024); // 128 MB

4. Upstream fix: Avoid creating small files. Use Spark/Hive to write fewer, larger partitioned files. Coalesce output before writing to HDFS.

Q9. How does HDFS handle writes and what is the write pipeline?

Write flow:

Client calls create() on NameNode. NameNode checks permissions, returns list of DataNodes for replication (pipeline).
Client writes data to first DataNode in the pipeline.
First DataNode stores the block and forwards to second DataNode, which forwards to third.
ACKs flow back: DN3 ACKs to DN2, DN2 ACKs to DN1, DN1 ACKs to client.
After all replicas acknowledge, block is considered written.
Client calls complete() on NameNode to close the file.

Packet-level pipeline:

Data is streamed in packets (64 KB default).
Client doesn't wait for full block ACK -- pipelines packets.
If a DataNode fails mid-write: client is notified, remaining good DataNodes form a new pipeline, block is re-replicated after write completes.

# Python hadoop client (hdfs3 or snakebite)
from hdfs import InsecureClient

client = InsecureClient('http://namenode:50070', user='hadoop')

# Write to HDFS
with client.write('/user/data/output.csv', overwrite=True) as writer:
    writer.write(b"col1,col2\n1,2\n3,4\n")

# Read from HDFS
with client.read('/user/data/output.csv', encoding='utf-8') as reader:
    content = reader.read()

YARN and Resource Management

Q10. How does YARN schedule jobs, and what schedulers are available?

YARN has three built-in schedulers:

1. FIFO Scheduler

First-in-first-out queue.
Simple but poor for multi-tenant clusters -- large jobs starve small ones.

2. Capacity Scheduler (default in CDH/HDP)

Multiple queues with guaranteed capacity (e.g., prod queue: 60%, dev queue: 40%).
Each queue uses FIFO internally.
Supports queue hierarchies, ACLs, preemption.
Queues can borrow unused capacity.


<property>
  <name>yarn.scheduler.capacity.root.queues</name>
  <value>prod,dev</value>
</property>
<property>
  <name>yarn.scheduler.capacity.root.prod.capacity</name>
  <value>60</value>
</property>

3. Fair Scheduler

All jobs get equal share of resources over time.
New job immediately gets some resources (preempts others if needed).
Supports weighted queues, minimum guarantees.
Better for ad-hoc workloads mixed with long jobs.

Preemption: Fair and Capacity schedulers can preempt (kill) containers from queues that are using more than their share, freeing capacity for under-served queues.

Hive

Q11. What is Hive, and how does HiveQL translate to MapReduce/Tez/Spark?

Hive is a data warehouse layer on top of Hadoop. It provides SQL-like HiveQL syntax that compiles to MapReduce, Tez, or Spark execution engines.

Architecture:

Metastore: MySQL/Derby database storing table schemas, partition metadata, SerDes.
Driver: Receives HiveQL, manages query lifecycle.
Compiler: Parses HiveQL into Abstract Syntax Tree, creates execution plan (DAG of MapReduce/Tez jobs).
Execution Engine: Runs the plan on YARN.

Execution engines:

Engine	Latency	Use case
MapReduce	Minutes	Batch, legacy
Tez	Seconds to minutes	Interactive, complex DAGs
Spark	Seconds	Memory-intensive, iterative

-- Set execution engine
SET hive.execution.engine = tez;

-- A simple HiveQL query
SELECT
    department,
    COUNT(*) AS employee_count,
    AVG(salary) AS avg_salary
FROM employees
WHERE year = 2026
GROUP BY department
ORDER BY avg_salary DESC
LIMIT 10;

Hive compiles this to:

TableScan on HDFS (map phase reads paritions for year=2026).
HashAggregate (map-side combiner for COUNT/SUM).
Shuffle by department.
MergeAggregate (reduce).
Order/Limit (single reducer for final sort).

Q12. What is partitioning in Hive, and how does dynamic partitioning work?

Partitioning stores data in subdirectories by column value, enabling partition pruning (skip entire directories during scans).

-- Create partitioned table
CREATE TABLE sales (
    order_id    BIGINT,
    product_id  BIGINT,
    amount      DOUBLE,
    customer_id BIGINT
)
PARTITIONED BY (year INT, month INT)
STORED AS PARQUET;

-- Static partition insert
INSERT INTO sales PARTITION (year=2026, month=6)
SELECT order_id, product_id, amount, customer_id FROM raw_sales
WHERE year=2026 AND month=6;

-- Dynamic partition insert (Hive figures out partitions from data)
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;

INSERT INTO sales PARTITION (year, month)
SELECT order_id, product_id, amount, customer_id, year, month FROM raw_sales;

Partition pruning:

-- This query only scans /warehouse/sales/year=2026/month=6/
SELECT SUM(amount) FROM sales WHERE year=2026 AND month=6;

-- EXPLAIN to verify partition pruning
EXPLAIN SELECT SUM(amount) FROM sales WHERE year=2026 AND month=6;

Bucketing (clustering within partitions):

-- Bucket by customer_id into 32 buckets
CREATE TABLE sales_bucketed
PARTITIONED BY (year INT, month INT)
CLUSTERED BY (customer_id) INTO 32 BUCKETS
STORED AS ORC;

Bucketing enables map-side joins (when both tables bucketed by join key with same # buckets) and more uniform splits.

Q13. What is ORC format and why is it preferred over CSV/JSON in Hive?

ORC (Optimized Row Columnar) is a self-describing, type-aware column-oriented file format for Hive workloads.

Feature	CSV	ORC
Storage	Row	Columnar (per-column stripes)
Compression	None/GZIP	Zlib, Snappy, LZO (per-column codec)
Predicate pushdown	No	Yes (min/max bloom filters per stripe)
Schema evolution	Limited	Add/rename columns
Splittable	Yes (line-based)	Yes (stripe-based)
Native type support	Strings only	All Hive types including complex

Performance advantage:

-- With ORC, this query only reads the 'amount' column
-- Skips all other columns entirely
SELECT SUM(amount) FROM sales_orc WHERE year=2026;

-- Stripe-level skipping: stripes where min(amount) > 1000
-- are skipped when filtering amount < 100

Creating ORC tables:

CREATE TABLE sales_orc
STORED AS ORC
TBLPROPERTIES (
    "orc.compress" = "SNAPPY",
    "orc.stripe.size" = "134217728",  -- 128 MB stripes
    "orc.bloom.filter.columns" = "customer_id,product_id"
);

Parquet vs ORC: Both are columnar. ORC is optimized for Hive (stronger predicate pushdown). Parquet has broader ecosystem support (Spark, Impala, BigQuery, AWS Glue). Modern practice: Parquet for cloud data lakes, ORC for Hive-heavy on-premise stacks.

HBase

Q14. What is HBase, and when would you choose it over RDBMS or Hive?

HBase is a distributed, column-family-oriented NoSQL database built on HDFS. Modeled after Google Bigtable.

Data model:

Table: Collection of rows.
Row key: Byte array, rows stored in sorted lexicographic order. Critical for access patterns.
Column family: Physical grouping of columns (defined at table creation). Separate HFiles per family.
Column qualifier: Column within a family (defined at write time, schema-free).
Cell: (row key, column family, column qualifier, timestamp) -> value.

Comparison:

Aspect	RDBMS	Hive	HBase
Query pattern	SQL, complex joins	SQL, analytical scans	Point lookups, range scans by row key
Latency	ms	Minutes	ms (random reads)
Mutations	Yes (ACID)	Append/overwrite	Yes (put/delete/increment)
Schema	Fixed	Fixed	Sparse, schema-free columns
Scale	Vertical	Horizontal (batch)	Horizontal (OLTP scale)
Use case	Transactional	Analytical batch	Real-time read/write at scale

Choose HBase when:

Random read/write access to billions of rows.
Time-series data (sensor readings, event logs) with row-key-based range scans.
Real-time counter increments (HBase supports atomic increment).
Variable schema per row (sparse data).

import happybase

# Connect to HBase via Thrift server
connection = happybase.Connection('hbase-master', port=9090)
connection.open()

# Create table
connection.create_table(
    'user_events',
    {'events': dict(max_versions=5),
     'metadata': dict(max_versions=1)}
)

table = connection.table('user_events')

# Write
table.put(
    b'user:12345:20260608',  # row key: entity:id:date for efficient range scans
    {b'events:page_view': b'homepage',
     b'events:duration': b'45',
     b'metadata:device': b'mobile'}
)

# Read single row
row = table.row(b'user:12345:20260608')

# Scan range
for key, data in table.scan(
    row_start=b'user:12345:20260601',
    row_stop=b'user:12345:20260609'
):
    print(key, data)

Q15. Explain HBase row key design principles.

The row key is the primary access path in HBase. Poor row key design causes hotspotting (all reads/writes hitting one region) and inefficient scans.

Principles:

1. Avoid sequential keys (hotspotting)

# BAD: timestamp as prefix -- all current writes go to one region
20260608120000:user123
20260608120001:user456
20260608120002:user789

# GOOD: salt prefix -- distributes writes across regions
a3:20260608120000:user123  # hash(user123) % num_regions -> prefix
b7:20260608120001:user456
f2:20260608120002:user789

2. Design for access pattern

# Access pattern: "get all events for user X in date range"
# Row key: userId:date -> enables range scan
user123:20260601 ... user123:20260608  -- range scan returns all June events

3. Keep row keys short Row keys are stored with every cell. Long row keys multiply storage. Use hashed or encoded IDs.

4. Reverse timestamp for latest-first reads

# Reverse timestamp: Long.MAX_VALUE - currentTimeMs
# Scan from start gets newest events first
row_key = f"user123:{9999999999999 - int(time.time() * 1000)}"

5. Composite keys

# User activity: (tenant:userId:eventType:timestamp)
# Enables: scan by tenant, scan by user, scan by user+eventType
acme:user123:pageview:20260608120000

Pig and Oozie

Q16. What is Apache Pig, and how does Pig Latin differ from HiveQL?

Apache Pig is a high-level scripting language (Pig Latin) for expressing data flow programs on Hadoop. Pig Latin compiles to MapReduce jobs.

Pig Latin vs HiveQL:

Aspect	Pig Latin	HiveQL
Paradigm	Data flow (procedural)	Declarative SQL
Optimization	Manual (you control flow)	Automatic query optimizer
Schema	Optional (schema on read)	Required (DDL)
UDFs	Easier to integrate (Java/Python/Ruby)	Supported but more complex
Use case	ETL pipelines, multi-step transformations	Ad-hoc queries, reporting

-- Pig Latin: find top-10 products by sales
raw_data = LOAD '/user/data/sales.csv'
    USING PigStorage(',')
    AS (order_id:int, product_id:int, amount:float, date:chararray);

-- Filter
recent = FILTER raw_data BY date >= '2026-01-01';

-- Group and aggregate
by_product = GROUP recent BY product_id;
totals = FOREACH by_product GENERATE
    group AS product_id,
    SUM(recent.amount) AS total_sales,
    COUNT(recent) AS order_count;

-- Sort and limit
ranked = ORDER totals BY total_sales DESC;
top10 = LIMIT ranked 10;

-- Store
STORE top10 INTO '/user/output/top_products' USING PigStorage('\t');

Q17. What is Apache Oozie, and how does it orchestrate Hadoop workflows?

Oozie is a workflow scheduler for Hadoop jobs. It orchestrates sequences of MapReduce, Hive, Pig, Sqoop, and shell actions as directed acyclic graphs (DAGs).

Two types of Oozie jobs:

Workflow: One-time DAG of actions.
Coordinator: Time-triggered or data-triggered workflow scheduler.

Workflow XML example:

<workflow-app name="etl-pipeline" xmlns="uri:oozie:workflow:0.5">
    <start to="validate-data"/>

    <action name="validate-data">
        <hive xmlns="uri:oozie:hive-action:0.3">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <script>scripts/validate.hql</script>
        </hive>
        <ok to="transform-data"/>
        <error to="send-failure-email"/>
    </action>

    <action name="transform-data">
        <map-reduce>
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <configuration>
                <property>
                    <name>mapreduce.job.jar</name>
                    <value>/user/oozie/transform.jar</value>
                </property>
            </configuration>
        </map-reduce>
        <ok to="end"/>
        <error to="send-failure-email"/>
    </action>

    <action name="send-failure-email">
        <email xmlns="uri:oozie:email-action:0.2">
            <to>[email protected]</to>
            <subject>ETL Pipeline Failed</subject>
            <body>Check Oozie logs for job ${wf:id()}</body>
        </email>
        <ok to="fail"/>
        <error to="fail"/>
    </action>

    <kill name="fail">
        <message>Pipeline failed: ${wf:errorMessage(wf:lastErrorNode())}</message>
    </kill>
    <end name="end"/>
</workflow-app>

Coordinator for scheduled runs:

<coordinator-app name="daily-etl" frequency="${coord:days(1)}"
    start="2026-01-01T00:00Z" end="2027-01-01T00:00Z"
    timezone="UTC" xmlns="uri:oozie:coordinator:0.4">
    <action>
        <workflow>
            <app-path>${workflowPath}</app-path>
        </workflow>
    </action>
</coordinator-app>

Sqoop and Flume

Q18. What is Sqoop, and how does it import/export data between HDFS and RDBMS?

Sqoop is a tool for bulk transfer of structured data between Hadoop and relational databases (MySQL, Oracle, PostgreSQL, SQL Server).

# Import entire table from MySQL to HDFS
sqoop import \
    --connect jdbc:mysql://mysql-server:3306/production_db \
    --username hadoop_user \
    --password-file /user/hadoop/.sqoop_password \
    --table orders \
    --target-dir /user/data/orders \
    --as-parquetfile \
    --num-mappers 8 \
    --compress \
    --compression-codec snappy

# Incremental import (only new rows since last run)
sqoop import \
    --connect jdbc:mysql://mysql-server:3306/production_db \
    --username hadoop_user \
    --password-file /user/hadoop/.sqoop_password \
    --table orders \
    --target-dir /user/data/orders_incremental \
    --incremental append \
    --check-column order_id \
    --last-value 1000000

# Export from HDFS to MySQL
sqoop export \
    --connect jdbc:mysql://mysql-server:3306/reporting_db \
    --username hadoop_user \
    --password-file /user/hadoop/.sqoop_password \
    --table aggregated_sales \
    --export-dir /user/data/aggregated_sales \
    --input-fields-terminated-by '\t' \
    --update-mode allowinsert \
    --update-key order_date,product_id

How Sqoop parallelizes imports:

Sqoop launches N mappers (default 4), each reads a range of the primary key.
Boundary query: SELECT MIN(pk), MAX(pk) FROM table.
Range divided into N equal splits.
Each mapper runs its own JDBC SELECT with WHERE clause on its range.

Limitation: Skewed data (many rows with same PK value or gaps) causes imbalanced splits. Use --split-by with a uniformly distributed column.

Q19. What is Apache Flume, and how does it differ from Kafka for data ingestion?

Flume is a distributed service for collecting, aggregating, and moving log/event data to HDFS. Designed specifically for Hadoop.

Flume architecture:

Source: Receives data (Avro, Thrift, Syslog, HTTP, JMS, Exec).
Channel: Buffer between source and sink (Memory channel or File channel for durability).
Sink: Writes data (HDFS, HBase, Kafka, Elasticsearch, Logger).

# flume-agent.conf
agent.sources = access_log_source
agent.sinks = hdfs_sink
agent.channels = memory_channel

agent.sources.access_log_source.type = exec
agent.sources.access_log_source.command = tail -F /var/log/nginx/access.log
agent.sources.access_log_source.channels = memory_channel

agent.channels.memory_channel.type = memory
agent.channels.memory_channel.capacity = 10000
agent.channels.memory_channel.transactionCapacity = 1000

agent.sinks.hdfs_sink.type = hdfs
agent.sinks.hdfs_sink.hdfs.path = /user/logs/%Y/%m/%d
agent.sinks.hdfs_sink.hdfs.fileType = DataStream
agent.sinks.hdfs_sink.hdfs.rollInterval = 3600
agent.sinks.hdfs_sink.hdfs.rollSize = 134217728
agent.sinks.hdfs_sink.hdfs.rollCount = 0
agent.sinks.hdfs_sink.channel = memory_channel

Flume vs Kafka:

Aspect	Flume	Kafka
Primary purpose	Log collection to HDFS	Durable event streaming bus
Consumer model	Single sink per agent	Multiple independent consumer groups
Replay	No (data goes to HDFS)	Yes (retained for configurable period)
Decoupling	Source-to-sink coupled	Full producer-consumer decoupling
Backpressure	Channel capacity	Consumer lag
Ecosystem	Hadoop-centric	Universal (any producer/consumer)

Modern architectures: use Kafka as the ingestion bus, Kafka Connect HDFS Sink (or Flume's Kafka source) to land data in HDFS.

Performance Tuning

Q20. How do you tune a MapReduce job for performance?

Memory tuning:

# Mapper JVM heap
-Dmapreduce.map.java.opts=-Xmx1800m
-Dmapreduce.map.memory.mb=2048

# Reducer JVM heap
-Dmapreduce.reduce.java.opts=-Xmx3500m
-Dmapreduce.reduce.memory.mb=4096

Shuffle tuning (most impactful for large jobs):


<property>
  <name>mapreduce.task.io.sort.mb</name>
  <value>512</value>  
</property>


<property>
  <name>mapreduce.map.sort.spill.percent</name>
  <value>0.90</value>
</property>


<property>
  <name>mapreduce.task.io.sort.factor</name>
  <value>100</value>  
</property>


<property>
  <name>mapreduce.reduce.shuffle.parallelcopies</name>
  <value>25</value>  
</property>

Compression tuning:


<property>
  <name>mapreduce.map.output.compress</name>
  <value>true</value>
</property>
<property>
  <name>mapreduce.map.output.compress.codec</name>
  <value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>

Reduce count:

# Rule of thumb: (0.95 * num_reducers_to_fill_cluster) reducers
# Or: total_reduce_input / 128MB (target one reducer per 128 MB input)
-Dmapreduce.job.reduces=50

Q21. What causes reducer data skew, and how do you fix it?

Skew = one reducer gets vastly more data than others, becoming the job bottleneck.

Causes:

Natural key distribution (popular product IDs, bot user IDs).
Cartesian-like joins (NULL keys all route to same reducer).
Uneven partitioning.

Diagnosis:

# Check task times in JobTracker UI
# Look for reduce tasks with 10x longer duration than peers
mapred job -status job_20260608_0001 | grep reduce

Fixes:

1. Custom partitioner for hot keys:

public class SaltedPartitioner extends Partitioner<Text, IntWritable> {
    @Override
    public int getPartition(Text key, IntWritable value, int numPartitions) {
        // For hot key "popular_product", distribute across 10 reducers
        String keyStr = key.toString();
        if (keyStr.startsWith("popular_product:")) {
            return (keyStr.hashCode() & Integer.MAX_VALUE) % 10;
        }
        return (keyStr.hashCode() & Integer.MAX_VALUE) % numPartitions;
    }
}

2. NULL key handling:

-- Hive: route NULLs with random salt
SELECT /*+ MAPJOIN(small_table) */ a.*, b.info
FROM large_table a
LEFT JOIN small_table b
ON COALESCE(a.user_id, rand() * -99999) = b.user_id;

3. Two-phase aggregation:

Map -> (salted_key, 1)
Reduce 1 -> (salted_key, partial_count) [many reducers]
Reduce 2 -> (original_key, total_count) [few reducers]

Hadoop Ecosystem Integration

Q22. How does Hadoop integrate with modern cloud and Spark ecosystems?

Hadoop on cloud (AWS EMR, Azure HDInsight, GCP Dataproc):

# Spark on EMR reading from S3 (EMRFS -- S3 as HDFS replacement)
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("EMR-Spark-Job") \
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .config("spark.hadoop.fs.s3a.aws.credentials.provider",
            "com.amazonaws.auth.InstanceProfileCredentialsProvider") \
    .getOrCreate()

df = spark.read.parquet("s3a://my-datalake/processed/events/year=2026/")
df.groupBy("event_type").count().write.mode("overwrite") \
    .parquet("s3a://my-datalake/aggregates/event_counts/")

HDFS to Iceberg/Delta Lake migration path:

Keep HDFS for historical data (cheaper than re-migration).
New data lands in S3/GCS with Iceberg/Delta format.
Spark unified read layer handles both (Hive metastore unified catalog).

Hive Metastore as unified catalog:

# Glue Catalog (AWS) compatible with Hive Metastore API
spark = SparkSession.builder \
    .config("spark.sql.catalog.spark_catalog",
            "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config("hive.metastore.uris", "thrift://metastore:9083") \
    .enableHiveSupport() \
    .getOrCreate()

# Query both Hive tables (HDFS) and Delta tables (S3) in one SQL
spark.sql("""
    SELECT h.user_id, d.event_count
    FROM hive_db.legacy_users h
    JOIN delta_db.event_counts d ON h.user_id = d.user_id
""").show()

Q23. What is Apache Tez, and how does it improve on MapReduce for Hive?

MapReduce limitations:

Every stage writes intermediate results to HDFS (expensive I/O).
Forced map-shuffle-reduce structure -- complex queries need chained MR jobs.
Each job has JVM startup overhead.

Tez improvements:

Represents query as a generic DAG (Directed Acyclic Graph) of tasks.
Intermediate data flows in memory or local disk -- no HDFS writes between stages.
Operator fusion: consecutive operations (filter + project + aggregate) run in one task.
Container reuse: JVMs reused across tasks in the same query.
Dynamic re-planning: Adjust partition counts based on actual data size.

MapReduce for multi-join:
MR1: Map(scan A) -> Reduce(hash join A-B) -> write HDFS
MR2: Map(read HDFS) -> Reduce(hash join AB-C) -> write HDFS
MR3: Map(read HDFS) -> Reduce(aggregate) -> write HDFS

Tez for same query:
Task1(scan A) -> Task2(scan B) -> Task3(hash join A-B, hash join with C, aggregate)
No intermediate HDFS writes. 3-10x faster for complex queries.

-- Enable Tez
SET hive.execution.engine = tez;
SET tez.am.resource.memory.mb = 4096;
SET hive.auto.convert.join = true;  -- Map join optimization
SET hive.tez.container.size = 2048;

Q24. Compare Hadoop MapReduce vs Apache Spark for batch processing.

Dimension	MapReduce	Spark
In-memory caching	No -- always HDFS	Yes -- RDD/DataFrame cache
Iterative algorithms	Slow (read/write HDFS each iteration)	Fast (cache between iterations)
Latency	Minutes	Seconds
Programming model	Map + Reduce only	Rich transformations (100+)
Language support	Java primary (Streaming for others)	Python, Scala, Java, R, SQL
DAG support	Chained jobs (manual)	Native DAG execution
Streaming	No	Structured Streaming
Fault tolerance	Task re-execution from HDFS	RDD lineage re-computation
Memory pressure	Spills to HDFS gracefully	OOM risk on large datasets
Maturity	Proven at exabyte scale	Production-ready, growing

When MapReduce is still valid:

Extreme-scale batch jobs where memory is the bottleneck.
Environments locked to older Hadoop clusters without Spark.
Jobs that already work well and don't justify migration cost.

ML workloads: Spark MLlib significantly outperforms MapReduce-based Mahout for iterative algorithms (gradient descent converges in minutes vs hours).

# Spark equivalent of word count (compare to MR Java code above)
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("WordCount").getOrCreate()

counts = (
    spark.read.text("hdfs:///user/data/corpus/")
    .rdd
    .flatMap(lambda row: row[0].split())
    .map(lambda word: (word.lower(), 1))
    .reduceByKey(lambda a, b: a + b)
    .sortBy(lambda x: -x[1])
)

counts.saveAsTextFile("hdfs:///user/output/word_counts/")

Q25. How do you monitor and troubleshoot a Hadoop cluster?

Key monitoring points:

# HDFS health
hdfs dfsadmin -report
# Shows: live/dead DataNodes, block counts, used/available space

# Check under-replicated blocks
hdfs dfsadmin -report | grep "Under replicated"
hdfs fsck / -summary

# YARN cluster status
yarn node -list -all
yarn application -list -appStates RUNNING

# Check YARN queue usage
yarn queue -status default

# View running/failed jobs
mapred job -list all | head -20
mapred job -status <job_id>

Log locations:

NameNode logs: $HADOOP_LOG_DIR/hadoop-<user>-namenode-<host>.log
DataNode logs: $HADOOP_LOG_DIR/hadoop-<user>-datanode-<host>.log
YARN RM logs: $HADOOP_LOG_DIR/yarn-<user>-resourcemanager-<host>.log
MapReduce job logs: YARN ResourceManager UI -> application -> container logs

Common issues:

Symptom	Likely cause	Fix
Jobs hang at 99% reduce	Data skew	Custom partitioner, salting
NameNode OOM	Too many small files	HAR archives, CombineFileInputFormat
DataNode full	Unbalanced data distribution	`hdfs balancer -threshold 10`
Jobs slow	GC pressure	Tune JVM heap, reduce shuffle
"Too many open files"	DataNode limit	Increase `ulimit -n` to 65536
Connection refused to NameNode	NameNode dead or GC pause	Check NN logs, increase heap

# Rebalance HDFS data distribution
hdfs balancer -threshold 5  # Balance until no node deviates >5% from average

# Safe mode (NameNode waits for minimum replication)
hdfs dfsadmin -safemode get
hdfs dfsadmin -safemode leave  # Force exit if stuck

Real-World Scenarios

Q26. Design a Hadoop-based data pipeline for daily batch processing of 10 TB of e-commerce logs.

Requirements: Parse 10 TB raw NGINX access logs daily, compute product page views, add to 90-day rolling window, generate daily product performance report.

Architecture:

[NGINX Servers] -> [Flume/Kafka] -> [HDFS Raw Zone]
                                         |
                                   [Hive ETL (Tez)]
                                         |
                              [HDFS Processed Zone (ORC)]
                                         |
                              [Hive Aggregation Job]
                                         |
                              [HDFS Report Zone (Parquet)]
                                         |
                              [Sqoop Export to MySQL]
                                         |
                              [Dashboard/BI Tools]

Oozie workflow:

<workflow-app name="daily-ecommerce-pipeline">
    
    <action name="validate-raw">
        <hive><script>validate_raw_completeness.hql</script></hive>
        <ok to="parse-logs"/>
        <error to="alert-data-ops"/>
    </action>

    
    <action name="parse-logs">
        <hive><script>parse_access_logs.hql</script></hive>
        <ok to="compute-aggregates"/>
        <error to="alert-data-ops"/>
    </action>

    
    <action name="compute-aggregates">
        <hive><script>product_pageview_aggregates.hql</script></hive>
        <ok to="build-90day-window"/>
        <error to="alert-data-ops"/>
    </action>

    
    <action name="build-90day-window">
        <hive><script>rolling_window_update.hql</script></hive>
        <ok to="export-report"/>
        <error to="alert-data-ops"/>
    </action>

    
    <action name="export-report">
        <sqoop>
            <command>export --connect ... --table daily_product_report ...</command>
        </sqoop>
        <ok to="end"/>
        <error to="alert-data-ops"/>
    </action>
</workflow-app>

HDFS directory layout:

/data/raw/access_logs/year=2026/month=06/day=08/    -- Flume landing zone
/data/processed/parsed_logs/year=2026/month=06/day=08/   -- ORC partitioned
/data/aggregates/product_pageviews/year=2026/month=06/day=08/  -- daily aggregates
/data/reports/90day_window/   -- rolling window table

Q27. How would you migrate an on-premise Hadoop cluster to AWS EMR?

Migration strategy: Lift-then-Modernize

Phase 1: Assess

# Inventory: file sizes, formats, partition counts
hdfs dfs -du -s /data/* | sort -rn | head -20
hive -e "SHOW DATABASES; USE prod; SHOW TABLES;" > table_inventory.txt

# Job complexity: how many MR vs Hive vs Pig jobs
grep -r "job_type" /oozie/workflows/ | sort | uniq -c

Phase 2: Copy data to S3

# S3DistCp -- distributed copy from HDFS to S3 using MapReduce
s3-dist-cp \
    --src hdfs:///data/processed/ \
    --dest s3://company-datalake/processed/ \
    --srcPattern ".*\.orc" \
    --outputCodec snappy \
    --groupBy ".*/(year=\d+/month=\d+/day=\d+)/.*"

Phase 3: Validate

# Row count validation per partition
import boto3
from pyspark.sql import SparkSession

spark = SparkSession.builder.enableHiveSupport().getOrCreate()

# Compare counts: HDFS Hive vs S3 EMR
hdfs_count = spark.read.orc("hdfs:///data/processed/events/").count()
s3_count = spark.read.orc("s3a://company-datalake/processed/events/").count()

assert hdfs_count == s3_count, f"Mismatch: {hdfs_count} vs {s3_count}"

Phase 4: Switch + modernize

Point Oozie jobs to EMR/MWAA (managed Airflow).
Convert MR jobs to Spark.
Convert ORC to Parquet/Iceberg for cross-service compatibility.
Replace Hive metastore with AWS Glue Catalog.

Q28. What are the key differences between Hadoop 2 and Hadoop 3?

Feature	Hadoop 2	Hadoop 3
NameNode HA	Active + 1 Standby	Active + multiple Standbys
Storage type	3x replication only	Erasure Coding (EC) support
Default block size	128 MB	256 MB
YARN timeline service	v1	v2 (scalable, HBase backend)
Minimum Java	Java 7	Java 8
Port defaults	NameNode: 50070	NameNode: 9870
HDFS Federation	Supported	Enhanced with Router-Based Federation

Erasure Coding (Hadoop 3):

# EC reduces storage overhead from 200% (3x replication) to ~50%
# Trade-off: higher CPU for encoding/decoding on reads/writes

# Enable EC on a directory
hdfs ec -setPolicy -path /cold_data -policy RS-6-3-1024k
# RS-6-3: 6 data blocks + 3 parity blocks, can lose any 3
# Overhead: 3/6 = 50% vs 200% for 3x replication

# View EC policies
hdfs ec -listPolicies

Router-Based Federation (Hadoop 3): Allows multiple independent NameNode namespaces mounted under one unified namespace. Client sees single HDFS; Router maps paths to the correct NameNode cluster.

FAQ

Q: What is the difference between HDFS and a regular distributed filesystem like NFS?

HDFS is optimized for large sequential reads of large files, not random access. It assumes write-once-read-many workloads, stores data in large blocks (128-256 MB), and co-locates compute with storage for data locality. NFS provides POSIX semantics with random read/write but does not scale to petabytes or support MapReduce data locality. HDFS is fault-tolerant via replication; NFS relies on underlying hardware or RAID.

Q: Can Hadoop handle real-time data?

Hadoop core (HDFS + MapReduce + Hive) is designed for batch processing with latencies of minutes to hours. Real-time processing on the Hadoop ecosystem uses Apache Spark Structured Streaming, Apache Flink, or Apache Storm, which can run on YARN and read/write to HDFS. Kafka typically sits in front as the real-time ingestion layer.

Q: What replaced Hadoop in modern data architectures?

Candidates report that modern data stacks increasingly use: cloud object storage (S3/GCS/ADLS) instead of HDFS; Apache Spark instead of MapReduce; Apache Airflow instead of Oozie; open table formats (Delta Lake, Iceberg, Hudi) instead of Hive tables on ORC. The Hadoop ecosystem components (Hive Metastore, YARN, HBase) remain relevant but the underlying HDFS layer is being replaced by cloud storage. Confirm the exact stack at the company you are interviewing with on the official careers portal.

Sources and review notesreviewed 8 Jun 2026

Article-specific sources

Verification window

Page last edited 8 Jun 2026 by Aditya Sharma. A review date records an editorial edit, not a guarantee that every external fact is still current.

Evidence labels

Official notices, candidate reports, offer documents, and editorial practice questions carry different confidence levels. The visible source list lets you inspect the evidence instead of relying on a blanket verification badge.

Verification policy: /editorial-standards/. Found something incorrect? Submit a correction - we respond within 48 hours.

topic cluster

Sat this this year? Share your story, earn ₹500.

First-person experience reports help future candidates prep smarter. We pay verified contributors ₹500 via UPI per accepted story with byline.

Submit your story →

ready to practice?

Take a free timed mock test

Put what you learned into practice. Our mock tests match the 2026 pattern with timer, navigator, reveal, and score breakdown. No signup.

Start free mock test →

related guides

Interview Questions

Share this guide

Twitter LinkedIn W WhatsApp