issue 117apr 27mmxxvi
est. 2017
Sun, 27 Apr 2026
vol. IX · no. 117
PapersAdda
placement intelligence, since 2017
640+ briefs · 24 campuses · by reservation
verified offers · sourced from r/developersIndia
razorpay₹65.00 LPA· iit-d · sde-1google₹54.00 LPA· iiit-h · swe-imicrosoft₹49.50 LPA· iit-b · sdeatlassian₹38.00 LPA· nit-w · sde-1amazon₹44.20 LPA· bits-p · sde-1uber₹42.00 LPA· iit-kgp · sde-1razorpay₹65.00 LPA· iit-d · sde-1google₹54.00 LPA· iiit-h · swe-imicrosoft₹49.50 LPA· iit-b · sdeatlassian₹38.00 LPA· nit-w · sde-1amazon₹44.20 LPA· bits-p · sde-1uber₹42.00 LPA· iit-kgp · sde-1

Hadoop Interview Questions 2026: HDFS, MapReduce, YARN & Ecosystem

28 min read
Interview Questions
Updated: 8 Jun 2026
Aditya Sharma
Aditya's Edit

PapersAdda 2026 Placement Cycle

By Aditya Sharma·Founder & Editor, PapersAdda

What changed in 2026 drives

Mass-recruiter offer letters are flatter for 2026 batch - the 4-5 LPA ASE band has barely budged in three years while inflation eats real wages. Premium tracks (Digital, Pro, Elite, Specialist) are still where the differential lives, and they are entirely test-driven. If you are aiming higher than the default offer, the coding round is not optional pageantry - it is the entire interview.

What I'd actually study for this

  • 01Two solid coding-round answers (1 medium-hard DSA each, with edge-case discussion) > five half-baked ones
  • 02One real project you can defend end-to-end - file paths, design decisions, and what you would change
  • 03One DBMS schema you actually built (not a textbook ER diagram), with at least 3 join-heavy queries written from memory
  • 04Three behavioural STAR stories: failure recovered, conflict handled, ownership taken

Where most candidates trip up

The single biggest mistake is treating company-specific guides as primary prep and DSA as secondary. It is the opposite. Mass recruiters use the test as a filter, but premium tracks at every IT services company use coding to allocate offer band. Spend 70% of prep time on DSA + system fundamentals, 20% on company-specific patterns, 10% on HR rehearsal. Reverse that ratio and you collect the default offer.

Editorial commentary by Aditya Sharma · written for PapersAdda · not generated, not aggregated.

Candidates report that Hadoop ecosystem questions in 2026 interviews focus heavily on HDFS internals, MapReduce optimization, and integration with modern tools like Spark and cloud storage. Confirm exact topics and framework versions on the official company careers portal before your interview.

Hadoop remains a foundational big data technology. While cloud-native tools have taken center stage, understanding Hadoop's architecture -- HDFS, MapReduce, YARN, and the broader ecosystem (Hive, HBase, Pig, Oozie) -- is essential for data engineering roles at enterprises with large on-premise or hybrid deployments.


Core Architecture

Q1. Explain Hadoop's core architecture and the role of each component.

Hadoop has three core components:

ComponentRole
HDFSDistributed filesystem -- stores data across commodity hardware
MapReduceParallel processing framework -- splits jobs into map + reduce tasks
YARNResource management -- allocates CPU/memory to applications

HDFS architecture:

  • NameNode: Master -- stores filesystem metadata (file-to-block mappings, block locations). Single point of metadata, but HA mode with standby NameNode.
  • DataNodes: Workers -- store actual data blocks (default 128 MB each). Send heartbeats every 3 seconds to NameNode.
  • Secondary NameNode: Checkpoints NameNode's edit logs to prevent log bloat. NOT a hot standby.

YARN architecture:

  • ResourceManager: Cluster-wide resource allocator -- two daemons: Scheduler (allocates resources) + ApplicationsManager (manages submitted jobs).
  • NodeManager: Per-node -- launches and monitors containers.
  • ApplicationMaster: Per-job -- negotiates resources with RM, coordinates tasks.

Data flow for a MapReduce job:

  1. Client submits job to ResourceManager.
  2. RM launches ApplicationMaster in a container.
  3. AM requests containers from RM; NMs launch mapper/reducer tasks.
  4. Output written to HDFS.

Q2. How does HDFS achieve fault tolerance?

Three mechanisms:

1. Replication (default factor = 3) Each block is replicated to 3 DataNodes. Placement policy: first replica on local rack, second on different rack, third on same rack as second. This balances fault tolerance with write bandwidth.

2. Heartbeat + block reports

  • DataNodes send heartbeats to NameNode every 3 seconds.
  • Block reports every 6 hours (default) -- NameNode validates replication factor.
  • If NameNode misses 10 consecutive heartbeats (30 seconds), DataNode is declared dead and its blocks are re-replicated.

3. Checksums Each block has a CRC-32C checksum stored alongside data. Reads verify checksums; silent corruption triggers re-read from replica.

# Check HDFS file health
hdfs fsck /user/data/my_file.csv -files -blocks -locations

# Check replication factor
hdfs dfs -stat "%r" /user/data/my_file.csv

NameNode HA: Active + Standby NameNode share state via JournalNodes (quorum-based log). ZooKeeper handles automatic failover.


Q3. What is data locality in Hadoop, and why does it matter?

Data locality = running computation where the data already resides, avoiding network I/O.

Three levels:

  1. Node-local: Task runs on the DataNode that holds the block. Best -- zero network cost.
  2. Rack-local: Task runs on a DataNode in the same rack. One network hop within the rack switch.
  3. Off-rack: Task fetches data across racks. Most expensive -- cross-rack bandwidth is the bottleneck.

MapReduce JobTracker (in YARN: ApplicationMaster) attempts node-local placement first. If the local node's slots are all occupied, it falls back to rack-local, then off-rack.

Why it matters: A typical MapReduce job reads all input data. At 100 TB input, even 1 GB/s cross-rack bandwidth means hours of network transfer. Node-local reads saturate local disk instead (often 200-500 MB/s per disk).

Cloud implication: S3-based data lakes decouple storage from compute, so data locality is impossible. This is why Spark on EMR uses speculative execution and large instance types with high network bandwidth to compensate.


Q4. Explain the MapReduce programming model with a word count example.

MapReduce has four phases: Input split, Map, Shuffle & Sort, Reduce.

# Conceptual Python pseudocode for word count

# MAP phase: runs on each input split (portion of a text file)
def map(key, value):
    # key = line offset, value = line text
    for word in value.split():
        emit(word.lower(), 1)

# After map: framework groups all values by key (shuffle + sort)
# ("hadoop", [1, 1, 1]), ("is", [1, 1]), ...

# REDUCE phase: runs once per unique key
def reduce(key, values):
    # key = word, values = list of 1s
    emit(key, sum(values))

Java implementation structure:

public class WordCount {
    // Mapper class
    public static class TokenizerMapper
        extends Mapper<Object, Text, Text, IntWritable> {
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(Object key, Text value, Context context)
            throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
            }
        }
    }

    // Reducer class
    public static class IntSumReducer
        extends Reducer<Text, IntWritable, Text, IntWritable> {
        public void reduce(Text key, Iterable<IntWritable> values,
                           Context context)
            throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            context.write(key, new IntWritable(sum));
        }
    }
}

Phases in detail:

  1. Input split: HDFS file divided into splits (usually one block = one split). Each split goes to one Mapper.
  2. Map: Emits (key, value) pairs.
  3. Combiner (optional): Local mini-reducer on mapper output -- reduces shuffle data volume.
  4. Shuffle & Sort: Framework partitions, transfers, and sorts mapper output by key before reduce.
  5. Reduce: Aggregates all values for each key into final output.

Q5. What is a Combiner in MapReduce, and when should you use it?

A Combiner is a local reducer that runs on each Mapper's output before the shuffle phase. It reduces the volume of data transferred across the network.

When to use:

  • The reduce operation is commutative and associative (e.g., sum, max, min, count).
  • NOT valid for operations like average (partial sums divided by different counts would corrupt results).
// Set combiner to same class as reducer (valid for word count)
job.setCombinerClass(IntSumReducer.class);

Impact:

  • Without combiner: Each mapper emits (word, 1) for every occurrence. Network transfers millions of pairs.
  • With combiner: Local aggregation per word per mapper. Network transfers one (word, localCount) per unique word per mapper.

For average -- correct pattern:

# Map emits: (word, (count, sum))
# Combine: (word, (totalCount, totalSum)) -- valid, just add counts + sums
# Reduce: (word, totalSum / totalCount)

Q6. Explain MapReduce speculative execution.

Problem: Stragglers -- slow tasks that lag behind and delay entire job completion. Can be caused by hardware degradation, data skew, or resource contention.

Solution: Speculative execution -- launch duplicate copies of slow tasks on other nodes; use whichever finishes first, kill the other.

Trigger conditions (configurable):

  • Task has been running longer than average for its stage.
  • Progress rate is significantly below average.
  • Cluster has spare capacity.

<property>
  <name>mapreduce.map.speculative</name>
  <value>true</value>
</property>
<property>
  <name>mapreduce.reduce.speculative</name>
  <value>true</value>
</property>

When to disable:

  • Non-idempotent operations (e.g., writing to external databases without deduplication).
  • Tasks with external side effects.
  • When cluster is already fully loaded (speculative tasks just compete for same resources).

HDFS Deep Dive

Q7. What is the difference between NameNode and Secondary NameNode?

AspectNameNodeSecondary NameNode
RoleActive master -- serves all metadataCheckpoint helper
StoresFsImage + EditLog (in memory + disk)Merged FsImage checkpoints
Failure impactCluster unusableNo immediate impact
HA roleActive or Standby (HA mode)Not part of HA setup

EditLog problem: NameNode logs every metadata operation to an EditLog. Over time, EditLog grows huge. On restart, replaying a large EditLog is slow.

Secondary NameNode solution:

  1. Periodically downloads FsImage + EditLog from NameNode.
  2. Merges them into a new FsImage (checkpointing).
  3. Uploads merged FsImage back to NameNode.
  4. NameNode can start fresh from new FsImage + smaller EditLog.

Secondary NameNode is NOT a hot standby. For actual HA, use HDFS HA with Active/Standby NameNodes and JournalNodes.

# Force checkpoint on Secondary NameNode
hdfs secondarynamenode -checkpoint force

Q8. What is HDFS small files problem, and how do you solve it?

Problem: Each file in HDFS requires a metadata entry in the NameNode's memory (roughly 150 bytes per file/block). Millions of small files:

  • Exhaust NameNode heap memory.
  • Create excessive MapReduce tasks (one mapper per file/split).
  • Slow metadata operations.

Solutions:

1. HAR (Hadoop Archive):

# Package many small files into a single HAR archive
hadoop archive -archiveName myarchive.har -p /user/small_files /user/archives/

# Access individual files through HAR
hdfs dfs -ls har:///user/archives/myarchive.har/

Downside: HAR files are read-only; not suitable for streaming updates.

2. SequenceFile: Pack small files as key-value pairs in a SequenceFile (filename as key, content as value). Compressible, splittable, MapReduce-friendly.

3. CombineFileInputFormat:

// Combine multiple small files into one mapper's input split
job.setInputFormatClass(CombineTextInputFormat.class);
CombineTextInputFormat.setMaxInputSplitSize(job, 128 * 1024 * 1024); // 128 MB

4. Upstream fix: Avoid creating small files. Use Spark/Hive to write fewer, larger partitioned files. Coalesce output before writing to HDFS.


Q9. How does HDFS handle writes and what is the write pipeline?

Write flow:

  1. Client calls create() on NameNode. NameNode checks permissions, returns list of DataNodes for replication (pipeline).
  2. Client writes data to first DataNode in the pipeline.
  3. First DataNode stores the block and forwards to second DataNode, which forwards to third.
  4. ACKs flow back: DN3 ACKs to DN2, DN2 ACKs to DN1, DN1 ACKs to client.
  5. After all replicas acknowledge, block is considered written.
  6. Client calls complete() on NameNode to close the file.

Packet-level pipeline:

  • Data is streamed in packets (64 KB default).
  • Client doesn't wait for full block ACK -- pipelines packets.
  • If a DataNode fails mid-write: client is notified, remaining good DataNodes form a new pipeline, block is re-replicated after write completes.
# Python hadoop client (hdfs3 or snakebite)
from hdfs import InsecureClient

client = InsecureClient('http://namenode:50070', user='hadoop')

# Write to HDFS
with client.write('/user/data/output.csv', overwrite=True) as writer:
    writer.write(b"col1,col2\n1,2\n3,4\n")

# Read from HDFS
with client.read('/user/data/output.csv', encoding='utf-8') as reader:
    content = reader.read()

YARN and Resource Management

Q10. How does YARN schedule jobs, and what schedulers are available?

YARN has three built-in schedulers:

1. FIFO Scheduler

  • First-in-first-out queue.
  • Simple but poor for multi-tenant clusters -- large jobs starve small ones.

2. Capacity Scheduler (default in CDH/HDP)

  • Multiple queues with guaranteed capacity (e.g., prod queue: 60%, dev queue: 40%).
  • Each queue uses FIFO internally.
  • Supports queue hierarchies, ACLs, preemption.
  • Queues can borrow unused capacity.

<property>
  <name>yarn.scheduler.capacity.root.queues</name>
  <value>prod,dev</value>
</property>
<property>
  <name>yarn.scheduler.capacity.root.prod.capacity</name>
  <value>60</value>
</property>

3. Fair Scheduler

  • All jobs get equal share of resources over time.
  • New job immediately gets some resources (preempts others if needed).
  • Supports weighted queues, minimum guarantees.
  • Better for ad-hoc workloads mixed with long jobs.

Preemption: Fair and Capacity schedulers can preempt (kill) containers from queues that are using more than their share, freeing capacity for under-served queues.


Hive

Q11. What is Hive, and how does HiveQL translate to MapReduce/Tez/Spark?

Hive is a data warehouse layer on top of Hadoop. It provides SQL-like HiveQL syntax that compiles to MapReduce, Tez, or Spark execution engines.

Architecture:

  • Metastore: MySQL/Derby database storing table schemas, partition metadata, SerDes.
  • Driver: Receives HiveQL, manages query lifecycle.
  • Compiler: Parses HiveQL into Abstract Syntax Tree, creates execution plan (DAG of MapReduce/Tez jobs).
  • Execution Engine: Runs the plan on YARN.

Execution engines:

EngineLatencyUse case
MapReduceMinutesBatch, legacy
TezSeconds to minutesInteractive, complex DAGs
SparkSecondsMemory-intensive, iterative
-- Set execution engine
SET hive.execution.engine = tez;

-- A simple HiveQL query
SELECT
    department,
    COUNT(*) AS employee_count,
    AVG(salary) AS avg_salary
FROM employees
WHERE year = 2026
GROUP BY department
ORDER BY avg_salary DESC
LIMIT 10;

Hive compiles this to:

  1. TableScan on HDFS (map phase reads paritions for year=2026).
  2. HashAggregate (map-side combiner for COUNT/SUM).
  3. Shuffle by department.
  4. MergeAggregate (reduce).
  5. Order/Limit (single reducer for final sort).

Q12. What is partitioning in Hive, and how does dynamic partitioning work?

Partitioning stores data in subdirectories by column value, enabling partition pruning (skip entire directories during scans).

-- Create partitioned table
CREATE TABLE sales (
    order_id    BIGINT,
    product_id  BIGINT,
    amount      DOUBLE,
    customer_id BIGINT
)
PARTITIONED BY (year INT, month INT)
STORED AS PARQUET;

-- Static partition insert
INSERT INTO sales PARTITION (year=2026, month=6)
SELECT order_id, product_id, amount, customer_id FROM raw_sales
WHERE year=2026 AND month=6;

-- Dynamic partition insert (Hive figures out partitions from data)
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;

INSERT INTO sales PARTITION (year, month)
SELECT order_id, product_id, amount, customer_id, year, month FROM raw_sales;

Partition pruning:

-- This query only scans /warehouse/sales/year=2026/month=6/
SELECT SUM(amount) FROM sales WHERE year=2026 AND month=6;

-- EXPLAIN to verify partition pruning
EXPLAIN SELECT SUM(amount) FROM sales WHERE year=2026 AND month=6;

Bucketing (clustering within partitions):

-- Bucket by customer_id into 32 buckets
CREATE TABLE sales_bucketed
PARTITIONED BY (year INT, month INT)
CLUSTERED BY (customer_id) INTO 32 BUCKETS
STORED AS ORC;

Bucketing enables map-side joins (when both tables bucketed by join key with same # buckets) and more uniform splits.


Q13. What is ORC format and why is it preferred over CSV/JSON in Hive?

ORC (Optimized Row Columnar) is a self-describing, type-aware column-oriented file format for Hive workloads.

FeatureCSVORC
StorageRowColumnar (per-column stripes)
CompressionNone/GZIPZlib, Snappy, LZO (per-column codec)
Predicate pushdownNoYes (min/max bloom filters per stripe)
Schema evolutionLimitedAdd/rename columns
SplittableYes (line-based)Yes (stripe-based)
Native type supportStrings onlyAll Hive types including complex

Performance advantage:

-- With ORC, this query only reads the 'amount' column
-- Skips all other columns entirely
SELECT SUM(amount) FROM sales_orc WHERE year=2026;

-- Stripe-level skipping: stripes where min(amount) > 1000
-- are skipped when filtering amount < 100

Creating ORC tables:

CREATE TABLE sales_orc
STORED AS ORC
TBLPROPERTIES (
    "orc.compress" = "SNAPPY",
    "orc.stripe.size" = "134217728",  -- 128 MB stripes
    "orc.bloom.filter.columns" = "customer_id,product_id"
);

Parquet vs ORC: Both are columnar. ORC is optimized for Hive (stronger predicate pushdown). Parquet has broader ecosystem support (Spark, Impala, BigQuery, AWS Glue). Modern practice: Parquet for cloud data lakes, ORC for Hive-heavy on-premise stacks.


HBase

Q14. What is HBase, and when would you choose it over RDBMS or Hive?

HBase is a distributed, column-family-oriented NoSQL database built on HDFS. Modeled after Google Bigtable.

Data model:

  • Table: Collection of rows.
  • Row key: Byte array, rows stored in sorted lexicographic order. Critical for access patterns.
  • Column family: Physical grouping of columns (defined at table creation). Separate HFiles per family.
  • Column qualifier: Column within a family (defined at write time, schema-free).
  • Cell: (row key, column family, column qualifier, timestamp) -> value.

Comparison:

AspectRDBMSHiveHBase
Query patternSQL, complex joinsSQL, analytical scansPoint lookups, range scans by row key
LatencymsMinutesms (random reads)
MutationsYes (ACID)Append/overwriteYes (put/delete/increment)
SchemaFixedFixedSparse, schema-free columns
ScaleVerticalHorizontal (batch)Horizontal (OLTP scale)
Use caseTransactionalAnalytical batchReal-time read/write at scale

Choose HBase when:

  • Random read/write access to billions of rows.
  • Time-series data (sensor readings, event logs) with row-key-based range scans.
  • Real-time counter increments (HBase supports atomic increment).
  • Variable schema per row (sparse data).
import happybase

# Connect to HBase via Thrift server
connection = happybase.Connection('hbase-master', port=9090)
connection.open()

# Create table
connection.create_table(
    'user_events',
    {'events': dict(max_versions=5),
     'metadata': dict(max_versions=1)}
)

table = connection.table('user_events')

# Write
table.put(
    b'user:12345:20260608',  # row key: entity:id:date for efficient range scans
    {b'events:page_view': b'homepage',
     b'events:duration': b'45',
     b'metadata:device': b'mobile'}
)

# Read single row
row = table.row(b'user:12345:20260608')

# Scan range
for key, data in table.scan(
    row_start=b'user:12345:20260601',
    row_stop=b'user:12345:20260609'
):
    print(key, data)

Q15. Explain HBase row key design principles.

The row key is the primary access path in HBase. Poor row key design causes hotspotting (all reads/writes hitting one region) and inefficient scans.

Principles:

1. Avoid sequential keys (hotspotting)

# BAD: timestamp as prefix -- all current writes go to one region
20260608120000:user123
20260608120001:user456
20260608120002:user789

# GOOD: salt prefix -- distributes writes across regions
a3:20260608120000:user123  # hash(user123) % num_regions -> prefix
b7:20260608120001:user456
f2:20260608120002:user789

2. Design for access pattern

# Access pattern: "get all events for user X in date range"
# Row key: userId:date -> enables range scan
user123:20260601 ... user123:20260608  -- range scan returns all June events

3. Keep row keys short Row keys are stored with every cell. Long row keys multiply storage. Use hashed or encoded IDs.

4. Reverse timestamp for latest-first reads

# Reverse timestamp: Long.MAX_VALUE - currentTimeMs
# Scan from start gets newest events first
row_key = f"user123:{9999999999999 - int(time.time() * 1000)}"

5. Composite keys

# User activity: (tenant:userId:eventType:timestamp)
# Enables: scan by tenant, scan by user, scan by user+eventType
acme:user123:pageview:20260608120000

Pig and Oozie

Q16. What is Apache Pig, and how does Pig Latin differ from HiveQL?

Apache Pig is a high-level scripting language (Pig Latin) for expressing data flow programs on Hadoop. Pig Latin compiles to MapReduce jobs.

Pig Latin vs HiveQL:

AspectPig LatinHiveQL
ParadigmData flow (procedural)Declarative SQL
OptimizationManual (you control flow)Automatic query optimizer
SchemaOptional (schema on read)Required (DDL)
UDFsEasier to integrate (Java/Python/Ruby)Supported but more complex
Use caseETL pipelines, multi-step transformationsAd-hoc queries, reporting
-- Pig Latin: find top-10 products by sales
raw_data = LOAD '/user/data/sales.csv'
    USING PigStorage(',')
    AS (order_id:int, product_id:int, amount:float, date:chararray);

-- Filter
recent = FILTER raw_data BY date >= '2026-01-01';

-- Group and aggregate
by_product = GROUP recent BY product_id;
totals = FOREACH by_product GENERATE
    group AS product_id,
    SUM(recent.amount) AS total_sales,
    COUNT(recent) AS order_count;

-- Sort and limit
ranked = ORDER totals BY total_sales DESC;
top10 = LIMIT ranked 10;

-- Store
STORE top10 INTO '/user/output/top_products' USING PigStorage('\t');

Q17. What is Apache Oozie, and how does it orchestrate Hadoop workflows?

Oozie is a workflow scheduler for Hadoop jobs. It orchestrates sequences of MapReduce, Hive, Pig, Sqoop, and shell actions as directed acyclic graphs (DAGs).

Two types of Oozie jobs:

  1. Workflow: One-time DAG of actions.
  2. Coordinator: Time-triggered or data-triggered workflow scheduler.

Workflow XML example:

<workflow-app name="etl-pipeline" xmlns="uri:oozie:workflow:0.5">
    <start to="validate-data"/>

    <action name="validate-data">
        <hive xmlns="uri:oozie:hive-action:0.3">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <script>scripts/validate.hql</script>
        </hive>
        <ok to="transform-data"/>
        <error to="send-failure-email"/>
    </action>

    <action name="transform-data">
        <map-reduce>
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <configuration>
                <property>
                    <name>mapreduce.job.jar</name>
                    <value>/user/oozie/transform.jar</value>
                </property>
            </configuration>
        </map-reduce>
        <ok to="end"/>
        <error to="send-failure-email"/>
    </action>

    <action name="send-failure-email">
        <email xmlns="uri:oozie:email-action:0.2">
            <to>[email protected]</to>
            <subject>ETL Pipeline Failed</subject>
            <body>Check Oozie logs for job ${wf:id()}</body>
        </email>
        <ok to="fail"/>
        <error to="fail"/>
    </action>

    <kill name="fail">
        <message>Pipeline failed: ${wf:errorMessage(wf:lastErrorNode())}</message>
    </kill>
    <end name="end"/>
</workflow-app>

Coordinator for scheduled runs:

<coordinator-app name="daily-etl" frequency="${coord:days(1)}"
    start="2026-01-01T00:00Z" end="2027-01-01T00:00Z"
    timezone="UTC" xmlns="uri:oozie:coordinator:0.4">
    <action>
        <workflow>
            <app-path>${workflowPath}</app-path>
        </workflow>
    </action>
</coordinator-app>

Sqoop and Flume

Q18. What is Sqoop, and how does it import/export data between HDFS and RDBMS?

Sqoop is a tool for bulk transfer of structured data between Hadoop and relational databases (MySQL, Oracle, PostgreSQL, SQL Server).

# Import entire table from MySQL to HDFS
sqoop import \
    --connect jdbc:mysql://mysql-server:3306/production_db \
    --username hadoop_user \
    --password-file /user/hadoop/.sqoop_password \
    --table orders \
    --target-dir /user/data/orders \
    --as-parquetfile \
    --num-mappers 8 \
    --compress \
    --compression-codec snappy

# Incremental import (only new rows since last run)
sqoop import \
    --connect jdbc:mysql://mysql-server:3306/production_db \
    --username hadoop_user \
    --password-file /user/hadoop/.sqoop_password \
    --table orders \
    --target-dir /user/data/orders_incremental \
    --incremental append \
    --check-column order_id \
    --last-value 1000000

# Export from HDFS to MySQL
sqoop export \
    --connect jdbc:mysql://mysql-server:3306/reporting_db \
    --username hadoop_user \
    --password-file /user/hadoop/.sqoop_password \
    --table aggregated_sales \
    --export-dir /user/data/aggregated_sales \
    --input-fields-terminated-by '\t' \
    --update-mode allowinsert \
    --update-key order_date,product_id

How Sqoop parallelizes imports:

  • Sqoop launches N mappers (default 4), each reads a range of the primary key.
  • Boundary query: SELECT MIN(pk), MAX(pk) FROM table.
  • Range divided into N equal splits.
  • Each mapper runs its own JDBC SELECT with WHERE clause on its range.

Limitation: Skewed data (many rows with same PK value or gaps) causes imbalanced splits. Use --split-by with a uniformly distributed column.


Q19. What is Apache Flume, and how does it differ from Kafka for data ingestion?

Flume is a distributed service for collecting, aggregating, and moving log/event data to HDFS. Designed specifically for Hadoop.

Flume architecture:

  • Source: Receives data (Avro, Thrift, Syslog, HTTP, JMS, Exec).
  • Channel: Buffer between source and sink (Memory channel or File channel for durability).
  • Sink: Writes data (HDFS, HBase, Kafka, Elasticsearch, Logger).
# flume-agent.conf
agent.sources = access_log_source
agent.sinks = hdfs_sink
agent.channels = memory_channel

agent.sources.access_log_source.type = exec
agent.sources.access_log_source.command = tail -F /var/log/nginx/access.log
agent.sources.access_log_source.channels = memory_channel

agent.channels.memory_channel.type = memory
agent.channels.memory_channel.capacity = 10000
agent.channels.memory_channel.transactionCapacity = 1000

agent.sinks.hdfs_sink.type = hdfs
agent.sinks.hdfs_sink.hdfs.path = /user/logs/%Y/%m/%d
agent.sinks.hdfs_sink.hdfs.fileType = DataStream
agent.sinks.hdfs_sink.hdfs.rollInterval = 3600
agent.sinks.hdfs_sink.hdfs.rollSize = 134217728
agent.sinks.hdfs_sink.hdfs.rollCount = 0
agent.sinks.hdfs_sink.channel = memory_channel

Flume vs Kafka:

AspectFlumeKafka
Primary purposeLog collection to HDFSDurable event streaming bus
Consumer modelSingle sink per agentMultiple independent consumer groups
ReplayNo (data goes to HDFS)Yes (retained for configurable period)
DecouplingSource-to-sink coupledFull producer-consumer decoupling
BackpressureChannel capacityConsumer lag
EcosystemHadoop-centricUniversal (any producer/consumer)

Modern architectures: use Kafka as the ingestion bus, Kafka Connect HDFS Sink (or Flume's Kafka source) to land data in HDFS.


Performance Tuning

Q20. How do you tune a MapReduce job for performance?

Memory tuning:

# Mapper JVM heap
-Dmapreduce.map.java.opts=-Xmx1800m
-Dmapreduce.map.memory.mb=2048

# Reducer JVM heap
-Dmapreduce.reduce.java.opts=-Xmx3500m
-Dmapreduce.reduce.memory.mb=4096

Shuffle tuning (most impactful for large jobs):


<property>
  <name>mapreduce.task.io.sort.mb</name>
  <value>512</value>  
</property>


<property>
  <name>mapreduce.map.sort.spill.percent</name>
  <value>0.90</value>
</property>


<property>
  <name>mapreduce.task.io.sort.factor</name>
  <value>100</value>  
</property>


<property>
  <name>mapreduce.reduce.shuffle.parallelcopies</name>
  <value>25</value>  
</property>

Compression tuning:


<property>
  <name>mapreduce.map.output.compress</name>
  <value>true</value>
</property>
<property>
  <name>mapreduce.map.output.compress.codec</name>
  <value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>

Reduce count:

# Rule of thumb: (0.95 * num_reducers_to_fill_cluster) reducers
# Or: total_reduce_input / 128MB (target one reducer per 128 MB input)
-Dmapreduce.job.reduces=50

Q21. What causes reducer data skew, and how do you fix it?

Skew = one reducer gets vastly more data than others, becoming the job bottleneck.

Causes:

  • Natural key distribution (popular product IDs, bot user IDs).
  • Cartesian-like joins (NULL keys all route to same reducer).
  • Uneven partitioning.

Diagnosis:

# Check task times in JobTracker UI
# Look for reduce tasks with 10x longer duration than peers
mapred job -status job_20260608_0001 | grep reduce

Fixes:

1. Custom partitioner for hot keys:

public class SaltedPartitioner extends Partitioner<Text, IntWritable> {
    @Override
    public int getPartition(Text key, IntWritable value, int numPartitions) {
        // For hot key "popular_product", distribute across 10 reducers
        String keyStr = key.toString();
        if (keyStr.startsWith("popular_product:")) {
            return (keyStr.hashCode() & Integer.MAX_VALUE) % 10;
        }
        return (keyStr.hashCode() & Integer.MAX_VALUE) % numPartitions;
    }
}

2. NULL key handling:

-- Hive: route NULLs with random salt
SELECT /*+ MAPJOIN(small_table) */ a.*, b.info
FROM large_table a
LEFT JOIN small_table b
ON COALESCE(a.user_id, rand() * -99999) = b.user_id;

3. Two-phase aggregation:

Map -> (salted_key, 1)
Reduce 1 -> (salted_key, partial_count) [many reducers]
Reduce 2 -> (original_key, total_count) [few reducers]

Hadoop Ecosystem Integration

Q22. How does Hadoop integrate with modern cloud and Spark ecosystems?

Hadoop on cloud (AWS EMR, Azure HDInsight, GCP Dataproc):

# Spark on EMR reading from S3 (EMRFS -- S3 as HDFS replacement)
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("EMR-Spark-Job") \
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .config("spark.hadoop.fs.s3a.aws.credentials.provider",
            "com.amazonaws.auth.InstanceProfileCredentialsProvider") \
    .getOrCreate()

df = spark.read.parquet("s3a://my-datalake/processed/events/year=2026/")
df.groupBy("event_type").count().write.mode("overwrite") \
    .parquet("s3a://my-datalake/aggregates/event_counts/")

HDFS to Iceberg/Delta Lake migration path:

  1. Keep HDFS for historical data (cheaper than re-migration).
  2. New data lands in S3/GCS with Iceberg/Delta format.
  3. Spark unified read layer handles both (Hive metastore unified catalog).

Hive Metastore as unified catalog:

# Glue Catalog (AWS) compatible with Hive Metastore API
spark = SparkSession.builder \
    .config("spark.sql.catalog.spark_catalog",
            "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config("hive.metastore.uris", "thrift://metastore:9083") \
    .enableHiveSupport() \
    .getOrCreate()

# Query both Hive tables (HDFS) and Delta tables (S3) in one SQL
spark.sql("""
    SELECT h.user_id, d.event_count
    FROM hive_db.legacy_users h
    JOIN delta_db.event_counts d ON h.user_id = d.user_id
""").show()

Q23. What is Apache Tez, and how does it improve on MapReduce for Hive?

MapReduce limitations:

  • Every stage writes intermediate results to HDFS (expensive I/O).
  • Forced map-shuffle-reduce structure -- complex queries need chained MR jobs.
  • Each job has JVM startup overhead.

Tez improvements:

  • Represents query as a generic DAG (Directed Acyclic Graph) of tasks.
  • Intermediate data flows in memory or local disk -- no HDFS writes between stages.
  • Operator fusion: consecutive operations (filter + project + aggregate) run in one task.
  • Container reuse: JVMs reused across tasks in the same query.
  • Dynamic re-planning: Adjust partition counts based on actual data size.
MapReduce for multi-join:
MR1: Map(scan A) -> Reduce(hash join A-B) -> write HDFS
MR2: Map(read HDFS) -> Reduce(hash join AB-C) -> write HDFS
MR3: Map(read HDFS) -> Reduce(aggregate) -> write HDFS

Tez for same query:
Task1(scan A) -> Task2(scan B) -> Task3(hash join A-B, hash join with C, aggregate)
No intermediate HDFS writes. 3-10x faster for complex queries.
-- Enable Tez
SET hive.execution.engine = tez;
SET tez.am.resource.memory.mb = 4096;
SET hive.auto.convert.join = true;  -- Map join optimization
SET hive.tez.container.size = 2048;

Q24. Compare Hadoop MapReduce vs Apache Spark for batch processing.

DimensionMapReduceSpark
In-memory cachingNo -- always HDFSYes -- RDD/DataFrame cache
Iterative algorithmsSlow (read/write HDFS each iteration)Fast (cache between iterations)
LatencyMinutesSeconds
Programming modelMap + Reduce onlyRich transformations (100+)
Language supportJava primary (Streaming for others)Python, Scala, Java, R, SQL
DAG supportChained jobs (manual)Native DAG execution
StreamingNoStructured Streaming
Fault toleranceTask re-execution from HDFSRDD lineage re-computation
Memory pressureSpills to HDFS gracefullyOOM risk on large datasets
MaturityProven at exabyte scaleProduction-ready, growing

When MapReduce is still valid:

  • Extreme-scale batch jobs where memory is the bottleneck.
  • Environments locked to older Hadoop clusters without Spark.
  • Jobs that already work well and don't justify migration cost.

ML workloads: Spark MLlib significantly outperforms MapReduce-based Mahout for iterative algorithms (gradient descent converges in minutes vs hours).

# Spark equivalent of word count (compare to MR Java code above)
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("WordCount").getOrCreate()

counts = (
    spark.read.text("hdfs:///user/data/corpus/")
    .rdd
    .flatMap(lambda row: row[0].split())
    .map(lambda word: (word.lower(), 1))
    .reduceByKey(lambda a, b: a + b)
    .sortBy(lambda x: -x[1])
)

counts.saveAsTextFile("hdfs:///user/output/word_counts/")

Q25. How do you monitor and troubleshoot a Hadoop cluster?

Key monitoring points:

# HDFS health
hdfs dfsadmin -report
# Shows: live/dead DataNodes, block counts, used/available space

# Check under-replicated blocks
hdfs dfsadmin -report | grep "Under replicated"
hdfs fsck / -summary

# YARN cluster status
yarn node -list -all
yarn application -list -appStates RUNNING

# Check YARN queue usage
yarn queue -status default

# View running/failed jobs
mapred job -list all | head -20
mapred job -status <job_id>

Log locations:

NameNode logs: $HADOOP_LOG_DIR/hadoop-<user>-namenode-<host>.log
DataNode logs: $HADOOP_LOG_DIR/hadoop-<user>-datanode-<host>.log
YARN RM logs: $HADOOP_LOG_DIR/yarn-<user>-resourcemanager-<host>.log
MapReduce job logs: YARN ResourceManager UI -> application -> container logs

Common issues:

SymptomLikely causeFix
Jobs hang at 99% reduceData skewCustom partitioner, salting
NameNode OOMToo many small filesHAR archives, CombineFileInputFormat
DataNode fullUnbalanced data distributionhdfs balancer -threshold 10
Jobs slowGC pressureTune JVM heap, reduce shuffle
"Too many open files"DataNode limitIncrease ulimit -n to 65536
Connection refused to NameNodeNameNode dead or GC pauseCheck NN logs, increase heap
# Rebalance HDFS data distribution
hdfs balancer -threshold 5  # Balance until no node deviates >5% from average

# Safe mode (NameNode waits for minimum replication)
hdfs dfsadmin -safemode get
hdfs dfsadmin -safemode leave  # Force exit if stuck

Real-World Scenarios

Q26. Design a Hadoop-based data pipeline for daily batch processing of 10 TB of e-commerce logs.

Requirements: Parse 10 TB raw NGINX access logs daily, compute product page views, add to 90-day rolling window, generate daily product performance report.

Architecture:

[NGINX Servers] -> [Flume/Kafka] -> [HDFS Raw Zone]
                                         |
                                   [Hive ETL (Tez)]
                                         |
                              [HDFS Processed Zone (ORC)]
                                         |
                              [Hive Aggregation Job]
                                         |
                              [HDFS Report Zone (Parquet)]
                                         |
                              [Sqoop Export to MySQL]
                                         |
                              [Dashboard/BI Tools]

Oozie workflow:

<workflow-app name="daily-ecommerce-pipeline">
    
    <action name="validate-raw">
        <hive><script>validate_raw_completeness.hql</script></hive>
        <ok to="parse-logs"/>
        <error to="alert-data-ops"/>
    </action>

    
    <action name="parse-logs">
        <hive><script>parse_access_logs.hql</script></hive>
        <ok to="compute-aggregates"/>
        <error to="alert-data-ops"/>
    </action>

    
    <action name="compute-aggregates">
        <hive><script>product_pageview_aggregates.hql</script></hive>
        <ok to="build-90day-window"/>
        <error to="alert-data-ops"/>
    </action>

    
    <action name="build-90day-window">
        <hive><script>rolling_window_update.hql</script></hive>
        <ok to="export-report"/>
        <error to="alert-data-ops"/>
    </action>

    
    <action name="export-report">
        <sqoop>
            <command>export --connect ... --table daily_product_report ...</command>
        </sqoop>
        <ok to="end"/>
        <error to="alert-data-ops"/>
    </action>
</workflow-app>

HDFS directory layout:

/data/raw/access_logs/year=2026/month=06/day=08/    -- Flume landing zone
/data/processed/parsed_logs/year=2026/month=06/day=08/   -- ORC partitioned
/data/aggregates/product_pageviews/year=2026/month=06/day=08/  -- daily aggregates
/data/reports/90day_window/   -- rolling window table

Q27. How would you migrate an on-premise Hadoop cluster to AWS EMR?

Migration strategy: Lift-then-Modernize

Phase 1: Assess

# Inventory: file sizes, formats, partition counts
hdfs dfs -du -s /data/* | sort -rn | head -20
hive -e "SHOW DATABASES; USE prod; SHOW TABLES;" > table_inventory.txt

# Job complexity: how many MR vs Hive vs Pig jobs
grep -r "job_type" /oozie/workflows/ | sort | uniq -c

Phase 2: Copy data to S3

# S3DistCp -- distributed copy from HDFS to S3 using MapReduce
s3-dist-cp \
    --src hdfs:///data/processed/ \
    --dest s3://company-datalake/processed/ \
    --srcPattern ".*\.orc" \
    --outputCodec snappy \
    --groupBy ".*/(year=\d+/month=\d+/day=\d+)/.*"

Phase 3: Validate

# Row count validation per partition
import boto3
from pyspark.sql import SparkSession

spark = SparkSession.builder.enableHiveSupport().getOrCreate()

# Compare counts: HDFS Hive vs S3 EMR
hdfs_count = spark.read.orc("hdfs:///data/processed/events/").count()
s3_count = spark.read.orc("s3a://company-datalake/processed/events/").count()

assert hdfs_count == s3_count, f"Mismatch: {hdfs_count} vs {s3_count}"

Phase 4: Switch + modernize

  • Point Oozie jobs to EMR/MWAA (managed Airflow).
  • Convert MR jobs to Spark.
  • Convert ORC to Parquet/Iceberg for cross-service compatibility.
  • Replace Hive metastore with AWS Glue Catalog.

Q28. What are the key differences between Hadoop 2 and Hadoop 3?

FeatureHadoop 2Hadoop 3
NameNode HAActive + 1 StandbyActive + multiple Standbys
Storage type3x replication onlyErasure Coding (EC) support
Default block size128 MB256 MB
YARN timeline servicev1v2 (scalable, HBase backend)
Minimum JavaJava 7Java 8
Port defaultsNameNode: 50070NameNode: 9870
HDFS FederationSupportedEnhanced with Router-Based Federation

Erasure Coding (Hadoop 3):

# EC reduces storage overhead from 200% (3x replication) to ~50%
# Trade-off: higher CPU for encoding/decoding on reads/writes

# Enable EC on a directory
hdfs ec -setPolicy -path /cold_data -policy RS-6-3-1024k
# RS-6-3: 6 data blocks + 3 parity blocks, can lose any 3
# Overhead: 3/6 = 50% vs 200% for 3x replication

# View EC policies
hdfs ec -listPolicies

Router-Based Federation (Hadoop 3): Allows multiple independent NameNode namespaces mounted under one unified namespace. Client sees single HDFS; Router maps paths to the correct NameNode cluster.


FAQ

Q: What is the difference between HDFS and a regular distributed filesystem like NFS? HDFS is optimized for large sequential reads of large files, not random access. It assumes write-once-read-many workloads, stores data in large blocks (128-256 MB), and co-locates compute with storage for data locality. NFS provides POSIX semantics with random read/write but does not scale to petabytes or support MapReduce data locality. HDFS is fault-tolerant via replication; NFS relies on underlying hardware or RAID.

Q: Can Hadoop handle real-time data? Hadoop core (HDFS + MapReduce + Hive) is designed for batch processing with latencies of minutes to hours. Real-time processing on the Hadoop ecosystem uses Apache Spark Structured Streaming, Apache Flink, or Apache Storm, which can run on YARN and read/write to HDFS. Kafka typically sits in front as the real-time ingestion layer.

Q: What replaced Hadoop in modern data architectures? Candidates report that modern data stacks increasingly use: cloud object storage (S3/GCS/ADLS) instead of HDFS; Apache Spark instead of MapReduce; Apache Airflow instead of Oozie; open table formats (Delta Lake, Iceberg, Hudi) instead of Hive tables on ORC. The Hadoop ecosystem components (Hive Metastore, YARN, HBase) remain relevant but the underlying HDFS layer is being replaced by cloud storage. Confirm the exact stack at the company you are interviewing with on the official careers portal.


Methodology applied to this articlelast verified 8 Jun 2026
Sources used
Public exam-pattern documents, official recruiter pages, and verified candidate reports on r/developersIndia and LinkedIn.
Verification window
Page last edited 8 Jun 2026 by Aditya Sharma. Numbers and patterns sanity-checked against the most recent 2026 cycle drives we tracked.
What we did NOT do
  • No fabricated salary numbers or success rates. If we quote a range, it's sourced.
  • No noun-substituted templates. This article was not generated by swapping company names in a stock prompt.
  • No paid placements, sponsored coaching links, or affiliate-shilled course pushes.
Verification policy: /editorial-standards/. Found something incorrect? Submit a correction - we respond within 48 hours.

Explore this topic cluster

More resources in Interview Questions

Use the category hub to browse similar questions, exam patterns, salary guides, and preparation resources related to this topic.

Paid contributor programme

Sat this this year? Share your story, earn ₹500.

First-person experience reports help future candidates prep smarter. We pay verified contributors ₹500 via UPI per accepted story - with byline.

Submit your story →

Ready to practice?

Take a free timed mock test

Put what you learned into practice. Our mock tests match the 2026 pattern with timer, navigator, reveal, and score breakdown. No signup.

Start Free Mock Test →

Related Articles

More from PapersAdda

Share this guide: