issue 117apr 27mmxxvi
est. 2017
Sun, 27 Apr 2026
vol. IX · no. 117
PapersAdda
placement intelligence, since 2017
640+ briefs · 24 campuses · by reservation
verified offers · sourced from r/developersIndia
razorpay₹65.00 LPA· iit-d · sde-1google₹54.00 LPA· iiit-h · swe-imicrosoft₹49.50 LPA· iit-b · sdeatlassian₹38.00 LPA· nit-w · sde-1amazon₹44.20 LPA· bits-p · sde-1uber₹42.00 LPA· iit-kgp · sde-1razorpay₹65.00 LPA· iit-d · sde-1google₹54.00 LPA· iiit-h · swe-imicrosoft₹49.50 LPA· iit-b · sdeatlassian₹38.00 LPA· nit-w · sde-1amazon₹44.20 LPA· bits-p · sde-1uber₹42.00 LPA· iit-kgp · sde-1

LLM Interview Questions 2026: 28 Answers with Code

21 min read
Interview Questions
Updated: 8 Jun 2026
Aditya Sharma
Aditya's Edit

PapersAdda 2026 Placement Cycle

By Aditya Sharma·Founder & Editor, PapersAdda

What changed in 2026 drives

Mass-recruiter offer letters are flatter for 2026 batch - the 4-5 LPA ASE band has barely budged in three years while inflation eats real wages. Premium tracks (Digital, Pro, Elite, Specialist) are still where the differential lives, and they are entirely test-driven. If you are aiming higher than the default offer, the coding round is not optional pageantry - it is the entire interview.

What I'd actually study for this

  • 01Two solid coding-round answers (1 medium-hard DSA each, with edge-case discussion) > five half-baked ones
  • 02One real project you can defend end-to-end - file paths, design decisions, and what you would change
  • 03One DBMS schema you actually built (not a textbook ER diagram), with at least 3 join-heavy queries written from memory
  • 04Three behavioural STAR stories: failure recovered, conflict handled, ownership taken

Where most candidates trip up

The single biggest mistake is treating company-specific guides as primary prep and DSA as secondary. It is the opposite. Mass recruiters use the test as a filter, but premium tracks at every IT services company use coding to allocate offer band. Spend 70% of prep time on DSA + system fundamentals, 20% on company-specific patterns, 10% on HR rehearsal. Reverse that ratio and you collect the default offer.

Editorial commentary by Aditya Sharma · written for PapersAdda · not generated, not aggregated.

Large Language Models are the defining technology of 2026, and every AI engineer interview now includes LLM-specific rounds. Roles at OpenAI, Anthropic, Google DeepMind, and every AI-first startup in India require fluency in transformer internals, fine-tuning, RAG, and production serving. This guide covers 28 LLM interview questions with full answers and code examples from basic to system design.

PapersAdda's take: Candidates report that RAG pipeline design and fine-tuning strategy questions now appear in over 75% of LLM engineer shortlists. The two most common elimination questions are "implement multi-head attention" and "explain KV cache". Confirm the exact interview format and tech stack on the official company careers portal before you prepare.

Related articles: Deep Learning Interview Questions 2026 | NLP Interview Questions 2026 | Machine Learning Interview Questions 2026 | MLOps Interview Questions 2026 | Generative AI Interview Questions 2026


Which Companies Ask LLM-Specific Questions?

Company / RoleLLM Focus Area
OpenAI, AnthropicAlignment, RLHF, inference systems
Google DeepMindResearch, Gemini family, TPU serving
Microsoft (Azure AI)Azure OpenAI integration, fine-tuning
Indian AI startups (Sarvam, Krutrim, Cohesive)Multilingual LLM, RAG, cost-efficient serving
Enterprise (Infosys, TCS AI labs)RAG pipelines, LLM ops, governance

EASY: Transformer and LLM Foundations (Questions 1-8)

Q1. What is the transformer architecture? What are its key components?

The transformer (Vaswani et al., 2017) replaces recurrence with self-attention. Core components:

  1. Input embedding + positional encoding -- converts tokens to vectors and adds position information.
  2. Multi-head self-attention -- computes Query, Key, Value projections; attention scores = softmax(QK^T / sqrt(d_k)) * V.
  3. Feed-forward network -- position-wise two-layer MLP applied identically at each position.
  4. Layer normalization -- pre-LN (modern) or post-LN (original) stabilizes training.
  5. Residual connections -- enable gradient flow in deep networks.

Encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5) are three families derived from this.


Q2. Explain multi-head attention. Why "multi-head"?

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int, n_heads: int):
        super().__init__()
        assert d_model % n_heads == 0
        self.d_k = d_model // n_heads
        self.n_heads = n_heads
        self.q = nn.Linear(d_model, d_model)
        self.k = nn.Linear(d_model, d_model)
        self.v = nn.Linear(d_model, d_model)
        self.out = nn.Linear(d_model, d_model)

    def forward(self, x, mask=None):
        B, T, C = x.shape
        Q = self.q(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        K = self.k(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        V = self.v(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        attn = (Q @ K.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            attn = attn.masked_fill(mask == 0, float('-inf'))
        attn = F.softmax(attn, dim=-1)
        out = (attn @ V).transpose(1, 2).reshape(B, T, C)
        return self.out(out)

Multiple heads allow the model to attend to different semantic subspaces simultaneously. One head might focus on syntactic relations, another on coreference. Single-head attention collapses this to one subspace.


Q3. What is the KV cache? Why is it critical for LLM inference?

During autoregressive generation, each new token needs attention over all previous tokens. Without caching, you recompute K and V for every previous token on every step -- O(n^2) compute per sequence.

The KV cache stores Key and Value tensors from all previous positions. On each new token:

  • Only compute Q, K, V for the new token.
  • Retrieve cached K, V for all previous positions.
  • Concatenate and compute attention.

This reduces per-step compute from O(n) matrix multiplications to O(1) new projections + O(n) attention.

Memory cost: For a 7B model with 32 layers, 32 heads, d_k=128, batch=1, sequence=4096:

  • KV cache = 2 * 32 * 32 * 128 * 4096 * 2 bytes (FP16) = ~2GB
  • This is why long-context inference is memory-bound.

Q4. What is the difference between causal (decoder) attention and full (encoder) attention?

PropertyCausal / Decoder AttentionFull / Encoder Attention
MaskLower-triangular: token i attends only to positions 0..iNo mask: all positions attend to all positions
Use caseAutoregressive generation (GPT family)Bidirectional understanding (BERT family)
Training objectiveNext-token prediction (CLM)Masked language modeling (MLM)
ContextPast onlyFull sequence

In code, causal mask is typically torch.tril(torch.ones(T, T)).


Q5. What is rotary positional encoding (RoPE)? How does it differ from learned positional embeddings?

Original transformers add a learned or sinusoidal embedding to the input. RoPE (Su et al., 2021) encodes position by rotating the Q and K vectors in 2D planes:

Q_rotated[2i]   = Q[2i] * cos(theta) - Q[2i+1] * sin(theta)
Q_rotated[2i+1] = Q[2i] * sin(theta) + Q[2i+1] * cos(theta)

Advantages of RoPE:

  • Relative position information is preserved in dot products: Q_m * K_n depends only on (m - n).
  • Generalizes better to sequence lengths unseen during training.
  • No added embedding parameters.
  • Used in LLaMA, Mistral, Qwen, Gemma, and most modern LLMs.

Q6. What is perplexity? How do you interpret it for LLMs?

Perplexity = exp(cross-entropy loss) = exp(-(1/N) * sum(log P(token_i | context))).

It measures how "surprised" the model is by the test sequence. A perplexity of 10 means the model is as uncertain as choosing uniformly among 10 options at each step.

Interpretation:

  • Lower perplexity = better language model.
  • GPT-2 (117M): ~29 on WikiText-103. GPT-4 class models: ~3-5.
  • Perplexity is not a proxy for task performance -- a model can have low perplexity but poor reasoning.

Limitation: perplexity is dataset-specific. Cross-dataset comparison is invalid.


Q7. What is the difference between GPT-style (decoder-only) and BERT-style (encoder-only) LLMs?

PropertyGPT-style (decoder-only)BERT-style (encoder-only)
AttentionCausal (left-to-right)Bidirectional
PretrainingNext-token prediction (CLM)Masked LM + next sentence prediction
GenerationNative (autoregressive sampling)Not directly generative
Best forOpen-ended generation, chat, codeClassification, NER, embedding extraction
ExamplesGPT-4, LLaMA 3, Mistral, GemmaBERT, RoBERTa, DeBERTa

Modern preference is decoder-only for general-purpose LLMs due to unified pretraining + generative capability.


Q8. What is tokenization in LLMs? Why does it matter?

LLMs operate on integer token IDs, not raw characters. Common methods:

  • BPE (Byte Pair Encoding): merges frequent byte pairs iteratively. Used by GPT-2/3/4.
  • WordPiece: similar to BPE but maximizes likelihood of training data. Used by BERT.
  • SentencePiece: language-agnostic, works directly on Unicode. Used by LLaMA, T5.

Why tokenization matters:

  • Longer tokenizations = higher context usage for the same text.
  • Code and math tokenize poorly with word-level vocabularies -- GPT-4 uses a code-optimized vocabulary.
  • Multilingual models need subword segmentation to share vocabulary across scripts.
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
tokens = tok("Hello, PapersAdda!")
print(tokens["input_ids"])  # [128000, 9906, 11, 27685, 85048, 0]
print(len(tokens["input_ids"]))  # token count

MEDIUM: Fine-Tuning, RAG, and Alignment (Questions 9-20)

Q9. What is the difference between full fine-tuning, LoRA, and QLoRA?

MethodWhat changesGPU memoryQuality
Full fine-tuningAll weightsVery high (requires optimizer states for all params)Best
LoRALow-rank adapter matrices (r << d)Low -- adapters onlyNear-full with r=16-64
QLoRALoRA + base model in 4-bit NF4Very low -- 4-bit base + 16-bit adaptersSlight quality cost

LoRA math: Instead of updating W (d x d), learn W = W0 + BA where B is d x r, A is r x d. Only 2d*r parameters per layer vs d^2.

from peft import get_peft_model, LoraConfig, TaskType

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06

Q10. Walk through building a RAG pipeline from scratch.

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
from transformers import pipeline

# 1. Ingest and chunk documents
docs = [
    "UPSC 2026 prelims date is June 1, 2026.",
    "SBI PO exam pattern includes reasoning, quant, and English.",
    "CAT 2026 registration starts August 2026.",
]

# 2. Embed chunks
embedder = SentenceTransformer("BAAI/bge-base-en-v1.5")
doc_embeddings = embedder.encode(docs, normalize_embeddings=True)

# 3. Build FAISS index
dim = doc_embeddings.shape[1]
index = faiss.IndexFlatIP(dim)  # inner product = cosine on normalized vectors
index.add(doc_embeddings.astype(np.float32))

# 4. Retrieve at query time
def retrieve(query: str, k: int = 2):
    q_emb = embedder.encode([query], normalize_embeddings=True)
    scores, idx = index.search(q_emb.astype(np.float32), k)
    return [docs[i] for i in idx[0]]

# 5. Generate with retrieved context
llm = pipeline("text-generation", model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")

def rag_answer(question: str):
    context = "\n".join(retrieve(question))
    prompt = f"Context:\n{context}\n\nQuestion: {question}\nAnswer:"
    return llm(prompt, max_new_tokens=150)[0]["generated_text"]

print(rag_answer("When is UPSC prelims?"))

Production additions: chunking strategy (sentence vs fixed-token), re-ranking (cross-encoder), hybrid search (BM25 + dense), metadata filtering.


Q11. What is RLHF? Explain the three stages.

Reinforcement Learning from Human Feedback (RLHF) aligns LLMs to human preferences:

Stage 1 -- Supervised Fine-Tuning (SFT):

  • Fine-tune base LLM on high-quality demonstrations.
  • Result: SFT model that follows instructions.

Stage 2 -- Reward Model Training:

  • Human annotators rank pairs of model outputs (A vs B).
  • Train a reward model RM(prompt, response) -> scalar reward.
  • Dataset: comparison pairs with human preference labels.

Stage 3 -- PPO Optimization:

  • Use PPO to update the LLM policy to maximize RM score.
  • KL penalty keeps the updated policy close to SFT model: reward_total = RM(x,y) - beta * KL(pi_rlhf || pi_sft).

DPO (Direct Preference Optimization) skips the separate RM and optimizes preference directly:

  • Loss = -log(sigma(beta * (log pi(y_w|x) - log pi(y_l|x) - log pi_ref(y_w|x) + log pi_ref(y_l|x))))

Q12. What is speculative decoding? How does it speed up LLM inference?

Autoregressive generation is sequential -- one token per forward pass through a large model. Speculative decoding breaks this bottleneck:

  1. A small draft model (3B or distilled) generates K tokens speculatively in K steps.
  2. The large target model evaluates all K+1 tokens in a single forward pass (batched).
  3. Tokens that match target model's distribution are accepted; the first rejection triggers re-sampling from the target.
  4. Expected speedup: 2-4x with ~0% quality degradation.
# HuggingFace supports speculative decoding natively
from transformers import AutoModelForCausalLM, AutoTokenizer

draft = AutoModelForCausalLM.from_pretrained("facebook/opt-125m")
target = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b")

inputs = tok("The capital of France is", return_tensors="pt")
output = target.generate(
    **inputs,
    assistant_model=draft,
    max_new_tokens=50,
)

Q13. Explain quantization for LLMs. What is the difference between GPTQ, AWQ, and GGUF?

Quantization reduces weight precision to save memory and speed up inference.

FormatMethodUse case
GPTQPost-training, layer-wise second-order weight quantizationGPU inference with bitsandbytes or AutoGPTQ
AWQActivation-aware weight quantization -- preserves salient weightsHigher quality than GPTQ at same bit-width
GGUFCPU/Metal quantization format for llama.cppLocal inference on CPU/Mac M-series
BnB 4-bitNF4 + double quantization (QLoRA)Fine-tuning on consumer GPUs
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    quantization_config=bnb_config,
    device_map="auto",
)
# Memory: ~4.5GB vs ~16GB for FP16 8B model

Q14. What is prompt engineering? What techniques matter most in 2026?

Prompt engineering shapes LLM behavior through input design rather than weight updates.

Core techniques:

  1. Zero-shot: direct instruction, no examples.
  2. Few-shot: 3-8 in-context examples demonstrating the desired format.
  3. Chain-of-Thought (CoT): "Let's think step by step" elicits reasoning traces.
  4. Self-consistency: sample multiple CoT paths, majority-vote the answer.
  5. ReAct: interleave reasoning (Thought) and tool calls (Action) in the prompt.
  6. System prompt engineering: define persona, output constraints, refusal behavior.
# Few-shot CoT example
prompt = """
Classify sentiment. Think step by step.

Review: "Battery dies in 2 hours" -> Thought: mentions battery problem, negative tone -> Sentiment: Negative
Review: "Amazing build quality, fast delivery" -> Thought: positive attributes, no complaints -> Sentiment: Positive
Review: "Average camera but great screen" -> Thought: mixed, one positive one negative ->
"""

Q15. What is flash attention? Why is it important?

Standard attention computes the full N x N attention matrix in HBM (GPU DRAM), which is memory-bandwidth-bound for long sequences.

FlashAttention (Dao et al., 2022):

  • Tiles the Q, K, V matrices and computes attention in SRAM (fast cache).
  • Never materializes the full N x N matrix in HBM.
  • IO complexity: O(N) HBM reads/writes vs O(N^2) standard.
  • Result: 2-4x faster, 5-20x less memory for attention.

FlashAttention-2 (2023): better work partitioning, 2x faster than FA1. FlashAttention-3 (2024): targets H100 async pipelines, ~2x faster than FA2.

import torch.nn.functional as F
# PyTorch 2.0+ uses FlashAttention automatically via scaled_dot_product_attention
# when inputs are on CUDA and no custom mask is used
out = F.scaled_dot_product_attention(q, k, v, is_causal=True)

Q16. How do you evaluate an LLM? What metrics go beyond perplexity?

Evaluation TypeMetric / BenchmarkWhat it measures
General knowledgeMMLU, ARC-ChallengeBreadth of factual recall
ReasoningGSM8K, MATH, HumanEvalStep-by-step reasoning, code
Instruction followingMT-Bench, AlpacaEvalMulti-turn instruction quality
Safety / alignmentTruthfulQA, BBQ (bias)Hallucination, fairness
Retrieval accuracyEM, F1 on QA datasetsFactual accuracy in RAG
Human evalElo/Arena ratings (LMSYS Chatbot Arena)Real user preference

For production RAG systems: use RAGAS (context precision, context recall, answer faithfulness, answer relevance).


Q17. What is the difference between in-context learning, fine-tuning, and RAG? When to use each?

MethodKnowledge sourceLatencyCostBest for
In-context learningPrompt context windowLowPer-call tokensDynamic, few-shot tasks
Fine-tuningBaked into weightsLow (no retrieval)One-time trainingStyle, tone, domain-specific format
RAGExternal vector storeMedium (+ retrieval)InfrastructureFresh, factual, large knowledge bases

Rule of thumb:

  • Need to follow a specific format/style reliably? Fine-tune.
  • Need access to fresh or large external knowledge? RAG.
  • Need general reasoning with a few examples? ICL.
  • Best systems combine fine-tuning (style) + RAG (knowledge).

Q18. What are hallucinations in LLMs? How do you reduce them in production?

Hallucinations = model generates plausible-sounding but factually incorrect content. Caused by:

  • Training data gaps / outdated knowledge.
  • Model over-confidence in generation distribution.
  • Ambiguous or underspecified prompts.

Mitigation strategies:

  1. RAG: ground generation in retrieved documents; add source attribution.
  2. Self-consistency: sample multiple outputs, check agreement.
  3. Structured output + validation: force JSON schema output, validate post-generation.
  4. Calibration prompts: "Answer only if you are confident. Say 'I don't know' otherwise."
  5. RLHF/DPO alignment: train reward model to penalize hallucinations.
  6. Citation forcing: require model to quote source document for every factual claim.
# LangChain with citation forcing
from langchain.chains import RetrievalQAWithSourcesChain
chain = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm, retriever=retriever
)
result = chain({"question": "When is UPSC prelims 2026?"})
# result["answer"] + result["sources"]

Q19. What is Mixture of Experts (MoE)? How does it relate to modern LLMs?

MoE replaces the dense feed-forward layer with N expert sub-networks. A router network selects the top-K experts for each token.

Output = sum_k(gate_k * Expert_k(x))

Benefits:

  • Total parameters: N * expert_size (large).
  • Active parameters per token: K * expert_size (small).
  • Inference cost matches a small dense model; capacity matches a large one.

Examples: Mixtral 8x7B (8 experts, top-2 active = 13B active out of 47B total). GPT-4 is widely believed to be a MoE. Gemini 1.5 uses MoE.

Challenges: load balancing (all tokens routing to same expert), communication overhead in distributed training, expert collapse.


Q20. How does grouped-query attention (GQA) differ from multi-head attention?

Multi-head attention (MHA): each head has its own Q, K, V projections -- high KV cache memory. Multi-query attention (MQA): all heads share a single K and V -- low KV cache memory but quality degradation. Grouped-query attention (GQA): K and V are shared within groups of H/G heads -- best quality/memory tradeoff.

MethodKV cacheQuality
MHA (h=32)32 K, 32 V per layerBest
MQA (h=32)1 K, 1 V per layerLowest
GQA (h=32, g=8)8 K, 8 V per layerNear-MHA

LLaMA 3, Mistral, Gemma 2, Qwen2.5 all use GQA. The memory saving is critical for long-context and batched inference.


HARD: Production and System Design (Questions 21-28)

Q21. Design a production RAG system for a 10-million-document corpus.

Architecture:
  Ingest pipeline:
    S3 / GCS (raw docs)
      --> Document parser (Unstructured.io or Docling)
      --> Chunker (semantic / fixed-token with overlap)
      --> Embedding batch job (BAAI/bge-large via SageMaker batch)
      --> Vector DB (Pinecone / Weaviate / pgvector with HNSW)
      --> Metadata store (PostgreSQL: doc_id, url, date, category)

  Query pipeline:
    User query
      --> Query rewriter (LLM: HyDE or step-back prompting)
      --> Hybrid search: BM25 (Elasticsearch) + dense (FAISS/HNSW)
      --> Re-ranker (cross-encoder: ms-marco-MiniLM-L-6-v2)
      --> Top-K context (k=5 chunks)
      --> LLM generation (GPT-4o / fine-tuned Llama)
      --> Source attribution

  Scale decisions:
    - HNSW index: ANN lookup in <50ms at 10M vectors
    - Embedding cache: Redis for repeated queries
    - Async ingestion: Kafka queue, batch embedding workers
    - Monitoring: RAGAS scores, query latency p99, retrieval hit rate

Q22. How do you implement streaming for LLM APIs?

import anthropic

client = anthropic.Anthropic()

# Streaming with Anthropic Claude
with client.messages.stream(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain transformers in 3 sentences."}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

# FastAPI streaming endpoint
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import asyncio

app = FastAPI()

async def stream_llm(prompt: str):
    async with anthropic.AsyncAnthropic().messages.stream(
        model="claude-3-5-sonnet-20241022",
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}],
    ) as stream:
        async for text in stream.text_stream:
            yield f"data: {text}\n\n"

@app.get("/stream")
async def stream_endpoint(prompt: str):
    return StreamingResponse(stream_llm(prompt), media_type="text/event-stream")

Q23. How do you reduce LLM inference latency in production?

TechniqueLatency reductionNotes
KV cache2-4x per-tokenAlways on; critical for long contexts
Speculative decoding2-3xNeeds a good draft model
Continuous batching2-5x throughputvLLM, TGI -- processes multiple requests
Quantization (INT8/INT4)1.5-2x with ~1% quality dropGPTQ/AWQ recommended
Smaller model + RAG3-10xTrade model size for retrieval quality
FlashAttention2-4x attention opsAlways use in training and serving
vLLM PagedAttention2-5x memory efficiencyEnables larger batch sizes
# vLLM for production serving
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct", dtype="bfloat16")
params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["What is UPSC?"] * 100, params)  # batched generation

Q24. Explain context length scaling. What are the techniques to extend an LLM's context window?

Standard attention is O(N^2) in sequence length. Approaches to extend:

  1. RoPE scaling: scale the theta base frequency (e.g., base=500,000 vs default 10,000). Used in LLaMA 3.1 (128K context).
  2. Position interpolation: scale positional indices to fit new length -- fine-tune for a few steps.
  3. YaRN (Yet another RoPE extensioN): NTK-aware interpolation + temperature scaling -- works without fine-tuning.
  4. Sliding window attention: attend only to local window + global tokens (Longformer, Mistral).
  5. Memory-efficient attention: FlashAttention-2/3 -- same O(N^2) but better constant factor.
  6. External memory / RAG: practical limit at 1M+ tokens; retrieve and inject instead.
# YaRN config in HuggingFace
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.3",
    rope_scaling={"type": "yarn", "factor": 4.0, "original_max_position_embeddings": 32768},
)

Q25. What is the difference between temperature, top-p, and top-k sampling?

import torch
import torch.nn.functional as F

def sample_token(logits: torch.Tensor, temperature=1.0, top_k=0, top_p=0.9):
    # Temperature scaling
    logits = logits / temperature

    # Top-K filtering
    if top_k > 0:
        values, _ = torch.topk(logits, top_k)
        min_val = values[-1]
        logits[logits < min_val] = float('-inf')

    # Top-P (nucleus) filtering
    sorted_logits, sorted_idx = torch.sort(logits, descending=True)
    cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
    sorted_logits[cumulative_probs - F.softmax(sorted_logits, dim=-1) > top_p] = float('-inf')
    logits.scatter_(0, sorted_idx, sorted_logits)

    return torch.multinomial(F.softmax(logits, dim=-1), 1)
ParameterEffectTypical range
temperature=0Greedy (argmax)0 = deterministic
temperature=1Raw model distributionDefault
temperature>1More random / creative1.2-1.5 for creative
top_kSample from top K tokens only40-100
top_pSample from smallest set summing to p0.85-0.95

Q26. How do you implement LLM guardrails in production?

from pydantic import BaseModel
from enum import Enum
import json

class SafetyLevel(Enum):
    SAFE = "safe"
    FLAGGED = "flagged"
    BLOCKED = "blocked"

class GuardrailResult(BaseModel):
    level: SafetyLevel
    reason: str | None = None

def input_guardrail(user_message: str, llm) -> GuardrailResult:
    """LLM-as-judge input safety check."""
    prompt = f"""Is this message safe to answer? Reply with JSON: {{"safe": true/false, "reason": "..."}}
Message: {user_message}"""
    response = llm(prompt)
    result = json.loads(response)
    if result["safe"]:
        return GuardrailResult(level=SafetyLevel.SAFE)
    return GuardrailResult(level=SafetyLevel.BLOCKED, reason=result["reason"])

# Output guardrails: PII detection, hallucination check, format validation
import re

def output_guardrail(response: str) -> str:
    """Strip PII from generated output."""
    # Redact email patterns
    response = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '[EMAIL]', response)
    # Redact phone patterns (Indian)
    response = re.sub(r'\b[6-9]\d{9}\b', '[PHONE]', response)
    return response

Production stack: Llama Guard for content classification, custom regex for PII, RAGAS faithfulness for hallucination detection, rate limiting per user tier.


Q27. Explain continuous batching (PagedAttention). Why does it matter for LLM serving?

Problem: In static batching, the server waits to fill a batch before processing -- early requests wait for late arrivals. Also, KV cache is pre-allocated for max_seq_len even if actual output is shorter, wasting GPU memory.

PagedAttention (vLLM):

  • Manages KV cache like OS virtual memory with pages (typically 16 tokens/page).
  • Allocates pages on demand as tokens are generated.
  • Multiple requests can share KV cache pages (prefix caching for repeated system prompts).
  • Memory utilization: ~99% vs ~60% for static KV cache.

Continuous batching:

  • Instead of waiting for all requests to finish, new requests join as old ones complete.
  • GPU utilization stays near-constant vs bursty in static batching.
  • Throughput improvement: reported 2-24x depending on workload.

vLLM implements both. Other frameworks: TGI (HuggingFace), TensorRT-LLM (NVIDIA), MLC-LLM.


Q28. Design an LLM application for real-time exam answer evaluation.

Requirements:
  - Student submits written answer (200-800 words)
  - System compares against model answer and rubric
  - Returns score (0-10) + detailed feedback in <5 seconds
  - Scale: 10,000 concurrent students (exam peak)

Architecture:

  Request path:
    Student answer (POST /evaluate)
      --> Redis queue (prevent spike overload)
      --> Worker: retrieve model answer + rubric from PostgreSQL
      --> LLM evaluation prompt (structured rubric criteria)
      --> Pydantic output validation (score + feedback JSON)
      --> Cache result (Redis 1-hour TTL for identical answers)
      --> Return to student

  LLM evaluation prompt pattern:
    System: "You are an expert examiner for {subject}. Score strictly."
    User:
      "Model answer: {model_answer}
       Rubric: {rubric_criteria}
       Student answer: {student_answer}
       Return JSON: {score: 0-10, feedback: {criterion: comment, ...}}"

  Scaling:
    - vLLM serving fine-tuned LLaMA-3-8B (exam-domain SFT)
    - 4x A100 40GB: ~800 evaluations/minute
    - Horizontal scaling via Kubernetes HPA on queue depth
    - Fallback: GPT-4o API for overflow during exam peaks

  Quality assurance:
    - Shadow evaluation: 5% of answers re-evaluated by GPT-4o
    - Score distribution monitoring (alert if mean shifts >0.5)
    - Human-in-loop queue for contested scores

FAQ

Q: Which LLM framework should I learn first? A: HuggingFace Transformers is the industry standard for working with LLMs. Learn it alongside LangChain or LlamaIndex for RAG applications. Candidates from public preparation resources report that HuggingFace + PEFT (for LoRA) covers 90% of LLM engineer interview questions.

Q: Do I need a GPU to practice LLM interview prep? A: Google Colab T4 GPU (free tier) is sufficient for most exercises. Quantized 7B models (GGUF 4-bit) run on CPU. Use Ollama locally for fast iteration without cloud costs.

Q: What LLM papers should I read before an interview? A: Attention Is All You Need (transformer), LoRA, QLoRA, RLHF (Ouyang et al.), FlashAttention, RAG (Lewis et al.), and the LLaMA 3 technical report cover 90% of architecture questions that candidates report encountering in senior LLM interviews.

Q: Is prompt engineering still relevant in 2026? A: Yes -- but it has matured. Basic zero/few-shot prompting is table stakes. Interviewers now ask about prompt injection defense, structured output forcing, multi-step ReAct agents, and evaluation harnesses. Candidates from public preparation resources confirm this shift toward systematic prompt engineering over ad-hoc prompting.

Methodology applied to this articlelast verified 8 Jun 2026
Sources used
Public exam-pattern documents, official recruiter pages, and verified candidate reports on r/developersIndia and LinkedIn.
Verification window
Page last edited 8 Jun 2026 by Aditya Sharma. Numbers and patterns sanity-checked against the most recent 2026 cycle drives we tracked.
What we did NOT do
  • No fabricated salary numbers or success rates. If we quote a range, it's sourced.
  • No noun-substituted templates. This article was not generated by swapping company names in a stock prompt.
  • No paid placements, sponsored coaching links, or affiliate-shilled course pushes.
Verification policy: /editorial-standards/. Found something incorrect? Submit a correction - we respond within 48 hours.

Explore this topic cluster

More resources in Interview Questions

Use the category hub to browse similar questions, exam patterns, salary guides, and preparation resources related to this topic.

Paid contributor programme

Sat this this year? Share your story, earn ₹500.

First-person experience reports help future candidates prep smarter. We pay verified contributors ₹500 via UPI per accepted story - with byline.

Submit your story →

Ready to practice?

Take a free timed mock test

Put what you learned into practice. Our mock tests match the 2026 pattern with timer, navigator, reveal, and score breakdown. No signup.

Start Free Mock Test →

Related Articles

More from PapersAdda

Share this guide: