placement brief / Interview Questions / interview questions / 08 Jun 2026

LLM Interview Questions 2026: 28 Answers with Code

28 LLM interview questions with full code answers covering transformer architecture, fine-tuning, RAG, RLHF, inference optimization, and production LLM system design for 2026.

By Aditya SharmaPublished 8 Jun 20262 sources listedSpot an error? Corrections open

11 min read last revised 8 Jun 2026

on this page§ 05

Large Language Models are the defining technology of 2026, and every AI engineer interview now includes LLM-specific rounds. Roles at OpenAI, Anthropic, Google DeepMind, and every AI-first startup in India require fluency in transformer internals, fine-tuning, RAG, and production serving. This guide covers 28 LLM interview questions with full answers and code examples from basic to system design.

PapersAdda's take: Candidates report that RAG pipeline design and fine-tuning strategy questions now appear in over 75% of LLM engineer shortlists. The two most common elimination questions are "implement multi-head attention" and "explain KV cache". Confirm the exact interview format and tech stack on the official company careers portal before you prepare.

Related articles: Deep Learning Interview Questions 2026 | NLP Interview Questions 2026 | Machine Learning Interview Questions 2026 | MLOps Interview Questions 2026 | Generative AI Interview Questions 2026

Which Companies Ask LLM-Specific Questions?

Company / Role	LLM Focus Area
OpenAI, Anthropic	Alignment, RLHF, inference systems
Google DeepMind	Research, Gemini family, TPU serving
Microsoft (Azure AI)	Azure OpenAI integration, fine-tuning
Indian AI startups (Sarvam, Krutrim, Cohesive)	Multilingual LLM, RAG, cost-efficient serving
Enterprise (Infosys, TCS AI labs)	RAG pipelines, LLM ops, governance

EASY: Transformer and LLM Foundations (Questions 1-8)

Q1. What is the transformer architecture? What are its key components?

The transformer (Vaswani et al., 2017) replaces recurrence with self-attention. Core components:

Input embedding + positional encoding -- converts tokens to vectors and adds position information.
Multi-head self-attention -- computes Query, Key, Value projections; attention scores = softmax(QK^T / sqrt(d_k)) * V.
Feed-forward network -- position-wise two-layer MLP applied identically at each position.
Layer normalization -- pre-LN (modern) or post-LN (original) stabilizes training.
Residual connections -- enable gradient flow in deep networks.

Encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5) are three families derived from this.

Q2. Explain multi-head attention. Why "multi-head"?

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model: int, n_heads: int):
        super().__init__()
        assert d_model % n_heads == 0
        self.d_k = d_model // n_heads
        self.n_heads = n_heads
        self.q = nn.Linear(d_model, d_model)
        self.k = nn.Linear(d_model, d_model)
        self.v = nn.Linear(d_model, d_model)
        self.out = nn.Linear(d_model, d_model)

    def forward(self, x, mask=None):
        B, T, C = x.shape
        Q = self.q(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        K = self.k(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        V = self.v(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        attn = (Q @ K.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            attn = attn.masked_fill(mask == 0, float('-inf'))
        attn = F.softmax(attn, dim=-1)
        out = (attn @ V).transpose(1, 2).reshape(B, T, C)
        return self.out(out)

Multiple heads allow the model to attend to different semantic subspaces simultaneously. One head might focus on syntactic relations, another on coreference. Single-head attention collapses this to one subspace.

Q3. What is the KV cache? Why is it critical for LLM inference?

During autoregressive generation, each new token needs attention over all previous tokens. Without caching, you recompute K and V for every previous token on every step -- O(n^2) compute per sequence.

The KV cache stores Key and Value tensors from all previous positions. On each new token:

Only compute Q, K, V for the new token.
Retrieve cached K, V for all previous positions.
Concatenate and compute attention.

This reduces per-step compute from O(n) matrix multiplications to O(1) new projections + O(n) attention.

Memory cost: For a 7B model with 32 layers, 32 heads, d_k=128, batch=1, sequence=4096:

KV cache = 2 * 32 * 32 * 128 * 4096 * 2 bytes (FP16) = ~2GB
This is why long-context inference is memory-bound.

Q4. What is the difference between causal (decoder) attention and full (encoder) attention?

Property	Causal / Decoder Attention	Full / Encoder Attention
Mask	Lower-triangular: token i attends only to positions 0..i	No mask: all positions attend to all positions
Use case	Autoregressive generation (GPT family)	Bidirectional understanding (BERT family)
Training objective	Next-token prediction (CLM)	Masked language modeling (MLM)
Context	Past only	Full sequence

In code, causal mask is typically torch.tril(torch.ones(T, T)).

Q5. What is rotary positional encoding (RoPE)? How does it differ from learned positional embeddings?

Original transformers add a learned or sinusoidal embedding to the input. RoPE (Su et al., 2021) encodes position by rotating the Q and K vectors in 2D planes:

Q_rotated[2i]   = Q[2i] * cos(theta) - Q[2i+1] * sin(theta)
Q_rotated[2i+1] = Q[2i] * sin(theta) + Q[2i+1] * cos(theta)

Advantages of RoPE:

Relative position information is preserved in dot products: Q_m * K_n depends only on (m - n).
Generalizes better to sequence lengths unseen during training.
No added embedding parameters.
Used in LLaMA, Mistral, Qwen, Gemma, and most modern LLMs.

Q6. What is perplexity? How do you interpret it for LLMs?

Perplexity = exp(cross-entropy loss) = exp(-(1/N) * sum(log P(token_i | context))).

It measures how "surprised" the model is by the test sequence. A perplexity of 10 means the model is as uncertain as choosing uniformly among 10 options at each step.

Interpretation:

Lower perplexity = better language model.
GPT-2 (117M): ~29 on WikiText-103. GPT-4 class models: ~3-5.
Perplexity is not a proxy for task performance -- a model can have low perplexity but poor reasoning.

Limitation: perplexity is dataset-specific. Cross-dataset comparison is invalid.

Q7. What is the difference between GPT-style (decoder-only) and BERT-style (encoder-only) LLMs?

Property	GPT-style (decoder-only)	BERT-style (encoder-only)
Attention	Causal (left-to-right)	Bidirectional
Pretraining	Next-token prediction (CLM)	Masked LM + next sentence prediction
Generation	Native (autoregressive sampling)	Not directly generative
Best for	Open-ended generation, chat, code	Classification, NER, embedding extraction
Examples	GPT-4, LLaMA 3, Mistral, Gemma	BERT, RoBERTa, DeBERTa

Modern preference is decoder-only for general-purpose LLMs due to unified pretraining + generative capability.

Q8. What is tokenization in LLMs? Why does it matter?

LLMs operate on integer token IDs, not raw characters. Common methods:

BPE (Byte Pair Encoding): merges frequent byte pairs iteratively. Used by GPT-2/3/4.
WordPiece: similar to BPE but maximizes likelihood of training data. Used by BERT.
SentencePiece: language-agnostic, works directly on Unicode. Used by LLaMA, T5.

Why tokenization matters:

Longer tokenizations = higher context usage for the same text.
Code and math tokenize poorly with word-level vocabularies -- GPT-4 uses a code-optimized vocabulary.
Multilingual models need subword segmentation to share vocabulary across scripts.

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
tokens = tok("Hello, PapersAdda!")
print(tokens["input_ids"])  # [128000, 9906, 11, 27685, 85048, 0]
print(len(tokens["input_ids"]))  # token count

MEDIUM: Fine-Tuning, RAG, and Alignment (Questions 9-20)

Q9. What is the difference between full fine-tuning, LoRA, and QLoRA?

Method	What changes	GPU memory	Quality
Full fine-tuning	All weights	Very high (requires optimizer states for all params)	Best
LoRA	Low-rank adapter matrices (r << d)	Low -- adapters only	Near-full with r=16-64
QLoRA	LoRA + base model in 4-bit NF4	Very low -- 4-bit base + 16-bit adapters	Slight quality cost

LoRA math: Instead of updating W (d x d), learn W = W0 + BA where B is d x r, A is r x d. Only 2d*r parameters per layer vs d^2.

from peft import get_peft_model, LoraConfig, TaskType

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06

Q10. Walk through building a RAG pipeline from scratch.

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
from transformers import pipeline

# 1. Ingest and chunk documents
docs = [
    "UPSC 2026 prelims date is June 1, 2026.",
    "SBI PO exam pattern includes reasoning, quant, and English.",
    "CAT 2026 registration starts August 2026.",
]

# 2. Embed chunks
embedder = SentenceTransformer("BAAI/bge-base-en-v1.5")
doc_embeddings = embedder.encode(docs, normalize_embeddings=True)

# 3. Build FAISS index
dim = doc_embeddings.shape[1]
index = faiss.IndexFlatIP(dim)  # inner product = cosine on normalized vectors
index.add(doc_embeddings.astype(np.float32))

# 4. Retrieve at query time
def retrieve(query: str, k: int = 2):
    q_emb = embedder.encode([query], normalize_embeddings=True)
    scores, idx = index.search(q_emb.astype(np.float32), k)
    return [docs[i] for i in idx[0]]

# 5. Generate with retrieved context
llm = pipeline("text-generation", model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")

def rag_answer(question: str):
    context = "\n".join(retrieve(question))
    prompt = f"Context:\n{context}\n\nQuestion: {question}\nAnswer:"
    return llm(prompt, max_new_tokens=150)[0]["generated_text"]

print(rag_answer("When is UPSC prelims?"))

Production additions: chunking strategy (sentence vs fixed-token), re-ranking (cross-encoder), hybrid search (BM25 + dense), metadata filtering.

Q11. What is RLHF? Explain the three stages.

Reinforcement Learning from Human Feedback (RLHF) aligns LLMs to human preferences:

Stage 1 -- Supervised Fine-Tuning (SFT):

Fine-tune base LLM on high-quality demonstrations.
Result: SFT model that follows instructions.

Stage 2 -- Reward Model Training:

Human annotators rank pairs of model outputs (A vs B).
Train a reward model RM(prompt, response) -> scalar reward.
Dataset: comparison pairs with human preference labels.

Stage 3 -- PPO Optimization:

Use PPO to update the LLM policy to maximize RM score.
KL penalty keeps the updated policy close to SFT model: reward_total = RM(x,y) - beta * KL(pi_rlhf || pi_sft).

DPO (Direct Preference Optimization) skips the separate RM and optimizes preference directly:

Loss = -log(sigma(beta * (log pi(y_w|x) - log pi(y_l|x) - log pi_ref(y_w|x) + log pi_ref(y_l|x))))

Q12. What is speculative decoding? How does it speed up LLM inference?

Autoregressive generation is sequential -- one token per forward pass through a large model. Speculative decoding breaks this bottleneck:

A small draft model (3B or distilled) generates K tokens speculatively in K steps.
The large target model evaluates all K+1 tokens in a single forward pass (batched).
Tokens that match target model's distribution are accepted; the first rejection triggers re-sampling from the target.
Expected speedup: 2-4x with ~0% quality degradation.

# HuggingFace supports speculative decoding natively
from transformers import AutoModelForCausalLM, AutoTokenizer

draft = AutoModelForCausalLM.from_pretrained("facebook/opt-125m")
target = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b")

inputs = tok("The capital of France is", return_tensors="pt")
output = target.generate(
    **inputs,
    assistant_model=draft,
    max_new_tokens=50,
)

Q13. Explain quantization for LLMs. What is the difference between GPTQ, AWQ, and GGUF?

Quantization reduces weight precision to save memory and speed up inference.

Format	Method	Use case
GPTQ	Post-training, layer-wise second-order weight quantization	GPU inference with bitsandbytes or AutoGPTQ
AWQ	Activation-aware weight quantization -- preserves salient weights	Higher quality than GPTQ at same bit-width
GGUF	CPU/Metal quantization format for llama.cpp	Local inference on CPU/Mac M-series
BnB 4-bit	NF4 + double quantization (QLoRA)	Fine-tuning on consumer GPUs

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    quantization_config=bnb_config,
    device_map="auto",
)
# Memory: ~4.5GB vs ~16GB for FP16 8B model

Q14. What is prompt engineering? What techniques matter most in 2026?

Prompt engineering shapes LLM behavior through input design rather than weight updates.

Core techniques:

Zero-shot: direct instruction, no examples.
Few-shot: 3-8 in-context examples demonstrating the desired format.
Chain-of-Thought (CoT): "Let's think step by step" elicits reasoning traces.
Self-consistency: sample multiple CoT paths, majority-vote the answer.
ReAct: interleave reasoning (Thought) and tool calls (Action) in the prompt.
System prompt engineering: define persona, output constraints, refusal behavior.

# Few-shot CoT example
prompt = """
Classify sentiment. Think step by step.

Review: "Battery dies in 2 hours" -> Thought: mentions battery problem, negative tone -> Sentiment: Negative
Review: "Amazing build quality, fast delivery" -> Thought: positive attributes, no complaints -> Sentiment: Positive
Review: "Average camera but great screen" -> Thought: mixed, one positive one negative ->
"""

Q15. What is flash attention? Why is it important?

Standard attention computes the full N x N attention matrix in HBM (GPU DRAM), which is memory-bandwidth-bound for long sequences.

FlashAttention (Dao et al., 2022):

Tiles the Q, K, V matrices and computes attention in SRAM (fast cache).
Never materializes the full N x N matrix in HBM.
IO complexity: O(N) HBM reads/writes vs O(N^2) standard.
Result: 2-4x faster, 5-20x less memory for attention.

FlashAttention-2 (2023): better work partitioning, 2x faster than FA1. FlashAttention-3 (2024): targets H100 async pipelines, ~2x faster than FA2.

import torch.nn.functional as F
# PyTorch 2.0+ uses FlashAttention automatically via scaled_dot_product_attention
# when inputs are on CUDA and no custom mask is used
out = F.scaled_dot_product_attention(q, k, v, is_causal=True)

Q16. How do you evaluate an LLM? What metrics go beyond perplexity?

Evaluation Type	Metric / Benchmark	What it measures
General knowledge	MMLU, ARC-Challenge	Breadth of factual recall
Reasoning	GSM8K, MATH, HumanEval	Step-by-step reasoning, code
Instruction following	MT-Bench, AlpacaEval	Multi-turn instruction quality
Safety / alignment	TruthfulQA, BBQ (bias)	Hallucination, fairness
Retrieval accuracy	EM, F1 on QA datasets	Factual accuracy in RAG
Human eval	Elo/Arena ratings (LMSYS Chatbot Arena)	Real user preference

For production RAG systems: use RAGAS (context precision, context recall, answer faithfulness, answer relevance).

Q17. What is the difference between in-context learning, fine-tuning, and RAG? When to use each?

Method	Knowledge source	Latency	Cost	Best for
In-context learning	Prompt context window	Low	Per-call tokens	Dynamic, few-shot tasks
Fine-tuning	Baked into weights	Low (no retrieval)	One-time training	Style, tone, domain-specific format
RAG	External vector store	Medium (+ retrieval)	Infrastructure	Fresh, factual, large knowledge bases

Rule of thumb:

Need to follow a specific format/style reliably? Fine-tune.
Need access to fresh or large external knowledge? RAG.
Need general reasoning with a few examples? ICL.
Best systems combine fine-tuning (style) + RAG (knowledge).

Q18. What are hallucinations in LLMs? How do you reduce them in production?

Hallucinations = model generates plausible-sounding but factually incorrect content. Caused by:

Training data gaps / outdated knowledge.
Model over-confidence in generation distribution.
Ambiguous or underspecified prompts.

Mitigation strategies:

RAG: ground generation in retrieved documents; add source attribution.
Self-consistency: sample multiple outputs, check agreement.
Structured output + validation: force JSON schema output, validate post-generation.
Calibration prompts: "Answer only if you are confident. Say 'I don't know' otherwise."
RLHF/DPO alignment: train reward model to penalize hallucinations.
Citation forcing: require model to quote source document for every factual claim.

# LangChain with citation forcing
from langchain.chains import RetrievalQAWithSourcesChain
chain = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm, retriever=retriever
)
result = chain({"question": "When is UPSC prelims 2026?"})
# result["answer"] + result["sources"]

Q19. What is Mixture of Experts (MoE)? How does it relate to modern LLMs?

MoE replaces the dense feed-forward layer with N expert sub-networks. A router network selects the top-K experts for each token.

Output = sum_k(gate_k * Expert_k(x))

Benefits:

Total parameters: N * expert_size (large).
Active parameters per token: K * expert_size (small).
Inference cost matches a small dense model; capacity matches a large one.

Examples: Mixtral 8x7B (8 experts, top-2 active = 13B active out of 47B total). GPT-4 is widely believed to be a MoE. Gemini 1.5 uses MoE.

Challenges: load balancing (all tokens routing to same expert), communication overhead in distributed training, expert collapse.

Q20. How does grouped-query attention (GQA) differ from multi-head attention?

Multi-head attention (MHA): each head has its own Q, K, V projections -- high KV cache memory. Multi-query attention (MQA): all heads share a single K and V -- low KV cache memory but quality degradation. Grouped-query attention (GQA): K and V are shared within groups of H/G heads -- best quality/memory tradeoff.

Method	KV cache	Quality
MHA (h=32)	32 K, 32 V per layer	Best
MQA (h=32)	1 K, 1 V per layer	Lowest
GQA (h=32, g=8)	8 K, 8 V per layer	Near-MHA

LLaMA 3, Mistral, Gemma 2, Qwen2.5 all use GQA. The memory saving is critical for long-context and batched inference.

HARD: Production and System Design (Questions 21-28)

Q21. Design a production RAG system for a 10-million-document corpus.

Architecture:
  Ingest pipeline:
    S3 / GCS (raw docs)
      --> Document parser (Unstructured.io or Docling)
      --> Chunker (semantic / fixed-token with overlap)
      --> Embedding batch job (BAAI/bge-large via SageMaker batch)
      --> Vector DB (Pinecone / Weaviate / pgvector with HNSW)
      --> Metadata store (PostgreSQL: doc_id, url, date, category)

  Query pipeline:
    User query
      --> Query rewriter (LLM: HyDE or step-back prompting)
      --> Hybrid search: BM25 (Elasticsearch) + dense (FAISS/HNSW)
      --> Re-ranker (cross-encoder: ms-marco-MiniLM-L-6-v2)
      --> Top-K context (k=5 chunks)
      --> LLM generation (GPT-4o / fine-tuned Llama)
      --> Source attribution

  Scale decisions:
    - HNSW index: ANN lookup in <50ms at 10M vectors
    - Embedding cache: Redis for repeated queries
    - Async ingestion: Kafka queue, batch embedding workers
    - Monitoring: RAGAS scores, query latency p99, retrieval hit rate

Q22. How do you implement streaming for LLM APIs?

import anthropic

client = anthropic.Anthropic()

# Streaming with Anthropic Claude
with client.messages.stream(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain transformers in 3 sentences."}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

# FastAPI streaming endpoint
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import asyncio

app = FastAPI()

async def stream_llm(prompt: str):
    async with anthropic.AsyncAnthropic().messages.stream(
        model="claude-3-5-sonnet-20241022",
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}],
    ) as stream:
        async for text in stream.text_stream:
            yield f"data: {text}\n\n"

@app.get("/stream")
async def stream_endpoint(prompt: str):
    return StreamingResponse(stream_llm(prompt), media_type="text/event-stream")

Q23. How do you reduce LLM inference latency in production?

Technique	Latency reduction	Notes
KV cache	2-4x per-token	Always on; critical for long contexts
Speculative decoding	2-3x	Needs a good draft model
Continuous batching	2-5x throughput	vLLM, TGI -- processes multiple requests
Quantization (INT8/INT4)	1.5-2x with ~1% quality drop	GPTQ/AWQ recommended
Smaller model + RAG	3-10x	Trade model size for retrieval quality
FlashAttention	2-4x attention ops	Always use in training and serving
vLLM PagedAttention	2-5x memory efficiency	Enables larger batch sizes

# vLLM for production serving
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct", dtype="bfloat16")
params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["What is UPSC?"] * 100, params)  # batched generation

Q24. Explain context length scaling. What are the techniques to extend an LLM's context window?

Standard attention is O(N^2) in sequence length. Approaches to extend:

RoPE scaling: scale the theta base frequency (e.g., base=500,000 vs default 10,000). Used in LLaMA 3.1 (128K context).
Position interpolation: scale positional indices to fit new length -- fine-tune for a few steps.
YaRN (Yet another RoPE extensioN): NTK-aware interpolation + temperature scaling -- works without fine-tuning.
Sliding window attention: attend only to local window + global tokens (Longformer, Mistral).
Memory-efficient attention: FlashAttention-2/3 -- same O(N^2) but better constant factor.
External memory / RAG: practical limit at 1M+ tokens; retrieve and inject instead.

# YaRN config in HuggingFace
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.3",
    rope_scaling={"type": "yarn", "factor": 4.0, "original_max_position_embeddings": 32768},
)

Q25. What is the difference between temperature, top-p, and top-k sampling?

import torch
import torch.nn.functional as F

def sample_token(logits: torch.Tensor, temperature=1.0, top_k=0, top_p=0.9):
    # Temperature scaling
    logits = logits / temperature

    # Top-K filtering
    if top_k > 0:
        values, _ = torch.topk(logits, top_k)
        min_val = values[-1]
        logits[logits < min_val] = float('-inf')

    # Top-P (nucleus) filtering
    sorted_logits, sorted_idx = torch.sort(logits, descending=True)
    cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
    sorted_logits[cumulative_probs - F.softmax(sorted_logits, dim=-1) > top_p] = float('-inf')
    logits.scatter_(0, sorted_idx, sorted_logits)

    return torch.multinomial(F.softmax(logits, dim=-1), 1)

Parameter	Effect	Typical range
temperature=0	Greedy (argmax)	0 = deterministic
temperature=1	Raw model distribution	Default
temperature>1	More random / creative	1.2-1.5 for creative
top_k	Sample from top K tokens only	40-100
top_p	Sample from smallest set summing to p	0.85-0.95

Q26. How do you implement LLM guardrails in production?

from pydantic import BaseModel
from enum import Enum
import json

class SafetyLevel(Enum):
    SAFE = "safe"
    FLAGGED = "flagged"
    BLOCKED = "blocked"

class GuardrailResult(BaseModel):
    level: SafetyLevel
    reason: str | None = None

def input_guardrail(user_message: str, llm) -> GuardrailResult:
    """LLM-as-judge input safety check."""
    prompt = f"""Is this message safe to answer? Reply with JSON: {{"safe": true/false, "reason": "..."}}
Message: {user_message}"""
    response = llm(prompt)
    result = json.loads(response)
    if result["safe"]:
        return GuardrailResult(level=SafetyLevel.SAFE)
    return GuardrailResult(level=SafetyLevel.BLOCKED, reason=result["reason"])

# Output guardrails: PII detection, hallucination check, format validation
import re

def output_guardrail(response: str) -> str:
    """Strip PII from generated output."""
    # Redact email patterns
    response = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '[EMAIL]', response)
    # Redact phone patterns (Indian)
    response = re.sub(r'\b[6-9]\d{9}\b', '[PHONE]', response)
    return response

Production stack: Llama Guard for content classification, custom regex for PII, RAGAS faithfulness for hallucination detection, rate limiting per user tier.

Q27. Explain continuous batching (PagedAttention). Why does it matter for LLM serving?

Problem: In static batching, the server waits to fill a batch before processing -- early requests wait for late arrivals. Also, KV cache is pre-allocated for max_seq_len even if actual output is shorter, wasting GPU memory.

PagedAttention (vLLM):

Manages KV cache like OS virtual memory with pages (typically 16 tokens/page).
Allocates pages on demand as tokens are generated.
Multiple requests can share KV cache pages (prefix caching for repeated system prompts).
Memory utilization: ~99% vs ~60% for static KV cache.

Continuous batching:

Instead of waiting for all requests to finish, new requests join as old ones complete.
GPU utilization stays near-constant vs bursty in static batching.
Throughput improvement: reported 2-24x depending on workload.

vLLM implements both. Other frameworks: TGI (HuggingFace), TensorRT-LLM (NVIDIA), MLC-LLM.

Q28. Design an LLM application for real-time exam answer evaluation.

Requirements:
  - Student submits written answer (200-800 words)
  - System compares against model answer and rubric
  - Returns score (0-10) + detailed feedback in <5 seconds
  - Scale: 10,000 concurrent students (exam peak)

Architecture:

  Request path:
    Student answer (POST /evaluate)
      --> Redis queue (prevent spike overload)
      --> Worker: retrieve model answer + rubric from PostgreSQL
      --> LLM evaluation prompt (structured rubric criteria)
      --> Pydantic output validation (score + feedback JSON)
      --> Cache result (Redis 1-hour TTL for identical answers)
      --> Return to student

  LLM evaluation prompt pattern:
    System: "You are an expert examiner for {subject}. Score strictly."
    User:
      "Model answer: {model_answer}
       Rubric: {rubric_criteria}
       Student answer: {student_answer}
       Return JSON: {score: 0-10, feedback: {criterion: comment, ...}}"

  Scaling:
    - vLLM serving fine-tuned LLaMA-3-8B (exam-domain SFT)
    - 4x A100 40GB: ~800 evaluations/minute
    - Horizontal scaling via Kubernetes HPA on queue depth
    - Fallback: GPT-4o API for overflow during exam peaks

  Quality assurance:
    - Shadow evaluation: 5% of answers re-evaluated by GPT-4o
    - Score distribution monitoring (alert if mean shifts >0.5)
    - Human-in-loop queue for contested scores

FAQ

Q: Which LLM framework should I learn first?

A: HuggingFace Transformers is the industry standard for working with LLMs. Learn it alongside LangChain or LlamaIndex for RAG applications. Candidates from public preparation resources report that HuggingFace + PEFT (for LoRA) covers 90% of LLM engineer interview questions.

Q: Do I need a GPU to practice LLM interview prep?

A: Google Colab T4 GPU (free tier) is sufficient for most exercises. Quantized 7B models (GGUF 4-bit) run on CPU. Use Ollama locally for fast iteration without cloud costs.

Q: What LLM papers should I read before an interview?

A: Attention Is All You Need (transformer), LoRA, QLoRA, RLHF (Ouyang et al.), FlashAttention, RAG (Lewis et al.), and the LLaMA 3 technical report cover 90% of architecture questions that candidates report encountering in senior LLM interviews.

Q: Is prompt engineering still relevant in 2026?

A: Yes -- but it has matured. Basic zero/few-shot prompting is table stakes. Interviewers now ask about prompt injection defense, structured output forcing, multi-step ReAct agents, and evaluation harnesses. Candidates from public preparation resources confirm this shift toward systematic prompt engineering over ad-hoc prompting.

Sources and review notesreviewed 8 Jun 2026

Article-specific sources

Verification window

Page last edited 8 Jun 2026 by Aditya Sharma. A review date records an editorial edit, not a guarantee that every external fact is still current.

Evidence labels

Official notices, candidate reports, offer documents, and editorial practice questions carry different confidence levels. The visible source list lets you inspect the evidence instead of relying on a blanket verification badge.

Verification policy: /editorial-standards/. Found something incorrect? Submit a correction - we respond within 48 hours.

topic cluster

Sat this this year? Share your story, earn ₹500.

First-person experience reports help future candidates prep smarter. We pay verified contributors ₹500 via UPI per accepted story with byline.

Submit your story →

ready to practice?

Take a free timed mock test

Put what you learned into practice. Our mock tests match the 2026 pattern with timer, navigator, reveal, and score breakdown. No signup.

Start free mock test →

related guides

Interview Questions

Share this guide

Twitter LinkedIn W WhatsApp