LLM Interview Questions 2026: 28 Answers with Code

What changed in 2026 drives
Mass-recruiter offer letters are flatter for 2026 batch - the 4-5 LPA ASE band has barely budged in three years while inflation eats real wages. Premium tracks (Digital, Pro, Elite, Specialist) are still where the differential lives, and they are entirely test-driven. If you are aiming higher than the default offer, the coding round is not optional pageantry - it is the entire interview.
What I'd actually study for this
- 01Two solid coding-round answers (1 medium-hard DSA each, with edge-case discussion) > five half-baked ones
- 02One real project you can defend end-to-end - file paths, design decisions, and what you would change
- 03One DBMS schema you actually built (not a textbook ER diagram), with at least 3 join-heavy queries written from memory
- 04Three behavioural STAR stories: failure recovered, conflict handled, ownership taken
Where most candidates trip up
The single biggest mistake is treating company-specific guides as primary prep and DSA as secondary. It is the opposite. Mass recruiters use the test as a filter, but premium tracks at every IT services company use coding to allocate offer band. Spend 70% of prep time on DSA + system fundamentals, 20% on company-specific patterns, 10% on HR rehearsal. Reverse that ratio and you collect the default offer.
Editorial commentary by Aditya Sharma · written for PapersAdda · not generated, not aggregated.
Large Language Models are the defining technology of 2026, and every AI engineer interview now includes LLM-specific rounds. Roles at OpenAI, Anthropic, Google DeepMind, and every AI-first startup in India require fluency in transformer internals, fine-tuning, RAG, and production serving. This guide covers 28 LLM interview questions with full answers and code examples from basic to system design.
PapersAdda's take: Candidates report that RAG pipeline design and fine-tuning strategy questions now appear in over 75% of LLM engineer shortlists. The two most common elimination questions are "implement multi-head attention" and "explain KV cache". Confirm the exact interview format and tech stack on the official company careers portal before you prepare.
Related articles: Deep Learning Interview Questions 2026 | NLP Interview Questions 2026 | Machine Learning Interview Questions 2026 | MLOps Interview Questions 2026 | Generative AI Interview Questions 2026
Which Companies Ask LLM-Specific Questions?
| Company / Role | LLM Focus Area |
|---|---|
| OpenAI, Anthropic | Alignment, RLHF, inference systems |
| Google DeepMind | Research, Gemini family, TPU serving |
| Microsoft (Azure AI) | Azure OpenAI integration, fine-tuning |
| Indian AI startups (Sarvam, Krutrim, Cohesive) | Multilingual LLM, RAG, cost-efficient serving |
| Enterprise (Infosys, TCS AI labs) | RAG pipelines, LLM ops, governance |
EASY: Transformer and LLM Foundations (Questions 1-8)
Q1. What is the transformer architecture? What are its key components?
The transformer (Vaswani et al., 2017) replaces recurrence with self-attention. Core components:
- Input embedding + positional encoding -- converts tokens to vectors and adds position information.
- Multi-head self-attention -- computes Query, Key, Value projections; attention scores = softmax(QK^T / sqrt(d_k)) * V.
- Feed-forward network -- position-wise two-layer MLP applied identically at each position.
- Layer normalization -- pre-LN (modern) or post-LN (original) stabilizes training.
- Residual connections -- enable gradient flow in deep networks.
Encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5) are three families derived from this.
Q2. Explain multi-head attention. Why "multi-head"?
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
class MultiHeadAttention(nn.Module):
def __init__(self, d_model: int, n_heads: int):
super().__init__()
assert d_model % n_heads == 0
self.d_k = d_model // n_heads
self.n_heads = n_heads
self.q = nn.Linear(d_model, d_model)
self.k = nn.Linear(d_model, d_model)
self.v = nn.Linear(d_model, d_model)
self.out = nn.Linear(d_model, d_model)
def forward(self, x, mask=None):
B, T, C = x.shape
Q = self.q(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
K = self.k(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
V = self.v(x).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
attn = (Q @ K.transpose(-2, -1)) / math.sqrt(self.d_k)
if mask is not None:
attn = attn.masked_fill(mask == 0, float('-inf'))
attn = F.softmax(attn, dim=-1)
out = (attn @ V).transpose(1, 2).reshape(B, T, C)
return self.out(out)
Multiple heads allow the model to attend to different semantic subspaces simultaneously. One head might focus on syntactic relations, another on coreference. Single-head attention collapses this to one subspace.
Q3. What is the KV cache? Why is it critical for LLM inference?
During autoregressive generation, each new token needs attention over all previous tokens. Without caching, you recompute K and V for every previous token on every step -- O(n^2) compute per sequence.
The KV cache stores Key and Value tensors from all previous positions. On each new token:
- Only compute Q, K, V for the new token.
- Retrieve cached K, V for all previous positions.
- Concatenate and compute attention.
This reduces per-step compute from O(n) matrix multiplications to O(1) new projections + O(n) attention.
Memory cost: For a 7B model with 32 layers, 32 heads, d_k=128, batch=1, sequence=4096:
- KV cache = 2 * 32 * 32 * 128 * 4096 * 2 bytes (FP16) = ~2GB
- This is why long-context inference is memory-bound.
Q4. What is the difference between causal (decoder) attention and full (encoder) attention?
| Property | Causal / Decoder Attention | Full / Encoder Attention |
|---|---|---|
| Mask | Lower-triangular: token i attends only to positions 0..i | No mask: all positions attend to all positions |
| Use case | Autoregressive generation (GPT family) | Bidirectional understanding (BERT family) |
| Training objective | Next-token prediction (CLM) | Masked language modeling (MLM) |
| Context | Past only | Full sequence |
In code, causal mask is typically torch.tril(torch.ones(T, T)).
Q5. What is rotary positional encoding (RoPE)? How does it differ from learned positional embeddings?
Original transformers add a learned or sinusoidal embedding to the input. RoPE (Su et al., 2021) encodes position by rotating the Q and K vectors in 2D planes:
Q_rotated[2i] = Q[2i] * cos(theta) - Q[2i+1] * sin(theta)
Q_rotated[2i+1] = Q[2i] * sin(theta) + Q[2i+1] * cos(theta)
Advantages of RoPE:
- Relative position information is preserved in dot products: Q_m * K_n depends only on (m - n).
- Generalizes better to sequence lengths unseen during training.
- No added embedding parameters.
- Used in LLaMA, Mistral, Qwen, Gemma, and most modern LLMs.
Q6. What is perplexity? How do you interpret it for LLMs?
Perplexity = exp(cross-entropy loss) = exp(-(1/N) * sum(log P(token_i | context))).
It measures how "surprised" the model is by the test sequence. A perplexity of 10 means the model is as uncertain as choosing uniformly among 10 options at each step.
Interpretation:
- Lower perplexity = better language model.
- GPT-2 (117M): ~29 on WikiText-103. GPT-4 class models: ~3-5.
- Perplexity is not a proxy for task performance -- a model can have low perplexity but poor reasoning.
Limitation: perplexity is dataset-specific. Cross-dataset comparison is invalid.
Q7. What is the difference between GPT-style (decoder-only) and BERT-style (encoder-only) LLMs?
| Property | GPT-style (decoder-only) | BERT-style (encoder-only) |
|---|---|---|
| Attention | Causal (left-to-right) | Bidirectional |
| Pretraining | Next-token prediction (CLM) | Masked LM + next sentence prediction |
| Generation | Native (autoregressive sampling) | Not directly generative |
| Best for | Open-ended generation, chat, code | Classification, NER, embedding extraction |
| Examples | GPT-4, LLaMA 3, Mistral, Gemma | BERT, RoBERTa, DeBERTa |
Modern preference is decoder-only for general-purpose LLMs due to unified pretraining + generative capability.
Q8. What is tokenization in LLMs? Why does it matter?
LLMs operate on integer token IDs, not raw characters. Common methods:
- BPE (Byte Pair Encoding): merges frequent byte pairs iteratively. Used by GPT-2/3/4.
- WordPiece: similar to BPE but maximizes likelihood of training data. Used by BERT.
- SentencePiece: language-agnostic, works directly on Unicode. Used by LLaMA, T5.
Why tokenization matters:
- Longer tokenizations = higher context usage for the same text.
- Code and math tokenize poorly with word-level vocabularies -- GPT-4 uses a code-optimized vocabulary.
- Multilingual models need subword segmentation to share vocabulary across scripts.
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
tokens = tok("Hello, PapersAdda!")
print(tokens["input_ids"]) # [128000, 9906, 11, 27685, 85048, 0]
print(len(tokens["input_ids"])) # token count
MEDIUM: Fine-Tuning, RAG, and Alignment (Questions 9-20)
Q9. What is the difference between full fine-tuning, LoRA, and QLoRA?
| Method | What changes | GPU memory | Quality |
|---|---|---|---|
| Full fine-tuning | All weights | Very high (requires optimizer states for all params) | Best |
| LoRA | Low-rank adapter matrices (r << d) | Low -- adapters only | Near-full with r=16-64 |
| QLoRA | LoRA + base model in 4-bit NF4 | Very low -- 4-bit base + 16-bit adapters | Slight quality cost |
LoRA math: Instead of updating W (d x d), learn W = W0 + BA where B is d x r, A is r x d. Only 2d*r parameters per layer vs d^2.
from peft import get_peft_model, LoraConfig, TaskType
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06
Q10. Walk through building a RAG pipeline from scratch.
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
from transformers import pipeline
# 1. Ingest and chunk documents
docs = [
"UPSC 2026 prelims date is June 1, 2026.",
"SBI PO exam pattern includes reasoning, quant, and English.",
"CAT 2026 registration starts August 2026.",
]
# 2. Embed chunks
embedder = SentenceTransformer("BAAI/bge-base-en-v1.5")
doc_embeddings = embedder.encode(docs, normalize_embeddings=True)
# 3. Build FAISS index
dim = doc_embeddings.shape[1]
index = faiss.IndexFlatIP(dim) # inner product = cosine on normalized vectors
index.add(doc_embeddings.astype(np.float32))
# 4. Retrieve at query time
def retrieve(query: str, k: int = 2):
q_emb = embedder.encode([query], normalize_embeddings=True)
scores, idx = index.search(q_emb.astype(np.float32), k)
return [docs[i] for i in idx[0]]
# 5. Generate with retrieved context
llm = pipeline("text-generation", model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
def rag_answer(question: str):
context = "\n".join(retrieve(question))
prompt = f"Context:\n{context}\n\nQuestion: {question}\nAnswer:"
return llm(prompt, max_new_tokens=150)[0]["generated_text"]
print(rag_answer("When is UPSC prelims?"))
Production additions: chunking strategy (sentence vs fixed-token), re-ranking (cross-encoder), hybrid search (BM25 + dense), metadata filtering.
Q11. What is RLHF? Explain the three stages.
Reinforcement Learning from Human Feedback (RLHF) aligns LLMs to human preferences:
Stage 1 -- Supervised Fine-Tuning (SFT):
- Fine-tune base LLM on high-quality demonstrations.
- Result: SFT model that follows instructions.
Stage 2 -- Reward Model Training:
- Human annotators rank pairs of model outputs (A vs B).
- Train a reward model RM(prompt, response) -> scalar reward.
- Dataset: comparison pairs with human preference labels.
Stage 3 -- PPO Optimization:
- Use PPO to update the LLM policy to maximize RM score.
- KL penalty keeps the updated policy close to SFT model: reward_total = RM(x,y) - beta * KL(pi_rlhf || pi_sft).
DPO (Direct Preference Optimization) skips the separate RM and optimizes preference directly:
- Loss = -log(sigma(beta * (log pi(y_w|x) - log pi(y_l|x) - log pi_ref(y_w|x) + log pi_ref(y_l|x))))
Q12. What is speculative decoding? How does it speed up LLM inference?
Autoregressive generation is sequential -- one token per forward pass through a large model. Speculative decoding breaks this bottleneck:
- A small draft model (3B or distilled) generates K tokens speculatively in K steps.
- The large target model evaluates all K+1 tokens in a single forward pass (batched).
- Tokens that match target model's distribution are accepted; the first rejection triggers re-sampling from the target.
- Expected speedup: 2-4x with ~0% quality degradation.
# HuggingFace supports speculative decoding natively
from transformers import AutoModelForCausalLM, AutoTokenizer
draft = AutoModelForCausalLM.from_pretrained("facebook/opt-125m")
target = AutoModelForCausalLM.from_pretrained("facebook/opt-1.3b")
inputs = tok("The capital of France is", return_tensors="pt")
output = target.generate(
**inputs,
assistant_model=draft,
max_new_tokens=50,
)
Q13. Explain quantization for LLMs. What is the difference between GPTQ, AWQ, and GGUF?
Quantization reduces weight precision to save memory and speed up inference.
| Format | Method | Use case |
|---|---|---|
| GPTQ | Post-training, layer-wise second-order weight quantization | GPU inference with bitsandbytes or AutoGPTQ |
| AWQ | Activation-aware weight quantization -- preserves salient weights | Higher quality than GPTQ at same bit-width |
| GGUF | CPU/Metal quantization format for llama.cpp | Local inference on CPU/Mac M-series |
| BnB 4-bit | NF4 + double quantization (QLoRA) | Fine-tuning on consumer GPUs |
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B",
quantization_config=bnb_config,
device_map="auto",
)
# Memory: ~4.5GB vs ~16GB for FP16 8B model
Q14. What is prompt engineering? What techniques matter most in 2026?
Prompt engineering shapes LLM behavior through input design rather than weight updates.
Core techniques:
- Zero-shot: direct instruction, no examples.
- Few-shot: 3-8 in-context examples demonstrating the desired format.
- Chain-of-Thought (CoT): "Let's think step by step" elicits reasoning traces.
- Self-consistency: sample multiple CoT paths, majority-vote the answer.
- ReAct: interleave reasoning (Thought) and tool calls (Action) in the prompt.
- System prompt engineering: define persona, output constraints, refusal behavior.
# Few-shot CoT example
prompt = """
Classify sentiment. Think step by step.
Review: "Battery dies in 2 hours" -> Thought: mentions battery problem, negative tone -> Sentiment: Negative
Review: "Amazing build quality, fast delivery" -> Thought: positive attributes, no complaints -> Sentiment: Positive
Review: "Average camera but great screen" -> Thought: mixed, one positive one negative ->
"""
Q15. What is flash attention? Why is it important?
Standard attention computes the full N x N attention matrix in HBM (GPU DRAM), which is memory-bandwidth-bound for long sequences.
FlashAttention (Dao et al., 2022):
- Tiles the Q, K, V matrices and computes attention in SRAM (fast cache).
- Never materializes the full N x N matrix in HBM.
- IO complexity: O(N) HBM reads/writes vs O(N^2) standard.
- Result: 2-4x faster, 5-20x less memory for attention.
FlashAttention-2 (2023): better work partitioning, 2x faster than FA1. FlashAttention-3 (2024): targets H100 async pipelines, ~2x faster than FA2.
import torch.nn.functional as F
# PyTorch 2.0+ uses FlashAttention automatically via scaled_dot_product_attention
# when inputs are on CUDA and no custom mask is used
out = F.scaled_dot_product_attention(q, k, v, is_causal=True)
Q16. How do you evaluate an LLM? What metrics go beyond perplexity?
| Evaluation Type | Metric / Benchmark | What it measures |
|---|---|---|
| General knowledge | MMLU, ARC-Challenge | Breadth of factual recall |
| Reasoning | GSM8K, MATH, HumanEval | Step-by-step reasoning, code |
| Instruction following | MT-Bench, AlpacaEval | Multi-turn instruction quality |
| Safety / alignment | TruthfulQA, BBQ (bias) | Hallucination, fairness |
| Retrieval accuracy | EM, F1 on QA datasets | Factual accuracy in RAG |
| Human eval | Elo/Arena ratings (LMSYS Chatbot Arena) | Real user preference |
For production RAG systems: use RAGAS (context precision, context recall, answer faithfulness, answer relevance).
Q17. What is the difference between in-context learning, fine-tuning, and RAG? When to use each?
| Method | Knowledge source | Latency | Cost | Best for |
|---|---|---|---|---|
| In-context learning | Prompt context window | Low | Per-call tokens | Dynamic, few-shot tasks |
| Fine-tuning | Baked into weights | Low (no retrieval) | One-time training | Style, tone, domain-specific format |
| RAG | External vector store | Medium (+ retrieval) | Infrastructure | Fresh, factual, large knowledge bases |
Rule of thumb:
- Need to follow a specific format/style reliably? Fine-tune.
- Need access to fresh or large external knowledge? RAG.
- Need general reasoning with a few examples? ICL.
- Best systems combine fine-tuning (style) + RAG (knowledge).
Q18. What are hallucinations in LLMs? How do you reduce them in production?
Hallucinations = model generates plausible-sounding but factually incorrect content. Caused by:
- Training data gaps / outdated knowledge.
- Model over-confidence in generation distribution.
- Ambiguous or underspecified prompts.
Mitigation strategies:
- RAG: ground generation in retrieved documents; add source attribution.
- Self-consistency: sample multiple outputs, check agreement.
- Structured output + validation: force JSON schema output, validate post-generation.
- Calibration prompts: "Answer only if you are confident. Say 'I don't know' otherwise."
- RLHF/DPO alignment: train reward model to penalize hallucinations.
- Citation forcing: require model to quote source document for every factual claim.
# LangChain with citation forcing
from langchain.chains import RetrievalQAWithSourcesChain
chain = RetrievalQAWithSourcesChain.from_chain_type(
llm=llm, retriever=retriever
)
result = chain({"question": "When is UPSC prelims 2026?"})
# result["answer"] + result["sources"]
Q19. What is Mixture of Experts (MoE)? How does it relate to modern LLMs?
MoE replaces the dense feed-forward layer with N expert sub-networks. A router network selects the top-K experts for each token.
Output = sum_k(gate_k * Expert_k(x))
Benefits:
- Total parameters: N * expert_size (large).
- Active parameters per token: K * expert_size (small).
- Inference cost matches a small dense model; capacity matches a large one.
Examples: Mixtral 8x7B (8 experts, top-2 active = 13B active out of 47B total). GPT-4 is widely believed to be a MoE. Gemini 1.5 uses MoE.
Challenges: load balancing (all tokens routing to same expert), communication overhead in distributed training, expert collapse.
Q20. How does grouped-query attention (GQA) differ from multi-head attention?
Multi-head attention (MHA): each head has its own Q, K, V projections -- high KV cache memory. Multi-query attention (MQA): all heads share a single K and V -- low KV cache memory but quality degradation. Grouped-query attention (GQA): K and V are shared within groups of H/G heads -- best quality/memory tradeoff.
| Method | KV cache | Quality |
|---|---|---|
| MHA (h=32) | 32 K, 32 V per layer | Best |
| MQA (h=32) | 1 K, 1 V per layer | Lowest |
| GQA (h=32, g=8) | 8 K, 8 V per layer | Near-MHA |
LLaMA 3, Mistral, Gemma 2, Qwen2.5 all use GQA. The memory saving is critical for long-context and batched inference.
HARD: Production and System Design (Questions 21-28)
Q21. Design a production RAG system for a 10-million-document corpus.
Architecture:
Ingest pipeline:
S3 / GCS (raw docs)
--> Document parser (Unstructured.io or Docling)
--> Chunker (semantic / fixed-token with overlap)
--> Embedding batch job (BAAI/bge-large via SageMaker batch)
--> Vector DB (Pinecone / Weaviate / pgvector with HNSW)
--> Metadata store (PostgreSQL: doc_id, url, date, category)
Query pipeline:
User query
--> Query rewriter (LLM: HyDE or step-back prompting)
--> Hybrid search: BM25 (Elasticsearch) + dense (FAISS/HNSW)
--> Re-ranker (cross-encoder: ms-marco-MiniLM-L-6-v2)
--> Top-K context (k=5 chunks)
--> LLM generation (GPT-4o / fine-tuned Llama)
--> Source attribution
Scale decisions:
- HNSW index: ANN lookup in <50ms at 10M vectors
- Embedding cache: Redis for repeated queries
- Async ingestion: Kafka queue, batch embedding workers
- Monitoring: RAGAS scores, query latency p99, retrieval hit rate
Q22. How do you implement streaming for LLM APIs?
import anthropic
client = anthropic.Anthropic()
# Streaming with Anthropic Claude
with client.messages.stream(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[{"role": "user", "content": "Explain transformers in 3 sentences."}],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
# FastAPI streaming endpoint
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import asyncio
app = FastAPI()
async def stream_llm(prompt: str):
async with anthropic.AsyncAnthropic().messages.stream(
model="claude-3-5-sonnet-20241022",
max_tokens=512,
messages=[{"role": "user", "content": prompt}],
) as stream:
async for text in stream.text_stream:
yield f"data: {text}\n\n"
@app.get("/stream")
async def stream_endpoint(prompt: str):
return StreamingResponse(stream_llm(prompt), media_type="text/event-stream")
Q23. How do you reduce LLM inference latency in production?
| Technique | Latency reduction | Notes |
|---|---|---|
| KV cache | 2-4x per-token | Always on; critical for long contexts |
| Speculative decoding | 2-3x | Needs a good draft model |
| Continuous batching | 2-5x throughput | vLLM, TGI -- processes multiple requests |
| Quantization (INT8/INT4) | 1.5-2x with ~1% quality drop | GPTQ/AWQ recommended |
| Smaller model + RAG | 3-10x | Trade model size for retrieval quality |
| FlashAttention | 2-4x attention ops | Always use in training and serving |
| vLLM PagedAttention | 2-5x memory efficiency | Enables larger batch sizes |
# vLLM for production serving
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct", dtype="bfloat16")
params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["What is UPSC?"] * 100, params) # batched generation
Q24. Explain context length scaling. What are the techniques to extend an LLM's context window?
Standard attention is O(N^2) in sequence length. Approaches to extend:
- RoPE scaling: scale the theta base frequency (e.g., base=500,000 vs default 10,000). Used in LLaMA 3.1 (128K context).
- Position interpolation: scale positional indices to fit new length -- fine-tune for a few steps.
- YaRN (Yet another RoPE extensioN): NTK-aware interpolation + temperature scaling -- works without fine-tuning.
- Sliding window attention: attend only to local window + global tokens (Longformer, Mistral).
- Memory-efficient attention: FlashAttention-2/3 -- same O(N^2) but better constant factor.
- External memory / RAG: practical limit at 1M+ tokens; retrieve and inject instead.
# YaRN config in HuggingFace
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-v0.3",
rope_scaling={"type": "yarn", "factor": 4.0, "original_max_position_embeddings": 32768},
)
Q25. What is the difference between temperature, top-p, and top-k sampling?
import torch
import torch.nn.functional as F
def sample_token(logits: torch.Tensor, temperature=1.0, top_k=0, top_p=0.9):
# Temperature scaling
logits = logits / temperature
# Top-K filtering
if top_k > 0:
values, _ = torch.topk(logits, top_k)
min_val = values[-1]
logits[logits < min_val] = float('-inf')
# Top-P (nucleus) filtering
sorted_logits, sorted_idx = torch.sort(logits, descending=True)
cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
sorted_logits[cumulative_probs - F.softmax(sorted_logits, dim=-1) > top_p] = float('-inf')
logits.scatter_(0, sorted_idx, sorted_logits)
return torch.multinomial(F.softmax(logits, dim=-1), 1)
| Parameter | Effect | Typical range |
|---|---|---|
| temperature=0 | Greedy (argmax) | 0 = deterministic |
| temperature=1 | Raw model distribution | Default |
| temperature>1 | More random / creative | 1.2-1.5 for creative |
| top_k | Sample from top K tokens only | 40-100 |
| top_p | Sample from smallest set summing to p | 0.85-0.95 |
Q26. How do you implement LLM guardrails in production?
from pydantic import BaseModel
from enum import Enum
import json
class SafetyLevel(Enum):
SAFE = "safe"
FLAGGED = "flagged"
BLOCKED = "blocked"
class GuardrailResult(BaseModel):
level: SafetyLevel
reason: str | None = None
def input_guardrail(user_message: str, llm) -> GuardrailResult:
"""LLM-as-judge input safety check."""
prompt = f"""Is this message safe to answer? Reply with JSON: {{"safe": true/false, "reason": "..."}}
Message: {user_message}"""
response = llm(prompt)
result = json.loads(response)
if result["safe"]:
return GuardrailResult(level=SafetyLevel.SAFE)
return GuardrailResult(level=SafetyLevel.BLOCKED, reason=result["reason"])
# Output guardrails: PII detection, hallucination check, format validation
import re
def output_guardrail(response: str) -> str:
"""Strip PII from generated output."""
# Redact email patterns
response = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '[EMAIL]', response)
# Redact phone patterns (Indian)
response = re.sub(r'\b[6-9]\d{9}\b', '[PHONE]', response)
return response
Production stack: Llama Guard for content classification, custom regex for PII, RAGAS faithfulness for hallucination detection, rate limiting per user tier.
Q27. Explain continuous batching (PagedAttention). Why does it matter for LLM serving?
Problem: In static batching, the server waits to fill a batch before processing -- early requests wait for late arrivals. Also, KV cache is pre-allocated for max_seq_len even if actual output is shorter, wasting GPU memory.
PagedAttention (vLLM):
- Manages KV cache like OS virtual memory with pages (typically 16 tokens/page).
- Allocates pages on demand as tokens are generated.
- Multiple requests can share KV cache pages (prefix caching for repeated system prompts).
- Memory utilization: ~99% vs ~60% for static KV cache.
Continuous batching:
- Instead of waiting for all requests to finish, new requests join as old ones complete.
- GPU utilization stays near-constant vs bursty in static batching.
- Throughput improvement: reported 2-24x depending on workload.
vLLM implements both. Other frameworks: TGI (HuggingFace), TensorRT-LLM (NVIDIA), MLC-LLM.
Q28. Design an LLM application for real-time exam answer evaluation.
Requirements:
- Student submits written answer (200-800 words)
- System compares against model answer and rubric
- Returns score (0-10) + detailed feedback in <5 seconds
- Scale: 10,000 concurrent students (exam peak)
Architecture:
Request path:
Student answer (POST /evaluate)
--> Redis queue (prevent spike overload)
--> Worker: retrieve model answer + rubric from PostgreSQL
--> LLM evaluation prompt (structured rubric criteria)
--> Pydantic output validation (score + feedback JSON)
--> Cache result (Redis 1-hour TTL for identical answers)
--> Return to student
LLM evaluation prompt pattern:
System: "You are an expert examiner for {subject}. Score strictly."
User:
"Model answer: {model_answer}
Rubric: {rubric_criteria}
Student answer: {student_answer}
Return JSON: {score: 0-10, feedback: {criterion: comment, ...}}"
Scaling:
- vLLM serving fine-tuned LLaMA-3-8B (exam-domain SFT)
- 4x A100 40GB: ~800 evaluations/minute
- Horizontal scaling via Kubernetes HPA on queue depth
- Fallback: GPT-4o API for overflow during exam peaks
Quality assurance:
- Shadow evaluation: 5% of answers re-evaluated by GPT-4o
- Score distribution monitoring (alert if mean shifts >0.5)
- Human-in-loop queue for contested scores
FAQ
Q: Which LLM framework should I learn first? A: HuggingFace Transformers is the industry standard for working with LLMs. Learn it alongside LangChain or LlamaIndex for RAG applications. Candidates from public preparation resources report that HuggingFace + PEFT (for LoRA) covers 90% of LLM engineer interview questions.
Q: Do I need a GPU to practice LLM interview prep? A: Google Colab T4 GPU (free tier) is sufficient for most exercises. Quantized 7B models (GGUF 4-bit) run on CPU. Use Ollama locally for fast iteration without cloud costs.
Q: What LLM papers should I read before an interview? A: Attention Is All You Need (transformer), LoRA, QLoRA, RLHF (Ouyang et al.), FlashAttention, RAG (Lewis et al.), and the LLaMA 3 technical report cover 90% of architecture questions that candidates report encountering in senior LLM interviews.
Q: Is prompt engineering still relevant in 2026? A: Yes -- but it has matured. Basic zero/few-shot prompting is table stakes. Interviewers now ask about prompt injection defense, structured output forcing, multi-step ReAct agents, and evaluation harnesses. Candidates from public preparation resources confirm this shift toward systematic prompt engineering over ad-hoc prompting.
Methodology applied to this articlelast verified 8 Jun 2026
- No fabricated salary numbers or success rates. If we quote a range, it's sourced.
- No noun-substituted templates. This article was not generated by swapping company names in a stock prompt.
- No paid placements, sponsored coaching links, or affiliate-shilled course pushes.
Explore this topic cluster
More resources in Interview Questions
Use the category hub to browse similar questions, exam patterns, salary guides, and preparation resources related to this topic.
Paid contributor programme
Sat this this year? Share your story, earn ₹500.
First-person experience reports help future candidates prep smarter. We pay verified contributors ₹500 via UPI per accepted story - with byline.
Submit your story →Ready to practice?
Take a free timed mock test
Put what you learned into practice. Our mock tests match the 2026 pattern with timer, navigator, reveal, and score breakdown. No signup.
Start Free Mock Test →Related Articles
Airbnb Interview Questions 2026: Top Tech, HR & Behavioural Q&As for Freshers
Clearing Airbnb's fresher loop in 2026 comes down to preparing for the exact mix of questions across technical, behavioural,...
Airtel Interview Questions 2026: Top Tech, HR & Behavioural Q&As for Freshers
Clearing Airtel's fresher loop in 2026 comes down to preparing for the exact mix of questions across technical, behavioural,...
AMD Interview Questions 2026: Top Tech, HR & Behavioural Q&As for Freshers
Clearing AMD's fresher loop in 2026 comes down to preparing for the exact mix of questions across technical, behavioural,...
Atlassian Interview Questions 2026: Top Tech, HR & Behavioural Q&As for Freshers
Clearing Atlassian's fresher loop in 2026 comes down to preparing for the exact mix of questions across technical,...
Barclays Interview Questions 2026
_Last verified by [Aditya Sharma](/author/aditya-sharma/) · cross-checked against PapersAdda Hiring Pulse and...
More from PapersAdda
Accenture Interview Questions 2026 (with Answers for Freshers)
Capgemini Interview Questions 2026 (with Answers for Freshers)
HCLTech Interview Questions 2026 (TechBee + TGT, with Answers)
IBM Interview Questions 2026 (with Answers for Freshers)