What you will learn
KV Cache
KV caching stores key-value tensors from previous tokens during autoregressive generation. Instead of recomputing the whole prefix for every new token, the model reuses cached representations. This reduces latency for long outputs but increases memory pressure per active request.
Mixture of Experts
Mixture of Experts routes tokens or examples through a subset of specialized expert networks. This can increase parameter count without using every parameter for every token. The tradeoffs are routing complexity, load balancing, communication overhead and harder debugging.
Speculative Decoding
Speculative decoding uses a smaller draft model to propose tokens and a larger target model to verify them. When drafts are accepted, generation can be faster without changing the target distribution. Speedups depend heavily on draft quality, batch shape and hardware.
Quantization
Quantization reduces the precision of model weights or activations, for example from 16-bit to 8-bit or 4-bit. It can reduce memory and improve throughput, but may degrade quality, calibration or tool-call reliability if pushed too far.
LoRA and Fine-Tuning
LoRA freezes the base model and trains small low-rank adapter matrices, making adaptation cheaper than full fine-tuning. It is useful for style, format or domain adaptation, but it does not magically add fresh knowledge unless the training data and evaluation support that use case.
DPO/RLHF
RLHF and DPO optimize model behavior using human or preference data. RLHF typically involves reward modeling and policy optimization. DPO directly optimizes preferences with a simpler objective. Both depend on preference quality and can overfit to what raters reward.
LLM Observability
LLM observability tracks prompts, retrieval results, tool calls, latency, cost, user feedback, safety events and evaluation scores. Logs must be privacy-aware: capture enough to debug, but avoid storing unnecessary sensitive data.
RAG Evaluation
RAG evaluation separates retrieval quality from answer quality. Measure whether the right documents were found, whether the answer used them, whether citations support claims and whether the final output satisfies the task.
Model Routing
Model routing sends each task to the cheapest sufficient model or workflow. Simple classification may use a small model; complex reasoning may use a stronger model; risky answers may require retrieval and review. Routing saves cost only if quality controls catch misroutes.
AI Safety Red Teaming
Red teaming tests how systems fail under adversarial or unusual inputs. For LLM apps, test prompt injection, data leakage, unsafe tool use, policy bypasses, hidden instructions and social-engineering prompts. The goal is to improve controls, not to prove perfection.
Cost Optimization for LLM Apps
Cost optimization combines prompt compression, caching, routing, batching, context pruning, retrieval quality, quantization and evaluation. Optimize cost per successful task, not cost per token alone. A cheap model that needs three retries can be more expensive than a strong model once.
Mini test you can run
Pick five real tasks from your own workflow. Run one short prompt, one structured prompt and one prompt with examples or source context. Score each output from 1 to 5 for usefulness, factual risk and edit time. Keep the winning prompt as your baseline and retest after every major change.
| Variant | Usefulness | Factual risk | Edit time |
|---|---|---|---|
| Short prompt | Medium | Higher | High |
| Structured prompt | High | Medium | Medium |
| Context + examples | Highest for repeat tasks | Lower if sources are good | Low |