Production AI Engineering

Joona-Pekka Kokko · jokokko.com · github.com/jokokko
Download PDF

TL;DR — top 5 if you read nothing else


1. Foundations

1.1 LLMs vs classical services

Classical software is deterministic. LLMs are not. The contrast:

Dimension Classical service LLM-backed service
Determinism Reproducible Stochastic; not even reproducible at temperature 0
State App-managed Stateless model — "memory" is whatever you put in context
Failure modes Exceptions, timeouts Plus: hallucination, instruction drift, format violations, prompt injection, context overflow
Latency Sub-100ms 500ms–60s+ depending on output and reasoning
Cost CPU-time, fixed infra Token-metered, user-driven, hard to forecast
Testing Unit tests Eval sets, judges, statistical thresholds
Versioning Code Code + prompts + model + tool schemas + retrieval index

The LLM is an unreliable remote service with a natural-language API and a hard token budget.

1.2 The request cycle

LLM APIs take a structured array of messages with roles (system, user, assistant, tool), not a string. The model is stateless — every turn must contain whatever history matters.

Context budgeting. Total tokens = system prompt + retrieved context + tool schemas + history + tool outputs + completion. You'll exceed limits faster than expected once tools and retrieval kick in. Measure tokens; characters lie (especially for code, non-English, and unusual symbols).

History management. For long sessions:

Prompt caching is the highest-ROI optimization most teams skip. Anthropic gives you explicit cache breakpoints (cache_control); OpenAI auto-caches stable prefixes. Cached input tokens are typically ~10× cheaper and faster. Order matters: only content before the first variable input gets cached. Structure prompts so the system prompt, tool schemas, and large reused documents sit at the front.

1.3 Picking a model

Selection criterion: cheapest model that hits the eval bar.

Routing. Once you have multiple task types, route. A small classifier (or a cheap LLM call) sends simple queries to a cheap model and complex ones to the expensive one. Add a fallback chain: on schema-validation failure, transient error, or refusal, retry with a stronger model before surfacing the failure to the user.

Prompts are code. Prompts, model IDs, sampling params, tool schemas, and output schemas are all behavior-affecting config. Version them with the application, attach to releases, gate changes through CI with eval runs. Unversioned prompt drift in prod is one of the most common "it used to work" root causes.

1.4 Transformer mechanics

The attention computation:

Attention(Q, K, V) = softmax( Q·Kᵀ / √dₖ ) · V

What matters in practice:

Component Engineering implication
Tokenizer Drives token cost and context utilization. Budget in tokens.
Embeddings Powers retrieval, similarity caching, dedup, clustering.
Self-attention Quadratic — long context is expensive.
Positional encoding Order matters; the model knows where tokens sit.
Softmax + temperature Your only real lever on output randomness.
KV cache Reused prefixes save real money. Design for cache-friendliness.

1.5 Controlling randomness

Logits → softmax → token. Temperature T rescales as logits/T before softmax:

temperature=0 is not deterministic. Floating-point non-associativity in batched matmul means you'll see occasional differences across model versions, hardware, batching, even time. If you need bit-exact repeatability, you don't have it. Design around statistical stability, not equality.

top_p (nucleus sampling) truncates the distribution to the smallest set of tokens whose probability mass exceeds p. Often a better lever than temperature alone. Don't aggressively tune both at once.


2. Context engineering and RAG

2.1 Prompt engineering principles

Principles that consistently improve outputs:

Use the system prompt for stable behavior. Persona, format constraints, refusal policy, style — anything that doesn't change per-request goes here. It's also the cleanest cache target.

Delimit cleanly. XML tags (<context>...</context>, <question>...</question>) work especially well on Anthropic models. Markdown headers also work. Point is unambiguous boundaries between instructions, user input, and retrieved data — both for the model and for prompt-injection defense.

Position matters, and "lost in the middle" is real. Models attend best to the start and end of long contexts, worst to the middle. So:

Few-shot examples. Two to five well-chosen examples typically beat long instructions. Demonstrate edge cases and the exact output format. Dynamic few-shot — retrieve the most similar examples from a curated library at request time — beats static examples on heterogeneous workloads.

Chain-of-thought for non-reasoning models. Asking for step-by-step thinking before the answer measurably helps on multi-step problems. For reasoning models, this is built in — adding "think step by step" is redundant and sometimes counterproductive.

Force citations. When grounding on retrieved context, require inline citation of the source chunk for each claim. Hallucinations drop sharply because the model commits to evidence at generation time instead of confabulating around it.

Heuristic framing is unreliable. "This is mission-critical", "lives depend on this" — inconsistent across models, often counterproductive on modern frontier ones. Structural clarity beats emotional pressure.

2.2 RAG: ingestion

RAG grounds outputs in your data. Three stages:

Normalization. Convert source formats (PDF, HTML, DOCX, code, transcripts) to clean text with structure preserved. PDF extraction has more failure modes than people expect — flattened tables, interleaved columns, lost headers — and it silently destroys retrieval downstream. Inspect your normalized output before you ever embed.

Chunking. Strategies, in increasing sophistication:

Chunk enrichment. Prepend each chunk with metadata breadcrumbs: doc title, section path, date, author, source URL. A 200-token chunk with "Title: X / Section: Y" prefixed retrieves dramatically better. Anthropic's "contextual retrieval" technique generates a per-chunk situating sentence with a cheap LLM call before embedding — measurable wins on hard queries.

Embedding. Pick a model that matches your domain (general English, multilingual, code). Test on your queries. Write to an ANN index (FAISS, pgvector, Qdrant, Pinecone) with metadata stored alongside.

2.3 Retrieval

Hybrid search is the default, not an optimization. Pure vector search misses exact-match cases: error codes, identifiers, proper nouns, version numbers, code symbols. BM25 catches these. Combine via Reciprocal Rank Fusion (RRF). Production RAG systems converge on hybrid because the failure modes are complementary.

Metadata pre-filtering. Hard-filter the candidate set before semantic search: tenant_id, date >= ..., department = ..., language = .... Treats the vector index as search-within-results, not a global semantic database. Skipping this is a common cause of cross-tenant leaks and stale results.

Query rewriting. A cheap LLM call rewrites vague or conversational queries into focused search queries. Especially important in multi-turn dialogue where pronouns and context shift the actual information need.

HNSW tuning (the dominant ANN structure):

Measure recall against an exhaustive baseline on a sample. Fast indexes that miss 30% of relevant docs are a silent failure mode.

2.4 Reranking

Retrieval is high-recall, low-precision by design. Rerank to fix that.

Cross-encoder rerankers (Cohere Rerank, BGE reranker) score query+candidate as a single input. Slower than bi-encoders, materially more accurate. Standard pattern: hybrid retrieve top-50 to top-100, rerank to top-5 to top-10.

A 200k-token window does not eliminate reranking. Two reasons:

Measure faithfulness and answer quality, not document presence in the candidate set.

2.5 Long-context vs. RAG

Prompt caching shifts the math: caching a 100k-token document prefix makes long-context far cheaper to reuse across turns. Re-evaluate annually.

2.6 Retrieval evaluation

Metric What it measures
Recall@k Fraction of relevant docs in top-k
MRR Mean reciprocal rank of the first relevant doc
nDCG Position-weighted ranking quality
Faithfulness Whether the generated answer is supported by retrieved context
Answer relevance Whether the answer addresses the actual question

Evaluate retrieval in isolation with a curated query → relevant-docs gold set before evaluating end-to-end. Otherwise the LLM's own knowledge masks retrieval failures and you'll fix the wrong thing. Tools: Ragas, DeepEval, TruLens, or a few hundred lines of your own.

Maintain the gold set under version control. Add new examples whenever you find a regression. Run retrieval evals on every index change, embedding-model upgrade, and chunking-strategy change.

2.7 Semantic caching

Two-tier:

The semantic cache is dangerous if the threshold is loose. "What is X" and "What is not X" can sit at similarity 0.95. Treat the threshold as task-specific, evaluate on real traffic, and only cache stable factoid-style answers — not anything that depends on context, time, or user state.


3. Agents

3.1 Agent loop

A control loop wrapping an LLM with tools. Minimum loop:

This is ReAct. Use the model's native tool-calling API, not prompt-based parsing — eliminates a class of formatting failures.

3.2 The Model Context Protocol

MCP is Anthropic's open standard for connecting models to tools and data without per-integration glue code. JSON-RPC 2.0 over stdio, SSE, or streamable HTTP.

Roles:

Transports:

Capability negotiation happens at session init. Don't assume capabilities; check.

Security. MCP gives the model access to tools that may have side effects:

3.3 Workflow patterns

Patterns combine; production systems mix them.

Plan-then-execute. Force the model to write a plan (numbered steps, expected tools per step) before any tool call. Reduces thrashing on complex tasks, produces an auditable artifact. Costs an extra LLM call.

ReAct loop. Open-ended thought–action–observation. Best for exploratory tasks where the path can't be planned upfront (research, debugging, data exploration).

LLM pipeline / assembly line. Sequential, fixed stages, each with a narrow specialized prompt. Most reliable for known business workflows. Easier to evaluate per-stage. Less flexible.

Agentic RAG / multi-hop retrieval. Agent searches, reads, identifies gaps, searches again. Strong for research-style queries that don't decompose into a single retrieval.

Parallel sub-agents. Independent calls executed concurrently for genuinely independent subtasks (summarize each of 10 documents, then aggregate). Latency win; doesn't help if the subtasks have dependencies.

Orchestrator + specialist workers. A controller agent delegates to specialists, each with a narrow toolset and dedicated system prompt. Reduces per-agent cognitive load and tool confusion at the cost of additional handoff points where context can be lost. Indicated above the complexity threshold where single-agent tool selection accuracy degrades.

3.4 Structured outputs

When tool results or final answers must conform to a schema, prompting alone is insufficient.

3.5 Tool design

The tool schema is a prompt. The model picks tools and fills parameters based on names, descriptions, and parameter docs.

3.6 Resilience

Agents fail in ways ordinary services don't.

3.7 Side effects

Any agent that writes — DB, side-effecting API, email, deploys — needs hard guardrails.


4. Evaluation

4.1 The evaluation loop

A team without an eval set is shipping vibes. Every prompt change, model upgrade, retrieval tweak, and tool-schema edit is a regression risk. Without a measurement harness, there is no way to tell whether a change is an improvement.

The cycle:

4.2 Error analysis

Treat it as qualitative coding, not bug triage.

The taxonomy drives the next step: each category becomes either an automated check, a reranker tweak, a prompt revision, or a tool-schema fix.

4.3 LLM-as-judge

Human review doesn't scale past a few hundred examples. Automated judges fill the gap.

4.4 Human evaluation

Human evaluation is the ground-truth signal. Requirements for it to produce usable labels:

4.5 Synthetic data and adversarial testing


5. Production

5.1 Quantization and serving

For self-hosted open models, quantization (FP16 → INT8/INT4) cuts memory ~2–4× and latency ~1.5–3×.

Always evaluate the quantized model on your task. INT4 can degrade arithmetic, code, and long-context reasoning more than benchmarks suggest. Right bit width is the lowest one that holds your eval scores.

5.2 Fine-tuning

Default to RAG and prompt engineering. Fine-tune only when:

PEFT methods:

Distillation — train a small model on a strong model's outputs — is the standard approach for high-volume narrow tasks. Inference cost drops materially once quality is validated on a held-out set.

Default position: do not fine-tune. The frontier moves fast enough that fine-tunes obsolete within months; a stronger base model with better prompts ports forward.

5.3 Guardrails

Threat model is wider than for ordinary services because the LLM itself is an attack surface.

Input side:

Output side:

Operational:

5.4 Observability

Standard metrics aren't enough. You need qualitative trace data alongside quantitative telemetry.

Per-request:

Aggregate:

User feedback:

Implicit signals are noisier but higher-volume. Both feed back into the eval set.

A/B and canary. Serve two variants in parallel to a slice of traffic. Compare on the metrics above with proper statistical tests (two-sample with multiple-comparison correction). A/B comparison is the only mechanism that validates a prompt or model change against a real-traffic baseline; without it, changes ship blind.

Streaming UX. Tokens stream, users hit cancel, agents take 30 seconds. Implement clean cancellation: propagate stop signals to the provider, abort in-flight tool calls, clear partial state. Otherwise your system happily burns tokens on responses no one will read.

5.5 Pre-launch checklist

Model and config. Picked against eval set. Temperature/sampling tuned per task. Routing and fallback chains defined. Cache breakpoints placed for stable prefixes. Cost and latency budgets set.

Prompts. System prompt versioned. Stable prefixes positioned for caching. Few-shot (static or dynamic) in place. Critical instructions positioned to survive lost-in-the-middle. Tool schemas reviewed for clarity and minimality. Prompts in CI with eval gates.

RAG. Hybrid retrieval (vector + BM25) with RRF. Metadata pre-filtering enforced for tenant and ACL boundaries. Reranking active. Chunk enrichment / contextual retrieval applied. HNSW parameters benchmarked against exhaustive baseline. Retrieval evaluated in isolation.

Agents. Tool schemas validated. Toolset sized appropriately per agent. Loop / token / cost budgets enforced. Tool errors structured. Tool outputs sanitized. Code execution sandboxed. HITL on destructive actions in trusted UI. Trajectory eval set in place.

Safety. Input and output filtering deployed. PII redaction before logging and third-party calls. Red-team suite run; failures triaged. Bias evaluation run. Prompt-injection test cases in eval set.

Evaluation. Error taxonomy from open/axial coding. LLM judge calibrated against human gold set with held-out test split. Multi-judge ensemble for high-stakes evals. Drift monitoring with alerts. Eval gates in CI for prompt and model changes.

Telemetry. Full traces logged with PII redaction. Token, cost, latency captured per request. Implicit and explicit feedback collected. A/B framework operational. Streaming cancellation handled cleanly.