AI Application Architecture

Goal

Design principles and patterns for building reliable, maintainable AI applications around foundation models. Foundation models are powerful but non-deterministic, expensive, and under continuous evolution — the system surrounding the model largely determines production quality. Good architecture decouples the application from any specific model provider, creates a data flywheel for continuous improvement, and manages context deliberately rather than naively concatenating strings.

Architecture

Layer model:

User
  └─ Frontend / API Gateway
      └─ Orchestration Layer (LangChain / LlamaIndex / DSPy / custom)
          ├─ Model Gateway (routing, auth, rate limiting, semantic cache)
          │   ├─ Primary LLM provider (OpenAI, Anthropic, Google)
          │   └─ Fallback LLM provider
          ├─ Tool Registry (search, code execution, external APIs)
          ├─ Vector Store (retrieval, memory)
          └─ Observability (LangSmith / OpenTelemetry)

Context management: The model’s context window is the most important resource to manage explicitly. Components of a typical context:

  1. System prompt — persona, capabilities, constraints, output format instructions
  2. Conversation history — compressed using summarization or sliding window when approaching context limits
  3. Retrieved context — documents from RAG, ranked by relevance, truncated to fit
  4. Tool results — structured outputs from tool calls, potentially large (e.g., code execution output)
  5. Current user turn — the immediate input

Context overflow is a hard failure; context bloat increases latency and cost linearly. Budget each component explicitly: e.g., system=1k, history=4k, retrieved=6k, tools=2k, user=1k within a 16k context.

Model gateway pattern: An internal proxy between the orchestration layer and LLM providers abstracts provider APIs, enforces auth/rate limiting, implements semantic caching, and enables routing logic. Key behaviors:

  • Semantic caching: Cache responses for queries semantically similar to previously seen queries (embed query, vector search cache, serve if similarity >threshold). Can reduce costs 30–80% for FAQ-heavy workloads. LiteLLM + Redis provides this out of the box.
  • Fallback routing: Primary → fallback model on error (500, rate limit, timeout). E.g., GPT-4o primary, GPT-4o-mini fallback; or Anthropic primary, OpenAI fallback.
  • Cost-aware routing: Route simple queries to cheap models (GPT-4o-mini, Claude Haiku), complex queries to capable models (GPT-4o, Claude 3.5 Sonnet). Classifier determines query complexity tier.

Data flywheel: Production queries → sample and annotate (human labels or LLM-as-judge) → evaluation dataset → measure regressions with each prompt/model change → fine-tune or improve prompt → redeploy → repeat. The flywheel accelerates over time: more production data → better evaluation coverage → faster, more confident iteration.

Components

ComponentPurposeTools
Model gatewayProvider abstraction, caching, routingLiteLLM, custom proxy
OrchestrationMulti-step reasoning, tool use, RAGLangChain, LlamaIndex, DSPy
Vector storeSemantic retrieval, memoryChroma (dev), Pinecone/Weaviate/pgvector (prod)
Tool registryExternal capabilities (search, code, APIs)LangChain tools, custom functions
ObservabilityTracing, evaluation, cost monitoringLangSmith, OpenTelemetry + Grafana
Human feedbackPreference signal for improvement loopThumbs up/down, correction interface
A/B testingCompare prompts/models on live trafficFeature flags (LaunchDarkly), custom
CacheSemantic + exact match response cacheRedis + LiteLLM, GPTCache

Orchestration framework selection:

  • LangChain: Broad ecosystem, best observability integration (LangSmith), good for RAG and agents. High abstraction can obscure behavior.
  • LlamaIndex: Stronger data/retrieval primitives; best for complex RAG (multi-document, hierarchical indices).
  • DSPy: Systematic prompt optimization via compilation; best when prompts are complex and tunable.
  • Custom/minimal: Prefer for simple chains where framework overhead is unjustified. Often the right call for production-critical paths.

Evaluation

Latency SLOs:

  • TTFT: p50 <100ms, p95 <300ms (with streaming; users tolerate latency better when tokens start flowing)
  • Total latency: p50 <1.5s, p95 <3s for typical chat; p95 <10s for complex agent tasks
  • Tool call overhead: instrument and budget individually

Quality metrics on sampled production traffic:

  • Correctness (LLM-as-judge, 1–5 scale): track weekly trend
  • Groundedness (for RAG): proportion of response claims supported by retrieved context
  • Refusal rate: proportion of valid requests incorrectly refused by safety systems
  • User satisfaction: thumbs up/(thumbs up + thumbs down) ratio

A/B testing: Route 5–10% of production traffic to new prompt/model variant. Measure: user satisfaction signal (if available), task completion, session length, cost per session. Minimum detectable effect ~2% requires ~5k sessions per variant for adequate power.

Cost per conversation: Track as p50/p95 to catch runaway agent loops. Alert threshold: p99 > 10× p50 suggests pathological behavior.

Failure Modes

Prompt injection: Malicious content in tool outputs, retrieved documents, or user input attempts to override the system prompt or hijack tool calls. Mitigations: (1) separate user/tool content from instructions via clear delimiters and role enforcement; (2) classify tool outputs through LlamaGuard before including in context; (3) sandbox tool execution; (4) monitor for instruction-pattern tokens in retrieved content.

LLM provider outage: Unmitigated, a primary provider outage causes total product failure. Mitigation: fallback routing in the model gateway, with automatic failover on 5xx/timeout. Test quarterly. LiteLLM supports this natively.

Context window overflow: Application fails or silently truncates context when inputs exceed the model’s context limit, causing incoherent outputs. Mitigation: explicit context budgeting, conversation summarization before window boundary, early warning metric (log when context >80% of limit).

Hallucination in high-stakes decisions: Model confidently asserts false information. Mitigation: RAG with groundedness checking, structured output with validation, human review for high-stakes outputs, uncertainty quantification (sample multiple completions, flag high variance).

Feedback loop amplification: Model fine-tuned on its own outputs can amplify biases and reduce diversity. Mitigation: always include a ground-truth anchor (verified examples from humans), monitor output diversity metrics, never fine-tune exclusively on model-generated data.

Tool call error cascades: One failed tool call causes the agent to hallucinate downstream results rather than retry or escalate. Mitigation: explicit error handling in tool definitions, retry logic with backoff, agent should report uncertainty rather than fabricate results.

Cost / Latency

Input token cost >> output token cost for most providers (~3–5x). Long system prompts and large retrieved contexts dominate costs in RAG applications. Cache aggressively:

  • Exact cache: Hash of full prompt string → cached response. Effective for repeated FAQ-style queries.
  • Semantic cache: Embedding similarity search → cached response. 30–80% hit rate for FAQ workloads; near-zero for open-ended generation.
  • Prompt prefix cache: OpenAI and Anthropic both support automatic prefix caching for long system prompts — up to 90% discount on repeated prefix tokens. Structure prompts to keep the static prefix as long as possible.

Streaming: Start rendering output to the user as soon as the first token arrives. Reduces perceived latency dramatically even when total latency is unchanged. Required for any user-facing chatbot or copilot.

Model cost tiers (approximate 2025 pricing, input/output per 1M tokens):

  • Tier 1 (capable): GPT-4o ~15, Claude 3.5 Sonnet ~15
  • Tier 2 (balanced): GPT-4o-mini ~0.60, Claude Haiku ~1.25
  • Tier 3 (self-hosted): Llama-3-8B-Q4 on consumer GPU ~$0.005 equivalent

Route to the cheapest model that satisfies quality requirements for each request type.