Foundation Model Overview

Purpose

Foundation models are large neural networks pre-trained on massive, diverse datasets — typically hundreds of billions to trillions of tokens drawn from web text, books, code, and scientific literature. The central engineering insight is the pre-training + fine-tuning paradigm: one expensive pre-training run produces a general-purpose model that can be adapted cheaply to many downstream tasks.

These models exhibit emergent capabilities — abilities that appear at scale but are absent in smaller models: multi-step reasoning, in-context learning (few-shot prompting), instruction following, and rudimentary tool use. This makes them qualitatively different from task-specific models.

From an engineering standpoint, foundation models are infrastructure components: they provide general-purpose language understanding and generation that you layer application logic on top of. The decision of which model to use, how to serve it, and how to constrain its behaviour are first-class engineering decisions.

Architecture

Transformer Core

All dominant foundation models are built on the Transformer (Vaswani et al., 2017). Key components:

  • Multi-head self-attention: each token attends to all other tokens in the context window; O(n²) in sequence length. Captures long-range dependencies.
  • Feed-forward network (FFN): two linear layers with a non-linearity (GELU/SwiGLU). Expanded dimension typically 4× model dimension.
  • Layer normalisation (LayerNorm / RMSNorm): applied pre- or post-attention; stabilises training.
  • Residual connections: wrap both attention and FFN sub-layers; enable gradient flow through depth.
  • Positional encoding: learned absolute (GPT-2), sinusoidal (original Transformer), or RoPE (Rotary Position Embedding, most modern LLMs) / ALiBi for length generalisation.

Model Families by Architecture

FamilyArchitectureExamples
Decoder-only (autoregressive)Causal self-attentionGPT-4, Llama 3, Mistral, Qwen, Gemma
Encoder-only (masked LM)Bidirectional attentionBERT, RoBERTa, DeBERTa
Encoder-decoder (seq2seq)Cross-attention between encoder and decoderT5, BART, Flan-T5

Decoder-only models dominate generative AI workloads. Encoder-only models remain useful for classification, NER, and retrieval (bi-encoder embeddings). Encoder-decoder models suit structured generation (summarisation, translation) where the input and output are clearly distinct.

Multimodal Models

  • CLIP: contrastive vision-language alignment; image and text encoders trained on 400M image-text pairs. Enables zero-shot image classification and cross-modal retrieval.
  • LLaVA / BLIP-2: vision-language models that project visual features into the LLM token space, enabling image-conditioned generation.
  • Whisper: encoder-decoder model for speech recognition; trained on 680K hours of multilingual audio.

State-Space Alternatives

Mamba (Gu & Dao, 2023) is a selective state-space model with O(n) complexity in sequence length, no KV cache, and competitive performance with transformers on language tasks. Hardware-aware selective scan enables fast training and inference on long sequences. An emerging alternative for long-context workloads where quadratic attention becomes a bottleneck.

Implementation Notes

Model Families and Sizes

Common parameter counts and typical use cases:

SizeExamplesUse Case
1B–3BPhi-3-mini, Llama 3.2 3BEdge deployment, low-latency classification
7B–8BLlama 3.1 8B, Mistral 7B, Gemma 7BCapable general-purpose; fits single consumer GPU
13B–14BLlama 2 13B, Qwen2 14BBetter reasoning; 2× GPU
70BLlama 3.1 70BNear-frontier open quality
FrontierGPT-4o, Claude 3.5, Gemini 1.5Best capability; API only

Context Lengths and Engineering Implications

Longer context windows (32K–1M tokens) enable retrieval-free document QA, long conversation memory, and whole-codebase context. Engineering implications:

  • KV cache grows linearly with context length; at 128K tokens with a 70B model, KV cache alone can exceed 80 GB.
  • Attention cost is O(n²) unless using Flash Attention or sliding-window attention.
  • Needle-in-a-haystack degradation: many models perform worse at the middle of long contexts.

VRAM Requirements (approximate, FP16)

ParametersFP16INT8INT4
7B~14 GB~7 GB~4 GB
13B~26 GB~13 GB~7 GB
70B~140 GB~70 GB~35 GB

Distribution and Selection

HuggingFace Hub is the canonical distribution point for open-weight models: model cards, tokenizer configs, quantised variants (GGUF, AWQ, GPTQ). Key selection criteria:

  1. Capability: benchmark scores (MMLU, HumanEval, MT-Bench) relative to task requirements.
  2. License: permissive (Apache 2.0: Mistral, Falcon) vs restricted (Llama community licence) vs proprietary API.
  3. Latency and throughput: smaller models at higher batch sizes often win on tokens/sec/dollar.
  4. Domain fit: code models (DeepSeek Coder, CodeLlama), multilingual models (Qwen, Aya), long-context models (Yarn-Mistral, Claude).

Trade-offs

DimensionOpen-weightClosed / API
Cost at scaleInfra cost onlyPer-token pricing
PrivacyData stays in-houseData leaves to provider
CapabilityUp to 70B open rivals frontierGPT-4-class leads on hard reasoning
Ops burdenFull MLOps stackZero infra

Capability vs cost vs latency: frontier API models maximise quality but cost 10–100× open-weight alternatives at equivalent throughput. Smaller quantised open-weight models running locally minimise latency (no network round-trip) at the cost of capability.

General vs domain-specific: general models trained on diverse data generalise better zero-shot; domain-specific fine-tunes (code, biomedical, legal) outperform on narrow tasks with limited prompting.

References

  • Vaswani et al. (2017). Attention Is All You Need. NeurIPS.
  • Touvron et al. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. Meta AI.
  • Hoffmann et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla). DeepMind.
  • Gu & Dao (2023). Mamba: Linear-Time Sequence Modelling with Selective State Spaces.
  • Radford et al. (2021). Learning Transferable Visual Models from Natural Language Supervision (CLIP). OpenAI.