Foundation Model Overview

Purpose

Foundation models are large neural networks pre-trained on massive, diverse datasets — typically hundreds of billions to trillions of tokens drawn from web text, books, code, and scientific literature. The central engineering insight is the pre-training + fine-tuning paradigm: one expensive pre-training run produces a general-purpose model that can be adapted cheaply to many downstream tasks.

These models exhibit emergent capabilities — abilities that appear at scale but are absent in smaller models: multi-step reasoning, in-context learning (few-shot prompting), instruction following, and rudimentary tool use. This makes them qualitatively different from task-specific models.

From an engineering standpoint, foundation models are infrastructure components: they provide general-purpose language understanding and generation that you layer application logic on top of. The decision of which model to use, how to serve it, and how to constrain its behaviour are first-class engineering decisions.

Architecture

Transformer Core

All dominant foundation models are built on the Transformer (Vaswani et al., 2017). Key components:

Multi-head self-attention: each token attends to all other tokens in the context window; O(n²) in sequence length. Captures long-range dependencies.
Feed-forward network (FFN): two linear layers with a non-linearity (GELU/SwiGLU). Expanded dimension typically 4× model dimension.
Layer normalisation (LayerNorm / RMSNorm): applied pre- or post-attention; stabilises training.
Residual connections: wrap both attention and FFN sub-layers; enable gradient flow through depth.
Positional encoding: learned absolute (GPT-2), sinusoidal (original Transformer), or RoPE (Rotary Position Embedding, most modern LLMs) / ALiBi for length generalisation.

Model Families by Architecture

Family	Architecture	Examples
Decoder-only (autoregressive)	Causal self-attention	GPT-4, Llama 3, Mistral, Qwen, Gemma
Encoder-only (masked LM)	Bidirectional attention	BERT, RoBERTa, DeBERTa
Encoder-decoder (seq2seq)	Cross-attention between encoder and decoder	T5, BART, Flan-T5

Decoder-only models dominate generative AI workloads. Encoder-only models remain useful for classification, NER, and retrieval (bi-encoder embeddings). Encoder-decoder models suit structured generation (summarisation, translation) where the input and output are clearly distinct.

Multimodal Models

CLIP: contrastive vision-language alignment; image and text encoders trained on 400M image-text pairs. Enables zero-shot image classification and cross-modal retrieval.
LLaVA / BLIP-2: vision-language models that project visual features into the LLM token space, enabling image-conditioned generation.
Whisper: encoder-decoder model for speech recognition; trained on 680K hours of multilingual audio.

State-Space Alternatives

Mamba (Gu & Dao, 2023) is a selective state-space model with O(n) complexity in sequence length, no KV cache, and competitive performance with transformers on language tasks. Hardware-aware selective scan enables fast training and inference on long sequences. An emerging alternative for long-context workloads where quadratic attention becomes a bottleneck.

Implementation Notes

Model Families and Sizes

Common parameter counts and typical use cases:

Size	Examples	Use Case
1B–3B	Phi-3-mini, Llama 3.2 3B	Edge deployment, low-latency classification
7B–8B	Llama 3.1 8B, Mistral 7B, Gemma 7B	Capable general-purpose; fits single consumer GPU
13B–14B	Llama 2 13B, Qwen2 14B	Better reasoning; 2× GPU
70B	Llama 3.1 70B	Near-frontier open quality
Frontier	GPT-4o, Claude 3.5, Gemini 1.5	Best capability; API only

Context Lengths and Engineering Implications

Longer context windows (32K–1M tokens) enable retrieval-free document QA, long conversation memory, and whole-codebase context. Engineering implications:

KV cache grows linearly with context length; at 128K tokens with a 70B model, KV cache alone can exceed 80 GB.
Attention cost is O(n²) unless using Flash Attention or sliding-window attention.
Needle-in-a-haystack degradation: many models perform worse at the middle of long contexts.

VRAM Requirements (approximate, FP16)

Parameters	FP16	INT8	INT4
7B	~14 GB	~7 GB	~4 GB
13B	~26 GB	~13 GB	~7 GB
70B	~140 GB	~70 GB	~35 GB

Distribution and Selection

HuggingFace Hub is the canonical distribution point for open-weight models: model cards, tokenizer configs, quantised variants (GGUF, AWQ, GPTQ). Key selection criteria:

Capability: benchmark scores (MMLU, HumanEval, MT-Bench) relative to task requirements.
License: permissive (Apache 2.0: Mistral, Falcon) vs restricted (Llama community licence) vs proprietary API.
Latency and throughput: smaller models at higher batch sizes often win on tokens/sec/dollar.
Domain fit: code models (DeepSeek Coder, CodeLlama), multilingual models (Qwen, Aya), long-context models (Yarn-Mistral, Claude).

Trade-offs

Dimension	Open-weight	Closed / API
Cost at scale	Infra cost only	Per-token pricing
Privacy	Data stays in-house	Data leaves to provider
Capability	Up to 70B open rivals frontier	GPT-4-class leads on hard reasoning
Ops burden	Full MLOps stack	Zero infra

Capability vs cost vs latency: frontier API models maximise quality but cost 10–100× open-weight alternatives at equivalent throughput. Smaller quantised open-weight models running locally minimise latency (no network round-trip) at the cost of capability.

General vs domain-specific: general models trained on diverse data generalise better zero-shot; domain-specific fine-tunes (code, biomedical, legal) outperform on narrow tasks with limited prompting.

References

Vaswani et al. (2017). Attention Is All You Need. NeurIPS.
Touvron et al. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. Meta AI.
Hoffmann et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla). DeepMind.
Gu & Dao (2023). Mamba: Linear-Time Sequence Modelling with Selective State Spaces.
Radford et al. (2021). Learning Transferable Visual Models from Natural Language Supervision (CLIP). OpenAI.

Notes

Explorer

foundation_model_overview