Glossary

One-line definitions for key terms used across the vault. Entries link to the primary source note where one exists. Alphabetically sorted within sections.

Vault Meta

Term	Definition
concept note	Note type for timeless theoretical content; template sections: Definition → Intuition → Formal Description → Applications → Trade-offs → Links.
engineering note	Note type for tooling, systems, and implementation patterns; template sections: Purpose → Architecture → Implementation Notes → Trade-offs → References → Links.
evergreen	Highest lifecycle status; note is comprehensive, accurate, and well-linked.
growing	Intermediate lifecycle status; note has substantive content but may lack cross-links or completeness.
index note	Note type serving as a sublayer or layer table of contents; lists all notes in lifecycle reading order with prev/next navigation.
seed	Initial lifecycle status; note is a stub or placeholder needing expansion.
sublayer	A numbered subdirectory within a layer (e.g., `01_foundations/03_probability_and_statistics/`).
wikilink	Obsidian-style internal link `[[path/to/note\|Display Text]]`; must use full vault-root paths for Quartz compatibility.

Mathematics

Linear Algebra

Term	Definition
basis	A linearly independent spanning set for a vector space; every vector has a unique representation in terms of basis vectors.
dot product	$u \cdot v = \sum_{i} u_{i} v_{i}$ ; measures alignment between vectors; equals $∥ u ∥∥ v ∥ cos θ$ .
eigenvalue / eigenvector	Scalar $λ$ and non-zero vector $v$ satisfying $A v = λ v$ ; eigenvectors are invariant directions under $A$ . See Eigenvalues.
gradient	Vector of partial derivatives $\nabla_{x} f \in R^{n}$ ; points in the direction of steepest ascent. See Differentiation.
Hessian	Matrix of second-order partial derivatives $H_{ij} = \partial^{2} f / \partial x_{i} \partial x_{j}$ ; characterises local curvature.
matrix rank	Number of linearly independent rows (or columns); determines solution space of $A x = b$ .
norm	Function $∥ \cdot ∥$ measuring vector magnitude; $ℓ^{1}$ : $\sum
orthogonal matrix	Square matrix $Q$ with $Q^{T} Q = I$ ; columns form an orthonormal basis; preserves lengths and angles.
positive semidefinite (PSD)	Symmetric matrix $A$ with $x^{T} A x \geq 0$ for all $x$ ; all eigenvalues $\geq 0$ ; arises in covariance matrices and kernels.
singular value decomposition (SVD)	Factorisation $A = U Σ V^{T}$ ; $U, V$ orthogonal, $Σ$ diagonal; foundational for PCA, low-rank approximation, and pseudoinverse.
span	Set of all linear combinations of a collection of vectors.
trace	Sum of diagonal entries of a square matrix; equals sum of eigenvalues; invariant under cyclic permutations.
transpose	Matrix $A^{T}$ with rows and columns swapped; $(A B)^{T} = B^{T} A^{T}$ .

Calculus & Analysis

Term	Definition
chain rule	$\frac{d}{d x} f (g (x)) = f^{'} (g (x)) \cdot g^{'} (x)$ ; extended to vectors via Jacobians; backbone of backpropagation. See Chain Rule.
Jacobian	Matrix of all first-order partial derivatives of a vector-valued function $f : R^{n} \to R^{m}$ ; $J_{ij} = \partial f_{i} / \partial x_{j}$ .
Taylor expansion	Approximation of $f$ near $x_{0}$ as a polynomial in $(x - x_{0})$ ; first-order: $f (x) \approx f (x_{0}) + f^{'} (x_{0}) (x - x_{0})$ .

Probability & Statistics

Term	Definition
Bayes’ theorem	$P (B ∣ A) = P (A ∣ B) P (B) / P (A)$ ; updates prior belief with likelihood to obtain a posterior. See Bayesian Inference.
central limit theorem (CLT)	Sample mean of $n$ i.i.d. variables with finite variance converges in distribution to $N (μ, σ^{2} / n)$ as $n \to \infty$ . See Distributions.
conditional independence	$X ⊥ Y ∣ Z$ iff $P (X, Y ∣ Z) = P (X ∣ Z) P (Y ∣ Z)$ ; foundational for graphical models and Naive Bayes.
conjugate prior	A prior $p (θ)$ whose posterior $p (θ ∣ x)$ is in the same parametric family; simplifies Bayesian updates analytically. See Bayesian Inference.
covariance	$Cov (X, Y) = E [(X - μ_{X}) (Y - μ_{Y})]$ ; measures linear co-variation; zero for independent variables.
expected value	$E [X] = \sum_{x} x p (x)$ or $\int x f (x) d x$ ; linear operator (linearity holds without independence).
exponential family	Class of distributions with density $p (x ∣ η) = h (x) exp (η^{T} T (x) - A (η))$ ; includes Gaussian, Bernoulli, Poisson, Gamma.
hypothesis testing	Statistical procedure testing a null hypothesis $H_{0}$ using a test statistic; controlled by significance level $α$ (Type I error rate). See Hypothesis Testing.
MAP estimate	Maximum a posteriori: $\hat{θ} = ar g max_{θ} p (θ ∣ x) = ar g max_{θ} p (x ∣ θ) p (θ)$ ; MLE with a prior.
maximum likelihood estimation (MLE)	$\hat{θ} = ar g max_{θ} p (x ∣ θ)$ ; finds parameters most consistent with observed data.
p-value	Probability of observing a test statistic at least as extreme as the observed value under $H_{0}$ ; not the probability $H_{0}$ is true.
posterior	$p (θ ∣ x) \propto p (x ∣ θ) p (θ)$ ; updated belief about parameters after observing data.
prior	$p (θ)$ ; belief about parameters before observing data.
variance	$Var (X) = E [(X - E [X])^{2}]$ ; measures spread; $Var (a X) = a^{2} Var (X)$ .

Optimisation

Term	Definition
Adam	Adaptive gradient optimiser combining momentum and RMSProp; per-parameter learning rates using first and second moment estimates. See Gradient Descent.
convex function	$f$ satisfying $f (λ x + (1 - λ) y) \leq λ f (x) + (1 - λ) f (y)$ ; any local minimum is global. See Convex Optimisation.
duality	Relationship between a primal optimisation problem and its Lagrangian dual; strong duality holds under Slater’s condition.
gradient descent	Iterative optimisation: $θ_{t + 1} = θ_{t} - α \nabla_{θ} L$ ; converges to a local minimum for smooth objectives. See Gradient Descent.
KKT conditions	Necessary (and sufficient for convex problems) first-order conditions for constrained optimisation; stationarity, primal/dual feasibility, complementary slackness. See Lagrangian.
Lagrangian	$L (x, λ, μ) = f (x) + \sum_{i} λ_{i} g_{i} (x) + \sum_{j} μ_{j} h_{j} (x)$ ; encodes constrained problem as unconstrained via multipliers.
learning rate	Step size $α$ in gradient descent; too large → divergence; too small → slow convergence; typically scheduled or adapted.
SGD (stochastic gradient descent)	Gradient descent using a mini-batch (or single sample) gradient estimate per step; noisier but faster per update and better for large datasets.

Modeling

Term	Definition
ARIMA	Auto-regressive integrated moving average; classical time series model for stationary (after differencing) univariate series. See Time Series Models.
attention mechanism	Soft, differentiable lookup: computes a weighted sum of values $V$ using similarity scores between queries $Q$ and keys $K$ . See Attention.
AUC-ROC	Area under the receiver operating characteristic curve; threshold-invariant classification metric equal to $P (\overset{p}{^}_{+} > \overset{p}{^}_{-})$ . See Evaluation.
bias–variance trade-off	Generalisation error = bias² + variance + irreducible noise; simpler models → higher bias; complex models → higher variance. See Bias–Variance.
calibration	A model is calibrated if predicted probabilities match empirical frequencies; assessed via reliability diagram and ECE. See Evaluation.
cross-validation (k-fold)	Partition data into $k$ folds; train on $k - 1$ , evaluate on 1; repeat $k$ times; provides $k \times$ more evaluation signal than a single hold-out.
DBSCAN	Density-based clustering; groups core points within $ε$ -neighbourhoods; handles arbitrary shapes and labels outliers as noise. See Unsupervised Learning.
decision tree	Recursive binary partition of feature space; splits chosen to minimise impurity (Gini/entropy for classification, MSE for regression).
dropout	Regularisation technique: randomly zero activations during training with probability $p$ ; reduces co-adaptation of neurons. See Regularisation.
early stopping	Halt training when validation loss stops improving; prevents overfitting; a form of implicit regularisation. See Early Stopping.
ElasticNet	Regularised regression combining L1 (Lasso) and L2 (Ridge) penalties; $λ [α ∥ β ∥_{1} + (1 - α) ∥ β ∥_{2}^{2} /2]$ .
F1 score	Harmonic mean of precision and recall: $2 PR / (P + R)$ ; useful for imbalanced classes.
feature engineering	Creating informative input features from raw data; includes polynomial features, lag features, cyclical encoding, and target encoding. See Feature Engineering.
GLM (generalised linear model)	Extends linear regression to non-Gaussian targets via a link function; special cases include logistic regression (logit link) and Poisson regression (log link). See Linear Models & GLMs.
GMM (Gaussian mixture model)	Probabilistic clustering model: $p (x) = \sum_{k} π_{k} N (x ∣ μ_{k}, Σ_{k})$ ; fitted via EM algorithm. See Probabilistic Models.
gradient boosting	Ensemble method that sequentially fits shallow trees to the residuals (pseudo-gradients) of the current ensemble; XGBoost and LightGBM are implementations.
hyperparameter	Model configuration set before training (e.g., learning rate, regularisation strength, number of trees); tuned via cross-validation or Bayesian optimisation.
k-means	Iterative clustering: assign each point to nearest centroid, recompute centroids; minimises within-cluster sum of squares. See Unsupervised Learning.
kernel trick	Implicitly computes inner products in a high-dimensional feature space via a kernel function $k (x, x^{'})$ ; avoids explicit feature map computation. See Kernel Methods.
Lasso (L1 regularisation)	Adds $λ ∥ β ∥_{1}$ to loss; induces sparsity by driving coefficients exactly to zero; performs feature selection.
LightGBM	Gradient boosting library using leaf-wise tree growth and histogram-based splits; faster than XGBoost on large datasets.
logistic regression	Linear classifier applying sigmoid to a linear combination of features; outputs calibrated probabilities for binary classification.
loss function	Scalar measuring prediction error; cross-entropy for classification, MSE for regression; optimised during training. See Loss Functions.
LSTM	Long Short-Term Memory; RNN variant with gated cell state; captures long-range sequential dependencies. See Recurrent Networks.
Naive Bayes	Probabilistic classifier assuming conditional independence of features given class: $P (y ∣ x) \propto P (y) \prod_{i} P (x_{i} ∣ y)$ .
overfitting	Model performs well on training data but poorly on unseen data; caused by excess capacity relative to data; mitigated by regularisation, early stopping, or more data.
PCA (principal component analysis)	Linear dimensionality reduction via SVD; projects data onto directions of maximum variance. See Unsupervised Learning.
precision / recall	Precision = TP/(TP+FP); recall = TP/(TP+FN); precision-recall trade-off is controlled by classification threshold.
random forest	Ensemble of decision trees trained on bootstrap samples with random feature subsets; reduces variance via averaging.
regularisation	Penalty on model complexity added to loss (L1/L2) or applied implicitly (dropout, early stopping); reduces overfitting. See Regularisation.
Ridge (L2 regularisation)	Adds $λ ∥ β ∥_{2}^{2}$ to loss; shrinks coefficients towards zero but does not set them exactly to zero.
SHAP	SHapley Additive exPlanations; decomposes model predictions into additive feature contributions with game-theoretic guarantees. See SHAP.
softmax	$softmax (z)_{i} = e^{z_{i}} / \sum_{j} e^{z_{j}}$ ; converts logits to a probability distribution over classes.
stationarity	A time series is weakly stationary if its mean and autocovariance are time-invariant; required by ARIMA; tested with the ADF test.
SVM (support vector machine)	Maximum-margin classifier; finds the hyperplane maximising the margin between classes; extended to non-linear boundaries via kernels. See Kernel Methods.
t-SNE	Non-linear dimensionality reduction for 2D/3D visualisation; preserves local neighbourhood structure; not suitable for downstream ML.
transformer	Deep learning architecture using multi-head self-attention and positional encoding; dominant for sequence tasks. See Transformer.
UMAP	Non-linear dimensionality reduction; faster than t-SNE and better preserves global structure; suitable for large datasets.
XGBoost	Gradient boosting library with second-order gradient approximation, regularisation, and parallel tree construction.

ML Engineering

Term	Definition
A/B test	Controlled experiment comparing a treatment (new model) to a control (current model) on live traffic; used to measure real-world impact.
canary deployment	Gradually shift a fraction of traffic to a new model version; roll back if metrics degrade before full rollout.
data drift	Change in input feature distribution $P (X)$ between training and production; detected via statistical tests (KS, PSI).
feature store	Centralised repository for computed features with consistent serving between training and inference; prevents train–serve skew. See Feature Engineering.
MLflow	Open-source ML lifecycle platform: experiment tracking, model registry, and model serving. See Model Development.
model registry	Versioned store of trained model artefacts with stage transitions (Staging → Production); enables reproducibility and governance.
online learning	Model updates continuously as new data arrives; suited for non-stationary environments.
PSI (population stability index)	Measures shift between two distributions; PSI < 0.1: stable; 0.1–0.25: minor shift; > 0.25: major drift requiring investigation.
training–serving skew	Discrepancy between features computed at training time and at inference time; causes silent model degradation.

AI Engineering

Term	Definition
chunking	Splitting documents into segments for embedding and retrieval in RAG pipelines; chunk size and overlap affect recall and coherence.
context window	Maximum number of tokens a language model can process in a single forward pass; limits document length and conversation history.
embedding	Dense vector representation of text, image, or other data in a latent semantic space; similarity is measured by cosine or dot product.
fine-tuning	Further training a pre-trained model on a task-specific dataset; updates weights (full fine-tuning) or only adapters (LoRA/QLoRA). See Fine-tuning.
hallucination	LLM generating plausible-sounding but factually incorrect or unsupported content; mitigated by RAG, grounding, and output validation.
KV cache	Cached key-value attention tensors for previously processed tokens; enables efficient autoregressive decoding without recomputation.
LoRA (Low-Rank Adaptation)	Parameter-efficient fine-tuning: learns low-rank updates $Δ W = A B$ added to frozen weight matrices; far fewer trainable parameters. See Fine-tuning.
prompt engineering	Designing input text to elicit desired LLM behaviour; includes chain-of-thought, few-shot examples, and system prompts. See Prompt Engineering.
quantisation	Reducing model weight precision (FP32 → INT8/INT4) to decrease memory footprint and increase inference throughput with minimal accuracy loss. See Inference Optimisation.
RAG (retrieval-augmented generation)	Combines a retrieval system (vector search over a knowledge base) with an LLM; grounds responses in retrieved context. See RAG and Agents.
reranker	A cross-encoder model that scores query–document pairs for relevance; used after initial retrieval to improve precision before LLM generation.
system prompt	Instruction prepended to a conversation that shapes LLM behaviour, persona, and constraints; not visible to end-users in production.
temperature	Sampling parameter scaling logits before softmax; higher → more random outputs; temperature = 0 → greedy (deterministic) decoding.
token	Basic unit of text processed by an LLM; typically ~0.75 words for English; models have a fixed context window measured in tokens.
tool use / function calling	LLM capability to invoke external tools (APIs, code interpreters, databases) by outputting structured JSON; enables agentic behaviour.
vLLM	High-throughput LLM inference library using PagedAttention for efficient KV cache management; enables continuous batching. See Inference Optimisation.

Notes

Explorer

glossary

Glossary

Vault Meta

Mathematics

Linear Algebra

Calculus & Analysis

Probability & Statistics

Optimisation

Modeling

ML Engineering

AI Engineering

Links

Graph View

Table of Contents

Backlinks