Glossary

One-line definitions for key terms used across the vault. Entries link to the primary source note where one exists. Alphabetically sorted within sections.


Vault Meta

TermDefinition
concept noteNote type for timeless theoretical content; template sections: Definition → Intuition → Formal Description → Applications → Trade-offs → Links.
engineering noteNote type for tooling, systems, and implementation patterns; template sections: Purpose → Architecture → Implementation Notes → Trade-offs → References → Links.
evergreenHighest lifecycle status; note is comprehensive, accurate, and well-linked.
growingIntermediate lifecycle status; note has substantive content but may lack cross-links or completeness.
index noteNote type serving as a sublayer or layer table of contents; lists all notes in lifecycle reading order with prev/next navigation.
seedInitial lifecycle status; note is a stub or placeholder needing expansion.
sublayerA numbered subdirectory within a layer (e.g., 01_foundations/03_probability_and_statistics/).
wikilinkObsidian-style internal link [[path/to/note|Display Text]]; must use full vault-root paths for Quartz compatibility.

Mathematics

Linear Algebra

TermDefinition
basisA linearly independent spanning set for a vector space; every vector has a unique representation in terms of basis vectors.
dot product; measures alignment between vectors; equals .
eigenvalue / eigenvectorScalar and non-zero vector satisfying ; eigenvectors are invariant directions under . See Eigenvalues.
gradientVector of partial derivatives ; points in the direction of steepest ascent. See Differentiation.
HessianMatrix of second-order partial derivatives ; characterises local curvature.
matrix rankNumber of linearly independent rows (or columns); determines solution space of .
normFunction measuring vector magnitude; : $\sum
orthogonal matrixSquare matrix with ; columns form an orthonormal basis; preserves lengths and angles.
positive semidefinite (PSD)Symmetric matrix with for all ; all eigenvalues ; arises in covariance matrices and kernels.
singular value decomposition (SVD)Factorisation ; orthogonal, diagonal; foundational for PCA, low-rank approximation, and pseudoinverse.
spanSet of all linear combinations of a collection of vectors.
traceSum of diagonal entries of a square matrix; equals sum of eigenvalues; invariant under cyclic permutations.
transposeMatrix with rows and columns swapped; .

Calculus & Analysis

TermDefinition
chain rule; extended to vectors via Jacobians; backbone of backpropagation. See Chain Rule.
JacobianMatrix of all first-order partial derivatives of a vector-valued function ; .
Taylor expansionApproximation of near as a polynomial in ; first-order: .

Probability & Statistics

TermDefinition
Bayes’ theorem; updates prior belief with likelihood to obtain a posterior. See Bayesian Inference.
central limit theorem (CLT)Sample mean of i.i.d. variables with finite variance converges in distribution to as . See Distributions.
conditional independence iff ; foundational for graphical models and Naive Bayes.
conjugate priorA prior whose posterior is in the same parametric family; simplifies Bayesian updates analytically. See Bayesian Inference.
covariance; measures linear co-variation; zero for independent variables.
expected value or ; linear operator (linearity holds without independence).
exponential familyClass of distributions with density ; includes Gaussian, Bernoulli, Poisson, Gamma.
hypothesis testingStatistical procedure testing a null hypothesis using a test statistic; controlled by significance level (Type I error rate). See Hypothesis Testing.
MAP estimateMaximum a posteriori: ; MLE with a prior.
maximum likelihood estimation (MLE); finds parameters most consistent with observed data.
p-valueProbability of observing a test statistic at least as extreme as the observed value under ; not the probability is true.
posterior; updated belief about parameters after observing data.
prior; belief about parameters before observing data.
variance; measures spread; .

Optimisation

TermDefinition
AdamAdaptive gradient optimiser combining momentum and RMSProp; per-parameter learning rates using first and second moment estimates. See Gradient Descent.
convex function satisfying ; any local minimum is global. See Convex Optimisation.
dualityRelationship between a primal optimisation problem and its Lagrangian dual; strong duality holds under Slater’s condition.
gradient descentIterative optimisation: ; converges to a local minimum for smooth objectives. See Gradient Descent.
KKT conditionsNecessary (and sufficient for convex problems) first-order conditions for constrained optimisation; stationarity, primal/dual feasibility, complementary slackness. See Lagrangian.
Lagrangian; encodes constrained problem as unconstrained via multipliers.
learning rateStep size in gradient descent; too large → divergence; too small → slow convergence; typically scheduled or adapted.
SGD (stochastic gradient descent)Gradient descent using a mini-batch (or single sample) gradient estimate per step; noisier but faster per update and better for large datasets.

Modeling

TermDefinition
ARIMAAuto-regressive integrated moving average; classical time series model for stationary (after differencing) univariate series. See Time Series Models.
attention mechanismSoft, differentiable lookup: computes a weighted sum of values using similarity scores between queries and keys . See Attention.
AUC-ROCArea under the receiver operating characteristic curve; threshold-invariant classification metric equal to . See Evaluation.
bias–variance trade-offGeneralisation error = bias² + variance + irreducible noise; simpler models → higher bias; complex models → higher variance. See Bias–Variance.
calibrationA model is calibrated if predicted probabilities match empirical frequencies; assessed via reliability diagram and ECE. See Evaluation.
cross-validation (k-fold)Partition data into folds; train on , evaluate on 1; repeat times; provides more evaluation signal than a single hold-out.
DBSCANDensity-based clustering; groups core points within -neighbourhoods; handles arbitrary shapes and labels outliers as noise. See Unsupervised Learning.
decision treeRecursive binary partition of feature space; splits chosen to minimise impurity (Gini/entropy for classification, MSE for regression).
dropoutRegularisation technique: randomly zero activations during training with probability ; reduces co-adaptation of neurons. See Regularisation.
early stoppingHalt training when validation loss stops improving; prevents overfitting; a form of implicit regularisation. See Early Stopping.
ElasticNetRegularised regression combining L1 (Lasso) and L2 (Ridge) penalties; .
F1 scoreHarmonic mean of precision and recall: ; useful for imbalanced classes.
feature engineeringCreating informative input features from raw data; includes polynomial features, lag features, cyclical encoding, and target encoding. See Feature Engineering.
GLM (generalised linear model)Extends linear regression to non-Gaussian targets via a link function; special cases include logistic regression (logit link) and Poisson regression (log link). See Linear Models & GLMs.
GMM (Gaussian mixture model)Probabilistic clustering model: ; fitted via EM algorithm. See Probabilistic Models.
gradient boostingEnsemble method that sequentially fits shallow trees to the residuals (pseudo-gradients) of the current ensemble; XGBoost and LightGBM are implementations.
hyperparameterModel configuration set before training (e.g., learning rate, regularisation strength, number of trees); tuned via cross-validation or Bayesian optimisation.
k-meansIterative clustering: assign each point to nearest centroid, recompute centroids; minimises within-cluster sum of squares. See Unsupervised Learning.
kernel trickImplicitly computes inner products in a high-dimensional feature space via a kernel function ; avoids explicit feature map computation. See Kernel Methods.
Lasso (L1 regularisation)Adds to loss; induces sparsity by driving coefficients exactly to zero; performs feature selection.
LightGBMGradient boosting library using leaf-wise tree growth and histogram-based splits; faster than XGBoost on large datasets.
logistic regressionLinear classifier applying sigmoid to a linear combination of features; outputs calibrated probabilities for binary classification.
loss functionScalar measuring prediction error; cross-entropy for classification, MSE for regression; optimised during training. See Loss Functions.
LSTMLong Short-Term Memory; RNN variant with gated cell state; captures long-range sequential dependencies. See Recurrent Networks.
Naive BayesProbabilistic classifier assuming conditional independence of features given class: .
overfittingModel performs well on training data but poorly on unseen data; caused by excess capacity relative to data; mitigated by regularisation, early stopping, or more data.
PCA (principal component analysis)Linear dimensionality reduction via SVD; projects data onto directions of maximum variance. See Unsupervised Learning.
precision / recallPrecision = TP/(TP+FP); recall = TP/(TP+FN); precision-recall trade-off is controlled by classification threshold.
random forestEnsemble of decision trees trained on bootstrap samples with random feature subsets; reduces variance via averaging.
regularisationPenalty on model complexity added to loss (L1/L2) or applied implicitly (dropout, early stopping); reduces overfitting. See Regularisation.
Ridge (L2 regularisation)Adds to loss; shrinks coefficients towards zero but does not set them exactly to zero.
SHAPSHapley Additive exPlanations; decomposes model predictions into additive feature contributions with game-theoretic guarantees. See SHAP.
softmax; converts logits to a probability distribution over classes.
stationarityA time series is weakly stationary if its mean and autocovariance are time-invariant; required by ARIMA; tested with the ADF test.
SVM (support vector machine)Maximum-margin classifier; finds the hyperplane maximising the margin between classes; extended to non-linear boundaries via kernels. See Kernel Methods.
t-SNENon-linear dimensionality reduction for 2D/3D visualisation; preserves local neighbourhood structure; not suitable for downstream ML.
transformerDeep learning architecture using multi-head self-attention and positional encoding; dominant for sequence tasks. See Transformer.
UMAPNon-linear dimensionality reduction; faster than t-SNE and better preserves global structure; suitable for large datasets.
XGBoostGradient boosting library with second-order gradient approximation, regularisation, and parallel tree construction.

ML Engineering

TermDefinition
A/B testControlled experiment comparing a treatment (new model) to a control (current model) on live traffic; used to measure real-world impact.
canary deploymentGradually shift a fraction of traffic to a new model version; roll back if metrics degrade before full rollout.
data driftChange in input feature distribution between training and production; detected via statistical tests (KS, PSI).
feature storeCentralised repository for computed features with consistent serving between training and inference; prevents train–serve skew. See Feature Engineering.
MLflowOpen-source ML lifecycle platform: experiment tracking, model registry, and model serving. See Model Development.
model registryVersioned store of trained model artefacts with stage transitions (Staging → Production); enables reproducibility and governance.
online learningModel updates continuously as new data arrives; suited for non-stationary environments.
PSI (population stability index)Measures shift between two distributions; PSI < 0.1: stable; 0.1–0.25: minor shift; > 0.25: major drift requiring investigation.
training–serving skewDiscrepancy between features computed at training time and at inference time; causes silent model degradation.

AI Engineering

TermDefinition
chunkingSplitting documents into segments for embedding and retrieval in RAG pipelines; chunk size and overlap affect recall and coherence.
context windowMaximum number of tokens a language model can process in a single forward pass; limits document length and conversation history.
embeddingDense vector representation of text, image, or other data in a latent semantic space; similarity is measured by cosine or dot product.
fine-tuningFurther training a pre-trained model on a task-specific dataset; updates weights (full fine-tuning) or only adapters (LoRA/QLoRA). See Fine-tuning.
hallucinationLLM generating plausible-sounding but factually incorrect or unsupported content; mitigated by RAG, grounding, and output validation.
KV cacheCached key-value attention tensors for previously processed tokens; enables efficient autoregressive decoding without recomputation.
LoRA (Low-Rank Adaptation)Parameter-efficient fine-tuning: learns low-rank updates added to frozen weight matrices; far fewer trainable parameters. See Fine-tuning.
prompt engineeringDesigning input text to elicit desired LLM behaviour; includes chain-of-thought, few-shot examples, and system prompts. See Prompt Engineering.
quantisationReducing model weight precision (FP32 → INT8/INT4) to decrease memory footprint and increase inference throughput with minimal accuracy loss. See Inference Optimisation.
RAG (retrieval-augmented generation)Combines a retrieval system (vector search over a knowledge base) with an LLM; grounds responses in retrieved context. See RAG and Agents.
rerankerA cross-encoder model that scores query–document pairs for relevance; used after initial retrieval to improve precision before LLM generation.
system promptInstruction prepended to a conversation that shapes LLM behaviour, persona, and constraints; not visible to end-users in production.
temperatureSampling parameter scaling logits before softmax; higher → more random outputs; temperature = 0 → greedy (deterministic) decoding.
tokenBasic unit of text processed by an LLM; typically ~0.75 words for English; models have a fixed context window measured in tokens.
tool use / function callingLLM capability to invoke external tools (APIs, code interpreters, databases) by outputting structured JSON; enables agentic behaviour.
vLLMHigh-throughput LLM inference library using PagedAttention for efficient KV cache management; enables continuous batching. See Inference Optimisation.