Glossary
One-line definitions for key terms used across the vault. Entries link to the primary source note where one exists. Alphabetically sorted within sections.
Vault Meta
| Term | Definition |
|---|---|
| concept note | Note type for timeless theoretical content; template sections: Definition → Intuition → Formal Description → Applications → Trade-offs → Links. |
| engineering note | Note type for tooling, systems, and implementation patterns; template sections: Purpose → Architecture → Implementation Notes → Trade-offs → References → Links. |
| evergreen | Highest lifecycle status; note is comprehensive, accurate, and well-linked. |
| growing | Intermediate lifecycle status; note has substantive content but may lack cross-links or completeness. |
| index note | Note type serving as a sublayer or layer table of contents; lists all notes in lifecycle reading order with prev/next navigation. |
| seed | Initial lifecycle status; note is a stub or placeholder needing expansion. |
| sublayer | A numbered subdirectory within a layer (e.g., 01_foundations/03_probability_and_statistics/). |
| wikilink | Obsidian-style internal link [[path/to/note|Display Text]]; must use full vault-root paths for Quartz compatibility. |
Mathematics
Linear Algebra
| Term | Definition |
|---|---|
| basis | A linearly independent spanning set for a vector space; every vector has a unique representation in terms of basis vectors. |
| dot product | ; measures alignment between vectors; equals . |
| eigenvalue / eigenvector | Scalar and non-zero vector satisfying ; eigenvectors are invariant directions under . See Eigenvalues. |
| gradient | Vector of partial derivatives ; points in the direction of steepest ascent. See Differentiation. |
| Hessian | Matrix of second-order partial derivatives ; characterises local curvature. |
| matrix rank | Number of linearly independent rows (or columns); determines solution space of . |
| norm | Function measuring vector magnitude; : $\sum |
| orthogonal matrix | Square matrix with ; columns form an orthonormal basis; preserves lengths and angles. |
| positive semidefinite (PSD) | Symmetric matrix with for all ; all eigenvalues ; arises in covariance matrices and kernels. |
| singular value decomposition (SVD) | Factorisation ; orthogonal, diagonal; foundational for PCA, low-rank approximation, and pseudoinverse. |
| span | Set of all linear combinations of a collection of vectors. |
| trace | Sum of diagonal entries of a square matrix; equals sum of eigenvalues; invariant under cyclic permutations. |
| transpose | Matrix with rows and columns swapped; . |
Calculus & Analysis
| Term | Definition |
|---|---|
| chain rule | ; extended to vectors via Jacobians; backbone of backpropagation. See Chain Rule. |
| Jacobian | Matrix of all first-order partial derivatives of a vector-valued function ; . |
| Taylor expansion | Approximation of near as a polynomial in ; first-order: . |
Probability & Statistics
| Term | Definition |
|---|---|
| Bayes’ theorem | ; updates prior belief with likelihood to obtain a posterior. See Bayesian Inference. |
| central limit theorem (CLT) | Sample mean of i.i.d. variables with finite variance converges in distribution to as . See Distributions. |
| conditional independence | iff ; foundational for graphical models and Naive Bayes. |
| conjugate prior | A prior whose posterior is in the same parametric family; simplifies Bayesian updates analytically. See Bayesian Inference. |
| covariance | ; measures linear co-variation; zero for independent variables. |
| expected value | or ; linear operator (linearity holds without independence). |
| exponential family | Class of distributions with density ; includes Gaussian, Bernoulli, Poisson, Gamma. |
| hypothesis testing | Statistical procedure testing a null hypothesis using a test statistic; controlled by significance level (Type I error rate). See Hypothesis Testing. |
| MAP estimate | Maximum a posteriori: ; MLE with a prior. |
| maximum likelihood estimation (MLE) | ; finds parameters most consistent with observed data. |
| p-value | Probability of observing a test statistic at least as extreme as the observed value under ; not the probability is true. |
| posterior | ; updated belief about parameters after observing data. |
| prior | ; belief about parameters before observing data. |
| variance | ; measures spread; . |
Optimisation
| Term | Definition |
|---|---|
| Adam | Adaptive gradient optimiser combining momentum and RMSProp; per-parameter learning rates using first and second moment estimates. See Gradient Descent. |
| convex function | satisfying ; any local minimum is global. See Convex Optimisation. |
| duality | Relationship between a primal optimisation problem and its Lagrangian dual; strong duality holds under Slater’s condition. |
| gradient descent | Iterative optimisation: ; converges to a local minimum for smooth objectives. See Gradient Descent. |
| KKT conditions | Necessary (and sufficient for convex problems) first-order conditions for constrained optimisation; stationarity, primal/dual feasibility, complementary slackness. See Lagrangian. |
| Lagrangian | ; encodes constrained problem as unconstrained via multipliers. |
| learning rate | Step size in gradient descent; too large → divergence; too small → slow convergence; typically scheduled or adapted. |
| SGD (stochastic gradient descent) | Gradient descent using a mini-batch (or single sample) gradient estimate per step; noisier but faster per update and better for large datasets. |
Modeling
| Term | Definition |
|---|---|
| ARIMA | Auto-regressive integrated moving average; classical time series model for stationary (after differencing) univariate series. See Time Series Models. |
| attention mechanism | Soft, differentiable lookup: computes a weighted sum of values using similarity scores between queries and keys . See Attention. |
| AUC-ROC | Area under the receiver operating characteristic curve; threshold-invariant classification metric equal to . See Evaluation. |
| bias–variance trade-off | Generalisation error = bias² + variance + irreducible noise; simpler models → higher bias; complex models → higher variance. See Bias–Variance. |
| calibration | A model is calibrated if predicted probabilities match empirical frequencies; assessed via reliability diagram and ECE. See Evaluation. |
| cross-validation (k-fold) | Partition data into folds; train on , evaluate on 1; repeat times; provides more evaluation signal than a single hold-out. |
| DBSCAN | Density-based clustering; groups core points within -neighbourhoods; handles arbitrary shapes and labels outliers as noise. See Unsupervised Learning. |
| decision tree | Recursive binary partition of feature space; splits chosen to minimise impurity (Gini/entropy for classification, MSE for regression). |
| dropout | Regularisation technique: randomly zero activations during training with probability ; reduces co-adaptation of neurons. See Regularisation. |
| early stopping | Halt training when validation loss stops improving; prevents overfitting; a form of implicit regularisation. See Early Stopping. |
| ElasticNet | Regularised regression combining L1 (Lasso) and L2 (Ridge) penalties; . |
| F1 score | Harmonic mean of precision and recall: ; useful for imbalanced classes. |
| feature engineering | Creating informative input features from raw data; includes polynomial features, lag features, cyclical encoding, and target encoding. See Feature Engineering. |
| GLM (generalised linear model) | Extends linear regression to non-Gaussian targets via a link function; special cases include logistic regression (logit link) and Poisson regression (log link). See Linear Models & GLMs. |
| GMM (Gaussian mixture model) | Probabilistic clustering model: ; fitted via EM algorithm. See Probabilistic Models. |
| gradient boosting | Ensemble method that sequentially fits shallow trees to the residuals (pseudo-gradients) of the current ensemble; XGBoost and LightGBM are implementations. |
| hyperparameter | Model configuration set before training (e.g., learning rate, regularisation strength, number of trees); tuned via cross-validation or Bayesian optimisation. |
| k-means | Iterative clustering: assign each point to nearest centroid, recompute centroids; minimises within-cluster sum of squares. See Unsupervised Learning. |
| kernel trick | Implicitly computes inner products in a high-dimensional feature space via a kernel function ; avoids explicit feature map computation. See Kernel Methods. |
| Lasso (L1 regularisation) | Adds to loss; induces sparsity by driving coefficients exactly to zero; performs feature selection. |
| LightGBM | Gradient boosting library using leaf-wise tree growth and histogram-based splits; faster than XGBoost on large datasets. |
| logistic regression | Linear classifier applying sigmoid to a linear combination of features; outputs calibrated probabilities for binary classification. |
| loss function | Scalar measuring prediction error; cross-entropy for classification, MSE for regression; optimised during training. See Loss Functions. |
| LSTM | Long Short-Term Memory; RNN variant with gated cell state; captures long-range sequential dependencies. See Recurrent Networks. |
| Naive Bayes | Probabilistic classifier assuming conditional independence of features given class: . |
| overfitting | Model performs well on training data but poorly on unseen data; caused by excess capacity relative to data; mitigated by regularisation, early stopping, or more data. |
| PCA (principal component analysis) | Linear dimensionality reduction via SVD; projects data onto directions of maximum variance. See Unsupervised Learning. |
| precision / recall | Precision = TP/(TP+FP); recall = TP/(TP+FN); precision-recall trade-off is controlled by classification threshold. |
| random forest | Ensemble of decision trees trained on bootstrap samples with random feature subsets; reduces variance via averaging. |
| regularisation | Penalty on model complexity added to loss (L1/L2) or applied implicitly (dropout, early stopping); reduces overfitting. See Regularisation. |
| Ridge (L2 regularisation) | Adds to loss; shrinks coefficients towards zero but does not set them exactly to zero. |
| SHAP | SHapley Additive exPlanations; decomposes model predictions into additive feature contributions with game-theoretic guarantees. See SHAP. |
| softmax | ; converts logits to a probability distribution over classes. |
| stationarity | A time series is weakly stationary if its mean and autocovariance are time-invariant; required by ARIMA; tested with the ADF test. |
| SVM (support vector machine) | Maximum-margin classifier; finds the hyperplane maximising the margin between classes; extended to non-linear boundaries via kernels. See Kernel Methods. |
| t-SNE | Non-linear dimensionality reduction for 2D/3D visualisation; preserves local neighbourhood structure; not suitable for downstream ML. |
| transformer | Deep learning architecture using multi-head self-attention and positional encoding; dominant for sequence tasks. See Transformer. |
| UMAP | Non-linear dimensionality reduction; faster than t-SNE and better preserves global structure; suitable for large datasets. |
| XGBoost | Gradient boosting library with second-order gradient approximation, regularisation, and parallel tree construction. |
ML Engineering
| Term | Definition |
|---|---|
| A/B test | Controlled experiment comparing a treatment (new model) to a control (current model) on live traffic; used to measure real-world impact. |
| canary deployment | Gradually shift a fraction of traffic to a new model version; roll back if metrics degrade before full rollout. |
| data drift | Change in input feature distribution between training and production; detected via statistical tests (KS, PSI). |
| feature store | Centralised repository for computed features with consistent serving between training and inference; prevents train–serve skew. See Feature Engineering. |
| MLflow | Open-source ML lifecycle platform: experiment tracking, model registry, and model serving. See Model Development. |
| model registry | Versioned store of trained model artefacts with stage transitions (Staging → Production); enables reproducibility and governance. |
| online learning | Model updates continuously as new data arrives; suited for non-stationary environments. |
| PSI (population stability index) | Measures shift between two distributions; PSI < 0.1: stable; 0.1–0.25: minor shift; > 0.25: major drift requiring investigation. |
| training–serving skew | Discrepancy between features computed at training time and at inference time; causes silent model degradation. |
AI Engineering
| Term | Definition |
|---|---|
| chunking | Splitting documents into segments for embedding and retrieval in RAG pipelines; chunk size and overlap affect recall and coherence. |
| context window | Maximum number of tokens a language model can process in a single forward pass; limits document length and conversation history. |
| embedding | Dense vector representation of text, image, or other data in a latent semantic space; similarity is measured by cosine or dot product. |
| fine-tuning | Further training a pre-trained model on a task-specific dataset; updates weights (full fine-tuning) or only adapters (LoRA/QLoRA). See Fine-tuning. |
| hallucination | LLM generating plausible-sounding but factually incorrect or unsupported content; mitigated by RAG, grounding, and output validation. |
| KV cache | Cached key-value attention tensors for previously processed tokens; enables efficient autoregressive decoding without recomputation. |
| LoRA (Low-Rank Adaptation) | Parameter-efficient fine-tuning: learns low-rank updates added to frozen weight matrices; far fewer trainable parameters. See Fine-tuning. |
| prompt engineering | Designing input text to elicit desired LLM behaviour; includes chain-of-thought, few-shot examples, and system prompts. See Prompt Engineering. |
| quantisation | Reducing model weight precision (FP32 → INT8/INT4) to decrease memory footprint and increase inference throughput with minimal accuracy loss. See Inference Optimisation. |
| RAG (retrieval-augmented generation) | Combines a retrieval system (vector search over a knowledge base) with an LLM; grounds responses in retrieved context. See RAG and Agents. |
| reranker | A cross-encoder model that scores query–document pairs for relevance; used after initial retrieval to improve precision before LLM generation. |
| system prompt | Instruction prepended to a conversation that shapes LLM behaviour, persona, and constraints; not visible to end-users in production. |
| temperature | Sampling parameter scaling logits before softmax; higher → more random outputs; temperature = 0 → greedy (deterministic) decoding. |
| token | Basic unit of text processed by an LLM; typically ~0.75 words for English; models have a fixed context window measured in tokens. |
| tool use / function calling | LLM capability to invoke external tools (APIs, code interpreters, databases) by outputting structured JSON; enables agentic behaviour. |
| vLLM | High-throughput LLM inference library using PagedAttention for efficient KV cache management; enables continuous batching. See Inference Optimisation. |