Jacobian and Hessian
Definition
Jacobian: the matrix of all first-order partial derivatives of a vector-valued function :
Hessian: the matrix of all second-order partial derivatives of a scalar function :
When has continuous second derivatives (Schwarz’s theorem), is symmetric: .
Intuition
Jacobian is the “derivative” of a multi-input, multi-output function — it captures how each output changes as each input changes. A Jacobian is the gradient row vector; an Jacobian generalises to multiple outputs. The Jacobian is the best linear approximation of near .
Hessian encodes the curvature of a scalar landscape at a point. Just as the second derivative tells you whether a 1D function is concave or convex, the Hessian’s eigenvalues tell you the same in each direction. Positive definite Hessian → local minimum; indefinite Hessian → saddle point.
Formal Description
Jacobian: vector chain rule. If , then:
This is the multivariable chain rule; it underlies backpropagation’s layer-by-layer gradient computation.
Jacobian determinant: for a square map , measures the local volume scaling factor of the transformation. Used in change-of-variables for integration and in normalising flows.
Second-order Taylor expansion:
Stationary points: at a critical point , the Hessian classifies the point:
| signature | Type |
|---|---|
| All eigenvalues (PD) | Strict local minimum |
| All eigenvalues (ND) | Strict local maximum |
| Mixed signs (indefinite) | Saddle point |
| Some zero eigenvalues (PSD) | Degenerate — need higher-order terms |
Newton’s method uses the Hessian for faster convergence than gradient descent:
Converges quadratically near a minimum but costs per step (Hessian inversion); impractical for large (e.g., neural network parameters). Quasi-Newton methods (L-BFGS) approximate without explicitly computing it.
Hessian in deep learning: the Hessian of the loss w.r.t. parameters has entries — intractable for modern networks with millions of parameters. Second-order information is instead captured implicitly through adaptive optimizers (Adam/RMSProp) or through the Fisher information matrix.
Applications
| Application | Role |
|---|---|
| Backpropagation | Jacobian of each layer’s output w.r.t. input is the layer’s local gradient |
| Change of variables | Jacobian determinant scales probability density under transformation |
| Normalising flows | Bijective map with tractable Jacobian determinant |
| Newton / quasi-Newton optimisation | Hessian (or its approximation) gives curvature information |
| Curvature analysis | Hessian eigenvalues diagnose saddle points, condition number |
| Sensitivity analysis | (Jacobian column) measures feature importance |
Trade-offs
- Full Hessian computation: storage, inversion — infeasible for large networks.
- Jacobian-vector products (JVP, forward mode) and vector-Jacobian products (VJP, reverse mode) can be computed without materialising the full Jacobian — this is the basis of automatic differentiation.
- Saddle points dominate in high-dimensional loss landscapes; second-order analysis reveals them, but first-order methods (SGD, Adam) escape them through gradient noise.
Links
- Gradient and Directional Derivative — gradient is the Jacobian for scalar functions
- Chain Rule — scalar chain rule; Jacobian is its vector generalisation
- Convex Optimization — PD Hessian ↔ strictly convex
- Constrained Optimization — second-order conditions for KKT
- Backpropagation — uses Jacobian chain rule throughout
- Spectral Theorem — Hessian is symmetric; eigenvalues determine its definiteness