Jacobian and Hessian

Definition

Jacobian: the matrix of all first-order partial derivatives of a vector-valued function $f : R^{n} \to R^{m}$ :

J_{f} (x) = \frac{\partial f}{\partial x} \in R^{m \times n}, (J_{f})_{ij} = \frac{\partial f _{i}}{\partial x _{j}}

Hessian: the matrix of all second-order partial derivatives of a scalar function $f : R^{n} \to R$ :

H_{f} (x) = \nabla^{2} f (x) \in R^{n \times n}, H_{ij} = \frac{\partial ^{2} f}{\partial x _{i} \partial x _{j}}

When $f$ has continuous second derivatives (Schwarz’s theorem), $H$ is symmetric: $H_{ij} = H_{ji}$ .

Intuition

Jacobian is the “derivative” of a multi-input, multi-output function — it captures how each output changes as each input changes. A $1 \times n$ Jacobian is the gradient row vector; an $m \times n$ Jacobian generalises to multiple outputs. The Jacobian is the best linear approximation of $f$ near $x$ .

Hessian encodes the curvature of a scalar landscape at a point. Just as the second derivative tells you whether a 1D function is concave or convex, the Hessian’s eigenvalues tell you the same in each direction. Positive definite Hessian → local minimum; indefinite Hessian → saddle point.

Formal Description

Jacobian: vector chain rule. If $h = f (g (x))$ , then:

J_{h} = J_{f} (g (x)) J_{g} (x) \in R^{p \times n}

This is the multivariable chain rule; it underlies backpropagation’s layer-by-layer gradient computation.

Jacobian determinant: for a square map $f : R^{n} \to R^{n}$ , $∣ det J_{f} ∣$ measures the local volume scaling factor of the transformation. Used in change-of-variables for integration and in normalising flows.

Second-order Taylor expansion:

f (x + δ) \approx f (x) + \nabla f (x)^{⊤} δ + \frac{1}{2} δ^{⊤} H_{f} (x) δ

Stationary points: at a critical point $\nabla f (x^{*}) = 0$ , the Hessian classifies the point:

$H$ signature	Type
All eigenvalues $> 0$ (PD)	Strict local minimum
All eigenvalues $< 0$ (ND)	Strict local maximum
Mixed signs (indefinite)	Saddle point
Some zero eigenvalues (PSD)	Degenerate — need higher-order terms

Newton’s method uses the Hessian for faster convergence than gradient descent:

x_{t + 1} = x_{t} - H_{f} (x_{t})^{- 1} \nabla f (x_{t})

Converges quadratically near a minimum but costs $O (n^{3})$ per step (Hessian inversion); impractical for large $n$ (e.g., neural network parameters). Quasi-Newton methods (L-BFGS) approximate $H^{- 1}$ without explicitly computing it.

Hessian in deep learning: the Hessian of the loss w.r.t. parameters has $O (n^{2})$ entries — intractable for modern networks with millions of parameters. Second-order information is instead captured implicitly through adaptive optimizers (Adam/RMSProp) or through the Fisher information matrix.

Applications

Application	Role
Backpropagation	Jacobian of each layer’s output w.r.t. input is the layer’s local gradient
Change of variables	Jacobian determinant scales probability density under transformation
Normalising flows	Bijective map with tractable Jacobian determinant
Newton / quasi-Newton optimisation	Hessian (or its approximation) gives curvature information
Curvature analysis	Hessian eigenvalues diagnose saddle points, condition number
Sensitivity analysis	$\partial \overset{y}{^} / \partial x_{j}$ (Jacobian column) measures feature importance

Trade-offs

Full Hessian computation: $O (n^{2})$ storage, $O (n^{3})$ inversion — infeasible for large networks.
Jacobian-vector products (JVP, forward mode) and vector-Jacobian products (VJP, reverse mode) can be computed without materialising the full Jacobian — this is the basis of automatic differentiation.
Saddle points dominate in high-dimensional loss landscapes; second-order analysis reveals them, but first-order methods (SGD, Adam) escape them through gradient noise.

Notes

Explorer

jacobian_and_hessian

Jacobian and Hessian

Definition

Intuition

Formal Description

Applications

Trade-offs

Links

Graph View

Table of Contents

Backlinks