Jacobian and Hessian

Definition

Jacobian: the matrix of all first-order partial derivatives of a vector-valued function :

Hessian: the matrix of all second-order partial derivatives of a scalar function :

When has continuous second derivatives (Schwarz’s theorem), is symmetric: .

Intuition

Jacobian is the “derivative” of a multi-input, multi-output function — it captures how each output changes as each input changes. A Jacobian is the gradient row vector; an Jacobian generalises to multiple outputs. The Jacobian is the best linear approximation of near .

Hessian encodes the curvature of a scalar landscape at a point. Just as the second derivative tells you whether a 1D function is concave or convex, the Hessian’s eigenvalues tell you the same in each direction. Positive definite Hessian → local minimum; indefinite Hessian → saddle point.

Formal Description

Jacobian: vector chain rule. If , then:

This is the multivariable chain rule; it underlies backpropagation’s layer-by-layer gradient computation.

Jacobian determinant: for a square map , measures the local volume scaling factor of the transformation. Used in change-of-variables for integration and in normalising flows.

Second-order Taylor expansion:

Stationary points: at a critical point , the Hessian classifies the point:

signatureType
All eigenvalues (PD)Strict local minimum
All eigenvalues (ND)Strict local maximum
Mixed signs (indefinite)Saddle point
Some zero eigenvalues (PSD)Degenerate — need higher-order terms

Newton’s method uses the Hessian for faster convergence than gradient descent:

Converges quadratically near a minimum but costs per step (Hessian inversion); impractical for large (e.g., neural network parameters). Quasi-Newton methods (L-BFGS) approximate without explicitly computing it.

Hessian in deep learning: the Hessian of the loss w.r.t. parameters has entries — intractable for modern networks with millions of parameters. Second-order information is instead captured implicitly through adaptive optimizers (Adam/RMSProp) or through the Fisher information matrix.

Applications

ApplicationRole
BackpropagationJacobian of each layer’s output w.r.t. input is the layer’s local gradient
Change of variablesJacobian determinant scales probability density under transformation
Normalising flowsBijective map with tractable Jacobian determinant
Newton / quasi-Newton optimisationHessian (or its approximation) gives curvature information
Curvature analysisHessian eigenvalues diagnose saddle points, condition number
Sensitivity analysis (Jacobian column) measures feature importance

Trade-offs

  • Full Hessian computation: storage, inversion — infeasible for large networks.
  • Jacobian-vector products (JVP, forward mode) and vector-Jacobian products (VJP, reverse mode) can be computed without materialising the full Jacobian — this is the basis of automatic differentiation.
  • Saddle points dominate in high-dimensional loss landscapes; second-order analysis reveals them, but first-order methods (SGD, Adam) escape them through gradient noise.