Gradient and Directional Derivative

Definition

The gradient of a scalar field at a point is the vector of partial derivatives:

It points in the direction of steepest ascent of at , with magnitude equal to the rate of steepest ascent.

Intuition

The gradient generalises the derivative to multiple dimensions. For a 2D landscape = altitude, is an arrow on the horizontal plane pointing uphill as steeply as possible. Its magnitude tells you how steep that slope is. Moving in the direction descends as steeply as possible — the basis of gradient descent.

The directional derivative answers: “how fast does change if I walk in direction ?” The gradient is the tool that computes this for any direction at once.

Formal Description

Partial derivative: fix all variables except and differentiate:

Directional derivative in direction ():

The directional derivative equals the dot product of the gradient with the unit direction. Maximised when , giving steepest ascent. Zero when (moving along a level set).

Level sets and gradient orthogonality: the level set is a -dimensional surface. The gradient is orthogonal to the level set at every point .

First-order Taylor approximation:

This is the linear approximation; the gradient is the coefficient vector.

Chain rule for scalar composition: if and :

where is the Jacobian of (see jacobian_and_hessian).

Gradient in Cartesian coordinates ():

Applications

ApplicationRole of gradient
Gradient descentUpdate
BackpropagationAccumulate gradients through the computation graph
Lagrange multipliersCondition: at constrained optimum
PhysicsForce = (potential energy)
Image processingImage gradient detects edges:

Trade-offs

  • The gradient exists only where is differentiable. Non-smooth functions (e.g., ReLU) require subgradients at non-differentiable points.
  • In high dimensions, computing the full gradient requires evaluating all partial derivatives; automatic differentiation handles this efficiently.
  • Gradient direction is locally optimal but can lead to saddle points or local minima; second-order information (Hessian) is needed to distinguish these.