Gradient and Directional Derivative
Definition
The gradient of a scalar field at a point is the vector of partial derivatives:
It points in the direction of steepest ascent of at , with magnitude equal to the rate of steepest ascent.
Intuition
The gradient generalises the derivative to multiple dimensions. For a 2D landscape = altitude, is an arrow on the horizontal plane pointing uphill as steeply as possible. Its magnitude tells you how steep that slope is. Moving in the direction descends as steeply as possible — the basis of gradient descent.
The directional derivative answers: “how fast does change if I walk in direction ?” The gradient is the tool that computes this for any direction at once.
Formal Description
Partial derivative: fix all variables except and differentiate:
Directional derivative in direction ():
The directional derivative equals the dot product of the gradient with the unit direction. Maximised when , giving steepest ascent. Zero when (moving along a level set).
Level sets and gradient orthogonality: the level set is a -dimensional surface. The gradient is orthogonal to the level set at every point .
First-order Taylor approximation:
This is the linear approximation; the gradient is the coefficient vector.
Chain rule for scalar composition: if and :
where is the Jacobian of (see jacobian_and_hessian).
Gradient in Cartesian coordinates ():
Applications
| Application | Role of gradient |
|---|---|
| Gradient descent | Update |
| Backpropagation | Accumulate gradients through the computation graph |
| Lagrange multipliers | Condition: at constrained optimum |
| Physics | Force = (potential energy) |
| Image processing | Image gradient detects edges: |
Trade-offs
- The gradient exists only where is differentiable. Non-smooth functions (e.g., ReLU) require subgradients at non-differentiable points.
- In high dimensions, computing the full gradient requires evaluating all partial derivatives; automatic differentiation handles this efficiently.
- Gradient direction is locally optimal but can lead to saddle points or local minima; second-order information (Hessian) is needed to distinguish these.
Links
- Jacobian and Hessian — multivariable analogue for vector-valued functions
- Chain Rule — single-variable chain rule generalised here
- Gradient Descent — gradient used in iterative optimization
- Convex Optimization — first-order optimality condition:
- Backpropagation — gradient of loss w.r.t. parameters