Skip to main content

Section 14.5 Gradient descent

Introduction will go here.
  • If \(f(x,y)\) is a differentiable function, the direction of steepest decrease at a point is \(-\,\Grad f\text{.}\) This suggests the iterative scheme
    \begin{equation*} \bv{x}_{k+1} = \bv{x}_k - \alpha \Grad f(\bv{x}_k), \end{equation*}
    where \(\alpha > 0\) is a step size.
    For suitable choices of \(\alpha\text{,}\) this process moves downhill and can converge to a local minimum.
  • Now suppose we want to optimize \(f(x,y)\) subject to the constraint \(g(x,y)=c\text{.}\) Previously, we saw that a constrained extremum satisfies
    \begin{equation*} \Grad f = \lambda \Grad g. \end{equation*}
    Rather than solving this system algebraically, we introduce the Lagrangian:
    \begin{equation*} \cl{L}(x,y,\lambda)=f(x,y)+\lambda(c-g(x,y)). \end{equation*}
  • The stationary conditions \(\Grad_{x,y,\lambda}\cl{L}=0\) reproduce the Lagrange multiplier equations:
    \begin{align*} \pdv{f}{x} - \lambda\pdv{g}{x} \amp = 0\\ \pdv{f}{y} - \lambda\pdv{g}{y} \amp = 0\\ c - g(x,y) \amp = 0 \end{align*}
  • However, instead of minimizing \(\cl{L}\) in all variables (which would fail, since the critical point is always a saddle point), we use a combined descent-ascent strategy:
    \begin{equation*} \bv{x}_{k+1} = \bv{x}_k - \alpha \Grad_{x,y}\cl{L}, \end{equation*}
    \begin{equation*} \lambda_{k+1} = \lambda_k + \beta \pdv{\cl{L}}{\lambda}. \end{equation*}
    The variables \((x,y)\) move in the direction that decreases \(f\text{,}\) while \(\lambda\) adjusts to enforce the constraint. When this system settles, we recover the constrained optimum.
  • What does \(\lambda\) represent? Since
    \begin{equation*} \pdv{\cl{L}}{c} = \lambda, \end{equation*}
    the multiplier measures how the Lagrangian changes when the constraint level \(c\) changes.
  • Let \((x^{*}(c),y^{*}(c))\) denote the constrained optimizer and consider the optimal value
    \begin{equation*} f(x^{*}(c),y^{*}(c)). \end{equation*}
    Differentiating with respect to \(c\) and using the chain rule together with
    \begin{equation*} \Grad f = \lambda \Grad g, \end{equation*}
    one finds
    \begin{equation*} \dv{f}{c} = \lambda. \end{equation*}
    Thus the Lagrange multiplier measures the sensitivity of the optimal value to changes in the constraint.