1. Climb Every Mountain:Sound of Music 1965 Gradient and Hessian Based Optimization
    In more than one dimension the idea of bracketing a maximum becomes unworkable. Now we turn to calculus to help the algorithm decide which direction to go towards a maximum. That is, we use the partial derivatives (gradient vector) of the function to guide the search for improvement. In turns out that the line optimization technique of bracket-and-bisect still comes in handy.

  1. Steepest Ascent
  2. NR 10.6
    Imagine you are hiking to the top of a mountain in a dense fog. You can't see the peak at all. All you can see is in which direction the terrain is going up, down or staying level at your feet. With nothing else to go on, you might as well climb in the direction where the terrain below your feet is steepest, hoping the pattern continues to the top. In addition, you are hiking with a friend and want to carry on a conversation. This means that you don't survey the terrain after every step. You head out in a direction and keep going for a while until you feel that you are no longer going up. Then you stop, look down again and move in the steepest direction again.

    If that analogy makes sense then you understand the notion of steepest ascent broken down into steps (an algorithm!)

    Algorithm 15. Steepest Ascent

    Given $f(x)$ and $x^0\in \Re^n$ and tolerance $\epsilon>0$.
    1. Initialize. Set $t=0$.
    2. →Compute Gradient. $\nabla\,f_t = \nabla\,f(x^t)$. Strong convergence: If $\|\nabla\,f_t\| \mathop{=_{\epsilon_1}} 0$, then .
    3. Line Optimization.
    4. $$\lambda^\star = {\arg\max}_\lambda f\left(x^t + \lambda \nabla\,f^{\,\prime}_t \right).\nonumber$$
      Set $x^{t+1} = x^t + \lambda^\star\nabla\,f^{\,\prime}_t$.
    5. Increment $t$ and repeat previous two steps (go back to →).

    Exhibit 40. Steepest Descent in Two Dimensions

    1. Set $t=0$. $\nabla\,f_t = \nabla\,f(x^t)$.
    2. Line Optimization. $\lambda^\star = {\arg\max}_\lambda f\left(x^t + \lambda \nabla\,f^{\,\prime}_t \right)$
    3. Set $x^1$.
    4. Line Optimization 2, etc.
  3. Two Examples
  4. Consider $$f(x_0,x_1) = -x_0^2 + x_0x_1 - x_1^2.\nonumber$$ As a quadratic function we get a linear gradient: $$\nabla\,f = (\,-2x_0+x_1\,,\,x_0 - 2x_1\,).\nonumber$$ First order conditions have a closed form solution, so it is easy to show that $x^\star = (0,0)$. If we started at $x^0 = (2,1)$, we have $\nabla\,f^0 = (-3,-1)$. It's easy to see that with a step size of $s^0=.75$ we get $$x^1 = x^0-(1.5)\nabla\,f^0 = (-0.25,.25).\nonumber$$ That might not be the exact line maximization step, but we can see get very close. It also turns out that once the current values are equal like that the symmetry of the gradient kicks in and the next step will hit $x^\star$ exactly (the reader is encouraged to verify or correct that claim). Therefore, Steepest Ascent works perfectly in two iterations! This is because the objective is quadratic so the Hessian is a constant matrix. That then means the gradient points in the right direction to go but it might not be the right length.

    Now consider a different simple function: $$f(x_0,x_1)=-\left(\,1-x_0\,\right)^{2}-100\left(\,x_1-x_0^2\,\right)^{2}.\tag{Rosenbrock}$$ This is known as the Rosenbrock function (see 28-rosenbrock.ox). Again, $x^\star =(0,0)$. It turns out that Steepest Ascent cannot quickly find $x^\star$ as we will see later. The reason is that the gradient is almost never pointing in the direction of $x^\star$. Using information about how the gradient is changing will fix this problem, hence the value of using the Hessian as well as the gradient.

  5. Newton's Method
  6. Newton's method is a big improvement over Steepest Ascent. It uses both the gradient vector and the Hessian matrix in order to speed up convergence to a maximum. Continuing with our mountain climbers in the fog analogy: rather than just going up the steepest path they would be better off thinking about how the gradient is changing within the area they can see. If they can see that the gradient is slowing down along the steepest path but stays steeper to one side of that path they would make quicker process if they curled off into that direction.

    The algorithm can be explained several different ways. One way is to use Taylor's second order approximation of any (smooth) function. Given the objective $f(x)$ and a point $x_0$ we compute the gradient vector and Hessian matrix at $x_0$, denoted $\nabla\,f_0$ and $Hf_0$. Then a Taylor approximation to the value of $f(x)$ at any point away from $x_0$ is $${f(x) \approx f(\,x_0\,) + (\,x-x_0\,)\nabla\,f_0^T + {1\over 2}\left(\,x-x_0\,\right)^T \left[ Hf_0\right] \left(\,x-x_0\,\right).}\tag{Taylor}\label{Taylor}$$ ($^T$ stands for the matrix transpose operator.) What is going is that the gradient and Hessian matrices at $x_0$ are determining a quadratic function on the right. Quadratic functions have closed form first order conditions and closed-form solutions to solving them. The max of the approximation would be $$x^\star = x_0 - H^{-1}f_0 \nabla\,f_0^T.\tag{QuadMax}\label{QuadMax}$$ However, our goal is not to maximize the approximation but rather to maximize the function itself. So $x^\star$ would be the true maximum only if $f(x)$ was actually quadratic (in which case the approximation is exact everywhere.). So Newton's method turns this closed form problem into an iterative process for the non-closed form original.

    Algorithm 16. Newton Iteration

    Given $f(x)$, and $x_0\in \Re^n$ and tolerance $\epsilon_1$.
    1. Set $t=0$.
    2. →Compute $\nabla\,f_t$ Strong convergence: If $\|\,\nabla\,f_t\,\|\ \mathop{=_{\epsilon_1}}\ 0$, then .
    3. Compute $H_t = Hf(\,x_t\,)$.
    4. Solve $H_t\,s_t = -\nabla\,f_t^T$ for the direction $s_t$.
    5. Perform line optimization on $\lambda$:
    6. $$\lambda^\star_t = {\arg\max}_\lambda f\left(x_t + \lambda s_t\right)\nonumber$$ Set $x_{t+1} = x_t + \lambda^\star s_t$.
    7. Increment $t$ and repeat previous two steps (go back to →).

    Exhibit 41. Newton versus Steepest Descent

    Hessian at each iteration `curls' the direction away from steepest ascent.
    Newton's method is always superior to steepest ascent, which was discussed mainly to introduce the notion of using gradients. However, there are reasons for not using Newton's Method. First, computing $H$ at each step of the iteration can be costly. For our simple, elementary problems this is not an issue, but for big problems computing the Hessian can add a lot to the overall cost of finding a maximum. Second, the Hessian can be misleading far away from an optimum. Consider a two dimensional objective with contour curves like this:

    Exhibit 42. Newton and Bad Starting Values

  7. Quasi-Newton Methods
  8. NR 10.7
    A quasi-Newton algorithm uses the same step formula as Newton but it uses an approximation to the Hessian instead of computing it directly. Quasi-Newton algorithms build up the approximation to the Hessian as iteration proceeds. The formula uses the current Hessian approximation and the changes in the parameter vector and the gradient vector.

    Different quasi-Newton methods are defined by different formulas for updating the Hessian approximation at each iteration. The formula that has become the standard is known as BFGS.

    Algorithm 17. BFGS Updating

    Given $f(x)$, and $x_0\in \Re^n$.
    1. Initialize.
      Compute $\nabla\, f_0$. Set $t=0$ and $H_0=I$ (or some other symmetric invertible matrix).
    2. → Strong convergence: If $\|\,\nabla\,f_t\,\|\ \mathop{=_{\epsilon_1}}\ 0$, then .
    3. Solve $H_t s_t = -\nabla\,f_t^T$.
    4. Perform line optimization on $\lambda$:
    5. $$\lambda^\star_t = {\arg\max}_\lambda f\left(x_t + \lambda s_t\right)\nonumber$$ Set $x_{t+1} = x_t - \lambda^\star_t$.
    6. Update $H$ (according to BFGS).

    7. Compute $\nabla\,f(x_{t+1})$. Let $z=x_{t+1}-x_t$ and $y= \left(\nabla\, f_{t+1}-\nabla\, f_t\right)^T$.
      BFGS Update: $${H_{t+1} = H_t - { (H_t z)(H_t z)^T\over z^TH_t z} + {yy^T \over y^T z}.}\tag{BFGS}\label{BFGS}$$
    8. Increment $t$ and repeat the previous 4 steps (go to →).
    The theoretical result behind quasi-Newton methods is that the approximation $H_t$ converges to the true Hessian $Hf(x_t)$ in $n$ iterations (where $n$ is the dimensions of the parameter vector). The final value $H_T$ is not a reliable approximation to $Hf(x^\star)$ and should never be used as an estimate of $H$. Instead, compute $Hf(x^\star)$ directly or do one Newton step.

  9. Gradient-Based Methods in Ox
  10. #import "maximize"
    
        MaxNewton(func, ax, af, aH, UseNH);
    
            Argument        Explanation
            ------------------------------------------------------------------------------------
            func            function that computes the objective (see format below)
            ax              address of starting vector, where final values will be placed
            aH              address to place final Hessian (can be 0 if not needed)
            UseNH           0=func computes Hessian
                            1=Use numerical Hessians computed in MaxNewton
            ------------------------------------------------------------------------------------
    
        Returns an integer Convergence Code:
    
              Value    Label            Explanation
            ------------------------------------------------------------------------------------------------
                0      MAX_CONV         Strong convergence Both convergence tests were passed
                1      MAX_WEAK_CONV    Weak convergence (no improvement in line search); step length
                                            too small but one test passed
                2      MAX_MAXIT        No convergence (maximum no of iterations reached)
                3      MAX_LINE_FAIL    No convergence (no improvement in line search); step length
                                            too small and convergence test not passed.
                4      MAX_FUNC_FAIL    No convergence (initial function evaluation failed)
                5      MAX_NOCONV       No convergence Probably not yet attempted to maximize.
            ------------------------------------------------------------------------------------------------
    
    The function you write that you send to MaxNewton() has to be in a particular form since MaxNewton will call it.
        myobjective(x, af, agrad, aHess);
    
            Argument        Explanation
            ------------------------------------------------------------------------------------
            x               parameter vector
            af              address of where to put f(x)
            agrad           address to place gradient
                            0 = gradient not needed
            aHess           address to place Hessian
                            0 = Hessian not needed
            ------------------------------------------------------------------------------------
    
        Returns
        1: function computed successful at x
        0: function evaluation failed at x
    
    You should verify that this is a specialization of the function format required by Num1Derivative() and Num2Derivative(), which makes sense since they are all part of the maximize package. The difference is that the unspecified variable argument, , is replaced by two specified arguments.

    Notice that MaxNewton gives you the option of using numerical Hessians but your function must supply the gradient when asked for. That is, if the third argument agrad is not an int it is an address and it means that the algorithm wants the gradient computed as well as the function. But you can use Num1Derivative() rather than code the analytic derivatives. So if you want to send it to MaxNewton and don't want to code analytic derivatives add a line to your function that looks like this:

        if (!isint(agrad)) Num1Derivative(myobjective,x,agrad);
        
    Note that "myobjective" is the name of your function, "x" is whatever you call the first formal argument and "agrad" is whatever you call the third formal argument.
    Recursion This is an example of a recursive function call because myobjective() will be called within another call to myobjective(). Machinery in the background of the language keeps this from going terribly wrong. In particular, each call to a function goes on something called the "runtime stack" which keeps straight where the program is and which functions are still executing.
    So we can modify the example from earlier of the function $-x_0^2+x_0x_1-x_1^2$:
    27-maxnewton.ox
     1:    #include "oxstd.h"
     2:    #import "maximize"
     3:    
     4:    const decl beta = 0.9;
     5:    f(x,af,ag,aH) {
     6:    	af[0] = -x[0]^2 + 0.5*x[0]*x[1] - x[1]^2 + beta*log(1+x[0]+x[1]);
     7:    	if (!isint(ag)) Num1Derivative(f,x,ag);
     8:    	return 1;
     9:    	}
    10:    
    11:    main() {
    12:    	decl xp = <2;1>, fv;
    13:    	MaxControl(-1,1);
    14:    	println("Conv. Code=",MaxNewton(f,&xp,&fv,0,0));
    15:    	}
    
    Output produced from the example above:
    Starting values
    parameters
           2.0000       1.0000
    gradients
          -3.0000      0.00000
    Initial function =                  -3
    
    Position after 1 Newton iterations
    Status: Strong convergence
    parameters
      2.9771e-005  1.9326e-005
    gradients
     -4.0216e-005 -8.8816e-006
    function value = -7.53864179238e-010
    Conv. Code=0
    

Exercises