1. Finding NemoMovie: The Nelder-Mead Amoeba Algorithm
Quasi-Newton methods such as BFGS can provide the benefits of Newton at a lower overall cost. But any gradient-based methods will typically fail when applied to an objective with kinks:

Exhibit 43. Gradients Do Not Help with Kinky Objectives

The reason is that the gradient doesn't provide reliable information about where the function is going.
NR 10.4
The Nelder-Mead (1965) algorithm searches for an optimum of a function without relying on the function's gradient or Hessian. It is a very simple but smart "search" algorithm, typically much better than searching over a grid of values. Because it does not rely on gradients its performance only requires continuity of the objective. This means an objective with kinks poses no problem for Nelder-Mead.

The basic algorithm creates a simplex which is a closed shape within the parameter space. If choosing one variable ($n=1$) then a simplex is simply an interval on the line. An interval is defined by its upper and lower limits, that is 2 points (or $2 = 1+1$) points. When choosing two variables ($n=2$) a simplex is a triangle, a shape that involves three (or $2+1$) vectors. For the triangle to have a non-zero area no two points or vertices can be on the same line. In general a simplex in $n$ dimensions is defined by $n+1$ points that are not co-linear.

Algorithm 18. The Nelder-Mead Amoeba Algorithm

Given $f(x)$ and $x_0$
  1. Initialization. Set $t=0$.
    • Create a simplex in $\Re^n$ with $n$ other (not co-linear) points.
    • Call these points $x^0, x^1, \dots, x^n$.
  2. ‡ Evaluate the objective at each vertex, $f^i = f(x^i)$. Identify the best ($f^M$), worst ($f^m$) and second worst ($f^s$) points.
  3. → Attempt A. Extend the best point. That is,
    • Stretch the best vertex $x^M$ out further from all the other points (by a factor greater). Call it $x^A$.
    • Evaluate the objective at that point, $f^A = f(x^A)$.
    • Keep $x^A$ if is better than the current worst point.
    • If $f^A \gt f^m$, replace $x^m$ with $x^A$. (Reset the values of $M,m,s$ to account for the change.) Return to → with the new simplex.
    • If $f^A \lt f^m$ then go on to attempt B.
  4. Attempt B. Try moving away from worst point.
    • That is, try $x^B$ which is on the opposite side of the simplex from $x^m$.
    • If $f^B \gt f^m$ then keep it as in Attempt A an return to →.
    • If not, try C.
  5. Attempt C. Collapse the simplex.
    • Compute new points $x^0, \dots, x^n$ that are all a factor closer to the average point of the current Simplex.
    • If the size of the new simplex is smaller than the precision $\epsilon$ .
    • Otherwise, return to ‡ evaluate the objective at each point and identify M, m and s. Return to →.

Exhibit 44. Walk like an Amoeba

#import "maximize"

    MaxSimplex(func, ax, af, vDelta);

        Argument        Explanation
        ------------------------------------------------------------------------------------
        func            objective (in same form as other Max routines require)
        ax              address of starting values and where to place final values
        af              address to place final objective value
        vDelta          0 = initialize the simplex internally
                        n x 1 matrix of step sizes for the simplex

    Returns
        Convergence code same as MaxBFGS

Summary of Unconstrained Optimization

We started with the simplest problem: find a local maxima of a continuous function in one dimension using a line search, involving two steps: bracket then bisect. Then we generalized to finding a local maxima in any number of dimensions when the gradient (first derivatives) all exist everywhere (no kinks). Whether these gradient based algorithms converge depend on the shape and curvature of the objective. Then we added the Nelder-Mead routine which works well even if the function has kinks. Later we will add a random search algorithm that tries to find a global maximum even when local maxima exist.

In practice, what algorithm should you use? The answer is, you should be prepared to use all of them, especially when maximizing a function of many parameters (large $n$) that you are not sure is concave and smooth (no kinks). In particular, you should see sequential optimization as a applying a sequence of algorithms from the most robust but inefficient to the fastest, typically Newton or BFGS. Typically you would monitor the progress of your algorithm. As the current best value stops changing quickly you can stop the algorithm and start the next one from where the last one left off.

Exercises