Exhibit 43. Gradients Do Not Help with Kinky Objectives

The basic algorithm creates a simplex which is a closed shape within the parameter space. If choosing one variable (n=1) then a simplex is simply an interval on the line. An interval is defined by its upper and lower limits, that is 2 points (or 2=1+1) points. When choosing two variables (n=2) a simplex is a triangle, a shape that involves three (or 2+1) vectors. For the triangle to have a non-zero area no two points or vertices can be on the same line. In general a simplex in n dimensions is defined by n+1 points that are not co-linear.
Algorithm 18. The Nelder-Mead Amoeba Algorithm
- Initialization. Set t=0.
- Create a simplex in ℜn with n other (not co-linear) points.
- Call these points x0,x1,…,xn.
- ‡ Evaluate the objective at each vertex, fi=f(xi). Identify the best (fM), worst (fm) and second worst (fs) points.
- → Attempt A. Extend the best point. That is,
- Stretch the best vertex xM out further from all the other points (by a factor greater). Call it xA.
- Evaluate the objective at that point, fA=f(xA).
- Keep xA if is better than the current worst point.
- If fA>fm, replace xm with xA. (Reset the values of M,m,s to account for the change.) Return to → with the new simplex.
- If fA<fm then go on to attempt B.
- Attempt B. Try moving away from worst point.
- That is, try xB which is on the opposite side of the simplex from xm.
- If fB>fm then keep it as in Attempt A an return to →.
- If not, try C.
- Attempt C. Collapse the simplex.
- Compute new points x0,…,xn that are all a factor closer to the average point of the current Simplex.
- If the size of the new simplex is smaller than the precision ϵ
.
- Otherwise, return to ‡ evaluate the objective at each point and identify M, m and s. Return to →.
Exhibit 44. Walk like an Amoeba

#import "maximize" MaxSimplex(func, ax, af, vDelta); Argument Explanation ------------------------------------------------------------------------------------ func objective (in same form as other Max routines require) ax address of starting values and where to place final values af address to place final objective value vDelta 0 = initialize the simplex internally n x 1 matrix of step sizes for the simplex Returns Convergence code same as MaxBFGS
Summary of Unconstrained Optimization
We started with the simplest problem: find a local maxima of a continuous function in one dimension using a line search, involving two steps: bracket then bisect. Then we generalized to finding a local maxima in any number of dimensions when the gradient (first derivatives) all exist everywhere (no kinks). Whether these gradient based algorithms converge depend on the shape and curvature of the objective. Then we added the Nelder-Mead routine which works well even if the function has kinks. Later we will add arandomsearch algorithm that tries to find a global maximum even when local maxima exist.
In practice, what algorithm should you use? The answer is, you should be prepared to use all of them, especially when maximizing a function of many parameters (large n) that you are not sure is concave and smooth (no kinks). In particular, you should see sequential optimization as a applying a sequence of algorithms from the most robust but inefficient to the fastest, typically Newton or BFGS. Typically you would monitor the progress of your algorithm. As the current best value stops changing quickly you can stop the algorithm and start the next one from where the last one left off.
Exercises
- The function
L=N−1∑i=0ln(eyixiβ1+exiβ) is the likelihood function for a simple (binary) logit model. This model explains which of two discrete outcomes happens (y=0 or y=1) using the row vector xi as explanatory variables. For example, explaining who goes to university or not depending gender, high school grades, parents' education. The columns of xi correspond to the variables like gender. The multinomial logit generalizes the simple or binary logit so that y takes on J different values instead J=2 values. The idea of a mlogit is that each option j has its own "utility." The utility depends on the chooser (i) and the option (j). It also combines deterministic and random components so that there is probability of j being optimal for chooser i. So let
vij=xiβj
be the deterministic component and ϵij be the random component, so that utility is:
uij=vij+ϵij.
We observe i choosing the option that maximizes utility their utility:
j⋆i=argmax
When the residual \epsilon_{ij} follows a particular distribution (the Extreme Value distribution) then we get a specific form for the probability that j^\star_i=y_i (where y_i is the observed choice):
Prob( j^\star_i =y_i | x_i, \beta_0,\dots,\beta_{J-1}) = { e^{v_{i\,y_i}} \over \sum_{j=0}^{J-1} e^{v_{i\,j}}}.
Notice that this uses the discrete outcome y_i as an index for the correct v_{ij} to include in the numerator, and this in turn depends on \beta_j.
We can get back to the simple logit by setting J=2 and \beta_0 = \overrightarrow{0}. That is option 0 has beta vector of 0s, which means e^{v_{i 0}} = e^{x_i\beta_0} = e^0 = 1. And now that function we've been working with all along becomes a special case of the multinomial logit log-likelihood function: L^J = \sum_{i=0}^{N-1} \ln \left( { e^{v_{i\,y_i}} \over \sum_{j=0}^{J-1} e^{v_{i\,j}} }\right),\qquad\hbox{and } v_{ij} \equiv x_i\beta_j. Note that L^J is shorthand. A more accurate definition would be L^J(\beta_1,\dots,\beta_{J-1}; y, X). That is, y and X are taken as given. The vectors of coefficients are variable in order to maximize L^J.
In the simple logit we could maximize this function using Ox's
MaxBFGS()
by sending a singlebeta
vector as the parameters to maximize over. In the multinomial model we want to maximize over 2 or more vectors, butMaxBFGS()
like almost all optimizers takes only a column vector as the argument to maximize over. Let's define \theta as that vector and think of it containing the "concatenation" of the \beta_j's: \theta \equiv \pmatrix{\beta_1\cr\beta_2\cr\vdots\cr\beta_{J-1}} Note that we don't include \beta_0 because it is fixed at 0 andMaxBFGS()
should not change its values. A function written to compute L^J would "unpack" \theta into a matrix: \theta_{(J-1)K\times 1} \qquad \Rightarrow \qquad B_{K\times J} = \pmatrix{ \overrightarrow{0} & \beta_1 & \beta_2 & \cdots & \beta_{J-1}}. Now notice that XB will contain a N\times J matrix consisting of all the values x_i\beta_j for all i and all j. So computing v_{ij} and ultimately L^J can be done with vector functions (or with loops over j or both i and j). The code and data file in themlogit
folder is designed to build a function to compute L^J and finding the maximizing coefficients using one of theMax????()
algorithms. The data come from the example used in Stata to illustrate its mlogit command. The data and estimates are summarized with these screen captures:Add code to the files in the folder to:
- Complete coding of LJ
- Create a starting theta vector
- Maximize LJ (use algorithm of your choice)
- Print out results to compare with Stata output
- EXTRA:
- copy over stata_logit_data.dta from the logit folder
- use of loaddemo to load the simple logit data data
- Reset K and J (*NOTE: do not decrement y since it is already 0/1)
- Verify LJ() to provides same results as simple logit