1. Bound for GloryBook / Movie 1976: Contrained Optimization

Don't Fence Me InCole Porter 1930s: Constraining Parameter Values to Open Intervals

Many optimization problems in economics are constrained, meaning that the parameters cannot take on any real value. These constraints come from economic concepts, such as budget constraints, technology frontiers, etc. They also come from logical and mathematical necessities. For example, probabilities must be numbers between 0 and 1. If you maximize an objective that includes choosing a probability and the result is a value outside this range it is meaningless.

Consider the problem: $${\max}_{0 \lt x \lt 1} \quad 2\ln (x) + 3\ln(1-x).\nonumber$$ This is choosing $x$ over the open interval $(0,1)$. If we try to evaluate the objective for $x<0$ or $x>1$ we are trying to take the log of a negative number, which is undefined. So the problem with using unconstrained methods is that they won't know this and may try to evaluate the objective for values outside the well-defined range.

However, there is a trick that can be pulled on unconstrained optimization algorithms. First, a little bit of calculus. Remember the chain rule:
if $f(y) = g\left(h(y)\right)$ then $f^{\,\prime}(y) = g^{\,\prime}\left(\,h(y)\,\right)h^{\,\prime}(y).$
If we want to maximize $f(y)$ the first order condition is $$f^{\,\prime}(y^\star) =0 \qquad\Leftrightarrow\qquad g^{\,\prime}\left(\,h\left(y^\star\right)\,\right)h^{\,\prime}\left(\,y^\star\,\right) = 0.\nonumber$$ And suppose that the inner function $h(y)$ is monotonic, meaning $h^{\,\prime}(y) > 0$ for all $y$. Then we can divide both sides of the condition by $h^{\,\prime}(x^\star)$ and the result is simply $$g^{\,\prime}\left(\,h(y^\star)\,\right)=0.\nonumber$$ Now, we can re-imagine $h$ as the argument to $g()$ and let it be the variable we chose, $g(h)$. If we maximize $g$ with respect to $h$ we get the condition $g(h^\star) = 0$, the same condition. If we already knew $y^\star$ then we could simply set $h^\star = h(y^\star)$ and be sure we were satisfying the first order condition in $h$. That leads to a general principle: if we maximize with respect to a monotonic function of a variable we get the same result as maximizing over the variable itself.

The trick is then to re-think $x$ in the problem at the start like it is $h$ in the chain rule example. We want to make $x$ a monotonic function of a new variable $y$. Then we maximize over $y$ and we will get the same result as if we had maximized over $x$ directly. That is, we transform the choice variable. Instead of choosing $x$ we let the algorithm choose $y$. And for a given $y$ we then undo the transformation so that $h(y)$ is guaranteed to be between 0 and 1. A convenient monotonic transformation for the range $(0,1)$ is: $$h(y) = {e^y \over 1 + e^y}.\tag{Hprob}\label{Hprob}$$ Notice that for any value of $y$ the value of $h(x)$ is in $(0,1)$. And notice that $h(x)$ is monotonic: take the derivative and you will see that $h^{\,\prime}(y) > 0 $ for all $y$. So we have two equivalent optimization problems, A and B: $$A:\quad{\max}_{0 \lt x \lt 1} \quad 2\ln (x) + 3\ln(1-x) \qquad\Leftrightarrow\qquad B:\quad{\max}_{y} \quad 2\ln\left( {e^y \over 1 + e^y} \right) + 3\ln\left(1-{e^y \over 1 + e^y}\right).\nonumber$$ The difference is that the optimization algorithm can try any value of $y$ in problem B and the result is always a well-defined objective.

But we do not really care about $y$ and problem B. We really care about $x$ and problem A. So we would like to deal with values of $x$ only and let the optimization algorithm deal with $y$. To do this we have to be able to go in the opposite direction. If we want to evaluate the objective for a particular value of $x$, call it $x_0$, then what value of $y_0$ would we need to send in problem $B$? We have to invert the transformation. Given $x_0$, solve for $y_0$ such that $x_0 = e^{y_0}/(1+e^{y_0}).$ Solving for $y_0$: $$y_0 = \ln\left( {x_0 \over 1-x_0} \right).\nonumber$$ Notice this is well-defined as long as $x_0$ is in the feasible interval $(0,1)$. So if we could tell the optimization algorithm that $x$ should stay inside $(0,1)$ we would not have to do anything. Just send an initial $x_0$, the enhanced algorithm would work with $y$, but before asking your code to evaluate the objective it would transform to $x$.

Table 12. Transformations for Constraints on Parameters

Now consider choosing two parameters, $x_0$ and $x_1$ and you want to guarantee that $0>x_1>x_0$. Now the interval is dynamically determined by the current value of $x_0$. Can we use transformations to ensure this ordering? Yes, as long as we do them in order. $$\eqalign{ x_0 &= e^{y_0}\cr x_1 &= x_0 + e^{y_1}.\cr}\nonumber$$

Constrained Optimization

Transformation of variables can handle many kinds of constrained optimization problems, but there are important problems when it is not possible to specify an open interval ahead of time. You should recall that we solve constrained optimization problems by forming a Lagrangian objective $${\cal L}(x,\lambda) = f(x) - \lambda \circ g(x).\tag{L}\label{L}$$ This is more general than you are probably used to. First, $f(x)$ is the objective and $x$ is the choice vector, just as before. The function $g(x)$ is a system of equations that represent constraints on the choice of $x$. You are perhaps used to imposing a budget constraint: $$p_0 x_0 + p_1 x_1 = m.\nonumber$$ The choice of $x$ has to be affordable. To work in our notation we have to change this equation into something we want the choice vector to set to 0. That is, the budget constraint is satisfied if expenditure equals income: $$p_0 x_0 + p_1 x_1 - m = 0.\tag{Budget1}\label{Budget1}$$ So now we can see that in this case $g(x) = p_0 x_0 + p_1 x_1 - m$ and we have to maximize $f(x)$ subject to $g(x)=\overrightarrow{0}$, where $\overrightarrow{0}$ is a vector of 0s.

What is different is that we may have multiple constraints, so that $g(x)$ is a system of $r$ equations in $n$ unknowns. If we have as many constraints as choices we will not have any choice left! So $r \lt n$ is the usual problem. A single budget constraint is the case when $r-1$ and $g(x)$ is linear in $x$.

You may also recall that $\lambda$ is the Lagrange multiplier on the constraint. We need a multiplier on each constraint, so in general $\lambda$ is a $r\times 1$ vector of multipliers. And $\lambda \circ g(x)$ is the inner product of the two vectors, $\lambda \circ g(x) = \sum_{i=1}^r \lambda_i g_i(x)$.

We are now going to farther than you are used to by allowing for inequality constraints. $$h(x)\,\ge\,\overrightarrow{0}.\tag{Ineq}\label{Ineq}$$ In other words, what is required is that values of the equations in the system $h(x)$ are non-negative. The set of choice vectors that satisfy this condition is the feasible region. Of course, budget constraints can be considered inequality constraints too, but Lagrangian problems are much easier when we know that the solution will be an equality. The reason to allow both equality and inequality constraints is as follows: with multiple constraints and possibly objectives that are not always increasing we may want to be inside the feasible region not on the edge where $h_i(x)=0$.

The most general problem is then one that combines both equality and inequality constraints:

Definition 3. A general Kuhn-Tucker Optimization Problem

$${\cal L}(x,\lambda) = f(x) - \lambda \circ g(x) - \gamma \circ h(x).\tag{Kuhn}\label{Kuhn}$$
Inequality constraints are more complicated to handle than equality constraints. Suppose there is only one inequality constraint, so $h(x)$ is a single equation. Suppose at $x^o$ $h(x^o) = 3$, meaning the constraint is not binding on the margin at $x^o$. Near $x^o$ $h(x)$ remains positive and improvements in $f(x)$ can be pursued as if the problem was unconstrained by $h$. If it turns out that $x^\star = x^o$ then there is no cost to imposing $h$, on the margin. This means that the correct penalty in the Lagrangian should be 0. But we can't know that ahead of time.

On the other hand, suppose $h(x^d)=0$ and $h$ is binding at $x^d$. The objective may improve in a direction from $x^d$ for which $h(x)$ turns negative. If $x^\star=x^d$ then the penalty on $h()$ reflects this cost just like $\lambda$ in the Lagrangian.

So either the penalty or the value of the constraint is zero. This is called complementary slackness. But which constraints bind cannot be predetermined. And since the constraints are either on or off, there are many combinations to consider. With $s$ constraints there are $2^s$ different sets of possible active constraints.

Algorithm 19. Kuhn-Tucker Programming

Given
$f(x)$, $h(x)$
  1. Initialize ${\cal A}_0$, the first guess of the active constraint set. Set $t=0$.
  2. Set $\gamma^\star_j=0$ for $j\in {\cal A}_t^c$. Solve $x^\star_t$ and the remaining elements of $\gamma^\star$ as a Lagrangian problem for the constraints $j \in {\cal A}_t$. Compute the value of all constraints at $x^\star_t$, $h(x^\star_t)$.
  3. If $j\in {\cal A}_t^c$ and $h^j(x^\star_t)\le 0$ then move $j$ to ${\cal A}_{t+1}$ and now consider $\gamma_j$ free. If $j\in {\cal A}_t$ and $\gamma^\star_j\le 0$ then move $j$ out of ${\cal A}_{t+1}$ and set $\gamma_j=0$.
  4. If ${\cal A}_t = {\cal A}_{t+1}$ then (IEC), (CS) and (POS) are all satisfied at $x^\star_t$ and $\gamma^\star$. Stop iterating. (EXIT this recipe)
  5. Otherwise, one or more constraints were misclassified. This may make $x^\star_t$ either infeasible or suboptimal, and subsequent changes may affect other constraints that were conditionally correct. Increment $t$ and return to S1.

Sequential Quadratic Programming

Judd 4.7 The idea of Sequential Quadratic Programming (SQP) is an extension of sorts to Newton's method for unconstrained optimization. Recall that we can think of Newton as optimizing a quadratic approximation to the objective using the gradient and Hessian information. SQP takes a quadratic approximation to the Lagrangian to account for the penalties. It takes linear approximations to the constraints themselves. So it maximizes a quadratic objective subject to linear constraints (both equality and inequality if applicable). Special methods are able to solve this kind of problem using polynomial solution methods.

First, suppose the algorithm is currently at some point $x^t$. At the same time we have to specify some current values of the multiplier vectors, $\lambda^t$ and $\gamma^t$. Some of the inequality multipliers might be 0, meaning those constraints are currently inactive. The Hessian of the Lagrangian is $$H_{xx}{\cal L}(x,\lambda,\gamma) = \left[ {\partial^2 {\cal L} \over \partial x_i \partial x_j }\right] = Hf(x) - H[\lambda\circ g(x)] - H[\gamma\circ h(x)].\tag{HL}\label{HL}$$ This is a $n\times n$ matrix (the number of parameters being maximized over). Although the constraints are systems of equations they are aggregated into a single number by the inner product with the multipliers, so the penalties themselves have symmetric Hessian matrices. The multiplier vectors $\lambda$ and $\gamma$ enter the matrix through the penalties but we are only differentiating with respect to the choice variables not the penalties. The constraint systems $g$ and $h$ have Jacobian matrices, $Jg(x)$ and $Jh(x)$, which are $r\times n$ and $p\times n$, respectively.

Then the quadratic objective with linear constraints can be written: $$\eqalign{ {\max}_{s\in \Re^n} \quad (x^t-s)' &H_{xx}{\cal L}\left(x^t,\lambda^t,\mu^t\right) (x^t-s)\cr &Jg(x^t)(x^t-s) = \overrightarrow{0}&(LQ)\cr &Jh(x^t)(x^t-s) \le \overrightarrow{0}\cr }\tag{SQP}\label{SQP}$$

Algorithm 20. SQP Iteration

Given a Kuhn-Tucker problem defined above
.
  1. Pick initial guesses $x^0$, $\lambda^0$, and $\mu^0$. Set $t=0$.
  2. At iteration $t$, compute $H_{xx}{\cal L}$, $Jg$ and $Jh$ to form the linear-quadratic approximation in equation (LQ). There are closed-form solutions to that problem which determine a step-size, $s^\star$ and a new set of multipliers. If $\| s^\star \| \mathop{=_\epsilon} 0$ then (strong convergence).
  3. Update $x^{t+1}=x^t+s^\star$. Then update $\lambda^{t+1}$ and $\mu^{t+1}$. Increment $t=t+1$.
#import "maxsqp"

    MaxSQP(func, ax, af, aH, useNH, ineq, eq, Lo, Hi)

        Argument        Explanation
        ------------------------------------------------------------------------------------
        func            objective (in same form as other Max routines require)
        ax              address of starting values and where to place final values
        af              address to place final objective value
        aH              0 or address of where to place final Hessian
        useNH           1=use numerical first derivatives; 0=func provides them when asked
        ineq            h(x), system of INequality constraints, in form asked by SolveNLE
        eq              g(x), system of Equality constraints, in form asked by SolveNLE
        Lo              empty matrix or nx1 vector of lower bounds on x
        Hi              empty matrix or nx1 vector of upper bounds on x

    Returns
        Convergence code like MaxBFGS