choices are sequential over time and choices can be made contingent on what has happened in the past that lead up to the current situation.
the value of choices now depends on what may happen in the future
choices now affect what happens in the future, and therefore anticipating future choices is crucial for understanding choices today.
Does Dynamic Programming Include Simpler Situations as Special Cases?
A single choice in a single state is the most special case of DP. So static optimization can be considered a special case of DP.
When there are multiple choices but they are not linked together through time it is also a special case.
If there multiple choices linked together through time but the agent does not care about the future then this again is a special case of DP. It is the case of myopic decision-making.
What is the DDP component of niqlow
DDP is the component of niqlow that lets a user design and solve a discrete dynamic programming. It is written in Ox and requires Ox to run. The user importsDDP into their Ox program to use it.
Why Discrete Dynamic Programming (DDP)?
DDP is a term used in niqlow that applies to models that have discrete (as opposed to continuous) choices and/or states. A model of decisions with discrete states (including as a very special case a single state/decisin) but continuous choices typically involves solving first order conditions or carrying non-linear optimization. DDP does not incorporate continuous choice directly.
What about continuous choice?
Optimization over continually varying controls (and solving non-linear systems) is the topic of the FiveO component niqlow. Methods that embed continuous choice inside a discrete state spaces is therefore a hybrid of the two parts of niqlow.
What does this first part of the document do?
It presents DDP from scratch starting with simple static choices then building up to dynamics, and shows how the key concepts of DDP are represented in DDP.
What does part 1 of the document not do?
It does not teach computer programming or the Ox programming language. It also does not present the general DDP framework nor the technical elements of DDP in a systematic way. If you are already comfortable with Ox and with dynamic programming, then you might find this document too slow. The second part below may be a better place to start (and that includes links to examples the presume more background knowledge than what is presented here).
What about estimation?
Within niqlow, estimation refers to finding statistical estimates (guesses) of parameters of the model using data that is assumed to be generated by the model. A parameter is a quantity that enters the model and whose value is determined outside the model (an exogenous or so-called structural parameter). Parameters are typically continuous values and so guessing or estimating them is typically carried out using algorithms in FiveO. DDP includes methods to handle data generated by the model or created outside but assumed to be generated from the model. This document does not concern itself with DDP data and estimation, which is discussed in Data.
Discrete Choice
We start with choice without states and dynamics. We start with such a simple static choice framework that it might seem trivial. The computer code shown to implement the model will appear to be overkill. Bear with this simple start because you will see that all the elements of dynamic programming in many areas of economics can be handled easily within the framework.
The choice set
Choice is simply picking from a set of options. We will let \(A\) denote the finite set of possible actions to choose among. And we will let an arbitrary element of \(A\) be denoted \(\alpha\). We will work through a couple basic choices:
Example 1A. Bob wants to go to university. Four have accepted him. These four places make up his discrete choice set:
\(A = \) {Harvard, Yale, Queen's, McGill}.
We sometimes need a symbol for the number of options available and will use J for that purpose. So for Bob, J=4. Of course there are other schools but Bob either did not apply or did not get in to them. They are infeasible choices. \(A\) is his feasible choice set.
While \(\alpha\) stands for any of the feasible actions, at some point Bob makes a choice. The school he chooses becomes special, and we distinguish the choice made from the options considered. If Bob chooses to go to Yale we would write \(\alpha^\star = \)Yale. In these notes things marked with \(\star\) are typically related to the optimal choice.
In some cases we might use discrete choice to approximate a continuous choice, so here is another example:
Example 2A. Tommy is choosing how much time to study for a test the next day.
So the choice might be any number between 0 and, say, 12 hours. It would then be natural (or convenient or intuitive) to think of the choice set as \(A=[0,12]\), to let \(\alpha\) denote any real number in that range, and to characterize the optimal choice with a first-order condition o marginal utility, \(U^{\,\prime}(\alpha).\) However, for various reasons, we stick with a discrete choice set to model Tommy's choice. To be specific, we might think of him as studying for a number of hours: \(A=\) {0,1,2,…,12}. In this case \(J=13\).
What if Tommy were an actual person (say in a data set) and studied 2.2 hours?
Let's write \(\hat\alpha\) = 2.2 to denote the observed choice. This would suggest that Tommy did not choose among whole hours.
This difference between actual and model behaviour might not matter to us (this choice might be part of a much bigger model). So we might approximate by rounding to the nearest element of \(A\) so that
\(\alpha^\star=\)2 would explain Tommy's choice.
This is getting ahead of ourselves, but you might be wondering about accepting 2 to stand for an observed choice of 2.2. One way in which empirical DDP work deals with this kind of discrepancy is to assume that there is measurement error in the data.
So the difference \(d =\)2.2-2 would be attributed to the measurement error (i.e. Tommy says he worked 2.2 hours but he really worked exactly 2 hours). This can bother empirical people who are used to working with, say, regression models. But the error in a regression model plays a similar role in "explaining" the difference between observed and predicted values.
Utility
How did Bob choose Yale? And how Tommy chose 2 hours (according to our discrete approximation)?
Perhaps they considered all the pros and cons of their options and carefully weighed the factors to determine which was best. Or maybe they threw a dart at the wall.
Whatever process they used ...
we are not concerned with how the choice was made, but that a chose was made based on a utility associated with each action.
Optimality and functionOf course, much of social science concerns choices that are not optimal, but for our purposes what people do will always be optimal for them given their options and preferences. Also, a teacher of mine emphasized that utility is by definition a function, so saying "utility function" is redundant. It would be like saying "truck vehicle" instead of simply "truck," because by definition a truck is a vehicle.
Let \(U(\alpha)\) denote Bob's utility for school \(\alpha\).
\(U(\alpha)\) is just a real number associated with \(\alpha\). Bob can have utility for options that are not feasible (not in \(A\)), such as U(UCLA), but we don't have a complete model until utility is defined for each element of \(A\). So, we can get fancy and write \(U: A\,\to\, \Re\). But since \(A\) is a discrete set, \(U\) is really just four numbers, one for each feasible school.
For example, Bob would set \(\alpha^\star=\)Yale if his utility were:
Many other utility levels would explain Yale as an optimal choice.
Multiple Dimensions
In many cases we want to model choices in more than one dimension.
Bob was making a choice in the school dimension. But he might also be making decisions in other dimensions, such as major, roommate, etc. Before showing DDP code for Bob's choice let's have him decide his major at the same time. One way to do this is to convert a choice in several (discrete) dimensions into a one dimensional choice by simply listing all the possible combinations. Another way is to make \(\alpha\) a vector not a scalar. We illustrate both ways in the next example.
Example 1B: Bob is choosing both a university and a major. Which major is best might depend on which school he chooses.
For simplicity, suppose Bob's parents have told him he has to choose either Economics or Physics.
We could simply expand his choice set as follows: {Harvard-Econ,Harvard-Physics,Yale-Econ,Yale-Physics,Queen's-Econ,Queen's-Econ,McGill-Econ,McGill-Physics}.
Utility is then a number assigned to each of these J=8 options. This collapses a choice in two dimensions into a longer one-dimensional choice.
However, it can be more convenient not to collapse the two dimensions, but to consider them separate but simultaneous. In this case, we can let school be the row and major the column:
BOB'S UTILITY ON THE FEASIBLE MAJOR-SCHOOL CHOICE SET
Econ Physics
Harvard -20 -18
Yale 4.2 -0.6
Queen's 1.5 3.2
McGill -25 -0.5
Apparently Yale has a good Econ program, but if Bob had not gotten into Yale he would have chosen Queen's and majored in Physics.
We can also keep the dimensions separate by letting \(\alpha\) be a vector of action variables: \(\alpha = (a_0, a_1, \dots, a_{D-1})\).
In the major-school choice, D=2, because there are two dimensions of choice, \(a_0\) is the major choice, and \(a_1\) the major choice. Start at 0?We start counting at 0 because that is the way counting is done in many computer languages. So doing so now may avoid some confusion later when we see code.
Although using rows and columns to represent two different action variables is clear, it is not very helpful when there are three or more variables.
We can combine the list version with the vector version to get something like this:
the action vector is built by adding ActionVariables to it as part of the set-up of the model. The creation of the list of actions as shown above is done for you by DDP as you add action variables to the action vector.
As the user you only have to concern yourself with which variables are chosen and how many different values they take on. You can provide value labels for each action variable you add to the model. In the example above "Econ" would be a label for an integer option value.
Full Description of Bob's Problem
As with the first example, each row is an action \(\alpha\), but it is associated with values of two action variables. Labelling the choices certainly helps to understand them, but a computer program that is designed to handle any kinds of choices can hard-code labels like "Harvard" and use it to refer to utility. Hopefully it is clear that in the abstract the choice over the four universities is just a case of choosing among 4 possibilities. The labels are helpful to us, but generically the choices can just be numbered 0, 1, 2, and 3.
The school-major choice can be describe with action variables that are just integers along with labels for each value.
α = (a0 a1)
Index Label Options Choice Set Value Labels
-------------------------------------------------------------------------
0 Major 2 0 … 1 Econ, Physics
1 School 4 0 … 3 Harvard,Yale,Queen's,McGill
A = [0…1] × [0…3]
a0 a1 U(α)
-------------------------
0 0 -20
1 0 -18
0 1 4.2
1 1 -0.6
0 2 1.5
1 2 3.2
0 3 -25
1 3 -0.5
Bob's Choice
So far we just have 8 options with arbitrary values (utilities) assigned to them. And it turns out that the maximum of those utilities is 4.2 for the action of choosing Yale and Econ.
$$\eqalign{
\alpha^\star &\equiv \arg\max_{A}U(\alpha) = (0,1) = (\hbox{Econ},\hbox{Yale})\cr
EV &\equiv \max_{A} U(\alpha) = U(\alpha^\star) = 4.2.\cr}$$
V and EV
In microeconomics the highest utility possible is named indirect utility, but in dynamic programming it is usually called the (optimal) value of a state. So, above, \(V\) is used to represent the value Bob receives from his choice once optimized.
We use \(EV\) and not just \(V\) in anticipation that there can be elements of uncertainty in the choice. In particular, the person deciding will know everything up until today when the choice is made, but they will have to make that choice without knowing everything that will happen in the future (tomorrow). \(EV\) will end up being good notation since when, deciding today, they will have to average (take the Eexpectation of) their optimal choices made tomorrow when they have more information than they have now.
Ties
The notation also assumes there are no ties, the optimal choice is unique. That won't be necessary. Ties are handled properly by DDP, but assuming no ties does make the example simpler.
Coding Bob's Choice in DDP
Bob's choice is simple.
Simply look at the 8 numbers that make up his utility.
Find the biggest number.
Look up the labels associated with that action. Finished.
These operations could easily be done by hand, in a spreadsheet, or even in Ox using its built in maxc() and maxcindex() routines. When you look at the code for Bob's Choice in DDP you will see that it has some elements that are not obvious. Indeed, it relies on some sophisticated features of the Ox programming language. Even if you have done some programming in similar languages such as Matlab, Python or R, your reaction may be: that is a lot of complexity for such a simple choice. And you are right!
However, hopefully you see some logic in the complications so that the code is not completely unrelated to the problem as you understand it. And you might see that some of the complication is there to make real models easier to build than if simple tools only were used.
Import is a way to tell Ox that you are using code that is not part of this file. In this, case the program is importing DDP!
#import and #include If you want to know more, see Multiple files in Ox. Most examples shown in Ox start with #include "oxstd.h". That is done in DDP so it is not necessary to do it explicitly.
class … { … }: Class Declaration
A user's model is represented by a class, which is way to combine data and functions that work on the data in one package. This is called object oriented programming. See OOP in Ox if you are familiar with objects already and want to know how they work in Ox.
Declare and DefineIn a programming language like Ox your program will have different things in it. The language has to know what things are in the program, and this is the idea of declaring something. This is like listing a chapter of a book in a table of contents. But Ox also has to know what the things are, which is like the contents of the chapter. This is defining. How things are declared and defined in Ox depends on the kind of thing it is. And, in some cases your program might declare and define something at the same time not separately.
If you are not used to OOP, it can be mysterious and confusing. And Ox can be used without OOP, but it is the way in which DDP lets the user develop an economic model so it is essential for our purpose. So here is the basic idea. What is happening is the program is telling Ox that a new class is going to appear in this program. A class is a description of a kind of thing (objects). These lines are describing the class for Ox, which is just a list of the variables (also called members and functions (also called methods) that make up the class. Every class has a name, and the code gives this class the name BobsChoice.
One of the powerful (but complicated and confusing) features of OOP is that one class can be based on another class that is already defined. In this case, BobsChoice is based on a class called OneStateModel which is defined in DDP. The name is meant to imply that this model is really simple. In real dynamic programs in which there are many states at which the person is making choices. Here Bob makes a choice once, so there is just one state.
In turn, OneStateModel is a class based on other classes which are more flexible. The advantage of having a special class for a simple one state model is that some things can be done for the user.
The choice involves two action variables, and this class makes room for them with the static decl statement. The names are short but understandable in the context. decl is short for declare and does just make room for two things. What they end up holding is determined by other parts of the program.
The static tag is important but at this point it is not necessary to understand why it is there. It does not have anything to do with the fact that Bob's choice is a static choice.
The BobsChoice class also has two functions in it: Decide() and Utility(). Note that Decide() is declared static but Utility() is not. Again, this is not important to understand yet. And like decl, listing functions in the declaration of a class does not say anything about what they do. The have to be defined later.
main(){…}
Every Ox program has to have a function (or routine) called main(). This is where the Ox program actually starts. Even though the class declaration comes first in the file, it is main() that is the first thing to happen. I have written the code so that main() just does one thing: it asks that Decide() be executed. Once it is finished main() is finished (no other statements appear inside main()). Most experienced programmers make their main routines pretty simple.
BobsChoice::Decide(){…}
Above the class declaration said that a routine belonging to BobsChoice and named Decide() would appear, and these lines define what this routine does. It is the lines of code that are executed when main() refers to the function.
Decide() does four things. That is, it has four statements each ending with ;. The first two say that maj and sch will each contain an action variable. This routine, ActionVariables(), is part of DDP, so it would not work to call it if we had not imported DDP.
Hopefully, you can see that labels attached to the two variables are "major" and "school", respectively. Like nearly all computer languages, Ox asks you to put text inside quotes. So maj, which is not in quotes is referring to a variable with that name. Ox will make room for because it was declared in the class declaration. However, the action variable has a name "major" which is just those characters not a variable or a routine.
In Bob's choice he had only two majors to choose from. We could write ActionVariable("major",2) and this would mean maj would hold an action variable with two possible values. This would not given meaningful names (or labels) to the two options: they are just be coded as 0 and 1. By sending two strings in quotes instead of 2 this gives each major a meaningful label. The routine ActionVariable() counts the labels and knows that there are two choices. They are still coded as 0 and 1 but those codes now have the labels "Econ" and "Physics". This will make some output easier to read.
The variable sch will contain another action variable that takes on four values (because a list of four labels are sent).
The next statement calls a routine with the name Initialize(). This routine is part of DDP not Ox itself. As the name implies, it initializes or sets up the problem meant to solve Bob's Choice. Four things are sent to Initialize(). The order is important. The first is a bit mysterious: new BobsChoice(). The second is the number 0. As with some other parts of the code, it is not yet important to understand how these are chosen or used for this problem. In particular, to explain what new BobsChoice() means and why it is there would take us on a tangent that is not necessary for now.
However, it is important to see that the variables maj and sch are sent as well. This is how the OneStateModel in DDP knows what action variables are part of the model.
You may think that the line that says maj contains an ActionVariable would add maj to the model automatically. It would be possible to make it work that way, but for other aspects of DP models (namely state variables) it is easier to separate the creation of a variable from including it in the model. So to be consistent, the same procedure is followed for action variables.
The last thing Decide() does is call a routine VISolve(), where "VI" stands for "value iteration." That is a technique for solving a dynamic programming problem. This one state model is the most simple example of a DP problem, but one part of value iteration is to solve for the optimal choice at each state. This is where DDP will actually solve Bob's problem.
BobsChoice::Utility(){ …}
Any DDP problem has a utility, and here is Bob's. In general, the utility has to return (send back to where the utility was called) a vector in the same order as the actions \(\alpha\). The use of < -20; … > is the way Ox lets you hard-code a vector or matrix of numbers. So the list of utilities above has been copied and written in that format. We should definitely check that the action variables are organized so that the numbers match up: we wouldn't want Bob to mistakenly go to McGill and major in Physics.
Usually Utility() is a function of actions (and later states) and parameters of the problem. So Bob's utility is not at all typical. Later examples in the documentation and exercises will have utilities that are more like what show up in real DP problems.
Run the program and look at the output.
Getting Ox to run BobsChoice.ox and be able to use DDP is not hard if you are used to doing this kind of thing. Otherwise, it can be a pain for the first few times you try something. Go here ??? for some help.
The output you get should look something like this:
That output contains a lot of information that is not relevant to this simple case, so here are selected parts of the output:
Selected Output
-------------------- DP Model Summary ------------------------
4. ACTION VARIABLES
Number of Distinct action vectors: 8
major schoo
a.N 2 4
6. FEASIBLE ACTION SETS
alpha A[0]
----------------------
(00) X -Econ-Harvard
(10) X -Physics-Harvard
(01) X -Econ-Yale
(11) X -Physics-Yale
(02) X -Econ-Queen's
(12) X -Physics-Queen's
(03) X -Econ-McGill
(13) X -Physics-McGill
#States 1
----------------------
Key: X = row vector is feasible. - = infeasible
Value of States and Choice Probabilities
------------------------------------------------------------------------------
Indx I T A q t r f EV |Choice Probabilities:
0 1 0 0 0 0 0 0 4.200000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000
------------------------------------------------------------------------------
Explanation of Relevant Output
The "DP Model Summary" is produced by Initialize() after everything is set up. (In models that have more than one state Initialize() only does some of the things done here. The rest are done by a routine that it calls but typically is called from the user's program.) Part 4 of the summary shows you the action vector \(\alpha\). It shows that the new action variables assigned to maj and sch were added to the vector and that together they create 8 different options.
Part 6 shows you the feasible action set \(A\). It shows you each action \(\alpha\) as two integer values and the labels that go along with them. Recall that Yale-Econ was the third option in the list and it appears that \(A\) was set up so that it matches up with the vector in Utility() (but make sure!).
In real DP problems the actions that are feasible can depend on which state the system is at. So this output is set up to show you the different values of \(A\) in the model. But with one state only there is one feasible set only. Thus \(A\) is called A[0], because in some cases there will be A[1] and so forth.
VISolve() produces the table of values and choice probabilities. The first 8 columns really don't apply to this simple model because there is only one state. The table is designed to show the value of each state in the model.
The parts that matter are the EV and Choice Probabilities. Recall the EV is the DP version of indirect utility, and we know that the best Bob can do is 4.2. And the optimal choice is the third element of \(A\), which Bob should chose with probability 1.0. All the other options are sub-optimal and are chosen with 0 probability.
Summary
Here is what is important to gather from the example and code above.
At the heart of a DDP model is discrete choice. Discrete dynamic programming links together many different discrete choices that are connected by state variables that evolve base on the choices made.
In DDP you build your model by defining a class that is derived from one of the built-in Bellman classes. The simplest model to build from is OneStateModel, which is a single discrete choice (one state, no dynamics, etc).
The discrete choice \(\alpha\) is a vector of action variables, each taking on a finite number of values. Your model builds \(\alpha\) by creating new ActionVariables and adding them. With a OneStateModel you send all the action variables to Initialize(). In general your code will add action variables yourself after calling Initialize().
Your model must supply a Utility() which returns a vector of numbers corresponding to the elements of the feasible set \(A\).
The optimal choice for the model can be found by calling VISolve(), which does several things. For a one state model it simply finds \(\alpha^\star\), the maximizing action.
Choice Probabilities
The examples above are not particularly interesting models of behaviour. For one thing, the utility values are arbitrary, whereas a good model would relate utility to observable characteristics of the chooser and the choices. And, second, suppose Bob did not choose Yale-Econ? Then our model is incorrect. This is not surprising because we typically are not modeling a single person's choice and we will never have access to the utility of all the options. Instead, real discrete models provide a probability of people making choices not a 0/1 outcome.
The logit model of choice includes a continuous random variable in the value of a choice:
$$v(\alpha) = U(\alpha) + z_\alpha.$$
The extra component \(z_\alpha\) is assumed to follow the Type I Extreme Value (Gumbel) distribution. Across options \(z_\alpha\) values are independent. The term \(U(\alpha)\) is a shortened version of name of the routine that is part the model in DDP. The user's coding would be the same and the term \(z_\alpha\) is not explicitly added to it.
The probit model of choice makes a different distributional assumption
Across options \(z_\alpha\) values are independent.
Conceptually ...
Bob sees both Utility() (the part we see or assume or estimate) and the vector of values of \(z_\alpha\) (the part we don't see but we assume follows a convenient distribution). With his information, Bob still simply chooses the \(\alpha\) that maximizes \(v(\alpha)\). But since we don't see \(z_\alpha\), any of the options might be optimal to Bob. We get different probabilities of choices being optimal. The probabilities depend on the observed vector of utilities, to the 0/1 vector of ouptut above would become a vector of numbers between 0 and 1. Since Yale-Econ has the highest value of Utility() it will have the greatest probability of being chosen, but the probability will not be 1.0.
Choice Probabilities under logit
Logit:
$$P(\alpha) = {e^{U(\alpha)} \over \sum_{\alpha'\in A} e^{U(\alpha')}} = {e^{U(\alpha)-U(\alpha^\star)} \over \sum_{\alpha'\in A} e^{U(\alpha')-U(\alpha^\star)}}$$
The second version expresses the probability in relative utility loss compared to the optimal choice \(\alpha^\star\).
Following the literature, DDP generalizes the logit expression to allow the amount of smoothing to be chosen by setting a parameter \(\rho\):
$$P(\alpha) = {e^{\rho\left(U(\alpha)-U(\alpha^\star)\right)} \over \sum_{\alpha'\in A} e^{\rho\left(U(\alpha')-U(\alpha^\star)\right)}}$$
As \(\rho\to \infty\) the probability approach the non-smooth values. As \(\rho\to 0\) the smoothing becomes complete and each feasible choice is equally likely.
See Train (2001)Train (2001) explains all aspects of discrete choice with an emphasis on econometric applications. Readers are encouraged to refer to Train () for in-depth discussion of the static case. We start even simpler than Train does. Eventually the set up starts to look like the material in Train, but we move to dynamic choice before going into the depths of various static choice models that Train does.
Choice Probabilities under Probit
Independent Probit:
$$\eqalign{ P(\alpha) &= Prob( U(\alpha) \ge U(\alpha') ), \forall \alpha'\in A.\cr
&= \prod_{\alpha'\ne\alpha}\ \Phi\left( \rho(U(\alpha)-U(\alpha'))\right)\cr}$$
Again, DDP generalizes the standard probit by varying how important the deterministic component to utility is to choice. The smoothing parameter \(\rho = 1/\sigma\) and \(\sigma\) is the standard deviation of the difference \(z_\alpha-z_{\alpha'}\). This smoothing would not be identified if/when \(U(\alpha)\) is based on estimated coefficients on variables related to \(\alpha\).
Correlated Probit:
Smoothing Choice Probabilities in DDP
Select a Smoothing Method
Recall that Bob's Choice was derived from the OneStateModel class of problems. That class is in turn a special case (a derived class) of the ExPostSmoothing class. The method for smoothing for the one state model is made at this point in the code:
Initialize(new BobsChoice(),0,maj,sch)
As you can see from the documentation of Initialize(), the second argument is the smoothing method choice. To avoid having to explain everything at once, it was set to 0 in the code, which is the default choice in the general ExPostSmoothing class.
Another more descriptive way to send 0 is to use a name for zero that is defined in DDP in order to make the internal code easier to follow and debug:
Initialize(new BobsChoice(),NoSmoothing,maj,sch)
Here NoSmoothing is really just a name for the number 0. It is one of the SmoothingMethods that are really just names for the integers 0, 1, and 2. The code for the ExPostSmoothing class of problems will look at the method sent to it and decide if and how to smooth the resulting choice probabilities.
Logit Smoothing
To make Bob's Choice a logit model, just modify the code to ask for logit smoothing:
Initialize(new BobsChoice(),LogitKernel,maj,sch)
What does "kernel" mean? Essentially, a kernel means "smoothing." You can learn more at Kernel's wikipedia page
Probit Smoothing
To make Bob's Choice a probit model, just modify the code to ask for probit smoothing:
Initialize(new BobsChoice(),GaussKernel,maj,sch)
Why Gauss?Because the normal distribution is also referred to as the Gaussian distribution.Why Ex Post?
In the standard discrete choice econometric model the error term \(z\) is considered something known/observed by the chooser but not us. In a more complicated situation, namely a dynamic one, Bob might make other choices before he knows the value of \(z\). For example, \(z\) may be observed when Bob wakes up in the morning but he had to make a choice the day before that was affected by what he does today. In that case, the value of \(z\) is not just smoothing the choice probability for our sake, it is also have a direct effect on other decisions.
DDP distinguishes between choice probability smoothing for our own sake and real uncertainty that affects the chooser in a dynamic environment.
The case of ex-post smoothing is when \(z\) is not a real part of Bob's environment. He is really just making a deterministic choice based on \(U(\alpha)\). But after his choice we smooth the probability as if he also add a value of \(z\).
The case of ex-ante smoothing allows \(z\) to play a role in other decisions made by Bob so it is a real "structural" error term.
Exercises
Make a copy BobsChoice.ox. Try the 3 different smoothing options discussed above. Compare results.
Multiply all the utilities by 5. Confirm that this does not change the optional choice when there is no smoothing. What happens to choice probabilities when they are smoothed compared to the values in the previous exercise?
Multiple States (Unlinked Choices)
Bob and Tommy's choices are one-time discrete choices. By deriving a model of their choices from the OneStateModel class there was no possibility of dynamics or examining how the situation affects the choice.
A state is a situation in which the agent makes a decision. States differ from each other for various reasons. Utility may differ between states (leading to different decisions). The feasible action set may differ between states. The effect of choices on what happens in the future may differ. Obviously the next step from a one-state model is a two-state model. We then double the states again to consider a four-state static choice model.
The State Space
In dynamic programming the choice of action \(\alpha\) is contingent on the state the chooser is in. To store and solve models efficiently, DDP will distinguish between types of states. But a generic state will be denoted \(\theta\). And the set of states the problem might be in is denoted \(\Theta\). As with feasible actions, the state space is constructed by adding state variables to it. We illustrate this with simple extension of Bob's Choice.
Example 1C. Bob is still deciding between schools and majors. However, it is more complicated. He may or may not receive a scholarship to Queen's for which he is applied for.
Obviously Bob will make his decision based on whether he gets the scholarship or not, so the choice is contingent upon that. We want the model to capture both choices. One reason to do this is that we want to apply the model to data in which some people get scholarships and some don't. The values of utility above will be the values without a scholarship. Then in the event of getting the scholarship the utility of choosing Queen's goes up.
Scholarships are measured in dollars, but Bob's utility from getting and accepting a scholarship may be different that the monetary value. However, to keep things simple we will assume that the utilities already listed are in money-equivalent values so we can add the value of the scholarship to them in order to compute net utility for Bob. Specifically, the Queen's scholarship has a value of $2,200 and utility is measured in thousands of dollars. So in the case of getting the scholarship and going to Queen's utility increases by 2.2.
Coding Bob's New Choice
State Variables
The StateVariable class is designed to represents state variables in a dynamic program. Like action variables, the user's code creates state variable, stores them in a static member of the model class, and then adds them to the model. DDP then incorporates them into the state space and the solution methods.
Unlike action variables, which are mainly different only in the number of values they can take on, state variables differ from each in a lot of ways. So DDP provides a number of different kinds of state variables to add to a model as well as the possibility of defining your own type if it does not exist already. However, in this simple static situation, the base StateVariable class will suffice. So Bob's award status will be of that class.
The simple OneStateModel class is no longer appropriate because Bob either gets the award or not. So this model of Bob's choice is derived from a different class of problems. Again, since it is a very simple situation, the base Bellman class will do the trick.
Static Time
The Bellman class does not restrict the model to be static, so unlike OneStateModel the code will have to specify this feature of the environment. But this will take one simple line of code. Below the role of time is introduced in earnest.
Utility
Finally, the new model of Bob's choice will have to modify Utility() to account for scholarships. As with the first example, this may seem much more complicated than necessary, and it is for this simple static two-state environment. But the approach to doing this will handle any kind of utility with limited programming complexity.
The vector of utilities has been moved from the utility to a constant member of the class. Now the value of the scholarship can be added to Uv in order to compute overall utility.
static decl Qsch,
A variable to hold the scholarship state has been added to the model.
Changes to Decide
We still need to Initialize the model, but the Bellman version of Initialize is different than the very special OneStateModel version. It does not allow you to list actions. SetClock(StaticProgram) says that this is a static (but possibly multi-state) model.
Actions(maj,sch): Except in specialized models like OneStateModel, action variables are added to the model by sending them to Actions().
Qsch = new StateVariable("Qsch",2): this creates a basic state variable that takes on two values.
EndogenousStates(Qsch);: As mentioned earlier, different kinds of states are tracked in DDP and here we have added Qsch to the endogenous state space.
CreateSpaces(): In the first example this routine is called inside Initialize(). But in all other kinds of models, the user calls this method themselves once they have added all the elements to model. Each kind of Bellman model has its own version of CreateSpaces(), and the one that is called is the one that the model is derived from (here Bellman's own base version). Sometimes arguments can or have to be sent to CreateSpaces, but often the default values apply and nothing is sent.
Utility
Utility now returns a different vector depending on whether Qsch equals 0 or 1. Qsch is the state variable and takes on 2 values, 0 and 1. We set utility up so that a value of 1 indicates the Queen's scholarship was received and its value should be added to the utility of Queen's programs when choosing where to go. The underlying code will call Utility for all values of all states added to the model.
The value of Qsch when called can be accessed a couple different ways. One way is directly: Qsch.v will hold either 0 or 1 when Utility is called. Or the state variable can be sent to CV() which will access the value return it. CV() is a flexible way of incorporating state variables, parameters and even constants in your model in such way that it looks the same regardless of the kind of thing it is.
The trickier thing to understand at this point is accessing values of the school choice, which is an action variable not a state variable. Looking at the list of labels associated with sch, Queen's is the 3rd value, or 2 since we start actions at 0. Ox's .== operator will compare a vector to a value and return a vector of 0s and 1s for whether corresponding element equals the value.
If this is unclear, you can add a print() statement above return to see what CV(Qsch)*(CV(sch).==2) contributes to utility.
Run the program and look at the output.
The output you get should look something like this:
Now, suppose we also want the model to consider what happens if Yale does or does not accept Bob.
We handle this by creating a model with two states, one in which Yale is in the feasible set of schools and another at which it is not.
State-Dependence in \(A\)
An important feature of many dynamic programming models: the feasible set depends on the state the action is being taken at. DDP provides a method for creating different feasible sets. Given a state \(\theta\), we can specify that the feasible action set is \(A(\theta)\). If we let plain old \(A\) denote the unrestricted set (as above), then \(A(\theta) \subseteq A\). Recall that each action vector \(\alpha\) is represented in the code as a row in a matrix. Now different states will have different matrices.
Coding Bob's New Choice in DDP
FeasibleActions()
To restrict feasible actions, the user provides a method that returns a column vector of 0s and 1s. This column vector depends on the current state of the model. A 1 indicates the action (row) is feasible at the state, 0 means infeasible. The user's method is called by DDP inside CreateSpaces() for each possible state.
Line-by-line explanation of the (new and modified) code
Yacc is added to the class so that it can track the case of Yale accepting Bob or not.
The class declares that there will be a new method named FeasibleActions(). The name is important because this new method will replace one that already exists. If the name is not correct then this replacement will not happen.
The new binary state variable is created and added to the model just like Qsch.
The actions are still which school to go to and which major to study. But if Yacc=0 Yale is not feasible and any row of \(A\) with that school should be removed at that state. Yale is school 1. If Yacc=1 all the options are feasible. If Yacc=0 then only cases when sch is 1 are feasible. The logical expression for feasibility is "Yacc or sch=1". However, we have to use CV() to get the value of variables. And Ox uses double equal signs to test equality. It uses double vertical lines to denote "or." So the return value (CV(Yacc)==1) .|| (CV(sch).!=1); is a translation of the logic. The table below shows the two different return values.
Finally, utility must return values that correspond to the feasible action set. The hard-coded uv vector includes rows for Yale. It cannot be used without removing those rows. The function OnlyFeasible() can be used to do this without repeating the logical expression in FeasibleAction to find and delete the Yale rows.
Run the program and look at the output.
The output you get should look something like this:
Here is what is important to gather from the examples:
Discrete dynamic programming links together many different discrete choices that are made at different states.
A static problem with more than one state generalizes the OneStateModel introduced first but is a special case of sequential decisions made over time.
More generally then in the one-state example, your code will call routines/methods to first initialize the DP environment, then add state variables and action variables to the model then create the spaces implied by the items added to the model. Then the model can be solved.
Unlike the simple one state set up, you build the action vector \(\alpha\) by sending action variables to Actions() which will add them to the model. This is done between initializing and creating spaces.
In DDP timing is controlled by a clock and a static problem is specified by setting the clock as StaticProgram. This is done between initializing and creating spaces.
The discrete state vector \(\theta\) is a vector of state variables, each taking on a finite number of values. Your model builds \(\theta\) by creating new state variable objects and adding them to your model using EndogenousStates().
Different states in the model can affect the utility of actions. The value of state variables can affect the set of feasible actions.
Restrictions of actions is possible by providing a FeasibleActions() routine with your model. It takes as an input argument the matrix of all possible actions, \(A\), and returns a column of 0s and 1s indicating which rows are feasible at the current state.
As with Utility(), the DDP code will always set the value of state variables before calling FeasibleActions(). The value of a state variable is stored in its .v data member, or your code can send the state variable to CV() which will return the value.
Exercises
Make a couple copies of BobsChoiceB.ox and BobsChoiceC.ox and experiment with them as follows. (Do not change the original file's contents.)
Suppose Bob has also applied for a Justin Bieber Scholarship which is worth $100 if Bob attends a Canadian university. Add a state variable to version B to account for this possibility as well as the Queen's scholarship. Now there will be 2 state variables and four possible states.
Suppose McGill's Physics department does not have an undergraduate degree, so McGill-Physics is not feasible but McGill-Econ is. Modify FeasibleActions in version C so that this option is excluded for both states of Yacc.
Choices Linked over Time (Dynamic States)
So far the environment has multiple discrete choices but no dynamics. Now we introduce dynamics. We will use Tommy's choice to introduce dynamics. That is, we will model how many hours Tommy studies for an exam in the period leading up to the exam.
Tommy Choices Redux
Tommy chooses among \(J=13\) different discrete numbers of hours each night. Nights are different (and thus his choice) because the opportunity cost of studying goes up on the weekend.
For now we continue a very simple (trivial) reason Tommy cares about studying. Later we add to the model that he knows how studying affects his expected score.
Discounting and the Overall Objective
In the static examples earlier utility is the objective and there is nothing that ties together multiple states. Once we introduce sequential choice it becomes necessary to describe how choices and their utilities are aggregated over time. The power of dynamic programming is to tie together simple static choices in order to explain sequential choice.
Time
Let t denote time, takes on discrete values. The first decision occurs at t=0. Suppose the chooser at time 0 is trying to decide among different sequences of choices. A particular sequence of choices is \(\alpha \equiv \alpha_0, \alpha_0, \dots\). For the moment, the total number of periods is left unspecified. It could be finite or infinite (the decisions keep going forever). And, for the moment, we suppress any other state variables that may be part of the model other than t.
Then, DDP and the basic DP framework itself assume that the chooser places value on that sequence equal to
The new and fundamental parameter appearing in \(V()\) is the discount factor \(\delta\), which typically is in the range \([0,1)\).
Notes
If the decision horizon is finite then \(\delta=1\) is permissible. If \(\delta=0\) the person puts zero weight on future utility. The decider only cares about the current utility even though current decisions might affect the future.
This formulation allows utility to depend on time as well as the action chosen at time. But soon we will move t into the state vector \(\theta\) and it will simply be a specialized state variable not completely set apart from other state variables.
The Clock and Value Function Iteration
Dynamic programming accounts for a choice made today affects outcomes in the future along with the current appreciation (utility) of the choice. The linear separability of \(V()\) (and geometric discounting) makes it possible to solve the overall dynamic problem by breaking it into many connected static problems.
Transitions
The only required time distinction in DP is between now (today) and later (tomorrow). It is common in the literature to use ′ to denote things that will happen tomorrow. These notes follow that convention. So if t is a state variable, then t alone is its value today when a choice is being made. The value it takes on tomorrow is the denoted t′. A state variable moving from today to its value tomorrow is called a transition.
Stationary and Non-Stationary Clocks
There are two basic clocks in DP. The first is a stationary clock in which today is the same as tomorrow. This means that the decision horizon is infinite, because if decisions end sometime in the future then tomorrow is one period closer to the end then today. Therefore they are not the same. The second is a special kind of non-stationary world which ends after fixed number of periods. This is usually called a finite horizon model. However, DDP makes distinctions between different kinds of finite horizon clocks. So what is usually called a finite horizon is here called normal aging.
Normal Aging
In normal aging, time starts at 0 and ends at some fixed last time, denoted T-1. That is, T is the number of periods of choice. In normal aging, the time next period is always one period later than the last (until T-1). That is,
$\(t' = t + 1,\quad 0 \le t \lt T.\)
This is the transition of a normal aging clock. Each day you get another day older. Note this is really defining a function, t'(t) = t+1. Technically, the transition is undefined for when t=T-1 because that is the last period and the world ends immediately after.
Infinite Horizon (stationary)
A stationary simply says tomorrow is just like today:
$$t' = t$$
Again, tomorrow is different from today but when the chooser wakes up tomorrow it is just like today. However, what state the chooser is in tomorrow may be different than the state he was in today.
A perfect analogyIn Ground Hog Day in which Bill Murray knows he will wake up and relive the same day, apparently forever, but his state can be different each day he wakes up. He can learn to play the piano and every day he wakes up a little better. And each day he makes choices that affect his utility in the future. It is a perfect analogy to a stationary decision environment and also a perfect movie.
An the important feature of time is that if there is a yesterday (as in normal aging) it can never come again. Any state that occur in the future cannot have a value of time less than t.
Clocks in DDP
State Variables that Depend on Current and Past Events DDP
In BobsChoiceB there were two states. Either Bob has a scholarship or not and this changes his decision. Dynamic programs can account for state variables that depend on what has happened in the past. For example a very common state variable counts how many times a choice has been made before the current period. If \(a\) is binary choice then utility might depend on how many times \(a=1\) before now.
$$A_t = {\sum}_{s=0}^{t-1} a_s.$$
Note that this does not clearly define the initial value \(A_0.\) Typically the model would set this to 0, but it could be initialized to a different value. For dynamic programming it does not matter how the current value \(A_t\) came about. What matters is how current decisions will alter the effect \(A\) in the future. This is what is meant by the transition of \(A\). We can state this simply as
$$A^\prime = A + a.$$
That is, next period's value of \(A\) will be what it is today plus the value of the current decision. If \(a=1\) then \(A^\prime = A + 1.\) Otherwise it is unchanged.
Terminal States
In some environments decision-making ends when a particular state is reached. These are called terminal states. A value of state variable can be made terminal using MakeTerminal().