-
This section discusses two important applications of simulated randomness. They allow us to do things that otherwise would be very difficult without the ability to generate pseudo-random numbers that appear to be independently distributed.
- Monte Carlo Experiments
- Monte Carlo Integration of a Function Another use of simulated randomness is to calculate integrals over a distribution. Numerical integration over multiple dimensions can be computationally intensive (as briefly discussed previously). Monte Carlo integration is a way to approximate an integral in way that gets a good answer without complicated formulas.
- Given
- a random vector, $x\in \Re^n$, $x\sim F(x)$
- a function $g(x)$ to approximate the average (expected value) with respect to the distribution $$\hat S\ \approx\ S\ =\ \int g(x)f(x)dx.$$
- A number of replications $R$ or a desired precision of the approximation.
- Draw a pseudo-random $x^r_1 \sim F(x)$ and define $S_1 \equiv g(x^r_1) $.
- Then $E[S_1] = \int g(x^r)f(x^r)dx^r = S$. Since this is independent of the dimension $n$, Monte Carlo integration breaks the curse of dimensionality on average.
- Improve the precision of the estimate by repeatedly drawing $R$ independent values, $$\hat S_R = {1\over R}{\sum}_{i=1}^R S_i.$$
- Simulated Annealing
- Given
- $f(x)$, and $x_0\in \Re^n$
- a cool down schedule $T(t)$
- a probability function $P(\delta,T)$.
- an iteration max $t_{max}$.
- Set $t=0$ and $f_0 = f(x_0)$.
- → Generate a new vector $x^d$ and compute $f^d = f(x^d)$.
- $$x_{t+1} = \cases{ x_d & with prob. $P\left(f^d-f_t,T(t)\right)$\cr x_t & otherwise.\cr}$$ Set $f_{t+1} = f(x_{t+1})$.
- Increment $t$. Stop if $t=t_{max}$; Otherwise, return to →.
One way to use Monte Carlo is to demonstrate the implications of statistical results like Central Limit Theorems. Typically a CLT will say that under some conditions some sort of average of random variables will follow the normal distribution as the sample size goes to $\infty$. Let $X$ be a random variable with distribution $F(x)$ that has a mean $\mu$ and variance $\sigma^2.$ The sample mean of $N$ draws of $X$ is, of course, $$\bar{X}_N = {1\over N}\sum_{i=1}^N X_i.$$ The notation includes the sample size in the definition of the mean since a CLT concerns what happens as $N$ gets large. We know that $E[\bar{X}_N]=\mu$ and $Var(\bar{X}_N) = \sigma^2/N$ for any sample size $N$ as long as the observations are independent. (Otherwise the variance of the mean would have to account for the covariance across observations.) So in the IID case the variance of the mean goes to $0$ as $N\to\infty.$ So ITS distribution "collapses" to a point at $\mu$ which is not the normal distribution.
A CLT will need to "blow" up the distribution to keep it from collapsing. The way to do that, in this simple case, is to look at the distribution of $$Y_N = \sqrt{N}(\bar{X}_N-\mu).\label{YN}$$ Now we know that $E[Y_N] = 0$ and $Var[Y_N]= \sigma^2.$ That is, the mean and variance of the scaled deviations from the mean do not depend on the sample size. The factor $\sqrt{N}$ is the rate at which the distribution converges. That rate can differ if the observations are not identical in distribution and/or sampling is not indepdent.
Then the CLT says $$ \lim_{N\to\infty} Y_N \to {\cal N}(0,\sigma^2).$$ What this says it says in practice is that for a large but finite $N$ the sample mean is normally distributed with mean $\mu$ and variance $\sigma^2/N.$ But a CLT is usually stated in terms of scaled deviation like $Y_N.$
It doesn't matter what $F(x)$ looks like (within some technical limits that in some cases might not hold). There are different CLTs for whether the draws of $X$ are correlated or not and whether the distributions are identical or changing over $i$ and many other nuances.
Be careful about interpreting this result. For any given sample, the sample mean is simply a number and has no distribution. A CLT says that if you repeatedly draw samples to get different sample means they will approximately follow the normal distribution for "large" $N.$. For $N=1$ the sample mean is just the random variable, so unless $X$ itself a normal random variable the distribution won't look anything like a normal. But as you start increase $N$ the distribution looks more and more normal.
Algorithm 23. Use Monte Carlo to Approximate an Integral
annealingis that the probability of a mistake is signed to fall over time (over iterations). Formally, the code will have $p_t \to 0$ as $t\to\infty$. In other words, eventually only improvements will be accepted and the process will only move 'uphill'. Earlier, however, the algorithm will sometimes go 'downhill' in order to jump away from a local but not global maximum. The reason it is called
simulatedannealing is the decision to go downhill or not and the random candidate $x^d$ are generated using random number algorithms discussed in the next section. The analogy to physical annealing is carried through by talking of the current
temperatureas determining the probability of making a mistake. The algorithm then lowers the temperature as $t$ increases. $$\delta \equiv f^d - f_t.$$ If $\delta>0$ then $P(\delta,T) = 1$. In other words, always take an improvement on the current vector regardless of the temperature. $0\le P(\delta,T) \le 1$ for all $T\ge 0$ and for all $\delta$. In other words, $P()$ is a proper probability for any arguments. ${\partial P(\delta,T) \over \partial\delta} \ge 0$. In other words, the bigger the difference algebraically the greater the chance of going downhill. ${\partial P(\delta,T) \over \partial T} \ge 0$. In other words, the higher the temperature the more likely of going downhill. The last bit of the algorithm is how fast to let the temperature cool down. So that is, we need a cool-down schedule, $T(t)$.
Algorithm 24. Simulated Annealing
Exercises
ranchi()
is in Ox's oxprob
package. Remember that the ran?()
functions in Ox will return a matrix. So you can draw a $N\times R$ matrix. Each column of that matrix is a sample (each row is observation $i$). Then just compute the statistics for each column. Graph denisities or histograms for samples of size $N=10$, $N=30$ and $N=200.$ In each case, do $R=1000$ replications for each sample size. This takes only a few lines of Ox code.
- Simulate $Y_N$ in \eqref{YN} which is the scaled deviation from the population mean. Remember you can use built-in functions and operators to compute the mean of each column of a matrix. Here is the Ox graphing function that will do that if
Y
contains a $1\times R$ row matrix:#include
The last 3 1's are optional but should make the graph look the way you want ("Graphic Reference" describes the Ox drawing routines).// Draw simulated sample // Compute values of YN in Y DrawDensity(0, Y, "$Y_N$", 1, 1,1); SaveDrawWindow("YN.pdf"); - Not everything we do with sample data follows a normal distribution in large samples even with IID sampling. One example is what are called extreme statistics (whereas the mean is a central statistic). For example, for a sample of $X_i$ of size $N$ define as the largest value in the sample: $X^{max}_N \equiv \max_i\ X_i$ Using the same simulated sample of $\chi^2_3$, graph $X^{max}_N$ and note whether it "looks normal" for large $N$
ranextremevalue(N,J,0,1)
.
- Use the
loaddemo()
code to read in the $X$ matrix from the Stata sample. Instead of using the 3 outcome $y$ vector from the data, simulated new $y$ vectors multiple times using Monte Carlo. To do this, you need to have the "true" or population parameter vector, often denoted with a "0," so here $\theta_0.$ Set $\theta_0$ equal to the MLE estimates from the actual data: $\theta_0 = \hat\theta^{mle}.$ Write a function to simulate a sample of $y$ using the same $X$ matrix and $\theta_0.$ THat is compute the simulated $u_{ij}$ and compute $j^\star_i.$ That means you have to store the index of the element in a row or column vector that holds the max (in this case the choice with the highest utility for observation $j$). Ox has amaxcindex()
function to do this (or use loops). Note it does not provide amaxrindex()
so you may have to transpose your matrix to use it. - For a simulated $y$ vector re-estimate the coefficients using MaxBFGS() as in the previous assignment. Then put this code into a loop to estimate from $R$ simulated samples.
- You can put all the simulated results in a matrix that is $2K \times R.$ Each column is a simulated MLE vector. Compute the distribution of the estimates and compare them to $\theta_0$. (The
moments()
function is a good option for this.)