1. Applications
    This section discusses two important applications of simulated randomness. They allow us to do things that otherwise would be very difficult without the ability to generate pseudo-random numbers that appear to be independently distributed.

  1. Monte Carlo Experiments
  2. NR 7.6
    Monte Carlo is a catch-all name for using pseudo-randomness to simulate truly random processes. The term comes from the casinos of Monte Carlo where rolling dice and spinning wheels were repeated outcomes of random experiments. Until pRNG algorithms were invented it was difficult to see what is supposed to happen when randomness was part of the model.

    One way to use Monte Carlo is to demonstrate the implications of statistical results like Central Limit Theorems. Typically a CLT will say that under some conditions some sort of average of random variables will follow the normal distribution as the sample size goes to $\infty$. Let $X$ be a random variable with distribution $F(x)$ that has a mean $\mu$ and variance $\sigma^2.$ The sample mean of $N$ draws of $X$ is, of course, $$\bar{X}_N = {1\over N}\sum_{i=1}^N X_i.$$ The notation includes the sample size in the definition of the mean since a CLT concerns what happens as $N$ gets large. We know that $E[\bar{X}_N]=\mu$ and $Var(\bar{X}_N) = \sigma^2/N$ for any sample size $N$ as long as the observations are independent. (Otherwise the variance of the mean would have to account for the covariance across observations.) So in the IID case the variance of the mean goes to $0$ as $N\to\infty.$ So ITS distribution "collapses" to a point at $\mu$ which is not the normal distribution.

    A CLT will need to "blow" up the distribution to keep it from collapsing. The way to do that, in this simple case, is to look at the distribution of $$Y_N = \sqrt{N}(\bar{X}_N-\mu).\label{YN}$$ Now we know that $E[Y_N] = 0$ and $Var[Y_N]= \sigma^2.$ That is, the mean and variance of the scaled deviations from the mean do not depend on the sample size. The factor $\sqrt{N}$ is the rate at which the distribution converges. That rate can differ if the observations are not identical in distribution and/or sampling is not indepdent.

    Then the CLT says $$ \lim_{N\to\infty} Y_N \to {\cal N}(0,\sigma^2).$$ What this says it says in practice is that for a large but finite $N$ the sample mean is normally distributed with mean $\mu$ and variance $\sigma^2/N.$ But a CLT is usually stated in terms of scaled deviation like $Y_N.$

    It doesn't matter what $F(x)$ looks like (within some technical limits that in some cases might not hold). There are different CLTs for whether the draws of $X$ are correlated or not and whether the distributions are identical or changing over $i$ and many other nuances.

    Be careful about interpreting this result. For any given sample, the sample mean is simply a number and has no distribution. A CLT says that if you repeatedly draw samples to get different sample means they will approximately follow the normal distribution for "large" $N.$. For $N=1$ the sample mean is just the random variable, so unless $X$ itself a normal random variable the distribution won't look anything like a normal. But as you start increase $N$ the distribution looks more and more normal.

  3. Monte Carlo Integration of a Function
  4. Another use of simulated randomness is to calculate integrals over a distribution. Numerical integration over multiple dimensions can be computationally intensive (as briefly discussed previously). Monte Carlo integration is a way to approximate an integral in way that gets a good answer without complicated formulas.

    Algorithm 23. Use Monte Carlo to Approximate an Integral

    Given
    a random vector, $x\in \Re^n$, $x\sim F(x)$
    a function $g(x)$ to approximate the average (expected value) with respect to the distribution $$\hat S\ \approx\ S\ =\ \int g(x)f(x)dx.$$
    A number of replications $R$ or a desired precision of the approximation.
    1. Draw a pseudo-random $x^r_1 \sim F(x)$ and define $S_1 \equiv g(x^r_1) $.
    2. Then $E[S_1] = \int g(x^r)f(x^r)dx^r = S$. Since this is independent of the dimension $n$, Monte Carlo integration breaks the curse of dimensionality on average.
    3. Improve the precision of the estimate by repeatedly drawing $R$ independent values, $$\hat S_R = {1\over R}{\sum}_{i=1}^R S_i.$$
  5. Simulated Annealing
  6. NR 10.9
    The optimization methods discussed previously are designed to find a local maximum, $x^\star_V$, where $V$ indicates the output of the algorithm $V$ as opposed to a true global maximum of the function $x^\star$. The local maximum they find may be a/the global maximum but not necessarily. The reason they seek local maxima is because they converge (decide) based on how the value of the function changes very close or locally to $x^\star_V$. But a global maximum produces a larger value that any other input vector not just locally. Once these methods get close to a local maximum they simply zero in on it at the exclusion of what may happen far away.

    Methods designed to look for global maxima are based on the idea of annealing, which in short means purposely making local mistakes in order to not to zero in too soon. Simulated annealing works by searching randomly for new points to evaluate the objective at. If the new point, call it $x^d$, is an improvement on than the current vector, call it $x_t$ as the tth iteration, then $x^d$ becomes the new current vector, $x_{t+1}$. If it is not an improvement it still might be chosen as the next current vector with a positive probability. Let this probability be called $p_t$. So if $p_t$ is close to 1 it means a 'mistake' will be very likely. A better current vector will be abandoned in favour of a worse randomly chose vector.

    The analogy to annealing is that the probability of a mistake is signed to fall over time (over iterations). Formally, the code will have $p_t \to 0$ as $t\to\infty$. In other words, eventually only improvements will be accepted and the process will only move 'uphill'. Earlier, however, the algorithm will sometimes go 'downhill' in order to jump away from a local but not global maximum. The reason it is called simulated annealing is the decision to go downhill or not and the random candidate $x^d$ are generated using random number algorithms discussed in the next section.

    The analogy to physical annealing is carried through by talking of the current temperature as determining the probability of making a mistake. The algorithm then lowers the temperature as $t$ increases. $$\delta \equiv f^d - f_t.$$ If $\delta>0$ then $P(\delta,T) = 1$. In other words, always take an improvement on the current vector regardless of the temperature.

    $0\le P(\delta,T) \le 1$ for all $T\ge 0$ and for all $\delta$. In other words, $P()$ is a proper probability for any arguments.

    ${\partial P(\delta,T) \over \partial\delta} \ge 0$. In other words, the bigger the difference algebraically the greater the chance of going downhill.

    ${\partial P(\delta,T) \over \partial T} \ge 0$. In other words, the higher the temperature the more likely of going downhill.

    The last bit of the algorithm is how fast to let the temperature cool down. So that is, we need a cool-down schedule, $T(t)$.

    Algorithm 24. Simulated Annealing

    Given
    $f(x)$, and $x_0\in \Re^n$
    a cool down schedule $T(t)$
    a probability function $P(\delta,T)$.
    an iteration max $t_{max}$.
    1. Set $t=0$ and $f_0 = f(x_0)$.
    2. → Generate a new vector $x^d$ and compute $f^d = f(x^d)$.
    3. $$x_{t+1} = \cases{ x_d & with prob. $P\left(f^d-f_t,T(t)\right)$\cr x_t & otherwise.\cr}$$ Set $f_{t+1} = f(x_{t+1})$.
    4. Increment $t$. Stop if $t=t_{max}$; Otherwise, return to →.

Exercises