motivation

Recently I revisited the classic Monte-Carlo fundamentals (random number generator, importance/stratified sampling, QMC, etc), and found there are some cross explanations, sometimes confusing, about the MC estimator, so I decided to write down my own understandings.

The classic idea

MC enjoys the advantage of being cheap in cost (e.g. linear in no. of samples when used to calculate integration, and independent of dimensions - the curse of dimensionality when classic methods such as Riemann sum is used), though sacrifying some loss in accuracy (fixed no. of sample) and variance (samrt variance reduction techniques such as importance sampling can be used to control though).

Some classic applications include:

  • simulate random tests to calculate/validate expected value of the target quantity, e.g. probability, earnings/loss in gambling, returns of stocks, coin flipping, Buffon needle experiment, etc. The results get more accurate when no. of sample/repeats increases (e.g. to reduce the variance to half, sample needs to double)
  • can also be used to estimate some constants (with prior knowledge about the experiments). e.g. estimate the area under curve (integration), or PI using uniformly random tests and given the fact that the probability of falling into particular area is proportional to the area (here probability is approximated by frequency, i.e. counting the preferred number of outcomes)
  • integration

These are possible based on prerequisites such as a (pseudo) randome number generator (efficient sampling from uniform or designed distributions). Also, framing the problem in the MC context and constructing a proper estimator is a key.

Here we focus on a small point: understanding the MC estimator in the context of using MC to calculate integrals.

the MC estimator

The task is, we want to work out the value of a definite integral. The necessity of doing this is, in some cases closed analytical form of integral is not possible, or other numerical methods are to expensive (e.g. suffer from the curse of dimensionality).

For example, we want to calculate the following quantity:
equ

One way to make it is to treat as an expectation of f(x) over [a,b], with x randomly and uniformly drawn from U(a,b). Recall the following formula from a stats course:
equ
where p_X(X)is the pdf of X.

Comparing both, we see we aim to calculate the mean value of f(x) with x uniformly distributed over [a,b]. This yields the idea that, we can repeatedly draw samples x_i from U(a,b), evaluate the function value f(x_i), sum them up, and calculate the sample mean avg(all f(x_i)), then use sample mean to estimate the population mean (the first equation used in stats moment estimation). This idea can be summarised as:

where N is the total number of samples drawn.

You might be curious about where the (b-a) comes from - it’s from the distribution U(a,b). We can think about it in 3 ways.

The easiest way is:

in which 1/(b-a) is the pdf of U(a,b), as per the definition of E[f(x)]. Then is E[f(x)], where x is drawn from U(a,b). Numerically speaking, it’s the mean value of f(x) when we draw X from U(0,1) infinite many times and evaluate the function f(x). So if we use the sample mean to estimate (approximate) the population mean (first momemt), it can be proved that and it converges in probability as per the law of large numbers (unbiased and consistent).

The second way is to use change of variable to map x on [a,b] to a new variable X on [0,1]:

where X~U(0,1). Then using the same idea that sample mean can approximate population mean we have:

The third way of thinking is using the mean value theorem for integrals.
When uniformly sampling x, each sample has a probability 1/N to be drawn (they together form the empirical pdf). Then evaluate the function value f(x_i) at each datapoint x_i, sum them up and multiply by the empirical pdf, which gives the sample-based mean value of the function over the interval:

on the other hand, according to the mean value theorem for integrals, we have:

So we arrive at:

Actually it implies that, as the mean value of a definite integral, by definition, is a mean representation of the function on this interval, so if we multiply the representative value by the interval width, it should give the expected arena under curve, which matches the geometric definition of the definite integral.

Further, it we re-write the MC estimation equation in a more general form:

where pdfx_i is a realisation of pdfx at x_i.
We can then design arbitrary pdf(x) from which we draw the samples. This has a widely use in importance sampling, and by causiously emplying a distribution with pdf profile similar in shape to f(x) (e.g. scaling), we can reduce variance/discrepancy (imagine that if we are using MC to estimate a constant over a definite interval, whatever distribution we employ, the variance will always be zero - the minimal variance we can achieve, that’s why we need to design pdf to be multiples of f(x)).

Regarding how to do sampling, there are deep topics such as quasi MC which might be relevant. I’d better stop here.

Thanks, Yong 10/09/2020