2 Armed Bandits

2 Armed Bandits

A Blog by Jason Tang

CSC412 Communication Project

This blog post was written for my CSC412 Computer Science Communication Project. We will be looking at a general overview of Importance/Rejection Sampling along with some simulations of situations where they might be useful. I will be summarizing information from both lectures and the textbook “Information Theory, Inference, and Learning Algorithms” by David Mackay and will be presenting it along with animations powered by the D3 Library. To view the code used to run this blog, take a look at my Github Repo.

An Overview of Importance and Rejection Sampling

When working with complicated probability distributions p, whether it be purely theoretical or derived from real world applications, there are 2 common problems that arise:

  1. Generating samples xp.
  2. Computing expectations of functions f(x) over our distribution p, i.e. computing
Exp[f(x)]=f(x)p(x)dx

For sufficiently complex or high dimensional p, it becomes intractable to solve either of these problems exactly. Let’s consider the situation in which we can evaluate the unnormalized ~p(x) for any given x, such that:

p(x)=~p(x)Zp

where we have the normalization constant Zp=~p(x)dx. Computing this Zp quickly becomes difficult for complex p’s in high dimensions. But even with access to Zp (and therefore p(x)) it remains difficult to sample from p since we need to evaluate p(x) everywhere to know where we should sample from, i.e. the regions where p(x) is large.

However, if we are able to solve problem 1, we can compute an unbiased, arbitrarily close approximation to problem 2 through a simple Monte Carlo estimator:

Exp[f(x)]1NNi=1f(xi)=ˆf

where we generate N samples xip. Note that this estimator can get arbitrarily close to the actual expectation since its variance scales proportionally to 1N (no dependence on dimensionality). Therefore, we can conclude that problem 1 is a harder problem than problem 2. For now, we will focus on methods of solving problem 2 without solving problem 1, which is where Importance and Rejection Sampling comes into play.

The Importance of Importance Sampling

Since we cannot directly sample xp (problem 1), we will instead sample from some simpler distribution q. We will assume that we can easily generate samples xq and are able to compute an unnormalized ~q, such that:

q(x)=~q(x)Zq

We then generate N samples xiNi=1 with xiq. Since these samples are drawn from q rather than p, we need to adjust the weighting of each sample depending on how likely they are to be sampled by both p and q. Namely, we compute an importance weight:

˜w(x)=~p(x)~q(x)

Assuming that q and p are different distributions, this adjusts the importance of samples in regions where p is overrepresented by q (q>p), and regions where p is underrepresented (p>q). We also require that the support of p is captured entirely by the support of q. In other words, for some point x, if p(x)>0, then q(x)>0 as well. This ensures that we have asymptotic convergence to the real mean Exp[f(x)]. Otherwise, there would exist regions where p might sample from that our samples from q would never visit. These restrictions make the normal distribution an ideal candidate for q, with support over all real numbers and being easy to sample from. It is especially useful in higher dimensions where it remains easy to sample from, unlike many other distributions.

Now we will derive how this importance weighting fits into our original goal of problem 2:

f(x)p(x)dx=f(x)p(x)q(x)q(x)dx=f(x)~p(x)Zpq(x)Zq~q(x)dx=f(x)~p(x)~q(x)ZqZpq(x)dx=ZqZpf(x)˜w(x)q(x)dx=ZqZpExq[f(x)˜w(x)]ZqZp1NNi=1f(xi)˜w(xi)

Let’s consider the case where we do not have access to the normalization constants Zq or Zp, we can still approximate it by first considering the expected value of an importance weight over samples drawn from q:

Exq[˜w(x)]=˜w(x)q(x)dx=~p(x)~q(x)q(x)dx=q(x)~q(x)~p(x)dx=1Zq~p(x)dx=ZpZq

Since we have N samples from q, we can estimate the ratio ZpZq with a Monte Carlo estimator:

ZpZq=Exq[˜w(x)]1NNi=1˜w(xi)

Plugging this into our original equation, we get:

f(x)p(x)dxN1Ni=1˜w(xi)1NNi=1f(xi)˜w(xi)Ni=1f(xi)˜w(xi)Ni=1˜w(xi)

Now, we have a method for utilizing samples from q to learn something about our target distribution p.

A Basic Example of Importance Sampling

Let’s suppose that we’re working in a 1 dimensional space and we want to find the average over our very simple target distribution p=Uniform[1,6]. We know from elementary statistics that:

Exp[x]=1+62=3.5

In addition, almost every mathematical package out there can sample from a Uniform distribution. But for the purposes of demonstration, let’s assume that we can only sample from a 1-D normal distribution defined as q=N(μq,σ2q). We will utilize samples xq and importance sampling to approximate the value Exp[x] (which we know should be 3.5). In other words, we will be importance sampling with f(x)=x for all x.

Use the sliders below to adjust the parameters of our q distribution, then generate samples from q and watch the estimate for Exp[x] asymptotically converge.

μq

0.0

σq

1.0

Current Ni=1f(xi)˜w(xi)

0.0

Current Ni=1˜w(xi)

0.0

Current estimate of Exp[x]

0.0

-10-8-6-4-2024680.000.050.100.150.200.250.300.350.400.450.50

A More Complex Example

Let’s consider a more complex example brought up in chapter 29 of “Information Theory, Inference, and Learning Algorithms” by David Mackay:

~p(x)=0.5e0.4(x0.4)20.08x4

Since we don’t know the normalizing constant Zp (and would have a very difficult time computing it analytically), we will use Wolfram Alpha to get an estimated expected value:

x0.5e0.4(x0.4)20.08x4dx0.5e0.4(x0.4)20.08x4dx0.682815

So we want to see an estimated value around 0.682815 from our importance sampling simulation:

μq

0.0

σq

1.0

Current Ni=1f(xi)˜w(xi)

0.0

Current Ni=1˜w(xi)

0.0

Current estimate of Exp[x]

0.0

-10-8-6-4-2024680.000.050.100.150.200.250.300.350.400.450.500.550.600.650.700.750.80

From these simulations, we see that importance sampling well approximates the expected values over these relatively simple distributions. But for more complex and higher dimensional distributions, we run into issues with numerical stability. For example, we might luckily sample a point x with ~q(x)~p(x), resulting an extremely large ˜w(x), which would then dominate the expected value estimate. In order to circumvent this, we look towards the method of Rejection Sampling.

Why We Shouldn’t Reject Rejection Sampling

Rejection Sampling is a slightly different method for solving problem 2, i.e. approximating f(x)p(x)dx. All the assumptions from importance sampling still persist but we also require knowledge of some constant c such that for all x over the support of q, we have:

c~q(x)>~p(x)

This essentially produces a probabilistic bounding box around ~p. From here, we sample a point xq and then a point uUniform[0,c~q(x)]. Then, we compare u with ~p(x), if u>~p(x), then we reject that sample, otherwise we accept it.

We can interpret this procedure as first uniformly selecting a point under the curve c~q. Then, keeping that point only if it falls under the curve ~p. So our accepted samples are distributed proportionally to ~p, i.e. they are distributed according to p. Then, we can use a simple Monte Carlo estimator to estimate:

Exp[f(x)]1NMi=1f(xi)

Where we have M accepted points xiMi=1. In our case, we just want the vanilla expected value, so we have:

Exp[x]1NMi=1xi

Let’s see how this compares to Importance Sampling for our previous 2 example p’s.

μq

0.0

σq

1.0

c

1.0

Current Mi=1xi

0.0

Current M

0.0

Current estimate of Exp[x]

0.0

Current acceptance rate: 

0.0

-10-8-6-4-2024680.000.050.100.150.200.250.300.350.400.450.50

Note that the red lines indicate samples that were rejected. Now for the more complex distribution:

μq

0.0

σq

1.0

c

1.0

Current Mi=1xi

0.0

Current M

0.0

Current estimate of Exp[x]

0.0

Current acceptance rate: 

0.0

-10-8-6-4-2024680.000.050.100.150.200.250.300.350.400.450.500.550.600.650.700.750.80

Rejection Sampling in Higher Dimensions

We can see that the rejection sampling approximates f(x)p(x)dx about as well as importance sampling does, without the risk of numerical instability. However, the downside is that a proportion of our samples are rejected and have no use. This problem becomes an increasing issue in higher dimensions as the volume sandwiched between c~q and ~p grows with increasing dimensions. Eventually, unless c~q and ~p are the exact same distribution, the acceptance rate approaches 0%, i.e. most samples that we draw from q will immediately be thrown away.

Let’s consider a situation where our p=N(0,1) and our q is a slightly more spread out normal distribution N(0,1.01):

-10-8-6-4-2024680.000.050.100.150.200.250.300.350.40

Recall that the normal distribution N(μ,σ) has the density function:

f(x)=1σ2πe12(xμσ)2

So we have:

p(0)=12πq(0)=11.012πp(0)q(0)=12π1.012π=1.01

We need a c>1.01 in 1-D to satisfy cq>p. However, in for an n-dimensional Normal Distribution, which we will denote Nn(μ,σ), we recall that

Nn(μ,σ)=N(μ,σ)×N(μ,σ)××N(μ,σ)

Where we multiply a 1 dimensional normal distribution with another n times. This means that for p=Nn(0,1) and q=Nn(0,1.01), we have:

p(0)=(12π)nq(0)=(11.012π)np(0)q(0)=1.01n

So we have c as n. Let’s see this empirically:

σq

1.01

Number of dimensions n

1

Current Mi=1dj=1xi,j

0.0

Current M

0.0

Current estimate of Exp[dj=1xj]

0.0

Current acceptance rate: 

0.0

From this experiment, we see that the acceptance rate drops significantly with increasing dimension:

24681012141618200.00.10.20.30.40.50.60.70.80.91.0

This values were generated by running the above simulation over 10,000 samples with σq=1.01 and seeing what percentage of samples were accepted. Sadly, there is no way for me to visually show the sampling from these higher dimensions on a 2 dimensional computer screen. However, I’ll leave you with a piece of advice: