Importance sampling for Bayesian prediction

Application example

Bayesian inference and prediction is particularly suited to deal with cases where there is not a lot of data available to fit the probabilistic model of choice.

As an example we can think of a sport season : as each game is played we want to refine our prediction on the future games. As a whole there are not that many games a given team will play (one per week at most) during the season, plus we want to start making forecast from the begining of the season even when only 3 games have been played.

Even if 3 games is not enough to fit any model with reasonable confidence we should still make use of this data. But we use it only as a way to refine a "Prior" information that represent expert knowledge. Formally this Prior is given as a probability distribution on likely parameter values for the model.

Formal problem statement

We have a set of observations comprising of explanatory variables and target variables :
$\mathcal{O}=\left\{ \left(x_{1},y_{1}\right),\left(x_{2},y_{2}\right),\ldots,\left(x_{N},y_{N}\right)\right\}$

Our goal is to make a prediction on the target variable value for a new sample.
Formally we want to compute the probability :
$P\left(y^{new}\left|x^{new},\mathcal{\mathcal{O}}\right.\right)$

In a Bayesian setting we do not assume a single "best" value for $\theta$ so we integrate over its probability distribution as follow :
$P\left(y^{new}\left|x^{new},\mathcal{\mathcal{O}}\right.\right)=\int P\left(y^{new}\left|x^{new},\theta\right.\right)P\left(\theta\left|\mathcal{O}\right.\right)d\theta$

Monte Carlo approximation

Using Bayes formula we get :
$P\left(\theta\left|\mathcal{O}\right.\right)=\frac{P\left(\mathcal{O}\left|\theta\right.\right)Prior\left(\theta\right)}{P\left(\mathcal{O}\right)}$

The denominator can expressed as an integral :
$P\left(\mathcal{O}\right)=\int P\left(\mathcal{O}\left|\theta\right.\right)Prior\left(\theta\right)d\theta$

As the denominator does not depend on $\theta$ we get :
$P\left(y^{new}\left|x^{new},\mathcal{\mathcal{O}}\right.\right)=\frac{\int P\left(y^{new}\left|x^{new},\theta\right.\right)P\left(\mathcal{O}\left|\theta\right.\right)Prior\left(\theta\right)d\theta}{\int P\left(\mathcal{O}\left|\theta\right.\right)Prior\left(\theta\right)d\theta}$

We can rewrite the 2 integrals as expectations as to approximate them via Monte Carlo simulation :
$P\left(y^{new}\left|x^{new},\mathcal{\mathcal{O}}\right.\right)=\frac{E_{Prior}\left[P\left(y^{new}\left|x^{new},\theta\right.\right)P\left(\mathcal{O}\left|\theta\right.\right)\right]}{E_{Prior}\left[P\left(\mathcal{O}\left|\theta\right.\right)\right]}$

More generally we can sample from an auxialiary distribution $Q$ by setting :
$W_{Q}\left(\theta\right)=\frac{P\left(\mathcal{O}\left|\theta\right.\right)Prior\left(\theta\right)}{Q\left(\theta\right)}$

Then we get :
$P\left(y^{new}\left|x^{new},\mathcal{\mathcal{O}}\right.\right)=\frac{E_{Q}\left[P\left(y^{new}\left|x^{new},\theta\right.\right)W_{Q}\left(\theta\right)\right]}{E_{Q}\left[W_{Q}\left(\theta\right)\right]}$

Remark : Computationaly the MC approach is convenient as it only requires to be able to compute the likelihood of the observed data given a set of model parameters. There is no need to find optimize the set of parameters like in the frequentist MLE approach. The inconvenient is that running MC simulations may be "intensive" but with today computers this is not an issue.

Example

For this experiment lets consider a tennis tournament with 4 players.
The players are labelled 0,1,2,3 and their actual strength (the model need to infer it from scores) is increasing with their labels.
Lets assume the players are quite young and not well known so we have identical prior distribution on their strength ; a standard gaussian.

The strengths of each opponent are such that the probability of X winning over Y is :
$P\left(X>Y\left|strengths\right.\right)=\sigma\left(strength\left(X\right)-strength\left(Y\right)\right)$
Remark : $\sigma$ denotes the sigmoid function that "squash" values between 0 and 1.

Assuming that the outcome of each game is independant (conditionally on strengths) from the other ones we get the likelihood of the observed games as follow :
$P\left(\mathcal{O}=Games\left|\theta=strengths\right.\right)=\prod_{g\in Games}P\left(g\left(winner\right)>g\left(loser\right)\left|\theta\right.\right)$

Remark : The hypothesis of independant game outcomes is not required to apply the Importance Sampling algorithm but for this quick and simple JavaScript demo it is convenient to simplify the likelihood function.

Observed data

Bayesian Monte-Carlo Prediction Average

Bayesian Monte-Carlo Prediction Histogram

Frequentist (MLE) Prediction (still need code review)

If you re-run the page several times you will note it varies more than the Bayesian estimate.