Bayesian Parameter Estimation

Let

be distributed according to a parametric family: $y \sim f_{\theta}(y)$ . The goal is, given iid observations $\left\{y_{i}\right\}$ , to estimate $\theta$ . For instance, let $\left\{y_{i}\right\}$ be a series of coin flips where $y_{i} = 1$ denotes ``heads'' and $y_{i} = 0$ denotes ``tails''. The coin is weighted, so $P(y_{i} = 1)$ can be other than

. Let us define $\theta = P(y_{i} = 1)$ ; our goal is to estimate $\theta$ . This simple distribution is given the name ``Bernoulli''.

Without prior information, we use the maximum likelihood approach. Let the observations be $y_{1} \ldots y_{H+T}$ . Let

be the number of heads observed and

be the number of tails.

$\displaystyle \hat{\theta}$	$\displaystyle =$	$\displaystyle \mathrm{argmax} f_{\theta}(y_{1:H+T})$
	$\displaystyle =$	$\displaystyle \mathrm{argmax} \theta^{H} (1-\theta)^{T}$
	$\displaystyle =$	$\displaystyle H/(H+T)$

Not surprisingly, the probability of heads is estimated as the empirical frequency of heads in the data sample.

Suppose we remember that yesterday, using the same coin, we recorded 10 heads and 20 tails. This is one way to indicate ``prior information'' about $\theta$ . We simply include these past trials in our estimate:

$\displaystyle \hat{\theta}$

$\displaystyle =$

$\displaystyle (10+H)/(10+H+20+T)$

As (H+T) goes to infinity, the effect of the past trials will wash out.

Suppose, due to computer crash, we had lost the details of the experiment, and our memory has also failed (due to lack of sleep), that we forget even the number of heads and tails (which are the sufficient statistics for the Bernoulli distribution). However, we believe the probability of heads is about

, but this probability itself is somewhat uncertain, since we only performed 30 trials.

In short, we claim to have a $\textit{prior distribution}$ over the probability $\theta$ , which represents our prior belief. Suppose this distribution is $P(\theta)$ and $P(\theta) \sim \mathrm{Beta}(10,20)$ :

$\displaystyle g(\theta)$

$\displaystyle =$

$\displaystyle \frac{\theta^{9}(1-\theta)^{19}}{\int\theta^{9}(1-\theta)^{19} d\theta}$

$\begin{figure}\centerline{\epsfxsize=5.0in\epsfbox{beta1020.eps}} \end{figure}$

Now we observe a new sequence of tosses: $y_{1:H+T}$ . We may calculate the posterior distribution $P(\theta \vert y_{1:H+T})$ according to Bayes' Rule:

$\displaystyle P(\theta\vert y)$	$\displaystyle =$	$\displaystyle \frac{P(y\vert\theta) P(\theta)}{P(y)}$
	$\displaystyle =$	$\displaystyle \frac{P(y\vert\theta) P(\theta)}{\int P(y\vert\theta) P(\theta) d\theta}$

The term $P(y\vert\theta)$ is, as before, the likelihood function of $\theta$ . The marginal

comes by integrating out $\theta$ :

$\displaystyle P(y)$

$\displaystyle =$

$\displaystyle \int P(y\vert\theta) P(\theta) d\theta$

To continue our example, suppose we observe in the new data

a sequence of 50 heads and 50 tails. The likelihood becomes:

$\displaystyle P(y\vert\theta)$

$\displaystyle =$

$\displaystyle \theta^{50} (1-\theta)^{50}$

Plugging this likelihood and the prior into the Bayes Rule expression, and doing he math, obtains the posterior distribution as a $\mathrm{Beta}(10+50,20+50)$ :

$\displaystyle P(\theta\vert y)$

$\displaystyle =$

$\displaystyle \frac{\theta^{59}(1-\theta)^{69}}{\int\theta^{59}(1-\theta)^{69} d\theta}$

$\begin{figure}\centerline{\epsfxsize=5.0in\epsfbox{beta6070.eps}} \end{figure}$

Note that the posterior and prior distribution have the same form. We call such a distribution a conjugate prior. The Beta distribution is conjugate to the binomial distribution which gives the likelihood of iid Bernoulli trials. As we will see, a conjugate prior perfectly captures the results of past experiments. Or, it allows us to express prior belief in terms of ``invented'' data. More importantly, conjugacy allows for efficient sequential updating of the posterior distribution, where the posterior at one stage is used as prior for the next.

Key Point The ``output'' of the Bayesian analysis is not a single estimate of $\theta$ , but rather the entire posterior distribution. The posterior distribution summarizes all our ``information'' about $\theta$ . As we get more data, if the samples are truly iid, the posterior distribution will become more sharply peaked about a single value.

Of course, we can use this distribution to make inference about $\theta$ . Suppose an ``oracle'' was to tell us the true value of $\theta$ used to generate the samples. We want to guess $\theta$ that minimizes the mean squared error between our guess and the true value. This is the same criterion as in maximum likelihood estimation. We would choose the mean of the posterior distribution, because we know conditional mean minimizes mean square error.

Let our prior be $\mathrm{Beta}(H_{0}, T_{0})$ and

$\displaystyle \hat{\theta}$	$\displaystyle =$	$\displaystyle E(\theta \vert y_{1:N})$
	$\displaystyle =$	$\displaystyle \frac{H_{0} + H}{H_{0} + H + T_{0} + T}$

The same way, we can do prediction. What is $P(y_{N+1} = 1\vert y_{1:N})$ ?

$\displaystyle P(y_{N+1} = 1\vert y_{1:N})$	$\displaystyle =$	$\displaystyle \int P(y_{N+1} = 1\vert \theta, y_{1:N}) P(\theta \vert y_{1:N}) d\theta$
	$\displaystyle =$	$\displaystyle \int P(y_{N+1} = 1\vert \theta) P(\theta \vert y_{1:N}) d\theta$
	$\displaystyle =$	$\displaystyle \int \theta P(\theta \vert y_{1:N}) d\theta$
	$\displaystyle =$	$\displaystyle E(\theta \vert y_{1:N})$
	$\displaystyle =$	$\displaystyle \frac{H_{0} + H}{H_{0} + H + T_{0} + T}$