- Let be distributed according to a
*parametric family*: . The goal is, given iid observations , to estimate . For instance, let be a series of coin flips where denotes ``heads'' and denotes ``tails''. The coin is weighted, so can be other than . Let us define ; our goal is to estimate . This simple distribution is given the name ``Bernoulli''. - Without prior information, we use the
*maximum likelihood*approach. Let the observations be . Let be the number of heads observed and be the number of tails.

- Not surprisingly, the probability of heads is estimated as the
empirical frequency of heads in the data sample.
- Suppose we remember that yesterday, using the same coin, we
recorded 10 heads and 20 tails. This is one way to indicate
``prior information'' about . We simply include these past
trials in our estimate:

- As (H+T) goes to infinity, the effect of the past trials will
wash out.
- Suppose, due to computer crash, we had lost the details of the
experiment, and our memory has also failed (due to lack of sleep), that
we forget even the number of heads and tails (which are
the
*sufficient statistics*for the Bernoulli distribution). However, we believe the probability of heads is about , but this probability itself is somewhat uncertain, since we only performed 30 trials. - In short, we claim to have a
over the probability , which represents our prior belief.
Suppose this distribution is and
:

- Now we observe a new sequence of tosses: . We may
calculate the
*posterior*distribution according to*Bayes' Rule*:

The term is, as before, the*likelihood*function of . The*marginal*comes by integrating out :

- To continue our example, suppose we observe in the new
data a sequence of 50 heads and 50 tails. The likelihood
becomes:

- Plugging this likelihood and the prior into the Bayes Rule expression,
and doing he math, obtains the posterior distribution as a
:

- Note that the posterior and prior distribution have the same form.
We call such a distribution a
*conjugate prior*. The Beta distribution is conjugate to the binomial distribution which gives the likelihood of iid Bernoulli trials. As we will see, a conjugate prior perfectly captures the results of past experiments. Or, it allows us to express prior belief in terms of ``invented'' data. More importantly, conjugacy allows for efficient*sequential updating*of the posterior distribution, where the posterior at one stage is used as prior for the next. **Key Point**The ``output'' of the Bayesian analysis is not a*single estimate*of , but rather*the entire posterior distribution*. The posterior distribution summarizes all our ``information'' about . As we get more data, if the samples are truly iid, the posterior distribution will become more sharply peaked about a single value.- Of course, we can use this distribution to make
*inference*about . Suppose an ``oracle'' was to tell us the true value of used to generate the samples. We want to guess that minimizes the mean squared error between our guess and the true value. This is the same criterion as in maximum likelihood estimation. We would choose the*mean*of the posterior distribution, because we know conditional mean minimizes mean square error. - Let our prior be
and

- The same way, we can do
*prediction*. What is ?

Download bayes.pdf

Download bayes_2up.pdf

Copyright ©

Center for Computer Research in Music and Acoustics (CCRMA), Stanford University

[Automatic-links disclaimer]