JOS Index |
JOS Pubs |
JOS Home |
- Let be distributed according to a parametric family:
. The goal is, given iid observations
, to estimate . For instance, let
a series of coin flips where denotes ``heads'' and
denotes ``tails''. The coin is weighted, so
other than . Let us define
; our goal is
to estimate . This simple distribution is given the name
- Without prior information, we use the maximum likelihood
approach. Let the observations be
. Let be
the number of heads observed and be the number of tails.
- Not surprisingly, the probability of heads is estimated as the
empirical frequency of heads in the data sample.
- Suppose we remember that yesterday, using the same coin, we
recorded 10 heads and 20 tails. This is one way to indicate
``prior information'' about . We simply include these past
trials in our estimate:
- As (H+T) goes to infinity, the effect of the past trials will
- Suppose, due to computer crash, we had lost the details of the
experiment, and our memory has also failed (due to lack of sleep), that
we forget even the number of heads and tails (which are
the sufficient statistics for the Bernoulli distribution).
However, we believe the probability of heads is about
, but this probability itself is somewhat uncertain, since we
only performed 30 trials.
- In short, we claim to have a
over the probability , which represents our prior belief.
Suppose this distribution is and
- Now we observe a new sequence of tosses: . We may
calculate the posterior distribution
according to Bayes' Rule:
is, as before, the likelihood function of
. The marginal comes by integrating out :
- To continue our example, suppose we observe in the new
data a sequence of 50 heads and 50 tails. The likelihood
- Plugging this likelihood and the prior into the Bayes Rule expression,
and doing he math, obtains the posterior distribution as a
- Note that the posterior and prior distribution have the same form.
We call such a distribution a conjugate prior. The Beta
distribution is conjugate to the binomial distribution which gives the
likelihood of iid Bernoulli trials. As we will see, a conjugate prior
perfectly captures the results of past experiments. Or, it allows
us to express prior belief in terms of ``invented'' data. More
importantly, conjugacy allows for efficient sequential updating
of the posterior distribution, where the posterior at one stage is
used as prior for the next.
- Key Point The ``output'' of the Bayesian analysis is
not a single estimate of , but rather
the entire posterior distribution. The posterior distribution
summarizes all our ``information'' about . As we get more data,
if the samples are truly iid, the posterior distribution will become more
sharply peaked about a single value.
- Of course, we can use this distribution to make inference
about . Suppose an ``oracle'' was to tell
us the true value of used to generate the samples.
We want to guess that minimizes the mean squared error between
our guess and the true value. This is the same criterion as in
maximum likelihood estimation. We would choose the mean of the
posterior distribution, because we know conditional mean minimizes
mean square error.
- Let our prior be
- The same way, we can do prediction. What is
JOS Index |
JOS Pubs |
JOS Home |