Some-Definitions

About 406 wordsAbout 1 min

2025-03-25

Introduction

Let the $\theta$ denote the parameters of some model, and $x$ denote input data. Dataset $X_{N\times p} = (x_1,x_2,\dots,x_N)^T$ has $N$ samples, each of which is a vector of dimension $p$ , and is generated by $p(x|\theta)$ . We also assume that every sample is independent and identically distributed(i.i.d).

Now there are 2 views to look at probability model. Frequentist thinks that $\theta$ is just a constant, which can be obtained by applying maximum likelihood estimation(MLE):

\theta_{MLE} = \argmax_\theta \log{p(X|\theta)} = \argmax_\theta \sum_i^N \log{p(x_i|\theta)}

Bayesian takes probability theory to quantify the uncertainty, it thinks that $\theta$ is not a constant, and $\theta$ actually subjects to a pre-set prior distribution. Bayes's theorem integrate observations and prior distribution to posterior distribution:

p(\theta | X )=\frac{p(X|\theta)p(\theta)}{p(X)} = \frac{p(X|\theta)p(\theta)}{\int p(X|\theta)d\theta}

In the above equation, $p(\theta | X )$ is the posterior. $p(\theta|w)$ and $p(w)$ are likelihood and prior, respectively. We can get the denominator by integrating $w$ over the marginal function.

The denominator in Bayes' theorem is the normalization factor that ensures the posterior is a valid probability distribution. Therefore we can express the Bayes's Theorem as : $\text{postrior} \propto \text{likelihood} \times \text{prior}$ , and all the components can be seen as a function of $w$ .

To obtain $\theta$ , we usually apply maximum a posteriori estimation(MAP):

\theta_{MAP} = \argmax_\theta p(\theta|X)

Then after we obtain posterior distribution, we can apply it to predict new data, this step is also called Bayes Prediction:

p(x_{new}|X) = \int_\theta {p(x_new|\theta)·p(\theta|X)d\theta}

Some Disadvantages

Likelihood usually comes from dataset. Using MLE to obtain parameters sometimes bring bias. For example, When it is in Gaussian Model context, MLE underestimates the variance parameter even though mean parameter is unbiased.

On the other hand, Bayes method must choose a prior distribution first. There are good things about having a prior distribution, like it prevents the result ending to some extremes. But also there are arguments criticize that sometimes chosen prior is just for simplifying computation without reflecting any truth. In other hands, computing denominator is also very difficult. Methods are proposed to solve these problems.