Updated Wed May 06 2026 02:00:00 GMT+0200 (Central European Summer Time)

Bayesian Learning Exercises 3 notes

Lecture 7, Slide 21: Gibbs sampling for mixture distributions

Prior and notation

Prior:

\pi \sim \operatorname{Beta}(\alpha_1,\alpha_2)

Conjugate prior for $(\mu_j,\sigma_j^2)$ , see Slide 16.

Here $\pi$ is the mixing probability for component 1. The latent indicator $I_i$ records which mixture component generated observation $x_i$ , so the Gibbs sampler alternates between updating the component parameters and updating these hidden labels.

Define:

n_1=\sum_{i=1}^{n}\mathbf{1}(I_i=1),\qquad n_2=n-n_1

Gibbs sampling steps

\pi\mid I,x \sim \operatorname{Beta}(\alpha_1+n_1,\alpha_2+n_2)

\sigma_1^2\mid I,\mu_1,x \sim \operatorname{Inv}\text{-}\chi^2(\nu_{n_1},\sigma_{n_1}^2) \quad\text{and}\quad \mu_1\mid I,\sigma_1^2,x \sim N(\mu_{n_1},\tau_{n_1}^2)

\sigma_2^2\mid I,\mu_2,x \sim \operatorname{Inv}\text{-}\chi^2(\nu_{n_2},\sigma_{n_2}^2) \quad\text{and}\quad \mu_2\mid I,\sigma_2^2,x \sim N(\mu_{n_2},\tau_{n_2}^2)

I_i\mid \pi,\mu_1,\sigma_1^2,\mu_2,\sigma_2^2,x \sim \operatorname{Bern}(\theta_i), \qquad i=1,\ldots,n

\theta_i= \frac{\pi\phi(x_i;\mu_1,\sigma_1^2)} {\pi\phi(x_i;\mu_1,\sigma_1^2)+(1-\pi)\phi(x_i;\mu_2,\sigma_2^2)}

This $\theta_i$ is the posterior probability that observation $x_i$ belongs to component 1, after comparing how well both mixture components explain the same observation.

1. Gibbs step for an indicator variable

Problem

Show that the full conditional posterior of $I_i$ on Lecture 7, Slide 21, is correct.

Handwritten solution

Show that

I_i \mid \pi,\mu_1,\sigma_1^2,\mu_2,\sigma_2^2 \sim \operatorname{Bern}(\theta_i), \qquad i=1,\ldots,n, \qquad I_i\in\{1,2\}

where

\theta_i= \frac{ \pi\phi(x_i;\mu_1,\sigma_1^2) }{ \pi\phi(x_i;\mu_1,\sigma_1^2)+(1-\pi)\phi(x_i;\mu_2,\sigma_2^2) }

Model

The normal density model is

P(x_i\mid I_i,\pi,\mu_1,\mu_2,\sigma_1^2,\sigma_2^2) = \begin{cases} \phi(x_i;\mu_1,\sigma_1^2), & \text{if } I_i=1\\ \phi(x_i;\mu_2,\sigma_2^2), & \text{if } I_i=2 \end{cases}

and the prior probability of the indicator is

P(I_i\mid\pi) = \begin{cases} \pi, & \text{if } I_i=1\\ 1-\pi, & \text{if } I_i=2. \end{cases}

Conditional independence

In the Gibbs sampler, we sample

\pi,\mu_1,\sigma_1^2,\mu_2,\sigma_2^2 \quad \text{given } I_1,\ldots,I_n

and then sample

I_1,\ldots,I_n \quad \text{given } \pi,\mu_1,\sigma_1^2,\mu_2,\sigma_2^2.

Therefore,

P(I_1,\ldots,I_n\mid \vec{x},\pi,\cdot) = \prod_{i=1}^{n}P(I_i\mid x_i,\pi,\cdot),

where $\cdot$ denotes $\mu_1,\mu_2,\sigma_1^2,\sigma_2^2$ .

The important simplification is that once the parameters are fixed, each observation only needs its own $x_i$ to update its own label. The other observations affect $I_i$ only indirectly through the current parameter values.

Fact 1:

I_i\perp I_j\mid \vec{x},\pi,\cdot, \qquad \text{for } i\neq j.

Fact 2:

I_i\perp x_j\mid x_i,\pi,\cdot, \qquad \text{for } i\neq j.

Full conditional

To derive the full conditional, we keep only the terms that depend on $I_i$ . Constants that are the same for $I_i=1$ and $I_i=2$ disappear into the normalizing constant.

By Bayes' theorem,

P(I_i\mid x_i,\pi,\cdot) \propto P(x_i\mid I_i,\pi,\cdot)P(I_i\mid\pi).

Hence,

P(I_i\mid x_i,\pi,\cdot) \propto \phi(x_i\mid\mu_{I_i},\sigma_{I_i}^2)P(I_i\mid\pi),

P(I_i\mid x_i,\pi,\cdot) \propto \begin{cases} \phi(x_i\mid\mu_1,\sigma_1^2)\pi, & \text{if } I_i=1\\ \phi(x_i\mid\mu_2,\sigma_2^2)(1-\pi), & \text{if } I_i=2. \end{cases}

Since $I_i\in\{1,2\}$ , the full conditional is Bernoulli after normalization. Find the normalizing constant from

\sum_{k=1}^{2}P(I_i=k\mid x_i,\pi,\cdot)=1.

Thus,

1=P(I_i=1\mid x_i,\pi,\cdot)+P(I_i=2\mid x_i,\pi,\cdot),

1=c\phi(x_i\mid\mu_1,\sigma_1^2)\pi +c\phi(x_i\mid\mu_2,\sigma_2^2)(1-\pi),

and

c= \frac{1}{ \pi\phi(x_i\mid\mu_1,\sigma_1^2)+(1-\pi)\phi(x_i\mid\mu_2,\sigma_2^2) }.

Therefore,

\theta_i=P(I_i=1\mid x_i,\pi,\cdot)

= \frac{ \pi\phi(x_i\mid\mu_1,\sigma_1^2) }{ \pi\phi(x_i\mid\mu_1,\sigma_1^2)+(1-\pi)\phi(x_i\mid\mu_2,\sigma_2^2) }.

Intuitively, the numerator is the unnormalized weight for component 1. The denominator adds the two possible component weights, so the ratio turns the weight into a probability.

2. Frequentist vs Bayesian

Problem

2(a)

Let $x_1,\ldots,x_n\mid\theta\stackrel{iid}{\sim}\operatorname{Uniform}\left(\theta-\frac12,\theta+\frac12\right)$ . Let $\hat{\theta}=\bar{x}=\frac1n\sum_{i=1}^{n}x_i$ be an estimator of $\theta$ . Derive an expression for the repeated sampling variance of $\hat{\theta}$ .

2(b)

Derive the posterior distribution for $\theta$ assuming a uniform prior distribution.

Hint: Here it is absolutely crucial to think about the support for the data distribution. Once you have observed some data, some values are no longer possible. I strongly suggest that you plot some imaginary data on the real line and plot the data distribution in the same graph for some made-up values of $\theta$ . Just to make you think in the right direction.

2(c)

Assume that you have observed three data observations: $x_1=1.13$ , $x_2=2.1$ , $x_3=1.38$ . What do we conclude from a frequentist perspective about $\theta$ ? What do we conclude from a Bayesian perspective about $\theta$ ? Discuss.

Handwritten solution

2(a)

Let

x_1,\ldots,x_n \stackrel{iid}{\sim} \operatorname{Uniform}\left(\theta-\frac12,\theta+\frac12\right).

Let

\hat{\theta}=\bar{x} = \frac1n\sum_{i=1}^{n}x_i.

Then

\operatorname{Var}(\hat{\theta}) = \operatorname{Var}\left(\frac1n\sum_{i=1}^{n}x_i\right)

= \frac1{n^2}\sum_{i=1}^{n}\operatorname{Var}(x_i)

= \frac1{n^2}n\operatorname{Var}(x_i)

= \frac1n\cdot\frac1{12} \left(\theta+\frac12-\left(\theta-\frac12\right)\right)^2

= \frac1{12n}.

This is a repeated sampling variance: it describes how much $\bar{x}$ would vary if we repeatedly collected new samples of size $n$ from the same model. It is not a posterior variance for $\theta$ .

2(b)

Assume a flat uniform prior:

p(\theta)\propto 1.

The model density is

p(x_i\mid\theta) = \begin{cases} 1, & \theta-\frac12\leq x_i\leq\theta+\frac12\\ 0, & \text{otherwise.} \end{cases}

The posterior is

p(\theta\mid\vec{x}) \propto p(\vec{x}\mid\theta)p(\theta)

\propto \prod_{i=1}^{n}p(x_i\mid\theta)p(\theta).

Within the support allowed by all observations, this is proportional to 1, so $\theta\mid\vec{x}$ is uniform on the feasible interval.

The key point is the support. A candidate value of $\theta$ is possible only if every observed $x_i$ lies inside the interval centered at $\theta$ with width 1. Values of $\theta$ outside this feasible interval get likelihood zero.

Determine the bounds for $\theta\mid\vec{x}$ .

Given

\vec{x}=(x_1,\ldots,x_n),

let

x_{\min}=\min\{x_1,\ldots,x_n\}, \qquad x_{\max}=\max\{x_1,\ldots,x_n\}.

At the same time,

\theta-\frac12\leq x_i\leq\theta+\frac12, \qquad \text{for all } i.

Therefore,

\theta-\frac12\leq x_{\min} \quad\text{and}\quad x_{\max}\leq\theta+\frac12,

which implies

\theta\leq x_{\min}+\frac12 \quad\text{and}\quad x_{\max}-\frac12\leq\theta.

Thus,

x_{\max}-\frac12\leq\theta\leq x_{\min}+\frac12,

and

\theta\mid\vec{x} \sim \operatorname{Uniform}\left(x_{\max}-\frac12, x_{\min}+\frac12\right).

2(c)

Given

x_1=1.13, \quad x_2=2.1, \quad x_3=1.38,

the frequentist estimate is

\hat{\theta}=\bar{x}=1.537, \qquad \operatorname{Var}(\hat{\theta})=\frac1{12n}=0.028.

The Bayesian posterior is

\theta\mid\vec{x} \sim \operatorname{Uniform}\left(2.1-\frac12,1.13+\frac12\right) = \operatorname{Uniform}(1.6,1.63).

The frequentist estimate summarizes the center of the observed data through $\bar{x}$ . The Bayesian posterior instead uses the model support to rule out impossible values of $\theta$ , leaving only a short interval of feasible values.

3. The posterior becomes more normal

Problem

3(a)

Let $x_1,\ldots,x_n\mid\theta\stackrel{iid}{\sim}\operatorname{Bern}(\theta)$ and let $\theta\sim\operatorname{Beta}(\alpha,\beta)$ a priori. Find the posterior mode of $\theta$ .

3(b)

Approximate the posterior distribution of $\theta$ by a normal distribution.

3(c)

Assume now that you have the data $n=8$ and $s=2$ . Plot the true posterior distribution and the normal approximation in the same graph. Assume a uniform prior for $\theta$ .

3(d)

Redo the previous exercise, but this time with twice the data size: $n=16$ and $s=4$ . What do you conclude?

Handwritten solution

3(a)

Find the stationary point:

\frac{\partial \log P(\theta\mid\vec{x})}{\partial\theta}=0.

Let $s$ be the number of successes and $f=n-s$ be the number of failures. Then

(1-\theta)(\alpha+s-1)=\theta(\beta+f-1).

Expanding gives

\alpha+s-1-\theta(\alpha+s-1) = \theta(\beta+f-1),

\alpha+s-1 = \theta(\alpha+\beta+s+f-2).

Therefore,

\hat{\theta} = \frac{\alpha+s-1}{\alpha+\beta+n-2}.

This is the posterior mode, not necessarily the posterior mean. It is the value of $\theta$ where the posterior density is highest, assuming the mode is inside the interval $(0,1)$ .

3(b)

Approximate the posterior by

\theta\mid\vec{x} \approx N\left(\hat{\theta},J_{\vec{x}}^{-1}(\hat{\theta})\right),

where $J_{\vec{x}}(\hat{\theta})$ is the negative Hessian of the log posterior evaluated at $\hat{\theta}$ .

This is a Laplace approximation. Near its mode, a smooth log posterior can be approximated by a quadratic curve; exponentiating that quadratic gives a normal density. The inverse curvature controls the approximate variance.

J_{\vec{x}}(\theta) = - \frac{\partial^2\log P(\theta\mid\vec{x})}{\partial\theta^2}

= - \frac{\partial}{\partial\theta} \left( \frac{\partial}{\partial\theta}\log P(\theta\mid\vec{x}) \right)

= - \frac{\partial}{\partial\theta} \left( \frac{\alpha+s-1}{\theta} - \frac{\beta+f-1}{1-\theta} \right)

= \frac{\alpha+s-1}{\theta^2} + \frac{\beta+f-1}{(1-\theta)^2}.

Also,

1-\hat{\theta} = \frac{ \alpha+\beta+n-2-\alpha-s+1 }{ \alpha+\beta+n-2 } = \frac{\beta+f-1}{\alpha+\beta+n-2}.

Substitute $\hat{\theta}$ and $1-\hat{\theta}$ into $J_{\vec{x}}(\hat{\theta})$ :

J_{\vec{x}}(\hat{\theta}) = \frac{\alpha+s-1}{(\alpha+s-1)^2}(\alpha+\beta+n-2)^2 + \frac{\beta+f-1}{(\beta+f-1)^2}(\alpha+\beta+n-2)^2

= \frac{(\alpha+\beta+n-2)^2}{\alpha+s-1} + \frac{(\alpha+\beta+n-2)^2}{\beta+f-1}

= \frac{(\alpha+\beta+s+f-2)(\alpha+\beta+n-2)^2}{(\alpha+s-1)(\beta+f-1)}

= \frac{(\alpha+\beta+n-2)^3}{(\alpha+s-1)(\beta+f-1)}.

Therefore,

J_{\vec{x}}^{-1}(\hat{\theta}) = \frac{1}{J_{\vec{x}}(\hat{\theta})} = \frac{(\alpha+s-1)(\beta+f-1)}{(\alpha+\beta+n-2)^3}.

Hence,

\theta\mid\vec{x} \approx N\left(\hat{\theta},V_{\hat{\theta}}\right),

where

\hat{\theta}=\frac{\alpha+s-1}{\alpha+\beta+n-2}

and

V_{\hat{\theta}} = \frac{(\alpha+s-1)(\beta+f-1)}{(\alpha+\beta+n-2)^3}.

As the sample size increases while the success proportion stays similar, the posterior becomes more concentrated and the normal approximation usually improves. This is the sense in which the posterior becomes more normal.

3(c) Plot: $n=8, s=2$

For the plots, use the moment-matched normal approximation from the handwritten note:

\theta\mid\vec{x}\approx N\left( \frac{\alpha+s}{\alpha+\beta+n}, \frac{(\alpha+s)(\beta+f)}{(\alpha+\beta+n)^2(\alpha+\beta+n+1)} \right).

With a uniform prior, $\alpha=\beta=1$ , so for $n=8$ and $s=2$ we have

\theta\mid\vec{x}\sim \operatorname{Beta}(3,7), \qquad \theta\mid\vec{x}\approx N(0.30,0.019).

True posterior and normal approximation for n=8, s=2

3(d) Plot: $n=16, s=4$

Doubling the data while keeping the same success proportion gives

\theta\mid\vec{x}\sim \operatorname{Beta}(5,13), \qquad \theta\mid\vec{x}\approx N(0.28,0.011).

True posterior and normal approximation for n=16, s=4

The second posterior is more concentrated and the normal curve tracks the true posterior more closely. This supports the conclusion that, as the amount of data increases, the posterior tends to look more normal.

Bayesian Learning Exercises 3 notes

Lecture 7, Slide 21: Gibbs sampling for mixture distributions

Prior and notation

Gibbs sampling steps

1. Gibbs step for an indicator variable

Problem

Handwritten solution

Model

Conditional independence

Full conditional

2. Frequentist vs Bayesian

Problem

2(a)

2(b)

2(c)

Handwritten solution

2(a)

2(b)

2(c)

3. The posterior becomes more normal

Problem

3(a)

3(b)

3(c)

3(d)

Handwritten solution

3(a)

3(b)

3(c) Plot: n=8,s=2n=8, s=2n=8,s=2

3(d) Plot: n=16,s=4n=16, s=4n=16,s=4

3(c) Plot: $n=8, s=2$

3(d) Plot: $n=16, s=4$