Updated Wed May 06 2026 02:00:00 GMT+0200 (Central European Summer Time)

Bayesian Learning Exercises 3 notes

Lecture 7, Slide 21: Gibbs sampling for mixture distributions

Prior and notation

Prior:

πBeta(α1,α2) \pi \sim \operatorname{Beta}(\alpha_1,\alpha_2)

Conjugate prior for (μj,σj2)(\mu_j,\sigma_j^2), see Slide 16.

Here π\pi is the mixing probability for component 1. The latent indicator IiI_i records which mixture component generated observation xix_i, so the Gibbs sampler alternates between updating the component parameters and updating these hidden labels.

Define:

n1=i=1n1(Ii=1),n2=nn1 n_1=\sum_{i=1}^{n}\mathbf{1}(I_i=1),\qquad n_2=n-n_1

Gibbs sampling steps

πI,xBeta(α1+n1,α2+n2) \pi\mid I,x \sim \operatorname{Beta}(\alpha_1+n_1,\alpha_2+n_2)
σ12I,μ1,xInv-χ2(νn1,σn12)andμ1I,σ12,xN(μn1,τn12) \sigma_1^2\mid I,\mu_1,x \sim \operatorname{Inv}\text{-}\chi^2(\nu_{n_1},\sigma_{n_1}^2) \quad\text{and}\quad \mu_1\mid I,\sigma_1^2,x \sim N(\mu_{n_1},\tau_{n_1}^2)
σ22I,μ2,xInv-χ2(νn2,σn22)andμ2I,σ22,xN(μn2,τn22) \sigma_2^2\mid I,\mu_2,x \sim \operatorname{Inv}\text{-}\chi^2(\nu_{n_2},\sigma_{n_2}^2) \quad\text{and}\quad \mu_2\mid I,\sigma_2^2,x \sim N(\mu_{n_2},\tau_{n_2}^2)
Iiπ,μ1,σ12,μ2,σ22,xBern(θi),i=1,,n I_i\mid \pi,\mu_1,\sigma_1^2,\mu_2,\sigma_2^2,x \sim \operatorname{Bern}(\theta_i), \qquad i=1,\ldots,n
θi=πϕ(xi;μ1,σ12)πϕ(xi;μ1,σ12)+(1π)ϕ(xi;μ2,σ22) \theta_i= \frac{\pi\phi(x_i;\mu_1,\sigma_1^2)} {\pi\phi(x_i;\mu_1,\sigma_1^2)+(1-\pi)\phi(x_i;\mu_2,\sigma_2^2)}

This θi\theta_i is the posterior probability that observation xix_i belongs to component 1, after comparing how well both mixture components explain the same observation.


1. Gibbs step for an indicator variable

Problem

Show that the full conditional posterior of IiI_i on Lecture 7, Slide 21, is correct.

Handwritten solution

Show that

Iiπ,μ1,σ12,μ2,σ22Bern(θi),i=1,,n,Ii{1,2} I_i \mid \pi,\mu_1,\sigma_1^2,\mu_2,\sigma_2^2 \sim \operatorname{Bern}(\theta_i), \qquad i=1,\ldots,n, \qquad I_i\in\{1,2\}

where

θi=πϕ(xi;μ1,σ12)πϕ(xi;μ1,σ12)+(1π)ϕ(xi;μ2,σ22) \theta_i= \frac{ \pi\phi(x_i;\mu_1,\sigma_1^2) }{ \pi\phi(x_i;\mu_1,\sigma_1^2)+(1-\pi)\phi(x_i;\mu_2,\sigma_2^2) }

Model

The normal density model is

P(xiIi,π,μ1,μ2,σ12,σ22)={ϕ(xi;μ1,σ12),if Ii=1ϕ(xi;μ2,σ22),if Ii=2 P(x_i\mid I_i,\pi,\mu_1,\mu_2,\sigma_1^2,\sigma_2^2) = \begin{cases} \phi(x_i;\mu_1,\sigma_1^2), & \text{if } I_i=1\\ \phi(x_i;\mu_2,\sigma_2^2), & \text{if } I_i=2 \end{cases}

and the prior probability of the indicator is

P(Iiπ)={π,if Ii=11π,if Ii=2. P(I_i\mid\pi) = \begin{cases} \pi, & \text{if } I_i=1\\ 1-\pi, & \text{if } I_i=2. \end{cases}

Conditional independence

In the Gibbs sampler, we sample

π,μ1,σ12,μ2,σ22given I1,,In \pi,\mu_1,\sigma_1^2,\mu_2,\sigma_2^2 \quad \text{given } I_1,\ldots,I_n

and then sample

I1,,Ingiven π,μ1,σ12,μ2,σ22. I_1,\ldots,I_n \quad \text{given } \pi,\mu_1,\sigma_1^2,\mu_2,\sigma_2^2.

Therefore,

P(I1,,Inx,π,)=i=1nP(Iixi,π,), P(I_1,\ldots,I_n\mid \vec{x},\pi,\cdot) = \prod_{i=1}^{n}P(I_i\mid x_i,\pi,\cdot),

where \cdot denotes μ1,μ2,σ12,σ22\mu_1,\mu_2,\sigma_1^2,\sigma_2^2.

The important simplification is that once the parameters are fixed, each observation only needs its own xix_i to update its own label. The other observations affect IiI_i only indirectly through the current parameter values.

Fact 1:

IiIjx,π,,for ij. I_i\perp I_j\mid \vec{x},\pi,\cdot, \qquad \text{for } i\neq j.

Fact 2:

Iixjxi,π,,for ij. I_i\perp x_j\mid x_i,\pi,\cdot, \qquad \text{for } i\neq j.

Full conditional

To derive the full conditional, we keep only the terms that depend on IiI_i. Constants that are the same for Ii=1I_i=1 and Ii=2I_i=2 disappear into the normalizing constant.

By Bayes' theorem,

P(Iixi,π,)P(xiIi,π,)P(Iiπ). P(I_i\mid x_i,\pi,\cdot) \propto P(x_i\mid I_i,\pi,\cdot)P(I_i\mid\pi).

Hence,

P(Iixi,π,)ϕ(xiμIi,σIi2)P(Iiπ), P(I_i\mid x_i,\pi,\cdot) \propto \phi(x_i\mid\mu_{I_i},\sigma_{I_i}^2)P(I_i\mid\pi),

so

P(Iixi,π,){ϕ(xiμ1,σ12)π,if Ii=1ϕ(xiμ2,σ22)(1π),if Ii=2. P(I_i\mid x_i,\pi,\cdot) \propto \begin{cases} \phi(x_i\mid\mu_1,\sigma_1^2)\pi, & \text{if } I_i=1\\ \phi(x_i\mid\mu_2,\sigma_2^2)(1-\pi), & \text{if } I_i=2. \end{cases}

Since Ii{1,2}I_i\in\{1,2\}, the full conditional is Bernoulli after normalization. Find the normalizing constant from

k=12P(Ii=kxi,π,)=1. \sum_{k=1}^{2}P(I_i=k\mid x_i,\pi,\cdot)=1.

Thus,

1=P(Ii=1xi,π,)+P(Ii=2xi,π,), 1=P(I_i=1\mid x_i,\pi,\cdot)+P(I_i=2\mid x_i,\pi,\cdot),
1=cϕ(xiμ1,σ12)π+cϕ(xiμ2,σ22)(1π), 1=c\phi(x_i\mid\mu_1,\sigma_1^2)\pi +c\phi(x_i\mid\mu_2,\sigma_2^2)(1-\pi),

and

c=1πϕ(xiμ1,σ12)+(1π)ϕ(xiμ2,σ22). c= \frac{1}{ \pi\phi(x_i\mid\mu_1,\sigma_1^2)+(1-\pi)\phi(x_i\mid\mu_2,\sigma_2^2) }.

Therefore,

θi=P(Ii=1xi,π,) \theta_i=P(I_i=1\mid x_i,\pi,\cdot)
=πϕ(xiμ1,σ12)πϕ(xiμ1,σ12)+(1π)ϕ(xiμ2,σ22). = \frac{ \pi\phi(x_i\mid\mu_1,\sigma_1^2) }{ \pi\phi(x_i\mid\mu_1,\sigma_1^2)+(1-\pi)\phi(x_i\mid\mu_2,\sigma_2^2) }.

Intuitively, the numerator is the unnormalized weight for component 1. The denominator adds the two possible component weights, so the ratio turns the weight into a probability.


2. Frequentist vs Bayesian

Problem

2(a)

Let x1,,xnθiidUniform(θ12,θ+12)x_1,\ldots,x_n\mid\theta\stackrel{iid}{\sim}\operatorname{Uniform}\left(\theta-\frac12,\theta+\frac12\right). Let θ^=xˉ=1ni=1nxi\hat{\theta}=\bar{x}=\frac1n\sum_{i=1}^{n}x_i be an estimator of θ\theta. Derive an expression for the repeated sampling variance of θ^\hat{\theta}.

2(b)

Derive the posterior distribution for θ\theta assuming a uniform prior distribution.

Hint: Here it is absolutely crucial to think about the support for the data distribution. Once you have observed some data, some values are no longer possible. I strongly suggest that you plot some imaginary data on the real line and plot the data distribution in the same graph for some made-up values of θ\theta. Just to make you think in the right direction.

2(c)

Assume that you have observed three data observations: x1=1.13x_1=1.13, x2=2.1x_2=2.1, x3=1.38x_3=1.38. What do we conclude from a frequentist perspective about θ\theta? What do we conclude from a Bayesian perspective about θ\theta? Discuss.

Handwritten solution

2(a)

Let

x1,,xniidUniform(θ12,θ+12). x_1,\ldots,x_n \stackrel{iid}{\sim} \operatorname{Uniform}\left(\theta-\frac12,\theta+\frac12\right).

Let

θ^=xˉ=1ni=1nxi. \hat{\theta}=\bar{x} = \frac1n\sum_{i=1}^{n}x_i.

Then

Var(θ^)=Var(1ni=1nxi) \operatorname{Var}(\hat{\theta}) = \operatorname{Var}\left(\frac1n\sum_{i=1}^{n}x_i\right)
=1n2i=1nVar(xi) = \frac1{n^2}\sum_{i=1}^{n}\operatorname{Var}(x_i)
=1n2nVar(xi) = \frac1{n^2}n\operatorname{Var}(x_i)
=1n112(θ+12(θ12))2 = \frac1n\cdot\frac1{12} \left(\theta+\frac12-\left(\theta-\frac12\right)\right)^2
=112n. = \frac1{12n}.

This is a repeated sampling variance: it describes how much xˉ\bar{x} would vary if we repeatedly collected new samples of size nn from the same model. It is not a posterior variance for θ\theta.

2(b)

Assume a flat uniform prior:

p(θ)1. p(\theta)\propto 1.

The model density is

p(xiθ)={1,θ12xiθ+120,otherwise. p(x_i\mid\theta) = \begin{cases} 1, & \theta-\frac12\leq x_i\leq\theta+\frac12\\ 0, & \text{otherwise.} \end{cases}

The posterior is

p(θx)p(xθ)p(θ) p(\theta\mid\vec{x}) \propto p(\vec{x}\mid\theta)p(\theta)
i=1np(xiθ)p(θ). \propto \prod_{i=1}^{n}p(x_i\mid\theta)p(\theta).

Within the support allowed by all observations, this is proportional to 1, so θx\theta\mid\vec{x} is uniform on the feasible interval.

The key point is the support. A candidate value of θ\theta is possible only if every observed xix_i lies inside the interval centered at θ\theta with width 1. Values of θ\theta outside this feasible interval get likelihood zero.

Determine the bounds for θx\theta\mid\vec{x}.

Given

x=(x1,,xn), \vec{x}=(x_1,\ldots,x_n),

let

xmin=min{x1,,xn},xmax=max{x1,,xn}. x_{\min}=\min\{x_1,\ldots,x_n\}, \qquad x_{\max}=\max\{x_1,\ldots,x_n\}.

At the same time,

θ12xiθ+12,for all i. \theta-\frac12\leq x_i\leq\theta+\frac12, \qquad \text{for all } i.

Therefore,

θ12xminandxmaxθ+12, \theta-\frac12\leq x_{\min} \quad\text{and}\quad x_{\max}\leq\theta+\frac12,

which implies

θxmin+12andxmax12θ. \theta\leq x_{\min}+\frac12 \quad\text{and}\quad x_{\max}-\frac12\leq\theta.

Thus,

xmax12θxmin+12, x_{\max}-\frac12\leq\theta\leq x_{\min}+\frac12,

and

θxUniform(xmax12,xmin+12). \theta\mid\vec{x} \sim \operatorname{Uniform}\left(x_{\max}-\frac12, x_{\min}+\frac12\right).

2(c)

Given

x1=1.13,x2=2.1,x3=1.38, x_1=1.13, \quad x_2=2.1, \quad x_3=1.38,

the frequentist estimate is

θ^=xˉ=1.537,Var(θ^)=112n=0.028. \hat{\theta}=\bar{x}=1.537, \qquad \operatorname{Var}(\hat{\theta})=\frac1{12n}=0.028.

The Bayesian posterior is

θxUniform(2.112,1.13+12)=Uniform(1.6,1.63). \theta\mid\vec{x} \sim \operatorname{Uniform}\left(2.1-\frac12,1.13+\frac12\right) = \operatorname{Uniform}(1.6,1.63).

The frequentist estimate summarizes the center of the observed data through xˉ\bar{x}. The Bayesian posterior instead uses the model support to rule out impossible values of θ\theta, leaving only a short interval of feasible values.


3. The posterior becomes more normal

Problem

3(a)

Let x1,,xnθiidBern(θ)x_1,\ldots,x_n\mid\theta\stackrel{iid}{\sim}\operatorname{Bern}(\theta) and let θBeta(α,β)\theta\sim\operatorname{Beta}(\alpha,\beta) a priori. Find the posterior mode of θ\theta.

3(b)

Approximate the posterior distribution of θ\theta by a normal distribution.

3(c)

Assume now that you have the data n=8n=8 and s=2s=2. Plot the true posterior distribution and the normal approximation in the same graph. Assume a uniform prior for θ\theta.

3(d)

Redo the previous exercise, but this time with twice the data size: n=16n=16 and s=4s=4. What do you conclude?

Handwritten solution

3(a)

Find the stationary point:

logP(θx)θ=0. \frac{\partial \log P(\theta\mid\vec{x})}{\partial\theta}=0.

Let ss be the number of successes and f=nsf=n-s be the number of failures. Then

(1θ)(α+s1)=θ(β+f1). (1-\theta)(\alpha+s-1)=\theta(\beta+f-1).

Expanding gives

α+s1θ(α+s1)=θ(β+f1), \alpha+s-1-\theta(\alpha+s-1) = \theta(\beta+f-1),

so

α+s1=θ(α+β+s+f2). \alpha+s-1 = \theta(\alpha+\beta+s+f-2).

Therefore,

θ^=α+s1α+β+n2. \hat{\theta} = \frac{\alpha+s-1}{\alpha+\beta+n-2}.

This is the posterior mode, not necessarily the posterior mean. It is the value of θ\theta where the posterior density is highest, assuming the mode is inside the interval (0,1)(0,1).

3(b)

Approximate the posterior by

θxN(θ^,Jx1(θ^)), \theta\mid\vec{x} \approx N\left(\hat{\theta},J_{\vec{x}}^{-1}(\hat{\theta})\right),

where Jx(θ^)J_{\vec{x}}(\hat{\theta}) is the negative Hessian of the log posterior evaluated at θ^\hat{\theta}.

This is a Laplace approximation. Near its mode, a smooth log posterior can be approximated by a quadratic curve; exponentiating that quadratic gives a normal density. The inverse curvature controls the approximate variance.

Jx(θ)=2logP(θx)θ2 J_{\vec{x}}(\theta) = - \frac{\partial^2\log P(\theta\mid\vec{x})}{\partial\theta^2}
=θ(θlogP(θx)) = - \frac{\partial}{\partial\theta} \left( \frac{\partial}{\partial\theta}\log P(\theta\mid\vec{x}) \right)
=θ(α+s1θβ+f11θ) = - \frac{\partial}{\partial\theta} \left( \frac{\alpha+s-1}{\theta} - \frac{\beta+f-1}{1-\theta} \right)
=α+s1θ2+β+f1(1θ)2. = \frac{\alpha+s-1}{\theta^2} + \frac{\beta+f-1}{(1-\theta)^2}.

Also,

1θ^=α+β+n2αs+1α+β+n2=β+f1α+β+n2. 1-\hat{\theta} = \frac{ \alpha+\beta+n-2-\alpha-s+1 }{ \alpha+\beta+n-2 } = \frac{\beta+f-1}{\alpha+\beta+n-2}.

Substitute θ^\hat{\theta} and 1θ^1-\hat{\theta} into Jx(θ^)J_{\vec{x}}(\hat{\theta}):

Jx(θ^)=α+s1(α+s1)2(α+β+n2)2+β+f1(β+f1)2(α+β+n2)2 J_{\vec{x}}(\hat{\theta}) = \frac{\alpha+s-1}{(\alpha+s-1)^2}(\alpha+\beta+n-2)^2 + \frac{\beta+f-1}{(\beta+f-1)^2}(\alpha+\beta+n-2)^2
=(α+β+n2)2α+s1+(α+β+n2)2β+f1 = \frac{(\alpha+\beta+n-2)^2}{\alpha+s-1} + \frac{(\alpha+\beta+n-2)^2}{\beta+f-1}
=(α+β+s+f2)(α+β+n2)2(α+s1)(β+f1) = \frac{(\alpha+\beta+s+f-2)(\alpha+\beta+n-2)^2}{(\alpha+s-1)(\beta+f-1)}
=(α+β+n2)3(α+s1)(β+f1). = \frac{(\alpha+\beta+n-2)^3}{(\alpha+s-1)(\beta+f-1)}.

Therefore,

Jx1(θ^)=1Jx(θ^)=(α+s1)(β+f1)(α+β+n2)3. J_{\vec{x}}^{-1}(\hat{\theta}) = \frac{1}{J_{\vec{x}}(\hat{\theta})} = \frac{(\alpha+s-1)(\beta+f-1)}{(\alpha+\beta+n-2)^3}.

Hence,

θxN(θ^,Vθ^), \theta\mid\vec{x} \approx N\left(\hat{\theta},V_{\hat{\theta}}\right),

where

θ^=α+s1α+β+n2 \hat{\theta}=\frac{\alpha+s-1}{\alpha+\beta+n-2}

and

Vθ^=(α+s1)(β+f1)(α+β+n2)3. V_{\hat{\theta}} = \frac{(\alpha+s-1)(\beta+f-1)}{(\alpha+\beta+n-2)^3}.

As the sample size increases while the success proportion stays similar, the posterior becomes more concentrated and the normal approximation usually improves. This is the sense in which the posterior becomes more normal.

3(c) Plot: n=8,s=2n=8, s=2

For the plots, use the moment-matched normal approximation from the handwritten note:

θxN(α+sα+β+n,(α+s)(β+f)(α+β+n)2(α+β+n+1)). \theta\mid\vec{x}\approx N\left( \frac{\alpha+s}{\alpha+\beta+n}, \frac{(\alpha+s)(\beta+f)}{(\alpha+\beta+n)^2(\alpha+\beta+n+1)} \right).

With a uniform prior, α=β=1\alpha=\beta=1, so for n=8n=8 and s=2s=2 we have

θxBeta(3,7),θxN(0.30,0.019). \theta\mid\vec{x}\sim \operatorname{Beta}(3,7), \qquad \theta\mid\vec{x}\approx N(0.30,0.019).

True posterior and normal approximation for n=8, s=2

3(d) Plot: n=16,s=4n=16, s=4

Doubling the data while keeping the same success proportion gives

θxBeta(5,13),θxN(0.28,0.011). \theta\mid\vec{x}\sim \operatorname{Beta}(5,13), \qquad \theta\mid\vec{x}\approx N(0.28,0.011).

True posterior and normal approximation for n=16, s=4

The second posterior is more concentrated and the normal curve tracks the true posterior more closely. This supports the conclusion that, as the amount of data increases, the posterior tends to look more normal.