This note is the third in a three part series on the expectation maximization algorithm. Part 1 gives a cursory overview of the algorithm, Part 2 deals with mixture models, and Part 3 applies the EM algorithm to hidden markov models.
Introduction
Let’s extend now the application of the expectation maximization algorithm to a different class of latent variable models — hidden markov models (HMMs).
An HMM models an observable sequence of data as a function of an underlying latent stochastic process that transitions according to the Markov property, i.e. when the future is conditionally independent of the past given the present.1
As a working example, we’ll be partially replicating a 2012 article examining the link between political violence and house prices in Northern Ireland (Besley and Mueller 2012). The paper applies a series of HMMs whereby the recorded number of killings per yearly quarter is modeled as a function of the system being in a latent state of either peace or conflict.
It’s a really cool paper, but the code is atrocious which should have been a blocker for publication. So we’re going to re-implement the HMM portion from scratch using the EM algorithm, or more specifically a subvariant typically referred to as the Baum-Welch algorithm (Baum et al. 1970), to fit the models.
Replication Data
The replication files can be found on the AER website. We’re only going to be focusing on the HMM portion of the paper so ignore the data on house prices and grab the recorded number of regional killings per yearly quarter.
usingDataFrames, Plots, ReadStatTablesdf =readstat("maindata.dta") |> DataFrame |> d ->dropmissing(d, :totaldeaths) |> d ->select(d, [:region, :time, :totaldeaths])# Convert regions to proper namesdf.region_name =replace(df.region,1=>"Belfast",2=>"North Down",3=>"Lisburn",4=>"East Antrim",5=>"Londonderry/Strabane",6=>"Antrim/Ballymena",7=>"Coleraine/Limavady",8=>"Enniskillen/Fermanagh/S Tyrone",9=>"Mid Ulster",10=>"Mid and South Down",11=>"Craigavon/Armagh")describe(df, cols = [:totaldeaths, :time])
2×7 DataFrame
Row
variable
mean
min
median
max
nmissing
eltype
Symbol
Union…
Any
Any
Any
Int64
DataType
1
totaldeaths
0.908225
0
0.0
29
0
Int8
2
time
1983-01-01
1996-01-01
2009-01-01
0
Date
To get a quick visual of the data, Figure 1 provides of the number of killings over time per region.
Formally, for some region \(r \in \{1, 2, \ldots, 11\}\), let \(\boldsymbol{s}_{r} = (s_{r1}, s_{r2}, \ldots, s_{rT})\) be a latent markov process satisfying \(P(s_{rt} = k | s_{r,t-1}, s_{r,t-2}, \ldots, s_{r,1}) = P(s_{rt} = k | s_{r,t-1})\) for \(k \in \{1, \ldots, K\}\) where \(K = 2\) denotes the two latent states, peace and conflict. The sequence \(\boldsymbol{s}_{r}\) is never directly observed; however, it does result in observable emissions, in this case the number of killings measured per yearly quarter, \(\boldsymbol{y}_r = (y_{r1}, y_{r2}, \ldots, y_{rT})\).
Each \(y_{rt}\) is dependent on the hidden state at time \(t\) such that \(y_{rt} | s_{rt} = k \sim F_{rk}(\cdot)\) for some probability distribution, \(F_{rk}\), with the system evolving according to Figure 2. Following the original paper, we specify \(F_{rk}\) to be a normal distribution with parameters \(\mu_{rk}\) and \(\sigma_{rk}\).
Given the previous latent state of the system at time \(t-1\), the latent variable, \(s_{rt}\), transitions with time invariant probabilities, \(p_r\) and \(q_r\). The former is the probability of transitioning to conflict from conflict and the latter from peace to peace. Taken together, the transition matrix, \(\boldsymbol{A}_r\), is defined as
where each row is a probability simplex. When \(t=1\) let the initial state probability be \(\pi_{rk} = P(s_{r1} = k)\).
Our goal is to estimate the set of parameters \(\theta_r = \{\pi, \boldsymbol{A}, \xi\}\), where \(\xi\) is the \(F_1, \ldots, F_K\) distribution specific parameters, as well as generate predictions for the latent states \(\boldsymbol{s}_r\).
In line with the original paper, we will fit an HMM for each region separately using the EM algorithm to maximize the marginal log-likelihood, \(\ell(\theta_r | \boldsymbol{y}_r) = \log p(\boldsymbol{y}_r | \theta_r)\).
EM Algorithm
Start by defining the objective function in line with the previous discussions. We also drop the region subscript since it is implicit throughout.
where \(f_k(\cdot)\) is the probability function for the observed distribution given latent state \(k\).
Letting \(\gamma_t(k) = p(s_t = k | \boldsymbol{y}, \theta^\star)\) and \(\phi_{t-1,t}(i, j) = p(s_{t-1} = i, s_{t} = j | \boldsymbol{y}, \theta^\star)\), we are left with the following.
The expectation step consists of calculating \(\gamma_t(k)\) and \(\phi_{t-1,t}(i, j)\) for all \(k, i, j \in \{1, \ldots, K \}\) using the current iteration of \(\theta\).
The former can be expressed in terms of Bayes’ theorem and simplified.
In addition, the variables \(\alpha_t(k)\) and \(\beta_t(k)\) were substituted to respectively represent the forward probabilities, i.e. the joint probability of observing \(y_1, \ldots, y_t\) and being in state \(k\) at time \(t\), and the backward probabilities, i.e., the probability of observing \(y_{t+1}, \ldots, y_T\) given latent state \(k\) at time \(t\).
These forward and backward probabilities are also used to express \(\phi_{t-1,t}(i, j)\).
It is clear that we need to be able to find expressions for \(\alpha_t(k)\) and \(\beta_t(k)\) as part of the E-Step. Unsurprisingly, in accordance with today’s theme, this will be done using recursion. Starting with the forward probabilities,
with initial value, \(\alpha_1(k) = p(y_1 | s_1 = k, \theta^\star) P(s_1 = k | \theta^\star) = \pi_k f_k(y_1 | \theta^\star)\). Note, as part of this derivation recognize that \(y_1, \ldots, y_{t-1} {\perp\!\!\!\perp} s_t | s_{t-1}\). In other words, the observed data up until \(t-1\) are independent of the future latent states given the present state allowing us to express \(p(y_1, \ldots, y_{t-1} | s_t, s_{t-1}) = p(y_1, \ldots, y_{y-1} | s_{t-1})\).
Before we actually implement this in code, however, consider what happens to \(\alpha_t(k)\) as \(t \to \infty\). The forward probabilities will decay exponentially and risk underflowing. Therefore, to improve numeric stability we replace \(\alpha_t(k)\) with its normalized form, \(\hat{\alpha}_t(k) = \frac{\alpha_t(k)}{c_t}\), where \(c_t = \sum_{i=1}^K \alpha_t(k)\).2.
Finally, implement the E-Step to update \(\gamma_t(k)\) and \(\phi_{t-1,t}(i,j)\) using Equation 3 and Equation 4.3
usingDistributions"""EM algorithm E-step.# Arguments:- `X`: Observed data- `π`: Initial state probabilities- `A`: Transition matrix- `μ`: Means of the emission distributions- `σ`: Standard deviations of the emission distributions"""functionE_step(X, π, A, μ, σ) B =hcat([pdf(Normal(μ[i], σ[i]), X) for i in1:length(π)]...) α, c =forward(π, A, B) β =backward(π, A, B, c) γ = α .* β γ ./=sum(γ, dims=2) ϕ =Array{Float64}(undef, length(X) -1, length(π), length(π))for t inIterators.take(eachindex(X), length(X) -1) ϕ[t, :, :] = α[t, :] .* A .* (B[t+1, :]' .* β[t+1, :]') ϕ[t, :, :] ./=sum(ϕ[t, :, :])endreturnsum(log.(c)), α, ϕ, γend
M-Step
Returning to Equation 2, we update \(\theta = \{\boldsymbol{A}, \pi, \mu, \sigma \}\) in the M-Step by maximizing the objective function, \(\theta = \text{arg max}_{\theta \in \Theta} Q(\theta^\star, \theta)\), while treating \(\gamma_t(k)\) and \(\phi_{t-1,t}(i, j)\) as constant.
For brevity I will skip the full derivations, but solving the maximization problems leads to the following updating equations:4
usingLinearAlgebra"""EM algorithm M-Step.# Arguments:- `X`: Observed data- `γ`: Probability of being in state `k` given `X`- `ϕ`: Joint probability of being in state ``i`` and ``j`` at times ``t-1`` and ``t`` given `X`"""functionM_step(X, γ, ϕ)π= γ[1, :] q =dropdims(sum(ϕ, dims=1), dims=1) A = q ./sum(q, dims=2) μ =sum(γ .* X, dims=1) ./sum(γ, dims=1) σ =sqrt.([dot(γ[:, i], (X .- μ[i]).^2) /sum(γ[:, i]) for i inaxes(γ, 2)])returnπ, A, vec(μ), max.(σ, 1e-5)end
EM Function
The entry point for the EM implementation sets the initial values for \(\theta\) according to the values selected by the original paper.
It then iterates between the E-Step and M-Step until convergence with the stopping criteria also in line with the paper. Specifically, the algorithm stops when the absolute difference in estimates between iterations drops below a threshold. Formally, with some slight abuse in notation, we express this condition as \(\forall i \in \{1, 2, \ldots, n \}, \; |\theta_i - \theta_i^\star| < \epsilon\) for \(n\) parameters and \(\epsilon > 0\).
"""Entry point for EM algorithm."""functionEM(X; max_iter=1_000, tol=1e-8)π=fill(0.5, 2) A = [0.50.5; 0.50.5] μ = [0, 3] σ = [1, sqrt(3)]for i in1:max_iter# E-step log_lik, α, ϕ, γ =E_step(X, π, A, μ, σ)@info"Iteration $i: log-likelihood = $log_lik: μ = $μ" # M-step π_new, A_new, μ_new, σ_new =M_step(X, γ, ϕ)# Stopping criteria delta =Iterators.flatten([π_new -π, A_new - A, μ_new - μ, σ_new - σ])ifall(abs.(delta) .< tol)return A_new[1], A_new[4], μ_new, αendπ, A, μ, σ = π_new, A_new, μ_new, σ_newenderror("Model failed to converge")end
Run it all
Finally, run each region specific model.
# Enable logging to print the log-likelihood for each iterationusingLoggingdisable_logging(Logging.Info)# Fit an HMM per regiongroups =groupby(df, :region_name) |> collectresults = [EM(d.totaldeaths) for d in groups]
We can summarise the model fits by re-constructing Table 2 from the original paper. Most of the results are fairly in line, with the notable exception of Londonderry/Strabane which we estimate as having a significantly less degree of persistence between states.
We can also recreate Figure 4 from the original paper in order to plot the posterior expectations for the amount of violence per yearly quarter across four different regions.5 Again, Londonderry/Strabane stands out as our fitted HMM is more likely to transition between conflict and peace than the original results. Otherwise, the remaining region predictions are fairly in line with Besley and Mueller (2012).
functionpp(df, result) _, _, μ, α = result p =plot(df.time, df.totaldeaths, label="Observed deaths", legend=:topright, title=df.region_name[1], xrotation=45, background_color=:transparent)plot!(p, df.time, α * μ, label="Predicted deaths")return pendplot([pp(groups[i], results[i]) for i in [1, 5, 3, 9]]..., layout=[2,2])
Observed and predicted quarterly killings.
There is plenty of room for further improvement. We can investigate more robust initialization strategies, run sensitivity analyses, and implement a number of easy optimizations to the code in order to improve performance. Have fun, the world is your oyster.
References
Baum, Leonard E., Ted Petrie, George Soules, and Norman Weiss. 1970. “A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains.”The Annals of Mathematical Statistics 41 (1): 164–71.
Besley, Timothy, and Hannes Mueller. 2012. “Estimating the Peace Dividend: The Impact of Violence on House Prices in Northern Ireland.”The American Economic Review 102 (2): 810–33.
James D. Hamilton. 1991. “A Quasi-Bayesian Approach to Estimating Parameters for Mixtures of Normal Distributions.”Journal of Business & Economic Statistics 9 (1): 27–39.
For simplicity, we will restrict our discussion to first-order HMMs with finite state spaces and observed data measured at discrete uniform intervals of time.↩︎
The scaling factors arise from normalizing the forward probabilities such that \(\hat{\alpha}_t(k) = p(s_t | y_1, y_2, \ldots, y_t) = \frac{\alpha_t(k)}{p(y_1, y_2, \ldots, y_t)}\). Letting \(c_t = p(y_t | y_1, y_2, \ldots, y_{t-1})\), then \(\alpha_t(k) = \left(\prod_{m=1}^t c_t \right) \hat{\alpha}_t(k)\). Substituting into Equation 5 yields
Dropping the product term, we calculate \(\alpha_t(k)\) using the previous normalized term, \(\hat{\alpha}_{t-1}(k)\), and normalize the present value with scaling factor, \(c_t = \sum_{i=1}^K \alpha_t(i)\).↩︎
Note, the calculations for \(\gamma_t(k)\) and \(\phi_{t-1,t}(i,j)\) remain invariant to whether \(\alpha_t(k)\) and \(\beta_t(k)\) are normalized.↩︎
We deviate here from the original paper when updating \(\theta\). One potential issue when fitting a mixture of gaussians is that if \(\sigma_k^2\) becomes too small it may lead to numeric instability. Besley and Mueller (2012) mitigate this by following James D. Hamilton (1991) and using quasi-bayesian priors, which introduces a degree of regularization through penalized MLE. We will take a simpler, more naive approach and instead constrain the variances to a minimum threshold. The implication is that our results will not be an exact replication and differences will arise between the different sets of posterior estimates.↩︎
Typically fitted values are generated by conditioning on the entire sequence of observed data. However, in following the original paper, we will only use the forward probabilities and define \(\hat{y}_t = \sum_{k=1}^K \mathbb{E} \left[ y_t | s_t = k \right] p(s_t = k | y_1, \ldots, y_t)\).↩︎