Implementation of Baum-Welch algorithm for
HMM in Mahout Samsara
Manogna Vemulapati
November 8, 2018
Introduction
A Hidden Markov Model (HMM) is specified as a triplet (A, B, )where:
The number of hidden stat es is N and they are specified as the set S =
{S
0
,S
1
,...,S
N1
}. The state at time t is represented as q
t
.
The number of observation symbols is M and they are specified as the s et
V = {v
0
,...,v
M1
}.
The state transition probability d i st r i but ion matri x A is a matrix of di-
mensions N N.Theelementa
ij
of t he matrix A is the probability of
transitioning from state S
i
to state S
j
.
The emission probability distribution matrix B is a matrix of dimensions
N M.Theelementb
j
(k) of the matrix B is the probability of emitting
observation s ymb ol v
k
from state S
j
.
The probability distribution for the initial state is specified by the vector
= {
i
} where pi
i
is the probability of being in state S
i
at time t = 0.
Given an observation sequence O of observation symbols from the set V ,the
learning problem is to adjust the model paramete rs such that the probabil-
ity P (O|) is maximized. Baum-Welch algorithm provides a solution for the
training problem.
Baum-Welch Algorithm
Baum-Welch algorithm is an Expectation-Maximization (EM) algorithm which
computes the maximum likelihood estimate of the parameters of HMM given a
set of observation sequences. It is an iterati ve algorithm where in each itera-
tion it comp ut e s the forward variables and backward variables and uses these
variables to update the model parameters so that P (O|
¯
) >P( O|)where
¯
is
the model with the updated parameters. The algorithm iterates until the model
parameters converge.
1
Forward Variables
For an observation sequ en ce of O length T , that is, O =(O
0
...O
T 1
), the
forward variables are define d as
t
(i)=P (O
0
O
1
...O
t
,q
t
= S
i
|), 0 i N 1, 0 t T 1
which is the prob abil i ty of the par ti al ob se rvation sequence O
0
O
1
...O
t
up to
time t and state S
i
at time t, given the model . The forward variables are
computed by inductively as follows:
Initialization:
0
(i)=
i
b
i
(O
0
), 0 i N 1
Induction:
t+1
(j)=
"
N1
X
i=0
t
(i)a
ij
#
b
j
(O
t+1
), 0 t T 2, 0 j N 1
Backward Variables
The backward variable
t
(i) is the probability of the partial observation sequence
from time t +1tothe end T 1 given the HMM is in state S
i
at time t and
the mo del .
t
(i)=P (O
t+1
...O
T 1
|q
t
= S
i
,), 0 i N 1, 0 t T 1
The backward variables are comput e d inductively as follows.
Initialization:
T 1
(i)=1, 0 i N 1
Induction:
t
(i)=
N1
X
j=0
a
ij
b
j
(O
t+1
)
t+1
(j), 0 t T 2, 0 i N 1.
Gamma and Xi Variables
The gamma variable
t
(i) is the proba bil i ty of being in state S
i
at time t given
the observation sequence O and the model .
t
(i)=P (q
t
= S
i
|O, )=
t
(i)
t
(i)
P (O|)
0 i N 1, 0 t T 1
The xi variable
t
(i, j) is the probabili ty of being in state S
i
at time t and in
state S
j
at time t + 1 given the model and the observation sequence O.
t
(i, j)=P (q
t
= S
i
,q
t+1
= S
j
|O, )=
t
(i)a
ij
b
j
(O
t+1
t+1
(j)
P (O|)
where
0 i N 1, 0 j N 1, 0 t T 1
2
Probability of an observation sequence
The probability of an observation sequenc e O of length T given a model is
computed as follows.
P (O|)=
N1
X
i=0
T 1
(i)
Update of Model Parameters
The sum of gamma variables for a particular state i, that is the expression
P
T 2
t=0
t
(i) can be interpreted as the expected number of times that the state
S
i
is visited given the model parameters and the obse r vation s eq ue nc e O. And,
the summation of Xi variables
P
T 2
t=0
t
(i, j) can be interpr et e d as the expected
number of transitions from state S
i
to state S
j
. Hence the ratio of th e latter
over the former is the updated probability of transition from state S
i
to state S
j
.
Thus, an i t er at ion of Baum-Welch algorithm adjusts t he parameters as below.
Initial Probabilities Vector
¯
i
=
0
(i), 0 i N 1
State Tr an si t i on P r obab i li ty Distribution
¯a
ij
=
P
T 2
t=0
t
(i, j)
P
T 2
t=0
t
(i)
, 0 i N 1, 0 j N 1
Emission Probability Distribution
¯
b
j
(k)=
P
T 1
t=0,O
t
=v
k
t
(j)
P
T 1
t=0
t
(j)
, 0 j N 1, 0 k M 1
Numerical Stability and Scali ng
The value of a forward variable
t
(i) quickly tends to zero as the value of t
becomes large. The solution to this problem is to scale the forward variables
at each induction step. One common scaling scheme (as descr i bed in [1] ) is to
define a scaling factor which depends only on time t but is independent of the
state i as described below. The scaled forward variables ¨
t
(i) and the scaling
factors c
t
are computed by induction as follows.
Initialization
¨
0
(i)=
0
(i)0 i N 1
c
0
=
1
P
N1
i=0
¨
0
(i)
ˆ
0
(i)=c
0
¨
0
(i)0 i N 1
3
Induction
¨
t
(i)=
N1
X
j=0
ˆ
t1
(j)a
ji
b
i
(O
t
)
c
t
=
1
P
N1
i=0
¨
t
(i)
ˆ
t
(i)=c
t
¨
0
(i)0 i N 1
To compute the scal ed backward variables
¨
t
(i), the same scaling factors wh i ch
are computed for the sacled forward variables are used.
Initialization
¨
T 1
(i)=1
ˆ
T 1
(i)=c
T 1
¨
T 1
(i)
Induction
¨
t
(i)=
N1
X
j=0
ˆ
t+1
(j)a
ji
b
i
(O
t+1
)
ˆ
t
(i)=c
t
¨
t
(i)
Probability of an observat io n s eq uence wi th s ca le d
variables
The probability of an observation sequence O given a model is comput e d as
follows.
C
t
=
t
Y
=0
c
P (O|)=1/C
T 1
Using the sc al ed forward and backward parameters the model parameters are
adjusted as follows.
Initial Probabilities Vector
¯
i
0
(i)
ˆ
0
(i)/c
0
State Tr an si t i on P r obab i li ty Distribution
¯a
ij
=
P
T 2
t=0
ˆ
t
(i).a
ij
b
j
(O
t+1
).
ˆ
t+1
(j)
P
T 2
t=0
ˆ
t
(i).
ˆ
t
(i)/c
t
Emission Probability Distribution
¯
b
j
(k)=
P
T 1
t=0,O
t
=v
k
ˆ
t
(j).
ˆ
t
(j)/c
t
P
T 1
t=0
ˆ
t
(j).
ˆ
t
(j)/c
t
4
Training with Multiple Observation Sequences
Suppose we have L independent observation sequences where the observation
sequence indexed by l is denoted by O
l
and 0 l L 1. In order to update
the parameters of the model (as described in [3]), we need to do the following.
Starting probabilities From each observation sequence, compute the ex-
pected number of times in each state at time t = 0. For each state i,
we can compute the sum of expected number of times in that s t ate at
time t = 0 from all the sequences. From this we can update the initi al
probabilities vector.
Expected number of transitions From each observation sequence, compute
the expected number of times of transition from state i to state j. For each
ordered pair of states (i, j), we can compute the sum of expected number
of times of transition from state i to state j du e to all sequences. Once we
compute the sums of transitions for row i of the transition matrix, we can
update that row of the transition matrix by computing the total number
of times visiting state
Expected number of emissions From each observation sequence, compute
the expected number of times being in state i and emitting symbol j. For
each state i and symbol j, we can compute the total number of times
being in state i and emitting symb ol j from all sequen ces . Each row of
the emission matrix can be updated by computing the total number times
visiting that row.
The parameters are updated as follows.
Initial Probabilities Vector
¯
i
=
P
L1
l=0
l
0
(i)
l
0
(i)/P (O
l
|)
L
State Tr an si t i on P r obab i li ty Distribution
¯a
ij
=
P
L1
l=0
P
T
l
2
t=0
l
t
(i)a
ij
b
j
(O
l
t+1
)
l
t+1
(j)/P (O
l
|)
P
L1
l=0
P
T
l
2
t=0
l
t
(i)
l
t+1
(j)/P (O
l
|)
Emission Probability Distribution
¯
b
j
(k)=
P
L1
l=0
P
T
l
1
t=0,O
t
=v
k
ˆ
l
t
(j).
ˆ
l
t
(j)/P (O
l
|)
P
L1
l=0
P
T
l
1
t=0
l
t
(j)
l
t
(j)/P (O
l
|)
If we ar e using t he scaled for ward and backward variables, then the update
equations are as follows.
5
Initial Probabilities Vector
¯
i
=
P
L1
l=0
ˆ
l
0
(i)
ˆ
l
0
(i)/c
l
0
L
State Tr an si t i on P r obab i li ty Distribution
¯a
ij
=
P
L1
l=0
P
T
l
2
t=0
ˆ
l
t
(i)a
ij
b
j
(O
l
t+1
)
ˆ
l
t+1
(j)
P
L1
l=0
P
T
l
2
t=0
ˆ
l
t
(i)
ˆ
l
t+1
(j)/c t
l
Emission Probability Distribution
¯
b
j
(k)=
P
L1
l=0
P
T
l
1
t=0,O
t
=v
k
ˆ
l
t
(j)
ˆ
l
t
(j)/c
l
t
P
L1
l=0
P
T
l
1
t=0
ˆ
l
t
(j)
ˆ
l
t
(j)/c
l
t
Distributed Training in Samsara
The current implementation of di st r ib u t ed training of HMM in Samsara is based
on the HMM training in MapReduce described in [2]. During each iteration of
Baum-Welch algorithm, each node in a cluster works on a b lock of independent
observation sequences. Each node in the cluster executes th e following steps for
each observation sequence in the block.
Compute forward variables matr ix of dimensions T /times N where T is the
length of the obs er vation sequence. The forward variables can be either
scaled or not.
Compute backward variables matrix of dimensions T/timesN where T is
the length of the observation sequence . If the forward variables were scaled
in the previous step, then use the same scaling factors to scale backward
variables too.
For each state i, compute the expected number of times being in t h at state
at time t = 0.
For each state i, compute the expected number of transitions from the
state to every state j where 0 j N 1.
For each state i, compute the expected number of emissions of symbol k
where 0 k M 1.
The mapBlock operator transforms a block of observation sequences (which is
a matrix with R rows repr e senting a subset of R observation sequences) into a
matrix of shape R (N + N
2
+ N M).Each row in the input block (which is
an independent observation sequen ce) is mapped to a row i n the ou t put block
with (N + N
2
+N M) columns as described below. The first N columns of the
output row contain the values
0
(i), 0 i N 1 which are the probabilities of
starting in state i for each of the N states. The nex t N
2
columns store the row
6
major representation of the N N matrix which contains the expected transition
counts. The element e
ij
of this matrix is the expected number of transitions
from state i to state j given the observation sequence. The last N M columns
of the output block matrix st or e the row major representation of the N M
matrix of expected emission counts. The ele ment f
ij
of this matrix contains
the expected numbe r of times the symbol j is emitted from st ate i given the
observation sequence. When all the bl ocks of the input DRM of observation
sequences are processed, the parameters of the model are updated as follows.
To update t h e initial probabilities vector, compute the total count of ex-
pected number of times of being in state i at time t = 0 for all 0 N 1.
The element
i
is calculated as th e ratio of the total count of expected
number of times of b ei n g in state i at time t = 0 and the sum of counts
for all states.
State Transition Probability Distribution To update row i of the transition
matrix, for each element a
ij
we need to comput e th e cumulative expected
number of transitions from state i to state j from all the observation
sequences. The sum of all these cumulative ex pected number of transitions
gives us the tot al e xpected number of times the state i is visited. If we
divide the cumulative expected number of transitions from state i to state
j from all the observation sequences by the total expected number of
times the state i is visited, we get the updated probabili ty of transition
from state i to state j.
Emission Probability Distribution To update row j of the emission ma-
trix, for each element b
j
(k), we need to compute the cumulative expected
number of t i me s the symbol k is emitted while being in state j from all the
observation sequences. The sum of all thes e cumulative expected number
of emissions gives us th e total expected number of times the stat e j is
visited. If we d i v id e the cumulative expected number of emissions from
state j of symbol k by the total expected number of times the stat e j is
visited, we get t h e updated probability of emission from state j of symbol
k.
References
[1] Dawei Shen, Some Mathematics for HMM. https://pdfs.
semanticscholar.org/4ce1/9ab0e07da9aa10be1c336400c8e4d8fc36c5.
pdf
[2] Jimmy Lin and Chris Dyer, Data-intensi ve text processing with MapRe-
duce. https://http://www.iro.umontreal.ca/
~
nie/IFT6255/Books/
MapReduce.pdf
[3] Xiaolin Li, Marc Parizeau and Rejean Plamondon, Training Hidden
Markov Models with Multiple Observations ? A Combinatoria l Method
7
https://http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.
1.1.335.1457&rep=rep1&type=pdf
8