Implementation of Baum-Welch algorithm for

HMM in Mahout Samsara

Manogna Vemulapati

November 8, 2018

Introduction

A Hidden Markov Model (HMM)  is speciﬁed as a triplet (A, B, ⇡)where:

• The number of hidden stat es is N and they are speciﬁed as the set S =

,...,S

N1

}. The state at time t is represented as q

• The number of observation symbols is M and they are speciﬁed as the s et

V = {v

,...,v

M1

• The state transition probability d i st r i but ion matri x A is a matrix of di-

mensions N ⇥ N.Theelementa

of t he matrix A is the probability of

transitioning from state S

to state S

• The emission probability distribution matrix B is a matrix of dimensions

N ⇥ M.Theelementb

(k) of the matrix B is the probability of emitting

observation s ymb ol v

from state S

• The probability distribution for the initial state is speciﬁed by the vector

⇡ = {⇡

} where pi

is the probability of being in state S

at time t = 0.

Given an observation sequence O of observation symbols from the set V ,the

learning problem is to adjust the model paramete rs  such that the probabil-

ity P (O|) is maximized. Baum-Welch algorithm provides a solution for the

training problem.

Baum-Welch Algorithm

Baum-Welch algorithm is an Expectation-Maximization (EM) algorithm which

computes the maximum likelihood estimate of the parameters of HMM given a

set of observation sequences. It is an iterati ve algorithm where in each itera-

tion it comp ut e s the forward variables and backward variables and uses these

variables to update the model parameters so that P (O|

) >P( O|)where

 is

the model with the updated parameters. The algorithm iterates until the model

parameters converge.

Forward Variables

For an observation sequ en ce of O length T , that is, O =(O

...O

T 1

), the

forward variables are deﬁne d as

↵

(i)=P (O

...O

= S

|), 0  i  N  1, 0  t  T  1

which is the prob abil i ty of the par ti al ob se rvation sequence O

...O

up to

time t and state S

at time t, given the model  . The forward variables are

computed by inductively as follows:

• Initialization:

↵

(i)=⇡

), 0  i  N  1

• Induction:

↵

t+1

(j)=

N1

i=0

↵

(i)a

t+1

), 0  t  T  2, 0  j  N  1

Backward Variables

The backward variable 

(i) is the probability of the partial observation sequence

from time t +1tothe end T  1 given the HMM is in state S

at time t and

the mo del .



(i)=P (O

t+1

...O

T 1

= S

,), 0  i  N  1, 0  t  T  1

The backward variables are comput e d inductively as follows.

• Initialization:



T 1

(i)=1, 0  i  N  1

• Induction:



(i)=

N1

j=0

t+1

)

t+1

(j), 0  t  T  2, 0  i  N  1.

Gamma and Xi Variables

The gamma variable 

(i) is the proba bil i ty of being in state S

at time t given

the observation sequence O and the model .



(i)=P (q

= S

|O, )=

↵

(i)

(i)

P (O|)

0  i  N  1, 0  t  T  1

The xi variable ⇠

(i, j) is the probabili ty of being in state S

at time t and in

state S

at time t + 1 given the model  and the observation sequence O.

⇠

(i, j)=P (q

= S

t+1

= S

|O, )=

↵

(i)a

t+1



t+1

(j)

P (O|)

where

0  i  N  1, 0  j  N  1, 0  t  T  1

Probability of an observation sequence

The probability of an observation sequenc e O of length T given a model  is

computed as follows.

P (O|)=

N1

i=0

↵

T 1

(i)

Update of Model Parameters

The sum of gamma variables for a particular state i, that is the expression

T 2

t=0



(i) can be interpreted as the expected number of times that the state

is visited given the model parameters and the obse r vation s eq ue nc e O. And,

the summation of Xi variables

T 2

t=0

⇠

(i, j) can be interpr et e d as the expected

number of transitions from state S

to state S

. Hence the ratio of th e latter

over the former is the updated probability of transition from state S

to state S

Thus, an i t er at ion of Baum-Welch algorithm adjusts t he parameters as below.

• Initial Probabilities Vector

¯⇡

= 

(i), 0  i  N  1

• State Tr an si t i on P r obab i li ty Distribution

¯a

T 2

t=0

⇠

(i, j)

T 2

t=0



(i)

, 0  i  N  1, 0  j  N  1

• Emission Probability Distribution

(k)=

T 1

t=0,O



(j)

T 1

t=0



(j)

, 0  j  N  1, 0  k  M  1

Numerical Stability and Scali ng

The value of a forward variable ↵

(i) quickly tends to zero as the value of t

becomes large. The solution to this problem is to scale the forward variables

at each induction step. One common scaling scheme (as descr i bed in [1] ) is to

deﬁne a scaling factor which depends only on time t but is independent of the

state i as described below. The scaled forward variables ¨↵

(i) and the scaling

factors c

are computed by induction as follows.

• Initialization

¨↵

(i)=↵

(i)0 i  N  1

N1

i=0

¨↵

(i)

ˆ↵

(i)=c

¨↵

(i)0 i  N  1

• Induction

¨↵

(i)=

N1

j=0

ˆ↵

t1

(j)a

)

N1

i=0

¨↵

(i)

ˆ↵

(i)=c

¨↵

(i)0 i  N  1

To compute the scal ed backward variables



(i), the same scaling factors wh i ch

are computed for the sacled forward variables are used.

• Initialization



T 1

(i)=1



T 1

(i)=c

T 1



T 1

(i)

• Induction



(i)=

N1

j=0



t+1

(j)a

t+1

)



(i)=c



(i)

Probability of an observat io n s eq uence wi th s ca le d

variables

The probability of an observation sequence O given a model  is comput e d as

follows.

⌧ =0

⌧

P (O|)=1/C

T 1

Using the sc al ed forward and backward parameters the model parameters are

adjusted as follows.

• Initial Probabilities Vector

¯⇡

=ˆ↵

(i)



(i)/c

• State Tr an si t i on P r obab i li ty Distribution

¯a

T 2

t=0

ˆ↵

(i).a

t+1



t+1

(j)

T 2

t=0

ˆ↵

(i).



(i)/c

• Emission Probability Distribution

(k)=

T 1

t=0,O

ˆ↵

(j).



(j)/c

T 1

t=0

ˆ↵

(j).



(j)/c

Training with Multiple Observation Sequences

Suppose we have L independent observation sequences where the observation

sequence indexed by l is denoted by O

and 0  l  L  1. In order to update

the parameters of the model (as described in [3]), we need to do the following.

• Starting probabilities From each observation sequence, compute the ex-

pected number of times in each state at time t = 0. For each state i,

we can compute the sum of expected number of times in that s t ate at

time t = 0 from all the sequences. From this we can update the initi al

probabilities vector.

• Expected number of transitions From each observation sequence, compute

the expected number of times of transition from state i to state j. For each

ordered pair of states (i, j), we can compute the sum of expected number

of times of transition from state i to state j du e to all sequences. Once we

compute the sums of transitions for row i of the transition matrix, we can

update that row of the transition matrix by computing the total number

of times visiting state

• Expected number of emissions From each observation sequence, compute

the expected number of times being in state i and emitting symbol j. For

each state i and symbol j, we can compute the total number of times

being in state i and emitting symb ol j from all sequen ces . Each row of

the emission matrix can be updated by computing the total number times

visiting that row.

The parameters are updated as follows.

• Initial Probabilities Vector

¯⇡

L1

l=0

↵

(i)

(i)/P (O

|)

• State Tr an si t i on P r obab i li ty Distribution

¯a

L1

l=0

2

t=0

↵

(i)a

t+1

)

t+1

(j)/P (O

|)

L1

l=0

2

t=0

↵

(i)

t+1

(j)/P (O

|)

• Emission Probability Distribution

(k)=

L1

l=0

1

t=0,O

ˆ↵

(j).



(j)/P (O

|)

L1

l=0

1

t=0

↵

(j)

(j)/P (O

|)

If we ar e using t he scaled for ward and backward variables, then the update

equations are as follows.

• Initial Probabilities Vector

¯⇡

L1

l=0

ˆ↵

(i)



(i)/c

• State Tr an si t i on P r obab i li ty Distribution

¯a

L1

l=0

2

t=0

ˆ↵

(i)a

t+1

)



t+1

(j)

L1

l=0

2

t=0

ˆ↵

(i)



t+1

(j)/c t

• Emission Probability Distribution

(k)=

L1

l=0

1

t=0,O

ˆ↵

(j)



(j)/c

L1

l=0

1

t=0

ˆ↵

(j)



(j)/c

Distributed Training in Samsara

The current implementation of di st r ib u t ed training of HMM in Samsara is based

on the HMM training in MapReduce described in [2]. During each iteration of

Baum-Welch algorithm, each node in a cluster works on a b lock of independent

observation sequences. Each node in the cluster executes th e following steps for

each observation sequence in the block.

• Compute forward variables matr ix of dimensions T /times N where T is the

length of the obs er vation sequence. The forward variables can be either

scaled or not.

• Compute backward variables matrix of dimensions T/timesN where T is

the length of the observation sequence . If the forward variables were scaled

in the previous step, then use the same scaling factors to scale backward

variables too.

• For each state i, compute the expected number of times being in t h at state

at time t = 0.

• For each state i, compute the expected number of transitions from the

state to every state j where 0  j  N  1.

• For each state i, compute the expected number of emissions of symbol k

where 0  k  M  1.

The mapBlock operator transforms a block of observation sequences (which is

a matrix with R rows repr e senting a subset of R observation sequences) into a

matrix of shape R ⇥ (N + N

+ N ⇤ M).Each row in the input block (which is

an independent observation sequen ce) is mapped to a row i n the ou t put block

with (N + N

+N ⇤M) columns as described below. The ﬁrst N columns of the

output row contain the values 

(i), 0  i  N  1 which are the probabilities of

starting in state i for each of the N states. The nex t N

columns store the row

major representation of the N ⇥N matrix which contains the expected transition

counts. The element e

of this matrix is the expected number of transitions

from state i to state j given the observation sequence. The last N ⇤ M columns

of the output block matrix st or e the row major representation of the N ⇥ M

matrix of expected emission counts. The ele ment f

of this matrix contains

the expected numbe r of times the symbol j is emitted from st ate i given the

observation sequence. When all the bl ocks of the input DRM of observation

sequences are processed, the parameters of the model are updated as follows.

• To update t h e initial probabilities vector, compute the total count of ex-

pected number of times of being in state i at time t = 0 for all 0  N  1.

The element ⇡

is calculated as th e ratio of the total count of expected

number of times of b ei n g in state i at time t = 0 and the sum of counts

for all states.

• State Transition Probability Distribution To update row i of the transition

matrix, for each element a

we need to comput e th e cumulative expected

number of transitions from state i to state j from all the observation

sequences. The sum of all these cumulative ex pected number of transitions

gives us the tot al e xpected number of times the state i is visited. If we

divide the cumulative expected number of transitions from state i to state

j from all the observation sequences by the total expected number of

times the state i is visited, we get the updated probabili ty of transition

from state i to state j.

• Emission Probability Distribution To update row j of the emission ma-

trix, for each element b

(k), we need to compute the cumulative expected

number of t i me s the symbol k is emitted while being in state j from all the

observation sequences. The sum of all thes e cumulative expected number

of emissions gives us th e total expected number of times the stat e j is

visited. If we d i v id e the cumulative expected number of emissions from

state j of symbol k by the total expected number of times the stat e j is

visited, we get t h e updated probability of emission from state j of symbol

References

[1] Dawei Shen, Some Mathematics for HMM. https://pdfs.

semanticscholar.org/4ce1/9ab0e07da9aa10be1c336400c8e4d8fc36c5.

pdf

[2] Jimmy Lin and Chris Dyer, Data-intensi ve text processing with MapRe-

duce. https://http://www.iro.umontreal.ca/

nie/IFT6255/Books/

MapReduce.pdf

[3] Xiaolin Li, Marc Parizeau and Rejean Plamondon, Training Hidden

Markov Models with Multiple Observations ? A Combinatoria l Method

https://http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.

1.1.335.1457&rep=rep1&type=pdf