. Author manuscript; available in PMC: 2019 Jul 1.

Published in final edited form as: Proc Conf Assoc Comput Linguist Meet. 2018 Jul;2018:2225–2235.

Algorithm 1.

FEATURE EXTRACTION

1:	procedure FORCED ALIGNMENT
2:	Determine time interval of each word
3:	find w_i ← → [A_ij], j ∈ [1, L], i ∈ [1, N]
4:	end procedure
5:	procedure TEXT BRANCH
6:	Text Attention Module
7:	for i ∈ [1, N] do
8:	T_i ← getErnbedded(w_i)
9:	t_h_i ← bi_GRU(T_i)
10:	t_e_i ← getEnergies(t_h_i)
11:	t_α_i ← getDistribution(t_e_i)
12:	end for
13:	return t_h_i, t_α_i
14:	end procedure
15:	procedure AUDIO BRANCH
16:	for i ∈ [1, N] do
17:	Frame-Level Attention Module
18:	for j ∈ [1, L] do
19:	f_h_ij ← bi_GRU (A_ij)
20:	f_e_ij ← getEnergies(f_h_ij)
21:	f_α_ij ← getDistribution(f_e_ij)
22:	end for
23:	f_V_i ← weightedSum(f_α_ij, f_h_ij)
24:	Word-Level Attention Module
25:	w_h_i ← bi_GRU(f_V_i)
26:	w_e_i ← getEnergies(w_h_i)
27:	w_α_i ← getDistribution(w_e_i)
28:	end for
29:	return w_h_i, w_α_i
30:	end procedure