Two-Step Joint Optimization with Auxiliary Loss Function for Noise-Robust Speech Recognition

. 2022 Jul 19;22(14):5381. doi: 10.3390/s22145381

Algorithm 1: Two-Step Joint Optimization Procedure with Auxiliary ASR Loss.
Require: $m$ , batch size. $n$ , the number of iterations. $α_{1}$ , auxiliary ASR loss weight. $α_{2}$ , loss weight ratio between speech enhancement and ASR. $θ_{S E}$ , speech enhancement learnable parameters. $θ_{A S R}$ , ASR learnable parameters.
1:	for $1, \dots, n$ do
2:	Mini-batch of $m$ clean speech frames ${s^{(1)}, \dots, s^{(m)}}$ and corresponding acoustic features ${a^{(1)}, \dots, a^{(m)}}$
3:	Mini-batch of $m$ enhanced speech frames ${{\hat{s}}^{(1)}, \dots, {\hat{s}}^{(m)}}$ and corresponding acoustic features ${{\hat{a}}^{(1)}, \dots, {\hat{a}}^{(m)}}$
4:	Mini-batch of $m$ linguistic units ${y^{(1)}, \dots, y^{(m)}}$
5:	First-Step Processing: Update the speech enhancement parameters using SI-SNR and auxiliary ASR loss
	$\nabla_{θ_{S E}} \frac{1}{m} \sum_{i = 1}^{m} (α_{1} \cdot ℒ_{S I - S N R} (s^{(i)}, {\hat{s}}^{(i)}) + (1 - α_{1}) \cdot ℒ_{a u x} (a^{(i)}, {\hat{a}}^{(i)}))$
6:	Second-Step Processing: Update both speech enhancement and ASR parameters using SI-SNR and ASR loss
	$\nabla_{θ_{S E}, θ_{A S R}} \frac{1}{m} \sum_{i = 1}^{m} (α_{2} \cdot ℒ_{S I - S N R} (s^{(i)}, {\hat{s}}^{(i)}) + (1 - α_{2}) \cdot ℒ_{A S R} (a^{(i)}, y^{(i)}))$
7:	end for