Skip to main content
. 2021 Feb 19;15:578369. doi: 10.3389/fninf.2021.578369

Table 4.

DNN architecture.

Layer Frames Input dim Output dim
Frame-level 1 5 5 × K 512
Frame-level 2 9 1,536 512
Frame-level 3 15 1,536 512
Frame-level 4 15 512 512
Frame-level 5 15 512 1,500
Pooling T 1,500 × T 3,000
Segment-level 6 T 3,000 512
Segment-level 7 T 512 512
softmax T 512 N

X-vectors are extracted at layer segment-level 6 before the Rectified Linear Unit (ReLU) activation function. T is the number of frames composing the input segment. K corresponds to the number of input features for one frame, K = 24 for the telephone recordings (23 MFCCs + log energy) and K = 31 for the high-quality recordings (30 MFCCs + log energy). N is the number of speakers used for training, N = 5,139 for the SRE16 DNN and N = 7,330 for the voxceleb DNN.