Table 4.
Layer | Frames | Input dim | Output dim |
---|---|---|---|
Frame-level 1 | 5 | 5 × K | 512 |
Frame-level 2 | 9 | 1,536 | 512 |
Frame-level 3 | 15 | 1,536 | 512 |
Frame-level 4 | 15 | 512 | 512 |
Frame-level 5 | 15 | 512 | 1,500 |
Pooling | T | 1,500 × T | 3,000 |
Segment-level 6 | T | 3,000 | 512 |
Segment-level 7 | T | 512 | 512 |
softmax | T | 512 | N |
X-vectors are extracted at layer segment-level 6 before the Rectified Linear Unit (ReLU) activation function. T is the number of frames composing the input segment. K corresponds to the number of input features for one frame, K = 24 for the telephone recordings (23 MFCCs + log energy) and K = 31 for the high-quality recordings (30 MFCCs + log energy). N is the number of speakers used for training, N = 5,139 for the SRE16 DNN and N = 7,330 for the voxceleb DNN.