. 2021 Feb 19;15:578369. doi: 10.3389/fninf.2021.578369

Table 4.

DNN architecture.

Layer	Frames	Input dim	Output dim
Frame-level 1	5	5 × K	512
Frame-level 2	9	1,536	512
Frame-level 3	15	1,536	512
Frame-level 4	15	512	512
Frame-level 5	15	512	1,500
Pooling	T	1,500 × T	3,000
Segment-level 6	T	3,000	512
Segment-level 7	T	512	512
softmax	T	512	N

X-vectors are extracted at layer segment-level 6 before the Rectified Linear Unit (ReLU) activation function. T is the number of frames composing the input segment. K corresponds to the number of input features for one frame, K = 24 for the telephone recordings (23 MFCCs + log energy) and K = 31 for the high-quality recordings (30 MFCCs + log energy). N is the number of speakers used for training, N = 5,139 for the SRE16 DNN and N = 7,330 for the voxceleb DNN.