Table 3. AST architecture.
Input | (1,024,128) |
---|---|
Embedding: Conv2d(1, 768, kernel_size = [16, 16], stride = [10, 10]) | (12, 101, 768) |
Encoder_1 | (1,214, 768) |
Encoder_2 | (1,214, 768) |
Encoder_3 | (1,214, 768) |
Encoder_4 | (1,214, 768) |
Encoder_5 | (1,214, 768) |
Encoder_6 | (1,214, 768) |
Encoder_7 | (1,214, 768) |
Encoder_8 | (1,214, 768) |
Encoder_9 | (1,214, 768) |
Encoder_10 | (1,214, 768) |
Encoder_11 | (1,214, 768) |
Encoder_12 | (1,214, 768) |
Linear_1(in_features = 768, out_features = 527, bias = True) | (1, 527) |