Table 5.
Detailed parameters and operations within the CCST feature extractor module.
| Operation | Input shape | Parameters | Output shape |
|---|---|---|---|
| Temporal convolution | (B, 1, C, T) |
Conv2d: kernel_size = (1,25), stride = (1,1) |
(B, 40, C, T) |
| Spatial convolution | (B, 40, C, T) |
Conv2d: kernel_size = (C,1), stride = (1,1) |
(B, 40, 1, T) |
| Batch normalization | (B, 40, 1, T) | BatchNorm2d: num_features = 40 | (B, 40, 1, T) |
| Activation (ELU) | (B, 40, 1, T) | ELU activation | (B, 40, 1, T) |
| Average Pooling | (B, 40, 1, T) |
AvgPool2d: kernel_size = (1,75), stride = (1,15) |
![]() |
| Dropout | ![]() |
Dropout: p = 0.5 | ![]() |
| Projection convolution | ![]() |
Conv2d: in_channels = 40, out_channels = 40, kernel_size = (1,1) |
![]() |
| Rearrangement | ![]() |
Rearrange: ’b e (ht) (w) b (ht w) e’ |
![]() |
| Embedding projection | ![]() |
Linear: in_features = 40, out_features = 64 |
![]() |
| Positional encoding | ![]() |
{learnable, sine, none}, shape = (1, 15, 64) |
![]() |
| Transformer | ![]() |
Multiple layers: num_layers = 3, num_heads = 4, mlp_hidden = 128, window_size = 4 |
(B, 64) |












