Table 1.
Layer name | Output size | Original 50-layer | Off-the-shelf | Fine-tuned |
---|---|---|---|---|
conv1 | 112 × 112 | 7 × 7, 64-d, stride 2 | same | fine-tuned |
pooling1 | 56 × 56 | 3 × 3, 64-d, max pool, stride 2 | same | same |
conv2_x | 56 × 56 | same | fine-tuned | |
conv3_0 | 28 × 28 | same | fine-tuned | |
conv3_x | 28 × 28 | same | fine-tuned | |
conv4_0 | 14 × 14 | same | fine-tuned | |
conv4_x | 14 × 14 | same | fine-tuned | |
conv5_0 | 7 × 7 | same | fine-tuned | |
conv5_x | 7 × 7 | same | fine-tuned | |
pooling2 | 1 × 1 | 7 × 7, 2048-d, average pool, stride 1 | same | same |
dense | 1 × 1 | 1000-d, dense-layer | 15-d, dense-layer | |
loss | 1 × 1 | 1000-d, softmax | 15-d, sigmoid, BCE |
In our experiments, we use the ResNet-50 architecture and this table shows differences between the original architecture and ours (off-the-shelf and fine-tuned ResNet-50). If there is no difference to the original network, the word “same” is written in the table. The violet and bold text emphasizes, which parts of the network are changed for our application. All layers do employ automatic padding (i.e. depending on the kernel size) to keep spatial size the same. The conv3_0, conv4_0, and conv5_0 layers perform a down-sampling of the spatial size with a stride of 2.