Table 2.
CNN | Target | CNN | CNN + Lklhd | CNN + yLocn | CNN + scale | CNN + Lklhd + yLocn + scale | CNN + all car & person ratings | Increase in % |
---|---|---|---|---|---|---|---|---|
RCNN3 | C | 82.4 ± 0.000 | 85 ± 0.000 | 82.5 ± 0.000 | 83.2 ± 0.000 | 85.5 ± 0.001 | 86.2 ± 0.0017 | 3.8 |
P | 80.6 ± 0.000 | 80.6 ± 0.001 | 81.5 ± 0.001 | 80.6 ± 0.000 | 80.4 ± 0.0018 | 82.0 ± 0.0023 | 1.4 | |
Alexnet2 | C | 83.5 ± 0.000 | 85.8 ± 0.0012 | 83.5 ± 0.001 | 84.3 ± 0.0015 | 86.8 ± 0.0017 | 87.1 ± 0.0019 | 3.6 |
P | 73.4 ± 0.002 | 73 ± 0.0028 | 77.1 ± 0.0025 | 75.0 ± 0.0025 | 76.8 ± 0.0034 | 77.1 ± 0.0038 | 3.7 |
Each entry shows the cross-validated accuracy for detecting cars (C) or people (P) on novel scenes from the ADE20K dataset chosen from the same scene categories as in the human experiments (for details, see Supplementary Tables S2, S3 and S4). The best performing models are highlighted in bold. Columns indicate the kind of model used: the column marked CNN indicates the baseline accuracy of the deep neural network; the columns of the form “CNN + X” indicate accuracy for CNN augmented with feature X. Lklhd: predicted likelihood of target category object; xLocn: predicted horizontal location of target category object, yLocn: predicted vertical location of target category object; scale: overall bounding box area marked by subjects.