Skip to main content
. 2019 Feb 14;9:2112. doi: 10.1038/s41598-018-38427-0

Table 2.

Improvement in car/person detection obtained by augmenting state-of-the-art CNNs with predicted human-derived contextual expectations.

CNN Target CNN CNN + Lklhd CNN + yLocn CNN + scale CNN + Lklhd + yLocn + scale CNN + all car & person ratings Increase in %
RCNN3 C 82.4 ± 0.000 85 ± 0.000 82.5 ± 0.000 83.2 ± 0.000 85.5 ± 0.001 86.2 ± 0.0017 3.8
P 80.6 ± 0.000 80.6 ± 0.001 81.5 ± 0.001 80.6 ± 0.000 80.4 ± 0.0018 82.0 ± 0.0023 1.4
Alexnet2 C 83.5 ± 0.000 85.8 ± 0.0012 83.5 ± 0.001 84.3 ± 0.0015 86.8 ± 0.0017 87.1 ± 0.0019 3.6
P 73.4 ± 0.002 73 ± 0.0028 77.1 ± 0.0025 75.0 ± 0.0025 76.8 ± 0.0034 77.1 ± 0.0038 3.7

Each entry shows the cross-validated accuracy for detecting cars (C) or people (P) on novel scenes from the ADE20K dataset chosen from the same scene categories as in the human experiments (for details, see Supplementary Tables S2, S3 and S4). The best performing models are highlighted in bold. Columns indicate the kind of model used: the column marked CNN indicates the baseline accuracy of the deep neural network; the columns of the form “CNN + X” indicate accuracy for CNN augmented with feature X. Lklhd: predicted likelihood of target category object; xLocn: predicted horizontal location of target category object, yLocn: predicted vertical location of target category object; scale: overall bounding box area marked by subjects.