Network structure of mask region-based convolutional neural network (mask R-CNN). ResNet is a residual network; FPN is a feature pyramid network; RPN is a region proposal network; ROI is region of interest; NMS is non-maximum suppression; FC layer is fully-connected layer; Bbox is bounding box; FCN is fully-connected network; C1–C5 are convolutional stages 1 to 5 in the ResNet; P2–P6 are feature maps in the FPN; Box1–Box5 are proposed boxes with various scales and ratios after the RPN; Conv. 1 × 1,256 is the convolution with the kernel size of (1, 1) and depth of 256; MP [(1, 1), 2] is max pooling with the size of (1, 1) and stride of 2; ×2 Ups. is upsampling with the size of (2, 2); Conv. 3 × 3 × 256 is the convolution with the kernel size of (3, 3) and depth of 256; 7 × 7 × 256 is the size (length of 7, width of 7 and depth of 256) of convolution layers; 1024 is the number of neurons in the FC layer; 14 × 14 × 256 is the size (length of 14, width of 14 and depth of 256) of convolution layers; ×4 is the repeated operations of the previous layer for 4 times; 28 × 28 × 256 is the size (length of 28, width of 28 and depth of 256) of convolution layers; 28 × 28 × 80 are 80 target masks with the size of 28 in length and 28 in width.