Abstract
In computer vision tasks, object detection is the most significant challenge. Numerous studies on Convolutional neural network based techniques have been extensively utilized in computer vision for the detection of objects. Scale variation, illumination variation or occlusion problems are the most popular challenges in crowd counting. To address this, MaskFormer EfficientNetB7 Instance Segmentation architecture has been proposed that utilized the EfficientNetB7 as the backbone for feature extraction to create efficient and accurate counting of people in challenging scenarios. A basic mask classification model called MaskFormer has been used to predict a series of binary masks, each of which is connected to a single global class for label prediction and EfficientNetB7 has used a compound scaling algorithm that equally scaled each dimension using a predetermined set of scaling coefficients. Experimental findings on the UCF-QNRF, ShanghalTech (Part A and Part B) and Mall datasets have demonstrated that the suggested strategy has provided remarkable outcomes in contrast to existing crowd counting approaches in terms of Mean Absolute Error and Root Mean Squared Error. Therefore, the proposed crowd-counting model has been proven to be more adaptable in different environments or scenarios along with strong generalizability on unseen data.
Keywords: Computer vision, Object detection, Convolutional neural network, Transformer, Crowd counting, Instance segmentation
Subject terms: Computer science, Information technology
Introduction
The task of crowd counting is to analyze the image in an open scenario to estimate the number of pedestrians in the image1–3. Both business and academia have identified several challenges in crowd counting over the past 10 years in computer vision. Therefore, numerous innovative crowd-counting techniques have been presented for complicated crowded scenarios4,5. Due to numerous large-scale stampede tragedies and the recent COVID-19 epidemic, crowd counting has been widely used in both social distancing monitoring6 and smart video surveillance7,8. Various models have been considered in this field to address the different challenges such as large-scale fluctuation of pedestrians9,10, pedestrian occlusion11, background information, pedestrian clothing textures and other factors as shown in Fig. 112. Estimation of accurate crowd density is still a very difficult issue in case of low-light situations and occlusion as pedestrians are almost invisible. This makes it difficult to count the number of people in RGB images.
Fig. 1.
Uneven distribution, scale variance, and lighting fluctuation shown in the above images, the red line’s top and bottom segments exhibit sharp variations in scale, illumination and distribution12.
Different applications of crowd counting in various fields include people estimation, healthcare, safety and disaster management. Crowd behavior is analyzed in surveillance videos by different crowd counting methods in public places such as mass gatherings and airports. This analysis is essential for efficiently handling public events including concerts, sporting events, and political rallies by enabling the early identification of congestion. During the pandemic, maintaining the number of people within a reasonable range through crowd counting techniques has become crucial for ensuring public health and safety. This necessity has significantly increased the interest and attention of researchers in this field to develop innovative methods to overcome different challenges as mentioned in Fig. 1.
There are several ways to count the persons in an image. However, the majority of them still struggle with generalization. These techniques may be broadly categorized into two phases: Convolutional Neural Networks (CNN) and Object Detection with Instance Segmentation. The major challenges and gaps in crowd counting using the CNN model include the inability to accurately count individuals in areas of varying population density from sparse to extremely congested scene13. Additionally, the occlusion and overlapping problem remain challenging14 in real-world scenarios. Instance segmentation techniques address this limitation by providing pixel-wise masks for each object within an image rather than just detecting bounding boxes. These masks allow for more precise separation and identification of individual objects. This makes instance segmentation particularly useful for applications where accurate object delineation is crucial such as crowd analysis or crowd counting. As per the survey conducted so far, no one has used the EfficinetNet-B7 as the backbone in MaskFormer architecture for accurately counting the people in images. The major contribution of this paper is as follows:
-
(i)
This research work proposed a new model named MaskFormer EfficientNetB7(MFENet) with the integration of EfficientNetB7 model and MaskFormer architecture.
-
(ii)
The proposed model is based on EfficientNetB7 as a backbone network in the MaskFormer framework for counting the object in an image as inspired by the Inception structure (proposed by GoogLeNet)15.
-
(iii)
The proposed MFENet model used the attention mechanism for the extraction of specific features from images to generate the per-pixel embeddings and Detectron (DETR) is used as a transformer decoder for end-to-end object detection.
-
(iv)
The effectiveness of the proposed MFENet in comparison to other state-of-the-art approaches is validated by extensive validation experiments performed on three benchmark datasets.
The remainder of this paper is organized as follows: Related Work is discussed in section “Related work”, Materials and methods are discussed in section “Materials and methods”. Results analysis and discussion are provided in section “Results analysis and discussion” and the conclusion is presented in section “Conclusion”.
Related work
In computer vision, object detection tasks are often performed using convolutional neural networks (CNNs). Localizing and categorizing items in an image or video are various steps in the object detection process. Due to its capacity to automatically identify relevant characteristics from the input data, CNNs have demonstrated significant success in this field. There are many CNN-based architectures specifically designed for object detection such as You Only Look Once (YOLO)16, Single Shot Multibox Detector (SSD)17, Faster-RCNN18, Fast-RCNN19, Region-based CNNs(R-CNN)19 and the more popular convolutional neural network models20,21. Convolutional neural networks have excelled in a variety of computer vision tasks including semantic segmentation, object identification and instance segmentation demonstrating their high effectiveness in this area. Zhai et al.22 presented an Attentive Hierarchy model (AHNet) ConvNet to address background interference and scale variation problems and the Feature Enhancement (FE) to identify head regions at different scales. The Re-Calibrated Attention (RA) is developed in different layers for the reduction of the impact of background interference. Li et al.23 suggested the Multi-Scale Feature Fusion and Attention Mechanism CNN Network (MSFFA) handle complex background problems and large-scale variations. Global Attention Mechanism is used for the extraction of features from complex background scenes. Tyagi et al.24 used a single class object i.e. person to build a person detection system using YOLOV3 and Label Tool for annotation of the dataset. It has achieved significant performance in the detection of single-class objects. Khan et al.25 reported a method that was based on semantic scene segmentation using an improved convolutional neural network. The framework was successful in reducing the background and enhancing the foreground. Cholakkal et al.26 has given a supervised image-level learning-based instance segmentation technique that generated an object category density map that displayed both the overall object count and the spatial distribution of object instances. It further minimized the image-level supervision utilizing a small object count (up to four) of information, which is motivated by psychological research.
Object detection with instance segmentation: After discovering such a tremendous amount of success, several instance segmentation models were reported with transfer learning models as a backbone for feature extraction. Liang et al.27 introduced the TransCrowd to handle limited receptive fields in weakly supervised crowd counting. The transformer self-attention mechanism was used to extract semantic crowd information. Qian28 developed the multiscale transformer, i.e. Crowd U-Transformer (CUT) that extracted and aggregated the semantic and spatial features from multiple layers to address the local attentive receptive field problem and did not model long-term dependencies. Moreover, a weakly supervised method known as CrowdFormer, i.e a visionary pyramid structure transformer for the extraction of multi-scale features and feature aggression regression (FAR) is given by Savner at al.29. It is used to integrate features from various stages of the transformer for count estimation. To solve the issue of uneven population distribution, the Crowd Counting Swin Transformer (CCST) was designed by Li et al.30. A feature adaptive fusion regression (FAF) is used to handle large head size ranges by improving the relationship between feature pyramid hierarchy and head size.
A major task in computer vision is the image segmentation task for pixel-level classification in various fields such as autonomous driving, crowd counting31, medical systems and agricultural analysis etc.Semantic segmentation, and Instance segmentation32 are the two sub-tasks in image segmentation techniques. Semantic segmentation separates the background and foreground at the pixel level but is not able to distinguish between different object instances33. Long et al.20 developed a fully convolution neural network for semantic segmentation by introducing skip architecture that combined the information from deep layers to find accurate segmentation. Several architectures such as U-Shaped Structure(U-Net)34, Pyramid Pooling Module (PSPNet)35, DeepLab v3+36 have been presented to enhance the performance of crowd analysis. Semantic segmentation is used to analyze the image as a whole and record the contextual information but unable to distinguish between objects belonging to the same category which is one of the crucial tasks in some scenarios. A higher-level task called the instance segmentation technique simultaneously detects and segments individual objects in an image. It combines semantic segmentation with object detection. Additionally to pixel-level classification, it must also be able to differentiate between several objects belonging to the same class. Mask RCNN was the first segmentation technology to address this issue37. In addition to Mask R-CNN, several techniques using various network designs, such as BlendMask, SOLOV2 and Segmented objects by locations (SOLO) have been introduced to improve the effectiveness of instance segmentation38. The segmentation accuracy of these models is often improved by using encoder-decoder architectures, multi-scale feature fusion and complex attention mechanisms to collect local and global context information. Jiang et al.39 introduced a mask-aware network to predict the foreground/background mask and add mask prediction into the overall density map to improve the performance of crowd counting on different datasets. Tan et al.40 introduced a simple and highly effective compound scaling coefficient method to balance the width, depth and resolution for better accuracy and efficiency. In this section, all the previous studies on crowd counting have been discussed and the next section i.e. Materials and Methods provides detailed information about the datasets and the proposed method to overcome the shortcomings in crowd counting identified by several researchers.
Materials and methods
In this paper, MaskFormer EfficientNetB7 (MFENet) is suggested for counting people efficiently in crowded areas and enhancing the accuracy of counting the individuals in an image from the Mall, ShanghaiTech and UCF-QNRF datasets. Datasets descriptions are discussed in section “Datasets description”. In section “Methods”, the particulars of the network and the layout of the EfficientNetB7 and MaskFormer model are illustrated. The development of the proposed instance segmentation models in the image field is reviewed and analyzed in section “Proposed method: MaskFormer EfficientNetB7 (MFENet)”.
Datasets description
This section describes three challenging datasets that compare the proposed approach with other techniques. All these datasets are publicly available and can be easily downloaded from the Internet.
ShanghaiTech: The most widely used dataset in the field of crowd counting is the ShanghaiTech dataset41. The dataset is divided into two parts, Part A and Part B. The majority of the images in the Part A collection were taken in crowded areas. The average number of people in every image in the sub-dataset is over 500. There are 182 images utilized for testing and 300 images used for training. These photos were taken on Shanghai Street, where the smallest and highest population sizes are 9 and 578, respectively. For Part B dataset, it has 316 test images and 400 training images.
UCF-QNRF: Idrees et al.42 provide the UCF-QNRF dataset37. It includes a wide range of crowded scene images with a maximum number of 12,865 and a total of 1,251,642 annotations. 334 test images and 1201 training images are available through UCF-QNRF.
Mall: A sparse indoor scenario is used to gather the mall dataset43. The first 800 frames of the 2000 total video images have been utilized for training, while the remaining frames have been used for testing. Images have a resolution of 640 x 480. The dataset has a minimum and maximum population of 13 and 53 respectively. The total number of persons approaches 60,000. The Mall dataset also contains additional information such as the position on the map and the area of interest.
Methods
MaskFormer
Any per-pixel classification model that already exists may be easily transformed into a mask classification using the MaskFormer method. MaskFormer calculates a group of pairs, each of which comprised a class prediction and a mask vector for embedding by using the transformer decoder44 according to the set prediction process provided in Detectron (DETR)45. The binary mask prediction is obtained by utilizing the mask embedding vector to create a dot product with the per-pixel embedding produced by a fully convolutional network With no modifications to the model, losses or training process needed, the model completed both semantic and instance level segmentation tasks46. Instance level segmentation problems frequently utilized mask classification47. The MaskFormer Model48 has three modules. (i) a module that gathered the per-pixel embeddings necessary to produce predictions for binary masks at the pixel level. (ii) a transformer module made up of a stack of Transformer decoder layers that computed N per-segment embeddings and (iii) a segmentation module that produced predictions using these embeddings.
EfficientNetB7
Estimation of crowd counting tasks is used with EfficientNetB7 since it is a very effective and strong deep-learning model. The process of estimating the number of people or things in a frame or video is known as crowd counting. It has several uses in surveillance, urban planning, crowd control and other areas. To use EfficientNetB7 for crowd counting, include several steps such as data preparation, pre-processing, modifications to the model’s architecture to make it appropriate for crowd counting evaluation, fine-tuning, and optimization. Once the model has been trained and optimized, it can be used for crowd-counting on new, previously unseen images or video frames.
Mobile Inverted Bottleneck Convolution (MBConv)49 is the fundamental building component of the EfficientNet architecture as it has been squeezed and optimized for excitation. The amount of these MBConv blocks varies across the family of EfficientNet networks. Depth, breadth, resolution, and model size all continue to grow from EfficientNetB0 to EfficientNetB7, while accuracy also rises. EfficientNetB7 is the top model that has performed better than Conventional CNNs. It has achieved higher Top-1 accuracy on imagenet dataset40 as compared with EfficientNet(B0–B6) and Convolutional Neural Network models. Figure 2 depicts the network architecture of EfficientNetB750. Based on the filter size, stride, and number of channels, it has been separated into seven sections.
Fig. 2.
Architecture of EfficientNetB750.
Proposed method: MaskFormer EfficientNetB7 (MFENet)
The proposed architecture involves four components: (i) Backbone (ii) Pixel Decoder (iii) Transformer Decoder and (iv) Segmentation module. Figure 3 represents the structure of the proposed model.
Fig. 3.
Proposed MFENet model architecture.
Backbone. In this work, EfficientNetB7 is used as the backbone to enhance the overall performance of the model in terms of efficiency and accuracy. It balanced the three parameters such as depth, width and image resolution into a single dimension. The detecting head is continued to be compatible with MaskFormer.
![]() |
1 |
where x = Input image, F = Feature Maps
Pixel decoder: An image with the dimensions H × W serves as the input for the pixel-level module. A backbone produces a low-resolution image feature map usually
where
signifies channel number and S represents the feature map’s stride (
varies depending on the particular backbone). Then, a pixel decoder gradually upsamples the features to produce per-pixel embeddings
, where
signifies embedding dimension. The pixel-level module architecture is compatible with any per-pixel classification-based segmentation model, including the latest Transformer-based approaches5152. Such a model is easily transformed to mask classification via MaskFormer. Many per-pixel classification techniques employ modules like Fully Convolution Decoder to decode the low-resolution feature maps and build pixel-level segmentation masks by applying a sequence of convolutional filters to sharpen the predictions. To increase the overall quality and accuracy of the segmentation masks, attention techniques are used to extract specific regions or features while producing pixel-level predictions. All image characteristics are taken into consideration by the transformer module, which also gathers global information to produce class predictions. With this configuration, the per-pixel module no longer requires extensive context aggregation.
![]() |
2 |
where F = feature map,
= per-pixel embeddings
Transformer decoder: A transformer decoder creates N per-segment embeddings Q in response to its physical characteristics. The embeddings separately provide N class predictions with N-associated mask embeddings
. The model then uses a dot product of the embeddings of the pixels
and the masks
, followed by a sigmoid activation, to predict N potentially overlapping binary mask predictions. N binary masks and their class predictions are concatenated using matrix multiplication to provide the final prediction for the semantic or instance segmentation problem.
The same Transformer decoder as Detectron (DETR) has been applied in this work45. First, all N query embeddings are initialized as zero vectors and then a learnable positional encoding is assigned to each query while applying the same loss to each decoder. In this research work, it is observed that MaskFormer needs a lot of layers for instance-level segmentation in order to eliminate duplicate predictions in the output.
![]() |
3 |
![]() |
Segmentation module: Similar to the box head in DETR, multi-layer perceptron (MLP) predicts the mask embeddings (
) using two hidden layers with 256 channels each. There are 256 channels in per-pixel
embeddings as well as mask
embeddings.
Parameter settings: For each dataset, Detectron2 is applied with the standard training parameters. For the EfficientNetB7 as backbones, AdamW optimizer is employed alongwith the poly learning rate schedule with starting learning rates of
and weight decays of
. In this study, EfficientNetB7 is used as a feature extraction model along with various pre-processing techniques such as data augmentation, random cropping, rotation and background subtraction.
Results analysis and discussion
The MaskFormer has been used as baseline network for this work and the EfficientNet B7 as backbone for conducting experiments on the ShanghaiTech, UCF-QNRF, and Mall datasets.
Comparison of EfficientNetB7 with DenseNet121, ResNet50, EfficientNetB0, B4, B5 and B6
This section represents the comparison of DenseNet121, ResNet50 and EfficientNetB0, B4, B5 and B6 with EfficientNetB7 based on Mean Absolute Error and Mean Squared Error are shown in Tables 1, 2 and 3 for different datasets. This shows that EfficientNetB7 performed the best when compared to other models. Therefore, EfficientNetB7 is used as a backbone in the maskformer architecture. Figures 4 and 5 visually represent images for the Mall and UCF-QNRF datasets respectively. In both figures, it is observed that the predicted value from EfficientNetB7 is higher than the other models. Graphical representation of Mean Absolute Error for ShanghalTech_Part A and ShanghalTech_Part B are shown in Figs. 6 and 7 where employed models are placed on the x-axis and the mean absolute error on the y-axis.
Table 1.
Comparison of models on Mall dataset.
| Mall | ||
|---|---|---|
| Models | MAE | MSE |
| DenseNet121 | 8.93 | 115.3 |
| ResNet50 | 15.83 | 291.4 |
| EfficientNetB0 | 13.19 | 214.85 |
| EfficientNetB4 | 20.46 | 458.8 |
| EfficientNetB5 | 7.51 | 66.91 |
| EfficientNetB6 | 9.88 | 138.30 |
| EfficientNetB7 | 5.60 | 61.78 |
Significant values are in bold.
Table 2.
Comparison of models on ShanghalTech_Part A and ShanghalTech_Part B dataset.
| Models | ShanghalTech_Part A | ShanghalTech_Part B | ||
|---|---|---|---|---|
| MAE | MSE | MAE | MSE | |
| DenseNet121 | 340.5 | 262.4 | 274.7 | 196.6 |
| ResNet50 | 1.30 | 2.82 | 0.517 | 0.598 |
| EfficientNetB0 | 0.64 | 1.28 | 0.518 | 0.564 |
| EfficientNetB4 | 0.78 | 1.70 | 0.504 | 0.559 |
| EfficientNetB5 | 0.99 | 1.48 | 0.530 | 0.571 |
| EfficientNetB6 | 0.65 | 1.37 | 1.070 | 1.355 |
| EfficientNetB7 | 0.60 | 1.18 | 1.040 | 1.325 |
Significant values are in bold.
Table 3.
Comparison of models on UCF_QNRF dataset.
| UCF_QNRF | ||
|---|---|---|
| Models | MAE | MSE |
| DenseNet121 | 1.54 | 2.79 |
| ResNet50 | 0.74 | 1.50 |
| EfficientNetB0 | 0.55 | 1.83 |
| EfficientNetB4 | 1.04 | 2.67 |
| EfficientNetB5 | 0.80 | 1.59 |
| EfficientNetB6 | 0.60 | 1.66 |
| EfficientNetB7 | 0.50 | 1.20 |
Significant values are in bold.
Fig. 4.
Results from Mall dataset (a) DenseNet121 (b) ResNet50 (c) EfficientNetB0 (d) EfficientNetB4 (e) EfficientNetB5 (f) EfficientNetB6 (g) EfficientNetB7.
Fig. 5.
Results from UCF-QNRF dataset (a) DenseNet121 (b) ResNet50 (c) EfficientNetB0 (d) EfficientNetB4 (e) EfficientNetB5 (f) EfficientNetB6 (g) EfficientNetB7.
Fig. 6.

Representation of Mean Absolute Error on ShanghalTech_Part A for different models.
Fig. 7.

Representation of Mean Absolute Error on ShanghalTech_Part B for different models.
Performance of the proposed MFENet model with other models
The proposed MFEFNet model is compared to nine cutting-edge techniques including Switching Convolutional Neural Network (SwithCNN)53, Multi-Column Convolutional Neural Network (MCNN)41, Congested Scene Recognition Convolutional Neural Network (CSRNet)54, Dialted Attention Deformable ConvNet (DADNet)55, Contextual Pyramid Convolutional Neural Network (CPCNN)11, Cross-Level Parallel Network (CLPNet)56, Fusion of Multi-Layer Features (FMLF)57, Context-aware Crowd Counting (CAN)10 and Hierarchical Scale Recalibration Network (HSRNet)58 based on Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) as evaluation parameter. MAE is the average absolute difference between actual and predicted values, and RMSE is the Squared difference between actual and predicted values. CLPNet integrates information from several levels of context and CAN uses global average pooling to extract information from the local context. To increase prediction accuracy, HSRNet has integrated the multi-scale feature for producing the outcome of both channel and spatial attention for directing each level feature to concatenate on a certain scale.
The compared results on the ShanghalTech dataset are presented in Table 4 which shows that the suggested approach MFENet outperforms the previously mentioned approaches in terms of Root Mean Square Error (RMSE) and Mean Absolute Error (MAE). To be more precise, the proposed technique significantly outperforms CLPNet in the Part A dataset on the RMSE and MAE. When compared to FMLF, MFEFNet prediction errors on Part A and Part B are reduced by 24.18 and 3.7 respectively in terms of MAE, as MFENet can forecast population numbers with more accuracy. To further illustrate how well the suggested MFENet performs, the anticipated outcomes of Part A and Part B are displayed in Figs. 8 and 9 respectively to display a ground truth in terms of crowd size and distribution of the density map.
Table 4.
Results of a comparison using the ShanghaiTech dataset.
| Dataset | ShanghalTech Part A | ShanghalTech Part B | ||
|---|---|---|---|---|
| Evaluation metric | MAE | RMSE | MAE | RMSE |
| MCNN41 | 110.2 | 173.2 | 26.4 | 41.31 |
| SwitchCNN53 | 90.4 | 135.0 | 21.6 | 33.4 |
| CPCNN11 | 73.6 | 106.4 | 20.1 | 30.1 |
| CLPNet56 | 71.5 | 108.7 | 12.2 | 20.0 |
| FMLF57 | 69.8 | 114.7 | 10.2 | 14.9 |
| CSRNet54 | 68.2 | 115.0 | 10.6 | 16.0 |
| DADNet55 | 64.2 | 99.9 | 8.8 | 13.5 |
| CAN59 | 62.3 | 100.0 | 7.8 | 12.2 |
| HSRNet58 | 62.3 | 100.3 | 7.2 | 11.8 |
| MFEFNet | 24.18 | 98.1 | 6.5 | 10.4 |
The top performance is indicated in bold.
Fig. 8.
Data Visualisation from ShanghaiTech for Part A. The ground truth map, the anticipated density map, and the image itself are all represented by the three columns. Ground truth is represented by the G, while predicted density map are represented by the P.
Fig. 9.
Visualisation of ShanghaiTech Part B dataset is represented. The image of the actual ground truth map and the projected density map are each displayed in a distinct column, where G and P stand for the actual and predicted crowd sizes respectively.
Table 5 compares the outcomes of different techniques on the UCF-QNRF dataset from experimental results. It has been observed that the proposed MFENet approach has achieved the least MAE and RMSE as compared to the other eight exemplary approaches as shown in Table 5. The relevant characteristics of cross-layers in top-down and bottom-up are combined by Multi-Level Bottom-Top and Top-Bottom Feature Fusion (MBTTBF)60. Deep Structured Scale Integration Network (DSSINet)10 uses a sequence of image pyramids to learn the local correlation, enabling structural consistency between various regions. The proposed approach is compared to a Cross-Level Parallel Network (CLPNet), Composition-based convolutional neural network (CL-CNN) and Context-Aware Network (CAN). Additionally, the proposed method reduced the RMSE values by 3.7 and 9.7 respectively as compared to DSSINet and MBTTBF. Figure 10 illustrated further visual outcomes. It is clear that the suggested approach may yield magnificent density maps and counts that are remarkably accurate to the true values in such crowded environments.
Table 5.
Results of a comparison using the UCF-QNRF dataset.
| Dataset | UCF-QNRF | |
|---|---|---|
| Evaluation metric | MAE | RMSE |
| MCNN41 | 277.0 | 426.0 |
| SwitchCNN53 | 228.0 | 445.0 |
| CL-CNN42 | 132.0 | 191.0 |
| CLPNet56 | 120.1 | 204.4 |
| DADNet55 | 113.2 | 189.4 |
| CAN59 | 107.0 | 183.0 |
| DSSINet10 | 99.1 | 159.2 |
| MBTTBF | 97.5 | 165.2 |
| MFENet | 90.1 | 150.2 |
The best performance is highlighted in bold.
Fig. 10.
The three columns show a visual representation of the UCF-QNRF dataset. An image, a ground truth map and a predicted density map are represented in the three columns. The abbreviations G and P stand for actual crowd size and expected crowd size respectively.
When a scene is extremely crowded, it might be difficult to estimate the density map. To evaluate the performance of the MFENet model in densely populated areas or low-light conditions, one may verify the resilience of the proposed MFENet in extremely crowded environments based on the data shown below in Table 5.
The proposed model MFENet has obtained a remarkable prediction performance as shown by the findings listed in Table 6 on the mall dataset. Considering MAE and RMSE, the outcomes produced by the proposed technique are slightly superior to those of deep recurrent spatial aware network (DRSAN) and hierarchical scale recalibration network (HSRNet). From experimental results, it has been noticed that DRSAN and HSRNet take into account the associated side information such as the region of interest but the planned MFENet does not. As a result, compared to other methodologies like Multi-output ridge regression (MORR) and Cascaded Multi-task Learning (CMTL) the experimental situation is considerably more similar to situations in real life. The Mall dataset is used to generate representations of expected density maps and actual density maps as shown in Fig. 11. The outcomes further demonstrate that the proposed methodology has provided highly precise crowd counts in these sparse environments.
Table 6.
Results of a comparison using the Mall dataset.
| Dataset | Mall | |
|---|---|---|
| Evaluation metric | MAE | RMSE |
| MORR43 | 3.15 | 3.96 |
| CMTL61 | 3.02 | 3.81 |
| MCNN41 | 2.24 | 2.92 |
| DRSAN62 | 1.72 | 2.10 |
| HSRNet58 | 1.80 | 2.28 |
| MFENet | 1.50 | 2.08 |
The top performance is indicated in bold.
Fig. 11.
Visualization of Mall dataset is represented. Image, ground map and anticipated density map are represented by the three columns where G and P stand for the ground truth and projected crowd number respectively.
Conclusion
In this work, the MFENet has been presented as an instance segmentation model that has integrated the idea of compound model scaling into the MaskFormer for the extraction of features more effectively with widely disparate scales. The employment of this approach to count the population has been highly successful not only leading to an improvement in overall indicators but also successfully resolving the overlapping issue brought by the end-to-end instance segmentation framework. Extensive experiments on three public benchmark datasets Shanghalech, Mall and UCF-QNRF have produced cutting-edge findings, demonstrating the feasibility and applicability of the suggested approach for crowd counting. In future, researchers can focus on designing the novel architecture to combine semantic and instance segmentation techniques that can capture the crowd density and diversity in different scenarios.
Author contributions
S.G conceptualized the research idea, designed and implementation of methodology and D.K analyzed the results and contributed to the discussion and interpretation of findings. All authors participated in writing, reviewing and editing the manuscript for final submission.
Data availability
The datasets and images showing humans that are presented and analyzed in this study are available upon request from the corresponding author. The source code is available in the following GitHub repository: https://github.com/silkygoel/crowd-counting
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Goel, S., Koundal, D. & Nijawan, R. Learning models in crowd analysis: A review. Arch. Comput. Methods Eng.32, 943–961 (2024). [Google Scholar]
- 2.Fan, Z. et al. A survey of crowd counting and density estimation based on convolutional neural network. Neurocomputing472, 224–251 (2022). [Google Scholar]
- 3.Kang, D., Ma, Z. & Chan, A. B. Beyond counting: comparisons of density maps for crowd analysis tasks-counting, detection, and tracking. IEEE Trans. Circuits Syst. Video Technol.29(5), 1408–1422 (2018). [Google Scholar]
- 4.Wang, E. et al. An efficient prediction-based user recruitment for mobile crowdsensing. IEEE Trans.Mob. Comput.17(1), 16–28 (2018).
- 5.Xia, Y., He, Y., Peng, S., Yang, Q. & Yin, B. CFFNet: Coordinated feature fusion network for crowd counting. Image Vis. Comput.112, 104242 (2021). [Google Scholar]
- 6.Ghodgaonkar, I., Chakraborty, S., Banna, V., Allcroft, S., Metwaly, M., Bordwell, F., Kimura, K., Zhao, X., Goel, A., Tung, C. et al. Analyzing worldwide social distancing through large-scale computer vision. arXiv preprint arXiv:2008.12363 (2020).
- 7.Saleh, S. A. M., Suandi, S. A. & Ibrahim, H. Recent survey on crowd density estimation and counting for visual surveillance. Eng. Appl. Artif. Intell.41, 103–114 (2015). [Google Scholar]
- 8.Xiong, F., Shi, X., & Yeung, D.-Y. Spatiotemporal modeling for crowd counting in videos. In Proceedings of the IEEE International Conference on Computer Vision, 5151–5159 (2017)
- 9.Bai, S., He, Z., Qiao, Y., Hu, H., Wu, W., & Yan, J. Adaptive dilated network with self-correction supervision for counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4594–4603 (2020)
- 10.Liu, L., Qiu, Z., Li, G., Liu, S., Ouyang, W. & Lin, L. Crowd counting with deep structured scale integration network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1774–1783 (2019).
- 11.Sindagi, V. A., & Patel, V. M. Generating high-quality crowd density maps using contextual pyramid CNNs. In Proceedings of the IEEE International Conference on Computer Vision, 1861–1870 (2017).
- 12.Wang, T. et al. Context attention fusion network for crowd counting. Knowl.-Based Syst.271, 110541 (2023). [Google Scholar]
- 13.Zhao, Z., Ma, P., Jia, M., Wang, X. & Hei, X. A dilated convolutional neural network for cross-layers of contextual information for congested crowd counting. Sensors24(6), 1816 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Wang, B. et al. A novel embedded cross framework for high-resolution salient object detection. Applied Intelligence,55(4), 277 (2025).
- 15.Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V. & Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1–9 (2015).
- 16.Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 779–788 (2016).
- 17.Fu, C.-Y., Liu, W., Ranga, A., Tyagi, A. & Berg, A. C. DSSD: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659 (2017).
- 18.Ren, S., He, K., Girshick, R., & Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems, Vol. 28 (2015). [DOI] [PubMed]
- 19.Girshick, R., Donahue, J., Darrell, T. & Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 580–587 (2014).
- 20.Long, J., Shelhamer, E. & Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3431–3440 (2015). [DOI] [PubMed]
- 21.Cheng, B., Misra, I., Schwing, A. G., Kirillov, A. & Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1290–1299 (2022).
- 22.Zhai, W. et al. An attentive hierarchy convnet for crowd counting in smart city. Clust. Comput.26(2), 1099–1111 (2023). [Google Scholar]
- 23.Li, Z., Lu, S., Dong, Y. & Guo, J. MSFFA: A multi-scale feature fusion and attention mechanism network for crowd counting. Vis. Comput.39(3), 1045–1056 (2023). [Google Scholar]
- 24.Tyagi, B., Nigam, S. & Singh, R. Person detection using YOLOv3. In Soft Computing: Theories and Applications: Proceedings of SoCTA 2022, 903–912 (Springer, 2023).
- 25.Khan, K. et al. Crowd counting using end-to-end semantic image segmentation. Electronics10(11), 1293 (2021). [Google Scholar]
- 26.Cholakkal, H., Sun, G., Khan, F. S. & Shao, L. Object counting and instance segmentation with image-level supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12397–12405 (2019).
- 27.Liang, D., Chen, X., Xu, W., Zhou, Y. & Bai, X. TransCrowd: Weakly-supervised crowd counting with transformers. Sci. China Inf. Sci.65(6), 160104 (2022). [Google Scholar]
- 28.Qian, Y., Zhang, L., Hong, X., Donovan, C. R., Arandjelovic, O., Fife, U. & Harbin, P. Segmentation assisted U-shaped multi-scale transformer for crowd counting (2022).
- 29.Savner, S. S. & Kanhangad, V. CrowdFormer: Weakly-supervised crowd counting with improved generalizability. J. Vis. Commun. Image Represent.94, 103853 (2023). [Google Scholar]
- 30.Li, B., Zhang, Y., Xu, H. & Yin, B. CCST: Crowd counting with swin transformer. Vis. Comput.30, 1–12 (2022).
- 31.Jiang, H. et al. Pa-count: Passenger counting in vehicles using Wi-Fi signals. IEEE Trans. Mob. Comput.23, 2684–2697 (2023). [Google Scholar]
- 32.Gu, W., Bai, S. & Kong, L. A review on 2D instance segmentation based on deep neural networks. Image Vis. Comput.120, 104401 (2022). [Google Scholar]
- 33.Li, X. & Sun, Y. Application of RBF neural network optimal segmentation algorithm in credit rating. Neural Comput. Appl.33(14), 8227–8235 (2021). [Google Scholar]
- 34.Ronneberger, O., Fischer, P. & Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, 234–241 (Springer, 2015).
- 35.Zhao, H., Shi, J., Qi, X., Wang, X. & Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2881–2890 (2017).
- 36.Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F. & Adam, H. Encoder–decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), 801–818 (2018).
- 37.He, K., Gkioxari, G., Dollár, P. & Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, 2961–2969 (2017).
- 38.Zhao, Y., Chen, S., Liu, S., Hu, Z. & Xia, J. Hierarchical equalization loss for long-tailed instance segmentation. IEEE Trans. Multimed.26, 6943–6955 (2024). [Google Scholar]
- 39.Jiang, S., Lu, X., Lei, Y. & Liu, L. Mask-aware networks for crowd counting. IEEE Trans. Circuits Syst. Video Technol.30(9), 3119–3129 (2019). [Google Scholar]
- 40.Tan, M. & Le, Q. EfficientNet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, 6105–6114 (PMLR, 2019).
- 41.Zhang, Y., Zhou, D., Chen, S., Gao, S. & Ma, Y. Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 589–597 (2016).
- 42.Idrees, H., Tayyab, M., Athrey, K., Zhang, D., Al-Maadeed, S., Rajpoot, N. & Shah, M. Composition loss for counting, density map estimation and localization in dense crowds. In Proceedings of the European Conference on Computer Vision (ECCV), 532–546 (2018).
- 43.Chen, K., Loy, C. C., Gong, S. & Xiang, T. Feature mining for localised crowd counting. In Bmvc, Vol. 1, 3 (2012).
- 44.Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. & Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, Vol. 30 (2017).
- 45.Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A. & Zagoruyko, S. End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, 213–229 (Springer, 2020).
- 46.Zhang, H., Liu, H. & Kim, C. Semantic and instance segmentation in coastal urban spatial perception: A multi-task learning framework with an attention mechanism. Sustainability16(2), 833 (2024). [Google Scholar]
- 47.Hariharan, B., Arbeláez, P., Girshick, R. & Malik, J. Simultaneous detection and segmentation. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part VII 13, 297–312 (Springer, 2014).
- 48.Cheng, B., Schwing, A. & Kirillov, A. Per-pixel classification is not all you need for semantic segmentation. Adv. Neural Inf. Process. Syst.34, 17864–17875 (2021). [Google Scholar]
- 49.Sandler, M., Howard, A., Zhu, M., Zhmoginov, A. & Chen, L.-C. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4510–4520 (2018).
- 50.Baheti, B., Innani, S., Gajre, S. & Talbar S. Eff-UNet: A novel architecture for semantic segmentation in unstructured environment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 358–359 (2020).
- 51.Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S. & Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 10012–10022 (2021).
- 52.Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P. H. et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6881–6890 (2021).
- 53.Babu Sam, D., Surya, S. & Venkatesh Babu, R. Switching convolutional neural network for crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5744–5752 (2017).
- 54.Li, Y., Zhang, X. & Chen, D. CSRNet: Dilated convolutional neural networks for understanding the highly congested scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1091–1100 (2018).
- 55.Guo, D., Li, K., Zha, Z.-J. & Wang, M. DADNet: Dilated-attention-deformable convnet for crowd counting. In Proceedings of the 27th ACM International Conference on Multimedia, pp. 1823–1832 (2019).
- 56.Li, J., Xue, Y., Wang, W. & Ouyang, G. Cross-level parallel network for crowd counting. IEEE Trans. Ind. Inf.16(1), 566–576 (2019). [Google Scholar]
- 57.Ding, X. et al. Crowd density estimation using fusion of multi-layer features. IEEE Trans. Intell. Transp. Syst.22(8), 4776–4787 (2020). [Google Scholar]
- 58.Zou, Z., Liu, Y., Xu, S., Wei, W., Wen, S. & Zhou, P. Crowd counting via hierarchical scale recalibration network. arXiv preprint arXiv:2003.03545 (2020).
- 59.Liu, W., Salzmann, M. & Fua, P. Context-aware crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5099–5108 (2019).
- 60.Sindagi, V. A. & Patel, V. M. Multi-level bottom-top and top-bottom feature fusion for crowd counting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1002–1012 (2019).
- 61.Sindagi, V. A. & Patel, V. M. CNN-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 1–6 (IEEE, 2017).
- 62.Liu, L., Wang, H., Li, G., Ouyang, W. & Lin, L. Crowd counting using deep recurrent spatial-aware network. arXiv preprint arXiv:1807.00601 (2018).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The datasets and images showing humans that are presented and analyzed in this study are available upon request from the corresponding author. The source code is available in the following GitHub repository: https://github.com/silkygoel/crowd-counting













