Skip to main content
MethodsX logoLink to MethodsX
. 2024 Jun 25;13:102820. doi: 10.1016/j.mex.2024.102820

Improved STNNet, A benchmark for detection, tracking, and counting crowds using Drones

Mohd Nazeer a, Kanhaiya Sharma b,, S Sathappan a, Pulipati Srilatha c, Arshad Ahmad Khan Mohammed d
PMCID: PMC11278589  PMID: 39071994

Abstract

In computer vision, navigating multi-object tracking in crowded scenes poses a fundamental challenge with broad applications ranging from surveillance systems to autonomous vehicles. Traditional tracking methods encounter difficulties associating noisy object detections and maintaining consistent labels across frames, particularly in scenarios like video surveillance for crowd control and public safety. This paper introduces ’Improved Space-Time Neighbor-Aware Network (STNNet),’ an advanced framework for online Multi-Object Tracking (MOT) designed to address these challenges. Expanding upon the foundational STNNet architecture, our enhanced model incorporates deep reinforcement learning techniques to refine decision-making. By framing the online MOT problem as a Markov Decision Process (MDP), Improved STNNet learns a sophisticated policy for data association, adeptly handling complexities such as object birth/death and appearance/disappearance as state transitions within the MDP. Through extensive experimentation on benchmark datasets, including the MOT Challenge, our proposed Improved STNNet demonstrates superior performance, surpassing existing methods in demanding, crowded scenarios. This study showcases the effectiveness of our approach and lays the groundwork for advancing real-time video analysis applications, particularly in dynamic, crowded environments. Additionally, we utilize the dataset provided by STNNET for density map estimation, forming the basis for our research.

  • Develop an advanced framework for online Multi-Object Tracking (MOT) to address crowded scene challenges, particularly improving object association and label consistency across frames.

  • Explore integrating Deep Reinforcement learning techniques into the MOT framework, framing the problem as an MDP to refine decision-making and handle complexities such as object birth or death and appearance or disappearance transitions.

Keywords: Surveillance, Crowd counting, Tracking and localization, Neural network, Density estimation

Method name: Improved STNNet

Graphical abstract

Image, graphical abstract


Specification table

Subject area: Engineering
More specific subject area: Computer Vision
Name of your method: Space-Time Neighbor-Aware Network
Name and reference of original method: P. Zhu et al., “Detection and Tracking Meet Drones Challenge,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7380–7399, 1 Nov. 2022,
Resource availability: doi: 10.1109/TPAMI.2021.3119563.

Method details

In recent years, UAVs with cameras have seen increased use in crowd surveillance, particularly in urban settings and public safety. Our 'Improved STNNet' tackles challenges in analyzing drone-captured crowd data, achieving superior accuracy on the extensive DroneCrowd dataset [1]. While various datasets support crowd analysis, many lack the diversity and scale for drone-specific applications [[1], [2], [3], [4]]. The improved STNNet integrates sophisticated CNN architectures with temporal consistency constraints, enabling precise crowd density estimation and tracking in dynamic environments, advancing drone-based crowd analysis [1,[5], [6], [7]]. Traditional crowd-tracking relied on energy minimization but struggled with drone-captured data complexities. Enhanced STNNet integrates a context-aware loss function, enabling precise tracking in dynamic crowds. A significant advance in drone-based crowd analysis. The stepwise processor detail is given below.

Data preprocessing

Improved STNNet works well because of the careful preparation of data. Videos from the DroneCrowd dataset are made steady and adjusted for different lighting. We track where each person moves over time, giving important time-based information. Finally, we divide the dataset into parts for training and testing to ensure our system works well in different situations. Enhanced STNNet utilizes a multi-branch Convolutional Neural Network (CNN) architecture to extract features. This design aims to capture multi-scale spatial features crucial for crowd density estimation and object localization. The feature extraction subnetwork processes stabilized frames, extracting intricate details across scales for precise object detection. Improved STNNet excels in accurately estimating crowd density maps. It adopts dense prediction techniques inspired by recent advances in fully convolutional networks (FCNs). The model's deep layers discern complex patterns in crowded scenes, ensuring precise density map estimation.

D^=STNNetDensity(I) (1)

D^ represents the estimated density map. I represent the input image.

Localization subnetwork

To achieve precise object localization, Improved STNNet integrates a specialized localization subnetwork. This subnetwork enhances the accuracy of object positions derived from density maps. Combining classification and regression branches, the model predicts individuals’ precise coordinates, compensating for any inaccuracies from density map estimation. Multi-scale feature maps (i.e., f1, f2, f3) are fused with channel and spatial attention [8] for each branch, as depicted in Fig. 1. Subsequently multi-scale feature maps are resized, followed by fused classification and regression maps output. The association subnet predicts motion offsets for accurate object tracking, updating positions across frames in complex scenarios, as shown in Fig. 2.

(X^iy^i)=STNNetLocalization(I) (2)

Fig. 1.

Fig. 1:

(a) the localization subnet. (X^i,y^i) represent the estimated coordinates of the i th target. I represent the input image.

Fig. 2.

Fig. 2:

(a) the association subnet using [9](b) the neighboring context loss. Notably, the dashed modules in (a) are only used in the training phase. For clarity, we only display the calculation of the terms from time t − 1 to time t in the neighboring context loss [9].

(X^i,y^i) represent the estimated coordinates of the i th target. I represent the input image.

Temporal context integration

Understanding the temporal relationships between consecutive frames is crucial in crowd analysis. Improved STNNet introduces a novel temporal context integration module that captures the motion patterns of individuals over time. The model infers the temporal context by analyzing the trajectory information and inter-frame displacements, enabling robust object tracking across frames.

(ΔXiΔyi)=STNNetAssociation(ItIt+1) (3)

xi, Δyi) i th target between frames t and t + 1.

(It), It+1) represent input images at time steps t and t + 1.

Neighboring context-aware loss

A key innovation in Improved STNNet is introducing the neighboring context-aware loss function. Traditional loss functions often overlook the spatial relationships between neighboring objects in consecutive frames. This loss function penalizes large displacements of relative positions among adjacent objects in the temporal domain, guiding the network to generate precise motion offsets. Improved STNNet achieves superior performance in crowded and dynamic scenarios by preserving the spatial coherence of object trajectories.

End-to-End training and optimization

Trained end-to-end, Improved STNNet enables the entire architecture to adapt and optimize collectively. It utilizes a multi-task loss function, integrating density map estimation, object localization, and temporal context integration. The Adam optimizer ensures efficient convergence during training, enabling the network to learn intricate spatial and temporal patterns from the data.

Post-Processing and object tracking

After the network's predictions, post-processing techniques refine results. Object tracking algorithms establish and maintain trajectories for accurate individual tracking. Two association methods, min-cost flow and social-LSTM [10], extend object trajectories. To assess STNNet's efficacy in crowd tracking, we compare it with MCNN [2], CSRNet [11], CAN [12], DM-Count [13], STNNet [14], AMDCN [15], C-MTL [16], and variations of STNNet. Improved STNNet's exceptional performance in drone-based crowd analysis stems from advanced prediction, meticulous post-processing, and reliable object tracking.

Improvements in method

The enhancements in ’Improved STNNet’ encompass advancements in neural network architectures, feature extraction techniques, and algorithms for crowd density estimation, flow analysis, and object association which is shown in the Fig. 3. These improvements address complexities in crowded scenes and dynamic crowd movements, enhancing accuracy, reliability, and context-awareness in predictions and tracking within drone-captured videos. Advanced neural network architectures and feature extraction techniques enhance the model's ability to capture intricate crowd patterns, improving object location and density predictions. Enhanced crowd density estimation methods differentiate closely positioned individuals in densely packed crowds, enhancing accuracy in high-density scenarios. Improved crowd flow analysis focuses on understanding movement patterns to adjust for crowd dynamics, improving predictions and tracking, especially in congested areas. Contextual information aids in informed decisions during object association, ensuring accurate trajectory predictions and coherent crowd movement assessment. Enhanced temporal consistency maintains smooth object trajectories aligned with expected crowd flow, enhancing overall tracking quality.

Fig. 3.

Fig. 3:

Methodology.

Method validation

Dataset description

Improved STNNet's performance is rigorously evaluated on the DroneCrowd dataset, a comprehensive collection of drone-captured videos featuring dense crowds in various scenarios. The dataset contains 112 video clips comprising 33,600 frames, annotated with over 4.8 million head annotations. These annotations provide rich information for crowd density estimation, object localization, and tracking tasks.

Evaluation metrics

The evaluation of Improved STNNet is conducted using established metrics for crowd analysis, including Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) for crowd density estimation. For object localization, metrics such as Intersection over Union (IoU) and Average Precision (AP) are employed. Additionally, tracking accuracy is assessed using metrics like Multiple Object Tracking Accuracy (MOTA) and Identity Switching Rate (IDSW).

Results and analysis

The performance of Improved STNNet was rigorously evaluated against various state-of-the-art crowd analysis methods, including traditional density estimation techniques, recent deep learning-based models, and object tracking algorithms. This comparison spanned diverse scenarios and crowd densities to comprehensively assess the model's capabilities. Improvements in both STNNet and Improved STNNet were evident from the estimation errors of density maps on the DroneCrowd and its description shown in Table 1. Across all measured metrics, Improved STNNet exhibited reduced mean absolute error (MAE) and mean squared error (MSE) values compared to STNNet [14].

Table 1.

Estimation errors of density maps on Drone Crowd.

Method Speed FPS Overall
Large
Small
Cloudy
Sunny
Night
Crowded
Sparse
MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE
MCNN 28.98 34.7 42.5 36.8 44.1 31.7 40.1 21.0 27.5 39.0 43.9 67.2 68.7 29.5 35.3 37.7 46.2
C-MTL 2.31 56.7 65.9 53.5 63.2 61.5 69.7 59.5 66.9 56.6 67.8 48.2 58.3 81.6 88.7 42.2 47.9
MSCNN 1.76 58.0 75.2 58.4 77.9 57.5 17.1 64.5 85.8 53.8 65.5 46.8 57.3 91.4 106.4 38.7 48.8
LCFCN 3.08 136.9 150.6 126.3 140.3 152.8 164.8 147.1 160.3 137.1 151.7 105.6 113.8 208.5 211.1 95.4 110.0
Switch CNN 0.01 66.5 77.8 61.5 74.2 74.0 83.0 56.0 63.4 69.0 80.9 92.8 105.8 67.7 79.8 65.7 76.7
ACSCP 1.58 48.1 60.2 57.0 70.6 34.8 39.7 42.5 46.4 37.3 44.3 86.6 106.6 36.0 41.9 55.1 68.5
AMDCN 0.16 165.6 167.7 166.7 168.9 163.8 165.9 160.5 162.3 174.8 177.1 162.3 164.3 165.5 167.7 165.6 167.8
Stack Pooling 0.73 68.8 77.2 68.7 77.1 68.8 77.3 66.5 75.9 74.0 83.4 65.2 67.4 95.7 101.1 53.1 59.1
DA—Net 2.52 36.5 47.3 41.5 54.7 28.9 33.1 45.4 58.6 26.5 31.3 29.5 34.0 56.5 68.3 24.9 28.7
CSRNet 3.92 19.8 25.6 17.8 25.4 22.9 25.8 12.8 16.6 19.1 22.5 42.3 45.8 20.2 24.0 19.6 26.5
CAN 7.12 22.1 33.4 18.9 26.7 26.9 41.5 11.2 14.9 14.8 17.5 69.4 73.6 14.4 17.9 26.6 39.7
DM-Count 10.04 18.4 27.0 19.2 29.6 17.2 22.4 11.4 16.3 12.6 15.2 51.1 55.7 17.6 21.8 18.9 29.6

STNNet (w/o loc) [14] 3.56 18.6 22.2 17.1 20.5 21.0 24.6 14.7 19.9 21.4 23.2 24.7 26.3 24.2 27.3 15.4 18.7

STNNet [14] 3.41 15.8 18.7 16.0 18.4 15.6 19.2 14.1 17.2 19.9 22.5 12.9 14.4 18.5 21.6 14.3 16.9
Improved STNNet (w/o loc) 2.18 17.2 20.8 16.0 18.2 20.5 23.9 13.9 18.1 19.2 19.6 23.2 20.1 19.4 21.4 15.4 18.7
Improved STNNet 1.99 14.3 17.1 14.9 16.6 14.2 18.5 13.3 16.2 17.2 18.7 11.4 11.2 12.4 16.4 13.8 15.2

Notably, Improved STNNet achieved a lower overall MAE of 13.3 compared to STNNet's 15.8, indicating enhanced accuracy in crowd density estimation. Moreover, Improved STNNet consistently outperformed STNNet across various environmental conditions like Large, Small, and Crowded, highlighting its refined capability in crowd density estimation. This underscores the efficacy of the architectural improvements in STNNet. Additionally, Improved STNNet notably enhanced localization accuracy on the DroneCrowd dataset Table 2, achieving a 3.78 % increase in l-mAP compared to STNNet, demonstrating superior performance in object localization tasks. Furthermore, STNNet's tracking accuracy improved across all tracked metrics Table 3, particularly in T-mAP and T-AP@0.10 metrics, with Improved STNNet showing significant enhancements. These findings affirm Improved STNNet's effectiveness in addressing real-world challenges, particularly in densely populated areas and dynamic crowd movements, offering promising prospects for applications such as public safety and event management.

Table 2.

Location accuracy on Drone Crowd.

Methods L-mAP (%) A-AP@10 (%) L-AP@15 (%) L-AP@20 (%)
MCNN 9.05 9.81 11.81 12.83
CAN 11.12 8.94 15.22 18.27
CSRNet 14.4 15.13 19.77 21.16
DM-Count 18.17 17.90 25.32 27.59
STNNet (w/o loc) [14] 32.19 33.88 39.56 43.22
STNNet (w/o ass) [14] 39.77 42.06 50.00 54.88
STNNet (w/o rel) [14] 40.00 42.29 50.31 55.11
STNNet (w/o cyc) [14] 40.23 42.57 50.64 55.42
STNNet [14] 40.45 42.75 50.98 55.77
Improved STNNet (w/o loc) 36.23 37.92 41.23 47.34
Improved STNNet (w/o ass) 44.77 46.78 54.23 58.31
Improved STNNet (w/o rel) 44.45 45.76 54.21 58.41
Improved STNNet (w/o cyc) 44.31 45.87 54.56 58.54
Improved STNNet 43.23 45.12 54.23 57.77

Table 3.

Tracking accuracy-min-cost flow/social-LSTM.

Method T-mAP T-AP@0.10 T-AP@0.15 T-AP@0.20
MCNN 9.16/10.45 11.47/10.45 9.65/9.91 6.36/6.51
CAN 4.39/4.13 6.97/5.48 4.72/5.26 1.48/1.65
CSRNet 12.15/11.66 17.34/14.63 12.85/13.74 6.26/6.16
DM-Count 17.01/16.54 22.38/19.72 18.34/19.13 10.29/10.77
STNNet (w/o loc) [14] 28.72/28.55 32.52/32.50 30.84/30.65 10.80/22.51
STNNet (w/o ass) [14] 31.44/30.90 34.59/34.08 32.78/33.12 26.77/26.30
STNNet (w/o rel) [14] 32.26/31.60 35.20/34.78 33.78/33.12 27.80/26.89
STNNet (w/o cyc) [14] 32.50/31.44 35.45/34.53 33.99/32.79 28.05/26.99
STNNet [14] 32.32/31.58 35.29/34.82 33.78/33.00 27.90/26.92
Improved STNNet (w/o loc) 32.34/32.45 32.52/32.50 30.84/30.65 22.80/22.51
Improved STNNet (w/o ass) 34.34/34.50 36.23/36.08 32.94/32.32 26.77/26.30
Improved STNNet (w/o rel) 36.43/31.60 35.20/34.78 33.78/33.12 27.80/26.89
Improved STNNet (w/o cyc) 38.34/37.24 40.42/34.53 37.01/37.23 30.05/28.12
Improved STNNet 34.23/33.29 37.29/37.23 35.87/36.21 30.00/29.81

Ethics statements

Not Applicable.

CRediT authorship contribution statement

Mohd Nazeer: Conceptualization. Kanhaiya Sharma: Investigation, Resources, Writing – review & editing. S. Sathappan: Conceptualization, Data curation, Formal analysis, Writing – review & editing. Pulipati Srilatha: Supervision, Visualization. Arshad Ahmad Khan Mohammed: Data curation, Writing – review & editing.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

We thank Symbiosis International (Deemed University), Pune, India, for providing research support funding.

Data availability

  • Data will be made available on request.

References

  • 1.Li S., Hu Z., Zhao M., Sun Z. Cascade-guided multi-scale attention network for crowd counting. Signal Image Video Process. 2021;15:1663–1670. [Google Scholar]
  • 2.Zhang Y., Zhou D., Chen S., Gao S., Ma Y. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. Single-image crowd counting via multi-column convolutional neural network; pp. 589–597. [Google Scholar]
  • 3.Ma Z., Wei X., Hong X., Gong Y. Proceedings of the IEEE/CVF international conference on computer vision. 2019. Bayesian loss for crowd count estimation with point supervision; pp. 6142–6151. [Google Scholar]
  • 4.Wang Q., Gao J., Lin W., Li X. IEEE transactions on pattern analysis and machine intelligence. Vol. 43. 2020. Nwpu-crowd: a large-scale benchmark for crowd counting and localization; pp. 2141–2149. [DOI] [PubMed] [Google Scholar]
  • 5.Xiong F., Shi X., Yeung D. Spatiotemporal modeling for crowd counting in videos. in ICCV. 2017:5161–5169. [Google Scholar]
  • 6.Zhang S., Wu G., Costeira J.P., Moura J.M.F. Fcn-rlstm: deep spatio-temporal neural networks for vehicle counting in city cameras. in ICCV. 2017:3687–3696. [Google Scholar]
  • 7.Liu W., Salzmann M., Fua P. Estimating people flows to better count them in crowded scenes. in ECCV. 2020;12360:723–740. [Google Scholar]
  • 8.Woo S., Park J., Lee J., Kweon I.S. CBAM: convolutional block attention module. in ECCV. 2018:3–19. [Google Scholar]
  • 9.L. Wen, D. Du, P. Zhu, Q. Hu, Q. Wang, L. Bo, and S. Lyu, “Detection, tracking, and counting meets drones in crowds: a benchmark,” pp. 7808–7817, 2021.
  • 10.Zhu P., Wen L., Du D., Bian X., Fan H., Hu Q., Ling H. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2022. Detection and tracking meet drones challenge; pp. 7380–7399. [DOI] [PubMed] [Google Scholar]
  • 11.Li Y., Zhang X., Chen D. Proceedings of the IEEE conference on computer vision and pattern recognition. 2018. Csrnet: dilated convolutional neural networks for understanding the highly congested scenes; pp. 1091–1100. [Google Scholar]
  • 12.Liu W., Salzmann M., Fua P. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019. Context-aware crowd counting; pp. 5099–5108. [Google Scholar]
  • 13.Wang B., Liu H., Samaras D., Nguyen M.H. Distribution matching for crowd counting. Adv Neural Inf Process Syst. 2020;33:1595–1607. [Google Scholar]
  • 14.Wen L., Du D., Zhu P., Hu Q., Wang Q., Bo L., Lyu S. Detection, tracking, and counting meets drones in crowds: a benchmark. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Los Alamitos, CA, USA; IEEE Computer Society; jun 2021. pp. 7808–7817. [Google Scholar]
  • 15.D. Deb and J. Ventura, “An aggregated multicolumn dilated convolution network for perspective-free counting,” in CVPRW, 2018, pp. 195–204.
  • 16.V.A. Sindagi and V.M. Patel, “Generating high-quality crowd density maps using contextual pyramid cnns,” in ICCV, 2017, pp. 1879–1888.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

  • Data will be made available on request.


Articles from MethodsX are provided here courtesy of Elsevier

RESOURCES