Abstract
In computer vision, navigating multi-object tracking in crowded scenes poses a fundamental challenge with broad applications ranging from surveillance systems to autonomous vehicles. Traditional tracking methods encounter difficulties associating noisy object detections and maintaining consistent labels across frames, particularly in scenarios like video surveillance for crowd control and public safety. This paper introduces ’Improved Space-Time Neighbor-Aware Network (STNNet),’ an advanced framework for online Multi-Object Tracking (MOT) designed to address these challenges. Expanding upon the foundational STNNet architecture, our enhanced model incorporates deep reinforcement learning techniques to refine decision-making. By framing the online MOT problem as a Markov Decision Process (MDP), Improved STNNet learns a sophisticated policy for data association, adeptly handling complexities such as object birth/death and appearance/disappearance as state transitions within the MDP. Through extensive experimentation on benchmark datasets, including the MOT Challenge, our proposed Improved STNNet demonstrates superior performance, surpassing existing methods in demanding, crowded scenarios. This study showcases the effectiveness of our approach and lays the groundwork for advancing real-time video analysis applications, particularly in dynamic, crowded environments. Additionally, we utilize the dataset provided by STNNET for density map estimation, forming the basis for our research.
-
•
Develop an advanced framework for online Multi-Object Tracking (MOT) to address crowded scene challenges, particularly improving object association and label consistency across frames.
-
•
Explore integrating Deep Reinforcement learning techniques into the MOT framework, framing the problem as an MDP to refine decision-making and handle complexities such as object birth or death and appearance or disappearance transitions.
Keywords: Surveillance, Crowd counting, Tracking and localization, Neural network, Density estimation
Method name: Improved STNNet
Graphical abstract
Specification table
Subject area: | Engineering |
More specific subject area: | Computer Vision |
Name of your method: | Space-Time Neighbor-Aware Network |
Name and reference of original method: | P. Zhu et al., “Detection and Tracking Meet Drones Challenge,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7380–7399, 1 Nov. 2022, |
Resource availability: | doi: 10.1109/TPAMI.2021.3119563. |
Method details
In recent years, UAVs with cameras have seen increased use in crowd surveillance, particularly in urban settings and public safety. Our 'Improved STNNet' tackles challenges in analyzing drone-captured crowd data, achieving superior accuracy on the extensive DroneCrowd dataset [1]. While various datasets support crowd analysis, many lack the diversity and scale for drone-specific applications [[1], [2], [3], [4]]. The improved STNNet integrates sophisticated CNN architectures with temporal consistency constraints, enabling precise crowd density estimation and tracking in dynamic environments, advancing drone-based crowd analysis [1,[5], [6], [7]]. Traditional crowd-tracking relied on energy minimization but struggled with drone-captured data complexities. Enhanced STNNet integrates a context-aware loss function, enabling precise tracking in dynamic crowds. A significant advance in drone-based crowd analysis. The stepwise processor detail is given below.
Data preprocessing
Improved STNNet works well because of the careful preparation of data. Videos from the DroneCrowd dataset are made steady and adjusted for different lighting. We track where each person moves over time, giving important time-based information. Finally, we divide the dataset into parts for training and testing to ensure our system works well in different situations. Enhanced STNNet utilizes a multi-branch Convolutional Neural Network (CNN) architecture to extract features. This design aims to capture multi-scale spatial features crucial for crowd density estimation and object localization. The feature extraction subnetwork processes stabilized frames, extracting intricate details across scales for precise object detection. Improved STNNet excels in accurately estimating crowd density maps. It adopts dense prediction techniques inspired by recent advances in fully convolutional networks (FCNs). The model's deep layers discern complex patterns in crowded scenes, ensuring precise density map estimation.
(1) |
represents the estimated density map. I represent the input image.
Localization subnetwork
To achieve precise object localization, Improved STNNet integrates a specialized localization subnetwork. This subnetwork enhances the accuracy of object positions derived from density maps. Combining classification and regression branches, the model predicts individuals’ precise coordinates, compensating for any inaccuracies from density map estimation. Multi-scale feature maps (i.e., f1, f2, f3) are fused with channel and spatial attention [8] for each branch, as depicted in Fig. 1. Subsequently multi-scale feature maps are resized, followed by fused classification and regression maps output. The association subnet predicts motion offsets for accurate object tracking, updating positions across frames in complex scenarios, as shown in Fig. 2.
(2) |
Fig. 1.
(a) the localization subnet. represent the estimated coordinates of the i th target. I represent the input image.
Fig. 2.
(a) the association subnet using [9](b) the neighboring context loss. Notably, the dashed modules in (a) are only used in the training phase. For clarity, we only display the calculation of the terms from time t − 1 to time t in the neighboring context loss [9].
represent the estimated coordinates of the i th target. I represent the input image.
Temporal context integration
Understanding the temporal relationships between consecutive frames is crucial in crowd analysis. Improved STNNet introduces a novel temporal context integration module that captures the motion patterns of individuals over time. The model infers the temporal context by analyzing the trajectory information and inter-frame displacements, enabling robust object tracking across frames.
(3) |
(Δxi, Δyi) i th target between frames t and t + 1.
(It), It+1) represent input images at time steps t and t + 1.
Neighboring context-aware loss
A key innovation in Improved STNNet is introducing the neighboring context-aware loss function. Traditional loss functions often overlook the spatial relationships between neighboring objects in consecutive frames. This loss function penalizes large displacements of relative positions among adjacent objects in the temporal domain, guiding the network to generate precise motion offsets. Improved STNNet achieves superior performance in crowded and dynamic scenarios by preserving the spatial coherence of object trajectories.
End-to-End training and optimization
Trained end-to-end, Improved STNNet enables the entire architecture to adapt and optimize collectively. It utilizes a multi-task loss function, integrating density map estimation, object localization, and temporal context integration. The Adam optimizer ensures efficient convergence during training, enabling the network to learn intricate spatial and temporal patterns from the data.
Post-Processing and object tracking
After the network's predictions, post-processing techniques refine results. Object tracking algorithms establish and maintain trajectories for accurate individual tracking. Two association methods, min-cost flow and social-LSTM [10], extend object trajectories. To assess STNNet's efficacy in crowd tracking, we compare it with MCNN [2], CSRNet [11], CAN [12], DM-Count [13], STNNet [14], AMDCN [15], C-MTL [16], and variations of STNNet. Improved STNNet's exceptional performance in drone-based crowd analysis stems from advanced prediction, meticulous post-processing, and reliable object tracking.
Improvements in method
The enhancements in ’Improved STNNet’ encompass advancements in neural network architectures, feature extraction techniques, and algorithms for crowd density estimation, flow analysis, and object association which is shown in the Fig. 3. These improvements address complexities in crowded scenes and dynamic crowd movements, enhancing accuracy, reliability, and context-awareness in predictions and tracking within drone-captured videos. Advanced neural network architectures and feature extraction techniques enhance the model's ability to capture intricate crowd patterns, improving object location and density predictions. Enhanced crowd density estimation methods differentiate closely positioned individuals in densely packed crowds, enhancing accuracy in high-density scenarios. Improved crowd flow analysis focuses on understanding movement patterns to adjust for crowd dynamics, improving predictions and tracking, especially in congested areas. Contextual information aids in informed decisions during object association, ensuring accurate trajectory predictions and coherent crowd movement assessment. Enhanced temporal consistency maintains smooth object trajectories aligned with expected crowd flow, enhancing overall tracking quality.
Fig. 3.
Methodology.
Method validation
Dataset description
Improved STNNet's performance is rigorously evaluated on the DroneCrowd dataset, a comprehensive collection of drone-captured videos featuring dense crowds in various scenarios. The dataset contains 112 video clips comprising 33,600 frames, annotated with over 4.8 million head annotations. These annotations provide rich information for crowd density estimation, object localization, and tracking tasks.
Evaluation metrics
The evaluation of Improved STNNet is conducted using established metrics for crowd analysis, including Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) for crowd density estimation. For object localization, metrics such as Intersection over Union (IoU) and Average Precision (AP) are employed. Additionally, tracking accuracy is assessed using metrics like Multiple Object Tracking Accuracy (MOTA) and Identity Switching Rate (IDSW).
Results and analysis
The performance of Improved STNNet was rigorously evaluated against various state-of-the-art crowd analysis methods, including traditional density estimation techniques, recent deep learning-based models, and object tracking algorithms. This comparison spanned diverse scenarios and crowd densities to comprehensively assess the model's capabilities. Improvements in both STNNet and Improved STNNet were evident from the estimation errors of density maps on the DroneCrowd and its description shown in Table 1. Across all measured metrics, Improved STNNet exhibited reduced mean absolute error (MAE) and mean squared error (MSE) values compared to STNNet [14].
Table 1.
Estimation errors of density maps on Drone Crowd.
Method | Speed FPS | Overall |
Large |
Small |
Cloudy |
Sunny |
Night |
Crowded |
Sparse |
||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | ||
MCNN | 28.98 | 34.7 | 42.5 | 36.8 | 44.1 | 31.7 | 40.1 | 21.0 | 27.5 | 39.0 | 43.9 | 67.2 | 68.7 | 29.5 | 35.3 | 37.7 | 46.2 |
C-MTL | 2.31 | 56.7 | 65.9 | 53.5 | 63.2 | 61.5 | 69.7 | 59.5 | 66.9 | 56.6 | 67.8 | 48.2 | 58.3 | 81.6 | 88.7 | 42.2 | 47.9 |
MSCNN | 1.76 | 58.0 | 75.2 | 58.4 | 77.9 | 57.5 | 17.1 | 64.5 | 85.8 | 53.8 | 65.5 | 46.8 | 57.3 | 91.4 | 106.4 | 38.7 | 48.8 |
LCFCN | 3.08 | 136.9 | 150.6 | 126.3 | 140.3 | 152.8 | 164.8 | 147.1 | 160.3 | 137.1 | 151.7 | 105.6 | 113.8 | 208.5 | 211.1 | 95.4 | 110.0 |
Switch CNN | 0.01 | 66.5 | 77.8 | 61.5 | 74.2 | 74.0 | 83.0 | 56.0 | 63.4 | 69.0 | 80.9 | 92.8 | 105.8 | 67.7 | 79.8 | 65.7 | 76.7 |
ACSCP | 1.58 | 48.1 | 60.2 | 57.0 | 70.6 | 34.8 | 39.7 | 42.5 | 46.4 | 37.3 | 44.3 | 86.6 | 106.6 | 36.0 | 41.9 | 55.1 | 68.5 |
AMDCN | 0.16 | 165.6 | 167.7 | 166.7 | 168.9 | 163.8 | 165.9 | 160.5 | 162.3 | 174.8 | 177.1 | 162.3 | 164.3 | 165.5 | 167.7 | 165.6 | 167.8 |
Stack Pooling | 0.73 | 68.8 | 77.2 | 68.7 | 77.1 | 68.8 | 77.3 | 66.5 | 75.9 | 74.0 | 83.4 | 65.2 | 67.4 | 95.7 | 101.1 | 53.1 | 59.1 |
DA—Net | 2.52 | 36.5 | 47.3 | 41.5 | 54.7 | 28.9 | 33.1 | 45.4 | 58.6 | 26.5 | 31.3 | 29.5 | 34.0 | 56.5 | 68.3 | 24.9 | 28.7 |
CSRNet | 3.92 | 19.8 | 25.6 | 17.8 | 25.4 | 22.9 | 25.8 | 12.8 | 16.6 | 19.1 | 22.5 | 42.3 | 45.8 | 20.2 | 24.0 | 19.6 | 26.5 |
CAN | 7.12 | 22.1 | 33.4 | 18.9 | 26.7 | 26.9 | 41.5 | 11.2 | 14.9 | 14.8 | 17.5 | 69.4 | 73.6 | 14.4 | 17.9 | 26.6 | 39.7 |
DM-Count | 10.04 | 18.4 | 27.0 | 19.2 | 29.6 | 17.2 | 22.4 | 11.4 | 16.3 | 12.6 | 15.2 | 51.1 | 55.7 | 17.6 | 21.8 | 18.9 | 29.6 |
STNNet (w/o loc) [14] | 3.56 | 18.6 | 22.2 | 17.1 | 20.5 | 21.0 | 24.6 | 14.7 | 19.9 | 21.4 | 23.2 | 24.7 | 26.3 | 24.2 | 27.3 | 15.4 | 18.7 |
STNNet [14] | 3.41 | 15.8 | 18.7 | 16.0 | 18.4 | 15.6 | 19.2 | 14.1 | 17.2 | 19.9 | 22.5 | 12.9 | 14.4 | 18.5 | 21.6 | 14.3 | 16.9 |
Improved STNNet (w/o loc) | 2.18 | 17.2 | 20.8 | 16.0 | 18.2 | 20.5 | 23.9 | 13.9 | 18.1 | 19.2 | 19.6 | 23.2 | 20.1 | 19.4 | 21.4 | 15.4 | 18.7 |
Improved STNNet | 1.99 | 14.3 | 17.1 | 14.9 | 16.6 | 14.2 | 18.5 | 13.3 | 16.2 | 17.2 | 18.7 | 11.4 | 11.2 | 12.4 | 16.4 | 13.8 | 15.2 |
Notably, Improved STNNet achieved a lower overall MAE of 13.3 compared to STNNet's 15.8, indicating enhanced accuracy in crowd density estimation. Moreover, Improved STNNet consistently outperformed STNNet across various environmental conditions like Large, Small, and Crowded, highlighting its refined capability in crowd density estimation. This underscores the efficacy of the architectural improvements in STNNet. Additionally, Improved STNNet notably enhanced localization accuracy on the DroneCrowd dataset Table 2, achieving a 3.78 % increase in l-mAP compared to STNNet, demonstrating superior performance in object localization tasks. Furthermore, STNNet's tracking accuracy improved across all tracked metrics Table 3, particularly in T-mAP and T-AP@0.10 metrics, with Improved STNNet showing significant enhancements. These findings affirm Improved STNNet's effectiveness in addressing real-world challenges, particularly in densely populated areas and dynamic crowd movements, offering promising prospects for applications such as public safety and event management.
Table 2.
Location accuracy on Drone Crowd.
Methods | L-mAP (%) | A-AP@10 (%) | L-AP@15 (%) | L-AP@20 (%) |
---|---|---|---|---|
MCNN | 9.05 | 9.81 | 11.81 | 12.83 |
CAN | 11.12 | 8.94 | 15.22 | 18.27 |
CSRNet | 14.4 | 15.13 | 19.77 | 21.16 |
DM-Count | 18.17 | 17.90 | 25.32 | 27.59 |
STNNet (w/o loc) [14] | 32.19 | 33.88 | 39.56 | 43.22 |
STNNet (w/o ass) [14] | 39.77 | 42.06 | 50.00 | 54.88 |
STNNet (w/o rel) [14] | 40.00 | 42.29 | 50.31 | 55.11 |
STNNet (w/o cyc) [14] | 40.23 | 42.57 | 50.64 | 55.42 |
STNNet [14] | 40.45 | 42.75 | 50.98 | 55.77 |
Improved STNNet (w/o loc) | 36.23 | 37.92 | 41.23 | 47.34 |
Improved STNNet (w/o ass) | 44.77 | 46.78 | 54.23 | 58.31 |
Improved STNNet (w/o rel) | 44.45 | 45.76 | 54.21 | 58.41 |
Improved STNNet (w/o cyc) | 44.31 | 45.87 | 54.56 | 58.54 |
Improved STNNet | 43.23 | 45.12 | 54.23 | 57.77 |
Table 3.
Tracking accuracy-min-cost flow/social-LSTM.
Method | T-mAP | T-AP@0.10 | T-AP@0.15 | T-AP@0.20 |
---|---|---|---|---|
MCNN | 9.16/10.45 | 11.47/10.45 | 9.65/9.91 | 6.36/6.51 |
CAN | 4.39/4.13 | 6.97/5.48 | 4.72/5.26 | 1.48/1.65 |
CSRNet | 12.15/11.66 | 17.34/14.63 | 12.85/13.74 | 6.26/6.16 |
DM-Count | 17.01/16.54 | 22.38/19.72 | 18.34/19.13 | 10.29/10.77 |
STNNet (w/o loc) [14] | 28.72/28.55 | 32.52/32.50 | 30.84/30.65 | 10.80/22.51 |
STNNet (w/o ass) [14] | 31.44/30.90 | 34.59/34.08 | 32.78/33.12 | 26.77/26.30 |
STNNet (w/o rel) [14] | 32.26/31.60 | 35.20/34.78 | 33.78/33.12 | 27.80/26.89 |
STNNet (w/o cyc) [14] | 32.50/31.44 | 35.45/34.53 | 33.99/32.79 | 28.05/26.99 |
STNNet [14] | 32.32/31.58 | 35.29/34.82 | 33.78/33.00 | 27.90/26.92 |
Improved STNNet (w/o loc) | 32.34/32.45 | 32.52/32.50 | 30.84/30.65 | 22.80/22.51 |
Improved STNNet (w/o ass) | 34.34/34.50 | 36.23/36.08 | 32.94/32.32 | 26.77/26.30 |
Improved STNNet (w/o rel) | 36.43/31.60 | 35.20/34.78 | 33.78/33.12 | 27.80/26.89 |
Improved STNNet (w/o cyc) | 38.34/37.24 | 40.42/34.53 | 37.01/37.23 | 30.05/28.12 |
Improved STNNet | 34.23/33.29 | 37.29/37.23 | 35.87/36.21 | 30.00/29.81 |
Ethics statements
Not Applicable.
CRediT authorship contribution statement
Mohd Nazeer: Conceptualization. Kanhaiya Sharma: Investigation, Resources, Writing – review & editing. S. Sathappan: Conceptualization, Data curation, Formal analysis, Writing – review & editing. Pulipati Srilatha: Supervision, Visualization. Arshad Ahmad Khan Mohammed: Data curation, Writing – review & editing.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
We thank Symbiosis International (Deemed University), Pune, India, for providing research support funding.
Data availability
Data will be made available on request.
References
- 1.Li S., Hu Z., Zhao M., Sun Z. Cascade-guided multi-scale attention network for crowd counting. Signal Image Video Process. 2021;15:1663–1670. [Google Scholar]
- 2.Zhang Y., Zhou D., Chen S., Gao S., Ma Y. Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. Single-image crowd counting via multi-column convolutional neural network; pp. 589–597. [Google Scholar]
- 3.Ma Z., Wei X., Hong X., Gong Y. Proceedings of the IEEE/CVF international conference on computer vision. 2019. Bayesian loss for crowd count estimation with point supervision; pp. 6142–6151. [Google Scholar]
- 4.Wang Q., Gao J., Lin W., Li X. IEEE transactions on pattern analysis and machine intelligence. Vol. 43. 2020. Nwpu-crowd: a large-scale benchmark for crowd counting and localization; pp. 2141–2149. [DOI] [PubMed] [Google Scholar]
- 5.Xiong F., Shi X., Yeung D. Spatiotemporal modeling for crowd counting in videos. in ICCV. 2017:5161–5169. [Google Scholar]
- 6.Zhang S., Wu G., Costeira J.P., Moura J.M.F. Fcn-rlstm: deep spatio-temporal neural networks for vehicle counting in city cameras. in ICCV. 2017:3687–3696. [Google Scholar]
- 7.Liu W., Salzmann M., Fua P. Estimating people flows to better count them in crowded scenes. in ECCV. 2020;12360:723–740. [Google Scholar]
- 8.Woo S., Park J., Lee J., Kweon I.S. CBAM: convolutional block attention module. in ECCV. 2018:3–19. [Google Scholar]
- 9.L. Wen, D. Du, P. Zhu, Q. Hu, Q. Wang, L. Bo, and S. Lyu, “Detection, tracking, and counting meets drones in crowds: a benchmark,” pp. 7808–7817, 2021.
- 10.Zhu P., Wen L., Du D., Bian X., Fan H., Hu Q., Ling H. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2022. Detection and tracking meet drones challenge; pp. 7380–7399. [DOI] [PubMed] [Google Scholar]
- 11.Li Y., Zhang X., Chen D. Proceedings of the IEEE conference on computer vision and pattern recognition. 2018. Csrnet: dilated convolutional neural networks for understanding the highly congested scenes; pp. 1091–1100. [Google Scholar]
- 12.Liu W., Salzmann M., Fua P. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019. Context-aware crowd counting; pp. 5099–5108. [Google Scholar]
- 13.Wang B., Liu H., Samaras D., Nguyen M.H. Distribution matching for crowd counting. Adv Neural Inf Process Syst. 2020;33:1595–1607. [Google Scholar]
- 14.Wen L., Du D., Zhu P., Hu Q., Wang Q., Bo L., Lyu S. Detection, tracking, and counting meets drones in crowds: a benchmark. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Los Alamitos, CA, USA; IEEE Computer Society; jun 2021. pp. 7808–7817. [Google Scholar]
- 15.D. Deb and J. Ventura, “An aggregated multicolumn dilated convolution network for perspective-free counting,” in CVPRW, 2018, pp. 195–204.
- 16.V.A. Sindagi and V.M. Patel, “Generating high-quality crowd density maps using contextual pyramid cnns,” in ICCV, 2017, pp. 1879–1888.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Data will be made available on request.