Skip to main content
Springer logoLink to Springer
. 2023 May 8;46(5):32. doi: 10.1140/epje/s10189-023-00290-x

Benchmarking YOLOv5 and YOLOv7 models with DeepSORT for droplet tracking applications

Mihir Durve 1,, Sibilla Orsini 2,3, Adriano Tiribocchi 3, Andrea Montessori 4, Jean-Michel Tucny 1,4, Marco Lauricella 3, Andrea Camposeo 2, Dario Pisignano 2,5, Sauro Succi 1,6
PMCID: PMC10167152  PMID: 37154834

Abstract

Tracking droplets in microfluidics is a challenging task. The difficulty arises in choosing a tool to analyze general microfluidic videos to infer physical quantities. The state-of-the-art object detector algorithm You Only Look Once (YOLO) and the object tracking algorithm Simple Online and Realtime Tracking with a Deep Association Metric (DeepSORT) are customizable for droplet identification and tracking. The customization includes training YOLO and DeepSORT networks to identify and track the objects of interest. We trained several YOLOv5 and YOLOv7 models and the DeepSORT network for droplet identification and tracking from microfluidic experimental videos. We compare the performance of the droplet tracking applications with YOLOv5 and YOLOv7 in terms of training time and time to analyze a given video across various hardware configurations. Despite the latest YOLOv7 being 10% faster, the real-time tracking is only achieved by lighter YOLO models on RTX 3070 Ti GPU machine due to additional significant droplet tracking costs arising from the DeepSORT algorithm. This work is a benchmark study for the YOLOv5 and YOLOv7 networks with DeepSORT in terms of the training time and inference time for a custom dataset of microfluidic droplets.

Introduction

A subset of machine learning-based tools, called computer vision tools, deal with object identification, classification and tracking in images or videos. State-of-the-art computer vision tools can read handwritten text [14], find objects in images [58], find product defects [8, 9], make a medical diagnosis from medical images with accuracy surpassing humans [10, 11] and object tracking [12, 13], just to name a few. In the last few years, they have been increasingly consolidating their place in all scientific fields and industries as reliable and fast analysis methods.

Computer vision tools have shown remarkable success in studying microfluidic systems. Artificial neural networks, for example, can predict physical observables, such as flow rate and chemical composition, from images of microfluidics systems with high accuracy, thus reducing hardware requirements to measure these quantities in an microfluidics experiment [14, 15]. More recently, a convolutional autoencoder model was trained to predict stable vs unstable droplets from their shapes within a concentrated emulsion [16].

Another application of computer vision tools in microfluidics is tracking droplets. Droplet recognition and tracking in experiments such as ref. [1719] and in simulation studies [20, 21] can yield rich information without needing human intervention. For example, counting droplet numbers, measuring flow rate, observing droplets size distribution and computing statistical quantities are cumbersome to measure with the manual marking of the droplets across several frames. Two natural questions, while using computer vision tools for image analysis, are i) how accurate the application is in terms of finding and tracking the objects, and ii) how fast the application is in analyzing each image. A typical digital camera operates at 30 frames per second (fps), thus one challenge is to analyze the images at the same or higher rate for real-time applications.

Along with a few other algorithms, You Only Look Once (YOLO) has the capability to analyze images at a few hundred frames per second [22, 23] and is designed to detect 80 classes of objects in a given image. The very first version of YOLO was introduced back in 2015 and the subsequent versions have been focused on making the algorithm faster and more accurate at detecting objects. The latest release of YOLO is its 7th version [24], with a reported significant gain in speed and accuracy for object detection in standard datasets containing several objects in realistic scenes. In our previous study, we trained YOLO version 5 and DeepSORT for real-time droplet identification and tracking in microfluidic experiments and simulations [25, 26], and we reported the image analysis speed for various YOLOv5 models. In this one, we train the latest YOLOv7 models along with DeepSORT and compare performance and image analysis speed of these models with the previous one. In particular, this paper studies and compares training time, droplet detection accuracy and inference time for an application that combines YOLOv5/YOLOv7 with DeepSORT for droplet recognition and tracking.

Experimental methods

The images analyzed in this study were obtained from a microfluidic device for the generation of droplets exploiting a flow-focusing configuration (scheme of the device in Fig. 1). The device has two inlets for oil flow (length: 7 mm, width: 300 μm, depth: 500 μm), one inlet for the flow of an aqueous solution (length: 5 mm, width: 500 μm, depth: 500 μm), a Y-shaped junction for droplet generation and an expansion channel. The latter is connected to an outlet for collecting the two-phase emulsion. The device was realized by using a stereolithography system (Enviontec, Micro Plus HD) and the E-shell®600 (Envisiontec) as pre-polymer. The continuous phase consists of silicone oil (Sigma-Aldrich, oil viscosity 350 cSt at 25C), while an aqueous solution constitutes the dispersed phase. The latter was made by dissolving 7 mg of a black pigment (Sigma Aldrich, Brilliant Black BN) in 1 mL of distilled water. Both phases were injected through the inlets at constant flow rates by a programmable syringe pump with two independent channels (Harvard Apparatus, model 33). The images analyzed in this study were obtained by using a flow rate of 10 μl/min and 150 μl/min for the dispersed phase and the continuous phase, respectively. The droplets have average diameter of 185 μm. The droplet formation is imaged by using a stereomicroscope (Leica, MZ 16 FA) and a camera (Photron, fastcam APX RS). The fast camera acquired the images at 3000 frames per second (fps). This image capture rate is far higher than any present algorithm’s real-time object detection capabilities. The image playback rate is to 30 fps. The sequences of images were stored as AVI video files. Later, images from the video were used to train YOLO and DeepSORT models as described in the following section.

Fig. 1.

Fig. 1

Schematic representation of the microfluidic device used for the droplet generation

Training YOLOv5 and YOLOv7 models

The steps required to train YOLOv5 and YOLOv7 are identical. First, a training dataset is prepared by manually annotating 1000 images taken from a microfluidics experiment as described in Sect. 2. Each image in this dataset has approximately 13 to 14 droplets. One example from the training dataset is shown in Fig. 2. The droplets in these images are identified, and the dimensions of a rectangle that fully covers the droplet are noted in a separate text file called the label file. We used PyTorch implementation of YOLOv5 [27] and YOLOv7 [28] to train several YOLO models on an HPC system on a single node containing two Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz Cascade Lake and NVIDIA Tesla V100 GPU with 32 GB VRAM. YOLOv5 and YOLOv7 models differ in the number of trainable parameters (see Table 1). YOLOv7 algorithm includes extended efficient layer aggregation networks to enhance the features learned by different feature maps and improve the use of parameters and calculations over its previous versions [22]. A typical training time is mentioned in Table 1.

Fig. 2.

Fig. 2

Example from custom training dataset to train YOLOv5 and YOLOv7 object detector models. Each object is manually placed in a rectangle (called the bounding box) and the dimensions of the rectangle are noted in an associated label file

Table 1.

YOLO models training time on the same machine with an identical training dataset. The YOLO model descriptions can be found in Ref. [27] for v5 and in Ref. [24] for v7

Model Parameters (millions) Image size (pixels) Epoch Total time (×103s) Time per epoch
YOLOv5s 7.2 640 1000 15.5 15.5
YOLOv5m 21.2 640 1000 27.1 27.1
YOLOv5l 46.5 640 1000 39.4 39.4
YOLOv5x 86.7 640 1000 50.7 50.7
YOLOv5s6 12.6 1280 1000 17.5 17.5
YOLOv5m6 35.7 1280 1000 30.4 30.4
YOLOv5l6 76.8 1280 1000 44.5 44.5
YOLOv5x6 140.7 1280 1000 54.6 54.6
YOLOv7-tiny 6.2 640 1000 14.4 14.4
YOLOv7 36.9 640 1000 26.7 26.7
YOLOv7-x 71.3 640 1000 45.2 45.2
YOLOv7-w6 70.04 1280 1000 66.7 66.7
YOLOv7-e6 97.2 1280 1000 100.4 100.4
YOLOv7-d6 154.7 1280 1000 140.4 140.4
YOLOv7-e6e 151.7 1280 1000 135.0 135.0

During the training phase, a subset of data (called a batch) is passed through the network and a loss value is computed using the difference between the network’s predictions and the ground truth provided in the label file. The loss value is then used to update the network’s trainable parameter to minimize the loss in subsequent passes. An epoch is said to be completed when all of the training data is passed through the network. YOLO’s loss calculation takes into account the error in bounding box prediction, error in object detection and error in object classification [22]. The loss value components computed with training and validation data are shown in Fig. 3.

Fig. 3.

Fig. 3

Loss function during the YOLOv5 and YOLOv7 as the training progress. See Ref. [22] for a detailed loss function description. Figure legends are the same as in Fig. 4

Inference with YOLO and DeepSORT

During the training phase, the quality of YOLOv5 and YOLOv7 models is measured with a well-known mean average precision (mAP), which is calculated with an Intersection over Union (IoU) threshold of 0.5 (see Fig. 4). For both versions, mAP value quickly saturates to unity after training with 20 epochs. Similarly, the average of mAP calculated with IoU threshold of 0.5 to 0.95 in steps of 0.05 for YOLOv5 models are observed between 0.9 and 0.94, and for YOLOv7 models, the mAP values are observed between 0.8 and 0.9. These differences in the mAP values are practically insignificant for droplet detection with the YOLOv5 and YOLOv7 models.

Fig. 4.

Fig. 4

Mean average precision (mAP) comparison between YOLOv5 and YOLOv7 models with custom dataset

After the models are trained, they can be deployed for real-world applications. One challenging milestone for any computer vision application is to use it in real time, i.e., when the image analysis speed exceeds 30 fps. YOLO models on their own do deliver real-time performance. In Tables 2 and 3, we show the total time for droplet identification and tracking, combining YOLOv5/YOLOv7 with DeepSORT on two hardware configurations. Here, we measured YOLO and DeepSORT time as time taken by the functions that include the algorithms to analyze the input. The time to load the input and write the output is not taken into account. The benchmarking study was carried out on an MSI G77 Stealth laptop with i7-12,700 H, 32 GB RAM and NVIDIA RTX 3070 Ti 8 GB VRAM GPU. Two ’X’ in the table shows those YOLOv7 models that require more than 8GB VRAM making them unfeasible to run on RTX 3070 Ti GPU. Running on GPU, we observe approximately 10% improvement in the inference speed for YOLOv7 over YOLOv5. However, additional time by the object tracking algorithm DeepSORT is comparable with heavier YOLO models. 30 FPS is a commonly acceptable threshold for real-time tracking. The single application combining object identification and tracking can deliver real-time tracking with lighter YOLO models (YOLOv5s, YOLOv5m, YOLOv7-tiny and YOLOv7-x), but they fall below the real-time tracking mark with other heavier YOLO models. Finally, a video of droplet tracking is provided in supplemental material (see SM1.avi).

Table 2.

Inference time per frame - CPU

Model YOLO (s) DeepSORT (s) Total time (s) FPS
YOLOv5s 0.11 0.12 0.23 4.34
YOLOv5m 0.12 0.12 0.24 4.13
YOLOv5l 0.35 0.12 0.47 2.13
YOLOv5x 0.59 0.11 0.71 1.42
YOLOv5s6 0.27 0.12 0.39 2.57
YOLOv5m6 0.65 0.11 0.76 1.31
YOLOv5l6 1.25 0.12 1.37 0.73
YOLOv5x6 2.18 0.13 2.30 0.43
YOLOv7-tiny 0.11 0.12 0.22 4.47
YOLOv7-x 0.12 0.12 0.23 4.28
YOLOv7 0.36 0.11 0.47 2.12
YOLOv7-w6 0.26 0.12 0.38 2.64
YOLOv7-e6 0.55 0.11 0.67 1.50
YOLOv7-d6 1.20 1.31 2.51 0.40
YOLOv7-e6e 0.63 0.12 0.75 1.33

Table 3.

Inference time per frame - GPU

Model YOLO (s) DeepSORT (s) Total time (s) FPS
YOLOv5s 0.0057 0.0235 0.0292 34.27
YOLOv5m 0.0076 0.0156 0.0232 43.05
YOLOv5l 0.0192 0.0258 0.0450 22.24
YOLOv5x 0.0196 0.0229 0.0425 23.53
YOLOv5s6 0.0162 0.0256 0.0418 23.92
YOLOv5m6 0.0261 0.0304 0.0565 17.70
YOLOv5l6 0.0384 0.0237 0.0621 16.11
YOLOv5x6 0.0696 0.0186 0.0881 11.34
YOLOv7-tiny 0.0049 0.0241 0.0290 34.48
YOLOv7-x 0.0065 0.0244 0.0309 32.40
YOLOv7 0.0175 0.0217 0.0392 25.53
YOLOv7-w6 0.0176 0.0221 0.0397 25.16
YOLOv7-e6 0.0138 0.0256 0.0394 25.35
YOLOv7-d6 X X X X
YOLOv7-e6e X X X X

Conclusion

This paper studied two versions of YOLO object detector models coupled with DeepSORT tracking algorithms in a single tool we call DropTrack. DropTrack produces bounding boxes and unique IDs for the detected droplets, which help in constructing trajectories of droplets across sequential frames, thus allowing to compute other derived quantities in real time such as droplet flow rate, droplet size distribution, the distance between droplets, local order parameters, etc., which are desired observations in other applications [2932]. The benchmarks studied in this work serve as a guide for computational resource requirements to train the networks and mention expected inference time for various models on diverse hardware configurations.

YOLOv5 and YOLOv7 networks were trained with identical training datasets on the same HPC machine with NVIDIA-V100 GPU. The training time per epoch is comparable for lighter YOLOv5 and YOLOv7, but the heavier YOLOv7 models take almost double the time to complete the training.

We observe a significant increase in inference speed in YOLOv7 models compared to their YOLOv5 counterparts, as one would expect. Moreover, we report detailed computational costs on object detection and object tracking routines and the overall performance of the combined application. Lighter YOLO models are much quicker to identify objects in comparison with the time taken by DeepSORT to track them. However, the object identification time increases with the increasing complexity of the object-detecting networks. Thus, it is crucial to choose the right YOLO network and hardware configuration for real-time tracking at the cost of the bounding box accuracy.

Acknowledgements

The authors acknowledge funding from the European Research Council Grant Agreement No. 739964 (COPMAT) and ERC-PoC2 grant No. 101081171 (DropTrack). We gratefully acknowledge the HPC infrastructure and the Support Team at Fondazione Istituto Italiano di Tecnologia.

Author contribution statement

MD, AT, AM, ML and JMT performed data labeling, neural network training and output data analysis. SO and AC performed experiments to generate training data. SS and DP designed the study. MD, AT, ML and SO wrote and revised the manuscript.

Funding

Open access funding provided by Istituto Italiano di Tecnologia within the CRUI-CARE Agreement.

Data Availability Statement

All data generated or analyzed during this study are included in this published article and its supplementary files.

References

  • 1.Kang L, Riba P, Rusiñol M, Fornés A, Villegas M. Pay attention to what you read: Non-recurrent handwritten text-line recognition. Pattern Recogn. 2022;129:108766. doi: 10.1016/j.patcog.2022.108766. [DOI] [Google Scholar]
  • 2.Coquenet D, Chatelain C, Paquet T. End-to-end handwritten paragraph text recognition using a vertical attention network. IEEE Trans. Pattern Anal. Mach. Intell. 2023;45(1):508–524. doi: 10.1109/TPAMI.2022.3144899. [DOI] [PubMed] [Google Scholar]
  • 3.Darmatasia, M.I. Fanany, Handwriting recognition on form document using convolutional neural network and support vector machines (cnn-svm). in 2017 5th International Conference on Information and Communication Technology (ICoIC7), 1–6 (2017). 10.1109/ICoICT.2017.8074699
  • 4.Ahlawat S, Choudhary A, Nayyar A, Singh S, Yoon B. Improved handwritten digit recognition using convolutional neural networks (cnn) Sensors. 2020 doi: 10.3390/s20123344. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Z. Zou, K. Chen, Z. Shi, Y. Guo, J. Ye, Object detection in 20 years: A survey. arXiv preprint arXiv:1905.05055 (2019)
  • 6.K.J. Joseph, S. Khan, F.S. Khan, V.N. Balasubramanian: Towards open world object detection. in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5830–5840 (2021)
  • 7.J. Brownlee, Deep learning for computer vision: image classification, object detection, and face recognition in python. Mach. Learn. Mastery (2019). https://books.google.co.in/books?id=DOamDwAAQBAJ
  • 8.A. Prabhu, K.V. Sangeetha, S. Likhitha, S. Shree Lakshmi: Applications of computer vision for defect detection in fruits: A review. in: 2021 International Conference on Intelligent Technologies (CONIT), pp. 1–10 (2021). 10.1109/CONIT51480.2021.9498393
  • 9.A. John Rajan, K. Jayakrishna, T. Vignesh, J. Chandradass, T.T.M. Kannan, Development of computer vision for inspection of bolt using convolutional neural network. Mater. Today Proc. 45, 6931–6935 (2021). 10.1016/j.matpr.2021.01.372. International Conference on Mechanical, Electronics and Computer Engineering 2020: Materials Science
  • 10.A. Esteva, K. Chou, S. Yeung, N. Naik, A. Madani, A. Mottaghi, Y. Liu, E. Topol, J. Dean, R. Socher: Deep learning-enabled medical computer vision. npj Digital Medicine 4(1), 5 (2021). 10.1038/s41746-020-00376-2 [DOI] [PMC free article] [PubMed]
  • 11.Bhargava A, Bansal A. Novel coronavirus (covid-19) diagnosis using computer vision and artificial intelligence techniques: a review. Multimedia Tools Appl. 2021;80(13):19931–19946. doi: 10.1007/s11042-021-10714-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Z. Soleimanitaleb, M.A. Keyvanrad, A. Jafari: Object tracking methods:a review. In: 2019 9th International Conference on Computer and Knowledge Engineering (ICCKE), pp. 282–288 (2019). 10.1109/ICCKE48569.2019.8964761
  • 13.Xu S, Wang J, Shou W, Ngo T, Sadick A-M, Wang X. Computer vision techniques in construction: A critical review. Arch. Comput. Methods Eng. 2021;28(5):3383–3397. doi: 10.1007/s11831-020-09504-3. [DOI] [Google Scholar]
  • 14.Hadikhani P, Borhani N, Hashemi SMH, Psaltis D. Learning from droplet flows in microfluidic channels using deep neural networks. Sci. Rep. 2019;9:8114. doi: 10.1038/s41598-019-44556-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Mahdi Y, Daoud K. Microdroplet size prediction in microfluidic systems via artificial neural network modeling for water-in-oil emulsion formulation. J. Dispersion Sci. Technol. 2017;38(10):1501–1508. doi: 10.1080/01932691.2016.1257391. [DOI] [Google Scholar]
  • 16.J.W. Khor, N. Jean, E.S. Luxenberg, S. Ermon, S.K.Y. Tang, Using machine learning to discover shape descriptors for predicting emulsion stability in a microfluidic channel. Soft Matter 15, 1361–1372 (2019). 10.1039/C8SM02054J [DOI] [PubMed]
  • 17.Bogdan M, Montessori A, Tiribocchi A, Bonaccorso F, Lauricella M, Jurkiewicz L, Succi S, Guzowski J. Stochastic jetting and dripping in confined soft granular flows. Phys. Rev. Lett. 2022;128:128001. doi: 10.1103/PhysRevLett.128.128001. [DOI] [PubMed] [Google Scholar]
  • 18.Kintses B, van Vliet LD, Devenish SR, Hollfelder F. Microfluidic droplets: new integrated workflows for biological experiments. Curr. Opin. Chem. Biol. 2010;14(5):548–555. doi: 10.1016/j.cbpa.2010.08.013. [DOI] [PubMed] [Google Scholar]
  • 19.Teh S-Y, Lin R, Hung L-H, Lee AP. Droplet microfluidics. Lab Chip. 2008;8:198–220. doi: 10.1039/B715524G. [DOI] [PubMed] [Google Scholar]
  • 20.Montessori A, Lauricella M, Tiribocchi A, Succi S. Modeling pattern formation in soft flowing crystals. Phys. Rev. Fluids. 2019;4(7):072201. doi: 10.1103/PhysRevFluids.4.072201. [DOI] [Google Scholar]
  • 21.Montessori A, Rocca ML, Prestininzi P, Tiribocchi A, Succi S. Deformation and breakup dynamics of droplets within a tapered channel. Phys. Fluids. 2021;33(8):082008. doi: 10.1063/5.0057501. [DOI] [Google Scholar]
  • 22.J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Unified, real-time object detection. in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 779–788 (2016). 10.1109/CVPR.2016.91
  • 23.J. Redmon, A. Farhadi, Yolov3: An incremental improvement. ArXiv:1804.02767v1 (2018)
  • 24.C.-Y. Wang, A. Bochkovskiy, H.-Y.M. Liao: YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv preprint arXiv:2207.02696 (2022)
  • 25.Durve M, Tiribocchi A, Bonaccorso F, Montessori A, Lauricella M, Bogdan M, Guzowski J, Succi S. Droptrack - automatic droplet tracking with yolov5 and deepsort for microfluidic applications. Phys. Fluids. 2022;34(8):082003. doi: 10.1063/5.0097597. [DOI] [Google Scholar]
  • 26.Durve M, Bonaccorso F, Montessori A, Lauricella M, Tiribocchi A, Succi S. Tracking droplets in soft granular flows with deep learning techniques. Eur. Phys. J. Plus. 2021;136(8):864. doi: 10.1140/epjp/s13360-021-01849-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.YOLOv5 git repository. https://github.com/ultralytics/yolov5
  • 28.YOLOv7 git repository. https://github.com/WongKinYiu/yolov7
  • 29.Ferraro D, Serra M, Filippi D, Zago L, Guglielmin E, Pierno M, Descroix S, Viovy J-L, Mistura G. Controlling the distance of highly confined droplets in a capillary by interfacial tension for merging on-demand. Lab Chip. 2019;19(1):136–146. doi: 10.1039/C8LC01182F. [DOI] [PubMed] [Google Scholar]
  • 30.Tan Y-C, Fisher JS, Lee AI, Cristini V, Lee AP. Design of microfluidic channel geometries for the control of droplet volume, chemical concentration, and sorting. Lab Chip. 2004;4:292–298. doi: 10.1039/B403280M. [DOI] [PubMed] [Google Scholar]
  • 31.Hettiarachchi S, Melroy G, Mudugamuwa A, Sampath P, Premachandra C, Amarasinghe R, Dau V. Design and development of a microfluidic droplet generator with vision sensing for lab-on-a-chip devices. Sens. Actuators, A. 2021;332:113047. doi: 10.1016/j.sna.2021.113047. [DOI] [Google Scholar]
  • 32.Khater A, Mohammadi M, Mohamad A, Nezhad AS. Dynamics of temperature-actuated droplets within microfluidics. Sci. Rep. 2019;9(1):3832. doi: 10.1038/s41598-019-40069-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All data generated or analyzed during this study are included in this published article and its supplementary files.


Articles from The European Physical Journal. E, Soft Matter are provided here courtesy of Springer

RESOURCES