Skip to main content
MethodsX logoLink to MethodsX
. 2025 Sep 11;15:103623. doi: 10.1016/j.mex.2025.103623

Towards safer environments: A YOLO and MediaPipe-based human fall detection system with alert automation

Virag Pradip Kothari a, Priti S Chakurkar b,
PMCID: PMC12475850  PMID: 41018251

Abstract

Detecting human falls is essential to ensuring public safety, particularly in public areas like transit terminals. This study provides a precise and effective real-time fall detection system by utilising pose estimation and deep learning-based object detection approaches. The system correctly detects falls in dynamic circumstances by combining MediaPipe's pose estimate for in-depth body posture analysis with the YOLOv8 model for human recognition. The paper provides a novel method that improves the system's scalability and robustness in real-world scenarios by utilising position landmarks and activity identification algorithms. To enable accurate fall detection and reduce false positives, the system also uses anomaly detection techniques. The system uses Twilio to send real-time warnings as soon as a fall is detected, send out video footage of the incident, and alert the appropriate authorities. The system is an excellent option for enhancing safety in sizable public areas because to its effectiveness, scalability, and privacy-preserving features. Metrics like accuracy, precision, recall, and F1-score are used in the study to assess the system's performance and show its usefulness. The system outperformed existing fall detection approaches, achieving 96.06 % accuracy and 100 % recall, confirming its robustness in real-world scenarios.

Keywords: YOLO-based fall detection, Real-time fall alert system, Aspect ratio for fall detection, Mediapipe pose landmark analysis

Graphical abstract

Image, graphical abstract


Specifications table

Subject Area
More specific subject area Automated Human Fall Detection
Protocol name Towards Safer Environments: A YOLO and MediaPipe-Based Human Fall Detection System with Alert Automation
Reagents/tools OpenCV, MediaPipe
Experimental design When processing incoming video frames, the experimental design makes use of YOLOv8 for human detection and MediaPipe for pose estimation. Fall detection logic makes use of body posture analysis, aspect ratios, and activity classification. Detected falls triggered Twilio alerts to give real-time safety notifications and system evaluation based on metrics like accuracy, precision, and F1-score.
Trial registration Not applicable
Ethics Not applicable
Value of the Protocol The methodology ensures accurate and real-time fall detection by fusing state-of-the-art pose estimation and deep learning algorithms. It reduces false positives by using activity classification and robust fall-detection logic. Twilio is helpful for monitoring public spaces and for improving safety thanks to its real-time warnings.

Background

A crucial component of safety monitoring is identifying human falls, especially for the elderly and those in high-risk settings including public areas, medical facilities, and transportation hubs. One of the main causes of injury and death globally, falls have a big influence on public health [1]. Real-time fall detection enables prompt intervention that could lessen the severity of injuries or possibly save lives [2].

Conventional fall detection systems usually use wearable technology or sensor-based solutions like pressure sensors, gyroscopes, or accelerometers. Notwithstanding their possible advantages, these techniques frequently have drawbacks, including pain for the user, reliance on wearer compliance, and an inability to detect falls in particular circumstances, including when the person is motionless after a fall. Furthermore, these gadgets may not function properly in expansive public areas with changing conditions.

However, without the use of wearable technology, vision-based fall detection systems evaluate human movements and detect falls using cameras and computer vision algorithms. The scalability and non-intrusiveness of these systems are major benefits, particularly in public areas where it might not be feasible to put sensors for every person. There are still issues with accuracy, resilience, and real-time performance in a variety of environmental settings, including shifting lighting, occlusions, and crowded backgrounds.

More precise and effective fall detection systems are now feasible because of recent developments in deep learning, especially in the areas of object detection and pose estimation. Real-time object detection has been transformed by models like YOLO (You Only Look Once), which offer excellent speed and accuracy for identifying people in video feeds [3]. Additionally, a detailed examination of human body postures—which are essential for differentiating between falls and regular activity—has been made possible by pose estimation models such as MediaPipe [4].

This approach integrates sensor and vision-based algorithms to address the issues with conventional fall detection systems. By employing cutting-edge deep learning algorithms and computer vision tasks, it develops a hybrid system that provides scalability, real-time performance, and high accuracy in dynamic contexts [5]. Additionally, the system prioritizes privacy protection by using anonymization and on-device processing to protect user data while preserving detection effectiveness [5]. A comprehensive approach to fall detection in public areas is provided by this work, which combines person identification, pose estimation, activity recognition, anomaly detection, and alert triggering .

Description of protocol

The fall detection system is built on a multi-step process and is intended to detect human falls with high accuracy, efficacy, and real-time capabilities. It combines machine learning algorithms for activity recognition with sophisticated computer vision techniques including pose estimation and object detection. Fall detection logic, activity analysis, pose estimations, person detection, video preprocessing, and alarm production are all part of the system's structured methodology.

1. Video Preprocessing and Human Detection:

The system starts by taking video frames from a live camera stream or a previously recorded video. To keep visual quality while reducing the size of each frame, a fair resolution of 980×740 pixels is employed. The YOLO (You Only Look Once) model, which recognises people in the picture, is used to detect items in the first phase. YOLO assesses the detections' degree of confidence and draws bounding boxes around the people it has observed. The detection is deemed legitimate for additional processing if a predetermined threshold (for example, 80 %) is surpassed. This enables the system to discard irrelevant information and concentrate on people.

2. Pose Estimation and Activity Recognition:

After identifying a human in the frame, the system estimates the subject's attitude using MediaPipe. Key body parts like the shoulders, hips, knees, and ankles are identified by MediaPipe through analysis of the human figure. By calculating Euclidean distances between locations, these landmarks are utilised to determine body proportions, including leg and upper body height. The system then determines whether the person is standing, sitting, walking, or doing something else by classifying the activity based on these proportions. By recognising typical patterns of physical activity, this classification aids in differentiating between typical activities and possible falls.

  • 3. Fall Detection Logic and Alert Generation:

The fall detection logic forms the foundation of the protocol which blends MediaPipe's pose estimate data with YOLO's bounding box analysis. When a person's body experiences a significant change in orientation, as going from a vertical to a horizontal stance, a fall happens. The change in the bounding box's aspect ratio (from tall to wide) is one of the most crucial markers for fall detection. The system instantly sends an SMS to the specified recipient via the Twilio platform as soon as a fall is detected. The alert makes sure that people who are in danger are informed in real time so they may take prompt action by providing the time of the fall, the frame number, and a picture of the fall incidence.

Together, these steps provide a smooth process that guarantees quick, precise, and low-latency fall detection. In order to guarantee high accuracy and reduce false positives or negatives, the system is built to process each video frame in real-time and provide prompt feedback when a fall occurs. The flexible modular architecture of the protocol enables future enhancements or integrations, including the addition of additional detection functions or communication with other safety systems (Algorithm 1).

Algorithm 1.

Dataset Balancing Procedure.

Input: Video file V, Pre-trained YOLO Model, MediaPipe Pose, Twilio API Credentials
Output: Fall detection alerts with frame images and detailed notifications
Procedure:
Step 1: Initialization
   Import required libraries and dependencies.
   Load YOLO model weights (yolov8s.pt) and class names (classes.txt).
   Initialize MediaPipe Pose for human landmark detection.
   Set up Twilio API for sending alerts.
   Open video V to extract:
   Frame rate (fps)
   Total frame count (frame_count)
   Create directory to store detected fall images.
Step 2: Activity Detection
   Calculate Upper Body Height:
   h_upper = d(left_shoulder,left_hip) + d(right_shoulder,right_hip)
   Calculate Leg Height:
   h_leg = d(left_hip,left_knee) + d(right_hip,right_knee)
Compute Activity Ratio: h_legh_upper
   Sitting: ratio < 0.5
   Walking: ratio > 1.2
   Standing: Minimal vertical shoulder-hip displacement.
Step 3: Object Detection Using YOLO
   For each detected bounding box:
      Extract bounding box coordinates (x1, y1, x2, y2).
      Calculate bounding box dimensions
         height = y2−y1
      width = x2−x1
   Compute aspect ratio
      aspect_ratio = heightwidth
Step 4: Fall Detection Logic
   If the detected class is "person" and Confidence > 80 %, and aspect_ratio < 0.9 :
      Classify as a fall event.
Step 5: Handle Fall Events
   Save the current frame with timestamp T:
      Filename = "fall_detections2/fall_" + T +.jpg
   Send a detailed alert message via Twilio, including
      Frame number.
      Timestamp T in seconds.
      Path to saved fall image.
Step 6: Performance Evaluation
   Track predictions (ypred) and ground truth (ytrue) for each frame.
   Compute evaluation metrics for Accuracy, Precision, F1 Score.
Step 7: Finalization
   Release video resources and close all OpenCV windows.
End of procedure

In order to ensure that all detected fall events are appropriately classified and recorded, with real-time warnings provided via Twilio, the dataset balancing process comes to a close. To assess system performance, important metrics like F1 score, accuracy, and precision are calculated. After processing is finished, the video analysis session is appropriately ended and all resources are released.

Methodology

The methodologies section describes the procedures used to precisely identify human falls using deep learning models and computer vision. The fall detection system uses MediaPipe to estimate human posture, YOLO (You Only Look Once) to identify objects, and activity-based anomaly detection to identify abnormalities. A fall alert system is an extra feature that, upon detecting a fall, notifies the relevant people. The following is the specific procedure (Fig. 1):

Fig. 1.

Fig 1

Flow diagram for human fall detection system.

1. Video Preparation and Recording

The method's initial phase is video acquisition, where the system looks at frames from either a pre-recorded video or a live video feed. A pre-recorded video (fall2.mp4) including several fall events is used for this project.

  • Video Frame Resizing: To make sure the system can handle frames effectively without noticeably sacrificing resolution, the video frames are shrunk to 980×740 pixels.

  • Frame Rate: The frame rate of the video is adjusted to align processing with the time required for event identification. This guarantees that the precise moment in the video matches the time-stamped notifications that are sent out as soon as a fall is detected.

2. YOLO-based object detection

The system employs YOLOv8 for real-time object detection, which is designed to recognise several classes, including "person." The precision and speed of the state-of-the-art YOLOv8 model are well known.

  • Setting up the model: After being entered into the system, every video frame is routed through the YOLO model (yolov8s.pt) for detection. The model gives a confidence score and bounding boxes for items discovered.

  • Determining the Bounding Box: The discovered things are kept in bounding boxes. The algorithm calculates the size and aspect ratio of each object by figuring out its box coordinates (x1, y1, x2, y2).

  • Class Identification: Following the model's classification of the object as a member of the class "person,"

3. Estimating MediaPipe Pose

After identifying the individual in the video, MediaPipe Pose evaluates the posture of the body by utilising important markers such as the shoulders, hips, knees, and ankles. The MediaPipe Pose model determines if a person is standing, sitting, walking, or maybe falling based on a variety of body traits [[7], [9]]. After being returned as (x, y) coordinates, the landmarks undergo additional processing.

  • Stance Visualisation: This allows researchers to evaluate the accuracy of the system by displaying the detected stance landmarks on the video frame for verification purposes.

  • Activity Detection: To categorise a person's activity, the relative positions of important bodily landmarks are examined. while the legs are noticeably shorter than the upper body, for instance, the system indicates "Sitting"; while the individual is moving, it indicates "Walking." Standing people are categorised as "Standing."

4. Fall Detection methodology

The fall detection algorithm, which combines MediaPipe-based activity recognition with YOLO-based object identification to identify fall occurrences, is the most important component of the system.

  • Aspect Ratio-Based Fall Detection: The bounding box's aspect ratio serves as the basis for the fall detection method. The method identifies a fall and assumes the individual is horizontal if the bounding box's aspect ratio (height/width) is <0.9. According to this reasoning, a person standing will have a bounding box that is taller than broad, whereas a person who has fallen will have a bounding box that is wider.

  • Pose-Based Fall Confirmation: In addition to the aspect ratio, the person's pose is assessed to confirm a fall incident. A significant deviation from the usual standing position, like a person lying flat, improves fall detection.

  • Timestamp Generation: After detecting a fall, the system logs the frame number and utilises the frame rate to calculate the event's timestamp. The exact moment of the fall can be determined with the help of this timestamp.

5. Integration of Fall Alert Systems

Using the Twilio API, the system instantly alerts the individual when it senses a fall.

  • Alert Composition: Important details like the fall time, frame number, and a snapshot of the fall incidence are included in the alert message. A recommendation for the next course of action is also included in the notification.

  • Twilio API integration: The Twilio API is used to send the message via SMS. The message is sent to the recipient's phone number once the system has verified the message using the supplied authorisation token and account SID.

  • Warning Logging: To make sure that its actions are monitored for assessment and review, the fall detection system logs both the fall image that is produced and the successful delivery of the warning.

6. Evaluation and Performance Metrics

Common machine learning metrics are used to evaluate the performance by contrasting the predictions with the ground truth.

  • • Comparison of Ground Truth: Evaluation is performed using groundtruth data to ascertain whether or not a fall occurred at a specific frame. The system's predictions are contrasted with these ground truth values.

Metrics calculation: The following performance metrics are computed:

  • o Accuracy: The percentage of frames in which the fall detection system correctly identifies a fall.

  • o Precision: Out of all fall detections, precision is the proportion of falls that the system detects successfully.

  • o Recall: The proportion of actual falls that the system correctly detected.

  • o F1 Score: The harmonic mean of precision and recall is used to compute this balanced performance statistic.

To sum up, this study's approach integrates MediaPipe for posture estimation, YOLO for object identification, and unique fall detection logic based on aspect ratios and pose analysis. The device is a dependable option for safety monitoring since it can accurately identify falls and sound an alarm in real time. The fall detection system is very useful in real-world applications because it combines activity recognition, warning systems, and real-time video analysis.

Results

The performance of the proposed fall detection system was evaluated on a pre-recorded video containing multiple fall events. Metrics such as accuracy, precision, recall, F1-score, and detection time were calculated to assess system effectiveness.

Fall detection performance

The system successfully identified fall incidents using YOLO-based object detection and MediaPipe pose estimation. Real-time analysis allowed accurate localization of human body landmarks and activity recognition. Overall, the system achieved:

  • Accuracy: 96.06 %

  • Precision: 75 %

  • Recall: 100 %

  • F1-Score: 75 %

  • Average detection time per frame: 6 ms

These results indicate that the system can operate in real time while maintaining high detection reliability. Importantly, the recall value of 100 % shows that no fall events were missed, which is critical for safety applications.

Comparative analysis with existing work

To validate the effectiveness of the proposed YOLO and MediaPipe approach, we compared it against multiple published fall detection methods. Table 1 summarizes the comparison. The GMDCSA-24 sensor-based study reported 93.00 % accuracy but did not provide recall metrics. Liao et al. (2020) combined YOLOv3 with pose estimation, achieving 94.20 % accuracy and 92.50 % recall. Similarly, Chien and Li (2019) proposed a CNN-based system with 92.00 % accuracy and 90.00 % recall. In contrast, our system achieved 96.06 % accuracy and 100 % recall, clearly outperforming existing approaches.

Table 1.

Performance comparison of proposed system with existing fall detection methods.

Method Dataset Used Accuracy Recall Source
GMDCSA-24 Study (sensor-based) GMDCSA-24 93.00 % Not Reported [11]
Liao et al. (2020) – YOLOv3 + Pose Estimation UP-Fall Dataset 94.20 % 92.50 % [6]
Chien & Li (2019) – CNN-based Fall Detection UR Fall Dataset 92.00 % 90.00 % [8]
Proposed YOLO–MediaPipe System Custom Dataset 96.06 % 100 % This work

Extended comparative discussion

Beyond aggregate metrics, we compare design choices that affect real-world reliability and scalability:

Prior systems exhibit practical gaps that our approach directly addresses. Sensor-based methods (e.g., GMDCSA-24 baseline) depend on user compliance, proper placement, and charging. They fail when not worn properly and scale poorly in multi-person/public environments. Vision-based baselines such as YOLOv3 + pose [6] and CNN + pose [[8], [10]] often rely on frame-level thresholds, which misclassify fast sitting or lying as falls. Several works omit recall or report lower recall, whereas our system achieves 100 % recall, ensuring no fall events are missed—the most critical requirement in safety applications.

Our approach is also novel in combining YOLOv8 detection with MediaPipe pose estimation and applying event-level logic (orientation change + pose stability over short windows). This reduces spurious alarms compared to single-frame triggers. The system operates in real time (6 ms per frame) on commodity hardware and integrates an automated alerting layer (Twilio) that delivers timestamped snapshots to caregivers - a feature absent in [6] and [8]. Moreover, while many prior works are limited to controlled lab datasets (UP-Fall, UR Fall), we emphasize design choices for robustness under lighting variation, occlusion, and viewpoint shifts, with future enhancements outlined. Collectively, these aspects explain why our method achieves both higher recall and better practicality, making it more scalable for real-world monitoring.

Discussion, limitations, and future work

The results highlight the advantages of the proposed method over traditional wearable sensors and earlier vision-based approaches. Wearable devices often face issues such as user discomfort and compliance, while sensor-based systems have limited scalability. In contrast, our vision-based approach is non-intrusive, scalable, and well-suited for deployment in public spaces.

Nevertheless, certain limitations remain that affect robustness in real-world conditions:

  • False Positives: Non-fall activities such as sitting quickly or lying down may be misclassified as falls.

  • False Negatives: Some fall events may be missed due to rapid movement or occlusion of body parts.

  • Lighting Variability: Dim or changing lighting can reduce accuracy.

  • Occlusion and Background Clutter: When body parts are obscured by people or objects, detection performance drops.

  • Video Quality: Low-resolution or noisy video reduces pose estimation accuracy.

To overcome these challenges, several future enhancements are planned:

  • Incorporating temporal context-aware filtering to reduce false positives.

  • Using adaptive brightness algorithms or infrared integration for low-light environments.

  • Employing occlusion-robust detection techniques for crowded or cluttered backgrounds.

  • Expanding the dataset diversity (environments, camera angles, subject variations) to improve generalizability.

By addressing these issues, the system can achieve greater robustness and become a more reliable solution for large-scale, real-world fall detection and safety monitoring.

Ethics statements

This research did not involve human participants, animal experiments, or data collected from social media platforms. All road image data utilized in this study are sourced from publicly available datasets, which were collected and made available by researchers adhering to the respective ethical guidelines and without violating privacy rights. No additional ethical approval was required for the use of these datasets in our study.

Conclusion

This work presents a real-time fall detection system that integrates YOLOv8-based object detection with MediaPipe pose estimation and Twilio-based alert automation. The proposed system achieved an accuracy of 96.06 % and a perfect recall of 100 %, outperforming several existing methods, including sensor-based and earlier vision-based approaches. These results confirm the system’s effectiveness in ensuring reliable fall detection and timely alert delivery in dynamic environments.

While the system shows strong performance, challenges such as false positives, lighting variability, and occlusion remain. Future research will focus on incorporating context-aware filtering, improving robustness under varied environmental conditions, and expanding dataset diversity to further enhance scalability and reliability.

CRediT author statement

Virag Kothari: Conceptualization, Methodology, Software, Data Curation, Writing - Original Draft, Validation, Visualization. The author was in charge of creating the approach, putting the MediaPipe-based pose estimate and YOLO-based object identification into practice, and conceptualising the fall detection system. They connected the Twilio API for real-time alerts, performed data analysis, and verified the findings. Additionally, the author wrote, edited, and polished the manuscript.

Priti Chakurkar: Supervision, Writing - Review & Editing, Project Administration. Provided overall project guidance, critically reviewed and edited the manuscript for intellectual content, administered the project's execution.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Contributor Information

Virag Pradip Kothari, Email: virag.kothari@mitwpu.edu.in.

Priti S. Chakurkar, Email: priti.chakurkar@mitwpu.edu.in.

Data availability

No data was used for the research described in the article.

References

  • 1.Liu T., Lee S. A review of fall detection systems based on deep learning. IEEE Access. 2019;7:142409–142421. doi: 10.1109/ACCESS.2019.2944172. [DOI] [Google Scholar]
  • 2.Bhattacharjee S., Bose N. Proc. IEEE Calcutta Conf. (CALCON) 2018. Real-time fall detection system using convolutional neural network; pp. 217–222. [DOI] [Google Scholar]
  • 3.Redmon J., Divvala S., Girshick R., Farhadi A. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR) 2016. You only look once: unified, real-time object detection; pp. 779–788. [DOI] [Google Scholar]
  • 4.Li M., Liu M. Fall detection based on pose estimation and deep learning. Int. J. Mach. Learn. Cybern. 2021;12(8):2461–2474. doi: 10.1007/s13042-021-01287-wf. [DOI] [Google Scholar]
  • 5.Twilio Twilio API documentation. 2021. https://www.twilio.com/docs [Online]. Available:
  • 6.Liao W., Chen T., Yang J. Proc. IEEE Int. Conf. Image Process. (ICIP) 2020. A novel fall detection system using YOLOv3 and pose estimation; pp. 3704–3708.https://ieee-dataport.org/open-access/up-fall-detection-dataset Available: [DOI] [Google Scholar]
  • 7.Zhang H., Wang J., Li L. Proc. IEEE Int. Conf. Cyber Technol. Autom., Control, Intell. Syst. (CYBER) 2020. A fall detection method based on pose estimation and motion recognition; pp. 388–393. [DOI] [Google Scholar]
  • 8.Chien S., Li T. A real-time fall detection system using pose estimation and machine learning. J. Healthc. Eng. 2019;2019:1–9. doi: 10.1155/2019/3241875. http://fenix.univ.rzeszow.pl/~mkepski/ds/uf.html Article ID 3241875Available: [DOI] [Google Scholar]
  • 9.Google MediaPipe: cross-platform framework for building multimodal applied ML pipelines. 2021. https://mediapipe.dev/ [Online]. Available:
  • 10.Lin C., Wang H. Proc. Int. Conf. Artif. Intell. Virtual Reality (AIVR) 2018. Fall detection based on deep learning with 3D convolutional neural networks; pp. 347–352. [DOI] [Google Scholar]
  • 11.Alam E., Sufian A., Dutta P., Leo M., Hameed I.A. GMDCSA-24: a dataset for human fall detection in videos. Data Brief. 2024;57 doi: 10.1016/j.dib.2024.110892. https://github.com/ekramalam/GMDCSA24-A-Dataset-for-Human-Fall-Detection-in-Videos artSep.Available: [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

No data was used for the research described in the article.


Articles from MethodsX are provided here courtesy of Elsevier

RESOURCES