Abstract
Wearable camera and thermal sensing systems are increasingly used for real-time eating detection and timely notifications to remind users to log their meals. However, confounding gestures such as irrelevant hand movements can cause false device confirmations of eating in real-time. Delaying the device confirmation of an eating episode, until the system is certain, can improve accuracy of eating detection, but prevents the capture of shorter bouts of eating. Balancing the trade-off between errors and detection delay is key to developing effective methods that provide immediate user feedback. This paper presents a real-time, hand-object-based method for automated detection of eating and drinking gestures and identifies the minimum number of gestures needed to reliably detect an eating episode. Unlike prior work, our method considers both hand motion and the object-in-hand and uses a low-power thermal sensor to reduce false positives. We evaluated our method on 36 participants, 28 of whom wore a wearable camera for up to 14 days in free-living environments. The results show that eating episodes can be accurately detected using 10 gestures or within the first 1.5 minutes of the eating episode, achieving an F1-score of 89.0%. Our findings provide evaluation guidelines for designing real-time intervention systems to address problematic eating behaviors.
I. Introduction
Wearable video cameras are increasingly being used for real-time detection and feedback in monitoring human behavior. These devices reduce the reliance on self-reporting, minimize associated errors, and enable natural observation and understanding of human behaviors in uncontrolled settings. They hold promise to lessen the burden and inaccuracies associated with self-reporting, scale natural observation, and enhance our understanding of human behavior both in public and private settings. Unlike other wearables, cameras offer visual confirmation utility, enhancing the ability to validate automated mobile health systems [1], [2].
Compared to traditional methods such as self-report, continuous data streams from these cameras capture fine-grained activities, allowing for timely interventions, such as notifications or meal logging reminders. Real-time monitoring is particularly relevant for identifying and addressing problematic behaviors, such as overeating, making wearable cameras a valuable tool in mobile health research [3]. However, in real-world settings, eating gestures are often confused with similar hand-to-head actions such as smoking, touching or scratching the face [4]. These irrelevant gestures can cause false positives, hindering the development of systems capable of providing reliable real-time feedback and notifications to users.
To address these challenges, previous work focused on creating eating episodes based on feeding gestures [5]. This approach leverages the fact that feeding gestures occur in short, consecutive intervals, allowing them to be clustered into episodes while irrelevant hand-to-head gestures, which occur sporadically, are identified as noise. Despite its effectiveness in eliminating false positive errors, this method has several limitations: (1) it requires a large time window of past data, which can delay notifications and negatively impact the system’s real-time functionality, (2) it often fails to detect short meals (e.g., less than 5 minutes), (3) it confounds other similar activities, such as smoking episodes, with eating episodes, and (4) it struggles to generalize across different studies with new participants and settings. Needed are methods that are generalizable and can accurately determine when the device is most confident an eating episode has occurred and can hence intervene or interrupt the user for further information. This requires studying the trade-off between false positive errors, detection delay, and overall system performance to ensure timely detection and reliable interventions.
In this work, we address the first two limitations by developing a real-time eating detection algorithm that uses hand and object-in-hand data to overcome the challenges posed by confounding gestures. Our approach determines the optimal balance between the number of gestures required to trigger a notification and the occurrence of false positive errors. To tackle the third limitation, we incorporate a thermal sensor alongside an RGB camera to distinguish and filter out smoking gestures. Thermal sensors provide distinctive information for objects with thermal signatures (e.g., the tip of a cigarette), which enhances the smoking session detection accuracy. The advent of low-resolution, power-efficient thermal cameras has led to their increased use in human activity recognition via wearables [6]. Additionally, thermal sensor data can be used to trigger RGB cameras to improve power efficiency and reduce privacy concerns in wearable devices [7]. To demonstrate the generalizability of our method, we conduct a rigorous evaluation using a dataset entirely different from the training set and compare our method with a baseline. We also perform an ablation study to compare the performance of our algorithm when combining both RGB and thermal compared to RGB-only data. This study highlights the added value of thermal sensors in accurately detecting eating episodes and distinguishing them from similar activities.
Our algorithm detects both the object-in-hand (e.g., food, utensil) and the wearer’s hand using a custom loss function [8] and a lightweight YOLOX [9] object detection backbone running in real-time. Frames where both the hand and object-in-hand are detected are clustered to form gestures, which are then used to create episodes. Thermal images are employed to detect smoking sessions using a threshold algorithm, enhancing the system’s accuracy for individuals who smoke. We then examine the trade-off between the number of gestures required to detect an eating episode and the detection delay.
Our method was evaluated in a study with 36 participants, demonstrating that it can detect an eating episode with an F1-score of up to 89.0% using an average of 10 gestures. For some participants, our method can identify eating episodes as short as 1.3 minutes. We compared our method to baseline approaches that only consider the presence of hands in detecting eating episodes, showing that our method improves the baseline F1-score by at least 34%. Our work provides a robust method and evaluation for real-time eating detection and paves the way for studies to customize their timely interventions to ensure high intervention fidelity (the degree to which an intervention is implemented as intended) required when studying human eating behaviors, supporting personalized and interactive health monitoring.
II. Method
A. System Design
The wearable is designed for continuous visual confirmation of human hand-to-head activities, with ease of use, power efficiency, and intelligence (ability to run ML models) in mind. We developed a low-cost, lightweight sensing device about the size of a golf ball (see Figure 1-A). It is built around an STM32L4 System on Chip (SoC), Cortex M4 with an 80 MHz clock frequency and 128 KB built-in SRAM. It includes an nRF SoC as well, for bluetooth low energy (BLE) capabilities. This configuration supports timely ML model execution in real-time, on the device. The system includes a low-power MLX90640 thermal sensor array for continuous thermal imaging and an OV2640 camera for visual confirmation. Additionally, it has a 500 mAh LiPo battery and charging circuitry for safe and efficient power management.
Fig. 1.

Diagram of the methodological framework. An activity-oriented wearable camera (A) is used to capture hand-to-head activities using RGB and thermal sensors. A hand-object detection model (B) is employed to detect the hand and object-in-hand using (C) a custom loss function incorporating the vector going from the centroid of the hand to the centroid of the object-in-hand. Frames with hand and object-in-hand are clustered to create gestures (D) and their overlap is used to calculate the method’s F1-score (E). Frames with hand and object-in-hand are clustered together to create episodes (F). We filter out smoking e (shown in gray) using the thermal sensor (G) and study the balance between the number of gestures required to create an episode and the accuracy (H).
B. Data Collection
We recruited 36 participants, including 7 smokers and 20 individuals with obesity. In the first phase, 8 participants wore the device for 1 day, and in the second phase, 14 wored the device for 7 days, and the third phase another 14 new participants wore the device for 14 days. Participants were instructed to wear the device during all waking hours. On average, each participant contributed 78 hours of data, resulting in a total of 2,797 hours of collected data. The data included RGB and thermal video recordings, both captured at a rate of 5 frames per second. Each frame was labeled to identify when the participant was performing a feeding gesture, a smoking gesture, and all other activities were considered part of the other or background class. This dataset was then used to develop and evaluate our real-time eating detection model.
C. Gesture Detection
1). Architecture:
Our gesture detection method is built on a hand and object-in-hand detection model (see Figure 1-B). The object-in-hand detection model enables us to differentiate between gestures involving objects and those that do not (e.g., touching the face). Following a similar methodology by Shan et al. [8], we designed a custom loss function to enhance our hand-object detection model. This loss function integrates the direction and magnitude of vectors extending from the centroid of the hand bounding box to the centroid of the object-in-hand (see Figure 1-C). This approach allows for class-agnostic object-in-hand detection. To ensure our model operates in real-time, we implemented the smallest version of the YOLOX architecture, known as YOLOX-nano. This model comprises only 0.91 million parameters and, after quantization, requires approximately 3MB of storage, making it suitable for deployment on most edge devices.
2). Training Details:
We used data from 8 participants who wore the device for a single day consuming at least 3 meals, along with a public hand-object dataset [8], to train our gesture recognition model. The model was trained and validated on a total of 30,203 images, achieving a mean Average Precision (mAP) of 71% in recognizing hand and object bounding boxes on the validation set. We set the model’s confidence threshold to 70% (empirically determined).
3). Clustering Frames:
We input every frame of our test set into the trained model to detect the presence of a hand and an object-in-hand. Since our camera device is activity-oriented (oriented upwards towards the face) and captures only hand-to-head gestures, the simultaneous presence of both a hand and an object strongly indicates a feeding gesture. Each frame is classified as either including a hand and object or not, forming a binary sequence. We then use DBSCAN [10] to create the feeding gestures (see Figure 1-D). The parameters for DBSCAN were determined using grid search and cross-validation, with the optimal empirical settings being eps = 21 seconds and min_points = 3. These identified gestures are subsequently grouped together to create eating episodes.
4). Evaluation:
To evaluate our gesture recognition method, we compare the overlap between the ground-truth gestures and the predicted gestures (see Figure 1-E). A predicted gesture is considered a true positive if it overlaps with a ground-truth gesture. If there is no overlap, the predicted gesture is deemed a false positive. Conversely, a ground-truth gesture is classified as a false negative if there is no overlapping predicted gesture.
D. Episode Detection
1). Clustering Gestures:
We use the gesture clusters created in the previous step to generate eating episodes through a similar approach to DBSCAN. We create our binary sequence by marking the center of each gesture as one and the rest as zero. As before, we determine the optimal parameters for DBSCAN using grid search and cross-validation, empirically setting them to eps = 5 minutes and min_points = 4. The start and end of each cluster indicate the beginning and end of each eating episode (i.e., meals). To reduce false positive errors, we exclude clusters that are shorter than 1 minute.
2). Clustering Frames:
We also experimented with directly clustering frames where both the hand and object-in-hand are present to create eating episodes (see Figure 1-F). We used DBSCAN for this process, with parameters set to eps = 2 minutes and min_points = 22, determined through grid search and cross-validation. While this method, as discussed in Section III, yielded slightly better results, it is limited by its inability to detect gestures specifically.
3). Filtering Smoking Episodes:
Our method can distinguish between confounding hand-to-head gestures with objects, such as cigarettes. Once the start and end of each eating episode are determined, we use thermal data to filter out smoking sessions, following a similar approach to [6]. This filtering method employs a thresholding algorithm that examines the maximum temperature value in each frame. If the ratio of frames with at least one pixel exceeding the temperature threshold of 70°C (as determined in [6] is more than 5%, the episode is classified as a smoking session and subsequently filtered (see Figure 1-G).
4). Evaluation:
To evaluate our episode detection method, we compare the overlap between ground-truth episodes (i.e., meals) and predicted episodes. A predicted episode is considered a true positive if there is any overlap with a ground-truth episode, and a false positive if it does not. Conversely, a ground-truth episode with no overlapping predicted episodes is considered a false negative.
5). Minimum Number of Gestures:
We study the trade-off between the minimum number of gestures required to detect an eating episode, a proxy for the minimum time needed to trigger an eating notification, and the F1-score of episode detection (see Figure 1-H). To do this, we start by removing episodes containing only one gesture, then incrementally exclude episodes with two gestures, three gestures, and continue this process up to 15 gestures. At each step, we compute the F1-score using our method to assess how the number of gestures affects detection performance. The results of this analysis are presented in Section III and shown in Figure 2.
Fig. 2.

F1-score vs. Number of gestures required to detect an eating episode. Precision increases as the number of gestures increase and recall drops as meals with less number of gestures can no longer be detected. Overall, the method achieves the highest F1-score at 10 number of gestures.
III. Results
We evaluated our approach and compared it to a baseline method using our dataset. The baseline method detects eating gestures and episodes solely based on the presence of a hand. Additionally, to assess the role of the thermal sensor in effectively removing smoking sessions, we compare our method against only the RGB sensors for participants who smoke. Our findings are discussed in this section.
A. Gesture Detection Evaluation
Table I and Table II show the results of our gesture recognition model for participants with and without smoking sessions, respectively. For non-smokers, our method improves precision and F1-score by 33% and 20% over the baseline method. For smokers, our method using the thermal sensor improves precision and F1-score by 35% and 23%, respectively.
TABLE I.
Non-Smokers Only: Gesture-level and Episode-level Results Alongside with the Minimum Number of Gestures Required Using Proposed Method vs. Baseline Method (in parenthesis)
| Detection Task | Recall | Precision | F1-Score | Num Gestures |
|---|---|---|---|---|
| Gesture | 0.61 (0.65) | 0.64 (0.31) | 0.62 (0.42) | N/A |
| Episode | 0.92 (0.93) | 0.83 (0.33) | 0.87 (0.49) | 10 (13) |
TABLE II.
Gesture-level and Episode-level Results Alongside with the Minimum Number of Gestures Required Using Baseline (in parenthesis) vs. Our Method for Smoker Participants
| Detection Task | Recall | Precision | F1-Score | Num Gestures |
|---|---|---|---|---|
| Gesture (RGB-T) | 0.65 (0.73) | 0.67 (0.33) | 0.66 (0.45) | N/A |
| Episode (RGB-T) | 0.91 (0.92) | 0.86 (0.38) | 0.88 (0.54) | 10 (15) |
B. Episode Detection Evaluation
Table I and Table II present the results of our eating episode detection method for participants with and without smoking sessions, respectively. For non-smoking participants, our method achieves 92% recall, 83% precision, and an 87% F1-score using an average of 10 gestures. This outperforms the baseline with 2 fewer gestures and improving the F1-score by 38%. Figure 2 illustrates the trade-off between the number of gestures required to detect an eating episode and the F1-score. For smoking participants, our method using the thermal sensor requires 3 fewer gestures on average and improves precision and F1-score by 18% and 10%, respectively.
Our evaluation includes clustering frames directly into episodes, which achieves 2% improvement in F1-score compared to clustering gestures into episodes. This improvement is likely due to the increased granularity and temporal consistency captured at the frame level. By clustering frames, we leverage the continuous and detailed nature of the data, providing a more nuanced understanding of the underlying patterns and transitions between states within an episode.
C. Number of Gestures vs. F1-score
In Figure 2, we show that as the number of gestures required to detect an eating episode increases, precision improves drastically due to the filtering of noise (i.e., irrelevant confounding gestures) associated with fewer gestures. However, the F1-score begins to decline when the number of gestures exceeds 10. This decline is because eating episodes with fewer than 10 gestures are no longer detected, reducing the recall performance of the method. In scenarios where precision is more critical (to avoid annoying the user with false notifications), the number of required gestures can be increased up to 13. Conversely, for applications where recall is prioritized, the minimum number of gestures can be reduced to 6, achieving a recall of 89%.
IV. Conclusion and Future Work
We presented a method using RGB and thermal sensors to detect eating episodes in real-time and an evaluation method for real-time eating detection systems. We developed a hand-object-based method that significantly reduces false positives by considering both hand motion and the object-in-hand. We discussed considerations for clustering methods and trade-off and guidelines on how to choose optimal parameters depending on the use case. We evaluated our method involving 36 participants and demonstrated that our method can accurately detect eating episodes with 10 gestures or within the first 1.5 minutes, achieving F1-score of 89%. In the future, we plan to do a more fine-grained evaluation by looking at the intersection over union of clusters. These findings provide valuable guidelines for the design of real-time intervention systems aimed at mitigating problematic eating behaviors.
Acknowledgement
This material is based upon work supported by the National Institute of Health (NIH) under award numbers K25DK113242, R03DK127128, R21EB030305, and R01DK129843.
References
- [1].Alshurafa N, Lin AW, Zhu F, Ghaffari R, Hester J, Delp E, Rogers J, Spring B et al. , “Counting bites with bits: expert workshop addressing calorie and macronutrient intake monitoring,” Journal of medical Internet research, vol. 21, no. 12, p. e14904, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Bell BM, Alam R, Alshurafa N, Thomaz E, Mondol AS, de la Haye K, Stankovic JA, Lach J, and Spruijt-Metz D, “Automatic, wearable-based, in-field eating detection approaches for public health research: a scoping review,” NPJ digital medicine, vol. 3, no. 1, pp. 1–14, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Vu T, Lin F, Alshurafa N, and Xu W, “Wearable food intake monitoring technologies: A comprehensive review,” Computers, vol. 6, no. 1, 2017. [Online]. Available: https://www.mdpi.com/2073-431X/6/1/4 [Google Scholar]
- [4].Shahi S, Sen S, Pedram M, Alharbi R, Gao Y, Katsaggelos AK, Hester J, and Alshurafa N, “Detecting eating, and social presence with all day wearable rgb-t,” in Proceedings of the 8th ACM/IEEE International Conference on Connected Health: Applications, Systems and Engineering Technologies, 2023, pp. 68–79. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Zhang S, Zhao Y, Nguyen DT, Xu R, Sen S, Hester J, and Alshurafa N, “Necksense: A multi-sensor necklace for detecting eating activities in free-living conditions,” Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies, vol. 4, no. 2, pp. 1–26, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Alharbi R, Shahi S, Cruz S, Li L, Sen S, Pedram M, Romano C, Hester J, Katsaggelos AK, and Alshurafa N, “Smokemon: unobtrusive extraction of smoking topography using wearable energy-efficient thermal,” Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies, vol. 6, no. 4, pp. 1–25, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Shahi S, Pedram M, Fernandes G, and Alshurafa N, “Smartact: energy efficient and real-time hand-to-mouth gesture detection using wearable rgb-t,” in 2022 IEEE-EMBS International Conference on Wearable and Implantable Body Sensor Networks (BSN). IEEE, 2022, pp. 1–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Shan D, Geng J, Shu M, and Fouhey DF, “Understanding human hands in contact at internet scale,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9869–9878. [Google Scholar]
- [9].Ge Z, Liu S, Wang F, Li Z, and Sun J, “Yolox: Exceeding yolo series in 2021,” arXiv preprint arXiv:2107.08430, 2021. [Google Scholar]
- [10].Ester M, Kriegel H-P, Sander J, Xu X et al. , “A density-based algorithm for discovering clusters in large spatial databases with noise,” in kdd, vol. 96, no. 34, 1996, pp. 226–231. [Google Scholar]
