Abstract
This article explores video analysis methods for monitoring eating behaviors, a critical factor in approximately 70% of global deaths due to illnesses like cancer, diabetes, and heart disease. Automated monitoring quantifies aspects such as meal duration, food types, and intake gestures (bite and drink gestures). Previous deep-learning methods segment videos into short clips (e.g., 16 frames at 8 Hz) for analysis, but this approach overlooks common meal-length patterns in gesture distribution across different individuals and sessions, which can enhance detection accuracy. Our study introduces a novel pipeline that analyzes the entire meal context (5–40 minutes). We propose a framework allowing a global detector to learn meal-length patterns with manageable computational demands. Additionally, we introduced a new augmentation technique to generate hundreds of meal-length feature samples per video, facilitating effective training of a global detector with limited video availability. Experimental results on two datasets (Clemson Cafeteria and EatSense) demonstrate that our pipeline significantly enhances the performance of state-of-the-art window-based networks, particularly in reducing false positives in gesture detection. On the Clemson Cafeteria dataset of 486 meal videos (the largest dataset to date), our method achieves F1 scores of 0.93 for bite gestures and 0.88 for drink gestures, substantially outperforming existing methodologies.
Additional Key Words and Phrases: intake gesture detection, video processing, deep learning, dietary monitoring, computer vision, neural network
1. Introduction
Unhealthy dietary habits are a key behavioral risk factor associated with noncommunicable diseases, which contribute to 70% of global deaths [24]. The need to address unhealthy dietary habits motivates the development of practical methods for monitoring daily food intake. Traditional tools for dietary monitoring, such as food diaries and 24-hour recalls, impose a high user burden, leading to underreporting and decreased compliance over time [35]. Consequently, current research focuses on automating the monitoring of daily intake activities. These automated tools facilitate the objective quantification of eating behaviors, encompassing aspects such as meal duration, food and beverage consumption [14, 27], and the enumeration of intake occurrences [16, 31]. These metrics yield valuable insights to inform targeted dietary interventions [34].
Researchers have investigated various sensor modalities for detecting dietary intake activities. Specifically, significant efforts have been dedicated to wearable devices equipped with sensors, including inertial measurement unit sensors [8, 9, 21], acoustic sensors [12, 25, 26], and electromyography sensors [17, 47, 48]. These wearable devices can be worn on the wrist or head, enabling users to carry them wherever they go. However, they require frequent recharging and may disrupt the daily routine of users due to the need for continuous wear.
In contrast to wearables, camera sensors offer an environment-based solution for detecting intake activities [16, 31]. Once installed, they can operate with no user burden, eliminating the need for daily charging and continuous wear, thus reducing disruptions to daily routines. Another notable advantage of camera sensors is their capacity to provide insights into the types of foods and beverages being consumed [14].
Home-based video monitoring of health-related activities is being researched across several applications such as gait analysis through telemedicine [22], exercise-based rehablitation [23] and improving care for elders in assisted living facilities [2]. Similarly, home-based monitoring of eating activities could be useful for interventions, telemedicine, or rehabilitation of diseases or health problems requiring intake measures, such as obesity and diabetes. In all these applications it is important to consider user privacy. Typical solutions involves immediately destroying raw video data after feature extraction, normalizing sensitive features to prevent re-identification [19], and deploying computing algorithms on edge devices to keep data offline [28]. For our system, only analysis results summarizing eating behaviors, such as meal duration, intake gestures count, drink-to-food intake ratio, and eating pace, are reported and stored.
Accurate detection of intake gestures, specifically bites and drinks, is crucial for dietary monitoring systems to reliably assess eating behaviors. These measurements directly facilitate evaluations such as eating rate (bites/min), total bite gesture count (a proxy for portion size), and total drink gesture count (used alongside bite count to calculate the liquid-to-solid intake ratio). Additionally, there is growing evidence that the total bite count may serve as a reasonable proxy for meal-level energy intake (kilocalories). For instance, Salley et al. demonstrated that bite-based estimates of kilocalories were more accurate than participants’ guesses across 271 individuals each eating a single meal in a cafeteria setting [34]. Similarly, Chou et al. discovered that bite count contributed more significantly to total energy intake than the types of foods consumed [4]. These assessments of eating behaviors, distinguishing between lean and obese individuals, are instrumental in formulating recommendations for behavioral changes to manage obesity [39]. This article provides evidence that measurements of intake gestures can be effectively extracted from video data, supporting the development of home-based systems for dietary monitoring.
Some researchers have focused on developing deep learning models tailored for the assessment of eating behaviors through video analysis [16, 31, 37, 44]. Rouast et al. explored various video models and identified SlowFast [11] and a custom-designed recurrent neural network (RNN), which they called CNN-long short-term memory (LSTM), as the top-performing models for detecting intake gestures [31]. SlowFast employs two kernels to scan input data at different temporal speeds [11]. In CNN-LSTM, spatial-temporal information is encoded using a CNN, and the resulting feature maps are then aggregated via LSTM [15], for video-level predictions. In a subsequent study, Rouast et al. delved into the utilization of connectionist temporal classification loss to train a single-stage model for detecting and localizing intake gestures [32]. This approach enabled the model to learn gesture sequencing instead of treating each gesture independently. However, both of these studies had a limitation in that they analyzed relatively short data windows, ranging from 2 seconds to 8 seconds.
Hossain et al. developed methods to quantify both bites and chews from video recordings over a longer duration [16]. Their approach involved initially using Faster R-CNN to extract facial regions from videos. Subsequently, they applied AlexNet to classify each frame as either a bite or non-bite, followed by utilizing 1D optical flow to tally chews over a maximum time span of 52 seconds. However, it is important to mention that this final post-processing step was heuristic and not learned by any specific model.
We theorize that eating-related gestures inherently carry sequential context that can be harnessed to improve their recognition over extended time periods. For instance, food preparation activities such as cutting and stirring are typically followed by the intake of food, leading to mastication [29]. Bite gestures, indicative of food intake, tend to occur more frequently than drink gestures. Beverage intake often happens more toward the end of a meal, informally referred to as “washing down food,” and intake tends to slow as the person eating becomes satiated. These contextual cues within a meal-length timeframe hold the potential for enhancing the recognition of intake gestures.
In the following section, we will review related work focused on the analysis of full-length videos to elucidate concepts explored in diverse applications unrelated to dietary intake. The last section of the introduction will delineate the distinctive contributions and novelty of our research.
1.1. Full-length Video Analysis
Several researchers have attempted to leverage long-term temporal information in videos within an affordable computational overhead. One approach involves splitting videos into segments and compressing each segment into a feature vector. For example, Tu et al. [43] adaptively split videos into segments and aggregated segment-level deep feature maps over the video to classify videos from a video-level representation. However, their feature encoding method is limited to video-wise classification and is not suitable in scenarios such as intake gesture detection, where frame-wise classification is required for temporally locating actions in videos. Additionally, their method was found to outperform window-based networks only on short-trimmed videos with actions of interest taking up most of the video [43]. Meal videos, on the other hand, are tens of minutes long and contain numerous non-intake activities.
Another research direction involves extracting long-term information and integrating it into window-based networks to classify short video clips. For example, Yang et al. [46] introduced a collaborative memory pool shared by every clip windowed from the same video. By simultaneously applying a window-based network to every clip within a video, the memory pool gathered features from each clip and infused these features into each clip. In another work, Tang et al. [41] introduced a global attention mechanism that employed long-term attention features to enhance window-based networks. However, neither the collaborative memory mechanism nor the global attention mechanism effectively addressed the computational challenges when processing long videos. For instance, Tang et al. had to limit the maximum input frame count to 768 during model testing on ActivityNet-1.3 due to GPU memory constraints.
All of these prior works either focus solely on clip-level interaction, neglecting long-term context at different scales, or concentrate on short videos with durations of several minutes. In contrast, training a network on longer videos can capture more comprehensive, global knowledge across different scales, particularly when the videos share a common theme, such as eating a meal, where people follow behavioral patterns from the beginning to the end.
1.2. Novelty
We propose a new pipeline, as shown in Figure 1, designed to analyze full-length videos from a global perspective. In our benchmarking efforts, we compare our approach against the same methods previously examined by Rouast et al., which demonstrated superior performance, namely SlowFast and CNN-LSTM [31]. Additionally, we include X3D [10], a more recent model known for delivering state-of-the-art (SOTA) performance in generic activity recognition. In the implementation of our pipeline, we adapt CNN-LSTM from Rouast et al. and X3D [10], recognized as SOTA models in the 3D-CNN and RNN categories, respectively, as the window-based backbones.
Fig. 1.

Standard and proposed video gesture detection methods. The window-based model may struggle with short and fake actions, such as misinterpreting moving hands towards the mouth as a drink or bite gesture, as shown in the figure. In contrast, the global-view model considers the entire meal-length video content and other gestures, allowing it to accurately exclude more fake actions and provide more reliable results.
The computational cost associated with directly building a model to analyze an entire meal video is substantial. We evaluate the computational savings of our approach by comparing a hypothetical full-video analysis in one stage, which utilizes ResNet-34 as the backbone, to our two-stage approach. Both approaches assume training on the Clemson Cafeteria dataset using two Nvidia V100 GPUs, without accounting for the warm-up training phase on other datasets. The forward/backward pass size for processing a single frame with ResNet-34 is approximately 60 MB, which is manageable when analyzing a brief window of typically tens of frames. However, a meal-length video in the Clemson Cafeteria dataset could contain as many as 12,000 frames when sampled at 5 Hz, resulting in a forward/backward pass size for the spatial encoding section alone of about 60 × 12, 000 = 720 GB. Including temporal components would require even greater computational memory for training. Additionally, the training time for this hypothetical one-stage model will be huge. X3D-L [10] has a forward/backward pass size of 43 GB and takes approximately 118 hours to train on only 16 frames at 6 Hz. In contrast, our two-stage approach takes approximately 240 hours to train on all 12,000 frames at 5 Hz while outperforming previous works. Thus, our approach enables full-length video analysis with significantly less computational complexity.
The scarcity of available video data also makes it difficult to training a network with full-length videos. Training a video classifier typically necessitates thousands of samples [45], which implies an equivalent number of videos since one video typically represents one full-length sample without additional augmentation. Yet, acquiring such a large volume of ground truth videos is exceedingly time-consuming, with annotating every frame in a 15-minute meal video taking approximately one hour [38]. As a point of reference, the two largest public datasets for intake gesture detection, OREBA dataset and Clemson Cafeteria dataset, comprise only 100 and 486 meal-length videos [33, 42]. This limited availability of data poses a significant challenge for training a meal-length video classifier.
To overcome the shortage of video data, we introduce an innovative data augmentation technique during the preparation of samples for training the second network. We observed that the output features of a window-based network (local encoder) exhibit volatility between multiple independent training iterations or when using different frame offsets within a window. This volatility results from random processes such as parameter initialization and data batch variations, as well as from varying forward and backward distances. Leveraging this model volatility, we generate a substantial volume of meal-length feature patterns for training our second network (global detector).
The novelty of our work can be summarized as follows:
We propose the first effective two-stage framework for analyzing full-length meal videos to detect intake gestures.
We incorporate two SOTA models as window-based backbones and show significant performance improvements following the application of our global analysis framework, as evidenced by improvements in F1 scores and model stability across diverse training runs and individuals.
Our experimental findings underscore the effectiveness of full-length pattern analysis on meal videos, surpassing the performance of several window-based benchmark methods.
2. Methods
In this section, we first overview the proposed pipeline and the dataset. Then we describe the details including local encoders, global detectors, augmentation method, and the training, testing and evaluation of the proposed pipeline.
2.1. Overview
The input to our complete pipeline is a meal-length video, and the output is a label of each frame from the set of classes {bite, drink, non-intake}. Figure 2 shows the training phase of our proposed method. We first build local encoders by adapting small-scale versions of SOTA window-based video action recognition networks. During training, local encoders make frame-wise predictions via a sliding window on videos. One local encoder network is trained ten times to obtain ten variants. Then we use frame-wise unnormalized probabilities obtained by local encoders to construct meal-length feature patterns, which we refer to as global patterns. During the construction, we augment the global pattern dataset by 160x by leveraging model volatility. Finally, with the augmented global pattern dataset, we are able to train a second network which we refer to as the global detector. The global detector learns global knowledge, such as the interaction between different gestures and temporal gesture distributions. Therefore, the global detector is expected to perform better than window-based networks.
Fig. 2.

Proposed training procedure. Rounded boxes specify the output from the training phase. Window-level operations are represented in green, trained local encoders are in yellow, and meal-level operations are in red.
Figure 3 shows the testing and inference phase of our proposed method. During this phase, a prediction pipeline is established by combining one trained local encoder variant with one trained global detector. The trained local encoder variant constructs a global pattern for each video, and the trained global detector then analyzes these global patterns and predicts labels for frames in each corresponding meal video.
Fig. 3.

Proposed inference and testing procedure. Only one trained local encoder variant is deployed during this phase. Rounded boxes specify the output from the testing and inference phase.
Finally, the predicted labels are evaluated by comparing them to the corresponding ground truth.
2.2. Dataset
In this study we used the publicly available Clemson Cafeteria dataset1 [42] and EatSense dataset2 [30]. The Clemson Cafeteria dataset includes 486 videos of 264 participants, in which each participant ate a single meal consisting of 1–4 courses (each course was recorded as a separate video) at the Harcombe Dining Hall at Clemson University. Each course is considered an independent meal episode in our proposed pipeline. The participant group consists of 137 females and 127 males, of which 183 were of 18–30 years old, 60 were 31–50 years old, and 21 were between 51–75 years old when the data was collected. A total of 374 different food and beverage types appeared in their meals. Utensils included forks, spoons, chopsticks and hands. Containers included plates, bowls, glasses and cups [42]. Videos were recorded in 480 × 640 spatial resolution at 30 Hz. These were cropped to regions of interest centered on the participant and their tray of food and beverage [42].
All 264 participants were randomly allocated to train set, validation set and test set. The train set includes 70% of participants (i.e., 184 participants), and each of the validation and test set includes 15% of participants (i.e., 40 and 39 participants, respectively).
The EatSense dataset includes 135 untrimmed videos from 27 participants of 12 different nationalities. The participant group consists of 11 females and 16 males, of which 11 were below 30 years old, 6 were 30–39 years old, 3 were 40–49 years old, 3 were 50–59 years old, 4 were above 60 years old. Utensils included forks, spoons, knifes, and hands. Videos were recorded in 640 × 480 spatial resolution at 15 Hz or 30 Hz. We do not crop videos in the EatSense dataset, as each video features the participant as the sole subject.
Because of the relatively small number of participants, we randomly split the videos into training, validation, and testing sets, rather than splitting by participants, to minimize performance variability due to dataset division. The train set includes 70% of videos (i.e., 94 videos), and the validation and test sets each include 15% of videos (i.e., 20 and 21 videos, respectively).
In this study, we focus on intake gesture detection and consider bite and drink gestures as target classes. All other classes are consolidated into a single background class. Raw video frames from both datasets are resized according to the specifications recommended for each local encoder and benchmark method. These details are explained in Sections 2.3 and 2.6.
2.3. Local Encoder
We investigate two SOTA networks for local encoders, X3D [10] and CNN-LSTM [7]. The CNN-LSTM architecture achieved the best results in intake gesture detection in the study by Rouast et al. [31]. X3D achieved the best accuracy to date on Kinetics [3], a large dataset for generic video action recognition, with top-1 and top-5 class accuracies out of 700 total classes being 80.0% and 94.5%. As intake gesture detection can be viewed as a downstream task in the realm of generic action recognition, we anticipate that X3D can yield comparable SOTA performance in this domain. Prior to training, the ResNet component of the CNN-LSTM model and the complete X3D are pre-trained on ImageNet dataset [6] and Kinetics datasets [3], respectively. This initialization process ensures that the models are equipped with a strong foundation of visual features relevant to the task of intake gesture detection, enabling efficient training on datasets with relatively small amounts of data.
2.3.1. Local Encoder Instantiations.
We instantiate the local encoders at small scales (X3D-S and CNN-LSTM-S) to decrease computational cost. This is important because the local encoders will be trained repeatedly for generating and augmenting global pattern data. Table 1 shows the instantiation details of each local encoder. We follow the model design approach of the original authors to set the window length and sample rate. Therefore, each local encoder analyzes approximately a 2-second window of video data. A sliding window is used to analyze each video in its entirety.
Table 1.
Instantiations for Local Encoders
| Stage | X3D-S 13 frames at 5 Hz |
CNN-LSTM-S 16 frames at 8 Hz |
||
|---|---|---|---|---|
| Kernels | Output size | Kernels | Output size | |
| input layer | - | 13 × 1602 | - | 16 × 2242 |
| conv1 | 1 × 32, 24 channels; stride 1 × 22 | 13 × 802 | 72, 64 channels; stride 1 × 22 | 16 × 1122 |
| pool1 | - | - | 32; stride 1 × 22 | 16 × 562 |
| res2 | 13 × 402 | 16 × 562 | ||
| res3 | . | 13 × 202 | 16 × 282 | |
| res4 | 13 × 102 | 16 × 142 | ||
| res5 | 13 × 52 | 16 × 72 | ||
| conv2 | 1 × 12, 432 channels; stride 1 × 12 | 13 × 52 | - | - |
| pool2 | 1 × 52; stride 1 × 12 | 13 × 12 | 1 × 72; stride 1 × 12 | 16 × 12 |
| flatten | - | 13× 1 | - | 16 × 1 |
| lstm | - | - | 128 units, bidirectional | 16 × 1 |
| dense | 3 nodes per timestep | 13 × 3 | 3 nodes per timestep | 16 × 3 |
Kernel, stride and output sizes are expressed in temporal size × spatial size. Two SOTA models are downsized and adapted for fast feature encoding purposes.
In addition to keeping the model complexity small, the local encoders are modified to output predictions for every temporal frame in a window (i.e., seq2seq, the input and output keep the same time dimension). This increases the number of samples generated per training cycle, which increases the amount of data augmentation that can be done each time the models are retrained. The data augmentation will be further explained in Section 2.4. For X3D-S, the last spatiotemporal pooling layer is set to 1 temporal stride so that the time dimension does not shrink. For CNN-LSTM-S, ResNet34 is used as the backbone. ResNet34 reduces the model complexity while still providing competitive performance to ResNet50. The LSTM layer in CNN-LSTM-S is bidirectional, so that the receptive field of each node equally covers the whole input pattern.
2.3.2. Construct Global Patterns.
We utilize the 3-class unnormalized probability set, generated by the last dense layer of the local encoder, as the encoded feature vector for each frame. Subsequently, we construct a meal-length global pattern by concatenating the 3-class unnormalized probability sets of each frame in a meal video, where each class (bite, drink, non-intake) corresponds to one channel of global patterns. Using a probability value for each class per frame allows global patterns to explicitly convey gesture information encoded by window-based models in a compact way. Additionally, a 3-class unnormalized probability set is not squashed by the last softmax activation function in a local encoder, making it more continuous and providing more differentiable information about gesture prediction confidence compared to a model’s output probabilities that tend to discretely distribute near 0 or 1.
Figure 4 shows the process of constructing meal-length global patterns. When applying a local encoder to a window of video frames, each frame gets a 3-class unnormalized probability set. When sliding a window across a video with the stride of 1, a frame offset can be chosen to record a sequence of unnormalized probabilities, which becomes one global pattern for the video. Different global patterns can be generated from one video using different frame offsets; this is discussed more in the next section.
Fig. 4.

An example of constructing global patterns using local encoder outputs. The window size is 8 frames, and the frame offset is 3. pi,j is the unnormalized probability set for the ith frame in the video, and the frame is fed into the local encoder within the jth window. The dashed box highlights the unnormalized probabilities taken for constructing global patterns. Those uncovered frames due to the offset are padded with zeros.
2.4. Global Pattern Augmentation
Model volatility in local encoders enables the generation of global patterns across various contexts. We augment global patterns by a total of 130× to 160× using two different sources of model volatility.
2.4.1. Using Different Frame Offsets in the Sliding Window.
Because local encoders are built for frame-wise predictions, the model outputs for each frame in each input window are accessible. By using different frame offsets within the sliding window, multiple global patterns can be constructed from one video depending on the window size.
Figure 5 shows an example. With a window size of 16 frames, global patterns can be augmented to 16x the number of videos.
Fig. 5.

An example of generating global patterns using different frame offsets in the sliding window. The window size is 8 frames, and the frame offsets are 0, 2, 4, 6. pi,j is the unnormalized probability set for the ith frame in the video, and the frame is fed into the local encoder within the jth window. The dashed boxes highlight the unnormalized probabilities taken for constructing global patterns. Different colors of those boxes stand for different offsets. Those uncovered frames due to the offsets are padded with zeros.
When creating a global sample, the window offset remains consistent across all local windows, so the temporal distance for looking forward and backward within a local window is uniform and the distance between data points is consistent.
Different frame offsets affect the distance a local model considers when looking ahead and backward within a window, altering the information used for inferring a frame. Consequently, a local model can yield varying performance across different window offsets, and result in diverse global samples, even when encoding the same raw recording. We regard this variability as a type of model volatility that helps produce more patterns from a limited dataset.
Figure 6 shows an example of the fluctuation range in those global patterns constructed using one local encoder and different frame offsets on one video. We can see that the probability fluctuation between global patterns is apparent. For example, the upper bound of bite at around 930 s indicates a suspected bite gesture with considerable unnormalized probability, while the lower bound does not. Consequently, some global patterns include the suspected bite gesture while others do not. Those diverse contexts augment the global pattern dataset.
Fig. 6.

An example of the model volatility between different frame offsets. Y axes are the probabilities of bite and drink in the top and bottom plots, separately. The two curves with each plot are upper and lower bounds in global patterns constructed using different frame offsets on one video. Sixteen global patterns are generated from one trained CNN-LSTM-S variant and 16 different frame offsets. The shown period is from 900 s to 1,000 s in p110/c1 video in Cafeteria dataset.
Furthermore, we observed that suspicious gestures are often where the local encoder struggles, such as when a subject plays with a utensil or the video has significant noise. A global detector trained on the augmented global pattern dataset is expected to perform better on those places after learning the knowledge of gesture distribution over a meal.
2.4.2. Using Separately Trained Local Encoder Variants.
We noticed that different local encoder variants of the same network architecture but trained separately do not always make identical inferences. For each of the local encoder instantiations, we train 10 times independently to obtain 10 trained variants. Each trained variant is then used for generating global patterns. This augments the global patterns by an additional 10x.
Figure 7 shows an example of the fluctuation range in those global patterns constructed using separately trained local encoder variants. We can see the obvious gap between the upper and lower bounds, which indicates a significant difference between global patterns in terms of the probability ranges. For example, the upper bound for the bite probability curve indicates a suspected bite gesture with considerable unnormalized probability right after the first detected bite gesture, while the lower bound does not. Moreover, the upper bound for the drink probability curve indicates a second suspected drink gesture with an impulse of the unnormalized probability at around 940 s, while the lower bound does not. These observed differences demonstrate that augmented global patterns get diverse on both the frame-level probabilities and gesture distribution.
Fig. 7.

An example of the model volatility between independently trained local encoder variants. Y axes are the probabilities of bite and drink in the top and bottom plots, separately. The two curves within each plot are upper and lower bounds of global patterns constructed using one video and separately trained local encoder variants. Ten global patterns are generated by 10 independently trained CNN-LSTM-S variants. The window is 16 frames at 8 Hz, and the frame offset is 7. The shown period is from 900 s to 1,000 s in p110/c1 video in Cafeteria dataset (same as Figure 6).
One reason that independently trained local encoders get different performance is because of random elements during the training, such as network parameter initialization, the sequence of feeding training patterns, and some random choices within the gradient descent. Such volatility has a random impact when a trained model makes a new inference, and can be considered a realistic and random factor in the augmentation.
The augmented train set contains 53,280 global patterns (160x the video count) when using CNN-LSTM-S as the local encoder (window: 16 frames at 8 Hz), or 43,290 global patterns (130x the video count) when using X3D-S as the local encoder (window: 13 frames at 6 Hz). Therefore, the augmented train set provides sufficient data for training a neural network for meal-length level gesture prediction.
2.5. Global Detector
2.5.1. Dataset Split and Usage.
Data leakage is prevented by performing all data augmentation on the training and validation sets only. The validation set is held separately and is used for selecting the best epoch during the training of the local encoders. The test set is also held separately and is only used for evaluation after all training has been completed.
2.5.2. Pre-processing Global Patterns.
All values in global patterns are z-score standardized using means and standard deviations calculated from training sets.
Global patterns are then padded to a uniform length by adding zeros to the end, so that the input to the global detector network stays consistent. Since all samples are normalized using z-score standardization prior to padding, a padded value of zero does not impact the distribution. Additionally, these padded locations are masked out during the calculation of the loss function, so they do not influence the optimization process for parameters associated with the actual data locations.
We set the target global pattern lengths to slightly exceed the longest video in each dataset: 2,500 seconds (approximately 41 minutes) for the Clemson Cafeteria dataset and 1,050 seconds (approximately 18 minutes) for the EatSense dataset.
Subsequently, global patterns are downsampled to reduce sequence lengths and computational costs. Since local encoders handle frame-level information at higher frame rates and global patterns mainly convey gesture-level information, maintaining a high frame rate for global patterns is unnecessary. For the Clemson Cafeteria dataset, patterns are downsampled to 2 Hz, resulting in 5,000 data points per pattern. For the EatSense dataset, patterns are downsampled to 4 Hz, yielding 4,200 data points per pattern. The chosen sampling frequencies are specifically selected based on the durations of labeled bite and drink events in each dataset, ensuring that the downsampled data retain sufficient data points to accurately represent the underlying behaviors.
2.5.3. Global Detector Architecture.
The global detector network can be any structure. For validation purposes, we choose a simple recursive network as the global detector in this work. The network consists of one bidirectional LSTM layer and one dense layer. The input is the global patterns with 2 Hz and 5,000 data points, and the output is the predicted labels for every data point.
2.6. Benchmark Methods
For comparative evaluation, we implemented three SOTA networks: SlowFast [11], X3D-L [10], and CNN-LSTM [7]. SlowFast uses two pathways: a slow pathway at a low frame rate to capture spatial semantics, and a fast pathway at a high frame rate to capture motion at fine temporal resolution. The two pathways are fused by several lateral connections. By treating the raw video at different temporal rates, the method allows the two pathways to have their own expertise on video modeling. X3D is a family of 3D-CNN networks optimized for different computation budgets. Given a specific target model complexity, a X3D variant is constructed by expanding a tiny 2D-CNN architecture along axes of space, time, width and depth with a step-by-step optimization. CNN-LSTM adds LSTMs on top of the 2D-CNN architecture and learns spatial and temporal features sequentially. Though CNN-LSTM cannot learn low-level spatiotemporal features, the architecture leverages LSTMs to better learn long-term dependencies than 3D-CNNs.
Unlike the small-scale instantiations X3D-S and CNN-LSTM-S that were described previously and used for local encoders, these benchmark methods were implemented using large-scale instantiations to achieve the maximum performance possible by each method. Table 2 shows details of the benchmarking instantiations. SlowFast used ResNet50 as the backbone. The speed ratio between the fast and slow pathways is set to 4. We follow the original paper [11] to build lateral connections and non-local blocks. X3D-L differs from X3D-S in that it has deeper and wider structures and more input frames, all of which are optimized for achieving the best accuracy with the larger available model complexity. Rouast et al. [31] built a ResNet50 backboned CNN-LSTM model for eating gesture detection tasks, and achieved good performance. We re-implement their model for result benchmarking.
Table 2.
Instantiations for Benchmarking
| Stage | SlowFast [11] slow: 32 frames at 16 Hz, fast: 8 frames at 4 Hz |
X3D-L [10] 16 frames at 6 Hz |
Rouast et al. CNN-LSTM [31] 16 frames at 8 Hz |
||||
|---|---|---|---|---|---|---|---|
| Kernels (slow) | Kernels (fast) | Output size | Kernels | Output size | Kernels | Output size | |
| input layer | - | - |
slow: 8 × 2242 fast: 32 × 2242 |
- | 16 × 3122 | - | 16 × 2242 |
| conv1 | 1 × 72, 64 channels stride 1 × 22 | 5 × 72, 8 channels stride 1 × 22 |
slow: 8 × 1122 fast: 32 × 1122 |
1 × 32, 24 channels stride 1 × 22 | 16 × 1562 | 72, 64 channels stride 1 × 22 | 16 × 1122 |
| pool1 | 1 × 32 stride 1 × 22 | 1 × 32 stride 1 × 22 |
slow: 8 × 562 fast: 32 × 562 |
- | - | 32 stride 1 × 22 | 16 × 562 |
| res2 |
slow: 8 × 562 fast: 32 × 562 |
16 × 782 | 16 × 562 | ||||
| res3 |
slow: 8 × 282 fast: 32 × 282 |
16 × 392 | 16 × 282 | ||||
| res4 |
slow: 8 × 142 fast: 32 × 142 |
16 × 202 | 16 × 142 | ||||
| res5 |
slow: 8 × 72 fast: 32 × 72 |
16 × 102 | 16 × 72 | ||||
| conv2 | - | - | - | 16 × 52 | - | - | |
| pool2 | 8 × 72 stride 1 × 12 | 32 × 72 stride 1 × 12 |
slow: 1 × 12 fast: 1 × 12 |
16 × 102 stride 1 × 12 | 1 × 12 | 1 × 72 stride 1 × 12 | 16 × 12 |
| conc | - | 1 × 12 | - | - | - | - | |
| flatten | - | 1 × 1 | - | 1 × 1 | - | 16 × 1 | |
| lstm | - | - | - | - | 128 units | 16 × 1 | |
| dense | 3 classes | 3 classes | 3 × 1 | 3 classes | 3 classes per timestep | 16 × 3 classes | |
Kernel, stride and output sizes are expressed in temporal size × spatial size.
2.7. Training
2.7.1. Model Initialization.
All window-based models in this study leverage transfer learning, while global detector models are trained from scratch.
Specifically, X3D families and SlowFast are initialized using corresponding models trained on Kinetics [3], a large dataset for video action recognition. ResNet backbones of CNN-LSTM families are initialized using ResNet34 or ResNet50 trained on the ImageNet dataset [6]. LSTM parts in CNN-LSTM families and global detector models are initialized using weights with random normal distribution.
2.7.2. Training Settings.
We use the Adam optimizer to train models. The learning rate for global detector models is 0.001, with an exponentially decaying rate of 0.9. Other models are initialized via transfer learning, and the learning rate is 0.0001, with an exponentially decaying rate of 0.9.
The batch size is 32 for global detector models and 8 for all window-based models. Each model is trained for 50 epochs, and the maximum steps within each epoch are 10,000. Afterward, inspired by Rouast et al. [31], the unweighted average recall on the validation set is used to decide the best epochs.
The categorical cross-entropy loss is used to train models. Class weights are applied to the cross-entropy loss when calculating batch loss. Class weights are calculated on the training set as:
| (1) |
where wi and N (i) are the weight and sample numbers for class i, and a is set to 10,000 to avoid very small weights.
To mitigate overfitting, we apply batch normalization [18] after every convolution layer and LSTM layer. Additionally, dropout with a rate of 0.5 is applied before the last dense layer of each model.
The video windows used to train the local encoder are augmented using standard brightness jittering and random horizontal flipping [31].
We trained our models using two Nvidia V100 GPUs. Table 3 shows the time spent on one training phase for each model. Local encoders X3D-S and CNN-LSTM-S trained much faster than full-size video action recognition networks.
Table 3.
Training Time for Models
| Models | Training time (hrs) |
|---|---|
| X3D-Sa | 24 |
| CNN-LSTM-Sa | 18 |
| Global detector | 5 |
| SlowFast | 80 |
| X3D-L | 118 |
| Rouast et al. CNN-LSTM | 30 |
Reported hours are for training a single model instance.
These models are scaled down and the training of multiple variants can occur in parallel with sufficient computational resources.
2.8. Inference and Testing
During the testing phase, we integrate one trained local encoder variant with a trained global detector into a pipeline that delivers meal-length predictions from videos. Local encoders are designed to output logits for every frame within an input window, primarily to generate more global patterns. This sequence-to-sequence prediction mechanism results in each frame being classified multiple times as it appears within different windows due to the sliding window approach. Unlike benchmark models, which operate on a sequence-to-one basis and make a single classification prediction per input window, this can introduce variability. To mitigate fluctuation caused by different window offsets, we implement a maximum vote strategy as a post-processing step. This strategy fuses outputs from different window offsets, as shown in Figure 8. In cases of a tie among three classes, we prioritize classification as bite, drink, and non-intake, in that order, to emphasize intake detection.
Fig. 8.

An example of constructing meal-length frame predictions using local encoder predictions. The window size is 8 frames. B, D, NI stands for bite, drink and non-intake classes, respectively.
Benchmark models are direct instantiations from the original papers and make one classification prediction for an input window. Therefore, we use these models to predict the last frame of each window. After the sliding window travels through one video with a stride of 1, the predictions form a frame-wise prediction sequence for the video.
For a comprehensive evaluation of the global detector performance, we independently test all potential combinations of local encoder variants with the trained global detector, and report performance distributions including means and standard deviations. Additionally, the local encoder variant that yields the best validation results is selected for combination with the global detector when benchmarking against other studies.
Meal-length frame-level predictions are converted to gesture-level predictions by searching for the start and end times of each detected gesture. We apply a smoothing filter on gesture-level predictions to reduce noise and jitter. The filter fills small gaps between two consecutive intake gestures with the same class and removes extremely short intake gestures. The thresholds for determining a small gap and a short gesture are both 0.5 second, which is unlikely to filter out natural intake gestures.
2.9. Evaluation Metrics
The meal-length prediction of a model is compared with the corresponding ground truth to calculate the model performance. We evaluate model performance on both gesture classification and localization on the time span.
Our evaluation scheme combines the scheme our group used for measuring the inter-rater reliability during ground truth labeling [38], and the intake gesture counting scheme used by Kyritsis et al. [20] and Rouast et al. [31]. Specifically, when a predicted gesture overlaps a ground truth gesture with the same label, if the predicted gesture is the first detection within the range of the ground truth gesture and the overlap ratio is more than 50%, the predicted gesture is considered true positive (TP). Otherwise, the predicted gesture is considered false positive (FP). When the predicted gesture and corresponding ground truth gesture meet the overlap criteria but are with different labels, the predicted gesture is considered as a confused detection. A missed detection happens when there is a ground truth gesture but no predicted gesture meeting the overlap criteria. The examples in Figure 9 illustrate those definitions. We calculate precision, recall and F1 score for bite and drink gestures. Note that non-intake cannot be evaluated at the gesture level as it is the background value.
Fig. 9.

Different cases when matching predicted gestures with ground truth. Blocks in yellow and blue stand for bite and drink gestures, respectively.
3. Results and Evaluation
We conduct three sets of comparative evaluations. First, we evaluate the performance of each local encoder with and without the use of the proposed framework on two datasets: Clemson Cafeteria and EatSense. This comparison reveals how much global information over a meal can improve gesture detection performance. Second, we compare the performance of our two-stage framework (local encoder and global detector) with other SOTA video action recognition models on Clemson Cafeteria dataset, the largest intake gesture dataset to date. Third, we investigate how our new method performs across 264 participants on Clemson Cafeteria dataset. This result helps us assess the generalizability of models.
3.1. Influence of Using Global Detectors
Table 4 shows the evaluation metrics on Clemson Cafeteria dataset for using local encoders alone, compared to combining them with a global detector. We can see that adding a global detector led to an increase in the F1 scores for both bite and drink gestures, for both the CNN-LSTM-S and X3D-S local encoders, with the increase ranging from 0.02 to 0.12. This shows that global meal-length pattern analysis provided a consistent improvement in intake gesture detection. Additionally, recalls tended to remain about the same, but precisions were greatly improved, with the largest increases occurring for drink gestures (0.16 and 0.23 improvement). This reveals that one main advantage of using global analysis is in reducing FP detections. The global detector is able to use a longer context to determine if transient detections by the local encoder are actual gestures or not.
Table 4.
Model Performance on Clemson Cafeteria Dataset with and without the Proposed Framework
| Method | F1 | Precision | Recall | |||
|---|---|---|---|---|---|---|
| Drink | Drink | Drink | ||||
| Window-based | 0.75 | 0.73 | 0.78 | |||
| Proposed framework | 0.84 | 0.89 | 0.80 | |||
| Method | F1 | Precision | Recall | |||
| Drink | Drink | Drink | ||||
| Window-based | 0.73 | 0.68 | 0.80 | |||
| Proposed framework | 0.85 | 0.91 | 0.80 | |||
Reported F1, precision and recall values are averages from independently testing each of 10 trained local encoder variants. Applying our framework on a window-based backbone resulted in significant improvements in F1 scores and most other performance metrics. The best numbers in key indices are bolded.
Table 5 shows the evaluation metrics on the EatSense dataset, comparing the performance of local encoders alone to their combination with a global detector. We can see that F1 scores for both bite and drink gestures increased by 0.03 to 0.14, after integrating a global detector to window-based backbones. Recalls and precisions also improved, especially for the drink class. This provides additional evidence for the effectiveness of our global analysis framework.
Table 5.
Model Performance on the EatSense Dataset with and without the Proposed Framework
| Method | F1 | Precision | Recall | |||
|---|---|---|---|---|---|---|
| Drink | Drink | Drink | ||||
| Window-based | 0.66 | 0.55 | 0.83 | |||
| Proposed framework | 0.74 | 0.64 | 0.90 | |||
| Method | F1 | Precision | Recall | |||
| Drink | Drink | Drink | ||||
| Window-based | 0.59 | 0.46 | 0.83 | |||
| Proposed framework | 0.73 | 0.63 | 0.89 | |||
Reported F1, precision and recall values are averages from independently testing each of 10 trained local encoder variants. Applying our framework on a window-based backbone resulted in significant improvements in F1 scores and most other performance metrics. The best numbers in key indices are bolded.
When comparing Table 4 with Table 5, we can see that window-based backbones performed less effective on the EatSense dataset than on the Clemson Cafeteria dataset. A primary factor is the different definitions the two datasets use to define bite and drink classes. The Clemson Cafeteria dataset includes accessory sub-movements, such as moving hands toward the mouth, within the intake event labels. In contrast, the EatSense dataset strictly defines intake events as the brief periods when food or beverage enters the mouth. This labeling results in shorter intake gestures in the EatSense dataset, posing increased detection challenges. Despite these challenges, our global detector consistently enhanced the performance of backbone models across both datasets.
Figure 10 shows an example of how our proposed global detector reduced FP detections compared to the local encoder on Clemson Cafeteria dataset. The first FP detection was isolated and did not resemble the appearance of a typical gesture distribution, and the second FP detection from the local encoder was likely too close to the following detection to be real. In the video, The low light environment and the clothing color similar to the background decreased the visibility of tools (e.g., cups) and hand movements, making it challenging for the model to identify gestures accurately. However, the global detector effectively eliminated FP detections by analyzing global clues and frame-level probabilities in combination.
Fig. 10.

An example of the effectiveness of the global detector in reducing FP detections of drink gestures. By considering meal-length information and behavior patterns, the global detector successfully eliminated two FPs in the video with index “p259/c1” from the Clemson Cafeteria dataset.
Table 6 shows the standard deviations of evaluation metrics across ten local encoder variants trained independently with the same model structure, using Clemson Cafeteria dataset. We can see that by wrapping a local encoder into our framework, the standard deviations of most evaluation metrics were reduced. This suggests that our framework effectively stabilized the performance of local encoders between different training phases, making it easier to obtain a well-trained model irrespective of the starting point of training.
Table 6.
Model Stability on Clemson Cafeteria Dataset with and without the Proposed Framework
| Method | std: F1 | std: Precision | std: Recall | |||
|---|---|---|---|---|---|---|
| Drink | Drink | Drink | ||||
| Window-based | 0.06 | 0.09 | 0.04 | |||
| Proposed framework | 0.02 | 0.02 | 0.03 | |||
| Method | std: F1 | std: Precision | std: Recall | |||
| Drink | Drink | Drink | ||||
| Window-based | 0.05 | 0.08 | 0.03 | |||
| Proposed framework | 0.02 | 0.03 | 0.02 | |||
Reported standard deviations are calculated from independently testing each of 10 trained local encoder variants. A global detector helped reduce fluctuations between different trained window-based variants in most performance matrices (smaller standard deviations), indicating the benefits of stabilizing model performances across different training runs. The best numbers in key indices are bolded.
3.2. Comparison with SOTA Models
We compare our methods with current SOTA networks on Clemson Cafeteria dataset which is the largest intake gesture dataset with untrimmed videos to date. Table 7 shows the results. For our results, we used CNN-LSTM-S and X3D-S as local encoders. Each local encoder was trained 10 times for data augmentation, so potentially our pipeline can be evaluated 10 times by combining each variant with our global detector. We report the best pipeline variant according to their performance on the validation set. We can see that the F1 scores of both our framework instances were higher than all the benchmarks for both bite and drink gestures, with the improvement ranging from 0.03 to 0.06 for bite gestures and 0.08–0.21 for drink gestures. It is thus reasonable to conclude that global patterns over meal episodes benefit gesture detection and improve upon window-based approaches.
Table 7.
Our Results Compared to SOTA Models on Clemson Cafeteria Dataset
| Model | #Params (M) | F1 | Precision | Recall | |||
|---|---|---|---|---|---|---|---|
| Bite | Drink | Bite | Drink | Bite | Drink | ||
| Benchmark: X3D-L | 5.34 | 0.90 | 0.77 | 0.86 | 0.70 | 0.94 | 0.85 |
| Benchmark: SlowFast | 33,65 | 0.88 | 0.78 | 0.83 | 0.74 | 0.93 | 0.82 |
| Benchmark: Rouast et al. CNN-LSTM | 24.62 | 0.89 | 0.67 | 0.86 | 0.58 | 0.92 | 0.79 |
| Ours: CNN-LSTM-S + global detector | 21,65a | 0.94 | 0.86 | 0.97 | 0.90 | 0.92 | 0.83 |
| Ours: X3D-S + global detector | 3.02a | 0.93 | 0.88 | 0.95 | 0.95 | 0.91 | 0.81 |
Our framework utilizing X3D-S as the backbone achieved significantly higher F1 scores while using a much smaller model size during the testing phase.
The reported number of parameters corresponds to one inference pipeline, including one local encoder and one global detector. The best numbers in key indices are bolded.
The combination of X3D-S and the global detector achieved the highest class-balanced F1 scores and precisions on detecting both bite and drink gestures. X3D-L had higher recalls on detecting both gestures than combined detectors, but at the cost of much lower precisions. Therefore, it can be concluded that X3D-L tended to make more gesture detections, some of which were TPs but more of which were FPs. In general, the combined detector achieved better overall performance than these SOTA networks.
Considering that our framework implementation used downgraded window-based networks as local encoders (small-scale instantiations), it is reasonable to suppose that the performance improvement might be even greater if the local encoders could be trained with large-scale instantiations. This is currently infeasible due to the high computational burden of repeatedly training these models for global pattern data augmentation.
3.3. Performance on Different Participants
In this section, we analyze the performance improvement between individuals using the Clemson Cafeteria dataset, which is the largest dataset in the field to date. As detailed in Section 2.2, the test set was randomly split and includes 39 participants.
Figure 11 shows a histogram of F1 scores per participant from the X3D-S network, with and without integrating the proposed global detector.
Fig. 11.

F1 score distribution of participants from Clemson Cafeteria dataset using X3D-S alone (bottom row) and combined with a global detector (top row). Our framework with a global detector helped concentrate the subject-wise results in the higher F1 score range, indicating its ability to stabilize model performance across different subjects.
It can be observed that after using a global detector, the F1 score distributions concentrated more on higher values. For example, in bite detection, the percentage of participants with F1 scores higher than 0.9 increase from 46% to 74% after utilizing global detectors. In drink detection, the percentage of participants with F1 scores higher than 0.8 increase from 53% to 82% after utilizing global detectors. This suggests that the application of global patterns contributes to consistently improved performance across different individuals. One main reason is that meal videos have general episode-level patterns that a global detector can learn and leverage on inferring other individuals.
We also observe that adding a global detector did not always yield significant F1 score improvement on every participant, and several small bins remained below the high F1 scores mentioned earlier. Therefore, we conducted a demographic analysis to explore the varying performance improvements provided by the global detector across different participant groups.
Table 8 shows the distribution of participants across different demographic categories used for training, validation, and testing. We can see that the overall distribution of the testing set closely mirrors that to of the training and validation sets, with small variations observed in minor groups. This indicates that the dataset splitting process did not introduce significant demographic bias during the training of models.
Table 8.
Distribution of Participants across Demographics in the Training, Validation, and Testing Sets Split from the Clemson Cafeteria Dataset
| Age | <20 | 21–30 | 31–40 | 41–50 | >50 |
|---|---|---|---|---|---|
| Train & val | 45 | 131 | 20 | 27 | 18 |
| Test | 9 | 13 | 8 | 6 | 3 |
| BMIa | Underweight | Normal | Overweight | Obese |
|---|---|---|---|---|
| Train & val | 4 | 144 | 56 | 35 |
| Test | 1 | 23 | 11 | 4 |
| Ethnicity | Caucasian | Asian / Pacific Islander | African-American | Hispanic | American Indian / Alaska Native | Other |
|---|---|---|---|---|---|---|
| Train & val | 174 | 25 | 20 | 8 | 1 | 13 |
| Test | 7 | 4 | 6 | 3 | 1 | 2 |
The definition of BMI categories follows WHO standard [5]: Underweight (<18.5), normal range (18.5–24.9), overweight (25.0–29.9), obese (>=30.0)
Figure 12 shows the performance improvement of different sub-populations after applying our framework to X3D-S model. Across ethnicities, F1 scores for African American subjects had the largest increases (7% for bite detection, 25% for drink detection). This may be because of the challenge that darker skin color is more likely to blend into the background, which hinders tasks involving recognizing facial expression and hand motions within video. So in this case, the use of global patterns is most helpful. Across all ages, bite gesture F1 scores increased by 2%–6%. Drink gesture F1 scores increased more for younger people (22% for subjects younger than 20 years old) than for older people (9% for subjects older than 50 years old). This may be due to a higher consistency in the global distribution of drink intake during a meal (e.g., “washing down foods” towards the end of a meal) for younger people compared to older people. Across all BMI categories, F1 scores for bite and drink gestures increased by 2%–5% and 1%–7%, respectively. However, there was no clear correlation between BMI groups and F1 score improvement.
Fig. 12.

F1 score improvements on different sub-populations from Clemson Cafeteria dataset by combining a global detector with one X3D-S local model variant. Numbers in parenthesis are numbers of participants in corresponding categories.
4. Conclusion
This study presented a new approach to detecting intake gestures in videos. We analyzed meal-length patterns that improve the detection of individual gestures. To train this classifier, we described an efficient augmentation method that boosts meal-length patterns by 130x–160x by making use of model volatility. With sufficient meal-length patterns after the augmentation, a second network was trained to model meal-length patterns. Integrating all our ideas, we showed an end-to-end pipeline for intake gesture detection that achieved better overall performance than related SOTA networks, with F1 scores increasing by 0.03–0.06 for bite gestures and 0.08–0.21 for drink gestures.
We hypothesize that the improvement on drink gestures was larger than the improvement on bite gestures because drinks tend to follow a more consistent global pattern between individuals. For example, many people tend to consume more of their beverage towards the end of a meal (the so called “washing down” of food).
In addition to benchmarking SOTA methods, our experiments also investigated the effectiveness of learning global patterns. To make the augmentation phase computationally feasible, we downsized two SOTA networks and evaluated their performance before and after incorporating them into our framework. Our results showed that adding a second stage that considered meal-length patterns improved F1 scores on gesture recognition, particularly for drink gestures. Furthermore, our framework stabilized model performance across different training runs and individuals.
All of these results confirm our hypothesis that individuals share similar behavior patterns during a meal timespan, which can be learned by our global detector and can improve the accuracy of intake gesture recognition. These findings are consistent with other researchers who have studied long-term video content. For example, Yang et al. [46] consistently obtained over 2% video-level accuracy gain on Kinetics-400 dataset using a collaborative memory pool that captured dependencies across multiple windowed video clips. Similarly, Tang et al. [41] achieved 2.7% better mAP@0.5 on THUMOS’14 dataset by deploying a global attention mechanism on their base model to capture global context from several minutes of video.
An automatic eating activity monitoring system could aid a healthcare professional in delivering guidance and treatment. By deploying a camera within users’ homes, such a system can observe their natural eating behaviors in daily life and automatically calculate summaries of their eating patterns, such as eating rate, meal duration, kilocalories consumed [36], and typical drink-to-food ratio. Based on this information, healthcare professionals could provide tailored interventions remotely without imposing heavy recall tasks on users. For example, clinicians can setup real-time triggers into the system to alert users when they spend too much time eating [40] or eat too fast [13]. Furthermore, healthcare consultants can design time-of-day based strategies for users and provide timely adjustments, such as intermittent fasting or adjusting the daily distribution of food intake [1], based on the monitored kilocalories consumed by users.
The insights provided by video data facilitates the development of robust intake gesture detection algorithms, which in turn enables the assessment of other eating activities. By integrating global pattern analysis, our framework demonstrates increased accuracies in recognizing intake gestures on multiple datasets. In addition to achieving SOTA overall accuracies, our framework consistently delivers high performance across participants of diverse demographics. For instance, in testing on the Clemson Cafeteria dataset, 74% of participants achieved F1 scores exceeding 0.9 for bite detection, and 82% exceeded 0.8 for drink detection. This suggests promising potential for real-world deployment in individuals’ homes. Additionally, our streamlined inference pipeline features significantly reduced model sizes compared to existing solutions, and therefore require fewer computational resources. This makes it feasible to deploy a user-friendly monitoring system without costly hardware investments.
Our proposed framework could enhance the accuracy of other video-based behavior recognition problems, especially in videos with a consistent theme such as sports or dance. In activities like a basketball game, it is likely that different subjects follow similar routines throughout the activity. Although the clue from the global view alone is not sufficient to model individual behaviors, it provides useful information about the location of behaviors over the duration of the activity. Analyzing long videos with an end-to-end neural network is computationally expensive, and current researchers can only model videos up to several minutes long. Our framework offers an efficient way to analyze much longer videos while preserving the integrity of the video-length patterns. Our local encoder backbone can be adapted from most SOTA behavior recognition models. And our global detector helps correct many false detections due to noise in transient windows used by those models. Our experimental results demonstrate the effectiveness of a global detector for intake gesture recognition. Since our framework analyzes raw videos without any specific design for meal activity, we expect it to benefit behavior recognition tasks in other activities.
Although our method has shown promising results, it has some limitations. Firstly, our method is not suitable for real-time applications where frames are captured from a continuous stream, as it benefits from analyzing a completed video as one sample. However, this limitation can be bypassed during offline scenarios where videos are recorded and post-processed for accurate behavior logging.
There are several research questions to explore regarding our global pattern augmentation method. Our approach relies on retraining the local encoder multiple times and utilizing different frame offsets to generate sufficient global patterns for training. Each independent execution of training may be considered a new interpretation (model) of what was learned from real data. This idea warrants a much deeper theoretical and experimental evaluation across multiple types of classification problems. There may be limitations in increasing dataset diversity, particularly in scenarios where the dataset is too small or when the local model lacks volatility. Understanding the strengths and limitations of our augmentation method is an important question for future work.
This study found differences in certain demographics in terms of increases in accuracy of gesture recognition from the use of global patterns. Specifically, global patterns benefited African Americans more than other ethnicities, and younger people more than other age groups. However, the dataset was not collected with balanced demographics intended to study these questions, so these analyses are preliminary. Future work could explore these differences more systematically.
There are several other possible improvements to our framework that can be explored in future work. First, instead of using frame-level unnormalized probabilities as the features for modeling meal-length patterns, we could use more preliminary features, such as features from the middle stages of the encoding network. These would have higher feature dimensions and may convey more information to the global detector. A challenge would be that the global detector would need to be more complex to process the increased input dimensions. Second, the model could be trained for each individual instead of trained on all data. This could allow the classifier to learn personalized meal-length patterns. We saw that a few participants had distinct intake behaviors in the experiment results. Networks trained for these particular participants might improve gesture detection accuracy. However, more data are needed to train personalized networks. All these ideas remain for future work.
CCS Concepts:
• Computing methodologies → Neural networks; Temporal reasoning; Activity recognition and understanding; • Applied computing → Health care information systems; Health informatics;
Acknowledgments
This work was supported by National Institutes of Health (NIH) grant R01DK135679 and National Science Foundation (NSF) grant 2242812.
Footnotes
References
- [1].Bellisle France. 2004. Impact of the daily meal pattern on energy balance. Scandinavian Journal of Nutrition 48, 3 (2004), 114–118. [Google Scholar]
- [2].Buzzelli Marco, Albé Alessio, and Ciocca Gianluigi. 2020. A vision-based system for monitoring elderly people at home. Applied Sciences 10, 1 (2020), 374. [Google Scholar]
- [3].Carreira Joao and Zisserman Andrew. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6299–6308. [Google Scholar]
- [4].Chou Tommy, Hoover Adam W., Goldstein Stephanie P., Greco-Henderson Dante, Martin Corby K., Raynor Hollie A., Muth Eric R., and Thomas J. Graham. 2024. An explanation for the accuracy of sensor-based measures of energy intake: Amount of food consumed matters more than dietary composition. Appetite 194 (2024), 107176. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].WHO Consultation. 2000. Obesity: Preventing and managing the global epidemic. World Health Organization Technical Report Series 894 (2000), 1–253. [Google Scholar]
- [6].Deng Jia, Dong Wei, Socher Richard, Li Li-Jia, Li Kai, and Fei-Fei Li. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248–255. [Google Scholar]
- [7].Donahue Jeffrey, Hendricks Lisa Anne, Guadarrama Sergio, Rohrbach Marcus, Venugopalan Subhashini, Saenko Kate, and Darrell Trevor. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2625–2634. [Google Scholar]
- [8].Dong Yujie, Hoover Adam, Scisco Jenna, and Muth Eric. 2012. A new method for measuring meal intake in humans via automated wrist motion tracking. Applied Psychophysiology and Biofeedback 37, 3 (2012), 205–215. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Dong Yujie, Scisco Jenna, Wilson Mike, Muth Eric, and Hoover Adam. 2013. Detecting periods of eating during free-living by tracking wrist motion. IEEE Journal of Biomedical and Health Informatics 18, 4 (2013), 1253–1260. [DOI] [PubMed] [Google Scholar]
- [10].Feichtenhofer Christoph. 2020. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 203–213. [Google Scholar]
- [11].Feichtenhofer Christoph, Fan Haoqi, Malik Jitendra, and He Kaiming. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 6202–6211. [Google Scholar]
- [12].Gao Yang, Zhang Ning, Wang Honghao, Ding Xiang, Ye Xu, Chen Guanling, and Cao Yu. 2016. IHear food: Eating detection using commodity Bluetooth headsets. In Proceedings of the IEEE 1st International Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE). IEEE, 163–172. [Google Scholar]
- [13].Guss JL and Kissileff HR. 2000. Microstructural analyses of human ingestive patterns: From description to mechanistic hypotheses. Neuroscience & Biobehavioral Reviews 24, 2 (2000), 261–268. [DOI] [PubMed] [Google Scholar]
- [14].Han Yue, Yarlagadda Sri Kalyan, Ghosh Tonmoy, Zhu Fengqing, Sazonov Edward, and Delp Edward J.. 2021. Improving food detection for images from a wearable egocentric Camera. Electronic Imaging 2021, 8 (2021), 286–1. [Google Scholar]
- [15].Hochreiter Sepp and Schmidhuber Jürgen. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780. [DOI] [PubMed] [Google Scholar]
- [16].Hossain Delwar, Ghosh Tonmoy, and Sazonov Edward. 2020. Automatic count of bites and chews from videos of eating episodes. IEEE Access 8 (2020), 101934–101945. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Huang Qianyi, Wang Wei, and Zhang Qian. 2017. Your glasses know your diet: Dietary monitoring using electromyography sensors. IEEE Internet of Things Journal 4, 3 (2017), 705–712. [Google Scholar]
- [18].Ioffe Sergey and Szegedy Christian. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, 448–456. [Google Scholar]
- [19].Jourdan Théo, Boutet Antoine, Bahi Amine, and Frindel Carole. 2020. Privacy-preserving IoT framework for activity recognition in personal healthcare monitoring. ACM Transactions on Computing for Healthcare 2, 1 (2020), 1–22. [Google Scholar]
- [20].Kyritsis Konstantinos, Diou Christos, and Delopoulos Anastasios. 2019. Modeling wrist micromovements to measure in-meal eating behavior from inertial sensor data. IEEE Journal of Biomedical and Health Informatics 23, 6 (2019), 2325–2334. [DOI] [PubMed] [Google Scholar]
- [21].Luktuke Yadnyesh Y. and Hoover Adam. 2020. Segmentation and recognition of eating gestures from wrist motion using deep learning. In Proceedings of the IEEE International Conference on Big Data (Big Data). IEEE, 1368–1373. [Google Scholar]
- [22].Martini Enrico, Boldo Michele, Aldegheri Stefano, Valè Nicola, Filippetti Mirko, Smania Nicola, Bertucco Matteo, Picelli Alessandro, and Bombieri Nicola. 2022. Enabling gait analysis in the telemedicine practice through portable and accurate 3D human pose estimation. Computer Methods and Programs in Biomedicine 225 (2022), 107016. [DOI] [PubMed] [Google Scholar]
- [23].Mennella Ciro, Maniscalco Umberto, De Pietro Giuseppe, and Esposito Massimo. 2023. A deep learning system to monitor and assess rehabilitation exercises in home-based remote and unsupervised conditions. Computers in Biology and Medicine 166 (2023), 107485. [DOI] [PubMed] [Google Scholar]
- [24].World Health Organization. 2020. Noncommunicable Diseases: Progress Monitor 2020.
- [25].Päßler Sebastian and Fischer Wolf-Joachim. 2013. Food intake monitoring: Automated chew event detection in chewing sounds. IEEE Journal of Biomedical and Health Informatics 18, 1 (2013), 278–289. [Google Scholar]
- [26].Paßler Sebastian, Fischer Wolf-Joachim, and Kraljevski Ivan. 2012. Adaptation of models for food intake sound recognition using maximum a posteriori estimation algorithm. In Proceedings of the 9th International Conference on Wearable and Implantable Body Sensor Networks. IEEE, 148–153. [Google Scholar]
- [27].Qiu Jianing, Lo Frank P-W, Jiang Shuo, Tsai Ya-Yen, Sun Yingnan, and Lo Benny. 2020. Counting bites and recognizing consumed food from videos for passive dietary monitoring. IEEE Journal of Biomedical and Health Informatics 25, 5 (2020), 1471–1482. [Google Scholar]
- [28].Rahman Md Juber and Morshed Bashir I. 2021. A minimalist method toward severity assessment and progression monitoring of obstructive sleep apnea on the edge. ACM Transactions on Computing for Healthcare 3, 2 (2021), 1–16. [Google Scholar]
- [29].Ramos-Garcia Raul I., Muth Eric R., Gowdy John N., and Hoover Adam W.. 2014. Improving the recognition of eating gestures using intergesture sequential dependencies. IEEE Journal of Biomedical and Health Informatics 19, 3 (2014), 825–831. [DOI] [PubMed] [Google Scholar]
- [30].Raza Muhammad Ahmed, Chen Longfei, Nanbo Li, and Fisher Robert B. 2023. EatSense: Human centric, action recognition and localization dataset for understanding eating behaviors and quality of motion assessment. Image and Vision Computing 137 (2023), 104762. [Google Scholar]
- [31].Rouast Philipp V. and Adam Marc T. P.. 2019. Learning deep representations for video-based intake gesture detection. IEEE Journal of Biomedical and Health Informatics 24, 6 (2019), 1727–1737. [DOI] [PubMed] [Google Scholar]
- [32].Rouast Philipp V. and Adam Marc T. P.. 2020. Single-stage intake gesture detection using CTC loss and extended prefix beam search. IEEE Journal of Biomedical and Health Informatics 25, 7 (2020), 2733–2743. [Google Scholar]
- [33].Rouast Philipp V., Heydarian Hamid, Adam Marc T. P., and Rollo Megan E.. 2020. OREBA: A dataset for objectively recognizing eating behavior and associated intake. IEEE Access 8 (2020), 181955–181963. [Google Scholar]
- [34].Salley James N., Hoover Adam W., Wilson Michael L., and Muth Eric R.. 2016. Comparison between human and bite-based methods of estimating caloric intake. Journal of the Academy of Nutrition and Dietetics 116, 10 (2016), 1568–1577. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [35].Schoeller Dale A and Allison David B. 2017. Use of doubly-labeled water measured energy expenditure as a biomarker of self-reported energy intake. In Advances in the Assessment of Dietary Intake. CRC Press, 185–197. [Google Scholar]
- [36].Scisco Jenna L., Muth Eric R., and Hoover Adam W.. 2014. Examining the utility of a bite-count–based measure of eating activity in free-living human beings. Journal of the Academy of Nutrition and Dietetics 114, 3 (2014), 464–469. [DOI] [PubMed] [Google Scholar]
- [37].Selamat Nur Asmiza and Md Ali Sawal Hamid. 2020. Automatic food intake monitoring based on chewing activity: A survey. IEEE Access 8 (2020), 48846–48869. [Google Scholar]
- [38].Shen Yiru, Salley James, Muth Eric, and Hoover Adam. 2016. Assessing the accuracy of a wrist motion tracking method for counting bites across demographic and food variables. IEEE Journal of Biomedical and Health Informatics 21, 3 (2016), 599–606. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [39].Spiegel Theresa A., Kaplan Joel M., Tomassini Antonina, and Stellar Eliot. 1993. Bite size, ingestion rate, and meal size in lean and obese women. Appetite 21, 2 (1993), 131–145. [DOI] [PubMed] [Google Scholar]
- [40].Spruijt-Metz Donna and Nilsen Wendy. 2014. Dynamic models of behavior for just-in-time adaptive interventions. IEEE Pervasive Computing 13, 3 (2014), 13–17. [Google Scholar]
- [41].Tang Yiping, Zheng Yang, Wei Chen, Guo Kaitai, Hu Haihong, and Liang Jimin. 2023. Video representation learning for temporal action detection using global-local attention. Pattern Recognition 134 (2023), 109135. [Google Scholar]
- [42].Tang Zeyu and Hoover Adam. 2022. A new video dataset for recognizing intake gestures in a cafeteria setting. In Proceedings of the 26th International Conference on Pattern Recognition (ICPR). IEEE, 4399–4405. [Google Scholar]
- [43].Tu Zhigang, Li Hongyan, Zhang Dejun, Dauwels Justin, Li Baoxin, and Yuan Junsong. 2019. Action-stage emphasized spatiotemporal VLAD for video action recognition. IEEE Transactions on Image Processing 28, 6 (2019), 2799–2812. [Google Scholar]
- [44].Tufano Michele, Lasschuijt Marlou, Chauhan Aneesh, Feskens Edith J. M., and Camps Guido. 2022. Capturing eating behavior from video analysis: A systematic review. Nutrients 14, 22 (2022), 4847. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [45].Yadav Santosh Kumar, Tiwari Kamlesh, Pandey Hari Mohan, and Akbar Shaik Ali. 2021. A review of multimodal human activity recognition with special emphasis on classification, applications, challenges and future directions. Knowledge-Based Systems 223 (2021), 106970. [Google Scholar]
- [46].Yang Xitong, Fan Haoqi, Torresani Lorenzo, Davis Larry S., and Wang Heng. 2021. Beyond short clips: End-to-end video-level learning with collaborative memories. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7567–7576. [Google Scholar]
- [47].Zhang Rui and Amft Oliver. 2016. Bite glasses: Measuring chewing using emg and Bone vibration in smart eyeglasses. In Proceedings of the ACM International Symposium on Wearable Computers, 50–52. [Google Scholar]
- [48].Zhang Rui and Amft Oliver. 2017. Monitoring chewing and eating in free-living using smart eyeglasses. IEEE Journal of Biomedical and Health Informatics 22, 1 (2017), 23–32. [DOI] [PubMed] [Google Scholar]
