Development and evaluation of a computer vision algorithm for quantification of children’s microactivities

Sara N Lupolt; Guofeng Zhang; Jiahao Wang; Stacey Tang; Qinfan Lyu; Jamie Cho; Christina Huynh; Qihao Liu; Jiawei Peng; Xingrui Wang; Junjie Oscar Yin; Xiaoding Yuan; Yi Zhang; Alan L Yuille; Kristin M Voegtline; Keeve E Nachman

doi:10.1038/s41370-025-00814-x

. Author manuscript; available in PMC: 2026 Apr 11.

Published before final editing as: J Expo Sci Environ Epidemiol. 2025 Oct 14:10.1038/s41370-025-00814-x. doi: 10.1038/s41370-025-00814-x

Development and evaluation of a computer vision algorithm for quantification of children’s microactivities

Sara N Lupolt ^1,^2,^✉, Guofeng Zhang ³, Jiahao Wang ³, Stacey Tang ⁴, Qinfan Lyu ¹, Jamie Cho ⁵, Christina Huynh ¹, Qihao Liu ³, Jiawei Peng ³, Xingrui Wang ³, Junjie Oscar Yin ³, Xiaoding Yuan ³, Yi Zhang ³, Alan L Yuille ^3,⁶, Kristin M Voegtline ^4,^7,⁸, Keeve E Nachman ^1,^2,^9,^✉

¹Department of Environmental Health & Engineering, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA.

²Risk Sciences and Public Policy Institute, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA.

³Department of Computer Science, Johns Hopkins Whiting School of Engineering, Baltimore, MD, USA.

⁴Department of Pediatrics, Johns Hopkins School of Medicine, Baltimore, MD, USA.

⁵Department of Neuroscience, Johns Hopkins Krieger School of Arts and Sciences, Baltimore, MD, USA.

⁶Department of Cognitive Science, Johns Hopkins Krieger School of Arts and Sciences, Baltimore, MD, USA.

⁷Department of Obstetrics and Gynecology, Weill Cornell Medicine, New York, NY, USA.

⁸Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA.

⁹Department of Health Policy and Management, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA.

AUTHOR CONTRIBUTIONS

Sara Lupolt: Writing—original draft, conceptualization, methodology, supervision, funding acquisition, project administration. Guofeng Zhang: Methodology, investigation, data analysis, writing—review and editing. Jiahao Wang: Methodology, investigation, data analysis, writing—review and editing. Stacey Tang: Methodology, data analysis, writing-review and editing. Jamie Cho: Methodology, data analysis. Qinfan Lyu: Data collection, data analysis, writing—review and editing. Christina Huynh: Recruitment, data collection, writing—review and editing. Qihao Liu: Methodology, data collection. Jiawei Peng: Methodology, data collection. Xingrui Wang: Methodology, data collection. Junjie Oscar Yin: Methodology, data collection. Xiaoding Yuan: Methodology, data collection. Yi Zhang: Methodology, data collection. Alan Yuille: Conceptualization, methodology, funding acquisition, supervision, writing—review and editing. Kristin Voegtline: Conceptualization, methodology, funding acquisition, supervision, writing-review and editing. Keeve E. Nachman: Conceptualization, methodology, writing-review and editing, funding acquisition, supervision, project administration.

^✉

Correspondence and requests for materials should be addressed to Sara N. Lupolt or Keeve E. Nachman, slupolt1@jhu.edu; knachman@jhu.edu

PMCID: PMC13067183 NIHMSID: NIHMS2155116 PMID: 41087742

Abstract

BACKGROUND:

Estimates of microactivity (e.g., hand- and object-to-mouth contact) frequencies are essential for modeling children’s environmental exposures but are challenging to obtain due to the time and human costs of manually labeling behaviors from pre-recorded videos.

OBJECTIVES:

We aim to develop and evaluate a computer vision model to quantify microactivities for young children.

METHODS:

The vision model was trained and validated using video footage (collected via four concurrent Go-Pro cameras) of 25 children 6–18 months playing in their homes in Baltimore, MD. We leveraged computer vision techniques to develop an algorithm to assess children’s pose by identifying and tracking 3D key points (e.g., locations of children’s eyes, hands, wrists, elbows, etc.). We enabled automatic measurement to track the distance between the child’s hands and mouth in every video frame. When the distance reached a minimum threshold, the model logged a “contact event.” We compared the timing and number of events for three microactivities (left- and right-hand-to-mouth, and object-to-mouth) yielded by the vision model to the outputs from comparable human behavioral coding.

RESULTS:

Our method recognizes children’s microactivities. The timing and number of contact events detected were accurate (96–99%) on a second-level basis with minimal counting errors (<0.04–2.18 per video). We observed higher rates of object-to-mouth contacts (mean = 27 contacts/h) compared to hand-to-mouth contacts (mean = 3 contacts/h).

IMPACT:

This study developed and evaluated a computer vision method for accurately identifying and quantifying young children’s hand-to-mouth and object-to-mouth contacts from collected video, greatly reducing the costs and burden of generating microactivity data needed for soil and dust exposure modeling.

Keywords: microactivity, soil, dust, incidental ingestion, videography, computer vision

INTRODUCTION

Infants and toddlers (hereafter: children) are exposed to chemical and biological contaminants through the incidental ingestion of soil and dust. For some hazards, exposure to chemical and biological hazards soil and dust are believed to drive aggregate exposures in early life [1]. Estimating children’s exposures is a critical element of human health risk assessment and management. Recommended default exposure assumptions for incidental ingestion rates of soil and dust are of low confidence [2], and approaches to generating these estimates have been plagued by various methodological and practical limitations [3]. Numerous efforts are ongoing to better characterize soil and dust ingestion pathways, specifically to reduce uncertainty and better characterize variability to improve estimates of soil and dust ingestion rates to assess exposure among early lifestages [4].

Children have higher rates of mouthing behaviors than adults [5–7] which are a fundamental means of interacting with their environment, gathering information and advancing their physical and cognitive development [8–13]. Estimates of these microactivities (i.e., hand-to-mouth and object-to-mouth contacts) can be used to support an indirect exposure modeling approach that uses time-activity pattern methodology [2, 14]. To accomplish this, data describing children’s microactivities and microenvironments (i.e., where children spend their time) are combined with assumptions about soil and dust transfer parameters and other exposure factors in models to derive estimates of soil and dust ingestion rates. These exposure models, including the Stochastic Human Exposure and Dose Simulation model [1], can be used to inform risk assessments and public health interventions.

The existing database of studies of children’s microactivity behaviors is relatively small and may not capture population variability in the frequency and duration of hand- and object-to-mouth contacts. If population variability is not properly characterized, it is possible that soil and dust ingestion recommendations for exposure modeling and risk assessment may underprotect children who exhibit the highest frequency of these behaviors. This would have direct implications for regulatory decisions that depend on models involving soil and dust ingestion estimates, including (but not limited to) standards for lead and other chemicals in soil and dust, pesticide registrations, and Superfund site reuse decisions.

Estimates of microactivities are typically derived from observational videography studies of children in various settings (e.g., [15–19]). To date, most videography studies typically use a single video camera to record a child’s activities, inevitably resulting in a portion of time in which the child is either “not in view” or a key point, such as the child’s hand or face, is obstructed and thus microactivity data cannot be coded for the entirety of the video [20]. While some studies involve researchers using a standard diary form for recording observed children’s microactivities in real-time [21], the gold standard approach involves researchers using a computer software interface to manually code pre-selected behaviors from pre-recorded video data [22]. Because these methods are time and labor-intensive, videography studies typically have small sample sizes (e.g., 4; [23]), only observe children for a portion of the day (e.g., 1 h; [24]), and have been conducted in a limited number of settings. Most studies capturing microactivities by video to date have focused on children in the US, but an emerging body of literature has investigated children’s microactivity patterns in Bangladesh [19], Korea [25], and Taiwan [6, 18]. Given the limited data available on children’s microactivity patterns, there is considerable uncertainty as to whether current estimates capture the true variability of children’s microactivities in a given population. Furthermore, it is unclear whether it is appropriate to apply estimates of American children’s microactivities to estimate exposure and risks in other countries. Given feasibility challenges with human behavioral coding, new approaches are needed to support larger, representative studies of children’s microactivity frequencies that can be used to develop robust estimates of soil and dust exposure.

Artificial intelligence and machine learning-based approaches may have promise for facilitating more efficient and accurate characterization of children’s microactivity behaviors. While computer vision algorithms have achieved significant success in certain applications (such as bounding boxes around humans, identification of specific points [e.g., hands and mouth] on human bodies and object tracking), most of these efforts have primarily focused on adults [26–28], while very few studies have specifically addressed children [29]. For example, advanced models for adult pose and shape estimation [30, 31] or those analyzing hand-to-mouth interactions, cannot be directly applied to children due to significant anatomical and behavioral differences. The lack of annotated data for children further complicates the adaptation and validation of these methods, presenting a critical challenge for extending their use to studying children’s microactivities.

The goals of this study were to 1) develop a computer vision algorithm to enable an automatic inference pipeline that estimates the timing and number of contacts (e.g., hand-to-mouth and object-to-mouth) in which a transfer of dust or soil into the mouth of the child could occur and 2) evaluate the computer algorithm by comparing its outputs to that of traditional human behavioral coding.

METHODS

Recruitment

Recruitment was implemented as part of the broader Innovations to Generate Estimates of children’s Soil/dust inTake (INGEST) study, which aims to advance methods for estimating rates of soil and dust ingestion among children. We recruited and enrolled (n = 66) parents of children aged 6–18 months residing in or near Baltimore, MD, USA. Participants were recruited through BuildClinical [32], an online data-driven platform and service that connects potential participants with research studies that might interest them. We also recruited participants from the Harriet Lane Clinic at Johns Hopkins Children’s Health Center and WIC Centers. Flyers posted in community centers, childcare centers, and public libraries supplemented direct recruitment. Parents were screened and enrolled via telephone. We scheduled home visits to collect video footage at a time convenient to the family.

Camera calibration and video collection

Video data were collected between April 2023 and June 2024 (Fig. 1). Two study team members trained in home visitation and videography data collection visited each participant’s home to collect up to 20 min of video of the child engaged in play. The team set up four GoPro Hero9 cameras mounted on all four sides of the room where the child most often plays. We used multiple cameras to limit the occurrence of occlusion (time when the child’s mouth was not in view of one or two cameras), which can reduce the amount of usable video footage for estimation of microactivity behaviors. The GoPro cameras were set to “Narrow” mode, the camera’s tightest field-of-view setting, which crops the sensor’s center for a more zoomed-in and less distorted perspective. Compared to the GoPro’s “Wide” or “SuperView” modes, Narrow significantly reduces fisheye and barrel distortion. Before collecting video footage of the child playing, we recorded up to 2 min of footage to support camera calibration [33].

Calibration is essential for accurate 3D child reconstruction from 2D videos captured by multiple cameras [34, 35]; it involves intrinsic and extrinsic components. Intrinsic calibration determines a camera’s internal parameters, including focal length, principal point, and lens distortion, which are unique to each camera; they are useful in correcting for distortion and necessary for mapping 3D points to 2D image coordinates. Extrinsic calibration defines the spatial relationships between cameras by estimating their relative positions and orientations. To achieve precise calibration, a large black and white checkerboard is used for extrinsic calibration to ensure all cameras simultaneously view the same pattern, facilitating the estimation of their relative transformations (Fig. S1). Conversely, smaller black and white checkerboard calibration boards are used for intrinsic calibration, as each camera is calibrated individually, requiring only its own field of view to refine internal parameters. In our data collection process, a study team member put the large checkboard at the center of all four cameras (which is later picked up prior to the child entering the scene) and presented the small checkboard to each camera in turn, tilting the board toward and away from the camera in a figure 8 pattern. After calibration, the four cameras were left in place, and video collection continued uninterrupted.

Following calibration, we invited caregivers and the participating child into the play area and instructed caregivers to play with the child (i.e., unstructured play) for 10 min. Following this, we asked caregivers to introduce a standardized novel toy (i.e., Baby Einstein Take Along Tunes Musical Toy) that we provided in addition to any other toys already in the play area and to play with the child for about 10 more minutes. We repeated intrinsic and extrinsic camera calibration after collection of play footage was completed. We also captured photos of each toy or object the child interacted with during the filming. While the data collected were not utilized in the algorithm described in this paper due to its already high performance, we believe it could prove valuable for future research or algorithm improvements.

All 61 caregivers provided written informed consent to participate in the study. Forty-six participants provided written permission for the video of their child to be published for research purposes. Participants were offered a $90 gift card and invited to keep the Baby Einstein Take Along Tunes Musical Toy. The Johns Hopkins Bloomberg School of Public Health Institutional Review Board (IRB00020023) approved all study protocols.

Video footage curation

We preprocessed the recorded play videos using Adobe Premiere Pro to synchronize them into one timeline. Extraneous footage was removed from each video file to ensure the processed videos captured the calibration and play footage and were of equal length. From this, we generated a single video stream that synchronously displays video views from each of the four cameras (Fig. S1).

Implementation of the computer vision algorithm and determination of microactivity events

The goal of the computer vision algorithm was to use collected video footage to count the microactivities that occurred during filmed structured and unstructured play. The high-level strategy was to detect and localize three-dimensional locations of the observed key points (the locations of these key points on 2D images are shown in Fig. S2) of the child’s body (e.g., the location of the child’s eyes, wrists, elbows, knees, etc.) and then calculate the distance between sets of those locations as the basis for determination of when a microactivity of interest occurs (e.g., hand-to-mouth contact). A stepwise description of the computer vision algorithm implementation follows (Fig. 2).

Fig. 2 — The algorithm consists of six stages: (1) Detection of humans in the input 4-view images, with two-dimensional (2D) bounding boxes drawn around each individual in every image. (2) Identification of 133 2D key points representing human body features within each bounding box. (3) Segmentation to isolate child-specific regions, focusing on child activities, followed by extraction of the 2D key points corresponding to the child. (4) Triangulation of the child 2D key points from multiple views, leveraging camera parameters found by camera calibration, to generate three-dimensional (3D) key points. (5) Calculation of the distance between the 3D key points of the child’s hand and mouth to determine whether the hand was near the mouth. (6) Classification to distinguish between hand-to-mouth and object-to-mouth actions when the hand is near the mouth.

In Stage 1, for each frame of each of the four camera views, we used a Faster RCNN [36], a neural network model from the MMPose toolbox [37] to detect humans and create bounding boxes around them in the scene. In Stage 2, we used HRNet (High-Resolution Network) [38] to detect the positions of two-dimensional (2D) key points (Fig. S2) in COCO-WholeBody format [39] for each of the humans within the bounding boxes. In Stage 3, since both parents and children were present in the scene, we used the Grounded Segment Anything Model (Grounded SAM) [40, 41] to select the child from among the detected humans (and discard the parent) for further analyses.

In Stage 4, after obtaining 2D key points of the child from views where the child is visible, we used EasyMocap [42] to triangulate the 2D key points from each camera view to get the three-dimensional (3D) key points (points in 3D space that represent a specific, meaningful location, e.g., fingertips, mouth corners (Fig. 2)). To do this, we used each camera’s intrinsic and extrinsic parameters (obtained from the calibration process described above). For each camera, we adopted a triangulation process, which can be imagined as a ray being traced from the camera through the detected 2D key point on the image plane into 3D space. Given four rays (one from each of the cameras), we assigned a 3D key point at the three-dimensional location where the rays intersect.

In Stage 5, 3D key points can be used to estimate the distance between parts of the body or between objects and body parts (Supplementary video). To facilitate estimation of the distance between the child’s hand (or held object) and mouth, we took the 3D spatial average of key points located at the upper and lower lips to establish a 3D contact point for the mouth, and the 3D spatial average of key points corresponding to the fingertips and wrists to establish 3D contact points for each hand. A distance threshold D was used to determine whether the proximity between the child’s hand and mouth indicates a potential mouthing event. Frames from the collected video were sampled at 10 per second (i.e., each frame was a tenth of a second long). For frames where the distance between the child’s hand and mouth is smaller than D, it was necessary to differentiate potential hand-to-mouth from object-to-mouth events. Rather than developing a separate model for objects, we trained a binary classification model to make this determination; the model was based on the ResNet50 architecture [43] and used data labeled from four videos to classify child behaviors as hand-to-mouth or object-to-mouth.

Finally, in Stage 6, we analyzed the child’s actions to determine the occurrence of microactivity contact events (Fig. 3, Supplemental video). We considered a microactivity event (e.g., left hand-to-mouth, right hand-to-mouth, or object-to-mouth) to have occurred in a single second if more than three frames within that second consistently indicated that specific microactivity. To differentiate consecutive microactivities (i.e., a hand remains in the mouth vs. multiple mouthing events), we applied a tolerance interval of four seconds, meaning that mouthing behaviors occurring within a space of less than or equal to four seconds were not considered independent microactivities.

The models in this workflow (including Faster R-CNN, HRNet, and Grounded SAM) are trained on large-scale datasets encompassing a wide variety of images, including those of humans and other objects. This training process exposes the models to millions of annotated examples, allowing them to learn general information such as shapes, textures, poses, and object boundaries, which are essential for tasks like detecting people, identifying key body points, and segmenting babies. Our work demonstrates that these models (which are designed and trained for adults) can be applied to 1) detect children and 2) detect and analyze children’s key points without any further training, highlighting their adaptability (Fig. 2, Supplemental video).

Human behavioral coding

Observations were coded second-by-second from video recordings for 36 study participants using Noldus Observer XT 16 software (Noldus Information Technology, Inc., Leesberg, VA; (Fig. S3). The child’s activity during play (i.e., body position, position of left and right hand, and the position of an object and/or pacifier (e.g., in the mouth)) was captured by specific non-overlapping microactivity code sets (Table S1). Two research assistants dual-coded each microactivity of interest using the preprocessed synchronous video stream. Although the human eye cannot discern millisecond-level changes in behavior, the Noldus program captures the onset and offset of selected behaviors at this level of resolution.

In the first pass, we coded the orientation of the child’s body (i.e., supine, prone, supported sit, independent sit, crawling/all fours, standing, walking, and child not in view). In the second and third passes, we coded the right, then left-hand behavior (i.e., hand touching mouth, hand touching face, hand touching floor, hand elsewhere, and child not in view). In the fourth pass, we coded the object that the child was actively playing with (i.e., object touching mouth, object touching face, object in hand, object touching floor, mouth-to-floor, non-contact with objects, and child not in view). Finally, we coded the pacifier position (i.e., no pacifier in use, pacifier touching mouth, pacifier touching face, pacifier in hand, pacifier touching floor, or child not in view). For each microactivity, data generated includes the frequency count during the ~10-min play interval and the rate per minute. With each instance of a microactivity, there is an associated onset and offset time recorded continuously. From these parameters, the Noldus Observer software quantified the total and mean duration of each microactivity and computed the proportion of time the child engaged in each microactivity by dividing its total duration by the total duration of the video.

Coders were trained to achieve a 0.80 (Cohen’s K) reliability. Interrater reliability for videos containing unstructured play was derived within the accompanying Noldus software to the Observer XT program on 100% of cases (i.e., as all videos were double-coded for this study). A time tolerance of 3 s was built into the reliability computation to allow for reaction time differences in humans (e.g., 1-s difference in the onset of hand-to-mouth). Coding discrepancies were discussed by the two coders and resolved unanimously.

Based on comparison of frequency and sequence, interrater reliability ranged from 0.85–1.00 (mean = 0.95) for the orientation of the child’s body, 0.82–1.00 (mean = 0.94) for the action of the right and left hands, 0.80–0.97 (mean = 0.9) for objects, and 0.93–1.00 (mean = 1.00) for pacifier use for the unstructured play scenario. Interrater reliability ranged from 0.86–0.99 (mean = 0.94) for orientation of child’s body, 0.80–1.00 (mean = 0.91) for the action of the right and left hands, 0.80–0.92 (mean=0.80) for objects, and 0.83–0.99 (mean = 0.89) for pacifier use for the structured (novel toy) play scenario.

Hyperparameters

Hyperparameters are predefined values that control model behavior and decision logic, such as detection thresholds, event grouping intervals, and training settings (e.g., learning rate, batch size). These are not learned from data but are tuned empirically to optimize performance for the specific temporal and spatial structure of behavioral events. We fine-tuned key hyperparameters empirically to optimize detection sensitivity across the multi-stage pipeline. We aimed to identify a primary threshold D to represent the Euclidean distance between hand and mouth contact points. This value was selected via grid search to balance detection coverage with minimization of false positives in validation trials. In Stage 6 of our computer vision method, a microactivity event is considered to have occurred if at least 3 out of 10 frames in a 1-s window were positively identified, corresponding to a 30% frame-level threshold. This value was chosen to maintain robustness against brief occlusions or noise while avoiding spurious detections. We assessed an appropriate separation interval to delineate discrete events to ensure that repeated mouthing behaviors within this interval were not considered separate instances. For the ResNet50 classifier distinguishing hand-to-mouth from object-to-mouth behavior, we fine-tuned the model on four manually labeled training videos, using a learning rate of 1e⁻⁴, batch size of 32, and training for 20 epochs.

Evaluation of the computer vision algorithm

For each child, we exported three behavioral coding passes (i.e., right hand, left hand, and object) for the structured and unstructured play scenarios using the Noldus software. Each export file contained a continuous time variable at the microsecond level and the associated numerical values indicating the presence/absence of each mutually exclusive behavior coded in the pass (e.g., for passes related to coding the right hand, we used: child not in view, right-hand touching mouth, right-hand touching face, right-hand touching floor, right hand elsewhere).

The right-hand touching mouth, left-hand touching mouth, and object-touching-mouth behaviors from the human annotation were compared to the output from the computer vision algorithm. We assessed the performance of the computer vision algorithm in two ways: 1) second-by-second accuracy, which measures the proportion of seconds correctly classified by the algorithm among four possible categories—no behavior, left hand-to-mouth, right hand-to-mouth, or object-to-mouth—when compared to human annotations, and 2) average absolute counting error, which quantifies the difference between the count from the computer vision algorithm and the count from human behavioral coding for each of the three action types (left-hand-to-mouth, right-hand-to-mouth, object-to-mouth), across all videos. The formula for average absolute counting error is:

A v e r a g e C o u n t i n g E r r o r = \frac{1}{N} \sum_{i = 1}^{N}| {P r e d i c t e d C o u n t}_{i} - G r o u n d T r u t h C o u n t_{i} ∣

where $N$ is the number of videos. This dual evaluation approach ensures that we obtained the accurate counts of microactivity contacts and that the contacts were counted at the appropriate time.

Although the proposed algorithm is evaluated solely on our dataset, it is important to note that the majority of its components rely on pretrained models without additional fine-tuning. As such, the system does not learn dataset-specific biases during training. The modular design further facilitates generalization, as each component operates independently and can be readily adapted or replaced for other micro-activity domains.

Ablation analyses

Our video collection method used four cameras to collect video of child participants simultaneously. We conducted an ablation analysis (where parts of a “system” are removed to assess the impact of the missing components on the overall system) to evaluate whether using fewer (2 or 3) camera views would adequately support analysis by our computer vision algorithm; this sensitivity analysis simulates video collection using fewer cameras (which would be more efficient in cost, time in the field, and analytical resources). To assess the impact of using fewer cameras, we reran the algorithm on a subset of 12 videos (one run with three camera views, and another run with two camera views). Using the outputs of these algorithm runs, we re-calculated the average absolute counting error with the human coding as the reference. A higher counting error is indicative of poorer performance of the vision algorithm.

Microactivity frequency estimation

Counts of hand- and object-to-mouth contacts were generated at the participant level for all videos in both the training and evaluation data sets. To facilitate comparison with other published studies of microactivity data in the literature and in the EPA Exposure Factors Handbook, we report time-weighted activity counts for each hand-to-mouth and object-to-mouth microactivity by extrapolating the rate of contacts during the observation period (generally ~20 min) to the nearest hour. We used descriptive statistics to summarize the frequency of contact events in R Studio 2024.12.0 + 467 with R 4.4.2.

RESULTS

Composition of collected videos and training and evaluation data

We collected video footage of 61 children engaged in unstructured and structured (i.e., with a new and identical toy given to the child by the study team) play. We randomly identified four participants’ videos to serve as the “training dataset” to train the algorithm and another 21 participants’ videos served as the “evaluation dataset” (Table S2). The length of the videos for each dataset ranged from 18 min, 40 s to 23 min, 56 s, with approximately equal time dedicated to structured and unstructured play. Children played with other toys in 100% of the videos in the training dataset. The child did not play with the other toys in two videos of structured play in the evaluation data set. A caregiver was present in most videos (i.e., ≥50% of videos across both data sets.) The presence of pets and other children and the use of pacifiers were relatively uncommon in both datasets (i.e., present in ≤50% of videos).

Microactivity frequencies

The computer vision and human annotation approaches yielded similar estimates of hand-to-mouth-and object-to-mouth contact rates (Table 1). The four videos in the training set generally had higher counts, for all microactivities (e.g., mean of object-to-mouth = 57 contacts/h in the training set versus mean of 21 object-to-mouth contacts/hour in the evaluation set). Rates of object-to-mouth contacts (mean = 27 contacts/hour, SD = 34 contacts/h; range = 0–123 contacts/h) were both higher and more variable than hand-to-mouth contacts (mean = 3 contacts/h, SD = 4; range = 0–13 contacts/h) across both datasets.

Table 1.

Summary of microactivities frequency(contacts/h) generated by different methods.

	Training set (n = 4)			Evaluation set (n = 22)			All (n = 26)

Microactivity	range	mean (SD)	median	range	mean (SD)	median	range	mean (SD)	median

Human coding

Total (left and right) hand-to-mouth	0–12	6 (5)	6	0–12	3 (4)	3	0–12	3 (4)	3

Left-hand-to-mouth	0–12	4 (6)	1	0–9	2 (3)	0	0–12	2 (3)	0

Right-hand-to-mouth	0–6	2 (3)	2	0–9	1 (2)	0	0–9	1 (2)	0

Object-to-mouth	21–83	50 (25)	47	0–151	26 (39)	11	0–151	30 (38)	16

Computer vision algorithm

Total (left and right) hand-to-mouth	0–9	4 (4)	3	0–13	3 (5)	1	0–13	3 (4)	3

Left-hand-to-mouth	0–9	2 (4)	0	0–12	2 (4)	0	0–12	2 (4)	0

Right-hand-to-mouth	0–3	1 (2)	1	0–9	1 (2)	0	0–9	1 (2)	0

Object-to-mouth	30–85	57 (29)	56	0–123	21 (32)	10	0–123	27 (34)	12

Open in a new tab

Hand-to-mouth and object-to-mouth frequencies (contacts/h) per participant generated by traditional human and computer vision approaches in videos used to develop and evaluate a computer vision workflow.

Hyperparameter selection results and evaluation for the vision method

We selected three key hyperparameters to optimize performance on behavior-coded videos: 1) the distance threshold (D) of 6 cm to determine whether the proximity between the child’s hand or object and mouth indicates a potential mouthing event, 2) a frame count threshold (F) of more than 3 frames per second to define a microactivity, and 3) a tolerance interval (T) of 4 s to differentiate consecutive microactivities. These values were chosen to ensure a balance between accurately detecting relevant events and minimizing false positives (i.e., detecting a microactivity when one was not documented by human coders). While these hyperparameters were fine-tuned for optimal performance, their overall impact on the final scores was minor. Here, we describe how we determined the most important of these hyperparameters (the selection of the other two hyperparameters followed a similar process).

The distance threshold D was selected through iteration with different values to ensure reliable measurements and consistent performance when compared to the human annotated data. The differences in average absolute counting error rates across varying distances (4 cm, 6 cm, 8 cm, and 10 cm), however, were minimal (Table 2). For instance, the left-hand errors ranged between 0.191 and 0.545, while the right-hand errors (when compared with human coded activities) varied between 0.04 and 0.455 across all distances. Similarly, the object errors were stable, ranging from 2.18 to 2.883. These findings indicate that while the choice of D is important for ensuring consistency, the overall outcomes remain robust to changes in this parameter, suggesting flexibility in its selection without significantly impacting accuracy. We found that the frame count threshold and tolerance interval were similarly flexible.

Table 2.

Optimization of distance threshold hyperparameter D.

	Distance threshold
	4 cm	6 cm	8 cm	10 cm
Left hand-to-mouth	0.191	0.23	0.545	0.455
Right hand-to-mouth	0.182	0.04	0.273	0.455
Object-to-mouth	2.562	2.18	2.634	2.883

Open in a new tab

Comparison of computer vision method to human behavioral coding

The output of the computer vision model had a high performance compared to the gold standard output from the human behavioral coding (Table 3). The timing accuracy for the computer model was greater than 96%, meaning that at the second level, the computer vision and computer algorithm outputs agreed at least 96% of the time for all three microactivities. The overall accuracy, i.e., the number of unique hand-to-mouth and object-to-mouth event counts differed between the vision algorithm and the human coding by an average of less than 1 event for the left and right hands and up to two contacts for objects.

Table 3.

Comparison of event detection by microactivity type using two validation approaches (n = 21 videos).

	Left hand-to-mouth	Right hand-to-mouth	Object-to-mouth
Second-by second comparison	99.7%	99.9%	96.9%
Average absolute counting error	0.23	0.04	2.18

Open in a new tab

Results of ablation analyses

As the number of cameras decreases from four to two, the minimum required for 3D triangulation—the accuracy of the estimated 3D child key points declines, leading to increased reconstruction errors and higher counting errors across the three microactivities (Table 4). For instance, the average absolute counting error for object-to-mouth contacts increased from 2.18 with 4 cameras to 5.1 with 2 cameras, highlighting the greater ambiguity in estimating depth and spatial positioning when fewer viewpoints are available. Similarly, the counting errors for left hand-to-mouth and right hand-to-mouth interactions increased slightly, from 0.23 to 0.4 and from 0.04 to 0.3, respectively. These results suggest that it is feasible to conduct the analysis with fewer cameras, but careful consideration is needed to ensure accuracy. The placement and calibration of the cameras become critical, with cameras ideally positioned at separate locations to maximize the visibility of the child while minimizing blind spots. For hand-to-mouth interactions, the error increase is relatively small, suggesting that these movements may be more consistently visible even with fewer cameras. Proper planning and placement of cameras can mitigate some of the challenges of reducing the number of viewpoints.

Table 4.

Results of average absolute counting error from ablation analyses (n = 12 videos).

	4 cameras	3 cameras	2 cameras
Left hand-to-mouth	0.23	0.34	0.40
Right hand-to-mouth	0.04	0.27	0.30
Object-to-mouth	2.18	4.72	5.10

Open in a new tab

Microactivity frequencies

The computer vision and human annotation approaches yielded similar estimates of hand-to-mouth-and object-to-mouth contact rates (Table 1). The four videos in the training set generally had higher counts, for all microactivities (e.g., mean of object-to-mouth = 57 contacts/h in the training set versus mean of 21 object-to-mouth contacts/h in the evaluation set). Rates of object-to-mouth contacts (mean = 27 contacts/h, SD = 34 contacts/h; range = 0–123 contacts/h) were both higher and more variable than hand-to-mouth contacts (mean = 3 contacts/h, SD = 4; range = 0–13 contacts/h) across both datasets.

DISCUSSION

We propose a computer vision-based method to identify, code, and quantify three microactivities (i.e., left hand-to-mouth, right hand-to-mouth, and object-to-mouth) for children. Using prospectively collected data of children playing coupled with detailed microactivity annotations, we demonstrated the effectiveness and reliability of our method in accurately documenting and quantifying these behaviors. This contribution not only advances the field of exposure science but also paves the way for identifying and quantifying children’s microactivity patterns with greater accuracy and reliability. Our method can serve as an efficient, less labor-intensive approach to collection of microactivity data as compared to prior approaches reliant on human coding. Automating the analysis portion of videography studies will support larger-scale efforts to describe population variation in children’s mouthing behaviors. A better understanding of distributions of population exposures and patterns among more vulnerable subgroups will improve public health intervention efforts focusing on soil and dust exposure.

Our approach to collecting video footage of children confers several benefits over traditional behavioral coding studies, which are usually conducted in laboratories or controlled settings [44, 45]; in our study, we conducted videography in the naturalistic setting of children’s homes in rooms where caregivers indicated that children typically play. Previous studies of children’s microactivities conducted outside of controlled laboratory settings have focused on older children (i.e., 7–12 years) [46] or outdoor exposures to rubber crumb [47]. We used a standardized toy in the structured play portion of the video to introduce a novel stimulus (though a few children were familiar with the toy).

Studies have shown that the presence of unfamiliar adults or parents deviating from their usual caregiving behavior may alter the behavior of young children [20, 48, 49]. As such, it is possible that the presence of our home videography team influenced the behavior of the child participants in our study. We attempted to minimize this whenever possible by setting up and calibrating the cameras while the child was out of the room and then having our team stay out of sight of the child during filming. Caregivers present during filming may have also engaged in more performative and positive behaviors, knowing our team was recording them. Our instructions for caregivers were minimal (“play with your child as you normally would, and as if we’re not here”), but it is difficult to know the extent to which our presence in the home influenced estimates of caregivers’ and children’s behaviors regarding play and microactivities.

We note that we recorded 20 min of indoor play for each child and did not collect video data of other meso-activities the child might engage in on a typical day (e.g., eating, having diaper changed, outdoor play). Twenty minutes proved to be the amount of time that was minimally invasive and least disruptive to children and their caregivers’ normal routines but also allowed us to observe a sufficient of hand-and-object-to-mouth contacts needed to train and apply the algorithms. Despite this, we believe that our method is temporally scalable and that future studies using it may hold promise for overcoming this hurdle. With the computer vision method demonstrated, other researchers could apply it to much longer collected videos; camera setup and calibration could be performed without the child present and left to record for longer durations. This may cause the child to act more naturally, being unaware of the cameras or study team or would at least allow adequate time for the child to acclimate to the presence of the cameras and resume typical behavior, increasing the confidence in estimates derived from collected video. We acknowledge there is uncertainty in extrapolating these observed microactivity rates to represent a single day or longer.

Occlusion, in which the key point of interest (e.g., an object or the child’s mouth) is blocked or not present in each video frame, inhibiting the quantification of a microactivity, is possible and a challenge in both computer vision and human annotation approaches. Typical approaches used in behavioral coding rely on one or two cameras [23, 50]. Occlusion rates are often not reported, but one previous study using a single camera claimed at least 80% of the video data collected had the relevant body parts “in view” for human coding [20]. Our data collection approach, which uses four concurrent videos, seeks to minimize occlusion by ensuring that at least one of the cameras will always capture the critical key points needed to quantify the microactivities of interest. This four-view video approach is a strength of our protocol in that usually at least one view can clearly see the mouth of the baby, enabling more accurate key-point detection and micro-activity classification. The four-view video approach is also well suited to accommodate a mobile infant or toddler who may be constantly changing location, shifting or rotating the position of various body parts, and engaging in mouthing behaviors. Most existing infant micro-activity datasets lack multi-view capture. Even the three-camera dataset in Pacheco et al. targets only gross motor actions, (e.g., crawling, sitting, standing, and walking) rather than fine-grained micro-activities [51].

Our computer annotation workflow has several important strengths over traditional human annotation of microactivities. Although we created a detailed codebook (Table S1) and defined each behavior of interest as objectively as possible, human coding is subject to different interpretations. For example, what one human coder interprets as a finger graze across the face could be viewed as mouthing or perioral contact by another human coder, resulting in conflicting codes that require ongoing reliability assessments. Observer biases in which human coders consciously or unconsciously use environmental or language cues exchanged between parent and child during play (e.g., parent says “Are you sucking on that toy?”) to determine whether specific behaviors occur are not well characterized. The computer vision method, on the other hand, relies on objective and quantifiable definitions of the distance between two key points to determine whether a “contact” occurred.

The computer vision method has significant time and cost savings relative to the gold standard human coding and annotation approach. Human annotation of observational video data is time-consuming and requires extensive training for the research assistants. In our study, the training phase required 60 h of staff time over 3 months to achieve sufficient interrater reliability. After the research assistants routinely achieved high interrater reliability, it took approximately 2 h for each of the two research assistants to watch and annotate each of the two 10-min video segments (e.g., structured and unstructured play), resulting in 4 h of human effort expended to annotate the 20 min of video footage collected for each participant. There was also additional human time (ranging from 0 to 2 h per participant) to address and revise any gross discrepancies in coding decisions between the two research assistants. We estimate 4–6 h of human effort and 4–8 h of compensable time for each participant’s 20 min of video footage collected. We were unable to locate estimates of the labor burden from prior videography studies involving human coding, as reporting this information has not been common in the published literature.

Conversely, after the initial training and evaluation of the workflow, we estimate the computer vision approach requires less than 20 min of human involvement per video (i.e., the approximate duration of the footage collected). The nature of human involvement involves the oversight of the pipeline and the computational processes. The computational processes (which do not require direct or concurrent human oversight) require up to 5 h of preprocessing and 20 h of code execution and can be completed in parallel for numerous videos on a single graphics processing unit (GPU) with sufficient memory capacity (i.e., >3000 megabytes per video). A key efficiency in the computer vision approach is computational parallelism, which allows several videos to be assessed concurrently, substantially reducing the overall time cost compared to manual annotation. In addition, the reduced human labor costs may be significant savings for researcher teams with limited personnel budgets. Application of our computer vision algorithm may also aid in ensuring reproducibility. As compared to studies involving human annotation, there may be inherent differences in the way microactivities are coded. For example, the determination of when a behavior begins and ends may vary across coders, and even across project teams, based on subtle variations in coding and training protocols [45, 52].

While the current approach demonstrates robustness on the existing data, testing on larger and more diverse datasets is necessary to validate its generalizability and reliability across different domains. We identified several existing video datasets of children, however, each dataset only addresses a specific, limited scenario: infant reaching in single-view clips [53], respiration estimation from baby-monitor footage [54], non-nutritive sucking detection [55], pose estimation with mixed real/synthetic images [29], coarse gross-motor action classification from multiple fixed cameras [51], motion analysis using RGB-D recordings of supine infants [56], and outdoor or playground studies that transcribe hand- and mouth-contact events [46, 47]. To the best of our knowledge, ours is the first naturalistic, in-home, multi-camera dataset suitable for fine-grained microactivity analysis. Investigating domain transfer techniques will also be critical to adapting the system for varied environments and populations. Additionally, the method relies on a setup requiring 2–4 camera views for accurate pose estimation and activity analysis. While this multi-view approach enhances accuracy, it requires camera calibration procedures. Exploring single-view approaches and taking advantage of 3D infant body priors [56] could potentially simplify the pipeline and improve performance, particularly in less controlled settings. This represents a promising avenue for future work.

Estimates of children’s microactivities are an important input into exposure models and risk assessments, yet few studies have yielded estimates of children’s hand-to-mouth and object-to-mouth activities. Two meta-analyses of children’s hand-to-mouth [5] and object-to-mouth [57] activities published in 2007 and 2010, respectively, rely on 10 studies, all using human annotation (either manually or with computer software) from recorded videos. Since 2010, few new studies have been published. Of these studies that have been published, most have focused on non-US populations [6, 18]. While these studies are important for characterizing how cultural or environmental factors may impact microactivity patterns, we believe a more important area of research is in the development and refinement of approaches for quantifying microactivities from recorded videos. Advancements in the computer vision methodology and increases in computational speeds and efficiencies create new opportunities to collect and generate data across more diverse populations and in more microenvironments and contexts. Additional video data of children analyzed with our computer vision method would allow exposure scientists to better characterize the variability and reduce uncertainty in current estimates of microactivity rates for children.

Supplementary Material

NIHMS2155116-supplement-Supplementary_Material.pdf^{(4.4MB, pdf)}

ADDITIONAL INFORMATION

Supplementary information The online version contains supplementary material available at https://doi.org/10.1038/s41370-025-00814-x.

ACKNOWLEDGEMENTS

We thank the parents and their children who participated in this study. We are grateful to our student research assistants, Tionna Tolefree and Sofia Harrison, who visited participants’ homes and recorded the video data.

FUNDING

This project was supported by a grant from the US Environmental Protection Agency: Estimating Children’s Soil and Dust Ingestion Rates for Exposure Science EPA-G2020-STAR-D1. S.N.L. also received financial support from the National Institute of Environmental Health Sciences (NIEHS, Grant ID P30ES032756).

Footnotes

COMPETING INTERESTS

The authors declare no competing interests.

ETHICAL APPROVAL

This study has been approved by the Johns Hopkins Bloomberg School of Public Health Institutional Review Board (IRB00020023). All methods were performed in accordance with relevant guidelines and regulations.

DATA AVAILABILITY

The dataset and codebook will be available from the authors upon request.

REFERENCES

1.Zartarian V, Xue J, Tornero-Velez R, Brown J. Children’s lead exposure: a multimedia modeling analysis to guide public health decision-making. Environ Health Perspect. 2017;125:097009. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.US EPA. Exposure Factors Handbook Chapter 5 2017. [Available from: https://www.epa.gov/expobox/exposure-factors-handbook-chapter-5. [Google Scholar]
3.Panagopoulos Abrahamsson D, Sobus JR, Ulrich EM, Isaacs K, Moschet C, Young TM, et al. A quest to identify suitable organic tracers for estimating children’s dust ingestion rates. J Expo Sci Environ Epidemiol. 2021;31:70–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Ferguson A, Adelabu F, Solo-Gabriele H, Obeng-Gyasi E, Fayad-Martinez C, Gidley M, et al. Methodologies for the collection of parameters to estimate dust/soil ingestion for young children. Front Public Health. 2024;12. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Xue J, Zartarian V, Moya J, Freeman N, Beamer P, Black K, et al. A meta-analysis of children’s hand-to-mouth frequency data for estimating nondietary ingestion exposure. Risk Anal. 2007;27:411–20. [DOI] [PubMed] [Google Scholar]
6.Tsou M-C, Özkaynak H, Beamer P, Dang W, Hsi H-C, Jiang C-B, et al. Mouthing activity data for children aged 7 to 35 months in Taiwan. J Expo Sci Environ Epidemiol. 2015;25:388–98. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Freeman NCG, Jimenez M, Reed KJ, Gurunathan S, Edwards RD, Roy A, et al. Quantitative analysis of children’s microactivity patterns: the Minnesota Children’s Pesticide Exposure Study. J Expo Sci Environ Epidemiol. 2001;11:501–9. [DOI] [PubMed] [Google Scholar]
8.Tulve NS, Suggs JC, McCurdy T, Cohen Hubal EA, Moya J. Frequency of mouthing behavior in young children. J Expo Sci Environ Epidemiol. 2002;12:259–64. [DOI] [PubMed] [Google Scholar]
9.Rochat P (ed.) Object manipulation and exploration in 2-to 5-month-old infants 2001. [Google Scholar]
10.Ruff HA. Infants’ manipulative exploration of objects: Effects of age and object characteristics. Dev Psychol. 1984;20:9–20. [Google Scholar]
11.Palmer CF. The discriminating nature of infants’ exploratory actions. Dev Psychol. 1989;25:885–93. [Google Scholar]
12.Malachowski LG, Needham AW. Infants exploring objects: a cascades perspective. Adv Child Dev Behav. 2023;64:39–68. [DOI] [PubMed] [Google Scholar]
13.Whyte VA, McDonald PV, Baillargeon R, Newell KM. Mouthing and grasping of objects by young infants. Ecol Psychol. 1994;6:205–18. [Google Scholar]
14.Moya J, Phillips L. A review of soil and dust ingestion studies for children. J Expo Sci Environ Epidemiol. 2014;24:545–54. [DOI] [PubMed] [Google Scholar]
15.Beamer PI, Canales RA, Bradman A, Leckie JO. Farmworker children’s residential non-dietary exposure estimates from micro-level activity time series. Environ Int. 2009;35:1202–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Beamer P, Key ME, Ferguson AC, Canales RA, Auyeung W, Leckie JO. Quantified activity pattern data from 6 to 27-month-old farmworker children for use in exposure assessment. Environ Res. 2008;108:239–46. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Black K, Shalat SL, Freeman NCG, Jimenez M, Donnelly KC, Calvin JA. Children’s mouthing and food-handling behavior in an agricultural community on the US/Mexico border. J Expo Sci Environ Epidemiol. 2005;15:244–51. [DOI] [PubMed] [Google Scholar]
18.Tsou M-C, Özkaynak H, Beamer P, Dang W, Hsi H-C, Jiang C-B, et al. Mouthing activity data for children age 3 to <6 years old and fraction of hand area mouthed for children age <6 years old in Taiwan. J Expo Sci Environ Epidemiol. 2018;28:182–92. [DOI] [PubMed] [Google Scholar]
19.Kwong LH, Ercumen A, Pickering AJ, Unicomb L, Davis J, Luby SP. Age-related changes to environmental exposure: variation in the frequency that young children place hands and objects in their mouths. J Expo Sci Environ Epidemiol. 2020;30:205–16. [DOI] [PubMed] [Google Scholar]
20.Ferguson AC, Canales RA, Beamer P, Auyeung W, Key M, Munninghoff A, et al. Video methods in the quantification of children’s exposures. J Expo Sci Environ Epidemiol. 2006;16:287–98. [DOI] [PubMed] [Google Scholar]
21.Juberg DR, Alfano K, Coughlin RJ, Thompson KM. An observational study of object mouthing behavior by young children. Pediatrics. 2001;107:135–42. [DOI] [PubMed] [Google Scholar]
22.Zartarian VG, Ferguson AC, Ong CG, Leckie JO. Quantifying videotaped activity patterns: video translation software and training methodologies. J Expo Anal Environ Epidemiol. 1997;7:535–42. [PubMed] [Google Scholar]
23.Zartarian VG, Streicker J, Rivera A, Cornejo CS, Molina S, Valadez OF, et al. A pilot study to collect micro-activity data of two- to four-year-old farm labor children in Salinas Valley, California. J Expo Anal Environ Epidemiol. 1995;5:21–34. [PubMed] [Google Scholar]
24.Ferguson A, Dwivedi A, Adelabu F, Ehindero E, Lamssali M, Obeng-Gyasi E, et al. Quantified activity patterns for young children in beach environments relevant for exposure to contaminants. Int J Environ Res Public Health. 2021;18. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Oh HS, Ryu M. Hand-to-face contact of preschoolers during indoor activities in childcare facilities in the Republic of Korea. Int J Environ Res Public Health. 2022;19:13282. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Fang H-S, Li J, Tang H, Xu C, Zhu H, Xiu Y, et al. Alphapose: whole-body regional multi-person pose estimation and tracking in real-time. IEEE Trans Pattern Anal Mach Intell. 2022;45:7157–73. [DOI] [PubMed] [Google Scholar]
27.Wang J, Sun K, Cheng T, Jiang B, Deng C, Zhao Y, et al. Deep high-resolution representation learning for visual recognition. IEEE Trans Pattern Anal Mach Intell. 2020;43:3349–64. [DOI] [PubMed] [Google Scholar]
28.Loper M, Mahmood N, Romero J, Pons-Moll G, Black MJ. SMPL: a skinned multi-person linear model. Semin Graph Pap: Push Bound. 2023;2:851–66. p. [Google Scholar]
29.Huang X, Fu N, Liu S, Ostadabbas S (editors) Invariant representation learning for infant pose estimation with small data. In: Proceedings 16th international conference on automatic face and gesture recognition (FG 2021); IEEE; 2021. [Google Scholar]
30.Cai Z, Yin W, Zeng A, Wei C, Sun Q, Yanjun W, et al. Smpler-X: scaling up expressive human pose and shape estimation. Adv Neural Inf Process Syst. 2024;36. [Google Scholar]
31.Goel S, Pavlakos G, Rajasegaran J, Kanazawa A, Malik J. Humans in 4D: reconstructing and tracking humans with transformers. In: Proceedings IEEE/CVF international conference on computer vision (ICCV). 2023. pp 14783–94. [Google Scholar]
32.BuildClinical. 2025. [https://www.buildclinical.com/]. [Google Scholar]
33.Szeliski R Computer vision: algorithms and applications, 2nd ed. Switzerland: Springer; 2022. [Google Scholar]
34.Joo HL, Liu H, Tan L, Gui L, Nabbe B, Matthews I, et al. Panoptic studio: a massively multiview system for social motion capture. In: Proceedings IEEE international conference on computer vision. 2015:3334–42. [Google Scholar]
35.Dong JFQ, Jiang W, Yang Y, Huang Q, Bao H, Zhou X. Fast and robust multi-person 3 d pose estimation and tracking from multiple views. IEEE Trans Pattern Anal Mach Intell. 2021;44:6981–92. [DOI] [PubMed] [Google Scholar]
36.Ren S, He K, Girshick R, Sun J. Towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst. 2015;9199:2969239–50. [Google Scholar]
37.Contributors M Openmmlab pose estimation toolbox and benchmark. 2020. [Google Scholar]
38.Sun K, Xiao B, Liu D, Wang J, editors. Deep high-resolution representation learning for human pose estimation. In: Proceedings IEEE/CVF conference on computer vision and pattern recognition; 2019. [Google Scholar]
39.Jin S, Xu L, Xu J, Wang C., Liu W, Qian C, et al. Whole-body human pose estimation in the wild. In: Proceedings 16th European conference on computer vision–ECCV 2020; 23–28 August; Glasgow, UK: Springer International Publishing; 2020. pp 196–214. [Google Scholar]
40.Ren T, Liu S, Zeng A, Lin J, Li K, Cao H, et al. Grounded Sam: assembling open-world models for diverse visual tasks. Preprint at 10.48550/arXiv.2401.14159. [DOI] [Google Scholar]
41.Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, et al. , editors. Segment anything. In: Proceedings IEEE/CVF international conference on computer vision; 2023. [Google Scholar]
42.Easymocap—make human motion capture easier Github 2021. https://github.com/zju3dv/EasyMocap. [Google Scholar]
43.He K, Zhang X, Ren S, Sun J, editors. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. [Google Scholar]
44.Bakeman R, Quera V. Behavioral observation. APA handbook of research methods in psychology, Vol 1: Foundations, planning, measures, and psychometrics. APA handbooks in psychology^®. Washington, DC, US: American Psychological Association; 2012. p. 207–25. [Google Scholar]
45.Bakeman R Behavioral observation and coding. Handbook of research methods in social and personality psychology. New York, NY, US: Cambridge University Press; 2000. p. 138–59. [Google Scholar]
46.Beamer PI, Luik CE, Canales RA, Leckie JO. Quantified outdoor micro-activity data for children aged 7–12-years old. J Expo Sci Environ Epidemiol. 2012;22:82–92. [DOI] [PubMed] [Google Scholar]
47.Lopez-Galvez N, Claude J, Wong P, Bradman A, Hyland C, Castorina R, et al. Quantification and analysis of micro-level activities data from children aged 1–12 years old for use in the assessments of exposure to recycled tire on turf and playgrounds. Int J Env Res Public Health. 2022;19:2483. [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Groot EM, Lekkerkerk MC, Steenbekkers LPA. Mouthing behaviour of young children; an observational study (summary report). RIVM report 613320 002. RIVM: Bilthoven, The Netherlands; 1998. [Google Scholar]
49.Davis S MP, Kohler E, Wiggins C. Soil ingestion in children with PICA: Final Report (US EPA Cooperative Agreement CR 816334–01). Seattle, WA: Fred Hutchison Cancer Research Center; 1995. [Google Scholar]
50.Hubal EAC, Sheldon LS, Burke JM, McCurdy TR, Berry MR, Rigas ML, et al. Children’s exposure assessment: a review of factors influencing Children’s exposure, and the data available to characterize and assess that exposure. Environ Health Perspect. 2000;108:475–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Pacheco C, Mavroudi E, Kokkoni E, Tanner HG, Vidal R, editors. A detection-based approach to multiview action classification in infants. In: Proceedings 25th international conference on pattern recognition (ICPR); 2021. [Google Scholar]
52.Chorney JM, McMurtry CM, Chambers CT, Bakeman R. Developing and modifying behavioral coding schemes in pediatric psychology: a practical guide. J Pediatr Psychol. 2014;40:154–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
53.Dechemi A, Bhakri V, Sahin I, Modi A, Mestas J, Peiris P, et al. , editors. BabyNet: a lightweight network for infant reaching action recognition in unconstrained environments to support future pediatric rehabilitation applications. In: Proceedings 30th IEEE international conference on robot & human interactive communication (RO-MAN); 8–12 August 2021. [Google Scholar]
54.Manne SKR, Zhu S, Ostadabbas S, Wan M, editors. Automatic infant respiration estimation from video: a deep flow-based algorithm and a novel public benchmark. In: Proceedings international workshop on preterm, perinatal and paediatric image analysis. Springer; 2023. [Google Scholar]
55.Zhu S, Wan M, Hatamimajoumerd E, Jain K, Zlota S, Kamath CV, et al. , editors. A video-based end-to-end pipeline for non-nutritive sucking action recognition and segmentation in young infants. In: Proceedings medical image computing and computer assisted intervention—MICCAI 2023; Cham: Springer Nature Switzerland; 2023. [Google Scholar]
56.Hesse N, Pujades S, Romero J, Black MJ, Bodensteiner C, Arens M, et al. , editors. Learning an infant body model from RGB-D data for accurate full-body motion analysis. In: Proceedings 21st international conference on medical image computing and computer-assisted intervention–MICCAI 2018, Granada, Spain, 16–20 September 2018, Springer. [Google Scholar]
57.Xue J, Zartarian V, Tulve N, Moya J, Freeman N, Auyeung W, et al. A meta-analysis of children’s object-to-mouth frequency data for estimating non-dietary ingestion exposure. J Expo Sci Environ Epidemiol. 2010;20:536–45. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material

NIHMS2155116-supplement-Supplementary_Material.pdf^{(4.4MB, pdf)}

Data Availability Statement

The dataset and codebook will be available from the authors upon request.

[R1] 1.Zartarian V, Xue J, Tornero-Velez R, Brown J. Children’s lead exposure: a multimedia modeling analysis to guide public health decision-making. Environ Health Perspect. 2017;125:097009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.US EPA. Exposure Factors Handbook Chapter 5 2017. [Available from: https://www.epa.gov/expobox/exposure-factors-handbook-chapter-5. [Google Scholar]

[R3] 3.Panagopoulos Abrahamsson D, Sobus JR, Ulrich EM, Isaacs K, Moschet C, Young TM, et al. A quest to identify suitable organic tracers for estimating children’s dust ingestion rates. J Expo Sci Environ Epidemiol. 2021;31:70–81. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Ferguson A, Adelabu F, Solo-Gabriele H, Obeng-Gyasi E, Fayad-Martinez C, Gidley M, et al. Methodologies for the collection of parameters to estimate dust/soil ingestion for young children. Front Public Health. 2024;12. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Xue J, Zartarian V, Moya J, Freeman N, Beamer P, Black K, et al. A meta-analysis of children’s hand-to-mouth frequency data for estimating nondietary ingestion exposure. Risk Anal. 2007;27:411–20. [DOI] [PubMed] [Google Scholar]

[R6] 6.Tsou M-C, Özkaynak H, Beamer P, Dang W, Hsi H-C, Jiang C-B, et al. Mouthing activity data for children aged 7 to 35 months in Taiwan. J Expo Sci Environ Epidemiol. 2015;25:388–98. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Freeman NCG, Jimenez M, Reed KJ, Gurunathan S, Edwards RD, Roy A, et al. Quantitative analysis of children’s microactivity patterns: the Minnesota Children’s Pesticide Exposure Study. J Expo Sci Environ Epidemiol. 2001;11:501–9. [DOI] [PubMed] [Google Scholar]

[R8] 8.Tulve NS, Suggs JC, McCurdy T, Cohen Hubal EA, Moya J. Frequency of mouthing behavior in young children. J Expo Sci Environ Epidemiol. 2002;12:259–64. [DOI] [PubMed] [Google Scholar]

[R9] 9.Rochat P (ed.) Object manipulation and exploration in 2-to 5-month-old infants 2001. [Google Scholar]

[R10] 10.Ruff HA. Infants’ manipulative exploration of objects: Effects of age and object characteristics. Dev Psychol. 1984;20:9–20. [Google Scholar]

[R11] 11.Palmer CF. The discriminating nature of infants’ exploratory actions. Dev Psychol. 1989;25:885–93. [Google Scholar]

[R12] 12.Malachowski LG, Needham AW. Infants exploring objects: a cascades perspective. Adv Child Dev Behav. 2023;64:39–68. [DOI] [PubMed] [Google Scholar]

[R13] 13.Whyte VA, McDonald PV, Baillargeon R, Newell KM. Mouthing and grasping of objects by young infants. Ecol Psychol. 1994;6:205–18. [Google Scholar]

[R14] 14.Moya J, Phillips L. A review of soil and dust ingestion studies for children. J Expo Sci Environ Epidemiol. 2014;24:545–54. [DOI] [PubMed] [Google Scholar]

[R15] 15.Beamer PI, Canales RA, Bradman A, Leckie JO. Farmworker children’s residential non-dietary exposure estimates from micro-level activity time series. Environ Int. 2009;35:1202–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Beamer P, Key ME, Ferguson AC, Canales RA, Auyeung W, Leckie JO. Quantified activity pattern data from 6 to 27-month-old farmworker children for use in exposure assessment. Environ Res. 2008;108:239–46. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Black K, Shalat SL, Freeman NCG, Jimenez M, Donnelly KC, Calvin JA. Children’s mouthing and food-handling behavior in an agricultural community on the US/Mexico border. J Expo Sci Environ Epidemiol. 2005;15:244–51. [DOI] [PubMed] [Google Scholar]

[R18] 18.Tsou M-C, Özkaynak H, Beamer P, Dang W, Hsi H-C, Jiang C-B, et al. Mouthing activity data for children age 3 to <6 years old and fraction of hand area mouthed for children age <6 years old in Taiwan. J Expo Sci Environ Epidemiol. 2018;28:182–92. [DOI] [PubMed] [Google Scholar]

[R19] 19.Kwong LH, Ercumen A, Pickering AJ, Unicomb L, Davis J, Luby SP. Age-related changes to environmental exposure: variation in the frequency that young children place hands and objects in their mouths. J Expo Sci Environ Epidemiol. 2020;30:205–16. [DOI] [PubMed] [Google Scholar]

[R20] 20.Ferguson AC, Canales RA, Beamer P, Auyeung W, Key M, Munninghoff A, et al. Video methods in the quantification of children’s exposures. J Expo Sci Environ Epidemiol. 2006;16:287–98. [DOI] [PubMed] [Google Scholar]

[R21] 21.Juberg DR, Alfano K, Coughlin RJ, Thompson KM. An observational study of object mouthing behavior by young children. Pediatrics. 2001;107:135–42. [DOI] [PubMed] [Google Scholar]

[R22] 22.Zartarian VG, Ferguson AC, Ong CG, Leckie JO. Quantifying videotaped activity patterns: video translation software and training methodologies. J Expo Anal Environ Epidemiol. 1997;7:535–42. [PubMed] [Google Scholar]

[R23] 23.Zartarian VG, Streicker J, Rivera A, Cornejo CS, Molina S, Valadez OF, et al. A pilot study to collect micro-activity data of two- to four-year-old farm labor children in Salinas Valley, California. J Expo Anal Environ Epidemiol. 1995;5:21–34. [PubMed] [Google Scholar]

[R24] 24.Ferguson A, Dwivedi A, Adelabu F, Ehindero E, Lamssali M, Obeng-Gyasi E, et al. Quantified activity patterns for young children in beach environments relevant for exposure to contaminants. Int J Environ Res Public Health. 2021;18. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Oh HS, Ryu M. Hand-to-face contact of preschoolers during indoor activities in childcare facilities in the Republic of Korea. Int J Environ Res Public Health. 2022;19:13282. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Fang H-S, Li J, Tang H, Xu C, Zhu H, Xiu Y, et al. Alphapose: whole-body regional multi-person pose estimation and tracking in real-time. IEEE Trans Pattern Anal Mach Intell. 2022;45:7157–73. [DOI] [PubMed] [Google Scholar]

[R27] 27.Wang J, Sun K, Cheng T, Jiang B, Deng C, Zhao Y, et al. Deep high-resolution representation learning for visual recognition. IEEE Trans Pattern Anal Mach Intell. 2020;43:3349–64. [DOI] [PubMed] [Google Scholar]

[R28] 28.Loper M, Mahmood N, Romero J, Pons-Moll G, Black MJ. SMPL: a skinned multi-person linear model. Semin Graph Pap: Push Bound. 2023;2:851–66. p. [Google Scholar]

[R29] 29.Huang X, Fu N, Liu S, Ostadabbas S (editors) Invariant representation learning for infant pose estimation with small data. In: Proceedings 16th international conference on automatic face and gesture recognition (FG 2021); IEEE; 2021. [Google Scholar]

[R30] 30.Cai Z, Yin W, Zeng A, Wei C, Sun Q, Yanjun W, et al. Smpler-X: scaling up expressive human pose and shape estimation. Adv Neural Inf Process Syst. 2024;36. [Google Scholar]

[R31] 31.Goel S, Pavlakos G, Rajasegaran J, Kanazawa A, Malik J. Humans in 4D: reconstructing and tracking humans with transformers. In: Proceedings IEEE/CVF international conference on computer vision (ICCV). 2023. pp 14783–94. [Google Scholar]

[R32] 32.BuildClinical. 2025. [https://www.buildclinical.com/]. [Google Scholar]

[R33] 33.Szeliski R Computer vision: algorithms and applications, 2nd ed. Switzerland: Springer; 2022. [Google Scholar]

[R34] 34.Joo HL, Liu H, Tan L, Gui L, Nabbe B, Matthews I, et al. Panoptic studio: a massively multiview system for social motion capture. In: Proceedings IEEE international conference on computer vision. 2015:3334–42. [Google Scholar]

[R35] 35.Dong JFQ, Jiang W, Yang Y, Huang Q, Bao H, Zhou X. Fast and robust multi-person 3 d pose estimation and tracking from multiple views. IEEE Trans Pattern Anal Mach Intell. 2021;44:6981–92. [DOI] [PubMed] [Google Scholar]

[R36] 36.Ren S, He K, Girshick R, Sun J. Towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst. 2015;9199:2969239–50. [Google Scholar]

[R37] 37.Contributors M Openmmlab pose estimation toolbox and benchmark. 2020. [Google Scholar]

[R38] 38.Sun K, Xiao B, Liu D, Wang J, editors. Deep high-resolution representation learning for human pose estimation. In: Proceedings IEEE/CVF conference on computer vision and pattern recognition; 2019. [Google Scholar]

[R39] 39.Jin S, Xu L, Xu J, Wang C., Liu W, Qian C, et al. Whole-body human pose estimation in the wild. In: Proceedings 16th European conference on computer vision–ECCV 2020; 23–28 August; Glasgow, UK: Springer International Publishing; 2020. pp 196–214. [Google Scholar]

[R40] 40.Ren T, Liu S, Zeng A, Lin J, Li K, Cao H, et al. Grounded Sam: assembling open-world models for diverse visual tasks. Preprint at 10.48550/arXiv.2401.14159. [DOI] [Google Scholar]

[R41] 41.Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, et al. , editors. Segment anything. In: Proceedings IEEE/CVF international conference on computer vision; 2023. [Google Scholar]

[R42] 42.Easymocap—make human motion capture easier Github 2021. https://github.com/zju3dv/EasyMocap. [Google Scholar]

[R43] 43.He K, Zhang X, Ren S, Sun J, editors. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. [Google Scholar]

[R44] 44.Bakeman R, Quera V. Behavioral observation. APA handbook of research methods in psychology, Vol 1: Foundations, planning, measures, and psychometrics. APA handbooks in psychology^®. Washington, DC, US: American Psychological Association; 2012. p. 207–25. [Google Scholar]

[R45] 45.Bakeman R Behavioral observation and coding. Handbook of research methods in social and personality psychology. New York, NY, US: Cambridge University Press; 2000. p. 138–59. [Google Scholar]

[R46] 46.Beamer PI, Luik CE, Canales RA, Leckie JO. Quantified outdoor micro-activity data for children aged 7–12-years old. J Expo Sci Environ Epidemiol. 2012;22:82–92. [DOI] [PubMed] [Google Scholar]

[R47] 47.Lopez-Galvez N, Claude J, Wong P, Bradman A, Hyland C, Castorina R, et al. Quantification and analysis of micro-level activities data from children aged 1–12 years old for use in the assessments of exposure to recycled tire on turf and playgrounds. Int J Env Res Public Health. 2022;19:2483. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] 48.Groot EM, Lekkerkerk MC, Steenbekkers LPA. Mouthing behaviour of young children; an observational study (summary report). RIVM report 613320 002. RIVM: Bilthoven, The Netherlands; 1998. [Google Scholar]

[R49] 49.Davis S MP, Kohler E, Wiggins C. Soil ingestion in children with PICA: Final Report (US EPA Cooperative Agreement CR 816334–01). Seattle, WA: Fred Hutchison Cancer Research Center; 1995. [Google Scholar]

[R50] 50.Hubal EAC, Sheldon LS, Burke JM, McCurdy TR, Berry MR, Rigas ML, et al. Children’s exposure assessment: a review of factors influencing Children’s exposure, and the data available to characterize and assess that exposure. Environ Health Perspect. 2000;108:475–86. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R51] 51.Pacheco C, Mavroudi E, Kokkoni E, Tanner HG, Vidal R, editors. A detection-based approach to multiview action classification in infants. In: Proceedings 25th international conference on pattern recognition (ICPR); 2021. [Google Scholar]

[R52] 52.Chorney JM, McMurtry CM, Chambers CT, Bakeman R. Developing and modifying behavioral coding schemes in pediatric psychology: a practical guide. J Pediatr Psychol. 2014;40:154–64. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R53] 53.Dechemi A, Bhakri V, Sahin I, Modi A, Mestas J, Peiris P, et al. , editors. BabyNet: a lightweight network for infant reaching action recognition in unconstrained environments to support future pediatric rehabilitation applications. In: Proceedings 30th IEEE international conference on robot & human interactive communication (RO-MAN); 8–12 August 2021. [Google Scholar]

[R54] 54.Manne SKR, Zhu S, Ostadabbas S, Wan M, editors. Automatic infant respiration estimation from video: a deep flow-based algorithm and a novel public benchmark. In: Proceedings international workshop on preterm, perinatal and paediatric image analysis. Springer; 2023. [Google Scholar]

[R55] 55.Zhu S, Wan M, Hatamimajoumerd E, Jain K, Zlota S, Kamath CV, et al. , editors. A video-based end-to-end pipeline for non-nutritive sucking action recognition and segmentation in young infants. In: Proceedings medical image computing and computer assisted intervention—MICCAI 2023; Cham: Springer Nature Switzerland; 2023. [Google Scholar]

[R56] 56.Hesse N, Pujades S, Romero J, Black MJ, Bodensteiner C, Arens M, et al. , editors. Learning an infant body model from RGB-D data for accurate full-body motion analysis. In: Proceedings 21st international conference on medical image computing and computer-assisted intervention–MICCAI 2018, Granada, Spain, 16–20 September 2018, Springer. [Google Scholar]

[R57] 57.Xue J, Zartarian V, Tulve N, Moya J, Freeman N, Auyeung W, et al. A meta-analysis of children’s object-to-mouth frequency data for estimating non-dietary ingestion exposure. J Expo Sci Environ Epidemiol. 2010;20:536–45. [DOI] [PubMed] [Google Scholar]

PERMALINK

Development and evaluation of a computer vision algorithm for quantification of children’s microactivities

Sara N Lupolt

Guofeng Zhang

Jiahao Wang

Stacey Tang

Qinfan Lyu

Jamie Cho

Christina Huynh

Qihao Liu

Jiawei Peng

Xingrui Wang

Junjie Oscar Yin

Xiaoding Yuan

Yi Zhang

Alan L Yuille

Kristin M Voegtline

Keeve E Nachman

Abstract

BACKGROUND:

OBJECTIVES:

METHODS:

RESULTS:

IMPACT:

INTRODUCTION

METHODS

Recruitment

Camera calibration and video collection

Fig. 1.

Video footage curation

Implementation of the computer vision algorithm and determination of microactivity events

Fig. 2. Flowchart illustrating the computer vision pipeline.

Human behavioral coding

Hyperparameters

Evaluation of the computer vision algorithm

Ablation analyses

Microactivity frequency estimation

RESULTS

Composition of collected videos and training and evaluation data

Microactivity frequencies

Table 1.

Hyperparameter selection results and evaluation for the vision method

Table 2.

Comparison of computer vision method to human behavioral coding

Table 3.

Results of ablation analyses

Table 4.

Microactivity frequencies

DISCUSSION

Supplementary Material

ACKNOWLEDGEMENTS

FUNDING

Footnotes

DATA AVAILABILITY

REFERENCES

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases