Skip to main content
Wiley Open Access Collection logoLink to Wiley Open Access Collection
. 2026 Mar 8;29(3):e70144. doi: 10.1111/desc.70144

Thinking Critically About Algorithms for Automated Detection of Behavior: 11 Guidelines for Social and Behavioral Scientists

Kaya de Barbaro 1,, Anna Madden‐Rusnak 2, Adela Timmons 1
PMCID: PMC12968601  PMID: 41797339

ABSTRACT

Developmental psychologists are increasingly leveraging mobile and wearable sensors paired with machine learning and artificial intelligence (AI) to automatically detect the everyday behaviors and interactions theorized to drive development. These technologies provide an opportunity to capture learners’ real‐world experiences, with wide‐ranging implications for basic science and intervention. However, many developmentalists lack the training to critically evaluate the accuracy of models used to automatically detect behavior and may not be aware of various challenges of implementing these approaches in real‐world settings. To advance the next wave of research and innovation in this area, we provide readers with a set of 11 practical guidelines that will give researchers the critical perspective necessary to leverage or codesign systems in a way that is technically sound, ethically responsive, and practical. Our guidelines highlight common pitfalls and challenges with using AI for research and intervention, matched with best practices and practical recommendations for researchers working in this field. They cover the limits of model generalizability, recommendations for careful interpretation of accuracy statistics, the importance of real‐world feasibility, ethical deployment, and interdisciplinary collaboration with sustained community engagement. Collectively, these guidelines provide a foundation for advancing the rigor, equity, and impact of tools for activity recognition in developmental science.

Keywords: accuracy, artificial intelligence, best practices, generalizability, machine learning, mobile sensing

Summary

  • Models are unlikely to generalize beyond their training data: researchers should match model training data with the settings and populations models will be used in.

  • High accuracy scores can hide poor model performance: carefully evaluate the quality of the training data, and check accuracy for each behavior of interest.

  • Model accuracy does not equal benefits to participants: a less accurate but adoptable model may deliver greater benefits than a highly accurate but impractical model.

  • Developmentalists—and the participants in their studies—bring crucial perspectives in developing AI tools that are accurate, meaningful, and usable in real‐world contexts.

1. Thinking Critically About Algorithms for Automated Detection of Behavior: 11 Guidelines for Developmental Researchers

Mobile and wearable sensors, paired with algorithms that can automatically detect activity from sensor data, offer unprecedented access to the everyday behaviors and interactions theorized to drive development. Recent work highlights the various ways that developmentalists are leveraging these technologies to support research and intervention (de Barbaro 2019; de Barbaro and Fausey 2023). For example, data from wearable motion sensors have been used to characterize behavioral regulation in natural daycare settings (Koepp et al. 2021). Parent speech detected from child‐worn audio recordings has been used to quantify language inputs for children across the planet (Bergelson et al. 2023) or, paired with screen use data, to demonstrate the real‐time effects of phone use on parent–child interactions (Mikhelson et al. 2024). Head‐mounted video cameras have been used to characterize children's visual inputs and its links to word learning (Jayaraman et al. 2015; Schroer et al. 2024). Real‐time behavioral predictions are revolutionizing traditional intervention paradigms (Nahum‐Shani et al. 2016; Timmons, Feng et al. 2024). For example, in clinical trials, automated affect detection is being used to provide parents real‐time feedback during therapy sessions for children with emotional and behavioral issues (Saliba et al. 2023).

Artificial intelligence (AI) plays a key role in such work. Mobile sensors, including wearable audio recorders, cameras, accelerometers, and proximity monitors, can record tens, hundreds, or even thousands of raw samples per second, resulting in large datasets even for brief recordings. To efficiently manage these large datasets, researchers can use algorithms to automatically infer meaningful activity from sensor outputs, a process known as activity recognition (Nweke et al. 2018). For example, algorithms can be applied to motion data from accelerometers to automatically infer that a child is walking versus crawling (Franchak et al. 2021), sleeping or awake (Sazonov et al. 2004), or held versus not held by a parent (Yao et al. 2019). These automated inferences can drastically reduce human annotation effort, allowing researchers to derive insights about naturalistic activity at scale.

Traditionally, most activity recognition applications have been developed by teams of engineers in the domain of adult consumer applications, including digital healthcare and fitness (Pires et al. 2020). However, in recent years there has been a proliferation of work—often led by or including developmentalists—to develop algorithms to infer activity from infant, child, and family sensor data, including markers of affect (Lee et al. 2025; Micheletti et al. 2023; Timmons et al. 2017), posture and motor activity (Deng et al. 2025; Franchak et al. 2021, 2024; Hendry et al. 2023), language (Lavechin et al. 2020; Li et al. 2024; Räsänen et al. 2021; Soderstrom et al. 2021), parent–child interactions (Karaca et al. 2024; Timmons, Feng et al. 2024), and aspects of the home environment (Khan et al. 2024; Khante et al. 2023). Additionally, off‐the‐shelf AI tools such as Chat‐GPT or other large language models that are not specifically designed for child‐worn data are becoming increasingly effective for a wide range of data annotation tasks (Rathje et al. 2024). As machine learning applications become increasingly common in developmental research, researchers must develop a degree of AI literacy that allows them to critically evaluate these models. Without this, the field risks drawing misleading conclusions, applying models in inappropriate contexts, or overlooking biases that could distort findings and their implications.

Finally, some developmentalists may want to lead or join teams engaged in developing algorithms to detect specific behaviors of interest. Their expertise is critical to ensure that models address problems of genuine scientific and applied importance, defining gold‐standard behavioral annotations, highlighting sources of variability that the model must account for (e.g., potential differences in performance across settings, ages, and treatment groups or cultural contexts), and guiding recruitment of relevant populations (Bergelson et al. 2023). Familiarity with the terminology, possibilities, and challenges of activity recognition can enable these researchers to advocate for better solutions in their collaborations with machine learning experts (see Baumgartner et al. 2023 for a specific account). Overall, whether as end‐users or collaborators, developmentalists can play a vital role in critically assessing and supporting these technologies such that the resulting tools are meaningful, ethical, and practical.

To facilitate these interdisciplinary efforts, we share a set of 11 guidelines for developing, evaluating, and implementing activity recognition algorithms in developmental science, as summarized in Figure 1. Our focus is not to provide general guidance on conducting mobile sensor studies (see de Barbaro and Fausey 2023; Harari et al. 2016) or an introduction to machine learning more broadly (see Hindman 2015; Jacobucci et al. 2023; Pargent et al. 2023). Instead, we aim to offer a critical lens and practical considerations, drawn from our own work and that of others, to support the next generation of developmentalists working in this area. We recognize that not every laboratory or publication will be able to adopt all of these guidelines at all times. For example, while we encourage recruiting diverse populations when developing models, individual projects will inevitably reflect the constraints of their local context and resources. As such, these guidelines are offered not as strict requirements, but as benchmarks to help researchers identify best practices so that future efforts yield stronger and more impactful science.

FIGURE 1.

FIGURE 1

Eleven guidelines for thinking critically about algorithms for automated detection of behavior.

2. Primer

Broadly, AI refers to computational systems that can perform tasks associated with human intelligence, such as reasoning, language, or decision‐making. Machine learning is a subset of AI, which uses algorithms and statistical models to identify patterns in data without relying on explicitly programmed rules. In developmental research, machine learning can be used to derive representations of behaviors of interest (e.g., walking, crawling, or sitting) from raw sensor data.

A common machine learning approach, supervised learning, pairs sensor data with “ground‐truth” labels or annotations—typically generated and validated by human coders—during the training or “learning” phase. Supervised models use these labeled examples to learn sensor features that can be used to predict or automatically detect behaviors of interest. The basis of these predictions is that different behaviors (e.g., walking vs. sitting) have distinct signatures in the sensor data. For example, walking involves sustained forward motion, while sitting involves relative stillness. The model learns to use such features, extracted from sensor input, to distinguish between behavioral classes.

The logic of supervised modeling resembles fitting a regression: an algorithm learns parameter values mapping input variables (features of sensor data) onto an outcome (a labeled behavior), adjusting them to best predict that outcome. However, where psychologists typically use regression to explore relationships within a known dataset (e.g., identifying which inputs best predict an outcome), machine learning approaches more often use a trained model to predict unknown outcomes (i.e., behaviors of interest) from new input variables (previously unseen sensor data). Accordingly, machine learning efforts typically rely on stricter cross‐validation approaches to evaluate model generalization.

In cross‐validation, a portion of the training dataset (e.g., paired sensor data with “ground truth” outputs) is held out while the model is trained on the rest. Following training, model accuracy is assessed on the held‐out data by comparing ground truth outputs with model predictions. In activity recognition tasks, the gold‐standard for cross‐validation is leave‐one‐participant‐out (LOPO), in which data from one participant is excluded from the training set and used exclusively for testing. This process is repeated across all participants in the training dataset until each participant has been left out of training once, with overall model accuracy calculated as an average across these iterations. This approach is critical for testing generalizability of the model, in that it allows researchers to determine how well the model performs on data from new participants that the model did not have access to during the training phase when model parameters were set.

Training a model from scratch is not always feasible or necessary. Instead, researchers can rely on pretrained models, that is, models that have already been trained on large, labeled (“ground truth”) datasets, and released for reuse, either commercially or on repositories such as GitHub. As we elaborate in Guidelines 1 and 2 below, under the right conditions pretrained models can be used “off the shelf” for new applications, reducing time and computational costs. This is the basis of tools including LENA classification (Gilkerson et al. 2017), OpenPose (Cao et al. 2017), and Whisper (Radford et al. 2023). Alternatively, if the pretrained model is publicly available, it can be “fine‐tuned” with additional ground truth data from a new dataset. Fine‐tuning allows the model to build upon representations learned from the original dataset, while adapting to the specific populations or context of the new dataset. This is particularly useful when a new dataset is too small to train a model from scratch.

3. Recognizing the Limits of Model Generalizability

A key lesson in the development of activity recognition algorithms is that models are unlikely to generalize beyond the data used to train them (Lockhart and Weiss 2014). For example, models trained with data collected in clean lab settings are unlikely to perform well in natural everyday settings, and models developed with data from one population may not perform well in another. Consequently, researchers should carefully consider the match between the settings and populations used for model training and those in which the models will be used.

Guideline 1. Compare Training Data With Your Intended Use Case

Many studies report major performance drops when models move from structured to unstructured data collection environments (Gillick et al. 2021; Liaqat et al. 2021). In our own work on infant distress detection (Yao et al. 2022), we found that most published models were trained on highly controlled data, for example, infants who were audio recorded in a silent room using a microphone suspended above the crib (Marschik et al. 2017). Published models leveraging data from such structured or otherwise “clean” datasets often yield over 90% accuracy in crying detection (see review Ji et al. 2021). However, we demonstrated that the accuracy of a cry detection model trained on such “clean” data dropped by 64% when applied to real‐world child‐worn audio recordings (Yao et al. 2022). Structured settings lack the variety of everyday sounds from siblings, pets, kitchen appliances, or music, which can overlap with data classes of interest (i.e., infant distress), making these sounds harder to detect. Further, other sounds—like cats meowing or chairs creaking—may share acoustic properties with infant fussing or crying sounds, confusing the model and leading to increased false positives. Importantly, algorithms trained on this real‐world dataset have reached 80+% accuracy when tested on real‐world data (Lee et al. 2025; Galhotra et al. under review). Note that while these scores are high, they do not reach the same levels of accuracy as data collected in clean environments, reflecting the challenges of classification in noisier settings.

The issue of generalizability across settings extends to many contexts. For example, models trained to detect affect from real‐world YouTube clips performed poorly for detecting affect in our child‐centered data, where the tone and tenor of affect classes (e.g., “shouting”) may be very different in adult settings (sports events, to call someone's attention) than everyday home scenarios (a young child tantrumming) (unpublished work; Supporting Information S1). Similarly, models trained to detect a parent holding a child using data collected in daytime waking hours (Yao et al. 2019), when holding positions are mostly upright and while mobile, drop in accuracy at night, when holding positions are relatively still and supine. As another example, models trained on adult postures and movement may misinterpret children's physical activities as children's movements differ substantially from adults (d'Andrea et al. 2025). Finally, models trained to detect speech in households with one child perform worse in households with multiple children (Cristia et al. 2020).

Researchers using publicly available models should carefully consider the characteristics of the model training data: where model training and/or testing data are highly mismatched from its intended use case, additional validation may need to be conducted to determine the true accuracy of the model in the novel setting. These efforts should be targeted rather than universally applied. In domains where there are already numerous validation studies, such as models tested extensively across multiple languages or age groups, additional validation may add little new information unless the new dataset introduces conditions that have not yet been examined. Validation should therefore focus on likely sources of error (e.g., increased noise or variability of behavior, unusual recording environment) and be guided by clear evidence gaps, rather than repeated in settings or populations that closely mirror existing studies. However, because models developed in structured protocols often fail to generalize to naturalistic data, models intended to detect spontaneous behavior should be validated in real‐world settings (see Franchak et al. 2024).

Guideline 2. Look for Evidence of Generalizability Across Samples—Or Test It Locally

The limits of model generalizability, that is, that algorithms trained on one population or sample often fail to generalize to another, can be especially problematic in developmental and clinical research. Behavior varies across age, culture, socioeconomic status, clinical status, and time (Buolamwini and Gebru 2018; DeCamp and Lindvall 2020; O'Brien et al. 2017), which contributes to variability in model performance. For example, issues in generalizability are likely to arise in models designed to assess early language development through speech patterns or vocalization features. Models trained to identify typical and atypical speech trajectories using audio data from monolingual, English‐speaking toddlers in well‐resourced households may fail to generalize to children from bilingual homes, non‐Western cultural contexts, or children with developmental delays (Ganek et al. 2018; Piot et al. 2022; Warlaumont et al. 2014). At the same time, cross‐corpus accuracy can be non‐monotonic: some tools do not fail catastrophically with older children or monolingual non‐English contexts. For example, the original LENA training dataset included a nationally representative sample of English speaking children aged 1–42 months old (Gilkerson et al. 2017), and several validation studies have shown that performance only drops modestly—adult world count correlations decreased by roughly 4%–10% across a wide range of other languages (Cristia et al. 2020, 2024; McDonald et al. 2021). However, correlations on child vocalizations/conversational turns drop by about 30%–50%, highlighting the specificity of challenges to generalize. Additionally, models trained on large diverse datasets—such as multiple unique datasets—do not always outperform more homogenous samples (Amrani et al. 2025). These findings highlight that greater diversity in training data, while valuable, does not always translate to better model performance across sample populations.

At its core, successful generalization depends on characterizing the full range of variability of behavior across the population. For example, consider a model trained to detect maternal sensitivity to infant distress using data from a community sample primarily composed of low‐risk, well‐resourced mothers exhibiting moderate to high levels of sensitivity. This model may be ill‐equipped to detect or differentiate lower‐sensitivity behaviors that may be more prevalent in high‐risk samples, such as mothers with depression or histories of childhood maltreatment. Without this representation, the model may misclassify or entirely overlook behaviors that fall outside of the distribution it has “seen” before.

In addition to cross‐population variability, developmental models must also account for change over time. Speech, for instance, evolves markedly across developmental stages—from early babbling to complex syntax and pragmatic language use in adolescence—and these shifts can render models trained on data from younger children ill‐suited for classifying data from older children. Moreover, communication styles are culturally and historically situated (Rogoff 2003). As cultural norms, and linguistic practices evolve, so too do the ways individuals express themselves. This is particularly salient in mental health and family contexts, where relational and emotional expression is deeply embedded in dynamic social and cultural frameworks (Chentsova‐Dutton and Ryder 2020). As such, the accuracy of AI models cannot be presumed to be static; rather, their accuracy and relevance may degrade as the interactions they assess dynamically change over time.

Historic datasets built on homogenous, predominantly white, English‐speaking, and socioeconomically advantaged populations can pose a significant risk when applied broadly across diverse developmental and clinical contexts (Koenecke et al. 2020; Kostick‐Quenet et al. 2022; Scaff et al. 2025; Timmons et al. 2023). When such models are used to inform scientific inference or guide intervention decisions, they can encode and perpetuate systemic biases—particularly when they underperform in underrepresented populations (Buolamwini and Gebru 2018; Hooker 2021; Mehrabi et al. 2021). This can manifest in different ways: underdiagnosis may lead to missed concerns and limit access to critical resources or services, while overdiagnosis can reinforce harmful stereotypes and contribute to inappropriate labeling or surveillance (Obermeyer et al. 2019). In both cases, the consequences can extend beyond individual misclassification, feeding into broader patterns of inequity in mental health care. When biased models inform science, practice, or policy, they risk entrenching disparities rather than correcting them—ultimately undermining the promise of AI to advance more precise and equitable approaches to developmental and clinical research.

End‐users of pretrained algorithms should carefully audit the match between model training data and their intended populations. To determine the need for additional validation, users should look for published performance on datasets similar in population, age, language, and context, and identify possible failures to generalize. For researchers involved in the model development process, consider that recruiting socioeconomically, clinically, and/or racially diverse samples for model development can increase variability in the training data, which could allow trained models to be used more widely. When working with diverse samples, following training or additional validation, researchers should test performance between meaningful subpopulations in the data. These results, including potential biases, should be shared in publications for transparency; ideally, models would be retrained with population‐specific data if performance differs substantially. Finally, if working with longitudinal samples or interventions, it may be necessary to monitor model accuracy regularly, as model performance can drift as the behaviors themselves change.

Not all labs will be positioned to recruit broadly diverse samples, and no single project can solve the full problem of generalizability. At a minimum, population characteristics should be thoroughly described in model publications, something that is standard practice in psychology but often not included in engineering publications. More broadly, systemic initiatives—such as funder incentives for diverse recruitment, requirements for subgroup reporting in journals, and the creation of bias‐audit benchmarks—may be just as critical as individual lab practices for advancing equity and generalizability in developmental algorithm research. Finally, whenever feasible, researchers should deposit raw data and accompanying annotations from validation studies in shared repositories (e.g., Databrary, HomeBank). This practice enables cumulative pooling of validation datasets, allowing future models to be trained and tested on richer, more representative resources and ensuring that individual validation efforts contribute to collective progress.

4. Thinking Carefully About Model Accuracy

Model accuracy reflects its ability to correctly classify or predict activities of interest. Common evaluation metrics in activity recognition include precision (the proportion of correct positive predictions), recall (the proportion of actual positives correctly identified), and F1 scores, which balance precision and recall summarizing overall performance. Below, we highlight common challenges when interpreting or working with these statistics.

Guideline 3. Beware: High Accuracy Scores Can Hide Poor Model Performance

Accuracy scores are often treated as a gold standard of model quality, but they can be misleading. First, if training or test data are of low quality, whether unrepresentative (e.g., excluding harder‐to‐model cases) or unreliably annotated (e.g., low interrater agreement), accuracy will not reflect real‐world performance. Second, scores may be artificially inflated when cross‐validation is not rigorous. For example, including data from the same individual in both training and test datasets boosts performance relative to gold‐standard LOPO protocols. Third, real‐world behaviors of interest to developmental scientists are often imbalanced: behaviors like laughter, anger, or family conflict often occur rarely even in clinical populations, meaning overall accuracy can mask poor performance on these less frequent but critical classes. This issue can be exacerbated by weighted accuracy metrics, which weight per‐class accuracy by the observed frequency of each class. Because frequent behaviors (e.g., no distress) are usually easier to detect, this can make the overall score look excellent, even if the model performs poorly on rare but clinically important behaviors (e.g., “cry” or “fuss”). By contrast, unweighted accuracy uses a simple average to summarize performance across classes, better revealing such issues. For example, a model which is 90% accurate for “no distress” but only 50% accurate for “cry” and “fuss” would report a weighted F1 score of 89% if “no distress” is 100× more frequent than distress. The unweighted score would be only 63%, showing that the model is essentially randomly guessing on the behaviors that matter the most (Supporting Information S2). Many studies rely on weighted scores alone, making it difficult to judge whether models capture these less frequent but meaningful activities.

Researchers considering publicly available models should consider the quality of the modeling process rather than rely on accuracy scores alone. Training data quality should be assessed in the same way as any other human‐annotated dataset, that is, by considering the reliability statistics and the representativeness of the annotated sample. Unless the intent is to develop personalized models (Guideline 4), cross‐validation should always be the gold‐standard LOPO. When classes of interest are imbalanced, readers should examine unweighted averages, per‐class accuracy, and/or contingency tables that cross‐reference model predictions and ground truth for each sample of each class. Editors and reviewers should push to make this reporting standard in the field. Researchers developing models should publish and explain these metrics to provide a realistic and useful assessment of model performance. Additionally, including accuracy metrics more familiar to behavioral scientists (e.g., Cohen's kappa or intraclass correlation coefficients, see Bakeman and Gottman 1997) can also help to bridge results with behavioral research standards.

Guideline 4. Try Personalized Models to Increase Accuracy

In activity recognition, models are typically designed to generalize—that is, to perform well on data from individuals they were not trained on. However, recent research has increasingly highlighted the potential of personalized machine learning models, which leverage labeled training data specific to an individual or family, to outperform generalized approaches (Webb et al. 2025).

Personalized models may be particularly useful for tasks with high behavioral heterogeneity across individuals, such as mental health symptom detection in daily life. For example, in recent work, we conducted a pilot study leveraging mobile and wearable data from 35 families over 60 days, comparing generalized versus personalized models for detecting moment‐to‐moment mental health states (Timmons, Tutul, et al. 2024). Personalized models significantly outperformed generalized models in sensitivity and F1 scores for a range of emotional and interpersonal outcomes, including sadness, anxiety, stress, and positive interactions. Similarly, Franchak et al. (2021, 2024) found that individualized models of infant body position based on wearable inertial sensors in the home provided better detection accuracy than group‐based models.

Tailoring models to the individual—by training on their own symptom profiles and available data streams—can substantially improve model performance (Jacobson et al. 2019; Wang et al. 2014). This may be especially true in domains where symptoms are expressed idiosyncratically, and sensor or device usage patterns vary widely across participants. Although promising, personalized models introduce computational and implementation challenges, including the need for per‐participant training pipelines, more frequent updates, and robust methods to handle sparse or irregular data. In some contexts, the costs of these adaptations may outweigh their benefits, so researchers should carefully consider resource demands relative to their study goals and constraints. Still, pursuing these efforts may be worthwhile when model generalizability across participants is low or when real‐time accuracy is critical for clinical impact.

Guideline 5. Find Creative Ways to Leverage Low‐Accuracy Models

The reality of using automated behavioral classification—particularly in noisy, naturalistic settings—is that our models will often have accuracy below that of human annotators, even following various attempts to increase performance. However, even when model predictions fall below thresholds for direct use, they can still offer valuable insights into human behavior by providing access to large volumes of data that would otherwise be impractical to annotate entirely by human efforts (see Cristia et al. 2020; de Barbaro and Fausey 2023).

For example, when developing a model to detect household chaos from audio (Khante et al. 2023), we evaluated YAMNet, a public model that recognizes 521 real‐world audio classes (Gemmeke et al. 2017; Google Research 2019). YAMNet performed poorly on our child‐worn audio recordings (unpublished data; see Supporting Information S1); however, we were able to use its predictions to build a filter that increased human annotation efficiency eightfold (Khante et al. 2023). Segments flagged with “loud” or “chaotic” sound classes were far more likely to contain high chaos than randomly selected segments, enabling coders to identify rare but meaningful events more efficiently. Similarly, LENA predictions of high volumes of speech or conversation turn counts are often used to sample audio segments with elevated activity for further annotation (e.g., Cychosz et al. 2020).

Carefully examining where model errors are occurring can also help to assess and refine model predictions. For example, using visualizations we identified that our chaos algorithm appeared to be identifying suspiciously long periods of high chaos at night. On review, we found the model misclassified white noise machines used for infant sleep as truck sounds, which were then associated with high chaos. This led us to restrict our analyses to waking hours where the model was less likely to make this mistake (see Figure 2).

FIGURE 2.

FIGURE 2

Household Chaos model predictions for individual participants 24‐h audio recordings. Note: Chaos predictions for N = 19 families (in rows) across 24‐h audio recordings, with time along the x axis. Although Chaos model accuracy was high overall (see Khante et al. 2023), visualizations indicated that some children have substantial amounts of high chaos (label Level 3) detected overnight (e.g., between 8 p.m./20:00 and 8 a.m.). Listening to those segments, we determined that the model was incorrectly labeling white noise machines, used by parents to facilitate sleep, as truck sounds, which were then associated with high chaos. This insight led us to restrict our analyses to daytime hours where white noise was less common.

5. Prioritizing Ethics, Feasibility, and Participant Benefits

Algorithm developers often prioritize model accuracy as the primary goal. Although important, it is equally critical to consider whether an algorithm is feasible, ethical, and useful.

Guideline 6. Recognize That Model Accuracy Is Not the Same as Benefits to Participants

A less accurate but feasible and adoptable model may ultimately deliver greater social benefit—reaching more people quickly—than a highly accurate model that is too complex, computationally demanding, or impractical to scale. This is particularly true in contexts where the risks of incorrect classification are minimal. For example, when an algorithm is used to deliver supportive feedback or engagement reminders—especially when such prompts are infrequent or low in burden—misclassification is unlikely to negatively affect the user or their experience interacting with the AI system. In such contexts, maximizing adoption and usability may be more important than optimizing for highly accurate classification. Balancing accuracy with feasibility and real‐world utility is essential to ensuring that AI tools can truly support people in meaningful and sustainable ways.

Guideline 7. Evaluate Feasibility in Target Settings

Similarly, while researchers may achieve higher levels of accuracy by using expensive, specialized research‐grade sensors or by integrating data streams from a variety of sensors simultaneously, we must carefully consider whether people would be willing to purchase and wear such sensors, especially outside the structure and incentives of a time‐limited research study. Without sustained use, the real‐world utility of these complex models may be limited, regardless of how well they perform in ideal conditions. In contrast, leveraging commercial devices—such as the smartphones people already own or widely available wearables like Fitbits and Garmins—may offer advantages in terms of scalability, cost, and long‐term engagement. Although these devices may offer lower data fidelity than research‐grade alternatives, their ubiquity and ease of use can make them more suitable for widespread deployment. These considerations are particularly relevant when the goal is to translate research tools into real‐world applications, but they may be less central for teams whose focus is to obtain highly precise data within a small sample, a common tension in mobile sensor research design (see de Barbaro and Fausey 2023).

A related issue when considering large‐scale deployments concerns technological infrastructure. Models are typically trained on precollected datasets from a heterogenous mix of sensors that are not interoperable, limiting them to offline assessments rather than real‐time responsiveness. To increase their real‐world impact, researchers can start with infrastructure that is already feasible to deploy at scale—such as smartphones, which integrate multiple sensors and support real‐time data collection—and then iteratively add compatible devices (e.g., wearables) as integration pipelines and user adoption make them viable at scale (Harari et al. 2016). Building from existing infrastructure in this way increases the likelihood that models will be both accurate and practically deployable at scale.

Guideline 8. Anticipate and Address Ethical Considerations

Alongside questions of feasibility and scalability, ethical considerations must be central to the deployment of real‐world monitoring technologies—particularly in sensitive domains like mental and behavioral health (Cychosz et al. 2020; Kostick‐Quenet et al. 2022; Lee et al. 2025; Timmons, Tutul, et al. 2024; Timmons et al. 2023). At the foundation of ethical data collection is the principle of informed consent, which must extend beyond initial enrollment to ongoing clarity about what is being captured, how it is processed, and how it might be used or shared. Real‐world monitoring, especially when passive or continuous, challenges traditional consent models by making it more difficult for participants to maintain awareness and agency over data collection. Ensuring participant autonomy and control requires systems that are transparent and user‐configurable, with clear mechanisms for participants to control and revoke access to their data. Although full control is often limited once data are uploaded to institutional servers or repositories, researchers can create meaningful opportunities for participants to make choices earlier in the process. The period immediately following data collection, that is, before files are uploaded or archived, is a particularly important window, as participants are most likely to request deletion and honoring such requests is most feasible.

Real‐world sensing often captures sensitive, granular, and context‐rich data, raising significant concerns about privacy and confidentiality, especially when collecting audio data. The risk of inadvertently collecting data from non‐consenting individuals (e.g., third‐party speech) or capturing highly sensitive information (e.g., interpersonal conflict, health‐related discussions) must be carefully mitigated. Participants may be uncomfortable with ongoing passive audio recording, especially without clear and continuous awareness of when and how their speech is being captured. To address these concerns, researchers may consider time‐limited “sessions” in which participants actively grant permission for audio capture for specific time intervals. A second strategy includes utilizing recording devices with accessible features that allow participants to start and stop recordings at will. Additionally, researchers can implement technical solutions that process audio locally on the device—extracting de‐identified features such as word counts or pitch—while ensuring that raw audio and transcripts never leave the participant's phone. Equally important is robust data security and storage: encryption, secure transmission, and restricted access protocols must be standard practice to prevent unauthorized use or breaches. When developing models, researchers must also actively monitor algorithms for fairness and bias, ensuring that they perform equitably across demographic groups and do not reinforce existing disparities in access, diagnosis, or care, as detailed in Guideline 2. These considerations are especially critical in mental health contexts, where marginalized populations have historically been underrepresented in research and underserved in clinical practice (Timmons et al. 2024).

In addition to considering privacy, confidentiality, and fairness, legal and regulatory compliance must be prioritized. Regulations such as the Health Insurance Portability and Accountability Act (HIPAA) in the U.S. and the General Data Protection Regulation (GDPR) in the EU impose strict standards for data protection, consent, and cross‐border data sharing (GDPR—Legal Text, n.d.). Jurisdiction‐specific laws—such as two‐party consent for audio recordings—must be carefully reviewed during system design and implementation. Finally, researchers must be proactive in assessing the risk of harm and unintended consequences of these technologies. For example, sensitive or unintended content captured in audio recordings may require clinical assessment, intervention, or reporting. Even well‐intentioned systems can produce harm if misclassifications lead to unnecessary clinical escalation, stigmatization, or erosion of trust. Prior scholarship in digital health and AI ethics has emphasized that risk–benefit analyses should be ongoing, not one‐time, and that researchers need frameworks for balancing innovation with patient safety and trust (Goodman et al. 2011; Nebeker et al. 2019; Vayena et al. 2018). These resources provide more detailed discussions of regulatory obligations, practical ethics, and approaches to managing unintended consequences, and can serve as a starting point for researchers navigating compliance.

6. Involving Diverse Expertise to Strengthen Research and Practice

Automated activity recognition is an exciting interdisciplinary nexus where collaborations allow research teams to do more together than they could do alone. The technical sophistication of computational approaches can be daunting, and developmentalists may feel like their disciplinary knowledge carries less weight. But developmental researchers and the participants in their studies bring crucial perspectives to the table. Insights from social and behavioral sciences, and the voices of end users, are essential for shaping sensor tools that are accurate, meaningful, and usable in real‐world contexts.

Guideline 9. Bring Engineers and Computer Scientists to the Table

Developmental research presents unique challenges for activity recognition that can offer unique opportunities for technical innovation. Tasks such as recognizing parent–child activities in real‐world settings often involve noisy data, rare events, and high variability within activity classes, both in duration and form. Unlike traditional human activity recognition tasks (e.g., walking or eating), these activities are often more complex and context dependent; they may unfold across multiple timescales, involve multiple individuals, and include nuanced qualitative dimensions of interaction (Khante et al. 2025). Highlighting these challenges early in conversations can help attract and engage interdisciplinary collaboration. Another effective strategy is to demonstrate that existing models or standard architectures underperform on the problem of interest, which can motivate technical faculty to collaborate and strengthen the case for reviewers in technical venues. For example, in our cry detection work, we motivated the need to develop real‐world cry detection models by first showing that existing models trained with clean lab data performed poorly on real‐world datasets (Yao et al. 2022).

At the same time, some applied activity recognition problems in developmental research may be solvable with “tried and true” approaches. This means that collaborations with senior engineers or computer scientists may result in models that are unnecessarily complex for the problem being studied, or the problems themselves may be less interesting to these collaborators if they do not require novel architectures. For simpler problems, one solution is to work with less senior researchers—such as undergraduate or master's students coadvised by PhD students or postdocs—who may be motivated to apply existing architectures effectively to developmental datasets.

Guideline 10. Leverage Domain Knowledge to Enhance Activity Recognition

Algorithms are typically developed in collaboration with engineering researchers experienced in machine learning but who have little to no domain expertise. Although large datasets paired with large models have famously achieved breakthroughs without explicit domain knowledge (LeCun et al. 2015), developmentalists can leverage their deep expertise and intuitions about human behavior to facilitate model development.

Although state‐of‐the‐art deep learning models independently identify features that distinguish data classes, they are data hungry and may be inappropriate for cases where huge training datasets are infeasible to collect. This is often the case for psychological behaviors that are rare or otherwise arduous to code. In these cases, it may be necessary to use less computationally intense “traditional” machine learning models where researchers prespecify the features that models use to make their predictions, using libraries of features that have performed well on similar datasets (e.g., standard lists of audio or motion features). However, although such feature libraries have been effective for classifying simple actions like walking or eating, because they typically operate on short windows of data, they are unable to capture longer‐range temporal dependencies important for classifying more complex activities. For example, in recent work, we developed a classifier to determine the quality of maternal sensitivity to infant distress (Khante et al. 2025). Caregiver responses to infant distress unfold over multiple timescales, requiring consideration of caregiver's immediate responses to individual distress vocalizations as well as their consistency of responding throughout the episode. Additionally, response quality depends on context: a caregiver could receive a high sensitivity rating despite a delayed or absent response to brief fussing, but not in response to hard crying. Leveraging domain knowledge to create features that could capture such dependencies in the data improved model performance over models that included traditional features alone (Khante et al. 2025). Developmentalists’ knowledge of the multimodal, multiscalar, and contextual nature of behaviors of interest can support models to capture meaningful distinctions in behavior and thus improve classification accuracy.

Guideline 11. Include Community Collaborators in Mobile Sensing Research and Interventions

Including community collaborators in the development of algorithms—particularly those used to inform scientific conclusions or guide intervention delivery—is becoming both standard and essential (Tebes 2018; Timmons et al. 2025). Although this work comes with known challenges (e.g., Israel et al. 1998; Wallerstein and Duran 2010), involving individuals with lived experience can offer critical insights into how algorithms are perceived, trusted, and used in real‐world contexts. Community input helps identify potential barriers to adoption, surface unintended consequences, and highlight practical and ethical concerns that may not be apparent through technical development alone. Importantly, this engagement should not be limited to the initial design phase; rather, it should occur at regular intervals and continue even after deployment or data collection. Qualitative methods—such as interviews and focus groups—can be used to assess users’ experiences and perceptions over time, creating ongoing feedback loops that support iterative refinement. Sustained collaboration with users allows algorithms to evolve alongside the communities they are intended to support. These partnerships must also be approached ethically and sensitively: avoiding undue burden on community members, respecting their time and expertise, and ensuring they are appropriately compensated for their contributions.

7. Synthesis and Next Steps

The field of developmental research is entering a transformative era, in which mobile sensing and machine learning offer unprecedented opportunities to observe behavior in context and at scale. As algorithm development becomes increasingly central to behavioral science, researchers have made exciting strides in applying AI to detect and interpret meaningful human activity. These advancements underscore both the promise and the complexity of using AI in developmental contexts. In this paper, we introduced 11 guidelines designed to help researchers critically engage with this emerging space. These include considering limits of generalizability across settings and populations, interpreting accuracy metrics with care, and considerations for low‐accuracy models. We emphasized the importance of real‐world feasibility, ethical deployment, interdisciplinary collaboration, and sustained community engagement. Although not all guidelines will be immediately feasible for every lab, our aim is to provide a foundation that can guide incremental progress and collaborative efforts toward advancing the rigor, equity, and impact of algorithmic tools in developmental science.

Looking ahead, we anticipate several critical challenges and opportunities that will shape the trajectory of the field. These include developing new theories, ensuring sustainability, investing in shared infrastructure, fostering collaborative ecosystems, and navigating the expanding commercial landscape. We detail these challenges and opportunities below. Addressing these next‐phase priorities will be essential to realizing the next generation of AI‐enhanced developmental research.

7.1. From New Data to New Theories

Moving forward, a key challenge for basic science is shifting from amassing large everyday datasets to leveraging these datasets to develop new theories of development and learning (see, e.g., Smith et al. 2018; Nencheva, Tamir and Lew‐Williams 2023). Doing so will require training the next generation of developmentalists in data science approaches for working with high‐density data (see tutorials in Xu et al. 2020). It will also require combining expertise across the ecological systems of development, from multimodal behavior to family processes and neighborhood effects. These efforts can be facilitated by data sharing initiatives, as we elaborate below. For example, the openly shared SEEDlingS corpus (Bergelson 2015) is a longitudinal dataset of monthly home audio and video recordings paired with and lab‐based eye tracking data with manual coding to investigate child language development. Similarly, the NIH‐funded Play & Learning Across a Year (PLAY) project (https://play‐project.org/) is an ongoing collaboration between over 70 researchers to capture and make publicly available one hour of richly annotated naturalistic parent‐child interactions from 1000 infants across the United States. These types of shared datasets can help the field move beyond tightly controlled lab datasets and allow researchers to build new theories drawn from rich data of the everyday interactions through which children develop.

7.2. From Proof‐of‐Concept to Sustainable Implementation

In the domain of intervention, the challenge is to move from proof‐of‐concept toward sustainable, real‐world systems. This means shifting from isolated demonstrations of accuracy to building tools that prioritize long‐term usability, adaptability, and integration into existing educational, clinical, or community infrastructures. Sustained impact in both domains depends on iterative refinement: not just what performs well in validation studies, but what endures and adds value in practice. For example, a recent e‐mental health intervention for depressive symptoms was first tested in a randomized controlled trial (N = 187) and then scaled to over 50,000 users (Xu et al. 2025). The platform integrates live Q&A with clinical experts, moderated peer groups, and an evolving FAQ section—turning user feedback into ongoing resource updates. This combination of theory‐driven content, robust support, and built‐in feedback loops shows how an initially controlled intervention can be iteratively refined to reach large populations while maintaining measurable impact.

7.3. Building Shared Infrastructure and Structural Solutions to Accelerate Progress

Although many of our guidelines are aimed at individual researchers, sustainable progress will also require structural solutions. Shared infrastructures such as open‐source codebases, diverse and representative datasets, standardized pipelines, and model repositories—can accelerate discovery and enable reproducibility. Open datasets that reflect the variability, diversity, and complexity of children's real‐world environments are especially vital for improving generalizability. Institutions and funders can play a critical role by creating infrastructures that make best practices achievable. For example, federally supported data repositories lower barriers to sharing real‐world, multimodal sensor datasets collected in ecologically valid contexts across varied samples and populations. Similarly, bias audits—already mandated in some applied AI domains (e.g., NIST's speech recognition benchmarks across accents)—could become standard practice in developmental algorithm research. More broadly, the field lacks shared standards for evaluating the validity and reliability of algorithmic approaches. What counts as “accurate enough,” how large or diverse a dataset is sufficient, and what levels of replication are needed remain open questions. Establishing evidence‐based norms and benchmarks—analogous to efforts in mobile sensing to standardize data collection parameters (Harari and Gosling 2023)—would help ensure rigor without imposing unrealistic burdens. Developing such standards should be a priority for the field moving forward.

7.4. Supporting Ecosystems for Collaboration

The interdisciplinary nature of this work demands collaborative ecosystems that span disciplines, institutions, and communities. Addressing the technical, ethical, and practical challenges of AI in developmental science requires partnerships between developmentalists, engineers, data scientists, clinicians, and—critically—community members and practitioners with lived experience. Platforms that support modular contributions, shared governance, and ongoing cocreation can help move the field from isolated innovations toward collective progress. Fostering these ecosystems will require intentional structural and cultural investments in openness, trust, and shared credit.

7.5. Navigating the Commercial Landscape of Mobile Sensing Technology

The explosion of consumer‐facing sensing and digital health tools—many developed outside of research settings—offers both opportunity and risk. On one hand, commercial partners may provide the infrastructure, reach, and long‐term support needed to collect data or disseminate digital interventions at scale. On the other hand, commercial endeavors may be misaligned with the goals of researchers. For example, commercial sensing or AI platforms may suppress unfavorable results or abandon maintenance of tools when company priorities shift. On the intervention side, the proliferation of thousands of unregulated mental health apps has saturated the space with tools of highly variable quality. To ensure that the most needed and useful tools and evidence‐based interventions are developed and are sustained over time, researchers may need to partner with implementation channels outside of academia. Collaborating with industry, healthcare systems, or mission‐aligned nonprofit platforms can help bridge the gap between rigorous science and scalable, real‐world delivery. As Harari and Gosling (2023) emphasize, unresolved issues around measurement, standardization, and privacy in mobile sensing are not just academic, but determine whether tools can be commercialized in ways that are reliable, ethical, and sustainable. Taken together, these considerations suggest that commercialization is not a separate concern but an extension of the scientific process. When approached with care, commercialization allows developmental science to advance science while delivering more robust tools to support research and intervention.

7.6. Implementation and Integration efforts

A number of early initiatives offer encouraging examples of how collaborative infrastructure systems can be deployed, tested, and shared to support broader use. For example, Colliga, a platform developed by the third author, offers shared infrastructure to support the development, testing, and dissemination of AI‐enhanced interventions (Timmons and Ahle 2020a, 2020b). It allows scientists to deploy personalized algorithms within modular intervention programs and is available for free research use by emailing the last author. Similarly, the MD2K initiative was established to advance the foundational science needed to convert the vast data generated by wearable sensors into meaningful, usable health insights (https://md2k.org/). In addition, data repositories such as Homebank (https://www.talkbank.org/), Databrary (https://databrary.org/), and (https://colliga.io/repository/) allow researchers to develop and evaluate models in the diverse, messy conditions in which such tools will actually be used. However, despite the potential of such infrastructure, widespread implementation and adoption across the developmental science community have yet to be realized. To scale these systems effectively, the field must continue investing in infrastructure that is not only technically robust but also sustainable, well‐supported, and integrated into researchers’ day‐to‐day workflows.

8. Conclusion

As developmental researchers embrace the tools of AI and mobile sensing, the field stands at a pivotal crossroads. This paper offers a set of practical guidelines for critically navigating the use of algorithms to detect behavior, grounded in lessons learned from early efforts across the field. Although the technical promise of these tools is clear, their scientific and societal value will depend on how thoughtfully they are developed, evaluated, and implemented. Moving forward, the field must prioritize not only accuracy but also equity, feasibility, and long‐term impact. This will require new forms of infrastructure, sustained interdisciplinary collaboration, and a commitment to shared standards and open practices. Most importantly, we must center real‐world contexts and ensure that tools developed reflect and serve the diverse populations in which children develop. With intentional design and collective effort, we can build a future in which algorithmic tools enhance—not replace—human insight in developmental science.

Ethics Statement

This manuscript involved no human subjects data and thus ethics approval was not obtained.

Conflicts of Interest

Adela C. Timmons owns intellectual property and stock in Colliga Apps Corporation and could benefit financially from the commercialization of related research. The other authors declare no conflicts of interest.

Supporting information

Supporting File 1: desc70144‐sup‐0001‐SuppMat.docx

DESC-29-e70144-s001.docx (87.8KB, docx)

Acknowledgments

This research was funded by the National Institutes of Health under Grant No. 1R01DA059423‐01 (PI: K.B.) and Grant No. R42MH123368 (Timmons, Comer, and Ahle, Co‐PIs) and supported by the Whole Communities‐Whole Health research grand challenge at the University of Texas at Austin.

Data Availability Statement

This study does not report upon novel data or findings, and thus there are no data to publicly share.

References

  1. Amrani, H. , Micucci D., Mobilio M., and Napoletano P.. 2025. “Leveraging Dataset Integration and Continual Learning for Human Activity Recognition.” International Journal of Machine Learning and Cybernetics 16, no. 7: 5213–5234. 10.1007/s13042-025-02569-1. [DOI] [Google Scholar]
  2. Bakeman, R. , and Gottman J. M.. 1997. Observing Interaction: An Introduction to Sequential Analysis. Cambridge Univ Press. [Google Scholar]
  3. Baumgartner, H. A. , Alessandroni N., Byers‐Heinlein K., et al. 2023. “How to Build up Big Team Science: A Practical Guide for Large‐scale Collaborations.” Royal Society Open Science 10, no. 6: 230235. 10.1098/rsos.230235. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bergelson, E. 2015. HomeBank English Bergelson Seedlings Corpus [Dataset]. TalkBank. 10.21415/T5PK6D. [DOI] [Google Scholar]
  5. Bergelson, E. , Soderstrom M., Schwarz I.‐C., et al. 2023. “Everyday Language Input and Production in 1,001 Children From Six Continents.” Proceedings of the National Academy of Sciences of the United States of America 120, no. 52: e2300671120. 10.1073/pnas.2300671120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Buolamwini, J. , and Gebru T.. 2018. “Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification.” Proceedings of the 1st Conference on Fairness, Accountability and Transparency 77–91. https://proceedings.mlr.press/v81/buolamwini18a.html. [Google Scholar]
  7. Cao, Z. , Simon T., Wei S. E., and Sheikh Y.. 2017. “Realtime Multi‐Person 2d Pose Estimation Using Part Affinity Fields.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 7291–7299. 10.1109/CVPR.2017.143. [DOI]
  8. Chentsova‐Dutton, Y. E. , and Ryder A. G.. 2020. “Cultural Models of Normalcy and Deviancy.” Asian Journal of Social Psychology 23, no. 2: 187–204. 10.1111/ajsp.12413. [DOI] [Google Scholar]
  9. Cristia, A. , Bulgarelli F., and Bergelson E.. 2020. “Accuracy of the Language Environment Analysis System Segmentation and Metrics: A Systematic Review.” Journal of Speech, Language, and Hearing Research 63, no. 4: 1093–1105. 10.1044/2020_JSLHR-19-00017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Cristia, A. , Gautheron L., Zhang Z., et al. 2024. “Establishing the Reliability of Metrics Extracted From Long‐Form Recordings Using LENA and the ACLEW Pipeline.” Behavior Research Methods 56, no. 8: 8588–8607. 10.3758/s13428-024-02493-2. [DOI] [PubMed] [Google Scholar]
  11. Cychosz, M. , Romeo R., Soderstrom M., et al. 2020. “Longform Recordings of Everyday Life: Ethics for Best Practices.” Behavior Research Methods 52: 1951–1969. 10.3758/s13428-020-01365-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. d'Andrea, F. , Heller B., Wheat J., and Penitente G.. 2025. “Gait Temporal Parameters Estimation in Toddlers Using Inertial Measurement Units: A Comparison of 15 Algorithms.” Gait & Posture 119: 77–86. 10.1016/j.gaitpost.2025.02.024. [DOI] [PubMed] [Google Scholar]
  13. de Barbaro, K. 2019. “Automated Sensing of Daily Activity: A New Lens Into Development.” Developmental Psychobiology 50th Anniversary Special Issue 61, no. 3: 444–464. 10.1002/dev.21831. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. de Barbaro, K. , and Fausey C.. 2023. “Mobile Sensing in Developmental Science: A Practical Guide for Researchers.” In Mobile Sensing in Psychology: Methods and Applications. Guilford Press. [Google Scholar]
  15. DeCamp, M. , and Lindvall C.. 2020. “Latent Bias and the Implementation of Artificial Intelligence in Medicine.” Journal of the American Medical Informatics Association 27, no. 12: 2020–2023. 10.1093/jamia/ocaa094. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Deng, W. , O'Brien M. K., Andersen R. A., Rai R., Jones E., and Jayaraman A.. 2025. “A Systematic Review of Portable Technologies for the Early Assessment of Motor Development in Infants.” NPJ Digital Medicine 8, no. 1: 1–15. 10.1038/s41746-025-01450-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Franchak, J. M. , Scott V., and Luo C.. 2021. “A Contactless Method for Measuring Full‐Day, Naturalistic Motor Behavior Using Wearable Inertial Sensors.” Frontiers in Psychology 12: 701343. 10.3389/fpsyg.2021.701343. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Franchak, J. M. , Tang M., Rousey H., and Luo C.. 2024. “Long‐Form Recording of Infant Body Position in the Home Using Wearable Inertial Sensors.” Behavior Research Methods 56, no. 5: 4982–5001. 10.3758/s13428-023-02236-9. [DOI] [PubMed] [Google Scholar]
  19. Galhotra, Y. , Khante P., Madden‐Rusnak A., and de Barbaro K.. (under review). Three‐way Classification of Infant Fussing and Crying in Real‐world Environments. Interspeech 2026.
  20. Ganek, H. , Smyth R., Nixon S., and Eriks‐Brophy A.. 2018. “Using the Language ENvironment Analysis (LENA) System to Investigate Cultural Differences in Conversational Turn Count.” Journal of Speech, Language and Hearing Research (Online) 61, no. 9: 1–13. 10.1044/2018_JSLHR-L-17-0370. [DOI] [PubMed] [Google Scholar]
  21. Gemmeke, J. F. , Ellis D. P. W., Freedman D., et al. 2017. “Audio Set: An Ontology and Human‐Labeled Dataset for Audio Events.” In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 776–780. 10.1109/ICASSP.2017.7952261. [DOI]
  22. Gilkerson, J. , Richards J. A., and Topping K.. 2017. “Evaluation of a LENA‐Based Online Intervention for Parents of Young Children.” Journal of Early Intervention 39, no. 4: 281–298. 10.1177/1053815117718490. [DOI] [Google Scholar]
  23. Gillick, J. , Deng W., Ryokai K., and Bamman D.. 2021. “Robust Laughter Detection in Noisy Environments.” Interspeech 2021 2481–2485. 10.21437/Interspeech.2021-353. [DOI] [Google Scholar]
  24. Goodman, K. W. , Berner E. S., Dente M. A., et al. 2011. “Challenges in Ethics, Safety, Best Practices, and Oversight Regarding HIT Vendors, Their Customers, and Patients: A Report of an AMIA Special Task Force.” Journal of the American Medical Informatics Association: JAMIA 18, no. 1: 77–81. 10.1136/jamia.2010.008946. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Google Research . 2019. YAMNet: An Audio Event Classifier Trained on AudioSet [Dataset]. GitHub repository. https://github.com/tensorflow/models/tree/master/research/audioset/yamnet. [Google Scholar]
  26. Harari, G. M. , and Gosling S. D.. 2023. “Understanding Behaviours in Context Using Mobile Sensing.” Nature Reviews Psychology 2, no. 12: 767–779. 10.1038/s44159-023-00235-3. [DOI] [Google Scholar]
  27. Harari, G. M. , Lane N. D., Wang R., Crosier B. S., Campbell A. T., and Gosling S. D.. 2016. “Using Smartphones to Collect Behavioral Data in Psychological Science: Opportunities, Practical Considerations, and Challenges.” Perspectives on Psychological Science: A Journal of the Association for Psychological Science 11, no. 6: 838–854. 10.1177/1745691616650285. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Hendry, D. , Rohl A. L., Rasmussen C. L., et al. 2023. “Objective Measurement of Posture and Movement in Young Children Using Wearable Sensors and Customised Mathematical Approaches: A Systematic Review.” Sensors 23, no. 24: 24. 10.3390/s23249661. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Hindman, M. 2015. “Building Better Models: Prediction, Replication, and Machine Learning in the Social Sciences.” The ANNALS of the American Academy of Political and Social Science 659, no. 1: 48–62. 10.1177/0002716215570279. [DOI] [Google Scholar]
  30. Hooker, S. 2021. “Moving Beyond “Algorithmic Bias Is a Data Problem”.” Patterns 2, no. 4: 100241. 10.1016/j.patter.2021.100241. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Israel, B. A. , Schulz A. J., Parker E. A., and Becker A. B.. 1998. “REVIEW of COMMUNITY‐BASED RESEARCH: Assessing Partnership Approaches to Improve Public Health.” Annual Review of Public Health 19, no. 1998: 173–202. 10.1146/annurev.publhealth.19.1.173. [DOI] [PubMed] [Google Scholar]
  32. Jacobson, N. C. , Weingarden H., and Wilhelm S.. 2019. “Digital Biomarkers of Mood Disorders and Symptom Change.” NPJ Digital Medicine 2, no. 1: 1–3. 10.1038/s41746-019-0078-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Jacobucci, R. , Grimm K. J., and Zhang Z.. 2023. Machine Learning for Social and Behavioral Research. Guilford Publications. [Google Scholar]
  34. Jayaraman, S. , Fausey C. M., and Smith L. B.. 2015. “The Faces in Infant‐Perspective Scenes Change Over the First Year of Life.” PLoS ONE 10, no. 5: e0123780. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Ji, C. , Mudiyanselage T. B., Gao Y., and Pan Y.. 2021. “A Review of Infant Cry Analysis and Classification.” EURASIP Journal on Audio, Speech, and Music Processing 2021, no. 1: 8. 10.1186/s13636-021-00197-5. [DOI] [Google Scholar]
  36. Karaca, B. , Salah A. A., Denissen J., Poppe R., and de Zwarte S. M. C.. 2024. “Survey of Automated Methods for Nonverbal Behavior Analysis in Parent‐Child Interactions.” In 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG) , 1–11. 10.1109/FG59268.2024.10582009. [DOI]
  37. Khan, M. N. H. , Li J., McElwain N. L., Hasegawa‐Johnson M., and Islam B.. 2024. “Sound Tagging in Infant‐Centric Home Soundscapes.” Sound tagging in infant‐centric home soundscapes. In 2024 IEEE/ACM Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE): 142–146. 10.1109/CHASE60773.2024.00023. [DOI]
  38. Khante, P. , Madden‐Rusnak A., and de Barbaro K.. 2025. “Real‐World Classification of Caregiver Sensitivity to Infant Distress.” In ICDL 2025 IEEE International Conference on Development and Learning (ICDL) , 1–8. 10.1109/ICDL63968.2025.11204456. [DOI]
  39. Khante, P. , Thomaz E., and de Barbaro K.. 2023. “Auditory Chaos Classification in Real‐World Environments.” Frontiers in Digital Health 5: 1261057. 10.3389/fdgth.2023.1261057. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Koenecke, A. , Nam A., Lake E., et al. 2020. “Racial Disparities in Automated Speech Recognition.” Proceedings of the National Academy of Sciences of the United States of America 117, no. 14: 7684–7689. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Koepp, A. E. , Gershoff E. T., Castelli D. M., and Bryan A. E.. 2021. “Measuring Children's Behavioral Regulation in the Preschool Classroom: An Objective, Sensor‐Based Approach.” Developmental Science 25, no. 5: e13214. 10.1111/desc.13214. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Kostick‐Quenet, K. M. , Cohen I. G., Gerke S., et al. 2022. “Mitigating Racial Bias in Machine Learning.” The Journal of Law, Medicine & Ethics 50, no. 1: 92–100. 10.1017/jme.2022.13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Lavechin, M. , Bousbib R., Bredin H., Dupoux E., and Cristia A.. 2020. “An Open‐Source Voice Type Classifier for Child‐Centered Daylong Recordings.” Interspeech 2020: 3072–3076. 10.21437/Interspeech.2020-1690. [DOI] [Google Scholar]
  44. LeCun, Y. , Bengio Y., and Hinton G.. 2015. “Deep Learning.” Nature 521, no. 7553: 436. [DOI] [PubMed] [Google Scholar]
  45. Lee, K. , Henry L. M., Hansen E., et al. 2025. Enhancing Infant Crying Detection With Gradient Boosting for Improved Emotional and Mental Health Diagnostics (No. arXiv:2410.09236). arXiv. 10.48550/arXiv.2410.09236. [DOI]
  46. Li, J. , Hasegawa‐Johnson M., and McElwain N. L.. 2024. Analysis of Self‐Supervised Speech Models on Children's Speech and Infant Vocalizations In 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW) , 550–554. 10.1109/ICASSPW62465.2024.10626416. [DOI] [PMC free article] [PubMed]
  47. Liaqat, D. , Liaqat S., Chen J. L., Sedaghat T., Gabel M., and Rudzicz F.. 2021. “Coughwatch: Real‐World Cough Detection Using Smartwatches.” In ICASSP 20212021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 8333–8337. 10.1109/ICASSP39728.2021.9414881. [DOI]
  48. Lockhart, J. W. , and Weiss G. M.. 2014. “Limitations With Activity Recognition Methodology and Data Sets.” In Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct Publication , 747–756. 10.1145/2638728.2641306. [DOI]
  49. Marschik, P. B. , Pokorny F. B., Peharz R., et al. 2017. “A Novel Way to Measure and Predict Development: A Heuristic Approach to Facilitate the Early Detection of Neurodevelopmental Disorders.” Current Neurology and Neuroscience Reports 17, no. 5: 43. 10.1007/s11910-017-0748-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. McDonald, M. , Kwon T., Kim H., Lee Y., and Ko E.‐S.. 2021. “Evaluating the Language Environment Analysis System for Korean.” Journal of Speech, Language, and Hearing Research 64, no. 3: 792–808. 10.1044/2020_JSLHR-20-00489. [DOI] [PubMed] [Google Scholar]
  51. Mehrabi, N. , Morstatter F., Saxena N., Lerman K., and Galstyan A.. 2021. “A Survey on Bias and Fairness in Machine Learning.” ACM Computing Surveys 54, no. 6: 115:1–115:35. 10.1145/3457607. [DOI] [Google Scholar]
  52. Micheletti, M. , Yao X., Johnson M., and de Barbaro K.. 2023. “Validating a Model to Detect Infant Crying From Naturalistic Audio.” Behavior Research Methods 55, no. 6: 3187–3197. 10.3758/s13428-022-01961-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Mikhelson, M. , Luong A., Etz A., Micheletti M., Khante P., and de Barbaro K.. 2024. “Mothers Speak Less to Infants During Detected Real‐World Phone Use.” Child Development 95, no. 5: e324–e337. 10.1111/cdev.14125. [DOI] [PubMed] [Google Scholar]
  54. Nahum‐Shani, I. , Smith S. N., Spring B. J., et al. 2016. “Just‐in‐Time Adaptive Interventions (JITAIs) in Mobile Health: Key Components and Design Principles for Ongoing Health Behavior Support.” Annals of Behavioral Medicine 52, no. 6: 446–462. 10.1007/s12160-016-9830-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Nebeker, C. , Torous J., and Bartlett Ellis R. J.. 2019. “Building the Case for Actionable Ethics in Digital Health Research Supported by Artificial Intelligence.” BMC Medicine 17, no. 1: 137. 10.1186/s12916-019-1377-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Nencheva, M. L. , Tamir D. I., and Lew‐Williams C.. 2023. “Caregiver Speech Predicts the Emergence of Children's Emotion Vocabulary.” Child Development 94, no. 3: 585–602. 10.1111/cdev.13897. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Nweke, H. F. , Teh Y. W., Al‐garadi M. A., and Alo U. R.. 2018. “Deep Learning Algorithms for Human Activity Recognition Using Mobile and Wearable Sensor Networks: State of the Art and Research Challenges.” Expert Systems with Applications 105: 233–261. 10.1016/j.eswa.2018.03.056. [DOI] [Google Scholar]
  58. Obermeyer, Z. , Powers B., Vogeli C., and Mullainathan S.. 2019. “Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations.” Science 366, no. 6464: 447–453. 10.1126/science.aax2342. [DOI] [PubMed] [Google Scholar]
  59. O'Brien, M. K. , Shawen N., Mummidisetty C. K., et al. 2017. “Activity Recognition for Persons With Stroke Using Mobile Phone Technology: Toward Improved Performance in a Home Setting.” Journal of Medical Internet Research 19, no. 5: e7385. 10.2196/jmir.7385. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Pargent, F. , Schoedel R., and Stachl C.. 2023. “Best Practices in Supervised Machine Learning: A Tutorial for Psychologists.” Advances in Methods and Practices in Psychological Science 6, no. 3: 25152459231162559. 10.1177/25152459231162559. [DOI] [Google Scholar]
  61. Piot, L. , Havron N., and Cristia A.. 2022. “Socioeconomic Status Correlates With Measures of Language Environment Analysis (LENA) System: A Meta‐Analysis.” Journal of Child Language 49, no. 5: 1037–1051. 10.1017/S0305000921000441. [DOI] [PubMed] [Google Scholar]
  62. Pires, I. M. , Marques G., Garcia N. M., et al. 2020. “A Review on the Artificial Intelligence Algorithms for the Recognition of Activities of Daily Living Using Sensors in Mobile Devices.” In Handbook of Wireless Sensor Networks: Issues and Challenges in Current Scenario's, edited by Singh P. K. Bhargava B. K. Paprzycki M. Kaushal N. C., and Hong W.‐C., 685–713. Springer International Publishing. 10.1007/978-3-030-40305-8_33. [DOI] [Google Scholar]
  63. Radford, A. , Kim J. W., Xu T., Brockman G., McLeavey C., and Sutskever I.. 2023. “Robust Speech Recognition via Large‐Scale Weak supervision.” In International Conference on Machine Learning, 28492–28518. PMLR. [Google Scholar]
  64. Räsänen, O. , Seshadri S., Lavechin M., Cristia A., and Casillas M.. 2021. “ALICE: An Open‐Source Tool for Automatic Measurement of Phoneme, Syllable, and Word Counts From Child‐Centered Daylong Recordings.” Behavior Research Methods 53, no. 2: 818–835. 10.3758/s13428-020-01460-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. Rathje, S. , Mirea D.‐M., Sucholutsky I., Marjieh R., Robertson C. E., and Van Bavel J. J.. 2024. “GPT Is an Effective Tool for Multilingual Psychological Text Analysis.” Proceedings of the National Academy of Sciences of the United States of America 121, no. 34: e2308950121. 10.1073/pnas.2308950121. [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. Rogoff, B. 2003. The Cultural Nature of Human Development, xiii, 434. Oxford University Press. [Google Scholar]
  67. Saliba, M. , Drapeau N., Skime M., et al. 2023. “PISTACHIo (PreemptIon of diSrupTive behAvior in CHIldren): Real‐Time Monitoring of Sleep and Behavior of Children 3–7 Years Old Receiving Parent–Child Interaction Therapy Augment With Artificial Intelligence – The Study Protocol, Pilot Study.” Pilot and Feasibility Studies 9, no. 1: 23. 10.1186/s40814-023-01254-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  68. Sazonov, E. , Sazonova N., Schuckers S., and Neuman M., CHIME Study Group. 2004. “Activity‐Based Sleep–Wake Identification in Infants.” Physiological Measurement 25, no. 5: 1291–1304. 10.1088/0967-3334/25/5/018. [DOI] [PubMed] [Google Scholar]
  69. Scaff, C. , Loukatou G., Cristia A., and Havron N.. 2025. “Demographic Biases in Naturalistic Language Recordings in the CHILDES Database.” Developmental Science 28, no. 3: e70011. 10.1111/desc.70011. [DOI] [PubMed] [Google Scholar]
  70. Schroer, S. E. , Peters R. E., and Yu C.. 2024. “Consistency and Variability in Multimodal Parent–Child Social Interaction: An at‐Home Study Using Head‐Mounted Eye Trackers.” Developmental Psychology 60, no. 8: 1432–1446. 10.1037/dev0001756. [DOI] [PMC free article] [PubMed] [Google Scholar]
  71. Smith, L. B. , Jayaraman S., Clerkin E., and Yu C.. 2018. The Developing Infant Creates a Curriculum for Statistical Learning. Trends in Cognitive Sciences . [DOI] [PMC free article] [PubMed]
  72. Soderstrom, M. , Casillas M., Bergelson E., et al. 2021. “Developing a Cross‐Cultural Annotation System and MetaCorpus for Studying Infants' Real World Language Experience.” Collabra: Psychology 7, no. 1: 23445. 10.1525/collabra.23445. [DOI] [Google Scholar]
  73. Tebes, J. K. 2018. “Team Science, Justice, and the Co‐Production of Knowledge.” American Journal of Community Psychology 62, no. 1–2: 13–22. 10.1002/ajcp.12252. [DOI] [PubMed] [Google Scholar]
  74. Timmons, A. C. , and Ahle M.. 2020a. The Colliga App [Computer software]. www.colliga.io.
  75. Timmons, A. C. , Feng K., Carta K., et al. 2024. “Biased Bots: An Empirical Demonstration of How AI Bias Could Compromise Mental Healthcare.” OSF. 10.31234/osf.io/7t98e. [DOI]
  76. Timmons, A. C. , Tutul A. A., Avramidis K., et al. 2024. “Developing Personalized Algorithms for Sensing Mental Health Symptoms in Daily Life.” OSF. 10.31234/osf.io/tzd7w. [DOI] [PMC free article] [PubMed]
  77. Timmons, A. C. , and Ahle M. W.. 2020b. Colliga Data Repository [Dataset] . www.colliga.io.
  78. Timmons, A. C. , Baucom B. R., Han S. C., et al. 2017. “New Frontiers in Ambulatory Assessment: Big Data Methods for Capturing Couples' Emotions, Vocalizations, and Physiology in Daily Life.” Social Psychological and Personality Science 8, no. 5: 552–563. [Google Scholar]
  79. Timmons, A. C. , Duong J. B., Simo Fiallo N., et al. 2023. “A Call to Action on Assessing and Mitigating Bias in Artificial Intelligence Applications for Mental Health.” Perspectives on Psychological Science 18, no. 5: 1062–1096. 10.1177/17456916221134490. [DOI] [PMC free article] [PubMed] [Google Scholar]
  80. Timmons, A. C. , Duong J. B., Walters S. N., et al. 2025. “Bridging Fair‐Aware Artificial Intelligence and Co‐Creation for Equitable Mental Healthcare.” Nature Reviews Psychology 4: 793–807. 10.1038/s44159-025-00491-5. [DOI] [Google Scholar]
  81. Vayena, E. , Blasimme A., and Cohen I. G.. 2018. “Machine Learning in Medicine: Addressing Ethical Challenges.” PLoS Medicine 15, no. 11: e1002689. 10.1371/journal.pmed.1002689. [DOI] [PMC free article] [PubMed] [Google Scholar]
  82. Wallerstein, N. , and Duran B.. 2010. “Community‐Based Participatory Research Contributions to Intervention Research: The Intersection of Science and Practice to Improve Health Equity.” American Journal of Public Health 100, no. S1: S40–S46. 10.2105/AJPH.2009.184036. [DOI] [PMC free article] [PubMed] [Google Scholar]
  83. Wang, R. , Chen F., Chen Z., et al. 2014. StudentLife: Assessing Mental Health, Academic Performance and Behavioral Trends of College Students Using Smartphones. 3–14. Guideline 10.1145/2632048.2632054. [DOI]
  84. Warlaumont, A. S. , Richards J. A., Gilkerson J., and Oller D. K.. 2014. “A Social Feedback Loop for Speech Development and Its Reduction in Autism.” Psychological Science 25, no. 7: 1314–1324. 10.1177/0956797614531023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  85. Webb, C. A. , Ren B., Rahimi‐Eichi H., Gillis B. W., Chung Y., and Baker J. T.. 2025. “Personalized Prediction of Negative Affect in Individuals With Serious Mental Illness Followed Using Long‐Term Multimodal Mobile Phenotyping.” Translational Psychiatry 15, no. 1: 174. 10.1038/s41398-025-03394-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  86. Xu, J. , Han Z. R., Lv X., et al. 2025. “A Scalable Mental Health Intervention for Depressive Symptoms: Evidence From a Randomized Controlled Trial and Large‐Scale Real‐World Studies.” NPJ Digital Medicine 8: 491. 10.1038/s41746-025-01888-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  87. Xu, T. L. , de Barbaro K., Abney D. H., and Cox R. F. A.. 2020. Finding Structure in Time: Visualizing and Analyzing Behavioral Time Series. Frontiers in Psychology, 11. 10.3389/fpsyg.2020.01457. [DOI] [PMC free article] [PubMed]
  88. Yao, X. , Micheletti M., Johnson M., Thomaz E., and de Barbaro K.. 2022. “Infant Crying Detection in Real‐World Environments.” In ICASSP 20222022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 131–135. 10.1109/ICASSP43922.2022.9746096. [DOI] [PMC free article] [PubMed]
  89. Yao, X. , Plötz T., Johnson M., and Barbaro K. D.. 2019. “Automated Detection of Infant Holding Using Wearable Sensing: Implications for Developmental Science and Intervention.” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 3, no. 2: 64. 10.1145/3328935. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting File 1: desc70144‐sup‐0001‐SuppMat.docx

DESC-29-e70144-s001.docx (87.8KB, docx)

Data Availability Statement

This study does not report upon novel data or findings, and thus there are no data to publicly share.


Articles from Developmental Science are provided here courtesy of Wiley

RESOURCES