Evaluating Open‐Source Solutions for Computerized Inference of Infant Facial Affect

Martin Lund Trinhammer; Ida Egmose; Marianne Thode Krogh; Anne Christine Stuart; Mette Skovgaard Væver; Sami Sebastian Brandt; Stella Graßhof

doi:10.1111/desc.70156

. 2026 Feb 24;29(2):e70156. doi: 10.1111/desc.70156

Evaluating Open‐Source Solutions for Computerized Inference of Infant Facial Affect

Martin Lund Trinhammer ^1,², Ida Egmose ³, Marianne Thode Krogh ³, Anne Christine Stuart ³, Mette Skovgaard Væver ^3,⁴, Sami Sebastian Brandt ^1,^2,^✉, Stella Graßhof ^1,²

PMCID: PMC12930228 PMID: 41732111

ABSTRACT

Infant affect is often expressed through facial expressions, making this modality a key source of insight into the child's well‐being and social functioning. Computational inference of infant affect could critically assist both researchers and clinicians working with infant development and mitigate the need for manual coding. While many studies have explored open‐source solutions in the adult domain, only the commercial Baby FaceReader 9 exists for the infant domain. To address this gap, we utilize the recently proposed, open‐source infant‐native action unit (AU) detection library PyAFAR (Python‐based Automated Facial Action Recognition) on a sample of 71 four‐month‐old infants, whose facial expressions were manually annotated frame‐by‐frame for three minutes according to the Infant Facial Affect (IFA) coding scheme. Using these AUs as features, we classify facial affect into negative, neutral, and positive using XGBoost and Bayesian filtering, both in a multiclass and a binary setup. Our results show that AUs estimates from PyAFAR, combined with an XGBoost classification model, can distinguish positive from neutral and positive from negative affect with AUC scores of 0.78 and 0.76, respectively. This performance is essentially on par with that reported in evaluation studies of the Baby FaceReader 9, when accounting for differences in study setup. Our work indicates that the area of infant facial affect is particularly well‐suited to supervised learning, given the availability of two distinct, commensurable measurement schemes that underpin the same phenomenon. Finally, we discuss how future iterations of PyAFAR may benefit from including AUs that capture more variability around infant forehead and mouth opening.

Keywords: computer vision, infant facial affect, machine learning, open‐source

Summary

Open‐source models for infant face detection and action unit estimation enable comparable affect estimation compared to commercial tools.
The two main measurement schemes used for annotating infant affect are highly commensurable, suggesting a fruitful avenue for imitation learning.
Next iterations of infant action unit detection models may benefit from incorporating features specific for infant forehead activation, mouth opening, and mouth widening

1. Introduction

Facial expressions remain among the earliest indicators of infant socio‐emotional development and affective state, and they have long been a focus of scientific inquiry in developmental psychology (Camras et al. 1992; Izard 1997; Messinger et al. 2012). Inferring infant affective state from facial expressions is a valuable tool for both researchers and clinicians. However, its reliance on time‐consuming manual annotation severely limits its application. This requirement is a significant bottleneck for scientific inquiry and reliable clinical estimation, as it is challenging to administer efficiently and effectively, and excludes personnel without specialized training. Automating the coding of facial expressions according to affective state, based on a robust theory, could provide faster access to details on infant development and functioning and open new pathways for researchers and clinicians.

Supported by notable developments in computer vision, efforts to automate the coding of adult facial expressions have seen substantial improvements, with multiple different software packages being available (Cheong et al. 2023; Baltrušaitis et al. 2016). Given the differences in both facial morphology and the nature of facial communication between adults and infants (Onal Ertugrul et al. 2023), tools built for adults cannot necessarily be applied to the infant domain. However, a small but growing cluster of computational approaches can be identified that are native to the infant domain. Relevant contributions can be grouped in terms of 1) studies that propose infant‐native solutions to address the preprocessing of raw video or image data of infants and 2) studies that explore affect or valence estimation of infant face configurations. In terms of the first cluster of contributions, Reyes‐Hernández et al. 2024 has recently offered the first infant face detection module, applying oriented bounding boxes to capture the high variability in infant orientation. When estimating facial features in a computer vision research design, a face detection model is necessary to “zoom in” on the relevant region of the frame containing the face. This region is then commonly cropped and used in the next step of a preprocessing pipeline. Following this, features (such as action units (AUs) or landmarks) need to be extracted from the infant's face to function as inputs to a computational model. To address this, Wan et al. 2022 proposed a model and dataset for infant facial landmark detection. Landmark detection is a technique that assigns key points, i.e., landmarks, to areas of the face, such as eyebrows, jawline, forehead, and mouth. The coordinates of this cluster of points form a mesh that captures granular changes in facial movement over time. In both the adult and infant literature, however, the features most often used are the AUs, which comprise an occurrence and/or intensity estimate of a facial movement, for instance, cheek raising. This terminology was originally developed for adults and further revised for infants (Ekman et al. 1978; Hjortsjo 1969; Oster et al. 2016). An AU estimation model has recently been proposed for the infant domain (Onal Ertugrul et al. 2023; Hinduja et al. 2023), thus enabling feature extraction through the common terminology of AUs for which there already lies considerable research on activation patterns for different affective states (Messinger et al. 2012; Colonnesi et al. 2012).

With regards to the second cluster of contributions, which attempts computational inference of infant affective state, the model and software known as the Noldus Baby FaceReader (Noldus 2022) and the associated performance evaluation of this software (Zaharieva et al. 2024) stand out. The Baby FaceReader is a commercial software module that enables AU estimation and inference of infant emotional valence scores. In their performance evaluation study, Zaharieva et al. 2024 found that the Baby FaceReader 9 can provide reliable estimates of positive versus neutral/negative affect in infants aged four and eight months. The tool, however, is only available under a license, which likely prevents broader adoption by small research groups or clinicians in low‐resource communities. Democratizing the computational tools necessary to enable affect inference could greatly increase its utility for researchers and pave the way for novel developments in applied clinical contexts. As noted, the array of open‐source preprocessing tools native to the infant domain is growing, and it is therefore critical to investigate if comparable performance on affect estimation can be achieved through the application of common software modules.

In the present study, we will evaluate if reliable estimates of infant facial affect can be achieved using the set of available open‐source tools (Reyes‐Hernández et al. 2024; Onal Ertugrul et al. 2023; Hinduja et al. 2023) and compare our estimates to those achieved by the current (commercial) baseline as reported in Zaharieva et al. 2024.

1.1. Different Measures of What Constitutes Negative, Neutral, and Positive Affect in Infant Facial Expressions

When considering automating a manual coding process, it is critical to closely examine the theory used to annotate the labels on which the models are trained (Mullainathan and Obermeyer 2021). Essentially, the proposed model can be seen as automating the annotator's work (supervised learning), thereby replicating their biases and the measurement system they relied on in the initial annotation. The present study uses data annotated according to the infant facial affect (IFA) coding scheme (Beebe et al. 2010; Koulomzin et al. 2002; Egmose et al. 2018), developed by Beatrice Beebe. The IFA system codes the infant's face according to five distinct affective conditions: neutral, positive (high/low), or negative (high/low). Commonly, these distinctions are collapsed into three: neutral, positive, and negative (Koulomzin et al. 2002). The coding criteria are summarized in Table 1.

TABLE 1.

Infant Facial Affect coding criteria from Koulomzin et al. 2002; Beebe et al. 2010.

Infant facial affect	Criteria/Definition	Mouth widen (MW)	Mouth open (MO)
Positive (high)	Forehead smooth, cheeks raised, mouth corners drawn back and curved up in full display, mouth fully open	MW 2	MO 3 or 4
Positive (low)	Forehead smooth, eyes open, mouth corners curved up, mouth open or closed	MW 1	MO 1 or 2
Neutral	Forehead smooth, eyes open, mouth relaxed open/closed, or slightly pursed	MW 0 or 1	MO 0 or 1
Negative (low)	Inner corners of eyebrows raised, eyes open or squinting, mouth corners down (grimace), or lips squeezed tightly together (“line mouth”)		MO 0 or 1
Negative (high)	Pre cry/cry‐face (partial/full display) Eyebrows drawn together (classical frown), eyes squinting, mouth angry, open and squarish		MO 2 or 3

Open in a new tab

Note: MW0 = lips neutral, MW1 = sideways lip stretch, MW2 = lip‐corner raise, MO1 = lips slightly parted, MO2 = mouth slightly open, MO3 = mouth medium open, MO4 = mouth fully open

This annotation scheme is widely used in the field of developmental psychology (Egmose et al. 2018; Lotzin et al. 2016; Margolis et al. 2019; Væver et al. 2020), even though the nosology originating from the facial affective coding scheme (FACS) (Ekman et al. 1978) is often incorrectly assumed to be the primary or only tool available (Zaharieva et al. 2024). When the FACS nosology is used in the infant context, applications are often attributed to the extension known as Baby FACS (Oster 1978; Oster et al. 2016; Messinger et al. 2012). From this literature, the AU combinations yielding instances of positive and negative affect are shown in Table 2.

TABLE 2.

Annotation criteria used in sample from Zaharieva et al. 2024, according to Colonnesi et al. 2012; Bolzani Dinehart et al. 2005.

Facial expression	Action units
Positive facial expressions
Smile with eye constriction	AU12, AU6 (Duchenne smile)
Smile with mouth opening	AU12, AU25, AU26, AU27
Negative facial expressions
Lip stretching with brow lowering	AU20, AU3, AU4
Lip stretching with eye constriction	AU20, AU6, AU7 (Duchenne cry‐face)
Lip stretching with mouth opening	AU20, AU25, AU26, AU27

Open in a new tab

Note: A neutral expression was annotated when neither a positive nor a negative expression was present. The table is inspired by table 1 from Zaharieva et al. 2024.

The granular similarities and differences in coding criteria are important to keep in mind, as the labels in the present study were annotated using one theory (Beebe et al. 2010; Koulomzin et al. 2002) (Table 1), though compared to labels achieved through a slightly different method (Zaharieva et al. 2024; Colonnesi et al. 2012; Bolzani Dinehart et al. 2005) (Table 2). When evaluating their overlap, it is evident that the two measurement systems are highly commensurable, especially in how they classify instances of positive affect. This is indexed by lip corner pull (AU12) and cheek raiser (AU6) in both contributions, which is essentially the Duchenne smile. Both theories also use the degree of lip stretching (AU20) to determine a negative expression. When distinguishing between negative and neutral expressions, the IFA approach also accounts for the degree of forehead smoothing. This effect is partially captured by AU3 and AU4 (brow tightener and brow lowerer), but for a full estimate, AU1 and AU2 (inner and outer brow raisers) would also need to be included.

In sum, these reflections suggest that it is viable to perform a comparison given the high degree of overlap in the operationalization of infant affect between the two measurement schemes. Further, as two systems developed in different research environments have identified largely the same granular mechanisms relating a facial expression to an affective category, it can be argued that the phenomenon is well accounted for, thus rendering the problem suitable for supervised learning (Kleinberg et al. 2024).

2. Method

2.1. Participants

The data used in the current study originates from a larger longitudinal study focusing on mother‐infant interactions during the first year after birth and their associations with different aspects of infant development. The participants in the larger study were 60 typical mother‐infant dyads and 30 dyads with mothers fulfilling criteria for postpartum depression (PPD). Typical mothers were recruited before birth through webpages targeting pregnant women or mothers with a newborn infant and advertisements at local obstetricians' and midwives' clinics. The mothers with PPD were recruited after birth by referrals from community health visitors (for details regarding the inclusion and assessment of mothers with PPD, see Smith‐Nielsen et al. 2015; 2016). All mothers were primiparous and lived in Greater Copenhagen at the time of recruitment. All infants were born full‐term and had unremarkable pre‐ and postnatal medical histories. Before inclusion, all mothers gave written informed consent to participate in the study. All children received a small gift for participation. The sample in the present study includes 71 dyads: 24 PPD and 47 typical. 19 were excluded due to: video recording of mother‐infant dyads missing (n = 3), more than 25% recordings being noncodable (NA) (n = 15), or mother not fulfilling the criteria for clinical depression (n = 1) (Egmose et al. 2018). Please refer Table 3 for an overview of the demographic distribution (same data as reported in Egmose et al. 2018).

TABLE 3.

Mother and infant sociodemographic characteristics

Variables	Non‐clinical (n = 47)	PPD (n = 24)
Infant gender	Female 26; Male 21	Female 11; Male 13
Mean infant age (months) (SD)	4.0 (0.2)	3.9 (0.3)
Infant age range (months)	3.8–4.7	3.5–4.6
Mean mother age (years) (SD)	30.6 (3.9)	30.7 (3.9)
Mother age range (years)	23.3–43.3	22.4–38.3
Mean EPDS score (SD)	4.8 (3.6)	15.9 (4.4)
Mean years of education (%)
9–12 years (ISCED level 3)	3 (6.4)	2 (8.3)
14 years (ISCED level 4)	3 (6.4)	2 (8.3)
15 years (ISCED level 5 and 6)	16 (34)	12 (50)
17 years or more (ISCED level 7 and 8)	25 (53.2)	8 (33.3)
Unemployed, n (%)	4 (8.5)	3 (12.5)
Living with partner, n (%)	45 (95.7)	24 (100)
Mother originates from Denmark, n (%)	45 (95.7)	21 (87.5)

Open in a new tab

^a EPDS = Edinburgh postnatal depression scale.

^b ISCED = International standard classification of education.

^c The category “Mother originates from Denmark” designates individuals who have at least one parent of Danish citizenship that was also born in Denmark

2.2. Procedure

The experimental design used a standard face‐to‐face setup (Tronick and Cohn 1989), with the infant seated opposite the mother in an infant seat. Each dyad was recorded over a 10‐min session by two cameras. The mothers were instructed to play with their child as they usually would. If the infant became too upset, the mother or the experimenter would stop the assessment before the 10 minutes had passed.

2.3. IFA Coding

IFA was coded on a frame‐by‐frame level for three minutes with the software ELAN 4.0.1 (Lausberg and Sloetjes 2009) using the criteria listed in Table 1. An affective category was coded if it lasted for at least 280 ms. Coding was conducted from the 60th second to the 240th second by psychology students and research assistants (Egmose et al. 2018). For a minority of dyads, the start of the IFA annotation was postponed to the 120th second because the infant took longer to adjust to the setup. Not mentioned in Table 1 is the NA category, which was used when the infant's face was visually occluded for more than 500 ms. The five categories seen in Table 1 were compounded to three according to Koulomzin et al. 2002, yielding a positive (1), neutral (0), and negative (‐1) distinction. This was done to ensure comparability to other studies (Zaharieva et al. 2024) and is also a standard practice for the IFA scheme (Koulomzin et al. 2002). In our study, we intentionally do not analyze differences in IFA between mothers experiencing PPD and typical mothers, as previous work on the same sample showed no viable differences (Egmose et al. 2018). The camera's frame‐per‐second was 25, leaving $25 \cdot (60 \cdot 3) = 4500$ frames in scope per infant. We randomly selected a subset of the sample to use for inter‐rater reliability assessment (n = 19; 6 PPD dyads and 13 nonclinical dyads). A different group of coders annotated this subset according to the same procedure as described in Table 1. Using the initial annotations as the ground truth and the reliability‐coded set as the test data, the resulting measure of agreement was quantified: the F1 score was 83%, with recall of 81% and precision of 89%. This indicates very high agreement between annotators (Landis and Koch 1977). We refer to Section 2.7 in our Methods section for the details of how these measures are calculated.

2.4. Preprocessing

Our fully automated preprocessing pipeline for a single recording is illustrated in Figure 1. As mentioned, the dyads were recorded by two cameras; however, some of these cameras had the infant captured from a lateral angle, rendering inspection of facial movements unfeasible. These recordings were removed, leaving 47 dyads with two viable recordings and 24 dyads with one viable recording available for inspection of facial features, a total of 118 recordings. As shown in the example frame in step one (Figure 1), the infant is roughly centered in the camera's field of view. We thus began by cropping 25% from the periphery of each frame, effectively zooming in on the dyad.

Preprocessing pipeline illustrated for a single recording. The values in the frame examples are chosen randomly. The image in step 2 was creating using generative AI (Gemini Pro).

In step two, we used the infant face detection model OBBabyFace, developed by Reyes‐Hernández et al. 2024. To the best of our knowledge, OBBabyFace is the first open‐source model for infant face detection. OBBabyFace is built on the Ultralytics Python framework (Jocher et al. 2023) and employs oriented bounding boxes (YOLOv8‐OBB) to capture the often high variability in infant face orientation. The bounding boxes represent the predicted four coordinates on the frame in which the infant's face is assumed to be located. For this study, the model is instantiated with a minimum detection accuracy of 75% and is forced to identify only one individual. Across 118 recordings, OBBabyFace successfully identifies the infant's face (i.e., detection accuracy $>$ 75%) in 70.48% of frames. Note that this estimate cannot be considered a valid performance assessment of OBBabyFace, as in many frames, the infant is not facing the camera due to head tilt or mother occlusion. For frames where OBBabyFace failed to identify the infant's face, we would crop the coordinates from the previous bounding box. The region specified by the bounding box was then cropped and saved for the whole recording. As such, no frames are removed at this step, and are maintained even though no face is identified. We chose this procedure to ensure compatibility with our IFA labels, which were available in milliseconds.

In step number three, we applied the recently proposed infant AU detection module known as PyAFAR (Hinduja et al. 2023; Onal Ertugrul et al. 2023) on the cropped input image, yielding a probability of occurrence between 0 and 1 for AU4, AU6, AU12, and AU20 for each frame. While PyAFAR does output estimates for a range of other AUs as well, the model was trained only on 13‐month‐old infants for this other subset of AUs (Onal Ertugrul et al. 2023). We will include a small experiment on the performance of the complete set of AUs available from PyAFAR; however, our chief interest is in the AUs that are explicitly trained for the four‐month cohort (AU4, AU6, AU12, and AU20), as much research underscores the need for age‐specific feature extraction tools for infants (Zaharieva et al. 2024; Onal Ertugrul et al. 2023). After extracting the AUs, the data were combined with the corresponding IFA labels. A random example of the resulting dataset is shown in Figure 1.

In step four, all frames corresponding to NA IFA labels were excluded. In the final fifth step, we excluded frames in which PyAFAR failed to detect the infant's face and thus returned no AU estimates. It is therefore also in this step that we remove the frames where OBBabyFace could not identify a face. The number of excluded frames can be seen in Figure 2a. This process was carried out to preprocess dyads with a single recording. For dyads with two viable recordings, we would run the pipeline described in Figure 1 for each and then combine the resulting AU estimates by taking the mean if both frames had estimates. If neither frame had estimates, we would populate it with the recording that had an estimate for the corresponding frame.

2.5. Sample Characteristics

In this section, we report details on the final sample included for analysis. Figure 2a gives an overview of the number of frames available for modeling after removing the frames with either a missing label or missing AU occurrence estimates (see details in Figure 1). As seen from Figure 2a, two dyads did not have annotations for the full scope of 4500 frames (Step 3). In these two cases, the recordings were not annotated for the full three minutes because the infants became increasingly discontent and the interaction had to be terminated prematurely. At step 4, upon removing the frames that could not be annotated due to brief moments of occlusion or infant head tilt (NA), each dyad had 4241 frames (94%) on average. This drops to a mean count of 3732 (82%) frames per dyad after removing the frames in which PyAFAR could not return values for the four AUs (step 5). This left a total of 261253 frames, which we include for further analysis, for which the distribution of labels is seen in Figure 2b. We observed a class imbalance: the positive class accounted for roughly 14% of the sample (36641), the neutral class for 44.4% (115947), and the negative class for 41.6% (108665).

2.6. Modeling

The core objective of our modeling is to predict the human‐annotated IFA labels on each frame using as input features the PyAFAR occurrence estimates for the four AUs. We will then compare the model's predictions to the ground‐truth annotations in our evaluations. For this, we implemented two classification models, XGBoost and Bayesian filtering. We choose XGBoost (Chen and Guestrin 2016) as multiple studies suggest it to be a superior classification model across many different domains (Imani et al. 2025; Liu et al. 2024; Yarmohammadtoosky Dinesh Chowdary Attota 2024; Gündoğdu 2023). Being a tree‐based model, XGBoost can be considered an extension of the commonly used random forest ensemble learning algorithm, with the notable advantage of maintaining a sequential learning architecture in which each new tree attempts to correct mistakes from the previous tree. This is ideal for an imbalanced dataset, which was another key reason we chose this architecture. A limitation of using XGBoost for the present use case is that it neglects the order of frames, as it classifies samples independently. Facial movements are temporally linked, an essential feature of the dataset that XGBoost fails to exploit. We therefore choose Bayesian filtering (Kaipio and Somersalo 2006) as our second model, as it natively accounts for the temporal relationship between frames. We decided on this architecture over other models for temporally dependent datasets, such as recurrent neural networks (RNNs), due to its interpretable framework and ease of implementation. Given its foundations in Bayesian learning theory, it further offers strong uncertainty quantification. For both models, we relied on four‐fold cross‐validation. The data split for each fold was conducted at the infant level, that is, we ensured that data from each infant was not leaked across folds. Otherwise, this could inadvertently cause the model to learn characteristics of each infant rather than their affective expressions. We now briefly introduce both model architectures and refer the interested reader to our supplementary material for the complete set of implementation details.

2.6.1. XGBoost

XGBoost is a highly efficient and scalable implementation of gradient boosting. The algorithm builds an ensemble of decision trees sequentially, where each new tree is trained to correct the errors of the previous tree. Specifically, this process optimizes a loss function using gradient descent. Instead of fitting a new tree to the simple residuals of the previous model, it is fit to the negative gradient of the loss function. Each tree added to the model thus represents a step in the direction that most steeply reduces the overall loss, making it a flexible and powerful predictive model. We refer to supplementary material A for further details, including hyperparameter selection.

2.6.2. Bayesian Filtering

XGBoost classifies each frame independently of the preceding and subsequent frames and is thereby temporally consistent; it thus fails to utilize the a priori information that facial movements are correlated across frames. We wished to account for the sequential nature of the data in our modeling by implementing a Bayesian filtering approach (Kaipio and Somersalo 2006). To address this, we first note how the problem facing us is essentially an inverse problem, where we aim to identify parameters for an unobservable latent process (the infant's underlying facial affect) that is assumed to give rise to the data we observe (the facial expressions as observed in the AU measurements) (Bishop and Bishop 2023; MacKay 2003). We can further categorize the problem as one of discrete latent‐variable modeling, since the labels for the affective condition are discrete: negative, neutral, and positive. We can visualize this theoretical notion in Figure 3, introducing our latent variables as $z \in {1, 2, \dots, K}$ for $K$ classes. The inverse problem is to identify the latent vectors $Z_{n}$ from the observed facial movements $X_{n}$ . In a Bayesian context, we aim not to find a single estimate for the latent vector but rather to obtain its posterior probability distribution given the data. For extended terminology, implementation details, and notation, please see supplementary material B.

A visualization of a sequential dependency scheme, inspired by Bishop and Bishop 2023.

2.7. Evaluation Metrics

We now introduce our procedure for evaluating our machine learning models, using the human‐annotated IFA labels as the ground‐truth test vector and comparing them with the model's estimates of the IFA labels. In machine learning classification experiments, the confusion matrix serves as the foundation for most commonly used metrics. The confusion matrix is a $K \times K$ matrix that depicts the samples in a grid of actual and predicted cases. This yields the number of true positives (TP), true negatives (TN), false negatives (FN), and false positives (FP). The first metric we rely on is the receiver operating characteristic (ROC) curve and its associated measure, the area under the curve (AUC). The ROC curve is a standard tool for assessing classifier performance: it is generated by varying the decision threshold from $- \infty$ to $+ \infty$ and plotting the true positive rate (TPR/recall) against the false positive rate (FPR). Each point on the curve corresponds to a specific confusion matrix (Bishop and Bishop 2023). The FPR and TPR used in the ROC curve are calculated by $FPR = FP / (FP + TN)$ and $TPR = TP / (TP + FN)$ , respectively, and the AUC metric is the area under this curve. When visualizing this curve, we plot the concatenated predicted and ground‐truth test vectors across the four‐fold cross‐validation, thus showing all samples used as test cases. As AUC is inherently a binary measure, in the multiclass setting we follow the common one‐vs‐rest strategy, computing an AUC for each class against the rest and reporting their class‐weighted average. For binary evaluations, the AUC metric yields a single value, so a weighted average is not computed. In addition to the AUC score, we report recall, precision, and the F1‐score. Precision is calculated as $precision = TP / (TP + FP)$ , and the F1‐score is the harmonic mean between the two F1 = 2 (precision $\cdot$ recall) / (precision + recall). In essence, recall, precision, and F1‐scores are all calculated on a per‐class basis (as in Table 6). To provide a comprehensive account of model performance, we need to aggregate the metrics to show either the combined scores for the models (as in Table 5 for the multiclass evaluations), the combined scores across raters (as in our inter‐rater agreement analysis in Section 2.3) or the combined scores for the binary evaluations (as in Table 7). When we report aggregated recall, precision, and F1‐score, we show the metrics weighted according to their respective class, as per

Weighted-metric = \frac{\sum_{k} N_{k} \times {metric}_{k}}{\sum_{k} N_{k}} .

(1)

We compare our multiclass evaluations to the chance‐level weighted‐F1 score, calculated using the manually coded IFA labels. The procedure is as follows: All samples are assigned to the majority (neutral) class, with recall and precision for the negative and positive classes set to zero. Precision for the neutral class is

P_{neutral} = \frac{115947}{108665 + 115947 + 36641} \approx 0.444 (44.4 %) .

The corresponding F1 score is

F 1_{neutral} = \frac{2 \times P_{neutral} \times 1}{P_{neutral} + 1} \approx 0.615 (61.5 %) .

Lastly, we use Equation 1 to calculate the chance‐level weighted F1‐score:

\begin{matrix} Weighted-F1 \\ = & \frac{N_{negative} \times F 1_{negative} + N_{neutral} \times F 1_{neutral} + N_{positive} \times F 1_{positive}}{N_{total}} \end{matrix}

\begin{matrix} Weighted-F1 \\ = & \frac{108665 \times 0 + 115947 \times 0.615 + 36641 \times 0}{261253} \approx 0.273 (27.3 %) . \end{matrix}

TABLE 6.

Evaluation metrics for each affect class for our multiclass experiment using XGBoost. Calculated from concatenated test vectors from across the four‐fold to match Figure 5.

	Recall (TPR)	Precision	F1	AUC
Negative	0.38	0.56	0.45	0.60
Neutral	0.79	0.53	0.64	0.65
Positive	0.21	0.46	0.29	0.76

Open in a new tab

TABLE 5.

Test‐set performance estimates for the best model during multiclass classification. Average over four‐fold cross‐validation. Chance level weighted F1: 27%.

	Recall (TPR)	Precision	F1	AUC
XGBoost	0.54	0.54	0.51	0.65
Bay. filt.	0.49	0.50	0.49	0.62

Open in a new tab

TABLE 7.

Test‐set performance for XG Boost across the three binary studies. Averages over four‐fold cross‐validation.

	Recall (TPR)	Precision	F1	AUC
Negative vs. Neutral	0.6	0.64	0.57	0.61
Negative vs. Positive	0.76	0.73	0.73	0.76
Neutral vs. Positive	0.81	0.79	0.79	0.78

Open in a new tab

3. Results

3.1. Explorative Findings

Before commencing on the modeling, some initial exploratory findings on the four AUs available are reported in Table 4 and Figure 4. First, we calculated the Spearman correlation coefficient between each AU and the binarized labels. The input data for our Spearman correlation calculation is the full data sample, in which we concatenate the frames from each infant into a single combined matrix. We observed that the coefficients were modest, with the maximum being 0.29 for the positive condition on AU12. At the same time, we observed a tendency for the AUs to be negatively correlated with neutral and positively correlated with positive, suggesting that a small, but coherent signal can be identified. These findings were further underscored when we explored the mean occurrence probability for each AU across the affective categories, as shown in Figure 4. We observe that high AU12 values suggest class membership in positive affect, while low AU12 values indicate a neutral expression.

TABLE 4.

Spearman correlation coefficient between labels and AUs.

	AU4	AU6	AU12	AU20
Negative	0.13	0.08	−0.10	0.10
Neutral	−0.17	−0.23	−0.10	−0.26
Positive	0.06	0.22	0.29	0.23

Open in a new tab

Mean occurrence of AUs across the dataset.

3.2. Multiclass Prediction Model

In Table 5, we report the performance across the two classification models when tasked with a multiclass classification, that is, classifying the IFA label corresponding to each frame according to one of its three possible discrete classes. The XGBoost model performs slightly better than Bayesian filtering, though the difference is minimal. Taking as baseline the chance‐level weighted‐F1 at 27% in Table 5, it is evident that the model has captured a moderate signal. It is noteworthy that the AUC performance estimate is markedly higher than the F1 score, reflecting the ROC‐AUC curve's softer decision boundary. The ROC‐AUC score accounts for the model's ability to rank classes, that is, to indicate the order of the predicted probabilities for each label. In contrast, the F1, recall, and precision scores only consider the most probable class label and thus use a hard decision boundary. Investigating the number of correctly identified frames per class, as shown in Figures 5 and 6, we observed that the model overpredicts the neutral affective condition and underpredicts instances of negative and positive affect. Oversampling the minority class did not improve overall performance (see supplementary material C). The high AUC score for the positive class in Table 6 suggests good separability in terms of assigning a higher probability to positive cases as compared to negative and neutral when the ground truth is positive. However, the model is seldom confident enough to assign this class the highest probability among the three classes when the ground truth is positive, as evidenced by the low recall and precision scores. For readers particularly interested in the temporal changes in affect, we refer to supplementary material D, where we show the distribution of predicted vs. ground‐truth affect states for one of the cross‐validated folds, as predicted by Bayesian filtering.

Confusion matrix for the multiclass evaluation for XGBoost.

3.3. Binary Prediction Model

As XGBoost was the best‐performing model in the multiclass setting, we also used this model in a binary setting, distinguishing negative vs. neutral, neutral vs. positive, and negative vs. positive infant facial affect. Our results for these three experiments are displayed in Figure 6. The neutral vs. positive condition achieves an AUC score of 0.78. We also show the recall, precision, and F1 scores for the binary condition in Table 7.

We have further chosen two modeling approaches to gauge the relative importance of each AU and each possible subset of AUs. First, as shown in Figure 7, we present the AUC scores obtained when running our modeling pipeline across the 14 possible AU subsets from the PyAFAR four‐month cohort (AU4, AU6, AU12, and AU20). To investigate the possibility of any positive performance impact from also including the five additional AUs used to train the PyAFAR 13‐month cohort model, we also include findings of executing our pipeline using this complete set of nine AUs (AU1, AU2, AU3, AU4, AU6, AU9, AU12, AU20, and AU28). We observe that AU4, AU6, AU12, and AU20 (AUs_4months) drive the bulk of the performance, and that including the five additional AUs yields no meaningful performance increase. In sum, our interpretability analysis shows that AU12 and AU20 are particularly salient in driving differences in IFA, especially in distinguishing between positive and neutral, and between positive and negative. These findings are corroborated by the XGBoost feature importance scores (supplementary material E).

Test‐score as average over four‐fold cross‐validation for all subsets of features (metric: AUC). AUs_4months refers to AU4, AU6, AU12, and AU20 from the PyAFAR four‐month‐old training cohort, while AUs_4months_13months refers to AU1, AU2, AU3, AU4, AU6, AU9, AU12, AU20, and AU28 obtained from using both the PyAFAR four‐month‐old and 13‐month‐old training cohorts. As such, the column with the subtext ”4” indicates the XGBoost model trained solely on AU4; ”4_6” indicates the model trained on both AU4 and AU6, and so forth.

4. Discussion

In this study, we investigate whether reliable estimates of infant facial affect can be obtained using PyAFAR for AU feature extraction in a sample of four‐month‐old infants. We find that a minimally tailored XGBoost model can distinguish between positive and negative instances and between positive and neutral instances, achieving AUC scores of 0.76 and 0.78, respectively.

4.1. Analyzing the Multiclass Findings

The first relevant consideration is the findings from our multiclass evaluations, as reported in Table 5. The XGBoost model performed marginally better than the Bayesian filtering implementation, achieving an average F1‐score of 51% and an AUC of 0.65. The model performed markedly better than chance at 27%, though the results remain moderate. Although the F1 score's hard classification rule yields moderate average performance, the soft decision boundary of the AUC metric allows performance to increase to a reasonable level. This indicates that while the model struggles to determine exact class membership, it is markedly better at ranking individual classes, suggesting good approximating ability.

Investigating the details of the model performance further, as seen in Figure 5, it is evident how the model appears to be overpredicting the neutral class. When attempting to account for this finding, it should be observed how all AUs correlated negatively with this class, indicating that lower AU estimates will lead to estimation of the neutral class (Figure 4). This is fully in line with the IFA annotation criteria for the neutral class (Table 1), which is labeled when the infant has their eyes open and show minimal lip activation (low AU12 and AU20). The model thus appears to be correctly identifying instances of facial expressions belonging to the neutral class. The poor performance in the multiclass case likely stems from underestimating the negative class. Let us again consider the IFA annotation criteria, which show that instances of neutral and negative affect depend on the degree of forehead smoothening. This facial configuration is reflected mainly through AU1 and AU2, which are not available for the four‐month cohort of PyAFAR (Onal Ertugrul et al. 2023). The criteria for the IFA are also more dependent on details of infant mouth opening (AU25, AU26, and AU27), which are not available for the PyAFAR four‐month training cohort. These considerations also explain why oversampling failed to improve performance, as the issue stems more from the features available than from modeling details.

In contrast to our expectations, we also observed in Table 5 how XGBoost performed slightly better than Bayesian filtering. We assumed that a sequential architecture would best model the sequential nature of facial expressions, but our results do not support this assumption. Multiple explanations can account for this finding. The first explanation concerns the nature of infant affective change. Assuming an infant is to change its affective state from positive to negative or from negative to positive, our intuition was that a neutral affective state would mediate such a change. This was our key motivation for choosing a modeling architecture with a sequential inductive bias, such as Bayesian filtering. Upon investigating the ground‐truth pattern of IFA change over time, we do not identify this expected relationship, as it is strikingly common for an infant to abruptly switch from positive to negative and vice versa (see supplementary material D). As deeper investigations into this finding exceed the scope of the present contribution, we leave it to future research. Regarding additional explanations for XGBoost's slight performance advantage, it is worth noting that Bayesian filtering has no hyperparameters to tune, limiting the options for customizing the model to the target problem. Lastly, we observe that the F1 score of 51% is substantially lower than the inter‐rater F1 overlap metric of 83% (as described in Section 2.3), indicating that purely automated infant affect estimation tools still require substantial improvement for reliable multiclass evaluation.

4.2. Comparing to Benchmark

In the context of computerized estimates of infant facial affect, the study by Zaharieva et al. 2024 serves as the most robust baseline for comparison. This is due to their use of a similarly aged cohort (51 four‐month‐old infants), identical affect classes (negative, neutral, and positive), and a reliance on the AU framework which allows for direct comparison with the present study. The samples differ in only two primary respects. First, our study includes a clinical group of mothers diagnosed with PPD. However, previous work on this specific sample suggests no significant differences in affective states between infants in dyads with mothers diagnosed with PPD compared to the nonclinical group (Egmose et al. 2018). Consequently, we pooled the groups for this analysis, as the clinical label does not appear to carry informative value for the current objective. Second, the label distribution differs between the two studies; we report a larger share of infants showing negative facial affect compared to both Zaharieva et al. 2024 and most other comparable samples (Colonnesi et al. 2012; Koulomzin et al. 2002). We consider this higher prevalence of negative affect a strength, as it facilitates a more granular investigation of negative expression features that are often underrepresented in similar datasets.

The study by Zaharieva et al. 2024 is a performance evaluation of the Baby FaceReader 9, comparing manual annotations of infant affect with estimates either directly predicted by the tool (global valence score) or predicted by the authors using the tool's AUs. There are numerous findings from the study, of which the following two are most important to consider in the current context: Zaharieva et al. 2024 reports 1) $>$ 0.80 AUC scores when distinguishing positive vs. negative, positive vs. neutral and neutral vs. negative, and 2) the finding that a parsimonious approach (i.e., selecting only one AU) is better than relying on the global emotional valence formula.

Considering first the maximum AUC scores, we found notable similarity to the ones reported in this study (Figure 6). Zaharieva et al. 2024 found a maximum AUC of 0.87 for the four‐month cohort when comparing positive to negative estimates and a maximum AUC of 0.85 when comparing positive to neutral. Our study reports a maximum AUC of 0.76 when comparing positive to negative and 0.78 when comparing positive to neutral, using all four AUs. As argued above, one reason for this difference is likely that the IFA theory also factors in the degree of forehead smoothening, which is captured by a set of AUs not available in the four‐month cohort of PyAFAR. Another reason for the differences is likely to stem from the infants being much closer to the camera in the sample used in Zaharieva et al. 2024 than in the present study. While the infants were seated in the same position – i.e., facing the parent – and parents were instructed to engage their child as they normally would in both studies, our sample had the infants positioned farther from the camera. We thus need to rely on a face detection model as described in Figure 1, which adds additional uncertainty to the preprocessing pipeline. When factoring in this additional limitation, our findings suggest that highly comparable affect estimates can be achieved when relying on a minimally tailored classification model (XGBoost) together with the AU occurrence probabilities obtained from PyAFAR (Hinduja et al. 2023; Onal Ertugrul et al. 2023). This is additionally corroborated by the fact that the mean AU occurrence rates (as seen in Figure 4) from our study closely resemble those reported from the Baby FaceReader 9 (fig. 5 in Zaharieva et al. 2024).

One notable difference in performance estimates is observed when comparing our results obtained from the negative vs. neutral category to those of Zaharieva et al. 2024. Here, we report a maximum AUC of 0.61 across all four AUs, which is substantially lower than the 0.83 reported in Zaharieva et al. 2024 for AU3 and AU4 alone. The first thing to consider in this comparison is how PyAFAR excludes AU3 for the four‐month cohort. Although AU3 and AU4 both consider the degree of brow activity, a fair comparison should also include AU3 as a feature in the present study. Second, the base rate of AU4 in the PyAFAR four‐month training database (Onal Ertugrul et al. 2023) is low, indicating that the facial movement is seldom exhibited by infants in the given dataset, thus awarding PyAFAR poor ability to learn a strong representation of AU4. A lack of a consistent signal in AU4 is also observed in supplementary material E, where the feature importance scores for AU4 are among the lowest. Third, our findings report that AU6 and AU12 account for all the signal in distinguishing the neutral from the negative class (Figure 7). As relevant theoretical contributions (Colonnesi et al. 2012; Bolzani Dinehart et al. 2005) (Table 2) suggest AU20 to be the primary indicator of a negative expression, it is striking that neither our study nor Zaharieva et al. 2024 reports this feature as salient. Future studies should pay close attention to this finding and examine whether it is a spurious pattern in the training data of our two studies or a genuine finding that warrants theoretical updates.

It is further interesting to consider that Zaharieva et al. 2024 reports that a parsimonious approach (using only AU12 or AU3+AU4) performs better at discriminating the affect classes than both the global valence score from Baby FaceReader 9 and any a priori configuration suggested by the literature. This finding appears to contrast with ours, as Figure 7 shows that the best performance is achieved by using all AUs from the PyAFAR four‐month cohort. When investigating the contribution of individual feature subsets, we generally observe that AU12 and AU20 drive most of the differences in IFA, with the inclusion of AU4 and AU6 yielding only a minimal performance increase. Our findings suggest that AU12 and AU20 are particularly salient in distinguishing positive from negative and neutral affective states. As AU12 indicates lip corner pull, i.e., a proxy for smiling and thus positive affect (Messinger et al. 2012; Beebe et al. 2010), the performance estimates are underscored by high clinical validity.

As a final note, we can see that including the AUs from the PyAFAR 13‐month training cohort (see Figure 7) does not drive performance gains for the demarcation of positive from negative/neutral affect, and only yields a 0.01 increase in distinguishing negative from neutral. The additional features are thus almost exclusively noisy, which is expected given both the morphological differences in facial configuration and the different patterns of interpersonal communication between infants of four and 13 months. Ultimately, this observation underscores the importance of age‐specific feature‐extraction tools for infant facial movements.

4.3. Next Steps for Technical Developments

The current investigation identifies several fruitful avenues for future research to improve both performance and reliability of estimates. Notably, the four‐month cohort on which PyAFAR was trained comprised 43 infants with annotations available for 116000 frames (Onal Ertugrul et al. 2023). The base occurrence rates for the four AUs were 0.10 (AU4), 0.31 (AU6), 0.26 (AU12), and 0.12 (AU20). This limited sample, combined with the generally low base rates, suggests that the AU estimation accuracy could greatly benefit from including annotations from additional samples. Further, the outputs for the four AUs from PyAFAR are measured in the probability occurrence of the given AU, rather than an estimation of its intensity. Having access to both an occurrence and an intensity estimate for each AU would be an additional step toward improving performance. We also find that the degree of forehead smoothening is relied slightly more for inference of a negative state in IFA than in the AU framework (Table 2). Subsequent iterations of PyAFAR could benefit from including AUs on these facial movements, such as AU1, AU2, and AU3 for forehead activation. Additionally, AU25, AU26, and AU27 would provide more insight into mouth opening, likely enabling better differentiation of affective states. As more and more AUs are included over time, it will also be fruitful to model infant facial affect at a higher granularity than the three labels used in the present study. Future research would benefit from adopting the full five‐category IFA measurement scheme shown in Table 1, which differentiates between low and high intensities of positive and negative affect.

5. Conclusion

Computational inference of infant affective state through facial movement analysis holds considerable potential to enable faster access to information for researchers and clinicians. In the present study, we demonstrated how the currently available set of open‐source software enables inference of affect with performance estimates in the same range as those from the primary proprietary tool. The reason for the marginally lower accuracy observed in the present study is likely to stem from PyAFAR enabling just four AU estimates for the four‐month cohort and minor differences in the annotation criteria between the IFA theory and the AU nosology (Beebe et al. 2010; Koulomzin et al. 2002; Egmose et al. 2018; Zaharieva et al. 2024; Colonnesi et al. 2012). Future iterations of PyAFAR may benefit from including AU estimates corresponding to more granular details of infant mouth and forehead activity, thereby enabling more nuanced capture of negative and neural affective states in inference.

Funding

This project was partially funded by the Pioneer Centre for AI, DNRF grant number P1.

Ethics Approval Statement

The Institutional Ethical Review Board, University of Copenhagen, Department of Psychology has reviewed the application titled “Early forms of Intersubjectivity at the UCPH Babylab” in retrospect and approved it. The approval was conducted according to the ethical standards and guidelines provided by the FP8 and the national ethical guidelines. Date of approval August 31, 2014, approval number 2014/02.

Conflicts of Interest

The authors declare no conflicts of interest.

Supporting information

Supporting Information

DESC-29-e70156-s001.tex^{(10.9KB, tex)}

Supporting Information

DESC-29-e70156-s002.pdf^{(238.6KB, pdf)}

Data Availability Statement

The authors have nothing to report.

References

Baltrušaitis, T. , Robinson P., and Morency L.‐P.. 2016. Openface: An Open Source Facial Behavior Analysis Toolkit. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–10. IEEE.
Beebe, B. , Jaffe J., Markese S., et al. 2010. “The Origins of 12‐Month Attachment: A Microanalysis of 4‐Month Mother–Infant Interaction.” Attachment and Human Development 12, no. 1–2: 3–141. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bishop, C. M. , and Bishop H.. 2023. Deep learning: Foundations andconcepts. Springer Nature. [Google Scholar]
Bolzani Dinehart, L. H. , Messinger D. S., Acosta S. I., Cassel T., Ambadar Z., and Cohn J.. 2005. “Adult Perceptions of Positive and Negative Infant Emotional Expressions.” Infancy 8, no. 3: 279–303. [Google Scholar]
Camras, L. A. , Oster H., Campos J. J., Miyake K., and Bradshaw D.. 1992. “Japanese and American Infants' Responses to Arm Restraint.” Developmental psychology 28, no. 4: 578. [Google Scholar]
Chen, T. , and Guestrin C.. 2016. Xgboost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '16, pp. 785–794. Association for Computing Machinery.
Cheong, J. H. , Jolly E., Xie T., Byrne S., Kenney M., and Chang L. J.. 2023. “Py‐Feat: Python Facial Expression Analysis Toolbox.” Affective Science 4, no. 4: 781–796. [DOI] [PMC free article] [PubMed] [Google Scholar]
Colonnesi, C. , Zijlstra B. J., van der Zande A., and Bögels S. M.. 2012. “Coordination of Gaze, Facial Expressions and Vocalizations of Early Infant Communication With Mother and Father.” Infant Behavior and Development 35, no. 3: 523–532. [DOI] [PubMed] [Google Scholar]
Egmose, I. , Cordes K., Smith‐Nielsen J., Væver M. S., and Køppe S.. 2018. “Mutual Regulation Between Infant Facial Affect and Maternal Touch in Depressed and Nondepressed Dyads.” Infant Behavior and Development 50: 274–283. [DOI] [PubMed] [Google Scholar]
Ekman, P. , Friesen W. V., and Hager J.. 1978. Facial Action Coding System: Manual. CA: Consulting Psychologists .
Gündoğdu, S. 2023. “Efficient Prediction of Early‐Stage Diabetes Using Xgboost Classifier With Random Forest Feature Selection Technique.” Multimedia Tools and Applications 82, no. 22: 34163–34181. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hinduja, S. , Ertugrul I. O., Bilalpur M., Messinger D. S., and Cohn J. F.. 2023. Pyafar: Python‐Based Automated Facial Action Recognition Library for Use in Infants and Adults. In 2023 11th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), pp. 1–3. IEEE.
Hjortsjo, C.‐H. 1969. Man's Face and Mimic Language. Studentlitteratur .
Imani, M. , Beikmohammadi A., and Arabnia H. R.. 2025. “Comprehensive Analysis of Random Forest and Xgboost Performance With Smote, Adasyn, and Gnus Under Varying Imbalance Levels.” Technologies 13, no. 3: 88. [Google Scholar]
Izard, C. E. 1997. Emotions and Facial Expressions: A Perspective From Differential Emotions Theory.
Jocher, G. , Qiu J., and Chaurasia A.. 2023. Ultralytics Yolo. Available at: https://github.com/ultralytics/ultralytics.
Kaipio, J. , and Somersalo E.. 2006. Statistical and Computational Inverse Problems, Vol. 160. Springer Science & Business Media. [Google Scholar]
Kleinberg, J. , Ludwig J., Mullainathan S., and Raghavan M.. 2024. “The Inversion Problem: Why Algorithms Should Infer Mental State and Not Just Predict Behavior.” Perspectives on Psychological Science 19, no. 5: 827–838. [DOI] [PubMed] [Google Scholar]
Koulomzin, M. , Beebe B., Anderson S., Jaffe J., Feldstein S., and Crown C.. 2002. “Infant Gaze, Head, Face and Self‐Touch at 4 Months Differentiate Secure Vs. Avoidant Attachment at 1 Year: A Microanalytic Approach.” Attachment & Human Development 4, no. 1: 3–24. [DOI] [PubMed] [Google Scholar]
Landis, J. R. , and Koch G. G.. 1977. “The Measurement of Observer Agreement for Categorical Data.” Biometrics 159–174. [PubMed]
Lausberg, H. , and Sloetjes H.. 2009. “Coding Gestural Behavior With the Neuroges‐Elan System.” Behavior Research Methods 41, no. 3: 841–849. [DOI] [PubMed] [Google Scholar]
Liu, P. , Li X.‐J., Zhang T., and Huang Y.‐H.. 2024. “Comparison Between Xgboost Model and Logistic Regression Model for Predicting Sepsis After Extremely Severe Burns.” Journal of International Medical Research 52, no. 5: 03000605241247696. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lotzin, A. , Schiborr J., Barkmann C., Romer G., and Ramsauer B.. 2016. “Maternal Emotion Dysregulation is Related to Heightened Mother–Infant Synchrony of Facial Affect.” Development and Psychopathology 28, no. 2: 327–339. [DOI] [PubMed] [Google Scholar]
MacKay, D. J. 2003. Information Theory, Inference and Learning Algorithms. Cambridge University Press. [Google Scholar]
Margolis, A. E. , Lee S. H., Peterson B. S., and Beebe B.. 2019. “Profiles of Infant Communicative Behavior.” Developmental Psychology 55, no. 8: 1594. [DOI] [PMC free article] [PubMed] [Google Scholar]
Messinger, D. S. , Mattson W. I., Mahoor M. H., and Cohn J. F.. 2012. “The Eyes Have it: Making Positive Expressions More Positive and Negative Expressions More Negative.” Emotion 12, no. 3: 430. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mullainathan, S. , and Obermeyer Z.. 2021. “On the Inequity of Predicting a While Hoping for b.” In AEA Papers and Proceedings, Vol. 111, pp. 37–42. American Economic Association.
Noldus . 2022. Facereader: Tool for Automatic Analysis of Facial Expressions: Version 9.017.
Onal, Ertugrul, I. , Ahn Y. A., Bilalpur M., Messinger D. S., Speltz M. L., and Cohn J. F.. 2023. “Infant Afar: Automated Facial Action Recognition in Infants.” Behavior Research Methods 55, no. 3: 1024–1035. [DOI] [PMC free article] [PubMed] [Google Scholar]
Oster, H. 1978. “Facial Expression and Affect Development.” In The Development of Affect, 43–75. Springer. [Google Scholar]
Oster, H. , Dondi M., et al. 2016. “Facial Action Coding System for Infants and Young Children (Baby Facs).” In International Conference on Infant Studies (ICIS), pp. 8–8. ICIS.
Reyes‐Hernández, J. C. , Alomar A., Rubio R., Piella G., and Sukno F.. 2024. “Obbabyface: Oriented Bounding Box for Infant Face Detection.” In Deep Learning Theory and Applications, edited by Fred A., Hadjali A., Gusikhin O., and Sansone C. (Eds.), 336–350. Springer Nature Switzerland. [Google Scholar]
Smith‐Nielsen, J. , Steele H., Mehlhase H., et al. 2015. “Links Among High Epds Scores, State of Mind Regarding Attachment, and Symptoms of Personality Disorder.” Journal of Personality Disorders 29, no. 6: 771–793. [DOI] [PubMed] [Google Scholar]
Smith‐Nielsen, J. , Tharner A., Krogh M. T., and Vaever M. S.. 2016. “Effects of Maternal Postpartum Depression in a Well‐Resourced Sample: Early Concurrent and Long‐Term Effects on Infant Cognitive, Language, and Motor Development.” Scandinavian Journal of Psychology 57, no. 6: 571–583. [DOI] [PubMed] [Google Scholar]
Tronick, E. Z. , and Cohn J. F.. 1989. “Infant‐Mother Face‐to‐Face Interaction: Age and Gender Differences in Coordination and the Occurrence of Miscoordination.” Child Development 85–92. [PubMed]
Væver, M. S. , Pedersen I. E., Smith‐Nielsen J., and Tharner A.. 2020. “Maternal Postpartum Depression is a Risk Factor for Infant Emotional Variability at 4 Months.” Infant Mental Health Journal 41, no. 4: 477–494. [DOI] [PubMed] [Google Scholar]
Wan, M. , Zhu S., Luan L., et al. 2022. “Infanface: Bridging the Infant–Adult Domain Gap in Facial Landmark Estimation in the Wild.” In 2022 26th International Conference on Pattern Recognition (ICPR), pp. 4486–4492. IEEE.
Yarmohammadtoosky Dinesh Chowdary Attota, S. 2024. Optimizing Fintech Marketing: A Comparative Study of Logistic Regression and Xgboost. arXiv e‐prints, arXiv–2412.
Zaharieva, M. S. , Salvadori E. A., Messinger D. S., Visser I., and Colonnesi C.. 2024. “Automated Facial Expression Measurement in a Longitudinal Sample of 4‐and 8‐Month‐Olds: Baby Facereader 9 and Manual Coding of Affective Expressions.” Behavior Research Methods 1–23. [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information

DESC-29-e70156-s001.tex^{(10.9KB, tex)}

Supporting Information

DESC-29-e70156-s002.pdf^{(238.6KB, pdf)}

Data Availability Statement

The authors have nothing to report.

[desc70156-bib-0001] Baltrušaitis, T. , Robinson P., and Morency L.‐P.. 2016. Openface: An Open Source Facial Behavior Analysis Toolkit. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–10. IEEE.

[desc70156-bib-0002] Beebe, B. , Jaffe J., Markese S., et al. 2010. “The Origins of 12‐Month Attachment: A Microanalysis of 4‐Month Mother–Infant Interaction.” Attachment and Human Development 12, no. 1–2: 3–141. [DOI] [PMC free article] [PubMed] [Google Scholar]

[desc70156-bib-0003] Bishop, C. M. , and Bishop H.. 2023. Deep learning: Foundations andconcepts. Springer Nature. [Google Scholar]

[desc70156-bib-0004] Bolzani Dinehart, L. H. , Messinger D. S., Acosta S. I., Cassel T., Ambadar Z., and Cohn J.. 2005. “Adult Perceptions of Positive and Negative Infant Emotional Expressions.” Infancy 8, no. 3: 279–303. [Google Scholar]

[desc70156-bib-0005] Camras, L. A. , Oster H., Campos J. J., Miyake K., and Bradshaw D.. 1992. “Japanese and American Infants' Responses to Arm Restraint.” Developmental psychology 28, no. 4: 578. [Google Scholar]

[desc70156-bib-0006] Chen, T. , and Guestrin C.. 2016. Xgboost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '16, pp. 785–794. Association for Computing Machinery.

[desc70156-bib-0007] Cheong, J. H. , Jolly E., Xie T., Byrne S., Kenney M., and Chang L. J.. 2023. “Py‐Feat: Python Facial Expression Analysis Toolbox.” Affective Science 4, no. 4: 781–796. [DOI] [PMC free article] [PubMed] [Google Scholar]

[desc70156-bib-0008] Colonnesi, C. , Zijlstra B. J., van der Zande A., and Bögels S. M.. 2012. “Coordination of Gaze, Facial Expressions and Vocalizations of Early Infant Communication With Mother and Father.” Infant Behavior and Development 35, no. 3: 523–532. [DOI] [PubMed] [Google Scholar]

[desc70156-bib-0009] Egmose, I. , Cordes K., Smith‐Nielsen J., Væver M. S., and Køppe S.. 2018. “Mutual Regulation Between Infant Facial Affect and Maternal Touch in Depressed and Nondepressed Dyads.” Infant Behavior and Development 50: 274–283. [DOI] [PubMed] [Google Scholar]

[desc70156-bib-0010] Ekman, P. , Friesen W. V., and Hager J.. 1978. Facial Action Coding System: Manual. CA: Consulting Psychologists .

[desc70156-bib-0011] Gündoğdu, S. 2023. “Efficient Prediction of Early‐Stage Diabetes Using Xgboost Classifier With Random Forest Feature Selection Technique.” Multimedia Tools and Applications 82, no. 22: 34163–34181. [DOI] [PMC free article] [PubMed] [Google Scholar]

[desc70156-bib-0012] Hinduja, S. , Ertugrul I. O., Bilalpur M., Messinger D. S., and Cohn J. F.. 2023. Pyafar: Python‐Based Automated Facial Action Recognition Library for Use in Infants and Adults. In 2023 11th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), pp. 1–3. IEEE.

[desc70156-bib-0013] Hjortsjo, C.‐H. 1969. Man's Face and Mimic Language. Studentlitteratur .

[desc70156-bib-0014] Imani, M. , Beikmohammadi A., and Arabnia H. R.. 2025. “Comprehensive Analysis of Random Forest and Xgboost Performance With Smote, Adasyn, and Gnus Under Varying Imbalance Levels.” Technologies 13, no. 3: 88. [Google Scholar]

[desc70156-bib-0015] Izard, C. E. 1997. Emotions and Facial Expressions: A Perspective From Differential Emotions Theory.

[desc70156-bib-0016] Jocher, G. , Qiu J., and Chaurasia A.. 2023. Ultralytics Yolo. Available at: https://github.com/ultralytics/ultralytics.

[desc70156-bib-0017] Kaipio, J. , and Somersalo E.. 2006. Statistical and Computational Inverse Problems, Vol. 160. Springer Science & Business Media. [Google Scholar]

[desc70156-bib-0018] Kleinberg, J. , Ludwig J., Mullainathan S., and Raghavan M.. 2024. “The Inversion Problem: Why Algorithms Should Infer Mental State and Not Just Predict Behavior.” Perspectives on Psychological Science 19, no. 5: 827–838. [DOI] [PubMed] [Google Scholar]

[desc70156-bib-0019] Koulomzin, M. , Beebe B., Anderson S., Jaffe J., Feldstein S., and Crown C.. 2002. “Infant Gaze, Head, Face and Self‐Touch at 4 Months Differentiate Secure Vs. Avoidant Attachment at 1 Year: A Microanalytic Approach.” Attachment & Human Development 4, no. 1: 3–24. [DOI] [PubMed] [Google Scholar]

[desc70156-bib-0020] Landis, J. R. , and Koch G. G.. 1977. “The Measurement of Observer Agreement for Categorical Data.” Biometrics 159–174. [PubMed]

[desc70156-bib-0021] Lausberg, H. , and Sloetjes H.. 2009. “Coding Gestural Behavior With the Neuroges‐Elan System.” Behavior Research Methods 41, no. 3: 841–849. [DOI] [PubMed] [Google Scholar]

[desc70156-bib-0022] Liu, P. , Li X.‐J., Zhang T., and Huang Y.‐H.. 2024. “Comparison Between Xgboost Model and Logistic Regression Model for Predicting Sepsis After Extremely Severe Burns.” Journal of International Medical Research 52, no. 5: 03000605241247696. [DOI] [PMC free article] [PubMed] [Google Scholar]

[desc70156-bib-0023] Lotzin, A. , Schiborr J., Barkmann C., Romer G., and Ramsauer B.. 2016. “Maternal Emotion Dysregulation is Related to Heightened Mother–Infant Synchrony of Facial Affect.” Development and Psychopathology 28, no. 2: 327–339. [DOI] [PubMed] [Google Scholar]

[desc70156-bib-0024] MacKay, D. J. 2003. Information Theory, Inference and Learning Algorithms. Cambridge University Press. [Google Scholar]

[desc70156-bib-0025] Margolis, A. E. , Lee S. H., Peterson B. S., and Beebe B.. 2019. “Profiles of Infant Communicative Behavior.” Developmental Psychology 55, no. 8: 1594. [DOI] [PMC free article] [PubMed] [Google Scholar]

[desc70156-bib-0026] Messinger, D. S. , Mattson W. I., Mahoor M. H., and Cohn J. F.. 2012. “The Eyes Have it: Making Positive Expressions More Positive and Negative Expressions More Negative.” Emotion 12, no. 3: 430. [DOI] [PMC free article] [PubMed] [Google Scholar]

[desc70156-bib-0027] Mullainathan, S. , and Obermeyer Z.. 2021. “On the Inequity of Predicting a While Hoping for b.” In AEA Papers and Proceedings, Vol. 111, pp. 37–42. American Economic Association.

[desc70156-bib-0028] Noldus . 2022. Facereader: Tool for Automatic Analysis of Facial Expressions: Version 9.017.

[desc70156-bib-0029] Onal, Ertugrul, I. , Ahn Y. A., Bilalpur M., Messinger D. S., Speltz M. L., and Cohn J. F.. 2023. “Infant Afar: Automated Facial Action Recognition in Infants.” Behavior Research Methods 55, no. 3: 1024–1035. [DOI] [PMC free article] [PubMed] [Google Scholar]

[desc70156-bib-0030] Oster, H. 1978. “Facial Expression and Affect Development.” In The Development of Affect, 43–75. Springer. [Google Scholar]

[desc70156-bib-0031] Oster, H. , Dondi M., et al. 2016. “Facial Action Coding System for Infants and Young Children (Baby Facs).” In International Conference on Infant Studies (ICIS), pp. 8–8. ICIS.

[desc70156-bib-0032] Reyes‐Hernández, J. C. , Alomar A., Rubio R., Piella G., and Sukno F.. 2024. “Obbabyface: Oriented Bounding Box for Infant Face Detection.” In Deep Learning Theory and Applications, edited by Fred A., Hadjali A., Gusikhin O., and Sansone C. (Eds.), 336–350. Springer Nature Switzerland. [Google Scholar]

[desc70156-bib-0033] Smith‐Nielsen, J. , Steele H., Mehlhase H., et al. 2015. “Links Among High Epds Scores, State of Mind Regarding Attachment, and Symptoms of Personality Disorder.” Journal of Personality Disorders 29, no. 6: 771–793. [DOI] [PubMed] [Google Scholar]

[desc70156-bib-0034] Smith‐Nielsen, J. , Tharner A., Krogh M. T., and Vaever M. S.. 2016. “Effects of Maternal Postpartum Depression in a Well‐Resourced Sample: Early Concurrent and Long‐Term Effects on Infant Cognitive, Language, and Motor Development.” Scandinavian Journal of Psychology 57, no. 6: 571–583. [DOI] [PubMed] [Google Scholar]

[desc70156-bib-0035] Tronick, E. Z. , and Cohn J. F.. 1989. “Infant‐Mother Face‐to‐Face Interaction: Age and Gender Differences in Coordination and the Occurrence of Miscoordination.” Child Development 85–92. [PubMed]

[desc70156-bib-0036] Væver, M. S. , Pedersen I. E., Smith‐Nielsen J., and Tharner A.. 2020. “Maternal Postpartum Depression is a Risk Factor for Infant Emotional Variability at 4 Months.” Infant Mental Health Journal 41, no. 4: 477–494. [DOI] [PubMed] [Google Scholar]

[desc70156-bib-0037] Wan, M. , Zhu S., Luan L., et al. 2022. “Infanface: Bridging the Infant–Adult Domain Gap in Facial Landmark Estimation in the Wild.” In 2022 26th International Conference on Pattern Recognition (ICPR), pp. 4486–4492. IEEE.

[desc70156-bib-0038] Yarmohammadtoosky Dinesh Chowdary Attota, S. 2024. Optimizing Fintech Marketing: A Comparative Study of Logistic Regression and Xgboost. arXiv e‐prints, arXiv–2412.

[desc70156-bib-0039] Zaharieva, M. S. , Salvadori E. A., Messinger D. S., Visser I., and Colonnesi C.. 2024. “Automated Facial Expression Measurement in a Longitudinal Sample of 4‐and 8‐Month‐Olds: Baby Facereader 9 and Manual Coding of Affective Expressions.” Behavior Research Methods 1–23. [DOI] [PMC free article] [PubMed]

PERMALINK

Evaluating Open‐Source Solutions for Computerized Inference of Infant Facial Affect

Martin Lund Trinhammer

Ida Egmose

Marianne Thode Krogh

Anne Christine Stuart

Mette Skovgaard Væver

Sami Sebastian Brandt

Stella Graßhof

ABSTRACT

Summary

1. Introduction

1.1. Different Measures of What Constitutes Negative, Neutral, and Positive Affect in Infant Facial Expressions

TABLE 1.

TABLE 2.

2. Method

2.1. Participants

TABLE 3.

2.2. Procedure

2.3. IFA Coding

2.4. Preprocessing

FIGURE 1.

FIGURE 2.

2.5. Sample Characteristics

2.6. Modeling

2.6.1. XGBoost

2.6.2. Bayesian Filtering

FIGURE 3.

2.7. Evaluation Metrics

TABLE 6.

TABLE 5.

TABLE 7.

3. Results

3.1. Explorative Findings

TABLE 4.

FIGURE 4.

3.2. Multiclass Prediction Model

FIGURE 5.

3.3. Binary Prediction Model

FIGURE 6.

FIGURE 7.

4. Discussion

4.1. Analyzing the Multiclass Findings

4.2. Comparing to Benchmark

4.3. Next Steps for Technical Developments

5. Conclusion

Funding

Ethics Approval Statement

Conflicts of Interest

Supporting information

Data Availability Statement

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases