Abstract
Risk assessments for a pediatric population are often conducted across multiple stages. For example, clinicians may evaluate risks prenatally, at birth, and during WellChild visits. While predictions at later stages typically achieve higher accuracy, it is clinically desirable to make reliable risk assessments as early as possible. Therefore, this study focuses on enhancing prediction performance in early-stage risk assessments. Our solution, Borrowing From the Future (BFF), is a contrastive multi-modal framework that treats each time window as a distinct modality. In BFF, a model is trained on all available data throughout the time while conduct risk assessment using the up-to-time information. This contrastive framework allows the model to “borrow” informative signals from later stages (e.g., WellChild visits) to implicitly supervise the learning at earlier stages (e.g., prenatal/birth stages). We validate BFF on two real-world pediatric outcome prediction tasks, demonstrating consistent improvements in early risk assessment. The code is at https://github.com/scotsun/bff.
1. Introduction
Clinical risk assessment can be conducted at multiple points in time. For pediatric populations, risk may be assessed prenatally, at birth, and during developmental Well-Child visits (Lipkin et al., 2020). Numerous studies have leveraged prenatal information to develop clinical prediction models for pediatric outcomes, highlighting important links between maternal prenatal factors and child health (Beijers et al., 2010; Walsh et al., 2019; Kong et al., 2024). Risk assessments performed at later time points tend to be more accurate because: (1) additional information has accumulated over the watchful waiting, and (2) the child is temporally closer to the potential onset of the condition, making recent data more predictive of the eventual diagnosis. However, it is clinically desirable to make risk assessments as early as possible, enabling timely interventions and preventative actions. In this work, our goal is to improve risk assessment at earlier points in time.
A clinical prediction model (CPM) solely relying on the data from the earlier time windows may have limited performance, since the data may lack the explicit and direct predictive relevance to the target outcome. Therefore, we develop a contrastive learning framework to adjust the learning of the early information. In general, contrastive learning improves representation quality by leveraging information from different views of the same sample (Chen et al., 2020; Tian et al., 2020), or by integrating data across different modalities observed in the same sample (Radford et al., 2021). In existing contrastive methods for sequential data, representations are learned either at the overall sequence level or at individual positions (Gao et al., 2021; Yue et al., 2022; Lee et al., 2023). However, for our clinical use case, where risk assessments are performed across distinct time windows, we aim to efficiently learn representations at the time window level.
To achieve our objective, we introduce Borrowing From the Future (BFF), a contrastive multi-modal framework that redefines modality in a different way. Typically, multi-modal approaches integrate heterogeneous data types—such as texts and images—to exploit complementary information. In our case, however, all input data only consist of medical codes. We treat each time window segment as a distinct modality, thereby converting our uni-modal data into a multi-modal configuration. This design choice enables the capture of temporal nuances across clinically significant periods, including maternal prenatal, at-birth, and child developmental stages.
One key element in the BFF framework is the contrastive regularization (CR), which improves the representation of the “early” modalities by using information from later time windows during training. Drawing on advances from unsupervised and self-supervised contrastive learning studies (Oord et al., 2018; Tian et al., 2020; Radford et al., 2021; Yuan et al., 2021; Lin and Hu, 2022), our CR term contain both within-modal and across-modal contrastive losses. The within-modal loss focuses on capturing time-window specific features, while the across-modal loss aligns representations across different modalities. This dual strategy supports the learning of distinct temporal features. More importantly, the across-modal alignment generate reliable patient-specific features from early modalities by “borrowing” information beyond the time during training.
Finding an effective way to aggregate the multi-modal representations is also important in multi-modal learning. We adapt the self-gating mechanism in SE block proposed by Hu et al. (2018) to the multi-modal setting. The original sigmoid gate is replaced by softmax gates for different modalities, which we call Softmax Self-Gating. Each feature from all modalities is calibrated by a softmax weight before the final aggregation. Modalities can be easily masked off through the softmax gates when they are missing or for the purpose of model evaluation. Meanwhile, the gating mechanism provides effortless model explainability. It directly quantifies the attention that should be assigned to each feature or modality in the prediction tasks.
To summarize, in our BFF framework, we treat Electronic Health Records (EHR) medical codes observed from different time windows as separate modalities. We use contrastive learning to separately extract both time-window-specific and patient-specific features. During the learning of patient-specific features, the CR effectively borrow information from the future modality to improve the representation learning in the earlier modalities. We use Softmax Self-Gating to aggregate the multi-modal features. We use two real-world pediatric outcome prediction tasks to show the effectiveness of our framework: time-to-diagnosis prediction for Autism Spectrum Disorder (ASD) and binary outcome prediction of Recurrent Acute Otitis Media (RAOM). According to the experiment results of these two tasks, BFF improves a CPM’s performance at early risk assessments.
In summary, our contributions are as follows:
We propose BFF, a framework that convert a uni-modal EHR data to a multi-modal data by considering data over different time windows as separate modalities. It leverages contrastive learning to enhance CPMs’ performance at early risk assessment.
We design a contrastive regularization that jointly optimizes within-modal feature understanding and cross-modal representation alignment. The cross-modal alignment, in turn, enriches the representations learned on earlier modalities by incorporating information from data closer to the event onset.
We introduce Softmax Self-Gating, a straightforward yet powerful mechanism for achieving efficient and explainable multi-modal fusion.
Generalizable Insights about Machine Learning in the Context of Healthcare
Contrastive learning has been extensively explored as an effective approach to enhancing representation learning. In this study, we employ this technique to boost the performance of a CPM in early risk assessments. Data gathered during the child development period are highly relevant to clinical outcomes of interest. We leverage these data as a form of “soft” supervision to guide the learning of historical information from the maternal prenatal period and at-birth encounter — two clinically significant time windows for early risk assessments. Our proposed framework, BFF, is capable of advancing preventive care for pediatric populations.
2. Related Work
Multi-modal Contrastive Learning.
InfoNCE provides an information-theoretic foundation for contrastive learning (Oord et al., 2018). In self-supervised learning (SSL) setting, representations are learnt through within-modal alignment (Chen et al., 2020; He et al., 2020). Positive pairs are constructed by applying different random augmentations to the same samples, whereas negative pairs are constructed by that of different samples. In multi-modal learning setting, various streamlined methods learn multi-modal representation through cross-modal alignment (Tian et al., 2020; Radford et al., 2021; Qiu et al., 2023; Hager et al., 2023). In these context, a positive pair refers to a matched pair of modalities (e.g., a matched image-text pair) while a negative pair refers to a mismatched pair of modalities. Several recent works extend multi-modal contrastive learning with simultaneous within-modal and cross-modal objectives (Yuan et al., 2021; Lin and Hu, 2022). The combination enables model to learn modality-specific representations, and the extracted features in the representations are enhanced by the cross-modal interaction (Yuan et al., 2021).
Contrastive Learning in Sequential Data.
TS2Vec (Yue et al., 2022) and SoftCLT (Lee et al., 2023) learn contrastive representations for multi- and univariate time series in a continuous space, and are evaluated on standard time series modalities. Contrastive learning is performed at two levels using augmented views of the sequences: an instance-wise loss, which contrasts representations from different sequence instances at the same time point, and a temporal loss, which contrasts representations from different timestamps within a single sequence. Both loss terms operate on the same representation vector. In contrast, BFF is applied to tokenized medical event sequences—structured similarly to NLP data. Contrastive pairs are constructed using time window labels and instance IDs. The representation vector is explicitly split into two parts, each optimized by a separate contrastive loss: a within-modal loss to extract temporal features, while an across-modal loss captures instance-level features.
Attention and Soft Gating.
Soft gating techniques have been used in various works to improve the recurrent neural networks (Hochreiter and Schmidhuber, 1997; Cho et al., 2014; Dauphin et al., 2017). These methods use input-dependent information control on features to address the vanishing gradient issue and improve representation learning. Similarly, SE block, leveraging sigmoid self-gating mechanism, recalibrate the feature map channel-wisely in convolutional neural networks (Hu et al., 2018). These gating techniques improves feature discriminability, empowering the model to adaptively focus more on the important features but less on the trivial features. The self-attention mechanism introduced in Transformer also share the same functionality of adaptively control the feature expressiveness (Vaswani et al., 2017). The attention weights resemble the soft gating weights, suppressing or amplifying the interaction between each embedding pairs, thereby generate contextualized representation.
3. Methods
3.1. Overview of the BFF
We consider each time window as a separate modality. All the medical concepts are mapped to embeddings using a pretrained Continuous Bag-of-Word (CBOW) model (Mikolov et al., 2013). For each modality, an encoder takes the average pooling of the token embeddings along the sequence dimension and then calculates latent representation (). Subsequently, a fusion module mix up the representations from different modalities. Fig. 1 visually illustrates working mechanism of BFF. Finally, a prediction head can use either or () for a downstream task.
Figure 1:

Multi-modal architecture with Softmax self-gating for modality fusion. Each encoder will extract two feature vectors: and . A fusion module is applied to (shaded) features and features separately. Here, we use an example of three samples with modality missingness to showcase the mechanism by which the contrastive regularization pulls some representations closer while pushing others apart during training. Different shapes and colors denote distinct samples and modalities, respectively.
As demonstrated in Fig. 1, the CR consists of two contrastive loss terms. To begin with, a cross-modal contrastive term is used to regularize , which contains the patient-specific information. Secondly, a within-modal contrastive loss is used to regularize , which contains the time/modality-specific information. Soft Nearest Neighbor (SNN) loss is used for both regularization terms (Frosst et al., 2019) (Sec. 3.3). Notably, these contrastive terms are additional to the loss function for the prediction task of interest. We use Softmax Self-Gating, multi-modal fusion module, to separately aggregate and from all modalities (Sec. 3.4). Therefore, the overall loss function for BFF training is:
| (1) |
To summarize, is the objective for the prediction task (e.g. negative log-likelihood). and are the SNN regularization terms for across-modal and within-modal alignment, respectively.
3.2. Training & Testing of BFF and Other Approaches
Under the standard practice for developing a CPM to calculate risks at time , we use a conventional holdout validation framework where both training and testing datasets share the same structure and cover the same set of time windows. More specifically, both the training and testing data are drawn exclusively from historical time windows that occur prior to .
Given that future data may contain richer predictive signals, we can enhance early risk assessment through self-supervised forecasting pretraining followed by task-specific fine-tuning (Lyu et al., 2018; Xue et al., 2020). The training procedure for this forecasting-based approach involves two stages: (1) pretraining a forecasting autoencoder to predict future observations based on data available before time ; and (2) using the pretrained encoder to extract representations of historical data, which serve as input features for the main CPM. During model testing, only information observed prior to is processed through the forecasting encoder and subsequently passed to the CPM. See Appx. B for implementation details of this approach.
The BFF framework, on the other hand, is more train-efficient than the two-step forecasting. The CPM will be trained using all available information across all time windows within a single step. The encoders for all time windows will be jointly optimized under the contrastive regularization. However, during the testing/evaluation phase, BFF calculates the risks at only using the information observed prior to . Encoders for the future time windows will then be masked off. Tab. 1 summarizes how BFF and other approaches are different in utilizing training and testing data for model development.
Table 1:
Differences in how training and testing data are utilized to develop a CPM for risk assessment at time across different approaches
| Training Paradigm | Training Data | Testing Data | |
|---|---|---|---|
| Standard Practice | one-step | prior to | prior to |
| Forecasting | two-step | throughout the time | prior to |
| BFF | one-step | throughout the time | prior to |
3.3. Contrastive Regularization
Let the sample size be and the number of modalities be . Let and denote the sets of patient representations and modality representations produced by the encoders, respectively, where indexing the patient and indexing the modality. Since each semantic can have multiple positive pairs, SNN loss is used for both within-modal and cross-modal contrastive learning. In our implementation of the SNN loss, we replace the distance with cosine similarity, , as cosine similarity is more aligned with the established InfoNCE derivation and used more often in the mainstream contrastive learning frameworks (Oord et al., 2018; He et al., 2020; Chen et al., 2020; Radford et al., 2021).
We implement a dual contrastive mechanism that jointly optimizes within-modal feature understanding and cross-modal representation alignment. In both components, similarities are encouraged to be maximized among the positively paired representations but minimized among those negatively paired ones. In the within-modal contrastive term (Eqn. 2), positive pairs are made by representations belonging to the same modality (observed from the same time-window). This allows the encoders to preserve modality-specific information in ’s. On the other hand, in the cross-modal contrastive term (Eqn. 3), positive pairs are made by representations generated the same patient. Under this regularization, ’s are encouraged to carry patient-specific features. The contrastive temperatures and is a learnable parameter.
| (2) |
| (3) |
To demonstrates the dual contrastive mechanism, we can assume a hypothetical scenario with three time windows of data collection and three patients. Assume that we also have the missing modality problem. The representations generated by the available modalities are and . Fig. 2 and 1 illustrate how these representations are paired up and adjusted by the contrastive regularization.
Figure 2:


Representation pairing in the dual contrastive mechanism.
Importantly, the representations derived from future modalities (e.g., medical codes collected from developmental Well-Child visits) are inherently more informative for predicting outcomes, as they are based on the information collected closer to the events onset. In the cross-modality alignment, these future representations serve as “soft labels” to guide the representations formed from earlier modalities. This approach is analogous to the contrastive mechanism in the CLIP framework, where natural language implicitly supervises the visual representation learning (Radford et al., 2021; Yuan et al., 2021; Lin and Hu, 2022).
3.4. Softmax Self-Gating
Since we treat data from each time window as a separate modality, an effective multi-modal fusion is significant for the downstream prediction performance. Our softmax self-gating mechanism for modality fusion is inspired by the SE block for computer vision (Hu et al., 2018). We extend the SE-block to our multi-modal setting and calculate the attention weights for each feature (e.g., embedding coordinate) from each modality. The channel-wise weights now become feature-wise attention scores of all modalities in the downstream information aggregation.
Assume we have modalities and each has a latent embedding size of , and for each patient, we have a multi-modal feature embedding . We compute the feature-wise attention weights using an MLP, denoted as , followed by a final softmax activation. The gated features are element-wise modulation of with the attention weights.
| (4) |
Notably, we apply softmax along the modality dimension of , producing modality-wise attention weight distributions. The operation denotes the Hadamard (element-wise) product. For each patient, the final representations is formed through feature-specific weighted sum across all modalities. Fig. 1 provides a visual demonstration of this process.
| (5) |
For predictions at time , we mask future modalities in the softmax, setting their fusion weights to zero to prevent the model from accessing future information. This is similar to “causal masking” (Vaswani et al., 2017) but for downstream prediction rather than decoding. To handle missing modalities during both training and inference, we represent missing inputs with padding tokens as placeholders without meaningful content. We apply boolean masking to set the gating weights of these modalities to zero, ensuring aggregation is performed only over observed modalities. Together, these masking strategies ensure that the model makes predictions based only on valid and temporally appropriate inputs.
Beyond enabling efficient multi-modal fusion, the weight matrix directly quantifies the contribution of each feature and modality to the downstream tasks during aggregation. See more details in Sec. 5.2.
3.5. Model Evaluation
We use the area under the curve (AUC) of the regular receiver operating characteristic (ROC) curve to evaluate the binary classification task. For the time-to-event prediction task, we use cumulative/dynamic AUC (), a time-dependent extension of the original AUC, firstly introduced by Heagerty and Zheng (2005). It evaluates model’s capability of differentiating cumulative cases and dynamic controls. Let be the true time-to-event of an arbitrary individual. For a given time , cumulative cases refer to the patients who experience the event of interest by the time (), whereas dynamic controls are those who still remain event-free at the moment () (Heagerty and Zheng, 2005; Kamarudin et al., 2017). In this work, we use the inverse probability of censoring weighting (IPCW) estimator (Eqn. 6) proposed by Uno et al. (2007) and Hung and Chiang (2010) to compute the metric.
| (6) |
where is the inverse of the probability of being censored for subject and is event indicator. The estimated survival function is computed using Kaplan-Meier Estimator (Kaplan and Meier, 1958) based on the training data.
We use the method (Eqn. 7) proposed by Lambert and Chevret (2016) to summarize over time. It calculates a weighted mean over a restricted time interval () by integrating over the estimated survival function.
| (7) |
4. Experiment & Data
4.1. CBOW pre-training
To capture generalized representations of medical encounters for both children and their mothers, we trained a CBOW model on tokenized EHR codes. In this formulation, each code is treated as an individual token, with children’s codes covering ages 0–24 months and mothers’ codes spanning the prenatal period through newborn delivery. The data was sourced from our institution’s Clinical Research Datamart (CRDM). Specifically, we used encounter data consisting of diagnosis codes, procedure codes, medication prescription and laboratory tests for all children born in 2015 – 2022. The data included mothers’ prenatal visit encounters, the birth encounters, and post-birth medical and well-child encounters. To prevent any potential data leakage, we excluded the downstream tasks’ pre-defined testing data from the training data used for the CBOW model.
4.2. Downstream tasks
We considered two common tasks for early childhood risk assessment using data retrieved from three critical time windows: prenatal (), birth (), and developmental (). For both tasks, the EHR inputs are tokenized into medical codes and organized into three modalities, , , and . A critical distinction emerged in , where maternal and newborn records contain fundamentally different clinical information. Therefore, we further subdivide into and .
4.2.1. Prediction Task 1: Autism Spectrum Disorder (ASD)
ASD is a neuro-developmental condition that typically presents within the first years of life. The mean age of the diagnosis is around 5 years (Van’T Hof et al., 2021). Children are typically screened during their 18-month well-child visit and referred for confirmatory diagnosis. Previous works has shown that early medical conditions (e.g., gastrointestinal disease) can serve as early indicators of a future ASD diagnosis (Engelhard et al., 2023;Alexeeff et al., 2017). Our team has been focused on developing automated early screening tools for ASD.
We identified a cohort of 43,945 children who had a well-child visit at our institution between 12 and 24 months of age. We used the well-child visits closest to 18 months as the index encounter. We required two distinct medical encounters with an ICD-10 for ASD to label an ASD case (Guthrie et al., 2019). Across our cohort’s follow-up, 2.05% were diagnosed with ASD. Our training and testing sizes were 35,000 and 8,945, respectively. To account for potential censoring and loss-to-follow-up, we set up our task as time-to-event prediction task. Our outcome of interest is the time to diagnosis, and we used the fixed-interval Discrete-Time Neural Network (DTNN) as the downstream prediction model (Lee et al., 2018; Hickey et al., 2024). More details about DTNN and corresponding loss function for prediction is provided in Appx. A.
4.2.2. Prediction Task 2: Recurrent Acute Otitis Media (rAOM)
rAOM, or recurrent ear infection, is a common medical condition in early childhood. While most children will experience an ear infection during their first years of life, approximately 20–30% of the pediatric population will experience recurrent ear infections defined as 3 episodes within a 6 month period or 4 episodes within a year period (Kaur et al., 2017; Pichichero, 2000). Children who experience rAOM often require surgical intervention in the form of ear tube placement. We sought to develop a predictive model for the probability a first AOM develops into rAOM. We identified a a cohort of 5,438 children with a first AOM. Of these, 22.0% transitioned into rAOM. We used the same embeddings framework as above to abstact pre-natal, birth and developmental clinical data. We split the data into training and testing sets of sizes of 4,227 and 1,211, respectively. Since our eligible cohort had obtained age 4 - an age at which rAOM is unlikely to developed - we modeled binary diagnosis indicator.
5. Results
5.1. Prediction Performance
Tab. 3 and Tab. 4 present model performance scores for risk assessments across the three time windows. While predictions at and are the early risk assessments, we include assessments to provide a comprehensive analysis. The fusion techniques are based on the same masking strategies. The forecasting approach does not involve any fusion and its performance at is omitted as no future data are available beyond this point. Crucially, model comparisons must be conducted within individual columns rather than across columns, since each column corresponds to different evaluation time points with varying testing populations. For example, when evaluating the performance at , the analysis is restricted to testing samples with observed, since some children were not born at our institution.
Table 3:
Evaluation results of ASD time-to-diagnosis prediction task. is the integrated measure across postnatal years 3–8. The prediction head encapsulates the specific representations utilized for the downstream task. CR includes both within-modal and across-modal contrastive losses. Performance results are averaged over 5 random seeds.
| Method | Fusion Techniques | |||
|---|---|---|---|---|
| Standard Practice | masked mean | 0.676 (0.004) | 0.677 (0.009) | 0.756 (0.003) |
| self-attention | 0.672 (0.004) | 0.684 (0.024) | 0.785 (0.004) | |
| softmax self-gating | 0.671 (0.011) | 0.685 (0.012) | 0.792 (0.006) | |
| Forecasting | / | 0.722 (0.003) | 0.737 (0.004) | - |
| BFF w/o CR | masked mean | 0.680 (0.009) | 0.679 (0.011) | 0.760 (0.008) |
| self-attention | 0.682 (0.002) | 0.682 (0.010) | 0.778 (0.009) | |
| softmax self-gating | 0.683 (0.002) | 0.684 (0.004) | 0.783 (0.007) | |
| masked mean | 0.707 (0.008) | 0.745 (0.004) | 0.742 (0.011) | |
| self-attention | 0.697 (0.011) | 0.741 (0.005) | 0.744 (0.017) | |
| softmax self-gating | 0.721 (0.011) | 0.749 (0.009) | 0.742 (0.010) | |
| masked mean | 0.706 (0.008) | 0.733 (0.028) | 0.762 (0.003) | |
| self-attention | 0.690 (0.006) | 0.733 (0.015) | 0.765 (0.009) | |
| softmax self-gating | 0.693 (0.010) | 0.727 (0.020) | 0.787 (0.005) | |
Table 4:
Evaluation results of rAOM binary prediction task, where denotes which representations are used in the downstream prediction. CR includes both within-modal and across-modal contrastive losses. Performance results are averaged over 5 random seeds.
| Method | Fusion Techniques | |||
|---|---|---|---|---|
| Standard Practice | masked mean | 0.613 (0.002) | 0.612 (0.005) | 0.871 (0.003) |
| self-attention | 0.604 (0.002) | 0.610 (0.003) | 0.872 (0.001) | |
| softmax self-gating | 0.600 (0.012) | 0.613 (0.007) | 0.870 (0.002) | |
| Forecasting | / | 0.519 (0.042) | 0.505 (0.010) | - |
| BFF w/o CR | masked mean | 0.580 (0.018) | 0.597 (0.003) | 0.868 (0.001) |
| self-attention | 0.600 (0.006) | 0.620 (0.005) | 0.868 (0.003) | |
| softmax self-gating | 0.573 (0.001) | 0.583 (0.003) | 0.871 (0.002) | |
| masked mean | 0.628 (0.010) | 0.634 (0.012) | 0.657 (0.002) | |
| self-attention | 0.609 (0.014) | 0.622 (0.007) | 0.649 (0.002) | |
| softmax self-gating | 0.623 (0.006) | 0.641 (0.004) | 0.649 (0.021) | |
| masked mean | 0.600 (0.013) | 0.608 (0.015) | 0.856 (0.002) | |
| self-attention | 0.611 (0.015) | 0.607 (0.016) | 0.857 (0.003) | |
| softmax self-gating | 0.605 (0.010) | 0.628 (0.009) | 0.852 (0.012) | |
Given the substantial size (~44k observations) of the ASD cohort, we utilize it to conduct experiments comparing the data efficiency of BFF against forecasting pretraining. We train the model with random sub-samples from the training data and evaluate it using the original testing data. Our findings reveal a pattern consistent with CLIP’s result, where the contrastive objective demonstrates superior efficiency compared to the forecasting predictive objective (Radford et al., 2021). As a contrastive framework, BFF requires much smaller amount of data to achieve a robust performance, whereas the forecasting approach requires much more data to attain comparable performance levels. The forecasting method demonstrates inferior data efficiency and generalizability under limited training data, as evidenced by low performance scores and high variability using small random training subsets. The model fails to extract robust patterns from the scarce data and becomes overly dependent on the specific training instances selected. Therefore, when applied to downstream tasks with small datasets — such as our rAOM dataset containing merely ~4k observations — forecasting approaches are expected to demonstrate significantly degraded performance.
The BFF framework demonstrates significant improvement in early risk assessment performance. The contrastive regularization terms emerge as the primary driver of the enhancements, particularly through the improved learning of patient-specific features, . One interest observation is that time-/modality-sensitive features, , can be counter-effective for the predictions at and . This implies that the from the prenatal and at-birth modalities can be irrelevant to the outcome and therefore masking off the true signal. However, from the developmental modality is highly predictive of the outcome. This empirically validates that risk assessments conducted later in time are generally more accurate since the data observed later in time is closer and therefore more correlated to the potential outcome of interest.
5.2. Modality Attention during BFF’s Training Stage
In our multi-modal setting, self-attention computes pair-wise attention across a sequence of modality. These scores represent how much attention each modality should allocate to others. While these weights produce contextualized representation features, it is challenging to translate them to feature importance measures. Appx. C provides visualizations of the multi-modal self-attention scores using two random samples.
On the other hand, Softmax Self-Gating’s gating mechanism calculates a input-dependent weight for each coordinate in the embeddings, which can be directly interpreted as feature attribution to the downstream task. It offers an explainable illustration of how each modality/time window contributes to the prediction during training and testing. More importantly, it explains why BFF improves from a standard practice in early risk assessment.
Fig. 4 visualizes how the contrastive regularization adjusts feature attention in BFF. Evidenced by (a), when all modalities are used during the training without CR, disproportionately dominates the calculation. The model naturally prioritizes ’s representation due to its high predictability to the outcome. Therefore, it underutilizes the early-stage modalities, and insufficiently optimizes the corresponding encoders. Under the contrastive regularization in BFF, the encoder for the early modality utilize the representations from later modality as soft supervision. In (b), we can observe that model is regularized to directs greater attention to the and when the downstream prediction head only takes , thereby enhancing the early risk assessment. Additionally, (c) helps to explain the drop in performance of the evaluation at and . When both and from all modalities are used in the downstream prediction, the prediction head would prioritize the modality-specific features from the developmental modality and pay less attention to from the early-stage modalities. Consequently, the encoders may not fully leverage the benefits of the borrowing from the future information.
Figure 4:






Feature-wise attention scores from the Softmax Self-Gating in the ASD time-to-diagnosis prediction. Each row corresponds to a patient, and the weight matrices, visualized by the heat maps, are calculated from the same random sample in difference model setup. The y-axis indicates the modality while the x-axis indicates the coordinates in the representations. There is no differentiation between and without CR.
We corroborate these findings in the analysis of Integrated Gradients (IG) (Sundararajan et al., 2017) at the modality level, as shown in Fig. 5. The procedure to calculate the modality importance is detailed in Appx. D. The contrastive regularization and the selection of latent features effectively alter the modality contribution in the downstream task. In “”, earlier modalities receive greater emphasis while the developmental modality is de-emphasized. This analysis also provides additional insight into the compromised performance at of “”: when it comes to the risk assessment at , the model appears to over-rely on earlier time windows instead of focusing on the more informative developmental modality,
Figure 5:


Modality importance for models optimized at different evaluation time points. Although all models are trained on the same data (data across all time windows), the choice of evaluation time directly influences early-stopping and checkpoint selection, which in turn shapes the learned modality importance.
6. Discussion
Inspired by previous works in contrastive learning (Oord et al., 2018; Tian et al., 2020; Radford et al., 2021), we introduce BFF, a novel framework that “borrows” the future data as soft and implicit supervision to enhance the learning of early information . Compared to the standard practice and the forecasting approach (Lyu et al., 2018; Xue et al., 2020), BFF achieves significant improvements in early risk assessment while maintaining high training and data efficiency. As an alternative framework, it algorithmically provide mores more accurate preventative decision support. Our results also shows that the contrastive regularization is a the crucial component of this framework.
Additionally, we demonstrate that the Softmax Self-Gating mechanism not only is an effective technique for multi-modal fusion but also offers a clear interpretation of how contrastive regularization adjusts representation learning within the BFF framework. The visualizations of feature-wise attention scores provide evidence that helps to explain the enhancements in early risk assessments.
Although missing data is a common challenge in developing clinical prediction models in real-world settings, handling it appropriately and effectively remains technically demanding. Commonly used imputation-based methods are typically designed for structured tabular data (Van Buuren and Groothuis-Oudshoorn, 2011; Stekhoven and Bühlmann, 2012). They define missingness at the feature level. We process EHR data into sequences of tokenized medical events following an NLP-style data structure, where missingness manifests at entire modality/time-window levels rather than individual feature values. Consequently, traditional missing data imputation methods are unsuitable for this context. Our approach leverages masking during multimodal fusion to incorporate all available information for downstream clinical prediction. It is both simple and entirely data-driven, requiring no distributional assumptions about missingness patterns. Nevertheless, theoretically-grounded methodologies with greater sophistication may yield superior representation learning and predictive performance (Ma et al., 2021; Lin and Hu, 2023; Yao et al., 2024). It would be interesting to systematically investigate how various modality-level missing data handling approaches impact model performances in our future work.
Limitations
We demonstrate that BFF can enhance early risk assessments, enabling more preventive care for children. However, this comes with a trade-off: while BFF improves model performance at earlier stages, it performs worse than the standard practice at . When relying solely on for downstream tasks, performance is notably lower at because representation learning at this optimal time window cannot benefit from “borrowed” information via cross-time-window alignment and may even be adversely affected by it. Technically, the cross-modal alignment objective encourages to capture patient-specific rather than time-specific features. To address this, in one implementation, we combine , modulated by the within-modal alignment term, with as input to the downstream head. However, this combined representation still slightly underperforms the baseline at , suggesting that further work is needed to refine the CR for more effective representation learning.
Moreover, BFF may not be generalizable to time series of continuous measures. The raw measurements can be highly correlated across time. Therefore, the representation learning at an earlier time point may not benefit from BFF through the contrastive regularization. Further investigation is needed to identify when BFF provides the greatest advantage and under what conditions its effectiveness may be constrained.
Currently, in BFF, we train separate models for each evaluation time. As future work, we aim to develop a single unified model optimized across the all time windows. Rather than treating the entire developmental period as a single modality, we are interested in increasing temporal granularity to enable risk assessments at each well-child visit. Additionally, replacing the CBOW encoder with a Transformer-based encoder may improve modeling capacity and contextual understanding.
Supplementary Material
Figure 3: BFF, as a one-step procedure, demonstrates more robust performance and higher data efficiency under limited training data.

The performance scores are averaged over five random seeds, and the error bars are the standard deviations.
Table 2:
Time windows and modalities
| Time window | Modality |
|---|---|
Acknowledgments
We thank the anonymous reviewers for their valuable insights and constructive feedback. Special appreciation goes to Jillian Hurst, Congwen Zhao, Abby Scheer for the careful preparation and setup of the cohort data. This research was supported by NIH grants NICHD P50 HD0N3074 and NIAID KO1 AI73398. Matthew Engelhard was supported by NIMH K01 MH127409. This project was completed during the Duke AI Health Data Science Fellowship Program. Health Data Science at Duke is supported by the National Center for Advancing Translational Sciences (NCATS), National Institutes of Health, through Grant Award Number UL1 TR002553. The Duke AI Health Data Science Fellowship Program is supported by the above NCATS grant, the Duke Department of Biostatistics & Bioinformatics, and Duke AI Health. The Duke Protected Analytics Computing Environment (PACE) program is supported by the above grant and by Duke University Health System. The content of this publication is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
References
- Alexeeff Stacey E, Yau Vincent, Qian Yinge, Davignon Meghan, Lynch Frances, Crawford Phillip, Davis Robert, and Croen Lisa A. Medical conditions in the first years of life associated with future diagnosis of asd in children. Journal of autism and developmental disorders, 47:2067–2079, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Beijers Roseriet, Jansen Jarno, Riksen-Walraven Marianne, and de Weerth Carolina. Maternal prenatal anxiety and stress predict infant illnesses and health complaints. Pediatrics, 126(2):e401–e409, 2010. [DOI] [PubMed] [Google Scholar]
- Boag John W. Maximum likelihood estimates of the proportion of patients cured by cancer therapy. Journal of the Royal Statistical Society. Series B (Methodological), 11(1):15–53, 1949. [Google Scholar]
- Chen Ting, Kornblith Simon, Norouzi Mohammad, and Hinton Geoffrey. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PmLR, 2020. [Google Scholar]
- Cho Kyunghyun, Van Merriënboer Bart, Gulcehre Caglar, Bahdanau Dzmitry, Bougares Fethi, Schwenk Holger, and Bengio Yoshua. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014. [Google Scholar]
- Dauphin Yann N, Fan Angela, Auli Michael, and Grangier David. Language modeling with gated convolutional networks. In International conference on machine learning, pages 933–941. PMLR, 2017. [Google Scholar]
- Engelhard Matthew M, Henao Ricardo, Berchuck Samuel I, Chen Junya, Eichner Brian, Herkert Darby, Kollins Scott H, Olson Andrew, Perrin Eliana M, Rogers Ursula, et al. Predictive value of early autism detection models based on electronic health record data collected before age 1 year. JAMA network open, 6(2):e2254303–e2254303, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Farewell Vern T. The use of mixture models for the analysis of survival data with long-term survivors. Biometrics, pages 1041–1046, 1982. [PubMed] [Google Scholar]
- Frosst Nicholas, Papernot Nicolas, and Hinton Geoffrey. Analyzing and improving representations with the soft nearest neighbor loss. In International conference on machine learning, pages 2012–2020. PMLR, 2019. [Google Scholar]
- Gao Tianyu, Yao Xingcheng, and Chen Danqi. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821, 2021. [Google Scholar]
- Guthrie Whitney, Wallis Kate, Bennett Amanda, Brooks Elizabeth, Dudley Jesse, Gerdes Marsha, Pandey Juhi, Levy Susan E, Schultz Robert T, and Miller Judith S. Accuracy of autism screening in a large pediatric network. Pediatrics, 144(4), 2019. [Google Scholar]
- Hager Paul, Menten Martin J, and Rueckert Daniel. Best of both worlds: Multimodal contrastive learning with tabular and imaging data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23924–23935, 2023. [Google Scholar]
- He Kaiming, Fan Haoqi, Wu Yuxin, Xie Saining, and Girshick Ross. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020. [Google Scholar]
- Heagerty Patrick J and Zheng Yingye. Survival model predictive accuracy and roc curves. Biometrics, 61(1):92–105, 2005. [DOI] [PubMed] [Google Scholar]
- Hickey Jimmy, Henao Ricardo, Wojdyla Daniel, Pencina Michael, and Engelhard Matthew. Adaptive discretization for event prediction (adept). In International Conference on Artificial Intelligence and Statistics, pages 1351–1359. PMLR, 2024. [Google Scholar]
- Hochreiter Sepp and Schmidhuber Jürgen. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. [DOI] [PubMed] [Google Scholar]
- Hu Jie, Shen Li, and Sun Gang. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018. [Google Scholar]
- Hung Hung and Chiang Chin-Tsang. Estimation methods for time-dependent auc models with survival data. Canadian Journal of Statistics, 38(1):8–26, 2010. [Google Scholar]
- Kamarudin Adina Najwa, Cox Trevor, and Kolamunnage-Dona Ruwanthi. Time-dependent roc curve analysis in medical research: current methods and applications. BMC medical research methodology, 17:1–19, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kaplan Edward L and Meier Paul. Nonparametric estimation from incomplete observations. Journal of the American statistical association, 53(282):457–481, 1958. [Google Scholar]
- Kaur Ravinder, Morris Matthew, and Pichichero Michael E. Epidemiology of acute otitis media in the postpneumococcal conjugate vaccine era. Pediatrics, 140(3), 2017. [Google Scholar]
- Kong Deming, Tao Ye, Xiao Haiyan, Xiong Huini, Wei Weizhong, and Cai Miao. Predicting preterm birth using auto-ml frameworks: a large observational study using electronic inpatient discharge data. Frontiers in Pediatrics, 12:1330420, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lambert Jérôme and Chevret Sylvie. Summary measure of discrimination in survival models based on cumulative/dynamic time-dependent roc curves. Statistical methods in medical research, 25(5):2088–2102, 2016. [DOI] [PubMed] [Google Scholar]
- Lee Changhee, Zame William, Yoon Jinsung, and Van Der Schaar Mihaela. Deephit: A deep learning approach to survival analysis with competing risks. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018. [Google Scholar]
- Lee Seunghan, Park Taeyoung, and Lee Kibok. Soft contrastive learning for time series. arXiv preprint arXiv:2312.16424, 2023. [Google Scholar]
- Lin Ronghao and Hu Haifeng. Multimodal contrastive learning via uni-modal coding and cross-modal prediction for multimodal sentiment analysis. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 511–523, 2022. [Google Scholar]
- Lin Ronghao and Hu Haifeng. Missmodal: Increasing robustness to missing modality in multimodal sentiment analysis. Transactions of the Association for Computational Linguistics, 11, 2023. [Google Scholar]
- Lipkin Paul H, Macias Michelle M, Norwood Kenneth W, Brei Timothy J, Davidson Lynn F, Davis Beth Ellen, Ellerbeck Kathryn A, Houtrow Amy J, Hyman Susan L, Kuo Dennis Z, et al. Promoting optimal development: identifying infants and young children with developmental disorders through developmental surveillance and screening. Pediatrics, 145(1), 2020. [Google Scholar]
- Lyu Xinrui, Hueser Matthias, Hyland Stephanie L, Zerveas George, and Raetsch Gunnar. Improving clinical predictions through unsupervised time series representation learning. arXiv preprint arXiv:1812.00490, 2018. [Google Scholar]
- Ma Mengmeng, Ren Jian, Zhao Long, Tulyakov Sergey, Wu Cathy, and Peng Xi. Smil: Multimodal learning with severely missing modality. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 2302–2310, 2021. [Google Scholar]
- Mikolov Tomas, Chen Kai, Corrado Greg, and Dean Jeffrey. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. [Google Scholar]
- van den Oord Aaron, Li Yazhe, and Vinyals Oriol. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018. [Google Scholar]
- Pichichero Michael E. Recurrent and persistent otitis media. The Pediatric infectious disease journal, 19(9):911–916, 2000. [DOI] [PubMed] [Google Scholar]
- Qiu Zi-Hao, Hu Quanqi, Yuan Zhuoning, Zhou Denny, Zhang Lijun, and Yang Tianbao. Not all semantics are created equal: Contrastive self-supervised learning with automatic temperature individualization. arXiv preprint arXiv:2305.11965, 2023. [Google Scholar]
- Radford Alec, Kim Jong Wook, Hallacy Chris, Ramesh Aditya, Goh Gabriel, Agarwal Sandhini, Sastry Girish, Askell Amanda, Mishkin Pamela, Clark Jack, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021. [Google Scholar]
- Stekhoven Daniel J and Bühlmann Peter. Missforest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1):112–118, 2012. [DOI] [PubMed] [Google Scholar]
- Sundararajan Mukund, Taly Ankur, and Yan Qiqi. Axiomatic attribution for deep networks. In International conference on machine learning, pages 3319–3328. PMLR, 2017. [Google Scholar]
- Tian Yonglong, Krishnan Dilip, and Isola Phillip. Contrastive multiview coding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pages 776–794. Springer, 2020. [Google Scholar]
- Uno Hajime, Cai Tianxi, Tian Lu, and Wei Lee-Jen. Evaluating prediction rules for t-year survivors with censored regression models. Journal of the American Statistical Association, 102(478):527–537, 2007. [Google Scholar]
- Van Buuren Stef and Groothuis-Oudshoorn Karin. mice: Multivariate imputation by chained equations in r. Journal of statistical software, 45:1–67, 2011. [Google Scholar]
- Van’T Hof Maarten, Tisseur Chanel, van Berckelear-Onnes Ina, Van Nieuwenhuyzen Annemyn, Daniels Amy M, Deen Mathijs, Hoek Hans W, and Ester Wietske A. Age at autism spectrum disorder diagnosis: A systematic review and meta-analysis from 2012 to 2019. Autism, 25(4):862–873, 2021. [DOI] [PubMed] [Google Scholar]
- Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N, Kaiser Łukasz, and Polosukhin Illia. Attention is all you need. Advances in neural information processing systems, 30, 2017. [Google Scholar]
- Walsh Kate, McCormack Clare A, Webster Rachel, Pinto Anita, Lee Seonjoo, Feng Tianshu, Krakovsky H Sloan, O’Grady Sinclaire M, Tycko Benjamin, Champagne Frances A, et al. Maternal prenatal stress phenotypes associate with fetal neurodevelopment and birth outcomes. Proceedings of the National Academy of Sciences, 116(48):23996–24005, 2019. [Google Scholar]
- Xue Yuan, Du Nan, Mottram Anne, Seneviratne Martin, and Dai Andrew M. Learning to select best forecast tasks for clinical outcome prediction. Advances in Neural Information Processing Systems, 33:15031–15041, 2020. [Google Scholar]
- Yao Wenfang, Yin Kejing, Cheung William K, Liu Jia, and Qin Jing. Drfuse: Learning disentangled representation for clinical multi-modal fusion with missing modality and modal inconsistency. In Proceedings of the AAAI conference on artificial intelligence, volume 38, pages 16416–16424, 2024. [Google Scholar]
- Yuan Xin, Lin Zhe, Kuen Jason, Zhang Jianming, Wang Yilin, Maire Michael, Kale Ajinkya, and Faieta Baldo. Multimodal contrastive training for visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6995–7004, 2021. [Google Scholar]
- Yue Zhihan, Wang Yujing, Duan Juanyong, Yang Tianmeng, Huang Congrui, Tong Yunhai, and Xu Bixiong. Ts2vec: Towards universal representation of time series. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pages 8980–8987, 2022. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
