Figure - PMC

Skip to main content

An official website of the United States government

Here's how you know

Here's how you know

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

View full-text article in PMC

. 2023 Jul 25;3(8):100540. doi: 10.1016/j.crmeth.2023.100540

Search in PMC
Search in PubMed
View in NLM Catalog
Add to search

© 2023 The Authors

This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

PMC Copyright notice

Combining datasets to predict values and uncertainties for missing viruses

(A) Schematic of data availability; two studies measure antibody responses against overlapping viruses (shades of blue) as well as unique viruses (green/gray). Studies may have different fractions of missing values (dark-red boxes) and measured values (gray). To test whether virus behavior can be inferred across studies, we predict the titers of a virus in dataset 1 (V₀, gold squares), using measurements from the overlapping viruses (V₁–V_n) as features in a random forest model.

(B) We train a decision tree model using a random subset of antibodies and viruses from dataset 2 (boxed in purple), cross-validate against the remaining antibody responses in dataset 2, and compute the root-mean-square error (RMSE, denoted by σ_Training).

(C) Multiple decision trees are trained, and the average from the 5 trees with the lowest error are used as the model going forward. Applying this model to dataset 1 (which was not used during training) yields the desired predictions, whose RMSE is given by σ_Actual. We repeat this process, withholding each virus in every dataset.

(D) To estimate the prediction error σ_Actual (which we are not allowed to directly compute because V₀’s titers are withheld), we define the transferability relation f_2→1 between the training error σ_Training in dataset 2 and actual error σ_Actual in dataset 1 using the decision trees that predict viruses V₁–V_n (without using V₀). Applying this relation to the training error, f_2→1(σ_Training), estimates σ_Actual for V₀.