Abstract
Predicting phenotypes with complex genetic bases based on a small, interpretable set of variant features remains a challenging task. Conventionally, data-driven approaches are used for this task, yet the high dimensional nature of genotype data makes the analysis and prediction difficult. Motivated by the biomedical knowledge encoded in pre-trained LLMs and the emerging applications for genetics, we set to examine the ability of LLMs in feature selection and engineering for tabular genotype data, with a novel knowledge-driven framework. We develop FREEFORM, Free-flow Reasoning and Ensembling for Enhanced Feature Output and Robust Modeling, designed with chain-of-thought and ensembling principles, to select and engineer features with the intrinsic knowledge of LLMs. Evaluated on two distinct genotype-phenotype datasets, genetic ancestry and hereditary hearing loss, we find this framework outperforms several data-driven methods, particularly on low-data regimes. FREEFORM is available as open-source framework at GitHub: https://github.com/PennShenLab/FREEFORM.
Introduction
Predicting observable phenotypes from genotype data has proven to be a monumental task in the field of genetics, with diverse applications ranging from personalized medicine1 to genomic selection of crops2. Vast amounts of genetic variants, such as single nucleotide polymorphisms (SNPs) from high-throughput sequencing data, are often harnessed to make predictions on the phenotype. To deal with the sparsity of this data, linear models with regularization have been developed to much success in the case of polygenic risk scores 1,3. Furthermore, machine learning models have been developed to capture the entangled epistatic relationships between genes and to predict complex traits 4.
The difficulty, however, of modeling genotype data is substantial: (1) First, the topic of interest is not merely predictive (e.g. discovering causal variants 5) and machine learning methods are known to identify spurious features as significant due to multicollinearity6. Analyzing interaction terms—like epistatic relationships between variants—also introduces challenges such as a lack of interpretability when using complex models or multiple testing when examining higher-order interactions7. (2) Furthermore, a table of genotype data could contain thousands or even millions of columns. This ‘curse of dimensionality’ can cause severe overfitting and amplify many issues, including those we’ve named above8. (3) Lastly, data can be limited in real settings, exacerbating existing concerns with overfitting.
Feature selection and feature engineering (i.e. feature construction) can be crucial steps to mitigate these concerns. Data-driven methods, such as Lasso regression, have demonstrated great success in selecting features9. Feature engineering improves predictive performance without resorting to complex models and can help uncover interactions between features 10. However, these methods have their own issues: data-driven methods can struggle with small sample sizes and feature engineering is a laborious process that requires expertise to avoid multiple testing.
Recent advances in large language models (LLMs) have shown promise in addressing these challenges. With their remarkable performance across various tasks, LLMs have established themselves as powerful tools across many domains 11,12. A key strength of LLMs lies in the knowledge they acquire through pre-training, enabling them to act as domain experts; recent LLMs have showed extensive understanding of biomedical concepts 13,14,15. Furthermore, LLMs can be greatly enhanced with well-designed prompting strategies 16,17,18. Chain-of-thought prompting (CoT) 16 improves the reasoning of LLMs by encouraging step-by-step problem solving. Self-consistency 17 addresses the naive greedy decoding used in CoT prompting by selecting the most consistent outcome across multiple reasoning paths.
Several studies have proposed to employ such capabilities of LLMs to perform feature preprocessing. Among them, Choi et al. 19 adopts the notion of prior knowledge in LLMs to conduct feature selection and casual discovery. Jeong et al.20 examines three types of selection strategies with LLMs e.g. ranking vs. scoring features. Hollmann et al.21 has an agentic approach, using the LLM to generate Python code that creates features in an iterative fashion based on cross-validation feedback. Han et al. 22 employs LLMs to do feature engineering, generating conditional rules for each class label (e.g. age > 21 increases the logits for label 0) and repeating this several times to form an ensemble.
Due to the promising performance of LLMs, there has also been a growing exploration of LLMs’ usage in biomedical applications, especially genetics 23,15,24,25. Despite these innovations, most genetic studies have focused on gene-level data, likely due to the limitations of earlier LLMs. To our knowledge, no prior work has explored LLMs on variant-level data, except one that used API calls on NCBI databases to retrieve SNP information 26. While this approach compensates for LLMs’ inability to pass the GeneTuring benchmark27, we argue the benchmark is an inadequate test of their utility, as it evaluates gene-SNP association by randomly sampling 100 SNPs from hundreds of millions. This setup overlooks how well LLMs leverage their knowledge of known variants (Fig. 5), a gap we address in this study.
Figure 5:
Comparison between different LLM models on their knowledge of the SNP rs671 relating to genomic ancestry. Red text indicates a hallucination, which was only observed in the case of the Llama 2 7B model.
Inspired by these recent explorations 20,22,23,28, we propose to leverage LLMs’ knowledge, intrinsic and augmented, and reasoning capabilities 16,29 to select the most informative genetic variants and generate novel features. We develop a knowledge-driven framework FREEFORM, Free-flow Reasoning and Ensembling for Enhanced Feature Output and Robust Modeling, designed from ground principles in ensembling (self-consistency)17 and “free-flow reasoning” 30 to best leverage the expertise of LLMs. In the LLM-enabled framework, we implement scalable feature selection strategies that can process a large number of variants and feature engineering approaches that focus on interaction terms which are more interpretable. We evaluate FREEFORM on two real genotype datasets, genetic ancestry and hereditary hearing loss, and compare with data-driven and LLM-enabled methods. In particular, we focus on few-shot (i.e. few data samples) settings where data-driven methods struggle due to limited sample size but LLMs have shown surprising generalizability31,32. In our study, we expect challenges when applying LLMs to tabular genotype data due to their limited semantics—column names are variant IDs and values indicate the number of minor allele copies (0,1,2). Thus, we analyze the effect of retrieval augmentation and domain-specific serialization when prompting.
Our results highlight FREEFORM’s potential to address three challenges present in modelling genotype data: Our method (1) enhances prediction while upholding interpretability, (2) reduces the dimensionality of the dataset, and (3) excels in low-shot regimes compared to data-driven methods by grounding ours in knowledge. Furthermore, we challenge the notion that LLMs lack knowledge of genetic variants by novelly applying them in this domain.
Methods
In this section, we introduce our FREEFORM framework (Fig. 1), designed to address the challenges of training models on genotype data. The framework is built on two components: (1) leveraging the knowledge of LLMs to select a set of features and (2) leveraging the knowledge of LLMs to engineer new features from the selected features.
Figure 1:
Overview of the FREEFORM framework. The pipeline consists of two parts: (1) LLM-driven feature selection takes d variants and selects d’ of them (2) Given the selected features, we use LLMs to generate sets of engineered features to create an ensemble of classifiers.
Feature Selection
The genotype dataset can be formalized as , i.e. a table comprising of N labeled samples (i.e. rows) each with d variants (i.e. columns). Each entry represents the number of minor allele copies, while the corresponding label yi denotes the phenotype, e.g. yi ∈ {“African”, “American”, “East Asian”, “European”, “South Asian”}. The column names, denoted by , are text strings representing variants such as rsIDs.
The goal of feature selection is to identify a subset S′ ⊆ S with d′ « d variants such that a downstream model f, trained on , can make efficient, interpretable predictions. We use an LLM, modelled as a stochastic operator T parameterized by 0 and given prompt p, to output a selected subset of features S′:
| (1) |
Notably, the proposed selection method is model agnostic unlike model-based rankings. Furthermore, previous LLM-based feature selection methods have not been tested in the setting of variants-level data where hallucination can be more likely 33 and high dimensionality is an issue. While context windows have become longer, new challenges have emerged 34,35 in which naive usage of the context would be ineffective; to address these challenges, we design our feature selection strategies to scale for high dimensions while remaining token-efficient. In the downstream analysis, we train two models, Random Forest and Logistic Regression, on the selected features to evaluate their quality.
Relevance Filtering We first ask the LLM to determine whether each of the d variants is relevant to the task, requesting a simple “Yes” or “No”, yielding a subset of variants. This set may still be large, so we adjust the language of the prompt appropriately (e.g. “potentially relevant”) based on how many are filtered.
Self-Consistent Hierarchical Selection is the first strategy we employ for selecting d’ features from the filtered set of variants. We begin by randomly partitioning the variants into buckets of approximately 50 to 100 variants (a hyperparameter) to prevent the loss of information when contexts become too large 34. Each bucket is independently passed to the LLM, tasked with selecting the d’ most relevant variants; we select d’ at every step in the case that the relevant features are concentrated in a single bucket. The d’ selected variants from each bucket are merged together, and the process repeats as delineated in Fig. 1. We observe that the selection process is sensitive to the order in which features are presented to the LLM. Thus, for each bucket, we conduct multiple iterations, randomizing the order of variants and using a temperature of 0.3. This approach naturally integrates self-consistency 17 by retaining the top d’ variants that appear the most across iterations. During the final selection, we enhance the LLM with chain-of-thought (CoT) prompting, increase the temperature to 0.7, and increase the number of iterations to ten.
Self-Consistent Sequential Forward Selection is the second strategy we employ. Starting with the filtered set of variants, we task the LLM with identifying the single most relevant variant with CoT. After each selection, the chosen variant is removed and the process repeats. Initially, we perform this extraction without any ensembling; the top few features are easy for the LLM to identify, but the task becomes more challenging as it becomes ambiguous which variants are more significant. After selecting a few, we start to apply self-consistency: repeating the extraction several times, in increasing amounts as we near the end of the selection. We find that this also mitigates the LLM’s tendency to return the features that were already selected. In cases where the LLM still fails to identify a valid variant, which occurs frequently towards the end, we implement an exception handling mechanism that retries the selection process.
Feature Engineering
Given the selected features S’, our goal is to engineer new features that capture meaningful relationships the model might not identify on its own (e.g., household density = family size ÷ number of rooms). We automate this traditionally manual, expert-driven process by leveraging the knowledge and reasoning capabilities of LLMs. As outlined in Fig. 1, we repeat this several times and train a model on each feature set, forming an ensemble.
Formally, we transform our dataset DS′ into K transformed datasets , where . Each transformed dataset is created by prompting the LLM T with p’, which includes a serialized representation of selected examples , to enhance the LLM with context31. We define this function Serialize to convert each row to a textual description (e.g. “The s1 variant of the person has minor alleles... The sd′ variant of the person has minor alleles.”) as LLMs usually prefer natural language. Since LLMs can be sensitive to the input, we examine various serialization templates and prompts. To address gaps in the LLM’s prior knowledge, we also explore retrieval-augmentation to supplement the variant IDs with its associated gene 36.
The LLM T outputs the new features that are added back onto S’ to create :
| (2) |
We then train K models, , each on a dataset Dk, aiming to capture different hypotheses about how the variants relate to the phenotype. During inference, for input x, we average the class probabilities p(fk(x)) from each model fk. The final prediction is made by selecting the class i with the highest averaged probability:
| (3) |
Our aim is to harness the diverse feature representations generated by the LLM, reducing the risk of overfitting in low-shot settings by anchoring each feature set in the knowledge embedded within LLMs rather than the limited data. To uphold interpretability, we limit our feature construction to interaction terms. Notably, our method is model-agnostic, de-coupled from the classifier. We train two models, Random Forest and Logistic Regression, on the transformed datasets to evaluate the quality of the constructed features. We discuss key steps of our method in depth below.
Automating Feature Engineering When asking the LLM to engineer features, we provide a comprehensive prompt p’ that includes the following components:
Instructions: Directions to use the provided features to engineer new features relevant to the task.
Task Description: A concise description of the specific task for which the features are being engineered.
Features: A list of features including the name of each genetic variant.
Examples: |R| examples that illustrate the data in a serialized format.
Detailed Instruction: List of specific choices for feature engineering, such as multiplying or adding features, accompanied by a task-specific demonstration as seen in Fig 2.
Step-by-Step Solution: Directions for the LLM to solve the problem step by step.
Figure 2:
Example of Detailed Instructions
Free-Flow Reasoning Traditionally, we enforce the output of the LLM to be structured e.g. JSON. However, we deliberately allow the LLM to freely generate its response, recognizing that enforcing a rigid structure can diminish the depth and quality of the LLM’s CoT reasoning 30. To guide the LLM, we provide an example of a “correctly” engineered feature 37, which we self-generate with an LLM and then manually verify.
Self-Parsing and Function Writing via LLMs The unstructured output generated by the LLM can present challenges for parsing. To address this, we employ the LLM itself to extract the engineered features from its output, listing them line by line for easy parsing. Subsequently, we task the LLM with writing an executable Python function that will generate the new columns of dataset Dk based on the extracted features. When errors are detected, we implement error-handling mechanisms to catch any issues and they are fed back into the LLM which rewrites the function.
Ensembling To mitigate overfitting, we repeat the feature engineering process K times to generate all Dk which we’ll use to train K classifiers. Similar to FEATLLM22, we incorporate additional ensembling and order bias mitigation strategies such as bagging and order shuffling, where we pass a random subset of |R| ≤ N samples to the LLM; we limit the number of samples |R| to 16. We do this to further diversify the LLM output for each iteration, and simultaneously avoid exorbitant usage of the context window. Our free-flow reasoning approach further contributes to the diversity of the output, as we naturally allow the LLM to determine the number of features that will be constructed during its generation. By setting a non-zero temperature of 1, we ensure that each iteration further produces a varied set of features, especially in type. The resulting k transformed datasets form the basis for the ensembled model.
Due to the page limit, we direct readers to our GitHub repo, which contains the detailed prompts and relevant hyper-parameters used in the entire pipeline. The source code of FREEFORM is provided to promote reproducibility.
Results
Experiment Setup
Datasets Our experiments involve two datasets: the Genomic Ancestry Dataset38 and the Hereditary Hearing Loss Dataset39. The Genomic Ancestry Dataset is derived from the 1000 Genomes Project (1KGP). We focus on determining the superpopulation ancestry phenotype (African, American, East Asian, European, and South Asian). In particular, we used a curated set of 10,000 SNPs predefined by GRAF to pinpoint ancestry markers 40, and we’ll discuss how we addressed the issues that arise from their quality control (QC). After QC and preprocessing, the dataset includes 2,403 subjects and 8,688 variants as columns in the rsID format, a standard identifier used by dbSNP.
The Hereditary Hearing Loss Dataset is considerably smaller, comprising of 1,209 subjects and 144 variants as columns, employing the HGVS nomenclature system. This dataset is notably imbalanced, with approximately 75.9% of the samples classified as “Yes” (indicating the presence of hereditary hearing loss) and 24.1% as “No.” To our best knowledge, these two are the only open-access genotype datasets available online.
Baselines For feature selection, we conduct comparisons with four baseline methods. The first three baselines are conventional machine learning approaches: (1) LASSO, (2) PCA and (3) RF-based Gini Importance where we fit a Random Forest on the training data and rank the features by their Gini importance. In our case, the specification of a fixed number of features is required for our study. LASSO, however, reduces an arbitrary number of coefficients to zero during model training. Similarly, PCA provides loadings along principal axes and has been widely used in the genetic field to select variants. To adapt these methods to our requirements, we select the d′ features with the largest coefficients in the highest-performing LASSO model. For PCA, we take the top d′ loadings of the first principle component. Lastly, we include (4) LLM-SELECT 20 (we use their LLM-RANK prompts).
For feature engineering, we compared our approach against five baselines. The first three are traditional machine learning methods: (1) Logistic Regression, (2) Random Forest and (3) XGBoost. We also include recent baselines, (4) TabPFN41, a foundation model for tabular data, and (5) FeatLLM22 which also leverages LLMs to do feature engineering but limits themselves to conditional rules (e.g. variant 1 is > 0) for a linear model.
Implementation Details Our FREEFORM framework utilizes GPT-4o (2024-05-13) as the primary LLM backbone, particularly for tasks requiring advanced reasoning capabilities, such as automating feature engineering and selecting relevant features. For more routine tasks, including parsing output and writing Python functions, we employ GPT-3.5-turbo which offers a cost-effective solution that meets the performance requirements for these functions. These models are called upon using the OpenAI API which only requires internet access.
For feature engineering, we employ an ensemble of K = 20 models, striking a balance between cost-effectiveness and model performance, noting that performance gains diminish beyond this point. In replicating baselines, machine learning models were implemented using Python’s scikit-learn library. Hyperparameters were optimized using grid search and k-fold cross-validation, with k set to either 2 or 4, ensuring that the training set includes at least one example of each class. For other methods, such as FEATLLM and TABPFN, we used default parameters with slight adjustments for fair comparison (e.g., using 15 conditions instead of 10 for FEATLLM). Also, for the evaluation of all feature engineering methods, we choose one of the feature sets generated by hierarchical selection.
Main Results
In Fig 3, we compare the performance of our feature selection methods against baselines using Logistic Regression and Random Forest as downstream classifiers. For feature selection, we repeat our experiments five times with cross-validation, limiting the evaluation to few-shot settings where N < 320 for ancestry and N < 128 for hearing loss. For data-driven methods, we use the training data (varying the size of N) to perform feature selection and train the classifier on the same training data. We also emphasize that, unlike the data-driven methods, the LLM-driven methods perform feature selection without relying on any data samples, leveraging only the model’s prior knowledge. Their usage of the training data is limited to training the downstream classifier.
Figure 3:
Evaluation of Feature Selection on Ancestry and Hearing Loss
Our findings indicate that LLM-driven methods significantly outperform data-driven approaches for feature selection in low-shot regimes, achieving gains of up to approximately 20%. Notably, in the genomic ancestry task, LASSO requires 80 shots to achieve similar results to what our framework achieves with just 10 shots. In hearing loss, we observe that the performance gap is smaller and our advantage remains until 16 shots, when using Random Forests. This discrepancy is likely due to the limited presence of variants within the dataset that the LLM has knowledge of.
For feature engineering, we conduct our experiments five times with cross-validation, limiting the evaluation to few-shot settings where N < 120. In Fig 4, FREEFORM consistently consistently ranks at or above baseline models. In the genomic ancestry task, our framework improves the performance of both Logistic Regression and Random Forest, especially in the lower-shot scenarios, and outperforms recent models such as FEATLLM and TABPFN. However, as the number of shots approaches 80, the gap between our methods and the baselines decreases. In the hearing loss task, our framework notably enhances the performance of Logistic Regression on higher-shot scenarios. While Random Forest does not benefit from the engineered features, our approach remains competitive. This outcome suggests limited effectiveness of the interaction terms we engineer for complex models like Random Forests.
Figure 4:
Evaluation of Feature Engineering on Ancestry and Hearing Loss Ablations and Analysis
Open Source Models In Table 1, we examine the performance of various LLMs in our framework, repeating the experiment five times with cross-validation. For feature selection, gpt-4o performs better as expected, except for hearing loss, further suggesting limited knowledge of the relevant variants across the LLMs. It remains unclear whether this advantage arises from the LLMs’ ability to identify relevant features or their depth of knowledge (e.g. while an LLM may recognize a variant, its expertise regarding that variant may vary). For feature engineering, the weaker models are highly competitive. This is surprising; this may indicate that the interaction terms we enable (e.g., conditional rules, multiplicative expressions) are within the reasoning capabilities of all studied models. Alternatively, it could reflect all the models’ inability to capture epistasis. Either way, these results align with the shared awareness of variants among models (see Figure 5), suggesting that even basic knowledge can enable LLMs to act as weak learners.
Table 1:
FREEFORM using different models. For feature selection, hierarchical selection is used across all models. We limit the analysis to 16-shots (Hearing Loss) and 20-shots (Ancestry). LR: Logistic Regression. RF: Random Forest. Values shown are AUROC (AUC). Standard deviation is in parenthesis.
| Model | Feature Selection | Feature Engineering | ||
| Hearing | Ancestry | Hearing | Ancestry | |
| LR RF | LR RF | LR RF | LR RF | |
| GPT-3.5-turbo | 0.524 (0.03) 0.539 (0.04) | 0.785 (0.03) 0.765 (0.04) | 0.506 (0.10) 0.540 (0.11) | 0.955 (0.01) 0.951 (0.01) |
| Llama-3.1-405B | — — | — — | 0.511 (0.11) 0.544 (0.11) | 0.956 (0.02) 0.952 (0.01) |
| GPT-4o | 0.506 (0.09) 0.544 (0.10) | 0.943 (0.01) 0.943 (0.02) | 0.514 (0.11) 0.543 (0.11) | 0.957 (0.01) 0.953 (0.01) |
Augmentation and Serialization In our main results of FREEFORM, we use a simple serialization strategy like the following: “s1 is xi1. s2 is xi2... Answer: yi”. In this ablation study, we find that using a more elaborate schema like “The s1 variant of the person has xi1 minor alleles ...” does not make a difference. While this may be surprising given the existing efforts on exploring serialization strategies 42, large foundation models may be robust to such formatting. Furthermore, our findings show that the augmentation of gene information is not significant. Our experimentation is limited to providing the genes associated with the variant. This may be redundant information for the LLM but other strategies are not straightforward; augmenting literature for a variant is challenging due to the lack of relevance it may have to the task. This will be an important avenue to explore for future work.
Feature Nomination The genomic ancestry dataset we used is a curated version of the full dataset, after applying rigorous quality control. While this ensures data reliability, it can lead to the omission of significant genetic variants. To address this concern, we ask GPT-4o to suggest fifteen SNPs; SNPs such as rs671 or rs2814778 with clear causal relationships or statuses as standard AIMs were usually suggested. These SNPs originally existed in the database but were omitted in the curation of the 10K version so we inserted them back. Our LLM-driven selection methods were able to recover most of these variants while the data-driven methods could not, and we found that these variants contributed largely to the performance gap we observed for feature selection in the genomic ancestry dataset. We find it concerning that data-driven methods failed to identify many of these variants despite their predictive power. In additional analysis, training a Logistic Regression model with the GPT-4o-suggested SNPs alone achieved an average of 0.94 AUC on 20 shots whereas a set of fifteen SNPs selected by PCA on the original dataset achieved 0.78 AUC. LLMs could be a promising method to mitigate the issues of quality control, a frequent issue in this domain, providing a potentially more robust, automatable method than manually imputing well-known variants from the literature.
Discussion
We present FREEFORM which advances the state-of-the-art in LLM-based, few-shot tabular learning and we novelly apply our LLM-driven framework to genotype data. FREEFORM goes beyond the typical usage of LLMs for inference and aids the process of feature selection and engineering, tackling the issues of high dimensionality, limited data samples, and interpretability. Furthermore, we find that LLMs have a robust knowledge of genetic variants, demonstrating state-of-the-art performance across different variant ID schemas, showcasing the promise of LLMs in genetics.
Our framework notably has several key advantages over existing approaches besides performance: (1) it is model-agnostic, (2) it scales well to higher dimensions, and (2) it incurs no inference costs, as features are engineered once during training, unlike LLM-only methods with high computational costs at inference time. Moreover, the entire pipeline can be executed using API access, costing approximately one dollar to run, with the majority of the pipeline completing within minutes, aside from the initial filtering step in feature selection.
However, FREEFORM has room for growth. The current results suggest we either allow a naive range of variants interaction types in feature engineering, or none of the models are capable of capturing epistasis. Our retrieval augmentation is also limited to gene information. Additionally, the LLM’s input could be enhanced by including high-level feature statistics typically considered in this domain. Future work remains in expanding its capabilities to generate novel features, improve the extraction of knowledge intrinsic to LLMs, better augment task-specific knowledge from APIs such as PubMed, and integrate further interpretability or explainability. Recent studies have demonstrated the potential for LLMs in the discovery of new gene sets or causal genes24,25, where LLMs effectively interpolate across the vast corpus of scientific literature they are trained on. We find this approach promising for addressing the issue of multicollinearity, which we acknowledged but did not resolve in this work. Furthermore, we see promising future work in developing feature nomination, where we used the LLM to suggest predictive features. The controllability of LLMs through prompting opens up interesting possibilities, such as the nomination of features that better represent diverse populations, thereby mitigating biases that data-driven methods can exacerbate. A limitation of our study is that several assumptions were made in the evaluation, such as the selection of fifteen features, to showcase our pipeline. Thus, we plan to expand our evaluation to more scenarios and more phenotypes, such as Alzheimer’s disease, to demonstrate its robust utility.
As LLMs advance in domain expertise, potentially surpassing humans 43, their potential to revolutionize bioinformatics becomes increasingly imminent. While our study demonstrates LLMs’ excellence in low-shot regimes, we acknowledge such scenarios are rare in practice. We anticipate, however, these capabilities will scale as foundational models advance and domain-specific LLMs develop. Recent efforts, such as fine-tuning foundational models on literature 43 or augmenting them with knowledge graphs 44, are making progress towards this. While our framework focuses on feature selection and engineering, our work serves as a prototype, showcasing the potential of LLMs in genetics.
Acknowledgments This work was supported in part by the NIH grants U01 AG066833, U01 AG068057, R01 AG071470, U19 AG074879, and S10 OD023495.
Figures & Table
Table 2:
FREEFORM using genotype-specific strategies for featuring engineering on Hearing Loss (16-shot) and Ancestry (20-shot). LR: Logistic Regression. RF: Random Forest.
| Configuration | LR AUC (Std) | RF AUC (Std) | ||
| Hearing Loss | ||||
| FreeForm: Feature Engineering | 0.5145 | (0.11) | 0.5438 | (0.11) |
| + Genotype Serialization | 0.5127 | (0.11) | 0.5384 | (0.11) |
| + Gene Augmentation | 0.5110 | (0.09) | 0.5387 | (0.11) |
| Ancestry | ||||
| FreeForm: Feature Engineering | 0.9572 | (0.01) | 0.9527 | (0.01) |
| + Genotype Serialization | 0.9568 | (0.01) | 0.9532 | (0.01) |
| + Gene Augmentation | 0.9571 | (0.01) | 0.9530 | (0.01) |
References
- 1.Torkmani A, Wineinger NE, Topol EJ. The personal and clinical utility of polygenic risk scores. Nature Reviews Genetics. 2018;19:581–90. doi: 10.1038/s41576-018-0018-x. [DOI] [PubMed] [Google Scholar]
- 2.Guo T, Li X. Machine learning for predicting phenotype from genotype and environment. Current Opinion in Biotechnology. 2023;79:102853. doi: 10.1016/j.copbio.2022.102853. [DOI] [PubMed] [Google Scholar]
- 3.Ma Y, Zhou X. Genetic prediction of complex traits with polygenic scores: a statistical review. Trends in Genetics. 2021;37(11):995–1011. doi: 10.1016/j.tig.2021.06.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Medvedev A, Mishra Sharma S, Tsatsorin E, Nabieva E, Yarotsky D. Human genotype-to-phenotype predictions: Boosting accuracy with nonlinear models. PloS one. 2022;17(8):e0273293. doi: 10.1371/journal.pone.0273293. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Uffelmann E, Huang QQ, Munung NS, De Vries J, Okada Y, Martin AR, et al. Genome-wide association studies. Nature Reviews Methods Primers. 2021;1(1):59. [Google Scholar]
- 6.Krzywinski M, Altman N. Multiple linear regression: when multiple variables are associated with a response, the interpretation of a prediction equation is seldom simple. Nature methods. 2015;12(12):1103–5. doi: 10.1038/nmeth.3665. [DOI] [PubMed] [Google Scholar]
- 7.Lippert C, Listgarten J, Davidson RI, Baxter J, Poon H, Kadie CM, et al. An exhaustive epistatic SNP association analysis on expanded Wellcome Trust data. Scientific reports. 2013;3(1):1099. doi: 10.1038/srep01099. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Altman N, Krzywinski M. The curse (s) of dimensionality. Nat Methods. 2018;15(6):399–400. doi: 10.1038/s41592-018-0019-x. [DOI] [PubMed] [Google Scholar]
- 9.Pudjihartono N, Fadason T, Kempa-Liehr AW, O’Sullivan JM. A review of feature selection methods for machine learning-based disease risk prediction. Frontiers in Bioinformatics. 2022;2:927312. doi: 10.3389/fbinf.2022.927312. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Lou Y, Caruana R, Gehrke J, Hooker G. Accurate intelligible models with pairwise interactions. Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 2013:623–31. [Google Scholar]
- 11.Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, et al. Gpt-4 technical report. arXiv preprint arXiv:230308774. 2023 [Google Scholar]
- 12.Chang Y, Wang X, Wang J, Wu Y, Yang L, Zhu K, et al. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology. 2023 [Google Scholar]
- 13.Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172–80. doi: 10.1038/s41586-023-06291-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Nori H, Lee YT, Zhang S, Carignan D, Edgar R, Fusi N, et al. Can generalist foundation models outcompete special-purpose tuning? case study in medicine. arXiv preprint arXiv:231116452. 2023 [Google Scholar]
- 15.Hou W, Ji Z. Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis. Nature Methods. 2024:1–4. doi: 10.1038/s41592-024-02235-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Wei J, Wang X, Schuurmans D, Bosma M, Xia F, Chi E, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems. 2022;35:24824–37. [Google Scholar]
- 17.Wang X, Wei J, Schuurmans D, Le Q, Chi E, Narang S, et al. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:220311171. 2022 [Google Scholar]
- 18.Yao S, Yu D, Zhao J, Shafran I, Griffiths T, Cao Y, et al. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems. 2024;36 [Google Scholar]
- 19.Choi K, Cundy C, Srivastava S, Ermon S. LMPriors: Pre-Trained Language Models as Task-Specific Priors. NeurIPS 2022 Foundation Models for Decision Making Workshop [Google Scholar]
- 20.Jeong DP, Lipton ZC, Ravikumar P. LLM-Select: Feature Selection with Large Language Models. arXiv preprint arXiv:240702694. 2024 [Google Scholar]
- 21.Hollmann N, M¨uller S, Hutter F. Large language models for automated data science: Introducing caafe for context-aware automated feature engineering. Advances in Neural Information Processing Systems. 2024;36 [Google Scholar]
- 22.Han S, Yoon J, Arik SO, Pfister T. Large Language Models Can Automatically Engineer Features for Few-Shot Tabular Learning. Forty-first International Conference on Machine Learning. 2024 [Google Scholar]
- 23.Toufiq M, Rinchai D, Bettacchioli E, Kabeer BSA, Khan T, Subba B, et al. Harnessing large language models (LLMs) for candidate gene prioritization and selection. Journal of Translational Medicine. 2023;21(1):728. doi: 10.1186/s12967-023-04576-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Wang Z, Jin Q, Wei CH, Tian S, Lai PT, Zhu Q, et al. GeneAgent: Self-verification Language Agent for Gene Set Knowledge Discovery using Domain Databases. arXiv preprint arXiv:240516205. 2024 [Google Scholar]
- 25.Shringarpure SS, Wang W, Karagounis S, Wang X, Reisetter AC, Auton A, et al. Large language models identify causal genes in complex trait GWAS. medRxiv. 2024:2024–05. [Google Scholar]
- 26.Jin Q, Yang Y, Chen Q, Lu Z. Genegpt: Augmenting large language models with domain tools for improved access to biomedical information. Bioinformatics. 2024;40(2):btae075. doi: 10.1093/bioinformatics/btae075. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Hou W, Shang X, Ji Z. Benchmarking large language models for genomic knowledge with GeneTuring. bioRxiv. 2025 Available from: https://www.biorxiv.org/content/early/2025/01/05/2023.03.11.532238 . [Google Scholar]
- 28.Li D, Tan Z, Liu H. Exploring Large Language Models for Feature Selection: A Data-centric Perspective. arXiv preprint arXiv:240812025. 2024 [Google Scholar]
- 29.Huang J, Chang KCC. Towards reasoning in large language models: A survey. arXiv preprint arXiv:221210403. 2022 [Google Scholar]
- 30.Tam ZR, Wu CK, Tsai YL, Lin CY, Hy Lee, Chen YN. Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models. arXiv preprint arXiv:240802442. 2024 [Google Scholar]
- 31.Brown TB. Language models are few-shot learners. arXiv preprint ArXiv:200514165. 2020 [Google Scholar]
- 32.Agrawal M, Hegselmann S, Lang H, Kim Y, Sontag D. Large language models are few-shot clinical information extractors. arXiv preprint arXiv:220512689. 2022 [Google Scholar]
- 33.Kandpal N, Deng H, Roberts A, Wallace E, Raffel C. International Conference on Machine Learning. PMLR; 2023. Large language models struggle to learn long-tail knowledge; pp. 15696–707. [Google Scholar]
- 34.Liu NF, Lin K, Hewitt J, Paranjape A, Bevilacqua M, Petroni F, et al. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics. 2024;12:157–73. [Google Scholar]
- 35.Kortukov E, Rubinstein A, Nguyen E, Oh SJ. Studying Large Language Model Behaviors Under Realistic Knowledge Conflicts. arXiv preprint arXiv:240416032. 2024 [Google Scholar]
- 36.Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems. 2020;33:9459–74. [Google Scholar]
- 37.Tong Y, Li D, Wang S, Wang Y, Teng F, Shang J. Can LLMs Learn from Previous Mistakes? Investigating LLMs’ Errors to Boost for Reasoning. arXiv preprint arXiv:240320046. 2024 [Google Scholar]
- 38.Fairley S, Lowy-Gallego E, Perry E, Flicek P. The International Genome Sample Resource (IGSR) collection of open human genomic variation resources. Nucleic acids research. 2020;48(D1):D941–7. doi: 10.1093/nar/gkz836. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Luo X, Li F, Xu W, Hong K, Yang T, Chen J, et al. Machine learning-based genetic diagnosis models for hereditary hearing loss by the GJB2, SLC26A4 and MT-RNR1 variants. EBioMedicine. 2021;69 doi: 10.1016/j.ebiom.2021.103322. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Moustafa A. Genetic Ancestry. 2023 https://github.com/ahmedmoustafa/genetic-ancestry. [Accessed: May 6 2024] [Google Scholar]
- 41.Hollmann N, M¨uller S, Eggensperger K, Hutter F. Tabpfn: A transformer that solves small tabular classification problems in a second. arXiv preprint arXiv:220701848. 2022 [Google Scholar]
- 42.Hegselmann S, Buendia A, Lang H, Agrawal M, Jiang X, Sontag D. International Conference on Artificial Intelligence and Statistics. PMLR; 2023. Tabllm: Few-shot classification of tabular data with large language models; pp. 5549–81. [Google Scholar]
- 43.Luo X, Rechardt A, Sun G, Nejad KK, Y´a˜nez F, Yilmaz B, et al. Large language models surpass human experts in predicting neuroscience results. Nature human behaviour. 2024:1–11. doi: 10.1038/s41562-024-02046-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Li D, Yang S, Tan Z, Baik JY, Yun S, Lee J, et al. DALK: Dynamic Co-Augmentation of LLMs and KG to answer Alzheimer’s Disease Questions with Scientific Literature. arXiv preprint arXiv:240504819. 2024 [Google Scholar]





