Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Sep 10.
Published in final edited form as: IEEE Int Conf Omnilayer Intell Syst. 2025 Aug 22;2025:10.1109/coins65080.2025.11125786. doi: 10.1109/coins65080.2025.11125786

Detecting Oral Cancer Using Tabular Deep Learning

Zhiyun Xue 1, Zhaohui Liang 2, Sivaramakrishnan Rajaraman 3, Niccolo Marini 4, Sameer Antani 5
PMCID: PMC12418227  NIHMSID: NIHMS2104405  PMID: 40933553

Abstract

Oral cancer has one of the lowest five-year survival rates among major cancer types. Therefore, early detection is crucial for histopathological confirmation. State-of-the-art methods reported in the literature largely analyze images only for oral cancer prediction. The use of deep learning networks for related tabular medical data remains unexplored for oral cancer and understudied in general. As part of our multimodal AI/ML approach toward reliable prediction of candidate lesions to biopsy, we describe our work in deep learning approaches on a fielded clinical structured text data in spreadsheet format (tabular data) on a subset comprising 1791 patients drawn from a large ongoing oral cancer study to classify patients with a cancerous lesion from those with a precancerous lesion (i.e., direct precursor to cancer). We compare two tabular deep learning methods and one conventional algorithm for the predictive data analysis. The experimental results on a hold-out test set demonstrate a promising performance for all models (Youden index > 0.6 and AUC > 0.9). In addition, we examine and analyze the interpretability of models. All models indicate that lesion characteristics are crucial predictive features. The insights and results obtained from this work would be valuable to the research community in application of AI/ML to biomedicine.

Keywords: oral cancer screening, tabular data, deep learning, oral lesion characteristics

I. Introduction

Oral (cavity) cancer is one of the most common cancers in the world with one of the lowest five-year survival rates among major cancer types and the advanced-stage oral cancer is associated with a high mortality rate, especially in Asia and the Pacific region. Its worldwide incidence and mortality in 2022 were 389,846 and 188,438 respectively [1] with India and China leading the list of countries. In the United States, the estimated new cases and deaths of oral cancer for 2024 were 58,450 and 12,230 [2]. Therefore, early detection and effective prevention are crucial for reducing its morbidity and mortality rates. The risk factors for oral cancer include tobacco use, alcohol consumption, betel quid chewing, and human papillomavirus (HPV) infection [3]. Visual inspection of the mouth is usually the first step in identification of oral precancer and cancer and requires considerable experience and training. However, significant inter- and intra-observer variance are reported [4] using this technique. Histopathological analysis of biopsied suspected lesions is therefore the diagnostic gold-standard.

Prior work in oral cancer screening with automatic techniques has focused mainly on processing and analyzing the oral images [513]. In this paper, we report findings from deep learning-based analysis of tabular non-image clinical metadata for oral cancer detection. These techniques are relatively under-studied compared to image and text data, especially in the biomedical domain. Non-deep learning methods such as decision trees remain the dominating techniques for tabular data classification/analysis due to their competitive performance, simplicity, and interpretability [14, 15]. Hence, it is of novel interest to use deep learning methods and examine their effectiveness on tabular clinical data, particularly since it is aligned with our long-term research goals of multimodal AI/ML methods for oral cancer prediction. To our knowledge, this is the first effort that explores tabular deep learning techniques to predict oral cancer. Specifically, we investigate and evaluate two deep learning methods besides a decision tree based method for separating patients with oral cancer from those with oral precancer. In addition, we analyze which specific features/information in the tabular data are the most important to the models and their influence on its performance. The insights and results obtained from this work would be valuable to the medical AI research community not only for oral cancer screening, but also for other relevant applications. In the following sections, we will introduce the oral tabular dataset used in the experiments, describe the methods for classifying these data into two categories: precancer and cancer, and present and discuss the experimental results, prior to concluding.

II. Tabular Data For Oral Cancer Analysis

The U.S. National Cancer Institute (NCI) at the National Institutes of Health (NIH), in collaboration with several hospitals in Taiwan, has conducted a multicenter longitudinal study to understand the natural history of oral cancer, identify epidemiologic factors and molecular markers associated with disease progression, and improve early and timely detection of high-grade lesions. Since NIH does not have direct participant contact, per the Common Rule, it is exempted from Institutional Review Board (IRB) review. IRBs at the respective sites in Taiwan have obtained necessary approvals for the study protocol. The study participants were informed of the study and provided their consent to participation and subsequent use of the data for research purposes. In the study, participants with clinical precursors for oral cancer were followed up bi-annually for five years and controls and invasive cancer cases had only one visit. The demographic, behavioral, and medical information of each participant was collected. Biopsies as well as other biospecimens including exfoliated cells, oral rinse, saliva, and venous blood were also collected at multiple time points. In addition, multi-field images of the oral cavity were taken at each visit using a specialized oral camera. The field medical staff also recorded lesion characteristics if a lesion was observed.

The tabular data used in this work is a subset of data obtained as part of this study. The data includes information extracted from the study’s clinical database on 1791 patients with only one lesion (1296 precancers and 495 cancers). It contains 21 columns of information, shown in Table I, that comprise the deidentified participant ID, biopsy result, 2 columns of demographic information, i.e., age and gender, 12 columns related to risk factors, e.g., smoking/drinking alcohol/chewing betel nuts, duration, intensity, and amounts, and 5 columns related with the descriptive lesion characteristics, i.e., lesion texture, lesion size in centimeters, feature of lesion border, lesion appearance, and lesion site. For the tabular data classification, we exclude the participant ID column, use the biopsy result column as the ground truth (precancer vs. cancer), and the remaining 19 columns as features. Among these, 11 features are categorical and 8 are numerical. Their values, ranges, and units vary significantly. For example, the lesion texture is described as either “smooth”, “corrugated”, or “exophytic”; the lesion appearance as: “homogeneous” or “mixed/non-homogeneous”; the consumed alcohol amount per week is between 0 and 1827 ml; and the betel nut chewing duration is between 0 to 58 years. In addition, 17 out of the 19 features have empty or missing values in some rows.

TABLE I.

Columns In The Oral Tabular Data

Columns
ID participant ID
Ground Truth biopsy result
Demographics (2) age
gender
Risk factors (12) ever/current smoking
ever/current drinking
ever/current chewing betel nuts
duration/intensity/amount of smoking
duration/intensity/amount of drinking
duration/intensity/amount of chewing betel nuts
Lesion characteristics (5) lesion texture
lesion appearance
lesion border
lesion size
lesion site

III. Methods

We analyzed existing efforts using tabular data for predictive analysis toward our goal of separating patients with precancers from those with cancer. Only a handful of the deep neural networks [1422] attained promising results on public tabular data sets (in general domain) and the classical methods could outperform or be on par with the deep learning [14, 15]. These findings provided the basis for our examination of both tabular deep learning methods and high-achieving classical methods to classify and analyze our oral cancer screening tabular data.

A. Data Preprocessing

To prepare and make our tabular data ready/suitable for the use of machine learning algorithms, we first preprocess and transform the data which comprise two actions: empty/missing data imputation and categorical data encoding. In our dataset, empty values exist in both numerical features and categorical features. Some empty values are due to skip patterns while others are due to missing values. For example, never-smokers need not to be asked if they are currently smoking, resulting in an empty cell for that field. Here we replace the empty values in the column of currently smoking with “No”. The same logic is also applied to empty cells resulting from the skip patterns related to the chewing and drinking columns. For never-users of smoking, chewing, and alcohol, we set all the corresponding duration and intensity variables to be 0. To impute missing values in numerical features that are not due to skip patterns after partitioning the data into training, validation, and test subsets, we use the simple mean method where the fields in each subset are replaced with the mean value computed from the training set. The missing values in categorical features that are not due to skip patterns are imputed using the string “NNA”. To convert a categorical feature value into a real number accepted by deep learning networks, we use a simple label encoding approach: mapping it to an integer between 0 and Ncategories-1 for that categorical variable. For TabR, since one-hot encoding is already incorporated in its input module for categorical features, we do not apply additional encoding.

B. Classification

We use three tabular classification methods for our work, two of them are deep learning-based and one is decision tree-based. Specifically, we use TabNet [17], TabR [22], and XGBoost [23]. They are selected because of their very different design ideas as well as their demonstrated state-of-the-art performance on other datasets [17,22,23].

TabNet [17] is a transformer-based network. It consists of an encoder and a decoder, each containing a sequence of decision steps. Each step of the encoder has a feature transformer, an attentive transformer, and a feature selection mask. The feature transformer is used to process the features and is composed of both step-dependent layers and step-independent layers that are shared across all steps. The output of the feature transformer is divided into two groups with one group used by the attentive transformer in the succeeding step and the other group congregated for the overall output. The attentive transformer is to do a sparse selection of the most relevant features among its received features with a learnable mask. It includes prior scale terms (indicating how much each feature has been used previously) and normalization (such as sparsemax [24]). The feature selection masks indicating the feature contributions/importance in each step can be combined to generate the global importance score for each feature. For the decoder architecture, each decision step consists of a feature transformer followed by a fully connected layer. For the encoded representations passed from the encoder, the reconstructed tabular features output by the decoder are obtained by aggregating the outputs from all the decision steps of the decoder. The whole encoder-decoder architecture enables self-supervised learning as well as unsupervised pretraining. For supervised learning for classification (which is what we do with our tabular data set), a fully connected layer is attached to the encoder to generate the final output and the standard softmax cross entropy loss is used for model training. Compared to other deep learning methods, one significant strength of TabNet is that its design enables interpretability. It can provide not only local interpretability for each instance but also global interpretability for the trained model. In our work, we identify and examine the top-most important tabular features that contribute to the model and will discuss the interesting insights from what we observe in Section IV.

TabR [22] is a very recent deep learning model for tabular data classification and regression tasks. It belongs to the group of retrieval-augmented methods, that is, given an input, it retrieves similar or relevant samples from the training data and utilizes their features and labels to help predict the output. Specifically, TabR architecture consists of three components: encoder, retrieval module, and predictor. The encoder converts the input and each candidate in the pool into intermediate feature representations. The retrieval module, which executes the special key idea of TabR, finds and processes the nearest neighboring samples for the input. This module is attention-based and is composed of two main submodules: the similarity submodule and the value submodule. The similarity submodule calculates the distance between the representations of the input and the candidates. The value submodule incorporates information from both embedded labels and representations. The predictor is to make the prediction based on the feature representations of both the input and the most relevant candidates that are identified by the retrieval module. Although TabR, unlike TabNet, does not provide mechanisms to show which input features are considered to be the most important by the model, we examine the change in model performance after certain tabular features are removed to implicitly demonstrate whether those removed tabular features are key features for separating the classes or not.

XGBoost [23] is a popular classical machine learning method for tabular data applications. It is widely used and the go-to tool for numerous machine learning tasks in a range of applications, due to its computation efficiency and the state-of-the-art performance it has achieved. It has also been reported to be superior to the counterpart deep learning networks on certain public benchmark datasets [14]. XGBoost is based on the gradient boosting decision tree (GBDT) algorithm but uses various techniques in the implementation to make it scalable, extensible, distributed, efficient, and effective. GBDT is an ensemble learning method that aims to combine the predictions of a group/series of weak learners (shallow decision trees) to get a strong overall model. It is an iterative process, and in each iteration, a new tree is added to fit the error residuals (i.e., to reduce the loss value) of the current model, guided using gradient descent. The final model is obtained by aggregating the result of each iteration. XGBoost improves the conventional GBDT algorithms significantly by incorporating techniques such as parallel tree boosting with a sparse-aware algorithm, shrinkage and feature subsampling to overcome overfitting, weighted quantile sketch for approximate learning, column block structure and cache-aware access for efficient computation [23], etc. XGBoost models offer a level of interpretability by indicating which features are used the most in the decision tree ensemble.

IV. Results and Discussion

We randomly split the tabular data that contains 1791 patients into the training, validation, and test sets using a ratio of 70/15/15, stratified by the histopathology results (precancer and cancer). The number of patients in each class of each set is provided in Table II respectively.

TABLE II.

Number of patients in the training, validation, and test set

Training Validation Test
Precancer 907 194 195
Cancer 347 74 74

For TabNet, we choose a small model given the size of our dataset. Specifically, the total number of representations output by the feature transformer in each step is 16 (i.e., Nd=Na=8 where Nd denotes the step output that will be aggregated for the final output and Na denotes the one passed to the subsequent step), and the number of decision steps in the encoder Nsteps is set to be 3. The binary cross entropy loss function and the Adam optimizer are used. The initial learning rate is 0.02 and is decayed using a StepLR schedular. The maximum number of epochs during training is 50. The model is trained from scratch. Early stopping will occur if there is no improvement in the model’s performance on the validation set for 20 consecutive epochs. We set the batch size of training to be 256. The masking function for normalization in the attentive transformer module is set as an entmax function [25] with alpha being 1.5. For the ghost batch normalization (BN) [26] used in the gated linear unit (GLU) blocks of the feature transformer module, the virtual batch size is 32.

For the XGBoost classification model, we set the value of the maximum tree depth for base learners as 8 and the number of trees to 1000. The objective function is logistic regression for binary classification. The boosting learning rate for shrinking the feature weights has a value of 0.1. The L1 and L2 regularization terms on weights are 0 and 1 respectively. The ratios related to feature subsampling for each level, each node, and each tree constructed are all set as 1. For tree node partition, the default values of the minimum split loss reduction and the minimum sum of instance weight are used. The subsampling ratio of training samples in each boosting iteration is 0.7. The initial prediction score of all instances in leaf nodes is set as 0.5. Early stopping is also employed.

For the TabR experiments, we use the simple architecture TabR-S, in which embeddings are not used for numerical features, the encoder does not contain any block module (NE=0), and the predictor contains only one block module (NP=1) [22]. Quantile normalization is used for numerical features. The optimization algorithm is AdamW. As in [22], learning rate schedules is not used. The hyperparameters are tuned first and the tuning space for learning rate, weight decay, encoder output vector dimension (width d), attention dropout rate, and dropout rate in the block module were [3e-05, 0.001], [1e-06, 0.0001], [16, 64], [0, 0.6], [0, 0.6], respectively. The number of tuning iterations is set as 10. After hyperparameter tuning, we follow the same evaluation strategy as in [22] and train 15 models with different random seeds. These individual TabR-S models are then ensembled to be one TabR-S model.

Other model hyperparameters that are not specified above use the default settings provided by the model code packages [17, 22]. Data augmentation and class weighting are not used in the experiments. All the experiments are conducted using a Lambda server with 8 Nvidia GeForce RTX 2080 Ti GPUs (each having 11GiB memory). Some data preprocessing and result analysis are done using Matlab (R2024a) on a Windows server (2019).

To evaluate and compare the classification performance of all models, we use the following metrics (the threshold applied to the output probability to generate the predicted class label is 0.5): sensitivity (or recall), specificity, positive predictive value (PPV) (or precision), Youden index (or Youden’s J statistic, in (1)), balanced accuracy (BAcc., in (2)), F1 score (in (3)), Matthews correlation coefficient (MCC, in (4)), and AUC value.

J=sensitivity+specificity-1 (1)
Balanced Accuracy=sensitivity+specificity2 (2)
F1=2×precision×recallprecision+recall (3)
MCC=TP×TN-FP×FNTP+FP×TP+FN×TN+FP×(TN+FN) (4)

Here, TP, TN, FP, and FN denote true positive, true negative, false positive, and false negative, respectively.

Table III lists the results of metrics we use for evaluating the classification performance of each of the algorithms. The confusion matrix and the corresponding ROC curve (with the AUC value) of the test set for each method are given in Fig. 1. All three models achieve very promising results. For example, the Youden index is above 0.6 and the AUC value is above 0.9 in all three methods. The TabNet model has the highest sensitivity (0.919) while XGBoost has the highest specificity (0.908). Both TabR and TabNet achieve better numbers than XGBoost for the metrics of Youden index, F1 score, MCC, and balanced accuracy, demonstrating the effectiveness of the deep learning networks. Finally, we extract the top 5 features for TabNet and XGBoost, respectively, to understand the features in the data that are considered the most important by the models.

TABLE III.

Classification Performance on the Test Set

Sensi. Speci. PPV Youden BAcc. F1 MCC AUC
TabNet 0.919 0.785 0.618 0.704 0.852 0.739 0.639 0.91
XGBoost 0.703 0.908 0.743 0.610 0.805 0.722 0.621 0.91
TabR 0.770 0.897 0.740 0.668 0.834 0.755 0.660 0.91

Fig. 1.

Fig. 1.

Confusion matrix and ROC curve of different models on the test set.

As shown in Table IV, the top 4 features for both models are all related to lesion information or characteristics. We hypothesize that the reason the lesion border feature is not among the most important features probably because its value is missing for a significant number (580) of patients while other lesion features only have less than 20 missing values. Table V lists the values for each lesion feature. To further examine and demonstrate the crucial role that the features of lesion characteristics play in the models, we re-train a new model for each method by not including those 5 columns of lesion characteristics and inspect its performance. As shown in Fig. 2, without using these lesion features, the classification performances of TabNet, XGBoost, and TabR are degraded significantly, with AUC values only being around 0.6.

TABLE IV.

The top 5 features considered by TabNet and XGBoost

TabNet XGBoost
Feature Score Feature Score
1 lesion texture 0.351 lesion appearance 0.277
2 lesion site 0.271 lesion texture 0.157
3 lesion size 0.105 lesion size 0.079
4 lesion appearance 0.052 lesion site 0.068
5 gender 0.036 current smoker 0.040

TABLE V.

Values for lesion features

Feature Value
Appearance Homogeneous, Mixed/non-homogeneous
Texture Smooth, Corrugated, Exophytic
Border Regular, Irregular
Site Lip, Tongue, Buccal Mucosa, Gum, Hard Palate, Soft Palate, Floor of Mouth, Retromolar Trigone, Multifocal, Other
Size numerical in centimeter

Fig. 2.

Fig. 2.

Classification performance on the test set for each classifier using only 14 features after excluding the 5 features on lesion characteristics

V. Conclusions

Oral cancer is one worldwide common cancer. Visual evaluation of the oral cavity requires training and extensive clinical experience. It could result in significant inter- and intra-observer variance. In this paper, we aim to develop AI/ML techniques for supporting oral cancer screening and improving its effectiveness. Prior work in oral cancer screening with automatic techniques has focused mainly on processing and analyzing the oral images. In this paper, we report our effort on using a tabular data subset collected in a large ongoing oral cancer study to develop a classifier to distinguish patients with a cancerous lesion from those with a precancerous lesion. Compared to other data formats, such as images and texts, the topic of deep learning networks on tabular data is much less studied and reported in the literature. To reduce this gap, we investigate two state-of-the-art tabular deep learning methods (TabNet and TabR) and a conventional high-achieving gradient boosting decision tree algorithm (XGBoost) for our oral cancer classification task. We evaluate and compare each method. The experimental results on the test set demonstrate the promising performance of our approaches. In addition, we examine and analyze the interpretability of models and identify the features that are most important to the models. Our analysis indicates the features of lesion characteristics (such as texture, appearance, size, and location) are crucial to all models. For future work, we will focus on two main aspects besides collecting more data: developing multi-modal classification that utilizes both image data and tabular data and exploring generative AI for data balancing and augmentation.

Acknowledgment

This research was supported by the Division of Intramural Research of the National Library of Medicine (NLM), National Institutes of Health (NIH). We are grateful to our collaborators at the NCI for the data and background information. We also acknowledge the collaborators in Taiwan for their clinical and data collection efforts.

Contributor Information

Zhiyun Xue, Division of Intramural Research, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

Zhaohui Liang, Division of Intramural Research, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

Sivaramakrishnan Rajaraman, Division of Intramural Research, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

Niccolo Marini, Division of Intramural Research, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

Sameer Antani, Division of Intramural Research, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.

References

  • [1].https://www.wcrf.org/cancer-trends/mouth-and-oral-cancer-statistics/
  • [2].https://www.cancer.org/cancer/types/oral-cavity-and-oropharyngeal-cancer/about/key-statistics.html
  • [3].https://www.cancer.org/cancer/types/oral-cavity-and-oropharyngeal-cancer/causes-risks-prevention/risk-factors.html
  • [4].Epstein JB, Güneri P, Boyacioglu H, and Abt E, “The limitations of the clinical oral examination in detecting dysplastic oral lesions and oral squamous cell carcinoma,” J Am Dent Assoc, 2012. 143(12): p. 1332–42 [DOI] [PubMed] [Google Scholar]
  • [5].Song B, Sunny S, Li S, Gurushanth K, Mendonca P, Mukhia N, et al. “Bayesian deep learning for reliable oral cancer image classification,” Biomed Opt Express. 2021. Sep 20;12(10):6422–6430. doi: 10.1364/BOE.432365. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Welikala RA, Remagnino P, Lim JH, Chan CS, Rajendran S, Kallarakkal TG, et al. “Clinically guided trainable soft attention for early detection of oral cancer,” In: Tsapatsoulis N, Panayides A, Theocharides T, Lanitis A, Pattichis C, Vento M (eds) Computer Analysis of Images and Patterns. CAIP 2021. Lecture Notes in Computer Science, vol. 13052. Springer, Cham. 10.1007/978-3-030-89128-2_22 [DOI] [Google Scholar]
  • [7].Yang SM, Song B, Wink C, Abouakl M, Takesh T, Hurlbutt M, et al. , “Performance of automated oral cancer screening algorithm in tobacco users vs. non-tobacco users,” Applied Sciences, 2023, 13(5), 3370. 10.3390/app13053370 [DOI] [Google Scholar]
  • [8].Welikala RA, Remagnino P, Lim JH, Chan CS, Rajendran S, Kallarakkal TG, “Automated detection and classification of oral lesions using deep learning for early detection of oral cancer,” in IEEE Access, vol. 8, pp. 132677–132693, 2020, doi: 10.1109/ACCESS.2020.3010180. [DOI] [Google Scholar]
  • [9].Warin K K, Suebnukarn S, “Deep learning in oral cancer - a systematic review,” BMC Oral Health. 2024. Feb 10;24(1):212. doi: 10.1186/s12903-024-03993-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Xue Z, Yu K, Pearlman P, Pal A, Chen T-C, Hua C-H, Kang CJ, Chien CY, Tsai M-H, Wang C-P, Chaturvedi A, Antani S, “Oral cavity anatomical site image classification and analysis,” in Proc. SPIE Int Soc Opt Eng 2022. Feb-Mar;12037:120370E. doi: 10.1117/12.2611541 [DOI] [Google Scholar]
  • [11].Xue Z, Yu K, Pearlman P, Chen T-C, Hua C-H, Kang CJ, Chien CY, Tsai M-H, Wang -P, Chaturvedi A, Antani S, “Extraction of ruler markings for estimating physical size of oral lesions,” Int Conference on Pattern Recognition (ICPR) 2022. [Google Scholar]
  • [12].Mahmoodi E, Xue Z, Rajaraman S, and Antani S, “A study on reducing big data image annotation burden through iterative expert-in-the-loop strategy,” in Proceedings of IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Istanbul, Turkiye, 2023, pp. 3097–3102, doi: 10.1109/BIBM58861.2023.10385356 [DOI] [Google Scholar]
  • [13].Xue Z, Oguguo T, Yu KJ, Chen Tseng-Cheng, Hua Chun-Hung, Kang Chung Jan, Chien Chih-Yen, Tsai Ming-Hsui, Wang Cheng-Ping, Chaturvedi Anil K., Antani Sameer, “Cleaning and harmonizing medical image data for reliable AI: Lessons learned from longitudinal oral cancer natural history study data,” in Proc. SPIE 12931, Medical Imaging 2024: Imaging Informatics for Healthcare, Research, and Applications, 129310E (2 April 2024) [Google Scholar]
  • [14].Shwartz-Ziv R, Armon A, “Tabular data: deep learning is not all you need”, Information Fusion, vol. 81, 2022, pp. 84–90, ISSN 1566–2535, 10.1016/j.inffus.2021.11.011. [DOI] [Google Scholar]
  • [15].Ye H, Liu S, Cai H, Zhou Q, Zhan D, “A closer look at deep learning on tabular data”, arXiv preprint, arXiv: 2407.00956, 2024 [Google Scholar]
  • [16].Borisov V, Leemann T, Seßler K, Haug J, Pawelczyk M, Kasneci G. “Deep neural networks and tabular data: a survey,” IEEE Transactions on Neural Networks and Learning Systems, vol.35, pp. 7499–7519, 2021. [Google Scholar]
  • [17].Arik SO and Pfister T, “TabNet: Attentive interpretable tabularlearning,” in Proceedings of the AAAI Conference on Artificial Intelligence 35(8):6679–6687, 2021, DOI: 10.1609/aaai.v35i8.16826 [DOI] [Google Scholar]
  • [18].Huang X, Khetan A, Cvitkovic M, and Karnin Z, “TabTransformer: tabular data modeling using contextual embeddings,” arXiv preprint, arxiv:2012.06678, 2020. [Google Scholar]
  • [19].Somepalli G, Goldblum M, Schwarzschild A, Bruss CB, and Goldstein T, “SAINT: improved neural networks for tabular data via row attention and contrastive pre-training,” arXiv preprint, arXiv:2106.01342, 2021. [Google Scholar]
  • [20].Hollmann N, Müller S, Eggensperger K, and Hutter F, “TabPFN: a transformer that solves small tabular classification problems in a second”, arXiv preprint, arXiv: 2207.01848, 2023. [Google Scholar]
  • [21].McElfresh DC, Khandagale S, Valverde J, Prasad C. V, Feuer B, Hegde C, Ramakrishnan G, Goldblum M, and White C, “When do neural nets outperform boosted trees on tabular data?”, in Proceedings of NeurIPS, pp. 76336–76369, 2023 [Google Scholar]
  • [22].Gorishniy Y, Rubachev I, Kartashev N, Shlenskii D, Kotelnikov A, and Babenko A, “TabR: tabular deep learning meets nearest neighbors in 2023”, in Proceedings of ICLR, 2024. [Google Scholar]
  • [23].Chen T and Guestrin C, “XGBoost: a scalable tree boosting system,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.785 – 794, 2016. [Google Scholar]
  • [24].Martins AFT and Astudillo RF, “From Softmax to Sparsemax: a sparse model of attention and multi-label classification,” arXiv preprint, arXiv:1602.02068. 2016. [Google Scholar]
  • [25].Peters B, Niculae V, and Martins AFT, “Sparse sequence-to-sequence models,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1504–1519, 2019. [Google Scholar]
  • [26].Hoffer E, Hubara I, and Soudry D, “Train longer, generalize better: closing the generalization gap in large batch training of neural networks,” arXiv preprint, arXiv:1705.08741, 2017. [Google Scholar]

RESOURCES