Abstract
In this paper we propose feature selection and machine learning approaches to identify a combination of features for risk prediction of Temporomandibular Joint (TMJ) disease progression. In a sample of 32 TMJ osteoarthritis and 38 controls, feature selection of 5 clinical comorbidities, 43 quantitative imaging, 28 biological features and was performed using Maximum Relevance Minimum Redundancy, Chi-Square and Least Absolute Shrinkage and Selection Operator (LASSO) and Recursive Feature Elimination. We compared the performance of learning using concave and convex kernels (LUCCK), Support Vector Machine (SVM) and Random Forest (RF) approaches to predict disease cure/improvement or persistence/worsening. We show that the SVM model using LASSO achieves area under the curve (AUC), sensitivity and precision of 0.92±0.08, 0.85±0.19 and 0.76 ±0.18, respectively. Baseline levels of headaches, lower back pain, restless sleep, muscle soreness, articular fossa bone surface/bone volume and trabecular separation, condylar High Gray Level Run Emphasis and Short Run High Gray Level Emphasis, saliva levels of 6Ckine, Osteoprotegerin (OPG) and Angiogenin, and serum levels of 6ckine and Brain Derived Neurotrophic Factor (BDNF) were the most frequently occurring features to predict more severe TMJ osteoarthritis prognosis.
Keywords: Temporomandibular Joint Osteoarthritis, disease progression, feature selection, machine learning
1. INTRODUCTION
Osteoarthritis (OA) of the Temporomandibular joint (TMJ) is a prevalent progressive disorder characterized by chronic joint degradation. Rapidly progressive OA can involve multiple joints [1] and severe stages may require joint replacement [2]. Assessments of OA have focused on disk and cartilage degradation with no symptom or test that predicts the risk of severe prognosis [3]. The bone of the mandibular condyles is located just beneath the fibrocartilage, making it particularly vulnerable to inflammatory damage and a valuable model for studying arthritic changes. The current practice for determining the progression of TMJ OA is limited to improvement of pain symptoms and subjective radiographic signs [4]. This is the first study that utilized feature selection and machine learning approaches to develop a robust model for TMJ OA risk prediction, including quantitative radiomic features at baseline. A recent systematic review [5] concluded that training artificial intelligence models with various features of the disease may increase the accuracy for TMJ OA diagnosis. Indeed, Biachi et al. [6] proposed a method that enhanced TMJ OA diagnosis using quantitative features. Nevertheless, these studies were limited to baseline assessments of TMJ OA; determining predictive factors for disease progression remains unexplored.
Based on our published results [7,8] we hypothesize that patterns of clinical symptoms, TMJ bone structure and biological mediators are unrecognized indicators of the severity of progression of TMJ OA. Selecting the combination of features that optimizes the performance of machine learning/statistical models is an important task. Many feature selection methods have been proposed in literatures and here we compare filter-based method (Chi-Square), a filter algorithm called Maximum Relevance Minimum Redundancy (mRMR) that uses mutual information criteria as a measure of both relevance and redundancy of features to quantify nonlinear relationships between variables, a wrapper-based algorithm called Recursive Feature Elimination (RFE), and Least Absolute Shrinkage and Selection Operator (LASSO), a method that applies a shrinking/regularization process and feature selection. We then evaluate the performance metrics of Learning Using Concave and Convex Kernels (LUCCK), Support Vector Machine (SVM) and Random Forest (RF) for classifying the patients at risk of severe prognosis.
2. METHODS
This study followed the “Strengthening the Reporting of Observational studies in Epidemiology” (STROBE) guidelines for observational studies and was approved by the Institutional Review Board HUM00113199 from the University of Michigan and the informed consent was obtained from all participants. The longitudinal sample consisted of 32 early-stage TMJ OA patients and 38 healthy controls recruited at the University of Michigan School of Dentistry with a 2.5 ± 0.9 y follow up interval between the subjects’ assessments. All subjects were examined by a Temporomandibular Disorder (TMD) and an orofacial pain specialist based on the diagnostic criteria for TMD. The clinical symptoms features entailed 5 comorbidities obtained from diagnostic questionnaire and exam by the same investigator: 1) headaches in the last month, 2) muscle soreness in the last month, 3) vertical range of unassisted jaw opening without pain (mouth opening), 4) restless sleep and 5) lower back pain. Using the 3D Accuitomo cone-beam computed tomography (CBCT, J. Morita MFG. CORP Tokyo, Japan), TMJ scans were acquired for each subject to analyze the TMJ bone structure. The joint space was measured at the most superior region of the fossa. Radiomics analysis was centered on the lateral region of the articular fossa and condyle, sites where greater OA bone degeneration occurs. Twenty-one radiomic features in the articular fossa and in the condyle were extracted using the BoneTexture module in 3D Slicer software v.4.11 (www.3Dslicer.org) including 5 bone morphometry features, 6 Gray Level Co-occurrence Matrix (GLCM) and 10 Grey-Level Run Length Matrix (GLRLM) features. The biologic mediators were evaluated using customized protein microarrays (RayBiotech, Inc. Norcross, GA), which determines the expression level of 14 proteins was measured in the participants’ saliva and serum samples. The analyzed proteins included: 6Ckine, Angiogenin, BDNF, CXCL16, ENA-78, MMP-3, MMP-7, OPG, PAI-1, TGFb1, TIMP-1, TRANCE, VE-Cadherin and VEGF. As MMP3 measures were below the level of detection in saliva, it was excluded from subsequent analysis. The criteria for determining health status at the follow-up evaluation was scored as 0 (healthy), 1 (improved), 2 (same) or 3 (worsened). These scores were based on the level of clinical pain related symptoms compared to baseline levels, radiographic signs of the disease (subchondral cyst, erosion, osteophyte) assessed by 2 radiologist experts and 3D degenerative morphological changes. Since the small and unbalanced sample size of 4 categories (4 worsened, 11 same, 9 improved, 8 healthy after treatment) are not sufficient for cross-validation and modeling, we binarized the follow-up evaluation score in an effective treatment (healthy/improved, n=17) and non-effective treatment response (same/worsened, n=15) groups. The 38 non-TMJ controls were not included in the training phase, and were only used for testing. Our results focus on the evaluation scores of treated patients. Towards building a robust model for TMJ OA prognosis we performed: 1) stratified cross-validation and grid search, 2) comparison of feature selection methods and 3) comparison of machine learning approaches.
2.1. Cross-validation
The dataset was shuffled randomly and stratified split into 80% for training and 20% for testing based on the severity of disease progression and diagnosis at baseline visit. We performed 5-fold cross-validation and the grid search was performed in each fold of data for hyperparameters tuning, based on the mean and standard deviation of F1 scores. The overall procedure was repeated 10 times with different random seeds for shuffling to avoid sampling bias and overfitting from data partitioning. The final evaluation scores reported in this study are the mean ± standard deviation of the test set performance on 10 times 5-fold cross-validation.
2.2. Feature Selection
We have collected and measured 76 features from our dataset. To improve the efficiency of training, enhance accuracy of the models and reduce the complexity of models, we performed feature selection and chose a subset of features. These methods are roughly divided into three categories: 1) filter-based methods which select features independently from the learning process, 2) wrapper-based methods which are based on the learning procedure and perform greedy search by evaluating different combinations of features against an evaluation criterion and 3) embedded methods which integrate feature selection and training of the model. Four feature selection methods were chosen, considering the effectiveness and complexity and covering primary methodologies. 1) Maximum relevance and minimum redundancy (mRMR) method [9] is a filter-based method that tends to select a feature subset based on the importance of features and least correlation among them. The relevancy is calculated by mutual information and the redundancy is implemented by Pearson correlation; 2) Chi-Squared [10] is another filter-based method which calculates the chi-squared metric between each feature and the prediction target. It tests whether the occurrences of a specific feature and a specific class are independent. Then top-ranking features with maximum chi-squared values are selected which are highly dependent on the prediction target for modeling; 3) Recursive feature elimination [11] is a wrapper-based method which selects features by recursively eliminating the number of features. The ranking of feature weights is obtained by training a classifier on the initial feature set and then low ranked features are removed. The procedure is repeated recursively until it achieves the optimum number of features needed to assure peak performance; 4) LASSO [12] regression is an embedded method using l1-norm as regularizer. The objective of LASSO is to solve (1) where N and p are the number of samples and features respectively, X is the feature vector, β is the coefficient vector. An important property of LASSO norm regularizer is that it could generate an estimation of penalty with exact zero coefficients, which denotes that the corresponding feature is eliminated. The parameter λ controls the strength of the shrinkage, where the higher the value of λ, the fewer features are selected with non-zero coefficient value.
| (1) |
2.3. Machine learning approaches
Although there are many machine learning classifiers, in this paper, the performance of three of them was tested on our dataset including RF, SVM and LUCCK, as they learn patterns in data with different approaches. Random forest [13] is an ensemble learning method for classification that operates by combining multiple decision trees at the training time and outputs the class selected by most trees. The decision tree recursively partitions the given dataset into two groups based on a certain criterion until a predetermined stopping condition is met. However, this method is prone to overfitting, especially for decision trees when they perfectly classify the training data. The bootstrap aggregating method and randomization in the data nodes selection process prevent overfitting and improve the performance of a single decision tree. SVM is based on statistical learning theory which finds an optimal hyperplane by minimizing the norm of a vector that defines the separating hyperplanes [14]. The basic intuition of SVM is finding a hyperplane that best separates the data points into different classes. In real word, the data might be noisy and the presence of a few outliers can lead to overfitting and eventually misclassification. SVM can work with a hyperplane that separates most but not all data points, which is called soft margin, to deal with outliers and provide a generalized robust model. LUCCK [15] is a classification method that identifies and incorporates the local patterns in data while accounting for non-convex borders between classes. The algorithm could use vital feature-specific information to determine the complex pattern of changes in the data with adjusted concavity or convexity of similarity function, and then adjust the importance of each feature for the classifier. SVM and RF give fixed weight for each feature across all individuals, and LUCCK, which gives dynamic weight to each feature depending on the context of the prediction target.
3. RESULTS
We implemented three machine learning models by incorporating four feature selection methods. We calculated six metrics to compare these models and summarized the results in heatmap tables. Figure 1 shows the heatmap comparison of LUCCK, SVM and RF predictive models with each feature selection method.
Figure 1.

Heatmaps for the performance of the tested feature selection methods and machine learning approaches. The color code dark green to red indicated respectively lower to improved performance.
We evaluated the predictive risk of TMJ OA cure/improvement or persistence/worsening of clinical comorbidities, imaging features extracted from the articular fossa, condyle, and joint space as well as biological features. The best performance was obtained with the SVM model using LASSO that achieved AUC, sensitivity and precision of 0.92±0.08, 0.85±0.19 and 0.76±0.28, respectively.
Figure 2 allows the visualization of the predictive performance separately for each feature selection method and machine learning approach. LASSO and MRMR are the top performing feature selection methods when considering the AUC, F1 score and sensitivity. SVM presents stronger performance in terms of AUC, F1 score, sensitivity, and accuracy with slightly lower precision than RF. LUCCK outperformed RF in items of sensitivity using any feature selection method.
Figure 2.

A. Graphic visualization of performance of each feature selection method. B. Graphic visualization of the performance of each machine learning approach.
To interpret the prediction of our proposed model, we utilized feature occurrence which calculates the number of times a feature is selected by the SVM model using LASSO among the total 50 models. The more often a feature occurs, the more reliable its importance is. Contributing features are shown in Table 1 based on feature occurrence (cut off ≥ 8 out of 50 models), which indicates the impact of each feature on the model performance. Baseline levels of headaches, lower back pain, restless sleep, muscle soreness, articular fossa bone surface/bone volume and trabecular separation, condylar HighGreyLevelRunEmphasis and ShortRunHighGreyLevelEmphasis, saliva levels of OPG, 6Ckine, Angiogenin and ENA78, and serum levels of 6Ckine and BDNF were the most frequently occurring features in the SVM model using LASSO to predict more severe TMJ OA prognosis. Figure 3 shows the mean shapely value for features’ importance in the SVM model with LASSO features’ selection method.
Table 1.
Times of occurrence showing how many times a feature was selected by the SVM model using LASSO.
| Clinical comorbidities | Imaging | Biological | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Articular Fossa | Condyle | Saliva | Serum | |||||||||||
| Headaches | Lower Back Pain | Restless Sleep | Muscle Soreness | Bone Surface/Bone Volume | Trabecular Separation | HighGreyLevelRunEmphasis | ShortRunHighGreyLevelEmphasis | OPG | 6Ckine | Angiogenin | ENA78 | 6Ckine | BDNF | |
| Times of occurrence | 50 | 46 | 11 | 8 | 34 | 8 | 32 | 9 | 39 | 38 | 28 | 15 | 41 | 8 |
Figure 3.

Features’ importance measured as the mean absolute Shapley values in 50 SVM models.
*Af, articular fossa; C, condyle; Sal, saliva; Ser, serum; BS/BV, bone surface/bone volume; OPG, osteoprotegerin; PAI-1, plasminogen activator inhibitor-1, ENA78, epithelial neutrophil-activating peptide; BDNF, brain derived neurotrophic factor; TbSP, trabecular separation; MMP, matrix metalloproteinases; TIMP1, Tissue inhibitor matrix metalloproteinase 1.
4. CONCLUSION
In this paper we proposed three machine learning models and four feature selection methods to predict the progression of TMJ OA disease. We designed a 5-fold cross validation and calculated the feature occurrence to determine which features have stronger predictive power. The SVM predictive model of TMJ OA using LASSO for feature selection achieved AUC, sensitivity and precision of 0.92±0.08, 0.85±0.19 and 0.76±0.18, respectively. Baseline levels of headaches, lower back pain, restless sleep, muscle soreness, articular fossa bone surface/bone volume and trabecular separation, condylar HighGreyLevelRunEmphasis and ShortRunHighGreyLevelEmphasis, saliva levels of 6Ckine, OPG and Angiogenin, and serum levels of 6Ckine and BDNF were the most frequently occurring features in the SVM model using LASSO to predict more severe TMJ OA prognosis.
REFERENCES
- [1].Abrahamsson AK, Kristensen M, Arvidsson LZ, Kvien TK, Larheim TA, & Haugen IK “Frequency of temporomandibular joint osteoarthritis and related symptoms in a hand osteoarthritis cohort”. Osteoarthritis and cartilage, 25(5), 654–657 (2017). [DOI] [PubMed] [Google Scholar]
- [2].O’Connor RC, Fawthrop F, Salha R, & Sidebottom AJ “Management of the temporomandibular joint in inflammatory arthritis: Involvement of surgical procedures”. European journal of rheumatology, 4(2), 151 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Goldring SR, & Goldring MB “Changes in the osteochondral unit during osteoarthritis: structure, function and cartilage–bone crosstalk”. Nature Reviews Rheumatology, 12(11), 632–644 (2016). [DOI] [PubMed] [Google Scholar]
- [4].Song H, Lee JY, Huh KH. et al. “Long-term Changes of Temporomandibular Joint Osteoarthritis on Computed Tomography”. Sci Rep 10, 6731 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Jha Nayansi, Lee Kwang-sig, and Kim Yoon-Ji. “Diagnosis of Temporomandibular Disorders Using Artificial Intelligence Technologies: A Systematic Review and Meta-Analysis.” PLOS ONE 17, no. 8 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Bianchi Jonas, de Oliveira Ruellas Antônio Carlos, Gonçalves João Roberto, Paniagua Beatriz, Prieto Juan Carlos, Styner Martin, Li Tengfei, et al. “Osteoarthritis of the Temporomandibular Joint Can Be Diagnosed Earlier Using Biomarkers and Machine Learning.” Scientific Reports 10, no. 1 (2020). 10.1038/s41598-020-64942-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Cevidanes LHS, Hajati AK, Paniagua B, Lim PF, Walker DG, Palconet G, … & Phillips C “Quantification of condylar resorption in temporomandibular joint osteoarthritis”. Oral Surgery, Oral Medicine, Oral Pathology, Oral Radiology, and Endodontology, 110(1), 110–117 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Cevidanes LH, Walker D, Schilling J, Sugai J, Giannobile W, Paniagua B, … & Styner M “3D osteoarthritic changes in TMJ condylar morphology correlates with specific systemic and local biomarkers of disease”. Osteoarthritis and Cartilage, 22(10), 1657–1667 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Ding C, & Peng H “Minimum redundancy feature selection from microarray gene expression data”. Journal of bioinformatics and computational biology, 3(02), 185–205 (2005). [DOI] [PubMed] [Google Scholar]
- [10].Yang Y, & Pedersen JO “A comparative study on feature selection in text categorization”. In ICML, Vol. 97, No. 412–420, p. 35. (1997). [Google Scholar]
- [11].Guyon I, Weston J, Barnhill S, & Vapnik V “Gene selection for cancer classification using support vector machines”. Machine learning, 46(1), 389–422 (2002). [Google Scholar]
- [12].Muthukrishnan R, & Rohini R “LASSO: A feature selection technique in predictive modeling for machine learning”. In 2016 IEEE international conference on advances in computer applications (ICACA) (pp. 18–20). IEEE. (2016). [Google Scholar]
- [13].Trafalis TB, & Ince H “Support vector machine for regression and applications to financial forecasting”. In Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium (Vol. 6, pp. 348–353). IEEE. (2000). [Google Scholar]
- [14].Maimon OZ, & Rokach L “Data mining with decision trees: theory and applications” (Vol. 81). World scientific. (2014). [Google Scholar]
- [15].Sabeti E, Gryak J, Derksen H, Biwer C, Ansari S, Isenstein H, … & Najarian K “Learning using concave and convex kernels: applications in predicting quality of sleep and level of fatigue in fibromyalgia”. Entropy, 21(5), 442. (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
