Skip to main content
PLOS One logoLink to PLOS One
. 2020 Nov 20;15(11):e0242730. doi: 10.1371/journal.pone.0242730

A comparison of penalised regression methods for informing the selection of predictive markers

Christopher J Greenwood 1,2,*, George J Youssef 1,2, Primrose Letcher 3, Jacqui A Macdonald 1,2, Lauryn J Hagg 1, Ann Sanson 3, Jenn Mcintosh 1,2, Delyse M Hutchinson 1,2,3,4, John W Toumbourou 1, Matthew Fuller-Tyszkiewicz 1, Craig A Olsson 1,2,3
Editor: Yuka Kotozaki5
PMCID: PMC7678959  PMID: 33216811

Abstract

Background

Penalised regression methods are a useful atheoretical approach for both developing predictive models and selecting key indicators within an often substantially larger pool of available indicators. In comparison to traditional methods, penalised regression models improve prediction in new data by shrinking the size of coefficients and retaining those with coefficients greater than zero. However, the performance and selection of indicators depends on the specific algorithm implemented. The purpose of this study was to examine the predictive performance and feature (i.e., indicator) selection capability of common penalised logistic regression methods (LASSO, adaptive LASSO, and elastic-net), compared with traditional logistic regression and forward selection methods.

Design

Data were drawn from the Australian Temperament Project, a multigenerational longitudinal study established in 1983. The analytic sample consisted of 1,292 (707 women) participants. A total of 102 adolescent psychosocial and contextual indicators were available to predict young adult daily smoking.

Findings

Penalised logistic regression methods showed small improvements in predictive performance over logistic regression and forward selection. However, no single penalised logistic regression model outperformed the others. Elastic-net models selected more indicators than either LASSO or adaptive LASSO. Additionally, more regularised models included fewer indicators, yet had comparable predictive performance. Forward selection methods dismissed many indicators identified as important in the penalised logistic regression models.

Conclusions

Although overall predictive accuracy was only marginally better with penalised logistic regression methods, benefits were most clear in their capacity to select a manageable subset of indicators. Preference to competing penalised logistic regression methods may therefore be guided by feature selection capability, and thus interpretative considerations, rather than predictive performance alone.

Introduction

Maximising prediction of health outcomes in populations is central to good public health practice and policy development. Population cohort studies have the potential to provide such evidence due to the collection of a wide range of developmentally appropriate psychosocial and contextual information on individuals over extended periods of time. It is however, challenging to identify which indicators maximise prediction, particularly when a large number of potential indicators is available. Atheoretical predictive modelling approaches–such as penalised regression methods–have the potential to identify key predictive markers while handling potential multicollinearity issues, addressing selection biases that impinge on the ability of indicators to be identified as important, and using procedures that enhance likelihood of replication. The interpretation of penalised regression is relatively straightforward for those accustomed to regressions, and thus represents an accessible solution. Furthermore, growing accessibility of predictive modelling tools is encouraging and presents as a potential point of advancement for identifying key predictive indicators relevant to population health [1].

Broadly, predictive modelling contrasts with the causal perspective most commonly seen within cohort studies, in that it aims to maximise prediction of an outcome, not investigate underlying causal mechanisms [2]. While both predictive and causal perspectives share some similarities, Yarkoni and Westfall note that “it is simply not true that the model that most closely approximates the data-generating process will in general be the most successful at predicting real-world outcomes” [2, p. 1000]. This reflects the perennial difficulty of achieving full or even adequate representation of underlying constructs through measurable data, creating a “disparity between the ability to explain phenomena at the conceptual level and the ability to generate predictions at the measurable level” [3, p. 293]. The perspectives differ further in their foci: while causal inference approaches are commonly interested in a single exposure-outcome pathway [4, 5], predictive modelling focuses on multivariable patterns of indicators that together predict an outcome [6].

Two of the key goals in finding evidence for predictive markers are improving the accuracy and generalisability of predictive models. Accuracy refers to the ability of the model to correctly predict an outcome, whereas generalisability refers to the ability of the model to predict well given new data [7]. The concept of model fitting is key to understanding both accuracy and generalisability. Over-fitting is of particular concern and refers to the tendency for analyses to mistakenly fit models to sample-specific random variation in the data, rather than true underlying relationships between variables [8]. When a model is overfitted, it is likely to be accurate in the dataset it was developed with but is unlikely to generalise well to new data.

Several model building considerations are key to balancing the accuracy and generalisability of predictive models. First and foremost is the use of training and testing data. Training data are the data used to generate the predictive model, and testing data are then used to examine the performance of the predictive model. However, given the rarity of entirely separate cohort datasets suitable to train and then test a predictive model, a single data set is often split into training and testing portions; two subsets of a larger data pool [7]. If predictive models are trained and tested on the exact same data (i.e., no data splitting), the accuracy of models is likely to be inflated due to overfitting. To further improve generalisability, it is also recommended to iterate through a series of many training and testing data splits, to reduce the influence of any specific training/testing split of the data [9].

Another consideration important to balancing the accuracy and generalisability of predictive models is the process of regularisation involved in penalised regression. Regularisation is an automated method whereby the strength of coefficients for predictive variables that are deemed unimportant in predicting the outcome are shrunk towards zero. Regularisation also helps reduce overfitting by balancing the bias-variance trade-off [10]. Specifically, by reducing the size of the estimated coefficients (i.e., adding bias and reducing accuracy in the training data), the model becomes less sensitive to the characteristics of the training data, resulting in smaller changes in predictions when estimating the same model in the testing data (i.e., reducing variation and increasing generalisability).

Additionally, regularisation aids in balancing the accuracy and generalisability of predictive models in terms of complexity. Complexity often refers to the number of indicators in the final model. For several penalised regression procedures, the regularisation process results in the coefficients of unimportant variables being shrunk to zero (i.e., excluded from the model). Importantly, reducing the number of indicators in the final model helps to reduce overfitting. This is commonly referred to as feature selection [10]. Feature selection helps to improve the interpretability of models by selecting only the most important indicators from a potentially large initial pool, which is critical for researchers seeking to create administrable population health surveillance tools. While traditional approaches, such as backward elimination and forward selection procedures [11], are capable of identifying a subset of indicators, penalised regression methods improve on these by entering all potential variables simultaneously, reducing biases induced by the order variables are entered/removed from the model.

Retention of indicators is, however, influenced by the particular decision rules of the algorithm. Three penalised regression methods which conduct automatic feature commonly compared in the literature and discussed in standard statistical texts [10, 12] are the least absolute shrinkage and selection operator (LASSO) [13], the adaptive LASSO [14] and the elastic-net [15]. The LASSO applies the L1 penalty (constraint based on the sum of the absolute value of regression coefficients), which shrinks coefficients equally and enables automatic feature selection. However, in situations with highly correlated indicators the LASSO tends to select one and ignore the others [16]. The adaptive LASSO and elastic-net are extensions on the LASSO, both of which incorporate the L2 penalty from ridge regression [17].

The L2 penalty (constraint based on the sum of the squared regression coefficients) shrinks coefficients equally towards zero, but not exactly zero, and is beneficial in situations of multicollinearity as correlated indicators tend to group towards each other [16, 18]. More specifically, the adaptive LASSO incorporates an additional data-dependent weight (derived from ridge regression) to the L1 penalty term, which results in coefficients of strong indicators being shrunk less than the coefficients of weak indicators [14], contrasting to the standard LASSO approach. The elastic-net includes both the L1 and L2 penalty and enjoys the benefits of both automatic feature selection and the grouping of correlated predictors [15].

In addition to selecting between alternative penalised regression methods, analysts need to tune (i.e., select) the parameter lambda (λ), which controls the strength of the penalty terms. This is most commonly done via the data driven process of k-fold cross-validation [19]. Specifically, k-fold cross-validation splits the training data into k number of partitions, for which a model is built on k-1 of the partitions and validated on the remaining partition. This is repeated k number of times until each partition is used once as the validation data. 5-fold cross is often used. Commonly, the value of λ which minimises out-of-sample prediction error is selected, identifying the best model. An alternative parameterisation applies the one-standard-error rule, for which a value of λ is selected that results in the most regularised model which is within one standard error of the best model [20]. Comparatively, the best model usually selects a greater number of indicators than the one-standard-error model. The tuning of λ is, however, often poorly articulated and represents an important consideration for those seeking to derive a succinct set of predictive indicators [21].

The purpose of this study was to examine the predictive performance and feature (i.e., variable) selection capability of common penalised logistic regression methods (LASSO, adaptive LASSO, and elastic-net) compared with traditional logistic regression and forward selection methods. To demonstrate methods, a broad range of adolescent health and development indicators, drawn from one of Australia’s longest-running cohort studies of social-emotional development, was used to maximise prediction of tobacco use in young adulthood. Tobacco use is widely recognised as a leading health concern in Australia [22], making it a high priority area for government investment [23] and a targetable health outcome of predictive models. Although some comparative work has previously been examined the prediction substance use with cohort study data [24], comparisons have largely focused on differences in predictive performance, not feature selection, and have yet to examine differences between the best and one-standard-error models.

Method

Participants

Participants were from the Australian Temperament Project (ATP), a large multi-wave longitudinal study (16 waves) tracking the psychosocial development of young people from infancy to adulthood. The baseline sample in 1983 consisted of 2,443 infants aged between 4–8 months from urban and rural areas of Victoria, Australia. Information regarding sample characteristics and attrition are available elsewhere [25, 26]. The current sample consisted of 1292 (707 women) participants with responses from at least one of the adolescent data collection waves (ages 13–14, 15–16 or 17–18 years) and who remained active in the study during at least one of the three young adult waves (ages 19–20, 23–24 or 27–28 years).

Research protocols were approved by the Human Research Ethics Committee at the University of Melbourne, the Australian Institute of Family Studies and/or the Royal Children’s Hospital, Melbourne. Participants’ parents or guardians provided informed written consent at recruitment into the study, and participants provided informed written consent at subsequent waves.

Measures

Adolescent indicators

A total of 102 adolescent indicators, assessed at ages 13–14, 15–16, 17–18 years, by parent and self-report, were available for analysis (see S1 Table). Data spanned individual (i.e., biological, internalising/externalising, personality/temperament, social competence, and positive development), relational (i.e., peer and family relationships, and parenting practices), contextual (demographics, and school and work), and substance use specific (personal, and environmental use) domains. Repeated measures data were combined (i.e., maximum or mean level depending on the indicator type) to represent overall adolescent experience. All 102 adolescent indicators were entered as predictors into each model.

Young adult tobacco use outcome

Tobacco use was assessed at age 19–20 years (after measurement of the indicators), as the number of days used in the last month. This was converted to a binary variable representing daily use (i.e., ≥ 28 days in the last month), which was used as the outcome (response variable) in all analyses.

Statistical analyses

R statistical software [27] was used for all analyses. Since standard penalised regression packages in R do not have in-built capacity to handle multiply imputed or missing data, a single imputed data set was generated using the mice package [28]. Following imputation, continuous indicators were standardised by dividing scores by two times its standard deviation, to improve comparability between continuous and binary indicators [29].

Regression procedures

Logistic, LASSO, adaptive LASSO, and elastic-net logistic regressions were run using the glmnet package [16]. The additional weight term used in the adaptive LASSO was derived as the inverse of the corresponding coefficient from ridge regression. Elastic-net models were specified as using an equal split between L1 and L2 penalties. Specifically, L1 penalization imposes a constraint based on the sum of the absolute value of regression coefficients, whilst L2 penalisation, imposes a constraint based on the sum of the squared regression coefficients [12]. 5-fold cross-validation was used to tune λ (the strength of the penalty) for all penalised logistic regression methods. Logistic regression imposes no penalisation on regression coefficients. Forward selection logistic regression was conducted using the MASS package [30]. Predictive performance and feature selection were examined separately using the processes described below. Models were run based on an adapted procedure and syntax implemented by Ahn et al. [9]. All syntax is available online (https://osf.io/ehprb/?view_only=9f96d224f08e4987829bb29204061f4b).

The process to compare predictive performance is illustrated in Fig 1A. All models were implemented in 100 iterations of training and testing data splits (80/20%). For penalised logistic regression approaches, 100 iterations of 5-fold cross-validation were implemented to tune λ in the training data (80%). Each iteration of cross-validation identified the best (λ-min; model which minimised out-of-sample predictor error) and one-standard-error (λ-1se; more regularised model with out-of-sample prediction error within one standard error of the best model) model. For all models, predictions were made in the test data (20%). For penalised logistic regression approaches, mean predictive performance was derived across cross-validation iterations. For all models, predictive performance metrics were saved for each of the 100 training and testing data splits.

Fig 1.

Fig 1

A. Process to compare predictive performance of models. B. Process to compare feature selection capability of models.

Predictive performance was assessed using the area under the curve (AUC) of the receiver operator characteristic [31] and the harmonic mean of precision and recall (F1 score) [32]. AUC indicates a model’s ability to discriminate a given binary outcome, by plotting the true positive rate (the likelihood of correctly identifying a case) against the false positive rate (i.e., the likelihood of incorrectly identifying a case). An AUC value of 0.5 is equivalent to chance, and a value of 1 equals perfect discrimination [33]. However, as base rates decline (i.e., the prevalence of the outcome gets lower), the AUC can become less reliable because high scores can be driven by correctly identifying true-negatives, rather than true-positives. When base rates are low, the F1 score is a useful addition [34]. The F1 score represents the harmonic average of precision (i.e., proportion of true-positives from the total number of positives identified) and recall/sensitivity (i.e., the proportion of true-positive identified from the total number of true-positives). The F1 score indicates perfect prediction at 1 and inaccuracy at 0.

The process to compare feature selection capability is illustrated in Fig 1B. In line with Ahn et al [9], to identify the most robust indicators from each model, similar procedures to that described above were implemented; although the full data set was used to train the models. Additionally, the number of cross-validation iterations was increased to 1,000. For the penalised logistic regression methods, robust indicators were considered as those which were selected in at least 80% of the cross-validation iterations. For indicators that met this robust criterion the mean of the coefficients was taken, whilst others were set to zero. Small coefficients were considered as those between -0.1 and 0.1 [34].

Results

Predictive performance

The predictive performance of each model across iterations of training and testing data splits is presented in Fig 2.

Fig 2.

Fig 2

Predictive performance: Box and whisker plot of AUC (A) and F1 (B) scores across 100 iterations of training and testing data splits. Dotted lines indicating median performance for logistic regression and forward selection; Median values reported for each model.

All λ-min penalised logistic regression models had higher AUC scores than both logistic regression and forward selection (Δ median AUC 0.002–0.008). Similarly, all of the λ-1se penalised logistic regression models outperformed forward selection (Δ median AUC 0.003–0.006), however, only the λ-1se elastic-net model outperformed logistic regression (Δ median AUC 0.002). Similarly, in comparison to logistic regression and forward selection F1 scores were higher for all λ-min and λ-1se penalised logistic regression models (Δ median F1 0.007–0.025).

Between penalised logistic regression models, the elastic-net had the highest AUC scores within both the λ-min (Δ median AUC 0.001–0.002) and λ-1se (Δ median AUC 0.003) models. Comparatively, the LASSO had the highest F1 scores for the λ-min models (Δ median F1 0.010–0.015) and the adaptive LASSO scored highest F1 scores for the λ-1se models (Δ median F1 0.002–0.007).

Finally, all λ-min models had higher AUC scores than the respective λ-1se models (Δ median AUC 0.002–0.004). In contrast, while the λ-min LASSO had the higher F1 score than the λ-1se model (Δ median F1 0.004), the λ-1se model outperformed the respective λ-min model for adaptive LASSO and elastic-net (Δ median F1 0.011–0.018).

Feature selection

Feature selection was compared between penalised logistic regression methods and forward selection logistic regression. Fig 3 plots the beta coefficients from feature selection methods. Indicators were only presented if selected in at least one model.

Fig 3. Mean beta coefficients for indicators that survived at least 80% of 1,000 iterations; * = small coefficients (between -0.1 and 0.1).

Fig 3

Suffix 10 = wave 10 (13–14 years) only; 11 = wave 11 (15–16 years) only; 12 = wave 12 (17–18 years) only; ad = combined across adolescence.

There were notable differences in feature selection when comparing the penalised logistic regression methods to the forward selection procedure. Specifically, the number of indicators selected via forward selection was far below that selected in the λ-min models. There was, however, similarity in the number of selected indicators between forward selection and the λ-1se models. Even so, forward selection models selected several indicators not present in any of the penalised logistic regression models.

There was notable similarity in the number of indicators selected in the LASSO and adaptive LASSO. The LASSO did, however, select slightly fewer indicators than the adaptive LASSO for both the λ-min (35 v 36) and λ-1se (5 v 7) models. Comparatively, both the LASSO and adaptive LASSO λ-min models contained several unique indicators (five and six, respectively); in the LASSO unique indicators were limited to those with small coefficients, whereas in the adaptative LASSO none of the unique indicators had small coefficients. Additionally, while the λ-min LASSO contained a total of seven indicators with small coefficients (28 indicators with non-small coefficients), the respective adaptive LASSO only contained one (35 indicators with non-small coefficients). There was clear similarity between the LASSO and adaptive LASSO λ-1se models, such that the adaptive LASSO contained all indicators from the LASSO (and two unique indicators). For both the LASSO and adaptive LASSO λ-1se models, no indicators had small coefficients.

In contrast, the elastic-net models selected more indicators in both the λ-min (51) and λ-1se models (17) than the respective LASSO and adaptive LASSO models. The elastic-net models selected all indicators selected with the LASSO and adaptive LASSO. Additionally, there were several unique elastic-net λ-min and λ-1se indicators (10 and 10, respectively); almost all of these unique indicators had small coefficients. The elastic-net selected a total of 13 indicators with small-coefficients from the λ-min model (38 indicators with non-small coefficients) and 8 from the λ-1se model (9 indicators with non-small coefficients).

Discussion

This study examined three common penalised logistic regression methods, LASSO, adaptive LASSO and elastic-net, in terms of both predictive performance and feature selection using adolescent data from a mature longitudinal study of social and emotional development to maximise prediction of daily tobacco use in young adulthood. We demonstrated an analytical process for examining predictive performance and feature selection and found that while differences in predictive performance were only small, differences in feature selection were notable. Findings suggested that the benefits of penalised logistic regression methods were most clear in their capacity to select a manageable subset of indicators. Therefore, decisions to select any particular method may benefit from reflecting on interpretative considerations.

The use of penalised regression approaches provides one method of identifying important indicators from amongst a large pool. While the interpretation of penalised regression methodologies is relatively straight forward for those accustomed to regression analyses, the process by which output is obtained requires a series of iterative procedures, which may be somewhat novel. This study has outlined one potential approach to this sequence of analyses and provided syntax for others to apply and adjust analyses for themselves. Understanding both the predictive performance and the feature selection capabilities of any one model is necessary to make informed decisions regarding the development of population surveillance tools. For instance, a well predicting model with too many indicators may be difficult to translate into a succinct tool, whereas an interpretable model with poor predictive performance may throw into question the usefulness of such indicators. Additionally, the current analytical process allows for the examination of both the best and, more regularised, one-standard-error models, which have previously received limited attention.

Findings suggested greater predictive performance of penalised logistic regression compared to logistic regression or forward selection, albeit small. Results were consistent in that all of the best models outperformed both standard and forward selection logistic regression in terms of both AUC and F1 scores. Additionally, all one-standard-error models outperformed forward selection on both performance indices. The only underperforming models were the one-standard-error LASSO and adaptive LASSO, which had lower AUC scores, albeit only marginally, than standard logistic regression. The standard logistic regression models, however, included no feature selection and thus is not a useful alternative.

Current findings are supportive of previously identified improvements in predictive performance, although improvements appear to be smaller than previously documented [24]. While, alignment with previous work may be limited by the substantial differences in both indicators and outcomes, even small improvements in predictive performance encourage the use of penalised logistic regression methods. In contrast to previous literature [24], differences in predictive performance between penalised logistic regression methods were only small and did not suggest any particular best performing method. Specifically, elastic-net models did not show an improvement in predictive performance over other methods, whereby, although elastic-net models had the highest AUC scores, the highest F1 scores were found in the LASSO and adaptive LASSO models. For those seeking to use penalised logistic regression methods to develop population surveillance tools, the current findings suggest that all examined methods performed similarly.

There were, however, notable differences in feature selection–and thus interpretation–between models. Forward selection, in comparison to the penalised logistic regression approaches, selected far fewer indicators than the best penalised logistic regression models. Additionally, forward selection included a number of indicators not selected in the penalised logistic regression models and omitted indicators with large coefficients identified in the penalised logistic regression models. Overall, there appeared little benefit to the use of forward selection over penalised logistic regression alternatives. This recommendation coincides with a range of statistical concerns inherent to forward selection procedures [35].

In comparing among penalised logistic regression models, most notably, elastic-net models selected substantially more indicators than either the LASSO or adaptive LASSO, although almost all indicators unique to elastic-net models had small coefficients. The elastic-net selected all indicators included in the penalised logistic regression models. Additionally, while the LASSO and adaptive LASSO selected a similar number of indicators, both models contained several unique indicators. All indicators unique to the LASSO, however, had small effects. whereas in the adaptive LASSO unique indicators had more pronounced effects. These findings suggest a considerable level of similarity in the selected features, with differences between models largely limited to indicators with small coefficients.

A particularly relevant consideration for those seeking to develop predictive population surveillance tools is the differences in feature selection between the best and one-standard-error models, which reflects understanding the desired goal of the model [1]. Findings suggest that the one-standard-error models selected far fewer indicators than the best models yet had relatively comparable predictive performance. In developing a predictive population surveillance tool which is intended to be relevant to a diversity of outcomes (e.g., tobacco use or mental health problems), examining both the best and one-standard-error models simultaneously is likely to convey advantage in determining which indicators are worth retaining. By examining the best model, a diverse range of indicators are likely to be identified, for which indicators (even those with weak prediction) may share overlap across multiple outcomes and suggest cost effective points of investment. By examining the one-standard-error model, the smallest subset of predictors for each relevant outcome can be identified, which may suggest the most pertinent indicators for future harms.

There are, however, some study limitations to note. First, the use of real data provides a relatable demonstration of how models may function, but findings may not necessarily generalise to other populations or data types. The use of simulation studies remains an important and complementary area of research for systematically exploring differences in predictive models [e.g., 12]. Second, we did not explore all available penalised logistic regression applications, but rather methods that were common and accessible to researchers. Methods such as the relaxed LASSO [36] or data driven procedures to balance the L1 and L2 penalties of elastic-net [34] require similar comparative work. Finally, as penalised regression methods have not been widely implemented into a MI framework, the current study relies on a singly-imputed data set [37, 38].

In summary, this paper provided an overview of the implementation and both the predictive performance and feature selection capacity of several common penalised logistic regression methods and potential parameterisations. Such approaches provide an empirical basis for selecting indicators for population surveillance research aimed at maximising prediction of population health outcomes over time. Broadly, findings suggested that penalised logistic regression methods showed improvements in predictive performance over logistic regression and forward selection, albeit small. However, in selecting between penalised logistic regression methods, there was no clear best predicting method. Differences in feature selection were more apparent, suggesting that interpretative goals may be a key consideration for researchers when making decisions between penalised logistic regression methods. This includes greater consideration of the respective best and one-standard-error models. Future work should continue to compare the predictive performance and feature selection capacities of penalised logistic regression models.

Supporting information

S1 Table. Description of adolescent indicators.

Note: a = approach to combine repeated measures data; SR = Self report, PR = Parent report; SMFQ = Short Mood and Feelings Questionnaire [39], RBPC = Revised Behaviour Problem Checklist [40], RCMAS = Revised Children’s Manifest Anxiety Scale [41], SRED = Self-Report Early Delinquency Instrument [42], SSRS = Social Skills Rating System [43], CSEI = Coopersmith Self-Esteem Inventory [44], PIES = Psychosocial Inventory of Ego Strengths [45], ACER SLQ = ACER School Life Questionnaire [46], OSBS = O’Donnell School Bonding Scale [47], IPPA = Inventory of Parent and Peer Attachment [48], CBQ = Conflict Behaviour Questionnaire [49], FACES II = Family Adaptability and Cohesion Evaluation Scale [50], RAS = Relationship Assessment Scale [51], OHS = Overt Hostility Scale [52], ZTAS = Zuckerman’s Thrill and Adventure Seeking Scale [53], FFPQ = Five Factor Personality Questionnaire [54], GBFM = Goldberg’s Big Five Markers [55], SATI = School Age Temperament Inventory [56], more information on ATP derived scales can be found in Vassallo and Sanson [25].

(DOCX)

Acknowledgments

The ATP study is located at The Royal Children’s Hospital Melbourne and is a collaboration between Deakin University, The University of Melbourne, the Australian Institute of Family Studies, The University of New South Wales, The University of Otago (New Zealand), and the Royal Children’s Hospital (further information available at www.aifs.gov.au/atp). The views expressed in this paper are those of the authors and may not reflect those of their organizational affiliations, nor of other collaborating individuals or organizations. We acknowledge all collaborators who have contributed to the ATP, especially Professors Ann Sanson, Margot Prior, Frank Oberklaid, John Toumbourou and Ms Diana Smart. We would also like to sincerely thank the participating families for their time and invaluable contribution to the study.

Data Availability

Ethics approvals for this study do not permit these potentially re-identifiable participant data to be made publicly available. Enquires about collaboration are possible through our institutional data access protocol: https://lifecourse.melbournechildrens.com/data-access/. The current institutional body responsible for ethical approval is The Royal Children’s Hospital Human Research Ethics Committee.

Funding Statement

Data collection for the ATP study was supported primarily through Australian grants from the Melbourne Royal Children’s Hospital Research Foundation, National Health and Medical Research Council, Australian Research Council, and the Australian Institute of Family Studies. Funding for this work was supported by grants from the Australian Research Council [DP130101459; DP160103160; DP180102447] and the National Health and Medical Research Council of Australia [APP1082406]. Olsson, C.A. was supported by a National Health and Medical Research Council fellowship (Investigator grant APP1175086). Hutchinson, D.M. was supported by the National Health and Medical Research Council of Australia [APP1197488].

References

  • 1.Shatte ABR, Hutchinson DM, Teague SJ. Machine learning in mental health: a scoping review of methods and applications. Psychol Med [Internet]. 2019. July 12;49(09):1426–48. Available from: https://www.cambridge.org/core/product/identifier/S0033291719000151/type/journal_article 10.1017/S0033291719000151 [DOI] [PubMed] [Google Scholar]
  • 2.Yarkoni T, Westfall J. Choosing Prediction Over Explanation in Psychology: Lessons From Machine Learning. Perspect Psychol Sci. 2017;12(6):1100–22. 10.1177/1745691617693393 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Shmueli G. To Explain or to Predict? Stat Sci [Internet]. 2010. August;25(3):289–310. Available from: http://projecteuclid.org/euclid.ss/1294167961 [Google Scholar]
  • 4.Pearl J. Causal inference in statistics: An overview. Stat Surv [Internet]. 2009;3(0):96–146. Available from: http://projecteuclid.org/euclid.ssu/1255440554 [Google Scholar]
  • 5.Hernán MA. A definition of causal effect for epidemiological research. J Epidemiol Community Health. 2004;58(4):265–71. 10.1136/jech.2002.006361 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Moons KGM, Royston P, Vergouwe Y, Grobbee DE, Altman DG. Prognosis and prognostic research: what, why, and how? Br Med J [Internet]. 2009. February 23;338(feb23 1):b375–b375. Available from: http://www.bmj.com/cgi/doi/10.1136/bmj.b375 [DOI] [PubMed] [Google Scholar]
  • 7.Altman DG, Vergouwe Y, Royston P, Moons KGM. Prognosis and prognostic research: validating a prognostic model. Br Med J [Internet]. 2009. May 28;338(may28 1):b605–b605. Available from: 10.1136/bmj.b605 [DOI] [PubMed] [Google Scholar]
  • 8.Sauer B, VanderWeele TJ. Developing a Protocol for Observational Comparative Effectiveness Research: A User’s Guide. [Internet]. Velentgas P, Dreyer N, Nourjah P, Smith S, Torchia M, editors. AHRQ Publication No. 12(13)-EHC099. Rockville, MD: Agency for Healthcare Research and Quality; 2013. 177–184 p. Available from: https://www.ncbi.nlm.nih.gov/books/NBK126178/ [PubMed] [Google Scholar]
  • 9.Ahn W, Ramesh D, Moeller FG, Vassileva J. Utility of machine-learning approaches to identify behavioral markers for substance use disorders: Impulsivity dimensions as predictors of current cocaine dependence. Front Psychiatry. 2016;7(MAR):1–11. 10.3389/fpsyt.2016.00034 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning [Internet]. Springer Texts in Statistics. 2013. 618 p. Available from: http://books.google.com/books?id = 9tv0taI8l6YC [Google Scholar]
  • 11.Derksen S, Keselman HJ. Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining authentic and noise variables. Br J Math Stat Psychol. 1992;45(2):265–82. [Google Scholar]
  • 12.Pavlou M, Ambler G, Seaman S, De iorio M, Omar RZ. Review and evaluation of penalised regression methods for risk prediction in low-dimensional data with few events. Stat Med. 2016;35(7):1159–77. 10.1002/sim.6782 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Tibshirani R. Regression Shrinkage and Selection Via the Lasso. J R Stat Soc Ser B [Internet]. 1996. January;58(1):267–88. Available from: http://doi.wiley.com/10.1111/j.2517-6161.1996.tb02080.x [Google Scholar]
  • 14.Zou H. The adaptive lasso and its oracle properties. J Am Stat Assoc. 2006;101(476):1418–29. [Google Scholar]
  • 15.Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Ser B (Statistical Methodol [Internet]. 2005. April;67(2):301–20. Available from: http://doi.wiley.com/10.1111/j.1467-9868.2005.00503.x [Google Scholar]
  • 16.Friedman J, Hastie T, Tibshirani R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Softw [Internet]. 2010;33(1):1–20. Available from: http://www.jstatsoft.org/v33/i01/ [PMC free article] [PubMed] [Google Scholar]
  • 17.Cessie S Le Houwelingen JC Van. Ridge Estimators in Logistic Regression. Appl Stat [Internet]. 1992;41(1):191 Available from: https://www.jstor.org/stable/10.2307/2347628?origin=crossref [Google Scholar]
  • 18.Feig DG. Ridge regression: when biased estimation is better. Soc Sci Q [Internet]. 1978;58(4):708–16. Available from: http://www.jstor.org/stable/42859928 [Google Scholar]
  • 19.Kohavi R. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence—Volume 2, San Francisco, CA, USA. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.; 1995. (IJCAI’95).
  • 20.Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning [Internet]. The Mathematical Intelligencer. New York, NY: Springer New York; 2009. 369–370 p. (Springer Series in Statistics). Available from: http://link.springer.com/10.1007/978-0-387-84858-7 [Google Scholar]
  • 21.Christodoulou E, Ma J, Collins GS, Steyerberg EW, Verbakel JY, Van Calster B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol [Internet]. 2019;110:12–22. Available from: 10.1016/j.jclinepi.2019.02.004 [DOI] [PubMed] [Google Scholar]
  • 22.Australian Institute of Health and Welfare. Impact and cause of illness and death in Australia [Internet]. Australian Burden of Disease Study. 2011. Available from: https://www.aihw.gov.au/getmedia/d4df9251-c4b6-452f-a877-8370b6124219/19663.pdf.aspx?inline=true
  • 23.Skara S, Sussman S. A review of 25 long-term adolescent tobacco and other drug use prevention program evaluations. Prev Med (Baltim). 2003;37(5):451–74. 10.1016/s0091-7435(03)00166-x [DOI] [PubMed] [Google Scholar]
  • 24.Afzali MH, Sunderland M, Stewart S, Masse B, Seguin J, Newton N, et al. Machine-learning prediction of adolescent alcohol use: a cross-study, cross-cultural validation. Addiction. 2018;662–71. 10.1111/add.14504 [DOI] [PubMed] [Google Scholar]
  • 25.Vassallo S, Sanson A. The Australian temperament project: The first 30 years. Melbourne: Australian Institue of Family Studies; 2013. [Google Scholar]
  • 26.Letcher P, Sanson A, Smart D, Toumbourou JW. Precursors and correlates of anxiety trajectories from late childhood to late adolescence. J Clin Child Adolesc Psychol [Internet]. 2012;41(4):417–32. Available from: http://www.ncbi.nlm.nih.gov/pubmed/22551395 10.1080/15374416.2012.680189 [DOI] [PubMed] [Google Scholar]
  • 27.R Core Team. R Core Team (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing; 2018. [Google Scholar]
  • 28.Van Buuren S, Groothuis-Oudshoorn K. Multivariate Imputation by Chained Equations. J Stat Softw [Internet]. 2011;45(3):1–67. Available from: http://igitur-archive.library.uu.nl/fss/2010-0608-200146/UUindex.html [Google Scholar]
  • 29.Gelman A. Scaling regression inputs by dividing by two standard deviations. Stat Med [Internet]. 2008. July 10;27(15):2865–73. Available from: 10.1002/sim.3107 [DOI] [PubMed] [Google Scholar]
  • 30.Venables WN, Ripley BD. Modern Applied Statistics with S. Fourth Edi. New York; 2002.
  • 31.Metz CE. Basic principles of ROC analysis. Semin Nucl Med [Internet]. 1978. October;8(4):283–98. Available from: http://www.umich.edu/~ners580/ners-bioe_481/lectures/pdfs/1978-10-semNucMed_Metz-basicROC.pdf 10.1016/s0001-2998(78)80014-2 [DOI] [PubMed] [Google Scholar]
  • 32.Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015;10(3):1–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Hosmer DW, Lemeshow S. Applied Logistic Regression [Internet]. 2nd ed Hoboken, NJ, USA: John Wiley & Sons, Inc; 2000. Available from: http://doi.wiley.com/10.1002/0471722146 [Google Scholar]
  • 34.Fitzgerald A, Giollabhui N Mac, Dolphin L, Whelan R, Dooley B. Dissociable psychosocial profiles of adolescent substance users. PLoS One. 2018;13(8):1–17. 10.1371/journal.pone.0202498 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Harrell FE. Regression Modeling Strategies with applications to linear models, logistic regression, and survival analysis Regression modeling strategies. Springer; 2001. 106–107 p. [Google Scholar]
  • 36.Meinshausen N. Relaxed Lasso. Comput Stat Data Anal [Internet]. 2007. September;52(1):374–93. Available from: https://linkinghub.elsevier.com/retrieve/pii/S0167947306004956 [Google Scholar]
  • 37.Wan Y, Datta S, Conklin DJ, Kong M. Variable selection models based on multiple imputation with an application for predicting median effective dose and maximum effect. J Stat Comput Simul [Internet]. 2015. June 13;85(9):1902–16. Available from: http://www.tandfonline.com/doi/abs/10.1080/00949655.2014.907801 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Chen Q, Wang S. Variable selection for multiply-imputed data with application to dioxin exposure study. Stat Med. 2013;32(21):3646–59. 10.1002/sim.5783 [DOI] [PubMed] [Google Scholar]
  • 39.Angold A, Costello EJ, Messer S, Pickles A, Winder F, D S. The development of a short questionnaire for use in epidemiological studies of depression in children and adolescents. Int J Methods Psychiatr Res. 1995;5:237–49. [Google Scholar]
  • 40.Quay HC, Peterson DR. The Revised Behavior Problem Checklist: Manual. Odessa, FL: Psychological Assessment Resources; 1993. [Google Scholar]
  • 41.Reynolds CR, Richmond BO. What I think and feel: A revised measure of children’s manifest anxiety. J Abnorm Child Psychol. 1997;25(1):15–20. 10.1023/a:1025751206600 [DOI] [PubMed] [Google Scholar]
  • 42.Moffitt TE, Silva PA. Self-reported delinquency: Results from an instrument for new zealand. Aust N Z J Criminol. 1988;21(4):227–40. [Google Scholar]
  • 43.Gresham F, Elliott S. Manual for the Social Skills Rating System. Circle Pines MN: American Guidance Service; 1990. [Google Scholar]
  • 44.Coopersmith S. Self-esteem inventories. Palo Alto, CA: Consulting Psychologists Press; 1982. [Google Scholar]
  • 45.Markstrom CA, Sabino VM, Turner BJ, Berman RC. The psychosocial inventory of ego strengths: Development and validation of a new Eriksonian measure. J Youth Adolesc. 1997;26(6):705–29. [Google Scholar]
  • 46.Ainley J, Reed R, Miller H. School organisation and the quality of schooling: A study of Victorian government secondary schools. Hawthorn, Victoria: ACER; 1986. [Google Scholar]
  • 47.O’Donnell J, Hawkins JD, Abbott RD. Predicting Serious Delinquency and Substance Use Among Aggressive Boys. J Consult Clin Psychol. 1995;63(4):529–37. 10.1037//0022-006x.63.4.529 [DOI] [PubMed] [Google Scholar]
  • 48.Armsden GC, Greenberg MT. The inventory of parent and peer attachment: Individual differences and their relationship to psychological well-being in adolescence. J Youth Adolesc [Internet]. 1987. October;16(5):427–54. Available from: http://www.ncbi.nlm.nih.gov/pubmed/24277469 10.1007/BF02202939 [DOI] [PubMed] [Google Scholar]
  • 49.Prinz RJ, Foster S, Kent RN, O’Leary KD. Multivariate assessment of conflict in distressed and nondistressed mother-adolescent dyads. J Appl Behav Anal. 1979;12(4):691–700. 10.1901/jaba.1979.12-691 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Olson DH, Portner J, Bell RQ. FACES II: Family Adaptability and Cohesion Evaluation Scales. St. Paul: University of Minnesota, Department of Family Social Science; 1982. [Google Scholar]
  • 51.Hendrick S. A Generic Measure of Relationship Satisfaction. J Marriage Fam. 1988;50(1):93–8. [Google Scholar]
  • 52.Porter B, Leary KDO. Marital Discord and Childhood Behavior Problems. 1980;8(3):287–95. [DOI] [PubMed] [Google Scholar]
  • 53.Zuckerman M, Kolin EA, Price L, Zoob I. Development of a sensation-seeking scale. J Consult Psychol. 1964;28(6):477–82. [DOI] [PubMed] [Google Scholar]
  • 54.Lanthier RP, Bates JE. Infancy era predictors of the big five personality dimensions in adolescence. In: Paper presented at the 1995 meeting of the Midwestern Psychological Association, Chicago. 1995.
  • 55.Goldberg RL. Development of Factors for Big Five Factor Structure. Psychol Assess. 1992;4:26–42. [Google Scholar]
  • 56.McClowry SG. The Development of the School-Age Temperament Inventory. 2016;41(3):271–85. [Google Scholar]

Decision Letter 0

Yuka Kotozaki

3 Sep 2020

PONE-D-20-18792

An overview of penalised regression methods for informing the selection of predictive markers

PLOS ONE

Dear Dr. Christopher Greenwood,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please review the comments from the two reviewers and make appropriate corrections.

Please submit your revised manuscript by Oct 18 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Yuka Kotozaki

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. In your Data Availability statement, you have not specified where the minimal data set underlying the results described in your manuscript can be found. PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. All PLOS journals require that the minimal data set be made fully available. For more information about our data policy, please see http://journals.plos.org/plosone/s/data-availability.

Upon re-submitting your revised manuscript, please upload your study’s minimal underlying data set as either Supporting Information files or to a stable, public repository and include the relevant URLs, DOIs, or accession numbers within your revised cover letter. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. Any potentially identifying patient information must be fully anonymized.

Important: If there are ethical or legal restrictions to sharing your data publicly, please explain these restrictions in detail. Please see our guidelines for more information on what we consider unacceptable restrictions to publicly sharing data: http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access.

We will update your Data Availability statement to reflect the information you provide in your cover letter.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: No

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: No

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: No

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The manuscript is a useful, important, and largely well-written study on the applications of penalized logistic regression on real cohort data. I can envisions its eventual publication in PLOS ONE after suitable revisions are made.

Suggested Revisions

Abstract: It does not seem that penalized regression provides an "Atheoretical approach".

For example, the LASSO estimates of coefficients, under a Bayesian theoretical point of view, correspond to a posterior mode of the coefficients under independent Laplace priors.

Abstract: "Penalized regression methods" -> "Penalized logistic regression methods"

Throughout the manuscript:

Please provide page numbers for the manuscript.

Many references are made to "penalized regression" (e.g. LASSO, Elastic net) being used.

That phrase refers to real-valued responses. But the manuscript seems to focus on the use of logistic regression. So it is better to state "penalized logistic regression" instead of just "penalized regression" in these instances.

Lines 62-63: Can you give recent few examples of cohort studies that represent limited use of penalized regression in such studies?

Lines 82-83: "apply well to new data" is a vague term. Do you mean "predict new data well?"

Lines 191-192: "A single imputed data set was generated via multiple imputation (MI)" is an awkward phrase that needs to be fixed.

"Single imputation" is not an example of "multiple imputation".

Lines 193-194: The statement "since standard penalised regression procedures in R cannot handle multiple imputation or missing data"

s not exactly true. In principle, one can adopt a fully Bayesian approach to elastic net, lasso or ridge logistic regression, by running and MCMC algorithm generate samples from the posterior distribution of model parameters, for each imputed data set, from the multiple imputations of data sets. Then results are summarized by mixing together the MCMC posterior samples across data sets.

This is according to the Bayesian book: Bayesian Data Analysis (2nd Edition), by Andrew Gelman, et al.

Lines 197-238: The section on Regression Procedures lacks detailed descriptions of these procedures.

When describing each of the methods, namely the logistic, LASSO, adaptive LASSO, and elastic-net logistic regressions,

the section needs to describe the penalized likelihood function that is being maximized, by each method.

(Of course, the logistic regression method uses zero penalty).

For ideas, a good example of a paper that describes the likelihood function of each of these methods is provided in

the article described in https://www.jstatsoft.org/article/view/v033i01

Lines 212-213: Please define "best (lambda-min)" and "best one-standard-error (lambda-1se) model."

This will greatly help interpretation of the Results section, later.

The above two critical points was the main factor that moved my decision from Minor Revision to Major Revision.

Line 377: The phrase "single MI data set" a misnomer, because it refers to one thing having the property of being both "single" and "multiple". Instead, use something like "singly-imputed" data set.

Line 422: "BMJ". Spell out the journal's full name.

Line 426: "Int J Methods Psychiatr Res." Spell out the journal's full name.

Line 460: The reference is missing the volume, issue, and page numbers.

Line 478: The reference is missing the conference location.

Line 501: "BMJ". Spell out the journal's full name.

Line 532: The reference is missing the volume, issue, and page numbers.

Line 534: "Ssrn"appears to be incomplete journal name.

Line 539: The reference is missing the journal name.

Line 543: In the reference, capitalize the second "the".

Reviewer #2: As the title suggests that the manuscript presumably overviews the ability of different penalized regression methods in selecting predictive markers, it is not clear whether feature selection or predictability of selected features is focused. On the other hand, it is known that penalized regression methods are developed for variable selection instead of prediction, and there are a lot of theoretical results on the different methods (in contrast to the claimed "atheoretical predictive modelling approaches" by the authors). As long as the prediction is concerned, there are many machine learning methods (especially deep learning should be resorted). There are some other major issues:

1. The current overview solely relies on the analysis of single data set. Instead a comprehensive review may be better conducted with a well-designed simulation study, as well as several typical data sets. Otherwise, how have the authors calculated the sensitivity or specificity of selected predictors?

2. The presented analysis of the longitudinal study is very vague, lack of details. For example, what is the response variable, how many predictors are there, and what is the sample size? It suggested in "Discussion" on "prediction of daily tobacco use in young adulthood", does it mean a binary response variable, or quantitative response variable? As the original study suggests a longitudinal data, is a cross-sectional data set selected for the analysis? Otherwise, how is the longitudinal data fit to the proposed models (again lack of details)?

3. Logistic regression is a regression model, instead of a statistical approach in parallel of penalized regression method. As logistic regression model can be used here, I suppose a binary response variable is used here. In this case, have the authors considered all penalized regression methods for the logistic regression models?

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Nov 20;15(11):e0242730. doi: 10.1371/journal.pone.0242730.r002

Author response to Decision Letter 0


28 Oct 2020

October 22, 2020

Dr Yuka Kotozaki

Academic Editor, PLOS ONE

RE: Revised manuscript for consideration in PLOS ONE

Dear Dr Yuka Kotozaki,

Thank you for the invitation to submit a revised version of our manuscript to PLOS ONE. We are grateful to the reviewers for their considered and helpful comments. We have responded to each and have revised the manuscript accordingly. We believe the manuscript has been strengthened through the peer review process.

We look forward to hearing from you.

Yours sincerely,

Christopher Greenwood

Research Fellow | Life-course Epidemiology

Centre for Social and Early Emotional Development

Deakin University, Australia

E: christopher.greenwood@deakin.edu.au

Comments from the reviewers

Reviewer #1:

The manuscript is a useful, important, and largely well-written study on the applications of penalized logistic regression on real cohort data. I can envision its eventual publication in PLOS ONE after suitable revisions are made.

Thank you for your kind feedback and suggestions for improving the manuscript.

Throughout the manuscript:

Please provide page numbers for the manuscript.

Done.

Many references are made to "penalized regression" (e.g. LASSO, Elastic net) being used.

That phrase refers to real-valued responses. But the manuscript seems to focus on the use of logistic regression. So it is better to state "penalized logistic regression" instead of just "penalized regression" in these instances.

This has been amended throughout the revised manuscript so that we now consistently refer to penalized logistic regression instead of penalized regression.

Lines 62-63: Can you give recent few examples of cohort studies that represent limited use of penalized regression in such studies?

We have now re-written this part of the text to better express our point. The amended text now reads:

“Furthermore, growing accessibility of predictive modelling tools is encouraging and presents as a potential point of advancement for identifying key predictive indicators relevant to population health (Shatte et al., 2019).”

Lines 82-83: "apply well to new data" is a vague term. Do you mean "predict new data well?"

Thank you, this has been clarified.

“Accuracy refers to the ability of the model to correctly predict an outcome, whereas generalisability refers to the ability of the model to predict well given new data (Altman et al., 2009).”

Lines 191-192: "A single imputed data set was generated via multiple imputation (MI)" is an awkward phrase that needs to be fixed. "Single imputation" is not an example of "multiple imputation".

Lines 193-194: The statement "since standard penalised regression procedures in R cannot handle multiple imputation or missing data" is not exactly true. In principle, one can adopt a fully Bayesian approach to elastic net, lasso or ridge logistic regression, by running and MCMC algorithm generate samples from the posterior distribution of model parameters, for each imputed data set, from the multiple imputations of data sets. Then results are summarized by mixing together the MCMC posterior samples across data sets. This is according to the Bayesian book: Bayesian Data Analysis (2nd Edition), by Andrew Gelman, et al.

We thank the reviewer for this information. This section has now been clarified.

“Since standard penalised regression packages in R do not have in-built capacity to handle multiply imputed or missing data, a single imputed data set was generated using the mice package (Van Buuren & Groothuis-Oudshoorn, 2011).”

Lines 197-238: The section on Regression Procedures lacks detailed descriptions of these procedures.

When describing each of the methods, namely the logistic, LASSO, adaptive LASSO, and elastic-net logistic regressions, the section needs to describe the penalized likelihood function that is being maximized, by each method. (Of course, the logistic regression method uses zero penalty).

For ideas, a good example of a paper that describes the likelihood function of each of these methods is provided in the article described in https://www.jstatsoft.org/article/view/v033i01

We have now described in the “Regression procedures” section the penalty terms.

“Specifically, L1 penalization imposes a constraint based on the sum of the absolute value of regression coefficients, whilst L2 penalisation, imposes a constraint based on the sum of the squared regression coefficients (Pavlou et al., 2016). 5-fold cross-validation was used to tune λ (the strength of the penalty) for all penalised logistic regression methods. Logistic regression imposes no penalisation on regression coefficients.”

We have additionally expanded the section of the introduction which discusses the compromises of each model and the relevant penalties. We have provided the full paragraph here for completeness:

“Retention of indicators is, however, influenced by the particular decision rules of the algorithm. Three penalised regression methods which conduct automatic feature commonly compared in the literature and discussed in standard statistical texts (James et al., 2013; Pavlou et al., 2016) are the least absolute shrinkage and selection operator (LASSO) (Tibshirani, 1996), the adaptive LASSO (Zou, 2006) and the elastic-net (Zou & Hastie, 2005). The LASSO applies the L1 penalty (constraint based on the sum of the absolute value of regression coefficients), which shrinks coefficients equally and enables automatic feature selection. However, in situations with highly correlated indicators the LASSO tends to select one and ignore the others (Friedman et al., 2010). The adaptive LASSO and elastic-net are extensions on the LASSO, both of which incorporate the L2 penalty from ridge regression (Cessie & Houwelingen, 1992).

The L2 penalty (constraint based on the sum of the squared regression coefficients) shrinks coefficients equally towards zero, but not exactly zero, and is beneficial in situations of multicollinearity as correlated indicators tend to group towards each other (Feig, 1978; Friedman et al., 2010). More specifically, the adaptive LASSO incorporates an additional data-dependent weight (derived from ridge regression) to the L1 penalty term, which results in coefficients of strong indicators being shrunk less than the coefficients of weak indicators (Zou, 2006), contrasting to the standard LASSO approach. The elastic-net includes both the L1 and L2 penalty and enjoys the benefits of both automatic feature selection and the grouping of correlated predictors (Zou & Hastie, 2005).”

Lines 212-213: Please define "best (lambda-min)" and "best one-standard-error (lambda-1se) model."

This will greatly help interpretation of the Results section, later.

We have now added a description of these models to help with interpretation in the results section.

“Each iteration of cross-validation identified the best (λ-min; model which minimised out-of-sample predictor error) and one-standard-error (λ-1se; more regularised model with out-of-sample prediction error within one standard error of the best model) model.”

The above two critical points was the main factor that moved my decision from Minor Revision to Major Revision.

Line 377: The phrase "single MI data set" a misnomer, because it refers to one thing having the property of being both "single" and "multiple". Instead, use something like "singly-imputed" data set.

This has now been amended as “a singly-imputed data set”.

Line 422: "BMJ". Spell out the journal's full name.

Done.

Line 426: "Int J Methods Psychiatr Res." Spell out the journal's full name.

Done.

Line 460: The reference is missing the volume, issue, and page numbers.

Done.

Line 478: The reference is missing the conference location.

Done.

Line 501: "BMJ". Spell out the journal's full name.

Done.

Line 532: The reference is missing the volume, issue, and page numbers.

Done.

Line 534: "Ssrn"appears to be incomplete journal name.

Done.

Line 539: The reference is missing the journal name.

Done.

Line 543: In the reference, capitalize the second "the".

Done.

Reviewer #2:

As the title suggests that the manuscript presumably overviews the ability of different penalized regression methods in selecting predictive markers, it is not clear whether feature selection or predictability of selected features is focused. On the other hand, it is known that penalized regression methods are developed for variable selection instead of prediction, and there are a lot of theoretical results on the different methods (in contrast to the claimed "atheoretical predictive modelling approaches" by the authors). As long as the prediction is concerned, there are many machine learning methods (especially deep learning should be resorted). There are some other major issues:

We thank the reviewer for providing their thoughts on the current manuscript. We believe that we have presented a comparative examination of the current methods in terms of both prediction and feature selection, for which consideration in unison provides further insight into the differences between the methods. It is still desirable to have a subset of indicators which are accurate for prediction.

Penalised regression methods have, however, not been developed solely for the purpose of feature selection – for example, take ridge regression, which does not perform feature selection. The use of regularisation has benefits in terms of both prediction accuracy (related to the bias-variance trade-off) and model interpretability (e.g., feature selection).

We do agree that there are a range of other machine learning methods (such as deep learning algorithms) that could also be used to develop predictive models; however, these models often have low interpretability (e.g., the black box problem) which can be important for many psychological applications (e.g., the development of screening tools).

1. The current overview solely relies on the analysis of single data set. Instead a comprehensive review may be better conducted with a well-designed simulation study, as well as several typical data sets. Otherwise, how have the authors calculated the sensitivity or specificity of selected predictors?

We have now acknowledged in the discussion that a simulation study is an important area of research that permits examination of sensitivity and specificity selected predictors (see amended text below). However, we believe there is great merit in comparing methods using real data, and this has been a focus of many studies including those comparing machine learning applications within the PLoS ONE journal (e.g., Luo et al., 2017; full reference provided below). The benefits of demonstrating using real datasets relate to illustrating how methods are applied in practice, for which the imperfections of real data and the accompanying decisions are present. As such, we believe our approach provides an important contribution to the literature and is consistent with the approaches taken by others to demonstrate methodology in practice.

“First, the use of real data provides a relatable demonstration of how models may function, but findings may not necessarily generalise to other populations or data types. The use of simulation studies remains an important and complementary area of research for systematically exploring differences in predictive models (e.g., Pavlou et al., 2016).”

Luo, Y., Li, Z., Guo, H., Cao, H., Song, C., Guo, X., & Zhang, Y. (2017). Predicting congenital heart defects: A comparison of three data mining methods. PLoS ONE, 12(5), 1–14. https://doi.org/10.1371/journal.pone.0177811

2. The presented analysis of the longitudinal study is very vague, lack of details. For example, what is the response variable, how many predictors are there, and what is the sample size? It suggested in "Discussion" on "prediction of daily tobacco use in young adulthood", does it mean a binary response variable, or quantitative response variable? As the original study suggests a longitudinal data, is a cross-sectional data set selected for the analysis? Otherwise, how is the longitudinal data fit to the proposed models (again lack of details)?

We apologise if these details were unclear. We have now revised the wording throughout the methods section and believe that the design of the study is more clearly articulated.

“Adolescent indicators

A total of 102 adolescent indicators, assessed at ages 13-14, 15-16, 17-18 years, by parent and self-report, were available for analysis (see S1 Table). Data spanned individual (i.e., biological, internalising/externalising, personality/temperament, social competence, and positive development), relational (i.e., peer and family relationships, and parenting practices), contextual (demographics, and school and work), and substance use specific (personal, and environmental use) domains. Repeated measures data were combined (i.e., maximum or mean level depending on the indicator type) to represent overall adolescent experience. All 102 adolescent indicators were entered as predictors into each model.

Young adult tobacco use outcome

Tobacco use was assessed at age 19-20 years (after measurement of the indicators), as the number of days used in the last month. This was converted to a binary variable representing daily use (i.e., ≥ 28 days in the last month), which was used as the outcome (response variable) in all analyses.”

3. Logistic regression is a regression model, instead of a statistical approach in parallel of penalized regression method. As logistic regression model can be used here, I suppose a binary response variable is used here. In this case, have the authors considered all penalized regression methods for the logistic regression models?

We have now clarified throughout that the penalised regression approaches were all penalised logistic regression models. We use standard logistic regression as a comparison to the penalised logistic regression models (i.e., the non-penalised comparison).

We haven’t considered all penalised regression methods, as mentioned in the limitations section. We do, however, believe that the models compared here are an inclusive selection of those used commonly throughout the literature, as reflected by their discussion in key texts (see James et al., 2013) and selection in other comparative studies (see Pavlou et al., 2016) . We have amended the text to read:

“Three penalised regression methods which conduct automatic feature commonly compared in the literature and discussed in standard statistical texts (James et al., 2013; Pavlou et al., 2016) are the least absolute shrinkage and selection operator (LASSO) (Tibshirani, 1996), the adaptive LASSO (Zou, 2006) and the elastic-net (Zou & Hastie, 2005).”

Attachment

Submitted filename: PONE-D-20-18792_ReviewerFeedback_r1_201029.docx

Decision Letter 1

Yuka Kotozaki

9 Nov 2020

A comparison of penalised regression methods for informing the selection of predictive markers

PONE-D-20-18792R1

Dear Dr. Christopher Greenwood,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Yuka Kotozaki

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: (No Response)

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: (No Response)

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: (No Response)

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: (No Response)

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Acceptance letter

Yuka Kotozaki

11 Nov 2020

PONE-D-20-18792R1

A comparison of penalised regression methods for informing the selection of predictive markers

Dear Dr. Greenwood:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Yuka Kotozaki

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table. Description of adolescent indicators.

    Note: a = approach to combine repeated measures data; SR = Self report, PR = Parent report; SMFQ = Short Mood and Feelings Questionnaire [39], RBPC = Revised Behaviour Problem Checklist [40], RCMAS = Revised Children’s Manifest Anxiety Scale [41], SRED = Self-Report Early Delinquency Instrument [42], SSRS = Social Skills Rating System [43], CSEI = Coopersmith Self-Esteem Inventory [44], PIES = Psychosocial Inventory of Ego Strengths [45], ACER SLQ = ACER School Life Questionnaire [46], OSBS = O’Donnell School Bonding Scale [47], IPPA = Inventory of Parent and Peer Attachment [48], CBQ = Conflict Behaviour Questionnaire [49], FACES II = Family Adaptability and Cohesion Evaluation Scale [50], RAS = Relationship Assessment Scale [51], OHS = Overt Hostility Scale [52], ZTAS = Zuckerman’s Thrill and Adventure Seeking Scale [53], FFPQ = Five Factor Personality Questionnaire [54], GBFM = Goldberg’s Big Five Markers [55], SATI = School Age Temperament Inventory [56], more information on ATP derived scales can be found in Vassallo and Sanson [25].

    (DOCX)

    Attachment

    Submitted filename: PONE-D-20-18792_ReviewerFeedback_r1_201029.docx

    Data Availability Statement

    Ethics approvals for this study do not permit these potentially re-identifiable participant data to be made publicly available. Enquires about collaboration are possible through our institutional data access protocol: https://lifecourse.melbournechildrens.com/data-access/. The current institutional body responsible for ethical approval is The Royal Children’s Hospital Human Research Ethics Committee.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES