Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2022 Sep 21;17(9):e0274998. doi: 10.1371/journal.pone.0274998

Using machine learning to determine the shared and unique risk factors for marijuana use among child-welfare versus community adolescents

Sonya Negriff 1,‡,*, Bistra Dilkina 2,, Laksh Matai 2, Eric Rice 3
Editor: Carlos Andres Trujillo4
PMCID: PMC9491564  PMID: 36129944

Abstract

Objective

This study used machine learning (ML) to test an empirically derived set of risk factors for marijuana use. Models were built separately for child welfare (CW) and non-CW adolescents in order to compare the variables selected as important features/risk factors.

Method

Data were from a Time 4 (Mage = 18.22) of longitudinal study of the effects of maltreatment on adolescent development (n = 350; CW = 222; non-CW = 128; 56%male). Marijuana use in the past 12 months (none versus any) was obtained from a single item self-report. Risk factors entered into the model included mental health, parent/family social support, peer risk behavior, self-reported risk behavior, self-esteem, and self-reported adversities (e.g., abuse, neglect, witnessing family violence or community violence).

Results

The ML approaches indicated 80% accuracy in predicting marijuana use in the CW group and 85% accuracy in the non-CW group. In addition, the top features differed for the CW and non-CW groups with peer marijuana use emerging as the most important risk factor for CW youth, whereas externalizing behavior was the most important for the non-CW group. The most important common risk factor between group was gender, with males having higher risk.

Conclusions

This is the first study to examine the shared and unique risk factors for marijuana use for CW and non-CW youth using a machine learning approach. The results support our assertion that there may be similar risk factors for both groups, but there are also risks unique to each population. Therefore, risk factors derived from normative populations may not have the same importance when used for CW youth. These differences should be considered in clinical practice when assessing risk for substance use among adolescents.

Introduction

Child maltreatment and subsequent involvement with child welfare (CW) is a significant issue that affects a large number of youth in the US [1]. In addition to the wide-ranging effects on physical and mental health [25], child maltreatment is a known risk for alcohol, marijuana and illicit drug use in adolescence and adulthood [611]. While not all children who are maltreated come to the attention of CW, those who do enter CW have higher rates of substance use than the general population [12, 13], highlighting the need to understand the specific risk factors in this vulnerable population. In addition, adolescence is a key developmental period for prevention, as data indicate 90% of adults who meet the criteria for addiction, initiated use of alcohol or drugs in adolescence [1416]. Identifying the risk factors for early substance use among adolescents will help prevent future abuse and alleviate the economic toll of substance use/abuse [1720].

Adolescence is a key developmental period for the initiation of substance use

The confluence of biological, social, and cognitive changes that occur during adolescence increase the potential risk for initiation and prolonged use of substances [21]. Foremost, brain development during adolescence is primed for risk-taking behaviors, with executive function in the prefrontal cortex lagging behind the increased growth of the reward and sensation seeking regions [21]. Evidence also indicates that the adolescent brain is more sensitive to the addictive properties of nicotine, alcohol and other drugs, increasing the propensity for addiction [22, 23]. Social influences also increase vulnerability; susceptibility to peer pressure peaks in mid-adolescence and peer substance use is a known risk factor for substance use [24]. Early timing of puberty is also associated with higher risk of substance use through initiation of sexual behavior and exposure to older peers [25]. According to the Centers for Disease Control and Prevention, 90% of adults who meet the criteria for addiction initiated use of alcohol or drugs in adolescence [1416], highlighting this developmental period as a key time for prevention. More specifically, evidence indicates that individuals who initiate marijuana use in early adolescence are more likely to be prolonged users and progress to marijuana dependence [26]. Similarly, early alcohol misuse has been linked with abuse in adulthood. The effects of substance use in adolescence range from injury [27], unintended pregnancy [28], mental health problems [29, 30], impaired brain function [3133], reduced academic performance [34, 35], and criminal involvement[36]. The economic toll of substance use/abuse is estimated to be over $740 billion annually as a result of accidents, health care, homelessness, unemployment, and criminal activity [1720].

Predictors of early substance use for youth involved with child welfare

There is a substantial body of evidence addressing factors that predict alcohol, marijuana, and illicit drug use among community samples of adolescents [37], yet the relative contributions of these risk factors to substance use for CW youth have largely been examined with samples comprised only of CW-involved youth (no comparison group) [38]. In the only known comparison of CW and non-CW youth, Fettes and colleagues [12, 39] used the National Longitudinal Study on Adolescent Health and the National Survey of Child and Adolescent Well-Being (NSCAW) to examine the known risks for marijuana, inhalant, and other illicit drug use between CW and non-CW- involved youth. Results showed that parental closeness and parental education predicted current substance use among CW youth but not the community sample, and two-parent household predicted lifetime and current use among the community sample [39]. These findings demonstrate that the expected risk factors do not operate similarly for CW youth, necessitating further work to delineate the relative importance of known predictors of substance use for CW versus non-CW adolescents.

Machine learning predictive models for early substance use among high risk youth

The development of more accurate predictive analytics can provide new opportunities for the early detection of high-risk youth that can go beyond the identification of broad epidemiologic type categories such as gender, race, and CW involvement. There has been an increased interest and much commentary in developing predictive models that use Machine Learning (ML) to help hone intervention efforts in medicine [4045], substance abuse [46], and child welfare [47]. It has been shown in various domains that the broader class of machine learning techniques is often able to achieve more accurate predictive models than conventional statistical models such as logistic regression. A significant advantage of ML methods is the ability to analyze large amounts of data and to uncover relationships that remain hidden to standard statistical technique by allowing for more complex multi-way relationships between dependent and independent variables [48]. In addition, established strategies in machine learning, such as train/test split and cross-validation help with model calibration, prevent over-fitting, and measure performance in terms of generalization to unseen test data. Importantly, ML can advance our understanding of the predictors of early substance use using both data-driven and theory-driven selection of risk factors and can accommodate more potential predictors than conventional statistical techniques.

A number of studies have applied ML approaches to enhance our understanding of substance use, abuse, and relapse [46, 4952]. However, the majority have used adult sample with fewer focused on adolescents. In a study of Australian and Canadian adolescents four distinct clusters of predictors (demographics, psychopathology, risk behaviors, personality) were used as predictors of alcohol use, with personality and psychopathology clusters yielding the highest prediction accuracy indices [53]. Individually, sensation seeking, attention problems, prior alcohol use, and negative thinking were among the features with the highest predictive coefficients. A cohort study of nearly 700 adolescent included six features (demographics, family history, genetics, brain images, personality, and cognition) to predict current and future alcohol drinking [54]. Overall, stressful life events were found to contribute the most unique variance to the predictive model. While evidence suggests the utility of ML models for alcohol, cocaine, heroin, and substance use disorders, as of yet ML has not been applied to investigate marijuana use among adolescents.

The current study

Some studies [12, 39] have examined the same risk factors for substance use among CW-involved and non-CW-involved youth, albeit using different studies, and others have examined the factors specific to CW youth [38]. However, there are no studies delineating both the shared and unique risks for substance use among CW versus community youth within the same study. The extant evidence indicates the established risks for early substance use for community youth are not the same for CW youth. Therefore, further research is needed to pinpoint the risk factors specific to CW youth and those that may be shared across both CW and community youth. The current study used machine learning to test an empirically and theoretically derived set of risk factors for marijuana use. Models were built separately for CW and non-CW adolescents in order to compare the variables selected as important features/risk factors. The findings will enhance our understanding of the relative importance of risk variables among CW-involved youth to provide better information for screening into services.

Methods

Participants

Data were from the fourth assessment (T4) of an ongoing longitudinal study examining the effects of maltreatment on adolescent development. At Time 1 (T1), the sample was composed of 454 adolescents aged 9–13 years (241 males and 213 females). Time 2 (T2), Time 3 (T3), and Time 4 (T4) occurred on average 1, 2.7, and 7.2 years after baseline. Descriptives of the sample for baseline (T1) and T4 can be found in Table 1.

Table 1. Sample characteristics for Time 1 and 4.

Child welfare Non-Child welfare
Time 1 Time 4 Time 1 Time 4
N 303 222 151 128
Age (std deviation) 10.84 (1.15) 18.28 (1.41) 11.11 (1.15) 18.15 (1.56)
Gender (%)
 Male 50 47 60 56
 Female 50 53 40 44
Ethnicity (%)
 African American 40 43 32 35
 Latino 35 34 47 42
 White 12 10 10 10
 Multi-racial 13 13 11 13
Living Arrangement (%)
 With Parent 52 56 93 85
 Foster Care or Extended Family 48 24 7 3
 Without caregiver n/a 20 n/a 12
Marijuana use (% ever used) 48.2 41.4

Recruitment

The participants in the child welfare group (N = 303) were recruited from active cases in the Children and Family Services (CFS) of a large west coast city. The inclusion criteria were: (1) a new referral to CFS in the preceding month for any type of maltreatment (e.g. neglect, physical abuse, sexual abuse, emotional abuse); (2) child age of 9–12 years (some turned 13 between scheduling and actual study visit); (3) child identified as Latino, African-American, or Caucasian (non-Latino); (4) child residing in one of 10 zip codes in a designated county at the time of referral to CFS. With the approval of CFS and the Institutional Review Board of the University of Southern California, caregivers of potential participants were contacted via postcard and asked to indicate their willingness to participate. Contact via mail was followed up by a phone call. Of the families referred by CFS, 77% agreed to participate.

The non-child welfare group (N = 151) was recruited using names from school lists of children aged 9–12 years residing in the same 10 zip codes as the maltreated sample. Caregivers of potential participants were sent a postcard and asked to indicate their interest in participating which was followed up by a phone call. Non-CW families confirmed that they had no previous or ongoing experience with child welfare agencies. Approximately 50% of the comparison families contacted agreed to participate.

Upon enrollment in the study the CW and non-CW groups were compared on a number of demographic variables (see Table 1). The two groups were similar on age, (CW M = 10.84 years, SD = 1.15; non-CW M = 11.11, SD = 1.15), gender (53% male), race (38% African American, 39% Latino, 12% Multi-racial, and 11% Caucasian), and neighborhood characteristics (low-income based on Census block information) [reference withheld for blind review]. However, they were different in terms of living arrangements. In the non-CW group 93% lived with a biological parent, whereas this was the case for only 52% of the CW group. The remainder of the CW group was living in foster care, which is not unusual for those adolescents involved with social services.

Retention

The retention rate between T1 and T4 was 77.5% (n = 352). Participants not seen at Time 4 were more likely to be in the CW group (OR = 2.45, p < .01) and male (OR = 1.86, p < .01).

Procedures

Assessments were conducted at an urban research university. After written consent from the caregiver and assent from the adolescent was obtained, they each completed the questionnaires and tasks in separate rooms. The measures used in the following analyses represent a subset of the questionnaires administered during the protocol. Both the child and caretaker were given remuneration compatible with National Institutes of Health’s standard compensation rate for healthy volunteers. The Institutional Review Board of the University of Southern California reviewed and approved all study procedures.

Measures

Outcome

Marijuana use. Participants reported on their own marijuana use within the past 12 months via one item from the Adolescent Delinquency Questionnaire [ADQ; adapted from [55]]. Due to the needs of the prediction model, the number of times the adolescent used marijuana (0 to five or more) was re-coded as 0 (no use) or 1 (any use).

Risk factors

Demographics. Age was calculated from date of birth and date of interview (continuous), gender was given by the parent at enrollment in the study (male vs. female), and race/ethnicity (African American, Hispanic, Multi-racial and White) was reported by the parent using a demographic questionnaire. Race/ethnicity used as four separate variables indicating that particular race/ethnicity versus all others.

Mental health symptoms. Symptoms of depression, anxiety, and post-traumatic stress were included as risk factors. Adolescents completed the 27-item Children’s Depression Inventory. [56, 57] They rated statements such as “I am sad all the time” and “I feel like crying every day,” on a three-point scale, with the total score used in the analyses (range of possible scores = 0–54). The Cronbach’s alpha for T4 was .89. Symptoms of PTSD occurring in the past couple of months were assessed using the Youth Symptom Survey Checklist (Margolin G. The Youth Symptom Survey Checklist. Los Angeles, CA: Unpublished manuscript; 2000). This is a 17-item self-report measure of symptoms from the diagnostic criteria for PTSD found in the Diagnostic and Statistics Manual of Mental Disorders IV-TR such as hyperarousal, avoidance/numbness, and re-experiencing. Answer options range from 1 = not at all to 4 = almost always. The total score was used for this analysis (17 items; α = .88) and can range from 17 to 68. The 39-item Multidimensional Anxiety Scale for Children [58] was used to measure anxiety symptoms. It has been found to have good internal consistency (range for subscales is .70–.89), good test-retest reliability, invariant factor structure across gender and age, and discriminant validity [58]. The nine items on the separation anxiety subscale (e.g., “I get scared when my parents go away”) were removed from the scale at T4 due to development inappropriateness. Items such as “I feel tense or uptight” were rated from 0 to 3 (“never true about me” to “often true about me”) yielding a possible total score range from 0–90. Internal consistency reliability was .89 at T4.

Self-reported childhood maltreatment and adversities (self-reported ACEs). The Comprehensive Trauma Interview (CTI) [59] was used at Time 4 to assess self-reported exposure to maltreatment and adversities. The CTI assesses 19 different adverse experiences including parental divorce, parental incarceration, witnessing intimate partner violence (IPV), household substance use, death of parent, foster care placement or other parental separation, sexual abuse, physical abuse, emotional abuse, emotional neglect, and physical neglect. The CTI was administered via interview by a trained research assistant. Other studies have shown test-retest reliability ranging from .45-.76 depending on the maltreatment type [60, 61]. For the current analyses we used 7 individual items: witnessing IPV, household substance use, sexual abuse, physical abuse, emotional abuse, emotional neglect, and physical neglect (each coded 0 = no or 1 = yes). Community violence exposure was assessed with 19 items asking about witnessing violence “in your neighborhood or around your school” [62]. Items included “in the past year have you seen a person beaten up without a weapon” on a scale from never = 0 to more than 8 times = 4. The items were summed to create a composite score for exposure.

Risk behavior. Sexual behavior was measured using the Sexual Activity Questionnaire for Girls and Boys [63]. This questionnaire assesses series of eleven sexual activities with a current boyfriend/girlfriend as well as a past partner or with anyone. Activities begin with holding hands, continue with kissing, heavy petting, and culminate in sexual intercourse. The eleven sexual behavior items were summed (no = 0, yes = 1) to create a composite score of sexual behavior with higher scores indicating more advanced sexual behavior. Alcohol use was measured using the ADQ with one item indicating “how many times in the past 12 months have you passed out drunk”. The Youth Self Report was used to measure externalizing behavior [64]. The externalizing subscale is composed of aggression (17 items) and rule-breaking/delinquency (12 items). Each item is rated from 0 to 2 (“not at all” to “a lot”) with a possible range of 0–58. Cronbach’s alpha was .89 at T4. The participants also reported on their own delinquent behaviors within the past 12 months via 23 items from the Adolescent Delinquency Questionnaire (ADQ; adapted from [55]. Computerized administration was used to ensure participant confidentiality. For the present study three scales were used: status offenses (6 items, e.g. “run away from home”, α = .72-.74), person offenses (7 items, e.g. “carried a hidden weapon”, α = .77-.83), and property offenses (10 items, e.g. “damaged or destroyed someone else’s property on purpose”, α = .88-.92). The three scales were summed to create a composite score for delinquency.

Peer risk behavior. Participants reported on the delinquency and substance of their peers within the past 12 months. Similar to the adolescent self-report, they were asked “how many of your friends or people your age you know have done this in the past 12 months”. Answer options were 0 = none, 1 = some, 2 = a lot. Three scales of delinquency were used (status offences 6 items, α = .72-.80, person offences 7 items, α = .77–82, property offences 10 items, α = .85-.90) and summed to create a composite score for peer delinquency. One item was used to assess peer marijuana use and one to assess peer alcohol use (“how many of your friends or people your age you know have used marijuana in the past 12 months” and “how many of your friends or people your age you know have had an alcoholic drink in the past 12 months”). The same answer options were used as with the delinquency items.

Parent/family social support. The Hill intimacy scale was included to assess the degree of intimacy/closeness with parents [65]. The adolescent is asked to answer about their mother or the person who acted most like your mother, and a second time about their father, or father figure. There are eight questions such as “how much do you go to your mother for advice or support” answered on a three point scale: none, some, or a lot. The Cronbach’s alpha for the mother scale was .85 and for the father was .91. The two scores were averaged, or if missing one then only that one was used. Parental monitoring was assessed using the AddHealth Parental Monitoring questions [66]. This scale is comprised of six questions such as “How often do you tell your parent(s) who you were going out with?” rated on a scale from 0 = never to 4 = always. The total score was used in analyses and the Cronbach’s alpha was .82 at T4.

Self-esteem. Two components of self-esteem were used, global self-worth and self-image[67]. The Self-Perception Profile for Adolescents (SPPA) [68] is a widely used self-report measure assessing global self-worth (synonymous with self-esteem)[6971]. The current study collected six of the original eight subscales: athletic competence, scholastic competence, social competence, behavioral conduct, self-acceptance, and close friendship. The scales were summed in order to create a composite score of global self-worth (α = .85). Self-image was assessed using two subscales (body image and mastery/coping) of the Self-Image Questionnaire for Young Adolescents (SIQYA) [72]. This self-report measure is designed for children ages 11 to 15 years and these subscales were selected because of their particular relevance to adolescents. The total sum scale had an internal consistency reliability of α = .85.

Data analysis

The small amount of missing values in the dataset (see Table 2) was addressed using multiple imputation (MI). Although the percent missingness was below 2% for each individual variable, listwise deletion (as required by our ML models) would have resulted in dropping 10% of the total sample. As such, we determined that MI was necessary to account for potential bias in missingness. Five imputed datasets were created using the MI function in SPSS 25.0.

Table 2. Individual predictor variables (features), domains, and descriptives.

Feature Domain Feature name Coding Range % missingness CW: Bivariate correlation with MJ use nonCW: Bivariate correlation with MJ use
Demographics Race white = 0 minority = 1 [0, 1] none .096 .230**
Demographics Gender female = 0 male = 1 [0, 1] none -.168* -.166
Demographics Age [14.71, 22.66] none .154* .016
Mental health Anxiety continuous [0, 85] 0.57% -.097 .033
Mental health Depression continuous [2, 40] 0.57% .070 .146
Mental health PTSD continuous [17, 66] 1.70% .050 .317**
Parent/family social support Parental closeness continuous (low = risk) [1,3] none .023 -.297**
Parent/family social support Parental monitoring continuous (low = risk) [0, 24] none -.258** -.158
Parent/family social support Social support continuous (low = risk) [1,5] 0.85% -.092 -.196
Peer risk behavior Peer delinquency continuous [0, 46] none .286** .250**
Peer risk behavior Peer alcohol use 0 = none 1 = some 2 = a lot [0,1,2] 1.14% .203** .129
Peer risk behavior Peer marijuana use 0 = none 1 = some 2 = a lot [0,1,2] 1.14% .380** .232**
Risk behavior Sexual activity score 0–11 [0, 11] 1.99% .261** .265**
Risk behavior Delinquency continuous [0, 30.5] none .085 .384**
Risk behavior Externalizing continuous [0,36] 1.70% .165* .461**
Self-esteem Global self-worth continuous (low = risk) [8,20] none -.077 -.215*
Self-esteem Self-image continuous (low = risk) [38, 126] 1.99% .017 -.093
Self-report ACEs Emotional abuse no = 0; yes = 1 [0, 1] 1.42% .025 .303**
Self-report ACEs Emotional neglect no = 0; yes = 1 [0, 1] 1.42% .079 .167
Self-report ACEs Household substance use no = 0; yes = 1 [0, 1] 1.42% .038 .095
Self-report ACEs Physical abuse no = 0; yes = 1 [0, 1] 1.42% .122 .306**
Self-report ACEs Physical neglect no = 0; yes = 1 [0, 1] 1.42% -.083 .207*
Self-report ACEs Sexual abuse no = 0; yes = 1 [0, 1] 1.42% .131 .219*
Self-report ACEs Witnessing IPV no = 0; yes = 1 [0, 1] 1.42% .076 .098
Self-report ACEs Witnessing community violence continuous [0, 32] 1.14% .247** .330**

Note: CW = child welfare; ACEs = adverse childhood experiences; MJ = marijuana; IPV = intimate partner violence.

**p < .01,

*p < .05

Following MI, all non-binary predictor variables were standardized to allow for better interpretability of the coefficients. This was followed by a multi-step analysis plan to establish the best performing ML model and to identify the top features contributing to its performance. It should be noted that in the context of ML, the terms “variables” and “features” are used interchangeably.

Our analytic approach used traditional statistical methods (binary logistic regression) as well as ML approaches: Lasso and Support Vector Machines (SVM). Logistic Regression is a well-known and often used statistical technique for binary classification [73] that is easy to interpret but has limited model capacity and suffers from overfitting when many features are considered. Lasso regularization often helps in overcoming overfitting [74]. It achieves this by adding a penalty of the absolute value of the magnitude of the coefficients. Support Vector Machines [75] are widely used and very successful models for classification. SVM interprets the predictor values for each data point (i.e., person in our study) as a vector of coordinates in p-dimensional space and searches for a (p-1)-dimensional hyperplane in that space that separates the points belonging to the two classes with the largest margin or gap possible. In linear SVM, one can interpret the magnitude and sign of the coefficient in the linear hyperplane similarly to the coefficients in Logistic Regression and Lasso [76]. All ML analyses were performed in Python using the package Scikit-learn [77]. We used AUC (area under the curve [AUC] of the receiver operating characteristic [ROC], or C-statistic) as our main metric of goodness of fit, where generally a value higher than 0.7 designates a good model, and higher than 0.8 a strong model [78]. The Receiver Operating Characteristic (ROC) curve shows the tradeoff between true positives (sensitivity) and false positives (1-specificity) at all possible thresholds, and hence the area under the ROC curve measures the overall accuracy of the model without choosing a specific threshold. In addition, we also report precision and recall at the threshold of 0.5. Precision measures the fraction of the youth who were indeed marijuana users among those predicted to use marijuana, while recall (sensitivity) measures the fraction of all youth who used marijuana that the model actually predicted as such. For each of the three approaches (logistic regression, Lasso, SVM), we performed the following steps.

First, due to the large number of potential features, we performed feature selection to remove redundant features that might degrade the model performance. We used a technique called Backward Feature Selection [79] which iteratively selects to remove a feature/variable starting with all features, and evaluating the predictive performance with a single feature removed and choosing the one whose removal results in the best AUC. Given this selection, then we again re-evaluate the performance when removing one of the remaining features and select the best one, based on AUC. This process stops when only one feature remains. The method returns the number and names of the features that resulted in the best AUC over these iterative steps.

Next, k-fold cross-validation was used to evaluate the model performance. This technique of validation makes sure that every data point was once part of the test samples and alleviates possible sensitivity to selecting a single split of the data into a training and a test subset. It does this by splitting the data into k-groups. For each unique group, the group is held as a test set while the others combined are used for training the model. In the end, the model’s performance on each test set is retained and the final score is the average performance across all the k-groups. Since we have 352 participants, the value of k was chosen to be 5 for all experiments, which means 70 participants were part of the test set at each validation.

To determine which features contributed most strongly to the model performance we performed Permutation Feature Importance (PFI) analysis [80, 81]. Permutation Feature Importance is a widely used technique for calculating feature importance that it is model-agnostic, i.e., it works for any of the predictive approaches [80, 81]. It randomly permutates a single feature/predictor in the validation dataset leaving all the other features intact and then computing the AUC of the model on this permutated validation set. The PFI value of a feature is the respective drop in AUC observed. The larger the decrease from the original AUC, the higher the rank importance of the feature. Following PFI, we averaged (across k = 5 folds) the linear coefficients for those features with PFI ≥0.005 (≥.5% drop in AUC) in order to understand the positive or negative effect of a feature on the predicted likelihood of marijuana use. Finally, for each of the three approaches, we computed the average k-fold AUC across the five imputations to determine the most accurate model for each of the two groups (CW and non-CW).

We performed sensitivity analyses to determine if there were differences between the original (non-imputed) and imputed data on the performance metrics for the ML models. The AUC was the essentially the same for the both the CW and non-CW groups in the non-imputed data and imputed across all three models (see S1 Table). Therefore, we report only the results from the imputed data.

Results

Descriptives

As shown in Table 1, 48.2% of the CW group and 41.4% of the non-CW group reported marijuana use in the past year. Bivariate correlations between the predictors and marijuana use for each group (Table 2) showed that in the CW youth, marijuana use was positively associated with peer delinquency, peer alcohol use, peer marijuana use, sexual activity, externalizing behavior, and witnessing community violence (p < .05). Additionally, there were negative associations of marijuana use with gender (female) and parental monitoring (p < .05). For the non-CW youth, marijuana use was positively correlated with race (minority), PTSD, peer delinquency, peer marijuana use, sexual activity, delinquency, externalizing, emotional abuse, physical abuse, physical neglect, sexual abuse, and witnessing community violence (p < .05). It was also negatively correlated with parental closeness, social support, and self-esteem competence (p < .05).

Machine learning models

CW group

The performance metrics for each of the three ML approaches can be found in Table 3. For the CW group, the AUC for all three models was very similar and indicated good model fit (logistic regression AUC = .79, Lasso AUC = .80, SVM AUC = .80). In addition, we also report precision and recall at the threshold of 0.5. Like AUC, precision and recall for all three models was very similar (see Table 3). As an example, the logistic regression model achieved 0.72 precision (the fraction of the youth who were indeed marijuana users among those predicted to use marijuana), while recall was 0.73 (the fraction of all youth who used marijuana that the model actually predicted as such). Because the AUC, precision, and recall for the three ML models were not appreciably different, we could not choose one model as superior over the others. Therefore, feature selection and permutation feature importance (PFI) were performed on all three and we retained only those features that met the PFI threshold of 0.005 across all three models. This strategy increases the robustness of the feature selection as it mitigates model specific uncertainty. Using these criteria, eight features were retained (Fig 1a). The top feature was peer marijuana use which reduced the AUC by 12–13% (across the three models) if dropped from the model. Other features in order of importance were parental monitoring, gender, sexual abuse, age, physical neglect, witnessing community violence, and parental closeness. The linear coefficients produced by the model (Fig 2a) indicated that six features were risk factors whereas parental monitoring and physical neglect were protective. Specifically, higher levels of peer marijuana use, male sex, self-reported sexual abuse, older age, self-reported physical neglect, witnessing community violence, and higher levels of parental closeness all predicted marijuana use. On the other hand, higher levels of parental monitoring and self-reported physical neglect predicted non-use.

Table 3. Performance metrics for the three machine learning approaches.
CW Non-CW
AUC Precision Recall AUC Precision Recall
Logistic Regression 0.79 ± 0.004 0.72 ± 0.013 0.73 ± 0.009 0.87 ± 0.028 0.72 ± 0.003 0.68 ± 0.037
Lasso 0.80 ± 0.001 0.71 ± 0.015 0.75 ± 0.018 0.85 ± 0.021 0.72 ± 0.007 0.66 ± 0.014
SVM 0.80 ± 0.012 0.72 ± 0.01 0.74 ± 0.025 0.84 ± 0.010 0.73 ± 0.011 0.69 ± 0.057

Note: ± indicates the range across the 5 imputed datasets.

Fig 1. Plot of individual predictors selected by model ranked by Permutation Feature Importance value for a) Child Welfare and b) non-Child Welfare groups.

Fig 1

Fig 2. Plot of individual predictors selected by model ranked by coefficient for a) Child Welfare and b) non-Child Welfare groups.

Fig 2

Non-CW group

The AUC for all three models was very similar and indicated good model fit (logistic regression AUC = .87, Lasso AUC = .85, SVM AUC = .84) and slightly better accuracy than the CW model. At threshold of 0.5, the logistic regression model for non-CW achieved 0.72 precision (the fraction of the youth who were indeed marijuana users among those predicted to use marijuana), while recall was 0.68 (the fraction of all youth who used marijuana that the model actually predicted as such). Again, because the model performance metrics were very similar across all three we chose to retain only those features that achieved our cut-off of .005 for PFI across all three models. This resulted in four features being retained (Fig 1b). In order of importance these were: externalizing behavior (9–12% drop in AUC if removed from the model), delinquency, gender, and parental closeness. The sign of the linear coefficients indicated that higher externalizing behavior, higher delinquency, higher peer marijuana use, and male gender were all risk factors for marijuana use while, higher levels of parental closeness was protective (Fig 2b).

Discussion

The risks for substance use in adolescence have been extensively studied on normative populations, however CW youth are a particularly vulnerable population and may have unique risks. Our results support this supposition by showing that although the CW and non-CW groups shared some key predictors of marijuana use, they also had a substantial number of unique key predictors including a different top ranked predictor. These findings demonstrate the importance of developing separate predictive models for these different populations of youth and that generalizing results from normative populations may miss key predictors of marijuana use for CW youth.

Shared predictors for CW and non-CW groups

In an attempt to increase the robustness of the feature selection and reduce model-specific idiosyncrasies, we only retained those features that met our cut-off for PFI in all three models. Using this strategy, eight features were retained for the CW youth while four were retained for the non-CW youth. Of those features, only two were shared by both groups: male gender and parental closeness. While males have been shown to have higher risk for marijuana use in both normative [8284] and CW samples [38], parental closeness had opposite associations in CW versus non-CW group. Higher parental closeness was a risk for the CW youth but protective for the non-CW youth. The expectation was that parental closeness would be protective in both groups [84, 85], yet as others have shown expected associations do not always hold in youth with maltreatment or CW experiences [86]. Maltreated youth are more likely to experience insecure attachments with their parents [87] and although they report closeness, it may be reflective of an unhealthy attachment style which may exacerbate vulnerability for risk behavior such as marijuana use. Future studies might consider how in certain context close family relationships may actually be detrimental, as in the case where parents might be abusing substances and modeling this behavior for the adolescent [88].

Differences between CW and non-CW groups

The top two features of importance for the CW group were peer marijuana use and parental monitoring, while for non-CW youth they were externalizing problems and delinquency. All of these variables have been found to be consistent predictors of adolescent substance use among both CW and non-CW adolescents with parental monitoring being protective [38, 82, 83]. Interestingly, although externalizing and delinquency were included in the initial set of variables considered by the model for both groups, they were not retained in the final set of important features for the CW group. This result implies that behavior problems are not important for predicting marijuana use for CW youth and peer behavior may be more critical to assess. This contradicts findings from the NSCAW data where delinquency was a risk factor [89] and Aarons et al (2008) where externalizing problems was a predictor of substance use for CW youth. Our results may diverge in part because we specifically tested marijuana use as our outcome rather than a combination of alcohol, marijuana and hard drugs or because of other nuances of our study design. Importantly, because of our use of machine learning we were able to include far more predictors in the model at one time than prior studies. This may yield different associations with outcomes as the relative importance of each variable is assessed in combination with all others, rather than just in a small subset of possible risk factors.

Our results also indicate that CW youth’s marijuana use was influenced by older age, self-reported sexual abuse, self-reported physical neglect, and witnessing community violence, Evidence supports these variables to be risks [11, 9093]. However, the negative coefficient for physical neglect is unexpected, which may be an artifact of the particular combination of variables in the model and potential collinearity between predictors. As evidence accumulates on predictors of substance use among CW youth, it is clear that no consistent pattern of risks has emerged. In a meta-analysis of studies examining predictors of substance use among current and former foster youth, over 15 different variables emerged as predictors across studies [38]. While the strength of our machine learning approach is the ability to test a large number of potential variables without concern for error associated with multiple statistical tests, it is clear that more work needs to be done with larger datasets to converge on a clear set of risk factors.

Limitations

A limitation of this study is that we did not examine CW-specific predictors that have been shown to be important for substance abuse risk (e.g., placement history, number of referrals, caregiver type). For example, foster care placement is associated with five times higher risk for substance abuse compared to youth who were allowed to remain in their home of origin [94]. We did not include these variables because we were trying to keep the predictors the same across both groups. An additional limitation is that we used a binary variable for our outcome, use/no use. This does not allow examination of the full range of potential use, since 1 use in the past 12 months is combined with those using 5 or more times. However, the identification of early initiation of marijuana use is clinically important for prevention of future substance abuse. We examined only concurrent predictors for marijuana use, it is possible the predictive model may change if data from earlier timepoints was included. We chose concurrent data as this is likely the available data in practice setting and will be useful in terms of assessing current risk for marijuana use. Our predictive modeling strategy is limited in the ability to infer causal or explanatory relationships, instead it uses concurrent information to predict whether a given adolescent is a marijuana user. Finally, it is possible that unreported maltreatment may have occurred in our non-CW group. This is mitigated by our definition of our groups based on child welfare involvement and we do not suggest generalizing the results to individuals who may have maltreatment experiences but were not reported to child welfare.

Conclusions

Substance use is a substantial public health problem, especially among CW youth. Our study is only the second to provide evidence regarding the comparability of risk factors for substance use among CW-involved versus non-CW-involved youth, but the first to so within one study and using machine learning approaches. Our findings show that while there was some overlap in the most important risks for marijuana use among CW and non-CW youth, the order of importance differed with peer marijuana use emerging as the top feature for CW and externalizing behaviors for non-CW. In addition, more features were retained in the CW model than the non-CW model implying a more complex interplay of risk factors is needed to accurately predict marijuana use for CW youth. The results also support our assertion there are shared risk factors, but also features unique to each population. Therefore, risk factors derived from normative populations will not have the same predictive power when used for CW youth. These differences should be considered in clinical practice when assessing risk for substance use among high-risk adolescents.

Supporting information

S1 Table. Performance metrics for the three machine learning approaches for non-imputed (raw) data.

(DOCX)

Abbreviations

AUC

area under the curve

CPS

children and family services

CW

child welfare

IPV

intimate partner violence

ML

machine learning

PFI

Permutation Feature Importance

ROC

receiver operating curve

SVM

Support Vector Machines

Data Availability

There are ethical restrictions on sharing a de-identified dataset because they contain sensitive information about child welfare involvement and maltreatment experiences. Sharing a de-identified dataset on this small vulnerable group could potentially lead to identifiable information given the location of the study is specified in the manuscript and the dataset includes age, race, and gender. Requests may be sent to Julie Cederbaum PhD, MSW, the lead of the data access committee at USC School of Social Work, at jcederba@usc.edu.

Funding Statement

This study was funded by National Institutes of Health Grants R01HD39129 and R01DA024569 (to P.K. Trickett., Principal Investigator). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.U.S. Department of Health and Human Services, Child Maltreatment 2015. 2017, Administration for Children Youth and Families, Children’s Bureau: http://www.acf.hhs.gov/programs/cb/research-data-technology/statistics-research/child-maltreatment.
  • 2.Danese A. and Tan M., Childhood maltreatment and obesity: systematic review and meta-analysis. Molecular Psychiatry, 2014. 19(5): p. 544–554. doi: 10.1038/mp.2013.54 [DOI] [PubMed] [Google Scholar]
  • 3.Fang X., et al., The economic burden of child maltreatment in the United States and implications for prevention. Child Abuse & Neglect, 2012. 36(2): p. 156–165. doi: 10.1016/j.chiabu.2011.10.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Jonson-Reid M., Kohl P.L., and Drake B., Child and adult outcomes of chronic child maltreatment. Pediatrics, 2012. 129(5): p. 839–845. doi: 10.1542/peds.2011-2529 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Shonkoff J.P., et al., The lifelong effects of early childhood adversity and toxic stress. Pediatrics, 2012. 129(1): p. e232–e246. doi: 10.1542/peds.2011-2663 [DOI] [PubMed] [Google Scholar]
  • 6.Lewis T.L., et al., Internalizing Problems: A potential pathway from child maltreatment to adolescent smoking. Journal of Adolescent Health, 2011. 48(3): p. 247–252. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Moran P.B., Vuchinich S., and Hall N.K., Associations between types of maltreatment and substance use during adolescence. Child Abuse & Neglect, 2004. 28(5): p. 565–574. doi: 10.1016/j.chiabu.2003.12.002 [DOI] [PubMed] [Google Scholar]
  • 8.Topitzes J., Mersky J.P., and Reynolds A.J., Child maltreatment and adult cigarette smoking: A long-term developmental model. Journal of Pediatric Pscyhology, 2010. 35(5): p. 484–498. doi: 10.1093/jpepsy/jsp119 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Mills R., et al., Alcohol and tobacco use among maltreated and non-maltreated adolescents in a birth cohort. Addiction, 2014. 109(4): p. 672–680. doi: 10.1111/add.12447 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Mills R., et al., Child maltreatment and cannabis use in young adulthood: a birth cohort study. Addiction, 2017. 112(3): p. 494–501. doi: 10.1111/add.13634 [DOI] [PubMed] [Google Scholar]
  • 11.Tonmyr L., et al., A review of childhood maltreatment and adolescent substance use relationship. Current Psychiatry Reviews, 2010. 6(3): p. 223–234. [Google Scholar]
  • 12.Fettes D.L. and Aarons G.A., Smoking behavior of US youths: a comparison between child welfare system and community populations. American Journal of Public Health, 2011. 101(12): p. 2342–2348. doi: 10.2105/AJPH.2011.300266 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Braciszewski J.M. and Colby S.M., Tobacco use among foster youth: Evidence of health disparities. Child and Youth Services Review, 2015. 58: p. 142–145. doi: 10.1016/j.childyouth.2015.09.017 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.US Department of Health and Human Services, The Health Consequences of Smoking-50 Years of Progress: A Report of the Surgeon General. 2014, Centers for Disease Control and Prevention, National Center for Chronic Disease Prevention and Health Promotion, Office on Smoking and Health: Atlanta.
  • 15.US Department of Health and Human Services, Preventing Tobacco Use Among Youth and Young Adults: A Report of the Surgeon General. 2012, Centers for Disease Control and Prevention, National Center for Chronic Disease Prevention and Health Promotion, Office on Smoking and Health: Atlanta.
  • 16.National Center on Addiction and Substance Abuse, Adolescent substance use: America’s #1 public health problem. 2011, New York: Columbia University.
  • 17.U. S. Department of Health and Human Services, The Health Consequences of Smoking-50 Years of Progress: A Report of the Surgeon General. 2014, Centers for Disease Control and Prevention, National Center for Chronic Disease Prevention and Health Promotion, Office on Smoking and Health: Atlanta.
  • 18.Centers for Disease Control and Prevention. Excessive drinking is draining the US Economy https://www.cdc.gov/features/costsofdrinking. [cited 2019 April 17].
  • 19.Xu X., et al., Annual Healthcare Spending Attributable to Cigarette Smoking: An Update. American Journal of Preventive Medicine, 2014. 48(3): p. 326–333. doi: 10.1016/j.amepre.2014.10.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Florence C., et al., The economic burden of prescription opioid overdose, abuse and dependence in the United States, 2013. Medical Care, 2016. 54(10): p. 901. doi: 10.1097/MLR.0000000000000625 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Steinberg L., Risk taking in adolescence: new perspectives from brain and behavioral science. Current Directions in Psychological Science, 2007. 16(2): p. 55–59. [Google Scholar]
  • 22.Chen C.Y., Storr C.L., and Anthony J.C., Early-onset drug use and risk for drug dependence problems Addictive Behaviors, 2009. 34(3): p. 319–322. doi: 10.1016/j.addbeh.2008.10.021 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Benowitz N.L., Nicotine addiction. New England Journal of Medicine, 2010. 362: p. 2295–2503. doi: 10.1056/NEJMra0809890 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Monahan K.C., Steinberg L., and Cauffman E., Affiliation with antisocial peers, susceptibility to peer influence, and antisocial behavior during the transition to adulthood. Developmental Psychology, 2009. 45(6): p. 1520–1530. doi: 10.1037/a0017417 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Negriff S., Brensilver M., and Trickett P.K., Elucidating the mechanisms linking early pubertal timing, sexual activity, and substance use for maltreated versus nonmaltreated adolescents. Journal of Adolescent Health, 2015. 56: p. 625–631. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Hyshka E., Applying a social determinants model of health perspective to early adolescent cannabis use—An Overview. Drugs: Education, Prevention and Policy, 2013. 20(2): p. 110–119. [Google Scholar]
  • 27.Substance Abuse and Mental Health Services Administration, Drug Abuse Warning Network (DAWN): Detailed tables: National estimates of drug-related emergency department visits, 2004–2009. 2010. [PubMed]
  • 28.Henry J. Kaiser Family Foundation, et al., National survey of adolescents and young adults: Sexual health knowledge, attitudes, and experiences. 2003, Henry J. Kaiser Family Foundation: Menlo Park, CA. [Google Scholar]
  • 29.Patton G.C., et al., Cannabis use and mental health in young people: cohort study. BMJ, 2002. 325: p. 1195–1198. doi: 10.1136/bmj.325.7374.1195 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Caspi A., et al., Moderation of the effect of adolescent-onset cannabis use on adult psychosis by a functional polymorphism in the catechol-O-methyltransferase gene: longitudinal evidence of a gene X environment interaction. Biological Psychiatry, 2005. 57(10): p. 1117–1127. doi: 10.1016/j.biopsych.2005.01.026 [DOI] [PubMed] [Google Scholar]
  • 31.Bava S., et al., Altered white matter microstructure in adolescent substance users. Psychiatry Research, 2009. 173(3): p. 228–237. doi: 10.1016/j.pscychresns.2009.04.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.McQueeny T., et al., Altered white matter integrity in adolescent binge drinkers. Alcoholism: Clinical and Experimental Research, 2009. 33(7): p. 1278–1285. doi: 10.1111/j.1530-0277.2009.00953.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Jacobus J., et al., Funtional consequences of marijuana use in adolescents. Pharmacology, Biochemistry and Behavior, 2009. 92(4): p. 559–565. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Martins S.S. and Alexandre P.K., The association of extasy use and academic among adolescents in two U.S. national surveys. Addictive Behavior, 2009. 34(1): p. 9–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Bray J.W., et al., The relationship between marijuana initiation and dropping out of high school. Health Economics, 2000. 9(1): p. 9–18. [DOI] [PubMed] [Google Scholar]
  • 36.Dawkins M.P., Drug use and violent crime among adolescents. Adolescence 1997. 32(126): p. 395–405. [PubMed] [Google Scholar]
  • 37.Trucco E.M., A review of psychosocial factors linked to adolescent substance use. Pharmacology Biochemistry and Behavior, 2020: p. 172969. doi: 10.1016/j.pbb.2020.172969 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Braciszewski J.M. and Stout R.L., Substance use among current and former foster youth: A systematic review. Children and Youth Services Review, 2012. 34(12): p. 2337–2344. doi: 10.1016/j.childyouth.2012.08.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Fettes D.L., Aarons G.A., and Green A.E., Higher rates of adolescent substance use in child welfare versus community populations in the United States. Journal of Studies on Alcohol and Drugs, 2013. 74(6): p. 825–834. doi: 10.15288/jsad.2013.74.825 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Gulshan V., et al., Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA, 2016. 316(22): p. 2402–2410. doi: 10.1001/jama.2016.17216 [DOI] [PubMed] [Google Scholar]
  • 41.Gijsberts C.M., et al., Race/Ethnic differences in the associations of the Framingham risk factors with carotid IMT and cardiovascular events. PLoS One, 2015. 10(7): p. e0132321. doi: 10.1371/journal.pone.0132321 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Char D.S., Shah N.H., and Magnus D., Implementing Machine Learning in Health Care—Addressing Ethical Challenges. New England Journal of Medicine, 2018. 378(11): p. 981–983. doi: 10.1056/NEJMp1714229 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Verghese A., Shah N.H., and Harrington R.A., What this computer needs is a physician: humanism and artificial intelligence. JAMA, 2018. 319(1): p. 19–20. doi: 10.1001/jama.2017.19198 [DOI] [PubMed] [Google Scholar]
  • 44.Parikh R.B., Kakad M., and Bates D.W., Integrating predictive analytics into high-value care: the dawn of precision delivery. JAMA, 2016. 315(7): p. 651–652. doi: 10.1001/jama.2015.19417 [DOI] [PubMed] [Google Scholar]
  • 45.Beam A.L. and Kohane I.S., Big data and machine learning in health care. JAMA, 2018. 319(13): p. 1317–1318. doi: 10.1001/jama.2017.18391 [DOI] [PubMed] [Google Scholar]
  • 46.Ahn W.Y., et al., Utility of machine-learning approaches to identify behavioral markers for substance use disorders: impulsivity dimensions as predictors of current cocaine dependence. Frontiers in Psychiatry, 2016. 7: p. 34. doi: 10.3389/fpsyt.2016.00034 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Cuccaro-Alamin S., et al., Risk assessment and decision making in child protective services: Predictive risk modeling in context. Children and Youth Services Review, 2017. 79: p. 291–298. [Google Scholar]
  • 48.Bertsimas, D., K.O. Allison, and W.R. Pulleybank, The Analytics Edge. 2016: Dynaic Ideas LLC.
  • 49.Acion L., et al., Use of a machine learning framework to predict substance use disorder treatment success. PloS one, 2017. 12(4): p. e0175383. doi: 10.1371/journal.pone.0175383 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Connor J.P., et al., The application of machine learning techniques as an adjunct to clinical decision making in alcohol dependence treatment. Substance Use & Misuse, 2007. 42(14): p. 2193–2206. doi: 10.1080/10826080701658125 [DOI] [PubMed] [Google Scholar]
  • 51.Hu Z., et al., Analysis of substance use and its outcomes by machine learning: II. Derivation and prediction of the trajectory of substance use severity. Drug and Alcohol Dependence, 2019: p. 107604. doi: 10.1016/j.drugalcdep.2019.107604 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Ahn W.Y. and Vassileva J., Machine-learning identifies substance-specific behavioral markers for opiate and stimulant dependence. Drug and Alcohol Dependence, 2016. 161: p. 247–257. doi: 10.1016/j.drugalcdep.2016.02.008 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Afzali M.H., et al., Machine‐learning prediction of adolescent alcohol use: a cross‐study, cross‐cultural validation. Addiction, 2019. 114(4): p. 662–671. doi: 10.1111/add.14504 [DOI] [PubMed] [Google Scholar]
  • 54.Whelan R., et al., Neuropsychosocial profiles of current and future adolescent alcohol misusers. Nature, 2014. 512(7513): p. 185. doi: 10.1038/nature13402 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Huizinga D. and Elliott D.S., Reassessing the reliability and validity of self-report delinquency measures. Journal of Quantitative Criminology, 1986. 2(4): p. 293–327. [Google Scholar]
  • 56.Kovacs M., Rating scales to assess depression in school-aged children. Acta Paedopsychiatrica: International Journal of Child & Adolescent Psychiatry, 1981. 46(5–6): p. 305–315. [PubMed] [Google Scholar]
  • 57.Kovacs, M., Child Depression Inventory Manual. 1992, Toronto: Multi-Health System.
  • 58.March J.S., et al., The Multidimensional Anxiety Scale for Children (MASC): Factor structure, reliability, and validity. Journal of the American Academy of Child and Adolescent Psychiatry, 1997. 36: p. 554–565. doi: 10.1097/00004583-199704000-00019 [DOI] [PubMed] [Google Scholar]
  • 59.Noll J.G., et al., Revictimization and self-harm in females who experienced childhood sexual abuse: Results from a prospective study. Journal of Interpersonal Violence, 2003. 18: p. 1452–1471. doi: 10.1177/0886260503258035 [DOI] [PubMed] [Google Scholar]
  • 60.Fergusson D.M., Horwood L.J., and Woodward L.J., The stability of child abuse reports: A longitudinal study of the reporting behaviour of young adults. Psychological Medicine, 2000. 30(3): p. 529–544. doi: 10.1017/s0033291799002111 [DOI] [PubMed] [Google Scholar]
  • 61.Barnes J.E., et al., Sexual and physical revictimization among victims of severe childhood sexual abuse. Child Abuse & Neglect, 2009. 33: p. 412–420. doi: 10.1016/j.chiabu.2008.09.013 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Richters J.E. and Saltzman W., Childhood victimization and violent offending. Violence and Victims, 1990. 5: p. 19–35. [PubMed] [Google Scholar]
  • 63.Udry, R., The Sexual Activity Questionnaire for Girls and Boys. 1988: University of North Carolina, Chapel Hill.
  • 64.Achenbach, T.M. and L.A. Rescorla, Manual for the ASEBA School-Age Forms & Profiles. 2001, Burlington, VT: University of Vermont, Research Center for Children Youth and Families.
  • 65.Blythe D.A., Hill J.P., and Thiel K.S., Early adolescents’ significant others: Grade and gender differences in perceived relationships with familial and nonfamilial adults and young people. Journal of Youth and Adolescence, 1982. 11: p. 425–450. doi: 10.1007/BF01538805 [DOI] [PubMed] [Google Scholar]
  • 66.Carolina Population Center, National Study of Adolescent Health: Adolescent In-Home Questionnaire and Codebook. 1998, University of North Carolina at Chapel Hill: Chapel Hill.
  • 67.Harter S. and Leahy R.L., The construction of the self: A developmental perspective. 2001, Guildford Press. [Google Scholar]
  • 68.Harter S., Manual for the self-perception profile for adolescents. 1988, University of Denver: Denver, CO. [Google Scholar]
  • 69.Harter S., Waters P., and Whitesell N.R., Relational self‐worth: Differences in perceived worth as a person across interpersonal contexts among adolescents. Child Development, 1998. 69(3): p. 756–766. [PubMed] [Google Scholar]
  • 70.Harter S. and Whitesell N.R., Beyond the debate: Why some adolescents report stable self‐worth over time and situation, whereas others report changes in self‐worth. Journal of Personality, 2003. 71(6): p. 1027–1058. doi: 10.1111/1467-6494.7106006 [DOI] [PubMed] [Google Scholar]
  • 71.Harter S., The development of self-esteem, in Self-esteem issues and answers: A sourcebook of current perspectives, Kernis M.H., Editor. 2006, Psychology Press: New York. p. 144–150. [Google Scholar]
  • 72.Petersen A.C., et al., A self-image questionnaire for young adolescents (SIQYA): Reliability and validity studies. Journal of Youth Adolescence, 1984. 13(2): p. 93–111. doi: 10.1007/BF02089104 [DOI] [PubMed] [Google Scholar]
  • 73.Hosmer D.W. Jr, Lemeshow S., and Sturdivant R.X., Applied logistic regression. Vol. 398. 2013: John Wiley & Sons. [Google Scholar]
  • 74.Tibshirani R., Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 1996. 58(1): p. 267–288. [Google Scholar]
  • 75.Hearst M.A., et al., Support vector machines. IEEE Intelligent Systems and their Applications, 1998. 13(4): p. 18–28. [Google Scholar]
  • 76.Rakotomamonjy A., Variable selection using SVM-based criteria. Journal of Machine Learning Research, 2003. 3: p. 1357–1370. [Google Scholar]
  • 77.Pedregosa F., et al., Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research, 2011. 12: p. 2825–2830. [Google Scholar]
  • 78.Šimundić A., Measures of diagnostic accuracy: basic definitions. EJIFCC, 2009. 19(4): p. 203–211. [PMC free article] [PubMed] [Google Scholar]
  • 79.Guyon I. and Elisseeff A., An introduction to variable and feature selection. Journal of Machine Learning Research, 2003. 3(Mar): p. 1157–1182. [Google Scholar]
  • 80.Breiman L., Random forests. Machine Learning, 2001. 45(1): p. 5–32. [Google Scholar]
  • 81.Fisher, A., C. Rudin, and F. Dominici, Model Class Reliance: Variable importance measures for any machine learning model class, from the” Rashomon” perspective. arXiv preprint arXiv:1801.01489, 2018. http://arxiv.org/abs/1801.01489.
  • 82.Leung R.K., Toumbourou J.W., and Hemphill S.A., The effect of peer influence and selection processes on adolescent alcohol use: a systematic review of longitudinal studies. Health Psychology Review, 2014. 8(4): p. 426–457. doi: 10.1080/17437199.2011.587961 [DOI] [PubMed] [Google Scholar]
  • 83.Allen M., et al., Comparing the influence of parents and peers on the choice to use drugs: A meta-analytic summary of the literature. Criminal Justice and Behavior, 2003. 30(2): p. 163–186. [Google Scholar]
  • 84.Cleveland M.J., et al., The role of risk and protective factors in substance use across adolescence. Journal of Adolescent Health, 2008. 43(2): p. 157–164. doi: 10.1016/j.jadohealth.2008.01.015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85.Nash S.G., McQueen A., and Bray J.H., Pathways to adolescent alcohol use: Family environment, peer influence, and parental expectations. Journal of Adolescent Health, 2005. 37(1): p. 19–28. [DOI] [PubMed] [Google Scholar]
  • 86.Cicchetti D. and Toth S.L., Developmental processes in maltreated children, in Nebraska Symposium on Motivation Vol. 46, 1998: Motivation and child maltreatment, Hansen D.J., Editor. 2000, University of Nebraska Press: Lincoln, NE. p. 86–160. [PubMed] [Google Scholar]
  • 87.Baer J.C. and Martinez C.D., Child maltreatment and insecure attachment: a meta-analysis. Journal of reproductive and infant psychology, 2006. 24(3): p. 187–197. [Google Scholar]
  • 88.Cicchetti D. and Handley E.D., Child maltreatment and the development of substance use and disorder. Neurobiology of Stress, 2019. 10: p. 100144. doi: 10.1016/j.ynstr.2018.100144 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 89.Traube D.E., et al., A national study of risk and protective factors for substance use among youth in the child welfare system. Addictive Behaviors, 2012. 37(5): p. 641–50. doi: 10.1016/j.addbeh.2012.01.015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 90.Shin S.H., Hong H.G., and Hazen A.L., Childhood sexual abuse and adolescent substance use: a latent class analysis. Drug and Alcohol Dependence, 2010. 109(1–3): p. 226–235. doi: 10.1016/j.drugalcdep.2010.01.013 [DOI] [PubMed] [Google Scholar]
  • 91.Tubman J.G., et al., Maltreatment clusters among youth in outpatient substance abuse treatment: Co-occurring patterns of psychiatric symptoms and sexual risk behaviors. Archives of Sexual Behavior, 2011. 40(2): p. 301–309. doi: 10.1007/s10508-010-9699-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 92.Löfving-Gupta S., et al., Community violence exposure and substance use: cross-cultural and gender perspectives. European Child & Adolescent Psychiatry, 2018. 27(4): p. 493–500. doi: 10.1007/s00787-017-1097-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 93.Zinzow H.M., et al., Witnessed community and parental violence in relation to substance use and delinquency in a national sample of adolescents. Journal of Traumatic Stress, 2009. 22(6): p. 525–533. doi: 10.1002/jts.20469 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 94.Pilowsky D.J. and Wu L.T., Psychiatric symptoms and substance use disorders in a nationally representative smale of American adolescents involved with foster care. Journal of Adolescent Health, 2006. 38(4): p. 351–358. [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Carlos Andres Trujillo

17 Feb 2022

PONE-D-21-15650

Using machine learning to determine the shared and unique risk factors for marijuana use among child-welfare versus community adolescents

PLOS ONE

Dear Dr. Negriff,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

I apologize for the unusually long time that the review process took, but I needed to secure good reviews both from the theoretical and methodological points of view. Both reviewers see potential in your manuscript and I agree, but there are several aspects that require improvement. Nonetheless, they don't identify unfixable flaws, hence, I am confident you can address all their comments. I will send your review to the same reviewers.

Please submit your revised manuscript by Apr 02 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Carlos Andres Trujillo, PhD

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Thank you for stating the following financial disclosure: "Funding acknowledgements: National Institutes of Health Grants R01HD39129 and R01DA024569 (to P.K. Trickett., Principal Investigator)."

Please state what role the funders took in the study.  If the funders had no role, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript." 

If this statement is not correct you must amend it as needed. 

Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf.

3. We note that you have indicated that data from this study are available upon request. PLOS only allows data to be available upon request if there are legal or ethical restrictions on sharing data publicly. For more information on unacceptable data access restrictions, please see http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. 

In your revised cover letter, please address the following prompts:

a) If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially sensitive information, data are owned by a third-party organization, etc.) and who has imposed them (e.g., an ethics committee). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent.

b) If there are no restrictions, please upload the minimal anonymized data set necessary to replicate your study findings as either Supporting Information files or to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories.

We will update your Data Availability statement on your behalf to reflect the information you provide.

4. Please include your full ethics statement in the ‘Methods’ section of your manuscript file. In your statement, please include the full name of the IRB or ethics committee who approved or waived your study, as well as whether or not you obtained informed written or verbal consent. If consent was waived for your study, please include this information in your statement as well. 

5. We note that you have referenced "Margolin G. The Youth Symptom Survey Checklist. Los Angeles, CA: Unpublished manuscript; 2000" which has currently not yet been accepted for publication. Please remove this from your References and amend this to state in the body of your manuscript: "Margolin G. The Youth Symptom Survey Checklist. Los Angeles, CA: Unpublished manuscript; 2000" as detailed online in our guide for authors

http://journals.plos.org/plosone/s/submission-guidelines#loc-reference-style.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: No

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The study proposes a statistical novel approach to understand the interplay of risk factors of marijuana use, comparing a child welfare population and a non-welfare population. The authors argue that there is little research addressing the specific risk factors that affect welfare children, and that they could differ from community children. The manuscript is well written and explains clearly step by step the LM which helps understand the process for the analysis

When the introduce variables such as depression, anxiety and PTSD as risk factors, the age of onset of such psychological issues should be addressed, because those problems could be the result of drug use (and marijuana not being the onset drug it may be more probable), then they will no be risk factors, but outcomes. If the age of onset is not clear or the authors do not have it, then it should be addressed in the limitations section.

The researchers included the use of alcohol in their variables, but it is more abuse than use, which is different. They address the alcohol use by asking how many times in the past 12 months have your passed out drunk. Even if the answer is never, the adolescent may be drinking a lot, but he is not getting drunk, or he could have been drunk but not necessarily passed out. It is clear why they have dichotomized the dependent variable (marijuana use), because of the model they are using, but it should also be reviewed what happen when the variable is maintained with its full scope of categories.

For better understanding, it is important also to explain how was measured peer drug use. The authors said it was measured through one item, but they do not explain how. It is important that at the introduction, the researchers clarify if they are really measuring self-esteem, or they are just measuring competence and body image. They should support this notion of self-esteem by literature that argues that this is self-esteem, and not self-efficacy or just self-competence. It is also important to discuss why this kind of variables have no impact on the dependent variable in either of the two groups.

It is really interesting the use of ML to analyze the risk and protective factors, because it reveals new ways in which the risk and protective factors interact. The study of this factors through LM can be an ideal analytic approach to studying multiple variables such as risk and protective factors. The study has several strengths, including its data-driven approach to study many factors that could influence marijuana use in both samples. One of the things that I could question about the paper is the way use of marijuana was measured. The possible answers were some, a lot, none. I think this type of categorization can diminish the possible variance that could be found in this kind of behaviors. It is also difficult to understand what the difference could be, for the adolescent, between a lot and some use.

It is really surprising the result that parental closeness is a risk factor for children in welfare. Even though the authors give some hypothesis, it could also be interesting to explore, or to propose for future research if, maybe based on the social development model (Hawkins & Catalano), the family, in certain circumstances could become a risk factor, because the child or adolescent has an affective attachment, so she follows her family behaviors, and parents could be using drugs themselves. It could also be a way the adolescent reacts depending on her placement. The authors, in their limitations said that they did not account for placement as a variable so they could pair both samples, but maybe that is one of the reasons of this result. I am not sure that the dichotomization of the race variable is accurate.

Another surprising result is that peer delinquency reduces marijuana use in community sample. I the researchers suggest it was a variable that did have issues in the weight it has, but I think that the problem could be how the adolescent sees his peers, or the dichotomization of the dependent variable. The authors should explore a little farther this result.

Finally, I don’t agree that black and Latino should necessarily share the same backgrounds or contextual variables.

Reviewer #2: PLoS One Review:

Using machine learning to determine the shared and unique risk factors for marijuana use among child-welfare versus community adolescents

1. I suggest more extension and detail in how decision trees work in ML techniques. No reference in lines 56-58: “Not only can ML provide potentially more accurate predictive models, but techniques such as decision trees can provide new insights into non-linear relationships.”.

2. Lines 82-83: “there are no studies delineating both the shared and unique risks” is the italic necessary?

3. Why did CW=child welfare and non-CW turn into Line 117: maltreated and comparison groups? Please have uniformity.

4. What does a “biracial” participant’s race mean?

5. According to data analysis, why do not present results before and after multiple imputations? The missingness proportion is less than 2%, but if you apply multiple imputations, results sure present how data and results look before imputation.

6. Here is one serious concern with the analytical approach: Lines 242-243 describe how “generally a value higher than 0.7 designates a good model, and higher than 0.8 a strong model”. Table 3 presents AUC for all models between 0.79 and 0.83. That means all models (i.e., Logistic Regression, Lasso and SVM) are good or strong. Why do you opt for different models under small and marginal differences?

7. Following the previous comment in Line 237- 239: In linear SVM, one can interpret the magnitude and sign of the coefficient in the linear hyperplane similarly to the coefficients in Logistic Regression and Lasso. What references do you have for this affirmation? Also, it is unclear whether your study is a comparison (e.g., CW=child welfare and non-CW), but you might use non-comparable models (e.g., Logistic Regression vs SVM). Unusually, two different models are recognized for two independent samples under the same analysis. For CW, it is SVM, and for N-CW, it is the Logistic Regression. It is necessary to have a better explanation and justify with technical background why a model may be more applicable to one group and not to the other. This point needs a deep rationale for your data analysis proposal.

8. Line 272 It is “perturbed” or “permuted”?

9. Why do you use the restrictions of the Lasso model and the technique called Backward Feature Selection? It seems like there are too many restrictions for the exploration of Line 46: “further work to delineate the relative importance of known predictors of substance use /…/”.

10. How sensitive is the Permutation Feature Importance (PFI) to the SVM and the logistic regression results?

11. It is not easy to interpret the values from the Permutation Feature Importance (PFI) analysis. However, if the arbitrary selection of models we asked about in point 7 is justifiable, at least all results in all models should be presented for a fair comparison.

12. Since the authors claim that this is a study “Line 396-398: to provide evidence regarding the comparability of risk factors for substance use among CW-involved versus non-CW-involved youth, but the first to so within one study and using machine learning approaches.”, it is capital to clarify the analytic concerns mentioned before. The richness of their approach will reside in how well the points 5 to 11 are corrected and rationalized.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Juan J Giraldo-Huertas

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

Attachment

Submitted filename: PLoS One Review_PONE-D-21-15650.docx

PLoS One. 2022 Sep 21;17(9):e0274998. doi: 10.1371/journal.pone.0274998.r002

Author response to Decision Letter 0


21 Jul 2022

RESPONSE TO REVIEWERS

We thank the editor and reviewers for their time and thoughtful comments. We have addressed all comments below and indicated where changes have been made in the manuscript.

Editor’s comments

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

RESPONSE: We have made sure that we followed the PLOS ONE style requirements.

2. Thank you for stating the following financial disclosure: "Funding acknowledgements: National Institutes of Health Grants R01HD39129 and R01DA024569 (to P.K. Trickett., Principal Investigator)."

Please state what role the funders took in the study. If the funders had no role, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."

If this statement is not correct you must amend it as needed.

Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf.

RESPONSE: We have revised the funding statement in the manuscript to note the role of the funders. We have also included this in the cover letter.

3. We note that you have indicated that data from this study are available upon request. PLOS only allows data to be available upon request if there are legal or ethical restrictions on sharing data publicly. For more information on unacceptable data access restrictions, please see http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions.

In your revised cover letter, please address the following prompts:

a) If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially sensitive information, data are owned by a third-party organization, etc.) and who has imposed them (e.g., an ethics committee). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent.

RESPONSE: The data contain sensitive information about child welfare involvement and maltreatment experiences. Sharing de-identified dataset on this small vulnerable group could potentially lead to identifiable information given the location of the study is specified in the manuscript and the dataset includes age, race, and gender. Requests may be sent to the lead of the data access committee at USC School of Social Work, Julie Cederbaum PhD, MSW. We have included this information in the cover letter.

b) If there are no restrictions, please upload the minimal anonymized data set necessary to replicate your study findings as either Supporting Information files or to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories.

We will update your Data Availability statement on your behalf to reflect the information you provide.

4. Please include your full ethics statement in the ‘Methods’ section of your manuscript file. In your statement, please include the full name of the IRB or ethics committee who approved or waived your study, as well as whether or not you obtained informed written or verbal consent. If consent was waived for your study, please include this information in your statement as well.

RESPONSE: We have included the ethics statement at the end of the procedures section. We already stated at the beginning of the procedures section that written consent was obtained from the caregiver and assent from the adolescent.

5. We note that you have referenced "Margolin G. The Youth Symptom Survey Checklist. Los Angeles, CA: Unpublished manuscript; 2000" which has currently not yet been accepted for publication. Please remove this from your References and amend this to state in the body of your manuscript: "Margolin G. The Youth Symptom Survey Checklist. Los Angeles, CA: Unpublished manuscript; 2000" as detailed online in our guide for authors

http://journals.plos.org/plosone/s/submission-guidelines#loc-reference-style.

RESPONSE: We have updated this reference as requested.

Reviewer #1:

The study proposes a statistical novel approach to understand the interplay of risk factors of marijuana use, comparing a child welfare population and a non-welfare population. The authors argue that there is little research addressing the specific risk factors that affect welfare children, and that they could differ from community children. The manuscript is well written and explains clearly step by step the LM which helps understand the process for the analysis

RESPONSE: We thank the reviewer for their positive appraisal of our manuscript.

When the introduce variables such as depression, anxiety and PTSD as risk factors, the age of onset of such psychological issues should be addressed, because those problems could be the result of drug use (and marijuana not being the onset drug it may be more probable), then they will not be risk factors, but outcomes. If the age of onset is not clear or the authors do not have it, then it should be addressed in the limitations section.

RESPONSE: We are using variables concurrent with our outcome to determine if information from the same timepoint can predict whether an adolescent will be a marijuana user. This is different than using these variables as potential causal factors which would necessitate that they precede the outcome. In predictive models we are not attempting to use the “predictor variables” as explanatory or causal features. Instead, the predictive model attempts to use the available information on other variables to make a prediction about the likelihood that a given adolescent is a marijuana user or not. In this type of model it does not matter if the drug use or mental health symptoms occurred first, it only matters if concurrently having mental health symptoms allows us to predict the outcome (marijuana use) with higher accuracy. We have added the limitation of this approach in the discussion.

The researchers included the use of alcohol in their variables, but it is more abuse than use, which is different. They address the alcohol use by asking how many times in the past 12 months have your passed out drunk. Even if the answer is never, the adolescent may be drinking a lot, but he is not getting drunk, or he could have been drunk but not necessarily passed out. RESPONSE: We thank the review for this comment. We agree that alcohol use and our variable “passed out drunk” are very different. We had to choose between including more introductory alcohol use or more severe abuse. If we used alcohol use the rates were very high and gave us little variability. This is in part why we chose the passed out drunk variable.

It is clear why they have dichotomized the dependent variable (marijuana use), because of the model they are using, but it should also be reviewed what happen when the variable is maintained with its full scope of categories.

RESPONSE: To maintain a categorical variable we would need to use different Machine Learning Models and this would not allow for comparison with the models we have already reported on. In addition, the dataset is too small to use 5 categories within the marijuana use variable, the cell sizes would limit the ability to predict the outcome. There are also clinical implications for using a dichotomous variable of use/no use, as any use in adolescence should be of concern and warrant further assessment or intervention.

For better understanding, it is important also to explain how was measured peer drug use. The authors said it was measured through one item, but they do not explain how.

RESPONSE: We have added clarification of these variables in the “Peer risk behavior” section of the measures.

It is important that at the introduction, the researchers clarify if they are really measuring self-esteem, or they are just measuring competence and body image. They should support this notion of self-esteem by literature that argues that this is self-esteem, and not self-efficacy or just self-competence. It is also important to discuss why this kind of variables have no impact on the dependent variable in either of the two groups.

RESPONSE: We appreciate the opportunity to clarify these issues. In the Self-esteem section of the measures, the two questionnaires we included both measure aspects of self-esteem. Global self-worth is a construct synonymous with self-esteem and self-image is viewed as a component of self-esteem. We have included references to support this conceptualization. It is also not that these variables have no impact, but that other variables in the model are more important in predicting marijuana use. As seen in the bivariate correlations, higher global self-worth is protective for marijuana use.

It is really interesting the use of ML to analyze the risk and protective factors, because it reveals new ways in which the risk and protective factors interact. The study of this factors through LM can be an ideal analytic approach to studying multiple variables such as risk and protective factors.

RESPONSE: We thank the reviewer for highlighting this new way to analyze substance use in adolescence.

The study has several strengths, including its data-driven approach to study many factors that could influence marijuana use in both samples. One of the things that I could question about the paper is the way use of marijuana was measured. The possible answers were some, a lot, none. I think this type of categorization can diminish the possible variance that could be found in this kind of behaviors. It is also difficult to understand what the difference could be, for the adolescent, between a lot and some use.

RESPONSE: We thank the reviewer for raising this issue so that we can clarify. The outcome variable for marijuana use was dichotomized as ‘no use’ versus ‘any use’. It might be that the reviewer is referring to how the peer marijuana use was coded. The three answers referred to how many of their friends were using marijuana.

It is really surprising the result that parental closeness is a risk factor for children in welfare. Even though the authors give some hypothesis, it could also be interesting to explore, or to propose for future research if, maybe based on the social development model (Hawkins & Catalano), the family, in certain circumstances could become a risk factor, because the child or adolescent has an affective attachment, so she follows her family behaviors, and parents could be using drugs themselves. It could also be a way the adolescent reacts depending on her placement. The authors, in their limitations said that they did not account for placement as a variable so they could pair both samples, but maybe that is one of the reasons of this result.

RESPONSE: We thank the reviewer for this suggestion. We have added in the discussion section: “Future studies might consider how in certain context close family relationships may actually be detrimental, as in the case where parents might be abusing substances and modeling this behavior for the adolescent.”

I am not sure that the dichotomization of the race variable is accurate.

RESPONSE: We used this dichotomization because evidence indicates higher substance use rates among minority youth. However, we have completed new analyses with race included individually for White, Black, Hispanic, and Multi-racial. We did not find any of these variables to be important predictors of marijuana use. We now report on these findings.

Another surprising result is that peer delinquency reduces marijuana use in community sample. I the researchers suggest it was a variable that did have issues in the weight it has, but I think that the problem could be how the adolescent sees his peers, or the dichotomization of the dependent variable. The authors should explore a little farther this result.

RESPONSE: We agree that this was a surprising finding. However, in our updated analyses this does not emerge as an important feature and has been removed from the results.

Finally, I don’t agree that black and Latino should necessarily share the same backgrounds or contextual variables.

RESPONSE: We assume this is a reference to the race/ethnicity groups we created (white versus minority youth). If so, then we have addressed this by including individual race/ethnicity variable.

Reviewer 2

1. I suggest more extension and detail in how decision trees work in ML techniques. No reference in lines 56-58: “Not only can ML provide potentially more accurate predictive models, but techniques such as decision trees can provide new insights into non-linear relationships.”.

RESPONSE: Given that decision tress are not one of our actual analytic models we have removed this from the introduction as to not confuse the reader.

2. Lines 82-83: “there are no studies delineating both the shared and unique risks” is the italic necessary?

RESPONSE: We appreciate the question and agree that there does not need to be italics. It has been removed.

3. Why did CW=child welfare and non-CW turn into Line 117: maltreated and comparison groups? Please have uniformity.

RESPONSE: We apologize for this inconsistency and have revised the language to be consistent.

4. What does a “biracial” participant’s race mean?

RESPONSE: We have updated this term to Multi-racial, meaning the participant reported 2 or more races/ethnicities.

5. According to data analysis, why do not present results before and after multiple imputations? The missingness proportion is less than 2%, but if you apply multiple imputations, results sure present how data and results look before imputation.

RESPONSE: We thank the reviewer for raising this point. Although the variable level missingness is <2%, listwise deletion would results in 10% of the sample being dropped form the analyses. As such we feel that is it critical to use imputation to mitigate systematic bias due to missingness.

We have completed the analyses for the data with missingness and report that there is no difference in the model performance metrics between the imputed and non-imputed data. As such we report only the results for the feature importance and coefficients for the imputed data. We include the performance metrics for the non-imputed data in a supplementary table.

6. Here is one serious concern with the analytical approach: Lines 242-243 describe how “generally a value higher than 0.7 designates a good model, and higher than 0.8 a strong model”. Table 3 presents AUC for all models between 0.79 and 0.83. That means all models (i.e., Logistic Regression, Lasso and SVM) are good or strong. Why do you opt for different models under small and marginal differences?

RESPONSE: Although the difference in the AUC may seem small, any incremental increase in accuracy is considered an improvement. However, in our revised analyses with the new race/ethnicity variables we found that the performance metrics across the three models were so similar we could not choose a best-fitting model. Therefore, we now report the feature selection and coefficients across all three models. This adds robustness to our results as we only retained the features that met our threshold across all three models. This mitigates differences in features selected by each model that may be due to model-specific algorithms.

7. Following the previous comment in Line 237- 239: In linear SVM, one can interpret the magnitude and sign of the coefficient in the linear hyperplane similarly to the coefficients in Logistic Regression and Lasso. What references do you have for this affirmation?

RESPONSE: In those models the magnitude of the coefficient means the same thing for the prediction of the outcome. We have added the following reference: Rakotomamonjy, A. (2003). Variable selection using SVM-based criteria. Journal of machine learning research, 3(Mar), 1357-1370.

8. Also, it is unclear whether your study is a comparison (e.g., CW=child welfare and non-CW), but you might use non-comparable models (e.g., Logistic Regression vs SVM). Unusually, two different models are recognized for two independent samples under the same analysis. For CW, it is SVM, and for Non-CW, it is the Logistic Regression. It is necessary to have a better explanation and justify with technical background why a model may be more applicable to one group and not to the other. This point needs a deep rationale for your data analysis proposal.

RESPONSE: In our revised analyses we present all three models for both CW and non-CW youth.

9. Line 272 It is “perturbed” or “permuted”?

RESPONSE: We apologize for this error and have changed it to permutated.

10. Why do you use the restrictions of the Lasso model and the technique called Backward Feature Selection? It seems like there are too many restrictions for the exploration of Line 46: “further work to delineate the relative importance of known predictors of substance use /…/”.

RESPONSE: Lasso does not have restrictions, regularization is used to overcome issues with over fitting when many features are used that may duplicate variance. This is an advantage of Lasso over logistic regression. Backward feature selection is a distinct part of the process in which each feature is iteratively removed and the predictive performance with a single feature removed is evaluated and the final set of features is chosen based on the set that results in the best AUC.

11. How sensitive is the Permutation Feature Importance (PFI) to the SVM and the logistic regression results?

RESPONSE: Permutation feature importance is model agnostic as stated on line 307-308.

12. It is not easy to interpret the values from the Permutation Feature Importance (PFI) analysis. However, if the arbitrary selection of models we asked about in point 7 is justifiable, at least all results in all models should be presented for a fair comparison.

RESPONSE: For each model a higher value means the feature is contributing more to the prediction of the outcome. It can be approximated to mean the amount the predictive accuracy (AUC) will drop if that feature is removed for the model. We stated in the text that we used a cut-off of .005 to report important features, which means a contribution of ≥.5% to the accuracy. In the results section we note that the top feature for CW youth is peer marijuana use which would decrease the AUC by 12-13% if dropped from the model. Similarly, for non-CW youth externalizing behavior is the top feature and would result in a drop of 10-12% if removed from the model. These clarifications have been added to the analysis and results sections.

13. Since the authors claim that this is a study “Line 396-398: to provide evidence regarding the comparability of risk factors for substance use among CW-involved versus non-CW-involved youth, but the first to so within one study and using machine learning approaches.”, it is capital to clarify the analytic concerns mentioned before. The richness of their approach will reside in how well the points 5 to 11 are corrected and rationalized.

RESPONSE: We appreciate the opportunity to clarify our analytic methods in the above points.

Attachment

Submitted filename: PLoS One_reponse to reviewers 5-26-22.docx

Decision Letter 1

Carlos Andres Trujillo

9 Sep 2022

Using machine learning to determine the shared and unique risk factors for marijuana use among child-welfare versus community adolescents

PONE-D-21-15650R1

Dear Dr. Negriff,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Carlos Andres Trujillo, PhD

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: I think the authors have adressed all my concerns. I think it has improved and is ready for publication. I really like the research.

Reviewer #2: I appreciate the effort and clarity of the authors in every answer to the comments. Also, I have no additional comments for the authors or concerns about dual publication, research, or publication ethics.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Angela Trujillo

Reviewer #2: Yes: Juan Jose Giraldo-Huertas

**********

Acceptance letter

Carlos Andres Trujillo

13 Sep 2022

PONE-D-21-15650R1

Using machine learning to determine the shared and unique risk factors for marijuana use among child-welfare versus community adolescents

Dear Dr. Negriff:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Carlos Andres Trujillo

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table. Performance metrics for the three machine learning approaches for non-imputed (raw) data.

    (DOCX)

    Attachment

    Submitted filename: PLoS One Review_PONE-D-21-15650.docx

    Attachment

    Submitted filename: PLoS One_reponse to reviewers 5-26-22.docx

    Data Availability Statement

    There are ethical restrictions on sharing a de-identified dataset because they contain sensitive information about child welfare involvement and maltreatment experiences. Sharing a de-identified dataset on this small vulnerable group could potentially lead to identifiable information given the location of the study is specified in the manuscript and the dataset includes age, race, and gender. Requests may be sent to Julie Cederbaum PhD, MSW, the lead of the data access committee at USC School of Social Work, at jcederba@usc.edu.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES