Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Jan 1.
Published in final edited form as: Epidemiol Methods. 2020 Feb 25;9(1):20190032. doi: 10.1515/em-2019-0032

The use of Logic regression in epidemiologic studies to investigate multiple binary exposures: an example of occupation history and amyotrophic lateral sclerosis

Andrea Bellavia 1,*, Ran S Rotem 1, Aisha S Dickerson 1,2, Johnni Hansen 3, Ole Gredal 3, Marc G Weisskopf 1,2
PMCID: PMC7679079  NIHMSID: NIHMS1619289  PMID: 33224709

Abstract

Investigating the joint exposure to several risk factors is becoming a key component of epidemiologic studies. Individuals are exposed to multiple factors, often simultaneously, and evaluating patterns of exposures and high-dimension interactions may allow for a better understanding of health risks at the individual level. When jointly evaluating high-dimensional exposures, common statistical methods should be integrated with machine learning techniques that may better account for complex settings. Among these, Logic regression was developed to investigate a large number of binary exposures as they relate to a given outcome. This method may be of interest in several public health settings, yet has never been presented to an epidemiologic audience. In this paper, we review and discuss Logic regression as a potential tool for epidemiological studies, using an example of occupation history (68 binary exposures of primary occupations) and amyotrophic lateral sclerosis in a population-based Danish cohort. Logic regression identifies predictors that are Boolean combinations of the original (binary) exposures, fully operating within the regression framework of interest (e.g. linear, logistic). Combinations of exposures are graphically presented as Logic trees, and techniques for selecting the best Logic model are available and of high importance. While highlighting several advantages of the method, we also discuss specific drawbacks and practical issues that should be considered when using Logic regression in population-based studies. With this paper, we encourage researchers to explore the use of machine learning techniques when evaluating large-dimensional epidemiologic data, as well as advocate the need of further methodological work in the area.

Keywords: machine learning, big data, Logic regression, occupational epidemiology, amyotrophic lateral sclerosis

Introduction

Epidemiological studies are increasingly interested in investigating the effect of simultaneous exposures to multiple risk factors on a given outcome.(Mooney, Westreich, and El-Sayed 2015) Individuals are generally exposed to several risk factors during their lifetime, often simultaneously, and evaluating their joint rather than individual effects provides a better understanding of the associated risks and potentially improves intervention development.(Aylward et al. 2013; Dominici et al. 2010) In addition, the joint evaluation of several exposures allows investigating synergistic or antagonistic interactions, potentially detecting recurrent patterns of exposures associated with the outcome of interest. (Vanderweele 2009; Howard and Webster 2012)

With the increasing availability of big data and the need to jointly investigate several exposures as well as high-dimension interactions, classical statistical approaches based on regression modeling may often be of little use due to common problems such as overfitting or collinearity.(Greenland 1993; Weisskopf, Seals, and Webster 2018) Several methodological as well as applied publications have focused on addressing this problem in situations where the interest is in evaluating a large dimensional set of continuous exposures. In environmental epidemiology, for example, methods for the joint evaluation of environmental mixtures are becoming increasingly popular.(Taylor et al. 2016; Billionnet, Sherrill, and Annesi-Maesano 2012; Stafoggia et al. 2017)

Most commonly used approaches for large dimensional exposures are focused on the setting of continuous exposures, and little emphasis has been given thus far to settings where multiple binary exposures are of interest. Such settings are becoming more common in epidemiological investigations, such as those that use large databases of electronic medical records where diagnosis information is generally dichotomized. Also, studies focusing on occupational history are often interested in evaluating the relationship between a given outcome and a large set of dichotomous indicators of work history at specific occupations.(Dickerson et al. 2018) Such analysis can be implemented using a series of independent logistic regressions, each examining one exposure at a time. However, this method fails to capture joint effects and effects that are due to interactions. This is a significant limitation, as oftentimes it is the pattern of co-exposures that drives the risk of the outcome.(Lucek and Ott 1997)

Machine learning methods to deal with binary predictors while allowing for high-dimension interactions have been developed, mainly based on clustering and regression tree techniques.(Breiman 2017; Friedman 1991) Among these, one methodology that offers several advantages is Logic regression.(Ruczinski, Kooperberg, and LeBlanc 2003) This method was developed a decade ago, but to our knowledge has never been presented or discussed as a potential tool for epidemiological studies. Applications of the methods have been sporadic and mainly limited to genetics.(Ruczinski, Kooperberg, and LeBlanc 2004; Schwender and Ickstadt 2007; Chen et al. 2011) Nevertheless, by specifically focusing on identifying patterns of binary exposures while working iteratively within a regression framework, Logic regression may be relevant to several epidemiological studies.

The aim of this paper is to apply Logic regression in a study evaluating the association between occupational history and the risk of amyotrophic lateral sclerosis (ALS), and discuss advantages of the method as well as drawbacks and practical issues relevant for epidemiological research.

Study population and job exposure data

We used population-based Danish registry data that was previously presented in a nested case-control study to examine the association between occupational employment history and incidence of ALS.(Dickerson et al. 2018) Briefly, data on ALS diagnosis in 1982 to 2013 were determined from the National Danish Patient Registry using extended Danish International Classification of Diseases and Related Health Problems, Eighth Revision (ICD-8) code 348 for “amyotrophic lateral sclerosis” through 1994, and ICD-10 code G12.2 for “motor neuron disease” thereafter.(Schmidt et al. 2015; Kioumourtzoglou et al. 2015) Occupation history data backdating to 1964 were obtained through the Danish Pension Fund by using Industrial Classification codes from Statistics Denmark.(Hansen and Lassen 2011) We identified 68 primary occupation classifications and 746 sub-classifications among study participants. In total, 1826 ALS cases (1079 male and 747 female) were identified and matched to 182,600 controls (1:100 ratio), after excluding cases who were older than 25 in 1964 when the Pension Fund began.(Dickerson et al. 2018) Further details on inclusion and exclusion criteria can be found in the original publication where each occupation was evaluated independently using conditional logistic regression models.(Dickerson et al. 2018) Subjects without any employment history were included in the analysis as unexposed for every occupation category. The current package for Logic regression in the R statistical software, described in the following sections, allows for implementation in datasets with a maximum of 40,000 rows. Therefore, we randomly selected 20 of the 100 controls for each ALS case, yielding a final population of 37,972 individuals. The original case-control study was designed by matching controls to ALS cases based on sex, birth year, and vital status at the time the case was diagnosed. Because occupations and job tasks within industries are expected to largely differ between men and women, we stratified all analyses a priori by gender.(Dickerson et al. 2018) Further covariates that could confound the associations and are available in the data include SES and area of residence.

Review of Logic regression

Model definition

The main goal of Logic regression is to find combinations of binary covariates associated with a response variable of interest. The method takes its name by the fact that such combinations are defined in terms of Boolean logic expressions. These can be written, for example, as L=[X1 and (X2C or X3)] meaning “if X1 and either of ‘not-X2’ or X3 are true”, with X1, X2, and X3, being 3 out of several binary predictors. While several searching algorithms for finding Boolean combinations of predictors exist, Logic regression is the only one that completely works within a regression framework of interest. A Logic regression model takes the general form

gE[Y]=β0+jβjLj

For example, given an outcome Y, we can think of a Logic regression model such as g(Y)=β01L, where L is the Boolean combination previously defined. If Y is a binary outcome (which is not a requirement for the method), the function g would be a logit transformation and we could interpret β1 as the logarithm of the odds ratio comparing individuals for whom L is true (those individuals exposed to both X1 and either of ‘no-X2’ or X3) to individuals who do not have this combination. Similar to other methods based on Boolean combinations, Logic regression allows for 3 types of operators: Ʌ (AND), V (OR), C (NOT). The combination of operators is presented as a tree. For example, Figure 1 represents the combination L previously defined. White squares are interpreted as “the predictor is true”, while the NOT complements (“the predictor is not true”) are represented with black squares.

Figure 1:

Figure 1:

Logic tree representing the Boolean combination [X1 Ʌ (X2C V X3)]. White letters on black background denote the conjugate (i.e. ‘not-X2’)

Model estimation

Logic regression can be used for any function g where a score function can be defined. For example, for a logistic regression model g(Y)=log(Y/(1- Y)) the binomial deviance could be used as the scoring function. Other scoring functions associated with alternative model types (e.g. least squares for linear regression) could also be used. Additionally, the user can potentially define her/his own score function and apply Logic regression for any kind of regression technique. Once the score function is defined, estimation will proceed by finding the Boolean expressions that minimizes the scoring function. In this way, estimation of the parameters and search for the Boolean expression happen simultaneously.

The choice of Boolean expressions is generally based on decision trees. As the number of predictors of interest in such applications is generally large, the number of potential combinations of predictors is huge, and it is not possible to evaluate all potential trees. For this aim, Logic regression primarily uses simulated annealing, a technique that iteratively compares trees differing from one another by only a single move (e.g. a change of one of the predictors or the operator between predictors). New moves are accepted if their score is better than that of the old combinations. If the score is not better, the move can still be accepted based on the point in time of that move within the annealing chain process. Specifically, a parameter called temperature is defined to reflect the specific point in time in which the move is considered in the annealing chain process, thus creating a gradient of acceptance probabilities that gradually decreases as the annealing chain process progresses. In this way, at each step of the chain, a number of moves to modify the trees are considered, and only a certain number are accepted, with lower acceptance probability later in the process. The starting temperature must be chosen such that proposed moves are more readily accepted early on in the annealing process, while the final temperature should be selected to minimize the number of accepted moves later in the process. The recommendation is to allow most of the moves to occur between acceptance rates of 90 to 40%. Alternative search algorithms are also available, but are generally less efficient and provide less satisfactory results. Further details on Boolean combination estimation have been previously published.(Ruczinski, Kooperberg, and LeBlanc 2003)

Model selection

Because of the complexity of the iterative annealing process and the high number of potential combinations, techniques for evaluating whether the selected model is indeed the best predictive model, or at least close enough, are of extreme importance. Several approaches are proposed for this aim. First, the Logic regression algorithm could be run using varying specifications for model size and the total number of trees, and scores of these runs could be compared. The model size is the total number of predictors that will be included in the final model. The number of trees is the maximum number of Logic trees (Boolean combinations) to be estimated. For example, a Logic model of size 10 with 2 trees will be (E[Y]) = β0 + β1L1 + β2L2, where L1 and L2 will be combinations of no more than 10 predictors in total (e.g. 7+3; 4+6…). Second, standard validation techniques such as training/test set or cross validation can be used to determine the size of the Logic tree with the best predictive capability. Finally, two types of randomization tests are available: a test for signal in the data, and a test for optimal model size. We refer to previous publications for details on model selection techniques.(Ruczinski, Kooperberg, and LeBlanc 2003) When Logic regression is used among several alternative approaches, it is recommended to evaluate and compare the classification accuracy by calculating the area under the curve (AUC).(Yoo et al., 2012) Logic regression is implemented in the statistical software R (package LogicReg) and online tutorials are available. R code related to our analysis is available in the Supplementary Material.

Illustration: occupation history and ALS

The first step of the analysis, required to be able to fit the Logic model in R, is to define a matrix of binary exposures in wide-format (i.e. one column for each predictor). In our example we focused on both specific occupations (746 binary predictors) and occupation categories (68 binary predictors). All occupations and categories are listed in the Supplementary Material. As all analyses were stratified by gender, we conducted the same set of analyses in four settings: occupation categories in men; occupation categories in women; specific occupations in men; specific occupations in women. We present details of the analysis examining occupation categories in women, and later briefly discuss results in the other settings.

Algorithm definition and code sample

Before fitting the model, the algorithm for the search of the trees must be defined. For simulated annealing, the range of the temperature (the parameter defining the acceptance rate in case of higher scores) and the number of iterations must be set. After some tuning, we found that in our datasets the best setting for the temperature parameter limits was 2 to −8, and the number of iterations to 25000. Figure 2 shows the upper-left section of the output of a Logic regression model. The temperature parameter is provided in the first column, with only some values presented (here we chose to update the output every 0.08 unit change in the temperature). The current and best scores until that moment are provided in columns 2 and 3, respectively. The 4th column presents the number of accepted moves since the last output, with the number of moves that yielded identical scores to the preceding move in parenthesis. We see here that at early stages most of the moves are accepted. Finally, columns 5 and 6 present the number of rejected moves (with numerically acceptable and unacceptable results, respectively). The number of rejected moves gradually increases as the annealing chain progresses.

Figure 2:

Figure 2:

Upper left section of an R output monitoring the estimation of a Logic regression model.

Model fitting and pruning

After defining the search algorithm, we moved to the actual fitting of the Logic model. We simultaneously estimated several models by varying the number of trees from 1 to 3 and the number of predictors from 1 to 20 (10 to 50 when focusing on specific occupations) and looked for the lowest score (representing the better model). These ranges were chosen because the current version of the R package for Logic regression allows for a maximum of 5 trees (recommending 1–3 for practical purposes) and 50 separate parameters. Scores from the different models are presented in Figure 3, based on which we would choose the Logic model with 20 predictors in 3 trees, which had the lowest score. If several models have similar scores it is recommended to select the Logic model of lowest complexity.

Figure 3:

Figure 3:

Scores of Logic regression models of different sizes. The x-axis represents the total size of the model (the maximum number of included predictors), while the number of trees is indicated inside the squares.

One important practical caveat, which has not received enough attention in previous publications, is that results are based on an iterative algorithm that may be strongly subject to fluctuations that are just noise. As such, fitting the same model multiple times will generally give different results. If a second run of the same model provides a completely different logic tree, this is a sign of high noise in the data, demonstrating poor predictive capabilities. For example, in Table 1 we show the results of the same logic model (20 predictors in 3 trees) in 3 different runs. As the outcome is binary, beta coefficients are interpreted as log(OR) associated with the respective L being TRUE.

Table 1.

Results of 3 different runs of the same Logic model with 20 predictors in 3 trees

Score β0 β1 β2 β3 L1 L2 L3
5923.6 −3.14 3.73 0.36 −1.07 (((Textile and Clothing Ʌ (Cpaper Ʌ printing industry)) V (laundries Ʌ Others)) Ʌ ((Professional services Ʌ (CConstruction)) Ʌ (CWelfare))) (((Preserved foods V (CHealth and Research)) V Societies) Ʌ ((COthers) Ʌ ((CTextile and Clothing) Ʌ (CPersonal services)))) ((((CCommunication) Ʌ Retail trade) Ʌ Cleaning services) V ((Professional services V Electrical plants) Ʌ (CLaundries)))
5924.2 −5.94 0.81 1.68 2.19 (((Others Ʌ (CWholesale trade)) V Preserved foods) V ((Communication V Education) V (Leather industry V (CCleaning services)))) ((((COthers) Ʌ (CCleaning services)) V Medical goods) Ʌ (((CWood furniture production) Ʌ (CLeather industry)) Ʌ Ground transportation)) (((CGround transportation) Ʌ (CPainting)) V ((Electrical firms V General contractor) V (Metal factory Ʌ Retail trade)))
5926.4 −2.99 −18.3 0.54 −0.6 ((((CTextile and Clothing) Ʌ (CRenting of machines)) Ʌ (Professional services V Heat/gas company)) Ʌ ((CConstruction planning) Ʌ (Public administration V Transportation))) ((((CHealth and Research) Ʌ (CBank)) Ʌ ((COthers) Ʌ (CTextile and Clothing))) Ʌ ((Communication Ʌ (CTransportation)) V (Wholesale trade V Property management))) (((Metal foundries V Cleaning services) Ʌ (CTransportation manufacturing)) Ʌ (CEducation))

β0 represents the intercept, while β1 and β2 are the parameters associated with the combinations L1 and L2 respectively

Cross Validation

In Table 1, despite some predictors consistently appearing in all three models, there seems to be a high amount of noise, with the three runs of the same Logic model providing substantially different trees. Note that the models presented are not wrong, as they all identify combinations of exposures that are associated with the outcome of interest, but we are not able to consistently identify combinations of higher importance. It is thus recommended to evaluate the best model size using techniques such as training/test set or cross-validation that compares the predictive capability of the different models considered. We implemented a 10-fold cross-validation that is easily automated in the R package (see Supplementary Material). Figure 4 shows the scores for all evaluated models both in the training and test datasets of the cross-validation procedure.

Figure 4:

Figure 4:

Scores of Logic regression models of different sizes in the training (left) and test data (right). The x-axis represents the total size of the model (the maximum number of included predictors), while the number of trees is indicated inside the squares

We conclude from Figure 4 that while larger model size is better per-se (lower scores in the training set), it also substantially increases noise (higher scores in the test dataset), and models with 3 trees have consistently higher scores in the test dataset, indicating lower predictive capability. For example, although the model we have been using with 20 predictors in 3 trees is among the best in the training set, it is also has one of the highest scores in the test dataset. This low predictive capability may explain the inconsistent results we observed in the previous section. It is recommended, in such situations, to reduce the complexity of the model while still choosing a model with a low score in the training dataset. For example, we could use the model with 11 predictors and 2 trees, which is one of the few models with more than 1 tree that also has an acceptable score in the test dataset. Results from this model, still replicated three times, are reported in Table 2 with the corresponding Logic trees in Figure 5.

Table 2.

Results of 3 different runs of a Logic model with 7 predictors in 3 trees

Score β0 β1 β2 L1 L2
5951.31 −2.7 −0.82 0.58 (((CHotels and restaurants) Ʌ (CWholesale trade)) V ((CPreserved foods) Ʌ (CMilitary and defense))) ((((CCleaning services) V Communication) Ʌ ((CProfessional services) Ʌ (CLaundries))) Ʌ (((CHotels and restaurants) Ʌ (CTransportation)) V (CAgriculture Ʌ farming)))
5955 −3.6 0.66 1.13 ((((CHealth and research) V (CCleaning services)) Ʌ (CProfessional services)) Ʌ (((CPersonal services) Ʌ (CEntertainment)) V Health Ʌ Research)) (((Preserved foods V Military and defense) Ʌ (Health and research V (CPreserved foods))) Ʌ Hotels and restaurants)
5950.92 −3.33 −0.35 0.58 ((((CPreserved foods) Ʌ Health and research) V Personal services) V ((Others V Textile and clothing) V Entertainment)) ((CProfessional services) Ʌ (((CCleaning services) V Communication) V (Plumbing V Preserved foods)))

β0 represents the intercept, while β1 and β2 are the parameters associated with the combinations L1 and L2 respectively

Figure 5:

Figure 5:

Results from 3 Logic models with 2 trees and 11 predictors. Each row represents a different run of the same model, specular to the 3 rows of Table 2.

Results from this analysis provide much more consistent patterns, with several job categories being consistently selected (e.g. hotel and restaurant, preserved foods, cleaning services, military defense, wholesale trade, health and research), and some patterns emerging (e.g. the interaction between hotel/restaurant and wholesale trade, or the one between preserved foods and military defenses). Results from Logic regression are interpreted as increase (or decrease) in the log(OR) when given combinations are true. For example, in the first model, not working in the preserved foods industry nor in military and defense, or not working in wholesale trade nor hotel/restaurants, (L1) is associated with a reduction of −0.82 in the log(OR) of ALS.

Incorporating results in classical regression modeling

Results can be replicated by manually constructing binary indicators corresponding to the estimated Boolean combinations, and evaluating them in a classical regression model (e.g. binary logistic regression in our example). This also allows retrieving standard errors for estimated coefficients. By fitting logistic regression models that replicate the Logic models presented above, we obtained the same parameters together with interval measures (all betas were statistically significant at the conventional α=0.05). In Table 3, Models 1, we show for example results from a logistic regression model that replicates the Logic model presented in the first row of Table 2. In addition, the obtained coefficients can be adjusted for potential confounders, such as area of residence, for which we observed negligible changes in the association (Table 3, Models 2).

Table 3.

Results from logistic regression modelsa based on Logic models results

Predictor log (OR) OR p-value
Model 1 L1b −0.82 0.44 <0.01
L2c 0.58 1.79 <0.01
Model 2 L1b −0.82 0.44 <0.01
L2c 0.58 1.79 <0.01
Model 3 Preserved foods 0.36 1.43 0.03
Health and research −0.15 0.86 0.05
Military defense 0.38 1.46 0.25
Agriculture and farming 0.12 1.13 0.41
Hotels and restaurants −0.09 0.91 0.41
Cleaning services −0.38 0.68 <0.01
Wholesale trade 0.05 1.05 0.4
Model 4 Preserved foods 0.37 1.45 0.03
Health and research −0.15 0.86 0.05
Military defense 0.43 1.54 0.19
Agriculture and farming 0.12 1.13 0.41
Hotels and restaurants −0.15 0.86 0.21
Cleaning services −0.38 0.68 <0.01
Wholesale trade 0.05 1.05 0.96
Preserved foods *military −0.11 0.90 0.96
Hotel/rest*wholesale trade 0.24 1.27 0.28
a

Model 1: replicates the Logic model. Model 2: Further adjusts for area of residence. Model 3: incorporates selected categories as independent predictors. Model 4: Further adjusts model 3 for area of residence and includes an interaction between preserved food and wholesale.

b

L1=(((CHotels and restaurants) Ʌ (CWholesale trade)) V ((CPreserved foods) Ʌ (CMilitary and defense)))

c

L2= ((((CCleaning services) V Communication) Ʌ ((CProfessional services) Ʌ (CLaundries))) Ʌ (((CHotels and restaurants) Ʌ (CTransportation)) V (Agriculture Ʌ farming)))

In addition, one could also use results from Logic regression as a preliminary step for selecting covariates and eventual interactions to be included in a more conventional multivariable regression model. For example, we could fit a regression with all occupation categories selected in the first Logic model as independent predictors. Results from this model (Table 3, Model 3), show a significant protective effect for cleaning services (OR=0.68), and a significant harmful effect for preserved foods (OR=1.43). Furthermore, we could use observations from recurring patterns in the Logic trees, such as the combination of hotel and restaurants and wholesale trade, to include potential interactions. This is presented in Table 3 (model 4), where no interaction terms was however found to be significantly associated with ALS. Finally, we calculated the AUC to compare the classification accuracy of the Logic regression model selected after cross-validation, and the final Logistic regression model presented in Table 3. As expected, since the Logistic model includes covariates and interactions suggested by the Logic model, the two approaches performed similarly, showing moderate levels of accuracy (AUC=0.545 for Logic, and AUC=0.547 for Logistic).

Other settings

When focusing on occupation categories in the subsample of men, after model-pruning and cross validation we selected a Logic model with 2 trees and 14 predictors. Table 4 presents results of a logistic regression model where predictors and interactions consistently observed over 3 runs of the final Logic model are included. The main predictors of ALS identified from Logic models were: textile and clothing, health and research, transportation manufacturing, wood furniture production, meat products preparation and packaging, wholesale trade, metal works and foundries, chemical industry, metal goods. When including these covariates in the model, only the health and research occupation category was significantly associated with lower odds of ALS (OR: 0.84, p-value=0.02). Interactions were also detected, with a positive interaction between wholesale trade and textile and clothing remaining significant also in logistic regression models based on Logic regression results (OR: 2.36, p-value=0.01)

Table 4.

Results from logistic regression models based on Logic models resultsa

Predictor log(OR) OR p-value
Meat products preparation and packaging 0.17 1.19 0.17
Textile and clothing −0.37 0.69 0.09
Health and research −0.22 0.80 0.01
Metal goods −0.17 0.84 0.15
Transportation manufacturing 0.10 1.11 0.32
Wood furniture production 0.14 1.15 0.16
Chemical industry −0.04 0.96 0.65
Metal works and foundries 0.15 1.16 0.18
Wholesale trade −0.04 0.96 0.61
Textile/clothing*health and research 0.86 2.36 0.01
Metal work foundries*wholesale trade −0.01 0.99 0.96
a

Incorporates selected categories and interactions as individual predictors

When applying Logic regression on the larger covariate matrix of 746 specific occupations, we could not identify any clear pattern of exposures. Noise was extremely high and consistent models could not be obtained even after parameters tuning and cross-validation.

Discussion

The scope of this paper was to apply and discuss Logic regression as a potential machine learning tool for high-dimensional epidemiologic dataset when a large number of binary predictors is of interest.

Several researchers have recommended moving epidemiology beyond the classical exposure-outcome approach, where each potential risk factor for a given outcome of interest is independently evaluated.(Greenland 1993; Rothman, Greenland, and Lash 2008; Taylor et al. 2016) Individuals are exposed to a large number of factors over their lifetime - often simultaneously and evaluating the joint exposure to these factors allows a better understanding of health risks at the individual level while also providing the possibility of assessing the inter-relationship between different risk factors.(Carlin et al. 2013; Thomas, Witte, and Greenland 2007) This recommended switch is accompanied by a growing availability of large datasets with individual-level information.(Mooney, Westreich, and El-Sayed 2015) Increasingly large datasets provide enormous potentials for evaluating high-dimension interactions and patterns of exposures, enabling a better understanding of disease courses and determinants. At the same time, however, the large number of covariates makes classical statistical approaches based on regression modeling of little use due to limitations such as overfitting and collinearity.(Bühlmann and Van De Geer 2011) In such scenarios, novel approaches for big data based on machine learning and data mining techniques should be considered and pursued. While these techniques are becoming the standard approaches in fields such as genomics, they are seldomly applied in epidemiologic studies.(Naimi, Platt, and Larkin 2018) Specific to the focus of this paper, applications and discussions of methods for multiple binary covariates in epidemiology have been sporadic.(Alexeeff et al. 2017) This is despite the fact that research settings requiring the evaluation of a large number of binary predictors is common. For instance, in the example we used of occupation history data, the interest was in evaluating whether occupational history was associated with risk of ALS, considering a large matrix of occupations.

Among the potential methods for large-dimensional binary predictors, Logic regression is a potentially promising tool.(Ruczinski, Kooperberg, and LeBlanc 2003) Logic regression searches for combinations of predictors within a regression framework, and is available as a user-friendly package in the R statistical software. Nevertheless, despite its potential utility in epidemiological studies, applications of this method have been sporadic and mainly limited to genetics studies.(Schwender and Ickstadt 2007; Ruczinski, Kooperberg, and LeBlanc 2004; Chen et al. 2011) Our results using Logic regression in a large cohort study highlights some of the strengths of this approach and confirm its potential. First, our results identified some occupations that were also notable in a previous analyses using the same data that evaluated exposures one by one.(Dickerson et al. 2018) For example, in the previous analysis, cleaning services was the only job category found to be associated with ALS in women, with an adjusted OR of 0.69. In our analysis, cleaning services was also consistently identified as an important predictor by the Logic regression models, and we obtained an OR of 0.68 when evaluating it in a multivariable logistic regression model mutually adjusted for other occupations and incorporating interactions. Other job categories, such as health and research, were also consistently reported as potentially protective by our Logic models. In addition, we also observed the job category of preserved foods to be an important predictor of ALS, and obtained a significant OR of 1.51. The same occupation was associated with higher odds of ALS in the previous analysis, even though the magnitude of the association was smaller (OR=1.33) and the result non-significant. Moreover, the Logic regression models were also able to identify interactions that could not be detected when analyzing predictors one by one. Our findings of an association between certain occupations and ALS risk are supported by recent studies on ALS etiology. Immune-related factors have been suggested to play a role in ALS. This includes a possible link between ALS and infections caused by enteroviruses,(Xue et al. 2018) a class of positive-sense single-stranded RNA viruses that can target motor neurons and are transmitted through the fecal-oral route. Similar findings have been reported for certain food-borne gram-negative pathogens.(Steenblock et al. 2018) Occupations such as cleaning services or jobs related to health and research are often associated with higher use of disinfectants and greater familiarity with protocols that prevent contamination and infection, thus providing a potential explanation for the protective associations we observed. Similarly, the positive association with employment in the preserved food industry may be explained by the potential use of chemicals involved in the preservation process. For example, sodium nitrite, a common food preservative, is a precursor of nitric oxide (NO), which inhibits glutamate transport, potentially contributing to the pathogenesis of ALS.(Taskiran et al. 2000; DeVan et al. 2016) Despite these potential mechanisms supporting our findings, and choosing our final models based on cross-validation, a component of chance finding due to overfitting of the model to our data cannot be excluded, and validation in external data is recommended. Future studies on the subject should also evaluate approaches for accounting for timing of each occupation, which may be an important factor in determining the subsequent risk of ALS.

Of note, we encountered several problems and identified limitations that are worthy of consideration when using this method. First, we found estimations to be extremely sensitive to noise in the data, and obtained inconsistent results in some of the analyses. As Logic regression is an iterative procedure that does not search among all possible combinations of predictors (this would be practically infeasible), it necessarily has a random component. Previous applications of the model in genetics studies, as well as the original publications, did not underline this issue, which we found of high relevance in our data. In contrast to analyses in the field of genetics where a few strong predictors/combinations are often identified, in many epidemiological analyses, a large number of predictors/combinations that have a weak association with the outcome are present (e.g. OR≈1.05/1.10). In our analysis of 68 different occupation categories, for example, it may easily be the case that there are dozens of Boolean combinations that have similar (weak) associations with the outcome. By fitting a model with 2 trees as in our example, we are identifying only a few combinations out of a larger pool of many others that could have been equally selected. From a practical point of view, poor replicability of results may signal the absence of strong predictors and increase the probability that selected exposures are due to chance. For this reason, we recommend that epidemiologists attempting to use this method in population-based studies, especially in those situations where risk factors are not strongly associated with the outcome, should use Logic regression to identify potentially key variables and replicate and confirm results with classical approaches where few exposures and interactions are included. This issue becomes even more challenging when hundreds of covariates are of interest, as when we evaluated over 700 specific occupations. Of extreme importance, it must be noted that because of this, results from Logic regression are not always reproducible, but as we discussed, can still be informative. To our knowledge, no computational tool such as seed setting is available to assure replicating the exact results of a given run of the model. Several approaches are available for choosing the best predictive model for the data, including cross validation or training/test sets. In our analysis, we chose to describe the cross-validation technique, but other alternatives described in the original publications could be pursued.

A second caveat is the interpretation of the Boolean combinations, which may not have a direct explanation in practical terms. For example, we found that not working in the preserved foods industry nor in military and defense, or not working in wholesale trade nor hotel/restaurants is associated with a reduction of −0.82 in the log(OR) of ALS (L1 in Table 2), implying lower risk associated with that pattern of employment. Not only is this not of simple interpretation, but there may also be very few individuals who are actually exposed to the identified harmful combination. To facilitate results interpretation, one may often need to find the balance between two extremes: 1) focusing on complex models (i.e. several predictors for each tree), thus increasing the chances of detecting important predictors, but severely complicating the interpretation, and 2) focusing on simple models, with easier interpretation, but higher risk of missing important covariates. Nevertheless, to simplify the interpretation, we can use the results from Logic models as a preliminary step in conventional regression modeling (i.e. as a selection procedure). Predictors detected from Logic models can then be evaluated in a multivariable regression model, also allowing for low-dimension interactions in the case where recurrent combinations are detected. While this approach cannot account for high-dimension interactions, it also provides considerable advantages in terms of parameters interpretation.

It is important to note that individuals in our study were exposed to few exposures during their lifetime, and this dataset may not represent the best setting to investigate the specific properties of Logic regression in identifying high-order interactions. Nonetheless, even in these data, Logic regression represents one of the only methodologies currently available to evaluate the effects on ALS risk resulting from high-dimensional interactions between occupational exposures, without having to a-priori specify these interactions using conventional modeling techniques. Finally, we did not cover the potential adjustment for confounders in the Logic models in our description. This can be accommodated in a Logic regression and is implemented in the R package, but we observed that, with sample sizes such as the one we investigated, it may be require an extremely long computational time, especially if one or more confounders is continuous. In this case, adjustment via propensity scores might be advantageous in order to reduce the number of variables added to the model and so reduce computational time. We recommend considering adjusting for potential confounders if the sample size allows it, or, when this is not feasible, to evaluate them in later regression models as we have described here.

In conclusion, Logic regression may represent a useful methodology in several epidemiological studies dealing with a high number of covariates, and is one of the few available approaches to investigate patterns of multiple binary covariates as they relate to a given outcome, which can offer several advantages in terms of both computation and interpretation. Specifically, Logic regression is one of the few methods that searches the entire space of models for Boolean combinations that are related to the outcome while being completely embedded in a regression framework, and is recommended as the optimal choice when all exposures are binary. The use of specific scoring functions for each modeling class ensures objectivity in assessing the quality of the model, and provides the flexibility of implementing the methodology using various modeling classes. The method requires extensive model pruning and results may not be of intuitive interpretation, but this can be largely simplified by using results as a preliminary step for building a classical regression model. Applications of the method in epidemiological studies may be particularly subject to noise, and we recommend spending a good amount of time replicating the results several times to identify recurrent combinations. If noise persists, this may simply imply that the covariates being considered have low predictive ability, which is an informative conclusion in itself. We encourage researchers to move beyond the evaluation of risk factors one-at-a-time, and consider the use of available methods to simultaneously analyze multiple interrelated predictors, such as Logic regression. At the same time, we underline the need of further methodological development for dealing with large epidemiological datasets.

Supplementary Material

Supplementary Material

Funding.

This work was supported by the National Institute of Neurological Disorders and Stroke (R21NS099910) and by the National Institute of Environmental Health Science (R01ES028800).

Footnotes

Disclosure. The authors declare no conflict of interest

References

  1. Alexeeff Stacey E., Yau Vincent, Qian Yinge, Davignon Meghan, Lynch Frances, Crawford Phillip, Davis Robert, and Croen Lisa A.. 2017. “Medical Conditions in the First Years of Life Associated with Future Diagnosis of ASD in Children.” Journal of Autism and Developmental Disorders 47 (7): 2067–2079. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Aylward Lesa L., Kirman Christopher R., Schoeny Rita, Portier Christopher J., and Hays Sean M.. 2013. “Evaluation of Biomonitoring Data from the CDC National Exposure Report in a Risk Assessment Context: Perspectives across Chemicals.” Environmental Health Perspectives 121 (3): 287–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Billionnet Cécile, Sherrill Duane, and Annesi-Maesano Isabella. 2012. “Estimating the Health Effects of Exposure to Multi-Pollutant Mixture.” Annals of Epidemiology 22 (2): 126–141. [DOI] [PubMed] [Google Scholar]
  4. Breiman Leo. 2017. Classification and Regression Trees. Routledge. [Google Scholar]
  5. Bühlmann Peter, and Van De Geer Sara. 2011. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer Science & Business Media. [Google Scholar]
  6. Carlin Danielle J., Rider Cynthia V., Woychik Rick, and Birnbaum Linda S.. 2013. “Unraveling the Health Effects of Environmental Mixtures: An NIEHS Priority.” Environmental Health Perspectives 121 (1): A6–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Chen Carla CM, Schwender Holger, Keith Jonthan, Nunkesser Robin, Mengersen Kerrie, and Macrossan Paula. 2011. “Methods for Identifying SNP Interactions: A Review on Variations of Logic Regression, Random Forest and Bayesian Logistic Regression.” IEEE/ACM Transactions on Computational Biology and Bioinformatics 8 (6): 1580–1591. [DOI] [PubMed] [Google Scholar]
  8. DeVan AE, Johnson LC, Brooks FA, Evans TD, Justice JN, Cruickshank-Quinn C, Reisdorph N, Bryan NS, McQueen MB, Santos-Parker JR, Chonchol MB. Effects of sodium nitrite supplementation on vascular function and related small metabolite signatures in middle-aged and older adults. Journal of Applied Physiology. 2015. November 25;120(4):416–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Dickerson Aisha S., Hansen Johnni, Kioumourtzoglou Marianthi-Anna, Specht Aaron J., Gredal Ole, and Weisskopf Marc G.. 2018. “Study of Occupation and Amyotrophic Lateral Sclerosis in a Danish Cohort.” Occup Environ Med 75 (9): 630–638. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Dominici Francesca, Peng Roger D., Barr Christopher D., and Bell Michelle L.. 2010. “Protecting Human Health from Air Pollution: Shifting from a Single-Pollutant to a Multi-Pollutant Approach.” Epidemiology (Cambridge, Mass.) 21 (2): 187. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Friedman Jerome H. 1991. “Multivariate Adaptive Regression Splines.” The Annals of Statistics, 1–67. [Google Scholar]
  12. Greenland Sander. 1993. “Methods for Epidemiologic Analyses of Multiple Exposures: A Review and Comparative Study of Maximum-Likelihood, Preliminary-Testing, and Empirical-Bayes Regression.” Statistics in Medicine 12 (8): 717–736. [DOI] [PubMed] [Google Scholar]
  13. Hansen Johnni, and Funch Lassen Christina. 2011. “The Supplementary Pension Fund Register.” Scandinavian Journal of Public Health 39 (7_suppl): 99–102. [DOI] [PubMed] [Google Scholar]
  14. Howard Gregory J., and Webster Thomas F.. 2012. “Contrasting Theories of Interaction in Epidemiology and Toxicology.” Environmental Health Perspectives 121 (1): 1–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Kioumourtzoglou Marianthi-Anna, Seals Ryan M., Himmerslev Liselotte, Gredal Ole, Hansen Johnni, and Weisskopf Marc G.. 2015. “Comparison of Diagnoses of Amyotrophic Lateral Sclerosis by Use of Death Certificates and Hospital Discharge Data in the Danish Population.” Amyotrophic Lateral Sclerosis and Frontotemporal Degeneration 16 (3–4): 224–229. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Lucek Paul R., and Ott Jurg. 1997. “Neural Network Analysis of Complex Traits.” Genetic Epidemiology 14 (6): 1101–1106. [DOI] [PubMed] [Google Scholar]
  17. Mooney Stephen J., Westreich Daniel J., and El-Sayed Abdulrahman M.. 2015. “Epidemiology in the Era of Big Data.” Epidemiology (Cambridge, Mass.) 26 (3): 390. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Naimi Ashley I., Platt Robert W., and Larkin Jacob C.. 2018. “Machine Learning for Fetal Growth Prediction.” Epidemiology 29 (2): 290–298. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Rothman Kenneth J., Greenland Sander, and Lash Timothy L.. 2008. Modern Epidemiology. Lippincott Williams & Wilkins. [Google Scholar]
  20. Ruczinski Ingo, Kooperberg Charles, and LeBlanc Michael. 2003. “Logic Regression.” Journal of Computational and Graphical Statistics 12 (3): 475–511. [Google Scholar]
  21. Ruczinski Ingo, Kooperberg Charles, and LeBlanc Michael L.. 2004. “Exploring Interactions in High-Dimensional Genomic Data: An Overview of Logic Regression, with Applications.” Journal of Multivariate Analysis 90 (1): 178–195. [Google Scholar]
  22. Schmidt Morten, Schmidt Sigrun Alba Johannesdottir, Sandegaard Jakob Lynge, Ehrenstein Vera, Pedersen Lars, and Sørensen Henrik Toft. 2015. “The Danish National Patient Registry: A Review of Content, Data Quality, and Research Potential.” Clinical Epidemiology 7: 449. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Schwender Holger, and Ickstadt Katja. 2007. “Identification of SNP Interactions Using Logic Regression.” Biostatistics 9 (1): 187–198. [DOI] [PubMed] [Google Scholar]
  24. Stafoggia Massimo, Breitner Susanne, Hampel Regina, and Basagaña Xavier. 2017. “Statistical Approaches to Address Multi-Pollutant Mixtures and Multiple Exposures: The State of the Science.” Current Environmental Health Reports 4 (4): 481–490. [DOI] [PubMed] [Google Scholar]
  25. Steenblock DA, Ikrar T, Antonio AS, Wardaningsih E, Azizi MJ. Amyotrophic Lateral Sclerosis (ALS) Linked to Intestinal Microbiota Dysbiosis & Systemic Microbial Infection in Human Patients: A Cross-Sectional Clinical Study. Int J Neurodegener Dis. 2018;1(003). [Google Scholar]
  26. Taskiran D, Sagduyu A, Yüceyar N, Kutay FZ, Pögün Ş. Increased cerebrospinal fluid and serum nitrite and nitrate levels in amyotrophic lateral sclerosis. International journal of neuroscience. 2000. January 1;101(1–4):65–72. [DOI] [PubMed] [Google Scholar]
  27. Taylor Kyla W., Joubert Bonnie R., Braun Joe M., Dilworth Caroline, Gennings Chris, Hauser Russ, Heindel Jerry J., Rider Cynthia V., Webster Thomas F., and Carlin Danielle J.. 2016. “Statistical Approaches for Assessing Health Effects of Environmental Chemical Mixtures in Epidemiology: Lessons from an Innovative Workshop.” Environmental Health Perspectives 124 (12): A227–29. 10.1289/EHP547. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Thomas Duncan C., Witte John S., and Greenland Sander. 2007. “Dissecting Effects of Complex Mixtures: Who’s Afraid of Informative Priors?” Epidemiology 18 (2): 186–190. [DOI] [PubMed] [Google Scholar]
  29. Vanderweele Tyler J. 2009. “Sufficient Cause Interactions and Statistical Interactions.” Epidemiology 20 (1): 6–13. 10.1097/EDE.0b013e31818f69e7. [DOI] [PubMed] [Google Scholar]
  30. Weisskopf Marc G., Seals Ryan M., and Webster Thomas F.. 2018. “Bias Amplification in Epidemiologic Analysis of Exposure to Mixtures.” Environmental Health Perspectives [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Yoo W, Ference BA, Cote ML, Schwartz A. A comparison of logistic regression, logic regression, classification tree, and random forests to identify effective gene-gene and gene-environmental interactions. International journal of applied science and technology. 2012. August;2(7):268.47003: 1. [PMC free article] [PubMed] [Google Scholar]
  32. Xue YC, Feuer R, Cashman N, Luo H. Enteroviral infection: the forgotten link to amyotrophic lateral sclerosis?. Frontiers in molecular neuroscience. 2018. March 12;11:63. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material

RESOURCES