Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2020 Aug 20;15(8):e0237905. doi: 10.1371/journal.pone.0237905

A classification model of homelessness using integrated administrative data: Implications for targeting interventions to improve the housing status, health and well-being of a highly vulnerable population

Thomas Byrne 1,*, Travis Baggett 2,3, Thomas Land 4, Dana Bernson 5, Maria-Elena Hood 5, Cheryl Kennedy-Perez 5, Rodrigo Monterrey 5, David Smelson 4, Marc Dones 6, Monica Bharel 5
Editor: Benn Sartorius7
PMCID: PMC7446866  PMID: 32817717

Abstract

Homelessness is poorly captured in most administrative data sets making it difficult to understand how, when, and where this population can be better served. This study sought to develop and validate a classification model of homelessness. Our sample included 5,050,639 individuals aged 11 years and older who were included in a linked dataset of administrative records from multiple state-maintained databases in Massachusetts for the period from 2011–2015. We used logistic regression to develop a classification model with 94 predictors and subsequently tested its performance. The model had high specificity (95.4%), moderate sensitivity (77.8%) for predicting known cases of homelessness, and excellent classification properties (area under the receiver operating curve 0.94; balanced accuracy 86.4%). To demonstrate the potential opportunity that exists for using such a modeling approach to target interventions to mitigate the risk of an adverse health outcome, we also estimated the association between model predicted homeless status and fatal opioid overdoses, finding that model predicted homeless status was associated with a nearly 23-fold increase in the risk of fatal opioid overdose. This study provides a novel approach for identifying homelessness using integrated administrative data. The strong performance of our model underscores the potential value of linking data from multiple service systems to improve the identification of housing instability and to assist government in developing programs that seek to improve health and other outcomes for homeless individuals.

Introduction

Homelessness is associated with a wide range of adverse social, economic and health outcomes [13]. Persons experiencing homelessness often interact with multiple publicly-funded systems of care including the emergency shelter, health care, mental health, substance use disorder treatment, and criminal justice systems [47], thus providing numerous points to address their housing, health care, and other social needs.

However, the capacity of publicly-funded systems to intervene is hampered by unavailable or incomplete data regarding persons who are experiencing homelessness or housing instability. Accurate risk identification even without a perfect record of periods of homelessness would still enhance the potential for the more effective targeting of a variety of housing, health care, and social service interventions. Unfortunately, many service systems do not capture information about housing status in a reliable manner, despite the potential importance of such information for tailoring service delivery to those experiencing housing instability. Recognition of this shortcoming has led to increased interest in developing predictive models to identify persons experiencing homelessness using available data in administrative records. Much of this work has been conducted in health care systems where studies have used indicators obtained from medical records, including diagnosis codes [8], address information [9, 10], and free text notes [1113], to develop models identifying persons experiencing homelessness. Yet, these studies are limited by their exclusive reliance on data obtained from medical records and thus are based on a limited set of predictor variables and apply to non-representative samples of individuals.

Only one study [14] has used administrative records from multiple service systems to identify predictors of homelessness. However, that study focused on a specific population (young adults exiting foster care) and sought to identify salient predictors of homelessness rather than evaluate the overall performance of a predictive model that might be used in an applied context. Two related studies [15, 16] have used linked administrative data from multiple service systems to develop predictive models of the risk of long-term homelessness among two specific sub-populations (low-wage workers who recently experienced a job loss and young adults receiving public assistance) and of extremely high cost use of public services among individuals currently experiencing homelessness population, respectively. However, to our knowledge, no study to date has used multiple sources of administrative data from public service systems to develop and evaluate the performance of a similar model of homelessness in a broader population.

Therefore, the current study capitalizes on the availability of a unique and rich data source that integrates administrative records from a wide array of service systems in Massachusetts to develop and test a classification model of homelessness. Because of widely recognized challenges in accurately identifying people experiencing homelessness [17] and evidence that administrative data sources capturing housing status do not fully concord with self-reported housing status [18] this study makes the assumption that our data on homelessness is incomplete and that an examination of patterns of relationships between known cases of homelessness and other data can be applied to individuals who have not been recorded as having experienced a period of homelessness. As a second and exploratory aim intended to illustrate the potential opportunity for health improvements that that exists from using a similar modeling approach in an applied context as a mechanism to target interventions to mitigate serious health outcomes, we use the results of this classification model to assess the relationship between homelessness and risk of fatal opioid-related overdoses, a particularly adverse outcome previously linked to homelessness [19] that is especially important to evaluate in light of increasing rates of opioid-related overdose deaths nationwide [20].

Methods

Data and sample

Data for the present study come from the Massachusetts Chapter 55 of the Acts of 2015 (“Chapter 55”) integrated dataset. Enacted in August 2015, Chapter 55 authorized the linkage and mandated the analysis of several Massachusetts government databases to better understand the opioid epidemic and guide policy development. Chapter 55 allowed the Massachusetts Department of Public Health (DPH) to link individual-level records from 16 state-based administrative data sources. Persons aged 11 years or older who had health insurance between 2011 and 2015 as reported in the Massachusetts All-Payer Claims Database (APCD), which aggregates health care claims from all public and private payers, comprised the universe of individuals included in the Chapter 55 data warehouse. APCD data for these individuals, who represent more than 98% of Massachusetts residents, were linked with other datasets using a multistage deterministic linkage algorithm; full details on this linkage algorithm and about the 16 data sources included in the Chapter 55 data warehouse are available elsewhere [21]. The present study used data from 15 of the 16 sources that contributed to the Chapter 55 dataset (see Table 1). Since many of the 15 data sets included the same demographics, a “master demographic” dataset was created from the best available demographic information from across all Chapter 55 datasets.

Table 1. Summary of chapter 55 datasets and variables included in classification model.

Data Source Description Sample Variables
Chapter 55 Master Demographic Dataset Dataset aggregating and reconciling demographic information from all Chapter 55 datasets • Age
• Sex
• Race/ethnicity
Massachusetts All Payer Claims Database (APCD) Health, pharmacy, and dental insurance claims for the ~80 private health care payers, public health care payers and publicly supported managed care organizations and senior care organizations across Massachusetts. • Indicator of MassHealth (Massachusetts Medicaid program) membership
• Separate indicators of any claims with diagnoses for:
• Psychoses
• Schizophrenia
• Substance use disorders
• Opioid use disorder
• Alcohol use disorder
Massachusetts Cancer Registry Population-based registry tracking incidence of cancer. • None included, used only to restrict sample
Acute Care Hospital Case Mix Records for all inpatient, emergency department, and outpatient observations discharges from acute care hospitals in the state • Indicator for any use of emergency department services
• Separate indicators for any inpatient, emergency department and outpatient observation discharge with claims with diagnosis codes for:
• Skin/soft tissue infection
• Anxiety disorder
• Bipolar disorder
• Medication induced mental health disorder
• Injection drug use
• Obsessive compulsive disorder
Massachusetts Department of Correction (DOC) Records for individuals incarcerated in Massachusetts prisons • Indicator for any history of incarceration in DOC facility
Massachusetts Department of Housing and Community Development (DHCD) Emergency Assistance Program Records of heads of homeless families who received services from the Emergency Assistance program. • None included, used to restrict sample
Massachusetts Department of Mental Health (DMH) Records for individuals receiving services from DMH, the Massachusetts State Mental Health Authority. • Indicator of psychiatric hospitalization
• Indicator of incarceration, as recorded by DMH
Massachusetts Department of Veteran Services (DVS) Records for individuals receiving medical, housing, or other benefits from DVS • Indicator of receipt of medical benefits from DVS
Massachusetts Department of Public Health, Bureau of Substance Addiction Services (BSAS) Substance use disorder (SUD) treatment episode data from BSAS-funded SUD treatment providers. • Separate indicators for BSAS-funded services including:
• Detox
• Case management
• Post-Detox treatment
• Outpatient treatment
Massachusetts Ambulance Trip Record Information (MATRIS) Emergency medical service (EMS) incident data from licensed ambulance services. • Indicator for any ambulance trip
Massachusetts Department of Public Health, Prescription Monitoring Program (PMP) Records for prescriptions for schedule II through V medications filled by all Massachusetts community, hospital outpatient, and clinic pharmacies as well from out-of-state mail order pharmacies delivering to Massachusetts. • Indicator for Veteran status as recorded in PDMP
Massachusetts Office of the Chief Medical Examiner (OCME) Intake forms Cause of death • Opioid related deaths
Massachusetts Office of the Chief Medical Examiner (OCME) Toxicology Reports Toxicology Reports • Opioid related deaths
Massachusetts State Police Circumstances of Death Reports • Opioid related deaths
Massachusetts Registry of Vital Records and Statistics (RVRS) Death Records Official death certificates • Opioid related deaths
Massachusetts Registry of Vital Records and Statistics (RVRS) Birth Records Official birth certificates • Mother’s occupation code

The Chapter 55 dataset included records for a total of 14,245,349 individuals, based on APCD data. This exceeded the actual number of Massachusetts residents who met criteria for inclusion in the Chapter 55 dataset, suggesting that a number of records in the data reflected either non-Massachusetts residents or unresolved duplicate records. To ensure that our sample included only unique Massachusetts residents, we conservatively limited the cohort for the present study from the 14,245,349 unique individuals in the APCD data to the 5,050,639 unique individuals who had a at least one record in the APCD and one other Chapter 55 dataset. See S1 Fig for a schematic diagram showing sample selection process.

Measures

Measures of homelessness

Developing a classification model of homelessness required that we identify known cases of homelessness in the Chapter 55 datasets. Based on the consensus of a working group of experts in homelessness, we used the following criteria to identify these known cases: 1) a claim in the APCD or record in the Acute Care Hospital Case Mix (Case Mix) data with an accompanying ICD-9 V.60 or ICD-10 Z590 code indicating homelessness; 2) a record in the Department of Mental Health (DMH) dataset in which individuals were ever identified as experiencing a loss of housing based on a measure of housing status captured on a monthly basis for all DMH clients; 3) an ambulance record in the Massachusetts Ambulance Trip Record Information System (MATRIS) data in which the word “homeless” or “shelter” appeared in the narrative report; or 4) a prescription record in the Prescription Monitoring Program (PMP) in which the patient’s address matched that of an emergency shelter. Individuals meeting any of these criteria at any point during the 5-year observation period were classified as experiencing homelessness.

Independent variables

We selected 94 possible independent variables from across all 16 Chapter 55 datasets based on prior research identifying correlates of homelessness [2225]. These predictors were classified into several groups, including socio-demographic predictors (e.g. age, gender, race, Medicaid receipt [a proxy for socioeconomic status]); drug/alcohol use predictors (e.g. presence of drug/alcohol diagnoses, use of substance use disorder treatment services); mental health predictors (e.g. presence of mental health diagnoses, use of mental health services); physical health predictors (e.g. skin disorders); other service use predictors (e.g. history of incarceration in state prison, use of emergency department services). Table 1 provides examples of these predictors from each of the Chapter 55 datasets (the full set of predictors are provided in S1 Table).

Fatal opioid overdoses

Fatal opioid-related overdoses were identified from death records from the Massachusetts Registry of Vital Records and Statistics (RVRS). Deaths were classified by using the International Classification of Disease (ICD-10) codes for mortality or using a literal search of written cause of death from the medical examiner’s office for records that did not yet have an ICD-10 code assigned. The following codes were selected from the underlying cause of death field to identify poisonings/overdoses: X40-X49, X60-X69, X85-X90, Y10-Y19, and Y35.2. All multiple cause of death fields were then used to identify an opioid-related death, which included any of the following ICD-10 codes: T40.0, T40.1, T40.2, T40.3, T40.4, and T40.6

Analysis

Our primary aim and analytic plan centered on the development and testing of a classification model of homelessness. The terms “predictive model” and “classification model” are frequently used interchangeably to describe models that attempt to predict a categorical outcome conditional on a set of predictors with the goal of maximizing the performance of such models. Although these terms are often used interchangeably, we chose to use the term "classification model" in the current paper to reflect the fact that our data are cross-sectional, and thus we cannot determine the temporal ordering of our outcome variable (homelessness) relative to our set of independent variables. Thus, our analysis is not “predicting” a future outcome on a set of antecedent predictors, and we use the term “classification” model to avoid confusion about the scope of our analysis.

To develop and test our classification model, we split the study sample into a development sample to be used in building the classification model of homelessness and a validation sample to be used to evaluate model performance. Given the proportionally small number of cases in our dataset identified as homeless and to ensure that the development and validation samples included equal proportions of individuals identified as homeless, we used a stratified random sampling approach to divide the sample into a development and validation sample. Specifically, we identified two strata based on whether individuals were identified as homeless based on the criteria outlined above. We then randomly assigned 75% of the cases within each stratum to the development sample and the remaining 25% of cases within each stratum to the validation sample. This resulted in a development sample that comprised 75% (n = 3,787,980) of cases in the full sample while the remaining 25% of cases from the full sample (n = 1,262,659) formed the validation sample.

We used multivariable binary logistic regression as the classification method in developing our classification model of homelessness. We initially estimated a model that included all individuals in the development sample. However, the small proportion of individuals in our cohort who met the criteria for homelessness (0.82%) resulted in models that had near perfect specificity but extremely poor sensitivity. We therefore used a technique called downsampling to balance outcome class membership in the development sample [26]. In the present context, downsampling worked by retaining all persons identified as homeless in the development sample and then randomly selecting an equal size number of persons not identified as homeless for inclusion, while excluding all other cases. We then used this balanced development sample in the model development phase.

We applied parameter estimates from the logistic regression model estimated using the development sample to derive predicted probabilities of homelessness for all individuals in the validation sample. We evaluated model performance using area under the receiver operating curve (AUC), sensitivity (i.e. true positive rate), specificity (i.e. true negative rate), and balanced accuracy, which is the average proportion of correctly classified cases in each outcome category and is a better metric of overall model accuracy when there is severe imbalance between outcome classes [27]. We also calculated positive predictive value, which measures true positives as a proportion of all model predicted positive cases and negative predictive value, which models true negatives as a proportion of all model predicted negative cases.

To address the study’s second aim, we estimated fatal opioid-related overdose rates per 100,000 persons for both homeless and non-homeless individuals in the validation sample using model predicted probabilities to classify persons as homeless or not homeless. In the principal analysis, we classified individuals with predicted probabilities of ≥0.5 as homeless and individuals with predicted probabilities of <0.5 as non-homeless. In sensitivity analyses, we used two alternative approaches for assigning homelessness status based on model-predicted probabilities. In the first approach, we assigned all persons in the validation cohort with a known case of homelessness (regardless of their model predicted probability) a risk score of 1, and we used the model-predicted probabilities as the risk score for all other members of the study cohort. In the second approach, we assigned all persons with a known case of homelessness (based on criteria described above) a risk score of 1 and all persons with no observed homeless indicator and a model predicted probability <0.5 a risk score of 0, with all remaining individuals assigned a risk score equivalent to their predicted probability.

We then used these risk scores to calculate weighted estimates of the number of homeless and non-homeless persons in the validation sample in addition to the number of fatal opioid overdoses experienced by each group. Specifically, we calculated the weighted estimate of the number of homeless persons as the sum of the homeless risk scores for all those with scores ≥0.5 and the weighted estimated number of non-homeless persons as the sum of the inverse of the homeless risk scores for all those with risk scores <0.5. We calculated the weighted estimate of the number of overdoses in the homeless group using a two-step process. First, we multiplied the homeless risk scores for all those with scores greater than 0.5 by 1 if they experienced a fatal overdose or 0 if they did not. We then summed the resulting products to estimate the number of fatal overdose deaths in the homeless groups. To estimate the number of fatal overdose deaths in the non-homeless group we repeated this two-step process, but used the inverse of the homeless risk score for all those with risk scores less than 0.5 in the first step.

For each of the above analytic approaches, we compared the risk of fatal opioid overdose between the homeless and non-homeless groups using rate ratios with 95% confidence intervals estimated using standard techniques [28].

Results

Observed homelessness

Of 5,050,639 individuals in the analytic cohort, 41,457 (0.82%) were identified as experiencing homelessness according to our pre-specified indicators of known cases of homelessness. Based on ICD codes the number of individuals identified as homeless in each of the datasets used to construct this measure were as follows: 23,239 individuals in the APCD dataset, 21,722 in the CaseMix dataset, 300 in the DMH dataset, 3,237 in the MATRIS dataset, and 6,704 were identified based on the PMP. A total of 13,745 individuals, roughly one third of all those identified as homeless, were identified based on multiple indicators (See S1 Fig).

Homelessness classification

Applying the parameters of the classification model estimated on the downsampled development sample to the validation sample yielded an AUC of 0.94, which is in the excellent range by conventional guidelines [29]. S1 Table provides parameter estimates for the full development sample model. In assigning all individuals in the validation sample with a predicted probability of homelessness greater than or equal to 0.5, the model identified a total of 69,675 individuals in the validation sample (5.5%) as homeless. Table 2 summarizes the metrics used to assess model performance. Balanced accuracy in the validation sample was 86.4%, indicating that, on average, 86.4% of cases in each outcome category were correctly classified. Sensitivity and specificity were 77.8% and 95.1%, respectively. Positive predictive value in the validation sample was 11.7% and negative predictive value was 99.8%. Dividing this positive predictive value (or equivalently the proportion of model-predicted homeless cases that are truly homeless) by the baseline prevalence of homelessness of 0.82% (or equivalently the expected proportion of cases that would be truly homeless if they were randomly selected), indicates that the performance of the model in identifying persons experiencing homelessness was more than fourteen times better than what would be expected based on chance alone. Nonetheless, taking the reciprocal of this positive predictive value also indicates that, for each person that the model correctly identified as experiencing homelessness, there would be 8.5 false positives.

Table 2. Summary of model performance.

Metric Value
Area under the receiver operating curve 0.94
Balanced accuracy 86.4
Sensitivity 77.8
Specificity 95.1
Positive predictive value 11.7
Negative predictive value 99.8

Fatal opioid overdoses

A total of 1,265 individuals in the validation sample experienced a fatal opioid overdose during the study period, resulting in a crude opioid-related mortality rate of 100.2 per 100,000 individuals. Table 3 summarizes the results of the comparison of fatal opioid overdose rates by homeless status in the validation sample. Using model-predicted probabilities to assign individuals to a homeless status (i.e. using a predicted probability of 0.5 as the threshold), resulted in a 22.9-fold increased risk of fatal opioid overdose in the homeless group relative to the non-homeless group. The two alternative approaches yielded estimated fatal overdose rates that were, respectively, roughly 9 and 21 times higher in the homeless group than in the non-homeless group.

Table 3. Summary of fatal overdoses based on validation sample.

Homeless Not Homeless Rate ratio (95% CI)
N No. of deaths Mortality rate N No. of deaths Mortality rate
Model predicted homeless status 69,675 724 1039.1 1,192,443 541 45.4 22.9 (20.5–25.6)
Weighted homeless status (Approach 1) 169,378 743 438.7 1,093,281 522 47.7 9.2 (8.2–10.3)
Weighted homeless status (Approach 2) 55,430 618 1114.9 1,207,229 647 53.6 20.8 (18.6–23.2)

Discussion

This study provides a novel approach for identifying homelessness probability using integrated administrative data from a large number of service systems. Leveraging data from these systems, we developed an accurate classification model with high specificity (95.4%), moderate sensitivity (77.8%), and excellent classification properties (AUC 0.94; balanced accuracy 86.4%). The strong relationship between model-predicted homeless status and several conditions with established associations with homelessness provided additional support for the validity of our model.

The strong overall performance of our model serves as a valuable proof of concept for other service systems or localities that are interested in identifying clients experiencing housing instability or homelessness even when such information is not directly available. Since we based our model on predictors obtained from a unique integrated administrative dataset, it may be difficult for other localities to replicate our exact model. Nonetheless, our results underscore the potential value of the general approach of linking data from multiple service systems and applying classification modeling techniques to these data. Doing so can lead to improved identification of housing instability and other risk factors that negatively impact health and well-being, which are difficult to measure and typically poorly captured in many service systems.

Using data in this manner carries with it the potential for the more efficient targeting of specialized service interventions at the point of care, particularly in the medical care, behavioral health, and criminal justice service systems. Our study presents a proof of concept of this idea, rather than a shelf-ready approach that can be applied immediately. However, the potential for developing such an applied approach based on our findings is real. To illustrate one example of the potential value of applying our model, we assessed the association between model-predicted homeless status and fatal opioid overdoses. We found a substantially elevated risk of fatal opioid overdose among those identified by the model as having a high probability of homelessness in alignment with prior research [19, 30]. This finding underscores the sizeable opportunity that could exist for reducing fatal overdoses if such individuals could be proactively identified and targeted for effective treatment interventions. There are analogous approaches already in use in other contexts. For example, Allegheny County in Pennsylvania uses linked administrative data from multiple county agencies in a predictive model that serves as a decision aid to frontline workers who screen and triage cases referred to the local child welfare system [31]. The model assigns risk scores quantifying both the likelihood of re-referral to the child welfare system were a worker to screen a child out of the system, and of foster care placement, were a child to be screened-in for further investigation. From a practical standpoint, moving from the development of a predictive model to the application of model results to inform service delivery requires resolving a host of technical, legal and ethical issues. Resolving these issues is not a small challenge, but neither is it an insurmountable one.

The high specificity of the model relative to its lower sensitivity indicates better performance at correctly identifying persons not experiencing homelessness than in correctly identifying those experiencing homelessness. The relatively low positive predictive value (11.7%) of our model was tied to the low prevalence of homelessness (0.82%), based on the indicators we used to identify known cases of homelessness, in the available data and underscores the challenges associated with developing predictive models for a relatively rare phenomenon. Indeed, while our model performed much better at correctly identifying persons experiencing homelessness than would be expected by chance, it nonetheless identified nearly nine false positives for every person correctly identified as homeless. As implied above, this approach may not be suitable for identifying individuals in near real-time but more useful for evaluating policies and programs aimed at serving a poorly identified population.

Finally, it is also important to note that this project was part of a larger effort to better understand predictors of fatal and nonfatal opioid overdose in Massachusetts. As such, the Chapter 55 data set was developed for multiple uses and users. The likelihood estimates of homelessness were made available to multiple research groups for inclusion in their models or for testing their hypotheses. This model of shared knowledge has profound implications for research uses of state collected administrative data sets.

This study has several limitations. First, the indicators we used to identify known cases of homelessness likely failed to capture many individuals who actually did experience homelessness over the five-year study period. Indeed, the prevalence of homelessness observed in the present study (0.82%) is far lower than prior estimates of the five-year prevalence of homelessness in the general population (4.6%) [32]. This underestimate was due in part to the fact that identification of homelessness was conditional on use of a service system that captured information about housing status. Access to Homelessness Management Information System (HMIS) records collected on a routine basis by the homeless assistance system, would have improved the quality of our measure of homelessness although it still would have been imperfect and incomplete. The shortcomings of our measure of homelessness likely affected the performance of our model, although, at the same time, the lack of reliable measures of homelessness in the Chapter 55 data was one of the primary motivations for this study. Another limitation is that our cross-sectional approach could not take into account the duration of homelessness or its timing relative to other service use experiences (e.g. episodes of incarceration) used as predictor variables in our model. This means that some experiences used as predictors in our model may have temporally succeeded an individual’s experience of homelessness. Similarly, our analysis of the relationship between model-predicted homeless status and fatal opioid overdoses was potentially biased by the inclusion of substance-use related measures in our predictive model of homelessness. Additionally, selected demographic variables known to be associated with homelessness (e.g. gender identity and sexual orientation) were not reliably available in the Chapter 55 data and were not included in our model.

Conclusions

The present study is a useful example of how large, integrated administrative data from multiple service systems can be used to identify individuals at risk of homelessness to facilitate targeted services or timely intervention. Prior research has shown that homeless individuals have a high burden of medical and mental illnesses, substance use disorders, and health care and human services systems use [1, 3336]. By identifying individuals at risk of homelessness, service providers can improve the coordination of services and promote better health outcomes, particularly for conditions such as opioid use disorder that exact a high toll on individuals experiencing homelessness. Future work should focus on refining and our approach to aid in identifying individuals at high risk of homelessness who may benefit from targeted service interventions.

Supporting information

S1 Fig. Flow diagram of sample selection.

(DOCX)

S1 Table. Full results of binary logistic regression model fit on development sample.

(DOCX)

Data Availability

Data cannot be publicly shared due to legal restrictions that prohibit the sharing of both the data sources used in the construction of the study's analytic data set and the analytic data set itself. More specifically, study authors had access to these data by responding to a Notice of Opportunity issued by the Commonwealth of Massachusetts that enabled interested parties to propose analysis of these data for a set period of time. The deadline for responding to the Notice of Opportunity was April 30, 2017. Information about that Notice of Opportunity is available at the following link: https://www.commbuys.com/bso/external/bidDetail.sdo?docId=BD-17-1031-HISRE-HIS01-11089&external=true&parentUrl=bid The Commonwealth subsequently issued a second Notice of Opportunity to conduct analysis with these data, and applications in response to that Notice of Opportunity closed on March 23, 2018. Additional information about that Notice of Opportunity is available here: https://www.commbuys.com/bso/external/bidDetail.sdo?bidId=BD-18-1031-OFFIC-ODMOA-24680&parentUrl=activeBids At the time of the submission of this manuscript, there Commonwealth of Massachusetts was not an open Notice of Opportunity to which interested researchers could apply to access these data. Additional information about the data used in this study and relevant legal restrictions and data access are available at the following link: https://www.mass.gov/public-health-data-warehouse-phd As per that website, the contact person for the data are Abigail Averbach (Abigail.Averbach@MassMail.State.MA.US) and Brigido Ramirez Espinosa (Brigido.Ramirez1@State.MA.US) at the Massachusetts Department of Public Health Office of Population Health.

Funding Statement

One author (Marc Dones) was employed by a commercial entity, the Center for Social Innovation, at the time work on the manuscript was completed. The funder provided support in the form of salaries for authors [insert relevant initials], but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of this author are articulated in the ‘author contributions’ section.

References

Decision Letter 0

Benn Sartorius

23 Jan 2020

PONE-D-19-26052

Predicting homelessness using integrated administrative data: Implications for targeting interventions to improve the housing status, health and well-being of a highly vulnerable population

PLOS ONE

Dear Dr. Byrne,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

We would appreciate receiving your revised manuscript by Mar 08 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

We look forward to receiving your revised manuscript.

Kind regards,

Benn Sartorius, PhD

Academic Editor

PLOS ONE

Additional Editor Comments (if provided):

Editorial comments

Please include a completed GATHER checklist as part of the supplementary material and make reference to this in the methods.

Journal Requirements:

When submitting your revision, we need you to address these additional requirements:

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at http://www.plosone.org/attachments/PLOSOne_formatting_sample_main_body.pdf and http://www.plosone.org/attachments/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. In your data availability statement, please add the date in the following statement: "The deadline for responding to the Notice of Opportunity was [XX].

3. Thank you for stating the following in the Competing Interests section:

"I have read the journal's policy and the authors of this manuscript have the following competing interests: Travis Baggett receives royalties from UpToDate for authorship of a topic review on health care for homeless people. No other authors have any competing interests to disclose."

Please confirm that this does not alter your adherence to all PLOS ONE policies on sharing data and materials, by including the following statement: "This does not alter our adherence to  PLOS ONE policies on sharing data and materials.” (as detailed online in our guide for authors http://journals.plos.org/plosone/s/competing-interests).  If there are restrictions on sharing of data and/or materials, please state these. Please note that we cannot proceed with consideration of your article until this information has been declared.

Please include your updated Competing Interests statement in your cover letter; we will change the online submission form on your behalf.

Please know it is PLOS ONE policy for corresponding authors to declare, on behalf of all authors, all potential competing interests for the purposes of transparency. PLOS defines a competing interest as anything that interferes with, or could reasonably be perceived as interfering with, the full and objective presentation, peer review, editorial decision-making, or publication of research or non-research articles submitted to one of the journals. Competing interests can be financial or non-financial, professional, or personal. Competing interests can arise in relationship to an organization or another person. Please follow this link to our website for more details on competing interests: http://journals.plos.org/plosone/s/competing-interests

4. Please amend your list of authors on the manuscript to ensure that each author is linked to an affiliation. Authors’ affiliations should reflect the institution where the work was done (if authors moved subsequently, you can also list the new affiliation stating “current affiliation:….” as necessary).

5. Thank you for stating the following in the Competing Interests section:

"I have read the journal's policy and the authors of this manuscript have the following competing interests: Travis Baggett receives royalties from UpToDate for authorship of a topic review on health care for homeless people. No other authors have any competing interests to disclose."

We note that one or more of the authors are employed by a commercial company: Future Laboratories.

  1. Please provide an amended Funding Statement declaring this commercial affiliation, as well as a statement regarding the Role of Funders in your study. If the funding organization did not play a role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript and only provided financial support in the form of authors' salaries and/or research materials, please review your statements relating to the author contributions, and ensure you have specifically and accurately indicated the role(s) that these authors had in your study. You can update author roles in the Author Contributions section of the online submission form.

Please also include the following statement within your amended Funding Statement.

“The funder provided support in the form of salaries for authors [insert relevant initials], but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘author contributions’ section.”

If your commercial affiliation did play a role in your study, please state and explain this role within your updated Funding Statement.

2. Please also provide an updated Competing Interests Statement declaring this commercial affiliation along with any other relevant declarations relating to employment, consultancy, patents, products in development, or marketed products, etc. 

Within your Competing Interests Statement, please confirm that this commercial affiliation does not alter your adherence to all PLOS ONE policies on sharing data and materials by including the following statement: "This does not alter our adherence to  PLOS ONE policies on sharing data and materials.” (as detailed online in our guide for authors http://journals.plos.org/plosone/s/competing-interests) . If this adherence statement is not accurate and  there are restrictions on sharing of data and/or materials, please state these. Please note that we cannot proceed with consideration of your article until this information has been declared.

Please include both an updated Funding Statement and Competing Interests Statement in your cover letter. We will change the online submission form on your behalf.

Please know it is PLOS ONE policy for corresponding authors to declare, on behalf of all authors, all potential competing interests for the purposes of transparency. PLOS defines a competing interest as anything that interferes with, or could reasonably be perceived as interfering with, the full and objective presentation, peer review, editorial decision-making, or publication of research or non-research articles submitted to one of the journals. Competing interests can be financial or non-financial, professional, or personal. Competing interests can arise in relationship to an organization or another person. Please follow this link to our website for more details on competing interests: http://journals.plos.org/plosone/s/competing-interests

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: I Don't Know

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This is a very interesting use of a novel linked administrative dataset at a state level. The study methods are appropriate to the data, but one question is why the authors did not use a machine learning approach (e.g. PRM)? The primary limitation of the study is that homelessness is significantly underidentified in the data available and biased to specific service users. This undoubtedly contributes to the modest sensitivity. As the authors note in the study limitations section, the study would have been greatly strengthened by inclusion of general shelter user data.

One of the indicators for homelessness was Emergency Assistance receipt, which I believe is the emergency shelter assistance program for families. If so, this means that nearly all of the homeless family adults were identified for the study, in contrast to the single adults. And given that homeless families are quite distinct from single adults, in terms of their risk factors and service use patterns, the study model may have been muddled by the combining of adults in homeless families and other homeless adults. Perhaps the model should be run separately for families and single adults to see if it improves model performance. However, based on their description, the dat may not be available to rerun the analysis in this way.

Otherwise, I found the paper to be a strong contribution, particularly in light of the fact that the model could be used by other states to identify highly vulnerable people in various service systems who could be screened for homelessness risk, and provided prevention services.

Reviewer #2: Date: 1/22/2020

Manuscript #: PONE-D-19-26052

Title: Predicting homelessness using integrated administrative data: Implications for

targeting interventions to improve the housing status, health and well-being of a highly

vulnerable population

Overall comment: Identifying (classifying or predicting) homelessness is an important topic and has many applications in healthcare and elsewhere. The paper is well-written. I have two major concerns: (1) right type of analysis and (2) focus of the paper. First, in its current format, I think the analysis represents a cross-sectional classification/association type analysis rather than a prediction study. There is no indication whether the variables used occurred before, during, or after experiencing homelessness. As a matter of fact, authors do not have a good indication on when homelessness has occurred. For a prediction or prognosis study, predictors should occur before the event. The timing is important and key in a predictive model. This is not the case in this study. Having said that, I still see value in the study. Classification or identification of risks associated with homelessness is also important. It is up to authors to decide what they want to do and appropriately conduct the analysis. Second, the paper is not focused. I would remove the association between homelessness and other health condition and opioid overdose from the paper. They are irrelevant to the main topic. They can be presented separately and more in depth elsewhere.

The followings are my minor comments:

Abstract:

The main purpose of the study was to develop and validate a model for homelessness prediction. This is an important topic. However, I am not sure why the authors lost the focus and brought into attention association between homelessness and a series of health conditions including opioid overdose. I would suggest the authors discuss these as potential applications of their predictive model and not as the main focus of the manuscript.

Introduction:

Page 3, line 56: Change “service systems” to “publicly-funded systems.”

Page 4, lines 83-86: what is the basis for your assumption? Any citation to validate or explain how you made this assumption and how accurate it could be?

Page 4, lines 86-88: Again, what is the basis for this claim? This is a huge assumption to make. You are basically validating your model not based on actual data on homelessness but based on measures that correlates with homelessness. What are the degrees of correlations? Elaborate. Cite your references for such assumption.

Page 4. Lines 88-91: As I mentioned above, I strongly recommend keeping the paper focus. This paper is about a predictive model of homelessness based on integrated administrative data. Keep the rest for future papers. And stay focus on using various predictive models, make your model parsimonious, validate it properly, show its economic usefulness, etc., etc.

Data and Sample:

Page 5, line 97: What are programmatic decisions?

Page 5, lines 104-105: What are the 15 data sets? What variables is linked with the main Chapter 55 dataset? Why these variables are chosen? Cite your multistage deterministic approach to merge the data across all these datasets (or put it in the appendix).

Page 5, line 114: consider rewording the sentence. So, did I understand this correctly? Among 14,245,349 people included in the Chapter 55 dataset only 5,050,639 had a record in the APCD. Please include a complete and detailed schematic flow diagram of your sample size. This can be included in the appendix.

Measure of Homelessness:

Page 6, line 122-129. Please include number of homeless people identified based on each of the defined criterion in your schematic flow diagram.

Analysis:

Page 10, lines 155: Explain your stratified random sampling. How did you stratify? Did you consider other variables such as age, sex, race/ethnicity to be randomly distributed in both development and validation group?

Page 14, line 233: How did you calculate 14 times? For each correctly identified homeless person, there are 8.5 false positive. Elaborate on this.

Note: I would like to see a table with all related diagnostic measures (i.e. C-statistic, sensitivity, specificity, PPV, NPV, etc.)

Note 2. The specificity of your model is extremely high (95.1%). Could it be because of timing of prediction, meaning that you included variables in your prediction model that was taken after a person experienced homelessness. So, your model actually did not predict homelessness. It assesses the risks of several variables and their associations with homelessness. This is different from prediction. Timing is important in a predictive model. Timing of prediction should be prior to the event. So, in building a predictive model, one should use only predictors that are available prior to being homeless.

Note 3. Please include the following information for the logistic regression model in the appendix:

1. Full name of the abbreviated variables.

2. Diagnostic testing of your regression model.

3. Instead of using “unknown” or “missing” as your reference category, please use more meaningful groups for your categories. For example, for race, use “White” as your reference category.

4. This study covers a wide age-range group (11+). The question about one’s mother’s occupation for certain age group seems irrelevant. And, this variable may change frequently for certain jobs.

5. These data are gathered over time. What if the condition for one person changed? Any thought of including longitudinal (time-variant) variables in your model? In that case probably a generalized estimating equation would be more appropriate than simple logistic model.

6. What do these varibles mean or represent?

Any record in BSAS

Any record in Casemix mental health records

Any record in DMH

Any record in DVS

Any record in Matris

Any record in PMP

a.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.

Attachment

Submitted filename: PLOSONE_20200122_EM.docx

PLoS One. 2020 Aug 20;15(8):e0237905. doi: 10.1371/journal.pone.0237905.r002

Author response to Decision Letter 0


15 Jul 2020

We thank the reviewers for their thorough feedback on this manuscript. We have taken their feedback into account in revising this manuscript, and believe it is stronger as a result. Below, we detail how we responded to each of the comments offered by the reviewers.

Reviewer #1

1. This is a very interesting use of a novel linked administrative dataset at a state level. The study methods are appropriate to the data, but one question is why the authors did not use a machine learning approach (e.g. PRM)? The primary limitation of the study is that homelessness is significantly underidentified in the data available and biased to specific service users. This undoubtedly contributes to the modest sensitivity. As the authors note in the study limitations section, the study would have been greatly strengthened by inclusion of general shelter user data.

We thank the reviewer for this positive assessment of our use of this unique dataset. We appreciate the reviewer’s suggestion that we use a machine learning approach as part of our analytic strategy. We did, in fact hope to employ machine learning algorithm approaches such as random forests or support vector machines as an alternative and point of comparison to logistic regression. Unfortunately, due to the unique nature of these data, they could only be managed and analyzed using a version of the SAS statistical software that did not provide us with the capability to use such algorithms. However, it bears mentioning that recent work by Gao and colleagues (2017) and the present study’s lead author (Byrne et al., 2019) have shown that the performance of machine learning algorithms in predicting homelessness is only marginally better than logistic regression.

We also agree that the study could be strengthened by the inclusion of general shelter use data. Unfortunately, these data were unavailable in part because data for the single adult shelter system are collected and maintained not by a single state agency, but by more than a dozen distinct entities—known as Continuums of Care (CoCs) in various regions throughout the state. Therefore was not possible to obtain data from these entities and merge them with the chapter 55 database.

Gao, Y., Das, S., & Fowler, P. (2017, March). Homelessness service provision: a data science perspective. In Workshops at the Thirty-First AAAI Conference on Artificial Intelligence.

Byrne, T., Montgomery, A. E., & Fargo, J. D. (2019). Predictive modeling of housing instability and homelessness in the Veterans Health Administration. Health services research, 54(1), 75-85.

2. One of the indicators for homelessness was Emergency Assistance receipt, which I believe is the emergency shelter assistance program for families. If so, this means that nearly all of the homeless family adults were identified for the study, in contrast to the single adults. And given that homeless families are quite distinct from single adults, in terms of their risk factors and service use patterns, the study model may have been muddled by the combining of adults in homeless families and other homeless adults. Perhaps the model should be run separately for families and single adults to see if it improves model performance. However, based on their description, the dat may not be available to rerun the analysis in this way.

The reviewer raises an excellent point with this comment. It is indeed correct that records from the Emergency Assistance program was one of the data sources included in the Chapter 55 data warehouse upon which the study was based. However, we did not use receipt of Emergency Assistance services an indicator of homelessness for precisely the reason that the reviewer mentions: it would have led to a nearly completely accurate identification of homelessness among adults in families, and a far less accurate identification of homelessness among single adults. We used the Emergency Assistance data only as way to restrict the sample to eliminate what were assumed to be duplicate records in the All Payer Claims Database, which served as the base dataset for the data linkage, i.e. we only included persons who had a claim in the All Payer Claims Database AND at least one of the other data sources (one of which was Emergency Assistance data). As we note in the methods section we used the following indicators to identify homelessness: 1) a claim in the APCD or record in the Acute Care Hospital Case Mix (Case Mix) data with an accompanying ICD-9 V.60 or ICD-10 Z590 code indicating homelessness; 2) a record in the Department of Mental Health (DMH) dataset in which individuals were ever identified as experiencing a loss of housing based on a measure of housing status captured on a monthly basis for all DMH clients; 3) an ambulance record in the Massachusetts Ambulance Trip Record Information System (MATRIS) data in which the word “homeless” or “shelter” appeared in the narrative report; or 4) a prescription record in the Prescription Monitoring Program (PMP) in which the patient’s address matched that of an emergency shelter.

3. Otherwise, I found the paper to be a strong contribution, particularly in light of the fact that the model could be used by other states to identify highly vulnerable people in various service systems who could be screened for homelessness risk, and provided prevention services.

We thank the reviewer for this generous assessment of our work. While we do not view our analysis as delivering a “shelf ready” product that can be used to identify those at risk of homelessness and target them with prevention services accordingly, we do believe that it offers “proof of concept” of how integrated administrative data might be used for this purpose.

Reviewer #2

1. Overall comment: Identifying (classifying or predicting) homelessness is an important topic and has many applications in healthcare and elsewhere. The paper is well-written. I have two major concerns: (1) right type of analysis and (2) focus of the paper. First, in its current format, I think the analysis represents a cross-sectional classification/association type analysis rather than a prediction study. There is no indication whether the variables used occurred before, during, or after experiencing homelessness. As a matter of fact, authors do not have a good indication on when homelessness has occurred. For a prediction or prognosis study, predictors should occur before the event. The timing is important and key in a predictive model. This is not the case in this study. Having said that, I still see value in the study. Classification or identification of risks associated with homelessness is also important. It is up to authors to decide what they want to do and appropriately conduct the analysis.

We thank the reviewer for this comment. While the terms “predictive model” are “classification model” are frequently used interchangeably, we concede the reviewer’s point that our use of the term “predictive model” may be misleading. Additionally, while we would like to revise our analysis to take temporal ordering into account, we cannot accurately assess onset of homelessness in our data, but rather can only assess when someone is identified as homeless in one of the administrative data sources we use. The distinction is subtle, but important in terms of developing a truly “predictive” model. In other words, we could theoretically estimate a model in which we use predictors collected only prior to a first indicated date of homelessness, but this too would be problematic because we could not actually assess whether predictors temporally preceded the onset of an episode of homelessness. As such, we have decided to retain our cross-sectional approach, and describe this is a limitation in the Discussion. We also now use the term “classification model” throughout the paper and explain our reasoning for doing so in the Analysis section where we have added the following explanation:

Our analytic plan centered on the development and testing of a classification model of homelessness. The terms “predictive model” and “classification model” are frequently used interchangeably to describe models that attempt to predict a categorical outcome conditional on a set of predictors with the goal of maximizing the performance of such models. Although these terms are often used interchangeably, we chose to use the term "classification model" in the current paper to reflect the fact that our data are cross-sectional, and thus we cannot determine the temporal ordering of our outcome variable (homelessness) relative to our set of independent variables. Thus, our analysis is not “predicting” a future outcome on a set of antecedent predictors, and we use the term “classification” model to avoid confusion about the scope of our analysis.

2. Second, the paper is not focused. I would remove the association between homelessness and other health condition and opioid overdose from the paper. They are irrelevant to the main topic. They can be presented separately and more in depth elsewhere.

We thank the reviewer for this comment and recognize that the inclusion of these other health outcomes made for a somewhat disjointed paper in its original configuration. As such, we have removed the analysis of the association between homelessness and the health conditions (e.g. Hepatitis C, HIV/AIDS) that we had originally included as validators of our model. We have, however, retained the analysis of the association between model-predicted homeless status and fatal opioid overdoses, and reframed the presentation of this analysis. Our decision to retain this analysis was motivated by two factors. First, the regulations put in place by the Massachusetts legislature governing the use of the Chapter 55 data used in this study required that all analyses using the data include some examination of opioid overdoses as part of their scope. Second, and more important, we believe that there is substantive value in retaining this analysis in the paper even if these regulations did not exist. Specifically, as we try to emphasize in the Discussion, we envision our analysis not simply as an academic exercise, but as the first step towards an applied use of more refined modeling approaches that would lead to targeting of services to improve housing and health outcomes among a high-risk population. Thus, our inclusion of an analysis of the relationship between model-predicted homeless status and fatal opioid overdoses is intended as an example that demonstrates the potential value and opportunity that exists in targeting interventions in this manner to mitigate the risk of a serious adverse health outcome. To more clearly articulate this rationale for including this analysis, we have reframed our description of the aim assessing the relationship between model-predicted homeless status and fatal opioid overdoses in the Introduction and the implications of the results of this analysis in the Discussion.

The followings are my minor comments:

3. Abstract:

The main purpose of the study was to develop and validate a model for homelessness prediction. This is an important topic. However, I am not sure why the authors lost the focus and brought into attention association between homelessness and a series of health conditions including opioid overdose. I would suggest the authors discuss these as potential applications of their predictive model and not as the main focus of the manuscript.

We thank the reviewer for this comment. As we noted above, we have revised the paper to 1) remove the analysis of the relationship between model-predicted homeless status and health conditions we had previously included as validators; and 2) reframed our inclusion of the analysis of the relationship between model predicted homeless status and fatal opioid overdoses. We have amended the abstract to reflect these changes.

4. Introduction:

Page 3, line 56: Change “service systems” to “publicly-funded systems.”

We have made this change.

5. Page 4, lines 83-86: what is the basis for your assumption? Any citation to validate or explain how you made this assumption and how accurate it could be?

We thank the reviewer for this comment. We have revised this sentence to explain the basis of this assumption. Specifically, the assumption is rooted in long-standing challenges in accurately identifying people experiencing homelessness using administrative data and the fact that even administrative data with information on housing status may not accurately capture an individual’s true housing status. We now explain these points and have included citations to justify then.

6. Page 4, lines 86-88: Again, what is the basis for this claim? This is a huge assumption to make. You are basically validating your model not based on actual data on homelessness but based on measures that correlates with homelessness. What are the degrees of correlations? Elaborate. Cite your references for such assumption.

We agree with this comment and, as described above, we have removed the analysis in which we seek to validate our models against other health conditions.

7 .Page 4. Lines 88-91: As I mentioned above, I strongly recommend keeping the paper focus. This paper is about a predictive model of homelessness based on integrated administrative data. Keep the rest for future papers. And stay focus on using various predictive models, make your model parsimonious, validate it properly, show its economic usefulness, etc., etc.

We thank the reviewer again for this comment. As noted above, we have made changes to the paper to more clearly focus on the classification model and to make a more explicit connection between the model and our analysis of the relationship between model-predicted homeless status and fatal opioid overdoses.

8. Data and Sample:

Page 5, line 97: What are programmatic decisions?

We have removed this phrase and now leave mention the goal to “guide policy development” which we feel is more clear.

9. Page 5, lines 104-105: What are the 15 data sets? What variables is linked with the main Chapter 55 dataset? Why these variables are chosen? Cite your multistage deterministic approach to merge the data across all these datasets (or put it in the appendix).

Thank you for raising this comment. The 15 data sets we used are all included in Table 1 and we now direct readers to that Table in this section of the paper. We believe a full discussion of the multistage deterministic matching approach is beyond the scope of the current paper, although it has been described in full detail in a publicly available report from the Massachusetts Department of Public Health. We thus cite this report and direct interested readers to it by writing the following (new text shown in italics):

APCD data for these individuals, who represent more than 98% of Massachusetts residents, were linked with other datasets using a multistage deterministic linkage algorithm; full details on this linkage algorithm and about the 16 data sources included in the Chapter 55 data warehouse are available elsewhere.[19]

10. Page 5, line 114: consider rewording the sentence. So, did I understand this correctly? Among 14,245,349 people included in the Chapter 55 dataset only 5,050,639 had a record in the APCD. Please include a complete and detailed schematic flow diagram of your sample size. This can be included in the appendix.

We have revised the wording to clarify that there are actually 14,245,349 unique records in the APCD, but we only included the 5,050,639 who also had at least one record in a dataset besides the APCD. The remaining records are suspected duplicates/out of state residents. We have also now included the recommended schematic diagram as S1 Figure.

11. Measure of Homelessness:

Page 6, line 122-129. Please include number of homeless people identified based on each of the defined criterion in your schematic flow diagram.

We have included this in the newly added S1 Figure

12. Analysis:

Page 10, lines 155: Explain your stratified random sampling. How did you stratify? Did you consider other variables such as age, sex, race/ethnicity to be randomly distributed in both development and validation group?

We have revised this section to more clearly describe the stratified random sampling procedure we used and why we used it. Specifically, we now explain that we used our measure of whether an individual was identified as homeless to stratify our sample into two strata. We then randomly sampled 75% of cases within each stratum to be assigned to the development sample and the remaining 25% within each stratum were assigned to the validation sample. Given the small number of cases in our overall sample identified as homeless and attendant risk that simple randomly sampling would result in a large proportion of homeless cases in the development (or validation) sample, we used this approach to balance the proportion of homeless cases across the development and validation sample.

13. Page 14, line 233: How did you calculate 14 times? For each correctly identified homeless person, there are 8.5 false positive. Elaborate on this.

We now explain how we calculated each of these.

14. Note: I would like to see a table with all related diagnostic measures (i.e. C-statistic, sensitivity, specificity, PPV, NPV, etc.)

We now include these in a new table, Table 2.

15. Note 2. The specificity of your model is extremely high (95.1%). Could it be because of timing of prediction, meaning that you included variables in your prediction model that was taken after a person experienced homelessness. So, your model actually did not predict homelessness. It assesses the risks of several variables and their associations with homelessness. This is different from prediction. Timing is important in a predictive model. Timing of prediction should be prior to the event. So, in building a predictive model, one should use only predictors that are available prior to being homeless.

We acknowledge that this is a limitation of our paper. As we note above, and in response to similar feedback offered by the reviewer, we have now amended the description of our modeling approach to reflect the fact that it is not possible to assess temporal ordering between our predictors and our outcomes.

16. Note 3. Please include the following information for the logistic regression model in the appendix:

1. Full name of the abbreviated variables.

We have now added a note in which we spell out all abbreviations used in the table.

2. Diagnostic testing of your regression model.

We are not sure what specific diagnostics the reviewer is hoping to see. We are happy to include them if the reviewer can be more specific.

3. Instead of using “unknown” or “missing” as your reference category, please use more meaningful groups for your categories. For example, for race, use “White” as your reference category.

Due to restrictions around the use of the Chapter 55 data, we were unable to re-estimate the models changing the reference categories. However, while potentially of substantive interest, the reference category should not affect the performance of the models, which was the main focus of the analysis, and not the substantive relationships.

4. This study covers a wide age-range group (11+). The question about one’s mother’s occupation for certain age group seems irrelevant. And, this variable may change frequently for certain jobs.

We agree that this is potentially a limitation. However, given that the goal of the analysis was to maximize model performance, we erred on the side of being overly inclusive with predictors.

5. These data are gathered over time. What if the condition for one person changed? Any thought of including longitudinal (time-variant) variables in your model? In that case probably a generalized estimating equation would be more appropriate than simple logistic model.

In the ideal case, we would include these. Unfortunately, as we outline above, given that our data do not allow for a true assessment of the timing of the onset of homelessness, we believe our more conservative cross-sectional approach is preferred. We note the limitations associated with our cross-sectional approach in the Discussion.

6. What do these varibles mean or represent?

Any record in BSAS

Any record in Casemix mental health records

Any record in DMH

Any record in DVS

Any record in Matris

Any record in PMP

We have amended the description of these variables in the table to clarify that the refer to whether an individual had received any service from each of the agencies.

Attachment

Submitted filename: reviewer_response_20200715.docx

Decision Letter 1

Benn Sartorius

6 Aug 2020

A classification model of homelessness using integrated administrative data: Implications for targeting interventions to improve the housing status, health and well-being of a highly vulnerable population

PONE-D-19-26052R1

Dear Dr. Byrne,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Benn Sartorius, PhD

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Acceptance letter

Benn Sartorius

11 Aug 2020

PONE-D-19-26052R1

A classification model of homelessness using integrated administrative data: Implications for targeting interventions to improve the housing status, health and well-being of a highly vulnerable population

Dear Dr. Byrne:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Benn Sartorius

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Flow diagram of sample selection.

    (DOCX)

    S1 Table. Full results of binary logistic regression model fit on development sample.

    (DOCX)

    Attachment

    Submitted filename: PLOSONE_20200122_EM.docx

    Attachment

    Submitted filename: reviewer_response_20200715.docx

    Data Availability Statement

    Data cannot be publicly shared due to legal restrictions that prohibit the sharing of both the data sources used in the construction of the study's analytic data set and the analytic data set itself. More specifically, study authors had access to these data by responding to a Notice of Opportunity issued by the Commonwealth of Massachusetts that enabled interested parties to propose analysis of these data for a set period of time. The deadline for responding to the Notice of Opportunity was April 30, 2017. Information about that Notice of Opportunity is available at the following link: https://www.commbuys.com/bso/external/bidDetail.sdo?docId=BD-17-1031-HISRE-HIS01-11089&external=true&parentUrl=bid The Commonwealth subsequently issued a second Notice of Opportunity to conduct analysis with these data, and applications in response to that Notice of Opportunity closed on March 23, 2018. Additional information about that Notice of Opportunity is available here: https://www.commbuys.com/bso/external/bidDetail.sdo?bidId=BD-18-1031-OFFIC-ODMOA-24680&parentUrl=activeBids At the time of the submission of this manuscript, there Commonwealth of Massachusetts was not an open Notice of Opportunity to which interested researchers could apply to access these data. Additional information about the data used in this study and relevant legal restrictions and data access are available at the following link: https://www.mass.gov/public-health-data-warehouse-phd As per that website, the contact person for the data are Abigail Averbach (Abigail.Averbach@MassMail.State.MA.US) and Brigido Ramirez Espinosa (Brigido.Ramirez1@State.MA.US) at the Massachusetts Department of Public Health Office of Population Health.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES