ABSTRACT
The skin microbiome is a highly abundant and relatively stable source of DNA that may be utilized for human identification (HID). In this study, a set of single nucleotide polymorphisms (SNPs) with a high mean estimated Wright’s fixation index (FST) (>0.1) and widespread abundance (found in ≥75% of samples compared) were selected from a diverse set of markers in the hidSkinPlex panel. The least absolute shrinkage and selection operator (LASSO) was used in a novel machine learning framework to generate a SNP panel and predict the human host from skin microbiome samples collected from the hand, manubrium, and foot. The framework was devised to emulate a new unknown person introduced to the algorithm and to match samples from that person against a population database. Unknown samples were classified with 96% accuracy (Matthews correlation coefficient [MCC], 0.954) in the test (n = 225 samples) data set. A final panel of informative SNPs was determined for HID (hidSkinPlex+) using all 51 individuals sampled at three body sites in triplicate. The hidSkinPlex+ panel comprises 365 SNPs and yielded prediction accuracy for the correct host of 95% (MCC = 0.949). The accuracy of the hidSkinPlex+ panel may be somewhat overestimated due to using 26 individuals from the training data set for the selection of the final panel. However, this accuracy still provides an indication of performance when tested on new samples.
IMPORTANCE One of the fundamental goals in forensic genetics is to identify the source of biological evidence. Methods for detecting human DNA have advanced and can be quite sensitive, but not all DNA samples are amenable to current methods. However, the human skin microbiome is a source of DNA with high copy numbers, and it has the potential for high discriminatory power. The hidSkinPlex panel has been used for HID; however, some aspects of it could be improved. Missing information is ambiguous, as it is unclear if marker drop-out is a by-product of a low-template sample or if the reasons for not observing a marker are biological. Such ambiguity may confound methods for HID, and as such, an improved marker set (hidSkinPlex+) was designed that is considerably smaller and more robust to drop-out (365 SNPs contained in 135 markers) yet still can be used to accurately predict the human host.
KEYWORDS: hidSkinPlex, skin microbiome, microbial forensics, human identification, massively parallel sequencing, machine learning, multinomial logistic regression, Wright’s fixation index
INTRODUCTION
The human microbiome encompasses the fungi, bacteria, and viruses living on and in individuals and their surrounding environment. The interplay of genetics and environment results in each person having a skin microbiome that is suggested to be unique (1–4). Like human skin cells, the skin microbiome is continuously shed from its host and deposited on other individuals, items, and surfaces. For every one squamous epithelial cell shed from the human skin approximately 30 microorganisms are shed (5). Thus, deposited microorganisms could serve as an additional source of evidence to include or exclude a person of interest in criminal cases.
Genetic signatures from the skin microbiome could be used for human identification (HID) with a panel that targets stable and abundant microorganisms (6, 7). Schmedes et al. (8) developed a targeted genome sequencing (TGS) panel called hidSkinPlex. This panel contains 286 markers covering a range of taxonomies of specific microorganisms that are in high abundance on the human skin. With the greater resolution of the hidSkinPlex panel and the demonstrated stability of the microorganisms chosen for the panel, a new avenue is available for HID using the skin microbiome. Although some relatively high accuracies were obtained, the hidSkinPlex panel has areas that can be improved (9). Optimization of the number and informativeness of the markers as well as reduction in their amplicon size is still needed to improve the robustness for HID purposes.
Like the process of selecting ancestry informative markers in humans, single nucleotide polymorphisms (SNPs) within the hidSkinPlex panel of markers can be selected for HID using Wright’s fixation index (FST). FST is an estimate of population differentiation that can be used to select SNPs to potentially increase classification accuracies for HID. Sherier et al. (10) used FST to select SNPs for HID; however, the approach only focused on SNPs that were common to the two samples being compared. One ramification of only using SNPs that are local to a pair of individuals is losing information that is specific in one individual but missing in the other. A global panel may allow for a better population genetic characterization of the selected SNPs. Furthermore, reduction to a specific set of informative SNPs can improve the efficiency of machine learning. Also, defined SNPs allow for better primer design for smaller amplicons, which in turn can improve amplification efficiency (i.e., increased sensitivity of detection). Overall, HID could be easier to accomplish when the metrics for comparison are the same among all individuals.
The study here focuses on developing a select microbial SNP panel for HID that is highly effective at associating a sample with its host. An effective microbial SNP panel would be well defined in nearly all individuals, be highly individualizing, and involve typing as few genetic markers as possible to achieve a defined level of attribution. One approach to select specific SNPs and define potential accuracy is to consider the human host as a class and to leverage classifier algorithms to predict the identity of the human host from microbial signatures. Some classification algorithms can be used to learn a sparse solution (i.e., few SNP markers), and the data can be described in a way that is robust to missing data, a common problem with forensic samples. The least absolute shrinkage and selection operator (LASSO) is one such algorithm which can be used to simultaneously identify a sparse set of SNPs and use those SNPs for HID. Here, a machine learning procedure is introduced that tests the ability of LASSO to identify SNPs for microbial-based HID. The performance of the selected SNP panel is then tested on individuals not used to generate the panel using a cross-validation framework. The candidate SNPs were assessed by their ability to predict the human host.
RESULTS
FST estimation.
The microbiomes of 51 individuals sampled at three body sites in triplicate (total samples, n = 459) were sequenced with the hidSkinPlex panel. As described in Woerner et al. (9), the resulting fastq files were aligned to the metagenomics database of MetaPhlAn2 (11). Wright’s fixation index (FST) was estimated between all pairs of samples ( = 105,111 pairs) using the formulation of Hudson et al. (12) as described in Sherier et al. (10).
There are two objectives for determining single nucleotide markers for HID. The SNPs should be (i) individualizing to the person (i.e., have generally high FST) and (ii) relatively stable over time. In terms of the first objective, FST varies from 0 to 1 with an estimate of 1 indicating complete differentiation between (microbial) populations; thus, the nucleotides selected should tend to have a large FST. In terms of the second objective, FST is undefined in the presence of missing data and when the allele is monomorphic between (and within) populations. Thus, it follows that the sites selected should have an FST that tends to be well defined. To evaluate the interplay between missing information and the central tendencies of the FST of microbial markers, FST was estimated at all the 172,116 nucleotide positions in the hidSkinPlex panel in anywhere from 1 to 105,111 pairwise comparisons (as limited by the information apparent; Fig. 1). Nucleotide positions with FST estimates of ≥0.1 and defined in at least 75% of the comparisons are considered candidate SNPs here. Additionally, selecting SNPs seen in at least 75% of the pairwise comparisons allows for some tolerance for missing data which may be due to technical limitations.
FIG 1.
The average FST estimate and the sample size in the hidSkinPlex pane. The figure on the left shows the distribution of the average FST for all nucleotide positions in the hidSkinPlex. The graph on the right shows the percentage of nucleotide positions in which FST can be estimated.
Training and test data set creation.
Training (n = 234, 26 individuals at three body-sites in triplicate) and test (n = 225, 25 individuals at three body-sites in triplicate) data were randomly partitioned (see File S1 in the supplemental material). The training data set produced 26 SNP panels (one for each individual) with a mean of 1,265.769 ± 21.486 SNPs per panel. Similarly, the test data set produced 25 panels with a mean of 1,475.240 ± 45.256 SNPs per panel. For the final panel using all 459 samples from the training and test data set, the list of initial SNPs for analysis contained 4,445 SNPs (Fig. 2).
FIG 2.
The average FST estimate and the sample size of the reduced list of 1,344 candidate SNPs from the training data set. The graph on the left shows the distribution of the average FST estimated for the SNP candidate list. The graph on the right shows the distribution of SNPs contained in the top 75% of pairwise comparisons.
Analysis of the training data set.
The training data set was used to optimize an algorithm for selecting a reduced number of SNPs for HID. The lambda sequence and the alpha parameter were optimized (see Materials and Methods), using all 26 individuals, to ensure that there were not too few or too many SNPs selected. A procedure was developed to select a SNP panel and then classify a new individual based on the selected markers. The procedure was run in a cross-validation framework, holding out each individual in turn. The training data set produced 26 SNP panels with a mean of 191.400 ± 21.702 SNPs.
Classification results.
The above-described approach for selecting a reduced SNP list was applied to the training and test data sets. Applying the classification procedure (see Materials and Methods) to the training data set gave an overall accuracy of 93% (Matthews correlation coefficient [MCC], 0.920; 24.180 times better than chance), with only 18 out of 234 samples incorrectly classified (Fig. 3). Of the incorrectly classified samples, 4 samples were from the foot (Fb), 11 from the manubrium (Mb), and 3 from the hand (Hp) (Table 1). Samples from the Mb had a higher number of incorrectly classified samples compared to the Fb and a significantly higher number than the Hp (Fisher’s exact test, P = 0.101 and 0.0470, respectively). A missing SNP was defined as a site selected in the procedure where 0 reads were apparent for a given sample. For the training data, the number of samples missing SNPs was determined for each of the 26 SNP panels. The mean number of samples missing SNPs was 93.300 ± 7.394 per panel. The mean number of missing SNPs for combined predicted results was 5.154 ± 11.312. For correctly classified samples, 131 samples were not missing any SNPs, and 85 had missing SNPs, with a mean of 10.330 ± 13.342; 14 incorrectly classified samples were missing SNPs, with a mean of 23.430 ± 18.241. Incorrectly classified samples were more likely to have missing SNPs than correctly classified samples (Fisher’s exact test, P = 0.002). The held-out sample for the development of each panel is more likely to have missing data because the SNPs were selected without considering the held-out sample.
FIG 3.

Classification results for the and test data sets and the number of samples missing SNPs. The x axis indicates the number of missing SNPs for a given sample. The y axis shows the training and test data sets partitioned into the correct (white) and incorrect (gray) classification groups.
TABLE 1.
Classification accuracy at different body sites in the training data set
| Training | No. of samples (%) |
|||
|---|---|---|---|---|
| Foot (Fb) | Manubrium (Mb) | Hand (Hp) | Total | |
| Correct | 74 (95) | 67 (85) | 75 (96) | 216 (93) |
| Incorrect | 4 (5) | 11 (15) | 3 (4) | 18 (7) |
| Total | 78 | 78 | 78 | 234 |
The classification procedure was applied to the test data set. The test data set was 96% accurate (MCC, 0.954; 24.000 times better than chance), with only 10 (all from the Mb) out of 225 samples incorrectly classified (Table 2). Nine misclassified samples were missing a mean of 23.333 ± 32.943 SNPs (Fig. 3). Of the correctly classified samples, 103 out of 215 had missing SNPs (6.786 ± 8.354). As with the training data set, incorrectly classified samples were more likely to have missing SNPs than correctly classified samples (Fisher’s exact test, P = 0.009).
TABLE 2.
Classification accuracy at different body sites in the test data set
| Test | No. of samples (%) |
|||
|---|---|---|---|---|
| Foot (Fb) | Manubrium (Mb) | Hand (Hp) | Total | |
| Correct | 75 (100) | 65 (87) | 75 (100) | 215 (96) |
| Incorrect | 0 (0) | 10 (13) | 0 (0) | 10 (4) |
| Total | 75 | 75 | 75 | 225 |
Reduced SNP list.
The final candidate SNP list was determined by pooling the test and training data sets and reapplying a similar classification procedure. LASSO was used to produce a single SNP list. Cross-validation was used to find the optimal lambda and to estimate the overall accuracy. The final SNP list, referred to as hidSkinPlex+, is composed of 365 SNPs (Table S2) that reside in 135 of the original amplicons (mean number of SNPs in each marker, 3.419 ± 4.984; range, 1 to 51) from the hidSkinPlex panel (11). The markers are specific to four taxa, Cutibacterium acnes, Cutibacterium humerusii, Corynebacterium tuberculostearicum, and Propionibacteriaceae. Previous studies have shown that Cutibacterium is a common and abundant (13–15) genus found on human skin (3, 8).
Of 459 samples, 95% (MCC, 0.949; 48.469 times better than chance) were correctly classified using data from the hidSkinPlex+ panel. Of the 23 incorrectly classified samples, 17 were from Fb samples, which is a significantly larger number of samples than the 2 Mb and the 4 Hp samples incorrectly classified (Fisher’s exact test, P < 0.001 compared to the Mb, P = 0.003 compared to Hp). The numbers of Mb and Hp were not significantly different (P = 0.684; Table 3). Of the 23 incorrectly classified samples, 21 samples were missing a mean of 40.520 ± 36.853 SNPs. More incorrectly classified Fb samples had missing SNPs (P < 0.001) compared to the number of incorrectly classified Mb or Hp samples with missing SNPs. For the 436 samples correctly classified, 204 samples had a mean of 11.590 ± 16.559 missing SNPs. A sample was more likely to be misclassified if it had missing SNPs (Fisher’s exact test, P < 0.001).
TABLE 3.
Classification accuracy at different body sites for hidSkinPlex+ panel
| All data | No. of samples (%) |
|||
|---|---|---|---|---|
| Foot (Fb) | Manubrium (Mb) | Hand (Hp) | Total | |
| Correct | 136 (89) | 151 (99) | 149 (97) | 436 (95) |
| Incorrect | 17 (11) | 2 (1) | 4 (3) | 23 (5) |
| Total | 153 | 153 | 153 | 459 |
DISCUSSION
The skin microbiome is a highly abundant and relatively stable source of DNA that may be utilized for HID (6, 16–20). A common set of microbial SNPs could provide another avenue of investigation to improve HID. In this study, a subset of SNPs from the hidSkinPlex panel that generally were common to all individuals analyzed were assessed for classification accuracy. The skin microbiome samples from 51 individuals’ Fb, Mb, and Hp were attributed to their respective individual hosts with an accuracy of 96% for the test data set. The targeted panels were composed of 157 to 243 SNPs, a substantial decrease in the number of SNPs relied on by Woerner et al. (9). The final SNP panel, hidSkinPlex+, contained 365 SNPs residing in 135 markers which were specific to 4 taxa. LASSO was used to select informative SNPs for HID and correctly predicted the human host 95% of the time. It should be noted, however, that the reported accuracy of the final panel may be slightly biased upward, as it is estimated within-fold, though given the 96% accuracy of the test data set, this bias is likely modest. Classification accuracies for each of the three body sites using the hidSkinPlex+ panel ranged from 89 to 99%. Accuracy with Fb samples (89%) was significantly lower (i.e., a greater number of incorrectly classified samples) than with the Mb (99%, Fisher’s exact test P < 0.001) and Hp (97%, Fisher’s exact test P = 0.001) samples, while the accuracies for the Mb and Hp sites were not significantly different. While accuracies for the Fb in this study were lower than those for the other two body sites, the results were more accurate than previous work from Woerner et al. (9) (28 to 73%).
One factor that appears to be related to the reduced accuracies is missing SNPs. Samples that had missing SNPs were more likely to be incorrectly classified than samples that had no missing SNPs (Fisher’s exact test, P < 0.001 for all comparisons). However, there were also samples that had missing SNPs that were classified correctly. For example, 38 out of 149 Hp samples had missing SNPs but were correctly classified. The incorrectly classified Hp samples had a significantly higher mean number of missing SNPs (34.500 ± 32.296) than Hp samples with missing data that were correctly classified (13.340 ± 16.515, Fisher’s exact test P < 0.001). While further research is needed to determine why some samples were incorrectly classified, one possible explanation of incorrect classification is low coverage. The Fb had the largest amount of missing data and the lowest read coverage. For example, the Fb had a mean of 448,400 ± 319,227 reads, compared to Hp, which had a mean of 1,025,866 ± 410,674 reads. Therefore, a more efficient chemistry could reduce the chances of data drop out.
The hidSkinPlex+ panel allows for a new targeted sequencing panel to be designed and optimized for the use of HID. Eliminating the markers that do not contribute to classification accuracy can improve the enrichment process, i.e., amplification efficiency of the PCR. Fewer markers in a PCR may increase amplicon yield and thus provide a more sensitive assay. Since the hidSkinPlex+ panel contains fewer markers, and thus SNPs, than the original hidSkinPlex, specific targeted SNPs primers may be redesigned to generate smaller amplicons that may increase amplicon yield and provide for a more robust panel for analyzing degraded samples, which are desirable features for forensic applications. Research is still needed to assess how well the SNPs selected for the hidSkinPlex+ pane; work when applied to samples collected from touch samples and at different time points. With additional studies on the allele frequency of the selected SNPs in different populations (populations may be geographically determined instead of genetically determined), a better estimation of HID classification accuracies can be achieved.
These results further support that the skin microbiome can serve as a potential source of DNA for HID. This panel could serve as a set of biomarkers to assess the stability of the specific SNPs and whether they can be generalized to the greater population.
MATERIALS AND METHODS
Sample collection and sequencing.
Human skin microbiome samples were collected by swabbing 51 individuals at three body sites (manubrium [Mb], hand [Hp], and foot [Fb]) in triplicate (replicates R1, R2, and R3), for a total of 459 samples as described previously in Woerner et al. (9). Briefly, the samples were assayed with hidSkinPlex, a TGS panel developed by Schmedes et al. (8). All markers in the hidSkinPlex (8) pane; are drawn from the MetaPhlAn2 database (11) and, as such, describe both the nucleotide sequence of the marker and a corresponding taxonomic affiliation (e.g., the marker is associated with C. acnes). The hidSkinPlex panel targets 22 clades from the genus to the species level and comprises 286 markers that are considered taxonomically stable and abundant on human skin (8). The University of North Texas Health Science Center Institutional Review Board approved the collection and analyses of these samples.
Sequence data generated.
As described previously in Woerner et al. (9), all sequencing was performed on a MiSeq instrument (Illumina, San Diego, CA). Fastq files were trimmed with cutadapt (21) to remove adapters from the sequencing results. The sequence data were aligned to the MetaPhlAn2 reference database (11) using bowtie2. Using an in-house BASH script (v. 4.4.20, Free Software Foundation, http://www.gnu.org/software/bash/), the total number of reads and the percentage of ACGT for each nucleotide base were calculated based on pileups from SAMtools (22). Finally, the base pileups for each aligned marker in the hidSkinPlex panel were generated.
Computation and statistical analysis.
As described in Sherier et al. (10), the FST was computed as per Hudson et al. (12), who proposed estimating FST as FST = 1 – (Hw/Hb), where Hw is the mean number of pairwise differences within a population and Hb is the mean number of pairwise differences between two populations (12). FST was estimated using an in-house script written in the Python programming language (v. 2.7.17, Python Software Foundation, https://www.python.org/) with minor modifications from the script used in Sherier et al. (10). The modifications allowed FST to be measured at all nucleotide positions regardless of read depth. All other statistical analyses were performed in R (v. 4.0.3) using the glmnet package (v. 4.1-1) (23), tidyverse (v. 1.3.1) (24), and ggplot2 (v. 1.3.1) (25) as appropriate. Additionally, the Matthews correlation coefficient (MCC) was estimated using mltools (v. 0.3.5) (26).
Potential SNPs for analysis.
Potential informative nucleotide positions were identified on the basis of FST estimated between pairs of samples. Training (n = 234, 26 individuals at 3 body sites in triplicate) and test (n = 225, 25 individuals at 3 body sites in triplicate) data were randomly partitioned by sample(c(1:51), 26) in R (File S1). FST was estimated between all samples within the training and the test data sets separately (27,261 and 25,200 pairwise comparisons, respectively) at all 172,116 nucleotide positions in the hidSkinPlex panel. As a summary statistic, FST is undefined if either individual (or both individuals) is missing the SNP or if the allele is monomorphic between populations. Thus, for some pairwise comparisons there is no FST estimate. FST estimates less than 0 were documented as 0. Here, candidate SNPs were defined as nucleotide positions that have a mean FST estimate of ≥0.1 and had a defined FST in >75% of comparisons.
Machine learning strategy.
A major aim of the current study is to identify SNPs that are informative for HID. The SNPs are selected to both differentiate individuals (e.g., tend to have high FST) and to be well defined (have a defined FST). Further, a central aim is to identify a small number of such SNPs (i.e., a sparse solution). Classification algorithms can be used for HID by treating each person as a class (i.e., as a categorical variable), and for the approaches here, an individual is predicted based on coefficients that are learned for that individual. One tool to find a small set of SNPs for HID is LASSO (the least absolute shrinkage and selection estimator). LASSO considers two measures in its optimization, the error (i.e., deviance in a logistic regression) and the absolute value of the coefficients (i.e., the L1 norm). The relative importance of these two criteria is specified with lambda. L1 regularization tends to produce solutions that are sparse. Thus, in the current use-case LASSO can be used to simultaneously identify a SNP panel and predict the human host based on those SNPs.
A potential concern with LASSO is that the SNP panel identified may work well for each person currently in the database; however, it may not work well for new individuals. If one were to consider a potential forensic scenario, a new individual (e.g., a person of interest) is presented and the SNP panel needs to be accurate for both the current database and the additional individual (i.e., the new class). While the collection of the samples used in this study does not mimic real-life casework, the classification method should determine if accurate HID is possible when samples are technical replicates and collected directly from an individual. Given these requirements, a procedure was developed that first learns a sparse set of SNPs using LASSO from individuals in a database, a new individual is introduced, and then the additional individual is classified based on a SNP panel. The last classification is performed using a ridge regression (L2 norm) with the SNP panel developed within-fold for each held-out individual (i.e., based on high FST SNPs identified in the database and not considering the held-out individual). The three-step procedure was repeated in a cross-validation framework, holding out each individual in turn. To ensure that the sample sizes were equal in all classes during the cross-validation development of the reduced SNP panel, the same sample type (e.g., Hp, replicate 3) was held out in all individuals. In R (27), the lambda sequence is given by 1.1^seq(1, −200, length = 100). SNP coefficients and the optimal lambda value were learned using the cv.glmnet function in the R package glmnet. The optimal lambda was taken to be the lambda that minimizes the deviance (lambda.min from cv.glmnet). In particular, the LASSO regression was run by standardizing the allele frequencies for the provided SNPs (standardize = TRUE), the regression type was set to grouped (type.multinomial = “grouped”), and the maximum number of iterations were set to 1,000,000. SNPs were identified by selecting SNP alleles based on the optimal lambda from LASSO. SNP panels were created by using all allele frequencies for any SNP position corresponding to a nonzero coefficient. A ridge regression was used to predict the held-out individual (that is, by setting alpha = 0 in cv.glmnet, but otherwise as per the above-described LASSO procedure).
Selection of SNPs for hidSkinPlex+.
The machine learning strategy described above was designed to simulate the ability of LASSO to identify SNPs in some data set that can then be used to predict the identity of a previously unseen individual. In the framework described above, a SNP panel is produced for each held-out individual, which is appropriate for assessing the accuracy of the approach, but it does not create a singular SNP panel. To produce a final SNP panel, a similar but simpler procedure was used. LASSO was used to identify a sparse set of SNPs considering all individuals (and body sites) pooled across the training and test data sets. The same lambda sequence was considered, and the optimal lambda (and corresponding panel) was estimated using cross-validation (cv.glmnet). The final panel is referred to as hidSkinPlex+. The accuracy of the final panel was estimated within-fold (keep = TRUE), and as such, the estimated accuracy of the final panel is likely inflated (biased upward).
Data availability.
Custom R and Python scripts can be accessed at https://github.com/CardiShire/MLforSkinMicrobiomeHID.
ACKNOWLEDGMENTS
We thank Sarah Schmedes for the design of the hidSkinPlex pane; and sample processing. Additionally, we thank Angie Ambers, Rachel Kieser, Frank Wendt, Nicole Novroski, and Jonathan King for their contributions to collecting/processing samples. We also thank Utpal Smart, Sammed Mandape, Ben Crysup, and Jonathan King for all the time they spent advising on code and debugging.
This study was supported in part by the National Institute of Justice, award numbers 2015-NE-BX-K006 and 2020-R2-CX-0046. The views expressed in this article do not necessarily represent the views of the Department of Justice, the National Institute of Justice, or the United States government.
Footnotes
Supplemental material is available online only.
Contributor Information
Allison J. Sherier, Email: allisonsherier@my.unthsc.edu.
Maia Kivisaar, University of Tartu.
REFERENCES
- 1.Wang Y, Yu Q, Zhou R, Feng T, Hilal MG, Li H. 2021. Nationality and body location alter human skin microbiome. Appl Microbiol Biotechnol 105:5241–5256. doi: 10.1007/s00253-021-11387-8. [DOI] [PubMed] [Google Scholar]
- 2.Ross AA, Doxey AC, Neufeld JD. 2017. The skin microbiome of cohabiting couples. mSystems 2:e00043-17. doi: 10.1128/mSystems.00043-17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Oh J, Byrd AL, Deming C, Conlan S, Program NCS, Kong HH, Segre JA, NISC Comparative Sequencing Program. 2014. Biogeography and individuality shape function in the human skin metagenome. Nature 514:59–64. doi: 10.1038/nature13786. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Richardson M, Gottel N, Gilbert JA, Lax S. 2019. Microbial similarity between students in a common dormitory environment reveals the forensic potential of individual microbial signatures. mBio 10:e01054-19. doi: 10.1128/mBio.01054-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Percival SL, Emanuel C, Cutting KF, Williams DW. 2012. Microbiology of the skin and the role of biofilms in infection. Int Wound J 9:14–32. doi: 10.1111/j.1742-481X.2011.00836.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Fierer N, Lauber CL, Zhou N, McDonald D, Costello EK, Knight R. 2010. Forensic identification using skin bacterial communities. Proc Natl Acad Sci USA 107:6477–6481. doi: 10.1073/pnas.1000162107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Knight R, Metcalf JL, Gilbert JA, Carter DO. 2018. Evaluating the skin microbiome as trace evidence. https://nij.ojp.gov/library/publications/evaluating-skin-microbiome-trace-evidence.
- 8.Schmedes SE, Woerner AE, Novroski NMM, Wendt FR, King JL, Stephens KM, Budowle B. 2018. Targeted sequencing of clade-specific markers from skin microbiomes for forensic human identification. Forensic Sci Int Genet 32:50–61. doi: 10.1016/j.fsigen.2017.10.004. [DOI] [PubMed] [Google Scholar]
- 9.Woerner AE, Novroski NMM, Wendt FR, Ambers A, Wiley R, Schmedes SE, Budowle B. 2019. Forensic human identification with targeted microbiome markers using nearest neighbor classification. Forensic Sci Int Genet 38:130–139. doi: 10.1016/j.fsigen.2018.10.003. [DOI] [PubMed] [Google Scholar]
- 10.Sherier AJ, Woerner AE, Budowle B. 2021. Population informative markers selected using Wright’s fixation index and machine learning improves human identification using the skin microbiome. Appl Environ Microbiol 87:e0120821. doi: 10.1128/AEM.01208-21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Truong DT, Franzosa EA, Tickle TL, Scholz M, Weingart G, Pasolli E, Tett A, Huttenhower C, Segata N. 2015. MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat Methods 12:902–903. doi: 10.1038/nmeth.3589. [DOI] [PubMed] [Google Scholar]
- 12.Hudson RR, Slatkin M, Maddison WP. 1992. Estimation of levels of gene flow from DNA sequence data. Genetics 132:583–589. doi: 10.1093/genetics/132.2.583. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Grice EA, Kong HH, Conlan S, Deming CB, Davis J, Young AC, Program NCS, Bouffard GG, Blakesley RW, Murray PR, Green ED, Turner ML, Segre JA, NISC Comparative Sequencing Program. 2009. Topographical and temporal diversity of the human skin microbiome. Science 324:1190–1192. doi: 10.1126/science.1171700. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Grice EA, Kong HH, Renaud G, Young AC, Program NCS, Bouffard GG, Blakesley RW, Wolfsberg TG, Turner ML, Segre JA, NISC Comparative Sequencing Program. 2008. A diversity profile of the human skin microbiota. Genome Res 18:1043–1050. doi: 10.1101/gr.075549.107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Fitz-Gibbon S, Tomida S, Chiu BH, Nguyen L, Du C, Liu M, Elashoff D, Erfe MC, Loncaric A, Kim J, Modlin RL, Miller JF, Sodergren E, Craft N, Weinstock GM, Li H. 2013. Propionibacterium acnes strain populations in the human skin microbiome associated with acne. J Invest Dermatol 133:2152–2160. doi: 10.1038/jid.2013.21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Franzosa EA, Huang K, Meadow JF, Gevers D, Lemon KP, Bohannan BJ, Huttenhower C. 2015. Identifying personal microbiomes using metagenomic codes. Proc Natl Acad Sci USA 112:E2930–E2938. doi: 10.1073/pnas.1423854112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Hampton-Marcell JT, Larsen P, Anton T, Cralle L, Sangwan N, Lax S, Gottel N, Salas-Garcia M, Young C, Duncan G, Lopez JV, Gilbert JA. 2020. Detecting personal microbiota signatures at artificial crime scenes. Forensic Sci Int 313:110351. doi: 10.1016/j.forsciint.2020.110351. [DOI] [PubMed] [Google Scholar]
- 18.Kapono CA, Morton JT, Bouslimani A, Melnik AV, Orlinsky K, Knaan TL, Garg N, Vazquez-Baeza Y, Protsyuk I, Janssen S, Zhu Q, Alexandrov T, Smarr L, Knight R, Dorrestein PC. 2018. Creating a 3D microbial and chemical snapshot of a human habitat. Sci Rep 8:3669. doi: 10.1038/s41598-018-21541-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Lax S, Hampton-Marcell JT, Gibbons SM, Colares GB, Smith D, Eisen JA, Gilbert JA. 2015. Forensic analysis of the microbiome of phones and shoes. Microbiome 3:21. doi: 10.1186/s40168-015-0082-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Lee S-Y, Woo S-K, Lee S-M, Eom Y-B. 2016. Forensic analysis using microbial community between skin bacteria and fabrics. Toxicol Environ Health Sci 8:263–270. doi: 10.1007/s13530-016-0284-y. [DOI] [Google Scholar]
- 21.Martin M. 2011. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J 17. doi: 10.14806/ej.17.1.200. [DOI] [Google Scholar]
- 22.Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup. 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Friedman J, Hastie T, Tibshirani R. 2010. Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33:1–22. [PMC free article] [PubMed] [Google Scholar]
- 24.Wickham H, Averick M, Bryan J, Chang W, McGowan L, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H. 2019. Welcome to the tidyverse. JOSS 4:1686. doi: 10.21105/joss.01686. [DOI] [Google Scholar]
- 25.Wickham H, Chang W, Henry L, Pedersen TL, Takahashi K, Wilke C, Woo K. 2016. ggplot2: elegant graphics for data analysis, vol 2018. Springer-Verlag, New York, NY. [Google Scholar]
- 26.Gorman B. 2018. mltools. https://github.com/ben519/mltools.
- 27.Team RC. 2013. R: a language and environment for statistical computing, on R Foundation for Statistical Computing. http://www.R-project.org/.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
List S1 and Table S2. Download aem.00052-22-s0001.pdf, PDF file, 0.2 MB (193.9KB, pdf)
Data Availability Statement
Custom R and Python scripts can be accessed at https://github.com/CardiShire/MLforSkinMicrobiomeHID.


