Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Jul 1.
Published in final edited form as: J Forensic Sci. 2018 Jan 22;63(4):1033–1042. doi: 10.1111/1556-4029.13741

Bioinformatics Approach to Assess the Biogeographic Patterns of Soil Communities: The Utility for Soil Provenance*

Natalie Damaso 1,2, Julian Mendel 1,2, Maria Mendoza 1,2, Eric J von Wettberg 1,3, Giri Narasimhan 4, DeEtta Mills 1,2
PMCID: PMC6028300  NIHMSID: NIHMS932482  PMID: 29357400

Abstract

Soil DNA profiling has potential as a forensic tool to establish a link between soil collected at a crime scene to soil recovered from a suspect. However, a quantitative measure is needed to investigate the spatial/temporal variability across multiple scales prior to their application in forensic science. In this study, soil DNA profiles across Miami-Dade, FL were generated using Length Heterogeneity PCR to target four-taxa. The objectives of this study were to 1) assess the biogeographical patterns of soils to determine if soil biota is spatially correlated to geographical location, and 2) evaluate five machine-learning algorithms for their predictive ability to recognize biotic patterns which could accurately classify soils at different spatial scales regardless of seasonal collection. Results demonstrate that soil communities have unique patterns and are spatially autocorrelated. Bioinformatic algorithms could accurately classify soils across all scales with Random Forest significantly outperforming all other algorithms regardless of spatial level.

Keywords: forensic science, soil DNA profiling, spatial scale, machine learning algorithms, Random Forest, soil provenance


Soil can provide valuable corroborative evidence in forensic investigations due to its prevalence and transferability, based on the Locard Exchange Principle (1). Forensic soil analyses are usually conducted by comparing questioned samples with those of known origin to evaluate if they are similar (inclusion) or different (exclusion) based on physical characteristics (e.g., soil color, texture, consistency, density, porosity, and particle size) (24), and chemical properties (i.e., mineralogy and elemental composition) (1, 46). With the growth of molecular techniques, research has shown the potential of soil DNA profiling as a reliable method for forensic soil analyses (719). The current ecological hypothesis states that the soil type (e.g., chemical and physical properties) determines which microbes and plants occupy a particular soil (2022) and provides the foundation for soil provenance studies to assist in intelligence gathering or forensics applications. Based on this hypothesis, soil community profiling should produce a unique biotic profile at every location that can provide a rapid and efficient method for soil geographical provenance or comparison of soil evidence. For example, human DNA profiles are used to determine a match between biological evidence from a crime scene and a suspect. Applying the same principle, a soil biotic profile can be used to establish a link between soil found on the suspect’s shoes, or clothing, to the soil at a crime scene (comparison) or identify a likely geographic source (provenance).

Studies have shown the potential and effectiveness of using microbial DNA from soil to identify the origin of a soil sample (79, 11, 14, 1618). A majority of the studies explore the bacterial 16S rRNA genes using terminal restriction length fragment polymorphism (T-RFLP) (7, 9, 10) or length heterogeneity polymerase chain reaction (LH-PCR) (8, 23). Several studies have recently shown that soil community profiling (i.e., bacteria or multiple taxon) using Next Generation Sequencing (NGS) technologies is strongly correlated with soil location and soil disturbance (14, 1618, 24). In this study, a simple profiling method, LH-PCR, was used instead of NGS metagenomic sequencing for several reasons. Primarily, the LH-PCR has been proven to be quick, robust, cost-effective, and reproducible in studying microbial community dynamics (23, 25, 26). Secondly, many forensic laboratories have not implemented NGS technologies in their laboratories and LH-PCR is one method that can be used with standard forensic DNA instrumentation.

Although studies have previously shown spatial variation, the effectiveness of soil DNA profiling to differentiate forensic samples depends on the existence of quantitative measures or methods that (a) help distinguish soils from different types of habitats, (b) exhibit spatial autocorrelation (i.e., increasing dissimilarity with further distance from a particular habitat), and (c) remain relatively stable within limited temporal scales (27). Analytical approaches to soil community profiling need to combine discriminatory power, robustness, and reliability; in addition, statistical methods must be identified to provide objective measures for assessing the similarities and differences between samples (28). In forensic science, the probability that a sample originated from one source rather than another selected at random must be evaluated with statistics such as the Random Match Probability or Likelihood Ratio, which are commonly used for Human DNA profiling (16). However, soil analysis differs from human identification as soil is not discrete and the soil community is vulnerable to spatial and temporal variability. To-date, there is no standardized way to process T-RFLP or LH-PCR profiles. A standard method to quantify the calculated similarity in a forensic setting and develop a decision model to estimate evidential value of such similarities is needed (12, 29). Therefore, two soil samples cannot be said in the absolute sense to have originated from a single source (16) and it is only possible to establish a degree of probability (i.e., similarity percentages) regarding whether or not the sample derived from a given location (16, 29).

Pattern classification and prediction are developing rapidly and previous literature discusses the benefits of machine learning tools for pattern discovery, classification, and prediction (3033). The statistical algorithms are designed to study patterns in data that can further provide predictive models for the classification of unknown samples. Machine learning tools are separated into two categories: supervised and unsupervised. Supervised learning involves using a training set to build a model of causation for the desired classification, whereas unsupervised learning does not make such assumptions and attempts to discover patterns and structures in the data without a training set (31). The forensic community could benefit enormously by the utilization of the classification tools and a comprehensive reference database to distinguish soil samples in order to determine their geographic origin. However, bioinformatic trials are required to establish an optimal data analysis pipeline and assess the signal-to-noise ratio and false-positive/negative error rates (34).

This study examines DNA from four large kingdoms (i.e., bacteria, archaea, fungi, plant) by using a four-taxa concatenated profile from the soil samples and assessing them over multiple spatial scales—soil types (i.e., similar physical and chemical properties), transects (i.e., sites within a soil type and 100 m in length), subplot levels (i.e., sites within the transect within a 1 m2 quadrat), and temporal scales (dry and wet seasons over a year). Previous study conducted by MacDonald et al. (2008) utilized a multiplex T-RFLP approach to analyze bacteria, archaea, and fungi, which led to better discrimination of soil samples as different taxa responded differently to spatio-temporal ecological drivers (10). Plants also have a potential to be used to discriminate soil samples, as they are dependent on the soil’s microbes, water, and nutrients. Therefore, a four-taxa approach was employed to include plants as well as bacteria, archaea, and fungi; however, the discrimination power of each taxon compared to their combined power for soil discrimination was outside the scope of this manuscript and will be discussed in a subsequent paper. The first objective of this study was to assess the biogeographical patterns of the soil and determine if a soil’s genetic content was spatially correlated to their respective geographic locations (i.e., samples that are geographically closer are more similar than those farther away). The second objective was to evaluate five different machine-learning algorithms (i.e., K-Nearest Neighbor, Decision Trees, Random Forests, Neural Networks, and Support Vector Machines) for their predictive ability to recognize biotic patterns and that could accurately classify soils at different spatial scales. This study advances the fundamental understanding of how site-by-site differences in soil microbiota might be exploited to distinguish soil samples that can assist in provenance investigations and ultimately, help to decide whether community analysis of soils is a viable tool to incorporate into forensic practice.

Materials and Methods

Soil Collection

Soil samples (N = 1332) were collected across Miami-Dade County, Florida (S1 Table). Given that the collections were made from public access sites and did not involve endangered or protected species, no special permits were required. Six soil types were surveyed. The soils were labeled as one of six different soil types according to USDA soil surveys (35): 1-Urban Land-Udorthents, 2- Lauderhill Dania-Pahokee, 3- Rock Outcrop-Biscayne-Chekika, 4- Perrine-Biscayne-Pennsuco, 5- Krome Association, 6- Perrine-Terra Ceia-Pennsuco. Within each soil type, 2–4 transects of 100 m in length and 1.6 km apart were sampled. Most transects were established in undisturbed sites that had limited public access. Six subplots were randomly selected within each transect and GPS coordinates for each subplot of each transect were recorded. Names of transects and their GPS coordinates can be found in the Supporting Information section (S1 Table). Within each subplot, six cored samples were taken within a 1 m2 quadrat. A 5 cm diameter soil corer was used to collect the top 5–10 cm of the soil (Fig. 1). Samples were collected during a one-year time frame, with the exception of one transect (FIU) where samples were collected over a 1.5-year period. Due to South Florida’s monsoonal subtropical climate, sampling was repeated at the same sites during both the dry and wet seasons. Wet and dry seasons were defined by the Florida Automated Weather Network (FAWN, http://fawn.ifas.ufl.edu), where seasons in Florida are classified based on the average rainfall. The wet season is defined as the period during which the average rainfall is four times more than that in the corresponding dry season. The wet season typically occurs from May-October, while the dry season is generally from November-April (36).

FIG. 1.

FIG. 1

Map of Miami-Dade County, FL. Shaded areas represent the six soil types observed in Miami-Dade County (according to the USDA): 1-Urban Land-Udorthents, 2-Lauderhill Dania-Pahokee, 3-Rock Outcrop-Biscayne-Chekika, 4-Perrine-Biscayne-Pennsuco, 5-Krome Association, 6-Perrine-Terra Ceia-Pennsuco (35). Stars indicate transect sites. Within each 100 m transect, six subplots were sampled and six cored samples were taken within a 1 m2 quadrat from each subplot. A 5 cm diameter soil corer was used to collect the top 5–10 cm of the soil.

DNA Extraction

The soil samples were transported back to the laboratory on ice, each manually homogenized, and sieved to remove large objects and debris. DNA was extracted from each homogenized soil sample (<250 mg) using the BIO 101 Fast DNA Spin Kit for Soil® and FastPrep®-24 System homogenizer (MP Bio, Solon, OH). The Fluorescent DNA Quantitation Kit (Bio-Rad, Hercules, CA) and Modulus™ Microplate Multimode Reader (Turner Biosystems, Sunnyvale, CA) were used to quantify the extracted DNA. Samples were diluted to a working stock of 20 ng/μL. Lastly, a 1% agarose yield gel was run to assess the integrity and quality of the extracted DNA.

Length Heterogeneity Polymerase Chain Reaction

DNA was amplified using Length Heterogeneity Polymerase Chain Reaction (LH-PCR) using two PCR duplexes: (1) bacteria and fungi, and (2) plant and archaea. LH-PCR, as the name implies, is based on the natural variations in sequence lengths of target gene fragments (8, 25). This technique is a rapid, robust, and reliable method that uses universal taxa primer sets that are highly conserved among the taxa, but have hypervariable regions that can distinguish species level. Universal primers for the following genomic regions of each taxa were used: 16S rRNA for bacteria (27-F, 355-R) (37) and archaea (21-F, 518-R) (38), ribosomal internal transcribed spacer region (ITS) for fungi (ITS5-F, ITS2-R) (39), and chloroplast trnl intergenic region for plant (trnL-F, trnL-R) (40) (Table 1). Forward primers were labeled with 6-FAM fluorescent dye. PCR reaction mixtures included: 1X reaction buffer, 2.5 mM MgCl2, 250 μM dNTPs (Promega, Madison, WI), 1% BSA (fraction V, Fisher Scientific, Pittsburgh, PA), 1% DMSO (Promega), various concentrations of primers (bacteria, 0.5 μM; archaea and fungi, 0.4 μM; plant, 0.6 μM), 40 ng DNA, 0.5 U AmpliTaq Gold® DNA Polymerase (Applied Biosystems, Foster City, CA), and diethylpyrocarbonate-treated (DEPC) water to a final volume of 20 μl. Each duplex was amplified with the same program using the ABI 9700™ thermocycler (Applied Biosystems) with the following parameters: 10 min initial denaturing step at 95°C; 25 cycles of denaturation at 95°C, annealing at 54°C, and extension at 74°C each for 30 sec; with a final extension at 74°C for 10 min.

TABLE 1.

Primer pairs used for each respective taxa.

Primer Name Primer Sequence (5′→3′) References
Bacteria 27-F
355-R
AGA GTT TGA TCM TGG CTC AG
GCT GCC TCC CGT AGG AGT
[42]
Fungi ITS5-F
ITS2-R
GGA AGT AAA AGT CGT AAC AAG G
GCT GCG TTC TTC ATC GAT GC
[44]
Archaea Arch21-F
Arch518-R
TTC CGG TTG ATC CYG CCG GA
ATT ACC GCG GCT GCT GG
[43]
Plant trnLUAA-F
trnLUAA-R
CGA AAT CGG TAG ACG CTA CG
GGG GAT AGA GGG ACT TGA
[45]

Capillary Electrophoresis

Samples from the two duplexes were co-loaded where 0.5 μL of each duplex PCR product was added to a mixture of 11.5 μL Hi-Di™ Formamide (Applied Biosystems) and 0.65 μL internal size standard, GeneScan 600 LIZ® (Applied Biosystems), denatured by heating for 2 min at 95°C and then snap-cooled on ice for 2 min. The amplicon data were analyzed using the DS-33 Matrix and Filter Set G5 (Applied Biosystems). The samples were electrokinetically injected at 15 kV for 5 sec and separated at 60°C on an ABI Prism™ 310 Genetic Analyzer (Applied Biosystems) using a Performance Optimized Polymer 4 (POP4) separation matrix (Applied Biosystems) with laser power at 9.9 mW and capillary length of 36 cm well-to-read (WTR) distance to the detection window.

Analyses

Raw data were analyzed using the GeneMapper™ research software, Version 4.0 (Applied Biosystems, Foster City, CA). Local Southern Size Calling method was used for the analysis parameters with a minimum threshold of 50 relative fluorescent units (RFUs). Bins were created to separate amplicons that differed from each other in length by a single base pair. Data from all taxa were concatenated for subsequent analyses. The relative ratios were calculated by normalizing the heights of each peak in the genotype to the total peak intensity, thus making each peak height to be in the range 0 through 1.

Mantel Test

Mantel tests were performed using the ade4 library from the R programming language (41). Two distance matrices were tested, geographic distance and genetic distance, with data imported as binary data (i.e., presence/absence). Relative GPS coordinates were recorded for each sample by designating the center of each subplot as the true GPS coordinate. The Mantel tests were conducted, plotted using the mantel.randtest function, and calculated based on random permutation using the Monte Carlo method. This method relies on repeated random sampling (i.e., using 999 permutations) to compute the results so that no assumptions regarding the statistical distributions of samples in the matrix were needed. The rows and columns of one matrix were randomly permutated, followed by recalculations of the correlations after each permutation, thereby testing the significance. A detailed exemplar script can be found in the Supporting Information section (S1 File).

Dissimilarity Percentages

All analyses were conducted using the PRIMER-E v7 software (PRIMER E Ltd., Plymouth Marine Laboratory, Plymouth, U.K.). Dissimilarity percentages were obtained from Bray-Curtis similarity matrices that were generated on relative abundance ratios, which had been square-root transformed prior to analysis. Cluster dendrograms were used to visualize the average transect Bray-Curtis similarity matrices and show if the different soil types (16) cluster in an order that reflects their geographical distribution (i.e., samples closer together geographically were similar in their biotic composition). SIMPER analysis was conducted to determine the similarity percentages within and between samples at multiple spatial scales (i.e., soil type, transect, subplot) and seasonal differences (i.e., wet, dry). These analyses were also conducted to identify unique LH-PCR peaks contributing to dissimilarity between sites, which was supported by the Random Forest important variable plot, illustrating the significant LH-PCR peaks to discriminate between samples.

Machine Learning Tools

R software was used for all the classification methods; the specific R packages used for these methods include: class for K-Nearest Neighbor (42), rpart for Decision Trees (43), randomForest for Random Forests (44), neuralnet for Neural Networks (45), and e1071 for Support Vector Machines (46). Detailed exemplar scripts for each classification method can be found in the Supporting Information section (S2 File). Two-thirds of the dataset was used for training while one-third was used for testing for each replicate run and across each algorithm. For reproducibility, the datasets were re-tested by randomly selecting a different training and testing set, three different times. Comparisons of the methods were conducted by calculating the average percent of samples correctly classified into soil type, transect, or subplot based on the three test set. The primary performance criterion evaluated was the classification accuracy, which is the measurement of correctly classified instances (i.e., accuracy = total number of samples correctly classified/total number of samples), as well as the overall error rate. The second performance criterion evaluated was the area under an ROC curve (AUC). This is widely used to measure the performance of supervised classification methods based on their ranking quality of sensitivity (true-positive rate) as a function of the 1- specificity (false-positive rate). An AUC value of 1 illustrates a perfect test that has zero false-positives and zero false-negatives. Multi-class AUC was conducted using the pROC package in R (47). Random Forest and Support Vector Machines were re-evaluated using different minimum ratio RFU thresholds (1%, 5%, 10%, and 20%) for the electrophoretic data to check for any changes in the classification accuracy. Student’s two-sample T-tests were conducted to determine significant differences between varying classification scales and machine learning tools. Random Forest analysis was conducted to provide the most important variables for classification for each spatial scale (i.e., soil type, transect, and subplot). Random Forest analysis provides a “Mean Decrease Accuracy” for the different LH-PCR peaks to determine the most important variable for discriminating between the soils being classified. The greater the decrease in accuracy due to the exclusion of a single variable, the more important that variable is deemed. Three scales were analyzed: soil type, transect, and subplot. The Mean Decrease Accuracy was calculated based on an out-of-bag error calculation phase to determine if the accuracy of the Random Forest prediction decreases when the single variable is excluded.

Results

Spatial Autocorrelation Analysis: Mantel Test

The genetic profiles from a majority of the sites in Miami-Dade displayed a significant positive spatial autocorrelation between its geographic location and biotic composition illustrating samples that were geographically closer together were statistically similar in their biotic composition. However, out of eighteen transects, six transects had non-significant correlations, as well as temporal variability (S2 Table). These six sites were found to have been previously disturbed and therefore, did not represent the pristine parent soil type. During the wet season, the sites included: Soil type 1, OSP1 (ob = −0.04, p = 0.58) constructed parkland; Soil type 3, CH (ob = 0.03, p = 0.29), burned 6 months earlier; Soil type 4, PE (ob = −0.17, p = 0.99), abandoned nursery site; and Soil type 5, USDA 1 (ob = −0.05, p = 0.73) agricultural land. For dry season samples, the following correlations were found: Soil type 1, OSP1 (ob = 0.09, p = 0.15) constructed parkland, Soil type 1, OSP2 (ob = −0.08, p = 0.81) constructed parkland; Soil type 2, NW 137 (ob = −0.01, p = 0.42) illegal trash dumping; and Soil type 5, USDA 1 (ob = -0.17, p = 0.98), agricultural land.

Similarity Percentages

Cluster dendrograms supported the Mantel spatial autocorrelation results showing that samples that were geographically closer together were similar in their biotic composition (Fig. 2). For instance, KK and CH transects both from Soil Type 3 grouped within the same cluster. The same trend was observed for Soil Type 2, 5, 6. Two outliers were observed that did not group with their respective soil type (i.e., CS from Soil Type 4 and OSP3 from Soil Type 1 that grouped with Soil Type 6 and 4, respectively). SIMPER analyses conducted at different scales (i.e., soil type, transect, subplot, and season) illustrated the similarities between and within each scale (Tables 2 and 3). Overall, the average similarities within the sites were greater than between sites. Within soil type similarities ranged from 19–49% and within transects similarities ranged from 36–71%, while the between site similarities ranged from 12–20% and 8–26% for soil type and transect, respectively. Seasonal similarity was different based on soil type and transect (Table 2 and 3, respectively). At the transect level, KS8 had the highest seasonal similarity of 71%, while PE had the lowest seasonal similarity of 24%. These results can be attributed to the physical characteristics of the soil samples as will be elaborated in the discussion. Soil type did not demonstrate to be an indicator of the seasonal variability (Fig. 3). In this study, both transects from Soil Type 4 (CS and PE) showed to have the highest seasonal differences (70 and 76% seasonal dissimilarity, respectively). While, Soil Type 2 transects differed in their seasonal dynamic with KS8 having 71% similarities between the wet and dry community and KNT having a larger seasonal fluctuation in the biotic community (i.e., 60% dissimilarity).

FIG. 2.

FIG. 2

Cluster dendrogram based on average transect Bray-Curtis similarity to visualize the biotic similarity in relation to their geographic distribution. Symbols represent the soil type classification, while labels represent the transects within each soil type.

TABLE 2.

SIMPER analysis illustrating the average similarity (± is the SD of the mean % similarity). “Between” soil types represents the average similarity of one soil type compared to the other five soil types. “Within” each soil type represents the average similarity of the 2–4 transects within a soil type. Season represents the similarity between wet and dry season for each soil type.

Similarity (%)

Soil Type Between Within Season
1 19.73 ± 8.17 32.48 ± 6.42 35.43
2 18.18 ± 6.56 45.60 ± 2.59 48.92
3 18.22 ± 7.19 49.18 ± 0 45.10
4 11.61 ± 5.21 19.16 ± 0 20.93
5 17.33 ± 3.87 33.99 ± 6.7 38.29
6 18.88 ± 1.82 34.27 ± 0 40.11

TABLE 3.

SIMPER analysis illustrating the average similarity (± is the SD of the mean % similarity). “Between” transects represents the average similarity of one transect when compared to the other seventeen transects. “Within” each transect represents the average similarity of the six subplots within each transect. Season represents the similarity between wet and dry season for each soil type.

Similarity (%)

Transect Between Within Season
FIU 21.93 ± 9.04 37.67 ± 4.44 36.76
OSP1 22.21 ± 9.46 47.08 ± 8.77 48.91
OSP2 25.86 ± 9.06 57.59 ± 3.40 53.46
OSP3 19.99 ± 7.92 47.05 ± 10.84 45.10
NW137 24.97 ± 12.07 57.28 ± 7.12 55.22
KNT 22.73 ± 12.14 46.07 ± 9.28 40.55
KS8 23.38 ± 12.64 71.46 ± 3.34 71.43
CC6 25.36 ± 11.58 68.27 ± 3.54 64.99
KK 20.84 ± 9.95 53.14 ± 3.54 42.19
CH 21.05 ± 10.21 52.26 ± 8.74 47.55
CS 14.56 ± 4.64 35.90 ± 6.97 30.00
PE 7.82 ± 6.34 39.73 ± 2.25 23.94
HA 24.05 ± 8.63 68.84 ± 4.12 68.30
TREC 19.95 ± 8.71 47.50 ± 4.21 45.22
USDA1 18.12 ± 7.53 46.06 ± 4.60 36.06
USDA2 21.26 ± 7.29 54.84 ± 4.36 53.57
USDA3 20.08 ± 5.95 53.92 ± 2.75 35.84
FC 19.36 ± 5.54 58.20 ± 4.90 57.35

FIG. 3.

FIG. 3

SIMPER analysis representing the average similarity (± is the SD of the mean % similarity) within a transect per soil type.

Soil Classification: Comparison of Five Machine Learning Tools

Using only the soil type (N=144–288 per soil type) as a classifier, K-Nearest Neighbors, Decision Tree, Random Forest, Neural Networks, Support Vector Machines provided 98%, 95%, 99%, 91% and 91% classification accuracy (AUC= 0.93–1), respectively (Fig. 4, Table 4). At the transect level (N=72 per transect), accuracies were 92%, 85%, 98%, 64% and 89% (AUC= 0.95–1), while the subplot level (N=12 per subplot) had classification accuracies drop to 51%, 6%, 67%, 13%, 45% (AUC= 0.97–0.99) with K-Nearest Neighbors, Decision Tree, Random Forest, Neural Network, Support Vector Machines, respectively (Fig. 4, Table 4). Irrespective of which machine learning tool was used, soil type classification resulted in significantly higher accuracy when compared to either the transect or subplot classifications (p< 0.007). Student’s T-Test results show that Random Forest significantly outperformed all other algorithms regardless of the spatial level selected (e.g., soil type (p< 0.044), transect (p< 0.001), subplot (p< 0.001) with the exception of K-Nearest Neighbors (p= 0.065) for soil type classification.

FIG. 4.

FIG. 4

Prediction accuracy values (±SD of the mean) for each of the five machine-learning tools (K-Nearest Neighbor (KNN), Decision Trees (DT), Random Forest (RF), Neural Networks (NN), Support Vector Machines (SVM)) using training and test sets randomly chosen three different times from the complete database. Black = Soil Type; Light Grey = Transect; Dark Grey = Subplot.

TABLE 4.

Prediction accuracy and AUC values (±SD of the mean) for each of the five machine-learning tools (KNN, DT, RF, NN, SVM) based on three repeats.

Soil Type Transect Subplot
KNN Accuracy 98.5 ± 0.44 92.57 ± 0.42 51.67 ± 2.65
AUC 0.99 ± 0.01 0.98 ± 0.01 0.98 ± 0.01

DT Accuracy 95.26 ± 1.54 85.45 ± 1.37 6.16 ± 2.06
AUC 0.97 ± 0.01 0.96 ± 0.00 0.94 ± 0.01

RF Accuracy 99.76 ± 0.24 98.1 ± 0.36 67.98 ± 1.52
AUC 1.00 ± 0.00 1.00 ± 0.00 0.99 ± 0.00

NN Accuracy 91.61 ± 1 64.71 ± 0.67 13.91 ± 0.94
AUC 0.93 ± 0.01 0.95 ± 0.01 0.97 ± 0.00

SVM Accuracy 91.86 ± 0.84 89.47 ± 0.68 45.95 ± 1.86
AUC 0.95 ± 0.01 0.97 ± 0.00 0.98 ± 0.02

Testing only Random Forest and Support Vector Machine algorithms, the classification accuracy was not significantly altered by increasing the electrophoretic threshold to 5% (p= 0.528). However, increasing the threshold to > 10% significantly reduced the classification accuracy (p< 0.001). Regarding the 1% and 5% thresholds, Random Forest significantly outperformed Support Vector Machines (p< 0.004); however, under higher thresholds (i.e., 10%, 20%) the two machine learning tools were not significantly different (p= 0.570, 0.848, respectively). Prediction accuracy and AUC values are listed in Table 5.

TABLE 5.

Prediction accuracy and AUC values (±SD of the mean) for Random Forest and Support Vector Machines from randomly sampling training and testing sets three different times.

Soil Type Transect

Threshold Accuracy AUC Accuracy AUC
RF 1% 99.76 ± 0.24 1.00 ± 0.00 98.10 ± 0.36 1.00 ± 0.00
5% 99.19 ± 0.32 0.99 ± 0.01 94.27 ± 0.89 0.99 ± 0.00
10% 93.23 ± 0.12 0.97 ± 0.00 73.97 ± 0.78 0.95 ± 0.00
20% 63.30 ± 1.59 0.86 ± 0.01 35.72 ± 0.84 0.86 ± 0.02

SVM 1% 91.86 ± 0.84 0.95 ± 0.01 89.47 ± 0.68 0.97 ± 0.00
5% 93.21 ± 0.55 0.96 ± 0.01 87.94 ± 0.23 0.96 ± 0.00
10% 89.63 ± 0.88 0.95 ± 0.00 70.38 ± 0.94 0.92 ± 0.01
20% 64.52 ± 0.21 0.87 ± 0.01 37.9 ± 0.65 0.85 ± 0.02

Discriminatory LH-PCR Peaks

The results illustrated that with finer resolution scale (i.e., subplot vs. soil type) more peaks were important to accurately classify the soil’s origin (Fig. 5). Moreover, these data support the threshold data (Table 5) and illustrated that all four taxa were important to discriminate between soils. The Random Forest analyses were supported by the SIMPER analysis results of their unique LH-PCR peaks that contributed to the dissimilarity between sites (Fig. 5).

FIG. 5.

FIG. 5

Most Important Variables (LH-PCR amplicons) for discriminating between soils at multiple spatial scales (soil type, transect, and subplot) based on Random Forest analysis. The greater the Mean Decrease Accuracy (y-axis), the more important the LH-PCR amplicon (x-axis) is for classification. Results indicate that as the spatial scale decrease from Soil Type to Subplot Level, more LH-PCR amplicons are important to discriminate locations.

Discussion

Soil ecosystems are very complex and contain a vast array of information, both abiotic and biotic, that are highly integrated. It is not surprising to see differences in biota, especially for sites that have varying vegetation and soil properties. However, this study illustrates that not only are there differences, but more importantly, there is a biotic organization to the soil communities that can be captured with length heterogeneity PCR. The machine learning algorithms were able to detect these hidden patterns by focusing on the dissimilarities as well as the similarities to be able to group the sites according to their geographic location at multiple spatial scales.

This study builds on the growing knowledge of spatial relationships in microbial communities by applying the Mantel statistic to illustrate that the biotic patterns and their geographic location are indeed spatially auto-correlated in Miami-Dade soils (S2 Table). The non-significant spatial autocorrelation proved to be an indicator of localized disturbed or constructed sites when compared to the undisturbed transects within the same soil type—value added for discrimination of sites for provenance or forensic applications (i.e., CH had been burned six months prior to soil collection, PE was an old abandoned nursery, NW137 was an illegal mixed trash dump site, and the OSP transects spanned mixed forest vegetation to abandoned construction sites). Previous studies have also shown that extensive human interactions can lead to biological homogenization (27, 4850).

Based on the four-taxa profiles and Mantel test, correlation between biotic content and geographic location was observed, thus justifying the use of machine learning tools to predict biotic patterns that can be applied for determination of soil provenance. Studies have shown that no single classification method is superior in every case (33, 5154). Each classification tool has its own learning and prediction procedure; therefore, to be able to compare five different supervised machine learning tools, using the same training and test sets was important. For forensic/provenance applications, the model needs to have a high degree of classification accuracy and be easily interpretable for implementation in a court of law. A balance between sensitivity and specificity is required for forensics, whereby the method should be able to detect differences and also avoid false-positive results (16). In this study, all algorithms were able to classify the soil samples with high accuracy and high AUC values. Irrespective of the spatial scale (soil type, transect, or subplot), the Random Forest approach had the highest classification accuracy and AUC value compared to the other algorithms (Table 4). Random Forest was able to predict soil type and transect level accurately 99% and 98% of the time, respectively. This was significantly higher than the other algorithms that were only able to achieve soil type and transect level classification accuracies of 91–98% and 64–92%, respectively. Moreover, the Random Forest method was able to accurately predict the origin of the soil using the four-taxa profiles at the smallest scale, subplot level with 67% accuracy (Fig. 4). Those that misclassified at the subplot level were still classified correctly within the same transect. In contrast, the other four machine learning algorithms were only able to accurately predict subplot level with 6–51% accuracy. Previous studies also concluded that the Random Forest algorithm was both computationally efficient and made extraction of important model features simple (54, 55). Similarly, Kampichler et al. (2010) recommended the utilization of the Random Forest approach for biologists and decision makers due to their ease of interpretability of classifiers and clarity of the method (33).

When choosing the most appropriate algorithm, it is important to take into account the dataset. Neural Networks and Support Vector Machines are more complex algorithms compared to Decision Trees, Random Forests, and K-Nearest Neighbors. For example, the simpler Decision Trees and Random Forest methods perform better with discrete and categorical data as they approach the variables with the purpose of finding the most discriminative variable to classify and repeats this process until all of the data are categorized (51). Support Vector Machines and Neural Networks essentially find the maximal margin that can distinguish different classes that result in a highly comprehensible model but at times can also have the potential to over-fit the data (30, 51, 54). Therefore, Support Vector Machines and Neural Networks are capable of working with high-dimensional and continuous data, but require variable selection and do not perform well with a large number of irrelevant variables (51, 56).

Variable selection can significantly influence the performance of machine learning tools. Random Forest and Support Vector Machines were re-evaluated using different minimum thresholds and a continuous reduction of the number and intensity of peaks by increasing minimum thresholds. It was expected that the Support Vector Machines classification accuracy would increase with more stringent electrophoretic thresholds as expected “irrelevant” variables (i.e., low intensity peaks) would be reduced. A previous study found that using the “top (highest peak intensity) 40 peaks” of a bacterial profile generated with universal primers was as effective in discriminating soil samples as when implementing all of the electrophoretic peaks and similarity indices (27). Meyers and Foran (2008) determined that observing the top 40 peaks reduced the inclusion of small non-reproducible peaks that can occur by slight differences in the amount of DNA injected into the capillary (27). In the current study, using four-taxa profiles showed that as the minimum RFU threshold was raised, the majority of the peaks that were removed were archaea, fungi, and plant peaks resulting in a decrease in classification accuracy. Increasing the threshold resulted in losing peaks that represented distinguishing taxa. Therefore, these ‘rare’ peaks representing various members of the community are important to the specific habitat and provided “uniqueness” to the sample, which is important in forensics and provenance studies. This implication was supported by the discriminatory amplicons from the SIMPER and Random Forest analyses (Fig. 5). These results show that as the spatial resolution increases from soil type to subplot level, all the LH-PCR amplicons are important in discriminating locations (Fig. 5). A peak threshold between 1–5% was needed for all representative taxa to categorize an unknown sample to its approximate origin, and thus demonstrating the significance of using four taxa to yield higher accuracy and discrimination between sites.

Scale level is important to consider as accuracy rates decrease at smaller scales. Soil type classification resulted in significantly higher accuracy when compared to transect and subplot (p< 0.007). The decreased number of samples at the different spatial scales (i.e., soil type (N=144–288), transect (N=72), subplot (N=12)) could also be a reason a decrease in accuracy was observed at finer scales. The reduced sample size resulted in fewer samples that could be implemented to train the algorithms to recognize the hidden patterns, which are essential to classify the samples correctly. Regardless, this study shows the power of bioinformatics to pinpoint the origin of a soil sample accurately in 67% of cases at the subplot level and approximately 99% for larger spatial levels (i.e., soil type, transect). Decrease in accuracy at the finer scale (subplot level) was not surprising for some transects as SIMPER analysis illustrated that the within transect variability was less than the variability between transects (Table 3). Multiple studies support these results and have also found that bacterial profiles within a habitat are more similar to each other than those from other ecosystems (9, 15, 5759). For example, in this study, site KS8 had a 71% within site similarity compared to a 23% similarity between transect (Table 3). This can be attributed to more similar and homogeneous microbial flora and fauna within some transects. Similarly, Mummey and Stahl (2003), illustrated that homogeneous grasslands had a highly similar bacterial community and a lower within-site variability than shrublands (57).

Local heterogeneity can be due to different soil properties and multiple environmental factors, such as mixed or unique plant species, sunlight penetration through the plant canopy, and different moisture content (58, 59). Meta-analysis by Shade et al. (2013) showed that microbial communities’ temporal dynamics can also be dependent on habitat type (60). Therefore, it is important to understand temporal variability of the soil microbial communities and how this variability compares among different soil types. Seasonal dissimilarities also varied between transects (Table 3); however, it did not alter the classification of the biotic profiles at different spatial scales. Overall, this study showed that stable profiles may allow comparison between evidence and a possible crime scene despite the time lapse between sample collections. Previous studies have also reported a large level of variability within habitats spatially and temporally; however the variability did not have a substantially negative influence on the ability to categorize soils based on habitat from samples collected throughout a year time frame (15, 29, 59).

This study provides a foundation for further studies as it shows that soil communities have a hidden pattern and bioinformatic approaches can be used for soil provenance. However, the authors do acknowledge that for a robust tool to be applied in forensic applications, an understanding of the uncertainty associated with any comparison and the parameters that significantly influence variability in profiles must be determined. Further studies are needed to address other important questions such as temporal variability (e.g., over several years’ time-span). Spatial distribution and sensitivity of the analysis method to detect differences in soil communities from similar soil types (i.e., chemical and physical properties) and local scales (i.e., similar location) are important for forensic context. Therefore, this study focused on multiple spatial scales across two seasons to address the sensitivity of the approach—from large scale (i.e., soil type-similar physical and chemical properties), to transect (i.e., sites within 100m in length) and down to subplot (sites within 1m2). The biotic analyses can be conducted with the DNA expertise and instrumentation already employed in many crime laboratories, making it easy to implement and can be used with ≤ 250 mg of soil (26, 28). Further studies should be conducted to determine the sampling design, such as number of samples collected and distance between samples across different habitats, that are needed to utilize soil community profiling for intelligence based forensic investigations and ultimately for establishing a usable database for soil provenance.

Supplementary Material

Supp TableS1
Supp TableS2
Supp info1
Supp info2

Acknowledgments

The authors would like to acknowledge the Forensic DNA Profiling Facility within the International Forensic Research Institute at Florida International University. The authors thank Yanie Oliva and Ashley Diaz for assistance with soil sampling and data collection. D.K. Mills was supported for this work from the National Geospatial Intelligence Agency [HM1582‐09–1‐0011], N. Damasco was supported by the McNair Graduate Fellowship, and J. Mendel was supported by the National Institutes of Health/National Institute of General Medical Sciences [R25 GM061347]. The aforementioned funders had no role in the study design, data collection and analysis, decision to publish, or the preparation of the manuscript.

Footnotes

*

The content in this manuscript is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

*

Presented in part at the American Society for Microbiology Florida Branch Meeting, April 13, 2013, in Islamorada, FL; the International Forensic Research Institute Annual Symposium, April 30, 2014, in Miami, FL; and the USDA National Cooperative Soil Survey National Conference, June 8, 2015, in Duluth, MN.

References

  • 1.Fitzpatrick RW, Raven MD, Forrester ST. A systematic approach to soil forensics: criminal case studies involving transference from crime scene to forensic evidence. In: Ritz K, Dawson L, Miller D, editors. Criminal and environmental soil forensics. Berlin, Germany: Springer; 2009. pp. 105–27. [Google Scholar]
  • 2.Sugita R, Marumo Y. Validity of color examination for forensic soil identification. Forensic Sci Int. 1996 Dec;83(3):201–10. [Google Scholar]
  • 3.Petraco N, Kubic TA, Petraco NDK. Case studies in forensic soil examinations. Forensic Sci Int. 2008 Jul;178(2–3):e23–e27. doi: 10.1016/j.forsciint.2008.03.008. [DOI] [PubMed] [Google Scholar]
  • 4.Murray RC. Forensic examination of soils. In: Kobilinsky L, editor. Forensic chemistry handbook Hoboken. NJ: Jon Wiley & Sons; 2012. pp. 109–30. [Google Scholar]
  • 5.Pye K, Blott SJ. Development of a searchable major and trace element database for use in forensic soil comparisons. Sci Justice. 2009 Sep;49(3):170–81. doi: 10.1016/j.scijus.2009.02.007. [DOI] [PubMed] [Google Scholar]
  • 6.Jantzi SC, Almirall JR. Characterization and forensic analysis of soil samples using laser-induced breakdown spectroscopy (LIBS) Anal Bioanal Chem. 2011 Jul;400(10):3341–51. doi: 10.1007/s00216-011-4869-7. [DOI] [PubMed] [Google Scholar]
  • 7.Horswell J, Cordiner SJ, Maas EW, Martin TM, Sutherland KBW, Speir TW, et al. Forensic comparison of soils by bacterial community DNA profiling. J Forensic Sci. 2002 Mar;47(2):350–3. [PubMed] [Google Scholar]
  • 8.Moreno LI, Mills DK, Entry J, Sautter RT, Mathee K. Microbial metagenome profiling using amplicon length heterogeneity-polymerase chain reaction proves more effective than elemental analysis in discriminating soil specimens. J Forensic Sci. 2006 Nov;51(6):1315–22. doi: 10.1111/j.1556-4029.2006.00264.x. [DOI] [PubMed] [Google Scholar]
  • 9.Heath LE, Saunders VA. Assessing the potential of bacterial DNA profiling for forensic soil comparisons. J Forensic Sci. 2006 Sep;51(5):1062–8. doi: 10.1111/j.1556-4029.2006.00208.x. [DOI] [PubMed] [Google Scholar]
  • 10.Macdonald LM, Singh BK, Thomas N, Brewer MJ, Campbell CD, Dawson LA. Microbial DNA profiling by multiplex terminal restriction fragment length polymorphism for forensic comparison of soil and the influence of sample condition. J Appl Microbiol. 2008 Apr;105(3):813–21. doi: 10.1111/j.1365-2672.2008.03819.x. [DOI] [PubMed] [Google Scholar]
  • 11.Concheri G, Bertoldi D, Polone E, Otto S, Larcher R, Squartini A. Chemical elemental distribution and soil DNA fingerprints provide the critical evidence in murder case investigation. PloS One. 2011 Jun;6(6):e20222. doi: 10.1371/journal.pone.0020222. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Quaak FCA, Kuiper I. Statistical data analysis of bacterial t-RFLP profiles in forensic soil comparisons. Forensic Sci Int. 2011 Jul;210(1):96–101. doi: 10.1016/j.forsciint.2011.02.005. [DOI] [PubMed] [Google Scholar]
  • 13.Larson SA. Developing a high throughput protocol for using soil molecular biology as trace evidence [thesis] Lincoln, NE: University of Nebraska-Lincoln; 2012. [Google Scholar]
  • 14.Khodakova AS, Smith RJ, Burgoyne L, Abarno D, Linacre A. Random whole metagenomic sequencing for forensic discrimination of soils. PloS One. 2014 Aug;9(8):e104996. doi: 10.1371/journal.pone.0104996. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Young JM, Weyrich LS, Cooper A. Forensic soil DNA analysis using high-throughput sequencing: a comparison of four molecular markers. Forensic Sci Int Genet. 2014 Nov;13:176–84. doi: 10.1016/j.fsigen.2014.07.014. [DOI] [PubMed] [Google Scholar]
  • 16.Young JM, Weyrich LS, Breen J, Macdonald LM, Cooper A. Predicting the origin of soil evidence: high throughput eukaryote sequencing and MIR spectroscopy applied to a crime scene scenario. Forensic Sci Int. 2015 Jun;251:22–31. doi: 10.1016/j.forsciint.2015.03.008. [DOI] [PubMed] [Google Scholar]
  • 17.Jesmok EM, Hopkins JM, Foran DR. Next-generation sequencing of the bacterial 16S rRNA gene for forensic soil comparison: a feasibility study. J Forensic Sci. 2016 May;61(3):607–17. doi: 10.1111/1556-4029.13049. [DOI] [PubMed] [Google Scholar]
  • 18.Demanèche S, Schauser L, Dawson L, Franqueville L, Simonet P. Microbial soil community analyses for forensic science: application to a blind test. Forensic Sci Int. 2017 Jan;270:153–8. doi: 10.1016/j.forsciint.2016.12.004. [DOI] [PubMed] [Google Scholar]
  • 19.Habtom H, Demanèche S, Dawson L, Azulay C, Matan O, Robe P, et al. Soil characterisation by bacterial community analysis for forensic applications: a quantitative comparison of environmental technologies. Forensic Sci Int Genet. 2017 Jan;26:21–9. doi: 10.1016/j.fsigen.2016.10.005. [DOI] [PubMed] [Google Scholar]
  • 20.Maron P, Mougel C, Ranjard L. Soil microbial diversity: methodological strategy, spatial overview and functional interest. Comptes Rendus Biologies. 2011 May;334(5):403–11. doi: 10.1016/j.crvi.2010.12.003. [DOI] [PubMed] [Google Scholar]
  • 21.Martiny JB, Eisen JA, Penn K, Allison SD, Horner-Devine MC. Drivers of bacterial beta-diversity depend on spatial scale. Proc Natl Acad Sci U S A. 2011 May;108(19):7850–4. doi: 10.1073/pnas.1016308108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Pasternak Z, Al-Ashhab A, Gatica J, Gafny R, Avraham S, Minz D, et al. Spatial and temporal biogeography of soil microbial communities in arid and semiarid regions. PLoS One. 2013 Jul;8(7):e69705. doi: 10.1371/journal.pone.0069705. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Moreno LI, Mills D, Fetscher J, John-Williams K, Meadows-Jantz L, McCord B. The application of amplicon length heterogeneity PCR (LH-PCR) for monitoring the dynamics of soil microbial communities associated with cadaver decomposition. J Microbiol Methods. 2011 Mar;84(3):388–93. doi: 10.1016/j.mimet.2010.11.023. [DOI] [PubMed] [Google Scholar]
  • 24.Giampaoli S, Berti A, Di Maggio R, Pilli E, Valentini A, Valeriani F, et al. The environmental biological signature: NGS profiling for forensic comparison of soils. Forensic Sci Int. 2014 Jul;240:41–7. doi: 10.1016/j.forsciint.2014.02.028. [DOI] [PubMed] [Google Scholar]
  • 25.Mills DK, Entry JA, Gillevet PM, Mathee K. Assessing microbial community diversity using amplicon length heterogeneity polymerase chain reaction. Soil Sci Soc Am J. 2007 Apr;71(2):572–8. [Google Scholar]
  • 26.Doud M, Zeng E, Schneper L, Narasimhan G, Mathee K. Approaches to analyse dynamic microbial communities such as those seen in cystic fibrosis lung. Hum Genomics. 2009 Apr;3(3):246–56. doi: 10.1186/1479-7364-3-3-246. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Meyers MS, Foran DR. Spatial and temporal influences on bacterial profiling of forensic soil samples. J Forensic Sci. 2008 May;53(3):652–60. doi: 10.1111/j.1556-4029.2008.00728.x. [DOI] [PubMed] [Google Scholar]
  • 28.Sensabaugh GF. Microbial community profiling for the characterisation of soil evidence: forensic considerations. In: Ritz K, Dawson L, Miller D, editors. Criminal and environmental soil forensics. Berlin, Germany: Springer; 2009. pp. 49–60. [Google Scholar]
  • 29.Foran DR, Jesmok EM, Hopkins JM. Soil bacteria as trace evidence. In: Carter DO, Tomberlin JK, Benbow E, Metcalf JL, editors. Forensic microbiology. Chichester, U.K: John Wiley & Sons; 2017. pp. 339–57. [Google Scholar]
  • 30.Yang C, Mills D, Mathee K, Wang Y, Jayachandran K, Sikaroodi M, et al. An ecoinformatics tool for microbial community studies: supervised classification of amplicon length heterogeneity (ALH) profiles of 16S rRNA. J Microbiol Methods. 2006 Apr;65(1):49–62. doi: 10.1016/j.mimet.2005.06.012. [DOI] [PubMed] [Google Scholar]
  • 31.Tarca AL, Carey VJ, Chen X, Romero R, Drăghici S. Machine learning and its applications to biology. PLoS Comput Biol. 2007 Jun;3(6):e116. doi: 10.1371/journal.pcbi.0030116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Entry JA, Mills D, Mathee K, Jayachandran K, Sojka R, Narasimhan G. Influence of irrigated agriculture on soil microbial diversity. Applied Soil Ecol. 2008 Sep;40(1):146–54. [Google Scholar]
  • 33.Kampichler C, Wieland R, Calmé S, Weissenberger H, Arriaga-Weiss S. Classification in conservation biology: a comparison of five machine-learning methods. Ecol Inform. 2010 Nov;5(6):441–50. [Google Scholar]
  • 34.Young J, Austin J, Weyrich L. Soil DNA metabarcoding and high-throughput sequencing as a forensic tool: Considerations, potential limitations and recommendations. FEMS Microbiol Ecol. 2017 Feb;93(2) doi: 10.1093/femsec/fiw207. [DOI] [PubMed] [Google Scholar]
  • 35.Noble CV, Drew RW, Slabaugh JD. Soil survey of Dade county area, Florida. Gainesville, FL: USDA NRCS; 1996. [Google Scholar]
  • 36.Duever M, Meeder J, Meeder L, McCollom J. The climate of south Florida and its role in shaping the everglades ecosystem. In: Davis SM, Ogden JC, editors. Everglades: the ecosystem and its restoration. Boca Raton, FL: St Lucie Press; 1994. pp. 225–48. [Google Scholar]
  • 37.Suzuki M, Rappe MS, Giovannoni SJ. Kinetic bias in estimates of coastal picoplankton community structure obtained by measurements of small-subunit rRNA gene PCR amplicon length heterogeneity. Appl Environ Microbiol. 1998 Nov;64(11):4522–9. doi: 10.1128/aem.64.11.4522-4529.1998. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Bai Y, Sun Q, Wen D, Tang X. Abundance of ammonia-oxidizing bacteria and archaea in industrial and domestic wastewater treatment systems. FEMS Microbiol Ecol. 2012 May;80(2):323–30. doi: 10.1111/j.1574-6941.2012.01296.x. [DOI] [PubMed] [Google Scholar]
  • 39.White TJ, Bruns T, Lee S, Taylor J. Amplification and direct sequencing of fungal ribosomal RNA genes for phylogenetics. In: Innis MA, Gelfand DH, Sninsky JJ, White TJ, editors. PCR protocols: a guide to methods and applications. San Diego, CA: Academic Press; 1990. pp. 315–22. [Google Scholar]
  • 40.Taberlet P, Gielly L, Pautou G, Bouvet J. Universal primers for amplification of three non-coding regions of chloroplast DNA. Plant Mol Biol. 1991 Nov;17(5):1105–19. doi: 10.1007/BF00037152. [DOI] [PubMed] [Google Scholar]
  • 41.Dray S, Dufour AB. The ade4 package: Implementing the duality diagram for ecologists. J Stat Softw. 2007 Sep;22(4):1–20. [Google Scholar]
  • 42.Ripley BD. Pattern recognition and neural networks. Cambridge, U.K.: Cambridge University Press; 2007. [Google Scholar]
  • 43.Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and regression trees. Boca Raton, FL: CRC Press; 1984. [Google Scholar]
  • 44.Liaw A, Wiener M. Classification and regression by randomForest. R News. 2002 Dec;2(3):18–22. [Google Scholar]
  • 45.Günther F, Fritsch S. Neuralnet: Training of neural networks. R Journal. 2010 Jun;2(1):30–8. [Google Scholar]
  • 46.Dimitriadou E, Hornik K, Leisch F, Meyer D, Weingessel A. Misc functions of the department of statistics (e1071), TU Wien. 2008;1:5–24. R Package Version. [Google Scholar]
  • 47.Hand DJ, Till RJ. A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach Learning. 2001 Nov;45(2):171–86. [Google Scholar]
  • 48.Mori AS, Ota AT, Fujii S, Seino T, Kabeya D, Okamoto T, et al. Biotic homogenization and differentiation of soil faunal communities in the production forest landscape: taxonomic and functional perspectives. Oecologia. 2015 Feb;177(2):533–44. doi: 10.1007/s00442-014-3111-7. [DOI] [PubMed] [Google Scholar]
  • 49.Ribeiro-Neto JD, Arnan X, Tabarelli M, Leal IR. Chronic anthropogenic disturbance causes homogenization of plant and ant communities in the brazilian caatinga. Biodivers Conserv. 2016 May;25(5):943–56. [Google Scholar]
  • 50.Rito KF, Tabarelli M, Leal IR. Euphorbiaceae responses to chronic anthropogenic disturbances in caatinga vegetation: from species proliferation to biotic homogenization. Plant Ecol. 2017 Jun;218(6):749–59. [Google Scholar]
  • 51.Tan AC, Gilbert D. Proceedings of the First Asia-Pacific Bioinformatics Conference on Bioinformatics. Vol. 19. Adelaide, Australia Darlinghurst, Australia: Australian Computer Society; 2003. Feb 4–7, An empirical comparison of supervised machine learning techniques in bioinformatics; pp. 219–22. 2003. [Google Scholar]
  • 52.Caruana R, Niculescu-Mizil A. Proceedings of the 23rd International Conference on Machine Learning (ICML 2006) Pittsburgh, PA. New York, NY: ACM; 2006. Jun 25–29, An empirical comparison of supervised learning algorithms; pp. 161–8. 2006. [Google Scholar]
  • 53.Amancio DR, Comin CH, Casanova D, Travieso G, Bruno OM, Rodrigues FA, et al. A systematic comparison of supervised classifiers. PloS One. 2014 Apr;9(4):e94137. doi: 10.1371/journal.pone.0094137. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Cracknell MJ, Reading AM. Geological mapping using remote sensing data: a comparison of five machine learning algorithms, their response to variations in the spatial distribution of training data and the use of explicit spatial information. Comput Geosci. 2014 Feb;63:22–33. [Google Scholar]
  • 55.Beck D, Foster JA. Machine learning techniques accurately classify microbial communities by bacterial vaginosis characteristics. PLoS One. 2014 Feb 3;9(2):e87830. doi: 10.1371/journal.pone.0087830. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Chandrashekar G, Sahin F. A survey on feature selection methods. Comput Electr Eng. 2014 Dec;40(1):16–28. [Google Scholar]
  • 57.Mummey DL, Stahl PD. Spatial and temporal variability of bacterial 16S rDNA-based T-RFLP patterns derived from soil of two Wyoming grassland ecosystems. FEMS Microbiol Ecol. 2003 Aug;46(1):113–20. doi: 10.1016/S0168-6496(03)00208-3. [DOI] [PubMed] [Google Scholar]
  • 58.Franklin RB, Mills AL. Multi-scale variation in spatial heterogeneity for microbial community structure in an eastern Virginia agricultural field. FEMS Microbiol Ecol. 2003 Jun 1;44(3):335–46. doi: 10.1016/S0168-6496(03)00074-6. [DOI] [PubMed] [Google Scholar]
  • 59.Lenz EJ, Foran DR. Bacterial profiling of soil using genus-specific markers and multidimensional scaling. J Forensic Sci. 2010 Nov;55(6):1437–42. doi: 10.1111/j.1556-4029.2010.01464.x. [DOI] [PubMed] [Google Scholar]
  • 60.Shade A, Caporaso JG, Handelsman J, Knight R, Fierer N. A meta-analysis of changes in bacterial and archaeal communities with time. ISME J. 2013 Aug;7(8):1493–506. doi: 10.1038/ismej.2013.54. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp TableS1
Supp TableS2
Supp info1
Supp info2

RESOURCES