Skip to main content
. 2020 Apr 3;11:393. doi: 10.3389/fmicb.2020.00393

TABLE 3.

Key challenges that arise in microbiome data analysis with examples and suggested solutions.

Challenges in microbiome data analysis Examples and solutions
(1) Compositional quantities: Metagenomic data processing provides read counts for discovered entities such as genes, species, and OTUs from a given sample. These read counts are only meaningful within a sample. Example: Metagenomic analysis of feces samples tells us that Person A has 5 reads mapped to bacterium Escherichia coli, while person B has 10. Can we conclude that this bacterium is more populated in the gut of person B compared to person A? Answer: No, read counts cannot be compared across samples.
Solutions: (I) Convert read counts to relative abundances before comparison. (II) If an optimization problem is defined using read counts, add constraint for total counts per sample.
(2) High dimensionality: Metagenomic data processing results in many entities such as genes and species discovered for each sample, which may not be shared among multiple samples. During data aggregation, one dimension is associated to each entity resulting in a high number of dimensions compared to the number of samples. Example: Metagenomic data processing of feces samples from 20 individuals results in relative abundances for 10 microbial families per sample. Can we use classical linear regression to predict an individual’s age using relative abundances from aggregated data? Answer: No, aggregating 20 samples results in more than 20 microbial families.
Solutions: (I) Use dimensionality reduction such as PCA prior to regression. (II) Use regularized linear regression such as Lasso. (III) Use microbial abundances of higher-order taxonomic ranks such as phylum instead of family.
(3) Multiple hypotheses: The high-dimensional nature of metagenomic data allows the researcher to generate a large number of hypotheses, which leads to seeing patterns that simply occur due to random chance. This is sometimes called “the high probability of low probability events.” Example: Metagenomic data processing provides relative microbial abundances at species level using feces samples of 200 individuals, half of which are diagnosed with Crohn’s disease and the rest are healthy. Performing a t-test identifies that the relative abundance of 40 species (amongst 1,000) are significantly different between microbiota of sick and healthy individuals (p-value < 0.05). Is this result correct? Answer: No, the standard threshold of 0.05 for p-value is only acceptable when a single hypothesis is involved while the t-test is performed 1,000 times leading to many false discoveries.
Solution: Calculate FDR adjusted p-value (i.e., q-value) of 0.05 to control the false discovery rate.
(4) Hierarchical relationships: Assumptions of independence do not hold in microbiome data since taxonomic variables (e.g., species and OTUs) have known hierarchical relationships due to genetic and phenotypic similarities. Therefore, common statistical techniques that assume independence between variables are problematic. Example: Beta-diversity can be used to calculate the similarity between groups of microbiome samples. Can we simply calculate the Beta-diversity using standard Euclidean distance between relative abundances at a given taxonomic order? Answer: No, Euclidean distance doesn’t take into account the similarity between species.
Solution: Use phylogeny-aware metrics such as UniFrac distance instead, which takes into account the phylogenetic tree when calculating distances.
(5) Missing quantities: Metagenomic data often lacks information about the functions of the microbial communities which can only be estimated using meta-transcriptomics or meta-proteomics. However, deciphering microbiota’s function is a major goal in microbiome studies. Example: In one case, metagenomic data processing from marker-gene data has provided us with relative abundances at the genus level, but we do not know the possible functions of the microbiota in terms of proteins that it can produce. Should we abandon further analysis? Answer: No, although we don’t have direct information about proteins, we can infer.
Solution: Databases such as Greengenes contain the whole-genome sequence of identified species at various taxonomic orders which can be used for gene and protein inference.