TABLE 6.
Glossary of terms.
Alpha diversity. A measure that quantifies the species diversity in a given sample. It can be calculated by several methods including richness (i.e. the number of unique species) as well as the Shannon index which relies on the relative abundance of unique species. | Marker gene sequencing. Primer-based strategy (such as 16S rRNA) that targets a specific region of a gene of interest to characterize microbial phylogenies of a sample. |
Beta diversity. A measure that quantifies the difference between species abundances across samples. It can be calculated by several methods including the Jaccard index (i.e. the ratio of shared to total unique species in a pair of samples) as well as the weighted Jaccard index which also considers the number of times each specie is observed. | Multiple-hypothesis testing. A problem that arises in tests of statistical significance when applied multiple times using different hypotheses. |
Classification. A type of supervised learning problem where the dependent variables are categorical. | Overfitting. A problem that arises in machine learning where parameter values of a model are too closely fit for training data and therefore not useful in practice. |
Cluster analysis. Unsupervised learning methodology to identify groups of similar datapoints automatically. | Rarefaction. A bias correction technique used to enable comparison of diversity measures between communities with unequal sample sizes. |
Collaborative filtering. Recommendation system methodology which relies on similarities amongst user preferences for new recommendations. | Recommendation system. “Any system that guides a user in a personalized way to interesting or useful objects in a large space of possible options or that produces such objects as output.” (Burke, 2002) Regression. Supervised learning tasks in which the dependent variables are numerical. |
Compositional quantities. Dataset attributes that their absolute quantities are only meaningful relative to each other for each sample, and cannot be compared directly across different samples. | Regularization. Machine learning technique that dampens the variability of model parameters leading to a less complex model. It is usually used to mitigate overfitting. |
Content-based filtering. Recommendation system methodology in which recommendations are made based on the features for both items and users. | Stability metric. A quantitative measure to assess whether properties of a community (e.g., gut microbes) are preserved over time. |
Curse of dimensionality. A set of challenges, such as the need of exponentially more samples to train a model and increased computational complexity, that appear when the dimensionality of the data or model increases. | Supervised learning. Learning tasks that require labeled data. They involve learning a function to predict the correct label for a new sample given input attributes. |
Data imputation. Substitution of missing values in a given dataset. | Unsupervised learning. Learning tasks that do not rely on labeled data. They involve learning hidden structures, features, or patterns within the data. |
Diversity metric. Quantitative measure that represents the number of unique entity types (e.g., species) in a community and evenness in their relative population. | Variation analysis. Statistical methods, such as analysis of variance (ANOVA), used to identify the amount of variance in a dependent variable that can be explained using independent variables. |
Dimensionality. Number of attributes available for each sample in a given dataset. A dataset with relatively few attributes is considered low-dimensional while a dataset with many attributes is referred to as high-dimensional. | |
Labeled/unlabeled samples. Samples that have been tagged using particular labels describing the value of a dependent variable are called labeled. This is in contrast to unlabeled samples for which such labels are unavailable. Note that labels can be categorical or numerical. | Whole metagenomic sequencing. A sequencing strategy that targets the whole genome of all microbial species within a sample. This is also called shotgun metagenomics. |