Skip to main content
Nature Portfolio logoLink to Nature Portfolio
. 2021 Oct 14;7(10):1330–1334. doi: 10.1038/s41477-021-01003-y

Cannabis labelling is associated with genetic variation in terpene synthase genes

Sophie Watts 1, Michel McElroy 1, Zoë Migicovsky 1, Hugo Maassen 2, Robin van Velzen 2,3, Sean Myles 1,
PMCID: PMC8516649  PMID: 34650264

Abstract

Analysis of over 100 Cannabis samples quantified for terpene and cannabinoid content and genotyped for over 100,000 single nucleotide polymorphisms indicated that Sativa- and Indica-labelled samples were genetically indistinct on a genome-wide scale. Instead, we found that Cannabis labelling was associated with variation in a small number of terpenes whose concentrations are controlled by genetic variation at tandem arrays of terpene synthase genes.

Subject terms: Genomics, Natural variation in plants


By quantifying over 100 Cannabis samples for terpene and cannabinoid content and genotyping them for over 100,000 single nucleotide polymorphisms, this study finds that Cannabis labelling is associated with genetic variants in terpene synthase genes.

Main

Cannabis has been consumed for its psychoactive properties for over 2,500 years, and its estimated global market value is US$340 billion13. Because it is a widely used drug that is increasingly being legalized for medicinal and recreational use, it is critical that Cannabis’s genetic and chemical variation be accurately quantified and communicated. The vernacular labels Sativa and Indica (not to be confused with the taxonomic names C. sativa sativa L. and C. sativa indica Lam.) are routinely assigned to Cannabis cultivars by breeders, retailers and users to describe a cultivar’s morphology, aromas and/or psychoactive effects4. However, it is unclear whether these labels capture meaningful information about Cannabis genetic and chemical variation.

Cannabis genomics research has thus far largely focused on the characterization of genes underlying the production of the cannabinoids cannabidiol (CBD) and tetrahydrocannabinol (THC)58. However, Cannabis produces hundreds of aromatic terpenes that drive consumer preference and are frequently associated with Sativa and Indica labels4,9. In addition, there is evidence to suggest that a cultivar’s terpene profile affects its psychoactive properties10,11. To date, various terpene synthase genes have been identified in Cannabis; however, the genetic control of terpene variation across Cannabis cultivars remains largely unexplored1215.

Here we re-analysed 297 samples of drug-type Cannabis that were previously quantified for 40 terpenes and cannabinoids using gas chromatography–mass spectrometry (GC–MS)16 (Supplementary Table 1 and Extended Data Fig. 1), and we paired these data with 116,296 newly generated single nucleotide polymorphisms (SNPs) from 137 of these samples from which sufficient high-quality DNA could be extracted. We determined the degree to which the genomic and GC–MS data corresponded to a five-point labelling scale ranging from 1 (100% Sativa) to 5 (100% Indica) as reported by sample sources.

Extended Data Fig. 1. CBD vs THC content.

Extended Data Fig. 1

Plot of percent CBD versus percent THC content.

Principal component analysis (PCA) of the genomic data showed no clear clustering according to sample labels (Fig. 1a). Even though PC1 and PC2 were significantly correlated with the Sativa–Indica scale, the variance explained by the primary PCs was low (PC1: R2 = 0.12, P = 2.1 × 10−5; PC2: R2 = 0.12, P = 1.8 × 10−5). Furthermore, the overall genetic structure (captured by including the first ten PCs of the genomic data in a linear model) explained only 37% of the variance in labelling (Fig. 1c). Sativa–Indica labels thus do not accurately reflect genetic relatedness, which is consistent with previous work17,18. In addition, we determined that pairs of samples with identical cultivar names (for example, OG Kush) were often as genetically and chemically distant from each other as pairs of samples with different names (Extended Data Fig. 2). This is consistent with previous studies indicating that cultivar names were not reliable indicators of a sample’s genetic or chemical identity17,1921.

Fig. 1. PCA.

Fig. 1

a, Genome-wide SNP data. b, Terpenes and cannabinoids. Each dot represents a Cannabis sample and is coloured by the labelling scale ranging from 100% Sativa to 100% Indica. c, The percent variance explained by PCs from the genome-wide SNP data (blue), from the terpene and cannabinoid data (green) and from both the genetic and chemical data (purple). The y axis shows the percent variance explained as PCs are added to linear models where the Sativa–Indica labelling scale is the dependent variable.

Extended Data Fig. 2. Pairwise chemical and genetic distances.

Extended Data Fig. 2

Histograms of a) pairwise chemical distances and b) pairwise genetic distances among all pairs of samples. Vertical lines indicate the median distance between pairs of samples with the same name.

Similar to the PCA of the genome-wide SNP data, the PCA of the terpene and cannabinoid profiles provided poor separation of samples according to their Sativa–Indica labels (Fig. 1b). Nevertheless, we observed significant correlations between the first two PCs and the Sativa–Indica scale (PC1: R2 = 0.049, P = 7.5 × 10−5; PC2: R2 = 0.24, P = 3.7 × 10−19). Including the first ten PCs from the terpene and cannabinoid profiles in a linear model accounted for only 41% of the variance in labelling (Fig. 1c). The pairwise genetic and chemical relatedness matrices were correlated (Mantel r = 0.21, P = 1 × 10−3, Extended Data Fig. 3), and a linear model including the first ten PCs from both the genomic and chemical profiles captured only 41% (Fig. 1c; P = 3.1 × 10−10) of the variance in labelling. Since the overall patterns of genetic and chemical relatedness could not fully account for the labels applied to Cannabis samples, we aimed to determine which individual chemicals were the strongest predictors of Sativa–Indica labelling.

Extended Data Fig. 3. Correlation of chemical and genetic pairwise distances.

Extended Data Fig. 3

Plot of genetic distance versus chemical distance between pairs of samples. The Mantel r statistic and p-value are reported.

Of the 40 measured terpenes and cannabinoids, 12 (30%) were correlated with the Sativa–Indica scale at P < 0.01 (Fig. 2a and Supplementary Fig. 1). Sativa content was positively correlated with the concentrations of bergamotene (R2 = 0.12, P = 9.26 × 10−8) and farnesene (R2 = 0.11, P = 1.09 × 10−7), which impart tea-like and fruity aromas, respectively22,23. This is consistent with descriptions of Sativa cultivars as having a ‘sweet’ or ‘herbal’ aroma4,9. The strongest correlation was between Indica content and myrcene, whose concentration explained 21.2% of the variation in labelling (P = 2.29 × 10−15; Fig. 2a). The sedative effect and earthy aroma attributed to high myrcene content are often reported by recreational users to be characteristic of Indica cultivars10,2426. We also observed significant positive correlations between Indica labelling and three sesquiterpenes: guaiol (R2 = 0.18, P = 7.7 × 10−13), γ-eudesmol (R2 = 0.11, P = 3.8 × 10−7) and β-eudesmol (R2 = 0.21, P = 8.2 × 10−15). Hillig27 found that these three sesquiterpenes were associated with plants from Afghanistan, which is considered the region of origin for Indica cultivars.

Fig. 2. The genetic control of terpenes underlying Cannabis labelling.

Fig. 2

a, The percent variance of the five-point Sativa–Indica labelling scale that is explained by terpene and cannabinoid concentrations from Pearson correlations. The P values were Bonferroni-adjusted for multiple comparisons. The asterisks denote chemicals with tentative identifications. GWAS results are shown for chemicals highlighted in grey. b,c, Manhattan plots of mixed linear model (MLM) GWAS for myrcene on chromosome 5 (b) and for guaiol, γ-eudesmol and β-eudesmol on chromosome 6 (c). The significance thresholds from the MLM are shown as horizontal dashed lines. Significant SNPs from the MLMM GWAS are red. Terpene synthase gene clusters are green. Below the Manhattan plots are heat maps of the pairwise LD (R2) between pairs of SNPs that appear in the Manhattan plots.

Previous chemical analyses of Cannabis have suggested that the distinction between Sativa and Indica is best explained by differences in the concentrations of specific monoterpenes and sesquiterpenes19,2830. In addition, the contrasting aromas that have been associated with Sativa (that is, sweet) and Indica (that is, earthy) were key discriminators in a sensory evaluation of Cannabis cultivars and mediated customers’ perceptions of potency and quality9. As a previous study suggested31, we hypothesize that Cannabis growers and breeders have been assigning labels to cultivars primarily on the basis of aroma profiles and purported effects, rather than genetic ancestry or overall chemical similarity. The primary differences between cultivars labelled as Sativa and Indica may thus be driven by a small set of genomic regions controlling the concentrations of a small number of contrasting aromas. To examine this, we conducted a genome-wide association study (GWAS) of the 40 chemicals examined here (Supplementary Fig. 2 and Supplementary Table 2).

We identified three regions of the Cannabis genome associated with the four terpenes most strongly associated with Sativa–Indica labelling (Fig. 2). The optimal model from the multilocus mixed linear model (MLMM) GWAS for myrcene identified two significantly associated SNPs 1.2 megabases apart that tag independent blocks of linkage disequilibrium (LD) on the proximal end of chromosome 5 (Fig. 2b). The first SNP (chr5:1348048) is located 6.4 kilobases (kb) from a block of terpene synthase genes composed of four copies of TPS30, which is known to encode myrcene synthase12 (Supplementary Table 3). The second SNP (chr5:2576403) is 46.7 kb from another tandem array of terpene synthase genes spanning ~200 kb (Supplementary Table 3). Within this gene cluster are two sequences highly similar to the myrcene synthase gene, TPS3 (refs. 12,13). These observations suggest that myrcene synthesis is mediated by genetic variants at two independent terpene synthase gene clusters on chromosome 5. The other three sesquiterpenes (guaiol, β-eudesmol and γ-eudesmol) strongly associated with Sativa–Indica labelling are correlated with each other (Extended Data Fig. 4) and share a common GWAS hit on chromosome 6: the single SNP identified from the MLMM (chr6:76790611) is 51.9 kb from a gene cluster comprising sesquiterpene synthase genes related to TPS7FN (δ-selinene synthase), TPS8FN (γ-eudesmol/valencene synthase)12 and TPS20CT13 (hedycaryol synthase) (Fig. 2c and Supplementary Table 3).

Extended Data Fig. 4. Chemical correlation heatmap.

Extended Data Fig. 4

Heatmap displaying the Pearson correlation between the concentrations of the 40 terpenes and cannabinoids.

Our results demonstrate that the Sativa–Indica scale currently used to label Cannabis poorly captures overall genomic and metabolomic variation. Cannabis labelling is instead probably driven primarily by a small number of key terpenes whose concentrations contribute to the characteristic aromas commonly associated with Sativa and Indica and whose variation we genetically mapped to tandem arrays of terpene synthase genes on chromosomes 5 and 6. While the vernacular labels ‘Sativa’ and ‘Indica’ are derived from taxonomic names that were originally used to categorize plants according to ancestry4, these terms have been co-opted by contemporary Cannabis culture and now probably reflect locus-specific genetic variation affecting terpene synthesis. Our results suggest that a practical and reliable classification system for Cannabis that is consistent with contemporary understanding of the terms ‘Sativa’ and ‘Indica’ may be achievable by quantifying a small number of terpenes and/or genotyping genetic markers associated with key Cannabis aromas.

Methods

Samples

The samples come from a previous study of 460 Cannabis chemotypes16. The samples were collected from Bedrocan International BV (n = 37), HempFlax (n = 205) and Dutch ‘coffee shops’ either directly or indirectly through the TRIMBOS Institute (n = 55). Samples labelled as ‘Hemp’ were excluded from the analysis. We retained and analysed 297 samples that were classified along a five-point scale according to ancestries reported by the sources: ‘Sativa’ (100% Sativa), ‘Hybrid-Sativa’ (75% Sativa, 25% Indica), ‘Hybrid’ (50% Sativa, 50% Indica), ‘Hybrid-Indica’ (25% Sativa, 75% Indica) and ‘Indica’ (100% Indica). These five groups were encoded as 1 (100% Sativa) to 5 (100% Indica) for the statistical analyses described below.

Gas chromatography

A total of 297 samples were previously quantified for terpene and cannabinoid content, and we conduct a re-analysis of these data here. The chemical analyses of the samples are described in detail in ref. 16. Briefly, for each sample, 500 mg of ground homogenized dried flower material was mixed with 40 ml of ethanol, agitated for 10 minutes and centrifuged. The supernatant was collected, and the process was repeated twice more on the pellet. An internal standard consisting of 200 μl of 1% solution of 1-octanol was added to the combined supernatant, the volume was adjusted to 100 ml with ethanol and the combined sample was centrifuged again. The combined sample was analysed using an Agilent GC 6890 series (Agilent Technologies) equipped with a 7683 autosampler and a flame ionizing detector. The instrument was equipped with a DB-5 column (length, 30 m; internal diameter, 0.25 mm; film thickness, 0.25 μm; J&W Scientific). Peaks from the sample chromatograms were manually integrated, and the peak area was recorded with correction for the internal standard peak area. Peak identification was conducted by analysing selected samples using GC–MS and then comparing compounds’ mass spectra and retention times with authentic standards and literature reports as described in ref. 16. Compounds without authentic standards are marked with an asterisk in the figures to indicate that they were tentative identifications. Peak areas of monoterpenes, sesquiterpenes and cannabinoids were quantified (in mg per g of plant material) using calibrated standards of β-pinene, α-humulene and CBD, respectively. We re-assessed the compound identifications in Hazekamp et al.16, and in certain cases we renamed compounds on the basis of the inability to distinguish stereoisomers using a DB-5 column. For example, in the case of the compound listed by Hazekamp et al.16 as ‘(−)-linalool’, we renamed this to ‘linalool’. There are also two compounds that could not be reliably identified; they are listed as ‘unidentified compounds’ (Supplementary Table 3). THC, δ-8-THC and CBN were combined into a single value, ‘Total THC’, because δ-8-THC and CBN are degradation products of THC. Peaks of R-limonene and β-phellandrene were indistinguishable and were therefore combined into a single value and reported as ‘limonene’. Thymoquinone, geraniol, thymol and carvacrol were removed because they were not present in any samples, and cineol was removed because it was present in only one sample. Pearson correlations were calculated between each pair of chemicals using the cor.test function in R v.3.5.132. According to previous work33, the samples analysed here were nearly all drug-type Cannabis (that is, type I) (Extended Data Fig. 1), except nine samples with THC > 0.3% and CBD > 0.5% (that is, type II).

Genomic analysis

Whole-genome DNA was extracted using a NucleoSpin 96 Plant II kit (Machery-Nagel) and quantified using the QuantiFluor dsDNA System and the GloMax-Multi + Microplate Multimode Reader with Instinct (Promega). Genotyping-by-sequencing libraries were prepared using the restriction enzyme ApeKI34, and the libraries were sequenced on two lanes of an Illumina Hi-Seq 4000 (Illumina). The DNA sequence data are available as NCBI BioProject PRJNA713792. Calling of SNPs was performed in TASSEL (v.5.0)35 by aligning to the CBDRx reference genome8. SNP calling was performed before the implementation of the new chromosome numbering of the CBDRx genome in April 2020. Chromosomes were recoded for analyses to reflect the new chromosome numbering system. We used VCFtools (v.0.1.15)36 to retain only bi-allelic SNPs and samples with <70% missing data, which resulted in 155 remaining samples and 284,988 SNPs. Genotype imputation was performed using LinkImputeR37 with a minor allele frequency threshold of 0.01, a minimum read depth for masking of 20 and the number of masked genotypes set to 5,000. We chose to impute with a minimum read count of 2 and a maximum missingness threshold of 70%, which resulted in an imputation accuracy of 92.88%. After imputation, 149 samples remained. An additional 12 samples were removed because they had no phenotype data. This resulted in a final set of 137 samples with both genetic and chemical data. The SNP data were filtered using PLINK (v.1.90)38 to exclude SNPs with a minor allele frequency less than 0.05 and SNPs with excess heterozygosity resulting in Hardy–Weinberg P values less than 1 × 10−5. The final SNP dataset used for GWAS consisted of 116,296 SNPs from 137 samples. For PCA, 1,257 unanchored SNPs were removed, and the remaining 115,039 SNPs were LD-pruned using PLINK (command: –indep-pairwise 10 3 0.5), resulting in 80,939 SNPs.

Genetic and chemical analysis

The chemical distance between cultivars was calculated as the Euclidean distance using the ‘dist’ function in R from the matrix of metabolomic data—that is, 40 terpenes and cannabinoids quantified across 297 samples. The genetic similarity between samples was calculated as an inverse identity-by-state matrix generated in PLINK. The correlations between the matrices were computed using a Mantel test in R32 by first reducing the chemical matrix to the 137 samples with both chemical and genetic datasets. PCA was performed on the scaled genetic and chemical data using the prcomp function in R. To calculate the variance in labelling explained by the chemical and genetic data, linear models including the top ten PCs from the genetic data, the chemical data and both the chemical and genetic datasets together were performed. Pearson correlations between chemical concentration and the 1-to-5 Sativa–Indica scale were performed with the cor.test function in R. A Bonferroni correction was applied to the P values from the correlation test between chemical concentration and the Sativa–Indica scale.

Genome-wide association

We performed GWAS for 40 terpene and cannabinoid phenotypes, using both normalized and non-normalized data. Normalizing was conducted to generate values for a chemical concentration in a sample relative to the total abundance of its chemical class (that is, monoterpene, sesquiterpene or cannabinoid) in that sample. Thus, a sample’s myrcene content was divided by the total concentration of all monoterpenes in that sample to generate a normalized value for myrcene. GWAS was performed using an MMLM39 accounting for relatedness using a kinship matrix created in TASSEL (v.5.0)35. The MLMM incorporates significant SNPs as cofactors using stepwise regression (maxsteps = 10), and the optimal model was chosen on the basis of the extended Bayesian information criterion. We also present the first step of the MLMM, which is equivalent to an MLM where relatedness is accounted for but no SNPs are included as cofactors. Using the simpleM40 package in R, the effective number of independent tests (Meff) was generated, and the threshold for significance was then calculated using −log10(α/Meff), where α = 0.05. Quantile–quantile and Manhattan plots were created using the qq function in R. Genomic regions with significant GWAS hits were explored, and the physical locations of genes within these regions were retrieved using annotations from the CBDRx reference genome8 in Geneious Prime (v.2020.1.2). The GWAS results and LD regions of interest were visualized using code adapted from ref. 41.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Supplementary information

Supplementary Information (32.4MB, pdf)

Supplementary Figs. 1 and 2.

Reporting Summary (67.6KB, pdf)
Supplementary Tables (236.6KB, xlsx)

Supplementary Table 1: Chemical concentrations and labels across 297 Cannabis samples. The asterisks denote chemicals with tentative identifications. Supplementary Table 2: Significant SNPs from the MLMM GWAS for myrcene and three sesquiterpenes. Only SNPs identified as significantly (P < 6.69 × 10−7) associated with a trait according to the MLMM GWAS are shown. The genomic coordinates and annotations, P value, R2 value and nearby candidate genes are shown. Supplementary Table 3: A list of compound names identified by Hazekamp et al.16, a list of authentic standards used for compound identification and a list of compound names based on re-analysis of the methods used in Hazekamp et al.16.

Acknowledgements

We thank A. Hazekamp, M. E. Schranz and F. F. M. Becker for their contributions to this work. We thank C. F. Forney and T. Soomro for their assistance. This work was funded primarily by Bedrocan but was also supported by the National Science Foundation Plant Genome Research Programme grant no. 154686 to Z.M. and a Vanier Scholarship from the National Sciences and Engineering Research Council of Canada to S.W.

Extended data

Author contributions

S.M., R.v.V., H.M. and M.M. conceived and designed the study. S.W., R.v.V., M.M. and Z.M. performed the analyses. S.W., R.v.V. and S.M. wrote the manuscript.

Data availability

The authors declare that the data supporting the findings are available within the paper. The sequence data are available in the NCBI Short Read Archive under BioProject No. PRJNA713792. The genotype files are available at 10.5061/dryad.gqnk98smm.

Code availability

All code used for the analyses is available through GitHub at https://github.com/MylesLab/cannabis-labelling.

Competing interests

R.v.V. and H.M. are employed by Bedrocan. Bedrocan funded this work, and R.v.V. played a role in the conceptualization, design, data collection, analysis, decision to publish and preparation of the manuscript. The remaining authors declare no competing interests.

Footnotes

Peer review information Nature Plants thanks Mahmoud A ElSohly, Andrea Mastinu and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

is available for this paper at 10.1038/s41477-021-01003-y.

Supplementary information

The online version contains supplementary material available at 10.1038/s41477-021-01003-y.

References

  • 1.Lawler A. Mountain high: oldest clear signs of pot use. Science. 2019;364:1018. doi: 10.1126/science.364.6445.1018. [DOI] [PubMed] [Google Scholar]
  • 2.Naville, S. $340 billion: the global cannabis market. Geneva Business Newshttps://www.gbnews.ch/340-billion-the-global-cannabis-market/ (2019).
  • 3.Bonini SA, et al. Cannabis sativa: a comprehensive ethnopharmacological review of a medicinal plant with a long history. J. Ethnopharmacol. 2018;227:300–315. doi: 10.1016/j.jep.2018.09.004. [DOI] [PubMed] [Google Scholar]
  • 4.Guy, G. W. & McPartland, J. M. Models of Cannabis taxonomy, cultural bias, and conflicts between scientific and vernacular names. Bot. Rev.10.1007/s12229-017-9187-0 (2017).
  • 5.Laverty KU, et al. A physical and genetic map of Cannabis sativa identifies extensive rearrangements at the THC/CBD acid synthase loci. Genome Res. 2019;29:146–156. doi: 10.1101/gr.242594.118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.McKernan, K. J. et al. Sequence and annotation of 42 cannabis genomes reveals extensive copy number variation in cannabinoid synthesis and pathogen resistance genes. Preprint at bioRxiv10.1101/2020.01.03.894428 (2020).
  • 7.Vergara D, et al. Gene copy number is associated with phytochemistry in Cannabis sativa. AoB Plants. 2019;11:plz074. doi: 10.1093/aobpla/plz074. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Grassa CJ, et al. A new Cannabis genome assembly associates elevated cannabidiol (CBD) with hemp introgressed into marijuana. N. Phytol. 2021;230:1665–1679. doi: 10.1111/nph.17243. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Gilbert AN, DiVerdi JA. Consumer perceptions of strain differences in Cannabis aroma. PLoS ONE. 2018;13:e0192247. doi: 10.1371/journal.pone.0192247. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Russo EB. Taming THC: potential cannabis synergy and phytocannabinoid–terpenoid entourage effects. Br. J. Pharmacol. 2011;163:1344–1364. doi: 10.1111/j.1476-5381.2011.01238.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Koltai H, Namdar D. Cannabis phytomolecule ‘entourage’: from domestication to medical use. Trends Plant Sci. 2020;25:976–984. doi: 10.1016/j.tplants.2020.04.007. [DOI] [PubMed] [Google Scholar]
  • 12.Booth JK, Page JE, Bohlmann J. Terpene synthases from Cannabis sativa. PLoS ONE. 2017;12:e0173911. doi: 10.1371/journal.pone.0173911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Zager, J. J., Lange, I., Srividya, N., Smith, A. & Lange, B. M. Gene networks underlying cannabinoid and terpenoid accumulation in Cannabis. Plant Physiol.10.1104/pp.18.01506 (2019). [DOI] [PMC free article] [PubMed]
  • 14.Günnewich, N., Page, J. E., Köllner, T. G., Degenhardt, J. & Kutchan, T. M. Functional expression and characterization of trichome-specific (−)-limonene synthase and (+)-α-pinene synthase from Cannabis sativa. Nat. Prod. Commun.10.1177/1934578X0700200301 (2007).
  • 15.Livingston SJ, et al. Cannabis glandular trichomes alter morphology and metabolite content during flower maturation. Plant J. 2020;101:37–56. doi: 10.1111/tpj.14516. [DOI] [PubMed] [Google Scholar]
  • 16.Hazekamp, A., Tekalova, K. & Papadimitriou, S. Cannabis: from cultivar to chemovar II—a metabolomics approach to cannabis classification. Cannabis Cannabinoid Res.10.1089/can.2016.0017 (2016).
  • 17.Sawler J, et al. The genetic structure of marijuana and hemp. PLoS ONE. 2015;10:e0133292. doi: 10.1371/journal.pone.0133292. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Lynch RC, et al. Genomic and chemical diversity in Cannabis. Crit. Rev. Plant Sci. 2017;35:349–363. doi: 10.1080/07352689.2016.1265363. [DOI] [Google Scholar]
  • 19.Henry P, et al. A single nucleotide polymorphism assay sheds light on the extent and distribution of genetic diversity, population structure and functional basis of key traits in cultivated North American cannabis. J. Cannabis Res. 2020;2:26. doi: 10.1186/s42238-020-00036-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Schwabe AL, McGlaughlin ME. Genetic tools weed out misconceptions of strain reliability in Cannabis sativa: implications for a budding industry. J. Cannabis Res. 2019;1:3. doi: 10.1186/s42238-019-0001-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Smith, C. J., Vergara, D., Keegan, B. & Jikomes, N. The phytochemical diversity of commercial cannabis in the United States. Preprint at bioRxiv10.1101/2021.07.05.451212 (2021). [DOI] [PMC free article] [PubMed]
  • 22.Compound card for α-trans-bergamotene. Cannabis Databasehttps://cannabisdatabase.ca/compounds/CDB000306 (2020).
  • 23.Russo, E. B. & Marcu, J. in Advances in Pharmacology Vol. 80 (eds Kendall, D. & Alexander, S. P. H.) 67–134 (Academic Press, 2017).
  • 24.Pearce DD, Mitsouras K, Irizarry KJ. Discriminating the effects of Cannabis sativa and Cannabis indica: a web survey of medical cannabis users. J. Altern. Complement. Med. 2014;20:787–791. doi: 10.1089/acm.2013.0190. [DOI] [PubMed] [Google Scholar]
  • 25.Temple LM, Leikin JB. Tetrahydrocannabinol—friend or foe? Debate. Clin. Toxicol. 2020;58:75–81. doi: 10.1080/15563650.2019.1610567. [DOI] [PubMed] [Google Scholar]
  • 26.Hartsel, J. A., Eades, J., Hickory, B. & Makriyannis, A. in Nutraceuticals (ed. Gupta, R. C.) 735–754 (Academic Press, 2016); 10.1016/B978-0-12-802147-7.00053-X
  • 27.Hillig KW. A chemotaxonomic analysis of terpenoid variation in Cannabis. Biochem. Syst. Ecol. 2004;32:875–891. doi: 10.1016/j.bse.2004.04.004. [DOI] [Google Scholar]
  • 28.Elzinga S, Fischedick J, Podkolinski R, Raber JC. Cannabinoids and terpenes as chemotaxonomic markers in cannabis. Nat. Prod. Chem. Res. 2015;3:181. doi: 10.4172/2329-6836.1000163. [DOI] [Google Scholar]
  • 29.Casano S, Grassi G, Martini V, Michelozzi M. Variations in terpene profiles of different strains of Cannabis sativa L. Acta Hortic. 2011;925:115–121. doi: 10.17660/ActaHortic.2011.925.15. [DOI] [Google Scholar]
  • 30.Fischedick JT, Hazekamp A, Erkelens T, Choi YH, Verpoorte R. Metabolic fingerprinting of Cannabis sativa L., cannabinoids and terpenoids for chemotaxonomic and drug standardization purposes. Phytochemistry. 2010;71:2058–2073. doi: 10.1016/j.phytochem.2010.10.001. [DOI] [PubMed] [Google Scholar]
  • 31.Mudge EM, Brown PN, Murch SJ. The terroir of cannabis: terpene metabolomics as a tool to understand Cannabis sativa selections. Planta Med. 2019;85:781–796. doi: 10.1055/a-0915-2550. [DOI] [PubMed] [Google Scholar]
  • 32.R Core Team R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2018).
  • 33.Small E, Beckstead HD, Chan A. The evolution of cannabinoid phenotypes in cannabis. Econ. Bot. 1975;29:219–232. doi: 10.1007/BF02873168. [DOI] [Google Scholar]
  • 34.Elshire RJ, et al. A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS ONE. 2011;6:e19379. doi: 10.1371/journal.pone.0019379. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Bradbury PJ, et al. TASSEL: software for association mapping of complex traits in diverse samples. Bioinformatics. 2007;23:2633–2635. doi: 10.1093/bioinformatics/btm308. [DOI] [PubMed] [Google Scholar]
  • 36.Danecek P, et al. The variant call format and VCFtools. Bioinformatics. 2011;27:2156–2158. doi: 10.1093/bioinformatics/btr330. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Money D, Migicovsky Z, Gardner K, Myles S. LinkImputeR: user-guided genotype calling and imputation for non-model organisms. BMC Genomics. 2017;18:523. doi: 10.1186/s12864-017-3873-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Purcell S, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Segura V, et al. An efficient multi-locus mixed-model approach for genome-wide association studies in structured populations. Nat. Genet. 2012;44:825–830. doi: 10.1038/ng.2314. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Gao X, Starmer J, Martin ER. A multiple testing correction method for genetic association studies using correlated single nucleotide polymorphisms. Genet. Epidemiol. 2008;32:361–369. doi: 10.1002/gepi.20310. [DOI] [PubMed] [Google Scholar]
  • 41.Hu Z, Olatoye MO, Marla S, Morris PG. An integrated genotyping-by-sequencing polymorphism map for over 10,000 sorghum genotypes. Plant Genome. 2019;12:180044. doi: 10.3835/plantgenome2018.06.0044. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information (32.4MB, pdf)

Supplementary Figs. 1 and 2.

Reporting Summary (67.6KB, pdf)
Supplementary Tables (236.6KB, xlsx)

Supplementary Table 1: Chemical concentrations and labels across 297 Cannabis samples. The asterisks denote chemicals with tentative identifications. Supplementary Table 2: Significant SNPs from the MLMM GWAS for myrcene and three sesquiterpenes. Only SNPs identified as significantly (P < 6.69 × 10−7) associated with a trait according to the MLMM GWAS are shown. The genomic coordinates and annotations, P value, R2 value and nearby candidate genes are shown. Supplementary Table 3: A list of compound names identified by Hazekamp et al.16, a list of authentic standards used for compound identification and a list of compound names based on re-analysis of the methods used in Hazekamp et al.16.

Data Availability Statement

The authors declare that the data supporting the findings are available within the paper. The sequence data are available in the NCBI Short Read Archive under BioProject No. PRJNA713792. The genotype files are available at 10.5061/dryad.gqnk98smm.

All code used for the analyses is available through GitHub at https://github.com/MylesLab/cannabis-labelling.


Articles from Nature Plants are provided here courtesy of Nature Publishing Group

RESOURCES