Skip to main content
Journal of Insect Science logoLink to Journal of Insect Science
. 2022 Jul 3;22(4):2. doi: 10.1093/jisesa/ieac038

An Outsider’s Perspective on Why We Climb Mountains and Why Projects Like the i5k Matter

David C Molik 1,
Editor: Phyllis Weintraub
PMCID: PMC9250708  PMID: 35780386

Abstract

Initiatives like the i5k are creating evermore genome assemblies. These initiatives are resource heavy, and their justifications and economics deserve attention. Scientifically, these initiatives are important, paving the way for cross-species analysis, requiring the building of new computational analysis and tools, and creating other new resources. However, an open question remains of how we quantitively measure the impact of genomes, and by extension these initiatives. This forum article discusses one such method which is to look at the publications about a species over time, however, this method does not show any signal from a published genome, leaving an open question of how to measure impact.

Keywords: computer science, genetic research/engineering, genetics


The 5,000 Insect Genomes Project (i5k) (i5K Consortium 2013; Sills et al. 2011) is creating a large number of genome assemblies for insects. The i5k can be likened to initiatives like the Genome 10K Project (Koepfli et al. 2015), the Global Invertebrate Genome Alliance (Giribet et al. 2014), and the 10,000 Plant Genomes Project (Twyford 2018), among others. Sequencing efforts like the i5k can be traced back to around 2009 (Levine 2011); large sequencing initiatives were started to investigate the genomics of whole families and orders, the implication being that through multiple genomes the biology, especially adaptation, speciation, and molecular pathways, of these clades would become more apparent (Sills et al. 2011). This was possible due to new sequencing technology that had become available: 2008 saw the 1,000 Genomes project, a project to catalog human genetic variation (Siva 2008), and 2005 saw the release of the GS20, the first ‘next generation sequencer.’ The first Solexa (now Illumina) sequencer came out in 2006. Many genome sequencing initiatives are not centrally funded, they are a loose organization pushing for the sequencing of clades. For instance, in the i5k, a portion of the genomes is funded through federal agencies like the USDA (Childers et al. 2021), but others are funded through university research labs. The creation of genomes has a monetary cost, so it follows that the genomes that are being completed are generally the genomes that constitute organization members need to complete their funded projects, some of these projects have a broad mission whose goal is either purely additional genome assemblies or would otherwise allow genomes that may not have immediate personal-research benefit to the initiative’s members, as an example the i5k had a pilot project to sequence 28 species as a test study for the i5k concept as a whole, these insects were in part selected by the community, so the motivations varied (i5K Consortium 2013).

Genome sequencing initiatives provide genomic resources to the scientific community; it is important to understand the impact that these initiatives have on the communities they serve. There is a lack of methodological resources for measuring the scientific impact of sequencing initiatives; this forum article explores one possible method through the use of before and after genome publication linear models but intends to provide a starting point in a conversation about impact and justification in sequencing initiatives in general. Determining the impact of genomic resources is not without precedent (Wadman 2013); the benefits of understanding the impact of novel genome assemblies would help in the justification of resources for the organizations that create them. In a quantifiable variable affected by genome assemblies, there is some question in what to measure.

Ideally, some primary data product of these sequencing initiatives would be utilized in a method to tell a genome assemblies effect, citations of a genome paper, or citations BioProject ID number from NCBI are examples (Sayers 2021). Unfortunately, while this would give some indication of the use of a genome, it would be unable to account for the total impact of a genome: papers that came out before the genome assembly that referenced the genome would not be accounted for, nor would papers that did not directly cite the genome assembly. To account the total amount of impact that a genome assembly has had utilizing the number of papers that have mentioned a genome assembly, with or without trackable citation, would be necessary. A considerable problem is what effect a genome assembly would leave on its relevant articles, many methods would not be able to differentiate the effect of a genome assembly from a background linear increase or just low number of publications published per-year; these are kinds of the problems that arise when an Ordinary Least Squares (OLS) method is utilized on time-series data. An autoregressive model may be more appropriate for time-series data, like the number of publications per-year, however, in building an understanding of a general trend of the data, and not an understanding of its cause an OLS method might be utilized, although the conclusions that can be drawn will be limited. Finding and utilizing an anomaly in the time series data would be ideal, if genome assemblies acted as a work force multiplier as a scientific tool, then this signal would cause an increase in rate of publications, as the scientific community around an insect with the genome assembly would benefit from the genome assembly.

If the publication of journal articles and other archivable documents to be the measurable output of a scientific field, then the amount of research articles mentioning a species should be tied to the total amount of work being done on that species. While imperfect, data mining research articles are one of the easiest ways to access data on the work output of a scientific community, as defined as a group of researchers working on a common study target, possibly unorganized. Arguably the easiest way to access publication data is through NCBI’s PubMed API (Sayers 2021), by utilizing the PubMed API then it should be possible to get a rough unit of work for each insect in the i5k.

One such way to implement an approach which accounts for the rate of publication is to compare the adjusted trends of publications mentioning a particular genome before and after genome assembly publication; if publications mentioning a genome assembly came out at an increasing rate after the publication of a genome assembly, and this trend was seen over a number of different novel genome assemblies, then there would be a strong indication that the publication of a genome assembly has an effect on the community of researchers studying a particular species or, as with i5k genomes, insect. If the method showed an increasing rate, there would be an understandable and accessible scientific justification for these initiatives, but if this method did not show an increase in rate, then it would raise questions on how to measure the work output of a scientific community, and what other methods could be utilized in justifications. Once this analysis is completed the expected result of an increase of rate after the publication of a genome assembly is not seen, which would seem to indicate that there is no effect of the publication of a genome assembly on the rate of publication of the community working on any insect. Exploring a journal article-based methodology for use in the scientific justification of genome assembly initiatives provides insights for justification methodologies in general, but the lack of an effect of the methodology presents an open question on why no effect is seen.

While the method mentioned in this forum article does not show an increased publication rate, this is one metric of many, and it would be worth exploring other ways in which the impact of a new genome can be measured; while useful, measuring the impact of a given genome in this way is flawed, as this method would fail to account for singularly high impact studies, or novel important discoveries. In framing this result, it is important to consider that every year more publications are published than the last necessitating a method which can account for a constant increase in publications regardless of the publication of the genome assembly itself, utilizing the rate of publication instead of the number of articles mentioning a genome.

To determine the change of rate before and after the publication of each i5k genome assembly each of 476 listed species/genomes formal Latin names on the i5k website were ascertained (For list see: Childers 2017, supplement); each scientific name was submitted on PubMed for number of hits for years 2000–2020, and a time series dataset of each species’ publications each year was built. Scientific (Latin) names were utilized as common names may not be unique or may not exist for some species, some common names are composed of words that may be found in nonsubject area topics, and some species having multiple common names. The scientific name only needs to be mentioned somewhere in the research article for the article to be counted in the search. Genome Assembly publication date was found by utilizing the attached genome assembly ID number and utilizing the NCBI search API for genome assembly date (Winter 2017). Since it is assumed that the number of publications on any species will go up over time, we analyze the rate at which new publications about a species are published before and after a genome is published which differs for each species (Fig. 1) (Fire and Guestrin 2019).

Fig. 1.

Fig. 1.

Example diagram in determination of change in the rate of publishing before and after a genome is published on a single instance of the method. In the method, this would be independently run for all insects. A linear model is fit to the years preceding the genome assembly publishing date, and another linear model is fit to the years proceeding the genome publishing date. The genome assembly publishing year changes from insect to insect. The three years before the publishing year and the three years after the publishing year are not taken into account in the proceeding and preceding analysis, due to possible effects caused by the publishing of the genome assembly itself. Fig. 2 is a hypothetical example where the insect had a genome published in 2010 where the genome assembly was published in 2010. Cutoffs of three years would be 2007 and 2014. The slope from 2000 to 2006 would be compared to the slope from 2014 to 2020. A greater slope would be found in the proceeding model. However, the exact years accounted for would be determined by the publishing year of the insect in question.

Two preliminary analyses were conducted, only the last of which testing for the specific assumption of a genome assembly causing an increase in publishing rate per-species. In the first analysis a general linear model was fit to each species over time without regard to the genome publishing date, R-squared values tended to be high (0.95 for a model of the amount of papers by year and factorized by species), indicating that linear models fit each species well (See Fig. 2C and D, see Supplement for further details). A well-fit linear model was an indication that genome publication is not influencing the number of papers over time, as an increased rate at some point in the publication history would result in a lower R-squared. A second analysis was added to determine if the genome publication was having any effect on the number of publications over time. In an analysis borrowed from the field of economics, an event study was conducted. Event studies are an analysis that determines if a particular event influenced an outcome, generally they are used in securities trading (Autor et al. 2017). The eventStudy R-package was utilized and the mean P-value was greater than 0.27 indicating no effect from the genome publication event (Novgorrodsky and Setzler 2019).

Fig. 2.

Fig. 2.

(A) Trends of articles of species after a genome assembly was published. (B) Accumulation of articles of species with greater than zero articles, a linear model of the log of sorted articles greater than zero is added. The model had a P-value of < 0.00 indicating the number of articles generally followed an exponential increase per-species (i.e., a few species have many articles). (C) Number of articles per species (a line per species) from 2000 to 2020 with genome publication date points for each species as red dots. (D) Number of articles per species (a line per species) from 2000 to 2020, a trend line is added to each species similar to preanalysis one.

In the main analysis, a linear model was fit from each year before the publication of the genome, and another model for each year after the year of publication (Fig. 2A and B). The average R-squared value for the linear model fit before the genome publication across species was 0.23 and the average R-squared for publications after the genome publication was 0.30. Species were sorted into three categories. The first category had an increase in trend after publication: the slope of the first model was less than the slope of the second model. The second category being those species that had no change in trend after genome publication, or the slope of first before publication model was the same as the slope of the after-publication model, generally these are insects with no publications outside of the genome assembly publication year. The third category was those species that had a decrease in trend after publication, or the slope of first model was greater than the slope of the second model (for general method see Fig. 1). This analysis indicated that when a genome is published there is no indication that rate of the number of publications increases. Similarly, when the same analysis is done on some model organisms (i.e., Homo sapiens, Drosophila melanogaster, Zea mays, and Caenorhabditis elegans) from the years 1990 to 2020, no correlation is found between the published and in increased publishing rate (see Supplement for details). When the same analysis was conducted with the scientific name of the insect, but also requiring the word ‘genome’ similar results were also found (see Supplement for details).

From the perspective of a sequencing imitative like the i5k, it is tempting to utilize insects that have a large corpus of work around them for new genome assembly prioritization. However, there are cases where a genome would provide little benefit for the researchers working on an insect, for instance, it could be that an insect is being worked on for nonmolecular ecological reasons. Ideally, initiatives which create genome assemblies prioritize genomes that have communities already working on the insect in question; communities that are working on a particular insect should leave a footprint in published articles. But, if there is no relationship between the rate of publication and the publication of a genome would call into question this argument, indicating that an argument on the ability of genome assembly data to answer questions that only a large number of genomes could answer might be a better argument for genome sequencing initiatives, or a piece-mail argument that a particular genome assembly is needed for questions only a genome assembly could answer.

A possible reason for the lack of an effect in the method is that in some insects there is not a clear signal of publications at all, this would happen if a stochastic variable was affecting the amount of publications each year, a number of such insects where the signal was lost because of a stochastic variable would cause the eventStudy to false negative, however, a linear model fits publications fairly well, but if there were insects where there were only a handful of publications a year then a linear model would still fit fairly well. Such a hypothesis does not easily explain why model organisms seem to suffer from a lack of effect from a genome assembly by these methods; however, the preliminary linear model not considering the genome publication, which fits well, suggests that the genome assemblies’ effect, if any, was sized by the general trend of more publications every year regardless. If the effect size is inexistant or too small, some other metric should be found. Another possible explanation for the no increase in trend in the main analysis is a possible lack of data, although any genome that did not have at least three datapoints before and after publication of the genome would have been included in the analysis (i.e., no genome would have been considered if the publication of the genome was before 2006 or after 2015).

An important point is that these initiatives exist for scientific reasons, which may not include a measurable impact on a community around an insect; examples of scientific justifications include the need for phylogenetic trees in the study of evolution, the understanding of the mechanisms of symbiosis, and understanding how the chromosome evolved, among other possible questions are only possible with the filling out of the tree of life (Blaxter et al. 2022, Richards 2015). Famed mountaineer George Mallory is reputed to have answered the question ‘Why do you climb mountains?’ with ‘Because they are there.’ George Mallory’s poetic answer probably says something more about scientists and explores in general than any sequencing initiative, nevertheless, the exploratory sentiment finds its application: these initiatives are explorations in the mechanisms of the genome and its evolution and analysis of their effect is a worthwhile endeavor.

Supplementary Material

ieac038_suppl_Supplementary_Materials

Acknowledgments

We would like to thank Scott Geib, PhD at the United States Department of Agriculture (USDA) Agricultural Research Service (ARS) Tropical Pest Genetics and Molecular Biology Research Unit for helpful comments and ideas regarding the substance of this forum paper. David Molik is supported by the USDA ARS HQ Research Associate program in Big Data. This work used resources provided by the SCINet project of the USDA ARS, ARS project number 0500-00093-001-00-D. The USDA is an equal opportunity lender, provider, and employer. Mention of trade names or commercial products in this report is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the USDA.

Author Contributions

DCM: Conceptualization; Formal analysis; Investigation; Methodology; Software; Writing—original draft, review, and editing.

References Cited

  1. Autor, D., Kostol A. R., Mogstad M., and Setzler B.. . 2017. Disability benefits, consumption insurance, and household labor supply. Working Paper 23466, National Bureau of Economic Research. doi: 10.1257/aer.20151231 [DOI] [Google Scholar]
  2. Blaxter, M., Archibald J. M., Childers A. K., Coddington J. A., Crandall K. A., Di Palma F., Durbin R., Edwards S. V., Graves J. A. M., Hackett K. J., . et al. 2022. Why sequence all eukaryotes? Proc. Natl. Acad. Sci. U.S.A. 119. doi: 10.1073/pnas.2115636118 [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Childers, A. 2017. Sequenced arthropod genomes - i5K. http://i5k.github.io/arthropod_genomes_at_ncbi. [Website] [Accessed: December 3rd 2022]
  4. Childers, A. K., Geib S. M., Sim S. B., Poelchau M. F., Coates B. S., Simmonds T. J., Scully E. D., Smith T. P. L., Childers C. P., Corpuz R. L., . et al. 2021. The USDA-ARS Ag100Pest initiative: high-quality genome assemblies for agricultural pest arthropod research. Insects. 12. doi: 10.3390/insects12070626 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Fire, M., and Guestrin C.. . 2019. Over-optimization of academic publishing metrics: observing Goodhart’s Law in action. GigaScience. 8: giz053. doi: 10.1093/gigascience/giz053 [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Giribet, G., Bracken-Grissom H., Collins A. G., Collins T., Crandall K., Distel D., Dunn C., Haddock S., Knowlton N., Martindale M., . et al. 2014. The Global Invertebrate Genomics Alliance (GIGA): developing community resources to study diverse invertebrate genomes. J. Hered. 105: 1–18. doi: 10.1093/jhered/est084 [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. i5K Consortium. 2013. The i5K initiative: advancing arthropod genomics for knowledge, human health, agriculture, and the environment. J. Hered. 104: 595–600. doi: 10.1093/jhered/est050 [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Koepfli, K. -P., Paten B., and O’Brien S. J.. . 2015. The genome 10K project: a way forward. Annu. Rev. Anim. Biosci. 3: 57–111. doi: 10.1146/annurev-animal-090414-014900 [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Levine, R. 2011. i5k: the 5,000 insect genome project. Am. Entomol. 57: 110–113. doi: 10.1093/ae/57.2.110 [DOI] [Google Scholar]
  10. Novgorrodsky, D., and Setzler B.. . 2019. eventStudy (0.1.0). https://github.com/setzler/eventStudy/ [Software]. [Accessed: May 10th 2022].
  11. Richards, S. 2015. It’s more than stamp collecting: how genome sequencing can unify biological research. Trends Genet. 31: 411–421. doi: 10.1016/j.tig.2015.04.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Sayers, E. W., Beck J., Bolton E. E., Bourexis D., Brister J. R., Canese K., Comeau D. C., Funk K., Kim S., Klimke W., . et al. 2021. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 49: D10–D17. doi: 10.1093/nar/gkv1290 [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Sills, J., Robinson G. E., Hackett K. J., Purcell-Miramontes M., Brown S. J., Evans J. D., Goldsmith M. R., Lawson D., Okamuro J., Robertson H. M., . et al. 2011. Creating a buzz about insect genomes. Science. 331: 1386–1386. doi: 10.1126/science.331.6023.1386 [DOI] [PubMed] [Google Scholar]
  14. Siva, N. 2008. 1000 genomes project. Nat. Biotechnol. 26: 256. [DOI] [PubMed] [Google Scholar]
  15. Twyford, A. D. 2018. The road to 10,000 plant genomes. Nat. Plants. 4: 312–313. doi: 10.1038/s41477-018-0165-2 [DOI] [PubMed] [Google Scholar]
  16. Wadman, M. 2013. Economic return from Human Genome Project grows. Nature. doi: 10.1038/nature.2013.13187 [DOI] [Google Scholar]
  17. Winter, D. J. 2017. rentrez: An R package for the NCBI eUtils API. R J. 9: 520–526. doi: 10.32614/RJ-2017-058 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ieac038_suppl_Supplementary_Materials

Articles from Journal of Insect Science are provided here courtesy of University of Wisconsin Libraries

RESOURCES