Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2024 Nov 26;40(12):btae713. doi: 10.1093/bioinformatics/btae713

STRprofiler: efficient comparisons of short tandem repeat profiles for biomedical model authentication

Jared M Andrews 1,2,, Michael W Lloyd 2,2, Steven B Neuhauser 3, Margaret Bundy 4, Emily L Jocoy 5, Susan D Airhart 6, Carol J Bult 7, Yvonne A Evrard 8, Jeffrey H Chuang 9, Suzanne Baker 10
Editor: Jonathan Wren
PMCID: PMC11634538  PMID: 39589865

Abstract

Summary

Short tandem repeat (STR) profiling is commonly performed for authentication of biomedical models of human origin, yet no tools exist to easily compare sets of STR profiles to each other or an existing database in a high-throughput manner. Here, we present STRprofiler, a Python package, command line tool, and Shiny application providing methods for STR profile comparison and cross-contamination detection. STRprofiler can be run with custom databases or used to query against the Cellosaurus cell line database.

Availability and implementation

STRprofiler is freely available as a Python package with a rich CLI from PyPI https://pypi.org/project/strprofiler/ with source code available under the MIT license on GitHub https://github.com/j-andrews7/strprofiler and at https://zenodo.org/records/10989034. A web server hosting an example STRprofiler Shiny application backed by a database with data from the National Cancer Institute-funded PDXNet consortium and The Jackson Laboratory PDX program is available at https://sj-bakerlab.shinyapps.io/strprofiler/. Full documentation is available at https://strprofiler.readthedocs.io/en/latest/.

1 Introduction

Human cell lines and patient-derived xenograft (PDX) models are invaluable and commonly used tools for biological and pharmaceutical research (Rosfjord et al. 2014, Mirabelli et al. 2019, Abdolahi et al. 2022), but they are at risk for cross-contamination and mislabeling. Misidentified cell lines have been a known issue for over 50 years (Gartler 1968), leading to erroneous conclusions and a massive waste of resources. A 2021 study estimating the financial impact of the usage of Intestine 407 and HEp-2 found that over $900 million was spent to publish nearly 10 000 articles in which these two HeLA-contaminated cell lines were used (Korch and Capes-Davis 2021). A 2017 study found 32 755 published articles using known misidentified cell lines, which were cited by roughly 500 000 other papers (Horbach and Halffman 2017). Worryingly, the authors also found that the rate of contaminated cell line usage is not decreasing over time and is geographically widespread.

Numerous studies have identified 15%–45% misidentification across hundreds of cell lines (Ye et al. 2015, Drexler et al. 2017, Huang et al. 2017, Gu et al. 2022). Despite this, many studies are published each year using known contaminated or misidentified cell lines (Buehring et al. 2004, Makowska et al. 2024). Due to these alarming trends, increasing numbers of journals are implementing cell line authentication policies, including Nature, Cell Press, EMBO Press, the International Journal of Cancer, and the American Association for Cancer Research. A 2022 study of the mandatory cell line authentication at the International Journal of Cancer revealed that at least 5% of human cell lines in submitted manuscripts between July 2018 and June 2021 were misidentified (Souren et al. 2022).

As with cell lines, PDX models have been used extensively in pharmacology and preclinical drug testing. Ensuring that patient-derived model samples are not mishandled and avoiding misidentified results is of equal importance for such models; however, misidentification in PDX models across studies has not been as widely documented. Historically, the validation of PDX models has primarily been concerned with reproducing primary disease state (Rosfjord et al. 2014), but determination of the genetic identity of models is strongly recommended (Mattar et al. 2018).

Ensuring the authenticity and purity of cell lines and patient-derived models is critical for the integrity of scientific research and efficient use of resources. Short tandem repeat (STR) profiling is the recommended method for model authentication (Masters et al. 2001, Capes-Davis et al. 2013, El-Hoss et al. 2016, Mattar et al. 2018), which requires the comparison of STR profiles between a test sample and previously generated reference sample. Generation of cell lines and PDX models from primary samples is expensive and time-consuming, so it is best practice for groups that generate and utilize them to serially profile them over time or after transfer between groups to ensure their authenticity. Doing so also provides a way to measure genetic drift over time in culture, which can be significant for certain lines (Parson et al. 2005). Additionally, the NCI Patient-Derived Models Repository (PDMR) requires STR profiling before submission of material for banking and recommends repeated profiling over the course of any PDX-based experimental process. It is also important to note that natural intra-model allelic variation in STR profiles does exist (Zhang et al. 2018). Such variation is most commonly seen as allelic drop-out due to loss of heterozygosity as demonstrated in models with microsatellite instability (Poetsch et al. 2004, Vauhkonen et al. 2004). Documentation of such intra-model variation may be important for model validation if there is a concern of cross-contamination.

Despite the growing necessity of this practice, no software exists to easily and quickly compare arbitrary STR profiles to identify matches or potential contaminants. STRprofiler was developed to fill this critical gap in the efforts to use model systems to guarantee reproducible results.

2 Methods and results

STRprofiler is a Python package and Shiny application for comparing short tandem repeat (STR) profiles. Profiles are frequently generated after model generation and throughout maintenance to authenticate their identity, determine sample mixing, and measure genetic drift during culture and PDX passaging. STRprofiler provides a simple command line interface (CLI) or interactive web application for comparing STR profiles and detecting sample mixing, which can prove a time-consuming and tedious task when performed manually. While there exist comprehensive STR profile databases and tools, such as the Cellosaurus STR Similarity Search Tool (CLASTR; https://www.cellosaurus.org/str-search/) (Robin et al. 2020), that allow comparisons to established and published cell lines in a low-throughput manner, these databases and tools are inadequate for research groups that generate, utilize, and maintain their own models not yet available through public repositories. STRprofiler allows researchers to compare all their STR profiles quickly and easily to each other or to existing databases, running on collections of thousands of STR profiles in seconds on a typical laptop.

STRprofiler was designed to be simple to use and interpret. It utilizes flexible and straightforward STR profile formats as input, supporting popular STR platforms such as PowerPlex. For all input STR profiles, each profile will be compared to every other profile or to each sample in the database provided and scored for similarity using the Tanabe (Tanabe et al. 1999), Masters (versus query), and Masters (versus reference) algorithms (Masters et al. 2001):

  • Tanabe, also known as the Sørenson–Dice coefficient:
    Score = 2 × no. shared allelesno. query alleles + no. reference alleles 
  • Masters (versus query):
    Score = no. shared allelesno. query alleles
  • Masters (versus reference):
    Score = no. shared allelesno. reference alleles

The Masters and Tanabe formulae used with an 80% match threshold (the default in STRprofiler) have been demonstrated to accurately identify matching profiles in 98%–99% of cases, with most failures arising in models with known variation due to microsatellite instability (Capes-Davis et al. 2013). In such models, use of additional loci or validation with alternative approaches (e.g. SNP analysis) may be necessary. To identify potential sample contamination, STRprofiler flags profiles with three or more alleles detected for three or more loci for further investigation. The Masters algorithms are particularly useful for determining the potential contaminating samples when unintentional mixing occurs.

STRprofiler returns two types of output files. The first is a summary file containing a record for each STR profile, its top matches with the Tanabe algorithm, all matches for each algorithm that meet the scoring threshold, and a flag for potential sample mixing based on the number of markers with three or more alleles detected. The second is a profile-specific file with that profile queried against all others. This file allows for closer interrogation of samples with potential mixing or genetic drift.

STRprofiler also contains a command to compare profiles to the Cellosaurus database using the CLASTR API, enabling convenient, high-throughput access to the most expansive STR profile database available (Robin et al. 2020). This functionality is available from both the command line and through the Shiny application, returning rich results in Excel format with links out to the relevant cell line pages on Cellosaurus.

In conclusion, STRprofiler fills a critical gap in ensuring the integrity and authenticity of models broadly used in biomedical research. It aids in the prevention of erroneous data generation and interpretation, which is a major concern in this field.

Acknowledgements

The authors thank Lawryn Kasper for her helpful feedback and testing of the software and the PDX Development and Trial Centers Research Network (PDXNet) for supporting this work.

Contributor Information

Jared M Andrews, Department of Developmental Neurobiology, St. Jude Children's Research Hospital, Memphis, TN 38105, United States.

Michael W Lloyd, The Jackson Laboratory, Bar Harbor, ME 04609, United States.

Steven B Neuhauser, The Jackson Laboratory, Bar Harbor, ME 04609, United States.

Margaret Bundy, The Jackson Laboratory, Sacramento, CA 95838, United States.

Emily L Jocoy, The Jackson Laboratory, Sacramento, CA 95838, United States.

Susan D Airhart, The Jackson Laboratory, Bar Harbor, ME 04609, United States.

Carol J Bult, The Jackson Laboratory, Bar Harbor, ME 04609, United States.

Yvonne A Evrard, Leidos Biomedical Research, Inc, Frederick National Laboratory for Cancer Research, Frederick, MD 21701, United States.

Jeffrey H Chuang, The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, United States.

Suzanne Baker, Department of Developmental Neurobiology, St. Jude Children's Research Hospital, Memphis, TN 38105, United States.

Conflict of interest

None declared.

Funding

This work was supported by the National Cancer Institute (NCI) of the National Institutes of Health under award numbers P01CA096832 and U24CA224067. This project was also supported in whole or in part with federal funds from the National Cancer Institute, National Institutes of Health, under Contract No. HHSN261200800001E. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the US Government.

Data availability statement

No new data were generated or analysed in support of this research. However, the data underlying the example STRprofiler Shiny application are publicly available from The Jackson Laboratory PDX Program (https://tumor.informatics.jax.org/mtbwi/pdxSearch.do) and the NCI's Patient-Derived Model Repository (https://pdmr.cancer.gov/).

References

  1. Abdolahi S, Ghazvinian Z, Muhammadnejad S. et al. Patient-derived xenograft (PDX) models, applications and challenges in cancer research. J Transl Med 2022;20:206. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Buehring GC, Eby EA, Eby MJ. et al. Cell line cross-contamination: how aware are Mammalian cell culturists of the problem and how to monitor it? In Vitro Cell Dev Biol Anim 2004;40:211–5. [DOI] [PubMed] [Google Scholar]
  3. Capes-Davis A, Reid YA, Kline MC. et al. Match criteria for human cell line authentication: where do we draw the line? Int J Cancer 2013;132:2510–9. [DOI] [PubMed] [Google Scholar]
  4. Drexler HG, Dirks WG, MacLeod RAF. et al. False and mycoplasma-contaminated leukemia-lymphoma cell lines: time for a reappraisal. Int J Cancer 2017;140:1209–14. [DOI] [PubMed] [Google Scholar]
  5. El-Hoss J, Jing D, Evans K. et al. A single nucleotide polymorphism genotyping platform for the authentication of patient derived xenografts. Oncotarget 2016;7:60475–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Gartler SM. Apparent HeLa cell contamination of human heteroploid cell lines. Nature 1968;217:750–1. [DOI] [PubMed] [Google Scholar]
  7. Gu M, Yang M, He J. et al. A silver lining in cell line authentication: short tandem repeat analysis of 1373 cases in China from 2010 to 2019. Int J Cancer 2022;150:502–8. [DOI] [PubMed] [Google Scholar]
  8. Horbach SPJM, Halffman W.. The ghosts of HeLa: how cell line misidentification contaminates the scientific literature. PLoS One 2017;12:e0186281. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Huang Y, Liu Y, Zheng C. et al. Investigation of cross-contamination and misidentification of 278 widely used tumor cell lines. PLoS One 2017;12:e0170384. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Korch CT, Capes-Davis A.. The extensive and expensive impacts of HEp-2 [HeLa], intestine 407 [HeLa], and other false cell lines in journal publications. SLAS Discov 2021;26:1268–79. [DOI] [PubMed] [Google Scholar]
  11. Makowska A, Kontny U, Weiskirchen R. et al. HeLa cells cross-contaminated nasopharyngeal carcinoma cell lines: still a common problem. Br J Cancer 2024;130:1885–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Masters JR, Thomson JA, Daly-Burns B. et al. Short tandem repeat profiling provides an international reference standard for human cell lines. Proc Natl Acad Sci USA 2001;98:8012–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Mattar M, McCarthy CR, Kulick AR. et al. Establishing and maintaining an extensive library of patient-derived xenograft models. Front Oncol 2018;8:19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Mirabelli P, Coppola L, Salvatore M. et al. Cancer cell lines are useful model systems for medical research. Cancers (Basel) 2019;11:1098. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Parson W, Kirchebner R, Mühlmann R. et al. Cancer cell line identification by short tandem repeat profiling: power and limitations. FASEB J 2005;19:434–6. [DOI] [PubMed] [Google Scholar]
  16. Poetsch M, Petersmann A, Woenckhaus C. et al. Evaluation of allelic alterations in short tandem repeats in different kinds of solid tumors—possible pitfalls in forensic casework. Forensic Sci Int 2004;145:1–6. [DOI] [PubMed] [Google Scholar]
  17. Robin T, Capes-Davis A, Bairoch A. et al. CLASTR: the cellosaurus STR similarity search tool—a precious help for cell line authentication. Int J Cancer 2020;146:1299–306. [DOI] [PubMed] [Google Scholar]
  18. Rosfjord E, Lucas J, Li G. et al. Advances in patient-derived tumor xenografts: from target identification to predicting clinical response rates in oncology. Biochem. Pharmacol 2014;91:135–43. [DOI] [PubMed] [Google Scholar]
  19. Souren NY, Fusenig NE, Heck S. et al. Cell line authentication: a necessity for reproducible biomedical research. EMBO J 2022;41:e111307. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Tanabe H, Takada Y, Minegishi D. et al. Cell line individualization by str multiplex system in the cell bank found cross-contamination between Ecv304 and Ej-1/T24. Tissue Culture Res Commun 1999;18:329–38. [Google Scholar]
  21. Vauhkonen H, Hedman M, Vauhkonen M. et al. Evaluation of gastrointestinal cancer tissues as a source of genetic information for forensic investigations by using STRs. Forensic Sci Int 2004;139:159–67. [DOI] [PubMed] [Google Scholar]
  22. Ye F, Chen C, Qin J. et al. Genetic profiling reveals an alarming rate of cross-contamination among human cell lines used in China. FASEB J 2015;29:4268–72. [DOI] [PubMed] [Google Scholar]
  23. Zhang P, Zhu Y, Li Y. et al. Forensic evaluation of STR typing reliability in lung cancer. Leg Med (Tokyo) 2018;30:38–41. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

No new data were generated or analysed in support of this research. However, the data underlying the example STRprofiler Shiny application are publicly available from The Jackson Laboratory PDX Program (https://tumor.informatics.jax.org/mtbwi/pdxSearch.do) and the NCI's Patient-Derived Model Repository (https://pdmr.cancer.gov/).


Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES