MiPRIME: an integrated and intelligent platform for mining primer and probe sequences of microbial species

Zhiming Zhang; Jing Ren; Lili Ren; Lanying Zhang; Qubo Ai; Haixin Long; Yi Ren; Kun Yang; Huiying Feng; Sabrina Li; Xu Li

doi:10.1093/bioinformatics/btae429

. 2024 Jul 2;40(7):btae429. doi: 10.1093/bioinformatics/btae429

MiPRIME: an integrated and intelligent platform for mining primer and probe sequences of microbial species

Zhiming Zhang ^1,², Jing Ren ^2,², Lili Ren ^3,², Lanying Zhang ^4,², Qubo Ai ⁵, Haixin Long ⁶, Yi Ren ⁷, Kun Yang ⁸, Huiying Feng ⁹, Sabrina Li ^10,^✉, Xu Li ^11,^✉

Editor: Jonathan Wren

¹ Research and Development Department, Coyote Bioscience (Beijing) Co., Ltd., Building 22, Zone 3, Gaolizhang Road, Haidian District, Beijing, 10095, China

² Research and Development Department, Coyote Bioscience (Beijing) Co., Ltd., Building 22, Zone 3, Gaolizhang Road, Haidian District, Beijing, 10095, China

³ Equipment technology research institute, Science and Technology Research Center of China Customs, Tianshuiyuan street No. 6, Chaoyang District, Beijing, 100026, China

⁴ Research and Development Department, Coyote Diagnostics Lab (Beijing) Co., Ltd., Building 22, Zone 3, Gaolizhang Road, Haidian District, Beijing, 100095, China

⁵ Research and Development Department, Coyote Bioscience (Beijing) Co., Ltd., Building 22, Zone 3, Gaolizhang Road, Haidian District, Beijing, 10095, China

⁶ Research and Development Department, Coyote Diagnostics Lab (Beijing) Co., Ltd., Building 22, Zone 3, Gaolizhang Road, Haidian District, Beijing, 100095, China

⁷ Research and Development Department, Coyote Bioscience (Beijing) Co., Ltd., Building 22, Zone 3, Gaolizhang Road, Haidian District, Beijing, 10095, China

⁸ Research and Development Department, Coyote Bioscience (Beijing) Co., Ltd., Building 22, Zone 3, Gaolizhang Road, Haidian District, Beijing, 10095, China

⁹ Research and Development Department, Coyote Bioscience (Beijing) Co., Ltd., Building 22, Zone 3, Gaolizhang Road, Haidian District, Beijing, 10095, China

¹⁰ Research and Development Department, Coyote Bioscience (Beijing) Co., Ltd., Building 22, Zone 3, Gaolizhang Road, Haidian District, Beijing, 10095, China

¹¹ Research and Development Department, Coyote Bioscience (Beijing) Co., Ltd., Building 22, Zone 3, Gaolizhang Road, Haidian District, Beijing, 10095, China

^✉

Corresponding authors. Research and Development Department, Coyote Bioscience (Beijing) Co., Ltd., Building 22, Zone 3, Gaolizhang Road, Haidian District, Beijing, 10095, China. E-mail: tclxcccc@126.com (X.L.); sabrina@coyotebio.com (S.L.)

= Zhiming Zhang, Jing Ren, Lili Ren and Lanying Zhang equal contribution.

Roles

Zhiming Zhang: Conceptualization, Methodology, Software, Visualization

Jing Ren: Investigation, Visualization, Writing - original draft, Writing - review & editing

Lili Ren: Conceptualization, Funding acquisition, Investigation, Writing - review & editing

Lanying Zhang: Investigation, Methodology

Qubo Ai: Conceptualization, Supervision

Haixin Long: Conceptualization, Supervision

Yi Ren: Supervision

Kun Yang: Supervision

Huiying Feng: Project administration, Supervision

Sabrina Li: Project administration, Supervision

Xu Li: Conceptualization, Project administration, Writing - original draft, Writing - review & editing

Jonathan Wren: Associate Editor

PMCID: PMC11246166 PMID: 38954836

Abstract

Motivation

Accurately detecting pathogenic microorganisms requires effective primers and probe designs. Literature-derived primers are a valuable resource as they have been tested and proven effective in previous research. However, manually mining primers from published texts is time-consuming and limited in species scop.

Results

To address these challenges, we have developed MiPRIME, a real-time Microbial Primer Mining platform for primer/probe sequences extraction of pathogenic microorganisms with three highlights: (i) comprehensive integration. Covering >40 million articles and 548 942 organisms, the platform enables high-frequency microbial gene discovery from a global perspective, facilitating user-defined primer design and advancing microbial research. (ii) Using a BioBERT-based text mining model with 98.02% accuracy, greatly reducing information processing time. (iii) Using a primer ranking score, ${P R}_{score}$ , for intelligent recommendation of species-specific primers. Overall, MiPRIME is a practical tool for primer mining in the pan-microbial field, saving time and cost of trial-and-error experiments.

Availability and implementation

The web is available at {{https://www.ai-bt.com}}.

1 Introduction

Microbial detection is a crucial technology for healthy safety fields such as clinical diagnosis (Lee et al. 2021), food safety, environmental monitoring, and biotechnology (Li et al. 2023, Nachega et al. 2023). Accurate detection of pathogenic microorganisms requires well-designed primers and probes that can specifically and sensitively identify and amplify the target sequences. However, primer and probe design remain a challenging process/technology that involves various evaluation indices such as sequence specificity, thermodynamic stability, secondary structure, and multiplex compatibility. Several algorithm-based software tools have been developed to assist the primer and probe design, such as Primer3 (Koressaar and Remm 2007, Untergasser et al. 2012), Primer-BLAST (Ye et al. 2012), and QuantPrime (Arvidsson et al. 2008), etc. However, due to the complexity and diversity of microorganisms, these tools are limited by predefined databases, lack of validation information, and uncertain experimental performance.

Literature-curated primers are valuable reference materials for primer and probe design, with comprehensive information, experimental validation, and high success rate (García-Remesal et al. 2010). However, manually screening through thousands of references to target microbe primers is highly time-consuming and laborious. Besides, existing literature-sourced primer and probe platforms are limited to specific species or regions, such as PrimerBank (Spandidos et al. 2010, Wang et al. 2012) for human and mouse primers, ProbeBase (Loy et al. 2003, Loy et al. 2007, Greuter et al. 2016) for ribosomal RNA genes, LCPDb-ARG (Gorecki et al. 2019) for antibiotic resistance genes, and MRPrimerV (Kim et al. 2017) for RNA viruses. Moreover, the specificity evaluation of literature-derived primers and probes is generally insufficient among multiple output sequences of the same microbe species.

In this study, we constructed a novel real-time AI (artificial intelligence) based microbial primer mining platform, MiPRIME (https://www.ai-bt.com). It extracts species-specific primer and probe sequences from literature with three highlights: First, MiPRIME is an integrated platform that captures full-species primers from all full-text literature in real-time. To enable this, we constructed three databases as data mining references: a literature database with over 40 million abstracts or full texts, a species-wide genome database covering 548 942 species, and a database focused on antimicrobial resistance and virulence factor genes. These databases facilitated massive-scale data mining, overcame the limitation of species, and made the detection of pathogen-specific nucleic acid targets possible. Second, MiPRIME platform uses a BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) (Lee et al. 2020) -based text mining model to recognize the species-specific primers accurately. This text mining model was refined with high performance on a self-built manually annotated corpus, which significantly reduces the time for manual reading and information processing. Third, comprehensive consideration multi-features of primers, a ${P R}_{score}$ is used in MiPRIME to automatically recommend species-specific primers. These suggested primers have been experimentally validated with high specificity, greatly reducing the cost of trial and error. Also, gathering the microbial high-frequency genes from a global perspective, MiPRIME platform provides documentary evidence for user-defined primer design. In summary, the MiPRIME platform is currently the only species-wide primer and probe sequence mining platform in the world. The work of the MiPRIME platform is innovative and forward-thinking, filling a gap in the pan-microbial field of literature-sourced primer platforms and providing a powerful and user-friendly tool for scientific microbial detection and diagnosis.

2 Materials and methods

2.1 Data collection and database construction

2.1.1 Literature collection and processing

We constructed an Elasticsearch (ES)-based literature database with >35 million abstracts collected from PubMed and 8.9 million full-text articles from PubMed Central^® (PMC, https://www.ncbi.nlm.nih.gov/pmc/). The python package lxml was then used to parse the full text of each paper for the title, abstract, authors, and publication date. The literature database can be freely accessed at http://106.37.92.187:1234/miprimer/.

2.1.2 A species-wide database with reference genomes

A species-wide genome database was constructed to ensure species-specific primers through sequence alignment. Overcoming the limited number of species, we collected >548 942 microbial species from the Taxdb database (https://www.ncbi.nlm.nih.gov/Taxonomy/) (Schoch et al. 2020), including 7180 viruses, 24 784 bacteria, 55 881 fungi, etc. Reference genome sequences for all species were obtained from GeneBank (https://ftp.ncbi.nlm.nih.gov/genomes/genbank/) (Benson et al. 2012) or Refeq (https://ftp.ncbi.nlm.nih.gov/genomes/refseq/) (O'Leary et al. 2016). The selection criteria for the standard genome sequences of each species were as follows: First, reference genomes were preferred for species with multiple versions of the genome, and representative sequences were used as their reference genomes for species without reference genomes. Second, complete genome assembly was preferred, followed by chromosome-level genome assembly. Otherwise, a complete genome was randomly selected. These genome sequences were stored in an ES-based species-wide database.

2.1.3 A database of antibiotic resistance and virulence genes

Since resistance and virulence of pathogenic microorganisms are important in pathogen detection (Bradley et al. 2019), a database of 11 532 virulence genes and 5170 antimicrobial resistant (AMR) genes was constructed for gene classification and annotation. The AMR genes and sequences were obtained from the Comprehensive Antibiotic Resistance Database (CARD, https://card.mcmaster.ca/) (Alcock et al. 2019, Alcock et al. 2023), whereas the virulence genes and sequences were received from the Virulence Factor Database (VFDB, http://www.mgc.ac.cn/VFs/main.htm) (Chen L et al. 2005). All the data were saved in a MySQL database.

2.1.4 A Corpus of artificially labeled microbial species and primer sequences

We construct a self-built Microbial Primer Corpus (MPC) with expert annotations for the task of text mining model optimization for microbiological primers. From 2016 to 2022, 500 full-text papers related to “microorganisms” AND “primers” were randomly screened in the PMC. Following the corpus tagging rules (detailed in Supplementary Information S1), each article was manually annotated with forward primers, reverse primers, probes, species, genes, virulence genes, resistance genes, etc. In total, a corpus with 8458 different types of labels was constructed as the standard for the following Natural Language Processing (NLP) models.

2.2 A text mining model for the primer and probe sequence

2.2.1 Data cleaning

To identify sequences of primers and probes, we used two python packages, lxml and pubmed_parser, to parse an XML file of a paper and obtain table information. Specially, for primers target microbial antimicrobial resistance and virulence gene, we selected the papers with keywords such as “Drug Resistance, Microbial” [Mesh], or “Virulence” [Mesh] first.

2.2.2 Sequences of primers and probes extracted from literature

Primer and probe sequences were extracted from the full text and tables using a regular expression pipeline that matches strings of length 8–30 bp. These strings are composed of four nucleotide bases: adenine (A), cytosine (C), guanine (G), and thymine (T), as well as wildcard characters (such as B, D, H, K, M, N, S, V, W, and Y).

2.2.3 The identification of species and genes

Microbial species and genes were identified and extracted using a hybrid approach combining dictionary-based, BioBERT-based and GnormPlus (Wei et al. 2015) methods. The latter is a common integrative NER (Named Entity Recognition) approach for tagging genes, gene families, and protein domains.

2.2.4 Primer targeting relation extraction and correction

The relations between primers and species were extracted by semantic similarity based on word co-occurrence analysis. Considering the uncertainty of these relationships, we used a sequence alignment tool, BLASTn (Johnson et al. 2008), to verify the targeted segment, and a qcovhsp (query coverage per high scoring pair) to screen the confidence species-primer connections. Details of the screening criteria were given in Supplementary data S2. Similar to what was described above, interactions between primers and resistance or virulence genes were identified and corrected.

2.3 The accuracy of text mining models

The MPC corpus with 500 full texts was divided into three datasets, 300, 100, and 100 articles as train, development, and test sets, respectively. The train and development sets were used for parameter optimization and model building, while the test sets were used to evaluate the models. Four measurements, including accuracy, precision, recall and F1 score, were calculated to assess model performance. And the accuracies of text mining models were above 0.9. For more details, please refer to Supplementary data S3.

2.4 PR_score: a novel evaluation index for microbial primer recommendation

Numerous variables, including the primer’s specificity, coverage, melting temperature (Tm), GC content, hairpin structure, self-dimer, cross-dimer, etc., have effects on the performance of a primer for the detection of pathogens. Taking these factors into account, we proposed a weighted primer evaluation metric, Primer Ranking score ( ${P R}_{score}$ ), to thoroughly assess the performance of a primer. Literature-sourced primers on MiPRIME were ranked in descending order by ${P R}_{score}$ , and those with high scores were recommended for experiments. The ${P R}_{score}$ was calculated using the following formula:

{P R}_{score} = P_{specificity} + P_{coverage} + L_{Tm} + L_{GC %} - N_{hairpin} - N_{dimer} + N_{literature} \times {(\frac{P_{specificity}}{100})}^{2}

(1)

where “ $P_{specificity}$ ” refers to the percentage of the primers successfully mapped to the reference genome; “ $P_{coverage}$ ” refers to the percentage of the sequences successfully aligned; “ $L_{Tm}$ ” refers to whether the value of Tm is between 50°C and 80°C (Rozen and Skaletsky 2000), where 1 is yes, 0 is no; “ $L_{GC %}$ ” refers to whether the GC content is between 40–60% (Lohoff et al. 2021, Takei et al. 2021), also 1 is yes, 0 is no; “ $N_{hairpin}$ ” refers to the sum of forward primer hairpin score level and reverse primer hairpin score level; the primer hairpin score level refers to whether the ΔG of hairpin calculated by MFEprimer (Wang et al. 2019) is <4, when 1 is yes, 0 is no. “ $N_{dimer}$ ” refers to the sum of forward primer self-dimer score level, reverse primer self-dimer score level, and cross-dimer score level of forward/reverse primer score; the dimer score level refers to whether the ΔG of hairpin calculated by MFEprimer is <4, when 1 is yes, 0 is no. “ $N_{literature}$ ” refers to the number of supported articles. Here we modified the literature weights by the specificity of the primers, since primers with lower specificity usually have higher numbers of supported references.

2.5 Development tools

The MiPRIME web server was designed in a user-friendly manner which can be accessible via computers, tablets, and mobile phones at https://www.ai-bt.com. The front end of MiPRIME platform was developed using CSS, PHP, JavaScript, HTML, XML, etc., whereas the backend is comprehended using MySQL, ES, Python, etc.

3 Results

3.1 MiPRIME: an integrated primer text mining platform for all species

In this study, we construct a rare literature-derived primer/probe mining platform, MiPRIME, that enables users to obtain high-quality primers and probe sequences for any target microbial species from 40 million published literature texts. As shown in Fig. 1, the construction of the MiPRIME platform consists of three major steps. First, database construction, to construct a Pan-species primer/probe mining platform we built a comprehensive in-house database as a data source for worldwide primer/probe information (Fig. 1A), which including 3 databases of (i) a biomedical literature database with 40 million open-access articles that provide sources of worldwide published pan-species microbial primer/probe sequences; (ii) a pan-species sequence database with 548 942 microorganism reference genomes to validate the accuracy of the discovered primer/probe sequences; and (iii) AMR and virulence factors (VFs) database for the annotation the resistance and toxin genes of target microbial species. Second, primer/probe extraction model (Fig. 1B), to accurately identify targeting primer/probe information from massive literatures, we developed a BioBERT-based text mining model that was fine-tuned using our in-house manually annotated corpus. In this model, entity recognition methods such as regular expression matching, dictionary-based matching, and GNormPlus (Wei et al. 2015) were used for the identification of primers and species. And the associations between them were supported and confirmed by both word co-occurrence analysis and sequence alignment. Third, best primer/probe recommendation (Fig. 1C), for auto-recommendation of species-specific primers with best coverage and high PCR success rates, we proposed a primer evaluation metric, ${P R}_{score}$ , which is a weighted sum of primer specificity, coverage, GC content, Tm, hairpin, dimer, etc. Usually, primers with the highest ${P R}_{score}$ were preferred.

Figure 1. — The pipeline of MiPRIME platform. The pipeline of MiPRIME platform consists of three major steps: (A) A comprehensive in-house database as a data source for worldwide primer/probe information including: (i) a biomedical literature database with 40 million open-access articles; (ii) a pan-species sequence database with 548 942 microorganism reference genomes to validate the accuracy of discovered primer/probe sequences; (iii) AMR and VFs database for the annotation the resistance and toxin genes of target microbial species; and (iv) a microbial primer corpus for the optimization of BioBERT-based model. (B) Accurate BioBERT-based text mining models for species-specific primers and probes. (C) ${P R}_{score}$ : a novel index for microbial primer recommendation

3.2 The main functions of MiPRIME platform: a case of Streptococcus pyogenes

Here we provide a case study on the usage and functionality of the MiPRIME platform, using Streptococcus pyogenes [also known as group A streptococci (GAS)] as an example (Fig. 2). The MiPRIME platform allows users to enter a given species name or Tax ID for text mining primers (Fig. 2A). The search box features an autocomplete function for convenience. After a real-time calculation, users can review the task results in their personal center. The results here were grouped into three panels: The first panel (upper panel) presents a comprehensive list of literature-derived primers ranked by ${P R}_{score}$ (Fig. 2B). For GAS, 285 primer pairs and 6 probes were collected from 3685 articles. According to ${P R}_{score}$ , primer (Forward primer: GCACTCGCTACTATTTCTTACCTCAA; Reverse primer: GTCACAATGTCTTGGAAACCAGTAAT) and probe (CCGCAACTCATCAAGGATTTCTGTTACCA) were the preferred option. While for species-specific detection, additional 90 primers with high levels of specificity were proposed as optional primers. The second panel (lower panel) reveals the detection hotspots of a given species from the perspective of global literature (Fig. 2C). These hot-spot genes of GAS include speB (Deng et al. 2022), covS, covR, emm, enn, hasA, ropB, scpA/B, etc. Some of these genes are virulence genes that are highly correlated with microbial pathogenicity, while others are commonly used for strain typing. The statistics of these hotspot primer/probe design regions will assist researchers in designing their own primers/probes. The third panel is a submodule of MiPRIME for mining the primer/probe sequences of resistance and toxin genes. Here, a total of 3167 primers for 368 resistance genes were identified and extracted from all published literature, revealing the general rules of microbial resistance, pathogenicity, and epidemics (see Supplementary data S4 and S5). In the case of GAS, eight resistance gene primers were observed from 2001 articles (Fig. 2D). Among these genes, gene mel had the largest amount of supporting literature. We linked this antibiotic resistance gene to an independently developed AI literature platform (http://106.37.92.187:1234/miprimer/) to directly observe its global distribution, associated species, and diseases. This highlights the significance of detecting this gene.

Figure 2. — The usage and functionality of the MiPRIME platform. (A) home page; (B) the summary list of literature-sourced primers and probes, the primers here were sorted by PRscore and divided into four group by specificity level; (C) the detection hotspots of a given species from the perspective of global literature; (D) the submodule of MiPRIME for drug resistance and toxin gene detection; (E) the major function of MiPRIME

Although we use GAS as an example in this study, the MiPRIME platform has widely applicability across various microorganisms including bacteria, fungi, and viruses. To validate MiPRIME's performance and reliability, we selected three representative species: severe acute respiratory syndrome coronavirus 2 (Covid-19), Candida albicans, and Chlamydia trachomatis, to validate both high-specific and low-specific primers by RT-qPCR (Real-time quantitative polymerase chain reaction). Furthermore, we assessed primer specificity using human coronavirus 229 E (HCoV-229E), Candida tropicalis, and Chlamydia pneumoniae. Our analysis of amplification curves, melting curves, and agarose gel electrophoresis revealed that primers with high ${P R}_{score}$ are more effective in producing target-specific PCR products. Conversely, primers with low ${P R}_{score}$ were associated with anomalies such as amplification of nontarget species sequences, formation of primer dimers, or absence of amplification products. Detailed experimental procedures and results can be found in Supplement data S6.

Overall, the MiPRIME platform serves as a valuable resource for researchers in the field of microbial detection and diagnosis. Its innovative approach to literature-derived primer mining, combined with its user-friendly interface and comprehensive functionality, makes it a valuable tool for advancing scientific understanding in various domains related to microbial detection.

4 Discussion

In this study, we developed a rare platform, MiPRIME, for real-time mining of primers and probes across all microbial species. It addresses the existing challenges in primer design by providing a comprehensive solution that combines literature mining, species-specific database integration, and advanced text mining techniques.

Our platform innovatively solves the problem of time-consuming and labor-intensive manual screening of thousands of references for targeting primers/probes. Additionally, it overcomes the limitations of existing primer platforms, which are often limited to specific species or regions, and lack validation information. By integrating multiple databases and using a refined biomedical NLP model, the MiPRIME platform enables efficient extraction of species-specific primers with high accuracy. It also offers users the flexibility to design their own primers while considering various primer features. Importantly, MiPRIME fills a significant gap in the field of literature-derived primer platforms by providing a species-wide primer mining solution. It addresses the need for comprehensive information, experimental validation, and high success rates in primer design. Furthermore, the platform statistics the primer design hotspots from a global perspective, assisting researchers in designing their own primers and gaining insights into pathogen detection.

More interesting, these primers and genes further extend the potentially broad applications of MiPRIME, such as a given pathogen detection (Fig. 3A), or multi-species pathogen detection (Fig. 3B). In the case of the design of a multiplex respiratory pathogen panel for influenza A, influenza B and Covid-19, the MiPRIME platform allows us to collect experimentally validated species-specific primers for each species, which can be used to form a combined list for multiplex PCR primer selection; the stability of multiple primers will then be assessed through the multiple dimensions of these primers, such as Tm, hairpin structure, homodimer and heterodimer; Finally, an optimal combination of primers is recommended for multiple pathogen detection.

Figure 3. — The application of the MiPRIME platform in pathogen detection. (A) For a given pathogen, recommended the high-specific primer according to ${P R}_{score} or$ $P_{specificity}$ . (B) The potential application of multi-species pathogen detection

Overall, our platform represents a creative and efficient tool that improves primer/probe design and contributes to advancements in scientific microbial detection and diagnosis. Although the sequences of these primers/probes originate from the experimental segment of the researchers' work, rendering them reliable and valuable for beginners, it remains crucial for users to adopt a critical approach when evaluating the outcomes derived from these primers/probes, in views of some limitations: (i) due to copyright issues with certain journals, we could only include a limited number of articles with full texts. This leads to a scarcity of information regarding the sequences of primers/probes. In addition, we failed to take into account the potential bias introduced by outdated information or writing errors in the literature, necessitating caution on the part of the user when using these primer and probe sequences. (ii) the platform currently does not distinguish between different forms of PCR, such as conventional PCR, real-time PCR, nested PCR, reverse-transcription PCR, multiplex PCR, etc., therefore, users need to click the links to view the full-texts for specific details. In the future, we aim to overcome these shortcomings and enhance the user experience by improving the platform.

Supplementary Material

btae429_Supplementary_Data

btae429_supplementary_data.zip^{(7.8MB, zip)}

Acknowledgements

We thank the developers of PubMed, PMC, Taxdb, CARD, and VFDB for continuous development of their respective databases over the past several years, which have greatly facilitated the studies of microbiology. We also thank the developers of PHP, python, and ES, which have furthered the development of our site.

Contributor Information

Zhiming Zhang, Research and Development Department, Coyote Bioscience (Beijing) Co., Ltd., Building 22, Zone 3, Gaolizhang Road, Haidian District, Beijing, 10095, China.

Jing Ren, Research and Development Department, Coyote Bioscience (Beijing) Co., Ltd., Building 22, Zone 3, Gaolizhang Road, Haidian District, Beijing, 10095, China.

Lili Ren, Equipment technology research institute, Science and Technology Research Center of China Customs, Tianshuiyuan street No. 6, Chaoyang District, Beijing, 100026, China.

Lanying Zhang, Research and Development Department, Coyote Diagnostics Lab (Beijing) Co., Ltd., Building 22, Zone 3, Gaolizhang Road, Haidian District, Beijing, 100095, China.

Qubo Ai, Research and Development Department, Coyote Bioscience (Beijing) Co., Ltd., Building 22, Zone 3, Gaolizhang Road, Haidian District, Beijing, 10095, China.

Haixin Long, Research and Development Department, Coyote Diagnostics Lab (Beijing) Co., Ltd., Building 22, Zone 3, Gaolizhang Road, Haidian District, Beijing, 100095, China.

Yi Ren, Research and Development Department, Coyote Bioscience (Beijing) Co., Ltd., Building 22, Zone 3, Gaolizhang Road, Haidian District, Beijing, 10095, China.

Kun Yang, Research and Development Department, Coyote Bioscience (Beijing) Co., Ltd., Building 22, Zone 3, Gaolizhang Road, Haidian District, Beijing, 10095, China.

Huiying Feng, Research and Development Department, Coyote Bioscience (Beijing) Co., Ltd., Building 22, Zone 3, Gaolizhang Road, Haidian District, Beijing, 10095, China.

Sabrina Li, Research and Development Department, Coyote Bioscience (Beijing) Co., Ltd., Building 22, Zone 3, Gaolizhang Road, Haidian District, Beijing, 10095, China.

Xu Li, Research and Development Department, Coyote Bioscience (Beijing) Co., Ltd., Building 22, Zone 3, Gaolizhang Road, Haidian District, Beijing, 10095, China.

Supplementary data

Supplementary data are available at Bioinformatics online.

Conflict of interest

None declared.

Funding

This work was supported by National key research and development program of China [2022YFC2602400].

Data availability

The data underlying this article are available in the article and in its online supplementary material. Additionally, related data and materials can be accessed through the platform provided at https://www.ai-bt.com.

References

Alcock BP, Huynh W, Chalil R. et al. CARD 2023: expanded curation, support for machine learning, and resistome prediction at the comprehensive antibiotic resistance database. Nucleic Acids Res 2023;51:D690–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Alcock BP, Raphenya AR, Lau TTY et al. CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database. Nucleic Acids Res 2019;48:D517–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
Arvidsson S, Kwasniewski M, Riaño-Pachón DM. et al. QuantPrime—a flexible tool for reliable high-throughput primer design for quantitative PCR. BMC Bioinformatics 2008;9:465. [DOI] [PMC free article] [PubMed] [Google Scholar]
Benson DA, Cavanaugh M, Clark K. et al. GenBank. Nucleic Acids Res 2012;41:D36–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bradley P, den Bakker HC, Rocha EPC. et al. Ultrafast search of all deposited bacterial and viral genomic data. Nat Biotechnol 2019;37:152–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen L, Yang J, Yu J. et al. VFDB: a reference database for bacterial virulence factors. Nucleic Acids Res 2005;33:D325–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Deng W, Bai Y, Deng F. et al. Streptococcal pyrogenic exotoxin B cleaves GSDMA and triggers pyroptosis. Nature 2022;602:496–502. [DOI] [PMC free article] [PubMed] [Google Scholar]
García-Remesal M, Cuevas A, López-Alonso V. et al. A method for automatically extracting infectious disease-related primers and probes from the literature. BMC Bioinformatics 2010;11:410. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gorecki A, Decewicz P, Dziurzynski M. et al. Literature-based, manually-curated database of PCR primers for the detection of antibiotic resistance genes in various environments. Water Res 2019;161:211–21. [DOI] [PubMed] [Google Scholar]
Greuter D, Loy A, Horn M. et al. probeBase—an online resource for rRNA-targeted oligonucleotide probes and primers: new features 2016. Nucleic Acids Res 2016;44:D586–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Johnson M, Zaretskaya I, Raytselis Y. et al. NCBI BLAST: a better web interface. Nucleic Acids Res 2008;36:W5–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kim H, Kang N, An K. et al. MRPrimerV: a database of PCR primers for RNA virus detection. Nucleic Acids Res 2017;45:D475–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
Koressaar T, Remm M.. Enhancements and modifications of primer design program Primer3. Bioinformatics 2007;23:1289–91. [DOI] [PubMed] [Google Scholar]
Lee J, Yoon W, Kim S. et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020;36:1234–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lee CY, Degani I, Cheong J. et al. Development of integrated systems for on-Site infection detection. Acc Chem Res 2021;54:3991–4000. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li H, Xie Y, Chen F. et al. Amplification-free CRISPR/Cas detection technology: challenges, strategies, and perspectives. Chem Soc Rev 2023;52:361–82. [DOI] [PubMed] [Google Scholar]
Lohoff T, Ghazanfar S, Missarova A. et al. Integration of spatial and single-cell transcriptomic data elucidates mouse organogenesis. Nat Biotechnol 2021;40:74–85. [DOI] [PMC free article] [PubMed] [Google Scholar]
Loy A, Horn M, Wagner M. et al. probeBase: an online resource for rRNA-targeted oligonucleotide probes. Nucleic Acids Res 2003;31:514–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
Loy A, Maixner F, Wagner M. et al. probeBase—an online resource for rRNA-targeted oligonucleotide probes: new features 2007. Nucleic Acids Res 2007;35:D800–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nachega JB, Nsanzimana S, Rawat A. et al. Advancing detection and response capacities for emerging and re-emerging pathogens in Africa. Lancet Infect Dis 2023;23:e185–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
O'Leary NA, Wright MW, Brister JR. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res 2016;44:D733–45. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schoch CL, Ciufo S, Domrachev M. et al. NCBI taxonomy: a comprehensive update on curation, resources and tools. Database, 2020;2020:baaa062. [DOI] [PMC free article] [PubMed] [Google Scholar]
Spandidos A, Wang X, Wang H. et al. PrimerBank: a resource of human and mouse PCR primer pairs for gene expression detection and quantification. Nucleic Acids Res 2010;38:D792–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rozen S, Skaletsky H.. Primer3 on the WWW for general users and for biologist programmers. Methods Mol Biol (Clifton, N.J.) 2000;132:365–86. [DOI] [PubMed] [Google Scholar]
Takei Y, Yun J, Zheng S. et al. Integrated spatial genomics reveals global architecture of single nuclei. Nature 2021;590:344–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
Untergasser A, Cutcutache I, Koressaar T. et al. Primer3—new capabilities and interfaces. Nucleic Acids Res 2012;40:e115. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang K, Li H, Xu Y. et al. MFEprimer-3.0: quality control for PCR primers. Nucleic Acids Res 2019;47:W610–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang X, Spandidos A, Wang H. et al. PrimerBank: a PCR primer database for quantitative gene expression analysis, 2012 update. Nucleic Acids Res 2012;40:D1144–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wei C-H, Kao H-Y, Lu Z. et al. GNormPlus: an integrative approach for tagging genes, gene families, and protein domains. Biomed Res Int 2015;2015:918710–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ye J, Coulouris G, Zaretskaya I. et al. Primer-BLAST: a tool to design target-specific primers for polymerase chain reaction. BMC Bioinformatics 2012;13:134. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btae429_Supplementary_Data

btae429_supplementary_data.zip^{(7.8MB, zip)}

Data Availability Statement

[btae429-B1] Alcock BP, Huynh W, Chalil R. et al. CARD 2023: expanded curation, support for machine learning, and resistome prediction at the comprehensive antibiotic resistance database. Nucleic Acids Res 2023;51:D690–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae429-B2] Alcock BP, Raphenya AR, Lau TTY et al. CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database. Nucleic Acids Res 2019;48:D517–25. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae429-B3] Arvidsson S, Kwasniewski M, Riaño-Pachón DM. et al. QuantPrime—a flexible tool for reliable high-throughput primer design for quantitative PCR. BMC Bioinformatics 2008;9:465. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae429-B4] Benson DA, Cavanaugh M, Clark K. et al. GenBank. Nucleic Acids Res 2012;41:D36–42. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae429-B5] Bradley P, den Bakker HC, Rocha EPC. et al. Ultrafast search of all deposited bacterial and viral genomic data. Nat Biotechnol 2019;37:152–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae429-B6] Chen L, Yang J, Yu J. et al. VFDB: a reference database for bacterial virulence factors. Nucleic Acids Res 2005;33:D325–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae429-B7] Deng W, Bai Y, Deng F. et al. Streptococcal pyrogenic exotoxin B cleaves GSDMA and triggers pyroptosis. Nature 2022;602:496–502. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae429-B8] García-Remesal M, Cuevas A, López-Alonso V. et al. A method for automatically extracting infectious disease-related primers and probes from the literature. BMC Bioinformatics 2010;11:410. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae429-B9] Gorecki A, Decewicz P, Dziurzynski M. et al. Literature-based, manually-curated database of PCR primers for the detection of antibiotic resistance genes in various environments. Water Res 2019;161:211–21. [DOI] [PubMed] [Google Scholar]

[btae429-B10] Greuter D, Loy A, Horn M. et al. probeBase—an online resource for rRNA-targeted oligonucleotide probes and primers: new features 2016. Nucleic Acids Res 2016;44:D586–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae429-B11] Johnson M, Zaretskaya I, Raytselis Y. et al. NCBI BLAST: a better web interface. Nucleic Acids Res 2008;36:W5–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae429-B12] Kim H, Kang N, An K. et al. MRPrimerV: a database of PCR primers for RNA virus detection. Nucleic Acids Res 2017;45:D475–81. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae429-B13] Koressaar T, Remm M.. Enhancements and modifications of primer design program Primer3. Bioinformatics 2007;23:1289–91. [DOI] [PubMed] [Google Scholar]

[btae429-B14] Lee J, Yoon W, Kim S. et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020;36:1234–40. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae429-B15] Lee CY, Degani I, Cheong J. et al. Development of integrated systems for on-Site infection detection. Acc Chem Res 2021;54:3991–4000. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae429-B16] Li H, Xie Y, Chen F. et al. Amplification-free CRISPR/Cas detection technology: challenges, strategies, and perspectives. Chem Soc Rev 2023;52:361–82. [DOI] [PubMed] [Google Scholar]

[btae429-B17] Lohoff T, Ghazanfar S, Missarova A. et al. Integration of spatial and single-cell transcriptomic data elucidates mouse organogenesis. Nat Biotechnol 2021;40:74–85. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae429-B18] Loy A, Horn M, Wagner M. et al. probeBase: an online resource for rRNA-targeted oligonucleotide probes. Nucleic Acids Res 2003;31:514–6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae429-B19] Loy A, Maixner F, Wagner M. et al. probeBase—an online resource for rRNA-targeted oligonucleotide probes: new features 2007. Nucleic Acids Res 2007;35:D800–4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae429-B20] Nachega JB, Nsanzimana S, Rawat A. et al. Advancing detection and response capacities for emerging and re-emerging pathogens in Africa. Lancet Infect Dis 2023;23:e185–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae429-B21] O'Leary NA, Wright MW, Brister JR. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res 2016;44:D733–45. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae429-B22] Schoch CL, Ciufo S, Domrachev M. et al. NCBI taxonomy: a comprehensive update on curation, resources and tools. Database, 2020;2020:baaa062. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae429-B23] Spandidos A, Wang X, Wang H. et al. PrimerBank: a resource of human and mouse PCR primer pairs for gene expression detection and quantification. Nucleic Acids Res 2010;38:D792–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae429-B24] Rozen S, Skaletsky H.. Primer3 on the WWW for general users and for biologist programmers. Methods Mol Biol (Clifton, N.J.) 2000;132:365–86. [DOI] [PubMed] [Google Scholar]

[btae429-B25] Takei Y, Yun J, Zheng S. et al. Integrated spatial genomics reveals global architecture of single nuclei. Nature 2021;590:344–50. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae429-B26] Untergasser A, Cutcutache I, Koressaar T. et al. Primer3—new capabilities and interfaces. Nucleic Acids Res 2012;40:e115. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae429-B27] Wang K, Li H, Xu Y. et al. MFEprimer-3.0: quality control for PCR primers. Nucleic Acids Res 2019;47:W610–3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae429-B28] Wang X, Spandidos A, Wang H. et al. PrimerBank: a PCR primer database for quantitative gene expression analysis, 2012 update. Nucleic Acids Res 2012;40:D1144–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae429-B29] Wei C-H, Kao H-Y, Lu Z. et al. GNormPlus: an integrative approach for tagging genes, gene families, and protein domains. Biomed Res Int 2015;2015:918710–7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btae429-B30] Ye J, Coulouris G, Zaretskaya I. et al. Primer-BLAST: a tool to design target-specific primers for polymerase chain reaction. BMC Bioinformatics 2012;13:134. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

MiPRIME: an integrated and intelligent platform for mining primer and probe sequences of microbial species

Zhiming Zhang

Jing Ren

Lili Ren

Lanying Zhang

Qubo Ai

Haixin Long

Yi Ren

Kun Yang

Huiying Feng

Sabrina Li

Xu Li

Roles

Abstract

Motivation

Results

Availability and implementation

1 Introduction

2 Materials and methods

2.1 Data collection and database construction

2.1.1 Literature collection and processing

2.1.2 A species-wide database with reference genomes

2.1.3 A database of antibiotic resistance and virulence genes

2.1.4 A Corpus of artificially labeled microbial species and primer sequences

2.2 A text mining model for the primer and probe sequence

2.2.1 Data cleaning

2.2.2 Sequences of primers and probes extracted from literature

2.2.3 The identification of species and genes

2.2.4 Primer targeting relation extraction and correction

2.3 The accuracy of text mining models

2.4 PRscore: a novel evaluation index for microbial primer recommendation

2.5 Development tools

3 Results

3.1 MiPRIME: an integrated primer text mining platform for all species

Figure 1.

3.2 The main functions of MiPRIME platform: a case of Streptococcus pyogenes

Figure 2.

4 Discussion

Figure 3.

Supplementary Material

Acknowledgements

Contributor Information

Supplementary data

Conflict of interest

Funding

Data availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

2.4 PR_score: a novel evaluation index for microbial primer recommendation