Abstract
The wealth of sequence data available on public databases is increasing at an exponential rate, and while tremendous efforts are being made to make access to these resources easier, these data can be challenging for researchers to reuse because submissions are made from numerous laboratories with different biological objectives, resulting in inconsistent naming conventions and sequence content. Researchers can manually inspect each sequence and curate a dataset by hand but automating some of these steps will reduce this burden. This paper is a step-by-step guide describing how to identify all proteins containing a specific domain with the Conserved Protein Domain Architecture Retrieval Tool, download all associated amino acid sequences from NCBI Entrez, tabulate, and clean the data. I will also describe how to extract the full taxonomic information and computationally predict some physicochemical properties of the proteins based on amino acid sequence. The resulting data are applicable to a wide range of bioinformatic analyses where publicly available data are utilized.
• Step-by-step guide to gathering, cleaning, and parsing data from publicly available databases for computational analysis, plus supplementation of taxonomic data and physicochemical characteristics from sequence data.
• This strategy allows for reuse of existing large-scale publicly available data for different downstream applications to answer novel biological questions.
Keywords: RNA-glycosylase, Ribosome inactivating protein, Gene tree, Phylogenetic inference, Bioinformatic analysis, Protein domain, Sequence conservation, Data mining
Abbreviations: RIP, Ribosome inactivating protein
Graphical abstract
Specifications table
| Subject Area; | Bioinformatics |
| More specific subject area; | Preparation of protein domain-based mined data for phylogenetic and computational analysis |
| Method name; | Text manipulation for mined biological data |
| Name and reference of original method; | No original method used |
| Resource availability; |
|
1. Identify all protein sequences containing the domain of interest
The example used here was the input data for the analyses described in Dougherty and Hudak [3]. The Conserved Domains section of NCBI (https://www.ncbi.nlm.nih.gov/cdd/) contains a database of protein domains collected from a variety of external databases. Here you can search for your domain of interest; for this example the ribosome inactivating protein (RIP domain) will be used (Fig. 1). When you select your domain of interest you will be redirected to a page which outlines details about the domain, including protein structure, related domain families, and representative sequences (Fig. 2). Under the drop-down window called “Links” select “Architectures” to be redirected to the conserved domain architecture retrieval tool [4,9].
Fig. 1.
Screenshot of the Conserved Domains entry for pfam00161: RIP (https://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=395109).
Fig. 2.
Screenshot of the results page on the Conserved Domain Architecture Retrieval Tool using ‘pfam00161: RIP’ as the query.
Here you will see a graphical view of all the proteins in NCBI with annotations for your query domain, and any other domains that have been annotated as well; they will be separated into combinations of domains. These results can be filtered by taxonomy from the drop-down menu at the top. Under “Filter by taxonomy” select “NCBI taxonomy tree”, select your taxonomic group of interest (in this case plants), select “Include” at the bottom, and click “Apply” at the top to apply the changes.
To access the amino acid sequence data of the identified proteins, navigate to the domain configuration of interest and click “Lookup sequences in Entrez”. This will redirect you to the search results in the Proteins section of NCBI [7]. Download all sequences by selecting “Send to:” > “File” > “FASTA” > “Create file” in the top right corner. If you are interested in investigating more than one domain configuration, as is the case in this example, go back to the previous page and repeat this process for each domain configuration, then copy and paste the sequences into a single FASTA file. The raw data used in this example are available in Supplementary Data 1.
2. Clean and tabulate data in R
The following code blocks are all in the R programming language and were written in RStudio as a markdown file. This file, along with its accompanying HTML output which includes the results of each intermediate step, is available in Supplementary Data 2 and 3, respectively.
2.1. Load in data
Open RStudio and load the required packages: seqinr [2], Biostrings [6], Peptides [5], tidyverse [8], and taxize [1].
library(seqinr) # Biological Sequences Retrieval and Analysis, CRAN v4.2-8
library(Biostrings) # Efficient manipulation of biological strings, Bioconductor v2.62.0
library(Peptides) # Calculate Indices and Theoretical Physicochemical Properties of Protein Sequences, CRAN v2.4.4
library(tidyverse) # Many useful packages for data manipulation and plotting, CRAN v1.3.1
library(taxize) # Taxonomic Information from Around the Web, CRAN v0.9.99
2.2. Import the FASTA file, convert to table
fasta1 <- readAAStringSet("sequence.fasta", use.names=TRUE)
dataset_fasta1 <- data.frame(names(fasta1), paste(fasta1))
colnames(dataset_fasta1) <- c("Name","Sequence")
# Count how many sequences
print(paste0("Number of sequences before filtering: " , nrow(dataset_fasta1)))
2.3. Filter by character patterns
Most sequences will have flags in the FASTA description line indicating if a sequence is incomplete or low quality; therefore, you can remove these sequences with specific keyword searches. The commands shown below are not exhaustive but instead show some examples of potential keywords that can be used for protein data. Partial sequences can be further filtered by removing sequences that do not start with a methionine. The results of this code are visualized in Fig. 3.
Fig. 3.
Output of ‘head(dataset_fasta1, n=10)’ from the code block in Section 2.4.
dataset_fasta1 <- dataset_fasta1[!str_detect(dataset_fasta1$Name,"partial"),]
dataset_fasta1 <- dataset_fasta1[!str_detect(dataset_fasta1$Name,"[Cc]hain"),]
dataset_fasta1 <- dataset_fasta1[!str_detect(dataset_fasta1$Name,"fragment"),]
dataset_fasta1 <- dataset_fasta1[!str_detect(dataset_fasta1$Name,"LOW "),]
dataset_fasta1 <- dataset_fasta1[!str_detect(dataset_fasta1$Name,"truncated"),]
dataset_fasta1 <- dataset_fasta1[!str_detect(dataset_fasta1$Name,"protein product"),]
# Select only sequences that start with methionine
dataset_fasta1 <- dataset_fasta1[str_detect(dataset_fasta1$Sequence,"^M"),]
# Remove gaps/stop codons (can cause errors in other programs)
dataset_fasta1$Sequence <- gsub("\\-", "", dataset_fasta1$Sequence)
# Count how many sequences survived this filtering process
print(paste0("Number of sequences after this filtering step: ", nrow(dataset_fasta1)))
# Inspect the table (Table 1)
head(dataset_fasta1, n=10)
2.4. Identify missing species instances
Some entries will not have the standard notation for species name, which are surrounded by square brackets. Find sequences without species names within the table, then use the accession number to find the species of origin on NCBI and add them manually to the FASTA file. Then you can reload the updated FASTA file into R and continue to the next step. If the output of the following command is empty, then there are no sequences with missing species names and no action is required.
dataset_fasta1[!str_detect(dataset_fasta1$Name,"\\["),]
2.5. Extract species names
Extract the species names by selecting the characters between the square brackets in the ‘Names’ column. It may also be useful to replace or delete ‘special characters’ such as periods and spaces, as they can cause errors for other programs in future analyses.
gene_tax1 <- sub(".*\\[([^][]+)].*", "\\1", dataset_fasta1$Name)
# Replace the spaces with underscores
gene_tax1 <- gsub(" ","_",gene_tax1)
# Remove the periods
gene_tax1 <- gsub("\\.","",gene_tax1)
# Inspect
head(gene_tax1, n=20)
2.6. Clean gene IDs
Clean gene IDs by removing everything except the accession number. Not all submissions will follow the same naming conventions, but all information about a sequence can be retrieved with the accession number so it is the only piece that is necessary to keep. Again, this code is not exhaustive but merely shows some examples of what can be done; be sure to inspect your sequence names to see what kinds of details you need to consider.
gene_ID1 <- dataset_fasta1$Name
# Remove any lowercase letters plus a vertical bar present before the accession number (eg. sp|P22851)
gene_ID1 <- str_remove(gene_ID1, "^[a-z]+\\|")
# Keep only the accession number, plus the version
# This is denoted by a combination of capital letters and numbers and sometimes underscores, followed by a period then a single number
gene_ID1 <- str_extract(gene_ID1, "^[A-Z0-9_]+\\.[0-9]")
# Optional: remove version number
gene_ID1 <- str_remove(gene_ID1, "\\.[0-9]")
head(gene_ID1, n=20)
2.7. Add the accession number and species name to separate columns of the original table
The results of this code are visualized in Fig. 4.
Fig. 4.
Output of ‘head(dataset_fasta1, n=10)’ from the code block in Section 2.7.
The results of this code are visualized in Table 2.
dataset_fasta1$Gene_tax <- gene_tax1
dataset_fasta1$Gene_ID <- gene_ID1
# Remove the old ‘Names’ column
dataset_fasta1 <- dataset_fasta1[,c("Gene_ID", "Gene_tax", "Sequence")]
# Inspect (Table 2)
head(dataset_fasta1, n=10)
2.8. Check that there are no empty cells in the table
This command will return no results if all cells contain data. If any results are missing an accession number, you can use the amino acid sequence to search your raw data FASTA file and see if this number is missing or if some part of the code caused it to be lost. Row 41 in this example is missing the accession number (Fig. 5), which corresponds to line 484 of the raw FASTA file (Supplementary Data 1). The accession number provided there lacks the version number, which means that the code used in Section 2.6 above for extracting this information found no match with the expected pattern. Because the accession number is present in the FASTA file, the missing data can be added into the table.
Fig. 5.
Output of ‘dataset_fasta1[is.na(dataset_fasta1),]’ in Section 2.8.
# Check for missing data (Table 3)
dataset_fasta1[is.na(dataset_fasta1),]
# Find out which rows are affected (output to console in this case will be: 41)
which(is.na(dataset_fasta1))
# Add missing accession number
dataset_fasta1$Gene_ID[41] <- "2019502A"
# Check again (should be an empty data frame)
dataset_fasta1[is.na(dataset_fasta1),]
2.9. Check for duplicate sequences
Check that there are no duplicate sequences by calculating the pairwise percentage identity of all sequences. This is necessary because there are instances where different researchers submitted the sequence of the same gene to NCBI at different times, but the sequences were not 100% identical. The following code will iterate through each sequence and do a pairwise comparison with every other sequence, tabulate the results, and save the entries with a sequence identity over 99% into a new table.
Note, the speed of this process will greatly vary depending on the number of sequences searched and the computational power allocated to R. The dataset used in this example contained approximately 820 sequences and took several minutes to run. The results of this code are visualized in Fig. 6.
Fig. 6.
Output of ‘head(table_pairwise_I, n=5)’ from the code block in Section 2.9.
end <- length(dataset_fasta1$Gene_ID)
count <- 1:end
table_pairwise_I <- data.frame(gene1 = character(), gene2 = character(), Percent_identity = double())
for (i in count){
pairwise <- pairwiseAlignment(pattern = dataset_fasta1$Sequence[i:end], subject = dataset_fasta1$Sequence[i])
pi <- data.frame(Percent_identity=pid(pairwise), gene_id_query=dataset_fasta1$Gene_ID[i], gene_id_test=dataset_fasta1$Gene_ID[i:end], gene_tax_query=dataset_fasta1$Gene_tax[i], gene_tax_test=dataset_fasta1$Gene_tax[i:end], sequence_test=dataset_fasta1$Sequence[i:end])
table_pairwise_I <- rbind(table_pairwise_I, pi[pi$Percent_identity > 99,])
}
# Inspect output (Table 4)
head(table_pairwise_I, n=5)
2.10. Make a table of the duplicates
The output will be all pairwise comparisons in your dataset, including those between other species and to itself. If you are dealing with multiple species, this may result in the identification of orthologs rather than actual duplicates, so these should be excluded. The results of this code are visualized in Fig. 7, and the csv file saved at this step is available under Supplementary Data 4.
Fig. 7.
Output of ‘head(table_pairwise_I2, n=5)’ from the code block in Section 2.10.
# Remove the entries where the query is the same as the test
table_pairwise_I2 <- table_pairwise_I[table_pairwise_I$gene_id_query != table_pairwise_I$gene_id_test,]
# Remove ones where the query and test are from different species
table_pairwise_I2 <- table_pairwise_I2[table_pairwise_I2$gene_tax_query == table_pairwise_I2$gene_tax_test,]
# Save results to a file, for reference (Supplementary Data 4)
write.csv(table_pairwise_I2, "pairwise_percent_identity_over_99.csv",row.names = FALSE)
# Inspect output (Table 5)
head(table_pairwise_I2, n=5)
2.11. Remove duplicates
Remove all test sequences that matched with over 99% similarity between two sequences in the same species. The results of this code are visualized in Fig. 8.
Fig. 8.
Output of ‘head(dataset_fasta1, n=10)’ from the code block in Section 2.11.
dataset_fasta1 <- dataset_fasta1[! dataset_fasta1$Gene_ID %in% table_pairwise_I2$gene_id_query, ]
# See how many sequences survived through to this stage of the filtering process
print(paste0("Number after filtering: ", nrow(dataset_fasta1)))
# Inspect table (Table 6)
head(dataset_fasta1, n=10)
2.12. Save cleaned and filtered data as a FASTA file
The file generated from this code is available in Supplementary Data 5.
write.fasta(strsplit(dataset_fasta1$Sequence,""), paste(dataset_fasta1$Gene_ID, datset_fasta1$Gene_tax, sep = "-"), "filtered_sequences.fasta", open = "w", as.string = F)
3. Add physicochemical properties and detailed taxonomic information for each sequence
3.1. Tabulate species representation
If you are working with a large dataset from a variety of species, as is the case in this example, it is useful to tabulate the species representation and how many proteins are associated with each species. This can be repeated later for any taxonomic level by replacing ‘Gene_tax’ with the column name of the taxonomic level of interest. The results of this code are visualized in Fig. 9.
Fig. 9.
Output of ‘head(table_summary, n=10)’ from the code block in Section 3.1.
table_summary <- as.data.frame(table(dataset_fasta1$Gene_tax))
colnames(table_summary) <- c("Species", "Number_of_sequences")
# How many species are represented in this dataset?
print(paste0("Total number of species: ", length(table_summary$Species)))
# Inspect table (Table 7)
head(table_summary, n=10)
3.2. Make a table of the full taxonomy of each species based on the NCBI taxonomy classification
Note that this will take several minutes to run as retrieving the data for each species takes a couple of seconds. The results of this code are visualized in Fig. 10, and the csv file generated from this code is available under Supplementary Data 6.
Fig. 10.
Output of ‘head (taxonomy_summary, n=5)’ from the code block in Section 3.2.
# Convert from data type ‘factor’ to ‘character’
table_summary$Species <- as.character(table_summary$Species)
nspecies <- length(table_summary$Species)
# Make empty data frame
full_tax <- data.frame(Species=table_summary$Species, Genus=character(nspecies), Family=character(nspecies), Order=character(nspecies), Class=character(nspecies), Phylum=character(nspecies))
# Fill in data frame for each protein
for (i in full_tax$Species){
full_tax$Genus[full_tax$Species == i] <- tax_name(i, get = "genus", db = "ncbi")$genus
full_tax$Family[full_tax$Species == i] <- tax_name(i, get = "family", db = "ncbi")$family
full_tax$Order[full_tax$Species == i] <- tax_name(i, get = "order", db = "ncbi")$order
full_tax$Class[full_tax$Species == i] <- tax_name(i, get = "class", db = "ncbi")$class
full_tax$Phylum[full_tax$Species == i] <- tax_name(i, get = "phylum", db = "ncbi")$phylum
}
taxonomy_summary <- merge(table_summary,full_tax, by= "Species")
# Inspect (Table 8)
head(taxonomy_summary, n=5)
# Save results to a file, for reference (Supplementary Data 6)
write.csv(full_tax, file = "detailed_taxonomy.csv", row.names = FALSE)
3.3. Calculate physicochemical properties
Computationally infer physicochemical properties for each amino acid sequence: aliphatic index, Bowman potential protein interaction index, theoretical net charge, hydrophobicity index, instability index, molecular weight, monoisotopic mass over charge ratio, and isoelectric point. This package can calculate more properties than what is shown here, so this is just an example of some of them. Note: if there are unusual characters in your sequence (e.g., B, U, X, Z, *, or any number) then this code will produce an error. You can remove these sequences in the same way you removed those that did not start with a methionine. Alternatively, you can replace the amino acid with another character or with nothing (i.e., empty quotes) the same way as was done to remove special characters from the taxonomic names.
dataset_fasta1$aliphatic_index <- aIndex(dataset_fasta1$Sequence)
dataset_fasta1$Boman_Potential_Protein_Interaction_index <- boman(dataset_fasta1$Sequence)
dataset_fasta1$theoretical_net_charge <- charge(dataset_fasta1$Sequence, pH = 7, pKscale = "Lehninger")
dataset_fasta1$hydrophobicity_index <- hydrophobicity(dataset_fasta1$Sequence, scale = "KyteDoolittle")
dataset_fasta1$instability_index <- instaIndex(dataset_fasta1$Sequence)
dataset_fasta1$molecular_weight <- mw(dataset_fasta1$Sequence)
dataset_fasta1$monoisotopic_mass_over_charge_ratio <- mz(dataset_fasta1$Sequence)
dataset_fasta1$isoelectic_point <- pI(dataset_fasta1$Sequence, pKscale = "EMBOSS")
dataset_fasta1[order(dataset_fasta1$Gene_ID),]
head(dataset_fasta1, n=10)
3.4. Combine results into a single table
The csv file generated from this code is available under Supplementary Data 7, and the text file is available under Supplementary Data 8.
for (i in dataset_fasta1$Gene_tax){
dataset_fasta1$Genus[dataset_fasta1$Gene_tax == i] <- taxonomy_summary$Genus[taxonomy_summary$Species == i]
dataset_fasta1$Family[dataset_fasta1$Gene_tax == i] <- taxonomy_summary$Family[taxonomy_summary$Species == i]
dataset_fasta1$Order[dataset_fasta1$Gene_tax == i] <- taxonomy_summary$Order[taxonomy_summary$Species == i]
dataset_fasta1$Class[dataset_fasta1$Gene_tax == i] <- taxonomy_summary$Class[taxonomy_summary$Species == i]
dataset_fasta1$Phylum[dataset_fasta1$Gene_tax == i] <- taxonomy_summary$Phylum[taxonomy_summary$Species == i]
}
# Save full table (Supplementary Data 7)
write.csv(dataset_fasta1, file = "tabulated_cleaned_data.csv", row.names = FALSE)
# Save accession numbers only (Supplementary Data 8)
write.table(dataset_fasta1$Gene_ID, “accessions.txt”, quote=FALSE, row.names=FALSE, col.names=FALSE)
# Inspect
head(dataset_fasta1, n=20)
The final output of this process is included in Dougherty and Hudak [3], Supplementary Data 1 and Supplementary Data 2.
4. Manual inspection of sequences
The protein sequence data on NCBI come from a variety of sources with different experimental purposes. Therefore, it may be necessary to assess the quality of all sequences and filter any that do not meet the standards of your experiment. While many of these steps have been done in R, some manual inspection is still advised by reading the GenPept entries of each sequence. Some useful details available on these pages are 1: whether sequences are genomic in origin or clones from cDNA, and 2: whether sequences are annotated as mature peptides. To view the GenPept pages of only the sequences that passed previous filtering steps you can use Batch Entrez (https://www.ncbi.nlm.nih.gov/sites/batchentrez, [7]). Upload the text file containing the NCBI accession numbers (accessions.txt), select “Protein” and select “Retrieve”. You will be redirected to a page indicating how many records were successfully retrieved. Click the link “Retrieve records”, and you will be redirected again to the Proteins database on NCBI where you can inspect each sequence or download the GenPept files.
5. Method validation
Because each step is performed in RStudio, and because the cleaning and reorganizing of data are done in stages, it is straightforward to inspect the data at each step to ensure that the changes being made are expected. This inspection was done in the case of the example used here; all relevant data were retained and all data from incomplete sequences, low quality sequences, and duplicates were removed. In addition, all irrelevant information from the FASTA description lines was removed and the filtered sequences were successfully written to a new FASTA file and their description lines contained only the NCBI accession number and the species name. These cleaned data have many potential bioinformatic applications; for further detail on the subsequent analyses used with this dataset see Dougherty and Hudak [3].
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
FUNDING: This work was supported by a Discovery Grant to K.A.H. from the Natural Sciences and Engineering Research Council of Canada, and a Canada Graduate Scholarship – Master's (CGS M) to K.D.
Footnotes
Related research article: K. Dougherty, K.A. Hudak Phylogeny and domain architecture of plant ribosome inactivating proteins Phytochemistry, 202 (2022), pp. 113337, 10.1016/j.phytochem.2022.113337
Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.mex.2022.101846.
Appendix. Supplementary materials
Supplementary Data 1 – The raw FASTA file of the sequences downloaded from the NCBI Proteins database (referred to as sequence.fasta).
Supplementary Data 2 – R markdown file with all the code used in this paper.
Supplementary Data 3 – HTML output of the R markdown file in Supplementary Data 2.
Supplementary Data 4 – Table containing all the results of the pairwise comparisons for all sequences with over 99% similarity (referred to as pairwise_percent_identity_over_99.csv).
Supplementary Data 5 – Cleaned and filtered FASTA file (referred to as filtered_sequences.fasta).
Supplementary Data 6 – Detailed taxonomic information for each species in the dataset (referred to as detailed_taxonomy.csv).
Supplementary Data 7 – Final table (referred to as tabulated_cleaned_data.csv).
Supplementary Data 8 – Text file containing all accession numbers remaining after filtering (referred to as accessions.txt).
Data availability
All code and data are available in supplementary materials.
References
- 1.Chamberlain S.A., Szöcs E. Taxize: taxonomic search and retrieval in R. F1000 Res. 2013;2:191. doi: 10.12688/f1000research.2-191.v1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Charif D., Lobry J.R. Seqin{R} 1.0-2: a contributed package to the {R} project for statistical computing devoted to biological sequences retrieval and analysis. Struct. Approaches Seq. Evol. 2007:207–232. doi: 10.1007/978-3-540-35306-5_10. [DOI] [Google Scholar]
- 3.Dougherty K., Hudak K.A. Phylogeny and domain architecture of plant ribosome inactivating proteins. Phytochemistry. 2022;202 doi: 10.1016/j.phytochem.2022.113337. [DOI] [PubMed] [Google Scholar]
- 4.Geer L.Y., Domrachev M., Lipman D.J., Bryant S.H. CDART: protein homology by domain architecture. Genome Res. 2002;12:1619–1623. doi: 10.1101/gr.278202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Osorio D., Rondon-Villarreal P., Torres R. Peptides: a package for data mining of antimicrobial peptides. R. J. 2015;7:4–14. doi: 10.32614/RJ-2015-001. [DOI] [Google Scholar]
- 6.H. Pagès, P. Aboyoun, R. Gentleman, S. DebRoy. Biostrings: efficient manipulation of biological strings R package version 2.62.0. (2021) https://bioconductor.org/packages/Biostrings.
- 7.Sayers E.W., Bolton E.E, Brister J.R., Canese K., Chan J., Comeau D.C., Connor R., Funk K., Kelly C., Kim S., Madej T., Marchler-Bauer A., Lanczycki C., Lathrop S., Lu Z., Thibaud-Nissen F., Murphy T., Phan L., Skripchenko Y., Tse T., Wang J., Williams R., Trawick B.W., Pruitt K.D., Sherry S.T. Database resources of the national center for biotechnology information. Nucleic Acids Res. 2022;50 doi: 10.1093/nar/gkab1112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Wickham H., Averick M., Bryan J., Chang W., D'Agostino McGowan L., François R., Grolemund G., Hayes A., Henry L., Hester J., Kuhn M., Pedersen T.L., Miller E., Bache S.M., Müller K., Ooms J., Robinson D., Seidel D.P., Spinu V., Takahashi K., Vaughan D., Wilke C., Woo K., Yutani H. Welcome to the tidyverse. J. Open Source Softw. 2019;4:1686. doi: 10.21105/joss.01686. [DOI] [Google Scholar]
- 9.Yang M., Derbyshire M.K., Yamashita R.A., Marchler-Bauer A. NCBI's conserved domain database and tools for protein domain analysis. Curr. Protoc. Bioinform. 2020;69 doi: 10.1002/cpbi.90. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplementary Data 1 – The raw FASTA file of the sequences downloaded from the NCBI Proteins database (referred to as sequence.fasta).
Supplementary Data 2 – R markdown file with all the code used in this paper.
Supplementary Data 3 – HTML output of the R markdown file in Supplementary Data 2.
Supplementary Data 4 – Table containing all the results of the pairwise comparisons for all sequences with over 99% similarity (referred to as pairwise_percent_identity_over_99.csv).
Supplementary Data 5 – Cleaned and filtered FASTA file (referred to as filtered_sequences.fasta).
Supplementary Data 6 – Detailed taxonomic information for each species in the dataset (referred to as detailed_taxonomy.csv).
Supplementary Data 7 – Final table (referred to as tabulated_cleaned_data.csv).
Supplementary Data 8 – Text file containing all accession numbers remaining after filtering (referred to as accessions.txt).
Data Availability Statement
All code and data are available in supplementary materials.











