Classification of COVID-19 and Other Pathogenic Sequences: A Dinucleotide Frequency and Machine Learning Approach

Gciniwe S Dlamini; Stephanie J Muller; Rebone L Meraba; Richard A Young; James Mashiyane; Tapiwa Chiwewe; Darlington S Mapiye

doi:10.1109/ACCESS.2020.3031387

. 2020 Oct 15;8:195263–195273. doi: 10.1109/ACCESS.2020.3031387

Classification of COVID-19 and Other Pathogenic Sequences: A Dinucleotide Frequency and Machine Learning Approach

Gciniwe S Dlamini ^1,^✉, Stephanie J Muller ¹, Rebone L Meraba ¹, Richard A Young ¹, James Mashiyane ¹, Tapiwa Chiwewe ¹, Darlington S Mapiye ¹

PMCID: PMC8675546 PMID: 34976561

Abstract

The world is grappling with the COVID-19 pandemic caused by the 2019 novel SARS-CoV-2. To better understand this novel virus and its relationship with other pathogens, new methods for analyzing the genome are required. In this study, intrinsic dinucleotide genomic signatures were analyzed for whole genome sequence data of eight pathogenic species, including SARS-CoV-2. The genome sequences were transformed into dinucleotide relative frequencies and classified using the extreme gradient boosting (XGBoost) model. The classification models were trained to a) distinguish between the sequences of all eight species and b) distinguish between sequences of SARS-CoV-2 that originate from different geographic regions. Our method attained 100% in all performance metrics and for all tasks in the eight-species classification problem. Moreover, the models achieved 67% balanced accuracy for the task of classifying the SARS-CoV-2 sequences into the six continental regions and achieved 86% balanced accuracy for the task of classifying SARS-CoV-2 samples as either originating from Asia or not. Analysis of the dinucleotide genomic profiles of the eight species revealed a similarity between the SARS-CoV-2 and MERS-CoV viral sequences. Further analysis of SARS-CoV-2 viral sequences from the six continents revealed that samples from Oceania had the highest frequency of TT dinucleotides as well as the lowest CG frequency compared to the other continents. The dinucleotide signatures of AC, AG,CA, CT, GA, GT, TC, and TG were well conserved across most genomes, while the frequencies of other dinucleotide signatures varied considerably. Altogether, the results from this study demonstrate the utility of dinucleotide relative frequencies for discriminating and identifying similar species.

Keywords: Alignment-free sequence analysis, COVID-19, dinucleotide frequencies, feature representations, genomic signatures, human pathogens, machine learning, XGBoost

I. Introduction

Coronaviruses (CoVs) are enveloped, linear, positive-sense, single-stranded ribonucleic acid (RNA) viruses approximately 30kb in length [1]. Belonging to the family Coronaviridae and the subfamily Orthocoronavirinae, members of the Betacoronavirus genus have been shown to cause infection in humans [2]. Three coronavirus outbreaks have caused moderate to severe respiratory diseases in the last two decades: the 2002 Severe Acute Respiratory Syndrome (SARS) outbreak [3], the 2012 Middle East Respiratory Syndrome (MERS) outbreak [4], and the current Coronavirus Disease 2019 (COVID-19) pandemic. In December 2019, the first cases of infection with the novel SARS-related CoV-2 (SARS-CoV-2) were reported in Wuhan, China with subsequent spread to more than 180 countries resulting in nearly 15 million COVID-19 cases and more than 600 000 deaths worldwide [5].

During a virus outbreak, the taxonomic classification of a pathogenic species and understanding its relatedness to other pathogens may aid in the development of appropriate mitigation strategies. For example, global efforts to design and develop a vaccine for SARS-CoV-2 and therapeutic drugs may benefit greatly from the early identification of SARS-CoV-2 as a close relative to MERS-CoV and SARS [6], through improved understanding of possible disease progression, host pathogen interactions and potential treatment strategies.

Several alignment-based methods have been used to determine species relatedness [7], [8], but as sequencing technologies improve and available datasets become significantly larger, application of these methods for multiple sequences become computationally inefficient [9], [10]. Alignment-based methods rely on the availability of well-characterized reference sequences, thereby limiting the discovery of novel characteristics embedded within a species’ genome [8]. Thus, several alignment-free methods have been proposed for rapid sequence analyses [11].

This work establishes the usefulness of an alignment-free and machine learning-based taxonomic classification approach using the dinucleotide genomic signatures of several pathogenic species. Specifically, this study examines the relative frequencies of 16 dinucleotide pairs derived from fully assembled whole genome sequences (WGS) of eight human infecting species, including SARS-CoV-2. The same examination was implemented on SARS-CoV-2 sequences only, with an aim to evaluate the usefulness of this approach in an effort to understand a novel pathogen. Understanding the variability of dinucleotide genomic profiles among viruses and other related species is of particular importance during a pandemic. Currently, it is unclear to what extent host species or virus family influence dinucleotide frequencies and whether dinucleotide genomic signatures can be used to accurately predict species using machine learning approaches.

The significance of dinucleotide patterns was first reported in the 1960s when biochemical experiments performed on genomic DNA unraveled remarkable species-specific dinucleotide genomic patterns [12]. However, detailed exploration of dinucleotide patterns was hampered by the limited availability of complete genomes, rendering most earlier inferences speculative [13]. As the field of genomics continues to evolve, more WGS are being made available, providing tremendous opportunities for detailed exploration of dinucleotide genomic signatures. In the aforementioned biochemical experiments, dinucleotide genomic signatures were found to be associated with repair-based enzymes, structural features and replication mechanisms [14], and dinucleotide frequencies were found to be more homogeneous in GC-rich genomes than in AT-rich genomes [15]. Although the main reason for this difference is not clearly understood, genomes with an abundance of the AT genomic signature have often been associated with smaller genomes consisting of fewer genes [16]. In addition, these genomes appear to be prone to mutational bias possibly due to a loss of repair genes [17] or relaxed selective pressures [18]. These early experiments served as motivation for the continued exploration of underlying dinucleotide patterns.

II. Related Work

To date, numerous alignment-free numerical DNA characterization or representation schemes have been proposed [11]. One such widely-used class of representation schemes produces fixed-length numerical representations based on frequency mappings, and these fixed-length vectors are convenient as they facilitate an efficient comparison between DNA sequences. The rationale for using frequency-based mappings is that the occurrence of nucleotides differs in the different regions of the genome both within a species and between species [19]. As a result of these differences, nucleotides can be encoded by their frequency of occurrence. For instance, mononucleotide frequencies are known to differ in the coding and non-coding regions of the genome [20], and have been effectively used in detecting these regions. Besides the use of mononucleotide frequencies, frequencies of dinucleotides, trinucleotides [21]–[25] and tetranucleotides [26], [27] have also been used to numerically represent DNA data. One shortcoming of these methods is the potential loss of valuable genomic information arising when condensing a possibly lengthy DNA sequence into a small fixed set of statistical descriptors [21]. Another popular class of alignment-free representation schemes is one that is based on information theory principles such as entropy, although numerous other representations do not fall in either of these two categories [11]. A comprehensive review of the recent numerical encoding schemes can be found in [28].

The conversion of DNA sequences from nucleotides to numerical representation is a critical component of pipelines in many computational genomics applications. One such application area is taxonomy classification where species are classified into groups based purely on their genomic sequences. For this task, alignment-free approaches have been implemented such as in [29]–[31], where in [29] a deep learning approach was taken to distinguish viral sequences from non-viral sequences in a pool of diverse genomic human samples. Whilst [30] proposed a model to classify subtypes of HIV-1 genomes based on varying sub-sequence lengths (lengths ranging from one through ten), [31] used sub-sequences of length seven in conjunction with chaos game numerical representations to build machine learning models for the purposes of classifying COVID-19 genomic sequences. Specifically, different machine learning models were trained to classify viral genomes at different levels of taxonomy, and these models were utilized to predict the correct classification of the COVID-19 samples within the different taxonomic levels.

III. Materials and Methods

A. Data Collection

Fully assembled, WGS data in FASTA format were retrieved for eight pathogenic species namely, SARS-CoV-2, MERS-CoV, Dengue Virus (DENV), Zaire Ebolavirus (EBOV), Hepatitis B virus (HBV), Hepacivirus C (HCV), Human Immunodeficiency Virus 1 (HIV-1) and Mycobacterium tuberculosis (M. tb). The rationale for including these datasets are firstly because SARS-CoV-2, MERS-CoV, and M. tb all cause diseases affecting the human respiratory system. Secondly, M. tb and HIV-1 are well-established co-infections in low-and middle-income countries such as South Africa [32]. Lastly, EBOV [33], DENV [34], HBV [35], and HCV [36] are responsible for epidemics in the tropical regions, causing similar vascular symptoms to those seen in COVID-19 patients [37].

Complete, high coverage sequences for SARS-CoV-2 from Africa, Asia, Europe, Oceania, North America and South America were downloaded from the GISAID database [38]. In addition, viral and bacterial sequences were sourced from several publicly accessible databases. HCV and HIV-1 sequences were downloaded from the Los Alamos National Laboratory (LANL) database [39], while MERS-CoV, DENV, EBOV, HBV, and M. tb sequences were downloaded from the National Center for Biotechnology Information (NCBI) database [40]. Ethical approval was not required for this study as the samples used were sourced from publicly accessible websites and contain no personally identifiable information. Hereafter, any analyses using the eight species’ sequences are referred to as between species. In addition, any analysis using only the SARS-CoV-2 sequences will be referred to as within species. Analyses were conducted using the Python programming language [41] and R statistical language [42].

B. Data Preprocessing

To ensure that only high quality WGS data was included in this analysis, several preprocessing steps were followed (Fig. 1). To remove duplicate sequences, an in-house Python script was used to identify any sequences that had the same accession number and genomic sequence. Where duplicates were found, only one sequence from each duplicate set was retained for further analysis. An additional in-house Python script was used to detect and identify ambiguous nucleotides. For each species, sequences having any other nucleotides besides A, T, C, and G were excluded, as the presence of ambiguous nucleotides may potentially mask the genomic signature encoded within dinucleotide frequencies. Additionally, samples from Georgia were removed from the dataset due to its transcontinental location between Europe and Asia [43].

FIGURE 1. — Generalized flow diagram showing the methodology.

C. Dinucleotide Frequency Representation

Given the four nucleotides A, T, C, G, there are 4² = 16 unique dinucleotide pairs that can be constructed from them, namely: Inline graphic {AT, AA, AC, AG, TT, TA, TC, TG, GT, GA, GC, GG, CT, CA, CC, CG}. If we denote by the frequency of the dinucleotide, then a genomic sequence can be represented by a 16-dimensional feature vector:

Due to the varying sequence lengths of the different species’ genomes, the relative frequencies of the dinucleotides are computed by dividing each frequency by the total number of dinucleotide pairs, Inline graphic , extracted from the entire genome sequence. Letting be the length of a genome sequence, and assuming a sliding window of length 1, then there are dinucleotide pairs. The refined feature vector is then defined as:

where the fraction bar depicts element-wise vector division and each component is the relative frequency of each dinucleotide pair in that sequence.

Relative dinucleotide frequency feature vectors were computed for all genome sequences used in this study and they were used as a numerical representation for all sequence analyses.

D. Dinucleotide Frequency Analysis

1). Exploratory Data Analysis

Two approaches were used to investigate the patterns of the genomic sequences as represented by the dinucleotide features. Firstly, dimensionality reduction techniques were employed to embed the 16-dimensional feature space in two dimensions for visualization. Specifically, principal component analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) were used. In the second approach, an unsupervised learning approach using agglomerative hierarchical clustering was utilized to uncover any underlying group structures of genomic sequences. This hierarchical approach was chosen because unlike other clustering algorithms such as Inline graphic -means [44], this method does not require the number of clusters to be specified. To enable visualization of clustering results, only ten sequences were randomly sampled from each class in both within and between species analysis, resulting in 60 and 80 sequences for the two analyses, respectively. The dinucleotide relative frequency vectors of these sampled sequences were used to construct dendrograms through average linkage of the Euclidean distance matrix of the feature vectors. Validation of the clustering results, which is usually not an easy task, was performed in this study through the known class labels of the samples.

2). Statistical Inference

The Kruskal-Wallis test was used to compare the distribution of the relative frequencies across the different species and continents of SARS-CoV-2 sequences. Pairwise comparisons were conducted using the Wilcoxon Rank Sum test, with adjustment for multiple comparisons using the Bonferroni correction. A p-value of less than 0.05 was considered statistically significant.

3). Classification

For the supervised learning task, two classification problems were investigated for each of the between species and within species analyses. Firstly, the problem was framed directly as a multi-class classification problem where the goal was to classify each sequence into Inline graphic different classes, where for the different species in the between species analysis and for the different continents in the within species analysis. Secondly, to discriminate SARS-CoV-2 from the other species and to investigate differences between SARS-CoV-2 sampled from different continents, the multi-class classification problem was binarized through the one-vs-all approach. Specifically, one classification model was built to distinguish SARS-CoV-2 from all the other species and another to distinguish SARS-CoV-2 samples originating in Asia from those originating from all the other continents.

In the one-vs-all approach, the number of sequences in the class of interest (SARS-CoV-2 and Asia for the two analyses) were greatly outnumbered by the sequences in the respective complement classes (i.e. the complement classes are a combination of all the classes other than the classes of interest). Thus, the complement classes were under sampled to match the classes of interest, thus producing balanced classes.

Two resampling techniques were utilized in this classification system: Inline graphic -fold cross-validation for hyper parameter tuning and bootstrap for final model evaluation. For both the within and between species analyses, the data was randomly split, in a stratified manner, into 70% for training and hyper parameter tuning and the remaining 30% was used for final model testing of the fully specified classifier. Randomized hyper parameter search was implemented through stratified 10-fold cross-validation on the 70% of the data reserved for training. Stratification was utilized to maintain the proportion of samples for each class in the train and test sets, as well as in the folds during the 10-fold cross validation procedure. Stratification is especially important when dealing with imbalanced data (such as in the multi-class classification settings). Ten folds were chosen for the cross-validation because they provide a good compromise between model bias and computational efficiency [45]. A smaller number of folds, such as two or three, have a high bias but are computationally efficient. On the other hand, a large number of folds (the extreme case being leave-one-out cross-validation) have a low bias, but are computationally inefficient [45]. Moreover, [46] showed that leave-one-out and 10-fold cross-validation yielded similar results, indicating that using ten folds is more appealing from a computational efficiency perspective.

The randomized hyper parameter tuning through stratified 10-fold cross validation yielded the best model configuration which was then used to train the model on the entirety of the training data, where “best” was determined through the balanced accuracy metric [47], [48] before being evaluated on the held out 30% testing data. Balanced accuracy is an alternative to the standard accuracy measure that is especially useful when working with imbalanced data. It is defined as the average of recall scores obtained in each class whereas standard accuracy is simply the proportion of all correctly predicted class labels. For evaluation of the final model, 20000 bootstrap resamples from the unseen test data were evaluated on the trained model. In each bootstrap iteration, balanced accuracy, precision, recall and the F1 score performance metrics were computed. The average values of these metrics as well as 95% confidence intervals for uncertainty were computed and reported. The methods performed for hyper parameter optimization and model evaluation are described in greater detail in Fig. S3 and Fig. S4.

IV. Results

A. Data Collection and Preprocessing

A total of 60 063 WGS were downloaded from publicly accessible databases including GISAID, NCBI, and the LANL (Table 1). Of these, 54.8% of the sequences contained only A, T, C, and G nucleotides (Table 1), which were used for the between species dinucleotide frequency analysis. Of the eight species, the SARS-CoV-2 retained the least number of sequences for the between species dinucleotide frequency analysis (Table 1). Most of the SARS-CoV-2 sequences used in this study were sampled in Europe, while South America, Oceania, and Africa had the least number of sequences (Table S1).

TABLE 1. Number of Sequences Downloaded and Selected for Analysis.

Species	Total number of sequences downloaded	Selected sequences (ATCG only)
SARS-CoV-2	28 067	8 252 (29.4%)
MERS-CoV	256	198 (77.3%)
DENV	5 448	4 749 (87.2%)
EBOV	1 547	1 222 (79.0%)
HBV	8 627	6 990 (81.0%)
HCV	3 288	2 453 (74.6%)
HIV-1	12 538	8 890 (70.9%)
	292	145 (49.7%)
Total	60 063	32 899 (54.78%)

Open in a new tab