. 2023 Apr 28;11(5):1323. doi: 10.3390/biomedicines11051323

Table 2.

Pseduo-code.

Input: NCBI GenBank nucleotide sequences

Output: Biomarkers extracted from genome

Let us denote the set of input nucleotide sequences as S, and the set of extracted biomarkers as B. Here is the mathematical representation of the given pseudo-code:

● Collect nucleotide sequences:

● S = {s1, s2, …, sn}

● Filter and screen the sequences:

● S′ = {s|s meets certain criteria}

● Transform the genomes into k-mers:

● K = {k1, k2, …, km}, where ki is a k-mer of a nucleotide sequence s

● Train the BERT tokenizer on the k-mers:

● T = Tokenizer.train (K)

● Use SMOTE to balance imbalanced genomic data samples:

● S″ = SMOTE (S′)

● Perform additional preprocessing steps for the BERT model:

● Convert nucleotide sequences to DNA-specific tokens using T

● Apply necessary transformations to prepare the data for the BERT model

● Preprocess the nucleotide sequence for the custom BERT model:

● Tokenize the nucleotide sequence using the proposed DNA/RNA tokenizer

● Pad any gaps or missing sequence regions with specific tokens

● Encode the input k-mers into a bidirectional representation using the BERT model’s bidirectional encoder:

● E = Encoder.encode (S″)

● Extract specific biomarkers from the genome in an unsupervised manner using the BERT model:

● B = Biomarker.extract (E)

● Pass these biomarkers into a deep neural network-based classifier:

● Classifier.train (B)