Input: NCBI GenBank nucleotide sequences |
Output: Biomarkers extracted from genome |
Let us denote the set of input nucleotide sequences as S, and the set of extracted biomarkers as B. Here is the mathematical representation of the given pseudo-code: |
● Collect nucleotide sequences: |
● S = {s1, s2, …, sn} |
● Filter and screen the sequences: |
● S′ = {s|s meets certain criteria} |
● Transform the genomes into k-mers: |
● K = {k1, k2, …, km}, where ki is a k-mer of a nucleotide sequence s |
● Train the BERT tokenizer on the k-mers: |
● T = Tokenizer.train (K) |
● Use SMOTE to balance imbalanced genomic data samples: |
● S″ = SMOTE (S′) |
● Perform additional preprocessing steps for the BERT model: |
● Convert nucleotide sequences to DNA-specific tokens using T |
● Apply necessary transformations to prepare the data for the BERT model |
● Preprocess the nucleotide sequence for the custom BERT model: |
● Tokenize the nucleotide sequence using the proposed DNA/RNA tokenizer |
● Pad any gaps or missing sequence regions with specific tokens |
● Encode the input k-mers into a bidirectional representation using the BERT model’s bidirectional encoder: |
● E = Encoder.encode (S″) |
● Extract specific biomarkers from the genome in an unsupervised manner using the BERT model: |
● B = Biomarker.extract (E) |
● Pass these biomarkers into a deep neural network-based classifier: |
● Classifier.train (B) |