Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Oct 15.
Published in final edited form as: Anal Chem. 2024 Oct 4;96(41):16115–16120. doi: 10.1021/acs.analchem.4c03201

Nucleo-SAFARI: Automated Identification of Fragment Ions in Top-Down MS/MS Spectra of Nucleic Acids

Michael B Lanzillotti 1, Jennifer S Brodbelt 2
PMCID: PMC11533214  NIHMSID: NIHMS2030666  PMID: 39365982

Abstract

Recent progress in top-down mass spectrometry analysis of progressively larger nucleic acids has enabled in-depth characterization of intact, modified RNA molecules. Development of methods for desalting and MS/MS fragmentation allows rapid acquisition of high-quality top-down MS/MS spectra of nucleic acids up to 100 nt, which has spurred the need for development of software approaches to identify and validate nucleic acid fragment ions. We have implemented an R-based approach to aid in analysis of MS/MS spectra of nucleic acids based on fragment ions observed directly in the m/z domain. This program, entitled Shiny Application for Fragment Assignment by Relative Isotopes (Nucleo-SAFARI), utilizes the Shiny HTML framework for deployment of a user-friendly application for automated annotation of top-down MS/MS spectra of nucleic acids recorded on Orbitrap mass spectrometer platforms. This approach proceeds through in silico generation of fragment ions and their isotopic distributions, followed by algorithmic assessment of the experimental isotopic distributions. Nucleo-SAFARI is available for download at https://github.com/mblanzillotti/Nucleo-SAFARI.

Graphical Abstract

graphic file with name nihms-2030666-f0004.jpg

INTRODUCTION

Chemical modifications are known to have crucial effects on the biological functions of nucleic acids.17 While broadly identified in tRNAs8,9 and rRNAs,10 the >150 modifications profiled to date11,12 have likewise been discovered in several other classes of RNA (e.g., mRNAs,13 siRNAs,14,15 and miRNAs,16 among others) where they can affect the structure and intermolecular interactions of the RNA.2,17 Understanding the dynamics of RNA modifications is critical to gain insight on their downstream effects; however, characterizing the locations and frequencies of covalent modifications on RNA molecules presents unique challenges.1822 Many types of modified nucleosides have been discovered by mass spectrometry,1821,23,24 next generation sequencing methods,20 and highly sensitive biochemical assays.25

Recent developments in top-down mass spectrometry of RNAs have enabled accurate identification of multiple RNA modifications in tandem while also localizing them to specific nucleotides.2631 Top-down approaches leverage the capabilities of modern high-resolution MS platforms to ionize, fragment, and detect intact RNA molecules without prior enzymatic digestion.2831 Characterization of intact RNAs via MS/MS affords extensive molecular fingerprints that may reveal multiple covalent modification co-occurring on each RNA.20,22 Advances in top-down analysis of RNAs have focused on improving sample preparation, increasing ionization efficiency, and optimizing fragmentation methods to characterize modifications present on synthetic RNA therapeutics.19,20,22,24,2633 Many ion activation techniques are amenable to characterization of nucleic acids, including collisionally activated dissociation (CAD),3133 and, more recently, ultraviolet photodissociation (UVPD)27,30 and activated-electron photodetachment dissociation (a-EPD).34,35 Each activation method yields a defined set of fragment ions, typically originating from bond cleavages across the phosphodiester backbone—illustrated in Figure S1A—that generate product ions containing either the 3′ or 5′ terminus, which can be annotated to display identified fragments throughout the primary structure using the type of encoding shown in Figure S1B.32 Top-down mass spectrometry can provide comprehensive information about the identity and location of various covalent modifications in addition to mapping the sequence of the nucleic acid.

Given the complexity inherent in top-down MS/MS spectra of even short (<20 nt) oligonucleotides for which fragment ions may exist in multiple charge states, identification and validation of fragment ions is nontrivial.20 An early data processing platform for nucleic acid MS/MS data, Simple Oligonucleotide Sequencer, enabled ab initio sequencing of small oligonucleotides.36 One of the first database search tools developed for RNA analysis was the Ariadne.37 This approach involved MS-based profiling of RNA components of RNA-protein complexes by analysis of RNase digestion products via a two-step MS/MS search and nucleotide mapping approach.37 Another foundational software was the pair of OMA (oligonucleotide mass assembler) and OPA (oligonucleotide peak analyzer), specifically developed to support characterization of double-stranded and modified RNAs or DNAs based on fragment ions profiled in pioneering studies.32,38 MS/MS data analysis was extended for modified RNAs with RoboOligo, which focused on de novo sequencing of a single digested RNA to map modifications based on c and y fragments produced by CAD.39 This approach was extended with the development of RNAModMapper, where in silico digestion products of target RNA sequences and their CAD fragments–including c, y, w, a-B, and other neutral loss ions–are generated and identified in LC-MS/MS data sets, enabling localization of RNA modifications on multiple precursors in tandem.40,41 An additional approach was developed utilizing NIST spectral library matching of manually annotated RNase T1 digestion products, enhancing analysis throughput.42 More recently, the Nucleic Acid Search Engine (NASE) platform was released, employing an open-source database searching approach and incorporating optimization for high-resolution mass spectrometers and affording substantial increases in search speed over its predecessors.43 Pytheas, one of the most recent additions to the RNA-MS software field, incorporated MS/MS search tools into a Python environment for rapid assessment of bottom-up RNA LC-MS/MS data with additional statistical validation by decoy searching.44 Finally, MIND4OLIGOS recently introduced an algorithm for rapid determination of oligonucleotide monoisotopic mass.45 Each of these tools has offered substantial progress in the field of RNA MS, leveraging its strengths to characterize biologically relevant molecules. Each of these software methods, however, focuses on the interrogation of LC-MS/MS data of small (typically <30 nt) RNAs produced by digestion with restriction enzymes, limiting their applicability to top-down analyses of nucleic acids.

To expand the software tools that enable interpretation of MS/MS spectra of modified nucleic acids in both positive and negative polarities, we have developed an R-based application developed in the Shiny framework that generates theoretical fragment ions in silico from a user-provided sequence and identifies them by their isotopic distributions directly in the m/z domain. Herein we showcase the use of this software platform, named, Shiny Application for Fragment Annotation by Relative Isotopes (Nucleo-SAFARI), on several oligonucleotides (both DNA and RNA) of varying length (10–80 nt) and highlight its utility in characterizing covalent modifications via top-down fragment ion searches.

EXPERIMENTAL SECTION

A 60 mer DNA was synthesized by IDT (Coralville, IA), while all RNA oligonucleotides (miR145, let7 precursors) were provided by the Xhemalce group (University of Texas at Austin). Sequences and associated monoisotopic masses are provided in Table S1. Nanoflow online desalting was performed as described previously.30 Briefly, an isocratic gradient (4:1 acetonitrile:water, 10 mM piperidine ~ pH 10) at a flow rate of 1 μL/min was utilized to desalt nucleic acid samples with an in-house packed Waters XBridge Phenyl trap column (5 cm X 100 μm). Oligonucleotides were ionized in the negative mode using an electrospray voltage of 1600 V applied to a stainless-steel tee and pulled fused silica emitter (New Objective, Littleton, MA). Each acquisition utilized 0.5 μL of sample solution at 10 μM in 4:1 acetonitrile:water; the remainder of the fluidics volume consists solely of mobile phase. All spectra were collected on an Orbitrap Lumos mass spectrometer (Thermo Scientific, San Jose, CA), and averaged and exported using QualBrowser. All MS/MS spectra utilized quadrupole isolation with a width of 2 m/z. CID activation used and NCE values of 25 and a q value of 0.14. 193 nm UVPD activation utilized 1 pulse (2 ms), at an energy of 1 mJ, while 213 nm UVPD activation employed 50 pulses (20 ms) at a fixed pulse energy of ~2 μJ per pulse. Fragment identification was carried out in Nucleo-SAFARI (R 4.3.3 “Angel Food Cake”) using a 10 ppm m/z error tolerance and 30% intensity error tolerance. A full module diagram of the application is shown in Figure S2.

RESULTS AND DISCUSSION

Application Functions.

Data analysis using Nucleo-SAFARI involves three major steps, illustrated in Figure 1. First, a user-input sequence listed from 5′ to 3′ is parsed based on the specific sugars, nitrogenous bases, and backbone compositions defined, along with any potential modifications listed by their chemical formulas, represented in Figure 1A. For each nucleotide, a nucleobase must be listed from a predefined set with corresponding chemical formulas (located in the Prerequisites and Helpers file in the application); however, sugar and backbone moieties are assumed to be deoxyribose and phosphodiester if not otherwise specified symbolically. Covalent modifications are listed in parentheses after the symbol upon which they occur. These can be listed as either predefined symbols corresponding to modifications (e.g., methyl, acetyl) or as chemical formulas, shown in Table S2. Furthermore, modifications affecting sugars are listed based on the carbon atom (1′−5′) to which they are covalently bound; definition of any 3′ or 5′ modifications in this way are important for accurate calculation of fragment ion masses, as any manual modification to the 5′ or 3′ carbon is considered to replace the 5′ or 3′ oxygen within the phosphodiester linkage. Once the sequence is parsed, the corresponding chemical formulas at each sugar, nucleobase, and backbone position along the nucleic acid are collated.

Figure 1.

Figure 1.

Application body with annotations describing (A) sequence input based on sugar, base, and backbone symbols and conversion to its corresponding chemical formula, (B) generation of m/z domain isotopic distributions for precursor and fragment ions, and (C) empirical fragment identification based on m/z and intensity tolerances relative to theoretical values, represented for a valid identification as blue-shaded boxes, and a rejected identification as red boxes. Last, (D) identified ions can be visualized, here shown as a sequence coverage map.

Second, represented in Figure 1B, the net chemical formula of the intact nucleic acid and its fragment ions are determined as a total sum of each element for the intact precursor, or as a rolling cumulative sum from the 5′ or 3′ direction for fragment ions. By default, chemical formulas for all fragment ions generated by backbone cleavage (a/w, b/x, c/y, d/z) are calculated, and fragments considering neutral loss of the neo-terminal nucleobase can be considered with the “neutral loss” toggle. Monoisotopic masses of each fragment ion are calculated directly from the generated chemical formulas. Fine isotopic structures are generated up to a total probability of 97% - where the total theoretical abundance of every possible isotopologue totals 100% - using the IsoSpecR package,46 and the probabilities of each isotopologue are then condensed by nominal mass for use in fragment identification. In this way, the most abundant isotopes in a given isotopic distribution are generated consistently, while very low-abundance isotopic peaks are not generated. Based on a user-input precursor charge state, m/z values of fragment ions are calculated and included in the searches based on proximity to the precursor m/z; Nucleo-SAFARI by default considers the three charge states of the fragment ions that are closest in charge density to the precursor.

Third, the m/z values of the centroids extracted from an averaged spectrum (Thermo.raw file processed with the “Export” function within Xcalibur) are searched against the theoretical m/z list of the precursor and fragment isotopes with a user-defined relative m/z tolerance (default 10 ppm), and subsequently validated based on predicted isotopic abundances, shown schematically in Figure 1C. Various visualizations are generated from the identified fragment ions, with an example sequence coverage map shown in Figure 1D.

A more detailed version of the fragment identification procedure is outlined in Figure 2. Figure 2A displays a well-resolved fragment ion in the m/z domain, which is identified against theoretical m/z values of various fragment ions within a 10 ppm m/z tolerance in Figure 2B. Four theoretical fragment ions are displayed as patterns of short color-coded bars in Figure 2B. From the resulting m/z identifications, predicted abundances are determined for each isotope based on the probabilities of the theoretical isotopic distribution and total abundance of the observed isotopic distribution in Figure 2C. At the same time, the first validation steps take place, where the identified isotopes must be sequential, and the total theoretical probability of these identified theoretical isotopes must exceed 70%, encompassing a majority of the predicted isotopic abundance. In practice, this step mitigates false identification of fragment ions that would be removed during validation as well as false identifications of isotopic distributions with half the charge and mass of the theoretical fragment. Last, in Figure 2D, the observed isotopic distribution is validated against the predicted isotopic abundances based on a weighted average error between the observed and predicted abundance of each isotope (weighted by isotope probability). If this average error is less than a user-defined tolerance (default 25%), the fragment ion is considered a match, as is the case with the w126− ion of a 60 mer DNA strand in Figure 2.

Figure 2.

Figure 2.

Representation of fragment ion identification procedure where (A) the isotopic pattern of an observed fragment ion is (B) overlaid with theoretical fragment m/z values, with each isotope represented a colored bar with a defined width corresponding to the m/z tolerance (10 ppm). Four theoretical fragment ions shown as short colored bars are displayed. Next, (C) a sequential set of isotopes must cumulatively encompass 70% of the total predicted isotopic abundance, and predicted abundances for each isotope are calculated. The predicted abundances (blue) overlaid with the observed fragment ion (D) are validated based on an intensity tolerance (25%). The fragment ion in this example was generated by 193 nm UVPD of 60 mer DNA (z = 24-).

An additional feature of the fragment identification procedure involves the modeling of potential hydrogen shifts. Transfer of one or more hydrogens during bond cleavages induced by UVPD is well-documented in proteins47 and can also occur during photoactivation of nucleic acids. To capture these contributions to an isotopic distribution of a fragment ion, Nucleo-SAFARI employs a non-negative least-squares model to predict the relative amounts of the overlaid species arising from hydrogen shifts.47 A similar approach has been employed previously to perform relative quantitation of potential cytidine and uridine misincorporation in synthetic gRNAs.29

Identified fragment ions can be visualized, both in a sequence coverage map and interactive annotated spectrum, and with several built-in visualizations shown in Figures S3S6. These figures are generated within the application and display a summary of the total identified fragment abundances by type (Figure S3), a summary of total identified fragment ion abundance by both type and cleavage position (Figure S4), observed m/z error for all identified fragments (Figure S5), and a summary of charge site localization based on the abundance of fragment ions in different charge states by cleavage position (Figure S6). In addition to various visualizations, Nucleo-SAFARI also reports sequence coverage and a P-score. In brief, P-scores are calculated using a sequence tag approach, modified to account for the characteristics of nucleic acid fragmentation and consideration of fragment ions directly in the m/z domain.48,49 A more detailed description of the P-score calculation is included in Supporting Information. Finally,. csv files containing the masses of the theoretical fragment ions (monoisotopic or isotopic distributions), fragment m/z values, and fragment identifications can be downloaded from within the application.

Fragment Identification Performance.

An example of fragment identification results from Nucleo-SAFARI executed on a UVPD mass spectrum of a synthetic, methylated (m5C 66) let7 miRNA precursor (78 nt, 25 kDa) is shown in Figure 3. This spectrum was acquired utilizing an online nanoflow desalting method to introduce the nucleic acid by ESI, then the MS/MS spectrum was searched against the corresponding sequence (shown in Table S1) with an m/z error tolerance of 10 ppm and an intensity tolerance of 30%. These parameters resulted in identification of 127 fragment ions, yielding a sequence coverage of 88.3% and a P-score of 1.03e-180. This oligonucleotide contains a 5-methylcytidine nucleobase at position 66 incorporated with a deoxyribose sugar. In this spectrum, well-resolved w13 and w12 fragments were identified (displayed in insets) that bracket the modified residue and confirm its location in the let7 precursor sequence. These fragment ions and the modified residue are highlighted by red boxes in Figure 3. The types of fragment ions identified by Nucleo-SAFARI agree with the expected preferential production of d and w ions noted previously for analysis of RNA by UVPD.27,28,30 The sum intensity of these fragment ions likewise followed established trends, where d and w fragments were not only the most numerous, but also the highest total abundance, summarized in Figure S3. Annotated MS/MS spectra and sequence maps for the m6A 65 let7 precursor and unmodified let7 precursor are shown in Figures S7 and S8.

Figure 3.

Figure 3.

Annotated UVPD (213 nm, 50 pulses) spectrum of m5C 66 let7 precursor (z = 28-, m/z 894) with fragment ions identified by Nucleo-SAFARI using an m/z tolerance of 10 ppm and an intensity tolerance of 30%. These parameters resulted in a sequence coverage of 88.3% and a P-score of 1.03e-180. Eight representative fragment ions are expanded in the insets. Selected fragment ion annotations outlined in red boxes display two key w fragments bracketing a methylated cytidine nucleobase (m5C) at position 66, which also contains a deoxyribose sugar.

A critical application of top-down MS for analysis of nucleic acids is differentiation of positional isomers, a particularly critical aspect when localizing covalent modifications. UVPD mass spectra were collected for three variants of the synthetic let7 precursor RNA, including an unmodified species and two methylated isomers m6A 56 and m5C 66. These oligonucleotides contained deoxyribose sugars at the methylated nucleobases, resulting in only a 2 Da mass difference. Fragment ion searches carried out for each isoform in Nucleo-SAFARI yielded sequence coverages of 83.1%, 85.7%, and 88.3%, and P-scores of 1.24e-163, 2.66e-179, and 1.03e-182 for the unmodified, m6A 56, and m5C 66 variants, respectively. Fragment ions illustrating characterization of the different isoforms are shown in Figure S9. While bracketing fragments were not observed at positions 56 and 66 for all three isoforms, w fragments originating from phosphodiester backbone cleavages between each methylation site were observed in each spectrum. Each of the four w fragment ions at positions 51, 60, 62, and 70 (i.e., w27, w18, w16, and w8) display isotopic distributions dependent on inclusion of the modified nucleotide, as mapped by Nucleo-SAFARI in Figure S9. These fragment ions indicate the presence of a modified nucleotide in the identified fragment when the observed fragment ion in the m6A 56 and m5C 66 variants exhibit lower m/z values than the unmodified let7 precursor. The m6A 56 variant exhibits this m/z difference for only the w2710− ion, whereas the m5C 66 variant exhibits the characteristic m/z shift in the w2710−, w187−, and w166− ions, while no −2 Da shift is observed for the w83− ion. The 1 Da discrepancy between the unmodified let7 precursor and the m6A 56 and m5C 66 variants in the w83− ion could be attributable to a sequence mutation from cytidine to uridine at the 3′ terminus or deamidation elsewhere in the molecule, which is also observable as a 3 Da difference in the precursor isotopic distributions, shown in Figure S10. These observations (e.g., 1 and 3 Da differences) would unlikely be reflected in P-scores calculated for fragment ion searches of unmodified let7 precursor with and without a potential 3′ terminal cytidine to uridine mutation owing to the small number of identified fragments that would differ between the two searches.

CONCLUSIONS

Top-down mass spectrometry offers a versatile strategy for characterization of modified nucleic acids. Despite the limitations of mass spectrometry in generating sequence information on the scale of contemporary next-generation sequencing methods, the information provided by MS1 and MS/MS spectra can provide valuable insight into the identity and location of covalent modifications. Coupled to high-throughput sample introduction methods, optimized MS/MS fragmentation of modified nucleic acids can rapidly characterize numerous isoforms. Continued development of automated software approaches to process this wealth of top-down data is critical for advancing the field.

Nucleo-SAFARI adds to the burgeoning field of nucleic acid mass spectrometry data analysis tools and provides an R-based workflow designed for top-down spectra of increasingly large (25 kDa, 80 nt) nucleic acids. This tool provides a critical validation capability, where nucleic acid sequences and covalent modifications thereof can be confirmed based on fragment ions generated by MS/MS. In this way, known sequences can be verified, and potential permutations of a covalent modification can be investigated based on the specific masses of identified fragment ions. Fragment identification in the m/z domain for DNA and RNA molecules in the positive or negative polarities coupled with detailed results and visualization outputs enable straightforward analysis of nucleic acid MS/MS spectra with support for a variety of synthetic and biological modifications. Moreover, identification of fragments directly in the m/z domain without reliance on deconvolution and deisotoping algorithms enables direct evaluation of fragment ion abundances and accurate consideration of noncanonical heteroatoms. This approach is being utilized to continue investigations into the fragmentation propensities of various oligonucleotides, increasing the depth to which top-down mass spectrometry methods can characterize modified nucleic acids and distinguish positional isomers of specific covalent modifications based on their fragmentation trends. Continued development of this data processing approach will extend the methodology to simultaneous consideration of permutations of covalent modification, along with additional automation to enable batch processing of multiple spectra and sequences within the application.

Supplementary Material

Supporting

Funding

This research was supported by the National Institutes of Health (R35GM13965) and the Robert A. Welch Foundation [F-1155].

Footnotes

The authors declare no competing financial interest.

ASSOCIATED CONTENT

Supporting Information

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.analchem.4c03201.

Sequences and calculated monoisotopic masses for the DNA 60 mer and let7 precursors, definition of predefined symbols for encoding nucleic acid sequences in Nucleo-SAFARI, a module diagram for Nucleo-SAFARI, visualizations of identified fragment ions from UVPD of a m5C66 let7 precursor, annotated spectra of the unmodified and m6A56 let7 precursors, and a comparison of the let7 precursor isotopic distributions (PDF)

Contributor Information

Michael B. Lanzillotti, Department of Chemistry, University of Texas at Austin, Austin, Texas 78712, United States

Jennifer S. Brodbelt, Department of Chemistry, University of Texas at Austin, Austin, Texas 78712, United States

Data Availability Statement

All data utilized herein including raw files, theoretical fragments, and identified fragments can be found at https://dataverse.tdl.org/dataverse/Nucleo-SAFARI, and the contents are described in the Supporting Information.

REFERENCES

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting

Data Availability Statement

All data utilized herein including raw files, theoretical fragments, and identified fragments can be found at https://dataverse.tdl.org/dataverse/Nucleo-SAFARI, and the contents are described in the Supporting Information.

RESOURCES