Skip to main content
PLOS One logoLink to PLOS One
. 2024 Apr 18;19(4):e0300350. doi: 10.1371/journal.pone.0300350

Bioinformatics pipeline for the systematic mining genomic and proteomic variation linked to rare diseases: The example of monogenic diabetes

Ksenia G Kuznetsova 1,2,*, Jakub Vašíček 1,2, Dafni Skiadopoulou 1,2, Janne Molnes 1,3, Miriam Udler 4,5,6, Stefan Johansson 1,3, Pål Rasmus Njølstad 1,7, Alisa Manning 4,5,6, Marc Vaudel 1,2,8,*
Editor: Kazunori Nagasaka9
PMCID: PMC11025945  PMID: 38635808

Abstract

Monogenic diabetes is characterized as a group of diseases caused by rare variants in single genes. Like for other rare diseases, multiple genes have been linked to monogenic diabetes with different measures of pathogenicity, but the information on the genes and variants is not unified among different resources, making it challenging to process them informatically. We have developed an automated pipeline for collecting and harmonizing data on genetic variants linked to monogenic diabetes. Furthermore, we have translated variant genetic sequences into protein sequences accounting for all protein isoforms and their variants. This allows researchers to consolidate information on variant genes and proteins linked to monogenic diabetes and facilitates their study using proteomics or structural biology. Our open and flexible implementation using Jupyter notebooks enables tailoring and modifying the pipeline and its application to other rare diseases.

Introduction

The most common forms of monogenic diabetes are maturity-onset diabetes of the young (MODY), neonatal diabetes [1], inherited lipodystrophies, mitochondrial diabetes, among others [2]. Today, international guidelines are available for the diagnostic and follow-up of patients with suspected MODY [3]. These patients may now receive a molecular genetic diagnosis using diagnostic gene sequencing panels (https://panelapp.genomicsengland.co.uk/panels/472). This allows precise MODY subtyping and, depending on the diagnosis, the opportunity to avoid lifelong insulin medication and complications through lifestyle management or alternative treatment using oral antidiabetic drugs [4]. Furthermore, because the early correct diagnosis may implement successful treatment with low doses of sulfonylurea or diet alone and postpone complications, the timeline of diagnosis and care is thus crucial [4]. It is estimated that around 80% of monogenic diabetes cases remain undiagnosed due to symptomatic similarity to other types of diabetes [5]. These patients and their relatives remain unaware of their familial condition and do not benefit from adapted care.

The first challenge in establishing a firm diagnosis in all cases of MODY is the mapping of all genes that can cause monogenic diabetes. To date, multiple genes have been discovered to have associations with familial forms of diabetes [2]. Fourteen of them are notably often referred to as “the MODY genes” [3], although this list is subject to debate in the literature [6] and is being systematically assessed by international experts using established guidelines to determine gene-disease relationships (https://clinicalgenome.org/affiliation/40016). A second challenge is the difficulty in evaluating the pathogenicity of genetic variants [7], for which there is also an ongoing international effort to establish guidelines and provide expert variant curations in the ClinVar database (https://clinicalgenome.org/affiliation/50016). [8]. Furthermore, the response to alternative treatment might differ between populations [9]. One of the ways to address the challenge of precise diagnostics is to complement genetic screening with additional data, combining both molecular and clinical dimensions [10].

The recent advent of high-performance computational models for protein structures notably holds the promise to increase the throughput of the structural consequences of genetic variation [11]. Studying the protein sequences encoded by specific alleles can, for example, help understand whether their structure and properties are affected, hence shedding light on the pathogenicity of variants found during genetic diagnosis [12]. The adoption of these approaches is however impaired by the difficulty of mapping variants linked to rare diseases to the different forms of proteins that they encode. First, given that these variants are rare, the coverage by genomic databases is low. Maintaining an updated list of variants requires monitoring and mining of the literature by experts. Second, variants reported in the literature often lack standardization in their identifiers and coordinates, making it challenging to map them to a given genome build and requiring manual variant mapping. Third, inferring the consequences on protein products is still a daunting task for some variants (those alleles affecting splice sites and untranslated regions [UTRs], for example). Fourth, a given protein-coding sequence might encode different protein isoforms, which will produce different forms of proteins upon folding and post-translational modification [13], hence for a given variant multiple protein sequences need to be investigated. Mapping genetic variants linked to monogenic diabetes from genes to proteins is therefore not tractable and sustainable without automation using dedicated bioinformatic tools.

Here, we describe a new open-source modular pipeline based on Jupyter notebooks (https://eprints.soton.ac.uk/403913). that allows for the systematic collection of variants linked to monogenic diabetes and their mapping to Ensembl [14] and ClinVar [8]. We demonstrate how the different genes linked to monogenic diabetes harbor variants of different clinical significance. Finally, we translate the variant sequences to the protein level and provide the resulting sequences in a standard format that can readily be used for proteomic search and structural proteomic analyses using mass spectrometry or protein structure modeling.

Methods

General architecture

The pipeline consists of seven independent modules written in Python using Jupyter notebooks Fig 1. The notebooks are chained together as a pipeline, but they can also be used as standalone applications, or integrated into other pipelines. First, the pipeline takes a list of genes and extracts exonic variants from Ensembl retaining only those variants that are predicted to affect protein sequences. Next, the program integrates the variants from ClinVar and maps them to Ensembl. Similarly, variants are extracted from the literature, here using the literature mining by Rafique et al. [15], and mapped to Ensembl. Subsequently, the harmonized collection of variants is consolidated in a database stored in the form of a text file that can easily be parsed and reused. Finally, the table of variants is mapped to all the transcripts linked by Ensembl to the genes of interest to obtain protein sequences of all the possible isoforms encoded by these genes. In this last step, the DNA sequences are translated to amino acid sequences and stored as protein FASTA files and here the user can also filter them to the isoforms of interest. To visually inspect the results, a separate module allows overlaying all the variants that can possibly affect a given gene onto the corresponding amino acid sequences.

Fig 1. General architecture of the pipeline.

Fig 1

Module 1: extract variants from Ensembl affecting genes linked to monogenic diabetes. Module 2: filter variant by consequence on the protein. Module 3: extract variants from ClinVar affecting genes linked to monogenic diabetes. Modules 4 and Module 5: extract the variants from the literature using the mining by Rafique et al. Module 6: consolidate the variants in a single table. Translation: produce the possible variant protein sequences. MODY—maturity-onset diabetes of the young, MD—monogenic diabetes, ND—neonatal diabetes, API—application programming interface, PMID—identifiers of scientific publications from the PubMed database, dbSNP identifiers—identifiers of the genomic variants from the dbSNP database, that start from “rs”.

Module 1—Mining variants in Ensembl

In Module 1, a list of genes is mapped to Ensembl (here and further version 108) genes, transcripts, and exon identifiers using the Ensembl REST API [16]. Subsequently, the Ensembl REST API is queried using the exon identifiers to return all variants in Ensembl overlapping with the corresponding regions along with their annotation (identifiers, coordinates, consequences, etc.) For multi-allelic variants, we treat every alternative allele as a different entry, and add it as a new line to the table produced. The reference table is given in the Supplementary materials, S1 Table.

Module 2—Categorizing by consequence and pathogenicity

Module 2 takes as an input the table of exonic variants from Ensembl generated in Module 1. For each consequence type, the prevalence of variants with different levels of pathogenicity is computed and visualized as heat maps Fig 2A. The same is done for the data from ClinVar (version October 2022) with all the variants regardless of the associated phenotype Fig 2B.

Fig 2. Distribution of the consequence types within pathogenicity categories of the variants.

Fig 2

A: Variants from Ensembl, B: Variants from ClinVar. The heatmap shows the percentage of variants within pathogenicity categories for each consequence type. Arrows indicate the consequence types that were left in the Ensembl reference table created in Module 2 after filtering. “Stop lost” is marked with a blue arrow as it has conflicting evidence in two sources (see detailed in the text).

By default, only variants whose pathogenicity is classified mostly as likely pathogenic or pathogenic are retained for further analyses: “missense_variant“, “protein_altering_variant“, “coding_sequence_variant“, “frameshift_variant“, “splice_donor_variant“, “splice_acceptor_variant“, “splice_donor_5th_base_variant“, “start_lost“, “stop_gained“, “stop_lost“, “inframe_deletion“, “inframe_insertion“.

Module 3. Mapping of the variants from ClinVar

Module 3 takes as an input three tables exported from ClinVar containing all the variants returned after querying three phenotypes: “MODY”, “monogenic diabetes”, and “neonatal diabetes”. ClinVar annotates variants with different levels of pathogenicity: “pathogenic”, “likely pathogenic”, “uncertain significance”, “likely benign” or “benign”. The variants presenting dbSNP (release 154) identifiers are passed to Module 2. The others are mapped to Ensembl using genomic coordinates and alleles and predicted consequences are obtained with the Ensembl Variant Effect Predictor (VEP) [17] called on all variants using the Ensembl REST API. The variants with predicted consequences according to Ensembl are filtered in the same way as in Module 2 and listed in S2 and S3 Tables in the Supplementary materials. The variants where no consequence type was returned were all either short deletions, short insertions, or other short fragment replacements. These variants are listed in S4 and S5 Tables in the Supplementary materials. This subset is directly passed to Module 6. Note that the full set of tables is stored on GitHub.

Module 4. Mapping variants from the literature (step 1 of 2)

Module 4 takes variants linked to monogenic diabetes according to the literature to cover the variants where the association with monogenic diabetes is not yet consolidated in ClinVar or Ensembl. As input we used the variants mined in the review by Rafique et al. [15] and provided as a supplementary table in their publication. The module uses the Vcfanno [18] library to annotate the variants with DNA coordinates and the dbSNP identifiers where possible. The Ensembl REST API is queried as in Module 1, and the results are passed to Module 6. The resulting variants in this step are given in Supplementary S6 Table.

Module 5. Mapping variants from the literature (step 2 of 2)

For the variants that could not be mapped automatically in Module 4, the title of the publication as obtained from Module 4 is queried against the Entrez Programming Utilities API (https://www.ncbi.nlm.nih.gov/books/NBK25500). to return the PubMed (https://pubmed.ncbi.nlm.nih.gov). identifiers (PMIDs) of these articles. Note that some of the PMIDs were mapped and had to be added manually. Next, these PMIDs are used to query the same API and return the “rs” identifiers of the variants mentioned in these publications. Finally, the variants are mapped to Ensembl using their identifiers and passed to Module 6. Note that not all these variants could be mapped automatically and those that did not map to Ensembl were formatted manually for input to Module 6. Particularly, the variants from ClinVar that did not map and as a result do not have the “rs” identifiers were added to the table with the identifiers from ClinVar. Besides the identifiers, the table contains the chromosome number, the position of the variant, and the reference and alternative alleles. In total, 8 variants were added to VCF manually. The resulting variants in this step are given in Supplementary S7 Table.

Module 6—Consolidation as table and VCF file

Module 6 combines all the tables produced by the previous modules and creates Venn diagrams showing the number of variants obtained from the different sources and their overlap. This table is then used to create a Variant Calling Format (VCF) file listing the site of all the variants.

We further categorized the variants into two levels of pathogenicity evidence. Level 1 variants include the variants from ClinVar obtained in Module 3 after querying three phenotypes: “MODY”, “monogenic diabetes”, or “neonatal diabetes” regardless of their reported clinical significance, complemented with the variants extracted from the literature in Module 4 and Module 5. Level 2 variants undergo stricter filtration criteria. For the variants obtained from ClinVar in Module 3, all variants labeled as benign, likely benign, or of uncertain significance were filtered out. For the variants obtained from the literature in Module 4 and Module 5, variants in BLK, KLF11, and PAX4 were removed as these genes were reported to lack pathogenicity in MODY in more recent literature [6]. The output tables of this module are given as Supplementary S8 and S9 Tables for the 1st and the 2nd level variants correspondingly.

Translation of the variant sequences into protein sequences

The translation step was performed using the ProVar tool (https://github.com/ProGenNo/ProHap). Shortly, the variants from the table obtained in Module 6 were mapped to canonical cDNA sequences from Ensembl to retrieve all the transcripts of the same gene with the annotation of start and stop codons. After mapping, all the sequences were translated to their amino acid sequences and written into a protein FASTA file. Both the table output and FASTA examples are given in Supplementary materials deposited on (https://doi.org/10.6084/m9.figshare.21444963.v2).

Sequence overlay

All the variants from the resulting database were overlaid with the reference protein sequences obtained from Ensembl for all the transcripts of all genes. The variants are represented using two separate rows corresponding to the two levels of pathogenicity confidence (see example in Fig 3). FASTA files are parsed using the Pyteomics library [19]. The protein sequences and variants are plotted using the Matplotlib library [20].

Fig 3.

Fig 3

A: All isoforms of the protein product of HNF1A with the amino acid variants from the database mapped on the sequence. B: Examples of the random fragments of the products of the canonical isoforms of HNF1A and ABCC8 showing the difference in density of variants from level 1 and level 2 databases. The top row of blue dots represents the positions of the variants from the level 2 database, whilst the bottom green row represents the positions of the variants from the level 1 database. Being the most well-studied gene, HNF1A has an almost equal number of dots in both rows as most of the reported variants are revised and confirmed as pathogenic. ABCC8 has a lot of variants in the level 1 database that are not confirmed as pathogenic and, thus, are not in the level 2 database.

Results

Unlike more common forms of diabetes like type 2 diabetes (T2D), where large numbers of samples are available and federated initiatives consolidate information on genetic variants and their consequences in aggregated and harmonized forms (e.g. (https://t2d.hugeamp.org), monogenic diabetes, as a rarer disease, relies on small cohorts and information on genetic variants is scattered in the literature and online databases. The aggregation and comparison of variants possibly linked with monogenic diabetes therefore currently relies on expert manual curation and annotation.

Mining variants in genes linked to monogenic diabetes

We mined variants based on a list of 109 genes linked to monogenic diabetes, aiming at being as comprehensive as possible. These 109 genes consist of: i. 14 MODY genes taken from OMIM [21] or other reviews on MODY such as [3]; ii. 10 genes linked to neonatal diabetes, lipodystrophy, and insulin signaling taken from [2]; iii. 77 genes having variants with any evidence of association with either MODY, neonatal diabetes, or just the condition referred to as “monogenic diabetes” by ClinVar.

Categorizing by consequence and pathogenicity

Since monogenic diabetes, being a Mendelian disease, is determined mostly by rare, highly penetrant coding variants [22], we focused on exonic variants when mining Ensembl. Nevertheless, some variants reported in ClinVar and the literature mapped to untranslated regions (UTR) and splice regions. In these cases we decided whether to keep these variants or not based on the pathogenicity of variants in each consequence type category. To produce protein sequences, we focused on predicted consequences: “missense variant”, “protein altering variant”, or “coding sequence variant”, and ignored variants on a transcript that were “non-coding transcript variant” or “synonymous variant”. For other types of consequences, we focused on those enriched with pathogenic and likely pathogenic variants: “missense variant”, “protein altering variant”, “coding sequence variant”, “frameshift variant”, “splice donor variant”, “splice acceptor variant”, “stop retained”, “stop gained”, “inframe insertion”, and “inframe deletion” (Fig 2). The “stop lost” consequences yielded different prevalence of pathogenic or likely pathogenic variants when considering the consequences reported by Ensembl vs. ClinVar (Fig 2), which can be explained by the differences between Ensembl and ClinVar. Here, the Ensembl dataset consists of the exonic variants, but is not linked to any pathological phenotype. The ClinVar dataset, though, was not narrowed down to any particular regions in the genes of interest, but all the variants in it are linked to clinical conditions. Therefore, the Ensembl dataset is enriched with protein coding variants, whereas the ClinVar dataset is enriched with pathogenic variants. The “stop lost” consequences were included in our analysis. Altogether, the resulting table variants contained 69,256 unique variants located in 109 genes.

Categorizing by consequence and pathogenicity

We mapped 2,701 of these variants to variants linked to monogenic diabetes according to ClinVar or the literature, termed level 1 variants thereafter: 2,220 (82%) mapped uniquely to ClinVar, 136 (5%) to publications reviewed by Rafique et al., and 345 (13%) to both (Fig 4). An effect on protein sequences was predicted for 2,624 (97%) of them, resulting in 12,643 different protein sequences when accounting for all isoforms. The reason for some variants not reaching the final translated sequences is that some transcripts in Ensembl do not have an associated canonical protein product. These are included when selecting exons in Module 1 but do not yield protein sequences. On the other hand, genetic variation can cause translation of UTRs and other normally untranslated regions [23], and genetic variation in the UTRs and splice regions might affect the translation of the proteins in an indirect way [24, 25], but the effects of these variants on amino acid sequences remain challenging to predict. We further filtered the variants to retain only the pathogenic and likely pathogenic variants, termed level 2 variants thereafter, yielding 876 variants, of which 641 (73%) mapped uniquely to ClinVar, 160 (18%) to publications reviewed by Rafique et al., and 75 (9%) to both (Fig 4). An effect on protein sequences was predicted for 714 (82%) of them, resulting in 3,776 distinct protein sequences.

Fig 4. Venn diagram representing the number of variants in two levels of the database and the number of variants taken from different sources.

Fig 4

Level 1 database (at the left) consists of i. ClinVar variants retrieved when querying three phenotypes, i.e. “MODY”, “monogenic diabetes”, and “neonatal diabetes” mapped to Ensembl; ii. All variants from Rafique et al. mapped to Ensembl. Level 2 database (at the right) consists of i. ClinVar “pathogenic” + “likely pathogenic” variants; ii. Variants from Rafique et al. excluded BLK, KLF11, and PAX4.

In both cases, the overlap between variants from ClinVar and the literature is limited. This can be explained by the fact that the ClinVar dataset consisted of the variants linked to all kinds of monogenic diabetes (109 genes) and the Rafique et al. dataset consisted only of the MODY variants (14 genes). Thus, we re-ran the pipeline considering only MODY variants from ClinVar. The overlap of the ClinVar MODY dataset was then 18% and 10% for the level 1 and 2 variants respectively. While the overlap was improved, it is still limited. This illustrates the importance to consider different sources of information when studying monogenic diabetes, and rare diseases in general.

It should be noted here that, in their work, Rafique et al. also used ClinVar as a source of variants. The variants from ClinVar that were not found by the literature text mining algorithm are listed in a separate supplementary file in their publication. We decided not to include this list in our work as we have included all the monogenic diabetes-associated variants from ClinVar anyway. This is another reason why the overlap of the variants taken from Rafique et al. and ClinVar seems limited in our analysis.

Distribution of variants among the genes

We observed strong disparities in the number of variants linked to monogenic diabetes among the different genes. HNF1A, GCK, and HNF1B are the genes presenting the most variants in both level 1 and level 2 databases, indicating that most of the monogenic diabetes-associated variants have been reported in these genes, as well as most of the ones confirmed as pathogenic or likely pathogenic (Fig 3). Conversely, genes like ABCC8 and KCNJ11 feature many level 1 variants, while only a few of those are confirmed to be pathogenic or likely pathogenic (Fig 3). In fact, of all the genes observed, just about a third bear more than 20 variants, and only around 5% have more than 100 variants. Fig 5 represents the number of gene variants in both levels of the database, where the number of variants in the gene is more than 20. S10 Table of the Supplementary materials and figure “number of variants per gene” in GitHub represent the number of variants in each gene in both levels of the database. Besides the biological reasons that some genes have more association with monogenic diabetes than others, this effect can be due to study biases or heterogeneous levels of information on these variants. For example, the most common genes known to cause MODY (HNF1A, GCK, HNF1B, and HNF4A) [2] might have received more attention than others (e.g. ABCC8 and KCNJ11), yielding a lower confirmation rate for the pathogenicity of the variants. It is also worth noticing that for the time of this manuscript, only variants in HNF1A, GCK, and HNF4A are reviewed by the expert panel on ClinVar (https://clinicalgenome.org/affiliation/40016). At the same time, ABCC8 is reported to be also associated with other types of diabetes [26] and definitely deserves attention from the perspective of MODY associations as this may bring the researchers closer to the overall understanding of diabetes causal mechanisms and more precise treatment outcome predictions. The variant distribution in the studied genes can be observed in the figures visualizing the variants in the protein sequences (Fig 3 and others in the “figure” directory at (https://github.com/kuznetsovaks/MD_variants).

Fig 5. Number of variants in the genes with the largest number of variants.

Fig 5

Selected are the genes in which the number of variants in the level 1 database is more than 20. The full bars represent the number of variants in the level 1 database, and the brown part of the bars represents the number of variants in the level 2 database.

Pipeline reproducibility

In order to check whether our pipeline is applicable to other rare diseases and show the reproducibility of the approach, we have run it on Hajdu-Cheney syndrome (HCS). Similar to monogenic diabetes, this is a rare monogenic condition that can be caused by a number of variants in protein-coding regions. All the intermediate files and the results of this analysis can be found in the “HCS” directory at at the GitHub. Finally, we collected all the variants reported to be linked to HCS and highlighted the ones that are classified as pathogenic by ClinVar. All the variant sequences have been translated and added to the human proteome fasta database and the positions of the variants in the protein sequence have been visualized.

Discussion

In this work, we presented a computational pipeline that allows for systematic monogenic diabetes-linked variant collection and mapping. The sources of information on the genetic variation are not unified which makes mapping of the variants challenging. An automated and reproducible pipeline for variant mapping has been developed and is available for public use. A database of variant protein sequences was created for the gene products of variants linked to monogenic diabetes. All known variants reported to be linked to monogenic diabetes published by the beginning of 2023 have been included in the database. The database contains variants with two levels of clinical significance: variants ever reported as linked to monogenic diabetes and pathogenic variants. Here we were considering the variants pathogenic if they had pathogenic or likely pathogenic clinical significance regardless of the star status according to ClinVar. All the monogenic diabetes-linked variants have been translated into protein products and can be compared to the canonical protein sequences. This will help predict the effect of genetic variation on the resulting protein structure and function.

The workflow is automated and aims to gather multiple variants from different sources and avoid their manual annotation. The implementation in Jupyter notebooks provides a good trade-off between automation and flexibility. For example, researchers can execute the entire pipeline as is, adapt it to specific use cases, execute only modules of interest, or completely change the set of genes to study another disease. The public availability, extensive documentation, and permissive license further enable the reuse of our work.

The collected genetic variants linked to monogenic diabetes have been translated to protein sequences and mapped to all known protein isoforms resulting in the collection of all predicted protein variant sequences. Our database is represented in the form of tables along with FASTA files and can be accessed both manually and automatically allowing implementation in various workflows. The localization of variants on proteins and protein domains can shed light on their possible consequences and pathogenicity. In turn, overlaying variant pathogenicity on protein sequences can help in understanding protein function. Mutations in protein-coding regions of genes sometimes lead not to a complete stop of expression or protein degradation but rather to structural changes affecting the function of a protein. E.g., in human glucokinase encoded by the GCK gene, sequence variants in a particular region do not affect the catalytic activity of the enzyme. Instead, they increase the rate of degradation and aggregation of the protein, contributing to the molecular mechanism of GCK-MODY disease [27]. The FASTA files can be supplemented with other proteins and used for proteomic search of mass spectrometry data. The variant protein sequences can also be used in protein structure modeling using tools like Alphafold [11] and pathogenicity prediction using such tools as AlphaMissense [28].

This work illustrates how the different genes linked to monogenic diabetes show very different levels of annotation. Besides well-investigated genes, featuring a high number of variants with unambiguous consequences, many understudied genes bear variants lacking evidence of pathogenicity. Finally, after mining different sources, we have created a collection of variants reported to be linked to monogenic diabetes. These variants map to 109 genes. After filtering out all the non-coding variants and the variants in the regions not enriched with pathogenic variants (i.e. untranslated regions etc.) the list of genes consists of 76 genes. Filtering out all the benign and likely benign variants shortens the list of genes down to 36 genes, which, in turn, contains all the 14 previously recognized “MODY genes”. All these genes along with the source of the information on their variants and their expert panel review status are presented in Supplementary S10 Table.

Furthermore, some variants simply lack basic genomic annotation, and are reported as amino acid changes, e.g. “Gly292Argfs”. An amino acid substitution cannot always be mapped to a single genetic variant. Furthermore, most of the genes encode several protein isoforms and knowing an amino acid change does not give information on which isoform it affects and how it maps to other isoforms. Thus, reporting single amino acid substitutions impairs their inclusion in genomic and bioinformatic studies. In this work, we manually curated these variants and were able to match some of them to genomic coordinates using other variants and aligning protein isoforms using the IsoAligner tool [29]. Moreover, for complex proteomic research in humans, it is important to account for common variation [30]. In future work, we are going to combine our variant sequences with the database of human protein haplotypes [31].

Our work focused on single nucleotide variants (SNVs) or short indels and therefore does not cover larger insertion/deletion mutations causing monogenic diabetes. Large genomic rearrangements, such as full deletion of the HNF1B gene [32] or full deletion of 17q12 locus, have been shown to cause HNF1B-MODY (MODY5) and can be missed by conventional point mutation screening [33, 34]. Insertions or deletions included in our database did not exceed 30 base pairs. These events cannot be directly translated to protein sequences and, as a result, are not reflected in our database.

Furthermore, we would like to emphasize that our work was aimed at creating a pipeline that helps collect variant-level information. While most of the works and reviews provide the clinical context of monogenic diabetes and summarize the collection of genes linked to it, for example, a recent review [34], we collected information on variant positions and allele alternatives in each gene.

In our work, we have analyzed the distribution of pathogenicity among the variants with known consequence types. Based on this analysis we have included variants classified as “splice donor variant”, “splice acceptor variant” and filtered out the “splice region variants”. In future work, the question of splice region variant consequences should be given more attention as this is an understudied field, and these variants can play a significant role in rare diseases [35].

The research of monogenic diabetes is a dynamically developing field, and new variants are being constantly reported from different cohorts. Now with the pipeline we have developed, we and others can easily update MODY variant collections as new variants are reported. Our pipeline can further be used altogether or in parts to study other diseases. This can enable researchers to automatically and reproducibly collect variants linked to phenotypes of interest and consolidate them to a unified format. In research on rare diseases, the availability of flexible pipelines based on notebooks represents a good compromise between manual expert curation that lacks reproducibility and automated pipelines that cannot be tailored to the application.

Supporting information

S1 Table. Exonic variants extracted from Ensembl in Module 1.

(CSV)

pone.0300350.s001.csv (27.1MB, csv)
S2 Table. Result of the 1st step of mapping ClinVar to Ensembl in Module 3.

(CSV)

pone.0300350.s002.csv (1,021.6KB, csv)
S3 Table. Result of the 2nd step of mapping ClinVar to Ensembl in Module 3.

(CSV)

pone.0300350.s003.csv (51.2KB, csv)
S4 Table. Indels extracted from ClinVar for the level 1 database in Module 3.

(CSV)

S5 Table. Indels extracted from ClinVar for the level 2 database in Module 3.

(CSV)

pone.0300350.s005.csv (3.5KB, csv)
S6 Table. Results of the 1st step mapping variants from Rafique et al. to Ensembl in Module 4.

(CSV)

pone.0300350.s006.csv (99.2KB, csv)
S7 Table. Results of the 2nd step mapping variants from Rafique et al. to Ensembl in Module 5.

(CSV)

pone.0300350.s007.csv (167KB, csv)
S8 Table. Table ready for VCF file creation for the level 1 database produced in Module 6.

(CSV)

pone.0300350.s008.csv (76.5KB, csv)
S9 Table. Table ready for VCF file creation for the level 2 database produced in Module 6.

(CSV)

pone.0300350.s009.csv (25.7KB, csv)
S10 Table. Summary table of genes and their appearance in different sources.

(PDF)

pone.0300350.s010.pdf (81KB, pdf)

Data Availability

The data underlying the results presented in the study are available from https://github.com/kuznetsovaks/MD_variants.

Funding Statement

This research was funded, in whole or in part, by the Research Council of Norway (project #301178 to MV), the University of Bergen, and the Novo Nordisk Foundation (project NNF20OC0063872 to SJ). The publication of the manuscript was funded by Bergen Universitetsfond (2023/04/FOL to K.K.) A CC BY or equivalent license is applied to any Author Accepted Manuscript (AAM) version arising from this submission, in accordance with the grant’s open access conditions. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Beltrand J, Busiah K, Vaivre-Douret L, Fauret AL, Berdugo M, Cavé H, et al. Neonatal Diabetes Mellitus. Frontiers in Pediatrics. 2020;8:540718. doi: 10.3389/fped.2020.540718 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Kavvoura FK, Owen KR. Monogenic diabetes. Medicine. 2019;47(1):16–21. doi: 10.1016/j.mpmed.2018.10.007 [DOI] [Google Scholar]
  • 3. Aarthy R, Aston-Mourney K, Mikocka-Walus A, Radha V, Amutha A, Anjana RM, et al. Clinical features, complications and treatment of rarer forms of maturity-onset diabetes of the young (MODY)—A review. Journal of Diabetes and its Complications. 2021;35(1):107640. doi: 10.1016/j.jdiacomp.2020.107640 [DOI] [PubMed] [Google Scholar]
  • 4. Shepherd MH, Shields BM, Hudson M, Pearson ER, Hyde C, Ellard S, et al. A UK nationwide prospective study of treatment change in MODY: genetic subtype and clinical characteristics predict optimal glycaemic control after discontinuing insulin and metformin. Diabetologia. 2018;61(12):2520–2527. doi: 10.1007/s00125-018-4728-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Shields BM, Hicks S, Shepherd MH, Colclough K, Hattersley AT, Ellard S. Maturity-onset diabetes of the young (MODY): how many cases are we missing? Diabetologia. 2010;53(12):2504–2508. doi: 10.1007/s00125-010-1799-4 [DOI] [PubMed] [Google Scholar]
  • 6. Laver TW, Wakeling MN, Knox O, Colclough K, Wright CF, Ellard S, et al. Evaluation of Evidence for Pathogenicity Demonstrates That BLK, KLF11, and PAX4 Should Not Be Included in Diagnostic Testing for MODY. Diabetes. 2022;71(5):1128–1136. doi: 10.2337/db21-0844 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Duzkale H, Shen J, McLaughlin H, Alfares A, Kelly MA, Pugh TJ, et al. A systematic approach to assessing the clinical significance of genetic variants. Clinical Genetics. 2013;84(5):453–463. doi: 10.1111/cge.12257 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Landrum MJ, Lee JM, Benson M, Brown GR, Chao C, Chitipiralla S, et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Research. 2018;46(D1):D1062–D1067. doi: 10.1093/nar/gkx1153 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Martagón AJ, Bello-Chavolla OY, Arellano-Campos O, Almeda-Valdés P, Walford GA, Cruz-Bautista I, et al. Mexican Carriers of the HNF1A p.E508K Variant Do Not Experience an Enhanced Response to Sulfonylureas. Diabetes Care. 2018;41(8):1726–1731. doi: 10.2337/dc18-0384 [DOI] [PubMed] [Google Scholar]
  • 10. Tebani A, Afonso C, Marret S, Bekri S. Omics-Based Strategies in Precision Medicine: Toward a Paradigm Shift in Inborn Errors of Metabolism Investigations. International Journal of Molecular Sciences. 2016;17(9):1555. doi: 10.3390/ijms17091555 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Varadi M, Anyango S, Deshpande M, Nair S, Natassia C, Yordanova G, et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Research. 2022;50(D1):D439–D444. doi: 10.1093/nar/gkab1061 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Kind L, Raasakka A, Molnes J, Aukrust I, Bjørkhaug L, Njølstad PR, et al. Structural and biophysical characterization of transcription factor HNF-1A as a tool to study MODY3 diabetes variants. The Journal of Biological Chemistry. 2022;298(4):101803. doi: 10.1016/j.jbc.2022.101803 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Aebersold R, Agar JN, Amster IJ, Baker MS, Bertozzi CR, Boja ES, et al. How many human proteoforms are there? Nature Chemical Biology. 2018;14(3):206–214. doi: 10.1038/nchembio.2576 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Cunningham F, Allen JE, Allen J, Alvarez-Jarreta J, Amode M, Armean I, et al. Ensembl 2022. Nucleic Acids Research. 2022;50(D1):D988–D995. doi: 10.1093/nar/gkab1049 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Rafique I, Mir A, Saqib MAN, Naeem M, Marchand L, Polychronakos C. Causal variants in Maturity Onset Diabetes of the Young (MODY)—A systematic review. BMC Endocrine Disorders. 2021;21(1):223. doi: 10.1186/s12902-021-00891-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Yates A, Beal K, Keenan S, McLaren W, Pignatelli M, Ritchie GRS, et al. The Ensembl REST API: Ensembl Data for Any Language. Bioinformatics (Oxford, England). 2015;31(1):143–145. doi: 10.1093/bioinformatics/btu613 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GRS, Thormann A, et al. The Ensembl Variant Effect Predictor. Genome Biology. 2016;17(1):122. doi: 10.1186/s13059-016-0974-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Pedersen BS, Layer RM, Quinlan AR. Vcfanno: fast, flexible annotation of genetic variants. Genome Biology. 2016;17(1):118. doi: 10.1186/s13059-016-0973-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Levitsky LI, Klein JA, Ivanov MV, Gorshkov MV. Pyteomics 4.0: Five Years of Development of a Python Proteomics Framework. Journal of Proteome Research. 2019;18(2):709–714. doi: 10.1021/acs.jproteome.8b00717 [DOI] [PubMed] [Google Scholar]
  • 20. Hunter JD. Matplotlib: A 2D Graphics Environment. Computing in Science & Engineering. 2007;9(3):90–95. doi: 10.1109/MCSE.2007.55 [DOI] [Google Scholar]
  • 21. Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Research. 2005;33(Database issue):D514–517. doi: 10.1093/nar/gki033 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Sun BB, Kurki MI, Foley CN, Mechakra A, Chen CY, Marshall E, et al. Genetic associations of protein-coding variants in human disease. Nature. 2022;603(7899):95–102. doi: 10.1038/s41586-022-04394-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Umer HM, Audain E, Zhu Y, Pfeuffer J, Sachsenberg T, Lehtiö J, et al. Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides. Bioinformatics. 2022;38(5):1470–1472. doi: 10.1093/bioinformatics/btab838 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Hull J, Campino S, Rowlands K, Chan MS, Copley RR, Taylor MS, et al. Identification of Common Genetic Variation That Modulates Alternative Splicing. PLOS Genetics. 2007;3(6):e99. doi: 10.1371/journal.pgen.0030099 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Steri M, Idda ML, Whalen MB, Orrù V. Genetic Variants in mRNA Untranslated Regions. Wiley interdisciplinary reviews RNA. 2018;9(4):e1474. doi: 10.1002/wrna.1474 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Haghverdizadeh P, Sadat Haerian M, Haghverdizadeh P, Sadat Haerian B. ABCC8 genetic variants and risk of diabetes mellitus. Gene. 2014;545(2):198–204. doi: 10.1016/j.gene.2014.04.040 [DOI] [PubMed] [Google Scholar]
  • 27. Negahdar M, Aukrust I, Johansson BB, Molnes J, Molven A, Matschinsky FM, et al. GCK-MODY diabetes associated with protein misfolding, cellular self-association and degradation. Biochimica et Biophysica Acta (BBA)—Molecular Basis of Disease. 2012;1822(11):1705–1715. doi: 10.1016/j.bbadis.2012.07.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Cheng J, Novati G, Pan J, Bycroft C, Žemgulytė A, Applebaum T, et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science. 2023;381(6664):eadg7492. doi: 10.1126/science.adg7492 [DOI] [PubMed] [Google Scholar]
  • 29. Hanimann J, Moch H, Zoche M, Kahraman A. IsoAligner: dynamic mapping of amino acid positions across protein isoforms [version 1; peer review: 2 approved with reservations]. F1000Research. 2022;11(382). 10.12688/f1000research.76154.1 [DOI] [Google Scholar]
  • 30. Spooner W, McLaren W, Slidel T, Finch DK, Butler R, Campbell J, et al. Haplosaurus computes protein haplotypes for use in precision drug design. Nature Communications. 2018;9(1):4128. doi: 10.1038/s41467-018-06542-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Vašíček J, Skiadopoulou D, Kuznetsova KG, Wen B, Johansson S, Njølstad PR, et al. Finding haplotypic signatures in proteins. GigaScience. 2023;12:giad093. 10.1093/gigascience/giad093 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Bellanné-Chantelot C, Clauin S, Chauveau D, Collin P, Daumont M, Douillard C, et al. Large genomic rearrangements in the hepatocyte nuclear factor-1beta (TCF2) gene are the most frequent cause of maturity-onset diabetes of the young type 5. Diabetes. 2005;54(11):3126–3132. doi: 10.2337/diabetes.54.11.3126 [DOI] [PubMed] [Google Scholar]
  • 33. Mefford H, Clauin S, Sharp A, Moller R, Ullmann R, Kapur R, et al. Recurrent Reciprocal Genomic Rearrangements of 17q12 Are Associated with Renal Disease, Diabetes, and Epilepsy. American Journal of Human Genetics. 2007;81(5):1057–1069. doi: 10.1086/522591 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Bonnefond A, Unnikrishnan R, Doria A, Vaxillaire M, Kulkarni RN, Mohan V, et al. Monogenic diabetes. Nature Reviews Disease Primers. 2023;9(1):12. doi: 10.1038/s41572-023-00421-w [DOI] [PubMed] [Google Scholar]
  • 35. Lord J, Baralle D. Splicing in the Diagnosis of Rare Disease: Advances and Challenges. Frontiers in Genetics. 2021;12. doi: 10.3389/fgene.2021.689892 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Kazunori Nagasaka

1 Dec 2023

PONE-D-23-29592Systematically mining genomic and proteomic variation linked to rare diseases: the example of monogenic diabetesPLOS ONE

Dear Dr. Kuznetsova,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Jan 15 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Kazunori Nagasaka

Academic Editor

PLOS ONE

Journal requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please update your submission to use the PLOS LaTeX template. The template and more information on our requirements for LaTeX submissions can be found at http://journals.plos.org/plosone/s/latex.

3. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, all author-generated code must be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

4. Note from Emily Chenette, Editor in Chief of PLOS ONE, and Iain Hrynaszkiewicz, Director of Open Research Solutions at PLOS: Did you know that depositing data in a repository is associated with up to a 25% citation advantage (https://doi.org/10.1371/journal.pone.0230416)? If you’ve not already done so, consider depositing your raw data in a repository to ensure your work is read, appreciated and cited by the largest possible audience. You’ll also earn an Accessible Data icon on your published paper if you deposit your data in any participating repository (https://plos.org/open-science/open-data/#accessible-data).

5. Thank you for stating the following financial disclosure:

“The work is supported by the Research Council of Norway https://www.forskningsradet.no/

Grant #301178 to M.V.”

Please state what role the funders took in the study.  If the funders had no role, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."

If this statement is not correct you must amend it as needed.

Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf.

6. Thank you for stating the following in the Acknowledgments Section of your manuscript:

“This work was supported by the Research Council of Norway (project #301178 to MV), the

University of Bergen, and the Novo Nordisk Foundation (project NNF20OC0063872 to SJ).

This research was funded, in whole or in part, by the Research Council of Norway 301178. A CC

BY or equivalent license is applied to any Author Accepted Manuscript (AAM) version arising

from this submission, in accordance with the grant’s open access conditions.”

We note that you have provided funding information that is currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

“The work is supported by the Research Council of Norway https://www.forskningsradet.no/

Grant #301178 to M.V.”

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

7. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information.

8. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Additional Editor Comments:

Dear Authors,

Thank you very much for your submission to Plos One.

I think the manuscript is informative and describes some important points.

Sincerely,

Plos one editorial office

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Partly

Reviewer #4: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: N/A

Reviewer #2: N/A

Reviewer #3: N/A

Reviewer #4: N/A

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The study addresses the issue of pathogenic variation in monogenic diabetes. They developed an automated pipeline to mine the pathogenic variants in over 100 genes considering to be related with monogenic diabetes. By mining the variation data from both ClinVar and literatures and multiple processes, they identified the pathogenic variants in 36 genes, and further sorted out these causing translational changes. All the data from the study are available. The study is certainly a plus for further exploration in the subject.

Comments for improvement:

The definition of short deletions, short insertions, or other short fragment replacements needs to be given and justified. For example, how many bases deleted to be used as the cut off for short deletion?

The use of isoforms in the pipeline can cause mapping trouble as presented by the authors, as the number of isoforms in different genes can be substantial and lack of standardization and quality control. This is the reason to have RefSeq instead of isoforms in genomic annotation. The authors need to justify why use isoform instead of isoforms in their analysis.

“Note that not all these variants could be mapped automatically and those that did not map to Ensembl were formatted manually for input to Module 6. Detailed description needs to be given for formatted “manually”, and indicate the number of the variants under manually check.

ClinVar doesn’t have the classification of “unknown pathogenicity”. It should be the variants of unknown significance or VUS.

“For other types of consequence we focused on those presenting more than 80 % pathogenic and likely pathogenic variants”. Justification for 80% needs to be given, and the number at this cut-off needs to be given.

“We could map 2,701 of these variants to variants linked to monogenic diabetes according to

ClinVar or the literature”: should be “We mapped 2701 of….”

The version numbers for Ensembl, ClinVar and dbSNP need to be indicated. Different versions can be a reason for the inconsistence of the mapping results, such as “why the overlap of the variants taken from Rafique et al. and ClinVar seems limited in our analysis”.

The value of translation products has been repeatedly indicated to justify the development of translational products, as it “help predict the effect of genetic variation on the resulting protein structure and function”. However, there is no single example to justify their claim. I would suggest to present an example of coding changed protein with structural alteration as a proof of principle, considering they have generated rich coding-change pathogenic variants in multiple genes, like HNF1A.

The resolution is poor for all figures.

Reviewer #2: The authors present an interesting paper on the computational assessment of coding variants on protein sequences, categorizing these variants into different tiers depending on predicted pathogenicity.

Some minor comments to the authors

- The authors use the article by Rafique et al (2021). However, they should also mention the paper published by Bonnefond et al (2023) Nature Reviews particularly in their Discussion section (page 14) and how this compares to what they included in their study.

- Pg 4: Change port to import

- Page 3 “It is estimated that around 77 % of monogenic diabetes cases remain undiagnosed”. Specify why.

- Pg 3: Perhaps specify the total number of genes (to date) related to monogenic diabetes not just the 14 MODY genes

- I think it should be specified before (on page 5) which type of exonic variants were retained (listed on page 6)

- It would be good to specify the list of in silico tools used by Ensembl (VEP) to predict pathogenicity

- Did the authors consider cross checking some of the level 1/2 variants with ACMG/AMP predictions?

- Did the authors apply an alternative allele frequency cut-off? Perhaps some of the e.g. nonsense variants might be common.

- Page 8: Variants in BLK, KLF11 and PAX4 should be removed for dominant monogenic diabetes and it is still nonetheless considered controversial.

- Supp Table 5: would change to Yes/No in Rafique column

- Perhaps Links should be included as footnotes (rather than references in text)

Reviewer #3: This manuscript provides a modular Jupiter notebook approach for collecting information on monogenic diabetes variants across 108 genes. The authors compile their database from ClinVar and literature to generate a uniform list with annotations.

The work presented is thorough, though I was unable to personally validate the coding implementation. Additionally, the authors did not demonstrate this approach generalizes well to other diseases, making it difficult to assess wider utility.

Some of the approaches rely on circular logic - for example, the variants in Module 1 already include ClinVar data and use the Ensemble Variant Effect Predictor, which provides comprehensive annotations.

While this tool aims to fill a need, numerous established resources (DECIPHER, VEP etc.) and their continuous expansions already provide rich variant annotation and data integration. As more addons targeting monogenic diseases are actively developed and included in these tools actively the niche for this specific tool may diminish over time due to manual update needed for this work.

Overall, this manuscript describes a thoughtful effort to aggregate information on monogenic diabetes variants. However, given the breadth of existing tools, I am unable to clearly recognize the unmet need this resource fills in the community.

Reviewer #4: Authors developed automated and reproducible bioinformatics pipeline for mapping variants in genes related to MODY, and further collecting and harmonizing data about them. Their work addresses current challenges that geneticist face in the process of interpreting effect of rare/new genetic variants.

The pipeline has been described in detail and is made available for public use.

Additional value is that parts of this pipeline could be used to develop similar pipelines for other rare genetic disorders / groups of rare disorders. This is a needed tool in the rare disease ecosystem.

Therefore, I highly recommend publishing this manuscript in the current form.

Minor comment: I suggest including words “bioinformatics pipeline” in the title to facilitate bioinformaticians to find this valuable manuscript.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Reviewer #4: Yes: Maja Stojiljkovic

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2024 Apr 18;19(4):e0300350. doi: 10.1371/journal.pone.0300350.r002

Author response to Decision Letter 0


30 Jan 2024

We thank the editor and all the reviewers for their attention to our work and valuable comments. We have implemented all suggestions and as a result the manuscript is substantially improved.

We are providing the revised version in the form of a LaTex document as it was requested by the journal and also attaching the MS Word file with the changes highlighted.

We have not made any changes in the figures, but have added one supplementary table (S10_table), renamed the supplementary files, and updated all the in-text citations as required.

We have modified the reference list and added 5 more references (28, 30, 31, 32, 34). The updated reference list is included at the end of the LaTex file as required by the template.

We have added a license file to the GitHub repository and a file containing all the dependencies for creating the conda environment for running our pipeline with explanations in a readme file.

Reviewer #1:

We thank reviewer 1 for the thorough examination of our work and for their appreciation of its contribution to monogenic disease research. We have implemented all suggestions, as detailed in the answers below.

The definition of short deletions, short insertions, or other short fragment replacements needs to be given and justified. For example, how many bases deleted to be used as the cut off for short deletion?

Filtering the variants we did not set any limits deliberately, but out of the variants from ClinVar linked to monogenic diabetes, the indels were no longer than 30 bp. We have added this information to page 17.

The use of isoforms in the pipeline can cause mapping trouble as presented by the authors, as the number of isoforms in different genes can be substantial and lack of standardization and quality control. This is the reason to have RefSeq instead of isoforms in genomic annotation. The authors need to justify why use isoform instead of isoforms in their analysis.

In our study, we translated all variant genes into the possibly encoded proteins to enable protein-level analyses (e.g. structural or mass spectrometry-based analysis). In order to comprehensively cover all possibly encoded proteins, we chose to include all possible isoforms as provided by Ensembl. The reviewer is correct that analysts usually reduce the number of isoforms and in most cases use only canonical sequences, as provided by RefSeq or Ensembl. In the study of rare diseases, and monogenic diabetes in particular, it is important to be able to select between isoforms. For example, different isoforms of GCK are specific for different tissues https://www.degruyter.com/document/doi/10.1515/hsz-2018-0109/html?lang=de. It is therefore valuable for analysts to have all isoforms available and choose the ones relevant to their analysis, instead of being limited to canonical isoforms. This has been clarified in the text.

“Note that not all these variants could be mapped automatically and those that did not map to Ensembl were formatted manually for input to Module 6. Detailed description needs to be given for formatted “manually”, and indicate the number of the variants under manually check.

The description is added to page 8.

ClinVar doesn’t have the classification of “unknown pathogenicity”. It should be the variants of unknown significance or VUS.

We apologize for the incorrect formulation, it has been corrected with the denomination from ClinVar as suggested.

“For other types of consequence we focused on those presenting more than 80 % pathogenic and likely pathogenic variants”. Justification for 80% needs to be given, and the number at this cut-off needs to be given.

“We could map 2,701 of these variants to variants linked to monogenic diabetes according to

ClinVar or the literature”: should be “We mapped 2701 of….”

This has been corrected on page 10.

The version numbers for Ensembl, ClinVar and dbSNP need to be indicated. Different versions can be a reason for the inconsistence of the mapping results, such as “why the overlap of the variants taken from Rafique et al. and ClinVar seems limited in our analysis”.

This has been added.

The value of translation products has been repeatedly indicated to justify the development of translational products, as it “help predict the effect of genetic variation on the resulting protein structure and function”. However, there is no single example to justify their claim. I would suggest to present an example of coding changed protein with structural alteration as a proof of principle, considering they have generated rich coding-change pathogenic variants in multiple genes, like HNF1A.

We have added an example of structural changes in GCK due to sequence coding region mutation from doi.org/10.1016/j.bbadis.2012.07.005 and variant pathogenicity prediction using AlphaMissense that is based on the structural predictor AlphaFold doi/epdf/10.1126/science.adg7492 on page 16. We thank the reviewer for this suggestion.

The resolution is poor for all figures.

We have provided high-resolution figures as separate files along with the manuscript as well as on GitHub. Unfortunately, it seems that the compression of the submission system altered the resolution. We will ensure that resolution is sufficient in the resubmission, alternatively, all the files are available for download from https://github.com/kuznetsovaks/MD_variants

Reviewer #2:

We thank reviewer 2 for the careful analysis of our work and for pointing out our approach to dividing variants into tiers. We have implemented all suggestions, as detailed in our answers below.

- The authors use the article by Rafique et al (2021). However, they should also mention the paper published by Bonnefond et al (2023) Nature Reviews particularly in their Discussion section (page 14) and how this compares to what they included in their study.

We thank the reviewer for pointing out this important paper. We have now included this this review in the paper (See page 17 and Supplementary Table 10). However, unlike the work by Rafique et al., this manuscript provides an overview of MG at the gene level and lists the linked genes but not the variants. We have checked the overlap of the gene list in the review with the gene list in our work. As expected, the overlap is not perfect, but discussing the differences between the gene lists provided by different reviews is beyond the scope of our work.

- Pg 4: Change port to import

This has been corrected.

- Page 3 “It is estimated that around 77 % of monogenic diabetes cases remain undiagnosed”. Specify why.

Thank you for this suggestion, we have extended the text accordingly.

- Pg 3: Perhaps specify the total number of genes (to date) related to monogenic diabetes not just the 14 MODY genes

One of the points and conclusions of our study is that the list of genes related to monogenic diabetes is a matter of continuous discussion and revision. A gene can be included or excluded from this list based on the pathogenicity, frequency etc. of its genetic variants, which in turn is constantly being revised. We have extended the result section and discussion accordingly, but for the sake of clarity the introduction does not expand on these considerations.

- I think it should be specified before (on page 5) which type of exonic variants were retained (listed on page 6)

This has been added, thank you for the suggestion.

- It would be good to specify the list of in silico tools used by Ensembl (VEP) to predict pathogenicity

This has been specified.

- Did the authors consider cross checking some of the level 1/2 variants with ACMG/AMP predictions?

The main aim of our work was to collect, standardize, and translate the variants linked to monogenic diabetes using open-source and freely available tools and databases. There are many different tools for ACMG/AMP classification such as older ones (doi: 10.1186/s13059-017-1353-5) and newer machine learning-based ones (doi: 10.1038/s41598-022-06547-3) as well as commercial tools. To our understanding, these classifiers are designed to be used on datasets of variants called from actual sequencing data rather than genomic knowledgebases. We find this analysis an important and interesting part of the patient classification framework but it is outside the scope of our study.

- Did the authors apply an alternative allele frequency cut-off? Perhaps some of the e.g. nonsense variants might be common.

The main goal of our work was to collect and standardize the information on the possible protein alterations connected to monogenic diabetes. For this reason, we based our variant selection primarily on the variant location (exonic variants), secondarily on protein consequences (missense SNP, frameshift etc.), and thirdly on reported pathogenicity. For the latter, we separated our database into two levels: one including all the variants and the second one including just the pathogenic ones. This approach is implemented to address the limitations of the ‘street light effect’ appearing in mass spectrometry-based proteomics. Since the proteomics spectra are analyzed against a protein sequence database, the sequences not included in the database have no chance of being found in the samples. For this reason, we tried to keep the level 2 database as thorough as possible and create a strict version of the database limited to only pathogenic variants. Here we did not use any allele frequency cut-off. Nevertheless, we controlled for the alternative allele frequencies of the variants for which they were available using the Ensemple REST API and see that the level 2 database contains approximately 8.5% of common and 11% of low-frequency variants while the level 1 database contains 0.3% of low frequency variants and no common variants. We did not include this analysis in the manuscript as we believe that it will introduce complexity and distract attention from the main purpose of our study.

- Page 8: Variants in BLK, KLF11 and PAX4 should be removed for dominant monogenic diabetes and it is still nonetheless considered controversial.

We thank the reviewer for this suggestion, these genes have been discussed at length in our lab during the past years. We agree with the reviewer and have removed them from the level 2 database but kept them in level 1, as level 1 is the most complete collection of variants that have been reported. We mention the controversy on these three genes and cite the paper discussing it on page 12.

- Supp Table 5: would change to Yes/No in Rafique column

This has been corrected

- Perhaps Links should be included as footnotes (rather than references in text)

This has been corrected in the LaTex version following the template of the journal.

Reviewer #3:

We thank reviewer 3 for their critical review of our work. We have implemented all suggestions as detailed in the comments below.

The work presented is thorough, though I was unable to personally validate the coding implementation. Additionally, the authors did not demonstrate this approach generalizes well to other diseases, making it difficult to assess wider utility.

We have reproduced the pipeline on an example of a Hajdu-Cheney syndrome. All the files can be found in GitHub in the "HCS" directory. The text has been extended accordingly (see page 14).

Some of the approaches rely on circular logic - for example, the variants in Module 1 already include ClinVar data and use the Ensemble Variant Effect Predictor, which provides comprehensive annotations.

While this tool aims to fill a need, numerous established resources (DECIPHER, VEP etc.) and their continuous expansions already provide rich variant annotation and data integration. As more addons targeting monogenic diseases are actively developed and included in these tools actively the niche for this specific tool may diminish over time due to manual update needed for this work.

Overall, this manuscript describes a thoughtful effort to aggregate information on monogenic diabetes variants. However, given the breadth of existing tools, I am unable to clearly recognize the unmet need this resource fills in the community.

We are sorry to read that we failed to convey how our pipeline answers a need from the biomedical community. The main aim of our work is to collect genetic variants linked to monogenic information and translate them into protein sequences. While we acknowledge the value of resources like VEP, DECYPHER, and ClinVar, the nature of rare diseases makes it challenging for generalist solutions to stay updated with field-specific advances. When creating proteomic databases enhanced with genetic data to analyze patients' proteomes, the community currently often relies on manual sequence analysis. While we share the optimism of the reviewer that specialist addons will be included, eventually streamlining the work on rare diseases, we believe that to date, semi-automated approaches like ours provide a good compromise between exhaustivity and reproducibility.

Reviewer #4:

We thank Prof. Stojiljković for their appreciation of our contribution and a valuable suggestion to change the title. We have changed it accordingly: “Bioinformatics pipeline for the systematic mining genomic and proteomic variation linked to rare diseases: the example of monogenic diabetes”

Attachment

Submitted filename: answers_to_reviewers.pdf

pone.0300350.s011.pdf (94.7KB, pdf)

Decision Letter 1

Kazunori Nagasaka

27 Feb 2024

Bioinformatics pipeline for the systematic mining genomic and proteomic variation linked to rare diseases: the example of monogenic diabetes

PONE-D-23-29592R1

Dear Dr. Kuznetsova,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Kazunori Nagasaka

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Dear Authors,

Thank you so much for submitting your manuscript to JOGR.

Now your manuscript is acceptable for publication in Plos One.

We look forward to your future submission.

Sincerely,

Kazunori Nagasaka

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

Reviewer #4: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #4: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: N/A

Reviewer #2: Yes

Reviewer #4: N/A

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #4: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #4: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The revision has addressed my questions with satisfaction. The quality of revision is substantially improved over the original version.

Reviewer #2: (No Response)

Reviewer #4: The authors made changes in line with the first round of the review. I do not have any further comments.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: San Ming Wang

Reviewer #2: No

Reviewer #4: Yes: Maja Stojiljkovic

**********

Acceptance letter

Kazunori Nagasaka

25 Mar 2024

PONE-D-23-29592R1

PLOS ONE

Dear Dr. Kuznetsova,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Professor Kazunori Nagasaka

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table. Exonic variants extracted from Ensembl in Module 1.

    (CSV)

    pone.0300350.s001.csv (27.1MB, csv)
    S2 Table. Result of the 1st step of mapping ClinVar to Ensembl in Module 3.

    (CSV)

    pone.0300350.s002.csv (1,021.6KB, csv)
    S3 Table. Result of the 2nd step of mapping ClinVar to Ensembl in Module 3.

    (CSV)

    pone.0300350.s003.csv (51.2KB, csv)
    S4 Table. Indels extracted from ClinVar for the level 1 database in Module 3.

    (CSV)

    S5 Table. Indels extracted from ClinVar for the level 2 database in Module 3.

    (CSV)

    pone.0300350.s005.csv (3.5KB, csv)
    S6 Table. Results of the 1st step mapping variants from Rafique et al. to Ensembl in Module 4.

    (CSV)

    pone.0300350.s006.csv (99.2KB, csv)
    S7 Table. Results of the 2nd step mapping variants from Rafique et al. to Ensembl in Module 5.

    (CSV)

    pone.0300350.s007.csv (167KB, csv)
    S8 Table. Table ready for VCF file creation for the level 1 database produced in Module 6.

    (CSV)

    pone.0300350.s008.csv (76.5KB, csv)
    S9 Table. Table ready for VCF file creation for the level 2 database produced in Module 6.

    (CSV)

    pone.0300350.s009.csv (25.7KB, csv)
    S10 Table. Summary table of genes and their appearance in different sources.

    (PDF)

    pone.0300350.s010.pdf (81KB, pdf)
    Attachment

    Submitted filename: answers_to_reviewers.pdf

    pone.0300350.s011.pdf (94.7KB, pdf)

    Data Availability Statement

    The data underlying the results presented in the study are available from https://github.com/kuznetsovaks/MD_variants.


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES