Functional and Structural Characterization of Pathogenicity of Human Arginine-Histidine Variants

Nirav Modha; Emil Alexov

doi:10.1142/s2737416526400041

. Author manuscript; available in PMC: 2026 Jan 31.

Published before final editing as: J Comput Biophys Chem. 2025 Dec 20:10.1142/s2737416526400041. doi: 10.1142/s2737416526400041

Functional and Structural Characterization of Pathogenicity of Human Arginine-Histidine Variants

Nirav Modha ^1,², Emil Alexov ^3,^4,^5,⁶

PMCID: PMC12858162 NIHMSID: NIHMS2134374 PMID: 41626145

Abstract

Missense variants that change arginine-to-histidine and histidine-to-arginine (R>H; H>R) preserve positive charge yet alter pH-dependent behavior near neutrality, creating mutation type-specific but context-dependent effects on protein function, and thus could be pathogenic. To reveal the factors causing pathogenicity, we assembled high-confidence human R>H and H>R variants from ClinVar annotated as pathogenic or benign. It was found that in both cases, R>H/H>R, pathogenic variants are strongly enriched in cores and ordered regions, while benign variants were seen on surfaces and coils. Secondary structure analysis showed mutation type specificity; R>H pathogenic variants were enriched in helices, while H>R pathogenic variants were enriched in β-strands. Regarding the pH-optimum of activity, most R>H and H>R variants fell in physiological/near-physiological pH ranges, but R>H benign variants were more frequent in the neutral/physiological pH bin, whereas H>R pathogenic variants were overrepresented in the same neutral/physiological pH range. The last observation is consistent with histidine’s pK_a being tunable near physiological range, while arginine’s side chain introduces a permanent positive charge, and thus H>R substitution eliminates the wild-type pH-dependence. Functional protein analyses highlighted that pathogenic variants are overrepresented at binding/interface-heavy proteins (e.g., transcription factors) and selected enzymatic classes (e.g., oxidoreductases, ion channels, transporters, ligases). Interestingly, in the vast majority of cases, the proteins in our dataset had either R>H or H>R mutations, but not both present in the same protein. Proteins harboring both variant types, R>H and H>R, were very few, and typically they had both variants, either pathogenic or benign.

Keywords: Proteins, Missense mutations, Arginine, Histidine, pH-dependence

1. Introduction

Missense mutations occur when a single nucleotide change within the genome results in the substitution of one amino acid for another, and are considered to be the major contributor to diseases¹. While many missense variants are benign, others can be highly deleterious, disrupting the protein’s function, stability, or regulation². While classification of missense mutations into pathogenic or benign categories is an important task, it is equally important to reveal the molecular effect causing the disease to guide the development of therapeutic solutions. Recent studies have shown that a significant fraction of pathogenic missense mutations, particularly in monogenic disorders, can be traced to changes in protein folding stability, providing a quantitative and biophysical basis for variant interpretation and drug development³.

Proteins such as enzymes and other biological macromolecules tend to function within specific pH ranges, often showing a characteristic pH-optimum at which their stability and physiological activity are maximal^{4, 5}. This pH-optimum reflects their intrinsic structural properties and is adapted to the characteristic pH of their cellular environment^5–8. This is also consistent with proteome-wide analyses showing that predicted pH of maximal stability correlates with subcellular pH across organelles⁹. Changes in the native pH-optimum can alter properties such as folding, binding, or catalytic efficiency, which can lead to a loss of function and disease¹⁰. Hence, maintaining compatibility between a protein’s pH-optimum and its subcellular or tissue environment is essential for normal physiology, and mutations that disturb this balance may have deleterious effects⁸. Additionally, mutations affecting the pH sensitivity, either increasing or decreasing it, might represent a rather subtle but widespread mechanism of pathogenicity^{11, 12}.

The pKₐ values of ionizable residues within the macromolecule govern the pH-dependence of protein stability and function⁵. Among these ionizable residues, histidine plays a unique role of having unperturbed pKₐ near neutrality (~6.0–6.8), making it the dominant contributor to pH sensitivity around physiological pH of 7.2. Most proteins in the bloodstream, extracellular fluid, and normal tissues have a pH optimum near 7.2⁴. While other titratable amino acid residues, such as aspartate, glutamate, lysine, and arginine, are essential for the function of macromolecules, they generally contribute less to pH-dependence at neutral pH because their pKₐ values are farther from neutral pH. However, this does depend on the local protein environment. Buried ionizable residues can experience significant pKₐ shifts, as seen when lysine residues engineered into the hydrophobic core of a protein exhibited pKₐ values decreased by several units relative to solvent-exposed lysines¹³. While certain environmental pKₐ shifts allow other titratable residues to contribute under specific conditions, histidine remains unique in modulating pH dependence near neutrality, and any deviation of a wild-type sequence that inserts or removes a histidine residue is expected to change the pH dependence of the corresponding macromolecule at neutral pH. Of particular interest for this study are missense mutations that either change the wild-type arginine-to-histidine (R>H) or vice versa. The reason for this interest is that such mutations result from a single nucleotide change and preserve the positive charge of the site at lower than neutral pHs, but cause alteration of the pH-dependence at high pHs. Thus, the substitution of R>H will result in pH-dependence at neutral pH of the corresponding reaction in the mutant, which is not present in the wild-type. In the opposite case, the histidine-to-arginine (H>R) mutation, the mutant will be pH-independent in contrast to the wild-type protein.

Missense mutations can impact a protein’s function through various mechanisms, such as destabilizing the folded structure, altering catalytic residues, disrupting binding interfaces, or altering subcellular targeting². These effects are often tied to changes in charge, polarity, or hydrophobicity¹⁴. Still, some substitutions preserve core biophysical properties; these are deemed “conservative” and may produce subtler functional shifts. The arginine-to-histidine (R>H) and histidine-to-arginine (H>R) substitution exemplifies this phenomenon: both residues can carry a positive charge, but histidine’s shorter side chain and its pKₐ (~6.5–6.8, near physiological pH) introduce nuances in geometry, electrostatics, and pH-dependent regulation. Recent work has demonstrated that R>H substitutions in cancer can grant mutant proteins pH-sensitive activity, which gives an adaptive advantage in tumor microenvironments with elevated intracellular pH, and can change nucleic acid-binding or protein–protein interactions, a capability not present in wild-type arginine-containing proteins¹⁵.

R>H mutations are frequently seen in cancer genomes and have been proposed to confer novel biophysical properties to mutant proteins. While both arginine and histidine carry a positive charge at physiological pH, histidine’s side chain has a pKₐ near 6.5, enabling it to act as a molecular pH sensor¹⁵. This allows histidine-containing proteins to change their charge state, and potentially function, within narrow pH ranges, a feature not shared by arginine. Tumor cells often exhibit altered intracellular pH, which is a hallmark of metabolic reprogramming in cancer. R>H and H>R mutations in oncogenes can introduce pH-sensitive regulatory features that may enable cancer cells to modulate protein activity, localization, or binding in response to changes in the tumor microenvironment. Despite the conservative nature of the substitution, this gain-of-function effect suggests that R>H/H>R mutations may provide selective advantages under altered pH conditions, with implications for signaling, chromatin regulation, and enzyme catalysis in malignancy¹⁵.

Although considered a chemically conservative substitution, the R>H and H>R mutations appear in both benign and pathogenic contexts, raising important questions about the molecular determinants that distinguish the two. In this study, we present a comparative analysis of human R>H and H>R missense mutations, aiming to understand the structural, evolutionary, and biochemical features that contribute to their classification as benign or pathogenic. Using curated datasets derived from ClinVar¹⁶, we compare high-confidence benign and pathogenic R>H and H>R variants across a diverse array of proteins. Our analysis integrates structural mapping (including core/surface classification and secondary structure via DSSP), evolutionary conservation (via Shannon entropy and amino acid frequency), disorder determination from UniProt, disorder prediction from AlphaFold confidence scores, and pH context via subcellular compartment and enzyme pH optimum. By systematically examining these aspects, our goal is to uncover whether and how R>H and H>R mutations differ when classified as benign vs pathogenic, and whether these differences reveal common molecular patterns that could aid future variant interpretation.

2. Methods

2.1. Dataset preparation

We compiled mutation type-specific variant sets for R>H and H>R substitutions from ClinVar and generated four datasets with standardized annotations.

2.1.1. ClinVar sourcing and primary filters

We downloaded the latest ClinVar release available at the time of analysis. We restricted to human, single-nucleotide missense variants that change codons to produce either arginine-to-histidine (R>H) or histidine-to-arginine (H>R). Variants labeled “Pathogenic” or “Benign” were retained as the two clinical classes; entries with conflicting or indeterminate significance (e.g., “Likely”, “Uncertain significance”, “Conflicting interpretations”) were excluded. We captured the protein-level change (e.g., p.Arg187His) and gene symbol from the ClinVar annotation for each retained record.

2.1.2. Mutation type and class stratification

We separated variants by substitution type (R>H vs H>R) and by clinical label (Pathogenic vs Benign), producing four primary ClinVar lists:

R>H Pathogenic Mutations (n = 806)
R>H Benign Mutations (n = 997)
H>R Pathogenic Mutations (n = 241)
H>R Benign Mutations (n = 252)

2.1.3. Gene and protein mapping

For each variant, we mapped the gene to a reviewed human UniProt (Swiss-Prot) accession ID where available, prioritizing canonical isoforms. Records lacking a confident human Swiss-Prot mapping were excluded from downstream structural/functional annotation to avoid species/isoform mismatches.

2.1.4. Expansion to gene analysis

We then produced datasets with all the genes involved in the R>H and H>R mutations. We compiled all the genes per row with harmonized columns required for downstream analyses (function, subcellular compartments, and pH context). These finalized analysis tables, each containing one row per unique gene per mutation class (R>H; H>R), are:

R>H Pathogenic Genes (n = 291)
R>H Benign Genes (n = 491)
H>R Pathogenic Genes (n = 126)
H>R Benign Genes (n = 143)

2.1.5. Column schema (summary)

The expanded tables include:

Identifiers: Gene, UniProt_ID, Protein_change (e.g., p.Arg187His), Mutation_Position, WT_AA, Mut_AA, Mutation Type (R>H or H>R), Clinical_Class (Pathogenic/Benign), and Disease_Categories (from ClinVar “Condition(s).”
Structural context (best available model): best_pdb_id, experimental method and resolution for all matches, DSSP-derived dssp_secondary_structure_best (Helix/Strand/Coil), dssp_rsa_best, and dssp_core_surface_best classification.
Disorder/structure features: UniProt feature flags for disorder and AlphaFold coverage (pLDDT for AF-only rows when applicable).
Evolutionary constraint: conservation metrics and Shannon entropy at the mutation position derived from ortholog MSAs after ≥80% coverage filtering (human set as reference).
Functional and localization: merged GO terms (biological process, molecular function, cellular component), summarized functional class labels, and subcellular compartment assignments.
pH context: enzyme pH-optimum annotations where available, with cleaned range handling and row-level averages/medians (neutral values emphasized for interpretation).

2.1.6. Quality control

QC included: (i) verification that the wild-type residue at Mutation_Position matched the expected R or H in the selected structure/model, (ii) removal of non-human mappings, (iii) exclusion of variants lacking resolvable protein positions, (iv) spot checks that DSSP coverage encompassed the position, and (v) consistency checks across mutation type/class splits.

2.2. Variant collection and filtering

We initiated our analysis by querying the ClinVar database using the keywords “p.Arg” and “p.His” to retrieve all reported variants where arginine and histidine are mutated to any other amino acid. From this initial pool, we filtered the dataset to include only missense variants. A custom Python script was developed to further isolate cases involving arginine-to-histidine (R>H) and histidine-to-arginine (H>R) substitutions. This curation produced four clean datasets: pathogenic R>H, benign R>H, pathogenic H>R, and benign H>R. The clean datasets were then expanded with each row in the resulting files corresponding to a single R>H and H>R mutation per gene (overview shown in Fig. 1B), enabling accurate mapping and downstream structural or functional comparison.

Fig. 1. — Clinical disease categories and relative abundance of R>H and H>R mutations from ClinVar. (a) Distribution of condition-level ClinVar annotations across curated disease categories for pathogenic R>H (white bars) and H>R (hatched bars) substitutions. Each ClinVar record can list multiple conditions, so entries were split at the level of individual condition labels, yielding 1,750 condition annotations for R>H variants and 362 for H>R variants. Significance was assessed with two-sided Fisher’s exact tests and Benjamini–Hochberg FDR correction; * q < 0.05, *** q < 0.001. (b) Overall fraction of benign vs pathogenic variants for R>H and H>R substitutions. For R>H (n = 1,803 variants; 997 benign, 806 pathogenic), 55.3% of variants are benign and 44.7% are pathogenic. For H>R (n = 493 variants; 252 benign, 241 pathogenic), 51.1% are benign and 48.9% are pathogenic. The pathogenic fraction is modestly higher for H>R than for R>H

2.3. ClinVar Disease-Category Mapping

For all ClinVar variants in the R>H and H>R datasets, we extracted the text disease labels from the “Condition(s)” field. When multiple conditions were listed for a variant, each entry was treated as an independent condition-level annotation. Labels corresponding to unspecified indications (“not provided”, “not specified”, “no assertion provided”) were removed while retaining any other conditions listed in the same record. The remaining condition strings were mapped to higher-level disease categories (e.g., cardiovascular, neurological/neurodevelopmental, developmental/syndromic, metabolic/endocrine, hematologic/immune, musculoskeletal/connective, renal/urologic, pulmonary, cancer/neoplasm, dermatologic/hair, sensory [vision/hearing], reproductive/genital, pharmacogenetic/drug response, gastrointestinal/hepatic, general/genetic broad, and other) using curated keyword lists (for example “cardiomyopathy”, “arrhythmia”, “heart failure”, and related terms for cardiovascular; “epilepsy”, “seizure”, “intellectual disability”, “autism”, and related terms for neurological/neurodevelopmental). All disease-category analyses in Fig. 1A were done at the condition level (number of condition annotations per category).

2.4. Gene and protein mapping

Gene symbols were extracted and cross-verified from each record’s “Name” and “Gene” fields. Using the UniProt REST API¹⁷, we mapped each gene to its reviewed (Swiss-Prot) human protein entry. We retrieved the corresponding UniProt ID and canonical protein name for each mapped gene. Mismatched or ambiguous mappings were manually inspected and corrected, ensuring the accuracy of gene-to-protein relationships.

2.5. Functional annotation

Using the validated UniProt IDs, we retrieved detailed protein-level annotations from UniProt, including:

Organism verification, confirming human protein origin
Protein function summaries, based on curated literature and experimental data
Subcellular localization, including predicted and experimentally validated compartments
Gene Ontology (GO) annotations: Molecular Function, Biological Process, Cellular Component

2.6. Protein functional annotation using Gene Ontology terms

To functionally categorize proteins affected by R>H or H>R mutations, we performed structured annotation based on Gene Ontology (GO) molecular function terms. GO terms provide a standardized vocabulary for describing molecular activities of gene products, enabling biologically meaningful grouping across large datasets.

Our goal was to classify each protein into one or more high-level functional classes based on its GO molecular function annotations. These GO terms were extracted from the GO_Molecular_Function column of our curated benign and pathogenic datasets, with each protein potentially associated with multiple GO IDs.

To interpret these GO terms, we first retrieved the GO term names (labels) for each GO ID using the EMBL-EBI QuickGO REST API^{18, 19}. This provided us with descriptive term names (e.g., “protein kinase activity”) that could be semantically compared.

Next, we used a manually curated mapping dictionary to assign each GO term to one or more high-level functional categories. The classification was based on the term name (label) and biological interpretation, matching keywords and functional relevance. For instance, terms containing “kinase” were mapped to the Kinase category, while terms with “channel” were mapped to Ion Channel. This mapping step allowed us to abstract raw GO annotations into more interpretable functional groupings.

The high-level functional classes used were:

Binding
Kinase
Phosphatase
Hydrolase
Transferase
Oxidoreductase
Ligase
Receptor
Transporter
Ion Channel
Motor
Chaperone
Enzyme (for uncategorized “-ase” terms)
Structural
Transcription Factor
Miscellaneous

GO terms with no initial match were flagged as “Other” and further reviewed. We applied rule-based refinements for these terms using common molecular function keywords:

Terms containing “motor” were reassigned to Motor
Terms with “oxidase”, “reductase”, or “dehydrogenase” were reassigned to Oxidoreductase
Terms ending in “-ase” were labeled Enzyme unless already assigned

This classification framework enabled biologically relevant comparisons between the molecular functions of proteins carrying pathogenic vs benign R>H and H>R mutations, allowing us to analyze functional category enrichments, distributions, and patterns across mutation types.

2.7. Subcellular Compartment Annotation and pH Environment Mapping

To evaluate the potential role of compartment-specific pH in modulating the impact of R>H and H>R mutations, we performed a systematic mapping of subcellular localizations for each protein in our benign and pathogenic datasets and assigned corresponding pH values based on curated physiological ranges.

2.7.1. GO term extraction and resolution

We began by extracting Gene Ontology (GO) terms associated with the cellular component ontology from the GO_Cellular_Component column of each dataset^{18, 19}. These entries often contained multiple GO terms per protein, separated by pipe characters (|). We applied a regular expression filter to extract all valid GO terms matching the pattern GO:nnnnnnn.

To convert these GO identifiers into standardized subcellular compartment names, we used the Ontology Lookup Service (OLS) REST API (https://www.ebi.ac.uk/ols/api/ontologies/go)²⁰. For each unique GO term, we queried the OLS endpoint to retrieve the corresponding term label (label field) representing the cellular structure or localization. GO terms that could not be resolved (due to OLS limitations or deprecated terms) were excluded from subsequent mapping.

2.7.2. Assignment of compartment pH values

We constructed a compartment-to-pH mapping dictionary based on values for typical organelle pH ranges. The dictionary mapped standardized compartment names to their approximate intracellular pH under physiological conditions:

pH dictionary: cytoplasm: 7.2, cytosol: 7.2, nucleus: 7.2, nucleoplasm: 7.2, mitochondrion: 7.8, lysosome: 4.8, endosome: 5.5, golgi: 6.5, endoplasmic reticulum: 7.1, peroxisome: 7.5, plasma membrane: 7.4, cell membrane: 7.4, extracellular: 7.4, secreted: 7.4 }

Each resolved GO label was checked for the presence of a matching compartment term (e.g., “nucleoplasm” in “nucleoplasm part”). If a match was found, the corresponding pH value was assigned. For proteins associated with multiple compartments, multiple pH values were assigned and separated using a pipe character (|). Entries with unmatched or unresolvable GO terms were marked as “NA”. The final pH assignments were appended to the datasets under a new column GO_Compartment_pH.

2.8. EC Number Annotation and pH Optimum Mapping

2.8.1. EC number annotation

UniProt accession IDs for each protein in our benign and pathogenic R>H and H>R mutation datasets were mapped to their corresponding Enzyme Commission (EC) numbers using the UniProt REST API¹⁷. For each UniProt ID, EC numbers were extracted from the recommended Name.ecNumbers field in the proteinDescription section of the JSON response. Additional fallback extraction was performed from the comments section under entries with commentType equal to CATALYTIC_ACTIVITY.

Multiple EC numbers assigned to a single protein were retained and stored as pipe-separated values (e.g., 3.4.17.-|3.4.17.24). This ensured full annotation coverage while maintaining ambiguity when specific enzymatic function remained unclassified.

2.8.2. Extraction of pH Optimum from BRENDA

The BRENDA database (version 2025.1)10 was downloaded in JSON format and parsed locally²¹. For each EC number present in our annotated datasets, we queried the “ph_optimum” field to extract all pH optimum values corresponding to Homo sapiens proteins.

Organism-specific entries were matched via the protein section of each EC block in the BRENDA JSON, using the associated organism metadata. pH values were only retained if the organism was explicitly annotated as Homo sapiens. If an EC number contained multiple entries (e.g., 3.4.17.-|3.4.17.24), pH values from all subcomponents were combined.

The extracted pH values were stored as pipe-separated strings in a new column (pH_Optimum) for both benign and pathogenic datasets. These combined pH optimum values were then averaged if multiple values were present per EC number. This average pH optimum value was then used for further downstream analysis.

2.9. Conservation score and Shannon entropy analysis of R>H and H>R mutations

To evaluate the evolutionary aspect at each mutation site, we quantified residue-level conservation for both benign and pathogenic R>H and H>R mutations using a combination of BLAST-based homology search, multiple sequence alignment (MSA), and entropy-based scoring.

2.9.1. Data preparation

We began with our four curated datasets containing R>H and H>R missense variants, each annotated with a UniProt accession ID (UniProt_ID) and mutation site (Mutation_Position). Each dataset was then processed separately but identically.

2.9.2. Protein sequence retrieval

For each unique UniProt_ID, the canonical protein sequence was retrieved in FASTA format from UniProt using their REST API¹⁷. Sequences were saved locally and used as queries for homology search.

2.9.3. Homology search

We ran local blastp searches against the curated Swiss-Prot protein database using NCBI BLAST+ (v2.15.0)²². For each query, we retrieved up to 100 top-scoring homologous protein sequences with an E-value threshold of 1e-5. Results were extracted in pairwise format (outfmt 6) and reformatted into FASTA files containing only the aligned sequences.

2.9.4. Multiple sequence alignment

Homologous sequences for each query were aligned using Clustal Omega (v1.2.4) and MUSCLE protein multiple sequence alignment software for processing accuracy^{23, 24}. Alignments were output in FASTA format and saved individually for each protein.

2.9.5. Conservation score calculation

To evaluate the evolutionary conservation of residues mutated from arginine-to-histidine (R>H) and histidine-to-arginine (H>R) in our datasets, we performed a rigorous multi-step analysis using orthologous protein sequences.

First, the human protein sequence corresponding to each UniProt ID was used as the reference. To construct a high-quality multiple sequence alignment (MSA), homologous sequences were retrieved and filtered to retain only those that shared at least 80% sequence coverage relative to the human sequence. This filtering step ensured sufficient positional correspondence across the orthologs.

Next, all kept sequences were stripped of alignment gaps, yielding ungapped sequences that maintained their biological reading frame. These gap-free sequences were then subjected to a second round of alignment using Clustal Omega and MUSCLE, generating refined MSAs with improved column consistency and alignment quality.

For each aligned dataset, we calculated residue-level conservation scores only for alignment files containing five or more orthologous sequences to ensure statistical reliability. The conservation score at the mutation site was defined as the proportion of aligned sequences that contain an arginine (R; for the R>H dataset) and a histidine (H; for the H>R dataset) at the position corresponding to the human mutation site:

{Conservation Score}_{R > H} = \frac{n_{R}}{N_{R > H}}

(1)

{Conservation Score}_{H > R} = \frac{n_{H}}{N_{H > R}}

(2)

Where:

n_R = number of sequences with arginine at the aligned position (for the R>H dataset)
n_H = number of sequences with histidine at the aligned position (for the H>R dataset)
N_R>H, N_H>R = total number of sequences in the respective multiple sequence alignments

The alignment position corresponding to the human mutation site was identified by mapping the ungapped residue index in the reference sequence to its corresponding aligned index in the MSA. Gaps and non-R or non-H residues were included in the denominator to accurately reflect residue variability at that site, as shown in Eq. (1) and Eq. (2).

Only those residues in the human sequence that matched the wild-type arginine at the mutation position were included in the scoring analysis. The resulting conservation scores reflect the positional evolutionary constraint across orthologs and were used to distinguish conserved from variable regions within both pathogenic and benign mutation datasets.

Sites that could not be mapped to the alignment due to missing sequence coverage or coordinate mismatches were excluded from analysis.

2.9.6. Shannon entropy calculation

To quantify positional variability across orthologous protein sequences, we calculated Shannon entropy at each R>H or H>R mutation site using multiple sequence alignments (MSAs). The aim was to measure the degree of amino acid conservation or diversity at the exact mutation position, using the human sequence as the reference.

We began with aligned MSAs from Clustal Omega and MUSCLE, constructed from orthologs of each protein identified by its UniProt ID. Only sequences meeting ≥80% alignment coverage with the human reference sequence were retained. Gaps were removed from these filtered sequences to avoid introducing spurious entropy at poorly aligned positions. The resulting sequences were then re-aligned using Clustal Omega and MUSCLE to generate high-quality MSAs for entropy analysis.

Entropy was only calculated for positions where the aligned amino acid in the human reference sequence at the annotated mutation site was arginine (for the R>H dataset) or histidine (for the H>R dataset), and the re-aligned MSA contained a minimum of five orthologous sequences, including the human sequence.

For each qualifying position, we identified the aligned column corresponding to the human arginine residue and computed Shannon entropy using the formula:

H_{R > H} (X) = - \sum_{i} p_{R > H} (x_{i}) {log}_{2} (p_{R > H} (x_{i}))

(3)

H_{H > R} (X) = - \sum_{i} p_{H > R} (x_{i}) {log}_{2} (p_{H > R} (x_{i}))

(4)

Where:

H_R>H (X), H_H>R (X) = Shannon entropy of R>H or H>R multiple sequence alignment
∑_i = sum over all possible outcomes
p_R>H (x_i), p_H>R (x_i) = the relative frequency (probability) of residue x_i at the mutation position in the corresponding dataset
– p_i log₂ (p(x_i)) = the probability of an amino acid outcome multiplied by its information content. The minus sign ensures entropy is non-negative

Higher entropy values represent greater sequence variability, while values near zero indicate strong conservation.

The calculated entropy scores were appended to the respective benign and pathogenic mutation datasets as a new column, Shannon_Entropy. Mutations with insufficient ortholog coverage or non-R/non-H residues at the mutation position were excluded from entropy analysis and annotated accordingly.

2.10. Structural coverage identification and analysis

We analyzed four curated datasets (pathogenic and benign of R>H and H>R) of single-point mutations, each annotated as either benign or pathogenic, based on ClinVar classifications. Each entry contained the following: gene symbol, UniProt ID, mutation position, and protein sequence data. These datasets were further annotated with structural metrics to explore the context of each mutation.

2.10.1. Protein structure retrieval

We retrieved protein structures in a manner that maximized structural coverage. As a primary source, we queried the RCSB PDB API for each UniProt ID to collect all available experimental structures derived from Homo sapiens²⁵. To ensure biological relevance, structures were filtered based on three criteria: they had to originate from human proteins, span the mutation position of interest, and contain the wild-type residue at that position (arginine in the R>H datasets or histidine in the H>R datasets).

For cases in which no suitable experimental structures were available, we used AlphaFold-predicted human models as a secondary source^{26, 27}. These models were retained only if they covered the mutation position and contained the expected wild-type residue (arginine for R>H or histidine for H>R). AlphaFold structure files were parsed directly, and we verified the residue identity at the mutation site using the B-factor field, which encodes predicted Local Distance Difference Test (pLDDT) scores.

2.10.2. PDB chain filtering

In structures with multiple chains, we validated that at least one chain contained arginine or histidine at the mutation position. Only those chains were retained for downstream analysis. A new column (chains_with_arg) was added to denote the subset of chains that passed this criterion.

2.10.3. Best PDB selection

For entries with multiple qualifying PDB structures, we selected the best representative structure per row based on:

Experimental method preference: X-ray > EM > NMR
Resolution score: Preference for resolution < 3.5 Å
Scoring Formula: Score = Method Weight + 1 (Resolution < 3.5) + (5 - Resolution)
If only one PDB was available, it was selected by default.
AlphaFold models were automatically selected when they were the sole available structure.

2.10.4. Secondary structure and solvent accessibility

We ran DSSP (Define Secondary Structure of Proteins) to extract²⁸:

Secondary structure code and descriptive label
Relative Solvent Accessibility (RSA) values
Core vs Surface classification with Core RSA ≤ 0.20 and Surface RSA > 0.20²⁹

These metrics were added to the dataset under columns:

dssp_secondary_structure_best
dssp_structure_label_best
dssp_rsa_best
dssp_core_surface_best

2.10.5. Disorder annotation via UniProt

For each UniProt accession present in our datasets, we queried the UniProt REST API and parsed annotations. Residue intervals with type = “Disordered region” or type = “Region” whose description contained “Disordered” were extracted and merged into non-redundant ranges per protein. For every variant, the mutation position was intersected with these ranges: sites falling inside a disordered interval were labeled disordered, while sites outside were labeled ordered (blank if no relevant UniProt feature was available for that protein). This classification was recorded in a new column: UniProt_Disorder, containing either “Ordered” or “Disordered”.

2.10.6. Disorder prediction via AlphaFold Confidence

For all AlphaFold-based entries, we extracted pLDDT scores from the B-factor field of the AlphaFold PDB files. The pLDDT value for each mutation position was retrieved from the Cα atom of the relevant residue. Based on established AlphaFold interpretation guidelines, mutation positions with pLDDT < 70 were flagged as disordered, while those with pLDDT ≥ 70 were considered ordered³⁰. This classification was recorded in a new column: alphafold_confidence, containing either “ordered” or “disordered.” Experimental structures were excluded from this step, as they do not contain pLDDT confidence scores.

2.10.7. Residue verification and quality control

Before performing DSSP, we performed residue identity checks to confirm arginine presence at the mutation position in each selected structure. Mutations failing this criterion were excluded from DSSP. Additional checks were included for:

Chain presence
Missing coordinates
AlphaFold alignment issues

We also independently re-ran DSSP on SOLUTION NMR cases that lack resolution metadata, to ensure completeness.

2.11. Computational Resources

All statistical analyses and visualizations were performed using Python (Pandas, SciPy, Seaborn, Matplotlib). Most computational tasks, including BLAST searches, multiple sequence alignments, structure validation, and DSSP analysis, were performed either locally or on the Palmetto High Performance Computing (HPC) cluster³¹.

3. Results

While the primary focus of this study is to investigate the impact of R>H and H>R missense mutations on modulating pH-dependent properties, we also use clinical annotations to compare benign and pathogenic variants with respect to structural and sequence-based protein features. This comparison attempts to evaluate whether R>H expected pH-dependent change at physiological pH (via histidine titration), or whether H>R expected removal of pH-dependence is linked with evolutionary constraints and structural/functional classifications.

To begin our investigation of R>H and H>R substitutions in a disease context, we first classified ClinVar “Condition(s)” annotations for pathogenic variants into disease categories. We compared their distributions across our R>H and H>R pathogenic datasets (Fig. 1A). For both R>H and H>R variants, neurological/neurodevelopmental and metabolic/endocrine phenotypes were predominant, accounting for ~20% and ~11% of total cases, respectively, with no significant enrichment (Fisher’s exact test with Benjamini–Hochberg FDR, q = 1.0 for both categories). Several categories, however, showed significant differences between R>H and H>R. Renal/urologic phenotypes were strongly enriched among R>H variants (6.1% vs 0.6% of condition annotations for R>H and H>R, respectively; q = 5×10⁻⁶), and musculoskeletal/connective diagnoses and pharmacogenetic/drug-response annotations were also more frequent for R>H substitutions (5.4% vs 2.2% and 1.6% vs 0%, respectively; q = 0.032). In contrast, H>R variants were relatively enriched in developmental/syndromic and hematologic/immune conditions (19.9% vs 14.3% and 9.7% vs 5.7% of condition annotations, respectively; q = 0.032 for both categories). A complementary overview of variant recurrence (Fig. 1B) shows that R>H substitutions occur far more frequently than H>R substitutions in ClinVar (n = 1803 vs n = 493 total variants). However, the proportion of variants classified as pathogenic is similar between the two directions (44.7% vs 48.9% pathogenic for R>H and H>R, respectively).

To understand the occurrence of R>H and H>R missense mutations further, we delved deeper into the biophysical and biochemical nature of our datasets. The wild-type residue for the R>H datasets, arginine, is positively charged almost in the entire pH range. In contrast, the wild-type residue for the H>R datasets, histidine, toggles its charged state at neutral or physiological pH due to its side chain pK_a. Some arginine residues may be buried or partially buried for a particular biochemical function, and thus, a mutation replacing them with a shorter-chain histidine residue is expected to alter protein function. However, a substitution of wild-type arginine residue with histidine may have little impact on the functionality unless the wild-type arginine is involved in specific interactions. On the other hand, histidine is frequently involved in specific interactions and often localized in the interior of a protein, so a substitution with a bulky arginine residue may have significant structural and functional effects.

Using DSSP-derived RSA values, we classified each mutation site as core (buried; RSA ≤ 0.20) or surface (solvent-exposed) and summarized the distribution by mutation type and clinical class (Fig. 2 and Fig. S2). In both R>H and H>R, benign variants are overwhelmingly found on the surface, whereas pathogenic variants are typically located in buried sites. For R>H, 507 of 531 benign variants (~95%) are surface-exposed and only 24 (~5%) are in the core, while among the 401 pathogenic R>H variants, 227 (~57%) occur at the surface and 174 (~43%) in the core (Fisher’s exact test, p = 1.46×10⁻⁴⁸). For H>R, 106 of 121 benign variants (~88%) are present at the surface and 15 (~12%) in the core, compared with 38 of 131 pathogenic variants (~29%) on the surface and 93 (~71%) in the core (p = 1.62×10⁻²¹). Thus, although benign variants remain strongly surface-skewed for both substitution types, pathogenic variants in both R>H and H>R are much more likely to occur at buried positions than their benign counterparts. When comparing across all of the mutation types, pathogenic H>R variants show the strongest enrichment in the core, while benign R>H mutations are the most surface enriched. These results reinforce the notion that core wild-type residue mutations are associated with pathogenicity, while surface-exposed mutations exhibit a benign phenotype.

Fig. 2. — Core vs surface distribution of R>H and H>R variants. Bars show, for each mutation type (R>H total = 932; H>R total = 252), the percentage of all variants in that mutation type that fall in the protein core (blue) or surface (pink). Counts per dataset: R>H Pathogenic 174 core/227 surface (401 total), R>H Benign 24 core/507 surface (531 total), H>R Pathogenic 93 core/38 surface (131 total), H>R Benign 15 core/106 surface (121 total). Asterisks indicate Fisher’s exact tests (two-sided): ***p < 0.001, **p < 0.01, *p < 0.05.

3.1. Secondary structure type

We next examined whether R>H and H>R mutations exhibit preferential localization to specific secondary structure types. We compared the normalized distributions of benign and pathogenic mutations using DSSP-assigned labels grouped into Helix, Strand, and Coil categories (Fig. 3). Our analysis showed clear mutation type-specific biases in secondary structure localization of R>H and H>R mutations. In the R>H cohort (N=932), pathogenic variants were significantly enriched in helical regions (192/932, 20.6%) compared with benign variants (168/932, 18.0%; p = 6.6×10⁻⁷, ***). Conversely, coil regions were strongly depleted in pathogenic mutations (139/932, 14.9%) relative to benign (294/932, 31.5%; p = 3.9×10⁻¹⁰, ***). The proportion of strand-localized mutations was similar between pathogenic (70/932, 7.5%) and benign (69/932, 7.4%) groups, showing no significant difference (p = 0.063).

Fig. 3. — Normalized distribution of mutations across secondary-structure elements shows mutation sites across major secondary structure classes (Helix, Strand, Coil). Bars show the percentage of variants occurring in Helix, Strand, or Coil (DSSP) for R>H and H>R, split by Benign and Pathogenic groups (normalized within each group; R>H total N = 932 & H>R total N = 252). Within R>H, pathogenic variants are enriched in helices and depleted in coils, with a modest, non-significant difference in strands. Within H>R, pathogenic variants are enriched in strands, show no significant difference in helices, and are depleted in coils. Benign cases in both R>H and H>R are enriched in coils. Asterisks indicate Fisher’s exact tests (two-sided): ***p < 0.001, **p < 0.01, *p < 0.05.

The pattern was very distinct for the H>R cohort (N=252). Pathogenic mutations were enriched in β-strands (33/252, 13.1%) relative to benign mutations (11/252, 4.4%; p = 8.2×10⁻⁴, ***). By contrast, coil regions were underrepresented in pathogenic variants (55/252, 21.8%) compared with benign (76/252, 30.2%; p = 1.1×10⁻³, **). No significant difference was observed in helical regions (pathogenic 43/252, 17.1% vs benign 34/252, 13.5%; p = 0.494).

Together, these findings demonstrate that pathogenic R>H mutations preferentially affect helical elements and deplete coil regions, whereas pathogenic H>R mutations are biased toward β-strand environments and away from coils, revealing a structural difference between the two substitution types. This observation contradicts our previous finding, indicating that there is no preference for pathogenic vs benign mutations to occur in a particular secondary structure element². This points out the uniqueness of R>H and H>R and shows that such mutations do not follow the general trend seen for the rest of the variant types.

3.2. Ordered vs disordered regions

Intrinsically disordered regions (IDRs) are parts of a protein that do not adopt stable tertiary structures under physiological conditions. Instead, these regions remain relatively flexible and often contain shorter motifs, as well as post-translational modification sites, that mediate and regulate transient interactions within these regions. Since they do not contribute to a tightly packed hydrophobic core, IDRs can frequently tolerate substantial sequence variation without catastrophic effects on global folding. On the other hand, residues within ordered regions are located in well-defined and stable secondary and tertiary structure elements, such that even conservative substitutions like R>H and H>R can disrupt local packing, longer-range electrostatic networks, or binding interfaces. Consequently, mutations in ordered regions are generally more likely to perturb structural integrity and/or critical interactions, and thus are associated with pathogenic phenotypes, whereas disordered segments are comparatively more permissive to sequence change.

Since structural disorder can buffer mutations, we compared the frequency of R>H and H>R substitutions in ordered vs disordered regions. To assess whether these mutations exhibit different localization in structured vs intrinsically disordered regions, we queried disorder region intervals from the UniProt REST API to annotate each mutation site as “Ordered” or “Disordered.” We then compared compositions between Benign and Pathogenic variants within each mutation type (Fig. 4). For R>H, benign variants were more often found in disordered regions (23.3% Disordered / 76.7% Ordered; n = 730), whereas pathogenic variants were predominantly localized in ordered regions (10.7% / 89.3%; n = 458). The difference was highly significant (Fisher’s exact, p = 0; OR_{Disordered, Path vs Benign} = 0.39, 95% CI 0.28–0.56), indicating pathogenic R>H variants are markedly less likely to occur in disordered regions. For H>R, the same pattern is upheld: benign variants showed preference to be in more disorder regions (27.5% / 72.5%; n = 178), whereas pathogenic variants were found mostly in ordered regions(10.5% / 89.5%; n = 143; Fisher’s exact, p = 1.33 × 10⁻⁴; OR = 0.31, 95% CI 0.16–0.58). These observations support the view that intrinsic disorder can buffer the structural impact of R>H and H>R substitutions, whereas ordered domains are less tolerant, leading to a higher proportion of pathogenic variants at ordered sites

Fig. 4. — Pathogenic variants preferentially occur in ordered regions, which are less tolerant to structural disturbances than intrinsically disordered segments for both R>H and H>R. 100% stacked bars show the within-group composition of Disordered (blue) and Ordered (orange) UniProt annotations for Benign and Pathogenic variants in each mutation class. R>H: Benign 23.3% Disordered / 76.7% Ordered (n = 730); Pathogenic 10.7% / 89.3% (n = 458). H>R: Benign 27.5% / 72.5% (n = 178); Pathogenic 10.5% / 89.5% (n = 143). Statistical comparisons (2×2 Fisher’s exact; reported in text) indicate significant enrichment of ordered contexts among pathogenic variants in both types of substitutions.

3.3. Conservation score and Shannon entropy

To assess evolutionary constraints at R>H and H>R mutation positions, we compared the conservation scores and Shannon entropy values of mutation sites between benign and pathogenic datasets. Violin plots revealed a stark contrast in distribution patterns. The conservation scores (Fig. 5) for pathogenic mutations were significantly skewed toward high conservation (near 1.0), indicating that these mutations tend to occur at highly conserved residues. In contrast, benign mutations displayed a broader and more symmetric distribution, with a substantial portion of scores in the low to intermediate conservation range.

Fig. 5. — Distribution of conservation scores for R>H and H>R mutations in benign and pathogenic datasets. Violin plots show that pathogenic mutations are significantly enriched at highly conserved residues (score near 1.0), while benign mutations are more evenly distributed across the conservation spectrum in R>H and H>R datasets.

Similarly, Shannon entropy scores (Fig. 6), which represent the sequence variability at each aligned site, were markedly lower in the pathogenic group. Pathogenic mutations clustered around entropy values near 0, suggesting high conservation and minimal variability at these positions. Conversely, benign mutations showed a wider distribution of entropy scores, with many mutations occurring at moderate or highly variable residues.

Using Mann-Whitney U tests, we also compared the Conservation Score and Shannon Entropy between benign and pathogenic variants. For R>H, conservation scores were higher for pathogenic (n=347, median = 0.959; mean = 0.790; P10–P90 = 0.279–1.000) than benign (n=303, median = 0.350; mean = 0.423; 0.087–0.927), Mann–Whitney U = 20,791; p = 3.95×10⁻⁴¹. Shannon entropy showed the reciprocal pattern: benign (n=303, median = 1.594; mean = 1.599; 0.000–2.857) exceeded pathogenic (n=347, median = 0.280; mean = 0.493; 0.000–1.745), U = 78,250.5; p = 1.92×10⁻²⁷. For H>R, conservation was likewise higher at pathogenic sites (n=107, median = 0.886; mean = 0.780; 0.246–1.000) than benign (n=80, median = 0.232; mean = 0.334; 0.060–0.701), U = 1,057.5; p = 7.03×10⁻¹⁹; and entropy was higher for benign (n=80, median = 1.864; mean = 1.845; 0.000–2.781) than pathogenic (n=107, median = 0.491; mean = 0.773; 0.000–2.324), U = 6,852; p = 1.42×10⁻¹². Collectively, the Mann–Whitney tests show that pathogenic variants in both mutation types occur at significantly more conserved, lower-entropy positions than benign variants. Since high conservation across homologs typically marks residue positions that are structurally or functionally essential, for example, catalytic bases, ligands that bind metals or cofactors, or positions that stabilize tightly packed cores and interaction interfaces, these patterns point to pathogenic R>H and H>R substitutions preferentially occuring at sites where any amino acid change is more likely to disrupt protein function. On the other hand, benign mutations tend to occur at positions that are less conserved among homologs and tolerate a broader range of substitutions without perturbing folding, catalysis, or binding. Our conservation score and Shannon entropy findings are consistent with this notion.

3.4. Subcellular distribution

We then examined whether R>H and H>R variants are enriched in proteins localized in particular subcellular compartments, given that these differ in their pH environments. We analyzed the subcellular distribution of benign and pathogenic mutations by mapping each variant’s associated protein to Gene Ontology (GO) cellular compartment terms. After identifying the top 30 most frequently occurring GO terms, we merged semantically related compartments into higher-level categories (e.g., Nucleus, Cytoplasm, Plasma Membrane). The normalized frequencies of each compartment were compared between benign and pathogenic variants (Fig. 7). Across both mutation types, the largest fraction of variants map to nuclear proteins, with pathogenic variants more frequent than benign ones in this compartment. A marked difference was also observed for mitochondria, where pathogenic variants are substantially more frequent than benign variants.

Fig. 7. — Normalized distribution of merged GO compartments by group. Bar plot showing the relative frequency (%) of R>H and H>R mutations mapped to merged GO cellular compartment categories, separated by group. Only the top 30 most frequently occurring GO compartments were considered, and similar GO terms were merged into biologically meaningful categories (e.g., Nucleus, Cytoplasm, Plasma Membrane). Frequencies were normalized within each group to show proportional distribution datasets.

We also compared the distribution of GO Cellular Component annotations between Pathogenic and Benign variants within each mutation type using Fisher’s exact tests (gene-level coverage; genes counted if they had ≥1 annotation in a bin). For R>H, Mitochondrion was enriched among pathogenic variants (p = 9.03×10⁻⁴), whereas Endosome was enriched among benign variants (p = 6.66×10⁻³). For H>R, pathogenic variants were again enriched in Mitochondrion (p = 3.49×10⁻⁴) and also in Lysosome (p = 2.61×10⁻²), while benign variants were enriched in Extracellular Matrix (p = 1.11×10⁻³), Ribosome (p = 3.45×10⁻²), and Cytoskeleton (p = 4.45×10⁻²).

3.5. pH optima results

To directly connect R>H and H>R substitutions with pH dependence, we integrated enzyme pH-optimum data to test whether pathogenic and benign variants differ in their distribution of functional pH ranges. Analysis of pH-optimum distribution for R>H and H>R mutations revealed that the majority of substitutions occur within the physiological and near-physiological ranges (Fig. 8).

Fig. 8. — Distribution of R>H and H>R substitutions across pH bins under two normalizations. (a) Proportions normalized across the entire mutation type dataset (R>H; H>R). Bars show the fraction of all observed substitutions that fall into each pH bin, split by clinical class (benign = blue; pathogenic = orange) and mutation type (R>H = solid; H>R = hatched). The largest share of substitutions occurs in the neutral (6.6–7.6) bin, with a secondary concentration at 7.6–8.6. R>H benign is most enriched in the neutral bin, while H>R pathogenic is also disproportionately represented there. (b) Proportions normalized within each mutation type and clinical class (benign vs pathogenic are scaled within R>H and within H>R separately). This controls for different totals in the R>H and H>R datasets. The mutation type patterns mirror 8a: R>H benign remains maximally concentrated around neutral (6.6–7.6) with a smaller contribution at 7.6–8.6, and H>R pathogenic remains elevated in the neutral range. Substitutions are relatively rare in acidic (<5.6) and strongly alkaline (>8.6) bins in both mutation types.

Using Enzyme Commission (EC) mappings, we retrieved experimentally determined pH-optimum values from BRENDA for a subset of genes for each mutation type. Since the R>H dataset was substantially larger than the H>R dataset (782 vs 269 genes), this imbalance was also mirrored in the pH-optima subset. For R>H, we obtained pH optima for 173 proteins (91 benign, 82 pathogenic), representing 22.1% of the full R>H dataset (782 genes; 491 benign, 291 pathogenic). Within this R>H pH-optima subset, 52.6% were benign and 47.4% were pathogenic. For H>R, we obtained pH optima for 90 proteins (29 benign, 61 pathogenic), representing 33.5% of the full H>R dataset (269 genes; 143 benign, 126 pathogenic). Within this H>R pH-optima subset, 32.2% were benign and 67.8% were pathogenic. Across both mutation types combined, we analyzed 263 proteins with pH optima, which corresponds to 25.0% of the total gene set (263/1,051).

For R>H variants, pathogenic mutations were most frequently observed in the neutral (6.6–7.6) bin (23.7% of all pathogenic variants), and benign variants were more enriched in the same bin (32.4%). H>R variants also showed strong enrichment in the neutral bin, with 41.1% of pathogenic and 16.7% of benign substitutions falling in this range. At more alkaline pH values (7.6–8.6), both R>H and H>R mutations showed moderate representation (R>H: 15.6% pathogenic vs 16.8% benign; H>R: 16.7% pathogenic vs 10.0% benign). In acidic bins (<6.6), both mutation classes occurred less frequently, though small enrichments were present (e.g., R>H pathogenic 4–6% range; H>R pathogenic 4–6% range).

3.6. Protein functional classes

We then classified variants by protein function to investigate whether specific protein functions were disproportionately associated with benign or pathogenic R>H or H>R mutations. We examined the distribution of functional classes annotated for each protein (Fig. 9). After removing the Miscellaneous bin (proteins not mapped to a major class), we used class labels within genes and compared benign vs pathogenic proportions within each mutation type.

Fig. 9. — Functional classes of genes carrying R>H and H>R variants (per gene). Horizontal lollipop plots show, for each functional class, the percentage of genes in that dataset that contain the class at least once in the Functional_Classes field. Left, R>H (Benign vs Pathogenic); right, H>R. Values are normalized by the total number of genes in each dataset. Brackets with asterisks mark Benign vs Pathogenic differences significant by two-sided Fisher’s exact test (* p≤0.05; ** p≤0.01; *** p≤0.001; **** p≤1e−4).

R>H pathogenic genes were significantly more likely to be annotated as Binding (83.2% [242/291] vs 70.5% [346/491]; OR=0.49, 95% CI 0.34–0.70, p = 7.6×10⁻⁵), Transcription Factor (13.1% [38/291] vs 5.7% [28/491]; OR=0.40, 0.24–0.67, p = 4.9×10⁻⁴), Enzyme (29.6% [86/291] vs 20.8% [102/491]; OR=0.63, 0.45–0.87, p = 7.2×10⁻³), Ion Channel (10.0% [29/291] vs 5.1% [25/491]; OR=0.49, 0.28–0.84, p = 1.26×10⁻²), Oxidoreductase (6.2% [18/291] vs 2.6% [13/491]; OR=0.42, 0.20–0.85, p = 2.14×10⁻²), and Transporter (8.9% [26/291] vs 4.9% [24/491]; OR=0.53, 0.30–0.93, p = 3.34×10⁻²). Other classes showed no significant differences.

H>R pathogenic genes were enriched for Binding (87.3% [110/126] vs 72.7% [104/143]; OR=0.40, 0.21–0.74, p = 3.81×10⁻³) and Ligase (10.3% [13/126] vs 2.1% [3/143]; OR=0.21, 0.063–0.70, p = 7.78×10⁻³). Chaperone appeared only in pathogenic (3.2% [4/126] vs 0% [0/143]; p = 4.69×10⁻²). All other classes were not significant.

3.7. Analysis of proteins harboring both pathogenic and benign mutations

Finally, to fully understand gene-level vs residue-level effects, we searched for genes that harbor both types of mutations, pathogenic and benign, and we wanted to investigate if there are unique characteristics within the same protein that make some sites more predisposed to pathogenic mutations rather than benign, or vice versa.

First, we compared benign and pathogenic variants occurring in R>H and H>R datasets. Our study showed 14 overlapping genes between R>H pathogenic and R>H benign datasets, and no overlapping genes between H>R pathogenic and H>R benign variants. To better understand the molecular distinctions, we compared the structural and evolutionary variables of the 14 overlapping genes across the R>H clinical classifications. These overlapping proteins offer a unique opportunity to assess how context, rather than identity alone, determines mutation pathogenicity.

We began by mapping the distribution of the R>H mutations across 14 individual genes shared between the benign and pathogenic datasets (Fig. 10). The positional analysis showed that in several overlapping genes, pathogenic mutations appeared in clusters near conserved functional domains, while benign mutations were more scattered, suggesting differential structural or functional constraints within the same protein.

As seen in Fig. 10, the top absolute plot reveals that overlapping genes vary substantially in protein length, with some genes, like PIEZO1 and NSD1, showing mutations spread across a wide range of amino acid positions, while others, such as AHSG and TULP3, have mutations concentrated at the N-terminal regions. Certain clusters of pathogenic mutations can be seen at higher residue positions in long proteins (e.g., TG, NSD1, MYO15A), suggesting possible functional domains toward the C-terminus may be sensitive to R>H substitutions. This visualization highlights the importance of absolute position context, especially in long multi-domain proteins.

Additionally, regarding the bottom relative plot, mutations are broadly distributed along the protein sequences with no strong enrichment at either terminus. However, a few clusters of pathogenic mutations appear near the C-terminus in certain genes (e.g., NSD1, TG). The general observation is that pathogenic and benign positions are typically away from each other, with very few exceptions.

Next, we examined whether the local secondary structure influenced pathogenicity (Fig. 11). Pathogenic R>H mutations were more likely to occur within alpha helices and β-strands, whereas benign mutations were more frequently associated with coils and turns. This trend supports the hypothesis that pathogenic mutations tend to disrupt well-structured regions of the protein where residue substitution is more likely to affect folding or stability; it is also in line with the secondary structure results we see in Fig. 3.

Relative solvent accessibility (RSA) analysis further distinguished the two groups (Fig. 12). Pathogenic mutations spanned a wide range of RSA values, including many in moderately buried or fully buried positions. In contrast, benign mutations were skewed toward more solvent-exposed residues. A summary classification of these mutations into “core” (buried) and “surface” (exposed) categories (Fig. 13) confirmed this trend: a greater proportion of pathogenic mutations occurred in core regions compared to benign mutations, which were primarily found on the surface. This suggests that pathogenic mutations may interfere with protein folding or stability by altering densely packed core regions; following the overall trend we reported in Fig. 2 and Fig. 2S.

Fig. 13. — Core vs Surface Mutation Counts. This stacked barplot categorizes mutations into core or surface regions based on RSA thresholds. Each bar represents the total number of mutations per group, split into core (low RSA) and surface (high RSA) components, giving a summary of spatial localization differences across the two groups.

Finally, we investigated whether evolutionary conservation could explain the differential impact of these mutations (Fig. 14). Pathogenic mutations showed significantly higher conservation scores, most near 1.0, indicating that these residues are highly conserved across homologs and likely under strong evolutionary constraint. On the other hand, benign mutations occurred at much more variable positions, with a broader distribution and lower median conservation values. This finding suggests that pathogenic R>H mutations are more likely to disrupt critical, evolutionarily preserved functions.

Together, all the results of our gene overlap investigation demonstrate that benign and pathogenic R>H substitutions differ systematically across multiple molecular contexts, including structural environment, conservation, disorder, functional category, and pH-related properties. These observations suggest that R>H mutations are context-dependent, which we further explore in the Discussion.

To further extend our studies into overlapping genes, we also searched for genes that spanned all our datasets. While there was no single gene that overlapped all of our datasets (R>H pathogenic/benign and H>R pathogenic/benign), there was an overlap that appeared in three out of the four datasets.

We compared four gene sets: R>H Pathogenic (291), H>R Pathogenic (126), R>H Benign (491), and H>R Benign (143). The largest pairwise overlaps occurred within group type across mutation types: R>H P ∩ H>R P contained 32 genes (Jaccard 0.083; examples include TP53, CFTR, VHL, G6PD, SCN2A, TGFBR2) and R>H B ∩ H>R B contained 33 genes (Jaccard 0.055). In contrast, cross-group overlaps were very small: R>H P ∩ R>H B = 14 (Jaccard 0.018), H>R P ∩ R>H B = 5 (Jaccard 0.008), R>H P ∩ H>R B = 1 (Jaccard 0.002), and H>R P ∩ H>R B = 0. Only one three-way intersection was observed (R>H P ∩ H>R P ∩ R>H B = NSD1), and there was no four-way intersection (Fig. 15).

Fig. 15. — Gene-set intersections visualized with an UpSet plot. Each column is an intersection of the four sets; filled dots indicate which sets participate, with a vertical line connecting them. Top bars give the size of each intersection (genes). The right panel shows total set sizes. Labels use R>H and H>R with B = Benign and P = Pathogenic. The largest intersections are the within-group, across mutation type pairs, R>H P ∩ H>R P (n=32) and R>H B ∩ H>R B (n=33), while cross-group overlaps are small to absent (e.g., H>R P ∩ H>R B = 0). A single three-way intersection is present (R>H B ∩ R>H P ∩ H>R P = 1), and no four-way overlap is observed.

4. Discussion

Although chemically conservative, we hypothesized that R>H/H>R substitutions may act as context-dependent modulators of protein function due to histidine’s unique pKₐ near neutrality. This aspect allows histidine to change its charge state and activity within physiological pH ranges, potentially leading to either a benign, tolerant state or a pathogenic, disease-associated state, depending on the structural, evolutionary, and cellular context of the substitution.

Our analysis showed that pathogenic R>H and H>R variants are significantly enriched in buried/core residues, while benign substitutions are more frequently surface-exposed. This result demonstrates a strong association between core localization and pathogenicity in R>H and H>R mutations, which confirms our previous finding and larger-scale studies showing that disease-causing substitutions are more often found in the protein core, where they can disrupt packing and stability^{2, 14}.

While both arginine and histidine are positively charged residues, substitutions in tightly packed, buried environments are more likely to destabilize protein structure due to differences in side-chain geometry, hydrogen bonding, or packing constraints. The disproportionate occurrence of benign variants on the surface suggests that solvent-exposed positions better tolerate side-chain substitutions. This finding supports long-standing observations that core mutations tend to disrupt protein folding and thermodynamic stability, and it emphasizes the importance of considering three-dimensional structural context when interpreting the potential pathogenicity of missense variants^{3, 15}.

The context-dependent nature of R>H and H>R substitutions is also reflected at the evolutionary level, as seen in our conservation and Shannon entropy analyses (Fig. 5 and Fig. 6). Pathogenic variants cluster at highly conserved, low-entropy positions, whereas benign variants more often occupy sites that are variable across orthologs. Conserved residues are generally observed at functional and structural “hotspots”, such as catalytic residues, metal-binding motifs, and residues that support interaction interfaces or stabilize the protein core. Introducing or removing pH sensitivity at such positions through R>H or H>R substitutions is therefore especially likely to perturb enzyme activity, binding specificity, or conformational regulation, in line with their association with pathogenic outcomes.

In addition to the evolutionary analyses, we also analyzed the distribution of R>H and H>R missense mutations across DSSP-based secondary structure elements (SSE). Our findings suggest that secondary structure context contributes to the functional consequences of R>H and H>R substitutions. Helices and β-strands form the stable core elements of a protein’s architecture. They are often involved in key structural or functional roles, such as ligand binding, catalytic activity, or protein-protein interactions. Mutations in these regions are more likely to disrupt folding, stability, or biological activity, which may explain their enrichment in the pathogenic group. In contrast, coil regions are more structurally flexible and may tolerate amino acid substitutions with fewer deleterious effects, consistent with the higher prevalence of benign variants in these environments. This pattern contrasts with our prior analyses of disease-associated vs benign substitutions, which found little difference in secondary structure distribution². Hence, secondary structure context seems to amplify the disruptive potential of R>H and H>R substitutions, providing an additional layer of explanation for their pathogenicity in structured regions. Similar observations were made regarding intrinsically disordered vs ordered structural regions, and it was pointed out that pathogenic mutations are overrepresented in ordered and less in disordered structural regions. The preference is not as prominent as for SSE, perhaps due to the fact that intrinsically disordered regions are frequently involved in macromolecular interactions, and disruption of such interactions could be deleterious. This also aligns with prior studies showing that disordered regions tend to harbor more benign or neutral mutations. At the same time, pathogenic variants are more often located in structured regions critical for folding, interactions, or enzymatic activity³².

Subcellular pH is a crucial regulator of macromolecular activity, and proteins have evolved to maintain their pH optima in line with the pH of their resident compartment. Mitochondria maintain a relatively alkaline matrix (pH ~7.8–8.0), whereas endosomes and lysosomes are progressively acidified (roughly pH ~6 and pH ~4.5–5.0, respectively), and the extracellular matrix can be mildly acidic, especially in the tumor microenvironment. In our GO cellular compartment analysis, R>H and H>R substitutions were more likely to be pathogenic when they occurred in compartments with strong or tightly regulated pH biases, notably mitochondria (for both mutation types) and lysosomes (for H>R). In contrast, benign variants were more frequent in compartments with greater modularity or redundancy, such as the extracellular matrix, ribosome, cytoskeleton, or in trafficking intermediates such as endosomes for R>H. One interpretation is that compartments with more extreme or constrained pH, such as mitochondria and lysosomes, are less tolerant to changes in local pH-sensing or charge networks introduced by R>H or H>R substitutions, whereas highly permissive environments can buffer their effects. Although benign variants were enriched in proteins localized in the extracellular matrix in our dataset, it is worth noting that many secreted metalloproteases are activated under acidic conditions and that solid tumors often exhibit an acidified extracellular environment; in such contexts, the gain or loss of histidine-based pH sensitivity at extracellular proteins could influence protease activity, matrix remodeling, and signaling, and will be important to examine in more targeted future studies^{33, 34}. The strong presence of nucleus-localized proteins seen across both mutation types likely reflects not only classical DNA-binding transcription factors but also chromatin regulators, epigenetic readers and writers, and other nuclear enzymes and scaffolding proteins^{35, 36}. In addition, several nuclear bodies and condensates are now known to also exhibit pH-sensitive phase behavior, providing a potential route by which introducing or removing a titratable histidine side chain could modulate nuclear signaling, chromatin state, or gene-regulatory programs^{37, 38}.

Subcellular localization provides one aspect of environmental context; however, protein function is also determined by the most favorable pH range at which stability and activity are optimized. A protein’s pH optimum combines structural features, functionality, and the external environment. Changes in this optimum can impair folding, binding, or catalysis^{4, 5}. To see if R>H and H>R substitutions exploit this constraint, we compared the distribution of benign and pathogenic variants across proteins with experimentally measured pH optima. The overrepresentation of benign mutations in proteins with acidic pH optima likely reflects functional buffering in low-pH environments. In contrast, the enrichment of pathogenic mutations in proteins with neutral to slightly basic pH optima suggests that these environments may be more susceptible to disruptions in charge state or hydrogen bonding caused by histidine substitution.

Since histidine has a pK_a near neutral/physiological pH (~6.0–6.5), it can gain or lose a proton depending on its local environment, potentially altering protein folding, function, or catalytic activity. In proteins optimized for near-neutral pH, the introduction of histidine may lead to unintended protonation behavior and thus functional impairment. Conversely, in acidic environments where histidine is more likely to be protonated consistently, its impact may be more predictable or functionally tolerated. These findings follow the hypothesis that protein environment, such as pH, plays a vital role in modulating the pathogenic potential of missense mutations, especially those involving titratable residues such as arginine and histidine.

A key motivation for exploring the enzymatic pH optima of proteins harboring R>H and H>R mutations was to investigate whether pH-dependent biochemical context could help explain the pathogenicity seen across protein environments. Given that histidine has a titratable side chain with a pK_a near neutral/physiological pH (~6.0), its protonation state, and thus its electrostatic behavior, can shift depending on the local pH environment. In contrast, arginine maintains a consistently positive charge across physiological pH ranges. Therefore, substitutions from R>H may introduce local charge variability or disrupt electrostatic interactions in a pH-sensitive manner. Conversely, the opposite, substitutions involving H>R, would reverse the pH-sensitivity. By analyzing the pH optima of proteins affected by these mutations, our goal was to identify whether certain pH environments (acidic, neutral, or basic) are more vulnerable to pathogenic outcomes. Ultimately, this knowledge can guide the design of small molecules that stabilize the mutated protein’s function or electrostatics within its native pH context, potentially opening new therapeutic avenues for pH-sensitive R>H-driven or pH-insensitive H>R-driven diseases.

The pH optima analysis showed the distribution of R>H and H>R mutations across pH bins and underscores the importance of the pH dependence near physiological range in shaping the pathogenic potential of charge-swapping substitutions. The neutral bin (6.6–7.6) was consistently the most populated for both benign and pathogenic variants, consistent with the fact that most human proteins function under near-neutral intracellular conditions. However, an asymmetry emerged between the two types of substitution. R>H mutations showed a relatively higher benign enrichment in the neutral range, while H>R mutations demonstrated a skew toward pathogenic enrichment in the same bin. This may reflect the distinct biophysical consequences of introducing vs removing pH sensitivity at near-neutral conditions: R>H mutations introduce a histidine side chain with a pK_a near neutrality, which may be tolerated in some situations, whereas H>R mutations eliminate pH responsiveness and may disrupt regulatory mechanisms. This modest representation of both R>H and H>R substitutions in acidic and alkaline ranges suggests that mutations at proteins operating outside physiological pH are relatively rare, potentially due to stronger constraints in extreme environments. The pH optima analysis highlights a difference in how pH context influences pathogenicity in clinical cases involving missense mutations in enzymes.

Overall, these observations support the idea that R>H and H>R substitutions occur in proteins that are already evolutionarily tuned to their operating pH. In enzymes adapted to near-neutral conditions, histidine often plays a direct role in determining the activity of that protein. Therefore, introducing or removing histidine at these sites can alter, shift, or distort that activity profile in ways that are more likely to compromise function and present as pathogenic variants. By contrast, enzymes optimized for acidic environments appear more tolerant to R>H and H>R substitutions, either because histidine is constitutively protonated in their native environment or because other residues carry the burden of pH sensing. Thus, the pH-optimum results are consistent with a model in which R>H and H>R mutations are most deleterious when they interfere with finely tuned pH-sensitive chemistry near physiological pH, rather than acting as uniformly disruptive charge changes across all proteins.

Certain protein functions, such as DNA binding, redox activity, or ion transport, are tightly coupled to pH and electrostatics; we hypothesized that R>H or H>R substitutions may be particularly disruptive in these classes³⁹. To explore this, we compared the distribution of benign and pathogenic variants across functional categories. Across both mutation types, the Binding functional class emerges as the dominant and pathogenic-enriched class, consistent with the notion that R>H and H>R substitutions can disrupt electrostatic contacts, DNA/RNA/protein interfaces, and sequence-specific recognition. For R>H, additional enrichments in transcription factors, ion channels, transporters, and oxidoreductases/enzymes point to the sensitivity of charge-coupled mechanisms, such as ligand-gated conduction, active transport, redox catalysis, and active-site protonation, to R>H changes.

For H>R, enrichment in ligases and a small signal for chaperones suggests vulnerabilities in ATP-dependent conjugation or quality-control pathways. While several effects are modest, the pattern aligns with our broader theme that functional contexts that rely on precise electrostatics or interface recognition are more likely to harbor pathogenic R>H and H>R substitutions. Future work could further enrich these observations by stratifying within superfamilies (e.g., DNA- vs protein-binding) and by adjusting for multiple testing across different classes.

While certain functional categories appear particularly sensitive to R>H substitutions, our dataset suggested that substitutions can have divergent outcomes even within the same protein. To directly test whether gene-level effects or local context drive pathogenicity, we compared benign and pathogenic variants occurring in overlapping genes. While we observed 14 separate genes that overlapped in our R>H pathogenic and R>H benign datasets, we saw no overlap of genes between the H>R pathogenic and H>R benign datasets. The presence of both benign and pathogenic R>H mutations within the same proteins highlights the importance of local context in the effect of mutations. Even though these R>H mutations impact the same residue type and gene, their structural location, solvent accessibility, evolutionary conservation, and predicted disorder state significantly differ.

This divergence underscores the limitations of gene-level annotations for variant classification and emphasizes the value of integrating residue-level structural and functional information. By dissecting these properties in overlapping genes, this analysis shows how seemingly similar mutations can have vastly different phenotypic consequences, depending on their precise biophysical and evolutionary context.

Additionally, taking a look at overlaps across all datasets, the UpSet plot shows that the overlap is concentrated within groups across mutation types, R>H and H>R share many of the same pathogenic genes (n=32) and likewise share a sizable benign set, while benign–pathogenic overlaps are small or absent. This pattern implies that a subset of genes is not sensitive to R>H or H>R substitution; their disease liability likely stems from context-dependent biophysics (charge networks, stability, or interface geometry) rather than the substitution type. On the other hand, the near-separation of benign and pathogenic genes within a mutation type indicates that tolerated and deleterious changes generally occur in distinct structural contexts, consistent with our earlier results showing pathogenic enrichment in cores, ordered regions, and defined secondary structure and depletion in coils/surface.

Taken together, our structural, evolutionary, localization, and functional analyses reveal a unifying theme: R>H and H>R substitutions are most likely to be pathogenic when they disrupt the biophysical adaptation of a protein to its environment. Pathogenic variants preferentially occur at buried, structurally ordered, and evolutionarily conserved positions, in proteins operating within constrained pH landscapes (for example, mitochondrial, lysosomal, or nuclear chromatin contexts), and in functional classes where precisely positioned charges at binding or catalytic sites are critical. In contrast, benign variants are enriched at solvent-exposed, flexible, or redundant sites, in compartments and proteins that can buffer changes in protonation state. This investigation sheds light on why chemically conservative substitutions between two basic residues can have very different clinical outcomes, suggesting that future studies should explicitly integrate pH-dependent biophysical properties alongside sequence-based and structure-based analyses.

Supplementary Material

supplementary material

NIHMS2134374-supplement-supplementary_material.docx^{(326.3KB, docx)}

Acknowledgments

This research utilized resources and services provided by Clemson University’s HPC infrastructure. We gratefully acknowledge Clemson University’s Palmetto Cluster for providing high-performance computing resources and support services that enabled this work.

Funding Information

This work was supported by a grant from NIH, grant number R35GM151964.

Footnotes

Statement of Usage of Artificial Intelligence

Artificial Intelligence (AI) tools were used to prepare this paper. ChatGPT was used for code debugging. Grammarly AI was used for grammar, punctuation, clarity, and overall readability. No AI tools were used for the scientific results or interpretations presented in this manuscript.

Conflict of Interest

The authors declare no conflicts of interest.

Contributor Information

Nirav Modha, Department of Physics, College of Science, Clemson University, 118 Kinard Laboratory, Clemson, South Carolina, 29634, USA; Medical Biophysics Graduate Program, Clemson University, 118 Kinard Laboratory, Clemson, South Carolina, 29634, USA.

Emil Alexov, Computational Biophysics & Bioinformatics, 118 Kinard Laboratory, Clemson, South Carolina, 29634, USA; Department of Physics, College of Science, Clemson University, 118 Kinard Laboratory, Clemson, South Carolina, 29634, USA; Medical Biophysics Graduate Program, Clemson University, 118 Kinard Laboratory, Clemson, South Carolina, 29634, USA; Clemson University Center for Human Genetics, 106 Gregor Mendel Circle Greenwood, South Carolina, 29646, USA.

Data Availability

The curated datasets are available for download from the Computational Biophysics and Bioinformatics Lab webpage (http://compbio.clemson.edu/lab/downloads/).

References

1.Zhang Z; Miteva MA; Wang L; Alexov E Analyzing effects of naturally occurring missense mutations. Computational and mathematical methods in medicine 2012, 2012 (1), 805827. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Petukh M; Kucukkal TG; Alexov E On human disease-causing amino acid variants: Statistical study of sequence and structural patterns. Human mutation 2015, 36 (5), 524–534. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Pandey P; Alexov E Most monogenic disorders are caused by mutations altering protein folding free energy. International Journal of Molecular Sciences 2024, 25 (4), 1963. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Alexov E Numerical calculations of the pH of maximal protein stability: The effect of the sequence composition and three-dimensional structure. European Journal of Biochemistry 2004, 271 (1), 173–185. [DOI] [PubMed] [Google Scholar]
5.Talley K; Alexov E On the pH-optimum of activity and stability of proteins. Proteins: Structure, Function, and Bioinformatics 2010, 78 (12), 2699–2706. [Google Scholar]
6.Mitra RC; Zhang Z; Alexov E In silico modeling of pH-optimum of protein–protein binding. Proteins: Structure, Function, and Bioinformatics 2011, 79 (3), 925–936. [Google Scholar]
7.Garcia-Moreno B Adaptations of proteins to cellular and subcellular pH. Journal of biology 2009, 8 (11), 98. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Koirala M; Shashikala HM; Jeffries J; Wu B; Loftus SK; Zippin JH; Alexov E Computational investigation of the pH dependence of stability of melanosome proteins: implication for melanosome formation and disease. International journal of molecular sciences 2021, 22 (15), 8273. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Chan P; Lovrić J; Warwicker J Subcellular pH and predicted pH-dependent features of proteins. Proteomics 2006, 6 (12), 3494–3501. [DOI] [PubMed] [Google Scholar]
10.Webb BA; Chimenti M; Jacobson MP; Barber DL Dysregulated pH: a perfect storm for cancer progression. Nature Reviews Cancer 2011, 11 (9), 671–677. [DOI] [PubMed] [Google Scholar]
11.Spassov VZ; Yan L pH-selective mutagenesis of protein–protein interfaces: In silico design of therapeutic antibodies with prolonged half-life. Proteins: Structure, Function, and Bioinformatics 2013, 81 (4), 704–714. [Google Scholar]
12.Wei W; Sulea T Sequence-based engineering of pH-sensitive antibodies for tumor targeting or endosomal recycling applications. In MAbs, 2024; Taylor & Francis: Vol. 16, p 2404064. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Isom DG; Castañeda CA; Cannon BR; García-Moreno EB Large shifts in pKa values of lysine residues buried inside a protein. Proceedings of the National Academy of Sciences 2011, 108 (13), 5260–5265. [Google Scholar]
14.Yue P; Li Z; Moult J Loss of protein structure stability as a major causative factor in monogenic disease. Journal of molecular biology 2005, 353 (2), 459–473. [DOI] [PubMed] [Google Scholar]
15.White KA; Ruiz DG; Szpiech ZA; Strauli NB; Hernandez RD; Jacobson MP; Barber DL Cancer-associated arginine-to-histidine mutations confer a gain in pH sensing to mutant proteins. Science signaling 2017, 10 (495), eaam9931. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Landrum MJ; Lee JM; Benson M; Brown GR; Chao C; Chitipiralla S; Gu B; Hart J; Hoffman D; Jang W ClinVar: improving access to variant interpretations and supporting evidence. Nucleic acids research 2018, 46 (D1), D1062–D1067. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Consortium TU UniProt: the Universal Protein Knowledgebase in 2025. Nucleic Acids Research 2024, 53 (D1), D609–D617. DOI: 10.1093/nar/gkae1010 (accessed 8/23/2025). [DOI] [Google Scholar]
18.Ashburner M; Ball CA; Blake JA; Botstein D; Butler H; Cherry JM; Davis AP; Dolinski K; Dwight SS; Eppig JT Gene ontology: tool for the unification of biology. Nature genetics 2000, 25 (1), 25–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Aleksander SA; Balhoff J; Carbon S; Cherry JM; Drabkin HJ; Ebert D; Feuermann M; Gaudet P; Harris NL The gene ontology knowledgebase in 2023. Genetics 2023, 224 (1), iyad031. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Jupp S; Burdett T; Leroy C; Parkinson HE A new Ontology Lookup Service at EMBL-EBI. SWAT4LS 2015, 2, 118–119. [Google Scholar]
21.Chang A; Jeske L; Ulbrich S; Hofmann J; Koblitz J; Schomburg I; Neumann-Schaal M; Jahn D; Schomburg D BRENDA, the ELIXIR core data resource in 2021: new developments and updates. Nucleic acids research 2021, 49 (D1), D498–D508. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Camacho C; Coulouris G; Avagyan V; Ma N; Papadopoulos J; Bealer K; Madden TL BLAST+: architecture and applications. BMC bioinformatics 2009, 10 (1), 421. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Sievers F; Wilm A; Dineen D; Gibson TJ; Karplus K; Li W; Lopez R; McWilliam H; Remmert M; Söding J Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Molecular systems biology 2011, 7 (1), 539. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Edgar RC MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic acids research 2004, 32 (5), 1792–1797. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Berman HM; Westbrook J; Feng Z; Gilliland G; Bhat TN; Weissig H; Shindyalov IN; Bourne PE The protein data bank. Nucleic acids research 2000, 28 (1), 235–242. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Jumper J; Evans R; Pritzel A; Green T; Figurnov M; Ronneberger O; Tunyasuvunakool K; Bates R; Žídek A; Potapenko A Highly accurate protein structure prediction with AlphaFold. nature 2021, 596 (7873), 583–589. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Varadi M; Bertoni D; Magana P; Paramval U; Pidruchna I; Radhakrishnan M; Tsenkov M; Nair S; Mirdita M; Yeo J AlphaFold Protein Structure Database in 2024: providing structure coverage for over 214 million protein sequences. Nucleic acids research 2024, 52 (D1), D368–D375. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Joosten RP; Te Beek TA; Krieger E; Hekkelman ML; Hooft RW; Schneider R; Sander C; Vriend G A series of PDB related databases for everyday needs. Nucleic acids research 2010, 39 (suppl_1), D411–D419. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Savojardo C; Manfredi M; Martelli PL; Casadio R Solvent accessibility of residues undergoing pathogenic variations in humans: from protein structures to protein sequences. Frontiers in molecular biosciences 2021, 7, 626363. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Krokengen OC; Raasakka A; Kursula P The intrinsically disordered protein glue of the myelin major dense line: Linking AlphaFold2 predictions to experimental data. Biochemistry and biophysics reports 2023, 34, 101474. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Antao A; Burton JD; Dawson D; Gemmill J; Gerstener Z; Godfrey B; Groel S; Jordan Z; Ligon B; Smith D Modernizing Clemson University’s Palmetto Cluster: Lessons Learned from 17 Years of HPC Administration. In Practice and Experience in Advanced Research Computing 2024: Human Powered Computing, 2024; pp 1–9. [Google Scholar]
32.Vacic V; Markwick PR; Oldfield CJ; Zhao X; Haynes C; Uversky VN; Iakoucheva LM Disease-associated mutations disrupt functionally important regions of intrinsic protein disorder. 2012.
33.Estrella V; Chen T; Lloyd M; Wojtkowiak J; Cornnell HH; Ibrahim-Hashim A; Bailey K; Balagurunathan Y; Rothberg JM; Sloane BF Acidity generated by the tumor microenvironment drives local invasion. Cancer research 2013, 73 (5), 1524–1535. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Kato Y; Lambert CA; Colige AC; Mineur P; Noël A; Frankenne F; Foidart J-M; Baba M; Hata R-I; Miyazaki K Acidic extracellular pH induces matrix metalloproteinase-9 expression in mouse metastatic melanoma cells through the phospholipase D-mitogen-activated protein kinase signaling. Journal of Biological Chemistry 2005, 280 (12), 10938–10944. [DOI] [PubMed] [Google Scholar]
35.Musselman CA; Lalonde M-E; Côté J; Kutateladze TG Perceiving the epigenetic landscape through histone readers. Nature structural & molecular biology 2012, 19 (12), 1218–1227. [Google Scholar]
36.Honer MA; Ferman BI; Gray ZH; Bondarenko EA; Whetstine JR Epigenetic modulators provide a path to understanding disease and therapeutic opportunity. Genes & Development 2024, 38 (11–12), 473–503. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Stoffel F; Papp M; Küffner AM; Benítez-Mateos A; Jacquat RP; Gil-Garcia M; Galvanetto N; Faltova L; Arosio P Enhancement of Enzymatic Activity by Biomolecular Condensates through pH Buffering. BioRxiv 2024, 2024.2010. 2008.617196. [Google Scholar]
38.Lee DS; Strom AR; Brangwynne CP The mechanobiology of nuclear phase separation. APL bioengineering 2022, 6 (2). [Google Scholar]
39.Kisor KP; Ruiz DG; Jacobson MP; Barber DL A role for pH dynamics regulating transcription factor DNA-binding selectivity. Nucleic Acids Research 2025, 53 (10), gkaf474. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplementary material

NIHMS2134374-supplement-supplementary_material.docx^{(326.3KB, docx)}

Data Availability Statement

The curated datasets are available for download from the Computational Biophysics and Bioinformatics Lab webpage (http://compbio.clemson.edu/lab/downloads/).

[R1] 1.Zhang Z; Miteva MA; Wang L; Alexov E Analyzing effects of naturally occurring missense mutations. Computational and mathematical methods in medicine 2012, 2012 (1), 805827. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Petukh M; Kucukkal TG; Alexov E On human disease-causing amino acid variants: Statistical study of sequence and structural patterns. Human mutation 2015, 36 (5), 524–534. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Pandey P; Alexov E Most monogenic disorders are caused by mutations altering protein folding free energy. International Journal of Molecular Sciences 2024, 25 (4), 1963. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Alexov E Numerical calculations of the pH of maximal protein stability: The effect of the sequence composition and three-dimensional structure. European Journal of Biochemistry 2004, 271 (1), 173–185. [DOI] [PubMed] [Google Scholar]

[R5] 5.Talley K; Alexov E On the pH-optimum of activity and stability of proteins. Proteins: Structure, Function, and Bioinformatics 2010, 78 (12), 2699–2706. [Google Scholar]

[R6] 6.Mitra RC; Zhang Z; Alexov E In silico modeling of pH-optimum of protein–protein binding. Proteins: Structure, Function, and Bioinformatics 2011, 79 (3), 925–936. [Google Scholar]

[R7] 7.Garcia-Moreno B Adaptations of proteins to cellular and subcellular pH. Journal of biology 2009, 8 (11), 98. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Koirala M; Shashikala HM; Jeffries J; Wu B; Loftus SK; Zippin JH; Alexov E Computational investigation of the pH dependence of stability of melanosome proteins: implication for melanosome formation and disease. International journal of molecular sciences 2021, 22 (15), 8273. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Chan P; Lovrić J; Warwicker J Subcellular pH and predicted pH-dependent features of proteins. Proteomics 2006, 6 (12), 3494–3501. [DOI] [PubMed] [Google Scholar]

[R10] 10.Webb BA; Chimenti M; Jacobson MP; Barber DL Dysregulated pH: a perfect storm for cancer progression. Nature Reviews Cancer 2011, 11 (9), 671–677. [DOI] [PubMed] [Google Scholar]

[R11] 11.Spassov VZ; Yan L pH-selective mutagenesis of protein–protein interfaces: In silico design of therapeutic antibodies with prolonged half-life. Proteins: Structure, Function, and Bioinformatics 2013, 81 (4), 704–714. [Google Scholar]

[R12] 12.Wei W; Sulea T Sequence-based engineering of pH-sensitive antibodies for tumor targeting or endosomal recycling applications. In MAbs, 2024; Taylor & Francis: Vol. 16, p 2404064. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Isom DG; Castañeda CA; Cannon BR; García-Moreno EB Large shifts in pKa values of lysine residues buried inside a protein. Proceedings of the National Academy of Sciences 2011, 108 (13), 5260–5265. [Google Scholar]

[R14] 14.Yue P; Li Z; Moult J Loss of protein structure stability as a major causative factor in monogenic disease. Journal of molecular biology 2005, 353 (2), 459–473. [DOI] [PubMed] [Google Scholar]

[R15] 15.White KA; Ruiz DG; Szpiech ZA; Strauli NB; Hernandez RD; Jacobson MP; Barber DL Cancer-associated arginine-to-histidine mutations confer a gain in pH sensing to mutant proteins. Science signaling 2017, 10 (495), eaam9931. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Landrum MJ; Lee JM; Benson M; Brown GR; Chao C; Chitipiralla S; Gu B; Hart J; Hoffman D; Jang W ClinVar: improving access to variant interpretations and supporting evidence. Nucleic acids research 2018, 46 (D1), D1062–D1067. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Consortium TU UniProt: the Universal Protein Knowledgebase in 2025. Nucleic Acids Research 2024, 53 (D1), D609–D617. DOI: 10.1093/nar/gkae1010 (accessed 8/23/2025). [DOI] [Google Scholar]

[R18] 18.Ashburner M; Ball CA; Blake JA; Botstein D; Butler H; Cherry JM; Davis AP; Dolinski K; Dwight SS; Eppig JT Gene ontology: tool for the unification of biology. Nature genetics 2000, 25 (1), 25–29. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Aleksander SA; Balhoff J; Carbon S; Cherry JM; Drabkin HJ; Ebert D; Feuermann M; Gaudet P; Harris NL The gene ontology knowledgebase in 2023. Genetics 2023, 224 (1), iyad031. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Jupp S; Burdett T; Leroy C; Parkinson HE A new Ontology Lookup Service at EMBL-EBI. SWAT4LS 2015, 2, 118–119. [Google Scholar]

[R21] 21.Chang A; Jeske L; Ulbrich S; Hofmann J; Koblitz J; Schomburg I; Neumann-Schaal M; Jahn D; Schomburg D BRENDA, the ELIXIR core data resource in 2021: new developments and updates. Nucleic acids research 2021, 49 (D1), D498–D508. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Camacho C; Coulouris G; Avagyan V; Ma N; Papadopoulos J; Bealer K; Madden TL BLAST+: architecture and applications. BMC bioinformatics 2009, 10 (1), 421. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Sievers F; Wilm A; Dineen D; Gibson TJ; Karplus K; Li W; Lopez R; McWilliam H; Remmert M; Söding J Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Molecular systems biology 2011, 7 (1), 539. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Edgar RC MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic acids research 2004, 32 (5), 1792–1797. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Berman HM; Westbrook J; Feng Z; Gilliland G; Bhat TN; Weissig H; Shindyalov IN; Bourne PE The protein data bank. Nucleic acids research 2000, 28 (1), 235–242. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Jumper J; Evans R; Pritzel A; Green T; Figurnov M; Ronneberger O; Tunyasuvunakool K; Bates R; Žídek A; Potapenko A Highly accurate protein structure prediction with AlphaFold. nature 2021, 596 (7873), 583–589. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Varadi M; Bertoni D; Magana P; Paramval U; Pidruchna I; Radhakrishnan M; Tsenkov M; Nair S; Mirdita M; Yeo J AlphaFold Protein Structure Database in 2024: providing structure coverage for over 214 million protein sequences. Nucleic acids research 2024, 52 (D1), D368–D375. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28.Joosten RP; Te Beek TA; Krieger E; Hekkelman ML; Hooft RW; Schneider R; Sander C; Vriend G A series of PDB related databases for everyday needs. Nucleic acids research 2010, 39 (suppl_1), D411–D419. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Savojardo C; Manfredi M; Martelli PL; Casadio R Solvent accessibility of residues undergoing pathogenic variations in humans: from protein structures to protein sequences. Frontiers in molecular biosciences 2021, 7, 626363. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] 30.Krokengen OC; Raasakka A; Kursula P The intrinsically disordered protein glue of the myelin major dense line: Linking AlphaFold2 predictions to experimental data. Biochemistry and biophysics reports 2023, 34, 101474. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] 31.Antao A; Burton JD; Dawson D; Gemmill J; Gerstener Z; Godfrey B; Groel S; Jordan Z; Ligon B; Smith D Modernizing Clemson University’s Palmetto Cluster: Lessons Learned from 17 Years of HPC Administration. In Practice and Experience in Advanced Research Computing 2024: Human Powered Computing, 2024; pp 1–9. [Google Scholar]

[R32] 32.Vacic V; Markwick PR; Oldfield CJ; Zhao X; Haynes C; Uversky VN; Iakoucheva LM Disease-associated mutations disrupt functionally important regions of intrinsic protein disorder. 2012.

[R33] 33.Estrella V; Chen T; Lloyd M; Wojtkowiak J; Cornnell HH; Ibrahim-Hashim A; Bailey K; Balagurunathan Y; Rothberg JM; Sloane BF Acidity generated by the tumor microenvironment drives local invasion. Cancer research 2013, 73 (5), 1524–1535. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] 34.Kato Y; Lambert CA; Colige AC; Mineur P; Noël A; Frankenne F; Foidart J-M; Baba M; Hata R-I; Miyazaki K Acidic extracellular pH induces matrix metalloproteinase-9 expression in mouse metastatic melanoma cells through the phospholipase D-mitogen-activated protein kinase signaling. Journal of Biological Chemistry 2005, 280 (12), 10938–10944. [DOI] [PubMed] [Google Scholar]

[R35] 35.Musselman CA; Lalonde M-E; Côté J; Kutateladze TG Perceiving the epigenetic landscape through histone readers. Nature structural & molecular biology 2012, 19 (12), 1218–1227. [Google Scholar]

[R36] 36.Honer MA; Ferman BI; Gray ZH; Bondarenko EA; Whetstine JR Epigenetic modulators provide a path to understanding disease and therapeutic opportunity. Genes & Development 2024, 38 (11–12), 473–503. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] 37.Stoffel F; Papp M; Küffner AM; Benítez-Mateos A; Jacquat RP; Gil-Garcia M; Galvanetto N; Faltova L; Arosio P Enhancement of Enzymatic Activity by Biomolecular Condensates through pH Buffering. BioRxiv 2024, 2024.2010. 2008.617196. [Google Scholar]

[R38] 38.Lee DS; Strom AR; Brangwynne CP The mechanobiology of nuclear phase separation. APL bioengineering 2022, 6 (2). [Google Scholar]

[R39] 39.Kisor KP; Ruiz DG; Jacobson MP; Barber DL A role for pH dynamics regulating transcription factor DNA-binding selectivity. Nucleic Acids Research 2025, 53 (10), gkaf474. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Functional and Structural Characterization of Pathogenicity of Human Arginine-Histidine Variants

Nirav Modha

Emil Alexov

Abstract

1. Introduction

2. Methods

2.1. Dataset preparation

2.1.1. ClinVar sourcing and primary filters

2.1.2. Mutation type and class stratification

2.1.3. Gene and protein mapping

2.1.4. Expansion to gene analysis

2.1.5. Column schema (summary)

2.1.6. Quality control

2.2. Variant collection and filtering

Fig. 1.

2.3. ClinVar Disease-Category Mapping

2.4. Gene and protein mapping

2.5. Functional annotation

2.6. Protein functional annotation using Gene Ontology terms

2.7. Subcellular Compartment Annotation and pH Environment Mapping

2.7.1. GO term extraction and resolution

2.7.2. Assignment of compartment pH values

2.8. EC Number Annotation and pH Optimum Mapping

2.8.1. EC number annotation

2.8.2. Extraction of pH Optimum from BRENDA

2.9. Conservation score and Shannon entropy analysis of R>H and H>R mutations

2.9.1. Data preparation

2.9.2. Protein sequence retrieval

2.9.3. Homology search

2.9.4. Multiple sequence alignment

2.9.5. Conservation score calculation

2.9.6. Shannon entropy calculation

2.10. Structural coverage identification and analysis

2.10.1. Protein structure retrieval

2.10.2. PDB chain filtering

2.10.3. Best PDB selection

2.10.4. Secondary structure and solvent accessibility

2.10.5. Disorder annotation via UniProt

2.10.6. Disorder prediction via AlphaFold Confidence

2.10.7. Residue verification and quality control

2.11. Computational Resources

3. Results

Fig. 2.

3.1. Secondary structure type

Fig. 3.

3.2. Ordered vs disordered regions

Fig. 4.

3.3. Conservation score and Shannon entropy

Fig. 5.

Fig. 6.

3.4. Subcellular distribution

Fig. 7.

3.5. pH optima results

Fig. 8.

3.6. Protein functional classes

Fig. 9.

3.7. Analysis of proteins harboring both pathogenic and benign mutations

Fig. 10.

Fig. 11.

Fig. 12.

Fig. 13.

Fig. 14.

Fig. 15.

4. Discussion

Supplementary Material

Acknowledgments

Funding Information

Footnotes

Contributor Information

Data Availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles