Gauge fixing for sequence-function relationships

Anna Posfai; Juannan Zhou; David M McCandlish; Justin B Kinney

doi:10.1101/2024.05.12.593772

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Jun 24:2024.05.12.593772. Originally published 2024 May 13. [Version 2] doi: 10.1101/2024.05.12.593772

Gauge fixing for sequence-function relationships

Anna Posfai ¹, Juannan Zhou ^1,², David M McCandlish ^1,^†, Justin B Kinney ^1,^†

PMCID: PMC11118547 PMID: 38798671

Abstract

Quantitative models of sequence-function relationships are ubiquitous in computational biology, e.g., for modeling the DNA binding of transcription factors or the fitness landscapes of proteins. Interpreting these models, however, is complicated by the fact that the values of model parameters can often be changed without affecting model predictions. Before the values of model parameters can be meaningfully interpreted, one must remove these degrees of freedom (called “gauge freedoms” in physics) by imposing additional constraints (a process called “fixing the gauge”). However, strategies for fixing the gauge of sequence-function relationships have received little attention. Here we derive an analytically tractable family of gauges for a large class of sequence-function relationships. These gauges are derived in the context of models with all-order interactions, but an important subset of these gauges can be applied to diverse types of models, including additive models, pairwise-interaction models, and models with higher-order interactions. Many commonly used gauges are special cases of gauges within this family. We demonstrate the utility of this family of gauges by showing how different choices of gauge can be used both to explore complex activity landscapes and to reveal simplified models that are approximately correct within localized regions of sequence space. The results provide practical gauge-fixing strategies and demonstrate the utility of gauge-fixing for model exploration and interpretation.

Keywords: regression, non-identifiability, model interpretability, epistasis, sequence space

Introduction

One of the central challenges of biology is to understand how functionally relevant information is encoded within the sequences of DNA, RNA, and proteins. Unlike the genetic code, most sequence-function relationships are quantitative in nature, and understanding them requires finding mathematical functions that, upon being fed unannotated sequences, return values that quantify sequence activity (1). Multiplex assays of variant effects (MAVEs), functional genomics methods, and other high-throughput techniques are rapidly increasing the ease with which sequence-function relationships can be experimentally studied. And while quantitative modeling efforts based on these high-throughput data are becoming increasingly successful, in that they yield models with ever-increasing predictive ability, major open questions remain about how to interpret both the parameters (2–12) and the predictions (13–17) of the resulting models. One major open question is how to deal with the presence of gauge freedoms.

Gauge freedoms are directions in parameter space along which changes in model parameters have no effect on model predictions (18). Not only can the values of model parameters along gauge freedoms not be determined from data, differences in parameters along gauge freedoms have no biological meaning even in principle. Many commonly used models of sequence-function relationships exhibit numerous gauge freedoms (19–35), and interpreting the parameters of these models requires imposing additional constraints on parameter values, a process called “fixing the gauge”.

The gauge freedoms of sequence-function relationships are currently most completely understood in the context of additive models [commonly used to describe transcription factor binding to DNA (19, 22, 35)] and pairwise-interaction models [commonly used to describe proteins (20, 21, 23–34)]. Recently, some gauge-fixing strategies have been described for all-order interaction models, again in the context of protein sequence-function relationships (30, 31, 34). However, a unified gauge-fixing strategy applicable to diverse models of sequence-function relationships has yet to be developed.

Here we provide a general treatment of the gauge fixing problem for sequence-function relationships, focusing on the important case where the set of gauge-fixed parameters form a vector space, thus ensuring that differences between vectors of gauge-fixed parameter values are directly interpretable. We first demonstrate the relationship between these linear gauges and $L_{2}$ regularization on parameter vectors, and then derive a mathematically tractable family of gauges for the all-order interaction model. Importantly, a subset of these gauges–the “hierarchical gauges”–can be applied to diverse lower-order models (including additive models, pairwise-interaction models, and higher-order interaction models) and include as special cases two types of gauges that are commonly used in practice [“zero-sum gauges” (23, 28) and “wild-type gauges” (9, 23, 33)]. We then illustrate the properties of this family of gauges by analyzing two example sequence-function relationships: a simulated all-order interaction landscape on short binary sequences, and an empirical pairwise-interaction landscape for the B1 domain of protein G (GB1). The GB1 analysis, in particular, shows how different hierarchical gauges can be used to explore, simplify, and interpret complex functional landscapes. A companion paper (36) further explores the mathematical origins of gauge freedoms in models of sequence-function relationships, and shows how gauge freedoms arise as a consequence of the symmetries of sequence space.

Results

Preliminaries and background.

In this section we review how gauge freedoms arise in commonly used models of sequence-function relationships, as well as strategies commonly used to fix the gauge. In doing so, we establish notation and concepts that are used in subsequent sections, as well as in our companion paper (36).

Linear models.

We define quantitative models of sequence-function relationships as follows. Let $𝓐$ denote an alphabet comprising $α$ distinct characters (written $c_{1}, \dots, c_{α}$ ), let $𝓢$ denote the set of sequences of length $L$ built from these characters, and let $N = α^{L}$ denote the number of sequences in $𝓢$ . A quantitative model of a sequence-function relationship (henceforth “model”) is a function $f (s; \vec{θ})$ that maps each sequence $s$ in $𝓢$ to a real number. The vector $\vec{θ}$ represents the parameters on which this function depends and is assumed to comprise $M$ real numbers. $s_{l}$ denotes the character at position $l$ of sequence $s$ . We use $l$ , $l^{'}$ , etc. to index positions (ranging from 1 to $L$ ) in a sequence and $c$ , $c^{'}$ , etc. to index characters in $𝓐$ .

A linear model is a model that is a linear function of $\vec{θ}$ . Linear models have the form

f (s; \vec{θ}) = \vec{θ} \cdot \vec{x} (s) = \sum_{i = 1}^{M} θ_{i} x_{i} (s),

[1]

where $\vec{x} (\cdot)$ is a vector of $M$ distinct sequence features and each sequence feature $x_{i} (\cdot)$ is a function that maps sequences to the real numbers. We refer to the space $ℝ^{M}$ in which $\vec{x} (\cdot)$ lives as feature space, and the specific vector $\vec{x} (s)$ as the embedding of sequence $s$ in feature space. We use $𝓢$ to denote the vector space spanned by the set of embeddings $\vec{x} (s)$ for all sequences $s$ in $𝓢$ .

One-hot models.

One-hot models are linear models based on sequence features that indicate the presence or absence of specific characters at specific positions within a sequence (1). Such models play a central role in scientific reasoning concerning sequence-function relationships because their parameters can be interpreted as quantitative contributions to the measured function due to the presence of specific biochemical entities (e.g. nucleotides or amino acids) in specific positions in the sequence. These one-hot models include additive models, pairwise-interaction models, all-order interaction models, and more. Additive models have the form

f_{add} (s) = θ_{0} x_{0} (s) + \sum_{l} \sum_{c} θ_{l}^{c} x_{l}^{c} (s),

[2]

where $x_{0} (s)$ is the constant feature (equal to one for every sequence $s$ ) and $x_{l}^{c} (s)$ is an additive feature (equal to one if sequence $s$ has character $c$ at position $l$ and equal to zero otherwise). Pairwise interaction models have the form

f_{pair} (s) = θ_{0} x_{0} (s) + \sum_{l} \sum_{c} θ_{l}^{c} x_{l}^{c} (s) + \sum_{l < l^{'}} \sum_{c, c^{'}} θ_{l l^{'}}^{c c^{'}} x_{l l^{'}}^{c c^{'}} (s),

[3]

where $x_{l l^{'}}^{c c^{'}} (s)$ is a pairwise feature (equal to one if $s$ has character $c$ at position $l$ and character $c^{'}$ at position $l^{'}$ , and equal to zero otherwise). All-order interaction models include interactions of all orders, and are written

f_{all} (s) = \sum_{K = 0}^{L} \sum_{l_{1} < \dots < l_{K}} \sum_{c_{1}, \dots, c_{K}} θ_{l_{1} \dots l_{K}}^{c_{1} \dots c_{K}} x_{l_{1} \dots l_{K}}^{c_{1} \dots c_{K}} (s),

[4]

where $x_{l_{1} l_{2} \dots l_{K}}^{c_{1} c_{2} \dots c_{K}} (s)$ is a $K$ -order feature (equal to one if $s$ has character $c_{k}$ at position $l_{k}$ for all $k$ , and equal to zero otherwise; $K = 0$ corresponds to the constant feature).

Gauge freedoms.

Gauge freedoms are transformations of model parameters that leave all model predictions unchanged. The gauge freedoms of a general sequence-function relationship $f (\cdot, \cdot)$ are vectors $\vec{g}$ in $ℝ^{M}$ that satisfy

f (s; \vec{θ}) = f (s; \vec{θ} + \vec{g}) for all s \in 𝓢 .

[5]

For linear models, gauge freedoms $\vec{g}$ satisfy

X \vec{g} = \vec{0},

[6]

where $X$ is the $N \times M$ design matrix having rows $\vec{x} (s)$ for $s \in 𝓢$ . In linear models, gauge freedoms thus arise when sequence features (i.e., the columns of $X$ ) are not linearly independent. In such cases, the space $S$ spanned by sequence embeddings is a proper subspace of $ℝ^{M}$ , so is the space $G$ of gauge freedoms, and $G$ is orthogonal to $S$ .

Each linear relation between multiple columns of $X$ yields a gauge freedom. For example, additive models have $L$ gauge freedoms arising from the $L$ linear relations,

x_{0} (s) = \sum_{c} x_{l}^{c} (s),

[7]

for all positions $l$ . Pairwise models have $L$ gauge freedoms arising from the $L$ additive model linear relations in Eq. (7), and $(\begin{matrix} L \\ 2 \end{matrix}) (2 α - 1)$ additional gauge freedoms arising from the linear relations

x_{l}^{c} (s) = \sum_{c^{'}} x_{l l^{'}}^{c c^{'}} (s) and x_{l^{'}}^{c^{'}} (s) = \sum_{c} x_{l l^{'}}^{c c^{'}} (s)

[8]

for all characters $c$ , $c^{'}$ and all positions $l$ and $l^{'}$ , with $l < l^{'}$ (see SI Sec. 2 for details). More generally, the gauge freedoms of one-hot models arise from the fact that summing any $K$ -order feature $x_{l_{1} \dots l_{K}}^{c_{1} \dots c_{K}}$ over all characters $c_{k}$ at any chosen position $l_{k}$ yields a feature of order $K - 1$ .

Parameter values depend on choice of gauge.

Gauge freedoms pose problems for the interpretation of model parameters because different choices of model parameters can give the exact same predictions when they are present. Thus, unless constraints are placed on the values of allowable parameters, individual parameters will have little biological meaning when viewed in isolation. To interpret model parameters, one therefore needs to adopt constraints that eliminate gauge freedoms and, as a result, make the values of model parameters unique. These constraints are called the “gauge” in which parameters are expressed, and this process of choosing constraints is called “fixing the gauge”. There are many different gauge-fixing strategies. For example, Fig. 1 shows an additive model of the DNA binding energy of CRP [an important transcription factor in Escherichia coli (37)] expressed in three different choices of gauge.

Fig. 1A shows parameters expressed in the “zero-sum gauge” (23, 28) [also called the “Ising gauge” (28), or the “hierarchical gauge” (9)]. In the zero-sum gauge, the constant parameter is the mean sequence activity and the additive parameters quantify deviations from this mean activity. The name of the gauge comes from the fact that the additive parameters at each position sum to zero. The zero-sum gauge is commonly used in additive models of protein-DNA binding (35, 38–43). As we will see, zero-sum gauges are readily defined for models with pairwise and higher-order interactions as well.

Fig. 1B shows parameters expressed in the “wild-type gauge” (9, 23, 33) [also called the “lattice-gas gauge” (28), or the “mis-match gauge” (35)]. In the wild-type gauge, the constant parameter is equal to the activity of a chosen wild-type sequence (denoted $s^{wt}$ ), and additive parameters are the changes in activity that result from mutations away from the wild-type sequence. The wild-type gauge is commonly used to visualize the results of mutational scanning experiments on proteins (44–48) or on long DNA regulatory sequences (49–54). As we will see, wild-type gauges are also readily defined for models with pairwise and higher-order interactions.

Fig. 1C shows parameters expressed in what we call the “maximum gauge”. In the maximum gauge, the constant parameter is equal to the activity of the highest-activity sequence, and additive parameters are the changes in activity that result from mutations away from the highest-activity sequence. The maximum gauge is less common in the literature than the zero-sum and the wild-type gauge, but has been used in multiple publications (55, 56).

Gauge spaces.

We now turn our attention to strategies for fixing the gauge. For every parameter vector $\vec{θ}$ in $ℝ^{M}$ , there is a corresponding “gauge orbit” defined by the set of vectors that can be obtained from $\vec{θ}$ by adding a vector $\vec{g}$ in the space of gauge freedoms $G$ . We remove the gauge freedoms of a model (a process called “fixing the gauge”) by restricting valid parameter vectors to a specified “gauge space” $Θ$ , a subset of $ℝ^{M}$ that intersects the gauge orbit of each possible parameter vector $\vec{θ}$ at exactly one point. That one point, denoted by ${\vec{θ}}_{fixed}$ , is called the “gauge-fixed” value of $\vec{θ}$ .

For any model of a sequence-function relationship with gauge freedoms, there are an infinite number of possible choices for the gauge space $Θ$ . Fig. 2 illustrates the three gauge spaces corresponding to the three different gauges (zero-sum, wild-type, and maximum) used in Fig. 1. In the zero-sum gauge (Fig. 2A), the $α$ additive parameters at each position are restricted to a linear subspace of dimension $α - 1$ in which the sum of the parameters is zero. In the wild-type gauge (Fig. 2B), the additive parameters at each position are restricted to a linear subspace in which the parameters that contribute to the activity of the wild-type sequence are zero. In the maximum gauge (Fig. 2C), the additive parameters at each position are restricted to a nonlinear subspace in which all parameters are less than or equal to zero and, at every point in the subspace, at least one parameter is equal to zero.

Linear gauges.

Here and throughout the rest of this paper we focus on linear gauges, i.e., choices of $Θ$ that are linear subspaces of feature space (as in Fig. 2A,B). Linear gauges are the most mathematically tractable family of gauges. Linear gauges also have the attractive property that the difference between any two parameter vectors in $Θ$ is also in $Θ$ . This property makes the comparison of models within the same gauge straight-forward.

Parameters can be fixed to any chosen linear gauge via a corresponding linear projection. Formally, for any linear gauge $Θ$ there exists an $M \times M$ projection matrix $P$ that projects any vector ${\vec{θ}}_{init}$ along the gauge space $G$ to an equivalent vector ${\vec{θ}}_{fixed}$ that lies in $Θ$ , i.e.

{\vec{θ}}_{fixed} = P {\vec{θ}}_{init} .

[9]

See SI Sec. 3 for a proof. We emphasize that $P$ depends on the choice of $Θ$ , and that $P$ is an orthogonal projection only for the specific choice $Θ = S$ .

Parameters can also be gauge-fixed through a process of constrained optimization. Let $Λ$ be any positive-definite $M \times M$ matrix, and let $\vec{y} = X {\vec{θ}}_{init}$ be the $N$ -dimensional vector of model predictions on all sequences. Then $Λ$ specifies a unique gauge-fixed set of parameters that preserves $\vec{y}$ via

{\vec{θ}}_{fixed} = \underset{\vec{θ} : X \vec{θ} = \vec{y}}{argmin} {‖ \vec{θ} ‖}_{Λ}^{2}, where {‖ \vec{θ} ‖}_{Λ}^{2} = {\vec{θ}}^{⊤} Λ \vec{θ} .

[10]

The resulting gauge space comprises the set of vectors that minimize the $Λ$ -norm in each gauge orbit. The corresponding projection matrix is

P = Λ^{- 1 / 2} {(X Λ^{- 1 / 2})}^{+} X,

[11]

where ‘+’ indicates the Moore-Penrose pseudoinverse. See SI Sec. 3 for a proof. In what follows, the connection between the penalization matrix $Λ$ and the projection matrix $P$ will be used to help interpret the constraints imposed by the gauge space $Θ$ .

One consequence of Eq. (10) is that parameter inference carried out using a positive-definite $L_{2}$ regularizer $Λ$ on model parameters will result in gauge-fixed model parameters in the specific linear gauge determined by $Λ$ (see SI Sec. 3). While it might then seem that $L 2$ regularizing parameter values during inference solves the gauge fixing problem, it is important to understand that regularizing during model inference will also change model predictions, whereas gauge-fixing proper influences only the model parameters while keeping the model predictions fixed. In addition, we show in SI Sec. 3 that, for any desired positive-definite regularizer on model predictions and choice of linear gauge $Θ$ , we can construct a positive-definite penalization matrix for model parameters $Λ$ that imposes the desired regularization on model predictions and yields inferred parameters in the desired gauge. Thus while $L 2$ regularization during parameter inference can simultaneously fix the gauge and regularize model predictions, the regularization imposed on model predictions does not constrain the choice of gauge.

Unified approach to gauge fixing.

We now derive strategies for fixing the gauge of the all-order interaction model. We first introduce a geometric formulation of the all-order interaction model embedding. We then construct a parametric family of gauges for the all-order interaction model, and derive formulas for the corresponding projection and penalizing matrices. Next, we highlight specific gauges of interest in this parametric family. We focus in particular on the “hierarchical gauges,” which can be applied to a variety of commonly used models in addition to the all-order interaction model. The results provide explicit gauge-fixing formulae that can be applied to diverse quantitative models of sequence-function relationships.

All-order interaction models.

To aid in our discussion of the all-order interaction model [Eq. (4)], we define an augmented alphabet $𝓐^{'} = {*, c_{1}, \dots, c_{α}}$ , where $c_{1}, \dots, c_{α}$ are the characters in $𝓐$ and * is a wild-card character that is interpreted as matching any character in $𝓐$ . Let $𝓢^{'}$ denote the set of sequences of length $L$ comprising characters from $𝓐^{'}$ . For each augmented sequence $s^{'} \in 𝓢^{'}$ , we define the sequence feature $x_{s^{'}} (s)$ to be 1 if a sequence $s$ matches the pattern described by $s^{'}$ and to be 0 otherwise. In this way, each augmented sequence $s^{'}$ serves as a regular expression against which bona fide sequences are compared.

Assigning one parameter $θ_{s^{'}}$ to each of the $M = {(α + 1)}^{L}$ augmented sequences $s^{'}$ , the all-order interaction model can be expressed compactly as

f_{all} (s; \vec{θ}) = \sum_{s^{'} \in 𝓢^{'}} θ_{s^{'}} x_{s^{'}} (s) .

[12]

In this notation, the constant parameter $θ_{0}$ is written $θ_{* \dots *}$ , each additive parameter $θ_{l}^{c}$ is written $θ_{* \dots c \dots *}$ , each pairwise-interaction parameter $θ_{l l^{'}}^{c c^{'}}$ is written $θ_{* \dots c \dots c^{'} \dots *}$ , and so on. (Here $c$ occurs at position $l$ , $c^{'}$ occurs at position $l^{'}$ , and … denotes a run of * characters). We thus see that augmented sequences provide a convenient way to index the features and parameters of the all-order interaction model.

Next we observe that $x_{s^{'}}$ can be expressed in a form that factorizes across positions. For each position $l$ , we define $x_{l}^{*} (s) = 1$ for all sequences $s$ and take $x_{l}^{c_{1}}, \dots, x_{l}^{c_{1}}$ to be the standard one-hot sequence features. $x_{s^{'}}$ can then be written in the factorized form,

x_{s^{'}} (s) = \prod_{l = 1}^{L} x_{l}^{s_{l}^{'}} (s) .

[13]

From this it is seen that the embedding for the all-order interaction model, ${\vec{x}}_{all} (s)$ , can be formulated geometrically as a tensor product:

{\vec{x}}_{all} (s) = \otimes_{l = 1}^{L} {\vec{x}}_{l}^{'} (s), where {\vec{x}}_{l}^{'} (s) = (\begin{matrix} x_{l}^{*} (s) \\ x_{l}^{c_{1}} (s) \\ ⋮ \\ x_{l}^{c_{α}} (s) \end{matrix}) .

[14]

See SI Sec. 4 for details.

Parametric family of gauges.

We now define a useful parametric family of gauges for the all-order interaction model. Each gauge in this family is defined by two parameters, $λ$ and $p$ . $λ$ is a non-negative real number that governs how much higher-order versus lower-order sequence features are penalized [in the sense of Eq. (10)]. $p$ is a probability distribution on sequence space that governs how strongly the specific characters at each position are penalized. This distribution is assumed to have the form

p (s) = p_{1}^{s_{1}} p_{2}^{s_{2}} \dots p_{L}^{s_{L}},

[15]

where $p_{l}^{c}$ denotes the probability of character $c$ at position $l$ . As we show below, choosing appropriate values for $λ$ and $p$ recovers the most commonly used linear gauges, including the zero-sum gauge, the wild-type gauge, and more.

Gauges in the parametric family have analytically tractable projection matrices because they can be expressed as tensor products of single-position gauge spaces. Let $Θ_{l}^{λ, p}$ be the $α$ -dimensional subspace of $ℝ^{α + 1}$ defined by

Θ_{l}^{λ, p} = V_{λ} \oplus V_{⊥}^{p_{l}},

[16]

where $V_{λ}$ (a 1-dimensional subspace) and $V_{⊥}^{p_{l}}$ [an $(α - 1)$ - dimensional subspace] are defined by

V_{λ} = span {(\begin{matrix} λ \\ 1 \\ ⋮ \\ 1 \end{matrix})}, V_{⊥}^{p_{l}} = {(\begin{matrix} 0 \\ v_{c_{1}} \\ ⋮ \\ v_{c_{α}} \end{matrix}) : \sum_{i = 1}^{α} p_{l}^{c_{i}} v_{c_{i}} = 0} .

[17]

The full parametric gauge, denoted by $Θ^{λ, p}$ , is defined to be the tensor product of these single-position gauges:

Θ^{λ, p} = \otimes_{l = 1}^{L} Θ_{l}^{λ, p} .

[18]

As detailed in SI Sec. 5, the corresponding projection matrix $P^{λ, p}$ is found to have elements given by

P_{s^{'} t^{'}}^{λ, p} = \prod_{\begin{matrix} l s.t. \\ s_{l}^{'} \in 𝓐 \\ t_{l}^{'} \in 𝓐 \end{matrix}} (δ_{s_{l}^{'} t_{l}^{'}} - p_{l}^{t_{l}^{'}} η) \times \prod_{\begin{matrix} l s.t. \\ s_{l}^{'} = * \\ t_{l}^{'} \in 𝓐 \end{matrix}} (p_{l}^{t_{l}^{'}} η) \times \prod_{\begin{matrix} l s.t. \\ s_{l}^{'} \in 𝓐 \\ t_{l}^{'} = * \end{matrix}} (1 - η) \times \prod_{\begin{matrix} l s.t. \\ s_{l}^{'} = * \\ t_{l}^{'} = * \end{matrix}} η,

[19]

where $η = λ / (1 + λ)$ and where the augmented sequences $s^{'}$ and $t^{'}$ index rows and columns. We thus obtain an explicit formula for the projection matrix needed to project any parameter vector into any gauge in the parametric family.

Gauges in the parametric family also have penalizing matrices of a simple diagonal form. Specifically, if $0 < λ < \infty$ to and $p (s^{'}) > 0$ everywhere, Eq. (10) is satisfied by the penalization matrix $Λ$ having elements

Λ_{s^{'} t^{'}} = p (s^{'}) λ^{o (s^{'})} δ_{s^{'} t^{'}},

[20]

where $o (s^{'})$ denotes the order of interaction described by $s^{'}$ (i.e., the number of non-star characters in $s^{'}$ ) and $p (s^{'})$ is defined as in Eq. (15) but with $p_{l}^{s_{l}^{'}} = 1$ when $s_{l}^{'} = *$ . See SI Sec. 5 for a proof. Note that, although Eq. (20) does not hold when $λ = 0$ , $λ = \infty$ , or any $p_{l}^{c} = 0$ , one can interpret $Θ^{λ, p}$ [which is well-defined in Eq. (18) and Eq. (19)] as arising from Eq. (10) under a limiting series of penalizing matrices.

Trivial gauge.

Choosing $λ = 0$ yields what we call the “trivial gauge”. In the trivial gauge, $θ_{s^{'}} = 0$ if $s^{'}$ contains one or more star characters (by Eq. (19)), and so the only nonzero parameters correspond to interactions of order $L$ . As a result,

f_{all} (s, \vec{θ}) = θ_{s}

[21]

for every sequence $s \in 𝓢$ . Note in particular that the trivial gauge is unaffected by $p$ . Thus, the trivial gauge essentially represents sequence-function relationships as catalogs of activity values, one value for every sequence. See SI Sec. 6 for details.

Euclidean gauge.

Choosing $λ = α$ and choosing $p$ to be the uniform distribution recovers what we call the “Euclidean gauge”. In the Euclidean gauge, the penalizing norm in Eq. (10) is the standard euclidean norm, i.e.

{‖ \vec{θ} ‖}_{Λ}^{2} = \sum_{s^{'}} θ_{s^{'}}^{2} .

[22]

It is readily seen that the euclidean gauge is orthogonal to the space of gauge freedoms $G$ and therefore equal to the embedding space $S$ . It is also readily seen that parameter inference using standard $L_{2}$ regularization (i.e. choosing $Λ$ to be a positive multiple of the identity matrix) will yield parameters in the Euclidean gauge. See SI Sec. 6 for details.

Equitable gauge.

Choosing $λ = 1$ and letting $p$ vary recovers what we call the “equitable gauge”. In the equitable gauge, the penalizing norm is

{‖ \vec{θ} ‖}_{Λ}^{2} = \sum_{s^{'}} p (s^{'}) θ_{s^{'}}^{2} = \sum_{s^{'}} {〈 f_{s^{'}}^{2} 〉}_{p} = \sum_{s^{'}} {‖ f_{s^{'}} ‖}_{p}^{2},

[23]

where $f_{s^{'}} = θ_{s^{'}} x_{s^{'}}$ denotes the contribution to the activity landscape corresponding to the sequence feature $s^{'}$ , ${〈 . 〉}_{p}$ denotes an average over sequences drawn from $p$ , and ${‖ f ‖}_{p}^{2} = \sum_{s \in 𝓢} p (s) f {(s)}^{2}$ is the squared norm of a function $f$ on sequence space with respect to $p$ . The equitable gauge thus penalizes each parameter $θ_{s^{'}}$ in proportion to the fraction of sequences that parameter applies to. Equivalently, the equitab gauge can be thought of as minimizing the sum of the squared norms of the landscape contributions ${‖ f_{s^{'}} ‖}_{p}^{2}$ rather than the squared norm of the parameter values themselves. Unlike the euclidean gauge, the equitable gauge accounts for the fact that different model parameters can affect vastly different numbers of sequences and can thereby have vastly different impacts on the activity landscape. See SI Sec. 6 for details.

Hierarchical gauge.

Choosing $p$ freely and letting $λ \to \infty$ yields what we call the “hierarchical gauge”. When expressed in the hierarchical gauge, model parameters obey the marginalization property,

\sum_{c_{k}} p_{l_{k}}^{c_{k}} θ_{l_{1} \dots l_{K}}^{c_{1} \dots c_{K}} = 0 .

[24]

This marginalization property has important consequences that we now summarize. See SI Sec. 7 for proofs of these results.

A first consequence of Eq. (24) is that, when parameters are expressed in the hierarchical gauge, the mean activity among sequences matched by an augmented sequence $s^{'}$ can be expressed as a simple sum of parameters. For example,

{〈 f_{all} 〉}_{p} = θ_{0},

[25]

{〈 f_{all} | c at l 〉}_{p} = θ_{0} + θ_{l}^{c},

[26]

{〈 f_{all} | c at l, c^{'} at l^{'} 〉}_{p} = θ_{0} + θ_{l}^{c} + θ_{l^{'}}^{c^{'}} + θ_{l l^{'}}^{c c^{'}},

[27]

and so on. Consequently, the parameters themselves can also be expressed in terms of differences of these average values. For instance, $θ_{l}^{c} = {〈 f_{all} | c at l 〉}_{p} - {〈 f_{all} 〉}_{p}$ . Because $p$ factorizes by position, conditioning on having particular characters in a subset of positions is equivalent to the probability distribution produced by drawing sequences from $p$ and then fixing those positions in the drawn sequences to those specific characters. Thus, $θ_{l}^{c}$ can also be interpreted as the average effect of mutating position $l$ to character $c$ when sequences are drawn from $p$ . Similarly, $θ_{l l^{'}}^{c c^{'}}$ is the average effect of fixing positions $l$ to $c$ and $l^{'}$ to $c^{'}$ when drawing from $p$ beyond what would be expected based on the effects of changing $l$ to $c$ and $l^{'}$ to $c^{'}$ individually (i.e. epistasis), and higher-order coefficients have a similar interpretation. The hierarchical gauge thus provides an ANOVA-like decomposition of activity landscapes.

A second consequence of Eq. (24) is that the activity land-scape, when expressed in the hierarchical gauge, naturally decomposes into mutually orthogonal components. Let $σ$ denote a set comprising all augmented sequences that have the same pattern of star and non-star positions, and let $f_{σ} = \sum_{s^{'} \in σ} θ_{s^{'}} x_{s^{'}}$ be the corresponding component of $f_{all}$ . These landscape components are $p$ -orthogonal when expressed in the hierarchical gauge:

{〈 f_{σ} f_{τ} 〉}_{p} = δ_{σ τ} \sum_{s^{'} \in σ} p (s^{'}) θ_{s^{'}}^{2},

[28]

where $σ$ and $τ$ represent any two such sets of augmented sequences. One implication of this orthogonality relation is that the variance of the landscape (with respect to $p$ ) is the sum of contributions from interactions of different orders:

{var}_{p} [f] = \sum_{k = 0}^{L} {var}_{p} [f_{k}],

[29]

where $f_{k}$ denotes the sum of $k$ -order terms that contribute to $f_{all}$ . Another implication is that the hierarchical gauge minimizes the variance attributable to different orders of interaction in a hierarchical manner: higher-order terms are prioritized for variance minimization over lower-order terms, and within a given order parameters are penalized in proportion to the fraction of sequences they apply to.

A third consequence of Eq. (24) is that hierarchical gauges preserve the form of a large class of one-hot models that are equivalent to all-order interaction models with certain parameters fixed at zero (specifically, these models satisfy the condition that if a parameter for a sequence feature is fixed at zero, all higher-order sequence features contained within that sequence feature also have their parameters fixed at zero). These models, which we call the “hierarchical models,” include all-order interaction models in which the parameters above a specified order are zero (e.g., additive models and pairwise-interaction models), but also include other models, such as nearest-neighbor interaction models. Projecting onto the hierarchical gauge (but not other parametric family gauges) is guaranteed to produce a parameter vector where the appropriate entries are still fixed to be zero.

Zero-sum gauge.

The zero-sum gauge (illustrated in Figs. 1A and 2A) is the hierarchical gauge for which $p$ is the uniform distribution. The name of this gauge comes from the fact that, when $p$ is uniform, Eq. (24) becomes

\sum_{c_{k}} θ_{l_{1} \dots l_{K}}^{c_{1} \dots c_{K}} = 0.

[30]

Prior studies (12, 15) have characterized the zero-sum gauge for the all-order interaction model. Our formulation of the hierarchical gauge extends those findings and generalizes them to gauges defined by non-uniformly weighted sums of parameters.

Wild-type and generalized wild-type gauges.

The wild-type gauge (illustrated in Figs. 1B and 2B) is a hierarchical gauge that arises in the limit as $p$ approaches an indicator function for some “wild-type sequence,” $s^{wt}$ . In the wild-type gauge, only the parameters $θ_{s^{'}}$ for which $s^{'}$ matches $s^{wt}$ receive any penalization, and all these penalized $θ_{s^{'}}$ (except for $θ_{0}$ ) are driven to zero. Consequently, $θ_{0}$ quantifies the activity of the wild-type sequence, each $θ_{l}^{c}$ quantifies the effect of a single mutation to the wild-type sequence, each $θ_{l l^{'}}^{c c^{'}}$ quantifies the epistatic effect of two mutations to the wild-type sequence, and so on. However, seeing the wild-type gauge as a special case of the hierarchical gauge provides the possibility of generalizing the wild-type gauge by using a $p$ that is not the indicator function on a single sequence but rather defines a distribution over one or more alleles per position that can be considered as being “wild-type” (equivalently, the frequencies of some subset of position-specific characters are set to zero). These gauges all inherit the property from the the hierarchical gauge that their coefficients relate to the average effect of taking draws from the probability distribution defined by $p$ and setting a subset of positions to the characters specified by that coefficient. More rigorously, these gauges are defined by considering the limit $\lim_{ϵ \to 0^{+}}$ of the hierarchical gauge with factorizable distribution

p_{ϵ} (s) = \prod_{l} [(1 - ϵ) p_{l}^{s_{l}} + \frac{ϵ}{α}],

[31]

where the $p_{l}^{s_{l}} \geq 0$ are the position-specific factors of the desired nonnegative vector of probabilities $p$ .

Applications.

We now demonstrate the utility of our results on two example models of complex sequence-function relationships. First, we study how the parameters of the all-order interaction model behave under different parametric gauges in the context of a simulated landscape on short binary sequences. We observe that model parameters exhibit nontrivial collective behavior across different choices of gauge. Second, we examine the parameters of an empirical pairwise-interaction model for protein GB1 using the zero-sum and multiple generalized wild-type gauges. We observe how these different hierarchical gauges enable different interpretations of model parameters and facilitate the derivation of simplified models that are approximately correct in different localized regions of sequence space. The results provide intuition for the behavior of the various parametric gauges, and show in particular how hierarchical gauges can be used to explore and interpret real sequence-function relationships.

Gauge-fixing a simulated landscape on short binary sequences.

To illustrate the consequences of choosing gauges in the parametric family, we consider a simulated random landscape on short binary sequences. Consider sequences of length $L = 3$ built from the alphabet $𝓐 = {0, 1}$ , and assume that the activities of these sequences are as shown in Fig. 3A. The corresponding all-order interaction model has ${(α + 1)}^{L} = 27$ parameters, which we index using augmented sequences: 1 constant parameter $(θ_{* * *})$ , 6 additive parameters $(θ_{0 * *}, θ_{1 * *}, θ_{* 0 *}, θ_{* 1 *}, θ_{* * 0}, θ_{* * 1})$ , 12 pairwise parameters $(θ_{00 *}, θ_{01 *}, θ_{10 *}, θ_{11 *}, θ_{0 * 0}, θ_{0 * 1}, θ_{1 * 0}, θ_{1 * 1}, θ_{* 00}, θ_{* 01}, θ_{* 10}, θ_{* 11})$ , and 8 third-order parameters $(θ_{000}, θ_{001}, θ_{010}, θ_{011}, θ_{100}, θ_{101}, θ_{110}, θ_{111})$ .

Fig. 3. — Binary landscape expressed in various parametric family gauges. (A) Simulated random activity landscape for binary sequences of length $L = 3$ . (B) Parameters of the all-order interaction model for the binary landscape as functions of $η = λ / (1 + λ)$ . Values of $η$ corresponding to different named gauges are indicated. Note: because the uniform distribution is assumed in all these gauges, the hierarchical gauge is also the zero-sum gauge.

We now consider what happens to the values of these 27 parameters when they are expressed in different parametric gauges, $Θ^{λ, p}$ . Specifically, we assume that $p$ is the uniform distribution and vary the parameter $λ$ from 0 to $\infty$ (equivalent, $η$ varies from 0 to 1). Note that each entry in the projection matrix $P^{λ, p}$ (Eq. 19) is a cubic function of $η$ , due to $L = 3$ . Consequently, each of the 27 gauge-fixed model parameters is a cubic function of $η$ [Fig. 3B]. In the trivial gauge ( $λ = 0, η = 0$ ), only the 8 third-order parameters are nonzero, and the values of the 8 third-order parameters correspond to the values of the landscape at the 8 corresponding sequences. In the equitable gauge ( $λ = 1, η = 1 / 2$ ), the spread of the 8 third-order parameters about zero is larger than that of the 12 pairwise parameters, which is larger than that of the 6 additive parameters, which is larger than that of the constant parameter. In the euclidean gauge ( $λ = 2, η = 2 / 3$ ), the parameters of all orders exhibit a similar spread about zero. In the hierarchical gauge ( $λ = \infty, η = 1$ ), the spread of the 8 third-order parameters about zero is smaller than that of the 12 pairwise parameters, which is smaller than that of the 6 additive parameters, which is smaller than that of the constant parameter. Moreover, the marginalization and orthogonality properties of the hierarchical gauge fix certain parameters to be equal or opposite to each other, e.g. we must have $θ_{1 * *} = - θ_{0 * *}$ and the third order parameters are all equal up to their sign, which depends only on whether the corresponding sequence feature has an even or odd number of “1”s.

This example illustrates generic features of the parametric gauges. For any all-order interaction model on sequences of length $L$ , the entries of the projection matrix $P^{λ, p}$ will be $L$ -order polynomials in $η$ . Consequently, the values of model parameters, when expressed in the gauge $Θ^{λ, p}$ , will also be $L$ -order polynomials in $η$ . In the trivial gauge, only the highest-order parameters will be nonzero. In the equitable gauge, the spread about zero will tend to be smaller for lower-order parameters relative to higher-order parameters. In the euclidean gauge, parameters of all orders will exhibit similar spread about zero. In the zero-sum gauge, the spread about zero will tend to be minimized for higher-order parameters relative to lower-order parameters. The nontrivial quantitative behavior of model parameters in different parametric gauges thus underscores the importance of choosing a specific gauge before quantitatively interpreting parameter values.

Hierarchical gauges of an empirical landscape for protein GB1.

Projecting model parameters onto different hierarchical gauges can facilitate the exploration and interpretation of sequence-function relationships. To demonstrate this application of gauge fixing, we consider an empirical sequence-function relationship describing the binding of the GB1 protein to immunoglobulin G (IgG). Wu et al. (59) performed a deep mutational scanning experiment that measured how nearly all 20⁴ = 160, 000 amino acid combinations at positions 39, 40, 41,’and 54 of GB1 affect GB1 binding to IgG. These data report log₂ enrichment values for each assayed sequence relative to the wild-type sequence at these positions, VDGV (Fig. 4A,B). Using these data and least-squares regression, we inferred a pairwise interaction model for log₂ enrichment as a function of protein sequence at these $L = 4$ variable positions. The resulting pairwise interaction model comprises 1 constant parameter, 80 additive parameters, and 2400 pairwise parameters. Fig. S1 illustrates the performance of this model. To understand the structure of the activity landscape described by the pairwise interaction model, we now examine the values of model parameters in multiple hierarchical gauges. Explicit formulas for implementing hierarchical gauges for pairwise-interaction models are given in SI Sec. 8.

Fig. 4C shows the parameters of the pairwise interaction model expressed in the hierarchical gauge corresponding to a uniform probability distribution on sequence space (i.e., the zero-sum gauge). In the zero-sum gauge, the constant parameter $θ_{0}$ equals the average activity of all sequences. We observe $θ_{0} = - 4.68$ , indicating that a typical random sequence is depleted approximately 20-fold relative to the wild-type sequence, which the pairwise interaction model assigns a score of −.21. This finding confirms the expectation that a random sequence should be substantially less functional than the wild-type sequence.

The additive parameters in the zero-sum gauge are shown in the rectangular heat map in Fig. 4C, and each additive parameter is equal to the difference between the mean activity of the set of sequences containing the corresponding amino acid at the relevant position relative to the mean activity of random sequences. We observe that the wild-type sequence receives positive or near-zero contributions at every position, including a contribution from the most positive additive parameter, corresponding to G at position 41. The additive parameters at positions 39, 40, and 54 that contribute to the wild-type sequence, however, are not the largest additive parameters at these positions. Moreover, the additive parameters that contribute to the wild-type sequence only sum to 2.32, meaning that, of the total difference (4.47) between the wild-type sequence score and the average sequence score, almost half (2.15) is due to contributions from pairwise parameters. This finding quantifies the importance of epistatic interactions at positions 39, 40, 41, and 54 for the IgG binding activity of wild-type GB1.

The pairwise parameters in the zero-sum gauge are shown in the triangular heat map in Fig. 4C, where each pairwise parameter is equal to the difference between the observed mean of the sequences containing the specified pair of characters at the specified pair of conditions and the expected mean activity based on the the mean activity of sequences containing the individual characters and the grand mean activity. We observe that the three largest-magnitude pairwise contributions to the wildtype sequence are from the pair G41V54 (1.25), V39G41 (0.91), and D40G41 (−0.44), indicating that position 41 is a major hub of epistatic interactions contributing to the wild-type sequence. Moving to the landscape as a whole, we observe that the largest magnitude pairwise interactions link positions 41 and 54. Moreover, the strongest positive pairwise contributions are obtained when a small amino acid (G or A) is present at position 54, and a G, C, A, L, or P is present at position 41 (see also ⁴⁵). This finding provides insight into the chemical nature of the epistatic interactions that facilitate wild-type GB1 binding to IgG.

Previous work (60, 61) identified three disjoint regions of high-activity sequences (region 1, region 2, and region 3) in the GB1 landscape measured by Wu et al. (59). Region 1 comprises sequences with G at 41; region 2 comprises sequences with L or F at position 41 and G at position 54; and region 3 comprises sequences with C or A at position 41 and A at position 54. To investigate the structure of the GB1 landscape within the three regions, we defined probability distributions that were uniform in each region of sequence space and zero outside (Fig. 4D; see SI Sec. 8 for formal definitions of these regions). We then examined the values of the parameters of the pairwise-interaction model, with the parameters expressed in the hierarchical gauges corresponding to the probability distribution $p (s)$ for each of the three regions (the “region 1 hierarchical gauge”, “region 2 hierarchical gauge”, and “region 3 hierarchical gauge”). Since some characters at positions 41 and 54 have had their frequencies set to zero, these hierarchical gauges are in fact generalized wild-type gauges, and the additive and pairwise parameters can be interpreted in terms of the mean effects of introducing mutations to these specific regions of sequences space.

In the region 1 hierarchical gauge (Fig. 4E, top), the additive parameters for position 41 quantify the effect of mutations away from G, and the additive parameters for positions 39, 40, and 54 quantify the average effect of mutations conditional on G at position 41. From the additive parameters at position 54, we observe that cysteine (C) and hydrophobic residues (A, V, I, L, M, or F) increase binding, and that proline (P) and charged residues (E, D, R, K) decrease binding. From the additive parameters at position 40, we observe that amino acids with a 5-carbon or 6-carbon ring (H, F, Y, W) increase binding, suggesting the presence of structural constraints on side chain shape, rather than constraints on hydrophobicity or charge. The largest pairwise parameters all involve mutations from G at position 41 to another amino acid, and careful inspection of these pairwise parameters show that the pairwise parameters are roughly equal and opposite to the additive effects of mutations at the other three positions. This indicates a classical form of masking epistasis, where the typical effect of a mutation at position 41 results in a more or less complete loss of function, after which mutations at the remaining three positions no longer have a substantial effect.

In the region 2 hierarchical gauge (Fig. 4E, middle), the additive parameters at position 54 quantify the average effect of mutations away from G contingent on L or F at position 41, the additive parameters at position 41 quantify the average effects of mutations away from L or F contingent on G at position 54, and the additive parameters at positions 39 and quantify the average effects of mutations contingent on L or F at position 41 and on G at position 54. From the values of the additive parameters, we observe that mutations away from L or F at position 41 in the presence of G at position 54 are typically strongly deleterious (mean effect −3.39), and that mutations away from G at position 54 in the presence of L or F at position 41 are also strongly deleterious (mean effect −3.75). However, the pairwise parameters linking positions 41 and 54 are strongly positive (mean effect 2.85), again indicating a masking effect where the first deleterious mutation at position 41 or 54 results in a more or less complete loss of function, so that an additional mutation at the other position has little effect (note the similar but less extreme pattern of masking between the large effect mutations at positions 41 and 54 with the milder mutations at positions 40 and 41, whose interaction coefficients are of the opposite sign of the additive effects at positions 40 and 41). Similar results hold for the region 3 hierarchical gauge, where mutations at positions 41 and 54 have masking effects on each other as well as on mutations in the other two positions (Fig. 4E, bottom). However, we can also contrast patterns of mutational effects between these regions. For example, mutating position 54 to G (a mututation leading towards region 2) on average has little effect in region 1 but would be deleterious in region 3. Similarly, if we consider mutations leading from region 2 to region 3, we can see that mutating 41 to C in region 2 typically has little effect whereas mutating 41 to A is more deleterious.

Besides using the interpretation of hierarchical gauge parameters as average effects of mutations to understand how mutational effects differ in different regions of sequence space, we hypothesised that by applying different hierarchical gauges to the pairwise interaction model, one might be able to obtain simple additive models that are accurate in different regions of sequence space. Our hypothesis was motivated by the fact that the parameters of all-order interaction models in the zero-sum gauge are chosen to maximize the fraction of variance in the sequence-function relationship that is explained by lower-order parameters. To test our hypothesis, we defined an additive model for each of the four hierarchical gauges described above (uniform, region 1, region 2, and region 3) by projecting pairwise interaction model parameters onto the hierarchical gauge for that region then setting all the pairwise parameters to zero. We then evaluated the predictions of each additive model on sequences randomly drawn from each of the four corresponding probability distributions (uniform, region 1, region 2, and region 3). The results (Fig. 5) show that the activities of sequences sampled uniformly from the sequence space are best explained by the additive model derived from the zero-sum gauge, that the activities of region 1 sequences are best explained by the additive model derived from the region 1 hierarchical gauge, and so on for regions 2 and 3. This shows that projecting a pairwise interaction model (or other hierarchical one-hot model) onto the hierarchical gauge corresponding to a specific region of sequence space can sometimes be used to obtain simplified models that approximate predictions by the original model in that region.

Fig. 5. — Model coarse-graining using hierarchical gauges. Predictions of additive models for GB1 derived by model truncation using region-specific zero-sum gauges (from Fig. 4C,E), plotted against predictions of the full pairwise-interaction model, are shown for 500 sequences randomly sampled from each of the four distributions listed in Fig. 4D (i.e., uniform, region 1, region 2, and region 3). Diagonals indicate equality. GB1: domain B1 of protein G.

Discussion

Here we report a unified strategy for fixing the gauge of commonly used models of sequence-function relationships. First we defined a family of analytically tractable gauges for the all-order interaction model. We then derived explicit formulae for imposing any of these gauges on model parameters, and used these formulae to investigate the mathematical properties of the these gauges. The results show that these gauges include multiple commonly used gauges, and that a subset of these gauges (the hierarchical gauges) can be applied to diverse lower-order models (including additive models, pairwise-interaction models, and higher-order interaction models).

Next, we demonstrated the family of gauges in two contexts: a simulated all-order interaction landscape on short binary sequences, and an empirical pairwise-interaction landscape for the protein GB1. The GB1 results, in particular, show how applying different hierarchical gauges can facilitate the biological interpretation of complex models of sequence-function relationships and to derive simplified models that are approximately correct in localized regions of sequence space.

Our study was limited to linear models of sequence-function relationships. Although linear models are used in many computational biology applications, more complex models are becoming increasingly common. For example, linear-nonlinear models [which include global epistasis models (9, 62–64) and thermodynamic models (56, 57, 65–68)] are commonly used to describe fitness landscapes and/or sequence-dependent biochemical activities. In addition to the gauge freedoms of their linear components, linear-nonlinear models can have additional gauge freedoms, such as diffeomorphic modes (69, 70), that also need to be fixed before parameter values can be meaningfully interpreted.

Sloppy modes are another important issue to address when interpreting quantitative models of sequence-function relationships. Sloppy modes are directions in parameter space that (unlike gauge freedoms) do affect model predictions but are nevertheless poorly constrained by data (71, 72). Understanding the mathematical structure of sloppy modes, and developing systematic methods for fixing these modes, is likely to be more challenging than understanding gauge freedoms. This is because sloppy modes arise from a confluence of multiple factors: the mathematical structure of a model, the distribution of data in feature space, and measurement uncertainty. Nevertheless, understanding sloppy modes is likely to be as important in many applications as understanding gauge freedoms. We believe the study of sloppy modes in quantitative models of sequence-function relationships is an important direction for future research.

Deep neural network (DNN) models present perhaps the biggest challenge for parameter interpretation. DNN models have had remarkable success in quantitatively modeling biological sequence-function relationships, most notably in the context of protein structure prediction (73, 74), but also in the context of other processes including gene regulation (75–77), epigenetics (78–80), and mRNA splicing (81, 82). It remains unclear, however, how researchers might gain insights into the molecular mechanisms of biological processes from inferred DNN models. DNNs are by nature highly over-parameterized (83–85), making the direct interpretation of DNN parameters infeasible. Instead, a variety of attribution methods have been developed to facilitate DNN model interpretations (86–89). Existing attribution methods can often be thought of as providing additive models that approximate DNN models in localized regions of sequence space (90), and the presence of gauge freedoms in these additive models needs to be addressed when interpreting attribution method output [as in (91, 92)]. We anticipate that, as DNN models become more widely adopted for mechanistic studies in biology, there will be a growing need for attribution methods that provide more complex quantitative models that approximate DNN models in localized regions of sequence space (16). If so, a comprehensive mathematical understanding of gauge freedoms in parametric models of sequence-function relationships will be needed to aid in these DNN model interpretations.

Materials and Methods

See Supplemental Information detailed derivations of mathematical results. All data and Python scripts used to generate the figures are available at https://github.com/jbkinney/23_posfai.

Supplementary Material

Supplement 1

media-1.pdf^{(467KB, pdf)}

Significance Statement.

Computational biology relies heavily on mathematical models that predict biological activities from DNA, RNA, or protein sequences. Interpreting the parameters of these models, however, remains difficult. Here we address a core challenge for model interpretation–the presence of “gauge freedoms”, i.e., ways of changing model parameters without affecting model predictions. The results unify commonly used methods for eliminating gauge freedoms and show how these methods can be used to simplify complex models in localized regions of sequence space. This work thus overcomes a major obstacle in the interpretation of quantitative sequence-function relationships.

ACKNOWLEDGMENTS.

We thank Peter Koo for helpful conversations and Samantha Petti for comments on the manuscript. This work was supported by NIH grant R35 GM133613 (AP, JZ, DMM), NIH grant R35 GM133777 (AP, JBK), NIH grant R01 HG011787 (JBK), the Alfred P. Sloan foundation (DMM), as well as additional funding from the Simons Center for Quantitative Biology at CSHL (DMM, JBK) and the College of Liberal Arts and Sciences at the University of Florida (JZ).

Footnotes

Please provide details of author contributions here.

Please declare any competing interests here.

References

1.Kinney JB, McCandlish DM, Massively parallel assays and quantitative sequence-function relationships. Annu. Rev. Genomics Hum. Genet. 20, 99–127 (2019). [DOI] [PubMed] [Google Scholar]
2.Weinberger ED, Fourier and Taylor series on fitness landscapes. Biol. cybernetics 65, 321–330 (1991). [Google Scholar]
3.Stadler PF, Landscapes and their correlation functions. J. Math. chemistry 20, 1–45 (1996). [Google Scholar]
4.Weinreich DM, Lan Y, Wylie CS, Heckendorn RB, Should evolutionary geneticists worry about higher-order epistasis? Curr. opinion genetics & development 23, 700–707 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Poelwijk FJ, Krishna V, Ranganathan R, The context-dependence of mutations: a linkage of formalisms. PLoS computational biology 12, e1004771 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Ferretti L, et al. Measuring epistasis in fitness landscapes: The correlation of fitness effects of mutations. J. theoretical biology 396, 132–143 (2016). [DOI] [PubMed] [Google Scholar]
7.Bank C, Matuszewski S, Hietpas RT, Jensen JD, On the (un) predictability of a large intragenic fitness landscape. Proc. Natl. Acad. Sci. 113, 14085–14090 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Poelwijk FJ, Socolich M, Ranganathan R, Learning the pattern of epistasis linking genotype and phenotype in a protein. Nat. communications 10, 4213 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Tareen A, et al. MAVE-NN: learning genotype-phenotype maps from multiplex assays of variant effect. Genome Biol. 23, 98 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Brookes DH, Aghazadeh A, Listgarten J, On the sparsity of fitness functions and implications for learning. Proc. Natl. Acad. Sci. 119, e2109649118 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Faure AJ, Lehner B, Miró Pina V, Colome CS, Weghorn D, An extension of the walsh- hadamard transform to calculate and model epistasis in genetic landscapes of arbitrary shape and complexity. bioRxivpp. 2023–03 (2023). [DOI] [PMC free article] [PubMed]
12.Metzger BP, Park Y, Starr TN, Thornton JW, Epistasis facilitates functional evolution in an ancient transcription factor. bioRxiv p. 2023.04.19.537271 (2024). [DOI] [PMC free article] [PubMed]
13.Novakovsky G, Dexter N, Libbrecht MW, Wasserman WW, Mostafavi S, Obtaining genetics insights from deep learning via explainable artificial intelligence. Nat. Rev. Genet. 24, 125–137 (2023). [DOI] [PubMed] [Google Scholar]
14.Koo PK, Majdandzic A, Ploenzke M, Anand P, Paul SB, Global importance analysis: An interpretability method to quantify importance of genomic features in deep neural networks. PLoS computational biology 17, e1008925 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Park Y, Metzger BP, Thornton JW, The simplicity of protein sequence-function relationships. bioRxiv p. 2023.09.02.556057 (2023). [DOI] [PMC free article] [PubMed]
16.Seitz EE, McCandlish DM, Kinney JB, Koo PK, Interpreting cis-regulatory mechanisms from genomic deep neural networks using surrogate models. bioRxiv (2023). [DOI] [PMC free article] [PubMed]
17.Dupic T, Phillips AM, Desai MM, Protein sequence landscapes are not so simple: on reference-free versus reference-based inference. bioRxiv p. 2024.01.29.577800 (2024).
18.Jackson JD, Okun LB, Historical roots of gauge invariance. Rev. modern physics 73, 663 (2001). [Google Scholar]
19.Kinney JB, Tkacik G, Callan CG, Precise physical models of protein-DNA interaction from high-throughput data. Proc. Natl. Acad. Sci. 104, 501–506 (2007) Wrote. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Weigt M, White RA, Szurmant H, Hoch JA, Hwa T, Identification of direct residue contacts in protein-protein interaction by message passing. Proc. Natl. Acad. Sci. 106, 67–72 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Marks DS, et al. Protein 3D Structure Computed from Evolutionary Sequence Variation. PLoS ONE 6, e28766 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Stormo GD, Maximally efficient modeling of DNA sequence motifs at all levels of complexity. Genetics 187, 1219–1224 (2011-April). [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Ekeberg M, Lovkvist C, Lan Y, Weigt M, Aurell E, Improved contact prediction in proteins: Using pseudolikelihoods to infer Potts models. Phys. Rev. E 87, 012707 (2013). [DOI] [PubMed] [Google Scholar]
24.Ekeberg M, Hartonen T, Aurell E, Fast pseudolikelihood maximization for direct-coupling analysis of protein structure from many homologous amino-acid sequences. J. Comput. Phys. 276,341–356 (2014). [Google Scholar]
25.Stein RR, Marks DS, Sander C, Inferring Pairwise Interactions from Biological Data Using Maximum-Entropy Probability Models. PLoS Comput. Biol. 11, e1004182 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Barton JP, Leonardis ED, Coucke A, Cocco S, ACE: adaptive cluster expansion for maximum entropy graphical model inference. Bioinformatics 32, 3089–3097 (2016). [DOI] [PubMed] [Google Scholar]
27.Haldane A, Flynn WF, He P, Levy RM, Coevolutionary Landscape of Kinase Family Proteins: Sequence Probabilities and Functional Motifs. Biophys. J. 114, 21–31 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Cocco S, Feinauer C, Figliuzzi M, Monasson R, Weigt M, Inverse statistical physics of protein sequences: a key issues review. Reports on Prog. Phys. 81, 032601 (2018). [DOI] [PubMed] [Google Scholar]
29.Haldane A, Levy RM, Influence of multiple-sequence-alignment depth on Potts statistical models of protein covariation. Phys. Rev. E 99, 032405 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Zamuner S, Rios PDL, Interpretable Neural Networks based classifiers for categorical inputs. arXiv (2021).
31.Feinauer C, Meynard-Piganeau B, Lucibello C, Interpretable pairwise distillations for generative protein sequence models. PLoS Comput. Biol. 18, e1010219 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Gerardos A, Dietler N, Bitbol AF, Correlations from structure and phylogeny combine constructively in the inference of protein partners from sequences. PLoS Comput. Biol. 18, e1010147 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Hsu C, Nisonoff H, Fannjiang C, Listgarten J, Learning protein fitness models from evolutionary and assay-labeled data. Nat. Biotechnol. 40,1114–1122 (2022). [DOI] [PubMed] [Google Scholar]
34.Feinauer C, Borgonovo E, Mean Dimension of Generative Models for Protein Sequences. bioRxiv p. 2022.12.12.520028 (2022).
35.Rube HT, et al. Prediction of protein-ligand binding affinity from sequencing data with interpretable machine learning. Nat. Biotechnol. 40, 1520–1527 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Posfai A, McCandlish DM, Kinney JB, Symmetry, gauge freedoms, and the interpretability of sequence-function relationships. bioRxiv (2024). [DOI] [PMC free article] [PubMed]
37.Busby S, Ebright RH, Transcription activation by catabolite activator protein (CAP). J Mol Biol 293, 199–213 (1999). [DOI] [PubMed] [Google Scholar]
38.Foat B, Morozov A, Bussemaker H, Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE. Bioinformatics 22, e141–9 (2006). [DOI] [PubMed] [Google Scholar]
39.Rube HT, Rastogi C, Kribelbauer JF, Bussemaker HJ, A unified approach for quantifying and interpreting DNA shape readout by transcription factors. Mol. Syst. Biol. 14, e7902 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Hu Y, et al. Evolution of DNA replication origin specification and gene silencing mechanisms. Nat. Commun. 11,5175 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Chen WC, Tareen A, Kinney JB, Density estimation on small data sets. Phys. Rev. Lett. 121, 160605 (2018) Wrote. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Skalenko KS, et al. Promoter-sequence determinants and structural basis of primer-dependent transcription initiation in Escherichia coli. Proc. Natl. Acad. Sci. 118, e2106388118 (2021) Co-authored. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Pukhrambam C, et al. Structural and mechanistic basis of σ-dependent transcriptional pausing. bioRxiv p. 2022.01.24.477500 (2022). [DOI] [PMC free article] [PubMed]
44.Fowler DM, et al. High-resolution mapping of protein sequence-function relationships. Nat Methods 7, 741–746 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Olson CA, Wu NC, Sun R, A comprehensive biophysical description of pairwise epistasis throughout an entire protein domain. Curr. biology : CB 24, 2643–2651 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Adams RM, Mora T, Walczak AM, Kinney JB, Measuring the sequence-affinity landscape of antibodies with massively parallel titration curves. eLife 5, e23156 (2016) Wrote. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Esposito D, et al. MaveDB: an open-source platform to distribute and interpret data from multiplexed assays of variant effect. Genome Biol. 20, 223 (2019) Read on 19.11.15 Looks like valuable database. Pissed off that their long list of refs misses my 2010 paper and most of my other work. I wrote the authors about this. [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Starr TN, et al. Deep mutational scanning of SARS-CoV-2 receptor binding domain reveals constraints on folding and ACE2 binding. Cell 182,1295–1310.e20 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Patwardhan RP, et al. High-resolution analysis of DNA regulatory elements by synthetic saturation mutagenesis. Nat Biotechnol 27,1173–1175 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
50.Patwardhan RP, et al. Massively parallel functional dissection of mammalian enhancers in vivo. Nat Biotechnol 30, 265–270 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Kwasnieski JC, Mogno I, Myers CA, Corbo JC, Cohen BA, Complex effects of nucleotide variants in a mammalian cis-regulatory element. Proc Natl Acad Sci USA 109, 19498–19503 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
52.Julien P, Miñana B, Baeza-Centurion P, Valcárcel J, Lehner B, The complete local genotype- phenotype landscape for the alternative splicing of a human exon. Nat. Commun. 7,11558 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
53.Kircher M, et al. Saturation mutagenesis of twenty disease-associated regulatory elements at single base-pair resolution. Nat. Commun. 10, 3583 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
54.Urtecho G, et al. Genome-wide Functional Characterization of Escherichia coli Promoters and Regulatory Elements Responsible for their Function. bioRxiv p. 2020.01.04.894907 (2020).
55.Berg O, Hippel Pv, Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. J Mol Biol 193, 723–750 (1987) Read (date unknown) I read the main part closely, but should reread this paper. All of it this time. [DOI] [PubMed] [Google Scholar]
56.Kinney JB, Murugan A, Callan CG, Cox EC, Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. Proc. Natl. Acad. Sci. 107, 9158–9163 (2010) Wrote. [DOI] [PMC free article] [PubMed] [Google Scholar]
57.Tareen A, Kinney JB, Logomaker: beautiful sequence logos in Python. Bioinforma. (Oxford, England) 36, 2272–2274 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
58.Kuszewski J, Gronenborn AM, Clore GM, Improving the Packing and Accuracy of NMR Structures with a Pseudopotential for the Radius of Gyration. J. Am. Chem. Soc. 121, 2337–2338 (1999). [Google Scholar]
59.Wu NC, Dai L, Olson CA, Lloyd-Smith JO, Sun R, Adaptation in protein fitness landscapes is facilitated by indirect paths. eLife 5,1965. (2016). [DOI] [PMC free article] [PubMed]
60.Zhou J, McCandlish DM, Minimum epistasis interpolation for sequence-function relationships. Nat. Commun. 11,1782 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
61.Rozhonova H, Marti-Gomez C, McCandlish DM, Payne JL, Protein evolvability under rewired genetic codes. bioRxiv pp. 2023–06 (2023). [DOI] [PMC free article] [PubMed]
62.Sarkisyan KS, et al. Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
63.Sailer ZR, Harms MJ, Detecting High-Order Epistasis in Nonlinear Genotype-Phenotype Maps. Genetics 205,1079–1088 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
64.Otwinowski J, McCandlish DM, Plotkin JB, Inferring the shape of global epistasis. Proc Natl Acad Sci USA 115, E7550–E7558 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
65.Mogno I, Kwasnieski JC, Cohen BA, Massively parallel synthetic promoter assays reveal the in vivo effects of binding site variants. Genome Res 23,1908–1915 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
66.Otwinowski J, Biophysical Inference of Epistasis and the Effects of Mutations on Protein Stability and Function. Mol Biol Evol 35, 2345–2354 (2018) Read Preprint. [DOI] [PMC free article] [PubMed] [Google Scholar]
67.Belliveau NM, et al. Systematic approach for dissecting the molecular mechanisms of transcriptional regulation in bacteria. Proc. Natl. Acad. Sci. 115, 201722055 (2018) Wrote. [DOI] [PMC free article] [PubMed] [Google Scholar]
68.Faure AJ, et al. Mapping the energetic and allosteric landscapes of protein binding domains. Nature 604,175–183 (2022). [DOI] [PubMed] [Google Scholar]
69.Kinney JB, Atwal GS, Parametric Inference in the Large Data Limit Using Maximally Informative Models. Neural computation 26, 637–653 (2014-04) Wrote. [DOI] [PubMed] [Google Scholar]
70.Atwal GS Kinney JB, Learning Quantitative Sequence-Function Relationships from Massively Parallel Experiments. J. Stat. Phys. 162,1203–1243 (2016) Wrote. [Google Scholar]
71.Machta BB, Chachra R, Transtrum MK, Sethna JP, Parameter space compression underlies emergent theories and predictive models. Science 342, 604–607 (2013). [DOI] [PubMed] [Google Scholar]
72.Transtrum MK, et al. Perspective: Sloppiness and emergent theories in physics, biology, and beyond. The J. Chem. Phys. 143, 010901–14 (2015). [DOI] [PubMed] [Google Scholar]
73.Jumper J, et al. Highly accurate protein structure prediction with alphafold. Nature 596, 583–589 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
74.Lin Z, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379,1123–1130 (2023). [DOI] [PubMed] [Google Scholar]
75.Avsec Ž, et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
76.Karbalayghareh A, Sahin M, Leslie CS, Chromatin interaction–aware gene regulatory modeling with graph attention networks. Genome Res. 32, 930–944 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
77.de Almeida BP, Reiter F, Pagani M, Stark A, Deepstarr predicts enhancer activity from dna sequence and enables the de novo design of synthetic enhancers. Nat. Genet. 54, 613–624 (2022). [DOI] [PubMed] [Google Scholar]
78.Avsec Ž, et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat. Genet. 53, 354–366 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
79.Chen KM, Wong AK, Troyanskaya OG, Zhou J, A sequence-based global map of regulatory activity for deciphering human genetics. Nat. Genet. 54, 940–949 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
80.Toneyan S, Tang Z, Koo PK, Evaluating deep learning for predicting epigenomic profiles. Nat. Mach. Intell. pp. 1–13 (2022). [DOI] [PMC free article] [PubMed]
81.Jaganathan K, et al. Predicting splicing from primary sequence with deep learning. Cell 176, 535–548 (2019). [DOI] [PubMed] [Google Scholar]
82.Cheng J, χelik MH, Kundaje A, Gagneur J, Mtsplice predicts effects of genetic variants on tissue-specific splicing. Genome Biol. 22,1–19 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
83.Raghu M, Poole B, Kleinberg J, Ganguli S, Dickstein JS , On the expressive power of deep neural networks in Proceedings of the 34th International Conference on Machine Learning- Volume 70. pp. 2847–2854 (2017). [Google Scholar]
84.Kaplan J, et al. Scaling laws for neural language models. arXivpreprint arXiv:2001.08361 (2020).
85.Nakkiran P, et al. Deep double descent: Where bigger models and more data hurt. J. Stat. Mech. Theory Exp. 2021, 124003 (2021). [Google Scholar]
86.Simonyan K, Vedaldi A, Zisserman A, Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013).
87.Shrikumar A, Greenside P, Kundaje A, Learning important features through propagating activation differences in Proceedings of the 34th International Conference on Machine Learning- Volume 70. pp. 3145–3153 (2017). [Google Scholar]
88.Lundberg SM, Lee SI, A unified approach to interpreting model predictions in Proceedings of the 31st International Conference on Neural Information Processing Systems. pp. 4768–4777 (2017). [Google Scholar]
89.Jha A, Aicher J K, Gazzara M R, Singh D, Barash Y, Enhanced integrated gradients: improving interpretability of deep learning models using splicing codes as a case study. Genome biology 21,1–22 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
90.Han T, Srinivas S, Lakkaraju H, Which explanation should i choose? a function approximation perspective to characterizing post hoc explanations. arXiv preprint arXiv:2206.01254 (2022).
91.Majdandzic A, Rajesh C, Koo PK, Correcting gradient-based interpretations of deep neural networks for genomics. Genome Biol. 24,109 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
92.Sasse A, Chikina M, Mostafavi S, Quick and effective approximation of in silico saturation mutagenesis experiments with first-order taylor expansion. bioRxiv pp. 2023–11 (2023). [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1

media-1.pdf^{(467KB, pdf)}

[R1] 1.Kinney JB, McCandlish DM, Massively parallel assays and quantitative sequence-function relationships. Annu. Rev. Genomics Hum. Genet. 20, 99–127 (2019). [DOI] [PubMed] [Google Scholar]

[R2] 2.Weinberger ED, Fourier and Taylor series on fitness landscapes. Biol. cybernetics 65, 321–330 (1991). [Google Scholar]

[R3] 3.Stadler PF, Landscapes and their correlation functions. J. Math. chemistry 20, 1–45 (1996). [Google Scholar]

[R4] 4.Weinreich DM, Lan Y, Wylie CS, Heckendorn RB, Should evolutionary geneticists worry about higher-order epistasis? Curr. opinion genetics & development 23, 700–707 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Poelwijk FJ, Krishna V, Ranganathan R, The context-dependence of mutations: a linkage of formalisms. PLoS computational biology 12, e1004771 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Ferretti L, et al. Measuring epistasis in fitness landscapes: The correlation of fitness effects of mutations. J. theoretical biology 396, 132–143 (2016). [DOI] [PubMed] [Google Scholar]

[R7] 7.Bank C, Matuszewski S, Hietpas RT, Jensen JD, On the (un) predictability of a large intragenic fitness landscape. Proc. Natl. Acad. Sci. 113, 14085–14090 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Poelwijk FJ, Socolich M, Ranganathan R, Learning the pattern of epistasis linking genotype and phenotype in a protein. Nat. communications 10, 4213 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Tareen A, et al. MAVE-NN: learning genotype-phenotype maps from multiplex assays of variant effect. Genome Biol. 23, 98 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Brookes DH, Aghazadeh A, Listgarten J, On the sparsity of fitness functions and implications for learning. Proc. Natl. Acad. Sci. 119, e2109649118 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Faure AJ, Lehner B, Miró Pina V, Colome CS, Weghorn D, An extension of the walsh- hadamard transform to calculate and model epistasis in genetic landscapes of arbitrary shape and complexity. bioRxivpp. 2023–03 (2023). [DOI] [PMC free article] [PubMed]

[R12] 12.Metzger BP, Park Y, Starr TN, Thornton JW, Epistasis facilitates functional evolution in an ancient transcription factor. bioRxiv p. 2023.04.19.537271 (2024). [DOI] [PMC free article] [PubMed]

[R13] 13.Novakovsky G, Dexter N, Libbrecht MW, Wasserman WW, Mostafavi S, Obtaining genetics insights from deep learning via explainable artificial intelligence. Nat. Rev. Genet. 24, 125–137 (2023). [DOI] [PubMed] [Google Scholar]

[R14] 14.Koo PK, Majdandzic A, Ploenzke M, Anand P, Paul SB, Global importance analysis: An interpretability method to quantify importance of genomic features in deep neural networks. PLoS computational biology 17, e1008925 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Park Y, Metzger BP, Thornton JW, The simplicity of protein sequence-function relationships. bioRxiv p. 2023.09.02.556057 (2023). [DOI] [PMC free article] [PubMed]

[R16] 16.Seitz EE, McCandlish DM, Kinney JB, Koo PK, Interpreting cis-regulatory mechanisms from genomic deep neural networks using surrogate models. bioRxiv (2023). [DOI] [PMC free article] [PubMed]

[R17] 17.Dupic T, Phillips AM, Desai MM, Protein sequence landscapes are not so simple: on reference-free versus reference-based inference. bioRxiv p. 2024.01.29.577800 (2024).

[R18] 18.Jackson JD, Okun LB, Historical roots of gauge invariance. Rev. modern physics 73, 663 (2001). [Google Scholar]

[R19] 19.Kinney JB, Tkacik G, Callan CG, Precise physical models of protein-DNA interaction from high-throughput data. Proc. Natl. Acad. Sci. 104, 501–506 (2007) Wrote. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Weigt M, White RA, Szurmant H, Hoch JA, Hwa T, Identification of direct residue contacts in protein-protein interaction by message passing. Proc. Natl. Acad. Sci. 106, 67–72 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Marks DS, et al. Protein 3D Structure Computed from Evolutionary Sequence Variation. PLoS ONE 6, e28766 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Stormo GD, Maximally efficient modeling of DNA sequence motifs at all levels of complexity. Genetics 187, 1219–1224 (2011-April). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Ekeberg M, Lovkvist C, Lan Y, Weigt M, Aurell E, Improved contact prediction in proteins: Using pseudolikelihoods to infer Potts models. Phys. Rev. E 87, 012707 (2013). [DOI] [PubMed] [Google Scholar]

[R24] 24.Ekeberg M, Hartonen T, Aurell E, Fast pseudolikelihood maximization for direct-coupling analysis of protein structure from many homologous amino-acid sequences. J. Comput. Phys. 276,341–356 (2014). [Google Scholar]

[R25] 25.Stein RR, Marks DS, Sander C, Inferring Pairwise Interactions from Biological Data Using Maximum-Entropy Probability Models. PLoS Comput. Biol. 11, e1004182 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Barton JP, Leonardis ED, Coucke A, Cocco S, ACE: adaptive cluster expansion for maximum entropy graphical model inference. Bioinformatics 32, 3089–3097 (2016). [DOI] [PubMed] [Google Scholar]

[R27] 27.Haldane A, Flynn WF, He P, Levy RM, Coevolutionary Landscape of Kinase Family Proteins: Sequence Probabilities and Functional Motifs. Biophys. J. 114, 21–31 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28.Cocco S, Feinauer C, Figliuzzi M, Monasson R, Weigt M, Inverse statistical physics of protein sequences: a key issues review. Reports on Prog. Phys. 81, 032601 (2018). [DOI] [PubMed] [Google Scholar]

[R29] 29.Haldane A, Levy RM, Influence of multiple-sequence-alignment depth on Potts statistical models of protein covariation. Phys. Rev. E 99, 032405 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] 30.Zamuner S, Rios PDL, Interpretable Neural Networks based classifiers for categorical inputs. arXiv (2021).

[R31] 31.Feinauer C, Meynard-Piganeau B, Lucibello C, Interpretable pairwise distillations for generative protein sequence models. PLoS Comput. Biol. 18, e1010219 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.Gerardos A, Dietler N, Bitbol AF, Correlations from structure and phylogeny combine constructively in the inference of protein partners from sequences. PLoS Comput. Biol. 18, e1010147 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] 33.Hsu C, Nisonoff H, Fannjiang C, Listgarten J, Learning protein fitness models from evolutionary and assay-labeled data. Nat. Biotechnol. 40,1114–1122 (2022). [DOI] [PubMed] [Google Scholar]

[R34] 34.Feinauer C, Borgonovo E, Mean Dimension of Generative Models for Protein Sequences. bioRxiv p. 2022.12.12.520028 (2022).

[R35] 35.Rube HT, et al. Prediction of protein-ligand binding affinity from sequencing data with interpretable machine learning. Nat. Biotechnol. 40, 1520–1527 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] 36.Posfai A, McCandlish DM, Kinney JB, Symmetry, gauge freedoms, and the interpretability of sequence-function relationships. bioRxiv (2024). [DOI] [PMC free article] [PubMed]

[R37] 37.Busby S, Ebright RH, Transcription activation by catabolite activator protein (CAP). J Mol Biol 293, 199–213 (1999). [DOI] [PubMed] [Google Scholar]

[R38] 38.Foat B, Morozov A, Bussemaker H, Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE. Bioinformatics 22, e141–9 (2006). [DOI] [PubMed] [Google Scholar]

[R39] 39.Rube HT, Rastogi C, Kribelbauer JF, Bussemaker HJ, A unified approach for quantifying and interpreting DNA shape readout by transcription factors. Mol. Syst. Biol. 14, e7902 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] 40.Hu Y, et al. Evolution of DNA replication origin specification and gene silencing mechanisms. Nat. Commun. 11,5175 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] 41.Chen WC, Tareen A, Kinney JB, Density estimation on small data sets. Phys. Rev. Lett. 121, 160605 (2018) Wrote. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] 42.Skalenko KS, et al. Promoter-sequence determinants and structural basis of primer-dependent transcription initiation in Escherichia coli. Proc. Natl. Acad. Sci. 118, e2106388118 (2021) Co-authored. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] 43.Pukhrambam C, et al. Structural and mechanistic basis of σ-dependent transcriptional pausing. bioRxiv p. 2022.01.24.477500 (2022). [DOI] [PMC free article] [PubMed]

[R44] 44.Fowler DM, et al. High-resolution mapping of protein sequence-function relationships. Nat Methods 7, 741–746 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] 45.Olson CA, Wu NC, Sun R, A comprehensive biophysical description of pairwise epistasis throughout an entire protein domain. Curr. biology : CB 24, 2643–2651 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R46] 46.Adams RM, Mora T, Walczak AM, Kinney JB, Measuring the sequence-affinity landscape of antibodies with massively parallel titration curves. eLife 5, e23156 (2016) Wrote. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] 47.Esposito D, et al. MaveDB: an open-source platform to distribute and interpret data from multiplexed assays of variant effect. Genome Biol. 20, 223 (2019) Read on 19.11.15 Looks like valuable database. Pissed off that their long list of refs misses my 2010 paper and most of my other work. I wrote the authors about this. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] 48.Starr TN, et al. Deep mutational scanning of SARS-CoV-2 receptor binding domain reveals constraints on folding and ACE2 binding. Cell 182,1295–1310.e20 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] 49.Patwardhan RP, et al. High-resolution analysis of DNA regulatory elements by synthetic saturation mutagenesis. Nat Biotechnol 27,1173–1175 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R50] 50.Patwardhan RP, et al. Massively parallel functional dissection of mammalian enhancers in vivo. Nat Biotechnol 30, 265–270 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R51] 51.Kwasnieski JC, Mogno I, Myers CA, Corbo JC, Cohen BA, Complex effects of nucleotide variants in a mammalian cis-regulatory element. Proc Natl Acad Sci USA 109, 19498–19503 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R52] 52.Julien P, Miñana B, Baeza-Centurion P, Valcárcel J, Lehner B, The complete local genotype- phenotype landscape for the alternative splicing of a human exon. Nat. Commun. 7,11558 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R53] 53.Kircher M, et al. Saturation mutagenesis of twenty disease-associated regulatory elements at single base-pair resolution. Nat. Commun. 10, 3583 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R54] 54.Urtecho G, et al. Genome-wide Functional Characterization of Escherichia coli Promoters and Regulatory Elements Responsible for their Function. bioRxiv p. 2020.01.04.894907 (2020).

[R55] 55.Berg O, Hippel Pv, Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. J Mol Biol 193, 723–750 (1987) Read (date unknown) I read the main part closely, but should reread this paper. All of it this time. [DOI] [PubMed] [Google Scholar]

[R56] 56.Kinney JB, Murugan A, Callan CG, Cox EC, Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. Proc. Natl. Acad. Sci. 107, 9158–9163 (2010) Wrote. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R57] 57.Tareen A, Kinney JB, Logomaker: beautiful sequence logos in Python. Bioinforma. (Oxford, England) 36, 2272–2274 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R58] 58.Kuszewski J, Gronenborn AM, Clore GM, Improving the Packing and Accuracy of NMR Structures with a Pseudopotential for the Radius of Gyration. J. Am. Chem. Soc. 121, 2337–2338 (1999). [Google Scholar]

[R59] 59.Wu NC, Dai L, Olson CA, Lloyd-Smith JO, Sun R, Adaptation in protein fitness landscapes is facilitated by indirect paths. eLife 5,1965. (2016). [DOI] [PMC free article] [PubMed]

[R60] 60.Zhou J, McCandlish DM, Minimum epistasis interpolation for sequence-function relationships. Nat. Commun. 11,1782 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R61] 61.Rozhonova H, Marti-Gomez C, McCandlish DM, Payne JL, Protein evolvability under rewired genetic codes. bioRxiv pp. 2023–06 (2023). [DOI] [PMC free article] [PubMed]

[R62] 62.Sarkisyan KS, et al. Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R63] 63.Sailer ZR, Harms MJ, Detecting High-Order Epistasis in Nonlinear Genotype-Phenotype Maps. Genetics 205,1079–1088 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R64] 64.Otwinowski J, McCandlish DM, Plotkin JB, Inferring the shape of global epistasis. Proc Natl Acad Sci USA 115, E7550–E7558 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R65] 65.Mogno I, Kwasnieski JC, Cohen BA, Massively parallel synthetic promoter assays reveal the in vivo effects of binding site variants. Genome Res 23,1908–1915 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R66] 66.Otwinowski J, Biophysical Inference of Epistasis and the Effects of Mutations on Protein Stability and Function. Mol Biol Evol 35, 2345–2354 (2018) Read Preprint. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R67] 67.Belliveau NM, et al. Systematic approach for dissecting the molecular mechanisms of transcriptional regulation in bacteria. Proc. Natl. Acad. Sci. 115, 201722055 (2018) Wrote. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R68] 68.Faure AJ, et al. Mapping the energetic and allosteric landscapes of protein binding domains. Nature 604,175–183 (2022). [DOI] [PubMed] [Google Scholar]

[R69] 69.Kinney JB, Atwal GS, Parametric Inference in the Large Data Limit Using Maximally Informative Models. Neural computation 26, 637–653 (2014-04) Wrote. [DOI] [PubMed] [Google Scholar]

[R70] 70.Atwal GS Kinney JB, Learning Quantitative Sequence-Function Relationships from Massively Parallel Experiments. J. Stat. Phys. 162,1203–1243 (2016) Wrote. [Google Scholar]

[R71] 71.Machta BB, Chachra R, Transtrum MK, Sethna JP, Parameter space compression underlies emergent theories and predictive models. Science 342, 604–607 (2013). [DOI] [PubMed] [Google Scholar]

[R72] 72.Transtrum MK, et al. Perspective: Sloppiness and emergent theories in physics, biology, and beyond. The J. Chem. Phys. 143, 010901–14 (2015). [DOI] [PubMed] [Google Scholar]

[R73] 73.Jumper J, et al. Highly accurate protein structure prediction with alphafold. Nature 596, 583–589 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R74] 74.Lin Z, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379,1123–1130 (2023). [DOI] [PubMed] [Google Scholar]

[R75] 75.Avsec Ž, et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R76] 76.Karbalayghareh A, Sahin M, Leslie CS, Chromatin interaction–aware gene regulatory modeling with graph attention networks. Genome Res. 32, 930–944 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R77] 77.de Almeida BP, Reiter F, Pagani M, Stark A, Deepstarr predicts enhancer activity from dna sequence and enables the de novo design of synthetic enhancers. Nat. Genet. 54, 613–624 (2022). [DOI] [PubMed] [Google Scholar]

[R78] 78.Avsec Ž, et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat. Genet. 53, 354–366 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R79] 79.Chen KM, Wong AK, Troyanskaya OG, Zhou J, A sequence-based global map of regulatory activity for deciphering human genetics. Nat. Genet. 54, 940–949 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R80] 80.Toneyan S, Tang Z, Koo PK, Evaluating deep learning for predicting epigenomic profiles. Nat. Mach. Intell. pp. 1–13 (2022). [DOI] [PMC free article] [PubMed]

[R81] 81.Jaganathan K, et al. Predicting splicing from primary sequence with deep learning. Cell 176, 535–548 (2019). [DOI] [PubMed] [Google Scholar]

[R82] 82.Cheng J, χelik MH, Kundaje A, Gagneur J, Mtsplice predicts effects of genetic variants on tissue-specific splicing. Genome Biol. 22,1–19 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R83] 83.Raghu M, Poole B, Kleinberg J, Ganguli S, Dickstein JS , On the expressive power of deep neural networks in Proceedings of the 34th International Conference on Machine Learning- Volume 70. pp. 2847–2854 (2017). [Google Scholar]

[R84] 84.Kaplan J, et al. Scaling laws for neural language models. arXivpreprint arXiv:2001.08361 (2020).

[R85] 85.Nakkiran P, et al. Deep double descent: Where bigger models and more data hurt. J. Stat. Mech. Theory Exp. 2021, 124003 (2021). [Google Scholar]

[R86] 86.Simonyan K, Vedaldi A, Zisserman A, Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013).

[R87] 87.Shrikumar A, Greenside P, Kundaje A, Learning important features through propagating activation differences in Proceedings of the 34th International Conference on Machine Learning- Volume 70. pp. 3145–3153 (2017). [Google Scholar]

[R88] 88.Lundberg SM, Lee SI, A unified approach to interpreting model predictions in Proceedings of the 31st International Conference on Neural Information Processing Systems. pp. 4768–4777 (2017). [Google Scholar]

[R89] 89.Jha A, Aicher J K, Gazzara M R, Singh D, Barash Y, Enhanced integrated gradients: improving interpretability of deep learning models using splicing codes as a case study. Genome biology 21,1–22 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R90] 90.Han T, Srinivas S, Lakkaraju H, Which explanation should i choose? a function approximation perspective to characterizing post hoc explanations. arXiv preprint arXiv:2206.01254 (2022).

[R91] 91.Majdandzic A, Rajesh C, Koo PK, Correcting gradient-based interpretations of deep neural networks for genomics. Genome Biol. 24,109 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R92] 92.Sasse A, Chikina M, Mostafavi S, Quick and effective approximation of in silico saturation mutagenesis experiments with first-order taylor expansion. bioRxiv pp. 2023–11 (2023). [DOI] [PMC free article] [PubMed]

PERMALINK

This is a preprint.

Gauge fixing for sequence-function relationships

Anna Posfai

Juannan Zhou

David M McCandlish

Justin B Kinney

Abstract

Introduction

Results

Preliminaries and background.

Linear models.

One-hot models.

Gauge freedoms.

Parameter values depend on choice of gauge.

Fig. 1.

Gauge spaces.

Fig. 2.

Linear gauges.

Unified approach to gauge fixing.

All-order interaction models.

Parametric family of gauges.

Trivial gauge.

Euclidean gauge.

Equitable gauge.

Hierarchical gauge.

Zero-sum gauge.

Wild-type and generalized wild-type gauges.

Applications.

Gauge-fixing a simulated landscape on short binary sequences.

Fig. 3.

Hierarchical gauges of an empirical landscape for protein GB1.

Fig. 4.

Fig. 5.

Discussion

Materials and Methods

Supplementary Material

Significance Statement.

ACKNOWLEDGMENTS.

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases