CONSTRUCTING BIOLOGICALLY CONSTRAINED RNNS VIA DALE’S BACKPROP AND TOPOLOGICALLY-INFORMED PRUNING

Aishwarya H Balwani; Alex Q Wang; Farzaneh Najafi; Hannah Choi

doi:10.1101/2025.01.09.632231

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2025 Jan 13:2025.01.09.632231. [Version 1] doi: 10.1101/2025.01.09.632231

CONSTRUCTING BIOLOGICALLY CONSTRAINED RNNS VIA DALE’S BACKPROP AND TOPOLOGICALLY-INFORMED PRUNING

Aishwarya H Balwani ^1,^*, Alex Q Wang ², Farzaneh Najafi ³, Hannah Choi ^4,^*

PMCID: PMC11760306 PMID: 39868098

Abstract

Recurrent neural networks (RNNs) have emerged as a prominent tool for modeling cortical function, and yet their conventional architecture is lacking in physiological and anatomical fidelity. In particular, these models often fail to incorporate two crucial biological constraints: i) Dale’s law, i.e., sign constraints that preserve the “type” of projections from individual neurons, and ii) Structured connectivity motifs, i.e., highly sparse yet defined connections amongst various neuronal populations. Both constraints are known to impair learning performance in artificial neural networks, especially when trained to perform complicated tasks; but as modern experimental methodologies allow us to record from diverse neuronal populations spanning multiple brain regions, using RNN models to study neuronal interactions without incorporating these fundamental biological properties raises questions regarding the validity of the insights gleaned from them. To address these concerns, our work develops methods that let us train RNNs which respect Dale’s law whilst simultaneously maintaining a specific sparse connectivity pattern across the entire network. We provide mathematical grounding and guarantees for our approaches incorporating both types of constraints, and show empirically that our models match the performance of RNNs trained without any constraints. Finally, we demonstrate the utility of our methods for inferring multi-regional interactions by training RNN models of the cortical network to reconstruct 2-photon calcium imaging data during visual behaviour in mice, whilst enforcing data-driven, cell-type specific connectivity constraints between various neuronal populations spread across multiple cortical layers and brain areas. In doing so, we find that the interactions inferred by our model corroborate experimental findings in agreement with the theory of predictive coding, thus validating the applicability of our methods.

1. Introduction

Recent years have seen the increasing adoption of artificial neural networks (ANNs) for modeling brain function both mechanistically and algorithmically [1, 2, 3, 4, 5]. In particular, recurrent neural networks (RNNs) are now an established tool in computational neuroscience research [6, 7], being used to study neuronal computation at varying scales ranging from subsets of neurons sampled across a single brain region, two interacting regions [8, 9, 10], and even numerous populations spread across multiple interacting brain regions [11, 12]. By way of either reproducing desired behaviours [13, 14, 15], task-driven responses [16, 17, 18], or by fitting to recorded neural data [19, 20], RNNs have been shown to successfully capture latent dynamics typical of neural circuits [21, 22] thus making them especially useful for modeling phenomena observed across the cortex. The degree to which ANNs can effectively approximate neural data however depends on two key considerations: i) The literature suggests a direct correlation between the ability of an ANN to learn well on a task and the extent to which its behaviour and learnt representations match real neural data [23, 24, 25], and ii) More biologically realistic architectures aid in the learning of representations that better match real neuronal data [26, 27, 28, 29]. These factors make it essential that the ability of the ANNs to learn and represent a wide range of function classes be unrestricted, that their training not suffer from hindrances, and their construction respect important anatomical principles, especially when being used as models of the brain to study neuroscientific phenomena [30, 31].

Of the various discrepancies conventional RNN-based neuroscientific models have with their biological counterparts (Fig. 1.A1) [32], two notable ones are their lack of adherence with Dale’s principle [33], i.e., the phenomena that restricts a presynaptic neuron to have exclusively either an excitatory or inhibitory effect on all its postsynaptic connections, and structured sparse connectivity amongst neuronal populations, a fundamental feature of brain organization observed across various species [34, 35, 36] and brain regions [37]. Unfortunately, directly incorporating these constraints oftentimes decreases the capacity and flexibility of the network to fit the training data, which leads to a drop in learning performance [38, 39]. While there has been active research towards addressing these issues in both the machine learning (ML) [40, 41, 42, 43] and computational neuroscience communities [38, 44, 45, 46, 47, 48, 49], these efforts have mostly been made to include these constraints into the network structure individually, rather than in conjunction as one would see in biological brains [50]. Subsequently, there remains a need for ways to construct sparse, sign-constrained deep neural networks which can also achieve performance levels comparable to conventional ANNs.

Figure 1: — **(A)** Illustration of conventional vs. biologically constrained RNN models (A1 vs. A2). Conventional RNNs consist of general purpose neurons that project a mix of excitatory and inhibitory signals, with no specific connectivity structure within or across populations. Biologically constrained RNNs restrict populations of neurons to be either strictly excitatory (red) or inhibitory (blue), with anatomically-informed connectivity motifs both within and across populations. **(B)** Optimization in parameter space when training with conventional backpropagation (black) vs. Dale’s backprop (blue). Purple contours represent level sets for the positions the algorithms take in parameter space at different time steps. **(C)** Enforcing anatomically-consistent connectivity motifs. Dashed lines represent connections that are set to 0 during the pruning process. Solid lines represent connections that are retained post-pruning.

Our work therefore introduces methods that allow us to easily incorporate both neuronal sign constraints and sparse connectivity motifs into the conventional backpropagation-based RNN training pipeline (Fig. 1.A2). Specifically, we first train a dense network that respects a pre-determined set of sign constraints via a modified version of standard backpropagation [51] which we call Dale’s backpropagation (Fig. 1.B), after which we prune away weights using a probabilistic pruning rule we call top-prob pruning (Fig. 1.C), to achieve a target connectivity pattern. Finally, we retrain the sparse sub-network retained post pruning once again with Dale’s backprop. Importantly, both of our methods are mathematically grounded and follow rigorous principles. With Dale’s backprop, we provide theoretical guarantees on the linear convergence of the algorithm under specific conditions, ensuring that the training respects anatomical constraints while achieving optimal learning performance. Our pruning rule is motivated by topological principles, particularly the preservation of high-magnitude weights that contribute to the network’s zeroth-order connectivity structure, thus enhancing the functional and anatomical plausibility of the model.

Besides being convenient, easily implementable, and scalable using standard ML packages, our approach also aligns with the biological processes of synaptic development and refinement. Given that synaptic connections initially form abundantly, with many connections later being pruned based on activity and functional relevance, by first learning a dense set of weights with Dale’s backprop, the network can capture a rich set of connections that adhere to Dale’s law, reflecting excitatory and inhibitory roles at a fundamental level. Subsequent application of top-prob pruning mirrors the refinement phase, where weaker, less functionally critical synapses are eliminated, retaining only the most effective pathways. This pruning rule not only emphasizes synaptic efficacy [52] — preserving stronger synapses — but also adheres to principles of synaptic scaling [53] in the retraining phase² by maintaining a balanced level of activity within the network. Overall, this process of initial dense learning followed by selective refinement embodies how biological systems evolve and ensures computational efficiency by optimizing for both anatomical and functional plausibility [54, 55].

We demonstrate the suitability of our methods for studying neuroscientific data by applying them to RNNs trained to fit a two-photon calcium imaging dataset exploring multi-regional interactions that underlie visual behavior in mice when performing a change detection task [56]. We find that our models successfully recapitulate both long- and short-timescale interactions among neuronal populations, capturing transient dynamics as well as sustained signals that are critical for complex perceptual processing. Moreover, our model outputs align with the predictive coding hypothesis [57], as they reflect anticipatory and feedback-driven patterns observed experimentally, suggesting that our approach is well-suited to modeling the layered processing of sensory information in the brain.

Taken together, our results on synthetic and real-world datasets indicate that our methods offer a robust framework for fitting and modeling neural dynamics in a biologically faithful manner. By capturing both anatomically realistic connectivity patterns and functional interactions, our approach provides a set of powerful tools for understanding the complex, hierarchical processing of information across different cell-types, populations, and brain areas. These tools subsequently enable models to better reflect anatomical structures, thereby imparting greater confidence in their findings and enhancing the alignment between RNNs and real neural circuitry.

2. Training Networks with Dale’s Backpropagation

In this section, we introduce our sign-constrained learning rule, Dale’s backpropagation and validate its performance. In particular, we provide intuition for the algorithm, describe it in detail, present the statements of the theoretical analyses performed (i.e., convergence guarantees and error bounds), and finally provide empirical results demonstrating its utility on a set of neuroscience-inspired and ML tasks.

2.1. Dale’s Backpropagation: Algorithm

Dale’s backpropagation enforces Dale’s principle by integrating sign constraints into the conventional backpropagation process. Specifically, it employs a projection step (similar to that of projected gradient descent) on the learnt parameters at every iteration to ensure that the weights remain non-negative for excitatory neurons and non-positive for inhibitory neurons, thus adhering to biological constraints (Fig. 1.B).

Consider the typical Elman RNN [58], whose hidden states $h_{t}$ at time $t$ are updated as per the rule

h_{t} = ϕ (W_{h i} x_{t} + b_{h i} + W_{h h} h_{t - 1} + b_{h h})

(1)

where $ϕ$ is the non-linear activation function, $W_{h h}$ is the recurrent weight matrix, and $W_{h i}$ is the projection matrix that acts on inputs $x_{t}$ . Biases corresponding to the input and hidden states are denoted as $b_{h i}$ , $b_{h h}$ respectively.

When $h_{t}$ are non-negative, respecting Dale’s law simplifies to constraining the recurrent weights $W$ such that if $i$ is the pre-synaptic neuron and $j$ is the post-synaptic neuron,

W = {\begin{array}{l} W_{j i} \geq 0 if neuron i is excitatory . \\ W_{j i} \leq 0 if neuron i is inhibitory . \end{array}

At initialization, the recurrent matrix $W$ can satisfy the sign constraints by construction. However, given that standard gradient descent-based backpropagation update

W^{(i + 1)} = W^{(i)} - η \nabla ℓ (W^{(i)})

(2)

with the step size $η$ and loss function $ℓ$ , there is no guarantee that the updated weights $W^{(i + 1)}$ at the next iteration will satisfy the sign constraints set by Dale’s law even if they are respected by the matrix $W^{(i)}$ at iteration $i$ .

We note however that our sign constraints always form a convex set [59], enabling us to adapt any gradient-based optimization scheme (e.g., SGD, ADAM, RMSprop, etc.) into its projected version [60]. Hence, after the standard backprop update at every iteration, we project the weights onto their feasible set – the orthant in parameter space where the sign constraints of all individual synaptic weights are met. Mathematically, this new update rule can be expressed as

W_{D}^{(i)} = 𝒫_{𝒞} (W^{(i)}) = 𝒫_{𝒞} (W_{D}^{(i - 1)} - η \nabla ℓ {(W_{D})}^{(i - 1)}) = max (0, W_{[N^{+}]}^{(i)}) \oplus min (0, W_{[N -]}^{(i)})

(3)

where $W_{D}^{(i)}$ represents the weight matrix that satisfies Dale’s law at iteration $i$ and $𝒫_{𝒞}$ denotes the projection operator. Weight subsets corresponding to excitatory and inhibitory neurons are denoted by $[N^{+}]$ and $[N^{-}]$ respectively. The full derivation of the update is provided in Appendix 6.1. The explicit algorithm for the update under gradient descent as the optimizer is shown in Appendix 6.2.

Moreover, this projection onto the feasible set has both, a simple interpretation and implementation. At every iteration, weights that violate their assigned sign constraints are set to zero, while those that comply are retained at their updated values. The projection itself can be efficiently implemented by multiplying the weights with a binary mask after each update. This flexibility makes it easy to apply our method across various architectures, seamlessly integrating sign constraints within standard backpropagation frameworks.

Consequently, for a single-layer RNN with $N$ neurons of which $N^{+}$ are excitatory and $N^{-}$ are inhibitory, our entire algorithm for Dale’s backpropagation can be summarized as follows

Algorithm 1:

Dale’s Backpropagation

1. Initialize

W_{D}^{(0)} = W^{(0)} \in ℝ^{N \times N}

such that

W^{(0)} = W_{[N^{+}]}^{(0)} \oplus W_{[N^{-}]}^{(0)}

W_{[N^{+}]}^{(0)}, W_{[N^{-}]}^{(0)}

represent weights from the excitatory and inhibitory neurons respectively.

2. Enforce Dale’s law by setting

W_{[N^{+}]}^{(0)} \in ℝ_{\geq 0}^{N \times N^{+}}

and

W_{[N^{-}]}^{(0)} \in ℝ_{\leq 0}^{N \times N^{-}}

3. Sample

W_{[N^{+}]}^{(0)}

from

U [0, \frac{1}{\sqrt{N}}]

and

W_{[N^{-}]}^{(0)}

from

U [\frac{- 1}{\sqrt{N}}, 0]

4. Initialize

h_{0} \leftarrow \vec{0}

5. For each time step

t

, compute and threshold the hidden state

h_{t}

to be non-negative as:

h_{t}^{+} = (ϕ (W_{h i} x_{t} + W_{D} h_{t - 1}^{+} + b_{h h}))^{+}

6. For each iteration

i

(a) Compute

W^{(i + 1)}

using standard backpropagation.

(b) Update weights by setting:

W_{D}^{(i + 1)} = max (0, W_{[N^{+}]}^{(i + 1)}) \oplus min (0, W_{[N^{-}]}^{(i + 1)})

Open in a new tab

We note that our initialization scheme relates closely to popularly used schemes such as Glorot [61] and He [62] initialization. In particular the scheme aligns with (and is equivalent to when the number of excitatory and inhibitory neurons are equal) common weight initialization practices in RNNs where weights are initialized with zero mean and scaled variance – often using the distribution $U [\frac{- 1}{\sqrt{N}}, \frac{1}{\sqrt{N}}]$ – to maintain consistent activation variances, facilitating effective training and convergence [38, 63].

We also observe that thresholding $h_{t}$ to be non-negative mimics the behavior of biological neurons, in that neuronal firing cannot be negative. Since our goal is to always ensure that $h_{t}$ is non-negative, we can also increase the threshold to any value greater than 0, depending on the application. Furthermore, it is of interest that previous works were able to enforce sign constraints with FORCE learning [13] in spiking neural networks (SNNs) [47] using an update rule that is the same as ours, but since $h_{t}$ is non-negative by definition in an SNN, this is not a step that they explicitly needed to incorporate into their training algorithm. Additionally, we can show that despite their different formulations, Dale’s backpropagation reinforces an overlapping set of synaptic connections as Hebbian learning [64], thus preserving key aspects of biologically plausible learning dynamics (Appendix 7).

Finally, it is worth mentioning that the Dale’s backpropagation update is guaranteed to find the weights $W_{D}$ that are the closest projection of $W$ under sign constraints (Theorem 5), with respect to the Frobenius norm (Appendix 6.3). Other methods that are similar to ours ideologically choose to enforce Dale’s law by always using a ReLU activation function and consequently constraining the post-synaptic weights to be positive (or negative) for excitatory (or inhibitory) neurons [50, 45] by multiplying these non-negative weights with a mask comprised of ±1. While this method (which we call rectified backprop) is equivalent to ours in the case of weights projecting from excitatory neurons, it essentially reverses the sets of weights that are kept vs. zeroed out in the case of inhibitory neurons. In terms of the final update, the new weights learnt by rectified backprop therefore are much farther away than the one originally computed by conventional backpropagation (Corollary 6).

2.2. Dale’s backpropagation: Theoretical Results

We now present our key mathematical analyses for Dale’s backprop when it utilizes gradient descent as its optimizer. First, we derive the rate of convergence of the algorithm under the assumption of restricted optima, i.e., when we can assume that the optimal set of parameters also have the same sign pattern as those imposed. Second, we quantify the differences between Dale’s backprop and standard backpropagation, both in terms of the weights learnt and the final solutions obtained. Together, these results establish a solid theoretical foundation for Dale’s backprop, demonstrating its ability to learn effectively and efficiently, thus validating its use in modeling neural data.

2.2.1. Analyzing convergence of Dale’s backpropagation under the restricted optimum assumption

We start by examining the behavior of Dale’s backpropagation algorithm under the assumption that the optimal set of parameters for a task shares the same sign pattern as the one imposed – a condition we refer to as the restricted optima assumption – and show that despite having to learn with constraints, under this assumption, Dale’s backprop converges linearly to the optimal solution (Theorem 2). Biologically, this assumption mirrors the idea that the arrangement of excitatory and inhibitory neurons in the network is optimized for such tasks.

Proving this theorem relies on the geometric observation that the restricted optima assumption ensures that the globally optimal set of weights $(W^{*})$ lies within the same orthant as our point of initialization $(W^{(0)})$ . We subsequently prove optimal sign pattern preservation (Lemma 1), which guarantees that every backpropagation iteration never leaves this orthant, implying the signs of the weights remain constant throughout the optimization process. As a result, Dale’s backpropagation behaves identically to unconstrained gradient descent within this orthant, making the projection step redundant since the optimization path does not approach the boundaries of the orthant. Consequently, the algorithm can take the most direct path to the optimum without any detours induced by constraint enforcement, allowing it to achieve a linear convergence rate under the Polyak-Łojasiewicz condition.

The statements of our results are presented below, with full proofs deferred to the supplement (Appendix 8.1).

Lemma 1 (Optimal sign pattern preservation).

Let the vector of learnt weights be $W \in ℝ^{n}$ with the components $w_{j}$ , where $j \in {1, 2, \dots, n}$ . Let L be the Lipschitz constant for the gradients $\nabla ℓ (W)$ , where $ℓ$ is a loss function.

Given a gradient descent-based, component-wise sign-preserving learning rule that uses the projection operator $𝒫_{𝒞} : ℝ^{n} \mapsto ℝ^{n}$ defined as

𝒫_{𝒞} (w_{j}) = {\begin{array}{l} w_{j} & i f s i g n (z_{j}) = s i g n (w_{j}) \\ 0 & i f s i g n (z_{j}) \neq s i g n (w_{j}) \end{array}

where $z_{j} = w_{j} - \frac{1}{L} \nabla ℓ (w_{j})$ , $s i g n (z_{j}) = \frac{z_{j}}{| z_{j} |}$ for $z_{j} \neq 0$ , and $s i g n (0) = 0$ . If $s i g n (W^{*}) = s i g n (W^{(0)})$ where $W^{*}$ are the set of weights that can achieve the optimal loss on $ℓ$ , it holds that for any iteration $i$ of regular gradient descent

s i g n (W^{(i)}) = s i g n (W^{(0)}) = s i g n (W^{*}) \forall i \in ℕ, a n d 𝒫_{𝒞} (w_{j}) = w_{j} f o r j \in {1, 2, \dots, n} .

Theorem 2 (Convergence of Dale’s Backpropagation).

Let $ℓ$ be a loss function satisfying the μ−Polyak-Łojasiewicz condition, with gradients that are L-Lipschitz such that $L \geq μ > 0$ . Consider the sequence of weights ${W_{D}^{(i)}}$ generated according to the Dale’s backpropagation update, with a step size of $\frac{1}{L}$ . Given an optimal loss $ℓ^{*} = ℓ (W^{*}) = a r g m i n ℓ (W_{D})$ where $W^{*}$ has the same sign pattern as all $W_{D}^{(i)}$ and a specific error $ε > 0$ , it holds for iteration i that

ℓ (W_{D}^{(i)}) - ℓ^{*} \leq ε w h e n i \geq \frac{log (\frac{ℓ (W_{D}^{(0)}) - ℓ^{*}}{ε})}{log (\frac{L}{L - μ})}

Notably, our analysis reveals that under the assumption of the restricted optima condition, Dale’s backpropagation achieves a linear convergence rate which matches the performance of unconstrained backpropagation [65]. This is significant, as it demonstrates that under the right conditions, imposing biological constraints through Dale’s principle does not necessarily come at the cost of convergence speed. Furthermore, it also suggests that the brain’s neural circuitry despite being constrained by Dale’s law might also be functionally organized to facilitate efficient learning and task performance. However, it is important to note that these guarantees rely not only on the restricted optima assumption, but also on the gradients satisfying Lipschitzness and the Polyak-Łojasiewicz condition, which from a mathematically rigorous perspective may not always hold in practice.

2.2.2. Analyzing Dale’s backpropagation relative to conventional backpropagation

Dale’s backprop also lends itself well to analyzing its behaviour with respect to standard backpropagation when we do not make the restricted optima assumption. Specifically in the case of a single-layer recurrent neural network (without biases), we can characterize the distance between the weights found using standard backprop and Dale’s backprop (Lemma 3), and therefore subsequently the distances between outputs found using the two weight update schema, allowing us to bound the difference between the final error of the solution found using Dale’s backprop in terms of that found using standard backprop (Theorem 4). Formally, we express the above as follows:

Lemma 3 (Distance between learnt weights).

Let $W^{(i)}$ and $W_{D}^{(i)}$ be the weights at iteration i for standard backpropagation and Dale’s backpropagation, respectively. Assume the gradients $\nabla ℓ (W)$ and $\nabla ℓ (W_{D})$ are upper bounded in magnitude by G and Lipschitz continuous with constant L. Then, the distance between the two sets of weights at any iteration $i$ , denoted as ${‖ δ^{(i)} ‖}_{2} = {‖ W^{(i)} - W_{D}^{(i)} ‖}_{2}$ is bounded³ by:

{‖ δ^{(i)} ‖}_{2} \leq \frac{G}{L} ({(1 + η L)}^{i} - 1)

where $η$ is the learning rate.

Theorem 4 (Differences in errors between solutions).

Let $f (W)$ be the function represented by a single-layer RNN unrolled over T timesteps, with weights W. Let $W_{D}$ be the weights learnt using Dale’s backpropagation, and $W$ be the weights learnt using standard backpropagation. Assume the non-linearity ϕ is either tanh or ReLU. Then, the error of the solution found using Dale’s backpropagation with respect to the ground truth y is bounded by:

‖ f (W_{D}) - {y ‖}_{2}^{2} \leq 2 \cdot (δ^{2} \sum_{t = 1}^{T} {(L_{f_{t}})}^{2} + \sum_{T}^{t = 1} {(ε_{t}^{*})}^{2})

where $f (W_{D})$ is the output after $K$ training iterations and $δ = \frac{G}{L} ({(1 + η L)}^{K} - 1)$ , $L_{f_{t}} = max (L_{f_{t} (W)}, L_{f_{t} (W_{D})})$ is the maximum of the Lipschitz constants of the two RNNs at timestep $t$ , and $ε_{t}^{*} = ‖ f_{t} (W) - y_{t} ‖$ is the error of the solution found using conventional backpropagation at timestep t.

The lemma on the distance between learnt weights quantifies how Dale’s backpropagation diverges from standard backpropagation over time due to the sign constraints, showing that this divergence grows but remains bounded, influenced by factors like the learning rate, the loss landscape’s smoothness, and gradient magnitudes. Building on this, the theorem on error between solutions relates the performance of Dale’s backpropagation to standard backpropagation, indicating that the error of Dale’s method is bounded by both the divergence of the weights (bounded by the lemma) and the sensitivity of the network’s output to weight changes, alongside the error of the standard method. Together, these results provide theoretical assurances that, while biological constraints impact learning dynamics, they do not cause uncontrolled error growth, supporting the use of Dale’s principle in neural network training.

Full proofs for both the lemma and theorem are provided in the supplement (Appendix 8.2).

2.3. Dale’s backpropagation: Empirical results

We evaluate the performance of Dale’s backpropagation across three tasks of interest (Fig. 2.A). The first is a 1-bit flip-flop task (Fig. 2.A - top row - left), in which the network is required to maintain and toggle between different states in response to a series of binary inputs. Specifically, the network output is meant to start at zero, following which it takes the value ±1 to match the input signal whenever presented. It is then expected to switch signs if presented with a signal of the opposing sign, or else, maintain the same output as before. Next, is a wave reconstruction task (Fig. 2.A - bottom row) where both the excitatory and inhibitory neurons are presented with individual sinusoidal waveforms. The network is tasked with accurately reconstructing both signals simultaneously, reflecting the roles of excitation and inhibition in modulating distinct aspects of signal processing in neural circuits. We finally also test our methods on the sequential MNIST task (Fig. 2.A - top row - right), which is a variation of the classical digit classification task in that instead of receiving the entire image as the input, the network instead sequentially receives the rows of the image.

Figure 2: — (A) Task examples: 1-bit flip-flop, Sequential MNIST, Wave reconstruction. (B) Distribution of weights: Weight matrices post-training (top row), Weight histograms at initialization and after training (middle row), Relative divergence in weight distributions with rectified and Dale’s backpropagation vs. conventional backpropagation (bottom row). (C) Test performance of models across different tasks when trained with conventional backpropagation (black), rectified backpropagation (gray), and Dale’s backpropagation (green). All statistics computed over 5 independent runs.

For each of the tasks we train RNNs (over 5 runs) with 128 hidden neurons of which ∼ 80%(102) are excitatory and ∼ 20%(26) are inhibitory. In addition to conventional backpropagation, we also include in our experiments “rectified backprop” a similar method from the literature [50, 45] that is also amenable to incorporating sparsity constraints⁴ simultaneously. Our experiments justify our proposed weight update and subsequent theoretical analyses, both in terms of how the weights evolve, as well as learning performance. We start by analyzing the distribution of weights before and after training for the three learning rules (Fig. 2.B) averaged across all tasks. While we initialize all three methods identically (Fig. 2.B - middle row - left), we notice that their distributions post-training are visibly different (Fig. 2.B middle row - right). Specifically, the weights learnt using conventional backprop (black) show the smallest peak around zero and the greatest deviation, followed by those learnt using Dale’s backprop (green), and finally rectified backprop (gray). This trend is also visible in the weight matrices themselves (Fig. 2.B - top row), where the negative weights learnt using rectified backprop seem practically to be zero (Fig. 2.B - top row - middle). On first glance the weight matrices for conventional and Dale’s backprop seem almost the same, but closer inspection (black boxes – bottom right of the weight matrices) reveals that some of the weights which should have been negative have flipped signs with conventional backprop (Fig. 2.B - top row - left) but there are no such discrepancies with Dale’s backprop (Fig. 2.B top row - right). Finally, we empirically quantify the differences in learnt weights for the two sign-preserving methods by measuring the Kullback-Liebler (KL) divergence amongst their weight distributions post training (Fig. 2.B - bottom row) with respect to conventional backpropagation. We observe that across all three tasks the divergence shown by rectified backprop (gray) from the weights learnt by conventional backprop are much higher than those shown by Dale’s backprop (green).

Finally, the learning performance (Fig. 2.C) of Dale’s backprop (green) matches that of conventional backprop (black) for all three tasks, at times even learning faster. We conjecture this is a consequence of the regularization introduced by adhering to the sign constraints – by restricting the optimization to the orthant where the signs are preserved, the search space is effectively reduced. This focused parameter space allows for more efficient learning dynamics, as the optimizer concentrates on adjusting weight magnitudes without expending effort on sign changes that violate the constraints. The fixed signs lead to more stable and directed weight updates, resulting in a smoother optimization landscape. Consequently, Dale’s backprop can converge more rapidly in the early stages of training, while ultimately achieving similar final performance as standard backpropagation. However, the learning performance of rectified backprop (gray) is both slower and not as competitive as the other two methods, especially on the more complicated sequential MNIST task. To that end, we note that this might be a consequence of the fact that rectified backprop does not allow for activation functions that have any negative outputs (e.g., tanh). This restriction likely leads to exploding gradients and dead neurons – the latter which might also be inferred visually from its weights post training. However Dale’s backpropagation does not suffer from such limitations and can use non-linearities that have negative outputs as long as it is centered around 0, i.e., it keeps positive and negative activations, positive and negative respectively.

3. Sparsifying Networks via Topology-Informed Local Pruning

Having established a method that lets us train sign-constrained RNNs leveraging the machinery of autograd-based backpropagation [66, 67], in this section we describe topologically-informed probabilistic (top-prob) pruning as a way of sparsifying dense neural networks to reflect a target connectivity pattern amongst neuronal populations (Fig. 1.C). We first describe the pruning rule formally, while also motivating it from both a mathematical and neuroscientific perspective. Our subsequent empirical analyses demonstrate the applicability of our method in conjunction with Dale’s backpropagation, wherein it outperforms the random pruning baseline on different tasks.

3.1. Topologically-informed probabilistic pruning rule

Consider a weight matrix $W \in ℝ^{N \times N}$ comprised of the synaptic weights $w_{j i} \forall i, j \in {1, 2, 3, \dots, N}$ , connecting neuron $i \to j$ . The sparsified matrix $W^{s p a r s e} \in ℝ^{N \times N}$ is obtained using the pruning rule

w_{j i}^{s p a r s e} = {\begin{array}{l} w_{j i} with probability κ | w_{j i} |, \\ 0 otherwise . \end{array}

(4)

where $κ \in ℝ^{+}$ is a non-negative scalar that controls the sparsity of the resulting matrix and is defined as

κ = \frac{(1 - s) N^{2}}{‖ W ‖_{L^{1}}^{2}}

(5)

$s \in [0, 1]$ is the target sparsity of $W^{sparse}$ and $‖ W ‖_{L^{1}}^{2} = \sum_{i = 1}^{N} \sum_{j = 1}^{N} ∣ w_{j i} ∣$ .

While it is evident that the top-prob pruning rule operates by probabilistically retaining weights of higher magnitude while eliminating weaker ones, we emphasize that this approach mirrors fundamental aspects of synaptic plasticity in biological neural networks. The rule’s local nature – where pruning decisions depend solely on individual synaptic weights – aligns with biological constraints, as real neurons modify their connections based only on local synaptic properties rather than global network states. Furthermore, when coupled with Dale’s backpropagation, the top-prob pruning mechanism has a propensity for preserving exactly those weights that align with Hebbian learning principles (Appendix 7), thereby ensuring the maintenance of biologically meaningful functional connectivity whilst simultaneously achieving network sparsification.

The top-prob approach is also grounded from an ML and mathematical standpoint. Magnitude-based pruning has a long history [48, 68, 69] and in its iterative form is still a highly competitve empirical baseline for neural network compression via pruning [39, 40]. It also closely relates to methods that look to preserve weights that maintain the dynamics of the network in the spectral sense [49, 70, 71, 72]. Finally, it provides us with an elegant way of maintaining the structural integrity of the network. Recent works have established that the zeroth-order topological information of a graph is fully encapsulated by its maximum spanning tree (MST)⁵ [74, 75, 76, 73]. Ergo, probabilistically maintaining higher magnitude weights of the network results in us preserving this topological structure (Appendix 10.1).

Throughout the remainder of this work, we use top-prob pruning in the one-shot sense [41], wherein we prune to a target sparsity level and connectivity pattern in a single step, followed by a single retraining phase to help restore the model’s performance. We note however that our approach can just as easily be used at initialization to sub-select a sparse network pre-training, or alternatively used in the iterative manner that is more typical in the ML community, especially for tasks that are more complicated and less amenable to drastic drops in sparsity levels from a fully-trained dense configuration. Additional explanations for how we derive and re-normalize our hyper-parameter $κ$ across different contingencies, as well as adjust the parameter $s$ are provided in Appendix 9.1, 9.2, and 9.3 respectively.

3.2. Topologically-informed probabilistic pruning: Empirical results

We study the behaviour and performance of top-prob pruning by first examining how it impacts the weight distribution and structural integrity of the original dense network, followed by its ability to maintain functional capacity. Our results suggest that top-prob pruning does indeed preserve key network properties, leading to highly sparse yet robust models that do not require extensive retraining to regain performance.

As a preliminary check we observe the distribution of non-zero weights (Fig. 3.A) for dense weight matrix (black), vs. that of a matrix that has been pruned to 90% sparsity using the top-prob pruning rule (green) and random pruning (grey). As expected, we see the dense matrix has weights that are almost uniformly distributed since the weights were sampled from the distribution $U ([\frac{- 1}{\sqrt{N}}, \frac{1}{\sqrt{N}}])$ and the randomly pruned network stays faithful to this distribution. The weights retained using top-prob pruning however are heavily skewed towards being higher in magnitude, demonstrating that it successfully prioritizes stronger connections while eliminating weaker ones. We subsequently quantify the notion of “structural integrity preservation” by plotting the fraction overlap of retained weights with the MST of the dense matrix as sparsity increases (Fig. 3.B) for both pruning methods. Again, we observe that empirically top-prob pruning shows a much higher overlap with the MST (green dots) compared to random sparsification in practice (grey crosses) and theoretically (black crosses) (Appendix 10.2). Additionally, it shows lesser variations in the amount of overlap as well.

Figure 3: — (A) Distribution of non-zero weights for dense (black), and sparse matrices pruned randomly (grey) and with top-prob pruning (green). (B) Fraction overlap that retained weights have with the MST of the dense matrix. (C) Errors of pruned models, without any retraining using random (grey) and top-prob pruning (green). (D) Performance of sparsified and fine-tuned models across different tasks, when pruned randomly (grey) vs. top-prob pruning (green). All statistics computed over 5 independent runs. * indicates p ≤ 0.05.

Moving on to the pruned network’s ability to retain information and functional capacity, we note that the errors post-pruning but before fine-tuning (Fig. 3.C) are always higher for models that are pruned randomly (grey) vs. those pruned with the top-prob pruning rule (when pruned to 90% sparsity). Moreover, the difference in errors becomes more significant as the task complexity increases (p = 0.022 for the flip-flop and wave reconstruction tasks, while p = 0.012 for sequential MNIST). Fine-tuning with Dale’s backpropagation for ∼ 50% the number of epochs as the original training, while retaining the sparse structure identified via pruning follows a similar trend (Fig. 3.D), with networks that that were pruned randomly (grey) showing higher errors at the epoch of re-training than those pruned with top-prob pruning (green). While both methods lead to models that seem to eventually approach the original model’s performance (dashed line) after fine-tuning, top-prob pruning consistently starts from a better initial error, converges faster to optimal performance, and shows more stable learning. We therefore conclude that our pruning rule effectively identifies and retains functionally important weights. The preserved connectivity aligns well with the network’s core topology (MST) which leads to efficient and robust performance of the sparsely structured RNN, in conjunction with Dale’s backpropagation.

4. Application to Visual Behaviour in Mice: Functional Connectivity and Predictive Coding

Having established the efficacy of our methods in successfully constructing and training highly sparse RNNs which respect Dale’s law, we apply them to study visual behaviour in mice under the predictive coding hypothesis [57, 77]. Specifically, we model data from the Allen Institute Visual Behavior dataset [78, 79], which comprises two-photon (2p) calcium imaging recordings from mice performing a change detection task, when presented with expected and unexpected stimuli. This experimental paradigm allows us to investigate how different neuronal populations in the cortical circuit interact and process information under varying predictive contexts, shedding light on how prediction errors may be communicated across hierarchically-related cortical regions. Moreover, since our modeling framework captures these interactions while respecting anatomical connectivity and signaling constraints, our analyses reveal how functional connectivity between populations adapts to the inherent biological scaffolding to support such predictive processing, providing insights into the circuit-level implementations of predictive coding in the visual cortex.

In the following subsections we provide details of the specific experimental setup and curated dataset, followed by our model architecture and training methodology. Our results align well with previous observations made in the experimental literature studying the data, and strongly support the predictive coding hypothesis. Furthermore, by “learning” the functional connectivity under various conditions, our approach not only corroborates previous experimental results, but also gives us a way to generate new hypotheses about how different prediction violations engage distinct patterns of feedforward and feedback connectivity across cortical layers and cell types, offering novel insights into the principles governing cortical circuit organization in predictive processing.

4.1. Dataset and experimental setup

The Visual Behavior Dataset [78, 79] entails a visually-guided, go/no-go task where mice are shown a continuous series of briefly presented natural images and they earn water rewards by correctly reporting when the identity of the image changes [80]. Responses from the mice are collected as they are presented with two different sets of images; A familiar set (Fig. 4.B - top row) comprising images that they were trained on, and a novel set (Fig. 4.B - bottom row) that are only presented at test time, during the recordings. While the trials themselves are longitudinal spanning multiple image changes, we restrict ourselves to modeling two full image presentations (Fig. 4.C - Top). If the identity of the second image is the same as that of the first one, we refer to the condition as no change (Fig. 4.C - second row). If the identity of the second image is different from that of the first one, we refer to it as the change condition (Fig. 4.C - third row). Both images are always from the same set, i.e., they are both either familiar or novel. In a small subset of the trials (∼ 5%), the second image is omitted, and instead replaced by a blank screen (Fig. 4.C - first row) allowing for analysis of expectation signals. We call this the omission condition. For more details on the experimental setup, see [79].

Figure 4: — (A) General architecture of the CelltypeRNN. (B) Familiar and Novel image sets used for training mice on the visual change detection task. Reproduced from the Allen Institute Visual Behavior-2p dataset (open source) [78]. (C) Top - Examples of different stimuli conditions in the visual change detection task, depiction of full and half set presentation timescales. Bottom - An example of target activity (dashed black curve), dense RNN output (solid grey curve), and sparse RNN output (solid green curve) from LM L5 Vip population. (D) Examples of inferred functional connectivity: Left - Complete neuron-to-neuron connectivity at initialization, after training with Dale’s backprop, and after sparsification (and retraining) with top-prob pruning. Neurons are ordered by area (V1 followed by LM) within which they are ordered by layer (L4, L2/3, L5), and type (Pyr, Sst, Vip). Right - Example of the sparse connectivity matrix where activity is averaged by cell type in every layer (bigger circles imply higher dispersion and darker colours imply stronger connections; dispersion is computed as the fraction of standard deviations to the mean activity in the population).

For each of our conditions we consider two temporal windows. In the full-set presentation (Fig. 4.C - indicated at the bottom), we model neural activity across the entire two-image sequence (first image (250ms), inter-stimulus interval (500ms), second presentation/omission (250ms), and post-stimulus interval (500ms)), which allows us to capture the sustained dynamics underlying predictive computation across time. In contrast, the half-set presentation (Fig. 4.C - indicated at the top) models neural activity following the second presentation/omission, enabling us to isolate the transient neural responses that implement the mechanistic components of prediction and error signalling. This complementary approach provides insights into both, the overarching dynamics and the immediate neural interactions that support predictive coding, and gives us the flexibility to infer both long-term and short-term functional interactions.

The complete dataset includes multi-regional 2-photon data from two hierarchically adjacent areas, VISp (i.e., primary visual cortex or V1) and VISl (i.e., the lateromedial area or LM). For both areas we collect recordings at depths roughly corresponding to layers 2/3, 4, and 5 in the cortical column for excitatory, i.e., pyramidal (Pyr) neurons and two types of inhibitory neurons, viz. somatostatin (Sst) and vasoactive intestinal peptide (Vip) expressing interneurons (sampling depths distributions provided in Appendix 11). In total, we therefore model the activities of 18 different interacting populations (Fig 4.A).

To curate the training data for our RNNs, we compute the neuron-averaged response for every experiment corresponding to each of our individual neuronal populations (e.g., LM L5 Vip) from the Allen Institute Visual Behavior-2P dataset [78]. We then randomly sample (with replacement) 100 averaged responses from the total set of averaged responses, take their mean, and pass the same through a 1-D Gaussian filter $(σ = 1)$ to produce a single training sample (Fig. 4.C - dashed black curve). We subsequently produce 2000 such samples for each of our individual neuronal populations.

4.2. CelltypeRNN: Architecture and training

We model the data as described previously with the anatomically constrained CelltypeRNN that replicates the inter-areal structure of the canonical cortical microcircuit with two hierarchically related cortical areas (Fig. 4.A) [81, 82, 83, 84] whilst simultaneously enforcing intra-areal lateral connectivity among different cell types within the cortical column as established by [85]. Moreover, given that the CelltypeRNN is constructed to be able to replicate experimentally obtained response patterns in different cell populations as specified by their cell type, cortical layer and area, by learning the connection weights, we in turn represent the inferred functional interactions between the populations across the cortical circuit [20, 11] under different stimulus conditions.

Subsequently, we first train a dense, unbiased Elman RNN using Dale’s backpropagation, following which we prune the network’s recurrent connections block-wise with top-prob pruning (Fig. 4.D, left) to achieve their individual target connectivity sparsities. We subsequently fine-tune the post-pruning non-zero RNN weights to achieve an overall performance that is at least as good as that of the RNN pre-pruning (Fig. 4.C, bottom). In our specific instantiation, the ratio of Pyr:Sst:Vip neurons in every layer is 12:2:1 (making the excitatory:inhibitory neuronal ratio 4:1), which with a scaling factor of 16 gives us a total of 240 neurons per layer and 1440 overall in the model. Our lateral connectivity probabilities across populations follow experimental data [85] and are explicitly stated in Appendix 12. Longer range inter-areal projections are sparsified to have a connection probability of 0.3, and are strictly excitatory, i.e., Feedforward connections: V1 L2/3 Pyr → LM L4 Pyr, Sst, Vip. Feedback connections: V1 L2/3 Pyr, Sst, Vip ← LM L5 Pyr and V1 L5 Pyr, Sst, Vip ← LM L5 Pyr.

In addition to the weights of the RNN – i.e., input weights $W_{h i}$ and recurrent weights $W_{h h}$ – we also have readout weights that project the recurrent RNN activity of individual neuronal populations onto their respective output space, using randomly initialized, fully connected linear layers. The readout weights are frozen at the time of initialization of the dense RNN itself, and remain so throughout the training procedure. By doing so we ensure that any changes in the model’s behavior come from changes in the recurrent dynamics, and not the model “cheating” by simply adjusting its output mapping. It subsequently also makes it easier to interpret and compare how the internal representations and computations change across conditions. To that end, we also mask the input weight matrix $W_{h i}$ so that recurrent neurons corresponding to a specific population do not receive inputs from any other populations.

Our training objective requires each individual population to be able to reconstruct its activity predictively one timestep into the future (Fig. 4.C, bottom), giving us the loss function

𝓛_{t o t a l} = \frac{1}{n_{p o p}} \sum_{n = 1}^{n_{p o p}} \sum_{t = 1}^{T - 1} {‖ x_{n, t + 1} - {\hat{x}}_{n, t + 1} ‖}_{2}^{2}

(6)

where $n_{p o p}$ is the number of interacting neuronal populations and $T$ is the total number of timesteps in the sequence. $x_{n, t + 1}$ is the input that will be received for population $n$ at timestep $t + 1$ while ${\hat{x}}_{n, t + 1}$ is that predicted by the RNN. The loss function is kept the same during both, the dense training (Fig. 4.C, bottom - grey curve) and fine-tuning post-pruning stages (Fig. 4.C, bottom - green curve). However, we fine-tune for only half the number of epochs (50) as we train for with the dense network (100).

We train separate models for each of our twelve different conditions (Familiar/Novel × Change/No Change/Omission × Full Set/Half Set presentation) and compare their connection weights across various spatial scales, the results of which are discussed in the following subsection.

Our codebase to download and pre-process the data, as well as construct and train the celltypeRNN models across various conditions and timescales is made publicly available at https://hchoilab.github.io/biologicalRNNs.

4.3. Insights and results

Using the anatomically-constrained CelltypeRNN architecture, we examine how distinct cell types across the layers and hierarchy in the visual cortex communicate both expected and unexpected information by comparing inferred connectivity patterns across different experimental conditions and timescales. By fitting neuronal responses of interacting populations through one-step-ahead predictive modeling, we capture the dynamic temporal dependencies inherent in neural activity and the RNN’s resulting connectivity matrix serves as a functional proxy for interactions amongst populations, reflecting how signals propagate within the cortical network. Analyzing how the RNN adjusts its connectivity across varying predictive contexts and timescales provides insights into circuit-level implementations of predictive coding, particularly in prediction error communication and modulation of feedforward and feedback pathways over both, entire stimulus sequences and immediate neural responses to prediction confirmations or violations. Our results can be broken down into three key comparisons:

Familiar No Change vs. Familiar Change (Full-set Presentation):

In the full-set presentation of familiar images, we observe significant differences in the inter-areal feedforward and feedback connections (Fig. 5.A). When the activities are averaged across layers, there is a stark increase in the projection V1 L2/3 → LM L4 when there is a change in the image compared to when there is not, suggesting that the expectation violation causes enhanced forward communication from V1 to LM (Fig. 5.A - left, middle). Likewise, feedback projections V1 L2/3 ← LM L5 and V1 L5 ← LM L5 are strengthened as well in the change case (Fig. 5.A - left, middle). Even at the scale of cell-types, we observe that the change condition leads to an increase in functional connectivity for both inter-areal feedforward and feedback projections (Fig. 5.A - right, magenta boxes). Additionally, we note that the changes are predominantly red, i.e., the inferred weights in the change condition are generally higher than that in the case of expected stimuli and conditions being perceived (5.A - middle). This trend also holds when we compare familiar and novel stimuli (Appendix 13, Fig. 7) in both the change and no change cases, i.e., the introduction of novelty leads to increased inter/intra area connectivity. In agreement with previous literature [86], this suggests that novelty and unexpectedness increase the brain’s excitability, which in turn could facilitate plasticity and aid learning.

Figure 5: — (A) Familiar No Change vs. Familiar Change (Full set presentation). (B) Familiar No Change vs. Familiar Change (Half set presentation). (C) Familiar Change vs. Familiar Omission (Full set presentation). All differences are computed as Second condition - First condition; Blue implies higher weights in the first condition, while red indicates higher weights in the second. In all three cases, the left and middle plots are a graphical representation of the weights averaged across layers, while the rightmost plot averages weights by cell-type within each layer. Magenta boxes highlight the feedforward and feedback connections, i.e., those originating at V1 L2/3 and LM L5 respectively. Green boxes highlight all Sst-ViP interactions in L2/3 of both V1 and LM.

Familiar No Change vs. Familiar Change (Half-set Presentation):

When focusing on the half-set presentation for familiar images however, we found a contrasting pattern of connectivity with the feedback signaling (Fig. 5.B - left, middle). While we see almost no difference in the weights V1 L5 ← LM L5 across the change and no change cases, the weights V1 L2/3 ← LM L5 are distinctly higher in the no change case than the change case. The feedforward projection V1 L2/3 → LM L4 however still maintains the same trend as the full presentation case, wherein it is higher when an image change occurs than when it does not. This suggests that while the feedforward communication is relatively immediate using shorter timescales, propagating feedback information occurs over a longer timescale [83, 87, 88], making a case for further investigation of role of inter-areal, cortico-cortical time-delays [89] when studying predictive coding [84]. The cell-type-specific analysis further reveals that Vip neurons in L2/3 are less inhibited by Sst neurons of the same layer in the change case in both V1 and LM (Fig. 5. B - right, green boxes), compared to the full presentation case (Fig. 5. A - right, green boxes), once again speaking to the transience of the change and also supporting the idea that Vip neurons could encode unexpectedness as hypothesized by the predictive coding theory.

Familiar Change vs. Familiar Omission:

Our setup also allows us to compare how the network processes different types of expectation violations, by contrasting the learnt weights in the case of an image change vs. image omission. In the setting of familiar images over the length of a full set presentation (Fig. 5.C), we notice that while most of the weights are quite similar for both types of expectation violations, there is an appreciable increase in the feedback projection V1 L5 ← LM L5 in the case of image omission (Fig. 5.C - left, middle). Additionally, this increase seems to be driven by an increase in the feedback connection weight to the Vip cells in V1 L5⁶ (Fig. 5.C - right). These observations are in agreement with experimental findings that omissions trigger signaling in Vip neurons in V1 [90]. We also note that across the RNN, weights to the Vip neurons from locally adjacent Sst neurons are reduced in the omission case, suggesting that these neurons are not as inhibited during the processing of omissions, thus potentially emphasizing their role in processing prediction violations. In the supplement, we also provide results studying the No Change vs. Omission case with both familiar and novel images (Appendix 13, Fig. 8) for fair comparison.

Collectively, our analysis demonstrates a hierarchical organization of predictive processing in the visual cortex operating over different timescales. We find that feedforward projections are consistently enhanced during all prediction violations across both long and short timescales, emphasizing their crucial role in transmitting prediction error signals (and fundamentally driving synaptic plasticity in the brain [91, 92, 93], facilitating learning and adaptation). In contrast, feedback projections are modulated by both the type of prediction error and the temporal window over which neuronal responses are modeled. Notably, the behavior of feedback projections differs when targeting different cortical layers: feedback projections to L2/3 are more prominently modulated during unviolated predictions over shorter timescales (Fig. 5.B), while feedback projections to L5 are more responsive during negative prediction errors such as omissions of expected visual input [94] (Fig. 5.C). This differential modulation suggests that while feedforward pathways rapidly convey unexpected sensory information, feedback pathways adjust more selectively based on the context, timing, and targeted cortical layer of the prediction error. These patterns are further corroborated by our observations comparing change, no change, and omission across familiar and novel conditions during the full-set presentation (Appendix 13, Fig. 7). In particular, we note that the presentation of a novel image always increases the feedforward connectivity (and ergo the projection) from V1 L2/3 → LM L4. On the finer-scale level of cell-types instead of entire layers, there is consistent prominent involvement of Vip interneurons during prediction violations⁷ which highlights their critical role in modulating cortical circuits in response to unexpected stimuli. Overall, our findings therefore provide circuit-level evidence supporting the predictive coding framework, illustrating how the brain dynamically adjusts its functional neural connectivity in response to varying predictive contexts, timescales, and cortical layers. The dynamic interplay of feedforward and feedback mechanisms facilitates efficient processing of sensory information, enabling the brain to anticipate and adapt to constantly changing environments.

Results for all comparisons across both the full set and half set presentations are publicly available at the project website.

5. Discussion

Our work develops methods for constructing RNNs that simultaneously incorporate two fundamental biological constraints: Dale’s law and structured sparse connectivity motifs. We provide mathematical grounding for these methods, including convergence guarantees and error bounds, demonstrating that they can match the performance of unconstrained RNNs. Empirical results on standard synthetic tasks support the efficacy of our approach, demonstrating that our biologically constrained RNNs can achieve performance comparable to conventional, unconstrained networks. Furthermore, by aligning computational models more closely with biological reality, we enhance their utility for neuroscientific research, providing tools for more accurate modeling of neural dynamics and brain function.

Our approach also differs significantly from CURBD [20], an existing method in the literature for inferring multi-regional interactions, in two key aspects. First, while CURBD successfully models neural dynamics and iteractions, it does not incorporate sign constraints during training, limiting its ability to differentiate between excitatory and inhibitory cellular mechanisms. Second, and more critically, CURBD’s reliance on FORCE training makes it poorly suited for implementing experimentally-informed sparse connectivity patterns among neuronal populations. Every iteration with FORCE is a least-squares update that is dense and doesn’t respect the sparsity constraints of the matrix at the previous iteration - it is non-trivial to subsequently enforce the sparsity pattern, or alternatively solve a recursive least-squares update for every sub-matrix defined by the sparsity pattern at each update, which quickly becomes computationally infeasible. These limitations consequently motivated our development of a our backpropagation-based weight update method that efficiently handles both Dale’s law constraints and structured sparsity.

Applying our methods to the Allen Institute Visual Behavior dataset, we inferred multi-regional neuronal interactions underlying visual behavior in mice performing a change detection task. Our anatomically and physiologically constrained celltypeRNNs not only replicated the experimental data but also provided insights consistent with the theory of predictive coding. Specifically, the models revealed dynamic interplay between feedforward and feedback mechanisms across cortical layers and cell types, capturing how the brain adjusts functional neural connectivity in response to varying predictive contexts and timescales.

We note that much of our methodological work can easily be extended to other deep architectures, and is not in fact restricted to simply RNNs. That said, a key area for incorporation of additional biological realism would be in the way we inherently solve the credit assignment problem. Backpropagation suffers from needing a global error signal and weight symmetry [95], prompting the need for more biologically plausible learning rules that can still learn as effectively. One hypothesis is that using local learning rules may contribute to the emergence of more modular network representations by promoting the formation of localized activity clusters, thus leading to deeper insights into how functional specialization arises in neural systems and its role in facilitating learning.

Furthermore, our findings highlight differential neural responses to different types of prediction errors, emphasizing the importance of the nature of the violations in shaping neural dynamics. In our study, the change in the familiar image case represents a “global oddball” – an unexpected stimulus that violates established patterns while maintaining the local context. Conversely, the omission of an expected stimulus constitutes a “local oddball”, introducing a novel scenario for the network. This distinction is significant, as recent work [96] has found that global oddballs elicit responses in non-granular layers, differing from local oddballs that evoke early responses in superficial layers 2/3, consistent with conventional predictive coding theory. Our findings align with this pattern for global oddballs but present discrepancies in the case of local oddballs (omissions). This underscores the need for further exploration into stimulus dependency in error encoding [97], and suggests that normative predictive coding computations may need to account for the type of prediction error to fully capture neural processing dynamics.

Finally, we note that an important consideration in our study is the limited scope of recorded celltypes and brain regions, which poses challenges in interpreting our results. Specifically, we do not have recordings from all interacting celltypes and areas that may be involved in the visual processing tasks we modeled. This limitation means that our models might capture neural responses that are more correlational rather than causal, as they are based solely on the observed data from recorded populations. The absence of data could lead to incomplete or biased representations of neural interactions, especially at the finer grained level of celltypes as opposed to the coarser level of layers, where the absence of a particular subpopulation’s influence is more easily subsumed within aggregate dynamics. To address this gap, future work could involve developing methods that account for unobserved interactions, perhaps through incorporating prior knowledge of anatomical and functional connectivity or using computational techniques to infer missing information. Additionally, expanding experimental recordings to include more brain areas and cell-types would provide a more comprehensive dataset, enabling our models to capture the full complexity of neural dynamics and leading to more causally robust conclusions.

Supplementary Material

Supplement 1

NIHPP2025.01.09.632231v1-supplement-1.pdf^{(1.5MB, pdf)}

Acknowledgments

We thank Anqi Wu for insightful comments and feedback. This work was supported by the Alfred P. Sloan Foundation Fellowships in Neuroscience (to H.C.) and the National Eye Institute of the National Institutes of Health under Award Number R00 EY030840 (to H.C.). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Footnotes

During the retraining phase, the synaptic strengths within the network are dynamically adjusted, effectively rescaling the remaining connections to maintain overall network activity and prevent neuron underutilization. This mirrors the biological process of synaptic scaling, ensuring that the network retains its capacity to learn and generalize despite the reduced number of connections.

$‖ \cdot ‖_{2}$ always corresponds to the operator norm $‖ \cdot ‖_{o p}$ induced by the 2-norm, which is the Euclidean norm if $W$ and $W_{D}$ are vectorized, and $σ_{max} (\cdot)$ , i.e., the largest singular value of the matrix if $W$ and $W_{D}$ are considered in their matrix forms.

⁴

While the methods of [38, 63] using DANNs would allow us to train with sign constraints, unfortunately adapting it to respect structured sparse motifs in non-trivial. We therefore refrain from incorporating any comparisons with such methods in this work.

⁵

For a more thorough treatment of this topic, we refer the interested reader to Sections 2 & 3 of [73].

⁶

As well as an overall increase in the connectivity weights targeted to V1 L5 Vip neurons.

⁷

We note however that these interactions do not directly confirm the experimental results of [56, 79], in that they show a coding change in the relevant populations but not in what would seem to be the same direction, i.e., Sst→Vip reduces in the case of novelty or omission.

Contributor Information

Aishwarya H. Balwani, School of Electrical & Computer Engineering, Georgia Institute of Technology.

Alex Q. Wang, Computational Science and Engineering Program, Georgia Institute of Technology

Farzaneh Najafi, School of Biological Sciences, Georgia Institute of Technology.

Hannah Choi, School of Mathematics, Georgia Institute of Technology.

References

[1].Cohen Yarden, Engel Tatiana A, Langdon Christopher, Lindsay Grace W, Ott Torben, Peters Megan AK, Shine James M, Breton-Provencher Vincent, and Ramaswamy Srikanth. Recent advances at the interface of neuroscience and artificial neural networks. Journal of Neuroscience, 42(45):8514–8523, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
[2].Saxe Andrew, Nelli Stephanie, and Summerfield Christopher. If deep learning is the answer, what is the question? Nature Reviews Neuroscience, 22(1):55–67, 2021. [DOI] [PubMed] [Google Scholar]
[3].Richards Blake A, Lillicrap Timothy P, Beaudoin Philippe, Bengio Yoshua, Bogacz Rafal, Christensen Amelia, Clopath Claudia, Costa Rui Ponte, Berker Archy de, Ganguli Surya, et al. A deep learning framework for neuroscience. Nature neuroscience, 22(11):1761–1770, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Tim C Kietzmann Patrick McClure, and Kriegeskorte Nikolaus. Deep neural networks in computational neuroscience. BioRxiv, page 133504, 2017. [Google Scholar]
[5].Yamins Daniel LKand DiCarlo James J. Using goal-driven deep learning models to understand sensory cortex. Nature neuroscience, 19(3):356–365, 2016. [DOI] [PubMed] [Google Scholar]
[6].Barak Omri. Recurrent neural networks as versatile tools of neuroscience research. Current opinion in neurobiology, 46:1–6, 2017. [DOI] [PubMed] [Google Scholar]
[7].Yang Guangyu Robertand Wang Xiao-Jing. Artificial neural networks for neuroscientists: a primer. Neuron, 107(6):1048–1070, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[8].Kaufman Matthew T, Churchland Mark M, Ryu Stephen I, and Shenoy Krishna V. Cortical activity in the null space: permitting preparation without movement. Nature neuroscience, 17(3):440–448, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Perich Matthew G, Gallego Juan A, and Miller Lee E. A neural population mechanism for rapid learning. Neuron, 100(4):964–976, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].Semedo João D, Zandvakili Amin, Machens Christian K, Byron M Yu, and Kohn Adam. Cortical areas interact through a communication subspace. Neuron, 102(1):249–259, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[11].Perich Matthew Gand Rajan Kanaka. Rethinking brain-wide interactions through multi-region ‘network of networks’ models. Current opinion in neurobiology, 65:146–151, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].Kozachkov Leo, Ennis Michaela, and Slotine Jean-Jacques. Rnns of rnns: Recursive construction of stable assemblies of recurrent neural networks. Advances in neural information processing systems, 35:30512–30527, 2022. [Google Scholar]
[13].Sussillo David and Abbott Larry F. Generating coherent patterns of activity from chaotic neural networks. Neuron, 63(4):544–557, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
[14].DePasquale Brian, Cueva Christopher J, Rajan Kanaka, Escola G Sean, and Abbott LF. full-FORCE: A target-based method for training recurrent networks. PloS one, 13(2):e0191527, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
[15].Sussillo David, Churchland Mark M, Kaufman Matthew T, and Shenoy Krishna V. A neural network that finds a naturalistic solution for the production of muscle activity. Nature neuroscience, 18(7):1025–1033, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
[16].Mante Valerio, Sussillo David, Shenoy Krishna V, and Newsome William T. Context-dependent computation by recurrent dynamics in prefrontal cortex. Nature, 503(7474):78–84, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
[17].Pandarinath Chethan, O’Shea Daniel J, Collins Jasmine, Jozefowicz Rafal, Stavisky Sergey D, Kao Jonathan C, Trautmann Eric M, Kaufman Matthew T, Ryu Stephen I, Hochberg Leigh R, et al. Inferring single-trial neural population dynamics using sequential auto-encoders. Nature methods, 15(10):805–815, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
[18].Russo Abigail A, Khajeh Ramin, Bittner Sean R, Perkins Sean M, Cunningham John P, Abbott Laurence F, and Churchland Mark M. Neural trajectories in the supplementary motor area and motor cortex exhibit distinct geometries, compatible with different classes of computation. Neuron, 107(4):745–758, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[19].Rajan Kanaka, Harvey Christopher D, and Tank David W. Recurrent network models of sequence generation and memory. Neuron, 90(1):128–142, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
[20].Perich Matthew G, Arlt Charlotte, Soares Sofia, Young Megan E, Mosher Clayton P, Minxha Juri, Carter Eugene, Rutishauser Ueli, Rudebeck Peter H, Harvey Christopher D, et al. Inferring brain-wide interactions using data-constrained recurrent neural network models. BioRxiv, pages 2020–12, 2020. [Google Scholar]
[21].Maheswaranathan Niru, Williams Alex, Golub Matthew, Ganguli Surya, and Sussillo David. Reverse engineering recurrent networks for sentiment classification reveals line attractor dynamics. Advances in neural information processing systems, 32, 2019. [PMC free article] [PubMed] [Google Scholar]
[22].Sussillo David and Barak Omri. Opening the black box: low-dimensional dynamics in high-dimensional recurrent neural networks. Neural computation, 25(3):626–649, 2013. [DOI] [PubMed] [Google Scholar]
[23].Yamins Daniel LK, Hong Ha, Cadieu Charles F, Solomon Ethan A, Seibert Darren, and DiCarlo James J. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the national academy of sciences, 111(23):8619–8624, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
[24].Kell Alexander JE, Yamins Daniel LK, Shook Erica N, Norman-Haignere Sam V, and McDermott Josh H. A task-optimized neural network replicates human auditory behavior, predicts brain responses, and reveals a cortical processing hierarchy. Neuron, 98(3):630–644, 2018. [DOI] [PubMed] [Google Scholar]
[25].Kubilius Jonas, Schrimpf Martin, Kar Kohitij, Rajalingham Rishi, Hong Ha, Majaj Najib, Issa Elias, Bashivan Pouya, Prescott-Roy Jonathan, Schmidt Kailyn, et al. Brain-like object recognition with high-performing shallow recurrent anns. Advances in neural information processing systems, 32, 2019. [Google Scholar]
[26].Schrimpf Martin, Kubilius Jonas, Hong Ha, Majaj Najib J, Rajalingham Rishi, Issa Elias B, Kar Kohitij, Bashivan Pouya, Prescott-Roy Jonathan, Geiger Franziska, et al. Brain-score: Which artificial neural network for object recognition is most brain-like? BioRxiv, page 407007, 2018. [Google Scholar]
[27].Michaels Jonathan A, Schaffelhofer Stefan, Agudelo-Toro Andres, and Scherberger Hansjörg. A neural network model of flexible grasp movement generation. biorxiv, page 742189, 2019. [Google Scholar]
[28].Nayebi Aran, Bear Daniel, Kubilius Jonas, Kar Kohitij, Ganguli Surya, Sussillo David, DiCarlo James J, and Yamins Daniel L. Task-driven convolutional recurrent models of the visual system. Advances in neural information processing systems, 31, 2018. [Google Scholar]
[29].Lindsay Grace W. Convolutional neural networks as a model of the visual system: Past, present, and future. Journal of cognitive neuroscience, 33(10):2017–2031, 2021. [DOI] [PubMed] [Google Scholar]
[30].Hassabis Demis, Kumaran Dharshan, Summerfield Christopher, and Botvinick Matthew. Neuroscience-inspired artificial intelligence. Neuron, 95(2):245–258, 2017. [DOI] [PubMed] [Google Scholar]
[31].Schaeffer Rylan, Khona Mikail, and Fiete Ila. No free lunch from deep learning in neuroscience: A case study through models of the entorhinal-hippocampal circuit. Advances in neural information processing systems, 35:16052–16067, 2022. [Google Scholar]
[32].Marblestone Adam H, Wayne Greg, and Kording Konrad P. Toward an integration of deep learning and neuroscience. Frontiers in computational neuroscience, 10:215943, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
[33].Eccles John Carew. From electrical to chemical transmission in the central nervous system: the closing address of the sir henry dale centennial symposium cambridge, 19 september 1975. Notes and records of the Royal Society of London, 30(2):219–230, 1976. [DOI] [PubMed] [Google Scholar]
[34].Eavani Harini, Satterthwaite Theodore D, Filipovych Roman, Gur Raquel E, Gur Ruben C, and Davatzikos Christos. Identifying sparse connectivity patterns in the brain using resting-state fmri. Neuroimage, 105:286–299, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
[35].Kaiser Marcus. Connectomes: from a sparsity of networks to large-scale databases. Frontiers in Neuroinformatics, 17:1170337, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
[36].Lappalainen Janne K, Tschopp Fabian D, Prakhya Sridhama, McGill Mason, Nern Aljoscha, Shinomiya Kazunori, Takemura Shin-ya, Gruntman Eyal, Macke Jakob H, and Turaga Srinivas C. Connectome-constrained networks predict neural activity across the fly visual system. Nature, pages 1–9, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
[37].Giacopelli Giuseppe, Tegolo Domenico, Spera Emiliano, and Migliore Michele. On the structural connectivity of large-scale models of brain networks at cellular level. Scientific Reports, 11(1):4345, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
[38].Cornford Jonathan, Kalajdzievski Damjan, Leite Marco, Lamarquette Amélie, Kullmann Dimitri M, and Richards Blake. Learning to live with dale’s principle: Anns with separate excitatory and inhibitory units. bioRxiv, pages 2020–11, 2020. [Google Scholar]
[39].Frankle Jonathan and Carbin Michael. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018. [Google Scholar]
[40].Tanaka Hidenori, Kunin Daniel, Yamins Daniel L, and Ganguli Surya. Pruning neural networks without any data by iteratively conserving synaptic flow. Advances in neural information processing systems, 33:6377–6389, 2020. [Google Scholar]
[41].Lee Namhoon, Ajanthan Thalaiyasingam, and Torr Philip HS. Snip: Single-shot network pruning based on connection sensitivity. arXiv preprint arXiv:1810.02340, 2018. [Google Scholar]
[42].Wang Chaoqi, Zhang Guodong, and Grosse Roger. Picking winning tickets before training by preserving gradient flow. arXiv preprint arXiv:2002.07376, 2020. [Google Scholar]
[43].Han Song, Pool Jeff, Tran John, and Dally William. Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28, 2015. [Google Scholar]
[44].Miconi Thomas. Biologically plausible learning in recurrent neural networks reproduces neural dynamics observed during cognitive tasks. Elife, 6:e20899, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
[45].Minni Sun, Li Ji-An Theodore Moskovitz, Lindsay Grace, Miller Kenneth, Dipoppa Mario, and Yang Guangyu Robert. Understanding the functional and structural differences across excitatory and inhibitory neurons. bioRxiv, page 680439, 2019. [Google Scholar]
[46].Ingrosso Alessandro and Abbott LF. Training dynamically balanced excitatory-inhibitory networks. PloS one, 14(8):e0220547, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[47].Nicola Wilten and Clopath Claudia. Supervised learning in spiking neural networks with force training. Nature communications, 8(1):2208, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
[48].Yann LeCun, Denker John, and Solla Sara. Optimal brain damage. Advances in neural information processing systems, 2, 1989. [Google Scholar]
[49].Moore Eli and Chaudhuri Rishidev. Using noise to probe recurrent neural network structure and prune synapses. Advances in neural information processing systems, 33:14046–14057, 2020. [Google Scholar]
[50].Song H Francis, Yang Guangyu R, and Wang Xiao-Jing. Training excitatory-inhibitory recurrent neural networks for cognitive tasks: a simple and flexible framework. PLoS computational biology, 12(2):e1004792, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
[51].Rumelhart David E, Hinton Geoffrey E, and Williams Ronald J. Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986. [Google Scholar]
[52].Zhang Lixiao, Wang Xianwei, Cueto Ramón, Effi Comfort, Zhang Yuling, Tan Hongmei, Qin Xuebin, Ji Yong, Yang Xiaofeng, and Wang Hong. Biochemical basis and metabolic interplay of redox regulation. Redox biology, 26:101284, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
[53].Tetzlaff Christian, Kolodziejski Christoph, Timme Marc, and Wörgötter Florentin. Synaptic scaling in combination with many generic plasticity mechanisms stabilizes circuit connectivity. Frontiers in computational neuroscience, 5:47, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
[54].Huttenlocher Peter R et al. Synaptic density in human frontal cortex-developmental changes and effects of aging. Brain Res, 163(2):195–205, 1979. [DOI] [PubMed] [Google Scholar]
[55].Bullmore Ed and Sporns Olaf. The economy of brain network organization. Nature reviews neuroscience, 13(5):336–349, 2012. [DOI] [PubMed] [Google Scholar]
[56].Garrett Marina, Manavi Sahar, Roll Kate, Ollerenshaw Douglas R, Groblewski Peter A, Ponvert Nicholas D, Kiggins Justin T, Casal Linzy, Mace Kyla, Williford Ali, et al. Experience shapes activity dynamics and stimulus coding of vip inhibitory cells. elife, 9:e50340, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[57].Rao Rajesh PN and Ballard Dana H. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature neuroscience, 2(1):79–87, 1999. [DOI] [PubMed] [Google Scholar]
[58].Elman Jeffrey L. Finding structure in time. Cognitive science, 14(2):179–211, 1990. [Google Scholar]
[59].Boyd Stephen Pand Vandenberghe Lieven. Convex optimization. Cambridge university press, 2004. [Google Scholar]
[60].Bertsekas Dimitri P.. Nonlinear Programming. Athena Scientific, third edition, 2016. [Google Scholar]
[61].Glorot Xavier and Bengio Yoshua. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010. [Google Scholar]
[62].He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015. [Google Scholar]
[63].Li Pingsheng, Cornford Jonathan, Ghosh Arna, and Richards Blake. Learning better with dale’s law: A spectral perspective. Advances in Neural Information Processing Systems, 36, 2024. [Google Scholar]
[64].Hebb Donald Olding. The organization of behavior: A neuropsychological theory. Psychology press, 2005. [Google Scholar]
[65].Ye Jong Chul. Geometry of Deep Learning. Springer, 2022. [Google Scholar]
[66].Paszke Adam, Gross Sam, Chintala Soumith, Chanan Gregory, Yang Edward, Zachary DeVito Zeming Lin, Desmaison Alban, Antiga Luca, and Lerer Adam. Automatic differentiation in pytorch. 2017. [Google Scholar]
[67].Paszke Adam, Gross Sam, Massa Francisco, Lerer Adam, Bradbury James, Chanan Gregory, Killeen Trevor, Lin Zeming, Gimelshein Natalia, Antiga Luca, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019. [Google Scholar]
[68].Mozer Michael C and Smolensky Paul. Skeletonization: A technique for trimming the fat from a network via relevance assessment. Advances in neural information processing systems, 1, 1988. [Google Scholar]
[69].Hanson Stephen and Pratt Lorien. Comparing biases for minimal network construction with back-propagation. Advances in neural information processing systems, 1, 1988. [Google Scholar]
[70].Spielman Daniel A and Srivastava Nikhil. Graph sparsification by effective resistances. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 563–568, 2008. [Google Scholar]
[71].Spielman Daniel A and Teng Shang-Hua. Spectral sparsification of graphs. SIAM Journal on Computing, 40(4):981–1025, 2011. [Google Scholar]
[72].Batson Joshua, Spielman Daniel A, Srivastava Nikhil, and Teng Shang-Hua. Spectral sparsification of graphs: theory and algorithms. Communications of the ACM, 56(8):87–94, 2013. [Google Scholar]
[73].Balwani Aishwarya and Krzyston Jakob. Zeroth-order topological insights into iterative magnitude pruning. In Topological, Algebraic and Geometric Learning Workshops 2022, pages 6–16. PMLR, 2022. [Google Scholar]
[74].Rieck Bastian, Togninalli Matteo, Bock Christian, Moor Michael, Horn Max, Gumbsch Thomas, and Borgwardt Karsten. Neural persistence: A complexity measure for deep neural networks using algebraic topology. arXiv preprint arXiv:1812.09764, 2018. [Google Scholar]
[75].Doraiswamy Harish, Tierny Julien, Silva Paulo JS, Nonato Luis Gustavo, and Silva Claudio. Topomap: A 0-dimensional homology preserving projection of high-dimensional data. IEEE Transactions on Visualization and Computer Graphics, 27(2):561–571, 2020. [DOI] [PubMed] [Google Scholar]
[76].Lacombe Théo, Ike Yuichi, Carriere Mathieu, Chazal Frédéric, Glisse Marc, and Umeda Yuhei. Topological uncertainty: Monitoring trained neural networks through persistence of activation graphs. arXiv preprint arXiv:2105.04404, 2021. [Google Scholar]
[77].Friston Karl. A theory of cortical responses. Philosophical transactions of the Royal Society B: Biological sciences, 360(1456):815–836, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
[78].Visual behavior - 2p - brain-map.org. https://portal.brain-map.org/circuits-behavior/visual-behavior-2p. (Accessed on 11/20/2024).
[79].Garrett Marina, Groblewski Peter, Piet Alex, Ollerenshaw Doug, Najafi Farzaneh, Yavorska Iryna, Amster Adam, Bennett Corbett, Buice Michael, Caldejon Shiella, et al. Stimulus novelty uncovers coding diversity in visual cortical circuits. bioRxiv, pages 2023–02, 2023. [Google Scholar]
[80].Groblewski Peter A, Ollerenshaw Douglas R, Kiggins Justin T, Garrett Marina E, Mochizuki Chris, Casal Linzy, Cross Sissy, Mace Kyla, Swapp Jackie, Manavi Sahar, et al. Characterization of learning, motivation, and visual perception in five transgenic mouse lines expressing gcamp in distinct cell populations. Frontiers in Behavioral Neuroscience, 14:104, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[81].Douglas Rodney J, Martin Kevan AC, and Whitteridge David. A canonical microcircuit for neocortex. Neural computation, 1(4):480–488, 1989. [Google Scholar]
[82].Mountcastle Vernon B. The columnar organization of the neocortex. Brain: a journal of neurology, 120(4):701–722, 1997. [DOI] [PubMed] [Google Scholar]
[83].Bastos Andre M, Usrey W Martin, Adams Rick A, Mangun George R, Fries Pascal, and Friston Karl J. Canonical microcircuits for predictive coding. Neuron, 76(4):695–711, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
[84].Balwani Aishwarya, Cho Suhee, and Choi Hannah. Exploring the architectural biases of the canonical cortical microcircuit. bioRxiv, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
[85].Campagnola Luke, Seeman Stephanie C, Chartrand Thomas, Kim Lisa, Hoggarth Alex, Gamlin Clare, Ito Shinya, Trinh Jessica, Davoudian Pasha, Radaelli Cristina, et al. Local connectivity and synaptic dynamics in mouse and human neocortex. Science, 375(6585):eabj5861, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
[86].Schulz Auguste, Miehl Christoph, Michael J Berry II, and Julijana Gjorgjieva. The generation of cortical novelty responses through inhibitory plasticity. Elife, 10:e65309, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
[87].Bosman Conrado A, Schoffelen Jan-Mathijs, Brunet Nicolas, Oostenveld Robert, Bastos Andre M, Womelsdorf Thilo, Rubehn Birthe, Stieglitz Thomas, Peter De Weerd, and Pascal Fries. Attentional stimulus selection through selective synchronization between monkey visual areas. Neuron, 75(5):875–888, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
[88].Semedo João D, Jasper Anna I, Zandvakili Amin, Krishna Aravind, Aschner Amir, Machens Christian K, Kohn Adam, and Yu Byron M. Feedforward and feedback interactions between visual cortical areas use different population activity patterns. Nature communications, 13(1):1099, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
[89].Moon Joon-Young, Müsch Kathrin, Schroeder Charles E, Valiante Taufik A, and Honey Christopher J. Interregional delays fluctuate in the human cerebral cortex. bioRxiv, pages 2022–06, 2022. [Google Scholar]
[90].Najafi Farzaneh, Russo Simone, and Lecoq Jerome. Unexpected events modulate context signaling in vip and excitatory cells of the visual cortex. bioRxiv, pages 2024–05, 2024. [Google Scholar]
[91].Hertäg Loreenand Sprekeler Henning. Learning prediction error neurons in a canonical interneuron circuit. Elife, 9:e57541, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[92].Haarsma Joost, Fletcher PC, Griffin JD, Taverne HJ, Ziauddeen Hisham, Spencer TJ, Miller Chantal, Katthagen Teresa, Goodyer I, Diederen KMJ, et al. Precision weighting of cortical unsigned prediction error signals benefits learning, is mediated by dopamine, and is impaired in psychosis. Molecular psychiatry, 26(9):5320–5333, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
[93].Starkweather Clara Kwon and Uchida Naoshige. Dopamine signals as temporal difference errors: recent advances. Current Opinion in Neurobiology, 67:95–105, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
[94].Hertäg Loreen and Clopath Claudia. Prediction-error neurons in circuits with multiple neuron types: Formation, refinement, and functional implications. Proceedings of the National Academy of Sciences, 119(13):e2115699119, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
[95].Richards Blake A and Lillicrap Timothy P. Dendritic solutions to the credit assignment problem. Current opinion in neurobiology, 54:28–36, 2019. [DOI] [PubMed] [Google Scholar]
[96].Westerberg Jacob A., Xiong Yihan S., Nejat Hamed, Sennesh Eli, Durand Séverine, Cabasco Hannah, Belski Hannah, Gillis Ryan, Loeffler Henry, Bawany Ahad, Peene Carter R., Han Warren, Nguyen Katrina, Ha Vivian, Johnson Tye, Grasso Conor, Hardcastle Ben, Young Ahrial, Swapp Jackie, Ouellete Ben, Caldejon Shiella, Williford Ali, Groblewski Peter A., Olsen Shawn R., Kiselycznyk Carly, Lecoq Jerome A., Maier Alexander, and Bastos André M.. Stimulus history, not expectation, drives sensory prediction errors in mammalian cortex. bioRxiv, 2024. [Google Scholar]
[97].Furutachi Shohei, Franklin Alexis D., Aldea Andreea M., Mrsic-Flogel Thomas, and Hofer Sonja B.. Cooperative thalamocortical circuit mechanism for sensory prediction errors. Advances in neural information processing systems, 633:398—−406, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
[98].Schneider Rolf. Convex bodies: the Brunn–Minkowski theory, volume 151. Cambridge university press, 2013. [Google Scholar]
[99].Kim Hyunjik, Papamakarios George, and Mnih Andriy. The lipschitz constant of self-attention. In International Conference on Machine Learning, pages 5562–5571. PMLR, 2021. [Google Scholar]
[100].Federer Herbert. Geometric measure theory. Springer, 2014. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1

NIHPP2025.01.09.632231v1-supplement-1.pdf^{(1.5MB, pdf)}

[R1] [1].Cohen Yarden, Engel Tatiana A, Langdon Christopher, Lindsay Grace W, Ott Torben, Peters Megan AK, Shine James M, Breton-Provencher Vincent, and Ramaswamy Srikanth. Recent advances at the interface of neuroscience and artificial neural networks. Journal of Neuroscience, 42(45):8514–8523, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] [2].Saxe Andrew, Nelli Stephanie, and Summerfield Christopher. If deep learning is the answer, what is the question? Nature Reviews Neuroscience, 22(1):55–67, 2021. [DOI] [PubMed] [Google Scholar]

[R3] [3].Richards Blake A, Lillicrap Timothy P, Beaudoin Philippe, Bengio Yoshua, Bogacz Rafal, Christensen Amelia, Clopath Claudia, Costa Rui Ponte, Berker Archy de, Ganguli Surya, et al. A deep learning framework for neuroscience. Nature neuroscience, 22(11):1761–1770, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] [4].Tim C Kietzmann Patrick McClure, and Kriegeskorte Nikolaus. Deep neural networks in computational neuroscience. BioRxiv, page 133504, 2017. [Google Scholar]

[R5] [5].Yamins Daniel LKand DiCarlo James J. Using goal-driven deep learning models to understand sensory cortex. Nature neuroscience, 19(3):356–365, 2016. [DOI] [PubMed] [Google Scholar]

[R6] [6].Barak Omri. Recurrent neural networks as versatile tools of neuroscience research. Current opinion in neurobiology, 46:1–6, 2017. [DOI] [PubMed] [Google Scholar]

[R7] [7].Yang Guangyu Robertand Wang Xiao-Jing. Artificial neural networks for neuroscientists: a primer. Neuron, 107(6):1048–1070, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] [8].Kaufman Matthew T, Churchland Mark M, Ryu Stephen I, and Shenoy Krishna V. Cortical activity in the null space: permitting preparation without movement. Nature neuroscience, 17(3):440–448, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] [9].Perich Matthew G, Gallego Juan A, and Miller Lee E. A neural population mechanism for rapid learning. Neuron, 100(4):964–976, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] [10].Semedo João D, Zandvakili Amin, Machens Christian K, Byron M Yu, and Kohn Adam. Cortical areas interact through a communication subspace. Neuron, 102(1):249–259, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] [11].Perich Matthew Gand Rajan Kanaka. Rethinking brain-wide interactions through multi-region ‘network of networks’ models. Current opinion in neurobiology, 65:146–151, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] [12].Kozachkov Leo, Ennis Michaela, and Slotine Jean-Jacques. Rnns of rnns: Recursive construction of stable assemblies of recurrent neural networks. Advances in neural information processing systems, 35:30512–30527, 2022. [Google Scholar]

[R13] [13].Sussillo David and Abbott Larry F. Generating coherent patterns of activity from chaotic neural networks. Neuron, 63(4):544–557, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] [14].DePasquale Brian, Cueva Christopher J, Rajan Kanaka, Escola G Sean, and Abbott LF. full-FORCE: A target-based method for training recurrent networks. PloS one, 13(2):e0191527, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] [15].Sussillo David, Churchland Mark M, Kaufman Matthew T, and Shenoy Krishna V. A neural network that finds a naturalistic solution for the production of muscle activity. Nature neuroscience, 18(7):1025–1033, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] [16].Mante Valerio, Sussillo David, Shenoy Krishna V, and Newsome William T. Context-dependent computation by recurrent dynamics in prefrontal cortex. Nature, 503(7474):78–84, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] [17].Pandarinath Chethan, O’Shea Daniel J, Collins Jasmine, Jozefowicz Rafal, Stavisky Sergey D, Kao Jonathan C, Trautmann Eric M, Kaufman Matthew T, Ryu Stephen I, Hochberg Leigh R, et al. Inferring single-trial neural population dynamics using sequential auto-encoders. Nature methods, 15(10):805–815, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] [18].Russo Abigail A, Khajeh Ramin, Bittner Sean R, Perkins Sean M, Cunningham John P, Abbott Laurence F, and Churchland Mark M. Neural trajectories in the supplementary motor area and motor cortex exhibit distinct geometries, compatible with different classes of computation. Neuron, 107(4):745–758, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] [19].Rajan Kanaka, Harvey Christopher D, and Tank David W. Recurrent network models of sequence generation and memory. Neuron, 90(1):128–142, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] [20].Perich Matthew G, Arlt Charlotte, Soares Sofia, Young Megan E, Mosher Clayton P, Minxha Juri, Carter Eugene, Rutishauser Ueli, Rudebeck Peter H, Harvey Christopher D, et al. Inferring brain-wide interactions using data-constrained recurrent neural network models. BioRxiv, pages 2020–12, 2020. [Google Scholar]

[R21] [21].Maheswaranathan Niru, Williams Alex, Golub Matthew, Ganguli Surya, and Sussillo David. Reverse engineering recurrent networks for sentiment classification reveals line attractor dynamics. Advances in neural information processing systems, 32, 2019. [PMC free article] [PubMed] [Google Scholar]

[R22] [22].Sussillo David and Barak Omri. Opening the black box: low-dimensional dynamics in high-dimensional recurrent neural networks. Neural computation, 25(3):626–649, 2013. [DOI] [PubMed] [Google Scholar]

[R23] [23].Yamins Daniel LK, Hong Ha, Cadieu Charles F, Solomon Ethan A, Seibert Darren, and DiCarlo James J. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the national academy of sciences, 111(23):8619–8624, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] [24].Kell Alexander JE, Yamins Daniel LK, Shook Erica N, Norman-Haignere Sam V, and McDermott Josh H. A task-optimized neural network replicates human auditory behavior, predicts brain responses, and reveals a cortical processing hierarchy. Neuron, 98(3):630–644, 2018. [DOI] [PubMed] [Google Scholar]

[R25] [25].Kubilius Jonas, Schrimpf Martin, Kar Kohitij, Rajalingham Rishi, Hong Ha, Majaj Najib, Issa Elias, Bashivan Pouya, Prescott-Roy Jonathan, Schmidt Kailyn, et al. Brain-like object recognition with high-performing shallow recurrent anns. Advances in neural information processing systems, 32, 2019. [Google Scholar]

[R26] [26].Schrimpf Martin, Kubilius Jonas, Hong Ha, Majaj Najib J, Rajalingham Rishi, Issa Elias B, Kar Kohitij, Bashivan Pouya, Prescott-Roy Jonathan, Geiger Franziska, et al. Brain-score: Which artificial neural network for object recognition is most brain-like? BioRxiv, page 407007, 2018. [Google Scholar]

[R27] [27].Michaels Jonathan A, Schaffelhofer Stefan, Agudelo-Toro Andres, and Scherberger Hansjörg. A neural network model of flexible grasp movement generation. biorxiv, page 742189, 2019. [Google Scholar]

[R28] [28].Nayebi Aran, Bear Daniel, Kubilius Jonas, Kar Kohitij, Ganguli Surya, Sussillo David, DiCarlo James J, and Yamins Daniel L. Task-driven convolutional recurrent models of the visual system. Advances in neural information processing systems, 31, 2018. [Google Scholar]

[R29] [29].Lindsay Grace W. Convolutional neural networks as a model of the visual system: Past, present, and future. Journal of cognitive neuroscience, 33(10):2017–2031, 2021. [DOI] [PubMed] [Google Scholar]

[R30] [30].Hassabis Demis, Kumaran Dharshan, Summerfield Christopher, and Botvinick Matthew. Neuroscience-inspired artificial intelligence. Neuron, 95(2):245–258, 2017. [DOI] [PubMed] [Google Scholar]

[R31] [31].Schaeffer Rylan, Khona Mikail, and Fiete Ila. No free lunch from deep learning in neuroscience: A case study through models of the entorhinal-hippocampal circuit. Advances in neural information processing systems, 35:16052–16067, 2022. [Google Scholar]

[R32] [32].Marblestone Adam H, Wayne Greg, and Kording Konrad P. Toward an integration of deep learning and neuroscience. Frontiers in computational neuroscience, 10:215943, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] [33].Eccles John Carew. From electrical to chemical transmission in the central nervous system: the closing address of the sir henry dale centennial symposium cambridge, 19 september 1975. Notes and records of the Royal Society of London, 30(2):219–230, 1976. [DOI] [PubMed] [Google Scholar]

[R34] [34].Eavani Harini, Satterthwaite Theodore D, Filipovych Roman, Gur Raquel E, Gur Ruben C, and Davatzikos Christos. Identifying sparse connectivity patterns in the brain using resting-state fmri. Neuroimage, 105:286–299, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] [35].Kaiser Marcus. Connectomes: from a sparsity of networks to large-scale databases. Frontiers in Neuroinformatics, 17:1170337, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] [36].Lappalainen Janne K, Tschopp Fabian D, Prakhya Sridhama, McGill Mason, Nern Aljoscha, Shinomiya Kazunori, Takemura Shin-ya, Gruntman Eyal, Macke Jakob H, and Turaga Srinivas C. Connectome-constrained networks predict neural activity across the fly visual system. Nature, pages 1–9, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] [37].Giacopelli Giuseppe, Tegolo Domenico, Spera Emiliano, and Migliore Michele. On the structural connectivity of large-scale models of brain networks at cellular level. Scientific Reports, 11(1):4345, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] [38].Cornford Jonathan, Kalajdzievski Damjan, Leite Marco, Lamarquette Amélie, Kullmann Dimitri M, and Richards Blake. Learning to live with dale’s principle: Anns with separate excitatory and inhibitory units. bioRxiv, pages 2020–11, 2020. [Google Scholar]

[R39] [39].Frankle Jonathan and Carbin Michael. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018. [Google Scholar]

[R40] [40].Tanaka Hidenori, Kunin Daniel, Yamins Daniel L, and Ganguli Surya. Pruning neural networks without any data by iteratively conserving synaptic flow. Advances in neural information processing systems, 33:6377–6389, 2020. [Google Scholar]

[R41] [41].Lee Namhoon, Ajanthan Thalaiyasingam, and Torr Philip HS. Snip: Single-shot network pruning based on connection sensitivity. arXiv preprint arXiv:1810.02340, 2018. [Google Scholar]

[R42] [42].Wang Chaoqi, Zhang Guodong, and Grosse Roger. Picking winning tickets before training by preserving gradient flow. arXiv preprint arXiv:2002.07376, 2020. [Google Scholar]

[R43] [43].Han Song, Pool Jeff, Tran John, and Dally William. Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28, 2015. [Google Scholar]

[R44] [44].Miconi Thomas. Biologically plausible learning in recurrent neural networks reproduces neural dynamics observed during cognitive tasks. Elife, 6:e20899, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] [45].Minni Sun, Li Ji-An Theodore Moskovitz, Lindsay Grace, Miller Kenneth, Dipoppa Mario, and Yang Guangyu Robert. Understanding the functional and structural differences across excitatory and inhibitory neurons. bioRxiv, page 680439, 2019. [Google Scholar]

[R46] [46].Ingrosso Alessandro and Abbott LF. Training dynamically balanced excitatory-inhibitory networks. PloS one, 14(8):e0220547, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] [47].Nicola Wilten and Clopath Claudia. Supervised learning in spiking neural networks with force training. Nature communications, 8(1):2208, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] [48].Yann LeCun, Denker John, and Solla Sara. Optimal brain damage. Advances in neural information processing systems, 2, 1989. [Google Scholar]

[R49] [49].Moore Eli and Chaudhuri Rishidev. Using noise to probe recurrent neural network structure and prune synapses. Advances in neural information processing systems, 33:14046–14057, 2020. [Google Scholar]

[R50] [50].Song H Francis, Yang Guangyu R, and Wang Xiao-Jing. Training excitatory-inhibitory recurrent neural networks for cognitive tasks: a simple and flexible framework. PLoS computational biology, 12(2):e1004792, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R51] [51].Rumelhart David E, Hinton Geoffrey E, and Williams Ronald J. Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986. [Google Scholar]

[R52] [52].Zhang Lixiao, Wang Xianwei, Cueto Ramón, Effi Comfort, Zhang Yuling, Tan Hongmei, Qin Xuebin, Ji Yong, Yang Xiaofeng, and Wang Hong. Biochemical basis and metabolic interplay of redox regulation. Redox biology, 26:101284, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R53] [53].Tetzlaff Christian, Kolodziejski Christoph, Timme Marc, and Wörgötter Florentin. Synaptic scaling in combination with many generic plasticity mechanisms stabilizes circuit connectivity. Frontiers in computational neuroscience, 5:47, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R54] [54].Huttenlocher Peter R et al. Synaptic density in human frontal cortex-developmental changes and effects of aging. Brain Res, 163(2):195–205, 1979. [DOI] [PubMed] [Google Scholar]

[R55] [55].Bullmore Ed and Sporns Olaf. The economy of brain network organization. Nature reviews neuroscience, 13(5):336–349, 2012. [DOI] [PubMed] [Google Scholar]

[R56] [56].Garrett Marina, Manavi Sahar, Roll Kate, Ollerenshaw Douglas R, Groblewski Peter A, Ponvert Nicholas D, Kiggins Justin T, Casal Linzy, Mace Kyla, Williford Ali, et al. Experience shapes activity dynamics and stimulus coding of vip inhibitory cells. elife, 9:e50340, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R57] [57].Rao Rajesh PN and Ballard Dana H. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature neuroscience, 2(1):79–87, 1999. [DOI] [PubMed] [Google Scholar]

[R58] [58].Elman Jeffrey L. Finding structure in time. Cognitive science, 14(2):179–211, 1990. [Google Scholar]

[R59] [59].Boyd Stephen Pand Vandenberghe Lieven. Convex optimization. Cambridge university press, 2004. [Google Scholar]

[R60] [60].Bertsekas Dimitri P.. Nonlinear Programming. Athena Scientific, third edition, 2016. [Google Scholar]

[R61] [61].Glorot Xavier and Bengio Yoshua. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010. [Google Scholar]

[R62] [62].He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015. [Google Scholar]

[R63] [63].Li Pingsheng, Cornford Jonathan, Ghosh Arna, and Richards Blake. Learning better with dale’s law: A spectral perspective. Advances in Neural Information Processing Systems, 36, 2024. [Google Scholar]

[R64] [64].Hebb Donald Olding. The organization of behavior: A neuropsychological theory. Psychology press, 2005. [Google Scholar]

[R65] [65].Ye Jong Chul. Geometry of Deep Learning. Springer, 2022. [Google Scholar]

[R66] [66].Paszke Adam, Gross Sam, Chintala Soumith, Chanan Gregory, Yang Edward, Zachary DeVito Zeming Lin, Desmaison Alban, Antiga Luca, and Lerer Adam. Automatic differentiation in pytorch. 2017. [Google Scholar]

[R67] [67].Paszke Adam, Gross Sam, Massa Francisco, Lerer Adam, Bradbury James, Chanan Gregory, Killeen Trevor, Lin Zeming, Gimelshein Natalia, Antiga Luca, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019. [Google Scholar]

[R68] [68].Mozer Michael C and Smolensky Paul. Skeletonization: A technique for trimming the fat from a network via relevance assessment. Advances in neural information processing systems, 1, 1988. [Google Scholar]

[R69] [69].Hanson Stephen and Pratt Lorien. Comparing biases for minimal network construction with back-propagation. Advances in neural information processing systems, 1, 1988. [Google Scholar]

[R70] [70].Spielman Daniel A and Srivastava Nikhil. Graph sparsification by effective resistances. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 563–568, 2008. [Google Scholar]

[R71] [71].Spielman Daniel A and Teng Shang-Hua. Spectral sparsification of graphs. SIAM Journal on Computing, 40(4):981–1025, 2011. [Google Scholar]

[R72] [72].Batson Joshua, Spielman Daniel A, Srivastava Nikhil, and Teng Shang-Hua. Spectral sparsification of graphs: theory and algorithms. Communications of the ACM, 56(8):87–94, 2013. [Google Scholar]

[R73] [73].Balwani Aishwarya and Krzyston Jakob. Zeroth-order topological insights into iterative magnitude pruning. In Topological, Algebraic and Geometric Learning Workshops 2022, pages 6–16. PMLR, 2022. [Google Scholar]

[R74] [74].Rieck Bastian, Togninalli Matteo, Bock Christian, Moor Michael, Horn Max, Gumbsch Thomas, and Borgwardt Karsten. Neural persistence: A complexity measure for deep neural networks using algebraic topology. arXiv preprint arXiv:1812.09764, 2018. [Google Scholar]

[R75] [75].Doraiswamy Harish, Tierny Julien, Silva Paulo JS, Nonato Luis Gustavo, and Silva Claudio. Topomap: A 0-dimensional homology preserving projection of high-dimensional data. IEEE Transactions on Visualization and Computer Graphics, 27(2):561–571, 2020. [DOI] [PubMed] [Google Scholar]

[R76] [76].Lacombe Théo, Ike Yuichi, Carriere Mathieu, Chazal Frédéric, Glisse Marc, and Umeda Yuhei. Topological uncertainty: Monitoring trained neural networks through persistence of activation graphs. arXiv preprint arXiv:2105.04404, 2021. [Google Scholar]

[R77] [77].Friston Karl. A theory of cortical responses. Philosophical transactions of the Royal Society B: Biological sciences, 360(1456):815–836, 2005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R78] [78].Visual behavior - 2p - brain-map.org. https://portal.brain-map.org/circuits-behavior/visual-behavior-2p. (Accessed on 11/20/2024).

[R79] [79].Garrett Marina, Groblewski Peter, Piet Alex, Ollerenshaw Doug, Najafi Farzaneh, Yavorska Iryna, Amster Adam, Bennett Corbett, Buice Michael, Caldejon Shiella, et al. Stimulus novelty uncovers coding diversity in visual cortical circuits. bioRxiv, pages 2023–02, 2023. [Google Scholar]

[R80] [80].Groblewski Peter A, Ollerenshaw Douglas R, Kiggins Justin T, Garrett Marina E, Mochizuki Chris, Casal Linzy, Cross Sissy, Mace Kyla, Swapp Jackie, Manavi Sahar, et al. Characterization of learning, motivation, and visual perception in five transgenic mouse lines expressing gcamp in distinct cell populations. Frontiers in Behavioral Neuroscience, 14:104, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R81] [81].Douglas Rodney J, Martin Kevan AC, and Whitteridge David. A canonical microcircuit for neocortex. Neural computation, 1(4):480–488, 1989. [Google Scholar]

[R82] [82].Mountcastle Vernon B. The columnar organization of the neocortex. Brain: a journal of neurology, 120(4):701–722, 1997. [DOI] [PubMed] [Google Scholar]

[R83] [83].Bastos Andre M, Usrey W Martin, Adams Rick A, Mangun George R, Fries Pascal, and Friston Karl J. Canonical microcircuits for predictive coding. Neuron, 76(4):695–711, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R84] [84].Balwani Aishwarya, Cho Suhee, and Choi Hannah. Exploring the architectural biases of the canonical cortical microcircuit. bioRxiv, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R85] [85].Campagnola Luke, Seeman Stephanie C, Chartrand Thomas, Kim Lisa, Hoggarth Alex, Gamlin Clare, Ito Shinya, Trinh Jessica, Davoudian Pasha, Radaelli Cristina, et al. Local connectivity and synaptic dynamics in mouse and human neocortex. Science, 375(6585):eabj5861, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R86] [86].Schulz Auguste, Miehl Christoph, Michael J Berry II, and Julijana Gjorgjieva. The generation of cortical novelty responses through inhibitory plasticity. Elife, 10:e65309, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R87] [87].Bosman Conrado A, Schoffelen Jan-Mathijs, Brunet Nicolas, Oostenveld Robert, Bastos Andre M, Womelsdorf Thilo, Rubehn Birthe, Stieglitz Thomas, Peter De Weerd, and Pascal Fries. Attentional stimulus selection through selective synchronization between monkey visual areas. Neuron, 75(5):875–888, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R88] [88].Semedo João D, Jasper Anna I, Zandvakili Amin, Krishna Aravind, Aschner Amir, Machens Christian K, Kohn Adam, and Yu Byron M. Feedforward and feedback interactions between visual cortical areas use different population activity patterns. Nature communications, 13(1):1099, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R89] [89].Moon Joon-Young, Müsch Kathrin, Schroeder Charles E, Valiante Taufik A, and Honey Christopher J. Interregional delays fluctuate in the human cerebral cortex. bioRxiv, pages 2022–06, 2022. [Google Scholar]

[R90] [90].Najafi Farzaneh, Russo Simone, and Lecoq Jerome. Unexpected events modulate context signaling in vip and excitatory cells of the visual cortex. bioRxiv, pages 2024–05, 2024. [Google Scholar]

[R91] [91].Hertäg Loreenand Sprekeler Henning. Learning prediction error neurons in a canonical interneuron circuit. Elife, 9:e57541, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R92] [92].Haarsma Joost, Fletcher PC, Griffin JD, Taverne HJ, Ziauddeen Hisham, Spencer TJ, Miller Chantal, Katthagen Teresa, Goodyer I, Diederen KMJ, et al. Precision weighting of cortical unsigned prediction error signals benefits learning, is mediated by dopamine, and is impaired in psychosis. Molecular psychiatry, 26(9):5320–5333, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R93] [93].Starkweather Clara Kwon and Uchida Naoshige. Dopamine signals as temporal difference errors: recent advances. Current Opinion in Neurobiology, 67:95–105, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R94] [94].Hertäg Loreen and Clopath Claudia. Prediction-error neurons in circuits with multiple neuron types: Formation, refinement, and functional implications. Proceedings of the National Academy of Sciences, 119(13):e2115699119, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R95] [95].Richards Blake A and Lillicrap Timothy P. Dendritic solutions to the credit assignment problem. Current opinion in neurobiology, 54:28–36, 2019. [DOI] [PubMed] [Google Scholar]

[R96] [96].Westerberg Jacob A., Xiong Yihan S., Nejat Hamed, Sennesh Eli, Durand Séverine, Cabasco Hannah, Belski Hannah, Gillis Ryan, Loeffler Henry, Bawany Ahad, Peene Carter R., Han Warren, Nguyen Katrina, Ha Vivian, Johnson Tye, Grasso Conor, Hardcastle Ben, Young Ahrial, Swapp Jackie, Ouellete Ben, Caldejon Shiella, Williford Ali, Groblewski Peter A., Olsen Shawn R., Kiselycznyk Carly, Lecoq Jerome A., Maier Alexander, and Bastos André M.. Stimulus history, not expectation, drives sensory prediction errors in mammalian cortex. bioRxiv, 2024. [Google Scholar]

[R97] [97].Furutachi Shohei, Franklin Alexis D., Aldea Andreea M., Mrsic-Flogel Thomas, and Hofer Sonja B.. Cooperative thalamocortical circuit mechanism for sensory prediction errors. Advances in neural information processing systems, 633:398—−406, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R98] [98].Schneider Rolf. Convex bodies: the Brunn–Minkowski theory, volume 151. Cambridge university press, 2013. [Google Scholar]

[R99] [99].Kim Hyunjik, Papamakarios George, and Mnih Andriy. The lipschitz constant of self-attention. In International Conference on Machine Learning, pages 5562–5571. PMLR, 2021. [Google Scholar]

[R100] [100].Federer Herbert. Geometric measure theory. Springer, 2014. [Google Scholar]

PERMALINK

This is a preprint.

CONSTRUCTING BIOLOGICALLY CONSTRAINED RNNS VIA DALE’S BACKPROP AND TOPOLOGICALLY-INFORMED PRUNING

Aishwarya H Balwani

Alex Q Wang

Farzaneh Najafi

Hannah Choi

Abstract

1. Introduction

Figure 1: Schematic for Constructing Biologically Constrained RNN Models.

2. Training Networks with Dale’s Backpropagation

2.1. Dale’s Backpropagation: Algorithm

Algorithm 1:

2.2. Dale’s backpropagation: Theoretical Results

2.2.1. Analyzing convergence of Dale’s backpropagation under the restricted optimum assumption

Lemma 1 (Optimal sign pattern preservation).

Theorem 2 (Convergence of Dale’s Backpropagation).

2.2.2. Analyzing Dale’s backpropagation relative to conventional backpropagation

Lemma 3 (Distance between learnt weights).

Theorem 4 (Differences in errors between solutions).

2.3. Dale’s backpropagation: Empirical results

Figure 2: Training with Dale’s Backprop.

3. Sparsifying Networks via Topology-Informed Local Pruning

3.1. Topologically-informed probabilistic pruning rule

3.2. Topologically-informed probabilistic pruning: Empirical results

Figure 3: Topologically-informed probabilistic pruning.

4. Application to Visual Behaviour in Mice: Functional Connectivity and Predictive Coding

4.1. Dataset and experimental setup

Figure 4: Dataset, network structure, and task schematics.

4.2. CelltypeRNN: Architecture and training

4.3. Insights and results

Familiar No Change vs. Familiar Change (Full-set Presentation):

Figure 5: Connectivity differences across timescales and test conditions.

Familiar No Change vs. Familiar Change (Half-set Presentation):

Familiar Change vs. Familiar Omission:

5. Discussion

Supplementary Material

Acknowledgments

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases