When predict can also explain: Few-shot prediction to select better neural latents

Kabir V Dabholkar; Omri Barak

doi:10.1371/journal.pcbi.1013789

. 2025 Dec 30;21(12):e1013789. doi: 10.1371/journal.pcbi.1013789

When predict can also explain: Few-shot prediction to select better neural latents

Kabir V Dabholkar ^1,^*, Omri Barak ²

Editor: Yuanning Li³

PMCID: PMC12779162 PMID: 41468520

Abstract

Latent variable models serve as powerful tools to infer underlying dynamics from observed neural activity. Ideally, the inferred dynamics should align with true ones. However, due to the absence of ground truth data, prediction benchmarks are often employed as proxies. One widely-used method, co-smoothing, involves jointly estimating latent variables and predicting observations along held-out channels to assess model performance. In this study, we reveal the limitations of the co-smoothing prediction framework and propose a remedy. Using a student-teacher setup, we demonstrate that models with high co-smoothing can have arbitrary extraneous dynamics in their latent representations. To address this, we introduce a secondary metric—few-shot co-smoothing, performing regression from the latent variables to held-out neurons in the data using fewer trials. Our results indicate that among models with near-optimal co-smoothing, those with extraneous dynamics underperform in the few-shot co-smoothing compared to ‘minimal’ models that are devoid of such dynamics. We provide analytical insights into the origin of this phenomenon and further validate our findings on four standard neural datasets using a state-of-the-art method: STNDT. In the absence of ground truth, we suggest a novel measure to validate our approach. By cross-decoding the latent variables of all model pairs with high co-smoothing, we identify models with minimal extraneous dynamics. We find a correlation between few-shot co-smoothing performance and this new measure. In summary, we present a novel prediction metric designed to yield latent variables that more accurately reflect the ground truth, offering a significant improvement for latent dynamics inference.

Author summary

The availability of large scale neural recordings encourages the development of methods to fit models to data. How do we know that the fitted models are loyal to the true underlying dynamics of the brain? A common approach is to use prediction scores that use one part of the available data to predict another part. The advantage of predictive scores is that they are general: a wide variety of modelling methods can be evaluated and compared against each other. But does a good predictive score guarantee that we capture the true dynamics in the model? We investigate this by generating synthetic neural data from one model, fitting another model to it, ensuring a high predictive score, and then checking if the two are similar. The result: only partially. We find that the high scoring models always contain the truth, but may also contain additional ‘made-up’ features. We remedy this issue with a secondary score that tests the model’s generalisation to another set of neurons with just a few examples. We demonstrate its applicability with synthetic and real neural data.

Introduction

In neuroscience, we often have access to simultaneously recorded neurons during certain behaviors. These observations, denoted $X$ , offer a window onto the actual hidden (or latent) dynamics of the relevant brain circuit, denoted $Z$ [1]. Although, in general, these dynamics can be complex and high-dimensional, capturing them in a concrete mathematical model opens doors to reverse-engineering, revealing simpler explanations and insights [2,3]. Inferring a model of the $Z$ variables, $\hat{Z}$ , also known as latent variable modeling (LVM), is part of the larger field of system identification with applications in many areas outside of neuroscience, such as fluid dynamics [4] and finance [5].

Because we don’t have ground truth for $Z$ , prediction metrics on held-out parts of $X$ are commonly used as a proxy [6]. However, it has been noted that prediction and explanation are often distinct endeavors [7]. For instance, [8] use an example where ground truth is available to show how different models that all achieve good prediction nevertheless have varied latents that can differ from the ground truth. Such behavior might be expected when using highly expressive models with large latent spaces. Bad prediction with good latents is demonstrated by [9] for the case of chaotic dynamics.

Various regularisation methods on the latents have been suggested to improve the similarity of $Z$ to the ground truth, such as recurrence and priors on external inputs [10], low-dimensionality of trajectories [11], low-rank connectivity [12,13], injectivity constraints from latent to predictions [8], low-tangling [14], and piecewise-linear dynamics [15]. However, the field lacks a quantitative, prediction-based metric that credits the simplicity of the latent representation—an aspect essential for interpretability and ultimately scientific discovery, while still enabling comparisons across a wide range of LVM architectures.

Here, we characterise the diversity of model latents achieving high co-smoothing, a standard prediction-based framework for Neural LVMs, and demonstrate potential pitfalls of this framework (see Methods for a glossary of terms). We propose a few-shot variant of co-smoothing which, when used in conjunction with co-smoothing, differentiates varying latents. We verify this approach both on synthetic data settings and a state-of-the-art method on neural data, providing an analytical explanation of why it works in simple settings.

Results

Co-smoothing: A cross-validation framework

Let $X \in ℤ_{\geq 0}^{T \times N}$ be spiking neural activity of N channels recorded over a finite window of time, i.e., a trial, and subsequently quantised into T time-bins. X_t,n represents the number of spikes in channel n during time-bin t. The dataset $X : = {X^{(i)}}_{i = 1}^{S}$ , partitioned as $X^{train}$ and $X^{test}$ , consists of S trials of the experiment. The latent-variable model (LVM) approach posits that each time-point in the data $x_{t}^{(i)}$ is a noisy measurement of a latent state $z_{t}^{(i)}$ .

To infer the latent trajectory $Z$ is to learn a mapping $f : X \mapsto \hat{Z}$ . On what basis do we validate the inferred $\hat{Z}$ ? We cannot access the ground truth $Z$ , so instead we test the ability of $\hat{Z}$ to predict unseen or held-out data. Data may be held-out in time, e.g., predicting future data points from the past, or in space, e.g., predicting neural activities of one set of neurons (or channels) based on those of another set. The latter is called co-smoothing [6].

The set of N available channels is partitioned into two: $N^{in}$ held-in channels and $N^{out}$ held-out channels. The S trials are partitioned into train and test. During training, both channel partitions are available to the model and during test, only the held-in partition is available. During evaluation, the model must generate the $T \times N^{out}$ rate-predictions $R_{:, out}$ for the held-out partition. This framework is visualised in Fig 1A.

Fig 1 — A. To evaluate a neural LVM with co-smoothing, the dataset is partitioned along the neurons and trials axes. B. The held-in neurons are used to infer latents $\hat{z}$ , while the held-out serve as targets for evaluation. The encoder f and decoder g are trained jointly to maximise co-smoothing $Q$ . After training, the composite mapping $g \circ f$ is evaluated on the test set. C. We hypothesise that models with high co-smoothing may have an asymmetric relationship to the true system, ensuring that model representation contains the ground truth, but not vice-versa. We reveal this in a synthetic student(S)-teacher(T) setting by the unequal performance of regression on the states in the two directions. $D_{u \to v}$ denote decoding error of model v latents $z_{v}$ from model u latents $z_{u}$ .

Importantly, the encoding-step or inference of the latents is done using a full time-window, i.e., analogous to smoothing in control-theoretic literature, whereas the decoding step, mapping the latents to predictions of the data is done on individual time-steps:

{\hat{z}}_{t} = f (X_{:, in}; t)

(1)

r_{t, out} = g ({\hat{z}}_{t}),

(2)

where the subscripts ‘ $in$ ’ and ‘ $out$ ’ denote partitions of the neurons (Fig 1B). During evaluation, the held-out data from test trials $X_{:, out}$ is compared to the rate-predictions $R_{:, out}$ from the model using the co-smoothing metric $Q$ defined as the normalised log-likelihood, given by:

Q (R_{t, n}, X_{t, n}) : = \frac{1}{μ_{n} \log 2} (L (R_{t, n}; X_{t, n}) - L ({\bar{r}}_{n}; X_{t, n}))

(3)

Q^{test} : = \sum_{n \in held-out} \sum_{i \in test} \sum_{t = 1}^{T} Q (R_{t, n}^{(i)}, X_{t, n}^{(i)}),

(4)

where $L$ is poisson log-likelihood, ${\bar{r}}_{n} = \frac{1}{T S} \sum_{i} \sum_{t} X_{t, n}^{(i)}$ is a the mean rate for channel n, and $μ_{n} : = \sum_{i} \sum_{t} X_{t, n}^{(i)}$ is the total number of spikes, following [6].

Thus, the inference of LVM parameters is performed through the optimisation:

f^{*}, g^{*} = {argmax}_{f, g} Q^{train}

(5)

using $X^{train}$ , without access to the test trials from $X^{test}$ . For clarity, apart from (5), we report only $Q^{test}$ , omitting the superscript.

Good co-smoothing does not guarantee correct latents

It is common to assume that being able to predict held-out parts of $X$ will guarantee that the inferred latent aligns with the true one [6,14,16–28]. To test this assumption, we use a student-teacher scenario where we know the ground truth. To compare how two models (u,v) align, we infer the latents of both from $X^{test}$ , then do a regression from latents of u to v. The regression error is denoted $D_{u \to v}$ (i.e. $D_{T \to S}$ for teacher to student decoding). Contrary to the above assumption, we hypothesise that good prediction guarantees that the true latents are contained within the inferred ones (low $D_{S \to T}$ ), but not vice versa (Fig 1C). It is possible that the inferred latents possess additional features, unexplained by the true latents (high $D_{T \to S}$ ).

We demonstrate this phenomenon in three different student-teacher scenarios: task-trained RNNs, Hidden Markov Models (HMMs) and linear gaussian state space models. We start with RNNs, as they are a standard tool to investigate computation through dynamics in neuroscience [29], and expand upon the other models in the appendix. A 128-unit RNN teacher (Methods) is trained on a 2-bit flip-flop task, inspired by working memory experiments. The network receives input pulses and has to maintain the identity of the last pulse (see Methods). The student is a sequential autoencoder, where the encoder f is composed of a neural network that converts observations into an initial latent state, and another recurrent neural network that advances the latent state dynamics [29] (see Methods).

We generated a dataset of observations from this teacher, and then trained 30 students with latent-dimensionality 3–64 on the same teacher data using gradient-based methods (see Methods). Co-smoothing scores of students increased with the size of the latents, but are high for models in the range of 5-15 dimensional latents (S1 Fig). Consistent with our hypothesis, the ability to decode the teacher from the student was highly correlated to the co-smoothing score (Fig 2 top left). In contrast, the ability to decode the student from the teacher has a very different pattern. For students with low co-smoothing, this decoding is good – but meaningless. For students with high co-smoothing, there is a large variability, and little correlation to the co-smoothing score (Fig 2 top right). In this simple example, it would seem that one only needs to increase the dimensionality of the latent until co-smoothing saturates. This minimal value would satisfy both demands. This is not the case for real data, as will be shown below.

Fig 2 — Several students, sequential autoencoders (SAE, see Methods), are trained on a dataset generated by a single teacher, a noisy GRU RNN trained on a 2-bit flip flop (2BFF, see Methods). The Student $\to$ Teacher decoding error $D_{S \to T}$ is low and tightly related to the co-smoothing score. The Teacher $\to$ Student decoding error $D_{T \to S}$ is more varied and uncorrelated to co-smoothing. A score of $Q = 0$ corresponds to predicting the mean firing-rate for each neuron at all trials and time points. Green and red points are representative “Good” and “Bad” students respectively, whose latents are visualised below along-side the ground truth $T$ . The visualisations are projections of the latents along the top three principal components of the data. The ground truth latents are characterised by 4 stable states capturing the 2² memory values. This structure is captured in the “Good” student. The bad student also includes this structure in addition to an extraneous variability along the third component. **Lower panel.** The same experiment conducted with HMMs. The teacher is a nearly deterministic 4-cycle and students are fit to its noisy emissions. Dynamics in selected models are visualised. Circles represent states, and arrows represent transitions. Circle area and edge thickness reflect fraction of visitations or volume of traffic after sampling the HMM over several trials. The colours also reflect the same quantity – brighter for higher traffic. Edges with values below 0.01 are removed for clarity (S5 Fig). The teacher (M = 4) is a 4-cycle. Note the prominent 4-cycles (orange) present in the good student (M = 10), and the bad student (M = 8). In the good student, the extra states are seldom visited, whereas in the bad student there is significant extraneous dynamics involving these states (dark arrows).

What is it about a student model, that produces good co-smoothing with the wrong latents? It’s easiest to see this in a setting with discrete latents, so we first show the HMM teacher and two exemplar students – named “Good” and “Bad” (marked by green and red arrows in S3 FigAB) – and visualise their states and transitions using graphs in Fig 2. The teacher is a cycle of 4 steps. The good student contains such a cycle (orange), and the initial distribution is restricted to that cycle, rendering the other states irrelevant. In contrast, the bad student also contains this cycle (orange), but the initial distribution is not consistent with the cycle, leading to an extraneous branch converging to the cycle, as well as a departure from the main cycle (both components in dark colour). Note that this does not interfere with co-smoothing, because the emission probabilities of the extra states are consistent with true states, i.e., the emission matrix conceals the extraneous dynamics. In the RNN, we see a qualitatively similar picture, with the bad students having dynamics in task-irrelevant dimensions (Fig 2 “Bad” S).

Few-shot prediction selects better models

Because our objective is to obtain latent models that are close to the ground truth, the co-smoothing prediction scores described above are not satisfactory. Can we devise a new prediction score that will be correlated with ground truth similarity? The advantage of prediction benchmarks is that they can be optimised, and serve as a common language for the community as a whole to produce better algorithms [30].

We suggest few-shot co-smoothing as a complementary prediction score to co-smoothing, to be used on models with good scores on the latter. Similarly to standard co-smoothing, the functions g and f are trained using all trials of the training data (Fig 3A). The key difference is that a separate group of $N^{k -out}$ neurons is set aside (Table 1), and only k trials of these neurons are used to estimate a mapping $g^{'} : {\hat{Z}}_{t, :} \mapsto R_{t, k -out}$ (Fig 3B), similar to g in (2). The neural LVM $(f, g, g^{'})$ is then evaluated on both the standard co-smoothing $Q$ using $g \circ f$ and the few-shot version $Q^{k}$ using $g^{'} \circ f$ (Fig 3C).

Fig 3 — A. The encoder f and decoder g are trained jointly using held-in and held-out neurons. B. A separate decoder $g^{'}$ is trained to readout k-out neurons using only k trials. Meanwhile, f and g are frozen. C. The neural LVM is evaluated on the test set resulting in two scores: co-smoothing $Q$ and k-shot co-smoothing $Q^{k}$ .

Table 1. Dimensions of real and synthetic datasets.

Number of train and test trials $S^{train}$ , $S^{test}$ , time-bins per trial for co-smoothing T, and forward-prediction $T^{fp}$ , held-in, held-out and k-out neurons $N^{in}$ , $N^{out}$ , $N^{k -out}$ . ^†In all the NLB [6] datasets as well the RNN dataset we use the same set of neurons for $N^{out}$ and $N^{k -out}$ .

DATASET	$S^{t r a i n}$	$S^{t e s t}$	T	$T^{f p}$	$N^{i n}$	$N^{o u t}$	$N^{k - o u t}$
SYNTHETIC NOISY GRU RNN (METHODS) [29]	800	200	500	–	50	10	10^†
SYNTHETIC HMM (METHODS)	2000	100	10	–	20	50	50
SYNTHETIC LGSSM (S4 FIG)	20	500	10	–	5	30	30
`mc_maze_20` [35]	1721	574	35	10	137	45	45^†
`mc_rtt_20` [36]	810	270	30	10	98	32	32^†
`dmfc_rsg_20` [37]	748	258	75	10	40	14	14^†
`area2_bump_20` [38]	272	92	30	10	49	16	16^†

Open in a new tab

For small values of k, the $Q^{k}$ scores can be highly variable. To reduce this variability, we repeat the procedure s times on independently resampled sets of k trials, producing s estimates of $g^{'}$ , each with its own score $Q^{k}$ . For each student $S$ , we then report the average score $⟨ Q_{S}^{k} ⟩$ across the s resamples. A theoretical analysis of the choice of k is given in the next section, with practical guidelines provided in S2 Fig. The number of resamples s is chosen empirically to ensure high confidence in the estimated average (Methods).

To demonstrate the utility of the proposed prediction score, we return to the RNN students from Fig 2 and evaluate $⟨ Q_{S}^{k} ⟩$ for each. This score provides complementary information about the models, as it is uncorrelated with standard co-smoothing (Fig 4A), and it is not merely a stricter version of co-smoothing (S6 Fig). Since we are only interested in models with good co-smoothing, we restrict attention to students satisfying $Q_{S} > Q_{T} - 10^{- 3}$ . Among these students, despite their nearly identical co-smoothing scores, the k-shot scores $⟨ Q_{S}^{k} ⟩$ are strongly correlated with the ground-truth measure $D_{T \to S}$ (Fig 4B). Together, these findings suggest that simultaneously maximizing $Q_{S}$ and $⟨ Q_{S}^{k} ⟩$ —both prediction-based objectives—produces models with low $D_{S \to T}$ and $D_{T \to S}$ , yielding a more complete measure of model similarity to the ground truth.

Fig 4 — A. Few-shot measures something new. Student models with high co-smoothing have highly variable 2-shot co-smoothing, which is uncorrelated to co-smoothing. Error bars reflect standard error of the mean across several few-shot regressions (see Methods). B. For the set of students with high co-smoothing, i.e., satisfying $Q > 0.034$ , 2-shot co-smoothing to held-out neurons is negatively correlated with decoding error from teacher-to-student. Green and red points represent the example “Good” and “Bad” models (Fig 2).

Why does few-shot work?

The example HMM and RNN students of Fig 2 can help us understand why few-shot prediction identifies good models. The students differ in that the bad student has more than one state corresponding to the same teacher state. Because these states provide the same output, this feature does not hurt co-smoothing. In the few-shot setting, however, the output of all states needs to be estimated using a limited amount of data. Thus the information from the same amount of observations has to be distributed across more states. We make this data efficiency argument more precise in three settings: linear regression (LR), HMMs, and binary classification prototype learning (BCPL).

In the case of LR, the teacher latent is a scalar random variable z and the student latent $\hat{z}$ is a random p-vector, whose first coordinate is z and the remaining p–1 coordinates are the extraneous noise:

\hat{z} : = [\begin{matrix} z \underset{extraneous noise}{\underset{⏟}{ξ_{1} ξ_{2} \dots ξ_{p - 1}}} \end{matrix}]^{T},

(6)

where $ξ_{j} ~ N (0, σ_{ext}^{2})$ . In other words, a single teacher state is represented by several possible student states.

Next, we model the neural-data – noisy observations of the teacher latent $x : = z + ϵ$ , where $ϵ ~ (0, σ_{obs}^{2})$ . The few-shot learning is captured by minimum-norm k-shot least-squares linear regression:

\hat{w} : = \underset{w}{arg min} {‖ w ‖^{2} : w minimises \sum_{i = 1}^{k} ‖ x^{(i)} - w^{T} {\hat{z}}^{(i)} ‖^{2}},

(7)

where $‖ \cdot ‖$ is the 2-norm.

The generalisation error of the few-shot learner is given by:

R^{k} = ⟨ ({\hat{z}}^{T} w^{*} - {\hat{z}}^{T} \hat{w})^{2} ⟩_{z, ξ_{1}, \dots, ξ_{p}, ϵ},

(8)

where $w^{*} = [\begin{matrix} 1 & 0 & \dots & 0 \end{matrix}]^{T}$ is the true mapping.

We solve for $⟨ R^{k} ⟩$ as $k, p \to \infty, p / k \to γ \in (0, \infty)$ using the theory of [31], and demonstrate a good fit to numerical simulations at finite p,k (Methods). We do similar analyses for Bernoulli HMM latents with maximum likelihood estimation of the emission parameters (Methods) and BCPL [32] (Methods).

Across the three scenarios, model performance decreases with extraneous variability (Fig 5). Crucially, this difference appears at small k, and vanishes as $k \to \infty$ . With HMMs and BCPL this is a gradual decrease, while in LR, there is a known critical transition at p = k [31,33,34].

Interestingly, the scenarios differ in the bias-variance decomposition of their performance deficits. In LR, extraneous noise leads to increased bias with identical variance (Methods, Claim 2), whereas in the HMM and BCPL, it leads to increased variance and zero bias (Methods, (28) and (52) respectively).

How does one choose the value of k in practice? The intuition and theoretical results suggest that we want the smallest possible value. In real data, however, we expect many sources of noise that could make small values impractical. For instance, for low firing rates, small k values can mean that some neurons will not have any spikes in k trials and thus there will nothing to regress from. Our suggestion is therefore to use the smallest value of k that allows robust estimation of few-shot co-smoothing. (S2 Fig) shows the effect of this choice for various datasets.

SOTA LVMs on neural data

In previous sections, we showed that models with near perfect co-smoothing may possess latents with extraneous dynamics. We established this in a synthetic student-teacher setting with RNNs, HMMs and LGSSM models.

To show the applicability in more realistic scenarios, we consider four datasets mc_maze_20 [35], mc_rtt_20 [36], dmfc_rsg_20 [37], area2_bump_20 [38] from the Neural Latent Benchmarks suite [6] (see Methods). They consist of neural activity (spikes) recorded from various cortical regions of monkeys as they perform specific tasks. The 20 indicates that spikes were binned into 20ms time bins. We trained several SpatioTemporal Neural Data Transformers (STNDTs) [39–42], that achieve near state-of-the-art (SOTA) co-smoothing on these datasets. We evaluate co-smoothing on a test set of trials and define the set of models with the best co-smoothing (see Methods and Table 1).

A key component of training modern neural network architectures such as STNDT is the random sweep of hyperparameters, a natural step in identifying an optimal model for a specific data set [19]. This process generates several candidate solutions to the optimisation problem (5), yielding models with similar co-smoothing scores but, as we demonstrate in this section, varying amounts of extraneous dynamics.

Two proxies for $D_{T \to S}$ : cycle consistency and cross-decoding.

To reveal extraneous dynamics in the synthetic examples (RNNs, HMMs), we had access to ground truth that enabled us to directly compare the student latent to that of the teacher. With real neural data, we do not have this privilege. This limitation has been recognised in the past and a proxy was suggested [8,29,43] – cycle consistency. Instead of decoding the student latent from the teacher latent, cycle consistency attempts to decode the student latent $\hat{z}$ from the student’s own rate prediction $r$ . In our notation this is $D_{r \to \hat{z}}$ (Fig 6A and Methods). If the student has perfect co-smoothing (see S3 Appendix), this should be equivalent to $D_{T \to S}$ as it would ensure that teacher and student have the same rate-predictions $r$ .

Fig 6 — A *Cycle consistency* $D_{r \to \hat{z}}$ [8,29,43] involves learning a mapping g⁻¹ from the rates $r$ back to the latents $\hat{z}$ (see Methods). B The latents of each pair of models are *cross-decoded* from one another. Minimal models can be fully decoded by all models but extraneous models only by some. C Cross-decoding matrix for SAE NODE models trained on data from the NoisyGRU (Fig 2). **D, E** For models with high co-smoothing ( $Q > 0.035$ ) the proxy metrics – cross-decoding column average $⟨ D_{u \to v} ⟩_{u}$ , and cycle-consistency $D_{r \to z}$ ) – are both highly correlated to ground truth $D_{T \to S}$ .

Because we cannot rely on perfect co-smoothing, we also suggest a novel metric – cross-decoding – where we compare the models to each other. The key idea is that all high co-smoothing models contain the teacher latent. One can then imagine that each student contains a selection of several extraneous features. The best student is the one containing the least such features, which would imply that all other students can decode its latents, while it cannot decode theirs (Fig 6B). Instead of computing $D_{S \to T}$ and $D_{T \to S}$ as in Fig 2, we perform decoding from latents of model u to model v ( $D_{u \to v}$ ) for every pair of models u and v using linear regression and evaluating an R² score for each mapping (see Methods). In Fig 6C the results are visualised by a $U \times U$ matrix with entries $D_{u \to v}$ for all pairs of models u and v. The ideal model v^* would have no extraneous dynamics, therefore, all the other models should be able to decode its latents perfectly, i.e., $D_{u \to v^{*}} = 0 \forall u$ . Provided a large and diverse population of models only the ‘pure’ ground truth would satisfy this condition. To evaluate how close a model v is to the ideal v^* we propose a simple metric: the column average $⟨ D_{u \to v} ⟩_{u}$ . This will serve as proxy for the distance to ground truth, analogous to $D_{T \to S}$ in Fig 4. We validate this procedure using the RNN student-teacher setting in Fig 6D, where we show that $⟨ D_{u \to v} ⟩_{u}$ is highly correlated to the ground truth measure $D_{T \to S}$ . We also validate cycle-consistency $D_{r \to \hat{z}}$ against $D_{T \to S}$ using the RNN setting (Fig 6E). In both cases we find a high correlation between the metrics.

Having developed a proxy for the ground truth we can now correlate it with the few-shot co-smoothing $⟨ Q^{k -shot} ⟩$ to held-out neurons. Following the discussion in the previous section, we choose the smallest value of k that ensures no trials with zero spikes (S2 Fig). Fig 7 shows a negative correlation of $⟨ Q^{k -shot} ⟩$ with both proxy measures $D_{r \to \hat{z}}$ and $⟨ D_{u \to v} ⟩_{u}$ across the STNDT models in the four data sets. Moreover, regular co-smoothing $Q$ for the same models is relatively uncorrelated with these measures. As an illustration of the latents of different models, Fig 7(bottom) shows the PCA projection of latents from two STNDT models trained on mc_maze_20. Both have high co-smoothing scores but differ in their few-shot scores $⟨ Q^{k -shot} ⟩$ . We note smoother trajectories and better clustering of conditions in the model with higher $⟨ Q^{k -shot} ⟩$ . We also quantify the ability to decode behavior from these two models, and found the top-PCs perform better in the “Good” model (S7 Fig).

Fig 7 — We train several STNDT models on four neural recordings from monkeys [35–38], curated by [6] and filter for models with high co-smoothing $Q > 0.8 \times max (Q)$ . The few-shot co-smoothing scores $⟨ Q^{k - shot} ⟩$ negatively correlate with the two proxies $D_{r \to z}$ and $⟨ D_{u \to v} ⟩_{u}$ (orange points), while regular co-smoothing $Q$ (turquoise points) does not (one-tailed p-values shown for p < 0.05 and ^*** for p < 0.001). Green and red arrows indicate the extreme models whose latents are visualised below. $Q$ values may be compared against an EvalAI leaderboard [6]. Note that we evaluate using an offline train-test split, not the true test set used for the leaderboard scores, for which held-out neuron data is not publicly accessible. (Bottom) Principal component analysis of the latent trajectories of two STNDT models trained on `mc_maze_20` with similar co-smoothing scores but contrasting few-shot co-smoothing. The “Good” model scores $Q = 0.341$ , $⟨ Q^{64 -shot} ⟩ = 0.292$ and the “Bad” model $Q = 0.342$ , $⟨ Q^{64 -shot} ⟩ = 0.012$ . The trajectories are coloured by task conditions and start at a circle and end in a triangle.

Discussion

Latent variable models (LVMs) aim to infer the underlying latents using observations of a target system. We showed that co-smoothing, a common prediction measure of the goodness of such models, cannot discriminate between LVMs containing only the true latents and those with additional extraneous dynamics.

We propose a complementary prediction measure: few-shot co-smoothing. After training the encoder that translates data observations to latents, we use only a few (k) trials to train a new decoder. Using several synthetic datasets generated from trained RNNs and two other state-space architectures, we show numerically and analytically that this measure correlates with the distance of model latents to the ground truth.

We demonstrate the applicability of this measure to four datasets of monkey neural recordings with a transformer architecture [39,40] that achieves near state-of-the-art (SOTA) results on all datasets. This required developing a new proxy to ground truth – cross-decoding. For each pair of models, we try to decode the latents of one from the latents of the other. Models with extraneous dynamics showed up as poor target latents on average, and vice versa.

Our work is related to a recent study that addresses benchmarking LVMs for neural data by developing benchmarks and metrics using only synthetic data - Computation through dynamics benchmark [29]. This study similarly tackles the issue of extraneous dynamics, primarily using ground-truth comparisons and cycle consistency. Our cross-decoding metric complements cycle consistency [8,29] as a proxy for ground truth. Cycle consistency has the advantage that it is defined on single models, compared with cross-decoding that depends on the specific population of models used. Cycle consistency has the disadvantage that it uses the rate predictions as proxies to the true dynamics. In the datasets we analyzed here, both measures provided very similar results. An interesting extension would be to use the cross-decoding metric as another method to select good models. However, its computational cost is high, as it requires training a population of models and comparing them pairwise. Additionally, it is less universal and standardised than few-shot co-smoothing, as it depends on a specific ‘jury’ of models.

Several works address the issue of extraneous dynamics through regularisation of dimensionality, picking the minimal dimensional or rank-constrained model that still fits the data [8,11–13]. Usually, these constraints are accompanied by poorer co-smoothing scores compared to their unconstrained competitors, and the simplicity of these constrained models often goes uncredited by standard prediction-based metrics. Classical measures like AIC [44] and BIC [45] address the issue of overfitting by penalising the number of parameters, but are less applicable given the success of overparameterised models [33]. We believe these approaches may not scale well to increasingly larger datasets [46], noting studies reporting that neural activity is not finite-dimensional but exhibits a scale-free distribution of variance [47,48]. Our few-shot co-smoothing metric, by contrast, does not impose dimensional constraints and instead leverages predictive performance on limited data to identify models closer to the true latent dynamics, potentially offering better scalability for complex, large-scale neural datasets. Furthermore, limiting the method to prediction offers other advantages. Prediction benchmarks are a common language for the community to optimise inference methods, without requiring access to the latents, which could be model-specific.

While the combination of student-teacher and SOTA results presents a compelling argument, we address a few limitations of our work. Regarding few-shot regression, while the Bernoulli HMM and linear regression scenarios have a closed-form solutions, the Poisson GLM regression for SOTA models is optimised iteratively and is sensitive to the L2 hyperparameter $α$ . In our results, we select a minimal $α$ that is sufficient to stabilise optimisation.

A broader limitation concerns LVM architectures with varying decoder (g) parameterisations, which would in general require different few-shot learning procedures for the auxiliary decoder ( $g^{'}$ ). Our results show that few-shot scores are indicative of model extraneousness when comparing models with a fixed decoder architecture. In our SOTA experiments, we use a conventional linear–exponential–Poisson decoder. However, when comparing models with substantially different decoder architectures—such as multi-layer nonlinear decoders [11] or linear–Gaussian emission models [27,49]—differences in few-shot performance may reflect strengths or weaknesses of the few-shot learning procedure in the respective setting, rather than differences in the extraneousness of the inferred latents.

Overall, our work advances latent dynamics inference in general and prediction frameworks in particular. By exposing a failure mode of standard prediction metrics, we guide the design of inference algorithms that account for this issue. Furthermore, the few-shot co-smoothing metric can be incorporated into existing benchmarks, helping the community build models that are closer to the desired goal of uncovering latent dynamics in the brain.

Methods

Glossary

Latent variable model (LVM) (f and g) : A function mapping neural time-series data to an inferred latent space (f). The latents can then be used to predict held-out data (g).
Smoothing : mapping a sequence of observations $X_{1 : T}$ to a sequence of inferred latents ${\hat{Z}}_{1 : T}$ . It is often formalised as a conditional probability $p ({\hat{Z}}_{1 : T} | X_{1 : T})$ .
Extraneous dynamics : the notion that inferred latent variables may contain features and temporal structure not present in the true system from which the data was observed.
Co-smoothing ( $Q$ ) : A metric evaluating LVMs by their ability to predict the activity of held-out neurons $X_{1 : T, out}$ provided held-in neural activity $X_{1 : T, in}$ over a window of time. The two sets of neurons are typically random subsets from a single population.
Few-shot co-smoothing ( $Q^{k -shot}$ ) : A variant of co-smoothing in which the mapping from latents to held-out neurons ( $g^{'}$ ) is learned from a small number of trials.
State-of-the-art (SOTA) : the best performing method or model current available in the field. This is usually based on a specific benchmark, i.e., a dataset and associated evaluation metric. In active fields the SOTA is constantly improving.
Cycle consistency ( $D_{r \to \hat{z}}$ ) : a measure of extraneousness of model latents as compared to their rate predictions. Computed by learning and evaluating the inverse mapping from rate predictions to latents.
Cross-decoding ( $D_{u \to v}$ ) : another measure of model extraneousness. It is evaluated on a population of models trained on the same dataset. It involves regressing from one model latents to another model, for all pairs in the population. A scalar measure is the obtained for each model: the cross-decoding column mean $⟨ D_{u \to v} ⟩_{u}$ . It reflects the average ‘decodability’ of a model, by all the other models.

Student-teacher Recurrent Neural Networks (RNN)

Both teacher and student are based on an adapted version of [29]. In the following, we provide a brief description.

Teacher

We train a noisy 64 Gated Recurrent Unit (NoisyGRU) RNN [50], on a 2-bit flip flop 2BFF task [3], implemented by [29]. The GRU RNN follows standard dynamics, which we repeat here using the typical notation of GRUs. This notation is not consistent with the Results section, and we explain the relation below.

𝐡_{0} = μ + η; η ~ 𝒩 (0, 0.05)

(9)

𝐳_{t} = σ (𝐖_{z} 𝐱_{t} + 𝐔_{z} 𝐡_{t - 1} + 𝐛_{z})

(10)

𝐫_{t} = σ (𝐖_{r} 𝐱_{t} + 𝐔_{r} 𝐡_{t - 1} + 𝐛_{r})

(11)

{\tilde{𝐡}}_{t} = \tanh (𝐖_{h} 𝐱_{t} + 𝐔_{h} (𝐫_{t} ⊙ 𝐡_{t - 1}) + 𝐛_{h} + ξ_{t}); ξ_{t} ~ 𝒩 (0, 0.01)

(12)

𝐡_{t} = (1 - 𝐳_{t}) ⊙ 𝐡_{t - 1} + 𝐳_{t} ⊙ {\tilde{𝐡}}_{t},

(13)

where $η$ , $𝐖_{z}$ , $𝐔_{z}$ , $𝐛_{z}$ , $𝐖_{r}$ , $𝐔_{r}$ , $𝐛_{r}$ , $𝐖_{h}$ , $𝐔_{h}$ , $𝐛_{h}$ are trainable parameters. The latent used in the Results section ( $z$ ) is the hidden unit activity h. After model training, the NoisyGRU units are subsampled, centered, normalised, and rectified to give synthetic neural firing rates - which are $r$ of the Results section. These firing rates are used to define a stochastic Poisson process to generate the synthetic neural data.

Students

The student models are sequential autoencoders (SAEs) consisting of a bidirectional GRU that predicts the initial latent state, a Neural ODE (NODE) that evolves the latent dynamics (together these form the encoder, f, under our notation), and a linear readout layer mapping the latent states to the data (the decoder, g). We train several randomly initialised models with a range of latent dimensionalities (3, 5, $8 : 16$ , 32, 64). Models are trained to minimise a Poisson negative loglikelihood reconstruction loss, using the Adam [51] optimiser.

Student-teacher Hidden Markov Models (HMMs)

We choose both student and teacher to be discrete-space, discrete-time Hidden Markov Models (HMMs). As a teacher model, they simulate two important properties of neural time-series data: its dynamical nature and its stochasticity. As a student model, they are perhaps the simplest LVM for time-series, yet they are expressive enough to capture real neural dynamics ( $𝒬$ of 0.29 for HMMs vs. 0.24 for GPFA and 0.35 for LFADS, on mc_maze_20). The HMM has a state space $z \in {1, 2, \dots, M}$ , and produces observations (emissions in HMM notation) along neurons $X$ , with a state transition matrix $A$ , emission model $B$ and initial state distribution $π$ . More explicitly:

A_{m, l} = p (z_{t + 1} = l | z_{t} = m) \forall m, l B_{m, n} = p (x_{n, t} = 1 | z_{t} = m) \forall m, n π_{m} = p (z_{0} = m) \forall m

(14)

The same HMM can serve two roles: a) data-generation by sampling from (14) and b) inference of the latents from data on a trial-by-trial basis:

ξ_{t, m}^{(i)} = f_{m} ((X_{:, in})^{(i)}) = p (z_{t}^{(i)} = m | (X_{:, in})^{(i)}),

(15)

i.e., smoothing, computed exactly with the forward-backward algorithm [52]. Note that although z is the latent state of the HMM, we use its posterior probability mass function $ξ_{t}$ as the relevant intermediate representation because it reflects a richer representation of the knowledge about the latent state than a single discrete state estimate. To make predictions of the rates of held-out neurons for co-smoothing we compute:

R_{n, t}^{(i)} = g_{n} (ξ_{t}^{(i)}) = \sum_{m} B_{m, n} ξ_{t, m}^{(i)} .

(16)

As a teacher, we constructed a 4-state model of a noisy chain $A_{m, l} \propto I [l = (m + 1) \mod M] + ϵ$ , with $ϵ = 1 e - 2$ , $π = \frac{1}{M}$ , and $B_{m, n} ~ Unif (0, 1)$ sampled once and frozen (Fig 2, left). We generated a dataset of observations from this teacher (see Table 1).

For each student, we evaluate $⟨ Q_{S}^{k} ⟩$ . This involves estimating the bernoulli emission parameters ${\hat{B}}_{m, k -out}$ , given the latents $ξ_{t, m}^{(i)}$ using (26) and then generating rate predictions for the k-out neurons using (16).

HMM training

HMMs are traditionally trained with expectation maximisation, but they can also be trained using gradient-based methods. We focus here on the latter as these are used ubiquitously and apply to a wide range of architectures. We use an existing implementation of HMMs with differentiable parameters: dynamax [53] – a library of differentiable state-space models built with jax.

We seek HMM parameters $θ : = (A, B^{[in, out]}, π)$ that minimise the negative log-likelihood loss, L of the held-in and held-out neurons in the train trials:

L (θ; X_{[in, out]}^{train}) = - \log p (X_{[in, out]}^{train}; θ)

(17)

= \sum_{i \in train} - \log p ({(X_{1 : T, [in, out]})}^{(i)}; θ)

(18)

To find the minimum we do full-batch gradient descent on L, using dynamax together with the Adam optimiser [51] .

Decoding across HMM latents

Consider two HMMs u and v, of sizes M(u) and M(v), both candidate models of a dataset $X$ . Following (15), each HMM can be used to infer latents from the data, defining encoder mappings f^u and f^v. These map a single trial i of the data $(X_{:, in})^{(i)} \in X$ to $(ξ_{t}^{(i)})_{u}$ and $(ξ_{t}^{(i)})_{v}$ .

Since HMM latents are probability mass functions, we do not do use linear regression to learn the mappings across model latents. Instead we perform a multinomial regression from $(ξ_{t}^{(i)})_{u}$ to $(ξ_{t}^{(i)})_{v}$ .

p_{t}^{(i)} = h ({(ξ_{t}^{(i)})}_{u})

(19)

h (ξ) = σ (W ξ + b)

(20)

where $W \in R^{M (v) \times M (u)}$ , $b \in R^{M (v)}$ and $σ$ is the softmax. During training we sample states from the target PMFs $(z_{t}^{(i)})_{v} ~ (ξ_{t}^{(i)})_{v}$ thus arriving at a more well-known problem scenario: classification of M(v)-classes. We optimise W and $b$ to minimise a cross-entropy loss to the target $({\hat{z}}_{t}^{(i)})_{v}$ using the fit() method of sklearn.linear_model.LogisticRegression.

We define decoding error, as the average Kullback-Leibler divergence D_KL between target and predicted distributions:

D_{u \to v} : = \frac{1}{S^{test} T} \sum_{i \in test} \sum_{t = 1}^{T} D_{K L} (p_{t}^{(i)}, (ξ_{t}^{(i)})_{v})

(21)

where D_KL is implemented with scipy.special.rel_entr.

In section and Fig 1, the data X is sampled from a single teacher HMM, $T$ , and we evaluate $D_{T \to S}$ and $D_{S \to T}$ for each student notated simply as $S$ .

Analysis of LVMs without access to ground truth

We denote the set of high co-smoothing models as those satisfying $Q > 0.034$ for Fig 4 and $Q > 0.8 \times Q_{best model}$ in Fig 7, $F : = {(f_{u}, g_{u})}_{u = 1}^{U}$ , the the encoders and decoders respectively. Note that STNDT is a deep neural network given by the composition $g \circ f$ , and the choice of intermediate layer whose activity is deemed the ‘latent’ $Z$ is arbitrary. Here we consider g the last ‘read-out’ layer and f to represent all the layers up-to g.

Few-shot co-smoothing

To perform few-shot co-smoothing, we learn $g^{'}$ , which takes the same form as g, a Poisson Generalised Linear Model (GLM) for each held-out neuron. We use sklearn.linear_model.PoissonRegressor, which has a hyperparameter alpha, the amount of l2 regularisation. For the results in the main text, $⟨ Q^{k -shot} ⟩$ in Fig 7, we select $α = 10^{- 3}$ . We partition the training data into several random subsets of k trials and train an independently initialised GLM on each subset. Each GLM is then evaluated on a fixed test set of trials (Fig 3), yielding a score for each subset. We report the mean over $⌊ 5 \times S^{train} / k ⌋$ such repetitions, $⟨ Q^{k -shot} ⟩$ , along with the standard error of the mean (error bars in Fig 4, Fig 7). Scores are more variable at small k, so we need more repetitions to better estimate the average score. To implement this in a standarised way, we incorporate this chunking of data into several subsets in the nlb_tools library (Data and code availability). This way we ensure that all models are trained and tested on identitical subsets. We report the compute-time for few-shot co-smoothing in S2 Appendix.

Cross-decoding

We perform a cross-decoding from the latents of model u, $(Z_{t, :})_{u}$ , to those of model v, $(Z_{t, :})_{v}$ , for every pair of models u and v using a linear mapping $h (z) : = W z + b$ implemented with sklearn.linear_model.LinearRegression:

{({\hat{Z}}_{t, :}^{(i)})}_{v} = h_{u \to v} ({(Z_{t, :}^{(i)})}_{u})

(22)

minimising a mean squared error loss. We then evaluate a R² score (sklearn.metrics.r2_score) of the predictions, $(\hat{Z})_{v}$ , and the target, $(Z)_{v}$ , for each mapping. We define the decoding error $D_{u \to v} : = 1 - (R^{2})_{u \to v}$ . The results are accumulated into a $U \times U$ matrix (see Fig 6).

Cycle consistency

We evaluate cycle-consistency [8,29] for a model u also using a linear mapping from its rate predictions $R$ back to its latents $\hat{Z}$ implemented with sklearn.linear_model.LinearRegression:

{({\hat{Z}}_{t, :}^{(i)})}_{u} = h_{r \to \hat{z}} ({(R_{t, out}^{(i)})}_{u}),

(23)

again minimising a squared error loss. As in cross-decoding we evaluate R² score (sklearn.metrics.r2_score) and the decoding error $D_{r \to \hat{z}} : = 1 - (R^{2})_{r \to \hat{z}}$ (Fig 6A).

Summary of Neural Latent Benchmark (NLB) datasets

Here are brief descriptions of the datasets used in this study. All datasets were collected from macaque monkeys performing sensorimotor or cognitive tasks. More comprehensive details can be found in the Neural Latents Benchmark paper [6].

mc_maze [35] Motor cortex recordings during a delayed reaching task where monkeys navigated around virtual barriers to reach visually cued targets. The task involved 108 unique maze configurations, with several repeated trials for each one, thus serving as a “neuroscience MNIST”. We choose this dataset to visualise the latents in Fig 7.
mc_rtt [36] Motor cortex recordings during naturalistic, continuous reaching toward randomly appearing targets without imposed delays. The task lacks trial structure and includes highly variable movements, emphasizing the need for modeling unpredictable inputs and non-autonomous dynamics.
dmfc_rsg [37] Recordings from dorsomedial frontal cortex during a time-interval reproduction task, where monkeys estimated and reproduced time intervals between visual cues using eye or hand movements. The task involves internal timing, variable priors, and mixed sensory-motor demands.

Somatosensory cortex recordings during a visually guided reach task in which unexpected mechanical bumps to the limb occurred in half of the trials. The task probes proprioceptive feedback processing and requires modeling input-driven neural responses.

Dimensions of datasets

We analyse several datasets in this work. Three synthetic datasets generated by an RNN, HMM (Methods, Fig 2) and LGSMM (S4 Fig) and the four datasets from the Neural Latent Benchmarks (NLB) suite [6,35–38]. In Table 1, we summaries the dimensions of all these datasets. To evaluate k-shot on the existing SOTA methods while maintaining the NLB evaluations, we conserved the forward-prediction aspect. During model training, models output rate predictions for $T^{fp}$ future time bins in each trial, i.e., (1) and (2) are evaluated for $1 \leq t \leq T^{fp}$ while input remains as $X_{1 : T, in}$ . Although we do not discuss the forward-prediction metric in our work, we note that the SOTA models receive gradients from this portion of the data.

In all the NLB datasets as well as the RNN dataset we reuse held-out neurons as k-out neurons. We do this to preserve NLB evaluation metrics on the SOTA models, as opposed to re-partitioning the dataset resulting in different scores from previous works. This way existing co-smoothing scores are preserved and k-shot co-smoothing scores can be directly compared to the original co-smoothing scores. The downside is that we are not testing the few-shot on ‘novel’ neurons. Our numerical results (Fig 7) show that our concept still applies.

Theoretical analysis of few shot learning in HMMs.

Consider a student-teacher scenario as in section . We let T = 2 and use a stationary teacher $z_{1}^{(i)} = z_{2}^{(i)}$ . Now consider two examples of inferred students. To ensure a fair comparison, we use two latent states for both students. In the good student, $ξ$ , these two states statistically do not depend on time, and therefore it does not have extraneous dynamics. In contrast, the bad student, $μ$ , uses one state for the first time step, and the other for the second time step. A particular example of such students is:

ξ_{t} = [\begin{matrix} 0.5 & 0.5 \end{matrix}]^{T} t \in {1, 2}

(24)

\begin{matrix} μ_{t = 1} = [\begin{matrix} 1 & 0 \end{matrix}]^{T} & μ_{t = 2} = [\begin{matrix} 0 & 1 \end{matrix}]^{T} \end{matrix}

(25)

where each vector corresponds to the two states, and we only consider two time steps.

We can now evaluate the maximum likelihood estimator of the emission matrix from k trials for both students. In the case of bernoulli HMMs the maximum likelihood estimate of $g^{'}$ given a fixed f and k trials has a closed form:

{\hat{B}}_{m, n} = \frac{\sum_{i \in k - s h o t t r i a l s} \sum_{t = 1}^{T} 𝕀 [X_{t, n}^{(i)} = 1] ξ_{t, m}^{(i)}}{\sum_{i^{'} \in k - s h o t t r i a l s} \sum_{t^{'} = 1}^{T} ξ_{t^{'}, m}^{(i^{'})}} \forall 1 \leq m \leq M and n \in k - o u t n e u r o n s

(26)

We consider a single neuron, and thus omit n, reducing the estimates to:

\begin{matrix} {\hat{B}}_{1} (ξ) = \frac{0.5 (C_{1} + C_{2})}{0.5 k T} & {\hat{B}}_{1} (μ) = \frac{C_{1}}{k} \\ {\hat{B}}_{2} (ξ) = \frac{0.5 (C_{1} + C_{2})}{0.5 k T} & {\hat{B}}_{2} (μ) = \frac{C_{2}}{k} \end{matrix}

(27)

where C_t is the number of times x = 1 at time t in k trials. We see that C_t is a sum of k i.i.d. Bernoulli random variables (RVs) with the teacher parameter B^*, for both $t = 1, 2$ .

Thus, ${\hat{B}}_{m} (ξ)$ and ${\hat{B}}_{m} (μ)$ are scaled binomial RVs with the following statistics:

\begin{matrix} E {\hat{B}}_{1} (ξ) = E {\hat{B}}_{2} (ξ) = B^{*} & E {\hat{B}}_{1} (μ) = E {\hat{B}}_{2} (μ) = B^{*} \\ C o v [\hat{B} (ξ)] = \frac{1}{2 k} B^{*} (1 - B^{*}) [\begin{matrix} 1 & 1 \\ 1 & 1 \end{matrix}] & C o v [\hat{B} (μ)] = \frac{1}{k} B^{*} (1 - B^{*}) [\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}] \end{matrix}

(28)

The test loss is given by $L (\hat{B}) = E \frac{1}{T} \sum_{t} \log p (X_{t}^{(i)}; \hat{B}) = \frac{1}{T} \sum_{t} B^{*} \log (R_{t}) + (1 - B^{*}) \log (1 - R_{t})$ . For $ξ$ , $R_{t} = 0.5 ({\hat{B}}_{1} + {\hat{B}}_{2})$ for both values of t, and for $μ$ , $R_{1} = {\hat{B}}_{1}$ and $R_{2} = {\hat{B}}_{2}$ . Ultimately,

L_{ξ} (\hat{B} (ξ)) = \frac{1}{T} \sum_{t} B^{*} \log (0.5 ({\hat{B}}_{1} + {\hat{B}}_{2})) + (1 - B^{*}) \log (1 - 0.5 ({\hat{B}}_{1} + {\hat{B}}_{2}))

(29)

L_{μ} (\hat{B} (μ)) = \frac{1}{T} \sum_{t} B^{*} \log ({\hat{B}}_{t}) + (1 - B^{*}) \log (1 - {\hat{B}}_{t})

(30)

To see how these variations affect the test loglikelihood L of the few-shot regression on average, we do a taylor expansion around B^*, recognising that the function is maximised at B^*, so $\frac{\partial L}{\partial B} |_{B^{*}} = 0$ .

E_{{\hat{B}}_{k}} L ({\hat{B}}_{k}) = E_{{\hat{B}}_{k}} [L (B_{\infty}) + \frac{1}{2} ({\hat{B}}_{k} - B^{*})^{T} \frac{\partial^{2} L}{\partial B^{2}} |_{B^{*}} ({\hat{B}}_{k} - B^{*}) + \dots]

(31)

\approx L (B^{*}) + E_{{\hat{B}}_{k}} \frac{1}{2} ({\hat{B}}_{k} - B^{*})^{T} \frac{\partial^{2} L}{\partial B^{2}} |_{B^{*}} ({\hat{B}}_{k} - B^{*})

(32)

= L (B^{*}) + \underset{bias}{\underset{⏟}{\frac{1}{2} (E {\hat{B}}_{k} - B^{*})^{T} \frac{\partial^{2} L}{\partial B^{2}} |_{B^{*}} (E {\hat{B}}_{k} - B^{*})}} + \underset{variance}{\underset{⏟}{\frac{1}{2} Tr [C o v ({\hat{B}}_{k}) \frac{\partial^{2} L}{\partial B^{2}} |_{B^{*}}]}}

(33)

We see that this second order truncation of the loglikelihood is decomposed into a bias and a variance term. We recognise that the bias term goes to zero because we know the estimator is unbiased (28). To compute the variance term, we compute the hessians which differ for the two models:

\frac{\partial^{2} L_{ξ}}{\partial B^{2}} |_{B^{*}} = - \frac{η}{4} [\begin{matrix} 1 & 1 \\ 1 & 1 \end{matrix}], \frac{\partial^{2} L_{μ}}{\partial B^{2}} |_{B^{*}} = - \frac{η}{2} [\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}],

(34)

where $η = \frac{1}{B^{*} (1 - B^{*})}$ .

Incorporating these hessians into (33), we obtain:

E_{{\hat{B}}_{k}} L_{ξ} ({\hat{B}}_{k} (ξ)) \approx L (B^{*}) - \frac{1}{8 k} Tr [\begin{matrix} 2 & 2 \\ 2 & 2 \end{matrix}] = L (B^{*}) - \frac{1}{2 k},

(35)

E_{{\hat{B}}_{k}} L_{μ} ({\hat{B}}_{k} (μ)) \approx L (B^{*}) - \frac{1}{2 k} Tr [\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}] = L (B^{*}) - \frac{1}{k} .

(36)

Fig 5A shows these analytical results against the left hand side of (35) and (36) evaluated numerically.

Theoretical analysis of ridgeless least squares regression with extraneous noise.

Teacher latents $z_{i}^{*} ~ N (0, 1)$ generate observations x_i:

x_{i} = z_{i}^{*} + ϵ_{i},

(37)

where $ϵ_{i} ~ N (0, σ_{obs}^{2})$ is observation noise.

In this setup there is no time index: we consider only a single sample index i.

We consider candidate student latents, $z \in ℝ^{p}$ , that contain the teacher along with extraneous noise, i.e:

z_{i} : = [\begin{matrix} z_{i}^{*} & ξ_{i} \end{matrix}]^{T},

(38)

where $ξ_{i} ~ N (0, σ_{ext}^{2} I_{p - 1})$ is a vector of i.i.d. extraneous noise, and $I_{p - 1}$ is the $(p - 1) \times (p - 1)$ identity matrix.

We study the minimum l₂ norm least squares regression estimator on k training samples:

\hat{w} = arg min {‖ w ‖_{2} : w minimises \sum_{i = 1}^{k} ‖ x_{i} - w^{T} z_{i} ‖_{2}^{2}} .

(39)

with the regression weights $w \in ℝ^{p}$ . More succinctly, $z_{i} ~ N (0, Σ)$ , where $Σ = diag ([1, σ_{ext}^{2}, \dots, σ_{ext}^{2}])$ .

Note that, by construction, the true mapping is:

w^{*} = [\begin{matrix} 1 & 0 & \dots & 0 \end{matrix}]^{T} .

(40)

Test loss or risk is a mean squared error:

R (\hat{w}; w^{*}) = E_{z_{0}} {(z_{0}^{T} w^{*} - z_{0}^{T} \hat{w})}^{2},

(41)

given a test sample $z_{0}$ . The error can be decomposed as:

R (\hat{w}; w^{*}) = \underset{bias, B}{\underset{⏟}{{‖ E (\hat{w}) - w^{*} ‖}_{Σ}^{2}}}} + \underset{variance, V}{\underset{⏟}{Tr [C o v (\hat{w}) Σ]}}},

(42)

The scenario described above is a special case of [31]. What follows is a direct application of their theory, which studies the risk R, in the limit $k, p \to \infty$ such that $p / k \to γ \in (0, \infty)$ , to our setting. The alignment of the theory with numerical simulations is demonstrated in Fig 5B.

Claim 1. $γ < 1$ , i.e., the underparameterised case k > p.

B = 0 and the risk is just variance and is given by:

{lim}_{k, p \to \infty and p / k \to γ} R (\hat{w}; w^{*}) = σ_{obs}^{2} \frac{γ}{1 - γ},

(43)

with no dependence on $σ_{ext}$ .

Proof: This is a direct restatement of Proposition 2 in [31]. $◻$

Claim 2. $γ > 1$ , i.e., the overparameterised case k < p.

The following is true as $k, p \to \infty and p / k \to γ$ :

{lim}_{k, p \to \infty and p / k \to γ} B = \frac{γ (γ - 1)}{{(γ - 1 + \frac{1}{σ_{ext}^{2}})}^{2}}

(44)

{lim}_{k, p \to \infty and p / k \to γ} V = σ_{obs}^{2} \frac{γ}{γ - 1}

(45)

Proof: For the non-isotropic case [31] define the following distributions based on the eigendecomposition of $Σ$ .

d \hat{H} (s) = \frac{1}{p} δ (s - 1) + \frac{p - 1}{p} δ (s - σ_{ext}^{2})

(46)

d \hat{G} (s) = δ (s - 1)

(47)

In the limit $p \to \infty$ we take $d \hat{H} (s) \approx δ (s - σ_{ext}^{2})$ . This greatly simplifies calculations and nevertheless provide a good fit for numerical results with finite k and p. We solve for $c_{0} (γ, \hat{H})$ using equation 12 in [31].

γ c_{0} = \frac{1}{(γ - 1) σ_{ext}^{2}}

(48)

We then compute the limiting values of B and V:

B = ‖ w^{*} ‖^{2} (1 + γ c_{0} σ_{ext}^{2}) \frac{1}{(1 + γ c_{0})^{2}}

(49)

V = σ_{obs}^{2} γ c_{0} σ_{e x t}^{2} .

(50)

Substituting $γ c_{0}$ completes the proof. $◻$

The extraneous noise, $σ_{ext}$ , influences the risk of ridgeless regression only in the regime k < p, and its effect is confined to the bias term, leaving the variance unaffected. In contrast, observation noise contributes exclusively to the variance term. Consequently, the dependence of the risk on $σ_{ext}$ persists even in the absence of observation noise, i.e., when $σ_{obs} = 0$ .

Fig 5B presents the theoretical predictions alongside the empirical average k-shot performance of minimum-norm least-squares regression, computed numerically using the function numpy.linalg.lstsq.

Theoretical analysis of prototype learning for binary classification with extraneous noise

Teacher latents are distributed as $p (z_{i}^{*}) = \frac{1}{2} δ (z_{i}^{*} - \frac{1}{\sqrt{2}}) + \frac{1}{2} δ (z_{i}^{*} + \frac{1}{\sqrt{2}})$ , that is either $\frac{1}{\sqrt{2}}$ or $- \frac{1}{\sqrt{2}}$ with probability $\frac{1}{2}$ , representing two classes a and b respectively.

We consider candidate student latents, $z \in ℝ^{2 M + 1}$ , that contain the teacher along with extraneous noise, i.e:

z_{i} : = {\begin{matrix} [\begin{matrix} z_{i}^{*} & ξ_{i} & 0 \end{matrix}]^{T} & {if z}_{i}^{*} =1 \\ [\begin{matrix} z_{i}^{*} & 0 & ξ_{i} \end{matrix}]^{T} & if z_{i}^{*} =-1 \end{matrix}

(51)

where $ξ_{i} ~ N (0, σ_{ext}^{2} I_{M})$ is a M-vector of i.i.d. extraneous noise, and $I_{M}$ is the $M \times M$ identity matrix and $0 \in ℝ^{M}$ .

We consider the prototype learner $w = {\bar{z}}_{a} - {\bar{z}}_{b}$ , $b = \frac{1}{2} ({\bar{z}}_{a} + {\bar{z}}_{b})$ , where ${\bar{z}}_{a}$ and ${\bar{z}}_{b}$ are the sample means of k latents from class $a$ and k latents from class $b$ respectively. The classification rule is given by the sign of $w^{T} x - b$ : classifying the input $x$ as $a$ if positive and $b$ otherwise.

This setting is a special case of [54]. They provide a theoretical prediction for average few-shot classification error rate for class $a$ , $ϵ_{a}$ , given by $ϵ_{a} = H (S N R)$ where $H (x) = \frac{1}{\sqrt{2 π}} \int_{x}^{\infty} d t \exp (- t^{2} / 2)$ is a monotonously decreasing function.

S N R_{a} = \frac{1}{2} \frac{‖ Δ x_{0} ‖^{2} + (R_{b}^{2} R_{a}^{2} - 1) / k}{\sqrt{D_{a}^{- 1} / k + ‖ Δ x_{0}^{T} U_{b} ‖^{2} / k + ‖ Δ x_{0}^{T} U_{a} ‖^{2}}} .

(52)

$Δ z = z_{a} - z_{b}$ the difference of the population centroids of the two classes.

In our case this reduces to:

S N R \approx \frac{\sqrt{M k}}{σ_{ext}^{2}}

(53)

To obtain this we note radii of manifold $a$ is $[\begin{matrix} 0 & σ_{ext} & \dots & σ_{ext} & 0 & \dots & 0 \end{matrix}]$ with an average radius $R = R_{a} = R_{b} = \frac{M}{(2 M + 1)} σ_{ext}^{2}$ and participation ratio $D_{a} = (\sum_{i} (R_{a}^{i})^{2})^{2} / \sum_{i} (R_{a}^{i})^{4} = M$ .

We substitute $‖ Δ x_{0} ‖^{2} = \frac{1}{R^{2}} = \frac{2 M + 1}{M σ_{ext}^{2}} \approx \frac{2}{σ_{ext}^{2}}$ .

The bias term $(R_{b}^{2} R_{a}^{2} - 1) / k$ is zero since $R_{a} = R_{b}$ .

The $Δ x_{0}^{T} U_{a}$ and $Δ x_{0}^{T} U_{b}$ terms are both zero.

The participation ratio D_a = M. Our construction is symmetric in that $S N R_{a} = S N R_{b}$ .

The classification error, $ϵ$ , decreases monotonically with the number of samples k, tending to zero as $k \to \infty$ for all finite values of $σ_{ext}$ . In contrast, $ϵ$ increases monotonically with extraneous noise $σ_{ext}$ , deviating significantly from zero once $σ_{ext}^{2} \approx \sqrt{M k}$ .

Fig 5C presents the numerically computed error in comparison with the theoretical prediction given in (53).

Data and code availability

The experiments done in this work are largely based on code repositories from previous works. The following repositories were used or developed in this work:

https://github.com/KabirDabholkar/ComputationThroughDynamicsBenchmark.git – Code from Versteeg et al. [29], which we used directly for training and analysis of RNNs and NODE SAEs.
https://github.com/KabirDabholkar/hmm_analysis - Training and analysis of HMMs, implemented in dynamax [53]
https://github.com/KabirDabholkar/ssm_analysis - Training and analysis of LGSSMs, implemented in dynamax [53]
https://github.com/KabirDabholkar/nlb_tools_fewshot – A fork of the Neural Latents Benchmark repository by Pei et al. [6], used for evaluation of state-of-the-art models (includes co-smoothing, few-shot co-smoothing, cycle-consistency, and cross-decoding).
https://github.com/KabirDabholkar/STNDT_fewshot - Training STNDT models [6,21,40–42]

Supporting information

S1 Fig. Student-Teacher RNNs: co-smoothing as a function of model size.

Finding the correct model is not just about tuning the latent size hyperparameter. NODE SAE students over a range of sizes (5-15) achieve high co-smoothing on the same 64-unit noisy GRU performing 3BFF teacher (Methods).

(TIFF)

pcbi.1013789.s001.tif^{(3.9MB, tif)}

S2 Fig. How to choose k for your dataset?

Our theoretical analysis in “Why does few-shot work?” reveals that extraneous models are best discriminated when the shot number, k, is small. So how small can we go? In the case of sparse data like neural spike counts we may obtain k-trial subsets in which some neurons are silent. In this scenario the few-shot decoder $g^{'}$ receives no signal for those neurons. To avoid this pathological scenario, for each dataset, we pick the smallest possible k that ensures that the probability of encountering silent neurons in a k-trial subset is safely near zero. This must be computed for each dataset independently since some datasets are more sparse than others. We compute the frequency of such silences for different k, for each NLB [6] dataset, and show the values of k (dashed lines) chosen for the analysis in the main text.

(TIFF)

pcbi.1013789.s002.tif^{(5.9MB, tif)}

S1 Appendix. Decoding across HMM latents: fitting and evaluation.

(PDF)

pcbi.1013789.s003.pdf^{(108.6KB, pdf)}

S3 Fig. Good co-smoothing does not guarantee correct latents in Hidden Markov Models (HMMs).

In the main text, we show how good prediction of held-out neural activity, i.e., co-smoothing, does not guarantee a match between model and true latents. We did this in the student-teacher setting of RNNs and NODE SAEs (Fig 2). Here we replicate the results in HMMs (see Methods). Similar to Fig 2, several students HMMs are trained on a dataset generated by a single teacher HMM, a noisy 4-cycle. The Student $\to$ Teacher decoding error $D_{S \to T}$ is low and tightly related to the co-smoothing score. The Teacher $\to$ Student decoding error $D_{T \to S}$ is more varied and uncorrelated to co-smoothing. The arrows mark the “Good” and “Bad” transition matrices shown in the Fig 2 (lower).

(TIFF)

pcbi.1013789.s004.tif^{(6MB, tif)}

S4 Fig. Student-teacher results in Linear Gaussian State Space Models.

We demonstrate that our results are not unique to the RNN or HMM settings by simulating another simple scenario: linear gaussian state space models (LGSSM), i.e., Kalman Smoothing.

The model is defined by parameters

(μ_{0}, Σ_{0}, F, G, H, R)

. A major difference to HMMs is that the latent states

z \in R^{M}

are continuous. They follow the dynamics given by:

z_{0} ~ N (μ_{0}, Σ_{0})

(54)

z_{t} ~ N (F z_{t - 1} + b, G)

(55)

x_{t} ~ N (H z_{t} + c, R)

(56)

Given these dynamics, the latents $z$ can be inferred from observations $x$ using Kalman smoothing, analogous to (15). Here we use the jax based dynamax implementation.

We use a teacher LGSSM with M = 4, with parameters chosen randomly (using the dynamax defaults) and then fixed. Student LGSSMs are also initialised randomly and optimised with Adam [51] to minimise negative loglikelihood on the training data (see the dataset dimensions section for dimensions of the synthetic data set). $D_{S \to T}$ and $D_{T \to S}$ is computed with linear regression (sklearn.linear_model.LinearRegression) and predictions are evaluated against the target using R² (sklearn.metrics.r2_score). We define $D_{u \to v} : = 1 - (R^{2})_{u \to v}$ . Few-shot regression from $z$ to $x^{k --out}$ is also performed using linear regression.

In line with our results with RNNs and HMMs (Fig 2 and Fig 4), we show that among the models with high test loglikelihood (>–55), $D_{S \to T}$ , but not $D_{T \to S}$ , is highly correlated to test loglikelihood, while $D_{T \to S}$ shows a close relationship to Average 10 shot MSE error. For these Linear Gaussian State Space Models, we report loglikelihood instead of co-smoothing, and k-shot MSE instead of k-shot co-smoothing, demonstrating the same pattern of results across different model classes.

(TIFF)

pcbi.1013789.s005.tif^{(5.4MB, tif)}

S5 Fig. HMM network visualisations.

l In the main text Fig 2 we visualised the teacher and two student HMMs as graphs of fractional traffic volume on states and transitions. For clarity we dropped the low probability edges with values lower than 0.01. We also show the same models with all the edges visualised, including the low probability transitions that were omitted in the main text figure for clarity.

(TIFF)

pcbi.1013789.s006.tif^{(5MB, tif)}

S6 Fig. Few-shot co-smoothing is not simply hard co-smoothing (variations of HMM student-teacher experiments).

Few-shot co-smoothing is a more difficult metric than standard co-smoothing. Thus, it might seem that any increase in the difficulty of will yield similar results. To show this is not the case, we use standard co-smoothing with fewer held-in neurons. The score is lower (because it’s more difficult), but does not discriminate models.

We demonstrate this through two variations of HMM student-teacher experiments. In the first variation, we increase the number of held out neurons from $N^{out} = 50$ to $N^{out} = 100$ , making the co-smoothing problem harder. The top three panels show: (1) decoder student-teacher original simple, (2) decoder teacher-student original simple (same as main text Fig 1CD), and (3) decoder teacher-student 6-shot best (same as main text Fig 4B). In the second variation, we decrease the number of held-in and held-out neurons to $N^{in} = 5$ , $N^{out} = 5$ , N^k−out = 50, further increasing difficulty. The bottom three panels show the same three decoder configurations as the top row. While the score does decrease because the problem is harder, co-smoothing is still not indicative of good models while few-shot co-smoothing remains discriminative.

(TIFF)

pcbi.1013789.s007.tif^{(25.7MB, tif)}

S2 Appendix. Time cost of computing few-shot co-smoothing.

(PDF)

pcbi.1013789.s008.pdf^{(66.6KB, pdf)}

S7 Fig. Classifying task variables from latents in models with contrasting few-shot performance.

In main text Fig 7(lower panel), we compare two STNDT models trained on mc_maze_20 that perform identically under standard co-smoothing but diverge under 64-shot co-smoothing. Projecting their latents onto the top two principal components reveals differences in trajectory smoothness and task-condition separation. Quantitatively, the “Bad” model exhibits higher latent dimensionality, as reflected by the slower growth of variance explained across PCs (left panel), and yields poorer binary classification of maze barrier presence—especially when using only the top two principal components (right panel).

(TIFF)

pcbi.1013789.s009.tiff^{(3MB, tiff)}

S3 Appendix. Illustrative example of the difference between cycle consistency and cross-decoding.

(PDF)

pcbi.1013789.s010.pdf^{(87.4KB, pdf)}

Financial disclosure

This work was supported by the Israel Science Foundation (grant No. 1442/21 to OB) and Human Frontiers Science Program (HFSP) research grant (RGP0017/2021 to OB). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Data Availability

Code for the few-shot evaluation is available at https://github.com/KabirDabholkar/nlb_tools_fewshot. Code for the HMM simulations is available at https://github.com/KabirDabholkar/hmm_analysis.

Funding Statement

References

1.Vyas S, Golub MD, Sussillo D, Shenoy KV. Computation through neural population dynamics. Annu Rev Neurosci. 2020;43:249–75. doi: 10.1146/annurev-neuro-092619-094115 [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Barak O. Recurrent neural networks as versatile tools of neuroscience research. Curr Opin Neurobiol. 2017;46:1–6. doi: 10.1016/j.conb.2017.06.003 [DOI] [PubMed] [Google Scholar]
3.Sussillo D, Barak O. Opening the black box: low-dimensional dynamics in high-dimensional recurrent neural networks. Neural Comput. 2013;25(3):626–49. doi: 10.1162/NECO_a_00409 [DOI] [PubMed] [Google Scholar]
4.Vinuesa R, Brunton SL. Enhancing computational fluid dynamics with machine learning. Nat Comput Sci. 2022;2(6):358–66. doi: 10.1038/s43588-022-00264-7 [DOI] [PubMed] [Google Scholar]
5.Bauwens L, Veredas D. The stochastic conditional duration model: a latent factor model for the analysis of financial durations. 2005. https://papers.ssrn.com/abstract=685421
6.Pei FC, Ye J, Zoltowski DM, Wu A, Chowdhury RH, Sohn H, et al. Neural latents benchmark ‘21: evaluating latent variable models of neural population activity. 2021.
7.Shmueli G. To explain or to predict?. Statistical Science. 2010;25(3):289–310. doi: 10.1214/10-STS330 [DOI] [Google Scholar]
8.Versteeg C, Sedler AR, McCart JD, Pandarinath C. Expressive dynamics models with nonlinear injective readouts enable reliable recovery of latent features from neural activity. In: Proceedings of the 2nd NeurIPS Workshop on Symmetry and Geometry in Neural Representations, 2024. p. 255–78. https://proceedings.mlr.press/v228/versteeg24a.html
9.Koppe G, Toutounji H, Kirsch P, Lis S, Durstewitz D. Identifying nonlinear dynamical systems via generative recurrent neural networks with applications to fMRI. PLoS Comput Biol. 2019;15(8):e1007263. doi: 10.1371/journal.pcbi.1007263 [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Pandarinath C, O’Shea DJ, Collins J, Jozefowicz R, Stavisky SD, Kao JC, et al. Inferring single-trial neural population dynamics using sequential auto-encoders. Nat Methods. 2018;15(10):805–15. doi: 10.1038/s41592-018-0109-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Sedler AR, Versteeg C, Pandarinath C. Expressive architectures enhance interpretability of dynamics-based neural population models. Neuron Behav Data Anal Theory. 2023;2023:10.51628/001c.73987. doi: 10.51628/001c.73987 [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Valente A, Pillow JW, Ostojic S. Extracting computational mechanisms from neural data using low-rank RNNs. In: Oh AH, Agarwal A, Belgrave D, Cho K, editors. Advances in Neural Information Processing Systems. 2022.
13.Gloeckler M, Macke JH, Pals M, Pei F, Sağtekin AE. Inferring stochastic low-rank recurrent neural networks from neural data. In: Advances in Neural Information Processing Systems 37. 2024. p. 18225–64. 10.52202/079017-0579 [DOI]
14.Perkins SM, Cunningham JP, Wang Q, Churchland MM. Simple decoding of behavior from a complicated neural manifold. eLife Sciences Publications, Ltd. 2023. 10.7554/elife.89421.1 [DOI]
15.Linderman S, Johnson M, Miller A, Adams R, Blei D, Paninski L. Bayesian learning and inference in recurrent switching linear dynamical systems. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. 2017. p. 914–22. https://proceedings.mlr.press/v54/linderman17a.html
16.Macke JH, Buesing L, Cunningham JP, Yu BM, Shenoy KV, Sahani M. Empirical models of spiking in neural populations. In: Advances in Neural Information Processing Systems. 2011. https://proceedings.neurips.cc/paper_files/paper/2011/file/7143d7fbadfa4693b9eec507d9d37443-Paper.pdf
17.Wu A, Pashkovski S, Datta SR, Pillow JW. Learning a latent manifold of odor representations from neural responses in piriform cortex. In: Advances in Neural Information Processing Systems. 2018. https://proceedings.neurips.cc/paper_files/paper/2018/file/17b3c7061788dbe82de5abe9f6fe22b3-Paper.pdf
18.Meghanath G, Jimenez B, Makin JG. Inferring population dynamics in macaque cortex. J Neural Eng. 2023;20(5):10.1088/1741-2552/ad0651. doi: 10.1088/1741-2552/ad0651 [DOI] [PubMed] [Google Scholar]
19.Keshtkaran MR, Sedler AR, Chowdhury RH, Tandon R, Basrai D, Nguyen SL, et al. A large-scale neural network training framework for generalized estimation of single-trial population dynamics. Nat Methods. 2022;19(12):1572–7. doi: 10.1038/s41592-022-01675-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Keeley S, Aoi M, Yu Y, Smith S, Pillow JW. Identifying signal and noise structure in neural population activity with gaussian process factor models. In: Advances in Neural Information Processing Systems. 2020. p. 13795–805. https://proceedings.neurips.cc/paper_files/paper/2020/file/9eed867b73ab1eab60583c9d4a789b1b-Paper.pdf
21.Le T, Shlizerman E. Stndt: Modeling neural population activity with spatiotemporal transformers. In: Advances in Neural Information Processing Systems. 2022. p. 17926–39. https://proceedings.neurips.cc/paper_files/paper/2022/file/72163d1c3c1726f1c29157d06e9e93c1-Paper-Conference.pdf
22.She Q, Wu A. Neural dynamics discovery via gaussian process recurrent neural networks. In: Proceedings of the 35th Uncertainty in Artificial Intelligence Conference. 2020. p. 454–64. https://proceedings.mlr.press/v115/she20a.html
23.Wu A, Roy NA, Keeley S, Pillow JW. Gaussian process based nonlinear latent structure discovery in multivariate spike train data. In: Guyon I, Von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, et al. Advances in neural information processing systems. Curran Associates, Inc.; 2017. https://proceedings.neurips.cc/paper_files/paper/2017/file/b3b4d2dbedc99fe843fd3dedb02f086f-Paper.pdf. [PMC free article] [PubMed]
24.Zhao Y, Park IM. Variational latent gaussian process for recovering single-trial dynamics from population spike trains. Neural Comput. 2017;29(5):1293–316. doi: 10.1162/NECO_a_00953 [DOI] [PubMed] [Google Scholar]
25.Schimel M, Kao T-C, Jensen KT, Hennequin G. iLQR-VAE: control-based learning of input-driven dynamics with applications to neural data. In: International Conference on Learning Representations. 2022. https://openreview.net/forum?id=wRODLDHaAiW
26.Mullen TSO, Schimel M, Hennequin G, Machens CK, Orger M, Jouary A. Learning interpretable control inputs and dynamics underlying animal locomotion. In: The Twelfth International Conference on Learning Representations. 2024. https://openreview.net/forum?id=MFCjgEOLJT
27.Gokcen E, Jasper AI, Semedo JD, Zandvakili A, Kohn A, Machens CK, et al. Disentangling the flow of signals between populations of neurons. Nat Comput Sci. 2022;2(8):512–25. doi: 10.1038/s43588-022-00282-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Yu M, Cunningham JP, Santhanam G, Ryu S, Shenoy KV, Sahani M. Gaussian-process factor analysis for low-dimensional single-trial analysis of neural population activity. In: Advances in Neural Information Processing Systems, 2008. https://papers.nips.cc/paper_files/paper/2008/hash/ad972f10e0800b49d76fed33a21f6698-Abstract.html [DOI] [PMC free article] [PubMed]
29.Versteeg C, McCart JD, Ostrow M, Zoltowski DM, Washington CB, Driscoll L, et al. Computation-through-dynamics benchmark: simulated datasets and quality metrics for dynamical models of neural activity. 2025. 2025.02.07.637062. https://www.biorxiv.org/content/10.1101/2025.02.07.637062v2
30.Deng J, Dong W, Socher R, Li L-J, Kai Li, Li Fei-Fei. ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. 2009. p. 248–55. 10.1109/cvpr.2009.5206848 [DOI]
31.Hastie T, Montanari A, Rosset S, Tibshirani RJ. Surprises in high-dimensional ridgeless least squares interpolation. Ann Stat. 2022;50(2):949–86. doi: 10.1214/21-aos2133 [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Sorscher B, Ganguli S, Sompolinsky H. Neural representational geometry underlies few-shot concept learning. Proc Natl Acad Sci U S A. 2022;119(43):e2200800119. doi: 10.1073/pnas.2200800119 [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Belkin M, Hsu D, Ma S, Mandal S. Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proc Natl Acad Sci U S A. 2019;116(32):15849–54. doi: 10.1073/pnas.1903070116 [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Nakkiran P, Kaplun G, Bansal Y, Yang T, Barak B, Sutskever I. Deep double descent: where bigger models and more data hurt*. J Stat Mech. 2021;2021(12):124003. doi: 10.1088/1742-5468/ac3a74 [DOI] [Google Scholar]
35.Churchland MM, Cunningham JP, Kaufman MT, Ryu SI, Shenoy KV. Cortical preparatory activity: representation of movement or first cog in a dynamical machine?. Neuron. 2010;68(3):387–400. doi: 10.1016/j.neuron.2010.09.015 [DOI] [PMC free article] [PubMed] [Google Scholar]
36.O’Doherty JE, Cardoso MMB, Makin JG, Sabes PN. Nonhuman primate reaching with multichannel sensorimotor cortex electrophysiology: broadband for indy 2016 0927 06. 2018. https://zenodo.org/records/1432819
37.Sohn H, Narain D, Meirhaeghe N, Jazayeri M. Bayesian computation through cortical latent dynamics. Neuron. 2019;103(5):934-947.e5. doi: 10.1016/j.neuron.2019.06.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Chowdhury RH, Glaser JI, Miller LE. Area 2 of primary somatosensory cortex encodes kinematics of the whole arm. Elife. 2020;9:e48198. doi: 10.7554/eLife.48198 [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Le T, Shlizerman E. STNDT: modeling neural population activity with spatiotemporal transformers. Advances in Neural Information Processing Systems. 2022;35:17926–39. [Google Scholar]
40.Ye J, Pandarinath C. Representation learning for neural population activity with neural data transformers. Neurons, Behavior, Data analysis, and Theory. 2021;5(3):1–18. doi: 10.51628/001c.27358 [DOI] [Google Scholar]
41.Nguyen TQ, Salazar J. Transformers without tears: Improving the normalization of self-attention. In: Proceedings of the 16th International Conference on Spoken Language Translation, Hong Kong, 2019. https://aclanthology.org/2019.iwslt-1.17/
42.Shi Huang X, Perez F e l i p e, Ba J, Volkovs M. Improving transformer optimization through better initialization. In: Proceedings of the 37th International Conference on Machine Learning. 2020. p. 4475–83. https://proceedings.mlr.press/v119/huang20f.html
43.Zhu J-Y, Park T, Isola P, Efros AA. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: 2017 IEEE International Conference on Computer Vision (ICCV). 2017. p. 2242–51. 10.1109/iccv.2017.244 [DOI]
44.Akaike H. Information theory and an extension of the maximum likelihood principle. Springer Series in Statistics. Springer New York. 1998. p. 199–213. 10.1007/978-1-4612-1694-0_15 [DOI]
45.Schwarz G. Estimating the dimension of a model. The Annals of Statistics. 1978;6(2):461–4. doi: 10.1214/aos/1176344136 [DOI] [Google Scholar]
46.Altan E, Solla SA, Miller LE, Perreault EJ. Estimating the dimensionality of the manifold underlying multi-electrode neural recordings. PLoS Comput Biol. 2021;17(11):e1008591. doi: 10.1371/journal.pcbi.1008591 [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Stringer C, Pachitariu M, Steinmetz N, Carandini M, Harris KD. High-dimensional geometry of population responses in visual cortex. Nature. 2019;571(7765):361–5. doi: 10.1038/s41586-019-1346-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Stringer C, Pachitariu M, Steinmetz N, Reddy CB, Carandini M, Harris KD. Spontaneous behaviors drive multidimensional, brainwide activity. Science. 2019;364(6437):255. doi: 10.1126/science.aav7893 [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Gokcen E, Jasper A, Xu A, Kohn A, Machens CK, Yu BM. Uncovering motifs of concurrent signaling across multiple neuronal populations. In: Advances in Neural Information Processing Systems. 2023. p. 34711–22. https://proceedings.neurips.cc/paper_files/paper/2023/hash/6cf7a37e761f55b642cf0939b4c64bb8-Abstract-Conference.html
50.Chung J, Gulcehre C, Cho KHY, Bengio Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint 2014. http://arxiv.org/abs/1412.3555
51.Kingma DP, Ba J. Adam: a method for stochastic optimization. arXiv preprint 2017. http://arxiv.org/abs/1412.6980
52.Barber D. Bayesian reasoning and machine learning. Cambridge University Press; 2012.
53.Linderman SW, Chang P, Harper-Donnelly G, Kara A, Li X, Duran-Martin G, et al. Dynamax: a python package for probabilistic state space modeling with JAX. JOSS. 2025;10(108):7069. doi: 10.21105/joss.07069 [DOI] [Google Scholar]
54.Sorscher B, Ganguli S, Sompolinsky H. Neural representational geometry underlies few-shot concept learning. Proc Natl Acad Sci U S A. 2022;119(43):e2200800119. doi: 10.1073/pnas.2200800119 [DOI] [PMC free article] [PubMed] [Google Scholar]

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1013789.r001

Decision Letter 0

Hugues Berry, Yuanning Li

16 Apr 2025

PCOMPBIOL-D-25-00336

When predict can also explain: few-shot prediction to select better neural latents

PLOS Computational Biology

Dear Dr. Dabholkar,

Thank you for submitting your manuscript to PLOS Computational Biology. After careful consideration, we feel that it has merit but does not fully meet PLOS Computational Biology's publication criteria as it currently stands. As you will see in the attached comments, one of the reviewer has raised serious concerns on unclear method descriptions, overuse of toy examples, lack of real data validation and statistical rigor, and formatting issues. We feel substantial revisions are necessary to address these concerns. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript within 60 days Jun 16 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at ploscompbiol@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pcompbiol/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

* A rebuttal letter that responds to each point raised by the editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. This file does not need to include responses to formatting updates and technical items listed in the 'Journal Requirements' section below.

* A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

* An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, competing interests statement, or data availability statement, please make these updates within the submission form at the time of resubmission. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter

We look forward to receiving your revised manuscript.

Kind regards,

Yuanning Li

Academic Editor

PLOS Computational Biology

Hugues Berry

Section Editor

PLOS Computational Biology

Journal Requirements:

1) Please ensure that the CRediT author contributions listed for every co-author are completed accurately and in full.

At this stage, the following Authors/Authors require contributions: Kabir Vinay Dabholkar, and Omri Barak. Please ensure that the full contributions of each author are acknowledged in the "Add/Edit/Remove Authors" section of our submission form.

The list of CRediT author contributions may be found here: https://journals.plos.org/ploscompbiol/s/authorship#loc-author-contributions

2) We ask that a manuscript source file is provided at Revision. Please upload your manuscript file as a .doc, .docx, .rtf or .tex. If you are providing a .tex file, please upload it under the item type u2018LaTeX Source Fileu2019 and leave your .pdf version as the item type u2018Manuscriptu2019.

3) Your manuscript is missing the following sections: Results, and Methods. Please ensure all required sections are present and in the correct order. Make sure section heading levels are clearly indicated in the manuscript text, and limit sub-sections to 3 heading levels. An outline of the required sections can be consulted in our submission guidelines here:

https://journals.plos.org/ploscompbiol/s/submission-guidelines#loc-parts-of-a-submission

4) Please upload all main figures as separate Figure files in .tif or .eps format. For more information about how to convert and format your figure files please see our guidelines:

https://journals.plos.org/ploscompbiol/s/figures

5) Please upload a copy of Figures 4C, and D which you refer to in your text on page 8. Or, if the subfigures are no longer to be included as part of the submission please remove all reference to them within the text.

6) Please ensure that all Figure files have corresponding citations and legends within the manuscript. Currently, Figure 5 in your submission file inventory does not have an in-text citation. Please include the in-text citation of the figure.

7) We notice that your supplementary Figures, Tables, and information are included in the manuscript file. Please remove them and upload them with the file type 'Supporting Information'. Please ensure that each Supporting Information file has a legend listed in the manuscript after the references list. Please cite and label the supplementary tables and figures as “S1 Table” and “S2 Table,” "S1 Figure", S2 Figure" and so forth.

8) Thank you for stating "Code for the HMM simulations is available at https://github.com/KabirDabholkar/hmm_analysis." This link reaches a 404 error page. Please amend this to a working link.

9) Please amend your detailed Financial Disclosure statement. This is published with the article. It must therefore be completed in full sentences and contain the exact wording you wish to be published.

1) State the initials, alongside each funding source, of each author to receive each grant. For example: "This work was supported by the National Institutes of Health (####### to AM; ###### to CJ) and the National Science Foundation (###### to AM)."

2) State what role the funders took in the study. If the funders had no role in your study, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."

10) Please ensure that the funders and grant numbers match between the Financial Disclosure field and the Funding Information tab in your submission form. Note that the funders must be provided in the same order in both places as well. Currently, the order of the funders is different in both places.

11) Please provide a completed 'Competing Interests' statement, including any COIs declared by your co-authors. If you have no competing interests to declare, please state "The authors have declared that no competing interests exist".

Reviewers' comments:

Reviewer's Responses to Questions

Reviewer #1: This manuscript introduces a novel and interesting few-shot prediction metric to enhance the alignment between inferred latent variables and the true latent. Such a tool holds significant potential for modern neuroscience research, particularly in the context of large-scale population recordings. Nevertheless, the paper lacks sufficient results to support their conclusion and the presentation of the work is kind of messy.

1. When the authors challenge the common assumption on the effectiveness of co-smoothing, they can provide concrete examples in real neural data, instead of using a simplified HMM as a toy example.

2. line 95-96, "It is common to assume that being able to predict held-out parts of X will guarantee that the inferred latent aligns with the true one"

I don't think this is a common assumption. Although co-smoothing is commonly used for benchmarking LVMs, their purpose is to demonstrate their inferred latents recover the true latent signals in some way (e.g., rotation, permutation, linear combinations etc.).

Actually, this is consistent with your statment in line 101 "we hypothesize that good prediction guarantees that the true latents are contained within the inferred ones".

3. Over-parameterization usually achieves better prediction scores. By regularizing the model to have pasimonious parameters, we may achieve good alignment between the inferred latents and the true latents. This can be verified in Fig 1E, small number of latent yields smaller D_{T->S} values. I think the authors should also comment on their few-shot approach vs model selection approach (using AIC/BIC or other regularizers described in line 18-20). One is using limited data, while the other is using limited model parameters.

4. The organization of the manuscript needs major revision. The manuscript put too much space to discuss the toy example HMM and fail to validate its applicability in real LVM on neural data, whose latent are continuous values, instead of discrete states. Most people will agree 'high prediction score not necessarily yields good alignment with the true latents', thus there is no need to use 1 page to verify this with an HMM model. Simply using a LFADS or STNDT results to show the diversity of inferred latent dynamics suffices. Or you can use simulated data to provide ground truth latent.

5. In the discussion of 'why does few-shot work', you can just describe the more insightful comments. The current version is too lengthy and obscure. You can also move them to supplementary materials. I think the manuscript should include more examples of neural data to validate the performance of this few-shot prediction score.

6. The few-shot approach is very similar to bagging in ensemble learning, with an aim of reducing variance of the model estimation. In bagging, a random sample of data in a training set is selected with replacement. After generating several data samples, these weak models (g' in your work) are then trained independently. Proably you can comment your methods by connecting it with bagging.

7. The few-shot prediction approach, which is the main contribution of the manuscript, is not clearly described.

a. Line 152, you used N^{k-out}. What's the typical value for N^{in}, N^{out}, N^{k_out}?

b. The caption of Fig 3B says that 'f and g are frozen'. If so, how does Q^k help with the model training? Are you only using few-shot prediction for model evaluation?

b. If the few-shot prediction is simply used for evaluation, the metric is only used for comparing the inferred models. What if all models fail to align with the true latent?

8. The format of the manuscripts needs huge changes. It looks like a rash change from conference submission. The authors should follow the PLOS journal format and organize their manuscript according to journal standards. For example, the supporting information has subsections with a prefix of A-Z, and different sections are like randomly compiled contents without inherent connections.

9. The time cost in computing few-shot prediction metric should be presented or discussed.

10. What if the data has no trial structure? For example, LFADS uses single trial to infer the latents. Maybe you have different definitions of trials?

11. Only uses mc_maze_20 for real-world experiments. Need more realistic evidence, and maybe some neuroscientific insights about how few-shot decoding can better recover latent dynamics.

12. No statistical representation of performance comparison, but only descriptive words and figures. Need hypothesis testing to justify your assertion about data distributions (un-correlated, negatively correlated, etc).

13. The authors realized using only HMMs for synthetic data is not enough, but just discussed in the appendices and provided additional results with LGSSMs. Why not integrate with the main text in the first place?

Minor:

- Fig 2 is a bit confusing.

- Good student on the right but mentioned first in the figure caption.

- What is "edge width"? Do you mean edge value or edge weight?

- There shouldn't be "invisible edges" in a graph. Maybe use lower alpha value or other colors.

- Line 201, typo, "the good student"

- L223-L225 the paragraph is just one sentence, which could be OK in some scenarios. But it looks more like the manuscript is not well organized.

- Fig 5 is not mentioned in your main text, instead you referred to Figure 19, which has duplicated contents with Fig 5.

- In your writing, you should consistently use only one of Fig , Fig. And Figure

- The real-world dataset needs a demonstration to address a non-expert audience:

- What is the experiment setting? Why should we model it with LVMs?

- How to interpret the latent variables underlying the dataset? And what does "extraneous dynamics" mean in this scenario?

- Without background information, a non-expert reader might not understand your experiments or your results, such as the "trajectories" in Fig 5 and 19, thus doubting the whole work.

- Some important notions (smoothing, co-smoothing, few-shot co-smoothing, cross-decoding) have similar and non-intuitive names, maybe list a name table in the main text or appendix, and use abbreviations in the main text.

- Some typos ("a the" in line 89, "LGSMM" in line 478, missing section index in line 184).

Reviewer #2: This manuscript presents a new metric for evaluating latent variable models used in neuroscience, the few-shot co-smoothing score. When combined with standard co-smoothing, the new metric can identify models that fit the observed data well but that have unnecessarily complicated latent states (where the unneeded complexity is hidden by the observation model). In other words, the metric can help to identify models that have more vs less parsimonious representations of the latent dynamics. It is extremely common in modern neuroscience to fit latent state models and interpret the inferred latents in scientific terms -- so it is a huge problem that inferred latent states can be unnecessarily complicated, even when a model is a good fit to data. Hence the current manuscript is a timely and valuable contribution to the literature. It is also well-written, convincing, and thorough.

My only substantive comment is that, as the authors demonstrate, the few-shot relationship to ground truth is very sensitive to the choice of the number of few-shot neurons k, and the appropriate choice depends on the model class. So I think it is important to discuss the choice of k in the main text rather than only discussing it in the appendix.

I include some minor additional comments below.

Minor comments:

- Equation 1: I think it would be clearer to use "Z hat" here, because f is estimating the true unknown latent state

- Page 5 line 118: It would be good to justify your use of xi rather than z hat, e.g. "we use its posterior probability mass function as the relevant intermediate representation because it reflects a richer representation of the knowledge about the latent state than a single discrete state estimate" or "... because it captures the degree of belief in a given latent state rather than just the most likely discrete state" or "... because the true latent state z is unknown, and xi completely summarizes the current knowledge of it"

- Figure 4: I believe the color still represents M in this figure -- if so, please include M in the legend (like in 1D,E). Same comment for all similar figures.

- Equations 9, 10: It wasn't obvious to me right away that xi and mu were the (posterior) probabilities of the latent states at time 1, 2. It would be good to say it explicitly.

- Equation 11: missing Bhat_1(xi) subscript

- Page 8 line 197: "we see that" -- you aren't showing the bias/variance properties here, so you should instead refer the reader to the appendix.

- Page 9 line 211: Acronym SOTA used without definition ("state of the art" is used on page 2 line 29)

- Page 9 line 236: Missing reference "as in Section we"

- Page 12 line 286: Typo "arguement"

- Page 12 line 294: Wording "may be thus can evaluated"

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Emily P Stephen

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

Figure resubmission:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. If there are other versions of figure files still present in your submission file inventory at resubmission, please replace them with the PACE-processed versions.

Reproducibility:

To enhance the reproducibility of your results, we recommend that authors of applicable studies deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. 2025 Dec 30;21(12):e1013789. doi: 10.1371/journal.pcbi.1013789.r002

Author response to Decision Letter 1

22 Aug 2025

Attachment

Submitted filename: few shot rebuttal.docx

pcbi.1013789.s011.docx^{(14.1KB, docx)}

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1013789.r003

Decision Letter 1

Hugues Berry, Yuanning Li

18 Sep 2025

PCOMPBIOL-D-25-00336R1

When predict can also explain: few-shot prediction to select better neural latents

PLOS Computational Biology

Dear Dr. Dabholkar,

Thank you for submitting your manuscript to PLOS Computational Biology. After careful consideration, we feel that it has merit but does not fully meet PLOS Computational Biology's publication criteria as it currently stands. One of the reviewers raised additional concerns and suggestions that should be addressed. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript within 60 days Nov 18 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at ploscompbiol@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pcompbiol/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

* A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

* An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

We look forward to receiving your revised manuscript.

Kind regards,

Yuanning Li

Academic Editor

PLOS Computational Biology

Hugues Berry

Section Editor

PLOS Computational Biology

Journal Requirements:

1) We note that your Manuscript files are duplicated on your submission. Please remove any unnecessary or old files from your revision, and make sure that only those relevant to the current version of the manuscript are included.

2) Your manuscript is missing the following section: Results. Please ensure all required sections are present and in the correct order. Make sure section heading levels are clearly indicated in the manuscript text, and limit sub-sections to 3 heading levels. An outline of the required sections can be consulted in our submission guidelines here:

https://journals.plos.org/ploscompbiol/s/submission-guidelines#loc-parts-of-a-submission

3) We notice that your supplementary information (Appendices) is included in the manuscript file. Please remove them and upload them with the file type 'Supporting Information'. Please ensure that each Supporting Information file has a legend listed in the manuscript after the references list.

4) Please amend your detailed Financial Disclosure statement. This is published with the article. It must therefore be completed in full sentences and contain the exact wording you wish to be published.

1) State what role the funders took in the study. If the funders had no role in your study, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."

5) Please ensure that the funders and grant numbers match between the Financial Disclosure field and the Funding Information tab in your submission form. Note that the funders must be provided in the same order in both places as well. Currently, the order of the grants is different in both places.

Note: If the reviewer comments include a recommendation to cite specific previously published works, please review and evaluate these publications to determine whether they are relevant and should be cited. There is no requirement to cite these works unless the editor has indicated otherwise.

Reviewers' comments:

Reviewer's Responses to Questions

Reviewer #2: The revisions address my issues with the first submission, and I approve of the additional changes.

Minor comments:

- In the section "Why does few-shot work?", you present the linear regression case first without saying that's what you're doing. That is, on p7 line 142, you introduce three models (LR, HMM, prototype). The next paragraph would be clearer if you started it with "For the linear regression case..." or similar.

- In Figure 5, it took me a second to notice that in Panel A higher is better (likelihood), while in Panels B and C lower is better (error). So I didn't get right away that all three were showing the same trend with respect to extraneous noise. It would be helpful just to state it explicitly in the text and/or caption.

- On p8 lines 156-158, you compare the methods in terms of their bias/variance decompositions. I assume you are referring to the analysis in the Methods sections "Theoretical analyis of...". If so, please refer the reader to these methods section (again).

Reviewer #3: This manuscript highlights the problem of extraneous/spurious dynamics in latent variable models and introduces two evaluation approaches for identifying it: few-shot co-smoothing and cross-decoding. This problem is important and hampers interpretation and scientific conclusions drawn from these models, so suitable evaluation frameworks are a timely and valuable contribution. After the first revision, the manuscript is much clearer and more well-motivated. However, I do still have a few questions/comments about the work:

1. Given that 1) few-shot co-smoothing is still used in conjunction with standard co-smoothing, and 2) cycle consistency and cross-decoding also indicate presence of spurious dynamics (but not co-smoothing quality), what does few-shot co-smoothing exactly offer that the combination of co-smoothing and e.g., cycle consistency does not? Can few-shot co-smoothing be sufficient alone, without also evaluating standard co-smoothing? Why is it important to specifically have a prediction-based metric for parsimony of latents?

2. It is stated in the text that cycle consistency relies on the models having “perfect co-smoothing” or having “rate predictions [that are] perfect proxies of the true dynamics,” while cross-decoding does not. However, it is also stated in the text that cross-decoding also relies on the assumption that “high co-smoothing models contain the teacher latent.” How is this assumption different from that of cycle consistency?

3. Though linear-exponential-poisson readouts are most common, some LVMs do not have this emissions model (e.g., linear-softplus, MLP). I assume this makes (linear) cross-decoding not directly applicable, and I wonder if few-shot co-smoothing scores are comparable across readout models. I assume the few-shot generalization behavior would vary, especially for a higher parameter count, neural network-based readout as in ODIN (Versteeg et al. 2023), so I would appreciate some empirical exploration and/or discussion of this limitation (if I am correct in assuming that it is a limitation).

4. Though it is maybe obvious, I think the two-bit flip flop is potentially a good example to briefly discuss the consequence of spurious dynamics on accurate interpretation of the system. Are the red and green stars in Fig 2 unstable and stable fixed points? Do the spurious dynamics in the “bad” model lead to incorrect fixed point topology (as computed in Maheswaranathan et al. 2019, for example)?

5. Though maybe only directly applicable to models with strictly linear emissions models, I would appreciate some discussion of Procrustes-style metrics from Alex Williams and others, which also penalize spurious dynamics without needing two separate metrics.

6. I appreciate the qualitative difference in smoothness and separation of latents in Fig 7. Can you perform any quantitative evaluations that support this point? For example, can conditions be more accurately classified from initial conditions of the “good” model?

7. Minor typo: disucssion (line 214)

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: Yes: Emily P Stephen

Reviewer #3: No

Figure resubmission:

While revising your submission, we strongly recommend that you use PLOS’s NAAS tool (https://ngplosjournals.pagemajik.ai/artanalysis) to test your figure files. NAAS can convert your figure files to the TIFF file type and meet basic requirements (such as print size, resolution), or provide you with a report on issues that do not meet our requirements and that NAAS cannot fix.

After uploading your figures to PLOS’s NAAS tool - https://ngplosjournals.pagemajik.ai/artanalysis, NAAS will process the files provided and display the results in the "Uploaded Files" section of the page as the processing is complete. If the uploaded figures meet our requirements (or NAAS is able to fix the files to meet our requirements), the figure will be marked as "fixed" above. If NAAS is unable to fix the files, a red "failed" label will appear above. When NAAS has confirmed that the figure files meet our requirements, please download the file via the download option, and include these NAAS processed figure files when submitting your revised manuscript.

Reproducibility:

PLoS Comput Biol. 2025 Dec 30;21(12):e1013789. doi: 10.1371/journal.pcbi.1013789.r004

Author response to Decision Letter 2

5 Nov 2025

Attachment

Submitted filename: Few shot rebuttal.pdf

pcbi.1013789.s012.pdf^{(220KB, pdf)}

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1013789.r005

Decision Letter 2

Hugues Berry, Yuanning Li

26 Nov 2025

Dear Mr Dabholkar,

We are pleased to inform you that your manuscript 'When predict can also explain: few-shot prediction to select better neural latents' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please also consider addressing the final comments from the reviewer regarding more discussion in the text on the limitations of the proposed methods.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology.

Best regards,

Yuanning Li

Academic Editor

PLOS Computational Biology

Hugues Berry

Section Editor

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #3: The authors have largely addressed all of my comments and misunderstandings. I only have one remaining minor comment, which is that I would appreciate a bit more discussion in the text on the limitations of the methods proposed here for comparing models with different readout/emissions models. I think the current manuscript convincingly shows that the proposed metrics are effective for model selection across models of the same architecture, but I think it remains unclear how to compare raw few-shot co-smoothing values across architectures, especially when they have different readout models, which is essential if the metric is to be used widely for benchmarking.

There are more exceptions to the conventional linear-exp-Poisson readout than just ODIN [1]. Old-school methods like GPFA [2] and some of their extensions like (m)DLAG [3] still (unfortunately) use a linear-Gaussian emissions model, and others like SLDS often use linear-softplus (for example, in the SLDS NLB baseline). There are also methods incorporating spike history [4] and binomial/negative-binomial count models [5]. Most critically, I think there is growing interest in modelling neural dynamics on nonlinear manifolds, which has maybe most prominently led to CEBRA [6] (not really an LVM of course) but also e.g., LVMs with tuning-curve-based readout models [7,8,9].

None of these methods are really state-of-the-art in current benchmarking settings, so I don’t think this is a concerning limitation of few-shot co-smoothing and I don’t think you need to cite every single one of these, but I think it would be better to more clearly acknowledge this potential limitation.

[1] Versteeg, C., Sedler, A. R., McCart, J. D., & Pandarinath, C. (2023). Expressive dynamics models with nonlinear injective readouts enable reliable recovery of latent features from neural activity. ArXiv, arXiv-2309.

[2] Yu, B. M., Cunningham, J. P., Santhanam, G., Ryu, S., Shenoy, K. V., & Sahani, M. (2008). Gaussian-process factor analysis for low-dimensional single-trial analysis of neural population activity. Advances in neural information processing systems, 21.

[3] Gokcen, E., Jasper, A., Xu, A., Kohn, A., Machens, C. K., & Yu, B. M. (2023). Uncovering motifs of concurrent signaling across multiple neuronal populations. Advances in Neural Information Processing Systems, 36, 34711-34722.

[4] Zhao, Y., & Park, I. M. (2017). Variational latent gaussian process for recovering single-trial dynamics from population spike trains. Neural computation, 29(5), 1293-1316.

[5] Keeley, S., Zoltowski, D., Yu, Y., Smith, S., & Pillow, J. (2020). Efficient non-conjugate Gaussian process factor models for spike count data using polynomial approximations. In International conference on machine learning (pp. 5177-5186). PMLR.

[6] Schneider, S., Lee, J. H., & Mathis, M. W. (2023). Learnable latent embeddings for joint behavioural and neural analysis. Nature, 617(7960), 360-368.

[7] Wu, A., Roy, N. A., Keeley, S., & Pillow, J. W. (2017). Gaussian process based nonlinear latent structure discovery in multivariate spike train data. Advances in neural information processing systems, 30.

[8] Jensen, K., Kao, T. C., Tripodi, M., & Hennequin, G. (2020). Manifold GPLVMs for discovering non-Euclidean latent structure in neural data. Advances in Neural Information Processing Systems, 33, 22580-22592.

[9] Genkin, M., Shenoy, K. V., Chandrasekaran, C., & Engel, T. A. (2025). The dynamics and geometry of choice in the premotor cortex. Nature, 1-9.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #3: No

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1013789.r006

Acceptance letter

Hugues Berry, Yuanning Li

PCOMPBIOL-D-25-00336R2

When predict can also explain: few-shot prediction to select better neural latents

Dear Dr Dabholkar,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

For Research, Software, and Methods articles, you will receive an invoice from PLOS for your publication fee after your manuscript has reached the completed accept phase. If you receive an email requesting payment before acceptance or for any other service, this may be a phishing scheme. Learn how to identify phishing emails and protect your accounts at https://explore.plos.org/phishing.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Anita Estes

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Fig. Student-Teacher RNNs: co-smoothing as a function of model size.

(TIFF)

pcbi.1013789.s001.tif^{(3.9MB, tif)}

S2 Fig. How to choose k for your dataset?

(TIFF)

pcbi.1013789.s002.tif^{(5.9MB, tif)}

S1 Appendix. Decoding across HMM latents: fitting and evaluation.

(PDF)

pcbi.1013789.s003.pdf^{(108.6KB, pdf)}

S3 Fig. Good co-smoothing does not guarantee correct latents in Hidden Markov Models (HMMs).

(TIFF)

pcbi.1013789.s004.tif^{(6MB, tif)}

S4 Fig. Student-teacher results in Linear Gaussian State Space Models.

We demonstrate that our results are not unique to the RNN or HMM settings by simulating another simple scenario: linear gaussian state space models (LGSSM), i.e., Kalman Smoothing.

The model is defined by parameters

(μ_{0}, Σ_{0}, F, G, H, R)

. A major difference to HMMs is that the latent states

z \in R^{M}

are continuous. They follow the dynamics given by:

z_{0} ~ N (μ_{0}, Σ_{0})

(54)

z_{t} ~ N (F z_{t - 1} + b, G)

(55)

x_{t} ~ N (H z_{t} + c, R)

(56)

Given these dynamics, the latents $z$ can be inferred from observations $x$ using Kalman smoothing, analogous to (15). Here we use the jax based dynamax implementation.

(TIFF)

pcbi.1013789.s005.tif^{(5.4MB, tif)}

S5 Fig. HMM network visualisations.

(TIFF)

pcbi.1013789.s006.tif^{(5MB, tif)}

S6 Fig. Few-shot co-smoothing is not simply hard co-smoothing (variations of HMM student-teacher experiments).

(TIFF)

pcbi.1013789.s007.tif^{(25.7MB, tif)}

S2 Appendix. Time cost of computing few-shot co-smoothing.

(PDF)

pcbi.1013789.s008.pdf^{(66.6KB, pdf)}

S7 Fig. Classifying task variables from latents in models with contrasting few-shot performance.

(TIFF)

pcbi.1013789.s009.tiff^{(3MB, tiff)}

S3 Appendix. Illustrative example of the difference between cycle consistency and cross-decoding.

(PDF)

pcbi.1013789.s010.pdf^{(87.4KB, pdf)}

Attachment

Submitted filename: few shot rebuttal.docx

pcbi.1013789.s011.docx^{(14.1KB, docx)}

Attachment

Submitted filename: Few shot rebuttal.pdf

pcbi.1013789.s012.pdf^{(220KB, pdf)}

Data Availability Statement

Code for the few-shot evaluation is available at https://github.com/KabirDabholkar/nlb_tools_fewshot. Code for the HMM simulations is available at https://github.com/KabirDabholkar/hmm_analysis.

The experiments done in this work are largely based on code repositories from previous works. The following repositories were used or developed in this work:

https://github.com/KabirDabholkar/ComputationThroughDynamicsBenchmark.git – Code from Versteeg et al. [29], which we used directly for training and analysis of RNNs and NODE SAEs.
https://github.com/KabirDabholkar/hmm_analysis - Training and analysis of HMMs, implemented in dynamax [53]
https://github.com/KabirDabholkar/ssm_analysis - Training and analysis of LGSSMs, implemented in dynamax [53]
https://github.com/KabirDabholkar/nlb_tools_fewshot – A fork of the Neural Latents Benchmark repository by Pei et al. [6], used for evaluation of state-of-the-art models (includes co-smoothing, few-shot co-smoothing, cycle-consistency, and cross-decoding).
https://github.com/KabirDabholkar/STNDT_fewshot - Training STNDT models [6,21,40–42]

[pcbi.1013789.ref001] 1.Vyas S, Golub MD, Sussillo D, Shenoy KV. Computation through neural population dynamics. Annu Rev Neurosci. 2020;43:249–75. doi: 10.1146/annurev-neuro-092619-094115 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1013789.ref002] 2.Barak O. Recurrent neural networks as versatile tools of neuroscience research. Curr Opin Neurobiol. 2017;46:1–6. doi: 10.1016/j.conb.2017.06.003 [DOI] [PubMed] [Google Scholar]

[pcbi.1013789.ref003] 3.Sussillo D, Barak O. Opening the black box: low-dimensional dynamics in high-dimensional recurrent neural networks. Neural Comput. 2013;25(3):626–49. doi: 10.1162/NECO_a_00409 [DOI] [PubMed] [Google Scholar]

[pcbi.1013789.ref004] 4.Vinuesa R, Brunton SL. Enhancing computational fluid dynamics with machine learning. Nat Comput Sci. 2022;2(6):358–66. doi: 10.1038/s43588-022-00264-7 [DOI] [PubMed] [Google Scholar]

[pcbi.1013789.ref005] 5.Bauwens L, Veredas D. The stochastic conditional duration model: a latent factor model for the analysis of financial durations. 2005. https://papers.ssrn.com/abstract=685421

[pcbi.1013789.ref006] 6.Pei FC, Ye J, Zoltowski DM, Wu A, Chowdhury RH, Sohn H, et al. Neural latents benchmark ‘21: evaluating latent variable models of neural population activity. 2021.

[pcbi.1013789.ref007] 7.Shmueli G. To explain or to predict?. Statistical Science. 2010;25(3):289–310. doi: 10.1214/10-STS330 [DOI] [Google Scholar]

[pcbi.1013789.ref008] 8.Versteeg C, Sedler AR, McCart JD, Pandarinath C. Expressive dynamics models with nonlinear injective readouts enable reliable recovery of latent features from neural activity. In: Proceedings of the 2nd NeurIPS Workshop on Symmetry and Geometry in Neural Representations, 2024. p. 255–78. https://proceedings.mlr.press/v228/versteeg24a.html

[pcbi.1013789.ref009] 9.Koppe G, Toutounji H, Kirsch P, Lis S, Durstewitz D. Identifying nonlinear dynamical systems via generative recurrent neural networks with applications to fMRI. PLoS Comput Biol. 2019;15(8):e1007263. doi: 10.1371/journal.pcbi.1007263 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1013789.ref010] 10.Pandarinath C, O’Shea DJ, Collins J, Jozefowicz R, Stavisky SD, Kao JC, et al. Inferring single-trial neural population dynamics using sequential auto-encoders. Nat Methods. 2018;15(10):805–15. doi: 10.1038/s41592-018-0109-9 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1013789.ref011] 11.Sedler AR, Versteeg C, Pandarinath C. Expressive architectures enhance interpretability of dynamics-based neural population models. Neuron Behav Data Anal Theory. 2023;2023:10.51628/001c.73987. doi: 10.51628/001c.73987 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1013789.ref012] 12.Valente A, Pillow JW, Ostojic S. Extracting computational mechanisms from neural data using low-rank RNNs. In: Oh AH, Agarwal A, Belgrave D, Cho K, editors. Advances in Neural Information Processing Systems. 2022.

[pcbi.1013789.ref013] 13.Gloeckler M, Macke JH, Pals M, Pei F, Sağtekin AE. Inferring stochastic low-rank recurrent neural networks from neural data. In: Advances in Neural Information Processing Systems 37. 2024. p. 18225–64. 10.52202/079017-0579 [DOI]

[pcbi.1013789.ref014] 14.Perkins SM, Cunningham JP, Wang Q, Churchland MM. Simple decoding of behavior from a complicated neural manifold. eLife Sciences Publications, Ltd. 2023. 10.7554/elife.89421.1 [DOI]

[pcbi.1013789.ref015] 15.Linderman S, Johnson M, Miller A, Adams R, Blei D, Paninski L. Bayesian learning and inference in recurrent switching linear dynamical systems. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. 2017. p. 914–22. https://proceedings.mlr.press/v54/linderman17a.html

[pcbi.1013789.ref016] 16.Macke JH, Buesing L, Cunningham JP, Yu BM, Shenoy KV, Sahani M. Empirical models of spiking in neural populations. In: Advances in Neural Information Processing Systems. 2011. https://proceedings.neurips.cc/paper_files/paper/2011/file/7143d7fbadfa4693b9eec507d9d37443-Paper.pdf

[pcbi.1013789.ref017] 17.Wu A, Pashkovski S, Datta SR, Pillow JW. Learning a latent manifold of odor representations from neural responses in piriform cortex. In: Advances in Neural Information Processing Systems. 2018. https://proceedings.neurips.cc/paper_files/paper/2018/file/17b3c7061788dbe82de5abe9f6fe22b3-Paper.pdf

[pcbi.1013789.ref018] 18.Meghanath G, Jimenez B, Makin JG. Inferring population dynamics in macaque cortex. J Neural Eng. 2023;20(5):10.1088/1741-2552/ad0651. doi: 10.1088/1741-2552/ad0651 [DOI] [PubMed] [Google Scholar]

[pcbi.1013789.ref019] 19.Keshtkaran MR, Sedler AR, Chowdhury RH, Tandon R, Basrai D, Nguyen SL, et al. A large-scale neural network training framework for generalized estimation of single-trial population dynamics. Nat Methods. 2022;19(12):1572–7. doi: 10.1038/s41592-022-01675-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1013789.ref020] 20.Keeley S, Aoi M, Yu Y, Smith S, Pillow JW. Identifying signal and noise structure in neural population activity with gaussian process factor models. In: Advances in Neural Information Processing Systems. 2020. p. 13795–805. https://proceedings.neurips.cc/paper_files/paper/2020/file/9eed867b73ab1eab60583c9d4a789b1b-Paper.pdf

[pcbi.1013789.ref021] 21.Le T, Shlizerman E. Stndt: Modeling neural population activity with spatiotemporal transformers. In: Advances in Neural Information Processing Systems. 2022. p. 17926–39. https://proceedings.neurips.cc/paper_files/paper/2022/file/72163d1c3c1726f1c29157d06e9e93c1-Paper-Conference.pdf

[pcbi.1013789.ref022] 22.She Q, Wu A. Neural dynamics discovery via gaussian process recurrent neural networks. In: Proceedings of the 35th Uncertainty in Artificial Intelligence Conference. 2020. p. 454–64. https://proceedings.mlr.press/v115/she20a.html

[pcbi.1013789.ref023] 23.Wu A, Roy NA, Keeley S, Pillow JW. Gaussian process based nonlinear latent structure discovery in multivariate spike train data. In: Guyon I, Von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, et al. Advances in neural information processing systems. Curran Associates, Inc.; 2017. https://proceedings.neurips.cc/paper_files/paper/2017/file/b3b4d2dbedc99fe843fd3dedb02f086f-Paper.pdf. [PMC free article] [PubMed]

[pcbi.1013789.ref024] 24.Zhao Y, Park IM. Variational latent gaussian process for recovering single-trial dynamics from population spike trains. Neural Comput. 2017;29(5):1293–316. doi: 10.1162/NECO_a_00953 [DOI] [PubMed] [Google Scholar]

[pcbi.1013789.ref025] 25.Schimel M, Kao T-C, Jensen KT, Hennequin G. iLQR-VAE: control-based learning of input-driven dynamics with applications to neural data. In: International Conference on Learning Representations. 2022. https://openreview.net/forum?id=wRODLDHaAiW

[pcbi.1013789.ref026] 26.Mullen TSO, Schimel M, Hennequin G, Machens CK, Orger M, Jouary A. Learning interpretable control inputs and dynamics underlying animal locomotion. In: The Twelfth International Conference on Learning Representations. 2024. https://openreview.net/forum?id=MFCjgEOLJT

[pcbi.1013789.ref027] 27.Gokcen E, Jasper AI, Semedo JD, Zandvakili A, Kohn A, Machens CK, et al. Disentangling the flow of signals between populations of neurons. Nat Comput Sci. 2022;2(8):512–25. doi: 10.1038/s43588-022-00282-5 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1013789.ref028] 28.Yu M, Cunningham JP, Santhanam G, Ryu S, Shenoy KV, Sahani M. Gaussian-process factor analysis for low-dimensional single-trial analysis of neural population activity. In: Advances in Neural Information Processing Systems, 2008. https://papers.nips.cc/paper_files/paper/2008/hash/ad972f10e0800b49d76fed33a21f6698-Abstract.html [DOI] [PMC free article] [PubMed]

[pcbi.1013789.ref029] 29.Versteeg C, McCart JD, Ostrow M, Zoltowski DM, Washington CB, Driscoll L, et al. Computation-through-dynamics benchmark: simulated datasets and quality metrics for dynamical models of neural activity. 2025. 2025.02.07.637062. https://www.biorxiv.org/content/10.1101/2025.02.07.637062v2

[pcbi.1013789.ref030] 30.Deng J, Dong W, Socher R, Li L-J, Kai Li, Li Fei-Fei. ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. 2009. p. 248–55. 10.1109/cvpr.2009.5206848 [DOI]

[pcbi.1013789.ref031] 31.Hastie T, Montanari A, Rosset S, Tibshirani RJ. Surprises in high-dimensional ridgeless least squares interpolation. Ann Stat. 2022;50(2):949–86. doi: 10.1214/21-aos2133 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1013789.ref032] 32.Sorscher B, Ganguli S, Sompolinsky H. Neural representational geometry underlies few-shot concept learning. Proc Natl Acad Sci U S A. 2022;119(43):e2200800119. doi: 10.1073/pnas.2200800119 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1013789.ref033] 33.Belkin M, Hsu D, Ma S, Mandal S. Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proc Natl Acad Sci U S A. 2019;116(32):15849–54. doi: 10.1073/pnas.1903070116 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1013789.ref034] 34.Nakkiran P, Kaplun G, Bansal Y, Yang T, Barak B, Sutskever I. Deep double descent: where bigger models and more data hurt*. J Stat Mech. 2021;2021(12):124003. doi: 10.1088/1742-5468/ac3a74 [DOI] [Google Scholar]

[pcbi.1013789.ref035] 35.Churchland MM, Cunningham JP, Kaufman MT, Ryu SI, Shenoy KV. Cortical preparatory activity: representation of movement or first cog in a dynamical machine?. Neuron. 2010;68(3):387–400. doi: 10.1016/j.neuron.2010.09.015 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1013789.ref036] 36.O’Doherty JE, Cardoso MMB, Makin JG, Sabes PN. Nonhuman primate reaching with multichannel sensorimotor cortex electrophysiology: broadband for indy 2016 0927 06. 2018. https://zenodo.org/records/1432819

[pcbi.1013789.ref037] 37.Sohn H, Narain D, Meirhaeghe N, Jazayeri M. Bayesian computation through cortical latent dynamics. Neuron. 2019;103(5):934-947.e5. doi: 10.1016/j.neuron.2019.06.012 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1013789.ref038] 38.Chowdhury RH, Glaser JI, Miller LE. Area 2 of primary somatosensory cortex encodes kinematics of the whole arm. Elife. 2020;9:e48198. doi: 10.7554/eLife.48198 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1013789.ref039] 39.Le T, Shlizerman E. STNDT: modeling neural population activity with spatiotemporal transformers. Advances in Neural Information Processing Systems. 2022;35:17926–39. [Google Scholar]

[pcbi.1013789.ref040] 40.Ye J, Pandarinath C. Representation learning for neural population activity with neural data transformers. Neurons, Behavior, Data analysis, and Theory. 2021;5(3):1–18. doi: 10.51628/001c.27358 [DOI] [Google Scholar]

[pcbi.1013789.ref041] 41.Nguyen TQ, Salazar J. Transformers without tears: Improving the normalization of self-attention. In: Proceedings of the 16th International Conference on Spoken Language Translation, Hong Kong, 2019. https://aclanthology.org/2019.iwslt-1.17/

[pcbi.1013789.ref042] 42.Shi Huang X, Perez F e l i p e, Ba J, Volkovs M. Improving transformer optimization through better initialization. In: Proceedings of the 37th International Conference on Machine Learning. 2020. p. 4475–83. https://proceedings.mlr.press/v119/huang20f.html

[pcbi.1013789.ref043] 43.Zhu J-Y, Park T, Isola P, Efros AA. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: 2017 IEEE International Conference on Computer Vision (ICCV). 2017. p. 2242–51. 10.1109/iccv.2017.244 [DOI]

[pcbi.1013789.ref044] 44.Akaike H. Information theory and an extension of the maximum likelihood principle. Springer Series in Statistics. Springer New York. 1998. p. 199–213. 10.1007/978-1-4612-1694-0_15 [DOI]

[pcbi.1013789.ref045] 45.Schwarz G. Estimating the dimension of a model. The Annals of Statistics. 1978;6(2):461–4. doi: 10.1214/aos/1176344136 [DOI] [Google Scholar]

[pcbi.1013789.ref046] 46.Altan E, Solla SA, Miller LE, Perreault EJ. Estimating the dimensionality of the manifold underlying multi-electrode neural recordings. PLoS Comput Biol. 2021;17(11):e1008591. doi: 10.1371/journal.pcbi.1008591 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1013789.ref047] 47.Stringer C, Pachitariu M, Steinmetz N, Carandini M, Harris KD. High-dimensional geometry of population responses in visual cortex. Nature. 2019;571(7765):361–5. doi: 10.1038/s41586-019-1346-5 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1013789.ref048] 48.Stringer C, Pachitariu M, Steinmetz N, Reddy CB, Carandini M, Harris KD. Spontaneous behaviors drive multidimensional, brainwide activity. Science. 2019;364(6437):255. doi: 10.1126/science.aav7893 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1013789.ref049] 49.Gokcen E, Jasper A, Xu A, Kohn A, Machens CK, Yu BM. Uncovering motifs of concurrent signaling across multiple neuronal populations. In: Advances in Neural Information Processing Systems. 2023. p. 34711–22. https://proceedings.neurips.cc/paper_files/paper/2023/hash/6cf7a37e761f55b642cf0939b4c64bb8-Abstract-Conference.html

[pcbi.1013789.ref050] 50.Chung J, Gulcehre C, Cho KHY, Bengio Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint 2014. http://arxiv.org/abs/1412.3555

[pcbi.1013789.ref051] 51.Kingma DP, Ba J. Adam: a method for stochastic optimization. arXiv preprint 2017. http://arxiv.org/abs/1412.6980

[pcbi.1013789.ref052] 52.Barber D. Bayesian reasoning and machine learning. Cambridge University Press; 2012.

[pcbi.1013789.ref053] 53.Linderman SW, Chang P, Harper-Donnelly G, Kara A, Li X, Duran-Martin G, et al. Dynamax: a python package for probabilistic state space modeling with JAX. JOSS. 2025;10(108):7069. doi: 10.21105/joss.07069 [DOI] [Google Scholar]

[pcbi.1013789.ref054] 54.Sorscher B, Ganguli S, Sompolinsky H. Neural representational geometry underlies few-shot concept learning. Proc Natl Acad Sci U S A. 2022;119(43):e2200800119. doi: 10.1073/pnas.2200800119 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

When predict can also explain: Few-shot prediction to select better neural latents

Kabir V Dabholkar

Omri Barak

Roles

Abstract

Author summary

Introduction

Results

Co-smoothing: A cross-validation framework

Fig 1. Prediction framework and its relation to ground truth.

Good co-smoothing does not guarantee correct latents

Fig 2. Upper panel.

Few-shot prediction selects better models

Fig 3. Co-smoothing and few-shot co-smoothing; a composite evaluation framework for Neural LVMs.

Table 1. Dimensions of real and synthetic datasets.

Fig 4. Few-shot prediction selects better models.

Why does few-shot work?

Fig 5. Theoretical analysis of k-shot learner performance as a function of k and extraneous noise σext, in three different settings.

SOTA LVMs on neural data

Fig 6. Cycle consistency and cross-decoding as a proxy for distance to the ground truth in the absence of ground-truth.

Fig 7. Few-shot scores ⟨Qk-shot⟩ correlate with the proxies of distance to the ground truth, cycle-consistency Dr→z and the cross-decoding column average ⟨Du→v⟩u.

Discussion

Methods

Glossary

Student-teacher Recurrent Neural Networks (RNN)

Student-teacher Hidden Markov Models (HMMs)

Analysis of LVMs without access to ground truth

Summary of Neural Latent Benchmark (NLB) datasets

Dimensions of datasets

Theoretical analysis of few shot learning in HMMs.

Theoretical analysis of ridgeless least squares regression with extraneous noise.

Theoretical analysis of prototype learning for binary classification with extraneous noise

Data and code availability

Supporting information

Financial disclosure

Data Availability

Funding Statement

References

Decision Letter 0

Hugues Berry

Yuanning Li

Roles

Author response to Decision Letter 1

Decision Letter 1

Hugues Berry

Yuanning Li

Roles

Author response to Decision Letter 2

Decision Letter 2

Hugues Berry

Yuanning Li

Roles

Acceptance letter

Hugues Berry

Yuanning Li

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Fig 5. Theoretical analysis of k-shot learner performance as a function of k and extraneous noise $σ_{ext}$ , in three different settings.

Fig 7. Few-shot scores $⟨ Q^{k -shot} ⟩$ correlate with the proxies of distance to the ground truth, cycle-consistency $D_{r \to z}$ and the cross-decoding column average $⟨ D_{u \to v} ⟩_{u}$ .