Abstract
Predicting future neural activity is a core challenge in modeling brain dynamics, with applications ranging from scientific investigation to closed-loop neurotechnology. While recent models of population activity emphasize interpretability and behavioral decoding, neural forecasting—particularly across multi-session, spontaneous recordings—remains underexplored. We introduce POCO, a unified forecasting model that combines a lightweight univariate forecaster with a population-level encoder to capture both neuron-specific and brain-wide dynamics. Trained across five calcium imaging datasets spanning zebrafish, mice, and C. elegans, POCO achieves state-of-the-art accuracy at cellular resolution in spontaneous behaviors. After pre-training, POCO rapidly adapts to new recordings with minimal fine-tuning. Notably, POCO’s learned unit embeddings recover biologically meaningful structure—such as brain region clustering—without any anatomical labels. Our comprehensive analysis reveals several key factors influencing performance, including context length, session diversity, and preprocessing. Together, these results position POCO as a scalable and adaptable approach for cross-session neural forecasting and offer actionable insights for future model design. By enabling accurate, generalizable forecasting models of neural dynamics across individuals and species, POCO lays the groundwork for adaptive neurotechnologies and large-scale efforts for neural foundation models. Code is available at https://github.com/yuvenduan/POCO.
1. Introduction
The ability to predict future states from the past is a critical benchmark for models of complex systems such as the brain [32]. Models capable of rapidly and accurately forecasting future neural activity across large spatial-temporal scales—rather than merely fitting historical data—are critical for applied technologies such as closed-loop optogenetic control [19]. Here, we focus on a time-series forecasting (TSF) setup: Given a recent history of measured neural activity, can we predict the neural population dynamics in the near future?
Recently, the increasing ability to simultaneously record from large populations of neurons has motivated a wide range of models of neural population dynamics [15, 27, 46, 60]. However, existing work has primarily focused on interpreting features of these high-dimensional dynamics, whereas neural forecasting is relatively under-explored. In addition, previous work has been mostly limited to a handful of brain regions during controlled behavioral tasks, often using short, trial-based data from individual animals. While trial-based data make modeling tractable due to strong behavioral constraints, a comprehensive understanding of neural dynamics benefits from whole-brain recordings during spontaneous, task-free behaviors. Growing evidence for shared neural motifs across individuals [42, 12, 37], along with the rise of large-scale, multi-animal datasets, motivates the development of foundation models—models trained across individuals that generalize to unseen subjects [5, 6, 2]. However, classical models of population dynamics predominantly focus on fitting single-session recordings [40, 27, 15], limiting their ability to utilize larger datasets and capture common motifs shared across animals.
To address these gaps, we developed POCO (POpulation-COnditioned forecaster), a unified predictive model for forecasting spontaneous, brain-wide neural activity. Trained on multi-animal calcium imaging datasets, POCO predicts cellular-resolution dynamics up to ~15 seconds into the future. It combines a simple univariate forecaster for individual neuron dynamics with a population encoder that models the influence of global brain state on each neuron, using Feature-wise Linear Modulation (FiLM) [39] to condition forecasts on population-level structure. For the population encoder, we adapt POYO [5]—originally developed for behavioral decoding in primates—to summarize high-dimensional population activity across sessions. We benchmark POCO against standard baselines and state-of-the-art TSF models trained on five datasets from zebrafish, mice, and C. elegans.
In sum, our key contributions are: (1) We introduce POCO, a novel architecture that combines a local forecaster with a population encoder, for cellular-level neural dynamics forecasting. (2) We benchmark the forecasting performance of a wide range of models on five diverse calcium imaging datasets spanning different species, with a focus on neural recording during spontaneous behaviors. (3) We demonstrate that POCO scales effectively with longer recordings and additional sessions, and pre-trained POCO can quickly adapt to new sessions. (4) We conduct extensive analyses on factors that affect performance, including context length, dataset pre-processing, multi-dataset training, and similarity between individuals. These analyses provide critical insights for future work on multi-session neural forecasting.
2. Related Work
Neural Foundation Models.
A growing body of work aims to develop unified models trained on neural data across multiple subjects, tasks, and datasets. This foundation model approach has been applied to spiking data in primates [5, 53, 52, 56, 2], human EEG and fMRI recordings [8, 10, 48], and calcium imaging in mice [6]. While much of this work emphasizes improving behavioral decoding performance [5, 6, 53], some studies have explored forward prediction [17, 2, 56, 38]. However, these efforts are largely confined to spiking data recorded during short, trial-based motor or decision-making tasks. Neural prediction has also been explored in C. elegans [44], but the setup is limited to next-step prediction using autoregressive models.
Models of Population Dynamics.
To understand high-dimensional population dynamics, one line of work has focused on inferring low-dimensional latent representations from observations with models including RNNs [15, 35, 40], switching linear dynamical systems [18, 27], sequential variational autoencoder (VAE) [46, 57, 37, 60], and latent diffusion models [22]. While some of these models could, in theory, be adapted for forecasting, their focus to date has been to gain interpretable insights, especially in constrained neuroscience tasks.
Time-Series Forecasting.
Time series forecasting is a general problem that emerges in a variety of domains [51, 29, 58, 32]. Deep learning has led to a wide range of TSF architectures, including RNNs [43], temporal convolutional networks (TCNs) [28, 33], and Transformer-based models [58, 59, 47]. Simpler models—such as MLP-based architectures [9, 54, 13] and even linear models [49, 55]—often perform competitively or even outperform more complex alternatives. In neuroscience, Zapbench [32] is a recent benchmark for forecasting, though recordings are limited to a single larval zebrafish.
3. Method
3.1. Problem Setup
In this work, we consider a multi-session TSF problem. For a session , we use to denote the neural activity at time step , where is the number of neurons recorded in session . Given population activity of the last time steps , we hope to find a predictor that forecasts the next steps and minimizes the mean squared error :
| (1) |
where we use denotes Frobenius norm. In most experiments, we use and . Importantly, the number of neurons varies across sessions, and neurons in different animals do not have one-to-one correspondences; yet, the forecasting problems for different sessions are closely linked due to the similarity in neural dynamics between animals [5, 6, 2], which distinguishes this setting from standard multivariate TSF setups.
3.2. Population-Conditioned Forecaster
We first introduce the overall framework of POCO (Figure 1A). Consider an MLP forecaster with hidden size that takes past population activity as input:
| (2) |
where and are weight matrices. The MLP forecaster is univariate, meaning that the prediction for a neuron only depends on its own history, capturing individual auto-correlative properties and simple temporal patterns. Prior work has shown that these simple univariate forecasters perform surprisingly well even for multivariate data [47, 13, 54, 55].
Figure 1: Model Architecture.
(A) POCO combines a univariate MLP forecaster (orange part) and a population encoder that conditions the MLP (blue part). This is a schematic for illustration only; traces and feature maps shown are not actual model input, output, or embedding. (B) The population encoder is adapted from POYO [5]. We split the trace of each neuron into several tokens, encode the tokens with POYO, and then use unit embedding to query the conditioning parameters. See the Method section for more details.
Building on the MLP forecaster, we add a population encoder that modulates the prediction of the MLP through Feature-wise Linear Modulation (FiLM) [39]. Specifically, the population encoder gives the conditioning parameters , where , are of the same shape as the hidden activations in MLP. We then define POCO as
| (3) |
where denotes element-wise multiplication. Intuitively, the FiLM conditioning allows the population encoder to modulate how each neuron’s past activity is interpreted, effectively tailoring the MLP forecaster to the broader brain state at each time point. This enables the model to account for context-dependent dynamics while maintaining neuron-specific predictions.
3.3. Population Encoder
We then need to define a population encoder capable of modeling how the population state influences each neuron. To this end, we adapt a recent architecture, POYO [5, 6], which combines the Perceiver-IO architecture [21] and a tokenization scheme for neural data (Figure 1B). Specifically, for each neuron , we partition the trace into segments of length and each segment forms a token, creating tokens for each neuron. Then for each token of each neuron , we define the embedding as
| (4) |
where is a linear projection, defines the temporal boundary of a token, and both UnitEmbed and SessionEmbed are learnable embeddings in . Intuitively, after learning, the unit embedding can define the dynamical or functional properties of the neuron, while the session embedding can account for different recording conditions (e.g., sampling rates, raw fluorescence magnitude) in different sessions. These token embeddings are arranged into a matrix , which is then processed by Perceiver-IO [21]. Specifically, we use learnable latents as the query for the first layer. After self-attention layers, the final attention layer uses the unit embedding as queries to extract conditioning parameters :
| (5) |
where Attention is a multi-head attention layer [50], , are learned weight matrices. We have attention layers in total, where in most experiments. Following POYO, we use rotary position embedding [45] (details are omitted above for simplicity). One advantage of the Perceiver-IO architecture is that the time complexity only scales linearly with the number of neurons, allowing the model to efficiently scale to recordings of large neural populations. Although we refer to individual neurons throughout, the same framework can also be applied to reduced representations of neural activity, such as principal components (PCs).
Our population encoder has two notable differences from the recent POYO+ model, which is also designed for calcium data [6]. First, POYO+ is designed for behavioral decoding and thus the unit embedding is only used for tokenization. In contrast, here, unit embedding is reused in the last layer to query how the population drives each neuron. Second, in POYO+, is always fixed to 1, which creates a massive number of tokens when the context length and the number of neurons are large. We discuss the effect of in Figure S15.
4. Benchmark
4.1. Datasets
To comprehensively test the predictive capability of the models, we used five different datasets from different labs (Table 1). Most recordings were collected during task-free spontaneous behavior, though the Ahrens zebrafish dataset involves responses to visual stimuli [11]. Details of the segmentation pipelines used to extract fluorescence traces are described in the original dataset publications. We z-scored all fluorescence traces to zero mean and unit variance. For experiments involving predicting PCs, we computed PCs after z-scoring individual neurons, and the magnitudes of PCs were preserved. We first cut each session into 1K-step segments, then partitioned each segment into training, validation, and test sets by 3:1:1. See the Appendix for more details.
Table 1: Overview of the five datasets.
is the sampling frequency. The number of neurons, recording length, and sampling frequency vary by session, the approximate average is shown here.
| Species | Lab | #Sessions | #Neurons | #Steps | Ca2+ Indicator | |
|---|---|---|---|---|---|---|
| Larval Zebrafish | Deisseroth[1] | 19 | 11K | 4.3K | 1.1Hz | GCaMP6s |
| Larval Zebrafish | Ahrens[11] | 15 | 77K | 3.9K | 2.1Hz | GCaMP6f |
| Mice | Harvey[40, 3] | 12 | 1.6K | 15K | 5.4Hz | GCaMP6s |
| C. elegans | Zimmer[23] | 5 | 126 | 3.1K | 2.8Hz | GCaMP5K |
| C. elegans | Flavell[4] | 40 | 136 | 1.6K | 1.7Hz | GCaMP7f |
4.2. Baselines
We compared POCO against a diverse set of baselines, from basic linear and auto-regressive models to state-of-the-art methods for time-series forecasting and dynamical system reconstruction. For all models, we used AdamW [30] optimizer with learning rate 0.0003 and weight decay 10−4. (1) MLP is POCO model without conditioning (Equation 2). (2) NLinear, DLinear[55] are variants of univariate linear models that use a linear projection from the context to predictions. (3) Latent_PLRNN uses piece-wise linear RNN to model underlying dynamical states linked to observations through linear projection [24, 35]. (4) TSMixer [16] is an all-MLP architecture for TSF based on mixing modules for the time and feature dimensions. (5) TexFilter [54] learns context-dependent frequency filters for time-series processing. (6) AR_Transformer is a basic autoregressive Transformer [50]. (7) Netformer [31] infers inter-neuron connection strength via an attention layer, learning a dynamic interaction graph. Here, we add softmax to attention weights for more stable training for multi-step prediction. (8) TCN denotes ModernTCN [33], a recent multivariate pure-convolution architecture for TSF. More details about architectures and training can be found in the Appendix.
4.3. Copy Baseline and Prediction Score
Lastly, we considered a naive baseline that copies the last observation
| (6) |
Although extremely simple, due to the slow dynamics of calcium traces, is a strong baseline, especially in short-term forecasting [44]. As a more intuitive metric than the raw MSE loss, we define the prediction score as the relative performance improvement compared to the copy baseline, i.e.,
| (7) |
which is similar to , but we use the last time step to replace the sample mean.
5. Results
Multi-Session POCO Outperforms Baselines.
We tested POCO against baselines on five calcium imaging datasets (Table 2). Sample prediction traces for POCO are shown in Figure 2C. For zebrafish datasets, we first considered the more tractable problem of predicting the first 512 principal components (PCs) due to the large number of neurons. We compared training a different model for each session (single-session models, SS) and training a shared model for all sessions (multi-session models, MS). POCO consistently benefited from multi-session training, outperforming all baselines on four out of five datasets. Other models—such as PLRNN, TexFilter, and NetFormer—also show performance gains from multi-session training. We obtained similar results measuring prediction errors by MSE and MAE (mean absolute error), as shown in Table S6 and S7.
Table 2: POCO achieves highest prediction scores across species and datasets.
Prediction scores across five datasets show that POCO consistently outperforms baselines, especially in the multi-session setting. Zebrafish data are reduced to 512 PCs. 95% CI from 4 seeds.
| Zebrafish, 512 PCs | Mice | C-elegans | |||
|---|---|---|---|---|---|
| Model | Deisseroth | Ahrens | Zimmer | Flavell | |
| Single-Session Models | |||||
| POCO | 0.466 ±0.019 | 0.433 ±0.008 | 0.415 ±0.001 | 0.329 ±0.009 | 0.079 ±0.017 |
| MLP | 0.399 ±0.001 | 0.388 ±0.001 | 0.409 ±0.000 | 0.336 ±0.001 | 0.236 ±0.002 |
| NLinear | 0.167 ±0.000 | 0.211 ±0.000 | 0.348 ±0.000 | 0.250 ±0.000 | 0.217 ±0.001 |
| Latent_PLRNN | 0.064 ±0.024 | 0.212 ±0.002 | 0.335 ±0.001 | 0.143 ±0.015 | 0.170 ±0.005 |
| TexFilter | 0.419 ±0.006 | 0.378 ±0.003 | 0.389 ±0.000 | 0.333 ±0.005 | 0.230 ±0.002 |
| NetFormer | 0.204 ±0.013 | 0.208 ±0.008 | 0.329 ±0.000 | 0.145 ±0.004 | 0.168 ±0.002 |
| AR_Transformer | −0.875 ±0.027 | −0.054 ±0.005 | 0.312 ±0.005 | −0.320 ±0.063 | −1.024 ±0.038 |
| DLinear | 0.211 ±0.000 | 0.290 ±0.000 | 0.394 ±0.000 | 0.267 ±0.000 | 0.221 ±0.000 |
| TCN | 0.153 ±0.010 | 0.240 ±0.004 | 0.360 ±0.000 | 0.305 ±0.003 | 0.226 ±0.004 |
| TSMixer | −0.550 ±0.036 | 0.016 ±0.009 | 0.390 ±0.001 | 0.129 ±0.012 | −0.199 ±0.039 |
|
| |||||
| Multi-Session Models | |||||
| MS_POCO | 0.525 ±0.004 | 0.440 ±0.003 | 0.420 ±0.002 | 0.364 ±0.005 | 0.213 ±0.030 |
| MS_MLP | 0.417 ±0.002 | 0.370 ±0.001 | 0.409 ±0.000 | 0.348 ±0.002 | 0.274 ±0.004 |
| MS_NLinear | 0.165 ±0.000 | 0.202 ±0.000 | 0.347 ±0.000 | 0.253 ±0.000 | 0.221 ±0.000 |
| MS_Latent_PLRNN | 0.149 ±0.002 | 0.248 ±0.007 | 0.355 ±0.000 | 0.118 ±0.011 | 0.183 ±0.006 |
| MS_TexFilter | 0.440 ±0.005 | 0.349 ±0.000 | 0.389 ±0.000 | 0.346 ±0.000 | 0.256 ±0.002 |
| MS_NetFormer | 0.221 ±0.008 | 0.220 ±0.002 | 0.331 ±0.000 | 0.150 ±0.002 | 0.217 ±0.001 |
| MS_AR_Transformer | −0.777 ±0.005 | 0.002 ±0.012 | 0.317 ±0.003 | −0.333 ±0.043 | −0.675 ±0.022 |
Figure 2: POCO maintains accuracy advantage over time and benefits from longer context.
(A) MSE increases when forecasting longer into the future. Results are shown for two different datasets, see Figure S6 for additional datasets. Error bars show SEM of 3 random seeds. (B) Model performance improves with longer context. (C) Sample prediction traces produced by POCO, where the first steps are given to the model as context.
We further evaluated a subset of efficient multi-session models on two zebrafish datasets at single-cell resolution (Table 3). Multi-session POCO again outperformed all baselines, demonstrating its effectiveness in modeling both original neural activity and PCA-reduced activity.
Table 3: POCO outperforms baselines at single-cell resolution in zebrafish.
Prediction scores on raw neural traces (not PCA-reduced) from Ahrens and Deisseroth datasets. All models are multi-session. 95% CI from 4 seeds.
| Model | Zebrafish (Ahrens) | Zebrafish (Deisseroth) |
|---|---|---|
| MS_POCO | 0.429 ±0.003 | 0.251 ±0.004 |
| MS_MLP | 0.417 ±0.001 | 0.254 ±0.001 |
| MS_NLinear | 0.367 ±0.000 | 0.172 ±0.001 |
| MS_TexFilter | 0.398 ±0.001 | 0.232 ±0.003 |
Effect of Context and Prediction Length.
We observed that prediction error gradually increases over prediction steps, but POCO generated relatively accurate predictions across all time steps (Figure 2A). To evaluate the impact of context length, we also tested on shorter context lengths , and adjusted POCO’s token length to to control the number of tokens. We found that univariate models perform poorly with short contexts, consistent with results in recent work [32]. POCO outperformed baselines in different context lengths (Figure 2B). The prediction accuracy increased with context length but plateaus beyond , suggesting that most of the predictive information is contained within a relatively short temporal window.
POCO Performance Improvements Scale with Recording Length.
We next tested how POCO performance scales with dataset size. First, instead of using the full training partition in each session, we tested the model performance when only the first steps are used (Figure 3A, S7). We found that POCO shows steady improvement when longer recordings are used for training, whereas TexFilter shows slower improvements, and NLinear shows no apparent improvement. Second, we split sessions in one dataset into several approximately even partitions and train one model for each partition (Figure 3B, S8). We found that POCO consistently benefits from training on more sessions. Taken together, these results suggest that POCO effectively leverages longer recordings across sessions to learn complex neural dynamics.
Figure 3: POCO performance improves with longer recordings and more sessions.
(A) Prediction score vs. training recording length (x-axis in log scale) for two different datasets. Models were trained using increasing portions of each session’s data. Error bars show SEM across 3 random seeds. (B) We split all sessions in one dataset into approximately equal partitions and train one model on each partition, then take the average of model prediction scores across all partitions. Average prediction scores vs the number of splits is shown for two datasets.
POCO Does Not Consistently Benefit from Multi-Species Training.
To see if datasets from multiple species improves performance, we trained POCO on different datasets simultaneously. Specifically, in each model update step, we aggregated the loss computed from one random batch of data in each dataset. We found that the model generally does not benefit from multi-species training (Table 4). This may be due to differences across datasets in pre-processing pipelines, recording conditions, and, perhaps more importantly, differences in the underlying neural dynamics of recorded species, animals’ behavioral states, and specific brain regions (Table 1). Given this result and a recent report that pre-training on other specimens does not improve neural forecasting performance in zebrafish [20], we hypothesize that the model performance only significantly benefits from more sessions when the modeled systems are similar enough. We explored this hypothesis in the following simulation experiment.
Table 4: POCO benefits from multi-session, but not multi-species, training.
Comparing single-session, multi-session, multi-species, and zebrafish-only POCO variants. Within-species training yields the best performance. 95% CI from 4 seeds.
| Zebrafish, 512 PCs | Mice | C-elegans | |||
|---|---|---|---|---|---|
| Model | Deisseroth | Ahrens | Zimmer | Flavell | |
| Single-Session POCO | 0.466 ±0.019 | 0.433 ±0.008 | 0.415 ±0.001 | 0.329 ±0.009 | 0.079 ±0.017 |
| Multi-Session POCO | 0.525 ±0.004 | 0.440 ±0.003 | 0.420 ±0.002 | 0.364 ±0.005 | 0.213 ±0.030 |
| Multi-Species POCO | 0.499 ±0.003 | 0.441 ±0.003 | 0.403 ±0.000 | 0.330 ±0.009 | 0.252 ±0.011 |
| Zebrafish POCO | 0.500 ±0.005 | 0.442 ±0.004 | |||
Simulation.
Here, we used simulated data to test how similarity between individuals influences multi-session model performance. To generate neural data from a synthesized cohort, we first randomly sample a template connectivity matrix , where we use neurons. Then for each synthesized individual [16], we set
where is random deviation from the template connectivity matrix, controls the similarity between individuals. We set the coefficient for to be so that keeps unit variance. We then used each as the connectivity of a noisy spontaneous RNN to generate synthetic neural data (see the Appendix for details) and trained POCO as on the multi-session neural data above (Figure 4A). POCO showed greater benefit from multi-session training when individuals shared similar connectivity patterns , compared to when their connectivity was independent .
Figure 4: Multi-session POCO improves when individuals are similar; POCO can quickly adapt to new sessions.
(A) Performance gain of multi-session POCO compared to single-session POCO on synthetic data for different values of . Larger means individuals are less similar. We randomly generated 16 cohorts, each with 16 individuals. Each blue cross represents a cohort. Error bars are SEM. (B) Validation loss curve of fine-tuning pre-trained POCO (Pre-POCO) and training POCO, NLinear, and MLP from scratch. We also compared full-finetuning with only tuning the embedding. Dashed gray lines represent the copy baseline. Error shades represent SEM for 3 random seeds. Two sample sessions from two datasets are shown here; see Figure S12 for more sessions.
We also compared POCO with baselines on single-session simulation data. Surprisingly, models including PLRNN, auto-regressive Transformer, and TSMixer performed significantly better than POCO, despite their relatively poor performance on real neural datasets (Figure S11). Real neural data likely exhibits greater non-stationarity, multi-scale dependencies, and heterogeneous noise profiles than our current simulations, properties which POCO’s architecture may be better suited to handle than models excelling in the simulated regime.
Finetuning on New Sessions.
A core benefit of the foundation model approach is that it allows rapid adaptation to new sessions. We pre-trained POCO on 80% of sessions and fine-tuned this model on each of the remaining ones (see the Appendix for details). For finetuning, we compared full finetuning with only finetuning the unit and session embedding. We found that pre-trained models achieved reasonable predictive performance in only tens of training steps (Figure 4B). In addition, full finetuning leads to limited or no improvements compared to only fine-tuning the unit embedding (Table S9). Rapid adaptation is crucial for real-time, closed-loop applications: in our setup, fine-tuning the embedding for 200 training steps takes less than 15 seconds, and the forecasting inference time is only 3.5 ms (see the Appendix for details). Consistent with previous results on multi-dataset training, pre-training on different datasets does not improve performance (Table S10).
Analyzing Unit Embedding.
Recent work shows that unit embeddings in POYO learns region and cell-type-specific information when trained for behavioral decoding [6]. We analyzed the trained multi-session POCO model to determine if similar structure emerges. We found that when trained on PCs, unit embeddings are consistently distributed according to the order of the PCs (Figure 5A). When trained on neurons, we found significant clustering of neurons by brain region, as indicated by the average pair-wise cosine similarity (Figure 5C). In particular, retrosplenial cortex (RSP) neurons in mice form a particularly distinct cluster (Figure 5B). Thus, POCO learns to encode functional dynamical properties in the unit embedding when trained for forecasting, even when no prior knowledge (e.g., neuron location) is given to the model.
Figure 5: POCO learns meaningful unit embeddings without supervision.
(A) UMAP [34] visualization of unit embeddings after training POCO on the first 512 PCs in a zebrafish dataset. (B) Visualization of unit embedding after training POCO on the mice dataset, where neurons are colored by the brain region. One sample session is shown for (A) and (B), see Figure S9 for more sessions. (C) Normalized average cosine similarity of unit embeddings between each pair of regions. Each row is normalized to [0, 1] and then averaged across 4 runs. Patterns are consistent for different seeds (Figure S10). See the Appendix for more details.
Effect of Filters.
Filtering high or low frequency components is a common preprocessing technique in calcium imaging, used to remove slow drifts or fast noise that are not directly related to neural dynamics [23, 36]. By default, we used no temporal filter to maximally preserve neural signals. However, we found that with low-filtering, POCO still outperforms baselines except on C. elegans datasets (Table S8). Low-pass filters improved models’ performance compared to the copy baseline in most datasets, suggesting that high-frequency components are generally harder to predict (Figure S13). Low-pass filtering also helped POCO to benefit more from multi-session training (Figure S8).
Zapbench Evaluation.
Although the main focus of this work is on multi-session datasets, we also tested our model on a recent neural population forecasting benchmark, Zapbench [32], which contains light-sheet microscopy recordings of 71721 neurons over 7879 time steps for one zebrafish. We followed the setup in Zapbench to partition the dataset and evaluate our model with short or long context, and prediction length . We found that when , POCO outperforms other trace-based methods and performs comparably to UNet, a computationally expensive model operating directly on raw volumetric videos rather than segmented neural traces [20]. At longer contexts , POCO underperforms relative to UNet (Figure S14). See the Appendix for more details.
Ablation Study.
To test the necessity of the components of POCO, we removed or replaced some parts of the model and compared performance. Specifically, we tested (1) directly using the POYO model to generate 16-steps prediction instead of generating conditioning parameters (POYO only); (2) MLP without conditioning; and (3) MLP conditioned by a univariate Transformer that takes the calcium trace of a single neuron instead of encoding the whole population (see the Appendix for more details). Both the MLP forecaster and the population encoder were necessary for full performance (Table 5). We also tested how key model hyperparameters influence performance, including the length of the token (Figure S15), embedding dimension (Figure S16), number of layers (Figure S17), the number of latents (Figure S18), learning rate (Figure S19) and weight decay (Figure S20). We found that POCO performance is relatively stable for a range of hyperparameter settings, and even a small POYO encoder is sufficient.
Table 5: Both MLP forecaster and population encoder are necessary for POCO.
Ablation study shows performance drops when either component is removed or simplified. 95% CI from 4 seeds.
| Zebrafish, 512 PCs | Mice | ||
|---|---|---|---|
| Model | Deisseroth | Ahrens | |
| Full POCO | 0.525 ±0.004 | 0.440 ±0.003 | 0.420 ±0.002 |
| POYO only | −0.971 ±0.015 | −0.057 ±0.001 | 0.332 ±0.001 |
| MLP only | 0.417 ±0.002 | 0.370 ±0.001 | 0.409 ±0.000 |
| MLP conditioned by univariate Transformer | 0.463 ±0.001 | 0.408 ±0.002 | 0.411 ±0.000 |
6. Discussion
In this work, we introduced POCO, a population-conditioned forecaster that combines a local univariate predictor with a global population encoder to capture both neuron-level dynamics and shared brain-state structure. Across five calcium-imaging datasets in zebrafish, mice, and C. elegans, POCO achieves state-of-the-art overall forecasting accuracy. Beyond raw performance, we show that POCO rapidly adapts to new sessions with only tens of fine-tuning steps of its embeddings, making it feasible for real-time adaptation during live recordings. Our analysis of POCO’s learned unit embeddings demonstrates that the model autonomously uncovers meaningful population structure such as brain regions, even though no anatomical labels were provided. These findings underscore POCO’s dual strength in accurate prediction and in learning interpretable representations of units. Finally, our results extend the recent progress on scaling up models for behavioral decoding [5, 6] to the realm of neural prediction on spontaneous neural recordings in different species.
We note a few limitations that can be opportunities for future research. First, our results indicate that factors such as calcium indicator dynamics, preprocessing pipelines, and species differences can significantly affect model performance, yet a systematic understanding of their influence remains lacking. Second, our findings highlight the difficulty of multi-dataset and multi-species training—a challenge that may be mitigated by improved architectures or alignment strategies. Third, while POCO learns biologically meaningful unit embeddings within datasets, it is unclear whether these representations are comparable across species or generalize to unseen brain regions. Finally, while we focus on calcium imaging during spontaneous behavior, extending POCO to spiking data and standardized behavioral tasks could enable the use of larger datasets and support modeling of neural dynamics in more structured settings [14, 26].
Supplementary Material
Acknowledgments and Disclosure of Funding
This work was supported by the NIH (RF1DA056403), James S. McDonnell Foundation (220020466), Simons Foundation (Pilot Extension-00003332-02), McKnight Endowment Fund, CIFAR Azrieli Global Scholar Program, and NSF (2046583).
We would like to thank Sabera Talukder, Krystal Xuejing Pan, and Viren Jain for helpful discussions.
References
- [1].Andalman A. S., Burns V. M., Lovett-Barron M., Broxton M., Poole B., Yang S. J., Grosenick L., Lerner T. N., Chen R., Benster T., et al. Neuronal dynamics regulating brain and behavioral state transitions. Cell, 177(4):970–985, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Antoniades A., Yu Y., Canzano J., Wang W., and Smith S. L.. Neuroformer: Multimodal and multitask generative pretraining for brain data. arXiv preprint arXiv:2311.00136, 2023. [Google Scholar]
- [3].Arlt C., Barroso-Luque R., Kira S., Bruno C. A., Xia N., Chettih S. N., Soares S., Pettit N. L., and Harvey C. D.. Cognitive experience alters cortical involvement in goal-directed navigation. Elife, 11:e76051, 2022. [Google Scholar]
- [4].Atanas A. A., Kim J., Wang Z., Bueno E., Becker M., Kang D., Park J., Kramer T. S., Wan F. K., Baskoylu S., et al. Brain-wide representations of behavior spanning multiple timescales and states in c. elegans. Cell, 186(19):4134–4151, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Azabou M., Arora V., Ganesh V., Mao X., Nachimuthu S., Mendelson M., Richards B., Perich M., Lajoie G., and Dyer E.. A unified, scalable framework for neural population decoding. Advances in Neural Information Processing Systems, 36:44937–44956, 2023. [Google Scholar]
- [6].Azabou M., Pan K. X., Arora V., Knight I. J., Dyer E. L., and Richards B. A.. Multi-session, multi-task neural decoding from distinct cell-types and brain regions. In The Thirteenth International Conference on Learning Representations, 2025. [Google Scholar]
- [7].Brenner M., Hess F., Mikhaeil J. M., Bereska L. F., Monfared Z., Kuo P.-C., and Durstewitz D.. Tractable dendritic rnns for reconstructing nonlinear dynamical systems. In International conference on machine learning, pages 2292–2320. Pmlr, 2022. [Google Scholar]
- [8].Caro J. O., Fonseca A. H. d. O., Averill C., Rizvi S. A., Rosati M., Cross J. L., Mittal P., Zappala E., Levine D., Dhodapkar R. M., et al. Brainlm: A foundation model for brain activity recordings. bioRxiv, pages 2023–09, 2023. [Google Scholar]
- [9].Challu C., Olivares K. G., Oreshkin B. N., Ramirez F. G., Canseco M. M., and Dubrawski A.. Nhits: Neural hierarchical interpolation for time series forecasting. In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 6989–6997, 2023. [Google Scholar]
- [10].Chau G., Wang C., Talukder S., Subramaniam V., Soedarmadji S., Yue Y., Katz B., and Barbu A.. Population transformer: Learning population-level representations of neural activity. ArXiv, pages arXiv–2406, 2024. [Google Scholar]
- [11].Chen X., Mu Y., Hu Y., Kuan A. T., Nikitchenko M., Randlett O., Chen A. B., Gavornik J. P., Sompolinsky H., Engert F., et al. Brain-wide organization of neuronal activity and convergent sensorimotor transformations in larval zebrafish. Neuron, 100(4):876–890, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Churchland M. M., Cunningham J. P., Kaufman M. T., Foster J. D., Nuyujukian P., Ryu S. I., and Shenoy K. V.. Neural population dynamics during reaching. Nature, 487(7405):51–56, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Das A., Kong W., Leach A., Mathur S., Sen R., and Yu R.. Long-term forecasting with tide: Time-series dense encoder. arXiv preprint arXiv:2304.08424, 2023. [Google Scholar]
- [14].de Vries S. E., Lecoq J. A., Buice M. A., Groblewski P. A., Ocker G. K., Oliver M., Feng D., Cain N., Ledochowitsch P., Millman D., et al. A large-scale standardized physiological survey reveals functional organization of the mouse visual cortex. Nature neuroscience, 23(1):138–151, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Durstewitz D.. A state space approach for piecewise-linear recurrent neural networks for identifying computational dynamics from neural measurements. PLoS computational biology, 13(6):e1005542, 2017. [Google Scholar]
- [16].Ekambaram V., Jati A., Nguyen N., Sinthong P., and Kalagnanam J.. Tsmixer: Lightweight mlp-mixer model for multivariate time series forecasting. In Proceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining, pages 459–469, 2023. [Google Scholar]
- [17].Filipe A. C. and Park I. M.. One model to train them all: A unified diffusion framework for multi-context neural population forecasting, 2025. URL https://openreview.net/forum?id=R9feGbYRG7. [Google Scholar]
- [18].Glaser J., Whiteway M., Cunningham J. P., Paninski L., and Linderman S.. Recurrent switching dynamical systems models for multiple interacting neural populations. Advances in neural information processing systems, 33:14867–14878, 2020. [Google Scholar]
- [19].Grosenick L., Marshel J. H., and Deisseroth K.. Closed-loop and activity-guided optogenetic control. Neuron, 86(1):106–139, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Immer A., Lueckmann J.-M., Chen A. B.-Y., Li P. H., Petkova M. D., Iyer N. A., Dev A., Ihrke G., Park W., Petruncio A., et al. Forecasting whole-brain neuronal activity from volumetric video. arXiv preprint arXiv:2503.00073, 2025. [Google Scholar]
- [21].Jaegle A., Borgeaud S., Alayrac J.-B., Doersch C., Ionescu C., Ding D., Koppula S., Zoran D., Brock A., Shelhamer E., et al. Perceiver io: A general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795, 2021. [Google Scholar]
- [22].Kapoor J., Schulz A., Vetter J., Pei F., Gao R., and Macke J. H.. Latent diffusion for neural spiking data. Advances in Neural Information Processing Systems, 37:118119–118154, 2024. [Google Scholar]
- [23].Kato S., Kaplan H. S., Schrödel T., Skora S., Lindsay T. H., Yemini E., Lockery S., and Zimmer M.. Global brain dynamics embed the motor command sequence of caenorhabditis elegans. Cell, 163(3):656–669, 2015. [DOI] [PubMed] [Google Scholar]
- [24].Koppe G., Toutounji H., Kirsch P., Lis S., and Durstewitz D.. Identifying nonlinear dynamical systems via generative recurrent neural networks with applications to fmri. PLoS computational biology, 15(8):e1007263, 2019. [Google Scholar]
- [25].Kumar A., Gilra A., Gonzalez-Soto M., Meunier A., and Grosse-Wentrup M.. Bunddle-net: Neuronal manifold learning meets behaviour. bioRxiv, pages 2023–08, 2023. [Google Scholar]
- [26].Laboratory I. B., Benson B., Benson J., Birman D., Bonacchi N., Bougrova K., Bruijns S. A., Carandini M., Catarino J. A., Chapuis G. A., et al. A brain-wide map of neural activity during complex behaviour. biorxiv, pages 2023–07, 2023. [Google Scholar]
- [27].Linderman S. W., Miller A. C., Adams R. P., Blei D. M., Paninski L., and Johnson M. J.. Recurrent switching linear dynamical systems. arXiv preprint arXiv:1610.08466, 2016. [Google Scholar]
- [28].Liu M., Zeng A., Chen M., Xu Z., Lai Q., Ma L., and Xu Q.. Scinet: Time series modeling and forecasting with sample convolution and interaction. Advances in Neural Information Processing Systems, 35:5816–5828, 2022. [Google Scholar]
- [29].Liu X., Xia Y., Liang Y., Hu J., Wang Y., Bai L., Huang C., Liu Z., Hooi B., and Zimmermann R.. Largest: A benchmark dataset for large-scale traffic forecasting. Advances in Neural Information Processing Systems, 36:75354–75371, 2023. [Google Scholar]
- [30].Loshchilov I. and Hutter F.. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. [Google Scholar]
- [31].Lu Z., Zhang W., Le T., Wang H., Sümbül U., SheaBrown E. T., and Mi L.. Netformer: An interpretable model for recovering dynamical connectivity in neuronal population dynamics. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=bcTjW5kS4W. [Google Scholar]
- [32].Lueckmann J.-M., Immer A., Chen A. B.-Y., Li P. H., Petkova M. D., Iyer N. A., Hesselink L. W., Dev A., Ihrke G., Park W., et al. Zapbench: A benchmark for whole-brain activity prediction in zebrafish. arXiv preprint arXiv:2503.02618, 2025. [Google Scholar]
- [33].Luo D. and Wang X.. Moderntcn: A modern pure convolution structure for general time series analysis. In The twelfth international conference on learning representations, pages 1–43, 2024. [Google Scholar]
- [34].McInnes L., Healy J., and Melville J.. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018. [Google Scholar]
- [35].Mikhaeil J., Monfared Z., and Durstewitz D.. On the difficulty of learning chaotic dynamics with rnns. Advances in neural information processing systems, 35:11297–11312, 2022. [Google Scholar]
- [36].Pachitariu M., Stringer C., Schröder S., Dipoppa M., Rossi L. F., Carandini M., and Harris K. D.. Suite2p: beyond 10,000 neurons with standard two-photon microscopy. BioRxiv, page 061507, 2016. [Google Scholar]
- [37].Pandarinath C., O’Shea D. J., Collins J., Jozefowicz R., Stavisky S. D., Kao J. C., Trautmann E. M., Kaufman M. T., Ryu S. I., Hochberg L. R., et al. Inferring single-trial neural population dynamics using sequential auto-encoders. Nature methods, 15(10):805–815, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [38].Pei F., Ye J., Zoltowski D., Wu A., Chowdhury R. H., Sohn H., O’Doherty J. E., Shenoy K. V., Kaufman M. T., Churchland M., et al. Neural latents benchmark’21: Evaluating latent variable models of neural population activity. arXiv preprint arXiv:2109.04463, 2021. [Google Scholar]
- [39].Perez E., Strub F., De Vries H., Dumoulin V., and Courville A.. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018. [Google Scholar]
- [40].Perich M. G., Arlt C., Soares S., Young M. E., Mosher C. P., Minxha J., Carter E., Rutishauser U., Rudebeck P. H., Harvey C. D., et al. Inferring brain-wide interactions using data-constrained recurrent neural network models. BioRxiv, pages 2020–12, 2020. [Google Scholar]
- [41].Pnevmatikakis E. A., Soudry D., Gao Y., Machado T. A., Merel J., Pfau D., Reardon T., Mu Y., Lacefield C., Yang W., et al. Simultaneous denoising, deconvolution, and demixing of calcium imaging data. Neuron, 89(2):285–299, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [42].Safaie M., Chang J. C., Park J., Miller L. E., Dudman J. T., Perich M. G., and Gallego J. A.. Preserved neural dynamics across animals performing similar behaviour. Nature, 623(7988): 765–771, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [43].Salinas D., Flunkert V., Gasthaus J., and Januschowski T.. Deepar: Probabilistic forecasting with autoregressive recurrent networks. International journal of forecasting, 36(3):1181–1191, 2020. [Google Scholar]
- [44].Simeon Q., Venâncio L., Skuhersky M. A., Nayebi A., Boyden E. S., and Yang G. R.. Scaling properties for artificial neural network models of a small nervous system. In SoutheastCon 2024, pages 516–524. IEEE, 2024. [Google Scholar]
- [45].Su J., Ahmed M., Lu Y., Pan S., Bo W., and Liu Y.. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024. [Google Scholar]
- [46].Sussillo D., Jozefowicz R., Abbott L., and Pandarinath C.. Lfads-latent factor analysis via dynamical systems. arXiv preprint arXiv:1608.06315, 2016. [Google Scholar]
- [47].Talukder S., Yue Y., and Gkioxari G.. Totem: Tokenized time series embeddings for general time series analysis. arXiv preprint arXiv:2402.16412, 2024. [Google Scholar]
- [48].Thomas A., Ré C., and Poldrack R.. Self-supervised learning of brain dynamics from broad neuroimaging data. Advances in neural information processing systems, 35:21255–21269, 2022. [Google Scholar]
- [49].Toner W. and Darlow L.. An analysis of linear time series forecasting models. arXiv preprint arXiv:2403.14587, 2024. [Google Scholar]
- [50].Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A. N., Kaiser Ł., and Polosukhin I.. Attention is all you need. Advances in neural information processing systems, 30, 2017. [Google Scholar]
- [51].Wu H., Zhou H., Long M., and Wang J.. Interpretable weather forecasting for worldwide stations with a unified deep model. Nature Machine Intelligence, 5(6):602–611, 2023. [Google Scholar]
- [52].Ye J. and Pandarinath C.. Representation learning for neural population activity with neural data transformers. arXiv preprint arXiv:2108.01210, 2021. [Google Scholar]
- [53].Ye J., Collinger J., Wehbe L., and Gaunt R.. Neural data transformer 2: multi-context pretraining for neural spiking activity. Advances in Neural Information Processing Systems, 36:80352–80374, 2023. [Google Scholar]
- [54].Yi K., Fei J., Zhang Q., He H., Hao S., Lian D., and Fan W.. Filternet: Harnessing frequency filters for time series forecasting. Advances in Neural Information Processing Systems, 37: 55115–55140, 2024. [Google Scholar]
- [55].Zeng A., Chen M., Zhang L., and Xu Q.. Are transformers effective for time series forecasting? In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 11121–11128, 2023. [Google Scholar]
- [56].Zhang Y., Wang Y., Jiménez-Benetó D., Wang Z., Azabou M., Richards B., Tung R., Winter O., Dyer E., Paninski L., et al. Towards a” universal translator” for neural dynamics at single-cell, single-spike resolution. Advances in Neural Information Processing Systems, 37:80495–80521, 2024. [PMC free article] [PubMed] [Google Scholar]
- [57].Zhou D. and Wei X.-X.. Learning identifiable and interpretable latent models of high-dimensional neural activity using pi-vae. Advances in neural information processing systems, 33:7234–7247, 2020. [Google Scholar]
- [58].Zhou H., Zhang S., Peng J., Zhang S., Li J., Xiong H., and Zhang W.. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 11106–11115, 2021. [Google Scholar]
- [59].Zhou T., Ma Z., Wen Q., Wang X., Sun L., and Jin R.. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. In International conference on machine learning, pages 27268–27286. PMLR, 2022. [Google Scholar]
- [60].Zhu F., Grier H. A., Tandon R., Cai C., Agarwal A., Giovannucci A., Kaufman M. T., and Pandarinath C.. A deep learning framework for inference of single-trial neural population dynamics from calcium imaging with subframe temporal resolution. Nature neuroscience, 25 (12):1724–1734, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.





