Skip to main content
eLife logoLink to eLife
. 2026 Mar 10;14:RP105081. doi: 10.7554/eLife.105081

Movie reconstruction from mouse visual cortex activity

Joel Bauer 1,2,, Troy W Margrie 1,, Claudia Clopath 1,2,
Editors: Rachel Denison3, Yanchao Bi4
PMCID: PMC12975128  PMID: 41804097

Abstract

The ability to reconstruct images represented by the brain has the potential to give us an intuitive understanding of what the brain sees. Reconstruction of visual input from human fMRI data has garnered significant attention in recent years. Comparatively less focus has been directed towards vision reconstruction from single-cell recordings, despite its potential to provide a more direct measure of the information represented by the brain. Here, we achieve high-quality reconstructions of natural movies presented to mice, from the activity of neurons in their visual cortex for the first time. Using our method of video optimization via backpropagation through a state-of-the-art dynamic neural encoding model, we reliably reconstruct 10 s movies at 30 Hz from two-photon calcium imaging data. We achieve a pixel-level correlation of 0.57 between ground-truth movies and single-trial reconstructions. Previous reconstructions based on awake mouse V1 neuronal responses to static images achieved a pixel-level correlation of 0.24 over a similar retinotopic area. We find that critical for high-quality reconstructions are the number of neurons in the dataset and the use of model ensembling. This paves the way for movie reconstruction to be used as a tool to investigate a variety of visual processing phenomena.

Research organism: Mouse

Introduction

One fundamental aim of neuroscience is to eventually gain insight into the ongoing perceptual experience of humans and animals. Reconstruction of visual perception directly from brain activity has the potential to give us a deeper understanding of how the brain represents visual information. Over the past decade, there have been considerable advances in reconstructing images and videos from human brain activity (Nishimoto et al., 2011; Shen et al., 2019a; Shen et al., 2019b; Rakhimberdina et al., 2021; Ren et al., 2021; Takagi and Nishimoto, 2023; Ozcelik and VanRullen, 2023; Ho et al., 2023; Scotti et al., 2023; Chen et al., 2023; Benchetrit et al., 2023; Kupershmidt et al., 2022). These advances have largely leveraged deep learning techniques to interpret fMRI or MEG recordings, taking advantage of the fact that spatially separated clusters of neurons have distinct visual and semantic response properties (Rakhimberdina et al., 2021). Due to the low resolution of fMRI and MEG, relative to single neurons, the most successful models heavily rely on extracting semantic content and use diffusion models to generate semantically similar images and videos. Some approaches combine low-level perceptual (retinotopic) and semantic information in separate modules to achieve even better image similarity (Ren et al., 2021; Ozcelik and VanRullen, 2023; Scotti et al., 2023). However, the pixel-level similarities are still relatively low. These methods are highly useful in humans, but their focus on semantic content may make them less useful when applied to non-human subjects or when using the reconstructed images to investigate visual processing.

Less attention has been given to image reconstruction from non-human brains. This is surprising given the advantages of large-scale single-cell-resolution recording techniques available in animal models, particularly mice. In the past, reconstructions using linear summation of receptive fields or Gabor filters have shown some success using responses from retinal ganglion cells (Brackbill et al., 2020), thalamo-cortical neurons in lateral geniculate nucleus (Stanley et al., 1999), and primary visual cortex (Garasto et al., 2019; Yoshida and Ohki, 2020). Recently, deep nonlinear neural networks have been used with promising results to reconstruct static images from mouse retina (Zhang et al., 2020; Li et al., 2023) and visual cortex (Cobos et al., 2022), and in particular from monkey V4 extracellular recordings (Li et al., 2023; Pierzchlewicz et al., 2023).

Here, we present a method for the reconstruction of 10 s movie clips using two-photon calcium imaging data recorded in mouse V1 (Turishcheva et al., 2023). Our method takes advantage of a state-of-the-art (SOTA) dynamic neural encoding model (DNEM) (Baikulov, 2023a) which predicts neuronal activity based on video input, as well as behavior. Our method allows us to successfully reconstruct videos despite the fact that V1 neuronal activity in awake mice is heavily modulated by behavioral factors, such as running speed (Niell and Stryker, 2010) and pupil diameter (correlated with arousal; Reimer et al., 2014). We then quantify the spatio-temporal limits of this reconstruction approach and identify key aspects of our method necessary for optimal performance.

Results

Video reconstruction using state-of-the-art dynamic neural encoding models

We used publicly available data provided by the Sensorium 2023 competition (Turishcheva et al., 2023; Turishcheva et al., 2024). The data included movies that were presented to mice and the evoked activity of V1 neurons along with pupil position, pupil diameter, and running speed. The neuronal activity was measured using two-photon imaging of GCaMP6s (Chen et al., 2013) fluorescence from 10 mice, with ≈8000 neurons from each mouse. In total, we reconstructed ten 10 s natural movies from 5 mice.

We used the winning model of the Sensorium 2023 competition which achieved a score of 0.301 (Baikulov, 2023a; Turishcheva et al., 2024) single-trial correlation between predicted and ground truth neuronal activity; (Figure 1A, Figure 1—figure supplement 1B−C). This state-of-the-art (SOTA) dynamic neural encoding model (DNEM), called DwiseNeuro was composed of three parts: core, cortex and readout. The model takes the video as input with the behavioral data (pupil position, pupil diameter, and running speed) broadcast to four additional channels of the video. This model achieved an average single-trial correlation between predicted and ground truth neural activity of 0.291 during the competition; this was later improved to 0.301. For context, the competition benchmark models achieved 0.106, 0.164, and 0.197 single-trial correlation, while the second and third place models achieved 0.265 and 0.243. Across the models entered in the competition, a variety of architectural components were used, including 2D and 3D convolutional layers, recurrent layers, and transformers, to name just a few. The original model weights of the winning model were not used to avoid reconstructing movies the model was trained on. Instead, we retrained 7 instances of the model using the same training data, which did not include the movies reserved for reconstruction. Beyond this point, the weights of the model were frozen, i.e., not influenced by future movie presentations.

Figure 1. Video reconstruction from neuronal activity in mouse V1 (data provided by the Sensorium 2023 competition; Turishcheva et al., 2023; Turishcheva et al., 2024) using a state-of-the-art (SOTA) dynamic neural encoding model (DNEM; Baikulov, 2023a).

(A) Dynamic neural encoding models (DNEMs) predict neuronal activity from mouse primary visual cortex, given a video and behavioral input. (B) We use a SOTA DNEM to reconstruct part of the input video given neuronal population activity, using gradient descent to optimize the input. (C) Poisson negative log likelihood loss across training steps between ground truth neuronal activity and predicted neuronal activity in response to reconstructed videos. Left: all 50 videos from 5 mice for one model. Right: average loss across all videos for seven model instances. (D) Spatio-temporal (pixel-by-pixel) correlation between reconstructed video and ground truth video.

Figure 1—source data 1. Source data to Figure 1.

Figure 1.

Figure 1—figure supplement 1. Summary ethogram of state-of-the-art (SOTA) dynamic neural encoding model (DNEM) inputs, output predictions, and video reconstruction over time for three videos from three mice (same as Figure 2A).

Figure 1—figure supplement 1.

(A) Top: motion energy of the input video. Bottom: pupil diameter and running speed of the mouse during the video. (B) Ground truth neuronal activity. (C) Predicted neuronal activity in response to input video and behavioral parameters. (D) Predicted neuronal activity given reconstructed video and ground truth behaviour as input. (E) Frame-by-frame correlation between reconstructed and ground truth video.
Figure 1—figure supplement 1—source data 1. Source data to Figure 1—figure supplement 1.
Figure 1—figure supplement 2. Variations on the reconstruction method.

Figure 1—figure supplement 2.

(A) Example movie frames (from mouse 4 trial 1) using variations of the reconstruction method. (B) Same as A but using different training mask thresholds. No evaluation mask is applied here. (C) Left: reconstruction performance for different reconstruction versions. Version without contrast and luminance adjustment is not included because video correlation is always calculated before contrast and luminance adjustment. Reconstruction from predicted vs standard method (34.7% increase; paired t-test P=1.5 x 10-5, n = 5 mice), standard method vs gradient ensembling (3.00% decrease; paired t-test p=0.0198, n=5 mice), standard method vs no Gaussian smoothing (6.28% decrease; paired t-test p=6.26 x 10-6, n=5 mice). Middle: reconstruction performance with different evaluation mask thresholds compared across the three training mask thresholds shown in B. Right: same as middle but plotting the mask diameter for each evaluation mask threshold. (D) Neural activity prediction performance for different movie inputs. Left: Poisson loss, used to train the dynamic neural encoding model (DNEM) and movie reconstruction. Predicted activity from full video vs masked with alpha = 0.5 (5.02% increase, t-test p=3.71 x 10-6, n=5 mice), reconstruction vs reconstruction after contrast & luminance matching to ground truth video (0.227% increase, paired t-test p=0.776, n=5 mice). Right: correlation across all neurons and frames (note this is a different metric to the one used in the Sensorium competition). Predicted activity from full video vs masked with alpha = 0.5 (3.73% increase, t-test p=2.49 x 10-5, n=5 mice), reconstruction vs reconstruction after contrast & luminance matching to ground truth video (2.49% increase, paired t-test p=0.0195, n=5 mice). In C-D, dashed lines are single mice, and solid lines are means across mice.
Figure 1—figure supplement 2—source data 1. Source data to Figure 1—figure supplement 2.
Figure 1—figure supplement 3. Receptive fields and transparency masks.

Figure 1—figure supplement 3.

(A) One example receptive field for one neuron from each mouse mapped using on & off patch stimuli in silico. (B) Average population receptive fields from each mouse. (C) Distribution of on and off receptive field centers for each mouse. (D) Unthresholded alpha masks, i.e., transparency masks, for each mouse. (E) Pixel-wise temporal correlation between ground truth and reconstructed videos with either the training or the evaluation mask applied. Dashed lines in C-E indicate retinotopic eccentricity in steps of 10°. Plot limits correspond to screen size.
Figure 1—figure supplement 3—source data 1. Source data to Figure 1—figure supplement 3.

To reconstruct the videos presented to mice, we iteratively optimized an initially blank input video to the SOTA DNEM until the predicted activity in response to this input matched the ground truth recorded neuronal activity. In effect, we optimized the input video to be perceptually similar with respect to the recorded neurons. To achieve this, we used an input optimization through gradient descent approach inspired by the optimization of maximally exciting images (Walker et al., 2019) and the reconstruction of static images (Cobos et al., 2022; Pierzchlewicz et al., 2023). The input videos were initialized as uniform gray values and the behavioral parameters (Figure 1—figure supplement 1A) were added as additional channels, i.e., these were not reconstructed but given. The neuronal activity in response to the input video was predicted using the SOTA DNEM for a sliding window of 32 frames (1.067 s) with a stride of eight frames. We saw slightly better results with a stride of two frames, but in our case, this did not warrant the increase in training time. For each window, the difference between the predicted and ground truth responses was calculated, and this loss was backpropagated to the pixels of the input video to get the gradient of the loss with respect to each pixel. In effect, the input pixels were thus treated as if they were model weights. The gradients for each pixel were then averaged across all windows and the pixels of the input video updated accordingly (See Supplementary Algorithm 1).

The data from the Sensorium competition provided the activity of neurons within a 630 by 630 μm field of view for each mouse, i.e., covering roughly one-fifth of mouse V1. Due to the retinotopic organization of V1 we, therefore, did not expect to get good reconstructions of the entire video frame. However, gradients still propagated to the full video frame and produced nonsensical results along the periphery of the video frames (Figure 1—figure supplement 2B). Inspired by previous work (Mordvintsev et al., 2018; Willeke et al., 2026) we, therefore, decided to apply a mask during training and evaluation. To generate these masks, we optimized a transparency layer placed at the input to the SOTA DNEM. High values are given to pixels that contribute to the accurate prediction of neuronal activity and represent the collective receptive field of the neural population. None of the reconstructed movies were used in the optimization of this transparency mask. The transparency masks are aligned with but not identical to the On-Off receptive field distribution maps using sparse noise (Figure 1—figure supplement 3). This mask was applied during the optimization of the reconstructed movies (training mask: binarized with threshold α=0.5) and applied again to the final reconstruction (evaluation mask: binarized with threshold α=1) (See Supplementary Algorithm 2). Applying the mask in two stages first boosts the performance of reconstruction itself and separately allows evaluation of the reconstruction in a region of high confidence, given the neural population available (Figure 1—figure supplement 2).

As the loss between predicted (Figure 1—figure supplement 1D) and ground truth responses (Figure 1—figure supplement 1B) decreased, the similarity between the reconstructed and ground truth input video increased (Figure 1C–D). We generated seven separate reconstructions from seven neural encoding models (trained on the same data) and averaged them. Finally, we applied a 3D Gaussian filter with sigma 0.5 pixels to reduce the remaining static noise (Figure 1—figure supplement 2) and applied the evaluation mask. When presenting videos in this paper, we normalize the mean and standard deviation of the reconstructions to the average and standard deviation of the corresponding ground truth movie before applying the evaluation masks, but this is not done for quantification except in Figure 1—figure supplement 2D. The Gaussian filter was not applied when evaluating spatial or temporal resolution (Figure 4, Figure 4—figure supplement 1, Figure 4—figure supplement 2).

High-quality video reconstruction

As can be seen in Figure 2 and Video 1, the reconstructed videos capture much of the spatial and temporal dynamics of the original input video. Because our optimization of the movies was based on a perceptual loss function, we were interested in how closely these movies matched the originals on the pixel level. To evaluate the performance of the video reconstructions we, therefore, correlated either all pixels from all time points between ground truth and reconstructed videos (Pearson’s correlation r=0.569; to quantify temporal and spatial similarity), or the average correlation between all sets of frames (Pearson’s correlation r=0.512; to quantify just spatial similarity) (Figure 2B). This represents a ≈2x higher pixel-level correlation over previous single-trial static image reconstructions from V1 in awake mice (image correlation 0.238+/-0.054 s.e.m. for awake mice) (Yoshida and Ohki, 2020) over a similar retinotopic area (≈43°×43°) while also capturing temporal dynamics. However, we would like to stress that directly comparing static image reconstruction methods with movie reconstruction approaches is fundamentally problematic, as they rely on different data types both during training and evaluation (temporally averaged vs continuous neural activity, images flashed at fixed intervals vs continuous movies).

Figure 2. Reconstruction performance.

(A) Three reconstructions of 10 s videos from different mice (see Video 1 for the full set). Reconstructions have been luminance (mean pixel value across video) and contrast (standard deviation of pixel values across video) matched to ground truth. (B) The reconstructed videos have high correlation to ground truth in both spatio-temporal correlation (mean Pearson’s correlation r=0.569 with 95% CIs 0.542–0.596, t-test between ground truth and random video p=6.69 x 10-49, n=50 videos from 5 mice) and mean frame correlation (mean Pearson’s correlation r=0.512 with 95% CIs 0.481–0.543, t-test between ground truth and random video p=4.29 x 10-45, n=50 videos from 5 mice).

Figure 2—source data 1. Source data to Figure 2.

Figure 2.

Figure 2—figure supplement 1. Reconstruction performance correlates with frame contrast but not with behavioral parameters.

Figure 2—figure supplement 1.

(A) Pearson’s correlation between mean frame correlation per movie, and three movie parameters, and three behavioral parameters. Linear fit as black line. (B) Left: Pearson’s correlation between activity prediction accuracy and movie reconstruction accuracy. Right: cross-correlation plot of frame-by-frame activity prediction accuracy and video frame correlation. In other words, the more predictable the neural activity, the better the reconstruction performance.
Figure 2—figure supplement 1—source data 1. Source data to Figure 2—figure supplement 1.
Video 1. Reconstructed natural videos from mouse brain activity.
Download video file (5.6MB, mp4)

Odd rows are ground truth (GT) movie clips presented to mice. Even rows are the reconstructed movies from the activity of ≈8000 V1 neurons. Reconstructed movies are smoothed (σ=0.5 pixels), masked, and contrast (std) and luminance (mean) matched to ground truth movies.

Reconstruction quality, however, was not consistent across movies (Figure 2B) or constant throughout the 10 s videos (Figure 1—figure supplement 1E). We, therefore, investigated what factors may cause these fluctuations by correlating video motion energy, contrast, and luminance, as well as running speed, pupil diameter and eye movement with frame correlation. We found that contrast correlated with frame correlation, but only to a moderate degree. Video motion energy shows a trend but was not significant (Figure 2—figure supplement 1A). We also found that the ability of the SOTA DNEM to predict neural activity correlated with reconstruction performance. This could be because some frames are harder to reconstruct due to their content (high temporal and spatial frequencies) or because neural activity in these moments was influenced by factors that the model cannot take into account.

Ensembling

We found that the seven instances of the SOTA DNEMs by themselves performed similarly in terms of reconstructed video correlation (Figure 1D), but that this correlation was significantly increased by taking the average across reconstructions from different models (Figure 3) – A technique known as bagging, and more generally ensembling (Breiman, 1996). We averaged over seven model instances, which gave a performance increase of 28.0%, but the largest gain in performance, 13.7%, came from averaging across just 2 models (Figure 3). Doubling the number of models to four increased the performance by another 8.32%. Individual models produced reconstructions with high-frequency noise in the temporal and spatial domains. We, therefore, think the increase in performance from ensembling is mostly an effect of averaging out this high-frequency noise. On the other hand, it is possible that averaging over separately optimized reconstructions degrades high-frequency information. We, therefore, tested whether averaging pixel gradients from all models at each iteration rather than averaging the final movies yields higher performance, but we observed no improvement (Figure 1—figure supplement 2C). Overall, although ensembling over models trained on separate data splits is a computationally expensive method, it substantially improved reconstruction quality.

Figure 3. Model ensembling.

Figure 3.

Mean video correlation is improved when predictions from multiple models are averaged. Dashed lines are individual animals, and the solid line is the mean. One-way repeated measures ANOVA p=1.11 x 10-16. Bonferroni-corrected paired t-test outcomes between consecutive ensemble sizes are all p<0.001, n=5 mice.

Figure 3—source data 1. Source data to Figure 3.

Not all spatial and temporal frequencies are reconstructed equally

While the reconstructed videos achieve high correlation to ground truth, it is not entirely clear if the remaining deviations are due to the limitations of the model or arise from the recorded neurons themselves. To assess the resolution limits of our reconstruction process, we assessed the model’s ability to reconstruct synthetic stimuli at varying spatial and temporal resolutions in a noise-free scenario.

To quantify which spatial and temporal frequencies our reconstruction approach is able to capture, we used a Gaussian noise stimulus set generated using a Gaussian process (https://github.com/TomGeorge1234/gp_video; George, 2024; Figure 4). The dataset consisted of 49, 2 s, 36 by 36 pixel videos at 30 Hz, which varied in the spatial and temporal length constants. As we did not have ground truth neuronal activity in response to this stimulus set, we first predicted the neuronal responses given these videos using the ensembled SOTA DNEMs. We then used gradient descent to reconstruct the original input using these predicted neuronal responses as the target. In this way, we generated reconstructions in an ideal case with no biological noise and assuming the SOTA DNEM perfectly predicts neuronal activity (Figure 4B and Video 2). This means the video reconstruction quality loss reflects the inefficiency of the reconstruction process itself without the additional loss or transformation of information by processes, such as top-down modulation, e.g., predictive coding or selective feature attention (see Discussion). We found that the reconstruction process failed at high spatial frequencies (<1 pixel, or <3.4° retinotopy) and performed worse at high temporal frequencies (<1 frame, or >30 Hz) (Figure 4C and Video 2). We repeated this analysis using full-field high-contrast square gratings drifting in the four cardinal directions and similarly found that high spatial and temporal frequencies were not reconstructed as well as low-spatial and temporal frequency gratings (Figure 4—figure supplement 1 and Video 3). We also found that beyond the spatial reconstruction limit, the reconstructions from phase-inverted Gaussian noise stimuli had higher correlation with each other than with their ground truth stimuli (Figure 4D). Nevertheless, even when the reconstructions were not captured on the pixel level, they did capture some of the spatial entropy and motion energy of the ground truth stimuli (Figure 4—figure supplement 2A–B).

Figure 4. Reconstruction of Gaussian noise across the spatial and temporal spectrum using predicted activity.

(A) Example Gaussian noise stimulus set with evaluation mask for one mouse. Shown is the last frame of a 2 s video. (B) Reconstructed Gaussian stimuli with state-of-the-art (SOTA) dynamic neural encoding model (DNEM) predicted neuronal activity as the target (see also Video 2). (C) Video correlation between ground truth (A) and reconstructed videos (B) across the range of spatial and temporal length constants. For each stimulus type, the average correlation across five movies reconstructed from the SOTA DNEM of 3 mice is given. (D) Video correlation between reconstructions from phase-inverted Gaussian noise stimuli.

Figure 4.

Figure 4—figure supplement 1. Reconstruction of drifting grating stimuli with different spatial and temporal frequencies using predicted activity.

Figure 4—figure supplement 1.

(A) Example drifting grating stimuli (rightwards moving) masked with the evaluation mask for one mouse. Shown is the 31st frame of a 2 s video. (B) Reconstructed drifting grating stimuli with state-of-the-art (SOTA) dynamic neural encoding model (DNEM) predicted neuronal activity as the target (see also Video 3). (C) Video correlation between ground truth (A) and reconstructed videos (B) across the range of spatial and temporal frequencies. For each stimulus type, the average correlation across four directions (up, down, left, right) reconstructed from the SOTA DNEM of 1 mouse is given. Interestingly, video correlation at 15 cycles/s (half the video frame rate of 30 Hz) is much higher than 7.5 cycles/s. This is an artifact of using predicted responses rather than true neural responses. The DNEM input layer convolution has dilation 2. The predicted activity is, therefore, based on every second frame, with the effect that the activity is predicted as the response of two static images which are then interleaved.
Figure 4—figure supplement 1—source data 1. Source data to Figure 4—figure supplement 1.
Figure 4—figure supplement 2. Gaussian Noise reconstruction: shannon entropy, motion energy .

Figure 4—figure supplement 2.

(A) Average frame Shannon entropy (a measure of variance in the spatial domain) across Gaussian noise stimuli with various spatial and temporal Gaussian length constants. Left: ground truth stimuli. Right: reconstructions from predicted activity. (B) Same as A but for motion energy (a measure of variance in the temporal domain). (C) Ensembling effect for each stimulus. Video correlation for ensembled prediction (average videos from seven model instances) minus the mean video correlation across the seven individual model instances.
Figure 4—figure supplement 2—source data 1. Source data to Figure 1—figure supplement 2.
Video 2. Gaussian noise stimuli and reconstructions.
Download video file (5.5MB, mp4)

Odd rows are ground truth (GT) video inputs to the model. Even rows are the reconstructed videos from the predicted neuronal activity for 1one mouse. Reconstructed movies are masked, and contrast (std) and luminance (mean) matched to ground truth videos.

Video 3. Drifting grating stimuli and reconstructions.
Download video file (3.6MB, mp4)

Odd rows are ground truth (GT) video inputs to the model. Even rows are the reconstructed videos from the predicted neuronal activity for 1one mouse. Reconstructed movies are masked, and contrast (std) and luminance (mean) matched to ground truth videos.

To test if model ensembling improves Gaussian noise reconstruction quality across all spatial and temporal length constants uniformly, we subtracted the average video correlation across the seven model instances from the video correlation of the average video (i.e. ensembled video reconstruction quality minus unensembled video reconstruction quality; Figure 4—figure supplement 2C). We found that, in particular, short temporal and spatial length constant stimuli improved in correlation, supporting our hypothesis that ensembling mitigates the high-frequency noise we observed in the reconstruction from individual models.

Neuronal population size

In order to design future in vivo experiments to investigate visual processing using our video reconstruction approach, it would be useful to know how reconstruction performance scales with the number of recorded neurons. This is vital for prioritizing experimental parameters, such as weighing between sampling density within a similar retinotopic area and retinotopic coverage to maximize both video reconstruction quality and visual coverage. We, therefore, performed an in silico ablation experiment, dropping either 50%, 75%, or 87.5% of the total recorded population of ≈8000 neurons per mouse by setting their activity to 0 (Figure 5). We found that dropping 50% of the neurons reduced the video correlation by only 9.96% while dropping 75% reduced the performance by 24.9%. We would, therefore, argue that ≈4000–8000 neurons within a 630 by 630 μm area (≈10000–20000 neurons/mm2) of mouse V1 would provide a balance when compromising between density and 2D coverage.

Figure 5. Video reconstruction using fewer neurons (i.e. population ablation) leads to lower reconstruction quality.

Figure 5.

Dashed lines are individual animals, and the solid line is the mean. One-way repeated measures ANOVA p=5.70 x 10-13. Bonferroni corrected paired t-test outcomes between consecutive drops in population size are all p<0.001, n=5 mice.

Figure 5—source data 1. Source data to Figure 5.

Visualization of reconstruction error

One advantage of stimulus reconstruction compared to stimulus identity decoding (i.e. classification) is that it is possible to visualize the deviation of the reconstructed stimuli from what is expected. This is interesting because reconstruction performance is not stable over time but fluctuates (Figure 6A), likely due to the fact that the DNEM does not have access to all possible factors that influence neural activity. When using our reconstruction method, it is not the input stimulus similarity that is optimized, but the evoked activity of the stimulus. As a consequence, the predicted neural response from the reconstructed movie is more similar to the experimental neural response compared to the predicted neural response evoked by the original ground truth movie (Figure 6B). It is possible to visualize this deviation on a pixel level by subtracting the experimentally derived movie reconstruction (i.e. based on measured neural responses) from the in silico simulation derived movie reconstruction (i.e. first predict activity based on the ground truth video and then reconstruct the movie based on the resulting simulated neural activity) (Figure 6 and Video 4). With the current dataset, it is not possible to test if these deviations reflect failures of the encoding model to predict neural activity given the sensory stimulus or true deviations of the images represented by the neural population from the sensory stimulus, but this approach may be an interesting method for investigating when and why model predictions of neural activity deviate from the experimentally measured activity.

Figure 6. Comparison of reconstructions from experimental responses vs expected responses and their visualization as error maps.

Figure 6.

(A) Frame-by-frame correlation between reconstructed and ground truth video for mouse 1 trial 7 (same as Figure 1—figure supplement 1E). (B) From left to right: experimental (ground truth) neural activity y, neural activity predicted by dynamic neural encoding model (DNEM) from ground truth video y^, neural activity predicted by DNEM based on reconstructed movie y^x^. (C) Difference between the correlation of true neural response y with predicted neural response from the ground truth movie y^, and the correlation of true neural response y with predicted neural response from the reconstructed movie y^x^. (D) 9 frames from mouse 1 trial 7. From top to bottom: reconstructed movie x^, reconstructed movie from predicted neural response to ground truth movie x^y^, ground truth movie x with overlayed heatmap of the difference between x^ and x^y^ (error map). (E) Error map of one frame from all 50 movie clips. Each row is 10 trials from one mouse. See also Video 4 .

Figure 6—source data 1. Source data to Figure 6.
Video 4. Reconstruction error maps.
Download video file (6.1MB, mp4)

Pixel error = reconstructions from experimental neural activity – reconstructions from expected neural activity. Over- &and underestimations of pixel values as hot &and cold heat maps, respectively. Ground truth movies in gray. Each row is 10 trials from 1one mouse played at half speed.

Discussion

Stimulus identification vs reconstruction

Stimulus identification, i.e., identifying the most likely stimulus from a constrained set, has been a popular approach for quantifying whether a population of neurons encodes the identity of a particular stimulus (Földiák, 1993, Kay et al., 2008). This approach has, for instance, been used to decode frame identity within a movie (Deitch et al., 2021; Xia et al., 2021; Schneider et al., 2023; Chen et al., 2024). Some of these approaches have also been used to reorder the frames of the ground truth movie (Schneider et al., 2023) based on the decoded frame identity. Importantly, stimulus identification methods are distinct from stimulus reconstruction, where the aim is to recreate what the sensory content of a neuronal code is in a way that generalizes to new sensory stimuli (Rakhimberdina et al., 2021). This is inherently a more demanding task because the range of possible solutions is much larger. Although stimulus identification is a valuable tool for understanding the information content of a population code, stimulus reconstruction could provide a more generalizable approach, because it can be applied to novel stimuli.

Comparison to other reconstruction methods

There has recently been a growing number of publications in the field of image reconstruction, primarily from fMRI data, and a comprehensive review of all the approaches is outside the scope of this paper. However, we will briefly summarize the most common approaches and how they relate to our own method. In general, image reconstruction methods can be categorized into one of four groups: direct decoding models, encoder-decoder models, invertible encoding models, and encoder model input optimization.

Direct decoders directly decode the input image/videos from neuronal activity with deep neuronal networks (Shen et al., 2019a; Zhang et al., 2020; Li et al., 2023). When training direct decoders, the decoders can be pretrained (Ren et al., 2021) or additional constraints can be added to the loss function to encourage the decoder to produce images that adhere to learned image statistics (Shen et al., 2019a; Kupershmidt et al., 2022). A direct decoder approach has been used for video reconstruction in mice (Chen et al., 2024), but in that case, the training and test movies were the same, meaning it is unclear if out-of-training set generalization was achieved (a key distinction between sensory reconstruction and stimulus identification, see previous section).

In encoder-decoder models, the aim is to combine separately trained brain encoders (brain activity to latent space) and decoders (latent space to image/video). Recently, this approach has become particularly popular because it allows the use of SOTA generative image models, such as stable diffusion (Rombach et al., 2021; Takagi and Nishimoto, 2023; Scotti et al., 2023; Chen et al., 2023; Benchetrit et al., 2023). The encoder part of the models are first trained to translate brain activity into a latent space that the pretrained generative networks can interpret. Because these latent spaces are often conditioned on semantic information, this lends itself to separate processing of low-level visual and high-level semantic information from brain activity (Scotti et al., 2023).

Invertible encoding models are encoding models which, once trained to predict neuronal activity, can implicitly be inverted to predict sensory input given brain activity. We would also include those models in this class which first compute the receptive field or preferred stimulus of neurons (or voxels) and reconstruct the input as the weighted sum of the receptive fields by their activity (Stanley et al., 1999; Thirion et al., 2006; Garasto et al., 2019; Brackbill et al., 2020; Yoshida and Ohki, 2020; Nishimoto et al., 2011). The downside of this approach is that invertible linear models generally underperform in terms of capturing the coding properties of neurons compared to more complex deep neural networks (Willeke et al., 2023).

Encoder input optimization also involves first training an encoder which predicts the activity of neurons or voxels given sensory input. Once trained, the encoder is fixed, and the input to the network is optimized using backpropagation until the predicted activity matches the observed activity (Pierzchlewicz et al., 2023). Unlike with invertible encoding models, any SOTA neuronal encoding model can be used. But like invertible models, the networks are not specifically trained to reconstruct images, so they may be less likely to extrapolate information encoded by the brain by learning general image statistics. There is some evidence to support this, static image reconstructions which were optimized to evoke similar in silico predicted neural activity also evoke more similar neural responses in vivo when compared to other methods that optimized image similarity directly (Cobos et al., 2022).

Although outlined here as four distinct classes, these approaches can be combined. For instance, encoder input optimization can be combined with image diffusion (Pierzchlewicz et al., 2023) and in principle, invertible models could also be combined in such a way.

We chose to pursue a pure encoder input optimization approach for single-cell mouse visual cortex activity for two reasons. First, there have been considerable advances in the performance of neuronal encoding models for dynamic visual stimuli (Sinz et al., 2018; Wang et al., 2025; Turishcheva et al., 2024) and we aimed to take advantage of these developments. Second, the addition of a generative decoder trained to produce high-quality images brings with it the risk of extrapolating information based on general image statistics rather than interpreting what the brain is representing. In some cases, the brain may not be encoding coherent images, and in those cases, we would argue image reconstruction should fail, rather than producing an image when only the semantic information is present.

Key contributions and limitations

We demonstrate high-quality video reconstruction from mouse V1 using SOTA DNEMs to iteratively optimize the input video to match the resulting predicted activity with the recorded neuronal activity. Key to achieving high-quality reconstructions is model ensembling and using a large enough number of recorded neurons over a given retinotopic area.

While we averaged the video reconstructions from several models, an alternative method would be to average the gradients calculated by multiple models at each epoch, as has been done for the generation of maximally exciting images in the past (Walker et al., 2019). When using video models, this can be an impractical solution due to the amount of GPU memory required, but in principle, there might be situations in which averaging gradients yields better reconstructions. For instance, there may be multiple solutions for the activation pattern of a neural population, e.g., if their responses are translation/phase invariant (Ito et al., 1995; Tacchetti et al., 2018). In such a case, averaging ‘misaligned’ reconstructions from multiple models might degrade overall quality. However, we observed no performance improvement when ensembling with gradients instead of ensembling with reconstructions.

The SOTA DNEM we used takes video data at an angular resolution of 3.4°/pixels at the center of the screen which is about 3x worse than the visual acuity of mice (≈0.5 cycles/° Prusky and Douglas, 2004). As our model can reconstruct Gaussian noise stimuli down to a spatial length constant of 1 pixel, and drifting gratings up to a spatial frequency of 0.071 cycles/°, there is still some potential for improving spatial resolution. To close this gap and achieve reconstructions equivalent to the limit of mouse visual acuity, a different dataset and model would likely need to be developed. However, the frame rate of the videos the SOTA DNEM takes as input (30 Hz) is faster than the flicker fusion frequency of mice (14 Hz; Nomura et al., 2019) and our tests with Gaussian noise and drifting grating stimuli show that the temporal resolution of reconstruction is close to this expected limit. Future efforts should, therefore, focus on the spatial resolution of video reconstruction rather than the temporal resolution.

It is, however, unclear how closely the representation of vision by the brain is expected to match the actual input. There are a number of visual processing phenomena that have previously been identified, which leads us to suspect that some deviations between video reconstructions and ground truth input are to be expected. One such phenomenon is predictive coding (Rao and Ballard, 1999; Fiser et al., 2016). It is possible that the unexpected parts of visual stimuli are sharper and have higher contrast compared to the expected parts when reconstructed from neuronal activity. Alternatively, perceptual learning is a phenomenon where visual stimulus detection or discriminability is enhanced through prolonged training (Li, 2016) and is associated with changes in the tuning distribution of neurons in the visual system (Goltstein et al., 2013; Poort et al., 2015; Jurjut et al., 2017; Schumacher et al., 2022). Similarly, selective feature attention can modulate the response amplitude of neurons that have a preference for the features that are currently being attended to Kanamori and Mrsic-Flogel, 2022. Visual task engagement and training could, therefore, alter the accuracy and biases of what features of a video can accurately be reconstructed from the neuronal activity. Visualizing differences between movie reconstructions from experimentally derived recordings to those from predicted activity, as we have done, may be an interesting approach.

Although fMRI-based reconstruction techniques are starting to be used to investigate visual phenomena in humans (such as illusions Cheng et al., 2023 and mental imagery Shen et al., 2019b, Koide-Majima et al., 2024, Kalantari et al., 2025), visual processing phenomena are likely difficult to investigate using existing fMRI-based reconstruction approaches, due to the low spatial and temporal resolution of the data. Additionally, many of these fMRI-based reconstruction approaches rely on the use of pretrained generative diffusion models to achieve more naturalistic and semantically interpretable images (Takagi and Nishimoto, 2023; Ozcelik and VanRullen, 2023; Scotti et al., 2023; Chen et al., 2023), but very likely at the cost of introducing information that may not be present in the actual neuronal representation. In contrast, our video reconstruction approach using single-trial single-cell resolution recordings, without a pretrained generative model, provides a more accurate method to investigate visual processing phenomena, such as predictive coding, perceptual learning, and selective feature attention.

In conclusion, we reconstruct videos presented to mice based on single-trial activity of neurons in the mouse visual cortex. This paves the way to using movie reconstruction as a tool to investigate a variety of visual processing phenomena.

Methods

Source data

The data was provided by the Sensorium 2023 competition (Turishcheva et al., 2023; Turishcheva et al., 2024) and downloaded from https://gin.g-node.org/pollytur/Sensorium2023Data and https://gin.g-node.org/pollytur/sensorium_2023_dataset. The data included grayscale movies presented to the mice at 30 Hz on a 31.8 by 56.5 cm monitor 15 cm from and perpendicular to the left eye. The movies were provided as spatially downsampled versions of the original screen resolution to 36 by 64 pixels, corresponding to an angular resolution of 3.4°/pixel at the center of the screen. The pupil position and diameter were recorded at 20 Hz and the running at 100 Hz. The neuronal activity was measured using two-photon imaging (Denk et al., 1990) of GCaMP6s (Chen et al., 2013) fluorescence at 8 Hz, extracted and deconvolved using the CAIMAN pipeline (Giovannucci et al., 2019). For each of the 10 mice, the activity of ≈8000 neurons was provided. The different data types were resampled to 30 Hz.

State-of-the-art dynamic neural encoding model

We used the winning model of the Sensorium 2023 competition, DwiseNeuro (Turishcheva et al., 2023; Turishcheva et al., 2024). The code for the SOTA DNEM was downloaded from https://github.com/lRomul/sensorium (Baikulov, 2023b). The winning model consists of 3 main components: core, cortex, and readout. The core largely consisted of factorized 3D convolution blocks with residual connections, positional encoding (Vaswani et al., 2017), and SiLU activations (Elfwing et al., 2018) followed by spatial average pooling. The cortex consisted of three fully connected layers. The readout consisted of a 1D convolution for each mouse with a final Softplus nonlinearity that gives activity predictions for all neurons of each mouse. The kernel of the input layer had size 16 with a dilation of 2 in the time dimension, so spanned 32 video frames.

The original ensemble of models consisted of 7 model instances trained on a sevenfold cross-validation split of all available Sensorium 2023 competition data (≈1 hr of training data and ≈8 min of cross-validation data per fold from each mouse). Each model instance was trained on 6 of 7 data folds, with different validation data excluded from training for each model. To allow ensembled reconstructions of videos without test set contamination, we instead retrained the models with a shared validation fold, i.e., we retrained the models leaving out the same validation data for all seven model instances. The only other difference in the training procedure was that we retrained the models using a batch size of 24 instead of 32. This did not change the performance of neuronal response prediction on the withheld data folds (mean validation fold predicted vs ground truth response correlation for original weights: 0.293; and retrained weights: 0.291). We also did not use model distillation, while the original model did (see https://github.com/lRomul/sensorium; Baikulov, 2023b).

We chose the first 10 movies in data fold 0 (assigned as part of the DNEM code using a video hashing function) for reconstructions. We additionally excluded nine movies which were incorrectly assigned to fold 0 and replaced them with other movie clips from fold 0.

Additional visual stimuli

The Gaussian noise stimuli were downloaded from https://github.com/TomGeorge1234/gp_video (George, 2024) and spanned a range of 0–32 pixels in spatial length constant and 0–32 frames in temporal length constant used in the Gaussian process. 5 separately generated movies of 2 s each were generated and combined with their phase-inverted versions to give a total of 10 trials.

The drifting grating stimuli were produced using PsychoPy (Peirce et al., 2019) and ranged from 0.5 to 0.062 cycles/degree and 0.5–0 cycles/s, with 2 s of movie for each cardinal direction. These ranges were chosen to avoid aliasing effects in the 36 by 64 pixel videos. The highest temporal frequency corresponds to a flicker stimulus.

The receptive field mapping stimulus, i.e., sparse noise stimulus, consisted of a pre-stimulus gray (gray value 127) screen period of 0.5 s, a 0.5 s stimulus period where one pixel was set to either 0 (Off) or 255 (On), and a 0.5 s post-stimulus gray screen period. The full stimulus set consisted of 4608 stimuli, one On and one Off stimulus for every pixel of the 36 by 64 movie.

Mask training

To generate the transparency masks, we used an alpha blending approach inspired by Mordvintsev et al., 2018 and Willeke et al., 2026. A transparency layer was placed at the input to the SOTA DNEM. This transparency layer was used to alpha blend the true video V with another randomly selected background video BG from the data:

VBG=Vα+BG(1α) (1)

where α is the 2D transparency mask and VBG is the blended input video. This mask was optimized using stochastic gradient descent (for 1000 epochs with learning rate 10) with mean squared error (MSE) loss between the true responses y and the predicted responses y^ scaled by the average weight of the transparency mask α¯:

MSE(y,y^)=1ni=1n(yy^)2 (2)
Loss=MSE(y,y^(1α¯)) (3)

where n is the total number of neurons. The mask was initialized as uniform noise between 0 and 0.05. At each epoch, the neuronal activity in response to a randomly selected 32-frame video segment from the training set was predicted and the gradients of the loss (Equation 3) with respect to the pixels in the transparency mask α were calculated for each video frame. The gradients were normalized by their matrix norm, clipped to between –1 and 1, and averaged across frames. The gradients were smoothed with a 2D Gaussian kernel of σ = 5 and subtracted from the transparency mask. The transparency mask was only calculated using one SOTA DNEM instance using its validation fold. See Supplementary Algorithm 2.

The transparency mask was thresholded and binarized at 0.5 for the masked gradients masked or 1 for the masked videos for evaluation Veval:

masked=(α>0.5) (4)
Veval=V(α1) (5)

where is the gradients of the loss with respect to each pixel in the video and V is the reconstructed video before masking. These masks were trained independently for each mouse using one model instance with the original weights of the model https://github.com/lRomul/sensorium (Baikulov, 2023b), not the retrained models used in the rest of this paper to reconstruct the videos.

Video reconstruction

To reconstruct the input video, we initialized the video as uniform gray values and concatenated the ground truth behavioral parameters. The SOTA DNEM took 32 frames at a time, and we shifted this window by eight frames until all frames of the whole 10 s video were covered. For each 32-frame window, the Poisson negative log-likelihood loss between the predicted and true neuronal responses was calculated:

Loss(y,y^)=y^neuronyneuronlog(y^neuron+108) (6)

where y^ are the predicted responses and y are the ground truth responses. The gradients of the loss with respect to each pixel of the input video were calculated for each window of frames and averaged across all windows. The gradients for each pixel were normalized by the matrix norm across all gradients and clipped to between –1 and 1. The gradients were masked (Equation 4) and applied to the input video using Adam (ß1 = 0.9) without second-order momentum (Kingma and Ba, 2014) for 1000 epochs and a learning rate of 1000, with a learning rate warm-up for the first 10 epochs. After each epoch, the video was clipped to between 0 and 255. The optimization was run for 1000 epochs. Seven reconstructions from seven model instances were averaged, denoised with a 3D Gaussian filter σ = 0.5 (unless specified otherwise), and masked with the evaluation mask. See Supplementary Algorithm 1. Optimizing each 10 s video with one model instance for 1000 epochs took ≈60 min using a desktop with an RTX4070 GPU.

Reconstruction quality assessment

To evaluate the similarity between reconstructed and ground truth videos, we used the mean Pearson’s correlation between pixels of corresponding frames to evaluate spatial similarity:

mean frame correlation=1fi=1fcov(xi,x^i)σxiσx^i (7)

where f is the number of frames, and xi and x^i are the ground truth and reconstructed frames. To evaluate temporal and spatial similarity between ground truth and reconstructed videos, we used the Pearson’s correlation between all pixels of the whole movie:

video correlation=cov(x,x^)σxσx^ (8)

To calculate the Shannon entropy, we first computed the intensity histogram of the pixels inside the evaluation mask for every frame (25 bins between 0 and 255). Shannon entropy of one frame (Hf) was then calculated as:

Hf=k=1npklog2pk (9)

where pk is the normalized histogram count of bin k (only including non-zero bins). For each movie, the average Shannon entropy across frames is taken. n is the total number of non-zero bins.

The motion energy of a frame (Ef) is calculated as:

Ef=1ni=1n|Vf,iVf1,i| (10)

where Vf,i is the intensity value for one pixel i inside the evaluation mask at frame f, n is the total number of pixels inside the mask.

Retinotopic mapping

To calculate the receptive fields of neurons in silico, we predicted each neuron's response to the full sparse noise stimulus set using the ensembles' prediction of seven SOTA DNEM instances. The response map across pixels for each neuron (OnRh,w,n) was defined as:

OnRh,w,n=(RstimRpre)Rpre (11)

where h and w denote the position of the pixel on the screen, and n the neuron. Rstim is the predicted response of the neuron during the stimulus period and Rpre during the pre-stimulus period. OnRh,w,n was thresholded at 0.1. The same procedure was done to calculate OffRh,w,n. The OnR and OffR maps were smoothed using a 2D Gaussian filter with σ = 2 and then normalized by the maximum value for each neuron. The On and Off receptive field centers were defined as the pixel with the maximum value for each neuron. We calculate the On-Off receptive fields, for example, neurons as:

On-Off receptive fields=OnRnOffRnMax(OnRnOffRn) (12)

and calculate the population On or Off response as:

population On or Off response=OnR+OffRMax(|OnR+OffR|) (13)

Reconstruction area calculation

To calculate the retinotopic diameter of a mask, we first computed the retinotopic area of each pixel of the movie based on the screen size (31.8 cm by 56.5 cm) and distance from the mouse eye (15 cm). Strictly speaking, this is the visuotopic area as it does not take eye position into account, but we refer to it as retinotopic for simplicity. We then take the sum of all pixel areas for an evaluation mask with a given α threshold. Then we define the retinotopic diameter of this area (A) as:

retinotopic diameter=2Aπ (14)

Error map calculation

To calculate the error maps, we reconstruct movie clips either from the experimental neural responses or the predicted neural responses given the ground truth movie and took the difference:

positive error=x^x^y^>0 (15)
negative error=x^x^y^<0 (16)

where x is the ground truth video, y is the experimental neural activity, x^ is the reconstructed movie from y, y^x^ is the predicted neural activity from x^, y^ is the predicted neural activity from x, and x^y^ is the reconstructed movie from y^. Using Fiji (Schindelin et al., 2012), the positive error (LUT: red hot, range 25–75), negative error (LUT: cyan hot, range –25 to –75), and ground truth video (LUT: gray scale, range 0–255) were then combined into a composite image.

Acknowledgements

We would like to thank Emmanuel Bauer, Sandra Reinert, and the anonymous reviewers for useful input and discussions, and Tom George for the Gaussian noise stimulus set. TWM is funded by The Wellcome Trust (306384/Z/23/Z; 318818/Z/24/Z) and Gatsby Charitable Foundation (GAT4057), JB is funded by EMBO (ALTF 415–2024), and CC is funded by The Wellcome Trust (200790/Z/16/Z), The Simons Foundation (564408), EPSRC (EP/R035806/1), and the ERC (MotorAdapt 101169605).

Appendix 1

Supplemental material

Appendix 1 – algorithm 1. Movie reconstruction.
1: Parameters:
2: Dynamic Neural Encoding Model (DNEM), Ground truth neuronal response y, predicted neuronal responses y^,
 predicted input video x^, video width w=64, video height h=36, number of frames n=300,
 transparency mask α, sliding window size k=32, sliding window stride s=8, total number of windows N←[2+(n−k)/s],
 learning rate lr=1000, β1=0.9, loss function Loss(y,y^)=y^neuronyneuronlog(y^neuron+108), number of
 epochs epochs=1000, number of model instances q=7

3: Objective: F(y,x^)=Loss(y,DNEM(x^))

4: Initialize variables:
5: x^f,h,w125
6: Gi,f,h,w0
7: gf,h,w0
8: mf,h,w0

9: for iteration t=1,,epochs do
10:  Gradients across all windows:
11:  for iteration i=1,,N do
12:   if i<N then
13:    f[s(i1),,k+s(i1)]
14:   else if i=N then
15:    f[nk,,n]
16:   end if
17:   Gi,fF(yf,x^f)
18:   Gi,fGi,fGi,f+106
19:  end for
20:  Average gradients across windows: g1Ni=1NGi

21:  Clip and mask gradients:
22:  gclip(g,1,1)
23:  gf,h,wgf,h,w(αh,w>0.5)

24:  Update movie:
25:  if t<10 then
26:   lrcurrentlrt10
27:  else if t>10 then
28:   lrcurrentlr
29: end if
30:  lrcurrentlrcurrent10.999t/(1β1t)
31:  mβ1m+(1β1)g
32:  m^m1β1t
33:  x^x^lrcurrentm^
34:  Clip movie: x^clip(x^,0,255)

35: end for
36: Ensembling: x^1qi=1qx^

37: Denoise (optional): x^GaussianBlur3D(x^,σ=(0.5,0.5,0.5))

38: Mask movie: x^f,h,wx^f,h,w(αh,w>1)
Appendix 1 – algorithm 2. Mask training.
1: Parameters:
2: Dynamic Neural Encoding Model (DNEM), Ground truth neuronal response y, predicted neuronal responses y^, ground
 truth input video x, background video background, video width w=64, video height h=36, frames f, number of frames n=32, transparency mask α, learning rate lr=10, loss function Loss(y,y^,α)=MSE(y,y^(1α¯)),
 number of epochs epochs=1000, alpha blending function Blend(x,background,α)=xα+background(1α)

3: Objective:
4: F(y,x,background,α)=Loss(y,DNEM(Blend(x,background,clip(α,1,1))),α)

5: Initialize variables:
6: αh,wU(0,0.5)

7: for iteration t=1,,epochs do
8:   xget random video
9:   backgroundget different random video
10:  backgroundFlip(background,dim=1)
11:  backgroundFlip(background,dim=2)
12:  backgroundFlip(background,dim=3)

13:  Gradients with respect to transparency mask:
14:  GF(y,x,background,α)α
15:  GGG+106

16:  Average gradients across frames:
17:  g1nf=1nGf

18: Clip and mask gradients:
19:  gclip(g,1,1)
20:  gGaussianBlur2D(g,σ=(5,5))

21:  Update mask:
22:  ααlrg

23: end for

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication. For the purpose of Open Access, the authors have applied a CC BY public copyright license to any Author Accepted Manuscript version arising from this submission.

Contributor Information

Joel Bauer, Email: joel.bauer@ucl.ac.uk.

Rachel Denison, Boston University, United States.

Yanchao Bi, Peking University, China.

Funding Information

This paper was supported by the following grants:

  • Wellcome Trust 318818/Z/24/Z to Troy W Margrie.

  • Wellcome Trust 10.35802/306384 to Troy W Margrie.

  • Gatsby Charitable Foundation GAT4057 to Troy W Margrie.

  • European Molecular Biology Organization ALTF 415-2024 to Joel Bauer.

  • Wellcome Trust 10.35802/200790 to Claudia Clopath.

  • Simons Foundation 564408 to Claudia Clopath.

  • Engineering and Physical Sciences Research Council EP/R035806/1 to Claudia Clopath.

  • European Research Council MotorAdapt 101169605 to Claudia Clopath.

Additional information

Competing interests

No competing interests declared.

Author contributions

Conceptualization, Software, Formal analysis, Funding acquisition, Validation, Investigation, Visualization, Methodology, Writing – original draft, Writing – review and editing.

Conceptualization, Resources, Supervision, Funding acquisition, Project administration, Writing – review and editing.

Conceptualization, Resources, Supervision, Funding acquisition, Project administration, Writing – review and editing.

Additional files

MDAR checklist

Data availability

The code is available at https://github.com/Joel-Bauer/movie_reconstruction_code (copy archived at Bauer, 2025).

The following previously published datasets were used:

Fahey P, Turishcheva P, Hansel L, Froebe R, Ponder K, Vystrcilová M, Qiu Y, Willeke K, Bashiri M, Tolias A, Sinz A, Ecker A. 2023. The Dynamic Sensorium competition for predicting large-scale mouse visual cortex activity from videos - Dataset. G-Node Gin. Sensorium2023Data

Fahey P, Turishcheva P, Hansel L, Froebe R, Ponder K, Vystrcilová M, Qiu Y, Willeke K, Bashiri M, Tolias A, Sinz A, Ecker A. 2023. The Dynamic Sensorium competition for predicting large-scale mouse visual cortex activity from videos - Dataset. G-Node Gin. sensorium_2023_data

References

  1. Baikulov R. Solution for sensorium 2023 competition. v23.11.22Zenodo. 2023a doi: 10.5281/zenodo.10155151. [DOI]
  2. Baikulov R. Sensorium. 6849050Github. 2023b https://github.com/lRomul/sensorium
  3. Bauer J. Movie_reconstruction_code. swh:1:rev:eb5ad2143b5d54aad4ca44cb26b919cfab987920Software Heritage. 2025 https://archive.softwareheritage.org/swh:1:dir:64f8fef9fa2f753d3b9f3874a5674bb90a04149d;origin=https://github.com/Joel-Bauer/movie_reconstruction_code;visit=swh:1:snp:8c410d2b0042f98984a09abb4509996007223d58;anchor=swh:1:rev:eb5ad2143b5d54aad4ca44cb26b919cfab987920
  4. Benchetrit Y, Banville H, King JR. Brain decoding: toward real-time reconstruction of visual perception. arXiv. 2023 doi: 10.48550/arxiv.2310.19812. [DOI]
  5. Brackbill N, Rhoades C, Kling A, Shah NP, Sher A, Litke AM, Chichilnisky EJ. Reconstruction of natural images from responses of primate retinal ganglion cells. eLife. 2020;9:e58516. doi: 10.7554/eLife.58516. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Breiman L. Stacked Regressions. Machine Learning. 1996;24:49–64. doi: 10.1023/A:1018046112532. [DOI] [Google Scholar]
  7. Chen T-W, Wardill TJ, Sun Y, Pulver SR, Renninger SL, Baohan A, Schreiter ER, Kerr RA, Orger MB, Jayaraman V, Looger LL, Svoboda K, Kim DS. Ultrasensitive fluorescent proteins for imaging neuronal activity. Nature. 2013;499:295–300. doi: 10.1038/nature12354. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Chen Z, Qing J, Zhou JH. Cinematic mindscapes: high-quality video reconstruction from brain activity. arXiv. 2023 doi: 10.48550/arxiv.2305.11675. [DOI]
  9. Chen Y, Beech P, Yin Z, Jia S, Zhang J, Yu Z, Liu JK. Decoding dynamic visual scenes across the brain hierarchy. PLOS Computational Biology. 2024;20:e1012297. doi: 10.1371/journal.pcbi.1012297. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Cheng FL, Horikawa T, Majima K, Tanaka M, Abdelhack M, Aoki SC, Hirano J, Kamitani Y. Reconstructing visual illusory experiences from human brain activity. Science Advances. 2023;9:eadj3906. doi: 10.1126/sciadv.adj3906. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Cobos E, Muhammad T, Fahey PG, Ding Z, Ding Z, Reimer J, Sinz FH, Tolias AS. It takes neurons to understand neurons: digital twins of visual cortex synthesize neural metamers. bioRxiv. 2022 doi: 10.1101/2022.12.09.519708. [DOI]
  12. Deitch D, Rubin A, Ziv Y. Representational drift in the mouse visual cortex. Current Biology. 2021;31:4327–4339. doi: 10.1016/j.cub.2021.07.062. [DOI] [PubMed] [Google Scholar]
  13. Denk W, Strickler JH, Webb WW. Two-photon laser scanning fluorescence microscopy. Science. 1990;248:73–76. doi: 10.1126/science.2321027. [DOI] [PubMed] [Google Scholar]
  14. Elfwing S, Uchibe E, Doya K. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks. 2018;107:3–11. doi: 10.1016/j.neunet.2017.12.012. [DOI] [PubMed] [Google Scholar]
  15. Fiser A, Mahringer D, Oyibo HK, Petersen AV, Leinweber M, Keller GB. Experience-dependent spatial expectations in mouse visual cortex. Nature Neuroscience. 2016;19:1658–1664. doi: 10.1038/nn.4385. [DOI] [PubMed] [Google Scholar]
  16. Földiák P. Computation and Neural Systems. Springer; 1993. [DOI] [Google Scholar]
  17. Garasto S, Nicola W, Bharath AA, Schultz SR. Neural Sampling Strategies for Visual Stimulus Reconstruction from Two-photon Imaging of Mouse Primary Visual Cortex. 2019 9th International IEEE/EMBS Conference on Neural Engineering (NER); 2019. pp. 566–570. [DOI] [Google Scholar]
  18. George T. Gp_video. 6465c31Github. 2024 https://github.com/TomGeorge1234/gp_video
  19. Giovannucci A, Friedrich J, Gunn P, Kalfon J, Brown BL, Koay SA, Taxidis J, Najafi F, Gauthier JL, Zhou P, Khakh BS, Tank DW, Chklovskii DB, Pnevmatikakis EA. CaImAn an open source tool for scalable calcium imaging data analysis. eLife. 2019;8:e38173. doi: 10.7554/eLife.38173. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Goltstein PM, Coffey EBJ, Roelfsema PR, Pennartz CMA. In vivo two-photon Ca2+ imaging reveals selective reward effects on stimulus-specific assemblies in mouse visual cortex. The Journal of Neuroscience. 2013;33:11540–11555. doi: 10.1523/JNEUROSCI.1341-12.2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Ho JK, Horikawa T, Majima K, Cheng F, Kamitani Y. Inter-individual deep image reconstruction via hierarchical neural code conversion. NeuroImage. 2023;271:120007. doi: 10.1016/j.neuroimage.2023.120007. [DOI] [PubMed] [Google Scholar]
  22. Ito M, Tamura H, Fujita I, Tanaka K. Size and position invariance of neuronal responses in monkey inferotemporal cortex. Journal of Neurophysiology. 1995;73:218–226. doi: 10.1152/jn.1995.73.1.218. [DOI] [PubMed] [Google Scholar]
  23. Jurjut O, Georgieva P, Busse L, Katzner S. Learning enhances sensory processing in Mouse V1 before improving behavior. The Journal of Neuroscience. 2017;37:6460–6474. doi: 10.1523/JNEUROSCI.3485-16.2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Kalantari F, Faez K, Amindavar H, Nazari S. Improved image reconstruction from brain activity through automatic image captioning. Scientific Reports. 2025;15:4907. doi: 10.1038/s41598-025-89242-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Kanamori T, Mrsic-Flogel TD. Independent response modulation of visual cortical neurons by attentional and behavioral states. Neuron. 2022;110:3907–3918. doi: 10.1016/j.neuron.2022.08.028. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Kay KN, Naselaris T, Prenger RJ, Gallant JL. Identifying natural images from human brain activity. Nature. 2008;452:352–355. doi: 10.1038/nature06713. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Kingma DP, Ba J. Adam: A method for stochastic optimization. arXiv. 2014 doi: 10.48550/arxiv.1412.6980. [DOI]
  28. Koide-Majima N, Nishimoto S, Majima K. Mental image reconstruction from human brain activity: Neural decoding of mental imagery via deep neural network-based Bayesian estimation. Neural Networks. 2024;170:349–363. doi: 10.1016/j.neunet.2023.11.024. [DOI] [PubMed] [Google Scholar]
  29. Kupershmidt G, Beliy R, Gaziv G, Irani M. A penny for your (visual) thoughts: self-supervised reconstruction of natural movies from brain activity. arXiv. 2022 doi: 10.48550/arxiv.2206.03544. [DOI]
  30. Li W. Perceptual learning: use-dependent cortical plasticity. Annual Review of Vision Science. 2016;2:109–130. doi: 10.1146/annurev-vision-111815-114351. [DOI] [PubMed] [Google Scholar]
  31. Li W, Zheng S, Liao Y, Hong R, He C, Chen W, Deng C, Li X. The brain-inspired decoder for natural visual image reconstruction. Frontiers in Neuroscience. 2023;17:1130606. doi: 10.3389/fnins.2023.1130606. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Mordvintsev A, Pezzotti N, Schubert L, Olah C. Differentiable image parameterizations. Distill. 2018;3:00012. doi: 10.23915/distill.00012. [DOI] [Google Scholar]
  33. Niell CM, Stryker MP. Modulation of visual responses by behavioral state in mouse visual cortex. Neuron. 2010;65:472–479. doi: 10.1016/j.neuron.2010.01.033. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Nishimoto S, Vu AT, Naselaris T, Benjamini Y, Yu B, Gallant JL. Reconstructing visual experiences from brain activity evoked by natural movies. Current Biology. 2011;21:1641–1646. doi: 10.1016/j.cub.2011.08.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Nomura Y, Ikuta S, Yokota S, Mita J, Oikawa M, Matsushima H, Amano A, Shimonomura K, Seya Y, Koike C. Evaluation of critical flicker-fusion frequency measurement methods using a touchscreen-based visual temporal discrimination task in the behaving mouse. Neuroscience Research. 2019;148:28–33. doi: 10.1016/j.neures.2018.12.001. [DOI] [PubMed] [Google Scholar]
  36. Ozcelik F, VanRullen R. Natural scene reconstruction from fMRI signals using generative latent diffusion. Scientific Reports. 2023;13:15666. doi: 10.1038/s41598-023-42891-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Peirce J, Gray JR, Simpson S, MacAskill M, Höchenberger R, Sogo H, Kastman E, Lindeløv JK. Psychopy2: Experiments in behavior made easy. Behavior Research Methods. 2019;51:195–203. doi: 10.3758/s13428-018-01193-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Pierzchlewicz PA, Willeke KF, Nix AF, Elumalai P, Restivo K, Shinn T, Nealley C, Rodriguez G, Patel S, Franke K, Tolias AS, Sinz FH. Energy guided diffusion for generating neurally exciting images. bioRxiv. 2023 doi: 10.1101/2023.05.18.541176. [DOI]
  39. Poort J, Khan AG, Pachitariu M, Nemri A, Orsolic I, Krupic J, Bauza M, Sahani M, Keller GB, Mrsic-Flogel TD, Hofer SB. Learning enhances sensory and multiple non-sensory representations in primary visual cortex. Neuron. 2015;86:1478–1490. doi: 10.1016/j.neuron.2015.05.037. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Prusky GT, Douglas RM. Characterization of mouse cortical spatial vision. Vision Research. 2004;44:3411–3418. doi: 10.1016/j.visres.2004.09.001. [DOI] [PubMed] [Google Scholar]
  41. Rakhimberdina Z, Jodelet Q, Liu X, Murata T. Natural image reconstruction from fMRI using deep learning: a survey. Frontiers in Neuroscience. 2021;15:795488. doi: 10.3389/fnins.2021.795488. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Rao RPN, Ballard DH. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature Neuroscience. 1999;2:79–87. doi: 10.1038/4580. [DOI] [PubMed] [Google Scholar]
  43. Reimer J, Froudarakis E, Cadwell CR, Yatsenko D, Denfield GH, Tolias AS. Pupil fluctuations track fast switching of cortical states during quiet wakefulness. Neuron. 2014;84:355–362. doi: 10.1016/j.neuron.2014.09.033. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Ren Z, Li J, Xue X, Li X, Yang F, Jiao Z, Gao X. Reconstructing seen image from brain activity by visually-guided cognitive representation and adversarial learning. NeuroImage. 2021;228:117602. doi: 10.1016/j.neuroimage.2020.117602. [DOI] [PubMed] [Google Scholar]
  45. Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B. High-resolution image synthesis with latent diffusion models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); New Orleans, LA, USA. 2021. [DOI] [Google Scholar]
  46. Schindelin J, Arganda-Carreras I, Frise E, Kaynig V, Longair M, Pietzsch T, Preibisch S, Rueden C, Saalfeld S, Schmid B, Tinevez J-Y, White DJ, Hartenstein V, Eliceiri K, Tomancak P, Cardona A. Fiji: an open-source platform for biological-image analysis. Nature Methods. 2012;9:676–682. doi: 10.1038/nmeth.2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Schneider S, Lee JH, Mathis MW. Learnable latent embeddings for joint behavioural and neural analysis. Nature. 2023;617:360–368. doi: 10.1038/s41586-023-06031-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Schumacher JW, McCann MK, Maximov KJ, Fitzpatrick D. Selective enhancement of neural coding in V1 underlies fine-discrimination learning in tree shrew. Current Biology. 2022;32:3245–3260. doi: 10.1016/j.cub.2022.06.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Scotti PS, Banerjee A, Goode J, Shabalin S, Nguyen A, Cohen E, Dempster AJ, Verlinde N, Yundler E, Weisberg D, Norman KA, Abraham TM. Reconstructing the mind’s eye: fMRI-to-image with contrastive learning and diffusion priors. arXiv. 2023 doi: 10.48550/arxiv.2305.18274. [DOI]
  50. Shen G, Dwivedi K, Majima K, Horikawa T, Kamitani Y. End-to-end deep image reconstruction from human brain activity. Frontiers in Computational Neuroscience. 2019a;13:21. doi: 10.3389/fncom.2019.00021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Shen G, Horikawa T, Majima K, Kamitani Y. Deep image reconstruction from human brain activity. PLOS Computational Biology. 2019b;15:e1006633. doi: 10.1371/journal.pcbi.1006633. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Sinz FH, Ecker AS, Fahey PG, Walker EY, Cobos E, Froudarakis E, Yatsenko D, Pitkow X, Reimer J, Tolias AS. Stimulus domain transfer in recurrent models for large scale cortical population prediction on video. Advances in neural information processing systems 31; 2018. [DOI] [Google Scholar]
  53. Stanley GB, Li FF, Dan Y. Reconstruction of natural scenes from ensemble responses in the lateral geniculate nucleus. The Journal of Neuroscience. 1999;19:8036–8042. doi: 10.1523/JNEUROSCI.19-18-08036.1999. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Tacchetti A, Isik L, Poggio TA. Invariant recognition shapes neural representations of visual input. Annual Review of Vision Science. 2018;4:403–422. doi: 10.1146/annurev-vision-091517-034103. [DOI] [PubMed] [Google Scholar]
  55. Takagi Y, Nishimoto S. High-resolution image reconstruction with latent diffusion models from human brain activity. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2023. [DOI] [Google Scholar]
  56. Thirion B, Duchesnay E, Hubbard E, Dubois J, Poline JB, Lebihan D, Dehaene S. Inverse retinotopy: inferring the visual content of images from brain activation patterns. NeuroImage. 2006;33:1104–1116. doi: 10.1016/j.neuroimage.2006.06.062. [DOI] [PubMed] [Google Scholar]
  57. Turishcheva P, Fahey PG, Hansel L, Froebe R, Ponder K, Vystrčilová M, Willeke KF, Bashiri M, Wang E, Ding Z, Tolias AS, Sinz FH, Ecker AS. The dynamic sensorium competition for predicting large-scale mouse visual cortex activity from videos. arXiv. 2023 doi: 10.48550/arxiv.2305.19654. [DOI]
  58. Turishcheva P, Fahey P, Vystrčilová M, Hansel L, Froebe R. Retrospective for the dynamic sensorium competition for predicting large-scale mouse primary visual cortex activity from videos. Advances in Neural Information Processing Systems 37; 2024. [DOI] [Google Scholar]
  59. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN. Attention is all you need. arXiv. 2017 doi: 10.48550/arxiv.1706.03762. [DOI]
  60. Walker EY, Sinz FH, Cobos E, Muhammad T, Froudarakis E, Fahey PG, Ecker AS, Reimer J, Pitkow X, Tolias AS. Inception loops discover what excites neurons most using deep predictive models. Nature Neuroscience. 2019;22:2060–2065. doi: 10.1038/s41593-019-0517-x. [DOI] [PubMed] [Google Scholar]
  61. Wang EY, Fahey PG, Ding Z, Papadopoulos S, Ponder K, Weis MA, Chang A, Muhammad T, Patel S, Ding Z, Tran D, Fu J, Schneider-Mizell CM, Reid RC, Collman F, da Costa NM, Franke K, Ecker AS, Reimer J, Pitkow X, Sinz FH, Tolias AS, MICrONS Consortium Foundation model of neural activity predicts response to new stimulus types. Nature. 2025;640:470–477. doi: 10.1038/s41586-025-08829-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Willeke KF, Fahey PG, Bashiri M, Hansel L, Blessing C, Lurz KK, Burg MF, Cadena SA, Ding Z, Ponder K. In: In NeurIPS 2022 Competition Track. Willeke KF, editor. PMLR; 2023. Retrospective on the sensorium 2022 competition; pp. 314–333. [Google Scholar]
  63. Willeke KF, Restivo K, Franke K, Nix AF, Cadena SA, Shinn T, Nealley C, Rodriguez G, Patel S, Ecker AS, Sinz FH, Tolias AS. Deep learning-driven characterization of single cell tuning in primate visual area V4 supports topological organization. eLife. 2026;15:RP109875. doi: 10.7554/eLife.109875.1. [DOI] [Google Scholar]
  64. Xia J, Marks TD, Goard MJ, Wessel R. Stable representation of a naturalistic movie emerges from episodic activity with gain variability. Nature Communications. 2021;12:5170. doi: 10.1038/s41467-021-25437-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. Yoshida T, Ohki K. Natural images are reliably represented by sparse and variable populations of neurons in visual cortex. Nature Communications. 2020;11:872. doi: 10.1038/s41467-020-14645-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. Zhang Y, Jia S, Zheng Y, Yu Z, Tian Y, Ma S, Huang T, Liu JK. Reconstruction of natural visual scenes from neural spikes with deep neural networks. Neural Networks. 2020;125:19–30. doi: 10.1016/j.neunet.2020.01.033. [DOI] [PubMed] [Google Scholar]

eLife Assessment

Rachel Denison 1

This valuable study uses state-of-the-art neural encoding and video reconstruction methods to achieve a substantial improvement in video reconstruction quality from mouse neural data. It provides a convincing demonstration of how reconstruction performance can be improved by combining these methods. The goal of the study was improving reconstruction performance rather than advancing theoretical understanding of neural processing, so the results will be of practical interest to the brain decoding community.

Reviewer #2 (Public review):

Anonymous

Summary:

This is an interesting study exploring methods for reconstructing visual stimuli from neural activity in the mouse visual cortex. Specifically, it uses a competition dataset (published in the Dynamic Sensorium benchmark study) and a recent winning model architecture (DNEM, dynamic neural encoding model) to recover visual information stored in ensembles of mouse visual cortex.

Strengths:

This is a great start for a project addressing visual reconstruction. It is based on physiological data obtained at a single-cell resolution, the stimulus movies were reasonably naturalistic and representative of the real world, the study did not ignore important correlates such as eye position and pupil diameter, and of course, the reconstruction quality exceeded anything achieved by previous studies. There appear to be no major technical flaws in the study, and some potential confounds were addressed upon revision. The study is an enjoyable read.

Weaknesses:

The study is technically competent and benchmark-focused, but without significant conceptual or theoretical advances. The inclusion of neuronal data broadens the study's appeal, but the work does not explore potential principles of neural coding, which limits its relevance for neuroscience and may create some disappointment to some neuroscientists. The authors are transparent that their goal was methodological rather than explanatory, but this raises the question of why neuronal data were necessary at all, as more significant reconstruction improvements might be achievable using noise-less artificial video encoders alone (network-to-network decoding approaches have been done well by teams such as Han, Poggio, and Cheung, 2023, ICML). Yet, even within the methodological domain, the study does not articulate clear principles or heuristics that could guide future progress. The finding that more neurons improve reconstruction aligns with well-established results in the literature that show that higher neuronal numbers improve decoding in general (for example, Hung, Kreiman, Poggio, and DiCarlo, 2005) and thus may not constitute a novel insight.

Specific issues:

(1) The study showed that it could achieve high-quality video reconstructions from mouse visual cortex activity using a neural encoding model (DNEM), recovering 10-second video sequences and approaching a two-fold improvement in pixel-by-pixel correlation over attempts. As a reader, I was left with the question: okay, does this mean that we should all switch to DNEM for our investigations of mouse visual cortex? What makes this encoding model special? It is introduced as "a winning model of the Sensorium 2023 competition which achieved a score of 0.301...single trial correlation between predicted and ground truth neuronal activity," but as someone who does not follow this competition (most eLife readers are not likely to do so, either), I do not know how to gauge my response. Is this impressive? What is the best theoretical score, given noise and other limitations? Is the model inspired by the mouse brain in terms of mechanisms or architecture, or was it optimized to win the competition by overfitting it to the nuances of the data set? Of course, I know that as a reader, I am invited to read the references, but the study would stand better on its own, if it clarified how its findings depended on this model.

The revision helpfully added context to the Methods about the range of scores achieved by other models, but this information remains absent from the Abstract and other important sections. For instance, the Abstract states, "We achieve a pixel-level correlation of 0.57 between the ground truth movie and the reconstructions from single-trial neural responses," yet this point estimate (presented without confidence intervals or comparisons to controls) lacks meaning for readers who are not told how it compares to prior work or what level of performance would be considered strong. Without such context, the manuscript undercuts potentially meaningful achievements.

(2) Along those lines, the authors conclude that "the number of neurons in the dataset and the use of model ensembling are critical for high-quality reconstructions." If true, these principles should generalize across network architectures. I wondered whether the same dependencies would hold for other network types, as this could reveal more general insights. The authors replied that such extensions are expected (since prior work has shown similar effects for static images) but argued that testing this explicitly would require "substantial additional work," be "impractical," and likely not produce "surprising results." While practical difficulty alone is not a sufficient reason to leave an idea untested, I agree that the idea that "more neurons would help" would be unsurprising. The question then becomes: given that this is a conclusion already in the field, what new principle or understanding has been gained in this study?

(3) One major claim was that the quality of the reconstructions depended on the number of neurons in the dataset. There were approximately 8000 neurons recorded per mouse. The correlation difference between the reconstruction achieved by 1000 neurons and 8000 neurons was ~0.2. Is that a lot or a little? One might hypothesize that 7000 additional neurons could contribute more information, but perhaps, those neurons were redundant if their receptive fields are too close together or if they had the same orientation or spatiotemporal tuning. How correlated were these neurons in response to a given movie? Why did so many neurons offer such a limited increase in correlation? Originally, this question was meant to prompt deeper analysis of the neural data, but the authors did not engage with it, suggesting a limited understanding of the neuronal aspects of the dataset.

(4) We appreciated the experiments testing the capacity of the reconstruction process, by using synthetic stimuli created under a Gaussian process in a noise-free way. But this originally further raised questions: what is the theoretical capability for reconstruction of this processing pipeline, as a whole? Is 0.563 the best that one could achieve given the noisiness and/or neuron count of the Sensorium project? What if the team applied the pipeline to reconstruct the activity of a given artificial neural network's layer (e.g., some ResNet convolutional layer), using hidden units as proxies for neuronal calcium activity? In the revision, this concern was addressed nicely in the review in Supplementary Figure 3C. Also, one appreciates that as a follow up, the team produced error maps (New Figure 6) that highlight where in the frames the reconstruction are likely to fail. But the maps went unanalyzed further, and I am not sure if there was a systematic trend in the errors.

(5) I was encouraged by Figure 4, which shows how the reconstructions succeeded or failed across different spatial frequencies. The authors note that "the reconstruction process failed at high spatial frequencies," yet it also appears to struggle with low spatial frequencies, as the reconstructed images did not produce smooth surfaces (e.g., see the top rows of Figures 4A and 4B). In regions where one would expect a single continuous gradient, the reconstructions instead display specular, high-frequency noise. This issue is difficult to overlook and might deserve further discussion.

Reviewer #3 (Public review):

Anonymous

Summary:

This paper presents a method for reconstructing input videos shown to a mouse from the simultaneously recorded visual cortex activity (two-photon calcium imaging data). The publicly available experimental dataset is taken from a recent brain-encoding challenge, and the (publicly available) neural network model that serves to reconstruct the videos is the winning model from that challenge (by distinct authors). The present study applies gradient-based input optimization by backpropagating the brain-encoding error through this selected model (a method that has been proposed in the past, with other datasets). The main contribution of the paper is, therefore, the choice of applying this existing method to this specific dataset with this specific neural network model. The quantitative results appear to go beyond previous attempts at video input reconstruction (although measured with distinct datasets). The conclusions have potential practical interest for the field of brain decoding, and theoretical interest for possible future uses in functional brain exploration.

Strengths:

The authors use a validated optimization method on a recent large-scale dataset, with a state-of-the-art brain encoding model. The use of an ensemble of 7 distinct model instances (trained on distinct subsets of the dataset, with distinct random initializations) significantly improves the reconstructions. The exploration of the relation between reconstruction quality and number of recorded neurons will be useful to those planning future experiments.

Weaknesses:

The main contribution is methodological, and the methodology combines pre-existing components without any new original component.

eLife. 2026 Mar 10;14:RP105081. doi: 10.7554/eLife.105081.3.sa3

Author response

Joel Bauer 1, Troy W Margrie 2, Claudia Clopath 3

The following is the authors’ response to the current reviews.

Public Reviews:

Reviewer #2 (Public review):

Summary:

This is an interesting study exploring methods for reconstructing visual stimuli from neural activity in the mouse visual cortex. Specifically, it uses a competition dataset (published in the Dynamic Sensorium benchmark study) and a recent winning model architecture (DNEM, dynamic neural encoding model) to recover visual information stored in ensembles of mouse visual cortex.

Strengths:

This is a great start for a project addressing visual reconstruction. It is based on physiological data obtained at a single-cell resolution, the stimulus movies were reasonably naturalistic and representative of the real world, the study did not ignore important correlates such as eye position and pupil diameter, and of course, the reconstruction quality exceeded anything achieved by previous studies. There appear to be no major technical flaws in the study, and some potential confounds were addressed upon revision. The study is an enjoyable read.

Weaknesses:

The study is technically competent and benchmark-focused, but without significant conceptual or theoretical advances. The inclusion of neuronal data broadens the study's appeal, but the work does not explore potential principles of neural coding, which limits its relevance for neuroscience and may create some disappointment to some neuroscientists. The authors are transparent that their goal was methodological rather than explanatory, but this raises the question of why neuronal data were necessary at all, as more significant reconstruction improvements might be achievable using noise-less artificial video encoders alone (network-to-network decoding approaches have been done well by teams such as Han, Poggio, and Cheung, 2023, ICML). Yet, even within the methodological domain, the study does not articulate clear principles or heuristics that could guide future progress. The finding that more neurons improve reconstruction aligns with well-established results in the literature that show that higher neuronal numbers improve decoding in general (for example, Hung, Kreiman, Poggio, and DiCarlo, 2005) and thus may not constitute a novel insight.

We thank the reviewer for this second round of comments and hope we were able to address the remaining points below.

Indeed, using surrogate noiseless data is interesting and useful when developing such methods, or to demonstrate that they work in principle. But in order to evaluate if they really work in practice, we need to use real neuronal data. While we did not try movie reconstruction from layers within artificial neural networks as surrogate data, in Supplementary Figure 3C we provide the performance of our method using simulated/predicted neuronal responses from the dynamic neural encoding model alongside real neuronal responses.

Specific issues:

(1)The study showed that it could achieve high-quality video reconstructions from mouse visual cortex activity using a neural encoding model (DNEM), recovering 10-second video sequences and approaching a two-fold improvement in pixel-by-pixel correlation over attempts. As a reader, I was left with the question: okay, does this mean that we should all switch to DNEM for our investigations of mouse visual cortex? What makes this encoding model special? It is introduced as "a winning model of the Sensorium 2023 competition which achieved a score of 0.301...single trial correlation between predicted and ground truth neuronal activity," but as someone who does not follow this competition (most eLife readers are not likely to do so, either), I do not know how to gauge my response. Is this impressive? What is the best theoretical score, given noise and other limitations? Is the model inspired by the mouse brain in terms of mechanisms or architecture, or was it optimized to win the competition by overfitting it to the nuances of the data set? Of course, I know that as a reader, I am invited to read the references, but the study would stand better on its own, if it clarified how its findings depended on this model.

The revision helpfully added context to the Methods about the range of scores achieved by other models, but this information remains absent from the Abstract and other important sections. For instance, the Abstract states, "We achieve a pixel-level correlation of 0.57 between the ground truth movie and the reconstructions from single-trial neural responses," yet this point estimate (presented without confidence intervals or comparisons to controls) lacks meaning for readers who are not told how it compares to prior work or what level of performance would be considered strong. Without such context, the manuscript undercuts potentially meaningful achievements.

We appreciate that the additional information about the performance of the SOTA DNEM to predict neural responses could be made more visible in the paper and will therefore move it from the methods to the results section instead:

Line 348 “This model achieved an average single-trial correlation between predicted and ground truth neural activity of 0.291 during the competition, this was later improved to 0.301. The competition benchmark models achieved 0.106, 0.164 and 0.197 single-trial correlation, while the third and second place models achieved 0.243 and 0.265. Across the models, a variety of architectural components were used, including 2D and 3D convolutional layers, recurrent layers, and transformers, to name just a few.” will be moved to the results.

With regard to the lack of context for the performance of our reconstruction in the abstract, we may have overcorrected in the previous revision round and have tried to find a compromise which gives more context to the pixel-level correlation value:

Abstract: “We achieve a pixel-level correlation of 0.57 (95% CI [0.54, 0.60]) between ground-truth movies and single-trial reconstructions. Previous reconstructions based on awake mouse V1 neuronal responses to static images achieved a pixel-level correlation of 0.238 over a similar retinotopic area.”

(2) Along those lines, the authors conclude that "the number of neurons in the dataset and the use of model ensembling are critical for high-quality reconstructions." If true, these principles should generalize across network architectures. I wondered whether the same dependencies would hold for other network types, as this could reveal more general insights. The authors replied that such extensions are expected (since prior work has shown similar effects for static images) but argued that testing this explicitly would require "substantial additional work," be "impractical," and likely not produce "surprising results." While practical difficulty alone is not a sufficient reason to leave an idea untested, I agree that the idea that "more neurons would help" would be unsurprising. The question then becomes: given that this is a conclusion already in the field, what new principle or understanding has been gained in this study?

As mentioned in our previous round of revisions, we chose not to pursue the comparison of reconstructions using different model architectures in this manuscript because we did not think it would add significant insights to the paper given the amount of work it would require, and we are glad the reviewer agrees.

While the fact that more neurons result in better reconstructions is unsurprising, how quickly performance drops off will depend on the robustness of the method, and on the dimensionality of the decoding/reconstruction task (decoding grating orientation likely requires fewer neurons than gray scale image reconstruction, which in turn likely requires fewer neurons than full color movie reconstruction). How dependent input optimization based image/movie reconstruction is on population size has not been shown, so we felt it was useful for readers to know how well movie reconstruction works with our method when recording from smaller numbers of neurons.

(3) One major claim was that the quality of the reconstructions depended on the number of neurons in the dataset. There were approximately 8000 neurons recorded per mouse. The correlation difference between the reconstruction achieved by 1000 neurons and 8000 neurons was ~0.2. Is that a lot or a little? One might hypothesize that 7000 additional neurons could contribute more information, but perhaps, those neurons were redundant if their receptive fields are too close together or if they had the same orientation or spatiotemporal tuning. How correlated were these neurons in response to a given movie? Why did so many neurons offer such a limited increase in correlation? Originally, this question was meant to prompt deeper analysis of the neural data, but the authors did not engage with it, suggesting a limited understanding of the neuronal aspects of the dataset.

We apologize that we did not engage with this comment enough in the previous round. We assumed that the question arose because there was a misunderstanding about figure 5: 1000 not 1 neuron is sufficient to reconstruct the movies to a pixel-level correlation of 0.344. Of course, the fact that increasing the number of neurons from 1000 to 8000 only increased the reconstruction performance from 0.344 to 0.569 (65% increase in correlation) is still worth discussing. To illustrate this drop in performance qualitatively, we show 3 example frames from movie reconstructions using 1000-8000 neurons in Author response image 1.

Author response image 1. 3 example frames from reconstructions using different numbers of neurons.

Author response image 1.

As the reviewer points out, the diminishing returns of additional neurons to reconstruction performance is at least partly because there is redundancy in how a population of neurons represents visual stimuli. In supplementary figure S2, we inferred the on-off receptive fields of the neurons and show that visual space is oversampled in terms of the receptive field positions in panel C. However, the exact slope/shape of the performance vs population size curve we show in Figure 5 will also depend on the maximum performance of our reconstruction method, which is limited in spatial resolution (Figure 4 & Supplementary Figure S5). It is possible that future reconstruction approaches will require fewer neurons than ours, so we interpret this curve rather as a description of the reconstruction method itself than a feature of the underlying neuronal code. For that reason, we chose caution and refrained from making any claims about neuronal coding principles based on this plot.

(4) We appreciated the experiments testing the capacity of the reconstruction process, by using synthetic stimuli created under a Gaussian process in a noise-free way. But this originally further raised questions: what is the theoretical capability for reconstruction of this processing pipeline, as a whole? Is 0.563 the best that one could achieve given the noisiness and/or neuron count of the Sensorium project? What if the team applied the pipeline to reconstruct the activity of a given artificial neural network's layer (e.g., some ResNet convolutional layer), using hidden units as proxies for neuronal calcium activity? In the revision, this concern was addressed nicely in the review in Supplementary Figure 3C. Also, one appreciates that as a follow up, the team produced error maps (New Figure 6) that highlight where in the frames the reconstruction are likely to fail. But the maps went unanalyzed further, and I am not sure if there was a systematic trend in the errors.

We are happy to hear that we were able to answer the reviewers’ question of what the maximum theoretical performance of our reconstruction process is in figure 3C. Regarding systematic trends in the error maps, we also did not observe any clear systematic trends. If anything, we noticed that some moving edges were shifted, but we do not think we can quantify this effect with this particular dataset.

(5) I was encouraged by Figure 4, which shows how the reconstructions succeeded or failed across different spatial frequencies. The authors note that "the reconstruction process failed at high spatial frequencies," yet it also appears to struggle with low spatial frequencies, as the reconstructed images did not produce smooth surfaces (e.g., see the top rows of Figures 4A and 4B). In regions where one would expect a single continuous gradient, the reconstructions instead display specular, high-frequency noise. This issue is difficult to overlook and might deserve further discussion.

Thank you for pointing this out, this is indeed true. The reconstructions do have high frequency noise. We mention this briefly in line 102 “Finally, we applied a 3D Gaussian filter with sigma 0.5 pixels to remove the remaining static noise (Figure S3) and applied the evaluation mask.” In revisiting this sentence, we think it is more appropriate to replace “remove” with “reduce”. This noise is more visible in the Gaussian noise stimuli (Figure 4) because we did not apply the 3D Gaussian filter to these reconstructions, in case it interfered with the estimates of the reconstruction resolution limits.

Given that the Gaussian noise and drifting grating stimuli reconstructions were from predicted activity (“noise-free”), this high-frequency noise is not biological in origin and must therefore come from errors in our reconstruction process. This kind of high-frequency noise has previously been observed in feature visualization (optimizing input to maximize the activity of a specific node within a neural network to visualize what that node encodes; Olah, et al., "Feature Visualization", https://distill.pub/2017/feature-visualization/, 2017). It is caused by a kind of overfitting, whereby a solution to the optimization is found that is not “realistic”. Ways of combating this kind of noise include gradient smoothing, image smoothing, and image transformations during optimization, but these methods can restrict the resolution of the features that are recovered. Since we were more interested in determining the maximum resolution of stimuli that can be reconstructed in Figure 4 and Supplementary Figures 5-6, we chose not to apply these methods.

Reviewer #3 (Public review):

Summary:

This paper presents a method for reconstructing input videos shown to a mouse from the simultaneously recorded visual cortex activity (two-photon calcium imaging data). The publicly available experimental dataset is taken from a recent brain-encoding challenge, and the (publicly available) neural network model that serves to reconstruct the videos is the winning model from that challenge (by distinct authors). The present study applies gradient-based input optimization by backpropagating the brain-encoding error through this selected model (a method that has been proposed in the past, with other datasets). The main contribution of the paper is, therefore, the choice of applying this existing method to this specific dataset with this specific neural network model. The quantitative results appear to go beyond previous attempts at video input reconstruction (although measured with distinct datasets). The conclusions have potential practical interest for the field of brain decoding, and theoretical interest for possible future uses in functional brain exploration.

Strengths:

The authors use a validated optimization method on a recent large-scale dataset, with a state-of-the-art brain encoding model. The use of an ensemble of 7 distinct model instances (trained on distinct subsets of the dataset, with distinct random initializations) significantly improves the reconstructions. The exploration of the relation between reconstruction quality and number of recorded neurons will be useful to those planning future experiments.

Weaknesses:

The main contribution is methodological, and the methodology combines pre-existing components without any new original component.

We thank the reviewer for their balanced assessment of our manuscript.

The following is the authors’ response to the original reviews.

Public Reviews:

Reviewer #1 (Public review):

Summary:

This paper presents a method for reconstructing videos from mouse visual cortex neuronal activity using a state-of-the-art dynamic neural encoding model. The authors achieve high-quality reconstructions of 10-second movies at 30 Hz from two-photon calcium imaging data, reporting a 2-fold increase in pixel-by-pixel correlation compared to previous methods. They identify key factors for successful reconstruction including the number of recorded neurons and model ensembling techniques.

Strengths:

(1) A comprehensive technical approach combining state-of-the-art neural encoding models with gradient-based optimization for video reconstruction.

(2) Thorough evaluation of reconstruction quality across different spatial and temporal frequencies using both natural videos and synthetic stimuli.

(3) Detailed analysis of factors affecting reconstruction quality, including population size and model ensembling effects.

(4) Clear methodology presentation with well-documented algorithms and reproducible code.

(5) Potential applications for investigating visual processing phenomena like predictive coding and perceptual learning.

We thank the reviewer for taking the time to provide this valuable feedback. We would like to add that in our eyes one additional main contribution is the step of going from reconstruction of static images to dynamic videos. We trust that in the revised manuscript, we have now made the point more explicit that static image reconstruction relies on temporally averaged responses, which negates the necessity of having to account for temporal dynamics altogether.

Weaknesses:

The main metric of success (pixel correlation) may not be the most meaningful measure of reconstruction quality:

High correlation may not capture perceptually relevant features.

Different stimuli producing similar neural responses could have low pixel correlations The paper doesn't fully justify why high pixel correlation is a valuable goal

This is a very relevant point. In retrospect, perhaps we did not justify this enough. Sensory reconstruction typically aims to reconstruct sensory input based on brain activity as faithfully as possible. A brain-to-image decoder might therefore be trained to produce images as close to the original input as possible. The loss function to train the decoder would therefore be image similarity on the pixel level. In that case, evaluating reconstruction performance based on pixel correlation is somewhat circular.

However, when reconstructing videos, we optimize the input video in terms of its perceptual similarity to the original video and only then evaluate pixel-level similarity. The perceptual similarity metric we optimize for is the estimate of how the neurons in mouse V1 respond to that video. We then evaluate the similarity of this perceptually optimized video to the original input video with pixel-level correlation. In other words, we optimize for perceptual similarity and then evaluate pixel similarity. If our method optimized pixel-level similarity, then we would agree that perceptual similarity is a more relevant evaluation metric. We do not think it was clear in our original submission that our optimization loss function is a perceptual loss function, and have now made this clearer in Figure 1C-D and have clarified this in the results section, line 70:

“In effect, we optimized the input video to be perceptually similar with respect to the recorded neurons.”

And in line 110:

“Because our optimization of the movies was based on a perceptual loss function, we were interested in how closely these movies matched the originals on the pixel level.”

We chose to use pixel correlation to measure pixel-level similarity for several reasons. (1) It has been used in the past to evaluate reconstruction performance (Yoshida et al., 2020), (2) It is contrast and luminance insensitive, (3) correlation is a common metric so most readers will have an intuitive understanding of how it relates to the data.

To further highlight why pixel similarity might be interesting to visualize, we have included additional analysis in Figure 6 illustrating pixel-level differences between reconstructions from experimentally recorded activity and predicted activity.

We expect that the type of perceptual similarity the reviewer is alluding to is pretrained neural network image embedding similarity (Zhang et al., 2018: https://doi.org/10.48550/arXiv.1801.03924). While these metrics seem to match human perceptual similarity, it is unclear if they reflect mouse vision. We did try to compare the embedding similarity from pretrained networks such as VGG16, but got results suggesting the reconstructed frames were no more similar to the ground truth than random frames, which is obviously not true. This might be because the ground truth videos were too different in resolution from the training data of these networks and because these metrics are typically very sensitive to decreases in resolution.

The best alternative approach to evaluate mouse perceptual similarity would be to show the reconstructed videos to the same animals while recording the same neurons and to compare these neural activation patterns to those evoked by the original ground truth videos. This has been done for static images in the past: Cobos et al., 2022, found that static image reconstructions generated using gradient descent evoked more similar trial-averaged (40 trials) responses to those evoked by ground truth images compared to other reconstruction methods. Unfortunately, we are currently not able to perform these in vivo experiments, which is why we used publicly available data for the current paper. We plan to use this method in the future. But this method is also not flawless as it assumes that the average response to an image is the best reflection of how that image is represented, which may not be the case for an individual trial.

As far as we are aware, there is currently no method that, given a particular activity pattern in response to an image/video, can produce an image/video that induces a neural activity pattern that is closer to the original neural response than simply showing the same image/video again. Hypothetically, such a stimulus exists because of various visual processing phenomena we mention in our discussion (e.g., predictive coding and selective attention), which suggest that the image that is represented by a population of neurons likely differs from the original sensory input. In other words, what the brain represents is an interpretation of reality not a pure reflection. Experimentally verifying this is difficult, as these variations might be present on a single trial level. The first step towards establishing a method that captures the visual representation of a population of neurons is sensory reconstruction, where the aim is to get as close as possible to the original sensory input. We think pixel-level correlation is a stringent and interpretable metric for this purpose, particularly when optimizing for perceptual similarity rather than image similarity directly.

Comparison to previous work (Yoshida et al.) has methodological concerns: Direct comparison of correlation values across different datasets may be misleading; Large differences in the number of recorded neurons (10x more in the current study); Different stimulus types (dynamic vs static) make comparison difficult; No implementation of previous methods on the current dataset or vice versa.

Yes, we absolutely agree that direct comparison to previous static image reconstruction methods is problematic. We primarily do so because we think it is standard practice to give related baselines. We agree that direct comparison of the performance of video reconstruction methods to image reconstruction methods is not really possible. It does not make sense to train and apply a dynamic model on a static image data set where neural activity is time-averaged, as the temporal kernels could not be learned. Conversely, for a static model, which expects a single image as input and predicts time averaged responses, it does not make sense to feed it a series of temporally correlated movie frames and to simply concatenate the resulting activity perdition. The static model would need to be substantially augmented to incorporate temporal dynamics, which in turn would make it a new method. This puts us in the awkward position of being expected to compare our video reconstruction performance to previous image reconstruction methods without a fair way of doing so. We have now added these caveats in line 119:

“However, we would like to stress that directly comparing static image reconstruction methods with movie reconstruction approaches is fundamentally problematic, as they rely on different data types both during training and evaluation (temporally averaged vs continuous neural activity, images flashed at fixed intervals vs continuous movies).”

We have also toned down the language, emphasising the comparison to previous image reconstruction performance in the abstract, results, and conclusion.

Abstract: We removed “We achieve a ~2-fold increase in pixel-by-pixel correlation compared to previous state-of-the-art reconstructions of static images from mouse V1, while also capturing temporal dynamics.” and replaced with “We achieve a pixel-level correction of 0.57 between the ground truth movie and the reconstructions from single-trial neural responses.”

Discussion: we removed “In conclusion, we reconstruct videos presented to mice based on the activity of neurons in the mouse visual cortex, with a ~2-fold improvement in pixel-by-pixel correlation compared to previous static image reconstruction methods.” and replaced with “In conclusion, we reconstruct videos presented to mice based on single-trial activity of neurons in the mouse visual cortex.”

We have also removed the performance table and have instead added supplementary figure 3 with in-depth comparison across different versions of our reconstruction method (variations of masking, ensembling, contrast & luminance matching, and Gaussian blurring).

Limited exploration of how the reconstruction method could provide insights into neural coding principles beyond demonstrating technical capability.

The aim of this paper was not to reveal principles of neural coding. Instead, we aimed to achieve the best possible performance of video reconstructions and to quantify the limitations. But to highlight its potential we have added two examples of how sensory reconstruction has been applied in human vision research in line 321:

“Although fMRI-based reconstruction techniques are starting to be used to investigate visual phenomena in humans (such as illusions [Cheng et al., 2023] and mental imagery [Shen et al., 2019; Koide-Majima et al., 2024; Kalantari et al., 2025]), visual processing phenomena are likely difficult to investigate using existing fMRI-based reconstruction approaches, due to the low spatial and temporal resolution of the data.”

We have also added a demonstration of how this method could be used to investigate which parts of a reconstruction from a single trial response differs from the model's prediction (Figure 6). We do this by calculating pixel-level differences between reconstructions from the recorded neural activity and reconstructions from the expected neural activity (predicted activity by the neural encoding model). Although difficult to interpret, this pixel-by-pixel error map could represent trial-by-trial deviations of the neural code from pure sensory representation. But at this point we cannot know whether these errors are nothing more than errors in the reconstruction process. To derive meaningful interpretations of these maps would require a substantial amount of additional work and in vivo experiments and so is outside the scope of this paper, but we include this additional analysis now to highlight (a) why pixel-level similarity might be interesting to quantify and visualize and (b) to demonstrate how video reconstruction could be used to provide insights into neural coding, namely as a tool to identify how sensory representations differ from a pure reflection of the visual input.

The claim that "stimulus reconstruction promises a more generalizable approach" (line 180) is not well supported with concrete examples or evidence.

What we mean by generalizable is the ability to apply reconstruction to novel stimuli, which is not possible for stimulus classification. We now explain this better in the paragraph in line 211:

“Stimulus identification, i.e. identifying the most likely stimulus from a constrained set, has been a popular approach for quantifying whether a population of neurons encodes the identity of a particular stimulus [Földiák, 1993, Kay et al., 2008]. This approach has, for instance, been used to decode frame identity within a movie [Deitch et al., 2021, Xia et al., 2021, Schneider et al., 2023, Chen et al.,2024]. Some of these approaches have also been used to reorder the frames of the ground truth movie [Schneider et al., 2023] based on the decoded frame identity. Importantly, stimulus identification methods are distinct from stimulus reconstruction where the aim is to recreate what the sensory content of a neuronal code is in a way that generalizes to new sensory stimuli [Rakhimberdina et al., 2021]. This is inherently a more demanding task because the range of possible solutions is much larger. Although stimulus identification is a valuable tool for understanding the information content of a population code, stimulus reconstruction could provide a more generalizable approach, because it can be applied to novel stimuli.”

All the stimuli we reconstructed were not in the training set of the model, i.e., novel. We have also downed down the claim: we have replaced “promises” with “could provide”.

The paper would benefit from addressing how the method handles cases where different stimuli produce similar neural responses, particularly for high-speed moving stimuli where phase differences might be lost in calcium imaging temporal resolution.

Thank you for this suggestion, we think this is a great question. Calcium dynamics are slow and some of the high temporal frequency information could indeed be lost, particularly phase information. In other words, when the stimulus has high temporal frequency information, it is harder to decode spatial information because of the slow calcium dynamics. Ideally, we would look at this effect using the drifting grating stimuli; however, this is problematic because we rely on predicted activity from the SOTA DNEM, and due to the dilation of the first convolution, the periodic grating stimulus causes aliasing. At 15Hz, when the temporal frequency of the stimulus is half the movie frame rate, the model is actually being given two static images, and so the predicted activity is the interleaved activity evoked by two static images. We therefore do not think using the grating stimuli is a good idea. But we have used the Gaussian stimuli as it is not periodic, and is therefore less of a problem.

We have now also reconstructed phase-inverted Gaussian noise stimuli and plotted the video correlation between the reconstructions from activity evoked by phase-inverted stimuli. On the one hand, we find that even for the fastest changing stimuli, the correlation between the reconstructions from phase inverted stimuli are negative, meaning phase information is not lost at high temporal frequencies. On the other hand, for the highest spatial frequency stimuli, the correlation is negative. So, the predicted neural activity (and therefore the reconstructions) are phase-insensitive when the spatial frequency is higher than the reconstruction resolution limit we identified (spatial length constant of 1 pixel, or 3.38 degrees). Beyond this limit, the DNEM predicts activity in response to phase-inverted stimuli, which, when used for reconstruction, results in movies which are more similar to each other than the stimulus that actually evokes them.

However, not all information is lost at these high spatial frequencies. If we plot the Shannon entropy in the spatial domain or the motion energy in the temporal domain, we find that even when the reconstructions fail to capture the stimulus at a pixel-specific level (spatial length constant of 1 pixel, or 3.38 degrees), they do capture the general spatial and temporal qualities of the videos.

We have added these additional analyses to Figure 4 and Supplementary Figure 5.

Reviewer #2 (Public review):

This is an interesting study exploring methods for reconstructing visual stimuli from neural activity in the mouse visual cortex. Specifically, it uses a competition dataset (published in the Dynamic Sensorium benchmark study) and a recent winning model architecture (DNEM, dynamic neural encoding model) to recover visual information stored in ensembles of the mouse visual cortex.

This is a great project - the physiological data were measured at a single-cell resolution, the movies were reasonably naturalistic and representative of the real world, the study did not ignore important correlates such as eye position and pupil diameter, and of course, the reconstruction quality exceeded anything achieved by previous studies. Overall, it is great that teams are working towards exploring image reconstruction. Arguably, reconstruction may serve as an endgame method for examining the information content within neuronal ensembles - an alternative to training interminable numbers of supervised classifiers, as has been done in other studies. Put differently, if a reconstruction recovers a lot of visual features (maybe most of them), then it tells us a lot about what the visual brain is trying to do: to keep as much information as possible about the natural world in which its internal motor circuits may act consequently.

While we enjoyed reading the manuscript, we admit that the overall advance was in the range of those that one finds in a great machine learning conference proceedings paper. More specifically, we found no major technical flaws in the study, only a few potential major confounds (which should be addressable with new analyses), and the manuscript did not make claims that were not supported by its findings, yet the specific conceptual advance and significance seemed modest. Below, we will go through some of the claims, and ask about their potential significance.

We thank the reviewer for the positive feedback on our paper.

(1) The study showed that it could achieve high-quality video reconstructions from mouse visual cortex activity using a neural encoding model (DNEM), recovering 10-second video sequences and approaching a two-fold improvement in pixel-by-pixel correlation over attempts. As a reader, I am left with the question: okay, does this mean that we should all switch to DNEM for our investigations of the mouse visual cortex? What makes this encoding model special? It is introduced as "a winning model of the Sensorium 2023 competition which achieved a score of 0.301... single-trial correlation between predicted and ground truth neuronal activity," but as someone who does not follow this competition (most eLife readers are not likely to do so, either), I do not know how to gauge my response. Is this impressive? What is the best achievable score, in theory, given data noise? Is the model inspired by the mouse brain in terms of mechanisms or architecture, or was it optimized to win the competition by overfitting it to the nuances of the data set? Of course, I know that as a reader, I am invited to read the references, but the study would stand better on its own if clarified how its findings depended on this model.

This is a very good point. We do not think that everyone should switch to using this particular DNEM to investigate the mouse visual cortex, but we think DNEMs and stimulus reconstruction in general has a lot of potential. We think static neural encoding models have already been demonstrated to be an extremely valuable tool to investigate visual coding (Walker et al., 2019; Yoshida et al., 2021; Willeke et al., bioRxiv 2023). DNEMs are less common, largely because they are very large and are technically more demanding to train and use. That makes static encoding models more practical for some applications, but they do not have temporal kernels and are therefore only used for static stimuli. They cannot, for instance, encode direction tuning, only orientation tuning. But both static and dynamic encoding models have advantages over stimulus classification methods which we outline in our discussion. Here we provide the first demonstration that previous achievements in static image reconstruction are transferable to movies.

It has been shown in the past for static neural encoding models that choosing a better-performing model produces reconstructed static images that are closer to the original image (Pierzchlewicz et al., 2023). The factors in choosing this particular DNEM were its capacity to predict neural activity (benchmarked against other models), it was open source, and the data it was designed for was also available.

To give more context to the model used in the paper, we have included the following, line 348:

“This model achieved an average single-trial correlation between predicted and ground truth neural activity of 0.291 during the competition, this was later improved to 0.301. The competition benchmark models achieved 0.106, 0.164 and 0.197 single-trial correlation, while the third and second place models achieved 0.243 and 0.265. Across the models, a variety of architectural components were used, including 2D and 3D convolutional layers, recurrent layers, and transformers, to name just a few.”

Concerning biologically inspired model design. The winning model contained 3 fully connected layers comprising the “Cortex” just before the final readout of neural activity, but we would consider this level of biological inspiration as minor. We do not think that the exact architecture of the model is particularly important, as the crucial aspect of such neural encoders is their ability to predict neural activity irrespective of how they achieve it. There has been a move towards creating foundation models of the brain (Wang et al., 2025) and the priority so far has been on predictive performance over mechanistic interpretability or similarity to biological structures and processes.

Finally, we would like to note that we do not know what the maximum theoretical score for single-trial responses might be, and don't think there is a good way of estimating it in this context.

(2) Along those lines, two major conclusions were that "critical for high-quality reconstructions are the number of neurons in the dataset and the use of model ensembling." If true, then these principles should be applicable to networks with different architectures. How well can they do with other network types?

This is a good question. Our method critically relies on the accurate prediction of neural activity in response to new videos. It is therefore expected that a model that better predicts neural responses to stimuli will also be better at reconstructing those stimuli given population activity. This was previously shown for static images (Pierzchlewicz et al., 2023). It is also expected that whenever the neural activity is accurately predicted, the corresponding reconstructed frames will also be more similar to the ground truth frames. We have now demonstrated this relationship between prediction accuracy and reconstruction accuracy in supplementary figure 4.

Although it would be interesting to compare the movie reconstruction performance of many different models with different architectures and activity prediction performances, this would involve quite substantial additional work because movie reconstruction is very resource- and time-intensive. Finding optimal hyperparameters to make such a comparison fair and informative would therefore be impractical and likely not yield surprising results.

We also think it is unlikely that ensembling would not improve reconstruction performance in other models because ensembling across model predictions is a common way of improving single-model performance in machine learning. Likewise, we think it is unlikely that the relationship between neural population size and reconstruction performance would differ substantially when using different models, because using more neurons means that a larger population of noisy neurons is “voting” on what the stimulus is. However, we would expect that if the model were worse at predicting neural activity, then more neurons are needed for an equivalent reconstruction performance. In general, we would recommend choosing the best possible DNEM available, in terms of neural activity prediction performance, when reconstructing movies using input optimization through gradient descent.

(3) One major claim was that the quality of the reconstructions depended on the number of neurons in the dataset. There were approximately 8000 neurons recorded per mouse. The correlation difference between the reconstruction achieved by 1 neuron and 8000 neurons was ~0.2. Is that a lot or a little? One might hypothesize that ~7,999 additional neurons could contribute more information, but perhaps, those neurons were redundant if their receptive fields were too close together or if they had the same orientation or spatiotemporal tuning. How correlated were these neurons in response to a given movie? Why did so many neurons offer such a limited increase in correlation?

In the population ablation experiments, we compared the performance using ~1000, ~2000, ~4000, ~8000 neurons, and found an attenuation of 39.5% in video correlation when dropping 87.5% of the neurons (~1000 neurons remaining), we did not try reconstruction using just 1 neuron.

(4) On a related note, the authors address the confound of RF location and extent. The study resorted to the use of a mask on the image during reconstruction, applied during training and evaluation (Line 87). The mask depends on pixels that contribute to the accurate prediction of neuronal activity. The problem for me is that it reads as if the RF/mask estimate was obtained during the very same process of reconstruction optimization, which could be considered a form of double-dipping (see the "Dead salmon" article, https://doi.org/10.1016/S1053-8119(09)71202-9). This could inflate the reconstruction estimate. My concern would be ameliorated if the mask was obtained using a held-out set of movies or image presentations; further, the mask should shift with eye position, if it indeed corresponded to the "collective receptive field of the neural population." Ideally, the team would also provide the characteristics of these putative RFs, such as their weight and spatial distribution, and whether they matched the biological receptive fields of the neurons (if measured independently).

We can reassure the reviewer that there is no double-dipping. We would like to clarify that the mask was trained only on videos from the training set of the DNEM and not the videos which were reconstructed. We have added the sentence, line 91:

“None of the reconstructed movies were used in the optimization of this transparency mask.”

Making the mask dependent on eye position would be difficult to implement with the current DNEM, where eye position is fed to the model as an additional channel. When using a model where the image is first transformed into retinotopic coordinates in an eye position-dependent manner (such as in Wang et al., 2025) the mask could be applied in retinotopic coordinates and therefore be dependent on eye position.

Effectively, the alpha mask defines the relative level of influence each pixel contributes to neural activity prediction. We agree it is useful to compare the shape of the alpha mask with the location of traditional on-off receptive fields (RFs) to clarify what the alpha mask represents and characterise the neural population available for our reconstructions. We therefore presented the DNEM with on-off patches to map the receptive fields of single neurons in an in silico experiment (the experimentally derived RF are not available). As expected, there is a rough overlap between the alpha mask (Supplementary Figure 2D), the average population receptive field (Supplementary Figure 2B), and the location of receptive field peaks (Supplementary Figure 2C). In principle, all three could be used during training or evaluation for masking, but we think that defining a mask based on the general influence of images on neural activity, rather than just on off patch responses, is a more elegant solution.

One idea of how to go a step further would be to first set the alpha mask threshold during training based on the % loss of neural activity prediction performance that threshold induces (in our case alpha=0.5 corresponds to ~3% loss in correlation between predicted vs recorded neural responses, see Supplementary Figure 3D), and second base the evaluation mask on a pixel correlation threshold (see example pixel correlation map in Supplementary Figure 2E) instead to avoid evaluating areas of the image with low image reconstruction confidence.

We referred to this figure in the result section, line 83:

“The transparency masks are aligned with but not identical to the On-Off receptive field distribution maps using sparse-noise (Figure S2).”

We have also done additional analysis on the effect of masking during training and evaluation with different thresholds in Supplementary Figure 3.

(5) We appreciated the experiments testing the capacity of the reconstruction process, by using synthetic stimuli created under a Gaussian process in a noise-free way. But this further raised questions: what is the theoretical capability for the reconstruction of this processing pipeline, as a whole? Is 0.563 the best that one could achieve given the noisiness and/or neuron count of the Sensorium project? What if the team applied the pipeline to reconstruct the activity of a given artificial neural network's layer (e.g., some ResNet convolutional layer), using hidden units as proxies for neuronal calcium activity?

That’s a very interesting point. It is very hard to know what the theoretical best reconstruction performance of the model would be. Reconstruction performance could be decreased due to neural variability, experimental noise, the temporal kernel of the calcium indicator and the imaging frame rate, information compression along the visual hierarchy, visual processing phenomena (such as predictive coding and selective attention), failure of the model to predict neural activity correctly, or failure of the reconstruction process to find the best possible image which explains the neural activity. We don't think we can disentangle the contribution of all these sources, but we can provide a theoretical maximum assuming that the model and the reconstruction process are optimal. To that end, we performed additional simulations and reconstructed the natural videos using the predicted activity of the neurons in response to the natural videos as the target (similar to the synthetic stimuli) and got a correlation of 0.766. So, the single trial performance of 0.569 is ~75% of this theoretical maximum. This difference can be interpreted as a combination of the losses due to neuronal variability, measurement noise, and actual deviations in the images represented by the brain compared to reality.

We thank the reviewer for this suggestion, as it gave us the idea of looking at error maps (Figure 6), where the pixel-level deviation of the reconstructions from recorded vs predicted activity is overlaid on the ground truth movie.

(6) As the authors mentioned, this reconstruction method provided a more accurate way to investigate how neurons process visual information. However, this method consisted of two parts: one was the state-of-the-art (SOTA) dynamic neural encoding model (DNEM), which predicts neuronal activity from the input video, and the other part reconstructed the video to produce a response similar to the predicted neuronal activity. Therefore, the reconstructed video was related to neuronal activity through an intermediate model (i.e., SOTA DNEM). If one observes a failure in reconstructing certain visual features of the video (for example, high-spatial frequency details), the reader does not know whether this failure was due to a lack of information in the neural code itself or a failure of the neuronal model to capture this information from the neural code (assuming a perfect reconstruction process). Could the authors address this by outlining the limitations of the SOTA DNEM encoding model and disentangling failures in the reconstruction from failures in the encoding model?

To test if a better neural prediction by the DNEM would result in better reconstructions, we ran additional simulations and now show that neural activity prediction performance correlates with reconstruction performance (Supplementary Figure 4B). This is consistent with Pierzchlewicz et al. (2023) who showed that static image reconstructions using better encoding models leads to better reconstruction performance. As also mentioned in the answer to the previous comment, untangling the relative contributions of reconstruction losses is hard, but we think that improvements to the DNEM performance are key. Two suggestions to improving the DNEM we used would be to translate the input image into retinotopic coordinates and shift this image relative to eye position before passing it to the first convolutional layer (as is done in Wang et al. 2025), to use movies which are not spatially down sampled as heavily, to not use a dilation of 2 in the temporal convolution of the first layer and to train on a larger dataset.

(7) The authors mentioned that a key factor in achieving high-quality reconstructions was model assembling. However, this averaging acts as a form of smoothing, which reduces the reconstruction's acuity and may limit the high-frequency content of the videos (as mentioned in the manuscript). This averaging constrains the tool's capacity to assess how visual neurons process the low-frequency content of visual input. Perhaps the authors could elaborate on potential approaches to address this limitation, given the critical importance of high-frequency visual features for our visual perception.

This is exactly what we also thought. To answer this point more specifically, we ran additional simulations where we also reconstruct the movies using gradient ensembling instead of reconstruction ensembling. Here, the gradients of the loss with respect to each pixel of the movie is calculated for each of the model instances and are averaged at every iteration of the reconstruction optimization. In essence, this means that one reconstruction solution is found, and the averaging across reconstructions, which could degrade high-frequency content, is skipped. The reconstructions from both methods look very similar, and the video correlation is, if anything, slightly worse (Supplemental Figure 3A&C). This indicates that our original ensembling approach did not limit reconstruction performance, but that both approaches can be used, depending on what is more convenient given hardware restrictions.

Reviewer #3 (Public review):

Summary:

This paper presents a method for reconstructing input videos shown to a mouse from the simultaneously recorded visual cortex activity (two-photon calcium imaging data). The publicly available experimental dataset is taken from a recent brain-encoding challenge, and the (publicly available) neural network model that serves to reconstruct the videos is the winning model from that challenge (by distinct authors). The present study applies gradient-based input optimization by backpropagating the brain-encoding error through this selected model (a method that has been proposed in the past, with other datasets). The main contribution of the paper is, therefore, the choice of applying this existing method to this specific dataset with this specific neural network model. The quantitative results appear to go beyond previous attempts at video input reconstruction (although measured with distinct datasets). The conclusions have potential practical interest for the field of brain decoding, and theoretical interest for possible future uses in functional brain exploration.

Strengths:

The authors use a validated optimization method on a recent large-scale dataset, with a state-of-the-art brain encoding model. The use of an ensemble of 7 distinct model instances (trained on distinct subsets of the dataset, with distinct random initializations) significantly improves the reconstructions. The exploration of the relation between reconstruction quality and the number of recorded neurons will be useful to those planning future experiments.

Weaknesses:

The main contribution is methodological, and the methodology combines pre-existing components without any new original components.

We thank the reviewer for taking the time to review our paper and for their overall positive assessment. We would like to emphasise that combining pre-existing machine learning techniques to achieve top results in a new modality does require iteration and innovation. While gradient-based input optimization by backpropagating the brain-encoding error through a neural encoding model has been used in 2D static image optimization to generate maximally exciting images and reconstruct static images, we are the first to have applied it to movies which required accounting for the time domain. Previous methods used time averaged responses and were limited to the reconstruction of static images presented with fixed image intervals.

The movie reconstructions include a learned "transparency mask" to concentrate on the most informative area of the frame; it is not clear how this choice impacts the comparison with prior experiments. Did they all employ this same strategy? If not, shouldn't the quantitative results also be reported without masking, for a fair comparison?

Yes, absolutely. All reconstruction approaches limit the field of view in some way, whether this is due to the size of the screen, the size of the image on the screen, or cropping of the presented/reconstructed images during analysis due to the retinotopic coverage of the recorded neurons. Note that we reconstruct a larger field of view than Yoshida et al. In Yoshida et al., the reconstructed field of view was 43 by 43 retinal degrees. we show the size of an example evaluation mask in comparison.

To address the reviewer’s concern more specifically, we performed additional simulations and now also show the performance using a variety of different training and evaluation masks, including different alpha thresholds for training and evaluation masks as well as the effective retinotopic coverage at different alpha thresholds. Despite these comparisons, we would also like to highlight that the comparison to the benchmark is problematic itself. This is because image and movie reconstruction are not directly comparable. It does not make sense to train and apply a dynamic model on a static image dataset where neural activity is time averaged. Conversely, it does not make sense to train or apply a static model that expects time-averaged neural responses on continuous neural activity unless it is substantially augmented to incorporate temporal dynamics, which in turn would make it a new method. This puts us in the awkward position of being expected to compare our video reconstruction performance to previous image reconstruction methods without a fair way of doing so. We have therefore de-emphasised the phrasing comparing our method to previous publications in the abstract, results, and discussion.

Abstract: “We achieve a ~2-fold increase in pixel-by-pixel correlation compared to previous state-of-the-art reconstructions of static images from mouse V1, while also capturing temporal dynamics.” with “We achieve a pixel-level correction of 0.57 between the ground truth movie and the reconstructions from single-trial neural responses.”

Results: “This represents a ~2x higher pixel-level correlation over previous single-trial static image reconstructions from V1 in awake mice (image correlation 0.238 +/- 0.054 s.e.m. for awake mice) [Yoshida et al., 2020] over a similar retinotopic area (~43° x 43°) while also capturing temporal dynamics. However, we would like to stress that directly comparing static image reconstruction methods with movie reconstruction approaches is fundamentally problematic, as they rely on different data types both during training and evaluation (temporally averaged vs continuous neural activity, images flashed at fixed intervals vs continuous movies).”

Discussion: “In conclusion, we reconstruct videos presented to mice based on the activity of neurons in the mouse visual cortex, with a ~2-fold improvement in pixel-by-pixel correlation compared to previous static image reconstruction methods.” with “In conclusion, we reconstruct videos presented to mice based on single-trial activity of neurons in the mouse visual cortex.”

We have also removed the performance table and have instead added supplementary figure 3 with in-depth comparison across different versions of our reconstruction method (variations of masking, ensembling, contrast & luminance matching, and Gaussian blurring).

We believe that we have given enough information in our paper now so that readers can make an informed decision whether our movie reconstruction method is appropriate for the questions they are interested in.

Recommendations for the authors:

Reviewer #2 (Recommendations for the authors):

(1) "Reconstructions have been luminance (mean pixel value across video) and contrast (standard deviation of pixel values across video) matched to ground truth." This was not clear: was it done by the investigating team? I imagine that one of the most easily captured visual features is luminance and contrast, why wouldn't the optimization titrate these well?

The contrast and luminance matching of the reconstructions to the ground truth videos was done by us, but this was only done to help readers assess the quality of the reconstructions by eye. Our performance metrics (frame and video correlation) are contrast and luminance insensitive. To clarify this, we have also added examples of non-adjusted frames in Supplementary Figure 3A, and added a sentence in the results, line 103:

“When presenting videos in this paper we normalize the mean and standard deviation of the reconstructions to the average and standard deviation of the corresponding ground truth movie before applying the evaluation masks, but this is not done for quantification except in Supplementary Figure 3D.”

We were also initially surprised that contrast and luminance are not captured well by our reconstruction method, but this makes sense as V1 is largely luminance invariant (O’Shea et al., 2025 https://doi.org/10.1016/j.celrep.2024.115217) and contrast only has a gain effect on V1 activity (Tring et al., 2024 here). Decoding absolute contrast is likely unreliable because it is probably not the only factor modulating the overall gain of the neural population.

To address the reviewer’s comment more fully, we ran additional experiments. More specifically, to test why contrast and luminance are not recovered in the reconstructions, we checked how the predicted activity between the reconstruction and the contrast/luminance corrected reconstructions differs. Contrast and luminance adjustment had little impact on predicted response similarity on average. This makes the reconstruction optimization loss function insensitive to overall contrast and luminance so it cannot be decoded. There is a small effect on activity correlation, however, so we cannot completely rule out that contrast and luminance could be reconstructed with a different loss function.

(2) The authors attempted to investigate the variability in reconstruction quality across different movies and 10-second snippets of a movie by correlating various visual features, such as video motion energy, contrast, luminance, and behavioral factors like running speed, pupil diameter, and eye movement, with reconstruction success. However, it would also be beneficial if the authors correlated the response loss (Poisson loss between neural responses) with reconstruction quality (video correlation) for individual videos, as these metrics are expected to be correlated if the reconstruction captures neural variance.

We thank the reviewer for this suggestion. We have now included this analysis and find that if the neural activity was better predicted by the DNEM then the reconstruction of the video was also more similar to the ground truth video. We further found that this effect is shift-dependent (in time), meaning the prediction of activity based on proximal video frames is more influential on reconstruction performance.

Reviewer #3 (Recommendations for the authors):

(1) I was confused about the choice of applying a transparency mask thresholded with alpha>0.5 during training and alpha>1 during evaluation. Why treat the two situations differently? Also, shouldn't we expect alpha to be in the [0,1] range, in which case, what is the meaning of alpha>1? (And finally, as already described in "Weaknesses", how does this choice impact the comparison with prior experiments? Did they also employ a similar masking strategy?)

We found that applying a mask during training increased performance regardless of the size of the evaluation mask. Using a less stringent mask during training than during evaluation increases performance slightly, but also allows inspection of the reconstruction in areas where the model will be less confident without sacrificing performance, if this is desired. The thresholds of 0.5 and 1 were chosen through trial and error, but the exact values do not hold intrinsic meaning. The alpha mask values can go above 1 during their optimization. We could have clipped alpha during the training procedure (algorithm 1), but we decided this was not worth redoing at this stage, as the alphas used for testing were not above 1. All reconstruction approaches in previous publications limit the field of view in some form, whether this is due to the size of the screen, the size of the image on the screen, or the cropping of the presented/reconstructed images during analysis.

To address the reviewer’s comment in detail, we have added extensive additional analysis to evaluate the coverage of the reconstruction achieved in this paper and how different masking strategies affect performance, as well as how the mask relates to more traditional receptive field mapping.

(2) I would not use the word "imagery" in the first sentence of the abstract, because this might be interpreted by some readers as reconstruction of mental imagery, a very distinct question.

We changed imagery to images in the abstract.

(3) Line 145-146: "<1 frame, or <30Hz" should be "<1 frame, or >30Hz".

We have corrected the error.

(4) Algorithm 1, Line 5, a subscript variable 'g' should be changed to 'h'

We have corrected the error.

Additional Changes

(1) Minor grammatical errors

(2) Addition of citations: We were previously not aware of a bioRxiv preprint from 2022 (Cobos et al., 2022), which used gradient descent-based input optimization to reconstruct static images but without the addition of a diffusion model. Instead, we had cited for this method Pierzchlewicz et al., 2023 bioRxiv/NeurIPS. In Cobos et al., 2022, they compare static image reconstruction similarity to ground truth images and the similarity of the in vivo evoked activity across multiple reconstruction methods. Performance values are only given for reconstructions from trial-averaged responses across ~40 trials (in the absence of original data or code we are also not able to retrospectively calculate single-trial performance). The authors find that optimizing for evoked activity rather than image similarity produces image reconstructions that evoke more similar in vivo responses compared to reconstructions optimized for image similarity itself. We have now added and discussed the citation in the main text.

(3) Workaround for error in the open-source code from https://github.com/lRomul/sensorium for video hashing function in the SOTA DNEM: By checking the most correlated first frame for each reconstructed movie, we discovered there was a bug in the open-source code and 9/50 movies we originally used for reconstruction were not properly excluded from the training data between DNEM instances. The reason for this error was that some of the movies are different by only a few pixels, and the video hashing function used to split training and test set folds in the original DNEM code classified these movies as different and split them across folds. We have replaced these 9 movies and provide a figure below showing the next closest first frame for every movie clip we reconstruct. This does not affect our claims. Excluding these 9 movie clips, did not affect the reconstruction performance (video correlation went from 0.563 to 0.568), so there was no overestimation of performance due to test set contamination. However, they should still be removed so some of the values in the paper have changed slightly. The only statistical test that was affected was the correlation between video correlation and mean motion energy (Supplementary Figure 4A), which went from p = 0.043 to 0.071.

Author response image 2. exclusion of movie clips with duplicates in the DNEM training data.

Author response image 2.

A) example frame of a reconstructed movie (ground truth) and the most correlated first frame from the training data. b) all movie clips and their corresponding most correlated clip from the training data. Red boxes indicate excluded duplicates.

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Data Citations

    1. Fahey P, Turishcheva P, Hansel L, Froebe R, Ponder K, Vystrcilová M, Qiu Y, Willeke K, Bashiri M, Tolias A, Sinz A, Ecker A. 2023. The Dynamic Sensorium competition for predicting large-scale mouse visual cortex activity from videos - Dataset. G-Node Gin. Sensorium2023Data
    2. Fahey P, Turishcheva P, Hansel L, Froebe R, Ponder K, Vystrcilová M, Qiu Y, Willeke K, Bashiri M, Tolias A, Sinz A, Ecker A. 2023. The Dynamic Sensorium competition for predicting large-scale mouse visual cortex activity from videos - Dataset. G-Node Gin. sensorium_2023_data

    Supplementary Materials

    Figure 1—source data 1. Source data to Figure 1.
    Figure 1—figure supplement 1—source data 1. Source data to Figure 1—figure supplement 1.
    Figure 1—figure supplement 2—source data 1. Source data to Figure 1—figure supplement 2.
    Figure 1—figure supplement 3—source data 1. Source data to Figure 1—figure supplement 3.
    Figure 2—source data 1. Source data to Figure 2.
    Figure 2—figure supplement 1—source data 1. Source data to Figure 2—figure supplement 1.
    Figure 3—source data 1. Source data to Figure 3.
    Figure 4—figure supplement 1—source data 1. Source data to Figure 4—figure supplement 1.
    Figure 4—figure supplement 2—source data 1. Source data to Figure 1—figure supplement 2.
    Figure 5—source data 1. Source data to Figure 5.
    Figure 6—source data 1. Source data to Figure 6.
    MDAR checklist

    Data Availability Statement

    The code is available at https://github.com/Joel-Bauer/movie_reconstruction_code (copy archived at Bauer, 2025).

    The following previously published datasets were used:

    Fahey P, Turishcheva P, Hansel L, Froebe R, Ponder K, Vystrcilová M, Qiu Y, Willeke K, Bashiri M, Tolias A, Sinz A, Ecker A. 2023. The Dynamic Sensorium competition for predicting large-scale mouse visual cortex activity from videos - Dataset. G-Node Gin. Sensorium2023Data

    Fahey P, Turishcheva P, Hansel L, Froebe R, Ponder K, Vystrcilová M, Qiu Y, Willeke K, Bashiri M, Tolias A, Sinz A, Ecker A. 2023. The Dynamic Sensorium competition for predicting large-scale mouse visual cortex activity from videos - Dataset. G-Node Gin. sensorium_2023_data


    Articles from eLife are provided here courtesy of eLife Sciences Publications, Ltd

    RESOURCES