Abstract
The human visual cortex extracts both spatial and temporal visual features to support perception and guide behavior. Deep convolutional neural networks (CNNs) provide a computational framework to model cortical representation and organization for spatial visual processing, but unable to explain how the brain processes temporal information. To overcome this limitation, we extended a CNN by adding recurrent connections to different layers of the CNN to allow spatial representations to be remembered and accumulated over time. The extended model, or the recurrent neural network (RNN), embodied a hierarchical and distributed model of process memory as an integral part of visual processing. Unlike the CNN, the RNN learned spatiotemporal features from videos to enable action recognition. The RNN better predicted cortical responses to natural movie stimuli than the CNN, at all visual areas especially those along the dorsal stream. As a fully-observable model of visual processing, the RNN also revealed a cortical hierarchy of temporal receptive window, dynamics of process memory, and spatiotemporal representations. These results support the hypothesis of process memory, and demonstrate the potential of using the RNN for in-depth computational understanding of dynamic natural vision.
Keywords: neural encoding, deep learning, recurrent neural network, natural vision, temporal receptive window, process memory
INTRODUCTION
Human behavior depends heavily on vision. The brain’s visual system works efficiently and flexibly to support a variety of tasks, such as visual recognition, tracking, and attention, to name a few. Although a computational model of natural vision remains incomplete, it has evolved from shallow to deep models to better explain brain activity (Kriegeskorte, 2015; Khaligh-Razavi et al., 2017), predict human behaviors (Canziani and Culurciello, 2015; Fragkiadaki et al., 2015; Mnih et al., 2015), and support artificial intelligence (AI) (LeCun et al., 2015; Silver et al., 2016). In particular, convolutional neural networks (CNNs), trained with millions of labeled natural images (Russakovsky et al., 2015), have enabled computers to recognize images with human-like performance (He et al., 2015). CNNs bear similar representational structures as the visual cortex (Khaligh-Razavi and Kriegeskorte, 2014; Cichy et al., 2016) and predict brain responses to natural stimuli (Yamins et al., 2014; Güçlü and van Gerven, 2015a; Wen et al., 2017a, 2017b; Eickenberg et al., 2017). It thus provides new opportunities for understanding cortical representations of vision (Yamins and DiCarlo, 2016; Khaligh-Razavi et al., 2017).
Nevertheless, CNNs driven for image recognition are incomplete models of the visual system. CNNs are intended and trained for analyses of images in isolation, rather than videos where temporal relationships among individual frames carry information about action. In natural viewing conditions, the brain integrates information not only in space (Hubel and Wiesel, 1968) but also in time (Hasson et al., 2008). Both spatial and temporal information is processed by cascaded areas with increasing spatial receptive fields (Wandell et al., 2007) and temporal receptive windows (TRWs) (Hasson et al., 2008) along the visual hierarchy. That is, neurons at progressively higher levels of visual processing accumulate past information over increasingly longer temporal windows to account for their current activity. In such a hierarchical system for spatiotemporal processing, Hasson et al. proposed a notion of “process memory” (Hasson et al., 2015). Unlike the traditional view of memory being restricted to a few localized reservoirs, process memory is hypothesized to be intrinsic to information processing that unfolds throughout the brain on multiple timescales (Hasson et al., 2015). However, CNNs only model spatial processing via feedforward-only computation, lacking any mechanism for processing temporal information.
An effective way to model temporal processing is by using recurrent neural networks (RNNs), which learn representations from sequential data (Goodfellow et al., 2016). As its name indicates, a RNN processes the incoming input by also considering the RNN’s internal states in the past. In AI, RNNs have made impressive progress in speech and action recognition (Jozefowicz et al., 2015; Greff et al., 2016; Donahue et al., 2015), demonstrating the potential to match human performance on such tasks. In addition, RNN can be designed with an architecture that resembles the notion of “process memory” to accumulate information in time as an integral part of ongoing sensory processing (Hasson et al., 2015). Therefore, RNN is a logical step forward from CNN toward modeling and understanding the inner working of the visual system in dynamic natural vision.
In this study, we designed, trained, and tested a RNN to model and explain cortical processes for spatial and temporal visual processing. This model began with a static CNN pre-trained for image recognition (Simonyan and Zisserman, 2014a). Recurrent connections were added to different layers in the CNN to embed process memory into spatial processing, so that layer-wise spatial representations could be remembered and accumulated over time to form video representations. While keeping the CNN intact and fixed, the parameters for the recurrent connections were optimized by training the entire model for action recognition with a large set of labeled videos (Soomro et al., 2012). Then, we evaluated how well this RNN model matched the human visual cortex up to linear transform. Specifically, the RNN was trained to predict functional magnetic resonance imaging (fMRI) responses to natural movie stimuli. The prediction accuracy with the RNN was compared with that of the CNN, to address whether and where the recurrent connections allowed the RNN to better model cortical representations given dynamic natural stimuli. Through the RNN, we also characterized and mapped the cortical topography of temporal receptive windows and dynamics of process memory. By doing so, we attempted to use a fully-observable model of process memory to explain the hierarchy of temporal processing, as a way to directly test the hypothesis of process memory (Hasson et al., 2015).
METHODS AND MATERIALS
Experimental Data
The experimental data was from our previous studies (Wen et al., 2017a, 2017b, 2017c), according to a research protocol approved by the Institutional Review Board at Purdue University. Briefly, we acquired fMRI scans from three healthy subjects while they were watching natural videos. The video-fMRI data was split into two datasets to train and test the encoding models, respectively, for predicting fMRI responses given any natural visual stimuli. The training movie contained 12.8 hours of videos for Subject 1, and 2.4 hours for the other subjects (Subject 2&3). The testing movie for every subject contained 40 minutes of videos presented ten times during fMRI (for a total of 400 mins). These movies included a total of ~9,300 continuous videos without abrupt scene transitions, covering a wide range of realistic visual experiences. These videos were concatenated and then split into 8-min movie sessions, each of which was used as the stimuli in a single fMRI experiment. Subjects watched each movie session through a binocular goggle (20.3°×20.3°) with their eyes fixating at the center of the screen (red cross). Although the fixation was not ensured, our prior study has demonstrated the ability to use this video-fMRI dataset to map the retinotopic organization in early visual areas (Wen et al., 2017a), lending indirect support to stable eye-fixation. Whole-brain fMRI scans were acquired in 3-T with an isotropic resolution of 3.5mm and a repetition time of 2s. The fMRI data was preprocessed and co-registered onto a standard cortical surface (Glasser et al., 2013). More details about the stimuli, data acquisition, and preprocessing are described in (Wen et al., 2017a, 2017b).
Convolutional Neural Network (CNN)
Similar to our prior studies (Wen et al., 2017a, 2017b, 2017c), a pre-trained CNN, also known as the VGG16 (Simonyan and Zisserman, 2014a), was used to extract the hierarchical feature representations of every video frame as the outputs of artificial neurons (or units). This CNN contained 16 layers of units stacked in a feedforward network for processing the spatial information in the input. Among the 16 layers, the first 13 layers were divided into five blocks (or sub-models). Each block started with multiple convolutional layers with Rectified Linear Units (ReLU) (Nair and Hinton, 2010), and ended with spatial max-pooling (Boureau et al., 2010). To simplify terminology, hereafter we refer to these blocks as layers. The outputs from every layer were organized as three-dimensional arrays (known as feature maps). For the 1st through 5th layers, the sizes of feature maps were 64×112×112, 128×56×56, 256×28×28, 512×14×14, and 512×7×7, where the 1st dimension was the number of features, the 2nd and 3rd dimensions specified the width and the height (or the spatial dimension). From lower to higher layers, the number of features increased as the dimension per feature decreased. This CNN was implemented in PyTorch (http://pytorch.org/).
Recurrent Neural Network (RNN)
A RNN was constructed by adding recurrent connections to the four out of five layers in the CNN. The first layer was excluded to reduce the computational demand as in a prior study (Ballas et al., 2015). The recurrent connections served to model distributed process memory (Hasson et al., 2015), which allowed the model to memorize and accumulate visual information over time for temporal processing.
Fig. 1A illustrates the design of the RNN for extracting layer-wise feature representations of an input video. Let the input video be a time series of color (RGB) frames with 224×224 pixels per frame. For the video frame xt at time t, xt ∈ ℝ3×224×224. The internal states of the RNN at layer l, denoted as , was updated at each moment, according to the incoming information xt and the history states , as expressed in Eq. (1).
Figure 1. The recurrent model of vision.
A) The architectural design of the RNN. B) The model training strategy. The gray blocks indicate the CNN layers; the orange blocks indicate the RNN layers. The CNN was pre-trained and fixed, while the RNN was optimized on the task of action recognition.
(1) |
where φl(·) was the spatial features encoded at layer l in the pre-trained CNN, so φl(xt) was the extracted feature representations of the current input xt. Importantly, was the so-called “forget gate” essential to learning long-term temporal dependency (Pascanu et al., 2013). As its name indicates, the forget gate determined the extent to which the history states should be “forgotten”, or reversely the extent to which the incoming information should be “remembered”. As such, the forget gate controlled, moment by moment, how information should be stored into vs. retrieved from process memory. Given a higher value of the forget gate, the RNN’s current states were updated by retrieving less from its “memory” , but learning more from the representations of the current input φl(xt). This notion was expressed as the weighted sum of the two terms in the right-hand side of Eq. (1), where ∘ stands for Hadamard product and the weights of the two terms sum to 1. In short, the RNN embedded an explicit model of “process memory” (Hasson et al., 2015).
Note that the forget gate was time dependent but a function of the time-invariant weights, denoted as ωl, of the recurrent connections, expressed as below.
(2) |
where σ(·) is the sigmoid function whose output ranges from 0 to 1.
As expressed in Eq. (2), the forget gate was the weighted sum of three terms: the RNN’s previous output , the CNN’s current output φl(xt), and the RNN’s current input from the lower layer . Here, maxpool(•) stands for the max-pooling operation, which in this study used a kernel size of 2×2 and a stride of 2 to spatially subsample half of the RNN’s output at layer l-1 and fed the result as the input to layer l in the RNN. Note that the weighted summation was in practice implemented as convolving a 3×3 kernel (with a padding of 1) across all three input terms concatenated together, as expressed by cat(•) in Eq. (2). This reduced the number of unknown parameters to be trained. In other words, ωl ∈ ℝM×N×3×3, where M and N were the numbers of output and input feature maps, respectively.
Training the RNN for Action Recognition
The RNN was trained for video action recognition by using the first split of the UCF101 dataset (Soomro et al., 2012). The dataset included 9,537 training videos and 3,783 validation videos from 101 labeled action categories, which included five general types of human actions: human-object interaction, body motion, human-human interaction, playing musical instruments, and sports. See Appendix for the list of all action categories. All videos were resampled at 5 frames per second, and preprocessed as described elsewhere (Ballas et al., 2015), except that we did not artificially augment the dataset with random crops.
To train the RNN with labeled action videos, a linear softmax classifier was added to the RNN to classify every training video frame as one of the 101 action categories. As expressed by Eq. (3), the inputs to the classifier were the feature representations from all layers in the RNN, and its outputs were the normalized probabilities, by which a given video frame was classified into pre-defined categories (Fig. 1B).
(3) |
where reduced from a 3-D feature array to a 1-D feature vector by averaging over the spatial dimension (or average pooling); [·]∀l further concatenated the feature vectors across all layers in the RNN; δ ∈ RPx101 was a trainable linear function to transform the concatenated feature vector onto a score for each category; softmax(·) converted the scores into a probability distribution, ŷt, to report the result of action categorization given each input video frame.
The loss function for training the RNN was defined as below.
(4) |
where yt stands for the true action category labeled for the input xt. Here, the learning objective was to maximize the average (over T samples) log probability of correct classification conditioned on the input {xt}∀t and parameterized by linear projection δ and the recurrent parameters [ωl]∀l.
The RNN was trained by using mini-batch gradient descent and back-propagation through time (Werbos, 1990). The parameters [ωl]∀l were initialized as random values from a uniform distribution between –0.01 and 0.01. For the training configurations, the batch size was set to 10. The sequence length was 20 frames, so that the losses were accumulated over 20 consecutive frames before back-propagation. A dropout of 0.7 was used to train δ. The gradient vector was normalized to 5. The gradient descent algorithm was based on the Adam optimizer (Kingma and Ba, 2014) with the learning rate initialized as 1e–3. The learning rate was decayed by 0.1 every 10 epochs, while the learning iterated across all training videos in each epoch.
To evaluate the RNN on the task of action recognition, we evaluated the top-1 accuracy given the validation videos, while being top-1 accurate meant that the most probable classification matched the label. In addition, we also trained a linear softmax classifier based on the feature representations extracted from the CNN with the same training data and learning objective, and evaluated the top-1 accuracy for model comparison.
Encoding Models
For each subject, a voxel-wise encoding model (Naselaris et al., 2011) was established for predicting the fMRI response to natural movie stimuli based on the features of the movie extracted by the RNN (or the CNN for comparison). A linear regression model was trained separately for each voxel to project feature representations to voxel responses, similar to prior studies (Güçlü and van Gerven, 2015a, b; Wen et al., 2017a, 2017b, 2017c; Eickenberg et al., 2017). As described below, the same training methods were used regardless of whether the RNN or the CNN was used as the feature model.
Using the RNN (or the CNN), the feature representations of the training movie were extracted and sampled every second. Note that the feature dimension was identical for the CNN and the RNN, both including feature representations from four layers with exactly matched numbers of units in each layer. For each of the four layers, the number of units was 401408, 200704, 100352, and 25088. Combining features across these layers ended up with a very high-dimensional feature space. To reduce the dimension of the feature space, principal component analysis (PCA) was applied first to each layer and then to all layers, similar to our prior studies (Wen et al., 2017a, 2017b, 2017c). The principal components (PCs) were identified based on the feature representations of the training movie, and explained 90% variance. Such PCs defined a set of orthogonal basis vectors, spanning a subspace of the original feature space (or the reduced feature space). Applying this basis set as a linear operator, B, to any representation, X, in the original feature space, converted it to the reduced feature space, as expressed as Eq. (5).; applying the transpose of B to any representation, Z, in the reduced feature space, converted it to the original feature space.
(5) |
where X ∈ ℝT×q stands for the representation of the RNN (or the CNN) with T samples and q units; B is a q-by-p matrix that consists of the PCs identified with the training movie; and Z ∈ ℝT×q stands for the p-dimensional feature representations after dimension reduction (p<q).
The feature representations after dimension reduction (i.e. columns in Z) were individually convolved with a canonical hemodynamic response function (HRF) (Buxton et al., 2004) and down-sampled to match the sampling rate for fMRI. Then, Z was used to fit each voxel’s response during the training movie through a voxel-specific linear regression model, expressed as Eq. (6).
(6) |
where wv is a columnar vector of regression coefficients specific to voxel v, and εv is the error term. To estimate wv, L2-regularized least-squares estimation was used while the regularization parameter λ was determined based on five-fold cross-validation.
(7) |
To train this linear regression model, we used the fMRI data acquired during the training movie. The model training was performed separately for the two feature models (the RNN and the CNN) using the same training algorithm. Afterwards, we used the trained encoding models to predict cortical fMRI responses to the independent testing movie. The prediction accuracy was quantified as the temporal correlation (r) between the observed and predicted responses at each voxel. As in our previous studies (Wen et al., 2017a, 2017b, 2017c), the statistical significance of the prediction accuracy was evaluated voxel by voxel with a block-permutation test (Adolf et al., 2014) corrected at the false discovery rate (FDR) q < 0.01.
Given the dimension reduction of the feature space, ŵv described the contributions to voxel v from individual basis vectors in the reduced feature space (i.e. columns of B in Eq. (5)). Since the dimension reduction was through linear transform, the voxel-wise encoding models (Eq. (6)) could be readily rewritten with the regressors specific to individual units (instead of basis vectors) in the RNN (or the CNN). In this equivalent encoding model, the regression coefficients, denoted as β̂v, reported the contribution from every unit to each voxel, and could be directly computed from ŵv as below.
(8) |
For each voxel, we further identified a subset of units in the RNN that contributed to the voxel’s response relatively more than other units. To do so, the half of the maximum in the absolute values of β̂v was taken as the threshold. Those units, whose corresponding regression coefficients had absolute values greater than this threshold, were included in a subset (denoted as Iv) associated with voxel v.
Model Evaluation and Comparison
After training them using the same training data and the same training algorithms, we compared the encoding models based on the RNN and those based on the CNN. For this purpose, the encoding performance was evaluated as the accuracy of predicting the cortical responses to every session of the testing movie. The prediction accuracy was measured as the temporal correlation (r) and then was converted to a z score by Fisher’s z-transformation. For each voxel, the z score was averaged across different movie sessions and different subjects, and the difference in the average z score between the RNN and the CNN was computed voxel by voxel. Such voxel-wise difference (Δz) was evaluated for statistical significance using the paired t-test across different movie sessions and different subjects (p < 0.01). The differences were also assessed at different ROIs, which were defined based on the cortical parcellation (Glasser et al., 2016), and evaluated for statistical significance using the paired t-test across voxels (p < 0.01). For the voxels where RNN significantly out-performed CNN, we further divided them into the voxels in early visual areas, dorsal-stream areas, and ventral-stream areas. We evaluated whether the improved encoding performance (Δz) was significantly higher for the dorsal stream than the ventral stream. For this purpose, we applied two-sample t-test to the voxel-wise Δz value in the dorsal vs. ventral visual areas with the significance level at 0.01.
We also compared the encoding performance against the “noise ceiling”, or the upper limit of the prediction accuracy (Nili et al., 2014). The noise ceiling was lower than 1, due to the fact that the measured fMRI data contained ongoing noise or activity unrelated to the external stimuli, and thus the measured data could not be entirely predictable from the stimuli even if the model were perfect. As described elsewhere (Kay et al., 2013), the response (evoked by the stimuli) and the noise (unrelated to the stimuli) were assumed to be additive and independent and follow normal distributions. Such response and noise distributions were estimated from the data. For each subject, the testing movie was presented ten times. For each voxel, the mean of the noise was assumed to be zero; the variance of the noise was estimated as the mean of the standard errors in the data across the 10 repetitions; the mean of the response was taken as the voxel signal averaged across the 10 repetitions, and the variance of the response was taken as the difference between the variance of the data and the variance of the noise. From the estimated signal and noise distributions, we conducted Monte Carlo simulations to draw samples of the response and the noise, and to simulate noisy data by adding the response and noise samples. The correlation between the simulated response and noisy data was calculated for each of the 1,000 repetitions of simulation, yielding the distribution of noise ceilings at each voxel or ROI.
Mapping the cortical hierarchy for spatiotemporal processing
We also used the RNN-based encoding models to characterize the functional properties of each voxel, by summarizing the fully-observable properties of the RNN units that were most predictive of that voxel. As mentioned, each voxel was associated with a subset of RNN units Iv. In this subset, we calculated the percentage of the units belonging to each of the four layers (indexed by 1 through 4) in the RNN, multiplied the layer-wise percentage by the corresponding layer index, and summed the result across all layers to yield a number (between 1 and 4). This number was assigned to the given voxel v, indicating this voxel’s putative “level” in the visual hierarchy. Mapping the voxel-wise level revealed the hierarchical cortical organization for spatiotemporal visual processing.
Estimating Temporal Receptive Windows
We also quantified the temporal receptive window (TRW) at each voxel v by summarizing the “temporal dependency” of its contributing units Iv in the RNN. For each unit i ∈ Iv, its forget gate, denoted as , controlled the memory storage vs. retrieval at each moment t. For simplicity, let us define a “remember” gate, , to act oppositely as the forget gate. From Eq. (1), the current state (or unit activity) was expressed as a function of the past input {xt−τ|1 ≤ τ ≤ t}.
(9) |
where . In Eq. (9), the first term was zero given the initial state . The second term was the result of applying a time-variant filter to the time series of the spatial representation {φi(xt)∀t} extracted by the CNN from every input frame {(x)t}∀t. In this filter, each element reflected the effect of the past visual input xt−τ (with an offset τ) on the current state . As it varied in time, we averaged the filter θi(t) across time, yielding θ̄i to represent the average temporal dependency of each unit i.
From the observable temporal dependency of every unit, we derived the temporal dependency of each voxel by using the linear unit-to-voxel relationships established in the encoding model. For each voxel v, the average temporal dependency was expressed as a filter θ̄v, which was derived as the weighted average of the filters associated with its contributing RNN units{θ̄i|i ∈ Iv}, as in Eq. (10).
(10) |
Of θ̄v, the elements delineated the dependency of the current response at voxel v on the past visual input with an offset τ prior to the current time. The accumulation of temporal information was measured as the sum of across different offsets in a given time window. The window size that accounted for 95% of the accumulative effect integrated over an infinite past period was taken as the TRW for voxel v. In the level of ROIs, the TRW was averaged across voxels within each pre-defined ROI. The difference in TRW between different ROIs was evaluated using two-sample t-tests (p < 0.01).
Spectral Analysis of Forget Gate Dynamics
We also characterized the temporal fluctuation of the forget gate at each unit in the RNN. As the forget gate behaved as a switch for controlling, moment by moment, how information was stored into vs. retrieved from process memory. As such, the forget-gate fluctuation reflected the dynamics of process memory in the RNN given natural video inputs.
To characterize the forget-gate dynamics, its power spectral density (PSD) was evaluated. The PSD followed a power-law distribution that was fitted with a descending line in the double-logarithmic scale. The slope of this line, or the power-law exponent (PLE) (Miller et al., 2009; Wen and Liu, 2016), characterized the balance between slow (low-frequency) and fast (high-frequency) dynamics. A higher PLE implied that slow dynamics dominated fast dynamics; a lower PLE implied the opposite. After the PLE was evaluated for each unit, we derived the PLE for each voxel v as a weighted average of the PLE of every unit i that contributed to this voxel (i ∈ Iv), in a similar way as expressed in Eq. (10).
RESULTS
RNN learned video representations for action recognition
We used a recurrent neural network (RNN) to model and predict cortical fMRI responses to natural movie stimuli. This model extended a pre-trained CNN (VGG16) (Simonyan and Zisserman, 2014a) by adding recurrent connections to different layers in the CNN (Fig. 1). While fixing the CNN, the weights of the recurrent connections were optimized by supervised learning with >13,000 labeled videos from 101 action categories (Soomro et al., 2012). After training, the RNN was able to categorize independent test videos with a 76.7% top-1 accuracy. This accuracy was much higher than the 65.09% accuracy obtained with the CNN, and close to the 78.3% accuracy obtained with the benchmark RNN model (Ballas et al., 2015).
Unlike the CNN, the RNN explicitly embodied a network architecture to learn hierarchically organized video representations for action recognition. When taking isolated images as the input, the RNN behaved as a feedforward CNN for image categorization. In other words, the addition of recurrent connections enabled the RNN to recognize actions in videos, without losing the already learned ability for recognizing objects in images.
RNN better predicted cortical responses to natural movies
Accompanying its enriched AI, the RNN learned to utilize the temporal relationships between video frames, whereas the CNN treated individual frames independently. We asked whether the RNN constituted a better model of the visual cortex than the CNN, by evaluating and comparing how well these two models could predict cortical fMRI responses to natural movie stimuli. The prediction was based on voxel-wise linear regression models, through which the representations of the movie stimuli, as extracted by either the RNN or the CNN, were projected onto each voxel’s response to the stimuli. Such regression models were trained and tested with different sets of video stimuli (12.4 hours or 2.4 hours for training, 40 minutes for testing) to ensure unbiased model evaluation and comparison. Both the RNN and the CNN explained significant variance of the movie-evoked response for widespread cortical areas (Fig. 2A & 2B). The RNN consistently performed better than the CNN, showing significantly (p<0.01, paired t-test) higher prediction accuracy for nearly all visual areas (Fig. 2D), especially for cortical locations along the dorsal visual stream relative to the ventral stream (p<0.01, two-sample t-test) (Fig. 2C). The response predictability given the RNN was about half of the “noise ceiling” – the upper limit by which the measured response was predictable given the presence of any ongoing “noise” or activity unrelated to the stimuli (Fig. 2D). This finding was consistently observed for each of the three subjects (Fig. 3).
Figure 2. Prediction accuracies of the cortical responses to novel movie stimuli.
A) Performance of the CNN-based encoding model, averaged across testing movie sessions and subjects. B) Performance of the RNN-based encoding model, averaged across testing movie sessions and subjects. C) Significant difference in the performance between the RNN and CNN (p<0.01). The values of difference were computed as subtracting A) from B). D) Comparison of performances at different ROIs with noise ceilings. The accuracy at each ROI is the voxel mean within the region, where the red bars indicate the standard error of accuracies across voxels. The gray blocks indicate the lower and upper bounds of the noise ceilings, and the gray bars indicate the mean and standard deviation of the noise ceilings at each ROI.
Figure 3. Prediction accuracies of the cortical responses to novel movie stimuli for individual subjects.
A) Performance of the CNN-based encoding model, averaged across testing movie sessions. B) Performance of the RNN-based encoding model, averaged across testing movie sessions.
RNN revealed a gradient in temporal receptive windows (TRWs)
Prior studies have shown empirical evidence that visual areas were hierarchically organized to integrate information not only in space (Kay et al., 2013), but also in time (Hasson et al., 2008). Units in the RNN learned to integrate information over time through the unit-specific “forget gate”, which controlled how past information shaped processing at the present time. Through the linear model that related RNN units to each voxel, the RNN’s temporal “gating” behaviors were passed from units to voxels in the brain. As such, this model allowed to characterize the TRWs, in which past information was carried over and integrated over time to affect and explain the current response at each specific voxel or region.
Fig. 4A shows the response at each given location as the accumulative effect integrated over a varying period (or window) prior to the current moment. On average, the response at V1 reflected the integrated effects over the shortest period, suggesting the shortest TRW at V1. Cortical areas running down the ventral or dorsal stream integrated information over progressively longer TRWs (Fig. 4A). Mapping the voxel-wise TRW showed a spatial gradient aligned along the visual streams, suggesting a hierarchy of temporal processing in the visual cortex (Fig. 4B). In the ROI level, the TRWs were significantly shorter for early visual areas than those for higher-order ventral or dorsal areas; and the dorsal areas tended to have longer TRWs than the ventral areas (Fig. 4C). While Fig. 4 shows the results for Subject 1, similar results were also observed in the other subjects (Fig. S1 and Fig. S2). We interpret the TRW as a measure of the average capacity of process memory at each cortical location involved in visual processing.
Figure 4. Model-estimated TRWs in the visual cortex of Subject 1.
A) The accumulation of information at different ROIs along ventral and dorsal streams. Window size represents the period to the past, and temporal integration indicates the relative amount of accumulated information. B) The cortical map of TRWs estimated by the RNN. The color bar indicates the window sizes at individual voxels. C) Average TRWs at individual ROIs. The blue bars represent the early visual cortex, the green bars the ventral areas, and the red bars the dorsal areas. The black error bars indicate the standard errors across voxels.
RNN revealed the slow vs. fast dynamics of process memory
In the RNN, the forget gate varied from moment to moment, indicating how the past vs. current information was mixed together to determine the representation at each moment. Given the testing movie stimuli, the dynamics of the forget gate was scale free, showing a power-law relationship in the frequency domain. The power-law exponent (PLE) reported on the balance between slow and fast dynamics: a higher exponent indicated a tendency for slow dynamics, and a lower exponent indicated a tendency for fast dynamics.
After projecting the PLEs from units to voxels, we mapped the distribution of the voxel-wise PLE to characterize the dynamics of process memory (Hasson et al., 2015) at each cortical location. As shown in Fig. 5, the PLE was lower in early visual areas, but became increasingly larger along the downstream pathways in higher-order visual areas. Such trend was similar to the gradient in TRWs (Fig. 4B), where the TRWs were shorter in early visual areas and longer in higher-order visual areas. In general, lower PLEs were associated with areas with shorter TRWs; higher PLEs were associated with areas with longer TRWs.
Figure 5. Model-estimated memory dynamics in the visual cortex.
Consistent across subjects, lower PLEs are associated early visual areas, and higher PLEs are associated with later stages of visual processing.
We further evaluated the correlation (across voxels) between PLE and the improved encoding performance given RNN relative to CNN. The correlation was marginally significant (r=0.16±0.04, p=0.04), suggesting a weak tendency that RNN better explained cortical responses at the voxels with relatively slower dynamics.
RNN revealed the cortical hierarchy of spatiotemporal processing
CNNs revealed the hierarchical organization of spatial processing in the visual cortex (Güçlü and van Gerven, 2015a; Wen et al., 2017a; Eickenberg et al., 2017; Horikawa and Kamitani, 2017). By using the RNN as a network model for spatiotemporal processing, we further mapped the hierarchical cortical organization of spatiotemporal processing. To do so, every voxel, where the response was predictable by the RNN, was assigned with an index, ranging continuously between 1 and 4. This index reported the “level” that a voxel was involved in the visual hierarchy: a lower index implied an earlier stage of processing; a higher index implied a later stage of processing. The topography of the voxel-wise level index showed a cortical hierarchy (Fig. 6). Locations from striate to extra-striate areas were progressively involved in early to late stages of processing the information in both space and time.
Figure 6. Model-estimated hierarchical organization of spatiotemporal processing.
Consistent across subjects, lower layer indices are assigned to early visual areas, and higher layer indices are assigned to later stages of visual processing. The color bar indicates the range of layer assignment, from layer 1 to 4.
DISCUSSION
Here, we designed and trained a recurrent neural net (RNN) to learn video representations for action recognition, and to predict cortical responses to natural movies. This RNN extended from a pre-trained CNN by adding layer-wise recurrent connections to allow visual information to be remembered and accumulated over time. In line with the hypothesis of process memory (Hasson et al., 2015), such recurrent connections formed a hierarchical and distributed model of memory as an integral part of the network for processing dynamic and natural visual input. Compared to the CNN, the RNN supported both image and action recognition, and better predicted cortical responses to natural movie stimuli at all visual areas, especially those along the dorsal stream. More importantly, the RNN provided a fully-observable computational model to characterize and map temporal receptive windows, dynamics of process memory, and a cortical representational hierarchy for dynamic natural vision.
A network model of process memory
Our work was in part inspired by the notion of “process memory” (Hasson et al., 2015). In this notion, memory is a continuous and distributed process as an integral part of information processing, as opposed to an encapsulated functional module separate from the neural circuits that process sensory information. Process memory provides a mechanism for the cortex to process the temporal information in natural stimuli, in a similarly hierarchical way as cortical processing of spatial information (Hasson et al., 2015). As explored in this study, the RNN uses an explicit model of process memory to account for dynamic interactions between incoming stimuli and the internal states of the neural network, or the state-dependent computation (Buonomano and Maass, 2009). In the RNN, the “forget gate” controls, separately for each unit in the network, how much its next state depends on the incoming stimuli vs. its current state. As such, the forget gate behaves as a switch of process memory to control how much new information should be stored into memory and how much history information should be retrieved from memory. This switch varies moment to moment, allowing memory storage and retrieval to occur simultaneously and continuously.
As demonstrated in this study, this model of process memory could be trained, with supervised learning, for the RNN to classify videos into action categories with a much higher accuracy than the CNN without any mechanism for temporal processing. It suggests that integrating process memory to a network of spatial processing indeed makes the network to be capable of spatiotemporal processing, as implied in previous theoretical work (Buonomano and Maass, 2009).
From theoretical modeling to empirical evidence of process memory
A unique contribution of this study is that computational modeling of process memory is able to explain previous empirical evidence for process memory. One of the strongest evidence for process memory is that the cortex organizes a topography of temporal receptive window (Hasson et al., 2008; Honey et al., 2012), which may be interpreted as the voxel-wise capacity of process memory. To probe the TRW, an experimental approach is to scramble the temporal structure of natural stimuli in multiple timescales and measure their resulting effects on cortical responses (Hasson et al., 2008). The TRW measured in this way increases orderly from early sensory areas to higher-order perceptual or cognitive areas (Hasson et al., 2015), suggesting a hierarchical organization of temporal processing. With this approach, the brain is viewed as a “black box” and is studied by examining its output given controlled perturbations to its input.
In this study, we have reproduced the hierarchically organized TRW by using a model-driven approach. The RNN tries to model the inner-working of the visual cortex as a computable system, such that the system’s output can be computed from its input. If the model uses the same computational and organizational principles as does the brain itself, the model’s output should match the brain’s response given the same input (Wu et al., 2006; Naselaris et al., 2011). By “matching”, we do not mean that the unit activity in the model should match the voxel response in the brain with one-to-one correspondence, but up to linear transform (Yamins and Di Carlo, 2016), because it is unrealistic to exactly model the brain. This approach allows to test computational models against experimental findings. The fact that the model of process memory explains the topography of TRW (i.e. the hallmark evidence for process memory), lends synergistic support to process memory as a fundamental principle for spatiotemporal processing of natural visual stimuli.
RNN extends CNN as both a brain model and an AI
Several recent studies explored deep-learning models as predictive models of cortical responses during natural vision (Yamins et al., 2014; Khaligh-Razavi et al., 2014; Güçlü and van Gerven, 2015a, b; Wen et al., 2017a, 2017b, 2017c; Cichy et al., 2016; Eickenberg, et al., 2017; Horikawa and Kamitani, 2017). Most of the prior studies used CNNs that extracted spatial features to support image recognition, and demonstrated the CNN as a good model for the feedforward process along the ventral visual stream (Yamins et al., 2014; Khaligh-Razavi et al., 2014; Güçlü and van Gerven, 2015a; Eickenberg, et al., 2017). In our recent study (Wen et al., 2017a), the CNN was further found to be able to partially explain the dorsal-stream activity in humans watching natural movies; however, the predictive power of the CNN was lesser in the dorsal stream than in the ventral stream. Indeed, the dorsal stream is known for its functional roles in temporal processing and action recognition in vision (Goodale and Milner, 1992; Rizzolatti and Matelli, 2003; Shmuelof and Zohary, 2005). It is thus expected that the limited ability of the CNN for explaining the dorsal-stream activity is due to its lack of any mechanism for temporal processing.
Extending from the CNN, the RNN established in this study offered a network mechanism for temporal processing, and improved the performance in action recognition. Along with this enhanced performance towards humans’ perceptual ability, the RNN also better explained human brain activity than did the CNN (Fig. 2). The improvement was more apparent in areas along the dorsal stream than those along the ventral stream (Fig. 2). It is worth noting that when the input is an image rather than a video, the RNN behaves as the CNN to support image classification. In other words, the RNN extends the CNN to learn a new ability (i.e. action recognition) without losing the already learned ability (i.e. image recognition). On the other hand, the RNN, as a model of the visual cortex, improves its ability in predicting brain activity not only at areas where the CNN falls short (i.e. dorsal-stream), but also at areas where the CNN excels (i.e. ventral-stream). As shown in this study, the RNN better explained the dorsal stream, without losing the already established ability to explain the ventral stream (Fig. 2).
This brings us to a perspective about developing brain models or brain-inspired AI systems. As humans continuously learn from experiences to support different intelligent behaviors, it is desirable for an AI model to continuously learn to expand capabilities while keeping existing capabilities. When it is also taken as a model of the brain, this AI model should be increasingly more predictive of brain responses at new areas, while remaining its predictive power at areas where the model already predicts well. This perspective is arguably valuable for designing a brain-inspired system for continuous learning as does the brain itself.
Our finding that RNN outperformed CNN in explaining cortical responses most notably in the dorsal stream might also be due to the fact that the RNN was trained for action recognition. In fact, action recognition is commonly associated with dorsal visual areas, whereas object recognition is associated with ventral visual areas (Yoon et al., 2012). As a side exploration in this study, we also used a meta-analysis tool (neuronsynth.org) to map the cortical activations with visual action related tasks primarily in supra-marginal gyrus, pre-/post-central sulcus, intra-parietal sulcus, superior parietal gyrus, and inferior frontal gyrus. Such areas overlapped with where we found significantly greater encoding performance with RNN than with CNN. However, this overlap should not be simply taken as the evidence that the better model prediction is due to the goal of action recognition, instead of the model’s memory mechanism. The memory mechanism supports the action recognition; the action recognition allows the mechanism to be parameterized through model training. As such, the goal and the mechanism are tightly inter-connected aspects of the model. Further insights await future studies.
Comparison with related prior work
Other than RNN, a 3-dimensional (3-D) CNN may also learn spatiotemporal features for action recognition of videos (Tran et al., 2014). A 3-D CNN shares the same computational principle as an otherwise 2-D CNN, except that the input to the former is a time series of video frames with a specific duration, whereas the input to the latter is a single video frame or image. Previously, the 3-D CNN was shown to explain cortical fMRI responses to natural movie stimuli (Güçlü and van Gerven, 2015b). However, it is unlikely that the brain works in a similar way as a 3-D CNN. The brain processes visual information continuously delivered from 2-D retinal input, rather than processing time blocks of 3-D visual input as required for 3-D CNN. Although it is a valid AI model, 3-D CNN is not appropriate for modeling or understanding the brain’s mechanism of dynamic natural vision.
It is worth noting the fundamental difference between the RNN model in this study and that in a recently published study (Güçlü and van Gerven, 2017). Here, we used the RNN as a feature model or the model of the visual cortex, whereas Güçlü and van Gerven used the RNN as the response model in an attempt to better describe the complex relationships between the CNN and the brain. Although a complex response model is potentially useful, it defeats our purpose of seeking a computational model that matches the visual cortex up to linear transform. It has been our intention to find a model that shares similar computing and organization principles as the brain. Towards this goal, the response model needs to as simple as possible, independent of the visual input, and with canonical or independently defined HRF.
Future Directions
The focus of this study is on vision. However, the RNN is expected to be useful, or even more useful, for modeling other perceptual or cognitive systems beyond vision. RNNs have been successful in computer vision (Donahue et al., 2015), natural language processing (Hinton et al., 2012; Mikolov et al., 2010), attention (Mnih et al., 2014; Xu et al., 2015; Sharma et al., 2015), memory (Graves et al., 2014), and planning (Zaremba and Sutskever, 2015). It is conceivable that such RNNs would set a good starting point to model the corresponding neural systems, to facilitate the understanding of the network basis of complex perceptual or cognitive functions.
The RNN offers a computational account of temporal processing. If the brain performs similar computation, how is it implemented? The biological implementation of recurrent processing may be based on lateral or feedback connections (Lamme et al., 1998; Kafaligonul et al., 2015). The latter is of particular interest, since the brain has abundant feedback connections to exert top-down control of feedforward processes (Itti et al., 1998; de Fockert et al., 2001). However, the feedback connections are not taken into account in this study, but may be incorporated into the models in the future by using such brain principles as predictive coding (Rao and Ballard, 1999) or the free-energy principle (Friston, 2010). Recent efforts along this line are promising (Lotter et al., 2016; Canziani and Culurciello, 2017) to merit further investigation.
Supplementary Material
A) The accumulation of information at different ROIs along ventral and dorsal streams. Window size represents the period to the past, and temporal integration indicates the relative amount of accumulated information. B) The cortical map of TRWs estimated by the RNN. The color bar indicates the window sizes at individual voxels. C) Average TRWs at individual ROIs. The blue bars represent the early visual cortex, the green bars the ventral areas, and the red bars the dorsal areas.
A) The accumulation of information at different ROIs along ventral and dorsal streams. Window size represents the period to the past, and temporal integration indicates the relative amount of accumulated information. B) The cortical map of TRWs estimated by the RNN. The color bar indicates the window sizes at individual voxels. C) Average TRWs at individual ROIs. The blue bars represent the early visual cortex, the green bars the ventral areas, and the red bars the dorsal areas.
Acknowledgments
This work was supported in part by NIH R01MH104402. The authors would like to recognize the inputs from Dr. Eugenio Culurciello on the discussions of deep neural networks. The authors have no conflict of interest.
Appendix: the full list of action categories in UCF101 dataset
Apply Eye Makeup, Apply Lipstick, Archery, Baby Crawling, Balance Beam, Band Marching, Baseball Pitch, Basketball Shooting, Basketball Dunk, Bench Press, Biking, Billiards Shot, Blow Dry Hair, Blowing Candles, Body Weight Squats, Bowling, Boxing Punching Bag, Boxing Speed Bag, Breaststroke, Brushing Teeth, Clean and Jerk, Cliff Diving, Cricket Bowling, Cricket Shot, Cutting In Kitchen, Diving, Drumming, Fencing, Field Hockey Penalty, Floor Gymnastics, Frisbee Catch, Front Crawl, Golf Swing, Haircut, Hammer Throw, Hammering, Handstand Pushups, Handstand Walking, Head Massage, High Jump, Horse Race, Horse Riding, Hula Hoop, Ice Dancing, Javelin Throw, Juggling Balls, Jump Rope, Jumping Jack, Kayaking, Knitting, Long Jump, Lunges, Military Parade, Mixing Batter, Mopping Floor, Nun chucks, Parallel Bars, Pizza Tossing, Playing Guitar, Playing Piano, Playing Tabla, Playing Violin, Playing Cello, Playing Daf, Playing Dhol, Playing Flute, Playing Sitar, Pole Vault, Pommel Horse, Pull Ups, Punch, Push Ups, Rafting, Rock Climbing Indoor, Rope Climbing, Rowing, Salsa Spins, Shaving Beard, Shotput, Skate Boarding, Skiing, Skijet, Sky Diving, Soccer Juggling, Soccer Penalty, Still Rings, Sumo Wrestling, Surfing, Swing, Table Tennis Shot, Tai Chi, Tennis Swing, Throw Discus, Trampoline Jumping, Typing, Uneven Bars, Volleyball Spiking, Walking with a dog, Wall Pushups, Writing On Board, Yo Yo
References
- Adolf D, Weston S, Baecke S, Luchtmann M, Bernarding J, Kropf S. Increasing the reliability of data analysis of functional magnetic resonance imaging by applying a new blockwise permutation method. Frontiers in neuroinformatics. 2014;8 doi: 10.3389/fninf.2014.00072. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ballas N, Yao L, Pal C, Courville A. Delving deeper into convolutional networks for learning video representations. 2015 arXiv preprint arXiv:1511.06432. [Google Scholar]
- Boureau Y-L, Ponce J, LeCun Y. A theoretical analysis of feature pooling in visual recognition. Paper presented at the Proceedings of the 27th international conference on machine learning (ICML-10).2010. [Google Scholar]
- Buonomano DV, Maass W. State-dependent computations: spatiotemporal processing in cortical networks. Nature reviews Neuroscience. 2009;10(2):113. doi: 10.1038/nrn2558. [DOI] [PubMed] [Google Scholar]
- Buxton RB, Uludağ K, Dubowitz DJ, Liu TT. Modeling the hemodynamic response to brain activation. Neuroimage. 2004;23:S220–S233. doi: 10.1016/j.neuroimage.2004.07.013. [DOI] [PubMed] [Google Scholar]
- Canziani A, Culurciello E. Visual attention with deep neural networks. Paper presented at the Information Sciences and Systems (CISS), 2015 49th Annual Conference on.2015. [Google Scholar]
- Canziani A, Culurciello E. CortexNet: a Generic Network Family for Robust Visual Temporal Representations. 2017 arXiv preprint arXiv:1706.02735. [Google Scholar]
- Cichy RM, Khosla A, Pantazis D, Torralba A, Oliva A. Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence. Scientific reports. 2016;6:27755. doi: 10.1038/srep27755. [DOI] [PMC free article] [PubMed] [Google Scholar]
- de Fockert JW, Rees G, Frith CD, Lavie N. The role of working memory in visual selective attention. Science. 2001;291(5509):1803–1806. doi: 10.1126/science.1056496. [DOI] [PubMed] [Google Scholar]
- Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T. Long-term recurrent convolutional networks for visual recognition and description. Paper presented at the Proceedings of the IEEE conference on computer vision and pattern recognition; 2015. [DOI] [PubMed] [Google Scholar]
- Eickenberg M, Gramfort A, Varoquaux G, Thirion B. Seeing it all: Convolutional network layers map the function of the human visual system. NeuroImage. 2017;152:184–194. doi: 10.1016/j.neuroimage.2016.10.001. [DOI] [PubMed] [Google Scholar]
- Fragkiadaki K, Levine S, Felsen P, Malik J. Recurrent network models for human dynamics. Paper presented at the Proceedings of the IEEE International Conference on Computer Vision.2015. [Google Scholar]
- Friston K. The free-energy principle: a unified brain theory? Nature Reviews Neuroscience. 2010;11(2):127–138. doi: 10.1038/nrn2787. [DOI] [PubMed] [Google Scholar]
- Glasser MF, Coalson TS, Robinson EC, Hacker CD, Harwell J, Yacoub E, … Jenkinson M. A multi-modal parcellation of human cerebral cortex. Nature. 2016;536(7615):171–178. doi: 10.1038/nature18933. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Glasser MF, Sotiropoulos SN, Wilson JA, Coalson TS, Fischl B, Andersson JL, … Polimeni JR. The minimal preprocessing pipelines for the Human Connectome Project. Neuroimage. 2013;80:105–124. doi: 10.1016/j.neuroimage.2013.04.127. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goodale MA, Milner AD. Separate visual pathways for perception and action. Trends in neurosciences. 1992;15(1):20–25. doi: 10.1016/0166-2236(92)90344-8. [DOI] [PubMed] [Google Scholar]
- Goodfellow I, Bengio Y, Courville A. Deep learning. MIT press; 2016. [Google Scholar]
- Graves A, Wayne G, Danihelka I. Neural turing machines. 2014 arXiv preprint arXiv:1410.5401. [Google Scholar]
- Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber J. LSTM: A search space odyssey. IEEE transactions on neural networks and learning systems. 2016 doi: 10.1109/TNNLS.2016.2582924. [DOI] [PubMed] [Google Scholar]
- Güçlü U, van Gerven MA. Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream. Journal of Neuroscience. 2015a;35(27):10005–10014. doi: 10.1523/JNEUROSCI.5023-14.2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Güçlü U, van Gerven MA. Increasingly complex representations of natural movies across the dorsal stream are shared between subjects. NeuroImage. 2015b doi: 10.1016/j.neuroimage.2015.12.036. [DOI] [PubMed] [Google Scholar]
- Güçlü U, van Gerven MA. Modeling the dynamics of human brain activity with recurrent neural networks. Frontiers in computational neuroscience. 2017:11. doi: 10.3389/fncom.2017.00007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hasson U, Chen J, Honey CJ. Hierarchical process memory: memory as an integral component of information processing. Trends in cognitive sciences. 2015;19(6):304–313. doi: 10.1016/j.tics.2015.04.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hasson U, Yang E, Vallines I, Heeger DJ, Rubin N. A hierarchy of temporal receptive windows in human cortex. Journal of Neuroscience. 2008;28(10):2539–2550. doi: 10.1523/JNEUROSCI.5487-07.2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- He K, Zhang X, Ren S, Sun J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. Paper presented at the Proceedings of the IEEE international conference on computer vision.2015. [Google Scholar]
- Hinton G, Deng L, Yu D, Dahl GE, Mohamed A-r, Jaitly N, … Sainath TN. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine. 2012;29(6):82–97. [Google Scholar]
- Honey CJ, Thesen T, Donner TH, Silbert LJ, Carlson CE, Devinsky O, … Hasson U. Slow cortical dynamics and the accumulation of information over long timescales. Neuron. 2012;76(2):423–434. doi: 10.1016/j.neuron.2012.08.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Horikawa T, Kamitani Y. Generic decoding of seen and imagined objects using hierarchical visual features. Nature communications. 2017:8. doi: 10.1038/ncomms15037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hubel DH, Wiesel TN. Receptive fields and functional architecture of monkey striate cortex. The Journal of physiology. 1968;195(1):215–243. doi: 10.1113/jphysiol.1968.sp008455. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Itti L, Koch C, Niebur E. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelligence. 1998;20(11):1254–1259. [Google Scholar]
- Jozefowicz R, Zaremba W, Sutskever I. An empirical exploration of recurrent network architectures. Paper presented at the Proceedings of the 32nd International Conference on Machine Learning (ICML-15).2015. [Google Scholar]
- Kafaligonul H, Breitmeyer BG, Öğmen H. Feedforward and feedback processes in vision. Frontiers in psychology. 2015:6. doi: 10.3389/fpsyg.2015.00279. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kay KN, Winawer J, Mezer A, Wandell BA. Compressive spatial summation in human visual cortex. Journal of neurophysiology. 2013;110(2):481–494. doi: 10.1152/jn.00105.2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Khaligh-Razavi SM, Henriksson L, Kay K, Kriegeskorte N. Fixed versus mixed RSA: Explaining visual representations by fixed and mixed feature sets from shallow and deep computational models. Journal of Mathematical Psychology. 2017;76:184–197. doi: 10.1016/j.jmp.2016.10.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Khaligh-Razavi SM, Kriegeskorte N. Deep supervised, but not unsupervised, models may explain IT cortical representation. PLoS computational biology. 2014;10(11):e1003915. doi: 10.1371/journal.pcbi.1003915. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kingma D, Ba J. Adam: A method for stochastic optimization. 2014 arXiv preprint arXiv:1412.6980. [Google Scholar]
- Kriegeskorte N. Deep neural networks: a new framework for modeling biological vision and brain information processing. Annual Review of Vision Science. 2015;1:417–446. doi: 10.1146/annurev-vision-082114-035447. [DOI] [PubMed] [Google Scholar]
- Lamme VA, Super H, Spekreijse H. Feedforward, horizontal, and feedback processing in the visual cortex. Current opinion in neurobiology. 1998;8(4):529–535. doi: 10.1016/s0959-4388(98)80042-1. [DOI] [PubMed] [Google Scholar]
- LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–444. doi: 10.1038/nature14539. [DOI] [PubMed] [Google Scholar]
- Lotter W, Kreiman G, Cox D. Deep predictive coding networks for video prediction and unsupervised learning. 2016 arXiv preprint arXiv:1605.08104. [Google Scholar]
- Mikolov T, Karafiát M, Burget L, Cernocký J, Khudanpur S. Recurrent neural network based language model. Paper presented at the Interspeech.2010. [Google Scholar]
- Miller KJ, Sorensen LB, Ojemann JG, Den Nijs M. Power-law scaling in the brain surface electric potential. PLoS computational biology. 2009;5(12):e1000609. doi: 10.1371/journal.pcbi.1000609. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mnih V, Heess N, Graves A. Recurrent models of visual attention. Paper presented at the Advances in neural information processing systems.2014. [Google Scholar]
- Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, … Ostrovski G. Human-level control through deep reinforcement learning. Nature. 2015;518(7540):529–533. doi: 10.1038/nature14236. [DOI] [PubMed] [Google Scholar]
- Nair V, Hinton GE. Rectified linear units improve restricted boltzmann machines. Paper presented at the Proceedings of the 27th international conference on machine learning (ICML-10).2010. [Google Scholar]
- Naselaris T, Kay KN, Nishimoto S, Gallant JL. Encoding and decoding in fMRI. Neuroimage. 2011;56(2):400–410. doi: 10.1016/j.neuroimage.2010.07.073. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nili H, Wingfield C, Walther A, Su L, Marslen-Wilson W, Kriegeskorte N. A toolbox for representational similarity analysis. PLoS computational biology. 2014;10(4):e1003553. doi: 10.1371/journal.pcbi.1003553. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pascanu R, Mikolov T, Bengio Y. On the difficulty of training recurrent neural networks. Paper presented at the International Conference on Machine Learning.2013. [Google Scholar]
- Rao RP, Ballard DH. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature neuroscience. 1999;2(1) doi: 10.1038/4580. [DOI] [PubMed] [Google Scholar]
- Rizzolatti G, Matelli M. Two different streams form the dorsal visual system: anatomy and functions. Experimental brain research. 2003;153(2):146–157. doi: 10.1007/s00221-003-1588-0. [DOI] [PubMed] [Google Scholar]
- Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, … Bernstein M. Imagenet large scale visual recognition challenge. International Journal of Computer Vision. 2015;115(3):211–252. [Google Scholar]
- Sharma S, Kiros R, Salakhutdinov R. Action recognition using visual attention. 2015 arXiv preprint arXiv:1511.04119. [Google Scholar]
- Shmuelof L, Zohary E. Dissociation between ventral and dorsal fMRI activation during object and action recognition. Neuron. 2005;47(3):457–470. doi: 10.1016/j.neuron.2005.06.034. [DOI] [PubMed] [Google Scholar]
- Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, … Lanctot M. Mastering the game of Go with deep neural networks and tree search. Nature. 2016;529(7587):484–489. doi: 10.1038/nature16961. [DOI] [PubMed] [Google Scholar]
- Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. 2014a arXiv preprint arXiv:1409.1556. [Google Scholar]
- Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems. 2014b:568–576. [Google Scholar]
- Soomro K, Zamir AR, Shah M. UCF101: A dataset of 101 human actions classes from videos in the wild. 2012 arXiv preprint arXiv:1212.0402. [Google Scholar]
- Tran D, Bourdev L, Fergus R, Torresani L, Paluri M. Learning spatiotemporal features with 3d convolutional networksa. Proceedings of the IEEE international conference on computer vision; 2015. pp. 4489–4497. [Google Scholar]
- Wandell BA, Dumoulin SO, Brewer AA. Visual field maps in human cortex. Neuron. 2007;56(2):366–383. doi: 10.1016/j.neuron.2007.10.012. [DOI] [PubMed] [Google Scholar]
- Wen H, Liu Z. Separating fractal and oscillatory components in the power spectrum of neurophysiological signal. Brain topography. 2016;29(1):13–26. doi: 10.1007/s10548-015-0448-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wen H, Shi J, Zhang Y, Lu K-H, Cao J, Liu Z. Neural Encoding and Decoding with Deep Learning for Dynamic Natural Vision. Cerebral Cortex. 2017a doi: 10.1093/cercor/bhx268. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wen H, Shi J, Chen W, Liu Z. Deep Residual Network Reveals a Nested Hierarchy of Distributed Cortical Representation for Visual Categorization. bioRxiv. 2017b:151142. [Google Scholar]
- Wen H, Shi J, Chen W, Liu Z. Transferring and Generalizing Deep-Learning-based Neural Encoding Models across Subjects. bioRxiv. 2017c:171017. doi: 10.1016/j.neuroimage.2018.04.053. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Werbos PJ. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE. 1990;78(10):1550–1560. [Google Scholar]
- Wu MCK, David SV, Gallant JL. Complete functional characterization of sensory neurons by system identification. Annu Rev Neurosci. 2006;29:477–505. doi: 10.1146/annurev.neuro.29.051605.113024. [DOI] [PubMed] [Google Scholar]
- Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, … Bengio Y. Show, attend and tell: Neural image caption generation with visual attention; Paper presented at the International Conference on Machine Learning.2015. [Google Scholar]
- Yamins DL, DiCarlo JJ. Using goal-driven deep learning models to understand sensory cortex. Nature neuroscience. 2016;19(3):356. doi: 10.1038/nn.4244. [DOI] [PubMed] [Google Scholar]
- Yamins DL, Hong H, Cadieu CF, Solomon EA, Seibert D, DiCarlo JJ. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the National Academy of Sciences. 2014;111(23):8619–8624. doi: 10.1073/pnas.1403112111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yoon EY, Humphreys GW, Kumar S, Rotshtein P. The Neural Selection and Integration of Actions and Objects: An fMRI Study. Journal of Cognitive Neuroscience. 2012;24(11):2268–2279. doi: 10.1162/jocn_a_00256. [DOI] [PubMed] [Google Scholar]
- Zaremba W, Sutskever I. Reinforcement learning neural turing machines. 2015:419. arXiv preprint arXiv:1505.00521. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
A) The accumulation of information at different ROIs along ventral and dorsal streams. Window size represents the period to the past, and temporal integration indicates the relative amount of accumulated information. B) The cortical map of TRWs estimated by the RNN. The color bar indicates the window sizes at individual voxels. C) Average TRWs at individual ROIs. The blue bars represent the early visual cortex, the green bars the ventral areas, and the red bars the dorsal areas.
A) The accumulation of information at different ROIs along ventral and dorsal streams. Window size represents the period to the past, and temporal integration indicates the relative amount of accumulated information. B) The cortical map of TRWs estimated by the RNN. The color bar indicates the window sizes at individual voxels. C) Average TRWs at individual ROIs. The blue bars represent the early visual cortex, the green bars the ventral areas, and the red bars the dorsal areas.