Abstract
Brain decoding based on functional magnetic resonance imaging has recently enabled the identification of visual perception and mental states. However, due to the limitations of sample size and the lack of an effective reconstruction model, accurate reconstruction of natural images is still a major challenge. The current, rapid development of deep learning models provides the possibility of overcoming these obstacles. Here, we propose a deep learning-based framework that includes a latent feature extractor, a latent feature decoder, and a natural image generator, to achieve the accurate reconstruction of natural images from brain activity. The latent feature extractor is used to extract the latent features of natural images. The latent feature decoder predicts the latent features of natural images based on the response signals from the higher visual cortex. The natural image generator is applied to generate reconstructed images from the predicted latent features of natural images and the response signals from the visual cortex. Quantitative and qualitative evaluations were conducted with test images. The results showed that the reconstructed image achieved comparable, accurate reproduction of the presented image in both high-level semantic category information and low-level pixel information. The framework we propose shows promise for decoding the brain activity.
Electronic supplementary material
The online version of this article (10.1007/s12264-020-00613-4) contains supplementary material, which is available to authorized users.
Keywords: Brain decoding, fMRI, Deep learning
Introduction
Brain-reading based on brain activity has made notable achievements in the past decade. Functional magnetic imaging (fMRI) studies have shown that visual features such as orientation, spatial frequency [1], motion direction [2, 3], object category [4–11], perceptual imagination [12], dreams [13], and even memory [14] can be decoded from fMRI activity patterns by classification-based machine-learning methods, which learn the linear or nonlinear mapping between a brain activity pattern and a stimulus category from a training dataset. More precisely, these studies are based on stimulus category identification.
Beyond category decoding, scientists have shown considerable interest and enthusiasm in visual image reconstruction decoding, because reconstruction can provide intuitive and vivid pictures regarding the objects a person is viewing. However, visual image reconstruction is more challenging than classification or identification, especially for complex natural images. Miyawaki et al. established a multiscale reconstruction model based on sparse multinomial logistic regression and reconstructed simple binary contrast images for the first time [15]. Compared with simple contrast images, natural images may include multiple objects and have more complex statistical characteristics, such as depth, color, and texture. Therefore, natural image reconstruction from very weak fMRI signals is extremely difficult. To date, some methods have been developed to decode natural images. Naselaris et al. [16] proposed a Bayesian reconstruction framework of gray images that uses two different encoding models to integrate information from functionally distinct visual areas: a structural model that describes how image information is represented in early visual areas, and a semantic encoding model that describes how image information is represented in the anterior visual areas. Nishimoto et al. [17] introduced motion energy information based on the Gabor wavelet pyramid model and proposed a Bayes decoder by combining estimated encoding models with a sampled natural movie prior. The decoder roughly reconstructed the viewed movies. Cowen et al. [18] proposed applying principal component analysis to identify components (eigenfaces) that efficiently represent face images in a relatively low-dimensional space. Then, a partial least squares regression algorithm was used to map patterns of fMRI activity to individual eigenfaces to reconstruct human faces.
With the rapid development of deep learning technology, some researchers are using its powerful feature extraction and fitting ability to build reconstruction models [19]. Du et al. [20] proposed a deep generative multiview model that includes a deep neural network architecture for visual image generation and a sparse Bayesian linear model for fMRI activity generation. Their model realized a better-resolution reconstruction of simple binary contrast images and handwritten characters. Güclütürk et al. and VanRullen et al. [21, 22] explored face reconstruction from brain activation with deep generative adversarial network (GAN) decoding. For natural image reconstruction, Zhang et al. and Seeliger et al. [23, 24] attempted grayscale natural image reconstruction with a convolutional neural network and GAN models. St-Yves et al. [25] proposed a conditional GAN (C-GAN) to reconstruct natural scenes. Although some advances were made by the above studies, the current methods can only reconstruct the basic outline or provide figures/features similar to the perceived natural stimuli. High-quality and high-resolution reconstruction from brain activity still requires considerable work.
In this paper, we propose a deep learning-based reconstruction framework to achieve high-resolution reconstruction of natural images. Compared with previous reconstruction studies based on GAN models [21, 24, 26], our proposed model enables a better reconstruction of high-resolution natural images based on fMRI activity.
Materials and Methods
Experimental Design and Data Acquisition
Participants
Five volunteers (3 males and 2 females aged 23–27 years) participated in the MRI scans. All participants were neurologically healthy, right-handed, and had normal or corrected-to-normal vision. All participants provided written, informed consent before the experiments, and the protocols were approved by the Institutional Review Board of the Institute of Biophysics, Chinese Academy of Sciences.
Visual Stimulus and Experimental Design
The experiment consists of two sessions: (1) the bar retinotopic mapping session and (2) the natural image presentation session.
The retinotopic mapping session was used to delineate the borders between visual cortical areas and to identify the retinotopic map on the flattened cortical surface. The session was conducted with the conventional protocol that uses flashing checkerboard bars drifting in different directions. The drifting bars were presented in the center of the display with a central fixation cross 20° in spatial length. These bars were constructed as a 100% contrast checkerboard with eight different motion directions and 22 equidistant positions in each direction. Each run contained 176 stimulation blocks, and each block lasted 2 s at a blink frequency of 10 Hz. Extra rest periods (12 s) were added at the beginning and end of each run. Each run had a duration of [12 s + 176 blocks × 2 s + 12 s = 6 min 16 s], as shown in Fig. 1A. The bar run was repeated four times.
Fig. 1.
Experimental design. A Bar retinotopic mapping session. The moving bar is presented in the center of the display with a central fixation cross. B Natural image session. Natural images are presented in the center of the display with a central fixation cross. The color of the fixation cross randomly changes from white to red for 1 s to remind the participant to focus by pressing a button.
In the natural image presentation session, each run contained 50 stimulus blocks. Each block included a 2-s flickering natural image (spatial length, 20°; temporal frequency, 5 Hz) followed by a 4–8 s (probability random integer; mean, 6 s) intervening rest period. A random rest time was adopted to avoid the periodic influence of visual stimuli. Extra rest periods were added at the beginning (12 s) and the end (8 s) of each run. Each run had a duration of [12 s + 50 blocks × (2 s + 6 s) + 8 s = 7 min], as shown in Fig. 1B. In each block, an image was presented on a gray background with a central fixation cross. The color of the fixation cross randomly changed from white to red for 1 s to remind the participant to focus. The participants were asked to press a button if they observed the color change. Fifty-five runs were repeated, and a total of 2750 natural images were presented to each participant. These natural images were from five categories of ImageNet [27] datasets: “Flower”, “Horse”, “Building”, “Fruit”, and “Landscape”, with 1103, 999, 995, 1056, and 1044 images, respectively. Each run randomly selected 10 images from the five category sets and randomly ordered their presentation.
MRI Data Acquisition
MRI data were acquired with a 3-T Prisma scanner (Siemens, Erlangen, Germany) at the Institute of Biophysics, Chinese Academy of Sciences, using a 20-element head coil. An interleaved T2*-weighted gradient-echo echo-planar imaging (EPI) scan was performed to acquire functional images that covered the entire occipital lobe (TR, 1,000 ms; TE, 31.2 ms; flip angle, 50°; FOV, 194 × 194 mm2; voxel size, 1.8 × 1.8 × 1.8 mm3; slice gap, 0 mm; number of slices, 48). T1-weighted magnetization-prepared rapid-acquisition gradient-echo fine-structural images of the whole head were also acquired (TR, 2,300 ms; TE, 3.49 ms; TI, 1,050 ms; flip angle, 8°; field of view, 256 × 256 mm2; voxel size, 1.0 × 1.0 × 1.0 mm3).
Data Preprocessing
Data were preprocessed using SPM8 (http://www.fil.ion.ucl.ac.uk/spm). The first 12-s scan of each run was discarded to avoid MRI scanner instability. The remaining fMRI data underwent slice-timing correction and three-dimensional motion correction. These data were then co-registered to the same sliced high-resolution anatomical image for EPI and then registered to a full-head high-resolution anatomical image [28, 29]. The co-registered data were then re-interpolated to 3 × 3 × 3-mm voxels.
Regions of Interest (ROIs)
An inflated cortical surface was generated based on the T1-weighted anatomical image. SamSrf (https://ndownloader.figshare.com/files/9342553) was used to map the population receptive field. The eccentricity map and the polar map were overlaid on the inflated cortical surface. The borders of retinotopic areas were defined using standard methods [30]. The early visual cortex areas (V1, V2, and V3) and visual cortex (VC) are identified in Fig. 2. The higher VC (HVC) area was defined as the area of VC remaining after removal of V1, V2, and V3. The numbers of voxels in each area are listed in Table 1.
Fig. 2.
Definition of visual cortical areas. A The areas are delineated on the inflated cortical surface for one participant. Note that the ventral and dorsal representations for early visual cortex areas (V1, V2, and V3) were defined separately. B The VC (left panel) and the early visual cortex areas (V1, V2, and V3) (right panel) in the three-dimensional space of the structural brain.
Table 1.
The number of voxels in different visual areas.
| Number | Visual areas | |||
|---|---|---|---|---|
| V1 | V2 | V3 | HVC | |
| Participant 1 | 1701 | 1731 | 1660 | 32134 |
| Participant 2 | 1897 | 1958 | 1753 | 34624 |
| Participant 3 | 1952 | 1395 | 1275 | 35724 |
| Participant 4 | 1455 | 1420 | 956 | 30626 |
| Participant 5 | 1695 | 1496 | 1503 | 35207 |
HVC: higher visual cortex.
Methods
Natural Image Reconstruction Framework
Brain responses when participants were viewing complex natural images were measured by fMRI. Next, multi-voxel response signals from the HVC and the primary VC (V1) were obtained. Here, the 6th-second response the V1 signal was extracted, but signals of the HVC from 2–6 s after the presented image disappeared were used. The multi-time HVC response signal was selected because the latent feature decoder constructed with the LSTM (long short-term memory networks) extracted the time information from the response signal, which achieved better decoding.
The 2750 natural images corresponded to 2750 data samples. We randomly selected 50 samples from the 2750 as the test set, and the remaining 2700 images were used as the training set. An F-score [31, 32] feature selection algorithm was used to calculate the F-value of each voxel in the HVC. In order to facilitate subsequent calculations, the number of voxels in V1, V2, and V3 from different participants was unified to the same number. The visual cortex (V1, V2, and V3), which contained < 2000 voxels, was upsampled by nearest neighbors to 2,000 voxels. Here, 2000 voxels in the different ROIs were selected for subsequent analysis. Each data sample contained a natural image (256 × 256 × 3), a natural image label (1 × 5, one-hot), a V1 response vector (1 × 2000), and an HVC response matrix (5 × 2000, 2–6 s response signals).
The latent feature extractor, the latent feature decoder, and the natural image generator, which were built based on a CAE (convolutional autoencoder), LSTM, and conditional progressively growing GAN (C-PG-GAN) deep networks, constituted the proposed natural image reconstruction framework shown in Fig. 3. The three deep networks in the reconstruction framework were trained by training samples to obtain optimal parameters, and then the performance was tested using the test samples.
Fig. 3.
Natural image reconstruction framework. A Latent feature extractor built with the convolutional autoencoder model, which extracts the latent features of complex natural images in low-dimensional space. B Overview of the natural image reconstruction framework. fMRI activity is measured while the participant views natural images. A latent feature decoder is trained to predict the values of the latent feature for the presented images from multi-voxel and multi-time fMRI signals. Then, conditional features are obtained by combining the predicted latent features with the fMRI signals from V1. Finally, the condition features are input into the natural image generator to reconstruct the natural image.
Latent Feature Extractor
The CAE extracted the latent features of the natural images in low-dimensional space. The CAE model included a convolution (conv, kernel size: 3 × 3), rectified linear unit (ReLU), batch normalization, max-pooling, reshape, full connection, upsampling, and tanh operations. The CAE network also introduced dropout and regularization mechanisms. The framework and detailed structure of the CAE network are shown in Fig. 4. It contained an encoder and a decoder. In the encoder, E1–E4 underwent convolution, ReLU, batch normalization, and max-pooling operations, and introduced dropout and regularization mechanisms. After the operations (E1–E4), the natural image (size: 256 × 256 × 3) was transformed into a feature tensor (size: 16 × 16 × 1). Next, the feature tensor was reshaped into a tensor of size 256. Finally, the latent features of size 250 were obtained through the full connection layer with the ReLU activation function. In the decoder, the feature tensor of size 256 was obtained by using the full connection with the ReLU activation function for the latent feature. Next, the feature tensor was reshaped into a tensor of size 16 × 16 × 1. D3–E5 underwent upsampling, convolution, ReLU, and batch normalization operations, and introduced dropout and regularization mechanisms. After the operations (D3–D5), the feature tensor (size: 16 × 16 × 1) was transformed into the feature tensor (size: 128 × 128 × 16). Finally, the output image (Img*, size: 256 × 256 × 3) was obtained through upsampling, convolution, and tanh. Dropout was used to reduce overfitting.
Fig. 4.
Latent feature extractor built with the CAE model. A Diagram of the structure of the CAE model. The dashed box shows the loss calculation process for high-level semantic category information. B Detailed parameters of the CAE model, which includes an encoder that encodes the image into a latent feature, and a decoder that decodes the latent feature into an input image.
We trained the CAE network by minimizing the loss function as follows:
where and represent the input and output images of the CAE model, respectively; (size: 1 × 5) represents the semantic category label of the image; (size: 1 × 250) represents the latent features of the image; (size: 1 × 5) represents the predicted category labels of the image obtained by performing a softmax nonlinear transformation after R averaging; represents the variance in the latent features; and W represents the set of network weight parameters. The total CAE loss included an autoencoder loss, a regularization loss, a category loss, and a variance loss. The regularization loss alleviated the overfitting of the model. The category loss made the learned latent features contain the category information of the image. The variance loss limited the amplitude of the learned latent features within a certain range. The Adam optimization algorithm was used to optimize the CAE model. The default parameter settings of the Adam optimizer (, , ) and the learning rate (0.0001) were adopted. Here, the CAE network was implemented in TensorFlow. Note that the training data were used to train the CAE network, and the test data did not participate in the training.
Latent Feature Decoder
The HVC response signals from 2–6 s were input to the LSTM network. First, the LSTM network mapped the selected 2000 HVC response signals to the 5000 neurons of the first fully-connected layer. Then, the first fully-connected output was input to the two-layer LSTM module (double-LSTM). Finally, the second fully-connected layer mapped the output of the double-LSTM module to a predicted latent feature of size 250. The framework and detailed structure of the LSTM network are shown in Fig. 5. The two fully connected layers in the LSTM network used the ReLU activation function. We trained the LSTM network by minimizing the loss function as follows:
where and represent the latent features and the predicted latent features of the image, respectively. The Adam optimization algorithm was used to optimize the LSTM network. The default parameter settings of the Adam optimizer (, , ) and the learning rate (0.001) were used. Here, the LSTM network was implemented in TensorFlow.
Fig. 5.
Latent feature decoder built with the LSTM model. Multi-voxel and multi-time response signals in the HVC were added to the model. The model consists of a double-LSTM module and two fully-connected layers. The output of the model is the predicted latent feature of the natural image.
Natural Image Generator
The network architecture of C-PG-GAN contained a synchronized and progressively growing generator and discriminator, as shown in Fig. 6. The PG-GAN model made it possible to produce high-resolution images. The key idea of the model was to grow both the generator and the discriminator progressively, starting from a low resolution and then adding new layers gradually, so that the model increased the fine details with the progress of training. By this idea, the PG-GAN model smoothly trained the model to produce high-quality images. The network architecture first generated 8 × 8 images, and then layers were continuously added to the generator until the discriminator finally output 128 × 128 images. The low-resolution process increased to a high-resolution process using a progressive growing mechanism; the formula was as follows:
where , , and represent the current generated image, the upsampled low-resolution image, and the current high-resolution image, respectively. The changed from 0 to 1 with increasing iterations.
Fig. 6.
Natural image generator built with the C-PG-GAN model.
Significantly, we introduced a conditional feature mechanism to the progressively growing GAN model so that similar features may produce similar images. For each resolution, we input the same conditional features into the generator and the discriminator. The conditional feature vectors were merged from the 6th-second response signals in V1 and the predicted latent features of natural images. We randomly scrambled the positions of the elements in the conditional feature vectors before putting them into the generator and discriminator. After undergoing a series of operations (up-sampling, convolution, ReLU, and full connection), the image generator was well trained. Because the conditional vectors included the fMRI signal features and the latent features of natural images for a certain category, so the generated images reflected the perception by the brain of the specific category images, showing similar characteristics of the specific category stimuli. Finally, a C-PG-GAN was constructed. The network parameters of the full-resolution generator and discriminator are shown in Fig. S1-1. The random vector corresponded to random points on a 256-dimensional space. Leaky ReLU with leakiness 0.2 was used in all layers of both networks, except for the last layer where linear activation was used. The weight and bias parameters were initialized by a standard normal distribution and constant zero, respectively. The upsampling and downsampling operations in Fig. S1-1 correspond to 2 × 2 element replication and average pooling, respectively. The wasserstein GANs—gradient penalty (WGAN-GP [33]) loss was calculated, and the Adam optimization algorithm was applied to optimize the model. The default parameter settings of the Adam optimizer (, , ) and the learning rate (0.001) were used.
Above all, the general purpose of reconstruction decoding was to explore the relationship between neural activity and visual stimuli features, and then establish a reverse mapping model from human brain activity to generate images. Here, three steps were performed. First, the latent features of the images were extracted by the CAE. And then, based on the fMRI signals from the HVC, the latent features were predicted by the LSTM. Finally, the predicted latent features and the fMRI signals in V1 were combined as a conditional vector, and then input into the generator and discriminator in the C-PG-GAN model to reconstruct the corresponding images.
Results
Reconstructed Images
After the three deep network models (CAE, LSTM, and C-PG-GAN) and the optimal parameters were trained in the reconstruction framework, the 50 test dataset was then input into the trained LSTM and C-PG-GAN models to obtain the reconstructed images. Figure 7 illustrates three reconstructed examples from the five categories (horses, buildings, flowers, fruits, and landscapes) from the five participants. The reconstruction results of all the test datasets from the V1 and HVC joint brain responses of the five participants are shown in Fig. S1-2–6. The reconstruction results of all the test datasets from the V2, V3, and HVC joint brain responses of the five participants are shown in Fig. S1-7–16. The experimental results showed that most of the reconstructed images captured the outline and the main characteristics of the original stimulus images, especially the semantic category and texture information.
Fig. 7.
Reconstructed samples of the test stimuli from the V1 and HVC joint brain responses of the five participants. The first row shows the test images (horse, building, flower, fruit and landscape) that the participants viewed during fMRI data recording. The lower five rows show the corresponding reconstructed images from the brain responses of the five participants.
To clarify the function of the random vectors and the conditional features in the C-PG-GAN reconstruction model, we compared the reconstructed samples after inputting only random vectors, only conditional features, and both (Explanation 1). The results showed that the role of the random vectors was to make the distribution of the reconstructed image close to that of the natural image. The main function of the conditional feature vectors was to make the reconstructed images with similar conditional features have similar semantic representations. Using both random vectors and conditional features, the reconstructed images showed characteristics similar to the natural images and remained semantically consistent with the stimulus images, achieving the best reconstructed result.
In addition, to verify the effect of the combination of the latent feature extractor and the latent feature decoder in the decoding framework, comparisons of the ablation learning of the three deep network modules are shown and explained in Explanation 2.
Reconstruction Quality Evaluation
To quantify the effect of the reconstruction framework, we calculated the similarity between the reconstructed and the presented images by using a variety of algorithms: cosine similarity (CSIM), earth mover distance (EMD [34]), histogram similarity (HSIM [35]), Kullback-Leibler divergence (KLD), mutual information (MI [36]), the Pearson correlation coefficient (PCC), histogram intersection (HI [37]), and the structural similarity index (SSIM). For these measurements, the larger values of CSIM, HSIM, MI, PCC, HI and SSIM indicated greater similarity between the reconstructed image and the presented image. The smaller the values of EMD and KLD are, the closer the distribution distance between the reconstructed images and the presented images.
Table S1 shows the mean and variance in the similarity of the eight methods from the V1 and HVC joint brain responses of the five participants. The similarity results obtained by replacing V1 with V2 or V3 are shown in Tables S2 and S3. In addition, the mean and standard deviation of similarity of different VCs are shown in Fig. S3-1. The results show that, except for MI, the evaluation algorithms all showed no significant difference in the reconstruction effect of V1, V2, and V3 joined with HVC. Although for the MI index, V2 joined with HVC showed worse reconstruction performance, generally, V1, V2, and V3 showed no difference in contribution to the reconstruction.
In addition, in order to demonstrate the function of progressive training in the high resolution image reconstruction, a C-GAN model without progressive training was used for a baseline comparison. The framework and parameters of C-GAN were the same as those of C-PG-GAN. The reconstruction results of all the test datasets from the V1 and HVC joint brain responses of the five participants by C-GAN are shown in Fig. S3-2–6. The paired sample t-test was used to compare the similarity of test samples obtained by C-PG-GAN and C-GAN under the eight measures. The results showed that, except for CSIM, the evaluation algorithms all showed that the reconstruction effect of C-PG-GAN was significantly better than that of C-GAN (Fig. 8). The visual and measurable results both indicated that the training method of gradual growth played an effective role in natural image reconstruction.
Fig. 8.
Comparison of reconstruction effects of C-PG-GAN and C-GAN under the eight measures (*P < 0.01, paired sample t-test).
Discussion
In this study, we presented a reconstruction framework that better reconstructed high-definition natural images based on the fMRI response signal of the visual cortex. The reconstruction framework consisted of three deep learning networks: CAE, LSTM, and C-PG-GAN. We used the CAE network to extract the latent features of natural images to replace the natural images. If a deep network model was directly established between a complex natural image and the response signal in the VC, the reconstruction would be limited by the image size, and it would be difficult to train the large number of parameters in the depth model. Therefore, it was necessary to replace the natural image with the latent features of the natural image and establish a predictive model with response signals in the VC. This replacement operation greatly reduced the number of parameters in the predictive model and alleviated overfitting of the deep network. The previous research by Horikawa and Kamitani [38] showed that, via hierarchical visual feature representation, arbitrary object categories seen and imagined by participants can be predicted from fMRI signals in the human visual cortex. Recent face or natural image reconstruction has been achieved through pre-training GAN [21, 24, 39] or autoencoder [40] models. These studies can reconstruct high-definition images, thanks to the external image dataset, such as ImageNet [27], which are used to pre-train the image-generation network in the reconstruction framework. Compared to the pre-trained deep convolutional neural networks, GAN, or autoencoder network, our proposed C-PG-GAN model provided another route to achieve accurate reconstruction without pre-training with external data. The CAE network we used was trained using small datasets in visual experiments. In addition, we introduced image category information in the total loss function of the CAE network, which made the latent features of the low-dimensional space contain the underlying pixel information and the high-level semantic category information of the natural image. Simultaneously, we also introduced the variance loss of the latent features in the total loss function, which contributed to controlling the amplitude of the sag of the latent features.
In addition, we did not use a visual fMRI response signal at a single point in time after the delay [26, 38], but we made better use of the spatiotemporal fMRI signals of a 2–6-s visual response. The LSTM network addressed a 2–6-s response signal and simulated the sequential relationship between multiple response signals. The network model extracted the temporal information of the VC response and better realized the latent features of the natural image in the low-dimensional space according to the fMRI response signal from the VC. Afterwards, the predicted latent features were fused with the response signal from V1 and, as a condition, were input into the C-PG-GAN model, which was capable of generating a high-definition image and progressively increasing the accurate reconstruction of the natural image. We also used V2 and V3 to replace V1. Through quantitative analysis of most evaluation indexes, we found no significant difference in the reconstruction effect of V1, V2, and V3. This suggests that most of the information that plays a key role in reconstruction comes from perception signals in the HVC. Besides, the proposed C-PG-GAN model received the predicted latent features and the response signal from V1, and used them as conditions to input synchronously to the generator and discriminator in the model. The assumption was that similar conditions produce similar images. That is, the C-PG-GAN model achieved accurate reconstruction of natural images by introducing progressive growth mechanisms and conditional mechanisms. The reconstruction evaluation indexes obtained by comparing C-PG-GAN and C-GAN fully indicated that the progressive growth mechanism was effective for reconstruction.
Although our results demonstrated that our framework can reconstruct high-resolution natural images by combining brain activity and deep neural networks, there were still some limitations to this research. First, only 2700 training samples were used due to considerable scanning time consumed for each participant, and only 50 test images were reconstructed and evaluated. Although the CAE network we used were trained using small datasets, more training samples are expected to achieve better results. Second, the CAE model we used captured the image category attributes, which greatly contributed to the high-level semantic reconstruction by providing an overall category-semantic space. However, due to the variety of low-level pixel space and the textures of natural images, combining high-level semantic space with low-level pixel space to achieve more accurate reconstruction is worth further study. Finally, the results showed that reconstruction was especially poor for participant 5. One reason lies in individual differences. Participant 5 was asked to perform the retinotopic mapping experiment twice, and the category discrimination accuracy from fMRI was found to be very low due to a poor mental state. As the signal-to-noise ratio of fMRI is extremely low, subtle head movement or inattentive mental activity may lead to a critical influence on the reconstruction results. Collecting high signal-to-noise ratio fMRI is especially critical for reconstructing complex natural images.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Acknowledgements
This work was supported by the National Natural Science Foundation of China (61773094, 61533006, U1808204, 31730039, 31671133, and 61876114), the Ministry of Science and Technology of China (2015CB351701), the National Major Scientific Instruments and Equipment Development Project (ZDYZ2015-2), and a Chinese Academy of Sciences Strategic Priority Research Program B grant (XDB32010300).
Compliance with ethical standards
Conflict of interest
The authors declare that they have no conflict of interest.
Contributor Information
Hongmei Yan, Email: hmyan@uestc.edu.cn.
Zhentao Zuo, Email: zuozt@163.com.
Huafu Chen, Email: chenhf@uestc.edu.cn.
References
- 1.Naselaris T, Olman CA, Stansbury DE, Ugurbil K, Gallant JL. A voxel-wise encoding model for early visual areas decodes mental images of remembered scenes. Neuroimage. 2015;105:215–228. doi: 10.1016/j.neuroimage.2014.10.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Haynes J-D, Rees G. Predicting the orientation of invisible stimuli from activity in human primary visual cortex. Nat Neurosci. 2005;8:686–691. doi: 10.1038/nn1445. [DOI] [PubMed] [Google Scholar]
- 3.Kamitani Y, Tong F. Decoding the visual and subjective contents of the human brain. Nat Neurosci. 2005;8:679–685. doi: 10.1038/nn1444. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Haxby JV, Gobbini MI, Furey ML, Ishai A, Schouten JL, Pietrini P. Distributed and overlapping representations of faces and objects in ventral temporal cortex. Science. 2001;293:2425. doi: 10.1126/science.1063736. [DOI] [PubMed] [Google Scholar]
- 5.Cox DD, Savoy RL. Functional magnetic resonance imaging (fMRI)“brain reading”: detecting and classifying distributed patterns of fMRI activity in human visual cortex. Neuroimage. 2003;19:261–270. doi: 10.1016/S1053-8119(03)00049-1. [DOI] [PubMed] [Google Scholar]
- 6.Mitchell TM, Shinkareva SV, Carlson A, Chang KM, Malave VL, Mason RA, et al. Predicting human brain activity associated with the meanings of nouns. Science. 2008;320:1191–1195. doi: 10.1126/science.1152876. [DOI] [PubMed] [Google Scholar]
- 7.Song S, Zhan Z, Long Z, Zhang J, Yao L. Comparative study of SVM methods combined with voxel selection for object category classification on fMRI data. PLoS ONE. 2011;6:e17191. doi: 10.1371/journal.pone.0017191. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Huth AG, Nishimoto S, Vu AT, Gallant JL. A continuous semantic space describes the representation of thousands of object and action categories across the human brain. Neuron. 2012;76:1210–1224. doi: 10.1016/j.neuron.2012.10.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Huth AG, Lee T, Nishimoto S, Bilenko NY, Vu AT, Gallant JL. Decoding the semantic content of natural movies from human brain activity. Front Syst Neurosci. 2016;10:81. doi: 10.3389/fnsys.2016.00081. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Wang C, Yan H, Huang W, Li J, Yang J, Li R, et al. “When” and “what” did you see? A novel fMRI-based visual decoding framework. J Neural Eng. 2020;17:056013. doi: 10.1088/1741-2552/abb691. [DOI] [PubMed] [Google Scholar]
- 11.Huang W, Yan H, Wang C, Li J, Yang X, Li L, et al. Long short-term memory-based neural decoding of object categories evoked by natural images. Hum Brain Mapp. 2020;41:4442–4453. doi: 10.1002/hbm.25136. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Reddy L, Tsuchiya N, Serre T. Reading the mind’s eye: decoding category information during mental imagery. Neuroimage. 2010;50:818–825. doi: 10.1016/j.neuroimage.2009.11.084. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Horikawa T, Tamaki M, Miyawaki Y, Kamitani Y. Neural decoding of visual imagery during sleep. Science. 2013;340:639–642. doi: 10.1126/science.1234330. [DOI] [PubMed] [Google Scholar]
- 14.Postle BR. The cognitive neuroscience of visual short-term memory. Curr Opin Behav Sci. 2015;1:40–46. doi: 10.1016/j.cobeha.2014.08.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Miyawaki Y, Uchida H, Yamashita O, Sato MA, Morito Y, Tanabe HC, et al. Visual image reconstruction from human brain activity using a combination of multiscale local image decoders. Neuron. 2008;60:915–929. doi: 10.1016/j.neuron.2008.11.004. [DOI] [PubMed] [Google Scholar]
- 16.Naselaris T, Prenger RJ, Kay KN, Oliver M, Gallant JL. Bayesian reconstruction of natural images from human brain activity. Neuron. 2009;63:902–915. doi: 10.1016/j.neuron.2009.09.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Nishimoto S, Vu AT, Naselaris T, Benjamini Y, Yu B, Gallant JL. Reconstructing visual experiences from brain activity evoked by natural movies. Curr Biol. 2011;21:1641–1646. doi: 10.1016/j.cub.2011.08.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Cowen AS, Chun MM, Kuhl BA. Neural portraits of perception: reconstructing face images from evoked brain activity. Neuroimage. 2014;94:12–22. doi: 10.1016/j.neuroimage.2014.03.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Huang W, Yan H, Wang C, Li J, Zuo Z, Zhang J, et al. Perception-to-image: Reconstructing natural images from the brain activity of visual perception. Ann Biomed Eng. 2020;48:2323–2332. doi: 10.1007/s10439-020-02502-3. [DOI] [PubMed] [Google Scholar]
- 20.Du C, Du C, Huang L, He H. Reconstructing perceived images from human brain activities with Bayesian deep multiview learning. IEEE Trans Neural Netw Learn Syst. 2018;30:2310–2323. doi: 10.1109/TNNLS.2018.2882456. [DOI] [PubMed] [Google Scholar]
- 21.VanRullen R, Reddy L. Reconstructing faces from fMRI patterns using deep generative neural networks. Commun Biol. 2019;2:1–10. doi: 10.1038/s42003-019-0438-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Güçlütürk Y, Güçlü U, Seeliger K, Bosch S, van Lier R, van Gerven MA. Reconstructing perceived faces from brain activations with deep adversarial neural decoding. Adv Neural Inform Process Syst 2017, 30: 4246–4257.
- 23.Zhang C, Qiao K, Wang L, Li T, Zeng Y, Yan B. Constraint-free natural image reconstruction from fMRI signals based on convolutional neural network. Front Hum Neurosci. 2018;12:242. doi: 10.3389/fnhum.2018.00242. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Seeliger K, Güçlü U, Ambrogioni L, Güçlütürk Y, van Gerven MA. Generative adversarial networks for reconstructing natural images from brain activity. Neuroimage. 2018;181:775–785. doi: 10.1016/j.neuroimage.2018.07.043. [DOI] [PubMed] [Google Scholar]
- 25.St-Yves G, Naselaris T. Generative adversarial networks conditioned on brain activity reconstruct seen images. bioRxiv 2018: 304774. [DOI] [PMC free article] [PubMed]
- 26.Shen G, Horikawa T, Majima K, Kamitani Y. Deep image reconstruction from human brain activity. PLoS Comput Biol. 2019;15:e1006633. doi: 10.1371/journal.pcbi.1006633. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Deng J, Dong W, Socher R, Li LJ, Li K, Li FF. Imagenet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 2009: 248–255.
- 28.Qian A, Wang X, Liu H, Tao J, Zhou J, Ye Q, et al. Dopamine D4 receptor gene associated with the frontal-striatal-cerebellar loop in children with ADHD: A resting-state fMRI study. Neurosci Bull. 2018;34:497–506. doi: 10.1007/s12264-018-0217-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Wang X, Yu A, Zhu X, Yin H, Cui L. Cardiopulmonary comorbidity, radiomics and machine learning, and therapeutic regimens for a cerebral fMRI predictor study in psychotic disorders. Neurosci Bull. 2019;35:955–957. doi: 10.1007/s12264-019-00409-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Engel SA, Glover GH, Wandell BA. Retinotopic organization in human visual cortex and the spatial precision of functional MRI. Cereb Cortex. 1997;7:181–192. doi: 10.1093/cercor/7.2.181. [DOI] [PubMed] [Google Scholar]
- 31.Huang W, Yan H, Liu R, Zhu L, Zhang H, Chen H. F-score feature selection based Bayesian reconstruction of visual image from human brain activity. Neurocomput. 2018;316:202–209. doi: 10.1016/j.neucom.2018.07.068. [DOI] [Google Scholar]
- 32.Polat K, Güneş S. A new feature selection method on classification of medical datasets: Kernel F-score feature selection. Expert Syst Appl. 2009;36:10367–10373. doi: 10.1016/j.eswa.2009.01.041. [DOI] [Google Scholar]
- 33.Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville AC. Improved training of wasserstein gans. Adv Neural Inform Process Syst 2017: 5767–5777.
- 34.Rubner Y, Tomasi C, Guibas LJ. The earth mover’s distance as a metric for image retrieval. Int J Comput Vis. 2000;40:99–121. doi: 10.1023/A:1026543900054. [DOI] [Google Scholar]
- 35.Ma Y, Gu X, Wang Y. Histogram similarity measure using variable bin size distance. Computer Vision and Image Understanding. 2010;114:981–989. doi: 10.1016/j.cviu.2010.03.006. [DOI] [Google Scholar]
- 36.Pluim JP, Maintz JA, Viergever MA. Mutual-information-based registration of medical images: a survey. IEEE Trans Med Imag. 2003;22:986–1004. doi: 10.1109/TMI.2003.815867. [DOI] [PubMed] [Google Scholar]
- 37.Wang Z, Bovik AC, Sheikh HR, Simoncelli EP. Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process. 2004;13:600–612. doi: 10.1109/TIP.2003.819861. [DOI] [PubMed] [Google Scholar]
- 38.Horikawa T, Kamitani Y. Generic decoding of seen and imagined objects using hierarchical visual features. Nat Commun. 2017;8:1–15. doi: 10.1038/ncomms15037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Mozafari M, Reddy L, VanRullen R. Reconstructing natural scenes from fMRI patterns using BigBiGAN. arXiv:2001.11761. 2020.
- 40.Han K, Wen H, Shi J, Lu KH, Zhang Y, Liu Z. Variational autoencoder: An unsupervised model for modeling and decoding fMRI activity in visual cortex. Neuroimage 2019, 198: 125–136. [DOI] [PMC free article] [PubMed]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.








