Skip to main content
MIT Press Open Journals logoLink to MIT Press Open Journals
. 2022 Nov 1;34(12):2406–2435. doi: 10.1162/jocn_a_01916

Predicting Identity-Preserving Object Transformations in Human Posterior Parietal Cortex and Convolutional Neural Networks

Viola Mocz 1, Maryam Vaziri-Pashkam 2, Marvin Chun 1, Yaoda Xu 1
PMCID: PMC9988239  NIHMSID: NIHMS1874115  PMID: 36122358

Abstract

Previous research shows that, within human occipito-temporal cortex (OTC), we can use a general linear mapping function to link visual object responses across nonidentity feature changes, including Euclidean features (e.g., position and size) and non-Euclidean features (e.g., image statistics and spatial frequency). Although the learned mapping is capable of predicting responses of objects not included in training, these predictions are better for categories included than those not included in training. These findings demonstrate a near-orthogonal representation of object identity and nonidentity features throughout human OTC. Here, we extended these findings to examine the mapping across both Euclidean and non-Euclidean feature changes in human posterior parietal cortex (PPC), including functionally defined regions in inferior and superior intraparietal sulcus. We additionally examined responses in five convolutional neural networks (CNNs) pretrained with object classification, as CNNs are considered as the current best model of the primate ventral visual system. We separately compared results from PPC and CNNs with those of OTC. We found that a linear mapping function could successfully link object responses in different states of nonidentity transformations in human PPC and CNNs for both Euclidean and non-Euclidean features. Overall, we found that object identity and nonidentity features are represented in a near-orthogonal, rather than complete-orthogonal, manner in PPC and CNNs, just like they do in OTC. Meanwhile, some differences existed among OTC, PPC, and CNNs. These results demonstrate the similarities and differences in how visual object information across an identity-preserving image transformation may be represented in OTC, PPC, and CNNs.

INTRODUCTION

The hallmark of the primate ventral visual pathway is its ability to form transformation-tolerant identity-preserving object representations. Object identity representations are argued to become increasingly transformation tolerant to nonidentity image changes from lower to higher visual regions (e.g., Isik, Meyers, Leibo, & Poggio, 2013; Rust & DiCarlo, 2010; DiCarlo & Cox, 2007; Rolls, 2000). This enables us to recognize objects even as their appearances change under different viewing conditions. Meanwhile, nonidentity features, such as the position of an object, can also be decoded independently of object identities throughout the primate ventral visual processing regions (Xu & Vaziri-Pashkam, 2021b; Vaziri-Pashkam, Taylor, & Xu, 2019; Hong, Yamins, Majaj, & DiCarlo, 2016; Hung, Kreiman, Poggio, & DiCarlo, 2005). Although the representations of nonidentity features are irrelevant for object recognition, they are essential in our interaction with the objects. Actions such as grasping critically rely on features including the position and size of the objects.

Past research has examined the low-dimensional representational space of a collection of neurons to understand object representation in the primate brain. This approach has revealed that although each neuron may be simultaneously tuned to object identity and nonidentity features, the neuronal representational space nevertheless exhibits a more or less orthogonal structure (see Figure 1A for schematics of orthogonal and near-orthogonal representations) to enable largely independent readout of object identity and nonidentity information in OTC in monkey neurophysiological and human fMRI studies (e.g., Xu & Vaziri-Pashkam, 2021b; Vaziri-Pashkam et al., 2019; Hong et al., 2016; Zhang, Liu, & Xu, 2015; Schwarzlose, Swisher, Dang, & Kanwisher, 2008; Hung et al., 2005). In these studies, a classifier trained to decode object identity at one location or size, for example, can also decode object identity successfully at another location or size. This led DiCarlo, Zoccolan, and Rust (2012) to propose that the goal of ventral visual processing is to untangle object identity and nonidentity features to gain access to transformation-tolerant object identity features at high-level vision. This, they argue, underlies the remarkable ability of high-level primate vision to recognize objects across different viewing conditions. Thus, a fully orthogonal representation can be coded by neurons tuned to multiple object features, so long as the representational space contains an orthogonal structure for the different feature dimensions. However, these decoding studies do not reveal the extent of orthogonality across brain regions: whether identity and nonidentity features are coded in a completely orthogonal manner or in a near-orthogonal manner because of some interactions in the coding of these two types of features. This is because decoding methods are not sensitive to changes in the representational space as long as objects are still on the correct side of the decoding decision boundary. Thus, until recently, the extent of orthogonality of identity and nonidentity information across the ventral visual system was unknown.

Figure 1. .

Figure 1. 

Possible neural representational space structures, experimental details, and analyses used. (A) A schematic illustration of how object identity and nonidentity features may be represented together in high-dimensional neural representational space, using size as a nonidentity feature. (Left) Completely orthogonal representations of these two types of features, with the object responses across the two states of the size transformation being equidistant for each object in the representational space. (Right) Near-orthogonal representations of these two types of features, with the object responses across the two states of the size transformation for each object being different in the representational space. (B) The eight natural object categories used. Each category contained 10 different exemplars varying in identity, pose (for the animate categories only), and viewing angle to minimize the low-level image similarities among them. (C) The four types of nonidentity transformations examined, including position, size, image stats, and SF. Each transformation included two states. (D) An illustration of the block design paradigm used. Participants performed a 1-back repetition detection task on the images. An actual block in the experiment contained 10 images with two repetitions per block. (E) Inflated brain surfaces from a representative participant showing the ROIs examined. They included topographically defined regions in OTC and PPC, including V1–V4, V3a, V3b, IPS0–IPS4, functionally defined higher ventral object processing regions LOT and VOT, and functionally defined parietal object processing regions inferior IPS and superior IPS. (F–G) Illustrations of the analyses performed to evaluate the predicted pattern. (F) To evaluate pattern predictability, the predicted and true patterns were directly correlated; to evaluate pattern selectivity, the correlation between the predicted and true patterns from the same category was compared to the correlation between the predicted and true patterns from different categories. (G) To evaluate the effect of category similarity on pattern prediction, a prediction similarity matrix was constructed, where each cell reflects how well two categories may predict each other, and a category similarity matrix was also constructed, where each cell reflects the pairwise correlation of the true patterns of two categories across the two halves of the data. The off-diagonal elements of these two matrices were vectorized and then correlated. See Methods for more details.

Previous work has shown that a linear mapping function can link fMRI responses from different affine states of an object in human lateral occipital (LO) cortex, including size and orientation, even for those from objects not included in training (Ward, Isik, & Chun, 2018). This further demonstrates some independence in how our brain represents object identity and nonidentity features. We replicated this work in a recent study and expanded it in two significant ways (Mocz, Vaziri-Pashkam, Chun, & Xu, 2021). First, in addition to higher human visual processing regions, we examined the entire human ventral processing pathway, including early visual areas V1–V4 and higher visual object processing regions in lateral occipito-temporal (LOT) and ventral occipito-temporal (VOT) regions. Second, in addition to affine/Euclidean transformations such as position and size, we examined the coding of non-Euclidean features including image stats and the spatial frequency (SF) content of an object image. In the image stats manipulation, we examined object identity representation in the original image format and in images in which image stats such as luminance, contrast, and SF were equated, and in the SF manipulation, we examined object identity representation in images appearing in two different SF ranges.

The representation of identity and nonidentity information can be considered fully orthogonal if the predicted patterns from the linear mapping functions show a significant correlation with the true patterns and if there is no difference in correlation for categories included and not included in training (Figure 1A, left). For a near-orthogonal representation, the predicted patterns from the linear mapping functions would still show a significant correlation with the true patterns, but predictions would be significantly better for categories included in training (Figure 1A, right). For both Euclidean and non-Euclidean transformations, we found that object responses in different states of nonidentity transformations could be linked through linear mapping functions. These mapping functions, however, are not entirely identity independent, with predictions being consistently better for objects included than those not included in training. This suggests that object identity and nonidentity features are represented in a near, rather than a completely, orthogonal manner in the human ventral visual stream.

Although visual information processing has been primarily linked to the occipito-temporal cortex (OTC) in the ventral pathway, more than two decades of neurophysiological and fMRI studies have revealed that direct representations of a diverse array of visual information also exist in the primate dorsal pathway in posterior parietal cortex (PPC) along the intraparietal sulcus (IPS; see Xu, 2018a, 2018b, for a recent review of this literature; see also Freud, Plaut, & Behrmann, 2016). Recent studies show that object representations in the PPC exhibit similar tolerance to various feature transformations as those in higher regions of OTC (Vaziri-Pashkam & Xu, 2019; Vaziri-Pashkam et al., 2019; see also Konen & Kastner, 2008). Despite these similarities, OTC and PPC differ in their object representational structures (Vaziri-Pashkam & Xu, 2019) and how their responses may be modulated by attention and task (Xu & Vaziri-Pashkam, 2019; Bracci, Daniels, & Op de Beeck, 2017; Vaziri-Pashkam & Xu, 2017; Jeong & Xu, 2016). In a recent review, it is argued that whereas OTC is involved in the invariant aspect of visual representation, providing us with a detailed and stable representation of the visual world, PPC is involved in the dynamic and adaptive aspect of visual representation, allowing us to selectively represent salient and task-relevant information to interact flexibly and efficiently with the world (Xu, 2018a; see also Xu, 2018b). Despite these advances, the nature of PPC visual representation is not fully understood, and whether or not a linear mapping may exist in the PPC between different states of a transformation has never been tested. Understanding whether such a representation may exist in both the OTC and PPC will further our understanding of how the nature of visual representation may be similar or different between these two brain regions.

To address this question, in this study, we used data from two existing fMRI data sets (Vaziri-Pashkam & Xu, 2019; Vaziri-Pashkam et al., 2019) where participants viewed objects undergoing two Euclidean transformations (i.e., position and size) and two non-Euclidean transformations (i.e., changes in image statistics and SF). We examined responses from topographically defined IPS areas V3A, V3B, and IPS0–IPS4 (Silver & Kastner, 2009; see Figure 1E). We additionally examined two functionally defined PPC regions in the inferior and superior IPS regions (referred to as inferior IPS and superior IPS, respectively, for simplicity). These two regions have been shown to be involved in object individuation/selection and object identification, respectively (Xu & Chun, 2009; see also Xu, 2008a, 2008b; 2010). Superior IPS in particular has been shown to track visual working memory (VWM) capacity and plays an important role in VWM storage (Bettencourt & Xu, 2016a; Jeong & Xu, 2012; Xu, 2008b, 2010; Xu & Chun, 2006, 2007; Todd & Marois, 2004, 2005; for reviews, see Xu, 2017, 2020, 2021). Both inferior IPS and superior IPS overlap with topographically defined IPS areas, with inferior IPS showing the greatest amount of overlap with V3a and V3b and superior IPS with IPS0–IPS4 (Bettencourt & Xu, 2016b). To streamline the report of the results, we include here only results from inferior and superior IPS. Similar results are obtained in topographically defined PPC regions.

The recent decade has witnessed the development of convolutional neural networks (CNNs) capable of achieving human-like performance on object recognition tasks and identifying objects across a variety of identity-preserving (sometimes quite challenging) image transformations (e.g., Serre, 2019; Yamins & Dicarlo, 2016; Kriegeskorte, 2015). This has led to the thinking that CNNs likely form transformation-tolerant object representations in their final stages of visual processing similar to those seen in the primate brain (Tacchetti et al., 2018; Hong et al., 2016; Yamins & Dicarlo, 2016). Consistent with this view, CNN object representations show some correspondence with the primate ventral visual stream (Xu & Vaziri-Pashkam, 2021a; Eickenberg, Gramfort, Varoquaux, & Thirion, 2017; Cichy, Khosla, Pantazis, Torralba, & Oliva, 2016; Güçlü & van Gerven, 2015; Khaligh-Razavi & Kriegeskorte, 2014), leading some to consider CNNs as the current best models of the primate ventral visual system (Cichy & Kaiser, 2019; Kubilius, Schrimpf, & Hong, 2019). Meanwhile, large discrepancies also exist in visual representation and performance between the CNNs and the primate brain (e.g., Xu & Vaziri-Pashkam, 2021a, 2021b; Serre, 2019; Geirhos et al., 2018). Although CNNs are fully accessible, they are at the same time extremely complex models with tens of thousands or more parameters, making their general operating principles at the algorithmic level largely unknown. Using the linear mapping method from Mocz and colleagues (2021), here we examined whether a similar representational structure for object identity and nonidentity features as that found in the human ventral regions may exist in CNNs pretrained on object recognition. We examined both shallower networks (Alexnet and VGG-19) and deeper networks (Googlenet and Resnet-50; He, Zhang, Ren, & Sun, 2016; Szegedy et al., 2015; Simonyan & Zisserman, 2014; Krizhevsky, Sutskever, & Hinton, 2012). We also included a recurrent network, Cornet-S, that has been shown to capture the recurrent processing in macaque IT cortex with a shallower structure (Kar, Kubilius, Schmidt, Issa, & DiCarlo, 2019; Kubilius et al., 2019). This CNN has been recently argued to be the current best model of the primate ventral visual processing regions.

As an additional update to Mocz and colleagues (2021) and to understand how some of the results from that study may be affected by overfitting, we also systematically investigated the impact of a penalty parameter (which determines overfitting and underfitting) in generating prediction. This allowed us to choose a more optimal penalty parameter for both the brain and the CNN analyses reported here.

We found that a linear mapping function could successfully link object responses in different states of nonidentity transformations in human PPC and CNNs for both Euclidean and non-Euclidean transformations. Overall, we found that object identity and nonidentity features are represented in a near-orthogonal, rather than complete-orthogonal, manner in PPC and CNNs, just like they do in OTC. Meanwhile, some differences existed among OTC, PPC, and CNNs. These results demonstrate the similarity in how visual object information across an identity-preserving image transformation may be represented in OTC, PPC, and CNNs.

METHODS

The details of the four fMRI experiments included here were first reported in two previous publications (Vaziri-Pashkam & Xu, 2019; Vaziri-Pashkam et al., 2019). The details of the transformation analysis and its evaluation were first reported in another study (Mocz et al., 2021). They are summarized here for the readers' convenience.

Participants

Seven (four women), seven (four women), six (four women), and ten (five women) healthy human participants with normal or corrected-to-normal visual acuity, all right-handed, and aged between 18 and 35 years took part in Experiments 1–4, respectively. The number of participants included was determined by the number of participants included in other similar published studies in object representation (e.g., n = 11 in Sawamura, Georgieva, Vogels, Vanduffel, and Orban [2005]; n = 6 in Konen and Kastner [2008]; n = 12 in Bettencourt and Xu [2016b]; and n = 11 or 13 in Jeong and Xu [2016]). All participants gave their informed consent before the experiments and received payment for their participation. The experiments were approved by the Committee on the Use of Human Subjects at Harvard University.

Experimental Design and Procedures

Main Experiments

In all four experiments, participants performed a 1-back object repetition detection task while viewing blocks of grayscale images of real-world object categories (Figure 1BD). Experiments 1–3 included eight real-world object categories, and they were as follows: body, car, cat, chair, elephant, face, house, and scissors. Experiment 4 included six real-world object categories, and they were as follows: body, car, chair, elephant, face, and house. These sets of categories cover a broad range of real-world objects and include small/large, animate/inanimate, and natural/manmade objects. Similar sets have been used in previous investigations of object category representations in the OTC (e.g., Haxby et al., 2011; Kriegeskorte et al., 2008). Each category contained 10 exemplars that varied in identity, pose (for the animal and body categories only), and viewing angle to minimize the low-level similarities among them. All images were placed on a dark gray square and displayed on a light gray background. Although Experiment 4 includes only six of the eight total object categories, the exemplars used for each category are the same as in the other experiments.

We analyzed two Euclidean and two non-Euclidean transformations. The two Euclidean transformations were as follows: (1) position (Experiment 1), with object image (subtended 5.7° × 5.7°) appearing above versus below fixation by 3.08°, and (2) size (Experiment 2), with object image shown in small (4.6° × 4.6°) versus large (11.4° × 11.4°) size. The two non-Euclidean transformations were as follows: (1) image stats (Experiment 3), with object images (subtended 9.13° × 9.13°) shown in original versus controlled format, and (2) SF (Experiment 4), with object images (subtended 7.8° × 7.8°) shown in high (i.e., high-pass filtered using a finite impulse response filter with a cutoff of 4.40 cycles per degree) versus low (i.e., low-pass filtered using a finite impulse response filter with a cutoff of 0.62 cycles per degree) SF (Figure 1C). Controlled images were generated using the SHINE technique to achieve spectrum, histogram, and intensity normalization and equalization (Willenbockel et al., 2010). Controlled images also appeared in Experiments 1 and 2 to better equate low-level differences among the images from the different categories.

During the experiment, blocks of images were shown. Each block contained a random sequential presentation of 10 exemplars from the same object category and the same transformation condition (e.g., for Experiment 1, in one block of a run, all the images were of cats positioned above fixation). Each image was presented for 200 msec followed by a 600-msec blank interval between the images (Figure 1D). Two image repetitions occurred randomly in each block. Participants were asked to view the images and report the repetitions by pressing a key on an MR-compatible button box. To ensure proper fixation, participants fixated at a central dot throughout the experiment, and eye movements were monitored in all four experiments using an SR-research EyeLink 1000 eye tracker. Each block lasted 8 sec followed by an 8-sec fixation period. In Experiments 1–3, there was an additional fixation period of 8 sec at the beginning of each run. In Experiment 4, there was an additional fixation period of 12 sec at the beginning of each run.

Each run within Experiments 1–3 contained 16 blocks, one for each of the eight object categories in each of the two states of the transformation. Each run within Experiment 4 contained 18 blocks, one for each of the six object categories in the low-SF condition, the high-SF condition, and the full-SF condition. The full-SF condition blocks were not used in the transformation analysis. Experiments 1–3 included 16 runs, with each run lasting 4 min 24 sec. Experiment 4 included 18 runs, with each run lasting 5 min. The order of the object categories and the two stages of the transformation were counterbalanced across runs and participants.

Localizer Experiments

Topographic visual regions.

In each participant, we examined ROIs from topographically localized areas within occipital cortex, including V1, V2, V3, and V4, as well as within parietal cortex, including V3a, V3b, IPS0, IPS1, IPS2, IPS3, and IPS4 (Figure 1E, top left). To streamline the presentation of the main results for the parietal regions, because V3a and V3b greatly overlap with inferior IPS and because IPS0–IPS4 greatly overlap with superior IPS (Bettencourt & Xu, 2016b), the main results only include the data from inferior and superior IPS.

The topographical regions were mapped with flashing checkerboards using standard techniques (Swisher, Halko, Merabet, McMains, & Somers, 2007; Sereno et al., 1995), with parameters optimized following Swisher and colleagues (2007). Specifically, a polar angle wedge with an arc of 72° swept across the entire screen (23.4° × 17.5° of visual angle). The wedge had a sweep period of 55.467 sec, flashed at 4 Hz, and swept for 12 cycles in each run (for more details, see Swisher et al., 2007). Participants completed four to six runs, each lasting 11 min 56 sec. The task varied slightly across participants. All participants were asked to detect a dimming in the visual display. For some participants, the dimming occurred only at fixation; for some, it occurred only within the polar angle wedge; and for others, it could occur in both locations, commiserate with the various methodologies used in the literature (Bressler & Silver, 2010; Swisher et al., 2007). No differences were observed in the maps obtained through each of these methods.

LOT and VOT.

VOT and LOT (Figure 1E, bottom left and right) loosely correspond to the location of LO and posterior fusiform areas (Kourtzi & Kanwisher, 2000; Grill-Spector et al., 1998; Malach et al., 1995) but extend further into the temporal cortex in an effort to include as many object-selective voxel as possible in the OTC. To identify LOT and VOT ROIs, following Kourtzi and Kanwisher (2000), participants viewed blocks of face, scene, object, and scrambled object images (all subtended approximately 12.0° × 12.0°). Only the contrast between objects and scrambled objects was used to localize LOT and VOT. The other object categories were included to localize other brain regions not examined in the present study. The images were photographs of gray-scaled male and female faces, common objects (e.g., cars, tools, and chairs), indoor and outdoor scenes, and phase-scrambled versions of the common objects. Participants monitored a slight spatial jitter that occurred randomly once in every 10 images. Each run contained four blocks of each of scenes, faces, objects, and phase-scrambled objects. Each block lasted 16 sec and contained 20 unique images, with each appearing for 750 msec and followed by a 50-msec blank display. Besides the stimulus blocks, 8-sec fixation blocks were included at the beginning, middle, and end of each run. Each participant was tested with two or three runs, each lasting 4 min 40 sec.

Inferior IPS.

To identify the inferior IPS ROI (Figure 1E, top right), we used the procedure developed by Xu and Chun (2006) and implemented by Xu and Jeong (2015), where participants viewed blocks of objects and noise images. The object images were similar to the images in the superior IPS localizer, except that in all trials, four images were presented on the display. The noise images were generated by phase-scrambling the entire object images. Each block lasted 16 sec and contained 20 images, each appearing for 500 msec followed by a 300-msec blank display. Participants were asked to detect the direction of a slight spatial jitter (either horizontal or vertical), which occurred randomly once in every 10 images. Each run contained eight object blocks and eight noise blocks. Each participant was tested with two or three runs, each lasting 4 min 40 sec.

Superior IPS.

To identify the superior IPS ROI (Figure 1E, top right) previously shown to be involved in VWM storage (Xu & Chun, 2006; Todd & Marois, 2004), we followed the procedure development by Xu and Chun (2006) and implemented by Xu and Jeong (2015). In an event-related object VWM experiment, participants viewed in the sample display, a brief presentation of one to four everyday objects and, after a short delay, judged whether a new probe object in the test display matched the category of the object shown in the same position as in the sample display. A match occurred in 50% of the trials. Gray-scaled photographs of objects from four categories (shoes, bikes, guitars, and couches) were used. Objects could appear above, below, to the left, or to the right of the central fixation. Object locations were marked by white rectangular placeholders that were always visible during the trial. The placeholders subtended 4.5° × 3.6° and were 4.0° away from the fixation (center to center). The entire display subtended 12.5° × 11.8°. Each trial lasted 6 sec and contained the following: fixation (1000 msec), sample display (200 msec), delay (1000 msec), test display/response (2500 msec), and feedback (1300 msec). With a counterbalanced trial history design (Xu & Chun, 2006; Todd & Marois, 2004), each run contained 15 trials for each set size and 15 fixation trials in which only the fixation dot was present for 6 sec. Two filler trials, which were excluded from the analysis, were added at the beginning and end of each run, respectively, for practice and trial history-balancing purposes. Participants were tested with two runs, each lasting 8 min.

MRI Methods

MRI data were collected using a Siemens MAGNETOM Trio, A Tim System 3-T scanner, with a 32-channel receiver array head coil. In Experiment 4, data from the last four participants were collected after the scanner was upgraded to a Prisma system. Participants lay on their back inside the MRI scanner and viewed the back-projected LCD with a mirror mounted inside the head coil. The display had a refresh rate of 60 Hz and a spatial resolution of 1024 × 768. An Apple MacBook Pro laptop was used to present the stimuli and collect the motor responses. For topographic mapping, the stimuli were presented using VisionEgg (Straw, 2008). All other stimuli were presented with MATLAB running Psychtoolbox extensions (Brainard, 1997).

Each participant completed three to six MRI scan sessions to obtain data for the high-resolution anatomical scans, topographic maps, functional ROIs, and experimental scans. Of these MRI scan sessions, one to four sessions were experimental sessions for each participant (with some participants taking part in more than one experiment). Using standard parameters, a T1-weighted high-resolution (1.0 × 1.0 × 1.3 mm3) anatomical image was obtained for surface reconstruction. For all the fMRI scans, a T2*-weighted gradient echo pulse sequence was used. For the experimental scans, 33 axial slices parallel to the AC–PC line (3 mm thick, 3 × 3 mm2 in-plane resolution with 20% skip) were used to cover the whole brain (repetition time [TR] = 2 sec, echo time [TE] = 29 msec, flip angle = 90°, matrix = 64 × 64). For the LOT/VOT and inferior IPS localizer scans, 30–31 axial slices parallel to the AC–PC line (3 mm thick, 3 × 3 mm2 in-plane resolution with no skip) were used to cover occipital and temporal lobes (TR = 2 sec, TE = 30 msec, flip angle = 90°, matrix = 72 × 72). For the superior IPS localizer scans, 24 axial slices parallel to the AC–PC line (5 mm thick, 3 × 3 mm2 in-plane resolution with no skip) were used to cover most of the brain with priority given to parietal and occipital cortices (TR = 1.5 sec, TE = 29 msec, flip angle = 90°, matrix = 72 × 72). For topographic mapping, 42 slices (3 mm thick, 3.125 × 3.125 mm2 in-plane resolution with no skip) just off parallel to the AC–PC line were collected to cover the whole brain (TR = 2.6 sec, TE = 30 msec, flip angle = 90°, matrix = 64 × 64). Different slice prescriptions were used here for the different localizers to be consistent with the parameters used in our previous studies. Because the localizer data were projected into the volume view and then onto individual participants' flattened cortical surface, the exact slice prescriptions used had minimal impact on the final results.

Data Analysis

fMRI data were analyzed using FreeSurfer (surfer.nmr.mgh.harvard.edu), FsFast (Dale, Fischl, & Sereno, 1999), and in-house MATLAB and Python codes. FMRI data preprocessing included 3-D motion correction, slice timing correction, and linear and quadratic trend removal. No smoothing was applied to the data. All the analysis for the main experiment was performed in the volume. The ROIs were selected on the surface and then projected back to the volume for further analysis.

ROI Definitions

Topographic maps.

Following the procedures described in Swisher and colleagues (2007) and by examining phase reversals in the polar angle maps, we identified topographic areas in occipital cortex including V1 and V2 in each participant (Figure 1E).

Superior IPS.

To identify this ROI (Figure 1E), fMRI data from the superior IPS localizer were analyzed using a linear regression analysis to determine voxels whose responses correlated with a given participant's behavioral VWM capacity estimated using Cowan's k (Cowan, 2001). In a parametric design, each stimulus presentation was weighted by the estimated Cowan's k for that set size. After convolving the stimulus presentation boxcars (lasting 6 sec) with a hemodynamic response function, a linear regression with two parameters (a slope and an intercept) was fitted to the data from each voxel. Superior IPS was defined as a region in parietal cortex showing a significantly positive slope in the regression analysis overlapping or near the Talairach coordinates previously reported for this region (Todd & Marois, 2004). As in Vaziri-Pashkam and Xu (2017), we defined superior IPS initially with a threshold of p < .001 (uncorrected). However, in Experiments 1–3, for five participants, this produced an ROI that contained too few voxels for multivoxel pattern analysis (MVPA) decoding. We therefore used p < .001 (uncorrected) in two participants and relaxed the threshold to .05 in three participants or .1 in two participants to obtain a reasonably large superior IPS with at least 100 voxels across hemispheres. In Experiment 4, there was a similar issue, and so we used p < .001 (uncorrected) in four participants and relaxed the threshold to .01, .05, or .1 for the other six participants to obtain a reasonably large superior IPS. The resulting ROIs on average had 234 voxels across all the participants.

Inferior IPS.

This ROI (Figure 1E) was defined as a cluster of continuous voxels in the inferior part of IPS that responded more (p < .001 uncorrected) to the original than to the scrambled object images in the inferior IPS localizer and did not overlap with the superior IPS ROI. In Experiment 4, for one participant, the threshold was relaxed to .1 to obtain a region with a reasonable number of voxels across hemispheres (the final size of inferior IPS was 90 voxels for this individual).

LOT and VOT.

These two ROIs (Figure 1E) were defined as a cluster of continuous voxels in the lateral and ventral occipital cortex, respectively, that responded more (p < .001 uncorrected) to the original than to the scrambled object images. LOT and VOT loosely correspond to the location of LO and posterior fusiform (Kourtzi & Kanwisher, 2000; Grill-Spector et al., 1998; Malach et al., 1995) but extend further into the temporal cortex in an effort to include as many object-selective voxel as possible in OTC regions. In Experiment 4, for two participants for LOT and for one participant for VOT, the threshold of p < .001 resulted in too few voxels, so the threshold was relaxed to p < .01 to have at least 100 voxels across the two hemispheres.

MVPA

In the main experiments, to generate fMRI response patterns for each condition in each run, we first convolved the 8-sec stimulus presentation boxcars with a hemodynamic response function. Then, for Experiments 1–3, we conducted a general linear model analysis with 16 factors (2 states of the transformation × 8 object categories) to extract a beta value for each condition in each voxel in each ROI. This was done separately for each run. For Experiment 4, we performed a general linear model analysis with 18 factors (3 SF conditions × 6 object categories) to extract the beta value in a similar way, again separately for each run. Although the full-SF condition was used in subsequent analysis in Vaziri-Pashkam and colleagues (2019), it is not used in subsequent analysis here. We z-normalized the beta values across all voxels for each condition in a given ROI in each run to remove amplitude differences between conditions, ROIs, and runs.

Reliability-based Voxel Selection

As pattern decoding to a large extent depends on the total number of voxels in an ROI, to equate the number of voxels in different ROIs to facilitate comparisons across ROIs and to increase power, we selected the 75 most reliable voxels in each ROI using reliability-based voxel selection (Tarhan & Konkle, 2020). Across the experiments, the ROIs ranged from 75 to 901 voxels before voxel selection. This method selects voxels whose response profiles are consistent across odd and even halves of the runs and works well when there are around 15 conditions. To implement this method, for each voxel, we calculated the split-half reliability by first averaging the runs within the odd and even halves and then correlating the resulting averaged responses for all conditions (12 or 16 in total for the six or eight image categories and two states of a transformation) across the even and odd halves. We then selected the top 75 voxels with the highest correlations. To avoid circularity in analysis, these 75 voxels were selected using only the training runs in each iteration of a leave-one-out cross-validation procedure (which will be explained in more detail in the next section). Within each ROI, the selected voxels showed an overlap of 33–69 voxels across the different training–testing iterations for the position transformation, 31–63 voxels for the size transformation, 40–72 voxels for the image stats transformation, and 28–70 voxels for the SF transformation. The 75 voxels chosen maintained a high split-half reliability of at least r = .70 for each participant and ROI while providing an optimal number of features for subsequent ridge regression analysis.

CNN Details

We tested five CNNs in our analyses (see Table 1). They included both shallower networks (Alexnet and VGG-19) and deeper networks (Googlenet and Resnet-50; He et al., 2016; Szegedy et al., 2015; Simonyan & Zisserman, 2014; Krizhevsky et al., 2012). We also included a recurrent network, Cornet-S, that has been shown to capture the recurrent processing in macaque IT cortex with a shallower structure (Kar et al., 2019; Kubilius et al., 2019). This CNN has been recently argued to be the current best model of the primate ventral visual processing regions. All the CNNs used were trained with ImageNet images.

Table 1. .

CNNs and Layers Examined

CNN Name Depth/Blocks Layers Number of Total Layers Sampled Sampled Layer Names and Locations (Indicated in the Parentheses)
Alexnet 8 25 6 pool1' (5), ‘pool2’ (9), ‘pool5’ (16), ‘fc6’ (17), ‘fc7’ (20), ‘fc8’ (23)
Cornet-S 4 42 6 V1_outpt' (8), ‘V2_output’ (18), ‘V4_output’ (28), ‘IT_output’ (38), ‘decoder_avgpool’ (39), ‘decoder_output’ (42)
Googlenet 22 144 6 pool1-3×3_s2' (4), ‘pool2-3×3_s2’ (11), ‘pool3-3×3_s2’ (40), ‘pool4-3×3_s2’ (111), ‘pool5-7×7_s1’ (140), ‘loss3-classifier’ (142)
Resnet-50 50 177 6 max_pooling2d_1' (5), ‘activation_10_relu’ (37), ‘activation_22_relu’ (79), ‘activation_40_relu’ (141), ‘avg_pool’ (174), ‘fc1000’ (175)
VGG-19 19 47 8 pool1' (6), ‘pool2’ (11), ‘pool3’ (20), ‘pool4’ (29), ‘pool5’ (38), ‘fc6’ (39), 'fc7’ (42), ‘fc8’ (45)

For each CNN, to streamline the main analyses, the first two sampled layers (in bold) were averaged to represent the lower visual layers. The remaining bold layers were those that showed the best correspondence with higher ventral visual regions as shown by Xu and Vaziri-Pashkam (2021a), and these layers were averaged to represent the higher visual layers.

Following O'Connell and Chun (2018), we sampled between six and eight mostly pooling and fully connected (FC) layers of each CNN (see Table 1 for the specific CNN layers sampled). Pooling layers were selected because they typically mark the end of processing for a block of layers when information is pooled to be passed on to the next block of layers. When there were no obvious pooling layers present, the last layer of a block was chosen. It has been shown previously that such a sampling procedure captures the evolution of the representation trajectory fairly well, if not fully, as adjacent layers exhibit identical or very similar representations (Taylor & Xu, 2021). For a given CNN layer, we created a set of activation patterns comparable to the fMRI response that included the same number of “runs” as the fMRI data. In each run, for each category, we extracted the full activation of the CNN layer to nine unique exemplars with one repeat, mimicking the 1-back repetition detection task used in the fMRI experiments, and then we performed a simple average of these activations to create an overall category representation. For each run, the repeated exemplar was randomly chosen. Cornet-S was implemented in Python. All other CNNs were implemented in MATLAB. The output from all CNNs was analyzed and compared with brain responses using Python and R.

Because many of the layers included tens of thousands of activation units, to prevent overfitting in the linear mapping procedure, following O'Connel and Chun (2018), we reduced the dimensionality of the activations using principal component analysis (PCA) and derived the linear mapping between two states of a transformation in the PCA space. Similar to the reliability-based voxel selection, we determined the PCA space using only the training data in each iteration of the leave-one-out cross-validation procedure and then projected the testing data onto these PCA dimensions to allow us to derive a predicted response pattern from the testing data. Within each CNN layer, we selected the maximum possible number of PCA components based on the size of the training set (15 [runs] × 6 or 8 [object categories] = 90 or 120 dimensions). We also derived a transformation matrix to convert the predicted CNN response patterns in the PCA space back to the original CNN activation space to allow us to directly compare between the predicted and actual response patterns.

Deriving Linear Mapping between Two States of a Transformation

For the fMRI response data, we utilized an updated version of the linear mapping analysis used by Mocz and colleagues (2021) with a split-half analysis on the 75 most reliable voxels for each training iteration in each ROI (explained in more detail later in the Methods section). A split-half analysis helps to ensure that any differences between predicted and real patterns are not because of the noise across runs. Training and testing of the linear mapping were computed separately for each participant, each ROI, and each state of the transformation in each experiment. For each ROI, the data were first split into odd and even runs. Within each half of the data, a leave-one-out cross-validation procedure was conducted where one run served as the testing run whereas the remaining runs served as training runs. For Experiments 1–3, which had 16 runs, this meant that first the data were split into two groups with eight runs each and, within each group, there were seven training runs and one testing run. For Experiment 4, which had 18 runs, this meant that first the data were split into two groups with nine runs each and, within each group, there were eight training runs and one testing run. The training runs of the odd and even runs were used to select the top 75 most reliable voxels of an ROI, as described in the previous section. These 75 most reliable voxels were then applied to the testing runs. In addition, for Experiments 1–3, the number of object categories included in training ranged from 1 to 7 as these experiments included eight categories in total. For Experiment 4, the number of object categories included in training ranged from 1 to 5 as this experiment included six categories in total. All possible combinations of training categories were used.

During training of each leave-one-out cross-validation fold, we derived linear mapping matrices to link the two states of a transformation in both directions (e.g., from small to large and from large to small) using ridge regression. The end result was that responses from the 75 voxels in one state, after convolving with the trained linear mapping matrix, would predict the responses of the 75 voxels in the other state. This was accomplished in the following way. For each object category, we first constructed two 75 [voxels] × (7 or 8) [training folds] matrices corresponding to the original and transformed states, respectively. If more than one object category was included in training, we concatenated each object's matrix so that the full pattern matrix was 75 [voxels] × (7 or 8) [training folds] × (1 to 7) [training categories]. Using ridge regression, we derived the linear mapping β between the original state (matrix X) and the transformed state (matrix Y) as follows:

β = (XTX + αI)−1XTY

where β is a 75 [voxels] × 75 [voxels] linear mapping matrix determining how each voxel in one transformation state should be weighted to predict each voxel in the other transformation state, α is the ridge regression penalty (set as α = 10 as detailed below), and I is the identity matrix. Regularizing the linear model with a penalty is necessary as there were fewer observations (Training Folds × Training Categories) than features (voxels). β was then applied to the left-out data to predict the response pattern in one transformation state given the response pattern of the other transformation state.

We performed a modified linear mapping analysis for the CNNs where, instead of a split-half analysis, we did a simple leave-one-run-out cross-validation. Specifically, we used data from n − 1 “runs” (see the earlier descriptions on this) to train the linear mapping function between two states of a transformation. We then used the response pattern from one state of the transformation from the nth “run” to predict the response pattern of the other state of the transformation using the learned linear mapping function. Before learning the linear mapping, the dimensions of the activations were reduced with PCA, as described in the previous section. This meant that the full pattern matrix was (90 or 120) [PCA dimensions] × 15 [training folds] × (1 to 7) [training categories]. The predicted pattern was then converted back to the original CNN activation space to compare with the true pattern.

Evaluating the Predictions of the Learned Linear Mapping

To test how well the learned linear mapping could predict fMRI response patterns or CNN activations between the two states of a given transformation, we first generated predicted patterns using the learned linear mapping. We then compared the predicted and true patterns using Pearson correlation. Specifically, for the fMRI response patterns, within each half of the data, for each leave-one-out cross-validation fold, we first generated two predicted patterns (one for each state) for each object category from the left-out data by using the learned linear mapping matrices. We then correlated each predicted pattern from one half of the data with the corresponding true pattern based on the average of all the runs in the other half of the data (Figure 1F). For the CNN patterns, because we did not use a split-half analysis, for each leave-one-out cross-validation fold, we simply correlated each predicted pattern with the corresponding true pattern.

We further calculated a reliability measure for the fMRI responses. Because of the presence of measurement noise, even patterns of the same condition across odd and even runs may not show 100% correlation. Consequently, how well the predicted patterns may correlate with the true patterns across odd and even runs should only be assessed when we compare this to how true patterns are correlated across odd and even runs (which is a split-half reliability measure). Because noise may differ for the different brain regions, reliability may also differ. Thus, obtaining the reliability measure in different brain regions additionally allows us to correct for them in the response patterns and facilitate cross-region comparisons. We derived the reliability measure, which we called Averaged-run Ceiling, following the method used in Kietzmann and colleagues (2019). Specifically, within each ROI, we averaged all but one run within one half of the data (similar to how we used all but one run to train a linear function to generate a prediction) and averaged all runs in the other half of the data. We then correlated these two averaged patterns for each condition and transformation state and also averaged all the resulting correlations as our measure of split-half reliability for a given ROI. This measure of reliability involves the same amount of data that goes into generating the predicted pattern and the true pattern. This reliability measure allows us to ask: Is the predicted pattern derived from data not included in training using the trained linear function as good as the average of the data included in training (i.e., the average of the true patterns used for training)?

Besides examining the correlation between the predicted pattern with the corresponding true pattern, we also examined whether such a correlation was higher for the same than the different category pairs. This allowed us to further evaluate category selectivity of the predicted pattern.

Within each transformation type and each ROI/CNN layer, we also assessed the effect of category (i.e., prediction performance for categories included or not included in training) and the effect of training set size (i.e., how prediction performance depended on the number of categories included in the training data) and their interaction. These analyses allowed us to test the generalizability of the learned linear mapping: how well a linear mapping learned from one set of categories could successfully predict the patterns of categories not included in training.

Evaluating the Optimal fMRI Response and CNN Activation Patterns

Ward and colleagues (2018) set the ridge regression penalty α to 1 in their study. In an effort to replicate that study, we also set α to 1 in our previous study (Mocz et al., 2021). We noted that a very low α can lead to data overfitting (i.e., the predicted patterns are too specific to the training data and do not fit the testing data well), whereas a very high α can lead to data underfitting (i.e., the predicted patterns do not capture either the training data nor the testing data well, with many of the pattern predictions dropping toward 0). Although α of 1 seems to be a reasonable value, whether it is indeed the optimal parameter has not been systematically tested. In Mocz and colleagues (2021), we noticed that although we could successfully predict the patterns for both trained and untrained categories throughout the ventral visual stream, predictions for untrained categories did not improve with more categories included in training and in fact trended downward. This suggests that there was some overfitting of the data. Moreover, an optimal α for the brain data may not be optimal for the CNN data, given differences in data quality and noise. It is therefore critical that we test a range of different α values for both the brain and CNN data to find the optimal value for each.

For the brain data, to find the optimal α parameter, for each transformation and brain region, using both the lowest and highest training set sizes, we examined both pattern predictability and selectivity with α of 1, 10, 100, and 1000. For the CNN data, because the training procedure was computationally intensive and long, on the basis of the results from preliminary testings, we examined a smaller number of α parameters of 0.1, 10, and 1000. To visualize both predictability and selectivity on the same plot, and to account for the fact that the correlations for predictability were much higher than those for selectivity, we normalized the predictability score in each ROI and CNN layer by subtracting the predictability score when α was 1 (such that a value of 0 represented α of 1 and all other values were relative to this value). We examined the changes in both predictability and selectivity for categories included in training and those not included in training as α varied. To provide a quantitative assessment, for the brain data, we performed paired t tests to evaluate whether there were significant differences in predictability and selectivity (separately for trained and untrained categories) with each successive pair of α parameters (i.e., α = 1 vs. α = 10, α = 10 vs. α = 100, and finally, α = 100 vs. α = 1000). The p values of these three pairs of comparisons were then corrected with the Benjamini–Hochberg method.

To help us choose the optimal α value for the brain data, for each ROI, we counted the number of comparisons that were significantly different across the four transformations, where p < .05 (i.e., significant) counted as 1 point, l.05 < p < .10 (i.e., marginally significant) counted as .5 points, and p > .1 (i.e., nonsignificant) counted as 0 points. We then divided this number by the total number of comparisons to calculate the percentage of comparisons that showed a difference (see summary plots in Figure 2). Predictability increased with a larger α for both the lowest and highest training set sizes, but this plateaued between α of 100 and α of 1000 for all of the ROIs (including those from OTC and PPC). Selectivity did not differ greatly across the different α values for the lowest training set size. However, selectivity for the largest training set sizes, particularly in LOT and VOT, although changed very little from α of 1 to α of 10, decreased significantly for the larger α values. To maximize both predictability and selectivity, we chose α of 10 for all of our analyses as predictability did not increase much further beyond α of 10, but selectivity became worse. In addition, with an α of 10, we no longer saw the pattern of results in Mocz and colleagues (2021) where predictability for untrained categories slightly decreased for more categories included in the training, consistent with overfitting. Instead, we saw that predictability for untrained categories increased when more categories were included in training, showing this new α did not create overfitting. Otherwise, the overall results remained the same as in the original study (e.g., predictions were category dependent; higher-level visual regions displayed category selectivity that generalized to untrained categories but lower-level regions did not).

Figure 2. .

Figure 2. 

The percentage of paired t tests with significant differences in predictability and selectivity (separately for trained and untrained categories) for each successive pair of α parameters (i.e., α of 1 vs. α of 10, α of 10 vs. α of 100, and finally, α of 100 vs. α of 1000) across the four transformations for both the largest and smallest training set sizes for each OTC and PPC region. When calculating the percentage, p < .05 (i.e., significant) counted as 1 point, .05 < p < .10 (i.e., marginally significant) counted as .5 points, and p > .1 (i.e., nonsignificant) counted as 0 points. InfIPS = inferior IPS; SupIPS = superior IPS.

For all CNNs other than Cornet-S, there was very little change in predictability and selectivity from α of 0.1 to α of 10, with a slight drop in performance from α of 10 to α of 1000. In Cornet-S, there was a significant improvement in predictability from α of 0.1 to α of 10 (in some cases, an increase of correlation greater than .4), but not much change from α of 10 to α of 1000. Selectivity was roughly the same for all tested values of α. These results show that α either does not matter or attains maximum performance at a value of 10. Consequently, we also chose α of 10 for all of the CNNs to be consistent with the brain data.

Relating Category Similarity to Pattern Prediction Generalization

In this analysis, we examined how the generalizability of linear mapping may depend on the similarity between the different categories in a given brain region or CNN layer. To do so, for the brain data, for each ROI, using the cross-validated split-half analysis described earlier, we first obtained the linear mapping between the two states of a given transformation using data from only one category as the training data to predict the response of all other categories from one state to the other state. We then correlated the predicted pattern and the true pattern of the same category across the two halves of the data, as described earlier. The resulting correlation coefficient was used as the prediction score for how well the training data of a given category could successfully predict the pattern of a different category. We repeated this analysis by including the data from each category as the training data to predict the response patterns of all other categories from one state to the other state. The results were averaged between the two states of a transformation, the two directions of predictions (i.e., using x to predict y and using y to predict x), and two halves of the brain data to construct a prediction similarity matrix in which each cell of the matrix reflects how well two categories may predict each other. To obtain the similarity between the object categories, we constructed a category similarity matrix with each cell of the matrix being the pairwise correlation of the true patterns of two categories across the two halves of the data. We vectorized the off-diagonal elements of these matrices and correlated them (see Figure 1G for a visualization). If the generalizability of transformations depends on the similarity between the different categories in a given brain region, then we expected to see a high correlation between prediction similarity and category similarity. We did not correct the prediction similarity matrix and the category similarity matrix by the split-half reliability of each ROI here for the brain data, as such correction would not affect the final correlation results (because values were z-normalized during correlation calculation and thus any scaling factor would have no effect).

For the CNN data, to create the prediction similarity matrix, for each sampled CNN layer, we again used data from one category to predict the response of all other categories across a transformation, and in lieu of a split-half analysis, the predicted pattern was correlated with the average true pattern of all the runs. We applied this analysis to all other categories from one state to the other state so that each cell of the matrix reflects how well two categories may predict each other. To create the category similarity matrix, which shows how initially similar each object category is to another, for each cell, for all potential leave-one-run-out folds, we took the correlation of the average true pattern of all but one of the runs for one category (to mimic the training data to create the predictions) and the average true pattern of all runs for another category and then averaged the correlations across the folds. We again vectorized the off-diagonal elements of these matrices and correlated them to determine the role of initial category similarity on pattern prediction in CNNs.

Experimental Design and Statistical Analyses

Seven, seven, six, and ten human participants took part in Experiments 1–4, respectively. These numbers were chosen based on prior published studies (e.g., Haxby et al., 2011; Kamitani & Tong, 2005). To simplify and streamline the analyses and the description of the results, we included only the average of the two lowest brain regions/CNN layers sampled and the average of the two highest brain regions/CNN layers sampled (see Table 1 for the layers and regions we averaged). Similarly, we included only the lowest and highest training set sizes in our main analyses. The factors described in the remaining Methods sections were evaluated at the group level using repeated-measures ANOVA and post hoc t tests. One-tailed t tests were performed when the comparison in one direction was meaningful. We corrected for multiple comparisons in all post hoc analyses using the Benjamini–Hochberg method (Benjamini & Hochberg, 1995). For the ANOVA results, we calculated effect size using a partial eta-squared measure (Cohen, 1973), because it is a less biased measure of effect size for ANOVAs that is comparable across study designs (Richardson, 2011). It is defined as

ηP2=SSeffectSSeffect+SSs/cells

where SSeffect is the sum of squares of the effect of interest and SSs/cells is the sum of squares for within-subject error. For the pairwise t tests, we calculated effect size using Cohen's d (Cohen, 1969, 1988).

To make a direct comparison between parietal regions and ventral regions as well as between CNNs and ventral regions, we obtained the following difference scores: lower-level visual regions/CNN layers versus higher-level visual regions/CNN layers, lowest training set sizes versus highest training set sizes, and untrained categories versus trained categories. We then compared these difference scores between parietal and ventral regions using a paired t test as well as between CNNs and ventral regions using a one-sample t test where the individual subject data of the ventral regions were compared to the CNN scores. Again, we corrected for multiple comparisons using the Benjamini–Hochberg method. All of the above statistical tests were conducted using R (R Core Team, 2018).

RESULTS

Previous fMRI work has shown that we can derive linear mapping functions to successfully predict visual object responses across both Euclidean and non-Euclidean transformations throughout the human ventral visual processing stream (Mocz et al., 2021; Ward et al., 2018). Because a diverse array of visual object information also exists in the primate dorsal pathway in PPC, in this study, we evaluated whether a similar approach may be used to predict object responses across transformations in the human dorsal visual stream. The regions we investigated included topographically defined parietal regions V3a, V3b, and IPS0–IPS4 as well as functionally defined inferior IPS and superior IPS. As CNNs are considered by some as the current best models of the primate ventral visual system, we performed a similar analysis on the visual object responses formed in five CNNs pretrained using ImageNet images (Deng et al., 2009) to perform object categorization. These CNNs are Alexnet, Cornet-S, Googlenet, Resnet-50, and VGG-19.

We used data from two existing fMRI data sets (Vaziri-Pashkam & Xu, 2019; Vaziri-Pashkam et al., 2019) where participants viewed objects undergoing two Euclidean transformations (i.e., position and size) and two non-Euclidean transformations (i.e., changes in image statistics and SF). During each run of the experiments, human participants viewed blocks of images, each containing 10 exemplars from one of six or eight real-world object categories (faces, houses, bodies, cats, elephants, cars, chairs, and scissors), and performed a 1-back repetition detection task. We selected the 75 most reliable voxels from each ROI in our analysis to equate the number of voxels across ROIs and to increase power (Tarhan & Konkle, 2020). Note that all results remained very similar when we included all voxels from each ROI, indicating the stability and robustness of the results, which did not depend on including the most reliable voxels. We extracted the fMRI response patterns corresponding to each object category in each run. Using a split-half approach to account for noise (see Methods), following Ward and colleagues (2018) and Mocz and colleagues (2021), we derived a linear mapping function to predict category responses across two states of a transformation. For all CNNs, we created a comparable set of activation patterns following the structure of the fMRI data (see Methods). We used a leave-one-run-out cross-validation training and testing procedure on the CNN data without a split-half approach as it was not necessary to account for noise given that CNN data had no noise.

We evaluated the predicted patterns in two different ways: how well they correlated with the true patterns and whether they show category selectivity (i.e., more similar to the true patterns of the same than different categories). We also examined the effect of training category (i.e., whether or not a category was included in the training data), the effect of training set size (i.e., the number of categories included in the training data), the effect of ROI/CNN layer (i.e., whether or not the effect differed among the different brain regions or CNN layers), and their interactions. We further compared these three effects between parietal regions and ventral regions as well as between CNNs and ventral regions to understand the similarities and differences between these visual systems. We additionally examined how the ability of using the response of one category to predict that of another category was determined by the similarity between these two categories in a given brain region/CNN layer. Before conducting the main analyses, we conducted an extensive set of tests to select the most optimal α penalty parameter for the ridge regression used in our analyses (see Methods for more details). To maximize both pattern predictability and category selectivity, we chose α of 10 for analyzing both the brain and CNN data.

Because inferior IPS and superior IPS overlap to a great extent with lower and higher parietal topographic areas, to streamline the report of the results, we only included the results for inferior IPS and superior IPS in the main results. To compare and contrast with the results from the PPC regions, we also included results from OTC regions. Because the latter results have been reported extensively in a prior study (Mocz et al., 2021), we only included results from the low visual areas (i.e., the average of areas V1 and V2) and higher visual areas (i.e., the average of LOT and VOT) in the main analysis and comparisons. Likewise, to streamline the report of the results and the comparison between CNNs and regions in OTC, we only included the results from lower CNN layers (i.e., the average of the first two sampled layers) and higher CNN layers (i.e., the average of the two higher layers that showed the best correspondence with the brain data as shown by Xu & Vaziri-Pashkam, 2021a). To further streamline the report of the results, we only included analyses from the lowest and highest training set sizes. In the full set of results, changes from lower to higher visual layers and changes from lowest to highest training set sizes were typically monotonic, allowing us to streamline the report of the results without much loss of information.

Evaluating the Predicted fMRI Response and CNN Activation Patterns

In this analysis, we examined how well the predicted fMRI or CNN response patterns correlated with the actual pattern. If identity and nonidentity information are represented orthogonally or near orthogonally, then the predicted pattern should show a significant correlation with the actual pattern.

PPC Regions

To document the responses from the PPC regions, we examined responses from inferior and superior IPS as they overlap substantially with lower and higher IPS topographic areas, respectively (Bettencourt & Xu, 2016a). These two IPS regions have been shown to be involved in different aspects of visual information processing, with inferior IPS participating in object individuation/selection and superior IPS participating in object identification and VWM storage (Xu, 2017, 2020, 2021; Bettencourt & Xu, 2016b; Jeong & Xu, 2012; Xu & Chun, 2006, 2009; Todd & Marois, 2004, 2005).

Overall, we found that a linear mapping was able to capture a significant amount of the changes associated with the transformations throughout the dorsal visual stream, but predictions were tailored to categories included in training, suggesting that identity and nonidentity information are represented in a near-orthogonal manner. There was also no significant difference between inferior and superior IPS. The full details of the results are described below.

Following the procedure of Mocz and colleagues (2021), to compare across ROIs and to account for differences in data reliability across the different ROIs, we calculated a split-half reliability measure, Averaged-run Ceiling, that involved the correlation of the training runs in one half of the runs with the average of all the runs in the other half (see Methods for details). Comparison with Averaged-run Ceiling allowed us to evaluate whether the predicted patterns are as good as the true patterns. We normalized the correlation coefficients between the predicted and true patterns by dividing them by the Averaged-run Ceiling to then compare across ROIs. The results were evaluated at the participant group level using statistical tests.

If identity and nonidentity information are completely orthogonal to each other, a linear mapping would explain a significant amount of variance of the true data (i.e., the correlation between the predicted and true patterns is significantly greater than 0) and that this linear mapping would predict the patterns of categories included and not included in training equally well. If, instead, identity and nonidentity information are near orthogonal, the linear mapping would explain a significant amount of variance of the true data, but predictions would be significantly better for categories included than those not included in training. Finally, if identity and nonidentity information are nonorthogonal, then the learned mapping would not generalize to categories not included in training (i.e., the correlation between the predicted and true patterns is no different from 0 for categories not included in training).

For all four transformation types and across both the training category and training set size manipulations, we found that normalized predicted and true pattern correlations were overall quite high, ranging between .85 and .98, and were all significantly above 0 (Figure 3; ts > 14.73, ps < .001, ds > 14.9; one-tailed, as only comparison in one direction was meaningful here and in all similar analyses reported below; all pairwise comparisons reported here and below were corrected for multiple comparisons using the Benjamini–Hochberg method; see Benjamini & Hochberg, 1995). However, nearly all of the correlations were significantly less than 1 across the different ROIs, training categories, training set sizes, and transformations (ts > 2.21, ps < .05, ds > 2.2). Thus, although a linear mapping could capture a significant amount of the changes associated with these transformations, the predicted patterns were generally not as good as the average of the true patterns used for training. This shows that, if the goal is to predict the best true pattern from the data, then it is better to use the average of the true patterns than to predict it through training and a linear mapping function.

Figure 3. .

Figure 3. 

Pattern predictability results plotted by brain region/CNN and transformation type showing pattern prediction for categories included in the training data (Trained Categories) and those that were not (Untrained Categories) as a function of the number of categories included in the training data. All predictions were calculated using the optimal penalty in the ridge regression (with α of 10). Pattern predictability is computed as the correlation between the predicted and actual patterns. Predictability from a brain region is further normalized by the corresponding noise ceiling to facilitate comparisons across brain regions. Results from the brain were computed from the 75 most reliable voxels in each region. In OTC, results from lower visual areas were the averaged results of V1 and V2, whereas results from higher visual areas were the averaged results of LOT and VOT. In PPC, lower and higher visual areas corresponded to inferior and superior IPS, respectively. For CNNs, results from lower CNN layers were the average results of the first two sampled layers, whereas results from higher CNN layers were the average of two higher CNN layers showing the best correspondence with higher OTC regions as determined by a prior study. The error bar represents the between-participant 95% confidence interval of the mean.

To better understand the generalizability of linear mapping in pattern prediction, within each transformation, using ANOVA, we also examined the effect of ROI, Training Set Size, Training Category, and their interactions. In most of the cases, we do not see a significant difference of ROI and Training Set Size, except for SF where predictability was higher for inferior than superior IPS and higher for the lowest than the highest training set size (Fs > 5.59, ps < .04, ηP2 > .38; Fs < 2.37, ps > .18, ηP2 < .32 for all others). Meanwhile, there was a consistent main effect of Training Category across all transformations (Fs > 11.36, ps < .05, ηP2 > .65), where predictions were better for categories included than those not included in training, providing evidence for a near-orthogonal representation of identity and nonidentity information. There were also interactions between Training Category and Training Set Size for all the transformations (Fs > 11.04, ps < .05, ηP2 > .59), such that predicted patterns for categories not included in training improved with a larger set size, showing greater generalizability with more categories included in training.

These PPC results show some similarities to our previously reported OTC results (Mocz et al., 2021; see Figure 3). Similar to the PPC regions, within OTC regions, we previously found that for all transformations, a linear mapping could capture a significant amount, but not all, of the variance of an object category's fMRI pattern between two states of a transformation. Furthermore, pattern predictions tended to be better for the categories included in the training data, but there was typically no main effect of including more categories in the training data. However, just like PPC, OTC also showed an interaction effect of Training Category and Training Set Size such that predictions improved for categories not included in training with a larger training set size. There were no consistent differences between lower and higher visual regions in either OTC or PPC. Overall, the linear mapping functions derived during training are not entirely category independent but interact with the specific categories included in training, suggesting that identity and nonidentity information are represented near-orthogonally in both the human ventral and dorsal visual systems.

We do not think that this near-orthogonal representation is because of overfitting during the training procedure. Had the object representational space contained category-independent orthogonal representations of the nonidentity features, when more training data were included with increasing training set size, we would expect to see better prediction performance regardless of whether or not a category was included in the training data. However, such a benefit was only seen for categories not included in training.

To directly compare the PPC results with those from the OTC, we obtained three scores in both the OTC and PPC: (1) the difference in predictability between the lowest and highest training set sizes, (2) the difference in predictability between lower and higher visual regions, and (3) the difference in predictability between trained and untrained categories. We then directly compared these scores for each transformation and for each of the four possible training conditions (i.e., predictions for categories included in training within lower visual layers, predictions for categories included in training within higher visual layers, predictions for categories not included in training within lower visual layers, and predictions for categories not included in training within higher visual layers) with paired t tests (see Table 2 for a summary of the results). Of the 48 total possible comparisons, most were not significant (68.75%). Where differences were found, they largely reflect quantitative, rather than qualitative, differences between the two brain regions.

Table 2. .

Summary of Statistical Results Comparing the Predicted Patterns between PPC and OTC as Well as between CNNs and OTC

Difference Transformation Correlation Type Predictability Selectivity
PPC Alexnet Cornet-S Googlenet Resnet-50 VGG-19 Parietal Alexnet Cornet-S Googlenet Resnet-50 VGG-19
Regions (lower–higher) Position Set Size 1, trained             *     *    
Set Size 1, not trained             * **       **
Set size max, trained             *     **    
Set size max, not trained             *      
Size Set Size 1, trained             *   *** *** *** *
Set Size 1, not trained     ** ** *   *   *** *** **  
Set size max, trained             ** *** *** *** **
Set size max, not trained     ** ** *   **   **      
Image stats Set Size 1, trained             * ** *** *** ** *
Set Size 1, not trained   * *** ** * *   * *** *** ** *
Set size max, trained           * ** *** *** ** *
Set size max, not trained   * *** ** * * **   ** *    
SF Set Size 1, trained *             *** ***   ***  
Set Size 1, not trained * ** ***   **     ** ***   * *
Set size max, trained *           * *** ***   ***  
Set size max, not trained *   ***   *   * *** ***  
 
Training set size (1–max) Position Lower, trained             * *** *** *** *** ***
Lower, not trained ** ***   **   ** ** ***   ***   ***
Higher, trained   * * * * *   ** ** ** ** **
Higher, not trained     *            
Size Lower, trained   * * * * *   *** *** *** *** ***
Lower, not trained   *** ***   *** * *** * ** *  
Higher, trained   * * * * *   ** ** ** ** **
Higher, not trained * * ** * * **        
Image stats Lower, trained   * * * * * * ** ** ** ** **
Lower, not trained       *   **
Higher, trained   ** ** ** ** **   *** *** *** *** ***
Higher, not trained *   *     **      
SF Lower, trained   * * * * * * *** *** *** *** ***
Lower, not trained   *** * ***   ***   ***   **    
Higher, trained   * * * * * *** *** *** *** ***
Higher, not trained **   *** *** *** *** *** * **      
 
Training (not trained–trained) Position Lower, Set Size 1 ***   *** *** * ***   *** * ***
Lower, set size max   *** ** ***   *** * *** *** *** *** ***
Higher, Set Size 1       *         **    
Higher, set size max     *** ***   * *** *** ** **
Size Lower, Set Size 1   ***   *** * *** ***   *** ** ***
Lower, set size max   *** *** *** ** *** * *** *** *** *** ***
Higher, Set Size 1 * ** *** ** ** ** *** *** ** ***
Higher, set size max   *** *** *** *** ***   *** *** *** *** ***
Image stats Lower, Set Size 1         *         **
Lower, set size max   * *** ** * ***   *** *** ***   ***
Higher, Set Size 1 ** * *** ***   * ** ** *** ***   **
Higher, set size max * *** *** *** *** *** ** *** *** *** *** ***
SF Lower, Set Size 1   *** ***   *** ***   ***   ***
Lower, set size max   *** *** *** * ***   *** *** *** *** ***
Higher, Set Size 1 *   *** *** * ** *   *** *** ** ***
Higher, set size max   ** *** *** *** *** * *** *** *** *** ***
  Percentage of filled cells, total 31.25 56.25 62.5 70.83 58.33 58.33 75 72.92 77.08 79.17 66.67 68.75
Percentage of filled cells, region effect 31.25 18.75 37.5 25 37.5 12.5 81.25 56.25 81.25 68.75 56.25 50
Percentage of filled cells, set size effect 31.25 75 75 93.75 62.5 81.25 62.5 81.25 81.25 75 68.75 62.5
Percentage of filled cells, training effect 31.25 75 75 93.75 75 81.25 81.25 81.25 68.75 93.75 75 93.75

The predicted patterns were evaluated in terms of their pattern correlation with the true response patterns and their category selectivity (i.e., testing if the predicted pattern was more similar to the true pattern of the same than different categories). In both analyses, we obtained difference scores for three contrasts in both brain regions and CNNs: lower-level visual regions versus higher-level visual regions, lowest training set sizes versus highest training set sizes, and categories included versus those not included in training. We then compared these difference scores across PPC and OTC regions using a paired t test as well as across CNNs and OTC regions using a one-sample t test where the individual subject data of the OTC regions were compared against the average score of the CNN across the leave-one-out cross-validation folds (see Methods). The comparisons that reached significance are marked with asterisks. All tests were corrected for multiple comparisons using the Benjamini–Hochberg method. At the bottom of the table, the total percentage of significant differences (i.e., filled cells), as well as the percentage of significant differences for each of the three contrasts, is shown. For the CNNs, the smallest percentages for an effect are bolded, indicating that the CNN has the fewest number of differences from OTC.

.05 < p < .10.

*

p < .05.

**

p < .01.

***

p < .001.

CNN Layers

We performed a similar analysis within CNN layers to understand how well linear mapping functions can capture CNN activation pattern differences across object transformations across the layers. For each CNN, we averaged the first two sampled layers and we also averaged the two higher layers that showed the best correspondence with the brain data as shown by Xu and Vaziri-Pashkam (2021; see Table 1 for the CNN layers selected and averaged). As CNNs do not have noise, it was not necessary to perform a split-half analysis to normalize by reliability. Instead, we used a leave-one-out cross-validation procedure (see Methods).

For all four transformation types and all CNNs, the correlation between the predicted and true patterns for categories included in training was virtually equal to 1 for both the lowest and highest training set sizes, showing that linear mapping could fully capture the variance associated with the transformations (see Figure 3). Correlations for categories not included in training were overall quite high, but lower than that for categories included in training. For categories not included in training, with the exception of lower layers of Resnet-50 for the position transformation whose correlation was 1, all other correlations ranged between .55 and below 1. For all CNNs and transformations, predictability either stayed the same or increased from the lowest to the highest training set size, showing that performance tended to improve with more training data included. For 17 of the 20 combinations between CNN and transformation type, predictability was better for lower layers than higher layers for categories not included in training (the exceptions are Alexnet for the SF transformation as well as VGG-19 for the position and SF transformations).

These results show both similarities and differences between CNNs and OTC regions. For both CNNs and OTC, predictability was better for categories included than those not included in training. Furthermore, predictability for categories not included in training improved with more categories included in training for both CNNs and OTC. Whereas variance could never be fully captured by linear methods in the OTC regions, it was fully captured in all CNNs for categories included in training. In addition, unlike OTC, and in the vast majority of cases, lower layers showed better performance than higher layers, whereas OTC did not consistently show significant differences between lower and higher visual regions. Again, we do not think that overfitting played a role here because with more categories included in training, predictions only improved for the categories not included in training.

To quantify the observed differences above, we performed a similar comparison between CNNs and OTC regions as we did between PPC and OTC regions. Instead of a paired t test, here we performed a one-sample t test, comparing all of the subjects from the OTC regions with the average CNN score across all cross-validation runs (see Table 2 for a summary of the results). Alexnet showed the fewest differences with OTC regions, with 21 of 48 possible comparisons not significantly different (43.75%), followed by Resnet-50 (20/48 or 41.67%), VGG-19 (20/48 or 41.67%), Cornet-S (18/48 or 37.5%), and Googlenet (14/48 or 29.17%). Alexnet thus appeared to be most similar to the ventral visual stream overall in how well a linear mapping may capture responses after transformation, with Resnet-50 and VGG-19 as close competitors.

Evaluating the Selectivity of Predicted fMRI Response and CNN Activation Patterns

A successful linear prediction for transformation would not only predict that the predicted patterns are similar to the true patterns but also that the predicted patterns would be more similar to the true patterns from the same than different categories in an ROI or a CNN layer. This would demonstrate that a linear mapping is capable of generating a prediction across a transformation that shows category selectivity. As in Mocz and colleagues (2021), to evaluate this, we tested whether the predicted pattern correlated more with the true patterns from the same than different categories. We computed a correlation difference score as our measure of selectivity and tested whether it was greater than 0.

PPC Regions

Overall, we found significant selectivity in superior and inferior IPS for categories included in training, but not for categories not included in training. For most transformations, there was no difference between superior and inferior IPS for selectivity. The full details of the results are described below.

For the categories included in training, correlation differences for all transformation types and ROIs, particularly for lower training set sizes, were significantly greater than 0 (see Figure 4; ts > 2.789, ps < .05, ds > 1.1). For categories not included in training, both superior and inferior IPS either showed correlation differences less than 0 (ts > 3.97, ps < .05, ds > 1.47) or no different from 0 (ts < 2.71, ps > .1, ds < 0.46). These results showed that, when a category was included in the training data, category selectivity was preserved in the predicted patterns across all ROIs, training set sizes, and transformations. However, when a category was not included in the training data, category selectivity was not preserved.

Figure 4. .

Figure 4. 

Category selectivity results plotted by brain region/CNN and transformation type showing pattern prediction for categories included in the training data (Trained Categories) and those that were not (Untrained Categories) as a function of the number of categories included in the training data. All predictions were calculated using the optimal penalty in the ridge regression (with α of 10). Category selectivity is calculated as the correlation difference between the predicted and true patterns of the same category and that of different categories. Results from the brain were computed from the 75 most reliable voxels in each region. In OTC, results from lower visual areas were the averaged results of V1 and V2, whereas results from higher visual areas were the averaged results of LOT and VOT. In PPC, lower and higher visual areas corresponded to inferior and superior IPS, respectively. For CNNs, results from lower CNN layers were the average results of the first two sampled layers, whereas results from higher CNN layers were the average of two higher CNN layers showing the best correspondence with higher OTC regions as determined by a prior study. The error bar represents the between-participant 95% confidence interval of the mean.

To better understand the generalizability of linear mapping in pattern selectivity, within each transformation, using ANOVA, we examined the effect of ROI, Training Set Size, Training Category, and their interactions. The effect of ROI was only present in image stats transformation, with inferior IPS showing greater selectivity than superior IPS, F(1, 5) = 50.54, p < .001, ηP2 = .91, and absent for all other transformations (Fs < 1.84, ps > .22, ηP2 < .23). For all transformations, there was a consistent main effect of Training Set Size (Fs > 21.76, ps < .01, ηP2 > .78), where selectivity was overall better for the lowest training set size than for the highest training set size when more categories were included in the training set. There was also a consistent main effect of Training Category (Fs > 21.38, ps < .01, ηP2 > .81), with selectivity being higher for categories included than those not included in training. There were also interactions between Training Category and Training Set Size for all of the transformations (Fs > 22.76, ps < .01, ηP2 > .81), such that the difference in selectivity between the trained and untrained categories decreased for the largest set size. This is driven by both categories not included in training showing the same or better selectivity with increasing training set size and categories included in training showing the same or worse selectivity with increasing training set size. Overall, these results are consistent with a near-orthogonal representational structure of identity and nonidentity information, with the linear mapping function more tailored to the categories included than those not included in training. This is because, with only a single category X included in training, the linear mapping function is highly tailored to X. When more categories are included in training and if the representational space structure is not completely orthogonal, then selectivity (and predictability) for X will decrease. For the same reason, training X would not yield the best selectivity (or predictability) for Y. However, when more categories (other than Y) are included in training, the chance that a category similar to Y would be included in training would increase. Thus, increasing the number of categories included in training (other than Y) would increase selectivity (and predictability) for Y.

These results show both similarities and differences to the lower (average of V1 and V2) and higher (average of LOT and VOT) visual regions of OTC (Mocz et al., 2021; see Figure 4 for a full summary of results). Similar to the two PPC regions examined, within OTC regions, we found significant selectivity in all ROIs for categories included in training. However, for categories not included in training, we found significant selectivity in higher-level OTC regions, particularly when more categories were included in training, but not in either superior or inferior IPS. In this regard, superior and inferior IPS are more similar to lower than higher visual regions of OTC.

To quantify the differences between OTC and PPC regions, we also directly compared these brain regions. We measured the same three scores as we did for pattern predictability and compared the scores between OTC and PPC regions (see Table 2 for the complete stats). Of 48 possible comparisons, most (36 or 75%) were significant. Thus, there is moderate similarity between OTC and PPC regions in how selectivity is affected by the number of training categories, particularly for categories included in training, but not as much similarity when looking at the effects of higher versus lower visual regions and of trained versus untrained categories. This is likely because although selectivity improves in higher-level visual regions in OTC, there is not much difference between superior and inferior IPS. Furthermore, for large training set sizes, selectivity in untrained regions greatly improves for higher-level visual regions in OTC, but we do not see the same in PPC regions.

CNN Layers

We performed a similar analysis within CNN layers to understand how well linear mapping functions can capture CNN activation pattern selectivity across object transformations across the layers. To test if there was some amount of category selectivity, we compared the correlation difference scores to 0.

For categories not included in training, for almost all of the CNNs and layer types, selectivity was very close to or under 0 for the lowest training set size but generally increased with training set size. Whereas similar selectivity was obtained regardless of layers at the lowest training set size, at the largest training set size, selectivity in general increased from the lowest to the highest layers in Alexnet, Googlenet, Resnet-50, and VGG-19, but not in Cornet-S. For categories included in training, selectivity tended to be greater than 0, generally better than the selectivity of categories not included in training, and greater for higher than lower layers for all the networks. Meanwhile, there was no effect of training set size. For most of the CNNs, the difference between trained and untrained categories decreased with more training categories.

These results show both similarities and differences between CNNs and OTC regions. For both CNNs and OTC, higher-level visual regions showed better category selectivity and better generalization to untrained categories than lower-level visual regions. In addition, selectivity is tailored to the categories included in training, but the difference between trained and untrained categories decreases for larger training set sizes for both CNNs and OTC. Meanwhile, the slopes of the difference scores calculated appear to be significantly different. For example, for CNNs, there was virtually no difference in selectivity between the lowest and highest training set sizes for categories included in training, whereas there was a difference in OTC. There was also a much greater difference between higher and lower visual layers for CNNs than in OTC.

To quantify the differences between CNN layers and OTC regions, we performed a similar comparison as before (see Table 2 for a summary of the results). Resnet-50 showed the fewest differences with OTC regions, where 32 of 48 possible comparisons were significantly different (66.67%), followed by VGG-19 (33/48 or 68.75%), then Alexnet (35/48 or 72.92%), Cornet-S (37/48 or 77.08%), and finally Googlenet (38/48 or 79.17%). Overall, it appears that Resnet-50 is the most similar to the ventral visual stream for pattern selectivity with linear mapping methods, followed closely by VGG-19.

Relating Category Similarity to Pattern Prediction Generalization

Given that predictions from linear mapping depended on the categories included in training in both OTC and PPC regions as well as in CNNs, here we examined if categories that are similarly represented in a given ROI or CNN layer could also better predict each other's response patterns with linear mapping. This would provide further evidence for a near-orthogonal representation of identity and nonidentity information, such that, in addition to a linear component, a nonlinear component tailored to each category is needed to fully predict the true pattern of a category after a transformation. For the fMRI response, in each ROI, using the same linear mapping procedure, we used data from one category to predict the responses of all other categories from one transformation state to the other state. We then correlated the predicted pattern and the true pattern of the same category across the two halves of the data. This analysis was rotated across all the categories with each serving as the training category to predict the other categories. The results were averaged between the two states of a transformation and the two directions of predictions (i.e., using A to predict B and using B to predict A) and were used to construct a prediction similarity matrix in which each cell of the matrix reflects how well two categories may predict each other. To obtain the similarity between the object categories, we constructed a category similarity matrix with each cell of the matrix being the pairwise correlation of the true patterns of two categories across the two halves of the data within the same state of the transformation averaged over the two states. We vectorized the off-diagonal elements of these matrices and correlated them.

For the CNN data, to create the prediction similarity matrix, for each layer of the CNN, we again used data from only one category to then predict the response of all other categories, and in lieu of a split-half analysis, the predicted pattern was correlated with the average true pattern of all the runs. We applied this analysis to all other categories from one state to the other state so that each cell of the matrix reflects how well two categories may predict each other. To create the category similarity matrix, which shows how initially similar each object category is to another, each cell is the average correlation across all possible leave-one-run-out folds of the average true pattern of all but one of the runs for one category (to mimic the training data to create the predictions) and the average true pattern of all runs for another category. We again vectorized the off-diagonal elements of these matrices and correlated them to determine the role of initial category similarity on pattern prediction in CNNs.

PPC Regions

For both superior and inferior IPS and for all transformations, pattern prediction showed a high correlation with category similarity, with the correlation ranging between .7 and .96. At the same time, these correlations were all significantly less than 1 (see the significance levels reported in Figure 5 for all the comparisons). Paired t tests revealed that inferior IPS showed a greater effect of category similarity than superior IPS (ts > 2.45, ps < .05, ds > 0.89).

Figure 5. .

Figure 5. 

The correlation between category similarity and pattern predictability. All predictions were calculated using the optimal penalty in the ridge regression (with α of 10). Results from the brain were computed from the 75 most reliable voxels in each region. In OTC, results from lower visual areas were the averaged results of V1 and V2, whereas results from higher visual areas were the averaged results of LOT and VOT. In PPC, lower and higher visual areas corresponded to inferior and superior IPS, respectively. For CNNs, results from lower CNN layers were the average results of the first two sampled layers, whereas results from higher CNN layers were the average of two higher CNN layers showing the best correspondence with higher OTC regions as determined by a prior study. For each brain region, significant values for pairwise t tests against 1 are marked with asterisks at the top of each plot. All t tests were corrected for multiple comparisons using the Benjamini–Hochberg method within each transformation type, totaling two comparisons (higher vs. lower layers). ∼.05 < p < .10, *p < .05, **p < .01, ***p < .001. The error bar represents the between-participant 95% confidence interval of the mean.

Previously, we showed that OTC regions also showed a high correlation between pattern prediction performance and category similarity (Mocz et al., 2021), with the correlation being higher in higher than lower OTC regions, the opposite of what was present for the PPC regions. A direct comparison between these two regions showed that the difference between the lower and higher regions in OTC and PPC was significant for all transformations (see Table 3 for the detailed stats).

Table 3. .

Summary of Statistical Results Evaluating the Correlation between Pattern Prediction Performance and Category Similarity for OTC, PPC, and CNNs

Transformation Parietal Alexnet Cornet-S Googlenet Resnet-50 VGG-19
Position ** *** ** ** ** **
Size *** *** ** *** *** ***
Image stats * * *** * *
SF *** * * * *

In this analysis, we obtained difference scores between lower and higher visual regions/CNN layers. We then compared these scores across PPC and OTC using a paired t test as well as across CNNs and OTC using a one-sample t test where the individual subject data from OTC were compared to the average score of the CNN across the leave-one-out cross-validation folds (see Methods). The comparisons that reached significance are marked with asterisks. All tests were corrected for multiple comparisons using the Benjamini–Hochberg method.

.05 < p < .10.

*

p < .05.

**

p < .01.

***

p < .001.

CNN Layers

The correlations between pattern prediction performance and category similarity were high for all transformations and layer types for each of the CNNs, ranging between .8 and 1. However, unlike in OTC, here we do not see better correlations for higher than lower layers. But rather, correlations either stay roughly the same or go down from lower layers to higher layers for 16 of 20 combinations of CNN and transformation type. If we quantitatively compare the two as we did between OTC and PPC, in all cases, we found a significant or marginally significant difference, with Cornet-S showing the most similar qualitative pattern to the human ventral visual stream (see Table 3 for the detailed stats).

DISCUSSION

In the present study, we used a representational transformation analysis to examine the representation of object identity and nonidentity information in both human PPC and CNNs pretrained for object classification, building off of previous work that has shown that a linear mapping function can predict, to a significant extent, object representations after different Euclidean and non-Euclidean transformations in the entire human ventral visual stream (Mocz et al., 2021; Ward et al., 2018). Specifically, we examined and compared object category responses from inferior IPS and superior IPS to both Euclidean (i.e., changes in position and size) and non-Euclidean (i.e., changes in image stats and SF) transformations of objects. We also examined responses from five pretrained CNNs (Alexnet, Cornet-S, Googlenet, Resnet-50, and VGG-19). If the representations of object identity and nonidentity information are orthogonal, then the predicted patterns from the linear mapping functions should show a significant correlation with the true patterns and there should be no difference in correlation for categories included and not included in training. If the representation is only near orthogonal, the predicted patterns from the linear mapping functions should still show a significant correlation with the true patterns, but predictions would be significantly better for categories included in training. We found that, just like OTC, human PPC and CNNs can link object responses in different states of nonidentity transformations through linear mapping functions for both Euclidean and non-Euclidean transformations. In addition, similar to OTC, these predictions are more tailored to the object included than not included in the training data, suggesting that object identity and nonidentity information are represented in a near-orthogonal, rather than completely orthogonal, manner in PPC and CNNs. Meanwhile, some differences were also found among OTC, PPC, and CNNs. We discuss the specific comparisons below.

Comparing PPC and OTC Regions

Although both OTC and PPC contain similar representations of visual information (see Xu, 2018a, 2018b, for a recent review of this literature; see also Freud et al., 2016), OTC and PPC differ in their object representational structures (Vaziri-Pashkam & Xu, 2019) and how their responses may be modulated by attention and task (Xu & Vaziri-Pashkam, 2019; Bracci et al., 2017; Vaziri-Pashkam & Xu, 2017; Jeong & Xu, 2016). Thus, it remains an open question whether a linear mapping function can predict object representations in PPC for different transformation states like it does in OTC.

Within PPC regions, we performed three evaluations: (1) how well the predicted patterns of a category correlated with the true pattern of the same category (i.e., pattern predictability), (2) whether the predicted patterns were closer to the true pattern of the same than different categories (i.e., pattern selectivity), and (3) whether category similarity may increase pattern prediction (i.e., whether there was any positive correlation between the two). As an update to the linear mapping method in Mocz and colleagues (2021), we further tested the role of the penalty parameter α used in our ridge regression (which determines overfitting/underfitting) in predicting responses. We sampled an array of α parameters and chose an α that maximized both pattern predictability (i.e., how well the predicted response pattern correlated with the actual response pattern) and pattern selectivity (i.e., how well the predicted patterns show differentiation for the different object categories) for the data.

Overall, pattern predictability is very similar between PPC and OTC. Just like the OTC regions, for all transformations, a linear mapping could capture a significant amount, but not all, of the variance of an object category's fMRI pattern between two states of a transformation, with noise-normalized correlations lying between .85 and .98 for both OTC and PPC. There were also no consistent differences between lower and higher visual regions in either OTC or PPC for pattern prediction. In addition, pattern predictions were better for categories included in the training data than for categories not included in the training data, showing that predictions are category-dependent in both PPC and OTC. Thus, we see support for a near-orthogonal representation of identity and nonidentity information in both PPC and OTC. There were also similarities in pattern selectivity between PPC and OTC, with significant pattern selectivity obtained for categories included in the training data across all ROIs and transformations in both PPC and OTC. Finally, in both PPC and OTC, pattern predictability significantly correlated with category similarity, providing additional evidence supporting a near-orthogonal, rather than a completely orthogonal, representation of identity and nonidentity information in both of these visual areas. These similarities are in line with previous literature showing that both PPC and OTC can represent a diverse set of visual stimuli and exhibit tolerance to nonidentity image transformations (e.g., Vaziri-Pashkam & Xu, 2019; Vaziri-Pashkam et al., 2019; see Xu, 2018a, for a recent review on this literature).

Meanwhile, there were also differences between OTC and PPC. There were some quantitative differences in the effects of region, training set size, and training category among OTC and PPC as documented in Table 2, in line with prior work showing that there are differences in object representational structures between OTC and PPC (Vaziri-Pashkam & Xu, 2019). In addition, no PPC region showed significant selectivity that generalized to objects not included in training, whereas higher-level regions in OTC like VOT and LOT did. PPC regions are thus more similar to lower-level OTC regions like V1–V4 in this regard. For both PPC and V1–V4, although identity and nonidentity information are partially untangled as evidenced by the success of a linear mapping function, predictions generated by such a function however are not sufficient to differentiate the categories. How can pattern predictability be high for both categories included and not included in training while pattern selectivity is only significant for categories included in training? Because object representations are less distinctive and thus more correlated in PPC and lower ventral regions than higher ventral regions because of enhanced object processing in the latter (Xu & Vaziri-Pashkam, 2021b; Vaziri-Pashkam & Xu, 2019), even when all these regions could equally well predict an object category response pattern after training, category selectivity likely remains greater for higher ventral regions than for the other regions.

The largest qualitative difference between OTC and PPC was in the effect of category similarity on pattern prediction. Whereas category similarity showed a higher correlation with predictability in higher than lower OTC regions, the opposite was found in PPC where higher PPC regions showed a lower such correlation than that of lower PPC regions. This could again be driven by the strength of object representations with regions containing stronger object representations such as higher OTC and, to some extent, lower PPC regions (Vaziri-Pashkam & Xu, 2019) showing more object-specific transformation. Lower PPC and higher OTC regions are also anatomically closer to each other and may share some of the same response properties.

Comparing CNNs and OTC Regions

CNNs have been considered by some as the current best models of the primate ventral visual system (Cichy & Kaiser, 2019; Kubilius et al., 2019). Meanwhile, large discrepancies in object representation also exist between the primate brain and CNNs (e.g., Xu & Vaziri-Pashkam, 2021a, 2021b; Serre, 2019; Geirhos et al., 2018). To better understand the inner workings of CNNs, we tested whether a linear mapping function can also successfully predict object representations in CNNs across transformations just like it does in OTC. To do so, we performed a similar set of analyses on CNNs. Taking into account all three analyses, Resnet-50 and VGG-19 are the most similar to the human ventral visual system in how identity and nonidentity information are represented.

We found some similarities between CNNs and OTC. For pattern predictability, for the most part, pattern predictions were better for categories included in training than those not included in training, showing that predictions are category dependent, similar to OTC regions. For pattern selectivity, for categories included in training, there was significant selectivity in all layers, and for categories not included in training, there was only significant selectivity in higher-level layers, again similar to OTC. Overall, selectivity was more tailored to the categories included in training for both OTC and CNNs, showing support for a near-orthogonal representation of identity and nonidentity information in both. Finally, the correlation of category similarity and pattern prediction was large in CNNs, ranging between .85 and 1, similar to those found in OTC regions, again lending support for a near-orthogonal representation. With CNNs demonstrating human-level performance on object recognition tasks (e.g., Serre, 2019; Rajalingham et al., 2018; Yamins & Dicarlo, 2016; Kriegeskorte, 2015), our results show that CNNs are indeed able to capture certain aspects of visual processing in the primate brain, consistent with some prior reports (Xu & Vaziri-Pashkam, 2021a; Eickenberg et al., 2017; Cichy et al., 2016; Güçlü & van Gerven, 2015; Khaligh-Razavi & Kriegeskorte, 2014).

Meanwhile, there were also differences in object representation between OTC and CNNs. For pattern predictability, the correlations between predicted and true patterns for categories included in training were virtually equal to 1 for both the lowest and highest training set sizes, showing that linear mapping could fully capture the variance associated with the transformations, unlike OTC where linear mapping could only partially capture such variance. However, given that there is not much noise in the CNN training data, this finding is not too surprising. In addition, whereas there was no significant difference in predictability across OTC regions, predictability tended to be better for lower layers than higher layers of CNNs, particularly for categories not included in training. This shows some difference between the brain in CNNs and OTC in how identity and nonidentity information are represented across the different layers. It suggests that, in higher layers of CNNs, an extra nonlinear component is needed to represent the relationship between identity and nonidentity information that is not present in the lower layers. Likewise, whereas the correlation between category similarity and pattern predictability increases from lower to higher OTC regions, in CNNs (except for Cornet-S), this correlation tended to either stay the same or (very slightly) decrease from lower- to higher-level layers, again showing a difference in object representations between CNNs and the brain.

Identity vs. Shape Representation

In the present study, we extracted the average response from a block of trials containing exemplars/members of the same object class (e.g., cars). At such a basic object level, shape features and object identity are highly correlated, as members of the basic objects share similar shapes and can be identified from the averaged shapes of the members of the same class (Rosch, Mervis, Gray, Johnson, & Boyes-Braem, 1976; see also Rosch & Mervis, 1975). The neural responses we examined here thus could reflect the representations of object shape, identity, or both. Because we were working with existing data sets in the present study, we were limited by the design of the data sets and the manipulations we could carry out. It would be interesting in future research to examine responses from a single exemplar as well as the averaged responses from different exemplars at the superordinate object category level that do not share similar shapes (such as vehicles and musical instruments). Doing so would allow us to better dissociate identity and shape and their unique contribution to the linear mapping responses in the human brain and CNNs.

Conclusion

In summary, using a representational transformation analysis, we show that the human PPC and CNNs can link object responses in different states of nonidentity transformations through linear mapping functions for both Euclidean and non-Euclidean transformations. We found some differences in pattern predictability across the layers for CNNs and OTC as well as differences in the effect of category similarity for OTC, PPC, and CNNs. However, we also found many similarities in pattern predictability and selectivity, where the results showed that object identity and nonidentity information is represented in a near-orthogonal manner in human OTC and PPC as well as in CNNs trained to perform object classification. Our study provides a useful framework to characterize similarities and differences in object processing between the ventral and dorsal visual streams and CNNs.

Acknowledgments

We thank members of Visual Cognitive Neuroscience Lab, Turk-Browne Lab, Holmes Lab, and Yale Cognitive and Neural Computation Lab for their helpful feedback on this project. This research was supported by the National Institute of Health grants 1R01EY030854 and 1R01EY022355 to Y. X.

Reprint requests should be sent to Viola Mocz or Yaoda Xu, 2 Hillhouse Ave., New Haven, CT 06520, or via e-mail: viola.mocz@yale.edu (V. M.), yaoda.xu@yale.edu (Y. X.).

Author Contributions

Viola Mocz: Conceptualization; Formal analysis; Investigation; Methodology; Software; Visualization; Writing–Original draft; Writing–Review & editing. Maryam Vaziri-Pashkam: Data curation; Investigation; Writing–Review & editing. Marvin Chun: Conceptualization; Supervision; Writing–Review & editing. Yaoda Xu: Conceptualization; Data curation; Funding acquisition; Methodology; Supervision; Writing–Original draft; Writing–Review & editing.

Data Availability Statement

Data and code have been made available via OSF: https://osf.io/7u65t.

Funding Information

Yaoda Xu, Foundation for the National Institutes of Health (https://dx.doi.org/10.13039/501100000038), grant numbers: 1R01EY022355 and 1R01EY030854.

Diversity in Citation Practices

Retrospective analysis of the citations in every article published in this journal from 2010 to 2021 reveals a persistent pattern of gender imbalance: Although the proportions of authorship teams (categorized by estimated gender identification of first author/last author) publishing in the Journal of Cognitive Neuroscience (JoCN) during this period were M(an)/M = .407, W(oman)/M = .32, M/W = .115, and W/W = .159, the comparable proportions for the articles that these authorship teams cited were M/M = .549, W/M = .257, M/W = .109, and W/W = .085 (Postle and Fulvio, JoCN, 34:1, pp. 1–3). Consequently, JoCN encourages all authors to consider gender balance explicitly when selecting which articles to cite and gives them the opportunity to report their article's gender citation balance.

REFERENCES

  1. Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B: Methodological, 57, 289–300. 10.1111/j.2517-6161.1995.tb02031.x [DOI] [Google Scholar]
  2. Bettencourt, K. C., & Xu, Y. (2016a). Decoding the content of visual short-term memory under distraction in occipital and parietal areas. Nature Neuroscience, 19, 150–157. 10.1038/nn.4174, [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bettencourt, K. C., & Xu, Y. (2016b). Understanding location- and feature-based processing along the human intraparietal sulcus. Journal of Neurophysiology, 116, 1488–1497. 10.1152/jn.00404.2016, [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bracci, S., Daniels, N., & Op de Beeck, H. (2017). Task context overrules object- and category-related representational content in the human parietal cortex. Cerebral Cortex, 27, 310–321. 10.1093/cercor/bhw419, [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Brainard, D. H. (1997). The psychophysics toolbox. Spatial Vision, 10, 433–436. 10.1163/156856897X00357, [DOI] [PubMed] [Google Scholar]
  6. Bressler, D. W., & Silver, M. A. (2010). Spatial attention improves reliability of fMRI retinotopic mapping signals in occipital and parietal cortex. Neuroimage, 53, 526–533. 10.1016/j.neuroimage.2010.06.063, [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Cichy, R. M., & Kaiser, D. (2019). Deep neural networks as scientific models. Trends in Cognitive Sciences, 23, 305–317. 10.1016/j.tics.2019.01.009, [DOI] [PubMed] [Google Scholar]
  8. Cichy, R. M., Khosla, A., Pantazis, D., Torralba, A., & Oliva, A. (2016). Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence. Scientific Reports, 6, 27755. 10.1038/srep27755, [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Cohen, J. (1969). Statistical power analysis for the behavioral sciences. New York: Academic Press. 10.4324/9780203771587 [DOI] [Google Scholar]
  10. Cohen, J. (1973). Eta-squared and partial eta-squared in fixed factor ANOVA designs. Educational and Psychological Measurement, 33, 107–112. 10.1177/001316447303300111 [DOI] [Google Scholar]
  11. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum. 10.4324/9780203771587 [DOI] [Google Scholar]
  12. Cowan, N. (2001). Metatheory of storage capacity limits. Behavioral and Brain Sciences, 24, 154–176. 10.1017/S0140525X0161392X [DOI] [PubMed] [Google Scholar]
  13. Dale, A. M., Fischl, B., & Sereno, M. I. (1999). Cortical surface-based analysis: I. segmentation and surface reconstruction. Neuroimage, 9, 179–194. 10.1006/nimg.1998.0395, [DOI] [PubMed] [Google Scholar]
  14. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition (pp. 248–255). 10.1109/CVPR.2009.5206848 [DOI] [Google Scholar]
  15. DiCarlo, J. J., & Cox, D. D. (2007). Untangling invariant object recognition. Trends in Cognitive Sciences, 11, 333–341. 10.1016/j.tics.2007.06.010, [DOI] [PubMed] [Google Scholar]
  16. DiCarlo, J. J., Zoccolan, D., & Rust, N. C. (2012). How does the brain solve visual object recognition? Neuron, 73, 415–434. 10.1016/j.neuron.2012.01.010, [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Eickenberg, M., Gramfort, A., Varoquaux, G., & Thirion, B. (2017). Seeing it all: Convolutional network layers map the function of the human visual system. Neuroimage, 152, 184–194. 10.1016/j.neuroimage.2016.10.001, [DOI] [PubMed] [Google Scholar]
  18. Freud, E., Plaut, D. C., & Behrmann, M. (2016). ‘What’ is happening in the dorsal visual pathway. Trends in Cognitive Sciences, 20, 773–784. 10.1016/j.tics.2016.08.003, [DOI] [PubMed] [Google Scholar]
  19. Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F. A., & Brendel, W. (2018). ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. https://arxiv.org/abs/1811.12231 [Google Scholar]
  20. Grill-Spector, K., Kushnir, T., Hendler, T., Edelman, S., Itzchak, Y., & Malach, R. (1998). A sequence of object-processing stages revealed by fMRI in the human occipital lobe. Human Brain Mapping, 6, 316–328. 10.1002/(SICI)1097-0193(1998)6:4<316::AID-HBM9>3.0.CO;2-6, [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Güçlü, U., & van Gerven, M. A. J. (2015). Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream. Journal of Neuroscience, 35, 10005–10014. 10.1523/JNEUROSCI.5023-14.2015, [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Haxby, J. V., Guntupalli, J. S., Connolly, A. C., Halchenko, Y. O., Conroy, B. R., Gobbini, M. I., et al. (2011). A common, high-dimensional model of the representational space in human ventral temporal cortex. Neuron, 72, 404–416. 10.1016/j.neuron.2011.08.026, [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 770–778). 10.1109/CVPR.2016.90 [DOI] [Google Scholar]
  24. Hong, H., Yamins, D. L. K., Majaj, N. J., & DiCarlo, J. J. (2016). Explicit information for category-orthogonal object properties increases along the ventral stream. Nature Neuroscience, 19, 613–622. 10.1038/nn.4247, [DOI] [PubMed] [Google Scholar]
  25. Hung, C. P., Kreiman, G., Poggio, T., & DiCarlo, J. J. (2005). Fast readout of object identity from macaque inferior temporal cortex. Science, 310, 863–866. 10.1126/science.1117593, [DOI] [PubMed] [Google Scholar]
  26. Isik, L., Meyers, E. M., Leibo, J. Z., & Poggio, T. (2013). The dynamics of invariant object recognition in the human visual system. Journal of Neurophysiology, 111, 91–102. 10.1152/jn.00394.2013, [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Jeong, S. K., & Xu, Y. (2012). Neural representation of targets and distractors during object individuation and identification. Journal of Cognitive Neuroscience, 25, 117–126. 10.1162/jocn_a_00298, [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Jeong, S. K., & Xu, Y. (2016). Behaviorally relevant abstract object identity representation in the human parietal cortex. Journal of Neuroscience, 36, 1607–1619. 10.1523/JNEUROSCI.1016-15.2016, [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Kar, K., Kubilius, J., Schmidt, K., Issa, E. B., & DiCarlo, J. J. (2019). Evidence that recurrent circuits are critical to the ventral stream's execution of core object recognition behavior. Nature Neuroscience, 22, 974–983. 10.1038/s41593-019-0392-5, [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Khaligh-Razavi, S.-M., & Kriegeskorte, N. (2014). Deep supervised, but not unsupervised, models may explain IT cortical representation. PLoS Computational Biology, 10, e1003915. 10.1371/journal.pcbi.1003915, [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Kietzmann, T. C., Spoerer, C. J., Sörensen, L. K. A., Cichy, R. M., Hauk, O., & Kriegeskorte, N. (2019). Recurrence is required to capture the representational dynamics of the human visual system. Proceedings of the National Academy of Sciences, U.S.A., 116, 21854–21863. 10.1073/pnas.1905544116, [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Konen, C. S., & Kastner, S. (2008). Two hierarchically organized neural systems for object information in human visual cortex. Nature Neuroscience, 11, 224–231. 10.1038/nn2036, [DOI] [PubMed] [Google Scholar]
  33. Kourtzi, Z., & Kanwisher, N. (2000). Cortical regions involved in perceiving object shape. Journal of Neuroscience, 20, 3310–3318. 10.1523/JNEUROSCI.20-09-03310.2000, [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Kriegeskorte, N. (2015). Deep neural networks: A new framework for modeling biological vision and brain information processing. Annual Review of Vision Science, 1, 417–446. 10.1146/annurev-vision-082114-035447, [DOI] [PubMed] [Google Scholar]
  35. Kriegeskorte, N., Mur, M., Ruff, D. A., Kiani, R., Bodurka, J., Esteky, H., et al. (2008). Matching categorical object representations in inferior temporal cortex of man and monkey. Neuron, 60, 1126–1141. 10.1016/j.neuron.2008.10.043, [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 1097–1105. 10.1145/3065386 [DOI] [Google Scholar]
  37. Kubilius J., Schrimpf M., & Hong, H. (2019) Brain-like object recognition with high-performing shallow recurrent ANNs. In NeurIPS | 2019, Thirty-Third Conference on Neural Information Processing Systems. San Diego, CA: Neural Information Processing Systems. 10.48550/arXiv.1909.06161 [DOI] [Google Scholar]
  38. Malach, R., Reppas, J. B., Benson, R. R., Kwong, K. K., Jiang, H., Kennedy, W. A., et al. (1995). Object-related activity revealed by functional magnetic resonance imaging in human occipital cortex. Proceedings of the National Academy of Sciences, U.S.A., 92, 8135–8139. 10.1073/pnas.92.18.8135, [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Mocz, V., Vaziri-Pashkam, M., Chun, M. M., & Xu, Y. (2021). Predicting identity-preserving object transformations across the human ventral visual stream. Journal of Neuroscience, 41, 7403–7419. 10.1523/JNEUROSCI.2137-20.2021, [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. O'Connell, T. P., & Chun, M. M. (2018). Predicting eye movement patterns from fMRI responses to natural scenes. Nature Communications, 9, 5159. 10.1038/s41467-018-07471-9, [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Rajalingham, R., Issa, E. B., Bashivan, P., Kar, K., Schmidt, K., & DiCarlo, J. J. (2018). Large-scale, high-resolution comparison of the core visual object recognition behavior of humans, monkeys, and state-of-the-art deep artificial neural networks. Journal of Neuroscience, 38, 7255–7269. 10.1523/JNEUROSCI.0388-18.2018, [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. R Core Team. (2018). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. Available at https://www.R-project.org/. [Google Scholar]
  43. Richardson, J. T. E. (2011). Eta squared and partial eta squared as measures of effect size in educational research. Educational Research Review, 6, 135–147. 10.1016/j.edurev.2010.12.001 [DOI] [Google Scholar]
  44. Rolls, E. T. (2000). Functions of the primate temporal lobe cortical visual areas in invariant visual object and face recognition. Neuron, 27, 205–218. 10.1016/S0896-6273(00)00030-1, [DOI] [PubMed] [Google Scholar]
  45. Rosch, E., & Mervis, C. B. (1975). Family resemblances: Studies in the internal structure of categories. Cognitive Psychology, 7, 573–605. 10.1016/0010-0285(75)90024-9 [DOI] [Google Scholar]
  46. Rosch, E., Mervis, C. B., Gray, W. D., Johnson, D. M., & Boyes-Braem, P. (1976). Basic objects in natural categories. Cognitive Psychology, 8, 382–439. 10.1016/0010-0285(76)90013-X [DOI] [Google Scholar]
  47. Rust, N. C., & DiCarlo, J. J. (2010). Selectivity and tolerance (“invariance”) both increase as visual information propagates from cortical area V4 to IT. Journal of Neuroscience, 30, 12978–12995. 10.1523/JNEUROSCI.0179-10.2010, [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Sawamura, H., Georgieva, S., Vogels, R., Vanduffel, W., & Orban, G. A. (2005). Using functional magnetic resonance imaging to assess adaptation and size invariance of shape processing by humans and monkeys. Journal of Neuroscience, 25, 4294–4306. 10.1523/JNEUROSCI.0377-05.2005, [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Schwarzlose, R. F., Swisher, J. D., Dang, S., & Kanwisher, N. (2008). The distribution of category and location information across object-selective regions in human visual cortex. Proceedings of the National Academy of Sciences, U.S.A., 105, 4447–4452. 10.1073/pnas.0800431105, [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Sereno, M. I., Dale, A. M., Reppas, J. B., Kwong, K. K., Belliveau, J. W., Brady, T. J., et al. (1995). Borders of multiple visual areas in humans revealed by functional magnetic resonance imaging. Science, 268, 889–893. 10.1126/science.7754376, [DOI] [PubMed] [Google Scholar]
  51. Serre, T. (2019). Deep learning: The good, the bad, and the ugly. Annual Review of Vision Science, 5, 399–426. 10.1146/annurev-vision-091718-014951, [DOI] [PubMed] [Google Scholar]
  52. Silver, M. A., & Kastner, S. (2009). Topographic maps in human frontal and parietal cortex. Trends in Cognitive Sciences, 13, 488–495. 10.1016/j.tics.2009.08.005, [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. https://arxiv.org/abs/1409.1556 [Google Scholar]
  54. Straw, A. D. (2008). Vision egg: An open-source library for realtime visual stimulus generation. Frontiers in Neuroinformatics, 2, 4. 10.3389/neuro.11.004.2008, [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Swisher, J. D., Halko, M. A., Merabet, L. B., McMains, S. A., & Somers, D. C. (2007). Visual topography of human intraparietal sulcus. Journal of Neuroscience, 27, 5326–5337. 10.1523/JNEUROSCI.0991-07.2007, [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., et al. (2015). Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1–9). 10.1109/CVPR.2015.7298594 [DOI] [Google Scholar]
  57. Tarhan, L., & Konkle, T. (2020). Reliability-based voxel selection. Neuroimage, 207, 116350. 10.1016/j.neuroimage.2019.116350, [DOI] [PubMed] [Google Scholar]
  58. Taylor, J., & Xu, Y. (2021). Conjunctive coding of color and shape in convolutional neural networks. Journal of Vision, 20, 400. 10.1167/jov.20.11.400 [DOI] [Google Scholar]
  59. Todd, J. J., & Marois, R. (2004). Capacity limit of visual short-term memory in human posterior parietal cortex. Nature, 428, 751–754. 10.1038/nature02466, [DOI] [PubMed] [Google Scholar]
  60. Todd, J. J., & Marois, R. (2005). Posterior parietal cortex activity predicts individual differences in visual short-term memory capacity. Cognitive, Affective & Behavioral Neuroscience, 5, 144–155. 10.3758/CABN.5.2.144, [DOI] [PubMed] [Google Scholar]
  61. Vaziri-Pashkam, M., Taylor, J., & Xu, Y. (2019). Spatial frequency tolerant visual object representations in the human ventral and dorsal visual processing pathways. Journal of Cognitive Neuroscience, 31, 49–63. 10.1162/jocn_a_01335, [DOI] [PubMed] [Google Scholar]
  62. Vaziri-Pashkam, M., & Xu, Y. (2017). Goal-directed visual processing differentially impacts human ventral and dorsal visual representations. Journal of Neuroscience, 37, 8767–8782. 10.1523/JNEUROSCI.3392-16.2017, [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Vaziri-Pashkam, M., & Xu, Y. (2019). An information-driven 2-pathway characterization of occipitotemporal and posterior parietal visual object representations. Cerebral Cortex, 29, 2034–2050. 10.1093/cercor/bhy080, [DOI] [PMC free article] [PubMed] [Google Scholar]
  64. Ward, E. J., Isik, L., & Chun, M. M. (2018). General transformations of object representations in human visual cortex. Journal of Neuroscience, 38, 8526–8537. 10.1523/JNEUROSCI.2800-17.2018, [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. Willenbockel, V., Sadr, J., Fiset, D., Horne, G. O., Gosselin, F., & Tanaka, J. W. (2010). Controlling low-level image properties: The SHINE toolbox. Behavior Research Methods, 42, 671–684. 10.3758/BRM.42.3.671, [DOI] [PubMed] [Google Scholar]
  66. Xu, Y. (2008a). Representing connected and disconnected shapes in human inferior intraparietal sulcus. Neuroimage, 40, 1849–1856. 10.1016/j.neuroimage.2008.02.014, [DOI] [PubMed] [Google Scholar]
  67. Xu, Y. (2008b). Distinctive neural mechanisms supporting visual object individuation and identification. Journal of Cognitive Neuroscience, 21, 511–518. 10.1162/jocn.2008.21024, [DOI] [PubMed] [Google Scholar]
  68. Xu, Y. (2010). The neural fate of task-irrelevant features in object-based processing. Journal of Neuroscience, 30, 14020–14028. 10.1523/JNEUROSCI.3011-10.2010, [DOI] [PMC free article] [PubMed] [Google Scholar]
  69. Xu, Y. (2018a). A tale of two visual systems: Invariant and adaptive visual information representations in the primate brain. Annual Review of Vision Science, 4, 311–336. 10.1146/annurev-vision-091517-033954, [DOI] [PubMed] [Google Scholar]
  70. Xu, Y. (2018b). Sensory cortex is nonessential in working memory storage. Trends in Cognitive Sciences, 22, 192–193. 10.1016/j.tics.2017.12.008, [DOI] [PubMed] [Google Scholar]
  71. Xu, Y., & Chun, M. M. (2006). Dissociable neural mechanisms supporting visual short-term memory for objects. Nature, 440, 91–95. 10.1038/nature04262, [DOI] [PubMed] [Google Scholar]
  72. Xu, Y., & Chun, M. M. (2007). Visual grouping in human parietal cortex. Proceedings of the National Academy of Sciences, U.S.A., 104, 18766–18771. 10.1073/pnas.0705618104, [DOI] [PMC free article] [PubMed] [Google Scholar]
  73. Xu, Y., & Chun, M. M. (2009). Selecting and perceiving multiple visual objects. Trends in Cognitive Sciences, 13, 167–174. 10.1016/j.tics.2009.01.008, [DOI] [PMC free article] [PubMed] [Google Scholar]
  74. Xu, Y., & Jeong, S. K. (2015). The contribution of human superior intraparietal sulcus to visual short-term memory and perception. In Jolicoeur P., Lefebvre C., & Martinez-Trujillo J. (Eds.), Mechanisms of sensory working memory: Attention and performance XXV (pp. 33–42). Elsevier Academic Press. 10.1016/B978-0-12-801371-7.00004-1 [DOI] [Google Scholar]
  75. Xu, Y., & Vaziri-Pashkam, M. (2021a). Limits to visual representational correspondence between convolutional neural networks and the human brain. Nature Communications, 12, 2065. 10.1038/s41467-021-22244-7, [DOI] [PMC free article] [PubMed] [Google Scholar]
  76. Xu, Y., & Vaziri-Pashkam, M. (2021b). Examining the coding strength of object identity and nonidentity features in human occipito-temporal cortex and convolutional neural networks. Journal of Neuroscience, 41, 4234–4252. 10.1523/JNEUROSCI.1993-20.2021, [DOI] [PMC free article] [PubMed] [Google Scholar]
  77. Yamins, D. L. K., & DiCarlo, J. J. (2016). Using goal-driven deep learning models to understand sensory cortex. Nature Neuroscience, 19, 356–365. 10.1038/nn.4244, [DOI] [PubMed] [Google Scholar]
  78. Zhang, J., Liu, J., & Xu, Y. (2015). Neural decoding reveals impaired face configural processing in the right fusiform face area of individuals with developmental prosopagnosia. Journal of Neuroscience, 35, 1539–1548. 10.1523/JNEUROSCI.2646-14.2015, [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data and code have been made available via OSF: https://osf.io/7u65t.


Articles from Journal of Cognitive Neuroscience are provided here courtesy of MIT Press

RESOURCES