Response mode differences in perspective taking: Differences in representation or differences in retrieval?

Jonathan W Kelly; Timothy P McNamara

doi:10.3758/mc.36.4.863

. Author manuscript; available in PMC: 2009 May 21.

Published in final edited form as: Mem Cognit. 2008 Jun;36(4):863–872. doi: 10.3758/mc.36.4.863

Response mode differences in perspective taking: Differences in representation or differences in retrieval?

Jonathan W Kelly ¹, Timothy P McNamara ¹

PMCID: PMC2685252 NIHMSID: NIHMS108446 PMID: 18604967

Abstract

Three experiments explored whether response mode differences in perspective taking result from different spatial representations or different retrieval processes. Participants learned object locations and then, while blindfolded, pointed to or verbally described object locations from perspectives aligned or misaligned with their facing direction and aligned or misaligned with the learning perspective. Pointing was facilitated from the perspective aligned with the body during testing. Similar facilitation occurred when verbally labeling, but only when conducted in the context of pointing (e.g., after pointing). Without this pointing context, or after third-person strategy instructions, the effect of body alignment was eliminated for verbal responses. Pointing was less responsive to context and strategy. Across all conditions, performance was facilitated for the learning perspective. Taken together, these experiments indicate that response mode differences are due to differences in the retrieval process, which varies with strategy, rather than differences in the organization of the underlying spatial memory.

In daily navigation, we regularly rely on our memories of spaces to guide our movements and decisions. Whether planning a detour to avoid road congestion or describing the location of the bookstore to a campus visitor, we are often required to imagine perspectives that we do not currently occupy. Our ability to imagine these non-occupied perspectives depends on a number of factors, including our representation of the environment in long-term memory and our current location and orientation within that environment (Mou, McNamara, Valiquette & Rump, 2004).

The ease with which new perspectives can be imagined has been shown to depend greatly on the presence of self-motion cues during the imagined movement. Spatial updating, the process of updating the remembered locations of previously learned objects during self-movement, is typically quite accurate after physical rotations and translations with eyes closed (Philbeck, Loomis & Beall, 1997), especially in contrast to the relatively poor performance after imagined movement (Rieser, 1989). In a seminal paper on imaginal repositioning, Rieser compared the relative ease with which people could point to updated object locations after imagined vs. real rotations and translations. Participants were able to point to objects equally well after imagined and physical translations, but performance after imagined rotations was worse than after physical rotations.

Presson and Montello (1994; see also Easton & Sholl, 1995 and May, 2004) theorized that the difference between imagined rotations and imagined translations was due to a reference frame conflict between imagined and actual perspectives for rotations, but not translations. Specifically, for participants to correctly point to objects after an imagined rotation, they had to compute the correct response from the imagined perspective and map that response onto their physically occupied perspective. After an imagined translation, no such reference frame conflict occurred, because the imagined perspective was aligned with the participant’s actual facing direction, and so the chosen response from the imagined perspective could be directly executed without any further transformation.

The detrimental effect of imagined rotation on perspective taking ability has been widely reported and replicated. Interestingly, this effect may be specific to certain response modalities employed in those experiments, namely body-based responses such as aiming a joystick or other pointing device (Easton & Sholl, 1995; Kelly, Avraamides & Loomis, in press; Presson & Montello, 1994; May, 2004; Mou et al., 2004; Rieser, 1989) or turning one’s body to indicate the desired direction (Klatzky, Loomis, Beall, Chance & Golledge, 1998; Waller, Montello, Richardson & Hegarty, 2002). When verbal responses (e.g., “front-left”) are used to indicate object locations after imagined rotation, the difficulties found when using a body-based response can be reduced (Avraamides, Ioannidou & Kyranidou, in press; Wraga, 2003) or eliminated altogether (Avraamides, Klatzky, Loomis & Golledge, 2004; de Vega & Rodrigo, 2001; Wang, 2004).

At least two different possibilities have been proposed to explain the relative ease associated with verbal responses compared to body-based responses after imagined rotation. As suggested by Wang (2004), unique representations and/or processes may subserve actions (e.g., body-based pointing) versus judgments (e.g., verbal descriptions). Regarding the separate representations hypothesis, Wang proposed that body-based responses might be based on an orientation-dependent representation whereas verbal responses could be based on an orientation-independent representation. Although the preponderance of evidence suggests that spatial memories are orientation-dependent (Diwadkar & McNamara, 1997; Kelly et al., in press; Kelly & McNamara, in press; McNamara, Rump & Werner, 2003; Mou et al., 2004; Mou & McNamara, 2002; Roskos-Ewoldsen, McNamara, Shelton & Carr, 1998; Shelton & McNamara, 1997, 2001; Werner & Schmidt, 1999), these data have generally been gathered using body-based pointing responses. Such studies indicate that spatial memories are organized around one or two primary reference directions, and retrieval of inter-object relationships (e.g., “Imagine you are standing at x facing y, point to z,” where x, y and z represent learned object locations) is facilitated along those encoded reference directions, relative to other directions. The reference directions are selected through a combination of egocentric experience and environmental structure (Diwadkar & McNamara, 1997; Kelly & McNamara, in press; McNamara, Rump & Werner, 2003; Shelton & McNamara, 1997, 2001), and the resulting facilitation for aligned perspectives (compared to misaligned perspectives) has been shown under remote (after removal from the remembered environment; Shelton & McNamara, 2001) and situated (when located within the remembered environment; Kelly et al., in press; Mou et al., 2004) testing conditions. However, verbal responses have not been used with the explicit purpose of measuring the orientation dependence of long-term spatial memory, so the generality of these conclusions to other response modes is unclear.

Regarding the separate processes hypothesis, a common representation might underlie both body-based and verbal judgments, but the response mode could modify the manner in which that representation is accessed. Specifically, Avraamides et al. (in press) proposed that body-based responses require an additional step during response computation: once the correct spatial relationship is inferred, it must be mapped onto body coordinates before the pointing response can be executed. This response mapping from the imagined perspective onto the body in its actual perspective demands cognitive effort. In contrast, verbal responses can be made directly from the imagined perspective, without implication of the body and its egocentric frame of reference. In this way, verbal responses avoid the reference frame conflict that plagues body-based responses.

Recent work demonstrates the potential to distinguish between orientation dependency in long-term spatial memory and conflicts resulting from body-based responses, two factors which are critical to assessing the separate representations and separate processes hypotheses, respectively. Mou et al. (2004; see also Kelly et al., in press) had participants learn a spatial layout from a single perspective, a procedure that has previously been shown to produce orientation-dependent spatial memories with privileged access to spatial relations from imagined perspectives aligned with the learning perspective (e.g., Shelton & McNamara, 2001). After learning, blindfolded participants were asked to imagine different perspectives within the learned layout, and these imagined perspectives could be aligned or misaligned with the learning perspective. Additionally, participants physically turned to assume different facing directions prior to performing the perspective-taking task, so that the imagined perspective could also be aligned or misaligned with their actual perspective during retrieval. In this way, Mou et al. found separate evidence of facilitation for imagined perspectives aligned, compared to misaligned, with 1) the reference direction used to organize the spatial memory (established at the learning perspective) and 2) the reference frame of the body in its current orientation.

The experimental design used by Mou et al. (2004) provides a promising tool for uncovering the nature of response mode differences. First, it should prove useful in evaluating the separate representations hypothesis proposed by Wang (2004). Given prior evidence, pointing responses should be based on an orientation-dependent representation, and the learning conditions in the current experiments have been previously shown to produce orientation-dependent spatial memories with a preferred reference direction parallel to the learning perspective. Similar to the findings of Mou et al., pointing responses should be facilitated when imagining perspectives aligned, compared to misaligned, with this reference direction in long-term memory. If verbal responses are based on an orientation-independent representation, then there should be no benefit for perspectives aligned with the learning perspective when responding verbally. The same predictions should hold whether participants are oriented to their surrounds (i.e., when they are aware of their position and orientation within the immediately surrounding environment) or disorientated. Second, the Mou et al. design can be employed to evaluate the separate-processes hypothesis. Based on prior evidence (Kelly et al., in press; May, 2004; Mou et al; Presson & Montello, 1991; Rieser, 1989), pointing responses should also be facilitated when imagining perspectives aligned, compared to misaligned, with the body in its current orientation. If verbal responses activate a different retrieval process that does not rely on the body’s actual orientation, then there should be no benefit for perspectives aligned with the body under verbal response conditions. After disorientation, performance should be similar on perspectives aligned and misaligned with the body, for both pointing and verbal responses.

Experiment 1

Method

Participants

Sixteen undergraduate students (8 males) at Vanderbilt University participated in exchange for course credit.

Stimuli and Design

The experiment was conducted in a 5 × 7 m room (see Figure 1) containing the test objects and a laptop computer used to present the experimental trials and collect responses. Each object set consisted of eight objects evenly spaced around a circle (3 m in diameter), centered in the middle of the room. Four such sets of objects were learned over the course of the experiment, and each set was chosen from a different semantic category.

Plan view of the stimuli and room environment used in Experiments 1–3. Circles represent object locations and the cross represents the participant’s location.

Participants faced 0° (see Figure 1) during learning, and faced either 90° or 270° during blindfolded testing. They arrived at 90° or 270° directly or indirectly depending on the disorientation condition, explained below. After participants were positioned for testing, each subsequent trial consisted of two objects selected from the eight-object array, and required participants to locate one object as if they were facing a second object (e.g., “Face the pear, find the apple”). Trials were presented via wireless headphones (HDR-130 from Sennheiser, Old Lyme, CT).

The three primary independent variables were response mode, disorientation, and imagined perspective. All three variables were manipulated within participants. Response mode was either verbal labeling or pointing, and participants either remained oriented or were disoriented prior to testing. The imagined perspective was either aligned with the learning perspective (0°, termed the “learning” perspective), aligned with the participant’s facing direction at test (90° when facing 90°, or 270° when facing 270°, termed the “body-aligned” perspective), or 180° misaligned with the participant’s facing direction at test (e.g., 90° when facing 270°, termed the “misaligned perspective”).

Factorial combinations of response mode and disorientation were blocked and block order was counterbalanced using a balanced Latin squares design. A new set of objects was learned before each block. Within each block of trials, imagined perspective was pseudo-randomized so that the same object never appeared on two consecutive trials; neither as an orienting object nor as a target object. Additionally, a single object never served as both the orienting and target object for the same trial. Three imagined perspectives were combined with seven pointing directions, resulting in 21 trials per block. The dependent measures, defined in detail below, were decision time, response time and absolute angular error. Data were recorded on a laptop computer using Vizard software (WorldViz, Santa Barbara, CA).

Procedure

After providing informed consent, participants were presented with a sample set of eight objects for training purposes. Prior to training, participants were outfitted with a wireless joystick (Freedom 2.4 by Logitech, Freemont, CA) affixed to a small table, which was suspended in front of their waist using shoulder straps. In this way, the joystick was always in front of the participant. During training, the experimenter explained the perspective taking task, and described the two response modes. Participants were told to imagine facing one object and then to decide where a second object would be from that imagined perspective. They were shown a button on the joystick, which they were told to press just prior to making their responses, regardless of the response mode. They were instructed to press the button only when they were ready to respond, and then immediately give their response. For the pointing response, they were told to deflect the joystick in the direction of the target object from the imagined perspective. For the verbal response, they were told to verbally describe the direction of the target object from the imagined perspective, using one of eight response options: front, front-right, right, back-right, back, back-left, left, or front-left. These verbal labels were chosen because they provided sufficient precision to perform the task. Participants were shown how these verbal labels matched up with the regularly spaced object array. Participants then performed four practice trials for both response modes using the training objects, which were perceptually available at all times during training. Any errors during practice were corrected.

After training with the pointing and verbal response procedures, participants were escorted to the experiment room on a different floor of the same building. Prior to entering the room, participants donned the blindfold and were led directly into the center of the object array, facing 0°. Once positioned, participants removed the blindfold and learning began. They were instructed to study the object locations for 60 s and then, with eyes closed, point with their hand to each object called out in a random order by the experimenter. Participants were instructed not to move their feet during learning, and to rotate at the neck and waist to view all of the objects. The study-test sequence ended when participants were able to successfully point to all object locations.

After learning the layout, participants donned the blindfold. In the disorientation condition, participants were told to rotate in place for 60 s. At random times during rotation, the experimenter instructed them to change directions. Participants were told that the experimenter would walk around them as they rotated, so that the experimenter’s location would not be a stable orientation cue. After the disorientation procedure, participants were turned to face 90° or 270°. In the other condition, oriented participants turned 90° to their right or left to face 90° or 270°, respectively, and then stood idle for 60 s without going through the disorientation procedure.

Once participants were positioned, they were instructed as to which type of response to use on the ensuing block of trials. Sound files for each trial were pre-recorded and presented via wireless headphones. Decision time was defined as the time between the termination of the sound file and the participant’s button press on the joystick, indicating readiness to respond. Response time was defined as the time between the button press and completion of the response. Decision time and response time were measured separately because verbal responses could take longer to produce than pointing responses, and this difference in response production could obfuscate the effects of the independent variables. For pointing trials, the response was completed when the joystick was deflected by 30° from vertical. For verbal trials, the response was completed when the experimenter pressed a key corresponding to the participant’s verbal response. Decision time was expected to be a more informative measure than response time. After completing a block of trials, oriented participants turned directly to face 0°, while disoriented participants underwent the disorientation procedure again before being returned to 0°. This was done to prevent disoriented participants from receiving feedback about their facing direction during testing, in light of the repeated measures design. After participants were returned to 0°, a new set of objects was laid on the floor and the procedure began again.

Analysis

Facilitation for perspectives aligned, compared to misaligned, with body orientation at test should be revealed by a performance difference between the body-aligned and misaligned perspectives. Additionally, facilitation due to alignment with an orientation-dependent spatial memory should be revealed by a performance difference between the learning and misaligned perspectives. These two indicators will serve as the primary evidence of the processes and representations used when performing the perspective-taking task under verbal and pointing conditions.

Although the joystick measured pointing responses continuously between 0° and 360°, pointing responses were quantized in 45° increments. This was done to make pointing responses more comparable to verbal responses, which were limited to the same 45° intervals. Participants were made aware of the layout’s spatial regularity during training, and so presumably would never intend to produce response angles in anything other than multiples of 45°. Absolute pointing error was calculated by computing the absolute value of the difference between the correct responses and participants’ quantized pointing responses.

Results

Latency

Decision time (the elapsed time between trial presentation and the button press indicating preparedness to respond, shown in Figure 2) was analyzed in a 2 (gender) × 2 (response mode: verbal labeling or pointing) × 2 (orientation: disoriented or oriented) × 3 (imagined perspective: learning, body-aligned, or misaligned perspective) mixed-model ANOVA. The analysis revealed a significant main effect of perspective [F(2,28)=15.25, p<0.001, η_p²=.52], qualified by a significant interaction between orientation and perspective [F(2,28)=7.55, p=0.002, η_p²=.35]. The three-way interaction between orientation, response mode, and imagined perspective was not significant [F(2,28)=0.64, p=.54, η_p²=.04]

Decision time as a function of response mode, disorientation, and imagined perspective in Experiment 1. Error bars are standard errors estimated from the ANOVA.

To further evaluate a priori hypotheses, performance on the misaligned perspective was compared with performance on the body-aligned and original perspectives for each combination of response mode and orientation. These contrasts indicated that when participants used the pointing response and remained oriented to the environment, performance on the learning and body-aligned perspectives was faster than on the misaligned perspective [F(1,14)=15.08, p=.002, η_p²=.52, and F(1,14)=13.58, p=.002, η_p²=.49, respectively]. When pointing after disorientation, judgments were faster for the learning perspective than the misaligned perspective [F(1,14)=20.04, p<.001, η_p²=.59], but there was no difference between the body-aligned and misaligned perspectives. When participants used the verbal labeling response and remained oriented to the environment, performance on the learning and body-aligned perspectives was faster than on the misaligned perspective [F(1,14)=7.56, p=.016, η_p²=.35, and F(1,14)=5.53, p=.034, η_p²=.28, respectively]. When verbally labeling after disorientation, performance was better on the learning perspective than the misaligned perspective [F(1,14)=5.93, p=.029, η_p²=.30], but there was no difference between the body-aligned and misaligned perspectives.

Response time (the elapsed time between the button press indicating preparedness to respond and recording of the response) was also analyzed in a 2 (gender) × 2 (response mode: verbal labeling or pointing) × 2 (orientation: disoriented or oriented) × 3 (imagined perspective: learning, body-aligned, or misaligned perspective) mixed-model ANOVA. Results showed only a main effect of response mode [F(1,14)=268.15, p<.001, η_p²=.95], with faster responses overall for pointing (M=0.53 s, SE=0.06) compared to verbal labeling (M=1.64 s, SE=0.06).

Accuracy

Absolute error (Figure 3) was analyzed in a 2 (gender) × 2 (response mode: verbal labeling or pointing) × 2 (orientation: disoriented or oriented) × 3 (imagined perspective: learning, body-aligned, or misaligned perspective) mixed-model ANOVA. Verbal responses (M=7.65°, SE=2.73) were more accurate overall than pointing responses [M=24.20°, SE=3.55; F(1,14)=88.85, p<.001, η_p²=.86]. The main effect of perspective [F(2,28)=10.07, p=.001, η_p²=.42] was qualified by an orientation by perspective interaction [F(2,28)=4.14, p=.027, η_p²=.23].

Absolute angular error as a function of response mode, disorientation, and imagined perspective in Experiment 1. Error bars are standard errors estimated from the ANOVA.

Contrasts conducted to evaluate a priori hypotheses showed that when participants pointed while oriented to the environment, performance was more accurate on the learning and body-aligned perspectives than on the misaligned perspective [F(1,14)=4.37, p=.055, η_p²=.24, and F(1,14)=5.94, p=.029, η_p²=.30, respectively]. When pointing after disorientation, performance was better for the learning perspective than the misaligned perspective [F(1,14)=14.40, p=.002, η_p²=.51], but there was no difference between the body-aligned and misaligned perspectives. When participants verbally responded and remained oriented to the environment, performance was more accurate on the learning perspective and the body-aligned perspective than on the misaligned perspective [F(1,14)=4.78, p=.046, η_p²=.25, and F(1,14)=9.76, p=.007, η_p²=.41, respectively]. When verbally labeling after disorientation, performance was better on the learning perspective than the misaligned perspective [F(1,14)=3.47, p=.084, η_p²=.20], but there was no difference between the body-aligned and misaligned perspectives.

Discussion

The separation of decision time and response time proved to be an effective means of accounting for the increased latency associated with producing verbal compared to joystick responses. The fact that response time was unaffected by the manipulations of imagined perspective and disorientation suggests that participants followed instructions, and did not expend further cognitive effort after indicating their preparedness to respond. Accordingly, only decision time and accuracy are discussed. The overall larger errors in pointing, compared to verbal labeling, is attributed to added noise in the joystick response. Whereas verbal responses allowed for high precision, joystick responses were performed with only proprioceptive feedback (participants were blindfolded during responding). Although attempts were made to reduce this noise by quantizing the pointing responses in 45° intervals, this was not sufficient to equate the two responses, presumably because noise in the joystick response often exceeded the 45° quantization.

Pointing and verbal labeling responses were faster and more accurate when participants imagined the learning perspective compared to the misaligned perspective. This evidence for an orientation-dependent representation occurred under both oriented and disoriented conditions, for both verbal and pointing responses. Previous research on long-term spatial memory predicts this finding, not only because of the saliency of the learning view (Diwadkar & McNamara, 1997; Kelly et al., in press; Shelton & McNamara, 1997) but also because of its alignment with the long axis of the room (Kelly & McNamara, in press; Shelton & McNamara, 2001). One or both of these factors most likely caused participants to organize their memories for the objects around the 0°–180° axis. The fact that both response modes showed evidence of orientation-dependent memories casts doubt on the hypothesis that pointing responses are based on orientation-dependent representations and verbal labeling responses are based on orientation-independent representations. Instead, it appears that a common orientation-dependent representation underlies both responses modalities.

Additionally, when participants remained oriented to the environment, pointing and verbal responses were faster and more accurate when participants imagined their current perspective, compared to the misaligned perspective. Although the pointing result replicates much of the previous work on imagined rotations (e.g., Presson & Montello, 1994; Rieser, 1989), the verbal labeling result fails to replicate previous work indicating equivalent access to occupied and non-occupied perspectives (Avraamides et al., in press; de Vega & Rodrigo, 2001; Wang, 2004; Wraga, 2003). This finding indicates that the same retrieval process governed both response modes in the current task. To explain the superior performance for imagined perspectives aligned with the body, Presson and Montello suggested that the misaligned perspective required participants to calculate the response from the imagined perspective and then map that response onto their actual perspective, and that this remapping process results in interference. Although this is a reasonable explanation when a pointing response is used, it is not clear why the same interference would occur for verbal responses, which are not dependent on the body.

Before concluding that verbal labeling and pointing responses share common representations as well as common retrieval processes, it is important to understand the differences between this experiment and those that do report response mode differences. Previous research has typically manipulated response mode between participants, compared to the within participants manipulation in Experiment 1, where participants received training on both response modes before beginning the experiment. In previous studies, different groups of participants may have used different retrieval strategies (resulting in potentially different retrieval processes) for verbal labeling and pointing, whereas participants in the current study may have elected to use a consistent strategy for the two tasks. Verbal labeling responses, but not pointing responses, are commonly used both egocentrically and non-egocentrically, depending on one’s goal. In the former case, an observer can verbally describe object locations relative to him/herself, where verbal labels correspond to egocentric directions (e.g., “The chair is 3 m to my right”). Alternatively, verbal labeling can also be done non-egocentrically, from a third-person framework, by describing object locations relative to another person or object (e.g., “The chair is 3 m to your right”). This third-person strategy is not consistent with the typical definition of a perspective-taking task, where the observer is expected to imagine egocentrically occupying the new perspective, rather than imagining someone or something else occupying that perspective. In contrast to the potential flexibility of the verbal response, pointing is reserved for indicating egocentric locations¹.

It is possible that the discrepancy between the results of Experiment 1 and those reported elsewhere is due to differences in participant strategy. The within participants manipulation of response mode in Experiment 1 may have encouraged participants to use the same egocentric strategy for both verbal and pointing responses, a more parsimonious solution to the two tasks. In contrast, participants in the verbal labeling condition of previous experiments may have used a third-person strategy, which does not incur the same interference costs as an egocentric strategy. In Experiment 2, participants were instructed to use a third-person strategy for both response modes. This non-egocentric strategy should eliminate the cost associated with imagining non-occupied perspectives when using a verbal response, which can be used to indicate non-egocentric directions. The third-person strategy should not, however, reduce the costs associated with pointing from a non-occupied perspective, because the correct response will still need to be transformed into body coordinates.

Experiment 2

Participants in Experiment 2 were instructed to imagine an arrow pointing toward one object and then indicate the location of a second object relative to that arrow. This modified instruction set was expected to induce a third-person strategy when solving the task. The disorientation condition was removed in Experiment 2 because it did not greatly aid in the identification of response mode differences, as the independent effects of alignment with the body and the learning perspective were both apparent within the oriented test conditions.