Abstract
Absolute pitch (AP) possessors can identify musical notes without an external reference. Most AP studies have used musical instruments and pure tones for testing, rather than the human voice. However, the voice is crucial for human communication in both speech and music, and evidence for voice-specific neural processing mechanisms and brain regions suggests that AP processing of voice may be different. Here, musicians with AP or relative pitch (RP) completed online AP or RP note-naming tasks, respectively. Four synthetic sound categories were tested: voice, viola, simplified voice, and simplified viola. Simplified sounds had the same long-term spectral information but no temporal fluctuations (such as vibrato). The AP group was less accurate in judging the note names for voice than for viola in both the original and simplified conditions. A smaller, marginally significant effect was observed in the RP group. A voice disadvantage effect was also observed in a simple pitch discrimination task, even with simplified stimuli. To reconcile these results with voice-advantage effects in other domains, it is proposed that voices are processed in a way that voice- or speech-relevant features are facilitated at the expense of features that are less relevant to voice processing, such as fine-grained pitch information.
I. INTRODUCTION
Absolute pitch (AP), colloquially known as “perfect pitch,” refers to the ability to identify the note name (chroma) of a given tone and/or to produce the pitch corresponding to a given note name without external reference (Levitin and Rogers, 2005). AP ability differs from relative pitch (RP) ability, a common auditory skill in trained musicians. While AP possessors can identify the note name of a tone played in isolation, RP possessors need to infer the note name from the musical interval between the tone and a given reference tone.
AP has long been considered a rare talent even among trained musicians, although the proportion of AP possessors in the general population remains unclear. Among music students at conservatories or universities, the proportion of AP possessors varies across countries and cultures from near zero to above 50%, with higher incidence in Asia than in Europe or the US (Miyazaki et al., 2018), probably due to differences in music training methods (Miyazaki et al., 2012) and possibly language (Deutsch et al., 2004; but see Schellenberg and Trehub, 2008). Although one recent study reported that two adult musicians with high auditory working memory abilities reached levels of performance indistinguishable from genuine AP possessors after extensive training (Van Hedger et al., 2019), it is generally believed that the acquisition of AP is associated with musical training within an early critical period in development (Levitin and Rogers, 2005). Learning a fixed-pitch instrument seems to contribute to the acquisition of AP (Vanzella and Schellenberg, 2010).
Many studies have treated AP possessors as a single cohort, but they are not homogeneous (Bahr et al., 2005). Two commonly discussed subtypes are known as “true AP” and “quasi AP” (Bachem, 1955; Levitin and Rogers, 2005; Kim and Knösche, 2017). True AP possessors can name notes with high accuracy and low latency. In contrast, quasi AP possessors may be somewhat less accurate, particularly at the extreme ends of the note range, and may take longer to respond, possibly because they compare the heard tone with an inner standard tone they remember. However, both subtypes can achieve high performance in typical pitch-labeling tasks. A proposed way to distinguish between these two subtypes is by measuring reaction time (RT) (Levitin and Rogers, 2005), although its empirical effectiveness in distinguishing between the two groups remains unknown.
Although AP can be a useful skill in musical tasks like transcription and sight singing, it is not always helpful. For instance, AP possessors' interval naming performance can be degraded when one or both of the two notes of the interval are mistuned (Miyazaki, 1995). When reading music notation and listening to a tone at the same time, the note name of the heard tone will affect the performance of reading the written note, but not the converse (Akiva-Kabiri and Henik, 2012). These findings indicate that note naming is highly automatic for at least some AP possessors. For them, AP cannot be “turned off” even when the task encourages them to do so, resulting in interference, and hence poorer performance, in such tasks compared to non-AP musicians.
Given the potential for AP abilities to interfere with task requirements, it is also possible that a task or stimulus may interfere with AP performance. One such example was reported by Vanzella and Schellenberg (2010). They found that AP possessors were less accurate in naming the pitches of recorded sung or synthesized voice than of piano notes or pure tones. The same pattern of results was replicated in AP possessors and extended to non-AP musicians (albeit at a much lower overall level of performance) by Weiss et al. (2015). Vanzella and Schellenberg (2010) suggested that voice-specific mechanisms may interfere with the process of note naming, leading to poorer performance. There is certainly neurophysiological and behavioral evidence for voice-specific processing, ranging from potentially voice-selective cortical regions (Belin et al., 2000; Agus et al., 2017) and responses (Charest et al., 2009) to perceptual asymmetries making it is easier to recognize a target voice sound in a sequence of non-voice distractors than vice versa (Agus et al., 2012; Isnard et al., 2019). Finally, when distinguishing previously heard melodies from novel melodies, both AP and non-AP musicians, as well as non-musicians, have been shown to have higher accuracy for melodies presented with voice than with instrumental timbres (Weiss et al., 2015).
Although voice-specific mechanisms may interfere with AP note naming (Vanzella and Schellenberg, 2010), some questions remain. First, because only performance, and not RT, was measured, it may be that the poorer performance with voice stimuli was due to a speed-accuracy trade-off (e.g., Heitz, 2014), due perhaps to particularly rapid processing of voice stimuli, rather than a decrease in underlying sensitivity. Second, if the effects are due to interference between the identity of the vowel carried by the voice (e.g., /a/) and the verbal labels of the pitches, then similar interference effects should be observed in RP tasks that involve naming the notes in musical intervals. If the effect is instead specific to AP, then another interpretation may be necessary. Finally, the decrease in performance may reflect the “vocal generosity effect,” whereby both musicians and non-musicians have been found to be poorer in both melody intonation judgments and pitch discrimination for voice than for non-voice stimuli (Hutchins et al., 2012). Although the vocal generosity effect may be due in part to the greater vibrato typically found in singing (Van Besouw et al., 2008), this cannot explain the results found by Vanzella and Schellenberg (2010), as they reported a similar effect with natural and synthesized voice, despite the lack of pronounced variations in fundamental frequency (F0) in the synthesized version.
The aim of the present study was to address the three questions outlined above: (1) Is the voice-disadvantage effect specific to AP, or does it extend to RP note-naming tasks? (2) Does the voice-disadvantage effect reflect a speed-accuracy trade-off or a true decrease in sensitivity? (3) To what extent is the voice-disadvantage effect a reflection of differences in basic F0 discrimination differences between voice and non-voice stimuli, as suggested by the vocal generosity effect? In our online Experiment 1, we addressed the first two questions by adopting a note name judgment task to test AP and RP musicians' ability to label the pitches of vocal (spoken vowel) and non-vocal (viola) stimuli, where a reference pitch with known note name was provided in the RP, but not the AP, task. The stimuli were either unprocessed or were manipulated to remove any time-varying features (such as vibrato) that might be more pronounced in the vocal stimuli. To detect any potential speech-accuracy trade-off, we also measured RT. Finally, to further assess the generalizability of the voice disadvantage effect and the potential influence of the vocal generosity effect, our in-person Experiment 2 adopted a basic F0 discrimination task for the same vocal and non-vocal stimuli (with and without time-varying features) that were used in Experiment 1.
II. EXPERIMENT 1: ABSOLUTE AND RELATIVE NOTE NAME JUDGMENTS
A. Methods
1. Participants
A total of 321 participants (age range: 18–46 years, mean = 21.3, standard deviation, SD = 4.2) took part in an online AP-RP screening test. Fifty-five participants passed the initial AP screening test (age range: 18–39 years, mean = 22.1, SD = 4.5; duration of musical training range: 3–25 years, mean = 12.6, SD = 4.9), and 40 of the remaining participants passed the RP screening test (age range: 18–32 years, mean = 21.5, SD = 3.6; duration of musical training range: 5–27 years, mean = 11.9, SD = 4.1). Two RP participants were excluded from the analysis comparing AP and RP participants due to technical errors in the online test. Participants were recruited through email lists, introductory psychology courses, social media, and word of mouth. At the time of recruitment, the participants were informed that the study was about musicians' pitch labeling abilities and that their task was to complete a questionnaire and potentially some listening tasks. To be eligible for the study, participants needed to consider themselves familiar with Western note names (e.g., C/C#/D). This study was approved by the University of Minnesota's Institutional Review Board. All the participants completed an online consent form prior to the study and were awarded a digital gift card or extra course credit upon completion.
2. Music background questionnaire
All participants completed a questionnaire about their musical background, self-evaluation of passive (i.e., perception) and active (i.e., production) AP proficiency, and (if applicable) AP strategies used for different types of sounds. Specifically, the available options for the strategies were “automatic” and “comparative.” “Automatic” was explained in terms of being able to come up with the note name immediately without recalling any inner standard tone for reference, and “comparative” was explained in terms of needing to compare the presented tone with some inner standard tone to determine the note name. Participants were allowed to select “other” and explain their strategies in a free-response box. The full questionnaire is provided in the supplementary materials.1 The questionnaire and all of the experimental tasks were delivered online via PsyToolKit (Stoet, 2010, 2017) and were only accessible via a laptop or desktop computer. Participants were instructed to complete the tasks in a quiet environment. Headphones were not required, and there were no questions or tests checking for the use of headphones.
3. Screening test and qualification for AP and RP
Prior to the experiment, a 5-s white noise with the same overall level as the tones in the task was played to the participants, so that they could adjust the volume on their device to a comfortable level. The participants were able to replay the noise as many times as they wished before clicking a button to proceed to the experiment once they were satisfied with the loudness.
Following the level adjustment, all participants took part in a screening test that involved assessment of AP and RP skills. The AP section contained 24 trials. In each trial, the participants heard a piano tone and saw a note name (e.g., “C”) displayed on the screen at the same time. Their task was to judge whether the note name matched the tone and to press “1” for match or “2” for no match on the keyboard. Each note in the range of C3–B4 was played exactly once, and each of the 12 note names was displayed on the screen exactly twice. The black key notes were displayed as both the alternative names (e.g., “C#/Db” or “D#/Eb”). In half the trials, the note name matched the piano tone; in the other half, the note names were equally likely to be ±1 or ±4 semitones (STs) deviant from the piano tone. The correspondence between the piano tones and note names was counterbalanced across participants.
The piano tones were exported from GarageBand v.10.3.5 (Apple Inc., 2020). Each tone was restricted to a total duration of 250 ms by gating it off with a 30-ms raised-cosine ramp. To preserve a naturalistic piano timbre, no additional onset ramp was applied. The note name appeared on the screen at the onset of the tone and was displayed until either the subject made a response or 5 s after the tone onset, whichever occurred first. No feedback was provided, but a highlighted “Miss” was displayed if the participant failed to respond within 5 s. Trials were interspersed with a 500-ms burst of spectro-temporally rippled noise with falling frequency sweeps (Aronoff and Landsberger, 2013), presented at the same overall level as the tones and gated with 100-ms linear onset and offset ramps, followed by a 1100-ms silent gap. The noise was included to reduce any possible carryover of pitch information across trials.
Participants scoring at least 75% correct in the AP section (N = 55) were assigned to the AP group and skipped the remaining part of the screening test. All others were then presented with the RP screening test, which also contained 24 trials. The RP section was identical to the AP section, with the exception that a 250-ms standard piano tone (A3 = 220 Hz) was presented 100 ms after the offset of the noise burst. The silent gap between the offset of the standard piano tone and the onset of the test tone was 750 ms. The standard tone was played before every test tone, and the participants were instructed at the beginning of each block that the note name of the standard tone was always A.
The participants who undertook the RP section and scored at least 75% correct (N = 40) were assigned to the RP group. All other participants, who had scored lower than 75% correct in both the AP and RP sections, were excluded from the remainder of the study.
4. Note name judgment task
Participants in the AP and RP groups completed their respective versions of the note name judgment task. The RP participants did not have the ability to reliably make AP judgments, as was shown in the screening test, and so did not complete the AP note name judgment task. The AP participants could potentially have used either AP or RP abilities to undertake the RP tasks (Miyazaki, 1995), so they were not tested on the RP note name judgment task. Each version consisted of four blocks of trials, each of which was very similar to the corresponding (AP or RP) version of the screening task. The differences were: (a) each note in the range of C3–B3 was used exactly twice; (b) each of the four blocks used a different type of sound, as described below. Within each block, the same type of sound was used as both the standard and test tone. The presentation order of the four blocks was randomized for each participant independently. Participants were able to take breaks between blocks.
The four sound types were original voice, original viola, simplified voice, and simplified viola. The original voice stimuli were based on a female spoken /a/ sound, synthesized and transposed to different pitch heights using the TD-PSOLATM algorithm in Praat (Boersma, 2001), whereby the periodicity of the sound is altered but the spectral envelope and duration remain the same. The original viola stimuli were based on a synthesized viola sound (the note F3) exported from Avid Sibelius, again transposed to different pitch heights (i.e., the 12 notes between C3 and B3) using Praat. Viola was chosen to match the pitch range and spectral richness of voice, as in previous studies (e.g., Agus et al., 2012). Each tone had a total duration of 250 ms and was gated off with a 30-ms raised-cosine ramp. The simplified voice and simplified viola stimuli were constructed by taking one cycle from the middle of the waveform of the voice or viola note F3, respectively, repeating it throughout the 250-ms duration of the tone in matlab R2018b (MathWorks Inc., Natick, MA), gating the tone on and off with 30-ms raised-cosine ramps, and transposing it to different pitch heights using Praat. Therefore, the long-term spectral and formant information was maintained but any fluctuations in spectrum, F0, or level over the course of the sound were eliminated, along with any transient effects. The reduction in random fluctuations as a result of editing was confirmed by calculating the SDs for F0 (in STs) and level (in dB) during the steady-state segment of the stimuli (from 70 to 220 ms). For F0 estimates, Praat was used to calculate the F0 in 10-ms windows; for level estimates, the root-mean-square (rms) level was calculated with Hanning windows of 10-ms half-amplitude duration. The resulting SDs for F0 (in ST) were 0.094 for the original voice, 0.00012 for the simplified voice, 0.027 for the original viola, and 0.00012 for the simplified viola. The resulting SDs for level (in dB) were 1.46 for the original voice, 0.038 for the simplified voice, 0.40 for the original viola, and 0.032 for the simplified viola. Examples of the stimulus waveforms and spectra are shown in Fig. 1. Audio of the four examples shown in Fig. 1 can be found in the supplementary materials.1
FIG. 1.
(Color online) Waveforms and spectrograms of the four stimulus categories: (a) voice, original; (b) voice, simplified; (c) viola, original; and (d) viola, simplified. The spectra (0–16 kHz) of the four stimulus categories are shown in (e). The note displayed is F3 (F0 = 174.6 Hz). In the spectrograms, the lines represent the F0 contour.
Immediately following the AP or RP note-naming task, participants listened to the note A3 from each of the four sound categories again and answered a set of questions. In the first part, they rated the similarity of each stimulus to a human voice and to a musical instrument, as well as how synthetic it sounded, in three separate 5-point scales (1 = not at all; 5 = very much). In the second part, they answered whether they were able to extract a vowel from each of the stimuli and, if so, which vowel. These questions were designed to measure the perceived identity of the test sounds.
5. Data analysis
The dependent variables of interest were accuracy and RT. Each response was coded as “1 = correct” or “0 = incorrect.” The raw RT was defined as the time interval between the onset of the test stimulus and the participant's response. For the missed trials, both RT and response were treated as missing. Across all conditions and participants, the percentage of missed trials was low (∼1%). The responses and RTs of all non-missed trials were included in the data analysis.
To answer the main questions of this study, effects of group (AP vs RP), stimulus (voice vs viola), and editing (simplified vs original), as well as their interactions, on the log-transformed RT values [10*log10(RT)] were assessed via mixed-effects linear models. A logarithmic transformation was used to produce a more normal distribution of the RTs. The effects on the accuracy of the same set of factors and their interactions were assessed via mixed-effects logistic models. The effects of stimulus rating and AP strategy (comparative vs automatic) on RT and accuracy were also examined in subsequent analyses. Mixed-effects linear models and mixed-effects logistic models were implemented using the lme4 package (Bates et al., 2015) and the GLMMadaptive package (Rizopoulos, 2021), respectively. The input data for the mixed-effects logistic regressions were 0 or 1 (i.e., incorrect or correct) for each individual trial, and the input data for the mixed-effects linear regressions were RT (in log-transformed milliseconds) for each individual trial. The p-values for the regression coefficients were obtained using the lmerTest package (Kuznetsova et al., 2017). All analyses were performed in R v4.0.2 (R Core Team, 2020). To assist with interpretability, the two possible values of each of the dichotomous predictors (group, AP/RP; stimulus, voice/viola; editing, original/simplified; AP strategy, automatic/comparative) were recoded as 0.5 and –0.5 (Schielzeth, 2010; Brauer and Curtin, 2018). The effects of within-subjects variables (stimulus and editing) were treated as random at the individual level, and participant was included as a random effect. All the estimates reported in the following section are at the population level (i.e., the overall difference across experimental conditions, after accounting for variability between participants). For mixed-effects linear models, Cohen's d was calculated as described by Brysbaert and Stevens (2018). For mixed-effects logistic models, odds ratios (ORs) were calculated as a measure of effect size. The further the OR deviates from 1 in either direction, the larger the effect.
To further validate the results, variables related to participants' musical backgrounds and parameters of each trial were adjusted for in a separate model when possible. These variables were the amount of musical training (years), current music playing time (hours per week), current music listening time (hours per week), first and primary instruments (four categories: piano, strings, voice, other), age when musical training commenced (years), native language (tonal or non-tonal), interval between sound stimulus and displayed note name (0, 1, or 4 STs), and direction of mismatch (1 = note name higher than sound stimulus, 0 = correct correspondence, –1 = note name lower than sound stimulus). The results reported in the next section are not adjusted for these variables unless stated otherwise. In general, incorporating these variables did not affect the conclusions.
B. Results
1. Screening differentiates between AP and RP possessors
The distributions of the screening scores of all participants are shown in Fig. 2. In both cases, the mode of the distribution was near 12 out of 24, or 50% correct, representing chance performance. However, both distributions appear positively skewed, indicating better-than-chance performance for some of the participants. For the AP test in particular, there was evidence of a bimodal distribution, with a secondary peak reflecting very high performance (23–24 correct); a similar form of distribution, presumably reflecting a subset of listeners with AP, has been reported with a much larger sample (Athos et al., 2007). Those who passed each of the tests (at least 75% correct) are shown in the distributions with darker bars (NAP = 55; NRP = 40). The 55 AP participants were also included in an additional exploratory analysis to assess potential differences in performance between those using automatic and comparative AP strategies.
FIG. 2.
Distribution of scores (number of correct trials) in AP (left panel, N = 321) and RP (right panel, N = 266) screening tests. Darker bars represent those who passed each screening test.
2. Voice disadvantage is demonstrated more in AP participants' accuracy
To answer the main questions of this study, a mixed-effects logistic regression was carried out to examine the effects of group (AP vs RP), stimulus (voice vs viola), editing (simplified vs original), and their interactions on the accuracy of responses (Table I). Significant main effects were found of group (b = 0.84, standard error, SE = 0.27, p = 0.00167, OR = 2.32, 95% confidence interval, CI [1.37, 3.93]), with higher overall performance by the AP than the RP group, and stimulus (b = –0.42, SE = 0.10, p < 0.0001; OR = 0.65, 95% CI [0.54, 0.79]), with lower accuracy for voice than for viola sounds, along with a significant interaction between group and stimulus (b = –0.35, SE = 0.16, p = 0.0290). These effects held after adjusting for participants' musical backgrounds and parameters of each trial (i.e., the musical interval between sound stimulus and displayed note name, and the direction of mismatch). Based on the interaction between group and stimulus, we carried out separate mixed-effects logistic regressions for the AP and RP groups. A significant main effect of stimulus was found in the AP group (b = –0.71, SE = 0.19, p = 0.000151; OR = 0.49, 95% CI [0.34, 0.71]), with lower accuracy for voice than for viola sounds. A similar trend was observed in the RP group, but it just failed to reach significance (b = –0.21, SE = 0.11, p = 0.0525; OR = 0.81, 95% CI [0.66, 1.00]). Thus, the AP group showed a stronger voice disadvantage effect than the RP group (as evidenced by the group-stimulus interaction), although the effect remained marginally significant even in the RP group. The lack of effect of (or interactions with) editing suggests that within-stimulus variability did not contribute significantly to the effect.
TABLE I.
Mixed-effects logistic regression analysis predicting accuracy from group (AP or RP), stimulus (voice or viola), editing (original or simplified), and their interactions; b refers to the estimated unstandardized regression coefficients with standard errors in parentheses; p-values less than 0.05 are considered statistically significant.
| Predictor | b | p |
|---|---|---|
| Group | 0.84 (0.27) | 0.00166 |
| Stimulus | −0.42 (0.10) | <0.0001 |
| Editing | −0.01 (0.09) | 0.956 |
| Voice * Editing | 0.25 (0.13) | 0.0659 |
| Group * Stimulus | −0.35 (0.16) | 0.0290 |
| Group * Editing | −0.03 (0.15) | 0.865 |
| Group * Stimulus * Editing | −0.16 (0.27) | 0.548 |
A mixed-effects linear regression analysis was also carried out to examine the effects of group (AP vs RP), stimulus (voice vs viola), and editing (original vs simplified) on the log-transformed RTs (Table II). A significant main effect of group was found (b = –0.58, SE = 0.28, p = 0.0397, d = 0.27, 95% CI [0.02, 0.52]), with faster response by the AP than the RP participants. A significant interaction between group and editing was found (b = –0.20, SE = 0.10, p = 0.0459, d = 0.09, 95% CI [0.0029, 0.18]), with AP but not RP participants responding faster to the simplified than to the original stimuli (b = –0.17, SE = 0.06, p = 0.00925, d = 0.08, 95% CI [0.02, 0.13]). These effects held after adjusting for participants' musical backgrounds and parameters of each trial, except that the main effect of group was reduced to a trend that just missed significance. No other significant main effects or interactions were found, although there was a marginally significant trend (b = 0.12, SE = 0.05, p = 0.0524, d = 0.05, 95% CI [0.00, 0.11]) for RTs with voice stimuli to be longer than with non-voice stimuli.
TABLE II.
Mixed-effects linear regression analysis predicting log-transformed RT from group (AP or RP), stimulus (voice or viola), editing (original or simplified), and their interactions; b refers to estimated unstandardized regression coefficients with standard errors in parentheses; p values less than 0.05 are considered statistically significant.
| Predictor | b | p |
|---|---|---|
| Group | −0.58 (0.28) | 0.0397 |
| Stimulus | 0.12 (0.06) | 0.0524 |
| Editing | −0.07 (0.05) | 0.131 |
| Stimulus * Editing | −0.02 (0.07) | 0.729 |
| Group * Stimulus | 0.22 (0.12) | 0.0686 |
| Group * Editing | −0.20 (0.10) | 0.0459 |
| Group * Stimulus * Editing | −0.24(0.14) | 0.0967 |
In summary, the results demonstrate that a voice disadvantage effect in pitch labeling accuracy exists in AP musicians performing an AP task and that it is greater than a similar (but non-significant) trend in RP musicians performing an RP task (Fig. 3). This effect cannot be attributed to a shift in criterion due to a speed-accuracy trade-off because the decreased accuracy observed with voice judgments was not accompanied by reduced RTs. Although the AP group had shorter RT to the simplified than to the original version of the stimuli, this effect of editing did not seem to interact with the stimulus type (voice vs viola), indicating that the difference in performance between voice and instrument sounds cannot be ascribed to greater random fluctuations in the pitch and quality of the voice than of the viola.
FIG. 3.
(Color online) Accuracy (top row) and RT (bottom row) of AP (left panels) and RP (right panels) participants under different conditions. Black horizontal lines denote median values, shaded areas are the interquartile ranges (IQR), and vertical lines denote the range of the values that are no further than 1.5IQR away from the shaded area. Outliers (defined as values more than 1.5IQR away from the shaded area) are shown as black dots. Colored lines connect all the data from individual participants.
3. Subjective measures of original and simplified stimuli also reflect a voice disadvantage in the AP task
For each of the four sound categories, participants rated the sound's similarity to a human voice and to a musical instrument, as well as how synthetic it sounded, on three separate 5-point scales. Two-way repeated-measures analyses of variance (ANOVAs) were performed on the subjective ratings, with stimulus (voice vs viola) and editing (original vs simplified) as within-subject factors. The analyses showed that simplifying the sound stimuli made them sound less similar to their respective category (i.e., voice or instrumental sound), and more synthetic. In addition, there was an asymmetrical change in perceived sound categories: simplifying a voice made it sound more like an instrumental sound, but simplifying a viola did not make it sound more like a voice.
For voice similarity ratings, the mean rating score of voices was significantly higher than that of viola sounds [F(1,92) = 262.06, p < 0.0001, ηp2 = 0.74], the mean rating score of original sounds was significantly higher than that of the simplified versions [F(1,92) = 22.62, p < 0.0001, ηp2 = 0.20], and there was a significant interaction between stimulus and editing [F(1,92) = 13.61, p = 0.000381, ηp2 = 0.13]. Post hoc analysis with Bonferroni correction showed that the simplified voice was rated as less similar to voice than the original voice (p < 0.0001), but editing did not affect viola sounds' perceived (dis) similarity to voice (p = 0.203).
For instrument similarity ratings, the mean rating score of the viola was significantly higher than that of the voice [F(1,92) = 241.87, p < 0.0001, ηp2 = 0.72], and there was a significant interaction between stimulus and editing [F(1,92) = 46.86, p < 0.0001, ηp2 = 0.34], but no main effect of editing [F(1,92) = 1.63, p = 0.204, ηp2 = 0.02]. Post hoc analysis with Bonferroni correction showed that the simplified voice was rated as more similar to an instrument sound than the original voice (p < 0.0001), and that the simplified viola sound was rated as less similar to an instrument than the original viola sound (p < 0.0001).
For synthetic ratings, voices were rated as more synthetic than viola sounds [F(1,92) = 4.44, p = 0.0379, ηp2 = 0.05], and simplified versions were rated as more synthetic than original versions [F(1,92) = 23.41, p < 0.0001, ηp2 = 0.20], but there was no significant interaction between stimulus and editing [F(1,92) = 3.38, p = 0.0691, ηp2 = 0.04].
Similarly, for each of the four sound categories, participants answered whether they could extract a vowel from the sound. The responses were coded as 0 (for not being able to extract a vowel) and 1 (for extracting any vowel). A mixed-effect logistic regression revealed that participants were significantly more likely to extract a vowel from voice than from viola stimuli (b = 8.80, SE = 2.87, p = 0.00219, OR = 6.63 × 103, 95% CI [23.76, 1.85 × 106]), but no differences were observed between the original and the simplified versions (b = 0.73, SE = 0.79, p = 0.355, OR = 2.07, 95% CI [0.44, 9.70]). The interaction between stimulus (voice vs viola) and editing (simplified vs original) was not analyzed due to the limited amount of data.
While voice and viola sounds were treated as separate sound categories in this study, the above analyses show that perceived similarities to voice and instrumental sounds are not always complementary (e.g., a decrease in perceived similarity to instrument does not necessarily mean increased similarity to voice). To examine whether the voice-disadvantage effect that we observed when the sounds were grouped based on physical characteristics was also found when the subjective measures were used to categorize the sounds, a set of mixed-effects linear and logistic regressions similar to the models described in the last section were constructed, where subjective measures of the stimuli were used in lieu of the physical stimulus characteristics themselves (voice vs viola and original vs simplified). The effects of voice rating, instrumental rating, synthetic rating, and vowel extraction were assessed through four separate sets of regressions, with each set consisting of a mixed-effects linear regression for RT and a mixed-effects logistic regression for accuracy. As expected, the results were largely in line with those obtained when considering the physical sound characteristics: the AP group was consistently more accurate (voice model: b = 0.81, SE = 0.26, p = 0.00163, OR = 2.26, 95% CI [1.36, 3.74]; instrument model: b = 0.81, SE = 0.27, p = 0.00219, OR = 2.25, 95% CI [1.34, 3.79]; synthetic model: b = 0.76, SE = 0.28, p = 0.00638, OR = 2.15, 95% CI [1.24, 3.72]; vowel extraction model: b = 0.88, SE = 0.27, p = 0.00128, OR = 2.41, 95% CI [1.41, 4.11]) and faster (voice model: b = –0.57, SE = 0.28, p = 0.0442, d = 0.27, 95% CI [0.01, 0.53]; instrument model: b = –0.56, SE = 0.28, p = 0.493, d = 0.26, 95% CI [0.0043, 0.28]; synthetic model: b = –0.62, SE = 0.28, p = 0.0312, d = 0.30, 95% CI [0.03, 0.56]; vowel extraction model: b = –0.61, SE = 0.28, p = 0.0343, d = 0.28, 95% CI [0.02, 0.54]) than the RP group. For accuracy, a higher perceived similarity to voice (voice model: b = –0.11, SE = 0.03, p = 0.00120, OR = 0.89, 95% CI [0.84, 0.96]), a lower perceived similarity to instrumental sound (instrument model: b = 0.13, SE = 0.04, p = 0.00298, OR = 1.14, 95% CI [1.06, 1.22]), and an identifiable vowel (vowel extraction model: b = –0.38, SE = 0.13, p = 0.00288, OR = 0.68, 95% CI [0.53, 0.88]) were all significantly associated with lower accuracy. For RT, a significant interaction between group and instrument rating was found (b = –0.08, SE = 0.04, p = 0.0470, d = 0.04, 95% CI [0.0012, 0.08]), where a lower perceived similarity to instrumental sound was associated with a slower response in the AP (b = –0.08, SE = 0.02, p = 0.00160, d = 0.04, 95% CI [0.02, 0.06]) but not the RP group. No other main effects or interactions were observed.
4. Voice disadvantage is larger in self-reported automatic than comparative AP participants
As a preliminary exploration of how the voice disadvantage effect is manifested in different subgroups of AP participants, we tested whether and how it differs between AP participants with self-reported automatic and comparative strategies (Fig. 4). Some participants reported using different strategies for different types of sounds, so we recoded their strategies as either automatic or comparative, based on the strategy used most often for the timbres used in this experiment (i.e., piano, voice, viola). The automatic and comparative strategies correspond loosely to the “true AP” and “quasi AP” subtypes discussed in previous literature (e.g., Levitin and Rogers, 2005), respectively. Thirty-one AP participants were in the automatic strategy subgroup, and 22 AP participants were in the comparative strategy subgroup. Two AP participants were not included in this section's analyses because they reported using other unspecified strategies and did not clearly belong to either category.
FIG. 4.
(Color online) Accuracy (top row) and RT (bottom row) of automatic AP (left panels) and comparative AP (right panels) participants under different conditions. See Fig. 3 for a more detailed description.
A mixed-effects logistic regression was carried out to examine the effects of AP strategy (comparative vs automatic), stimulus (voice vs viola), editing (simplified vs original), and their interactions on the accuracy of responses (Table III). As expected, there was a significant effect of stimulus (b = –0.66, SE = 0.18, p = 0.000283; OR = 0.52, 95% CI [0.36, 0.74]), with lower accuracy for voice than for viola sounds. There was also a significant effect of AP strategy (b = 1.19, SE = 0.44, p = 0.00620, OR = 3.29; 95% CI [1.40, 7.74]), with higher overall performance by the automatic than the comparative AP participants. The two-way interaction between AP strategy and stimulus (b = –0.54, SE = 0.25, p = 0.0292) and the three-way interaction (b = 1.03, SE = 0.41, p = 0.0120) also reached significance. These effects held even after adjusting for participants' musical backgrounds and parameters of each trial. When the responses from the automatic and the comparative AP participants were analyzed separately, the main effect of stimulus (reflecting lower accuracy for voice than for viola stimuli), was observed in both groups (automatic group: b = –0.78, SE = 0.31, p = 0.0123, OR = 0.46, 95% CI [0.25, 0.84]; comparative group: b = –0.44, SE = 0.21, p = 0.0368, OR = 0.64, 95% CI [0.42, 0.97]). A significant interaction between stimulus and editing was found for the automatic group only (b = 0.75, SE= 0.33, p = 0.0213); however, subsequent contrast analyses indicated no effect of editing for either the voice or non-voice stimuli and no effect of stimulus for either the original or simplified sounds. Therefore, a voice disadvantage effect in terms of accuracy was observed in both automatic and comparative AP participants. In summary, the overall accuracy of the comparative AP participants was lower than automatic AP participants, as expected, and the voice disadvantage effect of the comparative AP participants was smaller than for the automatic AP participants.
TABLE III.
Mixed-effects logistic regression analysis predicting accuracy from AP strategy, stimulus, editing, and their interactions; b refers to estimated unstandardized regression coefficients with standard errors in parentheses; p values less than 0.05 are considered statistically significant.
| Predictor | b | p |
|---|---|---|
| Stimulus | −0.66 (0.18) | 0.000282 |
| Editing | 0.02 (0.16) | 0.892 |
| Strategy | 1.19 (0.44) | 0.00620 |
| Stimulus * Editing | 0.22 (0.21) | 0.292 |
| Strategy * Stimulus | −0.54 (0.25) | 0.0285 |
| Strategy * Editing | −0.13 (0.22) | 0.563 |
| Strategy * Stimulus * Editing | 1.03 (0.41) | 0.0120 |
A similar mixed-effects linear regression was performed on the log-transformed RT values (Table IV). Interestingly, there was now a significant main effect of stimulus (b = 0.22, SE = 0.07, p = 0.00443, d = 0.11, 95% CI [0.04, 0.18]), with longer RT for voice than for viola sounds. There were also a significant main effect of strategy (b = –1.32, SE = 0.36, p = 0.00516, d = 0.64, 95% CI [0.30, 0.98]), with faster response by the automatic AP than the comparative AP participants, and a significant main effect of editing (b = –0.17, SE = 0.07, p = 0.0127, d = 0.08, 95% CI [0.02, 0.15]), with longer RT for the original than for the simplified version of the stimuli. No interactions were found. This pattern of results also held after adjusting for participants' musical backgrounds and parameters of each trial.
TABLE IV.
Mixed-effects linear regression analysis predicting log-transformed RT from AP strategy, stimulus, editing, and their interactions; b refers to estimated unstandardized regression coefficients with standard errors in parentheses; p values less than 0.05 are considered statistically significant.
| Predictor | b | p |
|---|---|---|
| Stimulus | 0.22 (0.07) | 0.00443 |
| Editing | −0.17 (0.07) | 0.0127 |
| Strategy | −1.32 (0.36) | 0.000516 |
| Stimulus * Editing | −0.11 (0.09) | 0.197 |
| Strategy * Stimulus | 0.04(0.15) | 0.797 |
| Strategy * Editing | −0.02 (0.13) | 0.907 |
| Strategy * Stimulus * Editing | −0.26 (0.17) | 0.140 |
Overall, the analyses focusing only on AP participants agree with the results from the initial analyses. Both automatic and comparative AP participants showed voice disadvantage effects in the form of decreased accuracy. When data from the AP group were analyzed separately, the increase in RT also reached significance, although it had not been significant as a main effect or interaction when considering both the AP and RP groups. As expected, automatic AP participants had higher accuracy and shorter RT compared to comparative AP participants. The current results suggest that an interaction between stimulus and AP strategy exists, since the voice disadvantage effect was larger in automatic AP participants than comparative AP participants, at least in terms of accuracy.
5. Weak but significant relationships between self-reported musical background and screening performance
In previous studies, musical training has been established as an important factor contributing to the development of AP ability (e.g., Miyazaki et al., 2012; Levitin and Rogers, 2005; Vanzella and Schellenberg, 2010). To examine the relationship between participants' self-reported musical background and their AP screening performance, we conducted a linear regression with all the participants who completed the questionnaire and the screening test (N = 321), including AP and RP participants, as well as those who did not pass screening. The predictors included native language (tonal vs non-tonal), the amount of musical training (years), current music playing and listening time (hours per week), first and primary instruments (piano vs others), and age when musical training began (years). The dependent variable was the percentage of correct responses in the AP screening test (range: 0 to 100). We found significant positive effects of current music playing time (b = 0.25, SE = 0.12, p = 0.0311, ηp2 = 0.015) and of the amount of musical training (b = 1.00, SE = 0.25, p < 0.0001, ηp2 = 0.048). Among all the continuous predictors (i.e., all except native language), weak significant correlations were observed between the amount of music training and current music playing time (Pearson's r = 0.31, p < 0.0001), and between the amount of music training and the age of music training onset (Pearson's r = –0.40, p <0.0001). The correlations between all other possible pairs of continuous predictors were small (Pearson's r < 0.15 in all cases). We also conducted a linear regression in the non-AP subset (N = 266) to examine the same set of predictors' effects on RP screening performance. We found a significant positive effect of the amount of musical training (b = 0.95, SE = 021, p < 0.0001, ηp2 = 0.072). The correlations between the other continuous predictors in the non-AP subset shared the same pattern as when both AP and non-AP participants were included. A correlation test between AP and RP screening scores in non-AP participants showed a weak but significant correlation (Pearson's r = 0.29, p < 0.0001).
To evaluate the potential of using self-assessment of AP proficiency as a quick index of AP ability in lieu of a screening test, we determined how well the self-assessment of being “always accurate” (for note-naming tasks) or “accurate - the error is less than 50 cents” (for singing and tuning tasks) in a variety of items and self-report of being an AP possessor predict the qualification for the AP group (i.e., at least 75% correct in the AP screening test) in terms of sensitivity, specificity, and d ′. The candidate self-assessment items were: accuracy in naming the chromas (e.g., C/C#/D) of tones played on the piano, tones played on instruments that the participant plays, tones played on instruments that the participant does not play, voice, synthetic sound, or environmental sound without a reference tone; accuracy in singing a pitch without a reference tone; and self-report of AP possession. The item “tuning an instrument without a reference tone” was dropped because many participants reported that their instruments did not need tuning. The different items varied in their power to distinguish between participants who passed and did not pass the AP screening (Table V). Considering the low prevalence of AP among all participants, a screening test is still a preferable procedure to eliminate the false positives in the self-assessment without falsely rejecting too many AP participants.
TABLE V.
Sensitivity, specificity, and d′ of using different self-report items as an indicator for AP screening performance. For the item with 100% specificity, d′ was calculated using the Hautus (1995) adjustment method.
| Item | Sensitivity (hits) | Specificity (1-FA) | d′ |
|---|---|---|---|
| Passive AP: Piano | 63.6% | 94.7% | 1.97 |
| Passive AP: Instrument that I play | 58.1% | 94.4% | 1.79 |
| Passive AP: Instrument that I don't play | 38.2% | 100% | 2.60 |
| Passive AP: Voice | 32.7% | 98.5% | 1.72 |
| Passive AP: Synthetic sound | 21.8% | 99.2% | 1.65 |
| Passive AP: Environmental sound | 12.7% | 98.5% | 1.03 |
| Active AP: Singing a note | 54.5% | 97.4% | 2.05 |
| Self-report of AP possession | 54.5% | 94.4% | 1.70 |
III. EXPERIMENT 2: FUNDAMENTAL FREQUENCY DISCRIMINATION
The voice disadvantage effect was significantly greater in the AP task than in the RP task, where the effect was marginal. It is nevertheless possible that the voice disadvantage effect reflects a more general phenomenon that affects any task involving fine-grained pitch coding and discrimination. To test the generalizability of the voice disadvantage effect, an in-person lab experiment was carried out to measure basic F0 difference limens (F0DLs) for the four types of sounds (i.e., original and simplified versions of voice and viola stimuli) used in Experiment 1.
A. Methods
1. Participants
Eighteen participants (age range: 18–31 years, mean = 20.7, SD = 3.3; duration of musical training range: 0–13 years, mean = 4.8, SD = 4.3) took part in the F0 discrimination experiment. Three participants were excluded from the analysis because their thresholds were not measurable using the current setup (i.e., their thresholds were likely greater than 1 ST). All but one participant reported not possessing AP. Participants were recruited through introductory psychology courses without specific requirements regarding musical experience. This study was approved by the University of Minnesota's Institutional Review Board. All the participants completed a consent form prior to the study and were awarded a digital gift card or extra course credit upon completion.
2. Stimuli and procedure
Each participant completed three adaptive runs for each of the four sound types: original viola, original voice, simplified viola, and simplified voice. The stimuli were presented diotically over Sennheiser HD650 headphones (Sennheiser, Wedemark, Germany) at 70 dB sound pressure level (SPL). The task involved a 3-interval, 3-alternative forced-choice (3I3AFC) paradigm in conjunction with a 2-down 1-up adaptive staircase procedure that tracks the 70.7% correct point on the psychometric function (Levitt, 1971). The stimuli were generated and the procedure controlled within matlab R2018b using the AFC package (Ewert, 2013). The stimuli were created by shifting the pitch (via Praat) of the note F3 (174.6 Hz) used in Experiment 1. The three 250-ms notes in each trial were separated by 500-ms silent interstimulus intervals. Two of the intervals (reference intervals) were below F3 and one interval (target interval, selected at random on each trial) was above F3. The listener's task was to select the interval containing the target. Feedback was provided after each trial.
At the beginning of each block, the F0 difference (ΔF0) between the target and the non-target stimuli was set to 80 cents. The initial step size of the adaptive procedure was set to 20 cents. After two reversals in the direction of the adaptive procedure, the step size was reduced to 10 cents. After another two reversals, the step size was reduced to its final value of 5 cents. The run then continued for another six reversals at the final step size. Threshold was defined as the mean value of ΔF0 (in cents) at the last six reversal points. Each listener's final F0DL was the mean across the three runs for each stimulus type. The stimulus types were presented in random order for each listener independently, with the constraint that each of the four stimulus types was tested before any was repeated.
B. Results
A repeated-measures ANOVA revealed significant effects of stimulus [voice vs viola; F(1,14) = 24.23, p = 0.000225, ηp2 = 0.63], editing [simplified vs original; F(1,14) = 6.68, p = 0.0216, ηp2 = 0.32] and their interaction [F(1,14) = 10.58, p = 0.00579, ηp2 = 0.43] on the F0DLs (Fig. 5). Post hoc analysis with Bonferroni correction showed that F0DLs were lower for the simplified voice than for the original voice (p = 0.0029), but that editing did not affect F0DLs for the viola sounds (p = 0.314). Pitch discrimination thresholds were higher for voice than for viola in both simplified (p = 0.0038) and original (p = 0.0004) conditions.
FIG. 5.
(Color online) Individual and summary F0DLs for the two stimuli (voice and viola) under simplified and original conditions. Details of the lines and boxplots are as in Fig. 3.
The results suggest that some voice disadvantage effect exists for basic F0 discrimination. Although some of the effects may be due to larger spectro-temporal variations (e.g., vibrato) in the voice than in the viola, such variations cannot completely explain the effect, as some differences between voice and viola remained, even for the simplified stimuli.
IV. DISCUSSION
A. Voice disadvantage effect in AP participants
The results of Experiment 1 show that AP participants are less accurate in identifying the pitch of voice, compared with viola tones. RTs for responses to the voice were also slower than to the viola tones in the AP group when their data were analyzed separately. This voice disadvantage in AP participants is consistent with the findings from the Vanzella and Schellenberg (2010) study, where AP participants had lower accuracy in identifying the pitch of a recorded or synthesized voice, compared with piano and pure tones. The lack of an effect of removing spectro-temporal variations in the stimuli on response accuracy is also consistent with the findings of Vanzella and Schellenberg (2010) and Weiss et al. (2015) that a synthesized voice produces a similar voice disadvantage effect in terms of accuracy as a recorded voice, despite having inherent pitch variations that are less than a recorded voice and comparable to instrumental sounds. Although AP participants responded faster to the stimuli where spectro-temporal variations were removed than to the original versions, this effect did not differ between voice and viola sounds. Taken together, it appears that the voice disadvantage effect is not due primarily to natural fluctuations in F0 and/or temporal envelope over time. In addition to confirming prior findings, our results extend them by demonstrating that decreased accuracy for the voice stimuli was not accompanied by a decrease in RT, ruling out the possibility that the previously observed voice disadvantage effect was simply a criterion shift, as reflected in a speed-accuracy trade-off.
B. Voice disadvantage effect is greater in the AP than the RP group
We observed a trend that a voice disadvantage effect in accuracy may also exist in RP participants, who were tested with RP tasks in our study, but to a lesser extent than in AP participants. It therefore remains possible that phonetic or semantic interference and difficulty in extracting pitch from voice do play a role in the voice disadvantage effects in AP participants/tasks, but the unequal voice disadvantage in AP and RP groups suggests that at least a part of the effect is specific to AP. However, it is important to note that the current study cannot distinguish between AP tasks and AP possession, since AP and RP participants were tested with separate tasks.
A possible factor underlying the larger voice disadvantage in AP than in RP participants is the relative timing of voice information perception (e.g., perceiving a vowel) and note name perception. Vanzella and Schellenberg (2010) speculated that the voice may automatically activate neural mechanisms for processing linguistic and paralinguistic information, which interferes with pitch identification. It is possible that temporal proximity of voice information processing and note name perception play an important role in the interference effect. Specifically, RP participants receive the voice and pitch information of the test tone simultaneously, but must then derive the note name of the test tone by comparing its pitch to that of the reference. Thus, there is a potential time lag between the vowel information and the derivation of the note name. In contrast, AP participants receive both the voice and note-name information simultaneously upon presentation of the test tone. The simultaneous processing of both the voice and note-name information in the AP condition may explain the greater interference observed in the AP group.
The trend that RP participants were worse at identifying the pitch of voice, compared to viola tones, is in line with the previously observed vocal generosity effect in melody intonation judgments, where both musicians and nonmusicians are worse at telling in-tune from out-of-tune melodies sung by voice compared to those played on a violin (Hutchins et al., 2012). However, it is unclear whether the current results and the previously observed vocal generosity effect are the same. It is worth noting that the RP tasks in Experiment 1 only involved recognizing musical intervals consisting of integer number STs and associating them with categorical labels and did not test the ability to detect frequency deviations of less than 1 ST, as in the previous experiments testing the vocal generosity effect.
C. Voice disadvantage effect generalizes to basic F0 discrimination
In Experiment 2, we observed poorer F0 discrimination thresholds for voice than for viola sounds, replicating the vocal generosity effect (Hutchins et al., 2012). Furthermore, it could be concluded that spectro-temporal fluctuations contribute, but only partially, to the voice disadvantage effect in F0 discrimination, since the effect was smaller in the simplified condition than in the original condition, but was not absent. Taken together with the results of Experiment 1, the results suggest that the voice disadvantage effect generally applies to fine-grain pitch perception of single tones, rather than just note naming based on AP or musical interval perception.
However, it is unlikely that the voice disadvantage in note naming can be completely attributed to any differences in F0DLs. The discrepancy between the two aspects exists in that reducing spectro-temporal fluctuations reduced the voice disadvantage effect in the pitch discrimination task but not in the AP or RP note name judgment tasks. Besides, F0DLs for all of the four sound types fell in the range of 10–25 cents for most participants, but it takes a shift of more than 50 cents for an in-tune tone to fall under another note category. Nevertheless, it is possible that the voice disadvantage effects in note naming and pitch discrimination share the same underlying cause, such as a coarser pitch height representation for voice than for non-vocal harmonic sounds.
D. Voice disadvantage or instrument advantage?
Can the voice disadvantage effect in note naming be interpreted as an “instrument advantage,” in that familiarity with instrumental sounds in music settings may enhance the accuracy for labeling instrumental tones? Although we did not include stimuli other than voice and viola tones in this study, we consider this interpretation unlikely. In music education, aural skills are usually trained through listening to piano tones and singing; strings are rarely used for this purpose. We chose viola as a non-voice comparison in our experiments, and participants would generally be expected to be no less familiar with voice than with viola in the context of tasks similar to those used in ear training. Even if some string players in Experiment 1 were indeed more familiar with viola sounds, the pattern of results did not change after controlling for participants' first or primary instrument, along with other variables related to participants' musical background. In addition, previous studies have found that AP participants' accuracy for labeling pure tones was significantly higher than that for recorded sung voice or synthesized voices, but the pitch labeling accuracy for the two types of voices did not differ (Vanzella and Schellenberg, 2010; Weiss et al., 2015), suggesting that the difference between accuracy for labeling instrument and voice pitch cannot be fully explained by a familiarity or instrument advantage effect.
E. Automatic AP participants are more accurate and faster than comparative AP participants
We examined the differences between AP participants who reported using either comparative (quasi AP) or automatic (true AP) strategies in Experiment 1. The participants using automatic strategies had higher accuracy and shorter RTs compared with participants using comparative strategies, and while voice disadvantage effects in terms of accuracy and RT were observed in both groups, the voice disadvantage on accuracy was larger in automatic AP participants than comparative AP participants. RT has been proposed as a possible measure for distinguishing between true and quasi AP possessors (Levitin and Rogers, 2005), and accuracy has been used to categorize AP possessors into subgroups (e.g., Athos et al., 2007). It is worth noting that while the proposed subtypes of AP were theoretically defined by inner processes or strategies for pitch labeling, whether accuracy and RT truly differ between AP possessors using different inner processes or strategies remains unclear. Our results support the proposal that RT and accuracy have the potential to be used as indicators for categorizing AP participants' strategies. To further explore this possibility, we used the mean RTs and mean accuracies in the AP version of the note name judgment task to predict AP participants' self-reported strategies. For a given RT threshold, AP participants with a longer mean RT than the threshold were categorized as comparative AP, and the other AP participants were categorized as automatic AP. Similarly, AP participants with a lower mean accuracy than the accuracy threshold were categorized as comparative AP, and the others were categorized as automatic AP. The d′ values for the categorizations of the two self-reported strategies were calculated when different threshold values and different stimuli were used (Fig. 6). Classification performance was generally better when RT thresholds, rather than accuracy thresholds, were used. The best performance was reached when the piano sounds and RT thresholds between 1900 and 2300 ms were used, where d ′ values were between 1.85 and 1.96, and categorization accuracy was between 75.5% and 81.1%.
FIG. 6.
(Color online) Values of d′ for predicting AP participants' self-report strategy (automatic or comparative) from accuracy (top panel) or RT (lower panel), when different thresholds and stimuli were used.
F. Towards a holistic account: The Feature Relevance Hypothesis
Previous studies have observed voice advantage effects in timbre recognition (Agus et al., 2012; Isnard et al., 2019) and in melody memorization (Weiss et al., 2015). To reconcile these results with the voice disadvantage effect, we propose a holistic account of voice processing, termed the Feature Relevance Hypothesis, which states that voice processing mechanisms are feature specific. According to this hypothesis, features relevant to voice information processing are facilitated, whereas features irrelevant to voice information processing are unchanged or suppressed. This hypothesis seems qualitatively consistent with the voice (dis)advantage effects observed so far: timbre and pitch contours are important for extracting linguistic and paralinguistic information from voice, and voice advantage effects have been observed for these features. By contrast, although gross estimates of pitch height can contribute to speaker gender identification (e.g., Pernet and Belin, 2012) and speaker recognition (McPherson and McDermott, 2018), fine-grained estimates of pitch height (accurate to within 1 ST) are generally not crucial for voice information extraction. For instance, the speaker recognition accuracy was reduced by less than 15% when the F0 was shifted by 3 STs in either direction (McPherson and McDermott, 2018), showing a larger tolerance to shifts in pitch height than is required to perform AP or RP note-naming tasks, where a shift of 1 ST would change the perceived category. Correspondingly, there is no voice advantage effect observed for pitch chroma and fine-tuned pitch identification or discrimination. It may be that these voice (dis)advantage effects are due to listeners' extensive experience in voice and speech perception, so that the relevant features are processed more efficiently than (and may, in turn, interfere with the processing of) the irrelevant features. Such difference in the efficiency of feature processing would result in a voice perceptual pattern similar to what the past and the present studies have observed and what the hypothesis predicts. The neural correlates of these voice (dis)advantage effects could be distinctive neural structures and pathways for voice processing, or a single pool of neural resources being recruited in an optimal way based on the acoustical properties of incoming signals, or a combination of both (Zatorre and Gandour, 2008). To examine the neural basis of the voice disadvantage effect in pitch perception, future neuroimaging studies could compare the level of activation and lateralization in Heschl's gyrus, a pitch-sensitive area where structural and functional differences are associated with musical abilities and pitch perception preference (Schneider et al., 2005; but see Seither-Preisler et al., 2007), in response to vocal and non-vocal stimuli. This hypothesis could be tested further by examining the perception of other vocal features (e.g., direction and slope of a continuous pitch contour, as in some tonal languages) and by determining whether the voice (dis)advantage effects are innate perceptual tendencies or acquired through listening experience.
G. Limitations and future directions
One limitation of this study is that it remains unclear whether the voice disadvantage effect observed in the pitch labeling task is specific to AP possessors or AP abilities or both since the current study only tested AP and RP participants with AP and RP tasks, respectively. The previous finding that non-AP musicians showed a voice disadvantage in AP labeling tasks comparable to AP musicians (Weiss et al., 2015) suggests that the voice disadvantage effect observed in AP musicians may not be specific to AP possessors. However, it worth noting that in the study by Weiss et al. (2015), the non-AP musicians were able to perform note-naming tasks significantly above chance, and therefore may not be representative of the majority of non-AP listeners. Although typical non-AP listeners, especially non-musicians, would not be able to perform note-naming tasks without an external reference tone, some level of absolute memory for pitch has been observed in nonmusicians in both production (Levitin, 1994) and perception tasks (Schellenberg and Trehub, 2003; Smith and Schmuckler, 2008; Van Hedger et al., 2016); thus, a task involving such absolute memory for pitch that can nevertheless be completed by non-musicians could be undertaken with both voice and non-voice stimuli to determine whether a voice disadvantage can be observed in non-AP possessors. The question of whether the effect is specific to AP abilities could in principle also be addressed by testing AP possessors on an RP task; however, care would need to be taken to ensure that the AP musicians were not able to perform the task using AP skills (Miyazaki, 1995).
Although the decreased accuracy and increased RT for voice conditions observed in the analysis that focused on automatic and comparative AP participants suggest that pitch labeling is a less automatic process for voice than for instrumental sounds, it is still unclear whether the stimuli's timbre affects the level of automaticity of pitch-labeling in AP possessors. A possible way to answer this question would be via auditory Stroop tasks (Akiva-Kabiri and Henik, 2012). An asymmetrical Stroop effect specific to AP possessors was previously observed when synthesized piano tones were used as auditory stimuli, where the incongruence between auditory stimuli and musical notation negatively affected the naming of the musical notation, but not the naming of the auditory stimuli (Akiva-Kabiri and Henik, 2012). This approach could be used in future studies with voice and other timbres. If there is indeed a voice disadvantage effect in terms of AP possessors' pitch-labeling automaticity, it may be that the AP possessor-specific Stroop effects are weaker when tested with voice compared with other timbres.
More generally, it could be further examined how the automaticity of fine-grained pitch extraction compares between vocal and non-vocal stimuli. In an early study, Semal et al. (1996) tested participants in a pitch comparison task where the two target tones were separated by four consecutive distractors. The effects of vocal and non-vocal distractors were asymmetric: when complex tones were used as targets, vocal distractors had a smaller interference effect than non-vocal distractors; however, when voices were used as targets, the interference effects from vocal and non-vocal distractors were comparable. A possible explanation is that the processing of the fine-grained pitch of voice is less automatic than that of non-voice, so that when voice is the attended timbre and non-voice is not attended, such attentional preferences for voice could compensate for the voice disadvantage. Future studies could further investigate how attention may interact with the voice disadvantage effect using more strictly controlled vocal and non-vocal stimuli.
Finally, the fact that the simplified sounds were similarly capable of inducing the voice disadvantage effect as the original sounds confirms that fluctuations in the fundamental frequency and temporal envelope over time (which may be greater for voice than for musical instrument sounds) do not play an important role in the effect. Further work is necessary to elucidate what specific acoustic features are crucial for inducing the voice disadvantage effect and/or whether the stimulus needs to be perceptually identified as a voice for the effect to occur.
ACKNOWLEDGMENTS
This work was supported by NIH Grant No. R01 DC005216. We thank Gun Joo Lim for assisting in recruitment and data collection. Two reviewers and the Associate Editor, Emily Buss, provided helpful comments on an earlier version.
Footnotes
See supplementary material at https://www.scitation.org/doi/suppl/10.1121/10.0010123 for the full questionnaire and audio examples.
References
- 2. Agus, T. R. , Paquette, S. , Suied, C. , Pressnitzer, D. , and Belin, P. (2017). “ Voice selectivity in the temporal voice area despite matched low-level acoustic cues,” Sci. Rep. 7, 11526. 10.1038/s41598-017-11684-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 1. Agus, T. R. , Suied, C. , Thorpe, S. J. , and Pressnitzer, D. (2012). “ Fast recognition of musical sounds based on timbre,” J. Acoust. Soc. Am. 131(5), 4124–4133. 10.1121/1.3701865 [DOI] [PubMed] [Google Scholar]
- 3. Akiva-Kabiri, L. , and Henik, A. (2012). “ A unique asymmetrical Stroop effect in absolute Pitch possessors,” Exp. Psychol. 59(5), 272–278. 10.1027/1618-3169/a000153 [DOI] [PubMed] [Google Scholar]
- 4.Apple Inc. (2020). “ GarageBand for Mac,” https://www.apple.com/mac/garageband/ (Last viewed 7/1/2021).
- 5. Aronoff, J. M. , and Landsberger, D. M. (2013). “ The development of a modified spectral ripple test,” J. Acoust. Soc. Am 134(2), EL217–EL222. 10.1121/1.4813802 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Athos, E. A. , Levinson, B. , Kistler, A. , Zemansky, J. , Bostrom, A. , Freimer, N. , and Gitschier, J. (2007). “ Dichotomy and perceptual distortions in absolute pitch ability,” Proc. Natl. Acad. Sci. U.S.A. 104(37), 14795–14800. 10.1073/pnas.0703868104 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Bachem, A. (1955). “ Absolute pitch,” J. Acoust. Soc. Am. 27, 1180–1185. 10.1121/1.1908155 [DOI] [Google Scholar]
- 8. Bahr, N. , Christensen, C. A. , and Bahr, M. (2005). “ Diversity of accuracy profiles for absolute pitch recognition,” Psychol. Music 33(1), 58–93. 10.1177/0305735605048014 [DOI] [Google Scholar]
- 9. Bates, D. , Mächler, M. , Bolker, B. , and Walker, S. (2015). “ Fitting linear mixed-effects models using lme4,” J. Stat. Softw. 67(1), 1–48. 10.18637/jss.v067.i01 [DOI] [Google Scholar]
- 10. Belin, P. , Zatorre, R. J. , Lafaille, P. , Ahad, P. , and Pike, B. (2000). “ Voice-selective areas in human auditory cortex,” Nature 403(6767), 309–312. 10.1038/35002078 [DOI] [PubMed] [Google Scholar]
- 11. Boersma, P. (2001). “ Praat: A system for doing phonetics by computer,” Glot. Int. 5(9), 341–345. [Google Scholar]
- 12. Brauer, M. , and Curtin, J. J. (2018). “ Linear mixed-effects models and the analysis of nonindependent data: A unified framework to analyze categorical and continuous independent variables that vary within-subjects and/or within-items,” Psychol. Methods 23(3), 389–411. 10.1037/met0000159 [DOI] [PubMed] [Google Scholar]
- 13. Brysbaert, M. , and Stevens, M. (2018). “ Power analysis and effect size in mixed effects models: A tutorial,” J. Cogn. 1(1), 9. 10.5334/joc.10 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Charest, I. , Pernet, C. R. , Rousselet, G. A. , Quiñones, I. , Latinus, M. , Fillion-Bilodeau, S. , Chartrand, J.-P. , and Belin, P. (2009). “ Electrophysiological evidence for an early processing of human voices,” BMC Neurosci. 10(1), 127. 10.1186/1471-2202-10-127 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Deutsch, D. , Henthorn, T. , and Dolson, M. (2004). “ Absolute pitch, speech, and tone language: Some experiments and a proposed framework,” Music Percept 21(3), 339–356. 10.1525/mp.2004.21.3.339 [DOI] [Google Scholar]
- 16. Ewert, S. D. (2013). “ AFC—A modular framework for running psychoacoustic experiments and computational perception models,” in Proceedings of the AIA-DAGA, March 18–21, Merano, Italy, pp. 1326–1329. [Google Scholar]
- 17. Hautus, M. J. (1995). “ Corrections for extreme proportions and their biasing effects on estimated values of d ′,” Behav. Res. Methods Instrum. Comput. 27(1), 46–51. 10.3758/BF03203619 [DOI] [Google Scholar]
- 18. Heitz, R. P. (2014). “ The speed-accuracy tradeoff: History, physiology, methodology, and behavior,” Front. Neurosci. 8, 150. 10.3389/fnins.2014.00150 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Hutchins, S. , Roquet, C. , and Peretz, I. (2012). “ The vocal generosity effect: How bad can your singing be?,” Music Percept. 30(2), 147–159. 10.1525/mp.2012.30.2.147 [DOI] [Google Scholar]
- 20. Isnard, V. , Chastres, V. , Viaud-Delmon, I. , and Suied, C. (2019). “ The time course of auditory recognition measured with rapid sequences of short natural sounds,” Sci. Rep. 9, 8005. 10.1038/s41598-019-43126-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Kim, S.-G. , and Knösche, T. R. (2017). “ On the perceptual subprocess of absolute pitch,” Front. Neurosci. 11, 557–562. 10.3389/fnins.2017.00557 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Kuznetsova, A. , Brockhoff, P. B. , and Christensen, R. H. B. (2017). “ lmerTest package: Tests in linear mixed effects models,” J. Stat. Softw. 82(13), 1–26. 10.18637/jss.v082.i13 [DOI] [Google Scholar]
- 23. Levitin, D. J. (1994). “ Absolute memory for musical pitch: Evidence from the production of learned melodies,” Percept. Psychophys. 56(4), 414–423. 10.3758/BF03206733 [DOI] [PubMed] [Google Scholar]
- 24. Levitin, D. J. , and Rogers, S. E. (2005). “ Absolute pitch: Perception, coding, and controversies,” Trends Cogn. Sci. 9(1), 26–33. 10.1016/j.tics.2004.11.007 [DOI] [PubMed] [Google Scholar]
- 25. Levitt, H. (1971). “ Transformed up-down methods in psychoacoustics,” J. Acoust. Soc. Am. 49(2B), 467–477. 10.1121/1.1912375 [DOI] [PubMed] [Google Scholar]
- 26. McPherson, M. J. , and McDermott, J. H. (2018). “ Diversity in pitch perception revealed by task dependence,” Nat. Hum. Behav. 2(1), 52–66. 10.1038/s41562-017-0261-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Miyazaki, K. (1995). “ Perception of relative pitch with different references: Some absolute-pitch listeners can't tell musical interval names,” Percept. Psychophys. 57(7), 962–970. 10.3758/BF03205455 [DOI] [PubMed] [Google Scholar]
- 28. Miyazaki, K. , Makomaska, S. , and Rakowski, A. (2012). “ Prevalence of absolute pitch: A comparison between Japanese and Polish music students,” J. Acoust. Soc. Am. 132(5), 3484–3493. 10.1121/1.4756956 [DOI] [PubMed] [Google Scholar]
- 29. Miyazaki, K. , Rakowski, A. , Makomaska, S. , Jiang, C. , Tsuzaki, M. , Oxenham, A. J. , Ellis, G. , and Lipscomb, S. D. (2018). “ Absolute pitch and relative pitch in music students in the East and the West: Implications for aural-skills education,” Music Percept. 36(2), 135–155. 10.1525/mp.2018.36.2.135 [DOI] [Google Scholar]
- 30. Pernet, C. R. , and Belin, P. (2012). “ The role of pitch and timbre in voice gender categorization,” Front. Psychol. 3, 23. 10.3389/fpsyg.2012.00023 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.R Core Team (2020). “ R: A language and environment for statistical computing,” R Foundation for Statistical Computing, Vienna, Austria, http://www.R-project.org/ (Last viewed 7/1/2021). [Google Scholar]
- 32. Rizopoulos, D. (2021). “ GLMMadaptive: Generalized linear mixed models using adaptive Gaussian quadrature. R package version 0.8–0,” https://CRAN.R-project.org/package=GLMMadaptive (Last viewed 7/1/2021).
- 33. Schellenberg, E. G. , and Trehub, S. E. (2003). “ Good pitch memory is widespread,” Psychol. Sci. 14(3), 262–266. 10.1111/1467-9280.03432 [DOI] [PubMed] [Google Scholar]
- 34. Schellenberg, E. G. , and Trehub, S. E. (2008). “ Is there an Asian advantage for pitch memory?,” Music Percept. 25(3), 241–252. 10.1525/mp.2008.25.3.241 [DOI] [Google Scholar]
- 35. Schielzeth, H. (2010). “ Simple means to improve the interpretability of regression coefficients,” Methods Ecol. Evol. 1(2), 103–113. 10.1111/j.2041-210X.2010.00012.x [DOI] [Google Scholar]
- 36. Schneider, P. , Sluming, V. , Roberts, N. , Scherg, M. , Goebel, R. , Specht, H. J. , Dosch, H. G. , Bleeck, S. , Stippich, C. , and Rupp, A. (2005). “ Structural and functional asymmetry of lateral Heschl's gyrus reflects pitch perception preference,” Nat. Neurosci. 8(9), 1241–1247. 10.1038/nn1530 [DOI] [PubMed] [Google Scholar]
- 37. Seither-Preisler, A. , Johnson, L. , Krumbholz, K. , Nobbe, A. , Patterson, R. , Seither, S. , and Lütkenhöner, B. (2007). “ Tone sequences with conflicting fundamental pitch and timbre changes are heard differently by musicians and nonmusicians,” J. Exp. Psychol. Hum. Percept. Perform 33(3), 743–751. 10.1037/0096-1523.33.3.743 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Semal, C. , Demany, L. , Ueda, K. , and Hallé, P. (1996). “ Speech versus nonspeech in pitch memory,” J. Acoust. Soc. Am. 100(2), 1132–1140. 10.1121/1.416298 [DOI] [PubMed] [Google Scholar]
- 38. Smith, N. A. , and Schmuckler, M. A. (2008). “ Dial A440 for absolute pitch: Absolute pitch memory by non-absolute pitch possessors,” J. Acoust. Soc. Am. 123(4), EL77–EL84. 10.1121/1.2896106 [DOI] [PubMed] [Google Scholar]
- 39. Stoet, G. (2010). “ PsyToolkit: A software package for programming psychological experiments using Linux,” Behav. Res. Methods 42(4), 1096–1104. 10.3758/BRM.42.4.1096 [DOI] [PubMed] [Google Scholar]
- 40. Stoet, G. (2017). “ PsyToolkit: A novel web-based method for running online questionnaires and reaction-time experiments,” Teach. Psychol. 44(1), 24–31. 10.1177/0098628316677643 [DOI] [Google Scholar]
- 41. Van Besouw, R. M. , Brereton, J. S. , and Howard, D. M. (2008). “ Range of tuning for tones with and without vibrato,” Music Percept. 26(2), 145–155. 10.1525/mp.2008.26.2.145 [DOI] [Google Scholar]
- 42. Van Hedger, S. C. , Heald, S. L. M. , and Nusbaum, H. C. (2016). “ What the [bleep]? Enhanced absolute pitch memory for a 1000 Hz sine tone,” Cognition 154, 139–150. 10.1016/j.cognition.2016.06.001 [DOI] [PubMed] [Google Scholar]
- 43. Van Hedger, S. C. , Heald, S. L. M. , and Nusbaum, H. C. (2019). “ Absolute pitch can be learned by some adults,” PLoS One 14(9), e0223047. 10.1371/journal.pone.0223047 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Vanzella, P. , and Schellenberg, E. G. (2010). “ Absolute pitch: Effects of timbre on note-naming ability,” PLoS One 5(11), e15449. 10.1371/journal.pone.0015449 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Weiss, M. W. , Vanzella, P. , Schellenberg, E. G. , and Trehub, S. E. (2015). “ Rapid communication: Pianists exhibit enhanced memory for vocal melodies but not piano melodies,” Q. J. Exp. Psychol. 68(5), 866–877. 10.1080/17470218.2015.1020818 [DOI] [PubMed] [Google Scholar]
- 46. Zatorre, R. J. , and Gandour, J. T. (2008). “ Neural specializations for speech and pitch: Moving beyond the dichotomies,” Philos. Trans. R. Soc. London, Ser. B 363(1493), 1087–1104. 10.1098/rstb.2007.2161 [DOI] [PMC free article] [PubMed] [Google Scholar]






