Voice disadvantage effects in absolute and relative pitch judgments

Zi Gao; Andrew J Oxenham

doi:10.1121/10.0010123

. 2022 Apr 7;151(4):2414–2428. doi: 10.1121/10.0010123

Voice disadvantage effects in absolute and relative pitch judgments

Zi Gao ^1,^a),^✉, Andrew J Oxenham ¹

PMCID: PMC8993423 PMID: 35461511

Abstract

Absolute pitch (AP) possessors can identify musical notes without an external reference. Most AP studies have used musical instruments and pure tones for testing, rather than the human voice. However, the voice is crucial for human communication in both speech and music, and evidence for voice-specific neural processing mechanisms and brain regions suggests that AP processing of voice may be different. Here, musicians with AP or relative pitch (RP) completed online AP or RP note-naming tasks, respectively. Four synthetic sound categories were tested: voice, viola, simplified voice, and simplified viola. Simplified sounds had the same long-term spectral information but no temporal fluctuations (such as vibrato). The AP group was less accurate in judging the note names for voice than for viola in both the original and simplified conditions. A smaller, marginally significant effect was observed in the RP group. A voice disadvantage effect was also observed in a simple pitch discrimination task, even with simplified stimuli. To reconcile these results with voice-advantage effects in other domains, it is proposed that voices are processed in a way that voice- or speech-relevant features are facilitated at the expense of features that are less relevant to voice processing, such as fine-grained pitch information.

I. INTRODUCTION

Absolute pitch (AP), colloquially known as “perfect pitch,” refers to the ability to identify the note name (chroma) of a given tone and/or to produce the pitch corresponding to a given note name without external reference (Levitin and Rogers, 2005). AP ability differs from relative pitch (RP) ability, a common auditory skill in trained musicians. While AP possessors can identify the note name of a tone played in isolation, RP possessors need to infer the note name from the musical interval between the tone and a given reference tone.

AP has long been considered a rare talent even among trained musicians, although the proportion of AP possessors in the general population remains unclear. Among music students at conservatories or universities, the proportion of AP possessors varies across countries and cultures from near zero to above 50%, with higher incidence in Asia than in Europe or the US (Miyazaki et al., 2018), probably due to differences in music training methods (Miyazaki et al., 2012) and possibly language (Deutsch et al., 2004; but see Schellenberg and Trehub, 2008). Although one recent study reported that two adult musicians with high auditory working memory abilities reached levels of performance indistinguishable from genuine AP possessors after extensive training (Van Hedger et al., 2019), it is generally believed that the acquisition of AP is associated with musical training within an early critical period in development (Levitin and Rogers, 2005). Learning a fixed-pitch instrument seems to contribute to the acquisition of AP (Vanzella and Schellenberg, 2010).

Many studies have treated AP possessors as a single cohort, but they are not homogeneous (Bahr et al., 2005). Two commonly discussed subtypes are known as “true AP” and “quasi AP” (Bachem, 1955; Levitin and Rogers, 2005; Kim and Knösche, 2017). True AP possessors can name notes with high accuracy and low latency. In contrast, quasi AP possessors may be somewhat less accurate, particularly at the extreme ends of the note range, and may take longer to respond, possibly because they compare the heard tone with an inner standard tone they remember. However, both subtypes can achieve high performance in typical pitch-labeling tasks. A proposed way to distinguish between these two subtypes is by measuring reaction time (RT) (Levitin and Rogers, 2005), although its empirical effectiveness in distinguishing between the two groups remains unknown.

Although AP can be a useful skill in musical tasks like transcription and sight singing, it is not always helpful. For instance, AP possessors' interval naming performance can be degraded when one or both of the two notes of the interval are mistuned (Miyazaki, 1995). When reading music notation and listening to a tone at the same time, the note name of the heard tone will affect the performance of reading the written note, but not the converse (Akiva-Kabiri and Henik, 2012). These findings indicate that note naming is highly automatic for at least some AP possessors. For them, AP cannot be “turned off” even when the task encourages them to do so, resulting in interference, and hence poorer performance, in such tasks compared to non-AP musicians.

Given the potential for AP abilities to interfere with task requirements, it is also possible that a task or stimulus may interfere with AP performance. One such example was reported by Vanzella and Schellenberg (2010). They found that AP possessors were less accurate in naming the pitches of recorded sung or synthesized voice than of piano notes or pure tones. The same pattern of results was replicated in AP possessors and extended to non-AP musicians (albeit at a much lower overall level of performance) by Weiss et al. (2015). Vanzella and Schellenberg (2010) suggested that voice-specific mechanisms may interfere with the process of note naming, leading to poorer performance. There is certainly neurophysiological and behavioral evidence for voice-specific processing, ranging from potentially voice-selective cortical regions (Belin et al., 2000; Agus et al., 2017) and responses (Charest et al., 2009) to perceptual asymmetries making it is easier to recognize a target voice sound in a sequence of non-voice distractors than vice versa (Agus et al., 2012; Isnard et al., 2019). Finally, when distinguishing previously heard melodies from novel melodies, both AP and non-AP musicians, as well as non-musicians, have been shown to have higher accuracy for melodies presented with voice than with instrumental timbres (Weiss et al., 2015).

Although voice-specific mechanisms may interfere with AP note naming (Vanzella and Schellenberg, 2010), some questions remain. First, because only performance, and not RT, was measured, it may be that the poorer performance with voice stimuli was due to a speed-accuracy trade-off (e.g., Heitz, 2014), due perhaps to particularly rapid processing of voice stimuli, rather than a decrease in underlying sensitivity. Second, if the effects are due to interference between the identity of the vowel carried by the voice (e.g., /a/) and the verbal labels of the pitches, then similar interference effects should be observed in RP tasks that involve naming the notes in musical intervals. If the effect is instead specific to AP, then another interpretation may be necessary. Finally, the decrease in performance may reflect the “vocal generosity effect,” whereby both musicians and non-musicians have been found to be poorer in both melody intonation judgments and pitch discrimination for voice than for non-voice stimuli (Hutchins et al., 2012). Although the vocal generosity effect may be due in part to the greater vibrato typically found in singing (Van Besouw et al., 2008), this cannot explain the results found by Vanzella and Schellenberg (2010), as they reported a similar effect with natural and synthesized voice, despite the lack of pronounced variations in fundamental frequency (F0) in the synthesized version.

The aim of the present study was to address the three questions outlined above: (1) Is the voice-disadvantage effect specific to AP, or does it extend to RP note-naming tasks? (2) Does the voice-disadvantage effect reflect a speed-accuracy trade-off or a true decrease in sensitivity? (3) To what extent is the voice-disadvantage effect a reflection of differences in basic F0 discrimination differences between voice and non-voice stimuli, as suggested by the vocal generosity effect? In our online Experiment 1, we addressed the first two questions by adopting a note name judgment task to test AP and RP musicians' ability to label the pitches of vocal (spoken vowel) and non-vocal (viola) stimuli, where a reference pitch with known note name was provided in the RP, but not the AP, task. The stimuli were either unprocessed or were manipulated to remove any time-varying features (such as vibrato) that might be more pronounced in the vocal stimuli. To detect any potential speech-accuracy trade-off, we also measured RT. Finally, to further assess the generalizability of the voice disadvantage effect and the potential influence of the vocal generosity effect, our in-person Experiment 2 adopted a basic F0 discrimination task for the same vocal and non-vocal stimuli (with and without time-varying features) that were used in Experiment 1.

II. EXPERIMENT 1: ABSOLUTE AND RELATIVE NOTE NAME JUDGMENTS

A. Methods

1. Participants

A total of 321 participants (age range: 18–46 years, mean = 21.3, standard deviation, SD = 4.2) took part in an online AP-RP screening test. Fifty-five participants passed the initial AP screening test (age range: 18–39 years, mean = 22.1, SD = 4.5; duration of musical training range: 3–25 years, mean = 12.6, SD = 4.9), and 40 of the remaining participants passed the RP screening test (age range: 18–32 years, mean = 21.5, SD = 3.6; duration of musical training range: 5–27 years, mean = 11.9, SD = 4.1). Two RP participants were excluded from the analysis comparing AP and RP participants due to technical errors in the online test. Participants were recruited through email lists, introductory psychology courses, social media, and word of mouth. At the time of recruitment, the participants were informed that the study was about musicians' pitch labeling abilities and that their task was to complete a questionnaire and potentially some listening tasks. To be eligible for the study, participants needed to consider themselves familiar with Western note names (e.g., C/C#/D). This study was approved by the University of Minnesota's Institutional Review Board. All the participants completed an online consent form prior to the study and were awarded a digital gift card or extra course credit upon completion.

2. Music background questionnaire

All participants completed a questionnaire about their musical background, self-evaluation of passive (i.e., perception) and active (i.e., production) AP proficiency, and (if applicable) AP strategies used for different types of sounds. Specifically, the available options for the strategies were “automatic” and “comparative.” “Automatic” was explained in terms of being able to come up with the note name immediately without recalling any inner standard tone for reference, and “comparative” was explained in terms of needing to compare the presented tone with some inner standard tone to determine the note name. Participants were allowed to select “other” and explain their strategies in a free-response box. The full questionnaire is provided in the supplementary materials.¹ The questionnaire and all of the experimental tasks were delivered online via PsyToolKit (Stoet, 2010, 2017) and were only accessible via a laptop or desktop computer. Participants were instructed to complete the tasks in a quiet environment. Headphones were not required, and there were no questions or tests checking for the use of headphones.

3. Screening test and qualification for AP and RP

Prior to the experiment, a 5-s white noise with the same overall level as the tones in the task was played to the participants, so that they could adjust the volume on their device to a comfortable level. The participants were able to replay the noise as many times as they wished before clicking a button to proceed to the experiment once they were satisfied with the loudness.

Following the level adjustment, all participants took part in a screening test that involved assessment of AP and RP skills. The AP section contained 24 trials. In each trial, the participants heard a piano tone and saw a note name (e.g., “C”) displayed on the screen at the same time. Their task was to judge whether the note name matched the tone and to press “1” for match or “2” for no match on the keyboard. Each note in the range of C3–B4 was played exactly once, and each of the 12 note names was displayed on the screen exactly twice. The black key notes were displayed as both the alternative names (e.g., “C#/Db” or “D#/Eb”). In half the trials, the note name matched the piano tone; in the other half, the note names were equally likely to be ±1 or ±4 semitones (STs) deviant from the piano tone. The correspondence between the piano tones and note names was counterbalanced across participants.

The piano tones were exported from GarageBand v.10.3.5 (Apple Inc., 2020). Each tone was restricted to a total duration of 250 ms by gating it off with a 30-ms raised-cosine ramp. To preserve a naturalistic piano timbre, no additional onset ramp was applied. The note name appeared on the screen at the onset of the tone and was displayed until either the subject made a response or 5 s after the tone onset, whichever occurred first. No feedback was provided, but a highlighted “Miss” was displayed if the participant failed to respond within 5 s. Trials were interspersed with a 500-ms burst of spectro-temporally rippled noise with falling frequency sweeps (Aronoff and Landsberger, 2013), presented at the same overall level as the tones and gated with 100-ms linear onset and offset ramps, followed by a 1100-ms silent gap. The noise was included to reduce any possible carryover of pitch information across trials.

Participants scoring at least 75% correct in the AP section (N = 55) were assigned to the AP group and skipped the remaining part of the screening test. All others were then presented with the RP screening test, which also contained 24 trials. The RP section was identical to the AP section, with the exception that a 250-ms standard piano tone (A3 = 220 Hz) was presented 100 ms after the offset of the noise burst. The silent gap between the offset of the standard piano tone and the onset of the test tone was 750 ms. The standard tone was played before every test tone, and the participants were instructed at the beginning of each block that the note name of the standard tone was always A.

The participants who undertook the RP section and scored at least 75% correct (N = 40) were assigned to the RP group. All other participants, who had scored lower than 75% correct in both the AP and RP sections, were excluded from the remainder of the study.

4. Note name judgment task

Participants in the AP and RP groups completed their respective versions of the note name judgment task. The RP participants did not have the ability to reliably make AP judgments, as was shown in the screening test, and so did not complete the AP note name judgment task. The AP participants could potentially have used either AP or RP abilities to undertake the RP tasks (Miyazaki, 1995), so they were not tested on the RP note name judgment task. Each version consisted of four blocks of trials, each of which was very similar to the corresponding (AP or RP) version of the screening task. The differences were: (a) each note in the range of C3–B3 was used exactly twice; (b) each of the four blocks used a different type of sound, as described below. Within each block, the same type of sound was used as both the standard and test tone. The presentation order of the four blocks was randomized for each participant independently. Participants were able to take breaks between blocks.

The four sound types were original voice, original viola, simplified voice, and simplified viola. The original voice stimuli were based on a female spoken /a/ sound, synthesized and transposed to different pitch heights using the TD-PSOLA^TM algorithm in Praat (Boersma, 2001), whereby the periodicity of the sound is altered but the spectral envelope and duration remain the same. The original viola stimuli were based on a synthesized viola sound (the note F3) exported from Avid Sibelius, again transposed to different pitch heights (i.e., the 12 notes between C3 and B3) using Praat. Viola was chosen to match the pitch range and spectral richness of voice, as in previous studies (e.g., Agus et al., 2012). Each tone had a total duration of 250 ms and was gated off with a 30-ms raised-cosine ramp. The simplified voice and simplified viola stimuli were constructed by taking one cycle from the middle of the waveform of the voice or viola note F3, respectively, repeating it throughout the 250-ms duration of the tone in matlab R2018b (MathWorks Inc., Natick, MA), gating the tone on and off with 30-ms raised-cosine ramps, and transposing it to different pitch heights using Praat. Therefore, the long-term spectral and formant information was maintained but any fluctuations in spectrum, F0, or level over the course of the sound were eliminated, along with any transient effects. The reduction in random fluctuations as a result of editing was confirmed by calculating the SDs for F0 (in STs) and level (in dB) during the steady-state segment of the stimuli (from 70 to 220 ms). For F0 estimates, Praat was used to calculate the F0 in 10-ms windows; for level estimates, the root-mean-square (rms) level was calculated with Hanning windows of 10-ms half-amplitude duration. The resulting SDs for F0 (in ST) were 0.094 for the original voice, 0.00012 for the simplified voice, 0.027 for the original viola, and 0.00012 for the simplified viola. The resulting SDs for level (in dB) were 1.46 for the original voice, 0.038 for the simplified voice, 0.40 for the original viola, and 0.032 for the simplified viola. Examples of the stimulus waveforms and spectra are shown in Fig. 1. Audio of the four examples shown in Fig. 1 can be found in the supplementary materials.¹

FIG. 1. — (Color online) Waveforms and spectrograms of the four stimulus categories: (a) voice, original; (b) voice, simplified; (c) viola, original; and (d) viola, simplified. The spectra (0–16 kHz) of the four stimulus categories are shown in (e). The note displayed is F3 (F0 = 174.6 Hz). In the spectrograms, the lines represent the F0 contour.

Immediately following the AP or RP note-naming task, participants listened to the note A3 from each of the four sound categories again and answered a set of questions. In the first part, they rated the similarity of each stimulus to a human voice and to a musical instrument, as well as how synthetic it sounded, in three separate 5-point scales (1 = not at all; 5 = very much). In the second part, they answered whether they were able to extract a vowel from each of the stimuli and, if so, which vowel. These questions were designed to measure the perceived identity of the test sounds.

5. Data analysis

The dependent variables of interest were accuracy and RT. Each response was coded as “1 = correct” or “0 = incorrect.” The raw RT was defined as the time interval between the onset of the test stimulus and the participant's response. For the missed trials, both RT and response were treated as missing. Across all conditions and participants, the percentage of missed trials was low (∼1%). The responses and RTs of all non-missed trials were included in the data analysis.

To answer the main questions of this study, effects of group (AP vs RP), stimulus (voice vs viola), and editing (simplified vs original), as well as their interactions, on the log-transformed RT values [10*log₁₀(RT)] were assessed via mixed-effects linear models. A logarithmic transformation was used to produce a more normal distribution of the RTs. The effects on the accuracy of the same set of factors and their interactions were assessed via mixed-effects logistic models. The effects of stimulus rating and AP strategy (comparative vs automatic) on RT and accuracy were also examined in subsequent analyses. Mixed-effects linear models and mixed-effects logistic models were implemented using the lme4 package (Bates et al., 2015) and the GLMMadaptive package (Rizopoulos, 2021), respectively. The input data for the mixed-effects logistic regressions were 0 or 1 (i.e., incorrect or correct) for each individual trial, and the input data for the mixed-effects linear regressions were RT (in log-transformed milliseconds) for each individual trial. The p-values for the regression coefficients were obtained using the lmerTest package (Kuznetsova et al., 2017). All analyses were performed in R v4.0.2 (R Core Team, 2020). To assist with interpretability, the two possible values of each of the dichotomous predictors (group, AP/RP; stimulus, voice/viola; editing, original/simplified; AP strategy, automatic/comparative) were recoded as 0.5 and –0.5 (Schielzeth, 2010; Brauer and Curtin, 2018). The effects of within-subjects variables (stimulus and editing) were treated as random at the individual level, and participant was included as a random effect. All the estimates reported in the following section are at the population level (i.e., the overall difference across experimental conditions, after accounting for variability between participants). For mixed-effects linear models, Cohen's d was calculated as described by Brysbaert and Stevens (2018). For mixed-effects logistic models, odds ratios (ORs) were calculated as a measure of effect size. The further the OR deviates from 1 in either direction, the larger the effect.

To further validate the results, variables related to participants' musical backgrounds and parameters of each trial were adjusted for in a separate model when possible. These variables were the amount of musical training (years), current music playing time (hours per week), current music listening time (hours per week), first and primary instruments (four categories: piano, strings, voice, other), age when musical training commenced (years), native language (tonal or non-tonal), interval between sound stimulus and displayed note name (0, 1, or 4 STs), and direction of mismatch (1 = note name higher than sound stimulus, 0 = correct correspondence, –1 = note name lower than sound stimulus). The results reported in the next section are not adjusted for these variables unless stated otherwise. In general, incorporating these variables did not affect the conclusions.

B. Results

1. Screening differentiates between AP and RP possessors

The distributions of the screening scores of all participants are shown in Fig. 2. In both cases, the mode of the distribution was near 12 out of 24, or 50% correct, representing chance performance. However, both distributions appear positively skewed, indicating better-than-chance performance for some of the participants. For the AP test in particular, there was evidence of a bimodal distribution, with a secondary peak reflecting very high performance (23–24 correct); a similar form of distribution, presumably reflecting a subset of listeners with AP, has been reported with a much larger sample (Athos et al., 2007). Those who passed each of the tests (at least 75% correct) are shown in the distributions with darker bars (N_AP = 55; N_RP = 40). The 55 AP participants were also included in an additional exploratory analysis to assess potential differences in performance between those using automatic and comparative AP strategies.

2. Voice disadvantage is demonstrated more in AP participants' accuracy

To answer the main questions of this study, a mixed-effects logistic regression was carried out to examine the effects of group (AP vs RP), stimulus (voice vs viola), editing (simplified vs original), and their interactions on the accuracy of responses (Table I). Significant main effects were found of group (b = 0.84, standard error, SE = 0.27, p = 0.00167, OR = 2.32, 95% confidence interval, CI [1.37, 3.93]), with higher overall performance by the AP than the RP group, and stimulus (b = –0.42, SE = 0.10, p < 0.0001; OR = 0.65, 95% CI [0.54, 0.79]), with lower accuracy for voice than for viola sounds, along with a significant interaction between group and stimulus (b = –0.35, SE = 0.16, p = 0.0290). These effects held after adjusting for participants' musical backgrounds and parameters of each trial (i.e., the musical interval between sound stimulus and displayed note name, and the direction of mismatch). Based on the interaction between group and stimulus, we carried out separate mixed-effects logistic regressions for the AP and RP groups. A significant main effect of stimulus was found in the AP group (b = –0.71, SE = 0.19, p = 0.000151; OR = 0.49, 95% CI [0.34, 0.71]), with lower accuracy for voice than for viola sounds. A similar trend was observed in the RP group, but it just failed to reach significance (b = –0.21, SE = 0.11, p = 0.0525; OR = 0.81, 95% CI [0.66, 1.00]). Thus, the AP group showed a stronger voice disadvantage effect than the RP group (as evidenced by the group-stimulus interaction), although the effect remained marginally significant even in the RP group. The lack of effect of (or interactions with) editing suggests that within-stimulus variability did not contribute significantly to the effect.

TABLE I.

Mixed-effects logistic regression analysis predicting accuracy from group (AP or RP), stimulus (voice or viola), editing (original or simplified), and their interactions; b refers to the estimated unstandardized regression coefficients with standard errors in parentheses; p-values less than 0.05 are considered statistically significant.

Predictor	b	p
Group	0.84 (0.27)	0.00166
Stimulus	−0.42 (0.10)	<0.0001
Editing	−0.01 (0.09)	0.956
Voice * Editing	0.25 (0.13)	0.0659
Group * Stimulus	−0.35 (0.16)	0.0290
Group * Editing	−0.03 (0.15)	0.865
Group * Stimulus * Editing	−0.16 (0.27)	0.548

Open in a new tab

A mixed-effects linear regression analysis was also carried out to examine the effects of group (AP vs RP), stimulus (voice vs viola), and editing (original vs simplified) on the log-transformed RTs (Table II). A significant main effect of group was found (b = –0.58, SE = 0.28, p = 0.0397, d = 0.27, 95% CI [0.02, 0.52]), with faster response by the AP than the RP participants. A significant interaction between group and editing was found (b = –0.20, SE = 0.10, p = 0.0459, d = 0.09, 95% CI [0.0029, 0.18]), with AP but not RP participants responding faster to the simplified than to the original stimuli (b = –0.17, SE = 0.06, p = 0.00925, d = 0.08, 95% CI [0.02, 0.13]). These effects held after adjusting for participants' musical backgrounds and parameters of each trial, except that the main effect of group was reduced to a trend that just missed significance. No other significant main effects or interactions were found, although there was a marginally significant trend (b = 0.12, SE = 0.05, p = 0.0524, d = 0.05, 95% CI [0.00, 0.11]) for RTs with voice stimuli to be longer than with non-voice stimuli.

TABLE II.

Mixed-effects linear regression analysis predicting log-transformed RT from group (AP or RP), stimulus (voice or viola), editing (original or simplified), and their interactions; b refers to estimated unstandardized regression coefficients with standard errors in parentheses; p values less than 0.05 are considered statistically significant.

Predictor	b	p
Group	−0.58 (0.28)	0.0397
Stimulus	0.12 (0.06)	0.0524
Editing	−0.07 (0.05)	0.131
Stimulus * Editing	−0.02 (0.07)	0.729
Group * Stimulus	0.22 (0.12)	0.0686
Group * Editing	−0.20 (0.10)	0.0459
Group * Stimulus * Editing	−0.24(0.14)	0.0967

Open in a new tab

In summary, the results demonstrate that a voice disadvantage effect in pitch labeling accuracy exists in AP musicians performing an AP task and that it is greater than a similar (but non-significant) trend in RP musicians performing an RP task (Fig. 3). This effect cannot be attributed to a shift in criterion due to a speed-accuracy trade-off because the decreased accuracy observed with voice judgments was not accompanied by reduced RTs. Although the AP group had shorter RT to the simplified than to the original version of the stimuli, this effect of editing did not seem to interact with the stimulus type (voice vs viola), indicating that the difference in performance between voice and instrument sounds cannot be ascribed to greater random fluctuations in the pitch and quality of the voice than of the viola.

FIG. 3. — (Color online) Accuracy (top row) and RT (bottom row) of AP (left panels) and RP (right panels) participants under different conditions. Black horizontal lines denote median values, shaded areas are the interquartile ranges (IQR), and vertical lines denote the range of the values that are no further than 1.5IQR away from the shaded area. Outliers (defined as values more than 1.5IQR away from the shaded area) are shown as black dots. Colored lines connect all the data from individual participants.

3. Subjective measures of original and simplified stimuli also reflect a voice disadvantage in the AP task

For each of the four sound categories, participants rated the sound's similarity to a human voice and to a musical instrument, as well as how synthetic it sounded, on three separate 5-point scales. Two-way repeated-measures analyses of variance (ANOVAs) were performed on the subjective ratings, with stimulus (voice vs viola) and editing (original vs simplified) as within-subject factors. The analyses showed that simplifying the sound stimuli made them sound less similar to their respective category (i.e., voice or instrumental sound), and more synthetic. In addition, there was an asymmetrical change in perceived sound categories: simplifying a voice made it sound more like an instrumental sound, but simplifying a viola did not make it sound more like a voice.

For voice similarity ratings, the mean rating score of voices was significantly higher than that of viola sounds [F(1,92) = 262.06, p < 0.0001, η_p² = 0.74], the mean rating score of original sounds was significantly higher than that of the simplified versions [F(1,92) = 22.62, p < 0.0001, η_p² = 0.20], and there was a significant interaction between stimulus and editing [F(1,92) = 13.61, p = 0.000381, η_p² = 0.13]. Post hoc analysis with Bonferroni correction showed that the simplified voice was rated as less similar to voice than the original voice (p < 0.0001), but editing did not affect viola sounds' perceived (dis) similarity to voice (p = 0.203).

For instrument similarity ratings, the mean rating score of the viola was significantly higher than that of the voice [F(1,92) = 241.87, p < 0.0001, η_p² = 0.72], and there was a significant interaction between stimulus and editing [F(1,92) = 46.86, p < 0.0001, η_p² = 0.34], but no main effect of editing [F(1,92) = 1.63, p = 0.204, η_p² = 0.02]. Post hoc analysis with Bonferroni correction showed that the simplified voice was rated as more similar to an instrument sound than the original voice (p < 0.0001), and that the simplified viola sound was rated as less similar to an instrument than the original viola sound (p < 0.0001).

For synthetic ratings, voices were rated as more synthetic than viola sounds [F(1,92) = 4.44, p = 0.0379, η_p² = 0.05], and simplified versions were rated as more synthetic than original versions [F(1,92) = 23.41, p < 0.0001, η_p² = 0.20], but there was no significant interaction between stimulus and editing [F(1,92) = 3.38, p = 0.0691, η_p² = 0.04].

Similarly, for each of the four sound categories, participants answered whether they could extract a vowel from the sound. The responses were coded as 0 (for not being able to extract a vowel) and 1 (for extracting any vowel). A mixed-effect logistic regression revealed that participants were significantly more likely to extract a vowel from voice than from viola stimuli (b = 8.80, SE = 2.87, p = 0.00219, OR = 6.63 × 10³, 95% CI [23.76, 1.85 × 10⁶]), but no differences were observed between the original and the simplified versions (b = 0.73, SE = 0.79, p = 0.355, OR = 2.07, 95% CI [0.44, 9.70]). The interaction between stimulus (voice vs viola) and editing (simplified vs original) was not analyzed due to the limited amount of data.

While voice and viola sounds were treated as separate sound categories in this study, the above analyses show that perceived similarities to voice and instrumental sounds are not always complementary (e.g., a decrease in perceived similarity to instrument does not necessarily mean increased similarity to voice). To examine whether the voice-disadvantage effect that we observed when the sounds were grouped based on physical characteristics was also found when the subjective measures were used to categorize the sounds, a set of mixed-effects linear and logistic regressions similar to the models described in the last section were constructed, where subjective measures of the stimuli were used in lieu of the physical stimulus characteristics themselves (voice vs viola and original vs simplified). The effects of voice rating, instrumental rating, synthetic rating, and vowel extraction were assessed through four separate sets of regressions, with each set consisting of a mixed-effects linear regression for RT and a mixed-effects logistic regression for accuracy. As expected, the results were largely in line with those obtained when considering the physical sound characteristics: the AP group was consistently more accurate (voice model: b = 0.81, SE = 0.26, p = 0.00163, OR = 2.26, 95% CI [1.36, 3.74]; instrument model: b = 0.81, SE = 0.27, p = 0.00219, OR = 2.25, 95% CI [1.34, 3.79]; synthetic model: b = 0.76, SE = 0.28, p = 0.00638, OR = 2.15, 95% CI [1.24, 3.72]; vowel extraction model: b = 0.88, SE = 0.27, p = 0.00128, OR = 2.41, 95% CI [1.41, 4.11]) and faster (voice model: b = –0.57, SE = 0.28, p = 0.0442, d = 0.27, 95% CI [0.01, 0.53]; instrument model: b = –0.56, SE = 0.28, p = 0.493, d = 0.26, 95% CI [0.0043, 0.28]; synthetic model: b = –0.62, SE = 0.28, p = 0.0312, d = 0.30, 95% CI [0.03, 0.56]; vowel extraction model: b = –0.61, SE = 0.28, p = 0.0343, d = 0.28, 95% CI [0.02, 0.54]) than the RP group. For accuracy, a higher perceived similarity to voice (voice model: b = –0.11, SE = 0.03, p = 0.00120, OR = 0.89, 95% CI [0.84, 0.96]), a lower perceived similarity to instrumental sound (instrument model: b = 0.13, SE = 0.04, p = 0.00298, OR = 1.14, 95% CI [1.06, 1.22]), and an identifiable vowel (vowel extraction model: b = –0.38, SE = 0.13, p = 0.00288, OR = 0.68, 95% CI [0.53, 0.88]) were all significantly associated with lower accuracy. For RT, a significant interaction between group and instrument rating was found (b = –0.08, SE = 0.04, p = 0.0470, d = 0.04, 95% CI [0.0012, 0.08]), where a lower perceived similarity to instrumental sound was associated with a slower response in the AP (b = –0.08, SE = 0.02, p = 0.00160, d = 0.04, 95% CI [0.02, 0.06]) but not the RP group. No other main effects or interactions were observed.

4. Voice disadvantage is larger in self-reported automatic than comparative AP participants

As a preliminary exploration of how the voice disadvantage effect is manifested in different subgroups of AP participants, we tested whether and how it differs between AP participants with self-reported automatic and comparative strategies (Fig. 4). Some participants reported using different strategies for different types of sounds, so we recoded their strategies as either automatic or comparative, based on the strategy used most often for the timbres used in this experiment (i.e., piano, voice, viola). The automatic and comparative strategies correspond loosely to the “true AP” and “quasi AP” subtypes discussed in previous literature (e.g., Levitin and Rogers, 2005), respectively. Thirty-one AP participants were in the automatic strategy subgroup, and 22 AP participants were in the comparative strategy subgroup. Two AP participants were not included in this section's analyses because they reported using other unspecified strategies and did not clearly belong to either category.

FIG. 4. — (Color online) Accuracy (top row) and RT (bottom row) of automatic AP (left panels) and comparative AP (right panels) participants under different conditions. See Fig. 3 for a more detailed description.

A mixed-effects logistic regression was carried out to examine the effects of AP strategy (comparative vs automatic), stimulus (voice vs viola), editing (simplified vs original), and their interactions on the accuracy of responses (Table III). As expected, there was a significant effect of stimulus (b = –0.66, SE = 0.18, p = 0.000283; OR = 0.52, 95% CI [0.36, 0.74]), with lower accuracy for voice than for viola sounds. There was also a significant effect of AP strategy (b = 1.19, SE = 0.44, p = 0.00620, OR = 3.29; 95% CI [1.40, 7.74]), with higher overall performance by the automatic than the comparative AP participants. The two-way interaction between AP strategy and stimulus (b = –0.54, SE = 0.25, p = 0.0292) and the three-way interaction (b = 1.03, SE = 0.41, p = 0.0120) also reached significance. These effects held even after adjusting for participants' musical backgrounds and parameters of each trial. When the responses from the automatic and the comparative AP participants were analyzed separately, the main effect of stimulus (reflecting lower accuracy for voice than for viola stimuli), was observed in both groups (automatic group: b = –0.78, SE = 0.31, p = 0.0123, OR = 0.46, 95% CI [0.25, 0.84]; comparative group: b = –0.44, SE = 0.21, p = 0.0368, OR = 0.64, 95% CI [0.42, 0.97]). A significant interaction between stimulus and editing was found for the automatic group only (b = 0.75, SE= 0.33, p = 0.0213); however, subsequent contrast analyses indicated no effect of editing for either the voice or non-voice stimuli and no effect of stimulus for either the original or simplified sounds. Therefore, a voice disadvantage effect in terms of accuracy was observed in both automatic and comparative AP participants. In summary, the overall accuracy of the comparative AP participants was lower than automatic AP participants, as expected, and the voice disadvantage effect of the comparative AP participants was smaller than for the automatic AP participants.

TABLE III.

Mixed-effects logistic regression analysis predicting accuracy from AP strategy, stimulus, editing, and their interactions; b refers to estimated unstandardized regression coefficients with standard errors in parentheses; p values less than 0.05 are considered statistically significant.

Predictor	b	p
Stimulus	−0.66 (0.18)	0.000282
Editing	0.02 (0.16)	0.892
Strategy	1.19 (0.44)	0.00620
Stimulus * Editing	0.22 (0.21)	0.292
Strategy * Stimulus	−0.54 (0.25)	0.0285
Strategy * Editing	−0.13 (0.22)	0.563
Strategy * Stimulus * Editing	1.03 (0.41)	0.0120

Open in a new tab

A similar mixed-effects linear regression was performed on the log-transformed RT values (Table IV). Interestingly, there was now a significant main effect of stimulus (b = 0.22, SE = 0.07, p = 0.00443, d = 0.11, 95% CI [0.04, 0.18]), with longer RT for voice than for viola sounds. There were also a significant main effect of strategy (b = –1.32, SE = 0.36, p = 0.00516, d = 0.64, 95% CI [0.30, 0.98]), with faster response by the automatic AP than the comparative AP participants, and a significant main effect of editing (b = –0.17, SE = 0.07, p = 0.0127, d = 0.08, 95% CI [0.02, 0.15]), with longer RT for the original than for the simplified version of the stimuli. No interactions were found. This pattern of results also held after adjusting for participants' musical backgrounds and parameters of each trial.

TABLE IV.

Mixed-effects linear regression analysis predicting log-transformed RT from AP strategy, stimulus, editing, and their interactions; b refers to estimated unstandardized regression coefficients with standard errors in parentheses; p values less than 0.05 are considered statistically significant.

Predictor	b	p
Stimulus	0.22 (0.07)	0.00443
Editing	−0.17 (0.07)	0.0127
Strategy	−1.32 (0.36)	0.000516
Stimulus * Editing	−0.11 (0.09)	0.197
Strategy * Stimulus	0.04(0.15)	0.797
Strategy * Editing	−0.02 (0.13)	0.907
Strategy * Stimulus * Editing	−0.26 (0.17)	0.140

Open in a new tab

Overall, the analyses focusing only on AP participants agree with the results from the initial analyses. Both automatic and comparative AP participants showed voice disadvantage effects in the form of decreased accuracy. When data from the AP group were analyzed separately, the increase in RT also reached significance, although it had not been significant as a main effect or interaction when considering both the AP and RP groups. As expected, automatic AP participants had higher accuracy and shorter RT compared to comparative AP participants. The current results suggest that an interaction between stimulus and AP strategy exists, since the voice disadvantage effect was larger in automatic AP participants than comparative AP participants, at least in terms of accuracy.

5. Weak but significant relationships between self-reported musical background and screening performance

In previous studies, musical training has been established as an important factor contributing to the development of AP ability (e.g., Miyazaki et al., 2012; Levitin and Rogers, 2005; Vanzella and Schellenberg, 2010). To examine the relationship between participants' self-reported musical background and their AP screening performance, we conducted a linear regression with all the participants who completed the questionnaire and the screening test (N = 321), including AP and RP participants, as well as those who did not pass screening. The predictors included native language (tonal vs non-tonal), the amount of musical training (years), current music playing and listening time (hours per week), first and primary instruments (piano vs others), and age when musical training began (years). The dependent variable was the percentage of correct responses in the AP screening test (range: 0 to 100). We found significant positive effects of current music playing time (b = 0.25, SE = 0.12, p = 0.0311, η_p² = 0.015) and of the amount of musical training (b = 1.00, SE = 0.25, p < 0.0001, η_p² = 0.048). Among all the continuous predictors (i.e., all except native language), weak significant correlations were observed between the amount of music training and current music playing time (Pearson's r = 0.31, p < 0.0001), and between the amount of music training and the age of music training onset (Pearson's r = –0.40, p <0.0001). The correlations between all other possible pairs of continuous predictors were small (Pearson's r < 0.15 in all cases). We also conducted a linear regression in the non-AP subset (N = 266) to examine the same set of predictors' effects on RP screening performance. We found a significant positive effect of the amount of musical training (b = 0.95, SE = 021, p < 0.0001, η_p² = 0.072). The correlations between the other continuous predictors in the non-AP subset shared the same pattern as when both AP and non-AP participants were included. A correlation test between AP and RP screening scores in non-AP participants showed a weak but significant correlation (Pearson's r = 0.29, p < 0.0001).

To evaluate the potential of using self-assessment of AP proficiency as a quick index of AP ability in lieu of a screening test, we determined how well the self-assessment of being “always accurate” (for note-naming tasks) or “accurate - the error is less than 50 cents” (for singing and tuning tasks) in a variety of items and self-report of being an AP possessor predict the qualification for the AP group (i.e., at least 75% correct in the AP screening test) in terms of sensitivity, specificity, and d ′. The candidate self-assessment items were: accuracy in naming the chromas (e.g., C/C#/D) of tones played on the piano, tones played on instruments that the participant plays, tones played on instruments that the participant does not play, voice, synthetic sound, or environmental sound without a reference tone; accuracy in singing a pitch without a reference tone; and self-report of AP possession. The item “tuning an instrument without a reference tone” was dropped because many participants reported that their instruments did not need tuning. The different items varied in their power to distinguish between participants who passed and did not pass the AP screening (Table V). Considering the low prevalence of AP among all participants, a screening test is still a preferable procedure to eliminate the false positives in the self-assessment without falsely rejecting too many AP participants.

TABLE V.

Sensitivity, specificity, and d′ of using different self-report items as an indicator for AP screening performance. For the item with 100% specificity, d′ was calculated using the Hautus (1995) adjustment method.

Item	Sensitivity (hits)	Specificity (1-FA)	d′
Passive AP: Piano	63.6%	94.7%	1.97
Passive AP: Instrument that I play	58.1%	94.4%	1.79
Passive AP: Instrument that I don't play	38.2%	100%	2.60
Passive AP: Voice	32.7%	98.5%	1.72
Passive AP: Synthetic sound	21.8%	99.2%	1.65
Passive AP: Environmental sound	12.7%	98.5%	1.03
Active AP: Singing a note	54.5%	97.4%	2.05
Self-report of AP possession	54.5%	94.4%	1.70

Open in a new tab

III. EXPERIMENT 2: FUNDAMENTAL FREQUENCY DISCRIMINATION

The voice disadvantage effect was significantly greater in the AP task than in the RP task, where the effect was marginal. It is nevertheless possible that the voice disadvantage effect reflects a more general phenomenon that affects any task involving fine-grained pitch coding and discrimination. To test the generalizability of the voice disadvantage effect, an in-person lab experiment was carried out to measure basic F0 difference limens (F0DLs) for the four types of sounds (i.e., original and simplified versions of voice and viola stimuli) used in Experiment 1.