Skip to main content
PLOS One logoLink to PLOS One
. 2021 Feb 22;16(2):e0247473. doi: 10.1371/journal.pone.0247473

A new approach to measuring absolute pitch on a psychometric theory of isolated pitch perception: Is it disentangling specific groups or capturing a continuous ability?

Nayana Di Giuseppe Germano 1,*, Hugo Cogo-Moreira 2,#, Fausto Coutinho-Lourenço 3,#, Graziela Bortz 4,#
Editor: Karl Bang Christensen5
PMCID: PMC7899369  PMID: 33617560

Abstract

Absolute Pitch (AP) is commonly defined as a rare ability that allows an individual to identify any pitch by name. Most researchers use classificatory tests for AP which tracks the number of isolated correct answers. However, each researcher chooses their own procedure for what should be considered correct or incorrect in measuring this ability. Consequently, it is impossible to evaluate comparatively how the stimuli and criteria classify individuals in the same way. We thus adopted a psychometric perspective, approaching AP as a latent trait. Via the Latent Variable Model, we evaluated the consistency and validity for a measure to test for AP ability. A total of 783 undergraduate music students participated in the test. The test battery comprised 10 isolated pitches. All collected data were analyzed with two different rating criteria (perfect and imperfect) under three Latent Variable Model approaches: continuous (Item Response Theory with two and three parameters), categorical (Latent Class Analysis), and the Hybrid model. According to model fit information indices, the perfect approach (only exact pitch responses as correct) measurement model had a better fit under the trait (continuous) specification. This contradicts the usual assumption of a division between AP and non-AP possessors. Alternatively, the categorical solution for the two classes demonstrated the best solution for the imperfect approach (exact pitch responses and semitone deviations considered as correct).

Introduction

The phenomenon of Absolute Pitch (AP) was first scientifically described by Stumpf [1], although it was alluded to much earlier in Mozart’s era [2, 3]. AP ability has attracted attention from musicians, psychologists, and neuroscientists, leading to a large body of research [47].

AP has not yet been accurately and consensually defined among the academic community [8], leading to significant variations among AP evidence and AP classification. Consequently, conclusions regarding AP classification may not be comparable due to the lack of consensus on criteria (e.g., the time required to identify a tone, or the degree of precision in tone identification).

Only a few defining criteria for AP ability are agreed upon among authors, such as the automatic association between a certain pitch and a learned verbal label [9], and the definition of AP as a rare ability that refers to a long-term internal representation for pitches. Consequently, AP typically manifests behaviorally as the ability to identify any given pitch by name (according to the traditional pattern of musical notation learned by a subject), or by producing a given musical tone on demand, with no external reference, e.g., without a diapason [3, 1013].

The extant AP literature references certain limitations of AP possessors in pitch identification. Timbre limitation is mentioned by several studies [1, 6, 1418], although it has not yet been universally specified or quantified, and its causes have not yet been scientifically explained. The same can be said of register limitation among AP individuals [1, 14, 19, 20]. We consider these to be examples of relevant non-consensual criteria excluded from most AP definitions. An important consequence of this methodological decision is that individuals with difficulties in tone recognition due to certain configurations of musical parameters (mostly regarding timbre and/or register) must be considered as non-AP possessors.

The AP phenomenon is generally considered as instantaneous pitch recognition and some studies adopt a brief time response window (e.g., three or four seconds), assuming it is sufficient to affirm an immediate response in a given task [21, 22]. It is also assumed that providing some procedures in AP tests can limit (or completely eliminate) the use of Relative Pitch (RP). These procedures can include some methodological issues, including granting brief time response, separating tones by an interval larger than an octave, or placing brown noise between stimuli [16, 23, 24].

There are two different theoretical perspectives regarding the RP definition, namely, the broad and the narrow perspectives [4, 5]. In the broad perspective, RP ability is assigned to anyone (musician or non-musician) who is capable of realizing basic music perceptual tasks, such as recognizing familiar music when it is transposed or played on different instruments or singing a familiar song in tune [25]. These are predominantly intuitive unconscious abilities, and most people accomplish them instinctively. In the narrow perspective, RP is assigned to individuals who can name intervals and other musical elements (including triads, tonalities, harmonic progressions, and scales, among others). Musicians must be able to recognize familiar music, like non-musicians, and also aurally recognize and name basic musical elements used in compositions (e.g., whether the heard musical interval was a minor or major second) [26]. Hence, RP in the narrow perspective is acquired through years of intense training.

Thus, the use of classificatory tests to separate AP possessors from RP possessors should be approached with caution. Given that the study of music perception in undergraduate music schools and conservatoires includes sight-singing and ear training, these goals encompass the development of the RP ability among all students. Since most participants in AP studies are musicians, they all have received some degree of training in music perception. Thus, it is reasonable to expect that all test participants possess some degree of RP ability, even AP possessors. Consequently, it is impossible to ascertain whether RP ability can be completely eliminated by the use of a short response time or any other methodology. In fact, the possibility that an individual may possess both abilities, i.e., that AP and RP phenomena are not mutually exclusive, must be considered [5, 27, 28].

As posited by Levitin and Rogers [29], “AP is neither ‘absolute’ nor ‘perfect’ in the ordinary uses of those words”. AP possessors not only exhibit limitations for timbre and registers, as mentioned in previous paragraphs, but they also frequently make octave and semitone errors. This occurs so commonly that a substantial portion of classificatory tests for AP consider semitone errors as correct (or partially correct) answers [16, 21, 22, 3034]. This leads to a core methodological issue found in AP literature, that is, a lack of agreement for criteria regarding cut-offs in AP classificatory tests, which are arbitrarily defined. Moreover, this affects what is considered a correct or an incorrect answer to the stimuli. An example can be observed in Dohn et al. [30], who used a pitch identification test described and provided by Athos et al. [22], which was originally developed by Baharloo et al. [21]. Although all three studies adopted the same test, they did not apply exactly the same methodology, nor the same scoring criteria. This lack of common criteria or a gold-standard tool to measure the same phenomenon leads to difficulties in comparing results, even when researchers intend to utilize the same test.

We aimed to develop a test for isolated pitch recognition from a psychometric perspective. That is, we considered this ability to be a latent phenomenon, evaluating a) the best model solution underlying the isolated pitch recognition, and b) how different rating approaches commonly used in the literature might influence the decision of the best model for isolated pitch recognition tasks. The use of a latent approach elucidates the item level functioning, providing evidence for construct validity.

Materials and methods

Participants and study design

A total of 783 undergraduate music students (n = 512; male = 65.4%) were recruited to this study. Participants ranged from the first to tenth semesters of study at seven different Brazilian universities, five of which were located in São Paulo city, and two in Curitiba city. This study was approved by the relevant ethics board (Ethics Committee’s Approval CAAE: 60855816.3.0000.5477) and participants written consent was provided by all the students before the test/evaluation. The study was conducted during the first semester of the 2017 academic year. The participants’ mean age was 24.7 years (range = 17 to 72) and they had an average of 10.29 years of music practice (SD = 6.7; range = 1 to 65). All participants were exposed to ear training and sight singing classes during their music studies.

The perception task consisted of five different batteries: isolated pitches, melodic intervals, harmonic intervals, fundamental position triads, and first position triads. Each battery included 10 stimuli. In this study, only the first battery (isolated pitches) will be discussed, which is the common procedure used to track AP. It must be emphasized that we considered the isolated pitch recognition without reference as a latent trait, without the automatic assumption that this ability and the AP ability were the same. Therefore, the items intended to measure the ability to identify isolated pitches without a reference was our main priority.

The first author of this study collected all data, giving exactly the same instructions to all subjects and guaranteeing an adequate standardization of method. The protocol was applied collectively, with previous authorization obtained from professors and the legal guardian responsible for each institution. Each stimulus was played once for 3 seconds, with a 15-second pause in between. No reference pitch was provided. Pitches and registers were highly variable among items. Timbres were chosen to represent each family of musical instruments. The stimuli were recorded in a studio by a professional and played on CD during the tests.

We attempted to limit the use of RP in our test by not providing any reference pitch, playing each stimulus only once, and changing the timbre and register between each stimulus. However, due to the issues discussed in the introduction section, we considered that it is methodologically impossible to completely prevent the use of RP in any isolated pitch recognition task. Each participant has a unique way of identifying pitches which can employ a combination of AP and RP, and common isolated pitch recognition tasks are incapable of evaluating the underlying mechanism being used. Consequently, we did not evaluate reaction time, providing a 15-second window between each stimulus in our test. With a longer response time, subjects had sufficient time to look at the response sheet, choose their answer, confirm where the right response was located on the drawn piano-keyboard, and mark their answer. This decreased the chance of errors unrelated to pitch discrimination. Notably, all participants in this research were required to pass an aural skill test to be admitted to music programs in Brazilian Universities. This indicates that all participants received some degree of training in music perception. Thus, it is reasonable to expect that all participants possessed some degree of RP ability, even AP possessors.

Participants were instructed to indicate the pitch they thought was correct on a piano-keyboard drawn on paper. It was expected that some participants would not be fluent in reading and/or writing traditional musical notations if they had not yet taken the appropriate courses. The drawn piano-keyboard allowed us to delimit specifically 12 possible answers (12 chromatic notes). The piano-keyboard was also chosen because it allowed for easier visualization and identification of each possible answer through key position. It also contained the verbally written note names.

Participants were informed that the first battery was composed of only isolated pitches. They were also informed about the duration of each stimulus and the time interval between them. This was necessary to avoid confusion and surprise among subjects. No information was provided regarding timbre. The drawn keyboard had only one octave, as the object of our study was pitch class recognition. Therefore, the octave parameter was not considered in this task, and was disregarded by all subjects.

The first battery contained 10 isolated pitches in 5 different timbres (piano, violin, flute, tuba, and voice). The voice was recorded from two professional vocalists in a studio. All other instruments were recorded with the software Kontakt, using professionally recorded samples. The piano was taken from Piano in 162, the violin from Spitfire Solo Strings, the flute from 8dio Claire Flute, and the tuba from Spitfire Symphonic Brass (Fig 1).

Fig 1. Theoretical Model for isolated pitch recognition trait.

Fig 1

Theoretical Model proposed for Latent Trait AIPWR: Ability to identify Isolated Pitch Without Reference. Items are 10 isolated pitches in various timbres and registers without reference (a-j). The arrows indicate the ability from latent trait to items. Figure adapted from Germano et al. [8].

In (Fig 1), the circle represents the latent trait, which we referred to as the ability to identify isolated pitches without reference (AIPWR). Because we were unable to measure any latent trait directly, the 10 stimuli constituted a set of items that could be measured and tested directly. These items, represented by rectangles, are similar to symptoms of psychological disorders, which can be directly observed. The stimuli are composed of three tone dimensions: register, timbre, and pitch class. We chose these 10 items to correspond to a summary of a vast stimuli range that is commonly used to measure AP ability. Thus, they were purposely very heterogeneous stimuli, encompassing all the different ranges of timbre, pitch, and register necessary to access Isolated Pitch Recognition Ability.

Data analysis

To evaluate the psychometric features of the isolated pitches battery, we used Mplus version 8.0 [35] and the R program [36]. All collected data were analyzed under three approaches: continuous, categorical, and hybrid (the factor mixture model). The former Item Response Theory (IRT) approach assumes that there is a continuous latent measure (or “trait”) underlying the 10 items. That is, each participant would have some ability to identify isolated pitches without reference, similarly to other continuous cognitive measurements like quotient intelligence, psychopathology, and language skills. Two different IRT models were used:

  1. An IRT model with two parameters for each stimulus: the discrimination parameter (parameter a), which describes the ability of the stimuli to distinguish between persons with low and high pitch identification ability; and the item location parameter (parameter b), representing the level of pitch identification ability where there is a 50% chance of correctly identifying the pitch of the stimulus;

  2. An IRT with three parameters (discrimination, difficulty, and guessing [aka. lower bound asymptote]), where the additional parameter, guessing, is the probability of a person with very low pitch identification ability still correctly providing a correct answer for a given stimulus. The guessing parameter was recently implemented in Mplus and uses a prior maximum likelihood parameter that helps in the convergence of the model [37].

We used a piano keyboard to track the answers of the participants. Out of 12 keys, participants chose only one. Therefore, the prior guessing parameter likelihood was 1/12 for the perfect rating and 3/12 for the imperfect rating criteria. According to Baker [38], discrimination parameter cutoffs are 0 (none); 0.1 to 0.34 (very low); 0.35 to 0.64 (low); 0.65 to 1.34 (moderate); 1.35 to 1.69 (high); < 1.70 (very high) and + infinity (perfect).

For IRT analysis, we used Maximum Likelihood estimator and logit parameterization (theta). The factor is assumed to be normally distributed being the mean fixed at zero and factor variance at 1. That is, the IRT analysis is centered on the person sample being at 0 logits, and the item difficulty parameters are provided relative to this. For the difficulty parameter, values closer to 3 indicated more difficulty, and values closer to -3 indicated less difficulty. Values around zero indicated the middle point between both extremes. To evaluate the model fit indices for IRT models, a Pearson chi-square test for categorical outcomes was used, with p-values higher than 0.05 being indicators of a good fit. Item level fit was evaluated via Pearson’s X2 (S-X2) implemented in R package mirt, as per Orlando and Thissen [39].

For the categorical approach, we used Latent Class Analysis (LCA), which classifies subpopulations where population membership is inferred from the data. LCA has some similarities with the prima prattica of AP research, where subjects are classified in homogeneous groups. However, LCA does not demand a predefined cut-off or a gold-standard measure as reference. For example, in traditional AP research, participants are considered AP possessors if they achieve an arbitrarily defined score (e.g., AP possessors must score 6 points or higher on an isolated pitch test). While previous research provides theoretical justifications for the cut-off choices, there is no statistical justification for choosing one cut-off threshold over another. Contrastingly, in LCA, class membership is inferred from the data and from the underlying patterns of responses across items. In this study, different numbers of classes were considered and evaluated.

The factor (IRT) mixture model was estimated based on Muthén [40], given that it is a generalization of the latent class model, where the assumption of conditional independence between the latent class indicators within a class is relaxed using a factor that influences the items within each class [4042]. The factor represents individual variations in response probabilities within a class. Therefore, this model allows for heterogeneity within each class. As described in Mplus User’s Guide (Example 7.27) [37], this model can be considered as an Item Response Theory (IRT) mixture model.

All three latent models were run twice, considering the two different rating criteria commonly used to define correct and incorrect answers. This choice was based on AP literature, as semitones errors can be considered incorrect [24, 4345] or correct [3133, 46] depending on how restrictively AP is defined. In our test, we adopted two criteria as follows:

  1. Perfect. only exact pitch responses were considered correct; all other responses were incorrect (e.g., aural stimulus = C, correct response = C);

  2. Imperfect. exact pitch responses and semitone deviations were considered correct; all other responses were incorrect (e.g., aural stimulus = C, correct response = C, B, or C#/Db).

The collected data formed a portrait of the latent trait distribution among the participants and was used to evaluate and validate the proposed test, i.e., how well it would measure the latent trait. The model fit indices used to evaluate and compare IRT and LCA were Akaike Information Criteria (AIC), Bayesian Information Criteria (BIC), and Simple Size Adjusted Bayesian Information Criterion (SSABIC). The lower the AIC, BIC, and SSABIC, the better the models being compared. In our case, we compared continuous versus categorical models under the same approach.

Due to the non-independence of sampling (i.e., students nested in universities), IRT and LCA models were run using robust maximum likelihood which produces standard errors, and chi-square test of the model fit considered this multilevel structure of the data [47, 48]. Lastly, a comparison between IRT and LCA models was conducted using BIC and AIC. Notably, given that there were three approaches to statistical modeling (i.e., IRT, LCA, and Hybrid modeling), comparisons were always made within the same criterion.

Results

Ordinary descriptive statistics with the proportions and counts under both criteria ratings (perfect and imperfect) are shown in Table 1. The summing of correct answers for both rating criteria are shown in Table 2. It can be observed that the criterion for the perfect approach reduces the probability of a correct answer.

Table 1. Frequency distribution.

Perfect (%) Imperfect (%)
Item a 6.9 40.5
Item b 39.2 46.4
Item c 11.4 31.2
Item d 28.2 34.0
Item e 43.4 53.1
Item f 42.1 47.6
Item g 27.6 37.0
Item h 29.1 49.4
Item i 7.7 34.4
Item j 9.6 47.0

This table provides the percentage of correct answers for perfect and imperfect approaches for each item.

Table 2. Frequency distribution of a simple correct answers sum.

Sum of correct answers Perfect (%) Imperfect (%)
0 16.9 2.9
1 26.2 9.6
2 19.3 18.9
3 13.5 15.5
4 8.3 15.8
5 3.4 10.9
6 5.0 6.3
7 3.1 4.7
8 2.2 5.4
9 1.0 5.2
10 1.1 4.9

This table provides the simple correct answers sum for each item for perfect and imperfect approaches.

The results for IRT with two parameters (discrimination and difficulty) and three parameters (discrimination, difficulty, and guessing) for both criteria ratings (perfect and imperfect) are provided in Table 3.

Table 3. Item response theory results: Two and three parameters for perfect and imperfect approaches.

IRT– 2 Parameters IRT– 3 Parameters
Perfect Discrimination SE Difficulty SE Discrimination SE Difficulty SE Guessing SE
Item a 1.927 0.197 2.05 0.267 2.942 0.555 1.886 0.237 0.018 0.008
Item b 1.209 0.202 0.448 0.243 5.289 1.688 0.976 0.213 0.259 0.030
Item c 1.391 0.295 1.937 0.434 3.484 0.482 1.664 0.240 0.048 0.008
Item d 1.25 0.140 0.949 0.227 2.416 0.877 1.154 0.174 0.129 0.036
Item e 1.129 0.164 0.277 0.190 1.490 0.371 0.547 0.348 0.117 0.102
Item f 1.85 0.138 0.24 0.128 2.521 0.340 0.350 0.119 0.058 0.026
Item g 1.929 0.391 0.767 0.205 2.999 0.640 0.889 0.179 0.070 0.028
Item h 1.83 0.233 0.724 0.120 2.615 1.012 0.828 0.167 0.061 0.039
Item i 1.303 0.239 2.419 0.261 3.021 0.489 1.974 0.096 0.034 0.010
Item j 1.356 0.149 2.141 0.249 3.509 0.830 1.812 0.143 0.046 0.009
Imperfect Item a 1.272 0.201 0.37 0.166 1.913 0.296 0.587 0.139 0.115 0.031
Item b 1.189 0.219 0.131 0.212 3.669 1.308 0.836 0.258 0.309 0.062
Item c 1.385 0.221 0.749 0.240 3.219 0.980 1.002 0.230 0.154 0.032
Item d 1.293 0.135 0.652 0.203 2.421 0.719 0.897 0.167 0.142 0.039
Item e 1.101 0.209 -0.164 0.199 2.469 0.866 0.513 0.413 0.295 0.118
Item f 1.697 0.244 0.045 0.145 3.612 2.368 0.458 0.235 0.210 0.118
Item g 1.458 0.212 0.474 0.179 8.414 5.810 0.865 0.168 0.218 0.028
Item h 1.209 0.193 -0.001 0.186 8.401 5.824 0.832 0.144 0.363 0.038
Item i 0.781 0.160 0.926 0.262 1.990 0.439 1.261 0.232 0.210 0.023
Item j 0.614 0.177 0.203 0.107 1.137 0.332 0.911 0.377 0.240 0.089

This table provides the Item Response Theory results for each item for two and three parameters, in both the perfect and imperfect approaches. Item Response for two parameters shows discrimination and difficulty results. Item response for three parameters shows discrimination, difficulty, and guessing results. SE = Standard Error.

The perfect approach under IRT with two parameters revealed item discrimination as moderate, high, and very high. The most discriminating item was item g (G6 on violin; 1.929) and the most difficult item was item i (G#1 on tuba; 2.419). The imperfect approach with two parameters showed item discrimination as low, moderate, and high. Item f displayed the highest item discrimination (C5 on piano; 1.697) and item i (G#1 on tuba; 0.926), identical to the perfect approach, showed the highest item difficulty.

For IRT with three parameters, results for the perfect approach showed item discriminations with high and very high values. The most discriminative item was item b (A5 on violin; 5.289) and the most difficult item, as in the IRT with two parameters, was item i (G#1 on tuba; 1.974). The guessing parameter demonstrated that item b had a high probability of being answered correctly (25.9%), even among those with very low ability to identify isolated pitches under the perfect approach. The imperfect approach with three parameters indicated item discrimination as moderate, high, and very high. Item g (G6 on violin; 8.414) demonstrated the highest discrimination parameter and item i (G#1 on tuba; 1.261) showed the highest item difficulty parameter. Under the imperfect approach, the probabilities of guessing increased across all the items. Item h indicated the highest guessing probability (36.3%), followed by item b (30.9%). Importantly, the standard errors (SE) were larger than under IRT with three parameters regardless of the adopted rating criterion, as commonly described in the literature [49, 50].

Table 4 shows the item-level fit. Under IRT with two parameters, the imperfect rating indicated that all items had a good fit (S-X2 p > 0.05). However, under the perfect approach, two out of the ten items (items c and d) were statistically significant. Under IRT with three parameters, the majority of the items displayed a reduction in p-values when compared to two parameters, for both approaches.

Table 4. Item level fit for perfect and imperfect ratings.

2 Parameters 3 Parameters
Perfect Continuous S-X2 df(S-X2) RMSEA p-value S-X2 df(S-X2) RMSEA p-value
Item a 7.448 7 0.009 0.384 4.651 6 <0.001 0.589
Item b 11.039 6 0.033 0.087 4.946 3 0.029 0.176
Item c 18.967 7 0.047 0.008 13.763 5 0.047 0.017
Item d 14.791 6 0.043 0.022 15.843 5 0.053 0.007
Item e 7.582 6 0.018 0.27 4.322 4 0.010 0.364
Item f 5.037 5 0.003 0.411 7.979 4 0.036 0.092
Item g 1.466 5 <0.001 0.917 1.58 4 <0.001 0.812
Item h 1.225 5 <0.001 0.942 1.444 4 <0.001 0.837
Item i 7.017 7 0.002 0.427 4.891 6 <0.001 0.558
Item j 11.588 7 0.029 0.115 11.044 6 0.033 0.087
Imperfect Item a 11.949 7 0.03 0.102 13.602 6 0.040 0.034
Item b 5.057 7 <0.001 0.653 4.82 6 <0.001 0.567
Item c 9.43 7 0.021 0.223 7.738 6 0.019 0.258
Item d 5.828 7 <0.001 0.560 6.707 6 0.012 0.349
Item e 8.401 7 0.016 0.299 7.975 6 0.021 0.24
Item f 5.24 7 <0.001 0.631 8.738 5 0.031 0.12
Item g 7.96 7 0.013 0.336 6.334 5 0.018 0.275
Item h 6.738 7 <0.001 0.457 15.606 5 0.052 0.008
Item i 3.038 7 <0.001 0.881 3.978 6 <0.001 0.68
Item j 6.991 7 <0.001 0.430 5.039 6 <0.001 0.539

This table provides the item-level fit values for each item for two and three parameters, in both perfect and imperfect approaches. S-X2 is an item fit index for dichotomous item response theory models. df(S-X2) is the degree of freedom for item fit index for dichotomous item response theory models. RMSEA = (Root Mean Square Error of Approximation).

A reason for the imperfect approach under two parameters having a better item level fit may be due to the increase in the probabilities of answering the items correctly (i.e., proportion and counts were higher since it was a less strict criterion). For items c and d–scored with the criterion of perfect rating–misfit is illustrated by comparisons of predicted and observed proportion of correct results (Figs 2 and 3). In particular, higher than expected proportion of correct answers are seen for theta scores a little higher than 1 and for theta scores a little lower than -1

Fig 2. Empirical plots (item c) for perfect model with 2 parameters.

Fig 2

Confidence intervals for the probability of endorsement of item c, correctly given the amount of AIPWR, are represented in dashed red lines. The estimated item characteristic curve for item c is indicated in continuous blue lines.

Fig 3. Empirical plots (item d) for perfect model with 2 parameters.

Fig 3

Confidence intervals for the probability of endorsement of item d, correctly given the amount of AIPWR, are represented in dashed red lines. The estimated item characteristic curve for item d is indicated in continuous blue lines.

Table 5 depicts the model fit for IRT models (two and three items parameters) for the perfect and imperfect ratings. Considering the perfect approach, the lowest BIC was in favor of an IRT model with two parameters. However, for the imperfect approach, the lowest BIC was in favor of an IRT model with three parameters. Notably, for perfect scoring, evaluations of item fit showed significant misfits for items c and d. This suggests that these two items are problematic as indicators of the latent trait. Moreover, for the imperfect approach, the standard errors of the discrimination parameters were high. Therefore, for both perfect and imperfect models, we concluded that the two-parameter model fits better than three-parameter model.

Table 5. Model fit information for IRT models—perfect and imperfect, two and three parameters.

Number of Classes Free Parameters Loglikelihood Correction Factor for MLR Loglikelihood (HO value) Akaike (AIC) Bayesian (BIC) SSA (BIC)
Perfect Continuous 2 Par. ---- 20 1.6634 -3474.444 6988.887 7082.150 7018.640
3 Par. ---- 30 1.3837 -3449.099 6958.198 7098.092 7002.827
Imperfect 2 Par. ---- 20 2.1810 -4816.900 9673.801 9767.064 9703.554
3 Par. ---- 30 1.6942 -4766.699 9593.398 9733.291 9638.026

This table provides the model fit information for two and three parameters, in both perfect and imperfect approaches. MLR (Maximum Likelihood Robust). AIC (Consistent Akaike’s Information Criterion). BIC (Bayesian Information Criterion). SSA (BIC) (Simple Size Adjusted Bayesian Information Criterion).

LCA results indicate the best solution for the two classes for both perfect and imperfect approaches, as illustrated in Table 6.

Table 6. Latent class analysis results for perfect and imperfect approaches.

Number of Classes Free Parameters Loglikelihood Correction Factor for MLR Loglikelihood (HO value) Akaike (AIC) Bayesian (BIC) SSA (BIC)
VLMR LRT (p-value) LMR LR adjusted test Entropy
Perfect Categorical 1 10 4.2098 -3925.508 7871.016 7917.648 7885.893 ----- ----- -----
2 21 1.8203 -3482.636 7007.273 7105.199 7038.513 0.0212 0.0219 0.914
3 32 1.5176 -3444.222 6952.444 7101.664 7000.048 0.5114 0.5130 0.656
4 43 1.4185 -3423.966 6933.931 7134.446 6997.899 0.5677 0.5684 0.678
5 54 1.2412 -3407.555 6923.11 7174.919 7003.442 0.4955 0.4960 0.675
6 65 1.1888 -3392.677 6915.335 7218.438 7012.031 0.4577 0.4581 0.718
7 76 1.2128 -3384.837 6921.745 7276.143 7034.805 0.6390 0.6393 0.732
Imperfect 1 10 4.6859 -5243.614 10507.229 10553.86 10522.105 ----- ----- -----
2 21 2.0328 -4784.795 9611.589 9709.515 9642.829 0.0160 0.0165 0.893
3 32 1.6635 -4755.888 9575.775 9724.995 9623.379 0.5189 0.5200 0.688
4 43 1.6857 -4736.214 9558.428 9758.943 9622.396 0.7612 0.7614 0.768
5 54 1.4823 -4716.346 9540.692 9792.501 9621.024 0.4518 0.4523 0.797
6 65 1.4254 -4708.111 9546.223 9849.326 9642.919 0.6568 0.6576 0.651
7 76 1.3179 -4697.236 9546.471 9900.869 9659.531 0.4955 0.4951 0.709
8 87 1.2404 -4690.011 9554.022 9959.714 9683.445 0.5387 0.5393 0.742
9 98 1.2126 -4685.470 9566.94 10023.927 9712.728 0.5037 0.5038 0.685

This table provides the Latent Class Analysis for both perfect (7 classes) and imperfect (9 classes) approaches. MLR (Maximum Likelihood Robust). AIC (Consistent Akaike’s Information Criterion). BIC (Bayesian Information Criterion). SSA (BIC) (Simple Size Adjusted Bayesian Information Criterion). VLMR LRT (Vuong-LO-Mendell-Rubin Likelihood Ratio Test). LMR (Likelihood Mendell Rubin).

The best class solution was two classes, given the strongest decline in the AIC and BIC values. There was still a reduction from the two to three-classes solution in BIC and AIC values, which was expected. However, such information gain is insufficient for the justification of an additional extracted class when compared to the information gain (i.e., the reduction of BIC and AIC) from one to two classes. Figs 4 and 5 show the LCA results for perfect and imperfect LCA results.

Fig 4. Isolated pitch perfect approach—latent class analysis for two classes.

Fig 4

Class 1 (red line—16.3%) represents the population with greater ability to identify pitches in various registers and timbres, without reference. Class 2 (blue line—83.7%) represents the population with less ability to identify pitches in various registers and timbres, without reference. The y-axis represents the probability of a correct answer and the x-axis represents each item tested. Figure adapted from Germano et al. [8].

Fig 5. Isolated pitch imperfect approach—latent class analysis for two classes.

Fig 5

Class 1 (red line—20.9%) represents the population with greater ability to identify pitches with semitone deviations in various registers and timbres without reference. Class 2 (blue line—79.1%) represents the population with less ability to identify pitches with semitone deviations in variated registers and timbres without reference. The y-axis represents the probability of a correct answer and the x-axis represents each item tested.

The figures illustrate that one group had higher a probability of correctly identifying the pitches (depicted by the red line, representing 16.3% of the sample for perfect approach and 20.9% of the sample for imperfect approach). Contrarily, the other group had a lower probability of correctly identifying the stimuli (blue line, 83.7% for perfect approach and 79.1% for imperfect approach). Notably, even the red group did not achieve a value of 1 for any of the items, which would indicate a 100% probability of answering correctly for a giving stimulus. Moreover, the prevalence of the group with the highest probabilities of correctly identifying the pitches was lower than the group with lowest probabilities of correctly identifying the pitches.

Based on the results from the continuous and categorical options, the hybrid model was conducted by merging the features of the best solutions obtained from both modeling approaches, the two classes solution, and a unidimensional solution.

The hybrid model fit information is given in Table 7.

Table 7. Hybrid model for perfect and imperfect ratings.

Number of Classes Free Parameters Loglikelihood Correction Factor for MLR Loglikelihood (HO value) Akaike (AIC) Bayesian (BIC) SSA (BIC) Entropy
Perfect Hybrid 2 43 1.2784 -3418.603 6923.206 7123.721 6987.174 0.876
Imperfect 2 43 2.8625 -4736.865 9559.731 9760.245 9623.699 0.759

This table provides the hybrid model for perfect and imperfect ratings with two latent classes and a unidimensional underlying latent factor. MLR (Maximum Likelihood Robust). AIC (Consistent Akaike’s Information Criterion). BIC (Bayesian Information Criterion). SSA (BIC) (Simple Size Adjusted Bayesian Information Criterion).

Based on model fit information, we conclude that the continuous solution was the best solution for the perfect approach, with lower BIC than the categorical and hybrid solutions. This indicates that the ability to recognize isolated pitches in different timbres and registers without reference is better modeled as a continuous ability, rather than when the perfect rating approach is considered with either a categorical or a hybrid model. Alternatively, the categorical solution demonstrated the best solution for the imperfect approach, with lower BIC than the continuous and hybrid solutions. This indicates that adopting flexibility in isolated pitch recognition without reference (with semitone deviations considered as correct) is better modeled as latent groups.

Fig 6 shows a histogram of the continuous distribution of the ability to recognize isolated pitches under the perfect approach, and Fig 7 displays the imperfect approach. The perfect approach (Fig 6) displays a half-normal distribution, while the imperfect approach (Fig 7) displays a log-Cauchy like distribution.

Fig 6. Isolated pitch perfect approach—histograms (sample values, estimated factor scores, estimated values, residuals).

Fig 6

Perfect approach ability. AIPWR: Ability to identify Isolated Pitch Without Reference. The y-axis represents the number of individuals. The x-axis represents the ability divided into 20 columns.

Fig 7. Isolated pitch imperfect approach—histograms (sample values, estimated factor scores, estimated values, residuals).

Fig 7

Imperfect approach ability. AIPWR: Ability to identify Isolated Pitch Without Reference. The y-axis represents the number of individuals. The x-axis represents the ability divided into 20 columns.

Discussion

Our results demonstrate a good fit adjustment in measuring the ability to recognize isolated pitches without reference in a continuous solution of perfect rating criteria. When the imperfect approach is used as a rating criterion, a categorical solution is preferred.

Moreover, in a two-parameter IRT model for the perfect scoring approach, all the items showed high values of discrimination. This indicates that our set of stimuli were appropriate for discriminating between subjects with high and low abilities to recognize isolated pitches. The items’ difficulty values in both the perfect and imperfect approaches were high (with the exception of items e and h in the imperfect approach with two parameters). This was an expected result, as the identification of pitches without any reference is considered to be an exceedingly challenging task for most musicians. Notably, we could not formally compare the imperfect and perfect approaches regarding superiority, because they are not nested models [51].

When comparing LCA to IRT, our results indicated that the ability to recognize isolated pitches was better represented by a continuous model for the perfect approach. That is, through a continuous line where participants were arranged according to their degree of ability, as can be seen in (Fig 6). This is a highly unexpected result, because a consensually adopted methodology in AP research is the division of subjects into two categories. Here, we labeled both groups as high-skilled and low-skilled.

Alternatively, the common division adopted by most AP research (dividing the population in two groups) is the best solution only when using the imperfect approach as a rating criterion. Crucially, our results demonstrate how the two rating approaches commonly used in AP literature (perfect and imperfect) might influence the decision of the best model underlying isolated pitch recognition ability.

In theory, it was expected that both the perfect and imperfect approaches would be better represented by a categorical model, because this is a status quo in the field of AP. However, the perfect approach showed the continuous model as the best solution. This was greatly unexpected, as the perfect approach uses more restrictive criteria than the imperfect approach does. According to the literature, AP possessors make many semitone errors. We thus hypothesize that these restrictions allow us to capture more fine grain variations of IWRPV skills across the participants sampling. Using the perfect scoring approach, 1.1% of participants had all items correct. According to the IRT model, these participants would be expected to have greater skills in isolated pitch recognition tasks than participants with lower numbers of correct responses. In contrast, for the imperfect scoring approach, the LCA model assumes that 20.9% of participants have high skills in isolated pitch recognition tasks. Within this group further differentiation in skills cannot be made. The 4.9% who had all 10 responses correct using the imperfect scoring approach were just luckier than the remaining 16% in the high-skill group. More research is necessary to examine the causes for the differences in the underlying models.

It is especially important to understand that a high ability to identify pitches without reference (as a latent group) is not necessarily synonymous with being an AP possessor. Furthermore, a low skill is not necessarily synonymous with not being an AP possessor. This is because we cannot deduce that a high performance in isolated pitch recognition is due to the presence of AP ability, since well-trained musicians that are non-AP possessors can also possibly have a high performance. This kind of test is not capable of assessing whether a participant is automatically associating a pitch to a verbal label.

In many areas, it is common procedure to choose a cut-off threshold to categorize subjects into a certain group, even when the original measure is continuous. Results from LCA may be exported from Mplus (or other statistical packages dealing with mixture modeling). This generates a most likely class membership and each subject would have a conditional probability for each group. That is, the probability of being classified as likely to correctly answer and the probability of being classified as less likely to correctly answer.

Interestingly, we observed that none of the individual stimuli were answered correctly 100% of the time, even among the group classified as showing a high probability of choosing the correct answer (less than 20% of the 783 participants). These incorrect rates among those classified as having higher probabilities of performing well in isolated pitch discrimination tasks corroborates previous research indicating that participants are fallible and can make a considerable number of mistakes. In terms of limitation, future studies may investigate more detailed elements of psychometrics as local dependency for each of the models (IRT and LCA), invariance testing per sex, time of studies, and played instruments.

Conclusion

The latent approach elucidates the psychometrics features for the measurement of isolated pitch recognition ability in a large-scale evaluation, which can be adopted by future researchers. According to model fit information indices, the test measures the proposed latent trait of AIPWR ability very well, given that the stimuli varied according to difficulty and discriminatory levels. The perfect approach showed a better adjustment through a continuous line and the imperfect approach showed a better adjustment when dividing the population in two groups. It is important to note that the ten stimuli did not evaluate whether a participant made an automatic association between a certain pitch and a learned verbal label. Consequently, we could not conclude that a high score in our test indicates that the participant possesses AP or that a low score indicates that they do not. The only plausible conclusion is higher scores indicate more latent trait in the participant, while a lower score indicates less latent trait. These findings may contribute to a better theoretical understanding of AP ability, showing that different rating criteria in AP tests greatly influence test results and the measurement of AP ability.

Supporting information

S1 Data

(XLSX)

S1 File

(DOCX)

Acknowledgments

We thank all students who volunteered in this research, as well as the professors and universities for allowing the conducting of this test.

Data Availability

All relevant data are within the paper and its Supporting Information files.

Funding Statement

Germano, N: FAPESP 2016/08377-4 (Fundação de Amparo à Pesquisa do Estado de São Paulo) provided funding for the research and publication of this article. Cogo-Moreira, H: CAPES (Thesis award) Grant no. 0374/2016, process no. 23038.009191/2013-76) and CAPES-Alexander von Humboldt senior research fellowship (Grant 88881.145593/2017-01). Bortz, G: FAPESP 2019/02133-4 (Fundação de Amparo à Pesquisa do Estado de São Paulo) provided funding for the publication of this article.

References

Decision Letter 0

Karl Bang Christensen

1 May 2020

PONE-D-20-04410

Absolute Pitch as a Latent Trait

PLOS ONE

Dear Dra. Germano,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

I found the manuscript interesting in that it tries to evaluate if absolute pitch is a trait or a categorical variable. The choice of methodology is appropriate, but the reporting is lacking in quality. The comments from the two reviewers are very constructive and should enable you to improve the manuscript. 

We would appreciate receiving your revised manuscript by Jun 15 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

We look forward to receiving your revised manuscript.

Kind regards,

Karl Bang Christensen, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments:

Please provide a revised version addressing the comments from the two reviewers, who are very constructive and have provided comments that should enable you to improve the manuscript. The choice of methodology is appropriate, but the reporting is lacking in quality. You must address test of validity much more rigorously in a revised version. Evaluate fit of a one-dimensional CFA model (reporting chi-square, df and P-value). If you also waht to report the RMSEA with corresponding confidence interval or other indeces of close fit that is OK.

The figures are attached on their own with no legends or titles. Some of them are not even mentioned in the main text of the paper. One example line 145 states 'fig. 1', but the next line appears to discuss fig. 2?

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at:

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please modify the title to ensure that it is meeting PLOS’ guidelines (https://journals.plos.org/plosone/s/submission-guidelines#loc-title).

In particular, the title should be "specific, descriptive, concise, and comprehensible to readers outside the field" and in this case it is not informative and specific about your study's scope and methodology.  When modifying the title please be sure to amend both the title on the online submission form (via Edit Submission) and the title in the manuscript so that they are identical.

3. Your ethics statement must appear in the Methods section of your manuscript. If your ethics statement is written in any section besides the Methods, please move it to the Methods section and delete it from any other section. Please also ensure that your ethics statement is included in your manuscript, as the ethics section of your online submission will not be published alongside your manuscript.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: No

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

Reviewer #2: No

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: No

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: PONE-D-20-04410

This is an interesting application of modern psychometric methods. The comparison of IRT models and latent class models appear well suited for the theoretical problem posed. I have some suggestions for the data analysis and presentation of results.

1. You should present some basic descriptive information, e.g. the frequency distribution for each item (e.g. Perfect, Imperfect, Wrong), and the frequency distribution of a simple sum of the items (or two sums according to your two approaches). This allows the reader to get some sense of your data.

2. You estimate a 2-parameter and a 3-parameter IRT model. However, it is unclear which of these models you deem is having the best fit. Also, it was unclear to me, which of these models you compare with the latent class model and for which IRT model you present the score distribution in figure 5. Please clarify.

3. Fit is evaluated through a global chi-square test. However, chi-square tests with app. 1000 DF are not optimal. You report χ2(991) = 6635.882, p-value =0.999 (line 239). This must be a typo, if the chisq value is correct, p=0. I suggest you do two things.

3a. For global tests of fit and comparisons of the 2-P and 3-P models, use AIC and BIC as in your comparison with latent class models.

3b. Evaluate item level fit, e.g. by the item fit tests suggested by Orlando and Thissen (Applied Psychological Measurement 2000). Such tests are available in the IRTPRO software and the free R package mirt (https://cran.r-project.org/web/packages/mirt/mirt.pdf). Such item based fit test may identify some stimuli that are not well modeled by IRT.

4. Your IRT parameter estimates have very large standard errors, in particular for the discrimination parameter in the 3-P model. Difficulties of estimating the discrimination parameter in 3-P models is a known problem, but I am still concerned. You may want to use a prior for the discrimination parameter in addition to a prior for the guessing parameter.

5. I am also a bit worried about the magnitude for the guessing parameter for some items. For pure guessing, you would expect a guessing parameter around 1/12 = 0.08. Is the any theoretical that some items would have a lower asymptote of 0.3?

6. In choosing between latent class model, you argue that the improved fit of the models with more than 2 classes should be ignored, due to the complexity of these models. However, in comparisons between IRT and latent class models for the perfect approach, you suddenly argue that a difference in fit of the same magnitude is important and should be interpreted. You cannot have it both ways. If you compare e.g. a 3 class model with the IRT model, you get the following results:

AIC D AIC 3CL BIC D BIC 3CL SSABIC D SSABIC 3CL

Perfect 6988.887 6952.444 7082.150 7101.664 7018.640 7000.048

This comparison show no particular superiority of the IRT model in terms of fit. The same reasoning could be applied for a 4-CL or a 5-CL model. You may want to keep a 2-CL model in the comparisons for conceptual reasons, but you should include a latent class model with better fit (e.g. 3 classes or 4 classes). Based on the results I have seen, I would conclude that for the “perfect” approach, a latent class model with 3, 4, or 5 classes has equally good fit as an IRT model. For the “imperfect” approach, the IRT model seems clearly better.

7. You present the score distribution for one IRT model, presumably for the “perfect” approach (the text in lines 272-273 seems to have the numbering of figures wrong). The score distribution is clearly skewed. A large group of people seem to have the same level of ability to identify the pitch of a tone (i.e. no ability). This may pose a problem for standard IRT model estimation, since a normal distribution is assumed for the latent trait. For this reason, a better model for your data may be a latent mixture distribution model with 2 latent classes. Class one consist of persons who are not able to identify the pitch regardless of the stimulus. Class two consists of persons with at least some ability to identify pitch. Within class two responses might follow a 2-P IRT model. Such a model can be estimated in Mplus. It is fairly complex, but you may want to at least discuss it.

8. With regards to the psychometric lingo, I would suggest that :

8a. “Psychometric” is better than “Psychometrical”

8b. “Latent variable models” is better than “structural equation model”. SEM refers to a particular type of latent variable models, a type you do not use in your analysis.

8c. “Continuous” is better than “Dimensional”

Specific suggestions:

Line 24. I suggest “Through Latent Variable Models (LVM) we can evaluate consistency validity…”

Line 27. I suggest “… two LVM approaches: continuous latent variables (LV)…”

Line 37, I suggest “The phenomenon of absolute pitch (AP)…”

Line 65. It might be helpful to define relative pitch.

Line 115. Is there a reference regarding the use of a combination of AP and RP to identify pitches?

Line 159. I suggest “…continuous and categorical.”

Line 160. I suggest writing “The former approach, Item Response Theory (IRT), …”

Lines 166-170. I suggest writing: “a) An IRT model with two parameters for each stimuli: the discrimination parameter (also called parameter a), which describe the ability of this stimuli to distinguish between persons with low and high pitch identification ability and the item location parameter (also called parameter b), representing the level of pitch identification ability where you have 50% chance of correctly identifying the pitch of this stimulus.”

Lines 172+173. I suggest writing: “… is the probability of a person with very low pitch identification ability still correctly get a correct answer for a given stimulus.

Reviewer #2: The ultimate aim of this research project is not explicitly stated, and it remains a little unclear. The authors have created a new test of assessing absolute pitch, and this was investigated using an item-response theory (IRT) approach and a latent class analysis (LCA) approach. These models were then compared to see which offered the best fit.

Although the foundation and rationale for the study seems reasonable, in that the authors wish to create a measure to determine whether individuals have the ability to identify absolute pitch, the applied methodologies are confusing and it should be better explained as to why they are appropriate.

• Latent Class Analysis categorises people into groups, under the assumption that the same thing is being measured for all people, by a standardised count, process, or measurement device/scale.

• IRT is used to determine whether a set of items are delivering a valid total score of an unobservable latent trait.

Thus it should be explained in more detail why the fit of these models should be compared as they look at different things. It should be emphasized that the categorisation of people relies on the measurement process being valid and stable, so in some sense LCA should not be considered until the measure has been validated.

For the IRT scale assessment approach, there are also many aspects that have been neglected.

There is no indication of item fit. No investigation of response dependency. No assessment of reliability or targeting.

Was a single parameter model considered? - A single-parameter (Rasch) model would be appropriate for scale development and validation purposes, and for determining whether a total score from a set of items is a sufficient statistic to assess the level of a latent trait.

Additionally, for the ‘imperfect approach’, it may be worth the authors considering a partial-credit model, where an exact pitch classification is awarded a score of 2, a semitone deviation is awarded 1, and all other pitches are scored 0.

There are also some further issues within the manuscript that would need attention:

The model fit statistics are dubious, and there is no real interpretation of the fit statistics that are presented. Certainly a test with 1000 degrees of freedom will have no statistical power.

In the manuscript, it is stated that items are centred around the 0 location, but there are no item locations reported below 0 – where are they centred?

The pitch test is based across different musical instruments – have these instruments been calibrated? Has the pitch been externally verified in some way?

Additionally, the manuscript is currently in need of a language edit and the Figures are incorrectly labelled.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Jakob Bue Bjorner

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2021 Feb 22;16(2):e0247473. doi: 10.1371/journal.pone.0247473.r002

Author response to Decision Letter 0


14 Jul 2020

PONE-D-20-04410

Absolute Pitch as a Latent Trait

PLOS ONE

Dear Dra. Germano,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

I found the manuscript interesting in that it tries to evaluate if absolute pitch is a trait or a categorical variable. The choice of methodology is appropriate, but the reporting is lacking in quality. The comments from the two reviewers are very constructive and should enable you to improve the manuscript.

Answer: Dear Editor, we are thankful for the opportunity to answer the insightful comments we received. We also would like to state that for solving one of the issues, we needed extra help, which was provided by a newly added co-author, Fausto Lourenco Coutinho. He was fundamental to answer and deal with some issues involving R.

We would appreciate receiving your revised manuscript by Jun 15 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

Please include the following items when submitting your revised manuscript:

• A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.

• A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.

• An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

We look forward to receiving your revised manuscript.

Kind regards,

Karl Bang Christensen, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments:

Please provide a revised version addressing the comments from the two reviewers, who are very constructive and have provided comments that should enable you to improve the manuscript. The choice of methodology is appropriate, but the reporting is lacking in quality. You must address test of validity much more rigorously in a revised version. Evaluate fit of a one-dimensional CFA model (reporting chi-square, df and P-value). If you also waht to report the RMSEA with corresponding confidence interval or other indeces of close fit that is OK.

Answer: We thank you for the opportunity to answer all the issues raised by the referees. An important detail regarding our IRT models: differently of Mplus default estimator operating under WLSMV (probit link function), which generates CFI, TLI, RMSEA, we are using a different estimator called robust Maximum-likelihood estimator, which does not generate, under dichotomous items, those above cited global model fit indices. To improve the quality of our report, we did extra data analysis (as suggested by the Reviewer#1) and the results are detailed in the following answers. In order to conduct the extra analysis and interpretation, because they involved R, we invited Fausto Lourenco Coutinho, who is expert in R and a PHD student supervised by Hugo Cogo-Moreira, to join us in this manuscript.

The figures are attached on their own with no legends or titles. Some of them are not even mentioned in the main text of the paper. One example line 145 states 'fig. 1', but the next line appears to discuss fig. 2?

Answer: Legends and better descriptions regarding the tables were provided.

Journal Requirements: When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at:

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

Answer: The manuscript was adjusted in order to meet PLOS ONE’s style requirements. We appreciated the links provided.

2. Please modify the title to ensure that it is meeting PLOS’ guidelines (https://journals.plos.org/plosone/s/submission-guidelines#loc-title).

In particular, the title should be "specific, descriptive, concise, and comprehensible to readers outside the field" and in this case it is not informative and specific about your study's scope and methodology. When modifying the title please be sure to amend both the title on the online submission form (via Edit Submission) and the title in the manuscript so that they are identical.

Answer: We changed the title to a more specific and descriptive version, which should be more comprehensible to readers outside the field.

3. Your ethics statement must appear in the Methods section of your manuscript. If your ethics statement is written in any section besides the Methods, please move it to the Methods section and delete it from any other section. Please also ensure that your ethics statement is included in your manuscript, as the ethics section of your online submission will not be published alongside your manuscript.

Answer: The ethics statement is now located at “Participants and study design” subsection (line 113).

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: No

________________________________________

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

Reviewer #2: No

________________________________________

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: No

Answer: The data are now provided as supporting information.

________________________________________

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: No

Answer: The first manuscript was revised by a professional English-language editing service. The new version of the manuscript was also fully revised by another professional English-language editing service.

________________________________________

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: PONE-D-20-04410

This is an interesting application of modern psychometric methods. The comparison of IRT models and latent class models appear well suited for the theoretical problem posed. I have some suggestions for the data analysis and presentation of results.

Answer: we thank you for all the provided suggestions. They were insightful and contributed for the work to improve significantly.

1. You should present some basic descriptive information, e.g. the frequency distribution for each item (e.g. Perfect, Imperfect, Wrong), and the frequency distribution of a simple sum of the items (or two sums according to your two approaches). This allows the reader to get some sense of your data.

Answer: we agree with the reviewer and we are thankful for such the suggestion. More information can be found at page 11 Table 1, which provides the percentage of correct answers for Perfect and Imperfect approach for each item, and at page 12 Table 2, which provides the simple correct answers sum for each item for Perfect and Imperfect approach.

2. You estimate a 2-parameter and a 3-parameter IRT model. However, it is unclear which of these models you deem is having the best fit. Also, it was unclear to me, which of these models you compare with the latent class model and for which IRT model you present the score distribution in figure 5. Please clarify.

Answer: We followed your recommendation and we are now using AIC/BIC + likelihood test for difference. Regarding the second raised issue, because there are two criteria for rating, the model comparisons between IRT and latent class analysis are always conducted within the same rate criterion. In the revised manuscript, at page 17, it might be read as follows:

“In terms of a model fit, Table 5 depicts the model fit for IRT models (two and three items parameters) for perfect and imperfect rating. Considering the perfect approach, the lowest BIC was in favor of an IRT with two parameters, whereas for the imperfect approach the lowest BIC was in favor of an IRT with three parameters”

3. Fit is evaluated through a global chi-square test. However, chi-square tests with app. 1000 DF are not optimal. You report χ2(991) = 6635.882, p-value =0.999 (line 239). This must be a typo, if the chisq value is correct, p=0. I suggest you do two things.

3a. For global tests of fit and comparisons of the 2-P and 3-P models, use AIC and BIC as in your comparison with latent class models.

Answer: We agree and reinforce that the comparison was conducted based on AIC and BIC, although the other fit is still reported.

3b. Evaluate item level fit, e.g. by the item fit tests suggested by Orlando and Thissen (Applied Psychological Measurement 2000). Such tests are available in the IRTPRO software and the free R package mirt (https://cran.r-project.org/web/packages/mirt/mirt.pdf). Such item based fit test may identify some stimuli that are not well modeled by IRT.

Answer: We deeply appreciated the idea about showing the item level fit. We calculated them as described by Orlando and Thissen (2000) using R MIRT package. We added table 4 and a set of figures containing empirical plots for all the items under perfect pitch approach working as illustration. Given these new add-ons, the following information were inserted subheading for dimensional solution (pages 15-16):

“Under IRT with two parameters, imperfect rating indicated that all the items had a good fit (all the S-X² p-values >0.05), whereas, under the perfect approach, two out of the ten items were statistically significant (i.e., items c and d). Under IRT with three parameters, it can be observed that the majority of the items displayed a reduction in their p-values when compared to two parameters, for both approaches.

One reason why the imperfect approach under two parameters, in terms of item level fit, is better is likely due to the increase in the probabilities of answering the items correctly (i.e., proportion and counts are higher since it is a less strict criterion). In the case of items c and d for a perfect rating, as shown in the empirical plot (Fig 2 and 3), these two items revealed that the closer to the negative amount of theta (AIPWR), the more deviation between what is expected by the model and what is actually being estimated (i.e., confidence interval [in red lines] are far and above from the item characteristic curve [in blue]). Therefore, the observed probability of correctly answering the item at a low spectrum of AIPWR is higher than would be expected.”

4. Your IRT parameter estimates have very large standard errors, in particular for the discrimination parameter in the 3-P model. Difficulties of estimating the discrimination parameter in 3-P models is a known problem, but I am still concerned. You may want to use a prior for the discrimination parameter in addition to a prior for the guessing parameter.

Answer: We understand the concern regarding the SE for the 3-P model. To avoid an extra model (i.e., 3-P model with prior in guessing and discrimination), we added in the Results section the difficulties of estimating the discrimination exactly as you pointed out. Moreover, we gave extra reference about the parsimony rule involving estimating an extra-parameter.

Now, at page 15, it might be read as follows:

“It is important to note that the standard errors (SE) are larger than under IRT with three parameters, regardless of the adopted rating criterion as commonly described in the literature [40, 41].”

5. I am also a bit worried about the magnitude for the guessing parameter for some items. For pure guessing, you would expect a guessing parameter around 1/12 = 0.08. Is the any theoretical that some items would have a lower asymptote of 0.3?

Answer: The reviewer is correct about lower asymptote of 0.3. This is happening in the imperfect approach. We added an extra information about that point and we made a correction regarding our prior guessing for the imperfect approach, updating it to 3/12 due to the rating criteria where exact pitch responses and semitone deviations were considered correct (e.g.: aural stimulus: C; correct response: C, B or C#/Db). Based on that, we recalculated the prior guessing and the model’s parameters for imperfect rating. We thank you very much for the very attentive reading of our manuscript. Now, at line 207, it might be read as follows:

“Because the keyboard used to report the answers had 12 possibilities, the a priori value for the guessing parameter was 1/12 for the perfect rating; for imperfect rating criteria, it was 3/12 (see below the description for rating)”

6. In choosing between latent class model, you argue that the improved fit of the models with more than 2 classes should be ignored, due to the complexity of these models. However, in comparisons between IRT and latent class models for the perfect approach, you suddenly argue that a difference in fit of the same magnitude is important and should be interpreted. You cannot have it both ways. If you compare e.g. a 3 class model with the IRT model, you get the following results:

AIC D AIC 3CL BIC D BIC 3CL SSABIC D SSABIC 3CL

Perfect 6988.887 6952.444 7082.150 7101.664 7018.640 7000.048

This comparison show no particular superiority of the IRT model in terms of fit. The same reasoning could be applied for a 4-CL or a 5-CL model. You may want to keep a 2-CL model in the comparisons for conceptual reasons, but you should include a latent class model with better fit (e.g. 3 classes or 4 classes). Based on the results I have seen, I would conclude that for the “perfect” approach, a latent class model with 3, 4, or 5 classes has equally good fit as an IRT model. For the “imperfect” approach, the IRT model seems clearly better.

Answer: This was a crucial point. We revised all the Mplus output and we found some inconsistencies in our previous report for AICs and BICs. We are deeply sorry and grateful at that same time, as it made us revise all the numbers and, consequently, our results. Moreover, we added a new suggested modeling approach (hybrid solution). If the reviewer requires Mplus outputs, we would be happy to share with her/him. Now, at page 25 (line 406), our new results are describe as follows:

“Comparing the continuous, categorical, and hybrid solutions we concluded based on model fit information values that the continuous solution was the best solution for perfect approach, with lower BIC than the categorical and hybrid solutions. This indicates that the ability to recognize isolated pitches in different timbers and registers without reference is better modeled as a continuous ability when perfect rating approaches is considered in comparison with a categorical and hybrid model. Alternatively, the categorical solution demonstrated the best solution for the imperfect approach, with lower BIC than the continuous and hybrid solutions. This indicates that adopting flexibility in isolated pitch recognition without reference (with semitone deviations considered as correct) is better modeled as latent groups”

7. You present the score distribution for one IRT model, presumably for the “perfect” approach (the text in lines 272-273 seems to have the numbering of figures wrong). The score distribution is clearly skewed. A large group of people seem to have the same level of ability to identify the pitch of a tone (i.e. no ability). This may pose a problem for standard IRT model estimation, since a normal distribution is assumed for the latent trait. For this reason, a better model for your data may be a latent mixture distribution model with 2 latent classes. Class one consist of persons who are not able to identify the pitch regardless of the stimulus. Class two consists of persons with at least some ability to identify pitch. Within class two responses might follow a 2-P IRT model. Such a model can be estimated in Mplus. It is fairly complex, but you may want to at least discuss it.

Answer: We agree with the reviewer and we added the analysis of a hybrid model. The idea was really nice!

8. With regards to the psychometric lingo, I would suggest that :

8a. “Psychometric” is better than “Psychometrical”

8b. “Latent variable models” is better than “structural equation model”. SEM refers to a particular type of latent variable models, a type you do not use in your analysis.

8c. “Continuous” is better than “Dimensional”

Answer: We accepted all the corrections and suggestions.

Specific suggestions:

Line 24. I suggest “Through Latent Variable Models (LVM) we can evaluate consistency validity…”

Line 27. I suggest “… two LVM approaches: continuous latent variables (LV)…”

Line 37, I suggest “The phenomenon of absolute pitch (AP)…”

Line 65. It might be helpful to define relative pitch.

Line 115. Is there a reference regarding the use of a combination of AP and RP to identify pitches?

Line 159. I suggest “…continuous and categorical.”

Line 160. I suggest writing “The former approach, Item Response Theory (IRT), …”

Lines 166-170. I suggest writing: “a) An IRT model with two parameters for each stimuli: the discrimination parameter (also called parameter a), which describe the ability of this stimuli to distinguish between persons with low and high pitch identification ability and the item location parameter (also called parameter b), representing the level of pitch identification ability where you have 50% chance of correctly identifying the pitch of this stimulus.”

Lines 172+173. I suggest writing: “… is the probability of a person with very low pitch identification ability still correctly get a correct answer for a given stimulus.

Answer: All the specific points were accepted and a new English-language editing service was contacted to double check spelling, coherence, and fluency.

Reviewer #2: The ultimate aim of this research project is not explicitly stated, and it remains a little unclear. The authors have created a new test of assessing absolute pitch, and this was investigated using an item-response theory (IRT) approach and a latent class analysis (LCA) approach. These models were then compared to see which offered the best fit.

Although the foundation and rationale for the study seems reasonable, in that the authors wish to create a measure to determine whether individuals have the ability to identify absolute pitch, the applied methodologies are confusing and it should be better explained as to why they are appropriate.

Answer: We thank the reviewer for the positive feedback and comments.

• Latent Class Analysis categorises people into groups, under the assumption that the same thing is being measured for all people, by a standardised count, process, or measurement device/scale.

• IRT is used to determine whether a set of items are delivering a valid total score of an unobservable latent trait.

Thus it should be explained in more detail why the fit of these models should be compared as they look at different things. It should be emphasized that the categorisation of people relies on the measurement process being valid and stable, so in some sense LCA should not be considered until the measure has been validated.

Answer: We improved information regarding CFA correcting for model fit and adding item fit indices as below required. For us, it was not clear what the reviewer#2 meant when s/he said about “measurement process being valid and stable, so in some sense LCA should not be considered until the measure has been validated”. Would it mean to conduct a CFA before LCA? CFA/IRT presumes homogeneity in parameter values across cases. LCA implies differences in parameter values, with each case participating in the different populations to a different degree. The hybrid modeling now added, CFA *with* LCA, accounts for heterogeneity by representing the data as a mixture of populations. Granting that LCA is primarily driven by differences in means rather than by differences in variances, we still would not trust parameter estimates (or fit assessment) from a CFA without LCA if we believed that a mixture model was best for the data. Here, we saw exactly that depending on the approach, the best way to model the data will change.

For the IRT scale assessment approach, there are also many aspects that have been neglected. There is no indication of item fit. No investigation of response dependency. No assessment of reliability or targeting.

Answer: This point was also raised by reviewer#1 and now, at page 9, it might be read as follows:

“To evaluate the model fit indices for IRT models, a Pearson chi-square test for categorical outcomes was used, with p-values higher than 0.05 indicators of a good fit. Item level fit was evaluated via Pearson’s X2 (S-X2) implemented in R package MIRT, as per Orlando and Thissen [34]”

Was a single parameter model considered? - A single-parameter (Rasch) model would be appropriate for scale development and validation purposes, and for determining whether a total score from a set of items is a sufficient statistic to assess the level of a latent trait.

Answer: We did not consider a single-parameter model leaving only 2 and 3 parameters. We did not find references discussing the superiority of 2/3 parameters modeling in comparison to single-parameter modeling for scale development and validation purposes. Based on the Deborah Bandalos 2018 book, on the chapter on Validity, it is stated that evidence based on internal consistency (i.e., commonly called as construct validity) are achieved by techniques involving latent variable approach not directly specifying the number of parameters to be estimated.

Additionally, for the ‘imperfect approach’, it may be worth the authors considering a partial-credit model, where an exact pitch classification is awarded a score of 2, a semitone deviation is awarded 1, and all other pitches are scored 0.

Answer: We understand the point raised by the reviewer and we understand the underlying idea of given this graded answer to the tasks. However, due to two main issues, we decide not to conduct it this way. Firstly, in the absolute pitch area, such scoring system (2, 1, 0) is not commonly adopted. Second, after the new added model (hybrid) plus other adds-on, we believe that the manuscript increased in its complexity. Adding a new modeling, which is only sensitive/meaningful for imperfect approach, will increase even more the complexity and length of our manuscript. This idea will be for sure further explored in future works.

There are also some further issues within the manuscript that would need attention:

The model fit statistics are dubious, and there is no real interpretation of the fit statistics that are presented. Certainly a test with 1000 degrees of freedom will have no statistical power.

Answer: We excluded the chi-square test for 1000 degrees of freedom and we improved the interpretability of the figure and new/old tables.

In the manuscript, it is stated that items are centred around the 0 location, but there are no item locations reported below 0 – where are they centred?

Answer: We are sorry for our imprecision. Now, at page 9, it might be read as follows:

“Under Maximum Likelihood estimator and using logit parameterization (theta), the constant 1.7 in the logit gives only an approximate closeness to the normal. The translation to IRT parameter values uses factor mean and factor variance to bring them to the N~(0,1) metric used in IRT”

The pitch test is based across different musical instruments – have these instruments been calibrated? Has the pitch been externally verified in some way?

Answer: The samples used were professionally recorded in a studio and are extremely accurate in terms of tuning, based on the A=440Hz standard. Since they are commercially used for a variety of soundtracks in cinema and television, their precision and quality is known among professional musicians. Nevertheless, the pitch of every sample used was checked before the battery test was assembled. The voice samples were recorded and edited by a professional studio technician. Two professional vocalists were assigned for the task. The pitches of the recorded voice samples were also checked (A=440Hz standard) and any deviation was corrected.

Additionally, the manuscript is currently in need of a language edit and the Figures are incorrectly labelled.

Answer: We agree with the reviewer that a language review in the whole manuscript was needed. We sent it to a new professional Editing-language editing service for a fully review. We deeply thank you for all the recommendations.

________________________________________

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Jakob Bue Bjorner

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.

Attachment

Submitted filename: response_to_reviwers.docx

Decision Letter 1

Karl Bang Christensen

8 Sep 2020

PONE-D-20-04410R1

A new Approach to Measuring Absolute Pitch on a Psychometric Theory of Isolated Pitch Perception: Is it Disentangling two Groups or Capturing a Continuous Ability?

PLOS ONE

Dear Dr. Germano,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Thank you for your careful revision. I agree with the reviwers that the manuscript has improved. The comments from the reviewers on the revised version illustretes that more work is needed. I agree with comment that the manuscript is currently too long and too difficult to read. I am also concerned about the results you obtain from the 3PL model. The very large standard errors, and the large guessing parameters makes me worry that these results cannot be trusted. One way to make the manuscript better would be to put less emphasis on these results.

Please carefully consider all the points raised in the attached reviews and submit your revised manuscript by Oct 23 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Karl Bang Christensen, Ph.D.

Academic Editor

PLOS ONE

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: (No Response)

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: No

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: I Don't Know

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: No

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Thanks for the responses to my previous comments. I find that your revisions have clarified the data analyses, but there is still some way to go.

You have two scoring options for assessment of isolated pitch perception: a. Perfect: only the correct note is classified as a correct response, b. Imperfect: including the half-note below and above. You compare 1. IRT analyses using either a 2-PL or a 3-PL model, 2. Latent class analyses, 3. Mixture model analyses. You conclude that perfect scoring is best fitted using an IRT model, while imperfect scoring is best fitted with a two-class latent class model.

1. I understand that your choice of model is guided by the global indices such as BIC and AIC. However, from a conceptual point of view, this conclusion does not make sense to me. If you conclude that isolated pitch perception with perfect scoring represent a continuous latent trait, how could the use of a less stringent scoring criterion suddenly change this ability to two latent classes? It seems to me that this interpretation of results is theoretically incoherent. I think you need to discuss this and make an overall interpretation whether isolated pitch perception is best considered a continuous trait or two latent classes.

2. Related to the discussion above, it seem to me that the latent class solution for imperfect scoring has some unfortunate interpretation. Even the best group has only a 67% chance of getting item i right ad 72% change of getting item j right within +/- a half tone. This does not concur with the common understanding of absolute pitch. Maybe the absolute pitch group is only a subgroup within the current best latent class. In that case, you may need more than two latent classes. Please discuss whether your current model is theoretically plausible.

3. For perfect scoring, you find that an IRT model represent the best model for the data. However, for perfect scoring, evaluation of item fit finds significant misfit for item c and item d. This suggests problems for these two items as indicators of the latent trait. This is not discussed, but should be. This plots and fit tests do not suggest that a 3-PL model provides a better fit to the data than a 2-PL model.

4. I continue having concerns about your 3-PL model. The discrimination parameter is very high for item b and so it the guessing parameter. Also, the standard errors of the discrimination parameter are high. The BIC suggest that the 2-PL model might be the best. You should conclude which model you regard as the final model.

5. You aim to develop a test for isolated pitch recognition. I assume that part of this development is to decide whether your test is best scored using the perfect or the imperfect approach. I think you should provide your recommendation and the reasoning behind it.

6. The paper is a long and complex read because some many combinations of options are examined. You would increase readability by focusing the main paper on what you consider the best solution and present other options as supplemental analyses. For example, you could focus on perfect scoring, use the 2-PL model as your IRT model to be compared with a latent class and a mixture analysis. Other options could be alluded to briefly and results for these other models could be presented in a web appendix. I think such an approach would make your paper much more readable.

7. You write that the differences in item difficult and discrimination poses difficulties for a simple sum score approach. However, items may be summed without problems even if they have widely different item difficulty. For example, in the Rasch model (where all items have the same discrimination, but may differ in item difficulty) the sum score is a sufficient statistics for the latent trait. So the real issue is whether the items very so much in item difficulty that a simple sum is inappropriate. For the 2-PL model and perfect scoring, I do not think this is the case. Item discrimination varies between 1.2 and 1.9, not a dramatic variation. It is possible that a Rasch (i.e. 1-PL model) may fit these items. Please revise this discussion.

8. While the analyses seems to be well done, the interpretation and discussion of results could use input from English language researchers with psychometric / IRT expertise. Some description of the models is not well structured (e.g. the discussion of IRT models on lines 205-219). Also, while most of the paper is well written, some parts of the psychometric discussion still deviates for the normal language of the field, e.g. in the abstract [my suggestions in square brackets]:

“We decided to adopt a psychometric perspective, approaching AP as a latent trait. Via Latent Variable Model (LVM) we can provide [evaluate] consistency and validity for a measurement [measure] to test for AP ability. A total of 783 undergraduate music students took part in the test. The battery test [test battery] consisted of 10 isolated pitches.”

Reviewer #2: I would like to thank the authors for responding to the reviewers’ requests and making appropriate amendments to their paper. They have clearly invested time and effort into this, and I believe that the manuscript is now presented better and is much clearer in terms of the purpose of the paper and the process that has been carried out.

There are a few additional amendment suggestions that I have, and these are provided below:

I would suggest that the title is amended to state ‘specific groups’ or ‘ability groups’ rather than ‘two groups’. As you are using LCA, it is not known a priori how many latent classes will be identified.

The Intro reads well, provides good background and makes sense. However, this is an exploratory study, to see how the items in the AP test work among the group tested. Do the items work together to form a measure of an underlying latent continuum (IRT)? Or do they work better as a set of indicator items that can classify people into groups (LCA)? I would suggest that this may be clarified for readers if the authors were to provide a statement in both the abstract and the introduction to state that this is an exploratory data modelling study, to determine which type of model best fits the data for the tested sample.

For the IRT analysis, there is still no test of local dependency among the items. This is perhaps unnecessary for the purpose of the current study, but perhaps it could be identified as a potential limitation, or the authors could suggest that it could be assessed in future work if a latent trait IRT approach is pursued further.

The authors state ‘Under Maximum Likelihood estimator and using logit parameterization (theta), the constant 1.7 in the logit gives only an approximate closeness to the normal. The translation to IRT parameter values uses factor mean and factor variance to bring them to the N~(0,1) metric used in IRT.’ To clarify this for the reader, I would suggest that the authors might also add a sentence to state that this means that the IRT analysis is centred on the person sample being at 0 logits, and that the item difficulty parameters are provided relative to this.

A few additional very minor corrections are as follows:

Line 189 states that MPlus is used. R also needs to be added here.

Line 287. I believe this should say difficulty rather than discrimination.

Line 409. Should be timbres rather than timbers.

A list of abbreviations would also be useful, so that the reader can refer back to them without scrolling through all of the manuscript to find the relevant abbreviation.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Jakob Bue Bjorner

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2021 Feb 22;16(2):e0247473. doi: 10.1371/journal.pone.0247473.r004

Author response to Decision Letter 1


21 Dec 2020

PONE-D-20-04410R1

A new Approach to Measuring Absolute Pitch on a Psychometric Theory of Isolated Pitch Perception: Is it Disentangling two Groups or Capturing a Continuous Ability?

PLOS ONE

Dear Dr. Germano,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Thank you for your careful revision. I agree with the reviwers that the manuscript has improved. The comments from the reviewers on the revised version illustretes that more work is needed. I agree with comment that the manuscript is currently too long and too difficult to read. I am also concerned about the results you obtain from the 3PL model. The very large standard errors, and the large guessing parameters makes me worry that these results cannot be trusted. One way to make the manuscript better would be to put less emphasis on these results.

Please carefully consider all the points raised in the attached reviews and submit your revised manuscript by Oct 23 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

• A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

• A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

• An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Karl Bang Christensen, Ph.D.

Academic Editor

PLOS ONE

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: (No Response)

Reviewer #2: All comments have been addressed

________________________________________

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: No

Reviewer #2: Yes

________________________________________

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: I Don't Know

________________________________________

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: Yes

________________________________________

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: No

Reviewer #2: Yes

Answer: The first and second manuscript was revised by a professional English-language editing service. The new version of the manuscript was also fully revised by another professional English-language editing service.

________________________________________

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Thanks for the responses to my previous comments. I find that your revisions have clarified the data analyses, but there is still some way to go.

You have two scoring options for assessment of isolated pitch perception: a. Perfect: only the correct note is classified as a correct response, b. Imperfect: including the half-note below and above. You compare 1. IRT analyses using either a 2-PL or a 3-PL model, 2. Latent class analyses, 3. Mixture model analyses. You conclude that perfect scoring is best fitted using an IRT model, while imperfect scoring is best fitted with a two-class latent class model.

1. I understand that your choice of model is guided by the global indices such as BIC and AIC. However, from a conceptual point of view, this conclusion does not make sense to me. If you conclude that isolated pitch perception with perfect scoring represent a continuous latent trait, how could the use of a less stringent scoring criterion suddenly change this ability to two latent classes? It seems to me that this interpretation of results is theoretically incoherent. I think you need to discuss this and make an overall interpretation whether isolated pitch perception is best considered a continuous trait or two latent classes.

Response: Thank you for your observation. AP literature shows different methods to measure AP ability, which led to the two rating criteria (perfect and imperfect approaches) adopted in our test. To clarify, we conducted a brief review of the two methods (see table below), showing that both methods are largely used universally. Our aim was not to directly compare these two approaches. Rather, we aimed to evaluate how they might influence the decision of the best model for isolated pitch recognition tasks (line 111), once they are both used arbitrarily in the literature. One of our hypotheses was that different measurement methods lead to different results that reflect different theoretical perspectives. This is a methodological issue that is not adequately addressed by AP literature. Our results corroborate this hypothesis, as the two rating approaches showed distinct underlying models (a continuous trait for the perfect approach and two latent classes for the imperfect approach). Also, our aim was to contribute to a better theoretical understanding of AP ability, showing that different rating criteria greatly influences test results and the underlying model. We added the imperfect result in the abstract to highlight this duality regarding AP measurement (line 37). In the conclusion, we added our contribution to the AP literature (line 511). In the discussion, we added an observation regarding the change from a continuous model to a two latent classes model resulting from the adoption of a less stringent scoring criterion, as mentioned (line 466).

2. Related to the discussion above, it seem to me that the latent class solution for imperfect scoring has some unfortunate interpretation. Even the best group has only a 67% chance of getting item i right ad 72% change of getting item j right within +/- a half tone. This does not concur with the common understanding of absolute pitch. Maybe the absolute pitch group is only a subgroup within the current best latent class. In that case, you may need more than two latent classes. Please discuss whether your current model is theoretically plausible.

Response:

We thank the reviewer for raising this issue.

AP possessors do not have exceptional pitch acuity. Absolute pitch is neither ‘absolute’ nor ‘perfect’ in the ordinary uses of those words (Levitin e Rogers, 2005). AP literature shows that it is quite common for AP possessors to make semitone errors in pitch judgments (Rogers and Levitin, 2005; Miyazaki, 1988; Baggaley, 1974; Brady, 1970). This results in semitone errors being considered as a partial or totally correct answer in some tests (Baharloo et al., 1998; Athos et al., 2007; Schulze et al., 2009). In our study, under the imperfect approach, the better group had a 67% chance of getting item i right and a 72% chance of getting item j right. These results are consistent with previous research using the imperfect approach, considering semitone errors as ½ correct (73% correct), not allowing for semitone errors and 80% correct allowing for semitone errors, [Li (2020]).

Therefore, contrastingly to what Reviewer#1 mentioned, and as described by Levitin and Rogers (2005), page 28, “... ‘absolute’ refers to judgments established independently, rather than by comparison. The terms ‘absolute’ and ‘perfect’ both imply in the lay mind a level of precision not typically present in AP possessors, who frequently make octave errors (confusing tones that are half or double the frequency), and semitone errors (confusing tones that are 6% apart) [Levitin, D.J. (1999) Absolute pitch: Self-reference and human memory. International Journal of Computing Anticipatory Systems 4, 255–266–Miyazaki, K. (1988) Musical pitch identification by absolute pitch possessors. Percept. Psychophys. 44, 501–512]. Like most human traits, AP is not an all-or-none ability, but rather, exists along a continuum [Levitin, D.J. (2004) L’oreille absolue. L’Annee´ Psychologique 104, 103–120; Levitin, D.J. (1999) Absolute pitch: Self-reference and human memory. International Journal of Computing Anticipatory Systems 4, 255–266; Deutsch, D. (2002) The puzzle of absolute pitch. Curr. Dir. Psychol. Sci. 11, 200–204; Vitouch, O. (2003) Absolutist models of absolute pitch are absolutely misleading. Music Perception 21, 111–117]. Self-identified AP possessors score well above chance (which would be 1 out of 12, or 8.3%) on AP tests, typically scoring between 50 and 100% correct [Miyazaki, K. (1988) Musical pitch identification by absolute pitch possessors. Percept. Psychophys. 44, 501–512], and even musicians not claiming AP score up to 40% [Lockhead, G.R. and Byrd, R. (1981) Practically perfect pitch. J. Acoust. Soc. Am. 70, 387–389]. Still, even those who score better than 90% show similar discrimination thresholds to, and are typically no better than, other musicians at noticing when one tone is out of tune with respect to another [Burns, E.M. and Campbell, S.L. (1994) Frequency and frequency-ratio resolution by possessors of absolute and relative pitch: Examples of categorical perception? J. Acoust. Soc. Am. 96, 2704–2719; Levitin, D.J. (1999) Absolute pitch: Self-reference and human memory. International Journal of Computing Anticipatory Systems 4, 255–266]. Clearly, there is nothing ‘perfect’ about AP; rather AP is the ability to place or produce tones within nominal categories.

This discussion was included in line 96. We also conducted a three latent class analysis, but the two classes solution was best, as presented in line 379. If there were volunteers who were near infallible in isolated pitch recognition tasks in our experiment, they could not be identified as a separate group among the participants. This discussion was added in line 466.

3. For perfect scoring, you find that an IRT model represent the best model for the data. However, for perfect scoring, evaluation of item fit finds significant misfit for item c and item d. This suggests problems for these two items as indicators of the latent trait. This is not discussed, but should be. This plots and fit tests do not suggest that a 3-PL model provides a better fit to the data than a 2-PL model.

Response: We thank the Reviewer for this suggestion. We agree with the Reviewer, especially considering the parsimony principle. We modified our text, accordingly stating that the 2-PL is the best solution for both ratings under the IRT approach.

4. I continue having concerns about your 3-PL model. The discrimination parameter is very high for item b and so it the guessing parameter. Also, the standard errors of the discrimination parameter are high. The BIC suggest that the 2-PL model might be the best. You should conclude which model you regard as the final model.

Response: Thank you for your considerations. The 2-PL model was concluded as the best model. The text modification was made in line 379-384 and in line 461.

5. You aim to develop a test for isolated pitch recognition. I assume that part of this development is to decide whether your test is best scored using the perfect or the imperfect approach. I think you should provide your recommendation and the reasoning behind it.

Response: As previously discussed, our aim was not to compare these two score approaches (because the models are not nested). We sought instead to evaluate how they might influence the decision of the best model for isolated pitch recognition tasks (line 112). As score approaches reflect different theoretical perspectives, we believe our results contribute to a better theoretical understanding of AP ability, showing that different rating criteria greatly influences test results and the nature of the underlying latent variable. Also, we cannot formally compare imperfect and perfect approaches in regarding superiority because they are not nested models (line 453).

The test comprised auditory stimuli and was collectively applied. Volunteers were given a keyboard drawn on paper on which they marked the answer that they thought was correct. This procedure allowed us to standardize the evaluation procedure. The correction was conducted in two ways (perfect and imperfect approaches), according to the literature.

Given below are examples of studies that use these two approaches:

Auditory Stroop and Absolute Pitch: An fMRI Study - Katrin Schulze, Karsten Mueller, and Stefan Koelsch Human Brain Mapping

Imperfect (answers within one semitone were regarded as a correct answer)

Intracortical Myelination in Musicians With Absolute Pitch: Quantitative Morphometry Using 7-T MRI -Seung-Goo Kim and Thomas R. Kn€osche - Human Brain Mapping - Imperfect (An error with a semitone was considered as a correct response)

Increased Volume and Function of Right Auditory Cortex as a Marker for Absolute Pitch- Martina Wengenroth, Maria Blatow, Armin Heinecke, Julia Reinhardt, Christoph Stippich, Elke Hofmann and Peter Schneider- Cerebral Cortex - Imperfect (for semitone errors 0.5 point was Accredited)

Gray and White Matter Anatomy of Absolute Pitch Possessors - Anders Dohn, Eduardo A. Garza-Villarrea, M. Mallar Chakravarty,

Mads Hansen, Jason P. Lerch, and Peter Vuust- Cerebral Cortex -Imperfect (Participants were given ¾ point for each error of a semitone)

Perceiving pitch absolutely: Comparing absolute and relative pitch possessors in a pitch memory task- Katrin Schulze, Nadine Gaab and Gottfried Schlaug BMC Neuroscience - Imperfect (answers within one semitone of the sented pitch as a correct answer)

The Neurocognitive Components of Pitch Processing: Insights from Absolute Pitch Sarah J. Wilson, Dean Lusher, Catherine Y. Wan,

Paul Dudgeon and David C. Reutens Cerebral Cortex - Perfect (semitone errors were coded as incorrect for all participants)

Absolute Pitch and the P300 Component of the Event-Related Potential: An Exploration of Variables That May Account for Individual Differences - Laura Renninger, Roni Granot, Emanuel Donchin - Music Perception - Imperfect (credit was given to subject was came within 1 semitone of the correct pitch)

Absolute Pitch—Functional Evidence of Speech-Relevant Auditory Acuity Mathias S. Oechslin, Martin Meyer and Lutz Jäncke - Cerebral Cortex - Perfect (the semitone errors were taken as incorrect to increase the discriminatory power)

Absolute Pitch and Planum Temporale- Julian Paul Keenan, Ven Thangaraj, Andrea Halpern, Gottfried Schlaug NeuroImage - Imperfect (we regarded a response within 1/2 tone difference of the presented tone as a correct response)

Absolute Pitch: An Approach for Identification of Genetic and Nongenetic Components- Siamak Baharloo, Paul A. Johnston, Susan K. Service, Jane Gitschier, and Nelson B. Freimer Am. J. Hum. Genet. Imperfect (we decided to score a full point for semitone errors made by individuals >45 years of age)

A Distribution of Absolute Pitch Ability as Revealed by Computerized Testing- Patrick Bermudez And Robert J. Zatorre Music Perception - Perfect (In the percent correct score, only exactly correct responses are counted (0 semitone deviation)

Absolute pitch is associated with a large auditory digit span: A clue to its genesis (L)- Diana Deutsch and Kevin Dooley J. Acoust. Soc. Am. Perfect (not allowing for semitone errors).

Absolute pitch among American and Chinese conservatory students: Prevalence differences, and evidence for a speech-related critical period - Diana Deutsch, Trevor Henthorn, Elizabeth Marvin, HongShuai Xi J. Acoust. Soc. Am. - Perfect and Imperfect (no semitone errors allowed and semitone errors allowed).

Dichotomy and perceptual distortions in absolute pitch ability - Alexandra Athos, Barbara Levinson, Amy Kistler, Jason Zemansky, Alan Bostrom, Nelson Freimer, and Jane Gitschier PNAS - Imperfect (partial [3/4 point] credit for an answer deviating by one semitone

Effects of Musical Training and Absolute Pitch on a Pitch Memory Task: an Event-related Potential Study- Edwin C Hantz, Kelley G. Kreilick, Amy L. Braveman, Kenneth P. Swartz Psychomusicology - Perfect (no points were given for any other answers; that is, no points were awarded for near hits half-step errors)

The effects of timbre on absolute pitch judgment Xiaonuo Li Psychology of Music (new research)- Perfect and Imperfect (no semitone errors allowed and semitone errors allowed).

Absolute Pitch as an Inability: Identification of musical interval in a tonal context- Ken’ichi Miyazaki- Musical Perception - Imperfect (semitone errors are counted as correct)

Perfect Pitch - Joseph Profita and T. George Bidder - American Journal of Medical Genetics - Perfect (Subjects were defined as having perfect pitch if they were able to identify 90% or more of the total number of tones).

Absolute Pitch and the P300 Component of the Event-Related Potential: An Explanatory of Variables that may Account for Individuals Differences - Laura Bischoff Renninger, Roni Granot, Emanuel Donchin- Music Perception - Imperfect (credit was given to subject who came within 1 semitone of the correct pitch)

Absolute Pitch: Effects of Timbre on Note-Naming Ability- Patrícia Vanzella, E. Glenn Schellenberg PLoS ONE - Imperfect (semitones errors were considered as correct)

Effects of musical training and absolute pitch ability on event-related activity in response to sine tones - John W. Wayman, Robert D. Frisina and Joseph P. Walton - Acoustical Society of America - Imperfect (half-step errors were given half credit)

How Stable is Pitch Labeling Accuracy in Absolute Pitch Possessors? - Wilfried Gruhn, Reet Ristmägi, Peter Schneider, Arun D'souza, Kristi Kiilu- Empirical Musicological Review - Imperfect (For semitone errors 0.5 point was accredited)

Absolute pitch memory: Its prevalence among musicians and dependence on the testing context- Yetta Kwailing Wong & Alan C.N. Wong - Psychon Bull Rev - Perfect (discriminating between neighboring semitones [e.g., treating a “C” as a “C#”] are regarded as errors) (Takeuchi & Hulse, 1993; Zatorre, 2003).

Multiple coding strategies in the retention of musical tones by possessors of absolute pitch - Robert J. Zatorre, Christine Beckett Memory & Cognition - Perfect (0% correct by semitone transposition)

6. The paper is a long and complex read because some many combinations of options are examined. You would increase readability by focusing the main paper on what you consider the best solution and present other options as supplemental analyses. For example, you could focus on perfect scoring, use the 2-PL model as your IRT model to be compared with a latent class and a mixture analysis. Other options could be alluded to briefly and results for these other models could be presented in a web appendix. I think such an approach would make your paper much more readable.

Response: Thank you for your comment. We agree that this paper is a long and complex read. Nevertheless, all these analyses are of paramount importance to researchers in the AP area. We believe that the audience and the impact of this paper will be expanded if these two scoring approaches commonly cited and used in AP literature are included in the main text. Please see the table above for your reference where we mapped the manuscripts, number of citations, the adopted approaches, and other bibliometric features. Applying psychometrics to AP may guide future studies, since our work provides different analytical perspectives for a set of stimuli commonly used to track AP. That is, if our paper addresses these two different approaches (imperfect versus perfect), it will be increasingly cited in the future due to the novel AP research. As mentioned above, the two score approaches reflect different theoretical perspectives which are commonly adopted. We believe that this full comparative and statistical discussion will contribute to a better theoretical understanding of AP ability, showing that different rating criteria greatly influences test results and the underlying model.

7. You write that the differences in item difficult and discrimination poses difficulties for a simple sum score approach. However, items may be summed without problems even if they have widely different item difficulty. For example, in the Rasch model (where all items have the same discrimination, but may differ in item difficulty) the sum score is a sufficient statistics for the latent trait. So the real issue is whether the items very so much in item difficulty that a simple sum is inappropriate. For the 2-PL model and perfect scoring, I do not think this is the case. Item discrimination varies between 1.2 and 1.9, not a dramatic variation. It is possible that a Rasch (i.e. 1-PL model) may fit these items. Please revise this discussion.

Response: We agree with the Reviewer. The process of parceling (summing or averaging items) has shown its robustness based on the law of large numbers and aggregation principles (see Little et al., 2002). The sentence has been removed from the text.

Little, T. D., Cunningham, W. A., Shahar, G., & Widaman, K. F. (2002). To parcel or not to parcel: Exploring the question, weighing the merits. Structural equation modeling, 9(2), 151-173.

8. While the analyses seems to be well done, the interpretation and discussion of results could use input from English language researchers with psychometric / IRT expertise. Some description of the models is not well structured (e.g. the discussion of IRT models on lines 205-219). Also, while most of the paper is well written, some parts of the psychometric discussion still deviates for the normal language of the field, e.g. in the abstract [my suggestions in square brackets]:

“We decided to adopt a psychometric perspective, approaching AP as a latent trait. Via Latent Variable Model (LVM) we can provide [evaluate] consistency and validity for a measurement [measure] to test for AP ability. A total of 783 undergraduate music students took part in the test. The battery test [test battery] consisted of 10 isolated pitches.”

Response: Thank you for your suggestions. We changed the words in the abstract and in the lines indicated.

Reviewer #2: I would like to thank the authors for responding to the reviewers’ requests and making appropriate amendments to their paper. They have clearly invested time and effort into this, and I believe that the manuscript is now presented better and is much clearer in terms of the purpose of the paper and the process that has been carried out.

There are a few additional amendment suggestions that I have, and these are provided below:

I would suggest that the title is amended to state ‘specific groups’ or ‘ability groups’ rather than ‘two groups’. As you are using LCA, it is not known a priori how many latent classes will be identified.

Response: Thank you for the suggestion. We changed “two groups” to “specific groups”.

The Intro reads well, provides good background and makes sense. However, this is an exploratory study, to see how the items in the AP test work among the group tested. Do the items work together to form a measure of an underlying latent continuum (IRT)? Or do they work better as a set of indicator items that can classify people into groups (LCA)? I would suggest that this may be clarified for readers if the authors were to provide a statement in both the abstract and the introduction to state that this is an exploratory data modelling study, to determine which type of model best fits the data for the tested sample.

Response: Thank you for your comments. AP literature indicates different approaches to rating the stimuli of isolated pitch perception, which led to the two rating criteria (the perfect and imperfect approaches) adopted in our test. Our aim was not to compare these two approaches, but to evaluate how they might influence the decision of the best model for isolated pitch recognition tasks within each approach (line 112). One of our hypotheses was that different approaches lead to different results. It is important to note that both approaches are used in AP research and, to our knowledge, this psychometric issue was not adequately addressed by AP literature. Our results corroborate this hypothesis, as the two rating approaches showed the distinct nature of latent variables (a continuous trait for the perfect approach and two latent classes for the imperfect approach). Also, our aim was to contribute to a better theoretical understanding of AP ability, showing that different rating criteria greatly influence test results and how the latent variable might be measured. We agree that this issue needed further clarification in the manuscript. Therefore, we included more information in lines 96, 466 and 509.

For the IRT analysis, there is still no test of local dependency among the items. This is perhaps unnecessary for the purpose of the current study, but perhaps it could be identified as a potential limitation, or the authors could suggest that it could be assessed in future work if a latent trait IRT approach is pursued further.

Response: We agree with the Reviewer. We added the following: “Future studies may investigate more detailed elements of psychometrics as local dependency for each of the models (IRT and LCA), invariance testing per sex, time of studies, and played instruments” (lines 496 – 498)

The authors state ‘Under Maximum Likelihood estimator and using logit parameterization (theta), the constant 1.7 in the logit gives only an approximate closeness to the normal. The translation to IRT parameter values uses factor mean and factor variance to bring them to the N~(0,1) metric used in IRT.’ To clarify this for the reader, I would suggest that the authors might also add a sentence to state that this means that the IRT analysis is centred on the person sample being at 0 logits, and that the item difficulty parameters are provided relative to this.

Response: We thank the Reviewer for this comment. It now reads in line 216 as: “The factor is assumed to be normally distributed being the mean fixed at zero and factor variance at 1. That is, the IRT analysis is centered on the person sample being at 0 logits, and that the item difficulty parameters are provided relative to this”

A few additional very minor corrections are as follows:

Line 189 states that MPlus is used. R also needs to be added here.

Response: Thank you for the observation. The R program was included.

Line 287. I believe this should say difficulty rather than discrimination.

Response: Thank you. The word has been changed.

Line 409. Should be timbres rather than timbers.

Response: You are correct. The revision has been made.

A list of abbreviations would also be useful, so that the reader can refer back to them without scrolling through all of the manuscript to find the relevant abbreviation.

Response: Thank you for your suggestion. The list of abbreviations was included as Supporting Information.

________________________________________

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Jakob Bue Bjorner

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

Attachment

Submitted filename: Response_to_Reviewers.docx

Decision Letter 2

Karl Bang Christensen

9 Feb 2021

A new approach to measuring absolute pitch on a psychometric theory of Isolated Pitch Perception: Is it disentangling specific groups or capturing a continuous ability?

PONE-D-20-04410R2

Dear Dr. Germano,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Karl Bang Christensen, Ph.D.

Academic Editor

PLOS ONE

Additional Comments:

Please edit the manuscript according to these helpful comments from the two reviewers listed below. 

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Thanks for the responses to my previous comments. I find that the manuscript is further improved. I only have some suggestions for improvement of language.

1. Line 193. I suggest writing “and the R program” and cite e.g.: R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

2. Line 194-195. I suggest writing: “The former Item Response Theory (IRT) approach …”

3. Line 198: I suggest writing: “Two different IRT models were used”

4. Line 223: I suggest writing: “Pearson’s X2 (S-X2) implemented in the R package mirt, as per Orlando and Thissen [38]”

5. Line 263, I suggest writing: “given that there were three approaches to statistical modeling…”

6. Lines 315-321: I do not agree with your comments on figure 2 and 3. There are indication of both overestimation and underestimation for low score levels. I suggest just stating: “For items c and d – scored with the criterion of perfect rating – misfit is illustrated by comparisons of predicted and observed proportion of correct results (Fig 2 and 3). In particular, higher than expected proportion of correct answers are seen for theta scores a little higher than 1 and for theta scores a little lower than -1.”

7. Line 417, I suggest writing: “Based on model fit information, we conclude that the continuous…”

8. Line 446, I suggest writing: ”Moreover, in a two-parameter IRT model for the perfect scoring approach, all the items showed…”

9. Line 454-455, I suggest dropping the first part of the sentence and just write: “When comparing LCA to IRT …”

10. Line 472-476, You write “If there are participants who are near infallible in isolated pitch recognition tasks, their prevalence will be reduced as the scores increases (i.e., the higher the score, the lower the number of subjects endorsing all the stimuli correctly). However, under the imperfect approach, all the participants that committed semitone errors were separated from the group that committed more broad errors. More research is necessary to examine the causes for the differences in the underlying models” . This can be misunderstood. I suggest writing: “Using the perfect scoring approach, 1.1% of participants had all items correct. According to the IRT model, these participants would be expected to have greater skills in isolated pitch recognition tasks than participants with lower numbers of correct responses. In contrast, for the imperfect scoring approach, the LCA model assumes that 20.9% of participants have high skills in isolated pitch recognition tasks. Within this group further differentiation in skills cannot be made. The 4.9% who had all 10 responses correct using the imperfect scoring approach were just luckier than the remaining 16% in the high-skill group. More research is necessary to examine the causes for the differences in the underlying models.”

Reviewer #2: The authors have addressed all of my comments, and I would like to thank them for considering the suggestions of the reviewers.

I have no further amendments to request, except some minor editing changes as listed below:

p.8

R is now mentioned – does this need a reference or software version number?

Line 308 states:

‘Table 4 shows the items level fit.’

Suggest this is changed to:

‘Table 4 shows item-level fit’

Line 324 states: ‘This table provides the item level for each item...’

Suggest this is changed to:

‘This table provides the item-level fit values for each item...’

Lines 340-342 state:

‘Considering the perfect approach, the lowest BIC was in favor of an IRT with two parameters. However, for the imperfect approach, the lowest BIC was in favor of an IRT with three parameters.’

Suggest this is changed to:

Considering the perfect approach, the lowest BIC was in favor of an IRT model with two parameters. However, for the imperfect approach, the lowest BIC was in favor of an IRT model with three parameters.

Line 345 states:

‘Therefore, for both perfect and imperfect models, we concluded that the two-parameters models fit better than three-parameters models’

Suggest this is changed to:

Therefore, for both perfect and imperfect models, we concluded that the two-parameter model fits better than three-parameter model.

Line 404 states:

‘Notably, the red group did not achieve 1, indicating a 100% probability of answering correctly for a giving stimulus’

Suggest this is changed to:

‘Notably, even the red group did not achieve a value of 1 for any of the items, which would indicate a 100% probability of answering correctly for a given stimulus’

Line 409 states:

‘This indicates that the ability to recognize isolated pitches in different timbres and registers without reference is better modeled as a continuous ability when the perfect rating approach is considered in comparison with a categorical and hybrid model’

Suggest this is changed to:

‘This indicates that the ability to recognize isolated pitches in different timbres and registers without reference is better modeled as a continuous ability, rather than when the perfect rating approach is considered with either a categorical or a hybrid model’

Line 446 states:

‘Moreover, in the perfect approach for two parameters, all the items showed high values of discrimination.'

Suggest this is changed to:

‘Moreover, for the two-parameter model of the perfect approach, all the items showed high values of discrimination.'

Line 490 states:

‘Interestingly, we observed that even the group classified as showing a high probability of choosing the correct answer (less than 20% of the 783 participants) across all the stimuli did not display 100% probability of answering correctly.’

Suggest this is changed to:

‘Interestingly, we observed that none of the individual stimuli were answered correctly 100% of the time, even among the group classified as showing a high probability of choosing the correct answer (less than 20% of the 783 participants).’

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Jakob Bue Bjørner

Reviewer #2: No

Acceptance letter

Karl Bang Christensen

12 Feb 2021

PONE-D-20-04410R2

A new approach to measuring absolute pitch on a psychometric theory of Isolated Pitch Perception: Is it disentangling specific groups or capturing a continuous ability?

Dear Dr. Germano:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Karl Bang Christensen

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Data

    (XLSX)

    S1 File

    (DOCX)

    Attachment

    Submitted filename: response_to_reviwers.docx

    Attachment

    Submitted filename: Response_to_Reviewers.docx

    Data Availability Statement

    All relevant data are within the paper and its Supporting Information files.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES