Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Feb 20.
Published in final edited form as: Cogn Sci. 2020 Sep;44(9):e12883. doi: 10.1111/cogs.12883

STIMULUS PARAMETERS UNDERLYING SOUND-SYMBOLIC MAPPING OF AUDITORY PSEUDOWORDS TO VISUAL SHAPES

Simon Lacey 1,2,4, Yaseen Jamal 5, Sara M List 4,5, Kelly McCormick 4,5, K Sathian 1,2,3,4,5, Lynne C Nygaard 5
PMCID: PMC7896554  NIHMSID: NIHMS1666043  PMID: 32909637

Abstract

Sound symbolism refers to non-arbitrary mappings between the sounds of words and their meanings and is often studied by pairing auditory pseudowords like “maluma” and “takete” with rounded and pointed visual shapes, respectively. However, it is unclear what auditory properties of pseudowords contribute to their perception as rounded or pointed. Here, we compared perceptual ratings of the roundedness/pointedness of large sets of pseudowords and shapes to their acoustic and visual properties using a novel application of representational similarity analysis (RSA). Representational dissimilarity matrices (RDMs) of the auditory and visual ratings of roundedness/pointedness were significantly correlated crossmodally. The auditory perceptual RDM correlated significantly with RDMs of spectral tilt, the temporal fast Fourier transform (FFT) and the speech envelope. Conventional correlational analyses showed that ratings of pseudowords transitioned from rounded to pointed as vocal roughness (as measured by the harmonics-to-noise ratio, pulse number, fraction of unvoiced frames, mean autocorrelation, shimmer and jitter) increased. The visual perceptual RDM correlated significantly with RDMs of global indices of visual shape (the simple matching coefficient, image silhouette, image outlines and Jaccard distance). Crossmodally, the RDMs of the auditory spectral parameters correlated weakly but significantly with those of the global indices of visual shape. Our work establishes the utility of RSA for analysis of large stimulus sets and offers novel insights into the stimulus parameters underlying sound symbolism, showing that sound-to-shape mapping is driven by acoustic properties of pseudowords and suggesting audiovisual cross-modal correspondence as a basis for language users’ sensitivity to this type of sound symbolism.

Keywords: language, multisensory, representational similarity analysis, sound symbolism

1. INTRODUCTION

It is commonly held that arbitrariness is a fundamental property of language, i.e., that the sound structure of a word bears no relation to the thing it describes (de Saussure, 2011; but see Joseph, 2015). Whether this is always the case or whether natural relationships between sound and meaning exist in natural language has been debated since at least the Platonic dialog of Cratylus (Ademollo, 2011). One aspect of language that is non-arbitrary is sound symbolism (Perniss & Vigliocco, 2014), which includes a broad set of phenomena in which there is a perceived resemblance between speech sounds and their referents. An example is onomatopoeia, in which the sound of a word resembles the sound it represents (Catricalà & Guidi, 2015; Schmidtke et al., 2014), e.g., “slap” or “splash”, and mimetic words in Japanese, e.g. “kirakira” (flickering light: Akita & Tsujimura, 2016). It is important to note, however, that while sound-symbolism may be contrast with arbitrariness, these are not mutually exclusive and may exist alongside one another in natural language (Lockwood & Dingemanse, 2015).

Sound symbolism often involves examples of crossmodal correspondence, i.e., the near-universally experienced associations between seemingly arbitrary stimulus features in different senses (Spence, 2011). For example, high and low auditory pitch are consistently associated with small and large visual size (Gallace & Spence, 2006; Evans & Treisman, 2010), and with high and low visuospatial elevation respectively (Ben-Artzi & Marks, 1995; Lacey et al., 2016; Jamal et al., 2017). A well-known example of sound-symbolic crossmodal correspondence was first described by Köhler (1929, 1947) in which individuals consistently assigned the pseudoword “maluma” to a curvy, cloud-like shape and the pseudoword “takete” to an angular star-like shape. Such crossmodal sound-symbolic associations occur not only for pseudowords but also for real words, e.g., “balloon” and “spike” for rounded and pointed shapes (Sučević et al., 2015).

Since Köhler’s early work, sound symbolism has been demonstrated across different languages (Blasi et al., 2016), with both similarities (e.g., Davis, 1961) and differences (e.g., Rogers & Ross, 1975; Bremner et al., 2013; Styles & Gawne, 2017) between Western and non-Western cultures. These studies show that the existence of sound symbolism in language is both prolific and robust. Further, language users are sensitive to sound-symbolic associations in that they can correctly assign meaning to synonym-antonym pairs in an unfamiliar foreign language at above-chance levels (Nygaard et al., 2009a; Revill et al., 2014; Tzeng et al., 2016). Sound symbolism may also play a role in language processing and early word learning (Imai & Kita, 2014). For example, children of pre-reading age exhibit sensitivity to sound-symbolic crossmodal associations (Imai et al., 2015; Maurer et al., 2006; Ozturk et al., 2013), and recent studies have suggested that sound symbolism is important for specific word-to-meaning associations in young children with limited vocabularies (Gasser, 2004; Tzeng et al., 2017). In adults, sound symbolism may offer linguistic processing advantages for categorization and word learning (Brand et al., 2018; Gasser, 2004; Revill et al., 2018), and for rehabilitation of patients with aphasia (Meteyard et al., 2015). More recently, neuroimaging studies have begun to reveal the neural correlates of sound symbolism (Revill et al., 2014; McCormick et al., 2018; Peiffer-Smadja & Cohen, 2019).

However, it remains an open question whether sound symbolic correspondences are essentially based in auditory features of words and if so, what auditory features are mapped onto which visual (or other) features of the referent that it sound-symbolically describes (and vice-versa). For sound-to-shape mapping, research has largely studied the pointed/rounded dimension, mostly concentrating on phonological features, e.g. consonants vs. vowels (Nielsen & Rendall, 2011; Fort et al., 2015), voiced vs. unvoiced consonants (McCormick et al., 2015; Cuskley et al., 2017), rounded vs. unrounded vowels (Maurer et al., 2006; McCormick et al., 2015), obstruents vs. sonorants (McCormick et al., 2015), or vowel formants1 (Knoeferle et al., 2017). These phonemic feature differences are important: Styles & Gawne (2017) suggest that failures to replicate sound-to-shape mapping cross-culturally occur because the chosen pseudowords did not conform to the sound structure of the language spoken in the target culture. But while different phonemic categories, e.g. consonants vs. vowels or obstruents vs. sonorants, have different acoustic properties, few studies have measured those properties directly in order to assess their contribution to sound symbolism. Of 21 studies of sound symbolism relating to pointedness/roundedness listed by Westbury et al. (2018, Table 1), only Monaghan et al. (2012) and Ozturk et al. (2013) measured acoustic properties (of frequency, amplitude, and duration) and only to confirm whether word groups differed on these rather than to examine their contributions. Nonetheless, such acoustic differences may be important. A notable exception to the list in Westbury et al. (2018) is the study of Parise & Pavani (2011) who measured participants’ vocalizations of a single vowel sound in response to the shape, luminance, or size of visual stimuli. These vocalizations were louder for complex (dodecahedron) compared to simple (triangle) shapes, and for brighter than darker stimuli, while the frequency of F3 was higher for triangles, a shape that is perhaps more obviously pointed than a dodecahedron. Subsequently, Knoeferle et al. (2017) showed that the frequencies of F2 and F3 are related to perceptual ratings of pointedness/roundedness. Pseudowords with lower/higher F2 were rated as more rounded/pointed respectively while roundedness ratings increased with higher values of F3, which reflects the amount of articulatory lip-rounding (Knoeferle et al., 2017). Note that there is a distinction to be made between how a speaker produces an utterance, i.e., any word can be spoken with higher or lower pitch, and the acoustic characteristics of particular phonemic elements, i.e., rounded vowels will always have a higher F3 compared to unrounded vowels.

Similarly, few studies have measured the properties of the visual shapes employed to study sound-symbolic crossmodal correspondences. To our knowledge, the sole exception is the cross-cultural study of Chen et al. (2016), who created shapes using radial frequency patterns: parametric sinusoidal modulations around the circumference of a circle that varied in frequency (number of modulations per unit length of the circumference), amplitude (magnitude of modulation), and ‘spikiness’ (magnitude of a triangular wave function added to the sinusoid). Interestingly, while all three factors predicted sound-to-shape mapping regardless of culture, North Americans weighted amplitude more heavily than ‘spikiness’ but the reverse was true for Taiwanese participants, perhaps reflecting cultural preferences for analytic and holistic processing respectively (Chen et al., 2016). Although some studies have addressed the visual shape effects of orthography (Cuskley et al., 2017) or typography (De Carolis et al., 2018), these factors are obviously less relevant when pseudowords are presented auditorily.

Here, we investigate both acoustic and visual parameters of large sets of pseudowords (537) and visual shapes (90), respectively, in relation to perceptual ratings of their roundedness/pointedness. To do this, we employ a method novel to studies of sound symbolism: representational similarity analysis (RSA). RSA was originally developed as a method for analyzing functional magnetic resonance imaging (fMRI) data and has also been applied to various kinds of neurophysiological data (Kriegeskorte et al., 2008). In the context of fMRI, RSA compares the pairwise spatial distribution of activity for stimuli across voxels. This spatial pattern should be similar for stimulus pairs that are similar in some respect, e.g., leopards and cheetahs, but dissimilar for stimulus pairs that are not, e.g., leopards and polar bears (both mammalian quadrupeds but differing in size, appearance, taxonomy, and habitat). Computationally, the activity levels for each stimulus are vectorized and the first order pairwise correlation is calculated: similar pairs should be positively correlated and dissimilar pairs should be negatively correlated. Operationally, the results are displayed as a representational dissimilarity matrix (RDM) in which each cell value is 1-r: for very similar, highly positively correlated, pairs this value should approach 0 (1–1 = minimum dissimilarity); for very dissimilar, highly negatively correlated, pairs this value should approach 2 (1− (−1) = maximum dissimilarity). Such RDMs can then be compared, via second-order correlations, to reference RDMs based on, for example, (dis)similarity in habitat or taxonomy, as in our animal categorization example, or formal computational models, in order to test hypotheses about how information is organized in a particular brain region. Our approach here was to calculate RDMs for pseudowords based on ratings of their perceived roundedness/pointedness. If both members of a pair of pseudowords are considered rounded, these ratings will be more or less positively correlated; but for a pair containing a rounded and a pointed pseudoword, the ratings will be more or less negatively correlated, reflecting the degree of similarity or dissimilarity respectively. Similarly, we could compute RDMs based on measurements of acoustic properties of the pseudowords (see below) and compare these to the perceptual ratings by way of second-order correlations between the perceptual and acoustic RDMs. To the extent that perception of the pseudowords as rounded/pointed correlates with an acoustic property, that property can be said to contribute to the sound-symbolic mapping of sound to shape. The same computations and principles apply to ratings and measurements of visual shapes. The advantages of the RSA approach over conventional correlational analyses are that, firstly it allows analysis of stimulus properties that involve multiple measurements or samples per stimulus; secondly RSA compares every item to every other item so that it analyzes similarity across and between all possible stimulus pairs. Conventional correlations only allow for an assessment of the association of a single acoustic measure with perceptual ratings and only considers pairs on a list-wise basis rather than examining all possible pairs. Thus, RSA allowed us to evaluate whether similarity across and between stimuli for a particular acoustic characteristic mirrored perceptual similarity across and between stimuli.

For the pseudowords, we chose acoustic parameters that would reflect the overall acoustic form of each word, capturing both the acoustic properties associated with phonemic content and aspects of the vocal characteristics of the speaker. We did so because we reasoned that the rating of a pseudoword as rounded or pointed could depend on the acoustic characteristics resulting from the phonemic content of the particular word (e.g., voicing or manner of articulation of a phonetic segment) and/or the vocal properties of the speaker’s voice. For example, Tzeng et al. (2017) found that speakers produced pseudowords referring to bright colors with higher fundamental frequency and amplitude and shorter duration than those for darker colors and that listeners could reliably assign pseudowords to their target color using these prosodic cues. In other words, the acoustic-phonetic instantiation of spoken language depends on both what a speaker says and how they say it (Nygaard et al., 2009b). These two factors may not be easily separable but, in general terms depending on the measure, both contribute to the acoustic form of the speech signal. As such, we chose three parameters, speech envelope, spectral tilt, and the temporal fast Fourier transform (FFT), that captured the distribution of amplitude and frequency over time. In addition, we chose parameters that reflect the acoustic consequences of voicing, or the extent to which the vocal folds vibrate creating a voiced or periodic signal. Each measure reflected the amount and regularity of voicing as reflected in periodicity in the speech waveform: the fraction of unvoiced frames (FUF), mean autocorrelation, pulse number, jitter, shimmer, the standard deviation of the fundamental frequency and the mean harmonics-to-noise ratio (HNR). Interestingly, recent work suggests that simple acoustic features such as the amplitude envelope are sufficient to decode cortical responses to speech (Daube et al., 2019). Full details of the acoustic parameters are provided in section 2.3.1.Briefly, we expected that parameters that captured low and high frequency information would reflect roundedness and pointedness respectively, as has been demonstrated with low- and high-pitched auditory tones, i.e., non-linguistic stimuli (e.g., Marks, 1987; Walker et al., 2010). In contrast, we expected that parameters capturing spectrotemporal aspects of the speech waveform would reflect roundedness and pointedness to the extent that the waveform reflected a speech pattern that was smooth and continuous as opposed to one that was uneven or contained abrupt transitions, as has also been demonstrated in non-linguistic contexts using sinusoidal and square waveforms (Parise & Spence, 2012) and by varying the ‘roughness’ of electronically produced auditory noise (Liew et al., 2018).

In choosing visual parameters for the shapes, we were somewhat constrained by the fact that the shapes were all irregular, thus it would not be possible to employ radial frequency measures (Wilkinson et al., 1998). We chose the Jaccard distance and the simple matching coefficient (SMC) which essentially measure pairwise shape similarity by the amount of overlap when the shapes are superimposed, together with image silhouette and outline which code object shape either taking account of area (silhouette) or independently of area (outline). Full details of the visual parameters are provided in section 2.3.2; for all these measures we expected that, as the shapes transitioned from rounded to pointed, there would be a graded transition from positive to negative correlation as dissimilarity increased.

In order to assess whether an acoustic parameter contributed to perception of a pseudoword as rounded or pointed, we compared the RDM for each parameter to that for the auditory perceptual ratings: a significant correlation between the RDMs would indicate that the parameter influenced the mapping of sound to shape. We carried out the same analysis for visual parameters and perceptual ratings of the shapes. Although ratings of the shapes could not directly influence ratings of the pseudowords (auditory and visual ratings were provided by separate groups: see Methods), we also compared acoustic and visual parameters crossmodally. To the extent that an acoustic parameter was significantly related to roundedness/pointedness ratings of the pseudowords and also, crossmodally to a visual parameter that significantly captured visual roundedness/pointedness, this provided a supplementary confirmation of the acoustic parameter’s relevance to sound-to-shape mapping. We follow this acoustic-visual-crossmodal sequence in the Methods, Results, and Discussion.

2. MATERIALS AND METHODS

2.1. Perceptual ratings

The RSA described in Section 2.2 is based on perceptual ratings of auditory pseudowords collected by McCormick and colleagues, who also created and recorded these stimuli (McCormick et al., 2015), and ratings of visual shapes created and collected by McCormick & Nygaard (unpublished data). The rest of Section 2.1 summarizes the methods for the creation and rating of these two data sets.

2.1.1. Participants

A total of 61 Emory University students (28 male, 33 female; mean age ± standard deviation, 20 ± 4 years) gave informed consent and received course credit for their participation. Thirty participated in the rating task for visual shapes (14 male, 16 female) and a separate 31 participated in the rating task for auditory pseudowords (14 male, 17 female). All participants were native English (American) speakers and reported normal or corrected-to-normal vision and no known hearing, speech, or language disorders. All procedures were approved by the Emory University Institutional Review Board.

2.1.2. Auditory Pseudowords

We used a set of 537 two-syllable pseudowords of the form ‘consonant, vowel, consonant, vowel’ (CVCV) devised by McCormick et al. (2015). These were constructed using only phonemes and combinations of phonemes that occur in the English language, and items deemed to be homophones of real words (33 items out of an original array of 570) were removed. Consonants were sampled from sonorants, fricatives/affricates, and stops; of the obstruents, including fricatives/affricates and stops, half were voiced and half were unvoiced. Vowels were either front/rounded or back/unrounded. The pseudowords were recorded in random order by a female native speaker of American English (KM) with neutral intonation in a sound-attenuated room, using a Zoom 2 Cardioid microphone, and digitized at a 44.1 kHz sampling rate. Two independent judges listened to the recordings in order to assess whether each pseudoword was recorded with neutral intonation, sounded consistent with other recordings (e.g., the pseudoword was not spoken faster/slower or louder than others), and conformed to the target phonemic content. For those pseudowords where the judges agreed that the token did not conform on any aspect, that item was re-recorded and judged again. Items were also re-recorded if the two judges disagreed on any aspect. A total of 54 pseudowords were re-recorded and re-assessed, if necessary multiple times, before being considered acceptable. Each pseudoword was then down-sampled at 22.05 kHz, which is standard for speech, and amplitude-normalized using PRAAT speech analysis software (Boersma & Weenink, 2012). The pseudowords had a mean duration of 457 ± 62 ms. Briefly, McCormick et al. (2015) showed that judgments of ‘roundedness’ for this set of stimuli were more associated with voiced (e.g., /b/, /d/) than unvoiced (e.g., /t/, /k/) consonants, and with back rounded vowels like /u/ or /o/. Judgments of ‘pointedness’ were more associated with stops like /p/ and /t/ than sonorants like /m/ or /l/, and with front unrounded vowels like /i/ or /e/. It is important to note, however, that the graded nature of the ratings suggests that judgments were based on more than individual phonetic features and likely involved processing/analysis at the segment or even whole-word level (McCormick et al., 2015; see also Thompson & Estes, 2011). For a complete description of the stimulus set, see McCormick et al. (2015).

2.1.3. Visual shapes

We used 90 shapes, consisting of gray line drawings (RGB: 240, 240, 240) on a white background, created in Adobe Illustrator (Ventura, CA: McCormick et al., unpublished data; McCormick et al., 2018: see Fig. 1 for examples) following a method similar to that of Monaghan et al. (2012). Shapes had four, five, or six protuberances and were constructed using a template of three concentric circles (25, 35, and 45 mm radii), the outer circle serving as a bounding border with protuberances, either rounded or pointed, extending to its perimeter. The two inner circles served to define the inward extent of each protuberance. Thinner protuberances (30 shapes) extended all the way to the innermost circle; thicker protuberances (30 shapes) extended only to the middle circle; the remaining 30 shapes were constructed with a mix between thin and thick protuberances. For each shape of one category (rounded or pointed), there was a corresponding shape in the other category with the same outer and inner anchor points, resulting in fifteen thick, fifteen thin, and fifteen mixed shapes in each category.

Figure 1.

Figure 1.

Analysis pipeline. Step 1: perceptual ratings of roundedness/pointedness for pseudowords and shapes were used to create reference RDMs. Step 2: crossmodal comparison of RDMs for perceptual ratings of pseudowords and shapes. Step 3: within-modal comparison of RDMs for perceptual ratings to those for acoustic and visual parameters of pseudowords and shapes respectively. Step 4: crossmodal comparison of RDMs for selected acoustic and visual parameters.

2.1.4. Perceptual rating tasks

Participants were randomly assigned to rate either pseudowords or shapes using one of two 7-point Likert-type scales. In order to avoid response bias, one of the scales rated roundedness from 1 (not rounded) to 7 (very rounded) and the other rated pointedness from 1 (not pointed) to 7 (very pointed). For pseudowords, 15 participants used the roundedness scale and 16 the pointedness scale (n = 31). To discourage participants from matching pseudowords with a specific word in the instructions (e.g. ‘teti’ and ‘pointed’), the instructions included several related terms for the concepts of rounded and pointed. For the shapes, 17 participants used the roundedness scale and 13 the pointedness scale (n = 30).

The auditory pseudowords were presented over Beyerdynamic DT100 headphones at approximately 75db SPL. The visual shapes were presented sequentially at the center of a desktop computer screen using E-Prime software Version 2.0.8.22 (Schneider et al., 2002). For both pseudowords and shapes, the 7-point rating scale appeared on the screen on each trial, either in the center of the screen for pseudowords or below each shape. The response keyboard always had 1–7 listed from left to right. All stimuli were presented only once and in random order.

2.2. Representational Similarity Analysis

We implemented RSA in MATLAB 2016a (The MathWorks, Natick MA). In outline, we created reference RDMs for the pseudowords and shapes from the perceptual ratings of their roundedness and pointedness. We then compared these, via second-order correlations, both to each other and to RDMs derived from measurements of selected acoustic and visual parameters (see Section 2.5 for details of these) in order to assess how these parameters related to perception of roundedness and pointedness. We performed this latter step both within-modally (e.g., comparing perceptual ratings of the visual shapes to visual parameters) and crossmodally for selected parameters (i.e., comparing visual parameters to acoustic parameters). A schematic of the analysis pipeline is shown in Fig. 1 and we describe each step in more detail below.

As a first step, we created reference RDMs for pseudowords and shapes based on the perceptual ratings of their roundedness and pointedness. In these matrices, items were ordered left to right from the most rounded to the most pointed based on the mean rating for each item. In order to achieve this, one of the two rating scales was recoded so that 1 was equal to ‘not pointed’ on one scale and ‘very rounded’ on the other and 7 was equal to ‘very pointed’ on the first scale and ‘not rounded’ on the second, i.e., the ‘roundedness’ scale was recoded to the ‘pointedness’ scale. Since the two scales are in opposition to each other, they should be strongly negatively correlated and this was, in fact, the case (pseudowords r535 = −.65, p < .001; shapes r88 = −.96, p < .001: Fig. 2). Thus, the two scales were comparable and recoding to a single scale was justified. The correlation was stronger for shapes than for pseudowords, presumably reflecting that visual ratings can directly assess visual roundedness/pointedness whereas the pseudowords were rated for a property that is primarily visual and therefore auditory roundedness/pointedness ratings access this indirectly. (With the exception of the pseudoword pointed scale, ratings were non-normally distributed, but testing these relationships with the non-parametric Spearman correlation produced the same pattern of results.)

Figure 2.

Figure 2.

The roundedness and pointedness scales were strongly negatively correlated indicating that ratings of pseudowords (left) and shapes (right) were comparable across the two scales.

Once the pseudowords and shapes had been ordered in this way, the RDMs were constructed using the original un-recoded data since the RDMs reflected how the patterns of ratings were dissimilar across items regardless of the rating scale that any individual participant used. In order to create the reference RDMs (Fig. 1, Step 1), we calculated the first-order correlation (Pearson’s r) between the perceptual ratings for each pair of pseudowords or shapes: pairwise dissimilarity is given by 1-r and this is the value entered in each cell of the RDM. Having created these reference RDMs, we could compare them to each other, via a second-order, non-parametric correlation (Spearman’s r [rs]: Fig. 1, Step 2)2, to assess the extent to which the perceptual rating matrices were crossmodally consistent.

The next stage was to create RDMs reflecting the pairwise dissimilarity for acoustic parameters of the pseudowords and visual parameters of the shapes and to compare these to the RDMs for the auditory and visual perceptual ratings, respectively (Fig. 1, Step 3). These second-order correlations would enable us to see, for example, which acoustic parameters might contribute to perception of the pseudowords as rounded or pointed. Full details of the acoustic and visual parameters and the calculation of their RDMs are provided in Section 2.3.

Finally, to the extent that RDMs for the acoustic and visual parameters were significantly correlated with those for auditory and visual perceptual ratings, respectively, we could compare the RDMs of those parameters crossmodally (Fig. 1, Step 4). This comparison served two purposes. Firstly, because the auditory and visual perceptual ratings were carried out by independent groups of participants, it was possible that each group judged pointedness and roundness on a different basis. If this were so, the RDMs of the acoustic and visual parameters would not necessarily be correlated with each other crossmodally. But if both groups were employing a common perceptual framework in the rating tasks regardless of modality, then the RDMs of parameters that correlated with the RDMs of perceptual ratings should also be crossmodally correlated. Secondly, and relevant to the study aims, in this data-driven approach a further test of whether an acoustic parameter is a likely candidate to drive sound-symbolic mapping of sound to shape would be that not only is its RDM correlated with the RDM for perceptual ratings of the pseudowords as rounded/pointed, but also crossmodally with the RDM for a visual parameter that predicts perceptual ratings of roundedness/pointedness for the shapes. Note that the reference RDM for pseudowords is a 537 × 537 matrix while that for shapes is 90 × 90. In order to perform the crossmodal second-order correlation the matrices must be the same size and therefore we down-sampled the pseudoword matrix by selecting every 6th word to create a 90 × 90 matrix (the number of samples per item remained unchanged, i.e. 31 rating scores).

2.3. Stimulus parameters

2.3.1. Acoustic parameters of pseudowords

As noted in the Introduction, we chose acoustic parameters that would reflect the overall acoustic form of each pseudoword, capturing the acoustic consequences of both their phonemic content and aspects of the vocal characteristics of the speaker. Therefore, we chose the speech envelope, spectral tilt, and temporal FFT, since these capture the distribution of both amplitude and frequency over time. Additionally, we chose parameters that reflect the proportion and regularity of voicing as reflected in acoustic periodicity in the speech waveform. The FUF, mean autocorrelation, and pulse number reflect the proportion of voiced segments and the remaining parameters were chosen to reflect the regularity of voicing or voice quality during the production of each stimulus item: jitter, shimmer, the standard deviation of the fundamental frequency (the speech analysis software PRAAT (Boersma & Weenink, 2012) refers to this as ‘pitch standard deviation’ and we adopt this term here), and the mean HNR. Each parameter is described in detail below. Note that these parameters are not necessarily independent of each other (for example, both FUF and pulse number reflect how often the vocal folds open and close). Additionally, although some parameters are most often studied in the context of voice pathology, the pseudowords were recorded by a speaker with a healthy voice and these parameters can certainly vary in a healthy voice (see Brockman et al., 2011). It should also be noted that, given the nature of each particular parameter or property, the number of measurements used to calculate the first-order correlation differed across the acoustic parameters.

For speech envelope, spectral tilt, and temporal FFT, we normalized the duration of all pseudowords to the mean of 457 ms, by removing and interpolating data points from longer and shorter items respectively, using the resampling function in MATLAB. Although this was a necessary step in order to achieve common vector lengths for parameter estimation, it necessarily introduced some noise; however, since we resampled to the mean duration, the introduced noise would be proportional in magnitude to the standard deviation of the duration, which was small (standard deviation/mean = 62/457 ms, i.e. 13.5%). At the original sampling rate of 22050 Hz (see Section 2.1.2), this gave a vector of 10077 data points (22050 x .457 = 10077). This enabled us to obtain a common vector length for calculating these parameters across pseudowords that varied in duration and therefore equal numbers of data points per pseudoword for the pairwise correlations that form the RDMs. However, for these parameters, measurements were taken from that vector in different ways, e.g. different window lengths, such that the number of measurements underlying the pairwise correlations for each of these parameter differed (see Supplementary Material). Speech envelope, spectral tilt, and temporal FFT were calculated in MATLAB 2016a while the remaining acoustic parameters were measured using the standard voice report settings in PRAAT (Boersma & Weenink, 2012); the RDMs were prepared using MATLAB 2016a. Speech envelope, spectral tilt, and the temporal FFT were all based on multiple samples for each pseudoword and therefore pairwise first-order correlations could be calculated at the item level resulting in a 537 × 537 matrix that could be compared directly to the 537 × 537 perceptual matrix. However, all the other acoustic parameters were expressed as a single value per pseudoword and therefore, in order to compute the first-order correlations, these single values were binned into an 18 × 18 matrix with 30 pseudowords per cell (comparable to the 31 participants who provided the perceptual ratings). For the second-order correlations for these parameters, the RDM for the perceptual ratings was similarly created by binning the mean rating for each pseudoword into an 18 × 18 matrix, also with 30 pseudowords per cell.

Speech envelope:

This is a measure of the amplitude profile across time which primarily reflects changes corresponding to phonemic properties and syllabic transitions (Aiken & Picton, 2008). A visual depiction of the speech envelope can capture the ‘shape’ of the sound by showing these transitions. To the extent that transitions are abrupt, the amplitude profile will appear uneven or jagged, which should be associated with pointed pseudowords, and to the extent that they are more gradual, the profile will appear smoother and more continuous, which should be associated with rounded pseudowords. This expectation is similar to the study of Thoret et al. (2014) which showed that participants could retrieve visual shape from the friction sounds produced when a shape was drawn; compare also, for example, the left and right panels of Fig. 5C which display speech envelopes for rounded and pointed pseudowords, respectively.

Figure 5.

Figure 5.

(A) Spectral tilt, the overall slope of power spectral density, for the rounded pseudoword ‘mumo (left) and the pointed pseudoword ‘kete’ (right). Spectral tilt is steeper for ‘mumo’ where power is concentrated in low frequency bands, but flatter for ‘kete’ as power migrates to high frequency bands. (B) The waveform (top panels) and spectrogram (bottom panels) illustrate aspects of the temporal FFT; the spectrogram captures more abrupt changes in power, especially at higher frequencies, for ‘kete’ compared to ‘mumo’. (C) Speech envelope for ‘mumo’ is continuous and smoother compared to ‘kete’ which is discontinuous and uneven (similar to the waveform for these pseudowords in B: top panels). All examples produced using PRAAT speech analysis software (Boersma & Weenink, 2012).

Spectral tilt:

This gives an estimate of the overall slope of the power spectrum sampling over the duration of the utterance. Spectral tilt occurs because high frequencies typically have less power than low frequencies and therefore the power spectrum slopes downward from low to high frequencies. Flattening spectral tilt, i.e., migrating power to high frequency bands, improves the intelligibility of speech in noise (Lu & Cooke, 2009). Spectral tilt may relate to roundedness/pointedness in that a steep slope, in which power is concentrated in the low frequency bands, is more likely to reflect sonorants and back rounded vowels that are associated with roundedness (McCormick et al., 2015). However, the slope should flatten out for pseudowords containing obstruents and/or front unrounded vowels associated with pointedness (McCormick et al., 2015) as power migrates to the higher frequencies associated with these phonemic properties.

Temporal FFT:

The FFT converts temporal or spatial signals into the corresponding frequency domain. The FFT analysis of temporal data, such as the acoustic speech signal in our pseudowords, derives the frequency components of that signal, some with more energy than others, and can be calculated over the duration of the sound signal. Thus, this parameter reflects the power spectrum of the frequency composition across time (Singh, 2015). To the extent that there is more power at the lower/higher frequencies, associated with roundedness/pointedness respectively, the temporal FFT should reflect the shape associations of the pseudowords.

Fraction of unvoiced frames:

This is a measurement of voice stability over time, with the number of unvoiced elements expressed as the percentage of measurement windows that do not engage the vocal folds (Boersma & Weenink, 2012). FUF depends on the phonemic content of an utterance, particularly when measured across the duration of the pseudoword, and will obviously increase for utterances that include unvoiced elements like obstruents and decrease for those containing voiced elements, typically long vowels (Mezzedimi et al., 2017). Since auditory ‘roundedness’ is more associated with voiced than unvoiced elements (McCormick et al., 2015), we would expect FUF to increase as ratings of pseudowords transition from rounded to pointed.

Mean autocorrelation:

This a measure of the similarity, or correlation, between a sound and a delayed copy of itself. As such, it is a measure of the periodicity of a signal wherein 0 is a white noise signal and 1 is a perfectly periodic signal (Boersma & Weenink, 2012). When a single phoneme is sustained, for example a long vowel like ‘ooo’ or consonant like ‘mmm’, each successive segment should sound very similar to the one before, i.e. they should be highly correlated. Higher autocorrelation values indicate a smoother voice pattern and/or more voiced segments, which should be reflected in roundedness ratings while lower values indicate an uneven pattern and/or fewer voiced or periodic segments, which should be reflected in ratings of pointedness.

Pulse number:

This is the number of glottal pulses, i.e. opening and closing of the vocal folds, during production of vowels or voiced consonants measured across the whole utterance (Boersma & Weenink, 2012). In order to understand how this manifests in the voice, it is necessary to consider an extreme form of phonation, known as pulse register phonation, in which rapid glottal pulses are followed by a long closed phase (Hollien et al., 1977; Whitehead et al., 1984). The auditory perception of this vocal register has been described as a ‘creaky voice’ (Ishi et al., 2008) or – onomatopoeically – as a ‘glottal rattle’ (Hornibrook et al., 2018). As such, it is a measure of vocal roughness or unevenness; lower pulse numbers indicate a rougher, more uneven voice pattern, and/or fewer voiced segments, which should be associated with pointed pseudowords, while higher pulse numbers indicate a smoother voice pattern, and/or more voiced segments, which should be associated with rounded pseudowords.

Jitter:

This is a measure of voice quality that indexes variation in the vibration of the vocal cords (Teixeira & Fernandes, 2014). Jitter is defined as the frequency variation between consecutive periods expressed as a percentage; here, we calculated local jitter, the mean absolute difference in frequency between consecutive periods of the speech waveform divided by the mean difference over all periods of the speech waveform and expressed as a percentage (Boersma & Weenink, 2012). Perceptually, high values of jitter manifest as a ‘breaking’ or rough voice, i.e., one that varies in the consistency and length of each period of the waveform corresponding to each opening and closing of the vocal cords. Jitter is typically measured for long vowel sounds, where little frequency variation would be expected, and therefore high levels of jitter indicate voice pathology (Teixeira & Fernandes, 2014). Jitter and shimmer (see below) have also been associated with changes in emotion and stress in speech (Van Puyvelde et al., 2018), suggesting that this acoustic measure can convey non-linguistic information. In the production of the pseudowords, cycle-to-cycle frequency variation or jitter should increase from rounded to pointed pseudowords, reflecting increased vocal instability or variation and perceived roughness.

Shimmer:

In contrast to jitter, shimmer is a measure of voice quality that indexes period-to-period variation in amplitude (Brockmann et al., 2011). While minor variations in amplitude are normal, substantial variability can indicate voice pathology stemming from glottal resistance, i.e. stiffness of the vocal cords, which manifests as breathiness or hoarseness (Brockmann et al., 2011; Teixeira & Fernandes, 2014). Since shimmer reflects vocal instability, low shimmer manifests in a smooth speech pattern whereas high shimmer results in an uneven speech pattern, that should be associated with roundedness and pointedness, respectively. Here, we measured local shimmer, defined as the mean absolute difference in amplitude between consecutive periods divided by the mean amplitude and expressed as a percentage (Boersma & Weenink, 2012).

Pitch standard deviation (PSD):

The PSD indicates the variation in the fundamental frequency present in the speech signal. This is a measure of vocal inflection with low PSD manifesting as a level, monotone voice and high PSD as a ‘lively’ voice (Kliper et al., 2016). As such, PSD can reflect an individual’s emotional state (Kliper et al., 2016), suggesting that this acoustic measure is also capable of conveying non-linguistic information. For present purposes, low and high PSD/vocal inflection should indicate roundedness and pointedness, respectively.

Mean harmonics-to-noise ratio (HNR):

This parameter measures the ratio between the dominant periodic, or harmonic, element of the speech signal and the aperiodic, or noise, element; thus providing an estimate of the overall periodicity of the sound expressed in dB (Teixeira & Fernandes, 2014). The noise element arises from turbulent airflow at the glottis when the vocal cords do not close properly (Ferrand, 2002). As the noise element increases, and therefore, mean HNR decreases, the voice becomes increasingly hoarse or quavery (Ferrand, 2002). In other words, as mean HNR decreases the speech pattern becomes progressively less smooth and more uneven, reflecting a transition from roundedness to pointedness.

2.3.2. Visual parameters of shapes

As mentioned in the Introduction, the choice of visual parameters of the shapes was constrained by the fact that the shapes were all irregular, thus we were unable to employ radial frequency measures (Wilkinson et al., 1998). For irregular shapes, one option would be to adopt the particle morphology measure of ‘roundness’ used in geology to classify grain shape by curve fitting (Folk, 1965; Boggs, 2009)3. However, while this would be possible here for the rounded shapes, it would not be meaningful for the pointed shapes because their protuberances end in a single point, i.e. curvature is zero. In practice, geologists generally classify particles by reference to visual analog scales (Folk, 1965, p10), highly similar to the approach we took here for perceptual ratings. More recently, particle morphology has been assessed using Fourier analyses (Boggs, 2009) similar to the spatial FFT described below. As a first step in calculating the visual parameters, we removed excessive background from the images to arrive at the smallest area that contained all the shapes overlaid on one another. This area was 200 × 200 pixels, giving 40,000 data points for each shape for all visual parameters (see Supplementary Figs. 1 and 2). Note that the visual parameters are indices of global shape whereas the acoustic parameters reflected specific acoustic aspects.

Jaccard distance:

The Jaccard distance is a measure of dissimilarity between two sets or items that uses a present/absent coefficient (Ricotta & Pavoine, 2015) and has been used as a measure of shape similarity (e.g., Devaprakash et al., 2019; Davico et al., 2019). The Jaccard similarity coefficient, J (Jaccard, 1901), can be interpreted as the intersection of the two shapes divided by their union, i.e., the more the two shapes overlap when superimposed on each other, the larger their intersection and the greater the similarity coefficient. The starting point is to designate pixels in the shape as 1 and pixels in the background as 0 (e.g., Devereux et al., 2013) and to calculate J for each pair of shapes. The coefficient J is given by a/(a+b+c) where a = pixels present in both shapes, b = pixels present in the first shape but not the second, and c = pixels present in the second shape but not the first. The Jaccard distance is then 1-J and in constructing the RDM this pairwise measure replaces 1-r.

Simple matching coefficient:

The SMC also reflects shape similarity, being calculated in the same way as the Jaccard similarity coefficient except that it includes an additional term, d, representing pixels that are absent from both shapes in the particular pair under consideration but present in other shapes in the set (Ricotta & Pavoine, 2015). This term appears in in both numerator and denominator such that the SMC is given by (a+d)/(a+b+c+d). The RDM is constructed by replacing 1-r with the pairwise 1-SMC as a measure of dissimilarity. By taking account of pixels that are present in other shapes in the set, the SMC provides a measure of the similarity of particular shapes not only to each other but also in relation to the remaining shapes in the set.

Image silhouette:

Image silhouettes enable us to compare shapes on the basis of low-level visual feature information (e.g., Kriegeskorte et al., 2008; Devereux et al., 2013). Images are binarized such that pixels in the shape = 1 and pixels in the background = 0 (Devereux et al., 2013), i.e. essentially separating figure from ground. Thus, this parameter explicitly codes roundedness and pointedness and, by including pixels within the shape, also accounts for area. The resulting 2-D information is reduced to a single vector for each image (for example, the vectors 1,0,1,0,1…n and 1,1,1,1,0…n describe different shapes by recording the presence/absence of shape pixels at specific positions in the vector) and the pairwise correlation of these vectors forms the basis of 1-r in the RDM.

Image outlines:

In this case, perimeter pixels forming the shape outline = 1 and all other pixels = 0. The results are again vectorized for each image and the pairwise correlation of these vectors forms the basis of 1-r in the RDM. In contrast to the image silhouette parameter, the image outline provides an index of roundedness and pointedness independent of area by focusing only on the perimeter pixels.

Spatial FFT:

The spatial FFT is based on greyscale value variations at each point in space, i.e. here, at each pixel across the whole image, and captures how often these variations repeat per unit of distance, i.e. their spatial frequency. Thus, analogously to the temporal FFT, this parameter reflects the power spectrum of frequency distribution across space: concentrations of power at low or high frequencies should reflect roundedness or pointedness, respectively.

Note that the RDMs for image silhouette, outlines, and the spatial FFT are calculated using the pairwise first-order Pearson correlation and therefore 1-r can range from 0 to 2, whilst the simple matching coefficient and Jaccard distance RDMs are calculated with coefficients that cannot exceed 1 and therefore 1-J and 1-SMC range from 0 to 1.

3. RESULTS

3.1. Crossmodal comparison of visual and auditory perceptual ratings

Examination of the RDMs for the perceptual ratings of the pseudowords and shapes suggests that roundedness/pointedness ratings for the pseudowords were relatively graded: while some were rated as very rounded or pointed, leading to high similarity values at either end of the diagonal and a cluster of high dissimilarity values at the corners, there was a wide range of pseudowords rated at intermediate points on the scale, leading to a range of dissimilarity values (Fig. 3, left panel). By contrast, roundedness/pointedness ratings for the shapes were essentially binary (Fig. 3, right panel): participants largely rated the shapes as either highly rounded or highly pointed with few shapes considered intermediate, leading to dissimilarity values that tended to be uniformly high or low. Despite this apparent qualitative difference, there was a significant, positive, second-order correlation between the RDMs for the auditory and visual perceptual ratings (rs 4003 = .64, p < .0001), indicating that the ratings were crossmodally consistent even though they were made by independent groups of participants. This is important because the auditory pseudowords were rated for a property that is primarily defined visually and therefore crossmodal consistency was not guaranteed a priori. In the present context, this crossmodal consistency serves as a verification of sound-symbolic associations between the auditory pseudowords and visual shapes used.

Figure 3.

Figure 3.

RDMs for perceptual ratings of roundedness/pointedness for auditory pseudowords (left) and visual shapes (right). Items in each RDM are ordered left to right from most rounded to most pointed. Color bar shows pairwise dissimilarity where 0 = zero dissimilarity (items are identical) and 2 = maximum dissimilarity (items are completely different). Auditory and visual RDMs were significantly positively correlated (rs 4003 = .64, p < 0.0001).

3.2. Comparison of auditory perceptual ratings to acoustic parameters of pseudowords

To determine which of the selected acoustic parameters were related to the auditory perceptual ratings of roundedness/pointedness, we computed the second-order correlation between the RDM for auditory perceptual ratings and that for each of the acoustic parameters (Fig. 1, Step 3). After correction for multiple comparisons (Bonferroni-corrected α for 10 tests = .005), there were significant positive correlations for spectral tilt (rs 143914 = .43, p < 0.0001), the temporal FFT (rs 143914 = .25, p < 0.0001) and speech envelope (rs 143914 = .14, p < 0.0001): Fig. 4 shows the RDMs for these, illustrating the 537 × 537 matrices for the entire pseudoword set. The mean autocorrelation (rs 151 = −.2, p = .02) and mean HNR (rs 151 = −.16, p = .04) were also correlated (negatively), but these correlations did not survive Bonferroni correction (Supplementary Fig. 3 shows the down-sampled 18 × 18 RDMs for these). The remaining acoustic parameters were uncorrelated (rs 151 = −.05 to −.1, all p > .1: Supplementary Fig. 3). As the Bonferroni correction method is relatively conservative, minimizing Type 1 error, we also tested for significance with the less restrictive modified Bonferroni correction suggested by Holm (1979) but the pattern of results was unchanged, indicating that Type 2 error was unlikely.

Figure 4.

Figure 4.

RDMs for (A) auditory perceptual ratings of roundedness/pointedness for the pseudowords; (B) spectral tilt; (C) temporal FFT; (D) speech envelope. Items in each RDM are ordered left to right from most rounded to most pointed. Interpretation of the color bar as for Fig. 3; r = Spearman correlation coefficient for B–D vs A; df = 143914 in all cases.

Notwithstanding the different numbers of samples per pseudoword, the acoustic parameters whose RDMs were correlated with the RDM for perceptual ratings of the pseudowords are likely to be important for auditory perception of pointedness/roundedness. The spectral tilt, temporal FFT, and speech envelope parameters all included multiple measurements per pseudoword, thus preventing a simple correlation between these and the single rating value for each pseudoword. Therefore we illustrate their relationships with perceptions of roundedness/pointedness qualitatively by providing examples that show how these parameters vary between a more rounded pseudoword (mumo: ‘moo-moh’) and a more pointed pseudoword (kete: ‘keh-teh’). For each pseudoword, Fig. 5A illustrates spectral tilt, the overall slope of the spectrum from low to high frequencies. As predicted, the slope is steeper for the rounded pseudoword ‘mumo’ (Fig. 5A, right panel), because power is concentrated in the low frequency bands associated with the sonorants and back rounded vowels that reflect roundedness (McCormick et al., 2015). By contrast, the slope is flatter for the pointed pseudoword ‘kete’ (Fig. 5A, left panel) because more power is present at the higher frequencies of the obstruents and/or front unrounded vowels associated with pointedness (McCormick et al., 2015). Fig. 5B shows the waveforms and the spectrograms resulting from the temporal FFT; these both clearly distinguish ‘mumo’, with its voiced segments, from ‘kete’, which contains unvoiced consonants. The spectrogram (Fig. 5B, bottom panels) plots amplitude (the degree of dark shading) for multiple frequencies (y-axis) across time (x-axis) and shows smoother changes for ‘mumo’ compared to ‘kete’ where these are more abrupt, especially at high frequencies. Fig. 5C shows the speech envelope for each pseudoword and that, as predicted, the envelope is continuous and smoother for ‘mumo’ compared to ‘kete’ which has an envelope that is discontinuous and uneven. Note that the speech envelope is related to the waveform where this continuous-smooth/discontinuous-uneven relationship for rounded vs. pointed words can also be seen (Fig. 5B, top panels: compare also the waveforms for ‘mumo’ and ‘kete’ to the intermediate pseudoword ‘zuvu’ in Fig. 1, Step 1).

Although RSA showed that none of the acoustic voice quality parameters was significantly related to auditory perceptual ratings of the pseudowords, the power of these analyses was limited by the smaller matrix size (18 × 18) for the RDMs of these parameters. Therefore, we supplemented RSA with conventional correlation analyses between the parameter values (one per pseudoword) and their perceptual ratings. These analyses showed that many of these parameters were related to auditory perception of roundedness/pointedness, demonstrating the relationships predicted in Section 2.5.2. After correction for multiple comparisons (Bonferroni-corrected α for 7 tests = .007), the FUF, shimmer, and jitter were all significantly positively correlated with perceptual ratings (Fig. 6AC respectively) reflecting increasing variation in aspects of voice quality as the pseudowords transitioned from rounded to pointed. Mean HNR, pulse number, and mean autocorrelation were all significantly negatively correlated with perceptual ratings (Figure 6 DF); for these parameters, lower values indicate greater vocal variability or roughness in voice quality as well as the presence or absence of voiced segments in each pseudoword and were associated with higher ratings indicating pointedness. However, the correlation between perceptual ratings and PSD was relatively weak (r535 = .1) and did not pass the Bonferroni-corrected α (see Supplementary Fig. 4). With the exception of PSD and the mean autocorrelation, all the voice parameter measurements were non-normally distributed, therefore we also tested these relationships with the non-parametric Spearman test but, although the correlations were slightly weaker, we obtained the same pattern of results. Neither set of results was changed by reference to the modified Bonferroni-correction (Holm, 1979).

Figure 6.

Figure 6.

Correlation between auditory perceptual ratings of the pseudowords and acoustic parameters of voice quality (note that the rating scale is truncated because there are no values less than 2 or greater than 6; the mean autocorrelation scale is truncated because there are no values below .7). df = 535 in all cases.

3.3. Comparison of visual perceptual ratings to visual shape parameters

We carried out within-modal second-order correlations between the RDM of the visual ratings of roundedness/pointedness and that for each visual parameter of the shapes (Fig. 1, Step 3). After correction for multiple comparisons (Bonferroni-corrected α for 5 tests = .01), there were significant positive correlations for the SMC (rs 4003 = .28, p < .0001), silhouette (rs 4003 = .14, p < .0001), image outlines (rs 4003 = .13, p < .0001), and Jaccard distance (rs 4003 = .1, p < .0001: Fig. 7) but not for the spatial FFT (rs 4003 = .01, p = .6: Supplementary Fig. 5). This pattern of results was unchanged by reference to the modified Bonferroni-correction (Holm, 1979). Note that the RDM for image outlines (Fig. 7D) indicates relatively higher dissimilarity between shapes than that for other visual parameters, and that dissimilarity values (1-r) were fairly uniform around 1 indicating that the majority of shapes were only weakly correlated with each other, whether positively or negatively. The reason for this can be seen in Supplementary Fig. 1 which shows all 90 image outlines overlaid on one another with darker/lighter areas indicating the intersection of more/fewer outlines. There are relatively few very dark intersections indicating that shape outlines rarely overlapped by much with other shapes and that therefore all the shapes were different to a large degree.

Figure 7.

Figure 7.

RDMs for (A) perceptual ratings of roundedness/pointedness for visual shapes; (B) SMC; (C) image silhouette; (D) image outlines; (E) Jaccard distance. Items in each RDM are ordered left to right from most rounded to most pointed. Interpretation of the color bar as for Fig. 3; r = Spearman correlation coefficient for B–E vs A; df = 4003 in all cases. Note that the maximum dissimilarity value for the simple matching coefficient (B) and Jaccard distance (E) is 1 while that for image silhouette and outlines is 2 (see Section 2.3.2).

3.4. Comparison of acoustic and visual parameters

Finally, to the extent that they were significantly correlated with their within-modal perceptual ratings, we compared RDMs of the acoustic and visual parameters to each other crossmodally (Fig. 1, Step 4). We compared RDMs of the three most strongly correlated visual parameters: the SMC, silhouette, and image outlines, and RDMs of the three most strongly correlated acoustic parameters: spectral tilt, temporal FFT, and speech envelope, down-sampling the acoustic RDMs (by selecting every 6th word from rounded to pointed) in order to maintain a consistent matrix size of 90 × 90.

After correction for multiple comparisons (Bonferroni-corrected α for nine tests = .0056), the auditory temporal FFT was only significantly correlated with the visual SMC (rs 4003 = .06, p < 0.0001), but auditory spectral tilt was significantly correlated with the visual parameters of SMC (rs 4003 = .17, p < 0.0001), silhouette (rs 4003 = .01, p < 0.0001), and image outlines (rs 4003 = .05, p = .002: Figure 8); all other correlations were non-significant (rs 4003 = −.002 to .02, all p > .1). This pattern of results was unchanged by reference to the modified Bonferroni-correction (Holm, 1979). As laid out in Section 2.2, these significant crossmodal correlations (albeit weak) between parameters provide some comfort that the independent groups involved in the ratings exercise were likely employing a common perceptual framework for pointedness/roundedness ratings regardless of modality. More specifically, the temporal FFT and spectral tilt may be capturing aspects of the speech signal that are related to the spatial aspects of shape captured by the SMC, image outlines, and silhouette.

Figure 8:

Figure 8:

(A) RDMs for temporal FFT and the simple matching coefficient were correlated; (B) the RDM for spectral tilt was correlated with the SMC, silhouette, and image outlines. Interpretation of the color bar as for Fig. 3; r = Spearman correlation coefficient for the RDMs of the visual parameters relative to the RDM of the acoustic parameter at the left of each row; df = 4003 in all cases.

4. DISCUSSION

4.1. Value of RSA

The present study is the first to examine how acoustic and visual parameters contribute to sound-to-shape mapping in the same experimental paradigm. Previous studies have only examined these separately (acoustic – Parise & Pavani, 2011; Knoeferle et al., 2017: visual – Chen et al., 2016). In addition, the number of pseudowords used here, and thus the sampling of the potential phonemic space, is much more comprehensive than earlier investigations of pseudoword-shape mapping: 537 vs. 100 in Knoeferle et al. (2017), the largest previous set to our knowledge. While the method of creating rounded/pointed shapes was very different, our set size of 90 is comparable to that of 72 in Chen et al. (2016), again the largest that we are aware of in earlier work. We also demonstrate the utility of a novel application of RSA (Kriegeskorte et al., 2008) for assessing the relevance of a particular acoustic parameter to the perception of an auditory pseudoword as rounded or pointed, and facilitating crossmodal comparison with visual parameters. The use of RSA allows very different stimulus sets and physical parameters to be compared on an equal footing, that of their pairwise dissimilarity (Kriegeskorte et al., 2008). The advantages of RSA are two-fold: Firstly, it allows comparison of parameters that involve multiple measurements/samples of pseudowords, as in the case of a number of the acoustic and visual parameters studied here. For these parameters, a conventional correlational analysis is not appropriate to assess their relationship to perceptual ratings, or for crossmodal comparisons. The RSA approach enables such comparisons by constructing dissimilarity matrices in which each cell contains a single value representing the pairwise dissimilarity between stimuli, effectively compressing the large number of samples per stimulus. The resulting matrices are then simply compared using non-parametric (Spearman) correlation. Secondly, regardless of the parameter, RSA compares each item to every other item, thus all possible pairs enter the analysis as opposed to conventional correlational analyses which only treat items as a group. This RSA approach, combined with substantially larger set of pseudowords than in many previous studies, allowed us to investigate sound-to-shape mapping not only at a more granular level, but also more comprehensively. A potential drawback of RSA, however, is that for parameters that are expressed as a single value for each item, the items must be binned into sets of stimuli to allow computation of dissimilarity; since the number of such item sets is only a fraction of the total number of items, this results in a loss of sensitivity. For such parameters, more conventional correlational analyses may remain appropriate, as discussed further below.

The RSA approach could prove fruitful in investigating other sound-symbolic mappings. For example, in sound-to-size mapping (e.g., ‘mil’ is small and ‘mal’ is large: Sapir, 1929), we would expect that spectral tilt would be flatter for ‘small’ words, since these would involve the higher frequencies associated with small size, and steeper for ‘large’ words since these would involve lower frequencies, a prediction that can be made not only from the findings of Knoeferle et al. (2017) but also from the well-known crossmodal correspondence between auditory pitch and visual size (Gallace & Spence, 2006; Evans & Treisman, 2010). In fact, non-linguistic crossmodal correspondences might be a good source of predictions about sound-symbolic mapping for pseudowords: for example, we could expect pseudowords reflecting the brightness/darkness dimension to be modulated by their amplitude and/or pitch (see Spence, 2011). Indeed, if these non-linguistic correspondences are found to effectively predict sound-symbolic linguistic correspondences, it would suggest that natural languages incorporate these general perceptual or cognitive constraints into the mapping of sound to meaning (Blasi et al., 2016; Namy & Nygaard, 2008; Revill et al., 2014). Additionally, other types of associations, for example between roundedness/pointedness and female/male first names respectively, previously demonstrated by shape-matching and phonemic analysis (Sidhu & Pexman, 2015) could also be reflected in acoustic analyses of those names.

4.2. Relationships between acoustic parameters and pseudoword ratings

The present study used a large set of pseudowords rather than real words. This had the advantage of being a highly constrained set of stimuli, controlling for, and systematically sampling variation in, vowel quality, consonant voicing, manner and place of articulation, and syllable structure (McCormick et al., 2015), thus enabling us to assess the roundedness/pointedness of a wide range of speech sounds that were as free of semantic associations as possible. Perception of a pseudoword as either rounded or pointed may depend on the acoustic consequences of the phonological content of the word itself and/or the vocal properties of the speaker’s voice. RSA showed that there were relatively strong correlations between the RDM of auditory perceptual ratings and the RDMs for spectral tilt, temporal FFT, and speech envelope, parameters that likely primarily reflect phonetic content. Spectral tilt was steeper for rounded pseudowords where power is concentrated at the lower frequencies associated with sonorants and back rounded vowels that reflect roundedness, but flatter for pointed pseudowords as spectral power migrated to the higher frequencies for obstruents and/or front unrounded vowels associated with pointedness (McCormick et al., 2015). Underlying the temporal FFT relationship, changes in the distribution of spectral power over time were smoother and occurred at lower frequencies for rounded pseudowords while for pointed pseudowords these transitions were more abrupt and occurred at higher frequencies, consistent with previous work showing that formant frequencies are higher for more pointed words (Knoeferle et al., 2017) and for vocalizations produced in response to more pointed shapes (Parise & Pavani, 2011). The speech envelope was smoother and more continuous for rounded words, whereas for pointed words it was more uneven and discontinuous.

However, RSA did not show significant relationships between perceptual ratings and any of the measures of voice quality, which were all based on a single measure per pseudoword; thus, to create RDMs, the data had to be combined across multiple contiguous stimuli (we chose 30) in the larger matrix, so as to allow computation of dissimilarity between sets of stimuli, rather than between individual stimuli as was possible for spectral tilt, temporal FFT and speech envelope. Since the resulting matrices (18×18) were substantially smaller than the full matrices (537×537), the statistical power of RSA was necessarily limited for comparisons based on these variables. To overcome this limitation, we also conducted conventional correlational analyses for these parameters. These correlational analyses showed that several measures of vocal quality (FUF, jitter, shimmer, mean HNR, pulse number and mean autocorrelation) were predictors of perceptual ratings of the pseudowords. As the variability of voice parameters increased or voice quality changed, as reflected in higher FUF, jitter, and shimmer values and lower mean HNR, pulse number and mean autocorrelation values, these increases in the noisy or rough quality of the speech pattern became less associated with roundedness and more associated with pointedness. Although these parameters are typically used to distinguish characteristics of individual speakers and to characterize and assess properties of voice quality, it is notable that in the present context, variation along these dimensions relates to differences among pseudoword productions by a single speaker (see Brockman et al., 2011). The relationship between parameters of voice quality and sound-to-shape mapping is novel and raises the question whether different vocal registers can be manipulated in order to influence this mapping; for example, whether rounded words spoken with the ‘glottal rattle’ of the pulse register (Hornibrook et al., 2018) would be perceived as less rounded than when spoken in the modal register of normal speech (Nygaard et al., 2009b; Tzeng et al., 2018). However, because we compared across many different pseudowords sampling an array of phonetic features and assessed these voice measures across the entire utterance, the effects may not have exclusively reflected changes in vocal quality since, as noted in Section 2.3.1, these parameters can be influenced by phonemic content as well. In this context, we also observed no significant relationship between perceptual ratings of the pseudowords and variation in fundamental frequency or PSD (whether assessed via conventional correlational analyses or RSA). Since the speaker deliberately recorded the words with minimal inflection, it is possible that pitch did not vary enough for an effect of PSD to be detected. Thus, while the other voice parameters may reflect both the acoustic correlates of the speech sounds that each pseudoword contains and voice quality differences, PSD may in this case may have primarily reflected how these speech sounds were produced.

In using continuous measurements of acoustic parameters and perceptual ratings, the present results extend previous work which relied on categorical linguistic contrasts, for example between consonants and vowels (Nielsen & Rendall, 2011; Fort et al., 2015) or voiced and unvoiced phonemes (McCormick et al., 2015). McCormick et al. (2015) suggested that perceivers make rounded/pointed judgments of a pseudoword by reference to both its specific individual phonemic components and to the overall inventory of features (acoustic, linguistic, or articulatory) within the utterance. Measuring acoustic parameters of the entire speech signal allowed us to provide some evidence in support of the idea that perceivers based their shape judgments on a global auditory assessment of each pseudoword. Spectral tilt, the temporal FFT, and the speech envelope are all complex measures of the complete pseudoword and cannot be reduced to a single value. Consistent with such holistic processing of the speech signal, these parameters were all related to roundedness/pointedness ratings, as demonstrated using RSA. Styles & Gawne (2017) reported a failure to replicate sound-to-shape mapping across languages and suggested that this was because the pseudowords employed, ‘kiki’ and ‘bubu’, did not conform to the phonological structure of the target language. However, failures to replicate invariably involved categorical responses (Rogers & Ross, 1975; Bremner et al., 2013; Styles & Gawne, 2017) so this may only be a partial explanation. This could be explored further using RSA for larger stimulus sets and continuous measurements of acoustic parameters.

4.3. Relationships between visual parameters and shape ratings

Since the visual shapes employed here were asymmetric, we were limited in our measurement choices. Nonetheless, RSA showed that the SMC, silhouette, image outlines and Jaccard distance were related to perception of the shapes as rounded or pointed. The SMC and Jaccard distance are pairwise measures of global shape matching, while the silhouette and image outlines are vectorized measures of the shapes that lend themselves to computation of pairwise dissimilarity. Thus, the RDMs based on these measures were all constructed from estimates of the pairwise dissimilarity of global shape. However, it should be noted that, apart from the SMC, the relationships between these RDMs and the RDM for visual perceptual ratings were modest, possibly reflecting the limited degrees of freedom used to generate the shapes: they were compositionally very similar, all consisting of grey outlines on a white background, so that the greyscale contrast was identical across all shapes, and they all lacked internal patterns. Interestingly, the RDMs for the visual ratings and the spatial FFT were uncorrelated, suggesting that spatial frequency is not a critical parameter underlying sound symbolism, at least for the visual shapes used here.

4.4. Crossmodal relationships

Of the crossmodal comparisons, the RDM for acoustic spectral tilt was significantly correlated with the RDMs for the visual SMC, silhouette and image outlines; the RDMs for the acoustic temporal FFT and the visual SMC were also correlated. Although these relationships were fairly weak, it is worth noting that spectral tilt (the parameter most strongly correlated with ratings of auditory roundedness/pointedness) was correlated with three visual indices, indicating that it may indeed be related to some aspects of visual shape and thus relevant to the kind of sound-symbolic crossmodal correspondence studied here. The crossmodal relationship between the SMC and the temporal FFT (another auditory spectral parameter) may also be tapping into this sound-symbolic correspondence. It is interesting that the spectral parameters of the pseudowords were related to their auditory ratings on the rounded-to-pointed dimension, and that both the spectral parameters we tested were related to global indices of visual shape, which were themselves related to the visual ratings of the shapes on the rounded-to-pointed dimension. We propose that these relationships may be particularly relevant for sound symbolism. However, further work is needed to confirm that auditory spectral parameters and global indices of visual shape, but not the spatial frequency spectrum of visual shapes, underpin sound-symbolic crossmodal correspondences. Further work should also examine why these crossmodal relationships are fairly weak: it may be that the underlying relationships are not linear or it may simply be that the auditory parameters are a proxy for roundedness/pointedness while the visual parameters measure this more explicitly. Although a number of voice quality measures were related to auditory perceptual ratings, we did not attempt to directly connect them to the visual shape measures that were related to the visual perceptual ratings, since the former relationships were based on conventional correlation and the latter on RSA.

An interesting point is that the sensory modality most associated with a particular word can change or be added to over time4 (Marks, 1978). For example, in Old English the word ‘sharp’ originally applied primarily to the sense of touch before becoming associated with taste during the eleventh century, and visual shape and audition during the fourteenth century in Middle English (Marks, 1978). Relatedly, it is well-known that the ‘hierarchy of the senses’ has changed over time (see Kambaskovic & Wolfe, 2016), for example both touch and hearing have been considered more primary than vision at different times. While the timescales involved probably preclude empirical enquiry, it may be worth considering whether a set of pseudowords have stronger connections to sound-symbolic mappings in one modality over another, and whether this follows the current sensory hierarchy.

4.5. Limitations and future directions

An obvious limitation of the present study is that, inevitably, we did not test all possible acoustic and visual parameters of the pseudowords and shapes; we may therefore have omitted parameters that turn out to be equally, or more, important. However, to the extent that the parameters examined here were not significantly related to the perceptual ratings or crossmodally, either using RSA or conventional correlations, our results help focus the search space for future studies. Also, we only tested the roundedness/pointedness dimension; acoustic and visual parameters might be differently weighted for other dimensions in other domains relevant to sound symbolism (see Knoeferle et al., 2017, for different weightings for shape and size). For instance, acoustic parameters that do not contribute to perception of roundedness/pointedness might still be important for onomatopoeic words, like “bang”, “splash”, or “slap”, that reflect auditory rather than visual properties. Alternatively, it may also be the case that some parameters do not contribute to sound-symbolic mapping across a range of target domains. Testing across different domains might help to explain why this may be so and thus even non-relevant parameters could further our understanding of sound symbolism, albeit in a negative sense.

It might be objected that measures of voice quality were correlated with perceptual ratings because the speaker who recorded the pseudowords pronounced them differently according to her expectations of their roundedness/pointedness, researchers not being immune to, or unaware of, sound-symbolic mappings. We think this unlikely for several reasons. Firstly, the speaker made a conscious effort to speak with neutral intonation and sound files were selected (from multiple takes) by two independent judges on the basis that they sounded both neutral and consistent with the other recordings. Since Parise & Pavani (2011) showed that people spontaneously vocalize differently to different stimulus attributes, the requirement to employ a neutral intonation may actually have reduced the true effect. Secondly, unlike other acoustic parameters such as amplitude or pitch, it would be hard to consciously modulate complex parameters such as shimmer or mean HNR. Even if this could be achieved, it is unlikely that it could be sustained over a set of more than 500 items in such a way as to produce the correlations seen in Fig 6A, particularly when the items were recorded in random, rather than a fixed order along the rounded-to-pointed scales.

A drawback of our use of pseudowords is that they are not part of actual language although they were sampled from linguistic segments and conformed to the phonological constraints of standard American English (McCormick et al., 2015). Disadvantages of using real words, e.g. mimetics or onomatopoeic words, include the loss of the control that we were able to command in using a carefully constructed stimulus set, or very small set sizes: for example, if one were to control for word length by choosing only two-syllable onomatopoeic words; set size would likely be diminished still further if one wanted a set of such words that all relate, as here, to a single dimension. However, the present study is exploratory, demonstrating the viability of RSA as a method and the importance of some acoustic and visual parameters but not others. Future work could proceed to examine these parameters in relation to real words indicating roundedness/pointedness, e.g., spike vs. balloon (Sučević et al., 2015), or to other kinds of shapes. Since the smaller set sizes for such words would entail smaller sets of shapes, the perceptual ratings of words and shapes could be carried out as a within-participant factor rather than, as here, a between-participant factor. This design aspect is a further limitation of the current study since it means that the pseudowords were never explicitly assigned to an actual shape. Thus, while we can reach some conclusions about sound symbolism, it is less easy to draw conclusions about the sound-shape crossmodal correspondence since the pseudowords and shapes were never explicitly compared by participants. This might be not problematic if the association between visual roundedness/pointedness and auditory pseudowords were, as seems likely, relative rather than absolute, as with the crossmodal correspondence between auditory pitch and visuospatial elevation (Spence, 2019). But the effect of these acoustic and visual parameters on the crossmodal correspondence could certainly now be tested further with smaller stimulus sets since the relationships between acoustic spectral parameters and global indices of visual shape may potentially underlie sound-to-shape mapping (see also Daube et al., 2019). In future work, we could more closely examine the relationship of the present work to sound-symbolic crossmodal correspondences by having people assign words to shapes and then examining the relationship between acoustic parameters and the visual properties of the shapes that people choose.

A final limitation is that, although we report the effects of the parameters individually, some parameters are likely interdependent either conceptually (for example, both the pulse number and fraction of unvoiced frames reflect how often the vocal folds open and close or relative amount of voicing in an utterance) or computationally (for example, the simple matching coefficient is a variation of the formula for the Jaccard distance – and both of these might be related to image silhouette even though the computation of the latter is different). Where different parameters are not independent of each other, it is hard to assess their unique contribution; however, it is unlikely that a single parameter is determinative of either auditory or visual roundedness/pointedness.

5. CONCLUSIONS

Our novel application of RSA on large sets of 537 auditory pseudowords and 90 visual shapes that were previously constructed and rated on the rounded-to-pointed dimension led to the following conclusions: (1) The auditory and visual ratings were closely interrelated, in keeping with the well-known crossmodal correspondence between auditory pseudowords and the roundedness or pointedness of visual shapes. (2) Global acoustic measures of the pseudowords, the speech envelope and spectral measures (spectral tilt and the temporal FFT), were related to the auditory ratings. For rounded compared to pointed pseudowords, the speech envelope and spectral power changes over the pseudoword were smoother, and spectral tilt was steeper with greater concentration in lower frequencies. (3) Multiple global indices of visual shape (the SMC, silhouette, image outlines and Jaccard distance), but not their spatial FFT, were related to the visual ratings. (4) Among these acoustic and visual parameters that were related to the corresponding perceptual ratings, the acoustic spectral measures were crossmodally related to the global indices of visual shape. (5) While voice quality measures were not found to be related to the auditory ratings using RSA, many of them (the HNR, pulse number, FUF, mean autocorrelation, shimmer, and jitter) were shown by conventional analyses to be correlated with the auditory ratings; however, their potential relationship to relevant visual measures was not undertaken here. Overall, our findings extend those of previous studies (Parise & Pavani, 2011; Knoeferle et al., 2017; Chen et al., 2016) by providing new insights into the stimulus features that may mediate sound-symbolic crossmodal correspondences. Here, we show for the first time that the sound-symbolic mapping of sound to shape is related to acoustic properties of pseudowords. Further research is required to establish whether these factors contribute consistently across a range of sound-symbolic mappings or whether they are differently weighted across different mappings, and to understand their neural basis.

Supplementary Material

Lacey et al Cognitive Science suppl

ACKNOWLEDGMENTS

This work was supported by grants to KS and LCN from the National Eye Institute at the NIH (R01EY025978) and the Emory University Research Council. Support to KS from the Veterans Administration and to SML from the Laney Graduate School is also acknowledged. We thank Jee Young Kim and Valentin Lazar for their advice and assistance.

Footnotes

1

Broadly speaking, vowels can be identified by their fundamental frequency (F0) and the relative frequencies of their formants – the resonance frequencies of the vocal tract when producing the vowel sound. The first three formants, F1–F3, are the most informative about vowel identity with higher formants contributing to speaker identity (Knoeferle et al., 2017).

2

Note that in order to avoid artificially inflating the degrees of freedom (df), the second-order correlations between matrices were calculated using one half of the off-diagonal data rather than the entire matrix: df is therefore given by ((n2 - n)/2) - 2, where n2 gives the size of the matrix, - n removes the diagonal cells, and dividing by 2 removes the redundant half of the cells, the matrices being symmetric across the diagonal.

3

We thank an anonymous reviewer for drawing this possibility to our attention. In this system, roundness is measured as the mean radius of the curvatures best fitting each of the outward ‘corners’ divided by the radius of the largest inscribed circle, i.e. the circle best fitting all the inward ‘corners’ (Folk, 1965, Boggs, 2009).

4

We thank an anonymous reviewer for drawing this to our attention.

DATA AVAILABILITY

Stimuli are available at https://osf.io/ekpgh/ and rating data and scripts for the acoustic analysis are available at https://osf.io/y9zjc/.

REFERENCES

  1. Ademollo F (2011). The Cratylus of Plato: A Commentary. Cambridge University Press: Cambridge, UK. [Google Scholar]
  2. Aiken SJ, & Picton TW (2008). Human cortical responses to the speech envelope. Ear and Hearing, 29, 139–157. [DOI] [PubMed] [Google Scholar]
  3. Akita K & Tsujimura N (2016). Mimetics In Kageyama T & Kishimoto H (Eds.) Handbook of Japanese Lexicon & Word Formation, pp133–160. Walter de Gruyter Inc., Boston, USA. [Google Scholar]
  4. Ben-Artzi E & Marks LE (1995). Visual-auditory interaction in speeded classification: Role of stimulus difference. Perception & Psychophysics, 57, 1151–1162. [DOI] [PubMed] [Google Scholar]
  5. Blasi DE, Wichmann S, Hammarström H, Stadler PF, & Christiansen MH (2016). Sound–meaning association biases evidenced across thousands of languages. Proceedings of the National Academy of Sciences, 113, 10818–10823. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Boersma P & Weenink D (2012). PRAAT: doing phonetics by computer. Accessed at http://www.praat.org/.
  7. Boggs S Jr. (2009). Petrology of Sedimentary Rocks, 2nd edition. Cambridge University Press; NY USA. [Google Scholar]
  8. Brand J, Monaghan P, & Walker P (2018). The Changing Role of Sound Symbolism for Small Versus Large Vocabularies. Cognitive Science, 42(Suppl 2), 578–590. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Bremner AJ, Caparos S, Davidoff J, de Fockert J Linnell KJ et al. (2013). “Bouba” and “Kiki” in Namibia? A remote culture make similar shape-sound matches, but different shape-taste matches to Westerners. Cognition, 126, 165–172. [DOI] [PubMed] [Google Scholar]
  10. Brockmann M, Drinnan MJ, Storck C & Carding PN (2011). Reliable jitter and shimmer measurements in voice clinics: the relevance of vowel, gender, vocal intensity, and fundamental frequency effects in a typical clinical task. Journal of Voice, 25, 44–53. [DOI] [PubMed] [Google Scholar]
  11. Catricalà M & Guidi A (2015). Onomatopoeias: a new perspective around space, image schemas, and phoneme clusters. Cognitive Processing, 16(Suppl 1), S175–S178. [DOI] [PubMed] [Google Scholar]
  12. Chen Y-C, Huang P-C, Woods A & Spence C (2016). When “bouba” equals “kiki”: cultural commonalities and cultural differences in sound-shape correspondences. Scientific Reports, 6, 26681, doi: 10.1038/srep26681. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Cuskley C, Simner J, & Kirby S (2017). Phonological and orthographic influences in the bouba–kiki effect. Psychological Research, 81, 119–130. [DOI] [PubMed] [Google Scholar]
  14. Daube C, Ince RAA & Gross J (2019). Simple acoustic features can explain phoneme-based predictions of cortical responses to speech. Current Biology, 29, 1924–1937. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Davico G, Pizzolato C, Killen BA, Barzan M, Suwarganda EK et al. (2019). Best methods and data to reconstruct paediatric lower limb bones for musculoskeletal modelling. Biomechanics & Modeling in Mechanobiology, in press, doi: 10.1007/s10237-019-01245-y [DOI] [PubMed]
  16. Davis R (1961). The fitness of names to drawings: a cross-cultural study in Tanganyika. British Journal of Psychology, 52, 259–268. [DOI] [PubMed] [Google Scholar]
  17. De Carolis L, Marsico E, Arnaud V & Coupé C (2018). Assessing sound symbolism: investigating phonetic forms, visual shapes, and letter fonts in an implicit bouba-kiki experimental paradigm. PLoS ONE, 13, e0208874. doi: 10.1371/journal.pone.0208874. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. de Saussure FD (2011). General principles: Nature of the linguistic sign. In Meisel P & Saussy H (Eds.), Course in general linguistics (Baskin W, Trans.) (pp. 65–70). New York, NY: Columbia University Press. [Google Scholar]
  19. Devaprakash D, Lloyd DG, Barrett RS, Obst SJ, Kennedy B et al. (2019). Magnetic resonance imaging and freehand 3-D ultrasound provide similar estimates of free Achilles tendon shape and 3-D geometry. Ultrasound in Medicine & Biology, 45, 2898–2905. [DOI] [PubMed] [Google Scholar]
  20. Devereux BJ, Clarke A, Marouchos A, & Tyler LK (2013). Representational similarity analysis reveals commonalities and differences in the semantic processing of words and objects. Journal of Neuroscience, 33, 18906–18916. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Evans KK and Treisman A (2010). Natural cross-modal mappings between visual and auditory features. Journal of Vision, 10, 6 doi: 10.1167/10.1.6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Ferrand CT (2002). Harmonics-to-noise ratio: An index of vocal aging. Journal of Voice, 16, 480–487. [DOI] [PubMed] [Google Scholar]
  23. Folk RL (1965). Petrology of Sedimentary Rocks. Hemphill Publishing Company, TX, USA: Retrieved from the Walter Geology Library, University of Texas https://web.archive.org/web/20060214063526/http://www.lib.utexas.edu/geo/folkready/folkprefrev.html [Google Scholar]
  24. Fort M, Martin A & Peperkamp S (2015). Consonants are more important than vowels in the Bouba-Kiki effect. Language & Speech, 58, 247–266. [DOI] [PubMed] [Google Scholar]
  25. Gallace A and Spence C (2006). Multisensory synesthetic interactions in the speeded classification of visual size. Perception & Psychophysics, 68, 1191–1203. [DOI] [PubMed] [Google Scholar]
  26. Gasser M (2004). The origins of arbitrariness in language. Proceedings of the Annual Meeting of the Cognitive Science Society Conference. 26, 434–439. [Google Scholar]
  27. Hollien H, Girard GT, & Coleman RF (1977). Vocal fold vibratory patterns of pulse register phonation. Folia Phoniatrica et Logopaedica, 29, 200–205. [DOI] [PubMed] [Google Scholar]
  28. Holm S (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6, 65–70. [Google Scholar]
  29. Hornibrook J, Ormond T & Maclagan M (2018). Creaky voice or extreme vocal fry in young women. New Zealand Medical Journal, 131, 36–40. [PubMed] [Google Scholar]
  30. Imai M, & Kita S (2014). The sound symbolism bootstrapping hypothesis for language acquisition and language evolution. Philosophical Transactions of the Royal Society B: Biological Sciences, 369, 20130298. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Imai M, Miyazaki M, Yeung HH, Hidaka S, Kantartzis K, Okada H, & Kita S (2015). Sound symbolism facilitates word learning in 14-month-olds. PLoS ONE, 10, e0116494. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Ishi CT, Sakakibara K-I, Ishiguro H & Hagita N (2008). A method for automatic detection of vocal fry. IEEE Transactions on Audio, Speech, & Language Processing, 16, 47–56. [Google Scholar]
  33. Jaccard P (1901) Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bulletin de la Société Vaudoise des Sciences Naturelles 37, 547–579. [Google Scholar]
  34. Jamal Y, Lacey S, Nygaard L & Sathian K (2017). Interactions between auditory elevation, auditory pitch, and visual elevation during multisensory perception. Multisensory Research, 30, 287–306. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Joseph JE (2015). Iconicity in Saussure’s Linguistic Work, and why it does not contradict the arbitrariness of the sign. Historiographia Linguistica, 42, 85–105. [Google Scholar]
  36. Kambaskovic D & Wolfe CT (2016). The Senses in Philosophy and Science: From the nobility of sight to the materialism of touch. In Roodenburg H (ed.) A Cultural History of the Senses in the Renaissance, pp107–125. Bloomsbury Press: London. [Google Scholar]
  37. Kliper R, Portuguese S, & Weinshall D (2016). Prosodic analysis of speech and the underlying mental state. In Serino S et al. , (eds.) Pervasive Computing Paradigms for Mental Health: MindCare 2015 Selected Papers, pp52–62. Springer, Switzerland. [Google Scholar]
  38. Knoeferle K, Li J, Maggioni E, & Spence C (2017). What drives sound symbolism? Different acoustic cues underlie sound-size and sound-shape mappings. Scientific Reports, 7, 5562. doi: 10.1038/s41598-017-05965-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Köhler W (1929). Gestalt Psychology. New York: Liveright Publishing Corporation. [Google Scholar]
  40. Köhler W (1947). Gestalt Psychology: An Introduction to New Concepts in Modern Psychology. Liveright: New York, NY. [Google Scholar]
  41. Kriegeskorte N, Mur M, Ruff DA, Kiani R, Bodurka J, Esteky H et al. (2008). Matching categorical object representations in inferior temporal cortex of man and monkey. Neuron, 60, 1126–1141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Lacey S, Martinez MO, McCormick K and Sathian K (2016). Synesthesia strengthens sound-symbolic cross-modal correspondences. European Journal of Neuroscience, 44, 2716–2721. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Liew K, Lindborg P, Rodrigues R & Styles SJ (2018). Cross-modal perception of noise-in-music: Audiences generate spiky shapes in response to auditory roughness in a novel electroacoustic concert setting. Frontiers in Psychology, 9, 178. doi: 10.3389/fpsyg.2018.00178 [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Lockwood G & Dingemanse M (2015). Iconicity in the lab: a review of behavioral, developmental, and neuroimaging research into sound-symbolism. Frontiers in Psychology, 6, 1246. doi: 10.3389/fpsyg.2015.01246. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Lu Y, & Cooke M (2009). The contribution of changes in F0 and spectral tilt to increased intelligibility of speech produced in noise. Speech Communication, 51, 1253–1262. [Google Scholar]
  46. Marks LE (1978). The unity of the senses: Interrelations among the modalities. New York, NY: Academic Press. [Google Scholar]
  47. Marks LE (1987). On cross-modal similarity: Auditory–visual interactions in speeded discrimination. Journal of Experimental Psychology: Human Perception and Performance, 13, 384–394. [DOI] [PubMed] [Google Scholar]
  48. Maurer D, Pathman T, & Mondloch CJ (2006). The shape of boubas: Sound-shape correspondences in toddlers and adults. Developmental Science, 9, 316–322. [DOI] [PubMed] [Google Scholar]
  49. McCormick K, Kim JY, List S & Nygaard LC (2015). Sound to meaning mappings in the bouba-kiki effect. Proceedings 37th Annual Meeting Cognitive Science Society, 1565–1570. [Google Scholar]
  50. McCormick K, Lacey S, Stilla R, Nygaard LC & Sathian K (2018). Neural basis of the sound-symbolic crossmodal correspondence between auditory pseudowords and visual shapes. bioRxiv doi: 10.1101/478347. [DOI] [PMC free article] [PubMed]
  51. Meteyard L, Stoppard E, Snudden D, Cappa SF, & Vigliocco G (2015). When semantics aids phonology: A processing advantage for iconic word forms in aphasia. Neuropsychologia, 76, 264–275. [DOI] [PubMed] [Google Scholar]
  52. Mezzedimi C, di Francesco M, Livi W, Spinosi MC & De Felice C (2017). Objective evaluation of presbyphonia: spectroacoustic study on 142 patients with Praat. Journal of Voice, 31, 257.e25–257.e32. [DOI] [PubMed] [Google Scholar]
  53. Monaghan P, Mattock K, & Walker P (2012). The role of sound symbolism in language learning. Journal of Experimental Psychology: Learning, Memory, & Cognition, 38, 1152–1164. [DOI] [PubMed] [Google Scholar]
  54. Namy LL, & Nygaard LC (2008). Perceptual-motor constraints on sound to meaning correspondence in language. Behavioral and Brain Sciences, 31, 528–529. [Google Scholar]
  55. Nielsen A, & Rendall D (2011). The sound of round: Evaluating the sound-symbolic role of consonants in the classic Takete-Maluma phenomenon. Canadian Journal of Experimental Psychology, 65, 115–124. [DOI] [PubMed] [Google Scholar]
  56. Nygaard LC, Cook AE, & Namy LL (2009a). Sound to meaning correspondences facilitate word learning. Cognition, 112, 181–186. [DOI] [PubMed] [Google Scholar]
  57. Nygaard LC, Herold DS & Namy LL (2009b). The semantics of prosody: acoustic and perceptual evidence of prosodic correlates to word meaning. Cognitive Science, 33, 127–146. [DOI] [PubMed] [Google Scholar]
  58. Ozturk O, Krehm M, & Vouloumanos A (2013). Sound symbolism in infancy: Evidence for sound–shape cross-modal correspondences in 4-month-olds. Journal of Experimental Child Psychology, 114, 173–186. [DOI] [PubMed] [Google Scholar]
  59. Parise CV, & Pavani F (2011). Evidence of sound symbolism in simple vocalizations. Experimental Brain Research, 214, 373–80. [DOI] [PubMed] [Google Scholar]
  60. Parise CV, & Spence C (2012). Audiovisual crossmodal correspondences and sound symbolism: a study using the implicit association test. Experimental Brain Research, 220, 319–333. [DOI] [PubMed] [Google Scholar]
  61. Peiffer-Smadja N, & Cohen L (2019). The cerebral bases of the bouba-kiki-effect. NeuroImage, 186, 679–689. [DOI] [PubMed] [Google Scholar]
  62. Perniss P, & Vigliocco G (2014). The bridge of iconicity: From a world of experience to the experience of language. Philosophical Transactions of the Royal Society B: Biological Sciences, 369, 20130300. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Revill KP, Namy LL, Defife LC, & Nygaard LC (2014). Cross-linguistic sound symbolism and crossmodal correspondence: Evidence from fMRI and DTI. Brain and Language, 128, 18–24. [DOI] [PubMed] [Google Scholar]
  64. Revill KP, Namy LL, & Nygaard LC (2018). Eye movements reveal persistent sensitivity to sound symbolism during word learning. Journal of Experimental Psychology: Learning, Memory, and Cognition, 44, 680–698. [DOI] [PubMed] [Google Scholar]
  65. Ricotta C, & Pavoine S (2015). Measuring similarity among plots including similarity among species: an extension of traditional approaches. Journal of Vegetation Science, 26, 1061–1067. [Google Scholar]
  66. Rogers S, & Ross A (1975). A cross-cultural test of the maluma–takete phenomenon. Perception, 5, 105–106. [DOI] [PubMed] [Google Scholar]
  67. Sapir E (1929). A study in phonetic symbolism. Journal of Experimental Psychology, 12, 225–239. [Google Scholar]
  68. Schmidtke DS, Conrad M & Jacobs AM (2014). Phonological iconicity. Frontiers in Psychology, 5:80, doi: 10.3389/fpsyg.2014.00080 [DOI] [PMC free article] [PubMed] [Google Scholar]
  69. Schneider W, Eschman A, & Zuccolotto A (2002). E-Prime user’s guide. Pittsburgh: Psychology Software Tools Inc. [Google Scholar]
  70. Sidhu DM & Pexman PM (2015). What’s in a name? Sound symbolism and gender in first names. PLoS ONE, 10, e0126809. doi: 10.1371/journal.pone.0126809. [DOI] [PMC free article] [PubMed] [Google Scholar]
  71. Singh L (2015). Speech signal analysis using FFT and LPC. International Journal of Advanced Research in Computer Engineering & Technology (IJARCET), 4, 1658–1660. [Google Scholar]
  72. Spence C (2011). Crossmodal correspondences: A tutorial review. Attention, Perception, and Psychophysics, 73, 971–995. [DOI] [PubMed] [Google Scholar]
  73. Spence C (2019). On the relative nature of (pitch-based) crossmodal correspondences, Multisensory Research, 32, 235–265. [DOI] [PubMed] [Google Scholar]
  74. Styles SJ & Gawne L (2017). When does Maluma/Takete fail? Two key failures and a meta-analysis suggest that phonology and phonotactics matter. iPerception, 8, doi: 10.1177/2041669517724807 [DOI] [PMC free article] [PubMed] [Google Scholar]
  75. Sučević J, Savić AM, Popović MB, Styles SJ & Ković V (2015). Balloons and bavoons versus spikes and shikes: ERPs reveal shared neural processes for shape-sound-meaning congruence in words, and shape-sound congruence in pseudowords. Brain & Language, 145/146, 11–22. [DOI] [PubMed] [Google Scholar]
  76. Teixeira JP, & Fernandes PO (2014). Jitter, shimmer and HNR classification within gender, tones and vowels in healthy voices. Procedia Technology, 16, 1228–1237. [Google Scholar]
  77. Thompson PD & Estes Z (2011). Sound symbolic naming of novel objects is a graded function. Quarterly Journal of Experimental Psychology, 64, 2392–2404. [DOI] [PubMed] [Google Scholar]
  78. Thoret E, Aramaki M, Kronland-Martinet R, Velay J-L, & Ystad S (2014, January 20). From Sound to Shape: Auditory Perception of Drawing Movements. Journal of Experimental Psychology: Human Perception and Performance, 40, 983–994. [DOI] [PubMed] [Google Scholar]
  79. Tzeng CY, Duan J, Namy LL, & Nygaard LC (2018). Prosody in speech as a source of referential information. Language, Cognition and Neuroscience, 33, 512–526. [Google Scholar]
  80. Tzeng CY, Nygaard LC, & Namy LL (2016). The Specificity of Sound Symbolic Correspondences in Spoken Language. Cognitive Science, 41, 2191–2220. [DOI] [PMC free article] [PubMed] [Google Scholar]
  81. Tzeng CY, Nygaard LC, & Namy LL (2017). Developmental change in children’s sensitivity to sound symbolism. Journal of Experimental Child Psychology, 160, 107–118. [DOI] [PubMed] [Google Scholar]
  82. Van Puyvelde M, Neyt X, McGlone F & Pattyn N (2018). Voice stress analysis: A new framework for voice and effort in human performance. Frontiers in Psychology, 9, 1994 doi: 10.3389/fpsyg.2018.01994 [DOI] [PMC free article] [PubMed] [Google Scholar]
  83. Walker P, Bremner JG, Mason U, Spring J, Mattock K, Slater A, Johnson SP (2010). Preverbal infants’ sensitivity to synesthetic cross-modality correspondences. Psychological Science, 21, 21–25. [DOI] [PubMed] [Google Scholar]
  84. Westbury C, Hollis G, Sidhu DM & Pexman PM (2018). Weighing up the evidence for sound symbolism: distributional properties predict cue strength. Journal of Memory & Language, 99, 125–150. [Google Scholar]
  85. Whitehead RL, Metz DE & Whitehead BH (1984). Vibratory patterns of the vocal folds during pulse register phonation. Journal of the Acoustical Society of America, 75, 1293–1297. [DOI] [PubMed] [Google Scholar]
  86. Wilkinson F, Wilson HR & Habak C (1998). Detection and recognition of radial frequency patterns. Vision Research, 38, 3555–3568. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Lacey et al Cognitive Science suppl

Data Availability Statement

Stimuli are available at https://osf.io/ekpgh/ and rating data and scripts for the acoustic analysis are available at https://osf.io/y9zjc/.

RESOURCES