Abstract
This study investigated how Chinese learners of English perceive the effectiveness of different multimodal input for vocabulary learning. Forty participants perceived 14 combinations of visual, auditory, tactile, and gestural modalities across various word types and proficiency levels through an online questionnaire. Results revealed three key findings. First, adding more modalities did not automatically increase perceived effectiveness; what mattered was which specific modalities were combined. Visual-inclusive combinations consistently received the highest ratings, while gestural input alone was rated lowest. Second, learners showed distinct preferences for different word types: concrete words favored visual and auditory input, action words preferred visual and gestural input, and emotion words selected auditory and gestural input. Abstract words elicited no clear consensus. Third, proficiency influenced these patterns. Advanced learners rated multimodal input more favorably overall and showed convergence on visual and auditory combinations for abstract words, whereas intermediate learners remained divided and reported higher processing difficulty. These findings document systematic patterns in learner perceptions and provide a foundation for experimental studies testing whether such preferences correspond to actual vocabulary learning outcomes.
Keywords: Chinese EFL learners, learner proficiency, multimodal input, perceived effectiveness, vocabulary learning, word type
1. Introduction
How learners receive and process lexical information may influence learners’ perceptions of vocabulary learning effectiveness. Traditional classroom instruction has relied primarily on written text and spoken input, but multimedia technology now enables a broader range of options. Learners can engage with vocabulary through sight, sound, touch, and movement—combining these channels is known as multimodal input (Cárdenas-Claros et al., 2023). Such input is increasingly common in language learning contexts, from video-based instruction to gesture-integrated activities (Montero Perez, 2020).
Two theoretical frameworks generate differential predictions about the effectiveness of multimodal vocabulary input. Paivio and Walsh (1993) Dual Coding Theory (DCT) proposes that cognition operates through two independent but interconnected systems: a verbal system for language and a nonverbal system for mental imagery. When learners encounter information through both systems simultaneously—such as a visual image paired with spoken narration—dual encoding creates two retrieval pathways, increasing the probability of recall. This advantage is predicted to be strongest for concrete words, which have clear sensory referents and are more readily encoded as mental images, and for action words, where bodily movement may activate sensorimotor representations that support encoding (Kana et al., 2012; Macedonia, 2014). For abstract words, however, the nonverbal system offers fewer accessible images, which limits the dual-coding advantage (Borghi et al., 2017; Mayer, 2024).
Cognitive Load Theory (CLT) generates a partially competing prediction. Sweller et al. (2011) argue that working memory capacity is finite, and that the total cognitive load imposed by a learning task constrains what can be processed at any given time. When input channels increase from single to dual to triple modalities, the cumulative demands may exceed available capacity, particularly for learners with limited proficiency who lack the schemas needed to manage multiple information streams efficiently (Kalyuga, 2007). Under this account, adding modalities does not automatically improve learning and may reduce perceived effectiveness when cognitive resources are strained.
These frameworks yield differential predictions. DCT predicts that visual-inclusive combinations should be perceived as most effective, particularly for concrete and action words with clear imageable referents. CLT predicts that lower-proficiency learners—who lack the schemas to manage multiple information streams—may find complex combinations more cognitively demanding than advanced learners (Kalyuga, 2007), such that modality quantity alone does not determine perceived effectiveness. Mayer’s (2009) cognitive theory of multimedia learning similarly holds that visual and auditory channels operate with limited capacity, such that effective instruction distributes rather than saturates processing demands.
Empirical research offers partial support for both frameworks, though findings are distributed across studies that examined different modality pairings and populations. Bisson et al. (2014) exposed Welsh learners to target words through text, images, and audio, finding that even two trimodal exposures improved translation recognition relative to untreated words. Xing and Zhang (2025) compared visual-only, auditory-only, and audiovisual conditions among 90 Chinese learners and found no group differences on immediate tests, but the audiovisual condition produced significantly stronger retention on delayed posttests. These findings suggest that dual-channel benefits are not guaranteed by combining input types and may depend on the time available for consolidation.
Word type moderates these effects in systematic ways. Farley et al. (2014) found that visual input significantly improved abstract word recall but did not benefit concrete word recall, suggesting that concrete words may already carry sufficient visual-semantic associations. Huang et al. (2019) demonstrated that gestural input improved action verb recognition by 8–10%, consistent with embodied cognition accounts linking physical enactment to sensorimotor representation. Altarriba and Basnight-Brown (2012) examined how English speakers learned Spanish concrete, abstract, and emotion words through visual and auditory input: emotion words showed faster initial processing but lower translation accuracy and higher error rates under interference, which the authors attributed to the absence of affective context in decontextualized learning. Together, these findings indicate that word type systematically shapes which modality pairings support encoding.
Input sequence and proficiency represent two further dimensions. Yuan and Tang (2025) compared subtitle conditions among Chinese EFL learners and found that presenting L1 subtitles before bilingual subtitles outperformed other orders on both form and meaning recall, with the lowest associated cognitive load, F(3, 158) = 96.21, p < 0.001, partial η2 = 0.65. Yu and Liu (2022) similarly found that presenting pictures before L1 translations produced stronger retention than the reverse order on both immediate (Cohen’s d = 1.27) and delayed posttests (d = 0.73). Regarding proficiency, Ting (2013) found that approximately 70% of lower-proficiency learners relied on subtitles as necessary support, whereas 57% of upper-intermediate learners deliberately avoided them to reduce dependency—a pattern consistent with CLT’s prediction that more advanced learners process familiar input with lower cognitive cost (Kalyuga, 2007).
Chinese EFL learners constitute a theoretically meaningful population for examining these questions. English instruction in Chinese secondary and tertiary education has historically been organized around teacher-centered delivery, written vocabulary lists, and repetitive copying exercises, with comparatively greater emphasis on visual and written input, and limited systematic incorporation of auditory and gestural channels (Hu, 2002; Liu and Ren, 2024; Fu, 2021). This instructional pattern means that Chinese learners typically enter multimodal learning contexts with extended experience in visual and tactile input—specifically reading and handwriting—while auditory and gestural channels remain less institutionally established. According to CLT, familiarity with a channel reduces the extraneous load associated with processing it (Kalyuga, 2007), which suggests that Chinese learners may evaluate familiar modalities more favorably not because those modalities produce stronger learning, but because they impose lower processing demands. Mayer’s (2009) cognitive theory of multimedia learning was developed largely in Western educational contexts and assumes that learners bring relatively balanced experience across visual and auditory channels. Whether this assumption holds for learners whose instructional history has systematically emphasized particular modalities over others has not been empirically examined.
Despite these findings, several gaps remain. First, prior studies have examined specific modality pairings—primarily visual–auditory combinations or isolated single-modality comparisons—without systematically comparing all combinations of visual, auditory, tactile, and gestural input within the same study. Second, existing research has measured learning outcomes but has not documented how learners perceive the relative effectiveness of different combinations, or how these perceptions interact with word type and proficiency level. Learner perceptions matter because they influence strategy selection and engagement (Liaw and Huang, 2013), and divergences between perceived and actual effectiveness represent a concern for instructional design. Third, the extent to which Chinese learners’ distinct instructional background shapes modality preferences—and whether this pattern varies with proficiency—has not been directly investigated.
Given these gaps, this study examines Chinese EFL learners’ perceived effectiveness and cognitive load across four input channels: visual, auditory, gestural, and tactile. Gestural input refers to whole-body movements representing lexical meaning, such as acting out “jump” (Macedonia, 2014). Tactile input, operationally defined in this study, encompasses learning activities involving physical hand engagement with written materials, including handwriting, copying, and manipulating flashcards. Research has demonstrated that handwriting creates motor memory traces through graphomotor processes, which facilitate subsequent character recognition (Longcamp et al., 2008). While these activities vary in their specific motor patterns, they share the common feature of repetitive fine motor movements and haptic interaction with vocabulary materials—a pedagogical practice widely employed in Chinese EFL classrooms.
RQ1: Which input combinations do learners perceive as most effective for vocabulary learning, and how does input complexity relate to perceived effectiveness?
RQ2: What are the preferred patterns of different combinations for learning vocabulary categories, including combination types and presentation sequences?
RQ3: What differences exist in input preferences among learners at different proficiency levels?
These questions examine learner perceptions and preferences, which may inform instructional design while requiring empirical validation through performance-based studies.
2. Method
2.1. Participants
Forty Chinese EFL learners participated (36 female; ages 18 and above), categorized by proficiency based on recent standardized test scores where available, or by self-assessment otherwise: lower intermediate (IELTS below 6; n = 10), intermediate (IELTS 6.5–7.5; n = 20), and advanced (IELTS 8–9; n = 10). The questionnaire used the term “beginner” for lowest proficiency group; however, since IELTS below 6 corresponds approximately to CEFR B1–B2, “lower-intermediate” is used throughout this paper to avoid confusion with absolute beginners. Prior experience with input types varied: visual (n = 38), auditory (n = 38), tactile (n = 29), and gestural (n = 16). Such experience was not required as the questionnaire described each input type with concrete examples, though differential familiarity with modalities may have influenced ratings and should be considered a potential confound.
2.2. Materials
A Qualtrics questionnaire measured perceived effectiveness of multimodal input combinations across five sections. First, it collected participants’ basic demographic and language learning background information. Second, it assessed learners’ perceptions of the effectiveness of different multimodal input types (visual, auditory, tactile, gestural) and their combinations for vocabulary learning. Third, it explored processing challenges and cognitive load when using multiple input types simultaneously. Fourth, it examined how different word types (concrete, abstract, action, emotion words) interact with various input combinations. Fifth, it investigated learner preferences regarding input presentation sequence for different word types. Concrete examples were provided throughout the questionnaire to ensure participants understood each input type and combination.
No fixed vocabulary items were used as learning targets. Example words were included to help participants understand each word type category (e.g., table for concrete nouns, run for action verbs, freedom for abstract words, joy for emotion words), not as stimuli to be learned or recalled. Since participants rated their perceptions of input modalities at the category level, systematic control of word frequency or imageability was not required for the present purpose. That said, individual example words may have shaped how participants interpreted each category, and future studies would benefit from using multiple exemplars or validated word lists.
Face validity was established through iterative review of item wording by both authors to ensure adequate coverage of the target constructs. Content validity was supported by grounding item content in established theoretical definitions—for instance, gestural input was operationalized following Macedonia (2014), and cognitive load items were informed by Sweller et al. (2011). No formal pilot study was conducted prior to administration.
The questionnaire used 5-point Likert scales throughout. Perceived effectiveness items ranged from 1 (Not effective at all) to 5 (Very effective). Processing difficulty was assessed with five items rated from 1 (Never) to 5 (Always), covering information overload, processing time, attentional focus, input confusion, and mental fatigue. For input sequence preferences, participants selected from three ordering options (e.g., visual → auditory → gestural). For best combination questions, participants selected one option from six dual-modality combinations. The processing difficulty scale showed strong internal consistency (Cronbach’s α = 0.91, 95% CI = [0.87, 0.94]), with item-total correlations ranging from r = 0.74 to r = 0.87, indicating that all items contributed to a coherent measure of perceived cognitive load.
2.3. Procedure
Participants were recruited voluntarily via social media. After providing informed consent, participants completed an online questionnaire administered through Qualtrics. The survey took approximately 20–25 min. Participants indicated their English proficiency level based on their most recent test scores (e.g., IELTS) or self-assessment, then completed all sections in a fixed order.
2.4. Data analysis
All analyses were conducted in R using lme4 and lmerTest for mixed-effects modeling (Bates et al., 2015; Kuznetsova et al., 2017). Likert-scale responses (1–5) were treated as continuous, following evidence that parametric analyses are generally robust with such data (Norman, 2010).
For RQ1, a linear mixed-effects model (LMM) explored how perceived effectiveness ratings varied across 14 input combinations, with Complexity (single/dual/triple) and Proficiency as fixed effects and participants as random intercepts. Additionally, an exploratory analysis examined self-reported processing difficulty to explore factors potentially associated with proficiency differences. Processing difficulty was computed as the mean of five items assessing information overload, processing time, focus difficulty, confusion, and mental fatigue (Cronbach’s α = 0.91). While these items tap conceptually distinct aspects of processing demands, their high internal consistency (α = 0.91) suggests participants experienced them as a unified construct. A Pearson correlation assessed the association between perceived effectiveness and processing difficulty.
For RQ2, an LMM explored whether perceived effectiveness of multimodal input differed by Word Type (concrete/abstract/action/emotion) and Proficiency. Chi-square goodness-of-fit tests examined whether participants’ combination preferences deviated from uniform distributions within each word type. An omnibus chi-square test on the 4 × 6 contingency table assessed whether preference patterns differed across word categories. Participants’ preferred input sequences were similarly analyzed using.
chi-square tests. To address the risk of Type I error across multiple chi-square comparisons, Bonferroni correction was applied to two families of tests: proficiency-related combination preference tests (four word types; corrected α = 0.0125) and sequence-related tests (four comparisons; corrected α = 0.0125). Uncorrected p-values are also reported given the exploratory nature of the study.
For RQ3, proficiency was included as a fixed effect in all LMMs to explore potential differences among proficiency groups. Chi-square tests examined whether combination preferences and beliefs about sequence importance differed across proficiency levels. Given sparse cells in some crosstabulations, Monte Carlo simulated p-values supplemented asymptotic results.
For all LMMs, Type III F-tests used Satterthwaite approximation. Significant effects were followed by Tukey-adjusted pairwise comparisons. Partial η2 was computed to estimate effect sizes, with values of 0.01, 0.06, and 0.14 representing small, medium, and large effects (Cohen, 1988). Cramér’s V indexed association strength for chi-square tests, with 0.10, 0.30, and 0.50 as small, medium, and large effect benchmarks (Cohen, 1988).
3. Results
3.1. Perceived effectiveness of input combinations
Participants rated the perceived effectiveness of 14 input combinations on a 5-point scale. Figure 1 presents mean ratings across proficiency levels. The three combinations receiving the highest perceived effectiveness ratings all included visual input: visual alone (M = 4.40, SD = 0.74), visual + auditory (M = 4.37, SD = 0.87), and visual + auditory + tactile (M = 4.28, SD = 0.72). Gestural alone received the lowest rating (M = 3.37, SD = 1.17).
Figure 1.
Heatmap of mean perceived effectiveness ratings by proficiency level and input combination complexity. Visual-based inputs (single visual and V + A) consistently received the highest scores, while triple combinations showed no clear advantage.
A linear mixed-effects model with participants as random intercepts examined the effects of word type and proficiency level on perceived effectiveness ratings (Table 1). The main effect of complexity was not significant, F(2, 514) = 1.54, p = 0.216, η2ₚ = 0.01, a small effect, with estimated marginal means (EMMs) of 3.94, 3.79, and 3.83 for single, dual, and triple combinations. The complexity × proficiency interaction was also non-significant, F(4, 514) = 0.93, p = 0.446, η2ₚ = 0.01. However, the main effect of proficiency was significant, F(2, 38) = 7.99, p = 0.001, η2ₚ = 0.29, a large effect. Advanced learners (EMM = 4.21) gave higher ratings than intermediate (EMM = 3.75, p = 0.002) and lower-intermediate learners (EMM = 3.70, p = 0.004); the latter two groups did not differ (p = 0.919). To explore factors that might be associated with these proficiency differences in perceived effectiveness, we examined participants’ self-reported processing difficulty.
Table 1.
Linear mixed model (LMM) analyses for perceived input effectiveness, helpfulness, and processing difficulty.
| Analysis | Effect | F | df | p | ηp2 |
|---|---|---|---|---|---|
| Input Perceived Effectiveness | Complexity (C) | 1.54 | (2, 514.0) | 0.216 | 0.01 |
| Proficiency (P) | 7.99 | (2, 38.4) | 0.001 | 0.29 | |
| C × P | 0.93 | (4, 514.0) | 0.446 | 0.01 | |
| Word type helpfulness | Word type (W) | 5.06 | (3, 111.0) | 0.003 | 0.12 |
| Proficiency (P) | 11.63 | (2, 37.0) | <0.001 | 0.39 | |
| W × P | 2.98 | (6, 111.0) | 0.010 | 0.14 | |
| Perceived processing difficulty | Dimension (D) | 1.91 | (4, 148.0) | 0.111 | 0.05 |
| Proficiency (P) | 22.18 | (2, 37.0) | <0.001 | 0.55 | |
| D × P | 1.57 | (8, 148.0) | 0.139 | 0.08 |
df, degrees of freedom; ηp2, partial eta squared. Significant effects (p < 0.05) are shown in bold. Proficiency Level yielded significant main effects across all measures, while a significant interaction between word type and proficiency (W × P) was observed for helpfulness ratings.
An exploratory LMM examining perceived processing difficulty revealed a significant main effect of.
Proficiency, F(2, 37) = 22.18, p < 0.001, η2p = 0.55, a large effect by conventional benchmarks (Cohen, 1988). Advanced learners reported the lowest perceived difficulty (M = 2.26), followed by intermediate (M = 3.67) and lower-intermediate learners (M = 4.02). Perceived processing difficulty correlated negatively with perceived effectiveness, r = −0.48, p = 0.002, suggesting that learners who found multimodal input more demanding also rated it as less effective.
3.2. Word type perceived effects and proficiency interaction
Participants rated the perceived effectiveness of multimodal input for four word types (1 = not helpful at all, 5 = very helpful). A linear mixed-effects model with participants as random intercepts examined the effects of word type and proficiency level on perceived effectiveness ratings (Table 1).
The main effect of word type was significant, F(3, 111) = 5.06, p = 0.003, η2ₚ = 0.12, a medium effect.
Estimated marginal means were highest for concrete words (EMM = 4.42), followed by action words (EMM = 4.30), emotion words (EMM = 4.12), and abstract words (EMM = 3.85). Tukey-adjusted pairwise comparisons indicated that abstract words were rated significantly lower than concrete words (p = 0.002) and action words (p = 0.023); no other pairwise differences reached significance (ps > 0.22).
The main effect of proficiency was also significant, F(2, 37) = 11.63, p < 0.001, η2ₚ = 0.39, a large effect, as was the word type × proficiency interaction, F(6, 111) = 2.98, p = 0.010, η2ₚ = 0.14, a large effect (Figure 2). Advanced learners gave the highest ratings across all word types (EMMs = 4.60–5.00). For concrete, action, and emotion words, lower-intermediate and intermediate learners showed similar ratings (EMMs = 3.70–4.20). Ratings for abstract words showed a U-shaped trend across proficiency levels: lower-intermediate (M = 3.90), intermediate (M = 3.05), and advanced (M = 4.60). Advanced learners rated abstract words significantly higher than both intermediate (p = 0.046) and lower-intermediate learners (p = 0.017), while the difference between lower-intermediate and intermediate learners was not significant (p = 0.70).
Figure 2.
Interaction of proficiency level and word type on mean helpfulness ratings. A significant drop in ratings is observed for intermediate learners specifically in the abstract category.
3.3. Preferred input combinations by word type
Participants selected their preferred dual-modality combination for each of four word types (Figure 3). The distribution of combination preferences differed significantly across word types, χ2 (15) =.
Figure 3.
Distribution of learner preferences for input combinations across word types. Learners favored V + A for concrete/action words but significantly preferred A + G for emotion words.
80.48, p < 0.001, Cramér’s V = 0.41. For concrete words, visual + auditory was most frequently selected (50.0%), followed by visual + tactile (27.5%). For action words, visual + gestural was the dominant choice (52.5%). For emotion words, auditory + gestural was most preferred (45.0%), with visual + gestural second (20.0%). Preferences for abstract words were more dispersed: visual + auditory was selected most often (37.5%), but auditory + tactile (17.5%), visual + gestural (15.0%), and visual + tactile (15.0%) were also selected.
Exploratory analyses examined whether combination preferences varied by proficiency level. Applying Bonferroni correction for multiple comparisons across four word types (α = 0.05/4 = 0.0125), none reached statistical significance after correction. For abstract words, the uncorrected chi-square test suggested a potential association, χ2 (10) = 19.70, p = 0.032, corroborated by Fisher’s exact test (p = 0.032) and Monte Carlo simulation with 10,000 replications (Psim = 0.022). Among advanced learners, 80.0% (8 of 10) selected visual + auditory, whereas intermediate learners’ preferences were distributed across visual + auditory (20.0%), auditory + tactile (25.0%), and auditory + gestural (25.0%). However, given small cell sizes and the exploratory nature of this analysis, this pattern should be interpreted cautiously. Combination preferences for the remaining word types showed no association with proficiency (concrete: Psim = 0.148; action: Psim = 0.555; emotion: Psim = 0.064).
The following analyses of input sequence preferences are exploratory in nature and should be interpreted accordingly. The questionnaire also assessed preferences for input presentation order among three modalities (visual, auditory, and gestural). Regarding initial modality, visual was most frequently preferred (52.5%), followed by auditory (15.0%), with 17.5% indicating that their preference depended on word type, χ2 (4) = 28.25, p < 0.001. Preferences for overall trimodal sequence also varied by word type, χ2 (6) = 19.15, p = 0.004, Cramér’s V = 0.24. For concrete words, 75.0% of participants preferred visual → auditory → gestural; for action words, 60.0% selected this same sequence. Preferences were more variable for abstract and emotion words. For abstract words, 45.0% preferred auditory → visual → gestural, while 35.0% preferred visual → auditory → gestural. For emotion words, 47.5% selected visual → auditory → gestural, and 30.0% preferred gestural → visual → auditory.
Most participants reported believing that presentation sequence affected vocabulary learning effectiveness, χ2 (2) = 13.85, p < 0.001: 55.0% reported perceiving a large effect, 37.5% a small effect, and 7.5% no effect. These responses differed by proficiency level, χ2 (4) = 12.44, p = 0.011 (Monte Carlo p, B = 10,000); 30.0% of advanced learners reported that sequence did not matter, whereas no lower-intermediate or intermediate learners selected this option. When asked about learning outcomes when inputs were presented in their preferred order, 52.5% selected “much better” and 42.5% selected “a little better” (M = 1.55, SD = 0.68; 1 = much better, 5 = much worse).
4. Discussion
A fundamental caveat applies throughout all findings reflect learner perceptions of hypothetical input scenarios, not measured vocabulary acquisition. Research has repeatedly shown that learners sometimes prefer strategies that feel easier yet produce weaker retention—a phenomenon known as desirable difficulties (Bjork and Bjork, 2011; Deslauriers et al., 2019). The present findings should therefore be read as documentation of what learners believe works, not as evidence of what actually works, and require experimental validation before instructional conclusions can be drawn.
Previous research on multimodal vocabulary learning has predominantly examined specific combinations, particularly audiovisual input (Bisson et al., 2014; Yu and Liu, 2022), with limited systematic comparison across broader modal configurations. The present survey study addressed this gap by documenting learners’ perceptions across 14 input combinations, four word types, and three proficiency levels, providing a more comprehensive picture of how learners evaluate multimodal vocabulary input.
4.1. Perceived effectiveness and input complexity
Learners’ ratings showed no significant difference across single, dual, and triple combinations (EMMs: 3.94, 3.79, 3.83), suggesting that adding more modalities did not automatically increase perceived effectiveness. This pattern is broadly consistent with Cognitive Load Theory: when multiple input channels compete for limited working memory resources, added complexity may offset potential benefits rather than enhance them (Sweller et al., 2011). Taken together, these findings partially support CLT over DCT in the context of modality quantity: adding channels did not increase perceived effectiveness, consistent with the prediction that working memory constraints limit the benefit of additional input streams (Sweller et al., 2011). DCT’s prediction that dual-channel encoding enhances processing was not reflected in complexity ratings, though the consistently high ratings for visual-inclusive combinations suggest that the type of channel matters more than the number—a distinction both frameworks can accommodate.
Within this overall pattern, visual input stood out consistently. The three highest-rated combinations—visual only (M = 4.40), visual + auditory (M = 4.37), and visual + auditory + tactile (M = 4.28)—all included a visual component. Gestural input alone received the lowest rating (M = 3.37). Dual Coding Theory offers one possible explanation. Visual input may activate the nonverbal imagery system alongside verbal processing, creating two independent retrieval pathways (Paivio and Walsh, 1993). Chinese EFL learners’ instructional background may have reinforced this pattern. Extended experience with reading and handwriting builds familiarity with visual and tactile channels, which may reduce the extraneous load these channels impose (Kalyuga, 2007; Hu, 2002). Whether learners’ ratings reflect genuine encoding benefits or simply greater comfort with familiar modalities cannot be determined from the present data.
Proficiency was also associated with ratings. Advanced learners tended to score combinations higher than intermediate and lower-intermediate learners (η2ₚ = 0.29), and this difference coincided with variation in perceived processing difficulty. Higher difficulty correlated negatively with perceived effectiveness (r = −0.48), which aligns with evidence that cognitive effort can color learners’ evaluations of their own performance (Seufert, 2020). The proficiency effect was even more pronounced for processing difficulty (η2ₚ = 0.55), where proficiency accounted for over half the variance in how demanding learners found multimodal input—considerably more than input complexity itself (η2ₚ = 0.01). Together, these effect sizes indicate that learner proficiency accounted for approximately 29–55% of variance in perceptions, compared to only 1% for modality quantity, suggesting that who is learning may matter more than how many channels are presented when it comes to perceived effectiveness and cognitive load. This pattern aligns with CLT’s prediction that schema availability—which increases with proficiency—determines processing efficiency (Kalyuga, 2007), though the cross-sectional design limits causal interpretation.
4.2. Word type-specific combination preferences
Learners demonstrated distinct combination preferences across word categories, χ2 (15) = 80.48, p < 0.001, Cramér’s V = 0.41. For concrete nouns, visual + auditory was preferred by 50.0% of participants. For action verbs, visual + gestural was the most common choice (52.5%). This pattern is compatible with embodied cognition perspectives linking physical actions with motor simulation (Lu and Yang, 2025; Kosmas and Zaphiris, 2020), though whether learners consciously drew on such principles remains unclear.
For emotion words, auditory + gestural was most preferred (45.0%), possibly because prosodic features naturally convey affective states (Liebenthal et al., 2016) and expressive gestures enhance emotional communication (Aslan et al., 2024).
These preferences may also reflect learners’ instructional background. Chinese EFL classrooms have traditionally emphasized visual and tactile input through reading and handwriting (Hu, 2002; Zhou et al., 2025). Auditory and gestural channels, being less common in this context, may have been perceived as supplementary rather than primary input. Combination preferences documented here may therefore not generalize to learners from other instructional traditions.
These differentiated patterns suggest that learners perceive certain modality pairings as more appropriate for specific semantic categories. While experimental studies have examined how different modalities affect vocabulary acquisition (Farley et al., 2014; Huang et al., 2019; Lin et al., 2016), the present findings complement this work by documenting how learners themselves conceptualize preferred input for different word types. Whether these preferences correspond to actual learning advantages requires experimental validation.
For abstract words, preferences were more dispersed: although visual + auditory was most frequently selected (37.5%), other combinations each received 15–17.5% support. This lack of consensus may reflect the inherent difficulty of grounding abstract concepts in sensory experience (Dove, 2016; Kaushanskaya and Rechtzigel, 2012), and suggests that learners may be less certain about which modality combinations suit this word type.
Beyond combination types, 52.5% of participants preferred visual input as the starting modality, and 92.5% believed that presentation order affected learning outcomes. While this indicates a prevalent perception among learners, interpretation should consider potential demand characteristics (Orne, 1962)—when explicitly asked whether order matters, participants may overestimate its importance. That said, verifying whether these preferences yield actual learning benefits is advisable, given that self-reported perceptions of learning may not always align with actual performance (Deslauriers et al., 2019). Future experimental designs would benefit from systematically comparing different presentation orders while controlling for expectancy effects.
4.3. Differences across proficiency levels
RQ3 examined proficiency-related differences in input preferences. As reported above, proficiency was significantly associated with overall perceived effectiveness ratings (η2ₚ = 0.29), with advanced learners rating input combinations higher than intermediate and lower-intermediate learners. Beyond this general pattern, a significant word type × proficiency interaction emerged, F(6, 111.0) = 2.98, p = 0.010, η2ₚ = 0.14, which, by conventional benchmarks (Cohen, 1988), represents a large effect, suggesting that proficiency-specific patterns in word type perceptions are not trivial.
For abstract words, intermediate learners reported the lowest helpfulness ratings (EMM = 3.05), below both lower-intermediate (EMM = 3.90) and advanced learners (EMM = 4.60), forming an unexpected U-shaped pattern.
The U-shaped pattern across proficiency levels can be interpreted through CLT’s distinction between extraneous and germane cognitive load (Sweller et al., 2011). For lower-intermediate learners, cognitive resources may be largely consumed by basic decoding demands, leaving insufficient capacity to register the particular difficulty abstract words pose for sensory grounding. Intermediate learners, having reduced their extraneous load through accumulated exposure, may become more sensitive to this difficulty—yet lack the developed schemas needed to allocate germane load toward deeper semantic encoding. The result may be a gap between awareness of difficulty and the capacity to address it, reflected in the lowest ratings in the sample. Advanced learners, by contrast, may bring sufficiently elaborated schemas to the task that germane resources can be directed toward meaning construction—for instance, connecting abstract words to concrete anchor examples or generating contextual definitions (Schmitt, 2008)—allowing them to engage more productively with multimodal input regardless of combination type. Given the post hoc nature of this interpretation and the small subgroup sizes, however, replication with direct measures of strategy use and schema development remains necessary.
Abstract words were also the only category showing potential proficiency-related differences in combination preferences. Although the association did not survive Bonferroni correction for multiple comparisons (uncorrected p = 0.032), converging evidence from chi-square, fisher’s exact, and Monte Carlo tests suggests the pattern merits attention: 80.0% of advanced learners preferred visual + auditory, whereas intermediate learners’ choices were distributed across multiple combinations (visual + auditory: 20.0%; auditory + tactile: 25.0%; auditory + gestural: 25.0%). This suggests that preference consolidation may occur as proficiency increases, at least for abstract vocabulary, and may reflect the emergence of more systematic encoding strategies at higher proficiency levels (Schmitt, 2008) though the small sample size requires cautious interpretation.
4.4. Pedagogical implications
The differentiated preferences across word types suggest that modality selection may matter more than modality quantity. Rather than uniformly increasing input channels, materials could be tailored to word characteristics: pairing images with narrated sentence contexts for concrete nouns, incorporating gestural demonstrations such as physically enacting collapse or grasp for action verbs, and using audio recordings that model emotional prosody for emotion words. The comparable perceived effectiveness ratings across single, dual, and triple combinations—alongside visual-only input achieving the highest mean rating (M = 4.40)—further suggest that a carefully designed single-modality resource may serve learners better than a hastily assembled multimodal alternative, which carries practical implications for resource-constrained settings. These suggestions remain grounded in learner perceptions rather than performance data and should be treated as hypotheses awaiting experimental confirmation.
The proficiency-related findings for abstract vocabulary point to a more specific instructional need. Intermediate learners showed the most dispersed combination preferences and the lowest perceived effectiveness ratings for this word type (EMM = 3.05), suggesting that varying modality alone may be insufficient. Instructors might consider making encoding strategies explicit—guiding learners to construct mental images, link abstract words to concrete anchor examples, or practice verbal elaboration by generating their own definitions or example sentences (Schmitt, 2008). For presentation sequences, the strong preference for visual-first input (52.5%) offers a tentative starting point, consistent with evidence that visual input may support subsequent linguistic encoding (Yu and Liu, 2022), though instructors should remain cautious about rigid sequencing given that 30% of advanced learners reported order made no difference.
5. Limitations
This study has several limitations. The sample was small (N = 40) and unevenly distributed across proficiency levels (10 lower-intermediate, 20 intermediate, 10 advanced), which limited power for detecting interactions. Some patterns that appeared potentially meaningful—such as how advanced learners preferred visual + auditory for abstract words while intermediate learners’ choices dispersed—did not survive Bonferroni correction and require replication with larger samples. The sample also skewed heavily female (90%), so we cannot determine whether gender affects modality preferences.
The design introduced additional constraints. Because all data came from self-reports about hypothetical scenarios, we cannot verify whether perceived effectiveness aligns with actual learning or whether reported processing difficulty reflects genuine cognitive load. The questionnaire’s explicit questions about presentation sequence may have inflated its perceived importance through demand characteristics. Additionally, our operational definition of “tactile input” combined distinct activities—handling flashcards versus writing/copying—which may involve different cognitive processes even though both engage the hands. The absence of formal pilot testing also limits available evidence for the questionnaire’s construct validity.
These constraints point to clear next steps. Experimental studies comparing retention across modality combinations would reveal whether preferences translate into learning advantages. Such experiments could also test whether sequence actually matters by manipulating order while controlling for expectancy effects. Tracking learners longitudinally would show whether preferences shift as proficiency develops, and adding objective measures like eye-tracking could clarify what drives modality choices.
The present findings nonetheless document systematic patterns in how Chinese EFL learners perceive multimodal vocabulary input. Learners distinguish among modalities based on word type and judge specific combinations rather than sheer quantity of input channels. These perceptions vary with proficiency in ways that suggest developmental changes in how learners approach vocabulary learning. Whether these patterns predict actual learning outcomes remains an open question for experimental research.
Acknowledgments
We thank the participants for their time and contribution to this study.
Funding Statement
The author(s) declared that financial support was not received for this work and/or its publication.
Footnotes
Edited by: Hassan Banaruee, University of Education Weingarten, Germany
Reviewed by: Ronald Leow, Georgetown University, United States
Rachid Ed-Dali, Cadi Ayyad University, Morocco
Data availability statement
The original contributions presented in the study are included in the article/Supplementary material, further inquiries can be directed to the corresponding authors.
Ethics statement
This study was approved by the University of Auckland Human Participants Ethics Committee (Reference: UAHPEC29609). All participants provided informed consent electronically via Qualtrics prior to accessing the survey.
Author contributions
JL: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Validation, Writing – original draft, Writing – review & editing, Software, Visualization. HW: Conceptualization, Data curation, Formal analysis, Methodology, Project administration, Software, Validation, Visualization, Writing – original draft, Writing – review & editing, Investigation, Resources.
Conflict of interest
The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declared that Generative AI was used in the creation of this manuscript. Generative AI was used only to assist with language editing and improving readability. The authors reviewed, verified, and take full responsibility for all content.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Supplementary material
The Supplementary material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpsyg.2026.1783303/full#supplementary-material
References
- Altarriba J., Basnight-Brown D. M. (2012). The acquisition of concrete, abstract, and emotion words in a second language. Int. J. Bil. 16, 446–452. doi: 10.1177/1367006911429511 [DOI] [Google Scholar]
- Aslan Z., Özer D., Göksun T. (2024). Exploring emotions through co-speech gestures: the caveats and new directions. Emot. Rev. 16, 265–275. doi: 10.1177/17540739241277820 [DOI] [Google Scholar]
- Bates D., Mächler M., Bolker B., Walker S. (2015). Fitting linear mixed-effects models using lme4. J. Stat. Softw. 67, 1–48. doi: 10.18637/jss.v067.i01 [DOI] [Google Scholar]
- Bisson M.-J., van Heuven W. J. B., Conklin K., Tunney R. J. (2014). The role of repeated exposure to multimodal input in incidental acquisition of foreign language vocabulary. Lang. Learn. 64, 855–877. doi: 10.1111/lang.12085, [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bjork E. L., Bjork R. A. (2011). “Making things hard on yourself, but in a good way: creating desirable difficulties to enhance learning,” in Psychology and the Real World: Essays Illustrating Fundamental Contributions to Society, eds. Gernsbacher M. A., Pew R. W., Hough L. M., Pomerantz J. R. (New York: Worth Publishers; ), 56–64. [Google Scholar]
- Borghi A. M., Binkofski F., Castelfranchi C., Cimatti F., Scorolli C., Tummolini L. (2017). The challenge of abstract concepts. Psychol. Bull. 143, 263–292. doi: 10.1037/bul0000089, [DOI] [PubMed] [Google Scholar]
- Cárdenas-Claros M. S., Sydorenko T., Huntley E., Montero Perez M. (2023). Teachers’ voices on multimodal input for second or foreign language learning. Lang. Teach. Res. doi: 10.1177/13621688231216044 [DOI] [Google Scholar]
- Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences. Rev. Edn. Hillsdale, NJ: Lawrence Erlbaum Associates. [Google Scholar]
- Deslauriers L., McCarty L. S., Miller K., Callaghan K., Kestin G. (2019). Measuring actual learning versus feeling of learning in response to being actively engaged in the classroom. PNAS 116, 19251–19257. doi: 10.1073/pnas.1821936116 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dove G. (2016). Three symbol ungrounding problems: abstract concepts and the future of embodied cognition. Psychon. Bull. Rev. 23, 1109–1121. doi: 10.3758/s13423-015-0825-4 [DOI] [PubMed] [Google Scholar]
- Farley A., Pahom O., Ramonda K. “Is a picture worth a thousand words? Using images to create a concreteness effect for abstract words: evidence from beginning L2 learners of Spanish.” Hispania 97. (2014): 634–650. Web. doi: 10.1353/hpn.2014.0106 [DOI] [Google Scholar]
- Fu S. (2021). Chinese EFL university students’ self-reported use of vocabulary learning strategies. Engl. Lang. Teach. 14:117. doi: 10.5539/elt.v14n12p117 [DOI] [Google Scholar]
- Hu G. (2002). Potential cultural resistance to pedagogical imports: the case of communicative language teaching in China. Lang. Cult. Curr. 15, 93–105. doi: 10.1080/07908310208666636 [DOI] [Google Scholar]
- Huang X., Kim N., Christianson K. (2019). Gesture and vocabulary learning in a second language. Lang. Learn. 69, 177–197. doi: 10.1111/lang.12326 [DOI] [Google Scholar]
- Kalyuga S. (2007). Expertise reversal effect and its implications for learner-tailored instruction. Educ. Psychol. Rev. 19, 509–539. doi: 10.1007/s10648-007-9054-3 [DOI] [Google Scholar]
- Kana R. K., Blum E. R., Ladden S. L., Ver Hoef L. W. (2012). “How to do things with words”: role of motor cortex in semantic representation of action words. Neuropsychologia 50, 3403–3409. doi: 10.1016/j.neuropsychologia.2012.09.006, [DOI] [PubMed] [Google Scholar]
- Kaushanskaya M., Rechtzigel K. (2012). Concreteness effects in bilingual and monolingual word learning. Psychon. Bull. Rev. 19, 935–941. doi: 10.3758/s13423-012-0271-5, [DOI] [PubMed] [Google Scholar]
- Kosmas P., Zaphiris P. (2020). Words in action: investigating students’ language acquisition and emotional performance through embodied learning. Innov. Lang. Learn. Teach. 14, 317–332. doi: 10.1080/17501229.2019.1607355 [DOI] [Google Scholar]
- Kuznetsova A., Brockhoff P. B., Christensen R. H. B. (2017). lmerTest package: tests in linear mixed effects models. J. Stat. Softw. 82, 1–26. doi: 10.18637/jss.v082.i13 [DOI] [Google Scholar]
- Liaw S.-S., Huang H.-M. (2013). Perceived satisfaction, perceived usefulness and interactive learning environments as predictors to self-regulation in e-learning environments. Comput. Educ. 60, 14–24. doi: 10.1016/j.compedu.2012.07.015 [DOI] [Google Scholar]
- Liebenthal E., Silbersweig D. A., Stern E. (2016). The language, tone and prosody of emotions: neural substrates and dynamics of spoken-word emotion perception. Front. Neurosci. 10:506. doi: 10.3389/fnins.2016.00506, [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin J. J. H., Lee Y.-H., Wang D.-Y., Lin S. S. J. (2016). Reading subtitles and taking enotes while learning scientific materials in a multimedia environment: cognitive load perspectives on EFL students. Educ. Technol. Soc. 19, 47–58. doi: 10.30191/ETS.201610_19(4).0005 [DOI] [Google Scholar]
- Liu Y., Ren W. (2024). Task-based language teaching in a local EFL context: Chinese university teachers’ beliefs and practices. Lang. Teach. Res. 28, 2234–2250. doi: 10.1177/13621688211044247 [DOI] [Google Scholar]
- Longcamp M., Boucard C., Gilhodes J.-C., Anton J.-L., Roth M., Nazarian B., et al. (2008). Learning through hand- or typewriting influences visual recognition of new graphic shapes: behavioral and functional imaging evidence. J. Cogn. Neurosci. 20, 802–815. doi: 10.1162/jocn.2008.20504, [DOI] [PubMed] [Google Scholar]
- Lu X., Yang J. (2025). Second language embodiment of action verbs: the impact of bilingual experience as a multidimensional spectrum. Bilingual. Lang. Cogn. 28, 1117–1133. doi: 10.1017/S1366728924000981 [DOI] [Google Scholar]
- Macedonia M. (2014). Bringing back the body into the mind: gestures enhance word learning in foreign language. Front. Psychol. 5:1467. doi: 10.3389/fpsyg.2014.01467, [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mayer R. E. (2009). Multimedia Learning. 2nd Edn. Cambridge, UK: Cambridge University Press. [Google Scholar]
- Mayer R. E. (2024). The past, present, and future of the cognitive theory of multimedia learning. Educ. Psychol. Rev. 36:8. doi: 10.1007/s10648-023-09842-1 [DOI] [Google Scholar]
- Montero Perez M. (2020). Multimodal input in SLA research. Stud. Second. Lang. Acquis. 42, 653–663. doi: 10.1017/S0272263120000145 [DOI] [Google Scholar]
- Norman G. (2010). Likert scales, levels of measurement and the “laws” of statistics. Adv. Health Sci. Educ. 15, 625–632. doi: 10.1007/s10459-010-9222-y, [DOI] [PubMed] [Google Scholar]
- Orne M. T. (1962). On the social psychology of the psychological experiment: with particular reference to demand characteristics and their implications. Am. Psychol. 17, 776–783. doi: 10.1037/h0043424 [DOI] [Google Scholar]
- Paivio A., Walsh M. (1993). “Psychological processes in metaphor comprehension and memory,” In Metaphor and Thought 2nd ed. Ed. Ortony A.. UK: Cambridge University Press; 307–328. [Google Scholar]
- Schmitt N. (2008). Review article: instructed second language vocabulary learning. Lang. Teach. Res. 12, 329–363. doi: 10.1177/1362168808089921 [DOI] [Google Scholar]
- Seufert T. (2020). Building bridges between self-regulation and cognitive load—an invitation for a broad and differentiated attempt. Educ. Psychol. Rev. 32, 1151–1162. doi: 10.1007/s10648-020-09574-6 [DOI] [Google Scholar]
- Sweller J., Ayres P., Kalyuga S. (2011). Cognitive Load Theory. New York, NY: Springer. [Google Scholar]
- Ting K. (2013). Input in multimodality for language learning. Int. J. Literacies 19, 47–56. doi: 10.18848/2327-0136/CGP/v19i01/48795 [DOI] [Google Scholar]
- Xing B., Zhang H. (2025). A study of the effect of multimodal input on vocabulary acquisition: evidence from online Chinese language learners. Lang. Teach. Res. doi: 10.1177/13621688241313017 [DOI] [Google Scholar]
- Yu J., Liu X. (2022). Text first or picture first? Evaluating two modes of multimodal input for EFL vocabulary meaning acquisition. SAGE Open 12. doi: 10.1177/21582440221119469 [DOI] [Google Scholar]
- Yuan X., Tang X. (2025). Effects of the sequential use of L1 and bilingual subtitles on incidental English vocabulary learning: a cognitive load perspective. Br. J. Educ. Psychol. 95, 565–577. doi: 10.1111/bjep.12740, [DOI] [PubMed] [Google Scholar]
- Zhou J., Li C., Cheng Y. (2025). Transforming pedagogical practices and teacher identity through multimodal (inter)action analysis: a case study of novice EFL teachers in China. Behav. Sci. 15:1050. doi: 10.3390/bs15081050, [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The original contributions presented in the study are included in the article/Supplementary material, further inquiries can be directed to the corresponding authors.



