Skip to main content
Sage Choice logoLink to Sage Choice
. 2023 Mar 2;130(3):958–983. doi: 10.1177/00315125231159688

Portuguese Physical Literacy Assessment Questionnaire (PPLA-Q) for Adolescents: Validity and Reliability of the Psychological and Social Modules using Mokken Scale Analysis

João Mota 1,2,3,, João Martins 1,2, Marcos Onofre 1,2
PMCID: PMC10233509  PMID: 36861939

Abstract

We examined the construct validity and reliability of the previously developed Psychological and Social modules of the Portuguese Physical Literacy Assessment Questionnaire (PPLA-Q) using Mokken Scale Analysis in a sample of 508 Portuguese adolescents in public schools in Lisbon. We used a retest subsample (n = 73) to calculate the Intraclass Correlation Coefficient. Eight PPLA-Q scales can be interpreted as moderate-to-strong Mokken scales (H = .47–.66) with good total-score reliability (ρ = .83–.94), and moderate-to-excellent test-retest reliability (ICC95%CI = .51–.95); four scales had an interpretable invariant item ordering. All but the Physical Regulation scale functioned similarly across sex. Scale-scores correlated as expected, with low-to-moderate correlations across domains supporting convergent and discriminant validity. These results support the construct validity and reliability of the PPLA-Q to assess the psychological and social domains of physical literacy in Portuguese adolescents (15–18 years) enrolled in physical education.

Keywords: physical literacy, assessment, physical education, construct validity, reliability, high-school, adolescence

Introduction

Physical literacy (PL) is a holistic concept referring to the skills and attributes that individuals demonstrate through physical activity (PA) and movement throughout their lives, enabling them to lead healthy and fulfilling lifestyles (Physical Literacy for Life, 2021) and reap the widely documented physical, cognitive, affective and social benefits of PA (Australian Government Department of Health, 2019; World Health Organization, 2020). Additionally, PL might help counter the grim scenario in which 27.5% of adults and 81% of adolescents (11–17 years) worldwide fail to meet the World Health Organization’s PA guidelines (Guthold et al., 2018; 2020). Others have exhorted quality physical education (PE), as a mandatory, free, and central means of boosting the development of PL (Guthold et al., 2020; UNESCO, 2015). If this goal is to be achieved, assessment of PL will be essential in tracking progress (Corbin, 2016).

To provide a feasible and integrated assessment of PL within the Portuguese PE context, we developed the Portuguese Physical Literacy Assessment (PPLA). This instrument targets students in grades 10–12 (15–18 years old) since studies in Portugal have shown that high school adolescents have lower PA levels and more sedentary behavior than their younger peers (Baptista et al., 2012; Matos & Equipa Aventura Social, 2018). This tool was inspired by the Australian Physical Literacy Framework (APLF; Sport Australia, 2019) and by the outcomes and didactic philosophy of the Portuguese PE syllabus (Ministry of Education [Ministério da Educação], 2001a; 2001b).

The PPLA is composed of two parts: a self-report questionnaire (PPLA-Q; Mota et al., 2021), and an observational instrument (PPLA-O; Mota et al., 2022). The PPLA-Q features three modules: Psychological, Social and Cognitive; collectively, they assess a selection of elements from the APLF (Mota et al., 2021). The Psychological and Social modules are comprised of Likert-type items measuring diverse socio-affective elements that have been posited as determinants of adolescents’ PA participation (e.g., Cortis et al., 2017), and associated with multiple beneficial outcomes inside and outside of PE (e.g., Li et al., 2008; Pozo et al., 2018). Within the Psychological module are: (a) Motivation, (b) Confidence, (c) Emotional Regulation, and (d) Physical Regulation scales; and, within the Social module are: (a) Culture & Society, (b) Ethics, (c) Collaboration, and (d) Relationships. Each of these elements is posited to have an underlying learning continuum (Mota et al., 2021; Sport Australia, 2019) based on an integration of the learning taxonomies of Structure of Observed Learning Outcomes (Biggs & Collis, 1982), and Bloom’s Affective taxonomy (Krathwohl et al., 1964). To operationalize this underlying conceptualization, two Likert-type subscales were initially developed with differing difficulty levels within each element, one to measure foundational skills and attitudes (Foundation), and one targeting a higher development of these skills (Mastery; Mota et al., 2021), in line with Superficial/Rote and Deep Learning, as discussed by Biggs and Collis (1982).

The choice to separate each element into two subscales was based on an initial Classical Test Theory-based (CTT) framework, whose dimensionality assessment methods (i.e., linear factor analysis) favor grouping together items of similar difficulty (i.e., creating a method factor) (Sijtsma & Ark, 2021; van Schuur, 2003). However, Item Response Theory (IRT) models provide another solution to this issue by explicitly modeling difficulty as a parameter. Within this large class of models, nonparametric models (NIRT), like those included in Mokken Scale Analysis (MSA; Sijtsma & Ark, 2021), have been seen as particularly useful for affective variables (e.g., Reise & Waller, 2009) since their underlying response processes might not conform to more rigidly defined response patterns implied by parametric models (Sijtsma & van der Ark, 2022; van Schuur, 2003; Wind, 2017). These NIRT models might allow for a more flexible and in-depth analysis of the dimensionality of the Psychological and Social modules, an essential characteristic to the validity and interpretation of scores derived from any instrument (American Educational Research Association et al., 2014; Chan, 2014).

Previous research has also highlighted differences in adolescents across sexes in both PA participation (Guthold et al., 2020) and regarding other variables similar to those included in the PPLA-Q scales (e.g., Motivation, Emotional Intelligence; Vaquero-Diego et al., 2020; Vasconcellos et al., 2019). However, before any meaningful sex comparisons can be drawn, differential item and test functioning (DIF and DTF) analyses are warranted to provide evidence of measurement invariance across sexes at both item and scale-level (Gamerman et al., 2019; Moorer et al., 2001; Teresi et al., 2008). Similarly, test-retest reliability is crucial in distinguishing random short-term score differences from true change (Polit, 2014) to allow reliable tracking of learning of these PL elements.

In this study, we aimed to investigate (a) dimensionality, convergent and discriminant validity, and measurement invariance (DIF and DTF); and (b) reliability (total-score and test-retest) of the Psychological and Social modules of the PPLA-Q among high school Portuguese adolescents in grades 10–12 (aged 15–18 years old) through MSA. This study was part of a larger project to validate all measures of the PPLA.

Method

Participants

Main Study (Baseline)

All work was done in Portugal, as part of the doctoral project of the lead author, as approved by the appropriate Institutional Review Board and governance committees. We used a convenience sample of 521 students (out of 611 available students; this 15% attrition rate, was lower than a 30% rate in a pilot study) in grades 10–12 from 25 classes in six public schools in the metropolitan Lisbon area. Due to COVID-19 restrictions, we selected only schools with a pre-service PE protocol from the Faculty of Human Kinetics. Participant recruitment was stratified by grade and course major, according to student numbers reported for the school year of 2017–18 (Ministério da Educação [Ministry of Education], 2019). We selected schools from as diverse socioeconomic backgrounds as possible, based on information from each school’s educational project. Before participation, we obtained signed informed consent from all students and from their legal guardians when students were minors. Thirteen students missed class on the day of data collection and were removed from this study. Table 1 shows the main characteristics of the final sample used in the data analyses, and this sample size conformed to acceptable sample sizes for MSA (Mokkink et al., 2018; Straat et al., 2014).

Table 1.

Sample Characteristics.

Characteristic Baseline Retest
N = 508 a N = 73 a
Sex
 Female 299 (59%) 41 (56%)
 Male 209 (41%) 32 (44%)
 Age; years 16 (1) 16 (1)
Grade
 10 204 (40%)
 11 137 (27%) 73 (100%)
 12 167 (33%)
Major
 Economics 75 (15%)
 Humanities 165 (32%)
 STEM 268 (53%) 73 (100%)
School
 School 1 39 (7.7%)
 School 2 61 (12%)
 School 3 21 (4.1%)
 School 4 69 (14%)
 School 5 207 (41%) 73 (100%)
 School 6 111 (22%)

Note: STEM = Sciences, Technology, Engineering and Math.

aStatistic presented: n (%); M (SD).

Retest Study Phase

We selected a subsample of 73 students from available classes, for retest administration of the PPLA-Q (Table 1). This sample size was based on a minimum sample size of 64 students estimated using an online power analysis tool (Arifin, 2020) for an expected Intraclass Correlation Coefficient (ICC) of .80, with .10 precision in a 95% confidence interval (Bonett, 2002), while also accounting for 20% participant attrition. Given time and COVID-19 constraints, no stratification of participants was possible for this phase.

Measures

As noted above, the PPLA-Q is a questionnaire developed to assess the psychological, social, and part of the cognitive domains of Physical Literacy in Portuguese adolescents. Evidence supporting the content validity of the PPLA-Q was previously established and reported (Mota et al., 2021). The Psychological and Social modules (v0.6) of the PPLA-Q were comprised of 46 and 43 Likert-type items, respectively, with these items divided into eight scales: (a) Motivation, (b) Confidence, (c) Emotional Regulation, and (d) Physical Regulation in the Psychological module; and (a) Culture & Society, (b) Ethics, (c) Collaboration, and (d) Relationships in the Social module (a full item listing is available in Supplemental File 1, Tables S1 and S2). All items used a consistent 5-point unipolar response scale. Response points were fully labeled, using both numeric and verbal labels, (0 = Not at all; 1 = Slightly; 2 = Moderately; 3 = Quite a lot; 4 = Totally), measuring the student’s degree of identification with each of the statements (using the general stem: “How much do these statements describe you?”).

Procedures

Main Study (Baseline)

The PPLA-Q (v0.6) was self-administered during PE classes under the supervision of the lead author, from January to March 2021. Data collection began in paper and pencil format; however, because of a COVID-19 lockdown, forcing us to resume data collection in an online format, only three of 25 classes sampled completed the PPLA-Q in this paper and pencil format (n = 60). A verbal standardized initial instruction informed participants of the questionnaire’s goals and the anonymity of collected data; it also encouraged participants to provide honest answers. The lead author clarified any questions throughout the administration of the PPLA-Q. The average completion time (n = 452) was 5.5 (2.2) and 4.6 (1.8) min for the Psychological and Social modules, respectively.

Retest Study Phase

The PPLA-Q was re-administered in online format, 15 days apart from the first administration to reduce carryover effects (Nunnaly & Bernstein, 1994). All remaining administration procedures were equal across these two administrations. On this re-administration, the average completion time (n = 73) was 3.8 (1.2) min and 3.3 (1.1) min, for the Psychological and Social modules, respectively.

Data Analysis

All analyses were performed in RStudio (RStudio Team, 2020) with R 4.1.0 (R Core Team, 2020). We reversed scoring for negatively stated items (S15, Ethics scale) and items P2–P6 (Motivation scale) so that an increase in score would correspond to an increase in each assessed element. Nine items had one missing response (0.2%), due to the initial paper and pencil administration. For these responses, we used two-way imputation (Bernaards & Sijtsma, 2000).

Dimensionality

Instead of separating each element of the PPLA-Q into two different subscales based on difficulty, we used MSA to analyze each element as a single scale, in coherence with the initial development logic of an underlying learning continuum across all items (Keegan et al., 2019; Mota et al., 2021; Sport Australia, 2019). This was possible due to the cumulative nature of MSA models, which recognizes that different items might have different difficulty levels, influencing their endorsement (van Schuur, 2003).

Before using MSA, we flagged outlier participants by calculating Guttman errors on each scale, using Tukey’s upper fence as the cutoffs (Zijlstra et al., 2011). The number of outlier participants ranged from 23 to 35, depending on the scale (5–7% of the sample size). A sensibility analysis revealed that these outliers affected the scalability coefficients for each scale, and so we removed them from further analyses (Sijtsma & van der Ark, 2017). We assessed dimensionality and total-score reliability of each scale using confirmatory MSA within the mokken package (Ark, 2012); fitting the polytomous Monotone Homogeneity Model (MHM) and the polytomous Double Monotonicity Model (DMM). We tested the unidimensionality assumption using the 95% confidence intervals for scalability coefficients at item (Hi) and scale-level (H). For Hi, we used a .30 cutoff (Ark, 2012); non-conforming items were eliminated stepwise, after evaluating the impact on content representativeness and their scalability with other items on the scale. We evaluated final scales according to their H, as strong, medium and weak scales, with thresholds of .50, .40, and .30, respectively (Sijtsma & Molenaar, 2002).

We used the conditional association procedure to assess local independence (Sijtsma et al., 2015; Straat et al., 2016). We examined the content of each pair of items flagged by the mokken package for positive local dependence (PLD; W1 and W2 statistic) or negative local dependence (NLD; W3 statistic), and we deleted the least representative item. We then redid the analysis until no offending pairs were detected.

We assessed the Monotonicity and Invariant Item Ordering (IIO) assumptions through the crit statistic for each item cutoff of crit < 40 (Stochl et al., 2012). Assessment of IIO was supplemented by a graphical analysis of pairwise Item Response Functions (IRF) to assess non-intersection (Sijtsma et al., 2011; Wind, 2017). After IIO was established, we calculated the Htrans (HT) coefficient, using the Manifest Item Invariant Ordering method to assess the accuracy and usefulness of said IIO; we evaluated scales based on their HT as having high, medium, and low accuracy using thresholds of .50, .40, and .30, respectively (Ligtvoet et al., 2010). We performed further exploratory analyses in scales in which clusters of unscalable items and/or borderline scalability coefficients (Hi95%CI ≈ .30) were identified. Their specific methods and results are available in Supplemental File 2 (Tables S3-S5).

Measurement Invariance

We assessed whether DIF and DTF according to sex were present in each scale by calculating scalability for each item (Hi) and scale (H) for each subgroup (Sijtsma & van der Ark, 2017; Wind, 2017). Then, we analyzed the magnitude of any existing difference, along with its statistical significance (i.e., non-intersection of 95% CIs).

Reliability

We calculated Molenaar and Sijtsma ρ (1988) within mokken as an unbiased measure of total-score reliability for each of the final scales. Its interpretation followed the same cutoffs as those of Cronbach’s α (Cronbach, 1951): with .70 and .80 as thresholds for acceptable (Nunnaly & Bernstein, 1994) and good reliability (Price, 2017), respectively. For comparison with previous studies and for accommodating readers accustomed to CTT, we also computed the α coefficient. To establish test-retest reliability, we computed the ICC and its 95%CI in the irr package (Gamer et al., 2019)—according to a single rater, absolute agreement, two-way mixed effect model (formula 2.1 in Koo & Li, 2016)—using sum-scores of the final scales at both time points (Liljequist et al., 2019). We used ICC values of .90, .75, and .50, respectively, as thresholds for excellent, good, and moderate test-retest reliability (Koo & Li, 2016).

Discriminant and Convergent Validity

We calculated bivariate Spearman correlations (and their 95%CI) among total summed scores using the RVAideMemoire (Hervé, 2021) package with 1000 bootstrap replications. Then, we deattenuated correlations for measurement error, using the respective ρ coefficients: r xy /√ρ x* ρ y (Murphy & Davidshofer, 2005). We used the deattenuated values to evaluate discriminant validity (with a threshold of .85 to discern whether variables were statistically different) and convergent validity (by comparing with magnitudes reported in similar studies). Interpretation of magnitudes followed guidelines of .90, .70, .50, .30, as thresholds for very high, high, moderate, and low correlations, respectively (Hinkle et al., 2003).

Results

Item Response Frequencies and Difficulty Level

In the interest of parsimony, Supplementary Table S1 displays the response frequencies in each response category, as well as the mean for each item in the Psychological and Social modules of the PPLA-Q. We obtained a balanced distribution of responses across options. No response option had higher than 55% frequency; and nine items (10%) had no responses in their lowest response option (0– “Not at all”). As expected, items developed to represent a higher development in each element (i.e., Mastery) had overall lower mean values (i.e., higher difficulty) than their less complex (i.e., Foundation) counterparts.

Dimensionality

Scalability

In the Psychological module, nine items were deemed unscalable as their Hi confidence interval included the cutoff value of .30 (or lower) (Table 2): four were in the Motivation scale, with items P3 and P4—both about introjected regulation—displaying high scalability between each other (Hij = .74); three were in the Emotional Regulation scale, where items P24, P26, and P27 had high scalability between each other (Hij = .64 to .78), suggesting the existence of an item cluster about the capacity to evaluate other’s emotions (e.g., P27–“I understand what others feel”); and the remaining item in the Physical Regulation scale.

Table 2.

Mokken Scaling Analysis (MSA) Abbreviated Results for the Psychological Module of the PPLA-Q.

Label M (SD) Confirmatory MSA DIF/DTF ΔHi
Reason for removal Final Hi [95%CI] nfemale = 284/nmale = 197
Motivation (N = 481)
 P1 2.6 (0.9) .56 [.52, .61] −.01
 P2 3.0 (1.0) .38 [.31, .45] −.01
 P3 1.9 (1.2) us (2)
 P4 1.8 (1.2) us (1)
 P5 3.3 (0.9) .39 [.32, .46] −.12
 P6 3.6 (0.7) us (4)
 P7 2.6 (1.0) .46 [.40, .51] .04
 P8 3.3 (0.8) .50 [.45, .55] .00
 P9 2.4 (1.2) PLDP11 and PLDP2(5)
 P10 3.3 (0.7) us (3)
 P11 2.2 (1.2) .54 [.49, .58] −.05
 P43 2.5 (1.1) .46 [.41, .51] −.04
H [95%CI] .47 [.43, .51] −.03
Confidence (N = 474) nfemale = 279/nmale = 195
 P13 2.7 (1.1) .71 [.68, .75] −.07
 P14 2.4 (1.0) .70 [.66, .74] −.06
 P15 2.6 (0.9) IIO (2)
 P16 2.5 (1.0) .67 [.64, .71] −.05
 P17 2.5 (1.0) IIO (1)
 P18 2.5 (1.0) .71 [.67, .74] −.01
 P19 2.5 (1.0) .64 [.60, .68] −.05
 P20 2.1 (1.1) .61 [.57, .66] .00
 P21 2.4 (1.1) .65 [.60, .69] −.07
 P22 2.3 (1.1) .64 [.60, .69] −.06
 P44 2.5 (1.1) .58 [.53, .64] −.09
H [95%CI] 0.66 [0.62, 0.69] −.05
Emotional regulation (N = 482) nfemale = 285/nmale = 197
 P23 2.4 (1.0) .56 [.50, .62] .08
 P24 2.9 (0.8) us (2)
 P25 2.9 (0.9) .57 [.51, .63] .02
 P26 2.8 (0.8) us (3)
 P27 2.7 (0.8) us (4)
 P28 2.7 (0.9) .58 [.52, .64] .05
 P29 2.2 (0.9) .51 [.45, .57] .01
 P30 2.6 (0.9) .57 [.52, .62] .00
 P31 2.5 (0.9) .61 [.57, .66] .08
 P32 2.4 (1.0) .64 [.60, .69] .07
 P45 1.9 (1.1) us (1)
H [95%CI] .58 [.53, .62] .05
Physical regulation (N = 485) nfemale = 288/nmale = 197
 P33 2.7 (0.8) .46 [.40, .52] −.13
 P34 3.3 (0.7) us (1)
 P35 3.3 (0.8) .46 [.40, .53] −.23*
 P36 3.2 (0.8) .45 [.38, .51] −.16
 P37 2.8 (0.9) .41 [.35, .47] −.12
 P38 3.0 (0.8) .49 [.43, .55] −.16
 P39 2.3 (1.0) .52 [.47, .57] −.20*
 P40 2.4 (1.0) PLDP39(2)
 P41 2 (0.9) .46 [.40, .51] −.20*
 P42 2.9 (1.0) .42 [.36, .48] −.14
 P46 2.5 (1.0) PLDP39(3)
H [95%CI] .46 [.41, .50] −.17*

Note. All items indicated no violations of the monotonicity assumption (crit = 0). DIF = Differential Item Functioning; us = unscalable item; PLDk = Positive Local Dependence (subscripted item pair); NLD = Negative Local Dependence (subscripted item pair); IIO = Invariant Item Ordering.

(1) – (5) Item removal order in the scale.

In the Social module, six items were unscalable (Table 3): two on the Culture scale; two on the Ethics scale, one of which was the single reverse-scored item of the scale; one on the Collaboration scale; and one on the Relationships scale. None of the unscalable items displayed a clustering pattern (i.e., high scalability between otherwise unscalable items), however, four of these six unscalable items were developed to assess the highest level of development in each corresponding scale—the capability to transfer the social skills developed in a PA context to other contexts. All 15 items were removed stepwise.

Table 3.

Mokken Scaling Analysis (MSA) Abbreviated Results for the Social Module of the PPLA-Q.

Label M (SD) Confirmatory MSA DIF/DTF ΔHi
Reason for removal Final Hi [95%CI]
Culture (N = 490) nfemale = 288/nmale = 202
 S1 2.5 (1.0) PLDS7(3)
 S2 1.6 (1.3) .55 [.50, .60] −.02
 S3 2.2 (1.1) .56 [.50, .61] −.03
 S4 3.0 (1.1) us (2)
 S5 2.4 (1.2) .66 [.63, .70] 00
 S6 2.5 (1.3) .67 [.64, .71] −.02
 S7 2.0 (1.1) .64 [.60, .69] −.03
 S8 1.6 (1.3) .65 [.60, .69] .01
 S9 1.4 (1.2) .63 [.59, .68] −.04
 S40 1.3 (1.2) us (1)
H [95%CI] .62 [.59, .66] −.02
Ethics (N = 473) nfemale = 280/nmale = 193
 S12 2.4 (0.7) .49 [.42, .56] .09
 S13 3.4 (0.7) .50 [.43, .57] −.02
 S14 3.4 (0.7) .52 [.46, .59] .00
 S15 3.3 (0.9) us (2)
 S16 3.2 (0.8) IIO (6)
 S17 3.4 (0.7) .59 [.53, .64] .05
 S18 3.5 (0.7) PLDS21(3)
 S19 2.9 (1.0) .53 [.46, .59] −.01
 S20 3.2 (0.7) PLDS12(4)
 S21 3.2 (0.8) .62 [.57, .67] .02
 S22 2.7 (0.9) PLDS19(5)
 S41 1.9 (1.1) us (1)
H [95%CI] .54 [.49, .59] .02
Collaboration (N = 490) nfemale = 290/nmale = 200
 S23 3.3 (0.7) .66 [.60, .71] .06
 S24 3.2 (0.7) .59 [.52, .65] .04
 S25 3.1 (0.7) IIO (4)
 S26 3.5 (0.6) .69 [.63, .75] .15
 S27 3.4 (0.6) .71 [.67, .76] .10
 S28 3.1 (0.8) .62 [.56, .67] .08
 S29 3.0 (0.9) PLDS29(2)
 S30 3.0 (0.8) .62 [.57, .68] .11
 S31 3.1 (0.7) PLDS28 and PLDS27(3)
 S42 2.0 (1.1) us (1)
H [95%CI] .64 [.60, .69] .09
Relationships (N = 482) nfemale = 283/nmale = 199
 S32 3.2 (0.7) .64 [.59, .70] −.02
 S33 3.1 (0.8) .66 [.61, .71] .04
 S34 2.8 (0.9) .55 [.48, .61] .08
 S35 2.7 (0.9) PLDS34 and PLDS38(3)
 S36 2.8 (0.9) .61 [.55, .67] .03
 S37 2.9 (0.9) .64 [.59, .69] .03
 S38 2.6 (0.9) .58 [.52, .64] −.01
 S39 2.9 (1.0) PLDS37(2)
 S43 2.0 (1.1) us (1)
H [95%CI] .61 [.57, .66] .02

Note. All items indicated no violations of the monotonicity assumption (crit = 0). DIF = Differential Item Functioning; us = unscalable item; PLDk = Positive Local Dependence (subscripted item pair); IIO = Invariant Item Ordering.

(1) – (6) Item removal order.

Local Independence

Using the Conditional Association procedure, three Psychological module items were flagged for likely being in a PLD pair with other(s) item(s) on the same scale (Table 2; one on the Motivation scale, and two on Physical Regulation). For the Social module, this number increased to eight items (Table 3; one on the Culture scale, three on the Ethics scale, two on the Collaboration scale and two on the Relationships scale). Most identified pairs were within the same lower-level structure (i.e., Foundation or Mastery), within the same specific trait (e.g., P9 and P11 with the same motivational regulation), or had similar wording. Within each pair, an item was chosen to be removed according to content coverage of the scale, resulting in the removal of 11 items total.

Monotonicity

Graphical analysis of each Item Response Function (IRF; not shown), supplemented by the crit statistic in the mokken package, revealed no significant violations of the monotonicity assumption (all crit = 0). Thus, all scales conformed with the Monotone Homogeneity Model, suggesting that the relative ordering of students according to each construct (scale) is consistent across its items.

Invariant Item Ordering (IIO)

During IIO analysis of both the IRFs and the corresponding crit statistic, two items in the Confidence scale (P15 and P17), and one item in both the Ethics and the Collaboration (S16 and S25, respectively) scales revealed intersections with other IRFs within the same scale (crit >40) and were removed so that scales conformed with the additional requirement of the Double Monotonicity Model. Table 4 displays the resulting scales’ total scalability coefficients (H) and IIO coefficients (HT). Based on their 95%CI, two scales formed medium-to-strong (Motivation and Physical Regulation), while the remaining six formed strong Mokken hierarchical scales (H lower bound >.50, estimates ranging from .50 to .66). Despite displaying formal IIO (through non-intersection of IRFs), four of the scales (Confidence, Emotional Regulation, Collaboration, and Relationships) had an estimated HT lower than .30 (Table 4). The remaining four scales displayed better prospects for such ordering, with their IIO accuracy as weak (Motivation, and Culture), medium (Physical Regulation) and strong (Ethics).

Table 4.

Scalability, Invariant Ordering, and Reliability Indexes for the Psychological and Social Modules of PPLA.

Scale – number of items Dimensionality Reliability
Scalability H [95% CI] Invariant item ordering HT (item ordering) a Total-score Test-retest 15-day interval N = 73
Molenaar-Sijtsma ρ Cronbach’s α ICC2.1 [95% CI] b Mean scores Baseline [95%CI] Mean scores retest [95%CI]
Psychological
 Motivation – 7 items .47 [.43, .51] .33 (P5, P8, P2, P1, P7, P43, P11) .83 .83 .82 [.72, .89] 19.8 [18.7, 20.9] 19.0 [18.1, 20.0]
 Confidence – 9 items .66 [.62, .69] .08 (P13, P17, P44, P16, P19, P14, P21, P22, P20) .94 .93 .92 [.87, .95] 22.2 [20.6, 23.8] 22.2 [20.6, 23.8]
 Emotional regulation - 7 items .58 [.53, .62] .19 (P25, P28, P30, P31, P23, P32, P29) .90 .88 .77 [.66, .85] 17.5 [16.3, 18.6] 17.6 [16.5, 18.7]
 Physical regulation – 8 items .46 [.41, .50] .41 (P35, P36, P38, P42, P37, P33, P39, P41) .84 .84 .66 [.51, .77] 22.3 [21.2, 23.4] 21.7 [20.6, 22.7]
Social
 Culture – 7 items .62 [.59, .66] .32 (S6, S5, S3, S7, S2, S8, S9) .91 .91 .88 [.82, .92] 14.0 [12.6, 15.5] 14.4 [12.8, 16.0]
 Ethics – 6 items .54 [.49, .59] .52 (S13, S14, S17, S21, S22, S12) .86 .85 .71 [.58, .81] 18.7 [18.0, 19.5] 19.1 [18.4, 19.8]
 Collaboration – 6 items .64 [.60, .69] .26 (S26, S27, S23, S24, S28, S30) .88 .87 .70 [.56, .80] 19.4 [18.6, 20.1] 18.8 [18.1, 19.5]
 Relationships – 6 items .61 [.57, .66] .22 (S32, S33, S37, S36, S34, S38) .88 .88 .68 [.53, .78] 17.0 [16.1, 17.9] 16.4 [15.5, 17.3]

ICC = Intraclass Correlation Coefficient.

aInvariant Item Ordering according to Manifest Item Invariant Ordering method (Ligtvoet et al., 2011).

bIntraclass Correlation formula 2.1 – two-way mixed effects model accounting for single measurement (Koo & Li, 2016).

Measurement Invariance

To assess whether items presented differential item functioning (DIF) according to sex, we calculated total and item scalability coefficients by subgroup on each final scale. Items P5 (“I feel pressured by others to practice PA”), S26 (“ I respect others”), S27 (“I cooperate with others”), and S30 (“I help others achieve success”) presented a difference in item scalability (i.e., DIF) according to sex (Hi difference >.10; Tables 2 and 3, column 5), P5 with higher scalability for males, and the others with higher scalability for females. However, these were not statistically significant (i.e., their 95% confidence interval overlapped), and they produced no appreciable effect on total scalability (H difference <.10; no DTF).

The Physical Regulation scale had slight to moderate differences in item scalability (DIF) in all its items (Table 2; ranging from .12 to .23) with statistically significant differences in items P35 (“I can recognize changes in my breathing”), P39 (“I use strategies to manage my effort”) and P41 (“I can control my fatigue”); resulting in a statistically significant difference in total scalability (H difference = .17; DTF) and borderline total scalability for females (H = 0.38 [0.32, 0.43]). To further investigate these differences, we calculated reliability coefficients for both subgroups (not shown): female (ρ = .81, α = .79) and male (ρ = .89, α = .88).

Reliability

Total-Score Reliability

Rho estimates ranged from .83 to .94, above the recommended cutoff of .80 (Table 4, column 4). Similarly, α coefficients were all above .80 (ranging from .83 to .91).

Test-Retest Reliability

The Motivation, Confidence, and Culture scales attained good to excellent test-retest reliability (ICC95%CI lower bound ranging from .72 to .87, and upper bound from .89 to .95; Table 4, column 6); while the remaining scales attained moderate-to-good reliability. Mean scores were stable across time points, with slight decreases in four of the eight scales (Table 4, columns 7 and 8).

Discriminant and Convergent Validity

Estimated deattenuated correlations among scales-scores within the Psychological domain ranged from .31 to .83 (Table 5). Of these, the Emotional Regulation scale was the lowest common correlate. Motivation and Confidence scales correlated above the .85 threshold (upper CI bound), higher than warranted for two theoretically distinct scales. Estimated deattenuated correlations within the Social module ranged from .21 to .69 (Table 5). Of these, Culture was the lowest common correlate; with other scales correlating moderately-to-strongly. Correlations across domains were low, except for Culture and Relationships, which showed moderate deattenuated correlations with scale-scores in the Psychological domain.

Table 5.

Bivariate Correlation (Spearman) Matrix.

Scale Psychological Domain Social Domain
1 2 3 4 5 6 7 8
1. Motivation .83 [.77, .87] .31 [.22, .42] .61 [.53, .70] .54 [.46, .62] .27 [.17, .37] .26 [.15, .36] .42 [.32, .51]
2. Confidence .73 [.69, .77] .47 [.37, .54] .65 [.54, .69] .50 [.43, .60] .22 [.13, .31] .22 [.12, .31] .45 [.36, .53]
3. Emotional regulation .27 [.19, .36] .43 [.35, .51] .49 [.38, .54] .18 [.08, .27] .26 [.15, .37] .19 [.09, .28] .22 [.12, .33]
4. Physical Regulation .51 [.44, .58] .57 [.50, .64] .43 [.35, .49] .44 [.35, .52] .42 [.31, .51] .37 [.27, .48] .46 [.37, .54]
5. Culture .47 [.39 .54] .46 [.38, .53] .16 [.07, .25] .39 [.31, .46] .21 [.10, .29] .23 [.14, .32] .36 [.26, .45]
6. Ethics .23 [14, .31] .20 [.10, .29] .23 [.14, .31] .36 [.28, .44] .18 [.10, .27] .74 [.64, .77] .52 [.42, .59]
7. Collaboration .22 [.14, .31] .20 [.11, .29] .17 [.08, .26] .32 [.23, .39] .20 [.11, .29] .64 [.58, .70] .69 [.60, .73]
8. Relationships .36 [.28, .44] .41 [.32, .48] .20 [.11, .28] .40 [.32, .47] .32 [.23, .40] .45 [.37, .52] .61 [.54, .67]

Note. Raw bivariate correlation below diagonal, deattenuated correlations above diagonal.

Discussion

We sought to establish evidence for construct validity and reliability of the Psychological and Social modules of the PPLA-Q for adolescents in grades 10–12 (aged 15–18 years) through investigation of their dimensionality, measurement invariance, and reliability (total-score and test-retest).

Dimensionality

We used MSA to gather evidence on the dimensionality of the eight scales composing the Psychological and Social modules of the PPLA-Q. Most local dependencies occurred within items initially designed for the same difficulty level (i.e., Foundation or Mastery), within the same specific trait (e.g., P9 and P11 with the same motivational regulation) or with similar wording. This was expected, since scale development ensured desirable redundancy (DeVellis, 2017).

All eight scales, after removal of offending items, adhered to the assumptions of the MHM (scalability, local independence, and monotonicity), with total scale scalability coefficients estimates (H) ranging from .46 to .62. Thus, all scales were evaluated as moderate-to-strong in dimensionality. These values support the convergent validity (at item-level) of each scale (Sijtsma et al., 2011). Sum-scores of items in these scales can be considered a sufficient indicator of the respondents’ position on the latent trait measured (Wind, 2017).

In all eight scales, the additional IIO assumption was tenable, adhering to the DMM. This evidence supports the interpretation that an invariant order of the items’ difficulty can be established across different ranges of development for all students on the respective constructs (Wind, 2017), as warranted in the initial development of these scales. However, four out of eight scales had an HT coefficient of less than .30 (Confidence, Emotional Regulation, Collaboration, and Relationships), meaning their IRFs are close to each other; likely because respondents did not distinguish one item from its neighbor in terms of item difficulty (Sijtsma et al., 2011). Thus, albeit presenting an overall valid assessment of the student’s (and items’) position on a continuum of difficulty, no specific use of this ordering is recommended for these four scales (e.g., application of scales from an estimated difficulty point onwards).

On the Motivation scale, items formed a weakly accurate continuum (HT = .33): from items pertaining to controlled forms of motivation to more autonomous forms of motivation (Table 4). Despite this, the continuum found does not entirely adhere to the posited order of the Organism Integration mini-theory of Self-Determination Theory regulations (Ryan & Deci, 2017): P8 (“I feel good when I practice PA”), developed to assess intrinsically regulated motivation, was deemed easier (i.e., higher mean score) than P2, targeting externally regulated motivation at the diametrical side of the theoretical continuum. We argue that the wording of P8 might target a general well-being perception, making it easier to endorse than targeted expressions of intrinsic motivation, like pleasure or satisfaction. We recommend rewriting this item so that it adheres to the expected difficulty range. Similarly, P7 (developed to assess intrinsically regulated motivation, mean = 2.6) and P11 (integrated regulation, mean = 2.2) switched places, as the first is usually expected to be the most autonomous form of motivational regulation. These results echo previous bifactor modeling, suggesting that these two regulations might be closely placed on the continuum (Howard et al., 2016). For the intended application of the scale, however, this switch might have little consequence, as we discuss in the next paragraphs.

On the Physical Regulation scale, items formed a moderately accurate (HT = .41) continuum adhering to the a priori expectation. This continuum ranged from items targeting identification of physiological signs of effort and awareness of physical limits, to items targeting the use of strategies to manage effort during PA. P42 (“I take action to improve my physical skills”, mean = 2.9) wording might need to be adjusted in the future, as it appears to be interpreted as identical difficulty-wise to P37 (“I recognize my physical limits”, mean = 2.8), while both items were designed to have different difficulties (i.e., P42 developmentally more complex than P37).

On the Culture scale, items formed a weakly accurate (HT = .32) continuum, ranging from items framing participation in the movement culture through use of specific PA terminology, to items related to endorsing and encouraging others to participate in said culture. Albeit designed to be among the easier items in this scale, S2 (“I participate in PA rituals (e.g., greetings, hymns/chants, cheers, applauses”) figured, difficulty-wise, among the harder items in this scale; this might result from a misunderstanding regarding the concept of what “rituals” in a movement context truly mean, despite examples being provided in the item. As such, this item might merit further scrutiny in the future. Also, S6 (“I like to keep up with PA events (e.g., competitions, spectacles, shows)”) wording might be refined to differentiate itself from S5 (“I watch PA events (e.g., competitions, spectacles, shows)”) in terms of difficulty.

For the Ethics scale, items formed a strongly accurate (HT = .51) continuum adhering to the a priori development expectations, based on Gibbs (2014)’s model. Items ranged from those targeting immature forms (i.e., pragmatic) of moral development to items targeting mature forms (i.e., value-based) of moral development. Global items, designed to generally represent each latent construct/element (P1, P13, P23, P33, S1, S12, S23, S32) and act as convergent validity indicators in future analyses (Cheah et al., 2018), had adequate scalability on all scales, strengthening the evidence for scale convergent validity. Only on the Culture scale was one of these items (S1) flagged for local dependence, likely due to similar wording, and this item was removed. Difficulty-wise, on scales with interpretable IIO (HT >30), these global items were in the middle-upper range of the difficulty continuum (i.e., lower mean score). This was to be expected, as these items were based on the operational definition of each element (Mota et al., 2021) stating the development of each skill/construct in its final stages. The usefulness of these items should be further examined (i.e., to see whether they are invaluable for scale scalability and validity), since their removal might slightly increase the practicality of subsequent administrations of this questionnaire, without loss of content representation.

Items developed to assess Relational Thinking, the highest development stage in the Structure of Observed Learning Outcomes taxonomy (Biggs & Collis, 1982)—items P43-P46, S40-S44—did not fit the tested models, either for being unscalable or for being in local dependence pairs, except those on the Motivation and Confidence scales. We suggest that (a) endorsement of these items might depend on the capacity of the student to draw a connection between his psychological and social skills in PA and their application in other contexts (e.g., being able to apply emotional regulation strategies developed or recurrently applied in PA contexts, to daily stressing occurrences), which might represent a different skill altogether; and (b) the wording might not be clear enough to capture this phenomenon among adolescents. Further efforts are needed to refine these items and analyze their dimensionality, either as part of each scale or as a separate latent trait by itself.

Additional explorations of the dimensionality of the Motivation, Emotional Regulation, and Physical Regulation scales using the Automated Item Selection Procedures (AISP) and Genetic Algorithms (GA) revealed an alternative cluster structure for these scales (Supplementary Tables S3-S5). Although these alternative dimensionality structures could be supported for these scales, we recommend their use as single scales within the PPLA framework, as their total scalability coefficients showed enough unidimensionality to locate an individual respondent on each of these latent traits. Other research applications (e.g., theory development) might benefit from the use and exploration of these alternative structures.

Regarding IIO, refinements to item difficulty in scales with below standard, or borderline IIO accuracy (HT ≈ .30), are warranted to better target different development stages across each construct. Use of parametric IRT models might support this effort, although their restrictive assumptions regarding item response functions’ shape might not fit those observed in this study.

Measurement Invariance

DIF and DTF analyses suggested all scales function similarly for male and female adolescent respondents, except for the Physical Regulation scale, which showed a sex bias, despite borderline total item scalability and acceptable reliability for females. This sex bias might stem from a different interpretation of these items among males and females (relating to concepts of physical signs and fatigue during PA). We advise caution in the interpretation and comparison of between-sexes score differences on this scale. Since previous literature on this construct is sparse, further investigation and refinement of this construct and its items are recommended through complementary quantitative (e.g., logistic regression/parametric Item Response Theory; Choi et al., 2011) and qualitative methodologies.

Reliability

All scales showed adequate total-score reliability, further supporting the use of a total sum-score. These estimates were an improvement upon those of the pilot phase of this research, in which 37% of scales failed to reach adequate reliability (Mota et al., 2021). For ease of interpretation and comparability between scales, we recommend that scores on each scale be transformed into a 0-100 metric using the maximum number of summed points as the upper bound. Since scales have mostly a balanced number of items designed to measure Foundational, and Mastery level skills, a middle point score (50%) can be used as a heuristic cutoff to identify students transitioning into a deeper phase of learning.

Additionally, ICC results drew evidence of moderate-to-excellent test-retest reliability of the scales. Since sample mean scores were stable, variation across time points originated from individual differences. These differences are a plausible consequence of lockdown and school closure in Portugal (concurrent with data collection); especially on constructs related to social interactions (Ethics, Collaboration, and Relationships) which were likely hampered during this period. Despite evidence of scale adequacy to detect changes in constructs over time, further research using IRT methods (e.g., growth models) in a setting outside COVID-19 impositions might reinforce these findings.

Discriminant and Convergent Validity

Deattenuated bivariate correlations suggest that the Motivation and Confidence scales might not have adequate discriminant validity (upper bound bordering on the usual .85 guideline; Brown, 2015), and, thus, they may measure the same construct. These findings should tentatively bear on further studies, since (a) deattenuated correlations might over-inflate estimates (Murphy & Davidshofer, 2005); and (b) previous research has identified a moderate-to-strong correlation between analogous constructs (r = .64; Sweet et al., 2012), similar in magnitude to our estimated raw correlation. For more robust interpretations and evaluations of whether these two scales should be collapsed to improve questionnaire feasibility, we suggest integrating the resulting scale-scores as indicators of a higher-order latent variable along with other Psychological scales, as posited in our PPLA model (Mota et al., 2021). There might also be a refinement of items and further replications through further research.

Correlations among scale-scores of Relationships, Confidence, and Motivation were coherent in magnitude with those observed in previous studies (Sweet et al., 2012). This was expected, as the first two measure constructs are akin to Perceived Relatedness and Perceived Competence, respectively—core psychological needs of Self-Determination Theory (Ryan & Deci, 2017). Similarly, the Collaboration–Motivation correlation echoes previous results (Li et al., 2008). These results, along with low-to-moderate correlations across domains, provide support for convergent and discriminant validity of these scales. This assertion could also be further supported by higher-order modeling in the next phases of validation of PPLA.

Strengths and Limitations of this Study

This study built upon preliminary reliability evidence collected during pilot testing of the PPLA-Q (Mota et al., 2021) and sought to refine the quality of the scales of the Psychological and Social modules, using MSA, a nonparametric scaling technique that uses a cumulative model, to allow items to differ in their difficulty along a latent trait and provide an improvement over CTT models used in linear factor analysis (van Schuur, 2003). This conception closely aligns with the a priori specification of an underlying learning continuum with multiple learning stages. The resulting scales from this study can be feasibly applied in a PE context, by summing its item points (i.e., sum-score), to provide an assessment of a student’s position on a continuum of each of these skills. We were also able to recruit a diverse sample, despite the pandemic context imposed by COVID-19. Our sample mimicked the relative composition of grade 10-12 students in Portugal according to grade and course major. Nonetheless, this was a convenience sample and caution is needed in generalizing these findings to other contexts. Further test-retest reliability studies with more diverse samples, under stabler circumstances, and preferably using IRT-based procedures are recommended. Notwithstanding the strengths of the MSA method, we acknowledge other reports of its limited value for assessing dimensionality (Smits et al., 2012); complementary methods for assessing dimensionality could be further employed in the future.

Conclusions

Here, we have reported evidence supporting the dimensionality, convergent and discriminant validity, and reliability (total-score and test-retest) of the eight scales of the Psychological and Social modules of the PPLA-Q, refined through Mokken Scale Analysis when applied to Portuguese adolescents. The summed score of all final items in each scale (Supplemental File 3) can be used as an indicator of each latent construct. Further refinement to the wording of items is warranted to increase the accuracy of difficulty ordering within each scale, and discriminant validity of the Motivation and Confidence scales. We identified differential item and test functioning across sexes in one scale (Physical Regulation), which should be further scrutinized before any between-sexes comparisons are made on this construct. All other scales obtained evidence in support of their measurement invariance. These scales can be integrated into the PPLA framework and used to provide a feasible and integrated assessment of the individual journey of each grade 10–12 (15–18 years) student in Portuguese PE.

Supplemental Material

Supplemental Material - Portuguese Physical Literacy Assessment Questionnaire (PPLA-Q) for Adolescents: Validity and Reliability of the Psychological and Social Modules using Mokken Scale Analysis

Supplemental Material for Portuguese Physical Literacy Assessment Questionnaire (PPLA-Q) for Adolescents: Validity and Reliability of the Psychological and Social Modules using Mokken Scale Analysis by João Mota, João Martins, and Marcos Onofre in Perceptual and Motor Skills

Supplemental Material - Portuguese Physical Literacy Assessment Questionnaire (PPLA-Q) for Adolescents: Validity and Reliability of the Psychological and Social Modules using Mokken Scale Analysis

Supplemental Material for Portuguese Physical Literacy Assessment Questionnaire (PPLA-Q) for Adolescents: Validity and Reliability of the Psychological and Social Modules using Mokken Scale Analysis by João Mota, João Martins, and Marcos Onofre in Perceptual and Motor Skills

Author Biographies

João Mota is a lecturer at University College Cork on the Sports Studies and Physical Education programme. He has a degree in Sports Science (minor in Exercise and Health), a masters in Physical Education and a PhD in Education from the Faculty of Human Kinetics, University of Lisbon. His research interests focus on Physical Literacy, Pedagogy/Didactics of Physical Activities and Sports, Educational Assessment, and Psychometrics.

João Martins has a bachelor and master degree in Physical Education, and PhD in Educational Sciences from the Faculty of Human Kinetics, University of Lisbon (FHK-UL). He is also a Master in Epidemiology (Faculty of Medicine, University of Lisbon). He is an Auxiliar Professor at FHK-UL and a researcher at the Institute of Education, University of Lisbon. He is national coordinator and member of several funded projects on Physical Education, Physical Activity and Health. He is the vice-president for the Portuguese Society of Physical Education and member of Directors Board of AIESEP - Association Internationale des Ecoles Superieures de Education Physique board.

Marcos Onofre is an associate professor at the Faculty of Human Kinetics, University of Lisbon, where he is member of the Scientific Council, former LaPED's coordinator, former chair of the Department of Education, Social Sciences and Humanities, and coordinator of the master degree in Physical Education. He is the coordinator of the Physical Education Didactics group at the Research and Development Unit on Education and Training of Institute of Education at the University of Lisbon. AIESEP - Association Internationale des Ecoles Superieures de Education Physique Directors Board former member, and former chair of its Scientific Committee. Vice president of the EUPEA - European Physical Education Association. He has dedicated his life to researching sport pedagogy and physical education and studying the quality of teaching and teachers' education.

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research work was funded by a PhD Scholarship from the University of Lisbon credited to the lead author.

Ethical Approval: All the work was done in Portugal, as part of the doctoral project of the lead author, approved by the Ethics Commitee of the Faculty of Human Kinetics, and the Diretorate-General of Education.

Consent to Participate: Before participation, signed informed consent was required of all students and their legal guardians (when students were minors).

Supplemental Material: Supplemental material for this article is available online.

ORCID iD

João Mota https://orcid.org/0000-0002-5229-9206

References

  1. American Educational Research Association . (2014). American psychological association, & national council on measurement in education. Standards for educational and psychological testing. American Educational Research Association. [Google Scholar]
  2. Arifin W. N. (2020). Sample size calculator. http://wnarifin.github.io [Google Scholar]
  3. Ark L. A. v. d. (2012). New developments in mokken scale analysis in R. Journal of Statistical Software, 48(5), 1. 10.18637/jss.v048.i05 [DOI] [Google Scholar]
  4. Australian Government Department of Health (2019). Australian 24-hour movement guidelines for children (5-12 years) and young people (13-17 years): An integration of physical activity, sedentary behaviour, and sleep. Australian Government Department of Health. [Google Scholar]
  5. Baptista F., Santos D. A., Silva A. M., Mota J., Santos R., Vale S., Ferreira J. P., Raimundo A. M., Moreira H., Sardinha L. B. (2012). Prevalence of the Portuguese population attaining sufficient physical activity. Medicine and Science in Sports and Exercise, 44(3), 466–473. 10.1249/MSS.0b013e318230e441 [DOI] [PubMed] [Google Scholar]
  6. Bernaards C. A., Sijtsma K. (2000). Influence of imputation and EM methods on factor analysis when item nonresponse in questionnaire data is nonignorable. Multivariate Behavioral Research, 35(3), 321–364. 10.1207/S15327906MBR3503_03 [DOI] [PubMed] [Google Scholar]
  7. Biggs J., Collis K. (1982). Evaluating the quality of learning: The SOLO taxonomy (structure of observed learning outcomes). Academic Press. [Google Scholar]
  8. Bonett D. G. (2002). Sample size requirements for estimating intraclass correlations with desired precision. Statistics in Medicine, 21(9), 1331–1335. 10.1002/sim.1108 [DOI] [PubMed] [Google Scholar]
  9. Brown T. A. (2015). Confirmatory factor analysis for applied research (2nd ed.). The Guilford Press. [Google Scholar]
  10. Chan E. K. H. (2014). Standards and guidelines for validation practices: Development and evaluation of measurement instruments. In Zumbo B. D., Chan E. K. H. (Eds.), Validity and validation in social, behavioral, and Health Sciences (pp. 9–24). Springer International Publishing. 10.1007/978-3-319-07794-9_2 [DOI] [Google Scholar]
  11. Cheah J.-H., Sarstedt M., Ringle C. M., Ramayah T., Ting H. (2018). Convergent validity assessment of formatively measured constructs in PLS-SEM: On using single-item versus multi-item measures in redundancy analyses. International Journal of Contemporary Hospitality Management, 30(11), 3192–3210. 10.1108/IJCHM-10-2017-0649 [DOI] [Google Scholar]
  12. Choi S. W., Gibbons L. E., Crane P. K. (2011). lordif: An R package for detecting differential item functioning using iterative hybrid ordinal logistic regression/item response theory and Monte Carlo simulations. Journal of Statistical Software, 39(8), 1–30. 10.18637/jss.v039.i08 [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Corbin C. B. (2016). Implications of physical literacy for research and practice: A commentary. Research Quarterly for Exercise and Sport, 87(1), 14–27. 10.1080/02701367.2016.1124722 [DOI] [PubMed] [Google Scholar]
  14. Cortis C., Puggina A., Pesce C., Aleksovska K., Buck C., Burns C., Cardon G., Carlin A., Simon C., Ciarapica D., Condello G., Coppinger T., D’Haese S., De Craemer M., Di Blasio A., Hansen S., Iacoviello L., Issartel J., Izzicupo P., Boccia S. (2017). Psychological determinants of physical activity across the life course: A “DEterminants of DIet and physical ACtivity” (DEDIPAC) umbrella systematic literature review. PLOS ONE, 12(8), e0182709. 10.1371/journal.pone.0182709 [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Cronbach L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297–334. 10.1007/BF02310555 [DOI] [Google Scholar]
  16. DeVellis R. (2017). Scale development: Theory and applications (4th ed.). SAGE Publications Ltd. https://us.sagepub.com/en-us/nam/scale-development/book246123 [Google Scholar]
  17. Gamer M., Lemon J., Singh I. (2019). irr: Various coefficients of interrater reliability and agreement. R package version 0.84.1. https://CRAN.R-project.org/package=irr
  18. Gamerman D., Gonçalves F. B., Soares T. M. (2019). Differential item functioning. In: Handbook of item response theory volume 3. Chapman and Hall/CRC. [Google Scholar]
  19. Gibbs J. C. (2014). Moral development and reality: Beyond the theories of Kohlberg, hoffman, and haidt (3rd ed.). Oxford University Press. [Google Scholar]
  20. Guthold R., Stevens G. A., Riley L. M., Bull F. C. (2018). Worldwide trends in insufficient physical activity from 2001 to 2016: A pooled analysis of 358 population-based surveys with 1·9 million participants. The Lancet Global Health, 6(10), e1077-e1086. 10.1016/S2214-109X(18)30357-7 [DOI] [PubMed] [Google Scholar]
  21. Guthold R., Stevens G. A., Riley L. M., Bull F. C. (2020). Global trends in insufficient physical activity among adolescents: A pooled analysis of 298 population-based surveys with 1·6 million participants. The Lancet Child & Adolescent Health, 4(1), 23–35. 10.1016/S2352-4642(19)30323-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Hervé M. (2021). RVAideMemoire: Testing and plotting procedures for Biostatistics. R package version 0.9-80. [Google Scholar]
  23. Hinkle D. E., Wiersma W., Jurs S. G. (2003). Applied statistics for the behavioral sciences (Vol. 663). Houghton Mifflin College Division. [Google Scholar]
  24. Howard J. L., Gagné M., Morin A. J. S., Forest J. (2016). Using bifactor exploratory structural equation modeling to test for a continuum structure of motivation. Journal of Management, 44(7), 2638–2664. 10.1177/0149206316645653 [DOI] [Google Scholar]
  25. Keegan R. J., Barnett L. M., Dudley D. A., Telford R. D., Lubans D. R., Bryant A. S., Roberts W. M., Morgan P. J., Schranz N. K., Weissensteiner J. R., Vella S. A., Salmon J., Ziviani J., Okely A. D., Wainwright N., Evans J. R. (2019). Defining physical literacy for application in Australia: A modified delphi method. Journal of Teaching in Physical Education, 38(2), 105–118. 10.1123/jtpe.2018-0264 [DOI] [Google Scholar]
  26. Koo T. K., Li M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine, 15(2), 155–163. 10.1016/j.jcm.2016.02.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Krathwohl D. R., Bloom B. S., Masia B. B. (Eds). (1964). Taxonomy of education objectives: The classification of education goals: Handbook 2—affective domain. David McKay. [Google Scholar]
  28. Li W., Wright P. M., Rukavina P. B., Pickering M. (2008). Measuring students’ perceptions of personal and social responsibility and the relationship to intrinsic motivation in Urban physical education. Journal of Teaching in Physical Education, 27(2), 167–178. 10.1123/jtpe.27.2.167 [DOI] [Google Scholar]
  29. Ligtvoet R., van der Ark L. A., Bergsma W. P., Sijtsma K. (2011). Polytomous latent scales for the investigation of the ordering of items. Psychometrika, 76(2), 200–216. 10.1007/s11336-010-9199-8 [DOI] [Google Scholar]
  30. Ligtvoet R., van der Ark L. A., te Marvelde J. M., Sijtsma K. (2010). Investigating an invariant item ordering for polytomously scored items. Educational and Psychological Measurement, 70(4), 578–595. 10.1177/0013164409355697 [DOI] [Google Scholar]
  31. Liljequist D., Elfving B., Skavberg Roaldsen K. (2019). Intraclass correlation – a discussion and demonstration of basic features. PLOS ONE, 14(7), e0219854. 10.1371/journal.pone.0219854 [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Matos M. G., Aventura Social E. (2018). A saúde dos adolescentes portugueses após a Recessão—Dados nacionais 2018. Faculdade de Motricidade Humana. http://aventurasocial.com/arquivo/1437158618_RELATORIO_HBSC_2014e.pdf [Google Scholar]
  33. Ministério da Educação (2001. a). Programa nacional Educação Física: Ensino secundário. DES. [Google Scholar]
  34. Ministério da Educação (2001. b). Programa nacional Educação Física (Reajustamento): Ensino Básico 3oCiclo. DEB. [Google Scholar]
  35. Ministério da Educação [Ministry of Education] (2019). Infoescolas—Estatísticas do Ensino Básico e Secundário. http://infoescolas.mec.pt/ [Google Scholar]
  36. Mokkink L. B., de Vet H. C. W., Prinsen C. a. C., Patrick D. L., Alonso J., Bouter L. M., Terwee C. B. (2018). COSMIN risk of bias checklist for systematic reviews of patient-reported outcome measures. Quality of Life Research: An International Journal of Quality of Life Aspects of Treatment, Care and Rehabilitation, 27(5), 1171–1179. 10.1007/s11136-017-1765-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Molenaar I. W., Sijtsma K. (1988). Mokken’s approach to reliability estimation extended to multicategory items. Kwantitatieve Methoden: Nieuwsbrief Voor Toegepaste Statistiek En Operationele Research, 9(28), 115–126. [Google Scholar]
  38. Moorer P., Suurmeijer Th. P. B. M., Foets M., Molenaar I. W. (2001). Psychometric properties of the RAND-36 among three chronic disease (multiple sclerosis, rheumatic diseases and COPD) in the Netherlands. Quality of Life Research, 10(7), 637–645. 10.1023/A:1013131617125 [DOI] [PubMed] [Google Scholar]
  39. Mota J., Martins J., Onofre M. (2021). Portuguese physical literacy assessment questionnaire (PPLA-Q) for adolescents (15–18 years) from grades 10–12: Development, content validation and pilot testing. BMC Public Health, 21(1), 2183. 10.1186/s12889-021-12230-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Mota J., Martins J., Onofre M. (2022). Portuguese Physical Literacy Assessment - observation (PPLA-O) for adolescents (15–18 years) from grades 10–12: Development and initial validation through item response theory. Frontiers in Sports and Active Living, 4, 1033648. 10.3389/fspor.2022.1033648 [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Murphy K. R., Davidshofer C. O. (2005). Psychological testing: Principles and applications (6th ed.). Pearson/Prentice Hall. [Google Scholar]
  42. Nunnaly J., Bernstein I. (1994). Psychometric theory. McGraw-Hill. [Google Scholar]
  43. Physical Literacy for Life . (2021). What is physical literacy. https://physical-literacy.isca.org/update/36/what-is-physical-literacy-infographic [Google Scholar]
  44. Polit D. F. (2014). Getting serious about test–retest reliability: A critique of retest research and some recommendations. Quality of Life Research, 23(6), 1713–1720. 10.1007/s11136-014-0632-9 [DOI] [PubMed] [Google Scholar]
  45. Pozo P., Grao-Cruces A., Pérez-Ordás R. (2018). Teaching personal and social responsibility model-based programmes in physical education: A systematic review. European Physical Education Review, 24(1), 56–75. 10.1177/1356336X16664749 [DOI] [Google Scholar]
  46. Price L. R. (2017). Psychometric methods theory into practice. The Guilford Press. [Google Scholar]
  47. R Core Team . (2020). R: A language and environment for statistical compution. R Foundation for Statistical Computing. http://www.R-project.org/ [Google Scholar]
  48. Reise S. P., Waller N. G. (2009). Item response theory and clinical measurement. Annual Review of Clinical Psychology, 5(1), 27–48. 10.1146/annurev.clinpsy.032408.153553 [DOI] [PubMed] [Google Scholar]
  49. RStudio Team . (2020). RStudio: Integrated development for R. RStudio, PBC. http://www.rstudio.com/ [Google Scholar]
  50. Ryan R. M., Deci E. L. (2017). Self-determination theory: Basic psychological needs in motivation, development, and wellness. Guilford Press. [Google Scholar]
  51. Sijtsma K., Ark L. A. van der. (2021). Measurement models for psychological attributes. CRC Press. [Google Scholar]
  52. Sijtsma K., Meijer R. R., Andries van der Ark L. (2011). Mokken scale analysis as time goes by: An update for scaling practitioners. Personality and Individual Differences, 50(1), 31–37. 10.1016/j.paid.2010.08.016 [DOI] [Google Scholar]
  53. Sijtsma K., Molenaar I. (2002). Introduction to nonparametric item response theory. SAGE Publications, Inc. 10.4135/9781412984676 [DOI] [Google Scholar]
  54. Sijtsma K., Straat J. H., van der Ark L. A. (2015). Goodness-of-Fit methods for nonparametric IRT models. In van der Ark L. A., Bolt D. M., Wang W.-C., Douglas J. A., Chow S.-M. (Eds), Quantitative psychology research (Vol. 140, pp. 109–120). Springer International Publishing. 10.1007/978-3-319-19977-1_9 [DOI] [Google Scholar]
  55. Sijtsma K., van der Ark L. A. (2017). A tutorial on how to do a Mokken scale analysis on your test and questionnaire data. British Journal of Mathematical and Statistical Psychology, 70(1), 137–158. 10.1111/bmsp.12078 [DOI] [PubMed] [Google Scholar]
  56. Sijtsma K., van der Ark L. A. (2022). Advances in nonparametric item response theory for scale construction in quality-of-life research. Quality of Life Research, 31(1), 1–9. 10.1007/s11136-021-03022-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Smits I. A. M., Timmerman M. E., Meijer R. R. (2012). Exploratory mokken scale analysis as a dimensionality assessment tool: Why scalability does not imply unidimensionality. Applied Psychological Measurement, 36(6), 516–539. 10.1177/0146621612451050 [DOI] [Google Scholar]
  58. Sport Australia . (2019). The Australian physical literacy framework. https://nla.gov.au/nla.obj-2341259417 [Google Scholar]
  59. Stochl J., Jones P. B., Croudace T. J. (2012). Mokken scale analysis of mental health and well-being questionnaire item responses: A non-parametric IRT method in empirical research for applied health researchers. BMC Medical Research Methodology, 12(1), 74. 10.1186/1471-2288-12-74 [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Straat J. H., van der Ark L. A., Sijtsma K. (2014). Minimum sample size requirements for mokken scale analysis. Educational and Psychological Measurement, 74(5), 809–822. 10.1177/0013164414529793 [DOI] [Google Scholar]
  61. Straat J. H., van der Ark L. A., Sijtsma K. (2016). Using conditional association to identify locally independent item sets. Methodology, 12(4), 117–123. 10.1027/1614-2241/a000115 [DOI] [Google Scholar]
  62. Sweet S. N., Fortier M. S., Strachan S. M., Blanchard C. M. (2012). Testing and integrating self-determination theory and self-efficacy theory in a physical activity context. Canadian Psychology/Psychologie Canadienne, 53(4), 319–327. 10.1037/a0030280 [DOI] [Google Scholar]
  63. Teresi J. A., Ramirez M., Lai J.-S., Silver S. (2008). Occurrences and sources of Differential Item Functioning (DIF) in patient-reported outcome measures: Description of DIF methods, and review of measures of depression, quality of life and general health. Psychology Science Quarterly, 50(4), 538. [PMC free article] [PubMed] [Google Scholar]
  64. UNESCO (2015). Quality physical education (QPE): Guidelines for policy makers. UNESCO Publishing. [Google Scholar]
  65. van Schuur W. H. (2003). Mokken scale analysis: Between the Guttman scale and parametric item response theory. Political Analysis, 11(2), 139–163. 10.1093/pan/mpg002 [DOI] [Google Scholar]
  66. Vaquero-Diego M., Torrijos-Fincias P., Rodriguez-Conde M. J. (2020). Relation between perceived emotional intelligence and social factors in the educational context of Brazilian adolescents. Psicologia: Reflexão e Crítica, 33(1), 1. 10.1186/s41155-019-0139-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  67. Vasconcellos D., Parker P. D., Hilland T., Cinelli R., Owen K. B., Kapsal N., Lee J., Antczak D., Ntoumanis N., Ryan R. M., Lonsdale C. (2019). Self-determination theory applied to physical education: A systematic review and meta-analysis. Journal of Educational Psychology, 112(7), 1444–1469. 10.1037/edu0000420 [DOI] [Google Scholar]
  68. Wind S. A. (2017). An instructional module on mokken scale analysis. Educational Measurement: Issues and Practice, 36(2), 50–66. 10.1111/emip.12153 [DOI] [Google Scholar]
  69. World Health Organization (2020). WHO guidelines on physical activity and sedentary behaviour. World Health Organization. https://apps.who.int/iris/handle/10665/336656 [Google Scholar]
  70. Zijlstra W. P., van der Ark L. A., Sijtsma K. (2011). Outliers in questionnaire data: Can they Be detected and should they Be removed? Journal of Educational and Behavioral Statistics, 36(2), 186–212. 10.3102/1076998610366263 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Material - Portuguese Physical Literacy Assessment Questionnaire (PPLA-Q) for Adolescents: Validity and Reliability of the Psychological and Social Modules using Mokken Scale Analysis

Supplemental Material for Portuguese Physical Literacy Assessment Questionnaire (PPLA-Q) for Adolescents: Validity and Reliability of the Psychological and Social Modules using Mokken Scale Analysis by João Mota, João Martins, and Marcos Onofre in Perceptual and Motor Skills

Supplemental Material - Portuguese Physical Literacy Assessment Questionnaire (PPLA-Q) for Adolescents: Validity and Reliability of the Psychological and Social Modules using Mokken Scale Analysis

Supplemental Material for Portuguese Physical Literacy Assessment Questionnaire (PPLA-Q) for Adolescents: Validity and Reliability of the Psychological and Social Modules using Mokken Scale Analysis by João Mota, João Martins, and Marcos Onofre in Perceptual and Motor Skills


Articles from Perceptual and Motor Skills are provided here courtesy of SAGE Publications

RESOURCES