Abstract
Objective:
To examine the role of administration setting (remote vs. in-person) on performance on four self-administered, smartphone-based neuropsychological measures delivered via the Mobile Toolbox (MTB) and MyCog Mobile (MCM) platforms.
Methods:
Participants self-administered four cognitive tasks (i.e. Executive Function, Face Memory, Picture Memory, and Working Memory) on a smartphone either in the lab or remotely (in-person N = 292, remote N = 701). Robust methods were used to examine performance differences across in-person and remote samples while controlling for covariates.
Results:
Remote testing was associated with higher Picture Memory Trial 1 scores (β = 1.114, p = 0.002) and lower Face Memory First Letter scores (β = −3.194, p < 0.001); the latter effect was moderated by education and version. Picture Memory Trial 2 showed only an indirect effect of setting through an interaction with age (β = 0.048, p < 0.001). No setting effects were found for Executive Function or Working Memory tasks. All observed setting effect sizes were small (partial η2≤.014). Perfect scores were more common remotely on memory tasks; however, sparse perfect scores for Trial 1 in the in-person group and ceiling effects on Trial 2 limit interpretation.
Conclusions:
Performance on self-administered smartphone tasks was largely comparable across settings, with only small, subtest-specific differences in performance on memory tasks. These results support remote self-administration for research use while highlighting the need for design strategies that preserve score validity across settings.
Keywords: self-administration, cognitive assessment, smartphone assessment, neuropsychological tests, remote testing
Introduction
Neuropsychological measures that can be self-administered on participants’ own smartphones have enormous potential to transform assessment by removing obstacles to testing including geographical barriers, mobility limitations, transportation challenges, and financial constraints. The long-standing need for accessible assessment options, accelerated by the COVID-19 pandemic, remains critical for reaching underserved populations, individuals with disabilities, and those in rural or resource-limited settings (McBride-Henry et al., 2023; Montes et al., 2021). Remote, self-administered approaches can also increase sample diversity, a pervasive challenge in psychology (Syed, 2021). However, the ease and flexibility of self-administered measures come at the cost of tightly controlled testing environments, which raises concerns about interpretability of scores in remote settings. Remote scores can be influenced by factors like distractions, increased participant comfort, or use of external score enhancement strategies, like cheating (Arioli et al., 2022; Atkins et al., 2022; Madero et al., 2021).
External score enhancement refers to deliberate action taken to artificially improve test performance that violates the intended measurement process. In remote cognitive testing, this might include writing down stimuli that should be memorized, taking photos of test materials, using external aids (like notes or calculators), seeking assistance from others, or having another person complete the assessment entirely. However, it’s important to acknowledge that apparent performance enhancements in remote settings may not always stem from deliberate attempts to manipulate results (i.e., intentional cheating). Various legitimate factors could explain score differences, including reduced performance anxiety in familiar environments, elimination of the “white coat effect” that might occur in clinical settings (Schlemmer & Desrichard, 2018), decreased fatigue from not having to travel to research sites, and the comfort of testing in one’s own space. Additionally, for some participants, the home environment may be less distracting than a laboratory setting. Moreover, participants may not see anything wrong with using artificial enhancement strategies if testing expectations are not clear at the outset.
Performance enhancement on remote tests may depend on which cognitive domains are measured and the degree to which their measurement affords artificial enhancement strategies (Arioli et al., 2022; Atkins et al., 2022). Some strategies could be used across measures, such as handing the test to another person to complete the measures; however, this does not necessarily guarantee a high score. Other strategies can only be used on certain types of tasks (e.g., taking a photo of stimuli). Atkins et al. (2022) found that episodic memory tests showed greater susceptibility to external enhancement strategies compared to working memory or executive function measures in remote settings. This differential vulnerability most likely stems from task characteristics: memory tasks often provide extended encoding periods and explicit retention instructions, allowing time to employ external aids (e.g., recording stimuli). In contrast, executive function and working memory tasks often require immediate, speeded responses that limit opportunities for external aids (Atkins et al., 2022).
This study aims to compare performance of four self-administered remote smartphone-based measures in a lab setting (albeit, unproctored) to a completely remote setting (also unproctored). The respective measures were adapted from NIH Toolbox® for Assessment of Neurological and Behavioral Function Cognition Battery (NIHTB-CB) measures (Gershon et al., 2013) for self-administration on a personal smartphone within both the Mobile Toolbox (Gershon et al., 2022; Figure 1a) and the MyCog Mobile apps (Young et al., 2023; Young, Dworak, Byrne, et al., 2024; Figure 1b). Evidence of the feasibility, reliability, and validity of the measures has been presented in multiple studies (Jutten et al., 2023; Novack et al., 2024; Young, Dworak, Byrne, et al., 2024; Young, Dworak, Novack et al., 2024) but performance has yet to be compared across remote and in-person environments.
Figure 1.

(a)MyCog Mobile measures. (b)Mobile Toolbox measures.
Our research questions included the following: RQ1) Do overall scores differ when the measures are self-administered remotely compared to in the lab setting? and RQ2) Is the proportion of participants achieving perfect scores higher on the measures when self-administered remotely? We consider these questions across four types of tasks: episodic memory, associative memory, executive function, and working memory.
Based on findings from Atkins et al. (2022) outlined above, we developed the following directional hypotheses: H1) We hypothesized that remote participants would score significantly higher than in-person participants on the episodic and associative memory tasks. This hypothesis is based on evidence that these task types allow time for external enhancement strategies (e.g., recording stimuli). H2) We hypothesized that remote and in-person participants would show no significant performance differences on executive function and working memory tasks. This hypothesis stems from the nature of these tasks: they require immediate, speeded responses to rapidly changing stimuli, making external aids impractical or impossible to use. H3) We hypothesized that the proportion of perfect scores would be higher in remote settings specifically for memory tasks but not for executive function or working memory tasks. We examined perfect scores as a complementary metric to overall mean differences because external enhancement strategies could theoretically make maximum scores more achievable on susceptible tasks, thus increasing the prevalence of perfect scores when external aids are available.
While these hypotheses test whether score patterns align with what would be expected if external enhancement strategies were employed, there are several caveats that must be acknowledged in the interpretation of findings. Most importantly, without formal performance validity tests, we cannot definitively attribute score patterns to specific behaviors. Moreover, ceiling effects (where tests are insufficiently difficult) could produce high perfect score rates independent of setting. Finally, the presence of higher scores in either setting does not establish causality or prove use of external strategies, as various legitimate factors could explain performance differences. Our aim is to identify patterns that could inform future remote self-administered test design and interpretation, recognizing that certain patterns – particularly on tests vulnerable to enhancement strategies – warrant careful examination as potential indicators of measurement concerns.
Method
All methods were approved by the Northwestern University Institutional Review Board (STU00207455; STU00214921). Participants provided informed written consent prior to participation and were compensated. All data collection procedures adhered to ethical guidelines for research with human subjects, including appropriate data privacy and security protocols for both in-person and remote testing conditions.
Sample
This study sourced data collected from several studies (Table 1) using two versions of the same tasks, the Mobile Toolbox version (Gershon et al., 2022) and the MyCog Mobile version (Young, Dworak, Byrne, et al., 2024). All participants provided informed consent prior to participation and were compensated for participation. The Mobile Toolbox and MyCog Mobile samples were English-speaking adults recruited by a third-party market research agency. While the MyCog Mobile study specifically focused on older adults aged 65+, the Mobile Toolbox study enrolled adult participants 18 years and older across different age groups. Participants completed the self-administered test either: (1) in-person on study-provided smartphones or (2) remotely on a personal smartphone. All data used in this study was collected from iOS devices (iPhones) version 8 or higher.
Table 1.
Sample characteristics (M [SD] or N[%]) and statistical comparison of in-person vs remote groups.
| Mobile Toolbox |
MyCog Mobile |
|||||
|---|---|---|---|---|---|---|
| In-Person | Remote | p-value | In-Person | Remote | p-value | |
| N | 92 | 650 | 200 | 51 | ||
| Age | < .001*** | 0.116 | ||||
| Mean (SD) | 49.27 (17.65) | 38.99 (20.75) | 72.69 (5.10) | 74.20 (6.25) | ||
| Range | [20, 84] | [18, 90] | [65, 87] | [65, 90] | ||
| Gender | 0.158 | 0.885 | ||||
| Female | 62 (67.39) | 384 (59.08) | 109 (45.5) | 29 (56.86) | ||
| Male | 30 (32.61) | 266 (40.92) | 91 (54.5) | 22 (43.14) | ||
| Racial Identity | < .001*** | 0.822 | ||||
| White/Caucasian | 48 (52.17) | 479 (73.69) | 156 (78.00) | 42 (82.35) | ||
| Black/African American | 30 (32.61) | 79 (12.15) | 35 (17.50) | 9 (17.65) | ||
| Asian | 9 (9.78) | 42 (6.46) | 0 (0.00) | 0 (0.00) | ||
| Middle Eastern/North African | 0 (0) | 9 (1.2) | 1 (0.50) | 0 (0.00) | ||
| Native Hawaiian/Pacific Islander | 1 (1.09) | 5 (0.67) | 0 (0.00) | 0 (0.00) | ||
| Native American/Alaska Native | 0 (0) | 0 (0) | 0 (0.00) | 0 (0.00) | ||
| Multiracial/> One Race | 4 (4.35) | 20 (3.08) | 1 (0.50) | 0 (0.00) | ||
| Another Race | 1 (1.09) | 13 (2.0) | 6 (3.00) | 0 (0.00) | ||
| Not Identified | 0 (0) | 12 (1.85) | 1 (0.50) | 0 (0.00) | ||
| Ethnic Identity | < .001*** | 0.136 | ||||
| Not Hispanic/Latino | 91 (98.91) | 537 (82.62) | 160 (80) | 46 (90.20) | ||
| Hispanic/Latino | 1 (1.09) | 113 (17.38) | 40 (20) | 5 (9.80) | ||
| Education | < .001*** | .177 | ||||
| Less than HS | 2 (2.17) | 9 (1.38) | 3 (1.53) | 0 (0.00) | ||
| HS Diploma or GED | 50 (54.35) | 199 (30.62) | 40 (20.41) | 17 (33.33) | ||
| Some College | 19 (20.65) | 238 (36.62) | 58 (29.59) | 10 (19.61) | ||
| 4-year College Degree | 14 (15.22) | 135 (20.77) | 60 (30.61) | 12 (23.53) | ||
| Graduate or Professional Degree | 7 (7.61) | 69 (10.62) | 35 (17.86) | 12 (23.53) | ||
Note. We evaluated significant differences across demographic groups by setting for each platform. Welch’s t-tests were used for continuous age. Gender and Ethnic Identity were compared with Pearson’s χ2 tests (with Yates correction for 2 × 2 tables when appropriate), and Racial Identity and Education Level were compared using Fisher’s exact tests with Monte Carlo simulation due to sparse expected counts in several categories. Significance levels are denoted as
p < .001,
p < .01,
p < .05.
Measures
We used data from four measures designed as counterparts to the measures from the NIHTB-CB, which were adapted to be self-administered on a personal smartphone. The two versions of the measures use the same task paradigms, with slight differences in the introductions, practice items, and user interface for the MyCog Mobile version, as it was designed to maximize usability for older adults. The tasks have different names in each platform to maximize understandability and brand consistency for the respective target user groups.
Executive Functioning task
The Executive Functioning task is called “Shape-Color Sorting” in the Mobile Toolbox and “MySorting” in MyCog Mobile, and the task is identical in both versions. Adapted for self-administration from the NIHTB Dimensional Change Card Sorting Task (Zelazo et al., 2014), this executive functioning task measures cognitive flexibility, impulse inhibition, and processing speed. Participants sort bivalent images by shape or color, as indicated by a cue word on the screen. The rate correct score is the ratio of number of correct responses to response speed.
Face Memory task
The Face Memory task is known as “Faces and Names” in the Mobile Toolbox and “MyFaces” in MyCog Mobile, and the task is identical in both versions. The task is an associative memory test that was originally developed by Rentz and colleagues to predict cerebral amyloid beta burden (Rentz et al., 2011), which was also adapted for the NIHTB-CB (Bauer & Zelazo, 2013). Participants are shown 12 photos of faces paired with their names. After a short delay (~5 minutes), they complete three subtests: First Letter (indicating the first letter of the name of the displayed face), Name Matching (selecting the correct name to match a face from three options), and Recognition (selecting the correct face seen before from three options). Each subtest generates a raw accuracy score with a maximum score of 12.
Picture Memory task
The Picture Memory task is known as “Arranging Pictures” in the Mobile Toolbox and “MyPictures” in MyCog Mobile. It is adapted from the NIHTB-CB Picture Sequence Memory test (Dikmen et al., 2014) to measure episodic memory. The paradigm is the same across versions; both measures use a 14-item sequence for the two live trials, describing common activities. Mobile Toolbox presents images related to “a day with a friend,” while MyCog Mobile presents images related to “chores.” The items are presented in portrait view in MyCog Mobile and in landscape view in Mobile Toolbox. Both versions start with a 4-item practice, but MyCog Mobile includes an additional 8-item practice. Participants move the images to their positions via a tapping gesture in MyCog Mobile and a dragging gesture in Mobile Toolbox. After Trial 1, participants see the exact same sequence again and are asked to reorder again in Trial 2. Scores are based on the number of images correctly placed for Trial 1 and Trial 2 with a maximum score of 14 on each trial. We also examined a change score (Trial 2-Trial 1) to assess improvement from Trial 1 to Trial 2.
Working memory task
The Working Memory task is known as “Sequences” in the Mobile Toolbox and “MySequences” in MyCog Mobile, and the task is the same across both versions. It is a letter-number sequencing task that is a counterpart to but not a direct adaptation of the NIHTB-CB List Sorting Working Memory Test (Tulsky et al., 2014). Participants are presented with a sequence of letters and numbers and asked to recall the letters first, in alphabetical order, then the numbers, in ascending order. Trial difficulty increases until participants score incorrectly on all three trials of the same length. Raw scores reflect the number of correct trials with a maximum score of 30.
Procedure
For both Mobile Toolbox and MyCog Mobile, the respective in-person samples were collected via in-person validation studies in a laboratory setting where participants self-administered the measures on study-provided devices as part of a larger battery of external measures (Young, Dworak, Byrne, et al., 2024; Young, Dworak, Novack et al., 2024). Study staff were present during the in-person study but did not closely monitor self-administration. Participants were randomized to complete either gold standard battery or the smartphone measures first to control for order effects. In the remote samples, participants were given instructions on how to access the measures and self-administer them on their own devices with no supervision. Participants did not have previous exposure to the smartphone measures, eliminating concerns for potential practice effects.
Analysis
All analyses were performed using R (R Core Team, 2025). Prior to primary analyses, we compared demographic variables between study samples using Pearson’s chi-square tests (with Yates continuity correction for 2 × 2 tables) or Fisher’s exact tests for categorical variables, and Welch’s two-sample t-tests for continuous variables (Table 1). Missing data analysis indicated less than 5% missingness across measures, leading us to conduct all analysis on complete case data (N = 272–292 for in-person groups and N = 635–686 for remote groups; see Table 2). Prior to analyses, these data were assessed for normality using both visual inspection of Q–Q plots and residual diagnostics. We screened for heteroskedasticity using the Breusch – Pagan test (and Levene’s test by setting). Significant departures from normality and evidence of heteroskedasticity were observed for the measures examined. Given these violations of parametric assumptions, we employed robust methods throughout.
Table 2.
Continuous score analysis: robust regression with bootstrapped standard errors.
| Measure/Score | Parameter | Estimate | SE | Partial η2 | 95% CI | p-value |
|---|---|---|---|---|---|---|
| Executive Function/Rate Correct | (Intercept) | 0.909 | 0.069 | – | – | – |
| Remote Setting | −0.134 | 0.087 | 0.002 | [−0.300, 0.030] | 0.126 | |
| Age | −0.006 | 0.002 | 0.016 | [−0.009, −0.003] | < 0.001*** | |
| Version (MTB) | 0.076 | 0.053 | 0.002 | [−0.023, 0.180] | 0.149 | |
| Education | 0.003 | 0.016 | 0 | [−0.028, 0.034] | 0.869 | |
| Remote × Age | −0.002 | 0.002 | 0.001 | [−0.005, 0.001] | 0.227 | |
| Remote × Version (MTB) | 0.121 | 0.073 | 0.003 | [−0.021, 0.273] | 0.096 | |
| Remote × Education | 0.032 | 0.02 | 0.002 | [−0.009, 0.071] | 0.111 | |
| Face Memory/Recognition | (Intercept) | 9.514 | 0.156 | – | – | – |
| Remote Setting | −0.006 | 0.117 | 0 | [−0.234, 0.221] | 0.959 | |
| Age | −0.022 | 0.002 | 0.073 | [−0.027, −0.018] | < 0.001*** | |
| Version (MTB) | 1.004 | 0.137 | 0.041 | [0.736, 1.273] | < 0.001*** | |
| Education | 0.12 | 0.039 | 0.007 | [0.044, 0.195] | 0.002** | |
| Face Memory/First Letter | (Intercept) | 4.757 | 0.662 | – | – | – |
| Remote Setting | −3.194 | 0.829 | 0.014 | [−4.787, −1.552] | < 0.001 *** | |
| Age | −0.062 | 0.015 | 0.018 | [−0.090, −0.034] | < 0.001*** | |
| Version (MTB) | −1.257 | 0.497 | 0.007 | [−2.221, −0.304] | 0.012* | |
| Education | −0.078 | 0.147 | 0 | [−0.362, 0.203] | 0.597 | |
| Remote × Age | 0.011 | 0.016 | 0.001 | [−0.018, 0.040] | 0.473 | |
| Remote × Version (MTB) | 2.01 | 0.695 | 0.009 | [0.678, 3.273] | 0.004 ** | |
| Remote × Education | 0.501 | 0.184 | 0.007 | [0.157, 0.864] | 0.007 ** | |
| Face Memory/Matching | (Intercept) | 9.132 | 0.252 | – | – | – |
| Remote Setting | −0.24 | 0.189 | 0.002 | [−0.608, 0.127] | 0.203 | |
| Age | −0.018 | 0.004 | 0.028 | [−0.025, −0.011] | < 0.001*** | |
| Version (MTB) | 0.538 | 0.221 | 0.007 | [0.106, 0.973] | 0.015* | |
| Education | 0.094 | 0.063 | 0.002 | [−0.028, 0.215] | 0.139 | |
| Working Memory/Total Score | (Intercept) | 8.949 | 0.453 | – | – | – |
| Remote Setting | −0.447 | 0.371 | 0.001 | [−1.202, 0.242] | 0.228 | |
| Age | −0.036 | 0.007 | 0.022 | [−0.049, −0.024] | < 0.001*** | |
| Version (MTB) | 1.996 | 0.426 | 0.018 | [1.182, 2.806] | < 0.001*** | |
| Education | 0.502 | 0.115 | 0.014 | [0.269, 0.722] | < 0.001*** | |
| Picture Memory/Trial 1 | (Intercept) | 5.909 | 0.473 | – | – | – |
| Remote Setting | 1.114 | 0.353 | 0.011 | [0.414, 1.762] | 0.002 ** | |
| Age | −0.034 | 0.007 | 0.029 | [−0.047, −0.021] | < 0.001*** | |
| Version (MTB) | 1.171 | 0.419 | 0.009 | [0.331, 1.995] | 0.005** | |
| Education | 0.202 | 0.114 | 0.003 | [−0.022, 0.442] | 0.077 | |
| Picture Memory/Trial 2 | (Intercept) | 10.331 | 0.618 | – | – | – |
| Remote Setting | −0.434 | 0.78 | 0 | [−1.901, 1.122] | 0.578 | |
| Age | −0.075 | 0.014 | 0.02 | [−0.101, −0.048] | < 0.001*** | |
| Version (MTB) | 0.719 | 0.452 | 0.002 | [−0.203, 1.692] | 0.112 | |
| Education | −0.1 | 0.142 | 0 | [−0.366, 0.196] | 0.48 | |
| Remote × Age | 0.048 | 0.014 | 0.007 | [0.018, 0.075] | < 0.001 *** | |
| Remote × Version (MTB) | 0.588 | 0.62 | 0.001 | [−0.704, 1.741] | 0.343 | |
| Remote × Education | 0.307 | 0.174 | 0.002 | [−0.048, 0.609] | 0.077 | |
| Picture Memory/Change Score | (Intercept) | 2.502 | 0.355 | – | – | – |
| Remote Setting | −0.394 | 0.265 | 0.002 | [−0.919, 0.092] | 0.137 | |
| Age | 0.003 | 0.005 | 0 | [−0.007, 0.013] | 0.531 | |
| Version (MTB) | 0.443 | 0.314 | 0.002 | [−0.187, 1.061] | 0.159 | |
| Education | −0.073 | 0.086 | 0.001 | [−0.241, 0.106] | 0.392 |
Note. Robust regression models (MM-estimation) were selected using an AIC-based procedure that compared main-effects and interaction specifications while controlling for age (centered), education (1–5 ordinal scale), and test version (Mobile Toolbox vs. MyCog Mobile). Standard errors and p-values were estimated with 1,000 bootstrap resamples. Statistically significant effects are indicated by
p < .05,
p < .01, and
p < .001.
Significant setting effects are bolded.
To address RQ1, we fitted models for all eight continuous scores: Executive Functioning Rate Correct; Face Memory First Letter, Name Matching, and Recognition; Working Memory Total Correct; and Picture Memory Trial 1, Trial 2, and Change Trial 2–Trial 1. Age was centered at the sample mean for improved interpretability of potential interaction effects. We fitted models with testing setting as the primary predictor, controlling for age (centered), test version (Mobile Toolbox vs. MyCog Mobile), and education level (1–5 ordinal scale) as covariates. To balance parsimony with the need to test key moderations, we compared a main-effects model with an interaction model that included Setting × Age, Setting × Version, and Setting × Education. Because robust regression does not provide likelihood-based indices, model selection was conducted with ordinary least squares (OLS) using Akaike Information Criterion (AIC), retaining the interaction model if AIC was reduced by at least two points. The final specification was then re-fit with robust MM-estimation (Yohai, 1987) which provides resistance to non-normality and heteroskedasticity. Standard errors, confidence intervals, and p-values were obtained from 1,000 bootstrap resamples. Effect sizes for the robust linear regressions were quantified as partial eta-squared (η2p), calculated as t2/(t2 + df_residual), with values categorized as small ≥.0099, medium ≥.059, and large ≥.14 (Richardson, 2011). Power analyses indicated that our continuous score analyses were able to detect effects as small as η2 ≈ .008 with 80% power at α = .05.
To address RQ2, we first examined the distribution of perfect scores. The Executive Functioning task was excluded from the perfect score analyses as there is no fixed maximal count of items correct. For the other measures, we examined ceiling effects as a broad distributional problem where a substantial portion of participants cluster near the maximum score, limiting the measure’s ability to discriminate among high performers. In addition to the overall percentage of perfect scores, we calculated a “ceiling room” index, computed as (maximum possible score – mean score)/standard deviation, with values less than 2.0 suggesting potential ceiling effects that could artificially constrain performance variability at the top of the ability range. Comparisons of both perfect scores and ceiling room help to identify if there is a true ceiling effect, i.e., the test is too easy for individuals of higher abilities, and thus there isn’t enough room for variability at this end of the ability range, regardless of setting.
To analyze the effect of setting on perfect score achievement, controlling for covariates and potential interactions, we used logistic regression with heteroskedasticity-consistent (HC3) standard errors, which provide more reliable inference in the presence of heteroskedasticity and small-subsample bias. Only Face Memory Recognition, Face Memory Matching, and the Picture Memory subtests had enough perfect scores for inclusion in the logistic regression analyses. We used the same AIC-based model selection procedure as in RQ1 to select for setting interaction effects that improved model fit, except for Picture Memory Trial 1, which did not have enough perfect scores (only 8 in the in-person sample) to accommodate complex models. As such, only a main effects model was fit for Picture Memory Trial 1. Effect sizes for the logistic regressions were quantified as Cohen’s d, with conventional benchmarks of small (d ≈ 0.20), medium (d ≈ 0.50), and large (d ≈ 0.80). Minimum detectable effect sizes ranged from d = 0.19–0.21 for the Face Memory tasks, to d = 0.82 for Picture Memory Trial 2 and d = 1.68 for Picture Memory Trial 1. The study was adequately powered at 80% (α < .05) to detect small-to-medium effects on the Face Memory Tasks but underpowered for the Picture Memory tasks where there were fewer perfect scores in each subgroup.
Results
In Mobile Toolbox, in-person and remote groups differed on age, race, ethnicity, and education (Table 1). In MyCog Mobile, these demographics did not differ significantly by setting. Table 2 presents results from robust regression analyses examining setting effects on continuous cognitive performance scores. AIC-based model selection identified interaction models as optimal for three outcome scores: Executive Functioning, Face Memory First Letter, and Picture Memory Trial 2; main effects models were selected for the remaining five measures (Face Memory Recognition, Face Memory Matching, Working Memory, Picture Memory Trial 1, and Picture Memory Change Score). In the optimal models, setting (i.e., in-person vs. remote) had a significant direct effect for Picture Memory Trial 1 (with higher scores observed in the remote group) and Face Memory First Letter (with lower scores in the remote group). However, for Face Memory First Letter, the effect of setting was significantly moderated by education and version. There was no significant direct effect of setting on Picture Memory Trial 2, but a significant interaction was found with age, such that younger participants showed a larger remote – in-person difference. No significant effects of setting were found for the Executive Functioning, Face Memory Matching or Recognition, Working Memory, or the Picture Memory Change Score. Across measures, partial η2 values for setting were consistently small to negligible.
Descriptive statistics indicated large discrepancies in the proportion of perfect scores between settings for several memory measures, with remote participants more likely to earn perfect scores than in-person participants (Table 3). Several measures showed evidence of potential ceiling effects. Remote participants achieved higher rates of perfect scores on Picture Memory—2.8% vs. 18.8% for Trial 1 and 15.1% vs. 46.2% for Trial 2—as well as on Face Memory tasks. Ceiling room was low (< 2) on several measures, including Picture Memory Trial 2, Face Memory Recognition, and Face Memory Matching in both remote and in-person groups. Picture Memory Trial 1 only showed low ceiling room in the remote group. Despite these descriptive differences, logistic regression models controlling for covariates (Table 4) only found remote setting to have a significant influence on perfect scores for Trial 1; however, it should be noted that interaction effects could not be estimated in this model. There was a significant indirect effect of setting moderated by age for Trial 2 perfect scores.
Table 3.
Percent of perfect scores and ceiling room analysis.
| Measure | Score Type | In-person |
Remote |
||||
|---|---|---|---|---|---|---|---|
| N | Perfect Scores n (%) | Ceiling Room | N | Perfect Scores n (%) | Ceiling Room | ||
| Face Memory | Recognition | 288 | 49 (17.0) | 1.16 | 674 | 291 (43.2) | 0.74 |
| First Letter | 292 | 0 (0.0) | 3.99 | 674 | 10 (1.5) | 2.62 | |
| Name Matching | 292 | 42 (14.4) | 1.32 | 674 | 134 (19.9) | 1.16 | |
| Working Memory | Total Score | 272 | 0 (0.0) | 4.48 | 686 | 1 (0.1) | 3.77 |
| Picture Memory | Trial 1 | 285 | 8 (2.8) | 2.57 | 676 | 127 (18.8) | 1.41 |
| Trial 2 | 285 | 43 (15.1) | 1.53 | 676 | 312 (46.2) | 0.82 | |
Note. Ceiling Room = (Maximum Score – Mean Score)/Standard Deviation. Lower ceiling room values indicate stronger ceiling effects. Maximum possible scores are as follows: Face Memory Recognition = 12, Face Memory First Letter = 12, Face Memory Name Matching = 12, Working Memory Total Score = 30, Picture Memory Trial 1 = 14, Picture Memory Trial 2 = 14 Executive Function excluded due to no defined maximum score.
Table 4.
Perfect score analysis: logistic regression with heteroscedasticity-consistent estimators.
| Measure/Score | Parameter | Odds Ratio | SE(log OR) | Cohen’s d | 95% C.I. | p-value |
|---|---|---|---|---|---|---|
| Face Memory/Recognition | (Intercept) | 0.1 | 0.36 | – | – | – |
| Remote Setting | 1.18 | 0.23 | 0.091 | [0.75, 1.85] | 0.477 | |
| Age | 0.52 | 0.092 | −0.361 | [0.44, 0.62] | < 0.001*** | |
| Version (MTB) | 3.03 | 0.315 | 0.611 | [1.63, 5.61] | < 0.001*** | |
| Education | 1.21 | 0.078 | 0.105 | [1.03, 1.40] | 0.017* | |
| Face Memory/Matching | (Intercept) | 0.14 | 0.388 | – | – | – |
| Remote Setting | 0.65 | 0.256 | −0.238 | [0.39, 1.07] | 0.087 | |
| Age | 0.7 | 0.104 | −0.197 | [0.57, 0.86] | < 0.001*** | |
| Version (MTB) | 2.25 | 0.338 | 0.447 | [1.16, 4.37] | 0.016* | |
| Education | 1.03 | 0.088 | 0.016 | [0.87, 1.22] | 0.731 | |
| Picture Memory/Trial 1 | (Intercept) | 0.02 | 0.484 | – | – | – |
| Remote Setting | 4.9 | 0.564 | 0.876 | [1.62, 14.82] | 0.005 ** | |
| Age | 0.68 | 0.125 | −0.213 | [0.53, 0.87] | 0.002** | |
| Version (MTB) | 1.24 | 0.58 | 0.119 | [0.40, 3.88] | 0.706 | |
| Education | 1.15 | 0.098 | 0.077 | [0.95, 1.40] | 0.144 | |
| Picture Memory/Trial 2 | (Intercept) | 0.21 | 0.839 | – | – | – |
| Remote Setting | 1.84 | 0.959 | 0.336 | [0.28, 12.06] | 0.524 | |
| Age | 0.36 | 0.331 | −0.563 | [0.19, 0.70] | 0.002** | |
| Version (MTB) | 3.27 | 0.588 | 0.653 | [1.03, 10.35] | 0.044* | |
| Education | 0.91 | 0.182 | −0.052 | [0.64, 1.31] | 0.621 | |
| Remote × Age | 2.07 | 0.344 | 0.401 | [1.05, 4.07] | 0.035 * | |
| Remote × Version (MTB) | 0.6 | 0.726 | −0.282 | [0.15, 2.50] | 0.484 | |
| Remote × Education | 1.12 | 0.2 | 0.062 | [0.76, 1.66] | 0.575 |
Note. Logistic regression models examined the effect of administration setting (remote vs. in-person) on the likelihood of achieving a perfect score, controlling for age (centered), education (1–5 ordinal scale), and test version (Mobile Toolbox vs. MyCog Mobile). Standard errors were estimated using heteroscedasticity-consistent (HC3) estimators, and p-values were derived from 1,000 bootstrap resamples. For Picture Memory Trial 1, only a main-effects model was fit due to sparse perfect scores in the in-person group. Statistically significant effects are indicated by
p < .05,
p < .01, and
p < .001.
Significant setting effects are bolded.
Discussion
Overall, performance on the self-administered smartphone measures was broadly comparable across remote and in-person settings, with differences that were small and specific to certain tasks rather than systematic across domains. These findings suggest that self-administered cognitive assessments can generally be administered remotely without a compromising influence of remote setting, although some memory tasks may be modestly sensitive to testing environment, and further research is warranted.
In terms of our hypotheses, H1, which predicted higher scores for remote participants on memory tasks, was partially supported with some inconsistent results between different types of memory tests. Remote participants performed significantly better on Picture Memory Trial 1, and the effect for Picture Memory Trial 2 emerged only through an interaction with age. In contrast, Face Memory First Letter in-person was higher, albeit moderated by education and version. The remaining Face Memory subtests showed no direct setting effects. H2 was fully supported as there were no significant differences between remote and in-person administration for the working memory nor executive functioning tasks. Finally, H3, which predicted a higher proportion of perfect scores in remote settings for memory tasks but not for executive function or working memory, was partially supported. Remote participants were more likely to achieve perfect scores on Picture Memory Trial 1, and an age-moderated setting effect was observed for Trial 2. Although raw perfect-score rates for Face Memory tasks were higher remotely, adjusted models indicated that these differences were better explained by covariates than setting. Moreover, ceiling effects, undefine perfect scores, or sparse-scores limited our ability to fully examine this hypothesis for many of the subtests.
Although our study was well powered to detect meaningful effects on continuous scores, the Picture Memory tests were not well powered for the perfect score analyses. The effect of setting on Trial 1 perfect scores was large and significant yet should be interpreted cautiously given the small number of perfect scores in the in-person group (n = 8) compared to the remote group (n = 127). Together with the significant influence of setting on continuous scores, findings suggest remote setting has a true effect on performance for Picture Memory Trial 1. The indirect effect of remote setting on perfect scores in Trial 2 was also still significant despite lower power, but different significant effects may have been detected in a larger sample, and interpretation warrants caution.
The discrepancy between Trial 1 and Trial 2 findings may be partially explained by a ceiling effect noted for Trial 2 for both the in-person and remote samples, suggesting that even under relatively more controlled in-person conditions it is easier to ge t a perfect score on Trial 2, masking the potential role of setting. Setting did not significantly influence the change in performance across trials, which suggests participants did not choose to use external strategies on Trial 2 after learning the task on Trial 1 (e.g., learn how to cheat on the task).
In contrast to Picture Memory, on Face Memory First Letter, another task where external score enhancement strategies could be used (e.g., writing down the name), participants actually did better in the in-person setting, though this relationship varied by education and test version. There were not enough perfect scores to statistically evaluate the role of setting on perfect score achievement, suggesting the test is quite difficult and ceiling effects are not a problem. The other Face Memory subtests did not demonstrate significant setting effects, neither for continuous score outcomes nor perfect scores. However, both measures demonstrated potential ceiling effects, which could mask the role of setting.
As we expected, neither the Executive Functioning nor Working memory task demonstrated significant setting effects. Neither could be examined properly for perfect score achievement, as Executive Functioning does not have a defined perfect score and only one participant achieved a perfect score on Working Memory. Ceiling effects were not a concern for these tasks which offer ample room to capture variance among high achievers.
The implications of domain-specific effects of setting in our study are mixed. Consistent with existing research (Atkins et al., 2022), we found that setting played a significant role in performance on memory measures, but in a complex and inconsistent pattern. Notably, the direction of effects was not uniformly favorable to remote testing – while remote participants scored higher on Picture Memory Trial 1, they actually performed worse on Face Memory First Letter (albeit moderated by education and test version), suggesting that simple explanations like widespread use of external strategies are insufficient to explain our findings. The interaction effects with age, education, and test version further complicate interpretation and suggest that individual and contextual factors may moderate how testing environment influences performance.
From a practical standpoint, these findings suggest that while remote administration may introduce some performance variability, the effects are generally small and may not compromise the overall validity of cognitive assessments for research purposes. While we report conventional statistical effect sizes for comparison with other research, we recognize that clinical meaningfulness depends on context (Zakzanis, 2001). In clinical assessment, even statistically “small” effects could be meaningful if they affect diagnostic decisions or treatment planning, and these small performance differences may warrant further consideration in clinical or other high-stakes contexts.
The domain-specific nature of these effects highlights the importance of task design considerations. Memory tasks that provide clear opportunities for external aids (such as those with extended encoding periods or explicit recall instructions) may benefit from design modifications that minimize such opportunities, such as shorter presentation times, unexpected recall elements, or strategic task instructions that emphasize the importance of unassisted performance. Ensuring that remote tests offer a wide range of item difficulties (i.e., “a high ceiling”) may also help identify potential issues with cheating. For executive function and working memory tasks, our results provide reassuring evidence that remote administration does not substantially alter performance, likely due to their real-time, speeded nature that inherently limits opportunities for external assistance.
Limitations
Our study offers many strengths including a large sample size, examination of multiple cognitive domains across two different smartphone platforms, and use of robust statistical methods. However, there are important limitations that should be considered in the interpretation of findings and to guide future studies into remote assessments. First, it is essential to acknowledge that the present findings do not imply causality – we cannot explain definitively why there might be higher scores remotely. Without direct observation of remote participants, we cannot determine whether score differences stem from external enhancement strategies, environmental factors that legitimately enhance performance (e.g., reduced anxiety), or demographic differences not fully controlled for in our analyses.
It is important to note that external score enhancement strategies are not unique to remote testing. In-person testing environments, even when supervised, can be subject to similar confounding factors. Self-administration was not closely monitored in our in-person sample, meaning participants could have potentially used external strategies similar to those available in remote settings (e.g., writing down stimuli without staff noticing). The key difference may lie more in the perceived opportunity and social acceptability of such strategies rather than their absolute availability (Mol et al., 2020).
Unfortunately, we were not able to use a within-subjects design to directly compare the same participants’ performance across both settings in this study. This between-subjects approach means individual differences may have influenced our results despite controlling for many demographic factors. A future within-subjects design would provide stronger evidence by allowing each participant to serve as their own control, thereby isolating the effect of setting more precisely. Although we maintained adequate power (80%) for detecting meaningful effects on continuous outcomes, our unequal sample sizes between settings (particularly for MyCog Mobile with only 51 remote participants) and the between-subjects design reduced statistical power for our perfect score analyses. Future studies should consider using more balanced samples in specific populations of interest (e.g., those with low technology experience).
Another important limitation is the absence of formal performance validity tests (PVTs) or embedded validity indicators in our study design. Without these measures, we cannot definitively establish the credibility of individual responses or identify participants who may have provided invalid data due to insufficient effort, external assistance, or other factors. Future studies should incorporate PVTs to better understand the relationship between response validity and administration setting. However, our pattern of results – showing inconsistent and domain-specific effects rather than uniformly higher remote scores – suggests that systematic invalid responding was not prevalent.
Finally, our study design involved data collection from separate validation studies with different research objectives, which limited our ability to control for test administration order and timing. Because remote participants used personal devices, we could not standardize screen size, processor, or connectivity; we restricted analyses to iPhones (version 8+) but did not collect detailed device metadata. Future studies may consider providing standardized devices or systematically collecting device specifications and environmental data to better isolate the effects of administration setting.
Conclusion
Overall, our results suggest artificial score enhancement strategies (e.g., “cheating”) were not a major problem in our studies, otherwise we would expect stronger remote setting effects for tests with more enhancement opportunities, larger improvement between repeated trials, and a higher proportion of perfect scores remotely. These findings generally support the use of remote self-administered measures in research contexts, though further investigation is warranted. Our results also highlight methodological considerations for researchers developing self-administered cognitive measures to safeguard against artificial enhancement strategies. Tasks designed to mitigate opportunities to cheat, such as incorporating unexpected recall elements and strategic instructions, may alleviate concerns for cheating and enhance confidence in performance across settings. Other approaches like honor code reminders may also prove effective (Mukherjee et al., 2023). Further studies should explore performance differences, confirm the presence and extent of artificial enhancement, and test potential solutions to ensure performance validity including validated performance validity measures to assess response credibility in remote settings.
Funding
This work was supported by the National Institutes of Health [1R01AG074245-01] and [U2CAG060426].
Footnotes
Disclosure statement
No potential conflict of interest was reported by the author(s).
References
- Arioli F, Pedroli E, Cipresso P, & Riva G (2022). Validation of at-home application of a digital cognitive screener for older adults. Frontiers in Aging Neuroscience, 14, 907496. 10.3389/fnagi.2022.907496 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Atkins AS, Kraus M, Welch M, Yuan P, Stevens A, Welsh-Bohmer K, & Keefe RSE (2022). Remote self-administration of digital cognitive tests using the brief assessment of cognition: Feasibility, reliability, and sensitivity to subjective cognitive decline. Frontiers in Psychiatry, 13, 910896. 10.3389/fpsyt.2022.910896 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bauer PJ, & Zelazo PD (2013). NIH Toolbox Cognition Battery (CB): Summary, conclusions, and implications for cognitive development. Monographs of the Society for Research in Child Development, 78(4), 133–146. 10.1111/mono.12039 [DOI] [PubMed] [Google Scholar]
- Dikmen SS, Bauer PJ, Weintraub S, Mungas D, Slotkin J, Beaumont JL, Gershon R, Temkin NR, & Heaton RK (2014). Measuring episodic memory across the lifespan: NIH Toolbox Picture Sequence Memory Test. Journal of the International Neuropsychological Society, 20(6), 611–619. 10.1017/S1355617714000460 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gershon RC, Sliwinski MJ, Mangravite L, King JW, Kaat AJ, Weiner MW, & Rentz DM (2022). The mobile toolbox for monitoring cognitive function. Lancet Neurology, 21(7), 589–590. 10.1016/S1474-4422(22)00225-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gershon RC, Wagster MV, Hendrie HC, Fox NA, Cook KF, & Nowinski CJ (2013). Nih toolbox for assessment of neurological and behavioral function. Neurology, 80(11 Supplement 3), S2–S6. 10.1212/WNL.0b013e3182872e5f [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jutten RJ, Burling J, Campbell EC, Roy C, Properzi MJ, Amariglio RE, Marshall GA, Johnson KA, Sperling RA, Papp KV, & Rentz DM (2023). The mobile toolbox for assessing cognition in older adults: Associations with standardized cognitive testing and amyloid and tau PET. Alzheimer’s & Dementia, 19(S18), e073522. 10.1002/alz.073522 [DOI] [Google Scholar]
- Madero EN, Crook TH, Reynolds C, & Buckwalter JG (2021). Environmental distractions during unsupervised remote digital cognitive assessment. The Journal of Prevention of Alzheimer’s Disease, 8(3), 263–266. 10.14283/jpad.2021.25 [DOI] [Google Scholar]
- McBride Henry K, Nazari Orakani S, Good G, Roguski M, & Officer TN (2023). Disabled people’s experiences accessing healthcare services during the COVID-19 pandemic: A scoping review. BMC Health Services Research, 23, 346. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mol JM, van der Heijden ECM, & Potters JJM (2020). (Not) alone in the world: Cheating in the presence of a virtual observer. Experimental Economics, 23(4), 961–978. 10.1007/s10683-020-09644-0 [DOI] [Google Scholar]
- Mukherjee S, Rohles B, Distler V, Lenzini G, & Koenig V (2023). The effects of privacy-non-invasive interventions on cheating prevention and user experience in unproctored online assessments: An empirical study. Computers and Education, 207, 104925 [Google Scholar]
- Novack MA, Young SR, Dworak EM, Kaat AJ, Slotkin J, Nowinski C, Yao L, Adam H, Stoeger J, Hosseinian Z, Amagai S, Pila S, Varela Diaz M, Almonte Correa A, Alperin K, Carlson S, Kellen M, Omberg L, Camacho MR … Weiner MW (2024). Mobile toolbox (MTB) remote measures of executive function and processing speed: Development and validation. Journal of the International Neuropsychological Society: JINS, 30(7), 1–9. 10.1017/S1355617724000225 [DOI] [PMC free article] [PubMed] [Google Scholar]
- R Core Team. (2025). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/ [Google Scholar]
- Rentz DM, Amariglio RE, Becker JA, Frey M, Olson LE, Frishe K, Carmasin J, Maye JE, Johnson KA, & Sperling RA (2011). Face-name associative memory performance is related to amyloid burden in normal elderly. Neuropsychologia, 49(9), 2776–2783. 10.1016/j.neuropsychologia.2011.06.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Richardson JTE (2011). Eta squared and partial eta squared as measures of effect size in educational research. Educational Research Review, 6(2), 135–147. 10.1016/j.edurev.2010.12.001 [DOI] [Google Scholar]
- Schlemmer M, & Desrichard O (2018). Is medical environment detrimental to memory? A test of a white coat effect on older people’s memory performance. Clinical Gerontologist, 41(1), 77–81. 10.1080/07317115.2017.1307891 [DOI] [PubMed] [Google Scholar]
- Syed M (2021). Reproducibility, diversity, and the crisis of inference in psychology. OSF. 10.31234/osf.io/89buj [DOI] [Google Scholar]
- Tulsky DS, Carlozzi N, Chiaravalloti ND, Beaumont JL, Kisala PA, Mungas D, Conway K, & Gershon R (2014). NIH Toolbox cognition battery (NIHTB-CB): List Sorting test to measure Working Memory. Journal of the International Neuropsychological Society, 20(6), 599–610. 10.1017/S135561771400040X [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yohai VJ (1987). High breakdown-point and high efficiency robust estimates for regression. Annals of Statistics, 15(2), 642–656. 10.1214/aos/1176350366 [DOI] [Google Scholar]
- Young SR, Dworak EM, Byrne GJ, Jones CM, Yao L, Benavente JNY, Diaz MV, Curtis L, Gershon R, Wolf M, & Nowinski CJ (2024). Remote self-administration of cognitive screeners for older adults prior to a primary care visit: Pilot cross-sectional study of the reliability and usability of the MyCog Mobile screening app. JMIR Formative Research, 8(1), e54299. 10.2196/54299 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Young SR, Dworak EM, Novack MA, Kaat AJ, Adam H, Nowinski CJ, Hosseinian Z, Slotkin J, Stoeger J, Amagai S, Varela Diaz M, Almonte Correa A, Alperin K, Omberg L, Kellen M, Camacho MR, Landavazo B, Nosheny RL, Weiner MW, & Gershon R (2024). Development and validation of an episodic memory measure in the Mobile Toolbox (MTB): Arranging pictures. Journal of Clinical and Experimental Neuropsychology, 46(4), 1–10. 10.1080/13803395.2024.2353945 [DOI] [PubMed] [Google Scholar]
- Young SR, Lattie EG, Berry ABL, Bui L, Byrne GJ, Benavente JNY, Bass M, Gershon RC, Wolf MS, & Nowinski CJ (2023). Remote cognitive screening of healthy older adults for primary care with the MyCog Mobile app: Iterative design and usability evaluation. JMIR Formative Research, 7(1), e42416. 10.2196/42416 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zakzanis KK (2001). Statistics to tell the truth, the whole truth, and nothing but the truth: Formulae, illustrative numerical examples, and heuristic interpretation of effect size analyses for neuropsychological researchers. Archives of Clinical Neuropsychology, 16(7), 653–667. 10.1016/S0887-6177(00)00076-7 [DOI] [PubMed] [Google Scholar]
- Zelazo PD, Anderson JE, Richler J, Wallner-Allen K, Beaumont JL, Conway KP, Gershon R, & Weintraub S (2014). NIH Toolbox cognition battery (CB): Validation of Executive function measures in adults. Journal of the International Neuropsychological Society: JINS, 20(6), 620–629. 10.1017/S1355617714000472 [DOI] [PMC free article] [PubMed] [Google Scholar]
