. 2026 Jan 9;19:1690499. doi: 10.3389/fnbeh.2025.1690499

Table 3.

Risk of bias.

Study	Selection bias	Performance bias	Detection bias	Attrition bias	Reporting bias
Ulrich et al. (2014)	Low risk. Volunteer sample of students; within-subject design minimizes group differences. (All conditions experienced by all participants.)	Some concerns. Participants not blinded to task difficulty; possible expectation of “optimal” condition. Task order not specified (could influence engagement).	Low risk (objective perfusion MRI data; flow ratings used but likely collected uniformly).	Low (27 entered, 27 analyzed; no dropouts reported).	Low (All hypothesized regions reported, both increases and decreases).
Ulrich et al. (2016)	Some concerns. Sample of 23 male students only; limits external validity, though internally participants serve as their own controls.	Some concerns. Difficulty conditions are obvious to participants; short blocks might cause carry-over (counterbalancing unclear). Experimenters knew condition sequences.	Low risk (fMRI outcomes objective; flow verified via EDA and self-report collected similarly across conditions).	Low (all 23 analyzed; no mention of exclusions).	Low (Reported both activation and deactivation findings; analysis plan followed prior study, reducing selective reporting risk).
Ulrich et al. (2022a)	Some concerns. 41 male participants; results may not generalize to females. Otherwise, within-subject comparisons are sound.	Some concerns. Likely not blinded to condition; however, task was automated. Not sure if “flow” block could be anticipated by participants.	Low (connectivity analysis objective; investigators presumably blind during data processing. Self-report flow data not heavily featured).	Low (no dropouts reported; adequate data from all).	Low (Focus was on insula connectivity as pre-registered hypothesis; full results for that reported. Unreported analyses unlikely).
Huskey et al. (2018)	Low. Recruited gamers; all participants experienced all difficulty conditions. Groups not an issue (within-subject design).	Some concerns. Participants know when game is easy vs. hard. However, the engaging nature of the game might reduce demand characteristics.	Low (fMRI and psychophysiological data collected identically across conditions. Self-reports of challenge/skill likely obtained immediately, reducing bias).	Low (retention good; any data loss due to technical issues was minor).	Low (All relevant outcomes discussed, including cases where DMN increased in non-flow, supporting no selective omission).
de Sampaio Barros et al. (2018)	Low. Within-subject with counterbalanced game conditions (assumed; if not, could be an issue—but likely rotated orders). Sample is somewhat small (18) but acceptable.	Some concerns. Participants could identify easy vs. hard levels, which might affect effort. Also, if autonomy condition always lasts, there is an order bias. No blinding of experimenters to condition.	Some concerns. NIRS outcome objective, but flow feelings were self-reported (might bias toward thinking “optimal = flow”). However, probe questions were used to catch mind-wandering, which is a strength.	Low (no mention of dropouts; all provided data for each condition).	Low (Reported multiple measures—physiological, self-report, NIRS—even when some were non-significant; appears comprehensive).
Beaty et al. (2016)	Low. Large sample: all did both creative and control tasks in random order. Individual differences in creativity were accounted for in analysis.	Low. Task order was counterbalanced across subjects; participants were not aware of the specific hypothesis about network coupling.	Low (fMRI connectivity analysis is objective. Creativity outcomes (divergent thinking scores) were assessed outside scanner by independent judges).	Low (all participants included; data complete. Any with excessive motion likely removed per standard, not affecting bias).	Low (Well-reported: they published even null results for some analyses and did confirmatory analyses. No obvious selective reporting).
Beaty et al. (2018)	Low. Two independent samples are used for model training and testing, enhancing robustness. Participants were recruited without regard to creativity (range of scores achieved).	Low. Resting-state scan—no performance to bias. Participants did not know their connectivity was to be correlated with creativity, so no performance expectations.	Low (Functional connectivity at rest measured objectively; creativity testing was standardized and presumably blinded scoring).	Low (nearly all who were scanned provided usable data; those with high head motion were excluded per protocol, not likely systematic bias).	Low (Outcomes pre-specified: prediction of creativity. Paper reports both successful and unsuccessful predictions for comparison models; likely no selective omission).
Rosen et al. (2024)	Some concerns. Non-random two groups (experts vs. non-experts); differences could confound (e.g., age, music training aside from expertise). They did have both groups perform same task to mitigate that.	Some concerns. Impossible to blind musicians to their skill level or task; they knew they were improvising. Researchers obviously knew who an expert was. Performance setting was standardized, though.	Low (EEG data collection uniform. Flow ratings might be subjective, but experts vs. novices would not bias their own rating differently except as true reflection of experience. The experimenters scoring performances for creativity, if any, should ideally be blinded to expert/novice—unclear if done).	Low (32 entered, possibly all completed. If any EEG data were excluded, it was likely due to noise, unrelated to flow level).	Low (Main outcomes (frontal deactivation in experts, etc.) reported. Qualitative descriptions align with hypothesis. No indication that contrary data were hidden—novices’ data were reported as showing less of the effect, as expected).
Ulrich et al. (2022b)	Some concerns. Convenience sample of healthy male university students only (N = 41); limits generalizability (sex/age/culture). Within-subject design reduces between-group confounding.	Some concerns. Participants can infer condition difficulty (boredom/flow/overload), so expectation/demand effects are possible; however, two predefined block sequences were counterbalanced, reducing order effects.	Low risk. Primary outcomes are objective (BOLD fMRI + EDA). Subjective flow ratings were compromised by an item error and therefore not used for replication, which reduces risk of biased subjective outcome interpretation (but weakens manipulation-check depth).	Low risk. No dropouts/exclusions are indicated for the replication fMRI sample; analyses are reported for the full replication sample.	Low risk. Clearly framed as a confirmatory replication of pre-specified quadratic (U/inverted-U) effects with replication Bayes factor quantification; results presented for the primary effects and physiological marker.