AI testing, evaluation, verification and validation for accessibility: a comprehensive framework

Gabriella Waters

doi:10.3389/fdgth.2025.1679603

. 2026 Feb 26;7:1679603. doi: 10.3389/fdgth.2025.1679603

AI testing, evaluation, verification and validation for accessibility: a comprehensive framework

Gabriella Waters ^1,^2,^*

PMCID: PMC12980396 PMID: 41835950

Abstract

As artificial intelligence (AI) systems continue to remain prevalent in society, ensuring their accessibility for all users, including those with disabilities, is of great importance. This paper presents a comprehensive framework for AI Testing, Evaluation, Verification and Validation (TEVV) focused on accessibility. The proposed methodology incorporates methods for red teaming, model testing, and field testing with a particular emphasis on usability testing for accessibility. The results demonstrate, through detailed case studies, that systematically evaluating AI systems for accessibility barriers and biases improves the inclusivity and effectiveness of AI technologies for diverse user populations. The findings suggest that this accessibility-focused TEVV framework provides a structured approach for developing more equitable and universally usable AI systems that benefit all members of society.

Keywords: accessibility (for disabled), AI evaluation, AI evaluation framework, AI testing, artificial intelligence

1. Introduction

Artificial intelligence (AI) technologies are rapidly transforming numerous aspects of modern life, from healthcare and education to transportation and communication. As these systems continue to become more deeply integrated into essential services and daily interactions, making sure they are accessible for all users, including those with disabilities, is not just a matter of inclusivity but a fundamental necessity.

Accessibility in AI refers to the design and development of AI systems that can be effectively used by people with a wide range of abilities and disabilities. This includes considerations for visual, auditory, motor, and cognitive impairments, as well as neurodiversity and situational limitations. Prioritizing accessibility in AI development and deployment, allows for the creation of more equitable technologies that empower all individuals to participate fully in an increasingly AI-driven world (1).

However, achieving true accessibility in AI systems presents unique challenges. Unlike traditional software applications, AI models often exhibit complex, dynamic behaviors that can be difficult to predict and evaluate comprehensively (2). Additionally, harmful biases in training data and algorithmic design can lead to discriminatory outcomes that disproportionately affect users with disabilities.

Despite progress in digital accessibility, most AI systems are evaluated using practices that prioritize accuracy and generalizability over inclusivity, which can lead to persistent access barriers for individuals with disabilities. The TEVV framework reconceptualizes evaluation as an iterative, multi-layered process that centers accessibility from project inception through deployment, bridging the gap between regulatory compliance, technical rigor, and user experience with diverse populations.

This is where a robust framework for AI testing, evaluation, verification and validation (TEVV) focused on accessibility becomes necessary. Systematic assessment of AI systems for potential barriers, harmful biases, and usability issues related to accessibility, can help to identify and address problems early in the development process. This proactive approach not only improves the inclusivity of AI technologies but also enhances their overall effectiveness and user satisfaction across diverse populations.

The TEVV for accessibility framework was developed through an integrative literature synthesis and conceptual modeling approach. Key components were derived from a review of accessibility standards (ISO 9241-210; WCAG 2.1), empirical and meta-analytic findings in AI accessibility and evaluation (1, 2), and best practices identified in human-computer interaction and disability studies. Iterative comparison with published frameworks enabled the refinement of unique metrics, taxonomy structures, and process stages. Expert commentaries and consensus reports from the literature (3, 4) guided the framework's multi-modal and continuous validation orientation. No original human or empirical data collection was conducted; all recommendations, examples, and use-case scenarios are derived or adapted from prior research syntheses and are illustrative of how the framework can be operationalized in real-world contexts (5, 6).

The proposed framework builds upon existing TEVV methodologies while introducing specialized techniques for evaluating accessibility in AI contexts. It incorporates methods for red teaming, which simulates adversarial scenarios to uncover potential accessibility vulnerabilities; model testing to rigorously evaluate AI model performance across diverse user profiles; field testing to assess real-world utility and accessibility in authentic contexts; and usability testing to conduct in-depth evaluations with users who have various disabilities. The framework aims to provide a comprehensive strategy for ensuring that AI systems are truly accessible and inclusive from the ground up through the integration of these approaches.

This paper will explore the key components of this accessibility focused TEVV framework throughout, compare it to current practices, and demonstrate its application through detailed example case studies. Adoption of this systematic approach to accessibility testing in AI development can contribute to the creation of more equitable, effective, and universally usable AI technologies that benefit all members of society.

1.2. Significance and contributions of this work

This paper attempts to address a critical gap in the field of AI development and evaluation by proposing a framework for testing, evaluation, verification, and validation specifically focused on accessibility. The rapid proliferation of AI systems across many domains of daily life in tandem with the potential of these technologies to exacerbate or alleviate barriers faced by individuals highlight the need for this work.

Current research in AI accessibility has primarily focused on specific applications or types of disabilities that frequently treat accessibility as an add on instead of a fundamental aspect of AI development. This paper contributes to the field by offering a holistic approach that integrates accessibility considerations throughout the entire AI lifecycle. The combination of methods from red teaming, model testing, field testing, and usability testing affords the proposed framework with a structured methodology for identifying and addressing a wide range of accessibility issues that may arise in AI systems.

While there has been significant work in AI ethics and fairness, there is a noticeable lack of standardized methodologies for comprehensive AI TEVV in the context of accessibility. The framework presented in this paper offers a systematic approach that can be adapted across different AI applications and domains. The framework's emphasis on iterative testing and diverse user involvement addresses the dynamic nature of AI systems and the varied needs of users with disabilities, which are aspects that are often overlooked in traditional AI testing methods.

The paper also makes contributions in the form of specific metrics and evaluation criteria for AI accessibility. These quantifiable measures provide developers and researchers with concrete tools to assess and improve the accessibility of their AI systems to facilitate more objective comparisons across the industry. The example case studies presented demonstrate the practical application and effectiveness of the proposed framework and offer insights into how organizations can implement comprehensive accessibility testing in real-world AI development scenarios. These approaches offer ways to bridge the gap between theoretical approaches and practical implementation.

By addressing the intersectionality of accessibility with other aspects of AI development like harmful bias mitigation and ethical considerations, this work can also contribute to a more nuanced understanding of inclusive AI design. Accessibility testing can uncover and address the issues that benefit all users, not just those with disabilities, making a case for accessibility as a driver of overall AI quality and usability.

Throughout the manuscript, we use several terms with closely related but distinct meanings as outlined in the Glossary of Key Terms below. Consistency in terminology enhances both the internal clarity of the framework and its alignment with ongoing scholarly and normative discourse.

Glossary of key terms

Accessibility: The design of AI systems to be usable by people with a range of abilities and disabilities, including sensory, motor, cognitive, and situational limitations (ISO 9241-210; WCAG 2.2).
Accessibility Gap: The measurable disparity between the accessibility features, outcomes, or experiences of disabled and non-disabled users in a system (7).
Equity: The process of ensuring fair treatment, opportunities, and outcomes for all individuals, with recognition of specific needs or barriers faced by marginalized groups, including people with disabilities (8).
Fairness: The absence of bias, discrimination, or favoritism toward any group in AI systems, including ensuring equal opportunity and treatment for people with disabilities (9).
Harmful Bias: Systematic error or unfair outcomes produced by an AI system due to data, design, or model artifacts that disadvantage certain groups (2).

2. Literature review

Recent research underscores the inadequacy of conventional, siloed evaluation methods for capturing the lived realities and evolving needs of users with disabilities. Studies from theoretical, empirical, and standards perspectives converge on the need for continuous, multi-modal validation, affirming the TEVV paradigm as an essential evolution in accessibility theory and practice,

2.1. Accessibility testing in AI

Chemnad and Othman's (2) bibliometrics analysis and systematic review of 43 articles from 2018 to 2023 that focused on AI applications for digital accessibility reveals a disproportionate focus on visual impairments compared to motor and cognitive disabilities, which highlights gaps in addressing speech/hearing impairments and neurodiverse needs. Their classification framework informs testing protocol development through technical, ethical, and implementation categories, with machine learning algorithms demonstrating capability to recognize patterns and identify accessibility barriers in digital content through image recognition for alternative text generation and natural language processing for cognitive accessibility support. This work is essential for AI TEVV for accessibility research as it provides a comprehensive taxonomy of accessibility testing approaches and identifies systematic gaps in current testing methodologies that must be addressed in verification and validation frameworks.

Fuglerud et al. (10) developed and evaluated machine learning prototypes for automated WCAG compliance checks through implementing computer vision algorithms for image analysis and natural language processing for content evaluation. Their experimental results demonstrated high precision rates for color contrast violations and alternative text assessment but revealed concerning false positive rates that required human intervention. The study compared automated detection capabilities against manual expert evaluation and found that AI systems excelled at identifying technical violations but struggled with contextual accessibility judgments, such as determining whether decorative images require alternative text descriptions. This evaluation of AI-powered accessibility testing tools directly supports TEVV frameworks by establishing performance benchmarks and identifying validation requirements for automated accessibility assessment systems.

Guo et al. (3) conducted a multi-round Delphi study with 24 AI ethics experts from academia and industry to identify critical gaps in AI accessibility evaluation. Through three rounds of structured questionnaires and consensus-building exercises, the study revealed three fundamental deficiencies: lack of assistive technology interoperability standards, absence of longitudinal evaluation methods accounting for progressive disabilities, and inadequate adaptive performance metrics that adjust to individual user capabilities (76% agreement). The research established expert consensus on priority areas for AI accessibility research, including the need for standardized testing protocols that can evaluate AI systems across diverse assistive technologies and the development of dynamic assessment frameworks that account for changing user needs over time. This expert consensus methodology provides foundational requirements for TEVV frameworks by establishing evidence-based priorities for comprehensive accessibility validation protocols that must be incorporated into continuous testing systems.

Nwokoye et al. (11) present a systematic survey of Accessible, Explainable Artificial Intelligence (AXAI) that examines how explanations of AI decisions are made accessible that highlights similar gaps at the intersection of explainability and disability inclusion as this work. The authors focus on people with sight loss and other visual challenges and highlight that most explainable AI techniques continue to rely on visually oriented modalities (e.g., charts, graphical overlays) that often exclude blind and low-vision users unless non-visual or low-vision optimized channels are provided (e.g., haptic or auditory feedback). Their findings align with the present paper's argument that accessibility must be addressed across the full TEVV lifecycle. This includes core functionality, model transparency, interpretability, and user-facing explanations as integral parts of accessibility evaluation vs. optional add-ons.

2.2. Harmful bias in AI accessibility

El Morr et al. (8) conducted a systematic scoping review analyzing 64 peer-reviewed studies on AI and disability to examine bias manifestations across AI systems. Their analysis revealed three primary mechanisms of bias amplification: dataset underrepresentation, where training data contained insufficient representation of disabled users; ableist problem framing, where AI development prioritized normative abilities over accessibility needs; and inappropriate evaluation metrics, which failed to account for diverse interaction patterns. The review demonstrated that AI systems consistently performed worse for disabled users across multiple domains, with facial recognition showing higher error rates for wheelchair users and voice recognition exhibiting reduced accuracy for users with speech disabilities. This comprehensive bias analysis provides essential validation requirements for TEVV frameworks by establishing systematic approaches to identify and measure discriminatory outcomes in AI accessibility systems. It establishes verification requirements for harmful bias detection and validation protocols that should be considered for integration into testing frameworks to ensure equitable AI system performance across diverse disability populations.

Kane et al. (12) conducted an online survey with 40 adults with physical disabilities across the United States to gather open-ended descriptions about participants' experiences with various sensing systems including motion sensors, biometric sensors, speech input, and touch/gesture systems. Their qualitative analysis used affinity diagramming and identified ten key challenge areas that sensing technologies present for people with physical disabilities: premature timeouts, poor device positioning, being “invisible” to sensors, mismatches between users' abilities and sensors' fidelity for range of motion, variability of users' abilities over time, difficulty setting up sensing systems, biometric failures, security vulnerabilities, incorrect inferences, and data validation problems. The study also revealed four patterns of response that participants used to mitigate these challenges: seeking assistance from others, developing custom adaptations, avoiding sensing technologies entirely, and abandoning technologies altogether. Their findings demonstrated specific accessibility barriers including automatic doors timing out before wheelchair users could pass through, sensor buttons positioned too high or in awkward locations for wheelchair access, motion sensors failing to detect wheelchair users due to mounting height and angle issues, and step-counters generating invalid data when used by wheelchair users rather than pedestrians. This comprehensive mixed-methods evaluation methodology provides validation protocols for TEVV frameworks by demonstrating how to systematically document and categorize accessibility barriers in AI-powered sensing systems through structured user experience research that captures both quantitative prevalence data and qualitative insights into real-world usage challenges.

Trewin et al. (9) conducted a community-centered, design-oriented inquiry into fairness in AI systems as it relates to people with disabilities. The authors presented empirical case studies and analyses drawn from industry, research, and accessibility advocacy. The authors examined multiple real-world AI systems, such as automated hiring tools, facial recognition, and virtual assistants, to illustrate how standard definitions of fairness often overlook or disadvantage users with disabilities. The main issues they identified included the absence of disability-related demographic data in many datasets (limiting bias detection), the inadequacy of fairness metrics that assume static user traits, and the failure of systems to accommodate users with assistive technologies or non-normative interaction patterns. The paper demonstrated how AI systems produce disparate outcomes for users with disabilities like facial recognition systems inaccurately identifying wheelchair users, or job-matching algorithms penalizing non-standard educational timelines associated with disability. The authors proposed a set of actionable considerations for inclusive AI design, including integrating flexibility into AI models, embedding inclusive testing methodologies, and actively involving disabled people in the development pipeline. This work offers essential guidance for integrating multi-dimensional bias detection and mitigation practices into TEVV frameworks, ensuring that AI systems are verified and validated with attention to inclusive fairness and representational completeness.

Whittaker et al. (40) conducted a comprehensive analysis of AI systems' impact on disabled communities through policy review, case study analysis, and stakeholder interviews with disability rights organizations. Their investigation revealed that traditional privacy protection methods like k-anonymity become ineffective for small disability populations, where unique characteristics make individuals easily identifiable even in anonymized datasets. The study documented specific cases where AI systems created disproportionate surveillance risks for disabled users, including biometric authentication systems that required multiple attempts for users with motor disabilities, creating detailed behavioral profiles, and predictive policing algorithms that disproportionately targeted disabled individuals through behavioral pattern analysis. Their analysis identified three critical privacy vulnerabilities: insufficient anonymization for rare disabilities, excessive data collection requirements for accessibility features, and inadequate consent mechanisms for users with cognitive disabilities. This comprehensive privacy impact analysis provides essential ethical validation requirements for TEVV frameworks by establishing protocols for assessing and mitigating privacy risks specific to disabled populations in AI accessibility systems.

2.3. Accessibility metrics and evaluation criteria

Goggin et al. (1) conducted a comparative policy analysis of digital accessibility frameworks across Australia, United States, European Union, and United Kingdom, examining regulatory approaches, implementation strategies, and compliance mechanisms through document analysis and stakeholder interviews. Their investigation identified three emerging standardization trends: mandatory conformance testing requirements, establishment of certified evaluation bodies for third-party assessment, and transparency reporting obligations requiring public disclosure of accessibility compliance status. The study revealed significant variations in enforcement mechanisms, with some jurisdictions relying on complaint-based systems while others implemented proactive monitoring programs. Their analysis demonstrated that countries with stronger regulatory frameworks achieved higher accessibility compliance rates across government digital services. This comparative framework analysis provides essential standardization guidance for TEVV methodologies by establishing internationally recognized benchmarks and compliance requirements that must be integrated into comprehensive testing and validation protocols. Work on Accessible, Explainable, Artificial Intelligence (AXAI) reinforces the need for accessibility-aware metrics by emphasizing that evaluation of AI explanations must account for modality, cognitive effort, and usability for disabled users, not for their technical faithfulness to underlying models exclusively (11).

Kumar et al. (4) proposed a new benchmarking protocol for PDF accessibility, using large-scale automated and LLM-based approaches to demonstrate how harmonized, statistical metrics can standardize accessibility assessment across contexts. Paddison and Englefield (13) and Andreia et al. (14) demonstrated the synergy of expert-driven heuristics and concrete accessibility guidelines, offering evidence that blended evaluation models yield comprehensive outcomes for TEVV implementations.

Morris et al. (5) conducted a mixed-methods investigation into the accessibility and use of Twitter by people who are blind. The study combined an online survey of 132 blind Twitter users with large-scale quantitative analysis of user profiles and tweets, including comparisons to a matched group of sighted users. Survey questions explored motivations, experiences, and barriers associated with Twitter use, as well as specific challenges related to image-based content and profile customization. Their findings could be grouped into five areas: profile and content customization barriers (many cited accessibility challenges, lack of awareness, or the difficulty of confirming visual changes as reasons for retaining default images), increasing visual content and access gaps (despite the sharp rise in embedded imagery on Twitter and image-based tweets, none of the images included textual descriptions, making them inaccessible to screen reader users and limiting blind users' participation in image-rich interactions and trending content), distinctive usage patterns (blind Twitter users wrote longer bios, used a more advocacy-oriented set of hashtags, and were substantially less likely to retweet or post images compared to sighted users—their tweet volume per day, however, was statistically similar, underscoring high engagement despite accessibility challenges), privacy risks and predictive features (logistic regression showed that blind users could be identified with over 90% accuracy based on profile and usage patterns, raising concerns about inadvertent privacy and discrimination risks associated with disability status signals), and community-reported needs (the majority of blind users voiced strong interest in more consistent and meaningful image descriptions—difficulty accessing multimedia content drove some to unfollow accounts or use workarounds, reflecting ongoing social exclusion as Twitter's ecosystem became more visual in nature). These findings highlight the limitations of traditional usability assessment approaches, which may miss accessibility obstacles specific to evolving content formats and social interactions. The methodology—blending direct user feedback, behavioral data, and quantitative analyses—demonstrates an effective approach for TEVV protocols in AI accessibility, supporting the development of measurement frameworks sensitive to real-world usage barriers, representational differences, and evolving platform practices.

2.4. User-centered approaches

Kane et al. (12) and Sarsenbayeva et al. (6) showed, through survey and co-word analysis, the importance of involving diverse populations in all evaluation phases to surface hidden accessibility failures and iteratively refine AI systems (6, 12). Xiao et al. (15) further advocate for ability-diverse collaboration through systematic, long-term participatory methods (15). Correia et al. (16) confirmed, through comparative testing, that multi-modal interfaces (e.g., voice and touch controls) can mitigate barriers for motor-impaired users, reinforcing the need for user-driven multi-modal testing protocols in TEVV. Additional work on participatory and co-design approaches with disabled people further demonstrates how collaborative methods can surface context-specific accessibility barriers and inform AI design decisions across domains, including health, education, and digital services (27–31).

2.5. Ethical considerations

Korada et al. (17) present a detailed conceptual framework synthesizing artificial intelligence and established accessibility principles to guide the development of inclusive technologies for people with disabilities. The authors performed literature analysis and critical engagement with policy, technical, and social perspectives, to emphasize that AI deployments must be grounded in both universal design and rights-based models of disability. The framework articulates the need for adaptable and configurable AI systems—tools that prioritize flexibility to meet the individualized requirements of diverse users across learning, employment, and social domains. The operational guidance offered includes explicit recommendations for collaborative development processes, continuous post-deployment monitoring to detect and address emergent bias, systematic stakeholder engagement, and context-sensitive evaluation metrics that reflect the lived experiences and evolving needs of disabled users. This informs TEVV's operational practices for continuous stakeholder input and bias monitoring, echoing findings from Bobek et al. (18), who demonstrate the impact of user-centric explainability and transparency measures. These approaches establish foundational validation criteria for TEVV protocols: it ensures that ethical review, privacy, fairness, and participatory design are systematically prioritized in all stages of accessible AI system development and operation. Broader ethical and sociotechnical scholarship on AI, disability, and accessibility also highlights the need for continuous monitoring, cross-sector collaboration, and shared accountability mechanisms that extend beyond any single AI system to the wider media and information ecosystem (36–39).

2.6. Regulatory and standards-based foundations

Recent work on accessibility standards demonstrates the central role of formal guidelines in modern AI evaluation. Sharma et al. (19) conducted a large-scale analysis of web content from ophthalmology social media to systematically assess compliance with WCAG 2.1 guidelines using both automated tools and manual inspection. Their finding revealed widespread accessibility gaps, especially in contrast and alt text quality, despite broad awareness of these standards. This study's dual method evaluation process illustrates best practices for integrating regulatory metrics into TEVV and highlights the persistent need for human-in-the-loop validation when auditing AI for accessibility. The World Wide Web Consortium (W3C) provides foundational documentation for WCAG 2.0 and 2.1, which establish essential benchmarks for accessibility testing in all relevant digital environment.

ISO standards have recently informed robust evaluation criteria for usability. The ISO 9241-210 (2019) standard defines human-centered design processes and operationalizes usability constructs for interactive systems. It provides prescriptive guidance on user involvement, iterative testing, and context-specific evaluation, which directly supports the structure of TEVV, allowing for systematic incorporation of user feedback and iterative refinement throughout the AI lifecycle.

Nielsen's usability heuristics (Nielson 1994) continue to be influential in accessibility testing. Zhang and Walji (41) applied extended heuristic evaluation methods to clinical dashboard design and demonstrated how the use of expert-based heuristics contributed to improved interface accessibility for clinicians with varied expertise. These findings reinforce the importance of usability heuristics in the TEVV framework, both as a means of expert validation and as an actionable checklist for real-world evaluation. Recent analyses of evolving legal requirements and standards for digital and AI accessibility reinforce that future-facing TEVV frameworks must align not only with WCAG but also with emerging national and international regulations governing government and healthcare websites, AI-enabled services, and data governance (32–35).

2.7. TEVV, fairness frameworks, and inclusive AI

Standards-based research is now complemented by more specialized frameworks that target AI risk and fairness. The NIST AI Risk Management Framework (2022) offers a structured approach for identifying and mitigating risks in generative AI systems, including rigorous documentation and evaluation procedures. Its modularity and focus on risk profiles allow for tailored testing strategies that are essential for disability-inclusive TEVV.

Recent medical informatics research also addresses disability equity in AI. Umucu (42) propose an integrated evaluation framework that includes participatory design, iterative user feedback, and specific fairness metrics tailored to disability contexts. Their findings show that collaborative development processes can significantly reduce harmful bias and exclusion in AI-driven healthcare applications. These methodological innovations are directly relevant for structuring TEVV protocols and emphasize the need to move beyond surface-level compliance toward outcomes-based fairness validation.

2.8. Empirical insights from HCI and disability studies

Longitudinal and user-centered approaches have become vital in rigorous accessibility evaluation. Alonzo and Hasaan (20) reviewed 25 years of HCI research on digital reading for people with disabilities and documented how modality-specific barriers persist in mainstream digital environments. Their systematic review identifies electronic reaching formats and assistive technologies as key leverage points for accessibility improvements in AI. This historical breadth provides TEVV frameworks with validated methods for benchmarking progress and structuring long-term evaluation.

Aljedaani (44) analyzed how government websites' compliance with accessibility standards directly influenced user satisfaction and navigation rates. Their study confirmed that systematic adoption of accessibility regulations leads to measurable gains in usability for disabled users, reinforcing the imperative to incorporate international standards into all phases of TEVV. Current scoping reviews of AI and disability research by Umucu et al. (43) offer comprehensive harmful bias and usability evaluations identifying persistent disparities in accuracy, representation, and adaptivity for disabled populations. Their synthesis provides evidence-based priorities for continuous testing and verification, which strengthens the empirical foundations of the accessibility TEVV framework.

Umucu (42) reviewed collaborative work models for people of varying abilities and found that active co-design and longitudinal participation yield significantly better outcomes in AI system usability and satisfaction. Their systematic review provides methodological support for participant recruitment, mixed methods validation, and continuous evaluation in accessibility-focused TEVV.

3. Background and current practices

3.1. The need for accessibility in AI

Ensuring the accessibility of AI is not just a matter of compliance or good practice, but a fundamental necessity for creating an inclusive society [(1); WCAG 2.0; W3C]. AI technologies are being deployed in a variety of ways across all segments of society. These systems are being used in healthcare for diagnosis and treatment planning, in education for personalized learning, in employment for resume screening and candidate selection, and in public services for information access and decision-making. If these tools are not accessible to people with disabilities, it can lead to significant barriers in accessing essential services, employment opportunities, and information.

Inaccessible AI can perpetuate and even amplify existing harmful societal biases and inequalities (21). For example, if speech recognition systems struggle to understand diverse speech patterns, including those associated with certain disabilities, it could lead to exclusion from voice-controlled devices and services. Similarly, if computer vision algorithms are not trained on diverse datasets that include people with various disabilities, they may fail to accurately identify or assist these individuals in applications ranging from autonomous vehicles to security systems (21).

3.2. Accessibility features and universal benefits

Before expanding into the specifics of the proposed TEVV framework for AI accessibility, it is important to understand the broader context of accessibility features and their universal benefits. This section provides an overview of key accessibility features across different modalities and illustrates how these features, while essential for users with disabilities, often enhance the experience for all users. Examination of the wide-ranging benefits of accessibility improvements allows for a potentially deeper appreciation of the importance of integrating comprehensive accessibility testing into AI development processes. Table 1 outlines various accessibility features, their specific benefits for users with accessibility needs, and the extrapolated universal advantages they offer to all users. The patterns summarized in Table 1 are consistent with prior accessibility and universal design research that shows that features such as high contrast modes, captions, voice control, and adjustable pacing function as classic “curb cut” effects: they are essential for some users with disabilities, but also measurably improve usability, flexibility, satisfaction, and utility for the broader user population (1, 5, 11).

Table 1.

Outlines various accessibility features their specific benefits for users with accessibility needs, and the universal advantages they offer to all users.

Modality	Accessibility feature	Benefits for accessibility needs	Universal benefits
Visual	High contrast modes	Improves readability for users with low vision	Enhances visibility in bright environments for all users
	Screen reader compatibility	Enables navigation for blind users	Supports hands-free operation and multitasking
	Adjustable text size	Aids users with visual impairments	Improves readability on various screen sizes
Auditory	Closed captions	Essential for deaf and hard of hearing users	Useful in noisy environments or when audio is unavailable
	Transcripts	Provides text alternative for audio content	Allows quick scanning of content and improves searchability
	Volume normalization	Helps users with hearing sensitivity	Enhances audio clarity in varying environments
Motor	Voice control	Crucial for users with limited mobility	Enables hands-free operation for convenience
	Keyboard navigation	Essential for users who can't use a mouse	Improves efficiency for power users
	Customizable input methods	Accommodates various physical limitations	Allows personalization for comfort and efficiency
Cognitive	Clear, simple language	Aids users with cognitive disabilities	Improves comprehension and reduces cognitive load for all
	Consistent layout	Helps users with memory or attention difficulties	Enhances usability and reduces learning curve
	Customizable pacing	Accommodates varying processing speeds	Allows users to consume content at their preferred pace

Category	Subcategory	Examples
Sensory	Visual, Auditory	Screen reader compatibility, color contrast, image descriptions, captioning, audio descriptions, volume control
Motor	Input Methods, Physical Interaction	Voice control, eye-tracking, switch access, Gesture recognition, touch sensitivity
Cognitive	Information Processing, Learning	Clear language, adjustable pace, memory aids, adaptive interfaces, customizable difficulty levels
Speech	Recognition, Output	Diverse accent handling, speech impediment accommodation, natural language generation, speech synthesis quality
Neurodiversity	Attention, Sensory Processing	Distraction reduction, focus assistance, Adjustable sensory input, overload prevention
Situational	Environmental, Temporary Impairments	Noise adaptation, lighting adjustment, one-handed operation, simplified interfaces

Method	Focus	Techniques	Outcomes
Red Teaming	Vulnerability Discovery, Bias Uncovering	Adversarial testing, edge case simulation, scenario development, systematic probing	Identified accessibility exploits and weaknesses, revealed hidden biases against users with disabilities
Model Testing	Performance Evaluation, Bias Detection	Diverse dataset testing, accessibility-specific metrics, statistical analysis, fairness testing	Quantitative assessment of model accessibility, identified and quantified accessibility biases
Field Testing	Real-world Performance, Integration Assessment	Long-term deployment, environmental testing, assistive technology compatibility testing	Insights into real-world accessibility challenges, evaluation of system interoperability
Usability Testing	User Experience, Accessibility heuristics	Task-based scenarios, think-aloud protocols, expert evaluation, accessibility checklist application	Qualitative feedback on accessibility and usability, systematic identification of usability barriers

Method	Description	Key steps
Disparate Impact Analysis	Compares outcomes for users with and without disabilities	• Calculate selection rate for each group • Compute ratio of rates (bias indicator if < 0.8)
Demographic Parity	Assesses equality of positive outcome probability across groups	• Compare positive outcome proportions • Use chi-square tests for significance
Equal Opportunity Difference	Focuses on true positive rates across groups	• Calculate true positive rate for each group • Compare rates using z-tests or t-tests
Average Odds Difference	Combines equal opportunity and false positive rates	• Compute true and false positive rates for each group • Average differences for a single bias metric
Theil Index	Measures outcome entropy across groups	• Calculate entropy of model predictions for each group • Higher values indicate more unequal distribution
Propensity Score Matching	Isolates disability status effect on outcomes	• Match users based on relevant features • Compare outcomes between matched pairs

Analysis type	Key steps	Description
Intersectional Bias Analysis	• Use multivariate regression to assess interaction effects	Examines intersection of disability with other characteristics
Temporal Bias Analysis	• Conduct longitudinal studies • Use time series analysis for trends	Investigates bias changes over time
Feature Importance Analysis	• Apply SHAP values • Identify disproportionate influence of disability features	Determines features contributing to bias
Subgroup Fairness Analysis	• Stratify data by disability type/severity • Compare performance metrics across subgroups	Assesses performance across disability types/severities
Counterfactual Analysis	• Create synthetic data by altering disability features • Analyze prediction changes	Generates “what-if” scenarios
Bias Amplification Analysis	• Compare outcome distributions in training data vs. predictions • Quantify bias increase from data to output	Investigates model amplification of training data biases

Metric	Formula	Purpose	Target value
Inclusive Accuracy Rate (IAR)	(Correct predictions for users with disabilities/Total predictions) * 100	Ensure AI performs well for users with disabilities	≥ 95%
Accessibility Disparity Index (ADI)	(Accuracy for users without disabilities—Accuracy for users with disabilities)/Accuracy for users without disabilities	Measure performance gap between user groups	≤ 0.05
Assistive Technology Compatibility Score (ATCS)	(Successfully integrated ATs/Total tested ATs) * 100	Evaluate integration with assistive technologies	≥ 90%
Cognitive Load Index (CLI)	Weighted sum of task time, error rate, and user-reported difficulty	Assess ease of use for cognitive accessibility	≤ 3 (on a 1-10 scale)
Multimodal Interaction Ratio (MIR)	Supported interaction modalities/Total possible modalities	Measure flexibility in interaction methods	≥ 0.8
Accessibility Adaptation Time (AAT)	Time to reach optimal performance for users with specific needs	Evaluate AI's adaptability to individual requirements	≤ 5 interactions
Error Recovery Rate (ERR)	(Successfully recovered errors/Total errors) * 100	Assess system's ability to help users recover from mistakes	≥ 90%

Adversarial Team Composition:
Scenarios Tested:
Key Findings:
Ethical Considerations:

Test Datasets Used:
Accessibility-Specific Metrics:
Bias Detection Results:
Robustness Test Outcomes:
Explainability Analysis:

Testing Environments:
Duration of Evaluation:
Integration Test Results:
Contextual Factors Observed:
User Feedback Summary:

Participant Demographics:
Tasks Evaluated:
Quantitative Metrics:
Qualitative Feedback:
Accessibility Heuristics Applied:

Current Accessibility Status:
Key Issues Identified:
Root Cause Analysis:
Comparison to Accessibility Goals:

Immediate Actions:
Long-term Strategies:
Monitoring and Maintenance Plan:
Next Testing Iteration:

Organization size	Minimal required practices	Scalable enhancements
Small	Automated accessibility checkers; heuristic review; user feedback from community orgs	Add contextual field tests; partner with local advocacy groups
Medium	All small org. practices+participatory usability tests, red teaming with internal/external volunteers	Full metrics suite; cross-department training
Large	All above+longitudinal evaluation, expert/facilitated red teaming, in-depth metrics and integration testing	Continuous evaluation cycle, external audit, benchmark reporting

Variable 1	Variable 2	Example interaction	Measurement approach
Disability	Gender	Differential effect of accessibility features	Cross-tabulation; subgroup regression
Disability	Linguistic status	Voice UI bias against non-native speakers with impairment	Disaggregated accuracy by subgroup
Disability	Socioeconomic class	Access to assistive tech/modality preference	Survey/qualitative stratified analysis

Adversarial Team Composition:	Speech therapist, deaf individual, AAC expert
Scenarios Tested:	Understanding diverse speech patterns, non-verbal commands
Key Findings:	System struggled with certain speech impediments and accents, limited AAC compatibility
Ethical Considerations:	Ensured testing didn't cause distress to participants

Test Datasets Used:	Diverse speech recordings including various disorders and accents
Accessibility-Specific Metrics:	Speech recognition accuracy for impaired speech
Bias Detection Results:	Lower accuracy for certain speech impairments
Explainability Analysis:	Difficulty interpreting model's speech recognition decisions

Testing Environments:	10 homes of users with various disabilities
Duration of Evaluation:	3 months
Integration Test Results:	Issues with some popular AAC devices
Contextual Factors Observed:	Background noise significantly impacted performance
User Feedback Summary:	Frustration with limited visual feedback options

Participant Demographics:	15 users with diverse disabilities (deaf, hard of hearing, speech disorders, motor impairments)
Tasks Evaluated:	System setup, common commands, error recovery
Quantitative Metrics:	Task completion rates, time on task, error rates
Qualitative Feedback:	Difficulties in error recovery for non-verbal users
Accessibility Heuristics Applied:	Applied: WCAG 2.2, Inclusive Design Principles

Current Accessibility Status:	Significant improvements needed in speech recognition and AAC compatibility
Key Issues Identified:	Poor performance with impaired speech, limited visual feedback, difficult error recovery
Root Cause Analysis:	Insufficient diverse training data, lack of multimodal interaction options
Comparison to Accessibility Goals:	Falls short in speech recognition accuracy and AAC integration targets

Immediate Actions:	Enhance speech recognition model with more diverse data, improve AAC device integration
Long-term Strategies:	Develop visual interface, implement adaptive learning for individual users
Monitoring and Maintenance Plan:	Monthly performance reviews, quarterly user feedback sessions
Next Testing Iteration:	Scheduled for November 15, 2024, focus on speech recognition improvements and new visual interface

Aspect	Current methods	Proposed framework
Focus	General performance and fairness	Specific accessibility considerations
User Representation	Limited diversity in test data	Comprehensive inclusion of users with disabilities
Testing Approach	Often static and controlled	Dynamic, including real-world scenarios
Metrics	Generic accuracy and bias metrics	Specialized accessibility metrics
Adversarial Testing	General security focus	Accessibility-specific red teaming
User Involvement	Limited, often late in development	Extensive, throughout the development process
Contextual Factors	Often overlooked	Explicitly considered in field testing
Iterative Process	Variable	Integral to the framework
Explainability	General focus	Accessibility-specific interpretability

Feature/approach	Existing frameworks (1, 10)	Proposed AI TEVV for accessibility framework	Supporting evidence/citations
Accessibility focus	Compliance, post hoc or checklist-based evaluation	Continuous, iterative, and central in design/development	(2, 7, 8)
Integration across lifecycle	Most apply accessibility after main development or as secondary consideration	Accessibility embedded as an ongoing, dynamic process	(1, 4)
Evaluation methods	Emphasis on static audits, technical guidelines, or automated tools	Multi-modal: red teaming, usability, model, and field testing	(9, 10)
User participation	Limited or delayed, typically in summative/usability testing	Diverse user involvement in all phases (design, metrics, validation)	(5, 12)
Bias, fairness, and context sensitivity	Under-addressed or unreported in typical frameworks	Explicitly targets bias detection, contextual factors, and fairness	(6–8)
Adaptability and real-world readiness	Often assumes static requirements and environments	Iterative, designed for adaptive deployment in evolving contexts	(2, 3)
Validation benchmarks	Rarely benchmarked with international standards or continuous improvement	Benchmarked against ISO, NIST, and WCAG, with clear improvement cycles	(4); National Institute of Standards and Technology, 2022

PERMALINK

AI testing, evaluation, verification and validation for accessibility: a comprehensive framework

Gabriella Waters

Roles

Abstract

1. Introduction

1.2. Significance and contributions of this work

2. Literature review

2.1. Accessibility testing in AI

2.2. Harmful bias in AI accessibility

2.3. Accessibility metrics and evaluation criteria

2.4. User-centered approaches

2.5. Ethical considerations

2.6. Regulatory and standards-based foundations

2.7. TEVV, fairness frameworks, and inclusive AI

2.8. Empirical insights from HCI and disability studies

3. Background and current practices

3.1. The need for accessibility in AI

3.2. Accessibility features and universal benefits

Table 1.

3.3. Current TEVV practices

3.4. Why traditional TEVV falls short for accessibility

3.5. Proposed framework for AI TEVV for accessibility

3.5.1. Framework overview

Figure 1.

Table 2.

Table 3.

3.6. Red teaming for accessibility

3.7. Model testing for accessibility

Table 4.

Table 5.

3.8. Key metrics for AI accessibility

Table 6.

3.9. Field testing

3.10. Usability testing for accessibility

3.11. AI TEVV for accessibility notes

3.12. Framework integration

3.12.1. Comparison with current methods

Table 7.

3.13. Example case studies

4. Discussion

Table 8.

Table 9.

4.1. Key insights

4.2. Challenges in implementation

Table 10.

4.3. Future directions

5. Conclusion

Funding Statement

Footnotes

Data availability statement

Ethics statement

Author contributions

Conflict of interest

Generative AI statement

Publisher's note

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases