Beyond Visual Perception: Insights from Smartphone Interaction of Visually Impaired Users with Large Multimodal Models

JINGYI XIE; RUI YU; HE ZHANG; SYED MASUM BILLAH; SOOYEON LEE; JOHN M CARROLL

doi:10.1145/3706598.3714210

. Author manuscript; available in PMC: 2025 Aug 11.

Published in final edited form as: Proc SIGCHI Conf Hum Factor Comput Syst. 2025 Apr 25;25:62. doi: 10.1145/3706598.3714210

Beyond Visual Perception: Insights from Smartphone Interaction of Visually Impaired Users with Large Multimodal Models

JINGYI XIE ¹, RUI YU ², HE ZHANG ³, SYED MASUM BILLAH ⁴, SOOYEON LEE ⁵, JOHN M CARROLL ⁶

PMCID: PMC12338113 NIHMSID: NIHMS2046489 PMID: 40792292

Abstract

Large multimodal models (LMMs) have enabled new AI-powered applications that help people with visual impairments (PVI) receive natural language descriptions of their surroundings through audible text. We investigated how this emerging paradigm of visual assistance transforms how PVI perform and manage their daily tasks. Moving beyond basic usability assessments, we examined both the capabilities and limitations of LMM-based tools in personal and social contexts, while exploring design implications for their future development. Through interviews with 14 visually impaired users and analysis of image descriptions from both participants and social media using Be My AI (an LMM-based application), we identified two key limitations. First, these systems’ context awareness suffers from hallucinations and misinterpretations of social contexts, styles, and human identities. Second, their intent-oriented capabilities often fail to grasp and act on users’ intentions. Based on these findings, we propose design strategies for improving both human-AI and AI-AI interactions, contributing to the development of more effective, interactive, and personalized assistive technologies.

Keywords: People with visual impairments, remote sighted assistance, large multimodal models, visual question answering, Human-AI interaction, Be My AI

1. INTRODUCTION

People with visual impairments (PVI) face challenges in perceiving their surroundings due to the absence of visual cues. Traditional AI-powered systems like Seeing AI [40] and human-assisted services like remote sighted assistance [3] have helped PVI interpret their environment and complete daily tasks. Recent advances in computer vision (CV) and natural language processing (NLP) have enabled AI systems to identify objects and text in scenes while responding to photo-based queries from PVI [6, 21, 28, 41, 43]. The emergence of large multimodal models (LMMs) [64], particularly GPT-4 [4], has transformed visual question-answering (VQA) capabilities. Researchers have begun exploring how these powerful LMMs might benefit PVI [63, 66]. At the forefront of this development is Be My AI [2], the first publicly-available LMM-based system designed specifically to help PVI with visual interpretation and question answering. Built on OpenAI’s GPT-4 models[4], Be My AI offers capabilities that surpass those of similar applications.

Prior work [5] has investigated how PVI use and access generative AI tools, focusing on information access, acquisition, and content creation through platforms like Be My AI. Our work broadens this investigation beyond content-oriented interactions to examine real-life scenarios involving visual descriptions, task performance, social interactions, and navigation. By analyzing the capabilities and limitations of LMM-based VQA systems, particularly Be My AI, we identify gaps between current technological capabilities and PVI’s practical needs and expectations. This study offers timely insights into rapidly evolving LMMs while seeking broader understanding that will remain valuable as foundational knowledge even as specific technologies advance.

More specifically, we explore the following research questions:

What are the capabilities and limitations of LMM-based assistance in the daily lives of people with visual impairments?
How do people with visual impairments mitigate these limitations?

To answer these questions, we conducted an exploratory study using two complementary data sources. First, we interviewed 14 visually impaired users about their experiences with Be My AI. This was our primary data source. Second, we analyzed image descriptions generated by Be My AI that were shared by both our interview participants and users on social media platforms (X, Facebook, Reddit, and blogposts). This secondary data source allowed us to understand PVI’s lived experiences with Be My AI while capturing concrete examples of their interactions in various real-life scenarios.

Our study revealed that Be My AI’s context-aware capabilities help participants better understand their spatial surroundings, support social interactions, interpret stylistic elements, and convey human identities. Yet several limitations undermine these benefits: the AI sometimes hallucinates non-existent details, makes subjective interpretations about human-animal interactions and fashion choices, and misidentifies people’s age or gender. When encountering these limitations, participants draw on their spatial memory and auditory cues, apply personal judgment, or turn to human assistants. Moreover, Be My AI struggles to grasp users’ intentions, provide actionable support, or offer real-time feedback. Participants compensate for this by actively guiding Be My AI through prompting, seeking human assistance, or depending on their orientation and mobility skills.

Informed by these findings, we discuss strategies to improve handoff between PVI, Be My AI, and remote sighted assistants. We propose streamlined interactions to reduce redundancy and envision new paradigms to achieve more accurate identity recognition, improve subjective interpretations, and mitigate AI hallucinations. We also discuss the benefits of multi-agent systems, including both human-human and human-AI collaborations, and explore future possibilities of AI-AI cooperation to aid PVI in tasks requiring specialized knowledge.

Our research examines how state-of-the-art AI, particularly LMM-based systems like Be My AI, creates new opportunities for PVI through advanced human-like language and vision capabilities. Rather than contrasting Be My AI with other VQA systems, we characterized this as a new genre of AI-driven prosthetic and identified how its early-stage usage could better align with established assistive approaches such as remote sighted assistance [22, 31, 36]. Through detailed analysis of Be My AI’s capabilities and limitations, based on rich narratives from PVI about their first-hand experiences, we provide insights into how LMM-based systems are reshaping accessibility tools. These insights help us understand how to enhance both the context awareness and intent-oriented capabilities of LMMs, laying the groundwork for future advances in intelligent, interactive, and personalized assistive technology that better serve PVI’s needs.

2. BACKGROUND AND RELATED WORK

In this section, we review the literature on AI-powered and human-assisted visual interpretation and question answering systems, as well as information needed in visual interpretations for PVI.

2.1. AI-Powered Visual Interpretation and Question Answering Systems

The advancements in deep learning models for computer vision and natural language processing have significantly enhanced the capabilities of AI-powered visual assistive systems [6, 43]. These systems leverage photos taken by PVI to identify objects or text within the scene, as well as respond to queries about the contents of the images [28, 41]. Such applications are now widely adopted, exemplified by Microsoft’s Seeing AI [40], providing support to PVI in various scenarios [21].

Recent advancements in LMMs, such as GPT-4 [4], have demonstrated exceptional performance in multimodal tasks [64], prompting exploration into their potential for assisting PVI. State-of-the-art LMMs have been leveraged to create assistive systems capable of evaluating image quality, suggesting retakes, answering queries about captured images [63, 66], and even integrate multiple functions to assist PVI in tasks such as navigation [65] and text input [39]. In the commercial domain, Be My Eyes [3] introduced the Be My AI feature powered by GPT-4 [4].

Prior work has examined the use of AI-powered visual assistive systems by PVI in their daily lives [21, 22, 32]. However, these studies did not incorporate the latest LMM technologies, thus may not fully represent the user experience of cutting-edge AI-powered systems. Bendel [9] documented his experience with GPT-4-based Be My AI, yet his account remains subjective. Therefore, in-depth investigation into PVI’s daily utilization of state-of-the-art AI-powered systems is imperative. This paper addresses this gap by exploring the daily usage of Be My AI among 14 visually impaired users.

2.2. Human-Assisted Visual Interpretation and Question Answering Systems

Human-assisted VQA systems offer prosthetic support for PVI by facilitating connections to sighted people through remote assistance. These systems utilize image-based and video-based modalities to meet varied needs and situations.

Image-based human-assisted VQA systems allow PVI to submit photos along with their queries and receive responses after some time. An example of this is Vizwiz, where PVI can upload images accompanied by audio-recorded questions and receive text-based answers through crowdsourced assistance [11]. This method has been successfully applied in tasks such as reading text, identifying colors, locating objects, obtaining fashion advice, and supporting social interactions [11, 12, 15, 23]. However, the single-photo, single-query limitation of image-based VQA systems [11] makes it less suitable for addressing complex or contextually deep inquiries [34].

Conversely, video-based human-assisted VQA systems facilitate real-time, interactive support, allowing PVI to receive immediate assistance tailored to their specific environmental context. This approach enables the visual interpretation of real-time scenes and supports a dynamic, back-and-forth VQA process, which is essential for addressing more specific and complex contextual inquiries effectively. The evolution of this technology has progressed from wearable digital cameras [8, 20, 29] and webcams [14, 19, 50] to the utilization of mobile video applications [1, 3, 27, 61, 62] and real-time screen sharing technologies [33, 34, 60]. Services like Be My Eyes [3], which connects PVI with untrained volunteers, and Aira [1], which connects PVI with trained professional assistants, exemplify the application of video-based VQA in scenarios that require immediate feedback. These services prove effective in navigation [31, 37, 59, 61], shopping [37, 59, 60], and social interaction [17, 35, 36]. In this study, we revealed specific situations where participants used human-assisted VQA systems to address limitations in Be My AI’s assistance.

2.3. Information Needed in Visual Interpretations for Visually Impaired Users

Both AI-powered and human-assisted visual interpretation face the challenge of identifying what specific information is necessary to meet PVI needs. Existing research primarily focuses on what information PVI seek in online image descriptions. The Web Content Accessibility Guidelines [16] advise on creating alternative text (alt text), but their one-size-fits-all approach limits context-specific applicability. Similarly, AI-based tools like Seeing AI [40] adhere to this uniform design model.

Earlier studies [23, 24] relied on sighted volunteers to decide image descriptions for PVI, while recent research [10, 51, 52] emphasizes user-centered approaches. For example, Stangl et al.[51] found PVI preferences for descriptions vary by platform (e.g., news, social media, eCommerce), with both common needs (e.g., identifying gender, naming objects) and platform-specific details (e.g., hair color on dating websites). Follow-up work [52] showed preferences also depend on users’ goals, with subjective details rarely included. Bennett et al. [10] examined screen reader users’ views on describing race, gender, and disability in appearance descriptions. For video content, Jiang et al. [30] examined PVI preferences across genres (e.g., how-to, music videos), recommending that entertainment descriptions focus on subjects, while how-to videos prioritize actions and tools. Natalie et al. [45] studied PVI needs for customization in video descriptions, such as length, speed, and voice. For synchronous systems like real-time video interpretation, Lee et al. [36] identified PVI needs in remote sighted assistance (RSA), including both objective details (e.g., text, spatial information) and subjective input (e.g., opinions on clothing). In this study, we will examine how Be My AI processes visual information and provide design insights based on the information needed by PVI.

3. METHOD: DATA COLLECTION AND ANALYSIS

In this exploratory study, we employed a qualitative approach to study how PVI use Be My AI, an emerging LMM technology, in their daily lives. This section details our data collection and analysis processes.

3.1. Data Collection

We collected data from two sources in this exploratory study: interviews as the primary data source and Be My AI-generated image descriptions as the secondary data source. Interview data captured accounts of PVI’s lived experiences with Be My AI, enabling participants to articulate personal interactions, challenges, and perceptions of the technology. The image descriptions enriched and supplemented interview data by contextualizing PVI’s experiences across a broader range of real-life scenarios, including daily chores, professional tasks, and travel experiences.

We collected the two data sources sequentially. First, we interviewed 14 visually impaired users, focusing on Be My AI’s emerging practices in their daily lives. At the end of interviews, we invited participants to share their image descriptions. However, most participants didn’t have examples to share or didn’t want to share due to the demanding process or personal content sensitivity (e.g., images capturing their face). Respecting and prioritizing participant privacy, we ensured voluntary sharing of image descriptions. Consequently, only 4 participants shared their descriptions voluntarily. Meanwhile, we accessed existing online data from reliable platforms where PVI share their experiences.

The combination of text-based interview data and text/visual image descriptions provided richer insights into PVI’s experiences with emerging LMM technologies than either source alone would have offered. We detail our data collection process in the following section.

3.1.1. Primary data source: Interview Data.

Participants.

We recruited a total of 14 visually impaired participants (4 males and 10 females, 10 blind and 4 low-vision) through our prior contacts and snowball sampling. Each visually impaired participant used and was familiar with Be My AI. Their common age groups are 35–40. Three of them are students, two of them are unemployed, and the rest were employed. Table 1 presents their demographics. Each visually impaired participant received a $30 gift card per session for their time and effort.

Table 1.

Participants’ demographics.

ID	Gender	Age Group	Condition of Vision Impairment	Age of Onset	Occupation Type	Be My AI Usage Frequency
P1	F	45–50	Totally blind, retinopathy of prematurity	Since birth	IT consultant	3 or 4 times a day
P2	F	35–40	Low vision, cone-rod dystrophy	Since birth	Program director in a nonprofit	a few times a week
P3	F	30–35	Totally blind, Leber’s congenital amaurosis	Since birth	Elementary school teacher	5 times a week
P4	M	25–30	Totally blind, Pale optic nerves	More than 12 yrs ago	Criminal law employee	2 to 3 times a week
P5	F	40–45	Totally blind, retinopathy of prematurity	Since birth	Manager of digital accessibility	2 times a day
P6	F	25–30	Totally blind, microcephaly and detached retina	Since birth	Student	once a week
P7	F	35–40	Low vision, retinopathy of prematurity	Since birth	Part-time employee	2 times a week
P8	M	40–45	Totally blind, detached retina	Since birth	Insurance	a few times a day
P9	F	30–35	Totally blind, retinopathy of prematurity	Since birth	In-between jobs	a few times a week
P10	F	30–35	Low vision, retinitis pigmentosa	Since birth	Student	3 or 4 times a day
P11	M	35–40	Totally blind, retinopathy of prematurity	Since birth	Stay-at-home parent	3 or 4 times a day
P12	M	20–25	Low vision, Leber’s hereditary optic neuropathy	Since 14 yrs old	Student	2 times a week
P13	F	40–45	Totally blind, retinitis pigmentosa	Low vision since infancy, totally blind since 2021	Human service employee	5 times a week
P14	F	35–40	Totally blind, retinopathy of prematurity	Since a few months old	Assistive technology specialist	5 times a week

Open in a new tab

Procedure.

We conducted 14 semi-structured interviews via Zoom, each lasting between 50 and 76 minutes. All interviews were recorded after participants’ consent. One or two researchers were present in each session.

First, we invited participants to share their personal use cases for the Be My AI app. To aid in their recall and to enrich the discussion, we presented a list of common use cases provided by Be My Eyes [2], asking participants to identify any similar experiences they had encountered. Follow-up questions were posed to further investigate their experiences with each identified use case.

Second, we gathered participants’ feedback regarding the quality of visual interpretations provided by Be My AI. This included evaluating the visual interpretations in terms of accuracy, level of detail, error, and the appropriateness of of how people’s identities were interpreted and described.

Third, we explored participants’ experiences with Be My AI in the broader context of their use of various assistive tools, including both human-assisted and AI-powered VQA systems. This exploration helped contextualize the unique advantages and challenges of Be My AI.

Finally, we collected participants’ demographic information and inquired about their willingness to share copies of the visual interpretations from Be My AI with our research team for detailed analysis.

3.1.2. Secondary data source: Image Descriptions Generated by Be My AI.

The image descriptions we collected from interview participants and social media platforms served as a secondary data source to complement our interview findings. These descriptions included text-only copies, or original images sent to Be My AI paired with their corresponding descriptions. Some descriptions also included follow-up questions and responses between users and Be My AI.

From Participants.

Participants were given the flexibility to select their preferred method of documentation, including copying text or taking screenshots. In total, we gathered 22 image descriptions from 4 participants, as most participants either lacked examples or were reluctant to share due to the demanding process or personal content concerns. Of these descriptions, 4 were original images with descriptions, while the majority consisted of text copies, which participants found easier to manage. Additionally, 6 of the 22 descriptions included follow-up questions that participants posed to Be My AI for further clarification or additional details.

From Social Media Platforms.

We also collected image descriptions by scanning publicly available posts from 4 different social media platforms: X, Facebook, Reddit, and blogposts. The platforms were chosen to cover both image-sharing sites (X, Facebook) and discussion forums (Reddit, blogposts).

We used the search terms “BeMyAI” and “#bemyai” to find related posts on X and Facebook. On Reddit, we searched for “AI” and “BeMyAI” within the r/Blind community. Similarly, for blogposts, we searched for “BeMyAI” and the hashtag “#bemyai.”

We gathered posts that met the following criteria: (i) written in English, (ii) published from September 25, 2023, the official release date of Be My AI, through March 31, 2024, and (iii) contained image descriptions generated by Be My AI. To avoid redundancy, if identical content was found across multiple platforms, only the earliest published post was recorded. If one post included multiple descriptions, each description was documented as a separate entry. If there were multiple posts published around the same time frame about a single topic (e.g., an X or Reddit thread), they were included as a single entry.

We verified that the social media posts were authored by PVI. Posts came either from the official Be My Eyes account or from users who explicitly identified themselves as visually impaired in their profiles. The platforms we chose are trustable and dependable sources where PVI share their experience.

In total, we collected 28 image descriptions generated by Be My AI from 4 social media platforms. Of these, 23 included original images sent to Be My AI and 5 were text-only interactions. Fifteen descriptions contained follow-up questions that participants sent to Be My AI for additional details or clarification. Specifically, we gathered (i) 17 image descriptions from X, including 16 original images and 6 follow-up questions, (ii) 3 image descriptions from Facebook, including 3 original images, (iii) 3 image descriptions from Reddit, including 1 follow-up question, and (iv) 5 image descriptions from blogposts, including 4 original images and 8 follow-up questions. To explore initial user experience and interactions with Be My AI during its early phase, we collected these image descriptions from social media platforms shortly after its release. Although this dataset is limited in size, it provides valuable insights into Be My AI’s early-stage performance and user experience.

3.2. Data Analysis

We used a bottom-up approach to analyze the interview data. The first author performed inductive thematic analysis [13] on the transcribed interviews, used open coding to develop initial codes, and generated themes and categories through iterative collating and grouping. The themes and categories (Table 2 in Appendix) were reviewed and finalized during weekly meetings with all authors.

We used a top-down approach to analyze the image descriptions. The first author reviewed each image description along with any follow-up user’s questions and Be My AI’s responses, understood the context where the image was captured, then utilized the codebook developed from interview data (Table 2 in Appendix) to analyze the descriptions deductively.

The following example illustrates our data analysis process from inductive to deductive approach. In the interviews, participants mentioned several scenarios where Be My AI initially described the general scene rather than the specific content users expected, so participants had to guide it through subsequent questions. We labeled this interaction as “Goal Understanding Dialogue.” During the analysis of image descriptions, we encountered similar patterns (Figure 2); therefore, these instances were classified under the pre-established “Goal Understanding Dialogue” category.

Fig. 2. — On the left is the original image sent to Be My AI. On the right is Be My AI’s description of eggs in a frying pan, followed by a question checking for the presence of eggshells. This example was originally drawn from X.

Through this process, we first identified Be My AI’s capabilities and limitations from the interview data, then used the image descriptions to complement the interview findings with concrete examples of user questions and Be My AI’s responses.

4. FINDINGS

In this section, we first explore Be My AI’s context-aware capabilities across diverse settings, assessing its effectiveness and the challenges it faces in interpreting physical environments, social and stylistic cues, and people’s identity. Next, we examine its intent-oriented capabilities, highlighting how it performs in understanding and addressing user needs. In our analysis of each aspect, we first describe how PVI utilize Be My AI and reveal its capabilities in Capabilities of Be My AI. We then present specific scenarios and concrete examples in Example that illustrate key challenges users encountered and their strategies for overcoming them.

4.1. Context Awareness

This section examines how Be My AI supports users across diverse environments and interactions. We analyze its roles in enhancing spatial awareness in physical settings, interpreting social and stylistic context, and conveying human identities. Throughout, we assess Be My AI’s effectiveness in providing enriched, context-aware descriptions alongside the challenges posed by its technology and subjective interpretations.

4.1.1. Physical Environments.

We explore how participants utilized Be My AI to enhance their perception of both indoor and outdoor settings through detailed scene descriptions. From indoor environments like theaters (P2) and room layouts (P3, P7, P9, P13) to outdoor scenes like holiday decorations (P5) and street views (P11, P12), Be My AI offers structured visual information that broadens spatial awareness. However, these benefits are occasionally compromised by AI hallucinations or require active guidance from participants to ensure accurate scanning.

Capabilities of Be My AI.

Seven participants (P2, P3, P7, P9, P11, P13, P14) highlighted the value of Be My AI’s comprehensive scene descriptions in enhancing their spatial awareness. These descriptions often uncover spatial details previously unnoticed by users, as noted by P11, “It gives me unexpected information about things that I didn’t even know were there.”

However, this enhanced awareness comes with its challenges, notably AI hallucinations – errors where AI inaccurately identifies objects that are not present. Such errors can disrupt the user’s context understanding, leading to confusion or misinterpretation of the environment. To mitigate these errors, participants often require human assistance or rely on their pre-existing mental models of the environment to cross-verify Be My AI’s descriptions.

Participants have also used auditory cues to guide Be My AI effectively. When attempting to locate dropped objects, for example, participants use their auditory perception and spatial memory to adjust the camera angle precisely, facilitating more accurate scanning by Be My AI. This integration of participants’ auditory inputs guides and refines Be My AI’s performance, leading to more accurate and useful descriptions.

Example 1: AI Hallucinations of Adding Non-existent Details.

Be My AI enriches participants’ understanding of their surroundings by providing detailed descriptions of objects’ colors, sizes (e.g., “small,” “medium,” “large,” and “tall”), shapes (e.g., “round,” “square,” “fluffy,” “wispy,” “dense,” and “open lattice structure”), and spatial orientations (e.g., “on the left,” “to the right,” “in the foreground,” “in the middle,” and “in the background”). Despite these details, 4 participants (P3, P5, P11, P14) encountered issues with AI hallucinations, where Be My AI added non-existent details to scenes.

To address these inaccuracies, participants either sought verification from human assistants or relied on their own spatial memory. For example, P3 experienced a situation where Be My AI reported a “phantom” object behind her. After failing to find anything upon personal investigation, she consulted a human assistant who confirmed that there was indeed nothing there. P3’s experience emphasized the importance of human validation:

“Apparently, Be My AI said there was an object behind me, and there really wasn’t. So, then, when I went and asked somebody, ‘You know what is behind me?’ and they said there wasn’t anything behind you in the picture.”

(P3)

Similarly, P5 encountered Be My AI inaccurately adding details to an image of her home, claiming the presence of objects next to her dogs. Utilizing her familiarity with the environment, she independently identified these errors. This ability to discern inaccuracies based on personal knowledge of one’s environment illustrates how participants can effectively use their spatial memory to challenge and correct Be My AI’s interpretations, particularly in instances of AI hallucinations.

Example 2: Need for User Support in Locating Dropped Objects.

Participants (P4, P12, P13) used Be My AI’s sequential descriptions to locate dropped objects, such as earbuds and hair ties. Be My AI systematically details the environment in a “top to bottom, left to right” sequence, aiding in the identification of specific locations. For instance, P13 was informed of “a picture of a carpet with a hair tie in the upper right hand corner,” and P12 received guidance that “the earphones are directly in front of you, between your feet”. These descriptions allow participants to pinpoint lost objects with greater accuracy.

However, effectively positioning the camera for accurate scanning often required participants’ auditory cues or spatial memory to initially estimate the location of dropped objects. This adjustment allowed Be My AI to focus on the intended search area, rather than starting the search at random points. For instance, P4 utilized his “listening skills” to detect the location of a fallen object, subsequently directing the camera to that specific spot. Similarly, P12 shifted his position slightly backward from the seating area to capture a better view of the floor, optimizing the camera angle for the dropped earbuds. These examples illustrate how participants synergize their understanding of the environment with Be My AI’s technological capabilities to manage tasks that require spatial awareness.

4.1.2. Social and Stylistic Contexts.

We analyze how Be My AI has been employed in interpreting social and stylistic contexts. In social interactions, nine participants (P2, P4, P5, P8–11, P13, P14) have leveraged Be My AI to take leisure pictures of animals like cats, dogs, birds, and horses, to better understand and monitor the animals’ statuses. In stylistic applications, Be My AI aids in identifying clothing colors (P2, P7, P9, P13) and patterns (P2, P9), suggesting harmonious outfit combinations like outfits, shoes, and jewelry (P2, P4, P5, P6, P12), and assisting with makeup checks (P5, P10). This section examines Be My AI’s subjective interpretations within these contexts, highlighting its utility in enriching users’ experiences and the challenges in providing accurate, context-aware descriptions.

Capabilities of Be My AI.

Be My AI usually concludes its descriptions by injecting subjective interpretations, which enriches users’ understanding of depicted scenarios. These interpretations encompass people’s or animal’s emotional states (e.g., “a cheerful expression on his face,” “smiling slightly,” “friendly and approachable,” “curious expression”), body language (e.g., “his arms are open as if he is engaging in a conversation or greeting the other man”), and the ambiance (e.g., “peaceful and natural,” “serene and peaceful atmosphere,” “cozy and cheerful holiday vibe,” “warm and inviting atmosphere,” “the atmosphere seems to be lively and festive”).

These subjective interpretations enable participants to engage more deeply in social interactions and gain better awareness of animals’ states and behaviors. Participants (P5, P9–11) have used Be My AI to grasp nuances in animals’ facial expressions, activities, and body language (Figure 1). P10, in particular, highlighted Be My AI’s utility during dog walks: “Sometimes it’s hard for me to know if the dog is peeing or what the dog is doing.”

Fig. 1. — Be My AI’s description of a dog, including subjective interpretations of the dog’s emotions. This screenshot was provided by P9.

Furthermore, the subjectivity extends to fashion suggestions, with Be My AI recommending stylistically coordinated outfits based on colors and patterns on clothing, as well as assessing makeup. For instance, P5 employed Be My AI to check the color, placement, and overall balance of her makeup, P2 and P9 consulted it to harmonize colors of tops and bottoms, and P12 used it to select a tie that complemented his shirt.

However, concerns arise as participants have noted that Be My AI sometimes introduces excessive subjectivity into its interpretations. This subjective input, unverified by human, could potentially distort the context awareness that Be My AI aims to enhance and undermine users’ agency. Consequently, participants favored their own subjective interpretations or feedback from sighted assistants over Be My AI in contexts involving human-animal interactions and fashion.

Example 1: Subjective Interpretations in Human-Animal Interactions.

While Be My AI provides valuable information in human-animal interactions, participants (P2, P5) raised concerns about how it infers animals’ emotions, highlighting worries about the subjective interpretations introduced in image descriptions. For instance, P5 noted cases where Be My AI describes a dog “appears to be relaxed” or “appears to be happy.” Likewise, P2 pointed out that Be My AI often adds subjective commentary like “That’s a white cat curled up on a fuzzy blanket. She looks peaceful and happy and rested.”

These interpretations extend beyond merely describing the objective visual elements to inferring emotional states of animals, which may not always accurately reflect the reality. As a result, participants prefer more straightforward, factual descriptions without the inclusion of subjective interpretations. P2 articulated a desire for less editorializing, stating, “Maybe I don’t want it to editorialize, you know, maybe I just literally only want the facts of it.”

Participants (P2, P5) emphasized the importance of human agency in interpreting the animals’ behaviors, preferring their personal judgment over the tool’s subjective interpretations. P5, in particular, appreciated the ability to override Be My AI’s assumptions about her dog’s expressions. She valued the flexibility to modify or reject Be My AI’s interpretations, reinforcing her sense of control over how she perceives her pets’ behaviors.

“Some blind people think, ‘How does it know that the dogs are happy? Why does it assume?’ Some people don’t like that it’s making assumptions about the picture. I like having access to that information, but I like to be able to change it if I want.”

(P5)

In summary, the subjective inferences made by Be My AI can sometimes add depth to descriptions. However, they can also lead to inaccuracies that compromise the authenticity and usefulness of context awareness provided by Be My AI. Maintaining human agency allows users to interpret and adjust the descriptions based on their judgment.

Example 2: Subjective Interpretations in Fashion Help.

Be My AI’s subjective interpretations extend beyond human-animal interactions to fashion help. Participants (P6, P7, P10) expressed concerns about Be My AI’s subjectivity in this context. While they utilized Be My AI to obtain descriptions of colors and patterns, they hesitated to rely on its fashion suggestion. These participants preferred making their own judgments and choices, or consulting human assistance for outfit matching, emphasizing the importance of human subjectivity in fashion decisions.

P6 articulated worries about the AI’s ability to replicate human subjectivity, remarking, “It’s interesting how AI is being taught to simulate kind of the human factor of things.” She linked her skepticism to instances where AI-generated responses were “strange” and “complete nonsense,” which contrasted with human subjectivity and nuanced understanding.

“No, no, no, no, I would never use it to do anything that required human subjectivity… I just don’t trust AI with a task that is supposed to be subjective like that, particularly visual like that. Have you ever seen AI weirdness?… I think that just goes to show why I’m not gonna trust AI with my fashion yet.”

(P6)

Similarly, P10 preferred human feedback, saying, “I’m still more confident asking a sighted person to provide me with the feedback.” This preference demonstrates the significance of human agency in areas reliant on subjective interpretation. By favoring human judgment over Be My AI, participants enhance their agency and maintain control over their personal fashion choices.

4.1.3. Identity Accuracy and Sensitivity.

We analyze how Be My AI handles the depiction of people’s identities in images. Nine participants (P2, P3, P5–7, P9–11, P13) shared their experiences with Be My AI’s descriptions of identity while viewing pictures of their families, friends, or images posted on social media. This exploration focuses on the Be My AI’s capabilities of conveying identity attributes like age and gender, examining both the strengths and the limitations of Be My AI in handling this information with sensitivity and accuracy.

Capabilities of Be My AI.

Be My AI describes people’s identity in terms of gender (e.g., “woman,” “man”), age (e.g., “old,” “late twenties or early thirties”), appearance (e.g., “long, dark brown hair,” “wavy brown hair,” “His hair falls past his ears, with a slightly messy but stylish look”) and ethnicity (e.g., “East Asian”). Participants appreciated such detailed descriptions, which provide a richer perception of people depicted in images, as exemplified by P2’s comment: “I enjoyed hearing, I knew my friend was East Asian when I saw her picture. It was kind of cool to see that because… as a blind person, you don’t know that all the time. So, I’d like access to it.”

Despite the appreciable benefits, challenges arise particularly around the sensitivity and accuracy of such descriptions. Eight participants (P2, P3, P5–7, P9, P10, P13) reported instances where Be My AI inaccurately assigned identity details, such as people’s gender or age. These inaccuracies highlight a gap in Be My AI’s context-awareness capabilities, underscoring the importance of AI’s ability to interpret complex cues that go beyond mere visual appearances to include cultural, situational, and personal context in its analysis.

Example: Inaccurate Identification of People’s Gender and Age.

Three participants (P3, P10, P13) reported inaccuracies in Be My AI’s gender identification. P13 mentioned an instance where Be My AI misidentified gender based on visible cues – mistaking a hidden ponytail for short hair, which “just looked like short-cropped hair.” This led to a misclassification of the person’s gender. Similarly, P10 encountered a limitation when Be My AI struggled with gender identification for individuals whose physical attributes do not conform to typical gender norms, noting, “the person had a short hair, and was a female”. In response, Be My AI defaulted to a neutral identification, “Be My AI didn’t tell me it was a woman or a man, just said a person.” These examples illustrate Be My AI’s reliance on stereotypical indicators and reveal its challenges in interpreting non-visible details.

Additionally, two participants (P5, P9) discussed Be My AI’s challenges in inaccurately interpreting age. P9 shared an example where Be My AI attempted to estimate the age of her daughter but was incorrect, guessing the child to be a year older than actual. Likewise, P5 observed sensitivity issues when Be My AI associated grey hair with being “an old woman,” which some people found offensive. To address these age-related inaccuracies, P5 suggested that Be My AI should adopt more objective, factual descriptions rather than subjective or potentially stigmatizing labels, such as “a woman with grey hair, instead of an older woman.” These examples indicate Be My AI’s difficulties with precise age estimation and the potential inaccuracy when AI makes assumptions based on appearance alone.

In summary, these examples illustrate Be My AI’s limited context awareness and its challenge in accurately interpreting people’s identities, such as gender and age. Be My AI’s assessments heavily rely on visible cues and superficial attributes. These misjudgements are rooted in Be My AI’s inability to integrate and interpret broader, non-visible contextual elements like cultural norms and personal styling choices.

4.2. Intent-Oriented Capabilities

In this section, we explore Be My AI’s limitations in comprehending users’ goals, providing actionable support to fulfill these objectives, and offering real-time feedback. Through specific examples, we elucidate the extent of Be My AI’s intent-oriented capabilities and highlight areas where it falls short in adapting to user needs.

4.2.1. Agentic Interaction.

Be My AI often struggles to infer users’ specific intentions from images. This section illustrates how users compensate by guiding Be My AI through targeted prompting.

Capabilities of Be My AI.

The “ask more” function of Be My AI enables participants to inquire further into aspects of images that were not covered in initial descriptions. This feature facilitates deeper engagement by allowing participants to seek specific details according to their personal needs or interests. Eleven participants (P1–3, P5, P6, P8–12, P14) utilized the “ask more” function to match outfits (P2, P5, P12), check makeup (P5, P10), suggest cooking recipes (P12), assist with household appliances (P5, P12), examine text or objects (P6, P8, P11, P12, P14), gain more details about people’s facial expressions, attire, or actions (P2, P3, P5, P8, P9, P10, P11), check the status of animals (P3, P5, P11, P14), and edit the descriptions for social media posts (P2, P3).

Seven participants (P1–3, P5, P8–10) valued the flexibility of posing follow-up questions to Be My AI, enhancing their independence by reducing reliance on human assistance. As P3 expressed, “I can ask follow-up questions, so I have a good way to sort of figure out what’s in the image independently, which is something I was not able to do before this app came out.”

Despite these advantages, participants identified a common issue that Be My AI is unable to discern users’ goals in initial response, often resulting in generalized descriptions without knowing which aspects to emphasize. P12 elaborated on this challenge, stating, “Be My AI loves to make general descriptions, and it doesn’t know what to focus on.” Consequently, users actively direct Be My AI through their prompts to understand their specific intents.

Example 1: Check for the Presence of Eggshells.

In Figure 2, the initial response from Be My AI involves a detailed descriptions of what is included in a frying pan. The user, likely influenced by prior experiences of inadvertently leaving unwanted items in the pan while cooking, specifically intended to check for such elements. However, Be My AI did not infer that the user was specifically looking for unwanted elements like eggshells, rather than seeking a general description of the scene. To clarify her objective, the user posed a follow-up question, thus directing Be My AI to recognize and prioritize her specific concerns and intentions.

In this example, although Be My AI can describe the visual elements accurately and comprehensively (from the condition of the eggs to the surroundings), it failed to identify unusual visual elements that are challenging for PVI to detect but are important to their understanding of the scene. Through user prompting, Be My AI can learn to focus on these unusual elements, thereby enhancing its ability to interpret and respond to user-specific intents more effectively.

Example 2: Adjust Rotary Control Appliances.

Six participants noted Be My AI’s inadequacy in comprehending their goal of adjusting rotary control appliances like washers (P1) and thermostats (P2, P4, P5, P13, P14). This task requires Be My AI to interpret the current settings on these appliances as users modify dials for time, mode, or temperature. However, Be My AI often offered broad descriptions of visual elements without honing in on the user’s specific objectives.

For example, P13 reported that although Be My AI could recognize a thermostat on the wall, it did not provide necessary details such as the current temperature setting or instructions for adjusting the temperature.

To overcome these limitations, participants guided Be My AI to better understand their goal through prompting. For instance, P5 posed targeted questions to Be My AI, such as “What is the arrow pointed at right now on the current setting?” This inquiry prompted Be My AI to provide more precise and relevant information, focusing on specific elements crucial for adjusting the appliance accurately, rather than delivering broad, scene-wide descriptions.

4.2.2. Consistency and Follow-Through.

In Section 4.2.1, we revealed instances where Be My AI failed to grasp users’ intentions, leading users to explicitly guide Be My AI towards their objectives through prompting. This section examined Be My AI’s challenges with maintaining consistency and follow-through even though recognizing users’ goals.

Capabilities of Be My AI.

Be My AI often fails to proactively suggest subsequent steps or continue support until the user’s objectives are fully achieved. Effective consistency should entail maintaining focus on the user’s objective throughout the interaction. This includes offering progressively specific and relevant assistance, and ensuring all responses align with reaching the user’s intended outcome.

Due to these shortcomings, participants turn to human assistants. Human assistants can interpret the context of tasks and goals more dynamically, providing conversational guidance and adapting to users’ actions to facilitate goal achievement.

Example 1: Inadequacy in Identifying Central Puzzle Piece.

P14 engaged Be My AI to differentiate between various puzzles by describing the images on the boxes like a scene of cats or bears. However, when attempting to identify the centerpiece for a nine-piece puzzle based on its box image, Be My AI fell short in providing the necessary specificity for more detailed tasks. Initially, Be My AI read the text on the back of the puzzle box, which specified the piece to be placed in the center. When P14 sought further details to locate the centerpiece, Be My AI merely replied that it was square, a characteristic shared by multiple pieces. This response was too vague to facilitate effective puzzle assembly.

“When Be My AI read me the text of the box on the back of the puzzle that said, you know, this particular piece should be in the center of the puzzle. As a follow-up question I asked, ‘Could I have more information about the piece in the center?’ And it said, ‘This piece is a square piece,’ but I mean, there were many different square pieces, so I could not tell from that.”

(P14)

This example demonstrates Be My AI’s capability to process visual inputs and grasp the user’s objective but highlights its deficiency in delivering sufficiently detailed information pertinent to achieving the user’s goals. Consequently, P14 seek help from a family member.

Example 2: Lack of Instruction for Camera Adjustment.

A common challenge for participants (P3, P5, P6, P8, P10–12, P14) was aligning the camera properly to capture a clear view of the intended area. Be My AI can notify participants about the quality of the images or if they contained incomplete information, such as “the picture got cut off” (P5), “the picture is blurry” (P6), or the example shown in Figure 3. However, Be My AI is unable to provide further actionable guidance on adjusting the camera to improve image clarity. This limitation was evident as Be My AI could recognize the user’s goal of capturing specific areas and understand that the provided images lack complete information but fell short in suggesting practical steps to address these issues. As P6 expressed, “It was very frustrating because it said the picture is blurry. Can you please put the label in the frame? But I didn’t know how to put the label on the frame.”

Fig. 3. — Be My AI’s description of a conference room, with the original image cropped. This example was drawn from X.

Eight participants (P1–3, P7, P8, P12–14) addressed this challenge by turning to human assistants, who can offer adaptive support to achieve their goals, especially in tasks that require continuous feedback. P13 shared an example where a human assistant not only understood her goal but also guided her in adjusting the camera and instantly reminded her to turn on the light to achieve her objective.

“You know, having that ability to communicate and say, ‘Hey, this is what I’m looking for.’ Or one time, I was looking for something and I had the lights off, and they’re like, ‘You need to turn the lights on.’ And I go, ‘Okay,’ as opposed to, you know, if I tried using AI for that, it would just say ‘dark room.”‘

(P13)

This example highlights human assistants’ ability to adapt their communication based on the situation and the user’s implied needs, providing solutions that directly support achieving the user’s goals. However, Be My AI often lacks this level of adaptability and practical assistance, even though it understands the literal request.

4.2.3. Real-Time Feedback.

We explore the application of Be My AI for navigation tasks, particularly its use in localizing and orientating participants. This section analyzes the limitations of Be My AI in providing real-time feedback and comprehensive navigational information from static images, and its ineffectiveness in facilitating immediate interactions with surroundings.

Capabilities of Be My AI.

Participants (P2, P5, P10) employed Be My AI to aid in localization and orientation while navigating to their destinations. For instance, P2 utilized it to identify gate numbers at the airport, P5 employed it to read signage directing toward the airport’s transportation area, and P10 used it to recognize her surroundings when disoriented in her neighborhood. Despite these positive applications, several participants reported challenges using Be My AI for navigation purposes. These difficulties were primarily attributed to the limited camera view and practical mobility issues.

Example 1: Limited Navigational Information in Static Images.

Six participants (P2, P5, P6, P10, P11, P13) noted that Be My AI’s reliance on static images rather than real-time videos makes it difficult to capture comprehensive navigational information, such as obstacles and signage, in a single shot. As P6 described, users “have to stand there and keep taking pictures and taking pictures,” check what’s captured, determine if it’s helpful for navigation, and adjust the angle for another pictures. This iterative process can be time-consuming, “It’ll be too much of a task that’s supposed to take maybe 10 minutes would probably take like 30.”

P10 elaborated on the challenge and risk of simultaneously taking pictures and navigating, particularly when attempting to identify and navigate around obstacles. The uncertainty of capturing all potential hazards was a significant concern.

“It’s still hard to know how to capture, you know, all the obstacles. I think that’s the issue. Like, how to know that I captured the right obstacle on my path? I mean, it depends [on] what I’m able to capture with the camera. You know, that’s the tricky part for somebody without vision to capture the obstacle.”

(P10)

Consequently, participants doubted Be My AI’s reliability and safety for real-time navigation, often opting for human assistance through video-based interactions instead. In addition to advantages offered by human adaptability of adjusting guidance accordingly (Section 4.2.2), the dynamic nature of video allows for real-time interaction, a critical feature absent in static images. Participants (P2, P6, P13) pointed out the potential improvements if Be My AI could interpret videos in real-time, which would alleviate the need for repeated picture-taking to capture more comprehensive navigational information.

“If you could hold the camera and it could do it in real time, versus having to stop, take a [picture], then assess it… If that were the case, then you could move on to doing things like uploading videos and getting it to describe actual videos and things like that, versus just still images. That would be great. You know, then you could describe more.”

(P2)

Enabling real-time video analysis would allow users to move the camera continuously, adjusting in real-time to capture necessary details without the need to stop and review each image, thus streamlining the navigation process.

Example 2: Irreplaceable Role of Orientation & Mobility Skills in Navigation.

Besides the value of real-time visual interpretations from external resources like human assistants, participants (P5, P10, P13, P14) emphasized the indispensable role of real-time feedback through their Orientation and Mobility (O&M) skills for safe navigation. P5 and P13 pointed out that even human assistance, while adaptive in guiding PVI away from navigational hazards, cannot substitute for essential O&M skills required for tasks such as street crossing. Consequently, participants are more cautious about relying on emerging tools like Be My AI for navigation. P14 reinforced this perspective, saying, “it’s not a replacement for our mobility skills or just any skills in general. It can aid and augment the skills but it’s not a replacement for them. It shouldn’t be.”

Participants (P5, P13) further delineated vital information provided by O&M skills or discerned through O&M tools like white canes or guide dogs, which current AI or human assistance cannot replace. First, immediate surroundings. Be My AI can identify obstacles “that are a little far away” (P5) but may not provide timely feedback on immediate surrounding hazards. Second, distance and proximity. Although Be My AI can indicate the presence of obstacles, it lacks the capability to measure their distance. Third, directional details. Essential navigational information, such as the direction of stairs (up or down) or the presence of railings, are not always detectable through Be My AI. With the O&M skills and tools, users can swiftly adjust their movements based on direct interaction with their environment in real-time. This level of responsiveness is currently unattainable with Be My AI or human assistance.

“[Be My AI] says, you know, ‘stairs in front,’ and it’s like, ‘Okay, that’s great, but where are they? How far are they? Are they going up? Are they going down? Is there a railing?’ which would be information that the dog or the cane could tell you. So, I would say use it as a tool along with, but definitely not by itself.”

(P13)

In summary, neither Be My AI nor human assistants can substitute the essential, real-time feedback and adaptive capabilities offered by O&M skills and tools. These elements are vital for ensuring safe navigation by allowing users to directly interact with their environment.

5. DISCUSSION

In this section, we examine the current state of handoff between users, Be My AI, and remote sighted assistants, and propose new paradigms to address the challenges identified in our findings. Next, we explore how multi-agent systems, both human-human and human-AI interactions, assist visually impaired users, and envision the transition toward AI-AI collaborations for tasks requiring specialized knowledge. Finally, we discuss the potential advantages of real-time video processing in the next generation of AI-powered VQA systems.

5.1. Handoff Between Users, Be My AI, and Remote Sighted Assistants

In this study, we illustrated the advantages of Be My AI in enhancing spatial awareness through detailed scene descriptions of objects’ colors, sizes, shapes, and spatial orientations (Section 4.1.1), enriching users’ understanding in social and stylistic contexts by detailing emotional states of people or animals, their body language, and ambiance (Section 4.1.2), and identity recognition (Section 4.1.3). Also, Be My AI aids in localizing and orientating users by reading signage (Section 4.2.3).

Informed by our findings, despite these various benefits, there are challenges that Be My AI alone cannot overcome. Be My AI system still requires human intervention, either from the blind user or the remote sighted assistant (RSA), to guide or validate its outputs. Users seek confirmation from RSAs or depend on their own spatial memory to overcome AI hallucinations, where Be My AI inaccurately adds non-existent details to scenes. Users also rely on auditory cues and spatial memory to locate dropped objects and direct Be My AI toward the intended search areas (Section 4.1.1). Moreover, users actively prompt Be My AI to understand their specific objectives, such as checking for eggshells in a frying pan or adjusting appliance dials (Section 4.2.1). There are also instances where users require assistance from RSAs when Be My AI fails to provide adequate support to fulfill users’ objectives, such as identifying the centerpiece of puzzles or adjusting the camera angle (Section 4.2.2). Human assistance or users’ O&M skills are necessary to receive real-time feedback for safe and smooth navigation (Section 4.2.3).

Furthermore, our findings revealed that Be My AI might produce inaccurate or controversial interpretations. Users express skepticism towards Be My AI’s subjective interpretations of animals’ emotions and fashion suggestions (Section 4.1.2), and have encountered inaccuracies in Be My AI’s identification of people’s gender and age (Section 4.1.3). These instances underline potential areas where human judgment is necessary to corroborate or correct Be My AI’s descriptions.

Next, we discuss the handoff [44] between users, Be My AI, and RSAs to mitigate the aforementioned challenges.

5.1.1. Status Quo of Interactions Between Users and Be My AI.

Through the “ask more” function on Be My AI, users are able to request additional details about the images that were not covered in Be My AI’s initial descriptions. This functionality facilitates a shift in interaction dynamics between users and Be My AI, even if Be My AI may not accurately understand or answer users’ questions in the first attempt. In these interactions, users are not merely passive recipients of AI-generated outputs, they actively guide Be My AI with specific prompts to better align the responses with their objectives.

Our findings reported one instance where Be My AI fails to grasp the user’s intent to check the presence for eggshells in the beginning (Figure 4 top). First, the user submits an image of eggs in the pan to Be My AI. Respond to the image, Be My AI describes the quantity and object (“Inside the frying pan, there are three eggs”) and states of the yolks and whites (“whites separated”, “yolk has broken”, “mixing with the egg white”). Next, the user clarifies her inquiry by asking, “are there any shells in my eggs?” This prompts Be My AI to understand the user’s goal, reevaluate the image, subsequently confirming the presence (“Yes, there is a small piece of eggshell in the frying pan”) and location of an eggshell to help her remove it (“near the bottom left of the broken egg yolk”).

Fig. 4. — The top shows the status quo of handoff between the user and Be My AI. The bottom illustrates our proposed simplified interaction.

This interaction exemplifies the status quo of handoff, where the user and Be My AI engage in a back-and-forth dialogue to refine the descriptions based on Be My AI’s “ask more” function and the user’s precise prompts. While this iterative process allows Be My AI to eventually understand the users’ intent without RSAs’ intervention, it places cognitive burden on users who must carefully craft and iteratively refine their prompts. The cognitive load increases as users mentally track what information they’ve already received, analyze gaps between their needs and Be My AI’s responses, and develop increasingly specific queries.

To reduce users’ cognitive load, we propose enabling Be My AI to adopt a mechanism that combines multi-source data input with long-term and short-term memory capabilities [67]. With explicit user consent, future versions of Be My AI could integrate data from users’ mobile devices (e.g., location information, time data) alongside historical interaction data within Be My AI (e.g., contexts and follow-up questions) to recognize user preferences and common inquiries, and infer their needs, thereby generating responses more effectively in similar contexts. Long-term memory serves as a repository for capturing generalized user preferences, behavior patterns, and aggregated insights across multiple users. This long-term memory is particularly effective for improving system intelligence by identifying common user needs and optimizing general responses [47]. Meanwhile, short-term memory can focus on task-specific optimization within a single interaction session. It retains context from the immediate conversation, such as recent user inputs and system responses, to enhance relevance and coherence in real time. Short-term memory operates dynamically, clearing retained data once the session ends or the task is completed, thereby ensuring privacy and preventing unnecessary data retention.

For example, when identifying unusual elements during cooking (e.g., eggshells in cooking eggs), Be My AI could utilize the user’s immediate input while referencing short-term memory from the current session or recent similar interactions. Additionally, by leveraging long-term memory, Be My AI can learn from the user’s past questions and query patterns to better match their habits and preferences, i.e., user’s typical needs. Furthermore, multi-source data input, such as time or location information, can assist Be My AI in inferring the user’s current context, for instance, recognizing that the user is preparing a specific meal at a particular time or place, which allows the system to provide more relevant and context-aware assistance. This approach enables Be My AI to proactively anticipate user intent and deliver targeted responses, reducing the need for multiple clarifying prompts (Figure 4 bottom).

Cognitive Load Theory (CLT) suggests that well-designed interactions can significantly reduce users’ extraneous load while enhancing the effective management of germane load [18, 53]. Following the principles of CLT, we recommend using the above design to enable Be My AI to minimize unnecessary clarifying prompts, thereby reducing users’ cognitive load.

5.1.2. AI Deferral Learning for Identity Interpretations.

Our findings elucidated Be My AI’s capabilities and limitations in interpreting identity attributes. Although Be My AI can describe aspects like gender, age, appearance, and ethnicity, it may make errors due to its reliance on stereotypical indicators and its inability to interpret non-visible details (Section 4.1.3). However, Stangl et al.’s work [51] pointed out that PVI seek identity interpretations from AI assistants across various contents, including browsing social networking sites where our participants reported using Be My AI.

This reveals a tension between PVI’s interests in knowing about identity attributes and the AI’s challenges in providing reliable information [26]. The conflict arises because attributes such as age and gender are not purely perceptual and cannot be accurately identified by visual cues alone. However, RSAs can draw on contextual clues, past interactions, and cultural knowledge to make more nuanced observations about these human traits. These social strategies are not typically accessible to AI systems.

To mitigate these issues, we propose adopting a deferral learning architecture [42, 48], where an AI model learns when to defer decisions to humans and when to make decisions by itself. As detailed by Han et al. [25] and illustrated in Figure 5, this architecture creates a three-stage information flow:

Stage 1: It begins when users submit image-based queries to Be My AI. At this stage, the system uses a detection mechanism to identify sensitive contents, focusing particularly on those involving human physical traits. Current large-language models have already incorporated such mechanisms [7, 46]; however, they still struggle to interpret human identity with consistent accuracy [26].
Stage 2: Rather than declining sensitive requests outright, Be My AI redirects these queries to RSAs. This maintains the system’s helpfulness while ensuring accurate responses.
Stage 3: RSAs provide descriptions by leveraging contextual understanding, such as analyzing the user’s current environment and cultural background.

Fig. 5. — Handoff between the user, Be My AI, and RSA for identity interpretations.

In contrast to prior work that addresses stereotypical identity interpretation through purely computational approaches [49, 55, 57], our proposed AI deferral learning takes a hybrid human-AI approach that combines AI capabilities with human expertise. While previous AI-only solutions have made progress in reducing bias, they still struggle with identity interpretation [26]. The challenges arise not only from technical issues but also from the ontological and epistemological limitations of social categories (e.g., the inherent instability of identity categories), as well as from social context and salience (e.g., describing a photograph of the Kennedy assassination merely as “ten people, car”). Our system leverages RSAs who have got more experience and probably more success in identifying people’s identity through their human perception abilities and real-world experience. RSAs can interpret subtle contextual cues, understand cultural nuances, and adapt to diverse presentation styles that may challenge AI systems. Through the AI deferral learning architecture, AI assistants can learn continuously from human assistants’ responses and improve its ability to handle similar situations. The three-way interactions between users, AI assistants, and RSAs generate rich contextual data that can enhance the AI system’s identity detection mechanisms.

5.1.3. Fact-Checking for AI Hallucination Problem.

Our findings highlighted that the AI-generated detailed descriptions helped users understand their physical surroundings. However, there were instances where Be My AI hallucinated, i.e. incorrectly added non-existent details to the descriptions, which led to confusion (Section 4.1.1). In fact, hallucinations is a known problem for large language models upon which Be My AI is built [21]. Current approaches to address this problem include Chain-of-thought (CoT) prompting [58], self-consistency [56], and retrieval-augmented generation (RAG) [38].

In CoT prompting [58], users ask an AI model to show its reasoning steps, like solving a math problem step by step rather than simply giving the final answer. It is similar to the “think aloud” protocol in HCI. Self-consistency [56] is an extension of CoT prompting. Instead of generating just one chain of thought, the model is asked to generate multiple different reasoning paths for the same task. Each reasoning path might arrive at a different answer. The model then takes a “majority vote” among these different answers to determine the final response. In RAG [38], AI models are provided with relevant information retrieved from a vector storage as “context” to reduce factual errors in their responses.

In light of these techniques, users adopt various strategies that mirror CoT prompting, self-consistency, and RAG. We outline some potential strategies below.

Part-Whole Prompting: This strategy parallels the Chain of Thought (CoT) prompting. A user sends an image to Be My AI and requests an initial overall description, followed by a systematic breakdown that justifies this description. For example, users might first ask for a description of the image as a “whole”, then request to divide the image into smaller “parts”, like a 3 × 3 grid, and describe each grid individually. If the descriptions of the individual parts align coherently, it increases the likelihood that the overall description is accurate. This approach would require processing more information; however, it will provide users with greater confidence in the AI’s response, as it enables them to verify consistency between the whole and its constituent parts.
Prompting from Multiple Perspectives: This strategy resembles the self-consistency technique. A user sends an image to Be My AI and requests multiple descriptions from different perspectives. For example, users might ask for one description that focuses on the background and another that emphasizes foreground objects. Users can also request descriptions from the viewpoint of objects within the image (e.g., “How would a person sitting on a chair see this scene?” and “How would a person sitting on the floor see this scene?”). While gathering descriptions from multiple perspectives may increase the likelihood of hallucination, it can also help identify common elements that appear consistently across different viewpoints, potentially indicating true features of the image.
Prompting with Human Knowledge: This strategy resembles the RAG approach. A user sends an image to Be My AI, provides their current understanding of the image and its context, and requests a description that complements their knowledge. For example, in Figure 2, users can specify that someone took the picture in a kitchen environment and that it should show a frying pan containing eggs. Users possess this knowledge through their familiarity with physical environments, self-exploration, spatial memory, and touch [21]. The user-provided knowledge will help the AI model ground its response in an accurate context.
Pairing with Remote Human Assistants: While the previous three strategies rely on multiple prompting and response aggregation to identify facts, this approach leverages the traditional remote sighted assistance framework. This strategy (shown in Figure 6) differs from the deferral learning framework (Section 5.1.2) in that users forward the AI responses to human assistants, rather than the AI assistant deferring to humans for the response. In this strategy, a user first sends an image to Be My AI to receive a description. When users suspect inaccuracies through triangulation [21], such as descriptions that conflict with their spatial memory or common sense (e.g., implausible objects like a palm tree in a cold region), they can request an RSA to fact-check the description. The RSA then verifies the description and sends corrected information back to the user. This verification process is likely easier and faster for a RSA than composing a description from scratch, as the RSA’s work involves checking rather than creating content.

Fig. 6. — Handoff between the user, Be My AI, and RSA for fact-checking.

In summary, AI hallucination presents both challenges and opportunities. By addressing these issues, future work will strengthen the way users, AI models, and human assistants interact with each other.

5.2. Towards Multi-Agent Systems for Assisting Visually Impaired Users

This section examines the transition from human-human interactions to human-AI and AI-AI systems in supporting PVI. We explore how these multi-agent systems, which involve the collaborative efforts of multiple agents (AI or human), are designed to adaptively meet the diverse needs of PVI.

Lee et al. [36] identified four contexts in which a professional human-assisted VQA system (Aira) offer support to PVI. The type of information required by PVI is incremental in these contexts. First, scene description and object identification acquires information about “what is it.” Second, navigation requires description about PVI’s surroundings and obstacles (“what is it”) and directional information (“where is it” and “how to get to the destination”). Third, task performance like putting on lipstick, cooking, and teaching a class. This context requires description (“what is it”) and domain knowledge on “how to do it.” Forth, social engagement like helping PVI in public spaces or interacting with other people. This needs description (“what is it”), directional information to navigate in social space, and discreet communication (PVI prefer not to disclose their use of VQA systems).

Our study reported how participants used Be My AI for tasks like matching outfits and assessing makeup, fitting under the category of task performance. Some participants raised concerns about the accuracy of Be My AI’s interpretations and suggestions, indicating their preference for human subjectivity in this context. Contrasting with this, Lee et al.’s work [36] highlighted that remote sighted assistants (RSAs), even those professionals RSAs from Aira, sometimes lack the specialized information or domain knowledge required in task performance, thereby they need to collaborate with other RSAs to find solutions.

Furthering this investigation, Xie et al. [60] paired two RSAs to assist one visually impaired user in synchronous sessions, validating the need for RSAs to complement each other’s description in task performance like aiding the user in applying makeup and matching outfits. They also explored the challenges in this human-human collaboration, revealing collaboration breakdowns between two opinionated RSAs. To address these issues, they proposed a collaboration modality in which one “silent” RSA supports the other RSA by researching but not directly communicating. This approach suggested that two RSAs in this multi-agent system should not deliver information simultaneously but have a clear division of labor, designating who takes the lead, to avoid overwhelming PVI with information.

Transitioning from human-human to human-AI collaboration, the handoff between the user, Be My AI and RSA (Section 5.1) opens up new opportunities for multi-agent systems. Our proposed modality of human-AI collaboration integrates the scalable, on-demand capabilities of AI-based visual assistance with the contextual understanding and adaptability of RSAs. This multi-agent system involves the AI system recognizing its own limitations and seamlessly handing off tasks to a RSA when appropriate. This collaboration aligns with prior work [60], where AI (Be My AI) and human (RSA) maintain a clear division of labor, minimizing cumbersome back-and-forth and reducing potential confusion for PVI.

Looking ahead, we envision the potential for AI-AI collaboration as part of the future multi-agent systems to assist PVI, especially for task performance. A domain-specific AI expert can be trained to handle more specialized tasks such as matching outfits, performing mathematical computations, or answering chemistry-related questions. Be My AI, as the core AI system, can provide general visual descriptions (“what is it”) and delegate more specialized tasks requiring domain knowledge to the domain-specific AI expert. This approach is in line with the human-AI collaboration (Section 5.1) by ensuring effective handoffs when necessary. By leveraging AI agents with more specialized capabilities, this multi-agent system can better adapt to PVI’s needs.

However, similar to concerns around human-human and human-AI interactions, these AI-AI collaborations must be carefully designed with clear protocols and handoff points for transitioning tasks between AI agents. It is important to make these transitions as seamless and transparent as possible to PVI, thereby avoiding any complexity or confusion.

5.3. Towards Real-Time Video Processing in LMM-based VQA Systems

One of the most significant advantages of Be My AI and other LMM-based assistive tools is their ability to provide contextually relevant and personalized assistance to users. By leveraging machine learning and natural language understanding, these systems can understand and respond to a wide range of user queries. This level of contextual awareness represents a significant advancement over pre-LMM-based assistive technologies, which often fail to adapt to the diverse needs and preferences of individual users.

However, our findings also identified several challenges and limitations associated with the reliance on static images by current LMM-based assistive tools. Participants in our study reported frustration with the need to take multiple pictures to capture the desired information, a process they found time-consuming and cognitively demanding (Section 4.2.3). This iterative process hinders efficiency and also poses safety risks, as participants struggled with taking images while navigating around obstacles.

To mitigate these issues, integrating real-time video processing capabilities into future LMM-based VQA systems could offer significant benefits. Our findings suggest that the dynamic nature of video serves as a foundation for subsequent guidance (Section 4.2.3), which is currently provided by human assistants through video-based remote sighted assistance. Shifting to real-time video processing would allow LMM-based VQA systems to transition from identifying objects (answering “what is it”) to offering practical advice (addressing “how to do it”), such as how to adjust the camera angle or how to navigate to a destination. By continuously analyzing the user’s surroundings through real-time video feeds, these systems can dynamically interpret changes and provide immediate feedback, thus eliminating the need for static image captures. This capability would enhance the user experience by offering seamless navigation aid in real time.

The feasibility of real-time video processing is supported by existing technologies demonstrated in commercial products and research prototypes. For instance, systems that utilize sophisticated algorithms for real-time object segmentation in video streams [54] have shown significant potential in other domains. Building on these techniques for video analysis could significantly extend the capabilities of future LMM-based VQA systems.

Transitioning from static image analysis to real-time video processing can alleviate the burden of iteratively taking pictures and adjusting angles experienced by users. It can also enhance the utility and safety of LMM-based VQA systems, particularly during navigation. This progression, driven by ongoing advancements in machine learning and computer vision, is essential for the development of more adaptive and responsive assistive technologies that align with the dynamic nature of real-world environments.

5.4. Limitations

There are some limitations of this work. First, we completed the interviews prior to March 2024 and gathered Be My AI’s image descriptions from its official release until March 31, 2024. Thus, our analysis did not encompass versions of Be My AI released after March 2024. Given the frequent updates and evolution of Be My AI, there may be feature changes or enhancements introduced post-March 2024 that were not considered in our study. Future research can investigate the capabilities and limitations of newer versions of Be My AI to provide updated insights. Second, while our focus on initial user experiences provided insights into early-stage interactions with Be My AI, this early-phase data collection resulted in a limited dataset. Future work can incorporate larger real-world datasets over extended periods of use and explore more diverse use cases to provide a more comprehensive understanding of Be My AI’s capabilities and limitations. Third, the interview participants live inside the United States, and only image descriptions written in English were collected and analyzed. Thus, this study may not reflect perspectives outside the United States or from non-English speaking contexts. Future research can explore the usage and limitations of Be My AI across broader and diverse cultural backgrounds. Fourth, Be My AI cannot store conversation history. This limited functionality prevented users from sharing their ongoing interactions with Be My AI. Thus, we were unable to fully understand users’ prior experiences or analyze the data within broader contexts.

6. CONCLUSION

This study investigates the application of LMMs, particularly through Be My AI, to enhance accessibility for PVI. Our research explores both the capabilities and limitations of Be My AI by interviewing 14 visually impaired users and analyzing image descriptions generated by it. We identify significant limitations in Be My AI’s context-aware and intent-oriented capabilities, including AI hallucinations, subjective interpretations in social and stylistic contexts, inaccurate recognition of people’s identities, and inconsistent support in understanding and acting on user intentions. These challenges often lead users to rely on human assistance or personal strategies to compensate. Informed by these findings, we propose strategies to enhance interaction between PVI, Be My AI, and remote sighted assistants, emphasizing streamlined interactions, more accurate identity recognition, and reduced AI errors. We also highlight the potential of multi-agent systems, fostering collaboration among humans and AI, and suggest exploring AI-AI cooperation for tasks requiring specialized knowledge.

CCS Concepts:

• Human-centered computing → Accessibility; Empirical studies in accessibility.

A. APPENDIX

Table 2.

Codebook with Category, Example Tasks, and Codes

Category	Explanation	Example Tasks	Codes
Physical Environment Understanding	Be My AI enhances spatial awareness in both indoor and outdoor settings through detailed and structured visual information. Challenges include AI hallucinations that require verification through user’s existing spatial knowledge or human assistance, and the need for users to complement the tool with their auditory perception for accurate object location.	Describe scenes like theaters, room layouts, neighborhood, and holiday decorations; Locate dropped objects like earbuds	Environmental familiarity, Object location identification, Concerns in AI hallucinations, Human verification needs, Multi-modal perception integration
Subjective Interpretations and Suggestions	Be My AI provides subjective interpretations in animal interaction and fashion contexts, such as animals’ emotional states and fashion suggestion. Challenges includes lack of accuracy or context sensitivity, leading users to rely on personal judgment or sighted assistance for validation.	Human-animal interactions like taking picuters of birds, dogs, horses; Fashion help like describing or matching colors and patterns of outfits, checking makeup	Subjective interpretation capability, Untrust in subjective interpretations, Agency in personal judgment
Identity Recognition	Be My AI enhances the perception of people by conveying identity attributes like age, gender, and ethnicity in image descriptions. Challenges in accuracy underscore the need for sensitivity and contextual awareness in identity depiction.	Human social interactions like describing online photos, checking photos before posting online	Identity attribute recognition, User concerns in identity recognition
Goal Understanding Dialogue	Be My AI’s “ask more” function enables users to inquire specific information, supporting independent information gathering. However, Be My AI often requires explicit user guidance to focus on relevant details, as it struggles to infer user intentions from initial queries.	Cooking and dining like checking eggshells; Helping with household appliances like rotating dials on washers and thermostats	Query refinement capability, Intention inference limitations, User prompting for goal understanding
Goal Achievement Support	Be My AI understands users’ goals but faces challenges in proactively suggesting further actions to align with the goals, requiring human assistance to maintain focus and achieve goals.	Identifying central puzzle piece, Adjusting camera angle	Proactive guidance limitations, Human assistance for continuous and actionable guidance
Real-time Navigation Assistance	Be My AI assists with localization and orientation tasks through static image interpretations, but shows limitations in providing real-time and comprehensive navigational feedback. Users often need to complement its capabilities with human assistance and O&M skills for safe navigation.	Navigating users in airports like reading signage and gate numbers; Finding restaurants	Localization and orientation capability, Static image limitations, Navigational risks, Human assistance for real-time interactions, Essential role of O&M skills

Open in a new tab

Contributor Information

JINGYI XIE, Pennsylvania State University, USA.

RUI YU, University of Louisville, USA.

HE ZHANG, Pennsylvania State University, USA.

SYED MASUM BILLAH, Pennsylvania State University, USA.

SOOYEON LEE, New Jersey Institute of Technology, USA.

JOHN M. CARROLL, Pennsylvania State University, USA

REFERENCES

[1].2024. Aira. https://aira.io/.
[2].2024. Announcing “Be My AI,” Soon Available for Hundreds of Thousands of Be My Eyes Users. Retrieved September 1, 2024 from https://www.bemyeyes.com/blog/announcing-be-my-ai
[3].2024. Be My Eyes - See the world together. https://www.bemyeyes.com/.
[4].Achiam Josh, Adler Steven, Agarwal Sandhini, Ahmad Lama, Akkaya Ilge, Aleman Florencia Leoni, Almeida Diogo, Altenschmidt Janko, Altman Sam, Anadkat Shyamal, et al. 2023. GPT-4 Technical Report. arXiv:2303.08774 (2023). [Google Scholar]
[5].Adnin Rudaiba and Das Maitraye. 2024. “I look at it as the king of knowledge”: How Blind People Use and Understand Generative AI Tools. people 16, 54 (2024), 92. [Google Scholar]
[6].Ahmetovic Dragan, Sato Daisuke, Oh Uran, Ishihara Tatsuya, Kitani Kris, and Asakawa Chieko. 2020. Recog: Supporting blind people in recognizing personal objects. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–12. [Google Scholar]
[7].Bai Yuntao, Jones Andy, Ndousse Kamal, Askell Amanda, Chen Anna, DasSarma Nova, Drain Dawn, Fort Stanislav, Ganguli Deep, Henighan Tom, et al. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 (2022). [Google Scholar]
[8].Baranski Przemyslaw and Strumillo Pawel. 2015. Field trials of a teleassistance system for the visually impaired. In 2015 8th International Conference on Human System Interaction (HSI). IEEE, 173–179. [Google Scholar]
[9].Bendel Oliver. 2024. How Can Generative AI Enhance the Well-being of Blind? arXiv:2402.07919 (2024). [Google Scholar]
[10].Bennett Cynthia L, Gleason Cole, Scheuerman Morgan Klaus, Bigham Jeffrey P, Guo Anhong, and To Alexandra. 2021. “It’s complicated”: Negotiating accessibility and (mis) representation in image descriptions of race, gender, and disability. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–19. [Google Scholar]
[11].Bigham Jeffrey P, Jayant Chandrika, Ji Hanjie, Little Greg, Miller Andrew, Miller Robert C, Miller Robin, Tatarowicz Aubrey, White Brandyn, White Samual, et al. 2010. Vizwiz: nearly real-time answers to visual questions. In Proceedings of the 23nd annual ACM symposium on User interface software and technology. 333–342. [Google Scholar]
[12].Bigham Jeffrey P, Jayant Chandrika, Miller Andrew, White Brandyn, and Yeh Tom. 2010. VizWiz:: LocateIt-enabling blind people to locate objects in their environment. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops. IEEE, 65–72. [Google Scholar]
[13].Braun Virginia and Clarke Victoria. 2006. Using thematic analysis in psychology. Qualitative research in psychology 3, 2 (2006), 77–101. [Google Scholar]
[14].Bujacz M, Baranski P, Moranski M, Strumillo P, and Materka A. 2008. Remote guidance for the blind—A proposed teleassistance system and navigation trials. In 2008 Conference on Human System Interactions. IEEE, 888–892. [Google Scholar]
[15].Burton Michele A, Brady Erin, Brewer Robin, Neylan Callie, Bigham Jeffrey P, and Hurst Amy. 2012. Crowdsourcing subjective fashion advice using VizWiz: challenges and opportunities. In Proceedings of the 14th international ACM SIGACCESS conference on Computers and accessibility. ACM, 135–142. [Google Scholar]
[16].Caldwell Ben, Cooper Michael, Reid Loretta Guarino, Vanderheiden Gregg, Chisholm Wendy, Slatin John, and White Jason. 2008. Web content accessibility guidelines (WCAG) 2.0. WWW Consortium (W3C) 290, 1–34 (2008), 5–12. [Google Scholar]
[17].Carroll John M., Lee Sooyeon, Reddie Madison, Beck Jordan, and Rosson Mary Beth. 2020. Human-Computer Synergies in Prosthetic Interactions. IxD&A 44 (2020), 29–52. http://www.mifav.uniroma2.it/inevent/events/idea2010/doc/44_2.pdf [Google Scholar]
[18].Chandler Paul and Sweller John. 1991. Cognitive load theory and the format of instruction. Cognition and instruction 8, 4 (1991), 293–332. 10.1207/s1532690xci0804_2 [DOI] [Google Scholar]
[19].Chaudary Babar, Paajala Iikka, Keino Eliud, and Pulli Petri. 2017. Tele-guidance based navigation system for the visually impaired and blind persons. In eHealth 360. Springer, 9–16. [Google Scholar]
[20].Garaj Vanja, Jirawimut Rommanee, Ptasinski Piotr, Cecelja Franjo, and Balachandran Wamadeva. 2003. A system for remote sighted guidance of visually impaired pedestrians. British Journal of Visual Impairment 21, 2 (2003), 55–63. [Google Scholar]
[21].Gonzalez Penuela Ricardo E, Collins Jazmin, Bennett Cynthia, and Azenkot Shiri. 2024. Investigating Use Cases of AI-Powered Scene Description Applications for Blind and Low Vision People. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ‘24). Association for Computing Machinery, New York, NY, USA, Article 901, 21 pages. 10.1145/3613904.3642211 [DOI] [Google Scholar]
[22].Granquist Christina, Sun Susan Y, Montezuma Sandra R, Tran Tu M, Gage Rachel, and Legge Gordon E. 2021. Evaluation and comparison of artificial intelligence vision aids: Orcam myeye 1 and seeing ai. Journal of Visual Impairment & Blindness 115, 4 (2021), 277–285. [Google Scholar]
[23].Gurari Danna, Li Qing, Stangl Abigale J, Guo Anhong, Lin Chi, Grauman Kristen, Luo Jiebo, and Bigham Jeffrey P. 2018. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3608–3617. [Google Scholar]
[24].Gurari Danna, Zhao Yinan, Zhang Meng, and Bhattacharya Nilavra. 2020. Captioning images taken by people who are blind. In European Conference on Computer Vision. 417–434. [Google Scholar]
[25].Han Chaeeun, Mitra Prasenjit, and Billah Syed Masum. 2024. Uncovering Human Traits in Determining Real and Spoofed Audio: Insights from Blind and Sighted Individuals. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–14. [Google Scholar]
[26].Hanley Margot, Barocas Solon, Levy Karen, Azenkot Shiri, and Nissenbaum Helen. 2021. Computer vision and conflicting values: Describing people with automated alt text. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. 543–554. [Google Scholar]
[27].Holmes Nicole and Prentice Kelly. 2015. iPhone video link facetime as an orientation tool: remote O&M for people with vision impairment. International Journal of Orientation & Mobility 7, 1 (2015), 60–68. [Google Scholar]
[28].Hong Jonggi, Gandhi Jaina, Mensah Ernest Essuah, Zeraati Farnaz Zamiri, Jarjue Ebrima, Lee Kyungjun, and Kacorri Hernisa. 2022. Blind Users Accessing Their Training Images in Teachable Object Recognizers. In Proceedings of the 24th International ACM SIGACCESS Conference on Computers and Accessibility. 1–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
[29].Hunaiti Ziad, Garaj Vanja, and Balachandran Wamadeva. 2006. A remote vision guidance system for visually impaired pedestrians. The Journal of Navigation 59, 3 (2006), 497–504. [DOI] [PubMed] [Google Scholar]
[30].Jiang Lucy, Jung Crescentia, Phutane Mahika, Stangl Abigale, and Azenkot Shiri. 2024. “It’s Kind of Context Dependent”: Understanding Blind and Low Vision People’s Video Accessibility Preferences Across Viewing Scenarios. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–20. [Google Scholar]
[31].Kamikubo Rie, Kato Naoya, Higuchi Keita, Yonetani Ryo, and Sato Yoichi. 2020. Support strategies for remote guides in assisting people with visual impairments for effective indoor navigation. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–12. [Google Scholar]
[32].Kupferstein Elizabeth, Zhao Yuhang, Azenkot Shiri, and Rojnirun Hathaitorn. 2020. Understanding the use of artificial intelligence based visual aids for people with visual impairments. Investigative Ophthalmology & Visual Science 61, 7 (2020), 932–932. [Google Scholar]
[33].Lasecki Walter S, Murray Kyle I, White Samuel, Miller Robert C, and Bigham Jeffrey P. 2011. Real-time crowd control of existing interfaces. In Proceedings of the 24th annual ACM symposium on User interface software and technology. ACM, 23–32. [Google Scholar]
[34].Lasecki Walter S, Thiha Phyo, Zhong Yu, Brady Erin, and Bigham Jeffrey P. 2013. Answering visual questions with conversational crowd assistants. In Proceedings of the 15th International ACM SIGACCESS Conference on Computers and Accessibility. ACM, 18. [Google Scholar]
[35].Lee Sooyeon, Reddie Madison, Gurdasani Krish, Wang Xiying, Beck Jordan, Rosson Mary Beth, and Carroll John M.. 2018. Conversations for Vision: Remote Sighted Assistants Helping People with Visual Impairments. arXiv:1812.00148 [cs.HC] [Google Scholar]
[36].Lee Sooyeon, Reddie Madison, Tsai Chun-Hua, Beck Jordan, Rosson Mary Beth, and Carroll John M. 2020. The emerging professional practice of remote sighted assistance for people with visual impairments. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–12. [Google Scholar]
[37].Lee Sooyeon, Yu Rui, Xie Jingyi, Billah Syed Masum, and Carroll John M. 2022. Opportunities for human-AI collaboration in remote sighted assistance. In 27th International Conference on Intelligent User Interfaces. 63–78. [DOI] [PMC free article] [PubMed] [Google Scholar]
[38].Lewis Patrick, Perez Ethan, Piktus Aleksandara, Petroni Fabio, Karpukhin Vladimir, Goyal Naman, Küttler Heinrich, Lewis Mike, Yih Wen-tau, Rocktäschel Tim, et al. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems. [Google Scholar]
[39].Liu Zhe, Chen Chunyang, Wang Junjie, Chen Mengzhuo, Wu Boyu, Huang Yuekai, Hu Jun, and Wang Qing. 2024. Unblind Text Inputs: Predicting Hint-text of Text Input in Mobile Apps via LLM. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ‘24). Association for Computing Machinery, New York, NY, USA, Article 51, 20 pages. 10.1145/3613904.3642939 [DOI] [Google Scholar]
[40].Microsoft. 2022. Seeing AI - Talking camera app for those with a visual impairment. https://www.microsoft.com/en-us/ai/seeing-ai.
[41].Morrison Cecily, Grayson Martin, Marques Rita Faia, Massiceti Daniela, Longden Camilla, Wen Linda, and Cutrell Edward. 2023. Understanding Personalized Accessibility through Teachable AI: Designing and Evaluating Find My Things for People who are Blind or Low Vision. In Proceedings of the 25th International ACM SIGACCESS Conference on Computers and Accessibility. 1–12. [Google Scholar]
[42].Mozannar Hussein and Sontag David. 2020. Consistent estimators for learning to defer to an expert. In International Conference on Machine Learning. PMLR, 7076–7087. [Google Scholar]
[43].Mukhiddinov Mukhriddin, Abdusalomov Akmalbek Bobomirzaevich, and Cho Jinsoo. 2022. Automatic fire detection and notification system based on improved YOLOv4 for the blind and visually impaired. Sensors 22, 9 (2022), 3307. [DOI] [PMC free article] [PubMed] [Google Scholar]
[44].Mulligan Deirdre K and Nissenbaum Helen. 2020. The concept of handoff as a model for ethical analysis and design. The Oxford handbook of ethics of AI 1, 1 (2020), 233. [Google Scholar]
[45].Natalie Rosiana, Chang Ruei-Che, Sheshadri Smitha, Guo Anhong, and Hara Kotaro. 2024. Audio Description Customization. arXiv preprint arXiv:2408.11406 (2024). [Google Scholar]
[46].Perez Ethan, Huang Saffron, Song Francis, Cai Trevor, Ring Roman, Aslanides John, Glaese Amelia, McAleese Nat, and Irving Geoffrey. 2022. Red teaming language models with language models. arXiv preprint arXiv:2202.03286 (2022). [Google Scholar]
[47].Priyadarshini Ishaani, Sharma Rohit, Bhatt Dhowmya, and Al-Numay M. 2023. Human activity recognition in cyber-physical systems using optimized machine learning techniques. Cluster Computing 26, 4 (2023), 2199–2215. 10.1007/s10586-022-03662-8 [DOI] [Google Scholar]
[48].Raghu Maithra, Blumer Katy, Corrado Greg, Kleinberg Jon, Obermeyer Ziad, and Mullainathan Sendhil. 2019. The algorithmic automation problem: Prediction, triage, and human effort. arXiv preprint arXiv:1903.12220 (2019). [Google Scholar]
[49].Ramaswamy Vikram V, Kim Sunnie SY, and Russakovsky Olga. 2021. Fair attribute classification through latent space de-biasing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9301–9310. [Google Scholar]
[50].Scheggi Stefano, Talarico A, and Prattichizzo Domenico. 2014. A remote guidance system for blind and visually impaired people via vibrotactile haptic feedback. In 22nd Mediterranean Conference on Control and Automation. IEEE, 20–23. [Google Scholar]
[51].Stangl Abigale, Morris Meredith Ringel, and Gurari Danna. 2020. “Person, Shoes, Tree. Is the Person Naked?” What People with Vision Impairments Want in Image Descriptions. In Proceedings of the 2020 chi conference on human factors in computing systems. 1–13. [Google Scholar]
[52].Stangl Abigale, Verma Nitin, Fleischmann Kenneth R, Morris Meredith Ringel, and Gurari Danna. 2021. Going beyond one-size-fits-all image descriptions to satisfy the information wants of people who are blind or have low vision. In Proceedings of the 23rd International ACM SIGACCESS Conference on Computers and Accessibility. 1–15. [Google Scholar]
[53].Sweller John. 1988. Cognitive load during problem solving: Effects on learning. Cognitive science 12, 2 (1988), 257–285. 10.1016/0364-0213(88)90023-7 [DOI] [Google Scholar]
[54].Wang Haochen, Jiang Xiaolong, Ren Haibing, Hu Yao, and Bai Song. 2021. Swiftnet: Real-time video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1296–1305. [Google Scholar]
[55].Wang Tianlu, Zhao Jieyu, Yatskar Mark, Chang Kai-Wei, and Ordonez Vicente. 2019. Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations. In Proceedings of the IEEE/CVF international conference on computer vision. 5310–5319. [Google Scholar]
[56].Wang Xuezhi, Wei Jason, Schuurmans Dale, Le Quoc, Chi Ed, Narang Sharan, Chowdhery Aakanksha, and Zhou Denny. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022). [Google Scholar]
[57].Wang Zeyu, Qinami Klint, Karakozis Ioannis Christos, Genova Kyle, Nair Prem, Hata Kenji, and Russakovsky Olga. 2020. Towards fairness in visual recognition: Effective strategies for bias mitigation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8919–8928. [Google Scholar]
[58].Wei Jason, Wang Xuezhi, Schuurmans Dale, Bosma Maarten, Xia Fei, Chi Ed, Le Quoc V, Zhou Denny, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837. [Google Scholar]
[59].Xie Jingyi, Reddie Madison, Lee Sooyeon, Billah Syed Masum, Zhou Zihan, Tsai Chun-hua, and Carroll John M. 2022. Iterative Design and Prototyping of Computer Vision Mediated Remote Sighted Assistance. ACM Transactions on Computer-Human Interaction (TOCHI) 29, 4 (2022), 1–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
[60].Xie Jingyi, Yu Rui, Cui Kaiming, Lee Sooyeon, Carroll John M., and Billah Syed Masum. 2023. Are Two Heads Better than One? Investigating Remote Sighted Assistance with Paired Volunteers. In Proceedings of the 2023 ACM Designing Interactive Systems Conference (DIS’23). 1810–1825. [DOI] [PMC free article] [PubMed] [Google Scholar]
[61].Xie Jingyi, Yu Rui, Lee Sooyeon, Lyu Yao, Billah Syed Masum, and Carroll John M. 2022. Helping Helpers: Supporting Volunteers in Remote Sighted Assistance with Augmented Reality Maps. In Designing Interactive Systems Conference. 881–897. [DOI] [PMC free article] [PubMed] [Google Scholar]
[62].Xie Jingyi, Yu Rui, Zhang He, Lee Sooyeon, Billah Syed Masum, and Carroll John M. 2024. BubbleCam: Engaging Privacy in Remote Sighted Assistance. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
[63].Yang Bufang, He Lixing, Liu Kaiwei, and Yan Zhenyu. 2024. VIAssist: Adapting Multi-modal Large Language Models for Users with Visual Impairments. arXiv:2404.02508 (2024). [Google Scholar]
[64].Yu Weihao, Yang Zhengyuan, Li Linjie, Wang Jianfeng, Lin Kevin, Liu Zicheng, Wang Xinchao, and Wang Lijuan. 2023. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490 (2023). [Google Scholar]
[65].Zhang He, Falletta Nicholas J., Xie Jingyi, Yu Rui, Lee Sooyeon, Billah Syed Masum, and Carroll John M.. 2024. Enhancing the Travel Experience for People with Visual Impairments through Multimodal Interaction: NaviGPT, A Real-Time AI-Driven Mobile Navigation System. arXiv:2410.04005 [cs.HC] https://arxiv.org/abs/2410.04005 [DOI] [PMC free article] [PubMed] [Google Scholar]
[66].Zhao Yi, Zhang Yilin, Xiang Rong, Li Jing, and Li Hillming. 2024. VIALM: A Survey and Benchmark of Visually Impaired Assistance with Large Models. arXiv:2402.01735 (2024). [Google Scholar]
[67].Zhong Wanjun, Guo Lianghong, Gao Qiqi, Ye He, and Wang Yanlin. 2024. Memorybank: Enhancing large language models with long-term memory. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 19724–19731. 10.1609/aaai.v38i17.29946 [DOI] [Google Scholar]

[R1] [1].2024. Aira. https://aira.io/.

[R2] [2].2024. Announcing “Be My AI,” Soon Available for Hundreds of Thousands of Be My Eyes Users. Retrieved September 1, 2024 from https://www.bemyeyes.com/blog/announcing-be-my-ai

[R3] [3].2024. Be My Eyes - See the world together. https://www.bemyeyes.com/.

[R4] [4].Achiam Josh, Adler Steven, Agarwal Sandhini, Ahmad Lama, Akkaya Ilge, Aleman Florencia Leoni, Almeida Diogo, Altenschmidt Janko, Altman Sam, Anadkat Shyamal, et al. 2023. GPT-4 Technical Report. arXiv:2303.08774 (2023). [Google Scholar]

[R5] [5].Adnin Rudaiba and Das Maitraye. 2024. “I look at it as the king of knowledge”: How Blind People Use and Understand Generative AI Tools. people 16, 54 (2024), 92. [Google Scholar]

[R6] [6].Ahmetovic Dragan, Sato Daisuke, Oh Uran, Ishihara Tatsuya, Kitani Kris, and Asakawa Chieko. 2020. Recog: Supporting blind people in recognizing personal objects. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–12. [Google Scholar]

[R7] [7].Bai Yuntao, Jones Andy, Ndousse Kamal, Askell Amanda, Chen Anna, DasSarma Nova, Drain Dawn, Fort Stanislav, Ganguli Deep, Henighan Tom, et al. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 (2022). [Google Scholar]

[R8] [8].Baranski Przemyslaw and Strumillo Pawel. 2015. Field trials of a teleassistance system for the visually impaired. In 2015 8th International Conference on Human System Interaction (HSI). IEEE, 173–179. [Google Scholar]

[R9] [9].Bendel Oliver. 2024. How Can Generative AI Enhance the Well-being of Blind? arXiv:2402.07919 (2024). [Google Scholar]

[R10] [10].Bennett Cynthia L, Gleason Cole, Scheuerman Morgan Klaus, Bigham Jeffrey P, Guo Anhong, and To Alexandra. 2021. “It’s complicated”: Negotiating accessibility and (mis) representation in image descriptions of race, gender, and disability. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–19. [Google Scholar]

[R11] [11].Bigham Jeffrey P, Jayant Chandrika, Ji Hanjie, Little Greg, Miller Andrew, Miller Robert C, Miller Robin, Tatarowicz Aubrey, White Brandyn, White Samual, et al. 2010. Vizwiz: nearly real-time answers to visual questions. In Proceedings of the 23nd annual ACM symposium on User interface software and technology. 333–342. [Google Scholar]

[R12] [12].Bigham Jeffrey P, Jayant Chandrika, Miller Andrew, White Brandyn, and Yeh Tom. 2010. VizWiz:: LocateIt-enabling blind people to locate objects in their environment. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops. IEEE, 65–72. [Google Scholar]

[R13] [13].Braun Virginia and Clarke Victoria. 2006. Using thematic analysis in psychology. Qualitative research in psychology 3, 2 (2006), 77–101. [Google Scholar]

[R14] [14].Bujacz M, Baranski P, Moranski M, Strumillo P, and Materka A. 2008. Remote guidance for the blind—A proposed teleassistance system and navigation trials. In 2008 Conference on Human System Interactions. IEEE, 888–892. [Google Scholar]

[R15] [15].Burton Michele A, Brady Erin, Brewer Robin, Neylan Callie, Bigham Jeffrey P, and Hurst Amy. 2012. Crowdsourcing subjective fashion advice using VizWiz: challenges and opportunities. In Proceedings of the 14th international ACM SIGACCESS conference on Computers and accessibility. ACM, 135–142. [Google Scholar]

[R16] [16].Caldwell Ben, Cooper Michael, Reid Loretta Guarino, Vanderheiden Gregg, Chisholm Wendy, Slatin John, and White Jason. 2008. Web content accessibility guidelines (WCAG) 2.0. WWW Consortium (W3C) 290, 1–34 (2008), 5–12. [Google Scholar]

[R17] [17].Carroll John M., Lee Sooyeon, Reddie Madison, Beck Jordan, and Rosson Mary Beth. 2020. Human-Computer Synergies in Prosthetic Interactions. IxD&A 44 (2020), 29–52. http://www.mifav.uniroma2.it/inevent/events/idea2010/doc/44_2.pdf [Google Scholar]

[R18] [18].Chandler Paul and Sweller John. 1991. Cognitive load theory and the format of instruction. Cognition and instruction 8, 4 (1991), 293–332. 10.1207/s1532690xci0804_2 [DOI] [Google Scholar]

[R19] [19].Chaudary Babar, Paajala Iikka, Keino Eliud, and Pulli Petri. 2017. Tele-guidance based navigation system for the visually impaired and blind persons. In eHealth 360. Springer, 9–16. [Google Scholar]

[R20] [20].Garaj Vanja, Jirawimut Rommanee, Ptasinski Piotr, Cecelja Franjo, and Balachandran Wamadeva. 2003. A system for remote sighted guidance of visually impaired pedestrians. British Journal of Visual Impairment 21, 2 (2003), 55–63. [Google Scholar]

[R21] [21].Gonzalez Penuela Ricardo E, Collins Jazmin, Bennett Cynthia, and Azenkot Shiri. 2024. Investigating Use Cases of AI-Powered Scene Description Applications for Blind and Low Vision People. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ‘24). Association for Computing Machinery, New York, NY, USA, Article 901, 21 pages. 10.1145/3613904.3642211 [DOI] [Google Scholar]

[R22] [22].Granquist Christina, Sun Susan Y, Montezuma Sandra R, Tran Tu M, Gage Rachel, and Legge Gordon E. 2021. Evaluation and comparison of artificial intelligence vision aids: Orcam myeye 1 and seeing ai. Journal of Visual Impairment & Blindness 115, 4 (2021), 277–285. [Google Scholar]

[R23] [23].Gurari Danna, Li Qing, Stangl Abigale J, Guo Anhong, Lin Chi, Grauman Kristen, Luo Jiebo, and Bigham Jeffrey P. 2018. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3608–3617. [Google Scholar]

[R24] [24].Gurari Danna, Zhao Yinan, Zhang Meng, and Bhattacharya Nilavra. 2020. Captioning images taken by people who are blind. In European Conference on Computer Vision. 417–434. [Google Scholar]

[R25] [25].Han Chaeeun, Mitra Prasenjit, and Billah Syed Masum. 2024. Uncovering Human Traits in Determining Real and Spoofed Audio: Insights from Blind and Sighted Individuals. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–14. [Google Scholar]

[R26] [26].Hanley Margot, Barocas Solon, Levy Karen, Azenkot Shiri, and Nissenbaum Helen. 2021. Computer vision and conflicting values: Describing people with automated alt text. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. 543–554. [Google Scholar]

[R27] [27].Holmes Nicole and Prentice Kelly. 2015. iPhone video link facetime as an orientation tool: remote O&M for people with vision impairment. International Journal of Orientation & Mobility 7, 1 (2015), 60–68. [Google Scholar]

[R28] [28].Hong Jonggi, Gandhi Jaina, Mensah Ernest Essuah, Zeraati Farnaz Zamiri, Jarjue Ebrima, Lee Kyungjun, and Kacorri Hernisa. 2022. Blind Users Accessing Their Training Images in Teachable Object Recognizers. In Proceedings of the 24th International ACM SIGACCESS Conference on Computers and Accessibility. 1–18. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] [29].Hunaiti Ziad, Garaj Vanja, and Balachandran Wamadeva. 2006. A remote vision guidance system for visually impaired pedestrians. The Journal of Navigation 59, 3 (2006), 497–504. [DOI] [PubMed] [Google Scholar]

[R30] [30].Jiang Lucy, Jung Crescentia, Phutane Mahika, Stangl Abigale, and Azenkot Shiri. 2024. “It’s Kind of Context Dependent”: Understanding Blind and Low Vision People’s Video Accessibility Preferences Across Viewing Scenarios. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–20. [Google Scholar]

[R31] [31].Kamikubo Rie, Kato Naoya, Higuchi Keita, Yonetani Ryo, and Sato Yoichi. 2020. Support strategies for remote guides in assisting people with visual impairments for effective indoor navigation. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–12. [Google Scholar]

[R32] [32].Kupferstein Elizabeth, Zhao Yuhang, Azenkot Shiri, and Rojnirun Hathaitorn. 2020. Understanding the use of artificial intelligence based visual aids for people with visual impairments. Investigative Ophthalmology & Visual Science 61, 7 (2020), 932–932. [Google Scholar]

[R33] [33].Lasecki Walter S, Murray Kyle I, White Samuel, Miller Robert C, and Bigham Jeffrey P. 2011. Real-time crowd control of existing interfaces. In Proceedings of the 24th annual ACM symposium on User interface software and technology. ACM, 23–32. [Google Scholar]

[R34] [34].Lasecki Walter S, Thiha Phyo, Zhong Yu, Brady Erin, and Bigham Jeffrey P. 2013. Answering visual questions with conversational crowd assistants. In Proceedings of the 15th International ACM SIGACCESS Conference on Computers and Accessibility. ACM, 18. [Google Scholar]

[R35] [35].Lee Sooyeon, Reddie Madison, Gurdasani Krish, Wang Xiying, Beck Jordan, Rosson Mary Beth, and Carroll John M.. 2018. Conversations for Vision: Remote Sighted Assistants Helping People with Visual Impairments. arXiv:1812.00148 [cs.HC] [Google Scholar]

[R36] [36].Lee Sooyeon, Reddie Madison, Tsai Chun-Hua, Beck Jordan, Rosson Mary Beth, and Carroll John M. 2020. The emerging professional practice of remote sighted assistance for people with visual impairments. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–12. [Google Scholar]

[R37] [37].Lee Sooyeon, Yu Rui, Xie Jingyi, Billah Syed Masum, and Carroll John M. 2022. Opportunities for human-AI collaboration in remote sighted assistance. In 27th International Conference on Intelligent User Interfaces. 63–78. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] [38].Lewis Patrick, Perez Ethan, Piktus Aleksandara, Petroni Fabio, Karpukhin Vladimir, Goyal Naman, Küttler Heinrich, Lewis Mike, Yih Wen-tau, Rocktäschel Tim, et al. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems. [Google Scholar]

[R39] [39].Liu Zhe, Chen Chunyang, Wang Junjie, Chen Mengzhuo, Wu Boyu, Huang Yuekai, Hu Jun, and Wang Qing. 2024. Unblind Text Inputs: Predicting Hint-text of Text Input in Mobile Apps via LLM. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ‘24). Association for Computing Machinery, New York, NY, USA, Article 51, 20 pages. 10.1145/3613904.3642939 [DOI] [Google Scholar]

[R40] [40].Microsoft. 2022. Seeing AI - Talking camera app for those with a visual impairment. https://www.microsoft.com/en-us/ai/seeing-ai.

[R41] [41].Morrison Cecily, Grayson Martin, Marques Rita Faia, Massiceti Daniela, Longden Camilla, Wen Linda, and Cutrell Edward. 2023. Understanding Personalized Accessibility through Teachable AI: Designing and Evaluating Find My Things for People who are Blind or Low Vision. In Proceedings of the 25th International ACM SIGACCESS Conference on Computers and Accessibility. 1–12. [Google Scholar]

[R42] [42].Mozannar Hussein and Sontag David. 2020. Consistent estimators for learning to defer to an expert. In International Conference on Machine Learning. PMLR, 7076–7087. [Google Scholar]

[R43] [43].Mukhiddinov Mukhriddin, Abdusalomov Akmalbek Bobomirzaevich, and Cho Jinsoo. 2022. Automatic fire detection and notification system based on improved YOLOv4 for the blind and visually impaired. Sensors 22, 9 (2022), 3307. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] [44].Mulligan Deirdre K and Nissenbaum Helen. 2020. The concept of handoff as a model for ethical analysis and design. The Oxford handbook of ethics of AI 1, 1 (2020), 233. [Google Scholar]

[R45] [45].Natalie Rosiana, Chang Ruei-Che, Sheshadri Smitha, Guo Anhong, and Hara Kotaro. 2024. Audio Description Customization. arXiv preprint arXiv:2408.11406 (2024). [Google Scholar]

[R46] [46].Perez Ethan, Huang Saffron, Song Francis, Cai Trevor, Ring Roman, Aslanides John, Glaese Amelia, McAleese Nat, and Irving Geoffrey. 2022. Red teaming language models with language models. arXiv preprint arXiv:2202.03286 (2022). [Google Scholar]

[R47] [47].Priyadarshini Ishaani, Sharma Rohit, Bhatt Dhowmya, and Al-Numay M. 2023. Human activity recognition in cyber-physical systems using optimized machine learning techniques. Cluster Computing 26, 4 (2023), 2199–2215. 10.1007/s10586-022-03662-8 [DOI] [Google Scholar]

[R48] [48].Raghu Maithra, Blumer Katy, Corrado Greg, Kleinberg Jon, Obermeyer Ziad, and Mullainathan Sendhil. 2019. The algorithmic automation problem: Prediction, triage, and human effort. arXiv preprint arXiv:1903.12220 (2019). [Google Scholar]

[R49] [49].Ramaswamy Vikram V, Kim Sunnie SY, and Russakovsky Olga. 2021. Fair attribute classification through latent space de-biasing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9301–9310. [Google Scholar]

[R50] [50].Scheggi Stefano, Talarico A, and Prattichizzo Domenico. 2014. A remote guidance system for blind and visually impaired people via vibrotactile haptic feedback. In 22nd Mediterranean Conference on Control and Automation. IEEE, 20–23. [Google Scholar]

[R51] [51].Stangl Abigale, Morris Meredith Ringel, and Gurari Danna. 2020. “Person, Shoes, Tree. Is the Person Naked?” What People with Vision Impairments Want in Image Descriptions. In Proceedings of the 2020 chi conference on human factors in computing systems. 1–13. [Google Scholar]

[R52] [52].Stangl Abigale, Verma Nitin, Fleischmann Kenneth R, Morris Meredith Ringel, and Gurari Danna. 2021. Going beyond one-size-fits-all image descriptions to satisfy the information wants of people who are blind or have low vision. In Proceedings of the 23rd International ACM SIGACCESS Conference on Computers and Accessibility. 1–15. [Google Scholar]

[R53] [53].Sweller John. 1988. Cognitive load during problem solving: Effects on learning. Cognitive science 12, 2 (1988), 257–285. 10.1016/0364-0213(88)90023-7 [DOI] [Google Scholar]

[R54] [54].Wang Haochen, Jiang Xiaolong, Ren Haibing, Hu Yao, and Bai Song. 2021. Swiftnet: Real-time video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1296–1305. [Google Scholar]

[R55] [55].Wang Tianlu, Zhao Jieyu, Yatskar Mark, Chang Kai-Wei, and Ordonez Vicente. 2019. Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations. In Proceedings of the IEEE/CVF international conference on computer vision. 5310–5319. [Google Scholar]

[R56] [56].Wang Xuezhi, Wei Jason, Schuurmans Dale, Le Quoc, Chi Ed, Narang Sharan, Chowdhery Aakanksha, and Zhou Denny. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022). [Google Scholar]

[R57] [57].Wang Zeyu, Qinami Klint, Karakozis Ioannis Christos, Genova Kyle, Nair Prem, Hata Kenji, and Russakovsky Olga. 2020. Towards fairness in visual recognition: Effective strategies for bias mitigation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8919–8928. [Google Scholar]

[R58] [58].Wei Jason, Wang Xuezhi, Schuurmans Dale, Bosma Maarten, Xia Fei, Chi Ed, Le Quoc V, Zhou Denny, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837. [Google Scholar]

[R59] [59].Xie Jingyi, Reddie Madison, Lee Sooyeon, Billah Syed Masum, Zhou Zihan, Tsai Chun-hua, and Carroll John M. 2022. Iterative Design and Prototyping of Computer Vision Mediated Remote Sighted Assistance. ACM Transactions on Computer-Human Interaction (TOCHI) 29, 4 (2022), 1–40. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R60] [60].Xie Jingyi, Yu Rui, Cui Kaiming, Lee Sooyeon, Carroll John M., and Billah Syed Masum. 2023. Are Two Heads Better than One? Investigating Remote Sighted Assistance with Paired Volunteers. In Proceedings of the 2023 ACM Designing Interactive Systems Conference (DIS’23). 1810–1825. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R61] [61].Xie Jingyi, Yu Rui, Lee Sooyeon, Lyu Yao, Billah Syed Masum, and Carroll John M. 2022. Helping Helpers: Supporting Volunteers in Remote Sighted Assistance with Augmented Reality Maps. In Designing Interactive Systems Conference. 881–897. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R62] [62].Xie Jingyi, Yu Rui, Zhang He, Lee Sooyeon, Billah Syed Masum, and Carroll John M. 2024. BubbleCam: Engaging Privacy in Remote Sighted Assistance. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R63] [63].Yang Bufang, He Lixing, Liu Kaiwei, and Yan Zhenyu. 2024. VIAssist: Adapting Multi-modal Large Language Models for Users with Visual Impairments. arXiv:2404.02508 (2024). [Google Scholar]

[R64] [64].Yu Weihao, Yang Zhengyuan, Li Linjie, Wang Jianfeng, Lin Kevin, Liu Zicheng, Wang Xinchao, and Wang Lijuan. 2023. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490 (2023). [Google Scholar]

[R65] [65].Zhang He, Falletta Nicholas J., Xie Jingyi, Yu Rui, Lee Sooyeon, Billah Syed Masum, and Carroll John M.. 2024. Enhancing the Travel Experience for People with Visual Impairments through Multimodal Interaction: NaviGPT, A Real-Time AI-Driven Mobile Navigation System. arXiv:2410.04005 [cs.HC] https://arxiv.org/abs/2410.04005 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R66] [66].Zhao Yi, Zhang Yilin, Xiang Rong, Li Jing, and Li Hillming. 2024. VIALM: A Survey and Benchmark of Visually Impaired Assistance with Large Models. arXiv:2402.01735 (2024). [Google Scholar]

[R67] [67].Zhong Wanjun, Guo Lianghong, Gao Qiqi, Ye He, and Wang Yanlin. 2024. Memorybank: Enhancing large language models with long-term memory. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 19724–19731. 10.1609/aaai.v38i17.29946 [DOI] [Google Scholar]

PERMALINK

Beyond Visual Perception: Insights from Smartphone Interaction of Visually Impaired Users with Large Multimodal Models

JINGYI XIE

RUI YU

HE ZHANG

SYED MASUM BILLAH

SOOYEON LEE

JOHN M CARROLL

Abstract

1. INTRODUCTION

2. BACKGROUND AND RELATED WORK

2.1. AI-Powered Visual Interpretation and Question Answering Systems

2.2. Human-Assisted Visual Interpretation and Question Answering Systems

2.3. Information Needed in Visual Interpretations for Visually Impaired Users

3. METHOD: DATA COLLECTION AND ANALYSIS

3.1. Data Collection

3.1.1. Primary data source: Interview Data.

Participants.

Table 1.

Procedure.

3.1.2. Secondary data source: Image Descriptions Generated by Be My AI.

From Participants.

From Social Media Platforms.

3.2. Data Analysis

Fig. 2.

4. FINDINGS

4.1. Context Awareness

4.1.1. Physical Environments.

Capabilities of Be My AI.

Example 1: AI Hallucinations of Adding Non-existent Details.

Example 2: Need for User Support in Locating Dropped Objects.

4.1.2. Social and Stylistic Contexts.

Capabilities of Be My AI.

Fig. 1.

Example 1: Subjective Interpretations in Human-Animal Interactions.

Example 2: Subjective Interpretations in Fashion Help.

4.1.3. Identity Accuracy and Sensitivity.

Capabilities of Be My AI.

Example: Inaccurate Identification of People’s Gender and Age.

4.2. Intent-Oriented Capabilities

4.2.1. Agentic Interaction.

Capabilities of Be My AI.

Example 1: Check for the Presence of Eggshells.

Example 2: Adjust Rotary Control Appliances.

4.2.2. Consistency and Follow-Through.

Capabilities of Be My AI.

Example 1: Inadequacy in Identifying Central Puzzle Piece.

Example 2: Lack of Instruction for Camera Adjustment.

Fig. 3.

4.2.3. Real-Time Feedback.

Capabilities of Be My AI.

Example 1: Limited Navigational Information in Static Images.

Example 2: Irreplaceable Role of Orientation & Mobility Skills in Navigation.

5. DISCUSSION

5.1. Handoff Between Users, Be My AI, and Remote Sighted Assistants

5.1.1. Status Quo of Interactions Between Users and Be My AI.

Fig. 4.

5.1.2. AI Deferral Learning for Identity Interpretations.

Fig. 5.

5.1.3. Fact-Checking for AI Hallucination Problem.

Fig. 6.

5.2. Towards Multi-Agent Systems for Assisting Visually Impaired Users

5.3. Towards Real-Time Video Processing in LMM-based VQA Systems

5.4. Limitations

6. CONCLUSION

CCS Concepts:

A. APPENDIX

Table 2.

Contributor Information

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases