Skip to main content
PLOS One logoLink to PLOS One
. 2025 Apr 29;20(4):e0322078. doi: 10.1371/journal.pone.0322078

Urban walkability through different lenses: A comparative study of GPT-4o and human perceptions

Musab Wedyan 1, Yu-Chen Yeh 2, Fatemeh Saeidi-Rizi 1,*, Tai-Quan Peng 3, Chun-Yen Chang 4
Editor: Abeer Elshater5
PMCID: PMC12040139  PMID: 40299853

Abstract

Urban environments significantly shape our well-being, behavior, and overall quality of life. Assessing urban environments, particularly walkability, has traditionally relied on computer vision and machine learning algorithms. However, these approaches often fail to capture the subjective and emotional dimensions of walkability, due to their limited ability to integrate human-centered perceptions and contextual understanding. Recently, large language models (LLMs) have gained traction for their ability to process and analyze unstructured data. With the increasing reliance on LLMs in urban studies, it is essential to critically evaluate their potential to accurately capture human perceptions of walkability and contribute to the design of more pedestrian-friendly environments. Therefore, a critical question arises: can large language models (LLMs), such as GPT-4o, accurately reflect human perceptions of urban environments? This study aims to address this question by comparing GPT-4o's evaluations of visual urban scenes with human perceptions, specifically in the context of urban walkability. The research involved human participants and GPT-4o evaluating street-level images based on key dimensions of walkability, including overall walkability, feasibility, accessibility, safety, comfort, and liveliness. To analyze the data, text mining techniques were employed, examining keyword frequency, coherence scores, and similarity indices between the participants and GPT-4o-generated responses. The findings revealed that GPT-4o and participants aligned in their evaluations of overall walkability, feasibility, accessibility, and safety. In contrast, notable differences emerged in the assessment of comfort and liveliness. Human participants demonstrated broader thematic diversity and addressed a wider range of topics, whereas GPT-4o had more focused and cohesive responses, particularly in relation to comfort and safety. In addition, similarity scores between GPT-4o and the responses of participants indicated a moderate level of alignment between GPT-4o’s reasoning and human judgments. The study concludes that human input remains essential for fully capturing human-centered evaluations of walkability. Furthermore, it underscores the importance of refining LLMs to better align with human perceptions in future walkability studies.

1. Introduction

Walkability has garnered considerable attention across various disciplines such as urban planning, public health, and transportation [13]. The quality of the walking environment is recognized also as a crucial component in enhancing community development [4], better human experience in historical sites [5], and reducing carbon emissions [6, 7].

Previous studies have constructed models to measure the perceived walkability such as using panoramic street view images and virtual reality [810]. Additionally, machine learning techniques, such as ResNet, have been employed to objectively quantify walkability based on pedestrian visual perception [11]. In addition, researchers applied deep learning algorithms to create a walkability index using micro and macro urban features [1214]. Collectively, the extensive application of street view imagery and deep learning algorithms have enabled the development of methods to assess pedestrian walkability.

Recently, along with computer vision techniques in urban studies, large language models (LLMs) have become increasingly capable of performing a wide range of tasks, including text completion, sentiment analysis [15,16], and cross-language translation [17]. LLMs have also found applications in social science research, where they simulate human responses to survey questions on attitudes and behaviors [18,19]. The release of ChatGPT at the end of 2022 brought global attention [2022]. Building on this momentum, the newly introduced GPT-4o model, with its multimodal capabilities, has further expanded the possibilities. For example, it has been applied to medical data [2326], fake news detection [27], education [28, 29], business [30, 31], agriculture [32] and social science [33]. Those studies show that LLMs have been applied in different domains. However, despite that generative methods in the field of walkability are expected to grow [34], and the use will be expanding in urban tasks [35], the application of LLMs in the urban domain remains is still limited.

Overall, according to the literature, previous research has extensively utilized street view imagery and computer vision techniques to assess the physical attributes of walkable environments. However, the potential of studying the performance of LLMs in the urban walkability field is unexplored. Addressing this gap, we only aimed to explore the alignment of the human perspective of the perceived walkability and GPT-4o as one of the LLMs. We will answer the following questions: how well do these models accurately capture real-world human experiences of visual appeal? We examined the capabilities of GPT-4o in assessing the visual perception of walkability in urban areas by having it evaluate overall walkability, feasibility, accessibility, safety, comfort, and liveliness. By comparing paired images, we assessed their ratings, text responses, and sentiment scores against those of human participants. Our findings highlight the limitations of GPT-4o in accurately perceiving urban environments and point to opportunities for refining LLM models to better align with human perspectives.

2. Literature review

2.1. Walkability perception

Walkability is increasingly acknowledged as a key element in promoting healthy communities [36], as well as enhancing social interaction and economic vitality within neighborhoods [37]. Walkability is typically characterized as the degree to which a built environment is accessible and appealing to individuals [38], whether they walk out of necessity, preference, or social engagement [39]. It also refers to individuals’ perceptions of a street as a suitable place for walking [40]. As a subjective measure, walkability reflects the perceived quality of the environment and is shaped by personal assessments of its suitability for walking, making it challenging to quantify and assess objectively [41]. Collectively, these elements shape what is often referred to as perceived walkability [42,43].

The investigation of perceived walkability through subjective assessments has emerged as an effective approach to deepen our comprehension of the walking environment [44,45]. Among the various factors influencing perceived walkability, the concept of visual variety has emerged as a critical determinant of pedestrian satisfaction. Visual variety captures the richness and diversity of urban design elements that engage and attract pedestrians, enhancing the overall appeal of space [46]. Based on Maslow’s hierarchy of needs, perceived walkability is suggested to consist of five dimensions: feasibility, accessibility, safety, comfort, and pleasurability [46]. Researchers commonly evaluate perceived walkability through four key dimensions: comfort, safety, utility, and appeal [46,47]. Other studies have referred to visual variety using terms such as imageability, complexity, transparency [40], and positive sensory experiences [48]. Building on this foundation, walkability has been systematically evaluated using five visual indicators established in early research: feasibility, accessibility, safety, comfort, and pleasurability [46].

According to [46], collectively, these indicators form the six categories of Visual Walkability Perception, providing a comprehensive framework for determining whether an environment visually supports walking. The visual walkability indicator provides an overall assessment of whether a location visually supports walking. Feasibility refers to factors that encourage walking, influenced by land use types and the diversity of available facilities. Accessibility addresses visible obstacles, such as dead-end streets or restricted access areas. Safety assesses a street’s security based on crime, traffic accidents, and visual cues like graffiti, litter, and neglected buildings. Comfort examines how the street environment enhances the pedestrian experience, factoring in elements like street furniture, sidewalk width, urban design features, and accessibility facilities. Finally, pleasurability assesses the appeal of public spaces, reflecting how diverse, lively, enjoyable, and interesting they are for walking.

Yet walkability remains inherently subjective and challenging to quantify. In a recent review of the trends of walkability over time, it was concluded that methods for measuring walkability have shifted dramatically [49]. Early studies largely relied on measurement-based methods such as GIS-based assessments, and physical and image audits [5056], as well as mixed-method approaches [57]. However, recent years have seen a growing emphasis on micro-level, street-based evaluations and SVI [58, 59] with applications including measuring psychological greenery and visual crowdedness [60] and object importance [61].

Although analyzing SVI by using CNN-based approaches has been effective in identifying physical features [62], they come with notable limitations. SVI and CNN do not inherently integrate textual opinions or subjective feedback from participants, which are crucial for understanding perceived walkability attributes like safety, aesthetics, or comfort [63,64]. Research has also highlighted the importance of textual opinions, such as those gathered from social media or surveys, to complement SVI in capturing the emotional and subjective dimensions of walkability [65,66]. This lack of interpretability makes it challenging for urban planners and policymakers to grasp the “reasoning” behind a algorithm’s assessment.

Recently, Unlike CNNs, which focus on recognizing visual features, LLMs can analyze both text and images, allowing them to interpret the broader context surrounding an environment [35]. By leveraging both visual and textual data, LLMs bridge the gap between physical environment analysis and subjective user experiences. For example, LLMs can integrate data from computer vision analyses of street view imagery with textual user feedback, enabling a comprehensive understanding of urban spaces [67]. This multimodal approach not only addresses the limitations of single-modality methods but also provides richer, more actionable insights for creating walkable cities.

2.2. ChatGPT advancements and uses

Recently, GPT-4o showcased extraordinary multimodal capabilities in decoding visual and textual content [68]. It has shown precision in functions such as detecting and identifying visual elements [69]. Another study explored the performance of GPT-4o in image classification by integrating images with textual descriptions. It was demonstrated that such a combination can significantly improve the accuracy of classification [70]. In addition, GPT-4o can identify and rank the perceived risk in traffic scenarios in images to some extent, despite that its evaluations do not always align with human judgment [71].

ChatGPT has been attracting attention from individuals across different backgrounds such as healthcare, and academia [2022,72]. In the medical field, GPT-4 has played roles in enhancing radiology report assessments [73], conducting reviews on the digital twin concept [74], undertaking medical writing tasks [75], and aiding in licensing exams [76], medical education [77, 78], and visualizing internal body structures for diagnosis, research. Additionally, in a recent overview summarizing GPT’s application across mathematics [79], physics [8082], and communication [83].

Despite the wide use of LLM in different domains, there is a significant gap in understanding how these models perceive and evaluate outdoor environments compared to human perspectives. While previous studies have extensively explored human perceptions of urban environments through traditional methodologies such as surveys, mixed-methods approach, and deep learning, it is important and methodologically desirable to compare GPT-4o perceptions of outdoor environments in terms of visual walkability with human perceptions. This research is the first to investigate the alignment or divergence between LLM-based evaluations and human perceptions of walkability. This study seeks to evaluate the potential of Large Language Models (LLMs) as reliable tools for assessing environmental factors influencing human experiences, such as walkability, an area which has not been systematically compared in prior research. Based on this, the research addresses the following questions: how do the keyword frequencies in GPT-4o-generated descriptions of paired images compare to those in human-generated descriptions? In what ways do the sentiment scores of GPT-4o-generated descriptions differ from those of human-generated descriptions? Additionally, how do the coherence and similarity indices of GPT-4o-generated responses compare to those of human responses?

3. Methodology

Fig 1 presents a detailed overview of the research process across three distinct phases. The first phase involves the evaluation of paired images by both human participants and GPT-4o. The second phase consists of two scenarios: one where both GPT-4o and human participants choose the first image as having a higher rating, and another where both select the second image. After aligning the responses based on the selected image, the third phase synthesizes the findings by analyzing response coherence and keywords to identify themes more highly rated by either GPT-4o or human participants. This phase also compares sentiment scores across responses, performs LDA topic modeling, and analyzes keyword frequencies.

Fig 1. Research Methodology.

Fig 1

3.1. Data collection

3.1.1. Image selection and questionnaire structure.

Images were collected from Lansing, East Lansing, and Williamston, Michigan, using a horizontally held iPhone 14 for consistency. The selected images for this study aim to represent multiple perspectives of the urban environment, encompassing a broad spectrum of developments in chosen areas to illustrate the diversity present in city landscapes. The evaluation of these variations was based on the subjective evaluation of the authors. The selection was designed to represent diverse perspectives on urban development, covering areas with various levels of greenery, pavement conditions, population density, vehicle and pedestrian flow, and spatial openness or constriction. This variation included both tree-lined streets and concrete-dominated areas, as well as crowded streets and sparsely populated spaces, reflecting a wide range of urban conditions. These diverse images provided a holistic view of walkability, as influenced by both natural elements and urban infrastructure, capturing how these factors impact perceptions of walkability for both human participants and GPT-4o.

Each image was assigned a unique identifier ranging from 1 to 106 to facilitate randomization and organization. The randomization process was conducted using a Python script, which paired the images to create sets for comparison. This approach was employed to avoid any bias in the selection process and to ensure a fair representation of diverse urban environments. Using this approach, some images were excluded from the final survey for distinct reasons. Images 25, 32, 58, 59, 72, 84, 88, 96, and 102 were not selected during randomization. This resulted in a final selection of 48 unique image pairs that were included in the survey. The finalized image set was designed to provide a comprehensive range of urban conditions, capturing both natural and built features that influence walkability perceptions. The systematic pairing ensured that participants evaluated images reflecting real-world variability in urban design, allowing for robust comparisons of human and GPT-4o perceptions of walkability. Table 1 summarizes the number of respondents and image pairs which were included in the survey and subsequent analysis in the “images numbers” column.

Table 1. Number of respondents and image pairs used in different groups of human survey.
Groups Number of Respondents Image Pairs Used
Overall Walkability Feasibility Accessibility Safety Comfort Liveliness
1 38 44,1 60,56 53,100 64,3 103,87 8,67
2 26 77,19 105,26 75,83 35,17 95,101 41,81
3 13 28,94 98,14 30,2 42,91 11,43 15,90
4 11 48,12 65,68 66,50 86,104 73,9 18,20
5 39 27,45 23,16 46,51 34,52 49,62 61,70
6 22 69,54 37,21 22,57 85,63 4,71 36,33
7 13 24,97 40,82 10,89 31,79 13,99 106,47
8 6 6,39 74,80 38,55 29,92 76,93 78,5

Participants in the main survey compared pairs of images of urban settings based on six aspects of walkability. These aspects included overall walkability (ease and attractiveness of walking), feasibility (practicality of walking based on individual and environmental factors), accessibility (how well the area accommodates diverse abilities), safety (perceived security), comfort (pedestrian comfort level), and liveliness (vibrancy of the area). Before conducting the primary survey, a pilot study was carried out to assess the survey design and definitions of walkability. Participants understood the six key aspects of walkability—overall walkability, feasibility, accessibility, safety, comfort, and liveliness, but minor adjustments were made to improve clarity and ensure consistent interpretation. The initial survey was shortened for the main study due to its length. Finally, the survey, consisting of 12 question sets, took about 15–20 minutes to complete, and participants were instructed to respond without using AI tools to ensure authentic responses. S1 Table shows the responses of the human participants for paired images. This study was conducted in compliance with the ethical standards set forth by the Institutional Review Board (IRB) of Michigan State University (MSU). Approval for the study was granted under the protocol MSU Study ID: STUDY00010749, covering the period from May 13, 2024, to July 14, 2024. All participants provided written informed consent before participating in the study. The consent process adhered to the guidelines approved by the MSU IRB to ensure participants were fully informed about the objectives of the study, procedures, and their rights, including the option to withdraw at any time. After obtaining IRB approval, minor modifications were made to the study protocol. The initial survey was revised to shorten its length based on feedback from a pilot study to improve participant engagement. These modifications did not alter the core research objectives and remained within the scope of the original IRB approval.

The participants for this study were randomly selected through an online survey platform, which was distributed to a wide audience to ensure diversity in responses. The recruitment process did not target specific age groups, professions, or cultural backgrounds, allowing for a broad participant pool. This random selection approach helps mitigate biases in recruitment and enhances the generalization of the findings. While specific demographic targeting was not employed, the survey yielded 174 responses from individuals with diverse characteristics, including a range of ages (from 18 to 65 ± years), professions (e.g., students, healthcare workers, educators, and urban planning professionals), and cultural backgrounds, as detailed in section 4.1.

3.1.2. ChatGPT prompting.

Various strategies have been employed to enhance the output of GPT-4o such as the use of composite images [84], comparing images in pairs [85], or employing multimodal cooperation [86]. Another technique is converting visual information into text using prompts like “What’s in this image?”. This method shows significant potential, especially when processing large volumes of images that appear in a temporal sequence [87]. However, minor variations in prompts can lead to inconsistent outputs [88]. To address this, methods like self-consistency or bootstrapping involve re-prompting multiple times with different text permutations and averaging the results, improving overall accuracy [8991]. This method involves repeating the prompting process multiple times, each with a different permutation of the text, and then extracting the mean output. This aggregated output typically achieves higher accuracy than a single prompt.

In our study, we ensured consistency by utilizing the same pairs of images in both the surveys and the GPT-4o web interface. For example, the image pair of 44,1 was prompted to GPT 15 times and GPT-4o and prompted to provide their evaluation of overall walkability. These image pairs were uploaded as composite images into GPT-4o, and self-consistency techniques were used to improve the reliability of the model’s output. By aligning the image pairs used in both the surveys and the GPT-4o web interface, we ensured that the results were directly comparable. The individual images had a size of 4023*3024 pixels. GPT-4o considers the image on the left as the first image while the image on the right is the second one. We prompted GPT-4o between 15 July and 1 August 2024. Each prompt was in a new chat window by using temporary chat in GPT-4o. The use of temporary chat windows was crucial to ensure that each prompt was processed independently, avoiding any potential carryover effects or contextual memory from previous interactions. This approach minimized bias and ensured that the model’s output for each prompt was generated without influence from earlier conversations, thereby enhancing the consistency and reliability of the results. In each prompt, we requested GPT-4o to rank the images from 1–10 and describe the perception of each walkability perception. For example, when asking about overall walkability, we wrote the following prompt: “How do you rate the Walkability of this environment for both photos from 1–10? Based on the photo you rated higher, why do you think it is more Walkable? (Please explain your opinion in at least 20 words). Overall Walkability: This measures the ease and appeal of walking around the area shown in the image.

4. Results

4.1. Demographic variables of respondents in the survey

The survey included 174 participants, with a balanced gender representation: 47% female, 50% male, and 2% preferring not to disclose their gender. The sample primarily consisted of younger adults, with 34% aged 18–25 and 33% aged 26–35, while participation decreased in older age groups (15% aged 36–45, 7% aged 46–55, 6% aged 56–65, and 1% over 65). Geographically, the majority of participants resided in urban areas (87%) compared to rural areas (12%). Most respondents were from the United States (61%), followed by Taiwan (15%), with smaller contributions from Jordan, Canada, China, Germany, and other countries (each 4% or less). Walking habits varied among participants: 22% reported walking daily, 27% walked four to five times a week, and 25% walked two to three times a week, while 12% walked once a week and 12% rarely walked. In terms of weekly walking duration, 52% walked for 30 minutes to 1 hour, 39% for less than 30 minutes, and 8% for 1–2 hours, while only 0.5% walked for more than 2 hours.

To maintain the integrity and consistency of the analysis, responses were filtered according to six specified dimensions of walkability: overall walkability, feasibility, accessibility, safety, comfort, and liveliness. Exclusions were made based on predetermined criteria: responses were deemed “Equal” if participants assigned the same ratings to both images, “Invalid” if they were incomplete or lacked significant differentiation, or inconsistent for particular images. Notably, this exclusion process was conducted at the level of individual responses for specific images, rather than dismissing entire participants, thereby preserving valid responses for other perceptions from the same individuals. Following this rigorous filtering, the final counts of analyzed responses for each dimension were 168 for overall walkability, 164 for feasibility, 162 for accessibility, 161 for safety, 165 for comfort, and 162 for liveliness. This selective methodology ensured maximizing the contributions of participants, thus facilitating a thorough comparison of human and GPT-4o of all perceptions. The total number of responses from participants for each perception included in the analysis is shown in Table 2.

Table 2. Number of responses of GPT-4o and participants’ responses for each perception.

Decision Perception
Overall walkability Feasibility Accessibility Safety Comfort Liveliness
Participants First image 68 (39%) 69 (40%) 70 (40%) 74 (43%) 109 (62%) 73 (42%)
Second image 100 (57%) 95 (55%) 92 (52%) 87 (50%) 56 (33%) 89 (51%)
Equal 2 (1%) 4 (2%) 2 (1%) 3 (1%) 3 (2%) 9 (5%)
Invalid 4 (3%) 6 (3%) 10 (7%) 10 (6%) 6 (3%) 3 (2%)
Total responses included in the analysis 168 164 162 161 165 162
Total responses 174
GPT-4o First image 45 (38%) 49 (41%) 53 (44%) 45 (38%) 56 (47%) 45 (38%)
Second image 75 (62%) 71 (59%) 67 (56%) 75 (62%) 64 (53%) 75 (62%)
Equal 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%)
Invalid 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%)
Total responses 120

4.2. Consistency of GPT-4o responses

GPT-4o’s responses exhibited two distinct patterns: it either consistently chose the same image across all prompts or varied its selection between the two images. GPT-4o’s responses displayed a clear answering pattern: for 41 out of the 48 image pairs, it consistently selected the same image (either the first or the second) across all prompts. In the remaining 7 pairs, GPT-4o alternated between selecting the first and second images. To determine the number of chosen responses for analysis, we calculated the similarity index for the generated responses out of the 15 responses. S2 Table shows what image GPT-4o chose as a higher rank, the number of chosen responses out of 15. For example, under the feasibility category, when the image numbers (60,56), (65,68), and (23,16), the second image was consistently chosen.

The similarity index was calculated by comparing responses, with the first response serving as the baseline reference. Pairwise comparisons were conducted across all 15 responses, starting with the first response compared to the second, followed by the first compared to the third, and so forth, until all responses had been evaluated. This method enabled an assessment of the cumulative consistency of the responses as additional outputs were generated. The same approach was applied uniformly to all responses, up to the final response (response number 15). (S1–S6 Fig) in the supporting information present the similarity index for each perception of walkability relative to the number of generated responses by GPT-4o. Although some fluctuation in the similarity index was observed, the values remained within a relatively narrow range. Based on this stability, 15 responses were selected for further analysis.

4.3. Alignment between GPT-4o and human responses

To assess the alignment between the responses from GPT-4o and those from human participants, we matched the responses when GPT-4o and participants chose either image 1 or image 2 in each pair. The number of pair image sets that were selected when the first image was chosen was 21 pairs while when the second image was chosen, we had 29 pairs. The number of participants based on the chosen images is illustrated in Table 2. The results highlighted notable differences in image selection across various perceptual categories. In terms of overall walkability, feasibility, accessibility, and safety, both participants and GPT-4o demonstrated a tendency to select the second image, with participants choosing it 57%, 55%, 52%, and 50% of the time, respectively, while GPT-4o showed comparable preferences at 62%, 59%, 56%, and 62%. However, a key divergence was observed in the comfort perception, where participants significantly favored the first image (62%), while GPT-4o leaned towards the second image (53%). A similar discrepancy emerged in the liveliness perception. Additionally, participant responses displayed more variability, with some instances of “equal” or “invalid” responses, the latter referring to cases where participants provided identical answers across multiple questions.

4.4. Comparing GPT-4o and human ratings

We compared the ratings between GPT-4o and human participants for two sets of images: Image 1 and Image 2. Independent samples t-tests were conducted to evaluate whether there were significant differences between the ratings assigned by humans and the GPT-4o model. For Image 1, the mean rating given by participants was (M = 7.80) with a standard deviation of (SD = 0.60), while the GPT-4o model assigned a mean rating of (M = 7.42) with a standard deviation of (SD = 0.70). The analysis did not reveal a statistically significant difference between the ratings of humans and the GPT-4o model, (t(N) = 1.91), (p = .063), suggesting that the GPT-4o model’s ratings were generally similar to those of human participants for this set of images. For Image 2, participants rated the images with a mean of (M = 7.65) and a standard deviation of (SD = 0.81), whereas the GPT-4o model provided ratings with a mean of (M = 7.76) and a standard deviation of (SD = 0.47). The t-test indicated no significant difference between the two groups, (t(N) = -0.64), (p = .526), implying that the ratings assigned by the GPT-4o model were not substantially different from those of the human participants in this case.

4.5. Content analysis

4.5.1. Similarity index between GPT-4o and participants’ responses.

We assessed the alignment between the textual responses of GPT-4o and human participants by examining the average similarity index to compare their reasoning and descriptive alignment. This was applied to all responses regardless of the perception. Text preprocessing steps, such as tokenization and stop-word removal, were applied before vectorization to ensure meaningful comparisons. This approach follows methodologies established in previous studies on text similarity [92].

The textual data was then vectorized, and cosine similarity was used to measure alignment between GPT-4o and human responses. Cosine similarity is a widely accepted metric for quantifying textual similarity by comparing vectorized representations of text, as discussed in [93, 94]. The threshold value for cosine similarity varies depending on the context, dataset, and application. It is often empirically determined or adaptively set to meet the needs of specific tasks, such as clustering, text classification, or similarity searches. Typical thresholds for high similarity range from 0.7 to 0.9, with values closer to 1.0 indicating stronger similarity, particularly in normalized datasets [95,96]. In the context of this study, a score above 0.4 is generally interpreted as moderate alignment, while scores closer to 0.6 or higher suggest stronger alignment.

The findings revealed a moderate degree of alignment, with an average similarity score of 0.4575 when the first image was chosen and 0.4615 for the second image. While this indicates that GPT-4o is capable of partially mimicking human decision-making processes, the results suggest that there are notable differences in how the two approach visual tasks, particularly in terms of depth and nuance.

4.5.2. Topic modeling.

We compared the responses from GPT-4o and human participants by looking into the topic modeling results and coherence scores across six categories of perception. Fig 2 shows the number of topics when the first image was selected by both groups. Human participants identified nine topics relating to overall walkability perception, with a coherence score of 0.358. In comparison, GPT-4o generated three topics, but with a slightly lower coherence score of 0.317. Regarding feasibility perception, humans produced five topics with a coherence score of 0.360, whereas GPT-4o identified only two topics, which resulted in a coherence score of 0.283. When it comes to accessibility perception, human participants identified eight distinct topics, achieving a coherence score of 0.392, while GPT-4o generated four topics with a score of 0.351. In terms of safety perception, both GPT-4o and human participants had similar outcomes. Human participants generated four topics with a coherence score of 0.377, whereas GPT-4o identified three topics and attained a slightly lower coherence score of 0.368. Interestingly, GPT-4o excelled in the comfort perception. Although it identified fewer topics (three), it achieved a higher coherence score of 0.378 compared to the four topics and a 0.356 score produced by human participants. For liveliness perception, human participants identified two topics with a notably higher coherence score of 0.458, while GPT-4o generated five topics, but with a lower score of 0.317.

Fig 2. Number of Topics (First Image).

Fig 2

Fig 3 shows the number of topics when the second image was chosen. Both GPT-4o and human participants identified three topics related to walkability, but human responses were more cohesive, achieving a coherence score of 0.363 compared to GPT-4o’s 0.332. In assessing feasibility, humans identified five topics, resulting in a higher coherence score of 0.398, while GPT-4o found three topics with a coherence score of 0.348. For accessibility, human participants were able to identify nine topics with a coherence score of 0.363, while GPT-4o only generated four topics, resulting in a much lower score of 0.277. This suggests that human responses captured a broader spectrum of accessibility-related themes in a more cohesive manner. For safety, GPT-4o’s coherence score was considerably lower (0.285) compared to the human participants’ score of 0.387, indicating that human responses were more coherent and comprehensive. In the comfort category, humans identified ten topics with a coherence score of 0.355, whereas GPT-4o identified seven topics with a slightly lower coherence score of 0.333. Finally, for liveliness, human participants identified ten topics with a coherence score of 0.391, whereas GPT-4o only identified two topics, resulting in a coherence score of 0.297.

Fig 3. Number of Topics (Second Image).

Fig 3

4.5.3. Top keywords.

When analyzing the responses of the first image, notable differences in the top keywords between human participants and GPT-4o reveal distinct approaches to perceiving outdoor spaces. For example, human responses frequently included experiential and descriptive words like “trees,” “shade,” “people,” and “comfortable,” reflecting a focus on the sensory experience and aesthetic quality of the environment. Humans often described how space made them feel and how specific natural elements contributed to comfort and liveliness. Words such as “traffic” and “obstacles” further indicated that humans were concerned with practical aspects of safety. In contrast, GPT-4o focused on more structured and functional terms like “pedestrian,” “path,” “mobility,” and “amenities.” This language suggests that GPT-4o approached the image from a more technical perspective, emphasizing the design and infrastructure of the outdoor space, such as the presence of sidewalks, accessibility features, and pedestrian pathways. GPT-4o’s responses were centered around how space functions for movement and public use, with less attention to subjective feelings or sensory details. In the second image, the differences between human and GPT-4o responses became even more pronounced. Humans continued to focus on the visual and sensory aspects of the environment, with frequent use of terms like “green,” “grass,” “quiet,” and “lively.” These keywords highlight the participants’ attention to natural features and the ambience of space, indicating a more holistic perception that integrates the physical appearance of the area with its emotional impact. Human participants evaluated space based on how peaceful or active it seemed, using language that suggested an assessment of the overall vibrancy and aesthetic appeal. On the other hand, GPT-4o’s responses were dominated by keywords such as “urban,” “area,” “street,” and “mobility,” once again emphasizing the functional design of the space. While humans highlighted specific natural features and the emotional atmosphere, GPT-4V remained more objective, concentrating on infrastructure and the practical usability of the environment.

5. Discussion

Prior research has shown that machine learning and computer vision methods are effective in analyzing image datasets, including those from Google Street View, to predict various factors such as scene complexity, safety, and socioeconomic conditions [64,97,98]. The emergence of vision-language models, such as ChatGPT [99], PaLM [100], and LLaMa [101] offer new opportunities for evaluating images in more comprehensive ways. Despite their promise, the extent to which these models align with human perception, particularly in urban walkability assessments, remains insufficiently explored. Building on this foundation, our study explored GPT-4o’s potential as a tool for assessing visual walkability in urban environments by comparing its evaluations with those of human participants. Our findings reveal a strong alignment between GPT-4o and human participants in assessing overall walkability, feasibility, accessibility, and safety. However, notable differences emerged in the assessment of comfort and liveliness, where human participants provided more thematically diverse insights, while GPT-4o’s responses were more structured and cohesive, particularly regarding comfort and safety. Similarity scores suggest a moderate level of alignment between GPT-4o’s reasoning and human judgments, highlighting its potential to systematically complement human evaluations of urban walkability with a systematic and focused approach. These findings contribute to ongoing discussions on AI’s role in urban design and suggest promising avenues for integrating AI-driven assessments into urban planning processes.

First, the results in Table 2 indicated that GPT-4o and human participants match each other in the choice of images for perceptions such as general walkability, feasibility, accessibility, and safety. However, human responses had “equal” answers when choosing some images, which indicates a nuance in human decision-making that was not replicated by GPT-4o. That would suggest that in areas where human preferences can be mimicked, the decision-making done by GPT-4o may not possess the same level of complexity and flexibility as that found in human judgments on most subjective perceptions.

Second, while the similarity scores—0.4575 for the first image and 0.4615 for the second indicated a moderate level of agreement, they also suggest that GPT-4o’s responses do not perfectly mirror human judgments. The slight difference in scores shows that GPT-4o might align more closely with human participants when evaluating certain images, but its reasoning does not fully capture the nuances of human perception. This moderate alignment points to GPT-4o’s potential to assist in tasks that require subjective judgment, such as urban design and walkability assessments, but it also highlights the need for caution when relying solely on AI models for decisions that involve complex human-centered evaluations.

Third, the comparison between GPT-4o and human participants shows that humans consistently identified a wider variety of topics across most perception categories, such as walkability, feasibility, and liveliness, although their responses varied in cohesiveness. In contrast, GPT-4o generated fewer topics but delivered more cohesive and focused responses, especially regarding comfort and safety. This indicates that while humans capture a broader range of themes, GPT-4o provides more structured and streamlined interpretations based on the perception being assessed. Notably, GPT-4o had difficulty with more complex and abstract perceptions like liveliness and accessibility, where it identified fewer topics and had lower coherence scores. On the other hand, human participants showed a more consistent ability to recognize a wide array of factors across different perceptions and images. Therefore, we argue that humans excel in identifying thematic diversity, while GPT-4o is skilled at organizing and simplifying more straightforward perceptions.

Fourth, regarding the top keywords, while humans emphasized specific natural features and the emotional atmosphere, GPT-4o remained more objective, concentrating on infrastructure and the practical usability of the environment. This aligns with earlier research on human perceptions of walkability in urban spaces. Studies such as those emphasized the importance of physical infrastructure like sidewalks and road conditions in shaping walkability perceptions [40,102]. In contrast, GPT-4o’s broader focus on the usability of space presents a different approach, one that is more grounded in specific infrastructural details. This divergence mirrors the findings by [60], who argued that while visual elements are essential, the subjective nature of walkability makes it challenging to capture the full scope of human experience through automated tools alone. Our study reinforces this notion by demonstrating that while GPT-4o can provide a consistent overview of environmental quality, it often misses the context-specific insights that human evaluators offer. Therefore, integrating human perspectives to fully capture the depth of contextual and experiential details is needed to understand human emotions in urban spaces [103]. Further supporting the need for a balanced approach, by integrating GPT-4o with human judgment, urban planners can benefit from the strengths of both approaches, leading to more effective and responsive urban design solutions.

Overall, this study demonstrates the potential practical use of LLMs, such as GPT-4o, to complement human-centered urban planning by providing structured, scalable assessments of walkability. While GPT-4o showed notable alignment with human evaluations in aspects such as overall walkability and feasibility, its applications can extend beyond simple evaluations to play a more dynamic role in urban design processes. For instance, LLMs could be employed in automated urban audits, analyzing street-level imagery to identify infrastructure gaps such as the absence of pedestrian crossings, narrow sidewalks, or insufficient greenery. This capability could save significant time and resources, particularly for large-scale urban projects. Another promising application lies in scenario simulations, where planners could upload mock-up designs or proposed changes to urban spaces and receive AI-driven feedback on how these alterations might influence walkability indicators such as comfort, accessibility, or safety. Additionally, LLMs could enhance public engagement by acting as an intermediary in community participation initiatives. It can translate technical urban design elements into more accessible language for residents, helping stakeholders better understand proposed plans and prioritize elements that align with public sentiment.

Despite these promising applications, the limitations of GPT-4o must be acknowledged, particularly in addressing cultural and personal factors that heavily influence human judgments of walkability. Cultural norms shape perceptions of aspects such as safety, liveliness, and accessibility differently across regions. For instance, a vibrant urban space in one cultural context might be perceived as chaotic or unsafe in another. Similarly, personal preferences, including mobility needs or aesthetic values, add layers of subjectivity that GPT-4o struggles to capture without explicit input. These limitations stem from GPT-4o’s reliance on text and image data, which, while powerful, cannot fully account for experiential or emotional connections to urban spaces. To address these subjective differences, future studies should incorporate more diverse datasets that reflect the cultural and geographic variability of urban spaces. Additionally, GPT-4o’s capabilities could be enhanced through multimodal data integration, including pedestrian movement patterns, audio cues, and climate data, to better simulate human sensory experiences and contextualize walkability evaluations. However, even with these advancements, human-centered design decisions require nuanced, context-specific insights that LLMs cannot replicate. LLMs should therefore be seen as a tool to augment human expertise, particularly in subjective and culturally sensitive areas of urban planning, rather than as a replacement for human judgment. Incorporating studies that compare different urban typologies, such as high-density versus low-density areas, could provide a new understanding of the role of built environmental characteristics. Furthermore, exploring the application of LLMs-driven tools in collaborative design processes with stakeholders, including urban planners, architects, and community members, could bridge the gap between data-driven models and practical implementation. In addition, the results emphasize GPT-4o’s proficiency in delivering structured and coherent assessments; however, they also reveal certain limitations, particularly regarding its ability to grasp more abstract and subjective metrics such as liveliness and accessibility. These perceptions are intricately linked to experiential, and contextual elements that may not be entirely quantifiable with the existing LLMs models and methodologies utilized in this research. Recognizing these limitations is essential to maintain realistic expectations for LLMs applications within urban studies. Additionally, the images used in this study were geographically taken in urban areas in Michigan, which may limit the applicability of the findings to other regions with distinct urban characteristics. The reliance on specific image types and the exclusion of certain human responses, such as biased or equal ratings, may have restricted the depth of the analysis. Given these limitations, future research should address these limitations to enhance the understanding of the role of LLMs in urban planning.

6. Conclusion

In conclusion, our results showed that while GPT-4o can help in measuring the perception of people of urban walkability, it cannot fully replicate the depth of human perception. The integration of LLMs into urban planning should be approached with caution, ensuring that these tools are used to augment, rather than replace, the perceptions of people. By refining LLM algorithms and incorporating human feedback, there is potential to develop more effective and responsive tools for urban analysis, ultimately leading to better design. However, this study has several limitations. The relatively small and demographic homogeneity sample may not fully capture the diversity of urban walkability perceptions across different populations. So, our main conclusion is that humans continue to have the final authority in decision-making. Instead of replacing urban planners, future research should concentrate on creating customized LLMs solutions for urban studies.

Supporting information

S1 Table. Human participants’ responses to paired images.

(DOCX)

pone.0322078.s001.docx (20.4KB, docx)
S2 Table. GPT-4V responses to paired images.

(DOCX)

pone.0322078.s002.docx (18.4KB, docx)
S1 Fig. Similarity Index of GPT-4V Responses Regarding Overall Walkability Perception.

(TIF)

pone.0322078.s003.tif (564.7KB, tif)
S2 Fig. Similarity Index of GPT-4V Responses Regarding Feasibility Perception.

(TIF)

pone.0322078.s004.tif (599.3KB, tif)
S3 Fig. Similarity Index of GPT-4V Responses Regarding Accessibility Perception.

(TIF)

pone.0322078.s005.tif (524.4KB, tif)
S4 Fig. Similarity Index of GPT-4V Responses Regarding Safety Perception.

(TIF)

pone.0322078.s006.tif (579.6KB, tif)
S5 Fig. Similarity Index of GPT-4V Responses Regarding Comfort Perception.

(TIF)

pone.0322078.s007.tif (569.6KB, tif)
S6 Fig. Similarity Index of GPT-4V Responses Regarding Liveliness Perception.

(TIF)

pone.0322078.s008.tif (577KB, tif)

Acknowledgments

This study was enriched by collaborative discussions with members of the HealthScape Lab at Michigan State University. We are also deeply thankful to Tai-Quan Peng and Chun-Yen Chang for their insightful feedback and contributions during the review of the manuscript. We gratefully acknowledge Michigan State University for offering the infrastructure and resources necessary to carry out this research. Our heartfelt thanks also go to the editorial team and the anonymous reviewers for their valuable comments and suggestions on the earlier draft of this paper.

Data Availability

All relevant data are within the manuscript and its Supporting Information files.

Funding Statement

The author(s) received no specific funding for this work.

References

  • 1.Wedyan M, Saeidi-Rizi F. Assessing the impact of walkability indicators on health outcomes using machine learning algorithms: A case study of Michigan. Travel Behaviour and Society. 2025;39:100983. doi: 10.1016/j.tbs.2025.100983 [DOI] [Google Scholar]
  • 2.Saelens BE, Handy SL. Built environment correlates of walking: A review. Med Sci Sports Exerc. 2008;40(7 Suppl):S550 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Giles-Corti B, et al. Encouraging walking for transport and physical activity in children and adolescents: How important is the built environment? Sports Medicine. 2009;39: 995–1009. [DOI] [PubMed] [Google Scholar]
  • 4.Pivo G, Fisher JD. The walkability premium in commercial real estate investments. Real Estate Economics. 2011;39(2):185–219. doi: 10.1111/j.1540-6229.2010.00296.x [DOI] [Google Scholar]
  • 5.Maniei H, et al. The influence of urban design performance on walkability in cultural heritage sites of Isfahan, Iran. Land. 2024;13(9):1523. [Google Scholar]
  • 6.Marshall JD, Brauer M, Frank LD. Healthy neighborhoods: Walkability and air pollution. Environ Health Perspect. 2009;117(11):1752–1759. doi: 10.1289/ehp.0900595 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Fonseca F, Ribeiro PJG, Conticelli E, Jabbari M, Papageorgiou G, Tondelli S, et al. Built environment attributes and their influence on walkability. Int J Sustain Transp. 2022;16(7):660–679. doi: 10.1080/15568318.2021.1914793 [DOI] [Google Scholar]
  • 8.Li Y, Yabuki N, Fukuda T. Measuring visual walkability perception using panoramic street view images, virtual reality, and deep learning. Sustain Cities Soc. 2022;86: 104140. [Google Scholar]
  • 9.Huang G, Yu Y, Lyu M, Sun D, Zeng Q, Bart D. Using Google Street View panoramas to investigate the influence of urban coastal street environment on visual walkability. Environ Res Commun. 2023;5(6):065017. doi: 10.1088/2515-7620/acdecf [DOI] [Google Scholar]
  • 10.Vo DC, Kim J. Exploring perceived walkability in one-way commercial streets: An application of 360° immersive videos. PloS One. 2024;19(12): e0315828. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Biswas G, Roy TK. Measuring objective walkability from pedestrian-level visual perception using machine learning and GSV in Khulna, Bangladesh. Geomatics and Environmental Engineering. 2023;17(6). [Google Scholar]
  • 12.Ki D. A novel walkability index using google street view and deep learning. Sustainable Cities and Society. 2023;99:104896. [Google Scholar]
  • 13.Li Y, Yabuki N, Fukuda T. Integrating GIS, deep learning, and environmental sensors for multicriteria evaluation of urban street walkability. Landsc Urban Plan. 2023;230:104603. [Google Scholar]
  • 14.Kang Y, Kim J, Park J, Lee J. Assessment of perceived and physical walkability using street view images and deep learning technology. ISPRS Int J Geo-Inf. 2023;12(5):186. doi: 10.3390/ijgi12050186 [DOI] [Google Scholar]
  • 15.Devlin J, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 2018.
  • 16.Sun C, Huang L, Qiu X. Utilizing BERT for aspect-based sentiment analysis via constructing auxiliary sentence. arXiv preprint arXiv:1903.09588. 2019.
  • 17.Zhu J, et al. Incorporating bert into neural machine translation. arXiv preprint arXiv:2002.06823. 2020.
  • 18.Milička J, Marklová A, VanSlambrouck K, Pospíšilová E, Šimsová J, Harvan S, et al. Large language models are able to downplay their cognitive abilities to fit the persona they simulate. PLoS One. 2024;19(3):e0298522. doi: 10.1371/journal.pone.0298522 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Gorenz D, Schwarz N. How funny is ChatGPT? A comparison of human-and AI-produced jokes. 2024 [DOI] [PMC free article] [PubMed]
  • 20.Lund BD, Wang T. Chatting about ChatGPT: How may AI and GPT impact academia and libraries?. Library Hi Tech News. 2023;40(3):26–29. doi: 10.1108/lhtn-01-2023-0009 [DOI] [Google Scholar]
  • 21.Biswas SS. Role of Chat GPT in public health. Ann Biomed Eng. 2023;51(5):868–869. doi: 10.1007/s10439-023-03172-7 [DOI] [PubMed] [Google Scholar]
  • 22.Sallam M. ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Healthcare. 2023. MDPI. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Brin D, et al. Assessing GPT-4 multimodal performance in radiological image analysis. Eur Radiol. 2024:1–7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Waisberg E, Ong J, Masalkhi M, Zaman N, Sarker P, Lee AG, et al. GPT-4 and medical image analysis: strengths, weaknesses and future directions. Journal of Medical Artificial Intelligence. 2023;6:29–29. doi: 10.21037/jmai-23-94 [DOI] [Google Scholar]
  • 25.Korngiebel DM, Mooney SD. Considering the possibilities and pitfalls of Generative Pre-trained Transformer 3 (GPT-3) in healthcare delivery. NPJ Digit Med. 2021;4(1):93. doi: 10.1038/s41746-021-00464-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Li Y, Li J, He J, Tao C. AE-GPT: Using large language models to extract adverse events from surveillance reports – A use case with influenza vaccine adverse events. PLoS One. 2024;19(3):e0300919. doi: 10.1371/journal.pone.0300919 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Wang J, Zhu Z, Liu C, Li R, Wu X. LLM-Enhanced multimodal detection of fake news. PloS One. 2024;19(10):e0312240. doi: 10.1371/journal.pone.0312240 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Zhang J, et al. Graph-to-tree learning for solving math word problems . Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. [Google Scholar]
  • 29.Kasneci E, Sessler K, Küchemann S, Bannert M, Dementieva D, Fischer F, et al. ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences. 2023;103:102274. doi: 10.1016/j.lindif.2023.102274 [DOI] [Google Scholar]
  • 30.Frederico GF. ChatGPT in supply chains: initial evidence of applications and potential research agenda. Logistics. 2023;7(2):26. [Google Scholar]
  • 31.Mich L, Garigliano R. ChatGPT for e-Tourism: a technological perspective. Information Technology & Tourism. 2023;25(1): 1–12. [Google Scholar]
  • 32.Biswas S. Importance of chat GPT in agriculture: According to chat GPT. Available from: SSRN 4405391. 2023.
  • 33.Lee S, et al. Can large language models estimate public opinion about global warming? An empirical assessment of algorithmic fidelity and bias. PLOS Climate. 2024;3(8):e0000429. doi: 10.1371/journal.pclm.0000429 [DOI] [Google Scholar]
  • 34.Yang J, Fricker P, Jung A. From intangible to tangible: The role of big data and machine learning in walkability studies. Computers, Environment and Urban Systems. 2024;109:102087. [Google Scholar]
  • 35.Feng J, et al. CityBench: Evaluating the capabilities of large language model as world model. arXiv preprint arXiv:2406.13945. 2024.
  • 36.Braveman P, Gottlieb L. The social determinants of health: It’s time to consider the causes of the causes. Public Health Rep. 2014;129(2):19–31. doi: 10.1177/00333549141291S206 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Duncan DT, Aldstadt J, Whalen J, Melly SJ, Gortmaker SL. Validation of Walk Score® for estimating neighborhood walkability: an analysis of four US metropolitan areas. Int J Environ Res Public Health. 2011;8(11):4160–4179. doi: 10.3390/ijerph8114160 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Abley S, Hill E. Designing living streets – a guide to creating lively, walkable neighbourhoods. 2005.
  • 39.Cerin E, et al. Destinations that matter: associations with walking for transport. Health & Place. 2007;13(3): 713–724. [DOI] [PubMed] [Google Scholar]
  • 40.Ewing R, Handy S. Measuring the unmeasurable: urban design qualities related to walkability. J Urban Des. 2009;14(1):65–84. doi: 10.1080/13574800802451155 [DOI] [Google Scholar]
  • 41.Wang H, Yang Y. Neighbourhood walkability: a review and bibliometric analysis. Cities. 2019;93:43–61. doi: 10.1016/j.cities.2019.04.015 [DOI] [Google Scholar]
  • 42.Wang W, et al. Exploring determinants of pedestrians’ satisfaction with sidewalk environments: case study in Korea. J Urban Plan Dev. 2012;138(2): 166–172. [Google Scholar]
  • 43.Lee E, Dean J. Perceptions of walkability and determinants of walking behaviour among urban seniors in Toronto, Canada. J Transp Health. 2018;9: 309–320.
  • 44.Gan Z, Yang M, Zeng Q, Timmermans HJP. Associations between built environment, perceived walkability/bikeability and metro transfer patterns. Transp Res A: Policy Pract. 2021;153:171–187. doi: 10.1016/j.tra.2021.09.007 [DOI] [Google Scholar]
  • 45.Koohsari MJ, McCormack GR, Shibata A, Ishii K, Yasunaga A, Nakaya T, et al. The relationship between walk score® and perceived walkability in ultrahigh density areas. Prev Med Rep. 2021;23:101393. doi: 10.1016/j.pmedr.2021.101393 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Alfonzo MA. To walk or not to walk? The hierarchy of walking needs. Environ Behav. 2005. 37(6): 808–836. [Google Scholar]
  • 47.Speck J. Walkable city: how downtown can save America, One Step At A Time. 2013: Macmillan. [Google Scholar]
  • 48.Gehl J. Cities for People. 2013: Island press. [Google Scholar]
  • 49.Hasan MM, Oh J-S, Kwigizile V. Exploring the trend of walkability measures by applying hierarchical clustering technique. J Transp Health. 2021;22:101241. [Google Scholar]
  • 50.Park S, Deakin E, Lee JS. Perception-based walkability index to test impact of microlevel walkability on sustainable mode choice decisions. Transp Res Rec. 2014. 2464(1): 126–134. [Google Scholar]
  • 51.Kelly CE, et al. A comparison of three methods for assessing the walkability of the pedestrian environment. J Transp Geography. 2011. 19(6): 1500–1508. [Google Scholar]
  • 52.Hasan MM, Oh J-S, Kwigizile V. Exploring the relationship between human walking trajectories and physical-visual environmental: an application of artificial intelligence and spatial analysis. 2021.
  • 53.Frank LD, et al. The development of a walkability index: application to the neighborhood quality of life study. British Journal of Sports Medicine. 2010;44(13): 924–933. [DOI] [PubMed] [Google Scholar]
  • 54.Gu P, et al. Using open source data to measure street walkability and bikeability in China: a case of four cities. Transportation research record. 2018;2672(31) 63–75. [Google Scholar]
  • 55.Lu Y. The association of urban greenness and walking behavior: using google street view and deep learning techniques to estimate residents’ exposure to urban greenness. Int J Environ Res Public Health. 2018;15(8):1576. doi: 10.3390/ijerph15081576 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Lu Y. Using Google Street View to investigate the association between street greenery and physical activity. Landsc Urban Plan. 2019;191: 103435. [Google Scholar]
  • 57.Xue F, et al. Personalized walkability assessment for pedestrian paths: An as-built BIM approach using ubiquitous augmented reality (AR) smartphone and deep transfer learning. in Proceedings of the 23rd International Symposium on the Advancement of Construction Management and Real Estate, Guiyang, China. 2018. [Google Scholar]
  • 58.Biljecki F, Ito K. Street view imagery in urban analytics and GIS: a review. Landsc Urban Plan. 2021;215:104217. doi: 10.1016/j.landurbplan.2021.104217 [DOI] [Google Scholar]
  • 59.Wedyan M, Saeidi-Rizi F. Assessing the impact of urban environments on mental health and perception using deep learning: a review and text mining analysis. J Urban Health. 2024;101(2):327–343. doi: 10.1007/s11524-024-00830-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Zhou H, He S, Cai Y, Wang M, Su S. Social inequalities in neighborhood visual walkability: Using street view imagery and deep learning technologies to facilitate healthy city planning. Sustain Cities Soc. 2019;50:101605. doi: 10.1016/j.scs.2019.101605 [DOI] [Google Scholar]
  • 61.Wu C, Peng N, Ma X, Li S, Rao J. Assessing multiscale visual appearance characteristics of neighbourhoods using geographically weighted principal component analysis in Shenzhen, China. Comput Environ Urban Syst. 2020;84:101547. doi: 10.1016/j.compenvurbsys.2020.101547 [DOI] [Google Scholar]
  • 62.Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2015. [DOI] [PubMed] [Google Scholar]
  • 63.Dai L, Zheng C, Dong Z, Yao Y, Wang R, Zhang X, et al. Analyzing the correlation between visual space and residents’ psychology in Wuhan, China using street-view images and deep-learning technique. City Environ Interac. 2021;11:100069. doi: 10.1016/j.cacint.2021.100069 [DOI] [Google Scholar]
  • 64.Zhang F, et al. Measuring human perceptions of a large-scale urban region using machine learning. Landsc Urban Plan. 2018;180:148–160. doi: 10.1016/j.landurbplan.2018.08.020 [DOI] [Google Scholar]
  • 65.Song J, et al. The effect of eye-level street greenness exposure on walking satisfaction: the mediating role of noise and PM2. 5. Urban Forestry & Urban Greening. 2022. 77: 127752. [Google Scholar]
  • 66.Tang Y, et al. Exploring the impact of built environment attributes on social followings using social media data and deep learning. ISPRS Int J Geo-Inf. 2022;11(6):325. [Google Scholar]
  • 67.Feng J, et al. CityGPT: empowering urban spatial cognition of large language models. arXiv preprint arXiv:2406.13948. 2024.
  • 68.OpenAI R. Gpt-4 technical report. arxiv 2303.08774. View in Article. 2023;2:13. [Google Scholar]
  • 69.Johnson O, Mohammed Alyasiri O, Akhtom D, Johnson OE. Image analysis through the lens of ChatGPT-4. Journal of Applied Artificial Intelligence. 2023;4(2). doi: 10.48185/jaai.v4i2.870 [DOI] [Google Scholar]
  • 70.Ding N, et al. Can large pre-trained models help vision models on perception tasks? arXiv preprint arXiv:2306.00693. 2023.
  • 71.Driessen T, et al. Putting ChatGPT Vision (GPT-4V) to the test: risk perception in traffic images. 2023. [DOI] [PMC free article] [PubMed]
  • 72.Dale R. GPT-3: What’s it good for? Nat Lang Eng. 2021;27(1):113–118. [Google Scholar]
  • 73.Lecler A, Duron L, Soyer P. Revolutionizing radiology with GPT-based models: Current applications, future possibilities and limitations of ChatGPT. Diagn Interv Imaging. 2023;104(6):269–274. doi: 10.1016/j.diii.2023.02.003 [DOI] [PubMed] [Google Scholar]
  • 74.Aydın Ö, Karaarslan E. OpenAI ChatGPT generated literature review: digital twin in healthcare. Available at SSRN 4308687. 2022.
  • 75.Kitamura FC. ChatGPT is shaping the future of medical writing but still requires human judgment. Radiological Society of North America. 2023:e230171. [DOI] [PubMed] [Google Scholar]
  • 76.Gilson A, et al. How does ChatGPT perform on the United States medical licensing examination? The implications of large language models for medical education and knowledge assessment. JMIR Medical Education. 2023;9(1): e45312. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Kung TH, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLoS Digital Health. 2023;2(2): e0000198. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Hu M, et al. Advancing medical imaging with language models: A journey from n-grams to chatgpt. arXiv preprint arXiv:2304.04920. 2023.
  • 79.Lu P, et al. Mathvista: evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255. 2023.
  • 80.Lehnert K. AI insights into theoretical physics and the Swampland program: a journey through the cosmos with ChatGPT. arXiv preprint arXiv:2301.08155. 2023.
  • 81.Kortemeyer G. Could an artificial-intelligence agent pass an introductory physics course? Phys Rev Phys Educ Res. 2023;19(1):010132. [Google Scholar]
  • 82.West CG. AI and the FCI: Can ChatGPT project an understanding of introductory physics? arXiv preprint arXiv:2303.01067. 2023.
  • 83.Guo S, et al. Semantic communications with ordered importance using ChatGPT. arXiv preprint arXiv:2302.07142. 2023.
  • 84.Li Y, et al. A comprehensive evaluation of gpt-4v on knowledge-intensive visual question answering. arXiv preprint arXiv:2311.07536. 2023.
  • 85.Zhang X, et al. Gpt-4v (ision) as a generalist evaluator for vision-language tasks. arXiv preprint arXiv:2311.01361. 2023.
  • 86.Ye Q, et al. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. [Google Scholar]
  • 87.Liu Y, et al. Rec-GPT4V: multimodal recommendation with large vision-language models. arXiv preprint arXiv:2402.08670. 2024.
  • 88.Huang J, et al. GPT-4V takes the wheel: evaluating promise and challenges for pedestrian behavior prediction. arXiv preprint arXiv:2311.14786. 2023.
  • 89.Tabone W, de Winter J. Using ChatGPT for human-computer interaction research: a primer. R Soc Open Sci. 2023;10(9):231053. doi: 10.1098/rsos.231053 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 90.Tang R, et al. Found in the middle: Permutation self-consistency improves listwise ranking in large language models. arXiv preprint arXiv:2310.07712. 2023. [Google Scholar]
  • 91.Wang X, et al. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171. 2022. [Google Scholar]
  • 92.Mikolov T Efficient estimation of word representations in vector space. arXiv preprint. 2013 https://arxiv.org/abs/1301.3781:3781 [Google Scholar]
  • 93.Lahitani AR, Permanasari AE, Setiawan NA. Cosine similarity to determine similarity measure: Study case in online essay assessment. 2016 4th International conference on cyber and IT service management. 2016. IEEE. [Google Scholar]
  • 94.Vijaymeena M, Kavitha K. A survey on similarity measures in text mining. Machine Learning and Applications: An International Journal. 2016;3(2): 19–28. [Google Scholar]
  • 95.Xia P, Zhang L, Li F. Learning similarity with cosine similarity ensemble. Information Sciences. 2015;307:39–52. [Google Scholar]
  • 96.Kryszkiewicz M. Determining cosine similarity neighborhoods by means of the euclidean distance. Rough Sets and Intelligent Systems-Professor Zdzisław Pawlak in Memoriam: Volume 2. 2013: 323–345 [Google Scholar]
  • 97.Dubey A, et al. Deep learning the city: quantifying urban perception at a global scale. Computer Vision–ECCV. 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. 2016. Springer. [Google Scholar]
  • 98.Fan Z, Zhang F, Loo BPY, Ratti C. Urban visual intelligence: uncovering hidden city profiles with street view images. Proc Natl Acad Sci U S A. 2023;120(27):e2220417120. doi: 10.1073/pnas.2220417120 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 99.Bahrini A, et al. ChatGPT: Applications, opportunities, and threats. in 2023 Systems and Information Engineering Design Symposium (SIEDS). 2023. IEEE. [Google Scholar]
  • 100.Chowdhery A., et al. Palm: scaling language modeling with pathways. Journal of Machine Learning Research. 2023. 24(240): 1–113. [Google Scholar]
  • 101.Touvron H, et al. Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971. 2023. [Google Scholar]
  • 102.Southworth M. Designing the walkable city. J Urban Plann Dev. 2005;131(4):246–257. doi: 10.1061/(asce)0733-9488(2005)131:4(246 [DOI] [Google Scholar]
  • 103.Malekzadeh M., et al. Urban visual appeal according to ChatGPT: contrasting AI and human insights. arXiv preprint arXiv:2407.14268. 2024. [Google Scholar]

Decision Letter 0

Abeer Elshater

26 Dec 2024

PONE-D-24-54765Urban Walkability Through Different Lenses: A Comparative Study of GPT-4o and Human PerceptionsPLOS ONE

Dear Dr. Saeidi-Rizi,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

==============================

Note from associate editor

The manuscript offers valuable insights into AI and human perception regarding walkability. It attempts to provide an answer to the role of AI in preceding the built environment, similar to humans. 

The associate editor and the reviewer have provided comments below, which should be addressed carefully. 

Associate Editor's comments

1- The introduction needs to be restructured by addressing the gap in literature, research problem and research aim.   

2- The sample size of the survey should be clarified. Providing a detailed description of the sample size is essential. 

3- The titles end with punctuation marks such as full stops and semicolons. Such marks should be removed. 

4- The discussion section needs to add a couple of sentences that discuss the current findings from conducting a comparative study between Chat-4o and human responses. Additionally, the research limitations of utilising qualitative study should also be discussed. 

5- The conclusion section summarizes the central finding, but it is important to expand on suggestions for future research.    

6- It is essential to review the latest article published by PLOS One to ensure the continuity of the research direction.

7- For compliance with PLOS One requirements, it is essential to provide the approval for human subjects research from the Institutional Review Board (IRB) or an equivalent ethics committee at Michigan State University, where the authors are affiliated.

==============================

prior 

Please submit your revised manuscript by Feb 09 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org . When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols . Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols .

We look forward to receiving your revised manuscript.

Kind regards,

Prof. Abeer Elshater

Academic Editor

PLOS ONE

Journal Requirements:

Journal requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please include a complete copy of PLOS’ questionnaire on inclusivity in global research in your revised manuscript. Our policy for research in this area aims to improve transparency in the reporting of research performed outside of researchers’ own country or community. The policy applies to researchers who have travelled to a different country to conduct research, research with Indigenous populations or their lands, and research on cultural artefacts. The questionnaire can also be requested at the journal’s discretion for any other submissions, even if these conditions are not met.  Please find more information on the policy and a link to download a blank copy of the questionnaire here: https://journals.plos.org/plosone/s/best-practices-in-research-reporting. Please upload a completed version of your questionnaire as Supporting Information when you resubmit your manuscript.

3. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, we expect all author-generated code to be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

4. We note that you have indicated that there are restrictions to data sharing for this study. PLOS only allows data to be available upon request if there are legal or ethical restrictions on sharing data publicly. For more information on unacceptable data access restrictions, please see http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions.

Before we proceed with your manuscript, please address the following prompts:

a) If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially identifying or sensitive patient information, data are owned by a third-party organization, etc.) and who has imposed them (e.g., a Research Ethics Committee or Institutional Review Board, etc.). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent.

b) If there are no restrictions, please upload the minimal anonymized data set necessary to replicate your study findings to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. For a list of recommended repositories, please see

https://journals.plos.org/plosone/s/recommended-repositories. You also have the option of uploading the data as Supporting Information files, but we would recommend depositing data directly to a data repository if possible.

We will update your Data Availability statement on your behalf to reflect the information you provide.

5. PLOS requires an ORCID iD for the corresponding author in Editorial Manager on papers submitted after December 6th, 2016. Please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager.

6. Please include your full ethics statement in the ‘Methods’ section of your manuscript file. In your statement, please include the full name of the IRB or ethics committee who approved or waived your study, as well as whether or not you obtained informed written or verbal consent. If consent was waived for your study, please include this information in your statement as well.

7. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information .

8. We note that this data set consists of interview transcripts. Can you please confirm that all participants gave consent for interview transcript to be published?

If they DID provide consent for these transcripts to be published, please also confirm that the transcripts do not contain any potentially identifying information (or let us know if the participants consented to having their personal details published and made publicly available). We consider the following details to be identifying information:

- Names, nicknames, and initials

- Age more specific than round numbers

- GPS coordinates, physical addresses, IP addresses, email addresses

- Information in small sample sizes (e.g. 40 students from X class in X year at X university)

- Specific dates (e.g. visit dates, interview dates)

- ID numbers

Or, if the participants DID NOT provide consent for these transcripts to be published:

- Provide a de-identified version of the data or excerpts of interview responses

- Provide information regarding how these transcripts can be accessed by researchers who meet the criteria for access to confidential data, including:

a) the grounds for restriction

b) the name of the ethics committee, Institutional Review Board, or third-party organization that is imposing sharing restrictions on the data

c) a non-author, institutional point of contact that is able to field data access queries, in the interest of maintaining long-term data accessibility.

d) Any relevant data set names, URLs, DOIs, etc. that an independent researcher would need in order to request your minimal data set.

For further information on sharing data that contains sensitive participant information, please see: https://journals.plos.org/plosone/s/data-availability#loc-human-research-participant-data-and-other-sensitive-data

If there are ethical, legal, or third-party restrictions upon your dataset, you must provide all of the following details (https://journals.plos.org/plosone/s/data-availability#loc-acceptable-data-access-restrictions):

1. A complete description of the dataset

2. The nature of the restrictions upon the data (ethical, legal, or owned by a third party) and the reasoning behind them

3. The full name of the body imposing the restrictions upon your dataset (ethics committee, institution, data access committee, etc)

4. If the data are owned by a third party, confirmation of whether the authors received any special privileges in accessing the data that other researchers would not have

5. Direct, non-author contact information (preferably email) for the body imposing the restrictions upon the data, to which data access requests can be sent

Additional Editor Comments:

The manuscript offers valuable insights into AI and human perception regarding walkability. It attempts to provide an answer to the role of AI in preceding the built environment, similar to humans.

The editor and the reviewer have provided comments below, which should be addressed carefully.

Associate editor's comments

1- The introduction needs to be restructured by addressing the gap in literature, research problem and research aim.

2- The sample size of the survey should be clarified. Providing a detailed description of the sample size is essential.

3- The titles end with punctuation marks such as full stops and semicolons. Such marks should be removed.

4- The discussion section needs to add a couple of sentences that discuss the current findings from conducting a comparative study between Chat-4o and human responses. Additionally, the research limitations of utilising qualitative study should also be discussed.

5- The conclusion section summarizes the central finding, but it is important to expand on suggestions for future research.

6- It is essential to review the latest article published by PLOS One to ensure the continuity of the research direction.

7- For compliance with PLOS One requirements, it is essential to provide the approval for human subjects research from the Institutional Review Board (IRB) or an equivalent ethics committee at Michigan State University, where the authors are affiliated.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: No

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Firstly, I would like to thank the authors for preparing this manuscript. This research presents a very intriguing idea with substantial importance in terms of novelty. It contributes to the development of AI-related advances and their applicability in urban design research, advocating for the utility and functionality of such recently developed tools, which are becoming considerably widespread nowadays. Therefore, I would like to suggest the publication of this paper, subject to addressing the following comments:

1. There are several unusual mistakes in the writing (e.g., line 4 of the abstract: repetitive statement of GPT-4o (GPT-4o)). Additionally, the numbering of sections and subsections is incorrect, and the conclusion section does not have a number. Moreover, the citation style is inconsistent throughout the paper; some references use a number-based format, while others follow APA. The authors must ensure consistency in the style of the paper. In general, the fluency and grammatical accuracy of the text need to be thoroughly checked.

2. The Introduction section is too brief. It should be expanded to better contextualize the topic, clearly define the variables associated with this study, and enhance its bibliographic resources.

3. Since the first subsection of the literature review forms the core of the analysis, it would benefit from a wider range of bibliographical resources. This includes elaborating on pedestrian-friendly infrastructure, the five visual elements of walking, the perception of security, street furniture, and urban design features influencing walkable urban spaces. Additionally, addressing the limitations of image semantic segmentation in the last paragraph of the ‘Walkability Perception’ subsection requires supporting evidence.

4. The clarity of Figure 1 must be improved. It appears to have been screenshotted while in modifiable mode, and some arrows are incomplete. Furthermore, each figure must be cited in the text and accompanied by suitable explanations. Likewise, figure captions should be described in detail.

5. The authors need to provide information about the sample size of the conducted survey in the Methods section, along with the rationale behind the chosen number of participants.

6. Demographic data should not be presented in such a chaotic manner. It would be more effective to summarize the participation range of each group using percentages to enhance the fluency of the text.

7. One of the most significant limitations of this study is that some of the indicators of walkability mentioned (e.g., livability and accessibility) may not be measurable using the current tools. This limitation should be explicitly noted in the Conclusion section. Additionally, the suggestions for further research should be improved by elaborating on other potential applications of AI in interpreting, organizing, and analyzing urban planning research.

Reviewer #2: Overall

The manuscript presents an interesting and innovative approach to examining urban walkability through the lens of both GPT-4 and human perceptions. The attempt to bridge AI-driven analysis with human judgment in this domain is both timely and significant. However, the study still requires further refinement and clarification in several areas to improve its overall rigor and readability.

Q1:

The overall quality of the images in the manuscript is suboptimal. There are several areas where the visuals could be improved for better clarity and more precise communication. For example, in Figure 1, the arrows below the image are unclear regarding what they are pointing to or connecting. Additionally, the image appears to have been captured as a screenshot, which is not ideal for a scientific publication. This screenshot retains an active selection box.

Q2

Section 2 Methodology: The description of how images were selected for evaluation is unclear. It is crucial to elaborate on the selection criteria for images and the participant demographics (e.g., age, profession, cultural background) to ensure the generalizability of results.

It would be helpful to mention the number of human participants and the diversity of the urban areas assessed. More specific details on the data (e.g., demographic diversity of participants or add more showcase of cityscape images to better illustrate) would strengthen the findings.

Q3

In the result section, the manuscript presents similarity scores (e.g., 0.4575) without sufficient explanation of how these scores were calculated, their statistical significance, or what constitutes a significant alignment between GPT-4o and human evaluations. The concept of a "coherence score" is not mentioned or defined, which may be critical for understanding the reliability and meaningfulness of the results.

Suggestion:

1. Provide a detailed explanation of how the similarity scores were computed, including the citations, equations, or models used to derive these values.

2. Clarify the significance of the similarity scores in the context of human and AI evaluation alignment. For example, what threshold value of the score represents a significant alignment between GPT-4o and human responses? Are there statistical tests that were performed to validate these scores?

3. If "coherence score" is a part of the analysis, please define it explicitly and explain its role in assessing the results.

Q4

The paper mentions the potential of AI to assist in urban planning tasks but stops short of providing a clear set of recommendations for the practical implementation of AI in urban design. The study could benefit from a deeper discussion of how AI-based assessments of urban environments could influence urban planning. Are there specific applications where GPT-4o might be more useful? What are the limitations for human-cantered decisions?

For example, how do cultural or personal factors influence human judgments of walkability? How might GPT-4o handle these subjective differences?

Q5

There is a minor error of the section serial number in the manuscript, where Section 2 is followed directly by Section 4, without Section 3 being included.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy .

Reviewer #1: No

Reviewer #2: Yes:  Waishan Qiu

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/ . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org . Please note that Supporting Information files do not need this step.

PLoS One. 2025 Apr 29;20(4):e0322078. doi: 10.1371/journal.pone.0322078.r003

Author response to Decision Letter 1


5 Feb 2025

Thanks for your comments and suggestions making this manuscript much better.

Attachment

Submitted filename: Response to Reviewers.docx

pone.0322078.s010.docx (28KB, docx)

Decision Letter 1

Abeer Elshater

18 Mar 2025

Urban Walkability Through Different Lenses: A Comparative Study of GPT-4o and Human Perceptions

PONE-D-24-54765R1

Dear Dr. Saeidi-Rizi,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager®  and clicking the ‘Update My Information' link at the top of the page. If you have any questions relating to publication charges, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Abeer Elshater

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Acceptance letter

Abeer Elshater

PONE-D-24-54765R1

PLOS ONE

Dear Dr. Saeidi-Rizi,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Prof. Abeer Elshater

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table. Human participants’ responses to paired images.

    (DOCX)

    pone.0322078.s001.docx (20.4KB, docx)
    S2 Table. GPT-4V responses to paired images.

    (DOCX)

    pone.0322078.s002.docx (18.4KB, docx)
    S1 Fig. Similarity Index of GPT-4V Responses Regarding Overall Walkability Perception.

    (TIF)

    pone.0322078.s003.tif (564.7KB, tif)
    S2 Fig. Similarity Index of GPT-4V Responses Regarding Feasibility Perception.

    (TIF)

    pone.0322078.s004.tif (599.3KB, tif)
    S3 Fig. Similarity Index of GPT-4V Responses Regarding Accessibility Perception.

    (TIF)

    pone.0322078.s005.tif (524.4KB, tif)
    S4 Fig. Similarity Index of GPT-4V Responses Regarding Safety Perception.

    (TIF)

    pone.0322078.s006.tif (579.6KB, tif)
    S5 Fig. Similarity Index of GPT-4V Responses Regarding Comfort Perception.

    (TIF)

    pone.0322078.s007.tif (569.6KB, tif)
    S6 Fig. Similarity Index of GPT-4V Responses Regarding Liveliness Perception.

    (TIF)

    pone.0322078.s008.tif (577KB, tif)
    Attachment

    Submitted filename: Response to Reviewers.docx

    pone.0322078.s010.docx (28KB, docx)

    Data Availability Statement

    All relevant data are within the manuscript and its Supporting Information files.


    Articles from PLOS One are provided here courtesy of PLOS

    RESOURCES