Skip to main content
PLOS One logoLink to PLOS One
. 2026 Feb 12;21(2):e0340380. doi: 10.1371/journal.pone.0340380

Multimodal generative AI for automated pavement condition assessment: Benchmarking model performance

Chang Xu 1, Lei Shu 1, Anh Dao 2, Yue Cui 1,*
Editor: Junghwan Kim3
PMCID: PMC12900301  PMID: 41678462

Abstract

Accurate and efficient pavement condition assessment is essential for maintaining roadway safety and optimizing maintenance investments. However, conventional assessment methods such as manual visual inspections and specialized sensing equipment are often time-consuming, expensive, and difficult to scale across large networks. Recent advancements in generative artificial intelligence (GAI) have introduced new opportunities for automating visual interpretation tasks using street-level imagery. This study evaluates the performance of seven multimodal large language models (MLLMs) for road surface condition assessment, including three proprietary models (Gemini 2.5 Pro, OpenAI o1, and GPT-4o) and four open-source models (Gemma 3, Llama 3.2, LLaVA v1.6 Mistral, and LLaVA v1.6 Vicuna). The models were tested across four task categories relevant to pavement management: distress and feature identification, spatial pattern recognition, severity evaluation, and maintenance interval estimation. Model performance was assessed across five dimensions: response rate, response correctness, consistency, multimodal errors, and overall computational intensity and cost. Results indicate that MLLMs can interpret street-level imagery and generate task-relevant outputs in a cost-effective manner. Among the evaluated models, we recommend GPT-4o as the preferred option, as it balances responsiveness, accuracy, and computational cost.

Introduction

Road condition assessment is a key component of transportation infrastructure management. It provides information necessary to evaluate the current state of roadways, schedule maintenance activities, and allocate resources efficiently. Accurate assessments can contribute to improved safety, reduced travel disruption, and lower vehicle operating costs [1,2]. Over time, road condition monitoring also supports infrastructure asset management by informing long-term planning, helping prioritize maintenance needs, and reducing overall life-cycle costs [35]. Beyond structural and safety benefits, effective pavement management also reduces fuel consumption, tire wear, and mechanical damage caused by surface irregularities, offering both economic and environmental advantages [6].

Road condition assessment methods range from structural evaluations that measure subsurface integrity using techniques such as falling weight deflectometer testing [7,8] and ground-penetrating radar [9,10], to surface-level inspections, which analyze visible pavement conditions such as cracking, rutting, raveling, and potholes [1113]. This study focuses on road surface assessment, which evaluates road condition based on the physical characteristics of the pavement surface.

Conventional road surface assessments have traditionally relied on manual inspection techniques, which are often time-consuming, labor-intensive, and subject to observe variability. In response, recent developments in computer vision and artificial intelligence have enabled the adoption of automated, image-based methods [1416]. Deep learning [14,17] and other machine learning techniques [18,19] are widely applied in pavement condition assessment to enhance accuracy, consistency, and operational efficiency [18,20,21]. Developing these models requires large volumes of high-quality, accurately labeled training data [22,23]. Acquiring such data is often challenging and time-consuming due to the need for extensive manual annotation [24]. In addition, most models are designed for a single, task-specific function [25]. Since road surface assessment involves multiple stages, including defect detection, severity evaluation, and maintenance prioritization, separate models are typically required to address each component of the workflow.

More recently, advances in generative artificial intelligence (AI), particularly multimodal models, have gained attention for their potential to move beyond traditional support functions and autonomously generate task-relevant outputs with minimal contextual input. Unlike conventional machine learning models that require extensive training datasets and pre-defined rules, generative artificial intelligence (GAI) models can respond adaptively to a range of inputs and generate task-relevant outputs with limited contextual information [26]. Leveraging these capabilities, recent studies have begun to apply MLLMs in urban research, particularly using street-view imagery to analyze built environments, urban form, and neighborhood conditions [2729].

However, the application of MLLMs using street-view remain for surface-level road condition assessment remains largely unexplored. Pavement evaluation typically involves detecting surface distress, classifying severity, and recommending maintenance actions, which have traditionally required multiple specialized models. With their multimodal understanding and generative capabilities, MLLMs have the potential to integrate these tasks within a single framework and streamline the assessment process. Before MLLMs can be reliably adopted for pavement evaluation using street-view images, it is necessary to benchmark their current capabilities, assess their performance across key road condition assessment tasks, and determine the degree of fine-tuning and adaptation required for reliable application in pavement condition assessment. Beyond establishing baseline benchmarks, this study expands prior work by comparing the performance, reliability, and computational efficiency of multiple MLLMs. Although earlier studies have explored the use of individual MLLMs with street-view imagery, they primarily focused on demonstrating the potential of a single model. To address this gap, the present study evaluates the capability of multiple MLLMs for surface-level road condition assessment and examines performance variations across key tasks. Both proprietary and open-source models are included to capture a comprehensive view of current technological capacity, as they differ in architecture, training data, and accessibility. Evaluating both types is important because it enables an assessment of whether open-source alternatives can achieve performance comparable to commercial systems, which has implications for cost-effectiveness and the feasibility of large-scale adoption by public agencies and researchers. The analysis also accounts for computational cost, providing insight into the trade-offs between performance and efficiency.

Building on this framework, this study examines the potential of multimodal large language models (MLLMs) for assessing road surface conditions based on the physical state of pavement. The evaluation focuses on four key tasks relevant to pavement management workflows: (1) surface distress and feature identification (e.g., cracks, potholes, and patches), (2) spatial pattern recognition (e.g., whether defects are isolated, uniformly distributed, or concentrated along edges or drainage areas), (3) severity evaluation (estimating the extent or seriousness of observed defects), and (4) maintenance interval estimation (inferring the timing or urgency of potential maintenance needs). These tasks were selected for their direct relevance to operational decision-making within pavement condition assessment frameworks.

Surface distress identification facilitates the initial detection and classification of visible pavement defects [30]. Spatial pattern recognition supports the identification of localized deterioration, helping to inform targeted maintenance interventions and resource allocation [31]. Severity evaluation translates observed pavement defects into standardized condition ratings that guide maintenance prioritization and investment decisions [32,33]. Finally, maintenance interval estimation involves predicting the timing of future repairs based on the current condition of the pavement, supporting long-term maintenance planning [31].

Model performance is assessed across five performance dimensions: response rate, response correctness, consistency, multimodal errors, and overall computational intensity and cost.

Literature review

Traditionally, road surface condition assessments have relied on manual inspections, structured surveys, and standardized scoring systems based on the physical condition of the pavement [34,35]. While these methods have served as the foundation for pavement and road management, they are labor-intensive, time-consuming, and susceptible to human error [36]. In response to these limitations, scholars and industry practitioners have increasingly advocated for the adoption of automated, sensor-based approaches. These methods improve the objectivity of assessments, reduce reliance on periodic manual inspections, and enable continuous, real-time monitoring of road conditions [35,37,38]. Technologies such as accelerometers, GPS, cameras, and LiDAR are frequently used to capture high-resolution images of surface defects [3941].

Building on these technological advances, machine learning [18,42] and deep learning techniques [19,4345] have been widely applied to road surface image analysis. These computational approaches automate condition assessment, reducing the time and cost associated with traditional manual inspections [17,20]. Beyond improving efficiency, they enhance diagnostic accuracy, streamline analytical workflows, and contribute to more effective road infrastructure management [46].

However, these models are not fully autonomous. They typically integrate human expertise into the decision-making process [47,48] and are unable to make independent judgments about condition severity or maintenance needs. Their effectiveness also depends heavily on large, high-quality annotated datasets, which are often computationally intensive to process [49,50] and may be limited or inconsistently available across different contexts [51].

Recent advancements in GAI have opened new avenues for automating assessment processes across various domains [52,53]. Although MLLMs have not yet reached the point of full automation, they demonstrate versatility across tasks [54]. Traditional machine learning approaches for surface-level road condition assessment are typically designed and trained for a single, well-defined objective such as defect detection [5557], defect classification [5860], or surface segmentation [61,62]. In contrast, MLLM can address multiple tasks within a unified framework by combining visual perception with linguistic reasoning [54]. In addition, when baseline performance is insufficient, these models can be fine-tuned to improve task-specific accuracy [63].

MLLMs appear to have technical capabilities that may support road surface condition assessment using street-level images. These models integrate computer vision functions [64] that allow them to detect, classify, and quantify road defects such as cracks, potholes, and other surface irregularities. In addition to visual processing, they are designed to generate descriptive textual outputs from image content [65], which may allow for context-aware evaluations that combine standardized scoring frameworks with natural language reporting. Furthermore, MLLMs may have the potential to interpret road conditions autonomously and produce preliminary maintenance recommendations, thereby reducing the need for expert intervention. In fact, scholars have utilized street-view images to evaluate built environments [29], streetscape quality [66], walkability [28,67], and neighborhood livability [66], demonstrating that MLLMs can interpret complex visual cues associated with urban design and infrastructure. However, these capabilities have not yet been empirically validated within the domain of pavement assessment.

To address this gap, this study evaluates the performance of MLLMs across four tasks relevant to pavement condition assessment. The selection of tasks reflects common practices in surface road condition assessment: (1) surface distress and feature identification [21,34,68,69] (2) spatial pattern recognition (i.e., crack positioning) [70,71] (3) severity evaluation [32,72,73], and (4) maintenance interval estimation [32,74].

Among the selected models, three are proprietary. Gemini 2.5 Pro, developed by Google DeepMind and released in February 2024, incorporates advanced multimodal reasoning capabilities [75]. OpenAI o1 and GPT-4o, both released by OpenAI in May 2024, represent a lightweight and flagship model, respectively. OpenAI o1 prioritizes computational efficiency for vision-language tasks [76], whereas GPT-4o offers enhanced capabilities in visual perception, language understanding, and rapid multimodal generation [77]. The remaining models are open source. Gemma 3, released by Google DeepMind in April 2024, is a publicly accessible and compact model tailored for ease of fine-tuning and adaptation in academic and research settings [78]. Llama 3.2, developed by Meta and released in April 2024, extends the Llama series with improved contextual reasoning and instruction-following [79]. LLaVA v1.6 Mistral, released jointly by UW-Madison and Microsoft in February 2024, integrates the Mistral backbone for visual-language alignment in multimodal reasoning [80]. LLaVA v1.6 Vicuna, also released by UW-Madison, builds on the Vicuna model and focuses on instruction-following in visual question answering [81].

Methods

Data

This study used a multimodal dataset comprising street-level images and textual prompts as model inputs, along with benchmark references for performance evaluation.

Multimodal input.

The input to MLLMs consisted of street road image and textual prompts designed to reflect real-world road condition assessment tasks. The images were sourced from Google Street View and selected based on PCI records, as detailed in the section on benchmark data. To ensure temporal alignment between pavement condition data and Street View imagery, only images captured within one year of a PCI assessment were included. PCI records prior to 2014 were excluded due to the lack of reliably dated images. The dataset size is relatively small primarily because real-world PCI data are scarce and currently available only from the City of San Francisco. The available dataset contains PCI values and geographic coordinates but does not include street-level imagery, requiring the retrieval of corresponding Google Street View images for each road segment. Furthermore, many segments in San Francisco have perfect PCI scores (i.e., PCI = 100), resulting in limited representation of visible pavement defects. To mitigate potential sampling bias, a random subset of those records was selected to balance the distribution of pavement conditions. This process resulted in a dataset of 261 road segment images.

Each image was paired with a prompt designed to instruct the model to perform a specific infrastructure-related task. A total of 39 unique prompts were developed, as detailed in S1 Table. These prompts request the identification of observable features (e.g., presence, type, and quantity of cracks or potholes), descriptions of spatial patterns (e.g., clustering near intersections or drainage), assessments of condition severity, and judgments on the urgency of potential maintenance actions. Prompts also ask for explanatory reasoning or justifications. The prompts were written in natural language and structured as multi-part instructions to encourage detailed responses.

Benchmark data.

To evaluate model outputs, this study used three types of benchmark data: PCI, estimated road maintenance intervals, and manually annotated labels for observable surface features and spatial patterns.

PCI records were obtained from the City of San Francisco’s open data portal in February 2025 [82]. The dataset includes street-level information such as PCI assessment dates, numerical scores ranging from 0 to 100, severity categories, road functional classifications, and geographic coordinates. These records were also used to guide image selection, as described in Section 3.1.1.

Road maintenance intervals were estimated by reviewing sequential Google Street View images for visible signs of surface repair. Segments showing repairs within one year of the initial image were labeled as short-term maintenance. Repairs that occurred between one and three years later were labeled as long-term. Segments with no visible repairs within three years were classified as having no maintenance. These labels provide a basis for evaluating the models’ ability to estimate repair urgency. While the use of Google Street View imagery may introduce some uncertainty in identifying the exact timing or type of repair, this approach remains a practical solution given the absence of datasets that contain both PCI records and detailed maintenance histories. To minimize potential bias, we aligned image selection with PCI assessment years and verified repairs at the same locations to ensure consistency.

In addition, manual annotations were conducted by members of the research team to support the evaluation of the models’ identification capabilities. Each image was independently reviewed by annotators, who answered a series of structured questions corresponding to the model prompts. A total of 29 questions were answered for each image. These questions focused on the presence and type of pavement defects, their spatial distribution, and relevant environmental or contextual conditions. Specifically, annotators assessed (1) surface features, such as whether cracks or potholes were present, and if so, the general type of cracking (transverse, longitudinal, or alligator); (2) spatial distribution patterns, indicating whether defects appeared isolated, spread across the pavement surface, or concentrated along joints, edges, or drainage areas; and (3) environmental and contextual conditions, including visible intersections, drainage features, patched or sealed areas, signs of resurfacing, and evidence of poor drainage or standing water that could exacerbate pavement deterioration. In cases where annotators disagreed, the discrepancies were discussed collectively until a consensus was reached. These annotations serve as benchmarks for evaluating the accuracy of model responses to prompts related to identification tasks.

Assessment.

This study evaluated the performance of seven MLLMs: Gemini 2.5 Pro, OpenAI o1, GPT-4o, Gemma 3, Llama 3.2, LLaVA v1.6 Mistral, and LLaVA v1.6 Vicuna. Fig 1 shows the general workflow of the MLLM evaluation process. These models were tested across five performance dimensions to assess their ability to complete four categories of road condition assessment tasks of varying complexity. Each image was prompted once, as the 39 prompts were designed primarily as close-ended questions with well-defined terms and clear worded instructions (e.g., yes/no, multiple choice, or numerical rating), which limit open-ended variability in model outputs. The output from each model consisted of text-based responses generated in direct answer to the prompt questions, which were then compared with human-annotated responses to evaluate correctness, consistency, and reasoning quality. All models were configured with a temperature of 0.3 and a top-p of 1.0, while other parameters were kept at their default settings to ensure consistency across platforms. All evaluations were conducted on a high-performance computing cluster equipped with NVIDIA V100 GPUs (32 GB memory) and Intel Xeon Gold 6148 CPUs (2.40 GHz).

Fig 1. Workflow of the MLLM evaluation process (Images are used for illustrative purposes only and comply with applicable copyright and licensing requirements; no images are sourced from Google Street View or other proprietary datasets).

Fig 1

Task design.

Four task categories, which represent a progression from low-level visual recognition to higher-level condition diagnosis and intervention planning in surface road assessment, were designed to evaluate model capabilities: (1) Pavement surface distress and feature identification, which addresses basic defect detection critical for maintenance prioritization; (2) Spatial pattern recognition, which evaluates whether deterioration is localized or widespread and thus supports infrastructure-level planning; (3) Pavement condition description and severity evaluation, which provides higher-level diagnostic insights into pavement health and deterioration risks; and (4) Maintenance interval estimation, which directly informs scheduling of interventions and long-term resource planning. Table 1 summarizes these four task categories, detailing their associated Task description and assessment focus.

Table 1. Task categories and associated prompts.
Task Category Task Detail Assessment Focus
Pavement surface distress and feature identification Identify the presence of cracks;
Identify the presence of potholes;
Specify the type of crack (e.g., transverse, longitudinal, alligator);
Identify surface repair features (e.g., patching, resurfacing);
Detect other surface anomalies or treatments.
Detect basic surface defects for maintenance need assessment
Spatial pattern recognition Describe the spatial distribution of cracks (e.g., isolated, widespread, along joints);
Describe the spatial distribution of potholes (e.g., near intersections, along edges);
Identify whether defects are clustered or evenly distributed;
Specify if defects appear near key infrastructure features (e.g., drains, curbs).
Assess spread and clustering to inform localized vs. systemic responses
Pavement condition description and severity evaluation Provide a written description of the overall pavement condition;
Assess the severity of observed defects;
Estimate the PCI;
Describe the visual cues used to justify the severity assessment.
Diagnose pavement health for condition-based decision making
Maintenance interval estimation Recommend a repair timeline (e.g., short-term, long-term, or no maintenance needed). Translate surface condition into actionable planning guidance

Assessment dimension.

This study assessed MLLMs’ performance across five dimensions, each corresponds to an aspect of model performance: response rate, response correctness, consistency, multimodal errors, and overall computational intensity and cost.

Response rate was used to assess model responsiveness. It was defined as the proportion of prompts for which the model generated a valid and interpretable output. A higher response rate indicates that the model can engage with a wide range of task instructions, while a lower rate suggests difficulty in interpreting or addressing certain prompt types. In this study, the response rate was calculated programmatically as the percentage of images that yielded a valid response for each question. Response rates were first computed for each of the 39 prompts for each model and then aggregated to represent overall responsiveness at the model level.

Response correctness was used to evaluate the factual accuracy of model outputs by comparing them against established benchmark references, including PCI scores, maintenance interval estimation, and manually annotated surface features. Standard classification metrics were calculated, accuracy, precision, recall, and F1 score. These metrics quantify the degree to which model predictions align with ground-truth annotations and serve as objective measures of task performance.

Consistency serves as a measure of inter-model agreement. It is computed using Cohen’s Kappa, which evaluates the level of agreement between different models when responding to the same prompt, while adjusting for chance-level concordance. All model responses including uncertain or incomplete ones such as “Not sure” are included in the calculation to reflect real-world deployment conditions.

Multimodal errors were evaluated to identify failures in the model’s ability to process, integrate, or reason across visual and textual information, specifically within the context of street-view imagery. This aspect of the analysis focused on the model’s reasoning and interpretive behavior, aiming to explore why incorrect or fabricated content was generated. Errors were examined qualitatively to identify recurring patterns and potential underlying causes. As part of this evaluation, hallucinations were included as a specific subtype of multimodal error. In this study, hallucinations were defined as responses that introduced fabricated, unsupported, or irrelevant content not grounded in the input image or prompt. A response was classified as a hallucination when the described feature or condition could not be verified in the corresponding image.

Overall computational intensity and cost are used to evaluate efficiency. These are quantified by recording each model’s token usage and response generation time. Statistical significance in model differences is tested using the Kruskal-Wallis test, followed by pairwise Wilcoxon rank-sum tests with Bonferroni correction. To complement these tests, effect sizes were calculated to assess the magnitude and practical significance of the observed differences. For pairwise comparisons, rank-biserial correlation (r) was used, and for overall group differences, eta squared (η²) was computed based on the Kruskal–Wallis statistic. For proprietary models, token usage is further translated into estimated monetary cost based on publicly available pricing, enabling comparison of cost-efficiency.

Results

Response rate

The evaluated MLLMs demonstrated the ability to produce outputs in response to domain-specific prompts related to road surface condition assessment. As shown in Fig 2, the overall average response rate, calculated by aggregating model outputs across four task categories, was 63.65%, indicating a moderate capacity to produce interpretable outputs. Both the highest and lowest response rates were observed among the proprietary models. Gemini 2.5 Pro achieved the highest rate at 76.01%, demonstrating strong responsiveness to task prompts. In contrast, OpenAI o1 and GPT-4o recorded the lowest rates, at 51.24% and 51.67%, respectively. Among the open-source models, LLaVA v1.6 Mistral, Gemma 3, and LLaVA v1.6 Vicuna achieved relatively high response rates of 70.85%, 69.44%, and 66.58%, respectively. LlaMA 3.2 had the lowest response rate within this group, at 59.73%, slightly below the overall average.

Fig 2. Overall response rates of MLLMs across all task categories.

Fig 2

Table 2 reports model response rates by the four task categories. Overall, MLLMs responded well to prompts focused on pavement surface distress and feature identification (71.82%), pavement condition description and severity evaluation (77.86%), and maintenance interval estimation (94.86%). However, the average response rate dropped significantly for spatial pattern recognition tasks (47.06%). These findings suggest that current MLLMs are generally capable of responding to tasks involving object identification and text-based reasoning, such as describing pavement distress and estimating maintenance needs. However, they struggle with tasks that require interpreting visual information related to spatial distribution patterns.

Table 2. Average response rates (%) by task category.

MLLMs Pavement surface distress and feature identification Spatial pattern recognition Pavement condition description and severity evaluation Maintenance interval estimation
Gemini 2.5 Pro 88.30 57.14 89.87 93.49
OpenAI o1 52.05 29.58 83.53 96.93
GPT-4o 53.49 28.98 84.20 98.47
Gemma 3 78.72 48.61 89.66 100.00
Llama 3.2 70.32 47.39 62.50 94.64
LLaVA v1.6 Mistral 81.52 59.24 73.43 94.64
LLaVA v1.6 Vicuna 78.34 58.48 61.86 85.82
Overall 71.82 47.06 77.86 94.86

Pavement surface distress and feature identification exhibited varied response rates across models. Gemini 2.5 Pro (88.30%), LLaVA v1.6 Mistral (81.52%), and Gemma 3 (78.72%) showed higher overall response rates for pavement surface distress and feature identification, whereas OpenAI o1 (52.05%) and GPT-4o (53.49%) performed lower. Reviewing the subtask-level results in Table 3, most models responded well to general questions, such as identifying whether cracks are present, with response rates close to or at 100% (e.g., 99.62% for OpenAI o1 and 100.00% for GPT-4o). However, their responsiveness dropped markedly for fine-grained classification tasks. For example, in the subtask of identifying transverse cracks, the response rates of OpenAI o1 and GPT-4o dropped to 42.91% and 44.06%, respectively. These findings suggest that the lower overall responsiveness of models like OpenAI o1 and GPT-4o is largely due to their limited performance on fine-grained tasks involving specific distress types.

Table 3. Response rates (%) of MLLMs by subtask in pavement surface distress and feature identification.

MLLMs Cracks Potholes Other infrastructure Pavement Surface Treatments
Cracks in general Transverse cracks Longitudinal cracks Alligator cracks Other types of cracks Potholes present Number of potholes Utility cuts or joints Drainage Repairment in general Patched areas Sealed cracks Resurfacing
Gemini 2.5 Pro 94.25 80.84 80.84 80.84 77.78 94.25 90.80 94.25 89.66 92.34 91.95 90.80 89.27
OpenAI o1 99.62 42.91 42.91 42.91 41.76 99.62 45.21 99.23 28.35 36.78 36.78 36.78 23.75
GPT-4o 100.00 44.06 44.06 44.06 41.76 100.00 44.06 98.85 28.74 41.76 41.76 41.76 24.52
Gemma 3 100.00 83.14 83.14 83.14 82.76 100.00 34.48 99.62 70.11 74.33 73.95 73.95 64.75
Llama 3.2 99.62 90.80 90.80 90.80 73.56 99.62 89.66 83.14 2.30 63.98 63.98 63.98 1.92
LLaVA v1.6 Mistral 100.00 99.62 99.62 99.62 78.93 99.62 90.42 95.79 26.82 81.99 81.61 81.23 24.52
LLaVA v1.6 Vicuna 100.00 99.62 98.85 98.47 42.91 100.00 98.85 96.93 1.53 93.49 93.87 93.49 0.38
Overall 99.07 77.28 77.17 77.12 62.78 99.02 70.50 95.40 35.36 69.24 69.13 68.86 32.73

Spatial pattern recognition had the lowest response rates among all tasks, with GPT-4o performing the weakest at just 28.98%, and Gemini 2.5 Pro achieving the highest at 57.14%. These results indicate that most models struggle with solving tasks related to the spatial distribution, patterns, or layout of road damage. Compared to proprietary models, open-source models generally outperformed proprietary models in spatial pattern recognition. LLaVA v1.6 Mistral (59.24%), LLaVA v1.6 Vicuna (58.48%), and Gemma 3 (48.61%) achieved higher response rates than GPT-4o (28.98%) and OpenAI o1 (29.58%), suggesting that open-source architecture may be better suited for spatial reasoning tasks. Nevertheless, all models showed markedly lower response rates in spatial pattern recognition than in other tasks, underscoring the persistent difficulty of capturing spatial context and relational features in road surface imagery. Pavement condition description and severity evaluation presented strong response rates across models, with an overall average of 77.86%. Most models were capable of generating concise textual descriptions of the road environment and estimating damage severity. Gemini 2.5 Pro (89.87%) and Gemma 3 (89.66%) achieved the highest response rates in this task category, while LlaMA 3.2 recorded the lowest at 62.50%.

Maintenance interval estimation had the highest response rates among all task categories, with an overall average of 94.86%. Gemma 3 reached 100%, followed by GPT-4o (98.47%), OpenAI o1 (96.93%), and Gemini 2.5 Pro (93.49%). The remaining models also performed well, each with response rates above 85%. These results indicate that most MLLMs are able to generate outputs when prompted to estimate appropriate maintenance timing based on road condition inputs.

Response correctness

Response correctness varies across models, with proprietary models generally outperforming their open-source counterparts. Fig 3 compares the response correctness of seven MLLMs across four metrics: (a) accuracy, (b) precision, (c) recall, and (d) F1 score. GPT-4o and OpenAI o1 (proprietary models) achieved the highest scores across all metrics, indicating stronger performance in generating accurate, relevant, and complete outputs for road condition assessment tasks. The red dashed line represents the mean score for that metric. In contrast, open-source models consistently underperformed, particularly in precision and recall, suggesting a higher likelihood of producing false positives and omitting relevant features.

Fig 3. Overall response correctness of MLLMs across all task categories.

Fig 3

As shown in Fig 3(a), GPT-4o achieved the highest accuracy at 67.80%, followed closely by OpenAI o1 at 67.24%. LLaVA v1.6 Vicuna was the only open-source model to exceed the group average of 61.49%, achieving an accuracy of 63.20%. Other models such as Gemini 2.5 Pro (58.85%), Gemma 3 (58.09%), and LLaVA v1.6 Mistral (55.37%) fell below the average, indicating more frequent generation of incorrect outputs. These results suggest that GPT-based models are more effective at producing correct responses across road condition assessment tasks.

In terms of precision, shown in Fig 3(b), OpenAI o1 led with 50.34%, followed by GPT-4o at 45.29%. Although these values were the highest among the evaluated models, they indicate that even the top-performing models occasionally labeled incorrect outputs as correct. Precision values for Gemma 3, LLaVA v1.6 Mistral, and LLaVA v1.6 Vicuna ranged between 32.69% and 43.25%, suggesting a greater tendency to produce false positives. Gemini 2.5 Pro recorded a precision of 40.46%, which was below the group mean of 41.64%.

Recall scores, displayed in Fig 3(c), followed a similar pattern to accuracy and precision. OpenAI o1 (60.19%), GPT-4o (58.55%), and Gemini 2.5 Pro (56.00%) all outperformed the group average of 51.96%, indicating they were better at correctly identifying relevant information for surface road condition assessment tasks. In contrast, all open-source models had lower recall scores, ranging from 44.49% to 49.84%, suggesting they were more likely to miss important features during the assessment.

Fig 3(d) presents F1 scores, which balance precision and recall providing an overall measure of response correctness. OpenAI o1 and GPT-4o again led this metric, with scores of 63.53% and 63.25%, respectively. Gemma 3 achieved a moderately strong score of 59.47%, while Gemini 2.5 Pro lagged behind at 54.21% due to lower precision. Open-source models including LLaVA v1.6 Mistral (50.31%), LLaVA v1.6 Vicuna (55.16%), and Llama 3.2 (53.34%) all scored below the group mean of 57.03%. These results confirm that proprietary models are better equipped to deliver responses for the assessment.

Table 4 presents response correctness metrics across the four task categories. Results show that models performed better on visual recognition tasks, such as pavement surface distress and feature identification as well as spatial pattern recognition, but showed weaker performance on interpretive reasoning tasks, including pavement condition evaluation and maintenance interval estimation.

Table 4. Response correctness metrics (%) by task category.

MLLMs Pavement surface distress and feature identification Spatial pattern recognition Pavement condition evaluation Maintenance interval estimation
Accuracy Precision Recall F1 score Accuracy Precision Recall F1 score Accuracy Precision Recall F1 score Accuracy Precision Recall F1 score
Gemini 2.5 Pro 73.04 49.49 65.35 64.85 56.43 37.37 54.98 52.06 17.50 24.00 18.20 22.30 23.00 30.50 40.90 21.90
OpenAI o1 77.55 60.65 69.21 67.72 67.16 48.68 60.02 68.53 19.85 19.10 17.65 20.85 44.70 29.50 40.10 32.70
GPT-4o 78.99 55.32 66.79 69.50 67.11 41.69 58.50 66.55 19.80 25.30 19.45 19.80 45.90 22.00 40.80 33.30
Gemma 3 68.21 38.53 58.48 66.46 62.63 56.55 49.63 65.40 13.20 18.30 11.95 13.25 3.10 18.20 23.10 5.60
Llama 3.2 67.41 44.70 55.81 57.33 63.99 39.68 52.88 58.93 14.65 17.60 12.05 19.65 9.30 17.80 22.10 13.20
LLaVA v1.6 Mistral 57.25 34.05 48.08 50.17 65.44 36.49 47.91 60.07 11.45 12.90 12.60 15.45 7.70 13.20 25.90 9.20
LLaVA v1.6 Vicuna 68.74 45.85 52.38 58.44 68.47 41.79 47.86 60.58 18.40 10.40 10.80 23.35 24.60 22.60 20.60 27.30
Overall 70.17 46.94 59.44 62.07 64.46 43.18 53.11 61.73 16.41 18.23 14.67 19.24 22.61 21.97 30.50 20.46

Among the visual recognition tasks, the highest performance was observed in pavement surface distress and feature identification. GPT-4o (Accuracy: 78.99%, F1: 69.50%) and OpenAI o1 (Accuracy: 77.55%, F1: 67.72%) demonstrated strong capabilities in detecting visible surface conditions such as cracks and potholes. Similarly, in spatial pattern recognition, both models maintained relatively high performance, with OpenAI o1 achieving an F1 score of 68.53% and GPT-4o scoring 66.55%.

In contrast, interpretive reasoning tasks are more challenging, as they require models to infer severity levels or predict future repair actions based on complex and often ambiguous visual and contextual cues. Consequently, model performance in these tasks remains low. All models recorded F1 scores below 25% for pavement condition evaluation and below 35% for maintenance interval estimation, highlighting their limited capability in producing accurate and consistent predictions.

Consistency

Fig 4 shows that proprietary models tend to produce more consistent outputs, and open-source models generate more variable and less predictable responses. Specifically, proprietary models exhibited stronger inter-model agreement than open-source models. The highest pairwise agreement was observed between GPT-4o and OpenAI o1 (0.621), indicating a high level of consistency in their outputs, likely due to similar underlying architectures or training strategies. Gemini 2.5 Pro also showed moderate agreement with GPT-4o (0.181) and OpenAI o1 (0.196), further supporting the coherence within proprietary models. In contrast, open-source models demonstrated low or near-zero agreement both among themselves and with proprietary models. For example, LLaVA v1.6 Vicuna showed almost no agreement with other models, with values ranging from 0.002 to 0.029, and even a slightly negative agreement with LLaVA v1.6 Mistral (−0.028). Similarly, LLaVA v1.6 Mistral and Llama 3.2 showed weak inter-model alignment, suggesting inconsistent output behavior.

Fig 4. Average inter-model agreement between MLLMs.

Fig 4

Fig 5 presents inter-model agreements across four task categories. Each heatmap panel shows agreement on a road condition assessment task: (a) pavement surface distress and feature identification, (b) spatial pattern Recognition, (c) road condition description and severity Evaluation, and (d) maintenance interval estimation. GPT-4o and OpenAI o1 showed high agreement in all four task categories, with particularly strong alignment in pavement surface distress identification and damage pattern recognition. This suggests that GPT-based models are more consistent in visual recognition tasks. Gemini 2.5 Pro showed moderate agreement with both GPT-4o and OpenAI o1 across all tasks, indicating relatively stable response behavior and internal consistency among proprietary models. In contrast, open-source models showed low agreement both among themselves and with proprietary models across all task categories. For example, LLaVA v1.6 Vicuna and Mistral frequently yielded near-zero or even negative agreement scores, indicating that they often produced divergent outputs when processing the same input. Such inconsistency suggests that open-source MLLMs may be less dependable for road condition assessment tasks, particularly those that require consistent recognition and interpretation of visual and contextual information.

Fig 5. Pairwise inter-model agreement across MLLMs.

Fig 5

Multimodal errors

Pavement surface distress and feature identification.

MLLMs often struggle to identify specific types of surface distress when performing fine-grained classification. While most models reliably detected the general presence of cracks or potholes, they frequently misidentified the specific type of distress. In several instances, models hallucinate cracks type, such as labeling transverse cracks as longitudinal or falsely detected alligator cracks in unrelated patterns. These hallucinations commonly occurred when distress features were subtle, partially obscured, or visually similar, making it difficult for the model to apply precise visual reasoning. This pattern suggests that although object detection is often accurate at a general level, the step of assigning a detailed label is prone to error. Such misclassifications introduce risks when outputs are used to inform infrastructure repair decisions, particularly in workflows where the type of surface distress directly affects prioritization or treatment strategy.

Spatial pattern recognition.

MLLMs often encounter difficulty when interpreting how surface damage is distributed across roadways, particularly in tasks that require recognizing spatial layouts or structural alignment. Unlike general damage detection, spatial pattern recognition demands a more nuanced understanding of how cracks or potholes relate to features such as joints, edges, or intersections. In these cases, models frequently misclassified or hallucinated spatial arrangements. These errors were especially common when features were irregular, subtle, or embedded in complex visual contexts. Rather than relying solely on observed visual cues, models may infer complexity where none exists, leading to exaggerated or inaccurate spatial descriptions. This poses a challenge for applications that require accurate diagnosis of localized deterioration, such as planning targeted maintenance interventions.

Pavement severity assessment.

Discrepancies were observed between MLLM-generated severity assessments and PCI scores provided by the City of San Francisco. These differences do not necessarily indicate model error, as the PCI scores are based on human evaluations that may introduce subjectivity. As shown in S2 Table, in some cases, MLLM assessments more accurately reflect the visible surface conditions captured in the street view images. This suggests that PCI labels may not always provide the most up-to-date or visually consistent representation of roadway quality, and that MLLMs can offer a valid alternative based on visual evidence. At the same time, it is important to note that in this study, the prompt requested only a PCI value without requiring justification. Without an explicit requirement for reasoning, the model may generate estimated values based on assumptions or unsupported visual cues.

Pavement maintenance interval estimation.

Multimodal errors in responses to maintenance interval estimation occurred when models relied excessively on visible surface distress, leading them to suggest short-term repairs for roads that were not actually scheduled for maintenance. This may have happened because important contextual factors such as long-term transportation plans, budget limitations, and traffic management needs were not considered, as shown in S3 Table. Although the road conditions appeared to be deteriorated, real-world maintenance decisions often prioritize these broader considerations, leading to delayed repairs. This behavior could represent a form of hallucination in which the model inferred unwarranted maintenance needs based solely on surface appearance, generating recommendations that were not grounded in actual maintenance schedules or operational priorities.

Overall computational intensity and cost

Fig 6 compares the processing time (left) and token count (right) across all evaluated MLLMs. Red lines indicate median values, annotated above each box. Proprietary models exhibited longer processing times per street image. OpenAI o1 and GPT-4o had median durations of 67.25 and 52.08 seconds, respectively, while Gemini 2.5 Pro was faster at 40.79 seconds. In contrast, open-source models showed consistently shorter processing times with minimal variability.

Fig 6. Comparison of processing time and token count across MLLMs.

Fig 6

Token usage varied by model. Among the proprietary models, Gemini 2.5 Pro had the highest median token usage (3,445), followed by GPT-4o (3,304) and OpenAI o1 (3,196). Interestingly, with input prompts and image embeddings kept consistent across cases, the near-constant token usage of open-source models suggests a fixed-length output behavior. This pattern suggests that open-source models may rely more heavily on templated responses, while proprietary models adjust output length dynamically based on input content.

A Kruskal-Wallis test revealed significant differences in token usage among the three models (p = 0.0066), with a small effect size (η² = 0.010). Post hoc pairwise Wilcoxon rank-sum tests with Bonferroni correction showed that Gemini 2.5 Pro used a significantly different number of tokens compared to both GPT-4o (adjusted p = 0.027, r = 0.11) and OpenAI o1 (adjusted p = 0.018, r = 0.12), while no significant difference was found between GPT-4o and OpenAI o1 (adjusted p = 1.000, r = 0.04). Despite statistical significance, the small effect sizes suggest that these differences in token usage may have limited practical relevance on their own.

Translating token usage into monetary cost, analyzing one street image with 39 prompts related to road surface condition assessment costs approximately $0.02 using Gemini 2.5 Pro, $0.04 using GPT-4o, and $0.12 using OpenAI o1. Although Gemini 2.5 Pro typically consumes more tokens, its lower per-token rate results in a lower overall cost compared to OpenAI o1. The per-token cost of OpenAI o1 is approximately seven times higher than that of Gemini 2.5 Pro and at least 3.75 times higher than that of GPT-4o. Consequently, OpenAI o1 incurs the highest cost per image, while Gemini 2.5 Pro and GPT-4o offer more cost-efficient alternatives.

Discussion, limitations and future work

The potential and current limitations of existing MLLMs in road surface condition assessment

The study assessed the performance of both proprietary and open-source MLLMs on four tasks relevant to surface road condition assessment. Results indicate that these models can analyze street-level imagery and generate relevant, structured responses that align with task objectives. This confirmed the potential of MLLMs to support surface road condition assessment workflows.

Beyond performance, MLLMs are characterized by their user-friendly design and low operational costs, which support their applicability in pavement condition assessment. The user-friendliness of MLLMs is reflected in two aspects. First, they allow users to interact with the model using natural language prompts [83], removing the need for specialized programming expertise. Second, MLLMs can produce structured assessment outputs directly from raw street-level images, without requiring extensive data preprocessing or formatting. This streamlined input process reduces the technical burden typically associated with automated assessment tools and facilitates more efficient deployment in real-world pavement evaluation workflows.

In terms of cost, MLLMs are also relatively affordable. The cost of analysis ranges from $0.02 to $0.12 per image. In contrast, conventional road condition assessments can be significantly more expensive, with costs starting at approximately $3,772 for evaluating a roadway of at least one mile in length [84]. Assuming one image is captured every 25 feet, or approximately 212 images per mile, an MLLM-based assessment would cost between $4.24 and $25.44 for a one-mile segment.

However, these advantages in usability and affordability are accompanied by several limitations. MLLM performance is not consistent across all task types and tends to vary with task complexity. The models perform well on general tasks that require minimal contextual understanding, such as identifying the presence of cracks or potholes. Their performance weakens as tasks require more detailed or specific classifications. For example, the average response rate drops from 99.07% when identifying general cracks to 77.28% when identifying transverse cracks. Accuracy also declines, with the average F1 score decreasing from 0.62 for object identification to 0.20 for maintenance interval estimation.

Another factor affecting model performance is the clarity and completeness of the input prompt. Models often produce inaccurate responses when the input lacks sufficient context. For instance, in this study, the prompt only asked the models to recommend maintenance intervals based on street-view images, without providing any additional background information. As a result, the models focused exclusively on visible pavement conditions and overlooked non-visual determinants of maintenance needs. Incorporating these contextual factors into model prompts or fine-tuning datasets could improve both the reliability and practical utility of MLLM-generated outputs.

Furthermore, generative models may produce probabilistic outputs, meaning that repeated runs do not necessarily yield identical results unless generation parameters are fixed. This stochastic variability parallels observer inconsistency in manual inspections and can introduce uncertainty into the evaluation process. Standardizing model parameters and employing well-structured prompts can mitigate this randomness, thereby enhancing the reproducibility and reliability of MLLM outputs.

A further limitation lies in the explainability of MLLM outputs. Although the models can generate decisions related to surface road condition assessment, their reasoning processes remain opaque. For instance, when asked to provide a PCI based on an image, the models typically return only a numerical value without clarifying how it was derived. While MLLMs can produce descriptive explanations, these are not always grounded in verifiable causal relationships and may reflect post-hoc rationalizations rather than genuine interpretive reasoning. Prompting models to explicitly state the rationale behind their decisions could improve transparency and facilitate more trustworthy applications in pavement assessment.

Finally, Although GAI can reduce reliance on manual inspections by automating visual assessments, human oversight remains necessary to validate cases and ensure alignment with engineering standards. While GAI using street-view imagery can assist with tasks related to surface condition assessment, effective pavement management requires additional data beyond surface imagery such as structural integrity and traffic loading. Therefore, GAI should be viewed as a decision-support toolkit that enhances human judgment rather than a complete replacement for professional expertise and field validation.

Pathways to improving MLLM performance

Enhancing their effectiveness on road surface condition assessment will require improvements in several areas. One key area is the need for additional domain-specific training data to support fine-tuning and improve model performance [85]. Many core assessment tasks, such as identifying specific types of pavement cracks or determining what types of repairs have been made to the road (e.g., patched areas, resurfacing), require detailed recognition capabilities that current models often lack. Compared to general object detection, the response rate for these fine-grained tasks is significantly lower, indicating that MLLMs need further tuning to recognize subtle distinctions in road surface conditions. Similarly, tasks such as pavement severity estimation and maintenance interval prediction depend on contextual variables beyond the visible defect itself. These tasks require reasoning based on factors such as budget, environmental exposure, or traffic patterns, which are typically not part of the model’s default knowledge. Fine-tuning with annotated datasets that include these features is necessary to improve model performance on these complex tasks.

In parallel, knowledge distillation can serve as a complementary strategy [86]. By transferring knowledge from high-performing proprietary models to open-source models, distillation can enhance performance without the computational cost of training from scratch. This is particularly useful for tasks like spatial pattern recognition, where response rates may be low, but the accuracy of responses, when provided, is relatively high. This suggests that models possess some capacity for spatial reasoning, but they are not consistently activated. Distillation can help encode this reasoning more effectively, enabling smaller models to perform better in both classification and spatial interpretation tasks.

At the same time, prompt engineering plays an important role in improving model outputs, particularly for context-dependent tasks [87]. For instance, when asking when a road should be repaired, the model must consider information not visible in the image, such as usage intensity, historical repair records, or surrounding infrastructure. Designing prompts that incorporate this contextual information and guide the model to consider relevant variables can significantly improve response quality. This approach is especially valuable in situations where retraining or fine-tuning is not feasible, as it allows users to optimize model performance through carefully structured input design.

Proprietary vs. open-source models: trade-offs and implications for future development

This study compared the performance of proprietary and open-source MLLMs across multiple dimensions, including responsiveness, accuracy, internal agreement, and cost. While both categories of models demonstrate similar levels of responsiveness, proprietary models generally achieve higher accuracy and show stronger internal agreement, producing similar outputs when presented with the same or slightly modified prompts. These advantages make proprietary models highly reliable for operational deployment; however, their use involves financial costs and limited transparency. Most proprietary models cannot be fine-tuned or adapted for domain-specific applications, and their token-based pricing structures can substantially increase computational expenses, especially in large-scale analyses.

In contrast, open-source models are freely available and offer greater flexibility for customization and domain adaptation. Although they typically perform slightly worse than proprietary models in accuracy and consistency, their transparency, accessibility, and modifiability make them valuable for research and development. Open-source models can be fine-tuned on task-specific datasets, enabling iterative improvement without recurring costs. This balance between performance and accessibility suggests that open-source models are a practical foundation for future pavement assessment applications, particularly in settings where reproducibility and budget constraints are critical.

Among the open-source models evaluated, Gemma 3 is recommended as the base model for fine-tuning. It demonstrates high response rates across all tasks, including a perfect score (100%) for maintenance interval estimation. Gemma 3 also achieves the highest F1 score among all open-source models, exceedingly even that of Gemini 2.5 Pro, which indicates strong overall reliability and balanced performance. Furthermore, Gemma 3 shows higher consistency with proprietary models compared to other open-source models. In contrast, the relatively low consistency observed among the LLaVA and Llama models suggests that their responses may be more variable or randomly generated. Overall, Gemma 3 achieved the best performance across all tasks, suggesting it is a strong candidate for fine-tuning and domain-specific adaptation. For applications involving knowledge distillation, GPT-4o is the most suitable proprietary model. It achieves high accuracy and internal consistency, comparable to OpenAI o1, but at a lower operational cost.

Conclusion

In this study, we evaluated the performance of seven MLLMs, including three proprietary models (GPT-4o, OpenAI o1, Gemini 2.5 Pro) and four open-source models (LLaVA v1.6 Mistral, LLaVA v1.6 Vicuna, Llama 3.2, Gemma 3), across four task categories: (1) pavement surface distress and feature identification, (2) spatial pattern recognition, (3) pavement condition severity evaluation, and (4) pavement maintenance interval estimation. The models were assessed across five dimensions: response rate, accuracy, consistency, multimodal errors, and computational cost.

The results empirically verified the transformative potential of MLLMs, which lies in their strong responsiveness and low cost. On average, the models achieved a 63.65% response rate, and the cost of conducting an assessment based on a single image ranged from approximately $0.02 to $0.12, depending on the model used. While accuracy requires further optimization, this work establishes a benchmark proving that with domain-specific refinement, MLLMs can transition from assistive tools to scalable, cost-effective solutions for pavement monitoring and management.

Supporting information

S1 Table. List of prompts used for model evaluation.

(DOCX)

pone.0340380.s001.docx (16.2KB, docx)
S2 Table. Cases illustrate discrepancies between MLLM-generated severity assessments and PCI-based condition label.

(DOCX)

pone.0340380.s002.docx (16.7KB, docx)
S3 Table. The comparison between repair maintenance intervals estimated by MLLMs and actual intervals.

(DOCX)

pone.0340380.s003.docx (16.2KB, docx)

Data Availability

We will share data on public GitHub as soon as this paper is accepted.

Funding Statement

The project is funded by the Great Lakes–Northern Forest (GLNF) Cooperative Ecosystem Studies Unit (CESU), award number W912HZ-19-SOI-0002. The funder had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The content of this publication does not necessarily reflect the views or policies of the funder.

References

  • 1.Li Y, Huang J. Safety Impact of Pavement Conditions. Transportation Research Record: Journal of the Transportation Research Board. 2014;2455(1):77–88. doi: 10.3141/2455-09 [DOI] [Google Scholar]
  • 2.Verma D, Singh V. Road maintenance effect in reducing road accident. International Journal for Scientific Research & Development. 2015;3(1):303–7. [Google Scholar]
  • 3.Wiegmann J, Yelchuru B. Resource allocation logic framework to meet highway asset preservation. Transportation Research Board. 2012. [Google Scholar]
  • 4.Elmansouri O, Alossta A, Badi I. Pavement condition assessment using pavement condition index and multi-criteria decision-making model. Mechatron Intell Transp Syst. 2022;1:57–68. [Google Scholar]
  • 5.Ezzati S, Palma CD, Bettinger P, Eriksson LO, Awasthi A. An integrated multi-criteria decision analysis and optimization modeling approach to spatially operational road repair decisions. Can J For Res. 2021;51(3):465–83. doi: 10.1139/cjfr-2020-0016 [DOI] [Google Scholar]
  • 6.Burningham S, Stankevich N. Why road maintenance is important and how to get it done. 2005.
  • 7.Elbagalati O, Elseifi M, Gaspard K, Zhang Z. Development of the pavement structural health index based on falling weight deflectometer testing. International Journal of Pavement Engineering. 2016;19(1):1–8. doi: 10.1080/10298436.2016.1149838 [DOI] [Google Scholar]
  • 8.Jiang X, Gabrielson J, Huang B, Bai Y, Polaczyk P, Zhang M, et al. Evaluation of inverted pavement by structural condition indicators from falling weight deflectometer. Construction and Building Materials. 2022;319:125991. doi: 10.1016/j.conbuildmat.2021.125991 [DOI] [Google Scholar]
  • 9.Huang YH. Pavement analysis and design. Upper Saddle River, NJ: Pearson/Prentice Hall. 2004. [Google Scholar]
  • 10.Liu Z, Yang Q, Gu X. Assessment of Pavement Structural Conditions and Remaining Life Combining Accelerated Pavement Testing and Ground-Penetrating Radar. Remote Sensing. 2023;15(18):4620. doi: 10.3390/rs15184620 [DOI] [Google Scholar]
  • 11.Yao Y, Tung S-TE, Glisic B. Crack detection and characterization techniques-An overview. Struct Control Health Monit. 2014;21(12):1387–413. doi: 10.1002/stc.1655 [DOI] [Google Scholar]
  • 12.Arianto T, Suprapto M. Pavement condition assessment using IRI from Roadroid and surface distress index method on national road in Sumenep Regency. In: IOP Conference Series: Materials Science and Engineering, 2018. [Google Scholar]
  • 13.Yang CH, Kim JG, Shin SP. Road Hazard Assessment Using Pothole and Traffic Data in South Korea. Journal of Advanced Transportation. 2021;2021:1–10. doi: 10.1155/2021/5901203 [DOI] [Google Scholar]
  • 14.Merkle N, Henry C, Azimi SM, Kurz F. Road condition assessment from aerial imagery using deep learning. ISPRS Ann Photogramm Remote Sens Spatial Inf Sci. 2022;V-2–2022:283–9. doi: 10.5194/isprs-annals-v-2-2022-283-2022 [DOI] [Google Scholar]
  • 15.Lopes Amaral Loures L, Rezazadeh Azar E. Condition Assessment of Unpaved Roads Using Low-Cost Computer Vision–Based Solutions. J Transp Eng, Part B: Pavements. 2023;149(1). doi: 10.1061/jpeodx.pveng-1006 [DOI] [Google Scholar]
  • 16.Moussa G, Hussain K. A new technique for automatic detection and parameters estimation of pavement crack. In: 2011. [Google Scholar]
  • 17.Aslan OD, Gultepe E, Ramaji IJ, Kermanshachi S. Using Artifical Intelligence for Automating Pavement Condition Assessment. In: International Conference on Smart Infrastructure and Construction 2019 (ICSIC), 2019. 337–41. doi: 10.1680/icsic.64669.337 [DOI] [Google Scholar]
  • 18.Sholevar N, Golroo A, Esfahani SR. Machine learning techniques for pavement condition evaluation. Automation in Construction. 2022;136:104190. doi: 10.1016/j.autcon.2022.104190 [DOI] [Google Scholar]
  • 19.Eisenbach M, Stricker R, Sesselmann M, Seichter D, Gross H. Enhancing the quality of visual road condition assessment by deep learning. In: 2019. [Google Scholar]
  • 20.Satheesan D, Talib M, Li S, Yuan A. An Automated Method for Pavement Surface Distress Evaluation. Int Arch Photogramm Remote Sens Spatial Inf Sci. 2024;XLVIII-M-4–2024:47–53. doi: 10.5194/isprs-archives-xlviii-m-4-2024-47-2024 [DOI] [Google Scholar]
  • 21.Qureshi WS, Hassan SI, McKeever S, Power D, Mulry B, Feighan K, et al. An Exploration of Recent Intelligent Image Analysis Techniques for Visual Pavement Surface Condition Assessment. Sensors (Basel). 2022;22(22):9019. doi: 10.3390/s22229019 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Ding J, Li X. An Approach for Validating Quality of Datasets for Machine Learning. In: 2018 IEEE International Conference on Big Data (Big Data), 2018. doi: 10.1109/bigdata.2018.8622640 [DOI] [Google Scholar]
  • 23.Gharawi AA, Alsubhi J, Ramaswamy L. Impact of Labeling Noise on Machine Learning: A Cost-aware Empirical Study. In: 2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA), 2022. 936–9. doi: 10.1109/icmla55696.2022.00156 [DOI] [Google Scholar]
  • 24.Whang SE, Lee J-G. Data collection and quality challenges for deep learning. Proc VLDB Endow. 2020;13(12):3429–32. doi: 10.14778/3415478.3415562 [DOI] [Google Scholar]
  • 25.Drosatos G, Efraimidis PS, Arampatzis A. Federated and Transfer Learning Applications. MDPI. 2023;:11722. [Google Scholar]
  • 26.Bandi A, Adapa PVSR, Kuchi YEVPK. The Power of Generative AI: A Review of Requirements, Models, Input–Output Formats, Evaluation Metrics, and Challenges. Future Internet. 2023;15(8):260. doi: 10.3390/fi15080260 [DOI] [Google Scholar]
  • 27.Lu K, Zhao X, Li M, Wang Z, Zhou Y, Wang J. StreetSenser: a novel approach to sensing street view via a fine-tuned multimodal large language model. International Journal of Geographical Information Science. 2025;:1–29.40814402 [Google Scholar]
  • 28.Blečić I, Saiu V, Trunfio A. Enhancing urban walkability assessment with multimodal large language models. In: International Conference on Computational Science and Its Applications, 2024. [Google Scholar]
  • 29.Jang KM, Kim J. Multimodal Large Language Models as Built Environment Auditing Tools. The Professional Geographer. 2024;77(1):84–90. doi: 10.1080/00330124.2024.2404894 [DOI] [Google Scholar]
  • 30.Ragnoli A, De Blasiis MR, Di Benedetto A. Pavement Distress Detection Methods: A Review. Infrastructures. 2018;3(4):58. doi: 10.3390/infrastructures3040058 [DOI] [Google Scholar]
  • 31.Mishalani RG, Koutsopoulos HN. Role of spatial dimension in infrastructure condition assessment and deterioration modeling. Transportation Research Record. 1995;1508. [Google Scholar]
  • 32.Karim FMA, Rubasi KAH, Saleh AA. The Road Pavement Condition Index (PCI) Evaluation and Maintenance: A Case Study of Yemen. Organization, Technology and Management in Construction: an International Journal. 2016;8(1):1446–55. doi: 10.1515/otmcj-2016-0008 [DOI] [Google Scholar]
  • 33.Putra DA, Suprapto M. Assessment of the road based on PCI and IRI roadroid measurement. MATEC Web of Conferences. 2018. [Google Scholar]
  • 34.Loprencipe G, Pantuso A. A Specified Procedure for Distress Identification and Assessment for Urban Road Surfaces Based on PCI. Coatings. 2017;7(5):65. doi: 10.3390/coatings7050065 [DOI] [Google Scholar]
  • 35.Cafiso S, D’Agostino C, Delfino E, Montella A. From manual to automatic pavement distress detection and classification. In: 2017 5th IEEE International Conference on Models and Technologies for Intelligent Transportation Systems (MT-ITS), 2017. 433–8. doi: 10.1109/mtits.2017.8005711 [DOI] [Google Scholar]
  • 36.Schnebele E, Tanyu BF, Cervone G, Waters N. Review of remote sensing methodologies for pavement management and assessment. Eur Transp Res Rev. 2015;7(2). doi: 10.1007/s12544-015-0156-6 [DOI] [Google Scholar]
  • 37.Bou-Saab G, Nlenanya I, Alhasan A. Correlating visual–windshield inspection pavement condition to distresses from automated surveys using classification trees. In: 2019. [Google Scholar]
  • 38.Ibragimov E, Kim Y, Lee JH, Cho J, Lee J-J. Automated Pavement Condition Index Assessment with Deep Learning and Image Analysis: An End-to-End Approach. Sensors (Basel). 2024;24(7):2333. doi: 10.3390/s24072333 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Radopoulou SC, Brilakis I, Doycheva K, Koch C. A framework for automated pavement condition monitoring. In: Construction Research Congress 2016, 2016. [Google Scholar]
  • 40.Abdel Raheem M, El-Melegy M. Drive-By Road Condition Assessment Using Internet of Things Technology. In: 2019 International Conference on Advances in the Emerging Computing Technologies (AECT), 2020. 1–6. doi: 10.1109/aect47998.2020.9194190 [DOI] [Google Scholar]
  • 41.Jang J, Smyth AW, Yang Y, Cavalcanti D. Road surface condition monitoring via multiple sensor-equipped vehicles. In: 2015 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), 2015. 43–4. doi: 10.1109/infcomw.2015.7179334 [DOI] [Google Scholar]
  • 42.Marcelino P, Lurdes Antunes M de, Fortunato E. Comprehensive performance indicators for road pavement condition assessment. Structure and Infrastructure Engineering. 2018;14(11):1433–45. doi: 10.1080/15732479.2018.1446179 [DOI] [Google Scholar]
  • 43.Majidifard H, Adu-Gyamfi Y, Buttlar WG. Deep machine learning approach to develop a new asphalt pavement condition index. Construction and Building Materials. 2020;247:118513. doi: 10.1016/j.conbuildmat.2020.118513 [DOI] [Google Scholar]
  • 44.Huang Y-T, Jahanshahi MR, Shen F, Mondal TG. Deep Learning–Based Autonomous Road Condition Assessment Leveraging Inexpensive RGB and Depth Sensors and Heterogeneous Data Fusion: Pothole Detection and Quantification. J Transp Eng, Part B: Pavements. 2023;149(2). doi: 10.1061/jpeodx.pveng-1194 [DOI] [Google Scholar]
  • 45.Hasanaath AA, Moinuddeen A, Mohammad N, Khan MA, Hussain AA. Continuous and Realtime Road Condition Assessment Using Deep Learning. In: 2022 International Conference on Connected Systems & Intelligence (CSI), 2022. 1–7. doi: 10.1109/csi54720.2022.9924135 [DOI] [Google Scholar]
  • 46.Jagatheesaperumal SK, Bibri SE, Ganesan S, Jeyaraman P. Artificial Intelligence for road quality assessment in smart cities: a machine learning approach to acoustic data analysis. ComputUrban Sci. 2023;3(1). doi: 10.1007/s43762-023-00104-y [DOI] [Google Scholar]
  • 47.Martorell S, Sánchez A, Villamizar M, Clemente G. Maintenance modelling and optimization integrating strategies and human resources: Theory and case study. Proceedings of the Institution of Mechanical Engineers, Part O: Journal of Risk and Reliability. 2008;222(3):347–57. doi: 10.1243/1748006xjrr128 [DOI] [Google Scholar]
  • 48.Verma AK, Srividya A, Ramesh P, Deshpande A, Sadiq R. Expert knowledge base in integrated maintenance models for engineering plants. Recent developments and new direction in soft-computing foundations and applications: Selected papers from the 4th World Conference on Soft Computing, May 25-27, 2014, Berkeley. Springer. 2016. [Google Scholar]
  • 49.Tomescu D, Heiman A, Badescu A. An Automatic Remote Monitoring System for Large Networks. In: 2019 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC), 2019. 71–3. doi: 10.1109/cse/euc.2019.00023 [DOI] [Google Scholar]
  • 50.Štefančič M, Jerič D, Štefančič M, Čebokli P. Dealing with large amount of data from automated pest monitoring system. Massendatenmanagement in der Agrar-und Ernährungswirtschaft–Erhebung–Verarbeitung–Nutzung. Gesellschaft für Informatik eV. 2013. [Google Scholar]
  • 51.Eisenbach M, Stricker R, Seichter D, Amende K, Debes K, Sesselmann M, et al. How to get pavement distress detection ready for deep learning? A systematic approach. In: 2017 International Joint Conference on Neural Networks (IJCNN), 2017. 2039–47. doi: 10.1109/ijcnn.2017.7966101 [DOI] [Google Scholar]
  • 52.Pearce J, Chiavaroli N. Rethinking assessment in response to generative artificial intelligence. Med Educ. 2023;57(10):889–91. doi: 10.1111/medu.15092 [DOI] [PubMed] [Google Scholar]
  • 53.Nguyen Thanh B, Vo DTH, Nguyen Nhat M, Pham TTT, Thai Trung H, Ha Xuan S. Race with the machines: Assessing the capability of generative AI in solving authentic assessments. AJET. 2023;39(5):59–81. doi: 10.14742/ajet.8902 [DOI] [Google Scholar]
  • 54.Chen Z, Dai J, Lai Z, Liu Z, Lu L, Lu T, et al. VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks. In: Advances in Neural Information Processing Systems 37, 2024. 69925–75. doi: 10.52202/079017-2235 [DOI] [Google Scholar]
  • 55.Song H, Baek K, Byun Y. Pothole detection using machine learning. Advanced Science and Technology. 2018;:151–5. [Google Scholar]
  • 56.Egaji OA, Evans G, Griffiths MG, Islas G. Real-time machine learning-based approach for pothole detection. Expert Systems with Applications. 2021;184:115562. doi: 10.1016/j.eswa.2021.115562 [DOI] [Google Scholar]
  • 57.Anaissi A, Khoa NLD, Rakotoarivelo T, Alamdari MM, Wang Y. Smart pothole detection system using vehicle-mounted sensors and machine learning. J Civil Struct Health Monit. 2019;9(1):91–102. doi: 10.1007/s13349-019-00323-0 [DOI] [Google Scholar]
  • 58.Hoang N-D, Nguyen Q-L. A novel method for asphalt pavement crack classification based on image processing and machine learning. Engineering with Computers. 2018;35(2):487–98. doi: 10.1007/s00366-018-0611-9 [DOI] [Google Scholar]
  • 59.Nguyen TH, Nguyen TL, Sidorov DN, Dreglea AI. Machine learning algorithms application to road defects classification. IDT. 2018;12(1):59–66. doi: 10.3233/idt-170323 [DOI] [Google Scholar]
  • 60.Inkoom S, Sobanjo J, Barbu A, Niu X. Prediction of the crack condition of highway pavements using machine learning models. Structure and Infrastructure Engineering. 2019;15(7):940–53. doi: 10.1080/15732479.2019.1581230 [DOI] [Google Scholar]
  • 61.Abderrahim NYQ, Abderrahim S, Rida A. Road Segmentation using U-Net architecture. In: 2020 IEEE International conference of Moroccan Geomatics (Morgeo), 2020. doi: 10.1109/morgeo49228.2020.9121887 [DOI] [Google Scholar]
  • 62.Munteanu A, Selea T, Neagul M. Deep learning techniques applied for road segmentation. In: 2019. [Google Scholar]
  • 63.Zhang S, Mu H, Liu T. Improving Accuracy and Generalizability via Multi-Modal Large Language Models Collaboration. In: 2024 International Joint Conference on Neural Networks (IJCNN), 2024. 1–8. doi: 10.1109/ijcnn60899.2024.10651475 [DOI] [Google Scholar]
  • 64.Ashqar HI, Alhadidi TI, Elhenawy M, Khanfar NO. Leveraging Multimodal Large Language Models (MLLMs) for Enhanced Object Detection and Scene Understanding in Thermal Images for Autonomous Driving Systems. Automation. 2024;5(4):508–26. doi: 10.3390/automation5040029 [DOI] [Google Scholar]
  • 65.Tan W, Ding C, Jiang J, Wang F, Zhan Y, Tao D. Harnessing the power of mllms for transferable text-to-image person reid. In: 2024. [Google Scholar]
  • 66.Zhou Q, Zhang J, Zhu Z. Evaluating Urban Visual Attractiveness Perception Using Multimodal Large Language Model and Street View Images. Buildings. 2025;15(16):2970. doi: 10.3390/buildings15162970 [DOI] [Google Scholar]
  • 67.Ki D, Lee H, Park K, Ha J, Lee S. Measuring nuanced walkability: Leveraging ChatGPT’s vision reasoning with multisource spatial data. Computers, Environment and Urban Systems. 2025;121:102319. doi: 10.1016/j.compenvurbsys.2025.102319 [DOI] [Google Scholar]
  • 68.Stricker R, Aganian D, Sesselmann M, Seichter D, Engelhardt M, Spielhofer R, et al. Road Surface Segmentation - Pixel-Perfect Distress and Object Detection for Road Assessment. In: 2021 IEEE 17th International Conference on Automation Science and Engineering (CASE), 2021. 1789–96. doi: 10.1109/case49439.2021.9551591 [DOI] [Google Scholar]
  • 69.Wang KC, Gong W. Automated pavement distress survey: a review and a new direction. In: 2002. [Google Scholar]
  • 70.Mukhlisin M, Khiyon KN. The Effects of Cracking on Slope Stability. Journal of the Geological Society of India. 2018;91(6):704–10. doi: 10.1007/s12594-018-0927-5 [DOI] [Google Scholar]
  • 71.Grantab R, Shenoy VB. Location- and Orientation-Dependent Progressive Crack Propagation in Cylindrical Graphite Electrode Particles. J Electrochem Soc. 2011;158(8):A948. doi: 10.1149/1.3601878 [DOI] [Google Scholar]
  • 72.Suryoto, Siswoyo DP, Setyawan A. The Evaluation of Functional Performance of National Roadway using Three Types of Pavement Assessments Methods. Procedia Engineering. 2017;171:1435–42. doi: 10.1016/j.proeng.2017.01.463 [DOI] [Google Scholar]
  • 73.Kumar AV. Pavement surface condition assessment: a-state-of-the-art research review and future perspective. Innov Infrastruct Solut. 2024;9(12). doi: 10.1007/s41062-024-01755-4 [DOI] [Google Scholar]
  • 74.Svenson K. Estimated Lifetimes of Road Pavements in Sweden Using Time-to-Event Analysis. J Transp Eng. 2014;140(11). doi: 10.1061/(asce)te.1943-5436.0000712 [DOI] [Google Scholar]
  • 75.Kavukcuoglu K. Gemini 2.5: Our most intelligent AI model. 2025. https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/
  • 76.OpenAI. Introducing OpenAI o1 2024. https://openai.com/o1/
  • 77.OpenAI. Hello GPT-4o. https://openai.com/index/hello-gpt-4o/
  • 78.Farabet CTW. Introducing Gemma 3: The most capable model you can run on a single GPU or TPU. 2025. https://blog.google/technology/developers/gemma-3/
  • 79.Meta 79. Llama 3.2: Revolutionizing edge AI and vision with open, customizable models. 2024. https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/
  • 80.Liu H. Llava-v1.6-mistral-7b. 2024. https://huggingface.co/liuhaotian/llava-v1.6-mistral-7b?utm_source=chatgpt.com
  • 81.Liu H. Liuhaotian/llava-v1.6-vicuna-7b. 2024. https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b?utm_source=chatgpt.com
  • 82.City and County of San Francisco. Streets Data – Pavement Condition Index (PCI) Scores. https://data.sfgov.org/City-Infrastructure/Streets-Data-Pavement-Condition-Index-PCI-Scores/5aye-4rtt
  • 83.Kuang J, Shen Y, Xie J, Luo H, Xu Z, Li R, et al. Natural Language Understanding and Inference with MLLM in Visual Question Answering: A Survey. ACM Comput Surv. 2025;57(8):1–36. doi: 10.1145/3711680 [DOI] [Google Scholar]
  • 84.Pavement L. Pavement analysis services request for services attachment A. 2020. [Google Scholar]
  • 85.Zhou X, He J, Ke Y, Zhu G, Gutierrez Basulto V, Pan J. An Empirical Study on Parameter-Efficient Fine-Tuning for MultiModal Large Language Models. In: Findings of the Association for Computational Linguistics ACL 2024, 2024. 10057–84. doi: 10.18653/v1/2024.findings-acl.598 [DOI] [Google Scholar]
  • 86.Cai Y, Zhang J, He H, He X, Tong A, Gan Z. Llava-kd: A framework of distilling multimodal large language models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025. [Google Scholar]
  • 87.Chen B, Zhang Z, Langrené N, Zhu S. Unleashing the potential of prompt engineering for large language models. Patterns (N Y). 2025;6(6):101260. doi: 10.1016/j.patter.2025.101260 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Junghwan Kim

6 Oct 2025

PONE-D-25-46997Multimodal Generative AI for Automated Pavement Condition Assessment: Benchmarking Model PerformancePLOS ONE

Dear Dr. Cui,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Nov 20 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Junghwan Kim

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf.

2. Please note that PLOS One has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, we expect all author-generated code to be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

3. Thank you for stating the following financial disclosure:

“CESU grant

Funding Opportunity No: W912HZ-19-SOI-0002”

Please state what role the funders took in the study.  If the funders had no role, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."

If this statement is not correct you must amend it as needed.

Please include this amended Role of Funder statement in your cover letter; we will change the online submission form on your behalf.

4. In the online submission form, you indicated that your data will be submitted to a repository upon acceptance.  We strongly recommend all authors deposit their data before acceptance, as the process can be lengthy and hold up publication timelines. Please note that, though access restrictions are acceptable now, your entire minimal  dataset will need to be made freely accessible if your manuscript is accepted for publication. This policy applies to all data except where public deposition would breach compliance with the protocol approved by your research ethics board. If you are unable to adhere to our open data policy, please kindly revise your statement to explain your reasoning and we will seek the editor's input on an exemption.

5. If the reviewer comments include a recommendation to cite specific previously published works, please review and evaluate these publications to determine whether they are relevant and should be cited. There is no requirement to cite these works unless the editor has indicated otherwise.

Additional Editor Comments (if provided):

Please ensure to address the reviewers' comments thoroughly. Additionally, please expand the literature review of recent papers on this topic to discuss gaps in the existing studies and why those gaps are significant. There are many papers on the similar or relevant topics (i.e., LLM + Street-view images) published in geography and urban planning journals, which have not been extensively discussed in this manuscript.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Yes

Reviewer #3: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

Reviewer #2: Yes

Reviewer #3: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This paper proposes using large language models (LLMs) to audit pavement condition. I think many technical details need to be described and considered more carefully.

Abstract

1. Page 1: The abstract starts by saying LLMs show promise in automated assessment of many things, but automated pavement condition assessment is unexplored. I don’t think this is a strong motivation (there are countless unexplored applications of LLMs). Instead, the authors should start with current challenges in pavement condition assessment and then explain how LLMs could help address these challenges.

Introduction

2. Page 2 (lines 36-39): Proper references should be cited for existing work on “structural evaluation” and “surface-level evaluation.” The authors should also specify which physical characteristics these studies focus on, with a few examples.

3. Page 3 (lines 46-56): Several claims here seem factually inaccurate. First, when you say existing deep learning models require “high-quality, structured image data,” do you mean training data? Even for LLMs, you still need high-quality image data to support detection and decision-making. Second, while deep learning models need human oversight, so do LLMs, which can hallucinate and require human review (so they can’t truly autonomously make decisions). Perhaps the authors mean that a single LLM could potentially handle not just detection but also evaluation of severity and maintenance timing, tasks that would require multiple deep learning models.

Methods

4. Page 7: First, what are the 39 prompts used? These should be spelled out either here or in an appendix, as it is currently unclear what the authors are asking the LLMs to do. Second, how are the model parameters set, such as temperature, top-k, etc.? Third, for each LLM, is it only prompted once per street view image? Typically, to enhance robustness in LLM-assisted assessment, each LLM should be prompted multiple times per image, because responses can vary due to inherent randomness.

5. Page 8: More details are needed about the manual annotation process. For example, where were the annotators recruited? Provide a detailed list of labels used for surface features and spatial distribution patterns, not just examples. The examples given here could instead go in the Introduction (page 3, lines 57–63), because terms like “spatial distribution pattern” are unclear until this point.

6. Page 11: How do you distinguish between response correctness and hallucination in practice? Both result in a mismatch between human-annotated data and LLM responses.

7. Echoing my earlier points, the authors should be explicit about what they instruct the LLMs to do, what outputs are expected, and what is contained in the human-annotated data. Without this clarity, it is difficult to understand how performance evaluation is conducted. How do we know whether the human-annotated results and LLM responses are comparable?

Reviewer #2: The manuscript titled “Multimodal Generative AI for Automated Pavement Condition Assessment: Benchmarking Model Performance” addresses an important and timely problem by exploring the application of multimodal generative AI to pavement condition assessment. The study provides a clear comparative evaluation of multiple proprietary and open-source models and presents results that highlight relative performance differences across tasks and dimensions. Here are my comments.

- While the study demonstrates that multimodal generative AI models can produce pavement maintenance recommendations, one important limitation not addressed in the manuscript is the lack of explainability of model outputs. For infrastructure management such as pavement condition assessment, where decisions directly affect safety, costs, and policy priorities, explainability is critical. Without interpretable reasoning or transparent links between input features (e.g., crack severity, traffic data) and recommendations, practitioners may be reluctant to trust or adopt such AI systems. The authors should discuss how LLMs address this.

- As one of the limitations of manual inspections, the paper noted that manual inspections are subject to observer variability (ln 40-41). This is also applicable in utilizing generative AI too. Many models generate probabilistic outputs, meaning repeated runs may not always yield identical results unless parameters are fixed. This needs to be acknowledged in this study

- For the 39 prompts used in this study, are there any domain expertise involved in their design? Given that prompt clarity and completeness strongly affect model outputs, it will benefit this paper to clarify whether experts in pavement management were consulted, and what criteria guided prompt construction.

- Response rate was used as one of the assessment dimensions. The paper does not clearly explain how response rate was measured in practice. Was it calculated manually?

Reviewer #3: The manuscript presents a timely contribution to the growing field of using MLLMs for imagery analysis. The integration of GenAI into road surface monitoring is novel, and the systematic evaluation across different proprietary and open-source models provides useful insights. The study is well motivated and relevant to both urban/transportation planning as well as AI research. But overall, the manusscript is not yet ready for publication. With clearer descriptions, stronger methodology, and better figure clarity, it could become a solid contribution to AI in transportation infrastructure. MAJOR REVISIONS REQUIRED.

Abstract:

23-25: Clear but the claim that GPT 4o provides the most favorable balance between accuracy and cost should be carefully stated

Introduction:

- HIghlights the importance of pavement condition assessment and the potential of generative AI, but it does not sufficiently discuss the study in relation to existing work.

- prior work in deep learning and computer vision for crack detection and PCI estimation mentioned but does not reference or contrast with these efforts in detail

- novelty of benchmarking multimodal generative models is not sharply presented

- The introduction would benefit from more explicitly stating how this study differs from prior research

Methods

- dataset size is small: needs justification

- 157–160: “ground truth” labels for maintenance interval based on temporal comparisons of Google Street View imagery, which may not reliably capture the actual timing or type of repairs. This introduces potential uncertainty in the benchmark labels

- 162-166: Manual annotation procedure is not described in enough detail. How many? Was inter-rater agreement measured?

- 205-208: definition of appears very subjective. any formal coding protocol or double-checking used?

- Statistical testing is ok, but effect sizes should also be reported to quantify practical differences, not jsut p values

Results:

- GPT-4o and OpenAI o1 are highlighted as top-performing, ut GEmma 3 is under-discussed even though it reached 100% response in one task, Being open-source, might be more important for reproducibility and cost.

- Hallucination analysis would benefit from systematic quantification across tasks and models

- the figures are difficult to read at this scale

Limitations:

- Actual pavement management requires more than just surface imagery?

- Fine tuning and prompt engineering are mentioned but not fully discussed in domain specific needs

- Comparison between Open source and proprietary could be expanded

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

Decision Letter 1

Junghwan Kim

12 Dec 2025

PONE-D-25-46997R1Multimodal Generative AI for Automated Pavement Condition Assessment: Benchmarking Model PerformancePLOS One

Dear Dr. Cui,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Jan 26 2026 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Junghwan Kim

Academic Editor

PLOS One

Journal Requirements:

If the reviewer comments include a recommendation to cite specific previously published works, please review and evaluate these publications to determine whether they are relevant and should be cited. There is no requirement to cite these works unless the editor has indicated otherwise.

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Additional Editor Comments:

I have received all the reviewers' reports and completed the evaluation of the revised manuscript. Reviewer #2 still has an outstanding item. Please address this. Thank you!

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

Reviewer #3: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: (No Response)

Reviewer #2: Many thanks to the authors for addressing my comments. I can see that the new version is well improved. All my concerns have been addressed. However, I recommend that the authors ensure that the claims added are adequately supported by appropriate references before the final version is published.

Reviewer #3: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

To ensure your figures meet our technical requirements, please review our figure guidelines: https://journals.plos.org/plosone/s/figures

You may also use PLOS’s free figure tool, NAAS, to help you prepare publication quality figures: https://journals.plos.org/plosone/s/figures#loc-tools-for-figure-preparation.

NAAS will assess whether your figures meet our technical requirements by comparing each figure against our figure specifications.

PLoS One. 2026 Feb 12;21(2):e0340380. doi: 10.1371/journal.pone.0340380.r004

Author response to Decision Letter 2


17 Dec 2025

2025-12-17

Manuscript PONE-D-25-46997R1

Response to Editor and Reviewers

# regarding funding source

We have spelled out the CESU to Great Lakes–Northern Forest (GLNF) Cooperative Ecosystem Studies Unit (CESU) in the portal website as well as a statement in the cover letter: The funder had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The content of this publication does not necessarily reflect the views or policies of the funder.

# regarding data availability statement

We have added a data availability statement in the submitted material.

# response to the Reviewer #2

Additional Editor Comments

I have received all the reviewers' reports and completed the evaluation of the revised manuscript. Reviewer #2 still has an outstanding item. Please address this. Thank you!

Authors’ response: Thank you very much for your time and for reviewing the revised manuscript. We have addressed the remaining comment from Reviewer #2. A detailed explanation of the revisions made in response to this comment is provided in our response to Reviewer #2 below.

Reviewers' Comments to the Authors: Reviewer

Reviewer #2

Many thanks to the authors for addressing my comments. I can see that the new version is well improved. All my concerns have been addressed. However, I recommend that the authors ensure that the claims added are adequately supported by appropriate references before the final version is published.

Authors’ response: We sincerely thank the reviewer for the positive assessment of the revised manuscript and for this helpful suggestion. In response to this helpful suggestion, we carefully reviewed the claims added in the previous revision and extended this review to the references supporting all claims throughout the manuscript to ensure appropriate scholarly support.

To further strengthen the academic rigor of the manuscript, we modified several references, particularly in cases where non–peer-reviewed sources were previously cited, by replacing them with appropriate peer-reviewed literature. When a claim was already supported by multiple peer-reviewed studies, the corresponding non–peer-reviewed reference was removed. When a claim relied primarily on a non–peer-reviewed source, we replaced it with an appropriate peer-reviewed citation. In cases where no suitable peer-reviewed reference could be identified to support a given claim, both the claim and the corresponding non–peer-reviewed reference were removed.

References removed

References 13, 27, 58, 59, 71, 73, 76, and 80 were removed.

References updated or replaced

• Reference 25 was replaced with:

Whang SE, Lee JG. Data collection and quality challenges for deep learning. Proceedings of the VLDB Endowment. 2020;13(12):3429–3432.

This supports the claim: “Acquiring such data is often challenging and time-consuming due to the need for extensive manual annotation.”

• Reference 28 was replaced with:

Bandi A, Adapa PV, Kuchi YE. The power of generative AI: A review of requirements, models, input–output formats, evaluation metrics, and challenges. Future Internet. 2023;15(8):260.

This supports the claim: “Unlike conventional machine learning models that require extensive training datasets and pre-defined rules, generative artificial intelligence (GAI) models can respond adaptively to a range of inputs and generate task-relevant outputs with limited contextual information”

• Reference 29 was replaced with:

Lu K, Zhao X, Li M, Wang Z, Zhou Y, Wang J. StreetSenser: a novel approach to sensing street view via a fine-tuned multimodal large language model. International Journal of Geographical Information Science. 2025.

This supports the claim: “Leveraging these capabilities, recent studies have begun to apply MLLMs in urban research, particularly using street-view imagery to analyze built environments, urban form, and neighborhood conditions.”

• Reference 30 was replaced with:

Blečić I, Saiu V, Trunfio GA. Enhancing urban walkability assessment with multimodal large language models. In: International Conference on Computational Science and Its Applications. Springer; 2024.

This further supports the same claim as reference 29.

• Reference 57 was replaced with:

Zhang S, Mu H, Liu T. Improving accuracy and generalizability via multimodal large language model collaboration. Proceedings of the IJCNN. 2024.

This supports the claim: “When baseline performance is insufficient, these models can be fine-tuned to improve task-specific accuracy.”

• References 68 and 69 were replaced with:

Wu J, Zhong M, Xing S, Lai Z, Liu Z, Chen Z, Wang W, Zhu X, Lu L, Lu T, Luo P. VisionLLM v2: An end-to-end generalist multimodal large language model for hundreds of vision–language tasks. Advances in Neural Information Processing Systems. 2024.

This supports the claim: “In contrast, MLLM can address multiple tasks within a unified framework by combining visual perception with linguistic reasoning.”

• Reference 96 was replaced with:

Kuang J, Shen Y, Xie J, Luo H, Xu Z, Li R, Li Y, Cheng X, Lin X, Han Y. Natural language understanding and inference with MLLMs in visual question answering: A survey. ACM Computing Surveys. 2025.

This supports the claim: “First, they allow users to interact with the model using natural language prompts, removing the need for specialized programming expertise.”

• Reference 98 was replaced with:

Zhou X, He J, Ke Y, Zhu G, Gutiérrez-Basulto V, Pan J. An empirical study on parameter-efficient fine-tuning for multimodal large language models. Findings of ACL. 2024.

This supports the claim: “One key area is the need for additional domain-specific training data to support fine-tuning and improve model performance.”

• Reference 99 has now been published as a peer-reviewed article, and the citation has been updated accordingly.

• Reference 100 was replaced with:

Chen B, Zhang Z, Langrené N, Zhu S. Unleashing the potential of prompt engineering for large language models. Patterns. 2025.

This supports the claim: “At the same time, prompt engineering plays an important role in improving model outputs, particularly for context-dependent tasks.”

In addition, to further enhance academic precision, we refined the wording of several claims while preserving their original meaning. For example, Reference 56 was replaced with Wu J, Zhong M, Xing S, Lai Z, Liu Z, Chen Z, Wang W, Zhu X, Lu L, Lu T, Luo P. Visionllm v2: An end-to-end generalist multimodal large language model for hundreds of vision-language tasks. Advances in Neural Information Processing Systems. 2024 Dec 16;37:69925-75 “Although MLLMs have not yet reached the point of full automation, they demonstrate versatility across tasks.”

Finally, after updating the references, we slightly reorganized the relevant paragraph on page 6 to improve clarity and logical flow. The revised paragraph now reads:

“Recent advancements in GAI have opened new avenues for automating assessment processes across various domains (54, 55). Although MLLMs have not yet reached the point of full automation, they demonstrate versatility across tasks (56). Traditional machine learning approaches for surface-level road condition assessment are typically designed and trained for a single, well-defined objective such as defect detection (60-62), defect classification (63-65), or surface segmentation (66, 67). In contrast, MLLM can address multiple tasks within a unified framework by combining visual perception with linguistic reasoning (68, 69). In addition, when baseline performance is insufficient, these models can be fine-tuned to improve task-specific accuracy (57).”

Attachment

Submitted filename: 20251217_Response to Reviewers.docx

pone.0340380.s006.docx (26.2KB, docx)

Decision Letter 2

Junghwan Kim

21 Dec 2025

Multimodal Generative AI for Automated Pavement Condition Assessment: Benchmarking Model Performance

PONE-D-25-46997R2

Dear Dr. Cui,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager® and clicking the ‘Update My Information' link at the top of the page. For questions related to billing, please contact billing support.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Junghwan Kim

Academic Editor

PLOS One

Additional Editor Comments (optional):

Reviewers' comments:

Acceptance letter

Junghwan Kim

PONE-D-25-46997R2

PLOS One

Dear Dr. Cui,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS One. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

You will receive further instructions from the production team, including instructions on how to review your proof when it is ready. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few days to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

You will receive an invoice from PLOS for your publication fee after your manuscript has reached the completed accept phase. If you receive an email requesting payment before acceptance or for any other service, this may be a phishing scheme. Learn how to identify phishing emails and protect your accounts at https://explore.plos.org/phishing.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Junghwan Kim

Academic Editor

PLOS One

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table. List of prompts used for model evaluation.

    (DOCX)

    pone.0340380.s001.docx (16.2KB, docx)
    S2 Table. Cases illustrate discrepancies between MLLM-generated severity assessments and PCI-based condition label.

    (DOCX)

    pone.0340380.s002.docx (16.7KB, docx)
    S3 Table. The comparison between repair maintenance intervals estimated by MLLMs and actual intervals.

    (DOCX)

    pone.0340380.s003.docx (16.2KB, docx)
    Attachment

    Submitted filename: Response to Reviewers.docx

    pone.0340380.s005.docx (44.4KB, docx)
    Attachment

    Submitted filename: 20251217_Response to Reviewers.docx

    pone.0340380.s006.docx (26.2KB, docx)

    Data Availability Statement

    We will share data on public GitHub as soon as this paper is accepted.


    Articles from PLOS One are provided here courtesy of PLOS

    RESOURCES