Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2026 Mar 7.
Published in final edited form as: Proc SPIE Int Soc Opt Eng. 2025 Apr 10;13413:134130L. doi: 10.1117/12.3047352

Instruction tuning for colorectal histopathology: a multimodal vision-language assistant on human-evaluated data

Usman Afzaal 1, Ziyu Su 1, Usama Sajjad 1, Thomas Stack 2, Mostafa Rezapour 3, Hao Lu 4, Shuo Niu 5, Metin Nafi Gurcan 4, Wendy Frankel 1, Wei Chen 1, Muhammad Khalid Khan Niazi 1
PMCID: PMC12965619  NIHMSID: NIHMS2079619  PMID: 41799657

Abstract

Despite the previous research on general-purpose Multimodal Large Language Models (MLLM) and histopathology multimodal chatbots, there has been limited exploration of their applications, particularly concerning colorectal cancer (CRC) histopathology slides. The success demonstrated by Language and Vision Assistant (LLaVA) as an MLLM in the natural image domain suggests its suitability for downstream tuning in histopathology slide analysis. In this study, we present CChat, a novel adaptation of the LLaVA model, to investigate its utility in CRC computational pathology. We accomplish this by fine-tuning LLaVA on a custom human-validated dataset curated from CRC histopathology slides. We also generate a benchmark dataset, to empirically evaluate CChat with other state-of-the-art chatbots, on which our model achieved final BERTScore, BLEU and Rouge-L scores of 93.23, 44.91, and 44.85, respectively, illustrating our methodology’s potential for generation of high-quality instruction datasets.

Keywords: Multimodal Large Language Models, Instruction-tuning, Colorectal cancer

1. INTRODUCTION

Digital pathology, integral to modern clinical practice, has advanced with the introduction of whole slide imaging (WSI), which enables high-resolution imaging and storage of slides and facilitates AI applications in pathology. This technology allows for extracting complex image data beyond human capability, enhancing diagnostic accuracy and precision, especially in analyzing gigapixel-sized images for diseases like cancer [1, 2]. WSI produces complex images rich in information, including color, multi-scale data, and z-stack levels, augmenting human capacity for visual information extraction. When integrated into pathology workflows with advanced algorithms, WSI extends the capabilities of pathologists beyond traditional microscopic examinations and enables telepathology and clinical use [1, 2].

AI has profoundly impacted computational pathology, transforming tasks such as nuclei detection, nuclear segmentation, cancer subtyping, and grading [310]. By analyzing gigapixel-sized images and integrating histopathological data with -omics and clinical records, AI provides more consistent, reproducible, and accurate results than traditional human methods [1113]. These capabilities are essential for addressing the increasing complexity of precision medicine, where human visual inspection alone is insufficient [14]. On a related front, large language models (LLMs) like GPT-3 [15] and GPT-4 [16] have revolutionized Natural Language Processing (NLP) tasks by understanding and generating human-like text, transforming applications such as coding, writing, summarization, and other language generation tasks [17, 18]. In pathology, LLMs hold the potential to enhance AI by processing complex histopathological data for improved diagnostic accuracy. These models leverage vast textual data from clinical reports and patient records for training, enabling instruction-tuning to follow specific user instructions [1820]. Instruction tuning is a specialized training approach where the model learns to respond appropriately to natural language instructions, allowing it to understand and execute specific tasks when given clear directions. Integrating NLP with AI in computational pathology enhances model capabilities for tasks such as interpreting medical reports and diagnosing complex histopathological conditions [1921].

While multimodal vision-language assistants enable open-ended interactions, datasets used for instruction-tuning MLLMs often rely on web-sourced data that is noisy and inconsistent, impacting model reliability [2224]. Addressing this limitation requires creating high-quality datasets that align textual annotations with visual data. Traditional augmentation methods, such as image rotation or scaling, fail to account for these relationships, introducing inconsistencies that compromise learning [25]. Ensuring alignment between images and captions is crucial to developing robust MLLMs for histopathology.

In this study, we develop a general-purpose MLLM for analyzing colorectal cancer slides. Our model is trained on a human-validated instruction-tuning dataset designed to enhance accuracy and ensure consistency. This dataset avoids common pitfalls associated with noisy web-sourced data by maintaining strict alignment between visual content and textual annotations. The resulting model demonstrates robust capabilities and high-quality predictions for analyzing colon histopathology slides. We evaluated it against state-of-the-art multimodal language models, including LLaVA-1.5 [26, 27] and GPT-4 [16], using a curated test set of 25 images with the Llama3 [28] evaluation framework. Our evaluation metrics included BERTScore, BLEU, and Rouge-L scores to assess text generation quality, along with LLM-based evaluation to measure accuracy in the domain-specific tasks.

Our key contributions are:

  1. Developed a human-validated instruction-tuning dataset tailored for colorectal cancer histopathology.

  2. Ensured strict alignment between visual content and textual annotations to enhance dataset quality.

  3. Developed a general-purpose MLLM, achieving state-of-the-art performance in analyzing colorectal cancer histopathology slides.

2. METHOD

2.1. Dataset

CChat was trained using a combination of three multimodal datasets across three distinct stages. For the first stage, an image-caption dataset was used to align the representations of both image-text modalities. For the second and third stage, pre-training and human-validated instruction tuning datasets were used, respectively, to finetune baseline LLaVA [26] architecture in a multi-stage fashion. Figure 1 shows our method’s overall pipeline.

Figure 1.

Figure 1.

This figure shows CChaťs three-stage development pipeline. The process begins with vision-language pretraining, where QUILT-1M dataset trains the CLIP architecture to produce a vision encoder. The second stage involves MLLM pretraining, utilizing a pre-training dataset to train the MLP connector. The final stage focuses on instruction tuning, where the Lizard dataset undergoes human validation to create the specialized CC-Instruct dataset.

2.2. Image-Caption Dataset

For the first stage of training, QUILT-1M, a dataset consisting of 1.02 million image-caption pairs, was utilized to pre-train the vision encoder [29]. This extensive dataset provided a broad range of visual and textual data, where the wide variety of the image-caption pairs ensured comprehensive pre-training.

2.3. Pretraining Dataset

In the second training stage, we performed a histopathology-focused filtration of the original QUILT-1M dataset. To reduce dataset noise, we utilized a 260-word vocabulary to filter the original dataset. The vocabulary consisted of medical and pathology-focused terms and was generated using GPT [15, 16]. Stemming was then performed on each word in the vocabulary and the captions associated with the images in the image-caption dataset. Table 1 provides an example of the words contained in the vocabulary. The vocabulary was then matched against the stemmed image-caption pairs from QUILT-1M, and pairs with at least one positive match were filtered out for the final pre-training dataset. This approach resulted in a final image-caption dataset of 650k datapoints, which was then further augmented using 3,309 image-captions pairs from ARCH, a computational pathology (CP) multiple instance captioning dataset [30]. These image-captions pairs were then converted into a question-answer format (QA), using a set of questions, querying about the image content, following [27]. The captions of the images served as the answers to the questions in this case. Finally, this pre-training dataset was used to train the MLP connector layers during the first stage of training in our MLLM architecture. Figure 2 illustrates the pipeline for generating the pre-training dataset.

Table 1.

Few example words along with their stemmed version from the medically-focused vocabulary.

Original Word Stemmed Version
sarcoidosis sarcoidosi
proliferation prolifer
metastasis metastasi

Figure 2.

Figure 2.

The figure illustrates the key components and workflow of the histopathology pre-training dataset preparation process. The pipeline begins with two primary datasets: QUILT-1M and ARCH. The QUILT-1M dataset passes through a filtration process, guided by word stemming applied to a custom medical vocabulary. This medical vocabulary, containing 260 medical terms, serves as the foundation for the filtering criteria. The filtered QUILT-1M data is then combined with the ARCH dataset, culminating in the final pre-training dataset. The filtration pipeline ensures the resulting dataset is optimized for histopathology-specific training.

2.4. Instruction-Tuning Dataset

We developed our instruction-tuning dataset starting with the 231 images from Lizard, a large-scale dataset for colonic nuclear instance segmentation and classification [31], which were then split into patches of size around 500x500 pixels, along with the newly normalized bounding box annotations. The original Lizard dataset provides annotations for 6 cell types, which includes, epithelial cells, lymphocytes, plasma cells, neutrophils, eosinophils, and connective tissue cells. The bounding box information for the original images was then converted into QA pairs using GPT-4, creating five conversational-style question-answer pairs for each image, with the prompt outlined in Figure 3 and following the methodology described in [27].

Figure 3.

Figure 3.

The prompt used to condition GPT4 for the generation of conversational question-answer pairs, based on the input bounding box coordinates for each image.

The resulting question-answer pairs were then reviewed by a human evaluator, who scored each pair based on the accuracy of its description of the given image in terms of histopathology and general language structure, filtering out any pairs with errors. Pairs were scored on a scale of 1–5, and only those with a minimum score of three were retained, resulting in a colorectal cancer-focused QA dataset of 11.5k pairs. Additionally, 24.5k QA pairs from the DigestPath colonoscopy tissue segment dataset [32] were included, leading to the final CC-Instruct dataset. These pairs were generated using the same methodology, except for the human-validation. CC-Instruct dataset comprised 36k question-answer pairs in total, which were used for the instruction-tuning stage of the MLLM. Unlike previous methods that rely on large-scale web-sourced datasets for instruction-tuning, we emphasized dataset quality through thorough human evaluations to ensure accuracy and relevance. By creating a colorectal cancer-focused instruction-tuning dataset, our approach significantly improved both the quality of the dataset and the accuracy of the model's predictions.

2.5. Train/Test Splits

A test set comprising 25 images was curated from the complete generated instruction dataset, where the test images and their corresponding QA pairs were isolated from the training dataset. For each test image, only one corresponding question-answer pair was selected for the final test set.

2.6. Network Architecture

To start, we used CLIP's vision encoder as our baseline model and performed vision-language pretraining with the QUILT-1M pathology image-caption dataset. This process aimed to align the image representation space with the pathology text representation. CLIP is a dual-encoder architecture consisting of an image and a text encoder. It learns a multimodal embedding space by jointly training these encoders to maximize the cosine similarity between correct image-text pairs while minimizing it for incorrect pairs, utilizing a cross-modal contrastive loss in a temperature-scaled N-way classification framework.

Initially, we adopted CLIP [33] as our baseline model and performed vision-language pretraining using the QUILT-1M pathology image-caption dataset. With this, the image representation space was aligned with the pathology text representation. CLIP employs a dual-encoder architecture—an image encoder and a text encoder—training them jointly to maximize the cosine similarity of correct image-text pairs and minimize it for incorrect pairs, through a cross-modal contrastive loss in a temperature-scaled setup [34]. The resulting vision encoder was then used in a two-step training pipeline of the MLLM following [27], which consists of (a) large language model Vicuna-13B [35] (b) two-layer MLP vision-language connector, and (c) a vision encoder, as illustrated in Figure 4. The first step of training the MLLM involves finetuning the weights of the two-layer MLP connector. For this purpose, the 650k pre-training dataset was utilized. This layer will transform the image features into a format compatible with the LLM by projecting them into the same embedding space dimension. The LLM processes these visual tokens along with tokenized language instructions from the user, integrating them to together to form a multimodal output. In the second step, we instruction-tuned the entire model using our human-validated instruction-tuning dataset. The vision encoder remained frozen during this stage while the entire architecture was fine-tuned. An auto-regressive training loss was used for both training steps of LLaVA.

Figure 4.

Figure 4.

(a) Illustrates the complete training pipeline. First, CLIP is trained on the image-caption dataset. Next, the LLaVA architecture is pre-trained using the pre-trained vision encoder. Finally, instruction tuning is conducted with our L-Instruct dataset. (b) The overall architecture of the MLLM. The vision encoder is pre-trained using QUILT-1M in the CLIP framework. It is then incorporated into LLaVA for furhter visual intruction tuning, during which it is kept frozen.

3. RESULTS

Our model was compared with other MLLMs using Llama3–8B as the evaluator. Llama 3 offers state-of-the-art reasoning capabilities amongst publicly available LLM checkpoints [28]. For each test image, Llama3 received a system command that outlined the comparison task between model predictions and ground-truth answers. The LLM was then conditioned to provide a rating score from 1 to 10 based on the answer accuracy, which was then converted to a percentage score from 1–100%. This method enabled Llama3 to deliver objective ratings by evaluating the semantic correctness of the model predictions. We also present results on the standard NLP metrics like BLEU [36], ROUGE-L [37] and BERTScore [38]. The results for the models are presented in Table 2. Figure 5 shows an example of CChat’s response.

Table 2.

The result of the MLLMs on the test set, as evaluated by Llama3. The tables also show the results on NLP metrics.

Model Llama3
Accuracy (%)
NLP Metrics Ours (CChat) LLaVA-1.5 GPT-4V
LLaVA 1.5 62.8 BLEU 44.91 30.34 17.01
GPT4V 55.59 ROUGE-L 44.85 26.41 8.5
Ours (CChat) 69.19 BERTScore 93.23 89.45 84.78

Figure 5.

Figure 5.

CChat can identify key medical regions within a histopathology patch and support reasoning toward a diagnosis based on the observed features.

4. DISCUSSION AND CONCLUSIONS

The integration of large language models in computational pathology presents both opportunities and challenges, particularly in analyzing complex histopathological data for colorectal cancer diagnosis. A fundamental challenge has been developing MLLMs that can effectively process and interpret histopathology slides while maintaining accuracy and reliability. Our study addressed this through CChat, a specialized MLLM that demonstrates enhanced performance in analyzing colorectal cancer histopathology slides, as validated by its superior performance across multiple evaluation metrics compared to existing models like LLaVA-1.5 and GPT-4V.

A key contribution of our work lies in the methodology used to create a high-quality instruction tuning dataset. Unlike conventional approaches that rely on web-sourced data, our human-validated instruction-tuning dataset ensures strict alignment between visual content and textual annotations. We specifically addressed challenges such as noise in pretraining datasets through vocabulary-based filtration and human evaluation, ensuring high-quality data for both pretraining and fine-tuning stages. This rigorous curation process, combined with our multi-stage training approach, resulted in more reliable and accurate predictions, demonstrating that prioritizing data quality over quantity can yield substantial benefits in specialized medical applications.

Our results underscore the effectiveness of domain-specific adaptation in MLLMs. While general-purpose models like GPT-4V and LLaVA-1.5 have shown impressive capabilities across various tasks, our focused approach to colorectal cancer histopathology demonstrates the value of specialized training. CChaťs superior performance, achieving 69.19% accuracy compared to LLaVA-1.5 (62.8%) and GPT-4V (55.59%), validates our overall approach. Furthermore, our approach’s strong performance across NLP metrics, with BLEU, ROUGE-L, and BERTScore values of 44.91, 44.85, and 93.23 respectively, demonstrates the effectiveness of our instruction tuning approach in generating contextually relevant and accurate predictions. The multi-stage training pipeline we implemented, beginning with CLIP-based vision-language pretraining and progressing through carefully curated instruction tuning, provides a robust framework for developing specialized MLLMs in other medical domains. This approach holds particular promise in areas where data quality and interpretation accuracy are crucial for clinical decision-making.

However, several limitations need to be addressed. While our instruction-tuning dataset emphasizes quality through human validation, its relatively small size (36k question-answer pairs) may limit the model's ability to generalize across the full spectrum of colorectal cancer presentations. Additionally, our evaluation on a test set of 25 images represents a limited sample of possible histopathological variations. Future research should focus on expanding the instruction-tuning dataset while maintaining rigorous human validation and exploring advanced vision-language architectures for improved model capabilities. These developments, combined with robust evaluation frameworks, could enhance the model's potential as a reliable assistive tool in pathological diagnosis.

In conclusion, this study introduces CChat, an innovative adaptation of the LLaVA model, tailored for computational pathology with a focus on colorectal histopathology slides. CChat demonstrates impressive performance gains in slide analysis, achieving the highest test set accuracy of 69.19% and outperforming existing models, including LLaVA-1.5 and GPT-4V, across NLP metrics with BLEU, ROUGE-L, and BERTScore values of 44.91, 44.85, and 93.23, respectively. These results underscore the importance of combining domain-specific datasets with advanced vision-language models to address the challenges of computational pathology. By leveraging instruction-tuned MLLMs, CChat not only improves diagnostic accuracy but also supports clinical decision-making in colorectal cancer pathology. Future work will focus on expanding this framework to other pathology domains and enhancing its generalizability to diverse medical imaging tasks.

ACKNOWLEDGEMENT

This project was supported by R01 CA276301 (PIs: Niazi, Chen) from the National Cancer Institute. The project was also supported by The Ohio State University Comprehensive Cancer Center, and the Department of Pathology. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health, National Cancer Institute.

REFERENCES

  • [1].Farahani N, Parwani AV, and Pantanowitz L, “Whole slide imaging in pathology: advantages, limitations, and emerging perspectives,” Pathology and Laboratory Medicine International, 23–33 (2015). [Google Scholar]
  • [2].Niazi MKK, Parwani AV, and Gurcan MN, “Digital pathology and artificial intelligence,” The lancet oncology, 20(5), e253–e261 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Xing F, and Yang L, “Robust Nucleus/Cell Detection and Segmentation in Digital Pathology and Microscopy Images: A Comprehensive Review,” IEEE Rev Biomed Eng, 9, 234–63 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Kumar N, Verma R, Sharma S et al. , “A Dataset and a Technique for Generalized Nuclear Segmentation for Computational Pathology,” IEEE Transactions on Medical Imaging, 36(7), 1550–1560 (2017). [DOI] [PubMed] [Google Scholar]
  • [5].Mahmood F, Borders D, Chen RJ et al. , “Deep Adversarial Training for Multi-Organ Nuclei Segmentation in Histopathology Images,” IEEE Trans Med Imaging, 39(11), 3257–3267 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Litjens G, Sánchez CI, Timofeeva N et al. , “Deep learning as a tool for increased accuracy and efficiency of histopathological diagnosis,” Sci Rep, 6, 26286 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [7].Lu MY, Chen TY, Williamson DFK et al. , “AI-based pathology predicts origins for cancers of unknown primary,” Nature, 594(7861), 106–110 (2021). [DOI] [PubMed] [Google Scholar]
  • [8].Zhu L, Shi H, Wei H et al. , “An accurate prediction of the origin for bone metastatic cancer using deep learning on digital pathological images,” EBioMedicine, 87, 104426 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Coudray N, Ocampo PS, Sakellaropoulos T et al. , “Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning,” Nature medicine, 24(10), 1559–1567 (2018). [Google Scholar]
  • [10].Bulten W, Pinckaers H, van Boven H et al. , “Automated deep-learning system for Gleason grading of prostate cancer using biopsies: a diagnostic study,” The Lancet Oncology, 21(2), 233–241 (2020). [DOI] [PubMed] [Google Scholar]
  • [11].Natrajan R, Sailem H, Mardakheh FK et al. , “Microenvironmental Heterogeneity Parallels Breast Cancer Progression: A Histology-Genomic Integration Analysis,” PLoS Med, 13(2), e1001961 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Rezapour M, Wesolowski R, and Gurcan MN, “Identifying Key Genes Involved in Axillary Lymph Node Metastasis in Breast Cancer Using Advanced RNA-Seq Analysis: A Methodological Approach with GLMQL and MAS,” International journal of molecular sciences, 25(13), 7306 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Rezapour M, Walker SJ, Ornelles DA et al. , “A comparative analysis of RNA-Seq and NanoString technologies in deciphering viral infection response in upper airway lung organoids,” Frontiers in Genetics, 15, 1327984 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Go H, “Digital Pathology and Artificial Intelligence Applications in Pathology,” Brain Tumor Res Treat, 10(2), 76–82 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [15].Brown T, Mann B, Ryder N et al. , “Language models are few-shot learners,” Advances in neural information processing systems, 33, 1877–1901 (2020). [Google Scholar]
  • [16].Achiam J, Adler S, Agarwal S et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, (2023). [Google Scholar]
  • [17].Mann B, Ryder N, Subbiah M et al. , “Language models are few-shot learners,” arXiv preprint arXiv:2005.14165, 1, (2020). [Google Scholar]
  • [18].Javed S, Mahmood A, Ganapathi II et al. , "CPLIP: Zero-Shot Learning for Histopathology with Comprehensive Vision-Language Alignment." 11450–11459. [Google Scholar]
  • [19].Alsentzer E, Murphy JR, Boag W et al. , “Publicly available clinical BERT embeddings,” arXiv preprint arXiv:1904.03323, (2019). [Google Scholar]
  • [20].Singhal K, Tu T, Gottweis J et al. , “Towards expert-level medical question answering with large language models,” arXiv preprint arXiv:2305.09617, (2023). [Google Scholar]
  • [21].Nori H, King N, McKinney SM et al. , “Capabilities of gpt-4 on medical challenge problems,” arXiv preprint arXiv:2303.13375, (2023). [Google Scholar]
  • [22].Aubreville M, Ganz J, Ammeling J et al. , “Model-based Cleaning of the QUILT-1M Pathology Dataset for Text-Conditional Image Synthesis,” arXiv preprint arXiv:2404.07676, (2024). [Google Scholar]
  • [23].Lu MY, Chen B, Williamson DF et al. , “A foundational multimodal vision language AI assistant for human pathology,” arXiv preprint arXiv:2312.07814, (2023). [Google Scholar]
  • [24].Seyfioglu MS, Ikezogwo WO, Ghezloo F et al. , "Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos." 13183–13192. [Google Scholar]
  • [25].Aldabbas H, Asad M, Ryalat MH et al. , “Data augmentation to stabilize image caption generation models in deep learning,” Int J Adv Comput Sci Appl, 10(10), 571–9 (2019). [Google Scholar]
  • [26].Liu H, Li C, Li Y et al. , "Improved baselines with visual instruction tuning." 26296–26306. [Google Scholar]
  • [27].Liu H, Li C, Wu Q et al. , “Visual instruction tuning,” Advances in neural information processing systems, 36, (2024). [Google Scholar]
  • [28].AI@Meta, [Llama 3 Model Card], (2024). [Google Scholar]
  • [29].Ikezogwo W, Seyfioglu S, Ghezloo F et al. , “Quilt-1m: One million image-text pairs for histopathology,” Advances in neural information processing systems, 36, (2024). [Google Scholar]
  • [30].Gamper J, and Rajpoot N, "Multiple instance captioning: Learning representations from histopathology textbooks and articles." 16549–16559. [Google Scholar]
  • [31].Graham S, Jahanifar M, Azam A et al. , "Lizard: A large-scale dataset for colonic nuclear instance segmentation and classification." 684–693. [Google Scholar]
  • [32].Da Q, Huang X, Li Z et al. , “DigestPath: A benchmark dataset with challenge review for the pathological detection and segmentation of digestive-system,” Medical Image Analysis, 80, 102485 (2022). [DOI] [PubMed] [Google Scholar]
  • [33].Radford A, Kim JW, Hallacy C et al. , "Learning transferable visual models from natural language supervision." 8748–8763. [Google Scholar]
  • [34].Sohn K, “Improved deep metric learning with multi-class n-pair loss objective,” Advances in neural information processing systems, 29, (2016). [Google Scholar]
  • [35].Chiang W-L, Li Z, Lin Z et al. , “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” See https://vicuna.lmsys.org (accessed 14 April 2023), 2(3), 6 (2023). [Google Scholar]
  • [36].Papineni K, Roukos S, Ward T et al. , "Bleu: a method for automatic evaluation of machine translation." 311–318. [Google Scholar]
  • [37].Lin C-Y, "Rouge: A package for automatic evaluation of summaries." 74–81. [Google Scholar]
  • [38].Zhang T, Kishore V, Wu F et al. , “Bertscore: Evaluating text generation with bert,” arXiv preprint arXiv:1904.09675, (2019). [Google Scholar]

RESOURCES