Abstract
The rapid development of Large Language Models (LLMs) has opened up new possibilities for their role in supporting research. This study assesses whether LLMs can generate “thoughtful” research plans in the domain of Medical Informatics and whether LLM-generated critiques can improve such plans. Using an LLM pipeline, we prompt four LLMs to generate primary research plans. Subsequently, these plans are mutually critiqued and then the LLMs are prompted to refine their outputs based on these critiques. These original and improved responses are then reviewed by human evaluators for errors, hallucinations, etc. We employ ROUGE scores, cosine similarity, and length differences to quantify similarities across responses. Our findings reveal variations in outputs among four LLMs, the impact of critiques, and differences between primary and secondary outputs. All LLMs produce cogent outputs and critiques, integrating feedback when generating improved outputs. Human evaluators can distinguish between primary and secondary responses in most cases.
Introduction
The remarkable progress of Large Language Models (LLMs) has been widely demonstrated by their adoption across numerous domains, touching almost every aspect of modern life. While early adoptions of LLMs primarily focused on natural language processing tasks such as machine translation, text summarization and question-answering, newer studies explore their potential in complex reasoning and structured problem solving. Recent advancements have examined LLMs’ ability to perform self-improvement via iterative feedback mechanisms, often leveraging external critiques to refine their outputs. LLMs are used as information generators and active participants in problem solving. Through this, they have helped mitigate issues such as factual inconsistencies, hallucinations1, and biases. Inspired by the recent work on LLM-based self-correction frameworks such as CRITIC2 and ensemble-based critique models3, this study investigates whether LLMs can effectively support Medical Informatics research through a structured critique-and-improve approach. Specifically, we analyze whether 1) the LLM-generated critiques contribute “thoughtfully” to refine an initial research output, 2) whether multiple LLMs converge or diverge in their critiques and 3) how human experts evaluate the resulting improvements in comparison to the primary responses. We note that the use of the word “thoughtful” throughout this paper does not imply that the LLMs are sentient or are reasoning in the way humans would. Rather, it is used to indicate that if a human had made the same statements as the LLMs, they would be deemed well-reasoned arguments.
Our goal is to investigate whether LLMs can be used to support research in Medical Informatics, specifically research about medical ontologies and terminologies. Objectives include comparing the quality of results of different LLMs for a research prompt and determining the effect of using LLMs to critique the results of other LLMs. Specifically, we formulate the following research questions:
Can LLMs be used as legitimate tools to support research for a given well-defined problem posed by an expert in the topic area of the research, specifically, in the domain of medical ontologies and terminologies?
If one LLM-C (Critic) is prompted to critique the research plan generated by another LLM-R (Research), can the LLM-C generate a “thoughtful” critique of the plan provided by the LLM-R?
How significant are errors and hallucinations in both the research plan and the critique, according to the judgment of a human expert(s)? Are the results human-interpretable?
Will feeding back the original research plan together with the critique of the LLM-C to the LLM-R as a new prompt enable the LLM-R to generate a secondary research plan? (We call the original research plan the primary research plan.)
If several LLM-Rs are fed the same prompt for generating a research plan, how do they differ with respect to basic metrics such as word count, and what is the measured degree of overlap and similarity between them, using standard metrics, e.g., cosine similarity and ROUGE scores?
The same questions mentioned for multiple LLM-Rs can be raised about critiques, if several LLM-C critics are used. Are the critiques mostly the same, widely different, or somewhere in between, as expressed by a numerical score?
How does the primary result of an LLM-R (LLM-RP) differ from the secondary research plan (LLM-RS), using the same metrics as mentioned above for critiques between the LLM outputs.
To what degree do human experts agree with the resulting primary and secondary research plans? Are they able to rank the research plans by subjective goodness? Are they able to clearly distinguish between the primary and the secondary research plan, if they receive them in randomized order?
Background
The rapid evolution of LLMs, advanced by models like GPT-4.5 and its derivatives, has transformed natural language processing and has impacted a wide array of domains including healthcare, medical informatics and even academic research practices.4,5 LLMs have demonstrated exceptional capabilities in reasoning, information processing, and code generation, among many other tasks, and even as collaborators in research processes. However, these “revolutionary” models are not without limitations, they are known to produce factually incorrect statements, referred to as “hallucinations,” and occasionally fail to maintain logical consistency or accuracy for prompts corresponding to specialized domains such as medicine6.
Human researchers routinely depend on iterative feedback and critiques to improve their work, primarily through the process of peer review and collaborative refinement. Inspired by this iterative human workflow, recent studies have explored how LLMs themselves might benefit from similar feedback loops. For instance, researchers have proposed self-reflection or self-criticism mechanisms within individual models to enhance reliability7. Yet, the potential of using multiple LLMs in structured critique-and-refine workflows, where different models critique each other’s responses and facilitate iterative improvements, remains relatively unexplored, particularly in specialized research domains.
This paper aims to address this knowledge gap by examining whether cross-model critiques between LLMs can meaningfully enhance the quality of research outputs in medical informatics, thereby mitigating the inherent limitations of individual models and contributing to more robust and reliable AI-assisted research methods.
Related Work
The recent advancements of LLMs have underscored the role of feedback mechanisms for improving model reliability, accuracy, and alignment with human expectations. The CRITIC framework2 enables LLMs to autonomously validate and refine their outputs through iterative interactions with external validation tools. The framework employs a three-step process: generate initial responses, leverage tools to detect errors, and refine the responses based on detected errors. This approach effectively addresses several common issues LLMs are prone to, which include hallucinations and faulty information generation. This results in significant improvements across tasks such as free-form question answering and code synthesis.
Similarly, the RL4F framework utilizes a reinforcement learning-based approach for generating natural language feedback aimed at guiding LLMs towards improved performance8. Their findings demonstrate notable gains in various language generation tasks consisting of summarization and action planning. Another approach highlights the incorporation of automated feedback loops, specifically within question-answering systems, showing that critical assessments via a secondary critic model notably enhance citation accuracy, response correctness, and linguistic fluency9. Additionally, a comprehensive survey highlights various methods for integrating human feedback into natural language generation, underscoring its effectiveness in ensuring that LLM outputs align more closely with human values, preferences, and intended meanings10.
Another study highlights a multiturn critique method where LLMs iteratively refine their responses through multiple rounds of self-critique11. The framework includes identifying specific errors, providing detailed explanations of the issues and generating improved solutions. The results show significant performance improvements across reasoning and problem-solving tasks, mainly complex queries that require multi-step thinking.
Our work differs from the above approaches in that we are using multiple LLMs in tandem. Furthermore, we are applying our methods to the task of Medical Informatics research design, while previous approaches focus on tasks such as computer code generation and natural language processing.
Overview of Multi-LLM refinement pipeline
To investigate the capability of LLMs to meaningfully contribute to the research process in Medical Informatics, particularly in the critique and iterative refinement of research outputs, we designed a structured Multi-LLM refinement pipeline (Figure 1). The experiment uses n well known LLMs. Our methodology comprises of initial response generation, cross-model critiques, iterative refinement, and human evaluation. The evaluation is an integral part of this work, and due to its complexity, involving many pairwise comparisons, we provide an overlay of all the comparison operations in Figure 2, augmenting Figure 1 with more details (while keeping Figure 1 readable).
Figure 1.
Experimental design (Part 1).
Figure 2.
Experimental design (Part 2): Partial display. Dots indicate omitted comparisons
Prompt Design Requesting a Research Plan for Medical Informatics
We formulated a prompt for a challenging, clearly defined and domain specific research problem relevant to medical ontology development as follows:
“I am developing a formula for the utility of a new concept relative to an existing ontology. If the new concept is useful for this ontology, it should be inserted. If the utility is low, it should not be inserted. The formula would be a function of the current content of the ontology. What would such a formula look like? What factors would such a formula consider? Provide scientific notations in plain text.”
This prompt was chosen, as it is specific to Medical Informatics, sufficiently complex, and well-suited to test the reasoning capabilities and factual correctness of each LLM.
Experimental Setup
Phase 1: Prompt Engineering and Distribution: We presented the above prompt to n LLMs. Based on the availability and accessibility of high-performing LLMs12, we selected n = 4: GPT-4 Turbo (GPT-4.5)13, DeepSeek-R114, Mistral-Large-241115 and Llama-3 (Llama-3.3-70B-Instruct)16. We set the temperature to 0.7 for all the LLMs in our experiments, following the OpenAI API default, with the intention of balancing factuality and creativity of outputs17.
Phase 2: Initial Quality Assurance: Human expert(s) review the results for all n LLMs, to make sure that there are no obvious mistakes or cases of hallucinations. Responses containing significant factual inaccuracies, obvious errors, or hallucinations are identified, documented, and corresponding text segments are excluded from subsequent analysis to maintain data integrity.
Phase 3: Cross-Model Critiquing: Each initial response (Rx) produced by an LLMx is critiqued independently by the remaining LLMs (LLMy, where y≠x). Specifically, the original prompt and the response from LLMx are provided as input to the critic models, along with explicit instructions to generate detailed, constructive critiques. This process results in n – 1 distinct critiques for each original response.
Phase 4: Iterative Refinement Based on Critiques: Each original LLMx receives: the prompt provided initially, its own initial response and one of the n – 1 critiques along with a new prompt to produce an improved (secondary) response addressing the critique. Specifically, each LLM receives a new prompt as follows:
“I previously asked you {prompt}. You responded with the following {original_response}. I passed your response to another system with the task to critique it. The other system returned this critique from {critique_model} {critique}. Please improve your original answer to the question (I repeat it here below) based on the critique mentioned above {prompt}.”
Curly brackets indicate variables that are replaced by the appropriate text. Thus, every model generates an improved (secondary) response in reaction to this prompt.
Phase 5: Quantitative Evaluation of Original Responses (Rx): ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-LCS)18, cosine similarities, and document lengths (in words) were used for quantitative evaluations. Specifically, if LLMx generates the primary research plan LLMxP, with n=4, there will be LLM1P, LLM2P, LLM3P, and LLM4P. Now every LLMx will be critiqued by each LLMy with y≠x. Thus, there will be 4 * 3=12 critiques. Each of the critiques will be fed back independently to the LLM that generated the primary response, and will be used to generate a secondary response. Thus, there will be three secondary responses for each of the four LLMs (12 secondary responses).
There are three categories of evaluations (see back Figure 2). Category 1: We measure the length of each primary result and then we compare the four primary results pairwise with each other. If we call those P1, P2, P3, and P4, then we compare (P1, P2), (P1, P3), (P1, P4), (P2, P3), (P2, P4), and (P3, P4). Category 2: First we record the lengths of the secondary responses. Because each primary response is mapped into three secondary responses (see Figure 1), there will 12 lengths. Then we compare each primary LLMxP with its three secondary LLMxS outputs. Category 3: We measure the length of each critique individually. Then we compare the critiques of each LLM with each other. If the critiques are C1x, C2x, and C3x for the primary result x, then we need to compare those pair-wise. For pairwise comparison we are using ROUGE-1, ROUGE-2, ROUGE-LCS, and cosine similarities.
Phase 6: Evaluation of Improvement through critique: We compute the same quantitative metrics used in Phase 5 to compare each original response with its corresponding refined response, resulting in four additional sets of values. This provides quantitative evidence regarding the effectiveness of critique-driven iterative refinement.
Phase 7: Expert Human Evaluation: To complement the quantitative assessments, human evaluators were recruited. Specifically, each evaluator was provided with the four original responses (one from each LLM) and a randomly selected subset of six out of the twelve improved responses, presented in a randomized order to avoid bias. This structured human evaluation adds insights beyond numerical metrics, providing clarity on the interpretability, and practical utility of the refined outputs.
Phase 8: Analysis and Conclusions: Based on the results in Phase 5, 6, and 7, we conclude which LLM or which pair of LLMs provides the best research support for the given problem, and whether the research support goes beyond what a human researcher could have come up with in a comparable amount of time.
Results
During our research we generated LLM output (primary results, secondary results, and critiques) equivalent to 19,868 words. This number was derived from Tables 1, 3, and 5. This total is about four times bigger than the current manuscript. Therefore, we made the primary LLM responses, critiques, and secondary (improved) responses available at https://github.com/narenkhatwani/LLMs-responses. One remarkable observation is that all the LLMs produced not just English text but formulas for computing goodness that were as good or better than what our team had previously suggested. For example, DeepSeek (after improvement by the critique of GPT-4.5) proposed the following formula for the Utility (U) of a new concept for a given ontology:
Relevance(C, O): Use domain-specific embeddings (e.g., Word2Vec, BERT) to compute similarity scores. Normalize to [0, 1].
Coverage(C, O): Compute the minimum distance of (C) to clusters of concepts in (O).Inverse distance (closer = higher coverage).
Redundancy(C, O): Compare attributes/properties of (C) to existing concepts (e.g., Jaccard index).
Connectivity(C, O): Count candidate relationships (e.g., parent/child, synonyms) divided by total possible edges.
InformationGain(C, O): Use entropy: ({Entropy}(O) -{Entropy}(O ∪ C) ).
Consistency(C, O): Formal verification (e.g., description logic reasoning).
Complexity(C, O): Ratio of new edges/nodes added versus existing size.
Table 1.
LLMs’ primary response lengths (measured in words)
| LLM Names | Length of Responses (in words) |
|---|---|
| GPT-4 Turbo (GPT-4.5) | 505 |
| DeepSeek-R1 | 1210 |
| Mistral-Large-2411 | 409 |
| Llama-3 (Llama-3.3-70B-Instruct) | 464 |
Table 3.
LLMs’ secondary response lengths (measured in words)
| LLM Names | Length of Responses (in words) |
|---|---|
| DeepSeek improved by GPT-4.5 critique | 766 |
| DeepSeek improved by Llama-3 critique | 1003 |
| DeepSeek improved by Mistral critique | 762 |
| GPT-4.5 improved by DeepSeek critique | 562 |
| GPT-4.5 improved by Llama-3 critique | 586 |
| GPT-4.5 improved by Mistral critique | 560 |
| Llama-3 improved by DeepSeek critique | 735 |
| Llama-3 improved by GPT-4.5 critique | 886 |
| Llama-3 improved by Mistral critique | 750 |
| Mistral improved by DeepSeek critique | 685 |
| Mistral improved by GPT-4.5 critique | 733 |
| Mistral improved by Llama-3 critique | 968 |
Table 5.
LLMs’ critiques’ lengths (measured in words)
| Critiques by LLMs | # of Words |
|---|---|
| DeepSeek critique by GPT-4.5 | 553 |
| DeepSeek critique by Llama-3 | 522 |
| DeepSeek critique by Mistral | 855 |
| GPT-4.5 critique by DeepSeek | 985 |
| GPT-4.5 critique by Llama-3 | 702 |
| GPT-4.5 critique by Mistral | 680 |
| Llama-3 critique by DeepSeek | 949 |
| Llama-3 critique by GPT-4.5 | 535 |
| Llama-3 critique by Mistral | 593 |
| Mistral critique by DeepSeek | 839 |
| Mistral critique by GPT-4.5 | 468 |
| Mistral critique by Llama-3 | 603 |
Category 1 – Original Responses
The Category 1 results consist of two parts. First, we record the lengths of all primary responses, independently from each other (Table 1). While Mistral, Llama, and GPT-4.5 have response lengths of the same order of magnitude, DeepSeek’s response is more than twice as long. This could indicate that DeepSeek provides a deeper analysis.
The second part of Category 1 (Table 2) performs a pairwise comparison of the primary responses of the four LLMs, using ROUGE and Cosine metrics. Cosine similarities are computed by embedding each complete primary response using the all-MiniLM-L6-v2 model from the SentenceTransformers library, which generates vectors of length l = 384. Cosine similarities are generally high, with five out of six between 0.8 and 0.9. Only one is below 0.8. This shows that (measured by embeddings) the six critiques are fairly similar to each other. That is an encouraging result, because if the LLMs would produce widely different outputs, then each result would be more doubtful and more likely to contain hallucinations. Just as consensual reviews by human evaluators increase our confidence in their assessments, the similarity among the primary responses strengthens our credence that the LLMs are correctly addressing the task at hand. The ROUGE-1 score computes Unigram Overlaps, and the values above 0.5 are considered moderate overlap. ROUGE-2 (bigram overlap) and ROUGE-LCS values (longest in order sequence overlap) are lower. ROUGE-LCS values between 0 and 0.2 are considered minimal similarity. Notably, all three values in this range are between DeepSeek and the other three LLMs. This could be attributed to the extra length of the DeepSeek results. The other three pairs have ROUGE-LCS values between 0.27 and 0.3, which indicates “some similarity”18. Thus, these results are not as supportive of similarity as the Cosine values.
Table 2.
ROUGE and cosine similarity scores comparison between pairs of LLMs’ primary responses
| LLM Primary Responses Compared | ROUGE-1 | ROUGE-2 | ROUGE-L | Cosine Similarity |
|---|---|---|---|---|
| Llama-3 vs. Mistral | 0.552901 | 0.264538 | 0.309443 | 0.848429 |
| Llama-3 vs. DeepSeek | 0.435308 | 0.163438 | 0.174123 | 0.811927 |
| Llama-3 vs. GPT-4.5 | 0.540655 | 0.190476 | 0.276663 | 0.887778 |
| Mistral vs. DeepSeek | 0.399753 | 0.142063 | 0.172733 | 0.762535 |
| Mistral vs. GPT-4.5 | 0.551422 | 0.219298 | 0.284464 | 0.895726 |
| DeepSeek vs. GPT-4.5 | 0.438129 | 0.130409 | 0.170515 | 0.818125 |
Category 2 – Improved Responses
Following the pattern of Category 1, we start by recording the lengths of the secondary responses (Table 3). In Table 4, we show the comparisons between each primary LLM and the three secondary LLMs it gives rise to, using the same ROUGE and Cosine metrics (Table 4). In Table 1, the range between shortest and longest response is 801 (=1210 – 409) with only four values. In Table 3, even though there are 12 responses as opposed to four, the range is reduced to 443 (=1003 − 560). Thus, critiquing appears to lead to a convergence of lengths. The longest secondary response is shorter than the longest primary output, and the shortest secondary response is longer than the shortest primary output.
Table 4.
ROUGE and cosine similarity scores comparison between primary and secondary response pairs
| Critic LLMs | ROUGE-1 | ROUGE-2 | ROUGE- L | Cosine Similarity |
|---|---|---|---|---|
| DeepSeek v/s DeepSeek improved by GPT-4.5 | 0.615463 | 0.487955 | 0.474142 | 0.765776 |
| DeepSeek v/s DeepSeek improved by Llama-3 | 0.665142 | 0.463370 | 0.430924 | 0.725186 |
| DeepSeek v/s DeepSeek improved by Mistral | 0.634476 | 0.503119 | 0.484943 | 0.806456 |
| GPT-4.5 v/s GPT-4.5 improved by DeepSeek | 0.730733 | 0.478551 | 0.540438 | 0.933458 |
| GPT-4.5 v/s GPT-4.5 improved by Llama-3 | 0.748593 | 0.560150 | 0.606004 | 0.946357 |
| GPT-4.5 v/s GPT-4.5 improved by Mistral | 0.716650 | 0.517073 | 0.609542 | 0.950972 |
| Llama-3 v/s Llama-3 improved by DeepSeek | 0.674556 | 0.563929 | 0.596788 | 0.876670 |
| Llama-3 v/s Llama-3 improved by GPT-4.5 | 0.629685 | 0.545045 | 0.575712 | 0.904020 |
| Llama-3 v/s Llama-3 improved by Mistral | 0.629100 | 0.377422 | 0.415475 | 0.865848 |
| Mistral v/s Mistral improved by DeepSeek | 0.688554 | 0.544000 | 0.567879 | 0.747582 |
| Mistral v/s Mistral improved by GPT-4.5 | 0.653130 | 0.515254 | 0.544839 | 0.774463 |
| Mistral v/s Mistral improved by Llama-3 | 0.603854 | 0.587563 | 0.599572 | 0.983189 |
High values in Table 4 indicate that the secondary LLM-generated research plan is semantically similar to the primary output. The highest observed values are for GPT-4.5 with all of its critiques and for Mistral “improved” by Llama.
Category 3 – Critiques
In Category 3 we measure whether critiques for one specific primary result are similar to each other (Table 6). As before, we start with the lengths, in this case of the critiques (Table 5). The range between the shortest and longest critique is 517 (=985–468).
Table 6.
Comparison of LLMs’ critiques by ROUGE scores and Cosine Similarity Scores
| LLM | Two LLM Critiques Compared | ROUGE 1 | ROUGE 2 | ROUGE-L | Cosine Similarity |
|---|---|---|---|---|---|
| GPT-4.5 | Llama-3 vs. Mistral | 0.550405 | 0.194547 | 0.263429 | 0.822089 |
| Mistral vs. DeepSeek | 0.553165 | 0.205538 | 0.212661 | 0.846475 | |
| Llama-3 vs. DeepSeek | 0.569544 | 0.187275 | 0.247002 | 0.812235 | |
| Llama-3 | GPT-4.5 vs. Mistral | 0.599103 | 0.219227 | 0.252915 | 0.870375 |
| Mistral vs. DeepSeek | 0.568393 | 0.236702 | 0.241700 | 0.817433 | |
| GPT-4.5 vs. DeepSeek | 0.517647 | 0.177408 | 0.232526 | 0.912402 | |
| Mistral | GPT-4.5 vs. Llama-3 | 0.566972 | 0.189338 | 0.280734 | 0.797380 |
| Llama-3 vs. DeepSeek | 0.575017 | 0.206848 | 0.256804 | 0.788717 | |
| GPT-4.5 vs. DeepSeek | 0.524942 | 0.164489 | 0.233308 | 0.843629 | |
| DeepSeek | GPT-4.5 vs. Llama-3 | 0.575619 | 0.156107 | 0.230981 | 0.779789 |
| Llama-3 vs. Mistral | 0.557902 | 0.211524 | 0.243263 | 0.793387 | |
| GPT-4.5 vs. Mistral | 0.532578 | 0.161702 | 0.191218 | 0.759125 |
The longest critique is shorter than the longest primary response. The shortest critique is longer than the shortest primary output. The two longest critiques are generated by DeepSeek, replicating to some degree the observation made about the length of the primary output of DeepSeek. The cosine similarities in Table 6 are high, ranging from 1.75 to 0.91. This indicates a high degree of similarity between the different critiques. These numbers a quite close to the results obtained for the primary responses (0.76 and 0.89).
The implementation, including code and data preprocessing scripts, is available as several Jupyter notebooks, at https://github.com/narenkhatwani/llms-for-concept-utility
Human Evaluation
Evaluation Rubric
Three human evaluators were given two sets of response datasets, each containing 10 responses. These datasets were constructed by dividing the 12 secondary LLM responses into two equal sets of six, and combining them with the four primary responses (one from each LLM). The responses within each set were randomized, ensuring that evaluators would not be able to infer any patterns.
Additionally, we removed any explicit indicators, “here is an improved version” or “this response is based on a critique” to prevent biases in evaluator judgements, even though these phrases were contained in the original prompts. This ensured that evaluators assessed responses purely on content quality, coherence and correctness. The experts were asked the following questions to gauge the effectiveness of the iterative refinement process:
We have posed a research question to an LLM and asked it for a solution. After we received the answer, we asked the same LLM for an improved version of the solution.
For each pair of responses, can you identify which one is the original and which one is the improved version? Please base your decision on the content and coherence of the responses
Do you notice any outright errors or significant mistakes in either the original or the improved version? If so, please briefly record or summarize these errors.
Can you estimate, in percentage terms, how much better the improved version is compared to the original version?
Finally, could you rank the top three answers overall, from best to third best?
Key Findings
For question 1, two evaluators correctly identified 8 out of 10 primary responses, while the third evaluator correctly identified 7 out of 10 correctly. This suggests that the critique-and-refine process led to noticeable improvements, though some refinements were subtle. The results highlight that LLM-generated critiques can enhance response quality in the Medical Informatics research domain.
For question 2, two human evaluators responded that they did not see anything obviously wrong. One evaluator noted inconsistencies in the notation and formatting.
For question 3, we judged the results to be unusable, probably due to the vague formulation of the question and the complexity of the text.
For question 4, results are shown in Table 7. We only list primary versus secondary responses, because different evaluators had slightly different data sets to work with. The results are not conclusive.
Table 7:
Results of Human Evaluators on Ranking Task
| Rank | Evaluator 1 | Evaluator 2 | Evaluator 3 |
|---|---|---|---|
| 1 | Secondary Response | Secondary Response | Primary Response |
| 2 | Secondary Response | Primary Response | Primary Response |
| 3 | Primary Response | Secondary Response | Secondary Response |
Manual Annotation of the Improved Responses in comparison with the Primary Responses
In a final step of manual analysis, the authors compared the primary and secondary responses of the LLMs. During our analysis of the secondary (improved) LLM responses, we manually categorized content into three types: repeated information (highlighted in the text in yellow), newly introduced information (highlighted in green) and hallucinated information (red). Across the dataset, we observed approximately 110 instances in which improved responses retained content from the original outputs. In about 25 instances, content was newly generated, indicating meaningful refinements. In one instance, the content was identified to be a hallucination, suggesting that in this instance the information was incorrect or not supported by the original context.
For example, in “Implement the utility calculation within an ontology management tool like Protégé.” the phrase “within an ontology management tool like Protégé” (red), introduces a factual inaccuracy that was not present in the original. In contrast, in “To dynamically adjust the weights, implement a feedback mechanism that considers the outcomes of previous insertions. Here’s a simple mechanism: 1. Initially, set equal weights: w1(0) = w2(0) = w3(0) = w4(0) = w5(0) = w6(0) = 1/6. 2) After each insertion decision, evaluate the change in the ontology’s performance (e.g., using metrics like accuracy, completeness, or F-measure). 3) Adjust the weights based on the gradient of the performance change with respect to each factor. This can be done using optimization algorithms like gradient descent.” the addition of “a feedback mechanism that considers the outcomes of previous insertions” (green) enhances clarity by incorporating relevant details from the critique while maintaining factual correctness. The text “Operationalization: Calculate the ratio of meaningful relationships that can be formed between the new concept and existing ones to the total possible relationships” retains “Calculate the ratio of meaningful relationships” (yellow) showing consistency with the original response, but without significant improvement. These examples indicate that while LLM-based iterative refinement may successfully introduce new content into a response, it typically does not deviate much from the primary result.
Discussion and Limitations
Returning to our initial research questions, we propose the following answers.
LLMs can be used as legitimate tools to support research in the domain of medical ontologies and terminologies. However, LLMs did not discover all of our own research techniques. Thus, we had developed a new criterion of multi-lingual brevity (as of now unpublished)19, which none of the LLMs discovered.
All four LLMs were able to generate “thoughtful” critiques of research plans generated by other LLMs. Among them, DeepSeek consistently produced more verbose outputs compared to the other models, as shown in Tables 1, 3, and 5, which report the output length for each LLM.
We discovered only one significant case of hallucination.
Each LLM generated a secondary research plan that was distinct from the primary research plan, based on providing them with critical feedback.
We have computed cosine similarities and ROUGE scores comparing the different primary LLM outputs in Table 2. ROUGE-LCS scores indicate that different primary results are textually different, even though they are all cogent for a human reader.
The same observations can be made about critiques. Table 6 shows the comparisons of the different critiques, and they are substantially different at the phrase level.
The primary results differ from the secondary research plans. In Table 4, a low ROUGE-LCS score indicates that the revised response does not have substantial word-for-word overlap with the original response. That suggests that there were structural or lexical changes. However, a high cosine similarity indicates that despite these surface-level changes, the semantic content remains closely related, meaning the improved response retains the core meaning while possibly adding new relevant information or improving coherence. This implies that the LLM did not merely rephrase the text, but likely expanded or refined the response while preserving its original intent. These results support the idea that LLMs can be used for research support in both roles, as primary generators of research plans and as critics of research plans.
The research prompt fed to the four LLMs was derived directly from a research project of the authors, and we were impressed by the quality of the research plans of all four LLMs. While the LLMs “had some ideas” that the authors did not have, the opposite was also true, and there was a good amount of overlap with the work of the authors. Our evaluators were overwhelmingly able to distinguish between primary and secondary research plans, but they had a hard time with ranking a bigger set of LLM outputs.
While we conclude that LLMs can be used with care for research support “in our niche,” we consider it inconceivable and irresponsible to use LLMs for research support outside of our own area of expertise. As researchers we have to be able to thoroughly evaluate the correctness of any research plan generated with the support of an LLM. While an error in an ontology is likely not going to cause any true damage, an error in a clinical domain research plan might cause irreparable harm to patients. The transferability of these conclusions into adjacent domains is not a given.
Limitations
There are several limitations of the current work. One limitation is in the human determination of which response is primary and which is secondary, because the secondary responses tend to be longer (but not always, see Results). Thus, evaluators have a “shortcut” to make their decisions, potentially without even reading the LLM outputs. Additionally, the 12 final responses were split into two groups of six, and each evaluator only reviewed one group. A second limitation is that we constrained critiquing to one round. Two or more rounds of critiquing might lead to better results. However, if all n − 1 LLMs are used in every cycle, this would lead to a “combinatorial explosion” (i.e., a much bigger version of Figure 1). This issue notwithstanding, we had intended to use more LLMs (e.g., OpenAI’s Pro subscription model), but did not have free or cheap access to them. Thirdly, we did not explore the option of having an LLM critiquing its own response and then feeding the primary output and critique back to that same LLM to generate a secondary response. Lastly, more human evaluators would produce more significant results.
Conclusions and Future Work
It appears that LLMs can be added to the tool box of the Medical Informatics researchers. This study indicates that LLMs can successfully function as generators of research plans, but also as critics of the output of other LLMs. Finally, feedback cycles where critiques are passed to the original LLMs to achieve improvements are also possible. In the hands of an experienced researcher, they are likely to support the development of cogent research plans. However, no researcher should use LLMs for research support outside of their areas of expertise.
In future work, we hope to address the limitations mentioned above. We are particularly interested in investigating whether multiple iterations of the critique-improve cycle will converge towards a “best” solution. Thus, if the critique of a secondary result is fed back to the LLM-R, does a tertiary result provide any new points, or are the results converging in just two iterations? We also hope to get access to more LLMs and newer versions of the LLMs used in this study. Additionally, we propose including a human expert as a fifth “LLM” in the critique pipeline, to serve as a benchmark for evaluating the quality and relevance of the critiques generated by the four LLMs.
Acknowledgement
Research reported in this publication was supported by the National Center For Advancing Translational Sciences of the National Institutes of Health under Award Number UM1TR004789. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The authors would like to thank the following human evaluators for their generous contributions of time and expertise to this research, Oliver Alvarado Rodriguez, Sarang Patil and Varad Rane.
Figures & Tables
References
- 1.Xu Z, Jain S, Kankanhalli M. Hallucination is inevitable: An innate limitation of large language models. arXiv preprint arXiv:240111817. 2024.
- 2.Gou Z, Shao Z., Gong Y., Shen Y., Yang Y., Duan N., Chen W. CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing. ArXiv. 2023.
- 3.Mousavi S, Gutierrez RL, Rengarajan D, et al. Enhancing Large Language Models with Ensemble of Critics for Mitigating Toxicity and Hallucination
- 4.Bubeck S, Chadrasekaran V, Eldan R, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. ArXiv. 2023.
- 5.Lee P, Bubeck S, Petro J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. New England Journal of Medicine. 2023;388(13):1233–1239. doi: 10.1056/NEJMsr2214184. [DOI] [PubMed] [Google Scholar]
- 6.Ji Z, Lee N, Frieske R, et al. Survey of hallucination in natural language generation. ACM computing surveys. 2023;55(12):1–38. [Google Scholar]
- 7.Akyürek AF, Akyürek E, Madaan A, et al. Rl4f: Generating natural language feedback with reinforcement learning for repairing model outputs. arXiv preprint arXiv:230508844. 2023 [Google Scholar]
- 8.Akyurek AF, Akyurek E, Kalyan A, Clark P, Wijaya DT, Tandon N. RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs. Association for Computational Linguistics. 2023:7716–7733. [Google Scholar]
- 9.Lee D, Whang T, Lee C, Lim H. Towards Reliable and Fluent Large Language Models: Incorporating Feedback Learning Loops in QA Systems. arXiv preprint. 2023.
- 10.Fernandes P, Madaan A, Liu E, et al. Bridging the gap: A survey on integrating (human) feedback for natural language generation. Transactions of the Association for Computational Linguistics. 2023;11:1643–1668. [Google Scholar]
- 11.Zhou J, Chen Z, Wang B, Huang M. Facilitating Multi-turn Emotional Support Conversation with Positive Emotion Elicitation: A Reinforcement Learning Approach. Association for Computational Linguistics. 2023:1714–1729. [Google Scholar]
- 12.GitHub. GitHub Marketplace - AI Models. 2025.
- 13.OpenAI. GPT-4 Turbo (GPT-4.5) https://openai.com/index/introducing-gpt-4-5/ (accessed 7/3/25)
- 14.DeepSeek AI. DeepSeek-R1: Advanced large language model for general tasks. https://github.com/deepseek-ai/DeepSeek-R1 (accessed 7/3/25)
- 15.Jiang AQ, Sablayrolles A., Mensch A., Haziza D., de las Casas D., Lample G., Lachaux M.-A., Stock P., Lavril T., Lacroix T. Mistral-Large-Instruct-2411. https://huggingface.co/mistralai/Mistral-Large-Instruct-2411 (accessed 7/3/25)
- 16.Meta AI. meta-llama/Llama-3.3-70B-Instruct. https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct .
- 17.OpenAI. OpenAI API documentation. 2024. https://platform.openai.com/docs/api-reference/responses/create#responses-create-temperature (accessed 7/3/25)
- 18.Lin C-Y. Rouge: A package for automatic evaluation of summaries. 2004:74–81. [Google Scholar]
- 19.Khatwani N, Wang L, Geller J. A Concept Utility Framework for Incremental Ontology Expansion. eHealth- 2025 @ the 19th Multi Conference on Computer Science and Information Systems (MCCSIS 2025); (in press)


