Optimizing document retrieval using massive text embeddings and LLM prompt engineering

Goran Mitrov; Boris Stanoev; Vladimir Trajkovik; Biljana Risteska Stojkoska; Lasko Basnarkov; Petre Lameski; Martin Kampel; Eftim Zdravevski

doi:10.1186/s13643-026-03155-4

. 2026 Apr 14;15:173. doi: 10.1186/s13643-026-03155-4

Optimizing document retrieval using massive text embeddings and LLM prompt engineering

Goran Mitrov ^1,², Boris Stanoev ^1,², Vladimir Trajkovik ¹, Biljana Risteska Stojkoska ¹, Lasko Basnarkov ¹, Petre Lameski ¹, Martin Kampel ^3,^✉, Eftim Zdravevski ^1,²

PMCID: PMC13200469 PMID: 41981616

Abstract

Background

The rapid expansion of digital data poses a unique challenge for retrieving relevant and insightful information efficiently. In particular, the increasing volume of scientific publications has made literature reviews time-consuming. The emergence of large language models (LLMs) offers new opportunities to streamline this process.

Methods

This paper explores the use of generative artificial intelligence (GenAI) for query reformulation and evaluates the performance of nine massive text embedding models, varying in size and fine-tuning strategies, in the context of document retrieval. We apply multiple prompt engineering techniques to evaluate the ability of LLMs to generate effective queries, comparing them with human-crafted queries. These are used to retrieve documents utilizing nine embedding models. The evaluation is across five datasets using metrics such as recall, average precision, and rank-based measures.

Results

Results show that embedding models fine-tuned for semantic similarity consistently outperform general-purpose models, with UAE Large proving most robust across diverse domains. Furthermore, queries generated using zero-shot and few-shot prompting techniques often surpass the performance of human-formulated queries.

Conclusion

These findings highlight the value of integrating LLMs and massive text embeddings to reduce manual effort in literature reviews. GenAI provides a reliable starting point for query formulation, with human input reserved for refinement when needed.

Keywords: Massive text embeddings, LLMs, Prompt engineering, Document retrieval, Information retrieval, Vector indexes, Systematic review automation, Automated surveys

Introduction

In an era of unprecedented data growth and an overwhelming flood of information, the ability to sift through vast amounts of raw documents and quickly extract the most relevant insights has become a critical need. Consequently, Information Retrieval (IR) systems play a crucial role in facilitating efficient access to desired resources. Recent advances in Natural Language Processing (NLP) have significantly enhanced the performance of IR systems by improving their ability to understand and interpret the complexities of natural language [1].

At the core of many IR applications is document retrieval, a process that provides a ranked list of relevant documents in response to a user’s query. This process is widely used across various fields, such as healthcare, law, and business, where quick access to pertinent information is crucial. In academia, whether researchers seek to quickly explore a specific niche or gain a broad overview of their field, document retrieval is a key component in efficiently conducting literature reviews [2]. Following the broader trend of data explosion in this digital era, the volume of scientific publications has been steadily increasing, with an annual growth rate of approximately 4% [3]. As a result, to navigate this vast sea of publications, the past decade has seen a growing number of review types, each incorporating different adaptations of the document retrieval process [4]. Most modern digital libraries offer various search functionalities, but they still face significant challenges. These include retrieving semantically relevant documents, as keyword-based search methods remain the dominant approach, handling ambiguous or vague user queries, and supporting the synthesis of relevant findings. As a result, much of the labor-intensive work is still left to researchers. The processes of database searching and paper selection during literature reviews are among the most time-consuming tasks, with experts highlighting these areas as the ones most in need of improved tool support [5].

As the volume of digital textual data has rapidly increased, the advancements in NLP techniques have kept pace. Over the past decade, we have seen the rise of text embeddings, which represent words as numerical vectors, followed by the development of transformer architecture, capable of capturing long-range dependencies in text. Most recently, we have witnessed the revolutionary emergence of large language models (LLMs) and their ability to comprehend and generate human language at an unprecedented scale. LLMs have made an immediate impact due to their power and versatility, demonstrating the ability to solve a wide range of tasks and showing immense potential for transforming fields such as medicine, education, finance, engineering, law, and more [6, 7]. LLMs have fundamentally transformed IR by enhancing each step of the document retrieval process, from improving query understanding and retrieval accuracy to enabling a more refined re-ranking of results, ultimately enriching the user experience with more context-sensitive and semantically relevant document matches [8]. One opportunity to improve IR is to assist users in constructing an ideal query. Zhai [9] claims that LLMs can bridge the vocabulary gap between queries and documents, clarify user intent, and assist in query transformation or reformulation.

We focus on literature reviews because they represent a challenging document retrieval task, involving high-recall search, ambiguous query formulation, and domain-specific semantic relevance. While this study is centered on this use case, the techniques we explore are applicable to a wide range of document retrieval scenarios. In the field of automating document retrieval for literature reviews, many existing tools utilize techniques from NLP and machine learning to assist researchers [10, 11]. However, LLMs offer the potential to further optimize these processes by improving accuracy and scalability while also holding promise for better integration into literature review workflows—motivating the exploration of their current capabilities in this area [12].

In previous work, we explored automating literature reviews by focusing on keyword-matching between paper abstracts and specific study properties [13]. However, this approach was limited by its reliance on exact lexical overlap, which often failed to capture the broader semantic context and nuanced terminology of diverse scientific domains. Building on this foundation, this study aims to integrate state-of-the-art technologies, such as massive text embeddings, LLMs, and vector databases, to enhance the capabilities and expand the range of functionalities. Figure 1 presents a high-level overview of our framework, visually summarizing the entire flow from the researcher’s initial idea to the retrieval of ranked scientific publications alongside the specific research goals labeled A, B, and C. The process begins with the researcher’s idea, goal, and aim, which are translated into an input consisting of search terms and properties (specific words and phrases acting as queries in the document retrieval process). The search terms are used to query digital libraries such as IEEE Xplore, Springer, MDPI, PubMed, and ScienceDirect, gathering an initial set of documents. We extract and clean the titles and abstracts from these documents, which are then embedded using massive text embeddings and stored in a vector database. Finally, the query is executed to retrieve and rank the most relevant documents.

Fig. 1 — A high-level overview of the framework architecture

In our study, we will primarily focus on three key research question, each targeting a critical part of the retrieval pipeline. The first aim (label A) is to investigate the differences between human experts and Generative AI in transforming a research goal into search terms and queries. With this, we want to examine whether, and in which scenarios, LLMs can assist in reformulating queries. We will compare the performance of human experts with three distinct prompt engineering techniques, measuring precision, recall, and the time spent on the retrieval process. Understanding the differences is crucial for designing assistive systems that enhance research productivity. Our second goal (label B) is to compare multiple massive text embeddings and determine which model performs best in the context of our use case. We investigate how variations in model size, training types, and fine-tuning strategies impact semantic similarity performance in the document retrieval process. Lastly, we will assess the impact of different storage options by comparing three vector indexes based on time and memory performance metrics and their effect on precision and recall (label C).

The remainder of this article is structured as follows: Subsection Related works reviews related work in information retrieval, large language modes, and their integration. Section Background introduces the foundational concepts essential to our study. Section Methodology outlines our research process, detailing the datasets, methodology, and experimental setup. Section Results presents the main results of our experiments, followed by a discussion in Section Discussion that highlights key findings and insights. Finally, Section Conclusion concluded our work.

Related works

Information retrieval has been an essential field for decades, emerging nearly 70 years ago. Since its inception, it has steadily grown and evolved, beginning with foundational techniques like indexing, retrieval, TF-IDF, keyword-based search engines, and over time, integrating machine learning techniques [14]. Today, like many other fields, IR is experiencing exponential growth due to the enormous amount of data generated every second and the continuous breakthroughs in artificial intelligence (AI). In a recent study, [1] provide a comprehensive overview of modern IR systems, demonstrating that recent advancements in deep learning techniques, large labeled datasets, and increased computing power have significantly improved the systems’ performance and made them more capable of handling the complexity of natural language queries. In their latest research, [15] discuss generative information retrieval (GenIR) as a new direction in IR. They present GenIR as a novel retrieval paradigm that shifts from traditional document retrieval using indexes, and instead uses a generative model to encode documents into its internal parameters, offering deeper semantic understanding. However, the limitations of these models include challenges with scalability, memory capacity, and training complexity, and they are beyond the scope of our research.

The use of LLMs in IR can be explored from multiple angles. For instance, [16] provide an extensive review on the development and applications of dense retrieval models that leverage pre-trained LLMs to encode text into dense vectors, enabling more contextually rich retrieval. Similarly, [17] discuss integrating the generative capabilities of LLMs with traditional retrieval techniques to address challenges such as computational efficiency, factual accuracy, and domain-specific adaptability. They propose a paradigm that combines LLMs, IR models, and human input to enhance user modeling, dynamic indexing, and semantic matching. While these works establish a vital theoretical framework, our study moves beyond the theoretical paradigms to provide cross-domain validation of multiple text embedding models, testing the practical accuracy and scalability.

The nature of LLMs is to use statistical patterns and likelihoods to generate responses, and they are designed to serve as general-purpose models. As their outputs may sometimes lack specificity or relevance, prompt engineering emerged as a process to guide and refine LLM output, addressing these inherent limitations and maximizing their utility and accuracy [18]. Azad and Deepak [19] in their survey, provide an in-depth historical overview of query expansion (QE) methods and their impact on IR, categorizing the techniques into manual, automatic, and interactive approaches. As part of automatic query expansion approaches, [20] propose using LLMs to aid in query expansion, concluding that their application shows promising results and can improve retrieval performance metrics such as recall, mean reciprocal rank (MRR), and normalized discounted cumulative gain (NDCG). Acknowledging these studies, we adopt LLM-based query expansion using multiple prompting techniques to evaluate whether automated queries can mitigate the inherent limitations of human subjectivity in the document retrieval process.

Vector databases and LLMs work in synergy, with vector databases providing efficient methods for storing, retrieving, and managing the high-dimensional vectors intrinsic to LLM operations. When acting as a cost-effective semantic cache and a robust memory layer, vector databases can address several challenges LLMs face, such as hallucinations, high commercial application costs, and memory limitations [21]. To accelerate the IR process in high-dimensional embedding spaces and enable efficient lookup in large-scale environments, vector databases support indexes that implement approximate nearest neighbor (ANN) techniques to efficiently identify the closest points, improving speed and reducing computational costs. Aumüller et al. [22] present a benchmarking tool for evaluating these ANN indexes, assessing their performance and quality across standard datasets. In our study, the evaluations helped in the selection of our vector indexing platform, ensuring the chosen architecture maintained the necessary balance between computational efficiency and high retrieval accuracy in terms of requirements for systematic reviews.

In recent years, the automation and semi-automation of conducting literature reviews have gained momentum, with numerous researchers exploring the area using NLP, machine learning, and text-mining techniques [23–26]. In their study, [27] explored the process of conducting a literature review using AI, reporting a positive experience with approximately 77% of time saved. However, the highlight is also the need for specific remedies to address scenarios that could compromise the methodological quality of the review. Dennstädt et al. [28] utilized an LLM by constructing an instructional prompt that included the title, abstract, and relevance criteria to evaluate whether an article should be included in a literature review. They measured accuracy, recall, and precision based on the classification of articles, and their findings indicated promising results. However, such approaches often rely on binary classification for full automation, which carries the inherent risk of excluding relevant studies without human oversight. In contrast, we focus on identifying the most semantically relevant document through optimized retrieval and ranking, providing a decision-support tool that empowers the human researcher. Similarly, [29] developed an R package designed to automate the title and abstract screening process using GPT-4. In the validation process, the package demonstrated an overall accuracy of 84%, with specificities and sensitivities of 89% and 71%, respectively, when compared to human consensus decisions. In a complementary study, [30] evaluated the effectiveness of GPT-4 in identifying relevant titles and abstracts from real-world clinical review datasets, comparing its performance against ground truth labels provided by two independent human reviewers. They concluded that using LLMs as a support tool rather than a replacement can lead to more accurate and reliable conclusions in medical research. These studies underscore the potential of LLMs and AI-assisted tools in automating literature reviews. However, challenges such as optimizing accuracy, scalability, and domain adaptability remain, highlighting the need for continued research and innovation in this space.

Background

In this section, we provide a brief technical background on the key concepts explored in our study. Specifically, we will cover text embeddings and their underlying architecture, vector indexes and databases, large language models (LLMs), and prompt engineering techniques.

Embeddings

Embeddings are the transformation of unstructured data, such as text, into a structured format by mapping the semantic meaning of words, phrases, or entire documents into a continuous vector space of numerical values. Word embeddings, developed to capture semantic meaning and context, project words as vectors into a multi-dimensional space, where the distance and direction between vectors reflect the similarity and relationships among words [31, 32]. Text embeddings extend the concept of capturing contextual meaning to larger units of text, such as sentences or paragraphs. The emergence of transformer architecture marked a significant breakthrough, as its self-attention mechanism captures long-range dependencies and bidirectional context [33]. Massive text embeddings are large-scale, contextually rich representations generated from vast corpora of text using pre-trained models with millions of parameters, and they play a crucial role in our research.

Vector databases and vector indexes

The increased use of embeddings in NLP, combined with the limitations of traditional databases in handling high-dimensional data, has led to the development of specialized data structures known as vector databases. These vector databases are optimized for storing high-dimensional vector data and enabling fast, efficient, and scalable searches [34]. Their storage capabilities and retrieval techniques make them well-suited for a wide range of applications, including recommendation systems, text generation and augmentation, and efficient document retrieval [35].

At the core of vector databases are vector indexes, which are structures designed to optimize similarity searches in high-dimensional spaces by enabling quick retrieval of vectors most similar to a query vector. These indexes support approximate nearest neighbor (ANN) search, a technique that significantly speeds up searches in large, high-dimensional datasets by finding neighbors close to the query with a certain level of approximation [36]. When comparing vector indexes, the three key parameters for evaluation are accuracy, time, and memory, and the challenge lies in finding the right balance between them. Our research will focus on three specific vector indexes: the flat index, the inverted file index (IVF), and the product quantization index (PQ).

The flat index is the simplest vector index, performing a brute-force exhaustive search by calculating the distance between the query vector and all data points. It is highly accurate and requires no training, but it is computationally expensive and does not scale well. Figure 2A illustrates the search process using the flat index.

The IVF index speeds up the search process by reducing the search scope and using approximation. It partitions the vector space by clustering vectors into a predefined number of groups, finding centroids, and constructing a Voronoi cell diagram where each data point belongs to one cell [37]. This is illustrated in Fig. 2B. During the search, the query vector is mapped to a cluster, and the closest neighbors are searched within that cluster or neighboring clusters rather than across the entire dataset. This significantly reduces search time, although it requires initial training, and may affect accuracy depending on the clustering quality.

Product quantization (PQ) compresses high-dimensional data into a smaller space to reduce the memory footprint of indexes [38]. As shown in Fig. 2C, the PQ process begins by splitting the original vector into smaller sub-vectors. Each sub-vector is placed into its own subspace, along with sub-vectors from other original vectors. In each subspace, clustering is performed, and the sub-vectors are mapped to specific cluster centroids. The centroid values are then replaced with unique IDs, and the subspaces are merged to form the quantized vector. This method improves both speed and memory efficiency but sacrifices some accuracy due to lossy compression.

Large language models

LLMs represent the most significant breakthrough in NLP, with the ability to understand, process, and generate human language. Predominantly built on transformer-based architecture, these models rely on massive text embeddings and learn complex semantic relationships from vast amounts of data through self-supervised and semi-supervised training processes [39]. The number of parameters in LLMs is measured in billions, and they are pre-trained on massive datasets containing general world knowledge. These models can also be fine-tuned on specific datasets and tasks, allowing them to adapt to particular applications and domains. Some of the most prominent and best-performing models include the GPT family [40], Gemini [41], LLaMA [42], Claude [43], Mistral [44], Gemma [45], and others [46]. LLMs are a subset of generative artificial intelligence (GenAI), computational techniques capable of producing original and meaningful content, such as text, images, audio, or code, that often seems indistinguishable from what humans might produce [47]. We will leverage the GenAI capabilities of LLMs in our research by applying various prompt engineering techniques for query expansion and query reformulation.

Prompt engineering techniques

Communication with LLMs is done through a prompt, which is the input text provided to the model to obtain a specific response or prompt the model to perform a task. A prompt usually includes an instruction, along with optional components such as context, additional background information to guide the response, input data, and output indicators or constraints. Prompt engineering is an iterative process of refining prompts to guide LLMs toward more accurate and relevant responses. A prompt engineering technique is a strategy for structuring the input so that the model generates more accurate and relevant responses [48]. The field of prompt engineering is thriving, with numerous diverse methods and techniques, each with its own methodology and application [49]. Our research will focus on three prompt engineering techniques: zero-shot, few-shot, and prompt chaining.

Zero-shot prompting eliminates the need for training data and relies on prompts that ask the model to perform a task without providing examples to guide the response. This requires the model to generate a response based solely on its pre-existing knowledge [50]. Few-shot prompting is a technique that enables in-context learning by providing a few input-output examples to guide the model in understanding the task pattern, unlike zero-shot prompting [51]. Prompt chaining is useful for accomplishing complex tasks that require detailed prompts. With this technique, the task is broken down into sub-tasks, each performed in a separate prompt, with the output of one prompt serving as the input for the next until the desired outcome is achieved.

Methodology

In this section, we present our research methodology, beginning with an overview of the datasets used in our study. We then discuss the selection of massive text embedding models, the design and implementation of prompts, and the choice of vector indexes. Following this, we describe the full experimental setup and conclude with the evaluation metrics we employ to assess the performance.

Datasets

The datasets used in this study were obtained through close collaboration with domain experts conducting real research studies, during which they utilized our NLP tool. Beginning with their initial ideas and research goals defined in a textual form, and leveraging their domain-specific knowledge, we work together to reformulate these goals into inputs suitable for the tool. This includes deriving keyword-based search strings to query the digital libraries and defining semantic “properties” used to construct the query vector for ranking the documents. Using the search strings, the tool gathers a set of candidate documents containing information such as DOI, title, abstract, and other relevant details. This pool of documents, referred to as papers provided in Table 1, is then presented to the researchers for evaluation. Through a thorough manual review process, the experts select the documents they consider relevant for inclusion in their research. These selected documents, listed as papers selected in Table 1, serve as the labeled ground-truth for evaluating retrieval performance in our study. In Table 1, we provide an overview of the datasets, including the number of papers initially available to the researchers and the final number of articles selected as relevant.

Table 1.

Datasets overview

Dataset	Papers provided	Papers selected	Reference
Driver healthcare monitoring	13,518	30	Awaiting publication
Venture capital	17,133	150	Awaiting publication
Relational learning	18,711	23	[52]
Ambient assisted living	26,331	108	[53]
10-m walks	6,708	22	[54]

Open in a new tab

The following is a detailed overview of the datasets, including their origin and focus areas:

Driving healthcare monitoring with IoT and wearable devices: a systematic review—this dataset is derived from a systematic review exploring the use of IoT and wearable devices in monitoring drivers’ health.
Venture capital: a bibliometric analysis—this dataset is obtained from a bibliometric and structural review and highlights three primary topics: environmental, social, and governance (ESG) factors, innovation, and exit strategies within the venture capital field.
Automating feature extraction from entity-relation models: experimental evaluation of machine learning methods for relational learning—this dataset is curated from a study that included a comprehensive review of the literature on relational learning and further explores machine learning methods for feature extraction from entity-relation models.
Ambient assisted living (AAL): scoping review of artificial intelligence (AI) models, domains, technology, and concerns—this dataset originates from a comprehensive scoping review that identifies, analyzes, and extracts literature on AI models in AAL.
Mobile and wearable technologies for the analysis of 10-m walk test: a concise, systematic review—this dataset is derived from a systematic review focused on using mobile and wearable devices to measure physical parameters during the 10-m walk test, analyzing test performance.

At the start of our investigation, we conducted an exploratory analysis focusing on the length of the documents in each dataset. After cleaning the title and abstract of each article, we tokenized the text, treating each word as a separate token, and created a histogram to visualize the token count distribution, as illustrated in Fig. 3. The analysis reveals that all datasets exhibit an approximately normal distribution, with similar means and standard deviations, showing no statistically significant differences. This allows us to exclude document length as a factor when comparing metrics during embedding creation, which is crucial for scalability and comparing vector indexes in terms of time performance. The remaining differentiating factor between the datasets will be the number of documents.

Fig. 3 — Histogram and distribution for the input document text length per dataset

Massive text embeddings

This subsection provides an overview of the massive text embeddings used in our research. For the selection of the models, we consulted the massive text embedding benchmark (MTEB) [55] which evaluates models across diverse embedding tasks. Our goal was to include models that vary in size and functionality. We selected nine models based on their overall benchmark scores, as well as their performance in specific tasks such as retrieval and semantic text similarity. In Table 2, we present the key characteristics of each model, including the number of parameters (in millions), memory usage (in GB), the number of dimensions, and the average model loading time (in seconds).

Table 2.

Overview of the massive text embedding models

Embedding model	Number of parameters (millions)	Memory usage (GB)	Number of dimensions	Average load time (seconds)
MiniLM	23	0.085	384	2.36
BGE Large v1.5	335	1.249	1024	2.52
BGE M3	560	2.11	1024	4.48
E5 Large v2	335	1.25	1024	3.32
UAE Large	335	1.25	1024	3.05
Snowflake Arctic L	334	1.24	1024	3.30
QWEN2	1776	9.25	1536	6.65
Stella v5	1543	9.25	1024	9.40
E5 Mistral	7111	26.49	4096	13.79

Open in a new tab

Furthermore, Table 3 complements the previous table by providing a qualitative analysis of the training methodologies, the types of data the models were trained on, and the specific strengths or tasks for which they are best suited. Through this, we aim to examine the qualitative aspects of the models and offer deeper insight into the underlying factors that may explain the performance variations.

Table 3.

Qualitative comparison of the massive text embedding models

Embedding model	Training type	Training data domain	Strengths/use cases
MiniLM [56, 57]	$∙$ Teacher-student distillation	$∙$ Open-domain text	$∙$ Low-resource environments
	$∙$ Contrastive learning	$∙$ Sentence pairs	$∙$ Sentence similarity
			$∙$ Quick inference
BGE Large v1.5 [58]	$∙$ Contrastive learning	$∙$ General web corpora	$∙$ Semantic search
	$∙$ Task-specific fine tuning	$∙$ Curated QA datasets	$∙$ Document ranking
			$∙$ Versatility
BGE M3 [59]	$∙$ Multi-stage training	$∙$ Multi-lingual	$∙$ Cross-lingual search
BGE M3 [59]	$∙$ Self-knowledge distillation	$∙$ Multi-format retrieval data	$∙$ Hybrid retrieval
E5 Large v2 [60]	$∙$ Contrastive learning	$∙$ CCPairs (query-passage)	$∙$ Question answering
E5 Large v2 [60]		$∙$ QA pairs	$∙$ Re-ranking
UAE Large [61]	$∙$ Contrastive learning	$∙$ General text	$∙$ Fine-grained semantic similarity
UAE Large [61]	$∙$ Angle optimization	$∙$ Semantic similarity tuning	$∙$ Robust generalization
Snowflake Arctic L [62]	$∙$ Contrastive learning	$∙$ Stratified IR corpora	$∙$ Scalable IR tasks
Snowflake Arctic L [62]	$∙$ Retrieval-specific fine-tuning	$∙$ Enterprise datasets
QWEN2 1.5B Instruct [63]	$∙$ Contrastive learning	$∙$ Code	$∙$ Instruction-following tasks
	$∙$ Instruction tuning	$∙$ Forums	$∙$ Classification
		$∙$ Social media
Stella 1.5B v5 [64]	$∙$ Matryoshka Representation Learning	$∙$ Open-domain multi-task sets	$∙$ Multi-task
E5 Mistral 7B Instruct [44, 65]	$∙$ Contrastive learning	$∙$ Multilingual text corpora	$∙$ Multilingual IR
E5 Mistral 7B Instruct [44, 65]	$∙$ Multilingual fine tuning		$∙$ Document re-ranking

Open in a new tab

Prompt engineering techniques

In this subsection, we provide the implementation details for the prompt engineering techniques used in our study. As shown in Fig. 1, the second step of our framework involves transforming the researcher’s study goal into search terms and queries for the document retrieval process in a format suitable for our NLP tool. Initially, this query reformulation was conducted manually, relying on researcher expertise, which made the process time-consuming. To address this, we introduce GenAI to automate the query reformulation phase, exploring three prompt engineering techniques: zero-shot, few-shot, and prompt chaining. These techniques are widely used, effective, and relatively straightforward to implement.

In designing our prompts, we followed the best practices recognized by the community, including the clear separation of instruction, context, and input data, specifying the desired length and format of the output, and avoiding ambiguity by providing clear handling for edge cases.

Zero-shot prompting

Zero-shot prompting refers to the technique in which a prompt is provided to the model without any examples or demonstrations. The model is instructed to perform the task based solely on the given instruction, without additional examples to guide its response. In our implementation, the prompt consists of an instruction, context, desired output format, and the input data. Below is the instruction used in our zero-shot prompting approach. graphic file with name 13643_2026_3155_Figa_HTML.jpg

To provide the model with further guidance on how to handle the instruction, we include additional context, which is demonstrated below. graphic file with name 13643_2026_3155_Figb_HTML.jpg

Next, we provide explicit guidance on the model’s expected outcome, as outlined below. graphic file with name 13643_2026_3155_Figc_HTML.jpg

Together with the study goal as input data, these elements form the complete prompt sent to the LLM to generate a response.

Few-shot prompting

Few-shot prompting is a technique that builds upon zero-shot prompting by enabling in-context learning, where we provide demonstrations to guide the model toward better performance. These examples serve as conditioning for subsequent instances where we want the model to generate a response. In our use case, we utilized the same prompt structure as in zero-shot prompting but added four examples (one from each of the remaining datasets), making it 4-shot prompting. In each example, we used the study goal as the input and the ground truth query constructed by the researcher as the output. The examples were formatted as shown below. graphic file with name 13643_2026_3155_Figd_HTML.jpg

Prompt chaining

Prompt chaining is a technique to accomplish complex tasks that an LLM might struggle to handle effectively when given a single, detailed prompt. This approach divides the task into multiple subtasks, and the LLM is first prompted with one subtask. Its response is then used as input for the next prompt, creating a chain of prompts. In our use case, we address the complex task of query reformulation by breaking it into three subtasks: extraction, expansion, and formatting. We handle these transformations in separate prompts to achieve the final desired output.

We begin with the first prompt by providing the initial instruction to extract information from the input, as demonstrated below: graphic file with name 13643_2026_3155_Fige_HTML.jpg

Next, we use the LLM’s output from the first prompt to provide additional instruction for expanding on the extracted information, as shown below: graphic file with name 13643_2026_3155_Figf_HTML.jpg

We conclude the chain with the final prompt, where we finalize the content of the desired outcome and provide specific formatting for the response. graphic file with name 13643_2026_3155_Figg_HTML.jpg

Vector indexes

In this subsection, we provide the implementation details for the vector indexes used in our research. These indexes store the vector embeddings generated by the massive text embedding models and retrieve ranked documents when a query is made. To focus on comparing the performance of different types of indexes and avoid potential overhead introduced by vector databases, we chose to work with FAISS [66], a lightweight library for efficient similarity search on dense vectors. For this research, we selected the flat index as a baseline, along with two additional indexes: the IVF index to compare the trade-off between time consumption and performance and the PQ index to examine the trade-off between memory usage and performance.

Flat index

For the flat index implementation, we use the FAISS IndexFlatL2 with the number of vector dimensions as the only parameter. This implementation encodes the vectors into fixed-size codes and stores them in an array without compression or additional overhead. During the search phase, an exhaustive brute-force search is performed, where all indexed vectors are decoded sequentially and compared to the query vector using the L2 (Euclidian distance) metric.

Inverted file index (IVF)

For the IVF index, we use the IndexIVFFlat implementation, which requires a quantizer index, the number of vector dimensions, and the number of clusters. The quantizer groups data into clusters using k-means clustering and assigns each data point to a cluster during indexing. In the search process, it identifies the nearest clusters and searches within their vectors. We use the flat index as the quantizer, and for the number of clusters, we follow the library’s recommendation of using the square root of the total number of documents in the index.

During the search, we also adjust the nprobe parameter, which determines how many of the closest clusters are searched. We iteratively experimented to develop a dynamic formula for the parameter to ensure that at least N closest neighbors are found while searching the fewest clusters possible.

Product quantization index (PQ)

For the PQ index, we use the IndexPQ implementation, which requires the number of vector dimensions, the number of sub-vector splits, and the number of bits necessary to represent each sub-vector in its compressed form. We set the number of sub-vectors to d/8, where d is the vector dimension. This ensures a balanced sub-vector size and efficient quantization. For the nbits, we choose a value of 8, corresponding to 256 centroids per sub-vector space, providing sufficient granularity for approximation and effective memory optimization. Consequently, during a search, we do not directly compare the original high-dimensional vectors but instead operate in a compressed domain.

Experiments

In this subsection, we describe the experimental setup, intermediate findings, and key observations made during the process without focusing on the final results. Our experiments began with clean datasets that included a document identifier, title, abstract, and label indicating whether the document was considered relevant in the corresponding research process. The experiments were divided into two phases: one for generating embeddings and the other for querying and retrieval.

All experiments, except those involving the Mistral model, were conducted on a server equipped with an NVIDIA TITAN V GPU, featuring 640 tensor cores, 5120 CUDA cores, and 12 GB of high-bandwidth memory (HBM). Due to the specific hardware requirements for the Mistral model, those experiments were performed on a cloud-based virtual machine with an NVIDIA A100 GPU, offering 640 tensor cores, 6912 CUDA cores, and 40 GB of HBM.

In the first phase, we iterated through each dataset and embedding model, generating vector embeddings for the documents and storing them as .npy files, the standard binary format in NumPy for persisting arrays on disk. As part of the embedding generation phase, we measured the time required to compute embeddings for each dataset using different text embedding models. The results, shown in Fig. 4, provide a comparative view of each model’s embedding times (in seconds) across the datasets.

Fig. 4 — Clustered bar chart for times necessary for embedding generation per model and dataset

MiniLM consistently exhibits the shortest times, while E5 Mistral requires the longest time due to differences in the number of parameters and the output vector embedding size. Using the time required for generating embeddings with MiniLM as a baseline, models such as BGE Large, BGE M3, E5 Large, UAE Large, and Snowflake Arctic L took approximately 5 to 6 times longer. QWEN2 and Stella required around 18 times more time, while Mistral took about 50 to 60 times longer for the same task. Our analysis also shows that embedding times increase linearly as a function of the number of documents. This suggests that the computational load scales predictably with dataset size, reinforcing the importance of selecting an appropriate model based on the size of the dataset and time constraints.

In the second phase, the process involved two steps. First, we loaded the pre-generated embeddings for each dataset and embedding model and created the three types of vector indexes. Additionally, we trained the indexes that required a training phase. During this step, we measured three main metrics: training time, build time (the time required to populate the indexes), and memory consumption. The average aggregated results of these metrics per dataset and index type are presented in Table 4.

Table 4.

Time and memory requirements for building vector indexes

Dataset	Index type	Training time (s)	Build time (s)	Memory consumption (MB)
Driver healthcare monitor	Flat	0	0.031	65.42
	IVF	0.472	0.059	66.1
	PQ	10.553	0.399	3.36
Venture capital	Flat	0	0.027	82.48
	IVF	0.718	0.082	83.25
	PQ	13.332	0.499	3.89
Relational learning	Flat	0	0.036	82.17
	IVF	0.67	0.078	82.94
	PQ	13.28	0.498	3.88
Ambient assisted living	Flat	0	0.053	130.04
	IVF	1.3	0.148	131.05
	PQ	21.16	0.782	5.38
10-m ealks	Flat	0	0.014	30.43
	IVF	0.189	0.022	30.87
	PQ	5.234	0.199	2.27

Open in a new tab

The results reveal important insights into the trade-offs among the index types. All values for the build time are below 1 s, making their impact negligible. When using the flat index as a baseline, we observe that IVF introduces a small training overhead, ranging between 0 and 1 s, and consumes slightly more memory. While this might make IVF seem less efficient compared to the flat index in terms of these metrics, its primary advantage lies in reducing query times, which is not reflected in this table. On the other hand, the PQ index offers a significant reduction in memory consumption, approximately 20 times less than the flat index. However, this memory efficiency comes at the cost of increased training times, which range from 10 to 20 s in most cases.

Finally, after setting up the indexes, we concluded our experiments by sending queries to the indexes and retrieving a ranked list of the closest documents. We experimented with different numbers of closest documents to retrieve (100, 200, 500, and 1000) to assess the performance of each index type. We also measured the time required for each index type to compare and return the results during this step. To evaluate the performance, we computed the mean query time for each index type across the datasets and embedding models, which is presented in Fig. 5.

Fig. 5 — Mean query response time per index type

From the figure, we can observe that both the Flat and PQ indexes have constant, straight lines because they perform an exhaustive search, regardless of the number of closest documents returned. This results in a fixed computational load for both indexes. However, the PQ index returns results approximately 8.5 times faster than the Flat index due to its operation in the compressed vector domain. On the other hand, the IVF index shows faster performance for the smaller number of closest documents, returning results in under 1 ms per query for 100 closest documents. As the number of closest documents increases, the query time gradually rises, matching the PQ index’s performance for 1000 documents.

Evaluation metrics

In this subsection, we provide definitions for the evaluation metrics used to assess the performance of our document retrieval system, focusing on comparing the different embedding models and query formulations. Specifically, we use the order-unaware metric Recall@K, as well as order-aware metrics such as Average Precision@K (AP@K), Average Rank (AR), Median Rank (MR), and Reciprocal Rank (RR).

Recall@K

Recall@K (R@K) is one of the most widely used and interpretable evaluation metrics in IR systems. It measures the proportion of relevant documents retrieved out of all relevant documents in the entire dataset. In our use case, we assess how many of the relevant documents selected by a researcher were retrieved by the system for different values of K (at 100, 500, and 1000). If a researcher selected N relevant articles, and only M of them are among the top K results, the recall is calculated as

\begin{matrix} Recall@K = \frac{M}{N} \end{matrix}

Average Precision@K

Average Precision@K (AP@K) is an order-aware metric used to evaluate the relevance of a ranked list of retrieved documents. It measures the precision of the results up to a specific position, K, accounting for both the quality and ranking of relevant items within the result set. To calculate AP, the first step is to compute Precision@K, which measures the proportion of relevant documents out of the top K retrieved documents. If a researcher selected N relevant articles, and M of them appear among the top K results, the precision is calculated as:

\begin{matrix} Precision@K = \frac{M}{K} \end{matrix}

AP@K extends this by calculating the average of the precision values at each point where a relevant document appears in the ranked list of K documents, providing a more fine-grained assessment of the system’s ability to rank relevant documents higher. It is calculated as in Eq. 3, where $r e l_{k}$ is a relevance parameter which is equal to 1 when $k^{th}$ item is relevant or 0 when it is not.

\begin{matrix} Average Precision@K = \frac{\sum_{k = 1}^{K} Precision@K * r e l_{k}}{M} \end{matrix}

Average rank

Average rank (AR) is a simple yet insightful metric that considers the entire ranked list of documents. It calculates the mean rank position of each relevant document, with a lower average rank indicating better retrieval effectiveness. If the researcher has selected N relevant documents, the value for this metric is calculated as in Eq. 4, where $D_{i}$ is the $i^{th}$ relevant document from the dataset.

\begin{matrix} Average Rank = \frac{\sum_{i = 1}^{i = N} Rank (D_{i})}{N} \end{matrix}

Median rank

Median rank (MR) is similar to AR, but it measures the central tendency of the ranks of relevant documents. It identifies the middle rank, providing a more robust measure of where the majority of relevant documents are located in the list. If the researcher has selected N relevant documents, the value for this metric is calculated as follows:

\begin{matrix} Median Rank = \{\begin{matrix} {(\frac{N + 1}{2})}^{th} term & if N is odd, \\ \frac{{(\frac{N}{2})}^{th} term + {(\frac{N + 1}{2})}^{th} term}{2} & if N is even . \end{matrix}) \end{matrix}

We take into consideration both metrics, with AR offering a holistic view of how relevant documents are distributed across the entire ranked list, whereas MR highlights where most of the relevant documents tend to cluster.

Reciprocal rank

Reciprocal rank (RR) is an order-aware metric that measures how quickly the first relevant document appears in a ranked list of documents. It is calculated as the reciprocal of the rank at which the first relevant document is found. Mathematically, if the first relevant document appears at position k, the reciprocal rank is calculated as follows:

\begin{matrix} Reciprocal Rank = \frac{1}{k} \end{matrix}

In this study, we refrain from using mean reciprocal rank (MRR) since our research involves a single query per prompt engineering technique.

Results

In this section, we present the results of our experiments, focusing on the performance of nine text embedding models across five datasets, using four different prompt engineering techniques. We provide a separate table for each prompt engineering technique showcasing the performance on the evaluation metrics defined in Subsection Evaluation metrics.

We begin by presenting the results for the embedding models utilizing the ground truth queries formulated by human experts, as outlined in Table 5. For the Driver Healthcare Monitor dataset, BGE Large and BGE M3 lead the performance across different metrics, each excelling in specific areas. In the Venture Capital and Relational Learning datasets, UAE Large emerges as the top performer, with BGE Large and MiniLM also showing strong results in some metrics. This trend of varying model performance continues with the Ambient Assisted Living dataset, where BGE Large delivers the highest scores. Finally, Snowflake Arctic L and UAE Large demonstrate superior performance for the 10-m walks dataset.

Table 5.

Ground truth results

Dataset	Model	R@100	R@500	R@1000	AP@100	AP@500	AP@1000	AR	MR	RR
Driver healthcare monitor	MiniLM	0.033	0.2	0.267	0.032	0.014	0.013	2952	2584	0.032
	BGE Large	0.033	0.233	0.5	0.333	0.061	0.038	1648	1071	0.333
	BGE M3	0.067	0.267	0.4	0.306	0.096	0.069	2837	1821	0.5
	E5 Large	0	0.1	0.167	0	0.007	0.007	3983	4103	0.006
	UAE Large	0.033	0.2	0.467	0.167	0.043	0.027	1702	1162	0.167
	Snowflake Arctic L	0	0.067	0.233	0	0.005	0.007	4418	3929	0.004
	QWEN2	0.067	0.133	0.2	0	0	0.002	3327	2433	0.001
	Stella	0.033	0.167	0.333	0.012	0.012	0.012	3622	3045	0.012
	Mistral	0.067	0.167	0.267	0.019	0.016	0.014	2536	1836	0.017
Venture capital	MiniLM	0.093	0.36	0.547	0.206	0.156	0.135	1452	848	1
	BGE Large	0.093	0.273	0.447	0.128	0.105	0.094	1713	1087	0.2
	BGE M3	0.067	0.253	0.38	0.326	0.147	0.121	2389	1601	1
	E5 Large	0.053	0.2	0.307	0.295	0.126	0.1	2400	1848	1
	UAE Large	0.1	0.32	0.567	0.216	0.143	0.122	1429	797	0.5
	Snowflake Arctic L	0.06	0.22	0.38	0.21	0.115	0.093	2067	1392	1
	QWEN2	0.007	0.047	0.073	0.011	0.012	0.013	5542	5737	0.011
	Stella	0.093	0.2	0.293	0.24	0.154	0.121	3450	2356	0.5
	Mistral	0.06	0.3	0.507	0.133	0.104	0.097	1921	967	0.062
Relational learning	MiniLM	0.261	0.913	0.957	0.139	0.091	0.089	239	136	0.333
	BGE Large	0.739	1	1	0.276	0.235	0.235	81	41	0.143
	BGE M3	0	0.217	0.391	0	0.007	0.008	2173	1546	0.004
	E5 Large	0	0.348	0.739	0	0.012	0.015	1416	722	0.007
	UAE Large	0.739	1	1	0.272	0.237	0.237	66	41	0.167
	Snowflake Arctic L	0.435	0.826	0.913	0.124	0.101	0.094	321	140	0.077
	QWEN2	0	0.13	0.13	0	0.008	0.008	3963	3602	0.007
	Stella	0.13	0.217	0.435	0.37	0.226	0.118	2482	1223	1
	Mistral	0.217	0.478	0.609	0.117	0.077	0.064	1014	557	0.167
Ambient assisted living	MiniLM	0.046	0.185	0.269	0.179	0.082	0.067	3935	2534	0.5
	BGE Large	0.102	0.296	0.407	0.121	0.094	0.082	2782	1383	0.25
	BGE M3	0.046	0.13	0.204	0.195	0.096	0.07	3614	2637	0.333
	E5 Large	0.074	0.194	0.352	0.1	0.069	0.057	3091	1401	0.333
	UAE Large	0.065	0.259	0.333	0.105	0.079	0.071	3077	1740	0.25
	Snowflake Arctic L	0.028	0.093	0.176	0.112	0.052	0.037	4972	2961	0.111
	QWEN2	0.009	0.009	0.019	0.014	0.014	0.008	12576	13391	0.014
	Stella	0.019	0.056	0.111	0.511	0.18	0.097	7955	4840	1
	Mistral	0.046	0.102	0.213	0.332	0.167	0.094	3914	2777	1
10-m walks	MiniLM	0.091	0.545	0.773	0.022	0.022	0.022	679	433	0.02
	BGE Large	0.182	0.545	0.773	0.043	0.043	0.036	583	357	0.032
	BGE M3	0.091	0.364	0.636	0.036	0.029	0.023	1008	703	0.037
	E5 Large	0.045	0.273	0.591	0.024	0.012	0.013	980	778	0.024
	UAE Large	0.136	0.636	0.818	0.031	0.037	0.034	581	372	0.027
	Snowflake Arctic L	0.318	0.636	0.727	0.061	0.051	0.048	812	337	0.056
	QWEN2	0.045	0.045	0.091	0.019	0.019	0.008	1850	1328	0.019
	Stella	0	0	0.045	0	0	0.002	3003	3539	0.002
	Mistral	0.091	0.455	0.818	0.03	0.03	0.024	610	690	0.036

Open in a new tab

R@K Recall@K, AP@K Average Precision@K, AR average rank, MR median rank, RR reciprocal rank

Bold numbers = best choices within the table

In Table 6, we present the performance of the embedding models using queries generated by the zero-shot prompting technique. Our analysis shows that, for the Driver Healthcare Monitor dataset, BGE Large and BGE M3 stand out. BGE Large excels in recall, average rank, and median rank, while BGE M3 achieves higher performance in average precision.

Table 6.

Zero shot results

Dataset	Model	R@100	R@500	R@1000	AP@100	AP@500	AP@1000	AR	MR	RR
Driver healthcare monitor	MiniLM	0.067	0.1	0.267	0.035	0.026	0.015	1949	1518	0.038
	BGE Large	0.1	0.467	0.767	0.107	0.045	0.037	906	617	0.25
	BGE M3	0.067	0.233	0.367	0.306	0.097	0.067	1850	1662	0.5
	E5 Large	0.033	0.133	0.3	0.111	0.037	0.022	2342	2050	0.111
	UAE Large	0.133	0.333	0.667	0.1	0.056	0.039	968	762	0.25
	Snowflake Arctic L	0.033	0.167	0.433	0.091	0.029	0.018	1818	1296	0.091
	QWEN2	0	0.2	0.267	0	0.011	0.011	2164	1651	0.007
	Stella	0.1	0.267	0.433	0.033	0.031	0.025	2393	1124	0.027
	Mistral	0.1	0.2	0.467	0.084	0.051	0.031	1328	1059	0.125
Venture capital	MiniLM	0.08	0.267	0.507	0.594	0.245	0.165	1322	996	1
	BGE Large	0.1	0.28	0.447	0.212	0.151	0.122	1611	1152	1
	BGE M3	0.06	0.22	0.387	0.478	0.189	0.134	2531	1558	1
	E5 Large	0.107	0.247	0.367	0.245	0.161	0.129	2283	1626	0.167
	UAE Large	0.107	0.293	0.5	0.28	0.176	0.138	1416	993	1
	Snowflake Arctic L	0.06	0.18	0.347	0.152	0.095	0.075	2137	1431	0.1
	QWEN2	0.007	0.067	0.153	0.031	0.019	0.021	3695	2785	0.031
	Stella	0.033	0.147	0.227	0.251	0.101	0.08	3812	2719	1
	Mistral	0.093	0.353	0.573	0.12	0.116	0.109	1608	794	0.056
Relational learning	MiniLM	0.652	0.913	0.957	0.345	0.277	0.266	157	45	1
	BGE Large	0.522	0.913	0.957	0.273	0.193	0.186	216	100	1
	BGE M3	0.261	0.609	0.783	0.11	0.071	0.061	794	311	0.25
	E5 Large	0.348	0.696	0.87	0.285	0.164	0.137	444	267	1
	UAE Large	0.522	0.913	0.957	0.335	0.233	0.224	198	80	1
	Snowflake Arctic L	0.304	0.87	1	0.295	0.144	0.129	244	186	1
	QWEN2	0	0.087	0.348	0	0.008	0.008	2926	2469	0.009
	Stella	0.261	0.652	0.783	0.229	0.123	0.107	1393	246	1
	Mistral	0.348	0.783	0.87	0.14	0.093	0.087	399	184	0.25
Ambient assisted living	MiniLM	0.037	0.102	0.204	0.072	0.047	0.035	4373	3168	0.091
	BGE Large	0.019	0.074	0.157	0.101	0.04	0.029	5446	4397	0.111
	BGE M3	0.009	0.046	0.13	0.019	0.018	0.015	6034	5488	0.019
	E5 Large	0	0.028	0.074	0	0.007	0.008	5943	4521	0.006
	UAE Large	0.019	0.083	0.148	0.038	0.02	0.02	5619	4613	0.045
	Snowflake Arctic L	0.009	0.065	0.111	0.011	0.013	0.013	6725	5542	0.011
	QWEN2	0.009	0.037	0.056	0.012	0.01	0.01	8254	7112	0.012
	Stella	0.028	0.056	0.139	0.036	0.03	0.021	6573	4210	0.033
	Mistral	0.046	0.102	0.204	0.12	0.07	0.047	3921	2827	0.25
10-m walks	MiniLM	0.5	0.818	0.955	0.2	0.151	0.135	218	122	0.2
	BGE Large	0.409	0.864	0.955	0.186	0.123	0.114	251	140	0.5
	BGE M3	0.455	0.864	0.909	0.407	0.247	0.235	285	129	0.5
	E5 Large	0.364	0.818	1	0.309	0.179	0.152	240	133	0.25
	UAE Large	0.409	0.864	0.955	0.186	0.122	0.113	262	173	0.5
	Snowflake Arctic L	0.5	0.818	0.909	0.322	0.224	0.205	274	106	0.25
	QWEN2	0.136	0.636	0.727	0.097	0.049	0.046	874	276	0.2
	Stella	0.318	0.727	0.864	0.074	0.065	0.059	473	186	0.048
	Mistral	0.682	0.955	1	0.224	0.192	0.185	103	68	0.333

Open in a new tab

R@K Recall@K, AP@K Average Precision@K, AR average rank, MR median rank, RR reciprocal rank

Bold numbers = best choices within the table

In the Venture Capital dataset, we see a shift in the top performers, with Mistral achieving the best results for recall and rank metrics, and MiniLM leading in average precision. For the Relational Learning dataset, MiniLM dominates across most metrics, with other models also achieving near-perfect recall, approaching 100%.

In the final two datasets, Ambient Assisted Living and 10-m walks, Mistral clearly outperforms the other models, although BGE M3 stands out for its superior average precision in the 10-m walks dataset.

In Table 7, we summarize the results obtained using the few-shot prompting technique. The results for the Driver Healthcare Monitor dataset are consistent with those from the zero-shot approach, with BGE Large and BGE M3 emerging as the top performers. In the Venture Capital dataset, Mistral achieves the highest recall values, while MiniLM excels in precision at smaller cutoffs, and Snowflake Arctic L shows strong precision at larger cutoffs.

Table 7.

Few shot results

Dataset	Model	R@100	R@500	R@1000	AP@100	AP@500	AP@1000	AR	MR	RR
Driver healthcare monitor	MiniLM	0.067	0.133	0.233	0.11	0.06	0.037	2387	1901	0.167
	BGE Large	0.1	0.433	0.7	0.193	0.069	0.052	921	592	0.5
	BGE M3	0.067	0.233	0.333	0.562	0.172	0.124	1984	1665	1
	E5 Large	0.067	0.167	0.267	0.127	0.063	0.043	2481	2524	0.143
	UAE Large	0.1	0.367	0.667	0.385	0.124	0.078	1000	712	1
	Snowflake Arctic L	0.033	0.133	0.3	0.2	0.059	0.031	2163	1414	0.2
	QWEN2	0	0.067	0.067	0	0.009	0.009	8430	9613	0.006
	Stella	0	0.267	0.333	0	0.017	0.016	2995	2262	0.008
	Mistral	0.067	0.3	0.533	0.061	0.028	0.023	1096	809	0.091
Venture capital	MiniLM	0.073	0.413	0.607	0.318	0.155	0.139	1192	736	1
	BGE Large	0.073	0.273	0.507	0.224	0.134	0.109	1634	992	0.333
	BGE M3	0.047	0.233	0.48	0.486	0.169	0.121	2021	1102	1
	E5 Large	0.08	0.227	0.413	0.442	0.209	0.144	1921	1217	1
	UAE Large	0.087	0.32	0.567	0.394	0.188	0.145	1388	883	1
	Snowflake Arctic L	0.093	0.233	0.44	0.43	0.225	0.152	1856	1212	1
	QWEN2	0.013	0.047	0.14	0.038	0.024	0.023	4675	3546	0.038
	Stella	0.08	0.28	0.413	0.116	0.106	0.095	2796	1586	0.067
	Mistral	0.107	0.42	0.66	0.147	0.148	0.136	1388	630	0.167
Relational learning	MiniLM	0.609	0.913	0.957	0.228	0.18	0.173	194	82	1
	BGE Large	0.652	0.913	0.957	0.32	0.26	0.25	168	78	1
	BGE M3	0.304	0.609	0.783	0.216	0.132	0.108	738	358	1
	E5 Large	0.348	0.696	0.87	0.301	0.179	0.149	406	214	1
	UAE Large	0.696	0.957	0.957	0.376	0.305	0.305	135	67	1
	Snowflake Arctic L	0.565	0.87	1	0.258	0.202	0.181	156	88	1
	QWEN2	0	0.087	0.174	0	0.003	0.004	3693	2801	0.002
	Stella	0.435	0.739	0.826	0.206	0.151	0.138	1174	137	1
	Mistral	0.435	0.696	0.87	0.197	0.145	0.123	361	149	0.25
Ambient assisted living	MiniLM	0.009	0.093	0.148	0.037	0.022	0.02	4819	3582	0.037
	BGE Large	0.028	0.102	0.194	0.068	0.036	0.03	4749	3666	0.111
	BGE M3	0.009	0.065	0.111	0.016	0.015	0.015	5615	4858	0.016
	E5 Large	0	0.028	0.083	0	0.009	0.009	5915	4316	0.008
	UAE Large	0.028	0.093	0.167	0.04	0.027	0.023	4966	3775	0.053
	Snowflake Arctic L	0.009	0.065	0.13	0.013	0.021	0.017	5312	3944	0.013
	QWEN2	0.009	0.037	0.056	0.013	0.013	0.011	8613	7188	0.013
	Stella	0.037	0.083	0.185	0.075	0.049	0.033	5782	3926	0.077
	Mistral	0.046	0.139	0.231	0.148	0.07	0.053	3438	2493	0.2
10-m walks	MiniLM	0.409	0.818	0.955	0.125	0.1	0.091	234	147	0.2
	BGE Large	0.455	0.864	0.955	0.156	0.108	0.101	264	185	0.5
	BGE M3	0.5	0.818	0.864	0.39	0.263	0.251	313	115	0.5
	E5 Large	0.409	0.818	1	0.155	0.109	0.095	243	146	0.25
	UAE Large	0.409	0.864	0.955	0.166	0.11	0.103	263	147	0.5
	Snowflake Arctic L	0.545	0.864	0.909	0.349	0.248	0.237	259	87	0.5
	QWEN2	0.091	0.227	0.455	0.182	0.085	0.048	1375	1181	0.333
	Stella	0.182	0.636	0.773	0.065	0.053	0.047	692	215	0.071
	Mistral	0.636	0.955	1	0.26	0.206	0.199	119	61	0.5

Open in a new tab

R@K Recall@K, AP@K Average Precision@K, AR average rank, MR median rank, RR reciprocal rank

Bold numbers = best choices within the table

UAE Large performs exceptionally well for the Relational Learning dataset, achieving near-perfect recall and the highest scores across rank and precision metrics. This dataset demonstrates strong performance across multiple models, similar to the zero-shot approach. Finally, Mistral stands out across most metrics in the last two datasets, outperforming other models by a substantial margin. The only exception is BGE M3, which excels in average precision.

Finally, we conclude our results with the performance of the models using the prompt chaining technique, as outlined in Table 8. Compared to the zero-shot and few-shot techniques, we observe greater variability in model performance, particularly in the first two datasets. Despite this variability, BGE Large, BGE M3, and UAE Large stand out as the top performers for these datasets.

Table 8.

Prompt chaining results

Dataset	Model	R@100	R@500	R@1000	AP@100	AP@500	AP@1000	AR	MR	RR
Driver healthcare monitor	MiniLM	0.033	0.1	0.133	0.011	0.011	0.009	2721	2094	0.011
	BGE Large	0.1	0.4	0.6	0.177	0.063	0.049	1068	768	0.333
	BGE M3	0.1	0.3	0.367	0.15	0.062	0.053	1720	1425	0.25
	E5 Large	0	0.1	0.2	0	0.006	0.006	3108	3146	0.008
	UAE Large	0.133	0.267	0.533	0.203	0.113	0.065	1203	960	0.5
	Snowflake Arctic L	0.033	0.1	0.233	0.067	0.03	0.017	2195	1781	0.067
	QWEN2	0.033	0.1	0.133	0.013	0.01	0.009	3602	3154	0.013
	Stella	0.033	0.3	0.433	0.022	0.023	0.021	2300	1389	0.022
	Mistral	0.067	0.3	0.533	0.153	0.055	0.039	1109	785	0.125
Venture capital	MiniLM	0.087	0.4	0.66	0.118	0.124	0.12	1045	669	0.091
	BGE Large	0.06	0.227	0.46	0.234	0.123	0.097	1613	1214	1
	BGE M3	0.08	0.273	0.473	0.331	0.169	0.13	1962	1083	1
	E5 Large	0.08	0.227	0.407	0.285	0.158	0.117	1903	1262	1
	UAE Large	0.093	0.253	0.56	0.259	0.153	0.117	1391	846	1
	Snowflake Arctic L	0.073	0.193	0.32	0.121	0.096	0.079	2078	1435	0.25
	QWEN2	0	0.02	0.067	0	0.006	0.008	6514	6190	0.003
	Stella	0.053	0.28	0.453	0.08	0.092	0.086	2461	1275	0.053
	Mistral	0.047	0.267	0.447	0.056	0.078	0.075	1922	1133	0.02
Relational learning	MiniLM	0.435	0.783	0.957	0.25	0.176	0.15	259	128	1
	BGE Large	0.522	0.913	0.913	0.262	0.187	0.187	214	92	1
	BGE M3	0.174	0.522	0.826	0.071	0.046	0.037	832	465	0.091
	E5 Large	0.304	0.696	0.913	0.168	0.099	0.082	424	335	0.2
	UAE Large	0.522	0.913	0.957	0.34	0.236	0.227	186	85	1
	Snowflake Arctic L	0.391	0.913	0.957	0.29	0.161	0.155	217	179	1
	QWEN2	0	0.043	0.087	0	0.003	0.002	6059	6285	0.003
	Stella	0.174	0.565	0.739	0.109	0.055	0.047	1468	400	0.25
	Mistral	0.348	0.652	0.826	0.107	0.078	0.068	463	258	0.077
Ambient assisted living	MiniLM	0.056	0.13	0.194	0.063	0.048	0.041	4230	3202	0.062
	BGE Large	0.056	0.12	0.259	0.062	0.049	0.037	4363	3448	0.038
	BGE M3	0.009	0.074	0.157	0.01	0.013	0.015	5531	5024	0.01
	E5 Large	0.046	0.12	0.213	0.048	0.035	0.03	4524	2689	0.053
	UAE Large	0.019	0.111	0.241	0.025	0.023	0.025	4622	3492	0.019
	Snowflake Arctic L	0.019	0.074	0.12	0.017	0.018	0.016	5831	4209	0.011
	QWEN2	0	0.009	0.019	0	0.002	0.003	9808	8927	0.002
	Stella	0.019	0.037	0.139	0.02	0.015	0.013	7119	5244	0.016
	Mistral	0.037	0.083	0.176	0.075	0.049	0.033	4618	3536	0.125
10-m walks	MiniLM	0.5	0.955	1	0.149	0.115	0.111	175	107	0.125
	BGE Large	0.364	0.864	1	0.184	0.121	0.109	220	139	0.5
	BGE M3	0.455	0.864	0.955	0.353	0.221	0.202	239	120	1
	E5 Large	0.409	0.864	1	0.22	0.142	0.126	220	120	0.25
	UAE Large	0.364	0.864	0.955	0.193	0.124	0.115	227	145	0.5
	Snowflake Arctic L	0.455	0.818	0.909	0.317	0.209	0.192	238	139	0.25
	QWEN2	0.136	0.545	0.727	0.351	0.113	0.09	786	406	1
	Stella	0.318	0.636	0.909	0.104	0.083	0.066	442	181	0.25
	Mistral	0.545	1	1	0.29	0.202	0.202	110	71	0.5

Open in a new tab

R@K Recall@K, AP@K Average Precision@K, AR average rank, MR median rank, RR reciprocal rank

Bold numbers = best choices within the table

For the Relational Learning dataset, UAE Large remains the best performer, maintaining consistency with the results from the previous techniques. In contrast, multiple models achieve top scores across different metrics for the Ambient Assisted Living dataset, but even the best-performing models show only modest results. The final dataset, 10-m walks, has results consistent with previous techniques, with Mistral and BGE M3 continuing to lead across most metrics.

Overall, we observe notable differences in performance across datasets, regardless of the prompt engineering technique or embedding model used. Interestingly, while the top-performing models vary across different datasets, the same models tend to lead within each dataset across various prompt techniques consistently. This suggests that certain embedding models are better aligned with the semantic and domain characteristics of specific datasets. The Relational Learning and 10-m walks datasets consistently demonstrate high performance across metrics. The selection process in these datasets includes more general articles that broadly reflect the set of semantic properties used in the query vector. In contrast, the Ambient Assisted Living dataset shows lower performance levels, likely due to its narrow focus on highly specific articles that correspond to only a small subset of the defined semantic properties. The remaining two datasets fall somewhere in between, with performance varying depending on the technique and model used.

Discussion

In this section, we will reflect on the key findings of our research, analyzing trends observed in the performance of the different prompt engineering techniques, embedding models, and vector indexes. By examining the results in detail, we aim to answer the three main questions posed in this study. Specifically, we will identify the embedding models that demonstrate the most consistent and reliable performance, explore the effectiveness of GenAI in query reformulation, and highlight the trade-offs between computational efficiency and retrieval performance when comparing vector indexes.

Massive text embeddings for document retrieval

The first focus of our investigation was to identify the best-performing massive embedding model for document retrieval, given our selection of models that vary in size and training data. Initial observations reveal a variation in top performers depending on the dataset and the prompt engineering technique applied. However, four models (BGE Large, UAE Large, MiniLM, and Mistral) consistently rank among the top performers in various scenarios. To address our research question of selecting an embedding model that performs robustly across different datasets and domains, we proceed with a more generalized evaluation using ranking-based assessment.

Our evaluation of the best-performing, domain-independent embedding model follows a multi-step ranking framework. We use dense ranking throughout, where models with identical scores share the same rank, and the next model receives the subsequent position without skipping ranks. First, we individually rank the models for each metric within each dataset. Next, to obtain a dataset-independent ranking, we aggregate the ranks for each metric across all datasets and apply another dense ranking. This provides a ranked list of embedding models per prompt engineering technique and metric, as presented in Table 9. Finally, we aggregate the ranking positions across all metrics using a simple average to establish an overall ranking, assigning equal weight to each metric. We then apply the dense ranking again, producing a final ranking for each prompt engineering technique.

Table 9.

Rankings of embedding models

Technique	Model	R@100	R@500	R@1000	AP@100	AP@500	AP@1000	AR	MR	RR	Final rank
Ground truth	UAE Large	2	1	1	4	1	1	1	2	3	1
	BGE Large	1	2	2	2	1	2	2	1	3	1
	MiniLM	3	3	3	6	3	5	3	3	2	2
	Mistral	4	4	4	5	4	3	3	4	2	3
	BGE M3	5	5	6	1	2	4	4	6	1	4
	Snowflake Arctic L	3	6	5	6	5	6	5	5	5	5
	Stella	6	8	7	3	2	4	6	8	4	6
	E5 Large	6	7	6	7	6	7	5	7	6	7
	QWEN2	7	9	8	8	7	8	7	9	7	8
Zero shot	MiniLM	2	4	4	1	1	1	1	2	5	1
	UAE Large	2	1	3	3	3	2	3	4	2	2
	BGE Large	3	2	2	4	5	4	3	3	1	3
	Mistral	1	3	1	7	4	5	2	1	4	4
	BGE M3	6	6	6	2	2	3	6	7	3	5
	Snowflake Arctic L	5	5	5	6	7	7	4	5	7	6
	E5 Large	4	8	7	5	6	6	5	7	6	7
	Stella	7	7	8	8	8	8	7	6	6	8
	QWEN2	8	9	9	9	9	9	8	8	8	9
Few shot	UAE Large	2	1	3	2	1	1	2	2	1	1
	Mistral	1	2	1	6	4	4	1	1	3	2
	BGE Large	3	3	2	4	3	3	3	3	1	3
	Snowflake Arctic L	3	5	5	3	2	2	5	5	3	4
	MiniLM	4	3	4	7	6	5	4	4	4	5
	BGE M3	6	6	7	1	3	4	7	6	2	6
	E5 Large	6	7	7	5	5	6	6	8	5	7
	Stella	5	4	6	8	7	7	8	7	6	8
	QWEN2	7	8	8	9	8	8	9	9	7	9
Prompt chaining	UAE Large	2	3	1	1	1	1	2	1	1	1
	BGE Large	3	2	1	2	2	2	1	1	1	2
	MiniLM	1	1	2	6	5	3	2	1	3	3
	Mistral	4	3	3	4	4	4	3	2	2	4
	BGE M3	5	5	4	4	3	3	6	4	2	5
	E5 Large	5	4	5	5	5	5	4	3	4	6
	Snowflake Arctic L	4	7	7	3	4	5	5	5	3	7
	Stella	6	6	6	7	6	6	7	6	5	8
	QWEN2	7	8	8	8	7	7	8	7	6	9

Open in a new tab

R@K Recall@K, AP@K Average Precision@K, AR average rank, MR median rank, RR reciprocal rank

The results of our evaluation framework indicate that the UAE Large embedding model consistently emerges as the top performer across all prompt engineering techniques, except in the zero-shot setting, where it ranks second behind MiniLM. BGE Large also demonstrates strong, consistent performance, consistently placing within the top three for all techniques and even sharing the top spot with UAE Large in the ground truth rankings. MiniLM and Mistral follow as solid contenders, while Stella and QWEN2 consistently occupy the two lowest positions, regardless of the technique. We observe that the highest-performing models employ fine-tuning strategies specifically targeting tasks such as information retrieval, document ranking, or semantic similarity. On the other hand, the lower-ranked models, despite having significantly more parameters, employ fine-tuning strategies for broader or unrelated tasks, which explains their underperformance.

In a similar manner, Table 10 presents the average rankings of the embedding models, calculated according to the approach proposed by [67]. We prioritize the Recall@500 metric for this analysis because high recall is essential in literature reviews where the objective is to retrieve as many relevant documents as possible for assessment. The cutoff at 500 reflects a practical limit, as reviewing the top 500 documents is typically a manageable and effective scope in academic settings. The results reinforce the findings from the previous analysis, with UAE Large and BGE Large taking the top 2 positions.

Table 10.

Average rankings of embedding models (Friedman) for Recall@500

Embedding model	Ranking
UAE Large	2.333
BGE Large	2.567
Mistral	3.400
MiniLM	3.933
Snowflake Arctic L	5.600
BGE M3	5.633
Stella	6.367
E5 Large	6.733
QWEN2	8.433

Open in a new tab

Bold numbers = best choices within the table

When considering additional parameters such as time and memory efficiency, we observe that UAE Large and BGE Large require 5 to 6 times more time than MiniLM for the initial embedding of documents. Additionally, with vector dimensions approximately 2.7 times larger than those of MiniLM, the indexes created from UAE Large and BGE Large embeddings are also 2.7 times larger, potentially impacting query times. There is no difference between UAE Large and BGE Large, as both models exhibit comparable embedding generation times and index sizes.

Our findings indicate that information-retrieval-specific embedding models generally outperform general-purpose models for automated document retrieval in literature reviews. Among the evaluated models, UAE Large stands out as the best overall choice for this purpose. However, MiniLM offers a practical alternative due to its efficiency in scenarios where time or memory constraints are critical.

Human versus GenAI in query reformulation for document retrieval

The second goal of our study was to compare the effectiveness of human experts and generative AI in transforming a researcher’s initial idea into a well-defined query for automated document retrieval in literature reviews. Our results show that GenAI-generated queries consistently outperformed those formulated by human experts across all evaluation metrics for datasets such as Driver Healthcare Monitoring and 10-m walks. Furthermore, the differences were minimal for the Venture Capital and Relational Learning datasets, with GenAI showing slight improvements or performing comparably to the human queries. However, for the Ambient Assisted Living dataset, the AI-generated queries underperformed those of the human-generated ones. We attribute this underperformance to the significantly more refined query construction, where the expert researchers iterated over the process of defining semantic properties more than 10 times to align them closely with the requirements. The other datasets involved only one or a few refinement iterations. We proceed with a more in-depth statistical and numerical analysis to gain a deeper understanding of these variations.

We conducted a dataset-specific pairwise comparison between the techniques, including the ground truth, to gain further insights. For each pair of techniques, we took the result obtained by a specific embedding model on a given metric for Technique A and compared it to the result for the same model and metric for Technique B. If Technique A outperformed Technique B, we incremented the count for Technique A, and vice versa. These comparative results are presented in Table 11.

Table 11.

Prompt engineering technique pairwise comparison

Dataset	GT	ZS	GT	FS	GT	PC	ZS	FS	ZS	PC	FS	PC
Driver healthcare monitor	10	66	13	64	19	54	38	34	49	25	47	28
Venture capital	30	46	12	60	45	34	22	56	46	32	56	18
Relational learning	21	53	16	60	32	46	16	42	51	16	64	6
Ambient assisted living	68	9	65	14	73	5	18	50	37	39	43	36
10-m walks	0	81	0	81	0	81	39	25	26	38	18	51
Total	129	255	106	279	169	220	133	207	209	150	228	139

Open in a new tab

GT ground truth, ZS zero-shot, FS few-shot, PC prompt chaining

Bold numbers = best choices within the table

Our findings confirm that, for four out of five datasets, both zero-shot and few-shot techniques outperform the ground truth, except for the Ambient Assisted Living dataset. When directly comparing zero-shot and few-shot techniques, we observe no statistically significant difference, although few-show generally shows a slight advantage. However, the key observation is that both techniques consistently outperform the ground truth.

Since we concluded that UAE Large is the embedding model that is the top performer, we selected its result to compare the ground truth with the three prompt engineering techniques. Then, we averaged the results for each metric across all datasets, resulting in a single value per metric for each prompt engineering technique, allowing for a more general comparison that is not dataset-specific. Using the metrics that provide results between 0 and 1, we constructed radar charts to visualize the performance of each technique, with each metric represented as a variable, as shown in Fig. 6.

Fig. 6 — Prompt engineering technique radar chart

From the figure, it is evident that the polygon generated for the few-shot prompt engineering techniques is larger than the others, indicating better overall performance using the UAE Large model. This observation is further supported by the area each polygon covers: 0.39 for few-shot, 0.31 for prompt chaining, 0.30 for zero-shot, and 0.23 for the ground truth.

In summary, regarding the choice between human expertise and GenAI for query reformulation, our findings suggest that GenAI is generally the preferable option due to its superior performance and the ability to reduce manual labor and time investment. However, in cases where human experts are willing and able to invest substantial time in multiple iterations and careful tuning of semantic criteria, they may outperform GenAI, as observed in the Ambient Assisted Living dataset. Therefore, as a recommended approach, GenAI could be used for the initial query formulation, with human experts potentially involved in a subsequent evaluation phase, thus minimizing their effort and time commitment.

Vector indexes trade-offs

In the final part of our research, we examine the trade-offs between speed, memory efficiency, and retrieval accuracy in different vector indexing techniques. Although the flat index serves as our baseline and performs well within the scope of our current datasets, scalability considerations become essential as the datasets grow. We recognize the potential need for faster response times for larger datasets, initial domain-specific collections of documents, or a high volume of queries to enhance retrieval robustness. In such scenarios, exploring alternative indexing strategies could offer substantial gains.

When comparing the IVF index to the Flat index, we begin by examining memory efficiency. Both indexes are similar in size across all datasets, with the IVF index requiring only minimal additional memory to store centroid metadata used in the query phase. For the time parameter, two aspects are considered: initial build and training time, and response time. Both indexes complete the initial vector build and training in less than a second, making this factor negligible. However, in terms of query response time, the IVF index performs significantly faster, returning the top 100 closest documents around 18 times faster than the Flat index, and the top 1000 documents about 9 times faster. While these differences are small in single-query scenarios because we measure in milliseconds, they become impactful when handling large query volumes. For instance, executing 1000 queries would take approximately 18 s with the Flat index, compared to just 1–2 s with the IVF index. Lastly, regarding retrieval accuracy, the IVF index performs comparably to the Flat index, with only a slight decrease overall, and in some cases, the IVF index even shows better results.

The primary advantage of the PQ index lies in memory compression, consuming roughly 20 times less memory than the Flat index across all datasets. However, this memory efficiency comes at the expense of build and training time, where PQ takes approximately 20 times longer than the Flat index, with figures around 10 s for 12,000 documents and up to 20 s for 25,000 documents. Despite the initial requirements, PQ compensates with faster query response times, performing around 9 times faster in retrieval speed. This advantage becomes meaningful in high-query scenarios where, with 1000 or more queries, PQ’s reduced response time offsets its slower initial build. As for retrieval accuracy, PQ performs similarly to the IVF index, maintaining results comparable to the Flat index but with a slight reduction in overall performance. Comprehensive results for both IVF and PQ indexes are provided in the supplementary materials.

Our findings suggest that the Flat index well supports the current requirements, making it a suitable and sufficient choice for this use case. However, we recognize that if future requirements demand greater scalability, whether due to an increase in the number of documents or queries, the IVF and PQ indexes provide viable alternatives that offer faster response times and comparable retrieval performance.

Limitations

In this study, embeddings and similarity search were computed using only paper titles and abstracts rather than full-text articles. This is a limitation because studies with short or less informative abstracts may be ranked lower even when the full text is relevant, while studies whose abstracts mention many key terms may be ranked higher even if the full text is not closely aligned. As a result, the reported performance may not fully generalize to full-text screening or later stages of evidence synthesis such as data extraction. We made this design choice because titles and abstracts are broadly accessible, whereas full texts are often unavailable due to paywalls or other access restrictions, reflecting a common constraint in human review processes.

LLM based query reformulation also introduces reproducibility limitations. Even when using the same prompts and the same model, generated outputs can vary across runs due to stochastic decoding and implementation details. Consequently, reproducing the query reformulation step may not produce exactly the same performance numbers. To improve transparency and support replication and reproduction, we report the full prompts used in the study and provide the complete LLM conversation logs for the query reformulation step in our code and data repository.

Conclusion

In this study, we investigated the potential of generative AI in query reformulation, evaluated the effectiveness of massive text embedding models in enhancing automated document retrieval for literature reviews, and examined the trade-offs between speed, memory, and retrieval accuracy offered by different vector indexes. All experiments were conducted on real-world datasets curated through our NLP toolkit that supports the paper selection phase in literature reviews.

Our findings indicate that GenAI-generated queries, specifically zero-shot and few-shot prompting, generally outperform the queries designed by humans. This suggests that GenAI can reduce the manual labor involved in query formulation, providing researchers with efficient, high-quality assistance in formulating suitable queries.

We also observed that embedding models designed or fine-tuned specifically for information retrieval consistently outperformed general-purpose models. UAE Large emerged as the most reliable performer across diverse datasets spanning healthcare, finance, social care, and machine learning domains.

Finally, our evaluation of vector indexes shows that the flat index is a sufficient choice for storing and querying documents in the current context of automated literature reviews. However, the IVF and PQ indexes present viable alternatives for scalability if the dataset size or query volume increases significantly.

Acknowledgements

Not applicable.

Authors' contributions

Conceptualization: GM, VT, and EZ.; data curation: GM, BS, and EZ; funding acquisition: MK and EZ; investigation: GM; methodology: GM, BRS, and PL; project administration: PL, MK, and EZ; resources: BRS and LB; software: GM and EZ; supervision: VT and EZ; validation: BS, LB, MK, and EZ; visualization: GM and BS; writing—original draft: GM and BS; writing—review and editing: GM, PL, VT, BRS, LB, and EZ. All authors have read and agreed to the published version of the manuscript.

Funding

Open access funding provided by TU Wien (TUW). • All authors affiliated with TU Wien acknowledge that this work was partially funded by the Computer Vision Lab at TU Wien in Vienna, Austria.

• All authors affiliated with Magix.AI acknowledge that this work was partially funded by Magix.AI DOOEL Skopje, North Macedonia.

• All authors affiliated with FCSE acknowledge that this work was partially funded by the Faculty of Computer Science and Engineering at the Ss. Cyril and Methodius University in Skopje, North Macedonia.

Data availability

The data and the code we utilize in this study are openly available on https://gitlab.com/mitrovg/article-analysis-study/-/tree/main.

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Hambarde KA, Proença H. Information retrieval: recent advances and beyond. IEEE Access. 2023;11:76581–604. 10.1109/ACCESS.2023.3295776. [Google Scholar]
2.Kitchenham B, Charters S, et al. Guidelines for performing systematic literature reviews in software engineering. Durham: University of Durham; 2007. [Google Scholar]
3.Bornmann L, Haunschild R, Mutz R. Growth rates of modern science: a latent piecewise growth curve approach to model publication numbers from established and new literature databases. Humanit Soc Sci Commun. 2021. 10.1057/s41599-021-00903-w. [Google Scholar]
4.Sutton A, Clowes M, Preston L, Booth A. Meeting the review family: exploring review types and associated information retrieval requirements. Health Inf Libr J. 2019;36(3):202–22. 10.1111/hir.12276. [DOI] [PubMed] [Google Scholar]
5.Carver JC, Hassler E, Hernandes E, et al. Identifying barriers to the systematic literature review process. In: 2013 ACM / IEEE International Symposium on Empirical Software Engineering and Measurement. Baltimore; 2013. pp. 203–212. 10.1109/ESEM.2013.28.
6.Hadi MU, Tashi QA, Shah A, Qureshi R, Muneer A, Irfan M, et al. Large Language Models: A Comprehensive Survey of its Applications, Challenges, Limitations, and Future Prospects. 2024. 10.36227/techrxiv.23589741.v6
7.Raiaan MAK, Mukta MSH, Fatema K, Fahad NM, Sakib S, Mim MMJ, et al. A review on Large Language Models: architectures, applications, taxonomies. Open issues and challenges. IEEE Access. 2024;12:26839–74. 10.1109/ACCESS.2024.3365742. [Google Scholar]
8.Zhu Y, Yuan H, Wang S, Liu J, Liu W, Deng C, et al. Large Language Models for Information Retrieval: A Survey. 2024. arXiv:2308.07107.
9.Zhai C. Large Language Models and Future of Information Retrieval: Opportunities and Challenges. In: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR ’24. New York: Association for Computing Machinery; 2024. pp. 481–490. 10.1145/3626772.3657848.
10.Cierco Jimenez R, Lee T, Rosillo N, Cordova R, Cree IA, Gonzalez A, et al. Machine learning computational tools to assist the performance of systematic reviews: a mapping review. BMC Med Res Methodol. 2022. 10.1186/s12874-022-01805-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Khalil H, Pollock D, McInerney P, Evans C, Moraes EB, Godfrey CM, et al. Automation tools to support undertaking scoping reviews. Res Synth Methods. 2024. 10.1002/jrsm.1731. [DOI] [PubMed] [Google Scholar]
12.Qureshi R, Shaughnessy D, Gill K, Robinson K, Li T, Agai E. Are ChatGPT and large language models “the answer” to bringing us closer to systematic review automation? Syst Rev. 2023;12:72. 10.1186/s13643-023-02243-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Zdravevski E, Lameski P, Trajkovik V, Chorbev I, Goleva R, Pombo N, et al. In: Ganchev I, Garcia NM, Dobre C, Mavromoustakis CX, Goleva R, editors. Automation in Systematic, Scoping and Rapid Reviews by an NLP Toolkit: A Case Study in Enhanced Living Environments. Cham: Springer International Publishing; 2019. pp. 1–18. 10.1007/978-3-030-10752-9_1.
14.Sanderson M, Croft WB. The history of information retrieval research. Proc IEEE. 2012;100(Special Centennial Issue):1444–51. 10.1109/JPROC.2012.2189916. [Google Scholar]
15.Li X, Jin J, Zhou Y, Zhang Y, Zhang P, Zhu Y, et al. From Matching to Generation: A Survey on Generative Information Retrieval. 2024. arXiv:2404.14851.
16.Zhao WX, Liu J, Ren R, Wen JR. Dense text retrieval based on pretrained language models: a survey. ACM Trans Inf Syst. 2024. 10.1145/3637870. [Google Scholar]
17.Ai Q, Bai T, Cao Z, Chang Y, Chen J, Chen Z, et al. Information retrieval meets Large Language Models: a strategic report from Chinese IR community. AI Open. 2023;4:80–90. 10.1016/j.aiopen.2023.08.001. [Google Scholar]
18.Chen B, Zhang Z, Langrené N, Zhu S. Unleashing the potential of prompt engineering in Large Language Models: a comprehensive review. 2024. arXiv:2310.14735. [DOI] [PMC free article] [PubMed]
19.Azad HK, Deepak A. Query expansion techniques for information retrieval: a survey. Inf Process Manage. 2019;56(5):1698–735. 10.1016/j.ipm.2019.05.009. [Google Scholar]
20.Jagerman R, Zhuang H, Qin Z, Wang X, Bendersky M. Query Expansion by Prompting Large Language Models. 2023. arXiv:2305.03653.
21.Jing Z, Su Y, Han Y, Yuan B, Xu H, Liu C, et al. When Large Language Models Meet Vector Databases: A Survey. 2024. arXiv:2402.01763.
22.Aumüller M, Bernhardsson E, Faithfull A. ANN-benchmarks: a benchmarking tool for approximate nearest neighbor algorithms. Inf Syst. 2020;87:101374. 10.1016/j.is.2019.02.006. [Google Scholar]
23.van Dinter R, Tekinerdogan B, Catal C. Automation of systematic literature reviews: a systematic literature review. Inf Softw Technol. 2021;136:106589. 10.1016/j.infsof.2021.106589. [Google Scholar]
24.Sundaram G, Berleant D. Automating Systematic Literature Reviews with Natural Language Processing and Text Mining: a Systematic Literature Review. 2022. arXiv:2211.15397.
25.Zala K, Acharya B, Mashru M, Palaniappan D, Gerogiannis VC, Kanavos A, et al. Transformative Automation: AI in Scientific Literature Reviews. International Journal of Advanced Computer Science and Applications. 2024;15(1). 10.14569/IJACSA.2024.01501122
26.Mitrov G, Stanoev B, Gievska S, Mirceva G, Zdravevski E. Combining semantic matching, word embeddings, transformers, and LLMs for enhanced document ranking: application in systematic reviews. Big Data Cogn Comput. 2024. 10.3390/bdcc8090110. [Google Scholar]
27.van Dijk SHB, Brusse-Keizer MGJ, Bucsán CC, van der Palen J, Doggen CJM, Lenferink A. Artificial intelligence in systematic reviews: promising when appropriately used. BMJ Open. 2023. 10.1136/bmjopen-2023-072254. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Dennstädt F, Zink J, Putora PM, Hastings J, Cihoric N. Title and abstract screening for literature reviews using large language models: an exploratory study in the biomedical domain. Syst Rev. 2024. 10.1186/s13643-024-02575-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Wilkins D. Automated title and abstract screening for scoping reviews using the GPT-4 Large Language Model. 2023. arXiv:2311.07918.
30.Guo E, Gupta M, Deng J, Park YJ, Paget M, Naugler C. Automated paper screening for clinical reviews using large language models: data analysis study. J Med Internet Res. 2024;26:e48996. 10.2196/48996. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Almeida F, Xexéo G. Word Embeddings: A Survey. 2023. arXiv:1901.09069.
32.Pak A, Ziyaden A, Saparov T, Akhmetov I, Gelbukh A. Word embeddings: a comprehensive survey. Computación y Sistemas. 2024;28(4):2005–29. [Google Scholar]
33.Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention Is All You Need. 2023. arXiv:1706.03762.
34.Taipalus T. Vector database management systems: fundamental concepts, use-cases, and current challenges. Cogn Syst Res. 2024;85:101216. 10.1016/j.cogsys.2024.101216. [Google Scholar]
35.Han Y, Liu C, Wang P. A Comprehensive Survey on Vector Database: Storage and Retrieval Technique, Challenge. 2023. arXiv:2310.11703.
36.Andoni A, Indyk P, Razenshteyn I. Approximate Nearest Neighbor Search in High Dimensions. 2018. arXiv:1806.09823.
37.Aoyama K, Saito K, Ikeda T. Inverted-File k-Means Clustering: Performance Analysis. 2020. arXiv:2002.09094.
38.Jégou H, Douze M, Schmid C. Product quantization for nearest neighbor search. IEEE Trans Pattern Anal Mach Intell. 2011;33(1):117–28. 10.1109/TPAMI.2010.57. [DOI] [PubMed] [Google Scholar]
39.Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language Models are Unsupervised Multitask Learners. 2019. https://api.semanticscholar.org/CorpusID:160025533. Accessed 19 Feb 2026.
40.OpenAI. GPT-4 Technical Report. 2024. arXiv:2303.08774.
41.Gemini T. Gemini: A Family of Highly Capable Multimodal Models. 2024. arXiv:2312.11805.
42.Touvron H, Lavril T, Izacard G, Martinet X, Lachaux MA, Lacroix T, et al. LLaMA: Open and Efficient Foundation Language Models. 2023. arXiv:2302.13971.
43.Anthropic. The Claude 3 Model Family: Opus, Sonnet, Haiku. 2024. https://api.semanticscholar.org/CorpusID:270640496.
44.Jiang AQ, Sablayrolles A, Mensch A, Bamford C, Chaplot DS, de las Casas D, et al. Mistral 7B. 2023. arXiv:2310.06825.
45.Gemma T. Gemma 2: Improving open language models at a practical size. 2024. arXiv:2408.00118.
46.Zhao WX, Zhou K, Li J, Tang T, Wang X, Hou Y, et al. A Survey of Large Language Models. 2024. arXiv:2303.18223.
47.Feuerriegel S, Hartmann J, Janiesch C, Zschech P. Generative AI. Business & Information Systems Engineering. 2023;66(1):111–26. 10.1007/s12599-023-00834-7. [Google Scholar]
48.Schulhoff S, Ilie M, Balepur N, Kahadze K, Liu A, Si C, et al. The Prompt Report: A Systematic Survey of Prompting Techniques. 2024. arXiv:2406.06608.
49.Sahoo P, Singh AK, Saha S, Jain V, Mondal S, Chadha A. A systematic survey of prompt engineering in large language models: techniques and applications. 2024. arXiv:2402.07927.
50.Li Y. A Practical Survey on Zero-shot Prompt Design for In-context Learning. In: Proceedings of the Conference Recent Advances in Natural Language Processing - Large Language Models for Natural Language Processings. RANLP. INCOMA Ltd., Shoumen; 2023. pp. 641–647. 10.26615/978-954-452-092-2_069.
51.Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, et al. Language Models are Few-Shot Learners. 2020. arXiv:2005.14165.
52.Stanoev B, Mitrov G, Kulakov A, Mirceva G, Lameski P, Zdravevski E. Automating feature extraction from entity-relation models: experimental evaluation of machine learning methods for relational learning. Big Data Cogn Comput. 2024. 10.3390/bdcc8040039. [Google Scholar]
53.Jovanovic M, Mitrov G, Zdravevski E, Lameski P, Colantonio S, Kampel M, et al. Ambient assisted living: scoping review of artificial intelligence models, domains, technology, and concerns. J Med Internet Res. 2022;24(11):e36553. 10.2196/36553. [DOI] [PMC free article] [PubMed] [Google Scholar]
54.Gabriel CL, Pires IM, Coelho PJ, Zdravevski E, Lameski P, Mewada H, et al. Mobile and wearable technologies for the analysis of Ten Meter Walk Test: a concise systematic review. Heliyon. 2023;9(6):e16599. 10.1016/j.heliyon.2023.e16599. [DOI] [PMC free article] [PubMed] [Google Scholar]
55.Muennighoff N, Tazi N, Magne L, Reimers N. MTEB: Massive Text Embedding Benchmark. In: Vlachos A, Augenstein I, editors. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. Dubrovnik, Croatia: Association for Computational Linguistics; 2023. pp. 2014–2037. 10.18653/v1/2023.eacl-main.148
56.Wang W, Wei F, Dong L, Bao H, Yang N, Zhou M. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. 2020. arXiv:2002.10957.
57.Wang W, Bao H, Huang S, Dong L, Wei F. MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers. 2020. arXiv:2012.15828.
58.Xiao S, Liu Z, Zhang P, Muennighoff N, Lian D, Nie JY. C-Pack: Packed Resources For General Chinese Embeddings. In: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR ’24. New York: Association for Computing Machinery; 2024. pp. 641–649. 10.1145/3626772.3657878.
59.Chen J, Xiao S, Zhang P, Luo K, Lian D, Liu Z. M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. In: Ku LW, Martins A, Srikumar V, editors. Findings of the Association for Computational Linguistics ACL 2024. Bangkok and virtual meeting: Association for Computational Linguistics; 2024. pp. 2318–2335. 10.18653/v1/2024.findings-acl.137
60.Wang L, Yang N, Huang X, Jiao B, Yang L, Jiang D, et al. Text Embeddings by Weakly-Supervised Contrastive Pre-training. 2022. arXiv preprint arXiv:2212.03533.
61.Li X, Li J. AnglE-optimized Text Embeddings. 2023. arXiv preprint arXiv:2309.12871.
62.Merrick L, Xu D, Nuti G, Campos D. Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models. 2024. arXiv:2405.05374.
63.Li Z, Zhang X, Zhang Y, Long D, Xie P, Zhang M. Towards general text embeddings with multi-stage contrastive learning. 2023. arXiv preprint arXiv:2308.03281.
64.Kusupati A, Bhatt G, Rege A, Wallingford M, Sinha A, Ramanujan V, et al. Matryoshka Representation Learning. 2024. arXiv:2205.13147.
65.Wang L, Yang N, Huang X, Yang L, Majumder R, Wei F. Improving text embeddings with large language models. 2023. arXiv preprint arXiv:2401.00368.
66.Douze M, Guzhva A, Deng C, Johnson J, Szilvasy G, Mazaré PE, et al. The Faiss library. 2024. arXiv:2401.08281.
67.García S, Fernández A, Luengo J, Herrera F. Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci. 2010;180(10):2044–64. 10.1016/j.ins.2009.12.010. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data and the code we utilize in this study are openly available on https://gitlab.com/mitrovg/article-analysis-study/-/tree/main.

[CR1] 1.Hambarde KA, Proença H. Information retrieval: recent advances and beyond. IEEE Access. 2023;11:76581–604. 10.1109/ACCESS.2023.3295776. [Google Scholar]

[CR2] 2.Kitchenham B, Charters S, et al. Guidelines for performing systematic literature reviews in software engineering. Durham: University of Durham; 2007. [Google Scholar]

[CR3] 3.Bornmann L, Haunschild R, Mutz R. Growth rates of modern science: a latent piecewise growth curve approach to model publication numbers from established and new literature databases. Humanit Soc Sci Commun. 2021. 10.1057/s41599-021-00903-w. [Google Scholar]

[CR4] 4.Sutton A, Clowes M, Preston L, Booth A. Meeting the review family: exploring review types and associated information retrieval requirements. Health Inf Libr J. 2019;36(3):202–22. 10.1111/hir.12276. [DOI] [PubMed] [Google Scholar]

[CR5] 5.Carver JC, Hassler E, Hernandes E, et al. Identifying barriers to the systematic literature review process. In: 2013 ACM / IEEE International Symposium on Empirical Software Engineering and Measurement. Baltimore; 2013. pp. 203–212. 10.1109/ESEM.2013.28.

[CR6] 6.Hadi MU, Tashi QA, Shah A, Qureshi R, Muneer A, Irfan M, et al. Large Language Models: A Comprehensive Survey of its Applications, Challenges, Limitations, and Future Prospects. 2024. 10.36227/techrxiv.23589741.v6

[CR7] 7.Raiaan MAK, Mukta MSH, Fatema K, Fahad NM, Sakib S, Mim MMJ, et al. A review on Large Language Models: architectures, applications, taxonomies. Open issues and challenges. IEEE Access. 2024;12:26839–74. 10.1109/ACCESS.2024.3365742. [Google Scholar]

[CR8] 8.Zhu Y, Yuan H, Wang S, Liu J, Liu W, Deng C, et al. Large Language Models for Information Retrieval: A Survey. 2024. arXiv:2308.07107.

[CR9] 9.Zhai C. Large Language Models and Future of Information Retrieval: Opportunities and Challenges. In: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR ’24. New York: Association for Computing Machinery; 2024. pp. 481–490. 10.1145/3626772.3657848.

[CR10] 10.Cierco Jimenez R, Lee T, Rosillo N, Cordova R, Cree IA, Gonzalez A, et al. Machine learning computational tools to assist the performance of systematic reviews: a mapping review. BMC Med Res Methodol. 2022. 10.1186/s12874-022-01805-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Khalil H, Pollock D, McInerney P, Evans C, Moraes EB, Godfrey CM, et al. Automation tools to support undertaking scoping reviews. Res Synth Methods. 2024. 10.1002/jrsm.1731. [DOI] [PubMed] [Google Scholar]

[CR12] 12.Qureshi R, Shaughnessy D, Gill K, Robinson K, Li T, Agai E. Are ChatGPT and large language models “the answer” to bringing us closer to systematic review automation? Syst Rev. 2023;12:72. 10.1186/s13643-023-02243-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Zdravevski E, Lameski P, Trajkovik V, Chorbev I, Goleva R, Pombo N, et al. In: Ganchev I, Garcia NM, Dobre C, Mavromoustakis CX, Goleva R, editors. Automation in Systematic, Scoping and Rapid Reviews by an NLP Toolkit: A Case Study in Enhanced Living Environments. Cham: Springer International Publishing; 2019. pp. 1–18. 10.1007/978-3-030-10752-9_1.

[CR14] 14.Sanderson M, Croft WB. The history of information retrieval research. Proc IEEE. 2012;100(Special Centennial Issue):1444–51. 10.1109/JPROC.2012.2189916. [Google Scholar]

[CR15] 15.Li X, Jin J, Zhou Y, Zhang Y, Zhang P, Zhu Y, et al. From Matching to Generation: A Survey on Generative Information Retrieval. 2024. arXiv:2404.14851.

[CR16] 16.Zhao WX, Liu J, Ren R, Wen JR. Dense text retrieval based on pretrained language models: a survey. ACM Trans Inf Syst. 2024. 10.1145/3637870. [Google Scholar]

[CR17] 17.Ai Q, Bai T, Cao Z, Chang Y, Chen J, Chen Z, et al. Information retrieval meets Large Language Models: a strategic report from Chinese IR community. AI Open. 2023;4:80–90. 10.1016/j.aiopen.2023.08.001. [Google Scholar]

[CR18] 18.Chen B, Zhang Z, Langrené N, Zhu S. Unleashing the potential of prompt engineering in Large Language Models: a comprehensive review. 2024. arXiv:2310.14735. [DOI] [PMC free article] [PubMed]

[CR19] 19.Azad HK, Deepak A. Query expansion techniques for information retrieval: a survey. Inf Process Manage. 2019;56(5):1698–735. 10.1016/j.ipm.2019.05.009. [Google Scholar]

[CR20] 20.Jagerman R, Zhuang H, Qin Z, Wang X, Bendersky M. Query Expansion by Prompting Large Language Models. 2023. arXiv:2305.03653.

[CR21] 21.Jing Z, Su Y, Han Y, Yuan B, Xu H, Liu C, et al. When Large Language Models Meet Vector Databases: A Survey. 2024. arXiv:2402.01763.

[CR22] 22.Aumüller M, Bernhardsson E, Faithfull A. ANN-benchmarks: a benchmarking tool for approximate nearest neighbor algorithms. Inf Syst. 2020;87:101374. 10.1016/j.is.2019.02.006. [Google Scholar]

[CR23] 23.van Dinter R, Tekinerdogan B, Catal C. Automation of systematic literature reviews: a systematic literature review. Inf Softw Technol. 2021;136:106589. 10.1016/j.infsof.2021.106589. [Google Scholar]

[CR24] 24.Sundaram G, Berleant D. Automating Systematic Literature Reviews with Natural Language Processing and Text Mining: a Systematic Literature Review. 2022. arXiv:2211.15397.

[CR25] 25.Zala K, Acharya B, Mashru M, Palaniappan D, Gerogiannis VC, Kanavos A, et al. Transformative Automation: AI in Scientific Literature Reviews. International Journal of Advanced Computer Science and Applications. 2024;15(1). 10.14569/IJACSA.2024.01501122

[CR26] 26.Mitrov G, Stanoev B, Gievska S, Mirceva G, Zdravevski E. Combining semantic matching, word embeddings, transformers, and LLMs for enhanced document ranking: application in systematic reviews. Big Data Cogn Comput. 2024. 10.3390/bdcc8090110. [Google Scholar]

[CR27] 27.van Dijk SHB, Brusse-Keizer MGJ, Bucsán CC, van der Palen J, Doggen CJM, Lenferink A. Artificial intelligence in systematic reviews: promising when appropriately used. BMJ Open. 2023. 10.1136/bmjopen-2023-072254. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Dennstädt F, Zink J, Putora PM, Hastings J, Cihoric N. Title and abstract screening for literature reviews using large language models: an exploratory study in the biomedical domain. Syst Rev. 2024. 10.1186/s13643-024-02575-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.Wilkins D. Automated title and abstract screening for scoping reviews using the GPT-4 Large Language Model. 2023. arXiv:2311.07918.

[CR30] 30.Guo E, Gupta M, Deng J, Park YJ, Paget M, Naugler C. Automated paper screening for clinical reviews using large language models: data analysis study. J Med Internet Res. 2024;26:e48996. 10.2196/48996. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR31] 31.Almeida F, Xexéo G. Word Embeddings: A Survey. 2023. arXiv:1901.09069.

[CR32] 32.Pak A, Ziyaden A, Saparov T, Akhmetov I, Gelbukh A. Word embeddings: a comprehensive survey. Computación y Sistemas. 2024;28(4):2005–29. [Google Scholar]

[CR33] 33.Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention Is All You Need. 2023. arXiv:1706.03762.

[CR34] 34.Taipalus T. Vector database management systems: fundamental concepts, use-cases, and current challenges. Cogn Syst Res. 2024;85:101216. 10.1016/j.cogsys.2024.101216. [Google Scholar]

[CR35] 35.Han Y, Liu C, Wang P. A Comprehensive Survey on Vector Database: Storage and Retrieval Technique, Challenge. 2023. arXiv:2310.11703.

[CR36] 36.Andoni A, Indyk P, Razenshteyn I. Approximate Nearest Neighbor Search in High Dimensions. 2018. arXiv:1806.09823.

[CR37] 37.Aoyama K, Saito K, Ikeda T. Inverted-File k-Means Clustering: Performance Analysis. 2020. arXiv:2002.09094.

[CR38] 38.Jégou H, Douze M, Schmid C. Product quantization for nearest neighbor search. IEEE Trans Pattern Anal Mach Intell. 2011;33(1):117–28. 10.1109/TPAMI.2010.57. [DOI] [PubMed] [Google Scholar]

[CR39] 39.Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language Models are Unsupervised Multitask Learners. 2019. https://api.semanticscholar.org/CorpusID:160025533. Accessed 19 Feb 2026.

[CR40] 40.OpenAI. GPT-4 Technical Report. 2024. arXiv:2303.08774.

[CR41] 41.Gemini T. Gemini: A Family of Highly Capable Multimodal Models. 2024. arXiv:2312.11805.

[CR42] 42.Touvron H, Lavril T, Izacard G, Martinet X, Lachaux MA, Lacroix T, et al. LLaMA: Open and Efficient Foundation Language Models. 2023. arXiv:2302.13971.

[CR43] 43.Anthropic. The Claude 3 Model Family: Opus, Sonnet, Haiku. 2024. https://api.semanticscholar.org/CorpusID:270640496.

[CR44] 44.Jiang AQ, Sablayrolles A, Mensch A, Bamford C, Chaplot DS, de las Casas D, et al. Mistral 7B. 2023. arXiv:2310.06825.

[CR45] 45.Gemma T. Gemma 2: Improving open language models at a practical size. 2024. arXiv:2408.00118.

[CR46] 46.Zhao WX, Zhou K, Li J, Tang T, Wang X, Hou Y, et al. A Survey of Large Language Models. 2024. arXiv:2303.18223.

[CR47] 47.Feuerriegel S, Hartmann J, Janiesch C, Zschech P. Generative AI. Business & Information Systems Engineering. 2023;66(1):111–26. 10.1007/s12599-023-00834-7. [Google Scholar]

[CR48] 48.Schulhoff S, Ilie M, Balepur N, Kahadze K, Liu A, Si C, et al. The Prompt Report: A Systematic Survey of Prompting Techniques. 2024. arXiv:2406.06608.

[CR49] 49.Sahoo P, Singh AK, Saha S, Jain V, Mondal S, Chadha A. A systematic survey of prompt engineering in large language models: techniques and applications. 2024. arXiv:2402.07927.

[CR50] 50.Li Y. A Practical Survey on Zero-shot Prompt Design for In-context Learning. In: Proceedings of the Conference Recent Advances in Natural Language Processing - Large Language Models for Natural Language Processings. RANLP. INCOMA Ltd., Shoumen; 2023. pp. 641–647. 10.26615/978-954-452-092-2_069.

[CR51] 51.Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, et al. Language Models are Few-Shot Learners. 2020. arXiv:2005.14165.

[CR52] 52.Stanoev B, Mitrov G, Kulakov A, Mirceva G, Lameski P, Zdravevski E. Automating feature extraction from entity-relation models: experimental evaluation of machine learning methods for relational learning. Big Data Cogn Comput. 2024. 10.3390/bdcc8040039. [Google Scholar]

[CR53] 53.Jovanovic M, Mitrov G, Zdravevski E, Lameski P, Colantonio S, Kampel M, et al. Ambient assisted living: scoping review of artificial intelligence models, domains, technology, and concerns. J Med Internet Res. 2022;24(11):e36553. 10.2196/36553. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR54] 54.Gabriel CL, Pires IM, Coelho PJ, Zdravevski E, Lameski P, Mewada H, et al. Mobile and wearable technologies for the analysis of Ten Meter Walk Test: a concise systematic review. Heliyon. 2023;9(6):e16599. 10.1016/j.heliyon.2023.e16599. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR55] 55.Muennighoff N, Tazi N, Magne L, Reimers N. MTEB: Massive Text Embedding Benchmark. In: Vlachos A, Augenstein I, editors. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. Dubrovnik, Croatia: Association for Computational Linguistics; 2023. pp. 2014–2037. 10.18653/v1/2023.eacl-main.148

[CR56] 56.Wang W, Wei F, Dong L, Bao H, Yang N, Zhou M. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. 2020. arXiv:2002.10957.

[CR57] 57.Wang W, Bao H, Huang S, Dong L, Wei F. MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers. 2020. arXiv:2012.15828.

[CR58] 58.Xiao S, Liu Z, Zhang P, Muennighoff N, Lian D, Nie JY. C-Pack: Packed Resources For General Chinese Embeddings. In: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR ’24. New York: Association for Computing Machinery; 2024. pp. 641–649. 10.1145/3626772.3657878.

[CR59] 59.Chen J, Xiao S, Zhang P, Luo K, Lian D, Liu Z. M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. In: Ku LW, Martins A, Srikumar V, editors. Findings of the Association for Computational Linguistics ACL 2024. Bangkok and virtual meeting: Association for Computational Linguistics; 2024. pp. 2318–2335. 10.18653/v1/2024.findings-acl.137

[CR60] 60.Wang L, Yang N, Huang X, Jiao B, Yang L, Jiang D, et al. Text Embeddings by Weakly-Supervised Contrastive Pre-training. 2022. arXiv preprint arXiv:2212.03533.

[CR61] 61.Li X, Li J. AnglE-optimized Text Embeddings. 2023. arXiv preprint arXiv:2309.12871.

[CR62] 62.Merrick L, Xu D, Nuti G, Campos D. Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models. 2024. arXiv:2405.05374.

[CR63] 63.Li Z, Zhang X, Zhang Y, Long D, Xie P, Zhang M. Towards general text embeddings with multi-stage contrastive learning. 2023. arXiv preprint arXiv:2308.03281.

[CR64] 64.Kusupati A, Bhatt G, Rege A, Wallingford M, Sinha A, Ramanujan V, et al. Matryoshka Representation Learning. 2024. arXiv:2205.13147.

[CR65] 65.Wang L, Yang N, Huang X, Yang L, Majumder R, Wei F. Improving text embeddings with large language models. 2023. arXiv preprint arXiv:2401.00368.

[CR66] 66.Douze M, Guzhva A, Deng C, Johnson J, Szilvasy G, Mazaré PE, et al. The Faiss library. 2024. arXiv:2401.08281.

[CR67] 67.García S, Fernández A, Luengo J, Herrera F. Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci. 2010;180(10):2044–64. 10.1016/j.ins.2009.12.010. [Google Scholar]

PERMALINK

Optimizing document retrieval using massive text embeddings and LLM prompt engineering

Goran Mitrov

Boris Stanoev

Vladimir Trajkovik

Biljana Risteska Stojkoska

Lasko Basnarkov

Petre Lameski

Martin Kampel

Eftim Zdravevski

Abstract

Background

Methods

Results

Conclusion

Introduction

Fig. 1.

Related works

Background

Embeddings

Vector databases and vector indexes

Fig. 2.

Large language models

Prompt engineering techniques

Methodology

Datasets

Table 1.

Fig. 3.

Massive text embeddings

Table 2.

Table 3.

Prompt engineering techniques

Zero-shot prompting

Few-shot prompting

Prompt chaining

Vector indexes

Flat index

Inverted file index (IVF)

Product quantization index (PQ)

Experiments

Fig. 4.

Table 4.

Fig. 5.

Evaluation metrics

Recall@K

Average Precision@K

Average rank

Median rank

Reciprocal rank

Results

Table 5.

Table 6.

Table 7.

Table 8.

Discussion

Massive text embeddings for document retrieval

Table 9.

Table 10.

Human versus GenAI in query reformulation for document retrieval

Table 11.

Fig. 6.

Vector indexes trade-offs

Limitations

Conclusion

Acknowledgements

Authors' contributions

Funding

Data availability

Declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles