Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Jul 3.
Published in final edited form as: Proc Conf Empir Methods Nat Lang Process. 2024 Nov;2024:22525–22545. doi: 10.18653/v1/2024.emnlp-main.1255

Improving Minimum Bayes Risk Decoding with Multi-Prompt

David Heineman 1, Yao Dou 1, Wei Xu 1
PMCID: PMC12226151  NIHMSID: NIHMS2092954  PMID: 40612446

Abstract

While instruction fine-tuned LLMs are effective text generators, sensitivity to prompt construction makes performance unstable and sub-optimal in practice. Relying on a single ‘best’ prompt cannot capture all differing approaches to a generation problem. Using this observation, we propose multi-prompt decoding, where many candidate generations are decoded from a prompt bank at inference-time. To ensemble candidates, we use Minimum Bayes Risk (MBR) decoding, which selects a final output using a trained value metric. We show multi-prompt improves MBR across a comprehensive set of conditional generation tasks (Figure 1), and show this is a result of estimating a more diverse and higher quality candidate space than that of a single prompt. Further experiments confirm multi-prompt improves generation across tasks, models and metrics.1

1. Introduction

Minimum Bayes Risk (MBR) decoding (Bickel and Doksum, 1977) improves the generation quality of large language models (LLMs) over standard, single-output decoding methods, such as beam search and sampling. MBR generates a set of candidates and selects the one with the highest expected utility, using all other hypotheses as references (see Fig. 2, left), following a simple intuition that a desirable output should be highly probable and consistent with others. MBR has been applied across a variety of NLP generation tasks (Amrhein and Sennrich, 2022; Shi et al., 2022; Suzgun et al., 2023; Jain et al., 2023). In particular, self-consistency (Wang et al., 2023), a special case of MBR, has become widely used to improve LLM reasoning capabilities by ensembling reasoning paths.

Figure 2:

Figure 2:

Multi-prompt MBR generates candidates using a human- or model-written prompt bank and selects the highest pairwise score with a trained value metric.

A central question to improve the generation quality of MBR decoding is how to balance between diversity and adequacy within the candidate set. Prior work has found success using sampling-based decoding to generate diverse hypotheses (Eikema and Aziz, 2020; Freitag et al., 2022a, 2023a). However, naively increasing the sampling temperature eventually degrades the quality of the candidates. Recently, instruction fine-tuned LLMs (Ouyang et al., 2022; Chung et al., 2022) have opened up the possibility of writing prompts in various formats to elicit higher diversity generations. As these models are observed to be sensitive to prompt design, a slight change in phrasing or the inclusion of more relevant example can significantly impact model behavior (Srivastava et al., 2023; White et al., 2023).

Taking advantage of the prompt sensitivity of LLMs, we introduce multi-prompt MBR decoding, which samples candidates using a bank of human- or model-written prompts (see Figure 2, right). Intuitively, exploring a variety of prompts enables the generation of diverse, high quality hypotheses that provide a closer representation of the true output distribution. By guiding the model towards different regions of the output space, each prompt captures unique sequences that are coherent and relevant to the specific input example.

We experiment with three distinct generation tasks: text simplification (Maddela et al., 2023), machine translation (Kocmi et al., 2022), and code generation (Chen et al., 2021). Each task assess the impact of different prompt components on multi-prompt MBR, such as instance-level prompts for code, task descriptions for simplification, and incontext examples for translation. To account for the relative quality between prompts, we develop different strategies for selecting prompts that outperform a baseline random choice: sampling prompts from a large prompt bank based on their usage on an unlabeled set of task data and selecting prompts using embedding-based heuristics without any examples.

We evaluate multi-prompt MBR on a broad range of LLMs including open-source models such as Llama 2 (Touvron et al., 2023) and state-of-the-art closed-source models such as GPT-4 (Achiam et al., 2023). Our results show multi-prompt MBR consistently improves single-prompt MBR across all three tasks and model scales, with gains of up to 7% on HumanEval (Chen et al., 2021) and 5 points of LENS score on SimpEval (Maddela et al., 2023). Figure 1 displays results for models at the 7B scale. Finally, we study the dynamics between different utility and evaluation metrics, revealing that multi-prompt MBR with one metric improves performance universally across metrics.

Figure 1:

Figure 1:

Multi-prompt and single prompt MBR results for code generation on HumanEval, text simplification on SimpEval, and translation on WMT ’22 En-Cs generated with open-source 7B LLMs (details in §4).

2. Preliminaries

Instruction fine-tuned LLMs are trained to follow arbitrary natural language task descriptions (Wei et al., 2022a). Given an input x and prompt ρ, an autoregressive language model πθ parameterized by θ estimates an output sequence y~πθ(x,ρ) using an decoding algorithm by sampling the next token conditioned on the input πθyiy<i,x,ρ. The decoding algorithm aims to generate y by maximizing the sequence likelihood over the language model distribution πθ(yx,ρ)=Πi=1Tπθyiy<i,x,ρ.

Minimum Bayes Risk Decoding.

In practice, the highest likelihood sequence does not necessarily yield the highest quality generation (Jaeger and Levy, 2006). From this observation, MBR decoding (Bickel and Doksum, 1977; Eikema and Aziz, 2020) first samples a set of hypotheses from the model πθ, approximating the true distribution of output space 𝒴, then selects the output yˆMBR that maximizes the expected utility (or minimizes the expected loss in traditional formulation) with respect to a set of references :

yˆMBR=argmaxyE~πθ[U(y,)], (1)

where U(y,R)=Ey~uy,y and uy,y is a utility function that evaluates hypothesis y against a reference y. In practice, is also sampled from the same model πθ under the assumption that the model produces reliable outputs in expectation, and is usually set as identical to hypothesis set .

Many existing techniques to improve LLMs’ performance such as self-consistency (Wang et al., 2023) and output ensemble (Kobayashi, 2018) are special cases of MBR. For instance, self-consistency can be viewed as MBR using the utility function uy,y=1ans(y)=ansy, where ans(y) is the answer extracted from the reasoning path y (Bertsch et al., 2023).

3. Multi-Prompt MBR Decoding

Prior work on MBR decoding primarily uses models trained or fine-tuned for a specific generation task (Freitag et al., 2022a; Fernandes et al., 2022). With instruction fine-tuned LLMs, the input x is contained within a structured prompt ρ, consisting of task instruction and/or in-context examples. Earlier studies have extensively documented that the design of the prompt has a dramatic impact on overall performance (Mishra et al., 2022; Khashabi et al., 2022; Lu et al., 2022; Sclar et al., 2023).

To investigate this phenomenon, we show in Figure 3a (bottom) the likelihoods and quality of samples from 10 prompts of varying performance for a text simplification task, measuring quality as the Lens metric score against a set of gold references. Greedy sampling (τ=0) estimates different sequences for each instruction, with single prompt (Figure 3a, top) generating a single sequence. As we increase temperature τ, generations from a single prompt simply exhibit noise centered around the mode of the highest likelihood sequence, while multi-prompt estimates a generations around modes uniquely defined by each prompt. For instance, one of the prompts (i.e., Prompt 9 highlighted in green) produces the highest quality generation for this one input sentence, despite having a low performance over the entire dataset. In fact, no prompt consistently produces the highest quality sequences, as illustrated in Figure 3b, rather prompts are most effective at different inputs.

Figure 3:

Figure 3:

(a) Lens score and sequence probability for 1000 generations on a single text simplification example decoded from Llama 2 7B Chat with temperatures τ=[0,0.1,0.5] using a single prompt (top) and multiple prompts (bottom). As the temperature increases, we find each prompt estimates candidate sequences centered at different modes. (b) Lens scores of the best generation per-prompt for the first 20 sentences in SimpEval, showing no single prompt produces the best overall output. (c) Dataset-level LENS performance of each prompt when performing single prompt MBR vs. multi-prompt MBR.

Building upon these insights, we propose multi-prompt MBR decoding, depicted in Figure 2, where the MBR hypothesis set consists of outputs sampled from n distinct prompts ρ:

=i=1ni,wherei=yy~πθx,ρi. (2)

Bertsch et al. (2023) show that MBR seeks the mode of some distribution q over a quality feature ϕ(y) applied to the output space rather than the mode of the model’s distribution:

yˆMBRargmaxyq(ϕ(y)x). (3)

We hypothesize, in expectation, the mode of ϕ(y) across outputs from multiple prompts has higher downstream performance compared to that derived from a single prompt. This is empirically supported by our example, where Figure 3c shows that multi-prompt MBR outperforms individual single-prompt MBR across the full task dataset.

Although multi-prompt ensembles hypothesis spaces between prompts, some notion of objective quality still exists when constructing the prompt bank. As shown in Figure 3c, the majority of the 10 human-written prompts fall within a 10-point range of LENS scores when evaluated on the task dataset but a few prompts consistently produce low-quality generation. Therefore, to account for the hierarchy in prompt quality, we propose two methods for choosing the prompts used at generation time from a prompt bank 𝒫: sampling from a learned distribution of prompts, based on a small unlabeled train set (§3.1); and selecting a subset of prompts based on heuristics in the absence of a train set (§3.2).

3.1. Prompt Sampling

In this approach, we first calculate the probability of each prompt p(ρ) as the proportion of times that prompt generates the highest scoring output on a separate training set. At inference time, prompts are sampled with replacements from this learned probability distribution, and candidate outputs are then generated given these prompts.

Top-p Prompt Sampling.

Inspired by the principle of nucleus sampling (Holtzman et al., 2020), our goal is to keep the prompts with high probability and truncate the least used prompts by setting their probabilities to zero. We define the top-p prompt set as the minimal set 𝒫top-p𝒫 such that:

i=0𝒫top-ppρip. (4)

We then re-normalize the distribution of 𝒫top-p and sample prompts from the new distribution:

p(ρ)=p(ρ)ρ𝒫top-pp(ρ)ifρ𝒫top-p0otherwise. (5)

3.2. Prompt Selection

Prompt selection chooses a fixed subset 𝒫best𝒫 of 𝒫best=k prompts based on heuristics. Compared to sampling, this does not require an additional training set to evaluate prompt efficacy. We consider the following heuristics for selecting 𝒫best: prompts that have the closest similarity and greatest dissimilarity with others, and prompts that are randomly selected from each k-NN cluster, which is also useful when a training set is presented, allowing the selection of high-performing prompts within each cluster. We calculate the semantic (dis)similarity of prompts based on SentenceBERT (Reimers and Gurevych, 2019) embeddings.

4. Experiment Setup

In this section, we describe the experimental details for evaluating the efficacy of multi-prompt MBR decoding across tasks, prompt setups, models, and utility metrics, with results and analyses in §5.

4.1. Tasks & Datasets

Unlike previous work applying MBR to a single generation task (Shi et al., 2022; Eikema and Aziz, 2022), we deliberately select three unique tasks to demonstrate the universality of multi-prompt: text simplification with task-level instructions, code generation with example-level instructions, and machine translation with in-context examples.

Code Generation.

We use HumanEval (Chen et al., 2021) benchmark, where models are tasked with generating a Python program given a description with unit tests. Since each example is a unique coding task, we generate a unique prompt bank for each input. Following Zhang et al. (2023), we reject empty, degenerate (e.g., pass, return None), or non-compiling programs before applying MBR.

Text Simplification.

We use the SimpEval2022 test set (Maddela et al., 2023), containing complex sentences from Wikipedia, paired with human-written simplifications. The prompt bank is generated based on author-written examples (Table 4) and are used for the entire dataset.

Machine Translation.

We intentionally choose the EN → CS language pair from the WMT 22 (Kocmi et al., 2022) newstest corpus, ensuring its exclusion from the training data of recent translation LLMs or metrics (Xu et al., 2024). Results on additional language pairs are in Appendix C.2.

4.2. Constructing the Prompt Bank

For text simplification and code generation experiments, we first collect a small set of manually written seed prompts and construct the full prompt set by using GPT-4 Turbo to generate diverse paraphrases of the seed prompts. The authors manually write 10 seed prompts for text simplification (Table 4) and use the original HumanEval instruction from each example as the seed prompt for code generation. For translation experiments, we use randomly sampled in-context examples taken from previous WMT shared tasks as the prompt bank instead of generating translation instructions. In our preliminary experiments, we found translation LLM performance to be more sensitive to varying examples rather than translation instructions.

For multi-prompt experiments, we select from the prompt bank with top-p prompt sampling (§5.2) using p=0.6, where the prompt usage p(ρ) is calculated using a held-out 20% split of each dataset. For our single prompt baselines, we use a randomly selected prompt from the prompt bank. Human-written prompts and prompt generation instructions are included in Appendix A.

4.3. Models

Our main experiments are performed with Llama 2–7B Chat (Touvron et al., 2023) for simplification, ALMA-7B-R (Xu et al., 2024) for translation and CodeLLaMA-13B Instruct (Roziere et al., 2023) for code generation, all fine-tuned to follow instructions. In §5.3 we further explore a wide range of model architectures and sizes, including state-of-the-art and task-specific fine-tuned models. Unless otherwise specified, we generate the hypothesis set using nucleus sampling (Holtzman et al., 2020) with τ=0.9,p=0.95. We include a detailed review of all models in this work in Appendix B.2.

4.4. Utility Metrics & Evaluation

Our core experiments use the trained Lens (Maddela et al., 2023) for simplification and Comet (Rei et al., 2020) for translation as the candidate selection metric. For code generation, we use MBR-Exec (Shi et al., 2022), which executes each candidate program against a set of test cases, selecting the program with the highest agreement over all test cases’ outputs. As in Zhang et al. (2023), we use the docstring examples as test cases for MBR-Exec and evaluate with pass@1. Given the growing body of work on metric development, we verify our multi-prompt results across a broad range of utility and evaluation metrics in §5.4.

5. Experiment Results

We compare multi-prompt decoding to traditional MBR (§5.1), ablate the prompt sampling mechanism (§5.2), vary model architectures (§5.3), evaluate across utility metrics (§5.4) and finally evaluate multi-prompt on efficient MBR alternatives (§5.5).

5.1. How does multi-prompt MBR perform?

Multi-prompt Improves MBR.

We report our main results in Figure 1, which compares single prompt and multi-prompt performance when generating up to 500 candidates. Multi-prompt consistently outperforms standard MBR for all tasks.

Candidate Diversity Quality.

To measure the impact of temperature on the candidate set quality, we report performance and diversity, as measured by novel bi-grams, across temperatures in Figure 4. For low temperatures, we find that multi-prompt generates a consistently more diverse candidate space, which directly translates to higher-quality generation. While single prompt MBR performance improves with temperature τ>1, despite generating an equal or greater diversity set than multi-prompt, multi-prompt MBR still produces higher quality candidates. As τ2, the quality of single and multi-prompt MBR begins to degrade as their candidate sets become too noisy to generate high-quality sequences. Framing the decoding process as each prompt estimating a unique distribution of candidate generations (§3), the ability of multi-prompt to achieve higher quality generation as a result of candidate set diversity is intuitively the byproduct of combining multiple candidate distributions defined by each instruction.

Figure 4:

Figure 4:

Candidate set diversity and Lens scores on SimpEval for 200 repetitions of single-prompt and multi-prompt at various temperatures. At low temperatures, the increased candidate diversity from multi-prompt directly translates to improved performance.

We include additional results on our main experiments in in Appendix C, notably that multi-prompt outperforms beam search and that the choice of the single prompt impacts the baseline performance.

5.2. What is the impact of the prompt bank?

Sampling Prompts Improves Candidate Quality.

Table 1 (top) reports results for multi-prompt across different prompt sampling methods for text simplification and translation. We perform a hypothesis test for the statistical significance of each variation of multi-prompt outperforming single prompt MBR using bootstrap sampling with 1000 iterations (Koehn, 2004). Note that, code generation results are omitted as a unique set of prompts is generated for each HumanEval example. We find sampling prompts by usage and truncating the top-p prompts improves multi-prompt over a random selection baseline, with top-p prompt sampling performing the best on both tasks.

Table 1:

Results for prompt sampling using 100 prompts (top) and subset selection using 10 of 100 prompts (bottom).

pass@1 Lens Comet
Single Prompt (||=100) 48.78 74.67 88.93
Multi-Prompt + Prompt Sampling (|𝒫|=100)
Random Selection 74.91* 89.98*
Prompt Sampling 78.29* 90.33*
Top-p Prompt Random 78.61* 90.11*
Top-p Prompt Sampling 79.08 * 90.36 *
Single Prompt (||=10) 41.55 61.26 87.24
Multi-Prompt + Prompt Selection (𝒫best𝒫,𝒫best=10)
Random Selection 39.63 60.00 87.81*
k-NN Cluster Random 40.24 58.73 87.80*
Farthest Similarity 44.51 * 58.32 88.14 *
Closest Similarity 37.80 61.53* 87.73*
Highest Performance 62.43* 87.65
k-NN Cluster Performance 66.12 * 87.73*
* =

Statistically significant improvement with p<0.05.

Sampling from a weighted, truncated distribution improves multi-prompt across candidate set sizes.

A Higher Quality Prompt Bank Improves Multi-prompt.

Table 1 (bottom) reports results for different prompt subset selection methods, which use heuristics to select a smaller set of prompts for multi-prompt to maximize performance. The best selection method for each task had a significant impact on performance when compared to a single prompt MBR (+2.9 pass@1, +4.9 LENS and +0.9 Comet). For text simplification, decoding with the 10 highest performing prompts is further improved by selecting prompts from a k-NN clustering of prompt embeddings, which enforces a dis-similarity between prompts. However, translation and code generation benefit from using the farthest similarity, or semantically distant prompts. These results highlight multi-prompt’s sensitivity to the prompt construction, and shows that enforcing both diversity via multi-prompt and performance via prompt selection improves candidate generation. A direct comparison between prompt sampling and selection using the same candidate set size is included in Table 6 in Appendix C.4.

5.3. Does multi-prompt MBR improve quality across model architectures and sizes?

Multi-prompt Improves MBR Across Models.

Figure 5 reports improvement of multi-prompt over single prompt across widely used LLMs as a Δ change in score, with per-model results in Appendix C.5. In all cases, multi-prompt outperforms single prompt using a sufficiently large candidate set size, showing an increasing or constant metric improvement. In fact, smaller models surpass their larger counterparts’ single output decoding at large enough candidate set sizes (Fig. 10). For instance, CodeLlama 13B outperforms its 70B variant using multi-prompt with 18 candidates (48.26 > 47.99 pass@1) and TowerInstruct 7B outperforms 13B with 5 candidates (81.73 > 80.14 Comet).

Figure 5:

Figure 5:

Δ metric improvement from single prompt to multi-prompt across model sizes and architectures, reported with a 95% CI bootstrapped over 20 iterations. For absolute performance, see Figure 10.

LLMs with Multi-prompt Outperform Finetuned Models.

Whether general-purpose, instruction fine-tuned LLMs outperform models trained on a specific generation task is still an active question (Qin et al., 2023), so we compare state-of-the-art results from each task dataset using single prompt MBR to instruction fine-tuned LLMs using multi-prompt MBR with top-p prompt sampling. In Table 2, we report previous SOTA results for each task: an 11B T5-based text simplification model with control tokens for simplification operations (Sheang and Saggion, 2021), the En-Cs results for the WMT ’22 winning submission (Kocmi et al., 2022) and StarCoder 15B, a code infilling and generation LLM (Li et al., 2023), not explicitly trained to follow natural language instructions. LLMs surpass fine-tuned model performance when using multi-prompt, for instance Llama 2 13B shows +5.8 Lens over fine-tuned T5 11B.

Table 2:

Metric scores for state-of-the-art systems compared to LLMs with multi-prompt using || candidates. Translation and simplification baselines are as reported in Hendy et al. (2023) and Maddela et al. (2023).

Single Prompt Multi-prompt Cand. Bleu (MP on SP) Cand. Bleu (SP on MP)
Code Generation (||=20) – HumanEval (pass@1)
StarCoder 2 15B 44.51 49.39 49.69 50.13
CodeLlama 7B 37.80 40.85 62.05 63.32
CodeLlama 13B 43.29 48.17 59.49 60.76
CodeLlama 34B 45.73 52.44 61.59 62.92
CodeLlama 70B 61.59 68.90 63.15 65.12
GPT-3.5 68.29 73.78 83.07 89.86
GPT-4 81.71 82.93 81.72 89.82
Text Simplification (||=100) – SimpEval (LENS)
Ctrl T5 3B 72.6
Ctrl T5 11B 74.4
Llama 2 7B Chat 75.71 80.38 80.71 74.68
Llama 2 13B Chat 78.19 80.27 79.30 77.65
Llama 2 70B Chat 82.21 83.28 74.11 70.65
GPT-3.5 76.87 81.25 94.18 85.56
GPT-4 76.47 81.56 96.74 81.05
Translation (||=100) – WMT ’22 En-Cs (COMET)
WMT ’22 Winners 91.9
MS Translate API 90.6
ALMA 7B R 89.17 89.94 87.22 81.20
ALMA 13B R 89.41 90.45 89.75 84.74
GPT-3.5 91.27 91.35 99.26 95.47
GPT-4 92.24 92.47 90.21 90.85

Candidate Set Overlap May Explain the Performance Similarity for Large Models.

Finally, in Table 2, we observe that stronger systems, such as GPT-4 on translation, show smaller differences between single and multi-prompt. One explanation may be due to stronger models generating similar candidate sets between both methods. To understand this behavior, we measure the similarity between the candidate set generated by multi-prompt and single prompt, where a higher similarity candidate set may indicate a smaller improvement from multi-prompt. We report the ‘Candidate Bleu (target on references)’ score, which measures of the n-gram overlap of a set of target sequences over the bank of references. In our results, we find that stronger models produce single prompt candidate sets which contain more multi-prompt n-grams (as shown in ‘SP on MP’), and that candidate sets show a higher n-gram coverage as models improve. This increasing similarity between the candidates may explain the decreasing performance improvement for multi-prompt.

5.4. Does multi-prompt MBR over-fit to the utility metric?

An inherent challenge of evaluating MBR is that the utility metric used to select candidates is typically also used for the final evaluation, in such cases it is difficult to attribute the metric improvement to higher quality generation (Bertsch et al., 2023). Given growing attention to metric development, we leverage various trained metrics to test whether multi-prompt using one utility metric improves performance cross all other utility metrics. We experiment with traditional overlap-based metrics, (Bleu, Sari), embedding similarity (BertScore), small (~100M parameter) trained metrics with references (Lens, Comet-22) and without references (CometKiwi, Lens-Salsa, Sle), and large (3B+ parameter) trained metrics (xComet, MetricX, MetricX-QE). These metrics represent diverse text evaluation approaches and encompass the full state of evaluation in both tasks. We include a full description of metric architectures in Appendix B.1.

Multi-prompt MBR Improves Across Metrics.

Table 3 reports results for cross-metric evaluation, with the diagonal reflecting the traditional MBR evaluation setup (i.e., calculate MBR and evaluate using the same metric) and other cells indicate generalization from one metric to all others. Multi-prompt improves performance on most evaluation setups, with a few notable exceptions such as disagreement between trained and overlap-based metrics for simplification and Comet-based metrics for translation. For simplification, trained metrics’ failure when evaluated by Sari and BertScore may be a byproduct of the test set size, as these metrics typically require a substantial number of references for stable evaluation (Alva-Manchego et al., 2020), more than what are provided in SimpEval. Interestingly, the magnitude of performance improvement is highly variable to the specific utility metric, with no clear relationship between the metric architecture and improvement of multi-prompt, but typically a lower baseline performance indicates multi-prompt performs better (Table 8 in Appendix for more details).

Table 3:

Δ metric improvement from single prompt to multi-prompt across metrics.

graphic file with name nihms-2092954-t0002.jpg

rf = Reference-free reranker.

* =

Statistically significant improvement with p<0.05.

For absolute performance, see Table 8.

5.5. How does the metric type impact multi-prompt MBR?

As discussed by Fernandes et al. (2022), the MBR operation requires each candidate evaluate against every other candidate (i.e., 𝒪n2 comparisons), this becomes inefficient in practice for a large n, especially when using a trained utility metric. Therefore, we explore multi-prompt MBR alternatives using reference-free utility metrics:

  • Reranker (𝒪(n)). Re-ranking directly estimates the quality of each candidate using a reference-free metric: yˆMBR=argmaxy[U(y)]. We use the trained Lens-Salsa for simplification (Heineman et al., 2023) and Comet-MQM (Rei et al., 2021) for translation. For code generation, we use Code Reviewer (Shi et al., 2022), which calculates agreement between the pertoken probability of the generation given the docstring and the original docstring given the generation. Reference-free re-ranking only requires n metric calculations to directly estimate quality.

  • Reranker + MBR (𝒪n+m2). We use a two-stage selection where we first rerank all n candidates and select the top m to use for MBR, where the cheap re-ranker can distill the candidate set and the expensive MBR metric performs the final selection, where mn.

  • Multi-turn MBR 𝒪n2+m2. Similar to the previous approach, we perform MBR and then re-compute MBR using the top m candidates.

Results.

We report results across candidate selection methods in Figure 6, finding the multi-prompt achieves performance improvement across reference-based and reference-free metrics, yet the relative performance of methods varies between tasks. With text simplification, the methods first narrowing the candidate set (‘Rerank + MBR’) and iteratively performing MBR (‘Multi-turn MBR’) either match or out-perform vanilla MBR. We speculate the first pass may prune the lowest quality generations such that the second pass only considers a distilled candidate set, which better informs the MBR calculation. For translation, the more efficient re-ranker outperforms vanilla MBR, which follows recent work finding trained reference-based and reference-free MT metrics are approaching a similar quality (Freitag et al., 2023b). For code generation, the re-ranker under-performs MBR, which may be reflective of the performance of Code Reviewer compared to MBR-Exec, as the latter has access to multiple test cases.

Figure 6:

Figure 6:

Alternative MBR formulations for multi-prompt across candidate set sizes for code generation, text simplification and translation. Efficient MBR methods show inconsistent results, dependent on task and metric.

6. Related Work

Output Selection.

Ensembling outputs across a generation set has become a widely used technique for improving LLM performance in classification tasks, such as using a majority vote over reasoning chains (Wang et al., 2023), or merging outputs from multiple models (Kobayashi, 2018; Martínez Lorenzo et al., 2023). This work applies the same underling concept to text generation by leveraging trained automatic evaluation metrics. To our knowledge, it is the first to propose a multi-prompt decoding scheme for text generation.

MBR Decoding.

MBR decoding has been previously used to improve generation quality for machine translation (Kumar and Byrne, 2004; Eikema and Aziz, 2020; Müller and Sennrich, 2021) text simplification (Maddela et al., 2023), summarization and style transfer (Suzgun et al., 2023). Bertsch et al. (2023) highlight the growing popularity of MBR as a simple technique in machine translation and reporting shared tasks results. While our work is the first to propose generating the MBR hypothesis space using a prompt bank, Farinhas et al. (2023) perform preliminary experiments with paraphrases of a single sentence prompt, but found no difference in performance. Recent work argues sampling strategies like nucleus (Eikema and Aziz, 2022) or epsilon (Freitag et al., 2023a) offer slightly better performance over beam search for MBR, with this work extending their findings by attributing candidate set quality to sampling diversity.

Prompt Selection.

Current work on prompting for text generation has instead focused on optimization, such as in-context example selection (Min et al., 2022), example ordering (Lu et al., 2022) and prompt selection (Gonen et al., 2023). Notably, Agrawal et al. (2023) show selecting in-context examples for MT by maximizing n-gram overlap between the source and examples improves few-shot performance. Zhou et al. (2023) experiment with LLMs as prompt generators, and Yang et al. (2023) show using LLMs to iteratively rewrite prompts on a development set can distill a single, high-performant prompt. Our work builds on LLM-written prompts and basic heuristics for distilling the prompt bank to further improve multi-prompt.

7. Conclusion

In this work, we propose multi-prompt, a generalized case of MBR for conditional text generation. Multi-prompt successfully ensembles outputs of instruction fine-tuned language models across prompt constructions and in-context examples. We highlight the importance of prompt selection and sampling when constructing the prompt bank with top-p prompt sampling and further verify our results across tasks, models and utility metrics.

Limitations

We limit our study of the prompt bank to a basic set of seed prompts and GPT-written paraphrases. Notably, we do not study the impact of prompt formats (e.g., passage:{}\n answer{} vs. Passage::{} Answer::{}, Sclar et al., 2023), in-context example ordering (Lu et al., 2022) or example selection (Agrawal et al., 2023) on multi-prompt performance, although multi-prompt may extend to such methods. We leave the question of exhaustively constructing a prompt bank to future work.

An inherent limitation of MBR is the increase in inference time, where we generate up to 500 samples in our experiments, and use a neural utility metric with either linear or quadratic comparisons between candidates. To illustrate this, the wall clock time for the main experiment setup (Figure 1) using standard decoding on a single A40 GPU is 4.73, 2.10, 2.21 seconds per input sentence and for multi-prompt with 100 candidates is 38.76, 183.81, 124.70 seconds per input sentence for code generation, simplification and translation respectively.

In practice, the generation time was significantly lowered by decoding in parallel and the use of efficient-memory attention techniques such as paged and flash attention used in the vLLM library (Kwon et al., 2023). The computational bottleneck for large candidate set sizes was instead evaluating the utility metrics across all pairs of generated candidates. To lower the number of metric comparisons, promising results have been demonstrated by pruning low-scoring candidates during the MBR process (Cheng and Vlachos, 2023), aggregating embedding representations of candidates (Vamvas and Sennrich, 2024) or selecting a subset of references for each candidate using heuristics on reference embeddings (Deguchi et al., 2024). Similarly, we show in §5.5 efficient alternatives to MBR such as using reference-free metrics largely preserve the benefits from multi-prompt.

Along with MBR, many widely used methods improving LLM abilities trade increased compute at inference time for higher performance, such as using chain-of-thought to decode a reasoning chain for a single answer or using self-consistency to selects an answer among multiple reasoning chains (Wei et al., 2022b; Wang et al., 2023).

Acknowledgments

The authors would like to thank Alan Ritter and Y-lan Boureau for discussions and Duong Le for his feedback on a draft manuscript. This research is supported in part by the NSF awards IIS-2144493 and IIS-2112633, NIH award R01LM014600, ODNI and IARPA via the HIATUS program (contract 2022-22072200004). The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of NSF, NIH, ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.

A. Prompt Bank Construction

Table 4 contains the human-written prompts for text simplification. These human-written prompts are provided as examples to GPT-4 when automatically generating prompts for large-scale experiments in §5. For code generation, we extract the docstring in the original HumanEval examples as the human-written prompt, and provide it as an example prompt to GPT-4. For machine translation, our few-shot examples were sampled randomly from the WMT newstest19 test corpus (Barrault et al., 2019).

Table 4:

Text simplification prompts used for the decoding experiment in Figure 3 and used as examples to write GPT-4 prompts for experiments in §5.

Human-Written Text Simplification Prompt
I am writing a sentence, please take a look at this sentence and write a simpler version such that a non-english speaker or an individual with disabilities could better understand the sentence.
Rewrite the following complex sentence in order to make it easier to understand by non-native speakers of English. You can do so by replacing complex words with simpler synonyms (i.e. paraphrasing), deleting unimportant information (i.e. compression), and/or splitting a long complex sentence into several simpler ones. The final simplified sentence needs to be grammatical, fluent, and retain the main ideas of its original counterpart without altering its meaning.
You are an artificial intelligence designed to simplify human written text. The text you are given will contain complex ideas, phrases or concepts and your job is to rewrite that text in a simple and easy to understand way. Your simplification should be completely fluent and retain the ideas of the simplification.
I would like you to simplify the following sentence such that the text is as concise and easy to read as possible.
You are to act as a text simplification bot. As a text simplification bot, you will simplify the following sentence such that it is syntactically easier to read and semantically easier to understand. Please do not make the text more complex, longer or difficult for a reader.
Make this sentence more approachable for a non-english speaker or an individual with a disability.
Rewrite the following sentence in simpler terms to help non-native English speakers and people with disabilities understand it better.
This is a sentence from Wikipedia, rewrite it such that it could appear on Simple English Wikipedia
You are an AI assistant that writes text simplification. Text simplification can be defined as any process that reduces the syntactic or lexical complexity of a text while attempting to preserve its meaning and information content. The aim of text simplification is to make text easier to comprehend for a human user, or process by a program. Please simplify the following sentence.
The following sentence has a high CEFR rating. Can you please rewrite it such that it will have a lower CEFR classification?

B. Detailed System Descriptions

In this section, we include a full description of the generation models and utility metrics used in experiments throughout §5.3 and §5.4. All experiments were inference-based and were run on up to 4xN-VIDIA A40 GPUs, depending on the requirements of the specific model or utility metric. The use of models, metrics and datasets in this project follows their respective licenses and intended use.

Table 5:

Instruction templates provided to GPT-4 when generating task instructions for code generation (top) and text simplification (bottom).

Prompt-Generation Instruction
Please write a variation of the following instruction for a coding task. You may be creative in proposing potential solutions, or explaining the nature of the task. Please do not write any examples.
Example: {example_prompt}
Prompt:
Create a prompt for a language model to simplify a sentence, this prompt will explain the text simplification task and instructions for how to perform the task. The prompt should be diverse, include a description of simplification and clearly state what is expected of the language model.
Example: {example_prompt_1}
Example: {example_prompt_2}
Prompt:

B. 1. Utility Metrics

B.1.1. Code Generation

MBR-Exec (Shi et al., 2022) executes candidate generations on a series of test cases, and selects the candidate with the highest agreement on its output with all other candidates. While the authors do not evaluate on HumanEval, we replicate the setup in Zhang et al. (2023) by using the test cases in the docstring to calculate the agreement. We use a soft loss over all test cases, as many HumanEval docstring examples are trivial or edge cases. If two candidates have the same MBR score, we break ties using the candidate with higher probability under the language model.

Code Reviewer (Zhang et al., 2023) attempts to find a consensus between the likelihood of the generated program p(yx) and the original docstring using a minified version of the generation p(xy). We use their implementation for rejecting degenerate samples, minifying code and calculating the reviewer score. We use the same models for generation and re-ranking.

B.1.2. Simplification

Sari (Xu et al., 2016) is an n-gram overlap based metric that compares edits on inputs, outputs and a bank of references.

BERTScore (Zhang et al., 2020) calculates a word-level cosine similarity of BERT embeddings. Alva-Manchego et al. (2021) find BERTScore is an adequate measure of quality generation, but that it does not correlate with simplicity.

Lens (Maddela et al., 2023) is a RoBERTa-based metric trained using human ratings of text simplification model outputs. The authors train on an adaptive loss to allow a high score for generations that are close to any references, encouraging the metric to consider different simplification types.

Lens-Salsa (Heineman et al., 2023) extends the Lens architecture by fine-tuning on a dual sentence- and word-level quality objective. The authors show Lens-Salsa is more sensitive to specific edit operations, while not requiring any reference simplifications.

Sle (Cripwell et al., 2023) is a RoBERTa-based metric trained to estimate the simplicity of text, with the simplicity score defined as the difference in simplicity between the complex and simplified sentences. Sle was trained on 0–4 readability scores of news articles in the Newsela corpus (Xu et al., 2015), with an additional label softening for individual sentences in each article.

B.1.3. Translation

Bleu (Papineni et al., 2002) is an n-gram overlap based metric comparing a translation to a bank of references. BLEU remains a widely-used standard for automatic evaluation, despite lower correlation to human judgement compared to learned metrics (Freitag et al., 2022b). We use the ScareBLEU implementation (Post, 2018).

Comet (Rei et al., 2020) is a widely used RoBERTa-based metric, trained on direct assessments of simplification quality. For reference-free evaluation, we use the CometKiwi-XXL variant (Rei et al., 2022, 2023), trained to predict sentence- and word-level scores simultaneously.

xComet (Guerreiro et al., 2023) is a fine-tuned XLM-R model (Goyal et al., 2021) based on the CometKiwi architecture, but scaling the model size and training data, including with synthetic data created by randomly swapping n-grams or entire sentences with unrelated translations. We use the 11B xComet-xxl in our experiments.

MetricX (Juraska et al., 2023) is a recent finetuned 11B mT5-XXL (Xue et al., 2021) trained on DA data from 2015–20, MQM data from 2020–21 (Freitag et al., 2021) and synthetic data based on the MQM and DEMETR (Karpinska et al., 2022) taxonomies of translation errors. Notably, the MetricX architecture encodes both candidates and references together, while COMET encodes both separately and combines the outputs to calculate the final score. We also use the reference-free variant MetricX-QE. The WMT ’22 test data used in this work is not included in the training data of any translation metrics we considered.

B.2. Model Architectures

B.2.1. Code Generation

StarCoder 2 (Li et al., 2023) is trained from-scratch on 4T tokens from 600+ programming languages. Although the model is not instruction fine-tuned, we see a slight performance improvement with multi-prompt, likely because comments and code descriptions are included in its pre-training.

CodeLLaMA (Roziere et al., 2023) is a fine-tuned Llama 2 model on 500B-1T tokens of code-related datasets, including Python, substantially outperforming the base Llama 2 model on HumanEval.

B.2.2. Simplification

Instruction Fine-tuned Models. We experiment with widely used instruction fine-tuned LLMs, aiming for a broad coverage of current models: Llama 2 Chat (Touvron et al., 2023), Gemma (Team et al., 2024) and Mistral (Jiang et al., 2023).

Fine-tuned Control T5 (Sheang and Saggion, 2021) is a T5-based text simplification model fine-tuned on the Wiki-Auto (Jiang et al., 2020) dataset of aligned English-Simple English Wikipedia articles. We use their same control token setup: <NC_0.95> <LS_0.75> <DR_0.75> <WR_0.75>.

B.2.3. Translation

Alma-r (Xu et al., 2024) is a class of translation LLMs. The base Alma (Xu et al., 2023) is a fine-tuned LLaMA model trained on monolingual text in each target language and further trained using parallel data. Alma-r (Xu et al., 2024) is an extension trained on a contrastive preference loss on ratings of translation quality.

TowerInstruct (Alves et al., 2024) is a fine-tuned Llama 2 model on multi-lingual instructions, aiming to incorporate tasks beyond translation, such as paraphrasing, post editing and grammar error correction.

Aya 101 (Üstün et al., 2024) is an mT5-based model fine-tuned on multi-lingual data in 101 languages. While mT5 is an instruction-following model, Aya is not fine-tuned on instruction data.

Additionally, we provide results from the WMT ’22 winning submission, and the Microsoft Translate API, as reported in Hendy et al. (2023).

Figure 7:

Figure 7:

Multi-prompt, single prompt and beam search MBR decoding performance across candidate set sizes for code generation, text simplification and translation. Results are an average over 5 repetitions.

C. Further Results

C.1. Beam Search & Oracle Performance

Following related work in MBR, we report upper-bound ‘oracle’ results (similar to Shi et al., 2022) and a lower-bound beam search baseline (similar to Freitag et al., 2023a) in comparison to our main results (Figure 1) in Figure 7.

Beam Search.

The MBR candidate set historically has consisted of the top beam search candidates, but as language models have become better generators recent work has argued sampling leads to a better estimation of the hypothesis space (Freitag et al., 2023a). For this reason, we exclusively use nucleus sampling in §5, but we report beam search as a baseline in Figure 7, with a ‘candidate set size’ of n corresponding to the top n beam candidates, or n candidates with nucleus sampling for other results.

Oracle.

As the final MBR performance can be impacted both by the quality of the candidate set and the choice of utility metric, we report an upper-bound performance by deliberately selecting the best candidate generations. Given a test set with gold-standard references , we define the oracle performance as the set of the highest scoring possible selection of candidates:

Oracle*=r*maxy[U(y,r)] (6)

Since code generation is evaluated using pass@1, its oracle uses expected pass@k (Shi et al., 2022), which measures whether at least one candidate within the candidate set passes all unit tests 𝒯:

ExPass@K=E||=Kmaxymint𝒯1[t(y)] (7)

Results.

As oracle performance measures candidate set quality independent of the utility metric, we find an increase in oracle performance coincides with an improvement when using multi-prompt, indicating that a utility metric can naturally select candidates when the candidate set is higher quality. This suggests improving utility metrics may be a promising direction to bridge the gap between candidate quality and candidate selection. Beam search was a particularly strong baseline for small candidate set sizes, particularly for code generation, but beam search is not as sensitive to improvement as the candidate set size increases. Additionally, as code generation is evaluated using the binary pass@1 metric, rather than a scalar quality metric as used by translation and simplification, there is a large gap between MBR and oracle performance, also observed by Shi et al. (2022).

C.2. En-XX Translation Results

For brevity, we limit our multi-prompt experiments to only the English-Czech language pair, but report results across the full ALMA test set, including WMT ’22 test data and a subset of NTREX (Federmann et al., 2022), in Figure 8, where we observe improvement with multi-prompt is dependent on the language pair. Generally, high resource languages (such as French, German, Russian) do not have a substantial difference, which may be a result of the low prompt sensitivity for such pairs.

C.3. Additional Multi-Prompt Results

In our main experiments, the single prompt setup uses a randomly selected prompt from the prompt bank. Instead, we experiment with using the prompt with the highest prompt usage p(ρ) on the held-out 20% of each dataset. In Figure 9, we report the performance of each method using the same setup as the main experiment (Figure 1) but using the alternative single prompt setup. For translation, we observe single-prompt and multi-prompt show a smaller performance difference. For text simplification, the highest usage prompt outperforms multi-prompt for small candidate sizes.

Figure 8:

Figure 8:

Multi-prompt and single prompt performance of ALMA 7B R across En-XX translation pairs. For low resource language pairs (e.g., Urdu, Turkish, Czech) we observe larger performance improvements compared to high resource pairs (e.g., French, German, Russian).

Figure 9:

Figure 9:

Multi-prompt and single prompt MBR results from the setup in Figure 1 with a different single prompt baseline. The single prompt was chosen as the highest usage p(ρ) on the held-out dataset.

C.4. Additional Prompt Selection Results

To further compare prompt sampling and prompt selection with the same candidate set size, we replicate the same experiment as Table 1, but modify prompt selection (bottom) to use 10 candidates for each prompt, such that both sampling and selection use 100 candidates. We find similar results when comparing between prompt selection methods, where at least one selection method leads to a statistically significant improvement on each task. However, all prompt selection methods under-perform prompt sampling. This underscores the benefit of the increased diversity from generating using a full prompt bank with multi-prompt.

C.5. Detailed Multi-Model Results

See Figure 10 contains separated results for multi-prompt and single prompt for each model, as reported in Figure 5 and discussed in §5.3.

Table 6:

Results for prompt sampling using 100 prompts (top) and subset selection with 100 candidates using 10 of 100 prompts (bottom).

pass@1 Lens Comet
Single Prompt (||=100) 48.78 74.67 88.93
Multi-Prompt + Prompt Sampling (|𝒫|=100,||=100)
Random Selection 74.91* 89.98*
Prompt Sampling 78.29* 90.33*
Top-p Prompt Random 78.61* 90.11*
Top-p Prompt Sampling 79.08 * 90.36 *
Single Prompt (||=100) 48.78 74.67 88.93
Multi-Prompt + Prompt Selection 𝒫best=10,||=100
Random Selection 47.40 70.95 89.90*
k-NN Cluster Random 45.73 72.04 90.14*
Farthest Similarity 49.17 * 71.64 90.18*
Closest Similarity 45.73 72.17 90.87 *
Highest Performance 72.56 90.27*
k-NN Cluster Performance 75.88 * 90.43*
* =

Statistically significant improvement with p<0.05.

C.6. Detailed Cross Metric Evaluation

Table 8 contains the full results for the MBR experiments across metrics as discussed in §5.4. While using the same metric for MBR and the final evaluation exhibits the highest improvement (see entries on the diagonal), we find that multi-prompt using any value metric universally improves performance when evaluated on any other metric. Recent neural metrics, which achieve higher correlation with human judgements, also have a higher overall performance. Note, MetricX scores within the range [0, 25] corresponding to an MQM rating, where lower is better and Sle scores within the range [0, 4] corresponding to a Newsela simplification rating, where higher is better. For clarity, we negate the MetricX results in Table 3 such that all the green cells indicate a metric improvement.

Table 7:

Prompts with highest usage for multi-prompt using the held-out split for simplification and translation.

Top 10 GPT-4 Generated Text Simplification Prompts (Sorted by No. Generations Selected)
Rewrite the following sentence in a simplified manner, making sure the same meaning and message are still conveyed clearly. The simplification should be done such that it can be read and understood easily by an individual who may not have knowledge of the English language or any disabilities that limit their understanding.
Please simplify the following sentence so that it is easy to understand by people with disabilities or those who are unfamiliar with English. Try to use shorter words, fewer clauses, and a simpler structure.
Simplify this sentence such that a non-English speaker or a person with disabilities is able to understand the sentence. Focus on replacing complex words and structures with simpler ones, while keeping the meaning intact. You can remove unnecessary words, break up longer phrases, and generally make the text more readable.
Text simplification is an important task in natural language processing for creating a simplified version of a sentence that conveys the same meaning as the original sentence but with less complex language. For this task, you will be given a sentence and asked to rewrite it using simpler words and structures so that a non-English speaker or an individual with disabilities can better understand it. Please use semantic compression to create a simplified version of the following sentence.
You are an artificial intelligence designed to simplify written text. The text you are given may be complex, and your job is to rewrite it in a way that a non-english speaker or an individual with disabilities could easily understand. While you simplify the text, you should make sure it is grammatically correct and retains the original meaning of the text.
You are an AI assistant tasked with creating a simpler version of a text. Text simplification can be defined as the reduction of the syntactic or lexical complexity of a text without changing its meaning. The aim of text simplification is to make the text easier to understand for a human or process by a program. Please simplify the following sentence.
Rewrite this sentence in a simple and easy to understand way. Make sure to retain the meaning and ideas of the original sentence while using shorter words and sentences.
Create a simpler version of the sentence below so that it can be better understood by non-English speakers or individuals with disabilities. Text simplification techniques should be used to reduce the complexity of the language while preserving the original meaning and information.
You are an AI assistant that writes text simplification. Text simplification can be defined as any process that reduces the syntactic or lexical complexity of a text while attempting to preserve its meaning and information content. The aim of text simplification is to make text easier to comprehend for a human user, or process by a program. Your task is to take the following sentence and produce a simplified version that would be easier for a non-English speaker or someone with disabilities to understand. Please simplify the sentence.
This prompt asks you to simplify the given sentence. In order to do so, reduce the sentence to its most basic and clear components. Remove unnecessary words, clauses, and phrases that can be inferred from the context. Use shorter, more concise words where possible. After simplifying, the resulting sentence should still convey the same essential message.
Top 5 Randomly Sampled Few-shot Translation Instructions (Sorted by No. Generations Selected)
Anglická věta: To do this, simply access your order page, tap ‘Help and support’ and choose the option ‘Call rider’.
Česká věta: Chcete-li to provést, jednoduše přejděte na stránku objednávky, klikněte na „Nápověda a podpora“ a vyberte možnost „Zavolat jezdci“.
Anglická věta: A private mass and the national anthem preceded the ceremony, which featured a portrait of De Klerk between two candles and a choir decorated with white flowers.
Česká věta: Soukromá mše a státní hymna předcházely tomuto ceremoniálu, který představil portrét De Klerka mez dvěma svíčkami a sbor ozdobený bílými květy.
Anglická věta: After that, we cannot offer an estimate on delivery times as it comes down to individual country’s postal service and customs if outside of the EU.
Česká věta: Poté nemůžeme odhadnout dobu dodání, protože záleží na poštovních a celních službách v jednotlivých zemích, pokud se nacházejí mimo EU.
Anglická věta: This item is an original American comic and is in English!
Česká věta: Tato položka je originální americký komiks ajev angličtině!
Anglická věrta: If they cannot find you they will surely call.
Česká věta: Pokud vás nenajdou, určitě zavolají.
Anglická věta: New Zealand’s computer emergency response team was among the first to report that the flaw was being “actively exploited in the wild” just hours after it was publicly reported Thursday and a patch released.
Česká věta: Tým Nového Zélandu pro reakci na počítačové ohrožení byl mezi prvními, kdo nahlásil, že tato závada se „aktivně divocĕ zneužívá“ jen pár hodin po tom, co byla veřejně nahlášena ve čtvrtek a byla vydána záplata.
Anglická věta: Not sure, but I don’t think we had any way of having them pay.
Česká věta: Nejsem si jistý, ale nemyslím si, že bychom měli nějaký způsob,a by museli zaplatit.
Anglická věta: Luckily, the guy was honest and rather than trying to charge the higher price, he sold me the tires for the price I had on my printout.
Česká věta: Naštěstí byl ten chlapík čestný anež aby se pokoušel účtovat vyšší cenu, prodal mi pneumatiky za cenu, kterou jsem měl na mém výtisku.
Anglická věta: The Cowboys just made sure Zeke and his teammates got that opportunity.
Česká věta: Cowboys se právě postarali o to, aby Zeke a jeho spoluhráči tuto příležitost dostali.
Anglická věta: Description Please scroll to the bottom of the listing for more pictures.
Česká věta: Popis Pro více obrázků sjed’te na konec nabídky.
Anglická věta: This is on a quote only basis and you need to supply us with your address for a quotation.
Česká věta: Tato služba je poskytována pouze na základě cenové nabídky dle vámi poskytnuté adresy.
Anglická věta: Fed up completely, she asks “Are you even going to work today?”
Česká věta: Totálně znechucená se ptá: „Budeš dnes vůbec pracovat?“
Anglická věta: So there was the usual gentle chaos that attends any gathering of toddlers.
Česká věta: Takže nastal obvyklý mírný chaos, který provází každé setkání batolat.
Anglická věta: We currently do not have the exact information on what happened to the rider as well as to your order.
Česká věta: V současné době nemáme přesné informace o tom, co se stalo s jezdcem, stejně jako s vaší objednávkou.
Anglická věta: UK media reported that “thousands” were eager to raise cash for the protesters by purchasing the gray T-shirt, which depicts an empty plinth with “Bristol” written above it.
Česká věta: Média ve Velké Británii hlásila, že „tisíce lidí“ nedočkavé vybírali hotovost pro protestující zakoupením šedého trička, které zobrazuje prázdný podstavec s napsaným Bristol nad ním.
Anglická verta: A. No, we do not include receipts in packages unless requested.
Česká věta: A. Ne, účtenku nepřikládáme, pokud to není požadováno.
Anglická věta: Russia warned of ‘consequences’ if Ukraine attacked
Česká věta: Rusko bylo varováno prred “následky“, pokud napadne Ukrajinu
Anglická věta: He noted that up to 90% of all Russian investments in the Arab world are made in the UAE.
Česká věta: Poznamenal, že až 90 % ruských investicí v arabském světě jsou prováděny v SAE.
Anglická věta: Many view the Softie 12 Osprey the ultimate four season synthetic fill sleeping bag available.
Česká věta: Mnohými je spací pytel Softie 12 Osprey považován za nejlepší dostupný čtyřsezónní spacák se syntetickou výplní.
Anglická věta: - Sign out and signing back in to your eReader.
Česká věta: - Odhlaste se a přihlaste se znovu do vaší e-čtečky.
Anglická věta: I told ya so….
Česká věta: Říkala jsem vám to…
Anglická věta: All information about the products on our website is provided for information purposes only.
Česká věta: Všechny informace o produktech na našich internetových stránkách mají pouze informativní charakter.
Anglická věta: I’m in HR and have worked payroll in the past.
Česká věta: Jsem na personálním oddělení a v minulosti jsem pracoval na mzdovém.
Anglická věta: Years ago, I worked at a cabinet shop.
Česká věta: Před lety jsem pracoval v obchodě se skříněmi.
Anglická věta: De Klerk’s foundation issued a posthumous video apologizing “for the pain, hurt, indignity and damage that apartheid has done” to South Africa’s non-white populations.
Česká věta: Fond De Klerka vydal posmrtné video omlouvající se „za bolest, zranění, ponížepí a škodu, kterou apartheid udělal „jihoafrickému nebělošskému obyvatelstvu“.

Figure 10:

Figure 10:

Results of multi-prompt MBR compared to single prompt MBR across model sizes and architectures. Multi-prompt MBR consistently improves performance across architectures and as models scale. A candidate size of 1 is equivalent to standard, single-output decoding.

Table 8:

Multi-prompt and single prompt performance across metrics.

graphic file with name nihms-2092954-t0001.jpg

rf = Reference-free reranker.

Footnotes

1

Our experiment code, data and prompts are available at https://github.com/davidheineman/multi-prompt.

References

  1. Achiam Josh, Adler Steven, Agarwal Sandhini, Ahmad Lama, Akkaya Ilge, Aleman Florencia Leoni, Almeida Diogo, Altenschmidt Janko, Altman Sam, Anadkat Shyamal, et al. 2023. GPT-4 technical report. arXiv preprint arXiv:2303.08774. [Google Scholar]
  2. Agrawal Sweta, Zhou Chunting, Lewis Mike, Zettlemoyer Luke, and Ghazvininejad Marjan. 2023. Incontext examples selection for machine translation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8857–8873, Toronto, Canada. Association for Computational Linguistics. [Google Scholar]
  3. Alva-Manchego Fernando, Martin Louis, Bordes Antoine, Scarton Carolina, Sagot Benoît, and Specia Lucia. 2020. ASSET: A dataset for tuning and evaluation of sentence simplification models with multiple rewriting transformations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4668–4679, Online. Association for Computational Linguistics. [Google Scholar]
  4. Alva-Manchego Fernando, Scarton Carolina, and Specia Lucia. 2021. The (un) suitability of automatic evaluation metrics for text simplification. Computational Linguistics, 47(4):861–889. [Google Scholar]
  5. Alves Duarte M, Pombal José, Guerreiro Nuno M, Martins Pedro H, Alves João, Farajian Amin, Peters Ben, Rei Ricardo, Fernandes Patrick, Agrawal Sweta, et al. 2024. Tower: An open multilingual large language model for translation-related tasks. arXiv preprint arXiv:2402.17733. [Google Scholar]
  6. Amrhein Chantal and Sennrich Rico. 2022. Identifying weaknesses in machine translation metrics through minimum Bayes risk decoding: A case study for COMET. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1125–1141, Online only. Association for Computational Linguistics. [Google Scholar]
  7. Barrault Loïc, Bojar Ondřej, Costa-jussà Marta R., Federmann Christian, Fishel Mark, Graham Yvette, Haddow Barry, Huck Matthias, Koehn Philipp, Malmasi Shervin, Monz Christof, Müller Mathias, Pal Santanu, Post Matt, and Zampieri Marcos. 2019. Findings of the 2019 conference on machine translation (WMT19). In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 1–61, Florence, Italy. Association for Computational Linguistics. [Google Scholar]
  8. Bertsch Amanda, Xie Alex, Neubig Graham, and Gormley Matthew. 2023. It’s MBR all the way down: Modern generation techniques through the lens of minimum Bayes risk. In Proceedings of the Big Picture Workshop, pages 108–122, Singapore. Association for Computational Linguistics. [Google Scholar]
  9. Bickel Peter J and Doksum Kjell A. 1977. Mathematical statistics: Basic ideas and selected topics, volumes I-II package. Chapman and Hall/CRC. [Google Scholar]
  10. Chen Mark, Tworek Jerry, Jun Heewoo, Yuan Qiming, de Oliveira Pinto Henrique Ponde, Kaplan Jared, Edwards Harri, Burda Yuri, Joseph Nicholas, Brockman Greg, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374. [Google Scholar]
  11. Cheng Julius and Vlachos Andreas. 2023. Faster minimum Bayes risk decoding with confidence-based pruning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12473–12480, Singapore. Association for Computational Linguistics. [Google Scholar]
  12. Chung Hyung Won, Hou Le, Longpre Shayne, Zoph Barret, Tay Yi, Fedus William, Li Eric, Wang Xuezhi, Dehghani Mostafa, Brahma Siddhartha, et al. 2022. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416. [Google Scholar]
  13. Cripwell Liam, Legrand Joël, and Gardent Claire. 2023. Simplicity level estimate (SLE): A learned referenceless metric for sentence simplification. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12053–12059, Singapore. Association for Computational Linguistics. [Google Scholar]
  14. Deguchi Hiroyuki, Sakai Yusuke, Kamigaito Hidetaka, Watanabe Taro, Tanaka Hideki, and Utiyama Masao. 2024. Centroid-based efficient minimum bayes risk decoding. arXiv preprint arXiv:2402.11197. [Google Scholar]
  15. Eikema Bryan and Aziz Wilker. 2020. Is MAP decoding all you need? the inadequacy of the mode in neural machine translation. In Proceedings of the 28th International Conference on Computational Linguistics, pages 4506–4520, Barcelona, Spain: (Online). International Committee on Computational Linguistics. [Google Scholar]
  16. Eikema Bryan and Aziz Wilker. 2022. Sampling-based approximations to minimum Bayes risk decoding for neural machine translation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10978–10993, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. [Google Scholar]
  17. Farinhas António, de Souza José G. C., and Martins André F. T.. 2023. An empirical study of translation hypothesis ensembling with large language models. Preprint, arXiv:2310.11430. [Google Scholar]
  18. Federmann Christian, Kocmi Tom, and Xin Ying. 2022. NTREX-128 – news test references for MT evaluation of 128 languages. In Proceedings of the First Workshop on Scaling Up Multilingual Evaluation, pages 21–24, Online. Association for Computational Linguistics. [Google Scholar]
  19. Fernandes Patrick, Farinhas António, Rei Ricardo, de Souza José G. C., Ogayo Perez, Neubig Graham, and Martins Andre. 2022. Quality-aware decoding for neural machine translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1396–1412, Seattle, United States. Association for Computational Linguistics. [Google Scholar]
  20. Freitag Markus, Foster George, Grangier David, Ratnakar Viresh, Tan Qijun, and Macherey Wolfgang. 2021. Experts, errors, and context: A large-scale study of human evaluation for machine translation. Transactions of the Association for Computational Linguistics, 9:1460–1474. [Google Scholar]
  21. Freitag Markus, Ghorbani Behrooz, and Fernandes Patrick. 2023a. Epsilon sampling rocks: Investigating sampling strategies for minimum Bayes risk decoding for machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9198–9209, Singapore. Association for Computational Linguistics. [Google Scholar]
  22. Freitag Markus, Grangier David, Tan Qijun, and Liang Bowen. 2022a. High quality rather than high model probability: Minimum Bayes risk decoding with neural metrics. Transactions of the Association for Computational Linguistics, 10:811–825. [Google Scholar]
  23. Freitag Markus, Mathur Nitika, Lo Chi-kiu, Avramidis Eleftherios, Rei Ricardo, Thompson Brian, Kocmi Tom, Blain Frederic, Deutsch Daniel, Stewart Craig, Zerva Chrysoula, Castilho Sheila, Lavie Alon, and Foster George. 2023b. Results of WMT23 metrics shared task: Metrics might be guilty but references are not innocent. In Proceedings of the Eighth Conference on Machine Translation, pages 578–628, Singapore. Association for Computational Linguistics. [Google Scholar]
  24. Freitag Markus, Rei Ricardo, Mathur Nitika, Lo Chi-kiu, Stewart Craig, Avramidis Eleftherios, Kocmi Tom, Foster George, Lavie Alon, and Martins André F. T.. 2022b. Results of WMT22 metrics shared task: Stop using BLEU – neural metrics are better and more robust. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 46–68, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics. [Google Scholar]
  25. Gonen Hila, Iyer Srini, Blevins Terra, Smith Noah, and Zettlemoyer Luke. 2023. Demystifying prompts in language models via perplexity estimation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10136–10148, Singapore. Association for Computational Linguistics. [Google Scholar]
  26. Goyal Naman, Du Jingfei, Ott Myle, Anantharaman Giri, and Conneau Alexis. 2021. Larger-scale transformers for multilingual masked language modeling. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021), pages 29–33, Online. Association for Computational Linguistics. [Google Scholar]
  27. Guerreiro Nuno M, Rei Ricardo, van Stigt Daan, Coheur Luisa, Colombo Pierre, and Martins André FT. 2023. xCOMET: Transparent machine translation evaluation through fine-grained error detection. arXiv preprint arXiv:2310.10482. [Google Scholar]
  28. Heineman David, Dou Yao, Maddela Mounica, and Xu Wei. 2023. Dancing between success and failure: Edit-level simplification evaluation using SALSA. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3466–3495, Singapore. Association for Computational Linguistics. [Google Scholar]
  29. Hendy Amr, Abdelrehim Mohamed, Sharaf Amr, Raunak Vikas, Gabr Mohamed, Matsushita Hitokazu, Kim Young Jin, Afify Mohamed, and Awadalla Hany Hassan. 2023. How good are GPT models at machine translation? A comprehensive evaluation. arXiv preprint arXiv:2302.09210. [Google Scholar]
  30. Holtzman Ari, Buys Jan, Du Li, Forbes Maxwell, and Choi Yejin. 2020. The curious case of neural text degeneration. In International Conference on Learning Representations. [Google Scholar]
  31. Jaeger T and Levy Roger. 2006. Speakers optimize information density through syntactic reduction. Advances in neural information processing systems, 19. [Google Scholar]
  32. Jain Siddhartha, Ma Xiaofei, Deoras Anoop, and Xiang Bing. 2023. Self-consistency for open-ended generations. arXiv preprint arXiv:2307.06857. [Google Scholar]
  33. Jiang Albert Q, Sablayrolles Alexandre, Mensch Arthur, Bamford Chris, Chaplot Devendra Singh, de las Casas Diego, Bressand Florian, Lengyel Gianna, Lample Guillaume, Saulnier Lucile, et al. 2023. Mistral 7B. arXiv preprint arXiv:2310.06825. [Google Scholar]
  34. Jiang Chao, Maddela Mounica, Lan Wuwei, Zhong Yang, and Xu Wei. 2020. Neural CRF model for sentence alignment in text simplification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7943–7960, Online. Association for Computational Linguistics. [Google Scholar]
  35. Juraska Juraj, Finkelstein Mara, Deutsch Daniel, Siddhant Aditya, Mirzazadeh Mehdi, and Freitag Markus. 2023. MetricX-23: The Google submission to the WMT 2023 metrics shared task. In Proceedings of the Eighth Conference on Machine Translation, pages 756–767, Singapore. Association for Computational Linguistics. [Google Scholar]
  36. Karpinska Marzena, Raj Nishant, Thai Katherine, Song Yixiao, Gupta Ankita, and Iyyer Mohit. 2022. DEMETR: Diagnosing evaluation metrics for translation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9540–9561, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. [Google Scholar]
  37. Khashabi Daniel, Lyu Xinxi, Min Sewon, Qin Lianhui, Richardson Kyle, Welleck Sean, Hajishirzi Hannaneh, Khot Tushar, Sabharwal Ashish, Singh Sameer, and Choi Yejin. 2022. Prompt waywardness: The curious case of discretized interpretation of continuous prompts. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3631–3643, Seattle, United States. Association for Computational Linguistics. [Google Scholar]
  38. Kobayashi Hayato. 2018. Frustratingly easy model ensemble for abstractive summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4165–4176, Brussels, Belgium. Association for Computational Linguistics. [Google Scholar]
  39. Kocmi Tom, Bawden Rachel, Bojar Ondřej, Dvorkovich Anton, Federmann Christian, Fishel Mark, Gowda Thamme, Graham Yvette, Grundkiewicz Roman, Haddow Barry, Knowles Rebecca, Koehn Philipp, Monz Christof, Morishita Makoto, Nagata Masaaki, Nakazawa Toshiaki, Michal Novák Martin Popel, and Popović Maja. 2022. Findings of the 2022 conference on machine translation (WMT22). In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 1–45, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics. [Google Scholar]
  40. Koehn Philipp. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 388–395, Barcelona, Spain. Association for Computational Linguistics. [Google Scholar]
  41. Kumar Shankar and Byrne William. 2004. Minimum Bayes-risk decoding for statistical machine translation. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pages 169–176, Boston, Massachusetts, USA. Association for Computational Linguistics. [Google Scholar]
  42. Kwon Woosuk, Li Zhuohan, Zhuang Siyuan, Sheng Ying, Zheng Lianmin, Yu Cody Hao, Gonzalez Joseph E., Zhang Hao, and Stoica Ion. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles. [Google Scholar]
  43. Li Raymond, Allal Loubna Ben, Zi Yangtian, Muennighoff Niklas, Kocetkov Denis, Mou Chenghao, Marone Marc, Akiki Christopher, Li Jia, Chim Jenny, et al. 2023. Starcoder: May the source be with you! arXiv preprint arXiv:2305.06161. [Google Scholar]
  44. Lu Yao, Bartolo Max, Moore Alastair, Riedel Sebastian, and Stenetorp Pontus. 2022. Fantastically ordered prompts and where to find them: Overcoming fewshot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8086–8098, Dublin, Ireland. Association for Computational Linguistics. [Google Scholar]
  45. Maddela Mounica, Dou Yao, Heineman David, and Xu Wei. 2023. LENS: A learnable evaluation metric for text simplification. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16383–16408, Toronto, Canada. Association for Computational Linguistics. [Google Scholar]
  46. Lorenzo Abelardo Carlos Martínez, Cabot Pere Lluís Huguet, and Navigli Roberto. 2023. AMRs assemble! learning to ensemble with autoregressive models for AMR parsing. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1595–1605, Toronto, Canada. Association for Computational Linguistics. [Google Scholar]
  47. Min Sewon, Lyu Xinxi, Holtzman Ari, Artetxe Mikel, Lewis Mike, Hajishirzi Hannaneh, and Zettlemoyer Luke. 2022. Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. [Google Scholar]
  48. Mishra Swaroop, Khashabi Daniel, Baral Chitta, Choi Yejin, and Hajishirzi Hannaneh. 2022. Reframing instructional prompts to GPTk’s language. In Findings of the Association for Computational Linguistics: ACL 2022, pages 589–612, Dublin, Ireland. Association for Computational Linguistics. [Google Scholar]
  49. Müller Mathias and Sennrich Rico. 2021. Understanding the properties of minimum Bayes risk decoding in neural machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 259–272, Online. Association for Computational Linguistics. [Google Scholar]
  50. Ouyang Long, Wu Jeffrey, Jiang Xu, Almeida Diogo, Wainwright Carroll, Mishkin Pamela, Zhang Chong, Agarwal Sandhini, Slama Katarina, Ray Alex, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744. [Google Scholar]
  51. Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics. [Google Scholar]
  52. Post Matt. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics. [Google Scholar]
  53. Qin Chengwei, Zhang Aston, Zhang Zhuosheng, Chen Jiaao, Yasunaga Michihiro, and Yang Diyi. 2023. Is ChatGPT a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476. [Google Scholar]
  54. Rei Ricardo, Farinha Ana C, Zerva Chrysoula, van Stigt Daan, Stewart Craig, Ramos Pedro, Glushkova Taisiya, Martins André F. T., and Lavie Alon. 2021. Are references really needed? unbabel-IST 2021 submission for the metrics shared task. In Proceedings of the Sixth Conference on Machine Translation, pages 1030–1040, Online. Association for Computational Linguistics. [Google Scholar]
  55. Rei Ricardo, Guerreiro Nuno M., Pombal José, van Stigt Daan, Treviso Marcos, Coheur Luisa, de Souza José G. C., and Martins André. 2023. Scaling up CometKiwi: Unbabel-IST 2023 submission for the quality estimation shared task. In Proceedings of the Eighth Conference on Machine Translation, pages 841–848, Singapore. Association for Computational Linguistics. [Google Scholar]
  56. Rei Ricardo, Stewart Craig, Farinha Ana C, and Lavie Alon. 2020. COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Online. Association for Computational Linguistics. [Google Scholar]
  57. Rei Ricardo, Treviso Marcos, Guerreiro Nuno M., Zerva Chrysoula, Farinha Ana C, Maroti Christine, de Souza José G. C., Glushkova Taisiya, Alves Duarte, Coheur Luisa, Lavie Alon, and Martins André F. T.. 2022. CometKiwi: IST-unbabel 2022 submission for the quality estimation shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 634–645, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics. [Google Scholar]
  58. Reimers Nils and Gurevych Iryna. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics. [Google Scholar]
  59. Roziere Baptiste, Gehring Jonas, Gloeckle Fabian, Sootla Sten, Gat Itai, Tan Xiaoqing Ellen, Adi Yossi, Liu Jingyu, Remez Tal, Rapin Jérémy, et al. 2023. Code LLaMA: Open foundation models for code. arXiv preprint arXiv:2308.12950. [Google Scholar]
  60. Sclar Melanie, Choi Yejin, Tsvetkov Yulia, and Suhr Alane. 2023. Quantifying language models’ sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting. arXiv preprint arXiv:2310.11324. [Google Scholar]
  61. Sheang Kim Cheng and Saggion Horacio. 2021. Controllable sentence simplification with a unified text-to-text transfer transformer. In Proceedings of the 14th International Conference on Natural Language Generation, pages 341–352, Aberdeen, Scotland, UK. Association for Computational Linguistics. [Google Scholar]
  62. Shi Freda, Fried Daniel, Ghazvininejad Marjan, Zettlemoyer Luke, and Wang Sida I.. 2022. Natural language to code translation with execution. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3533–3546, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. [Google Scholar]
  63. Srivastava Aarohi, Rastogi Abhinav, Rao Abhishek, and et al. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research. [Google Scholar]
  64. Suzgun Mirac, Melas-Kyriazi Luke, and Jurafsky Dan. 2023. Follow the wisdom of the crowd: Effective text generation via minimum Bayes risk decoding. In Findings of the Association for Computational Linguistics: ACL 2023, pages 4265–4293, Toronto, Canada. Association for Computational Linguistics. [Google Scholar]
  65. Team Gemma, Mesnard Thomas, Hardin Cassidy, Dadashi Robert, Bhupatiraju Surya, Pathak Shreya, Sifre Laurent, Rivière Morgane, Kale Mihir Sanjay, Love Juliette, et al. 2024. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295. [Google Scholar]
  66. Touvron Hugo, Martin Louis, Stone Kevin, Albert Peter, Almahairi Amjad, Babaei Yasmine, Bashlykov Nikolay, Batra Soumya, Bhargava Prajjwal, Bhosale Shruti, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288. [Google Scholar]
  67. Üstün Ahmet, Aryabumi Viraat, Yong Zheng-Xin, Ko Wei-Yin, D’souza Daniel, Onilude Gbemileke, Bhandari Neel, Singh Shivalika, Ooi Hui-Lee, Kayid Amr, et al. 2024. Aya model: An instruction finetuned open-access multilingual language model. arXiv preprint arXiv:2402.07827. [Google Scholar]
  68. Vamvas Jannis and Sennrich Rico. 2024. Linear-time minimum bayes risk decoding with reference aggregation. arXiv preprint arXiv:2402.04251. [Google Scholar]
  69. Wang Xuezhi, Wei Jason, Schuurmans Dale, Quoc V Le Ed H. Chi, Narang Sharan, Chowdhery Aakanksha, and Zhou Denny. 2023. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations. [Google Scholar]
  70. Wei Jason, Bosma Maarten, Zhao Vincent, Guu Kelvin, Yu Adams Wei, Lester Brian, Du Nan, Dai Andrew M., and Le Quoc V. 2022a. Finetuned language models are zero-shot learners. In International Conference on Learning Representations. [Google Scholar]
  71. Wei Jason, Wang Xuezhi, Schuurmans Dale, Bosma Maarten, Xia Fei, Chi Ed, Le Quoc V, Zhou Denny, et al. 2022b. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837. [Google Scholar]
  72. White Jules, Fu Quchen, Hays Sam, Sandborn Michael, Olea Carlos, Gilbert Henry, Elnashar Ashraf, Spencer-Smith Jesse, and Schmidt Douglas C. 2023. A prompt pattern catalog to enhance prompt engineering with ChatGPT. arXiv preprint arXiv:2302.11382. [Google Scholar]
  73. Xu Haoran, Kim Young Jin, Sharaf Amr, and Awadalla Hany Hassan. 2023. A paradigm shift in machine translation: Boosting translation performance of large language models. arXiv preprint arXiv:2309.11674. [Google Scholar]
  74. Xu Haoran, Sharaf Amr, Chen Yunmo, Tan Weiting, Shen Lingfeng, Van Durme Benjamin, Murray Kenton, and Kim Young Jin. 2024. Contrastive preference optimization: Pushing the boundaries of LLM performance in machine translation. arXiv preprint arXiv:2401.08417. [Google Scholar]
  75. Xu Wei, Callison-Burch Chris, and Napoles Courtney. 2015. Problems in current text simplification research: New data can help. Transactions of the Association for Computational Linguistics, 3:283–297. [Google Scholar]
  76. Xu Wei, Napoles Courtney, Pavlick Ellie, Chen Quanze, and Callison-Burch Chris. 2016. Optimizing statistical machine translation for text simplification. Transactions of the Association for Computational Linguistics, 4:401–415. [Google Scholar]
  77. Xue Linting, Constant Noah, Roberts Adam, Kale Mihir, Al-Rfou Rami, Siddhant Aditya, Barua Aditya, and Raffel Colin. 2021. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics. [Google Scholar]
  78. Yang Chengrun, Wang Xuezhi, Lu Yifeng, Liu Hanxiao, Le Quoc V, Zhou Denny, and Chen Xinyun. 2023. Large language models as optimizers. arXiv preprint arXiv:2309.03409. [Google Scholar]
  79. Zhang Tianyi, Kishore Varsha, Wu Felix, Weinberger Kilian Q., and Artzi Yoav. 2020. BERTScore: Evaluating text generation with BERT. In International Conference on Learning Representations. [Google Scholar]
  80. Zhang Tianyi, Yu Tao, Hashimoto Tatsunori, Lewis Mike, Yih Wen-tau, Fried Daniel, and Wang Sida. 2023. Coder reviewer reranking for code generation. In International Conference on Machine Learning, pages 41832–41846. PMLR. [Google Scholar]
  81. Zhou Yongchao, Muresanu Andrei Ioan, Han Ziwen, Paster Keiran, Pitis Silviu, Chan Harris, and Ba Jimmy. 2023. Large language models are human-level prompt engineers. In The Eleventh International Conference on Learning Representations. [Google Scholar]

RESOURCES