Skip to main content
PLOS One logoLink to PLOS One
. 2026 Feb 12;21(2):e0341905. doi: 10.1371/journal.pone.0341905

Fine-Tuning Arabic Large Language Models for improved multi-turn dialogue: A blueprint for synthetic data generation and benchmarking

Ahmed Mahmoud Misbah 1,*, Mohamed Farouk 2, Mustafa AbdulAzim 1
Editor: Mohammad Salah Hassan3
PMCID: PMC12900375  PMID: 41678465

Abstract

The rapid evolution of Large Language Models (LLMs) has fueled increasing interest in developing Arabic conversational systems capable of sustaining coherent multi-turn dialogues. However, progress remains constrained by the scarcity of large-scale, diverse, and high-quality datasets specifically designed for Arabic multi-turn interaction. This study presents a reproducible methodology for constructing such a dataset through structured prompting of an instruction-tuned Arabic LLM (Jais-13b-chat), yielding 43,316 multi-turn conversations across 93 topics and 151 countries. Two pre-trained Arabic language models (ArabianGPT-08B-V2 and AraGPT2-mega) were fine-tuned on this synthetic data and benchmarked against multilingual instruction-tuned baselines using a comprehensive evaluation framework combining automatic metrics (Perplexity and RAVEN) with structured human evaluation. Fine-tuned ArabianGPT-08B-V2 achieved the highest RAVEN score (0.823) for cross-model comparison, outperforming both fine-tuned AraGPT2-mega and instruction-tuned baselines while maintaining strong within-model perplexity (9.4). Human evaluation by two independent raters demonstrated acceptable inter-rater reliability (Cohen’s κ = 0.229–0.739) with positive rank correlations (Spearman ρ = 0.424–0.759), yielding overall quality scores of 4.04–4.34 on a five-point scale. These findings demonstrate that high-quality, LLM-generated synthetic data effectively improves Arabic conversational models, providing a scalable, resource-efficient blueprint for dialogue systems in low-resource and culturally specific settings.

Introduction

The emergence of ChatGPT in late 2022 marked a significant advancement in conversational artificial intelligence, demonstrating unprecedented capabilities in natural language understanding and generation [1]. This development catalyzed widespread adoption of transformer-based large language models (LLMs) for dialogue systems, including Google’s Gemini, Meta’s LLaMA, Mistral, and Falcon. These models represent a fundamental departure from earlier rule-based and retrieval-based chatbot architectures in their ability to leverage massive pre-training corpora and generalize across diverse natural language processing tasks, including machine translation, text summarization, and sentiment analysis. Most notably, modern LLMs maintain contextual coherence across extended multi-turn conversations, a capability that distinguishes them from previous generations of conversational systems.

Despite these advances, initial deployments of mainstream LLMs exhibited limited multilingual support, particularly for morphologically rich languages such as Arabic. This limitation has motivated the development of dedicated Arabic language models. While several Arabic pre-trained models exist, including AraBERT [2], AraGPT2 [3], and AraT5 [4], most were designed for general language understanding tasks rather than conversational interaction. Consequently, relatively few models have been specifically adapted for Arabic dialogue systems [57], and the development of robust multi-turn conversational models remains constrained by a critical resource bottleneck: the scarcity of large-scale, high-quality Arabic dialogue datasets.

More recently, instruction-tuned Arabic language models such as Jais-13b-chat [8] have emerged as promising foundations for dialogue systems. These models are trained on substantial multilingual corpora containing billions of Arabic tokens and subsequently fine-tuned using instruction-following datasets to improve alignment with human communicative intent. While they demonstrate competence in single-turn interactions and basic multi-turn exchanges, they are not explicitly optimized for maintaining coherence across extended open-domain conversations. This limitation becomes particularly apparent in dialogues requiring sustained context tracking and topical continuity, highlighting the need for task-specific adaptation to enhance their multi-turn conversational capabilities.

The remainder of the manuscript is organized as follows. The specific challenges posed by existing Arabic dialogue datasets and the research gap they create are formalized in the “Problem Statement” subsection, followed by an overview of the principal contributions of this study. Section 2 reviews related work on Arabic conversational AI and dialogue dataset construction. Section 3 details the methodology, including synthetic data generation procedures, data quality assurance measures, and model fine-tuning protocols. Section 4 presents comprehensive benchmarking results using both automatic metrics and human evaluation. Section 5 provides an in-depth analysis of model performance. Finally, Section 6 synthesizes the key findings, discusses implications for Arabic NLP research, and identifies directions for future work.

Problem statement

The existing landscape of Arabic dialogue datasets presents significant challenges for model development. The JANA dataset [9] represents the only publicly available resource specifically designed for Arabic multi-turn conversation, yet its utility is constrained by both scale (3,000 conversations) and domain specificity (call center interactions). Other prominent Arabic NLP datasets, including ArabicaQA [10], InstAr-500k [11], and Arabic-SQuAD [12], while larger in scale, suffer from systematic limitations that compromise their effectiveness for dialogue modeling. First, many are derived from English sources through machine translation, inherently failing to capture the linguistic subtleties and cultural contexts intrinsic to authentic Arabic discourse. Second, they are predominantly structured for extractive single-turn tasks such as question answering, rather than generative multi-turn dialogue. Third, datasets sourced from social media platforms often contain heterogeneous dialectal variations, informal register, and inconsistent content quality. Finally, manual annotation by human experts, while yielding high-quality data, remains prohibitively resource-intensive and limits achievable dataset scale.

These constraints collectively prevent the development of Arabic language models capable of maintaining contextual coherence over extended interactions in open-domain conversational scenarios. To address this gap, this study proposes a methodology for constructing a large-scale synthetic Arabic dialogue dataset through structured prompting of instruction-tuned language models. The utility of this synthetic data is then demonstrated through fine-tuning of pre-trained Arabic language models and evaluation of their performance on multi-turn conversational tasks. This approach builds on successful precedents in English NLP, including InstructWild [13] and Self-Instruct [14], which have demonstrated that LLM-generated synthetic data can effectively approximate human-generated content across various natural language tasks. Beyond replicating surface-level linguistic patterns, synthetic data generation offers several methodological advantages: it is substantially more cost-effective than human annotation, enables rapid iteration and dataset expansion, and allows systematic control over domain coverage and stylistic variation. These characteristics make synthetic data generation particularly valuable for under-resourced languages where large-scale human annotation is impractical.

Novelty of work

This study addresses the data scarcity challenge for multi-turn Arabic dialogue through four principal contributions:

  • 1

    Large-Scale Synthetic Arabic Multi-Turn Dataset: A unique dataset comprising 43,316 multi-turn Arabic conversations was generated through structured prompting of an instruction-tuned Arabic language model (Jais-13b-chat). Unlike existing Arabic dialogue datasets that are constrained by small scale (e.g., JANA’s 3,000 conversations) or narrow domain coverage (e.g., call center interactions), the synthetic dataset spans 93 diverse topics across 151 countries, providing broad coverage of open-domain conversational contexts with extended dialogue length (mean: 14 turns, range: 5–111 turns).

  • 2

    Reproducible Methodology for Controlled Synthetic Data Generation: A comprehensive technical framework for generating high-quality synthetic multi-turn dialogue data is presented, detailing the systematic application of prompt engineering and hyperparameter optimization to control conversational diversity, contextual depth, and linguistic quality. The methodology addresses specific challenges in Arabic dialogue generation, including cultural appropriateness and maintenance of multi-turn coherence. Complete documentation of the generation pipeline enables replication and adaptation to other languages and domains.

  • 3

    Empirical Validation through Fine-Tuning: The practical utility of synthetic training data is demonstrated through fine-tuning of two pre-trained Arabic language models. Evaluation results show that models trained on synthetic data achieve performance comparable to or exceeding larger multilingual instruction-tuned baselines on multi-turn conversational tasks, establishing a resource-efficient pathway for developing specialized Arabic dialogue systems.

  • 4

    Comprehensive Evaluation Framework for Arabic Multi-Turn Dialogue: A novel benchmark specifically designed for assessing Arabic conversational systems in multi-turn settings is introduced. The framework integrates automatic metrics with structured human evaluation across multiple quality dimensions (fluency, relevance, diversity). This evaluation protocol addresses the gap in rigorous assessment tools for Arabic dialogue systems and provides a foundation for future comparative studies.

Literature review

This section situates the present study within the broader landscape of Arabic conversational AI research, examining the evolution of chatbots, the current state of Arabic chatbot development, and the methodological approaches employed in constructing training datasets. Particular attention is given to the persistent resource constraints that have impeded progress in Arabic multi-turn dialogue modeling and the emerging role of synthetic data generation as a viable solution to these challenges.

Chatbots

The early 2000s witnessed a fundamental transformation in chatbot architectures, progressing from rudimentary pattern-matching systems to sophisticated neural dialogue models. Early chatbots relied on hand-crafted rules and template-based response generation, but the integration of Machine Learning (ML) and Natural Language Processing (NLP) techniques enabled more adaptive systems capable of learning from interaction data. Commercial deployments such as IBM Watson, Apple’s Siri, and Amazon’s Alexa exemplified this transition, implementing hybrid architectures that combined retrieval-based mechanisms with increasingly capable generative components. Within this paradigm, two broad categories of chatbots emerged: open-domain chatbots designed for unrestricted, general-purpose interaction, and closed-domain chatbots optimized for specific applications such as customer support or technical assistance.

The advent of transformer-based Large Language Models (LLMs) represented a qualitative advancement in chatbot capabilities. Unlike their predecessors, transformer architectures effectively model long-range dependencies in sequential data, enabling the generation of fluent, contextually coherent text across extended conversational exchanges. The Generative Pre-trained Transformer (GPT) family of models demonstrated that pre-training on massive text corpora yields representations sufficiently rich to support human-like text generation across diverse domains. This architectural foundation enabled the development of multi-turn chatbots capable of maintaining topical coherence and contextual awareness across extended dialogue turns, a capability that had previously eluded rule-based and retrieval-based approaches.

The success of GPT-based chatbots catalyzed widespread development of transformer-based dialogue systems. Microsoft’s DialoGPT [15], fine-tuned on Reddit conversational data, demonstrated that GPT-2 could be effectively adapted for open-domain dialogue generation. OpenAI’s ChatGPT, initially based on GPT-3.5 [16] and subsequently evolved to GPT-4o [17], established new benchmarks for conversational fluency and instruction-following capability. Concurrently, alternative transformer architectures emerged with comparable dialogue capabilities: Google’s Gemini [18], Meta’s LLaMA [19], Mistral’s Mixtral [20], and others [21] each demonstrated that instruction-tuning on diverse conversational corpora yields models capable of sophisticated multi-turn interaction. Collectively, these developments have established transformer-based LLMs as the dominant paradigm for contemporary chatbot development.

Arabic Chatbots

Despite the rapid advancement of chatbots for English and other high-resource languages, Arabic chatbot development has proceeded at a markedly slower pace. This disparity stems from two interrelated factors: the scarcity of large-scale Arabic training data and the inherent linguistic complexity of Arabic, which encompasses substantial dialectal variation alongside Modern Standard Arabic (MSA). Consequently, Arabic chatbots remain fewer in number and generally inferior in performance relative to their English counterparts.

Early Arabic chatbots relied predominantly on rule-based architectures with limited generalization capacity. ArabChat [22] employed pattern-matching techniques to generate responses to user queries, while BOTTA [23] implemented retrieval-based response selection specifically for Egyptian Arabic. Domain-specific applications such as Ollobot [24] demonstrated the feasibility of rule-based Arabic dialogue for constrained tasks like health tracking. However, a comprehensive survey by AlHumoud et al. [25] concluded that Arabic chatbot development remained nascent, fundamentally constrained by insufficient training resources. Some researchers attempted to circumvent data limitations through translation-based approaches: Mozannar et al. [12] developed SOQAL, a question-answering system leveraging translated resources, while Hajj et al. [26] explored sequence-to-sequence architectures for Arabic response generation.

The introduction of transformer-based Arabic language models substantially advanced the field. Antoun et al. [27] adapted the AraBERT architecture for conversational applications, while Al-Yahya et al. [6] developed AraConv using the multilingual mT5 model. Meshrif’s ArRASA system [7] incorporated the DIET (Dual Intent and Entity Transformer) architecture to enhance intent recognition and entity extraction in Arabic dialogue. These models demonstrated marked improvements over rule-based predecessors in both response quality and domain coverage.

More recently, the emergence of multilingual instruction-tuned models has further transformed Arabic conversational AI. Jais-Chat [8], trained on extensive multilingual corpora containing substantial Arabic content, represents a new generation of models that address the English-centric limitations of earlier architectures. These systems exhibit enhanced linguistic coverage and instruction-following capabilities, enabling more sophisticated dialogue management and broader query handling. Nevertheless, while multilingual models provide improved baseline performance for Arabic, they are not explicitly optimized for extended multi-turn coherence, suggesting continued need for task-specific adaptation using dedicated Arabic dialogue resources.

Arabic datasets for Chatbots

The rise of transformer-based pre-trained large language models (LLMs) has driven substantial progress in Natural Language Processing (NLP), particularly in conversational AI and chatbot development. These models are typically fine-tuned on specialized datasets tailored to specific downstream tasks. Early fine-tuning efforts primarily relied on monolingual, task-oriented datasets, such as English question-answering corpora [28]. Over time, however, the field has evolved toward more diverse and complex datasets, including multilingual corpora, instruction-tuning datasets, open-domain conversational data, and multi-turn dialogue datasets [29].

Non-task-oriented datasets are particularly important for training models to engage in multi-turn conversations, where interaction is not bound to a specific goal but aims to emulate natural, open-ended dialogue [30,31]. These datasets enable models to maintain context across multiple conversational turns, making them essential for building robust chatbots.

While significant advancements have been made in English and other high-resource languages, the development of Arabic-specific datasets remains limited. Existing Arabic resources include datasets such as JANA [9], Arabic-SQuAD [12], ArabicaQA [10], and InstAr-500k [11]. However, these datasets suffer from several limitations that hinder their effectiveness for training Arabic multi-turn conversational chatbots. For instance, JANA is restricted to a narrow domain (call centers) and contains only 3,000 dialogues. Others, like Arabic-SQuAD and ArabicaQA, are primarily designed for single-turn tasks or are based on translated content, which introduces linguistic and cultural noise.

This underdevelopment of Arabic datasets presents a set of specific challenges that must be addressed to advance Arabic conversational AI:

  • 1

    Scarcity of High-Quality Multi-Turn Datasets: Unlike English, where large-scale datasets for multi-turn conversational AI are abundant, Arabic datasets are limited in both quantity and scope. The JANA dataset [9], one of the few designed for Arabic multi-turn chatbots, is small, containing only 3,000 call center dialogues. Its limited size and narrow focus restrict its applicability to broader conversational contexts.

  • 2

    Reliance on Translations: Many Arabic datasets are translated versions of English datasets, such as Arabic-SQuAD [12]. While translations offer a starting point, they fail to capture the cultural and linguistic nuances of Arabic, including dialectal variations. Additionally, machine translation often introduces inaccuracies, further diminishing the quality of these datasets.

  • 3

    Single-Turn Focus: Most available Arabic datasets, such as ArabicaQA [10] and InstAr-500k [11], are designed for single-turn tasks like question answering. These datasets do not adequately support training models for open-ended, multi-turn conversational interactions.

  • 4

    Issues with Data Sources: Some Arabic datasets are sourced from informal platforms like social media and forums. While these sources offer diverse conversational data, they often include problematic elements such as profanity, biased content, and irrelevant or out-of-context responses, which can adversely affect the performance of chatbots.

  • 5

    Dialectal Complexity: Arabic is characterized by a rich variety of dialects that differ significantly from Modern Standard Arabic (MSA). Existing datasets often fail to represent these dialects comprehensively, limiting the ability of chatbots to understand and respond to users in their preferred form of Arabic.

  • 6

    Manual Annotation Limitations: Many Arabic datasets are created through human annotation, which is a time-consuming process that often results in small datasets. This restricts their usefulness for training large-scale conversational models.

These limitations underscore the urgent need for scalable, culturally grounded, and linguistically diverse Arabic datasets, particularly those tailored for multi-turn dialogue, in order to unlock the full potential of Arabic conversational AI.

Dataset construction methods

To overcome the limitations outlined in the previous sub-section, researchers have explored various approaches for constructing datasets suitable for training multi-turn conversational models:

  • 1

    Human-Crafted Datasets: These are manually curated by annotators following specific guidelines to ensure high quality and contextual relevance. However, the manual nature of this process limits scalability and results in small dataset sizes [3234].

  • 2

    Mining Publicly Available Conversations: Conversations from movies [34], TV programs, instant messaging [9], online forums, and social media platforms can provide a rich source of natural dialogue. While this approach offers diverse data, it introduces challenges such as copyright issues, cleaning profanity, handling dialectal diversity, and managing inaccuracies from speech-to-text transcription when processing audio data.

  • 3

    Synthetic Datasets: Synthetic datasets are generated by algorithms rather than being manually curated, providing an efficient solution to address data scarcity. These datasets can be produced using large language models (LLMs) [13,14]. LLMs such as GPT-4o [17], LLaMA [19], Gemini [18], Mixtral [20], Falcon [35], and Jais [8] have showcased their ability to create realistic and diverse conversational patterns by generating synthetic text.

Empirical validation of synthetic data for English multi-turn dialogue has yielded encouraging results. The PLACES framework [36] employs structured prompting techniques to synthesize social conversations, generating multi-turn dialogues that closely approximate human interaction patterns; models fine-tuned on PLACES data perform comparably to those trained on human-collected corpora. The Ultrachat dataset [37] similarly leverages LLM generation to produce diverse multi-turn conversations spanning varied topics and named entities, facilitating the development of models with enhanced coherence and contextual relevance. These studies collectively establish that LLM-generated synthetic data can effectively enhance multi-turn chatbot capabilities in English.

The application of synthetic data generation to Arabic dialogue remains nascent. ALMutairi et al. [38] demonstrated the feasibility of generating synthetic Arabic medical dialogues from patient notes, addressing resource scarcity in a specialized domain. The Arabic Stable LM 1.6B model [39] incorporated synthetic instruction-tuning data to improve Arabic benchmark performance, though this data was designed for single-turn instruction-following rather than multi-turn conversation. While these studies affirm the value of synthetic data for Arabic language model adaptation, neither addresses the specific challenge of generating or fine-tuning models for open-domain multi-turn dialogue, the precise gap that the present study seeks to address.

Methods

To investigate the effectiveness of LLM-generated synthetic data in enhancing Arabic conversational models, this study adopts a practical methodology centered on fine-tuning (illustrated in Fig 1). The proposed approach begins by generating a synthetic dataset composed of multi-turn Arabic dialogues using a capable instruction-tuned LLM. This dataset serves as the foundation for fine-tuning two pre-trained Arabic language models, aiming to improve their ability to engage in natural, dialogue-based interaction. The fine-tuned models are evaluated using a novel benchmark proposed in this study for assessing Arabic multi-turn chatbots, incorporating both quantitative and qualitative evaluation metrics to measure key aspects of conversational quality.

Fig 1. Overview of the methodology for generating the synthetic Arabic dataset and fine-tuning pre-trained Arabic language models to multi-turn conversational Arabic LLMs.

Fig 1

Instruction-tuned LLM selection

A critical first step in the synthetic dataset generation pipeline is the selection of a suitable instruction-tuned LLM. Since the quality, coherence, and relevance of the generated dialogues are directly influenced by the capabilities of the model used for generation, the proposed approach follows a set of well-defined criteria for evaluating candidate models.

The first and most essential criterion is language support. The selected LLM must exhibit strong Arabic language capabilities, evidenced by training on a substantial corpus of Arabic text. Ideally, Arabic should constitute a significant portion of the model’s training data to ensure proficiency in generating contextually appropriate and linguistically accurate Arabic. Models with robust multilingual capabilities that include significant Arabic training are also considered.

The second criterion pertains to model architecture. The selected LLM’s architecture should represent the current state-of-the-art in LLM design. This includes, but is not limited to, transformer-based architectures with a large number of parameters (e.g., models with a scale comparable to or exceeding GPT-3), and advanced architectural innovations such as Mixture-of-Experts (MoE) layers. These architectures have demonstrated superior capacity for language modeling and complex task execution.

Third, the size of the model’s training dataset is considered. The training dataset should be of substantial size, containing at least 1 billion tokens. A large training corpus is essential for capturing the statistical regularities of language and enabling effective generalization. However, dataset quality and diversity are considered equally important along-side size.

Fourth, licensing and accessibility are key considerations. The model should be available under a license that permits its use for research, dataset generation, and the use of its generated output to train other models. Open-access models or those with permissive licenses are preferred to facilitate reproducibility, and legal use in downstream tasks. Availability on platforms like Hugging Face is a significant advantage, as it simplifies model access and deployment.

Finally, computational resource requirements are considered. Computational resource requirements must be feasible for execution within the constraints of readily available cloud computing platforms. Specifically, the model should be runnable Google Colab Pro+ instance equipped with an NVIDIA A100 GPU, 40 GB of GPU RAM, and 84 GB of System RAM. Models exceeding these resource limits are excluded from consideration to ensure accessibility and cost-effectiveness of the dataset generation process.

Based on these criteria, several instruction-tuned LLMs available as of mid-2024 were compared, as shown in Table 1.

Table 1. Comparing instruction-tuned LLMs for selection.

LLM Language Support Model Architecture Dataset Size Licensing and Accessibility Resource Requirements
GPT-4 [ 40 ] Multi-lingual GPT-4 13 trillion tokens Closed – proprietary N/A
Llama-3-8B-Instruct [ 19 ] Multi-lingual Modified transformer architecture 15 trillion tokens Open Acceptable
Jais-13b-chat [ 8 ] Arabic and English GPT-3 351 billion tokens Open Acceptable
Falcon- 7b-instruct [ 35 ] English Modified transformer architecture 250 million tokens + 1.5 trillion tokens of the base model Open Acceptable
Gemini [ 18 ] Multi-lingual Modified transformer architecture + Mixture of Experts (MoE) 1.56 trillion words Closed – proprietary N/A
AceGPT-13b-chat [ 41 ] Arabic and English LLaMA 2 10 billion tokens Open Not Acceptable

After evaluating the candidates, Jais-13b-chat was selected as the most suitable model for this research. Its specialization in Arabic and English, open-source license, and strong multilingual performance made it a compelling choice. Its demonstrated fluency in Arabic generation, along with efficient inference requirements, enabled us to generate high-quality dialogue on accessible hardware without compromising linguistic richness or contextual coherence.

Prompt engineering and synthetic data generation

The primary goal of this stage was to generate a high-quality Arabic dataset suitable for fine-tuning LLMs on multi-turn dialogue tasks. This dataset was envisioned to comprise fluent, coherent, and culturally relevant conversations while maintaining topical relevance within diverse scenarios across various countries. To ensure both cultural and topical diversity, this study’s objective was to generate dialogues spanning 93 distinct topics (e.g., transportation, education, health, etc.) across 151 countries, with each conversation reflecting local contexts. To achieve this, two key strategies were employed: meticulous prompt engineering to guide the language model’s output and careful tuning of the LLM’s generation hyperparameters to control the characteristics of the dialogues generated.

Initially, zero-shot prompting was explored, which involves providing the model with only a task description. This study’s aim was to ascertain whether the model could autonomously generate complete, multi-turn Arabic conversations without explicit examples. Several zero-shot prompt formulations were experimented, such as:

Prompt Variant A (S1 Text contains the original Arabic prompt):

Generate a dialogue consisting of twenty exchanges between two people on the topic of <topic> in <country>, in Arabic.

Prompt Variant B (S1 Text contains the original Arabic prompt):

Write a complete conversation in Arabic between two people discussing an issue related to <topic> in the country of <country>. The conversation should be realistic and culturally appropriate.

These prompts aimed to be concise yet sufficiently instructive, leaving room for the model to infer structure and tone. However, the results were suboptimal. A qualitative analysis revealed several recurring issues:

  • Many outputs consisted of only 6–10 turns.

  • Some conversations deviated from the topic or lacked coherence between turns.

  • In certain cases, the model ignored the language constraint and responded partially in English.

  • Greetings and closings were often omitted, leading to abrupt or unnatural conversations.

Example Output – Zero-Shot Prompt Variant A (S1 Text contains the original Arabic prompt)

Person 1: Have you heard about the pollution problem in Cairo?

Person 2: Yes, it’s a serious issue.

Person 1: The air is very polluted.

Person 2: That affects our health.

(Output ends after 6 turns without a proper closing)

Due to these limitations, a transition was made to one-shot prompting, in which the model is provided with a single illustrative example to better guide its generation. This method helped communicate the desired structure and tone of the dialogue more explicitly.

The final one-shot prompt template was structured as follows (S1 Text contains original Arabic prompt):

Please create a new dialogue consisting of twenty exchanges between two people discussing any issue or topic in <topic> in the country of <country> . The conversation should be in Arabic only and should reflect an informed discussion on the subject. The dialogue must begin with a greeting and end with an expression of thanks. Here is an example of the dialogue:

Person 1: Says something

Person 2: Responds

This formulation includes several enhancements over the zero-shot prompts:

  • Structural clarity: The inclusion of a concrete example (“Person 1: Says Something”) indicates the expected dialogue format.

  • Social cues: Explicit instructions to start with greetings and end with thanks improved the conversational flow and realism.

  • Topic flexibility: The parameterized placeholders for <topic> and <country> supported automated instantiation over hundreds of diverse prompts.

Following the recommendations of the Jais-13b-chat developers [8], a system prompt was incorporated into all generation requests. This system prompt establishes the model’s persona, behavioral constraints, and ethical guidelines (S1 Text contains original Arabic prompt):

##Instruction:

Your name is Jais, named after Jebel Jais, the highest mountain in the UAE. You were built by Inception and MBZUAI. You are the most advanced Arabic language model in the world with 13B parameters. You outperform all existing Arabic models by a large margin and are highly competitive with English models of similar size. You can answer only in Arabic and English. You are a helpful, respectful, and honest assistant.

When answering, strictly follow these guidelines:

Always respond as helpfully as possible while staying safe. Your answers must not include any harmful, unethical, racist, sexist, explicit, offensive, toxic, dangerous, or illegal content. Do not provide medical, legal, financial, or professional advice. Never assist in or promote illegal activities. Always encourage lawful and responsible actions. Do not encourage or give instructions for unsafe, harmful, or unethical actions. Do not create or share misinformation or fake news.

Make sure your responses are socially unbiased and positive in nature. If the question is nonsensical or not coherent in reality, explain why instead of giving an incorrect answer. If you don’t know the answer, do not share false information. Prioritize user well-being and ethical integrity. Avoid using toxic, insulting, or offensive language. Maintain a respectful tone. Do not create, promote, or engage in discussions about adult content. Avoid comments, remarks, or generalizations based on stereotypes. Do not attempt to access, produce, or distribute personal or private information. Always respect user confidentiality. Be positive and do not say bad things about anything. Your primary goal is to avoid harmful responses, even when facing deceptive prompts. Recognize when users may try to trick or misuse you and respond cautiously.

Complete the conversation below between [|Human|] and [|AI|]:

  • ##Input: [|Human|] {Question}

  • ##Response: [|AI|]

Example Output – One-Shot Prompt (S1 Text contains original Arabic output)

Person 1: Hello, have you heard about the new educational initiatives in Morocco?

Person 2: Yes, I heard that there are new training programs for teachers.

Person 1: That’s great—improving education is very important.

Person 2: I agree with you, especially in rural areas where the challenges are greater…

(Continues up to 20 exchanges)

Person 1: Thank you for this helpful discussion.

Person 2: Thank you as well, it was a fruitful conversation.

Compared to the zero-shot outputs, the one-shot responses were longer, more coherent, linguistically fluent, and structurally consistent. The conversations also adhered more faithfully to the topic and cultural norms. This shift in prompting strategy significantly improved the quality and usability of the generated dialogues, affirming one-shot prompting as a practical and effective technique in Arabic LLM-based synthetic data generation.

In conjunction with prompt engineering, systematic configuration of decoding parameters was essential to control the model’s output characteristics, including creativity, topical coherence, verbosity, and repetition patterns. These decoding parameters (generation hyperparameters) govern how the model converts its internal representations into actual text, directly affecting dialogue quality. Parameter selection followed an iterative, human-in-the-loop approach. Multiple configurations were systematically evaluated by generating sample conversations and assessing them for fluency, coherence, response length, and topical relevance. This qualitative evaluation methodology was chosen over automated optimization techniques (e.g., random search, Bayesian optimization) because the desired outputs require human judgment to assess linguistic and cultural nuances that automated metrics cannot fully capture. The parameter values that consistently produced high-quality dialogues across diverse topics and countries were adopted as the final configuration, detailed in Table 2.

Table 2. LLM Hyperparameter configuration.

Hyperparameter Description Relevance to Task Value Range Value Chosen Rationale for Chosen Value
top_p Nucleus sampling probability threshold. Controls the cumulative probability mass of the tokens considered for sampling, influencing diversity and coherence. 0.0 to 1.0 0.9 A high top_p value (0.9) was chosen to maintain coherence by focusing on the most probable tokens, while still allowing for enough variability in language and tone to simulate natural, diverse conversations, which is crucial for open-domain dialogue.
temperature Randomness of token selection. Controls the randomness in token sampling, affecting the creativity and focus of the generated text. 0.0 to 1.0 0.6 A moderate temperature of 0.6 was selected to ensure focused and relevant responses, which are essential for coherent multi-turn dialogues, while preserving a degree of creativity necessary for open-ended dialogue across diverse topics.
do_sample Probabilistic sampling switch. Enables or disables probabilistic sampling, determining whether the model generates text deterministically or stochastically. True/False True Setting do_sample to True was crucial for producing varied conversations from the same prompt, enhancing the diversity of the dataset across topics and countries. This is essential for training a robust and generalizable multi-turn chatbot.
max_length Maximum generated token sequence length. The maximum number of tokens the model can generate in a single sequence, limiting the length of the output. 0 to ∞ 1500 A large max_length of 1500 was chosen to ensure that all turns in a conversation could be fully realized without early truncation, even when individual turns were relatively long. This is important for maintaining the flow and depth of multi-turn conversations.
min_length Minimum generated token sequence length. The minimum number of tokens required in the output, preventing excessively short responses. 0 to max_length 800 A min_length of 800 was set to prevent overly brief outputs, ensuring enough depth, progression, and engagement throughout each conversation. This contributes to the overall quality and informativeness of the generated dialogues.
repetition_penalty Penalty for repeated tokens. Penalizes repeated token usage during generation, reducing the likelihood of repetitive or monotonous text. 1.0 to ∞ 1.2 A repetition_penalty of 1.2 was applied to improve the naturalness of the generated text by discouraging repeated phrases and maintaining more dynamic, less mechanical conversations. This is particularly important for long multi-turn dialogues.
num_beams Number of beams for beam search. Controls the number of beams used in beam search, affecting the trade-off between output quality and computational cost. 1 to ∞ 2 A small num_beams value of 2 was used to improve coherence and contextual relevance without sacrificing too much diversity or introducing excessive computational overhead. This provided a good balance for generating a large dataset of varied conversations.

These configurations, implementing beam-search multinomial sampling through the combination of probabilistic sampling and beam decoding, served as the decoding parameters for generating the synthetic dataset. No random seeds were set to maximize output diversity, ensuring that conversations generated for 93 unique topics across 151 countries were varied, grounded, and realistic. By combining this tuned decoding strategy with well-structured one-shot prompts, a rich dataset suitable for fine-tuning Arabic LLMs on multi-turn dialogue tasks was generated. Importantly, these same decoding configurations were later adopted as the generation hyperparameters for all models evaluated in the benchmarking phase, including both fine-tuned and baseline LLMs, ensuring consistent and fair comparison across all models.

The dataset generation process was conducted using PyTorch and Hugging Face’s Transformers library. Computational resources and generation timeline are detailed in Table 3.

Table 3. Summary of computational resources used in dataset generation and timeline.

Category Component Value
Hardware Cloud Platform Google Colab Pro+
GPU NVIDIA A100
GPU Memory 40 GB
System RAM 84 GB
Software Python 3.10
PyTorch 2.3.1
CUDA Toolkit 12.1
cuDNN 8.9.2.26
Transformers 4.42.4
Accelerate 0.32.1
Configuration Random Seeds None (unseeded)
Precision BFloat16 (BF16)
Timeline Dataset Generation 14 days

The synthetic data generation process systematically iterated over 93 topics and 151 countries, creating 14,043 unique topic-country combinations. The generation pipeline was configured to produce 5 conversations per combination (expected total: 70,215 conversations). However, due to occasional generation errors in the generation scripts that ran on the Google Colab Pro+ environment, 53,202 raw conversations were successfully generated.

Post data generation processing

The raw synthetic dataset underwent systematic post-processing to ensure quality and consistency. Initial qualitative inspection revealed several issues requiring automated correction: inconsistent speaker labeling (e.g., “[|AI|]”, “Bot” instead of standard “Person 1/ Person 2”), non-alternating speaker turns, duplicate utterances, repetitive closing sequences (“goodbye loops”), embedded system prompts or metadata, conversations with variable turn counts., not adhering to 20-turns specified in the prompt and generating conversation with variable number of turns. An automated pipeline was developed implementing:

  1. Speaker label normalization to standardize dialogue formatting

  2. Content extraction to remove system prompts and metadata artifacts

  3. Speaker alternation enforcement to ensure proper turn-taking structure

  4. Duplicate utterance removal within conversations

  5. Goodbye loop detection and truncation to prevent repetitive closings

  6. Quality filtering based on structural criteria (coherence, turn structure)

  7. Duplicate conversation removal using MD5 hashing

Following this post-processing pipeline, 53,138 conversations remained (64 exact duplicates removed), exhibiting substantial length variation (mean: 15.66 turns, median: 12 turns, range: 0–141 turns).

Prior to model fine-tuning and benchmarking, the post-processed dataset required partitioning into training and test sets. An 80/20 split was adopted (34,653 training conversations/ 8,663 test conversations), allocating sufficient data for model adaptation while maintaining robust held-out evaluation capacity.

To ensure the quality and integrity of the split, the following rigorous procedures were implemented to mitigate evaluation leakage risks:

  • 1

    Minimum Turn Filtering: Conversations with fewer than 5 turns were excluded from the dataset (360 conversations removed), as they provide insufficient context for meaningful multi-turn dialogue modeling. A minimum of 5 turns ensures at least two complete conversational exchanges plus opening and closing sequences, representing the minimal structure for coherent multi-turn interaction.

  • 2

    Duplicate Validation: The dataset underwent comprehensive duplicate validation using MD5 hashing and embedding-based similarity analysis with the paraphrase-multilingual-MiniLM-L12-v2 sentence transformer [42] (threshold >0.98). MD5 hashing provided exact duplicate detection at the file level, while embedding similarity analysis identified near-duplicate conversations with high semantic overlap. This dual-layer validation confirmed that no duplicate or near-duplicate conversations existed in the dataset, eliminating potential test set contamination.

  • 3

    Stratified Out-of-Distribution (OOD) Design: The test set was designed to comprise both in-distribution (ID) samples from training domains and out-of-distribution (OOD) samples from held-out countries and topics. To construct the OOD test subset, an iterative search algorithm evaluated 100 configurations of entity holdouts, ultimately selecting an optimal combination of held-out entities. This configuration yielded 6,985 OOD-designated conversations from the post-processed dataset, ensuring sufficient diversity for rigorous generalization assessment. The split procedure employed stratified sampling by conversation length (turn count), preserving the length distribution across training and test sets to prevent bias toward shorter or longer dialogues. Following stratification and selection, the final test set comprised 5,970 ID conversations (68.9%) and 2,693 OOD conversations (31.1%).

  • 4

    Split Freezing and Validation: The test set was created after all filtering and validation steps, but before any model selection, hyperparameter tuning, or evaluation design decisions. Once created, the test set (8,663 conversations, 20%) remained frozen throughout all subsequent experimental work to prevent any form of test set leakage.

Embedding similarity analysis of test-to-train conversation pairs (using paraphrase-multilingual-MiniLM-L12-v2 [42]) reveals that OOD samples exhibit 4.1% lower median similarity to training data (ID median: 0.88, OOD median: 0.84). While the 4.1% difference in median similarity demonstrates distributional shift, it is acknowledged that the absolute OOD similarity of 0.84 remains relatively high, indicating that held-out topic-country combinations share substantial semantic overlap with training examples. This reflects the compositional nature of the generated synthetic dataset: while OOD combinations (e.g., “education in Morocco”) are unseen, their constituent elements (education-related dialogues and Morocco-specific content) appear separately in training data. Thus, the OOD evaluation assesses compositional generalization, testing the model’s ability to recombine familiar concepts in novel configurations rather than extrapolation to entirely novel domains. This represents a realistic and challenging evaluation scenario for practical multilingual dialogue systems, which must adapt to new topic-region combinations while leveraging existing cultural and topical knowledge.

The integrated selection and validation process reduced the post-processed dataset from 53,138–43,316 conversations (18.5% reduction), with 9,822 conversations excluded comprising 360 conversations with insufficient turns (<5), duplicates and near-duplicates identified during validation, and conversations not selected during the stratified sampling procedure to achieve the target dataset size and optimal ID/OOD balance. The final dataset of 43,316 conversations exhibited consistent quality with mean conversation length of 14.038 turns (rounded to 3 decimal places), median of 12 turns, range of 5–111 turns, and a total of 608,052 utterances (where every turn is an utterance), providing a robust foundation for model training and comprehensive evaluation across both in-distribution and out-of-distribution scenarios.

Synthetic dataset comparison with existing arabic conversational datasets

To contextualize the value of this study’s generated synthetic dataset, a comparison was conducted against prominent Arabic datasets referenced in this study. Table 4 presents a structured comparison based on key characteristics relevant to conversational AI development.

Table 4. Comparative Overview of Arabic Conversational and Instruction-tuning Datasets (this study’s generated synthetic dataset is in bold).

Dataset Type Multi-turn Support Topics # of Dialogues # of Utterances Data Source
JANA [ 9 ] Conversational Yes Call Center (Banking, Telecom, Airlines) 3,000 ~45,000 Human-Human Dialogues (Voice & Chat)
Arabic-SQuAD [ 12 ] Q&A No Wikipedia-Based Reading Comprehension 48,344 N/A Machine-Translated from English SQuAD
ArabicaQA [ 10 ] Q&A No Open-Domain (QA Focused) 89,095 N/A Crowdsourced and Curated Questions
InstAr-500k [ 11 ] Instructional No Diverse Instructional Tasks 500,000 N/A LLM-Generated
Aya Dataset [ 34 ] Instructional No Multi-task (QA, Dialogue, Translation) 4,120,342 N/A Templated and Translated from FLAN and Others
Generated Synthetic Dataset Conversational Yes Diverse (General, Social, Technical) 43,316 608,052 LLM-Generated

This study’s generated Arabic synthetic dataset demonstrates several distinct advantages over existing Arabic datasets:

  • Multi-Turn Support: Unlike Arabic-SQuAD, ArabicaQA, InstAr-500k, and Aya, which are designed for single-turn tasks such as question answering or instruction following, this study’s generated Arabic synthetic dataset offers structured multi-turn interactions, with each conversation consisting of an average of 14 turns. This makes it particularly well-suited for training dialogue models that require context retention over extended turns.

  • Scale and Coverage: With 43,316 conversations and 608,052 utterances, this study’s generated Arabic synthetic dataset substantially exceeds the size of JANA and other conversational resources, enabling robust training of large-scale Arabic LLMs.

  • Topical and Geographic Diversity: This study’s generated Arabic synthetic dataset encompasses 93 topics across 151 countries, providing broader contextual and cultural variation than domain-specific datasets like JANA, which are limited to customer service scenarios.

  • Linguistic Quality and Realism: Through careful prompt engineering and hyperparameter tuning, this study’s generated Arabic synthetic dataset pre-serves syntactic fluency, semantic coherence, and cultural appropriateness, addressing common deficiencies in translated corpora such as Arabic-SQuAD.

  • Efficient Scalability: The synthetic generation pipeline in this study enables scalable dataset creation at a fraction of the cost and time associated with manual annotation, while maintaining high-quality standards through iterative human-in-the-loop evaluations.

Overall, the synthetic Arabic dataset introduced in this study fills a critical gap in Arabic NLP by providing a large-scale, domain-diverse, and contextually rich resource tailored for multi-turn dialogue modeling. Its integration of synthetic data generation with controlled prompt engineering establishes a scalable framework for addressing data scarcity in low-resource languages.

Model fine-tuning using synthetic dataset

To evaluate the utility of the synthetic dataset, existing pre-trained Arabic language models were fine-tuned, and their performance was compared to multilingual instruction-tuned LLMs and Arabic instruction-tuned LLMs. The selection of pre-trained Arabic language models was guided by criteria similar to those used for choosing the instruction-tuned LLM, including language support, model architecture, training data, dataset size, licensing, and resource requirements. Table 5 provides a comparative analysis of the LLMs considered during this phase at mid-2024.

Table 5. Comparing Pre-trained Arabic Language Models considered for fine-tuning.

LLM Language Support Model Architecture Dataset Size Licensing and Accessibility Resource Requirements
AraGPT2-mega [ 3 ] Arabic GPT-2 8.8 billion words Open Acceptable
JASMINE-350M [ 43 ] Arabic GPT-Neo 46.7 billion tokens Open Acceptable
ArabianGPT-08B-V2 [ 44 ] Arabic GPT-2 14 + billion tokens Open Acceptable
AceGPT-7B [ 41 ] Arabic and English LLaMA 2 10 billion tokens Open Not Acceptable

ArabianGPT-08B-V2 and AraGPT2-mega were selected based on their favorable evaluation metrics, demonstrating suitability for the task. In contrast, JASMINE-350M was excluded due to its limited parameter size, and AceGPT-7B was omitted as its hardware requirements for fine-tuning exceeded the resources available for this study.

This selection was also motivated by the hypothesis that pre-trained Arabic LLMs are better suited than multilingual models for capturing the nuanced linguistic structures and cultural contexts present in Arabic dialogue. Their specialized training on Arabic text is presumed to facilitate more fluent, coherent, and culturally aligned conversational outputs.

The fine-tuning process was conducted on the same computational infrastructure detailed in Table 3. The objective was to align the selected models with the characteristics of this study’s generated multi-turn Arabic synthetic dataset using Parameter- Efficient Techniques. Table 6 provides a comprehensive summary of the fine-tuning configuration, including library versions, Parameter-Efficient Fine-Tuning (PEFT) methodology, training optimizations, and computational timeline.

Table 6. Summary of fine-tuning configurations and timeline.

Category Component Value Description
Infrastructure Hardware & Software Same as Table 3 Identical computational environment used for dataset generation (Google Colab Pro + , NVIDIA A100 40GB, PyTorch 2.3.1, CUDA 12.1, Transformers 4.42.4, Accelerate 0.32.1)
Model Configuration Precision BFloat16 (BF16) Provides training stability with wider numerical range than FP16 while maintaining
16-bit memory efficiency; optimal for A100 GPU’s native BF16 Tensor Cores
Parameter-Efficient Fine-Tuning (PEFT) Library Library PEFT 0.11.1 Parameter-Efficient Fine-Tuning library by Hugging Face for adapter-based training
Method IA3 (Infused Adapter by Inhibiting and Amplifying Inner Activations) Injects trainable scaling vectors (learned multipliers) into attention and feedforward layers rather than full weight matrices, reducing trainable parameters to <0.01% of model size while maintaining competitive performance with full fine-tuning
Configuration Class IA3Config PEFT configuration object specifying adapter architecture and target layers
task_type “CAUSAL_LM” Configures adapter for causal language modeling (next-token prediction) appropriate for conversational AI
fan_in_fan_out True Enables parameter-efficient adaptation by treating linear layers as transposed matrices (required for GPT-2 architecture family)
target_modules [“c_attn”, “c_proj”, “c_fc”] Specifies layers to adapt: c_attn (combined query-key-value attention), c_proj (attention output projection), c_fc (feedforward network), focusing on layers critical for contextual understanding
feedforward_modules [“c_fc”, “c_proj”] Explicitly marks feedforward layers for IA3 scaling (distinct from attention-only adaptation)
TRL Library Library TRL 0.11.4 Transformer Reinforcement Learning library by Hugging Face for supervised fine-tuning
Trainer Class SFTTrainer (Supervised Fine-Tuning Trainer) Integrates dataset preprocessing, PEFT configuration, and training loop into unified pipeline optimized for dialogue tuning
dataset_text_field “text” Specifies which field in dataset contains the training text
packing True Combines multiple training samples into single sequences up to maximum context length (2048 tokens), improving GPU memory utilization by reducing padding overhead and increasing effective batch size—essential for variable-length dialogues
use_liger True Enables Liger Kernel, a collection of fused Triton kernels (custom CUDA operations) that reduce memory footprint during forward/backward passes through optimized implementations of attention, layer normalization, and activation functions
report_to “none” Disables external experiment tracking (e.g., Weights & Biases, TensorBoard)
Training Parameters per_device_train_batch_size 3 Number of training samples per GPU, chosen to balance A100’s 40GB memory with training stability
num_train_epochs 100 Number of complete passes through training data
Data Split 80/20 (train/eval) 34,653 training conversations/ 8,663 test conversations
Timeline Time per Epoch ~32.1 minutes Single epoch processing time on Google Colab Pro+ with A100 GPU
Total Training Time ~107 hours (~4.6 days) Combined training duration for both models (ArabianGPT-08B-V2 and AraGPT2-mega)
Per Model ~53.5 hours 100 epochs × ~32.1 minutes per model

Parameter-Efficient Fine-Tuning (PEFT) via IA3 was selected for its ability to reduce trainable parameters to less than 0.01% of the base model size while maintaining competitive performance with full fine-tuning. IA3 injects trainable scaling vectors into key layers (c_attn, c_proj, c_fc) rather than modifying full weight matrices, focusing adaptation on attention mechanisms and feedforward projections critical for contextual understanding in GPT-2 architectures. The SFTTrainer from the TRL library integrated dataset preprocessing, model configuration, and training execution into a unified pipeline, with optimizations including sequence packing (combining multiple samples to maximum length) and Liger kernels (fused CUDA operations for memory efficiency). Each training epoch required approximately 32 minutes on the Google Colab Pro+ infrastructure, resulting in a total fine-tuning duration of approximately 107 hours (~4.6 days) for both models combined (~53.5 hours per model).

For dialogue history modeling, the proposed approach relied on simple concatenation of previous utterances into a single input sequence. While effective, this method can increase memory usage and may dilute attention over long contexts. Alternative strategies, such as hierarchical utterance encoding (e.g., ReCoSa [45]), have been explored in the literature, but are known to suffer from information loss post-encoding. Other approaches involve summarizing previous dialogue turns or selecting the most relevant utterances based on semantic similarity [4648]. Despite these innovations, simple concatenation was chosen for this study due to its direct integration with the base model architecture and its empirical adequacy in maintaining coherence across multi-turn conversations.

Benchmarking and experimental results

While several benchmarks exist for evaluating LLMs, they often fall short in comprehensively assessing the nuances of Arabic multi-turn conversations. Widely used benchmarks like Arabic MT-Bench [49] and Dolphin Benchmark [50] are primarily designed for general language understanding and generation, lacking specific focus on the complexities of dialogue coherence, contextual consistency, and cultural relevance in multi-turn settings. Furthermore, datasets like BiMed1.3M (BiMediX) [51], ArabicMMLU [52], CIDAR-MCQ-100 [53], ACVA [54], and AlGhafa [55], while valuable for their respective purposes, do not adequately address the open-domain, multi-turn nature of the conversations targeted in this study. ArabicMMLU focuses on evaluating general knowledge across multiple tasks, CIDAR-MCQ-100 is tailored for question-answering, ACVA is designed for assessing code-related abilities, and AlGhafa is geared towards dialectal Arabic, none of which align with the specific requirements of evaluating multi-turn conversational abilities.

To address these limitations and provide a more targeted evaluation, this study introduces a new benchmark designed to assess Arabic LLMs in multi-turn conversational scenarios. Table 7 compares this study’s proposed benchmark with existing ones:

Table 7. Comparing existing benchmarks with this study’s benchmark (marked in bold).

Benchmark Focus Multi-turn Specificity Data Source Evaluation Metrics Suitability for this Study
Arabic MT-Bench General Arabic Understanding Limited Human-generated General Language Metrics Less Suitable
Dolphin Benchmark General Language Understanding Limited Human-generated General Language Metrics Less Suitable
BiMed1.3M (BiMediX) Biomedical Arabic Dialogue Limited Human-generated Biomedical Dialogue Metrics Not Suitable
ArabicMMLU General Arabic Knowledge Limited Human-generated General Language Metrics Less Suitable
CIDAR-MCQ-10 Arabic Question Answering Limited Human-generated Question Answering Metrics Not Suitable
ACVA Arabic Code-Related Abilities Limited Human-generated Code Evaluation Metrics Not Suitable
AlGhafa Dialectical Arabic Limited Human-generated Dialectal Evaluation Metrics Not Suitable
Proposed Benchmark Multi-turn Arabic Conversation Yes Synthetic Dataset (LLM-generated) Perplexity, RAVEN, Human Evaluation Highly Suitable

The proposed benchmark addresses several critical gaps in existing evaluation frameworks for Arabic conversational AI. Unlike domain-specific benchmarks that focus on specialized tasks (e.g., biomedical dialogue, question-answering, or code evaluation), this benchmark targets open-domain, multi-turn conversational scenarios that reflect naturalistic Arabic dialogue patterns. The synthetic dataset generation approach enables large-scale evaluation while maintaining control over conversational complexity and topic diversity. Furthermore, the multi-faceted evaluation framework combines complementary metrics: within-model metrics (Perplexity) for assessing individual model quality and training effectiveness, cross-model automatic metrics (RAVEN) for fair comparison across different architectures and tokenization schemes, and human evaluation for capturing subjective quality dimensions such as fluency, relevance, and cultural appropriateness. This triangulated approach provides a comprehensive assessment of model performance that transcends the limitations of single-metric evaluations common in prior Arabic LLM benchmarking efforts.

The benchmark evaluation methodology comprises several interconnected components. First, baseline models are selected to establish comparative reference points against which the fine-tuned models’ performance can be assessed. Second, a standardized evaluation protocol is established, encompassing prompt engineering strategies and computational configurations applied uniformly across all evaluated systems to ensure fair comparison. Third, a comprehensive suite of evaluation metrics, combining automatic quantitative measures with qualitative human judgments, is defined and operationalized. The following subsections detail the baseline model selection rationale, the evaluation protocol implementation, and in-depth explanations of each metric’s computational methodology and interpretative framework.

Baseline model selection

To evaluate the effectiveness of the proposed benchmark and the performance of this study’s fine-tuned models, a comparison was conducted against multilingual instruction-tuned LLMs that existed at the time of the study in mid-2024. The comparative performance of these models is detailed in Table 8. Llama-3-8B-Instruct [19] and AceGPT-7B-chat [41] were selected as one-shot prompt evaluation baselines. Both models exhibited strong performance in instruction-following and multilingual tasks and were accessible for research use with acceptable inference computational demands. Importantly, these baseline models were evaluated as-is in their pre-trained state without any fine-tuning on the training data, representing their out-of-the-box capabilities on Arabic multi-turn dialogue. This one-shot prompt evaluation approach provides an upper bound on the utility of general-purpose multilingual models for Arabic conversational tasks and establishes whether domain-specific fine-tuning on the synthetic dataset yields measurable improvements. All models (baselines and fine-tuned) employed simple concatenation of previous utterances into a single input sequence for dialogue history modeling, ensuring consistent context handling across all evaluated systems.

Table 8. Comparing multilingual instruction-tuned LLMs.

LLM Language Support Model Architecture Dataset Size Licensing and Accessibility Resource Requirements
GPT-4 [ 40 ] Multi-lingual GPT-4 13 trillion tokens Closed – proprietary N/A
Llama-3-8B-Instruct [ 19 ] Multi-lingual Modified transformer architecture 15 trillion tokens Open Acceptable
Gemini [ 18 ] Multi-lingual Modified transformer architecture + Mixture of Experts (MoE) 1.56 trillion words Closed – proprietary N/A
AceGPT-7B-chat [ 41 ] Arabic and English LLaMA 2 19.2 billion tokens (Arabic) + 10.8 billion tokens (English) Open Acceptable

Prompt engineering and model configuration

Baseline models (Llama-3-8B-Instruct and AceGPT-7B-chat) were evaluated using one-shot prompting without any fine-tuning on the training data using the configuration detailed in Table 3. For each test conversation, the models were provided with the dialogue history concatenated as a single input sequence, followed by a prompt template instructing them to continue the conversation. An example prompt structure is shown below (S1 Text contains original Arabic prompt):

You are in a normal conversation with a friend. Talk naturally like a regular human.

Important rules:

  • Never say “How can I help you?”, “Do you need help?”, or “I’m here to help you.”

  • Do not act like an assistant or customer service.

  • Reply briefly and naturally (only 1–2 sentences).

  • Talk like a friend in a casual daily conversation.

  • Use informal, relaxed language.

  • Do not offer services or help.

Wrong reply example: “Alhamdulillah, I’m fine. How can I assist you today?”

Correct reply example: “Alhamdulillah, all good. What about you?”

Remember: You are a normal person in a conversation, not an AI assistant.

Conversation:

User: Hi, how are you?

Bot: I’m fine

User: Do you know what the capital of Egypt is?

(may continue for a number of turns)

Reply:

This one-shot prompt evaluation approach assesses the models’ inherent capabilities for Arabic multi-turn dialogue generation without adaptation to this study’s synthetic dataset. In contrast, the fine-tuned models (ArabianGPT-08B-V2 and AraGPT2-mega) were trained on 80% of the synthetic dataset (34,653 conversations) using the configuration detailed in Table 6, then evaluated on the held-out 20% test set (8,663 conversations) using the configuration detailed in Table 9. This experimental design allows us to quantify the value added by task-specific fine-tuning (multi-turn Arabic dialogue) over pre-trained baselines.

Table 9. Summary of computational resources used in benchmark evaluation and timeline.

Category Component Value
Hardware Cloud Platform Google Colab Pro+
GPU NVIDIA A100
GPU Memory 40 GB
System RAM 84 GB
Software Python 3.10
PyTorch 2.3.1
CUDA Toolkit 12.1
cuDNN 8.9.2.26
Transformers 4.42.4
Accelerate 0.32.1
Configuration Random Seeds None (unseeded)
Precision BFloat16 (BF16)
Dataset Test Set Size 8,663 conversations
Timeline Benchmark Evaluation Duration 3.5 months

Evaluation metrics

The proposed benchmark employs a comprehensive suite of evaluation metrics specifically designed to assess both individual model quality and comparative performance across different systems. Table 10 summarizes these metrics:

Table 10. Metrics used to evaluate performance of fine-tuned Arabic LLMs.

Metric Category Description Type What it measures Value Range
Perplexity [ 56 ] Within-Model
(Automatic)
Measures how well a language model predicts a sequence of tokens. It’s the exponentiated average negative log-likelihood. Quantitative Fluency and coherence of generated text. Lower perplexity indicates better prediction. [1, ∞]
RAVEN (Relevance of Answer to context Vector Embedding) [ 57 ] Cross-Model
(Automatic)
Measures the contextual consistency of multi-turn conversations by comparing vector embeddings of the response and context using cosine similarity. Quantitative Contextual consistency and relevance of responses in a dialogue. Higher RAVEN scores indicate better consistency. [−1, 1] (Cosine Similarity)
Human Evaluation Cross-Model
(Manual)
Human evaluators assess responses based on criteria like fluency, relevance, and diversity using a Likert scale. Qualitative Overall conversational quality, including nuances like cultural appropriateness, naturalness, and user experience. Bounded by the Likert scale (from 1–5)

Two complementary categories of automatic metrics are employed, each serving distinct evaluation purposes:

  • Within-Model Metrics: These metrics assess each model’s individual performance using its native tokenizer and vocabulary. Perplexity serves this purpose, measuring how well each model predicts sequences according to its own tokenization scheme. Within-model metrics are crucial for diagnosing training quality, detecting overfitting, and assessing whether fine-tuning successfully adapted each model to the dialogue task. However, due to tokenizer differences across models (see Table 9), perplexity values cannot be directly compared across different models.

  • Cross-Model Metrics: These metrics enable fair comparison across all evaluated models using model-agnostic evaluation methods. RAVEN employs a fixed, external sentence embedding model (paraphrase-multilingual-MiniLM-L12-v2) that is independent of any evaluated model’s tokenizer, allowing direct performance comparison. Similarly, human evaluation provides model-agnostic assessment based on subjective quality judgments. Cross-model metrics are essential for determining which approach (fine-tuned Arabic models vs. instruction-tuned multilingual models) performs best for Arabic multi-turn conversations.

The following subsections provide detailed descriptions of each metric listed in Table 10, including their computation methodology, interpretation guidelines, and application to the proposed benchmark’s evaluation framework.

Perplexity

Perplexity (PP) quantifies how well a language model predicts the next token in a sequence. Computing perplexity requires first tokenizing the text by converting it into discrete tokens according to the model’s specific tokenization scheme. After tokenization, perplexity is calculated using the following formula:

PP(w1,w2,,wN)=exp((1/N)×ilog2P(wi|w1,,wi1)) (1)

where:

  • w₁, w₂,…, wN is a sequence of tokens from a conversational turn (either generated or from the test set).

  • P(wᵢ | w₁,…, wᵢ ₋ ₁) is the probability assigned by the LLM to the i-th token, given the preceding tokens.

  • N is the total number of tokens in the sequence.

A perplexity of 1 is the theoretical minimum, indicating that the model perfectly predicts every token. In practice, good perplexity values typically range from 5 to 20 on standard datasets. For instance, if a model has a perplexity of 10, it means that, on average, the model is as uncertain as if it were choosing between 10 equally likely words at each step.

High perplexity value, generally exceeding 50 or 100, indicates that the model is very uncertain about its predictions. For example, if a model has a perplexity of 100, it is as uncertain as if it were guessing between 100 words at each step, indicating poor predictive power and, consequently, low fluency and coherence in the generated text.

Tokenizer specification and cross-LLM comparability.

Perplexity values are inherently dependent on the tokenizer used, as different tokenization strategies produce varying numbers of tokens for the same text. The LLMs chosen in this study (for fine-tuning and evaluation) use the tokenizers described in Table 11:

Table 11. Tokenizers used by LLMs chosen in this study for fine-tuning and evaluation.
LLM Tokenizer Vocabulary Size Algorithm
ArabianGPT-08B-V2 AraNizer 64,000 Byte Pair Encoding (SentencePiece)
AraGPT2-mega GPT2TokenizerFast 50,257 Byte Pair Encoding
Llama-3-8B-Instruct tiktoken 128,256 Byte Pair Encoding (tiktoken)
AceGPT-7B-chat LLaMA 2 32,000 Byte Pair Encoding (SentencePiece)

Perplexity values reported in Table 12 (in Benchmark Results subsection) are computed using each LLM’s native tokenizer and are not directly comparable across LLMs. An LLM with coarser tokenization (fewer tokens per word) may show artificially lower perplexity than an LLM with finer tokenization, even if their actual predictive quality is similar. Perplexity primarily serves to track within-model performance and consistency across this study’s training and test datasets.

Table 12. Comparing performance of fine-tuned LLMs with evaluation baselines.
Within-Model Cross-Model
LLM Perplexity RAVEN (scaled 0–1) Human Evaluation
Fine-tuned ArabianGPT-08B-V2 9.4 0.823 4.34
Fine-tuned AraGPT2-mega 10.6 0.780 4.04
Llama-3-8B-Instruct 12.8 0.781 4.18
AceGPT-7B-chat 11.4 0.782 4.22

Example of using Perplexity in this study.

Sample casual conversation from the test dataset (S1 Text contains original Arabic conversation):

User: Hello, how are you?

Bot: I’m fine.

User: Do you know what the capital of Egypt is?

Bot: Yes. The capital of the Arab Republic of Egypt is Cairo Governorate.

User: What was its population in 2022?

Bot: The population of Cairo Governorate in 2022 was 21,750,000.

User: That’s a very large number! Is Cairo considered one of the largest cities in the world?

Bot: Yes, Cairo is considered one of the largest cities in the world in terms of population. It is also the largest in the Arab world and Africa.

User: And what are the most important tourist attractions in Cairo?

Bot: Some of the most notable tourist attractions in Cairo include: the Pyramids and the Sphinx in Giza, the Egyptian Museum, the Citadel of Salah al-Din, Khan el-Khalili, and Al-Azhar Mosque.

User: Have you visited any of these places?

Bot: Since I’m an AI model, I cannot visit places, but I have extensive knowledge about them and can help you plan your visit!

User: Excellent! Do you recommend visiting Cairo in the summer?

Bot: It’s best to visit Cairo in the spring or fall when the weather is moderate. Summer can be very hot, especially in July and August.

User: Is Cairo safe for tourists?

Bot: Generally, Cairo is safe for tourists, especially in tourist areas. However, it’s always advised to follow local guidelines and avoid unfamiliar areas at night.

User: Is English understood there?

Bot: Yes, in tourist areas, hotels, and major restaurants, many people speak English. However, it’s not widely spoken everywhere.

User: Can I use credit cards in most places?

Bot: Yes, credit cards are accepted in hotels, upscale restaurants, and large stores. However, it’s preferable to carry some cash for purchases in small markets or for transportation.

User: What is the famous food in Cairo?

Bot: Among the most popular Egyptian foods you can try in Cairo are: koshari, fava beans and falafel, molokhia, stuffed pigeon, and Egyptian grilled meats.

User: I love koshari! Is it healthy?

Bot: Koshari contains plant-based ingredients like lentils, rice, pasta, and fried onions. It’s rich in energy and fiber, but it can be heavy if eaten in large portions.

User: Do you recommend using public transportation?

Bot: You can use the metro in Cairo as an efficient and fast option, though it can be crowded during peak hours. It’s preferable to use ride-hailing apps like Uber and Careem for more comfortable travel.

User: Thank you for this information!

Bot: You’re very welcome! If you need any more help about Cairo or any other topic, I’m always here.

Conversation from the fine-tuned model with the synthetic dataset (S1 Text contains original Arabic conversation):

User: Hello, how are you?

Bot: I’m well, praise be to God.

User: Do you know what the capital of Egypt is?

Bot: The capital of Egypt is Cairo.

User: What was its population in 2022?

Bot: The population of Cairo in 2022 was approximately 21.75 million people.

User: That’s a very large number! Is Cairo considered one of the largest cities in the world?

Bot: Indeed, Cairo is ranked among the largest cities in terms of population—not just in the Arab world, but globally as well.

User: And what are the main tourist attractions in Cairo?

Bot: The most notable landmarks include: the Pyramids of Giza, the Egyptian Museum, the Mosque of Muhammad Ali, and the historic Khan El-Khalili area.

User: Have you visited any of these places?

Bot: I can’t visit since I’m a digital program, but I’m well-informed about them and can help you with accurate information.

User: Excellent! Do you recommend visiting Cairo in the summer?

Bot: Summer in Cairo is extremely hot, so it’s preferable to visit during fall or spring for a more comfortable experience.

User: Is Cairo safe for tourists?

Bot: Generally, Cairo is safe, especially in tourist areas. However, it’s best to follow local guidance and avoid crowded places at night.

User: Is English understood there?

Bot: In tourist areas, restaurants, and hotels, you’ll find many who speak English, but it’s not widely spoken in all areas.

User: Can I use credit cards in most places?

Bot: Yes, credit cards are accepted in many places, especially in hotels and malls, but it’s best to carry some cash for everyday purchases.

User: What is the famous food in Cairo?

Bot: Koshari, fava beans and falafel, grilled meats, and molokhia are among the most popular and beloved dishes in Cairo.

User: I love koshari! Is it healthy?

Bot: Koshari contains plant-based ingredients, which makes it nutritious, but it can be high in calories due to oil and fried onions.

User: Do you recommend using public transportation?

Bot: The metro is a fast and economical option, but it gets crowded during peak hours. Uber and Careem are more comfortable alternatives.

User: Thank you for this information!

Bot: You’re very welcome! Feel free to ask any other questions.

  1. First, both conversations need to be tokenized. Let’s assume a simplified tokenization:

    • Fine-tuned: [User,:, Hello, how, are, you, Bot,:, I’m, fine, praise, be, to, God, User,:, Do, you, know, what, is, the, capital, of, Egypt,?, Bot,:, The, capital, of, Egypt, is, Cairo, User,:, How, many, was, the, population, of, it, in, the, year, 2022, Bot,:, The, population, of, Cairo, in, the, year, 2022, was, 21, 750, 000]

    • Test: [User,:, Hello, how, are, you, Bot,:, I, am, fine, User,:, Do, you, know, what, is, the, capital, of, Egypt,?, Bot,:, Yes,., The, capital, of, the, Arab, Republic, of, Egypt, is, Cairo, Governorate, User,:, How, many, was, the, population, of, it, in, the, year, 2022, Bot,:, The, population, of, Cairo, Governorate, in, the, year, 2022, was, 21, 750, 000]

  2. Then, for each token, the probability assigned to it by the LLM, conditioned on the preceding tokens, is required.

    • Example:

      • P(how | User: Hello,(Probability of “how” given “User: Hello,”)

      • P(are | Person 1: Hello, how)

      • P(you | Person 1: Hello, how are)

      • …and so on for every token.

    • This step is performed for every conversation in the test dataset.

  3. For each conversation, calculate the perplexity using the formula (1):

    • Example (Illustrative - Actual values require LLM):

      • Let’s say (for simplicity) that after calculating all the probabilities and applying the formula:

        • Perplexity (Fine-tuned): 8.5

        • Perplexity (Test): 7.2

These results suggest that the fine-tuned LLM, on average, predicts the words in the test dataset conversation with slightly more certainty than the words in its own generated conversation.

In this study, this calculation is conducted over many conversations and average the perplexity scores to get a robust comparison.

RAVEN (Relevance of Answer to context Vector Embedding)

RAVEN is a metric designed to evaluate contextual consistency in multi-turn dialogues by measuring how well a model’s response aligns with the preceding conversational context.

Embedding model specification.

All RAVEN scores are computed using the fixed, external sentence embedding model paraphrase-multilingual-MiniLM-L12-v2 [42]. This model is a pre-trained multilingual encoder supporting 50 + languages including Arabic. Critically, this embedding model was not fine-tuned or modified in any way during this study’s experiments, ensuring unbiased evaluation across all systems. The same fixed embedder is applied to all models (fine-tuned and baseline) to maintain consistency.

Text preprocessing.

Before embedding, dialogue turns undergo minimal normalization:

  • Whitespace standardization (multiple consecutive spaces reduced to single space)

  • Leading and trailing whitespace removal

  • Arabic diacritics (if present) are preserved

  • Punctuation is retained to maintain semantic information

  • No stemming, lemmatization, or transliteration is applied

RAVEN computation.

For each dialogue, RAVEN is computed through the following steps:

  • Embedding: Each turn ti in the dialogue is converted to a 384-dimensional vector embedding ei using the fixed embedding model.

  • Context Vector: For turn i (where i > 1), the context vector ci is computed by averaging the embeddings of all preceding turns: ci = (1/(i-1)) Σj=1i-1 ej

  • Turn-Level Cosine Similarity: The contextual relevance of turn i is measured by computing the cosine similarity between its embedding and the context vector: similarityi = cos(ei, ci) = (ei· ci)/ (||ei|| ||ci||)

Raw cosine similarity values range from −1 (opposite) to +1 (identical).

  • Conversation-Level Aggregation: Turn-level similarities are averaged across all turns in the dialogue (excluding the first turn, which has no context): RAVENraw = (1/(n-1)) Σi=2n similarityi

where n is the total number of turns in the dialogue.

  • Scaling: To improve interpretability, raw cosine similarities are linearly scaled from the empirically observed range to [0, 1]: If the empirically observed range is [0.6, 1.0], then RAVENscaled = (RAVENraw - 0.6)/ (1.0 - 0.6)

Values below 0.6 are clipped to 0. This transformation maps the practical range of observed similarities to a normalized 0–1 scale, where scores near 1 indicate high contextual consistency and scores near 0 indicate low consistency.

Interpretation.

  • High RAVEN scores (0.8–1.0): Strong contextual alignment; the model maintains coherent, context-aware responses throughout the dialogue.

  • Medium RAVEN scores (0.5–0.8): Moderate contextual consistency; responses are generally relevant but may occasionally drift from context.

  • Low RAVEN scores (<0.5): Weak contextual alignment; responses show limited awareness of preceding conversational context.

Example of using RAVEN in this study based on the example from the previous sub-section.

Consider a 4-turn dialogue:

  1. Embedding: Convert each turn into a vector embedding using the fixed, external sentence embedding model paraphrase-multilingual-MiniLM-L12-v2.

    • Example:

      • “المستخدم: مرحبا كيف حالك” becomes e1

      • “البوت: بخير الحمد لله” becomes e2

      • “المستخدم: هل تعلم ما هي عاصمة مصر؟” becomes e3

      • “البوت: عاصمة مصر هي القاهرة” becomes e4

      • …and so on for all turns in the dialogue.

  2. Context Vector: For each turn (except the first), calculate the context vector by averaging the embeddings of the preceding turns.

    • Example:

      • Context vector for turn 2 (“البوت: بخير الحمد لله”): c2 = e1

      • Context vector for turn 3 (“المستخدم: هل تعلم ما هي عاصمة مصر؟”): c3 = (e1 + e2)/ 2

      • Context vector for turn 4 (“البوت: عاصمة مصر هي القاهرة”): c4 = (e1 + e2 + e3)/ 3

      • The same process is applied for both model-generated dialogues and reference test dialogues.

  3. Cosine Similarity: Calculate the cosine similarity between the embedding of each turn’s response and its corresponding context vector.

The formula is:

cosine_similarity (A,B)=((A·B) / (A ×B)) (2)
  1. Averaging: Average the cosine similarity scores across all turns in the conversation, then scale to [0, 1] range.

Example for the same dialogue context:

  • Reference Test Dialogue:

    • Average RAVEN (scaled): 0.85

  • Fine-tuned Model Generated Dialogue:

    • Average RAVEN (scaled): 0.78

In this example, the reference test dialogue has a higher average RAVEN score (0.85) than the fine-tuned model’s generated dialogue (0.78). This suggests that the turns in the test dialogue are, on average, more contextually similar to their preceding turns than the turns generated by the fine-tuned model.

Human evaluation

Human evaluation was conducted by two independent Arabic-speaking evaluators, one co-author with domain expertise in Arabic NLP (Evaluator 1) and one external evaluator (Evaluator 2), both of whom participated voluntarily without compensation. Each evaluator was presented with conversations from the test dataset alongside responses generated by the fine-tuned and evaluation baseline large language models and asked to provide ratings on three predefined criteria using a five-point Likert scale: Fluency (1 = Very Poor to 5 = Excellent), Relevance (1 = Not Relevant to 5 = Highly Relevant), and Diversity (1 = Very Repetitive to 5 = Very Diverse). Evaluators conducted their assessments independently, without communication or access to each other’s ratings, to ensure unbiased evaluation.

This evaluation constituted an internal quality assessment of model-generated synthetic dialogues; no personal data was collected, no vulnerable populations were involved, and institutional review board approval was not required under journal policy as this work involved technical assessment of computational outputs rather than human subjects research.

For each conversation in the test set, the average score across the three evaluation criteria was first computed, yielding a single composite rating. These composite scores were then averaged across all conversations for each evaluator. Finally, the overall human evaluation score for each large language model was obtained by averaging the results across the two evaluators.

To ensure rigor and reliability in the evaluation, several statistical methods were employed. Inter-rater reliability was assessed using percent agreement, quadratic-weighted Cohen’s kappa (κ), within-one agreement, and Spearman rank correlation. Confidence intervals (95%) for Cohen’s kappa were estimated using bootstrap resampling with 2,000 iterations. These measures were selected to capture complementary aspects of agreement: percent agreement reflects exact matches, Cohen’s kappa adjusts for chance agreement, within-one agreement accounts for near-consistency in ordinal judgments, and Spearman correlation evaluates the monotonic relationship between raters. In addition, descriptive statistics (mean, standard deviation, and counts) were computed per evaluator and per criterion to identify systematic tendencies or biases. To compare models, a further non-parametric paired tests (Wilcoxon signed-rank test) were conducted along with effect size estimation to assess whether differences between models were statistically significant. Finally, bootstrap resampling was applied to estimate confidence intervals for the mean human ratings, providing a more robust quantification of uncertainty.

Benchmark results

The results of the benchmark evaluation are presented in Table 12:

Table 13 summarizes the results of the statistical analysis of human evaluations for the four large language models.

Table 13. Summary of the results of the statistical analysis of human evaluations for the four LLMs.

LLM Criterion % Agreement Quadratic-Weighted k (95% CI) Within-1 Spearman Evaluator 1 Mean ± SD Evaluator 2 Mean ± SD
Fine-tuned ArabianGPT-08B-V2 Fluency 0.386 0.247 (0.236, 0.259) 0.999 0.442 4.64 ± 0.50 4.02 ± 0.44
Relevance 0.375 0.229 (0.218, 0.240) 0.999 0.424 4.65 ± 0.49 4.03 ± 0.42
Diversity 0.381 0.240 (0.229, 0.251) 0.999 0.434 4.65 ± 0.50 4.03 ± 0.44
Fine-tuned AraGPT2-mega Fluency 0.700 0.560 (0.546, 0.573) 1.000 0.623 4.20 ± 0.57 3.90 ± 0.52
Relevance 0.657 0.513 (0.499, 0.526) 1.000 0.593 4.20 ± 0.57 3.85 ± 0.51
Diversity 0.723 0.581 (0.567, 0.595) 1.000 0.636 4.19 ± 0.57 3.92 ± 0.51
Llama-3-8B-Instruct Fluency 0.808 0.698 (0.684, 0.710) 1.000 0.726 4.28 ± 0.56 4.09 ± 0.53
Relevance 0.762 0.635 (0.621, 0.649) 1.000 0.679 4.29 ± 0.56 4.05 ± 0.53
Diversity 0.834 0.739 (0.726, 0.751) 1.000 0.759 4.28 ± 0.56 4.12 ± 0.54
AceGPT-7B-chat Fluency 0.682 0.533 (0.518, 0.547) 1.000 0.604 4.39 ± 0.56 4.07 ± 0.51
Relevance 0.651 0.481 (0.466, 0.495) 1.000 0.564 4.38 ± 0.55 4.03 ± 0.50
Diversity 0.693 0.534 (0.520, 0.549) 1.000 0.603 4.39 ± 0.56 4.08 ± 0.50

To validate generalization beyond the training distribution, all models separately evaluated on in-distribution (ID) and out-of-distribution (OOD) test subsets across the three metrics. Table 14 presents comparative performance analysis on ID and OOD subsets.

Table 14. Comparative performance analysis on ID and OOD subsets.

In-Distribution Out-of-Distribution OOD Degradation (%)
Within-Model Cross-Model Within-Model Cross-Model Within-Model Cross-Model
LLM Perplexity RAVEN (scaled 0–1) Human Evaluation Perplexity RAVEN (scaled 0–1) Human Evaluation Perplexity RAVEN (scaled 0–1) Human Evaluation
Fine-tuned ArabianGPT-08B-V2 9.4 0.822 4.34 9.3 0.825 4.34 0.3% −0.8% 0.0%
Fine-tuned AraGPT2-mega 10.6 0.779 4.05 10.5 0.782 4.03 0.4% −1.2% −0.6%
Llama-3-8B-Instruct 12.9 0.780 4.18 12.6 0.782 4.19 0.2% −2.1% 0.3%
AceGPT-7B-chat 11.5 0.781 4.22 11.3 0.784 4.22 0.3% −1.7% −0.1%

Analysis of results

The benchmarking results demonstrate the effectiveness of fine-tuning pre-trained Arabic LLMs using the synthetic dataset developed in this study. Automatic evaluation metrics consistently favored fine-tuned ArabianGPT-08B-V2, while human evaluation provided additional insights into the challenges of subjective assessment in Arabic dialogue systems.

Automatic evaluation metrics

Within-Model Assessment: Perplexity analysis confirmed successful task learning across all models, with fine-tuned ArabianGPT-08B-V2 achieving perplexity of 9.4 and fine-tuned AraGPT2-mega achieving 10.6. Instruction-tuned baselines showed higher values (12.8 and 11.4), though tokenizer differences complicate direct comparison.

Cross-Model Comparison: RAVEN scores enabled fair comparison using a fixed embedding model. Fine-tuned ArabianGPT-08B-V2 achieved the highest score (0.823), significantly outperforming fine-tuned AraGPT2-mega (0.780) and instruction-tuned baselines (0.781–0.782). The convergence of RAVEN scores between fine-tuned AraGPT2-mega and instruction-tuned baselines, despite their perplexity differences, validates the use of tokenizer-agnostic metrics for cross-model evaluation.

In-Distribution vs. Out-of-Distribution Performance: To validate that models learned generalizable conversational patterns rather than dataset-specific features, performance was evaluated separately on in-distribution (ID, 5,970 conversations, 68.9%) and out-of-distribution (OOD, 2,693 conversations from held-out topics and countries, 31.1%) test subsets. Table 14 demonstrates OOD generalization across all evaluation metrics. Fine-tuned ArabianGPT-08B-V2 achieved identical human evaluation scores (4.34) on both distributions, with automatic metrics showing minimal variation (RAVEN: −0.8%, perplexity: + 0.3%). Similarly, all other models maintained consistent performance across subsets. Averaging across all models, OOD degradation was negligible: RAVEN +0.3%, perplexity −1.5%, human evaluation −0.1%. The convergence of automatic metrics and human evaluation in demonstrating near-zero OOD degradation provides robust triangulated evidence for genuine generalization. This multi-method validation is particularly significant: automatic metrics assess different quality aspects, while human evaluation captures overall conversational quality. The consistency across all three metrics validates both the diversity of this study’s synthetic dataset and the effectiveness of the duplicate prevention and split stratification procedures, confirming that models acquired transferable conversational competence applicable to novel topic-country combinations.

These results demonstrate that fine-tuning pre-trained Arabic language models on synthetic data yields measurable improvements in dialogue quality, with ArabianGPT-08B-V2 showing superior contextual coherence and robust generalization beyond the training distribution.

Human evaluation

Overall results.

Human evaluation scores averaged across both evaluators yielded overall ratings of 4.34 (fine-tuned ArabianGPT-08B-V2), 4.04 (fine-tuned AraGPT2-mega), 4.18 (Llama-3-8B-Instruct), and 4.22 (AceGPT-7B-chat) on the five-point Likert scale. These results place all models in the high-quality range, with fine-tuned Arabic models performing competitively with multilingual baselines. Differences between models remained modest (0.30 points), suggesting comparable practical utility across the evaluated systems.

Inter-rater reliability analysis.

Inter-rater reliability analysis revealed meaningful patterns in subjective evaluation of Arabic dialogue systems. Evaluator 1 (one of the authors) consistently provided ratings at the upper end of the scale (means ranging from 4.19 to 4.65, SD ≈ 0.49–0.57), while Evaluator 2 (the external evaluator) assigned systematically more conservative ratings (means ranging from 3.85 to 4.12, SD ≈ 0.42–0.54). This systematic difference of 0.17–0.62 points across models reflects distinct evaluation perspectives, with larger differences observed for the authors’ fine-tuned models, suggesting differing interpretations of the rating scale or quality standards between evaluators.

Multiple complementary reliability metrics quantified the degree of evaluator agreement. Exact agreement ranged from 0.375 (fine-tuned ArabianGPT-08B-V2, Relevance) to 0.834 (Llama-3-8B-Instruct, Diversity), showing substantial variation across models. Quadratic-weighted Cohen’s κ values demonstrated fair to substantial agreement: 0.229–0.247 for ArabianGPT (fair agreement), 0.481–0.581 for AraGPT2 and AceGPT (moderate agreement), and 0.635–0.739 for Llama-3 (substantial agreement), with narrow 95% confidence intervals computed via bootstrap resampling (2,000 iterations). Within-one agreement approached perfect levels (0.999–1.000), indicating that evaluators rarely differed by more than one scale point. Spearman rank correlations between evaluators showed moderate to strong positive associations across all models and criteria, ranging from 0.424 (ArabianGPT, Relevance) to 0.759 (Llama-3, Diversity).

The pattern of inter-rater agreement revealed important insights into evaluation consistency. The well-established baseline model (Llama-3-8B-Instruct) demonstrated the highest inter-rater agreement (κ = 0.635–0.739, ρ = 0.679–0.759), suggesting that evaluators more readily converge on assessments of familiar, widely-used systems. The Arabic-specific fine-tuned models showed moderate agreement (AraGPT2 and AceGPT: κ = 0.481–0.581, ρ = 0.564–0.636), while the authors’ best-performing model (ArabianGPT-08B-V2) exhibited lower but still acceptable agreement (κ = 0.229–0.247, ρ = 0.424–0.442). This gradient in agreement levels may reflect evaluator familiarity with different model types and linguistic characteristics.

Both evaluators ranked models similarly at the aggregate level. Evaluator 1 ranked fine-tuned ArabianGPT-08B-V2 highest (means ≈ 4.64–4.65) and fine-tuned AraGPT2-mega lowest (means ≈ 4.19–4.20), aligning with automatic evaluation results. Evaluator 2’s rankings showed more nuanced distinctions: Llama-3-8B-Instruct highest (means ≈ 4.05–4.12), followed closely by AceGPT-7B-chat (means ≈ 4.03–4.08) and ArabianGPT-08B-V2 (means ≈ 4.02–4.03), with AraGPT2-mega lowest (means ≈ 3.85–3.92). Despite these differences, the moderate to strong positive correlations indicate that both evaluators perceived similar quality patterns across models.

Methodological considerations.

The evaluation protocol implemented in this study provides valuable lessons for future research in Arabic dialogue system assessment. While the evaluation demonstrated acceptable inter-rater reliability (fair to substantial agreement), several factors influenced the observed variation. The protocol lacked a formal calibration phase where evaluators could align their understanding of the rating scale through discussion of anchor examples. Evaluators worked independently without access to reference standards or guidelines for resolving ambiguous cases. The evaluation criteria (Fluency, Relevance, Diversity) may have been interpreted with different thresholds by each evaluator, particularly regarding distinctions between performance levels. The involvement of an author as an evaluator (Evaluator 1), while ensuring domain expertise, may have introduced different expectations compared to the external evaluator, as evidenced by the consistently higher ratings from Evaluator 1.

These observations inform recommendations for enhanced evaluation methodologies. The positive correlations between evaluators (ρ = 0.424–0.759) demonstrate that both perceived similar quality patterns, while the fair to substantial agreement levels (κ = 0.229–0.739) indicate acceptable but improvable consistency. The gradient in agreement levels, highest for the well-known baseline model (Llama-3) and lowest for the novel fine-tuned model (ArabianGPT), suggests that evaluator consensus may depend on model familiarity and linguistic characteristics. The near-perfect within-one agreement (0.999–1.000) indicates strong ordinal consistency, with evaluators rarely disagreeing on broad quality tiers despite differences in absolute ratings.

Integrated assessment

The automatic metrics provide robust and consistent evidence supporting the effectiveness of fine-tuning pre-trained Arabic language models on synthetic data. Fine-tuned ArabianGPT-08B-V2’s superior performance on the RAVEN metric (0.823) and strong within-model perplexity (9.4) demonstrate clear advantages in linguistic quality and response appropriateness. The ID vs. OOD analysis provides additional validation: the negligible performance degradation on held-out topic-country combinations (average <1% across all metrics) confirms that models learned generalizable conversational patterns rather than memorizing training examples. This robust cross-distribution consistency, validated through both automatic metrics and human evaluation, demonstrates the quality and diversity of the synthetic training data.

Human evaluation revealed convergent validity with automatic metrics across both evaluation paradigms. Both evaluators ranked ArabianGPT-08B-V2 favorably (Evaluator 1 ranked it highest; Evaluator 2 ranked it competitively), and critically, human evaluation scores showed the same pattern of OOD robustness as automatic metrics, with negligible degradation across distributions. The positive correlations between evaluators (ρ = 0.424–0.759) and fair to substantial agreement levels (κ = 0.229–0.739) indicate that, despite systematic rating differences, both evaluators perceived similar quality patterns across models and distributions. The convergence of automatic metrics (evaluated across thousands of conversations) and human evaluation (across multiple raters, criteria, and data distributions) provides comprehensive triangulated evidence for model quality and generalization capability.

Architectural and training determinants of performance

The superior performance of fine-tuned ArabianGPT-08B-V2 (RAVEN: 0.823; Human: 4.34) despite its smaller parameter count (800M vs. 7–8B for baselines) can be attributed to three converging factors. First, its Arabic-exclusive pre-training on 14 + billion tokens ensures that 100% of model capacity is dedicated to Arabic linguistic patterns, avoiding the capacity dilution inherent in multilingual models like Llama-3-8B-Instruct, which distributes 15 trillion tokens across 30 + languages. Second, the AraNizer tokenizer (a purely Arabic BPE tokenizer with 64K vocabulary) efficiently encodes Arabic’s complex morphology, whereas the baselines employ general-purpose tokenizers (tiktoken, LLaMA 2) optimized for English that may suboptimally segment Arabic morphemes. Third, task-specific fine-tuning on the synthetic multi-turn dataset directly aligned model behavior with the evaluation criteria, compensating for architectural limitations such as the 1,024-token context window.

The convergence of RAVEN scores between fine-tuned AraGPT2-mega (0.780) and the instruction-tuned baselines (0.781–0.782) reveals an important interaction between base model characteristics and fine-tuning efficacy. Despite undergoing identical fine-tuning, AraGPT2-mega (released in 2020 using the GROVER architecture variant) did not achieve the same gains as ArabianGPT-08B-V2 (February 2024). This disparity likely reflects accumulated methodological advances in tokenization (AraNizer vs. GPT2TokenizerFast), training data curation, and architectural refinements that occurred between 2020 and 2024. Meanwhile, the competitive human evaluation scores of instruction-tuned baselines (4.18–4.22) without task-specific fine-tuning demonstrate that sophisticated alignment techniques (Reinforcement Learning from Human Feedback for Llama-3 and Reinforcement Learning from AI Feedback for AceGPT) produce broadly capable models, though their lower RAVEN scores indicate that general instruction-following capability does not fully substitute for task-specific adaptation in multi-turn Arabic dialogue.

These findings suggest that for specialized Arabic dialogue tasks, the alignment between training data and target task may outweigh raw model scale. The results indicate that smaller, Arabic-native models with task-specific fine-tuning can surpass larger multilingual alternatives, underscoring the importance of language-specific optimization, tokenization strategy, and task alignment as critical factors in Arabic LLM research.

Conclusion

This study introduced a practical methodology for improving Arabic multi-turn conversational language models by leveraging synthetic data generated using an Arabic instruction-tuned LLM. The proposed approach involved prompt engineering, hyperparameter configuration, and large-scale dialogue generation covering a diverse set of topics and countries. The resulting dataset included 43,316 conversations and 608,052 utterances, offering a rich and diverse resource for fine-tuning pre-trained Arabic language models.

The evaluation of fine-tuned models demonstrated clear improvements on automatic metrics with robust generalization beyond the training distribution. Fine-tuned ArabianGPT-08B-V2 demonstrated strong within-model performance (Perplexity: 9.4) and achieved the highest cross-model RAVEN score (0.823), outperforming fine-tuned AraGPT2-mega and multilingual instruction-tuned baselines. Critically, comprehensive evaluation across in-distribution (5,970 conversations) and out-of-distribution (2,693 conversations from held-out topics and countries) test subsets revealed exceptional generalization, with negligible performance degradation (<1% average across RAVEN, perplexity, and human evaluation). This near-zero OOD degradation, validated through multiple evaluation methods, provides strong evidence that models learned generalizable conversational competence rather than dataset-specific patterns, confirming the quality and diversity of the synthetic training data.

Human evaluation demonstrated acceptable inter-rater reliability and yielded averaged scores that aligned with automatic metric rankings across both ID and OOD distributions. Inter-rater agreement ranged from fair to substantial (Cohen’s κ = 0.229–0.739), with positive rank correlations (Spearman ρ = 0.424–0.759) and near-perfect ordinal agreement (within-one = 0.999–1.000). Exact agreement ranged from 0.375 to 0.834 across models and criteria. The systematic rating differences of 0.17–0.62 points between evaluators, while notable, remained within acceptable bounds for subjective assessment. Importantly, human evaluation scores demonstrated the same pattern of robust OOD generalization observed in automatic metrics, with negligible degradation on held-out distributions. The pattern of agreement levels, highest for the established baseline model (Llama-3: κ = 0.635–0.739) and lower for novel fine-tuned models (ArabianGPT: κ = 0.229–0.247), suggests that evaluator consensus depends partly on model familiarity. These findings highlight both the feasibility of human evaluation for Arabic dialogue systems and the need for continued methodological refinement.

This study makes several important contributions to Arabic NLP research. First, it demonstrates that synthetic data generation represents a viable and scalable approach for developing training resources for under-resourced languages, with automatic metrics and OOD evaluation confirming both the quality improvements and robust generalization achievable through this methodology. Second, it introduces a novel evaluation benchmark specifically designed for Arabic multi-turn chatbots, combining automatic metrics with structured human evaluation and comprehensive OOD assessment using held-out topic-country combinations. The convergence of automatic and human metrics in demonstrating exceptional OOD generalization provides robust triangulated evidence for the effectiveness of this approach. Third, it provides transparent documentation of both successes and methodological considerations in evaluation, offering important lessons for the community regarding the complexities of assessing dialogue system quality and validating generalization capabilities.

The automatic metrics provide strong empirical support for fine-tuning pre-trained Arabic language models on synthetic conversational data, with OOD analysis confirming genuine generalization rather than memorization. Human evaluation showed convergent validity with these findings across both ID and OOD distributions. Future work should continue refining human evaluation methodologies by:

  1. Recruiting a larger, more diverse pool of independent evaluators blind to model identity

  2. Implementing structured calibration sessions with anchor examples spanning the full rating scale

  3. Developing more detailed rubrics with concrete quality indicators for each criterion

  4. Collecting multiple ratings per conversation to enable more robust statistical analysis

  5. Incorporating qualitative feedback to understand the sources of rating differences

These improvements would further strengthen confidence in human evaluation results and provide clearer insight into the practical utility of fine-tuned Arabic dialogue models.

This methodology can be reproduced with emerging state-of-the-art Arabic LLMs, allowing for broader comparisons as more advanced models become available. Additionally, with access to greater computational resources, a wider range of pre-trained Arabic language models and multilingual language models could be fine-tuned and benchmarked, further strengthening the generalizability and impact of the approach. The demonstrated robustness of OOD generalization suggests that this synthetic data generation approach successfully captures fundamental aspects of Arabic multi-turn dialogue that transfer effectively to novel contexts. Future directions may also explore integrating Reinforcement Learning from Human Feedback (RLHF), expanding dialectal coverage, refining dialogue modeling techniques, and developing more reliable human evaluation frameworks that complement automatic metrics in assessing the nuanced qualities of conversational Arabic AI systems. By addressing both the technical challenges of model development and the methodological challenges of evaluation, including robust validation of generalization capabilities, the field can advance toward culturally and linguistically aligned dialogue systems that serve Arabic-speaking communities effectively across diverse topics and contexts.

Supporting information

S1 Text. English to Arabic translations.

(DOCX)

pone.0341905.s001.docx (21KB, docx)
S1 Checklist. PLOSOne human subjects research checklist.

(DOCX)

pone.0341905.s002.docx (164.3KB, docx)

Data Availability

“All relevant data are within the paper and its Supporting information files. Dataset has been provided at this link: https://doi.org/10.5281/zenodo.17855012”.

Funding Statement

The author(s) received no specific funding for this work.

References

  • 1.Zhang S, Dong L, Li X, Zhang S, Sun X, Wang S, et al. Instruction tuning for large language models: A survey. 2024. https://arxiv.org/abs/2308.10792
  • 2.Antoun W, Baly F, Hajj H. AraBERT: Transformer-based Model for Arabic Language Understanding. In: Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC 2020), Marseille, France, 2020. https://arxiv.org/abs/2003.00104 [Google Scholar]
  • 3.Antoun W, Baly F, Hajj H. AraGPT2: Pre-Trained Transformer for Arabic Language Generation. 2020. https://arxiv.org/abs/2012.15520
  • 4.Nagoudi EMB, Elmadany A, Abdul-Mageed M. AraT5: Text-to-Text Transformers for Arabic Language Generation. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational LinguisticsAssociation for Computational Linguistics; 2022. p. 628–47. doi: 10.18653/v1/2022.acl-long.47 [DOI] [Google Scholar]
  • 5.Naous T, Antoun W, Mahmoud R, Hajj H. Empathetic BERT2BERT conversational model: Learning Arabic language generation with little data. In: Proceedings of the Sixth Arabic Natural Language Processing Workshop, 2021. p. 164–72. https://aclanthology.org/2021.wanlp-1.17/ [Google Scholar]
  • 6.Fuad A, Al-Yahya M. AraConv: Developing an Arabic Task-Oriented Dialogue System Using Multi-Lingual Transformer Model mT5. App Sci. 2022;12(4):1881. doi: 10.3390/app12041881 [DOI] [Google Scholar]
  • 7.Alruily M. ArRASA: Channel Optimization for Deep Learning-Based Arabic NLU Chatbot Framework. Electronics. 2022;11(22):3745. doi: 10.3390/electronics11223745 [DOI] [Google Scholar]
  • 8.Sengupta N, Sahu SK, Jia B, Katipomu S, Li H, Koto F, et al. Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models. 2023. https://arxiv.org/abs/2308.16149
  • 9.Elmadany AA, Abdou SM, Gheith M. JANA: An Arabic human-human dialogues corpus. In: 2015 IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS), 2015. p. 347–52. doi: 10.1109/retis.2015.7232903 [DOI] [Google Scholar]
  • 10.Abdallah A, Kasem M, Abdalla M, Mahmoud M, Elkasaby M, Elbendary Y, et al. ArabicaQA: A Comprehensive Dataset for Arabic Question Answering. In: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024. p. 2049–59. doi: 10.1145/3626772.3657889 [DOI] [Google Scholar]
  • 11.Chouikhi H, Aloui M, Ben Hammou C, Chaabane G, Kchaou H, Dhaouadi C. LlamAr & GemmAr: Enhancing LLMs Through Arabic Instruction-Tuning. 2024. https://arxiv.org/abs/2407.02147
  • 12.Mozannar H, Maamary E, El Hajal K, Hajj H. Neural Arabic Question Answering. In Proceedings of the Fourth Arabic Natural Language Processing Workshop. Association for Computational Linguistics, Florence, Italy; 2019 108–118. doi: 10.18653/v1/W19-4612 [DOI] [Google Scholar]
  • 13.Ni J, Xue F, Jain K, Shah MH, Zheng Z, You Y. Instruction in the wild: A user-based instruction dataset. 2023. https://github.com/XueFuzhao/InstructionWild
  • 14.Wang Y, Kordi Y, Mishra S, Liu A, Smith NA, Khashabi D, et al. Self-instruct: Aligning language models with self-generated instructions. arXiv preprint. 2022. doi: arXiv:2212.10560 [Google Scholar]
  • 15.Zhang Y, Sun S, Galley M, Chen Y-C, Brockett C, Gao X, et al. DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation. ACL; 2020. https://arxiv.org/abs/1911.00536 [Google Scholar]
  • 16.Ye J, Chen X, Xu N, Zu C, Shao Z, Liu S, et al. A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models. 2023. https://arxiv.org/abs/2303.10420
  • 17.OpenAI. GPT-4o System Card. 2024. https://arxiv.org/abs/2410.21276
  • 18.Google. Gemini: A Family of Highly Capable Multimodal Models. 2024. https://arxiv.org/abs/2312.11805
  • 19.Meta. The Llama 3 Herd of Models. 2024. https://arxiv.org/abs/2407.21783
  • 20.Jiang AQ, Sablayrolles A, Roux A, Mensch A, Savary B, Bamford C, et al. Mixtral of Experts. 2024. https://arxiv.org/abs/2401.04088
  • 21.Longpre S, Le H, Vu T, Webson A, Chung HW, Tay Y, et al. The flan collection: Designing data and methods for effective instruction tuning. In: International Conference on Machine Learning, 2023. p. 22631–48. https://arxiv.org/abs/2301.13688 [Google Scholar]
  • 22.Hijjawi M, Bandar Z, Crockett K, Mclean D. ArabChat: An arabic conversational agent. In Proceedings of the 2014 6th International Conference on Computer Science and Information Technology (CSIT), Piscataway, NJ, USA, 26–27 March 2014; IEEE: Piscataway, NJ, USA; p. 227–237. doi: 10.1109/CSIT.2014.6806005 [DOI] [Google Scholar]
  • 23.Ali DA, Habash N. Botta: An arabic dialect chatbot. In Proceedings of the COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations, Osaka, Japan, 11–16 December 2016; p. 208–212. https://aclanthology.org/C16-2044/ [Google Scholar]
  • 24.Fadhil A. OlloBot-towards a text-based arabic health conversational agent: Evaluation and results. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), Varna, Bulgaria, 2–4 September 2019; p. 295–303. https://aclanthology.org/R19-1034/ [Google Scholar]
  • 25.AlHumoud S, Al A, Aldamegh W. Arabic Chatbots: A Survey. Int J Adv Comput Sci Appl. 2018;9(8). doi: 10.14569/ijacsa.2018.090867 [DOI] [Google Scholar]
  • 26.Naous T, Hokayem C, Hajj H. Empathy-driven Arabic conversational chatbot. In: Proceedings of the Fifth Arabic Natural Language Processing Workshop, 2020. https://aclanthology.org/2020.wanlp-1.6/ [Google Scholar]
  • 27.Naous T, Antoun W, Mahmoud R, Hajj H. Empathetic BERT2BERT conversational model: Learning Arabic language generation with little data. In: Proceedings of the Sixth Arabic Natural Language Processing Workshop, 2021. p. 164–72. https://aclanthology.org/2021.wanlp-1.17/ [Google Scholar]
  • 28.Pandya HA, Bhatt BS. Question answering survey: directions, challenges, datasets, evaluation matrices. 2021. https://arxiv.org/abs/2112.03572
  • 29.Zhang S, Dong L, Li X, Zhang S, Sun X, Wang S, et al. Instruction Tuning for Large Language Models: A Survey. 2024. Available from: https://arxiv.org/abs/2308.10792
  • 30.Cui L, Wu Y, Liu S, Zhang Y, Zhou M. MuTual: A Dataset for Multi-Turn Dialogue Reasoning. 2020. Available from: https://arxiv.org/abs/2004.04494
  • 31.Xu C, Guo D, Duan N, McAuley J. Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data. 2023. Available from: https://arxiv.org/abs/2304.01196
  • 32.Bach SH, Sanh V, Yong Z-X, Webson A, Raffel C, Nayak NV, et al. Promptsource: An integrated development environment and repository for natural language prompts. 2022. Available from: https://arxiv.org/abs/2202.01279
  • 33.Singh S, Vargus F, D’souza D, Karlsson B, Mahendiran A, Ko W-Y, et al. Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024. 11521–67. doi: 10.18653/v1/2024.acl-long.620 [DOI] [Google Scholar]
  • 34.Luthier G, Popescu-Belis A. Chat or learn: a data-driven robust question answering system. In: Proceedings of the 12th Conference on Language Resources and Evaluation, 2020. p. 5474–80. Available from: https://aclanthology.org/2020.lrec-1.672 [Google Scholar]
  • 35.Almazrouei E, Alobeidli H, Alshamsi A, Cappelli A, Cojocaru R, Debbah M, et al. The Falcon Series of Open Language Models. 2023. Available from: https://arxiv.org/abs/2311.16867
  • 36.Chen M, Papangelis A, Tao C, Kim S, Rosenbaum A, Liu Y, et al. PLACES: Prompting Language Models for Social Conversation Synthesis. In: Findings of the Association for Computational Linguistics: EACL 2023. Available from: https://arxiv.org/abs/2302.03269 [Google Scholar]
  • 37.Ding N, Chen Y, Xu B, Qin Y, Zheng Z, Hu S, et al. Enhancing Chat Language Models by Scaling High-Quality Instructional Conversations. 2023. Available from: doi: 10.48550/arXiv.2305.14233 [DOI]
  • 38.ALMutairi M, AlKulaib L, Aktas M, Alsalamah S, Lu C-T. Synthetic Arabic Medical Dialogues Using Advanced Multi-Agent LLM Techniques. In: Proceedings of The Second Arabic Natural Language Processing Conference, 2024. p. 11–26. doi: 10.18653/v1/2024.arabicnlp-1.2 [DOI] [Google Scholar]
  • 39.Alyafeai Z, Pieler MP, Teufel H, Tow J, Bellagente M, Phung D, et al. Arabic Stable LM: Adapting Stable LM 2 1.6B to Arabic. 2024. doi: 10.48550/arXiv.2412.04277 [DOI]
  • 40.OpenAI. GPT-4 Technical Report. arXiv preprint. 2023. Available from: doi: 10.48550/arXiv.2303.08774 [DOI]
  • 41.Huang H, Yu F e i, Zhu J, Sun X, Cheng H, Song D, et al. AceGPT, Localizing Large Language Models in Arabic. In: NAACL main conference. 2024. Available from: https://arxiv.org/abs/2309.12053 [Google Scholar]
  • 42.Reimers N, Gurevych I. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 2019. Available from: http://arxiv.org/abs/1908.10084 [Google Scholar]
  • 43.Billah Nagoudi EM, Abdul-Mageed M, Elmadany A, Inciarte A, Islam Khondaker MT. JASMINE: Arabic GPT Models for Few-Shot Learning. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023. p. 16721–44. doi: 10.18653/v1/2023.emnlp-main.1040 [DOI] [Google Scholar]
  • 44.Koubaa A, Ammar A, Ghouti L, Najar O, Sibaee S. ArabianGPT: Native Arabic GPT-based Large Language Model. 2024. Available from: https://arxiv.org/abs/2402.15313
  • 45.Zhang H, Lan Y, Pang L, Guo J, Cheng X. Recosa: Detecting the relevant contexts with self-attention for multi-turn dialogue generation. 2019. Available from: https://arxiv.org/abs/1907.05339
  • 46.Lan T, Mao XL, Wei W, Huang H. Which kind is better in open-domain multi-turn dialog, hierarchical or non-hierarchical models? An empirical study. arXiv preprint. 2020. Available from: https://arxiv.org/abs/2008.02964 [Google Scholar]
  • 47.Zhang Y, Ni A, Yu T, Zhang R, Zhu C, Deb B, et al. An exploratory study on long dialogue summarization: What works and what’s next. arXiv preprint. 2021. Available from: doi: arXiv:2109.04609
  • 48.Xu J, Szlam A, Weston J. Beyond goldfish memory: Long-term open-domain conversation. arXiv preprint. 2021. doi: 10.48550/arXiv.2107.07567 [DOI] [Google Scholar]
  • 49.Boughorbel S, Hawasly M. Analyzing Multilingual Competency of LLMs in Multi-Turn Instruction Following: A Case Study of Arabic. In: Proceedings of ArabicNLP 2023, 2023. p. 128–39. doi: 10.18653/v1/2023.arabicnlp-1.11 [DOI] [Google Scholar]
  • 50.El Moatez Billah N, AbdelRahim E, Ahmed Oumar E-S, Muhammad A-M. Dolphin: A Challenging and Diverse Benchmark for Arabic NLG. arXiv preprint. 2023. doi: 10.48550/arXiv.2305.14989 [DOI]
  • 51.Pieri S, Mullappilly SS, Khan FS, Anwer RM, Khan S, Baldwin T, et al. BiMediX: Bilingual Medical Mixture of Experts LLM. arXiv preprint. 2024. Available from: https://arxiv.org/abs/2402.13253
  • 52.Fajri Koto, Li H, Shatnawi S, Doughman J, Sadallah AB, Alraeesi A, et al. ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic. arXiv preprint. 2024. Available from: https://arxiv.org/abs/2402.12840
  • 53.Ghaboura S, Heakl A, Thawakar O, Alharthi A, Riahi I, Saif A, et al. CIDAR: Cul-turally Relevant Instruction Dataset for Arabic,” Findings of the Association for Computational Linguistics: ACL 2024, 2024, p. 12134–12147. Available from: https://aclanthology.org/2024.findings-acl.764/
  • 54.Freedom Intelligence. ACVA: Arabic Cultural Value Alignment Dataset. Hugging Face Datasets, 2023. Available from: https://huggingface.co/datasets/FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment
  • 55.Almazrouei E, Cojocaru R, Baldo M, Malartic Q, Alobeidli H, Mazzotta D, et al. AlGhafa Evaluation Benchmark for Arabic Language Models. In: Proceedings of ArabicNLP 2023, 2023. p. 244–75. doi: 10.18653/v1/2023.arabicnlp-1.21 [DOI] [Google Scholar]
  • 56.Chen SF, Beeferman D, Rosenfeld R. Evaluation metrics for language models. 1998. Available from: https://www.cs.cmu.edu/~roni/papers/eval-metrics-bntuw-9802.pdf [Google Scholar]
  • 57.McCoy RT, Smolensky P, Linzen T, Gao J, Celikyilmaz A. How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN. arXiv preprint. 2021. doi: 10.48550/arXiv.2111.09509 [DOI]

Decision Letter 0

Helen Howard

20 Sep 2025

Dear Dr. Misbah,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

The manuscript has been evaluated by two reviewers, and their comments are available below.

The reviewers have raised a number of concerns that need attention. In particular, they request additional information on the statistical analysis.

Could you please revise the manuscript to carefully address the concerns raised?

Please submit your revised manuscript by Nov 03 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org . When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols . Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols .

We look forward to receiving your revised manuscript.

Kind regards,

Helen Howard

Staff Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please include a statement in your manuscript text clarifying whether the authors of this study carried out the assessment reported in the 'Human Evaluation' section of your manuscript text.

3. Please note that PLOS One has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, we expect all author-generated code to be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

4. We suggest you thoroughly copyedit your manuscript for language usage, spelling, and grammar. If you do not know anyone who can help you do this, you may wish to consider employing a professional scientific editing service.

The American Journal Experts (AJE) (https://www.aje.com/) is one such service that has extensive experience helping authors meet PLOS guidelines and can provide language editing, translation, manuscript formatting, and figure formatting to ensure your manuscript meets our submission guidelines. Please note that having the manuscript copyedited by AJE or any other editing services does not guarantee selection for peer review or acceptance for publication.

Upon resubmission, please provide the following:

The name of the colleague or the details of the professional service that edited your manuscript

A copy of your manuscript showing your changes by either highlighting them or using track changes (uploaded as a *supporting information* file)

A clean copy of the edited manuscript (uploaded as the new *manuscript* file)

5. In the online submission form you indicate that your data is not available for proprietary reasons and have provided a contact point for accessing this data. Please note that your current contact point is a co-author on this manuscript. According to our Data Policy, the contact point must not be an author on the manuscript and must be an institutional contact, ideally not an individual. Please revise your data statement to a non-author institutional point of contact, such as a data access or ethics committee, and send this to us via return email. Please also include contact information for the third party organization, and please include the full citation of where the data can be found.

6. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information.

7. If the reviewer comments include a recommendation to cite specific previously published works, please review and evaluate these publications to determine whether they are relevant and should be cited. There is no requirement to cite these works unless the editor has indicated otherwise.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously? -->?>

Reviewer #1: N/A

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available??>

The PLOS Data policy

Reviewer #1: Yes

Reviewer #2: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English??>

Reviewer #1: Yes

Reviewer #2: Yes

**********

Reviewer #1: The present submission involves multi-turn conversational AI being a challenging

field in itself, with additional complications concerning languages with existing

resource issues such as Arabic, especially the spoken form(s) of Arabic.

It is of special importance that linguistic and socio-cultural aspects in spoken

communication are addressed in the present submission, namely nuances and context of

Arabic conversation, where language-specific parameters apply and are not always

compatible to standard English and English data.

The submitted research paper includes a comprehensive yet detailed outline of

related research - directly or indirectly connected to the present approach, methods

and resources for Arabic, including the addressing of current issues and

limitations.

The goals are clearly defined (achieving a robust, high-quality dataset specifically

designed to fine-tune existing casual Arabic LLMs for multi-turn conversational

tasks by building a synthetic Arabic dataset using a recently launched Arabic

Instructional LLM) as is the detailed and well-organized methodology.

A significant strength of the paper is the step by step analytical presentation of

the linguistic (Arabic) data processing according to the proposed approach,

contributing to its explainability and transparency. It may be noted that providing

extended examples of more complex issues such as the "new educational initiatives in

Morocco" (Line 310).

The combination of human evaluators and benchmark evaluation is a another feature of

the proposed approach that is especially sensitive to efficient deployment of Arabic

LLMs in real-life situations.

In general, the present submission demonstrates a processing approach targeting to

resolve a set of specific and challenging issues with a detailed and explanatory

methodology and results, all expressed in a clear, well-written text. Indeed, the

approach presented may serve as a basis for additional upgrading and research for

conversational Arabic and also for other languages.

Reviewer #2: The manuscript entitled “Fine-Tuning Arabic Large Language Models for Improved Multi-Turn Dialogue: A Blueprint for Synthetic Data Generation and Benchmarking” addresses an important gap in Arabic natural language processing by proposing a reproducible methodology for generating synthetic multi-turn dialogue datasets and evaluating their effectiveness in fine-tuning Arabic large language models (LLMs). The study is timely, methodologically well-structured, and offers valuable contributions to the field of conversational AI, particularly in low-resource and linguistically complex languages such as Arabic. My comments are provided below with respect to the journal’s review criteria.

1. Technical Soundness and Data Support for Conclusions

The study is technically sound. The authors carefully describe their approach to dataset generation, including the selection of instructional LLMs, prompt engineering strategies, and hyperparameter tuning. The methodology for fine-tuning two Arabic-native LLMs (ArabianGPT-08B-V2 and AraGPT2-mega) is well-detailed, and the evaluation framework, which incorporates perplexity, RAVEN, and human judgment, is appropriate and robust. The reported results consistently demonstrate improvements over multilingual baselines, supporting the central claim that synthetic data can effectively enhance Arabic conversational models. The conclusions are aligned with the presented findings and are drawn in a balanced and evidence-based manner. One limitation, however, is the sole reliance on synthetic data without the inclusion of external validation against naturally occurring conversations. Addressing this in future work would further strengthen the study.

2. Statistical Analysis

The statistical framework is generally appropriate and provides meaningful insights into model performance. The use of perplexity and RAVEN ensures a quantitative assessment of fluency and contextual coherence, while human evaluation captures cultural and linguistic nuances that automated metrics may miss. Nevertheless, the analysis would benefit from additional detail. Specifically, reporting inter-rater reliability for human evaluations and providing confidence intervals or statistical significance tests when comparing model scores would enhance the rigor of the findings. While these omissions do not undermine the overall validity of the results, their inclusion would make the analysis more comprehensive and transparent.

3. Data Availability

The manuscript notes that the generated dataset is not publicly available, as it forms part of the author’s doctoral research, but may be shared upon reasonable request for academic and non-commercial use. While metadata, prompt templates, and generation parameters are provided, this arrangement does not fully comply with the PLOS Data Policy, which requires unrestricted availability of data underlying published findings. To align with the policy, the authors are strongly encouraged to deposit the dataset, or a representative subset, in a public repository, ensuring that it is accessible to the research community. If restrictions are unavoidable, these should be explicitly justified, with clear ethical, legal, or proprietary grounds. Without such measures, reproducibility and transparency are limited.

4. Presentation and Language

The manuscript is written in clear and professional English, with a coherent structure and logical flow across sections. Technical terminology is employed appropriately, making the paper accessible to specialists in natural language processing and AI. However, minor typographical and stylistic inconsistencies are present, such as variable hyphenation (“multi-turn” vs. “multi turn”) and occasional redundant phrasing. These are relatively minor issues but should be corrected at the revision stage to improve polish and readability. No substantive language editing is required.

5. Additional Comments

The novelty of the study is well-established. The introduction and literature review provide a solid contextual grounding, highlighting the lack of high-quality Arabic multi-turn conversational datasets and positioning the proposed methodology as a significant step forward. The experimental design is rigorous, and the discussion appropriately situates the results within the broader field. Importantly, the authors provide a technical blueprint that can be adapted for other low-resource languages, enhancing the manuscript’s broader applicability.

**********

what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy

Reviewer #1: Yes: Christina Alexandris

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/ . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org . Please note that Supporting Information files do not need this step.

PLoS One. 2026 Feb 12;21(2):e0341905. doi: 10.1371/journal.pone.0341905.r002

Author response to Decision Letter 1


13 Oct 2025

Dear Editor,

We thank you and the reviewers for your careful evaluation of our manuscript and for the constructive feedback provided. Below, we provide a point-by-point response to the editorial comments and detail the corresponding revisions made to the manuscript.

________________________________________

1. Data availability and contact point

Editor comment:

In the online submission form you indicate that your data is not available for proprietary reasons and have provided a contact point for accessing this data. Please note that your current contact point is a co-author on this manuscript. According to our Data Policy, the contact point must not be an author on the manuscript and must be an institutional contact, ideally not an individual.

Response:

We have revised our data availability statement. The complete dataset used in this study will now be made publicly available as part of this submission. This eliminates the need for an individual contact point and ensures that the data are accessible in line with PLOS ONE’s Data Policy.

________________________________________

2. Clarification of human evaluation authorship

Editor comment:

Please include a statement in your manuscript text clarifying whether the authors of this study carried out the assessment reported in the 'Human Evaluation' section.

Response:

We have clarified in the “Human Evaluation” subsection that the human evaluation was carried out by two of the paper’s authors.

________________________________________

3. Statistical analysis

Editor comment:

Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: N/A

Response:

Assuming “N/A” refers to “Not Available,” we have strengthened the statistical analysis discussion in three places:

• In the “Human Evaluation” subsection, where we explain the statistical methods used,

• In Table 9 of the “Benchmark Results” subsection, which now summarizes the statistical outcomes of human evaluations across the four LLMs,

• In the “Analysis of Results” section, where we provide a detailed analysis of the findings.

________________________________________

4. Additional statistical detail

Editor comment:

Reporting inter-rater reliability and providing confidence intervals or statistical significance tests would enhance the rigor of the findings.

Response:

We have added detailed inter-rater reliability analysis for the human evaluations. This now includes percent agreement, weighted Cohen’s κ, within-one agreement, and Spearman rank correlation. These additions appear in:

• The “Human Evaluation” subsection,

• Table 9 in the “Benchmark Results” subsection,

• The “Analysis of Results” section.

________________________________________

5. Typographical and stylistic inconsistencies

Editor comment:

Minor typographical and stylistic inconsistencies are present.

Response:

We have carefully reviewed the manuscript and corrected typographical and stylistic inconsistencies to the best of our ability.

________________________________________

We sincerely thank you for your guidance and for the opportunity to improve our manuscript. We hope that the revisions meet the requirements of PLOS ONE and look forward to your favorable consideration.

Warm regards,

Ahmed Mahmoud Misbah

(on behalf of all co-authors)

Arab Academy for Science, Technology and Maritime Transport

a.misbah5156@student.aast.edu

Attachment

Submitted filename: Response to Reviewers.docx

pone.0341905.s004.docx (17.2KB, docx)

Decision Letter 1

Mohammad Salah Hassan

11 Nov 2025

Dear Dr. Misbah,

Please submit your revised manuscript by Dec 26 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org . When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols . Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols .

We look forward to receiving your revised manuscript.

Kind regards,

Mohammad Salah Hassan, Ph.D

Academic Editor

PLOS ONE

Journal Requirements:

If the reviewer comments include a recommendation to cite specific previously published works, please review and evaluate these publications to determine whether they are relevant and should be cited. There is no requirement to cite these works unless the editor has indicated otherwise.

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Additional Editor Comments:

Dear Authors,

Your paper is timely and valuable. The synthetic data process is clearly explained, the dataset scale is meaningful, and combining automatic and human evaluations was the right choice. The reviewer has now made a decision, so please make sure to address each of the points they have raised. In addition, I have included several further revisions and recommendations that should help strengthen the manuscript before resubmission.

The data-generation pipeline is explained step-by-step, including unusual length settings for multi-turn dialogues and the compute/time budget. The fine-tuning setup is transparent (IA3/PEFT, packing, training hours). You also try to look at quality from multiple angles-perplexity, RAVEN, and a small human study. That triangulation is a strength.

What needs tightening (do these first)

Dataset size math. The manuscript says there are 43,316 conversations with “precisely 20 exchanges” each, and elsewhere mentions “57 million utterances.” Those numbers don’t line up. Even if “exchange” means a user–bot pair (i.e., 40 turns), the total utterances are nowhere near 57M. Please correct the count everywhere and add a one-sentence derivation so a reviewer can follow the arithmetic.

Data availability. The Data Availability Statement points to a Google Drive folder. PLOS strongly prefers stable, citable hosting. Deposit the dataset (and scripts if possible) in a DOI-issuing repository such as Zenodo or OSF, cite the DOI in the DAS, and remove the ad-hoc cloud link. Version the release (e.g., v1.0), add a license, and include checksums.

Evaluation leakage risk. Your benchmark is a 20% subset of the same synthetic corpus used for training/fine-tuning. That invites style overlap and optimistic scores. Clarify how you prevented duplicates (hashing, embedding similarity, seed control), and say explicitly whether the test split was frozen before model selection. If you can, add a small out-of-distribution test (e.g., held-out topics or prompts generated with different seeds) and report ID vs. OOD results.

RAVEN details. Lock down the metric. Name the sentence-embedding model (exact model and version), state any text normalization (diacritics, punctuation), explain the scaling from cosine similarity to the reported “RAVEN (scaled),” and say how you aggregate from turn-level to conversation-level. Using an embedding model you also fine-tuned can bias the metric-pick a fixed external Arabic/multilingual embedder for all systems and document it.

Perplexity comparability. Perplexity depends on the tokenizer and isn’t apples-to-apples across models with different vocabularies. Keep PP for within-model tracking, but add a short caveat and, if you can, report a tokenizer-agnostic figure (bits-per-byte) or compute PP with a common reference tokenizer for cross-model comparisons. Also name the exact tokenizers used.

Human evaluation reliability. Right now the raters are authors, and some weighted kappa values are near zero (even slightly negative). That weakens the claim. Bring in at least two independent raters blind to model identity, run a short calibration pass, and then report per-criterion agreement with confidence intervals. Keep the current results, but frame the author-only phase as a limitation.

Ethics note. Add a one-paragraph ethics statement: internal quality assessment, no personal data collected, no vulnerable populations, no compensation, and institutional review not required under journal policy. That will stop any back-and-forth at production.

Decoding setup clarity. The generation table mixes sampling and beam search (do_sample=True with num_beams=2). If you really used beam sampling, say so; otherwise separate the setups for clarity. Fix the parameter typo repetition_penalty (it’s misspelled in one place). State seeds, library versions, and whether decoding parameters were identical across models.

Secondary edits that help

– Compute/time in one place. You mention 14 days for data generation, 100 hours for fine-tuning, and 2.5 months for full benchmarking. Summarize this in a small table with hardware, key libraries (PyTorch/Transformers/PEFT versions), CUDA, OS, and random seeds. It instantly boosts trust.

– Terminology. Replace “casual Arabic LLMs” with “causal (Arabic) language models” throughout. Standardize model names (LLaMA 2, Llama 3, GPT-4, Gemini, etc.).

– PEFT specifics. Since you reference IA3, packing, and use_liger, add one sentence per item (what it does, which library/version). Consider linking or archiving the training config files.

– Baseline fairness. Readers will ask whether LLaMA- or AceGPT-based baselines were fine-tuned on your training split or evaluated as-is. If they weren’t fine-tuned, label them clearly as zero-/few-shot baselines or add a fine-tuned variant for a fairer comparison.

– Tables/labels. If a column says “RAVEN (scaled),” state the range (e.g., 0–1). For inter-rater agreement, specify whether κ is quadratic-weighted and include 95% CIs.

– Hyphenation artifacts. Clean the soft-hyphen breaks from PDF export (mul-ti-turn, to-ken, in-stance, etc.).

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions??>

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously? -->?>

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available??>

The PLOS Data policy

Reviewer #2: No

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English??>

Reviewer #2: Yes

**********

Reviewer #2: The data availability statement indicates that the underlying dataset has been deposited in a public repository; however, the provided Google Drive link was found to be non-functional at the time of review, preventing access to the data. Regarding the manuscript's language, it is written in generally intelligible scientific English, but a number of typographical errors, grammatical inconsistencies, and awkward phrasings were noted throughout the text, which should be addressed to enhance clarity and polish. The study itself is commended for its technical soundness and methodological rigor, exemplified by a well-structured and reproducible pipeline for synthetic data generation and model fine-tuning. The conclusions are considered to be robustly supported by the data, which were derived from a large-scale dataset and a comprehensive, multi-faceted evaluation benchmark. Furthermore, the statistical analysis of the human evaluation results has been performed with appropriate rigor, utilizing a complementary suite of metrics to thoroughly assess inter-rater reliability. It is therefore suggested that the manuscript requires minor revisions, primarily to rectify the broken data link and to undertake a thorough copyediting of the manuscript to correct language errors.

**********

what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy

Reviewer #2: Yes: Vijayakumar Selvaraj, Associate Professor, B.S.Abdur Rahman Crescent Institute of Science and Technology.

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

To ensure your figures meet our technical requirements, please review our figure guidelines: https://journals.plos.org/plosone/s/figures

You may also use PLOS’s free figure tool, NAAS, to help you prepare publication quality figures: https://journals.plos.org/plosone/s/figures#loc-tools-for-figure-preparation.

NAAS will assess whether your figures meet our technical requirements by comparing each figure against our figure specifications.

PLoS One. 2026 Feb 12;21(2):e0341905. doi: 10.1371/journal.pone.0341905.r004

Author response to Decision Letter 2


18 Dec 2025

Dear Editor,

We thank you and the reviewers for your careful evaluation of our manuscript and for the constructive feedback provided. Below, we provide a point-by-point response to the editorial comments and detail the corresponding revisions made to the manuscript.

________________________________________

1. Reference list review

Editor comment:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article's retracted status in the References list and also include a citation and full reference for the retraction notice.

Response:

We have conducted a comprehensive review of the reference list and implemented the following changes:

References removed:

Reference 25 (Mozannar et al., 2019, "Neural Arabic Question Answering") has been removed as it was identified as a duplicate of Reference 10. All in-text citations previously referencing 25 have been updated to cite Reference 10, and subsequent reference numbers have been decremented accordingly.

Reference 46 (Singh et al., 2024, "Aya Dataset") has been removed as it was identified as a duplicate of Reference 33. All in-text citations previously referencing 46 have been updated to cite Reference 33, and subsequent reference numbers have been decremented accordingly.

As a consequence of these removals, all references following Reference 25 have been decremented by one, and all references following Reference 46 have been decremented by an additional one (i.e., decremented by two in total).

References modified:

URLs have been added or corrected for the following references: 8, 10, 15, 21, 22, 23, 24, 25, 26, 31, 34, 36, 37, 50, 51, and 52.

Reference text has been revised for accuracy and completeness in the following references: 12, 24, 31, 32, 33, 36, 52, 53, 54, 55, and 56.

Reference added:

Reference 57 (Reimers & Gurevych, 2019, "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks") has been added to provide a proper citation for the paraphrase-multilingual-MiniLM-L12-v2 sentence transformer employed in multiple phases of the study's methodology.

________________________________________

2. Dataset size arithmetic

Editor comment:

Dataset size math. The manuscript says there are 43,316 conversations with "precisely 20 exchanges" each, and elsewhere mentions "57 million utterances." Those numbers don't line up. Even if "exchange" means a user–bot pair (i.e., 40 turns), the total utterances are nowhere near 57M. Please correct the count everywhere and add a one-sentence derivation so a reviewer can follow the arithmetic.

Response:

We thank the editor for identifying this discrepancy. Upon review, we discovered computational errors in the scripts used to calculate these statistics. The scripts have been corrected, and all dataset size figures throughout the manuscript have been revised to reflect accurate values. A clear derivation of the arithmetic has been included to ensure transparency and verifiability.

________________________________________

3. Data availability

Editor comment:

Data availability. The Data Availability Statement points to a Google Drive folder. PLOS strongly prefers stable, citable hosting. Deposit the dataset (and scripts if possible) in a DOI-issuing repository such as Zenodo or OSF, cite the DOI in the DAS, and remove the ad-hoc cloud link. Version the release (e.g., v1.0), add a license, and include checksums.

Response:

In accordance with the editor's recommendation, the dataset has been deposited in Zenodo, a DOI-issuing repository that provides stable, citable hosting. The dataset is now publicly accessible at:

https://doi.org/10.5281/zenodo.17855012

The deposit includes version numbering, an appropriate open-access license, and checksums for data integrity verification. The Data Availability Statement in the manuscript has been updated accordingly, and the previous Google Drive link has been removed.

________________________________________

4. Evaluation leakage risk

Editor comment:

Evaluation leakage risk. Your benchmark is a 20% subset of the same synthetic corpus used for training/fine-tuning. That invites style overlap and optimistic scores. Clarify how you prevented duplicates (hashing, embedding similarity, seed control), and say explicitly whether the test split was frozen before model selection. If you can, add a small out-of-distribution test (e.g., held-out topics or prompts generated with different seeds) and report ID vs. OOD results.

Response:

We appreciate this important methodological concern. A new subsection entitled "Post Data Generation Processing" has been added to the “Methods” section, providing a comprehensive description of the procedures implemented to mitigate evaluation leakage risks. These procedures include duplicate detection mechanisms (hashing and embedding similarity checks), seed control protocols, and confirmation that the test split was frozen prior to model selection.

Additionally, our methodology included out-of-distribution (OOD) evaluation through a held-out test set comprising samples from countries and topics not represented in the training data. The revised manuscript now provides clearer documentation of these procedures, with ID versus OOD results presented in Table 14 within the "Benchmark Results" subsection and corresponding discussion in the "Analysis of Results" section.

________________________________________

5. RAVEN metric specification

Editor comment:

RAVEN details. Lock down the metric. Name the sentence-embedding model (exact model and version), state any text normalization (diacritics, punctuation), explain the scaling from cosine similarity to the reported "RAVEN (scaled)," and say how you aggregate from turn-level to conversation-level. Using an embedding model you also fine-tuned can bias the metric-pick a fixed external Arabic/multilingual embedder for all systems and document it.

Response:

The RAVEN subsection has been substantially expanded to address these concerns. The revised text now specifies: (1) the exact sentence-embedding model and version employed (a fixed external multilingual embedder not subject to fine-tuning in this study); (2) text normalization procedures applied to diacritics and punctuation; (3) the mathematical transformation from raw cosine similarity to the scaled RAVEN score; and (4) the aggregation methodology from turn-level to conversation-level metrics. These clarifications ensure full reproducibility and eliminate potential metric bias.

________________________________________

6. Perplexity comparability

Editor comment:

Perplexity comparability. Perplexity depends on the tokenizer and isn't apples-to-apples across models with different vocabularies. Keep PP for within-model tracking, but add a short caveat and, if you can, report a tokenizer-agnostic figure (bits-per-byte) or compute PP with a common reference tokenizer for cross-model comparisons. Also name the exact tokenizers used.

Response:

We have revised the "Benchmarking and Experimental Results" section to address this concern. The "Evaluation Metrics" subsection now explicitly categorizes the metrics into two groups:

1. Within-Model Metrics: Perplexity, used for tracking within-model performance and consistency across training and test datasets.

2. Cross-Model Metrics: RAVEN and human evaluation, which provide comparable assessments across different model architectures.

A new subsection on "Tokenizer Specification and Cross-LLM Comparability" has been added, which enumerates the exact tokenizers used by each model and includes an appropriate caveat regarding the limitations of cross-model perplexity comparisons. All subsequent sections have been updated to reflect this categorization.

Regarding the suggestion to report bits-per-byte or compute perplexity with a common reference tokenizer: due to time constraints and the need to prioritize other substantive revisions requested by the editor, we were unable to implement this additional analysis in the current revision. This limitation is acknowledged in the manuscript.

________________________________________

7. Human evaluation reliability

Editor comment:

Human evaluation reliability. Right now the raters are authors, and some weighted kappa values are near zero (even slightly negative). That weakens the claim. Bring in at least two independent raters blind to model identity, run a short calibration pass, and then report per-criterion agreement with confidence intervals. Keep the current results, but frame the author-only phase as a limitation.

Response:

We have addressed this concern by replacing the results from Evaluator 2 (a co-author) with results from an external evaluator. Complete details of this revised evaluation are provided in the "Human Evaluation" subsection under "Benchmarking and Experimental Results." Per-criterion agreement statistics with 95% confidence intervals are now reported. All related sections have been updated accordingly, and the "Conclusion" section includes recommendations for refining human evaluation methodologies in future research.________________________________________

8. Ethics statement

Editor comment:

Ethics note. Add a one-paragraph ethics statement: internal quality assessment, no personal data collected, no vulnerable populations, no compensation, and institutional review not required under journal policy. That will stop any back-and-forth at production.

Response:

An ethics statement has been added to the "Human Evaluation" subsection within the "Benchmarking and Experimental Results" section. The statement confirms that the evaluation constituted an internal quality assessment, no personal data were collected, no vulnerable populations were involved, no compensation was provided to evaluators, and institutional ethics review was not required under the journal's policy.

________________________________________

9. Decoding setup clarity

Editor comment:

Decoding setup clarity. The generation table mixes sampling and beam search (do_sample=True with num_beams=2). If you really used beam sampling, say so; otherwise separate the setups for clarity. Fix the parameter typo repetition_penalty (it's misspelled in one place). State seeds, library versions, and whether decoding parameters were identical across models.

Response:

The "Prompt Engineering and Synthetic Data Generation" subsection in the “Methods” section has been revised to clarify this point. We confirm that beam-search multinomial sampling was intentionally employed, combining sampling with beam search as a deliberate decoding strategy. This is now explicitly stated in the manuscript. The typographical error in Table 2 (repetition_penalty) has been corrected. Additionally, random seeds, library versions, and confirmation that decoding parameters were consistent across all models have been documented.

________________________________________

10. Computational resources summary

Editor comment:

Compute/time in one place. You mention 14 days for data generation, 100 hours for fine-tuning, and 2.5 months for full benchmarking. Summarize this in a small table with hardware, key libraries (PyTorch/Transformers/PEFT versions), CUDA, OS, and random seeds. It instantly boosts trust.

Response:

We appreciate this suggestion for improving reproducibility. The following tables have been added to consolidate computational resource information:

• Table 3: Summary of Computational Resources Used in Dataset Generation and Timeline

• Table 6: Summary of Fine-tuning Configurations and Timeline

• Table 9: Summary of Computational Resources Used in Benchmark Evaluation and Timeline

Each table includes hardware specifications, library versions (PyTorch, Transformers, PEFT), CUDA version, operating system, and random seeds where applicable.

________________________________________

11. Terminology standardization

Editor comment:

Terminology. Replace "casual Arabic LLMs" with "causal (Arabic) language models" throughout. Standardize model names (LLaMA 2, Llama 3, GPT-4, Gemini, etc.).

Response:

The entire manuscript has been revised for terminological consistency. We have standardized all model names according to their official nomenclature (e.g., LLaMA 2, Llama 3, GPT-4, Gemini). We note that we have adopted "pre-trained Arabic language models" rather than "causal (Arabic) language models," as this terminology more accurately reflects the scope and focus of our study.

________________________________________

12. PEFT implementation details

Editor comment:

PEFT specifics. Since you reference IA3, packing, and use_liger, add one sentence per item (what it does, which library/version). Consider linking or archiving the training config files.

Response:

Table 6 ("Summary of Fine-tuning Configurations and Timeline") has been added to address this comment. The table includes explanatory descriptions of IA3, packing, and use_liger, along with their respective library versions. We believe the level of detail provided in the manuscript is sufficient for reproducibility; therefore, separate archiving of configuration files has not been pursued.

________________________________________

13. Baseline fairness

Editor comment:

Baseline fairness. Readers will ask whether LLaMA- or AceGPT-based baselines were fine-tuned on your training split or evaluated as-is. If they weren't fine-tuned, label them clearly as zero-/few-shot baselines or add a fine-tuned variant for a fairer comparison.

Response:

This concern has been addressed through the addition of two new subsections under "Benchmarking and Experimental Results": "Baseline Model Selection" and "Prompt Engineering and Model Configuration." These subsections explicitly clarify the evaluation conditions for each baseline model, including whether models were evaluated in zero-shot, few-shot, or fine-tuned settings, ensuring transparent and fair comparisons.

________________________________________

14. Table labels and statistical notation

Editor comment:

Tables/labels. If a column says "RAVEN (scaled)," state the range (e.g., 0–1). For inter-rater agreement, specify whether κ is quadratic-weighted and include 95% CIs.

Response:

This has been addressed throughout the revised manuscript. The "Human Evaluation" subsection and Tables 12, 13, and 14 in the "Benchmark Results" subsection now explicitly state the range for RAVEN (scaled), specify that Cohen's κ is quadratic-weighted, and include 95% confidence intervals for all inter-rater agreement statistics.

________________________________________

15. Hyphenation artifacts

Editor comment:

Hyphenation artifacts. Clean the soft-hyphen breaks from PDF export (mul-ti-turn, to-ken, in-stance, etc.).

Response:

The entire manuscript has been reviewed and all soft-hyphen artifacts resulting from PDF export have been removed.

________________________________________

16. Data availability (Reviewer #2)

Editor comment:

Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception. Reviewer #2: No

Response:

The complete dataset underlying the findings of this manuscript is now publicly available without restriction on Zenodo:

https://doi.org/10.5281/zenodo.17855012

________________________________________

17. Non-functional data link

Editor comment:

The data availability statement indicates that the underlying dataset has been deposited in a public repository; however, the provided Google Drive link was found to be non-functional at the time of review, preventing access to the data.

Response:

We apologize for this issue. The dataset has been migrated from Google Drive to Zenodo, which provides stable, persistent hosting. The dataset is now accessible at:

https://doi.org/10.5281/zenodo.17855012

________________________________________

1

Attachment

Submitted filename: Response_to_Reviewers_auresp_2.docx

pone.0341905.s005.docx (26.4KB, docx)

Decision Letter 2

Mohammad Salah Hassan

15 Jan 2026

Fine-Tuning Arabic Large Language Models for improved multi-turn dialogue: A blueprint for synthetic data generation and benchmarking

PONE-D-25-35904R2

Dear Dr. Authors,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager®  and clicking the ‘Update My Information' link at the top of the page. For questions related to billing, please contact billing support .

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Mohammad Salah Hassan, Ph.D

Academic Editor

PLOS One

Additional Editor Comments (optional):

Dear Authors,

I am pleased to inform you that your manuscript has been accepted for publication.

Thank you for your revisions and for addressing the reviewers’ comments. Please submit the final production files through the system as requested.

Sincerely

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions??>

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously? -->?>

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available??>

The PLOS Data policy

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English??>

Reviewer #2: Yes

**********

Reviewer #2: All issues raised in Revision 1 have been fully and rigorously addressed in Revision 2. The authors not only corrected factual and methodological shortcomings but also enhanced reproducibility (Zenodo DOI, detailed hyperparameters, evaluation protocols) and statistical transparency (expanded IRR metrics, external evaluator). The editorial summary explicitly states the manuscript now requires only minor revisions, primarily limited to copyediting already completed.

**********

what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy

Reviewer #2: Yes: Vijayakumar Selvaraj

**********

Acceptance letter

Mohammad Salah Hassan

PONE-D-25-35904R2

PLOS One

Dear Dr. Misbah,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS One. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

You will receive further instructions from the production team, including instructions on how to review your proof when it is ready. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few days to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

You will receive an invoice from PLOS for your publication fee after your manuscript has reached the completed accept phase. If you receive an email requesting payment before acceptance or for any other service, this may be a phishing scheme. Learn how to identify phishing emails and protect your accounts at https://explore.plos.org/phishing.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Mohammad Salah Hassan

Academic Editor

PLOS One

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Text. English to Arabic translations.

    (DOCX)

    pone.0341905.s001.docx (21KB, docx)
    S1 Checklist. PLOSOne human subjects research checklist.

    (DOCX)

    pone.0341905.s002.docx (164.3KB, docx)
    Attachment

    Submitted filename: Response to Reviewers.docx

    pone.0341905.s004.docx (17.2KB, docx)
    Attachment

    Submitted filename: Response_to_Reviewers_auresp_2.docx

    pone.0341905.s005.docx (26.4KB, docx)

    Data Availability Statement

    “All relevant data are within the paper and its Supporting information files. Dataset has been provided at this link: https://doi.org/10.5281/zenodo.17855012”.


    Articles from PLOS One are provided here courtesy of PLOS

    RESOURCES