Skip to main content
Scientific Data logoLink to Scientific Data
. 2025 Aug 11;12:1405. doi: 10.1038/s41597-025-05727-w

A simulated dataset for proactive robot task inference from streaming natural language dialogues

Haifeng Xu 1, Chunwen Li 1, Xiaohu Yuan 2, Tao Zhi 3, Huaping Liu 2,
PMCID: PMC12339746  PMID: 40789873

Abstract

This paper introduces a dataset designed to support research on proactive robots that infer human needs from natural language conversations. Unlike traditional human-robot interaction datasets focused on explicit commands, this dataset captures implicit task requests within multi-party dialogues. It simulates realistic workplace environments, spanning 10 diverse scenarios, such as biotechnology research centers, legal consulting firms, and game development studios, among others. The dataset includes 10,000 synthetic dialogues generated using a large language model-based pipeline, covering a wide range of topics, including task-related discussions and casual conversations. The dataset focuses on common workplace tasks, such as borrowing, distributing, and processing items. It provides a resource for advancing proactive robotic systems, enabling research in natural language understanding, intent recognition, and autonomous task inference.

Subject terms: Scientific data, Computer science, Information technology

Background & Summary

Most service robots operate by following predefined routes or executing pre-programmed tasks, responding passively to user commands. This approach results in systems that are essentially passive, only responding to explicit human commands or behaviors. However, real-world scenarios involve mixed-initiative interactions, where robots must not only react to human actions but also proactively take initiative. Ensuring effective collaboration between humans and robots necessitates understanding human intentions and predicting user behavior. In such contexts, robots should comprehend human intent even before interactions commence, thereby offering relevant services proactively. With advancements in artificial intelligence (AI) and robotics, robots are expected to exhibit a certain degree of proactivity. The widespread application of AI-equipped service robots in various service environments is transforming the nature of service interaction1.

These robots, known as proactive robots, provide services based on user preferences for proactive assistance. Studies have shown that in long-term human-robot interactions, users tend to prefer robots that offer proactive help rather than requiring users to explicitly request assistance24. Xie et al.5 conducted experiments and field studies demonstrating that higher robot proactivity leads to greater user co-creation intention.

Existing research in this field primarily focuses on human-robot interaction challenges concerning robot proactivity-i.e., situations where a robot takes initiative to assist humans without explicit instructions6,7. For instance, Baraglia et al.8 explored optimal timing for proactive actions in human-robot collaborative tasks. Patel et al.9 proposed a capability for robots to predict object motion time patterns, allowing them to adaptively arrange objects to meet user needs. Patel et al.10 further investigated how robots could learn user behavioral patterns in household tasks, enabling proactive item delivery before explicit requests. Buyukgoz et al.11 analyzed two methods for achieving robot proactivity: recognizing human intentions and reasoning about potential future threats or opportunities.

Teaching robots proactive behaviors without explicitly modeling user intentions is a complex task1214. An essential aspect of human perception is anticipation, which is widely used in daily interactions with others and the environment. Predicting what humans will do next and how they will act enables assistive robots to plan in advance. Moreover, anticipation can even enhance the accuracy of detecting past activities. The ability to detect current human actions and predict subsequent behaviors is crucial. Numerous studies have explored human activity detection from 2D RGB videos15,16, inertial/position sensors17, and RGB-D videos. These works typically convert input sensor streams into spatio-temporal representations and infer labels based on these inputs.

We summarize representative works on non-linguistic proactive robot systems in Table 1. Most existing research on proactive robots relies on physical cues for prediction. Various methods have been proposed, such as gaze direction18, body orientation19, and trajectory tracking20. For example, Koppula et al.21 used graphical models based on object positions and human postures to predict human actions, enabling robots to perform anticipatory actions such as opening doors for humans. Bohus et al.22 leveraged visual processing (e.g., face detection, distance estimation) and speech recognition to predict visitor engagement in directional guidance scenarios. Mascaro et al.23 demonstrated that by recognizing human intentions, such as preparing a drink, and understanding actions like picking up a cup, robots can proactively provide necessary assistance, such as pouring water from a bottle. Therefore, by recognizing and predicting human-object interactions, robots can better understand human intentions and cater to their needs. Merely detecting actions is insufficient; predictive capabilities are required to minimize response delays in robot-assisted interactions24,25. Abbate et al.26 proposed a learning-based method for predicting the probability of human-robot interaction initiation before actual engagement. This method employs self-supervision, where robots automatically label interactions based on post-encounter engagement outcomes.

Table 1.

Summary of proactive robot research using non-linguistic cues.

Work Scenario Modality and Proactivity Cues Goal
Pandey et al.12 Human-robot cooperation tasks Visuo-spatial reasoning, head motion Assist in task completion, reduce user effort and confusion
Koppula et al.21 Daily human activities RGB-D video, object affordances Anticipate human activities (e.g., microwaving, taking medicine)
Unhelkar et al.46 Automotive assembly Human motion prediction Deliver parts to human associates during assembly tasks
Harman et al.47 Smart kitchen in domestic environment Action observation via pervasive sensors Predict and perform human’s next actions
Oh et al.48 Cooking tasks Action recognition combined with activity-level knowledge bank Assist in sequential cooking tasks by delivering objects
Patel et al.9 Household routines Temporal sequences of object movements Anticipate and arrange objects
Patel et al.10 Household routines Object usage history, user actions, queries Predict routine object usage
Mascaro et al.23 Kitchen scenario, pouring drink Human-object interaction from videos Assist in pouring a drink task
Nemlekar et al.49 Human-robot assembly tasks Learning and updating human action preferences from demonstrations and interactions Assist by predicting and adapting to preferred action sequences
Abbate et al.26 Office break area User pose and motion features Interact proactively before interaction begins

With the rapid advancement of robotics, service robots are increasingly integrated into daily life, particularly in households, hotels, and office environments. This trend imposes higher demands on robots, requiring them to understand natural language and execute tasks based on human instructions27. Autonomy is a crucial component in achieving artificial general intelligence in robotics. However, research on proactive robots remains limited. One reason is that predicting and tracking human intentions necessitates advanced AI techniques such as computer vision and machine learning, which are not yet sufficiently developed to achieve satisfactory levels of proactivity. Additionally, the scarcity of relevant datasets further constrains progress in this domain.

Compared to gaze tracking18, body orientation19, and trajectory tracking20, which require complicated modeling mechanisms, natural language provides a rich, intuitive, and natural means for robots to analyze and recognize human intentions. In real-world applications, such as office environments, daily communication among personnel is primarily conducted through spoken dialogue. Robots deployed in such environments can utilize speech recognition technology to listen to nearby conversations and extract relevant content. Additionally, when humans communicate via group chats on social platforms, robots can access these conversations through dedicated group accounts. By analyzing these conversations, robots can identify potential human needs and autonomously generate and execute tasks without explicit human commands. As illustrated in Fig. 1, all office members communicate through a social chat application, while an embedded proactive robot account extracts task-related information from the chat and executes tasks on behalf of humans.

Fig. 1.

Fig. 1

Illustration of a proactive robot operating in a research office environment. The layout includes offices, workstations, a meeting room, a break room, and a laboratory. Colleagues communicate through a work chat group to manage daily tasks. When a task emerges in the conversation, the robot proactively recognizes it, autonomously generates the task, and executes it without requiring explicit human commands. This allows users to stay focused on their work without needing to issue instructions or manually operate the robot.

Despite the potential benefits of natural language-based proactive robots, achieving this vision remains a complex challenge. The core difficulty lies in the robot’s ability to understand and process human natural language, as well as generate and execute tasks within informal conversations. Unlike traditional task execution systems that rely on explicit commands, proactive robots must identify and act upon implicit requests embedded in ongoing, unstructured dialogues. These conversations may contain noise, ambiguity, and irrelevant information, making it challenging for robots to distinguish tasks from casual chatter.

Research on proactive robots remains in its early stages, particularly in the areas of task reasoning and human-robot collaboration through natural language dialogue. Dedicated studies in these areas are still limited. Nevertheless, research in related domains offers indirect support for the development of proactive robotic systems.

MultiWOZ28 represents a widely studied case of multi-domain, task-oriented dialogue systems, covering scenarios such as hotel booking and restaurant reservations. It emphasizes explicit multi-turn interactions for task completion and serves as a benchmark for dialogue state tracking and natural language understanding. The AMI Meeting Corpus29 has been extensively explored in studies of multi-party dialogue, particularly in the context of team collaboration. It features simulated corporate meetings that discuss topics like product design and project management. Similarly, the ICSI Meeting Corpus30 has been central to research on real-world academic meeting interactions, providing recordings and transcriptions of natural multi-party conversations. However, research based on these two corpora does not focus on concrete task execution. ALFRED31 has driven research in embodied AI, emphasizing agents performing multi-step everyday tasks within 3D environments. RoboCup@Home32 has motivated studies on task execution by service robots in domestic settings, with a focus on how robots interact with humans to carry out routine activities.In both ALFRED and RoboCup@Home, the research centers on robots completing tasks based on human instructions. Table 2 summarizes the comparison between our work and the current research in this area.

Table 2.

A comparison with research related to natural language-based proactive robots.

Dataset Domain focus Task-oriented Multi-party dialogue Implicit tasks Office/home
MultiWOZ28 Service booking and information inquiry
AMI Meeting Corpus29 Team collaboration meeting
ICSI Meeting Corpus30 Academic meeting
ALFRED31 Instruction following
RoboCup@Home32 Domestic human-robot interaction
ProactiveDialog (ours) Proactive task inference

In recent years, large language models (LLMs) have achieved remarkable breakthroughs in various fields, including conversation, reasoning33,34, mathematical problem-solving35, and code generation36. The application of LLMs in robotics is gaining traction, with researchers exploring their use in planning, reasoning, manipulation, navigation, and data generation. LLMs have also demonstrated success and efficiency in generating scientific texts37 and synthesizing utterances38 and dialogues39,40. The development of large language model technology has introduced new approaches for automatically constructing datasets, effectively addressing the issue of data scarcity in various application scenarios.

To address the lack of data in this field, we present a dataset generated using LLMs combined with structured prompt engineering techniques to automatically create multi-person dialogues. The prompts designed for the LLM were formulated to elicit dialogues enriched with contextual scene descriptions and concise summaries of their main content. We considered 10 different workplace scenarios and produced a total of 10,000 dialogue samples.

We employ a prompt engineering framework tailored for generating multi-turn, task-oriented dialogues situated in realistic workplace contexts. Each prompt integrates structured elements-such as scenario descriptions, relevant object inventories, and topic constraints-to ensure that the resulting dialogues are both contextually coherent and functionally relevant. This design enables the simulation of naturalistic conversations in which proactive robots are expected to understand user goals, infer intent, and reason about appropriate actions. The resulting dataset supports research and development in natural language-based proactive robotics, facilitating applications such as method design, model training, and system evaluation.

Method

We followed a prompt engineering approach, which has been successfully used in previous studies utilizing large language models for dataset generation3741. Building on prior prompt-based generation work, our method introduces a highly structured, multi-stage prompting framework tailored to simulate realistic workplace dialogues. Rather than using a set of generic prompts, we developed a hierarchical pipeline that progressively refines and expands the available information at each stage. Each stage incorporates contextual constraints-such as scenario descriptions, object inventories, and topical guidance-to ensure that the resulting dialogues are contextually coherent, task-relevant, and diverse. This design allows for fine-grained control over the content, enhancing the realism and applicability of the dataset.

Specifically, our approach is designed to generate datasets for proactive robots by simulating natural human dialogues across various environments. These dialogues encompass both casual conversations and task-oriented discussions, with the latter enabling robots to autonomously perform specific tasks. The methodology consists of scenario design, generation of common objects, generation of topics, streaming topics design, and dialogue generation, ensuring that the resulting dataset is both realistic and diverse. Figure 2 illustrates the overall process.

Fig. 2.

Fig. 2

Overall dialogue synthesis pipeline based on large language models.

We used GPT-4o mini (accessed via API in April 2025), a lightweight, efficient variant of the widely adopted and state-of-the-art GPT-4o series, offering lower cost and latency for large-scale generation, as small-scale preliminary experiments showed no significant difference from the full model in the quality of generated dialogue data.

Scenario Design

Different scenarios lead to distinct conversation styles and contents. To maximize realism and diversity, we consider a variety of office environments spanning multiple industries. The following scenarios are selected to ensure a broad range of real-world situations:

(1) Biotechnology Research Center, (2) Game Development Studio, (3) Legal Consulting Firm, (4) Architecture and Interior Design Company, (5) Electric Vehicle Technology R&D Center, (6) Smart Home Device Manufacturing Company, (7) High-End Fashion and Functional Apparel Design Company, (8) Advanced 3D Printing and Smart Manufacturing Laboratory, (9) Visual Effects and Animation Production Studio, (10) Drone Research and Testing Center. We use the following prompt template to generate detailed information for a given scenario.

“Please generate workplace scenario information based on my requirements. Include industry, job responsibilities, characteristics, workplace layout, personnel (no more than 20 people), and job positions. Keep it concise and not too long. You may refer to the example.

For example:

{scene_example}

Please generate a scenario using the following text as the beginning:

Scenario: {scene}

Provide the response in plain text without any numbering, bullet points, dashes, or other extraneous characters. Do not include any additional text, formatting.”

An example of such a scenario is illustrated in Fig. 2. Each scenario is further enriched with contextual details, including the workplace layout, industry-specific characteristics, number of employees, their respective roles, and typical daily activities. These elements can be manually specified or automatically generated using LLMs, with subsequent human refinement to improve accuracy and ensure realism.

Generation of Common Objects

Each scenario is characterized by unique physical environments and work requirements. To ensure dialogues are relevant to robotic tasks, we define a set of common objects frequently used in each setting. These objects are categorized as follows:

  • Specific work-related items: Items specific to a particular scenario, such as batteries and controllers in a robotics lab, or contracts and product samples in an office.

  • Office-related items: General office supplies such as staplers, pens, and printers.

  • Daily life items: Personal items commonly found in offices, including water bottles, snacks, and chargers.

We use the following prompt template to generate these three categories of items.

“This is a scene with multiple colleagues working together.

Scene Description:

{scene}

Based on the above scene, generate a list of common items found in this setting.

Requirements:

1. These items must be frequently found in the scene and must be tangible, physical objects. They should not be virtual or conceptual. The items should be visible, touchable, and small to medium-sized-something that people in the scene commonly use and pass around.

2. (for specific work-related items) The items should be specific work-related items (relevant to the given scene, such as controllers, batteries in a robotics lab; beakers, samples in a chemistry lab; products, contract documents in a company), rather than generic office supplies.

2. (for office-related items) These items should be office-related items (such as staplers, pens, mice, etc.). They are commonly found in various office work environments and are not specific to any particular industry or setting.

2. (for daily life items) These items should be daily life items (such as packages, snacks, water bottles, etc.); they are commonly used for personal daily activities in an office environment and are not strongly related to specific work or industry settings.

3. Generate at least 30 such items, keeping each item’s description concise, with no more than five words.

4. Output each item on a new line, without any numbering, bullet points, dashes, or other extraneous characters. Do not include any additional text, formatting, or blank lines.”

The inclusion of such objects ensures that dialogues contain task-relevant discussions where proactive robots can intervene effectively. Each category is populated with at least 30 items, which can be manually specified or generated using LLMs with subsequent human refinement.

Generation of Topics

Real-life conversations are typically structured around topics. To model this, we first generate a collection of dialogue summaries that represent key topics within a given time period. Topics are categorized into two groups:

  • Robot task-related topics: Discussions involving the use or transfer of objects, such as borrowing, returning, distributing, collecting items, document processing, and assisting colleagues. In our envisioned proactive robotic framework, we consider the most prevalent and extensively deployed type of mobile robot as a representative example. These mobile robots are generally equipped with autonomous navigation and object transportation functionalities. Within this context, we anticipate that such robots will facilitate routine delivery and handling operations frequently observed in real-world office environments. Consequently, the robot task-related topics encompassed in our dataset predominantly pertain to commonly encountered office items and standard delivery tasks within office settings.

  • Non-task-related topics: Broader conversations, including work-related discussions (e.g., project updates, meeting notifications), daily affairs (e.g., equipment repairs, announcements), team activities (e.g., event planning, holiday greetings), and casual chats (e.g., food recommendations, news, technology trends).

For each scenario, we generate 1,000 robot task-related topics and 100 non-task-related topics.

For generating robot task-related topics, we sample objects relevant to the scenario and instruct LLMs to generate topic summaries incorporating these objects. An excerpt of the prompt is as follows:

“In common office scenarios, colleagues frequently engage in activities involving the use of everyday objects, which often includes the exchange of items, such as: Borrowing Items...... Returning Items......

Below are some specific examples of conversation topics that have occurred......

Now, consider the following specific scenario:

{scene}

Based on the above scenario, please generate 10 task topics. Be creative and avoid repetition and homogenization.

The topics should involve items that can be referenced from the following list:

{selected_items}

Requirements:

1. Topics must fit the current scene and environmental settings.

2. Each topic should be concise and clearly describe the task......

3. Provide the response in plain text......

4. Each topic should reflect everyday interactions and the need for item exchanges in the work environment as much as possible.”

Non-task-related topics are generated with greater flexibility. An excerpt of the prompt is provided below.

“In an office setting, colleagues often discuss various topics through work-related group chats, such as: Work-related matters...... Daily affairs...... Casual chats...... Other miscellaneous conversations......

For a conversation focused on a specific topic, it can often be summarized in one sentence using a simple template, such as......

Here are some examples......

Now, consider the following specific scenario:

{scene}

Based on the above scenario, please generate 10 examples of conversation summaries. Be creative and avoid repetition and homogenization.

Requirements:

1. Topics must fit the current scene and environmental settings.

2. Each topic should be a concise and clear statement summarizing a short conversation in one sentence......

3. Topics can be work-related or completely unrelated......

4. Provide the response in plain text......”

Streaming Topics Design

Dialogues in real-world scenarios unfold chronologically in a streaming format. Within a given time frame, conversations typically encompass multiple topics, including various tasks and casual discussions. To simulate this continuous dialogue flow, we construct a streaming topic structure consisting of five topics. We consider five different configurations, where each segment contains N robot task-related topics (N ∈ {1, 2, 3, 4, 5}) and 5 − N non-task-related topics. This approach models varying frequencies of task occurrences in realistic settings. To ensure diversity, we sample 200 instances for each configuration, yielding a total of 5 × 200 = 1, 000 streaming topic sequences per scenario.

Dialogue Generation

Building on the ten predefined scenarios and the 1,000 streaming topic sequences per scenario, we leverage LLMs to reconstruct full dialogue transcripts. Using the streaming topic summaries as prompts, LLMs generate realistic, chronologically ordered conversations that reflect natural workplace interactions. This process results in a comprehensive dataset of 10 × 1,000 = 10,000 dialogues.

An excerpt of the prompt for generating dialogue content is as follows:

“Below is a description of an office scenario:

{scene}

In this office scenario, colleagues often discuss a variety of topics in their work group chat, some related to work and others not.

During a certain time period, a series of conversations took place in this group chat in chronological order. Each conversation can be summarized in a single sentence, as follows:

{streaming_topics}

Each summary above represents a conversation that actually took place. The time gap between two consecutive conversations may be short or span several minutes or even hours.

The conversations corresponding to adjacent summaries may be contextually independent, smoothly connected, sequential in time, or intertwined in some way.

Your task is to simulate the actual chat based on these summaries and reconstruct the conversation text. The requirements are as follows:

1. The dialogue should align with the given office environment and be as realistic as possible.

2. Each summary should correspond to a conversation of 3-10 exchanges......

3. Consecutive summaries may represent completely independent conversations or connected ones......

4. You may add casual side comments, typos, or off-topic remarks to create a more authentic chat experience......

5. Provide the response in plain text using the format ‘speaker: content’......”

Data Records

The 10,000 dialogue dataset is available at Zenodo42 (10.5281/zenodo.15094166) under CC BY 4.0 license. The dataset is provided in two formats: a JSON version for machine readability and a TXT version for human readability. The dataset is organized into the following structure:

  • config_files: This folder contains configuration files for ten different scenarios. Each file includes the scenario name, detailed descriptions, and a complete list of three types of objects.

  • data_json: This folder contains dialogues for all ten scenarios. Each dialogue is stored in a separate file with the following path structure: scene_XXX/num_task_Y/dialogue_NNN.json, where XXX represents one of the ten scenarios, Y denotes the number of tasks in the dialogue (ranging from 1 to 5), and NNN is the dialogue index (0-199). Each JSON file contains the full sequence of messages in the dialogue, with each message including its speaker and content, along with the streaming topics used during generation, with each topic categorized as either “task” or “non-task.”

  • data_txt: This folder contains dialogues for all ten scenarios in a human-readable format. Each dialogue is stored in a separate file with the following path structure: scene_XXX/num_task_Y/dialogue_NNN.txt. The file naming conventions are the same as in the JSON version, and each TXT file presents the full dialogue script.

  • metadata.json: A JSON file that records relevant metadata regarding the dataset and its generation process.

Technical Validation

To ensure the quality of our conversational dataset and its suitability for research on proactive robotics based on natural language dialogue, we conducted a thorough evaluation. Given the demonstrated effectiveness of large language models in text generation, traditional evaluation metrics such as BLEU43, ROUGE44, and perplexity45 have become less significant. Therefore, we devised an evaluation approach tailored to the specific application of our dataset. First, we performed a statistical analysis of the dataset. Then, we assessed the authenticity of dialogues using both human evaluations and automated evaluations by LLMs. Additionally, we focused on verifying task frequency to ensure that the dataset aligns with the predefined task occurrence rates.

Statistical Analysis

We conducted a statistical analysis of all dialogues in the dataset. Figure 3a,b presents word cloud visualizations for the streaming topics used in dialogue generation and the entire dialogue corpus, respectively. These visualizations illustrate the distribution of vocabulary across generated scenarios, including office-related affairs, possible task requirements, and commonly mentioned objects. Fig. 3c displays the distribution of the number of dialogue turns. The dataset was generated using five streaming topics, resulting in an average of 21.4 turns per conversation, indicating that each topic typically spans approximately 4.3 turns. Fig. 3d illustrates the distribution of message lengths in terms of the number of words per utterance. The average message length of 14.3 words suggests that the dialogues are not mere casual social chats but contain purposeful and informative exchanges, aligning with our office scenario setting. Furthermore, the distribution of message lengths is consistent with intuitive expectations of workplace communication.

Fig. 3.

Fig. 3

Details of the generated dataset: (a) A word cloud of the streaming topics used for generating dialogues, highlighting key terms that provide an overview of the dialogues and indicate the general context or scenarios they pertain to. (b) A word cloud visualization of the generated dialogue dataset, further illustrating the distribution of vocabulary across different scenarios, including office-related activities, potential task actions, and relevant objects. (c) The distribution of the number of turns in the dialogues. (d) The distribution of message lengths within the dialogues.

Evaluation

We randomly sampled 250 dialogues (5 task frequencies  × 50 samples per frequency) and evaluated them based on three criteria: contextual coherence, conversational naturalness, and task integration.

  • Contextual coherence: Each dialogue should align with the given scenario and maintain consistency with the predefined context.

  • Conversational naturalness: Workplace conversations should exhibit natural, coherent, and contextually appropriate interactions rather than rigid or mechanical exchanges.

  • Task integration: Tasks should be seamlessly integrated into the dialogue rather than being inserted in a forced or unnatural manner.

Both human and GPT-4 evaluations were based on the same three criteria. We required both human evaluators and GPT-4 to rate a dialogue sample on a scale from 1 to 5. The results are shown in Fig. 4a, where both human and GPT-4 evaluations exhibit very similar results. The dialogue content almost perfectly aligns with the current scene setting, and the dialogue is very natural, demonstrating that the generated dialogue exhibits sufficient realism. In terms of the task integration criterion, humans considered the incorporation of task-related content into the complete dialogue to be less than perfect.

Fig. 4.

Fig. 4

Evaluation of the generated dataset: (a) Evaluation results of the randomly sampled dialogues evaluated by GPT-4 and human. (b) Preset values and human evaluation results under different task occurrence frequencies. Error bars represent the standard error of the mean and are shown only for Human scores, as Preset scores are fixed input values without variance.

Task Frequency Verification

To enhance the dataset’s alignment with real-world scenarios, we predefined different task occurrence frequencies during the data generation process to simulate natural task distributions in dialogues. However, relying solely on generation rules is insufficient to ensure data quality. Thus, we conducted rigorous verification to check whether the dataset adheres to the predefined task frequencies.

We conducted a manual annotation analysis on these 250 dialogues. For each predefined task frequency, we manually counted the number of tasks in which the robot could participate and intervene within a subset of 50 samples and compared the results with the predefined values. As shown in Fig. 4b, the dataset demonstrates sufficient validity. The task occurrence across different frequencies ensures the dataset’s effectiveness for research on proactive robots.

We provide two dataset examples, as shown in Fig. 5. The dialogues in both samples are highly relevant to the current scenario, involving work-related topics, casual conversation, and specific task demands. These dialogues exhibit a realistic conversational style, incorporating habitual expressions, abbreviations, emojis, and other informal elements. Such features demonstrate sufficient contextual coherence, conversational naturalness, and task integration. However, on the other hand, it is also evident that the effectiveness of task occurrence frequency does not perfectly align with the preset values in the generation process. Some of the tasks embedded within the generated dialogue content are not directly applicable for a proactive robot, highlighting the increased complexity encountered in real-world scenarios.

Fig. 5.

Fig. 5

Two sample examples from the dataset. (Left) The dialogue contains two task demands, highlighted in blue. (Right) The dialogue includes two task demands, but the information for the first task is incomplete and vague, making it difficult for the proactive robot to intervene directly, highlighted in red.

Despite careful prompt design, the outputs generated by the LLMs do not always precisely align with the intended prompting goals. This divergence arises partly because the prompts may not fully reflect the designer’s intent, and partly because certain open-ended tasks do not have a fixed or standardized response. Furthermore, the emergent capabilities of LLMs can lead to unexpected outcomes.

The discrepancy between the actual task occurrence frequency in the generated data and the predefined target values contributes to avoiding excessive rigidity, instead introducing a level of realistic complexity that is difficult to model through fixed rules. While this results in a slightly lower task frequency, its impact on the dataset’s intended purpose is negligible, as the design already accommodates varying levels of task occurrence frequency.

Task Expression Diversity

To illustrate the real-world representativeness of our dataset, the dialogue generation process is grounded in ten carefully selected workplace scenarios. These scenarios were chosen to encompass a broad spectrum of professional environments across diverse industries. We intentionally selected scenarios from distinct domains to ensure wide diversity in both task types and dialogue content, aiming to maximize the dataset’s realism and practical applicability. The represented industries include biotechnology, legal consulting, game development, architecture and interior design, electric vehicle research and development, smart manufacturing, fashion design, animation production, and drone technology.

The range of objects involved in these scenarios varies from common office supplies that are generally unrelated to specific tasks-such as phone chargers, tissues, and coffee mugs-to ubiquitous workplace items like staplers and sticky notes. Moreover, each scenario incorporates specialized objects unique to its domain, for example, PCR tubes and sterile swabs in the Biotechnology Research Center, VR headsets in the Game Development Studio, and propeller sets in the Drone Research and Testing Center.

Furthermore, our analysis reveals that the task-related dialogues predominantly focus on representative object-handling activities commonly observed in real-world office settings, including borrowing, distributing, collecting, returning, delivering, handing out, retrieving, sharing, fetching, organizing, preparing, and providing items.

To further validate the linguistic quality and realism of our dialogue dataset, we conducted an analysis of the diversity of task-related expressions. This analysis highlights the linguistic richness in how tasks are naturally embedded within human conversations, which is crucial for developing proactive robots capable of understanding and responding to nuanced language in real-world scenarios.

We present a qualitative analysis by manually reviewing dialogue samples under various task frequency conditions. We summarize key dimensions of linguistic diversity across task-related utterances, including: degree of explicitness, sentence form, and information completeness.

As shown in Table 3, task-related utterances vary significantly in both structure and clarity, ranging from clear and actionable commands to vague or context-dependent expressions. This diversity better reflects how humans communicate needs in daily work environments and introduces necessary complexity for training and evaluating proactive robots. It emphasizes the dataset’s capability to support models in learning not just command-following, but also intent inference in realistic conversation settings.

Table 3.

Examples of diverse task expressions in the dataset.

Expression type Example from dialogues
Explicit Hey Noah, I’ll drop off those printer paper reams at your workstation later today.
Implicit Hey Lucas, just a quick reminder about those headphones I lent you last week for the data analysis stuff.
Declarative Sure! They’re in the top drawer. Just don’t forget to put them back!
Interrogative By the way, Noah, do you need that portable phone stand now?
Imperative Just swing by and grab it whenever you can.
Complete information I need USB flash drives from each of you with your software updates for integration testing tomorrow morning.
Partial/Underspecified Just a few colors should be fine-maybe pink and blue?
Highly context-dependent Just the ones we discussed in the last meeting. The more options, the better!

Usage Notes

Dialogue Dataset

The dataset is provided in both JSON and plain text formats, containing multi-party dialogue transcripts, scenario metadata, and associated object inventories. Researchers can access and parse the data using standard JSON libraries in Python or other environments.

As the dataset involves implicit task requests embedded within informal and multi-turn conversations, certain preprocessing steps-such as dialogue segmentation and contextual feature extraction-may help improve model performance in downstream applications.

This dataset is particularly suited for research involving proactive robots in workplace settings. Relevant applications include:

  • Training and evaluating natural language understanding models for proactive systems

  • Studying implicit task inference within multi-party dialogues

  • Developing autonomous task planning and execution systems

  • Exploring human-robot collaboration, interaction within shared human environments, and mixed-initiative interaction strategies

Scenario Adaptation

The prompt engineering framework used to generate this dataset is modular and can be extended to other domains by modifying the scenario descriptions, object inventories, and topic constraints. This allows researchers to construct dialogue datasets tailored to different environments such as healthcare, education, or smart homes, supporting transfer learning, domain adaptation, and targeted benchmarking.

Discussion and Limitations

As the dialogues are generated using a large language model, several limitations should be considered. First, the dataset may contain hallucinated content-utterances that appear plausible but lack grounding in the scenario context. Some responses may also exhibit unnatural phrasing or lack the spontaneity typical of real human interactions. In addition, since the language model reflects patterns from its pretraining data, cultural, occupational, or demographic biases may be present in subtle ways.

These limitations can affect downstream applications, especially those deployed in real-world settings. To mitigate potential risks, we recommend conducting human evaluation or error analysis before deploying models trained on this dataset. Additionally, combining this dataset with real-world or annotated dialogue corpora can help improve model robustness and reduce the impact of synthetic bias.

Acknowledgements

This work was supported in part by the National Natural Science Fund for Distinguished Young Scholars under Grant 62025304.

Author contributions

Haifeng Xu: Conceptualization, Methodology, Writing, Data Curation; Chunwen Li: Formal analysis, Investigation, Review & Editing; Xiaohu Yuan: Conceptualization, Review & Editing; Tao Zhi: Conceptualization, Formal analysis; Huaping Liu: Conceptualization, Review & Editing, Supervision.

Code availability

The code is available on GitHub (https://github.com/ProactiveRobot/ProactiveDialog).

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Lu, V. N. et al. Service robots, customers and service employees: what can we learn from the academic literature and where are the gaps? J. Serv. Theory Pract.30, 361–391, 10.1108/JSTP-04-2019-0088 (2020). [Google Scholar]
  • 2.Gross, H.-M. et al. Robot companion for domestic health assistance: implementation, test and case study under everyday conditions in private apartments. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 5992–5999, 10.1109/IROS.2015.7354230 (2015).
  • 3.Peleka, G. et al. Ramcip-a service robot for mci patients at home. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 1–9, 10.1109/IROS.2018.8594214 (2018).
  • 4.Liu, H., Guo, D. & Cangelosi, A. Embodied intelligence: a synergy of morphology, action, perception and learning. ACM Comput. Surv.57, 1–36, 10.1145/3717059 (2025). [Google Scholar]
  • 5.Xie, L., Liu, C. & Li, D. Proactivity or passivity? an investigation of the effect of service robots’ proactive behaviour on customer co-creation intention. Int. J. Hosp. Manag.106, 103271, 10.1016/j.ijhm.2022.103271 (2022). [Google Scholar]
  • 6.Grosinger, J., Pecora, F. & Saffiotti, A. Robots that maintain equilibrium: proactivity by reasoning about user intentions and preferences. Pattern Recognit. Lett.118, 85–93, 10.1016/j.patrec.2018.05.014 (2019). [Google Scholar]
  • 7.Peng, Z., Kwon, Y., Lu, J., Wu, Z. & Ma, X. Design and evaluation of service robot’s proactivity in decision-making support process. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 1–13, 10.1145/3290605.3300328 (2019).
  • 8.Baraglia, J., Cakmak, M., Nagai, Y., Rao, R. P. & Asada, M. Efficient human-robot collaboration: when should a robot take initiative? The Int. J. Robotics Res.36, 563–579, 10.1177/0278364916688253 (2017). [Google Scholar]
  • 9.Patel, M. & Chernova, S. Proactive robot assistance via spatio-temporal object modeling. In Proceedings of the 7th Conference on Robot Learning 205, 881–891 (PMLR, 2023).
  • 10.Patel, M., Prakash, A. G. & Chernova, S. Predicting routine object usage for proactive robot assistance. In Proceedings of the 7th Conference on Robot Learning 229, 1068–1083 (PMLR, 2023).
  • 11.Buyukgoz, S., Grosinger, J., Chetouani, M. & Saffiotti, A. Two ways to make your robot proactive: reasoning about human intentions or reasoning about possible futures. Front. Robotics AI9, 929267, 10.3389/frobt.2022.929267 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Pandey, A. K., Ali, M. & Alami, R. Towards a task-aware proactive sociable robot based on multi-state perspective-taking. Int. J. Soc. Robotics5, 215–236, 10.1007/s12369-013-0181-3 (Springer, 2013).
  • 13.Schrempf, O. C., Hanebeck, U. D., Schmid, A. J. & Worn, H. A novel approach to proactive human-robot cooperation. In ROMAN 2005. IEEE International Workshop on Robot and Human Interactive Communication, 2005., 555–560, 10.1109/ROMAN.2005.1513838 (2005).
  • 14.Tan, S., Ge, M., Guo, D., Liu, H. & Sun, F. Knowledge-based embodied question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence45, 11948–11960, 10.1109/TPAMI.2023.3277206 (2023). [DOI] [PubMed] [Google Scholar]
  • 15.Tang, K., Fei-Fei, L. & Koller, D. Learning latent temporal structure for complex event detection. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, 1250–1257, 10.1109/CVPR.2012.6247808 (2012).
  • 16.Pirsiavash, H. & Ramanan, D. Detecting activities of daily living in first-person camera views. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2847–2854, 10.1109/CVPR.2012.6248010 (2012).
  • 17.Min, J.-K. & Cho, S.-B. Activity recognition based on wearable sensors using selection/fusion hybrid ensemble. In 2011 IEEE International Conference on Systems, Man, and Cybernetics, 1319–1324, 10.1109/ICSMC.2011.6083808 (2011).
  • 18.Huang, C.-M. & Mutlu, B. Anticipatory robot control for efficient human-robot collaboration. In 2016 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI), 83–90, 10.1109/HRI.2016.7451737 (2016).
  • 19.Huang, C.-M., Cakmak, M. & Mutlu, B. Adaptive coordination strategies for human-robot handovers. In Robotics: Science and Systems11, 1–10 (2015). [Google Scholar]
  • 20.Satake, S. et al. How to approach humans? strategies for social robots to initiate interaction. In Proceedings of the 4th ACM/IEEE International Conference on Human Robot Interaction, 109–116, 10.1145/1514095.1514117 (2009).
  • 21.Koppula, H. S. & Saxena, A. Anticipating human activities using object affordances for reactive robotic response. IEEE Transactions on Pattern Analysis and Machine Intelligence38, 14–29, 10.1109/TPAMI.2015.2430335 (2015). [DOI] [PubMed] [Google Scholar]
  • 22.Bohus, D., Saw, C. W. & Horvitz, E. Directions robot: in-the-wild experiences and lessons learned. In Proceedings of the 2014 International Conference on Autonomous Agents and Multi-Agent Systems, 637–644 (2014).
  • 23.Mascaro, E. V., Sliwowski, D. & Lee, D. Hoi4abot: human-object interaction anticipation for human intention reading collaborative robots. In Proceedings of the 7th Conference on Robot Learning 229, 1111–1130 (PMLR, 2023).
  • 24.Psarakis, L., Nathanael, D. & Marmaras, N. Fostering short-term human anticipatory behavior in human-robot collaboration. Int. J. Ind. Ergonomics87, 103241, 10.1016/j.ergon.2021.103241 (2022). [Google Scholar]
  • 25.Garcia, C. A., Montalvo-Lopez, W. & Garcia, M. V. Human-robot collaboration based on cyber-physical production system and mqtt. Procedia manufacturing42, 315–321, 10.1016/j.promfg.2020.02.088 (2020). [Google Scholar]
  • 26.Abbate, G., Giusti, A., Schmuck, V., Celiktutan, O. & Paolillo, A. Self-supervised prediction of the intention to interact with a service robot. Robotics Auton. Syst.171, 104568, 10.1016/j.robot.2023.104568 (2024). [Google Scholar]
  • 27.Liang, L., Bian, G., Zhao, H., Dong, Y. & Liu, H. Extracting dynamic navigation goal from natural language dialogue. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 3539–3545, 10.1109/IROS55552.2023.10342509 (2023).
  • 28.Budzianowski, P. et al. Multiwoz-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 5016–5026, 10.18653/v1/D18-1547 (2018).
  • 29.Carletta, J. et al. The ami meeting corpus: a pre-announcement. In International Workshop on Machine Learning for Multimodal Interaction, 28–39, 10.1007/11677482_3 (2005).
  • 30.Janin, A. et al. The icsi meeting corpus. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03). vol. 1, I–I, 10.1109/ICASSP.2003.1198793 (2003).
  • 31.Shridhar, M. et al. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10740–10749, 10.1109/CVPR42600.2020.01075 (2020).
  • 32.Wisspeintner, T., Van Der Zant, T., Iocchi, L. & Schiffer, S. Robocup@home: Scientific competition and benchmarking for domestic service robots. Interact. Stud.10, 392–426, 10.1075/is.10.3.06wis (2009). [Google Scholar]
  • 33.Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. Adv. neural information processing systems35, 24824–24837 (2022). [Google Scholar]
  • 34.Kojima, T., Gu, S. S., Reid, M., Matsuo, Y. & Iwasawa, Y. Large language models are zero-shot reasoners. Adv. neural information processing systems35, 22199–22213 (2022). [Google Scholar]
  • 35.Polu, S. et al. Formal mathematics statement curriculum learning. In the 11th International Conference on Learning Representations (ICLR) https://openreview.net/forum?id=-P7G-8dmSh4 (2023).
  • 36.Chen, M. et al. Evaluating large language models trained on code. Preprint at https://arxiv.org/abs/2107.03374 (2021).
  • 37.Park, Y. J., Jerng, S. E., Yoon, S. & Li, J. 1.5 million materials narratives generated by chatbots. Sci. Data11, 1060, 10.1038/s41597-024-03886-w (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Neuman, Y. & Cohen, Y. A data set of synthetic utterances for computational personality analysis. Sci. Data11, 623, 10.1038/s41597-024-03488-6 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Gil-Martín, M. et al. A dataset of synthetic art dialogues with chatgpt. Sci. Data11, 825, 10.1038/s41597-024-03661-x (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Liu, Z. et al. Bilingual dialogue dataset with personality and emotion annotations for personality recognition in education. Sci. Data12, 514, 10.1038/s41597-025-04836-w (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Neuman, Y. & Cohen, Y. A dataset of 10,000 situations for research in computational social sciences psychology and the humanities. Sci. data10, 505, 10.1038/s41597-023-02406-6 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Xu, H., Li, C., Yuan, X., Zhi, T. & Liu, H. A simulated dataset for proactive robot task inference from streaming natural language dialogues. Zenodo10.5281/zenodo.15094166 (2025). [DOI] [PMC free article] [PubMed]
  • 43.Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 311–318, 10.3115/1073083.1073135 (2002).
  • 44.Lin, C.-Y. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, 74-81 (Association for Computational Linguistics, Barcelona, Spain, 2004). https://aclanthology.org/W04-1013/
  • 45.Bengio, Y., Ducharme, R., Vincent, P. & Jauvin, C. A neural probabilistic language model. Journal of Machine Learning Research3, 1137–1155 (2003). [Google Scholar]
  • 46.Unhelkar, V. V. et al. Human-aware robotic assistant for collaborative assembly: integrating human motion prediction with planning in time. IEEE Robotics Autom. Lett.3, 2394–2401, 10.1109/LRA.2018.2812906 (2018). [Google Scholar]
  • 47.Harman, H. & Simoens, P. Action graphs for proactive robot assistance in smart environments. J. Ambient Intell. Smart Environ.12, 79–99, 10.3233/AIS-200556 (2020). [Google Scholar]
  • 48.Oh, N., Park, J., Kwak, J. H. & Jo, S. A robot capable of proactive assistance through handovers for sequential tasks. In 2021 18th International Conference on Ubiquitous Robots (UR), 296–301, 10.1109/UR52253.2021.9494681 (2021).
  • 49.Nemlekar, H., Dhanaraj, N., Guan, A., Gupta, S. K. & Nikolaidis, S. Transfer learning of human preferences for proactive robot assistance in assembly tasks. In Proceedings of the 2023 ACM/IEEE International Conference on Human-Robot Interaction, 575–583, 10.1145/3568162.3576965 (2023).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The code is available on GitHub (https://github.com/ProactiveRobot/ProactiveDialog).


Articles from Scientific Data are provided here courtesy of Nature Publishing Group

RESOURCES