Adaptive robot guidance through real-time compliance estimation and dual-modal control

Ravi Tejwani; John Payne; Karl Velazquez; Paolo Bonato; Harry Asada

doi:10.1038/s44172-026-00632-5

. 2026 Mar 19;5:81. doi: 10.1038/s44172-026-00632-5

Adaptive robot guidance through real-time compliance estimation and dual-modal control

Ravi Tejwani ^1,^✉, John Payne ¹, Karl Velazquez ¹, Paolo Bonato ^2,³, Harry Asada ⁴

PMCID: PMC13133102 PMID: 41857171

Abstract

When human instructors guide learners through motor tasks, they seamlessly coordinate physical touch with verbal explanations - a dance teacher positions a student’s arms while describing the movement, a therapist supports a patient’s limb while offering encouragement. In contrast, a robot applying physical forces without verbal context can feel invasive or unsettling to humans. We present a robot guidance controller that learns to coordinate physical and verbal guidance as human instructors naturally do. Our system adaptively balances these modalities based on real-time estimation of human compliance: when learners struggle, it provides firmer physical corrections with explicit instructions; as they improve, it transitions to lighter touch with encouraging phrases. Our method comprises three components: (1) an estimator that infers physical and verbal compliance from tracking errors, (2) an optimization method that dynamically allocates guidance between force and language, and (3) a force-to-language model that generates contextually appropriate utterances. User studies (N=12) demonstrate that adaptive coordination of guidance significantly outperforms single-modality guidance and fixed-combination baselines: up to 50% reduction in tracking error, 39% improvement in movement smoothness, and 27% faster task completion. While validated in rehabilitation therapy, our approach generalizes to any human-robot collaborative learning scenario.

Subject terms: Scientific community, Social sciences

Ravi Tejwani and colleagues present a robot guidance controller that coordinates physical forces and verbal instructions based on real-time human compliance estimation. User studies demonstrate significant improvements in task performance over single-modality approaches.

Introduction

As a child learns to write the letter “A” for the first time, their mother holds the child’s hand, verbally explaining ‘draw a line up and above’ while simultaneously providing gentle nudges to guide the motion. When the child deviates from the intended path, the mother increases corrective feedback through more explicit instructions and firmer nudges. Conversely, as the child begins to follow the correct path, the mother reduces physical assistance and shifts to encouraging phrases such as ‘you are going in the right direction’, ‘keep going’, ‘you got this’. This intuitive integration of verbal and physical guidance is ubiquitous in learning scenarios, from dance instruction and physical therapy to athletic training and vocational education.

Guidance is an intentional communication provided by an instructor to facilitate learning. It consists of two primary components: physical guidance in the form of force profiles (tactile forces applied to direct movement) and verbal guidance (instructional cues through speech).

In human-robot interaction, relying solely on physical guidance can be unsettling—direct force application from a robot may feel invasive or threatening to users^1,2. Verbal guidance serves as a crucial complement, providing cognitive context that helps users understand and anticipate the robot’s actions. This dual-modality approach not only improves task performance but also enhances user comfort and trust in the robotic system.

Language and force profiles are fundamentally heterogeneous, presenting significant integration challenges. Force can be applied continuously with precisely controlled magnitude and direction across X, Y, and Z axes for specific durations, creating immediate spatial corrections. Language, in contrast, comprises discrete words and phrases with semantic content that requires interpretation by the learner. While force provides quantitatively measurable feedback, language conveys abstract concepts that must be processed within the learner’s cognitive framework. This inherent heterogeneity necessitates an approach where verbal and physical guidance must be optimally computed and delivered in synchronization to the learner based on their evolving needs and compliance levels.

Across diverse learning scenarios, several critical questions emerge regarding the instructor’s decision-making process:

What words to speak and when to speak? (Verbal Guidance)
How much force to apply, in which direction, and for what duration? (Physical Guidance)
How to adapt these decisions based on the learner’s changing behavioral states/compliance levels? (Optimization)

Our research objective is to encode these adaptive guidance capabilities within a robotic system (Fig. 1) that can effectively guide human learners with the same coordinated physical and verbal guidance provided by human instructors.

We propose a Robot Guidance Controller that enables a robot to optimally deliver physical and verbal guidance based on the learner’s estimated compliance states. The controller integrates three components: (1) a compliance estimator that infers physical and verbal compliance levels from observable behavior, such as position and velocity errors; (2) an optimization method that computes optimal physical force (magnitude, direction, and duration) and verbal instructions (contextually appropriate words) based on the inferred compliance states in real time; and (3) a force-to-language model that generates contextually appropriate utterances.

Our experimental validation addresses two key research questions: First, does combined verbal and physical guidance outperform single-modality approaches? Second, does adaptive allocation of guidance based on real-time compliance estimation improve performance over fixed-strategy baselines? Through user studies with 12 participants, we demonstrate significant improvements in task performance, movement quality, and completion time.

Research in robot guidance capabilities has been explored on topics such as teaching by demonstration³, programming by demonstration^4,5, skill acquisition^6,7, and imitation learning^8–10. These methods focus primarily on how robots can acquire and reproduce skills, ignoring the adaptive guidance aspect of instruction. When robots were deployed in interactive scenarios with humans, they typically either provided purely verbal instructions without physical guidance, or offered physical assistance without adaptive verbal feedback. Furthermore, existing systems that incorporated both modalities^11–14 typically used pre-programmed fixed strategies^15–18. Our Robot Guidance Controller enables robots to provide coordinated and optimal physical and verbal guidance that continuously adapts to the human’s compliance states^19–21.

Our proposed method is fundamentally task-agnostic—designed to generalize across diverse learning scenarios from writing and dance to rehabilitation and athletics. The controller is underpinned by universal principles of how instructors deliver and adapt guidance to learners. To demonstrate this generalizability, we derived insights from one representative instructor-learner interaction: physical therapists guiding patients during therapeutic exercises. We conducted an observational study at Spaulding Rehabilitation Hospital, analyzing interactions between therapists and patients during shoulder flexion exercises. This study revealed how expert instructors dynamically balance physical correction and verbal feedback based on the patient’s compliance levels—increasing assistive forces when encountering resistance and transitioning from instructional to encouraging language as performance improves (Fig. 2). While our empirical validation focuses on this rehabilitation context, the controller’s design principles apply broadly across learning scenarios^22–24.

Fig. 2 — Full experimental details and analysis are presented in Supplementary Note 2.

We make the following contributions:

A formalized framework for coordinating physical and verbal guidance in human-robot interaction
An adaptive optimization approach that distributes guidance across physical force and verbal instructions according to estimated compliance levels
A state estimation method for inferring human compliance levels from observable behaviors
Experimental validation through user studies demonstrating the controller’s effectiveness over baselines

Results

Experimental setup

Rehabilitation task

The experimental validation was conducted using a shoulder flexion exercise commonly employed in physical therapy for improving range of motion. This task requires participants to move their arm from a neutral position (arm at side) to full flexion (arm extended forward and upward) and return, following a predetermined trajectory.

The shoulder flexion exercise was selected because it: (1) requires continuous physical contact between instructor and learner, (2) involves both corrective and encouraging verbal communication, (3) exhibits clear compliance variations that instructors naturally adapt to, and (4) represents a clinically relevant rehabilitation scenario where adaptive guidance is essential.

During trials, participants were seated beside a UR5 collaborative robot and instructed to place their hand on the end-effector with 6-DOF force/torque sensor. The robot guided them through the shoulder flexion trajectory while providing coordinated physical forces and verbal instructions based on real-time compliance estimation.

Participants

The study included 12 participants (8 males, 4 females) with a mean age of 23 years (SD = 2.1, range 20-28). All participants were recruited from the MIT student population through online postings and provided written informed consent.

Participant characteristics were as follows:

Educational background: 10 engineering students (mechanical, electrical, computer science), 2 from other technical disciplines
Robotics experience: None had prior experience with collaborative robotics applications
Handedness: 9 right-handed, 3 left-handed
Physical condition: No participants reported upper limb mobility restrictions or neurological conditions

The study procedures were reviewed and approved by the MIT Institutional Review Board (Protocol #2212000845R001), ensuring compliance with ethical guidelines for human subject research.

Study design

We conducted two separate studies with each participant to evaluate different guidance methods provided by the robot.

Study 1: Comparison of Guidance Modalities. We had each user interact with each of the following guidance methods through multiple trials. In each trial, they were asked to exhibit a specific compliance behavioral state.

Verbal Guidance Only: Vision-based tracking system that monitored hand position via camera and provided spoken directional cues when users deviated from the reference trajectory.
Physical Guidance Only: Robot applied corrective forces using a PD (Proportional-Derivative) controller along the axis of deviation, guiding users back to the reference path without verbal feedback.
Combined Verbal and Physical Guidance: A baseline robot controller that simultaneously delivers physical forces and verbal instructions to guide the user. This approach does not estimate user compliance. It provides dual guidance solely on the user’s position and velocity deviations from the reference trajectory.

Study 2: Robot Controller Comparison. This study evaluated our proposed method (Robot Guidance Controller), which optimally allocates physical and verbal guidance based on real-time user compliance estimation (Section 4). We compared our method to the baseline method from Study 1.

Research questions

We investigated two main research questions:

Does combined verbal and physical guidance outperform single-modality guidance (physical-only or verbal-only)?
Does the Robot Guidance Controller method, which provides optimal adaptive guidance by tracking user state, outperform the baseline robot controller method?

Evaluation metrics

We evaluated our methods using the following metrics:

Position Error: Mean absolute error (MAE) between the user’s actual position and the reference position, calculated as $MAE = \frac{1}{N} \sum_{i = 1}^{N} ∥ x_{i} - x_{r e f, i} ∥$ , where x_i is the user’s position at time i.
Velocity Error: MAE between the user’s velocity and the reference velocity, computed as ${MAE}_{v} = \frac{1}{N} \sum_{i = 1}^{N} ∥ {\overset{°}{x}}_{i} - {\overset{°}{x}}_{r e f, i} ∥$ , capturing the temporal accuracy of movement execution.
Movement Smoothness: Quantified using the standard deviation of the velocity magnitude over time. Calculated as: $S = \sqrt{\frac{1}{N - 1} \sum_{i = 1}^{N} {(∥ {\overset{°}{x}}_{i} ∥ - μ_{v})}^{2}}$ , where μ_v is the mean velocity magnitude. Lower values indicate smoother motion.
Time to Completion: Total time required to complete the task.
Frequency of Words Spoken: Average number of words per second spoken during task execution, indicating the intensity of verbal guidance provided.
Ratio of Instructional to Encouraging Phrases: Ratio of instructional phrases (e.g., “move left", “push harder") to encouraging phrases (e.g., “good job", “keep going"), calculated by classifying each utterance and computing the ratio R = N_inst/N_enc.

Observational data from expert-led demonstrations

To understand how instructors naturally provide adaptive guidance, we collaborated with Spaulding Rehabilitation Hospital and observed the Chief Physical Therapist guiding patients during shoulder-flexion exercises (Fig. 2). This rehabilitation task requires participants to move their arm from a neutral position (arm at side) to full flexion (arm extended forward and upward) and return, following a predetermined trajectory. The shoulder flexion exercise was selected because it: (1) requires continuous physical contact between instructor and learner, (2) involves both corrective and encouraging verbal communication, (3) exhibits clear compliance variations that instructors naturally adapt to, and (4) represents a clinically relevant rehabilitation scenario where adaptive guidance is essential.

We established four distinct experimental conditions to isolate guidance modalities: (1) Reference Trajectory: The therapist performed the exercise while the patient remained passive, providing baseline data for learning reference trajectory and system dynamics parameters. (2) Verbal Guidance Only: The therapist provided only verbal guidance without physical contact as the patient attempted to follow the trajectory. (3) Physical Guidance Only: The therapist guided the patient using only physical guidance without verbal cues. (4) Combined Guidance: The therapist used both physical and verbal guidance simultaneously. We conducted 10 trials for each condition with different compliance level combinations, allowing systematic variation of compliance levels to isolate key parameters needed for our adaptive controller.

To simplify the experimental setup, compliance levels were classified as low or high. The patient was free to vary their compliance at will during each trial and afterward reviewed video recordings to annotate a timeline indicating their physical and verbal compliance levels at each instant. Data collection included motion capture systems tracking 3D positions, force sensors measuring interaction forces, and audio recordings of all verbal instructions with timing analysis.

The analysis revealed three key adaptive patterns that motivated our robot guidance controller design. First, physical guidance varied inversely with patient compliance—therapists applied stronger corrective forces when patients exhibited low compliance and reduced intervention as physical compliance improved. Median physical force decreased from 2.33 N at low compliance to 1.12 N at high compliance. Second, verbal guidance showed adaptation through both frequency and content. Speaking frequency reduced from 1.77 to 0.36 words/second as verbal compliance increased. Third, language content shifted dramatically with compliance levels. The ratio of instructional to encouraging phrases changed from predominantly instructional (ratio 3.67) when compliance was low to more encouraging (ratio 1.60) when both compliance levels were high (Fig. 2).

These patterns demonstrated that effective guidance naturally adapts to learner state, reducing both physical intervention and instructional explicitness as compliance improves. This motivated our robot guidance controller design. Detailed experimental procedures are provided in Supplementary Note 2.

Statistical analysis

Given our within-subjects experimental design where each participant experienced all conditions, we employed two-tailed paired t-tests to compare performance metrics across guidance modalities.

Our primary analysis compared each guidance condition against single-modality baselines using paired t-tests. We calculated Cohen’s d for all significant comparisons to assess effect sizes and reported 95% confidence intervals for all mean differences. We report results from equivalent non-parametric tests (Wilcoxon signed-rank). Statistical significance was set at α = 0.05 with Bonferroni correction applied for multiple comparisons within each experimental condition.

Experimental results

Research question 1

To determine whether combining verbal and physical modalities yields improved task performance, we compared each single-modality with a combined-modality controller baseline in which the user’s compliance levels were varied. Deep dive analysis of a user for each of the modality is presented in Figs. 3–5 and quantified in Table 1.

Fig. 4 — Representative trials under the “physical guidance only" condition, contrasting low versus high physical compliance.

Fig. 3 — Representative trials under the “verbal guidance only" condition, contrasting low (left) versus high (right) verbal compliance.

Fig. 5 — Combined physical and verbal guidance trials across four compliance combinations.

Table 1.

Effect of adding the complementary guidance modality under each compliance regime

(a) Effect of adding physical guidance under low verbal compliance (V = 0)
Metric	Verbal Guidance Only	+ Physical Guidance (P = 0)			+ Physical Guidance (P = 1)
		Value	Δ%	p-value	Value	Δ%	p-value
Position Error (m)	0.120 ± 0.030	0.058 ± 0.016	↓52	0.0021^**	0.033 ± 0.012	↓73	0.00014^***
Velocity Error (m/s)	0.083 ± 0.010	0.069 ± 0.010	↓17	0.086	0.039 ± 0.010	↓53	0.00014^***
Smoothness (m/s)	0.067 ± 0.009	0.030 ± 0.005	↓55	8.4 × 10^−6 ***	0.018 ± 0.004	↓73	2.2 × 10^−6 ***
Completion Time (s)	120.6 ± 43.1	93.7 ± 25.0	↓22	0.0426^*	69.0 ± 12.3	↓43	0.0280^*

(b) Effect of adding physical guidance under high verbal compliance (V = 1).
Metric	Verbal Guidance Only	+ Physical Guidance (P = 0)			+ Physical Guidance (P = 1)
		Value	Δ%	p-value	Value	Δ%	p-value
Position Error (m)	0.064 ± 0.008	0.034 ± 0.009	↓47	0.00062^***	0.024 ± 0.010	↓63	0.000064^***
Velocity Error (m/s)	0.068 ± 0.013	0.040 ± 0.009	↓41	0.0036^**	0.030 ± 0.005	↓56	0.00020^***
Smoothness (m/s)	0.039 ± 0.007	0.020 ± 0.005	↓49	0.00030^***	0.013 ± 0.004	↓67	1.1 × 10^−5 ***
Completion Time (s)	86.9 ± 28.8	67.7 ± 15.9	↓22	0.138	45.4 ± 4.2	↓48	0.0113^*

(c) Effect of adding verbal guidance under low physical compliance (P = 0).
Metric	Physical Guidance Only	+ Verbal Guidance (V = 0)			+ Verbal Guidance (V = 1)
		Value	Δ%	p-value	Value	Δ%	p-value
Position Error (m)	0.048 ± 0.015	0.058 ± 0.016	↑21	0.947	0.034 ± 0.009	↓29	0.0387^*
Velocity Error (m/s)	0.068 ± 0.013	0.069 ± 0.010	↑1	0.619	0.040 ± 0.009	↓41	0.00018^***
Smoothness (m/s)	0.026 ± 0.006	0.030 ± 0.005	↑15	0.905	0.020 ± 0.005	↓23	0.0077^**
Completion Time (s)	73.5 ± 13.2	93.7 ± 25.0	↑27	0.930	67.7 ± 15.9	↓8	0.325

(d) Effect of adding verbal guidance under high physical compliance (P = 1).
Metric	Physical Guidance Only	+ Verbal Guidance (V = 0)			+ Verbal Guidance (V = 1)
		Value	Δ%	p-value	Value	Δ%	p-value
Position (m)	0.032 ± 0.013	0.033 ± 0.012	↑3	0.582	0.024 ± 0.010	↓25	0.069
Velocity (m/s)	0.027 ± 0.007	0.039 ± 0.010	↑44	0.998	0.030 ± 0.005	↑11	0.801
Smoothness (m/s)	0.015 ± 0.005	0.018 ± 0.004	↑20	0.848	0.013 ± 0.004	↓13	0.144
Completion Time (s)	45.4 ± 5.1	69.0 ± 12.3	↑52	0.999	45.4 ± 4.2	↓0	0.500

Open in a new tab

Numbers are the means plus or minus the 95% confidence interval; Δ% is the relative change from the single-modality baseline. Two-tailed paired t-tests produce the p-values with significance markers (* < 0.05, ** < 0.01, *** < 0.001). Boldface highlights the significant improvements (p < 0.05).

Low verbal compliance

Under low verbal compliance (V = 0), introducing physical guidance halved the position-tracking error even when participants were physically noncompliant (P = 0, -52%, p < 0.01; Table 1a). When they were physically compliant (P = 1) the reduction deepened to 73% (p < 0.0001) and was accompanied by parallel gains in velocity error (-53%), smoothness (-73%) and a 43% cut in completion time.

High verbal compliance

Naturally high verbal compliance (V = 1) resulted in higher task accuracy when compared to low verbal compliance, however adding physical guidance managed to improve performance even further (Table 1b). Position error fell by 47% (p < 0.001) under physical non-compliance (P = 0) and by 63% (p < 0.0001) under physical compliance (P = 1). Smoothness improved by 49-67%, and completion time shortened by nearly one-half when both channels were followed. Hence, physical cues remain valuable even when verbal instruction is well attended, with their impact depending on the degree of physical compliance.

Low physical compliance

When physical compliance was low (P = 0), speech was helpful only if users actively attended to it (Table 1c). Under verbal noncompliance (V = 0), verbal guidance produced no change in any metric, whereas active listening (V = 1) reduced position and velocity errors by 29% and 41%, respectively, and improved smoothness by 23% (all p < 0.01). Task completion time decreased (-8%, p = 0.33).

High physical compliance

In contrast, when physical guidance was already adhered to (P = 1) verbal augmentation yielded no statistically significant benefit (Table 1d). None of the accuracy or kinematic indices changed (p > 0.14), and unheeded speech (V = 0) actually prolonged task time by 52% (p ≈ 1). These findings indicate that once the physical channel approaches ceiling performance, additional verbal content provides little incremental value.

Conclusion

Overall, combining both verbal and physical modalities yields improved task performance when compared to single-modality guidance. The user scenarios are summarized graphically in Fig. 6 and quantified in Table 1. We observed that introducing a second guidance modality improved task performance. The most pronounced gains—up to a 73% drop in position error, a 67% increase in smoothness, and a 48% reduction in completion time—appeared when the newly added modality was introduced while the original modality was ignored. When compliance with both modalities was already high, the incremental benefit of adding more guidance tapers off, consistent with the Law of Diminishing Returns.

Fig. 6 — Error bars indicate 95% confidence intervals. Statistically significant differences are marked with standard notation (* p < 0.05, ** p < 0.01, *** p < 0.001, **** p < 0.0001, ns = not significant). Results show that performance improves with compliance, and combining modalities helps most when one is ignored—allowing the other to compensate.

Research question 2

To further investigate whether adapting and providing optimal physical vs. verbal guidance based on users compliance levels yields further benefit beyond simply delivering both at once, we compared the fixed-gain “Baseline" controller used in Section 2.7.1 with our proposed robot guidance controller “Our Method". Visualizations of our results are provided in Figs. 7–8 and numerical values of our results are provided in Supplementary Table 1. Deep dive analysis of a user is presented in Fig. 9.

Fig. 7 — Error bars indicate 95% confidence intervals. Statistical significance is marked above each pairwise comparison (* p < 0.05, ** p < 0.01, **** p < 0.0001, ns = not significant).

Fig. 8 — Statistical significance is marked above each pairwise comparison (* p < 0.05, ** p < 0.01, **** p < 0.0001, ns = not significant).

Fig. 9 — Robot Guidance Controller adaptation across four compliance regimes showing real-time adjustment of guidance intensity and content.

Low verbal and physical compliance

When participants followed neither force nor speech, our controller reduced the position-tracking error by half (-50%, p = 0.002), improved movement smoothness (-50%, p < 0.0001, lower is better), and reduced completion time by 27% (p = 0.038). Although the normalized median robot force did not move statistically (p = 0.34), our method spoke more often ( + 26%, p = 0.004) and shifted its utterance mix from an instruction-heavy 2.31 toward the therapist’s 1.33 instruction-to-encouragement ratio (1.20, p = 0.038), demonstrating closer linguistic alignment with expert practice.

Physical compliance only

With force cues obeyed but speech ignored, our controller delivered a 48% drop in position error (p = 0.009), worsened movement smoothness ( + 39%, p = 0.011), and maintained task duration. Normalized robot force was further below the observational median than the baseline (0.57 vs. 0.79, p = 0.79), yet speaking frequency nearly matched the therapist rate (0.87 w/s vs. 0.85 w/s) and the instruction-to-encouragement ratio moved decisively toward the target (0.96 vs. 1.40).

Verbal compliance only

When participants listened but resisted force, our adaptive controller still lowered position error by 29% (p = 0.039) and improved movement smoothness (−30%, p = 0.043); velocity error and time showed no reliable change. Notably, the controller demonstrated an increase in its normalized median force to 1.14, almost perfectly matching the therapist median (1.15) whereas the baseline remained lower (0.88). Its speech, however, became slightly more verbose and instructional than the expert, leaving baseline closer on those two linguistic metrics.

Full compliance

When the human fully adhered to both force cues and verbal instructions, performance saturated: none of the kinematic differences reached significance (all p > 0.21). Our controller spoke substantially less than the baseline (-23%, p < 0.0001) and produced an instruction-to-encouragement balance (0.72) much nearer the therapist’s (0.67) than the baseline did (0.43). The smaller robot force (0.48) diverged from the therapist level (0.60), suggesting force adaptation remains conservative under ceiling performance.

Conclusion

Our robot guidance controller delivers significant improvements when compared to baseline controller in position and velocity accuracy (13-50%), smoothness (8-50%) and completion time (down by 27%). We observed that our robot controller reproduces expert speech patterns more closely than the fixed baseline-particularly in speaking frequency and directive balance-while its force magnitude still trails therapist norms in two of the four compliance states. Overall, these results validate the premise that optimally weighting force and language yields better task performance and better emulates instructor teaching.

User experience analysis

To complement quantitative performance metrics, we conducted post-experiment interviews to assess participant perceptions and preferences regarding different guidance modalities. This qualitative analysis provides insight into the subjective experience of adaptive multimodal guidance.

Each participant completed a brief questionnaire after the experimental trials and participated in a semi-structured interview lasting approximately 10 minutes. The questionnaire included:

Naturalness ratings: Participants rated the naturalness of each guidance approach (verbal-only, physical-only, combined baseline, and adaptive dual-modal) on a 7-point Likert scale (1 = very unnatural, 7 = very natural).
Preference rankings: Participants ranked all four guidance approaches from most to least preferred.
Open-ended questions: “What did you notice about how the robot adapted its guidance?” and “Which approach felt most like working with a human instructor?”

Naturalness ratings were analyzed using a paired-samples t-test comparing the adaptive dual-modal approach (M = 5.8, SD = 0.9, range = 4-7) against the fixed combined approach (M = 4.1, SD = 1.2, range = 2-6). The difference was statistically significant (t(11) = 3.42, p < 0.01, Cohen’s d = 1.58). Preference rankings were analyzed using frequency counts, and interview responses were coded by two independent raters to identify recurring patterns.

Representative participant feedback included:

"The robot seemed to understand when I was struggling and adjusted accordingly, like a human instructor would. It didn’t just follow a script". (P7)

"I appreciated that it gave me more encouragement when I was doing well, rather than constant correction. That felt more natural". (P3)

When asked to rank single modalities, 8 of 12 participants preferred physical-only guidance over verbal-only, citing more immediate and intuitive feedback. However, 11 of 12 participants preferred the adaptive dual-modal approach over any single modality, indicating clear benefits of coordinated guidance.

"The combination was definitely better than either one alone. The physical guidance helped me understand where to go, and the verbal feedback helped me know if I was doing it right". (P5)

These qualitative findings support the quantitative performance improvements, suggesting that adaptive guidance enhances both objective task performance and subjective user experience. The consistency between performance metrics and user perceptions strengthens the case for adaptive multimodal guidance in human-robot interaction applications.

Discussion

Interpretation of results

Our experimental findings demonstrate that adaptive coordination of physical and verbal guidance significantly outperforms both single-modality approaches and fixed dual-modality baselines. These results align with observational studies of human instructors, who naturally modulate guidance intensity and modality based on learner compliance^22,23. The most substantial performance gains occurred when our adaptive controller compensated for ignored guidance channels—for instance, increasing physical guidance when participants were verbally non-compliant. This compensatory behavior mirrors expert human instructors who intensify alternative guidance modalities when primary channels prove ineffective. The up to 73% reduction in position error when adding physical guidance to non-compliant verbal conditions demonstrates the effectiveness of this adaptive reallocation strategy.

Our compliance estimation approach successfully captured the dynamic nature of human responsiveness, enabling real-time adaptation that traditional fixed-gain controllers cannot achieve. The Bayesian framework’s ability to infer both physical and verbal compliance from tracking errors provides a foundation for more sophisticated adaptive behaviors in future human-robot interaction systems. Notably, the controller’s speech patterns closely matched expert therapist behavior, with instruction-to-encouragement ratios shifting from 2.31 to 1.20 as compliance improved, closely approximating the observed therapist ratio of 1.33.

The force-to-language mapping component enabled contextually appropriate verbal responses that adapted not only in content but also in timing and intensity based on task performance. This represents a significant advance over fixed verbal instruction systems that cannot respond to learner state changes.

Broader implications

The principles demonstrated in our rehabilitation-focused validation extend beyond physical therapy to numerous human-robot collaborative scenarios. Educational robotics could benefit from adaptive guidance that responds to student engagement and comprehension levels. Skills training applications, from surgical procedures to manufacturing tasks, could leverage dual-modal feedback that adapts to trainee proficiency. Assistive technologies for elderly or disabled populations could provide more natural, responsive support that adjusts to user capabilities and preferences.

Our dual-channel decomposition framework provides a general approach for coordinating heterogeneous interaction modalities in robotics. While we focused on force and language, the same optimization principles could apply to visual cues, haptic feedback, auditory signals, or other sensory channels. The compliance estimation methodology could be extended to assess user engagement, attention, or emotional state, enabling even richer adaptive behaviors.

The broader significance for human-robot interaction lies in moving beyond single-modality systems toward truly multimodal interfaces that coordinate multiple communication channels as humans naturally do. This work establishes a foundation for robots that can provide guidance as intuitively and effectively as human instructors, potentially accelerating learning and improving user acceptance in collaborative scenarios.

From a control theory perspective, our approach demonstrates how heterogeneous control inputs can be unified through common optimization objectives. The passivity guarantees ensure safety while enabling adaptive behavior, addressing a key concern in physical human-robot interaction applications.

Study limitations

Several limitations constrain the generalizability of our findings. First, validation was conducted exclusively with shoulder flexion exercises and healthy young adults. Extension to more complex motor tasks involving multi-joint coordination, fine motor skills, or cognitive-motor integration requires additional validation. Similarly, evaluation with clinical populations including patients with motor impairments, cognitive deficits, or age-related changes in motor learning would strengthen claims about rehabilitation applicability.

Second, our binary compliance modeling, while computationally tractable, oversimplifies the continuous nature of human responsiveness. Real human compliance exists on a spectrum rather than discrete high/low states. Future work should investigate whether continuous compliance estimation yields additional performance benefits and more nuanced adaptive behaviors.

Third, the force-to-language mapping relies on pre-trained embeddings from prior work²⁵, potentially limiting the diversity and appropriateness of generated verbal instructions. The current phrase library, while based on expert demonstrations, may not capture the full range of instructional language that would be optimal for different individuals or cultural contexts.

Fourth, our 12-participant study, while sufficient for proof-of-concept validation and statistical significance, represents a relatively small sample size for claims about broad applicability. Larger-scale evaluation across diverse populations, extended interaction periods, and multiple task domains would provide stronger evidence for the approach’s generalizability.

Finally, we did not compare against other adaptive multimodal systems, as few such systems exist in the literature. Direct comparison with alternative adaptive approaches would better establish the relative merits of our compliance-based allocation strategy versus other possible adaptation mechanisms.

These constraints represent important directions for future research rather than fundamental flaws in the adaptive guidance approach. The core principle of dynamic modality allocation based on compliance estimation remains valid despite these limitations.

Method

Overview: robot guidance controller

The Robot Guidance Controller (Fig. 10) operates in two nested loops. The inner loop is the fixed-rate admittance controller of Sec. 4.2, which converts a commanded wrench U into task-space motion. The outer loop, executed once per cycle, is as follows:

First, the current end-effector pose x is projected onto the reference trajectory, yielding the target state $(x_{ref}, {\overset{°}{x}}_{ref})$ described in Sec. 4.4. The tracking errors e and $\overset{°}{e}$ that result feed two proportional-derivative blocks, producing nominal corrective wrenches for the physical and verbal channels as defined in Sec. 4.5. Force and kinematic observations update the Bayesian compliance filter, providing the physical and verbal compliance estimates $(\hat{P}, \hat{V})$ in Sec. 4.6. These probabilities parameterize the quadratic program of Sec. 4.7, which returns adaptive weights (A_p, A_v) and thus the dual-channel wrenches U_p and U_v. Finally, the verbal wrench is rolled forward over a short horizon and converted into an utterance σ via the cross-modal force-language model of Sec. 4.8. The pair (U_p, σ) constitutes the actionable command issued to the human-robot interface, while the passivity result of Sec. 4.11 guarantees that this loop remains energetically safe for any admissible human response.

Algorithm 1 presents the complete control flow, detailing how these components integrate to provide adaptive physical and verbal guidance. The subsequent sections detail each component of this algorithm. The definition of all the mathematical symbols and variables are defined in Table 2.

Table 2.

Definition of mathematical symbols and variables

Symbol	Description
$x, \overset{°}{x}, \ddot{x} \in R^{m}$	End-effector pose, velocity, and acceleration in the m-dimensional operational space (Sec. 4.2)
$M \in R^{m \times m}$	Positive-definite virtual inertia matrix in admittance model (Sec. 4.2)
$B \in R^{m \times m}$	Positive-definite virtual damping matrix in admittance model (Sec. 4.2)
$U_{p}, U_{v} \in R^{m}$	Physical and verbal guidance wrenches generated by the controller (Sec. 4.3)
$F_{h} \in R^{m}$	Wrench applied by the human, measured at the robot end-effector (Sec. 4.3)
$δ \in R^{m}$	Disturbance δ = F_h − U_v between human’s realized wrench and virtual wrench requested through language (Sec. 4.3)
$x_{ref}, {\overset{°}{x}}_{ref} \in R^{m}$	Reference pose and tangential velocity (Sec. 4.4)
$e, \overset{°}{e} \in R^{m}$	Position and velocity tracking errors: e = x_ref − x, $\overset{°}{e} = {\overset{°}{x}}_{ref} - x$ (Sec. 4.5)
$K_{p}, K_{v} \in R^{m \times m}$	Proportional gain matrices for the physical and verbal PD blocks (Sec. 4.5)
$B_{p}, B_{v} \in R^{m \times m}$	Derivative (damping) gain matrices in the corresponding PD blocks (Sec. 4.5)
A_p, A_v ∈ [0, 1]	Time-varying physical and verbal admittance weights; satisfies A_p + A_v = 1 (Sec. 4.5)
$\hat{P}, \hat{V} \in [0, 1]$	Estimates of physical and verbal compliance (Sec. 4.6)
$c_{p}, c_{v} \in R$	Instantaneous costs of physical and verbal guidance (Sec. 4.7)
$σ \in L$	Utterance, chosen from a finite set of phrases $L$ (Sec. 4.8)

Open in a new tab

Algorithm 1

Robot Guidance Controller

Require Reference trajectory γ, control gains (K_p, B_p, K_v, B_v), robot dynamics (M, B)

Ensure Physical guidance U_p and verbal utterance σ

2: while task not complete do

⊳ Measurement Phase

3: Measure end-effector pose x(t) and velocity $\overset{°}{x} (t)$

4: Measure human interaction force F_h(t)

⊳ Reference Projection

6: $s^{*} (t) \leftarrow \arg \min_{s} ∥ x (t) - γ (s) ∥$

7: x_ref(t) ← γ(s^*(t))

8: ${\overset{°}{x}}_{ref} (t) \leftarrow γ^{'} (s^{*} (t))$

⊳ Tracking Error

10: e(t) ← x_ref(t) − x(t)

11: $\overset{°}{e} (t) \leftarrow {\overset{°}{x}}_{ref} (t) - \overset{°}{x} (t)$

12:

⊳ Compliance Estimation

13: $z_{t} \leftarrow [\begin{matrix} e (t) \\ \overset{°}{e} (t) \end{matrix}]$

14: Update belief π_t via Bayesian filter

15: ${\hat{P}}_{t} \leftarrow π_{10, t} + π_{11, t}$

16: ${\hat{V}}_{t} \leftarrow π_{01, t} + π_{11, t}$

17:

⊳ Cost Computation

18: $c_{p} (t) \leftarrow \frac{1}{2} ({\hat{P}}_{t} + {\hat{V}}_{t})$

19: $c_{v} (t) \leftarrow 1 - {\hat{V}}_{t}$

20:

⊳ Weight Allocation

21: $A_{p}^{*} (t) \leftarrow c_{v} (t) / (c_{p} (t) + c_{v} (t))$

22: $A_{v}^{*} (t) \leftarrow c_{p} (t) / (c_{p} (t) + c_{v} (t))$

23:

⊳ Control Laws

24: $U_{p} (t) \leftarrow K_{p} e + B_{p} \overset{°}{e} - A_{p}^{*} F_{h}$

25: $U_{v} (t) \leftarrow K_{v} e + B_{v} \overset{°}{e} - A_{v}^{*} F_{h}$

26:

⊳ Force-to-Language

27: Compute profile u_v over horizon T

28: σ ← Ψ(u_v)

29:

⊳ Actuation

30: Apply U_p to robot

31: Speak utterance σ

32: Update dynamics: $M \ddot{x} + B \overset{°}{x} = U_{p} + F_{h}$

33: end while

Task-space dynamics model

We model thm in task space by imposing a virtual admittance on the robot end-effector (Fig. 11). Let

x \in R^{m}, \overset{°}{x} = \frac{d x}{d t}, \ddot{x} = \frac{d^{2} x}{d t^{2}}

denote the end-effector pose, velocity, and acceleration in an m-dimensional operational space (m = 6 for a full Cartesian wrench). The equation of motion governing the end-effector dynamics follows the general form $M \ddot{x} + B \overset{°}{x} = Q$ , where $Q \in R^{m}$ represents the total generalized forces acting on the system. In our human-robot interaction context, these generalized forces comprise both the robot’s commanded wrench and the human’s applied force, yielding the linear admittance model:

{\underset{⏟}{M \ddot{x} + B \overset{°}{x}}}_{\begin{matrix} Physical \\ System \end{matrix}} = {\underset{⏟}{U_{p} + F_{h}}}_{\begin{matrix} Desired \\ Behavior \end{matrix}}

where $M \in R^{m \times m}$ and $B \in R^{m \times m}$ are robot parameters that are positive-definite inertia and damping matrices (Supplementary Note 2), $U_{p} \in R^{m}$ is the commanded wrench given to the robot to provide physical guidance, and $F_{h} \in R^{m}$ is the wrench applied by the patient. Equation (2) is integrated in real time by the low-level velocity controller^15,18.

Fig. 11 — Conventional variable admittance control loop for human–robot interaction.

Dual-channel decomposition

We extend (2) to include verbal guidance. Because verbal guidance is delivered as verbal utterances, we leverage the bi-directional force-to-language mapping²⁵. We introduce virtual wrench $U_{v} \in R^{m}$ generated by the controller. U_v represents a virtual force given to the patient in the form of utterances, which is described in Section 4.8.

As illustrated in Fig. 12, the human’s reaction F_h to a verbal instruction U_v results in a cognitive interpretation and delay (referred to as residual δ), defined as

δ : = F_{h} - U_{v} .

Substituting F_h = U_v + δ into (2) yields

M \ddot{x} + B \overset{°}{x} = U_{p} + U_{v} + δ .

We define

U : = U_{p} + U_{v},

so that the overall task-space behavior is described by

M \ddot{x} + B \overset{°}{x} = U + δ,

With this formulation, the U_v terms cancel exactly, leaving the physical dynamics unchanged.

Fig. 12 — Comparison of physical versus verbal control input manifestation in human-robot interaction.

The controller’s role is therefore to allocate, at every cycle, the corrective effort between the physical channel (U_p) and the verbal channel (U_v).

The optimization governing this allocation is detailed in Section 4.7, and Section 4.11 proves that closed-loop passivity is preserved for all admissible δ. The conventional variable admittance control loop for HRI task is shown in Figure 11.

Reference trajectory generation

To supply the feedback laws with a task-level target we extract a reference curve γ from the therapist-only trial recorded during observational data collection. At every control step the current end-effector pose x(t) is projected to the nearest point on γ; the resulting position and its tangential velocity constitute $(x_{ref} (t), {\overset{°}{x}}_{ref} (t))$ . All preprocessing and projection details are described in Supplementary Note 2.

Control laws

At every control cycle the controller (i) computes nominal corrective wrenches from the tracking error and (ii) modulates those wrenches with adaptive weights to apportion guidance between the robot and the human.

Let

e (t) = x_{ref} (t) - x (t),

\overset{°}{e} (t) = {\overset{°}{x}}_{ref} (t) - \overset{°}{x} (t)

denote the instantaneous tracking error, where x_ref(t) is the task-level reference pose supplied by the trajectory generator. We employ independent proportional—derivative blocks to convert this error into nominal corrective wrenches for the two actuators:

{PD}_{p} (e) = K_{p} e + B_{p} \overset{°}{e},

{PD}_{v} (e) = K_{v} e + B_{v} \overset{°}{e},

where $K_{p}, B_{p}, K_{v}, B_{v} \in R^{m \times m}$ are constant, positive-definite gains identified offline from expert demonstrations in Observational Study (Supplementary Note 2).

The human’s measured interaction wrench $F_{h} \in R^{m}$ is filtered through scaler, time-varying adaptive weights A_p(t), A_v(t) ∈ [0, 1] with the convex-combination constraint:

A_{p} (t) + A_{v} (t) = 1 .

A smaller value denotes greater admittance (the actuator yields to the human), while a larger value denotes greater guidance authority. Incorporating these weights yields the channel control laws

U_{p} (t) = {PD}_{p} (e (t)) - A_{p} (t) F_{h} (t),

U_{v} (t) = {PD}_{v} (e (t)) - A_{v} (t) F_{h} (t) .

Because the weights satisfy (11), their superposition U_p(t) + U_v(t) equals the single-actuator wrench that would result from combining the two PD terms and admitting F_h with unit-gain. Hence the dual-actuator decomposition reallocates, but does not alter, the total wrench driving the admittance dynamics (2).

Compliance estimation

As direct measurement of human compliance is infeasible, we utilize a recursive Bayesian estimation framework^19,20 to infer compliance states from observable task performance.

Let

X_{t} = (P_{t}, V_{t}) \in {0, 1}^{2}

denote the joint compliance state at time t, where P_t and V_t represent individual physical and verbal compliance levels. As per our observational study, we confine the compliance levels to a binary state where 0 represents low compliance and 1 represents high compliance (Section 9). The system assumes X_t evolves as a first-order Markov process conditioned on the previous state X_t−1 and control inputs U_p, U_v.

We define our observation vector $z_{t} \in R^{n}$ as

z_{t} = [\begin{matrix} e (t) \\ \overset{°}{e} (t) \end{matrix}] = [\begin{matrix} x_{ref} (t) - x (t) \\ {\overset{°}{x}}_{ref} (t) - \overset{°}{x} (t) \end{matrix}],

where e(t) and $\overset{°}{e} (t)$ quantify positional and velocity deviations from the reference trajectory. Larger tracking errors correlate with lower compliance, while smaller errors suggest higher adherence to guidance.

We represent the belief over the four possible joint states of X_t by the probability vector

π_{t} = [π_{00, t}, π_{01, t}, π_{10, t}, π_{11, t}],

where $π_{i j, t} = P (X_{t} = (i, j) ∣ z_{t})$ is the probability of the compliance levels given the observations. The estimator then operates in two stages:

Prediction step

First, the prior ${\tilde{π}}_{i j, t}$ is computed by propagating the previous belief π_ij,t into the future using the Chapman-Kolmogorov equation:

{\tilde{π}}_{i j, t} = P (X_{t} = (i, j) ∣ z_{t - 1}, U_{t - 1})

= \sum_{X_{t - 1}} P (X_{t} = (i, j) ∣ X_{t - 1}, U_{t - 1}) π_{i j, t - 1} .

We parameterize the 4 × 4 transition probabilities as a softmax over linear functions of the control input, and the corresponding weights are fit from the annotated data (See Supplementary Note 2 for more details).

Update step

Next, we incorporate the new observations z_t to obtain the posterior π_ij,t via Bayes’ rule:

π_{i j, t} = \frac{P (z_{t} ∣ X_{t} = (i, j)) {\tilde{π}}_{i j, t}}{\sum_{X_{t}} P (z_{t} ∣ X_{t}) P (X_{t} ∣ z_{t - 1}, U_{t - 1})} .

Here, we model each likelihood $P (z_{t} ∣ X_{t} = (i, j))$ as a multivariate Gaussian $N (z_{t}; μ_{i j}, Σ_{i j})$ with mean μ_ij and covariance Σ_ij estimated from the collected data (Supplementary Note 2).

Finally, we define the inferred compliance levels ${\hat{P}}_{t}, {\hat{V}}_{t}$ as the marginal probabilities of full compliance:

{\hat{P}}_{t} = P (P_{t} = 1 ∣ z_{t}) = π_{10, t} + π_{11, t},

{\hat{V}}_{t} = P (V_{t} = 1 ∣ z_{t}) = π_{01, t} + π_{11, t} .

These estimates provide real-time measures of the human’s physical and verbal responsiveness that are used to inform the controller’s adaptive allocation between physical and verbal guidance.

Adaptive weight allocation

To adaptively adjust the relative amounts of physical and verbal guidance, the weights A_p(t) and A_v(t) introduced in (12) are updated online by solving a point-wise quadratic optimization expressed in terms of costs of providing physical and verbal guidance.

Let $\hat{P} (t), \hat{V} (t) \in [0, 1]$ denote, respectively, the physical and verbal compliance estimates produced by the Bayesian filter described in Section 4.6. We assign instantaneous, positive costs

c_{p} (t) = ϕ_{p} (\hat{P} (t), \hat{V} (t)),

c_{v} (t) = ϕ_{v} (\hat{P} (t), \hat{V} (t))

with the following monotonicity properties:

The cost of physical guidance c_p increases as the human’s physical or verbal compliance increases (the robot should push harder only when effective).
The cost of verbal guidance c_v increases as verbal compliance decreases (talking more is wasteful if instructions are consistently ignored).

These properties are derived from our observations described in Supplementary Note 2. The specific functional choice for ϕ_* is left to experimental tuning, in our case we use

ϕ_{p} (\hat{P}, \hat{V}) = \frac{1}{2} (\hat{P} + \hat{V}),

ϕ_{v} (\hat{P}, \hat{V}) = 1 - \hat{V},

ϕ_{p} (\hat{P}, \hat{V}) = (\hat{P} - \hat{V}),

ϕ_{v} (\hat{P}, \hat{V}) = \hat{V},

which satisfy the monotonicity properties outlined above.

At each control step we formulate an objective

J = c_{p} A_{p}^{2} + c_{v} A_{v}^{2},

representing the total cost of applying physical and verbal guidance. Intuitively, when the cost of one particular guidance is higher, we’d like that control input to admit more to the human’s wrench, as we cannot afford to correct for the human’s actions. Alternatively, if the cost of a particular guidance is lower, then we can afford to admit less to the human’s wrench and provide more corrective control.

Recall the constraint given by (11), at each step we solve:

\min_{A_{p}, A_{v}} J = \min_{A_{p}, A_{v}} (c_{p} A_{p}^{2} + c_{v} A_{v}^{2})

s . t . A_{p} + A_{v} = 1, A_{p} \geq 0, A_{v} \geq 0 .

To solve this optimization, we first substitute the constraint A_v = 1 − A_p into the objective:

J (A_{p}) = c_{p} A_{p}^{2} + c_{v} {(1 - A_{p})}^{2} .

Then minimizing J(A_p) directly is sufficient. Taking the derivative of J(A_p) with respect to A_p and setting it to zero yields

\frac{d J}{d A_{p}} = c_{p} A_{p} - c_{v} (1 - A_{v}) = 0 .

Solving for A_p, we have

c_{p} A_{p} = c_{v} (1 - A_{p}),

A_{p} (c_{p} + c_{v}) = c_{v},

A_{p} = \frac{c_{v}}{c_{p} + c_{v}} .

Consequently, the optimal adaptive weights $A_{p}^{*}, A_{v}^{*}$ are given by

A_{p}^{*} (t) = \frac{c_{v} (t)}{c_{p} (t) + c_{v} (t)},

A_{v}^{*} (t) = \frac{c_{p} (t)}{c_{p} (t) + c_{v} (t)},

which can be substituted directly into the control laws given by (12):

U_{p} (t) = {PD}_{p} (e (t)) - \frac{c_{v} (t)}{c_{p} (t) + c_{v} (t)} F_{h} (t),

U_{v} (t) = {PD}_{v} (e (t)) - \frac{c_{p} (t)}{c_{p} (t) + c_{v} (t)} F_{h} (t),

yielding our final adaptive formulas for the physical and verbal control inputs.

Force to language

To translate the virtual wrench U_v into a verbal cue, we employ the cross-modal force-to-language model introduced by²⁵, which embeds force profiles and natural-language phrases into a shared latent space (Fig. 13).

Fig. 13 — Each plot shows a 3D force profile over time (components x, y, z in red/green/blue).

Following²⁵, an utterance must encode three attributes of the intended human action: magnitude, direction, and duration. We therefore project the controller forward over a fixed horizon T while holding the current compliance estimates $\hat{P}, \hat{V}$ constant. Assuming the human tracks the verbal request, we substitute F_h = U_v into the control law given by (13):

U_{v} (τ) = K_{v} e (τ) + B_{v} \overset{°}{e} (τ) - A_{v} U_{v} (τ),

which reduces to:

U_{v} (τ) = \frac{K_{v} e (τ) + B_{v} \overset{°}{e} (τ)}{1 + A_{v}}, τ \in [t, t + T] .

Equation (41) is evaluated at N = T/Δt uniform samples to yield the profile

u_{v} = [U_{v} (t), \dots, U_{v} (t + T)],

which is fed into the model provided by Ref. ²⁵. The model embeds the profile into a joint force-language latent space and outputs the closest utterance

σ = Ψ (u_{v}) \in L,

where $L$ is a finite library of therapist-style phrases (e.g. “push gently to the right”). The resulting utterance is then delivered using the GNUspeech text-to-speech tool to provide verbal guidance to the patient.

Real-time control implementation

The robot guidance controller was implemented using Universal Robots’ Real-Time Data Exchange (RTDE) interface for direct communication with the UR5 robot controller. The control system employs a hierarchical structure with two primary loops. The low-level admittance control operates at 200 Hz via RTDE for force and position control, while the high-level compliance estimation runs at 1 Hz for Bayesian filter updates and guidance allocation optimization. This frequency separation allows the admittance controller to maintain stable force interaction with sub-5ms latency while the compliance estimator processes accumulated tracking data over longer time windows to infer user responsiveness patterns.

The UR5 robot was controlled via the RTDE interface with direct access to the robot controller at 500 Hz. Force measurements were obtained from the 6-DOF force/torque sensor sampled at 1 kHz through dedicated acquisition hardware. End-effector position data was acquired from robot kinematics via RTDE at 200 Hz. Speech synthesis employed the Festival TTS engine with a custom phrase library for real-time verbal instruction generation.

The RTDE interface provides deterministic communication with the robot controller, enabling the admittance control loop to operate reliably at 200 Hz. Force feedback latency was maintained under 5ms to ensure stable physical interaction. The compliance estimation and speech generation processes operated at standard priority since they do not require hard real-time guarantees. All sensor readings, control commands, and system states were logged at their respective sampling rates for offline analysis and performance validation.

Safety considerations

Human safety was ensured through compliance with ISO/TS 15066 and ISO 10218 standards for collaborative robotics. Robot velocities were limited to 250 mm/s during physical contact, and interaction forces were continuously monitored with automatic intervention at 50 N threshold. Emergency stop systems were accessible to both participants and experimenters, with trained supervision throughout all trials. The robot workspace was software-constrained to prevent collisions, and comprehensive risk assessment confirmed low-risk classification suitable for human subjects research under IRB approval.

Passivity

Passivity guarantees that the closed-loop robot cannot generate net mechanical energy and therefore remains Lyapunov stable when coupled to a passive environment such as a human arm^2,26. We prove that our proposed controller is output-strictly passive with respect to the port variables $(F_{h}, \overset{°}{x})$ .

Throughout, we recall the error definition

e (t) = x_{ref} (t) - x (t), \overset{°}{e} (t) = {\overset{°}{x}}_{ref} (t) - \overset{°}{x} (t),

where the reference state $(x_{ref}, {\overset{°}{x}}_{ref})$ is obtained at each control cycle by projecting the current end-effector pose onto the therapist-demonstrated curve γ (Sec. 4.4). By construction of this nearest-point projection we have, for all t,

e {(t)}^{⊤} {\overset{°}{x}}_{ref} (t) = 0 .

Define the candidate storage function

S (t) = \frac{1}{2} \overset{°}{x} {(t)}^{⊤} M \overset{°}{x} (t) + \frac{1}{2} e {(t)}^{⊤} K_{p} e (t),

which combines the kinetic energy^1,27,28 of the effective inertia M ≻ 0 with the potential energy stored in the proportional term K_p ≻ 0.

Theorem 1

Let M ≻ 0, B ≻ 0, K_p ≻ 0 and B_p ≻ 0, and assume the admittance dynamics (2) together with the physical control law

U_{p} = K_{p} e + B_{p} \overset{°}{e} - A_{p} F_{h}, 0 \leq A_{p} (t) \leq 1, \forall t,

where A_p + A_v = 1 as in (11). Then for every t≥0

S (t) - S (0) \leq \int_{0}^{t} F_{h} {(τ)}^{⊤} \overset{°}{x} (τ) d τ .

Hence the closed loop is passive with input F_h and output $\overset{°}{x}$ .

The proof of Theorem 1 is provided in Supplementary Note 3.

Supplementary information

Transparent Peer Review file^{(294.9KB, pdf)}

supplementtal^{(9.1MB, pdf)}

Acknowledgements

We thank the clinical staff at Spaulding Rehabilitation Hospital for their assistance with data collection and the study participants for their time and effort.

Author contributions

R.T. conceived the project, developed the controller, and wrote the manuscript. J.P. and K.V. contributed to the implementation of the controller, designed the experiments, data collection and analysis, and wrote the manuscript. P.B. provided clinical expertise and access to rehabilitation facilities. H.A. supervised the project and provided guidance on robot control methodology. All authors reviewed and approved the final manuscript.

Peer review

Peer review information

: Communications Engineering thanks Antonio Galiza and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: [Matteo De Marchi] and [Philip Coatsworth]. A peer review file is available.

Data availability

The entire dataset, including observational data, video demonstrations, and user study results for all participants, is publicly available in our online repository at 10.17632/bgybkjtpn9.2.

Code availability

The code for the Robot Guidance Controller is available at https://github.com/robot-guidance-controller/robot-guidance-controller/. The repository includes implementation of the compliance estimator, optimization algorithm, and force-to-language model.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

The online version contains supplementary material available at 10.1038/s44172-026-00632-5.

References

1.Haddadin, S., Albu-Schaffer, A. & Hirzinger, G. Safe, stable and intuitive control for physical human-robot interaction 3383–3388 (2009).
2.Mamedov, S. et al. A passivity-based framework for safe physical human–robot interaction. Robotics12, 116 (2023). [Google Scholar]
3.Atkeson, C. G. & Schaal, S. Robot learning from demonstration. In ICML, vol. 97, 12–20 (1997).
4.Lozano-Pérez, T. Robot programming. Proc. IEEE71, 821–841 (1983). [Google Scholar]
5.Lozano-Pérez, T. Teaching robots by task demonstration. In Proceedings of the IEEE International Conference on Robotics and Automation (1987).
6.Asada, H. & Asari, Y. The direct teaching of tool manipulation skills via the impedance identification of human motion. In Proceedings of IEEE International Conference on Robotics and Automation, 1269–1274 (IEEE, 1988).
7.Liu, S. & Asada, H. Transferring manipulative skills to robots: Representation and acquisition of tool manipulative skills using a process dynamics model (1992).
8.Ross, S. & Bagnell, D. Efficient reductions for imitation learning. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, 661–668 (JMLR Workshop and Conference Proceedings, 2010).
9.Krebs, H. I., Hogan, N., Aisen, M. L. & Volpe, B. T. Rehabilitation robotics: Performance-based progressive robot-assisted therapy. Autonomous robots15, 7–20 (2003). [Google Scholar]
10.Mocan, B., Mocan, M., Fulea, M., Murar, M. & Feier, H. Robotics in physical rehabilitation: Systematic review. Healthcare12, 1720 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Fasola, J. & Mataric, M. J. A socially assistive robot exercise coach for the elderly. J. Hum.-Robot Interact.2, 3–32 (2013). [Google Scholar]
12.Huang, C.-M. & Mutlu, B. Multivariate evaluation of interactive robot systems. Autonomous Robots37, 335–349 (2014). [Google Scholar]
13.Reinkensmeyer, D. J. & Patton, J. L. Can robots help the learning of skilled actions? Exerc. Sport Sci. Rev.37, 43 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Duffy, B. R. & Joue, G. The paradox of social robotics: A discussion. In AAAI Fall Symposium on Machine Ethics, 22–24 (2005).
15.Ferraguti, F., Secchi, C. & Fantuzzi, C. A variable admittance control strategy for stable physical human–robot interaction. Int. J. Robot. Res.32, 949–968 (2013). [Google Scholar]
16.Wu, Q., Zhao, X., Wang, Y. & Zhang, J. Variable admittance control for safe physical human–robot interaction considering intuitive human intention. Mechanism Mach. Theory184, 105307 (2023). [Google Scholar]
17.Kim, N., Shin, H., Sim, H. & Park, G. Variable admittance control based on human–robot collaboration observer using frequency analysis for sensitive and safe interaction. Sensors21, 1899 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Landi, C. T. et al. Variable admittance control preventing undesired oscillating behaviors in physical human-robot interaction 3611–3616 (2017).
19.Yu, X. et al. Bayesian estimation of human impedance and motion intention for human-robot collaboration. IEEE Trans. Cybern.51, 1822–1834 (2021). [DOI] [PubMed] [Google Scholar]
20.Li, Y., Carboni, G., Gonzalez, F., Campolo, D. & Burdet, E. Trust-based variable impedance control of human–robot cooperative manipulation. Robot. Computer-Integr. Manuf.86, 102667 (2024). [Google Scholar]
21.Wang, Y. et al. Variable admittance control based on trajectory prediction of human hand motion for physical human-robot interaction. Appl. Sci.11, 5651 (2021). [Google Scholar]
22.Miller, E. L. et al. Comprehensive overview of nursing and interdisciplinary rehabilitation care of the stroke patient: A scientific statement from the american heart association. Stroke41, 2402–2448 (2010). [DOI] [PubMed] [Google Scholar]
23.Diaz, I., Gil, J. J. & Sanchez, E. Rehabilitation robots for the treatment of sensorimotor deficits: A neurophysiological perspective. J. Neuroeng. Rehabilitation15, 1–15 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Sheng, B., Zhang, Y., Meng, W., Deng, C. & Xie, S. The present and future of robotic technology in rehabilitation. Curr. Opin. Neurol.29, p814–821 (2016). [Google Scholar]
25.Tejwani, R., Velazquez, K., Payne, J., Bonato, P. & Asada, H. Cross-modality embedding of force and language for natural human-robot communication. In Proceedings of Robotics: Science and Systems (Los Angeles, California, 2025).
26.Zhang, Z., Li, T. & Figueroa, N. Constrained passive interaction control: Leveraging passivity and safety for robot manipulators. In 2024 IEEE International Conference on Robotics and Automation (ICRA), 13418–13424 (IEEE, 2024).
27.Petrovic, L., Zollo, L. & Khatib, O. Energy based control for safe human-robot physical interaction. In 2016 International Symposium on Experimental Robotics, 723–733 (Springer, 2017).
28.Hogan, N. Impedance control: An approach to manipulation. In 1984 American control conference, 304–313 (IEEE, 1984).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Transparent Peer Review file^{(294.9KB, pdf)}

supplementtal^{(9.1MB, pdf)}

Data Availability Statement

The entire dataset, including observational data, video demonstrations, and user study results for all participants, is publicly available in our online repository at 10.17632/bgybkjtpn9.2.

[CR1] 1.Haddadin, S., Albu-Schaffer, A. & Hirzinger, G. Safe, stable and intuitive control for physical human-robot interaction 3383–3388 (2009).

[CR2] 2.Mamedov, S. et al. A passivity-based framework for safe physical human–robot interaction. Robotics12, 116 (2023). [Google Scholar]

[CR3] 3.Atkeson, C. G. & Schaal, S. Robot learning from demonstration. In ICML, vol. 97, 12–20 (1997).

[CR4] 4.Lozano-Pérez, T. Robot programming. Proc. IEEE71, 821–841 (1983). [Google Scholar]

[CR5] 5.Lozano-Pérez, T. Teaching robots by task demonstration. In Proceedings of the IEEE International Conference on Robotics and Automation (1987).

[CR6] 6.Asada, H. & Asari, Y. The direct teaching of tool manipulation skills via the impedance identification of human motion. In Proceedings of IEEE International Conference on Robotics and Automation, 1269–1274 (IEEE, 1988).

[CR7] 7.Liu, S. & Asada, H. Transferring manipulative skills to robots: Representation and acquisition of tool manipulative skills using a process dynamics model (1992).

[CR8] 8.Ross, S. & Bagnell, D. Efficient reductions for imitation learning. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, 661–668 (JMLR Workshop and Conference Proceedings, 2010).

[CR9] 9.Krebs, H. I., Hogan, N., Aisen, M. L. & Volpe, B. T. Rehabilitation robotics: Performance-based progressive robot-assisted therapy. Autonomous robots15, 7–20 (2003). [Google Scholar]

[CR10] 10.Mocan, B., Mocan, M., Fulea, M., Murar, M. & Feier, H. Robotics in physical rehabilitation: Systematic review. Healthcare12, 1720 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Fasola, J. & Mataric, M. J. A socially assistive robot exercise coach for the elderly. J. Hum.-Robot Interact.2, 3–32 (2013). [Google Scholar]

[CR12] 12.Huang, C.-M. & Mutlu, B. Multivariate evaluation of interactive robot systems. Autonomous Robots37, 335–349 (2014). [Google Scholar]

[CR13] 13.Reinkensmeyer, D. J. & Patton, J. L. Can robots help the learning of skilled actions? Exerc. Sport Sci. Rev.37, 43 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Duffy, B. R. & Joue, G. The paradox of social robotics: A discussion. In AAAI Fall Symposium on Machine Ethics, 22–24 (2005).

[CR15] 15.Ferraguti, F., Secchi, C. & Fantuzzi, C. A variable admittance control strategy for stable physical human–robot interaction. Int. J. Robot. Res.32, 949–968 (2013). [Google Scholar]

[CR16] 16.Wu, Q., Zhao, X., Wang, Y. & Zhang, J. Variable admittance control for safe physical human–robot interaction considering intuitive human intention. Mechanism Mach. Theory184, 105307 (2023). [Google Scholar]

[CR17] 17.Kim, N., Shin, H., Sim, H. & Park, G. Variable admittance control based on human–robot collaboration observer using frequency analysis for sensitive and safe interaction. Sensors21, 1899 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.Landi, C. T. et al. Variable admittance control preventing undesired oscillating behaviors in physical human-robot interaction 3611–3616 (2017).

[CR19] 19.Yu, X. et al. Bayesian estimation of human impedance and motion intention for human-robot collaboration. IEEE Trans. Cybern.51, 1822–1834 (2021). [DOI] [PubMed] [Google Scholar]

[CR20] 20.Li, Y., Carboni, G., Gonzalez, F., Campolo, D. & Burdet, E. Trust-based variable impedance control of human–robot cooperative manipulation. Robot. Computer-Integr. Manuf.86, 102667 (2024). [Google Scholar]

[CR21] 21.Wang, Y. et al. Variable admittance control based on trajectory prediction of human hand motion for physical human-robot interaction. Appl. Sci.11, 5651 (2021). [Google Scholar]

[CR22] 22.Miller, E. L. et al. Comprehensive overview of nursing and interdisciplinary rehabilitation care of the stroke patient: A scientific statement from the american heart association. Stroke41, 2402–2448 (2010). [DOI] [PubMed] [Google Scholar]

[CR23] 23.Diaz, I., Gil, J. J. & Sanchez, E. Rehabilitation robots for the treatment of sensorimotor deficits: A neurophysiological perspective. J. Neuroeng. Rehabilitation15, 1–15 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] 24.Sheng, B., Zhang, Y., Meng, W., Deng, C. & Xie, S. The present and future of robotic technology in rehabilitation. Curr. Opin. Neurol.29, p814–821 (2016). [Google Scholar]

[CR25] 25.Tejwani, R., Velazquez, K., Payne, J., Bonato, P. & Asada, H. Cross-modality embedding of force and language for natural human-robot communication. In Proceedings of Robotics: Science and Systems (Los Angeles, California, 2025).

[CR26] 26.Zhang, Z., Li, T. & Figueroa, N. Constrained passive interaction control: Leveraging passivity and safety for robot manipulators. In 2024 IEEE International Conference on Robotics and Automation (ICRA), 13418–13424 (IEEE, 2024).

[CR27] 27.Petrovic, L., Zollo, L. & Khatib, O. Energy based control for safe human-robot physical interaction. In 2016 International Symposium on Experimental Robotics, 723–733 (Springer, 2017).

[CR28] 28.Hogan, N. Impedance control: An approach to manipulation. In 1984 American control conference, 304–313 (IEEE, 1984).

PERMALINK

Adaptive robot guidance through real-time compliance estimation and dual-modal control

Ravi Tejwani

John Payne

Karl Velazquez

Paolo Bonato

Harry Asada

Abstract

Introduction

Fig. 1.

Fig. 2. Observational study design and key findings showing adaptive guidance patterns across compliance states.

Results

Experimental setup

Rehabilitation task

Participants

Study design

Research questions

Evaluation metrics

Observational data from expert-led demonstrations

Statistical analysis

Experimental results

Research question 1

Fig. 4.

Fig. 3.

Fig. 5.

Table 1.

Low verbal compliance

High verbal compliance

Low physical compliance

High physical compliance

Conclusion

Fig. 6. Performance comparison across guidance modalities and compliance levels for four evaluation metrics.

Research question 2

Fig. 7. Performance comparison between adaptive Robot Guidance Controller and fixed-weight baseline across compliance regimes.

Fig. 8. Robot behavior comparison: baseline method, our adaptive method, and expert therapist observations across compliance regimes.

Fig. 9.

Low verbal and physical compliance

Physical compliance only

Verbal compliance only

Full compliance

Conclusion

User experience analysis

Discussion

Interpretation of results

Broader implications

Study limitations

Method

Overview: robot guidance controller

Fig. 10.

Table 2.

Algorithm 1

Task-space dynamics model

Fig. 11.

Dual-channel decomposition

Fig. 12.

Reference trajectory generation

Control laws

Compliance estimation

Prediction step

Update step

Adaptive weight allocation

Force to language

Fig. 13. Force-to-language examples using the cross-modality model25.

Real-time control implementation

Safety considerations

Passivity

Theorem 1

Supplementary information

Acknowledgements

Author contributions

Peer review

Peer review information

Data availability

Code availability

Competing interests

Footnotes

Supplementary information

References

Associated Data

Supplementary Materials

Fig. 13. Force-to-language examples using the cross-modality model²⁵.