Uncertainty-Aware Human-to-Robot Intention Prediction for Assisted Teleoperation

Abstract

In assisted teleoperation for human-robot collaboration, accurate intention prediction is critical for enabling timely and reliable robotic assistance during long-horizon manipulation and assembly tasks. Robot teleoperation demonstrations are costly and hardware-limited, whereas human demonstrations are easier to collect and provide rich temporal structure. To address this challenge, we propose an uncertainty-aware human-to-robot intention prediction framework that combines: (1) hierarchical transfer learning, where MS-TCN++ is pretrained on human hand demonstrations and fine-tuned on limited robot teleoperation data; (2) a conformal prediction module that provides frame-level prediction sets with statistical coverage guarantees for reliable uncertainty quantification; and (3) VLM-guided segment correction, which selectively reviews low-confidence or temporally uncertain segments using visual and temporal context.

Experiments on robot assembly demonstrations with 22 action classes show that human-to-robot fine-tuning improves the robot test-set Edit score from 70.50 to 80.70 using only 16 robot demonstrations — a gain of 10.20 points over training from scratch. Edit-safe VLM correction preserves the Edit score while providing a safety net for uncertain segments. These results show that human demonstrations provide scalable pretraining data for robust, uncertainty-aware robot action segmentation.

Overview

Framework overview

Our framework: (1) MS-TCN++ pretrained on human demonstrations is fine-tuned on robot data; (2) conformal prediction quantifies per-frame uncertainty; (3) uncertain segments are selectively queried by a VLM with edit-safe acceptance.

Method

Three components work together to produce reliable, uncertainty-calibrated action segmentation.

1

Cross-Domain Transfer

MS-TCN++ is pretrained on 51 UMI hand demonstrations encoding action transition priors, then fine-tuned on 16 ALOHA robot demonstrations. Both domains share a 22-class vocabulary and 192-dim X3D-M features, enabling complete weight transfer without architectural changes.

2

Conformal Prediction

Temperature-scaled softmax probabilities are calibrated on a held-out split to produce prediction sets with finite-sample marginal coverage guarantees. Four variants are evaluated: standard (marginal), class-conditional, one-sided shrinkage, and two-sided shrinkage.

3

VLM-Guided Correction

Segments flagged by low confidence or short duration are queried blind to the predicted label, preventing anchoring bias. Corrections are accepted only when they preserve or improve the video-level edit score, ensuring safe integration.

Why human demonstrations?

The 22×22 action transition matrix requires ~484 observed transitions to estimate reliably. Sixteen robot demonstrations provide only ~352 — insufficient to learn long-range assembly ordering. Pretraining on 51 human demonstrations encodes these priors and transfers them via fine-tuning.

Human and Robot Datasets

Hand demo

Source Domain — UMI

51 training + 10 validation hand demonstrations of toy car assembly captured with an egocentric GoPro camera on a handheld UMI gripper.

ALOHA demo

Target Domain — ALOHA

40 teleoperated demonstrations yielding 80 synchronized streams (480×640, dual wrist cameras). Split: 16 train / 10 val / 8 test / 6 calibration demonstrations.

Results

Evaluated on the ALOHA test set using frame accuracy, Edit score, and F1 at multiple overlap thresholds.

80.70
Edit Score
Human→Robot fine-tuned
+10.2
Edit Score Gain
over robot-only baseline
46.4%
Frame Accuracy
after VLM correction
16
Robot Demos
only 16 demos needed

Transfer Learning

Model Edit ↑ Acc % ↑ F1 @10 ↑ F1 @25 ↑ F1 @50 ↑
Hand-Only (zero-shot baseline) 28.9313.96
Robot-Only (trained from scratch) 70.5040.22
Human → Robot (ours) 80.7045.2151.2341.3622.22
Human → Robot + VLM (ours, best) 80.70 46.42 51.97 41.81 22.98
Key Takeaway — Edit Score
70.50 Robot-Only
80.70 Human → Robot
+10.20

Conformal Prediction

≥94%
Standard CP empirical coverage across 93% and 97% target levels
~25%
Set size reduction with class-conditional CP vs. standard CP
16–20
Standard CP mean prediction set size out of 22 classes
Standard CP results

Standard CP

Regularized class-conditional CP results

Regularized Class-Conditional CP

CP Method Target Level Robot-Only Coverage Human→Robot Coverage Mean Set Size
Standard CP (marginal) 93% 94.2% 95.0% 16.8–17.8
97% 97.5% 96.5% ~20
Class-Conditional CP (regularized) 93% 86.5% 89.3% ~12.5
97% 92.1% 92.3% ~14

Segmentation Results

Segment-level predictions for teleoperated robot video

StartEndDuration Ground Truth Action Predicted Action IoUResult
0461461pick up screwpick up screw62.7%correct
461896435position screw on first wheelposition screw on first wheel63.3%correct
8961015119pick up first wheelposition screw on first wheel11.0%incorrect
101521591144position first wheelposition first wheel54.1%correct
21592757598pick up electric screwdriverposition first wheel28.2%incorrect
27573581824position screwdriver bitposition screwdriver bit46.4%correct
35813779198screw first wheel with screwdriverput down electric screwdriver59.9%incorrect
37793968189put down electric screwdriverpick up screw54.9%incorrect

16 correct / 22 segments  |  Human→Robot model  |  Video v220