EgoWild-X: Egocentric, dexterous in-the-wild human interactions with precise motion capture

Sept 16, 2025

REQUEST SAMPLE DATASET

FULL DATASET)

Figure 1: Example segment of an egocentric video (disassembling a table) paired with its corresponding motion-captured hand pose.

OVERVIEW

EgoWild-X is a large-scale dataset of egocentric video sequences capturing dexterous, in-the-wild human interactions with ground-truth hand motion using the Manus MetaGloves Pro. Using head-mounted wide-lens GoPro cameras (270° FOV) synchronized with glove-based joint tracking, the dataset provides high-resolution, multimodal recordings ideal for training embodied agents with fine motor control. This is similar to our EgoWild dataset, with the addition of motion capture.

Unlike lab-constrained datasets, EgoWild-X emphasizes real-world household and assembly tasks performed in natural settings. Recordings include:

Cooking (chopping, stirring, opening jars, pouring)
Cleaning & Laundry (wiping, folding, ironing, vacuuming, picking up clutter)
Assembly & Repair (building furniture, screwing parts, tightening bolts)
Daily Routines (typing, writing, opening doors, using laptops)
Utility (unpacking, carrying bags, moving furniture)

Objects range from deformables (clothes, sponges, food) to rigid/articulated items (drawers, appliances, tools), with recordings spanning kitchens, garages, offices, and outdoor environments—complete with clutter, occlusion, and social presence.

EgoWild-X supports research in:

Learning dexterous manipulation policies from egocentric multimodal data
Understanding object affordances in realistic, task-driven contexts
Bridging vision-based hand pose estimation with ground-truth kinematics
Advancing generalist robot hands for everyday environments

The task-rich, untrimmed nature of EgoWild-X makes it uniquely suited for embodied AI systems that must learn from the same kinds of activities humans perform daily.

Figure 2: Example segment of an egocentric video (preparing ingredients and cooking) paired with its corresponding motion-captured hand pose.

DATASET STATISTICS

Resolution: 4K

Fps: 60fps

GET IN TOUCH

To request the full sample dataset, please contact us at db@zeroframe.ai or use the button below.

REQUEST SAMPLE DATASET