EgoWild-X: Egocentric, dexterous in-the-wild human interactions with precise motion capture

EgoWild-X: Egocentric, dexterous in-the-wild human interactions with precise motion capture

Sept 16, 2025

Figure 1: Example segment of an egocentric video (disassembling a table) paired with its corresponding motion-captured hand pose.

OVERVIEW

EgoWild-X is a large-scale dataset of egocentric video sequences capturing dexterous, in-the-wild human interactions with ground-truth hand motion using the Manus MetaGloves Pro. Using head-mounted wide-lens GoPro cameras (270° FOV) synchronized with glove-based joint tracking, the dataset provides high-resolution, multimodal recordings ideal for training embodied agents with fine motor control. This is similar to our EgoWild dataset, with the addition of motion capture.


Unlike lab-constrained datasets, EgoWild-X emphasizes real-world household and assembly tasks performed in natural settings. Recordings include:


  • Cooking (chopping, stirring, opening jars, pouring)

  • Cleaning & Laundry (wiping, folding, ironing, vacuuming, picking up clutter)

  • Assembly & Repair (building furniture, screwing parts, tightening bolts)

  • Daily Routines (typing, writing, opening doors, using laptops)

  • Utility (unpacking, carrying bags, moving furniture)


Objects range from deformables (clothes, sponges, food) to rigid/articulated items (drawers, appliances, tools), with recordings spanning kitchens, garages, offices, and outdoor environments—complete with clutter, occlusion, and social presence.

EgoWild-X supports research in:


  • Learning dexterous manipulation policies from egocentric multimodal data

  • Understanding object affordances in realistic, task-driven contexts

  • Bridging vision-based hand pose estimation with ground-truth kinematics

  • Advancing generalist robot hands for everyday environments


The task-rich, untrimmed nature of EgoWild-X makes it uniquely suited for embodied AI systems that must learn from the same kinds of activities humans perform daily.

Figure 2: Example segment of an egocentric video (preparing ingredients and cooking) paired with its corresponding motion-captured hand pose.

DATASET STATISTICS

Resolution: 4K

Fps: 60fps

GET IN TOUCH

To request the full sample dataset, please contact us at db@zeroframe.ai or use the button below.

REQUEST SAMPLE DATASET

SAN FRANCISCO, CA. ALL RIGHTS RESERVED.