Jul 21, 2025
If you'd like to receive the 4K samples directly (since YouTube compresses video uploads), please contact us at team@zeroframe.ai
Figure 1: Example timestamped moments from a 51-minute long egocentric video with short high-level captions. Please refer to the data annotations section below for details on the set's annotations.
OVERVIEW
EgoExplore is a real-world dataset of egocentric video sequences capturing open-world exploration with a strong emphasis on embodied physics, object interaction, and multi-agent dynamics. Collected using head-mounted GoPro cameras in natural, unstructured environments, the dataset offers rich, continuous trajectories ideal for training and evaluating world models, embodied agents, and multimodal perception systems.
Unlike scripted or task-bounded datasets, EgoExplore features freeform exploration through diverse physical scenes, where human behavior, object manipulation, and environmental physics unfold naturally and unsupervised.
The dataset is especially rich across three dimensions:
Object Interaction – manipulating real-world objects like gas pumps, car trunks, gym equipment, trash bins, and rideshare bicycles
Human Action – locomotion (walking, running, biking), physical exertion (pull-ups), fine motor actions (opening, lifting, discarding), and transitions across modalities (walking → driving → biking)
Behavior of Other Agents – close-range, unpredictable interactions with other autonomous agents like a dog (peeing, jumping, sniffing) and human drivers (Uber/Lyft)
Two samples are currently included in the sample dataset:
Suburban & Terrain Exploration (Figure 1, 51 minutes): A person walks his dog through a suburban neighborhood and hilly terrain, capturing complex agent-environment interactions—dog behavior, human response, object use (car trunk, trash bin), and urban navigation (driving to a gas station, handling pump failure).
Urban Multimodal Journey (Figure 2, 52 minutes): A person enters an Uber, exercises at an outdoor gym (e.g. pull-ups on public bars), explores the city on foot, and ends with a Lyft bike rental. This sequence captures transitions across transportation modes, public infrastructure use, and diverse human-object contact.
Each sequence is captured in high-resolution and provides continuous physical context, with real-world constraints, cause-and-effect, and emergent behavior. The dataset supports research in:
Learning physics-aware world models from egocentric data
Modeling temporal and spatial continuity of embodied action
Understanding object affordances and agent-driven environmental changes
Capturing multi-agent dynamics in open, unpredictable settings
The untrimmed, immersive nature of EgoExplore makes it well-suited for advancing generalist agents and sensorimotor learning beyond closed benchmarks.
Figure 2: Example timestamped moments from a 52-minute long egocentric video with short high-level captions.
DATASET STATISTICS
Quality: 4K at 60fps
Collection in progress
DATA ANNOTATION
Each video is annotated with two levels of natural language descriptions:
Summaries: Captions provided approximately every 5 minutes (or once per video) that offer a high-level overview of the events and activities in the video
Narrations: Captions that detail individual actions as they happen throughout the video
These annotations provide semantic grounding for learning and evaluating temporal understanding, behavior modeling, and video-language tasks. Refer to Figure 3 below.
Figure 3: Annotation example taken from Sample 1.
GET IN TOUCH
To request the full dataset, please contact us at team@zeroframe.ai or find a time with our team using the button below. You can find the sample dataset on Hugging Face, using the button on the top of the page.
GET IN TOUCH