EgoExplore: Large-Scale Egocentric & Long-duration Open World Exploration Dataset

EgoExplore: Large-Scale Egocentric & Long-duration Open World Exploration Dataset

Jul 21, 2025

If you'd like to receive the 4K samples directly (since YouTube compresses video uploads), please contact us at team@zeroframe.ai

Figure 1: Example timestamped moments from a 51-minute long egocentric video with short high-level captions. Please refer to the data annotations section below for details on the set's annotations.

OVERVIEW

EgoExplore is a real-world dataset of egocentric video sequences capturing open-world exploration with a strong emphasis on embodied physics, object interaction, and multi-agent dynamics. Collected using head-mounted GoPro cameras in natural, unstructured environments, the dataset offers rich, continuous trajectories ideal for training and evaluating world models, embodied agents, and multimodal perception systems.


Unlike scripted or task-bounded datasets, EgoExplore features freeform exploration through diverse physical scenes, where human behavior, object manipulation, and environmental physics unfold naturally and unsupervised.


The dataset is especially rich across three dimensions:


  • Object Interaction – manipulating real-world objects like gas pumps, car trunks, gym equipment, trash bins, and rideshare bicycles

  • Human Action – locomotion (walking, running, biking), physical exertion (pull-ups), fine motor actions (opening, lifting, discarding), and transitions across modalities (walking → driving → biking)

  • Behavior of Other Agents – close-range, unpredictable interactions with other autonomous agents like a dog (peeing, jumping, sniffing) and human drivers (Uber/Lyft)


Two samples are currently included in the sample dataset:


  • Suburban & Terrain Exploration (Figure 1, 51 minutes): A person walks his dog through a suburban neighborhood and hilly terrain, capturing complex agent-environment interactions—dog behavior, human response, object use (car trunk, trash bin), and urban navigation (driving to a gas station, handling pump failure).

  • Urban Multimodal Journey (Figure 2, 52 minutes): A person enters an Uber, exercises at an outdoor gym (e.g. pull-ups on public bars), explores the city on foot, and ends with a Lyft bike rental. This sequence captures transitions across transportation modes, public infrastructure use, and diverse human-object contact.


Each sequence is captured in high-resolution and provides continuous physical context, with real-world constraints, cause-and-effect, and emergent behavior. The dataset supports research in:


  • Learning physics-aware world models from egocentric data

  • Modeling temporal and spatial continuity of embodied action

  • Understanding object affordances and agent-driven environmental changes

  • Capturing multi-agent dynamics in open, unpredictable settings

The untrimmed, immersive nature of EgoExplore makes it well-suited for advancing generalist agents and sensorimotor learning beyond closed benchmarks.

Figure 2: Example timestamped moments from a 52-minute long egocentric video with short high-level captions.

DATASET STATISTICS

Quality: 4K at 60fps

Collection in progress

DATA ANNOTATION

Each video is annotated with two levels of natural language descriptions:

  • Summaries: Captions provided approximately every 5 minutes (or once per video) that offer a high-level overview of the events and activities in the video

  • Narrations: Captions that detail individual actions as they happen throughout the video


These annotations provide semantic grounding for learning and evaluating temporal understanding, behavior modeling, and video-language tasks. Refer to Figure 3 below.

Figure 3: Annotation example taken from Sample 1.

GET IN TOUCH

To request the full dataset, please contact us at team@zeroframe.ai or find a time with our team using the button below. You can find the sample dataset on Hugging Face, using the button on the top of the page.

GET IN TOUCH

SAN FRANCISCO, CA. ALL RIGHTS RESERVED.