EgoExplore: Large-Scale Egocentric & Long-duration Open World Exploration Dataset

Jul 21, 2025

EXAMPLE 1

REQUEST SAMPLE DATASET

EXAMPLE 2

FULL DATASET (IN PROGRESS)

If you'd like to receive the 4K samples directly (since YouTube compresses video uploads), please contact us at team@zeroframe.ai

Figure 1: Example timestamped moments from a 51-minute long egocentric video with short high-level captions. Please refer to the data annotations section below for details on the set's full annotations.

OVERVIEW

EgoExplore is a real-world dataset of egocentric video sequences capturing open-world exploration with a strong emphasis on embodied physics, object interaction, and multi-agent dynamics. Collected using head-mounted cameras (e.g. GoPro, DJI Osmo 3) in natural, unstructured environments, the dataset offers rich, continuous trajectories ideal for training and evaluating world models, embodied agents, and multimodal perception systems.

Unlike scripted or task-bounded datasets, EgoExplore features freeform exploration through diverse physical scenes, where human behavior, object manipulation, and environmental physics unfold naturally and unsupervised.

The dataset is especially rich across three dimensions:

Human Action – locomotion (walking, running, biking), physical exertion (pull-ups), fine motor actions (opening, lifting, discarding), and transitions across modalities (walking → driving → biking)
Object Interaction – manipulating real-world objects like gas pumps, car trunks, gym equipment, trash bins, and rideshare bicycles
Behavior of Other Agents – close-range, unpredictable interactions with other autonomous agents like a dog (peeing, jumping, sniffing) and human drivers (Uber/Lyft)

Each sequence is captured in high-resolution and provides continuous physical context, with real-world constraints, cause-and-effect, and emergent behavior. The dataset supports research in:

Learning physics-aware world models from egocentric data
Modeling temporal and spatial continuity of embodied action
Understanding object affordances and agent-driven environmental changes
Capturing multi-agent dynamics in open, unpredictable settings

The untrimmed, immersive nature of EgoExplore makes it well-suited for advancing generalist agents and sensorimotor learning beyond closed benchmarks.

Figure 2: Example timestamped moments from a 52-minute long egocentric video with short high-level captions.

DATASET STATISTICS

Resolution: 1080p to 4K

Fps: 60fps

Avg Duration: 45 minutes

SAMPLE DATASET LOCATIONS

Countries: United States, Canada, Mexico, Portugal, Japan, Taiwan, Thailand, Indonesia

Locations: Living room, bedroom, kitchen, street, beach, swimming pool, tennis court, park, coffee shop, grocery store, car, museum, outdoor market, mountain, parking garage, metro station, metro train, arcade, forest, drive-thru, boat, elevator, waterfall, shopping mall, athletic store, apartment hallways

DATA ANNOTATION

Our dataset is richly annotated to capture both the semantic and contextual aspects of long-duration, open-world exploration videos. Each video is temporally segmented into intervals, with annotations that provide both fine-grained descriptions and structured metadata.

ANNOTATION STRUCTURE

Video-Level Metadata
Each video includes:
- fps: Frames per second.
- frame_count: Total number of frames.
- duration: Video duration in seconds.
- total_intervals: Number of annotated segments.

Interval-Level Annotations
Each interval is defined by:
- start_time and end_time (in seconds).
- caption: A natural language description of the visual content, motion, and scene context within the interval.
- scene: Broad environmental classification (e.g., outdoor-natural, urban-street, indoor-public).
- weather: Weather conditions observed (e.g., sunny, cloudy, rainy).
- timeOfDay: Temporal context (e.g., day, night, dusk).
- crowd_density: Presence and density of people (e.g., empty, scattered, crowded).

Captions provide rich narrative descriptions of the evolving environment, including:

Camera motion and viewpoint changes.
Visual elements such as vehicles, signage, vegetation, or objects.
Environmental cues (e.g., shadows, foliage density, pavement condition).
Human presence and activities when relevant.
This ensures a story-like continuity across video segments while maintaining precise temporal grounding.

EXAMPLE CAPTION (30 SEC INTERVAL)

The camera initially focuses on a smartphone held in the foreground, with a white vehicle partially visible on the left. As the camera moves away from the phone, it captures a paved area surrounded by natural elements. The scene shifts to a small parking area with three parked cars, including a white and a black vehicle. Orange traffic cones are placed near the cars, indicating a potential boundary or caution area. The camera continues along the edge of the pavement, revealing patches of grass and small rocks lining the path. Sparse vegetation and scattered leaves are visible on the ground, suggesting a natural setting. As the camera progresses, it follows the curve of the road, with shadows from nearby trees casting patterns on the pavement. Moving forward, the camera captures more of the surrounding greenery, including trees with dense foliage. The path curves gently, and the camera maintains focus on the road, with occasional glimpses of sunlight filtering through the trees. The scene is quiet and empty, emphasizing the solitude of the natural environment under the bright, sunny sky.

Semantic Tags

In addition to captions, structured labels such as scene, weather, timeOfDay, and crowd_density provide categorical annotations for downstream tasks, such as:

Video classification.
Context-aware retrieval.
Benchmarking open-world video understanding models.

IMU Data

Alongside the visual annotations, we provide synchronized Inertial Measurement Unit (IMU) data to capture fine-grained motion dynamics during video recording.

The IMU stream is aligned with video timestamps, enabling multimodal analysis of motion, environment, and visual perception.

Annotation Structure

Each IMU record includes:

timestamp_ms: Time in milliseconds since the start of the recording, synchronized with video frames.
gyro: 3-axis gyroscope readings (angular velocity in degrees/sec).
accl: 3-axis accelerometer readings (linear acceleration in m/s²).

The annotation structure for this dataset is inspired by the Sekai video dataset project.

GET IN TOUCH

To request the full sample dataset, please contact us at db@zeroframe.ai or use the button below.

REQUEST SAMPLE DATASET