Week 11

Published

Wednesday, May 14, 2025

Egocentric Vision

Egocentric vision is the study of visual data captured from a first-person perspective, typically using wearable cameras. This perspective is inherently dynamic, uncurated, and embodied, producing long-form video that reflects the agent’s goals, interactions, and attention.

Research Tasks

Key research tasks in egocentric vision include:

Localization
3D scene understanding
Anticipation (e.g., predicting future actions or object interactions)
Action recognition
Gaze understanding and prediction
Social behavior understanding
Full body pose estimation
Hand and hand-object interaction analysis
Person identification
Privacy (addressing challenges and developing solutions)
Summarization of long-form egocentric video
Visual question answering (VQA) in egocentric contexts
Robotic applications (using egocentric perception for robot control and interaction)

Hand and Hand-Object Interaction

Understanding hand use and hand-object interactions is central to egocentric vision.

Understanding Human Hands in Contact at Internet Scale (Shan et al., 2020):
- Goal: Extract hand state and contact information from internet videos.
- Dataset: 100DOH, with 131 days of footage and 100,000 annotated hand-contact frames across 12 categories and 27,300 videos. Annotations include hand and object bounding boxes, left/right hand labels, and contact states (none, self, person, portable, non-portable object).
- Technical Approach: Uses Faster-RCNN to predict, for each hand, the bounding box, left/right label, and contact state.
- Results: Achieved 90% hand detection accuracy; demonstrated automated 3D hand mesh reconstruction and grasp pose prediction for uncontacted objects.
DexYCB Dataset:
- Focuses on object manipulation and interaction, relevant for both human and robotic contexts.
- Provides data for:
  - 2D object and keypoint detection
  - 6D object pose estimation (3D translation and 3D rotation)
  - 3D hand pose estimation
- Offers higher quantity and quality than alternatives like HO-3D, with manual annotations and improved robustness for multiple views.
HaMeR (Pavlakos et al., 2024):
- State-of-the-art model for 3D hand mesh reconstruction from monocular images.
- Based on a Vision Transformer (ViT)-like architecture, predicting parameters of the MANO hand model.
- End-to-end: takes monocular images and left/right hand information as input, outputs a 3D hand mesh.
- Robust tracking, with failures mainly in extreme poses.
- Introduces the HInt dataset with annotated 2D keypoints from multiple sources.

Action Recognition and Anticipation

Predicting future actions and intentions from egocentric video is a major challenge.

Challenge: Modeling human intentions requires understanding past actions and context from long-form egocentric videos.
Summarize the Past to Predict the Future (Pasca et al., CVPR 2024):
- Goal: Improve short-term object interaction anticipation using natural language descriptions of past context.
- Task: Predict the object (noun), action (verb), and time to contact (TTC) for future interactions.
- Previous Approaches: Ego4D Faster R-CNN + SlowFast baseline, AFF-tention (Mur-Labadia et al., ECCV 2024).
- Method:
  - Language Context Extraction: Applies object detection and image captioning to past frames. A VQA model with prompts (e.g., “What does the image describe?”, “What is the person in this image doing?”) extracts verbose descriptions. Part-of-speech tagging, lemmatization, and majority voting aggregate verbs and nouns. Filtering removes out-of-domain verb-noun pairs. CLIP identifies salient objects for context extraction without extra training.
  - TransFusion Model: Multimodal fusion model that combines language context and visual features to predict future action (verb, noun) and TTC.
- Results: Outperforms vision-only baselines on the Ego4D Short Term Object Interaction Anticipation Dataset (v1 and v2), especially for both frequent and infrequent noun and verb classes.
PALM: Predicting Actions through Language Models (Kim et al., ECCV 2024):
- Idea: Leverage procedural knowledge in large language models (LLMs) for action anticipation.
- Approach:
  1. Prompt Generation: Image captioning and action recognition models process egocentric frames to generate a visual context description and a sequence of past actions (verb, noun pairs).
  2. Action Anticipation: An LLM is prompted with the visual context, past actions, and a template instruction to predict a sequence of future actions (verb, noun pairs). In-context learning with examples is used. Maximal Marginal Relevance (MMR) is used for exemplar selection.
- Qualitative Comparison: Achieves lower edit distance between predicted and ground truth future action sequences compared to a SlowFast baseline, indicating better sequence prediction.

Gaze Understanding and Prediction

Estimating the camera wearer’s gaze is a fundamental egocentric vision task.

Egocentric Gaze Estimation (Lai et al., BMVC 2022):
- Problem: Estimate the visual attention of the camera wearer from egocentric videos.
- Previous Challenge: Eye trackers required calibration, increasing cost and complexity.
- Approach: Introduces the first transformer-based model for egocentric gaze estimation, incorporating global context via a global-local correlation module. Argues that local token correlations are insufficient; global context is necessary.
- Task: Predict a probability map indicating gaze location.
Leveraging Driver Field-of-View for Multimodal Ego-Trajectory Prediction (Akbiyik et al., ICLR 2025):
- Goal: Improve ego-trajectory prediction (predicting the future path of the camera wearer) by incorporating driver field-of-view (gaze) data alongside scene videos.
- Dataset: Gaze-assisted Ego Motion (GEM) Dataset.
- Framework:
  - Inputs: past trajectory \(M_{1:T}\), scene videos \(S_{1:T}\), and FOV data (\(V_{F_{1:T}}\) for visual features, \(G_{F_{1:T}}\) for gaze data) over 8 seconds.
  - Video encoders \(E_S\) process scene videos.
  - A cross-modal transformer integrates visual features and FOV data.
  - Modal embeddings are stacked and processed by a multimodal encoder \(E_M\) with self-attention.
  - A time-series transformer predicts the future trajectory \(M'_{T+1:T_{pred}}\) (6 seconds) and future FOV \(F_{T+1:T_{pred}}\).
  - Note: No gradient flows from trajectory prediction to the multimodal encoder during training, only from predicted future FOV.
  - Losses: visual loss \(\mathcal{L}_V\) for future FOV and trajectory loss \(\mathcal{L}_T\) for future trajectory.
- Path Complexity Index (PCI): Defined as \[ PCI(\mathcal{T}_{\text{target}} \| \mathcal{T}_{\text{input}}) = \mathcal{T}_{\text{target}}(t) - \mathcal{T}_{\text{simple}}(t) \] Used to measure the complexity or curvature of a predicted path relative to a simple baseline path.
  [Verification Needed: The formula needs clarification on how \(\mathcal{T}_{\text{simple}}(t)\) is derived.]
- Results: Incorporating gaze data improves ego-trajectory prediction performance.

3D Scene Understanding

Interpreting the 3D environment from egocentric video is challenging due to camera motion and a narrow field of view.

Challenge: Camera motion and limited field of view make it difficult to understand spatial layout and object relationships. 3D interpretation helps contextualize actions and compensates for camera movement.
Epic Fields: Marrying 3D Geometry and Video Understanding (Tschernezki et al., NeurIPS 2023):
- Concept: Proposes the Lift, Match, and Keep (LMK) approach for dynamic 3D scene understanding from egocentric video.
- LMK Approach:
  - Lift: Convert 2D observations from video frames to 3D world coordinates.
  - Match: Connect observations of the same object over time using appearance and estimated 3D location.
  - Keep: Maintain object tracks even when temporarily out of view.
- Dataset: Evaluated on a dataset derived from EPIC-KITCHENS, with 100 long videos (25 hours) across 45 kitchens and 2,939 objects.
- Performance: Achieved object tracking accuracy of 64% after 1 minute, 48% after 5 minutes, and 37% after 10 minutes, outperforming baselines.
EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting (Zhang et al., 3DV 2025):
- Explores the use of 3D Gaussian Splatting for representing and understanding dynamic scenes from egocentric video.

Robotic Applications

Egocentric vision is highly relevant for robot perception and control, especially for dexterous manipulation.

Kitten Carousel (Heid & Hein, 1963): Classic experiment showing the importance of self-motion and visual feedback for perceptual development. Kittens with control over their movement developed normal depth perception, while passive kittens did not, highlighting the value of embodied vision.
Visual Encoder for Dexterous Manipulation:
- Idea: Learn visual features for dexterous manipulation directly from egocentric videos.
- Motivation: Robot policy learning typically involves direct environment interaction; egocentric videos provide rich human manipulation data.
- Applications: Robotic control, interaction prediction, motion synthesis.
MAPLE: Encoding Dexterous Robotic Manipulation Priors Learned From Egocentric Videos (Heid et al., CVPR 2025):
- Goal: Learn manipulation priors from large-scale egocentric video pretraining to improve dexterous robotic manipulation.
- Approach:
  - Large-scale Egocentric Pretraining: Train a visual encoder (e.g., ViT-B/16) on egocentric videos to learn hand pose and contact point representations.
  - Label Extraction Pipeline: [Details needed; likely involves hand tracking, 3D hand pose estimation, and contact point identification.]
  - MAPLE Model: Takes prediction frames from egocentric video and outputs prior-infused features. Uses a Transformer Decoder to predict manipulation outputs, including contact points (2D coordinates) and a hand pose token (quantized hand pose index).
  - Dexterous Simulation Environments: Simulated environments train policies using prior-infused features.
  - Real-World Application: Learned priors are transferred to real robots for dexterous tasks.
- DiT - Policy: Diffusion transformer-based policy that takes prior-infused features as input and predicts action chunks (e.g., end effector pose in \(SE(3)\), hand joints in \(\mathbb{R}^{21}\)).
- Evaluation: Evaluated in simulation and real-world on tasks like “Grab the plush animal,” “Place the pan,” and “Paint the canvas.” Showed improved success rates over baselines; failure cases included not grasping or releasing objects.
GEM: Egocentric World Foundation Model (Hassan et al., CVPR 2025):
- Goal: Develop a generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition control.
- Comparison with Cosmos (Nvidia’s World Foundation Model):
  - Data: GEM uses 4K hours vs. 20M hours for Cosmos.
  - Compute: 0.5K GPUs for 3 days vs. 10K GPUs for 90 days.
  - Scale: 2B parameters vs. 14B.
  - Inference Time: 3 minutes vs. 10 minutes.
  - Extra Capabilities: Multimodal outputs (RGB and depth), fine-grained motion control, agent/object insertion, open-source dataset/code/models.
- Architecture: [Details needed; likely diffusion-based generative modeling for future frames and states.]
- Summary: GEM is more efficient and versatile than larger models like Cosmos, excelling in multimodal outputs and fine-grained control.

Summary

Egocentric vision is a critical and rapidly evolving area in computer vision.
Major challenges remain in understanding intentions, dynamic scenes, and privacy.
Large, diverse datasets (e.g., EPIC-KITCHENS, Ego4D) are driving progress.
There are many opportunities for impactful research in this field.