Key Highlights
Spatial Perception Improvement
(7B vs. Qwen2.5-VL-7B Baseline)
Task Planning Improvement
(7B vs. Qwen2.5-VL-7B Baseline)
VLM-PlanSim-99
(Our Novel Simulation Benchmark)
Main Features
Agent-Aligned Data
Specially designed data structure for embodied AI, including response, planning steps, and executable action sequences to ensure efficient task execution.
Two-Stage Training
Comprehensive training pipeline combining multimodal rejection sampling-based SFT for cold-start and Step-Augmented GRPO for reinforcement learning.
Superior Performance
Achieves state-of-the-art results across multiple benchmarks including spatial perception, task planning, and end-to-end simulation.
Fully Open Source
Open-sourcing all data, model weights, and evaluation methods, including the novel VLM-PlanSim-99 simulation benchmark.
Model Architecture
EmbodiedBrain 1.0 adopts a modular encoder-decoder architecture built upon the Qwen2.5-VL framework, unifying perception, reasoning, and action planning for complex embodied tasks.
The model processes multimodal inputs (images at dynamic resolution, long video sequences, and complex language instructions) and generates structured outputs: natural language responses, step-by-step plans, and executable action sequences.
Training Data
We constructed a diverse training dataset covering general multimodal capabilities, spatial reasoning, task planning, and video understanding, with data quality ensured through multimodal rejection sampling strategy.
Agent-Aligned Data Format Design
We designed a structured data format that closely aligns with the operational needs of embodied agents, organizing information into well-defined components: user query, model-generated response, explicit structured plan, and executable action sequences.
Example of our agent-aligned data format, featuring structured response, planning steps, and executable actions
Training Strategy
Our training pipeline consists of two complementary stages: Stage 1 - Cold-start SFT with multimodal rejection sampling to establish strong foundational capabilities, followed by Stage 2 - Step-Augmented GRPO, an innovative reinforcement learning approach that provides step-level hints of varying lengths to stabilize training and improve reward dynamics convergence for long-horizon task planning.
Stage 1: SFT Data Composition
Overall SFT Data Distribution
Planning Data Distribution by Action Type
Stage 2: Step-Augmented GRPO
The two-stage training methodology ensures both strong general capabilities and superior task planning performance.
Evaluation Results
We evaluated EmbodiedBrain on 14 comprehensive multimodal benchmarks across three categories: general capabilities, spatial perception, and embodied task planning. EmbodiedBrain demonstrates strong performance across all categories, particularly excelling in spatial perception and task planning.
Performance Comparison
| Benchmark | 7B Models | 32B Models | ||||
|---|---|---|---|---|---|---|
| Qwen2.5 VL |
RoboBrain 2.0 |
EmbodiedBrain | Qwen2.5 VL |
RoboBrain 2.0 |
EmbodiedBrain | |
| General Ability | ||||||
| MM-IFEval | 39.56 | 30.82 | 43.61 | 46.66 | 39.75 | 46.98 |
| MMStar | 62.27 | 59.40 | 62.17 | 64.70 | 65.80 | 65.40 |
| MMMU | 51.33 | 48.67 | 52.67 | 60.00 | 60.89 | 60.44 |
| AI2D | 82.55 | 81.83 | 82.61 | 85.37 | 85.23 | 84.39 |
| OCRBench | 785 | 757 | 783 | 740 | 732 | 741 |
| Spatial Perception | ||||||
| BLINK | 58.74 | 62.94 | 88.11 | 73.43 | 68.53 | 87.41 |
| CV-Bench | 62.03 | 62.97 | 80.69 | 75.57 | 68.27 | 83.64 |
| EmbSpatial | 51.76 | 52.12 | 75.04 | 67.39 | 62.95 | 77.03 |
| ERQA | 41.00 | 42.50 | 41.75 | 44.61 | 45.11 | 43.50 |
| Average | 53.38 | 55.13 | 71.40 | 65.25 | 61.22 | 72.90 |
| Task Planning | ||||||
| EgoPlan-Bench | 41.30 | 36.73 | 49.10 | 51.11 | 46.83 | 54.66 |
| EgoPlan-Bench2 | 38.63 | 33.54 | 49.58 | 49.81 | 49.96 | 57.11 |
| EgoThink | 52.13 | 44.92 | 53.54 | 56.75 | 49.33 | 53.92 |
| Internal Planning (F1) | 30.0 | 68.3 | 85.8 | 28.3 | 75.9 | 90.5 |
| VLM-PlanSim-99 | 23.2 | 21.21 | 31.31 | 25.25 | 24.24 | 46.46 |
End-to-End Simulation Examples
Task execution examples on the EmbodiedReasoner platform, demonstrating the model's adaptive reasoning and sequential decision-making capabilities in partially observable environments.
VLM-PlanSim-99 Benchmark
We propose and open-source VLM-PlanSim-99, a high-quality task planning evaluation benchmark containing 99 carefully curated household task instances. Each task is manually annotated and validated in the AI2-THOR simulation environment to ensure executability.
Task Execution Examples
Toast a Slice of Bread
Execution Steps:
1. Discover and navigate to bread
2. Pick up bread
3. Discover and navigate to knife
4. Place bread on countertop
5. Pick up knife
6. Slice bread with knife
7. Place knife on countertop
8. Pick up sliced bread
9. Discover and navigate to toaster
10. Place bread slice in toaster
11. Turn on toaster
12. Turn off toaster
13. Take out bread slice
14. Place toasted bread on kitchen countertop
Wash the Bowl in Sink and Heat it in Microwave
Execution Steps:
1. Discover and navigate to bowl
2. Pick up bowl
3. Discover and navigate to sink
4. Place bowl in sink
5. Turn on water
6. Turn off water
7. Pick up bowl
8. Discover and navigate to microwave
9. Open microwave
10. Place bowl in microwave
11. Close microwave
12. Turn on microwave
13. Turn off microwave
14. Open microwave
15. Take out bowl
16. Close microwave
Cool a Tomato in Fridge and Place it in Sink
Execution Steps:
1. Discover and navigate to tomato
2. Pick up tomato
3. Discover and navigate to fridge
4. Open fridge
5. Place tomato in fridge
6. Close fridge
7. Open fridge
8. Take out tomato
9. Close fridge
10. Discover and navigate to sink
11. Place tomato in sink
EgoPlan Benchmark Examples
EgoPlan Example 1
EgoPlan Example 2
Spatial Reasoning Examples
EmbodiedBrain excels in spatial reasoning tasks, accurately localizing objects, understanding relative positions, and recognizing hierarchical spatial relations.
BibTeX
@article{zou2025embodiedbrain,
title={EmbodiedBrain: Expanding Performance Boundaries of Task Planning for Embodied Intelligence},
author={Zou, Ding and Wang, Feifan and Ge, Mengyu and Fan, Siyuan and Zhang, Zongbing and Chen, Wei and Wang, Lingfeng and Hu, Zhongyou and Yan, Wenrui and Gao, Zhengwei and Wang, Hao and Jin, Weizhao and Zhang, Yu and Zhao, Hainan and Zhang, Mingliang and Xi, Xianxian and Zhang, Yaru and Li, Wenyuan and Gao, Zhengguang and Zhu, Yurui},
journal={arXiv preprint arXiv:2510.20578},
year={2025}
}