EmbodiedBrain

Expanding Performance Boundaries of Task Planning for Embodied Intelligence

We propose EmbodiedBrain, a powerful vision-language foundation model for embodied AI, available in both 7B and 32B parameter sizes. Through an innovative agent-aligned data structure and a two-stage training methodology combining large-scale SFT with Step-Augmented GRPO, we achieve state-of-the-art performance in spatial perception and task planning.

arXiv Code (Coming Soon) Asset (Coming Soon)

Key Highlights

+33.8%

Spatial Perception Improvement
(7B vs. Qwen2.5-VL-7B Baseline)

+45.4%

Task Planning Improvement
(7B vs. Qwen2.5-VL-7B Baseline)

46.46%

VLM-PlanSim-99
(Our Novel Simulation Benchmark)

Main Features

Agent-Aligned Data

Specially designed data structure for embodied AI, including response, planning steps, and executable action sequences to ensure efficient task execution.

Two-Stage Training

Comprehensive training pipeline combining multimodal rejection sampling-based SFT for cold-start and Step-Augmented GRPO for reinforcement learning.

Superior Performance

Achieves state-of-the-art results across multiple benchmarks including spatial perception, task planning, and end-to-end simulation.

Fully Open Source

Open-sourcing all data, model weights, and evaluation methods, including the novel VLM-PlanSim-99 simulation benchmark.

Model Architecture

EmbodiedBrain 1.0 adopts a modular encoder-decoder architecture built upon the Qwen2.5-VL framework, unifying perception, reasoning, and action planning for complex embodied tasks.

The model processes multimodal inputs (images at dynamic resolution, long video sequences, and complex language instructions) and generates structured outputs: natural language responses, step-by-step plans, and executable action sequences.

Training Data

We constructed a diverse training dataset covering general multimodal capabilities, spatial reasoning, task planning, and video understanding, with data quality ensured through multimodal rejection sampling strategy.

Agent-Aligned Data Format Design

We designed a structured data format that closely aligns with the operational needs of embodied agents, organizing information into well-defined components: user query, model-generated response, explicit structured plan, and executable action sequences.

Example of our agent-aligned data format, featuring structured response, planning steps, and executable actions

Training Strategy

Our training pipeline consists of two complementary stages: Stage 1 - Cold-start SFT with multimodal rejection sampling to establish strong foundational capabilities, followed by Stage 2 - Step-Augmented GRPO, an innovative reinforcement learning approach that provides step-level hints of varying lengths to stabilize training and improve reward dynamics convergence for long-horizon task planning.

Stage 1: SFT Data Composition

Overall SFT Data Distribution

Planning Data Distribution by Action Type

Stage 2: Step-Augmented GRPO

The two-stage training methodology ensures both strong general capabilities and superior task planning performance.

Evaluation Results

We evaluated EmbodiedBrain on 14 comprehensive multimodal benchmarks across three categories: general capabilities, spatial perception, and embodied task planning. EmbodiedBrain demonstrates strong performance across all categories, particularly excelling in spatial perception and task planning.

Performance Comparison

Benchmark	7B Models			32B Models
Benchmark	Qwen2.5 VL	RoboBrain 2.0	EmbodiedBrain	Qwen2.5 VL	RoboBrain 2.0	EmbodiedBrain
General Ability
MM-IFEval	39.56	30.82	43.61	46.66	39.75	46.98
MMStar	62.27	59.40	62.17	64.70	65.80	65.40
MMMU	51.33	48.67	52.67	60.00	60.89	60.44
AI2D	82.55	81.83	82.61	85.37	85.23	84.39
OCRBench	785	757	783	740	732	741
Spatial Perception
BLINK	58.74	62.94	88.11	73.43	68.53	87.41
CV-Bench	62.03	62.97	80.69	75.57	68.27	83.64
EmbSpatial	51.76	52.12	75.04	67.39	62.95	77.03
ERQA	41.00	42.50	41.75	44.61	45.11	43.50
Average	53.38	55.13	71.40	65.25	61.22	72.90
Task Planning
EgoPlan-Bench	41.30	36.73	49.10	51.11	46.83	54.66
EgoPlan-Bench2	38.63	33.54	49.58	49.81	49.96	57.11
EgoThink	52.13	44.92	53.54	56.75	49.33	53.92
Internal Planning (F1)	30.0	68.3	85.8	28.3	75.9	90.5
VLM-PlanSim-99	23.2	21.21	31.31	25.25	24.24	46.46

End-to-End Simulation Examples

Task execution examples on the EmbodiedReasoner platform, demonstrating the model's adaptive reasoning and sequential decision-making capabilities in partially observable environments.

VLM-PlanSim-99 Benchmark

We propose and open-source VLM-PlanSim-99, a high-quality task planning evaluation benchmark containing 99 carefully curated household task instances. Each task is manually annotated and validated in the AI2-THOR simulation environment to ensure executability.

Task Execution Examples

Toast a Slice of Bread

Execution Steps:

1. Discover and navigate to bread

2. Pick up bread

3. Discover and navigate to knife

4. Place bread on countertop

5. Pick up knife

6. Slice bread with knife

7. Place knife on countertop

8. Pick up sliced bread

9. Discover and navigate to toaster

10. Place bread slice in toaster

11. Turn on toaster

12. Turn off toaster

13. Take out bread slice

14. Place toasted bread on kitchen countertop

Wash the Bowl in Sink and Heat it in Microwave

Execution Steps:

1. Discover and navigate to bowl

2. Pick up bowl

3. Discover and navigate to sink

4. Place bowl in sink

5. Turn on water

6. Turn off water

7. Pick up bowl

8. Discover and navigate to microwave

9. Open microwave

10. Place bowl in microwave

11. Close microwave

12. Turn on microwave

13. Turn off microwave

14. Open microwave

15. Take out bowl

16. Close microwave

Cool a Tomato in Fridge and Place it in Sink

Execution Steps:

1. Discover and navigate to tomato

2. Pick up tomato

3. Discover and navigate to fridge

4. Open fridge

5. Place tomato in fridge

6. Close fridge

7. Open fridge

8. Take out tomato

9. Close fridge

10. Discover and navigate to sink

11. Place tomato in sink

EgoPlan Benchmark Examples

EgoPlan Example 1

EgoPlan Example 2

Spatial Reasoning Examples

EmbodiedBrain excels in spatial reasoning tasks, accurately localizing objects, understanding relative positions, and recognizing hierarchical spatial relations.

BibTeX

@article{zou2025embodiedbrain,
  title={EmbodiedBrain: Expanding Performance Boundaries of Task Planning for Embodied Intelligence},
  author={Zou, Ding and Wang, Feifan and Ge, Mengyu and Fan, Siyuan and Zhang, Zongbing and Chen, Wei and Wang, Lingfeng and Hu, Zhongyou and Yan, Wenrui and Gao, Zhengwei and Wang, Hao and Jin, Weizhao and Zhang, Yu and Zhao, Hainan and Zhang, Mingliang and Xi, Xianxian and Zhang, Yaru and Li, Wenyuan and Gao, Zhengguang and Zhu, Yurui},
  journal={arXiv preprint arXiv:2510.20578},
  year={2025}
}