ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures

ICLR 2026

Xiyin Zeng*, Yuyu Sun*, Haoyang Li*, Shouqiang Liu, Hao Wang†

The Hong Kong University of Science and Technology (Guangzhou); South China Normal University

📄 Paper 💻 Code

Figure 1: Overview of the ReCAPA framework.

MineDojo Demo

Demo: ReCAPA performance on MineDojo benchmark tasks.

Behavior Demo

Demo: ReCAPA performance on Behavior tasks.

Abstract

Vision–Language–Action (VLA) systems follow instructions to execute multi-step tasks in multimodal environments. Recent VLA approaches typically rely on post-hoc correction mechanisms or operate under fixed task decompositions and alignment schemes. However, once an intermediate subgoal or action is mis-specified and without a flexible correction mechanism, local errors propagate through subsequent steps and eventually accumulate into cascading failures in long-horizon reasoning. To mitigate this compounding effect, we propose Reflective Contrastive Alignment and Planning Architecture (ReCAPA), a framework that uses predictive correction to anticipate deviations and adjust representations across three levels: actions, subgoals, and trajectories. Semantic alignment is enforced at all levels using a Sinkhorn-based module and a Score-field module. The corrective signals, derived from predictive correction and alignment mechanisms, jointly update the execution network during training, enabling it to flexibly adjust fine-grained steps to remain aligned with the overall intent. We further introduce two new metrics to quantify error propagation and recovery processes in tasks. Experiments show that ReCAPA achieves competitive results on embodied agent benchmarks such as VisualAgentBench, MineDojo, and AI2-Thor, outperforming strong proprietary and open-source Large Language Model (LLM) baselines.

AI2-Thor Results

Performance on AI2-Thor across models and metrics. AI2-Thor assessed via Success Rate (SR), Transport Rate (TR), Coverage, and Balance; Coverage measures successful interactions, while Balance captures the evenness of contributions to subtasks.

Model	SR	TR	Coverage	Balance
Single-LM/Agent Baselines
ReAct	0.34	0.72	0.92	0.67
CoT	0.14	0.59	0.87	0.62
SmartLLM	0.11	0.23	0.91	0.45
CoELA	0.25	0.46	0.76	0.73

Multi-Modal/LLM-Enhanced Baselines
GPT-4o	0.51	0.85	0.95	0.83
LLaVA	0.54	0.84	0.91	0.75
IDEFICS-2	0.57	0.86	0.94	0.78
CogVLM	0.61	0.89	0.95	0.80
GPT-4V	0.66	0.91	0.97	0.82
LLaMAR	0.68	0.90	0.95	0.85
ReCAPA	0.75	0.93	0.95	0.93

VisualAgentBench Results

Performance of different models on VisualAgentBench which include OmniGibson and Minecraft. AVG. denotes the overall average score.

Model	AVG.	OmniGibson	Minecraft
Open-LMMs (Fine-tuning)
Qwen-VL	9.90	1.7	18.1
CogVLM2	13.55	6.6	20.5
LLaVA-NeXT	16.60	9.4	23.8
GLM-4V	14.35	8.8	19.9
InternVL-2	22.20	16.0	28.4

Proprietary-LMMs (Prompting)
qwen-vl-max	2.65	0.0	5.3
Claude-3.5-Sonnet	40.15	24.3	56.0
GPT-4V (preview)	41.95	36.5	47.4
GPT-4o	48.30	41.4	55.2
Claude-4-Sonnet	50.25	42.6	57.9
GPT-4o mini	54.15	46.7	61.6
Gemini 2.5 Flash	53.00	43.9	62.1
ReCAPA	58.65	50.6	66.7

Citation

@inproceedings{zeng2026recapa, title={ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures}, author={Zeng, Xiyin and Sun, Yuyu and Li, Haoyang and Liu, Shouqiang and Wang, Hao}, booktitle={International Conference on Learning Representations (ICLR)}, year={2026}, url={https://iclr.cc/virtual/2026/poster/10009066} }