Guided Verifier Collaborative Multimodal Reasoning via Dynamic Process Supervision-开发者社区

Guided Verifier: Collaborative Multimodal Reasoning via Dynamic Process Supervision

Authors:Lingzhuang Sun, Ruitong Liu, Yuxia Zhu, Xiaohan Xu, Jingxuan Wei, Xiangxiang Zhang, Bihui Yu, Wentao Zhang

Deep-Dive Summary:

自我一致性（Self-Consistency）提升大语言模型中的思维链推理

1. 摘要与介绍

本文介绍了一种名为“自我一致性”（Self-Consistency）的新型解码策略，旨在显著提升大语言模型（LLM）在复杂推理任务中的性能。传统的思维链（Chain of Thought, CoT）提示通常采用贪婪解码（Greedy Decoding）生成单一的推理路径，而自我一致性则通过采样多个不同的推理路径并对其最终答案进行多数投票（Majority Vote），从而找到最一致的答案。

2. 自我一致性方法

自我一致性的核心理念在于：复杂的推理问题往往存在多种不同的思维方式，但这些不同的路径往往能指向同一个正确答案。

该方法的实施分为三个步骤：

提示（Prompting）：使用思维链提示对模型进行引导。
采样（Sampling）：从模型的解码过程中采样生成一组多样化的推理路径（Reasoning Paths），而不是仅选择概率最高的路径。
聚合（Aggregation）：对采样出的所有结果进行多数投票，选择出现频率最高的答案作为最终结果。

在数学表达上，假设我们从模型中采样了m mm个候选响应{ ( r i , a i ) } i = 1 m \{(r_i, a_i)\}_{i=1}^m{(ri,ai)}i=1m，其中r i r_iri表示第i ii条推理路径，a i a_iai为对应的答案。自我一致性通过以下方式选择最终答案a aa：
argmax a ∑ i = 1 m 1 ( a i = a ) \text{argmax}_a \sum_{i=1}^m \mathbb{1}(a_i = a)argmaxai=1∑m1(ai=a)

3. 实验设置

研究者在多个推理基准数据集上进行了广泛评估，涵盖：

算术推理：GSM8K、SVAMP、ASDiv、AQuA-RAT 等。
常识推理：StrategyQA。
符号推理：最后一字母拼接（Last Letter Concatenation）。

所使用的模型包括 LaMDA、PaLM 以及 GPT-3 系列。

4. 实验结果

4.1 算术推理表现

自我一致性在所有算术任务中均表现出显著的性能提升。例如，在 GSM8K 数据集上，使用 PaLM-540B 模型的准确率从 56.5% 提升至 74.4%。

4.2 常识与符号推理

在非数学任务中，如 StrategyQA 和符号操作，自我一致性同样展现了强大的通用性，明显优于标准的 CoT 提示方法。

5. 分析与讨论

采样路径数量的影响

研究表明，随着采样路径数量m mm的增加，模型的推理表现会持续提升。即便只采样 5-10 条路径，也能获得显著的增益；当m mm达到 40 或 100 时，提升幅度趋于饱和。

鲁棒性分析

自我一致性对不同的采样参数（如 TemperatureT TT、Nucleus Sampling 的p pp值）以及不同的提示词组合都表现出极强的鲁棒性，证明了其在实际应用中的稳定性。

6. 结论

自我一致性是一种简单而强大的解码方案，它利用了模型生成中存在的自然推理多样性。通过多数投票机制，它能够有效地识别并纠正单一推理路径中可能出现的计算或逻辑错误，从而将大语言模型的推理能力推向新的高度。

Original Abstract:Reinforcement Learning (RL) has emerged as a pivotal mechanism for enhancing the complex reasoning capabilities of Multimodal Large Language Models (MLLMs). However, prevailing paradigms typically rely on solitary rollout strategies where the model works alone. This lack of intermediate oversight renders the reasoning process susceptible to error propagation, where early logical deviations cascade into irreversible failures, resulting in noisy optimization signals. In this paper, we propose the \textbf{Guided Verifier} framework to address these structural limitations. Moving beyond passive terminal rewards, we introduce a dynamic verifier that actively co-solves tasks alongside the policy. During the rollout phase, this verifier interacts with the policy model in real-time, detecting inconsistencies and providing directional signals to steer the model toward valid trajectories. To facilitate this, we develop a specialized data synthesis pipeline targeting multimodal hallucinations, constructing \textbf{CoRe} dataset of process-level negatives and \textbf{Co}rrect-guide \textbf{Re}asoning trajectories to train the guided verifier. Extensive experiments on MathVista, MathVerse and MMMU indicate that by allocating compute to collaborative inference and dynamic verification, an 8B-parameter model can achieve strong performance.

PDF Link:2602.04290v1