Table of Contents
cs.CL [Back]
[1] The Hypocrisy Gap: Quantifying Divergence Between Internal Belief and Chain-of-Thought Explanation via Sparse Autoencoders
Shikhar Shiromani,Archie Chaudhury,Sri Pranav Kunda
Main category: cs.CL
TL;DR: 本文提出了一种名为'Hypocrisy Gap'的机制性指标,利用稀疏自编码器(SAEs)量化大语言模型内部推理与其最终生成之间的偏差,以检测其不忠实行为(如谄媚或虚伪)。实验表明该方法在多个模型上优于基线。
Details
Motivation: 大语言模型常表现出不忠实行为,即最终回答与其内部思维链(CoT)严重不符,以迎合用户;亟需一种可解释、机制性的方法来检测此类行为。 Method: 提出Hypocrisy Gap指标,结合稀疏自编码器(SAEs)与稀疏线性探针,在隐空间中对比模型内部‘真实信念’与最终生成轨迹的差异,从而量化不忠实程度。 Result: 在Gemma、Llama、Qwen模型及Anthropic Sycophancy基准上的实验显示,该方法对谄媚行为检测AUROC达0.55–0.73,对虚伪行为达0.55–0.74,显著优于基于决策对齐的log-prob基线(0.41–0.50)。 Conclusion: Hypocrisy Gap是一种有效且可解释的机制性指标,能可靠识别LLM中的不忠实行为,为提升模型可信度与对齐鲁棒性提供了新工具。 Abstract: Large Language Models (LLMs) frequently exhibit unfaithful behavior, producing a final answer that differs significantly from their internal chain of thought (CoT) reasoning in order to appease the user they are conversing with. In order to better detect this behavior, we introduce the Hypocrisy Gap, a mechanistic metric utilizing Sparse Autoencoders (SAEs) to quantify the divergence between a model's internal reasoning and its final generation. By mathematically comparing an internal truth belief, derived via sparse linear probes, to the final generated trajectory in latent space, we quantify and detect a model's tendency to engage in unfaithful behavior. Experiments on Gemma, Llama, and Qwen models using Anthropic's Sycophancy benchmark show that our method achieves an AUROC of 0.55-0.73 for detecting sycophantic runs and 0.55-0.74 for hypocritical cases where the model internally "knows" the user is wrong, consistently outperforming a decision-aligned log-probability baseline (0.41-0.50 AUROC).[2] STEMVerse: A Dual-Axis Diagnostic Framework for STEM Reasoning in Large Language Models
Xuzhao Li,Xuchen Li,Jian Zhao,Shiyu Hu
Main category: cs.CL
TL;DR: 本文提出STEMVerse诊断框架,通过'学科×认知'双维度能力空间系统评估大语言模型在STEM领域的推理能力,揭示其结构性失败模式。
Details
Motivation: 现有STEM评估范式将基准视为孤立‘孤岛’,仅提供单一聚合分数,无法区分错误源于领域知识不足还是认知能力缺陷,诊断价值有限。 Method: 构建STEMVerse诊断框架,将20000多个STEM问题重新聚合到统一的‘学科×认知’能力空间,为每个实例标注双轴标签,并系统评估不同参数规模和训练范式的代表性大语言模型。 Result: 实证结果揭示了大语言模型在STEM推理中的结构性失败模式,验证了该框架在多学科覆盖和细粒度认知分层方面的有效性。 Conclusion: STEMVerse通过整合多学科覆盖与细粒度认知分层,为理解大语言模型的科学推理特性提供了清晰且可操作的视角。 Abstract: As Large Language Models (LLMs) achieve significant breakthroughs in complex reasoning tasks, evaluating their proficiency in science, technology, engineering, and mathematics (STEM) has become a primary method for measuring machine intelligence. However, current evaluation paradigms often treat benchmarks as isolated "silos," offering only monolithic aggregate scores that neglect the intricacies of both academic specialization and cognitive depth. This result-oriented approach fails to distinguish whether model errors stem from insufficient domain knowledge or deficiencies in cognitive capacity, thereby limiting the diagnostic value. To address this, we propose STEMVerse, a diagnostic framework designed to systematically analyze the STEM reasoning capabilities of LLMs. This framework characterizes model performance across academic specialization and cognitive complexity to map the capability required for reasoning. We re-aggregate over 20,000 STEM problems from mainstream benchmarks into a unified "Discipline $\times$ Cognition" capability space, assigning dual-axis labels to every instance. Utilizing this unified diagnostic framework, we systematically evaluate representative LLM families across varying parameter scales and training paradigms. Our empirical results reveal structural failure patterns in STEM reasoning. By integrating multi-disciplinary coverage and fine-grained cognitive stratification into a unified framework, STEMVerse provides a clear and actionable perspective for understanding the scientific reasoning characteristics of LLMs.[3] Test-Time Detoxification without Training or Learning Anything
Baturay Saglam,Dionysis Kalogerias
Main category: cs.CL
TL;DR: 本文提出了一种无需重训练、不依赖梯度或内部参数访问的测试时去毒化方法,利用零阶优化在输入嵌入空间进行毒性梯度近似与下降,显著降低大语言模型生成文本的毒性,同时保持生成质量。
Details
Motivation: 大语言模型即使对无害输入也可能生成有毒或不当文本,现有去毒方法多依赖重训练、梯度或辅助模块,成本高且难以跨模型或黑盒场景迁移。 Method: 提出一种测试时零阶优化方法:基于输入嵌入空间,用有限次前向计算和毒性评分函数近似毒性对嵌入的梯度,并执行少量下降步以引导生成更少毒性的续写。 Result: 该方法在多种模型和提示下均实现稳健的毒性降低,在多数设置中达到最优的毒性-质量权衡;无需训练、不依赖中间层访问,适用于黑盒模型。 Conclusion: 词嵌入可作为高效可控变量,零阶黑盒优化为安全、可扩展的大语言模型文本生成提供了新范式。 Abstract: Large language models can produce toxic or inappropriate text even for benign inputs, creating risks when deployed at scale. Detoxification is therefore important for safety and user trust, particularly when we want to reduce harmful content without sacrificing the model's generation quality. Many existing approaches rely on model retraining, gradients, or learned auxiliary components, which can be costly and may not transfer across model families or to truly black-box settings. We introduce a test-time procedure that approximates the gradient of completion toxicity with respect to the input embeddings and uses a small number of descent steps to steer generation toward less toxic continuations. This is achieved with zeroth-order optimization that requires only access to input embeddings, a toxicity scoring function, and forward evaluations of the model. Empirically, the approach delivers robust toxicity reductions across models and prompts and, in most settings, achieves the best overall toxicity-quality trade-off. More broadly, our work positions word embeddings as effective control variables and encourages wider use of black-box optimization to guide autoregressive language models toward scalable, safer text generation, without requiring any training or access to intermediate computations.[4] ROSA-Tuning: Enhancing Long-Context Modeling via Suffix Matching
Yunao Zheng,Xiaojie Wang,Lei Ren,Wei Chen
Main category: cs.CL
TL;DR: ROSA-Tuning 是一种新型高效长上下文建模方法,通过 CPU 上的在线后缀自动机(ROSA)模块并行检索历史相关信息,并以可训练方式注入模型状态,结合范围受限注意力实现端到端训练,在保持计算效率的同时显著提升窗口注意力模型的长上下文能力。
Details
Motivation: 现有高效注意力方法虽降低计算复杂度,但覆盖的历史状态有限,难以兼顾长上下文建模能力与计算效率。 Method: 提出 ROSA-Tuning:引入 CPU 实现的 RWKV Online Suffix Automaton(ROSA)检索模块,与标准注意力并行运行;设计二值离散化策略和反事实梯度算法支持端到端训练;采用异步 CPU-GPU 流水线优化执行效率;检索结果以可训练方式注入模型状态,并由范围受限注意力进行加权融合。 Result: 在 Qwen3-Base-1.7B 上系统评估表明,ROSA-Tuning 显著恢复窗口注意力模型的长上下文建模能力,在 LongBench 等基准上性能接近甚至媲美全局注意力,同时 GPU 内存与计算开销接近窗口注意力方法。 Conclusion: ROSA-Tuning 为高效长上下文处理提供了一条新路径,兼顾性能与效率,具备实用推广价值。 Abstract: Long-context capability and computational efficiency are among the central challenges facing today's large language models. Existing efficient attention methods reduce computational complexity, but they typically suffer from a limited coverage of the model state. This paper proposes ROSA-Tuning, a retrieval-and-recall mechanism for enhancing the long-context modeling ability of pretrained models. Beyond the standard attention mechanism, ROSA-Tuning introduces in parallel a CPU-based ROSA (RWKV Online Suffix Automaton) retrieval module, which efficiently locates historical positions in long contexts that are relevant to the current query, and injects the retrieved information into the model state in a trainable manner; subsequent weighted fusion can then be handled by range-restricted attention. To enable end-to-end training, we design a binary discretization strategy and a counterfactual gradient algorithm, and further optimize overall execution efficiency via an asynchronous CPU-GPU pipeline. Systematic evaluations on Qwen3-Base-1.7B show that ROSA-Tuning substantially restores the long-context modeling ability of windowed-attention models, achieving performance close to and in some cases matching global attention on benchmarks such as LongBench, while maintaining computational efficiency and GPU memory usage that are nearly comparable to windowed-attention methods, offering a new technical path for efficient long-context processing. The example code can be found at https://github.com/zyaaa-ux/ROSA-Tuning.[5] Graph-Augmented Reasoning with Large Language Models for Tobacco Pest and Disease Management
Siyu Li,Chenwei Song,Qi Zhou,Wan Zhou,Xinyi Liu
Main category: cs.CL
TL;DR: 本文提出了一种图增强推理框架,将烟草病虫害领域的结构化知识融入大语言模型,通过构建领域知识图谱并结合图神经网络与LoRA微调的ChatGLM,提升病虫害诊断与防治推荐的准确性与可解释性。
Details
Motivation: 解决大语言模型在农业病虫害管理中因缺乏结构化领域知识而导致的幻觉、推荐不当及复杂推理能力不足的问题。 Method: 基于GraphRAG构建烟草病虫害领域知识图谱;使用LoRA对ChatGLM进行参数高效微调;引入图神经网络学习症状-疾病-农药-防治措施等实体的关联表示;检索查询相关子图作为关系证据注入LLM输入。 Result: 在多跳推理和对比类问题上显著优于纯文本基线模型,验证了图增强证据引导对提升推理准确性和领域一致性的作用。 Conclusion: 显式建模领域实体及其关系,并融合图检索证据,可有效增强大语言模型在专业垂直场景下的推理可靠性与实用性。 Abstract: This paper proposes a graph-augmented reasoning framework for tobacco pest and disease management that integrates structured domain knowledge into large language models. Building on GraphRAG, we construct a domain-specific knowledge graph and retrieve query-relevant subgraphs to provide relational evidence during answer generation. The framework adopts ChatGLM as the Transformer backbone with LoRA-based parameter-efficient fine-tuning, and employs a graph neural network to learn node representations that capture symptom-disease-treatment dependencies. By explicitly modeling diseases, symptoms, pesticides, and control measures as linked entities, the system supports evidence-aware retrieval beyond surface-level text similarity. Retrieved graph evidence is incorporated into the LLM input to guide generation toward domain-consistent recommendations and to mitigate hallucinated or inappropriate treatments. Experimental results show consistent improvements over text-only baselines, with the largest gains observed on multi-hop and comparative reasoning questions that require chaining multiple relations.[6] WideSeek: Advancing Wide Research via Multi-Agent Scaling
Ziyang Huang,Haolin Ren,Xiaowei Yuan,Jiawei Wang,Zhongtao Jiang,Kun Xu,Shizhu He,Jun Zhao,Kang Liu
Main category: cs.CL
TL;DR: 本文提出Wide Research新范式,构建WideSeekBench基准与WideSeek多智能体架构,并通过端到端强化学习优化,验证了扩展智能体数量对广度搜索的有效性。
Details
Motivation: 现有搜索智能研究缺乏针对搜索广度的专用评测基准和优化方法,难以支持复杂约束下的并行信息检索与综合。 Method: 1)构建多阶段数据流水线生成WideSeekBench广度信息检索基准;2)设计可动态分叉子智能体的层级化多智能体架构WideSeek;3)提出将多智能体轨迹线性化的统一强化学习训练框架。 Result: 实验表明WideSeek及多智能体强化学习显著提升广度搜索能力,且增加智能体数量是推进Wide Research的有效路径。 Conclusion: Wide Research是搜索智能的重要演进方向,WideSeekBench和WideSeek为该范式提供了基准支撑与系统实现,多智能体协同与端到端优化是关键突破点。 Abstract: Search intelligence is evolving from Deep Research to Wide Research, a paradigm essential for retrieving and synthesizing comprehensive information under complex constraints in parallel. However, progress in this field is impeded by the lack of dedicated benchmarks and optimization methodologies for search breadth. To address these challenges, we take a deep dive into Wide Research from two perspectives: Data Pipeline and Agent Optimization. First, we produce WideSeekBench, a General Broad Information Seeking (GBIS) benchmark constructed via a rigorous multi-phase data pipeline to ensure diversity across the target information volume, logical constraints, and domains. Second, we introduce WideSeek, a dynamic hierarchical multi-agent architecture that can autonomously fork parallel sub-agents based on task requirements. Furthermore, we design a unified training framework that linearizes multi-agent trajectories and optimizes the system using end-to-end RL. Experimental results demonstrate the effectiveness of WideSeek and multi-agent RL, highlighting that scaling the number of agents is a promising direction for advancing the Wide Research paradigm.[7] Monotonicity as an Architectural Bias for Robust Language Models
Patrick Cooper,Alireza Nadali,Ashutosh Trivedi,Alvaro Velasquez
Main category: cs.CL
TL;DR: 本文提出在Transformer模型的前馈子层中引入单调性约束,以提升大语言模型对对抗攻击的鲁棒性,同时几乎不损害其原有性能。
Details
Motivation: 大型语言模型在对抗提示和越狱攻击下表现脆弱,反映出现代神经语言模型在高维输入空间中易受小扰动影响的根本问题。 Method: 在序列到序列Transformer的前馈子层中选择性施加单调性约束,保持注意力机制不受限,从而兼顾表达力与鲁棒性。 Result: 对抗攻击成功率从约69%降至19%,摘要任务性能仅轻微下降。 Conclusion: 单调性可作为有效的归纳偏置,在不牺牲语言模型性能的前提下显著提升其鲁棒性。 Abstract: Large language models (LLMs) are known to exhibit brittle behavior under adversarial prompts and jailbreak attacks, even after extensive alignment and fine-tuning. This fragility reflects a broader challenge of modern neural language models: small, carefully structured perturbations in high-dimensional input spaces can induce large and unpredictable changes in internal semantic representations and output. We investigate monotonicity as an architectural inductive bias for improving the robustness of Transformer-based language models. Monotonicity constrains semantic transformations so that strengthening information, evidence, or constraints cannot lead to regressions in the corresponding internal representations. Such order-preserving behavior has long been exploited in control and safety-critical systems to simplify reasoning and improve robustness, but has traditionally been viewed as incompatible with the expressivity required by neural language models. We show that this trade-off is not inherent. By enforcing monotonicity selectively in the feed-forward sublayers of sequence-to-sequence Transformers -- while leaving attention mechanisms unconstrained -- we obtain monotone language models that preserve the performance of their pretrained counterparts. This architectural separation allows negation, contradiction, and contextual interactions to be introduced explicitly through attention, while ensuring that subsequent semantic refinement is order-preserving. Empirically, monotonicity substantially improves robustness: adversarial attack success rates drop from approximately 69% to 19%, while standard summarization performance degrades only marginally.[8] InfMem: Learning System-2 Memory Control for Long-Context Agent
Xinyu Wang,Mingze Li,Peng Lu,Xiao-Wen Chang,Lifeng Shang,Jinping Li,Fei Mi,Prasanna Parthasarathi,Yufei Cui
Main category: cs.CL
TL;DR: 本文提出InfMem,一种基于控制中心的代理系统,通过PreThink-Retrieve-Write协议实现主动式、证据感知的记忆管理,在超长文档问答任务中显著提升准确率并加速推理。
Details
Motivation: 现有流式代理采用被动记忆更新策略,难以保留对多跳推理至关重要的低显著性桥梁证据,且受限于内存约束,难以有效合成分散在长文档中的稀疏证据。 Method: 提出InfMem代理,引入System-2风格的控制机制,包含三个核心步骤:PreThink(预判证据充分性)、Retrieve(目标导向的文档内检索)、Write(证据感知的联合压缩写入);并设计SFT-to-RL训练流程,联合优化检索、生成与停止决策。 Result: 在32k–1M token的超长QA基准上,InfMem在Qwen3-1.7B/4B和Qwen2.5-7B上平均绝对准确率分别提升+10.17、+11.84、+8.23点,同时推理时间平均减少3.9×(最高达5.1×)。 Conclusion: 主动控制驱动的记忆机制(如InfMem)比被动流式记忆更适配超长文档多跳推理任务,兼顾性能、效率与可控性。 Abstract: Reasoning over ultra-long documents requires synthesizing sparse evidence scattered across distant segments under strict memory constraints. While streaming agents enable scalable processing, their passive memory update strategy often fails to preserve low-salience bridging evidence required for multi-hop reasoning. We propose InfMem, a control-centric agent that instantiates System-2-style control via a PreThink-Retrieve-Write protocol. InfMem actively monitors evidence sufficiency, performs targeted in-document retrieval, and applies evidence-aware joint compression to update a bounded memory. To ensure reliable control, we introduce a practical SFT-to-RL training recipe that aligns retrieval, writing, and stopping decisions with end-task correctness. On ultra-long QA benchmarks from 32k to 1M tokens, InfMem consistently outperforms MemAgent across backbones. Specifically, InfMem improves average absolute accuracy by +10.17, +11.84, and +8.23 points on Qwen3-1.7B, Qwen3-4B, and Qwen2.5-7B, respectively, while reducing inference time by $3.9\times$ on average (up to $5.1\times$) via adaptive early stopping.[9] Predicting first-episode homelessness among US Veterans using longitudinal EHR data: time-varying models and social risk factors
Rohan Pandey,Haijuan Yan,Hong Yu,Jack Tsai
Main category: cs.CL
TL;DR: 本研究利用退伍军人事务部的电子健康记录(EHR)数据,构建融合临床与社会行为因素的纵向预测模型,显著提升对退伍军人首次无家可归风险的预测能力,尤其在高风险分层中具备实际干预价值。
Details
Motivation: 美国退伍军人无家可归问题严峻,亟需基于EHR的前瞻性风险预测以支持主动干预。 Method: 基于427万余名退伍军人2016年EHR数据,构建静态与时间动态表征,引入临床专家逻辑建模疾病与社会风险的持续性;比较传统机器学习、Transformer掩码语言模型及微调大语言模型(LLM)的预测性能。 Result: 融入社会与行为因素的纵向模型使PR-AUC提升15–30%;在最高1%风险组中,各模型3–12个月预测的阳性预测值达3.93%–13.80%;LLM判别能力弱于编码器模型,但在不同种族间性能差异更小。 Conclusion: 纵向、社会信息增强的EHR建模可将无家可归风险有效浓缩至可操作的风险层级,为高危退伍军人提供精准、数据驱动的预防策略。 Abstract: Homelessness among US veterans remains a critical public health challenge, yet risk prediction offers a pathway for proactive intervention. In this retrospective prognostic study, we analyzed electronic health record (EHR) data from 4,276,403 Veterans Affairs patients during a 2016 observation period to predict first-episode homelessness occurring 3-12 months later in 2017 (prevalence: 0.32-1.19%). We constructed static and time-varying EHR representations, utilizing clinician-informed logic to model the persistence of clinical conditions and social risks over time. We then compared the performance of classical machine learning, transformer-based masked language models, and fine-tuned large language models (LLMs). We demonstrate that incorporating social and behavioral factors into longitudinal models improved precision-recall area under the curve (PR-AUC) by 15-30%. In the top 1% risk tier, models yielded positive predictive values ranging from 3.93-4.72% at 3 months, 7.39-8.30% at 6 months, 9.84-11.41% at 9 months, and 11.65-13.80% at 12 months across model architectures. Large language models underperformed encoder-based models on discrimination but showed smaller performance disparities across racial groups. These results demonstrate that longitudinal, socially informed EHR modeling concentrates homelessness risk into actionable strata, enabling targeted and data-informed prevention strategies for at-risk veterans.[10] Time-Critical Multimodal Medical Transportation: Organs, Patients, and Medical Supplies
Elaheh Sabziyan Varnousfaderani,Syed A. M. Shihab,Mohammad Taghizadeh
Main category: cs.CL
TL;DR: 本文提出了一种用于医疗运输的多模态车辆调度的贪心启发式算法,通过整合地面救护车与无人机(UAV)及电动垂直起降飞行器(eVTOL),在考虑交通、天气和载荷整合的前提下,优化成本与时效性。
Details
Motivation: 解决紧急医疗运输中地面交通拥堵与空中运输高成本/受限于天气和航程的矛盾,提升器官、患者和医疗物资运输的时效性与可靠性。 Method: 设计一种构造性贪心启发式算法,支持四种车队配置(仅救护车、救护车+UAV、救护车+eVTOL、全集成车队),建模中纳入载荷整合、地面交通拥堵、空中天气约束,并强调快速可调度性。 Result: 在统一场景下评估表明,多模态集成车队(尤其含UAV和eVTOL)在降低运营成本、充能/燃料成本及总运输时间方面优于单一模式;算法兼顾效率与计算可行性。 Conclusion: 多模态医疗运输系统结合贪心启发式调度算法,可在实际约束下显著提升响应速度与经济性,为未来智慧医疗物流提供可行框架。 Abstract: Timely transportation of organs, patients, and medical supplies is critical to modern healthcare, particularly in emergencies and transplant scenarios where even short delays can severely impact outcomes. Traditional ground-based vehicles such as ambulances are often hindered by traffic congestion; while air vehicles such as helicopters are faster but costly. Emerging air vehicles -- Unmanned Aerial Vehicles and electric vertical take-off and landing aircraft -- have lower operating costs, but remain limited by range and susceptibility to weather conditions. A multimodal transportation system that integrates both air and ground vehicles can leverage the strengths of each to enhance overall transportation efficiency. This study introduces a constructive greedy heuristic algorithm for multimodal vehicle dispatching for medical transportation. Four different fleet configurations were tested: (i) ambulances only, (ii) ambulances with Unmanned Aerial Vehicles, (iii) ambulances with electric vertical take-off and landing aircraft, and (iv) a fully integrated fleet of ambulances, Unmanned Aerial Vehicles, and electric vertical take-off and landing aircraft. The algorithm incorporates payload consolidation across compatible routes, accounts for traffic congestion in ground operations and weather conditions in aerial operations, while enabling rapid vehicle dispatching compared to computationally intensive optimization models. Using a common set of conditions, we evaluate all four fleet types to identify the most effective configurations for fulfilling medical transportation needs while minimizing operating costs, recharging/fuel costs, and total transportation time.[11] From Task Solving to Robust Real-World Adaptation in LLM Agents
Pouya Pezeshkpour,Estevam Hruschka
Main category: cs.CL
TL;DR: 本文提出了一种面向实际部署场景的LLM智能体鲁棒性评测框架,聚焦于部分可观测、动态环境、噪声信号和动态内部状态四类现实挑战,在网格游戏中评估五种前沿LLM智能体,发现其名义任务性能与真实部署鲁棒性之间存在显著差距,且模型表现受不确定性类型影响大,揭示了目标推断、安全动作选择与验证机制等关键研究方向。
Details
Motivation: 现有LLM智能体评测多基于理想化‘干净接口’假设(规则明确、工具可靠、目标单一),高估了其在真实世界中的就绪度;而实际部署中面临规则模糊、信号不可靠、环境动态变化及多利益方隐式目标等复杂挑战,亟需评估其在不确定性下的适应能力。 Method: 设计了一个具有简单目标但长周期执行的网格游戏作为基准,系统引入四类部署相关扰动:部分可观测性、动态环境、噪声信号和动态智能体状态;在该框架下对五种SOTA LLM智能体进行压力测试,并通过消融实验与特征分析探究失败机制与模型敏感性。 Result: 所有模型在扰动条件下性能均显著下降,且随网格规模与任务周期增大而恶化;模型排名不稳定——较弱模型在特定不确定性场景下可超越更强模型;模型自发权衡完成率、效率与惩罚规避,表现出隐式目标推断能力;消融分析揭示了各模型在验证、信息获取与状态更新等方面的特异性缺陷。 Conclusion: LLM智能体的名义任务求解能力不能代表其部署鲁棒性;需重点发展在部分可观测、噪声与非平稳条件下的目标推断、安全动作选择与运行时验证机制,以提升真实场景适应力。 Abstract: Large language models are increasingly deployed as specialized agents that plan, call tools, and take actions over extended horizons. Yet many existing evaluations assume a "clean interface" where dynamics are specified and stable, tools and sensors are reliable, and success is captured by a single explicit objective-often overestimating real-world readiness. In practice, agents face underspecified rules, unreliable signals, shifting environments, and implicit, multi-stakeholder goals. The challenge is therefore not just solving tasks, but adapting while solving: deciding what to trust, what is wanted, when to verify, and when to fall back or escalate. We stress-test deployment-relevant robustness under four operational circumstances: partial observability, dynamic environments, noisy signals, and dynamic agent state. We benchmark agentic LLMs in a grid-based game with a simple goal but long-horizon execution. Episodes violate clean-interface assumptions yet remain solvable, forcing agents to infer rules, pay for information, adapt to environmental and internal shifts, and act cautiously under noise. Across five state-of-the-art LLM agents, we find large gaps between nominal task-solving and deployment-like robustness. Performance generally degrades as grid size and horizon increase, but rankings are unstable: weaker models can beat stronger ones when strategy matches the uncertainty regime. Despite no explicit instruction, agents trade off completion, efficiency, and penalty avoidance, suggesting partial objective inference. Ablations and feature analyses reveal model-specific sensitivities and failure drivers, motivating work on verification, safe action selection, and objective inference under partial observability, noise, and non-stationarity.[12] AmharicStoryQA: A Multicultural Story Question Answering Benchmark in Amharic
Israel Abebe Azime,Abenezer Kebede Angamo,Hana Mekonen Tamiru,Dagnachew Mekonnen Marilign,Philipp Slusallek,Seid Muhie Yimam,Dietrich Klakow
Main category: cs.CL
TL;DR: 本文提出AmharicStoryQA基准,强调在单一语言(如阿姆哈拉语)内部存在显著的文化差异,现有LLM在跨区域叙事理解上表现不均,需构建文化嵌入型评测基准。
Details
Motivation: 现有大模型多语言评测常将语言与文化混为一谈,以性能代替理解,忽视同一语言内不同地域文化的叙事多样性,尤其在低资源语言中问题突出。 Method: 构建基于埃塞俄比亚多地区阿姆哈拉语叙事的长序列故事问答基准AmharicStoryQA,并在该基准上系统评测主流大语言模型,分析区域差异及监督微调效果。 Result: 发现现有LLM在AmharicStoryQA上存在显著叙事理解差距;不同地区叙事导致评测结果差异明显;监督微调对各区域提升效果不均衡。 Conclusion: 语言层面的评测不足以反映真实理解能力,亟需面向文化多样性的、扎根于本地叙事的评测基准,以更准确评估和提升低资源语言中的叙事理解能力。 Abstract: With the growing emphasis on multilingual and cultural evaluation benchmarks for large language models, language and culture are often treated as synonymous, and performance is commonly used as a proxy for a models understanding of a given language. In this work, we argue that such evaluations overlook meaningful cultural variation that exists within a single language. We address this gap by focusing on narratives from different regions of Ethiopia and demonstrate that, despite shared linguistic characteristics, region-specific and domain-specific content substantially influences language evaluation outcomes. To this end, we introduce \textbf{\textit{AmharicStoryQA}}, a long-sequence story question answering benchmark grounded in culturally diverse narratives from Amharic-speaking regions. Using this benchmark, we reveal a significant narrative understanding gap in existing LLMs, highlight pronounced regional differences in evaluation results, and show that supervised fine-tuning yields uneven improvements across regions and evaluation settings. Our findings emphasize the need for culturally grounded benchmarks that go beyond language-level evaluation to more accurately assess and improve narrative understanding in low-resource languages.[13] When Efficient Communication Explains Convexity
Ashvin Ranjan,Shane Steinert-Threlkeld
Main category: cs.CL
TL;DR: 本文探讨了高效通信理论如何解释世界语言的语义类型学变异,利用信息瓶颈(IB)方法分析了最优性与凸性之间的相关性,并发现交际需求分布的凸性在其中起关键作用。
Details
Motivation: 探究高效通信理论能否以及为何能成功解释语义类型学中的语言变异。 Method: 采用信息瓶颈(IB)框架建模语言的简洁性与信息性权衡,并引入一种新的凸性推广概念;通过控制IB模型参数进行实验,分析影响最优性与凸性相关性的因素。 Result: 发现IB意义上的最优性与广义凸性存在相关性,且交际需求分布的凸性是驱动该相关性的最关键因素。 Conclusion: 高效通信对语义类型学的解释力不仅在于其现象匹配,更源于交际需求分布等底层结构特性,从而深化了我们对语言演化机制的理解。 Abstract: Much recent work has argued that the variation in the languages of the world can be explained from the perspective of efficient communication; in particular, languages can be seen as optimally balancing competing pressures to be simple and to be informative. Focusing on the expression of meaning -- semantic typology -- the present paper asks what factors are responsible for successful explanations in terms of efficient communication. Using the Information Bottleneck (IB) approach to formalizing this trade-off, we first demonstrate and analyze a correlation between optimality in the IB sense and a novel generalization of convexity to this setting. In a second experiment, we manipulate various modeling parameters in the IB framework to determine which factors drive the correlation between convexity and optimality. We find that the convexity of the communicative need distribution plays an especially important role. These results move beyond showing that efficient communication can explain aspects of semantic typology into explanations for why that is the case by identifying which underlying factors are responsible.[14] R2-Router: A New Paradigm for LLM Routing with Reasoning
Jiaqi Xue,Qian Lou,Jiarong Xing,Heng Huang
Main category: cs.CL
TL;DR: 本文提出R2-Router,通过将输出长度预算作为可控变量,联合选择最优大语言模型(LLM)及其对应长度预算,从而在保证质量的同时显著降低成本;同时构建首个覆盖多长度预算的路由数据集R2-Bench,实验证明其性能达到SOTA且成本降低4–5倍。
Details
Motivation: 现有LLM路由方法假设每个LLM对给定查询具有固定的质量和成本,忽略了同一LLM的质量随输出长度变化的事实,导致高能力LLM因预估成本超预算而被错误排除,错失其在较短输出下高性价比表现的机会。 Method: 提出R2-Router,将输出长度预算建模为可优化变量,联合决策最优LLM与对应长度预算,并通过长度约束指令强制执行该预算;同时构建R2-Bench数据集,涵盖不同长度预算下的LLM响应行为。 Result: R2-Router在多项实验中达到当前最优性能,且成本仅为现有路由方法的1/4–1/5;R2-Bench成为首个支持多长度预算评估的LLM路由基准。 Conclusion: 路由不应仅是被动选择,而应成为主动推理过程——R2-Router标志着LLM路由从‘反应式选择’迈向‘推理式决策’的新范式。 Abstract: As LLMs proliferate with diverse capabilities and costs, LLM routing has emerged by learning to predict each LLM's quality and cost for a given query, then selecting the one with high quality and low cost. However, existing routers implicitly assume a single fixed quality and cost per LLM for each query, ignoring that the same LLM's quality varies with its output length. This causes routers to exclude powerful LLMs when their estimated cost exceeds the budget, missing the opportunity that these LLMs could still deliver high quality at reduced cost with shorter outputs. To address this, we introduce R2-Router, which treats output length budget as a controllable variable and jointly selects the best LLM and length budget, enforcing the budget via length-constrained instructions. This enables R2-Router to discover that a powerful LLM with constrained output can outperform a weaker LLM at comparable cost-efficient configurations invisible to prior methods. Together with the router framework, we construct R2-Bench, the first routing dataset capturing LLM behavior across diverse output length budgets. Experiments show that R2-Router achieves state-of-the-art performance at 4-5x lower cost compared with existing routers. This work opens a new direction: routing as reasoning, where routers evolve from reactive selectors to deliberate reasoners that explore which LLM to use and at what cost budget.[15] CATNIP: LLM Unlearning via Calibrated and Tokenized Negative Preference Alignment
Zhengbang Yang,Yisheng Zhong,Junyuan Hong,Zhuangdi Zhu
Main category: cs.CL
TL;DR: 本文提出CATNIP方法,通过基于模型token级置信度的梯度重标度实现精细化遗忘控制,在无需保留数据或对比响应对的前提下,显著提升大语言模型知识遗忘效果与通用能力保留之间的权衡。
Details
Motivation: 预训练知识带来的安全与隐私风险促使LLM遗忘技术发展,但现有基于梯度上升的方法易导致灾难性遗忘,且依赖保留数据或对比样本;负偏好对齐方法受限于参考模型选择且在真实数据下性能下降。 Method: 提出CATNIP(校准化与标记化负偏好对齐),依据模型对不良知识的token级置信度动态重标度梯度更新,实现细粒度、数据高效、长度鲁棒的遗忘。 Result: 在MUSE和WMDP基准上验证了CATNIP在无需保留数据或对比响应对的情况下,实现了更强的知识遗忘能力与更优的通用能力保留平衡,优于当前最优方法。 Conclusion: CATNIP有效解决了遗忘中的灾难性遗忘与数据稀缺/长度变化鲁棒性问题,为LLM安全可控遗忘提供了新范式。 Abstract: Pretrained knowledge memorized in LLMs raises critical concerns over safety and privacy, which has motivated LLM Unlearning as a technique for selectively removing the influences of undesirable knowledge. Existing approaches, rooted in Gradient Ascent (GA), often degrade general domain knowledge while relying on retention data or curated contrastive pairs, which can be either impractical or data and computationally prohibitive. Negative Preference Alignment has been explored for unlearning to tackle the limitations of GA, which, however, remains confined by its choice of reference model and shows undermined performance in realistic data settings. These limitations raise two key questions: i) Can we achieve effective unlearning that quantifies model confidence in undesirable knowledge and uses it to calibrate gradient updates more precisely, thus reducing catastrophic forgetting? ii) Can we make unlearning robust to data scarcity and length variation? We answer both questions affirmatively with CATNIP (Calibrated and Tokenized Negative Preference Alignment), a principled method that rescales unlearning effects in proportion to the model's token-level confidence, thus ensuring fine-grained control over forgetting. Extensive evaluations on MUSE and WMDP benchmarks demonstrated that our work enables effective unlearning without requiring retention data or contrastive unlearning response pairs, with stronger knowledge forgetting and preservation tradeoffs than state-of-the-art methods.[16] Act or Clarify? Modeling Sensitivity to Uncertainty and Cost in Communication
Polina Tsvilodub,Karl Mulligan,Todd Snider,Robert D. Hawkins,Michael Franke
Main category: cs.CL
TL;DR: 本文提出一个基于预期遗憾(expected regret)的计算模型,解释人类在不确定性情境下何时选择提问澄清问题(CQs),指出提问倾向取决于上下文不确定性与错误行动代价的交互作用,且人类会按潜在损失风险比例权衡是否寻求澄清。
Details
Motivation: 理解人类在不确定性下如何权衡‘主动提问’与‘直接行动’的理性决策机制,尤其关注澄清问题(CQs)作为降低不确定性的重要交际策略。 Method: 构建基于预期遗憾(expected regret)的计算模型,并通过两项实验验证:一项考察纯语言层面的应答行为,另一项扩展至澄清提问与非语言行动之间的选择。 Result: 实验证实:人类提问澄清的倾向随不确定性升高而增强,但该效应在行动代价高时更显著;整体呈现与预期遗憾模型一致的理性权衡模式。 Conclusion: 人类在不确定性下的澄清行为并非随机或固定阈值驱动,而是依据潜在错误代价与不确定性水平进行动态、理性的风险敏感调节。 Abstract: When deciding how to act under uncertainty, agents may choose to act to reduce uncertainty or they may act despite that uncertainty.In communicative settings, an important way of reducing uncertainty is by asking clarification questions (CQs). We predict that the decision to ask a CQ depends on both contextual uncertainty and the cost of alternative actions, and that these factors interact: uncertainty should matter most when acting incorrectly is costly. We formalize this interaction in a computational model based on expected regret: how much an agent stands to lose by acting now rather than with full information. We test these predictions in two experiments, one examining purely linguistic responses to questions and another extending to choices between clarification and non-linguistic action. Taken together, our results suggest a rational tradeoff: humans tend to seek clarification proportional to the risk of substantial loss when acting under uncertainty.[17] Which course? Discourse! Teaching Discourse and Generation in the Era of LLMs
Junyi Jessy Li,Yang Janet Liu,Kanishka Misra,Valentina Pyatkin,William Sheffield
Main category: cs.CL
TL;DR: 本文介绍了一门新设的本科课程“计算话语与自然语言生成”,旨在弥合NLP快速发展中的话语处理与生成之间的教学断层,强调语言学与计算机科学的跨学科整合及理论与实践的深度融合。
Details
Motivation: NLP领域持续快速演变,但现有本科课程对话语处理(尤其是其与长文本生成的关联)重视不足,亟需设计能跨越子学科、反映前沿进展的教学方案。 Method: 通过跨学科团队协作设计一门面向高年级本科生的交叉课程,融合话语语言学理论与计算建模实践,采用探究式教学法,并辅以独立课程调查评估效果。 Result: 成功开设了首期‘计算话语与自然语言生成’课程(2025年秋季),实现了语言学与计算机科学的实质性课程共建,并收集了初步教学反馈。 Conclusion: 话语处理是连接NLP理论与生成应用的关键桥梁,将其系统纳入本科教育可提升学生跨学科素养与研究前瞻性;未来需进一步拓展课程资源与评估体系。 Abstract: The field of NLP has undergone vast, continuous transformations over the past few years, sparking debates going beyond discipline boundaries. This begs important questions in education: how do we design courses that bridge sub-disciplines in this shifting landscape? This paper explores this question from the angle of discourse processing, an area with rich linguistic insights and computational models for the intentional, attentional, and coherence structure of language. Discourse is highly relevant for open-ended or long-form text generation, yet this connection is under-explored in existing undergraduate curricula. We present a new course, "Computational Discourse and Natural Language Generation". The course is collaboratively designed by a team with complementary expertise and was offered for the first time in Fall 2025 as an upper-level undergraduate course, cross-listed between Linguistics and Computer Science. Our philosophy is to deeply integrate the theoretical and empirical aspects, and create an exploratory mindset inside the classroom and in the assignments. This paper describes the course in detail and concludes with takeaways from an independent survey as well as our vision for future directions.[18] HALT: Hallucination Assessment via Log-probs as Time series
Ahmad Shapiro,Karan Taneja,Ashok Goel
Main category: cs.CL
TL;DR: 本文提出HALT,一种基于LLM生成token的top-20 log-probabilities时间序列的轻量级幻觉检测器,结合GRU与熵特征学习模型校准偏差;同时构建统一基准HUB涵盖10类任务,实验表明HALT在更小体积和更快速度下优于现有方法。
Details
Motivation: 幻觉问题严重阻碍大语言模型在安全关键领域的应用,亟需高效、通用且无需模型内部访问权限的检测方法。 Method: 提出HALT检测器,将LLM输出的top-20 token log-probabilities建模为时间序列,使用轻量GRU网络并融合熵等校准相关特征;同时构建覆盖10类能力的统一基准HUB用于系统评估。 Result: HALT体积仅为对比模型Lettuce的1/30,推理速度快60倍,在HUB基准上性能更优;且不依赖隐藏状态或模型权重,兼容黑盒与专有LLM。 Conclusion: HALT与HUB共同构成一个高效、通用、易部署的幻觉检测框架,显著提升LLM在多类任务中的可信度与实用性。 Abstract: Hallucinations remain a major obstacle for large language models (LLMs), especially in safety-critical domains. We present HALT (Hallucination Assessment via Log-probs as Time series), a lightweight hallucination detector that leverages only the top-20 token log-probabilities from LLM generations as a time series. HALT uses a gated recurrent unit model combined with entropy-based features to learn model calibration bias, providing an extremely efficient alternative to large encoders. Unlike white-box approaches, HALT does not require access to hidden states or attention maps, relying only on output log-probabilities. Unlike black-box approaches, it operates on log-probs rather than surface-form text, which enables stronger domain generalization and compatibility with proprietary LLMs without requiring access to internal weights. To benchmark performance, we introduce HUB (Hallucination detection Unified Benchmark), which consolidates prior datasets into ten capabilities covering both reasoning tasks (Algorithmic, Commonsense, Mathematical, Symbolic, Code Generation) and general purpose skills (Chat, Data-to-Text, Question Answering, Summarization, World Knowledge). While being 30x smaller, HALT outperforms Lettuce, a fine-tuned modernBERT-base encoder, achieving a 60x speedup gain on HUB. HALT and HUB together establish an effective framework for hallucination detection across diverse LLM capabilities.[19] Equal Access, Unequal Interaction: A Counterfactual Audit of LLM Fairness
Alireza Amiri-Margavi,Arshia Gharagozlou,Amin Gholami Davodi,Seyed Pouyan Mousavi Davoudi,Hamidreza Hasani Balyani
Main category: cs.CL
TL;DR: 本文通过控制实验审计大型语言模型(LLM)在获得访问权限后的交互质量公平性,发现尽管GPT-4和LLaMA-3.1-70B在拒绝率上无差异(即访问公平),但在语气、不确定性表达和语言框架上存在显著的、与用户身份(年龄、性别、国籍)相关的系统性偏差。
Details
Motivation: 现有公平性研究多聚焦于访问层面(如拒绝响应),但平等访问不等于平等交互质量;本文旨在揭示访问 granted 后仍可能存在的隐性不公平。 Method: 采用反事实提示设计,在职业建议任务中系统性地变换用户身份属性(年龄、性别、国籍),使用自动化语言指标(情感、礼貌性、犹豫表达)评估交互质量,并通过配对统计检验识别身份相关差异。 Result: 两模型拒绝率为零(访问公平),但GPT-4对年轻男性用户表现出显著更高的犹豫表达,LLaMA则在不同身份组间呈现更广的情感差异。 Conclusion: 公平性问题可存在于交互质量层面,即使访问完全平等;因此,LLM公平性评估需超越仅基于拒绝的审计范式。 Abstract: Prior work on fairness in large language models (LLMs) has primarily focused on access-level behaviors such as refusals and safety filtering. However, equitable access does not ensure equitable interaction quality once a response is provided. In this paper, we conduct a controlled fairness audit examining how LLMs differ in tone, uncertainty, and linguistic framing across demographic identities after access is granted. Using a counterfactual prompt design, we evaluate GPT-4 and LLaMA-3.1-70B on career advice tasks while varying identity attributes along age, gender, and nationality. We assess access fairness through refusal analysis and measure interaction quality using automated linguistic metrics, including sentiment, politeness, and hedging. Identity-conditioned differences are evaluated using paired statistical tests. Both models exhibit zero refusal rates across all identities, indicating uniform access. Nevertheless, we observe systematic, model-specific disparities in interaction quality: GPT-4 expresses significantly higher hedging toward younger male users, while LLaMA exhibits broader sentiment variation across identity groups. These results show that fairness disparities can persist at the interaction level even when access is equal, motivating evaluation beyond refusal-based audits.[20] Where Norms and References Collide: Evaluating LLMs on Normative Reasoning
Mitchell Abrams,Kaveh Eskandari Miandoab,Felix Gervits,Vasanth Sarathy,Matthias Scheutz
Main category: cs.CL
TL;DR: 本文提出了SNIC测试平台,用于评估大语言模型在基于规范的指代消解(NBRR)任务中理解和应用社会规范的能力,发现当前最强的LLM在处理隐含、模糊或冲突的社会规范时表现不佳。
Details
Motivation: Embodied agents(如机器人)需要在真实环境中与人类交互,而这种交互依赖于对社会规范(即情境中适当行为的共享预期)的推理;现有LLM是否具备支持此类规范推理的能力尚不清楚。 Method: 构建了人类验证的诊断性测试平台SNIC,聚焦于日常物理任务(如清洁、整理、服务)中产生的具身化社会规范,并通过一系列受控实验评估主流LLM在规范提取与应用上的表现。 Result: 即使是最先进的LLM,在一致识别和应用社会规范方面仍存在明显困难,尤其当规范是隐含的、未明确定义的或相互冲突时。 Conclusion: 当前LLM在社会规范推理方面存在关键盲区,这对将语言模型部署到具身化、社会化的实际场景构成重大挑战。 Abstract: Embodied agents, such as robots, will need to interact in situated environments where successful communication often depends on reasoning over social norms: shared expectations that constrain what actions are appropriate in context. A key capability in such settings is norm-based reference resolution (NBRR), where interpreting referential expressions requires inferring implicit normative expectations grounded in physical and social context. Yet it remains unclear whether Large Language Models (LLMs) can support this kind of reasoning. In this work, we introduce SNIC (Situated Norms in Context), a human-validated diagnostic testbed designed to probe how well state-of-the-art LLMs can extract and utilize normative principles relevant to NBRR. SNIC emphasizes physically grounded norms that arise in everyday tasks such as cleaning, tidying, and serving. Across a range of controlled evaluations, we find that even the strongest LLMs struggle to consistently identify and apply social norms, particularly when norms are implicit, underspecified, or in conflict. These findings reveal a blind spot in current LLMs and highlight a key challenge for deploying language-based systems in socially situated, embodied settings.[21] CPMobius: Iterative Coach-Player Reasoning for Data-Free Reinforcement Learning
Ran Li,Zeyuan Liu,Yinghao chen,Bingxiang He,Jiarui Yuan,Zixuan Fu,Weize Chen,Jinyi Hu,Zhiyuan Liu,Maosong Sun
Main category: cs.CL
TL;DR: 本文提出CPMöbius,一种无需外部数据的协作式教练-玩家强化学习范式,用于提升大语言模型的数学推理能力,在Qwen2.5-Math-7B-Instruct上显著超越现有无监督方法。
Details
Motivation: 现有大语言模型在复杂推理上的进步受限于对大量高质量人工标注任务和标签的依赖,导致监督密集型训练难以持续扩展。 Method: 提出CPMöbius范式:Coach与Player为协作而非对抗关系;Coach生成适配Player能力的任务并根据Player性能变化获得奖励,Player则通过解决这些渐进式任务获得奖励,形成协同优化闭环。 Result: 在Qwen2.5-Math-7B-Instruct上,整体准确率提升+4.9,分布外(OOD)准确率提升+5.4,分别比RENT和R-zero高出+1.5和+4.2。 Conclusion: CPMöbius证明了无需外部标注数据即可有效提升模型推理能力,为降低推理模型训练的数据依赖性提供了新路径。 Abstract: Large Language Models (LLMs) have demonstrated strong potential in complex reasoning, yet their progress remains fundamentally constrained by reliance on massive high-quality human-curated tasks and labels, either through supervised fine-tuning (SFT) or reinforcement learning (RL) on reasoning-specific data. This dependence renders supervision-heavy training paradigms increasingly unsustainable, with signs of diminishing scalability already evident in practice. To overcome this limitation, we introduce CPMöbius (CPMobius), a collaborative Coach-Player paradigm for data-free reinforcement learning of reasoning models. Unlike traditional adversarial self-play, CPMöbius, inspired by real world human sports collaboration and multi-agent collaboration, treats the Coach and Player as independent but cooperative roles. The Coach proposes instructions targeted at the Player's capability and receives rewards based on changes in the Player's performance, while the Player is rewarded for solving the increasingly instructive tasks generated by the Coach. This cooperative optimization loop is designed to directly enhance the Player's mathematical reasoning ability. Remarkably, CPMöbius achieves substantial improvement without relying on any external training data, outperforming existing unsupervised approaches. For example, on Qwen2.5-Math-7B-Instruct, our method improves accuracy by an overall average of +4.9 and an out-of-distribution average of +5.4, exceeding RENT by +1.5 on overall accuracy and R-zero by +4.2 on OOD accuracy.[22] LatentMem: Customizing Latent Memory for Multi-Agent Systems
Muxin Fu,Guibin Zhang,Xiangyuan Xue,Yafu Li,Zefeng He,Siyuan Huang,Xiaoye Qu,Yu Cheng,Yang Yang
Main category: cs.CL
TL;DR: 本文提出LatentMem框架,通过角色感知的潜在记忆定制和轻量级经验存储,解决多智能体系统中记忆同质化与信息过载问题,并结合LMPO算法优化记忆表示,显著提升性能。
Details
Motivation: 现有多智能体记忆设计存在记忆同质化(缺乏角色感知定制)和信息过载(记忆条目过于细粒度)两大瓶颈,限制了系统的持续适应能力。 Method: 提出LatentMem框架,包含轻量级经验库与基于检索和角色上下文合成紧凑潜在记忆的记忆合成器;并设计Latent Memory Policy Optimization(LMPO)算法,将任务级优化信号反向传播至记忆合成器以提升记忆效用。 Result: 在多个基准和主流多智能体框架上实验表明,LatentMem相较基线最高提升19.36%,且无需修改底层框架即可一致优于现有记忆架构。 Conclusion: LatentMem通过可学习、角色感知、token高效的记忆机制,有效缓解多智能体记忆的同质化与过载问题,为LLM驱动的多智能体系统提供了更鲁棒、可扩展的记忆基础。 Abstract: Large language model (LLM)-powered multi-agent systems (MAS) demonstrate remarkable collective intelligence, wherein multi-agent memory serves as a pivotal mechanism for continual adaptation. However, existing multi-agent memory designs remain constrained by two fundamental bottlenecks: (i) memory homogenization arising from the absence of role-aware customization, and (ii) information overload induced by excessively fine-grained memory entries. To address these limitations, we propose LatentMem, a learnable multi-agent memory framework designed to customize agent-specific memories in a token-efficient manner. Specifically, LatentMem comprises an experience bank that stores raw interaction trajectories in a lightweight form, and a memory composer that synthesizes compact latent memories conditioned on retrieved experience and agent-specific contexts. Further, we introduce Latent Memory Policy Optimization (LMPO), which propagates task-level optimization signals through latent memories to the composer, encouraging it to produce compact and high-utility representations. Extensive experiments across diverse benchmarks and mainstream MAS frameworks show that LatentMem achieves a performance gain of up to $19.36$% over vanilla settings and consistently outperforms existing memory architectures, without requiring any modifications to the underlying frameworks.[23] SAES-SVD: Self-Adaptive Suppression of Accumulated and Local Errors for SVD-based LLM Compression
Xing Hu,Dawei Yang,Yuan Cheng,Zhixuan Chen,Zukang Xu
Main category: cs.CL
TL;DR: 本文提出了一种名为SAES-SVD的LLM低秩压缩框架,通过联合优化层内重建与层间误差补偿,缓解误差传播累积问题,显著提升压缩后模型性能。
Details
Motivation: 现有低秩压缩方法独立压缩各层,仅最小化单层重建误差,忽视误差在网络中传播累积导致全局偏差放大的问题。 Method: 提出SAES-SVD框架,包含两个核心组件:(1) 累积误差感知层压缩(CEALC),结合局部重建与加权累积误差补偿,并基于二阶激活统计推导闭式低秩解;(2) 自适应协同误差抑制(ACES),动态优化权重系数以增强压缩目标的低秩结构并高效利用秩预算。 Result: 在多个LLM架构和任务上的实验表明,SAES-SVD在无需微调或混合秩策略下,持续提升压缩后性能。 Conclusion: SAES-SVD通过建模并抑制误差传播,为LLM低秩压缩提供了更鲁棒、更高效的硬件无关解决方案。 Abstract: The rapid growth in the parameter scale of large language models (LLMs) has created a high demand for efficient compression techniques. As a hardware-agnostic and highly compatible technique, low-rank compression has been widely adopted. However, existing methods typically compress each layer independently by minimizing per-layer reconstruction error, overlooking a critical limitation: the reconstruction error propagates and accumulates through the network, which leads to amplified global deviations from the full-precision baseline. To address this, we propose Self-Adaptive Error Suppression SVD (SAES-SVD), a LLMs compression framework that jointly optimizes intra-layer reconstruction and inter-layer error compensation. SAES-SVD is composed of two novel components: (1) Cumulative Error-Aware Layer Compression (CEALC), which formulates the compression objective as a combination of local reconstruction and weighted cumulative error compensation. Based on it, we derive a closed-form low-rank solution relied on second-order activation statistics, which explicitly aligns each layer's output with its full-precision counterpart to compensate for accumulated errors. (2) Adaptive Collaborative Error Suppression (ACES), which automatically adjusts the weighting coefficient to enhance the low-rank structure of the compression objective in CEALC. Specifically, the coefficient is optimized to maximize the ratio between the Frobenius norm of the compressed layer's output and that of the compression objective under a fixed rank, thus ensuring that the rank budget is utilized effectively. Extensive experiments across multiple LLM architectures and tasks show that, without fine-tuning or mixed-rank strategies, SAES-SVD consistently improves post-compression performance.[24] ReMiT: RL-Guided Mid-Training for Iterative LLM Evolution
Junjie Huang,Jiarui Qin,Di Yin,Weiwen Liu,Yong Yu,Xing Sun,Weinan Zhang
Main category: cs.CL
TL;DR: 本文提出ReMiT方法,通过强化学习(RL)引导的中段训练(mid-training),利用RL调优模型的推理先验动态重加权token,实现预训练与后训练之间的双向反馈循环,无需额外教师模型,显著提升多个基准任务性能。
Details
Motivation: 现有大语言模型训练流程是单向的(预训练→后训练),而探索后训练阶段的洞察如何反哺并增强预训练基础模型这一双向过程尚未被研究。 Method: 提出ReMiT方法:识别出预训练末期、学习率快速衰减阶段的‘中段训练’为能力跃升关键点;利用RL调优模型的推理先验,在该阶段动态重加权对推理至关重要的token。 Result: 在10个涵盖数学、代码和通用推理的预训练基准上平均提升3%;该增益在后续整个后训练流程中持续保持超2%。 Conclusion: 验证了预训练与后训练之间可构建自强化飞轮机制,实现LLM的持续、自增强式演进。 Abstract: Standard training pipelines for large language models (LLMs) are typically unidirectional, progressing from pre-training to post-training. However, the potential for a bidirectional process--where insights from post-training retroactively improve the pre-trained foundation--remains unexplored. We aim to establish a self-reinforcing flywheel: a cycle in which reinforcement learning (RL)-tuned model strengthens the base model, which in turn enhances subsequent post-training performance, requiring no specially trained teacher or reference model. To realize this, we analyze training dynamics and identify the mid-training (annealing) phase as a critical turning point for model capabilities. This phase typically occurs at the end of pre-training, utilizing high-quality corpora under a rapidly decaying learning rate. Building upon this insight, we introduce ReMiT (Reinforcement Learning-Guided Mid-Training). Specifically, ReMiT leverages the reasoning priors of RL-tuned models to dynamically reweight tokens during the mid-training phase, prioritizing those pivotal for reasoning. Empirically, ReMiT achieves an average improvement of 3\% on 10 pre-training benchmarks, spanning math, code, and general reasoning, and sustains these gains by over 2\% throughout the post-training pipeline. These results validate an iterative feedback loop, enabling continuous and self-reinforcing evolution of LLMs.[25] AERO: Autonomous Evolutionary Reasoning Optimization via Endogenous Dual-Loop Feedback
Zhitao Gao,Jie Ma,Xuhong Li,Pengyu Li,Ning Qu,Yaqiang Wu,Hui Liu,Jun Liu
Main category: cs.CL
TL;DR: 本文提出AERO框架,通过熵定位和反事实修正实现无监督的自主推理进化,显著提升大语言模型在多个基准上的性能。
Details
Motivation: 现有自进化范式难以识别最优学习区,且易因内部反馈缺陷而强化错误先验和集体幻觉。 Method: 提出AERO框架,包含基于ZPD理论的熵定位机制、独立反事实修正验证方法,以及分阶段训练策略以协调功能角色能力增长。 Result: 在九个跨领域基准测试中,AERO使Qwen3-4B-Base和Qwen3-8B-Base平均性能分别提升4.57%和5.10%,优于竞争基线。 Conclusion: AERO是一种有效的无监督自主推理进化框架,能克服专家标注依赖与外部验证瓶颈,提升模型推理鲁棒性与泛化性。 Abstract: Large Language Models (LLMs) have achieved significant success in complex reasoning but remain bottlenecked by reliance on expert-annotated data and external verifiers. While existing self-evolution paradigms aim to bypass these constraints, they often fail to identify the optimal learning zone and risk reinforcing collective hallucinations and incorrect priors through flawed internal feedback. To address these challenges, we propose \underline{A}utonomous \underline{E}volutionary \underline{R}easoning \underline{O}ptimization (AERO), an unsupervised framework that achieves autonomous reasoning evolution by internalizing self-questioning, answering, and criticism within a synergistic dual-loop system. Inspired by the \textit{Zone of Proximal Development (ZPD)} theory, AERO utilizes entropy-based positioning to target the ``solvability gap'' and employs Independent Counterfactual Correction for robust verification. Furthermore, we introduce a Staggered Training Strategy to synchronize capability growth across functional roles and prevent curriculum collapse. Extensive evaluations across nine benchmarks spanning three domains demonstrate that AERO achieves average performance improvements of 4.57\% on Qwen3-4B-Base and 5.10\% on Qwen3-8B-Base, outperforming competitive baselines. Code is available at https://github.com/mira-ai-lab/AERO.[26] Test-time Recursive Thinking: Self-Improvement without External Feedback
Yufan Zhuang,Chandan Singh,Liyuan Liu,Yelong Shen,Dinghuai Zhang,Jingbo Shang,Jianfeng Gao,Weizhu Chen
Main category: cs.CL
TL;DR: 本文提出Test-time Recursive Thinking (TRT)框架,使大语言模型在推理任务中无需额外训练即可实现自我提升,在多个基准测试中显著提升准确率。
Details
Motivation: 探索大语言模型能否在不进行额外训练的情况下实现自我改进,并解决高质量候选解生成和无监督条件下正确答案选择两大挑战。 Method: 提出Test-time Recursive Thinking (TRT)框架,通过在推理时迭代地结合rollout-specific策略、累积知识和自生成验证信号来提升性能。 Result: 开源模型在AIME-25/24上达到100%准确率;闭源模型在LiveCodeBench最难问题上提升10.4–14.8个百分点,且无需外部反馈。 Conclusion: TRT证明了大语言模型可在测试时通过递归式自我思考实现显著自我提升,为免训练优化提供了新范式。 Abstract: Modern Large Language Models (LLMs) have shown rapid improvements in reasoning capabilities, driven largely by reinforcement learning (RL) with verifiable rewards. Here, we ask whether these LLMs can self-improve without the need for additional training. We identify two core challenges for such systems: (i) efficiently generating diverse, high-quality candidate solutions, and (ii) reliably selecting correct answers in the absence of ground-truth supervision. To address these challenges, we propose Test-time Recursive Thinking (TRT), an iterative self-improvement framework that conditions generation on rollout-specific strategies, accumulated knowledge, and self-generated verification signals. Using TRT, open-source models reach 100% accuracy on AIME-25/24, and on LiveCodeBench's most difficult problems, closed-source models improve by 10.4-14.8 percentage points without external feedback.[27] Task--Specificity Score: Measuring How Much Instructions Really Matter for Supervision
Pritam Kadasi,Abhishek Upperwal,Mayank Singh
Main category: cs.CL
TL;DR: 本文提出Task-Specificity Score (TSS) 和其改进版TSS++,用于量化指令对输出预测的重要性,并验证其在下游任务中提升性能的有效性。
Details
Motivation: 指令微调已成为大语言模型训练的主流方式,但许多指令-输入-输出三元组中指令与输出的对应关系较弱,即同一输入下多个不同指令可能产生合理输出,因此需评估指令是否唯一决定输出。 Method: 提出Task-Specificity Score(TSS)指标,通过对比真实指令与同一输入下若干合理替代指令的输出预测差异来衡量指令特异性;进一步提出TSS++,引入难负样本和质量修正项以缓解易负样本干扰。 Result: 在Alpaca、Dolly-15k、NI-20三个指令数据集及Gemma、Llama、Qwen三个开源大模型上验证:基于TSS/TSS++筛选高任务特异性样本,可在有限token预算下提升下游任务性能,并可与困惑度、IFD等质量过滤方法互补。 Conclusion: 指令特异性是影响指令微调效果的重要维度,TSS系列指标为数据筛选提供了新视角,有助于在资源受限场景下更高效地构建高质量指令微调数据集。 Abstract: Instruction tuning is now the default way to train and adapt large language models, but many instruction--input--output pairs are only weakly specified: for a given input, the same output can remain plausible under several alternative instructions. This raises a simple question: \emph{does the instruction uniquely determine the target output?} We propose the \textbf{Task--Specificity Score (TSS)} to quantify how much an instruction matters for predicting its output, by contrasting the true instruction against plausible alternatives for the same input. We further introduce \textbf{TSS++}, which uses hard alternatives and a small quality term to mitigate easy-negative effects. Across three instruction datasets (\textsc{Alpaca}, \textsc{Dolly-15k}, \textsc{NI-20}) and three open LLMs (Gemma, Llama, Qwen), we show that selecting task-specific examples improves downstream performance under tight token budgets and complements quality-based filters such as perplexity and IFD.[28] The Mask of Civility: Benchmarking Chinese Mock Politeness Comprehension in Large Language Models
Yitong Zhang,Yuhan Xiang,Mingxuan Liu
Main category: cs.CL
TL;DR: 本研究从语用学视角系统评估了代表性大语言模型在识别中文礼貌、不礼貌及伪礼貌现象方面的性能差异,构建了基于语用理论的三类数据集,并在多种提示策略下对六种模型进行了评测。
Details
Motivation: 弥补现有大语言模型在语用理解方面的不足,特别是对中文礼貌、不礼貌及伪礼貌现象的识别能力欠缺。 Method: 基于关系管理理论和伪礼貌模型构建包含真实与模拟中文语料的三类别数据集,并在零样本、少样本、知识增强和混合提示四种条件下评测GPT-5.1、DeepSeek等六个代表性模型。 Result: 揭示了不同模型在三类语用现象识别上的性能差异,验证了语用理论指导下的评测框架的有效性。 Conclusion: 该研究是‘大语言学’范式下的重要尝试,为语用理论在AI时代的应用提供了新路径,也推动了语言技术与人文学科的跨学科融合。 Abstract: From a pragmatic perspective, this study systematically evaluates the differences in performance among representative large language models (LLMs) in recognizing politeness, impoliteness, and mock politeness phenomena in Chinese. Addressing the existing gaps in pragmatic comprehension, the research adopts the frameworks of Rapport Management Theory and the Model of Mock Politeness to construct a three-category dataset combining authentic and simulated Chinese discourse. Six representative models, including GPT-5.1 and DeepSeek, were selected as test subjects and evaluated under four prompting conditions: zero-shot, few-shot, knowledge-enhanced, and hybrid strategies. This study serves as a meaningful attempt within the paradigm of ``Great Linguistics,'' offering a novel approach to applying pragmatic theory in the age of technological transformation. It also responds to the contemporary question of how technology and the humanities may coexist, representing an interdisciplinary endeavor that bridges linguistic technology and humanistic reflection.[29] ChemPro: A Progressive Chemistry Benchmark for Large Language Models
Aaditya Baranwal,Shruti Vyas
Main category: cs.CL
TL;DR: ChemPro是一个包含4100个化学领域自然语言问答对的渐进式基准测试,用于评估大语言模型(LLMs)在通用化学各主题上的能力;实验表明LLMs在基础题上表现良好,但面对复杂、多概念或需长程推理的问题时准确率显著下降,暴露其在科学推理与理解上的关键局限。
Details
Motivation: 现有基准难以全面评估LLMs在化学等科学领域的渐进式推理与深层理解能力,缺乏覆盖多学科、多难度层级且贴近真实教学评估的评测体系。 Method: 构建ChemPro基准:4100个QA对,按难度分为4个连贯层级,涵盖四大化学分支(生物、无机、有机、物理化学),题型包括选择题和数值题,分布于信息召回、长程推理、多概念整合、细致问题求解等维度;评估45+7个主流开源与闭源LLMs。 Result: LLMs在基础化学问题上表现较好,但随问题类型和难度提升(如多概念、长程推理、数值计算),性能显著下降;不同模型间存在明显差距,整体暴露出科学推理能力薄弱。 Conclusion: 当前LLMs在通用化学任务中仍存在严重的能力瓶颈,尤其在复杂科学推理方面;ChemPro揭示了被忽视的难度维度,亟需更鲁棒的训练与评测方法以提升模型的科学素养。 Abstract: We introduce ChemPro, a progressive benchmark with 4100 natural language question-answer pairs in Chemistry, across 4 coherent sections of difficulty designed to assess the proficiency of Large Language Models (LLMs) in a broad spectrum of general chemistry topics. We include Multiple Choice Questions and Numerical Questions spread across fine-grained information recall, long-horizon reasoning, multi-concept questions, problem-solving with nuanced articulation, and straightforward questions in a balanced ratio, effectively covering Bio-Chemistry, Inorganic-Chemistry, Organic-Chemistry and Physical-Chemistry. ChemPro is carefully designed analogous to a student's academic evaluation for basic to high-school chemistry. A gradual increase in the question difficulty rigorously tests the ability of LLMs to progress from solving basic problems to solving more sophisticated challenges. We evaluate 45+7 state-of-the-art LLMs, spanning both open-source and proprietary variants, and our analysis reveals that while LLMs perform well on basic chemistry questions, their accuracy declines with different types and levels of complexity. These findings highlight the critical limitations of LLMs in general scientific reasoning and understanding and point towards understudied dimensions of difficulty, emphasizing the need for more robust methodologies to improve LLMs.[30] One Model, All Roles: Multi-Turn, Multi-Agent Self-Play Reinforcement Learning for Conversational Social Intelligence
Bowen Jiang,Taiwei Shi,Ryo Kamoi,Yuan Yuan,Camillo J. Taylor,Longqi Yang,Pei Zhou,Sihao Chen
Main category: cs.CL
TL;DR: 本文提出OMAR框架,通过多轮、多智能体对话自博弈,使单一模型能扮演所有角色,从而在无监督下习得共情、说服等社会智能。
Details
Motivation: 传统方法依赖静态单轮优化,难以建模长期目标与复杂社会规范;需探索AI如何在动态社交互动中自主发展社会智能。 Method: 提出OMAR:基于强化学习的多角色自博弈框架,采用分层优势估计(turn-level与token-level)保障长对话训练稳定性。 Result: 在SOTOPIA和狼人杀环境中验证,模型涌现出共情、说服、妥协寻求等细粒度社会智能,即使在竞争场景下也能学习协作;但存在奖励黑客等挑战。 Conclusion: 单一模型可通过多角色自博弈在无监督下发展丰富社会智能,为群体对话中的AI社会智能研究提供新范式。 Abstract: This paper introduces OMAR: One Model, All Roles, a reinforcement learning framework that enables AI to develop social intelligence through multi-turn, multi-agent conversational self-play. Unlike traditional paradigms that rely on static, single-turn optimizations, OMAR allows a single model to role-play all participants in a conversation simultaneously, learning to achieve long-term goals and complex social norms directly from dynamic social interaction. To ensure training stability across long dialogues, we implement a hierarchical advantage estimation that calculates turn-level and token-level advantages. Evaluations in the SOTOPIA social environment and Werewolf strategy games show that our trained models develop fine-grained, emergent social intelligence, such as empathy, persuasion, and compromise seeking, demonstrating the effectiveness of learning collaboration even under competitive scenarios. While we identify practical challenges like reward hacking, our results show that rich social intelligence can emerge without human supervision. We hope this work incentivizes further research on AI social intelligence in group conversations.[31] Short Chains, Deep Thoughts: Balancing Reasoning Efficiency and Intra-Segment Capability via Split-Merge Optimization
Runquan Gui,Jie Wang,Zhihai Wang,Chi Ma,Jianye Hao,Feng Wu
Main category: cs.CL
TL;DR: 本文提出CoSMo框架,通过一致性引导的分-合优化方法,动态精炼推理链结构,减少冗余并填补逻辑缺口,在提升准确率的同时显著降低计算开销。
Details
Motivation: 大型推理模型(LRMs)依赖长推理链导致高延迟和计算开销,需消除结构性冗余而非简单压缩token数量。 Method: 提出CoSMo框架,包含动态分-合算法(合并冗余段、拆分逻辑缺口)与结构对齐的强化学习(引入段级预算约束)。 Result: 在多个基准和骨干模型上实验表明,相比推理效率基线,准确率提升3.3点,平均段使用量减少28.7%。 Conclusion: CoSMo有效平衡推理质量与效率,验证了结构优化比单纯压缩更有利于高效推理。 Abstract: While Large Reasoning Models (LRMs) have demonstrated impressive capabilities in solving complex tasks through the generation of long reasoning chains, this reliance on verbose generation results in significant latency and computational overhead. To address these challenges, we propose \textbf{CoSMo} (\textbf{Co}nsistency-Guided \textbf{S}plit-\textbf{M}erge \textbf{O}ptimization), a framework designed to eliminate structural redundancy rather than indiscriminately restricting token volume. Specifically, CoSMo utilizes a split-merge algorithm that dynamically refines reasoning chains by merging redundant segments and splitting logical gaps to ensure coherence. We then employ structure-aligned reinforcement learning with a novel segment-level budget to supervise the model in maintaining efficient reasoning structures throughout training. Extensive experiments across multiple benchmarks and backbones demonstrate that CoSMo achieves superior performance, improving accuracy by \textbf{3.3} points while reducing segment usage by \textbf{28.7\%} on average compared to reasoning efficiency baselines.[32] FASA: Frequency-aware Sparse Attention
Yifei Wang,Yueqi Wang,Zhenrui Yue,Huimin Zeng,Yong Wang,Ismini Lourentzou,Zhengzhong Tu,Xiangxiang Chu,Julian McAuley
Main category: cs.CL
TL;DR: 本文提出FASA框架,通过RoPE中的功能稀疏性发现频率块级的主导成分,动态预测token重要性并进行查询感知的token剪枝,在大幅降低KV缓存开销的同时保持接近全量KV的性能。
Details
Motivation: 大型语言模型在处理长输入时面临KV缓存内存占用过高的瓶颈;现有token剪枝方法(静态或启发式动态)难以兼顾信息保留与查询依赖的重要性建模。 Method: 基于RoPE中频率块(FC)层面的功能稀疏性新发现,识别出与完整注意力头高度一致的‘主导FC’作为无计算开销的token重要性代理;据此动态筛选关键token子集,并仅在该子集上执行注意力计算。 Result: 在LongBench-V1上仅保留256个token即达近100%全KV性能;在AIME24上用18.9%缓存实现2.56×加速;在多种长上下文任务中持续超越所有token剪枝基线,达到近似oracle精度。 Conclusion: FASA通过轻量、查询感知的token剪枝机制,有效缓解LLM长上下文推理的KV缓存瓶颈,在精度与效率间取得显著平衡。 Abstract: The deployment of Large Language Models (LLMs) faces a critical bottleneck when handling lengthy inputs: the prohibitive memory footprint of the Key Value (KV) cache. To address this bottleneck, the token pruning paradigm leverages attention sparsity to selectively retain a small, critical subset of tokens. However, existing approaches fall short, with static methods risking irreversible information loss and dynamic strategies employing heuristics that insufficiently capture the query-dependent nature of token importance. We propose FASA, a novel framework that achieves query-aware token eviction by dynamically predicting token importance. FASA stems from a novel insight into RoPE: the discovery of functional sparsity at the frequency-chunk (FC) level. Our key finding is that a small, identifiable subset of "dominant" FCs consistently exhibits high contextual agreement with the full attention head. This provides a robust and computationally free proxy for identifying salient tokens. %making them a powerful and efficient proxy for token importance. Building on this insight, FASA first identifies a critical set of tokens using dominant FCs, and then performs focused attention computation solely on this pruned subset. % Since accessing only a small fraction of the KV cache, FASA drastically lowers memory bandwidth requirements and computational cost. Across a spectrum of long-context tasks, from sequence modeling to complex CoT reasoning, FASA consistently outperforms all token-eviction baselines and achieves near-oracle accuracy, demonstrating remarkable robustness even under constraint budgets. Notably, on LongBench-V1, FASA reaches nearly 100\% of full-KV performance when only keeping 256 tokens, and achieves 2.56$\times$ speedup using just 18.9\% of the cache on AIME24.[33] Privasis: Synthesizing the Largest "Public" Private Dataset from Scratch
Hyunwoo Kim,Niloofar Mireshghallah,Michael Duan,Rui Xin,Shuyue Stella Li,Jaehun Jung,David Acuna,Qi Pang,Hanshen Xiao,G. Edward Suh,Sewoong Oh,Yulia Tsvetkov,Pang Wei Koh,Yejin Choi
Main category: cs.CL
TL;DR: 本文提出Privasis,首个百万级完全合成的隐私敏感文本数据集,用于推动隐私保护研究,并基于该数据集构建了高效的文本脱敏模型。
Details
Motivation: 隐私敏感数据的研究长期受限于数据稀缺,而现代AI代理(如OpenClaw、Gemini Agent)又日益频繁访问高度敏感的个人信息,亟需安全、多样、大规模的合成数据支持。 Method: 设计并构建了Privasis——一个包含140万条记录、5510万个标注属性的全合成隐私文本数据集,覆盖医疗、法律、金融等多种文档类型;并基于其构建平行脱敏语料,开发轻量级(≤4B)脱敏模型。 Result: 所训练的紧凑型脱敏模型在性能上超越GPT-5和Qwen-3 235B等大模型;数据集、模型与代码将全部开源。 Conclusion: Privasis填补了隐私敏感数据研究中高质量、大规模合成数据的空白,为文本脱敏及隐私增强AI代理提供了坚实基础与新范式。 Abstract: Research involving privacy-sensitive data has always been constrained by data scarcity, standing in sharp contrast to other areas that have benefited from data scaling. This challenge is becoming increasingly urgent as modern AI agents--such as OpenClaw and Gemini Agent--are granted persistent access to highly sensitive personal information. To tackle this longstanding bottleneck and the rising risks, we present Privasis (i.e., privacy oasis), the first million-scale fully synthetic dataset entirely built from scratch--an expansive reservoir of texts with rich and diverse private information--designed to broaden and accelerate research in areas where processing sensitive social data is inevitable. Compared to existing datasets, Privasis, comprising 1.4 million records, offers orders-of-magnitude larger scale with quality, and far greater diversity across various document types, including medical history, legal documents, financial records, calendars, and text messages with a total of 55.1 million annotated attributes such as ethnicity, date of birth, workplace, etc. We leverage Privasis to construct a parallel corpus for text sanitization with our pipeline that decomposes texts and applies targeted sanitization. Our compact sanitization models (<=4B) trained on this dataset outperform state-of-the-art large language models, such as GPT-5 and Qwen-3 235B. We plan to release data, models, and code to accelerate future research on privacy-sensitive domains and agents.[34] ForesightKV: Optimizing KV Cache Eviction for Reasoning Models by Learning Long-Term Contribution
Zican Dong,Peiyu Liu,Junyi Li,Zhipeng Chen,Han Peng,Shuo Wang,Wayne Xin Zhao
Main category: cs.CL
TL;DR: 本文提出ForesightKV,一种基于训练的KV缓存淘汰框架,通过预测生成过程中应淘汰的KV对,在降低缓存开销的同时保持推理性能。
Details
Motivation: 现有KV缓存淘汰方法难以捕捉复杂的KV依赖关系,导致性能下降;而长文本生成中KV缓存线性增长带来显著内存与计算开销。 Method: 提出Golden Eviction算法,利用未来注意力分数确定每步最优淘汰KV对;通过监督学习(Pairwise Ranking Loss)蒸馏该策略;进一步将缓存淘汰建模为马尔可夫决策过程,并采用GRPO强化学习缓解低熵token上的语言建模损失上升。 Result: 在AIME2024和AIME2025基准上,ForesightKV在仅一半缓存预算下持续优于先前方法,并且监督学习与强化学习协同增益明显。 Conclusion: ForesightKV有效平衡了长文本推理中的效率与性能,验证了基于训练的、前瞻性KV淘汰策略的可行性与优越性。 Abstract: Recently, large language models (LLMs) have shown remarkable reasoning abilities by producing long reasoning traces. However, as the sequence length grows, the key-value (KV) cache expands linearly, incurring significant memory and computation costs. Existing KV cache eviction methods mitigate this issue by discarding less important KV pairs, but often fail to capture complex KV dependencies, resulting in performance degradation. To better balance efficiency and performance, we introduce ForesightKV, a training-based KV cache eviction framework that learns to predict which KV pairs to evict during long-text generations. We first design the Golden Eviction algorithm, which identifies the optimal eviction KV pairs at each step using future attention scores. These traces and the scores at each step are then distilled via supervised training with a Pairwise Ranking Loss. Furthermore, we formulate cache eviction as a Markov Decision Process and apply the GRPO algorithm to mitigate the significant language modeling loss increase on low-entropy tokens. Experiments on AIME2024 and AIME2025 benchmarks of three reasoning models demonstrate that ForesightKV consistently outperforms prior methods under only half the cache budget, while benefiting synergistically from both supervised and reinforcement learning approaches.[35] Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection
Dongwon Jo,Beomseok Kang,Jiwon Song,Jae-Joon Kim
Main category: cs.CL
TL;DR: 本文提出Token Sparse Attention,一种轻量级、动态的token级稀疏化机制,通过在每层注意力中压缩和解压缩Q/K/V,实现token信息在后续层中的重新考虑,显著提升长上下文推理的准确率-延迟权衡。
Details
Motivation: 现有注意力加速方法存在结构化稀疏或早期永久删除token的问题,无法适应层/头间token重要性的动态变化。 Method: 提出Token Sparse Attention,动态选择重要token进行每头Q/K/V压缩与解压缩,兼容Flash Attention等密集注意力实现,并可与现有稀疏注意力核组合使用。 Result: 在128K上下文下实现最高3.23倍注意力加速,准确率下降不到1%,显著改善准确率-延迟权衡。 Conclusion: 动态且交错的token级稀疏化是一种互补且有效的可扩展长上下文推理策略。 Abstract: The quadratic complexity of attention remains the central bottleneck in long-context inference for large language models. Prior acceleration methods either sparsify the attention map with structured patterns or permanently evict tokens at specific layers, which can retain irrelevant tokens or rely on irreversible early decisions despite the layer-/head-wise dynamics of token importance. In this paper, we propose Token Sparse Attention, a lightweight and dynamic token-level sparsification mechanism that compresses per-head $Q$, $K$, $V$ to a reduced token set during attention and then decompresses the output back to the original sequence, enabling token information to be reconsidered in subsequent layers. Furthermore, Token Sparse Attention exposes a new design point at the intersection of token selection and sparse attention. Our approach is fully compatible with dense attention implementations, including Flash Attention, and can be seamlessly composed with existing sparse attention kernels. Experimental results show that Token Sparse Attention consistently improves accuracy-latency trade-off, achieving up to $\times$3.23 attention speedup at 128K context with less than 1% accuracy degradation. These results demonstrate that dynamic and interleaved token-level sparsification is a complementary and effective strategy for scalable long-context inference.[36] ATACompressor: Adaptive Task-Aware Compression for Efficient Long-Context Processing in LLMs
Xuancheng Li,Haitao Li,Yujia Zhou,Qingyao Ai,Yiqun Liu
Main category: cs.CL
TL;DR: 本文提出了一种自适应任务感知压缩器(ATACompressor),通过选择性编码和动态调整压缩率,有效缓解大语言模型中长上下文输入的“中间丢失”问题,在多个问答数据集上优于现有方法。
Details
Motivation: 长上下文输入在大语言模型中常出现“中间丢失”问题,而现有上下文压缩方法难以兼顾信息保留与压缩效率。 Method: 提出ATACompressor,包含任务相关的选择性编码器和能感知相关内容长度并动态调节压缩率的自适应分配控制器。 Result: 在HotpotQA、MSMARCO和SQUAD三个QA数据集上,ATACompressor在压缩效率和任务性能两方面均优于现有方法,并通过消融实验验证了各组件有效性。 Conclusion: ATACompressor为大语言模型的长上下文处理提供了一种可扩展、高效且任务自适应的解决方案。 Abstract: Long-context inputs in large language models (LLMs) often suffer from the "lost in the middle" problem, where critical information becomes diluted or ignored due to excessive length. Context compression methods aim to address this by reducing input size, but existing approaches struggle with balancing information preservation and compression efficiency. We propose Adaptive Task-Aware Compressor (ATACompressor), which dynamically adjusts compression based on the specific requirements of the task. ATACompressor employs a selective encoder that compresses only the task-relevant portions of long contexts, ensuring that essential information is preserved while reducing unnecessary content. Its adaptive allocation controller perceives the length of relevant content and adjusts the compression rate accordingly, optimizing resource utilization. We evaluate ATACompressor on three QA datasets: HotpotQA, MSMARCO, and SQUAD-showing that it outperforms existing methods in terms of both compression efficiency and task performance. Our approach provides a scalable solution for long-context processing in LLMs. Furthermore, we perform a range of ablation studies and analysis experiments to gain deeper insights into the key components of ATACompressor.[37] POP: Prefill-Only Pruning for Efficient Large Model Inference
Junhui He,Zhihui Fu,Jun Wang,Qingan Li
Main category: cs.CL
TL;DR: 本文提出了一种面向大语言模型和视觉-语言模型推理阶段的结构化剪枝方法Prefill-Only Pruning(POP),通过识别prefill与decode阶段对网络层的不同依赖性,仅在prefill阶段剪除深层冗余层,从而在显著降低计算开销的同时几乎不损失精度。
Details
Motivation: 现有结构化剪枝方法因忽略prefill与decode阶段的不对称性而导致精度大幅下降,需设计阶段感知的剪枝策略以兼顾效率与准确率。 Method: 提出虚拟门机制进行重要性分析,发现深层对decode关键而对prefill冗余;据此设计Prefill-Only Pruning(POP),在prefill阶段跳过深层,并引入独立KV投影与边界处理策略保障缓存一致性及首token生成精度。 Result: 在Llama-3.1、Qwen3-VL和Gemma-3等多模态模型上实现最高1.37×的prefill延迟加速,性能损失极小。 Conclusion: POP突破了传统结构化剪枝中精度与效率难以兼得的瓶颈,为LLM/VLM高效推理提供了新范式。 Abstract: Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated remarkable capabilities. However, their deployment is hindered by significant computational costs. Existing structured pruning methods, while hardware-efficient, often suffer from significant accuracy degradation. In this paper, we argue that this failure stems from a stage-agnostic pruning approach that overlooks the asymmetric roles between the prefill and decode stages. By introducing a virtual gate mechanism, our importance analysis reveals that deep layers are critical for next-token prediction (decode) but largely redundant for context encoding (prefill). Leveraging this insight, we propose Prefill-Only Pruning (POP), a stage-aware inference strategy that safely omits deep layers during the computationally intensive prefill stage while retaining the full model for the sensitive decode stage. To enable the transition between stages, we introduce independent Key-Value (KV) projections to maintain cache integrity, and a boundary handling strategy to ensure the accuracy of the first generated token. Extensive experiments on Llama-3.1, Qwen3-VL, and Gemma-3 across diverse modalities demonstrate that POP achieves up to 1.37$\times$ speedup in prefill latency with minimal performance loss, effectively overcoming the accuracy-efficiency trade-off limitations of existing structured pruning methods.[38] MIRROR: A Multi-Agent Framework with Iterative Adaptive Revision and Hierarchical Retrieval for Optimization Modeling in Operations Research
Yifan Shi,Jialong Shi,Jiayi Wang,Ye Fan,Jianyong Sun
Main category: cs.CL
TL;DR: 本文提出MIRROR框架,一种无需微调、端到端的多智能体系统,用于将自然语言优化问题自动转化为数学模型和求解器代码,通过执行驱动的迭代修正与分层检索机制提升准确性。
Details
Motivation: 运筹学建模依赖专家经验,过程缓慢且难以应对新场景;现有LLM方法存在需昂贵后训练、缺乏可靠协同纠错与任务特异性检索等问题。 Method: 提出MIRROR:无需微调的多智能体框架,包含(1)执行驱动的迭代自适应修订机制实现自动纠错;(2)分层检索机制从精选示例库中获取建模与编码范例。 Result: 在标准OR基准及工业数据集(IndustryOR、Mamo-ComplexLP)上显著优于现有方法。 Conclusion: MIRROR通过精准外部知识注入与系统化纠错,为非专家用户提供高效可靠的运筹学建模方案,克服通用大模型在专业优化任务中的根本局限。 Abstract: Operations Research (OR) relies on expert-driven modeling-a slow and fragile process ill-suited to novel scenarios. While large language models (LLMs) can automatically translate natural language into optimization models, existing approaches either rely on costly post-training or employ multi-agent frameworks, yet most still lack reliable collaborative error correction and task-specific retrieval, often leading to incorrect outputs. We propose MIRROR, a fine-tuning-free, end-to-end multi-agent framework that directly translates natural language optimization problems into mathematical models and solver code. MIRROR integrates two core mechanisms: (1) execution-driven iterative adaptive revision for automatic error correction, and (2) hierarchical retrieval to fetch relevant modeling and coding exemplars from a carefully curated exemplar library. Experiments show that MIRROR outperforms existing methods on standard OR benchmarks, with notable results on complex industrial datasets such as IndustryOR and Mamo-ComplexLP. By combining precise external knowledge infusion with systematic error correction, MIRROR provides non-expert users with an efficient and reliable OR modeling solution, overcoming the fundamental limitations of general-purpose LLMs in expert optimization tasks.[39] Accurate Failure Prediction in Agents Does Not Imply Effective Failure Prevention
Rakshith Vasudev,Melisa Russak,Dan Bikel,Waseem Alshikh
Main category: cs.CL
TL;DR: 本文研究了LLM批评模型在部署时的主动干预效果,发现即使离线准确率很高(AUROC 0.94),实际干预仍可能导致严重性能下降(最高-26pp),且影响因任务而异;作者提出一种仅需50个任务的小规模预部署测试,可有效预测干预是否有益,核心价值在于避免部署后出现严重退化。
Details
Motivation: LLM批评模型的主动干预常被认为能提升可靠性,但其在真实部署中的效果尚不明确,尤其缺乏对干预安全性的可靠评估方法。 Method: 通过实证分析不同任务上LLM二元批评模型干预的效果,揭示其‘破坏-恢复权衡’特性,并设计基于小样本(50任务)预部署测试来预测干预净效应。 Result: 高准确率批评模型干预导致某模型性能下降26pp,另一模型几乎无影响;预部署测试成功预测:在高成功率任务上干预有害(0至-26pp),在高失败率ALFWorld任务上有显著小幅提升(+2.8pp, p=0.014)。 Conclusion: LLM批评模型的离线准确率不能保证部署安全;应通过轻量级预部署测试判断是否干预,核心贡献是防止有害干预,而非一味追求干预增益。 Abstract: Proactive interventions by LLM critic models are often assumed to improve reliability, yet their effects at deployment time are poorly understood. We show that a binary LLM critic with strong offline accuracy (AUROC 0.94) can nevertheless cause severe performance degradation, inducing a 26 percentage point (pp) collapse on one model while affecting another by near zero pp. This variability demonstrates that LLM critic accuracy alone is insufficient to determine whether intervention is safe. We identify a disruption-recovery tradeoff: interventions may recover failing trajectories but also disrupt trajectories that would have succeeded. Based on this insight, we propose a pre-deployment test that uses a small pilot of 50 tasks to estimate whether intervention is likely to help or harm, without requiring full deployment. Across benchmarks, the test correctly anticipates outcomes: intervention degrades performance on high-success tasks (0 to -26 pp), while yielding a modest improvement on the high-failure ALFWorld benchmark (+2.8 pp, p=0.014). The primary value of our framework is therefore identifying when not to intervene, preventing severe regressions before deployment.[40] PEGRL: Improving Machine Translation by Post-Editing Guided Reinforcement Learning
Yunzhi Shen,Hao Zhou,Xin Huang,Xue Han,Junlan Feng,Shujian Huang
Main category: cs.CL
TL;DR: 本文提出PEGRL,一种两阶段强化学习框架,通过引入后编辑作为辅助任务来稳定训练并引导优化,从而解决LLM机器翻译中强化学习面临的噪声信号和大轨迹空间问题。
Details
Motivation: 现有基于强化学习的LLM机器翻译方法(如GRPO)受限于蒙特卡洛回报估计带来的噪声信号,以及庞大轨迹空间导致难以兼顾全局探索与细粒度局部优化。 Method: 提出两阶段RL框架PEGRL:第一阶段生成翻译输出,第二阶段将其作为后编辑输入;利用当前翻译行为对后编辑阶段的回报估计进行条件建模,并设计任务特定加权机制平衡翻译与后编辑目标,形成有偏但样本更高效的估计器。 Result: 在英→芬、英→土、英↔中多个翻译任务上一致优于RL基线;在英→土任务的COMET-KIWI指标上性能媲美先进LLM系统DeepSeek-V3.2。 Conclusion: PEGRL通过后编辑辅助任务有效缓解了翻译导向强化学习中的训练不稳定性和优化粒度不足问题,提升了样本效率与最终翻译质量。 Abstract: Reinforcement learning (RL) has shown strong promise for LLM-based machine translation, with recent methods such as GRPO demonstrating notable gains; nevertheless, translation-oriented RL remains challenged by noisy learning signals arising from Monte Carlo return estimation, as well as a large trajectory space that favors global exploration over fine-grained local optimization. We introduce \textbf{PEGRL}, a \textit{two-stage} RL framework that uses post-editing as an auxiliary task to stabilize training and guide overall optimization. At each iteration, translation outputs are sampled to construct post-editing inputs, allowing return estimation in the post-editing stage to benefit from conditioning on the current translation behavior, while jointly supporting both global exploration and fine-grained local optimization. A task-specific weighting scheme further balances the contributions of translation and post-editing objectives, yielding a biased yet more sample-efficient estimator. Experiments on English$\to$Finnish, English$\to$Turkish, and English$\leftrightarrow$Chinese show consistent gains over RL baselines, and for English$\to$Turkish, performance on COMET-KIWI is comparable to advanced LLM-based systems (DeepSeek-V3.2).[41] Pursuing Best Industrial Practices for Retrieval-Augmented Generation in the Medical Domain
Wei Zhu
Main category: cs.CL
TL;DR: 本文系统分析了检索增强生成(RAG)系统在医疗领域工业应用中的各组件及其实现方式,提出了实用替代方案,并通过三类任务的系统评估揭示了性能与效率之间的权衡及最佳实践。
Details
Motivation: 缺乏关于构建面向工业应用(尤其是医疗领域)的RAG系统的共识性最佳实践,包括组件选择、组织方式和具体实现方法。 Method: 首先细致分析RAG系统的各个组件并提出实用替代方案;随后在三类任务上开展系统性评估。 Result: 揭示了提升RAG系统性能的最佳实践,以及LLM-based RAG系统在性能与效率之间的权衡关系。 Conclusion: 针对医疗等高要求领域,需根据具体任务需求,在组件设计与系统实现中平衡性能与效率,本文提出的组件分析框架和评估结果可指导工业级RAG系统构建。 Abstract: While retrieval augmented generation (RAG) has been swiftly adopted in industrial applications based on large language models (LLMs), there is no consensus on what are the best practices for building a RAG system in terms of what are the components, how to organize these components and how to implement each component for the industrial applications, especially in the medical domain. In this work, we first carefully analyze each component of the RAG system and propose practical alternatives for each component. Then, we conduct systematic evaluations on three types of tasks, revealing the best practices for improving the RAG system and how LLM-based RAG systems make trade-offs between performance and efficiency.[42] Towards Distillation-Resistant Large Language Models: An Information-Theoretic Perspective
Hao Fang,Tianyi Zhang,Tianqu Zhuang,Jiawei Kong,Kuofeng Gao,Bin Chen,Leqi Liang,Shu-Tao Xia,Ke Xu
Main category: cs.CL
TL;DR: 本文从信息论角度分析了基于logit的大型语言模型(LLM)知识蒸馏攻击,并提出一种通过最小化条件互信息(CMI)来净化教师模型输出、提升抗蒸馏能力的方法,在保持任务准确率的同时显著削弱蒸馏效果。
Details
Motivation: 现有防御方法仅关注文本级蒸馏,而忽视了更具威胁的logit级蒸馏;需从理论层面刻画并抑制蒸馏相关的信息泄露。 Method: 定义并利用条件互信息(CMI)量化教师logits中与蒸馏相关的信息,设计可学习的输出变换矩阵,并构建CMI启发的抗蒸馏优化目标进行训练。 Result: 在多个LLM和强蒸馏算法上实验表明,该方法显著降低蒸馏性能(如学生模型准确率下降),同时几乎不损害原始任务准确率。 Conclusion: CMI是最小化蒸馏相关知识泄露的有效理论工具,所提输出净化机制能兼顾模型知识产权保护与服务效用。 Abstract: Proprietary large language models (LLMs) embody substantial economic value and are generally exposed only as black-box APIs, yet adversaries can still exploit their outputs to extract knowledge via distillation. Existing defenses focus exclusively on text-based distillation, leaving the important logit-based distillation largely unexplored. In this work, we analyze this problem and present an effective solution from an information-theoretic perspective. We characterize distillation-relevant information in teacher outputs using the conditional mutual information (CMI) between teacher logits and input queries conditioned on ground-truth labels. This quantity captures contextual information beneficial for model extraction, motivating us to defend distillation via CMI minimization. Guided by our theoretical analysis, we propose learning a transformation matrix that purifies the original outputs to enhance distillation resistance. We further derive a CMI-inspired anti-distillation objective to optimize this transformation, which effectively removes distillation-relevant information while preserving output utility. Extensive experiments across multiple LLMs and strong distillation algorithms demonstrate that the proposed method significantly degrades distillation performance while preserving task accuracy, effectively protecting models' intellectual property.[43] Verified Critical Step Optimization for LLM Agents
Mukai Li,Qingcheng Zeng,Tianqing Fang,Zhenwen Liang,Linfeng Song,Qi Liu,Haitao Mi,Dong Yu
Main category: cs.CL
TL;DR: 本文提出Critical Step Optimization (CSO)方法,通过聚焦于能决定任务成败的关键决策步,利用过程奖励模型识别候选关键步、专家模型生成高质量替代动作,并由策略模型自身执行验证,仅将成功修正结果的替代动作用于DPO训练,从而实现细粒度、可验证的代理后训练,显著提升长程任务性能。
Details
Motivation: 现有后训练方法存在三大问题:仅依赖最终结果的奖励无法精确归因中间步骤贡献;估计的步级奖励存在系统性噪声;蒙特卡洛采样估计步奖励计算开销过大。 Method: CSO方法从失败策略轨迹出发,使用过程奖励模型(PRM)识别候选关键步,调用专家模型生成高质量替代动作,再由策略模型从该步继续执行至任务完成;仅当替代动作能被策略成功执行并纠正结果时,才将其作为DPO训练数据,确保监督信号的质量与策略可达性。 Result: 在GAIA-Text-103和XBench-DeepSearch上,CSO相较SFT基线分别取得37%和26%的相对提升,显著优于其他后训练方法,且仅需在16%的轨迹步上提供监督。 Conclusion: 选择性、验证驱动的关键步学习能有效提升大语言模型代理的长程任务能力,兼顾监督精度、质量与计算效率。 Abstract: As large language model agents tackle increasingly complex long-horizon tasks, effective post-training becomes critical. Prior work faces fundamental challenges: outcome-only rewards fail to precisely attribute credit to intermediate steps, estimated step-level rewards introduce systematic noise, and Monte Carlo sampling approaches for step reward estimation incur prohibitive computational cost. Inspired by findings that only a small fraction of high-entropy tokens drive effective RL for reasoning, we propose Critical Step Optimization (CSO), which focuses preference learning on verified critical steps, decision points where alternate actions demonstrably flip task outcomes from failure to success. Crucially, our method starts from failed policy trajectories rather than expert demonstrations, directly targeting the policy model's weaknesses. We use a process reward model (PRM) to identify candidate critical steps, leverage expert models to propose high-quality alternatives, then continue execution from these alternatives using the policy model itself until task completion. Only alternatives that the policy successfully executes to correct outcomes are verified and used as DPO training data, ensuring both quality and policy reachability. This yields fine-grained, verifiable supervision at critical decisions while avoiding trajectory-level coarseness and step-level noise. Experiments on GAIA-Text-103 and XBench-DeepSearch show that CSO achieves 37% and 26% relative improvement over the SFT baseline and substantially outperforms other post-training methods, while requiring supervision at only 16% of trajectory steps. This demonstrates the effectiveness of selective verification-based learning for agent post-training.[44] FactNet: A Billion-Scale Knowledge Graph for Multilingual Factual Grounding
Yingli Shen,Wen Lai,Jie Zhou,Xueren Zhang,Yudong Wang,Kangyang Luo,Shuo Wang,Ge Gao,Alexander Fraser,Maosong Sun
Main category: cs.CL
TL;DR: 本文提出了FactNet,一个大规模、开源的资源,将17亿个原子断言与30.1亿个可审计的证据指针统一起来,全部来源于316个维基百科版本。它采用确定性构建流程,确保每个证据单元都能以字节级精度恢复,并在长尾语言中实现了92.1%的高接地精度。此外,还建立了FactNet-Bench评估套件,用于知识图谱补全、问答和事实核查。
Details
Motivation: 现有资源在事实 grounding 方面存在局限:要么提供无文本上下文的结构化知识(如知识库),要么提供规模有限、语言覆盖不足的 grounded 文本;同时大语言模型存在事实幻觉和缺乏可追溯来源的问题。 Method: 提出 FactNet,通过严格确定性的构建流程,从316个维基百科版本中提取1.7亿原子断言及3.01亿证据指针;所有证据均可字节级还原;并构建 FactNet-Bench 评估基准。 Result: FactNet 实现了92.1%的高 grounding 精度,尤其在长尾语言中表现稳健;FactNet-Bench 支持知识图谱补全、问答和事实核查三类任务评估。 Conclusion: FactNet 是一个基础性、可复现的多语言可信系统训练与评估资源,有助于提升大语言模型的事实准确性与可验证性。 Abstract: While LLMs exhibit remarkable fluency, their utility is often compromised by factual hallucinations and a lack of traceable provenance. Existing resources for grounding mitigate this but typically enforce a dichotomy: they offer either structured knowledge without textual context (e.g., knowledge bases) or grounded text with limited scale and linguistic coverage. To bridge this gap, we introduce FactNet, a massive, open-source resource designed to unify 1.7 billion atomic assertions with 3.01 billion auditable evidence pointers derived exclusively from 316 Wikipedia editions. Unlike recent synthetic approaches, FactNet employs a strictly deterministic construction pipeline, ensuring that every evidence unit is recoverable with byte-level precision. Extensive auditing confirms a high grounding precision of 92.1%, even in long-tail languages. Furthermore, we establish FactNet-Bench, a comprehensive evaluation suite for Knowledge Graph Completion, Question Answering, and Fact Checking. FactNet provides the community with a foundational, reproducible resource for training and evaluating trustworthy, verifiable multilingual systems.[45] A-RAG: Scaling Agentic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces
Mingxuan Du,Benfeng Xu,Chiwei Zhu,Shaohan Wang,Pengyu Wang,Xiaorui Wang,Zhendong Mao
Main category: cs.CL
TL;DR: 本文提出A-RAG(Agentic RAG)框架,让大语言模型主动参与检索决策,通过提供关键词搜索、语义搜索和块读取三种检索工具,实现多粒度自适应信息检索,在多个开放域问答基准上优于现有RAG方法,且检索token更少或相当。
Details
Motivation: 现有RAG系统无法充分利用前沿语言模型的推理与长程工具使用能力,其单次检索或预定义流程范式限制了模型在检索中的主动参与,难以随模型能力提升而高效扩展。 Method: 提出A-RAG框架,将分层检索接口直接暴露给模型,赋予其调用关键词搜索、语义搜索和chunk读取三种检索工具的能力,实现动态、多粒度、自适应的检索过程。 Result: 在多个开放域QA基准上,A-RAG持续超越现有RAG方法,且检索token数相当或更少;系统验证了其随模型规模和测试时计算量的可扩展性。 Conclusion: A-RAG通过赋予模型检索代理权,有效释放其推理与工具调用潜力,为构建更智能、可扩展的RAG系统提供了新范式。 Abstract: Frontier language models have demonstrated strong reasoning and long-horizon tool-use capabilities. However, existing RAG systems fail to leverage these capabilities. They still rely on two paradigms: (1) designing an algorithm that retrieves passages in a single shot and concatenates them into the model's input, or (2) predefining a workflow and prompting the model to execute it step-by-step. Neither paradigm allows the model to participate in retrieval decisions, preventing efficient scaling with model improvements. In this paper, we introduce A-RAG, an Agentic RAG framework that exposes hierarchical retrieval interfaces directly to the model. A-RAG provides three retrieval tools: keyword search, semantic search, and chunk read, enabling the agent to adaptively search and retrieve information across multiple granularities. Experiments on multiple open-domain QA benchmarks show that A-RAG consistently outperforms existing approaches with comparable or lower retrieved tokens, demonstrating that A-RAG effectively leverages model capabilities and dynamically adapts to different RAG tasks. We further systematically study how A-RAG scales with model size and test-time compute. We will release our code and evaluation suite to facilitate future research. Code and evaluation suite are available at https://github.com/Ayanami0730/arag.[46] Preferences for Idiomatic Language are Acquired Slowly -- and Forgotten Quickly: A Case Study on Swedish
Jenny Kunz
Main category: cs.CL
TL;DR: 本文研究了语言模型在预训练和从英语适配到瑞典语过程中对习语表达与语言可接受性之间的偏好发展,发现习语能力发展较慢但持续提升,而基于机器翻译指令微调会迅速削弱模型对习语的偏好。
Details
Motivation: 探究语言模型如何发展对瑞典语习语表达的偏好,特别是在预训练和跨语言适配(英语→瑞典语)过程中,并填补习语能力评估的数据与方法空白。 Method: 从零训练瑞典语模型并微调英语预训练模型,在多个检查点使用最小对(minimal pairs)探针评估;构建两类新数据集:常规习语vs.合理变体、瑞典语习语vs.翻译腔,并将现有可接受性基准改编为最小对格式。 Result: 习语能力比语法和词汇正确性等其他语言能力发展更慢;延长训练对多数任务收益递减,但对习语任务持续提升(尤其在8B最大模型中);基于英译瑞典语指令微调显著削弱模型对习语的偏好。 Conclusion: 习语理解是语言模型中一种晚熟且易受训练方式干扰的高级语言能力,需专门设计的数据与训练策略来有效建模。 Abstract: In this study, we investigate how language models develop preferences for \textit{idiomatic} as compared to \textit{linguistically acceptable} Swedish, both during pretraining and when adapting a model from English to Swedish. To do so, we train models on Swedish from scratch and by fine-tuning English-pretrained models, probing their preferences at various checkpoints using minimal pairs that differ in linguistic acceptability or idiomaticity. For linguistic acceptability, we adapt existing benchmarks into a minimal-pair format. To assess idiomaticity, we introduce two novel datasets: one contrasting conventionalized idioms with plausible variants, and another contrasting idiomatic Swedish with Translationese. Our findings suggest that idiomatic competence emerges more slowly than other linguistic abilities, including grammatical and lexical correctness. While longer training yields diminishing returns for most tasks, idiom-related performance continues to improve, particularly in the largest model tested (8B). However, instruction tuning on data machine-translated from English -- the common approach for languages with little or no native instruction data -- causes models to rapidly lose their preference for idiomatic language.[47] Self-Verification Dilemma: Experience-Driven Suppression of Overused Checking in LLM Reasoning
Quanyu Long,Kai Jie Jiang,Jianda Chen,Xu Guo,Leilei Gan,Wenya Wang
Main category: cs.CL
TL;DR: 本文发现大型推理模型(LRMs)中大量自验证步骤(recheck)虽频繁出现但大多无效,据此提出一种基于历史经验的测试时框架,通过检测并抑制不必要的自验证,显著降低token消耗且不损准确率,甚至在部分数据集上提升准确率。
Details
Motivation: 发现LRMs中大量自验证步骤(recheck)虽频繁出现但大多只是确认性而非纠错性,存在‘过度验证’问题,即验证行为频次与实际效用不匹配。 Method: 提出一种经验驱动的测试时框架:实时检测recheck行为,检索离线存储的历史验证结果经验池,估计当前recheck是否必要;若判断为不必要,则发出抑制信号使模型跳过该步继续推理。 Result: 在多个模型和基准上,该方法最多减少20.3%的token使用量,同时保持甚至提升推理准确率。 Conclusion: 自验证并非越多越好;利用历史经验动态抑制低效recheck可兼顾效率与性能,为高效推理提供了新范式。 Abstract: Large Reasoning Models (LRMs) achieve strong performance by generating long reasoning traces with reflection. Through a large-scale empirical analysis, we find that a substantial fraction of reflective steps consist of self-verification (recheck) that repeatedly confirm intermediate results. These rechecks occur frequently across models and benchmarks, yet the vast majority are confirmatory rather than corrective, rarely identifying errors and altering reasoning outcomes. This reveals a mismatch between how often self-verification is activated and how often it is actually useful. Motivated by this, we propose a novel, experience-driven test-time framework that reduces the overused verification. Our method detects the activation of recheck behavior, consults an offline experience pool of past verification outcomes, and estimates whether a recheck is likely unnecessary via efficient retrieval. When historical experience suggests unnecessary, a suppression signal redirects the model to proceed. Across multiple model and benchmarks, our approach reduces token usage up to 20.3% while maintaining the accuracy, and in some datasets even yields accuracy improvements.[48] Learning to Reason Faithfully through Step-Level Faithfulness Maximization
Runquan Gui,Yafu Li,Xiaoye Qu,Ziyan Liu,Yeqiu Cheng,Yu Cheng
Main category: cs.CL
TL;DR: 本文提出FaithRL框架,通过最大化推理忠实度来减少大语言模型在多步推理中的幻觉现象,采用几何奖励设计和忠实度感知的优势调制机制,在多个基准测试中显著降低了幻觉率并保持或提升了答案正确性。
Details
Motivation: 现有基于稀疏结果奖励的强化学习方法缺乏对中间推理步骤的监督,导致过度自信、虚假推理和幻觉增加。 Method: 提出FaithRL框架,形式化忠实度最大化目标;引入几何奖励设计与忠实度感知的优势调制机制,对不支持的推理步骤进行惩罚,同时保留有效的部分推导。 Result: 在多种模型主干和基准测试上,FaithRL持续降低幻觉率,同时维持甚至提升答案正确率;分析表明其提升了逐步推理忠实度并具有强泛化能力。 Conclusion: FaithRL是一种通用且有效的强化学习框架,能直接优化推理忠实度,缓解LLMs多步推理中的幻觉问题。 Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has markedly improved the performance of Large Language Models (LLMs) on tasks requiring multi-step reasoning. However, most RLVR pipelines rely on sparse outcome-based rewards, providing little supervision over intermediate steps and thus encouraging over-confidence and spurious reasoning, which in turn increases hallucinations. To address this, we propose FaithRL, a general reinforcement learning framework that directly optimizes reasoning faithfulness. We formalize a faithfulness-maximization objective and theoretically show that optimizing it mitigates over-confidence. To instantiate this objective, we introduce a geometric reward design and a faithfulness-aware advantage modulation mechanism that assigns step-level credit by penalizing unsupported steps while preserving valid partial derivations. Across diverse backbones and benchmarks, FaithRL consistently reduces hallucination rates while maintaining (and often improving) answer correctness. Further analysis confirms that FaithRL increases step-wise reasoning faithfulness and generalizes robustly. Our code is available at https://github.com/aintdoin/FaithRL.[49] Can Large Language Models Generalize Procedures Across Representations?
Fangru Lin,Valentin Hofmann,Xingchen Wan,Weixing Wang,Zifeng Ding,Anthony G. Cohn,Janet B. Pierrehumbert
Main category: cs.CL
TL;DR: 本文研究了大语言模型(LLMs)在代码、图和自然语言等不同表示形式之间的泛化能力,发现单一表示训练难以跨表示迁移;为此提出一种先符号后自然语言的两阶段数据课程方法,显著提升跨表示泛化性能,并揭示其本质是一种生成式类比。
Details
Motivation: 现实用户任务多以自然语言描述,而LLM训练与评测却大量依赖代码、图等符号表示,二者间存在表征鸿沟,亟需探究并提升跨表示泛化能力。 Method: 提出两阶段数据课程:第一阶段在代码和图等符号数据上训练,第二阶段切换至自然语言数据;并在多种模型(如Qwen-1.5B)和任务(如自然语言规划)上验证该策略。 Result: 该课程显著提升各模型家族在自然语言任务上的性能;1.5B Qwen模型经此方法训练后,在自然语言规划任务上接近零样本GPT-4o水平。 Conclusion: 跨表示泛化可被理解为一种生成式类比能力,而所提出的两阶段课程能有效激发该能力,从而弥合符号表示与自然语言之间的泛化差距。 Abstract: Large language models (LLMs) are trained and tested extensively on symbolic representations such as code and graphs, yet real-world user tasks are often specified in natural language. To what extent can LLMs generalize across these representations? Here, we approach this question by studying isomorphic tasks involving procedures represented in code, graphs, and natural language (e.g., scheduling steps in planning). We find that training LLMs with popular post-training methods on graphs or code data alone does not reliably generalize to corresponding natural language tasks, while training solely on natural language can lead to inefficient performance gains. To address this gap, we propose a two-stage data curriculum that first trains on symbolic, then natural language data. The curriculum substantially improves model performance across model families and tasks. Remarkably, a 1.5B Qwen model trained by our method can closely match zero-shot GPT-4o in naturalistic planning. Finally, our analysis suggests that successful cross-representation generalization can be interpreted as a form of generative analogy, which our curriculum effectively encourages.[50] SEAD: Self-Evolving Agent for Multi-Turn Service Dialogue
Yuqin Dai,Ning Gao,Wei Zhang,Jie Wang,Zichen Luo,Jinpeng Wang,Yujie Wang,Ruiyuan Wu,Chaozheng Wang
Main category: cs.CL
TL;DR: 本文提出SEAD框架,通过解耦用户建模为Profile Controller和User Role-play Model,提升服务对话中大语言模型的任务完成率与对话效率,无需大量人工标注数据。
Details
Motivation: 现有方法在服务对话中表现不佳,主要受限于噪声大、质量低的人类对话数据,以及真实目标导向用户行为难以模拟的问题。 Method: 提出SEAD(Self-Evolving Agent for Service Dialogue)框架,将用户建模解耦为Profile Controller(生成多样化用户状态以调控训练课程)和User Role-play Model(专注真实角色扮演),构建自演化、自适应的服务对话训练环境。 Result: 实验表明SEAD显著优于开源基础模型和闭源商用模型,任务完成率提升17.6%,对话效率提升11.1%。 Conclusion: SEAD提供了一种无需大规模人工标注即可高效训练服务对话智能体的新范式,提升了模型在真实服务场景中的实用性与鲁棒性。 Abstract: Large Language Models have demonstrated remarkable capabilities in open-domain dialogues. However, current methods exhibit suboptimal performance in service dialogues, as they rely on noisy, low-quality human conversation data. This limitation arises from data scarcity and the difficulty of simulating authentic, goal-oriented user behaviors. To address these issues, we propose SEAD (Self-Evolving Agent for Service Dialogue), a framework that enables agents to learn effective strategies without large-scale human annotations. SEAD decouples user modeling into two components: a Profile Controller that generates diverse user states to manage training curriculum, and a User Role-play Model that focuses on realistic role-playing. This design ensures the environment provides adaptive training scenarios rather than acting as an unfair adversary. Experiments demonstrate that SEAD significantly outperforms Open-source Foundation Models and Closed-source Commercial Models, improving task completion rate by 17.6% and dialogue efficiency by 11.1%. Code is available at: https://github.com/Da1yuqin/SEAD.[51] Assessing the Impact of Typological Features on Multilingual Machine Translation in the Age of Large Language Models
Vitalii Hirak,Jaap Jumelet,Arianna Bisazza
Main category: cs.CL
TL;DR: 本文研究了语言类型学特征对大型多语言翻译模型性能的影响,发现目标语言的类型学特征显著影响翻译质量,并提出了针对特定语言类型的解码策略优化建议。
Details
Motivation: 尽管多语言建模取得了重大进展,但不同语言间的质量差异仍然显著。除训练资源不均外,语言的类型学特性也被认为会影响建模难度,但现有证据多基于小型单语或双语模型,缺乏对大型预训练多语言翻译模型的系统分析。 Method: 本文基于两个最先进的大型预训练多语言翻译模型(NLLB-200 和 Tower+),在涵盖广泛语言的数据集上进行实证分析,控制数据资源量和文字系统等混杂因素,探究目标语言类型学特征与翻译质量的关系,并进一步分析不同类型语言对解码策略的响应差异。 Result: 研究发现目标语言的类型学特征显著影响两种模型的翻译质量;某些类型学特征的语言更受益于更广的输出空间搜索,暗示其可能更适合非标准的解码策略(如非从左到右的束搜索)。此外,作者发布了FLORES+评测基准中212种语言的细粒度类型学特征数据集。 Conclusion: 语言类型学是影响大型多语言翻译模型性能的重要内在因素,未来研究应考虑将类型学信息融入模型设计与解码策略中,以提升低资源及类型学特殊语言的翻译效果。 Abstract: Despite major advances in multilingual modeling, large quality disparities persist across languages. Besides the obvious impact of uneven training resources, typological properties have also been proposed to determine the intrinsic difficulty of modeling a language. The existing evidence, however, is mostly based on small monolingual language models or bilingual translation models trained from scratch. We expand on this line of work by analyzing two large pre-trained multilingual translation models, NLLB-200 and Tower+, which are state-of-the-art representatives of encoder-decoder and decoder-only machine translation, respectively. Based on a broad set of languages, we find that target language typology drives translation quality of both models, even after controlling for more trivial factors, such as data resourcedness and writing script. Additionally, languages with certain typological properties benefit more from a wider search of the output space, suggesting that such languages could profit from alternative decoding strategies beyond the standard left-to-right beam search. To facilitate further research in this area, we release a set of fine-grained typological properties for 212 languages of the FLORES+ MT evaluation benchmark.[52] HySparse: A Hybrid Sparse Attention Architecture with Oracle Token Selection and KV Cache Sharing
Yizhao Gao,Jianyu Wei,Qihao Zhang,Yu Cheng,Shimao Chen,Zhengju Tang,Zihan Jiang,Yifan Song,Hailin Zhang,Liang Zhao,Bo Yang,Gang Wang,Shijie Cao,Fuli Luo
Main category: cs.CL
TL;DR: HySparse是一种新型混合稀疏注意力架构,通过在全注意力层后插入多个稀疏注意力层,并直接复用前一层的token选择与KV缓存,从而兼顾计算效率与内存节省,在7B和80B模型上均显著优于基线方法。
Details
Motivation: 解决现有稀疏注意力方法的两个关键问题:一是依赖额外代理预测token重要性,导致复杂性和性能下降;二是虽降低计算量却未减少KV缓存内存开销。 Method: 提出HySparse架构,将全注意力层与多个稀疏注意力层交替堆叠;稀疏层的token选择和KV缓存直接由前一个全注意力层提供,实现精准重要性判断与缓存复用。 Result: 在7B稠密模型和80B MoE模型上均超越全注意力和SWA等基线;80B MoE模型中仅5/49层使用全注意力,仍获得显著性能提升,KV缓存减少近10倍。 Conclusion: HySparse通过巧妙利用全注意力层作为‘oracle’指导稀疏层设计,在不牺牲性能前提下大幅降低计算与KV内存开销,为大模型高效推理提供了新范式。 Abstract: This work introduces Hybrid Sparse Attention (HySparse), a new architecture that interleaves each full attention layer with several sparse attention layers. While conceptually simple, HySparse strategically derives each sparse layer's token selection and KV caches directly from the preceding full attention layer. This architecture resolves two fundamental limitations of prior sparse attention methods. First, conventional approaches typically rely on additional proxies to predict token importance, introducing extra complexity and potentially suboptimal performance. In contrast, HySparse uses the full attention layer as a precise oracle to identify important tokens. Second, existing sparse attention designs often reduce computation without saving KV cache. HySparse enables sparse attention layers to reuse the full attention KV cache, thereby reducing both computation and memory. We evaluate HySparse on both 7B dense and 80B MoE models. Across all settings, HySparse consistently outperforms both full attention and hybrid SWA baselines. Notably, in the 80B MoE model with 49 total layers, only 5 layers employ full attention, yet HySparse achieves substantial performance gains while reducing KV cache storage by nearly 10x.[53] ACL: Aligned Contrastive Learning Improves BERT and Multi-exit BERT Fine-tuning
Wei Zhu
Main category: cs.CL
TL;DR: 本文提出了一种对齐对比学习(ACL)框架,以解决监督学习中对比学习与交叉熵损失目标冲突的问题,并在GLUE基准上验证了其在单出口和多出口BERT模型上的有效性。
Details
Motivation: 对比学习在自监督学习中成功,但在监督学习中研究较少;作者发现其与交叉熵损失目标存在冲突,限制了其在监督场景下的应用。 Method: 提出ACL框架,包括:ACL-Embed(将标签嵌入视为增强样本并进行对比对齐)、ACL-Grad(在目标冲突时舍弃ACL-Embed项)和ACL-CL(跨层ACL,用教师出口指导学生浅层出口优化)。 Result: ACL-BRT在GLUE任务上优于或媲美CE及CE+SCL;ACL(尤其是ACL-CL)显著提升多出口BERT微调性能,改善质量-速度权衡。 Conclusion: ACL框架有效缓解监督对比学习中的目标冲突问题,尤其适用于低延迟场景下的多出口BERT模型。 Abstract: Despite its success in self-supervised learning, contrastive learning is less studied in the supervised setting. In this work, we first use a set of pilot experiments to show that in the supervised setting, the cross-entropy loss objective (CE) and the contrastive learning objective often conflict with each other, thus hindering the applications of CL in supervised settings. To resolve this problem, we introduce a novel \underline{A}ligned \underline{C}ontrastive \underline{L}earning (ACL) framework. First, ACL-Embed regards label embeddings as extra augmented samples with different labels and employs contrastive learning to align the label embeddings with its samples' representations. Second, to facilitate the optimization of ACL-Embed objective combined with the CE loss, we propose ACL-Grad, which will discard the ACL-Embed term if the two objectives are in conflict. To further enhance the performances of intermediate exits of multi-exit BERT, we further propose cross-layer ACL (ACL-CL), which is to ask the teacher exit to guide the optimization of student shallow exits. Extensive experiments on the GLUE benchmark results in the following takeaways: (a) ACL-BRT outperforms or performs comparably with CE and CE+SCL on the GLUE tasks; (b) ACL, especially CL-ACL, significantly surpasses the baseline methods on the fine-tuning of multi-exit BERT, thus providing better quality-speed tradeoffs for low-latency applications.[54] Use Graph When It Needs: Efficiently and Adaptively Integrating Retrieval-Augmented Generation with Graphs
Su Dong,Qinggang Zhang,Yilin Xiao,Shengyuan Chen,Chuang Zhou,Xiao Huang
Main category: cs.CL
TL;DR: 本文提出EA-GraphRAG框架,通过语法感知的查询复杂度分析,动态选择RAG或GraphRAG策略,提升准确率与效率。
Details
Motivation: GraphRAG在真实场景中因对所有查询刚性应用而表现不佳,存在准确率下降和延迟过高问题。 Method: 提出EA-GraphRAG:(i) 语法特征构造器提取查询结构特征;(ii) 轻量级复杂度评分器输出连续分数;(iii) 基于分数的路由策略,分别处理低分、高分及边界查询,并引入复杂度感知的倒数排名融合。 Result: 在多个单跳与多跳QA基准上,EA-GraphRAG显著提升准确率、降低延迟,达到混合查询场景下的SOTA性能。 Conclusion: 动态适配检索范式比统一使用GraphRAG更有效,语法结构可作为查询复杂度的可靠代理,实现高效精准的知识增强生成。 Abstract: Large language models (LLMs) often struggle with knowledge-intensive tasks due to hallucinations and outdated parametric knowledge. While Retrieval-Augmented Generation (RAG) addresses this by integrating external corpora, its effectiveness is limited by fragmented information in unstructured domain documents. Graph-augmented RAG (GraphRAG) emerged to enhance contextual reasoning through structured knowledge graphs, yet paradoxically underperforms vanilla RAG in real-world scenarios, exhibiting significant accuracy drops and prohibitive latency despite gains on complex queries. We identify the rigid application of GraphRAG to all queries, regardless of complexity, as the root cause. To resolve this, we propose an efficient and adaptive GraphRAG framework called EA-GraphRAG that dynamically integrates RAG and GraphRAG paradigms through syntax-aware complexity analysis. Our approach introduces: (i) a syntactic feature constructor that parses each query and extracts a set of structural features; (ii) a lightweight complexity scorer that maps these features to a continuous complexity score; and (iii) a score-driven routing policy that selects dense RAG for low-score queries, invokes graph-based retrieval for high-score queries, and applies complexity-aware reciprocal rank fusion to handle borderline cases. Extensive experiments on a comprehensive benchmark, consisting of two single-hop and two multi-hop QA benchmarks, demonstrate that our EA-GraphRAG significantly improves accuracy, reduces latency, and achieves state-of-the-art performance in handling mixed scenarios involving both simple and complex queries.[55] $V_0$: A Generalist Value Model for Any Policy at State Zero
Yi-Kai Zhang,Zhiyuan Yao,Hongyan Hao,Yueqing Sun,Qi Gu,Hui Su,Xunliang Cai,De-Chuan Zhan,Han-Jia Ye
Main category: cs.CL
TL;DR: 本文提出了一种无需参数更新即可泛化评估任意大语言模型在未见提示上性能的通用价值模型 $V_0$,通过将策略能力作为显式上下文输入(如指令-性能历史对),替代传统依赖参数拟合的价值建模方式;$V_0$ 在 GRPO 训练中用于预 rollout 的采样预算分配,在部署中用作低成本、高适配性的模型路由器,并在性能与成本间实现帕累托最优。
Details
Motivation: 现有 Actor-Critic 方法中价值模型需随策略同步更新,开销大;GRPO 虽去耦但依赖大量采样以稳定基线估计,缺乏高效、泛化、免训练的价值估计机制。 Method: 提出 $V_0$:一种通用价值模型,将策略能力建模为显式上下文(指令-性能历史对),不依赖参数微调;聚焦 State Zero(初始提示)的价值预测,支持训练阶段的采样预算调度与部署阶段的模型路由决策。 Result: $V_0$ 显著优于启发式预算分配策略,并在 LLM 路由任务中实现性能与成本的帕累托最优权衡。 Conclusion: 将策略能力显式建模为上下文可有效解耦价值估计与策略更新,$V_0$ 为高效、低成本、泛化性强的大模型训练与部署提供了新范式。 Abstract: Policy gradient methods rely on a baseline to measure the relative advantage of an action, ensuring the model reinforces behaviors that outperform its current average capability. In the training of Large Language Models (LLMs) using Actor-Critic methods (e.g., PPO), this baseline is typically estimated by a Value Model (Critic) often as large as the policy model itself. However, as the policy continuously evolves, the value model requires expensive, synchronous incremental training to accurately track the shifting capabilities of the policy. To avoid this overhead, Group Relative Policy Optimization (GRPO) eliminates the coupled value model by using the average reward of a group of rollouts as the baseline; yet, this approach necessitates extensive sampling to maintain estimation stability. In this paper, we propose $V_0$, a Generalist Value Model capable of estimating the expected performance of any model on unseen prompts without requiring parameter updates. We reframe value estimation by treating the policy's dynamic capability as an explicit context input; specifically, we leverage a history of instruction-performance pairs to dynamically profile the model, departing from the traditional paradigm that relies on parameter fitting to perceive capability shifts. Focusing on value estimation at State Zero (i.e., the initial prompt, hence $V_0$), our model serves as a critical resource scheduler. During GRPO training, $V_0$ predicts success rates prior to rollout, allowing for efficient sampling budget allocation; during deployment, it functions as a router, dispatching instructions to the most cost-effective and suitable model. Empirical results demonstrate that $V_0$ significantly outperforms heuristic budget allocation and achieves a Pareto-optimal trade-off between performance and cost in LLM routing tasks.[56] CL-bench: A Benchmark for Context Learning
Shihan Dou,Ming Zhang,Zhangyue Yin,Chenhao Huang,Yujiong Shen,Junzhe Wang,Jiayi Chen,Yuchen Ni,Junjie Ye,Cheng Zhang,Huaibing Xie,Jianglu Hu,Shaolei Wang,Weichao Wang,Yanling Xiao,Yiting Liu,Zenan Xu,Zhen Guo,Pluto Zhou,Tao Gui,Zuxuan Wu,Xipeng Qiu,Qi Zhang,Xuanjing Huang,Yu-Gang Jiang,Di Wang,Shunyu Yao
Main category: cs.CL
TL;DR: 本文提出了一种新能力——'上下文学习(context learning)',即模型需从任务特定上下文中学习新知识并据此推理;为此构建了真实世界基准CL-bench(含500个复杂上下文、1899个任务),评测显示当前大模型在此能力上表现极差(平均仅解决17.2%任务),揭示其为现实应用的关键瓶颈。
Details
Motivation: 现有语言模型虽擅长利用预训练知识进行推理,但在需从任务特定上下文中学习新知识(如领域规则、经验定律等)并据此解决复杂、情境依赖任务方面能力严重不足;而这种‘上下文学习’能力是人类自然具备且实际应用所必需的,却长期被忽视。 Method: 提出‘上下文学习’概念,并构建真实世界基准CL-bench:由领域专家设计,包含500个复杂上下文、1899个任务及31607条验证细则;每个任务所需的新知识均严格限定于对应上下文中,要求模型真正从上下文中学习而非仅检索或模仿模式。 Result: 在CL-bench上评测10个前沿语言模型,平均仅解决17.2%的任务;性能最佳的GPT-5.1也仅达23.7%,显著低于人类水平,表明当前模型尚不具备有效上下文学习能力。 Conclusion: 上下文学习是语言模型迈向真正智能与现实部署的关键瓶颈;CL-bench为评估和推动该能力的发展提供了首个系统性、高保真的基准。 Abstract: Current language models (LMs) excel at reasoning over prompts using pre-trained knowledge. However, real-world tasks are far more complex and context-dependent: models must learn from task-specific context and leverage new knowledge beyond what is learned during pre-training to reason and resolve tasks. We term this capability context learning, a crucial ability that humans naturally possess but has been largely overlooked. To this end, we introduce CL-bench, a real-world benchmark consisting of 500 complex contexts, 1,899 tasks, and 31,607 verification rubrics, all crafted by experienced domain experts. Each task is designed such that the new content required to resolve it is contained within the corresponding context. Resolving tasks in CL-bench requires models to learn from the context, ranging from new domain-specific knowledge, rule systems, and complex procedures to laws derived from empirical data, all of which are absent from pre-training. This goes far beyond long-context tasks that primarily test retrieval or reading comprehension, and in-context learning tasks, where models learn simple task patterns via instructions and demonstrations. Our evaluations of ten frontier LMs find that models solve only 17.2% of tasks on average. Even the best-performing model, GPT-5.1, solves only 23.7%, revealing that LMs have yet to achieve effective context learning, which poses a critical bottleneck for tackling real-world, complex context-dependent tasks. CL-bench represents a step towards building LMs with this fundamental capability, making them more intelligent and advancing their deployment in real-world scenarios.[57] Efficient Algorithms for Partial Constraint Satisfaction Problems over Control-flow Graphs
Xuran Cai,Amir Goharshady
Main category: cs.CL
TL;DR: 本文提出了一种针对控制流图(CFG)上部分约束满足问题(PCSP)的通用算法,基于Series-Parallel-Loop(SPL)分解,时间复杂度为O(|G|·|D|⁶),对固定域D为线性时间,并在寄存器分配、LOSPRE和Bank选择等编译优化任务中验证了其有效性与高效性。
Details
Motivation: 许多经典编译优化任务(如寄存器分配、LOSPRE、Bank选择指令放置)可建模为控制流图上的PCSP;而结构化程序的CFG具有稀疏性和可分解性(如SPL分解),亟需一种统一、高效的求解框架。 Method: 基于SPL图分解结构,设计一种通用动态规划算法求解PCSP,利用图的递归组合性质(串联、并联、循环)进行自底向上状态合并,支持带代价的约束松弛。 Result: 算法时间复杂度为O(|G|·|D|⁶),对固定域D实现线性时间求解;实验表明在最优Bank选择任务上比此前最优方法快4倍;统一了寄存器分配与LOSPRE等已有SPL方法。 Conclusion: SPL分解为PCSP提供了有力的结构化求解范式,所提通用算法兼具理论最优性与实践高效性,可扩展应用于其他基于CFG的编译优化问题。 Abstract: In this work, we focus on the Partial Constraint Satisfaction Problem (PCSP) over control-flow graphs (CFGs) of programs. PCSP serves as a generalization of the well-known Constraint Satisfaction Problem (CSP). In the CSP framework, we define a set of variables, a set of constraints, and a finite domain $D$ that encompasses all possible values for each variable. The objective is to assign a value to each variable in such a way that all constraints are satisfied. In the graph variant of CSP, an underlying graph is considered and we have one variable corresponding to each vertex of the graph and one or several constraints corresponding to each edge. In PCSPs, we allow for certain constraints to be violated at a specified cost, aiming to find a solution that minimizes the total cost. Numerous classical compiler optimization tasks can be framed as PCSPs over control-flow graphs. Examples include Register Allocation, Lifetime-optimal Speculative Partial Redundancy Elimination (LOSPRE), and Optimal Placement of Bank Selection Instructions. On the other hand, it is well-known that control-flow graphs of structured programs are sparse and decomposable in a variety of ways. In this work, we rely on the Series-Parallel-Loop (SPL) decompositions as introduced by~\cite{RegisterAllocation}. Our main contribution is a general algorithm for PCSPs over SPL graphs with a time complexity of \(O(|G| \cdot |D|^6)\), where \(|G|\) represents the size of the control-flow graph. Note that for any fixed domain $D,$ this yields a linear-time solution. Our algorithm can be seen as a generalization and unification of previous SPL-based approaches for register allocation and LOSPRE. In addition, we provide experimental results over another classical PCSP task, i.e. Optimal Bank Selection, achieving runtimes four times better than the previous state of the art.[58] Controlling Output Rankings in Generative Engines for LLM-based Search
Haibo Jin,Ruoxi Chen,Peiyan Zhang,Yifeng Luo,Huimin Zeng,Man Luo,Haohan Wang
Main category: cs.CL
TL;DR: 本文提出CORE方法,通过在检索内容中添加优化文本(字符串型、推理型、评论型)来控制大语言模型(LLM)在生成式搜索中对产品的推荐排序,显著提升小商家产品进入前1/3/5推荐的成功率,且不损害文本自然性。
Details
Motivation: LLM驱动的生成式搜索依赖初始检索顺序,导致小企业和独立创作者产品曝光不足,亟需一种可干预黑盒生成过程的排序控制方法。 Method: CORE是一种面向生成式引擎输出排序的优化方法,通过向搜索引擎返回的内容中附加三类精心设计的优化内容(字符串型、推理型、评论型),间接影响LLM最终推荐结果;其核心是将排序控制问题转化为对检索内容的可控编辑问题。 Result: 在ProductBench基准(15类×200产品)上,CORE在GPT-4o、Gemini-2.5、Claude-4和Grok-3四个LLM上平均实现Top-5推广成功率91.4%、Top-3为86.6%、Top-1为80.3%,显著优于现有方法,并保持优化内容的语言流畅性。 Conclusion: CORE证明了无需访问LLM内部参数或API,仅通过操控输入侧检索内容即可高效、鲁棒地调控生成式搜索推荐排序,为公平、可控的AI电商搜索提供了新范式。 Abstract: The way customers search for and choose products is changing with the rise of large language models (LLMs). LLM-based search, or generative engines, provides direct product recommendations to users, rather than traditional online search results that require users to explore options themselves. However, these recommendations are strongly influenced by the initial retrieval order of LLMs, which disadvantages small businesses and independent creators by limiting their visibility. In this work, we propose CORE, an optimization method that \textbf{C}ontrols \textbf{O}utput \textbf{R}ankings in g\textbf{E}nerative Engines for LLM-based search. Since the LLM's interactions with the search engine are black-box, CORE targets the content returned by search engines as the primary means of influencing output rankings. Specifically, CORE optimizes retrieved content by appending strategically designed optimization content to steer the ranking of outputs. We introduce three types of optimization content: string-based, reasoning-based, and review-based, demonstrating their effectiveness in shaping output rankings. To evaluate CORE in realistic settings, we introduce ProductBench, a large-scale benchmark with 15 product categories and 200 products per category, where each product is associated with its top-10 recommendations collected from Amazon's search interface. Extensive experiments on four LLMs with search capabilities (GPT-4o, Gemini-2.5, Claude-4, and Grok-3) demonstrate that CORE achieves an average Promotion Success Rate of \textbf{91.4\% @Top-5}, \textbf{86.6\% @Top-3}, and \textbf{80.3\% @Top-1}, across 15 product categories, outperforming existing ranking manipulation methods while preserving the fluency of optimized content.[59] Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation
Changze Lv,Jie Zhou,Wentao Zhao,Jingwen Xu,Zisu Huang,Muzhao Tian,Shihan Dou,Tao Gui,Le Tian,Xiao Zhou,Xiaoqing Zheng,Xuanjing Huang,Jie Zhou
Main category: cs.CL
TL;DR: 本文提出了一种训练人类偏好对齐的查询特定评分标准生成器的方法,用于DeepResearch报告生成,并结合多智能体马尔可夫状态(MaMs)工作流提升长程推理能力,显著提升了评估准确性和模型性能。
Details
Motivation: 现有DeepResearch报告评估缺乏可验证的奖励信号,依赖的评分标准要么过于粗糙,要么人工构建成本高、难以扩展。 Method: 构建带人类偏好标注的DeepResearch风格查询数据集,通过结合人类偏好监督和大语言模型评分的混合奖励,采用强化学习训练评分标准生成器;并引入多智能体马尔可夫状态(MaMs)工作流以增强长程推理。 Result: 所提评分标准生成器比现有方法更具判别力且更符合人类偏好;集成到MaMs框架后,DeepResearch系统在DeepResearch Bench上全面超越开源基线,性能媲美领先闭源模型。 Conclusion: 本工作为DeepResearch报告生成提供了可扩展、高对齐度的自动化评估与训练范式,有效 bridging 了人类判断与模型优化之间的鸿沟。 Abstract: Nowadays, training and evaluating DeepResearch-generated reports remain challenging due to the lack of verifiable reward signals. Accordingly, rubric-based evaluation has become a common practice. However, existing approaches either rely on coarse, pre-defined rubrics that lack sufficient granularity, or depend on manually constructed query-specific rubrics that are costly and difficult to scale. In this paper, we propose a pipeline to train human-preference-aligned query-specific rubric generators tailored for DeepResearch report generation. We first construct a dataset of DeepResearch-style queries annotated with human preferences over paired reports, and train rubric generators via reinforcement learning with a hybrid reward combining human preference supervision and LLM-based rubric evaluation. To better handle long-horizon reasoning, we further introduce a Multi-agent Markov-state (MaMs) workflow for report generation. We empirically show that our proposed rubric generators deliver more discriminative and better human-aligned supervision than existing rubric design strategies. Moreover, when integrated into the MaMs training framework, DeepResearch systems equipped with our rubric generators consistently outperform all open-source baselines on the DeepResearch Bench and achieve performance comparable to that of leading closed-source models.[60] BIRDTurk: Adaptation of the BIRD Text-to-SQL Dataset to Turkish
Burak Aktaş,Mehmet Can Baytekin,Süha Kağan Köse,Ömer İlbilgi,Elif Özge Yılmaz,Çağrı Toraman,Bilge Kaan Görür
Main category: cs.CL
TL;DR: 本文介绍了BIRDTurk——首个面向土耳其语的Text-to-SQL基准,通过受控翻译流程构建,在保持SQL逻辑与执行语义不变前提下适配土耳其语schema;基于该基准评估多种方法,发现土耳其语导致系统性能一致下降,主因是语言结构差异及大模型预训练中土耳其语数据不足,而基于智能体的多阶段推理展现出更强的跨语言鲁棒性。
Details
Motivation: 现有Text-to-SQL系统在英语上表现优异,但在形态丰富、资源稀缺的语言(如土耳其语)中行为尚不明确,亟需构建可控、高质量的跨语言评估基准。 Method: 构建BIRDTurk:采用受控翻译流程将BIRD基准适配为土耳其语,保留SQL逻辑与数据库语义;翻译质量通过中心极限定理确定样本量并人工验证;在BIRDTurk上系统评估推理式提示、智能体多阶段推理和监督微调三类方法。 Result: 土耳其语导致Text-to-SQL性能一致下降;下降源于语言结构差异和LLM预训练中土耳其语数据匮乏;智能体推理展现更强跨语言鲁棒性;监督微调对标准多语言基线效果有限,但对现代指令微调模型可有效扩展。 Conclusion: BIRDTurk为跨语言Text-to-SQL研究提供了首个真实数据库条件下的可控测试平台;强调了语言特性与预训练数据分布对Text-to-SQL性能的关键影响,并验证了智能体架构在低资源语言中的潜力。 Abstract: Text-to-SQL systems have achieved strong performance on English benchmarks, yet their behavior in morphologically rich, low-resource languages remains largely unexplored. We introduce BIRDTurk, the first Turkish adaptation of the BIRD benchmark, constructed through a controlled translation pipeline that adapts schema identifiers to Turkish while strictly preserving the logical structure and execution semantics of SQL queries and databases. Translation quality is validated on a sample size determined by the Central Limit Theorem to ensure 95% confidence, achieving 98.15% accuracy on human-evaluated samples. Using BIRDTurk, we evaluate inference-based prompting, agentic multi-stage reasoning, and supervised fine-tuning. Our results reveal that Turkish introduces consistent performance degradation, driven by both structural linguistic divergence and underrepresentation in LLM pretraining, while agentic reasoning demonstrates stronger cross-lingual robustness. Supervised fine-tuning remains challenging for standard multilingual baselines but scales effectively with modern instruction-tuned models. BIRDTurk provides a controlled testbed for cross-lingual Text-to-SQL evaluation under realistic database conditions. We release the training and development splits to support future research.[61] TRE: Encouraging Exploration in the Trust Region
Chao Huang,Yujing Lu,Quangang Li,Shenghe Wang,Yan Wang,Yueyang Zhang,Long Xia,Jiashu Zhao,Zhiyuan Sun,Daiting Shi,Tingwen Liu
Main category: cs.CL
TL;DR: 本文提出Trust Region Entropy (TRE)方法,通过在模型信任区域内进行熵正则化来提升大语言模型在推理与对齐任务中的探索效果,克服了传统全局熵正则在LLMs中失效的问题。
Details
Motivation: 标准熵正则在强化学习中有效,但在大语言模型(LLM)中效果差甚至有害,作者归因于LLM的大词表和长生成步数带来的累积尾部风险,导致全局熵最大化将概率质量错误分配至大量无效token,破坏推理连贯性。 Method: 提出Trust Region Entropy (TRE),限制熵正则仅作用于模型当前输出分布中高置信度的‘信任区域’(如top-k或阈值截断后的子集),避免污染尾部无效token。 Result: 在数学推理(MATH)、组合搜索(Countdown)和偏好对齐(HH)任务上,TRE持续优于PPO基线、标准熵正则及其他探索方法。 Conclusion: 熵正则在LLMs中失效的根本原因在于未考虑token分布的可信度结构;TRE通过信任区域约束实现了更安全、更有效的探索,为LLM强化学习提供了新范式。 Abstract: Entropy regularization is a standard technique in reinforcement learning (RL) to enhance exploration, yet it yields negligible effects or even degrades performance in Large Language Models (LLMs). We attribute this failure to the cumulative tail risk inherent to LLMs with massive vocabularies and long generation horizons. In such environments, standard global entropy maximization indiscriminately dilutes probability mass into the vast tail of invalid tokens rather than focusing on plausible candidates, thereby disrupting coherent reasoning. To address this, we propose Trust Region Entropy (TRE), a method that encourages exploration strictly within the model's trust region. Extensive experiments across mathematical reasoning (MATH), combinatorial search (Countdown), and preference alignment (HH) tasks demonstrate that TRE consistently outperforms vanilla PPO, standard entropy regularization, and other exploration baselines. Our code is available at https://github.com/WhyChaos/TRE-Encouraging-Exploration-in-the-Trust-Region.[62] RAGTurk: Best Practices for Retrieval Augmented Generation in Turkish
Süha Kağan Köse,Mehmet Can Baytekin,Burak Aktaş,Bilge Kaan Görür,Evren Ayberk Munis,Deniz Yılmaz,Muhammed Yusuf Kartal,Çağrı Toraman
Main category: cs.CL
TL;DR: 本文构建了首个面向土耳其语的RAG基准数据集,系统评测了RAG各阶段方法在该语言上的表现,发现复杂方法(如HyDE)虽精度高但成本大,而交叉编码器重排序+上下文增强的轻量组合可实现帕累托最优;同时指出过度堆叠生成模块会损害土耳其语等形态丰富语言的性能。
Details
Motivation: 现有RAG设计指南以英语为中心,缺乏对土耳其语等形态丰富语言的适配指导,亟需构建专用基准并探索其RAG优化路径。 Method: 构建基于土耳其语维基百科和CulturaX的大规模土耳其语RAG数据集;在不进行任务特定微调的前提下,对RAG七阶段(查询变换、重排序、答案精炼等)进行系统性基准测试;分析不同模块组合对形态线索保持的影响。 Result: HyDE方法达到最高准确率85%,显著优于基线78.70%;交叉编码器重排序+上下文增强组合以更低开销实现84.60%的高精度;证实过度堆叠生成模块会扭曲形态线索、降低性能,而简单查询澄清配合强重排序更有效。 Conclusion: 形态丰富语言的RAG优化需权衡精度与成本,并避免破坏语言形态结构;轻量高效、注重重排序与查询澄清的方案比盲目堆叠生成模块更具实用价值。 Abstract: Retrieval-Augmented Generation (RAG) enhances LLM factuality, yet design guidance remains English-centric, limiting insights for morphologically rich languages like Turkish. We address this by constructing a comprehensive Turkish RAG dataset derived from Turkish Wikipedia and CulturaX, comprising question-answer pairs and relevant passage chunks. We benchmark seven stages of the RAG pipeline, from query transformation and reranking to answer refinement, without task-specific fine-tuning. Our results show that complex methods like HyDE maximize accuracy (85%) that is considerably higher than the baseline (78.70%). Also a Pareto-optimal configuration using Cross-encoder Reranking and Context Augmentation achieves comparable performance (84.60%) with much lower cost. We further demonstrate that over-stacking generative modules can degrade performance by distorting morphological cues, whereas simple query clarification with robust reranking offers an effective solution.[63] Instruction Anchors: Dissecting the Causal Dynamics of Modality Arbitration
Yu Zhang,Mufan Xu,Xuefeng Bai,Kehai chen,Pengfei Zhang,Yang Xiang,Min Zhang
Main category: cs.CL
TL;DR: 本文通过信息流视角探究多模态大语言模型(MLLMs)的模态跟随机制,发现指令词元作为结构锚点,在浅层注意力中进行非选择性信息传递,在深层注意力中依据指令意图解决模态竞争,MLP层则表现出语义惯性;识别出少量关键注意力头,并通过因果干预验证其决定性作用。
Details
Motivation: 模态跟随能力对MLLMs在现实部署中的安全性和可靠性至关重要,但其内在决策机制尚不清楚。 Method: 从信息流视角出发,分析指令词元在注意力层和MLP层中的作用,识别驱动模态仲裁的关键稀疏注意力头,并开展因果干预实验。 Result: 发现指令词元是模态仲裁的结构锚点;浅层注意力执行非选择性信息路由,深层注意力依据指令意图解决模态竞争,MLP具语义惯性;仅操纵5%的关键注意力头即可使模态跟随率增减60%。 Conclusion: 本研究揭示了MLLMs模态跟随的内在机制,提升了模型可解释性,并为多模态信息协同提供了原理性框架。 Abstract: Modality following serves as the capacity of multimodal large language models (MLLMs) to selectively utilize multimodal contexts based on user instructions. It is fundamental to ensuring safety and reliability in real-world deployments. However, the underlying mechanisms governing this decision-making process remain poorly understood. In this paper, we investigate its working mechanism through an information flow lens. Our findings reveal that instruction tokens function as structural anchors for modality arbitration: Shallow attention layers perform non-selective information transfer, routing multimodal cues to these anchors as a latent buffer; Modality competition is resolved within deep attention layers guided by the instruction intent, while MLP layers exhibit semantic inertia, acting as an adversarial force. Furthermore, we identify a sparse set of specialized attention heads that drive this arbitration. Causal interventions demonstrate that manipulating a mere $5\%$ of these critical heads can decrease the modality-following ratio by $60\%$ through blocking, or increase it by $60\%$ through targeted amplification of failed samples. Our work provides a substantial step toward model transparency and offers a principled framework for the orchestration of multimodal information in MLLMs.[64] Neural Attention Search Linear: Towards Adaptive Token-Level Hybrid Attention Models
Difan Deng,Andreas Bentzen Winje,Lukas Fehring,Marius Lindauer
Main category: cs.CL
TL;DR: 本文提出NAtS-L框架,在同一层内对不同token自适应选择线性注意力或Softmax注意力,兼顾效率与表达力。
Details
Motivation: Softmax注意力的二次计算复杂度在长上下文场景中成为瓶颈,而纯线性注意力受限于隐藏状态大小、表达能力不足;混合层方法仍受Softmax层拖累。 Method: 提出NAtS-L(Neural Attention Search Linear)框架,在单层内对每个token动态选择Gated DeltaNet(线性注意力)或Softmax注意力,并通过搜索机制优化组合策略。 Result: NAtS-L实现了token级混合注意力架构,在保持强建模能力的同时显著提升计算效率,缓解了长上下文下的复杂度瓶颈。 Conclusion: token级自适应注意力选择是平衡效率与表达力的有效路径,NAtS-L为高效长序列建模提供了新范式。 Abstract: The quadratic computational complexity of softmax transformers has become a bottleneck in long-context scenarios. In contrast, linear attention model families provide a promising direction towards a more efficient sequential model. These linear attention models compress past KV values into a single hidden state, thereby efficiently reducing complexity during both training and inference. However, their expressivity remains limited by the size of their hidden state. Previous work proposed interleaving softmax and linear attention layers to reduce computational complexity while preserving expressivity. Nevertheless, the efficiency of these models remains bottlenecked by their softmax attention layers. In this paper, we propose Neural Attention Search Linear (NAtS-L), a framework that applies both linear attention and softmax attention operations within the same layer on different tokens. NAtS-L automatically determines whether a token can be handled by a linear attention model, i.e., tokens that have only short-term impact and can be encoded into fixed-size hidden states, or require softmax attention, i.e., tokens that contain information related to long-term retrieval and need to be preserved for future queries. By searching for optimal Gated DeltaNet and softmax attention combinations across tokens, we show that NAtS-L provides a strong yet efficient token-level hybrid architecture.[65] Rethinking the Reranker: Boundary-Aware Evidence Selection for Robust Retrieval-Augmented Generation
Jiashuo Sun,Pengcheng Jiang,Saizhuo Wang,Jiajun Fan,Heng Wang,Siru Ouyang,Ming Zhong,Yizhu Jiao,Chengsong Huang,Xueqiang Xu,Pengrui Han,Peiran Li,Jiaxin Huang,Ge Liu,Heng Ji,Jiawei Han
Main category: cs.CL
TL;DR: BAR-RAG 提出一种边界感知的证据选择机制,通过强化学习利用生成器反馈优化重排序器,使其选择处于‘恰到好处’难度区间的证据,从而提升RAG系统在检索噪声下的鲁棒性与性能。
Details
Motivation: 现有RAG系统在检索存在噪声时表现脆弱,因传统检索器和重排序器仅优化相关性,忽视所选证据是否真正适配生成器的推理需求——既不能过于简单(泄露答案),也不能过于困难(无法回答)。 Method: BAR-RAG将重排序器重构为边界感知证据选择器,定义生成器的'Goldilocks Zone'(恰到好处难度区间);采用基于生成器反馈的强化学习训练该选择器,并引入两阶段流程:先由选择器构建证据分布,再在此分布上微调生成器以缓解训练-推理分布偏移。 Result: 在知识密集型问答基准上,BAR-RAG在噪声检索下显著优于强基线,平均提升10.3%,且鲁棒性大幅增强。 Conclusion: 证据选择不应仅追求相关性,而应面向生成器建模其推理能力边界;BAR-RAG通过生成器驱动的边界感知选择与分布对齐微调,有效提升了RAG系统在现实噪声场景下的有效性与稳定性。 Abstract: Retrieval-Augmented Generation (RAG) systems remain brittle under realistic retrieval noise, even when the required evidence appears in the top-K results. A key reason is that retrievers and rerankers optimize solely for relevance, often selecting either trivial, answer-revealing passages or evidence that lacks the critical information required to answer the question, without considering whether the evidence is suitable for the generator. We propose BAR-RAG, which reframes the reranker as a boundary-aware evidence selector that targets the generator's Goldilocks Zone -- evidence that is neither trivially easy nor fundamentally unanswerable for the generator, but is challenging yet sufficient for inference and thus provides the strongest learning signal. BAR-RAG trains the selector with reinforcement learning using generator feedback, and adopts a two-stage pipeline that fine-tunes the generator under the induced evidence distribution to mitigate the distribution mismatch between training and inference. Experiments on knowledge-intensive question answering benchmarks show that BAR-RAG consistently improves end-to-end performance under noisy retrieval, achieving an average gain of 10.3 percent over strong RAG and reranking baselines while substantially improving robustness. Code is publicly avaliable at https://github.com/GasolSun36/BAR-RAG.[66] OCRTurk: A Comprehensive OCR Benchmark for Turkish
Deniz Yılmaz,Evren Ayberk Munis,Çağrı Toraman,Süha Kağan Köse,Burak Aktaş,Mehmet Can Baytekin,Bilge Kaan Görür
Main category: cs.CL
TL;DR: 本文介绍了OCRTurk,一个面向土耳其语文档解析的新基准数据集,涵盖多种布局元素和文档类型,并在三个难度级别上评估了七种OCR模型的性能。
Details
Motivation: 现有文档解析基准主要针对高资源语言,对低资源语言(如土耳其语)覆盖不足,且缺乏反映真实场景和文档多样性的标准化基准。 Method: 构建了包含180份土耳其语文档(来自学术论文、学位论文、幻灯片和非学术文章)的OCRTurk基准,涵盖多种布局元素和三类难度;采用元素级指标评估七种OCR模型。 Result: PaddleOCR在多数元素级指标和归一化编辑距离上表现最佳,但在图表识别上稍弱;模型在非学术文档上表现较好,幻灯片最难解析。 Conclusion: OCRTurk填补了土耳其语文档解析基准的空白,为低资源语言文档理解提供了实用评估工具,并揭示了模型在不同文档类型上的性能差异。 Abstract: Document parsing is now widely used in applications, such as large-scale document digitization, retrieval-augmented generation, and domain-specific pipelines in healthcare and education. Benchmarking these models is crucial for assessing their reliability and practical robustness. Existing benchmarks mostly target high-resource languages and provide limited coverage for low-resource settings, such as Turkish. Moreover, existing studies on Turkish document parsing lack a standardized benchmark that reflects real-world scenarios and document diversity. To address this gap, we introduce OCRTurk, a Turkish document parsing benchmark covering multiple layout elements and document categories at three difficulty levels. OCRTurk consists of 180 Turkish documents drawn from academic articles, theses, slide decks, and non-academic articles. We evaluate seven OCR models on OCRTurk using element-wise metrics. Across difficulty levels, PaddleOCR achieves the strongest overall results, leading most element-wise metrics except figures and attaining high Normalized Edit Distance scores in easy, medium, and hard subsets. We also observe performance variation by document type. Models perform well on non-academic documents, while slideshows become the most challenging.[67] Cognitively Diverse Multiple-Choice Question Generation: A Hybrid Multi-Agent Framework with Large Language Models
Yu Tian,Linh Huynh,Katerina Christhilf,Shubham Chakraborty,Micah Watanabe,Tracy Arner,Danielle McNamara
Main category: cs.CL
TL;DR: 本文提出ReQUESTA框架,一种混合多智能体系统,用于生成满足不同认知需求的多项选择题(MCQ),并在大规模阅读理解实验中验证其优于单次零样本GPT-5生成的效果。
Details
Motivation: 现有大语言模型虽能自动生成多项选择题,但难以可靠控制其认知难度(如推理、主旨理解等),亟需提升生成的可控性与教育有效性。 Method: 提出ReQUESTA——一个融合LLM智能体与规则组件的多阶段、多代理框架,将MCQ生成分解为规划、受控生成、迭代评估与后处理等子任务,并在学术说明文上开展大规模实证评估,结合心理测量分析与专家评分。 Result: ReQUESTA生成的题目显著更具挑战性、区分度更高,且更契合整体阅读理解能力;专家评价显示其主题相关性更强、干扰项语言一致性和语义合理性更优,尤其在推理类题目中表现突出。 Conclusion: 混合式智能体协同架构可系统提升LLM生成内容的可靠性与可控性,强调工作流设计是超越单次提示工程的关键路径。 Abstract: Recent advances in large language models (LLMs) have made automated multiple-choice question (MCQ) generation increasingly feasible; however, reliably producing items that satisfy controlled cognitive demands remains a challenge. To address this gap, we introduce ReQUESTA, a hybrid, multi-agent framework for generating cognitively diverse MCQs that systematically target text-based, inferential, and main idea comprehension. ReQUESTA decomposes MCQ authoring into specialized subtasks and coordinates LLM-powered agents with rule-based components to support planning, controlled generation, iterative evaluation, and post-processing. We evaluated the framework in a large-scale reading comprehension study using academic expository texts, comparing ReQUESTA-generated MCQs with those produced by a single-pass GPT-5 zero-shot baseline. Psychometric analyses of learner responses assessed item difficulty and discrimination, while expert raters evaluated question quality across multiple dimensions, including topic relevance and distractor quality. Results showed that ReQUESTA-generated items were consistently more challenging, more discriminative, and more strongly aligned with overall reading comprehension performance. Expert evaluations further indicated stronger alignment with central concepts and superior distractor linguistic consistency and semantic plausibility, particularly for inferential questions. These findings demonstrate that hybrid, agentic orchestration can systematically improve the reliability and controllability of LLM-based generation, highlighting workflow design as a key lever for structured artifact generation beyond single-pass prompting.[68] OmniRAG-Agent: Agentic Omnimodal Reasoning for Low-Resource Long Audio-Video Question Answering
Yifan Zhu,Xinyu Mu,Tao Feng,Zhonghong Ou,Yuning Gong,Haoran Luo
Main category: cs.CL
TL;DR: 本文提出了OmniRAG-Agent,一种面向预算受限的长音频-视频问答的代理式多模态问答方法,通过检索增强生成、代理式规划与工具调用、以及组相对策略优化,显著提升了低资源场景下的性能。
Details
Motivation: 现有长时序多模态问答方法面临密集编码成本高、细粒度检索能力弱、缺乏主动推理规划、以及缺乏端到端联合优化等问题,尤其在低资源长音频-视频QA任务中表现受限。 Method: 提出OmniRAG-Agent:1)构建图像-音频检索增强生成模块,使OmniLLM能从外部库中高效检索相关帧和音频片段;2)设计代理循环,支持多轮规划、跨模态工具调用与证据融合;3)采用组相对策略优化(GRPO)联合提升工具使用准确性和答案质量。 Result: 在OmniVideoBench、WorldSense和Daily-Omni三个基准上,OmniRAG-Agent在低资源设置下持续超越先前方法,并通过消融实验验证了各模块的有效性。 Conclusion: OmniRAG-Agent为低资源长时序多模态问答提供了一种高效、可扩展且端到端可优化的代理式解决方案,推动了OmniLLM在真实复杂场景中的实用化。 Abstract: Long-horizon omnimodal question answering answers questions by reasoning over text, images, audio, and video. Despite recent progress on OmniLLMs, low-resource long audio-video QA still suffers from costly dense encoding, weak fine-grained retrieval, limited proactive planning, and no clear end-to-end optimization.To address these issues, we propose OmniRAG-Agent, an agentic omnimodal QA method for budgeted long audio-video reasoning. It builds an image-audio retrieval-augmented generation module that lets an OmniLLM fetch short, relevant frames and audio snippets from external banks. Moreover, it uses an agent loop that plans, calls tools across turns, and merges retrieved evidence to answer complex queries. Furthermore, we apply group relative policy optimization to jointly improve tool use and answer quality over time. Experiments on OmniVideoBench, WorldSense, and Daily-Omni show that OmniRAG-Agent consistently outperforms prior methods under low-resource settings and achieves strong results, with ablations validating each component.[69] Beyond Tokens: Semantic-Aware Speculative Decoding for Efficient Inference by Probing Internal States
Ximing Dong,Shaowei Wang,Dayi Lin,Boyuan Chen,Ahmed E. Hassan
Main category: cs.CL
TL;DR: 本文提出SemanticSpec,一种语义感知的推测解码框架,通过验证整个语义序列而非单个token来加速大语言模型和大推理模型的推理过程,显著提升推理速度。
Details
Motivation: 现有推测解码方法仅在token级别操作,忽略语义等价性,导致大量无效拒绝,无法高效利用模型生成能力。 Method: SemanticSpec引入语义概率估计机制,通过探测模型内部隐藏状态评估生成特定语义序列的可能性,并在语义层面进行draft与verify。 Result: 在四个基准上,SemanticSpec在DeepSeekR1-32B和QwQ-32B上分别实现最高2.7倍和2.1倍的加速,性能持续优于token级和sequence级基线方法。 Conclusion: 语义层面的推测解码能更有效地匹配模型生成意图,显著提升推理效率,为大模型加速提供了新思路。 Abstract: Large Language Models (LLMs) achieve strong performance across many tasks but suffer from high inference latency due to autoregressive decoding. The issue is exacerbated in Large Reasoning Models (LRMs), which generate lengthy chains of thought. While speculative decoding accelerates inference by drafting and verifying multiple tokens in parallel, existing methods operate at the token level and ignore semantic equivalence (i.e., different token sequences expressing the same meaning), leading to inefficient rejections. We propose SemanticSpec, a semantic-aware speculative decoding framework that verifies entire semantic sequences instead of tokens. SemanticSpec introduces a semantic probability estimation mechanism that probes the model's internal hidden states to assess the likelihood of generating sequences with specific meanings.Experiments on four benchmarks show that SemanticSpec achieves up to 2.7x speedup on DeepSeekR1-32B and 2.1x on QwQ-32B, consistently outperforming token-level and sequence-level baselines in both efficiency and effectiveness.[70] No Shortcuts to Culture: Indonesian Multi-hop Question Answering for Complex Cultural Understanding
Vynska Amalia Permadi,Xingwei Tan,Nafise Sadat Moosavi,Nikos Aletras
Main category: cs.CL
TL;DR: 本文提出了ID-MoCQA,首个面向印尼文化的大规模多跳问答数据集,用于评估大语言模型的文化理解能力;通过系统性构建多跳推理链和多阶段验证,揭示当前模型在文化推理(尤其是细粒度推理)上的显著不足。
Details
Motivation: 现有文化导向的问答基准多为单跳问题,易使模型依赖表面线索而非真正文化推理,缺乏对跨语境、传统与隐性社会知识的综合考察。 Method: 提出新框架,将单跳文化问题系统转化为涵盖六类线索(如常识、时间、地理等)的多跳推理链,并构建双语(英/印尼语)ID-MoCQA数据集;采用专家评审与LLM-as-a-judge结合的多阶段验证流程保障质量。 Result: 在多个SOTA模型上的评测显示,模型在文化多跳推理任务上存在显著性能差距,尤其在需细微推断的任务中表现薄弱。 Conclusion: ID-MoCQA为评估和提升大语言模型的文化能力提供了关键且具挑战性的新基准。 Abstract: Understanding culture requires reasoning across context, tradition, and implicit social knowledge, far beyond recalling isolated facts. Yet most culturally focused question answering (QA) benchmarks rely on single-hop questions, which may allow models to exploit shallow cues rather than demonstrate genuine cultural reasoning. In this work, we introduce ID-MoCQA, the first large-scale multi-hop QA dataset for assessing the cultural understanding of large language models (LLMs), grounded in Indonesian traditions and available in both English and Indonesian. We present a new framework that systematically transforms single-hop cultural questions into multi-hop reasoning chains spanning six clue types (e.g., commonsense, temporal, geographical). Our multi-stage validation pipeline, combining expert review and LLM-as-a-judge filtering, ensures high-quality question-answer pairs. Our evaluation across state-of-the-art models reveals substantial gaps in cultural reasoning, particularly in tasks requiring nuanced inference. ID-MoCQA provides a challenging and essential benchmark for advancing the cultural competency of LLMs.[71] Training Multi-Turn Search Agent via Contrastive Dynamic Branch Sampling
Yubao Zhao,Weiquan Huang,Sudong Wang,Ruochen Zhao,Chen Chen,Yao Shu,Chengwei Qin
Main category: cs.CL
TL;DR: 本文提出BranPO方法,通过尾部截断和对比后缀构造,在无值函数和稠密奖励的情况下,为长视野强化学习提供步级对比监督,显著提升长周期任务性能。
Details
Motivation: 现有基于树的长视野强化学习方法存在高方差和计算效率低的问题;实证发现智能体性能差异主要源于轨迹尾部决策,因此需要更精准的信用分配机制。 Method: 提出无值函数的Branching Relative Policy Optimization(BranPO):1)在轨迹尾部截断并重采样替代后缀,构建共享前缀下的对比后缀;2)引入难度感知的分支采样以自适应调整分支频率;3)采用冗余步掩码抑制无信息动作。 Result: 在多个问答基准上,BranPO在不增加训练预算的前提下,持续超越强基线,尤其在长周期任务中取得显著准确率提升。 Conclusion: BranPO通过尾部对比监督有效缓解长视野RL中的信用分配模糊问题,兼顾训练稳定性与计算效率,为Agentic RL提供了新范式。 Abstract: Agentic reinforcement learning has enabled large language models to perform complex multi-turn planning and tool use. However, learning in long-horizon settings remains challenging due to sparse, trajectory-level outcome rewards. While prior tree-based methods attempt to mitigate this issue, they often suffer from high variance and computational inefficiency. Through empirical analysis of search agents, We identify a common pattern: performance diverges mainly due to decisions near the tail. Motivated by this observation, we propose Branching Relative Policy Optimization (BranPO), a value-free method that provides step-level contrastive supervision without dense rewards. BranPO truncates trajectories near the tail and resamples alternative continuations to construct contrastive suffixes over shared prefixes, reducing credit ambiguity in long-horizon rollouts. To further boost efficiency and stabilize training, we introduce difficulty-aware branch sampling to adapt branching frequency across tasks, and redundant step masking to suppress uninformative actions. Extensive experiments on various question answering benchmarks demonstrate that BranPO consistently outperforms strong baselines, achieving significant accuracy gains on long-horizon tasks without increasing the overall training budget. Our code is available at \href{https://github.com/YubaoZhao/BranPO}{code}.[72] CUBO: Self-Contained Retrieval-Augmented Generation on Consumer Laptops 10 GB Corpora, 16 GB RAM, Single-Device Deployment
Paolo Astrino
Main category: cs.CL
TL;DR: CUBO是一个面向消费级笔记本(16GB内存)的RAG平台,通过流式摄取、分层混合检索和硬件感知编排,在15.5GB内存限制下实现GDPR合规的本地化文档处理与高召回率检索。
Details
Motivation: 组织在处理敏感文档时面临云AI导致GDPR违规风险与本地系统高内存需求(18–32 GB RAM)之间的矛盾。 Method: 提出CUBO系统,集成流式摄入(O(1)缓冲开销)、分层混合检索和硬件感知编排,并在16GB共享内存的消费级笔记本上实现本地-only处理。 Result: 在BEIR基准测试中取得Recall@10为0.48–0.97;检索延迟p50为185ms(C1,300笔记本);内存占用严格控制在15.5GB以内;代码开源。 Conclusion: CUBO证明了在资源受限设备上构建GDPR合规、高性能、可部署RAG系统的可行性,适用于中小专业档案场景。 Abstract: Organizations handling sensitive documents face a tension: cloud-based AI risks GDPR violations, while local systems typically require 18-32 GB RAM. This paper presents CUBO, a systems-oriented RAG platform for consumer laptops with 16 GB shared memory. CUBO's novelty lies in engineering integration of streaming ingestion (O(1) buffer overhead), tiered hybrid retrieval, and hardware-aware orchestration that enables competitive Recall@10 (0.48-0.97 across BEIR domains) within a hard 15.5 GB RAM ceiling. The 37,000-line codebase achieves retrieval latencies of 185 ms (p50) on C1,300 laptops while maintaining data minimization through local-only processing aligned with GDPR Art. 5(1)(c). Evaluation on BEIR benchmarks validates practical deployability for small-to-medium professional archives. The codebase is publicly available at https://github.com/PaoloAstrino/CUBO.[73] Context Compression via Explicit Information Transmission
Jiangnan Ye,Hanqi Yan,Zhenyi Shen,Heng Chang,Ye Mao,Yulan He
Main category: cs.CL
TL;DR: 本文提出ComprExIT框架,通过显式信息传输机制实现轻量级软上下文压缩,解决现有LLM上下文压缩方法中表征覆盖和压缩容量分配不协调的问题,在多个问答基准上显著优于SOTA方法。
Details
Motivation: 长上下文推理因注意力计算复杂度高和KV缓存增大而成本高昂,需进行上下文压缩;现有基于LLM自身逐层自注意力的软压缩方法存在表征逐层覆盖和压缩容量分配不协调两大结构缺陷。 Method: 提出ComprExIT框架,将软压缩建模为在冻结LLM隐状态上的显式信息传输:(i)深度方向传输——选择性地将多层信息传入token锚点,缓解覆盖问题;(ii)宽度方向传输——通过全局优化的传输方案将锚点聚合成少量压缩槽,实现容量协调分配。 Result: 在六个问答基准上持续超越当前最优上下文压缩方法,仅引入约1%额外参数。 Conclusion: 显式且协同的信息传输机制可实现更高效、更鲁棒的长上下文软压缩。 Abstract: Long-context inference with Large Language Models (LLMs) is costly due to quadratic attention and growing key-value caches, motivating context compression. In this work, we study soft context compression, where a long context is condensed into a small set of continuous representations. Existing methods typically re-purpose the LLM itself as a trainable compressor, relying on layer-by-layer self-attention to iteratively aggregate information. We argue that this paradigm suffers from two structural limitations: (i) progressive representation overwriting across layers (ii) uncoordinated allocation of compression capacity across tokens. We propose ComprExIT (Context Compression via Explicit Information Transmission), a lightweight framework that formulates soft compression into a new paradigm: explicit information transmission over frozen LLM hidden states. This decouples compression from the model's internal self-attention dynamics. ComprExIT performs (i) depth-wise transmission to selectively transmit multi-layer information into token anchors, mitigating progressive overwriting, and (ii) width-wise transmission to aggregate anchors into a small number of slots via a globally optimized transmission plan, ensuring coordinated allocation of information. Across six question-answering benchmarks, ComprExIT consistently outperforms state-of-the-art context compression methods while introducing only ~1% additional parameters, demonstrating that explicit and coordinated information transmission enables more effective and robust long-context compression.[74] They Said Memes Were Harmless-We Found the Ones That Hurt: Decoding Jokes, Symbols, and Cultural References
Sahil Tripathi,Gautam Siddharth Kashyap,Mehwish Nasim,Jian Yang,Jiechao Gao,Usman Naseem
Main category: cs.CL
TL;DR: 本文提出CROSS-ALIGN+框架,通过引入外部知识、LoRA微调和级联解释生成,提升基于模因的社会滥用检测性能与可解释性。
Details
Motivation: 现有方法在文化符号理解、讽刺与滥用边界区分、模型推理可解释性三方面存在局限。 Method: CROSS-ALIGN+包含三阶段:I)融合ConceptNet、Wikidata和Hatebase等结构化知识缓解文化盲区;II)使用参数高效的LoRA适配器优化决策边界;III)生成级联式解释增强可解释性。 Result: 在五个基准数据集和八个大视觉语言模型上实验表明,该方法相较SOTA提升最高达17%相对F1值,并提供每项决策的可解释依据。 Conclusion: CROSS-ALIGN+系统性地解决了模因滥用检测中的文化盲区、边界模糊和不可解释三大挑战,兼顾性能与透明性。 Abstract: Meme-based social abuse detection is challenging because harmful intent often relies on implicit cultural symbolism and subtle cross-modal incongruence. Prior approaches, from fusion-based methods to in-context learning with Large Vision-Language Models (LVLMs), have made progress but remain limited by three factors: i) cultural blindness (missing symbolic context), ii) boundary ambiguity (satire vs. abuse confusion), and iii) lack of interpretability (opaque model reasoning). We introduce CROSS-ALIGN+, a three-stage framework that systematically addresses these limitations: (1) Stage I mitigates cultural blindness by enriching multimodal representations with structured knowledge from ConceptNet, Wikidata, and Hatebase; (2) Stage II reduces boundary ambiguity through parameter-efficient LoRA adapters that sharpen decision boundaries; and (3) Stage III enhances interpretability by generating cascaded explanations. Extensive experiments on five benchmarks and eight LVLMs demonstrate that CROSS-ALIGN+ consistently outperforms state-of-the-art methods, achieving up to 17% relative F1 improvement while providing interpretable justifications for each decision.[75] Accelerating Scientific Research with Gemini: Case Studies and Common Techniques
David P. Woodruff,Vincent Cohen-Addad,Lalit Jain,Jieming Mao,Song Zuo,MohammadHossein Bateni,Simina Branzei,Michael P. Brenner,Lin Chen,Ying Feng,Lance Fortnow,Gang Fu,Ziyi Guan,Zahra Hadizadeh,Mohammad T. Hajiaghayi,Mahdi JafariRaviz,Adel Javanmard,Karthik C. S.,Ken-ichi Kawarabayashi,Ravi Kumar,Silvio Lattanzi,Euiwoong Lee,Yi Li,Ioannis Panageas,Dimitris Paparas,Benjamin Przybocki,Bernardo Subercaseaux,Ola Svensson,Shayan Taherijam,Xuan Wu,Eylon Yogev,Morteza Zadimoghaddam,Samson Zhou,Vahab Mirrokni
Main category: cs.CL
TL;DR: 本文通过多个案例研究,展示了研究人员如何与Gemini系列大模型(特别是Gemini Deep Think)协作,在理论计算机科学、经济学、优化与物理等领域实现数学新发现,包括解决开放问题、证伪猜想和生成新证明;提出了人机协同的通用策略,并探索了将AI作为对抗性审稿人和嵌入神经符号循环的前沿用法。
Details
Motivation: 探究大语言模型在专家级数学发现中的潜力,超越常规任务辅助,评估其在原创性科研中的实际协作价值。 Method: 基于Gemini Deep Think等先进模型开展多领域案例研究;提炼人机协作技术(如迭代精炼、问题分解、跨学科知识迁移);探索非标准交互模式(如AI作为对抗性审稿人、神经符号闭环验证)。 Result: 成功应用于解决开放问题、证伪猜想、生成新证明;总结出可复用的人机协同范式;实现了AI在证明审查与自动推导验证中的突破性应用。 Conclusion: AI不仅是自动化工具,更可作为创造性科研过程中的真实合作伙伴,尤其在需要深度推理与跨领域整合的理论研究中展现出独特价值。 Abstract: Recent advances in large language models (LLMs) have opened new avenues for accelerating scientific research. While models are increasingly capable of assisting with routine tasks, their ability to contribute to novel, expert-level mathematical discovery is less understood. We present a collection of case studies demonstrating how researchers have successfully collaborated with advanced AI models, specifically Google's Gemini-based models (in particular Gemini Deep Think and its advanced variants), to solve open problems, refute conjectures, and generate new proofs across diverse areas in theoretical computer science, as well as other areas such as economics, optimization, and physics. Based on these experiences, we extract common techniques for effective human-AI collaboration in theoretical research, such as iterative refinement, problem decomposition, and cross-disciplinary knowledge transfer. While the majority of our results stem from this interactive, conversational methodology, we also highlight specific instances that push beyond standard chat interfaces. These include deploying the model as a rigorous adversarial reviewer to detect subtle flaws in existing proofs, and embedding it within a "neuro-symbolic" loop that autonomously writes and executes code to verify complex derivations. Together, these examples highlight the potential of AI not just as a tool for automation, but as a versatile, genuine partner in the creative process of scientific discovery.[76] Parallel-Probe: Towards Efficient Parallel Thinking via 2D Probing
Tong Zheng,Chengsong Huang,Runpeng Dai,Yun He,Rui Liu,Xin Ni,Huiwen Bao,Kaishen Wang,Hongtu Zhu,Jiaxin Huang,Furong Huang,Heng Huang
Main category: cs.CL
TL;DR: 本文提出2D probing接口和Parallel-Probe控制器,通过分析并利用并行推理中宽度与深度的动态关系,实现无需训练的在线优化,在保持准确率的同时显著降低token消耗。
Details
Motivation: 现有并行推理效率方法仅依赖单条路径的局部信号,缺乏对多分支间全局动态的建模机制。 Method: 提出2D probing接口以揭示并行推理的宽度-深度动态;基于其揭示的三项关键现象(非单调缩放、异质分支长度、早期全局共识稳定),设计无训练的Parallel-Probe控制器,包含基于共识的早停机制(调深度)和基于偏差的分支剪枝(调宽度)。 Result: 在三个基准和多个模型上验证,Parallel-Probe在测试时扩展性上达到更优Pareto前沿:相比多数投票,序列token减少最多35.8%,总token成本降低超25.8%,同时保持竞争力的准确率。 Conclusion: 宽度-深度协同调控是提升并行推理效率的关键,Parallel-Probe提供了一种简单、通用且无需训练的有效控制范式。 Abstract: Parallel thinking has emerged as a promising paradigm for reasoning, yet it imposes significant computational burdens. Existing efficiency methods primarily rely on local, per-trajectory signals and lack principled mechanisms to exploit global dynamics across parallel branches. We introduce 2D probing, an interface that exposes the width-depth dynamics of parallel thinking by periodically eliciting intermediate answers from all branches. Our analysis reveals three key insights: non-monotonic scaling across width-depth allocations, heterogeneous reasoning branch lengths, and early stabilization of global consensus. Guided by these insights, we introduce $\textbf{Parallel-Probe}$, a training-free controller designed to optimize online parallel thinking. Parallel-Probe employs consensus-based early stopping to regulate reasoning depth and deviation-based branch pruning to dynamically adjust width. Extensive experiments across three benchmarks and multiple models demonstrate that Parallel-Probe establishes a superior Pareto frontier for test-time scaling. Compared to standard majority voting, it reduces sequential tokens by up to $\textbf{35.8}$% and total token cost by over $\textbf{25.8}$% while maintaining competitive accuracy.cs.CV [Back]
[77] WorldVQA: Measuring Atomic World Knowledge in Multimodal Large Language Models
Runjie Zhou,Youbo Shao,Haoyu Lu,Bowei Xing,Tongtong Bai,Yujie Chen,Jie Zhao,Lin Sui,Haotian Yao,Zijia Zhao,Hao Yang,Haoning Wu,Zaida Zhou,Jinguo Zhu,Zhiqi Huang,Yiping Bao,Yangyang Liu,Y. Charles,Xinyu Zhou
Main category: cs.CV
TL;DR: WorldVQA是一个用于评估多模态大语言模型(MLLMs)原子级视觉世界知识的新基准,旨在解耦视觉知识检索与推理能力,专注于测量模型‘记住’的内容,涵盖从常见物体到长尾稀有实体的分层分类体系。
Details
Motivation: 现有评估方法常将视觉知识检索与推理混为一谈,难以准确衡量模型对视觉事实的记忆能力;需一个能严格分离并量化‘原子级视觉知识’的基准。 Method: 构建WorldVQA基准,采用分层分类法覆盖头类到长尾稀有视觉实体,聚焦于视觉实体的定位与命名等原子能力,以解耦知识记忆与高层推理。 Result: WorldVQA提供了一种新范式来评估MLLMs的视觉事实性、百科知识广度及幻觉率,初步验证其可作为前沿模型视觉知识能力的标准化测试工具。 Conclusion: WorldVQA填补了原子级视觉世界知识评估的空白,有望成为衡量MLLMs视觉事实性与知识稳健性的关键标准。 Abstract: We introduce WorldVQA, a benchmark designed to evaluate the atomic visual world knowledge of Multimodal Large Language Models (MLLMs). Unlike current evaluations, which often conflate visual knowledge retrieval with reasoning, WorldVQA decouples these capabilities to strictly measure "what the model memorizes." The benchmark assesses the atomic capability of grounding and naming visual entities across a stratified taxonomy, spanning from common head-class objects to long-tail rarities. We expect WorldVQA to serve as a rigorous test for visual factuality, thereby establishing a standard for assessing the encyclopedic breadth and hallucination rates of current and next-generation frontier models.[78] AdaptMMBench: Benchmarking Adaptive Multimodal Reasoning for Mode Selection and Reasoning Process
Xintong Zhang,Xiaowen Zhang,Jongrong Wu,Zhi Gao,Shilin Yan,Zhenxin Diao,Kunpeng Gao,Xuanyan Chen,Yuwei Wu,Yunde Jia,Qing Li
Main category: cs.CV
TL;DR: 本文提出AdaptMMBench,一个用于评估视觉语言模型自适应多模态推理能力的综合基准,涵盖五大领域,并引入Matthews相关系数(MCC)衡量模式选择合理性,实现对元认知能力的解耦评估。
Details
Motivation: 现有评估方法依赖静态难度标签和简单指标,无法反映任务难度随模型能力变化的动态性,混淆了自适应模式选择能力与整体性能,且缺乏细粒度过程分析。 Method: 构建AdaptMMBench基准,覆盖五大领域;采用Matthews相关系数(MCC)动态评估不同推理模式的选择合理性;支持多维过程评估,包括关键步骤覆盖率、工具有效性与计算效率。 Result: 实验表明:自适应模式选择能力随模型容量提升而增强,但与最终准确率显著解耦;关键步骤覆盖率与性能正相关,而工具有效性在不同架构间差异极大。 Conclusion: AdaptMMBench有效解耦并量化了VLMs的自适应推理元认知能力,揭示了当前模型在模式选择、步骤执行与工具利用上的不一致性,为未来研究提供了新评估范式与改进方向。 Abstract: Adaptive multimodal reasoning has emerged as a promising frontier in Vision-Language Models (VLMs), aiming to dynamically modulate between tool-augmented visual reasoning and text reasoning to enhance both effectiveness and efficiency. However, existing evaluations rely on static difficulty labels and simplistic metrics, which fail to capture the dynamic nature of difficulty relative to varying model capacities. Consequently, they obscure the distinction between adaptive mode selection and general performance while neglecting fine-grained process analyses. In this paper, we propose AdaptMMBench, a comprehensive benchmark for adaptive multimodal reasoning across five domains: real-world, OCR, GUI, knowledge, and math, encompassing both direct perception and complex reasoning tasks. AdaptMMBench utilizes a Matthews Correlation Coefficient (MCC) metric to evaluate the selection rationality of different reasoning modes, isolating this meta-cognition ability by dynamically identifying task difficulties based on models' capability boundaries. Moreover, AdaptMMBench facilitates multi-dimensional process evaluation across key step coverage, tool effectiveness, and computational efficiency. Our evaluation reveals that while adaptive mode selection scales with model capacity, it notably decouples from final accuracy. Conversely, key step coverage aligns with performance, though tool effectiveness remains highly inconsistent across model architectures.[79] End-to-end reconstruction of OCT optical properties and speckle-reduced structural intensity via physics-based learning
Jinglun Yu,Yaning Wang,Wenhan Guo,Yuan Gao,Yu Sun,Jin U. Kang
Main category: cs.CV
TL;DR: 本文提出了一种结合物理模型与深度学习的端到端正则化框架,用于OCT中的逆散射成像,可同时重建组织光学参数图(如折射率、散射系数、各向异性)和去斑点噪声的结构强度图像。
Details
Motivation: OCT逆散射问题因衰减、斑点噪声及参数强耦合而极具挑战性,亟需能同时实现定量多参数表征与高质量结构可视化的鲁棒方法。 Method: 构建一个正则化的端到端深度学习框架,嵌入基于物理的OCT前向模型,利用蒙特卡洛仿真生成的真值进行训练,实现光学参数图与去斑结构图像的联合重建。 Result: 在合成角膜OCT数据集上验证了该方法在噪声下稳健恢复光学参数图、提升分辨率与结构保真度的能力。 Conclusion: 该方法实现了定量多参数组织表征,证明了物理信息建模与深度学习融合在计算OCT中的有效性与优势。 Abstract: Inverse scattering in optical coherence tomography (OCT) seeks to recover both structural images and intrinsic tissue optical properties, including refractive index, scattering coefficient, and anisotropy. This inverse problem is challenging due to attenuation, speckle noise, and strong coupling among parameters. We propose a regularized end-to-end deep learning framework that jointly reconstructs optical parameter maps and speckle-reduced OCT structural intensity for layer visualization. Trained with Monte Carlo-simulated ground truth, our network incorporates a physics-based OCT forward model that generates predicted signals from the estimated parameters, providing physics-consistent supervision for parameter recovery and artifact suppression. Experiments on the synthetic corneal OCT dataset demonstrate robust optical map recovery under noise, improved resolution, and enhanced structural fidelity. This approach enables quantitative multi-parameter tissue characterization and highlights the benefit of combining physics-informed modeling with deep learning for computational OCT.[80] SVD-ViT: Does SVD Make Vision Transformers Attend More to the Foreground?
Haruhiko Murata,Kazuhiro Hotta
Main category: cs.CV
TL;DR: 本文提出SVD-ViT,利用奇异值分解(SVD)增强ViT对前景特征的学习能力,抑制背景噪声与伪影干扰,从而提升图像分类性能。
Details
Motivation: Vision Transformer因全局自注意力机制缺乏显式前景-背景区分能力,易学习无关背景特征,导致分类性能下降。 Method: 提出SVD-ViT,包含SPC模块、SSVA和ID-RSVD三个组件,通过提取并聚合表征前景信息的奇异向量,抑制背景噪声与伪影等任务无关因素。 Result: 实验表明该方法提升了分类准确率,更有效地学习前景表征,并降低了背景噪声影响。 Conclusion: SVD-ViT为ViT提供了前景感知能力,是一种有效缓解背景干扰、提升视觉表征质量的新方法。 Abstract: Vision Transformers (ViT) have been established as large-scale foundation models. However, because self-attention operates globally, they lack an explicit mechanism to distinguish foreground from background. As a result, ViT may learn unnecessary background features and artifacts, leading to degraded classification performance. To address this issue, we propose SVD-ViT, which leverages singular value decomposition (SVD) to prioritize the learning of foreground features. SVD-ViT consists of three components-\textbf{SPC module}, \textbf{SSVA}, and \textbf{ID-RSVD}-and suppresses task-irrelevant factors such as background noise and artifacts by extracting and aggregating singular vectors that capture object foreground information. Experimental results demonstrate that our method improves classification accuracy and effectively learns informative foreground representations while reducing the impact of background noise.[81] LmPT: Conditional Point Transformer for Anatomical Landmark Detection on 3D Point Clouds
Matteo Bastico,Pierre Onghena,David Ryckelynck,Beatriz Marcotegui,Santiago Velasco-Forero,Laurent Corté,Caroline Robine--Decourcelle,Etienne Decencière
Main category: cs.CV
TL;DR: 本文提出了一种名为Landmark Point Transformer (LmPT) 的新方法,用于在点云表示的解剖表面上自动检测解剖标志点,并支持跨物种(如人与狗)的泛化应用。
Details
Motivation: 传统手动标志点标注耗时且存在观察者间差异,基于规则的方法则受限于特定几何结构或标志点集合;而点云表示解剖表面具有轻量、灵活的优势,亟需一种能跨物种泛化的自动标志点检测方法。 Method: 提出LmPT模型,以点云为输入,引入条件机制(conditioning mechanism)实现对不同输入类型的自适应,从而支持跨物种学习;在人和狗股骨数据上进行验证。 Result: LmPT在人和狗股骨的解剖标志点检测任务中展现出良好的泛化能力和有效性;代码与新标注的狗股骨数据集将开源。 Conclusion: LmPT是一种可扩展、可迁移的自动解剖标志点检测框架,为跨物种解剖建模与转化研究提供了新工具。 Abstract: Accurate identification of anatomical landmarks is crucial for various medical applications. Traditional manual landmarking is time-consuming and prone to inter-observer variability, while rule-based methods are often tailored to specific geometries or limited sets of landmarks. In recent years, anatomical surfaces have been effectively represented as point clouds, which are lightweight structures composed of spatial coordinates. Following this strategy and to overcome the limitations of existing landmarking techniques, we propose Landmark Point Transformer (LmPT), a method for automatic anatomical landmark detection on point clouds that can leverage homologous bones from different species for translational research. The LmPT model incorporates a conditioning mechanism that enables adaptability to different input types to conduct cross-species learning. We focus the evaluation of our approach on femoral landmarking using both human and newly annotated dog femurs, demonstrating its generalization and effectiveness across species. The code and dog femur dataset will be publicly available at: https://github.com/Pierreoo/LandmarkPointTransformer.[82] Self-Supervised Uncalibrated Multi-View Video Anonymization in the Operating Room
Keqi Chen,Vinkle Srivastav,Armine Vardazaryan,Cindy Rolland,Didier Mutter,Nicolas Padoy
Main category: cs.CV
TL;DR: 本文提出了一种无需人工标注和相机标定的自监督多视角手术视频匿名化框架,通过跨视角与时间一致性检索漏检目标并生成伪标签,迭代优化检测与姿态估计模型,在真实与模拟手术数据上达到97%以上召回率。
Details
Motivation: 现有手术视频匿名化方法面临两大可扩展性瓶颈:需为每个新临床站点人工标注以保证高精度;多相机部署时每次重置相机位置均需重新标定。 Method: 提出自监督多视角视频匿名化框架,包含全身体检测与全身体姿态估计两部分;首先用低置信度阈值运行现成检测器获取候选框,再通过跟踪与自监督无标定多视角关联检索一致的低分误漏检作为伪标签,迭代微调检测器;最后在检测结果上运行姿态估计,并用其高分预测自监督微调姿态模型。 Result: 在4D-OR模拟手术数据集和自建真实手术数据集上实现超97%的召回率;基于伪标签训练的实时全身体检测器性能媲美监督方法,验证实用性。 Conclusion: 该方法显著降低对人工标注和相机标定的依赖,提升了手术视频匿名化的可扩展性与实际部署能力,为OR研究中的隐私保护提供了高效可行的新范式。 Abstract: Privacy preservation is a prerequisite for using video data in Operating Room (OR) research. Effective anonymization relies on the exhaustive localization of every individual; even a single missed detection necessitates extensive manual correction. However, existing approaches face two critical scalability bottlenecks: (1) they usually require manual annotations of each new clinical site for high accuracy; (2) while multi-camera setups have been widely adopted to address single-view ambiguity, camera calibration is typically required whenever cameras are repositioned. To address these problems, we propose a novel self-supervised multi-view video anonymization framework consisting of whole-body person detection and whole-body pose estimation, without annotation or camera calibration. Our core strategy is to enhance the single-view detector by "retrieving" false negatives using temporal and multi-view context, and conducting self-supervised domain adaptation. We first run an off-the-shelf whole-body person detector in each view with a low-score threshold to gather candidate detections. Then, we retrieve the low-score false negatives that exhibit consistency with the high-score detections via tracking and self-supervised uncalibrated multi-view association. These recovered detections serve as pseudo labels to iteratively fine-tune the whole-body detector. Finally, we apply whole-body pose estimation on each detected person, and fine-tune the pose model using its own high-score predictions. Experiments on the 4D-OR dataset of simulated surgeries and our dataset of real surgeries show the effectiveness of our approach achieving over 97% recall. Moreover, we train a real-time whole-body detector using our pseudo labels, achieving comparable performance and highlighting our method's practical applicability. Code is available at https://github.com/CAMMA-public/OR_anonymization.[83] ViThinker: Active Vision-Language Reasoning via Dynamic Perceptual Querying
Weihang You,Qingchan Zhu,David Liu,Yi Pan,Geng Yuan,Hanqi Jiang
Main category: cs.CV
TL;DR: ViThinker是一种新型视觉-语言模型框架,通过自主生成查询令牌来按需合成专家对齐的视觉特征,实现主动感知与推理,显著提升视觉推理准确率。
Details
Motivation: 现有视觉-语言模型中的思维链(CoT)推理受限于过早的视觉转文本过程,丢失几何与空间等连续信息;且当前增强方法多为被动处理预计算输入,缺乏人类主动感知能力。 Method: 提出ViThinker框架:1)两阶段课程学习——先将冻结的视觉专家知识蒸馏进模型参数,再通过稀疏性惩罚学习任务驱动的查询生成;2)模型在推理时自主生成决策(查询)令牌,触发生成专家对齐的视觉特征,实现无需外部工具的生成式心智模拟。 Result: 在多个视觉中心基准测试中均取得一致性能提升,验证了主动查询生成在感知接地与推理准确性上优于被动方法。 Conclusion: 主动感知机制(即按需查询与生成视觉特征)是提升视觉-语言模型推理能力的关键路径,ViThinker为构建具备类人感知-推理闭环的多模态模型提供了新范式。 Abstract: Chain-of-Thought (CoT) reasoning excels in language models but struggles in vision-language models due to premature visual-to-text conversion that discards continuous information such as geometry and spatial layout. While recent methods enhance CoT through static enumeration or attention-based selection, they remain passive, i.e., processing pre-computed inputs rather than actively seeking task-relevant details. Inspired by human active perception, we introduce ViThinker, a framework that enables vision-language models to autonomously generate decision (query) tokens triggering the synthesis of expert-aligned visual features on demand. ViThinker internalizes vision-expert capabilities during training, performing generative mental simulation during inference without external tool calls. Through a two-stage curriculum: first distilling frozen experts into model parameters, then learning task-driven querying via sparsity penalties, i.e., ViThinker discovers minimal sufficient perception for each reasoning step. Evaluations across vision-centric benchmarks demonstrate consistent improvements, validating that active query generation outperforms passive approaches in both perceptual grounding and reasoning accuracy.[84] DoubleTake: Contrastive Reasoning for Faithful Decision-Making in Medical Imaging
Daivik Patel,Shrenik Patel
Main category: cs.CV
TL;DR: 本文提出了一种对比性、文档感知的参考选择框架和反事实-对比推理方法,用于提升医学影像中易混淆疾病诊断的准确性。
Details
Motivation: 现有医学影像决策方法多依赖最近邻检索,易导致证据冗余并强化单一假设,难以区分视觉上相似的疾病。 Method: 构建基于ROCO嵌入与元数据的对比性、文档感知参考选择框架,平衡视觉相关性、嵌入多样性与来源可信度;提出反事实-对比推理框架,进行结构化两两视觉比较,并采用基于间隔的决策规则与可信拒识机制。 Result: 在MediConfusion基准上,该方法将集合级准确率相对提升近15%,显著降低混淆率并提升个体诊断准确率。 Conclusion: 对比性参考选择与反事实-对比推理能有效提升医学影像中细微差异判别的鲁棒性与可解释性,为临床辅助决策提供新范式。 Abstract: Accurate decision making in medical imaging requires reasoning over subtle visual differences between confusable conditions, yet most existing approaches rely on nearest neighbor retrieval that returns redundant evidence and reinforces a single hypothesis. We introduce a contrastive, document-aware reference selection framework that constructs compact evidence sets optimized for discrimination rather than similarity by explicitly balancing visual relevance, embedding diversity, and source-level provenance using ROCO embeddings and metadata. While ROCO provides large-scale image-caption pairs, it does not specify how references should be selected for contrastive reasoning, and naive retrieval frequently yields near-duplicate figures from the same document. To address this gap, we release a reproducible reference selection protocol and curated reference bank that enable a systematic study of contrastive retrieval in medical image reasoning. Building on these contrastive evidence sets, we propose Counterfactual-Contrastive Inference, a confidence-aware reasoning framework that performs structured pairwise visual comparisons and aggregates evidence using margin-based decision rules with faithful abstention. On the MediConfusion benchmark, our approach achieves state-of-the-art performance, improving set-level accuracy by nearly 15% relative to prior methods while reducing confusion and improving individual accuracy.[85] FaceLinkGen: Rethinking Identity Leakage in Privacy-Preserving Face Recognition with Identity Extraction
Wenqi Guo,Shan Du
Main category: cs.CV
TL;DR: 本文提出FaceLinkGen攻击,揭示了现有基于变换的隐私保护人脸识别(PPFR)系统在评估中过度依赖像素级重建指标(如PSNR、SSIM)所导致的隐私评估缺陷;该攻击无需恢复原始像素即可实现高精度身份匹配与人脸再生,暴露出视觉混淆无法有效隐藏身份信息的本质问题。
Details
Motivation: 现有PPFR评估过度依赖像素级重建质量(如PSNR、SSIM),忽视了身份信息可能在未恢复原始像素的情况下仍被提取,导致对真实隐私保障能力的误判。 Method: 提出FaceLinkGen攻击方法,直接从受保护的人脸模板中进行身份链接/匹配和人脸再生,不依赖原始像素重建;在多个主流PPFR系统上进行跨场景(含近零知识设定)实验验证。 Result: FaceLinkGen在三个最新PPFR系统上实现>98.5%匹配准确率和>96%再生成功率;即使在近零知识设定下,仍保持>92%匹配率和>94%再生成功率。 Conclusion: 像素失真类指标(如PSNR、SSIM)不能反映PPFR的真实隐私水平;视觉混淆无法有效保护身份信息,暴露了当前PPFR设计与评估范式的结构性缺陷。 Abstract: Transformation-based privacy-preserving face recognition (PPFR) aims to verify identities while hiding facial data from attackers and malicious service providers. Existing evaluations mostly treat privacy as resistance to pixel-level reconstruction, measured by PSNR and SSIM. We show that this reconstruction-centric view fails. We present FaceLinkGen, an identity extraction attack that performs linkage/matching and face regeneration directly from protected templates without recovering original pixels. On three recent PPFR systems, FaceLinkGen reaches over 98.5\% matching accuracy and above 96\% regeneration success, and still exceeds 92\% matching and 94\% regeneration in a near zero knowledge setting. These results expose a structural gap between pixel distortion metrics, which are widely used in PPFR evaluation, and real privacy. We show that visual obfuscation leaves identity information broadly exposed to both external intruders and untrusted service providers.[86] A Multi-scale Linear-time Encoder for Whole-Slide Image Analysis
Jagan Mohan Reddy Dwarampudi,Joshua Wong,Hien Van Nguyen,Tania Banerjee
Main category: cs.CV
TL;DR: MARBLE是一种基于Mamba的多尺度自适应循环模型,用于全切片图像(WSI)分析,通过线性时间状态空间建模实现高效跨尺度依赖建模,在多个公开数据集上显著提升AUC、准确率和C-index。
Details
Motivation: WSI分析面临千兆像素分辨率和多层级放大挑战,现有MIL方法多为单尺度,而Transformer类方法存在二次注意力计算开销问题。 Method: 提出纯Mamba架构的MARBLE框架,支持多尺度并行处理与粗到细推理,基于线性时间状态空间模型建模跨尺度依赖,参数开销小。 Result: 在五个公开数据集上,AUC最高提升6.9%,准确率提升20.3%,C-index提升2.3%。 Conclusion: MARBLE为多尺度WSI分析提供了可扩展、模块化且高效的替代方案,优于现有注意力机制架构。 Abstract: We introduce Multi-scale Adaptive Recurrent Biomedical Linear-time Encoder (MARBLE), the first \textit{purely Mamba-based} multi-state multiple instance learning (MIL) framework for whole-slide image (WSI) analysis. MARBLE processes multiple magnification levels in parallel and integrates coarse-to-fine reasoning within a linear-time state-space model, efficiently capturing cross-scale dependencies with minimal parameter overhead. WSI analysis remains challenging due to gigapixel resolutions and hierarchical magnifications, while existing MIL methods typically operate at a single scale and transformer-based approaches suffer from quadratic attention costs. By coupling parallel multi-scale processing with linear-time sequence modeling, MARBLE provides a scalable and modular alternative to attention-based architectures. Experiments on five public datasets show improvements of up to \textbf{6.9\%} in AUC, \textbf{20.3\%} in accuracy, and \textbf{2.3\%} in C-index, establishing MARBLE as an efficient and generalizable framework for multi-scale WSI analysis.[87] SRA-Seg: Synthetic to Real Alignment for Semi-Supervised Medical Image Segmentation
OFM Riaz Rahman Aranya,Kevin Desai
Main category: cs.CV
TL;DR: 本文提出SRA-Seg框架,通过语义特征对齐、软边缘融合与不确定性感知的软分割损失,有效弥合合成与真实医学图像间的域差距,在仅用10%标注真实数据时达到接近全监督性能。
Details
Motivation: 合成医学图像虽视觉逼真,但因与真实图像处于不同语义特征空间,导致现有半监督方法难以提升分割性能。 Method: 提出SRA-Seg框架:1)基于冻结DINOv2嵌入的相似性对齐(SA)损失拉近合成与真实特征;2)软边缘融合生成平滑解剖过渡和连续标签;3)EMA教师模型生成合成图像伪标签,并采用尊重混合区域不确定性的软分割损失。 Result: 在ACDC和FIVES数据集上,仅用10%标注真实数据+90%无标注合成数据,Dice分数分别达89.34%和84.42%,显著优于现有半监督方法,媲美使用真实无标注数据的方法。 Conclusion: 语义级特征对齐与不确定性建模是提升合成数据在医学图像分割中有效性的关键,SRA-Seg为低标注成本下的鲁棒分割提供了新范式。 Abstract: Synthetic data, an appealing alternative to extensive expert-annotated data for medical image segmentation, consistently fails to improve segmentation performance despite its visual realism. The reason being that synthetic and real medical images exist in different semantic feature spaces, creating a domain gap that current semi-supervised learning methods cannot bridge. We propose SRA-Seg, a framework explicitly designed to align synthetic and real feature distributions for medical image segmentation. SRA-Seg introduces a similarity-alignment (SA) loss using frozen DINOv2 embeddings to pull synthetic representations toward their nearest real counterparts in semantic space. We employ soft edge blending to create smooth anatomical transitions and continuous labels, eliminating the hard boundaries from traditional copy-paste augmentation. The framework generates pseudo-labels for synthetic images via an EMA teacher model and applies soft-segmentation losses that respect uncertainty in mixed regions. Our experiments demonstrate strong results: using only 10% labeled real data and 90% synthetic unlabeled data, SRA-Seg achieves 89.34% Dice on ACDC and 84.42% on FIVES, significantly outperforming existing semi-supervised methods and matching the performance of methods using real unlabeled data.[88] Nüwa: Mending the Spatial Integrity Torn by VLM Token Pruning
Yihong Huang,Fei Ma,Yihua Shao,Jingcai Guo,Zitong Yu,Laizhong Cui,Qi Tian
Main category: cs.CV
TL;DR: 本文提出Nüwa,一种两阶段视觉token剪枝框架,通过保留全局空间锚点和文本引导剪枝,在VQA和视觉定位任务上均取得显著性能提升。
Details
Motivation: 现有视觉token剪枝方法在视觉问答(VQA)中表现良好,但在视觉定位(VG)任务上性能大幅下降,原因在于其丢失了由位置信息交互构建的全局空间参考系。 Method: 提出两阶段token剪枝框架Nüwa:第一阶段在视觉编码器后采用受群体智能启发的分离、对齐与聚合操作,保留信息丰富的全局空间锚点;第二阶段在大语言模型中进行文本引导的剪枝,保留任务相关的视觉token。 Result: Nüwa在多个VQA基准上达到SOTA性能(94%–95%),并在视觉定位任务上大幅提升(从7%提升至47%)。 Conclusion: 维持空间完整性对视觉语言模型中的token剪枝至关重要,Nüwa通过显式建模空间锚点与文本引导剪枝,有效兼顾效率与多任务性能。 Abstract: Vision token pruning has proven to be an effective acceleration technique for the efficient Vision Language Model (VLM). However, existing pruning methods demonstrate excellent performance preservation in visual question answering (VQA) and suffer substantial degradation on visual grounding (VG) tasks. Our analysis of the VLM's processing pipeline reveals that strategies utilizing global semantic similarity and attention scores lose the global spatial reference frame, which is derived from the interactions of tokens' positional information. Motivated by these findings, we propose $\text{Nüwa}$, a two-stage token pruning framework that enables efficient feature aggregation while maintaining spatial integrity. In the first stage, after the vision encoder, we apply three operations, namely separation, alignment, and aggregation, which are inspired by swarm intelligence algorithms to retain information-rich global spatial anchors. In the second stage, within the LLM, we perform text-guided pruning to retain task-relevant visual tokens. Extensive experiments demonstrate that $\text{Nüwa}$ achieves SOTA performance on multiple VQA benchmarks (from 94% to 95%) and yields substantial improvements on visual grounding tasks (from 7% to 47%).[89] TRACE: Temporal Radiology with Anatomical Change Explanation for Grounded X-ray Report Generation
OFM Riaz Rahman Aranya,Kevin Desai
Main category: cs.CV
TL;DR: 本文提出了TRACE模型,首次实现了胸部X光片时间序列比较、变化分类和空间定位的联合任务,能够生成自然语言描述并精确定位变化区域,且发现时空联合学习对变化检测至关重要。
Details
Motivation: 现有方法无法结合单张图像报告生成和视觉定位能力来进行时间变化检测,而胸部X光的时间对比在临床中至关重要。 Method: 提出TRACE模型,以先前和当前胸部X光片为输入,联合执行时间比较、变化分类(恶化/改善/稳定)和空间定位(通过边界框),并生成自然语言解释。 Result: TRACE在空间定位上达到90%以上的接地准确率;消融实验表明,仅联合学习时间比较与空间定位才能产生有效变化检测能力。 Conclusion: 空间定位作为关键的空间注意力机制,对时间推理不可或缺;TRACE为胸部X光时序分析建立了新基准。 Abstract: Temporal comparison of chest X-rays is fundamental to clinical radiology, enabling detection of disease progression, treatment response, and new findings. While vision-language models have advanced single-image report generation and visual grounding, no existing method combines these capabilities for temporal change detection. We introduce Temporal Radiology with Anatomical Change Explanation (TRACE), the first model that jointly performs temporal comparison, change classification, and spatial localization. Given a prior and current chest X-ray, TRACE generates natural language descriptions of interval changes (worsened, improved, stable) while grounding each finding with bounding box coordinates. TRACE demonstrates effective spatial localization with over 90% grounding accuracy, establishing a foundation for this challenging new task. Our ablation study uncovers an emergent capability: change detection arises only when temporal comparison and spatial grounding are jointly learned, as neither alone enables meaningful change detection. This finding suggests that grounding provides a spatial attention mechanism essential for temporal reasoning.[90] Dynamic High-frequency Convolution for Infrared Small Target Detection
Ruojing Li,Chao Xiao,Qian Yin,Wei An,Nuo Chen,Xinyi Ying,Miao Li,Yingqian Wang
Main category: cs.CV
TL;DR: 本文提出了一种动态高频卷积(DHiF),用于单帧红外小目标检测,通过生成动态局部滤波器组显式建模和区分高频成分,提升目标与背景杂波的判别能力。
Details
Motivation: 现有基于学习的方法忽视了对高频成分(HFCs)的显式建模和判别表征学习,而这对区分红外小目标与类似高频杂波(如亮角点、碎云等)至关重要。 Method: 提出动态高频卷积(DHiF),利用傅里叶变换性质对称调整零中心范围内的动态滤波器参数,使其对高频成分敏感;结合标准卷积,自适应处理不同HFC区域并捕获其灰度变化特征;可即插即用地集成到任意SIRST检测网络中。 Result: 在真实场景数据集上,DHiF在多个SIRST检测网络中均取得优于当前主流卷积方法的检测性能,且计算效率无明显下降。 Conclusion: DHiF是一种有效、高效且通用的卷积模块,显著提升了红外小目标检测中对高频成分的判别建模能力。 Abstract: Infrared small targets are typically tiny and locally salient, which belong to high-frequency components (HFCs) in images. Single-frame infrared small target (SIRST) detection is challenging, since there are many HFCs along with targets, such as bright corners, broken clouds, and other clutters. Current learning-based methods rely on the powerful capabilities of deep networks, but neglect explicit modeling and discriminative representation learning of various HFCs, which is important to distinguish targets from other HFCs. To address the aforementioned issues, we propose a dynamic high-frequency convolution (DHiF) to translate the discriminative modeling process into the generation of a dynamic local filter bank. Especially, DHiF is sensitive to HFCs, owing to the dynamic parameters of its generated filters being symmetrically adjusted within a zero-centered range according to Fourier transformation properties. Combining with standard convolution operations, DHiF can adaptively and dynamically process different HFC regions and capture their distinctive grayscale variation characteristics for discriminative representation learning. DHiF functions as a drop-in replacement for standard convolution and can be used in arbitrary SIRST detection networks without significant decrease in computational efficiency. To validate the effectiveness of our DHiF, we conducted extensive experiments across different SIRST detection networks on real-scene datasets. Compared to other state-of-the-art convolution operations, DHiF exhibits superior detection performance with promising improvement. Codes are available at https://github.com/TinaLRJ/DHiF.[91] Fisheye Stereo Vision: Depth and Range Error
Leaf Jiang,Matthew Holzel,Bernhard Kaplan,Hsiou-Yuan Liu,Sabyasachi Paul,Karen Rankin,Piotr Swierczynski
Main category: cs.CV
TL;DR: 本文推导了鱼眼立体视觉系统在大角度下的深度和距离误差解析表达式,以物体距离为变量。
Details
Motivation: 提高鱼眼立体视觉系统在大角度观测时的深度与距离测量精度。 Method: 通过理论建模与数学推导,建立鱼眼镜头成像模型并分析其几何误差传播特性。 Result: 获得了深度误差和范围误差关于物体距离的解析表达式。 Conclusion: 所提解析表达式可有效量化鱼眼立体视觉系统在大角度下的测距误差,为系统标定与精度优化提供理论依据。 Abstract: This study derives analytical expressions for the depth and range error of fisheye stereo vision systems as a function of object distance, specifically accounting for accuracy at large angles.[92] SceneLinker: Compositional 3D Scene Generation via Semantic Scene Graph from RGB Sequences
Seok-Young Kim,Dooyoung Kim,Woojin Cho,Hail Song,Suji Kang,Woontack Woo
Main category: cs.CV
TL;DR: SceneLinker 是一个基于RGB序列和语义场景图生成组合式3D场景的新框架,通过图网络与图变分自编码器(graph-VAE)联合建模形状与布局,在复杂室内环境中实现更符合真实空间结构的MR内容生成。
Details
Motivation: 为适配用户真实物理空间以实现个性化混合现实(MR)体验,需从RGB序列中生成能准确反映现实环境语义与空间关系的3D场景;而以往方法难以充分建模物体间上下文关系或偏重形状多样性,导致生成场景与真实布局不一致。 Method: 提出SceneLinker框架:1)设计带交叉校验特征注意力的图网络用于场景图预测;2)构建图变分自编码器(graph-VAE),含联合形状与布局建模模块,实现端到端3D场景生成。 Result: 在3RScan/3DSSG和SG-FRONT数据集上,定量与定性评估均优于现有最先进方法,尤其在复杂室内环境及强约束场景图条件下表现鲁棒。 Conclusion: SceneLinker实现了从用户真实环境到语义一致、布局准确的3D MR内容的高效生成,推动了基于场景图的空间化MR内容创作。 Abstract: We introduce SceneLinker, a novel framework that generates compositional 3D scenes via semantic scene graph from RGB sequences. To adaptively experience Mixed Reality (MR) content based on each user's space, it is essential to generate a 3D scene that reflects the real-world layout by compactly capturing the semantic cues of the surroundings. Prior works struggled to fully capture the contextual relationship between objects or mainly focused on synthesizing diverse shapes, making it challenging to generate 3D scenes aligned with object arrangements. We address these challenges by designing a graph network with cross-check feature attention for scene graph prediction and constructing a graph-variational autoencoder (graph-VAE), which consists of a joint shape and layout block for 3D scene generation. Experiments on the 3RScan/3DSSG and SG-FRONT datasets demonstrate that our approach outperforms state-of-the-art methods in both quantitative and qualitative evaluations, even in complex indoor environments and under challenging scene graph constraints. Our work enables users to generate consistent 3D spaces from their physical environments via scene graphs, allowing them to create spatial MR content. Project page is https://scenelinker2026.github.io.[93] Aligning Forest and Trees in Images and Long Captions for Visually Grounded Understanding
Byeongju Woo,Zilin Wang,Byeonghyun Pak,Sangwoo Mo,Stella X. Yu
Main category: cs.CV
TL;DR: 本文提出CAFT框架,通过跨域森林与树对齐,在无像素级监督下实现图像与长文本的全局与局部语义对齐,显著提升长文本检索性能。
Details
Motivation: 现有大视觉语言模型(如CLIP)难以处理长文本,因其将图文作为整体对齐,缺乏细粒度、分层的语义理解;而纯语言或纯视觉的层次结构难以相互匹配且缺乏跨域语义聚焦。 Method: 提出CAFT(Cross-domain Alignment of Forests and Trees)框架:结合由细到粗的视觉编码器与分层文本Transformer,设计分层对齐损失函数,在对齐整图与整句的同时引导区域-句子对应关系,使粗粒度语义基于细粒度证据构建。 Result: 在3000万图文对上训练后,CAFT在六个长文本检索基准上达到SOTA,并展现出强扩展性;实验证明其能在无显式区域监督下自发学习细粒度、视觉接地的图文表征。 Conclusion: 分层跨域对齐是实现无需像素级监督的细粒度图文理解的有效路径,CAFT为长文本-图像建模提供了新范式。 Abstract: Large vision-language models such as CLIP struggle with long captions because they align images and texts as undifferentiated wholes. Fine-grained vision-language understanding requires hierarchical semantics capturing both global context and localized details across visual and textual domains. Yet linguistic hierarchies from syntax or semantics rarely match visual organization, and purely visual hierarchies tend to fragment scenes into appearance-driven parts without semantic focus. We propose CAFT (Cross-domain Alignment of Forests and Trees), a hierarchical image-text representation learning framework that aligns global and local semantics across images and long captions without pixel-level supervision. Coupling a fine-to-coarse visual encoder with a hierarchical text transformer, it uses a hierarchical alignment loss that matches whole images with whole captions while biasing region-sentence correspondences, so that coarse semantics are built from fine-grained evidence rather than from aggregation untethered to part-level grounding. Trained on 30M image-text pairs, CAFT achieves state-of-the-art performance on six long-text retrieval benchmarks and exhibits strong scaling behavior. Experiments show that hierarchical cross-domain alignment enables fine-grained, visually grounded image-text representations to emerge without explicit region-level supervision.[94] SharpTimeGS: Sharp and Stable Dynamic Gaussian Splatting via Lifespan Modulation
Zhanfeng Liao,Jiajun Zhang,Hanzhang Tu,Zhixi Wang,Yunqi Gao,Hongwen Zhang,Yebin Liu
Main category: cs.CV
TL;DR: 本文提出SharpTimeGS,一种基于寿命感知的4D高斯框架,通过可学习寿命参数和寿命-速度感知的稠密化策略,在统一表示下实现静态与动态区域的时序自适应建模,显著提升长时稳定性与动态保真度,并支持4K/100FPS实时渲染。
Details
Motivation: 现有基于高斯的动态场景新视角合成方法难以兼顾长期静态区域与短期动态区域在表示和优化上的平衡。 Method: 引入可学习寿命参数,将时间可见性建模为平顶型分布;利用寿命调制各高斯原语的运动;设计寿命-速度感知的稠密化策略以缓解静态与动态区域间的优化失衡。 Result: 在多个基准上达到SOTA性能,支持单张RTX 4090实现4K分辨率、100 FPS实时渲染。 Conclusion: SharpTimeGS通过寿命感知机制有效解耦运动幅度与时序长度,提升了动态场景4D重建的稳定性、保真度与效率。 Abstract: Novel view synthesis of dynamic scenes is fundamental to achieving photorealistic 4D reconstruction and immersive visual experiences. Recent progress in Gaussian-based representations has significantly improved real-time rendering quality, yet existing methods still struggle to maintain a balance between long-term static and short-term dynamic regions in both representation and optimization. To address this, we present SharpTimeGS, a lifespan-aware 4D Gaussian framework that achieves temporally adaptive modeling of both static and dynamic regions under a unified representation. Specifically, we introduce a learnable lifespan parameter that reformulates temporal visibility from a Gaussian-shaped decay into a flat-top profile, allowing primitives to remain consistently active over their intended duration and avoiding redundant densification. In addition, the learned lifespan modulates each primitives' motion, reducing drift in long-lived static points while retaining unrestricted motion for short-lived dynamic ones. This effectively decouples motion magnitude from temporal duration, improving long-term stability without compromising dynamic fidelity. Moreover, we design a lifespan-velocity-aware densification strategy that mitigates optimization imbalance between static and dynamic regions by allocating more capacity to regions with pronounced motion while keeping static areas compact and stable. Extensive experiments on multiple benchmarks demonstrate that our method achieves state-of-the-art performance while supporting real-time rendering up to 4K resolution at 100 FPS on one RTX 4090.[95] Video-OPD: Efficient Post-Training of Multimodal Large Language Models for Temporal Video Grounding via On-Policy Distillation
Jiaze Li,Hao Yin,Haoran Xu,Boshen Xu,Wenhui Tan,Zewen He,Jianzhong Ju,Zhenbo Luo,Jian Luan
Main category: cs.CV
TL;DR: 本文提出Video-OPD框架,通过引入前沿教师模型提供密集的token级监督,并结合TVDF训练策略,显著提升时序视频定位任务的训练效率与性能,克服了现有GRPO方法稀疏奖励和高计算开销的问题。
Details
Motivation: 现有基于GRPO的强化学习方法在时序视频定位(TVG)中受限于稀疏奖励信号和高昂计算开销,亟需更高效、稳定的后训练范式。 Method: 提出Video-OPD框架:基于在线策略蒸馏,利用当前策略采样轨迹并保持训练-推理分布一致;引入前沿教师模型,通过反向KL散度提供密集token级监督;进一步设计轻量级TVDF训练课程,聚焦教师可信且对学生信息量最大的轨迹。 Result: Video-OPD在多个TVG基准上持续超越GRPO,收敛速度更快、计算成本更低,验证了在线策略蒸馏对TVG的有效性。 Conclusion: 在线策略蒸馏可作为TVG任务中传统强化学习的高效替代方案,Video-OPD及其TVDF策略为视频语言理解中的后训练提供了新思路。 Abstract: Reinforcement learning has emerged as a principled post-training paradigm for Temporal Video Grounding (TVG) due to its on-policy optimization, yet existing GRPO-based methods remain fundamentally constrained by sparse reward signals and substantial computational overhead. We propose Video-OPD, an efficient post-training framework for TVG inspired by recent advances in on-policy distillation. Video-OPD optimizes trajectories sampled directly from the current policy, thereby preserving alignment between training and inference distributions, while a frontier teacher supplies dense, token-level supervision via a reverse KL divergence objective. This formulation preserves the on-policy property critical for mitigating distributional shift, while converting sparse, episode-level feedback into fine-grained, step-wise learning signals. Building on Video-OPD, we introduce Teacher-Validated Disagreement Focusing (TVDF), a lightweight training curriculum that iteratively prioritizes trajectories that are both teacher-reliable and maximally informative for the student, thereby improving training efficiency. Empirical results demonstrate that Video-OPD consistently outperforms GRPO while achieving substantially faster convergence and lower computational cost, establishing on-policy distillation as an effective alternative to conventional reinforcement learning for TVG.[96] VOILA: Value-of-Information Guided Fidelity Selection for Cost-Aware Multimodal Question Answering
Rahul Atul Bhope,K. R. Jayaram,Vinod Muthusamy,Ritesh Kumar,Vatche Isahagian,Nalini Venkatasubramanian
Main category: cs.CV
TL;DR: VOILA是一种基于信息价值驱动的自适应保真度选择框架,用于视觉问答(VQA),能在保证高准确率的同时显著降低计算成本。
Details
Motivation: 现有大多数多模态视觉-语言系统以固定保真度运行,导致高保真视觉输入的检索与处理成本高昂,缺乏根据任务需求动态调整保真度的机制。 Method: VOILA采用两阶段流水线:首先用梯度提升回归器仅基于问题特征预测各保真度下的回答正确率;再通过等渗校准器校准概率以提升决策可靠性;最后依据预测精度与检索成本选择期望效用最大化的最小成本保真度。 Result: 在五个数据集和六个不同参数规模(7B–235B)的视觉语言模型上验证,VOILA实现50–60%的成本降低,同时保持90–95%全分辨率精度。 Conclusion: 预检索阶段的保真度自适应选择对资源受限下的多模态推理优化至关重要,VOILA为此提供了高效可靠的解决方案。 Abstract: Despite significant costs from retrieving and processing high-fidelity visual inputs, most multimodal vision-language systems operate at fixed fidelity levels. We introduce VOILA, a framework for Value-Of-Information-driven adaptive fidelity selection in Visual Question Answering (VQA) that optimizes what information to retrieve before model execution. Given a query, VOILA uses a two-stage pipeline: a gradient-boosted regressor estimates correctness likelihood at each fidelity from question features alone, then an isotonic calibrator refines these probabilities for reliable decision-making. The system selects the minimum-cost fidelity maximizing expected utility given predicted accuracy and retrieval costs. We evaluate VOILA across three deployment scenarios using five datasets (VQA-v2, GQA, TextVQA, LoCoMo, FloodNet) and six Vision-Language Models (VLMs) with 7B-235B parameters. VOILA consistently achieves 50-60% cost reductions while retaining 90-95% of full-resolution accuracy across diverse query types and model architectures, demonstrating that pre-retrieval fidelity selection is vital to optimize multimodal inference under resource constraints.[97] Thinking inside the Convolution for Image Inpainting: Reconstructing Texture via Structure under Global and Local Side
Haipeng Liu,Yang Wang,Biao Qian,Yong Rui,Meng Wang
Main category: cs.CV
TL;DR: 本文提出了一种新的图像修复方法,通过统计归一化与反归一化策略,在卷积下采样过程中引导结构与纹理特征图相互补偿信息损失,从而提升上采样重建质量。
Details
Motivation: 现有基于CNN的图像修复方法在卷积下采样过程中不可避免地造成结构和纹理特征的信息损失,导致上采样结果不理想。 Method: 提出利用结构与纹理特征图之间的互补性,采用统计归一化与反归一化策略,在卷积下采样阶段进行重建引导。 Result: 在256×256和512×512等多分辨率图像上显著优于现有方法,尤其当替换全部编码器时效果更佳。 Conclusion: 结构与纹理特征图可相互辅助缓解卷积下采样中的信息损失,所提方法有效提升了图像修复质量。 Abstract: Image inpainting has earned substantial progress, owing to the encoder-and-decoder pipeline, which is benefited from the Convolutional Neural Networks (CNNs) with convolutional downsampling to inpaint the masked regions semantically from the known regions within the encoder, coupled with an upsampling process from the decoder for final inpainting output. Recent studies intuitively identify the high-frequency structure and low-frequency texture to be extracted by CNNs from the encoder, and subsequently for a desirable upsampling recovery. However, the existing arts inevitably overlook the information loss for both structure and texture feature maps during the convolutional downsampling process, hence suffer from a non-ideal upsampling output. In this paper, we systematically answer whether and how the structure and texture feature map can mutually help to alleviate the information loss during the convolutional downsampling. Given the structure and texture feature maps, we adopt the statistical normalization and denormalization strategy for the reconstruction guidance during the convolutional downsampling process. The extensive experimental results validate its advantages to the state-of-the-arts over the images from low-to-high resolutions including 256*256 and 512*512, especially holds by substituting all the encoders by ours. Our code is available at https://github.com/htyjers/ConvInpaint-TSGL[98] A Vision-Based Analysis of Congestion Pricing in New York City
Mehmet Kerem Turkcan,Jhonatan Tavori,Javad Ghaderi,Gil Zussman,Zoran Kostic,Andrew Smyth
Main category: cs.CV
TL;DR: This paper analyzes the impact of NYC's congestion pricing program using computer vision on traffic camera data from over 900 cameras in Manhattan and New York, comparing traffic patterns before and after implementation (Nov 2024–Jan 2026).
Details
Motivation: To quantitatively assess the real-world impact of New York City's newly implemented congestion pricing program on urban traffic flow. Method: A computer vision pipeline applied to footage from over 900 traffic cameras across Manhattan and New York, analyzing vehicle density changes between November 2024 (pre-implementation) and January 2025–January 2026 (post-implementation). Result: Baseline traffic patterns were established and systematic changes in vehicle density across the monitored region were identified following program implementation. Conclusion: The study demonstrates measurable traffic pattern shifts attributable to the congestion pricing program, validating the utility of large-scale automated camera analysis for policy evaluation. Abstract: We examine the impact of New York City's congestion pricing program through automated analysis of traffic camera data. Our computer vision pipeline processes footage from over 900 cameras distributed throughout Manhattan and New York, comparing traffic patterns from November 2024 through the program's implementation in January 2025 until January 2026. We establish baseline traffic patterns and identify systematic changes in vehicle density across the monitored region.[99] MUSE: A Multi-agent Framework for Unconstrained Story Envisioning via Closed-Loop Cognitive Orchestration
Wenzhang Sun,Zhenyu Wang,Zhangchi Hu,Chunfeng Wang,Hao Li,Wei Chen
Main category: cs.CV
TL;DR: 本文提出MUSE多智能体框架,通过计划-执行-验证-修正的闭环机制,解决长视频故事生成中的语义漂移和身份不一致问题,并引入无参考评估协议MUSEBench进行评测。
Details
Motivation: 现有方法在长序列生成中易出现语义漂移和身份不一致,难以保持高层叙事意图与多模态镜头级生成的一致性。 Method: 将故事生成建模为闭环约束满足问题,设计MUSE多智能体框架,实现基于身份、空间构图与时间连续性的显式可控生成,并引入针对性多模态反馈进行迭代修正。 Result: 在开放故事生成任务上,MUSE显著提升了长时序叙事连贯性、跨模态身份一致性与电影级质量;MUSEBench评估协议经人工验证有效。 Conclusion: 闭环多智能体协同生成范式能有效弥合叙事意图与多模态执行之间的鸿沟,为长视频生成提供了新思路与实用评估基准。 Abstract: Generating long-form audio-visual stories from a short user prompt remains challenging due to an intent-execution gap, where high-level narrative intent must be preserved across coherent, shot-level multimodal generation over long horizons. Existing approaches typically rely on feed-forward pipelines or prompt-only refinement, which often leads to semantic drift and identity inconsistency as sequences grow longer. We address this challenge by formulating storytelling as a closed-loop constraint enforcement problem and propose MUSE, a multi-agent framework that coordinates generation through an iterative plan-execute-verify-revise loop. MUSE translates narrative intent into explicit, machine-executable controls over identity, spatial composition, and temporal continuity, and applies targeted multimodal feedback to correct violations during generation. To evaluate open-ended storytelling without ground-truth references, we introduce MUSEBench, a reference-free evaluation protocol validated by human judgments. Experiments demonstrate that MUSE substantially improves long-horizon narrative coherence, cross-modal identity consistency, and cinematic quality compared with representative baselines.[100] Bongards at the Boundary of Perception and Reasoning: Programs or Language?
Cassidy Langenfeld,Claas Beger,Gloria Geng,Wasu Top Piriyakulkij,Keya Hu,Yewen Pu,Kevin Ellis
Main category: cs.CV
TL;DR: 本文提出一种神经符号方法来解决Bongard视觉推理问题,利用大语言模型生成规则的程序化表示,并通过贝叶斯优化进行参数拟合。
Details
Motivation: 人类能在全新情境中灵活运用视觉推理能力,而现有视觉语言模型(VLMs)在经典Bongard问题这类抽象视觉推理任务上表现不足,亟需新方法突破。 Method: 采用神经符号方法:给定假设的Bongard问题求解规则,用大语言模型(LLM)生成参数化的程序表示,并使用贝叶斯优化进行参数拟合。 Result: 该方法在给定真实规则下对Bongard图像进行分类,以及从零开始求解Bongard问题两个任务上均进行了评估,验证了其有效性。 Conclusion: 所提神经符号方法为提升VLMs在抽象、泛化性视觉推理任务上的能力提供了新路径, bridging symbolic reasoning and neural perception. Abstract: Vision-Language Models (VLMs) have made great strides in everyday visual tasks, such as captioning a natural image, or answering commonsense questions about such images. But humans possess the puzzling ability to deploy their visual reasoning abilities in radically new situations, a skill rigorously tested by the classic set of visual reasoning challenges known as the Bongard problems. We present a neurosymbolic approach to solving these problems: given a hypothesized solution rule for a Bongard problem, we leverage LLMs to generate parameterized programmatic representations for the rule and perform parameter fitting using Bayesian optimization. We evaluate our method on classifying Bongard problem images given the ground truth rule, as well as on solving the problems from scratch.[101] HP-GAN: Harnessing pretrained networks for GAN improvement with FakeTwins and discriminator consistency
Geonhui Son,Jeong Ryong Lee,Dosik Hwang
Main category: cs.CV
TL;DR: 本文提出HP-GAN,通过FakeTwins自监督机制和CNN/ViT双判别器一致性约束,利用预训练网络先验提升生成图像的多样性与质量,在17个数据集上FID指标显著优于SOTA。
Details
Motivation: 现有GAN方法虽常利用预训练网络计算感知损失或特征空间,但未充分挖掘其作为神经先验的潜力;同时多判别器间缺乏协同,影响训练鲁棒性与生成质量。 Method: 提出HP-GAN:1)FakeTwins——将预训练网络作为编码器,对生成图像施加自监督损失以优化生成器;2)CNN与ViT双判别器间的一致性约束,强制其对特征图的质量评估保持一致。 Result: 在17个涵盖大数据、小样本及少样本、多图像域的数据集上,HP-GAN在FID指标上持续超越当前最优方法,显著提升图像多样性与质量。 Conclusion: 融合自监督学习与跨架构判别器一致性可更有效地利用预训练网络先验,为高质量、高多样性图像生成提供新范式。 Abstract: Generative Adversarial Networks (GANs) have made significant progress in enhancing the quality of image synthesis. Recent methods frequently leverage pretrained networks to calculate perceptual losses or utilize pretrained feature spaces. In this paper, we extend the capabilities of pretrained networks by incorporating innovative self-supervised learning techniques and enforcing consistency between discriminators during GAN training. Our proposed method, named HP-GAN, effectively exploits neural network priors through two primary strategies: FakeTwins and discriminator consistency. FakeTwins leverages pretrained networks as encoders to compute a self-supervised loss and applies this through the generated images to train the generator, thereby enabling the generation of more diverse and high quality images. Additionally, we introduce a consistency mechanism between discriminators that evaluate feature maps extracted from Convolutional Neural Network (CNN) and Vision Transformer (ViT) feature networks. Discriminator consistency promotes coherent learning among discriminators and enhances training robustness by aligning their assessments of image quality. Our extensive evaluation across seventeen datasets-including scenarios with large, small, and limited data, and covering a variety of image domains-demonstrates that HP-GAN consistently outperforms current state-of-the-art methods in terms of Fréchet Inception Distance (FID), achieving significant improvements in image diversity and quality. Code is available at: https://github.com/higun2/HP-GAN.[102] IVC-Prune: Revealing the Implicit Visual Coordinates in LVLMs for Vision Token Pruning
Zhichao Sun,Yidong Ma,Gang Liu,Yibo Chen,Xu Tang,Yao Hu,Yongchao Xu
Main category: cs.CV
TL;DR: 本文提出IVC-Prune方法,在不训练、适配提示的前提下,通过识别对空间推理至关重要的隐式视觉坐标(IVC)token和语义相关的前景token,实现高效视觉token剪枝,大幅降低LVLM推理开销,同时几乎不损失性能。
Details
Motivation: 现有视觉token剪枝方法多关注语义相关性,容易丢弃对空间推理关键的token,导致性能下降;而LVLM中空间推理机制尚不明确。 Method: 基于RoPE数学性质理论分析,识别充当隐式视觉坐标(IVC)的特定位置token;结合两阶段鲁棒策略(语义种子发现+值向量相似性上下文优化)定位前景token;在推理时仅保留IVC与前景token,实现训练无关、提示感知的剪枝。 Result: 在4个主流LVLM和20个基准上验证,IVC-Prune可减少约50%视觉token,保持≥99%原始性能,并在多个基准上略有提升。 Conclusion: 隐式视觉坐标(IVC)是LVLM空间推理的关键机制;IVC-Prune是一种高效、通用、无需训练的视觉token剪枝策略,显著降低LVLM高分辨率输入的推理成本,同时保障甚至提升性能。 Abstract: Large Vision-Language Models (LVLMs) achieve impressive performance across multiple tasks. A significant challenge, however, is their prohibitive inference cost when processing high-resolution visual inputs. While visual token pruning has emerged as a promising solution, existing methods that primarily focus on semantic relevance often discard tokens that are crucial for spatial reasoning. We address this gap through a novel insight into \emph{how LVLMs process spatial reasoning}. Specifically, we reveal that LVLMs implicitly establish visual coordinate systems through Rotary Position Embeddings (RoPE), where specific token positions serve as \textbf{implicit visual coordinates} (IVC tokens) that are essential for spatial reasoning. Based on this insight, we propose \textbf{IVC-Prune}, a training-free, prompt-aware pruning strategy that retains both IVC tokens and semantically relevant foreground tokens. IVC tokens are identified by theoretically analyzing the mathematical properties of RoPE, targeting positions at which its rotation matrices approximate identity matrix or the $90^\circ$ rotation matrix. Foreground tokens are identified through a robust two-stage process: semantic seed discovery followed by contextual refinement via value-vector similarity. Extensive evaluations across four representative LVLMs and twenty diverse benchmarks show that IVC-Prune reduces visual tokens by approximately 50\% while maintaining $\geq$ 99\% of the original performance and even achieving improvements on several benchmarks. Source codes are available at https://github.com/FireRedTeam/IVC-Prune.[103] JRDB-Pose3D: A Multi-person 3D Human Pose and Shape Estimation Dataset for Robotics
Sandika Biswas,Kian Izadpanah,Hamid Rezatofighi
Main category: cs.CV
TL;DR: 本文介绍了JRDB-Pose3D数据集,旨在解决现有3D人体姿态估计数据集在真实复杂多人群场景(如室内外动态环境)中覆盖不足的问题,提供高密度、多模态、带时序跟踪与丰富语义标注的3D人体姿态数据。
Details
Motivation: 现有3D人体姿态估计数据集多局限于单人或实验室环境,难以支撑自动驾驶、机器人感知等真实场景应用需求;亟需一个能反映真实拥挤、动态、多交互场景的高质量多人体3D姿态数据集。 Method: 构建了基于移动机器人平台采集的JRDB-Pose3D数据集,涵盖室内外多人体复杂场景,提供每帧5–10人(最多35人)的SMPL格式3D姿态标注、统一身体形状参数、跨帧个体ID跟踪,并继承JRDB全部2D姿态、社交分组、活动、交互、语义掩码及人口统计学属性(年龄/性别/种族)等多维标注。 Result: JRDB-Pose3D成为首个面向真实移动机器人视角、支持密集多人体3D姿态估计与长期跟踪、并具备丰富社会与环境上下文标注的大规模数据集,显著提升了对遮挡、截断、出框等现实挑战的建模能力。 Conclusion: JRDB-Pose3D填补了真实世界多人体3D姿态理解的数据空白,为下游感知、行为理解、人机交互等任务提供了全面、鲁棒且可扩展的基准资源。 Abstract: Real-world scenes are inherently crowded. Hence, estimating 3D poses of all nearby humans, tracking their movements over time, and understanding their activities within social and environmental contexts are essential for many applications, such as autonomous driving, robot perception, robot navigation, and human-robot interaction. However, most existing 3D human pose estimation datasets primarily focus on single-person scenes or are collected in controlled laboratory environments, which restricts their relevance to real-world applications. To bridge this gap, we introduce JRDB-Pose3D, which captures multi-human indoor and outdoor environments from a mobile robotic platform. JRDB-Pose3D provides rich 3D human pose annotations for such complex and dynamic scenes, including SMPL-based pose annotations with consistent body-shape parameters and track IDs for each individual over time. JRDB-Pose3D contains, on average, 5-10 human poses per frame, with some scenes featuring up to 35 individuals simultaneously. The proposed dataset presents unique challenges, including frequent occlusions, truncated bodies, and out-of-frame body parts, which closely reflect real-world environments. Moreover, JRDB-Pose3D inherits all available annotations from the JRDB dataset, such as 2D pose, information about social grouping, activities, and interactions, full-scene semantic masks with consistent human- and object-level tracking, and detailed annotations for each individual, such as age, gender, and race, making it a holistic dataset for a wide range of downstream perception and human-centric understanding tasks.[104] Finding Optimal Video Moment without Training: Gaussian Boundary Optimization for Weakly Supervised Video Grounding
Sunoh Kim,Kimin Yun,Daeho Um
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的高斯边界优化(GBO)推断框架,用于弱监督时序视频定位任务,通过求解兼顾提案覆盖率与片段紧凑性的优化问题,显著提升定位精度,并在多个基准上达到SOTA。
Details
Motivation: 现有基于高斯时序提案的方法依赖启发式映射从高斯参数到段边界,导致定位性能次优。 Method: 提出高斯边界优化(GBO)框架,将段边界预测建模为一个平衡提案覆盖度与段紧凑性的有原则优化问题,并推导出闭式解,分析不同惩罚机制下的最优性条件。 Result: GBO在标准基准上显著提升定位性能,达到SOTA;且具有训练无关、兼容单高斯与混合高斯提案架构、高效通用等优势。 Conclusion: GBO是一种理论严谨、实用性强的新型推断方法,有效解决了弱监督时序视频定位中边界预测不准确的问题。 Abstract: Weakly supervised temporal video grounding aims to localize query-relevant segments in untrimmed videos using only video-sentence pairs, without requiring ground-truth segment annotations that specify exact temporal boundaries. Recent approaches tackle this task by utilizing Gaussian-based temporal proposals to represent query-relevant segments. However, their inference strategies rely on heuristic mappings from Gaussian parameters to segment boundaries, resulting in suboptimal localization performance. To address this issue, we propose Gaussian Boundary Optimization (GBO), a novel inference framework that predicts segment boundaries by solving a principled optimization problem that balances proposal coverage and segment compactness. We derive a closed-form solution for this problem and rigorously analyze the optimality conditions under varying penalty regimes. Beyond its theoretical foundations, GBO offers several practical advantages: it is training-free and compatible with both single-Gaussian and mixture-based proposal architectures. Our experiments show that GBO significantly improves localization, achieving state-of-the-art results across standard benchmarks. Extensive experiments demonstrate the efficiency and generalizability of GBO across various proposal schemes. The code is available at \href{https://github.com/sunoh-kim/gbo}{https://github.com/sunoh-kim/gbo}.[105] A generalizable large-scale foundation model for musculoskeletal radiographs
Shinn Kim,Soobin Lee,Kyoungseob Shin,Han-Soo Kim,Yongsung Kim,Minsu Kim,Juhong Nam,Somang Ko,Daeheon Kwon,Wook Huh,Ilkyu Han,Sunghoon Kwon
Main category: cs.CV
TL;DR: 本文提出了SKELEX,一个基于120万张肌肉骨骼X光片自监督训练的大规模基础模型,具备跨疾病和解剖部位的泛化能力,在12项下游诊断任务中表现优异,并支持零样本异常定位与可解释的区域引导骨肿瘤预测。
Details
Motivation: 现有AI模型多为任务特定、依赖大量标注、泛化能力差;缺乏大规模、高多样性公开数据集来训练通用肌肉骨骼影像基础模型。 Method: 采用自监督学习方法,在120万张多样且疾病丰富的肌肉骨骼X光片上训练SKELEX基础模型;评估其在12项下游任务的表现,并开发零样本异常定位及区域引导的骨肿瘤预测模型。 Result: SKELEX在骨折检测、骨关节炎分级和骨肿瘤分类等任务中普遍优于基线模型;实现零样本异常定位并生成病灶热图;区域引导模型在外部独立数据集上保持鲁棒性能,并已部署为公开网页应用。 Conclusion: SKELEX提供了一个可扩展、标签高效、泛化性强的AI框架,为肌肉骨骼影像的临床转化与数据高效研究奠定了基础。 Abstract: Artificial intelligence (AI) has shown promise in detecting and characterizing musculoskeletal diseases from radiographs. However, most existing models remain task-specific, annotation-dependent, and limited in generalizability across diseases and anatomical regions. Although a generalizable foundation model trained on large-scale musculoskeletal radiographs is clinically needed, publicly available datasets remain limited in size and lack sufficient diversity to enable training across a wide range of musculoskeletal conditions and anatomical sites. Here, we present SKELEX, a large-scale foundation model for musculoskeletal radiographs, trained using self-supervised learning on 1.2 million diverse, condition-rich images. The model was evaluated on 12 downstream diagnostic tasks and generally outperformed baselines in fracture detection, osteoarthritis grading, and bone tumor classification. Furthermore, SKELEX demonstrated zero-shot abnormality localization, producing error maps that identified pathologic regions without task-specific training. Building on this capability, we developed an interpretable, region-guided model for predicting bone tumors, which maintained robust performance on independent external datasets and was deployed as a publicly accessible web application. Overall, SKELEX provides a scalable, label-efficient, and generalizable AI framework for musculoskeletal imaging, establishing a foundation for both clinical translation and data-efficient research in musculoskeletal radiology.[106] Gromov Wasserstein Optimal Transport for Semantic Correspondences
Francis Snelgar,Stephen Gould,Ming Xu,Liang Zheng,Akshay Asthana
Main category: cs.CV
TL;DR: 本文提出了一种基于Gromov-Wasserstein最优传输的语义匹配新方法,替代Stable Diffusion特征,仅用DINOv2特征即达到甚至超越当前SOTA性能,且计算效率提升5–10倍。
Details
Motivation: 现有方法依赖DINOv2与Stable Diffusion(SD)特征融合以兼顾精度与空间一致性,但计算开销大;本文旨在不依赖SD模型的前提下,通过改进匹配算法来获得空间一致且准确的对应关系。 Method: 用带Gromov-Wasserstein空间平滑先验的最优传输算法,替代传统的最近邻匹配,在DINOv2特征上实现语义匹配。 Result: 显著提升DINOv2基线性能,在多个基准上媲美甚至超越使用SD特征的SOTA方法,同时推理速度提高5–10倍。 Conclusion: 空间一致性可由匹配算法而非大模型特征本身提供;本文验证了轻量、高效、高性能语义匹配的新范式。 Abstract: Establishing correspondences between image pairs is a long studied problem in computer vision. With recent large-scale foundation models showing strong zero-shot performance on downstream tasks including classification and segmentation, there has been interest in using the internal feature maps of these models for the semantic correspondence task. Recent works observe that features from DINOv2 and Stable Diffusion (SD) are complementary, the former producing accurate but sparse correspondences, while the latter produces spatially consistent correspondences. As a result, current state-of-the-art methods for semantic correspondence involve combining features from both models in an ensemble. While the performance of these methods is impressive, they are computationally expensive, requiring evaluating feature maps from large-scale foundation models. In this work we take a different approach, instead replacing SD features with a superior matching algorithm which is imbued with the desirable spatial consistency property. Specifically, we replace the standard nearest neighbours matching with an optimal transport algorithm that includes a Gromov Wasserstein spatial smoothness prior. We show that we can significantly boost the performance of the DINOv2 baseline, and be competitive and sometimes surpassing state-of-the-art methods using Stable Diffusion features, while being 5--10x more efficient. We make code available at https://github.com/fsnelgar/semantic_matching_gwot .[107] Beyond Cropping and Rotation: Automated Evolution of Powerful Task-Specific Augmentations with Generative Models
Judah Goldfeder,Shreyes Kaliyur,Vaibhav Sourirajan,Patrick Minwan Puma,Philippe Martin Wyder,Yuhang Hu,Jiong Lin,Hod Lipson
Main category: cs.CV
TL;DR: 本文提出了EvoAug,一种结合生成模型与进化算法的自动化数据增强学习框架,用于学习任务特定的、层次化的随机增强树,在细粒度分类和少样本学习中表现优异,并能发现符合领域知识的增强策略。
Details
Motivation: 传统数据增强方法(如裁剪、旋转)多样性有限;而新兴生成模型(如条件扩散、NeRF)虽能生成高多样性真实数据,但其强变换易因任务不匹配而损害性能,亟需一种能自动选择/组合生成增强以兼顾鲁棒性与准确性的方法。 Method: 提出EvoAug框架:利用条件生成模型(如扩散模型)作为基础增强源,设计可学习的随机增强树结构(支持层次化、随机组合),并采用高效进化算法搜索最优增强策略,端到端适配下游任务。 Result: 在细粒度分类和少样本学习任务上显著提升模型性能;所学增强策略符合人类领域知识(如鸟类识别中聚焦羽毛纹理),且在低数据场景下仍有效。 Conclusion: 生成式数据增强可通过自动化学习实现任务自适应,EvoAug验证了学习结构化、随机化增强树的有效性,为鲁棒视觉模型训练提供了新范式。 Abstract: Data augmentation has long been a cornerstone for reducing overfitting in vision models, with methods like AutoAugment automating the design of task-specific augmentations. Recent advances in generative models, such as conditional diffusion and few-shot NeRFs, offer a new paradigm for data augmentation by synthesizing data with significantly greater diversity and realism. However, unlike traditional augmentations like cropping or rotation, these methods introduce substantial changes that enhance robustness but also risk degrading performance if the augmentations are poorly matched to the task. In this work, we present EvoAug, an automated augmentation learning pipeline, which leverages these generative models alongside an efficient evolutionary algorithm to learn optimal task-specific augmentations. Our pipeline introduces a novel approach to image augmentation that learns stochastic augmentation trees that hierarchically compose augmentations, enabling more structured and adaptive transformations. We demonstrate strong performance across fine-grained classification and few-shot learning tasks. Notably, our pipeline discovers augmentations that align with domain knowledge, even in low-data settings. These results highlight the potential of learned generative augmentations, unlocking new possibilities for robust model training.[108] Feature, Alignment, and Supervision in Category Learning: A Comparative Approach with Children and Neural Networks
Fanxiao Wani Qiu,Oscar Leong
Main category: cs.CV
TL;DR: 本研究在物种公平的设计下,比较了儿童与卷积神经网络(CNN)在少量标注数据下的半监督类别学习能力,发现二者在监督量、特征类型和感知对齐度上的交互效应存在显著差异,强调人类与模型的对比需关注多因素交互而非单纯准确率。
Details
Motivation: 理解人类与机器如何从稀疏数据中学习是认知科学与机器学习的核心问题,需在公平条件下进行跨物种比较。 Method: 采用物种公平设计,让儿童与CNN在相同条件下完成少量标注数据的半监督类别学习任务,系统操控监督量(1/3/6个标签)、目标特征(大小、形状、纹理)和感知对齐度(高/低)。 Result: 儿童能从极少量标签快速泛化,但表现出强特征特异性偏差及对感知对齐度敏感;CNN性能随监督增加而提升,但其增益受对齐度与特征结构调节。 Conclusion: 人类与模型的学习机制差异体现在多因素交互模式上,而非整体准确率,提示跨物种比较必须控制并分析监督、特征结构与对齐度的联合效应。 Abstract: Understanding how humans and machines learn from sparse data is central to cognitive science and machine learning. Using a species-fair design, we compare children and convolutional neural networks (CNNs) in a few-shot semi-supervised category learning task. Both learners are exposed to novel object categories under identical conditions. Learners receive mixtures of labeled and unlabeled exemplars while we vary supervision (1/3/6 labels), target feature (size, shape, pattern), and perceptual alignment (high/low). We find that children generalize rapidly from minimal labels but show strong feature-specific biases and sensitivity to alignment. CNNs show a different interaction profile: added supervision improves performance, but both alignment and feature structure moderate the impact additional supervision has on learning. These results show that human-model comparisons must be drawn under the right conditions, emphasizing interactions among supervision, feature structure, and alignment rather than overall accuracy.[109] Flexible Geometric Guidance for Probabilistic Human Pose Estimation with Diffusion Models
Francis Snelgar,Ming Xu,Stephen Gould,Liang Zheng,Akshay Asthana
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散模型的3D人体姿态估计框架,通过利用2D关键点检测器的热图梯度来引导仅在3D数据上训练的无条件扩散模型,从而从图像中采样出符合约束的多种合理3D姿态,无需配对的2D-3D训练数据。
Details
Motivation: 解决3D姿态估计中因深度模糊和遮挡导致的病态性(多解性)问题,以及现有方法依赖大量配对2D-3D数据、泛化能力差的问题。 Method: 采用扩散模型作为生成先验,构建基于梯度引导的条件采样框架:使用2D关键点检测器热图的梯度来引导仅在3D姿态数据上预训练的无条件扩散模型,实现图像到多假设3D姿态的分布建模与采样。 Result: 在Human3.6M上达到无需配对数据方法中的SOTA(best-of-m评估);在MPI-INF-3DHP和3DPW上展现出良好泛化性;并成功拓展至姿态生成与补全等新任务。 Conclusion: 扩散模型可有效建模3D姿态的不确定性,所提引导框架摆脱了对配对监督的依赖,提升了泛化性与任务灵活性。 Abstract: 3D human pose estimation from 2D images is a challenging problem due to depth ambiguity and occlusion. Because of these challenges the task is underdetermined, where there exists multiple -- possibly infinite -- poses that are plausible given the image. Despite this, many prior works assume the existence of a deterministic mapping and estimate a single pose given an image. Furthermore, methods based on machine learning require a large amount of paired 2D-3D data to train and suffer from generalization issues to unseen scenarios. To address both of these issues, we propose a framework for pose estimation using diffusion models, which enables sampling from a probability distribution over plausible poses which are consistent with a 2D image. Our approach falls under the guidance framework for conditional generation, and guides samples from an unconditional diffusion model, trained only on 3D data, using the gradients of the heatmaps from a 2D keypoint detector. We evaluate our method on the Human 3.6M dataset under best-of-$m$ multiple hypothesis evaluation, showing state-of-the-art performance among methods which do not require paired 2D-3D data for training. We additionally evaluate the generalization ability using the MPI-INF-3DHP and 3DPW datasets and demonstrate competitive performance. Finally, we demonstrate the flexibility of our framework by using it for novel tasks including pose generation and pose completion, without the need to train bespoke conditional models. We make code available at https://github.com/fsnelgar/diffusion_pose .[110] FinMTM: A Multi-Turn Multimodal Benchmark for Financial Reasoning and Agent Evaluation
Chenxi Zhang,Ziliang Gan,Liyun Zhu,Youwei Pang,Qing Zhang,Rongjunchen Zhang
Main category: cs.CV
TL;DR: 本文提出了FinMTM,一个面向金融领域的多轮多模态基准测试,旨在解决现有金融基准测试在数据多样性和任务复杂性方面的不足。
Details
Motivation: 现有金融基准测试多为单轮且问题格式单一,无法全面评估视觉语言模型在真实金融应用场景中的能力。 Method: 构建了一个包含11,133个双语(中英文)金融问答对的多轮多模态基准FinMTM,涵盖蜡烛图、统计图表和报告图表等金融可视化数据,并设计了针对多选题、多轮对话和智能体任务的特定评估协议。 Result: 对22个视觉语言模型进行了广泛实验评估,揭示了它们在细粒度视觉感知、长上下文推理和复杂智能体工作流方面的局限性。 Conclusion: FinMTM为金融领域视觉语言模型提供了更全面、更具挑战性的评估框架,有助于推动其在实际金融场景中的应用与发展。 Abstract: The financial domain poses substantial challenges for vision-language models (VLMs) due to specialized chart formats and knowledge-intensive reasoning requirements. However, existing financial benchmarks are largely single-turn and rely on a narrow set of question formats, limiting comprehensive evaluation in realistic application scenarios. To address this gap, we propose FinMTM, a multi-turn multimodal benchmark that expands diversity along both data and task dimensions. On the data side, we curate and annotate 11{,}133 bilingual (Chinese and English) financial QA pairs grounded in financial visuals, including candlestick charts, statistical plots, and report figures. On the task side, FinMTM covers single- and multiple-choice questions, multi-turn open-ended dialogues, and agent-based tasks. We further design task-specific evaluation protocols, including a set-overlap scoring rule for multiple-choice questions, a weighted combination of turn-level and session-level scores for multi-turn dialogues, and a composite metric that integrates planning quality with final outcomes for agent tasks. Extensive experimental evaluation of 22 VLMs reveal their limitations in fine-grained visual perception, long-context reasoning, and complex agent workflows.[111] SwiftVLM: Efficient Vision-Language Model Inference via Cross-Layer Token Bypass
Chen Qian,Xinran Yu,Danyang Li,Guoxuan Chi,Zheng Yang,Qiang Ma,Xin Miao
Main category: cs.CV
TL;DR: 本文提出了一种新的视觉token剪枝范式“bypass”,通过保留未被选中的token并将其传递至后续剪枝阶段重新评估,避免早期剪枝导致的关键信息丢失;基于此提出了无需训练的SwiftVLM方法,在多层独立剪枝、模型适配性强的前提下,显著提升了细粒度视觉任务下的精度-效率权衡。
Details
Motivation: 现有视觉语言模型(VLMs)的视觉token剪枝方法常在浅层就做不可逆的剪枝决策,虽提升效率但在需细粒度视觉细节的任务上性能下降明显;层间分析发现token重要性随网络深度变化显著,早期判定为不重要的token可能在深层对文本条件推理至关重要。 Method: 提出‘bypass’剪枝范式:不丢弃未被选中的视觉token,而是将其绕过当前层剪枝模块、传递至后续层重新评估;在此基础上设计SwiftVLM——一种无需训练、在模型特定层执行剪枝、各层剪枝决策相互独立的方法。 Result: 在多个VLM架构和基准测试上,SwiftVLM一致优于现有剪枝策略,实现了更优的精度-效率平衡,并展现出更符合真实重要性的视觉token选择行为。 Conclusion: 早期不可逆剪枝会损害细粒度推理能力;‘bypass’范式及SwiftVLM证明了分层动态重评估的必要性与有效性,为高效VLM推理提供了新思路。 Abstract: Visual token pruning is a promising approach for reducing the computational cost of vision-language models (VLMs), and existing methods often rely on early pruning decisions to improve efficiency. While effective on coarse-grained reasoning tasks, they suffer from significant performance degradation on tasks requiring fine-grained visual details. Through layer-wise analysis, we reveal substantial discrepancies in visual token importance across layers, showing that tokens deemed unimportant at shallow layers can later become highly relevant for text-conditioned reasoning. To avoid irreversible critical information loss caused by premature pruning, we introduce a new pruning paradigm, termed bypass, which preserves unselected visual tokens and forwards them to subsequent pruning stages for re-evaluation. Building on this paradigm, we propose SwiftVLM, a simple and training-free method that performs pruning at model-specific layers with strong visual token selection capability, while enabling independent pruning decisions across layers. Experiments across multiple VLMs and benchmarks demonstrate that SwiftVLM consistently outperforms existing pruning strategies, achieving superior accuracy-efficiency trade-offs and more faithful visual token selection behavior.[112] FSOD-VFM: Few-Shot Object Detection with Vision Foundation Models and Graph Diffusion
Chen-Bin Feng,Youyang Sha,Longfei Liu,Yongjun Yu,Chi Man Vong,Xuanlong Yu,Xi Shen
Main category: cs.CV
TL;DR: 本文提出FSOD-VFM框架,利用视觉基础模型(如SAM2、DINOv2)实现无需训练的少样本目标检测,并引入图扩散置信度重加权方法缓解通用提议网络(UPN)导致的边界框过分割问题,在多个基准上显著超越现有无训练方法。
Details
Motivation: 现有少样本目标检测方法在使用通用提议网络(UPN)时易产生过分割的边界框,导致大量局部碎片化假阳性提案,影响检测完整性与精度;同时,亟需不依赖额外训练的高效适配方案。 Method: 提出FSOD-VFM框架:1)通用提议网络(UPN)生成类别无关候选框;2)SAM2提取精确掩码;3)DINOv2提供可迁移特征;4)设计图结构建模候选框为节点,通过有向图上的扩散操作重加权置信度,提升完整物体提案得分、抑制碎片提案。 Result: 在Pascal-5^i、COCO-20^i和跨域CD-FSOD数据集上大幅超越现有方法;尤其在CD-FSOD 10-shot设置下达31.6 AP,较此前无训练方法(21.4 AP)提升超10个点。 Conclusion: FSOD-VFM验证了视觉基础模型在免训练少样本检测中的强大潜力,所提图扩散置信度重加权机制有效缓解了UPN过分割缺陷,提升了检测完整性与鲁棒性,为零/少样本检测提供了新范式。 Abstract: In this paper, we present FSOD-VFM: Few-Shot Object Detectors with Vision Foundation Models, a framework that leverages vision foundation models to tackle the challenge of few-shot object detection. FSOD-VFM integrates three key components: a universal proposal network (UPN) for category-agnostic bounding box generation, SAM2 for accurate mask extraction, and DINOv2 features for efficient adaptation to new object categories. Despite the strong generalization capabilities of foundation models, the bounding boxes generated by UPN often suffer from overfragmentation, covering only partial object regions and leading to numerous small, false-positive proposals rather than accurate, complete object detections. To address this issue, we introduce a novel graph-based confidence reweighting method. In our approach, predicted bounding boxes are modeled as nodes in a directed graph, with graph diffusion operations applied to propagate confidence scores across the network. This reweighting process refines the scores of proposals, assigning higher confidence to whole objects and lower confidence to local, fragmented parts. This strategy improves detection granularity and effectively reduces the occurrence of false-positive bounding box proposals. Through extensive experiments on Pascal-5$^i$, COCO-20$^i$, and CD-FSOD datasets, we demonstrate that our method substantially outperforms existing approaches, achieving superior performance without requiring additional training. Notably, on the challenging CD-FSOD dataset, which spans multiple datasets and domains, our FSOD-VFM achieves 31.6 AP in the 10-shot setting, substantially outperforming previous training-free methods that reach only 21.4 AP. Code is available at: https://intellindust-ai-lab.github.io/projects/FSOD-VFM.[113] Diversity-Preserved Distribution Matching Distillation for Fast Visual Synthesis
Tianhe Wu,Ruibin Li,Lei Zhang,Kede Ma
Main category: cs.CV
TL;DR: 本文提出了一种名为Diversity-Preserved DMD(DP-DMD)的蒸馏框架,通过角色分离策略(首步保多样性、后续步保质量)解决分布匹配蒸馏(DMD)中的模式坍塌问题,无需额外网络或监督信号,兼顾多样性与生成质量。
Details
Motivation: 现有DMD方法因采用反向KL散度而易导致模式坍塌,且常用感知或对抗正则化方案带来计算开销和训练不稳定。 Method: 提出角色分离蒸馏框架:第一步采用目标预测(如v-prediction)目标保持样本多样性,后续步骤在标准DMD损失下进行质量优化,并在第一步阻断DMD梯度。 Result: 在大量文本到图像实验中,DP-DMD在不引入感知骨干网、判别器、辅助网络或额外真值图像的前提下,显著缓解模式坍塌,同时保持与SOTA方法相当的视觉质量。 Conclusion: DP-DMD以极简设计有效解耦多样性与质量优化目标,为高效高质量生成提供了稳定、低开销的新范式。 Abstract: Distribution matching distillation (DMD) aligns a multi-step generator with its few-step counterpart to enable high-quality generation under low inference cost. However, DMD tends to suffer from mode collapse, as its reverse-KL formulation inherently encourages mode-seeking behavior, for which existing remedies typically rely on perceptual or adversarial regularization, thereby incurring substantial computational overhead and training instability. In this work, we propose a role-separated distillation framework that explicitly disentangles the roles of distilled steps: the first step is dedicated to preserving sample diversity via a target-prediction (e.g., v-prediction) objective, while subsequent steps focus on quality refinement under the standard DMD loss, with gradients from the DMD objective blocked at the first step. We term this approach Diversity-Preserved DMD (DP-DMD), which, despite its simplicity -- no perceptual backbone, no discriminator, no auxiliary networks, and no additional ground-truth images -- preserves sample diversity while maintaining visual quality on par with state-of-the-art methods in extensive text-to-image experiments.[114] Fully Kolmogorov-Arnold Deep Model in Medical Image Segmentation
Xingyu Qiu,Xinghua Ma,Dong Liang,Gongning Luo,Wei Wang,Kuanquan Wang,Shuo Li
Main category: cs.CV
TL;DR: 本文提出SaKAN和Grad-Free Spline两项关键技术,构建首个全KA(Kolmogorov-Arnold)深度模型ALL U-KAN,显著缓解训练难度与内存开销,在医学图像分割任务中超越传统及部分KA模型。
Details
Motivation: 深层KAN因训练困难和内存消耗大而难以堆叠,限制了对KAN潜力的充分探索。 Method: 提出Share-activation KAN(SaKAN)简化参数化并提升样本密度以改善优化;提出Grad-Free Spline去除对训练贡献小但内存消耗大的样条梯度计算;基于二者构建全KA架构ALL U-KAN,用KA和KAonv层完全替代FC和Conv层。 Result: 在三个医学图像分割任务上,ALL U-KAN精度全面超越部分KA模型和传统模型;相比直接堆叠KAN,参数量减少10倍,显存消耗降低20倍以上。 Conclusion: KA-based层可完全替代传统层构建高性能深度模型,SaKAN与Grad-Free Spline为深度KAN架构开辟了新路径。 Abstract: Deeply stacked KANs are practically impossible due to high training difficulties and substantial memory requirements. Consequently, existing studies can only incorporate few KAN layers, hindering the comprehensive exploration of KANs. This study overcomes these limitations and introduces the first fully KA-based deep model, demonstrating that KA-based layers can entirely replace traditional architectures in deep learning and achieve superior learning capacity. Specifically, (1) the proposed Share-activation KAN (SaKAN) reformulates Sprecher's variant of Kolmogorov-Arnold representation theorem, which achieves better optimization due to its simplified parameterization and denser training samples, to ease training difficulty, (2) this paper indicates that spline gradients contribute negligibly to training while consuming huge GPU memory, thus proposes the Grad-Free Spline to significantly reduce memory usage and computational overhead. (3) Building on these two innovations, our ALL U-KAN is the first representative implementation of fully KA-based deep model, where the proposed KA and KAonv layers completely replace FC and Conv layers. Extensive evaluations on three medical image segmentation tasks confirm the superiority of the full KA-based architecture compared to partial KA-based and traditional architectures, achieving all higher segmentation accuracy. Compared to directly deeply stacked KAN, ALL U-KAN achieves 10 times reduction in parameter count and reduces memory consumption by more than 20 times, unlocking the new explorations into deep KAN architectures.[115] Human-in-the-loop Adaptation in Group Activity Feature Learning for Team Sports Video Retrieval
Chihiro Nakatani,Hiroaki Kawashima,Norimichi Ukita
Main category: cs.CV
TL;DR: 本文提出了一种无需群组活动标注的人机协同自适应方法,用于群组活动特征学习(GAFL),以提升群组活动视频检索性能。方法采用自监督预训练构建群组活动特征空间,并通过交互式、数据高效的细调过程,利用用户对精选视频的正负反馈,结合对比学习优化特征空间。实验在两个团队运动数据集上验证了其有效性。
Details
Motivation: 现有方法依赖于预定义群组活动类别的监督学习,缺乏灵活性和泛化能力;且难以适应用户个性化检索需求,亟需无需精细标注、能持续从用户反馈中学习的自适应框架。 Method: 1)自监督预训练:基于群组活动相似性构建初始群组活动特征(GAF)空间;2)人机协同细调:设计数据高效视频选择策略,向用户呈现待标注样本;3)对比学习更新:利用用户提供的正/负标签,驱动特征空间调整,使正样本靠近、负样本远离查询视频。 Result: 在两个团队运动数据集上显著提升了视频检索性能;消融实验验证了自监督预训练、交互式细调及数据选择策略等各组件的有效性。 Conclusion: 所提人机协同自适应GAFL框架,摆脱了对群组活动标注的依赖,通过少量用户反馈即可有效提升检索性能,具备良好的实用性与可扩展性。 Abstract: This paper proposes human-in-the-loop adaptation for Group Activity Feature Learning (GAFL) without group activity annotations. This human-in-the-loop adaptation is employed in a group-activity video retrieval framework to improve its retrieval performance. Our method initially pre-trains the GAF space based on the similarity of group activities in a self-supervised manner, unlike prior work that classifies videos into pre-defined group activity classes in a supervised learning manner. Our interactive fine-tuning process updates the GAF space to allow a user to better retrieve videos similar to query videos given by the user. In this fine-tuning, our proposed data-efficient video selection process provides several videos, which are selected from a video database, to the user in order to manually label these videos as positive or negative. These labeled videos are used to update (i.e., fine-tune) the GAF space, so that the positive and negative videos move closer to and farther away from the query videos through contrastive learning. Our comprehensive experimental results on two team sports datasets validate that our method significantly improves the retrieval performance. Ablation studies also demonstrate that several components in our human-in-the-loop adaptation contribute to the improvement of the retrieval performance. Code: https://github.com/chihina/GAFL-FINE-CVIU.[116] BinaryDemoire: Moiré-Aware Binarization for Image Demoiréing
Zheng Chen,Zhi Yang,Xiaoyang Liu,Weihang Zhang,Mengfan Wang,Yifan Fu,Linghe Kong,Yulun Zhang
Main category: cs.CV
TL;DR: 本文提出了BinaryDemoire,一种专为图像去摩尔纹设计的二值化深度网络框架,通过摩尔纹感知的二值门(MABG)和洗牌分组残差适配器(SGRA)提升性能,在多个基准上超越现有二值化方法。
Details
Motivation: 现有去摩尔纹深度网络虽效果好但计算开销大;二值化虽能大幅压缩模型,却因忽视摩尔纹的频域结构特性而表现差。 Method: 提出BinaryDemoire框架:1)摩尔纹感知二值门(MABG),提取轻量频域描述符并生成通道级门控系数以调制二值卷积响应;2)洗牌分组残差适配器(SGRA),实现结构化稀疏捷径对齐与跨分组信息交换。 Result: 在四个基准数据集上,BinaryDemoire显著优于现有二值化去摩尔纹方法。 Conclusion: 针对摩尔纹退化特有的多尺度、多方向频域特性,定制化二值化设计(MABG + SGRA)可有效缓解二值化带来的性能损失,实现高效高质的去摩尔纹。 Abstract: Image demoiréing aims to remove structured moiré artifacts in recaptured imagery, where degradations are highly frequency-dependent and vary across scales and directions. While recent deep networks achieve high-quality restoration, their full-precision designs remain costly for deployment. Binarization offers an extreme compression regime by quantizing both activations and weights to 1-bit. Yet, it has been rarely studied for demoiréing and performs poorly when naively applied. In this work, we propose BinaryDemoire, a binarized demoiréing framework that explicitly accommodates the frequency structure of moiré degradations. First, we introduce a moiré-aware binary gate (MABG) that extracts lightweight frequency descriptors together with activation statistics. It predicts channel-wise gating coefficients to condition the aggregation of binary convolution responses. Second, we design a shuffle-grouped residual adapter (SGRA) that performs structured sparse shortcut alignment. It further integrates interleaved mixing to promote information exchange across different channel partitions. Extensive experiments on four benchmarks demonstrate that the proposed BinaryDemoire surpasses current binarization methods. Code: https://github.com/zhengchen1999/BinaryDemoire.[117] LSGQuant: Layer-Sensitivity Guided Quantization for One-Step Diffusion Real-World Video Super-Resolution
Tianxing Wu,Zheng Chen,Cirou Xu,Bowen Chai,Yong Guo,Yutong Liu,Linghe Kong,Yulun Zhang
Main category: cs.CV
TL;DR: 本文提出LSGQuant,一种针对单步扩散模型视频超分辨率的层敏感性引导量化方法,通过动态范围自适应量化器、方差导向层训练策略和量化感知优化,在几乎不损失性能的前提下显著压缩模型。
Details
Motivation: 单步扩散模型在真实场景视频超分辨率中虽表现良好且推理快,但其Diffusion Transformer(DiT)模型体积大、计算成本高;而现有低比特量化方法受限于输入潜在表示的高动态范围及各层行为差异,效果不佳。 Method: 提出LSGQuant:1)动态范围自适应量化器(DRAQ)适配视频token激活;2)基于校准阶段层统计估计层敏感性,并设计方差导向层训练策略(VOLTS);3)引入量化感知优化(QAO),联合微调量化分支与保留的高精度分支。 Result: 实验表明,该方法在几乎保持全精度模型性能的同时,显著优于现有量化技术。 Conclusion: LSGQuant是一种高效、鲁棒的量化方案,有效解决了单步扩散VSR模型在真实场景部署中的计算与存储瓶颈问题。 Abstract: One-Step Diffusion Models have demonstrated promising capability and fast inference in video super-resolution (VSR) for real-world. Nevertheless, the substantial model size and high computational cost of Diffusion Transformers (DiTs) limit downstream applications. While low-bit quantization is a common approach for model compression, the effectiveness of quantized models is challenged by the high dynamic range of input latent and diverse layer behaviors. To deal with these challenges, we introduce LSGQuant, a layer-sensitivity guided quantizing approach for one-step diffusion-based real-world VSR. Our method incorporates a Dynamic Range Adaptive Quantizer (DRAQ) to fit video token activations. Furthermore, we estimate layer sensitivity and implement a Variance-Oriented Layer Training Strategy (VOLTS) by analyzing layer-wise statistics in calibration. We also introduce Quantization-Aware Optimization (QAO) to jointly refine the quantized branch and a retained high-precision branch. Extensive experiments demonstrate that our method has nearly performance to origin model with full-precision and significantly exceeds existing quantization techniques. Code is available at: https://github.com/zhengchen1999/LSGQuant.[118] From Single Scan to Sequential Consistency: A New Paradigm for LIDAR Relocalization
Minghang Zhu,Zhijing Wang,Yuxin Guo,Wen Li,Sheng Ao,Cheng Wang
Main category: cs.CV
TL;DR: 本文提出TempLoc框架,通过建模序列一致性提升LiDAR重定位鲁棒性,包含全局坐标估计、先验坐标生成和不确定性引导的坐标融合三个模块,在NCLT和Oxford Robot-Car数据集上显著超越现有方法。
Details
Motivation: 现有基于回归的LiDAR重定位方法在动态或模糊场景下表现不佳,因其仅依赖单帧推理或忽略扫描间的时空一致性。 Method: 提出TempLoc框架:1)全局坐标估计模块预测每帧点云的逐点全局坐标及不确定性;2)先验坐标生成模块利用注意力机制估计帧间点对应关系;3)不确定性引导的坐标融合模块端到端融合两种预测,输出更一致准确的6-DoF位姿。 Result: 在NCLT和Oxford Robot-Car基准上,TempLoc大幅超越当前最优方法。 Conclusion: 引入时间感知的对应关系建模可显著提升LiDAR重定位的鲁棒性与精度。 Abstract: LiDAR relocalization aims to estimate the global 6-DoF pose of a sensor in the environment. However, existing regression-based approaches are prone to dynamic or ambiguous scenarios, as they either solely rely on single-frame inference or neglect the spatio-temporal consistency across scans. In this paper, we propose TempLoc, a new LiDAR relocalization framework that enhances the robustness of localization by effectively modeling sequential consistency. Specifically, a Global Coordinate Estimation module is first introduced to predict point-wise global coordinates and associated uncertainties for each LiDAR scan. A Prior Coordinate Generation module is then presented to estimate inter-frame point correspondences by the attention mechanism. Lastly, an Uncertainty-Guided Coordinate Fusion module is deployed to integrate both predictions of point correspondence in an end-to-end fashion, yielding a more temporally consistent and accurate global 6-DoF pose. Experimental results on the NCLT and Oxford Robot-Car benchmarks show that our TempLoc outperforms stateof-the-art methods by a large margin, demonstrating the effectiveness of temporal-aware correspondence modeling in LiDAR relocalization. Our code will be released soon.[119] Hand3R: Online 4D Hand-Scene Reconstruction in the Wild
Wendi Hu,Haonan Zhou,Wenhao Hu,Gaoang Wang
Main category: cs.CV
TL;DR: Hand3R 是首个面向单目视频的在线式4D手-场景联合重建框架,通过场景感知视觉提示机制融合预训练手部专家模型与4D场景基础模型,在单次前向推理中同步生成高精度手部网格与度量尺度稠密场景几何。
Details
Motivation: 现有方法多局限于局部坐标系下的孤立手部重建,忽略了动态手与周围3D环境的物理交互关系,难以实现对真实交互的理解。 Method: 提出Hand3R框架,结合预训练手部专家模型与4D场景基础模型,设计场景感知视觉提示机制,将高保真手部先验注入持续场景记忆,实现单次前向传播下的联合重建。 Result: Hand3R无需离线优化,在本地手部重建和全局定位任务上均达到具有竞争力的性能。 Conclusion: Hand3R首次实现了单目视频驱动的在线4D手-场景联合重建,为具身智能中的物理交互理解提供了新范式。 Abstract: For Embodied AI, jointly reconstructing dynamic hands and the dense scene context is crucial for understanding physical interaction. However, most existing methods recover isolated hands in local coordinates, overlooking the surrounding 3D environment. To address this, we present Hand3R, the first online framework for joint 4D hand-scene reconstruction from monocular video. Hand3R synergizes a pre-trained hand expert with a 4D scene foundation model via a scene-aware visual prompting mechanism. By injecting high-fidelity hand priors into a persistent scene memory, our approach enables simultaneous reconstruction of accurate hand meshes and dense metric-scale scene geometry in a single forward pass. Experiments demonstrate that Hand3R bypasses the reliance on offline optimization and delivers competitive performance in both local hand reconstruction and global positioning.[120] VIRAL: Visual In-Context Reasoning via Analogy in Diffusion Transformers
Zhiwen Li,Zhongjie Duan,Jinyan Ye,Cen Chen,Daoyuan Chen,Yaliang Li,Yingda Chen
Main category: cs.CV
TL;DR: VIRAL is a novel framework that enables in-context learning for computer vision by leveraging visual analogy and a modified Diffusion Transformer, achieving state-of-the-art performance across diverse visual tasks.
Details
Motivation: Replicating in-context learning (ICL) in computer vision is difficult due to the heterogeneity of visual tasks; existing visual context datasets are also insufficient. Method: VIRAL formulates ICL as conditional generation via visual analogy (x_s : x_t :: x_q : y_q), adapts a frozen Diffusion Transformer with role-aware multi-image conditioning, and introduces a Mixture-of-Experts LoRA to reduce gradient interference. It also includes a newly curated large-scale visual context dataset covering perception, restoration, and editing. Result: VIRAL outperforms existing methods on diverse visual tasks, including open-domain editing, demonstrating the feasibility of a unified visual ICL paradigm. Conclusion: A unified visual in-context learning framework—VIRAL—is effective across heterogeneous visual tasks, enabled by visual analogy, architectural adaptations, and improved data curation. Abstract: Replicating In-Context Learning (ICL) in computer vision remains challenging due to task heterogeneity. We propose \textbf{VIRAL}, a framework that elicits visual reasoning from a pre-trained image editing model by formulating ICL as conditional generation via visual analogy ($x_s : x_t :: x_q : y_q$). We adapt a frozen Diffusion Transformer (DiT) using role-aware multi-image conditioning and introduce a Mixture-of-Experts LoRA to mitigate gradient interference across diverse tasks. Additionally, to bridge the gaps in current visual context datasets, we curate a large-scale dataset spanning perception, restoration, and editing. Experiments demonstrate that VIRAL outperforms existing methods, validating that a unified V-ICL paradigm can handle the majority of visual tasks, including open-domain editing. Our code is available at https://anonymous.4open.science/r/VIRAL-744A[121] ConsisDrive: Identity-Preserving Driving World Models for Video Generation by Instance Mask
Zhuoran Yang,Yanyong Zhang
Main category: cs.CV
TL;DR: 本文提出了ConsisDrive,一种用于自动驾驶视频生成的身份保持世界模型,通过实例掩码注意力和实例掩码损失机制解决对象身份漂移问题,在nuScenes数据集上实现了SOTA生成质量与下游任务性能提升。
Details
Motivation: 现有世界模型在生成驾驶视频时存在身份漂移问题,即同一物体在不同帧中外观或类别不一致,缺乏实例级时间约束。 Method: 提出ConsisDrive框架,包含两个核心组件:(1) 实例掩码注意力——在注意力模块中引入实例身份掩码和轨迹掩码,限制视觉token仅与对应实例特征交互;(2) 实例掩码损失——通过概率性实例掩码自适应聚焦前景区域,抑制背景噪声并保持场景保真度。 Result: 在nuScenes数据集上达到驾驶视频生成的最先进(SOTA)质量,并显著提升下游自动驾驶任务性能。 Conclusion: ConsisDrive通过显式建模实例级时间一致性,有效缓解身份漂移问题,为高质量、高保真驾驶仿真数据生成提供了新范式。 Abstract: Autonomous driving relies on robust models trained on large-scale, high-quality multi-view driving videos. Although world models provide a cost-effective solution for generating realistic driving data, they often suffer from identity drift, where the same object changes its appearance or category across frames due to the absence of instance-level temporal constraints. We introduce ConsisDrive, an identity-preserving driving world model designed to enforce temporal consistency at the instance level. Our framework incorporates two key components: (1) Instance-Masked Attention, which applies instance identity masks and trajectory masks within attention blocks to ensure that visual tokens interact only with their corresponding instance features across spatial and temporal dimensions, thereby preserving object identity consistency; and (2) Instance-Masked Loss, which adaptively emphasizes foreground regions with probabilistic instance masking, reducing background noise while maintaining overall scene fidelity. By integrating these mechanisms, ConsisDrive achieves state-of-the-art driving video generation quality and demonstrates significant improvements in downstream autonomous driving tasks on the nuScenes dataset. Our project page is https://shanpoyang654.github.io/ConsisDrive/page.html.[122] FARTrack: Fast Autoregressive Visual Tracking with High Performance
Guijie Wang,Tong Lin,Yifan Bai,Anjia Cao,Shiyi Liang,Wangbo Zhao,Xing Wei
Main category: cs.CV
TL;DR: 本文提出FARTrack,一种快速自回归跟踪框架,通过任务特定自蒸馏和帧间自回归稀疏化,在保持高性能的同时显著提升推理速度,适用于资源受限设备。
Details
Motivation: 高性能量跟踪器通常处理速度慢,难以在资源受限设备上部署,需要在推理速度和跟踪性能之间取得平衡。 Method: 提出FARTrack框架,包含任务特定自蒸馏(逐层蒸馏任务特定token)和帧间自回归稀疏化(顺序压缩多个模板,学习时间全局最优稀疏策略)。 Result: 在GOT-10k上实现实时AO达70.6%;最快模型在GPU上达343 FPS,CPU上达121 FPS。 Conclusion: FARTrack在不牺牲性能的前提下显著提升推理速度,验证了自回归建模与轻量化设计在视觉跟踪中的有效性。 Abstract: Inference speed and tracking performance are two critical evaluation metrics in the field of visual tracking. However, high-performance trackers often suffer from slow processing speeds, making them impractical for deployment on resource-constrained devices. To alleviate this issue, we propose FARTrack, a Fast Auto-Regressive Tracking framework. Since autoregression emphasizes the temporal nature of the trajectory sequence, it can maintain high performance while achieving efficient execution across various devices. FARTrack introduces Task-Specific Self-Distillation and Inter-frame Autoregressive Sparsification, designed from the perspectives of shallow-yet-accurate distillation and redundant-to-essential token optimization, respectively. Task-Specific Self-Distillation achieves model compression by distilling task-specific tokens layer by layer, enhancing the model's inference speed while avoiding suboptimal manual teacher-student layer pairs assignments. Meanwhile, Inter-frame Autoregressive Sparsification sequentially condenses multiple templates, avoiding additional runtime overhead while learning a temporally-global optimal sparsification strategy. FARTrack demonstrates outstanding speed and competitive performance. It delivers an AO of 70.6% on GOT-10k in real-time. Beyond, our fastest model achieves a speed of 343 FPS on the GPU and 121 FPS on the CPU.[123] PokeFusion Attention: Enhancing Reference-Free Style-Conditioned Generation
Jingbang Tang
Main category: cs.CV
TL;DR: 本文提出 PokeFusion Attention,一种轻量级、参考无关的风格化角色生成方法,通过在扩散模型解码器中融合文本语义与学习到的风格嵌入,实现高保真、几何一致的风格化图像生成,且无需外部参考图、参数高效、即插即用。
Details
Motivation: 现有文本到图像扩散模型在风格化角色生成中面临风格漂移和几何不一致问题:纯文本提示对视觉风格描述不足,而基于参考图的方法依赖外部图像、增加复杂性并限制部署灵活性。 Method: 提出 PokeFusion Attention——一种解码器层级的轻量级交叉注意力机制,在扩散模型解码器内部将文本语义与可学习风格嵌入进行融合;该方法在注意力层面解耦文本与风格条件,仅训练交叉注意力层及小型风格投影模块,保持预训练主干网络完全冻结。 Result: 在 Pokemon 风格角色生成基准上,相比主流适配器基线,本方法显著提升风格保真度、语义对齐性和角色形状一致性,同时参数开销低、推理简洁、支持跨主干迁移。 Conclusion: PokeFusion Attention 是一种高效、灵活、即插即用的参考无关风格控制方案,为扩散模型中的细粒度风格化生成提供了新范式。 Abstract: This paper studies reference-free style-conditioned character generation in text-to-image diffusion models, where high-quality synthesis requires both stable character structure and consistent, fine-grained style expression across diverse prompts. Existing approaches primarily rely on text-only prompting, which is often under-specified for visual style and tends to produce noticeable style drift and geometric inconsistency, or introduce reference-based adapters that depend on external images at inference time, increasing architectural complexity and limiting deployment flexibility.We propose PokeFusion Attention, a lightweight decoder-level cross-attention mechanism that fuses textual semantics with learned style embeddings directly inside the diffusion decoder. By decoupling text and style conditioning at the attention level, our method enables effective reference-free stylized generation while keeping the pretrained diffusion backbone fully frozen.PokeFusion Attention trains only decoder cross-attention layers together with a compact style projection module, resulting in a parameter-efficient and plug-and-play control component that can be easily integrated into existing diffusion pipelines and transferred across different backbones.Experiments on a stylized character generation benchmark (Pokemon-style) demonstrate that our method consistently improves style fidelity, semantic alignment, and character shape consistency compared with representative adapter-based baselines, while maintaining low parameter overhead and inference-time simplicity.[124] Spiral RoPE: Rotate Your Rotary Positional Embeddings in the 2D Plane
Haoyu Liu,Sucheng Ren,Tingyu Zhu,Peng Wang,Cihang Xie,Alan Yuille,Zeyu Zheng,Feng Wang
Main category: cs.CV
TL;DR: 本文提出Spiral RoPE,一种改进的二维旋转位置编码方法,通过将嵌入通道分组并沿均匀分布的方向进行旋转,突破了传统轴向RoPE仅支持水平和垂直方向的限制,从而更好地建模图像中斜向的空间关系,在多种视觉任务中均取得性能提升。
Details
Motivation: 标准的轴向二维RoPE在视觉Transformer中隐式限制了位置编码仅沿水平和垂直方向,无法有效建模自然图像中广泛存在的斜向空间关系,这是一个根本性局限。 Method: 提出Spiral RoPE:将嵌入通道划分为多个组,每组对应一个均匀分布的方向;对每个组,根据图像块位置在该方向上的投影进行旋转操作,实现多方向位置编码。 Result: 在分类、分割、生成等多种视觉任务上,Spiral RoPE持续提升模型性能;注意力图定性分析显示其激活更集中于语义相关物体,并更好遵循局部物体边界。 Conclusion: 多方向位置编码对视觉Transformer至关重要;Spiral RoPE以简单有效的方式解决了标准轴向2D RoPE的方向局限性,为视觉领域的位置编码设计提供了新思路。 Abstract: Rotary Position Embedding (RoPE) is the de facto positional encoding in large language models due to its ability to encode relative positions and support length extrapolation. When adapted to vision transformers, the standard axial formulation decomposes two-dimensional spatial positions into horizontal and vertical components, implicitly restricting positional encoding to axis-aligned directions. We identify this directional constraint as a fundamental limitation of the standard axial 2D RoPE, which hinders the modeling of oblique spatial relationships that naturally exist in natural images. To overcome this limitation, we propose Spiral RoPE, a simple yet effective extension that enables multi-directional positional encoding by partitioning embedding channels into multiple groups associated with uniformly distributed directions. Each group is rotated according to the projection of the patch position onto its corresponding direction, allowing spatial relationships to be encoded beyond the horizontal and vertical axes. Across a wide range of vision tasks including classification, segmentation, and generation, Spiral RoPE consistently improves performance. Qualitative analysis of attention maps further show that Spiral RoPE exhibits more concentrated activations on semantically relevant objects and better respects local object boundaries, highlighting the importance of multi-directional positional encoding in vision transformers.[125] EventFlash: Towards Efficient MLLMs for Event-Based Vision
Shaoyu Liu,Jianing Li,Guanghui Zhao,Yunjian Zhang,Wen Jiang,Ming Li,Xiangyang Ji
Main category: cs.CV
TL;DR: 本文提出了EventFlash,一种高效事件驱动的多模态大语言模型,通过时空标记稀疏化减少数据冗余并加速推理,在保持性能的同时实现12.4倍吞吐量提升。
Details
Motivation: 现有基于事件的多模态大语言模型(MLLMs)采用密集图像式处理范式,忽视事件流的时空稀疏性,导致计算开销高。 Method: 构建大规模多样场景数据集EventMind;提出自适应时间窗口聚合模块以压缩时间标记;设计稀疏密度引导注意力模块以提升空间标记效率。 Result: EventFlash相较基线EventFlash-Zero实现12.4倍吞吐量提升,支持最多1000个时间bin的长序列处理(远超EventGPT的5 bin限制)。 Conclusion: EventFlash是一种高效、可扩展的事件视觉基础模型,为事件驱动感知提供了新范式。 Abstract: Event-based multimodal large language models (MLLMs) enable robust perception in high-speed and low-light scenarios, addressing key limitations of frame-based MLLMs. However, current event-based MLLMs often rely on dense image-like processing paradigms, overlooking the spatiotemporal sparsity of event streams and resulting in high computational cost. In this paper, we propose EventFlash, a novel and efficient MLLM to explore spatiotemporal token sparsification for reducing data redundancy and accelerating inference. Technically, we build EventMind, a large-scale and scene-diverse dataset with over 500k instruction sets, providing both short and long event stream sequences to support our curriculum training strategy. We then present an adaptive temporal window aggregation module for efficient temporal sampling, which adaptively compresses temporal tokens while retaining key temporal cues. Finally, a sparse density-guided attention module is designed to improve spatial token efficiency by selecting informative regions and suppressing empty or sparse areas. Experimental results show that EventFlash achieves a $12.4\times$ throughput improvement over the baseline (EventFlash-Zero) while maintaining comparable performance. It supports long-range event stream processing with up to 1,000 bins, significantly outperforming the 5-bin limit of EventGPT. We believe EventFlash serves as an efficient foundation model for event-based vision.[126] InstaDrive: Instance-Aware Driving World Models for Realistic and Consistent Video Generation
Zhuoran Yang,Xi Guo,Chenjing Ding,Chiyu Wang,Wei Wu,Yanyong Zhang
Main category: cs.CV
TL;DR: InstaDrive 是一种用于生成高质量、多视角驾驶视频的新框架,通过实例流引导器和空间几何对齐器提升时间一致性和空间几何保真度,并在 nuScenes 数据集上验证了其在自动驾驶任务中的优越性。
Details
Motivation: 现有世界模型在生成驾驶视频时难以保持实例级时间一致性与空间几何保真度,影响下游自动驾驶任务性能。 Method: 提出 InstaDrive 框架,包含两个核心模块:实例流引导器(跨帧传播实例特征以保证时间一致性)和空间几何对齐器(增强空间推理、精确定位及显式建模遮挡层级);并利用 CARLA 自动驾驶模拟器生成罕见但关键的安全场景用于评估。 Result: 在视频生成质量上达到 SOTA,并在 nuScenes 数据集的下游自动驾驶任务中表现更优;实现了对安全关键场景的严格评估。 Conclusion: InstaDrive 通过引入实例感知机制显著提升了驾驶视频生成的真实性与实用性,为世界模型在自动驾驶仿真与训练中的应用提供了新思路。 Abstract: Autonomous driving relies on robust models trained on high-quality, large-scale multi-view driving videos. While world models offer a cost-effective solution for generating realistic driving videos, they struggle to maintain instance-level temporal consistency and spatial geometric fidelity. To address these challenges, we propose InstaDrive, a novel framework that enhances driving video realism through two key advancements: (1) Instance Flow Guider, which extracts and propagates instance features across frames to enforce temporal consistency, preserving instance identity over time. (2) Spatial Geometric Aligner, which improves spatial reasoning, ensures precise instance positioning, and explicitly models occlusion hierarchies. By incorporating these instance-aware mechanisms, InstaDrive achieves state-of-the-art video generation quality and enhances downstream autonomous driving tasks on the nuScenes dataset. Additionally, we utilize CARLA's autopilot to procedurally and stochastically simulate rare but safety-critical driving scenarios across diverse maps and regions, enabling rigorous safety evaluation for autonomous systems. Our project page is https://shanpoyang654.github.io/InstaDrive/page.html.[127] LaVPR: Benchmarking Language and Vision for Place Recognition
Ofer Idan,Dan Badur,Yosi Keller,Yoli Shavit
Main category: cs.CV
TL;DR: 本文提出LaVPR大规模视觉-语言定位基准,包含65万+自然语言描述,支持多模态融合提升鲁棒性与跨模态检索实现纯语言定位;实验表明语言信息显著增强小模型在视觉退化场景下的性能,并建立基于LoRA与Multi-Similarity损失的有效跨模态检索基线。
Details
Motivation: 现有视觉地点识别(VPR)方法在极端环境变化和感知歧义下表现差,且无法仅凭语言描述进行‘盲’定位,难以满足应急响应等实际需求。 Method: 构建LaVPR大规模视觉-语言基准;探索多模态融合(提升视觉鲁棒性)与跨模态检索(语言驱动定位)两种范式;采用LoRA微调与Multi-Similarity损失实现跨模态检索基线。 Result: 语言描述在视觉退化条件下带来稳定性能增益,尤其显著提升小型骨干网络性能,使其媲美更大规模纯视觉模型;所提跨模态检索基线大幅超越标准对比学习方法。 Conclusion: LaVPR推动了兼具现实鲁棒性与资源高效性的新型定位系统发展,为语言引导的视觉定位提供了数据、方法与评估基础。 Abstract: Visual Place Recognition (VPR) often fails under extreme environmental changes and perceptual aliasing. Furthermore, standard systems cannot perform "blind" localization from verbal descriptions alone, a capability needed for applications such as emergency response. To address these challenges, we introduce LaVPR, a large-scale benchmark that extends existing VPR datasets with over 650,000 rich natural-language descriptions. Using LaVPR, we investigate two paradigms: Multi-Modal Fusion for enhanced robustness and Cross-Modal Retrieval for language-based localization. Our results show that language descriptions yield consistent gains in visually degraded conditions, with the most significant impact on smaller backbones. Notably, adding language allows compact models to rival the performance of much larger vision-only architectures. For cross-modal retrieval, we establish a baseline using Low-Rank Adaptation (LoRA) and Multi-Similarity loss, which substantially outperforms standard contrastive methods across vision-language models. Ultimately, LaVPR enables a new class of localization systems that are both resilient to real-world stochasticity and practical for resource-constrained deployment. Our dataset and code are available at https://github.com/oferidan1/LaVPR.[128] HypCBC: Domain-Invariant Hyperbolic Cross-Branch Consistency for Generalizable Medical Image Analysis
Francesco Di Salvo,Sebastian Doerrich,Jonas Alle,Christian Ledig
Main category: cs.CV
TL;DR: 本文提出了一种基于双曲流形的医学图像分析表示学习方法,通过引入无监督、域不变的双曲跨分支一致性约束,在多个域泛化基准上显著优于欧氏方法。
Details
Motivation: 医学图像分析中数据稀缺且存在协变量偏移(如不同设备、成像协议和患者群体),导致模型泛化能力差;而现有方法多基于欧氏流形,难以刻画临床数据的复杂层次结构。 Method: 利用双曲流形建模医学图像数据的复杂特性,并提出一种无监督的域不变双曲跨分支一致性约束;在11个分布内数据集和3种ViT模型上进行验证。 Result: 在Fitzpatrick17k、Camelyon17-WILDS及视网膜跨数据集三个域泛化基准上,平均AUC提升+2.1%;验证了方法在不同模态、规模与标签粒度下的泛化能力。 Conclusion: 双曲表示学习能有效提升医学图像模型的鲁棒泛化能力,尤其适用于存在分布偏移的临床场景,为医学AI落地提供新思路。 Abstract: Robust generalization beyond training distributions remains a critical challenge for deep neural networks. This is especially pronounced in medical image analysis, where data is often scarce and covariate shifts arise from different hardware devices, imaging protocols, and heterogeneous patient populations. These factors collectively hinder reliable performance and slow down clinical adoption. Despite recent progress, existing learning paradigms primarily rely on the Euclidean manifold, whose flat geometry fails to capture the complex, hierarchical structures present in clinical data. In this work, we exploit the advantages of hyperbolic manifolds to model complex data characteristics. We present the first comprehensive validation of hyperbolic representation learning for medical image analysis and demonstrate statistically significant gains across eleven in-distribution datasets and three ViT models. We further propose an unsupervised, domain-invariant hyperbolic cross-branch consistency constraint. Extensive experiments confirm that our proposed method promotes domain-invariant features and outperforms state-of-the-art Euclidean methods by an average of $+2.1\%$ AUC on three domain generalization benchmarks: Fitzpatrick17k, Camelyon17-WILDS, and a cross-dataset setup for retinal imaging. These datasets span different imaging modalities, data sizes, and label granularities, confirming generalization capabilities across substantially different conditions. The code is available at https://github.com/francescodisalvo05/hyperbolic-cross-branch-consistency .[129] Global Geometry Is Not Enough for Vision Representations
Jiwan Chung,Seon Joo Kim
Main category: cs.CV
TL;DR: 本文挑战了表征学习中依赖全局嵌入几何结构来评估表示能力的常见假设,发现几何指标几乎无法预测组合绑定能力;相反,通过输入-输出雅可比矩阵衡量的功能敏感性能够可靠追踪该能力,并从目标函数设计角度给出理论解释。
Details
Motivation: 现有表征学习过度依赖全局嵌入分布作为表示能力的代理,但忽略了组合结构(即元素如何组合)这一关键方面。 Method: 在21个视觉编码器上系统测试几何度量(如嵌入距离、相似性统计)与组合绑定能力的相关性;引入并计算输入-输出雅可比矩阵作为功能敏感性的度量;结合理论分析,揭示损失函数对几何约束强而对局部映射约束弱的本质原因。 Result: 标准几何统计量与组合绑定能力几乎无关(近零相关);雅可比范数等功能敏感性指标能稳定预测组合绑定;理论分析证实该差异源于现有训练目标的设计偏向。 Conclusion: 全局嵌入几何仅反映表示能力的部分维度;功能敏感性是建模组合结构不可或缺的互补评估轴,应被纳入表征学习的训练与评估框架。 Abstract: A common assumption in representation learning is that globally well-distributed embeddings support robust and generalizable representations. This focus has shaped both training objectives and evaluation protocols, implicitly treating global geometry as a proxy for representational competence. While global geometry effectively encodes which elements are present, it is often insensitive to how they are composed. We investigate this limitation by testing the ability of geometric metrics to predict compositional binding across 21 vision encoders. We find that standard geometry-based statistics exhibit near-zero correlation with compositional binding. In contrast, functional sensitivity, as measured by the input-output Jacobian, reliably tracks this capability. We further provide an analytic account showing that this disparity arises from objective design, as existing losses explicitly constrain embedding geometry but leave the local input-output mapping unconstrained. These results suggest that global embedding geometry captures only a partial view of representational competence and establish functional sensitivity as a critical complementary axis for modeling composite structure.[130] A3-TTA: Adaptive Anchor Alignment Test-Time Adaptation for Image Segmentation
Jianghao Wu,Xiangde Luo,Yubo Zhou,Lianming Wu,Guotai Wang,Shaoting Zhang
Main category: cs.CV
TL;DR: 本文提出A3-TTA框架,通过锚点引导监督生成可靠伪标签,结合语义一致性与边界感知熵最小化正则化,并引入自适应指数滑动平均策略以稳定模型更新,显著提升跨域图像分割性能。
Details
Motivation: 现有基于伪标签的测试时自适应(TTA)方法依赖缺乏分布理论支撑的扰动集成启发式策略,导致训练信号不稳定,易引发错误累积和灾难性遗忘。 Method: 提出A3-TTA框架:1)利用类紧致密度度量选取高置信目标域图像作为锚点;2)以锚点为参考生成伪标签;3)引入语义一致性和边界感知熵最小化进行正则化;4)采用自适应指数滑动平均缓解标签噪声并稳定模型更新。 Result: 在多域医学图像(心脏结构、前列腺分割)和自然图像上,A3-TTA相较源模型平均Dice分数提升10.40–17.68个百分点,优于多种先进TTA方法;在持续TTA场景中表现出强抗遗忘能力。 Conclusion: A3-TTA通过分布感知的锚点机制和鲁棒正则化策略,有效提升了测试时自适应的稳定性与泛化性,为无源数据域偏移下的图像分割提供了实用可靠的新范式。 Abstract: Test-Time Adaptation (TTA) offers a practical solution for deploying image segmentation models under domain shift without accessing source data or retraining. Among existing TTA strategies, pseudo-label-based methods have shown promising performance. However, they often rely on perturbation-ensemble heuristics (e.g., dropout sampling, test-time augmentation, Gaussian noise), which lack distributional grounding and yield unstable training signals. This can trigger error accumulation and catastrophic forgetting during adaptation. To address this, we propose \textbf{A3-TTA}, a TTA framework that constructs reliable pseudo-labels through anchor-guided supervision. Specifically, we identify well-predicted target domain images using a class compact density metric, under the assumption that confident predictions imply distributional proximity to the source domain. These anchors serve as stable references to guide pseudo-label generation, which is further regularized via semantic consistency and boundary-aware entropy minimization. Additionally, we introduce a self-adaptive exponential moving average strategy to mitigate label noise and stabilize model update during adaptation. Evaluated on both multi-domain medical images (heart structure and prostate segmentation) and natural images, A3-TTA significantly improves average Dice scores by 10.40 to 17.68 percentage points compared to the source model, outperforming several state-of-the-art TTA methods under different segmentation model architectures. A3-TTA also excels in continual TTA, maintaining high performance across sequential target domains with strong anti-forgetting ability. The code will be made publicly available at https://github.com/HiLab-git/A3-TTA.[131] LEVIO: Lightweight Embedded Visual Inertial Odometry for Resource-Constrained Devices
Jonas Kühne,Christian Vogt,Michele Magno,Luca Benini
Main category: cs.CV
TL;DR: 本文提出LEVIO,一种专为超低功耗计算平台优化的视觉惯性里程计(VIO)系统,支持实时六自由度运动跟踪,在资源受限硬件上实现高能效与精度平衡。
Details
Motivation: 现有主流VIO系统计算开销大,难以在微无人机、智能眼镜等资源受限设备上实时运行,亟需轻量、低功耗、高精度的替代方案。 Method: LEVIO融合ORB特征跟踪与捆绑调整等成熟VIO组件,采用并行化设计、低内存占用架构,并结合硬件-软件协同优化策略,适配嵌入式微控制器与低功耗RISC-V SoC。 Result: 在超低功耗RISC-V SoC上实现实时20 FPS,功耗低于100 mW;在公开VIO数据集上验证了其效率与精度的良好权衡。 Conclusion: LEVIO为移动机器人与AR应用提供了高效、可部署、开源的基础设施无关VIO解决方案,推动VIO在边缘端的实际落地。 Abstract: Accurate, infrastructure-less sensor systems for motion tracking are essential for mobile robotics and augmented reality (AR) applications. The most popular state-of-the-art visual-inertial odometry (VIO) systems, however, are too computationally demanding for resource-constrained hardware, such as micro-drones and smart glasses. This work presents LEVIO, a fully featured VIO pipeline optimized for ultra-low-power compute platforms, allowing six-degrees-of-freedom (DoF) real-time sensing. LEVIO incorporates established VIO components such as Oriented FAST and Rotated BRIEF (ORB) feature tracking and bundle adjustment, while emphasizing a computationally efficient architecture with parallelization and low memory usage to suit embedded microcontrollers and low-power systems-on-chip (SoCs). The paper proposes and details the algorithmic design choices and the hardware-software co-optimization approach, and presents real-time performance on resource-constrained hardware. LEVIO is validated on a parallel-processing ultra-low-power RISC-V SoC, achieving 20 FPS while consuming less than 100 mW, and benchmarked against public VIO datasets, offering a compelling balance between efficiency and accuracy. To facilitate reproducibility and adoption, the complete implementation is released as open-source.[132] Full end-to-end diagnostic workflow automation of 3D OCT via foundation model-driven AI for retinal diseases
Jinze Zhang,Jian Zhong,Li Lin,Jiaxiong Li,Ke Ma,Naiyang Li,Meng Li,Yuan Pan,Zeyu Meng,Mengyun Zhou,Shang Huang,Shilong Yu,Zhengyu Duan,Sutong Li,Honghui Xia,Juping Liu,Dan Liang,Yantao Wei,Xiaoying Tang,Jin Yuan,Peng Xiao
Main category: cs.CV
TL;DR: FOCUS是一个基于OCT的端到端临床诊断系统,利用基础模型和自适应聚合方法实现3D视网膜疾病自动化诊断,在多中心验证中表现稳定且媲美专家水平。
Details
Motivation: 现有OCT诊断依赖多阶段流程和单切片单任务AI模型,难以实现全自动临床应用。 Method: 提出FOCUS框架:先用EfficientNetV2-S进行图像质量评估,再用微调的视觉基础模型进行异常检测与多病种分类,并采用统一自适应聚合方法将2D切片预测融合为3D患者级诊断。 Result: 在3300名患者(40672张切片)上训练测试,外部验证1345名患者(18498张切片),质量评估F1达99.01%,异常检测97.46%,患者级诊断94.39%;跨中心F1稳定在90.22%-95.24%;人机对比中异常检测和多病诊断F1分别达95.47%和93.49%,媲美专家。 Conclusion: FOCUS实现了从OCT图像到诊断的全自动化,为无人化眼科及大规模视网膜疾病筛查提供了可验证的技术范式。 Abstract: Optical coherence tomography (OCT) has revolutionized retinal disease diagnosis with its high-resolution and three-dimensional imaging nature, yet its full diagnostic automation in clinical practices remains constrained by multi-stage workflows and conventional single-slice single-task AI models. We present Full-process OCT-based Clinical Utility System (FOCUS), a foundation model-driven framework enabling end-to-end automation of 3D OCT retinal disease diagnosis. FOCUS sequentially performs image quality assessment with EfficientNetV2-S, followed by abnormality detection and multi-disease classification using a fine-tuned Vision Foundation Model. Crucially, FOCUS leverages a unified adaptive aggregation method to intelligently integrate 2D slices-level predictions into comprehensive 3D patient-level diagnosis. Trained and tested on 3,300 patients (40,672 slices), and externally validated on 1,345 patients (18,498 slices) across four different-tier centers and diverse OCT devices, FOCUS achieved high F1 scores for quality assessment (99.01%), abnormally detection (97.46%), and patient-level diagnosis (94.39%). Real-world validation across centers also showed stable performance (F1: 90.22%-95.24%). In human-machine comparisons, FOCUS matched expert performance in abnormality detection (F1: 95.47% vs 90.91%) and multi-disease diagnosis (F1: 93.49% vs 91.35%), while demonstrating better efficiency. FOCUS automates the image-to-diagnosis pipeline, representing a critical advance towards unmanned ophthalmology with a validated blueprint for autonomous screening to enhance population scale retinal care accessibility and efficiency.[133] PQTNet: Pixel-wise Quantitative Thermography Neural Network for Estimating Defect Depth in Polylactic Acid Parts by Additive Manufacturing
Lei Deng,Wenhao Huang,Chao Yang,Haoyuan Zheng,Yinbin Tian,Yue Ma
Main category: cs.CV
TL;DR: 本文提出了一种像素级定量热成像神经网络(PQT-Net),用于增材制造PLA部件中缺陷深度的高精度非破坏性量化检测,通过创新的数据增强策略和定制化回归头,实现了0.0094 mm的最小平均绝对误差和超99%的决定系数。
Details
Motivation: 增材制造(AM)部件中缺陷深度的定量检测在无损检测(NDT)中仍具挑战性,亟需高精度、鲁棒的定量方法。 Method: 提出Pixel-wise Quantitative Thermography Neural Network(PQT-Net):1)将热序列数据重构为二维条纹图像以保留各像素热扩散时序信息;2)采用预训练EfficientNetV2-S作为主干网络;3)设计含可学习参数的残差回归头(RRH)优化输出。 Result: PQT-Net在PLA试件上达到最小MAE为0.0094 mm,R² > 99%,性能优于其他深度学习模型。 Conclusion: PQT-Net具备高精度与强鲁棒性,有望成为增材制造部件缺陷定量表征的有效工具。 Abstract: Defect depth quantification in additively manufactured (AM) components remains a significant challenge for non-destructive testing (NDT). This study proposes a Pixel-wise Quantitative Thermography Neural Network (PQT-Net) to address this challenge for polylactic acid (PLA) parts. A key innovation is a novel data augmentation strategy that reconstructs thermal sequence data into two-dimensional stripe images, preserving the complete temporal evolution of heat diffusion for each pixel. The PQT-Net architecture incorporates a pre-trained EfficientNetV2-S backbone and a custom Residual Regression Head (RRH) with learnable parameters to refine outputs. Comparative experiments demonstrate the superiority of PQT-Net over other deep learning models, achieving a minimum Mean Absolute Error (MAE) of 0.0094 mm and a coefficient of determination (R) exceeding 99%. The high precision of PQT-Net underscores its potential for robust quantitative defect characterization in AM.[134] Invisible Clean-Label Backdoor Attacks for Generative Data Augmentation
Ting Xiang,Jinhui Zhao,Changjian Chen,Zhuo Tang
Main category: cs.CV
TL;DR: 本文提出InvLBA,一种面向生成式数据增强的隐式干净标签后门攻击方法,通过在潜在特征空间施加扰动实现高成功率、高鲁棒性且不影响干净准确率的后门攻击。
Details
Motivation: 现有基于像素级触发器的干净标签后门攻击(如COMBAT)在生成图像上攻击成功率低,需转向潜在特征层面设计更有效的攻击方法。 Method: 提出InvLBA方法,通过在生成模型的潜在空间中注入不可见扰动来植入后门,并从理论上证明其干净准确率与攻击成功率的泛化能力。 Result: 在多个数据集上实验表明,InvLBA平均提升攻击成功率46.43%,几乎不降低干净准确率,并对SOTA防御方法具有高鲁棒性。 Conclusion: 潜在空间扰动是比像素级扰动更有效、更隐蔽的干净标签后门攻击途径,InvLBA为生成式数据增强场景下的后门安全问题提供了新视角和强实证方案。 Abstract: With the rapid advancement of image generative models, generative data augmentation has become an effective way to enrich training images, especially when only small-scale datasets are available. At the same time, in practical applications, generative data augmentation can be vulnerable to clean-label backdoor attacks, which aim to bypass human inspection. However, based on theoretical analysis and preliminary experiments, we observe that directly applying existing pixel-level clean-label backdoor attack methods (e.g., COMBAT) to generated images results in low attack success rates. This motivates us to move beyond pixel-level triggers and focus instead on the latent feature level. To this end, we propose InvLBA, an invisible clean-label backdoor attack method for generative data augmentation by latent perturbation. We theoretically prove that the generalization of the clean accuracy and attack success rates of InvLBA can be guaranteed. Experiments on multiple datasets show that our method improves the attack success rate by 46.43% on average, with almost no reduction in clean accuracy and high robustness against SOTA defense methods.[135] MedSAM-Agent: Empowering Interactive Medical Image Segmentation with Multi-turn Agentic Reinforcement Learning
Shengyuan Liu,Liuxin Bao,Qi Yang,Wanting Geng,Boyun Zheng,Chenxin Li,Wenting Chen,Houwen Peng,Yixuan Yuan
Main category: cs.CV
TL;DR: 本文提出MedSAM-Agent框架,将医学图像分割重构为多步自主决策过程,通过混合提示策略和两阶段训练提升交互效率与分割性能。
Details
Motivation: 现有基于多模态大语言模型的医学图像分割方法依赖单轮僵化交互且缺乏过程监督,难以充分利用交互式工具的动态潜力,导致冗余操作。 Method: 提出MedSAM-Agent框架:1)采用混合提示策略生成专家轨迹,使模型学习人类启发式决策与自适应优化;2)设计两阶段训练流程,融合多轮端到端结果验证与临床保真过程奖励,促进交互简洁性与决策高效性。 Result: 在6种医学模态、21个数据集上实现SOTA性能,有效统一自主医学推理与鲁棒迭代优化。 Conclusion: MedSAM-Agent通过建模多步自主决策与引入过程级监督,显著提升了交互式医学图像分割的泛化性、效率与临床实用性。 Abstract: Medical image segmentation is evolving from task-specific models toward generalizable frameworks. Recent research leverages Multi-modal Large Language Models (MLLMs) as autonomous agents, employing reinforcement learning with verifiable reward (RLVR) to orchestrate specialized tools like the Segment Anything Model (SAM). However, these approaches often rely on single-turn, rigid interaction strategies and lack process-level supervision during training, which hinders their ability to fully exploit the dynamic potential of interactive tools and leads to redundant actions. To bridge this gap, we propose MedSAM-Agent, a framework that reformulates interactive segmentation as a multi-step autonomous decision-making process. First, we introduce a hybrid prompting strategy for expert-curated trajectory generation, enabling the model to internalize human-like decision heuristics and adaptive refinement strategies. Furthermore, we develop a two-stage training pipeline that integrates multi-turn, end-to-end outcome verification with a clinical-fidelity process reward design to promote interaction parsimony and decision efficiency. Extensive experiments across 6 medical modalities and 21 datasets demonstrate that MedSAM-Agent achieves state-of-the-art performance, effectively unifying autonomous medical reasoning with robust, iterative optimization. Code is available \href{https://github.com/CUHK-AIM-Group/MedSAM-Agent}{here}.[136] PWAVEP: Purifying Imperceptible Adversarial Perturbations in 3D Point Clouds via Spectral Graph Wavelets
Haoran Li,Renyang Liu,Hongjia Liu,Chen Wang,Long Yin,Jian Xu
Main category: cs.CV
TL;DR: 本文提出了一种无需修改模型、不依赖额外数据的即插即用式3D点云对抗防御方法PWAVEP,基于频谱域分析,利用图小波变换抑制高频对抗噪声,在保持空间不可感知性的同时显著提升鲁棒性。
Details
Motivation: 现有3D点云对抗防御方法通常需侵入式模型修改、昂贵训练或额外数据,难以实用;而对抗攻击在空间不可感知性和攻击性能上持续进步,亟需轻量、非侵入式防御机制。 Method: 提出PWAVEP框架:首先在图小波谱域计算各点的显著性分数和局部稀疏性分数;然后采用分层策略——剔除高显著性(难恢复)的对抗离群点,并对中等显著性点施加谱域滤波(衰减其高频小波系数)以抑制对抗噪声。 Result: PWAVEP在多个基准数据集和攻击方法下均展现出优于现有方法的分类准确率与鲁棒性,显著提升了3D点云净化的SOTA性能。 Conclusion: 频谱域分析为理解与抑制点云对抗扰动提供了新视角;PWAVEP作为一种非侵入、即插即用的净化框架,有效平衡了防御效果、计算开销与部署灵活性,推动了实际场景中3D点云安全防护的发展。 Abstract: Recent progress in adversarial attacks on 3D point clouds, particularly in achieving spatial imperceptibility and high attack performance, presents significant challenges for defenders. Current defensive approaches remain cumbersome, often requiring invasive model modifications, expensive training procedures or auxiliary data access. To address these threats, in this paper, we propose a plug-and-play and non-invasive defense mechanism in the spectral domain, grounded in a theoretical and empirical analysis of the relationship between imperceptible perturbations and high-frequency spectral components. Building upon these insights, we introduce a novel purification framework, termed PWAVEP, which begins by computing a spectral graph wavelet domain saliency score and local sparsity score for each point. Guided by these values, PWAVEP adopts a hierarchical strategy, it eliminates the most salient points, which are identified as hardly recoverable adversarial outliers. Simultaneously, it applies a spectral filtering process to a broader set of moderately salient points. This process leverages a graph wavelet transform to attenuate high-frequency coefficients associated with the targeted points, thereby effectively suppressing adversarial noise. Extensive evaluations demonstrate that the proposed PWAVEP achieves superior accuracy and robustness compared to existing approaches, advancing the state-of-the-art in 3D point cloud purification. Code and datasets are available at https://github.com/a772316182/pwavep[137] Composable Visual Tokenizers with Generator-Free Diagnostics of Learnability
Bingchen Zhao,Qiushan Guo,Ye Wang,Yixuan Huang,Zhonghua Zhai,Yu Tian
Main category: cs.CV
TL;DR: 本文提出了CompTok框架,用于训练具有增强组合性的视觉标记器,通过引入基于扩散解码器和信息最大化目标的训练策略,并结合对抗流正则化来保持语义交换生成的真实性,从而实现高质量图像生成与高层次语义编辑能力。
Details
Motivation: 现有视觉标记器在组合性控制方面存在不足,难以支持对图像进行高层次语义编辑(如跨图像交换标记),因此需要一种能显式建模组合结构并保证生成真实性的新训练框架。 Method: CompTok采用token-conditioned扩散解码器,结合InfoGAN式互信息最大化目标(训练识别模型从重建图像中预测输入token);引入token子集交换数据增强,并用对抗流正则化约束交换生成结果落在自然图像流形上;同时提出两个衡量token空间组合性与可学习性的新指标。 Result: CompTok在图像类别条件生成任务上达到SOTA性能;支持跨图像token交换以实现高层语义编辑;在提出的两个token空间评估指标上均优于基线方法;兼容并提升多种SOTA生成器的表现。 Conclusion: CompTok通过联合优化token表示的组合性、解码器对token的敏感性以及生成分布的真实性,成功构建了一个兼具强生成能力与可控编辑能力的视觉标记器框架,为可解释、可操控的生成建模提供了新范式。 Abstract: We introduce CompTok, a training framework for learning visual tokenizers whose tokens are enhanced for compositionality. CompTok uses a token-conditioned diffusion decoder. By employing an InfoGAN-style objective, where we train a recognition model to predict the tokens used to condition the diffusion decoder using the decoded images, we enforce the decoder to not ignore any of the tokens. To promote compositional control, besides the original images, CompTok also trains on tokens formed by swapping token subsets between images, enabling more compositional control of the token over the decoder. As the swapped tokens between images do not have ground truth image targets, we apply a manifold constraint via an adversarial flow regularizer to keep unpaired swap generations on the natural-image distribution. The resulting tokenizer not only achieves state-of-the-art performance on image class-conditioned generation, but also demonstrates properties such as swapping tokens between images to achieve high level semantic editing of an image. Additionally, we propose two metrics that measures the landscape of the token space that can be useful to describe not only the compositionality of the tokens, but also how easy to learn the landscape is for a generator to be trained on this space. We show in experiments that CompTok can improve on both of the metrics as well as supporting state-of-the-art generators for class conditioned generation.[138] Tiled Prompts: Overcoming Prompt Underspecification in Image and Video Super-Resolution
Bryan Sangwoo Kim,Jonghyun Park,Jong Chul Ye
Main category: cs.CV
TL;DR: 本文提出Tiled Prompts框架,通过为每个潜在图块生成特定文本提示,解决高分辨率图像和视频超分辨率中全局提示导致的提示欠指定问题,从而提升感知质量、文本对齐性,并减少幻觉与分块伪影。
Details
Motivation: 现代超分辨率流水线依赖潜在分块缩放至高分辨率,但单一全局文本提示易导致局部细节缺失(提示稀疏性)和局部无关引导(提示误导),尤其在无分类器引导下被放大。 Method: 提出Tiled Prompts统一框架,为每个潜在图块生成对应文本提示,并在局部文本条件后验下执行超分辨率。 Result: 在高分辨率真实图像和视频上的实验表明,该方法在感知质量与文本对齐性上持续提升,同时减少了幻觉现象和图块级伪影。 Conclusion: Tiled Prompts有效缓解了文本条件扩散模型在高分辨率超分辨率任务中的提示欠指定问题,在保持低开销的同时显著提升性能。 Abstract: Text-conditioned diffusion models have advanced image and video super-resolution by using prompts as semantic priors, but modern super-resolution pipelines typically rely on latent tiling to scale to high resolutions, where a single global caption causes prompt underspecification. A coarse global prompt often misses localized details (prompt sparsity) and provides locally irrelevant guidance (prompt misguidance) that can be amplified by classifier-free guidance. We propose Tiled Prompts, a unified framework for image and video super-resolution that generates a tile-specific prompt for each latent tile and performs super-resolution under locally text-conditioned posteriors, providing high-information guidance that resolves prompt underspecification with minimal overhead. Experiments on high resolution real-world images and videos show consistent gains in perceptual quality and text alignment, while reducing hallucinations and tile-level artifacts relative to global-prompt baselines.[139] Z3D: Zero-Shot 3D Visual Grounding from Images
Nikita Drozdov,Andrey Lemeshko,Nikita Gavrilov,Anton Konushin,Danila Rukhovich,Maksim Kolodiazhnyi
Main category: cs.CV
TL;DR: 本文提出Z3D方法,实现仅基于多视角图像的零样本3D视觉定位(3DVG),无需几何监督或物体先验,通过高质量3D实例分割与基于提示的分割推理,在ScanRefer和Nr3D上达到零样本SOTA性能。
Details
Motivation: 现有零样本3D视觉定位方法存在性能瓶颈,且依赖几何监督或物体先验,限制了其通用性与实用性。 Method: 提出Z3D通用定位框架,结合先进的零样本3D实例分割生成高质量3D边界框候选,并利用现代视觉语言模型(VLM)进行基于提示的分割推理;支持多视角图像输入,可选融合相机位姿和深度图。 Result: 在ScanRefer和Nr3D基准上,Z3D在零样本设定下达到当前最优性能。 Conclusion: 仅用多视角图像即可实现高性能零样本3D视觉定位,Z3D为无几何监督的开放场景3D理解提供了新范式。 Abstract: 3D visual grounding (3DVG) aims to localize objects in a 3D scene based on natural language queries. In this work, we explore zero-shot 3DVG from multi-view images alone, without requiring any geometric supervision or object priors. We introduce Z3D, a universal grounding pipeline that flexibly operates on multi-view images while optionally incorporating camera poses and depth maps. We identify key bottlenecks in prior zero-shot methods causing significant performance degradation and address them with (i) a state-of-the-art zero-shot 3D instance segmentation method to generate high-quality 3D bounding box proposals and (ii) advanced reasoning via prompt-based segmentation, which utilizes full capabilities of modern VLMs. Extensive experiments on the ScanRefer and Nr3D benchmarks demonstrate that our approach achieves state-of-the-art performance among zero-shot methods. Code is available at https://github.com/col14m/z3d .[140] Symbol-Aware Reasoning with Masked Discrete Diffusion for Handwritten Mathematical Expression Recognition
Takaya Kawakatsu,Ryo Ishiyama
Main category: cs.CV
TL;DR: 本文提出了一种离散扩散框架,将手写数学表达式识别(HMER)重构为迭代符号精炼过程,通过多步重掩码、符号感知分词和随机掩码互学习提升结构一致性和鲁棒性,在多个基准上超越了强Transformer和商业基线。
Details
Motivation: 自回归模型在HMER任务中存在曝光偏差和句法不一致问题,难以有效建模符号多样性和二维结构布局。 Method: 提出离散扩散框架,采用多步remasking进行符号与结构关系的迭代精炼;引入符号感知tokenization和Random-Masking Mutual Learning以增强句法对齐和对抗手写多样性。 Result: 在MathWriting基准上达到5.56% CER和60.42% EM,显著优于Transformer及商业基线;在CROHME 2014–2023上也表现出持续提升。 Conclusion: 离散扩散为结构感知的视觉识别提供了新范式,其价值不仅限于生成建模,更拓展至理解类任务。 Abstract: Handwritten Mathematical Expression Recognition (HMER) requires reasoning over diverse symbols and 2D structural layouts, yet autoregressive models struggle with exposure bias and syntactic inconsistency. We present a discrete diffusion framework that reformulates HMER as iterative symbolic refinement instead of sequential generation. Through multi-step remasking, the proposal progressively refines both symbols and structural relations, removing causal dependencies and improving structural consistency. A symbol-aware tokenization and Random-Masking Mutual Learning further enhance syntactic alignment and robustness to handwriting diversity. On the MathWriting benchmark, the proposal achieves 5.56\% CER and 60.42\% EM, outperforming strong Transformer and commercial baselines. Consistent gains on CROHME 2014--2023 demonstrate that discrete diffusion provides a new paradigm for structure-aware visual recognition beyond generative modeling.[141] Multi-Resolution Alignment for Voxel Sparsity in Camera-Based 3D Semantic Scene Completion
Zhiwen Yang,Yuxin Peng
Main category: cs.CV
TL;DR: 本文提出了一种多分辨率对齐(MRA)方法,通过场景级和实例级的多分辨率3D特征对齐来缓解基于相机的3D语义场景补全中的体素稀疏问题。
Details
Motivation: 现有基于图像的3D语义场景补全方法仅依赖体素标签监督,面临自动驾驶场景中体素稀疏(大量空体素)导致优化效率低和性能受限的问题。 Method: 提出多分辨率视图Transformer模块实现场景级多分辨率3D特征对齐;设计立方体语义各向异性模块识别体素级语义显著性;构建关键分布对齐模块,利用语义各向异性选取关键体素并施加跨分辨率特征分布一致性循环损失。 Result: 该方法在多个基准数据集上提升了3D语义场景补全性能,尤其改善了稀疏区域的预测精度,并提供了开源代码。 Conclusion: MRA通过引入多粒度(场景级与实例级)、多分辨率的辅助监督,有效缓解了体素稀疏问题,为基于图像的3D场景理解提供了更鲁棒、高效的解决方案。 Abstract: Camera-based 3D semantic scene completion (SSC) offers a cost-effective solution for assessing the geometric occupancy and semantic labels of each voxel in the surrounding 3D scene with image inputs, providing a voxel-level scene perception foundation for the perception-prediction-planning autonomous driving systems. Although significant progress has been made in existing methods, their optimization rely solely on the supervision from voxel labels and face the challenge of voxel sparsity as a large portion of voxels in autonomous driving scenarios are empty, which limits both optimization efficiency and model performance. To address this issue, we propose a \textit{Multi-Resolution Alignment (MRA)} approach to mitigate voxel sparsity in camera-based 3D semantic scene completion, which exploits the scene and instance level alignment across multi-resolution 3D features as auxiliary supervision. Specifically, we first propose the Multi-resolution View Transformer module, which projects 2D image features into multi-resolution 3D features and aligns them at the scene level through fusing discriminative seed features. Furthermore, we design the Cubic Semantic Anisotropy module to identify the instance-level semantic significance of each voxel, accounting for the semantic differences of a specific voxel against its neighboring voxels within a cubic area. Finally, we devise a Critical Distribution Alignment module, which selects critical voxels as instance-level anchors with the guidance of cubic semantic anisotropy, and applies a circulated loss for auxiliary supervision on the critical feature distribution consistency across different resolutions. The code is available at https://github.com/PKU-ICST-MIPL/MRA_TIP.[142] SLIM-Diff: Shared Latent Image-Mask Diffusion with Lp loss for Data-Scarce Epilepsy FLAIR MRI
Mario Pascual-González,Ariadna Jiménez-Partinen,R. M. Luque-Baena,Fátima Nagib-Raya,Ezequiel López-Rubio
Main category: cs.CV
TL;DR: 本文提出SLIM-Diff,一种紧凑型联合扩散模型,用于生成FCD病变的FLAIR MRI图像及对应病灶掩膜,通过共享瓶颈U-Net和可调Lp损失提升稳定性与解剖-病灶几何一致性。
Details
Motivation: FCD病变在FLAIR MRI中表现微弱且稀少,导致图像-掩膜联合生成模型易不稳定、易过拟合。 Method: 提出SLIM-Diff:采用单个共享瓶颈U-Net处理2通道(图像+掩膜)输入;对比ε预测与x0预测;引入可调Lp损失(如L1.5和L2)分别优化图像保真度与掩膜形态保持。 Result: x0预测在联合合成中性能最优;L1.5损失提升图像质量,L2损失更利于病灶掩膜形态保持;模型稳定、参数量小。 Conclusion: 联合建模中预测目标(x0 vs ε)和损失几何(Lp范数选择)对图像与掩膜质量具有解耦影响,SLIM-Diff为小样本稀疏病变生成提供了高效稳定方案。 Abstract: Focal cortical dysplasia (FCD) lesions in epilepsy FLAIR MRI are subtle and scarce, making joint image--mask generative modeling prone to instability and memorization. We propose SLIM-Diff, a compact joint diffusion model whose main contributions are (i) a single shared-bottleneck U-Net that enforces tight coupling between anatomy and lesion geometry from a 2-channel image+mask representation, and (ii) loss-geometry tuning via a tunable $L_p$ objective. As an internal baseline, we include the canonical DDPM-style objective ($ε$-prediction with $L_2$ loss) and isolate the effect of prediction parameterization and $L_p$ geometry under a matched setup. Experiments show that $x_0$-prediction is consistently the strongest choice for joint synthesis, and that fractional sub-quadratic penalties ($L_{1.5}$) improve image fidelity while $L_2$ better preserves lesion mask morphology. Our code and model weights are available in https://github.com/MarioPasc/slim-diff[143] Unifying Watermarking via Dimension-Aware Mapping
Jiale Meng,Runyi Hu,Jie Zhang,Zheming Lu,Ivor Tsang,Tianwei Zhang
Main category: cs.CV
TL;DR: 本文提出DiM多维水印框架,将水印建模为不同维度的有效载荷(如1D二进制消息、2D空间掩码、3D时空结构),通过嵌入与提取的维度配置统一现有方法,并在视频域验证其支持篡改定位、局部控制和时序恢复等能力。
Details
Motivation: 现有深度水印方法架构相似但功能行为差异大,缺乏功能层面的统一视角;需一种能解释并统一不同水印行为的理论框架。 Method: 提出DiM(Dimension-aware Mapping)框架,将水印建模为不同维度(1D/2D/3D)的payload;定义嵌入与提取的维度映射关系(同维/跨维),并在视频域中利用时空表征实现多种维度映射。 Result: 仅通过改变嵌入与提取的维度配置(不修改网络结构),即可实现多种水印能力:时空篡改定位、局部嵌入控制、帧扰动下的时序恢复。 Conclusion: 水印的功能行为主要由嵌入-提取的维度配置决定;DiM从维度视角统一了水印方法,为设计可复用、多功能的水印系统提供了新范式。 Abstract: Deep watermarking methods often share similar encoder-decoder architectures, yet differ substantially in their functional behaviors. We propose DiM, a new multi-dimensional watermarking framework that formulates watermarking as a dimension-aware mapping problem, thereby unifying existing watermarking methods at the functional level. Under DiM, watermark information is modeled as payloads of different dimensionalities, including one-dimensional binary messages, two-dimensional spatial masks, and three-dimensional spatiotemporal structures. We find that the dimensional configuration of embedding and extraction largely determines the resulting watermarking behavior. Same-dimensional mappings preserve payload structure and support fine-grained control, while cross-dimensional mappings enable spatial or spatiotemporal localization. We instantiate DiM in the video domain, where spatiotemporal representations enable a broader set of dimension mappings. Experiments demonstrate that varying only the embedding and extraction dimensions, without architectural changes, leads to different watermarking capabilities, including spatiotemporal tamper localization, local embedding control, and recovery of temporal order under frame disruptions.[144] Seeing Through the Chain: Mitigate Hallucination in Multimodal Reasoning Models via CoT Compression and Contrastive Preference Optimization
Hao Fang,Jinyu Li,Jiawei Kong,Tianqu Zhuang,Kuofeng Gao,Bin Chen,Shu-Tao Xia,Yaowei Wang
Main category: cs.CV
TL;DR: 本文提出C3PO框架,通过链式思维压缩和对比偏好优化来减少多模态推理模型(MLRMs)的幻觉问题。
Details
Motivation: 多模态推理模型虽能力强,但易产生幻觉,且目前尚缺乏有效解决方案。 Method: 提出C3PO框架,包含链式思维压缩(选择性过滤冗余推理token以提升视觉信号效率)和对比偏好优化(利用高质量AI反馈构建训练对,并设计多模态幻觉诱导机制生成负样本进行对比校正)。 Result: 在多种MLRMs和基准测试中均实现了稳定的幻觉减少效果,并提供了理论有效性证明。 Conclusion: C3PO是一种有效的基于训练的幻觉缓解方法,能增强模型对视觉输入的依赖、提升推理轨迹质量,并通过对比学习抑制幻觉。 Abstract: While multimodal reasoning models (MLRMs) have exhibited impressive capabilities, they remain prone to hallucinations, and effective solutions are still underexplored. In this paper, we experimentally analyze the hallucination cause and propose C3PO, a training-based mitigation framework comprising \textbf{C}hain-of-Thought \textbf{C}ompression and \textbf{C}ontrastive \textbf{P}reference \textbf{O}ptimization. Firstly, we identify that introducing reasoning mechanisms exacerbates models' reliance on language priors while overlooking visual inputs, which can produce CoTs with reduced visual cues but redundant text tokens. To this end, we propose to selectively filter redundant thinking tokens for a more compact and signal-efficient CoT representation that preserves task-relevant information while suppressing noise. In addition, we observe that the quality of the reasoning trace largely determines whether hallucination emerges in subsequent responses. To leverage this insight, we introduce a reasoning-enhanced preference tuning scheme that constructs training pairs using high-quality AI feedback. We further design a multimodal hallucination-inducing mechanism that elicits models' inherent hallucination patterns via carefully crafted inducers, yielding informative negative signals for contrastive correction. We provide theoretical justification for the effectiveness and demonstrate consistent hallucination reduction across diverse MLRMs and benchmarks.[145] From Vicious to Virtuous Cycles: Synergistic Representation Learning for Unsupervised Video Object-Centric Learning
Hyun Seok Seong,WonJun Moon,Jae-Pil Heo
Main category: cs.CV
TL;DR: 本文提出协同表征学习(SRL)方法,通过建立编码器与解码器之间的良性循环,解决无监督目标中心学习中因重建训练导致的编码器注意力图尖锐但解码器重建图模糊之间的冲突,显著提升视频目标中心学习性能。
Details
Motivation: 现有基于重建的无监督目标中心学习模型中,编码器产生的高频率尖锐注意力图与解码器生成的空间一致但模糊的重建图之间存在根本性冲突,导致训练过程陷入恶性循环。 Method: 提出协同表征学习(SRL),利用编码器的尖锐性对解码器输出语义边界进行去模糊,同时利用解码器的空间一致性对编码器特征进行去噪;引入带槽正则化目标的预热阶段以稳定训练。 Result: 在视频目标中心学习基准上达到当前最优性能。 Conclusion: SRL通过弥合编码器与解码器之间的表征鸿沟,有效打破原有恶性循环,验证了协同优化编码器与解码器表征的可行性与有效性。 Abstract: Unsupervised object-centric learning models, particularly slot-based architectures, have shown great promise in decomposing complex scenes. However, their reliance on reconstruction-based training creates a fundamental conflict between the sharp, high-frequency attention maps of the encoder and the spatially consistent but blurry reconstruction maps of the decoder. We identify that this discrepancy gives rise to a vicious cycle: the noisy feature map from the encoder forces the decoder to average over possibilities and produce even blurrier outputs, while the gradient computed from blurry reconstruction maps lacks high-frequency details necessary to supervise encoder features. To break this cycle, we introduce Synergistic Representation Learning (SRL) that establishes a virtuous cycle where the encoder and decoder mutually refine one another. SRL leverages the encoder's sharpness to deblur the semantic boundary within the decoder output, while exploiting the decoder's spatial consistency to denoise the encoder's features. This mutual refinement process is stabilized by a warm-up phase with a slot regularization objective that initially allocates distinct entities per slot. By bridging the representational gap between the encoder and decoder, SRL achieves state-of-the-art results on video object-centric learning benchmarks. Codes are available at https://github.com/hynnsk/SRL.[146] UnHype: CLIP-Guided Hypernetworks for Dynamic LoRA Unlearning
Piotr Wójcik,Maksym Petrenko,Wojciech Gromski,Przemysław Spurek,Maciej Zieba
Main category: cs.CV
TL;DR: 本文提出UnHype框架,利用超网络增强LoRA实现更语义感知、可扩展的扩散模型概念遗忘,支持单/多概念擦除且兼容Stable Diffusion等主流文生图模型。
Details
Motivation: 现有基于LoRA的机器遗忘方法在概念语义适应性、相关概念区分能力及多概念并行擦除的可扩展性方面存在不足,难以兼顾精准遗忘与模型泛化能力。 Method: 提出UnHype框架,将超网络引入LoRA训练过程,使其能根据CLIP文本嵌入动态生成上下文感知的LoRA权重;支持单概念与多概念联合遗忘,并可即插即用地集成到Stable Diffusion和现代流式文生图模型中。 Result: 在物体擦除、名人擦除和敏感内容移除等任务上验证了UnHype的有效性与鲁棒性,展现出优于现有LoRA遗忘方法的语义控制能力和多概念扩展性。 Conclusion: UnHype通过超网络驱动的动态LoRA权重生成,显著提升了扩散模型中概念遗忘的语义精度、上下文感知性与可扩展性,为安全可控的生成式AI提供了新思路。 Abstract: Recent advances in large-scale diffusion models have intensified concerns about their potential misuse, particularly in generating realistic yet harmful or socially disruptive content. This challenge has spurred growing interest in effective machine unlearning, the process of selectively removing specific knowledge or concepts from a model without compromising its overall generative capabilities. Among various approaches, Low-Rank Adaptation (LoRA) has emerged as an effective and efficient method for fine-tuning models toward targeted unlearning. However, LoRA-based methods often exhibit limited adaptability to concept semantics and struggle to balance removing closely related concepts with maintaining generalization across broader meanings. Moreover, these methods face scalability challenges when multiple concepts must be erased simultaneously. To address these limitations, we introduce UnHype, a framework that incorporates hypernetworks into single- and multi-concept LoRA training. The proposed architecture can be directly plugged into Stable Diffusion as well as modern flow-based text-to-image models, where it demonstrates stable training behavior and effective concept control. During inference, the hypernetwork dynamically generates adaptive LoRA weights based on the CLIP embedding, enabling more context-aware, scalable unlearning. We evaluate UnHype across several challenging tasks, including object erasure, celebrity erasure, and explicit content removal, demonstrating its effectiveness and versatility. Repository: https://github.com/gmum/UnHype.[147] Socratic-Geo: Synthetic Data Generation and Geometric Reasoning via Multi-Agent Interaction
Zhengbo Jiao,Shaobo Wang,Zifan Zhang,Wei Wang,Bing Zhao,Hu Wei,Linfeng Zhang
Main category: cs.CV
TL;DR: 本文提出Socratic-Geo框架,通过教师、求解器和生成器三智能体协同,实现几何推理数据的自主合成与模型联合优化,在减少数据依赖的同时显著提升多模态模型几何推理与图像生成能力。
Details
Motivation: 现有MLLM在几何推理上表现不佳,主因高质量图像-文本对极度稀缺;人工标注成本高,自动方法难以兼顾保真度与训练有效性;已有数据合成方法未与模型学习动态耦合。 Method: 提出Socratic-Geo多智能体框架:教师代理生成带反思反馈(Reflect/RePI)的参数化Python绘图脚本,保障图文对质量;求解器代理通过偏好学习优化推理,并将失败路径反馈给教师以指导针对性数据增强;生成器代理基于积累的‘图像-代码-指令’三元组学习图像生成,蒸馏程序化绘图能力。 Result: Socratic-Solver仅用1/4基线数据,在六项几何推理基准上达49.11分,超越强基线2.43分;Socratic-Generator在GenExam上达42.4%,创开源模型新高,超越Seedream-4.0(39.8%),逼近Gemini-2.5-Flash-Image(43.1%)。 Conclusion: 动态耦合数据合成与模型学习的多智能体范式可有效缓解几何推理任务中的数据瓶颈,为多模态模型自主演进提供新路径。 Abstract: Multimodal Large Language Models (MLLMs) have significantly advanced vision-language understanding. However, even state-of-the-art models struggle with geometric reasoning, revealing a critical bottleneck: the extreme scarcity of high-quality image-text pairs. Human annotation is prohibitively expensive, while automated methods fail to ensure fidelity and training effectiveness. Existing approaches either passively adapt to available images or employ inefficient random exploration with filtering, decoupling generation from learning needs. We propose Socratic-Geo, a fully autonomous framework that dynamically couples data synthesis with model learning through multi-agent interaction. The Teacher agent generates parameterized Python scripts with reflective feedback (Reflect for solvability, RePI for visual validity), ensuring image-text pair purity. The Solver agent optimizes reasoning through preference learning, with failure paths guiding Teacher's targeted augmentation. Independently, the Generator learns image generation capabilities on accumulated "image-code-instruction" triplets, distilling programmatic drawing intelligence into visual generation. Starting from only 108 seed problems, Socratic-Solver achieves 49.11 on six benchmarks using one-quarter of baseline data, surpassing strong baselines by 2.43 points. Socratic-Generator achieves 42.4% on GenExam, establishing new state-of-the-art for open-source models, surpassing Seedream-4.0 (39.8%) and approaching Gemini-2.5-Flash-Image (43.1%).[148] ConsistentRFT: Reducing Visual Hallucinations in Flow-based Reinforcement Fine-Tuning
Xiaofeng Tan,Jun Liu,Yuanting Fan,Bin-Bin Gao,Xi Jiang,Xiaochen Chen,Jinlong Peng,Chengjie Wang,Hongsong Wang,Feng Zheng
Main category: cs.CV
TL;DR: 本文提出ConsistentRFT框架,通过动态粒度采样(DGR)与一致性策略梯度优化(CPGO),缓解基于流模型的强化微调(RFT)中因探索不足和轨迹模仿导致的视觉幻觉问题,显著降低低/高层幻觉并提升跨域泛化性能。
Details
Motivation: 现有基于流模型的强化微调(RFT)易引发视觉幻觉(如细节过优化、语义错位),但其成因尚不清晰,亟需系统分析与有效抑制方法。 Method: 从探索与利用两方面归因幻觉:1)SDE采样中探索受限导致局部细节过强调而损害全局语义;2)策略梯度中的轨迹模仿扭曲基础向量场及跨步一致性。据此提出ConsistentRFT:含动态粒度采样(DGR)平衡多尺度噪声探索,及一致性策略梯度优化(CPGO)对齐稳定先验以保持模型一致性。 Result: ConsistentRFT在低层与高层感知幻觉上平均降低49%和38%;在跨域指标上超越基线FLUX1.dev达5.1%(基线下降-0.4%)。 Conclusion: 视觉幻觉源于RFT中探索不足与轨迹模仿失配,ConsistentRFT通过DGR与CPGO协同建模多尺度语义与时间一致性,为流式生成模型的偏好对齐提供更鲁棒、可泛化的强化微调范式。 Abstract: Reinforcement Fine-Tuning (RFT) on flow-based models is crucial for preference alignment. However, they often introduce visual hallucinations like over-optimized details and semantic misalignment. This work preliminarily explores why visual hallucinations arise and how to reduce them. We first investigate RFT methods from a unified perspective, and reveal the core problems stemming from two aspects, exploration and exploitation: (1) limited exploration during stochastic differential equation (SDE) rollouts, leading to an over-emphasis on local details at the expense of global semantics, and (2) trajectory imitation process inherent in policy gradient methods, distorting the model's foundational vector field and its cross-step consistency. Building on this, we propose ConsistentRFT, a general framework to mitigate these hallucinations. Specifically, we design a Dynamic Granularity Rollout (DGR) mechanism to balance exploration between global semantics and local details by dynamically scheduling different noise sources. We then introduce a Consistent Policy Gradient Optimization (CPGO) that preserves the model's consistency by aligning the current policy with a more stable prior. Extensive experiments demonstrate that ConsistentRFT significantly mitigates visual hallucinations, achieving average reductions of 49\% for low-level and 38\% for high-level perceptual hallucinations. Furthermore, ConsistentRFT outperforms other RFT methods on out-of-domain metrics, showing an improvement of 5.1\% (v.s. the baseline's decrease of -0.4\%) over FLUX1.dev. This is \href{https://xiaofeng-tan.github.io/projects/ConsistentRFT}{Project Page}.[149] Hierarchical Concept-to-Appearance Guidance for Multi-Subject Image Generation
Yijia Xu,Zihao Wang,Jinshi Cui
Main category: cs.CV
TL;DR: 本文提出了一种分层概念到外观引导(CAG)框架,通过VAE dropout训练和对应感知的掩码注意力机制,提升多主体图像生成中的身份一致性和组合控制能力。
Details
Motivation: 现有方法在多主体图像生成中存在身份不一致和组合控制能力有限的问题,主要因为依赖扩散模型隐式关联文本提示与参考图像。 Method: 提出Hierarchical Concept-to-Appearance Guidance(CAG):1)概念层采用VAE dropout训练策略,增强VLM语义信号的鲁棒性;2)外观层设计对应感知的掩码注意力模块,将文本token与匹配的参考区域绑定,嵌入Diffusion Transformer中。 Result: 在多主体图像生成任务上达到SOTA性能,显著提升文本遵循能力和主体一致性。 Conclusion: CAG通过显式、结构化的高层概念到细粒度外观监督,有效解决了多主体生成中的身份一致性和组合可控性问题。 Abstract: Multi-subject image generation aims to synthesize images that faithfully preserve the identities of multiple reference subjects while following textual instructions. However, existing methods often suffer from identity inconsistency and limited compositional control, as they rely on diffusion models to implicitly associate text prompts with reference images. In this work, we propose Hierarchical Concept-to-Appearance Guidance (CAG), a framework that provides explicit, structured supervision from high-level concepts to fine-grained appearances. At the conceptual level, we introduce a VAE dropout training strategy that randomly omits reference VAE features, encouraging the model to rely more on robust semantic signals from a Visual Language Model (VLM) and thereby promoting consistent concept-level generation in the absence of complete appearance cues. At the appearance level, we integrate the VLM-derived correspondences into a correspondence-aware masked attention module within the Diffusion Transformer (DiT). This module restricts each text token to attend only to its matched reference regions, ensuring precise attribute binding and reliable multi-subject composition. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the multi-subject image generation, substantially improving prompt following and subject consistency.[150] Contextualized Visual Personalization in Vision-Language Models
Yeongtak Oh,Sangwon Yu,Junsung Park,Han Cheol Moon,Jisoo Mok,Sungroh Yoon
Main category: cs.CV
TL;DR: 本文提出CoViP框架,通过强化学习后训练和字幕增强生成,解决视觉语言模型在个性化图像描述任务中的上下文视觉个性化问题,并设计诊断评估验证模型是否真正利用视觉上下文。
Details
Motivation: 现有视觉语言模型缺乏将视觉输入与用户积累的图文上下文关联的能力,难以生成基于用户特定经验的个性化响应。 Method: 提出CoViP统一框架,将个性化图像描述作为核心任务,结合强化学习后训练和字幕增强生成来提升能力,并设计排除文本捷径的诊断评估方法。 Result: 实验表明现有开源及商用VLM在此任务上存在显著局限,而CoViP不仅提升了个性化图像描述性能,还在下游个性化任务中带来整体提升。 Conclusion: CoViP是实现鲁棒且可泛化的上下文视觉个性化的重要一步。 Abstract: Despite recent progress in vision-language models (VLMs), existing approaches often fail to generate personalized responses based on the user's specific experiences, as they lack the ability to associate visual inputs with a user's accumulated visual-textual context. We newly formalize this challenge as contextualized visual personalization, which requires the visual recognition and textual retrieval of personalized visual experiences by VLMs when interpreting new images. To address this issue, we propose CoViP, a unified framework that treats personalized image captioning as a core task for contextualized visual personalization and improves this capability through reinforcement-learning-based post-training and caption-augmented generation. We further introduce diagnostic evaluations that explicitly rule out textual shortcut solutions and verify whether VLMs truly leverage visual context. Extensive experiments demonstrate that existing open-source and proprietary VLMs exhibit substantial limitations, while CoViP not only improves personalized image captioning but also yields holistic gains across downstream personalization tasks. These results highlight CoViP as a crucial stage for enabling robust and generalizable contextualized visual personalization.[151] Inlier-Centric Post-Training Quantization for Object Detection Models
Minsu Kim,Dongyeun Lee,Jaemyung Yu,Jiwan Hur,Giseop Kim,Junmo Kim
Main category: cs.CV
TL;DR: 本文提出InlierQ,一种面向内点的后训练量化方法,通过梯度感知的体素显著性评分和EM算法区分并抑制异常激活,从而在不依赖标签的情况下提升目标检测模型的量化精度。
Details
Motivation: 目标检测计算开销大,量化是常用压缩手段;但背景杂波、传感器噪声等任务无关形态会引发冗余或异常激活,扩大激活范围、扭曲分布,导致比特分配困难并削弱关键特征保留。 Method: InlierQ采用梯度感知的体素显著性评分机制,对每个体素进行内点/异常点分类,并利用EM算法拟合该评分的后验分布,实现异常激活抑制与信息特征保留;该方法无需标签、即插即用,仅需64个校准样本。 Result: 在COCO和nuScenes数据集上,InlierQ在基于相机的2D/3D及基于LiDAR的3D目标检测任务中均一致降低了量化误差。 Conclusion: InlierQ通过内点中心化策略有效缓解了任务无关异常对量化性能的干扰,在轻量、无监督、低样本需求前提下提升了多种模态目标检测模型的量化鲁棒性与精度。 Abstract: Object detection is pivotal in computer vision, yet its immense computational demands make deployment slow and power-hungry, motivating quantization. However, task-irrelevant morphologies such as background clutter and sensor noise induce redundant activations (or anomalies). These anomalies expand activation ranges and skew activation distributions toward task-irrelevant responses, complicating bit allocation and weakening the preservation of informative features. Without a clear criterion to distinguish anomalies, suppressing them can inadvertently discard useful information. To address this, we present InlierQ, an inlier-centric post-training quantization approach that separates anomalies from informative inliers. InlierQ computes gradient-aware volume saliency scores, classifies each volume as an inlier or anomaly, and fits a posterior distribution over these scores using the Expectation-Maximization (EM) algorithm. This design suppresses anomalies while preserving informative features. InlierQ is label-free, drop-in, and requires only 64 calibration samples. Experiments on the COCO and nuScenes benchmarks show consistent reductions in quantization error for camera-based (2D and 3D) and LiDAR-based (3D) object detection.[152] Decoupling Skeleton and Flesh: Efficient Multimodal Table Reasoning with Disentangled Alignment and Structure-aware Guidance
Yingjie Zhu,Xuefeng Bai,Kehai Chen,Yang Xiang,Youcheng Pan,Xiaoqiang Zhou,Min Zhang
Main category: cs.CV
TL;DR: 本文提出DiSCo和Table-GLS两个框架,通过解耦结构与内容对齐、全局到局部结构引导推理,以极少标注和无需外部工具的方式提升大视觉语言模型(LVLMs)对表格图像的推理能力。
Details
Motivation: 现有LVLMs在表格图像推理上面临布局复杂、结构与内容强耦合等挑战,且依赖昂贵监督训练、强化学习或外部工具,效率与可扩展性受限;本文旨在以最小标注和无外部工具方式适配LVLMs进行表格推理。 Method: 提出DiSCo(解耦结构-内容对齐)框架,显式分离结构抽象与语义接地;在此基础上构建Table-GLS(全局到局部结构引导)推理框架,通过结构化探索与证据支撑推理实现表格理解。 Result: 在多个基准测试中验证了方法有效性,显著提升了LVLMs对表格图像的理解与推理能力,尤其在未见表格结构上表现出良好泛化性。 Conclusion: DiSCo与Table-GLS联合提供了一种高效、轻量、无需外部工具的LVLM表格推理适配方案,推动了表格图像理解的实用化发展。 Abstract: Reasoning over table images remains challenging for Large Vision-Language Models (LVLMs) due to complex layouts and tightly coupled structure-content information. Existing solutions often depend on expensive supervised training, reinforcement learning, or external tools, limiting efficiency and scalability. This work addresses a key question: how to adapt LVLMs to table reasoning with minimal annotation and no external tools? Specifically, we first introduce DiSCo, a Disentangled Structure-Content alignment framework that explicitly separates structural abstraction from semantic grounding during multimodal alignment, efficiently adapting LVLMs to tables structures. Building on DiSCo, we further present Table-GLS, a Global-to-Local Structure-guided reasoning framework that performs table reasoning via structured exploration and evidence-grounded inference. Extensive experiments across diverse benchmarks demonstrate that our framework efficiently enhances LVLM's table understanding and reasoning capabilities, particularly generalizing to unseen table structures.[153] Semantic Routing: Exploring Multi-Layer LLM Feature Weighting for Diffusion Transformers
Bozhou Li,Yushuo Guan,Haolin Li,Bohan Zeng,Yiyan Ji,Yue Ding,Pengfei Wan,Kun Gai,Yuanxing Zhang,Wentao Zhang
Main category: cs.CV
TL;DR: 本文提出了一种统一的归一化凸融合框架,通过时间、深度及联合融合方式动态整合多层大语言模型(LLM)隐藏状态,以更好匹配扩散Transformer(DiT)的生成过程;实验表明深度方向语义路由效果最优,而单纯时间融合反而损害生成质量,原因在于分类器自由引导下训练与推理轨迹不匹配。
Details
Motivation: 现有DiT文本到图像模型虽采用LLM作为文本编码器,但文本条件化方式静态且通常仅用单层LLM,忽略了LLM各层语义层次性及去噪过程在扩散时间和网络深度上的非平稳性。 Method: 提出统一归一化凸融合框架,引入轻量门控机制,实现时间维度、深度维度及联合维度的多层LLM隐藏状态系统化融合;重点探索并对比了深度语义路由与时间融合等策略。 Result: 深度语义路由显著提升图文对齐与组合生成能力(如GenAI-Bench计数任务+9.97),而纯时间融合反而降低视觉生成保真度;发现训练-推理轨迹失配是其主因。 Conclusion: 深度方向路由是一种强而有效的文本条件化基线,强调需引入轨迹感知信号以实现鲁棒的时间依赖条件化。 Abstract: Recent DiT-based text-to-image models increasingly adopt LLMs as text encoders, yet text conditioning remains largely static and often utilizes only a single LLM layer, despite pronounced semantic hierarchy across LLM layers and non-stationary denoising dynamics over both diffusion time and network depth. To better match the dynamic process of DiT generation and thereby enhance the diffusion model's generative capability, we introduce a unified normalized convex fusion framework equipped with lightweight gates to systematically organize multi-layer LLM hidden states via time-wise, depth-wise, and joint fusion. Experiments establish Depth-wise Semantic Routing as the superior conditioning strategy, consistently improving text-image alignment and compositional generation (e.g., +9.97 on the GenAI-Bench Counting task). Conversely, we find that purely time-wise fusion can paradoxically degrade visual generation fidelity. We attribute this to a train-inference trajectory mismatch: under classifier-free guidance, nominal timesteps fail to track the effective SNR, causing semantically mistimed feature injection during inference. Overall, our results position depth-wise routing as a strong and effective baseline and highlight the critical need for trajectory-aware signals to enable robust time-dependent conditioning.[154] Interpretable Logical Anomaly Classification via Constraint Decomposition and Instruction Fine-Tuning
Xufei Zhang,Xinjiao Zhou,Ziling Deng,Dongdong Geng,Jianxiong Wang
Main category: cs.CV
TL;DR: 本文提出了逻辑异常分类(LAC)任务,旨在同时检测工业图像中的逻辑异常并细粒度识别违反的具体逻辑规则;为此设计了LogiCls视觉-语言框架,通过将复杂逻辑约束分解为可验证子查询,并结合链式思维指令合成与难度感知重采样策略进行训练,实现了高精度、可解释的异常分类。
Details
Motivation: 现有异常检测方法多为二分类,无法指出具体违反哪条逻辑规则,对质量保障支持有限。 Method: 提出LogiCls视觉-语言框架,将逻辑约束分解为可验证子查询;构建数据驱动的链式思维指令合成流程,融合精确定位标注与图像-文本增强;采用难度感知重采样策略稳定训练。 Result: 在多个工业场景中实验表明,LogiCls能鲁棒、准确、可解释地完成逻辑异常分类,同时输出违规类别及证据链。 Conclusion: LogiCls有效统一了异常检测与细粒度规则违反识别,提升了工业质检中逻辑异常分析的实用性与可解释性。 Abstract: Logical anomalies are violations of predefined constraints on object quantity, spatial layout, and compositional relationships in industrial images. While prior work largely treats anomaly detection as a binary decision, such formulations cannot indicate which logical rule is broken and therefore offer limited value for quality assurance. We introduce Logical Anomaly Classification (LAC), a task that unifies anomaly detection and fine-grained violation classification in a single inference step. To tackle LAC, we propose LogiCls, a vision-language framework that decomposes complex logical constraints into a sequence of verifiable subqueries. We further present a data-centric instruction synthesis pipeline that generates chain-of-thought (CoT) supervision for these subqueries, coupling precise grounding annotations with diverse image-text augmentations to adapt vision language models (VLMs) to logic-sensitive reasoning. Training is stabilized by a difficulty-aware resampling strategy that emphasizes challenging subqueries and long tail constraint types. Extensive experiments demonstrate that LogiCls delivers robust, interpretable, and accurate industrial logical anomaly classification, providing both the predicted violation categories and their evidence trails.[155] PnP-U3D: Plug-and-Play 3D Framework Bridging Autoregression and Diffusion for Unified Understanding and Generation
Yongwei Chen,Tianyi Wei,Yushi Lan,Zhaoyang Lyu,Shangchen Zhou,Xudong Xu,Xingang Pan
Main category: cs.CV
TL;DR: 本文提出了一种结合自回归(AR)与扩散模型的统一框架,用于3D理解与生成任务,避免了纯AR范式带来的量化失真和高训练成本,在多项3D基准上达到SOTA性能。
Details
Motivation: 现有将3D任务统一到单一自回归范式的尝试导致性能下降和训练成本过高;需在保留理解与生成各自优势的前提下实现有效信息交互,并利用预训练模型降低成本。 Method: 采用AR范式进行3D理解(如next-token预测),扩散范式进行3D生成;通过轻量级Transformer桥接大语言模型特征空间与3D扩散模型条件空间,实现跨模态信息交换。 Result: 在多种3D理解与生成基准测试中达到SOTA性能,并在3D编辑任务中表现优异。 Conclusion: AR+扩散联合范式是构建通用3D智能的可行且有前景的方向,兼顾性能、效率与模块协同能力。 Abstract: The rapid progress of large multimodal models has inspired efforts toward unified frameworks that couple understanding and generation. While such paradigms have shown remarkable success in 2D, extending them to 3D remains largely underexplored. Existing attempts to unify 3D tasks under a single autoregressive (AR) paradigm lead to significant performance degradation due to forced signal quantization and prohibitive training cost. Our key insight is that the essential challenge lies not in enforcing a unified autoregressive paradigm, but in enabling effective information interaction between generation and understanding while minimally compromising their inherent capabilities and leveraging pretrained models to reduce training cost. Guided by this perspective, we present the first unified framework for 3D understanding and generation that combines autoregression with diffusion. Specifically, we adopt an autoregressive next-token prediction paradigm for 3D understanding, and a continuous diffusion paradigm for 3D generation. A lightweight transformer bridges the feature space of large language models and the conditional space of 3D diffusion models, enabling effective cross-modal information exchange while preserving the priors learned by standalone models. Extensive experiments demonstrate that our framework achieves state-of-the-art performance across diverse 3D understanding and generation benchmarks, while also excelling in 3D editing tasks. These results highlight the potential of unified AR+diffusion models as a promising direction for building more general-purpose 3D intelligence.[156] Constrained Dynamic Gaussian Splatting
Zihan Zheng,Zhenglong Wu,Xuanxuan Wang,Houqiang Zhong,Xiaoyun Zhang,Qiang Hu,Guangtao Zhai,Wenjun Zhang
Main category: cs.CV
TL;DR: 本文提出Constrained Dynamic Gaussian Splatting(CDGS),将动态场景重建建模为预算约束优化问题,通过可微分预算控制器和多模态重要性评分,在严格高斯数量限制下实现高质量4D重建与高效压缩。
Details
Motivation: Dynamic Gaussian Splatting虽能实现高保真4D重建,但无约束致密化导致内存爆炸,难以部署于边缘设备;而启发式剪枝又无法在预设高斯数量预算下保证最优渲染质量,存在性能与资源的根本矛盾。 Method: 提出CDGS框架:1)构建预算约束优化问题;2)设计可微分预算控制器,融合几何、运动与感知线索的多模态统一重要性评分;3)解耦静态/动态成分优化,自适应分配高斯容量;4)采用三阶段训练策略确保精确达标;5)引入双模混合压缩方案。 Result: CDGS在误差<2%的硬件约束下严格满足高斯数量预算,渲染质量达当前最优,相比SOTA方法实现3倍以上压缩率,并推动率失真性能Pareto前沿。 Conclusion: CDGS成功解决了动态高斯溅射在边缘部署中资源受限与质量保障之间的核心矛盾,为实时、高保真4D重建提供了可扩展、可控制的新范式。 Abstract: While Dynamic Gaussian Splatting enables high-fidelity 4D reconstruction, its deployment is severely hindered by a fundamental dilemma: unconstrained densification leads to excessive memory consumption incompatible with edge devices, whereas heuristic pruning fails to achieve optimal rendering quality under preset Gaussian budgets. In this work, we propose Constrained Dynamic Gaussian Splatting (CDGS), a novel framework that formulates dynamic scene reconstruction as a budget-constrained optimization problem to enforce a strict, user-defined Gaussian budget during training. Our key insight is to introduce a differentiable budget controller as the core optimization driver. Guided by a multi-modal unified importance score, this controller fuses geometric, motion, and perceptual cues for precise capacity regulation. To maximize the utility of this fixed budget, we further decouple the optimization of static and dynamic elements, employing an adaptive allocation mechanism that dynamically distributes capacity based on motion complexity. Furthermore, we implement a three-phase training strategy to seamlessly integrate these constraints, ensuring precise adherence to the target count. Coupled with a dual-mode hybrid compression scheme, CDGS not only strictly adheres to hardware constraints (error < 2%}) but also pushes the Pareto frontier of rate-distortion performance. Extensive experiments demonstrate that CDGS delivers optimal rendering quality under varying capacity limits, achieving over 3x compression compared to state-of-the-art methods.[157] Making Avatars Interact: Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars
Youliang Zhang,Zhengguang Zhou,Zhentao Yu,Ziyao Huang,Teng Hu,Sen Liang,Guozhen Zhang,Ziqiao Peng,Shunkai Li,Yi Chen,Zixiang Zhou,Yuan Zhou,Qinglin Lu,Xiu Li
Main category: cs.CV
TL;DR: 本文提出InteractAvatar双流框架,解决有文本对齐的具身人-物交互(GHOI) talking avatar生成难题,通过解耦感知/规划与视频合成,并构建GroundedInter基准进行评估。
Details
Motivation: 现有方法难以生成能与周围物体进行文本对齐交互的全身说话虚拟人,主因是环境感知不足及控制性与生成质量之间的权衡困境。 Method: 提出双流框架InteractAvatar:1)感知与交互模块(PIM),利用检测增强环境感知并生成文本对齐动作;2)音频-交互感知生成模块(AIM),合成生动的交互式说话视频;3)运动到视频对齐器实现PIM与AIM并行协同生成;4)构建新基准GroundedInter。 Result: 在GHOI视频生成任务上显著优于现有方法,实现了高质量、高可控性的具身人-物交互 talking avatar 生成。 Conclusion: InteractAvatar通过解耦设计和协同生成机制,有效缓解控制-质量困境,为 grounded human-object interaction 的 talking avatar 生成提供了新范式。 Abstract: Generating talking avatars is a fundamental task in video generation. Although existing methods can generate full-body talking avatars with simple human motion, extending this task to grounded human-object interaction (GHOI) remains an open challenge, requiring the avatar to perform text-aligned interactions with surrounding objects. This challenge stems from the need for environmental perception and the control-quality dilemma in GHOI generation. To address this, we propose a novel dual-stream framework, InteractAvatar, which decouples perception and planning from video synthesis for grounded human-object interaction. Leveraging detection to enhance environmental perception, we introduce a Perception and Interaction Module (PIM) to generate text-aligned interaction motions. Additionally, an Audio-Interaction Aware Generation Module (AIM) is proposed to synthesize vivid talking avatars performing object interactions. With a specially designed motion-to-video aligner, PIM and AIM share a similar network structure and enable parallel co-generation of motions and plausible videos, effectively mitigating the control-quality dilemma. Finally, we establish a benchmark, GroundedInter, for evaluating GHOI video generation. Extensive experiments and comparisons demonstrate the effectiveness of our method in generating grounded human-object interactions for talking avatars. Project page: https://interactavatar.github.io[158] Cut to the Mix: Simple Data Augmentation Outperforms Elaborate Ones in Limited Organ Segmentation Datasets
Chang Liu,Fuxin Fan,Annette Schwarz,Andreas Maier
Main category: cs.CV
TL;DR: 本文研究了四种跨图像数据增强策略(CutMix、CarveMix、ObjectAug、AnatoMix)在多器官分割任务中的效果,发现CutMix等方法显著提升Dice分数,尤其在数据受限场景下表现稳健有效。
Details
Motivation: 临床中多器官分割标注数据稀缺,深度学习模型训练受限;传统数据增强仅限于单图变换,而跨图像和目标级增强尚未被充分探索。 Method: 在两个器官分割数据集上系统评估CutMix、CarveMix、ObjectAug和AnatoMix四种跨图像数据增强策略,并与nnUNet基线及结合传统数据增强的效果进行对比。 Result: CutMix、CarveMix和AnatoMix分别将平均Dice分数提升4.9、2.0和1.9;叠加传统数据增强后效果进一步提升;CutMix虽生成直观上‘错误’的图像,但仍表现出强鲁棒性。 Conclusion: 跨图像数据增强(尤其是CutMix)是提升小样本多器官分割性能的有效且实用的策略,为后续研究提供了新思路和开源实现基准。 Abstract: Multi-organ segmentation is a widely applied clinical routine and automated organ segmentation tools dramatically improve the pipeline of the radiologists. Recently, deep learning (DL) based segmentation models have shown the capacity to accomplish such a task. However, the training of the segmentation networks requires large amount of data with manual annotations, which is a major concern due to the data scarcity from clinic. Working with limited data is still common for researches on novel imaging modalities. To enhance the effectiveness of DL models trained with limited data, data augmentation (DA) is a crucial regularization technique. Traditional DA (TDA) strategies focus on basic intra-image operations, i.e. generating images with different orientations and intensity distributions. In contrast, the interimage and object-level DA operations are able to create new images from separate individuals. However, such DA strategies are not well explored on the task of multi-organ segmentation. In this paper, we investigated four possible inter-image DA strategies: CutMix, CarveMix, ObjectAug and AnatoMix, on two organ segmentation datasets. The result shows that CutMix, CarveMix and AnatoMix can improve the average dice score by 4.9, 2.0 and 1.9, compared with the state-of-the-art nnUNet without DA strategies. These results can be further improved by adding TDA strategies. It is revealed in our experiments that Cut-Mix is a robust but simple DA strategy to drive up the segmentation performance for multi-organ segmentation, even when CutMix produces intuitively 'wrong' images. Our implementation is publicly available for future benchmarks.[159] ELIQ: A Label-Free Framework for Quality Assessment of Evolving AI-Generated Images
Xinyue Li,Zhiming Xu,Zhichao Zhang,Zhaolin Cai,Sijing Wu,Xiongkuo Min,Yitong Chen,Guangtao Zhai
Main category: cs.CV
TL;DR: 本文提出了一种无需人工标注的AI生成图像质量评估框架ELIQ,通过自动构建正负样本对、指令微调多模态模型,并设计轻量级融合与质量查询Transformer,实现对视觉质量和提示-图像对齐度的二维评估。
Details
Motivation: 生成式文生图模型发展迅速,导致旧有标注数据失效,亟需一种无需依赖人工标签、能适应模型持续演化的质量评估方法。 Method: ELIQ自动构建正样本和面向特定失真类型的负样本,利用预训练多模态模型进行指令微调,结合轻量门控融合与Quality Query Transformer预测二维质量得分。 Result: 在多个基准上ELIQ显著优于现有无标签方法,且无需调整即可泛化至用户生成内容(UGC)场景。 Conclusion: ELIQ为快速迭代的生成模型提供了可扩展、无标注、跨域通用的质量评估新范式。 Abstract: Generative text-to-image models are advancing at an unprecedented pace, continuously shifting the perceptual quality ceiling and rendering previously collected labels unreliable for newer generations. To address this, we present ELIQ, a Label-free Framework for Quality Assessment of Evolving AI-generated Images. Specifically, ELIQ focuses on visual quality and prompt-image alignment, automatically constructs positive and aspect-specific negative pairs to cover both conventional distortions and AIGC-specific distortion modes, enabling transferable supervision without human annotations. Building on these pairs, ELIQ adapts a pre-trained multimodal model into a quality-aware critic via instruction tuning and predicts two-dimensional quality using lightweight gated fusion and a Quality Query Transformer. Experiments across multiple benchmarks demonstrate that ELIQ consistently outperforms existing label-free methods, generalizes from AI-generated content (AIGC) to user-generated content (UGC) scenarios without modification, and paves the way for scalable and label-free quality assessment under continuously evolving generative models. The code will be released upon publication.[160] SlowFocus: Enhancing Fine-grained Temporal Understanding in Video LLM
Ming Nie,Dan Ding,Chunwei Wang,Yuanfan Guo,Jianhua Han,Hang Xu,Li Zhang
Main category: cs.CV
TL;DR: 本文提出SlowFocus机制,通过查询相关时间片段的密集采样与多频率混合注意力模块,提升视频大模型(Vid-LLMs)在保留高质量帧级语义的同时增强视频级时序理解能力,并构建细粒度动作理解新基准FineAction-CGR。
Details
Motivation: 现有视频大语言模型(Vid-LLMs)难以同时兼顾高保真的帧级语义信息(每帧token数量充足)和全面的视频级时序信息(每视频采样帧数充足),制约了细粒度视频理解的发展。 Method: 提出SlowFocus机制:首先根据问题定位查询相关的时间片段,再对该片段进行密集采样以提取局部高频特征;引入多频率混合注意力模块,融合局部高频细节与全局低频上下文;配套设计强化时序定位与细粒度时序推理能力的训练策略;并构建细粒度视频理解新基准FineAction-CGR。 Result: 在多个公开视频理解基准及自建FineAction-CGR上,SlowFocus显著优于现有方法,验证了其在细粒度视频理解任务中的有效性与泛化性。 Conclusion: SlowFocus机制有效缓解了帧级细节与视频级时序建模之间的权衡困境,为构建更强大的Vid-LLMs提供了新思路,并推动细粒度视频理解研究的发展。 Abstract: Large language models (LLMs) have demonstrated exceptional capabilities in text understanding, which has paved the way for their expansion into video LLMs (Vid-LLMs) to analyze video data. However, current Vid-LLMs struggle to simultaneously retain high-quality frame-level semantic information (i.e., a sufficient number of tokens per frame) and comprehensive video-level temporal information (i.e., an adequate number of sampled frames per video). This limitation hinders the advancement of Vid-LLMs towards fine-grained video understanding. To address this issue, we introduce the SlowFocus mechanism, which significantly enhances the equivalent sampling frequency without compromising the quality of frame-level visual tokens. SlowFocus begins by identifying the query-related temporal segment based on the posed question, then performs dense sampling on this segment to extract local high-frequency features. A multi-frequency mixing attention module is further leveraged to aggregate these local high-frequency details with global low-frequency contexts for enhanced temporal comprehension. Additionally, to tailor Vid-LLMs to this innovative mechanism, we introduce a set of training strategies aimed at bolstering both temporal grounding and detailed temporal reasoning capabilities. Furthermore, we establish FineAction-CGR, a benchmark specifically devised to assess the ability of Vid-LLMs to process fine-grained temporal understanding tasks. Comprehensive experiments demonstrate the superiority of our mechanism across both existing public video understanding benchmarks and our proposed FineAction-CGR.[161] High-Resolution Underwater Camouflaged Object Detection: GBU-UCOD Dataset and Topology-Aware and Frequency-Decoupled Networks
Wenji Wu,Shuo Ye,Yiyu Liu,Jiguang He,Zhuo Wang,Zitong Yu
Main category: cs.CV
TL;DR: 本文提出DeepTopo-Net框架,结合拓扑感知建模与频率解耦感知,解决水下伪装目标检测中目标与背景高度相似、细长生物拓扑断裂及透明生物特征微弱等问题;设计了基于黎曼度量张量的水环境自适应感知器(WCAP)和基于骨架先验的深渊拓扑精修模块(ATRM),并构建首个面向海洋垂直带化(含超深渊与深渊带)的2K高分辨率基准GBU-UCOD。
Details
Motivation: 现有方法难以应对深海细长生物的拓扑断裂问题和透明生物的细微特征提取,且缺乏覆盖深渊与超深渊带的高质量数据集。 Method: 提出DeepTopo-Net框架,包含Water-Conditioned Adaptive Perceptor(WCAP,利用黎曼度量张量动态形变卷积采样场)和Abyssal-Topology Refinement Module(ATRM,基于骨架先验保持细长目标结构连通性);构建首个2K高分辨率水下垂直带化基准GBU-UCOD。 Result: 在MAS3K、RMAS及自建GBU-UCOD数据集上达到SOTA性能,尤其在复杂水下模式的形态完整性保持方面表现突出。 Conclusion: 融合拓扑建模与物理感知驱动的深度网络可有效提升水下伪装目标检测的鲁棒性与结构保真度,新基准GBU-UCOD填补了深渊带数据空白。 Abstract: Underwater Camouflaged Object Detection (UCOD) is a challenging task due to the extreme visual similarity between targets and backgrounds across varying marine depths. Existing methods often struggle with topological fragmentation of slender creatures in the deep sea and the subtle feature extraction of transparent organisms. In this paper, we propose DeepTopo-Net, a novel framework that integrates topology-aware modeling with frequency-decoupled perception. To address physical degradation, we design the Water-Conditioned Adaptive Perceptor (WCAP), which employs Riemannian metric tensors to dynamically deform convolutional sampling fields. Furthermore, the Abyssal-Topology Refinement Module (ATRM) is developed to maintain the structural connectivity of spindly targets through skeletal priors. Specifically, we first introduce GBU-UCOD, the first high-resolution (2K) benchmark tailored for marine vertical zonation, filling the data gap for hadal and abyssal zones. Extensive experiments on MAS3K, RMAS, and our proposed GBU-UCOD datasets demonstrate that DeepTopo-Net achieves state-of-the-art performance, particularly in preserving the morphological integrity of complex underwater patterns. The datasets and codes will be released at https://github.com/Wuwenji18/GBU-UCOD.[162] TIPS Over Tricks: Simple Prompts for Effective Zero-shot Anomaly Detection
Alireza Salehi,Ehsan Karami,Sepehr Noey,Sahand Noey,Makoto Yamada,Reshad Hosseini,Mohammad Sabokrou
Main category: cs.CV
TL;DR: 本文提出了一种基于TIPS视觉语言模型的零样本异常检测新方法,通过解耦提示(decoupled prompts)和局部证据注入,显著提升了图像级和像素级异常检测性能,且架构简洁、泛化性强。
Details
Motivation: 现有基于CLIP的零样本异常检测方法受限于粗粒度的图像-文本对齐,导致空间错位和对细粒度异常敏感性弱;而先前工作多依赖复杂辅助模块,忽视了主干网络的选择。 Method: 采用空间感知训练的视觉语言模型TIPS作为主干,设计解耦提示机制(固定提示用于图像级检测,可学习提示用于像素级定位),并将局部特征证据注入全局得分中,以弥合全局与局部特征间的分布差距。 Result: 在七个工业数据集上,图像级检测性能提升1.1–3.9%,像素级定位提升1.5–6.9%,优于CLIP基线且无需CLIP专属技巧。 Conclusion: 主干模型选择至关重要;TIPS结合解耦提示与局部证据融合,可在轻量架构下实现强泛化零样本异常检测。 Abstract: Anomaly detection identifies departures from expected behavior in safety-critical settings. When target-domain normal data are unavailable, zero-shot anomaly detection (ZSAD) leverages vision-language models (VLMs). However, CLIP's coarse image-text alignment limits both localization and detection due to (i) spatial misalignment and (ii) weak sensitivity to fine-grained anomalies; prior work compensates with complex auxiliary modules yet largely overlooks the choice of backbone. We revisit the backbone and use TIPS-a VLM trained with spatially aware objectives. While TIPS alleviates CLIP's issues, it exposes a distributional gap between global and local features. We address this with decoupled prompts-fixed for image-level detection and learnable for pixel-level localization-and by injecting local evidence into the global score. Without CLIP-specific tricks, our TIPS-based pipeline improves image-level performance by 1.1-3.9% and pixel-level by 1.5-6.9% across seven industrial datasets, delivering strong generalization with a lean architecture. Code is available at github.com/AlirezaSalehy/Tipsomaly.[163] Refer-Agent: A Collaborative Multi-Agent System with Reasoning and Reflection for Referring Video Object Segmentation
Haichao Jiang,Tianming Liang,Wei-Shi Zheng,Jian-Fang Hu
Main category: cs.CV
TL;DR: 本文提出Refer-Agent,一种基于多智能体协作与交替推理-反思机制的零样本视频指代分割方法,通过粗到细帧选择、动态聚焦布局和链式反思机制,在不依赖大规模监督微调的情况下显著提升性能,并支持快速集成新多模态大模型。
Details
Motivation: 现有RVOS方法严重依赖大规模监督微调,数据依赖性强、难以适配快速迭代的多模态大语言模型(MLLMs);而现有零样本方法因流程设计简单,性能远落后于微调方法。 Method: 提出Refer-Agent多智能体系统,包含:1)粗到细帧选择策略以兼顾帧多样性与文本相关性;2)动态聚焦布局自适应调整视觉焦点;3)链式反思机制(Questioner-Responder对)生成自我反思链,验证中间结果并反馈优化下一轮推理。 Result: 在五个挑战性基准上显著超越现有SFT-based和zero-shot方法;支持零样本快速接入新MLLM,无需额外微调。 Conclusion: Refer-Agent为RVOS提供了一种高效、灵活、可扩展的零样本范式,缓解了对监督微调的依赖,提升了模型演进下的适应性与实用性。 Abstract: Referring Video Object Segmentation (RVOS) aims to segment objects in videos based on textual queries. Current methods mainly rely on large-scale supervised fine-tuning (SFT) of Multi-modal Large Language Models (MLLMs). However, this paradigm suffers from heavy data dependence and limited scalability against the rapid evolution of MLLMs. Although recent zero-shot approaches offer a flexible alternative, their performance remains significantly behind SFT-based methods, due to the straightforward workflow designs. To address these limitations, we propose \textbf{Refer-Agent}, a collaborative multi-agent system with alternating reasoning-reflection mechanisms. This system decomposes RVOS into step-by-step reasoning process. During reasoning, we introduce a Coarse-to-Fine frame selection strategy to ensure the frame diversity and textual relevance, along with a Dynamic Focus Layout that adaptively adjusts the agent's visual focus. Furthermore, we propose a Chain-of-Reflection mechanism, which employs a Questioner-Responder pair to generate a self-reflection chain, enabling the system to verify intermediate results and generates feedback for next-round reasoning refinement. Extensive experiments on five challenging benchmarks demonstrate that Refer-Agent significantly outperforms state-of-the-art methods, including both SFT-based models and zero-shot approaches. Moreover, Refer-Agent is flexible and enables fast integration of new MLLMs without any additional fine-tuning costs. Code will be released.[164] A Lightweight Library for Energy-Based Joint-Embedding Predictive Architectures
Basile Terver,Randall Balestriero,Megi Dervishi,David Fan,Quentin Garrido,Tushar Nagarajan,Koustuv Sinha,Wancong Zhang,Mike Rabbat,Yann LeCun,Amir Bar
Main category: cs.CV
TL;DR: EB-JEPA是一个开源库,用于基于联合嵌入预测架构(JEPA)学习表征和世界模型,支持从图像到视频再到动作条件世界模型的渐进式建模,具备高效单卡训练、可解释表征和强下游性能。
Details
Motivation: 解决生成式建模在像素空间预测中的缺陷,探索在表征空间中进行预测的JEPA范式,使其能扩展至视频时序建模和动作驱动的世界模型,并提升自监督学习的可访问性与实用性。 Method: 构建模块化、自包含的EB-JEPA开源库,涵盖图像(CIFAR-10)、视频(Moving MNIST多步预测)和动作条件世界模型(Two Rooms导航)三类任务;采用能量型自监督学习框架,引入关键正则化组件防止表征坍缩。 Result: CIFAR-10表征探针准确率达91%;Moving MNIST实现多步视频预测;Two Rooms任务规划成功率97%;消融实验证明各正则化组件对防止表征坍缩至关重要。 Conclusion: JEPA范式可有效迁移至更复杂场景(视频与世界模型),EB-JEPA库为研究与教学提供了轻量、高效、可复现的自监督表征学习新基准。 Abstract: We present EB-JEPA, an open-source library for learning representations and world models using Joint-Embedding Predictive Architectures (JEPAs). JEPAs learn to predict in representation space rather than pixel space, avoiding the pitfalls of generative modeling while capturing semantically meaningful features suitable for downstream tasks. Our library provides modular, self-contained implementations that illustrate how representation learning techniques developed for image-level self-supervised learning can transfer to video, where temporal dynamics add complexity, and ultimately to action-conditioned world models, where the model must additionally learn to predict the effects of control inputs. Each example is designed for single-GPU training within a few hours, making energy-based self-supervised learning accessible for research and education. We provide ablations of JEA components on CIFAR-10. Probing these representations yields 91% accuracy, indicating that the model learns useful features. Extending to video, we include a multi-step prediction example on Moving MNIST that demonstrates how the same principles scale to temporal modeling. Finally, we show how these representations can drive action-conditioned world models, achieving a 97% planning success rate on the Two Rooms navigation task. Comprehensive ablations reveal the critical importance of each regularization component for preventing representation collapse. Code is available at https://github.com/facebookresearch/eb_jepa.[165] KTV: Keyframes and Key Tokens Selection for Efficient Training-Free Video LLMs
Baiyang Song,Jun Peng,Yuxin Zhang,Guangyao Chen,Feidiao Yang,Jianyuan Guo
Main category: cs.CV
TL;DR: 本文提出KTV框架,通过两阶段(关键帧选择+关键视觉令牌选择)减少视频理解中的冗余,提升训练-free视频理解的效率和效果。
Details
Motivation: 现有训练-free视频理解方法存在视觉冗余和计算开销大问题,且基于CLIP相似度的关键帧选择易产生偏差、遗漏重要帧。 Method: KTV为两阶段框架:第一阶段采用无问题导向的帧特征聚类进行关键帧选择;第二阶段对每帧进行关键视觉token选择,依据token重要性与冗余度剪枝。 Result: 在Multiple-Choice VideoQA任务上,KTV以仅504个视觉token(处理60分钟、10800帧视频)达到MLVU-Test 44.8%准确率,超越SOTA训练-free方法,甚至在部分基准上优于训练式方法。 Conclusion: KTV有效缓解了训练-free视频理解中的冗余与效率瓶颈,在保持高性能的同时大幅降低视觉token数量,验证了结构化压缩策略的优越性。 Abstract: Training-free video understanding leverages the strong image comprehension capabilities of pre-trained vision language models (VLMs) by treating a video as a sequence of static frames, thus obviating the need for costly video-specific training. However, this paradigm often suffers from severe visual redundancy and high computational overhead, especially when processing long videos. Crucially, existing keyframe selection strategies, especially those based on CLIP similarity, are prone to biases and may inadvertently overlook critical frames, resulting in suboptimal video comprehension. To address these significant challenges, we propose \textbf{KTV}, a novel two-stage framework for efficient and effective training-free video understanding. In the first stage, KTV performs question-agnostic keyframe selection by clustering frame-level visual features, yielding a compact, diverse, and representative subset of frames that mitigates temporal redundancy. In the second stage, KTV applies key visual token selection, pruning redundant or less informative tokens from each selected keyframe based on token importance and redundancy, which significantly reduces the number of tokens fed into the LLM. Extensive experiments on the Multiple-Choice VideoQA task demonstrate that KTV outperforms state-of-the-art training-free baselines while using significantly fewer visual tokens, \emph{e.g.}, only 504 visual tokens for a 60-min video with 10800 frames, achieving $44.8\%$ accuracy on the MLVU-Test benchmark. In particular, KTV also exceeds several training-based approaches on certain benchmarks.[166] Quasi-multimodal-based pathophysiological feature learning for retinal disease diagnosis
Lu Zhang,Huizhen Yu,Zuowei Wang,Fu Gui,Yatu Guo,Wei Zhang,Mengyu Jia
Main category: cs.CV
TL;DR: 本文提出了一种统一的多模态数据合成与融合框架,用于视网膜疾病分类与分级,通过合成FFA、MSI和显著图,并采用并行模型学习模态特异性表征,自适应校准跨模态特征,在多标签分类和糖尿病视网膜病变分级任务中性能优于现有方法。
Details
Motivation: 解决眼科实践中多模态诊断面临的数据异质性、潜在侵入性、配准复杂性等挑战。 Method: 提出统一框架,合成FFA、MSI和显著图;构建并行模型学习模态特异性表征;在模态内和模态间自适应校准特征以实现信息剪枝与灵活融合;结合图像与特征空间可视化进行可解释性分析。 Result: 在两个公开数据集上,多标签分类F1-score达0.683、AUC达0.953;糖尿病视网膜病变分级准确率达0.842、Kappa系数达0.861。 Conclusion: 该方法提升了视网膜疾病筛查的准确性与效率,并为多种医学影像模态的数据增强提供了可扩展框架。 Abstract: Retinal diseases spanning a broad spectrum can be effectively identified and diagnosed using complementary signals from multimodal data. However, multimodal diagnosis in ophthalmic practice is typically challenged in terms of data heterogeneity, potential invasiveness, registration complexity, and so on. As such, a unified framework that integrates multimodal data synthesis and fusion is proposed for retinal disease classification and grading. Specifically, the synthesized multimodal data incorporates fundus fluorescein angiography (FFA), multispectral imaging (MSI), and saliency maps that emphasize latent lesions as well as optic disc/cup regions. Parallel models are independently trained to learn modality-specific representations that capture cross-pathophysiological signatures. These features are then adaptively calibrated within and across modalities to perform information pruning and flexible integration according to downstream tasks. The proposed learning system is thoroughly interpreted through visualizations in both image and feature spaces. Extensive experiments on two public datasets demonstrated the superiority of our approach over state-of-the-art ones in the tasks of multi-label classification (F1-score: 0.683, AUC: 0.953) and diabetic retinopathy grading (Accuracy:0.842, Kappa: 0.861). This work not only enhances the accuracy and efficiency of retinal disease screening but also offers a scalable framework for data augmentation across various medical imaging modalities.[167] Multi-Objective Optimization for Synthetic-to-Real Style Transfer
Estelle Chigot,Thomas Oberlin,Manon Huguenin,Dennis Wilson
Main category: cs.CV
TL;DR: 本文提出了一种基于多目标遗传算法的风格迁移管道优化方法,用于合成到真实图像的语义分割域自适应,通过优化风格迁移操作序列,在结构一致性与风格相似性之间取得平衡,并引入样本级配对图像度量加速进化搜索。
Details
Motivation: 语义分割模型依赖大量像素级标注的真实图像数据,而合成图像虽可自动生成真值标签,却因与真实图像存在域差距导致性能下降;现有风格迁移方法缺乏对变换操作序列的系统性优化机制。 Method: 采用多目标遗传算法优化风格迁移操作序列,以结构相干性和风格相似性为双目标;引入基于单张配对图像的快速评估指标替代传统需大批量图像的分布级指标,实现高效进化搜索;最终在Pareto前沿上用分布级指标和分割性能进行验证。 Result: 在GTA5→Cityscapes/ACDC的合成到真实域自适应任务(尤其恶劣天气场景)中,该方法生成了多样化且目标导向的增强管道;验证了所提样本级指标与最终分割性能高度相关,显著提升搜索效率。 Conclusion: 将风格迁移建模为操作序列优化问题并用多目标进化算法求解是可行且有效的;样本级配对图像度量可支撑在大规模组合空间中进行实用化搜索;该框架为域自适应中的数据增强策略设计提供了新范式。 Abstract: Semantic segmentation networks require large amounts of pixel-level annotated data, which are costly to obtain for real-world images. Computer graphics engines can generate synthetic images alongside their ground-truth annotations. However, models trained on such images can perform poorly on real images due to the domain gap between real and synthetic images. Style transfer methods can reduce this difference by applying a realistic style to synthetic images. Choosing effective data transformations and their sequence is difficult due to the large combinatorial search space of style transfer operators. Using multi-objective genetic algorithms, we optimize pipelines to balance structural coherence and style similarity to target domains. We study the use of paired-image metrics on individual image samples during evolution to enable rapid pipeline evaluation, as opposed to standard distributional metrics that require the generation of many images. After optimization, we evaluate the resulting Pareto front using distributional metrics and segmentation performance. We apply this approach to standard datasets in synthetic-to-real domain adaptation: from the video game GTA5 to real image datasets Cityscapes and ACDC, focusing on adverse conditions. Results demonstrate that evolutionary algorithms can propose diverse augmentation pipelines adapted to different objectives. The contribution of this work is the formulation of style transfer as a sequencing problem suitable for evolutionary optimization and the study of efficient metrics that enable feasible search in this space. The source code is available at: https://github.com/echigot/MOOSS.[168] SPWOOD: Sparse Partial Weakly-Supervised Oriented Object Detection
Wei Zhang,Xiang Liu,Ningjing Liu,Mingxin Liu,Wei Liao,Chunyan Xu,Xue Yang
Main category: cs.CV
TL;DR: 本文提出了一种稀疏部分弱监督的有向目标检测框架(SPWOOD),仅需少量稀疏弱标注和大量无标签数据,通过SOS-Student模型、多级伪标签过滤策略和稀疏分区方法,在DOTA和DIOR数据集上显著提升性能并降低成本。
Details
Motivation: 遥感图像中目标密集、类别繁多,导致全监督标注成本极高,亟需减少对强标注的依赖。 Method: 提出SPWOOD框架,包含三个核心组件:(1) SOS-Student模型,从稀疏弱标注中学习方向与尺度信息;(2) 基于多层预测分布的多级伪标签过滤策略;(3) 保证各类别公平处理的稀疏分区方法。 Result: 在DOTA和DIOR数据集上显著优于传统有向目标检测方法,实现了高性能与低成本的兼顾。 Conclusion: SPWOOD是首个稀疏部分弱监督的有向目标检测框架,为大规模遥感图像检测提供了高效、经济的解决方案。 Abstract: A consistent trend throughout the research of oriented object detection has been the pursuit of maintaining comparable performance with fewer and weaker annotations. This is particularly crucial in the remote sensing domain, where the dense object distribution and a wide variety of categories contribute to prohibitively high costs. Based on the supervision level, existing oriented object detection algorithms can be broadly grouped into fully supervised, semi-supervised, and weakly supervised methods. Within the scope of this work, we further categorize them to include sparsely supervised and partially weakly-supervised methods. To address the challenges of large-scale labeling, we introduce the first Sparse Partial Weakly-Supervised Oriented Object Detection framework, designed to efficiently leverage only a few sparse weakly-labeled data and plenty of unlabeled data. Our framework incorporates three key innovations: (1) We design a Sparse-annotation-Orientation-and-Scale-aware Student (SOS-Student) model to separate unlabeled objects from the background in a sparsely-labeled setting, and learn orientation and scale information from orientation-agnostic or scale-agnostic weak annotations. (2) We construct a novel Multi-level Pseudo-label Filtering strategy that leverages the distribution of model predictions, which is informed by the model's multi-layer predictions. (3) We propose a unique sparse partitioning approach, ensuring equal treatment for each category. Extensive experiments on the DOTA and DIOR datasets show that our framework achieves a significant performance gain over traditional oriented object detection methods mentioned above, offering a highly cost-effective solution. Our code is publicly available at https://github.com/VisionXLab/SPWOOD.[169] MM-SCALE: Grounded Multimodal Moral Reasoning via Scalar Judgment and Listwise Alignment
Eunkyu Park,Wesley Hanwen Deng,Cheyon Jin,Matheus Kunzler Maldaner,Jordan Wheeler,Jason I. Hong,Hong Shen,Adam Perer,Ken Holstein,Motahhare Eslami,Gunhee Kim
Main category: cs.CV
TL;DR: 本文提出MM-SCALE数据集,通过5点量表评分与显式模态对齐标注,提升视觉语言模型在道德判断任务中的连续性与多元性建模能力。
Details
Motivation: 现有VLM在多模态和社会模糊情境下的道德判断能力不足,且二元或成对监督难以刻画人类道德推理的连续性与多元性。 Method: 构建大规模多模态道德量表数据集MM-SCALE,含图像-场景对、5点道德可接受性评分及模态接地推理标签,支持listwise偏好优化。 Result: 在MM-SCALE上微调的VLM在排序保真度和安全校准稳定性上均优于使用二元信号训练的模型。 Conclusion: 将离散监督转为标量监督可提供更丰富的对齐信号,显著提升VLM在多模态道德推理任务中的性能。 Abstract: Vision-Language Models (VLMs) continue to struggle to make morally salient judgments in multimodal and socially ambiguous contexts. Prior works typically rely on binary or pairwise supervision, which often fail to capture the continuous and pluralistic nature of human moral reasoning. We present MM-SCALE (Multimodal Moral Scale), a large-scale dataset for aligning VLMs with human moral preferences through 5-point scalar ratings and explicit modality grounding. Each image-scenario pair is annotated with moral acceptability scores and grounded reasoning labels by humans using an interface we tailored for data collection, enabling listwise preference optimization over ranked scenario sets. By moving from discrete to scalar supervision, our framework provides richer alignment signals and finer calibration of multimodal moral reasoning. Experiments show that VLMs fine-tuned on MM-SCALE achieve higher ranking fidelity and more stable safety calibration than those trained with binary signals.[170] Efficient Sequential Neural Network with Spatial-Temporal Attention and Linear LSTM for Robust Lane Detection Using Multi-Frame Images
Sandeep Patil,Yongqi Dong,Haneen Farah,Hans Hellendoorn
Main category: cs.CV
TL;DR: 本文提出了一种带有时空注意力机制的序列神经网络模型,用于提升复杂交通场景下的车道线检测精度、鲁棒性与实时性。
Details
Motivation: 现有车道检测方法在混合交通、严重遮挡和强光干扰等复杂场景下表现不佳,尤其视觉方法常忽略图像关键区域及其时空显著性。 Method: 设计基于编码器-解码器结构和通用骨干网络的序列神经网络,并引入时空注意力机制以聚焦车道线关键特征并建模连续帧间的时空相关性。 Result: 在三个大规模开源数据集上实验表明,该模型在多种测试场景下优于当前最优方法,且参数量和MACs更少,计算效率更高。 Conclusion: 所提时空注意力机制有效提升了车道检测的准确性、鲁棒性与实时性,兼具轻量化优势,适用于实际自动驾驶系统。 Abstract: Lane detection is a crucial perception task for all levels of automated vehicles (AVs) and Advanced Driver Assistance Systems, particularly in mixed-traffic environments where AVs must interact with human-driven vehicles (HDVs) and challenging traffic scenarios. Current methods lack versatility in delivering accurate, robust, and real-time compatible lane detection, especially vision-based methods often neglect critical regions of the image and their spatial-temporal (ST) salience, leading to poor performance in difficult circumstances such as serious occlusion and dazzle lighting. This study introduces a novel sequential neural network model with a spatial-temporal attention mechanism to focus on key features of lane lines and exploit salient ST correlations among continuous image frames. The proposed model, built on a standard encoder-decoder structure and common neural network backbones, is trained and evaluated on three large-scale open-source datasets. Extensive experiments demonstrate the strength and robustness of the proposed model, outperforming state-of-the-art methods in various testing scenarios. Furthermore, with the ST attention mechanism, the developed sequential neural network models exhibit fewer parameters and reduced Multiply-Accumulate Operations (MACs) compared to baseline sequential models, highlighting their computational efficiency. Relevant data, code, and models are released at https://doi.org/10.4121/4619cab6-ae4a-40d5-af77-582a77f3d821.[171] Referring Industrial Anomaly Segmentation
Pengfei Yue,Xiaokang Jiang,Yilin Lu,Jianghang Lin,Shengchuan Zhang,Liujuan Cao
Main category: cs.CV
TL;DR: 本文提出Referring Industrial Anomaly Segmentation (RIAS),利用文本描述引导工业异常检测与分割,摆脱手动阈值与单模型单类别限制,并构建MVTec-Ref数据集与DQFormer模型,在小异常检测和开放集能力上取得进展。
Details
Motivation: 传统工业异常检测方法存在无监督方法定位粗糙、需人工设阈值,有监督方法因数据稀缺不平衡而过拟合,且普遍受限于“一类一模型”范式。 Method: 提出RIAS范式,结合文本引导的异常分割;构建MVTec-Ref数据集(含95%小异常及多样化指代表达);设计DQFormer模型,采用双查询Token(Anomaly/Background)与语言门控多级聚合(LMA)实现高效图文融合。 Result: 在工业异常检测任务中实现了无需手动阈值的精确掩码生成,支持单模型检测多种异常类型,显著提升对小异常的定位能力,并推动IAD向开放集检测发展。 Conclusion: RIAS通过语言引导与统一架构突破了传统IAD方法的局限,为工业场景下灵活、鲁棒、可扩展的异常理解提供了新范式。 Abstract: Industrial Anomaly Detection (IAD) is vital for manufacturing, yet traditional methods face significant challenges: unsupervised approaches yield rough localizations requiring manual thresholds, while supervised methods overfit due to scarce, imbalanced data. Both suffer from the "One Anomaly Class, One Model" limitation. To address this, we propose Referring Industrial Anomaly Segmentation (RIAS), a paradigm leveraging language to guide detection. RIAS generates precise masks from text descriptions without manual thresholds and uses universal prompts to detect diverse anomalies with a single model. We introduce the MVTec-Ref dataset to support this, designed with diverse referring expressions and focusing on anomaly patterns, notably with 95% small anomalies. We also propose the Dual Query Token with Mask Group Transformer (DQFormer) benchmark, enhanced by Language-Gated Multi-Level Aggregation (LMA) to improve multi-scale segmentation. Unlike traditional methods using redundant queries, DQFormer employs only "Anomaly" and "Background" tokens for efficient visual-textual integration. Experiments demonstrate RIAS's effectiveness in advancing IAD toward open-set capabilities. Code: https://github.com/swagger-coder/RIAS-MVTec-Ref.[172] RegionReasoner: Region-Grounded Multi-Round Visual Reasoning
Wenfang Sun,Hao Chen,Yingjun Du,Yefeng Zheng,Cees G. M. Snoek
Main category: cs.CV
TL;DR: 本文提出RegionReasoner,一种基于强化学习的多轮视觉推理框架,通过引入区域引用和全局-局部一致性奖励机制,在新构建的RegionDial-Bench基准上显著提升了多轮推理准确性、空间定位精度与语义一致性。
Details
Motivation: 现有大视觉语言模型多依赖单步或纯文本推理,难以在多个视觉上下文中迭代优化理解,缺乏对多轮视觉推理能力的系统评估与建模。 Method: 构建面向检测与分割任务的多轮视觉推理基准RegionDial-Bench;提出RegionReasoner框架,采用强化学习,要求每步推理显式引用对应边界框,并设计融合接地保真度与全局-局部语义对齐的结构化奖励函数。 Result: RegionReasoner-7B在RegionDial-Bench上显著提升多轮推理准确率、空间接地精度及全局-局部一致性。 Conclusion: RegionReasoner为多轮、接地、语义一致的视觉推理提供了有效范式,RegionDial-Bench填补了该方向系统性评测的空白,共同确立了该新兴研究方向的强基线。 Abstract: Large vision-language models have achieved remarkable progress in visual reasoning, yet most existing systems rely on single-step or text-only reasoning, limiting their ability to iteratively refine understanding across multiple visual contexts. To address this limitation, we introduce a new multi-round visual reasoning benchmark with training and test sets spanning both detection and segmentation tasks, enabling systematic evaluation under iterative reasoning scenarios. We further propose RegionReasoner, a reinforcement learning framework that enforces grounded reasoning by requiring each reasoning trace to explicitly cite the corresponding reference bounding boxes, while maintaining semantic coherence via a global-local consistency reward. This reward extracts key objects and nouns from both global scene captions and region-level captions, aligning them with the reasoning trace to ensure consistency across reasoning steps. RegionReasoner is optimized with structured rewards combining grounding fidelity and global-local semantic alignment. Experiments on detection and segmentation tasks show that RegionReasoner-7B, together with our newly introduced benchmark RegionDial-Bench, considerably improves multi-round reasoning accuracy, spatial grounding precision, and global-local consistency, establishing a strong baseline for this emerging research direction.[173] Edge-Optimized Vision-Language Models for Underground Infrastructure Assessment
Johny J. Lopez,Md Meftahul Ferdaus,Mahdi Abdelguerfi
Main category: cs.CV
TL;DR: 本文提出了一种面向地下基础设施(如排水管道)自主巡检的端到端缺陷摘要生成两阶段轻量级AI pipeline,结合自研轻量分割模型RAPID-SCAN与微调后的视觉语言模型Phi-3.5,在边缘设备上实现实时、可读、领域适配的缺陷自然语言摘要。
Details
Motivation: 地下基础设施自动巡检对公共安全和城市可持续性至关重要,但如何在资源受限的边缘设备上将视觉检测结果自动转化为人类可读的摘要仍具挑战。 Method: 提出两阶段pipeline:第一阶段用轻量级RAPID-SCAN模型(0.64M参数,F1=0.834)完成缺陷分割;第二阶段用微调后的Phi-3.5 VLM将分割结果转为自然语言摘要;并构建了带人工验证描述的专用数据集,结合后训练量化与硬件优化提升边缘实时性。 Result: 在移动机器人平台上成功部署并验证该pipeline,显著降低模型体积与推理延迟,同时保持摘要质量,支持真实场景下的实时缺陷总结。 Conclusion: 该边缘可部署的集成AI系统有效弥合了自动缺陷检测与可操作运维决策之间的鸿沟,为规模化、全自主基础设施巡检提供了可行路径。 Abstract: Autonomous inspection of underground infrastructure, such as sewer and culvert systems, is critical to public safety and urban sustainability. Although robotic platforms equipped with visual sensors can efficiently detect structural deficiencies, the automated generation of human-readable summaries from these detections remains a significant challenge, especially on resource-constrained edge devices. This paper presents a novel two-stage pipeline for end-to-end summarization of underground deficiencies, combining our lightweight RAPID-SCAN segmentation model with a fine-tuned Vision-Language Model (VLM) deployed on an edge computing platform. The first stage employs RAPID-SCAN (Resource-Aware Pipeline Inspection and Defect Segmentation using Compact Adaptive Network), achieving 0.834 F1-score with only 0.64M parameters for efficient defect segmentation. The second stage utilizes a fine-tuned Phi-3.5 VLM that generates concise, domain-specific summaries in natural language from the segmentation outputs. We introduce a curated dataset of inspection images with manually verified descriptions for VLM fine-tuning and evaluation. To enable real-time performance, we employ post-training quantization with hardware-specific optimization, achieving significant reductions in model size and inference latency without compromising summarization quality. We deploy and evaluate our complete pipeline on a mobile robotic platform, demonstrating its effectiveness in real-world inspection scenarios. Our results show the potential of edge-deployable integrated AI systems to bridge the gap between automated defect detection and actionable insights for infrastructure maintenance, paving the way for more scalable and autonomous inspection solutions.[174] LIVE: Long-horizon Interactive Video World Modeling
Junchao Huang,Ziyang Ye,Xinting Hu,Tianyu He,Guiyu Zhang,Shaoshuai Shi,Jiang Bian,Li Jiang
Main category: cs.CV
TL;DR: 本文提出LIVE模型,通过循环一致性目标约束长时序视频生成中的误差累积,无需教师模型蒸馏,显著提升长时序视频生成质量与稳定性。
Details
Motivation: 自回归视频世界模型在长时序生成中因误差累积而性能下降,现有方法依赖预训练教师模型和序列级分布匹配,计算开销大且无法根治超训练步长的误差传播问题。 Method: 提出LIVE模型,引入前向 rollout 与反向重建的循环一致性目标:从前真实帧出发前向生成,再反向重建初始状态,并在重建的终端帧上施加扩散损失,以显式约束长程误差;同时设计渐进式训练课程并统一分析不同方法视角。 Result: 在长时序基准测试中达到SOTA性能,能稳定生成远超训练步长的高质量视频。 Conclusion: 循环一致性约束可有效控制长时序视频生成中的误差传播,摆脱对教师模型的依赖,为视频世界模型提供了更高效、更鲁棒的训练范式。 Abstract: Autoregressive video world models predict future visual observations conditioned on actions. While effective over short horizons, these models often struggle with long-horizon generation, as small prediction errors accumulate over time. Prior methods alleviate this by introducing pre-trained teacher models and sequence-level distribution matching, which incur additional computational cost and fail to prevent error propagation beyond the training horizon. In this work, we propose LIVE, a Long-horizon Interactive Video world modEl that enforces bounded error accumulation via a novel cycle-consistency objective, thereby eliminating the need for teacher-based distillation. Specifically, LIVE first performs a forward rollout from ground-truth frames and then applies a reverse generation process to reconstruct the initial state. The diffusion loss is subsequently computed on the reconstructed terminal state, providing an explicit constraint on long-horizon error propagation. Moreover, we provide an unified view that encompasses different approaches and introduce progressive training curriculum to stabilize training. Experiments demonstrate that LIVE achieves state-of-the-art performance on long-horizon benchmarks, generating stable, high-quality videos far beyond training rollout lengths.[175] See-through: Single-image Layer Decomposition for Anime Characters
Jian Lin,Chengze Li,Haoyun Qin,Kwun Wang Chan,Yanghua Jin,Hanyuan Liu,Stephen Chun Wang Choy,Xueting Liu
Main category: cs.CV
TL;DR: 本文提出了一种将静态动漫插画自动转换为可操控的2.5D模型的框架,通过图像分解、扩散模型与伪深度推断实现高保真层重建,无需人工分割与补全。
Details
Motivation: 现有专业流程依赖繁琐的手动分割和艺术家对遮挡区域的主观补全(“幻觉”),效率低且难以保证一致性。 Method: 提出一种新框架:1)将单张图像分解为语义明确、绘制顺序合理的全补全图层;2)构建基于商用Live2D模型的数据生成引擎以缓解标注稀缺;3)结合扩散驱动的‘身体部位一致性模块’与像素级伪深度推断机制,确保几何一致性和复杂分层(如发丝交错)的准确建模。 Result: 生成高保真、可实时操控的2.5D模型,适用于专业动画制作。 Conclusion: 该方法显著降低2.5D建模门槛,实现了从单图到动态可驱动模型的端到端自动化,兼顾语义准确性与几何合理性。 Abstract: We introduce a framework that automates the transformation of static anime illustrations into manipulatable 2.5D models. Current professional workflows require tedious manual segmentation and the artistic ``hallucination'' of occluded regions to enable motion. Our approach overcomes this by decomposing a single image into fully inpainted, semantically distinct layers with inferred drawing orders. To address the scarcity of training data, we introduce a scalable engine that bootstraps high-quality supervision from commercial Live2D models, capturing pixel-perfect semantics and hidden geometry. Our methodology couples a diffusion-based Body Part Consistency Module, which enforces global geometric coherence, with a pixel-level pseudo-depth inference mechanism. This combination resolves the intricate stratification of anime characters, e.g., interleaving hair strands, allowing for dynamic layer reconstruction. We demonstrate that our approach yields high-fidelity, manipulatable models suitable for professional, real-time animation applications.[176] Zero-shot large vision-language model prompting for automated bone identification in paleoradiology x-ray archives
Owen Dong,Lily Gao,Manish Kota,Bennett A. Landmana,Jelena Bekvalac,Gaynor Western,Katherine D. Van Schaik
Main category: cs.CV
TL;DR: 本文提出了一种基于大型视觉语言模型(LVLM)的零样本提示策略,用于自动识别古放射学图像中的主要骨骼、投照体位和左右侧,显著提升了古放射学数据集的内容导航与初步筛选效率。
Details
Motivation: 古放射学影像数据异质性强(如骨骼离散、摆位随意、缺乏左右标记等),且受年龄、骨龄、性别及设备等因素影响,导致内容检索和专家分析效率低下。 Method: 将原始DICOM文件转换为骨窗PNG图像,输入至先进LVLM,并设计精细提示词,获取结构化JSON输出,再自动整理至电子表格供验证。 Result: 在100张随机样本上,主骨识别准确率92%,投照体位80%,左右侧达100%;对模糊案例标注低/中置信度。 Conclusion: LVLM可显著加速古放射学大数据集的编码与内容导航,有望提升未来人类学研究工作流效率。 Abstract: Paleoradiology, the use of modern imaging technologies to study archaeological and anthropological remains, offers new windows on millennial scale patterns of human health. Unfortunately, the radiographs collected during field campaigns are heterogeneous: bones are disarticulated, positioning is ad hoc, and laterality markers are often absent. Additionally, factors such as age at death, age of bone, sex, and imaging equipment introduce high variability. Thus, content navigation, such as identifying a subset of images with a specific projection view, can be time consuming and difficult, making efficient triaging a bottleneck for expert analysis. We report a zero shot prompting strategy that leverages a state of the art Large Vision Language Model (LVLM) to automatically identify the main bone, projection view, and laterality in such images. Our pipeline converts raw DICOM files to bone windowed PNGs, submits them to the LVLM with a carefully engineered prompt, and receives structured JSON outputs, which are extracted and formatted onto a spreadsheet in preparation for validation. On a random sample of 100 images reviewed by an expert board certified paleoradiologist, the system achieved 92% main bone accuracy, 80% projection view accuracy, and 100% laterality accuracy, with low or medium confidence flags for ambiguous cases. These results suggest that LVLMs can substantially accelerate code word development for large paleoradiology datasets, allowing for efficient content navigation in future anthropology workflows.[177] Test-Time Conditioning with Representation-Aligned Visual Features
Nicolas Sereyjol-Garros,Ellington Kirby,Victor Letzelter,Victor Besnier,Nermin Samet
Main category: cs.CV
TL;DR: 本文提出REPA-G框架,利用自监督模型对齐的表征,在扩散模型推理阶段实现基于特征的条件控制,支持多尺度和多概念组合生成。
Details
Motivation: 现有工作主要关注表征对齐在扩散模型训练中的作用,而其在推理时条件控制方面的潜力尚未被充分探索。 Method: 提出Representation-Aligned Guidance(REPA-G)框架,通过在推理时优化一个基于预训练特征提取器提取的条件表征的相似性目标(势函数),引导去噪过程;支持从图像块到全局特征的多尺度控制及多概念组合。 Result: 在ImageNet和COCO数据集上实现了高质量、多样化的图像生成;提供了比文本提示或类别标签更灵活、精确的推理时控制。 Conclusion: REPA-G是一种纯推理时、无需微调的通用条件控制方法,理论上有依据,实证效果显著,为扩散模型的可控生成提供了新范式。 Abstract: While representation alignment with self-supervised models has been shown to improve diffusion model training, its potential for enhancing inference-time conditioning remains largely unexplored. We introduce Representation-Aligned Guidance (REPA-G), a framework that leverages these aligned representations, with rich semantic properties, to enable test-time conditioning from features in generation. By optimizing a similarity objective (the potential) at inference, we steer the denoising process toward a conditioned representation extracted from a pre-trained feature extractor. Our method provides versatile control at multiple scales, ranging from fine-grained texture matching via single patches to broad semantic guidance using global image feature tokens. We further extend this to multi-concept composition, allowing for the faithful combination of distinct concepts. REPA-G operates entirely at inference time, offering a flexible and precise alternative to often ambiguous text prompts or coarse class labels. We theoretically justify how this guidance enables sampling from the potential-induced tilted distribution. Quantitative results on ImageNet and COCO demonstrate that our approach achieves high-quality, diverse generations. Code is available at https://github.com/valeoai/REPA-G.[178] RAWDet-7: A Multi-Scenario Benchmark for Object Detection and Description on Quantized RAW Images
Mishal Fatima,Shashank Agnihotri,Kanchana Vaishnavi Gandikota,Michael Moeller,Margret Keuper
Main category: cs.CV
TL;DR: 本文介绍了RAWDet-7数据集,一个大规模RAW图像数据集,用于支持基于原始传感器数据的目标检测与描述研究,强调其在低比特量化下的性能评估能力。
Details
Motivation: 现有视觉模型多基于ISP处理后的RGB图像训练,而该处理针对人类感知优化,可能丢失对机器推理有用的传感器级信息;RAW图像保留了更丰富的场景细节,有助于提升目标检测与描述性能。 Method: 构建了RAWDet-7数据集,包含约25k训练与7.6k测试RAW图像,覆盖多种相机、光照与环境;采用MS-COCO/LVIS标准密集标注7类目标,并从对应sRGB图像生成对象级描述;支持4/6/8比特量化模拟,提供检测、描述质量与泛化性评测基准。 Result: 提供了首个面向低比特RAW图像的目标检测与描述联合评测基准,支持细粒度细节、空间关系与上下文信息建模研究。 Conclusion: RAWDet-7推动了利用原始传感器数据进行机器视觉理解的研究,为突破传统ISP限制、提升模型鲁棒性与细节感知能力提供了关键资源与评测框架。 Abstract: Most vision models are trained on RGB images processed through ISP pipelines optimized for human perception, which can discard sensor-level information useful for machine reasoning. RAW images preserve unprocessed scene data, enabling models to leverage richer cues for both object detection and object description, capturing fine-grained details, spatial relationships, and contextual information often lost in processed images. To support research in this domain, we introduce RAWDet-7, a large-scale dataset of ~25k training and 7.6k test RAW images collected across diverse cameras, lighting conditions, and environments, densely annotated for seven object categories following MS-COCO and LVIS conventions. In addition, we provide object-level descriptions derived from the corresponding high-resolution sRGB images, facilitating the study of object-level information preservation under RAW image processing and low-bit quantization. The dataset allows evaluation under simulated 4-bit, 6-bit, and 8-bit quantization, reflecting realistic sensor constraints, and provides a benchmark for studying detection performance, description quality & detail, and generalization in low-bit RAW image processing. Dataset & code upon acceptance.[179] FOVI: A biologically-inspired foveated interface for deep vision models
Nicholas M. Blauch,George A. Alvarez,Talia Konkle
Main category: cs.CV
TL;DR: 本文提出了一种受人类视网膜和初级视皮层启发的注视式视觉接口(FOVI),将变分辨率传感器阵列映射为均匀密集的V1样传感器流形,并通过k近邻卷积与新型核映射技术实现高效计算;在端到端CNN和DINOv3 ViT两种架构中验证了其在高分辨率自中心视觉任务中以显著降低计算成本实现竞争性性能。
Details
Motivation: 人类视觉具有中心高分辨率、周边低分辨率的注视特性,而主流计算机视觉系统采用均匀分辨率,难以高效处理全视野高分辨率图像;本文旨在借鉴生物视觉机制,构建更高效、可扩展的主动感知系统。 Method: 提出FOVI接口,将注视式传感器阵列映射为V1样均匀稠密流形;定义基于该流形的k近邻(kNN)感受野,并设计新型核映射实现kNN-卷积;进一步应用于端到端kNN-CNN和LoRA适配的DINOv3 ViT两种模型。 Result: 所提方法在多个视觉任务上达到与非注视式基线相当的性能,同时显著降低计算开销;代码与预训练模型已开源。 Conclusion: FOVI为高分辨率自中心视觉提供了生物启发的高效编码范式,证明了注视式结构在现代视觉模型(包括ViT)中可有效集成并提升能效比。 Abstract: Human vision is foveated, with variable resolution peaking at the center of a large field of view; this reflects an efficient trade-off for active sensing, allowing eye-movements to bring different parts of the world into focus with other parts of the world in context. In contrast, most computer vision systems encode the visual world at a uniform resolution, raising challenges for processing full-field high-resolution images efficiently. We propose a foveated vision interface (FOVI) based on the human retina and primary visual cortex, that reformats a variable-resolution retina-like sensor array into a uniformly dense, V1-like sensor manifold. Receptive fields are defined as k-nearest-neighborhoods (kNNs) on the sensor manifold, enabling kNN-convolution via a novel kernel mapping technique. We demonstrate two use cases: (1) an end-to-end kNN-convolutional architecture, and (2) a foveated adaptation of the foundational DINOv3 ViT model, leveraging low-rank adaptation (LoRA). These models provide competitive performance at a fraction of the computational cost of non-foveated baselines, opening pathways for efficient and scalable active sensing for high-resolution egocentric vision. Code and pre-trained models are available at https://github.com/nblauch/fovi and https://huggingface.co/fovi-pytorch.[180] QVLA: Not All Channels Are Equal in Vision-Language-Action Model's Quantization
Yuhao Xu,Yantai Yang,Zhenyang Fan,Yufan Liu,Yuming Li,Bing Li,Zhipeng Zhang
Main category: cs.CV
TL;DR: 本文提出了QVLA,一种专为具身控制设计的动作中心量化框架,通过通道级比特分配策略,直接测量动作空间敏感性,实现了量化与剪枝的统一优化,在保持高性能的同时大幅降低计算资源需求。
Details
Motivation: 现有VLA模型计算需求高,难以部署在资源受限的机器人平台上;而直接套用大语言模型的均匀比特量化方法忽视了动作偏差对任务失败的累积影响,缺乏针对VLA模型的系统性量化分析。 Method: 提出QVLA框架,采用细粒度的通道级比特分配策略,通过量化各通道至不同比特宽度并测量其对最终动作空间的影响,得到每通道重要性指标,并将量化与剪枝(0比特)统一到全局优化中。 Result: 在LIBERO基准上,OpenVLA-OFT经QVLA量化后仅需原模型29.2%显存,保持98.9%性能并提速1.49倍,性能比LLM衍生方法SmoothQuant提升22.6%。 Conclusion: QVLA为机器人领域VLA模型压缩建立了新的原理性基础,推动大规模模型在真实硬件上的部署。 Abstract: The advent of Vision-Language-Action (VLA) models represents a significant leap for embodied intelligence, yet their immense computational demands critically hinder deployment on resource-constrained robotic platforms. Intuitively, low-bit quantization is a prevalent and preferred technique for large-scale model compression. However, we find that a systematic analysis of VLA model's quantization is fundamentally lacking. We argue that naively applying uniform-bit quantization from Large Language Models (LLMs) to robotics is flawed, as these methods prioritize passive data fidelity while ignoring how minor action deviations compound into catastrophic task failures. To bridge this gap, we introduce QVLA, the first action-centric quantization framework specifically designed for embodied control. In a sharp departure from the rigid, uniform-bit quantization of LLM-based methods, QVLA introduces a highly granular, channel-wise bit allocation strategy. Its core mechanism is to directly measure the final action-space sensitivity when quantizing each individual channel to various bit-widths. This process yields a precise, per-channel importance metric that guides a global optimization, which elegantly unifies quantization and pruning (0-bit) into a single, cohesive framework. Extensive evaluations on different baselines demonstrate the superiority of our approach. In the LIBERO, the quantization version of OpenVLA-OFT with our method requires only 29.2% of the original model's VRAM while maintaining 98.9% of its original performance and achieving a 1.49x speedup. This translates to a 22.6% performance improvement over the LLM-derived method SmoothQuant. Our work establishes a new, principled foundation for compressing VLA models in robotics, paving the way for deploying powerful, large-scale models on real-world hardware. Code will be released.[181] From Pre- to Intra-operative MRI: Predicting Brain Shift in Temporal Lobe Resection for Epilepsy Surgery
Jingjing Peng,Giorgio Fiore,Yang Liu,Ksenia Ellum,Debayan Daspupta,Keyoumars Ashkan,Andrew McEvoy,Anna Miserocchi,Sebastien Ourselin,John Duncan,Alejandro Granados
Main category: cs.CV
TL;DR: 本文提出了一种名为NeuralShift的U-Net模型,仅利用术前MRI即可预测颞叶切除术中的脑移位,显著提升了神经导航系统的精度。
Details
Motivation: 脑移位使术前MRI在开颅后失效,需更新术中MRI并进行脑移位补偿以提高神经导航精度和手术效果。 Method: 提出基于U-Net的NeuralShift模型,仅使用术前MRI预测脑移位;采用靶标配准误差(TRE)和DICE分数评估性能。 Result: 模型实现了0.97的DICE分数(全局形变预测)和低至1.12 mm的解剖标志点TRE(局部位移预测)。 Conclusion: NeuralShift仅依赖术前MRI即可准确预测颞叶切除术中的脑移位,有望提升神经外科手术的安全性、效率与患者预后。 Abstract: Introduction: In neurosurgery, image-guided Neurosurgery Systems (IGNS) highly rely on preoperative brain magnetic resonance images (MRI) to assist surgeons in locating surgical targets and determining surgical paths. However, brain shift invalidates the preoperative MRI after dural opening. Updated intraoperative brain MRI with brain shift compensation is crucial for enhancing the precision of neuronavigation systems and ensuring the optimal outcome of surgical interventions. Methodology: We propose NeuralShift, a U-Net-based model that predicts brain shift entirely from pre-operative MRI for patients undergoing temporal lobe resection. We evaluated our results using Target Registration Errors (TREs) computed on anatomical landmarks located on the resection side and along the midline, and DICE scores comparing predicted intraoperative masks with masks derived from intraoperative MRI. Results: Our experimental results show that our model can predict the global deformation of the brain (DICE of 0.97) with accurate local displacements (achieve landmark TRE as low as 1.12 mm), compensating for large brain shifts during temporal lobe removal neurosurgery. Conclusion: Our proposed model is capable of predicting the global deformation of the brain during temporal lobe resection using only preoperative images, providing potential opportunities to the surgical team to increase safety and efficiency of neurosurgery and better outcomes to patients. Our contributions will be publicly available after acceptance in https://github.com/SurgicalDataScienceKCL/NeuralShift.[182] 3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation
Zhixue Fang,Xu He,Songlin Tang,Haoxian Zhang,Qingfeng Li,Xiaoqiang Liu,Pengfei Wan,Kun Gai
Main category: cs.CV
TL;DR: 本文提出3DiMo,一种隐式的、与视角无关的3D感知运动表示方法,通过联合训练运动编码器和预训练视频生成器,提取紧凑的运动token,并利用多视角和几何监督(仅初始化用SMPL)提升3D运动理解,在运动保真度和视觉质量上显著优于现有方法。
Details
Motivation: 现有2D姿态控制无法支持新视角合成,而显式3D模型(如SMPL)因重建误差会压制视频生成器固有的3D感知能力,因此需一种更契合生成器空间先验的隐式、视角无关的运动表示。 Method: 提出3DiMo框架:联合训练运动编码器与预训练视频生成器,将驱动帧蒸馏为视角无关的运动token,并通过交叉注意力语义注入;采用单视/多视/运动相机视频进行多视角一致性监督;引入仅用于初始化且逐步退火至零的SMPL辅助几何监督。 Result: 在运动保真度和视觉质量上显著超越现有方法,支持高保真运动复现与灵活的文本驱动相机控制。 Conclusion: 隐式、视角无关的运动表示能更好协同视频生成器的3D空间先验,避免外部3D约束带来的误差干扰,是提升可控视频生成质量的有效路径。 Abstract: Existing methods for human motion control in video generation typically rely on either 2D poses or explicit 3D parametric models (e.g., SMPL) as control signals. However, 2D poses rigidly bind motion to the driving viewpoint, precluding novel-view synthesis. Explicit 3D models, though structurally informative, suffer from inherent inaccuracies (e.g., depth ambiguity and inaccurate dynamics) which, when used as a strong constraint, override the powerful intrinsic 3D awareness of large-scale video generators. In this work, we revisit motion control from a 3D-aware perspective, advocating for an implicit, view-agnostic motion representation that naturally aligns with the generator's spatial priors rather than depending on externally reconstructed constraints. We introduce 3DiMo, which jointly trains a motion encoder with a pretrained video generator to distill driving frames into compact, view-agnostic motion tokens, injected semantically via cross-attention. To foster 3D awareness, we train with view-rich supervision (i.e., single-view, multi-view, and moving-camera videos), forcing motion consistency across diverse viewpoints. Additionally, we use auxiliary geometric supervision that leverages SMPL only for early initialization and is annealed to zero, enabling the model to transition from external 3D guidance to learning genuine 3D spatial motion understanding from the data and the generator's priors. Experiments confirm that 3DiMo faithfully reproduces driving motions with flexible, text-driven camera control, significantly surpassing existing methods in both motion fidelity and visual quality.[183] Progressive Checkerboards for Autoregressive Multiscale Image Generation
David Eigen
Main category: cs.CV
TL;DR: 本文提出了一种基于渐进式棋盘格顺序的多尺度自回归图像生成方法,通过在四叉树各层级上均衡采样,实现跨尺度和同尺度的有效条件建模,并在ImageNet上以更少采样步数达到与SOTA相当的性能。
Details
Motivation: 解决自回归图像生成中并行采样与序列依赖建模之间的矛盾,提升采样效率同时保持建模能力。 Method: 采用固定、灵活的渐进式棋盘格采样顺序,在多尺度金字塔中按四叉树结构逐层均衡采样,支持跨尺度和同尺度的条件建模。 Result: 在类条件ImageNet任务上,以更少的串行采样步数达到与当前最优自回归模型相当的性能;且在总步数固定时,多种尺度放大因子效果相近。 Conclusion: 渐进式棋盘格顺序提供了一种高效平衡的多尺度自回归建模范式,兼顾并行性与依赖建模能力。 Abstract: A key challenge in autoregressive image generation is to efficiently sample independent locations in parallel, while still modeling mutual dependencies with serial conditioning. Some recent works have addressed this by conditioning between scales in a multiscale pyramid. Others have looked at parallelizing samples in a single image using regular partitions or randomized orders. In this work we examine a flexible, fixed ordering based on progressive checkerboards for multiscale autoregressive image generation. Our ordering draws samples in parallel from evenly spaced regions at each scale, maintaining full balance in all levels of a quadtree subdivision at each step. This enables effective conditioning both between and within scales. Intriguingly, we find evidence that in our balanced setting, a wide range of scale-up factors lead to similar results, so long as the total number of serial steps is constant. On class-conditional ImageNet, our method achieves competitive performance compared to recent state-of-the-art autoregressive systems with like model capacity, using fewer sampling steps.[184] Fast-Slow Efficient Training for Multimodal Large Language Models via Visual Token Pruning
Dingkun Zhang,Shuhan Qi,Yulin Wu,Xinyu Xiao,Xuan Wang,Long Chen
Main category: cs.CV
TL;DR: 本文提出DualSpeed框架,通过快慢双模态训练策略,在减少视觉令牌数量以提升训练效率的同时,保持推理时的性能不下降。
Details
Motivation: 多模态大语言模型(MLLMs)训练效率低下,主要源于其庞大的模型规模和视觉令牌数量;现有高效训练方法多聚焦于减小模型或可训练参数,而本文探索通过视觉令牌剪枝(VTP)减少视觉输入以提升训练效率的新方向。 Method: 提出DualSpeed双速训练框架:快模式采用VTP插件减少视觉令牌,并引入模式隔离器隔离行为;慢模式在完整视觉序列上训练以保证训练-推理一致性,并利用自蒸馏从快模式学习知识。 Result: 实验表明,DualSpeed将LLaVA-1.5和LLaVA-NeXT的训练速度分别提升2.1×和4.0×,同时保持99%以上的性能。 Conclusion: DualSpeed有效解决了VTP用于训练时导致的训练-推理不匹配问题,在显著加速训练的同时避免性能退化,为MLLM高效训练提供了新范式。 Abstract: Multimodal Large Language Models (MLLMs) suffer from severe training inefficiency issue, which is associated with their massive model sizes and visual token numbers. Existing efforts in efficient training focus on reducing model sizes or trainable parameters. Inspired by the success of Visual Token Pruning (VTP) in improving inference efficiency, we are exploring another substantial research direction for efficient training by reducing visual tokens. However, applying VTP at the training stage results in a training-inference mismatch: pruning-trained models perform poorly when inferring on non-pruned full visual token sequences. To close this gap, we propose DualSpeed, a fast-slow framework for efficient training of MLLMs. The fast-mode is the primary mode, which incorporates existing VTP methods as plugins to reduce visual tokens, along with a mode isolator to isolate the model's behaviors. The slow-mode is the auxiliary mode, where the model is trained on full visual sequences to retain training-inference consistency. To boost its training, it further leverages self-distillation to learn from the sufficiently trained fast-mode. Together, DualSpeed can achieve both training efficiency and non-degraded performance. Experiments show DualSpeed accelerates the training of LLaVA-1.5 by 2.1$\times$ and LLaVA-NeXT by 4.0$\times$, retaining over 99% performance. Code: https://github.com/dingkun-zhang/DualSpeed[185] Continuous Control of Editing Models via Adaptive-Origin Guidance
Alon Wolf,Chen Katzir,Kfir Aberman,Or Patashnik
Main category: cs.CV
TL;DR: 本文提出了一种名为Adaptive-Origin Guidance (AdaOr)的新方法,用于在扩散模型中实现对文本引导图像/视频编辑强度的连续、平滑控制,通过引入与输入内容一致的身份条件预测作为自适应引导原点,替代传统无条件预测,从而避免编辑强度变化时的突变问题。
Details
Motivation: 现有扩散编辑模型缺乏对文本引导编辑强度的平滑可控机制;标准Classifier-Free Guidance(CFG)在编辑任务中无法实现从输入到编辑结果的连续过渡,因其无条件预测在低CFG尺度下主导生成且语义偏离输入。 Method: 提出Adaptive-Origin Guidance(AdaOr),用身份指令(identity instruction)引导生成身份操作预测,并将其与无条件预测按编辑强度插值,形成自适应引导原点,替代原始CFG中的固定无条件预测;该方法可直接嵌入标准训练框架,无需额外数据或编辑专用流程。 Result: 在图像和视频编辑任务上验证了AdaOr能提供比现有滑块式编辑方法更平滑、更一致的编辑强度控制效果;支持推理时细粒度调节,且不依赖特殊数据集或逐编辑定制过程。 Conclusion: AdaOr通过重构CFG的引导原点,解决了扩散编辑中编辑强度控制不连续的根本问题,为语义编辑提供了更可靠、灵活且易于部署的控制范式。 Abstract: Diffusion-based editing models have emerged as a powerful tool for semantic image and video manipulation. However, existing models lack a mechanism for smoothly controlling the intensity of text-guided edits. In standard text-conditioned generation, Classifier-Free Guidance (CFG) impacts prompt adherence, suggesting it as a potential control for edit intensity in editing models. However, we show that scaling CFG in these models does not produce a smooth transition between the input and the edited result. We attribute this behavior to the unconditional prediction, which serves as the guidance origin and dominates the generation at low guidance scales, while representing an arbitrary manipulation of the input content. To enable continuous control, we introduce Adaptive-Origin Guidance (AdaOr), a method that adjusts this standard guidance origin with an identity-conditioned adaptive origin, using an identity instruction corresponding to the identity manipulation. By interpolating this identity prediction with the standard unconditional prediction according to the edit strength, we ensure a continuous transition from the input to the edited result. We evaluate our method on image and video editing tasks, demonstrating that it provides smoother and more consistent control compared to current slider-based editing approaches. Our method incorporates an identity instruction into the standard training framework, enabling fine-grained control at inference time without per-edit procedure or reliance on specialized datasets.[186] EventNeuS: 3D Mesh Reconstruction from a Single Event Camera
Shreyas Sachan,Viktor Rudnev,Mohamed Elgharib,Christian Theobalt,Vladislav Golyanik
Main category: cs.CV
TL;DR: 本文提出了EventNeuS,一种自监督神经模型,首次将3D符号距离函数与密度场学习结合事件流监督,用于从单目彩色事件流中学习3D表示,并引入球谐编码提升视角相关效果建模,显著提升了事件相机下的密集3D网格重建精度。