Table of Contents
Deepak Gupta,Davis Bartels,Dina Demner-Fuhsman
Main category: cs.CL
TL;DR: 本文提出BioACE框架,用于自动评估生物医学领域大语言模型生成的答案及其引用文献的质量,涵盖完整性、正确性、精确率和召回率等多个维度,并通过实验验证其与人工评估的相关性。
Details
Motivation: 由于生物医学领域需要专家评估来验证生成文本与科学文献的一致性及处理复杂术语,现有LLM生成文本的评估方法面临挑战。
Method: 提出BioACE自动化评估框架,从完整性、正确性、精确率、召回率等方面评估答案质量,并结合自然语言推理(NLI)、预训练语言模型和大语言模型评估引用文献质量。
Result: 实验表明BioACE各评估模块与人工评估具有较高相关性,并确定了在生物医学答案与引用评估中最优的方法组合。
Conclusion: BioACE为生物医学问答、RAG等任务提供了可靠、可复现的自动化评估方案,开源代码已发布。
Abstract: With the increasing use of large language models (LLMs) for generating answers to biomedical questions, it is crucial to evaluate the quality of the generated answers and the references provided to support the facts in the generated answers. Evaluation of text generated by LLMs remains a challenge for question answering, retrieval-augmented generation (RAG), summarization, and many other natural language processing tasks in the biomedical domain, due to the requirements of expert assessment to verify consistency with the scientific literature and complex medical terminology. In this work, we propose BioACE, an automated framework for evaluating biomedical answers and citations against the facts stated in the answers. The proposed BioACE framework considers multiple aspects, including completeness, correctness, precision, and recall, in relation to the ground-truth nuggets for answer evaluation. We developed automated approaches to evaluate each of the aforementioned aspects and performed extensive experiments to assess and analyze their correlation with human evaluations. In addition, we considered multiple existing approaches, such as natural language inference (NLI) and pre-trained language models and LLMs, to evaluate the quality of evidence provided to support the generated answers in the form of citations into biomedical literature. With the detailed experiments and analysis, we provide the best approaches for biomedical answer and citation evaluation as a part of BioACE (https://github.com/deepaknlp/BioACE) evaluation package.
Zexin Lin,Jiachen Yu,Haoyang Zhang,Yuzhao Li,Zhonghang Li,Yujiu Yang,Junjie Wang,Xiaoqiang Ji
Main category: cs.CL
TL;DR: CoWork-X 是一种面向实时协作任务的主动协同进化框架,通过分离快速执行与慢速优化,结合 HTN 技能库与预算约束下的技能修补式优化,在保持低延迟和低 token 开销的同时实现多轮次持续性能提升。
Details
Motivation: 现有方法难以同时满足高协作性任务对亚秒级实时协调与严格在线 token 预算下多回合持续适应的双重约束:频繁的回合内推理导致延迟和抖动,而回合后非结构化文本改进又难以转化为可靠、低成本的执行。
Method: 提出 CoWork-X 框架,包含两个核心组件:1)Skill-Agent,基于分层任务网络(HTN)从结构化、可解释、可组合的技能库中检索并执行技能;2)Co-Optimizer,在回合后以‘补丁式’方式对技能库进行整合优化,并施加显式的 token 预算约束与漂移正则化,形成跨回合闭环优化。
Result: 在类 Overcooked-AI 的实时协作基准上验证表明,CoWork-X 实现了稳定且累积的性能提升,同时在线延迟与 token 使用量持续下降。
Conclusion: CoWork-X 通过快慢记忆分离与结构化技能演化机制,有效平衡了实时性、适应性与计算效率,为语言条件化智能体在资源受限协作场景中的部署提供了新范式。
Abstract: Large language models are enabling language-conditioned agents in interactive environments, but highly cooperative tasks often impose two simultaneous constraints: sub-second real-time coordination and sustained multi-episode adaptation under a strict online token budget. Existing approaches either rely on frequent in-episode reasoning that induces latency and timing jitter, or deliver post-episode improvements through unstructured text that is difficult to compile into reliable low-cost execution. We propose CoWork-X, an active co-evolution framework that casts peer collaboration as a closed-loop optimization problem across episodes, inspired by fast--slow memory separation. CoWork-X instantiates a Skill-Agent that executes via HTN (hierarchical task network)-based skill retrieval from a structured, interpretable, and compositional skill library, and a post-episode Co-Optimizer that performs patch-style skill consolidation with explicit budget constraints and drift regularization. Experiments in challenging Overcooked-AI-like realtime collaboration benchmarks demonstrate that CoWork-X achieves stable, cumulative performance gains while steadily reducing online latency and token usage.
Sean Trott,Pamela D. Rivière
Main category: cs.CL
TL;DR: 本文研究了多语言语言模型在词义消歧任务中表现不如单语模型的现象,即“多语言惩罚”,并通过控制实验量化了这一现象。研究发现,这种性能下降可归因于表征、注意力和词汇三个层面的容量限制,并且这些限制能统计上解释多语言状态带来的性能差异。
Details
Motivation: 多语言语言模型有时表现不如其单语对应模型,可能受限于模型容量,本文旨在量化并探究这种‘多语言惩罚’在词义消歧任务中的具体成因。
Method: 使用英语和西班牙语的人类相关性判断数据集进行受控实验,对比同系列单语与多语言语言模型在词义消歧上的性能;进一步分析表征各向异性、注意力机制对消歧线索的关注度以及分词复杂度三类潜在容量约束。
Result: 多语言模型在词义消歧任务中持续表现更差;三种容量约束(表征、注意力、词汇)均被观察到,且联合解释了原本归因于‘多语言性’的性能方差。
Conclusion: 多语言语言模型确受多重容量限制影响,这些限制与词义消歧性能下降显著相关,表明提升多语言能力需针对性缓解上述瓶颈。
Abstract: Multilingual language models (LMs) sometimes under-perform their monolingual counterparts, possibly due to capacity limitations. We quantify this ``multilingual penalty'' for lexical disambiguation--a task requiring precise semantic representations and contextualization mechanisms--using controlled datasets of human relatedness judgments for ambiguous words in both English and Spanish. Comparing monolingual and multilingual LMs from the same families, we find consistently reduced performance in multilingual LMs. We then explore three potential capacity constraints: representational (reduced embedding isotropy), attentional (reduced attention to disambiguating cues), and vocabulary-related (increased multi-token segmentation). Multilingual LMs show some evidence of all three limitations; moreover, these factors statistically account for the variance formerly attributed to a model's multilingual status. These findings suggest both that multilingual LMs do suffer from multiple capacity constraints, and that these constraints correlate with reduced disambiguation performance.
Sidi Lu,Zhenwen Liang,Dongyang Ma,Yan Wang,Haitao Mi,Dong Yu
Main category: cs.CL
TL;DR: 本文提出Locas,一种本地支持的参数化记忆机制,可灵活地从模型参数中卸载或合并,支持高效持续学习。Locas有两种变体:基于MLP和基于GLU-FFN结构,后者易于集成到现有大模型中。通过重用模型参数、激活或梯度进行合理初始化,显著提升收敛速度、泛化能力并缓解灾难性遗忘。实验表明Locas在极低参数开销下有效存储上下文信息,并在MMLU评测中保持原有知识能力。
Details
Motivation: 旨在弥合测试时训练与新型可灵活卸载/合并的参数化记忆之间的鸿沟,解决持续学习中的灾难性遗忘与效率问题。
Method: 提出Locas(Locally-Supported parametric memory),设计为类FFN结构,支持两种变体(MLP型与GLU-FFN型);采用基于模型参数、激活或梯度的原理性低秩侧向FFN初始化策略。
Result: 在PG-19和LoCoMo任务上验证有效性:Locas-GLU仅增0.02%参数即可高效存储长上下文并缩小窗口;MMLU评测显示其在记忆整本书后仍基本保留原有知识能力。
Conclusion: Locas是一种高效、轻量且可嵌入现有Transformer架构的参数化记忆机制,能实现持续学习与知识永久化,同时最小化灾难性遗忘。
Abstract: In this paper, we aim to bridge test-time-training with a new type of parametric memory that can be flexibly offloaded from or merged into model parameters. We present Locas, a Locally-Supported parametric memory that shares the design of FFN blocks in modern transformers, allowing it to be flexibly permanentized into the model parameters while supporting efficient continual learning. We discuss two major variants of Locas: one with a conventional two-layer MLP design that has a clearer theoretical guarantee; the other one shares the same GLU-FFN structure with SOTA LLMs, and can be easily attached to existing models for both parameter-efficient and computation-efficient continual learning. Crucially, we show that proper initialization of such low-rank sideway-FFN-style memories -- performed in a principled way by reusing model parameters, activations and/or gradients -- is essential for fast convergence, improved generalization, and catastrophic forgetting prevention. We validate the proposed memory mechanism on the PG-19 whole-book language modeling and LoCoMo long-context dialogue question answering tasks. With only 0.02\% additional parameters in the lowest case, Locas-GLU is capable of storing the information from past context while maintaining a much smaller context window. In addition, we also test the model's general capability loss after memorizing the whole book with Locas, through comparative MMLU evaluation. Results show the promising ability of Locas to permanentize past context into parametric knowledge with minimized catastrophic forgetting of the model's existing internal knowledge.
Michael Browder,Kevin Duh,J. David Harris,Vince Lyzinski,Paul McNamee,Youngser Park,Carey E. Priebe,Peter Viechnicki
Main category: cs.CL
TL;DR: 本文提出Data Kernel Perspective Space (DKPS)框架,旨在为基于大语言模型(LLM)生成的合成数据提供可证明的统计质量保证,解决当前合成数据不可预测、依赖经验调参(如temperature)的问题,并验证其在神经机器翻译和对比偏好优化(CPO)等下游任务中的性能提升。
Details
Motivation: 现有基于LLM的合成数据生成缺乏理论保障,工程师依赖经验调参,难以预测生成数据质量,制约了数据稀缺场景下模型性能的可靠提升。
Method: 提出Data Kernel Perspective Space(DKPS)数学框架,通过理论推导建立合成数据质量与下游任务性能之间的可证明关联,并将其应用于神经机器翻译和CPO训练等下游任务分析。
Result: DKPS提供了对LLM生成合成数据质量的统计保证,并能阐明其对下游任务(如NMT、CPO训练的LLM)性能的影响机制。
Conclusion: DKPS为合成数据生成提供了首个具备数学基础与性能保证的分析视角,是连接LLM生成行为与下游任务效果的重要理论桥梁,但仍存在实际部署与扩展性等局限。
Abstract: Scarcity of labeled training data remains the long pole in the tent for building performant language technology and generative AI models. Transformer models -- particularly LLMs -- are increasingly being used to mitigate the data scarcity problem via synthetic data generation. However, because the models are black boxes, the properties of the synthetic data are difficult to predict. In practice it is common for language technology engineers to 'fiddle' with the LLM temperature setting and hope that what comes out the other end improves the downstream model. Faced with this uncertainty, here we propose Data Kernel Perspective Space (DKPS) to provide the foundation for mathematical analysis yielding concrete statistical guarantees for the quality of the outputs of transformer models. We first show the mathematical derivation of DKPS and how it provides performance guarantees. Next we show how DKPS performance guarantees can elucidate performance of a downstream task, such as neural machine translation models or LLMs trained using Contrastive Preference Optimization (CPO). Limitations of the current work and future research are also discussed.
Ahmed Ruby,Christian Hardmeier,Sara Stymne
Main category: cs.CL
TL;DR: 本文提出了一种构建多语言多模态隐式篇章关系数据集的方法,并设计了融合文本与语音信息的跨语言分类模型,验证了多模态与跨语言迁移对低资源语言的有效性。
Details
Motivation: 隐式篇章关系分类依赖上下文语义推断,而单靠文本难以充分捕获跨模态、跨语言的上下文线索,尤其对低资源语言存在挑战。
Method: 构建覆盖英语、法语、西班牙语的远距离及无关语言对的多语言多模态数据集;提出基于Qwen2-Audio的文本-语音联合建模方法,实现跨语言隐式篇章关系分类。
Result: 实验表明,纯文本模型优于纯音频模型,但文本与音频模态融合可提升性能;跨语言迁移显著提升低资源语言的分类效果。
Conclusion: 多模态信息融合与跨语言迁移是提升隐式篇章关系分类性能、特别是支持低资源语言的重要途径。
Abstract: Implicit discourse relation classification is a challenging task, as it requires inferring meaning from context. While contextual cues can be distributed across modalities and vary across languages, they are not always captured by text alone. To address this, we introduce an automatic method for distantly related and unrelated language pairs to construct a multilingual and multimodal dataset for implicit discourse relations in English, French, and Spanish. For classification, we propose a multimodal approach that integrates textual and acoustic information through Qwen2-Audio, allowing joint modeling of text and audio for implicit discourse relation classification across languages. We find that while text-based models outperform audio-based models, integrating both modalities can enhance performance, and cross-lingual transfer can provide substantial improvements for low-resource languages.
Yang Zhang,Mersin Konomi,Christos Xypolopoulos,Konstantinos Divriotis,Konstantinos Skianis,Giannis Nikolentzos,Giorgos Stamou,Guokan Shang,Michalis Vazirgiannis
Main category: cs.CL
TL;DR: 本文介绍了GreekMMLU,一个原生希腊语的多任务语言理解评测基准,包含45个学科领域的21805道多项选择题,全部源自希腊本土学术、职业和政府考试,并公开发布其中16857题,保留4948题用于私有排行榜。实验评估了80多个大模型在该基准上的表现,揭示了前沿模型与开源模型、希腊适配模型与通用多语模型之间的显著性能差距,并系统分析了影响模型在希腊语上表现的关键因素。
Details
Motivation: 现有希腊语评测基准多为英译而来,无法准确反映希腊语的语言与文化特征,缺乏基于真实母语内容的可靠评测工具。
Method: 构建了一个完全原生希腊语来源的多任务语言理解评测基准GreekMMLU,涵盖45个学科、21805道题,按新定义的学科分类体系和教育难度等级(从小学到专业考试)进行组织与标注;所有题目均来自希腊本土权威考试;公开16857题,保留4948题用于私有评测;对80多个开源与闭源LLM进行系统评测,并分析模型规模、适配方式与提示策略等因素的影响。
Result: 评估发现前沿模型与开源模型、希腊语适配模型与通用多语模型之间存在显著性能差距;系统分析表明模型规模、是否针对希腊语微调及提示方法等均显著影响其在GreekMMLU上的表现。
Conclusion: GreekMMLU填补了高质量、原生希腊语评测基准的空白,为客观、抗污染地评估和提升大模型在希腊语上的能力提供了坚实基础,并为其他低资源语言构建类似基准提供了可复用的方法论。
Abstract: Large Language Models (LLMs) are commonly trained on multilingual corpora that include Greek, yet reliable evaluation benchmarks for Greek-particularly those based on authentic, native-sourced content-remain limited. Existing datasets are often machine-translated from English, failing to capture Greek linguistic and cultural characteristics. We introduce GreekMMLU, a native-sourced benchmark for massive multitask language understanding in Greek, comprising 21,805 multiple-choice questions across 45 subject areas, organized under a newly defined subject taxonomy and annotated with educational difficulty levels spanning primary to professional examinations. All questions are sourced or authored in Greek from academic, professional, and governmental exams. We publicly release 16,857 samples and reserve 4,948 samples for a private leaderboard to enable robust and contamination-resistant evaluation. Evaluations of over 80 open- and closed-source LLMs reveal substantial performance gaps between frontier and open-weight models, as well as between Greek-adapted models and general multilingual ones. Finally, we provide a systematic analysis of factors influencing performance-including model scale, adaptation, and prompting-and derive insights for improving LLM capabilities in Greek.
Ziyuan Yang,Wenxuan Ding,Shangbin Feng,Yulia Tsvetkov
Main category: cs.CL
TL;DR: 本文研究了多语言模型协作系统中恶意模型的安全风险,量化了其对系统性能的影响,并提出了通过外部监督器来缓解这些影响的策略。
Details
Motivation: 随着语言模型在多模型协作系统中的广泛应用,如何确保这些去中心化系统中各模型的安全性成为一个关键问题,特别是当某些模型被攻破或具有恶意时。
Method: 通过构建四类恶意语言模型并将其嵌入到四种主流的多模型协作系统中进行实验评估;同时提出利用外部监督器监控和屏蔽恶意模型以减轻其影响的方法。
Result: 恶意模型显著降低了多语言模型系统的性能,尤其在推理与安全领域分别平均下降7.12%和7.94%;所提出的监督策略平均恢复了初始性能的95.31%。
Conclusion: 尽管所提缓解策略效果显著,但实现对恶意模型完全免疫的多语言模型协作系统仍是开放的研究课题。
Abstract: Language models (LMs) are increasingly used in collaboration: multiple LMs trained by different parties collaborate through routing systems, multi-agent debate, model merging, and more. Critical safety risks remain in this decentralized paradigm: what if some of the models in multi-LLM systems are compromised or malicious? We first quantify the impact of malicious models by engineering four categories of malicious LMs, plug them into four types of popular model collaboration systems, and evaluate the compromised system across 10 datasets. We find that malicious models have a severe impact on the multi-LLM systems, especially for reasoning and safety domains where performance is lowered by 7.12% and 7.94% on average. We then propose mitigation strategies to alleviate the impact of malicious components, by employing external supervisors that oversee model collaboration to disable/mask them out to reduce their influence. On average, these strategies recover 95.31% of the initial performance, while making model collaboration systems fully resistant to malicious models remains an open research question.
Shangbin Feng,Kishan Panaganti,Yulia Tsvetkov,Wenhao Yu
Main category: cs.CL
TL;DR: 本文提出了一种名为'single-multi evolution loop'的模型协作蒸馏与进化方法,通过将多模型协作输出蒸馏到单个模型中,并在多轮协作-蒸馏循环中实现模型集体自进化,显著提升性能并降低成本。
Details
Motivation: 模型协作虽能结合多个语言模型的优势,但加载多个模型带来高昂开销;亟需一种既能保留协作优势、又能大幅降低推理成本的方法。
Method: 提出协作蒸馏:将多模型协作系统的输出作为监督信号训练单个模型;进一步构建单-多进化循环:多模型协作→各自蒸馏→蒸馏后模型再协作→迭代进化。
Result: 在7种协作策略、15项任务上验证:1)单模型平均提升8.0%,以单模型成本获得协作收益;2)协作系统自身平均提升14.9%;该方法优于现有进化AI方法,泛化性强,可解决原始模型难以处理的问题。
Conclusion: 单-多进化循环是一种高效、通用且可扩展的模型协同进化范式,实现了协作能力向单模型的迁移与多模型系统的持续自增强。
Abstract: Model collaboration -- systems where multiple language models (LMs) collaborate -- combines the strengths of diverse models with cost in loading multiple LMs. We improve efficiency while preserving the strengths of collaboration by distilling collaborative patterns into a single model, where the model is trained on the outputs of the model collaboration system. At inference time, only the distilled model is employed: it imitates the collaboration while only incurring the cost of a single model. Furthermore, we propose the single-multi evolution loop: multiple LMs collaborate, each distills from the collaborative outputs, and these post-distillation improved LMs collaborate again, forming a collective evolution ecosystem where models evolve and self-improve by interacting with an environment of other models. Extensive experiments with 7 collaboration strategies and 15 tasks (QA, reasoning, factuality, etc.) demonstrate that: 1) individual models improve by 8.0% on average, absorbing the strengths of collaboration while reducing the cost to a single model; 2) the collaboration also benefits from the stronger and more synergistic LMs after distillation, improving over initial systems without evolution by 14.9% on average. Analysis reveals that the single-multi evolution loop outperforms various existing evolutionary AI methods, is compatible with diverse model/collaboration/distillation settings, and helps solve problems where the initial model/system struggles to.
Hsuan-Yu Chou,Wajiha Naveed,Shuyan Zhou,Xiaowei Yang
Main category: cs.CL
TL;DR: 本文评估了七种最先进的大语言模型(包括四种专有模型和三种开源权重模型)在社交媒体有害内容检测中的表现,发现开源权重模型在敏感性和特异性上与专有模型相当,具备在消费级硬件上实现隐私保护型内容审核的潜力。
Details
Motivation: 随着互联网普及,有害内容暴露增加,需要更有效的审核手段;而开源权重大语言模型在零样本有害内容检测中的实际能力尚不明确,本文旨在填补这一研究空白。
Method: 在真实Bluesky平台帖子数据集上,对比评估四种专有和三种开源权重大语言模型的有害内容检测性能,指标包括敏感性、特异性,并结合人工标注与平台审核服务结果进行交叉验证。
Result: 开源权重LLMs在敏感性(81%–97%)和特异性(91%–100%)上与专有模型(72%–98%,93%–99%)高度重叠;不同有害类型(如粗鲁、不容忍、威胁)中敏感性与特异性呈现相反趋势;LLMs与人工审核员之间存在可观的评分一致性。
Conclusion: 开源权重大语言模型可支持隐私保护、低门槛部署的内容审核,为兼顾社区规范与个体偏好的新型审核系统设计提供了新方向。
Abstract: As internet access expands, so does exposure to harmful content, increasing the need for effective moderation. Research has demonstrated that large language models (LLMs) can be effectively utilized for social media moderation tasks, including harmful content detection. While proprietary LLMs have been shown to zero-shot outperform traditional machine learning models, the out-of-the-box capability of open-weight LLMs remains an open question.
Motivated by recent developments of reasoning LLMs, we evaluate seven state-of-the-art models: four proprietary and three open-weight. Testing with real-world posts on Bluesky, moderation decisions by Bluesky Moderation Service, and annotations by two authors, we find a considerable degree of overlap between the sensitivity (81%--97%) and specificity (91%--100%) of the open-weight LLMs and those (72%--98%, and 93%--99%) of the proprietary ones. Additionally, our analysis reveals that specificity exceeds sensitivity for rudeness detection, but the opposite holds for intolerance and threats. Lastly, we identify inter-rater agreement across human moderators and the LLMs, highlighting considerations for deploying LLMs in both platform-scale and personalized moderation contexts. These findings show open-weight LLMs can support privacy-preserving moderation on consumer-grade hardware and suggest new directions for designing moderation systems that balance community values with individual user preferences.
Kenichiro Ando,Tatsuya Harada
Main category: cs.CL
TL;DR: 本研究探讨了大语言模型(LLM)在生成文本时的引用偏好与人类偏好的一致性,构建了涵盖八类引用动机的标注数据集,发现当前模型在医学文本上与人类一致,但在需明确引用的维基类文本上过度引用(+27%),而在数值句和含人名句上显著不足(分别-22.6%、-20.1%);通过直接偏好优化(DPO)可有效校准模型行为。
Details
Motivation: 当前LLM服务普遍添加引用以增强可信度,但模型如何识别‘值得引用’的内容及其可控性尚不清晰;亟需系统刻画模型引用行为与人类 citation preferences 的对齐程度。
Method: 构建覆盖八种引用动机类型的网络文本数据集,进行全组合两两引用偏好对比标注;量化分析模型与人类在各类文本上的引用倾向差异;采用Direct Preference Optimization(DPO)微调模型以提升对齐度。
Result: 人类最常要求为医学文本提供引用,强模型也呈现类似趋势;模型对维基等显式标注‘需引用’文本过度引用(+27%),对数值句(-22.6%)和含人名句(-20.1%)则显著引用不足;DPO可有效提升模型与人类偏好的一致性。
Conclusion: 当前LLM引用行为与人类存在系统性偏差,既存在过引也存在欠引;该研究为细粒度理解与调控LLM引用行为提供了基准数据集与可行方法,推动可信AI引用机制的发展。
Abstract: Most services built on powerful large-scale language models (LLMs) add citations to their output to enhance credibility. Recent research has paid increasing attention to the question of what reference documents to link to outputs. However, how LLMs recognize cite-worthiness and how this process should be controlled remains underexplored. In this study, we focus on what kinds of content LLMs currently tend to cite and how well that behavior aligns with human preferences. We construct a dataset to characterize the relationship between human citation preferences and LLM behavior. Web-derived texts are categorized into eight citation-motivation types, and pairwise citation preferences are exhaustively evaluated across all type combinations to capture fine-grained contrasts. Our results show that humans most frequently seek citations for medical text, and stronger models display a similar tendency. We also find that current models are as much as $27\%$ more likely than humans to add citations to text that is explicitly marked as needing citations on sources such as Wikipedia, and this overemphasis reduces alignment accuracy. Conversely, models systematically underselect numeric sentences (by $-22.6\%$ relative to humans) and sentences containing personal names (by $-20.1\%$), categories for which humans typically demand citations. Furthermore, experiments with Direct Preference Optimization demonstrate that model behavior can be calibrated to better match human citation preferences. We expect this study to provide a foundation for more fine-grained investigations into LLM citation preferences.
Hongye Zhao,Yi Zhao,Chengzhi Zhang
Main category: cs.CL
TL;DR: 本文通过细粒度知识实体和语义空间量化产学研协同演化轨迹,揭示了技术变革背景下双方知识邻近性提升及学术界知识主导地位减弱的现象。
Details
Motivation: 现有研究依赖宏观指标(如合作论文数)衡量产学研知识邻近性,缺乏对文献中知识单元的细粒度分析,导致对知识邻近性的理解不足,影响协作框架与资源配置效率。
Method: 结合实体层面(预训练模型提取知识实体、余弦相似度测序重叠、复杂网络分析拓扑特征)与语义层面(无监督对比学习量化跨机构文本相似性),并利用引用分布模式分析双向知识流与相似性关联。
Result: 产学研知识邻近性整体上升,尤其在技术变革后显著增强;学术界知识主导地位在技术范式转变期间减弱;提供了协同演化的双向适应文本证据。
Conclusion: 细粒度实体与语义分析能更精准刻画产学研知识互动动态,为优化协同机制与资源分配提供新依据。
Abstract: The academia and industry are characterized by a reciprocal shaping and dynamic feedback mechanism. Despite distinct institutional logics, they have adapted closely in collaborative publishing and talent mobility, demonstrating tension between institutional divergence and intensive collaboration. Existing studies on their knowledge proximity mainly rely on macro indicators such as the number of collaborative papers or patents, lacking an analysis of knowledge units in the literature. This has led to an insufficient grasp of fine-grained knowledge proximity between industry and academia, potentially undermining collaboration frameworks and resource allocation efficiency. To remedy the limitation, this study quantifies the trajectory of academia-industry co-evolution through fine-grained entities and semantic space. In the entity measurement part, we extract fine-grained knowledge entities via pre-trained models, measure sequence overlaps using cosine similarity, and analyze topological features through complex network analysis. At the semantic level, we employ unsupervised contrastive learning to quantify convergence in semantic spaces by measuring cross-institutional textual similarities. Finally, we use citation distribution patterns to examine correlations between bidirectional knowledge flows and similarity. Analysis reveals that knowledge proximity between academia and industry rises, particularly following technological change. This provides textual evidence of bidirectional adaptation in co-evolution. Additionally, academia's knowledge dominance weakens during technological paradigm shifts. The dataset and code for this paper can be accessed at https://github.com/tinierZhao/Academic-Industrial-associations.
Jinchuan Tian,Haoran Wang,Bo-Hao Su,Chien-yu Huang,Qingzheng Wang,Jiatong Shi,William Chen,Xun Gong,Siddhant Arora,Chin-Jou Li,Masao Someki,Takashi Maekaku,Yusuke Shinohara,Jin Sakuma,Chao-Han Huck Yang,Shinji Watanabe
Main category: cs.CL
TL;DR: Bagpiper是一个8B参数的音频基础模型,通过大规模自然语言标注(600B tokens)建立音频与高阶认知概念(如转录、事件描述)之间的双向映射,采用‘先生成描述再处理’的范式实现统一的音频理解与生成。
Details
Motivation: 现有音频基础模型依赖刚性、任务特定监督,仅关注音频的孤立因素;而人类能整体性地将物理音频信号与抽象认知概念联结。本文旨在构建一个更接近人类听觉认知机制的统一音频基础模型。
Method: 提出Bagpiper模型,以丰富自然语言caption(涵盖转录、事件等认知概念)为桥梁,对原始音频进行预训练(600B token规模),建立音频↔概念空间的强双向映射;微调阶段采用‘caption-then-process’两步流程,模拟中间认知推理,无需任务特定先验。
Result: 在MMAU和AIRBench音频理解基准上超越Qwen-2.5-Omni;在生成质量上优于CosyVoice3和TangoFlux,可合成任意组合的语音、音乐与音效;首次实现通用音频的统一理解与生成。
Conclusion: Bagpiper验证了以认知驱动的caption作为统一表征可有效支撑多任务音频理解与生成,为构建类人音频基础模型提供了新范式。
Abstract: Current audio foundation models typically rely on rigid, task-specific supervision, addressing isolated factors of audio rather than the whole. In contrast, human intelligence processes audio holistically, seamlessly bridging physical signals with abstract cognitive concepts to execute complex tasks. Grounded in this philosophy, we introduce Bagpiper, an 8B audio foundation model that interprets physical audio via rich captions, i.e., comprehensive natural language descriptions that encapsulate the critical cognitive concepts inherent in the signal (e.g., transcription, audio events). By pre-training on a massive corpus of 600B tokens, the model establishes a robust bidirectional mapping between raw audio and this high-level conceptual space. During fine-tuning, Bagpiper adopts a caption-then-process workflow, simulating an intermediate cognitive reasoning step to solve diverse tasks without task-specific priors. Experimentally, Bagpiper outperforms Qwen-2.5-Omni on MMAU and AIRBench for audio understanding and surpasses CosyVoice3 and TangoFlux in generation quality, capable of synthesizing arbitrary compositions of speech, music, and sound effects. To the best of our knowledge, Bagpiper is among the first works that achieve unified understanding generation for general audio. Model, data, and code are available at Bagpiper Home Page.
Zhilin Liang,Yuxiang Wang,Zimu Zhou,Hainan Zhang,Boyi Liu,Yongxin Tong
Main category: cs.CL
TL;DR: 本文提出FedMosaic,首个基于参数化适配器的联邦检索增强生成(FedRAG)框架,通过语义聚类与选择性聚合,在保护隐私前提下显著提升准确率并大幅降低存储与通信开销。
Details
Motivation: 现有RAG依赖集中式语料库,难以满足隐私敏感场景中知识孤岛的需求,亟需支持分布式、不共享原始文档的联邦RAG方案。
Method: 采用参数化RAG范式,设计FedMosaic:1)将语义相关文档聚类为多文档适配器,并引入文档特定掩码;2)实施选择性适配器聚合,仅融合相关且无冲突的适配器。
Result: 在四类任务上平均准确率比SOTA高10.9%,存储成本降低78.8%–86.3%,通信成本降低91.4%,全程不传输原始文档。
Conclusion: FedMosaic有效解决了FedRAG中高开销与破坏性聚合两大挑战,为隐私优先的RAG部署提供了可行、高效且安全的新范式。
Abstract: Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by grounding generation in external knowledge to improve factuality and reduce hallucinations. Yet most deployments assume a centralized corpus, which is infeasible in privacy aware domains where knowledge remains siloed. This motivates federated RAG (FedRAG), where a central LLM server collaborates with distributed silos without sharing raw documents. In context RAG violates this requirement by transmitting verbatim documents, whereas parametric RAG encodes documents into lightweight adapters that merge with a frozen LLM at inference, avoiding raw-text exchange. We adopt the parametric approach but face two unique challenges induced by FedRAG: high storage and communication from per-document adapters, and destructive aggregation caused by indiscriminately merging multiple adapters. We present FedMosaic, the first federated RAG framework built on parametric adapters. FedMosaic clusters semantically related documents into multi-document adapters with document-specific masks to reduce overhead while preserving specificity, and performs selective adapter aggregation to combine only relevance-aligned, nonconflicting adapters. Experiments show that FedMosaic achieves an average 10.9% higher accuracy than state-of-the-art methods in four categories, while lowering storage costs by 78.8% to 86.3% and communication costs by 91.4%, and never sharing raw documents.
Guangwei Zhang,Jianing Zhu,Cheng Qian,Neil Gong,Rada Mihalcea,Zhaozhuo Xu,Jingrui He,Jiaqi Ma,Yun Huang,Chaowei Xiao,Bo Li,Ahmed Abbasi,Dongwon Lee,Heng Ji,Denghui Zhang
Main category: cs.CL
TL;DR: 本文提出了Copyright Detective,首个用于检测、分析和可视化大语言模型(LLM)输出中潜在版权风险的交互式法证系统。
Details
Motivation: 由于版权法的复杂性,现有方法将版权合规性视为静态分类任务,难以应对实际场景中的多样化侵权形式,因此需要一种以证据发现为导向的动态、交互式审计框架。
Method: 构建了一个统一且可扩展的交互式法证框架,整合内容召回测试、改写级相似性分析、说服性越狱探测和遗忘验证等多种检测范式,并支持通过交互式提示、响应收集与迭代工作流进行系统化审计。
Result: 实现了对LLM输出中逐字记忆与改写级信息泄露的系统化检测与可视化,即使在黑盒访问条件下也能支持负责任部署与透明评估。
Conclusion: Copyright Detective为LLM版权风险评估提供了首个端到端、可交互、可解释的法证工具,推动了AI生成内容合规性的实践落地与标准化评估。
Abstract: We present Copyright Detective, the first interactive forensic system for detecting, analyzing, and visualizing potential copyright risks in LLM outputs. The system treats copyright infringement versus compliance as an evidence discovery process rather than a static classification task due to the complex nature of copyright law. It integrates multiple detection paradigms, including content recall testing, paraphrase-level similarity analysis, persuasive jailbreak probing, and unlearning verification, within a unified and extensible framework. Through interactive prompting, response collection, and iterative workflows, our system enables systematic auditing of verbatim memorization and paraphrase-level leakage, supporting responsible deployment and transparent evaluation of LLM copyright risks even with black-box access.
Haoran Li,Sucheng Ren,Alan Yuille,Feng Wang
Main category: cs.CL
TL;DR: 本文提出CoPE方法,通过软裁剪RoPE的低频分量,统一解决OOD缓解与语义建模两大目标,在长达256k上下文长度上实现显著性能提升。
Details
Motivation: 现有RoPE适配长上下文的方法分为OOD缓解和语义建模两类,二者目标看似不同;本文旨在统一这两类指导原则。
Method: 提出CoPE(soft clipping of low-frequency components of RoPE),即对RoPE的低频成分进行软裁剪,避免硬裁剪导致的频谱泄漏。
Result: 在256k上下文长度上取得显著性能增益,验证了理论分析,并成为长度泛化新SOTA。
Conclusion: CoPE是一种简洁有效的方法,能同时缓解OOD问题、增强语义建模能力,并防止频谱泄漏,为RoPE长上下文扩展提供了新范式。
Abstract: Rotary Positional Embedding (RoPE) is a key component of context scaling in Large Language Models (LLMs). While various methods have been proposed to adapt RoPE to longer contexts, their guiding principles generally fall into two categories: (1) out-of-distribution (OOD) mitigation, which scales RoPE frequencies to accommodate unseen positions, and (2) Semantic Modeling, which posits that the attention scores computed with RoPE should always prioritize semantically similar tokens. In this work, we unify these seemingly distinct objectives through a minimalist intervention, namely CoPE: soft clipping lowfrequency components of RoPE. CoPE not only eliminates OOD outliers and refines semantic signals, but also prevents spectral leakage caused by hard clipping. Extensive experiments demonstrate that simply applying our soft clipping strategy to RoPE yields significant performance gains that scale up to 256k context length, validating our theoretical analysis and establishing CoPE as a new state-of-the-art for length generalization. Our code, data, and models are available at https://github.com/hrlics/CoPE.
Fanfan Liu,Youyang Yin,Peng Shi,Siqi Yang,Zhixiong Zeng,Haibo Qiu
Main category: cs.CL
TL;DR: 本文分析了强化学习中可验证奖励(RLVR)算法对响应长度的影响,提出了一种去长度偏差的序列策略优化算法LUSPO,解决了响应长度坍塌问题,并在数学与多模态推理任务上取得了SOTA性能。
Details
Motivation: 不同RLVR算法在训练过程中响应长度变化模式差异显著,缺乏对其根本原因的理论解释。
Method: 通过理论分析主流RLVR算法中影响响应长度的因素,并基于此提出Length-Unbiased Sequence Policy Optimization (LUSPO),修正GSPO中固有的长度偏差,使其损失函数对响应长度无偏。
Result: LUSPO在数学推理和多模态推理基准上均优于GRPO、GSPO等现有方法,有效缓解响应长度坍塌,提升推理能力。
Conclusion: 响应长度并非越长越好,关键在于消除优化过程中的长度偏差;LUSPO为RLVR提供了一种更鲁棒、更公平的序列策略优化新范式。
Abstract: Recent applications of Reinforcement Learning with Verifiable Rewards (RLVR) to Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated significant success in enhancing reasoning capabilities for complex tasks. During RLVR training, an increase in response length is often regarded as a key factor contributing to the growth of reasoning ability. However, the patterns of change in response length vary significantly across different RLVR algorithms during the training process. To provide a fundamental explanation for these variations, this paper conducts an in-depth analysis of the components of mainstream RLVR algorithms. We present a theoretical analysis of the factors influencing response length and validate our theory through extensive experimentation. Building upon these theoretical findings, we propose the Length-Unbiased Sequence Policy Optimization (LUSPO) algorithm. Specifically, we rectify the length bias inherent in Group Sequence Policy Optimization (GSPO), rendering its loss function unbiased with respect to response length and thereby resolving the issue of response length collapse. We conduct extensive experiments across mathematical reasoning benchmarks and multimodal reasoning scenarios, where LUSPO consistently achieves superior performance. Empirical results demonstrate that LUSPO represents a novel, state-of-the-art optimization strategy compared to existing methods such as GRPO and GSPO.
Jingru Fan,Dewen Liu,Yufan Dang,Huatao Li,Yuheng Wang,Wei Liu,Feiyu Duan,Xuanwen Ding,Shu Yao,Lin Wu,Ruijie Shi,Wai-Shing Leung,Yuan Cheng,Zhongyu Wei,Cheng Yang,Chen Qian,Zhiyuan Liu,Maosong Sun
Main category: cs.CL
TL;DR: 本文提出一种设计科学框架,通过引入协作增益度量Γ来区分真实协作收益与资源累积,并构建多智能体系统(MAS)因素库,以实现从经验试错到系统性科学研究的转变。
Details
Motivation: 当前多智能体系统(MAS)研究依赖经验试错,缺乏统一、可解释的科学框架;主要瓶颈在于归因模糊:一是缺乏结构化因素分类体系,二是缺少能剥离预算效应的统一评估指标。
Method: 提出协作增益度量Γ作为科学标准;构建基于控制层预设与信息层动态的MAS因素库;建立因素归因范式以系统识别驱动协作的关键因素。
Result: 建立了首个面向MAS系统性优化的科学框架,包含可量化的协作增益指标Γ和结构化因素库,支持对协作机制的可解释归因分析。
Conclusion: 该框架推动MAS研究从经验主义转向设计科学,为‘集体人工智能’奠定方法论基础。
Abstract: Recent advancements in Large Language Models (LLMs) have greatly extended the capabilities of Multi-Agent Systems (MAS), demonstrating significant effectiveness across a wide range of complex and open-ended domains. However, despite this rapid progress, the field still relies heavily on empirical trial-and-error. It lacks a unified and principled scientific framework necessary for systematic optimization and improvement. This bottleneck stems from the ambiguity of attribution: first, the absence of a structured taxonomy of factors leaves researchers restricted to unguided adjustments; second, the lack of a unified metric fails to distinguish genuine collaboration gain from mere resource accumulation. In this paper, we advocate for a transition to design science through an integrated framework. We advocate to establish the collaboration gain metric ($Γ$) as the scientific standard to isolate intrinsic gains from increased budgets. Leveraging $Γ$, we propose a factor attribution paradigm to systematically identify collaboration-driving factors. To support this, we construct a systematic MAS factor library, structuring the design space into control-level presets and information-level dynamics. Ultimately, this framework facilitates the transition from blind experimentation to rigorous science, paving the way towards a true science of Collective AI.
Haojin Wang,Yike Wang,Shangbin Feng,Hannaneh Hajishirzi,Yulia Tsvetkov
Main category: cs.CL
TL;DR: 本文提出MentorCollab方法,让大模型在推理时稀疏、选择性地指导小模型,通过轻量验证器决定是否采纳大模型的短程前瞻片段,从而在显著降低计算开销的同时提升多步推理性能。
Details
Motivation: 大型推理模型(LRMs)虽推理能力强但成本高、冗余多;小型语言模型(SLMs)高效却难以胜任多步推理;现有协同方法易导致模仿式冗长生成,缺乏有效纠错机制。
Method: MentorCollab:在随机采样的token位置探测SLM与LRM输出分歧,利用轻量验证器判断SLM是否采纳LRM提供的短程(short lookahead)生成片段,而非全程接管;仅在必要时调用大模型,实现稀疏引导。
Result: 在15组SLM-LRM组合、3个领域(数学推理、常识推理、通用知识)共多项任务上,12种设置性能提升,平均+3.0%,最高+8.0%;大模型仅生成平均18.4%的token,显著降低推理开销。
Conclusion: 稀疏、选择性的推理时大模型指导足以恢复接近大模型的推理能力,且无需显著增加推理成本,为高效推理协同提供了新范式。
Abstract: Large reasoning models (LRMs) achieve strong performance by producing long chains of thought, but their inference costs are high and often generate redundant reasoning. Small language models (SLMs) are far more efficient, yet struggle on multi-step reasoning tasks. A natural idea is to let a large model guide a small one at inference time as a mentor, yet existing collaboration methods often promote imitation, resulting in verbose reasoning without consistent error correction. We propose MentorCollab, an inference-time collaboration method in which an LRM selectively and sparsely guides an SLM, rather than taking over generation. At randomly sampled token positions, we probe for divergences between the two models and use a lightweight verifier to decide whether the SLM should follow a short lookahead segment from its mentor or continue on its own. Across 15 SLM--LRM pairs and 3 domains (math reasoning, general knowledge, and commonsense reasoning), our method improves performance in 12 settings, with average gains of 3.0% and up to 8.0%, while adopting only having 18.4% tokens generated by the expensive mentor model on average. We find that short segments and selective probing are sufficient for effective collaboration. Our results show that selective inference-time guidance restores large-model reasoning ability without substantial inference overhead.
Soma Sato,Ryohei Sasano
Main category: cs.CL
TL;DR: 本文探讨了语言模型如何隐式编码字符级信息,通过控制实验分析了分词和非分词因素对字符知识获取的影响。
Details
Motivation: 语言模型虽未显式接收字符级信息,却能隐式编码此类信息,其内在机制尚不清楚。
Method: 通过在受控设置(如指定预训练数据集或分词器)下训练语言模型,并与标准设置下的模型进行对比分析,将影响因素分为依赖和不依赖分词的两类。
Result: 发现分词相关的主因是合并规则和正字法约束;而语义关联和句法信息则是不依赖分词的关键因素。
Conclusion: 语言模型隐式学习字符级知识由分词相关与无关因素共同驱动,其中语义与句法信息起重要作用。
Abstract: Language models (LMs) have been reported to implicitly encode character-level information, despite not being explicitly provided during training. However, the mechanisms underlying this phenomenon remain largely unexplored. To reveal the mechanisms, we analyze how models acquire character-level knowledge by comparing LMs trained under controlled settings, such as specifying the pre-training dataset or tokenizer, with those trained under standard settings. We categorize the contributing factors into those independent of tokenization. Our analysis reveals that merge rules and orthographic constraints constitute primary factors arising from tokenization, whereas semantic associations of substrings and syntactic information function as key factors independent of tokenization.
Jun Rao,Zixiong Yu,Xuebo Liu,Guhan Chen,Jing Li,Jiansheng Wei,Xiaojun Meng,Min Zhang
Main category: cs.CL
TL;DR: 本文提出PACE方法,通过基于生成的纠正策略替代传统的Best-of-N采样,在数学推理任务中以更少计算量实现更优的偏好对齐效果,并避免政策崩溃。
Details
Motivation: 标准DPO-R1方法依赖大规模Best-of-N采样(如N≥8)来挖掘高质量推理轨迹,但作者发现该策略在数学推理中会导致验证器噪声放大、分布偏移,甚至策略崩溃。
Method: 提出PACE(Proximal Alignment via Corrective Exploration),采用生成式纠正机制,在低探索预算(2
### [22] [Cross-Lingual Empirical Evaluation of Large Language Models for Arabic Medical Tasks](https://arxiv.org/abs/2602.05374)
*Chaimae Abouzahir,Congbo Ma,Nizar Habash,Farah E. Shamout*
Main category: cs.CL
TL;DR: 本文研究了大型语言模型(LLMs)在阿拉伯语和英语医学问答任务中的跨语言表现,发现存在随任务复杂度加剧的语言驱动性能差距,并指出阿拉伯语医学文本的分词结构碎片化及模型置信度与答案正确性相关性低等问题,强调需采用语言感知的设计与评估策略。
Details
Motivation: 现有LLM多为英语中心化,导致其在低资源语言(如阿拉伯语)医学应用中鲁棒性和可靠性受限,而造成性能差异的根本原因尚不明确。
Method: 开展阿拉伯语与英语医学问答任务的跨语言实证分析,结合分词结构分析与模型置信度/解释可靠性评估。
Result: 发现显著且随任务复杂度加剧的语言性能差距;阿拉伯语医学文本存在分词结构碎片化问题;模型报告的置信度和解释与答案正确性相关性有限。
Conclusion: 需在医学LLM的设计与评估中引入语言感知策略,以提升多语言场景下的可靠性与公平性。
Abstract: In recent years, Large Language Models (LLMs) have become widely used in medical applications, such as clinical decision support, medical education, and medical question answering. Yet, these models are often English-centric, limiting their robustness and reliability for linguistically diverse communities. Recent work has highlighted discrepancies in performance in low-resource languages for various medical tasks, but the underlying causes remain poorly understood. In this study, we conduct a cross-lingual empirical analysis of LLM performance on Arabic and English medical question and answering. Our findings reveal a persistent language-driven performance gap that intensifies with increasing task complexity. Tokenization analysis exposes structural fragmentation in Arabic medical text, while reliability analysis suggests that model-reported confidence and explanations exhibit limited correlation with correctness. Together, these findings underscore the need for language-aware design and evaluation strategies in LLMs for medical tasks.
### [23] [IESR:Efficient MCTS-Based Modular Reasoning for Text-to-SQL with Large Language Models](https://arxiv.org/abs/2602.05385)
*Tao Liu,Jiafan Lu,Bohan Yu,Pengcheng Wu,Liu Haixin,Guoyu Xu,Li Xiangheng,Lixiao Li,Jiaming Hou,Zhao Shijun,Xinglin Lyu,Kunli Zhang,Yuxiang Jia,Hongyin Zan*
Main category: cs.CL
TL;DR: 本文提出IESR框架,利用轻量级大语言模型解决Text-to-SQL任务中复杂推理、领域知识和假设性查询等挑战,通过信息增强、多路径MCTS推理与轨迹一致性验证,在LogicCat和Archer数据集上达到SOTA性能,且无需微调。
Details
Motivation: 现有Text-to-SQL方法在复杂推理、领域知识、假设性查询方面表现不足,且企业部署成本高。
Method: 提出IESR框架:(i) 利用LLM进行关键信息理解与模式链接,并解耦数学计算与SQL生成;(ii) 基于蒙特卡洛树搜索(MCTS)的多路径推理加多数投票;(iii) 引入带判别器的轨迹一致性验证模块。
Result: 在LogicCat(24.28 EX)和Archer(37.28 EX)上达到SOTA,仅使用未微调的轻量级模型;发现当前编码模型在物理知识、数学计算和常识推理方面存在明显偏差与缺陷。
Conclusion: IESR是一种高效、轻量、无需微调的Text-to-SQL新范式,揭示了当前coder模型的关键能力短板,为后续研究提供方向。
Abstract: Text-to-SQL is a key natural language processing task that maps natural language questions to SQL queries, enabling intuitive interaction with web-based databases. Although current methods perform well on benchmarks like BIRD and Spider, they struggle with complex reasoning, domain knowledge, and hypothetical queries, and remain costly in enterprise deployment. To address these issues, we propose a framework named IESR(Information Enhanced Structured Reasoning) for lightweight large language models: (i) leverages LLMs for key information understanding and schema linking, and decoupling mathematical computation and SQL generation, (ii) integrates a multi-path reasoning mechanism based on Monte Carlo Tree Search (MCTS) with majority voting, and (iii) introduces a trajectory consistency verification module with a discriminator model to ensure accuracy and consistency. Experimental results demonstrate that IESR achieves state-of-the-art performance on the complex reasoning benchmark LogicCat (24.28 EX) and the Archer dataset (37.28 EX) using only compact lightweight models without fine-tuning. Furthermore, our analysis reveals that current coder models exhibit notable biases and deficiencies in physical knowledge, mathematical computation, and common-sense reasoning, highlighting important directions for future research. We released code at https://github.com/Ffunkytao/IESR-SLM.
### [24] [Beyond Length: Context-Aware Expansion and Independence as Developmentally Sensitive Evaluation in Child Utterances](https://arxiv.org/abs/2602.05392)
*Jiyun Chun,Eric Fosler-Lussier,Michael White,Andrew Perrault*
Main category: cs.CL
TL;DR: 本文提出了一种基于大语言模型(LLM)的评估框架,用于衡量儿童在成人-儿童对话中话语的质量,聚焦于‘扩展性’(Expansion)和‘独立性’(Independence)两个维度,超越传统以长度为主的指标(如MLU),具备发展有效性、预测力和语义敏感性。
Details
Motivation: 现有儿童话语质量评估指标(如MLU、vocd-D、Flesch-Kincaid等)过度依赖长度,忽略对话上下文,无法捕捉推理深度、话题维持和话语规划等关键方面。
Method: 构建LLM-as-a-judge框架:先分类前序成人话语类型,再沿Expansion(上下文拓展与推理深度)和Independence(推动话语的自主性)两轴评分;二者对应儿童语言发展的核心能力。
Result: 验证了指标的发展效度(呈现年龄相关模式)、预测效度(提升年龄估计准确率)、语义敏感性(识别不同话语关系差异),且与人工评分高度一致。
Conclusion: 该框架实现了从‘测长度’到‘评语境贡献’的范式转变,支持大规模、上下文感知的儿童语言发展评估。
Abstract: Evaluating the quality of children's utterances in adult-child dialogue remains challenging due to insufficient context-sensitive metrics. Common proxies such as Mean Length of Utterance (MLU), lexical diversity (vocd-D), and readability indices (Flesch-Kincaid Grade Level, Gunning Fog Index) are dominated by length and ignore conversational context, missing aspects of response quality such as reasoning depth, topic maintenance, and discourse planning. We introduce an LLM-as-a-judge framework that first classifies the Previous Adult Utterance Type and then scores the child's response along two axes: Expansion (contextual elaboration and inferential depth) and Independence (the child's contribution to advancing the discourse). These axes reflect fundamental dimensions in child language development, where Expansion captures elaboration, clause combining, and causal and contrastive connectives. Independence captures initiative, topic control, decreasing reliance on adult scaffolding through growing self-regulation, and audience design. We establish developmental validity by showing age-related patterns and demonstrate predictive value by improving age estimation over common baselines. We further confirm semantic sensitivity by detecting differences tied to discourse relations. Our metrics align with human judgments, enabling large-scale evaluation. This shifts child utterance assessment from simply measuring length to evaluating how meaningfully the child's speech contributes to and advances the conversation within its context.
### [25] [Late-to-Early Training: LET LLMs Learn Earlier, So Faster and Better](https://arxiv.org/abs/2602.05393)
*Ji Zhao,Yufei Gu,Shitong Shao,Xun Zhou,Liang Xiang,Zeke Xie*
Main category: cs.CL
TL;DR: 本文提出Late-to-Early Training(LET)范式,利用小规模预训练模型的晚期表征来指导大规模语言模型早期训练,显著加速收敛并提升性能。
Details
Motivation: 预训练大语言模型计算成本高昂,而如何复用已有小型预训练模型来加速大型模型训练这一现实问题尚未被充分探索。
Method: 提出LET范式,通过将预训练大模型晚期层的表征作为监督信号,引导目标模型在早期训练阶段和早期层中学习晚期知识,包含late-to-early-step和late-to-early-layer两种机制。
Result: 在1.4B和7B参数模型上验证有效;1.4B模型在Pile数据集上训练速度提升1.6倍,下游任务准确率提高近5%,且可用比目标模型小10倍的预训练模型实现。
Conclusion: LET是一种高效、鲁棒的训练加速方法,能兼顾训练效率与模型性能,为LLM预训练提供新范式。
Abstract: As Large Language Models (LLMs) achieve remarkable empirical success through scaling model and data size, pretraining has become increasingly critical yet computationally prohibitive, hindering rapid development. Despite the availability of numerous pretrained LLMs developed at significant computational expense, a fundamental real-world question remains underexplored: \textit{Can we leverage existing small pretrained models to accelerate the training of larger models?} In this paper, we propose a Late-to-Early Training (LET) paradigm that enables LLMs to explicitly learn later knowledge in earlier steps and earlier layers. The core idea is to guide the early layers of an LLM during early training using representations from the late layers of a pretrained (i.e. late training phase) model. We identify two key mechanisms that drive LET's effectiveness: late-to-early-step learning and late-to-early-layer learning. These mechanisms significantly accelerate training convergence while robustly enhancing both language modeling capabilities and downstream task performance, enabling faster training with superior performance. Extensive experiments on 1.4B and 7B parameter models demonstrate LET's efficiency and effectiveness. Notably, when training a 1.4B LLM on the Pile dataset, our method achieves up to 1.6$\times$ speedup with nearly 5\% improvement in downstream task accuracy compared to standard training, even when using a pretrained model with 10$\times$ fewer parameters than the target model.
### [26] [OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration](https://arxiv.org/abs/2602.05400)
*Shaobo Wang,Xuan Ouyang,Tianyi Xu,Yuzheng Hu,Jialin Liu,Guo Chen,Tianyu Zhang,Junhao Zheng,Kexin Yang,Xingzhang Ren,Dayiheng Liu,Linfeng Zhang*
Main category: cs.CL
TL;DR: 本文提出OPUS框架,通过在优化器诱导的更新空间中定义数据效用,并结合Ghost技术与Boltzmann采样,在极低额外开销下显著提升预训练数据选择效率与模型性能。
Details
Motivation: 随着高质量公开文本趋于枯竭(Data Wall),预训练需从‘更多token’转向‘更好token’;但现有方法或静态、忽略训练动态,或动态但未考虑优化器特性。
Method: 提出OPUS:在优化器诱导的更新空间中定义数据效用,将样本的有效更新投影到稳定代理方向上打分;采用Ghost+CountSketch加速计算,Boltzmann采样保障多样性。
Result: 在GPT-2 Large/XL预训练中,仅用30B tokens即超越200B tokens全量训练;在Qwen3-8B-Base续训中,0.5B tokens效果优于3B tokens全量训练;叠加静态过滤器后仍可进一步提效。
Conclusion: OPUS是一种高效、通用、可扩展的动态数据选择方法,能显著提升各类模型、数据集与优化器下的预训练数据效率。
Abstract: As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic static filters that ignore training dynamics, or use dynamic yet optimizer-agnostic criteria based on raw gradients. We propose OPUS (Optimizer-induced Projected Utility Selection), a dynamic data selection framework that defines utility in the optimizer-induced update space. OPUS scores candidates by projecting their effective updates, shaped by modern optimizers, onto a target direction derived from a stable, in-distribution proxy. To ensure scalability, we employ Ghost technique with CountSketch for computational efficiency, and Boltzmann sampling for data diversity, incurring only 4.7\% additional compute overhead. OPUS achieves remarkable results across diverse corpora, quality tiers, optimizers, and model scales. In pre-training of GPT-2 Large/XL on FineWeb and FineWeb-Edu with 30B tokens, OPUS outperforms industrial-level baselines and even full 200B-token training. Moreover, when combined with industrial-level static filters, OPUS further improves pre-training efficiency, even with lower-quality data. Furthermore, in continued pre-training of Qwen3-8B-Base on SciencePedia, OPUS achieves superior performance using only 0.5B tokens compared to full training with 3B tokens, demonstrating significant data efficiency gains in specialized domains.
### [27] [Grammatical Error Correction Evaluation by Optimally Transporting Edit Representation](https://arxiv.org/abs/2602.05419)
*Takumi Goto,Yusuke Sakai,Taro Watanabe*
Main category: cs.CL
TL;DR: 本文提出了一种基于编辑操作(而非句子整体相似性)的新型GEC自动评估指标UOT-ERRANT,利用不平衡最优传输对ERRANT提取的编辑向量进行匹配,在+Fluency领域显著提升评估性能,并具备可解释性。
Details
Motivation: 现有基于嵌入相似性的参考式评估指标(如BERTScore)在GEC中效果不佳,因源句大量词未改动,导致假阳性相似度;需更精准刻画GEC特有的编辑行为。
Method: 提出'编辑向量'表示ERRANT提取的编辑操作,并构建UOT-ERRANT指标:使用不平衡最优传输(UOT)将假设句的编辑向量运输到参考句的编辑向量,实现细粒度编辑级相似度建模。
Result: 在SEEDA元评估基准上,UOT-ERRANT显著优于现有指标,尤其在+Fluency子任务(编辑密集场景)中提升明显;其传输计划可解释为软编辑对齐,支持系统分析与排序。
Conclusion: UOT-ERRANT是一种更准确、更具可解释性的GEC评估新范式,凸显编辑级建模对语法纠错评估的重要性。
Abstract: Automatic evaluation in grammatical error correction (GEC) is crucial for selecting the best-performing systems. Currently, reference-based metrics are a popular choice, which basically measure the similarity between hypothesis and reference sentences. However, similarity measures based on embeddings, such as BERTScore, are often ineffective, since many words in the source sentences remain unchanged in both the hypothesis and the reference. This study focuses on edits specifically designed for GEC, i.e., ERRANT, and computes similarity measured over the edits from the source sentence. To this end, we propose edit vector, a representation for an edit, and introduce a new metric, UOT-ERRANT, which transports these edit vectors from hypothesis to reference using unbalanced optimal transport. Experiments with SEEDA meta-evaluation show that UOT-ERRANT improves evaluation performance, particularly in the +Fluency domain where many edits occur. Moreover, our method is highly interpretable because the transport plan can be interpreted as a soft edit alignment, making UOT-ERRANT a useful metric for both system ranking and analyzing GEC systems. Our code is available from https://github.com/gotutiyan/uot-errant.
### [28] [Once Correct, Still Wrong: Counterfactual Hallucination in Multilingual Vision-Language Models](https://arxiv.org/abs/2602.05437)
*Basel Mousi,Fahim Dalvi,Shammur Chowdhury,Firoj Alam,Nadir Durrani*
Main category: cs.CL
TL;DR: 本文提出M2CQA基准,用于评估视觉语言模型(VLMs)在多文化、多语言(尤其是阿拉伯语及其方言)背景下的反事实幻觉问题,并引入新指标CFHR来量化此类幻觉,发现现有VLM在阿拉伯语中幻觉率显著升高,且推理优先提示会加剧该问题。
Details
Motivation: 现有幻觉评测基准缺乏对文化适配性与非西方语境下视觉-语言不一致幻觉的覆盖,尤其忽视阿拉伯语及方言场景。
Method: 构建跨17个中东与北非国家图像的多语言(英/阿/方言)对比式真假陈述数据集M2CQA;提出反事实幻觉率(CFHR)指标,定义为在正确回答真实陈述前提下接受反事实陈述的比例;在多种提示策略下评测主流VLM。
Result: CFHR在阿拉伯语(尤其方言)中显著升高,而真实陈述准确率仍高;‘先推理后回答’提示加剧幻觉,‘先回答后解释’则提升鲁棒性。
Conclusion: VLMs存在文化与语言敏感的反事实幻觉问题,需新基准与提示策略来提升其跨文化可靠性;M2CQA与CFHR为后续研究提供重要工具。
Abstract: Vision-language models (VLMs) can achieve high accuracy while still accepting culturally plausible but visually incorrect interpretations. Existing hallucination benchmarks rarely test this failure mode, particularly outside Western contexts and English. We introduce M2CQA, a culturally grounded multimodal benchmark built from images spanning 17 MENA countries, paired with contrastive true and counterfactual statements in English, Arabic, and its dialects. To isolate hallucination beyond raw accuracy, we propose the CounterFactual Hallucination Rate (CFHR), which measures counterfactual acceptance conditioned on correctly answering the true statement. Evaluating state-of-the-art VLMs under multiple prompting strategies, we find that CFHR rises sharply in Arabic, especially in dialects, even when true-statement accuracy remains high. Moreover, reasoning-first prompting consistently increases counterfactual hallucination, while answering before justifying improves robustness. We will make the experimental resources and dataset publicly available for the community.
### [29] [Causal Front-Door Adjustment for Robust Jailbreak Attacks on LLMs](https://arxiv.org/abs/2602.05444)
*Yao Zhou,Zeen Song,Wenwen Qiang,Fengge Wu,Shuyi Zhou,Changwen Zheng,Hui Xiong*
Main category: cs.CL
TL;DR: 本文提出了一种基于因果推断的LLM越狱攻击方法CFA²,利用前门准则剥离安全对齐机制,实现高成功率且可解释的攻击。
Details
Motivation: 现有LLM安全对齐机制常以隐状态形式存在,掩盖模型真实能力;需从因果角度建模并解除其干扰。
Method: 将安全机制建模为未观测混杂因子,基于Pearl前门准则设计CFA²攻击框架;用稀疏自编码器(SAEs)剥离防御相关特征,并将边际化简化为确定性干预。
Result: CFA²在多个基准上达到最优攻击成功率,同时提供可解释的越狱机制分析。
Conclusion: 因果视角下的前门调整能有效解耦安全对齐与任务能力,为理解与评估LLM对齐鲁棒性提供了新范式。
Abstract: Safety alignment mechanisms in Large Language Models (LLMs) often operate as latent internal states, obscuring the model's inherent capabilities. Building on this observation, we model the safety mechanism as an unobserved confounder from a causal perspective. Then, we propose the \textbf{C}ausal \textbf{F}ront-Door \textbf{A}djustment \textbf{A}ttack ({\textbf{CFA}}$^2$) to jailbreak LLM, which is a framework that leverages Pearl's Front-Door Criterion to sever the confounding associations for robust jailbreaking. Specifically, we employ Sparse Autoencoders (SAEs) to physically strip defense-related features, isolating the core task intent. We further reduce computationally expensive marginalization to a deterministic intervention with low inference complexity. Experiments demonstrate that {CFA}$^2$ achieves state-of-the-art attack success rates while offering a mechanistic interpretation of the jailbreaking process.
### [30] [Structured Context Engineering for File-Native Agentic Systems: Evaluating Schema Accuracy, Format Effectiveness, and Multi-File Navigation at Scale](https://arxiv.org/abs/2602.05447)
*Damon McMillan*
Main category: cs.CL
TL;DR: 本文通过9649次实验,系统研究了大语言模型(LLM)代理在SQL生成任务中处理结构化数据时的上下文工程策略,涵盖11种模型、4种格式(YAML、Markdown、JSON、TOON)及规模从10到10000张表的模式。结果表明:架构选择因模型而异;格式对整体准确率无显著影响但存在模型特异性;模型能力是主导因素;文件原生代理可通过领域分区扩展至万级表;文件大小不决定运行效率。
Details
Motivation: 实践者缺乏关于如何为大语言模型代理设计有效上下文的实证指导,尤其在通过程序接口操作外部系统时。本文以SQL生成为代理程序化操作的代理任务,系统探索上下文结构对性能的影响。
Method: 开展大规模控制实验(共9649次),覆盖11个主流LLM、4种上下文格式(YAML/Markdown/JSON/TOON)、不同规模数据库模式(10–10,000表),并对比文件式检索与非文件式架构、分析格式敏感性、模型能力分层及可扩展性表现。
Result: 1)文件式上下文检索提升前沿模型准确率(+2.7%),但降低开源模型整体准确率(-7.7%);2)格式对总准确率无统计显著影响(p=0.484),但开源模型对格式敏感;3)前沿与开源模型间存在21个百分点的准确率鸿沟;4)文件原生代理借助领域分区可高效支持10,000表;5)文件大小不能预测token消耗或运行效率。
Conclusion: 上下文工程不应依赖通用最佳实践,而需依据所用模型的能力层级进行定制化设计;模型能力是性能主导因素,格式和架构效果均次之;文件原生、领域分区的架构是面向大规模结构化系统的可行路径。
Abstract: Large Language Model agents increasingly operate external systems through programmatic interfaces, yet practitioners lack empirical guidance on how to structure the context these agents consume. Using SQL generation as a proxy for programmatic agent operations, we present a systematic study of context engineering for structured data, comprising 9,649 experiments across 11 models, 4 formats (YAML, Markdown, JSON, Token-Oriented Object Notation [TOON]), and schemas ranging from 10 to 10,000 tables.
Our findings challenge common assumptions. First, architecture choice is model-dependent: file-based context retrieval improves accuracy for frontier-tier models (Claude, GPT, Gemini; +2.7%, p=0.029) but shows mixed results for open source models (aggregate -7.7%, p<0.001), with deficits varying substantially by model. Second, format does not significantly affect aggregate accuracy (chi-squared=2.45, p=0.484), though individual models, particularly open source, exhibit format-specific sensitivities. Third, model capability is the dominant factor, with a 21 percentage point accuracy gap between frontier and open source tiers that dwarfs any format or architecture effect. Fourth, file-native agents scale to 10,000 tables through domain-partitioned schemas while maintaining high navigation accuracy. Fifth, file size does not predict runtime efficiency: compact formats can consume significantly more tokens at scale due to format-unfamiliar search patterns.
These findings provide practitioners with evidence-based guidance for deploying LLM agents on structured systems, demonstrating that architectural decisions should be tailored to model capability rather than assuming universal best practices.
### [31] [Reasoning under Ambiguity: Uncertainty-Aware Multilingual Emotion Classification under Partial Supervision](https://arxiv.org/abs/2602.05471)
*Md. Mithun Hossaina,Mashary N. Alrasheedy,Nirban Bhowmick,Shamim Forhad,Md. Shakil Hossain,Sudipto Chaki,Md Shafiqul Islam*
Main category: cs.CL
TL;DR: 本文提出了一种名为'Reasoning under Ambiguity'的不确定性感知框架,用于多语言多标签情感分类,通过熵加权机制和掩码感知目标函数,有效应对标注模糊与不完全监督问题,并在多种语言数据集上取得显著性能提升。
Details
Motivation: 现有方法假设标签完全可观测且采用确定性学习目标,在情感模糊和标注不全的情况下易导致偏差和不可靠预测。
Method: 提出不确定性感知框架,包括共享多语言编码器、语言特定优化、基于熵的模糊性加权机制(避免将缺失标签视为负样本),以及带正-未标注正则化的掩码感知目标函数。
Result: 在英语、西班牙语和阿拉伯语情感分类基准上,该方法在多个指标上持续超越强基线,同时提升了训练稳定性、对标注稀疏性的鲁棒性及可解释性。
Conclusion: 显式建模标注不确定性是提升多语言多标签情感分类性能的关键,所提框架为部分监督下的情感识别提供了更可靠、鲁棒和可解释的解决方案。
Abstract: Contemporary knowledge-based systems increasingly rely on multilingual emotion identification to support intelligent decision-making, yet they face major challenges due to emotional ambiguity and incomplete supervision. Emotion recognition from text is inherently uncertain because multiple emotional states often co-occur and emotion annotations are frequently missing or heterogeneous. Most existing multi-label emotion classification methods assume fully observed labels and rely on deterministic learning objectives, which can lead to biased learning and unreliable predictions under partial supervision. This paper introduces Reasoning under Ambiguity, an uncertainty-aware framework for multilingual multi-label emotion classification that explicitly aligns learning with annotation uncertainty. The proposed approach uses a shared multilingual encoder with language-specific optimization and an entropy-based ambiguity weighting mechanism that down-weights highly ambiguous training instances rather than treating missing labels as negative evidence. A mask-aware objective with positive-unlabeled regularization is further incorporated to enable robust learning under partial supervision. Experiments on English, Spanish, and Arabic emotion classification benchmarks demonstrate consistent improvements over strong baselines across multiple evaluation metrics, along with improved training stability, robustness to annotation sparsity, and enhanced interpretability.
### [32] [LinguistAgent: A Reflective Multi-Model Platform for Automated Linguistic Annotation](https://arxiv.org/abs/2602.05493)
*Bingru Li*
Main category: cs.CL
TL;DR: 本文提出了LinguistAgent平台,通过双代理(标注员与评审员)的反思式多模型架构,提升大语言模型在人文社科复杂语义任务(如隐喻识别)中的实际标注效果,并支持多种范式对比实验及实时评估。
Details
Motivation: 数据标注是人文社科领域,尤其是复杂语义任务(如隐喻识别)中的关键瓶颈;现有大语言模型虽具理论潜力,但实际科研应用效果仍不理想。
Method: 提出LinguistAgent平台,采用反射式多模型架构与双代理(Annotator + Reviewer)工作流模拟专业同行评审;支持Prompt Engineering(零/少样本)、RAG和微调三种范式;在隐喻识别任务上进行实证,提供基于人工金标准的实时词元级评估(Precision/Recall/F1)。
Result: LinguistAgent在隐喻识别任务中展现出良好性能,支持可复现、可比较、可评估的自动化标注流程;系统开源,代码与应用已发布于GitHub。
Conclusion: LinguistAgent有效弥合了大语言模型理论能力与人文社科学者实际需求之间的鸿沟,为低资源、高专业性的语言标注任务提供了可扩展、可解释、用户友好的解决方案。
Abstract: Data annotation remains a significant bottleneck in the Humanities and Social Sciences, particularly for complex semantic tasks such as metaphor identification. While Large Language Models (LLMs) show promise, a significant gap remains between the theoretical capability of LLMs and their practical utility for researchers. This paper introduces LinguistAgent, an integrated, user-friendly platform that leverages a reflective multi-model architecture to automate linguistic annotation. The system implements a dual-agent workflow, comprising an Annotator and a Reviewer, to simulate a professional peer-review process. LinguistAgent supports comparative experiments across three paradigms: Prompt Engineering (Zero/Few-shot), Retrieval-Augmented Generation, and Fine-tuning. We demonstrate LinguistAgent's efficacy using the task of metaphor identification as an example, providing real-time token-level evaluation (Precision, Recall, and $F_1$ score) against human gold standards. The application and codes are released on https://github.com/Bingru-Li/LinguistAgent.
### [33] [Transport and Merge: Cross-Architecture Merging for Large Language Models](https://arxiv.org/abs/2602.05495)
*Chenhang Cui,Binyun Yang,Fei Shen,Yuxin Chen,Jingnan Zheng,Xiang Wang,An Zhang,Tat-Seng Chua*
Main category: cs.CL
TL;DR: 本文提出了一种基于最优传输(OT)的跨架构模型融合框架,用于将大语言模型(LLM)的知识有效迁移到异构的小型低资源模型上,仅需少量输入即可实现权重空间的直接融合,并在低资源语言和专业领域中验证了其有效性。
Details
Motivation: 大型语言模型(LLMs)虽能力强,但实际部署常依赖小型低资源模型;现有模型融合方法多要求架构兼容,难以实现从大模型到异构小模型的知识迁移,因此需要一种跨架构知识转移机制。
Method: 提出基于最优传输(OT)的跨架构合并框架,通过比对激活值建立异构模型间的跨神经元对应关系,生成传输计划并指导权重空间的直接融合。
Result: 在低资源语言和专业化领域上的大量实验表明,该方法在目标模型上实现了持续稳定的性能提升。
Conclusion: 基于最优传输的跨架构融合框架能有效实现高资源大模型向低资源小模型的知识迁移,突破了传统融合方法对模型架构一致性的限制,提升了低资源场景下的模型性能。
Abstract: Large language models (LLMs) achieve strong capabilities by scaling model capacity and training data, yet many real-world deployments rely on smaller models trained or adapted from low-resource data. This gap motivates the need for mechanisms to transfer knowledge from large, high-resource models to smaller, low-resource targets. While model merging provides an effective transfer mechanism, most existing approaches assume architecture-compatible models and therefore cannot directly transfer knowledge from large high-resource LLMs to heterogeneous low-resource targets. In this work, we propose a cross-architecture merging framework based on optimal transport (OT) that aligns activations to infer cross-neuron correspondences between heterogeneous models. The resulting transport plans are then used to guide direct weight-space fusion, enabling effective high-resource to low-resource transfer using only a small set of inputs. Extensive experiments across low-resource languages and specialized domains demonstrate consistent improvements over target models.
### [34] [A Human-in-the-Loop, LLM-Centered Architecture for Knowledge-Graph Question Answering](https://arxiv.org/abs/2602.05512)
*Larissa Pusch,Alexandre Courtiol,Tim Conrad*
Main category: cs.CL
TL;DR: 本文提出了一种交互式框架,利用大语言模型(LLMs)生成并解释Cypher图查询,用户通过自然语言迭代优化查询,从而提升知识图谱(KG)的可访问性、准确性与可解释性。
Details
Motivation: 现有LLMs在知识密集型任务中存在幻觉、信息过时和可解释性差等问题;文本检索增强生成(RAG)难以支持多跳推理,而知识图谱虽具精确性和可解释性,但需掌握图查询语言,门槛高。
Method: 设计交互式框架:LLM生成Cypher查询并用自然语言解释,用户以自然语言反馈修正;在合成电影KG上构建90查询基准测试,评估查询解释质量与错误检测能力,并在Hyena KG和MaRDI KG上开展真实场景实验。
Result: 该框架显著提升了KG查询的易用性与准确性,在合成KG基准中验证了不同LLM在查询解释与故障识别上的表现差异,并在两个真实KG上验证了实用性。
Conclusion: 结合LLM的自然语言能力与KG的语义严谨性,通过人机协同交互可有效弥合可用性与可靠性之间的鸿沟,为知识密集型AI系统提供新范式。
Abstract: Large Language Models (LLMs) excel at language understanding but remain limited in knowledge-intensive domains due to hallucinations, outdated information, and limited explainability. Text-based retrieval-augmented generation (RAG) helps ground model outputs in external sources but struggles with multi-hop reasoning. Knowledge Graphs (KGs), in contrast, support precise, explainable querying, yet require a knowledge of query languages. This work introduces an interactive framework in which LLMs generate and explain Cypher graph queries and users iteratively refine them through natural language. Applied to real-world KGs, the framework improves accessibility to complex datasets while preserving factual accuracy and semantic rigor and provides insight into how model performance varies across domains. Our core quantitative evaluation is a 90-query benchmark on a synthetic movie KG that measures query explanation quality and fault detection across multiple LLMs, complemented by two smaller real-life query-generation experiments on a Hyena KG and the MaRDI (Mathematical Research Data Initiative) KG.
### [35] [Multi-Task GRPO: Reliable LLM Reasoning Across Tasks](https://arxiv.org/abs/2602.05547)
*Shyam Sundhar Ramesh,Xiaotong Ji,Matthieu Zimmer,Sangwoong Yoon,Zhiyong Wang,Haitham Bou Ammar,Aurelien Lucchi,Ilija Bogunovic*
Main category: cs.CL
TL;DR: 本文提出多任务GRPO(MT-GRPO)算法,通过动态调整任务权重优化最差任务性能,并引入比率保持采样器以确保梯度反映权重,显著提升多任务下最差任务准确率与训练效率。
Details
Motivation: 现有GRPO在多任务场景中易导致任务间性能不平衡,且不同任务的零优势提示比例差异大,干扰优化信号。
Method: 提出MT-GRPO:(i) 动态任务加权机制以显式优化最差任务性能;(ii) 比率保持采样器使策略梯度与任务权重一致。
Result: 在3任务和9任务设置中,MT-GRPO在最差任务准确率上分别比标准GRPO和DAPO提升16–28%和6%绝对值,并以50%更少训练步数达到50%最差任务准确率,平均准确率保持竞争力。
Conclusion: MT-GRPO有效缓解多任务RL微调中的性能不平衡问题,在保障整体性能的同时显著提升最差任务表现与训练效率,更适合真实场景部署。
Abstract: RL-based post-training with GRPO is widely used to improve large language models on individual reasoning tasks. However, real-world deployment requires reliable performance across diverse tasks. A straightforward multi-task adaptation of GRPO often leads to imbalanced outcomes, with some tasks dominating optimization while others stagnate. Moreover, tasks can vary widely in how frequently prompts yield zero advantages (and thus zero gradients), which further distorts their effective contribution to the optimization signal. To address these issues, we propose a novel Multi-Task GRPO (MT-GRPO) algorithm that (i) dynamically adapts task weights to explicitly optimize worst-task performance and promote balanced progress across tasks, and (ii) introduces a ratio-preserving sampler to ensure task-wise policy gradients reflect the adapted weights. Experiments on both 3-task and 9-task settings show that MT-GRPO consistently outperforms baselines in worst-task accuracy. In particular, MT-GRPO achieves 16-28% and 6% absolute improvement on worst-task performance over standard GRPO and DAPO, respectively, while maintaining competitive average accuracy. Moreover, MT-GRPO requires 50% fewer training steps to reach 50% worst-task accuracy in the 3-task setting, demonstrating substantially improved efficiency in achieving reliable performance across tasks.
### [36] [CASTLE: A Comprehensive Benchmark for Evaluating Student-Tailored Personalized Safety in Large Language Models](https://arxiv.org/abs/2602.05633)
*Rui Jia,Ruiyi Lan,Fengrui Liu,Zhongxiang Dai,Bo Jiang,Jing Shao,Jingyuan Chen,Guandong Xu,Fei Wu,Min Zhang*
Main category: cs.CL
TL;DR: 本文提出学生定制化个性化安全概念,并构建了CASTLE基准,涵盖15种教育安全风险和14种学生属性,包含92908个双语场景,设计了三种评估指标:风险敏感性、情感共情和学生对齐度。实验表明现有SOTA LLM在个性化安全方面存在显著不足。
Details
Motivation: 现有大语言模型在个性化教育中存在“一刀切”响应问题,忽视学生认知与心理差异,且传统安全评估指标(如事实准确性、偏见、毒性)无法反映同一响应对不同学生属性可能造成的差异化危害。
Method: 基于教育理论提出“学生定制化个性化安全”概念,构建CASTLE基准(含15类教育安全风险、14类学生属性、92908个双语场景),并设计三个新评估指标:Risk Sensitivity、Emotional Empathy和Student Alignment。
Result: 在18个SOTA大语言模型上的实验显示,所有模型在CASTLE上的平均安全评分均低于2.3/5,表明其在个性化安全保障能力上存在严重缺陷。
Conclusion: 当前大语言模型在面向教育场景的个性化安全方面远未达标,亟需从学生异质性出发构建更细粒度的安全评估与优化框架。
Abstract: Large language models (LLMs) have advanced the development of personalized learning in education. However, their inherent generation mechanisms often produce homogeneous responses to identical prompts. This one-size-fits-all mechanism overlooks the substantial heterogeneity in students cognitive and psychological, thereby posing potential safety risks to vulnerable groups. Existing safety evaluations primarily rely on context-independent metrics such as factual accuracy, bias, or toxicity, which fail to capture the divergent harms that the same response might cause across different student attributes. To address this gap, we propose the concept of Student-Tailored Personalized Safety and construct CASTLE based on educational theories. This benchmark covers 15 educational safety risks and 14 student attributes, comprising 92,908 bilingual scenarios. We further design three evaluation metrics: Risk Sensitivity, measuring the model ability to detect risks; Emotional Empathy, evaluating the model capacity to recognize student states; and Student Alignment, assessing the match between model responses and student attributes. Experiments on 18 SOTA LLMs demonstrate that CASTLE poses a significant challenge: all models scored below an average safety rating of 2.3 out of 5, indicating substantial deficiencies in personalized safety assurance.
### [37] [Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew](https://arxiv.org/abs/2602.05648)
*Giuseppe Samo,Paola Merlo*
Main category: cs.CL
TL;DR: 本文研究了Transformer模型在土耳其语和现代希伯来语中表示复杂动词变位范式的能力,重点分析了不同分词策略对模型表现的影响。结果表明,土耳其语因形态标记透明,各类模型均表现良好;而希伯来语因非连缀构词特性,仅形态感知的单语模型效果优异。
Details
Motivation: 探究不同分词策略如何影响Transformer模型对具有复杂形态(尤其是非连缀构词)的语言(如土耳其语和希伯来语)动词范式的建模能力。
Method: 采用Blackbird Language Matrices任务,在真实语料和合成数据上评估单语与多语Transformer模型,对比原子级、子词级和字符级等不同tokenization策略的效果。
Result: 土耳其语中各类模型均表现良好;希伯来语中仅使用形态感知分词的单语模型表现优异,而字符级多语模型失败;所有模型在更合成的数据集上性能提升。
Conclusion: tokenization策略对形态复杂语言的建模至关重要,尤其对非连缀构词语言,需匹配语言特性的分词方式(如形态感知分割),通用分词(如字符级)可能失效。
Abstract: We investigate how transformer models represent complex verb paradigms in Turkish and Modern Hebrew, concentrating on how tokenization strategies shape this ability. Using the Blackbird Language Matrices task on natural data, we show that for Turkish -- with its transparent morphological markers -- both monolingual and multilingual models succeed, either when tokenization is atomic or when it breaks words into small subword units. For Hebrew, instead, monolingual and multilingual models diverge. A multilingual model using character-level tokenization fails to capture the language non-concatenative morphology, but a monolingual model with morpheme-aware segmentation performs well. Performance improves on more synthetic datasets, in all models.
### [38] [MedErrBench: A Fine-Grained Multilingual Benchmark for Medical Error Detection and Correction with Clinical Expert Annotations](https://arxiv.org/abs/2602.05692)
*Congbo Ma,Yichun Zhang,Yousef Al-Jazzazi,Ahamed Foisal,Laasya Sharma,Yousra Sadqi,Khaled Saleh,Jihad Mallat,Farah E. Shamout*
Main category: cs.CL
TL;DR: 本文提出了首个用于临床文本错误检测、定位与纠正的多语言基准MedErrBench,覆盖英语、阿拉伯语和中文,由临床专家构建并验证;实验揭示了现有大语言模型在非英语临床场景中的显著性能差距,强调需发展临床可信、语言适配的医疗AI系统。
Details
Motivation: 现有或生成的临床文本存在错误可能引发严重不良后果(如误诊、错误治疗),而当前缺乏覆盖多语言、多临床场景的专用评估基准。
Method: 构建MedErrBench:基于扩展的10类临床错误类型学,收集真实临床案例,由经验丰富的临床医生进行多语言(英/阿/中)标注与审核;设计错误检测、定位、纠正三任务评估协议,并评测多种通用、语言特化及医学领域大模型。
Result: 各模型在非英语(尤其阿拉伯语和中文)任务上表现明显落后于英语;所有模型在错误定位和纠正任务上均弱于检测任务;凸显临床知识与语言能力协同建模的必要性。
Conclusion: MedErrBench填补了多语言临床NLP评估基准的空白,其公开发布将推动更安全、公平的全球医疗AI发展。
Abstract: Inaccuracies in existing or generated clinical text may lead to serious adverse consequences, especially if it is a misdiagnosis or incorrect treatment suggestion. With Large Language Models (LLMs) increasingly being used across diverse healthcare applications, comprehensive evaluation through dedicated benchmarks is crucial. However, such datasets remain scarce, especially across diverse languages and contexts. In this paper, we introduce MedErrBench, the first multilingual benchmark for error detection, localization, and correction, developed under the guidance of experienced clinicians. Based on an expanded taxonomy of ten common error types, MedErrBench covers English, Arabic and Chinese, with natural clinical cases annotated and reviewed by domain experts. We assessed the performance of a range of general-purpose, language-specific, and medical-domain language models across all three tasks. Our results reveal notable performance gaps, particularly in non-English settings, highlighting the need for clinically grounded, language-aware systems. By making MedErrBench and our evaluation protocols publicly-available, we aim to advance multilingual clinical NLP to promote safer and more equitable AI-based healthcare globally. The dataset is available in the supplementary material. An anonymized version of the dataset is available at: https://github.com/congboma/MedErrBench.
### [39] [Consensus-Aligned Neuron Efficient Fine-Tuning Large Language Models for Multi-Domain Machine Translation](https://arxiv.org/abs/2602.05694)
*Shuting Jiang,Ran Song,Yuxin Huang,Yan Xiang,Yantuan Xian,Shengxiang Gao,Zhengtao Yu*
Main category: cs.CL
TL;DR: 本文提出了一种神经元高效微调框架,通过最大化神经元行为与领域特征之间的互信息来识别并更新共识对齐的神经元,从而提升多领域机器翻译性能。
Details
Motivation: 尽管大语言模型在机器翻译中表现出色,但在多领域场景下仍面临领域适应困难、领域偏移、参数干扰和泛化能力有限等问题。
Method: 提出一种神经元高效微调框架,基于互信息选择共识对齐神经元,并以此指导大语言模型的微调过程。
Result: 在三个大语言模型和十个德英/中英翻译领域上的实验表明,该方法在已见和未见领域上均显著优于主流参数高效微调基线,达到当前最优性能。
Conclusion: 共识对齐神经元的选择与微调可有效缓解参数干扰与领域过拟合,提升多领域机器翻译的泛化性与鲁棒性。
Abstract: Multi-domain machine translation (MDMT) aims to build a unified model capable of translating content across diverse domains. Despite the impressive machine translation capabilities demonstrated by large language models (LLMs), domain adaptation still remains a challenge for LLMs. Existing MDMT methods such as in-context learning and parameter-efficient fine-tuning often suffer from domain shift, parameter interference and limited generalization. In this work, we propose a neuron-efficient fine-tuning framework for MDMT that identifies and updates consensus-aligned neurons within LLMs. These neurons are selected by maximizing the mutual information between neuron behavior and domain features, enabling LLMs to capture both generalizable translation patterns and domain-specific nuances. Our method then fine-tunes LLMs guided by these neurons, effectively mitigating parameter interference and domain-specific overfitting. Comprehensive experiments on three LLMs across ten German-English and Chinese-English translation domains evidence that our method consistently outperforms strong PEFT baselines on both seen and unseen domains, achieving state-of-the-art performance.
### [40] [OmniMoE: An Efficient MoE by Orchestrating Atomic Experts at Scale](https://arxiv.org/abs/2602.05711)
*Jingze Shi,Zhangyang Peng,Yizhang Zhu,Yifan Wu,Guang Liu,Yuyu Luo*
Main category: cs.CL
TL;DR: OmniMoE提出向极致细粒度专家发展,通过原子级专家、笛卡尔积路由器和专家中心调度,在保持高准确率的同时大幅降低推理延迟。
Details
Motivation: 现有MoE架构在专家专业化粒度与硬件执行效率之间存在固有折衷,需突破该限制以提升参数效率与执行性能。
Method: 提出系统-算法协同设计的OmniMoE框架:引入向量级Atomic Experts;设计笛卡尔积路由器将路由复杂度从O(N)降至O(√N);采用Expert-Centric Scheduling将稀疏查表转为密集矩阵运算。
Result: 在7个基准测试中,OmniMoE(1.7B激活参数)零样本准确率达50.9%,显著优于DeepSeekMoE和PEER;推理延迟从73ms降至6.7ms(加速10.9倍)。
Conclusion: 极致细粒度MoE可通过系统-算法协同设计实现高精度与高效率的统一,为未来大模型稀疏化提供新范式。
Abstract: Mixture-of-Experts (MoE) architectures are evolving towards finer granularity to improve parameter efficiency. However, existing MoE designs face an inherent trade-off between the granularity of expert specialization and hardware execution efficiency. We propose OmniMoE, a system-algorithm co-designed framework that pushes expert granularity to its logical extreme. OmniMoE introduces vector-level Atomic Experts, enabling scalable routing and execution within a single MoE layer, while retaining a shared dense MLP branch for general-purpose processing. Although this atomic design maximizes capacity, it poses severe challenges for routing complexity and memory access. To address these, OmniMoE adopts a system-algorithm co-design: (i) a Cartesian Product Router that decomposes the massive index space to reduce routing complexity from O(N) to O(sqrt(N)); and (ii) Expert-Centric Scheduling that inverts the execution order to turn scattered, memory-bound lookups into efficient dense matrix operations. Validated on seven benchmarks, OmniMoE (with 1.7B active parameters) achieves 50.9% zero-shot accuracy across seven benchmarks, outperforming coarse-grained (e.g., DeepSeekMoE) and fine-grained (e.g., PEER) baselines. Crucially, OmniMoE reduces inference latency from 73ms to 6.7ms (a 10.9-fold speedup) compared to PEER, demonstrating that massive-scale fine-grained MoE can be fast and accurate. Our code is open-sourced at https://github.com/flash-algo/omni-moe.
### [41] [CompactRAG: Reducing LLM Calls and Token Overhead in Multi-Hop Question Answering](https://arxiv.org/abs/2602.05728)
*Hao Yang,Zhiyu Yang,Xupeng Zhang,Wei Wei,Yunjie Zhang,Lin Yang*
Main category: cs.CL
TL;DR: CompactRAG是一种解耦离线重构与在线推理的高效多跳检索增强生成框架,通过构建原子化QA知识库和仅两次LLM调用实现高准确率与低token消耗。
Details
Motivation: 现有多跳RAG系统因反复调用LLM、高token消耗及跨跳实体锚定不稳定而效率低下。
Method: 离线阶段:用LLM一次性将语料转化为原子化问答对知识库;在线阶段:对复杂查询进行保持实体一致性的分解与重写,再经稠密检索和RoBERTa答案抽取完成推理,全程仅调用LLM两次。
Result: 在HotpotQA、2WikiMultiHopQA和MuSiQue上达到与迭代式RAG相当的准确率,同时显著降低token消耗。
Conclusion: CompactRAG提供了一种成本效益高、实用性强的多跳知识密集型问答新范式。
Abstract: Retrieval-augmented generation (RAG) has become a key paradigm for knowledge-intensive question answering. However, existing multi-hop RAG systems remain inefficient, as they alternate between retrieval and reasoning at each step, resulting in repeated LLM calls, high token consumption, and unstable entity grounding across hops. We propose CompactRAG, a simple yet effective framework that decouples offline corpus restructuring from online reasoning.
In the offline stage, an LLM reads the corpus once and converts it into an atomic QA knowledge base, which represents knowledge as minimal, fine-grained question-answer pairs. In the online stage, complex queries are decomposed and carefully rewritten to preserve entity consistency, and are resolved through dense retrieval followed by RoBERTa-based answer extraction. Notably, during inference, the LLM is invoked only twice in total - once for sub-question decomposition and once for final answer synthesis - regardless of the number of reasoning hops.
Experiments on HotpotQA, 2WikiMultiHopQA, and MuSiQue demonstrate that CompactRAG achieves competitive accuracy while substantially reducing token consumption compared to iterative RAG baselines, highlighting a cost-efficient and practical approach to multi-hop reasoning over large knowledge corpora. The implementation is available at GitHub.
### [42] [LongR: Unleashing Long-Context Reasoning via Reinforcement Learning with Dense Utility Rewards](https://arxiv.org/abs/2602.05758)
*Bowen Ping,Zijun Chen,Yiyao Yu,Tingfeng Hui,Junchi Yan,Baobao Chang*
Main category: cs.CL
TL;DR: 本文提出LongR框架,通过动态'思考与阅读'机制和基于相对信息增益的上下文密度奖励,提升大语言模型在长上下文推理任务中的性能。
Details
Motivation: 现有基于稀疏、仅结果奖励的强化学习方法在长上下文推理中效果有限,因其粗粒度信号难以有效指导复杂推理过程。
Method: 提出LongR统一框架,融合动态'Think-and-Read'机制(交替进行推理与文档查阅)和基于相对信息增益的上下文密度奖励,以量化相关文档效用。
Result: LongR在LongBench v2上提升9%,并在RULER和InfiniteBench上持续改进;对多种RL算法(如DAPO、GSPO)均有效;并通过分析验证了其对推理链长度和干扰项的鲁棒性。
Conclusion: LongR通过细粒度、上下文感知的奖励机制与结构化推理流程,显著提升了LLM在长上下文复杂推理任务中的性能与泛化能力。
Abstract: Reinforcement Learning has emerged as a key driver for LLM reasoning. This capability is equally pivotal in long-context scenarios--such as long-dialogue understanding and structured data analysis, where the challenge extends beyond consuming tokens to performing rigorous deduction. While existing efforts focus on data synthesis or architectural changes, recent work points out that relying solely on sparse, outcome-only rewards yields limited gains, as such coarse signals are often insufficient to effectively guide the complex long-context reasoning. To address this, we propose LongR, a unified framework that enhances long-context performance by integrating a dynamic "Think-and-Read" mechanism, which interleaves reasoning with document consultation, with a contextual density reward based on relative information gain to quantify the utility of the relevant documents. Empirically, LongR achieves a 9% gain on LongBench v2 and consistent improvements on RULER and InfiniteBench, demonstrating robust efficiency in navigating extensive contexts. Furthermore, LongR consistently enhances performance across diverse RL algorithms (e.g., DAPO, GSPO). Finally, we conduct in-depth analyses to investigate the impact of reasoning chain length on efficiency and the model's robustness against distractors.
### [43] [Different Time, Different Language: Revisiting the Bias Against Non-Native Speakers in GPT Detectors](https://arxiv.org/abs/2602.05769)
*Adnan Al Ali,Jindřich Helcl,Jindřich Libovický*
Main category: cs.CL
TL;DR: 本文重新审视了LLM生成文本检测器对捷克语非母语者文本是否存在系统性偏见的问题,发现非母语者文本的困惑度并不低于母语者,且当前检测器不依赖困惑度特征,也未表现出对非母语者的系统性误判。
Details
Motivation: 先前研究指出,基于困惑度(perplexity)的LLM文本检测器容易将非母语者撰写的文本误判为AI生成,本文旨在在捷克语环境下验证该结论是否依然成立。
Method: 作者在捷克语语境下,对比分析母语与非母语者文本的困惑度,并评估三类主流检测器对两类文本的判别表现,同时探究困惑度在当代检测器中的实际作用。
Result: 实验表明:(1)捷克语非母语者文本的困惑度不低于母语者;(2)三类检测器均未表现出对非母语者的系统性偏差;(3)当前检测器的有效性不依赖于困惑度特征。
Conclusion: 先前关于检测器因困惑度导致对非母语者存在偏见的结论在捷克语场景下不成立,且现代检测器已转向更鲁棒的判别机制。
Abstract: LLM-based assistants have been widely popularised after the release of ChatGPT. Concerns have been raised about their misuse in academia, given the difficulty of distinguishing between human-written and generated text. To combat this, automated techniques have been developed and shown to be effective, to some extent. However, prior work suggests that these methods often falsely flag essays from non-native speakers as generated, due to their low perplexity extracted from an LLM, which is supposedly a key feature of the detectors. We revisit these statements two years later, specifically in the Czech language setting. We show that the perplexity of texts from non-native speakers of Czech is not lower than that of native speakers. We further examine detectors from three separate families and find no systematic bias against non-native speakers. Finally, we demonstrate that contemporary detectors operate effectively without relying on perplexity.
### [44] [Reinforcement World Model Learning for LLM-based Agents](https://arxiv.org/abs/2602.05842)
*Xiao Yu,Baolin Peng,Ruize Xu,Yelong Shen,Pengcheng He,Suman Nath,Nikhil Singh,Jiangfeng Gao,Zhou Yu*
Main category: cs.CL
TL;DR: 本文提出了一种名为Reinforcement World Model Learning (RWML)的自监督方法,用于增强LLM-based agent的世界建模能力,通过在预训练嵌入空间中对齐模拟状态与真实环境状态,提升其在动态环境中的行动预测与适应能力,并在ALFWorld和τ² Bench上取得显著效果。
Details
Motivation: LLM在代理任务中难以预测行动后果并适应环境动态,亟需具备世界建模能力。
Method: 提出RWML方法,利用sim-to-real gap rewards进行自监督训练,使LLM学习动作条件下的文本状态世界模型,在预训练嵌入空间中对齐模拟下一状态与真实下一状态;避免传统token级下一状态预测导致的语义失真与模型崩溃。
Result: 在ALFWorld和τ² Bench上显著优于基线模型;结合任务成功奖励后,分别比直接任务奖励强化学习高6.9和5.7分,并达到专家数据训练的性能水平。
Conclusion: RWML提供了一种鲁棒、自监督的世界建模训练范式,有效提升LLM代理在动态环境中的泛化与适应能力,且不易受reward hacking影响。
Abstract: Large language models (LLMs) have achieved strong performance in language-centric tasks. However, in agentic settings, LLMs often struggle to anticipate action consequences and adapt to environment dynamics, highlighting the need for world-modeling capabilities in LLM-based agents. We propose Reinforcement World Model Learning (RWML), a self-supervised method that learns action-conditioned world models for LLM-based agents on textual states using sim-to-real gap rewards. Our method aligns simulated next states produced by the model with realized next states observed from the environment, encouraging consistency between internal world simulations and actual environment dynamics in a pre-trained embedding space. Unlike next-state token prediction, which prioritizes token-level fidelity (i.e., reproducing exact wording) over semantic equivalence and can lead to model collapse, our method provides a more robust training signal and is empirically less susceptible to reward hacking than LLM-as-a-judge. We evaluate our method on ALFWorld and $τ^2$ Bench and observe significant gains over the base model, despite being entirely self-supervised. When combined with task-success rewards, our method outperforms direct task-success reward RL by 6.9 and 5.7 points on ALFWorld and $τ^2$ Bench respectively, while matching the performance of expert-data training.
### [45] [OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions](https://arxiv.org/abs/2602.05843)
*Fangzhi Xu,Hang Yan,Qiushi Sun,Jinyang Wu,Zixian Huang,Muye Huang,Jingyang Gong,Zichen Ding,Kanzhi Cheng,Yian Wang,Xinyu Che,Zeyi Sun,Jian Zhang,Zhangyue Yin,Haoran Luo,Xuanjing Huang,Ben Kao,Jun Liu,Qika Lin*
Main category: cs.CL
TL;DR: 本文提出OdysseyArena,一个专注于长时程、主动和归纳式交互的自主智能体评估框架,旨在弥补现有评估中忽视智能体从经验中自主发现潜在转移规律的不足。
Details
Motivation: 现有评估主要采用演绎范式,忽略了智能体需从经验中自主归纳潜在转移规律这一关键能力,而这正是实现前瞻性与策略一致性的基础。
Method: 提出OdysseyArena评估框架,形式化并实例化四类基本要素,构建支持归纳学习的交互环境;进一步设计轻量版OdysseyArena-Lite(120任务)用于标准化评测,以及挑战版OdysseyArena-Challenge(超长交互步数>200)用于压力测试。
Result: 在15+主流大语言模型上的实验表明,即使是前沿模型,在归纳场景下仍表现不足,揭示了复杂环境中自主发现能力的关键瓶颈。
Conclusion: OdysseyArena为评估智能体的归纳能力与长时程探索能力提供了新基准,指明了通向真正自主智能的重要方向。
Abstract: The rapid advancement of Large Language Models (LLMs) has catalyzed the development of autonomous agents capable of navigating complex environments. However, existing evaluations primarily adopt a deductive paradigm, where agents execute tasks based on explicitly provided rules and static goals, often within limited planning horizons. Crucially, this neglects the inductive necessity for agents to discover latent transition laws from experience autonomously, which is the cornerstone for enabling agentic foresight and sustaining strategic coherence. To bridge this gap, we introduce OdysseyArena, which re-centers agent evaluation on long-horizon, active, and inductive interactions. We formalize and instantiate four primitives, translating abstract transition dynamics into concrete interactive environments. Building upon this, we establish OdysseyArena-Lite for standardized benchmarking, providing a set of 120 tasks to measure an agent's inductive efficiency and long-horizon discovery. Pushing further, we introduce OdysseyArena-Challenge to stress-test agent stability across extreme interaction horizons (e.g., > 200 steps). Extensive experiments on 15+ leading LLMs reveal that even frontier models exhibit a deficiency in inductive scenarios, identifying a critical bottleneck in the pursuit of autonomous discovery in complex environments. Our code and data are available at https://github.com/xufangzhi/Odyssey-Arena
### [46] [RRAttention: Dynamic Block Sparse Attention via Per-Head Round-Robin Shifts for Long-Context Inference](https://arxiv.org/abs/2602.05853)
*Siran Liu,Guoxia Wang,Sa Wang,Jinle Zeng,HaoYang Xie,Siyu Lou,JiaBin Yang,DianHai Yu,Haifeng Wang,Chao Yang*
Main category: cs.CL
TL;DR: 本文提出RRAttention,一种新型动态稀疏注意力机制,通过轮转采样策略在保持查询独立性的同时实现高效全局模式发现,显著降低计算复杂度并提升长上下文处理性能。
Details
Motivation: 现有动态稀疏注意力方法存在预处理需求、缺乏全局评估、违反查询独立性或计算开销高等根本性权衡问题,难以兼顾效率与性能。
Method: 提出RRAttention,采用头级轮转(round-robin)采样策略,在每个步幅内轮换各注意力头的查询采样位置,并结合步幅级聚合与自适应Top-τ选择实现动态稀疏化。
Result: 在HELMET和Video-MME基准上恢复超99%全注意力性能,仅计算一半注意力块,在128K上下文长度下实现2.4×加速,优于现有动态稀疏注意力方法。
Conclusion: RRAttention首次同时满足输入自适应、无预处理、全局可评估、查询独立且低开销等理想属性,为长上下文大模型提供了高效可行的注意力替代方案。
Abstract: The quadratic complexity of attention mechanisms poses a critical bottleneck for large language models processing long contexts. While dynamic sparse attention methods offer input-adaptive efficiency, they face fundamental trade-offs: requiring preprocessing, lacking global evaluation, violating query independence, or incurring high computational overhead. We present RRAttention, a novel dynamic sparse attention method that simultaneously achieves all desirable properties through a head \underline{r}ound-\underline{r}obin (RR) sampling strategy. By rotating query sampling positions across attention heads within each stride, RRAttention maintains query independence while enabling efficient global pattern discovery with stride-level aggregation. Our method reduces complexity from $O(L^2)$ to $O(L^2/S^2)$ and employs adaptive Top-$τ$ selection for optimal sparsity. Extensive experiments on natural language understanding (HELMET) and multimodal video comprehension (Video-MME) demonstrate that RRAttention recovers over 99\% of full attention performance while computing only half of the attention blocks, achieving 2.4$\times$ speedup at 128K context length and outperforming existing dynamic sparse attention methods.
### [47] [xList-Hate: A Checklist-Based Framework for Interpretable and Generalizable Hate Speech Detection](https://arxiv.org/abs/2602.05874)
*Adrián Girón,Pablo Miralles,Javier Huertas-Tato,Sergio D'Antonio,David Camacho*
Main category: cs.CL
TL;DR: 本文提出xList-Hate诊断框架,将仇恨言论检测分解为基于规范性标准的显式概念级问题清单,由大语言模型逐项回答并经可解释决策树聚合,从而提升跨域鲁棒性、抗标注噪声能力与细粒度可解释性。
Details
Motivation: 现有监督模型将仇恨言论检测视为单一二分类任务,易过拟合特定数据集定义,在领域迁移和标注噪声下鲁棒性差。
Method: 构建xList-Hate诊断框架:1)设计基于广泛共识规范准则的概念级检查清单;2)用大语言模型对每个问题独立输出二值诊断信号;3)通过轻量、完全可解释的决策树聚合信号生成最终预测。
Result: 在多个基准和模型族上验证,相比零样本LLM分类和有监督微调,该方法显著提升跨数据集鲁棒性和领域迁移下的相对性能,并对部分标注不一致和语境模糊更具鲁棒性;同时支持细粒度可解释性分析。
Conclusion: 将仇恨言论检测重构为诊断推理任务,而非端到端分类,是一种更鲁棒、可解释且可扩展的内容审核新范式。
Abstract: Hate speech detection is commonly framed as a direct binary classification problem despite being a composite concept defined through multiple interacting factors that vary across legal frameworks, platform policies, and annotation guidelines. As a result, supervised models often overfit dataset-specific definitions and exhibit limited robustness under domain shift and annotation noise.
We introduce xList-Hate, a diagnostic framework that decomposes hate speech detection into a checklist of explicit, concept-level questions grounded in widely shared normative criteria. Each question is independently answered by a large language model (LLM), producing a binary diagnostic representation that captures hateful content features without directly predicting the final label. These diagnostic signals are then aggregated by a lightweight, fully interpretable decision tree, yielding transparent and auditable predictions.
We evaluate it across multiple hate speech benchmarks and model families, comparing it against zero-shot LLM classification and in-domain supervised fine-tuning. While supervised methods typically maximize in-domain performance, we consistently improves cross-dataset robustness and relative performance under domain shift. In addition, qualitative analysis of disagreement cases provides evidence that the framework can be less sensitive to certain forms of annotation inconsistency and contextual ambiguity. Crucially, the approach enables fine-grained interpretability through explicit decision paths and factor-level analysis.
Our results suggest that reframing hate speech detection as a diagnostic reasoning task, rather than a monolithic classification problem, provides a robust, explainable, and extensible alternative for content moderation.
### [48] [EuroLLM-22B: Technical Report](https://arxiv.org/abs/2602.05879)
*Miguel Moura Ramos,Duarte M. Alves,Hippolyte Gisserot-Boukhlef,João Alves,Pedro Henrique Martins,Patrick Fernandes,José Pombal,Nuno M. Guerreiro,Ricardo Rei,Nicolas Boizard,Amin Farajian,Mateusz Klimaszewski,José G. C. de Souza,Barry Haddow,François Yvon,Pierre Colombo,Alexandra Birch,André F. T. Martins*
Main category: cs.CL
TL;DR: EuroLLM-22B is a new multilingual large language model trained from scratch to support all 24 official EU languages plus 11 others, aiming to address the underrepresentation of European languages in existing open LLMs; it shows competitive performance on multilingual benchmarks and its models, data, and code are publicly released.
Details
Motivation: To address the underrepresentation and underservice of European languages in existing open large language models.
Method: Training a large language model from scratch with a custom tokenizer, specific architectural design, rigorous multilingual data filtering, and comprehensive pretraining and instruction-tuning procedures.
Result: EuroLLM-22B achieves strong and competitive performance across multilingual benchmarks in reasoning, instruction following, and translation.
Conclusion: EuroLLM-22B successfully fills a critical gap for European multilingual AI capabilities and supports future research through full open release of models, datasets, and code.
Abstract: This report presents EuroLLM-22B, a large language model trained from scratch to support the needs of European citizens by covering all 24 official European Union languages and 11 additional languages. EuroLLM addresses the issue of European languages being underrepresented and underserved in existing open large language models. We provide a comprehensive overview of EuroLLM-22B's development, including tokenizer design, architectural specifications, data filtering, and training procedures. Across a broad set of multilingual benchmarks, EuroLLM-22B demonstrates strong performance in reasoning, instruction following, and translation, achieving results competitive with models of comparable size. To support future research, we release our base and instruction-tuned models, our multilingual web pretraining data and updated EuroBlocks instruction datasets, as well as our pre-training and evaluation codebases.
### [49] [Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models](https://arxiv.org/abs/2602.05897)
*Shuo Nie,Hexuan Deng,Chao Wang,Ruiyu Fang,Xuebo Liu,Shuangyong Song,Yu Li,Min Zhang,Xuelong Li*
Main category: cs.CL
TL;DR: 本文提出FaithRL方法,通过步骤级的忠实性奖励和隐式截断重采样策略,有效减少小推理模型在链式思维过程中的忠实性幻觉问题。
Details
Motivation: 小推理模型(SRMs)在资源受限场景中易出现中间推理步骤的忠实性幻觉,而现有基于结果奖励或粗粒度CoT评估的在线强化学习方法可能错误强化不忠实但答案正确的推理路径。
Method: 提出Faithfulness-Aware Step-Level Reinforcement Learning(FaithRL),引入过程奖励模型提供的步骤级忠实性奖励,并结合隐式截断重采样策略生成来自忠实前缀的对比信号。
Result: 在多个SRMs和开放书问答基准上的实验表明,FaithRL能一致降低CoT及最终答案中的幻觉,提升推理的忠实性与可靠性。
Conclusion: FaithRL通过细粒度步骤监督与对比学习机制,显著改善小模型链式思维的忠实性,为资源受限场景下的可信推理提供了新范式。
Abstract: As large language models become smaller and more efficient, small reasoning models (SRMs) are crucial for enabling chain-of-thought (CoT) reasoning in resource-constrained settings. However, they are prone to faithfulness hallucinations, especially in intermediate reasoning steps. Existing mitigation methods based on online reinforcement learning rely on outcome-based rewards or coarse-grained CoT evaluation, which can inadvertently reinforce unfaithful reasoning when the final answer is correct. To address these limitations, we propose Faithfulness-Aware Step-Level Reinforcement Learning (FaithRL), introducing step-level supervision via explicit faithfulness rewards from a process reward model, together with an implicit truncated resampling strategy that generates contrastive signals from faithful prefixes. Experiments across multiple SRMs and Open-Book QA benchmarks demonstrate that FaithRL consistently reduces hallucinations in both the CoT and final answers, leading to more faithful and reliable reasoning. Code is available at https://github.com/Easy195/FaithRL.
### [50] [Codified Finite-state Machines for Role-playing](https://arxiv.org/abs/2602.05905)
*Letian Peng,Yupeng Hou,Kun Zhou,Jingbo Shang*
Main category: cs.CL
TL;DR: 本文提出Codified Finite-State Machines (CFSMs)及其概率扩展CPFSMs,利用LLM自动从角色档案中提取状态与转移,提升大语言模型角色扮演中潜藏状态建模的一致性与可解释性。
Details
Motivation: 现有基于提示的方法难以建模驱动角色交互的潜藏状态,而传统手工构建的有限状态机又难以适应开放语义的角色扮演场景。
Method: 提出CFSM框架,利用LLM将文本角色档案自动编码为有限状态机;进一步扩展为CPFSM,用概率分布建模状态转移。
Result: 在合成评估和真实RP场景中,CFSM和CPFSM均优于通用基线方法,验证了其在结构化任务与开放随机状态探索中的有效性。
Conclusion: CFSM/CPFSM为LLM角色扮演提供了可解释、可自适应且能处理不确定性的潜藏状态建模新范式。
Abstract: Modeling latent character states is crucial for consistent and engaging role-playing (RP) with large language models (LLMs). Yet, existing prompting-based approaches mainly capture surface actions, often failing to track the latent states that drive interaction. We revisit finite-state machines (FSMs), long used in game design to model state transitions. While effective in small, well-specified state spaces, traditional hand-crafted, rule-based FSMs struggle to adapt to the open-ended semantic space of RP. To address this, we introduce Codified Finite-State Machines (CFSMs), a framework that automatically codifies textual character profiles into FSMs using LLM-based coding. CFSMs extract key states and transitions directly from the profile, producing interpretable structures that enforce character consistency. To further capture uncertainty and variability, we extend CFSMs into Codified Probabilistic Finite-State Machines (CPFSMs), where transitions are modeled as probability distributions over states. Through both synthetic evaluations and real-world RP scenarios in established artifacts, we demonstrate that CFSM and CPFSM outperform generally applied baselines, verifying effectiveness not only in structured tasks but also in open-ended stochastic state exploration.
### [51] [KV-CoRE: Benchmarking Data-Dependent Low-Rank Compressibility of KV-Caches in LLMs](https://arxiv.org/abs/2602.05929)
*Jian Chen,Zhuoran Wang,Jiayu Qin,Ming Li,Meng Wang,Changyou Chen,Yin Chen,Qizhen Weng,Yirui Liu*
Main category: cs.CL
TL;DR: 本文提出KV-CoRE方法,通过SVD量化KV缓存的数据依赖低秩可压缩性,并基于归一化有效秩建立首个大规模LLM KV缓存可压缩性基准,揭示其与模型架构、训练数据和语言覆盖的系统性关联。
Details
Motivation: 现有KV缓存压缩方法忽视缓存的数据依赖性和层间差异,缺乏对不同模型、数据和语言下可压缩性的系统评估。
Method: 提出基于SVD的KV-CoRE(KV-cache Compressibility by Rank Evaluation)方法,以Frobenius范数下最优低秩近似衡量可压缩性;采用无需梯度、可增量计算的Normalized Effective Rank作为压缩性指标。
Result: 在5个英文领域、16种语言的多模型多数据集上验证了KV缓存可压缩性存在系统性规律,且归一化有效秩与压缩后的性能下降强相关。
Conclusion: 建立了首个大规模LLM KV缓存可压缩性基准和原理性评估框架,为动态、数据感知的压缩策略及数据驱动的模型开发提供依据。
Abstract: Large language models rely on kv-caches to avoid redundant computation during autoregressive decoding, but as context length grows, reading and writing the cache can quickly saturate GPU memory bandwidth. Recent work has explored KV-cache compression, yet most approaches neglect the data-dependent nature of kv-caches and their variation across layers. We introduce KV-CoRE KV-cache Compressibility by Rank Evaluation), an SVD-based method for quantifying the data-dependent low-rank compressibility of kv-caches. KV-CoRE computes the optimal low-rank approximation under the Frobenius norm and, being gradient-free and incremental, enables efficient dataset-level, layer-wise evaluation. Using this method, we analyze multiple models and datasets spanning five English domains and sixteen languages, uncovering systematic patterns that link compressibility to model architecture, training data, and language coverage. As part of this analysis, we employ the Normalized Effective Rank as a metric of compressibility and show that it correlates strongly with performance degradation under compression. Our study establishes a principled evaluation framework and the first large-scale benchmark of kv-cache compressibility in LLMs, offering insights for dynamic, data-aware compression and data-centric model development.
### [52] [Polyglots or Multitudes? Multilingual LLM Answers to Value-laden Multiple-Choice Questions](https://arxiv.org/abs/2602.05932)
*Léo Labat,Etienne Ollion,François Yvon*
Main category: cs.CL
TL;DR: 本文研究多语言大语言模型(LLM)在价值导向的多项选择题(MCQ)中是否因提问语言不同而给出不同答案,发现尽管指令微调的大模型整体一致性更高,但其响应鲁棒性因题目而异,且部分题目上存在明显的语言特异性行为。
Details
Motivation: 探究多语言LLM在价值导向MCQ中的跨语言响应一致性,即它们是像理论上的‘多语者’一样一致作答,还是像多个单语模型一样随语言切换表达不同价值观。
Method: 构建全新人工翻译、8种欧洲语言对齐的人类标注语料库MEVS,并在30多个不同规模、厂商和对齐微调状态的多语言LLM上,采用系统化提示变体(如选项顺序、符号类型、结尾字符)进行测试。
Result: 较大、经指令微调的模型整体一致性更高,但不同题目间鲁棒性差异显著;某些MCQ引发模型高度一致回答,另一些则导致答案高度分歧;所有一致且经指令微调的模型在部分题目上均表现出语言特异性行为。
Conclusion: 多语言LLM并非完全跨语言一致的价值表达者,其价值响应具有题目依赖性和语言选择性,提示偏好微调可能具有选择性影响,需进一步研究。
Abstract: Multiple-Choice Questions (MCQs) are often used to assess knowledge, reasoning abilities, and even values encoded in large language models (LLMs). While the effect of multilingualism has been studied on LLM factual recall, this paper seeks to investigate the less explored question of language-induced variation in value-laden MCQ responses. Are multilingual LLMs consistent in their responses across languages, i.e. behave like theoretical polyglots, or do they answer value-laden MCQs depending on the language of the question, like a multitude of monolingual models expressing different values through a single model? We release a new corpus, the Multilingual European Value Survey (MEVS), which, unlike prior work relying on machine translation or ad hoc prompts, solely comprises human-translated survey questions aligned in 8 European languages. We administer a subset of those questions to over thirty multilingual LLMs of various sizes, manufacturers and alignment-fine-tuning status under comprehensive, controlled prompt variations including answer order, symbol type, and tail character. Our results show that while larger, instruction-tuned models display higher overall consistency, the robustness of their responses varies greatly across questions, with certain MCQs eliciting total agreement within and across models while others leave LLM answers split. Language-specific behavior seems to arise in all consistent, instruction-fine-tuned models, but only on certain questions, warranting a further study of the selective effect of preference fine-tuning.
### [53] [Self-Improving Multilingual Long Reasoning via Translation-Reasoning Integrated Training](https://arxiv.org/abs/2602.05940)
*Junxiao Liu,Zhijun Wang,Yixiao Li,Zhejian Lai,Liqian Huang,Xin Huang,Xue Han,Junlan Feng,Shujian Huang*
Main category: cs.CL
TL;DR: 本文提出TRIT框架,通过将翻译训练与多语言推理训练相结合,无需外部反馈或额外多语言数据,即可同时提升多语言问题理解和响应生成能力,在MMATH等基准上显著提升准确率和语言一致性。
Details
Motivation: 长推理模型在多语言场景中表现不佳:常在非英语问题上用英语推理;若强制使用问题语言推理,准确率又大幅下降,根源在于多语言问题理解和多语言推理能力均有限。
Method: 提出TRIT(Translation-Reasoning Integrated Training)自改进框架,将翻译训练与多语言推理训练联合优化,不依赖外部反馈或额外多语言数据。
Result: 在MMATH上平均超越多个基线7个百分点,提升答案正确性与语言一致性;跨语言问题对齐提升超10个百分点;数学问题及通用文本翻译质量提升最高达8.4 COMET分(FLORES-200)。
Conclusion: 集成翻译训练能协同增强多语言理解与推理能力,TRIT是一种高效、自洽的多语言长推理优化方法。
Abstract: Long reasoning models often struggle in multilingual settings: they tend to reason in English for non-English questions; when constrained to reasoning in the question language, accuracies drop substantially. The struggle is caused by the limited abilities for both multilingual question understanding and multilingual reasoning. To address both problems, we propose TRIT (Translation-Reasoning Integrated Training), a self-improving framework that integrates the training of translation into multilingual reasoning. Without external feedback or additional multilingual data, our method jointly enhances multilingual question understanding and response generation. On MMATH, our method outperforms multiple baselines by an average of 7 percentage points, improving both answer correctness and language consistency. Further analysis reveals that integrating translation training improves cross-lingual question alignment by over 10 percentage points and enhances translation quality for both mathematical questions and general-domain text, with gains up to 8.4 COMET points on FLORES-200.
### [54] [Characterizing Human Semantic Navigation in Concept Production as Trajectories in Embedding Space](https://arxiv.org/abs/2602.05971)
*Felipe D. Toro-Hernández,Jesuino Vieira Filho,Rodrigo M. Cabral-Carvalho*
Main category: cs.CL
TL;DR: 本文提出了一种将概念生成建模为在语义嵌入空间中导航的新框架,通过累积嵌入构建个体化语义轨迹,并提取多种几何与动力学指标,验证其在多语言、多任务及临床分组中的有效性与鲁棒性。
Details
Motivation: 探究人类如何在结构化、动态的语义知识空间中导航以检索和操作意义,弥补传统语言预处理劳动密集、缺乏几何解释性的不足。
Method: 基于多种Transformer文本嵌入模型,构建被试特异的累积语义轨迹,提取距离(到下一词、到质心)、熵、速度、加速度等几何与动力学指标,对比累积与非累积嵌入在不同轨迹长度下的表现,并在多语言、多任务数据集上评估。
Result: 该框架能有效区分临床组与概念类型;累积嵌入更适用于长轨迹,非累积更适合短轨迹;不同嵌入模型结果高度一致,表明其表征具有跨模型鲁棒性。
Conclusion: 将语义导航形式化为嵌入空间中的结构化轨迹,成功桥接认知建模与学习表征,为临床研究、跨语言分析及人工认知评估提供了低干预、可量化的新范式。
Abstract: Semantic representations can be framed as a structured, dynamic knowledge space through which humans navigate to retrieve and manipulate meaning. To investigate how humans traverse this geometry, we introduce a framework that represents concept production as navigation through embedding space. Using different transformer text embedding models, we construct participant-specific semantic trajectories based on cumulative embeddings and extract geometric and dynamical metrics, including distance to next, distance to centroid, entropy, velocity, and acceleration. These measures capture both scalar and directional aspects of semantic navigation, providing a computationally grounded view of semantic representation search as movement in a geometric space. We evaluate the framework on four datasets across different languages, spanning different property generation tasks: Neurodegenerative, Swear verbal fluency, Property listing task in Italian, and in German. Across these contexts, our approach distinguishes between clinical groups and concept types, offering a mathematical framework that requires minimal human intervention compared to typical labor-intensive linguistic pre-processing methods. Comparison with a non-cumulative approach reveals that cumulative embeddings work best for longer trajectories, whereas shorter ones may provide too little context, favoring the non-cumulative alternative. Critically, different embedding models yielded similar results, highlighting similarities between different learned representations despite different training pipelines. By framing semantic navigation as a structured trajectory through embedding space, bridging cognitive modeling with learned representation, thereby establishing a pipeline for quantifying semantic representation dynamics with applications in clinical research, cross-linguistic analysis, and the assessment of artificial cognition.
### [55] [DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs](https://arxiv.org/abs/2602.05992)
*Lizhuo Luo,Shenggui Li,Yonggang Wen,Tianwei Zhang*
Main category: cs.CL
TL;DR: 本文提出了一种无需训练的动态滑动块(DSB)调度方法,用于改进扩散大语言模型(dLLMs)的并行文本生成,通过自适应调整块大小以匹配语义难度,并结合专用的DSB Cache机制,在不牺牲质量的前提下显著提升推理效率。
Details
Motivation: 现有固定块调度策略忽视语义难度差异,导致在不确定位置过早承诺、在简单位置延迟生成,影响生成质量与推理效率。
Method: 提出动态滑动块(DSB)调度方法,采用动态大小的滑动块替代固定块;并设计配套的无需训练的KV缓存机制DSB Cache。
Result: 在多个模型和基准测试上,DSB与DSB Cache联合使用显著提升了dLLMs的生成质量与推理效率。
Conclusion: 动态适配语义难度的块调度策略比固定块更可靠高效,DSB及其缓存机制为dLLMs提供了实用、即插即用的性能优化方案。
Abstract: Diffusion large language models (dLLMs) have emerged as a promising alternative for text generation, distinguished by their native support for parallel decoding. In practice, block inference is crucial for avoiding order misalignment in global bidirectional decoding and improving output quality. However, the widely-used fixed, predefined block (naive) schedule is agnostic to semantic difficulty, making it a suboptimal strategy for both quality and efficiency: it can force premature commitments to uncertain positions while delaying easy positions near block boundaries. In this work, we analyze the limitations of naive block scheduling and disclose the importance of dynamically adapting the schedule to semantic difficulty for reliable and efficient inference. Motivated by this, we propose Dynamic Sliding Block (DSB), a training-free block scheduling method that uses a sliding block with a dynamic size to overcome the rigidity of the naive block. To further improve efficiency, we introduce DSB Cache, a training-free KV-cache mechanism tailored to DSB. Extensive experiments across multiple models and benchmarks demonstrate that DSB, together with DSB Cache, consistently improves both generation quality and inference efficiency for dLLMs. Code is released at https://github.com/lizhuo-luo/DSB.
### [56] [A Systematic Evaluation of Large Language Models for PTSD Severity Estimation: The Role of Contextual Knowledge and Modeling Strategies](https://arxiv.org/abs/2602.06015)
*Panagiotis Kaliosis,Adithya V Ganesan,Oscar N. E. Kjell,Whitney Ringwald,Scott Feltman,Melissa A. Carr,Dimitris Samaras,Camilo Ruggero,Benjamin J. Luft,Roman Kotov,Andrew H. Schwartz*
Main category: cs.CL
TL;DR: 本研究系统评估了11种大语言模型(LLMs)在零样本下评估PTSD严重程度的准确性,发现上下文知识(如构念定义、叙事背景)、推理努力程度、模型规模及集成方法显著影响性能;最佳效果来自监督模型与零样本LLM的集成。
Details
Motivation: 尽管大语言模型(LLMs)被越来越多地用于零样本精神健康评估,但影响其准确性的关键因素尚不明确。
Method: 基于1437名个体的临床自然语言叙述和自评PTSD严重度数据,系统评估11个SOTA LLM;变量包括上下文知识(子量表定义、分布摘要、访谈问题)和建模策略(零样本/少样本、推理努力、模型大小、结构化子量表预测vs直接标量预测、输出重标定、9种集成方法)。
Result: (a)提供详细构念定义和叙事背景时LLM最准确;(b)增加推理努力提升估计精度;(c)开源模型(Llama、Deepseek)性能在70B参数后趋于饱和,闭源模型(o3-mini、gpt-5)随代际更新持续提升;(d)监督模型与零样本LLM集成效果最优。
Conclusion: 部署LLM进行精神健康评估时,上下文知识的选择和建模策略的设计至关重要。
Abstract: Large language models (LLMs) are increasingly being used in a zero-shot fashion to assess mental health conditions, yet we have limited knowledge on what factors affect their accuracy. In this study, we utilize a clinical dataset of natural language narratives and self-reported PTSD severity scores from 1,437 individuals to comprehensively evaluate the performance of 11 state-of-the-art LLMs. To understand the factors affecting accuracy, we systematically varied (i) contextual knowledge like subscale definitions, distribution summary, and interview questions, and (ii) modeling strategies including zero-shot vs few shot, amount of reasoning effort, model sizes, structured subscales vs direct scalar prediction, output rescaling and nine ensemble methods. Our findings indicate that (a) LLMs are most accurate when provided with detailed construct definitions and context of the narrative; (b) increased reasoning effort leads to better estimation accuracy; (c) performance of open-weight models (Llama, Deepseek), plateau beyond 70B parameters while closed-weight (o3-mini, gpt-5) models improve with newer generations; and (d) best performance is achieved when ensembling a supervised model with the zero-shot LLMs. Taken together, the results suggest choice of contextual knowledge and modeling strategies is important for deploying LLMs to accurately assess mental health.
### [57] [Multi-Token Prediction via Self-Distillation](https://arxiv.org/abs/2602.06019)
*John Kirchenbauer,Abhimanyu Hans,Brian Bartoldson,Micah Goldblum,Ashwinee Panda,Tom Goldstein*
Main category: cs.CL
TL;DR: 本文提出了一种无需额外训练辅助模型或复杂推理管道的在线蒸馏方法,将预训练自回归语言模型转换为能一次性预测多个token的快速独立模型,保持原有实现不变,在GSM8K上实现3倍加速且准确率下降小于5%。
Details
Motivation: 现有加速语言模型推理的技术(如推测解码)需训练辅助推测器并构建复杂推理流程,部署成本高、灵活性差。
Method: 采用简单的在线蒸馏目标,将预训练自回归语言模型改造为可直接多token预测的模型,不改变原始模型结构与部署方式,无需辅助验证器或特殊推理代码。
Result: 在GSM8K数据集上,解码速度提升超3倍,准确率下降控制在5%以内。
Conclusion: 该方法提供了一种轻量、即插即用的语言模型推理加速方案,兼顾高效性与部署简易性。
Abstract: Existing techniques for accelerating language model inference, such as speculative decoding, require training auxiliary speculator models and building and deploying complex inference pipelines. We consider a new approach for converting a pretrained autoregressive language model from a slow single next token prediction model into a fast standalone multi-token prediction model using a simple online distillation objective. The final model retains the exact same implementation as the pretrained initial checkpoint and is deployable without the addition of any auxiliary verifier or other specialized inference code. On GSM8K, our method produces models that can decode more than $3\times$ faster on average at $<5\%$ drop in accuracy relative to single token decoding performance.
### [58] [Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory](https://arxiv.org/abs/2602.06025)
*Haozhen Zhang,Haodong Yue,Tao Feng,Quanyu Long,Jianzhu Bao,Bowen Jin,Weizhi Zhang,Xiao Li,Jiaxuan You,Chengwei Qin,Wenya Wang*
Main category: cs.CL
TL;DR: 本文提出BudgetMem,一种用于LLM代理的运行时内存框架,支持基于查询感知的显式性能-成本权衡控制,通过轻量级强化学习路由器在不同预算层级(低/中/高)的内存模块间路由,显著提升准确率与成本的前沿表现。
Details
Motivation: 现有LLM代理内存系统多采用离线、查询无关的内存构建方式,效率低且易丢失关键信息;而运行时内存利用虽更自然,却常带来高开销且缺乏对性能-成本权衡的显式控制。
Method: 提出BudgetMem框架,将内存处理结构化为多个支持三档预算(Low/Mid/High)的模块;设计轻量神经路由器,通过强化学习训练实现预算层级路由;探索三种预算分级策略:实现复杂度、推理行为和模块容量。
Result: 在LoCoMo、LongMemEval和HotpotQA基准上,BudgetMem在高预算下超越强基线,在受限预算下提供更优的准确率-成本前沿;分析揭示了不同分级策略在不同预算区间下的适用性差异。
Conclusion: BudgetMem为LLM代理提供了可调控、高效、查询感知的运行时内存机制,验证了显式预算控制在实际应用中的有效性与灵活性。
Abstract: Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rely on offline, query-agnostic memory construction that can be inefficient and may discard query-critical information. Although runtime memory utilization is a natural alternative, prior work often incurs substantial overhead and offers limited explicit control over the performance-cost trade-off. In this work, we present \textbf{BudgetMem}, a runtime agent memory framework for explicit, query-aware performance-cost control. BudgetMem structures memory processing as a set of memory modules, each offered in three budget tiers (i.e., \textsc{Low}/\textsc{Mid}/\textsc{High}). A lightweight router performs budget-tier routing across modules to balance task performance and memory construction cost, which is implemented as a compact neural policy trained with reinforcement learning. Using BudgetMem as a unified testbed, we study three complementary strategies for realizing budget tiers: implementation (method complexity), reasoning (inference behavior), and capacity (module model size). Across LoCoMo, LongMemEval, and HotpotQA, BudgetMem surpasses strong baselines when performance is prioritized (i.e., high-budget setting), and delivers better accuracy-cost frontiers under tighter budgets. Moreover, our analysis disentangles the strengths and weaknesses of different tiering strategies, clarifying when each axis delivers the most favorable trade-offs under varying budget regimes.
### [59] [DFlash: Block Diffusion for Flash Speculative Decoding](https://arxiv.org/abs/2602.06036)
*Jian Chen,Yesheng Liang,Zhijian Liu*
Main category: cs.CL
TL;DR: 本文提出DFlash,一种基于轻量级块扩散模型的推测解码框架,通过并行生成草稿令牌并利用目标模型上下文特征进行条件化,显著提升大语言模型推理速度。
Details
Motivation: 自回归大语言模型推理延迟高、GPU利用率低;现有推测解码仍依赖串行草稿生成;扩散模型虽可并行但性能不足。
Method: 提出DFlash框架:采用轻量级块扩散模型实现单次前向并行生成草稿令牌,并将目标模型提取的上下文特征作为草稿模型的条件输入。
Result: 在多种模型和任务上实现超6倍无损加速,相比SOTA方法EAGLE-3提速最高达2.5倍。
Conclusion: DFlash通过结合扩散模型的并行性与目标模型的上下文感知能力,有效提升了推测解码的质量与效率,为大语言模型高效推理提供了新路径。
Abstract: Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by using a fast draft model whose outputs are verified in parallel by the target LLM; however, existing methods still rely on autoregressive drafting, which remains sequential and limits practical speedups. Diffusion LLMs offer a promising alternative by enabling parallel generation, but current diffusion models typically underperform compared with autoregressive models. In this paper, we introduce DFlash, a speculative decoding framework that employs a lightweight block diffusion model for parallel drafting. By generating draft tokens in a single forward pass and conditioning the draft model on context features extracted from the target model, DFlash enables efficient drafting with high-quality outputs and higher acceptance rates. Experiments show that DFlash achieves over 6x lossless acceleration across a range of models and tasks, delivering up to 2.5x higher speedup than the state-of-the-art speculative decoding method EAGLE-3.
# cs.CV [[Back]](#toc)
### [60] [SIDeR: Semantic Identity Decoupling for Unrestricted Face Privacy](https://arxiv.org/abs/2602.04994)
*Zhuosen Bao,Xia Du,Zheng Lin,Jizhe Zhou,Zihan Fang,Jiening Wu,Yuxin Zhang,Zhe Chen,Chi-man Pun,Wei Ni,Jun Luo*
Main category: cs.CV
TL;DR: SIDeR是一种面向无限制人脸隐私保护的语义解耦驱动框架,通过扩散模型实现身份特征与视觉外观的分离,在保障机器可识别性的同时生成高自然度、多样化的匿名化人脸,并支持密码驱动的原始图像恢复。
Details
Motivation: 随着人脸识别深度融入在线银行、身份认证等网络服务,如何在图像存储与传输中有效解耦身份信息与视觉表征,成为隐私保护的关键挑战。
Method: 提出SIDeR框架:将人脸图像分解为机器可识别的身份特征向量和视觉可感知的语义外观成分;利用扩散模型潜在空间中的语义引导重构,生成视觉匿名但身份一致的对抗样本;引入动量驱动的无限制扰动优化与语义-视觉平衡因子以提升多样性与自然度;支持密码验证下的原始图像恢复。
Result: 在CelebA-HQ和FFHQ数据集上,SIDeR在黑盒攻击下达到99%攻击成功率,在PSNR指标上的恢复质量较基线方法提升41.28%。
Conclusion: SIDeR在兼顾人脸视觉匿名性、机器识别一致性、图像自然度、多样性及可逆恢复性方面实现了显著突破,为无限制场景下的人脸隐私保护提供了新范式。
Abstract: With the deep integration of facial recognition into online banking, identity verification, and other networked services, achieving effective decoupling of identity information from visual representations during image storage and transmission has become a critical challenge for privacy protection. To address this issue, we propose SIDeR, a Semantic decoupling-driven framework for unrestricted face privacy protection. SIDeR decomposes a facial image into a machine-recognizable identity feature vector and a visually perceptible semantic appearance component. By leveraging semantic-guided recomposition in the latent space of a diffusion model, it generates visually anonymous adversarial faces while maintaining machine-level identity consistency. The framework incorporates momentum-driven unrestricted perturbation optimization and a semantic-visual balancing factor to synthesize multiple visually diverse, highly natural adversarial samples. Furthermore, for authorized access, the protected image can be restored to its original form when the correct password is provided. Extensive experiments on the CelebA-HQ and FFHQ datasets demonstrate that SIDeR achieves a 99% attack success rate in black-box scenarios and outperforms baseline methods by 41.28% in PSNR-based restoration quality.
### [61] [UniTrack: Differentiable Graph Representation Learning for Multi-Object Tracking](https://arxiv.org/abs/2602.05037)
*Bishoy Galoaa,Xiangyu Bai,Utsav Nandi,Sai Siddhartha Vivek Dhir Rangoju,Somaieh Amraee,Sarah Ostadabbas*
Main category: cs.CV
TL;DR: UniTrack是一种即插即用的图理论损失函数,通过统一可微学习直接优化多目标跟踪(MOT)任务中的检测精度、身份保持和时空一致性,无需修改现有跟踪模型架构,显著减少ID切换并提升IDF1和MOTA指标。
Details
Motivation: 现有基于图的MOT方法需重构跟踪架构,缺乏通用、即插即用的端到端训练目标;亟需一种能统一建模检测、身份连续性和运动一致性的可微损失函数。
Method: 提出UniTrack——一种基于可微图表示学习的统一损失函数,将检测精度、身份保持与时空一致性联合建模为单一可微图优化目标,兼容任意MOT模型。
Result: 在Trackformer、MOTR、FairMOT、ByteTrack、GTR、MOTE等多个模型及SportsMOT等基准上验证,ID切换最多降低53%,IDF1提升12%,GTR在SportsMOT上MOTA提升9.7%。
Conclusion: UniTrack作为一种通用、即插即用的图损失函数,有效提升了各类MOT模型的跟踪性能,证明了统一可微图优化在多目标跟踪中的有效性与泛化能力。
Abstract: We present UniTrack, a plug-and-play graph-theoretic loss function designed to significantly enhance multi-object tracking (MOT) performance by directly optimizing tracking-specific objectives through unified differentiable learning. Unlike prior graph-based MOT methods that redesign tracking architectures, UniTrack provides a universal training objective that integrates detection accuracy, identity preservation, and spatiotemporal consistency into a single end-to-end trainable loss function, enabling seamless integration with existing MOT systems without architectural modifications. Through differentiable graph representation learning, UniTrack enables networks to learn holistic representations of motion continuity and identity relationships across frames. We validate UniTrack across diverse tracking models and multiple challenging benchmarks, demonstrating consistent improvements across all tested architectures and datasets including Trackformer, MOTR, FairMOT, ByteTrack, GTR, and MOTE. Extensive evaluations show up to 53\% reduction in identity switches and 12\% IDF1 improvements across challenging benchmarks, with GTR achieving peak performance gains of 9.7\% MOTA on SportsMOT.
### [62] [VISTA: Enhancing Visual Conditioning via Track-Following Preference Optimization in Vision-Language-Action Models](https://arxiv.org/abs/2602.05049)
*Yiye Chen,Yanan Jian,Xiaoyi Dong,Shuxin Cao,Jing Wu,Patricio Vela,Benjamin E. Lundell,Dongdong Chen*
Main category: cs.CV
TL;DR: 本文提出了一种无需修改模型结构或额外数据收集的训练框架,通过偏好优化和潜在空间蒸馏来增强视觉-动作对齐,从而提升VLA模型在离散与连续动作空间中的视觉条件依赖性和任务性能。
Details
Motivation: 现有VLA模型存在视觉-动作错位问题,即动作预测对当前视觉状态依赖性弱,导致输出不可靠;作者观察到成功执行轨迹比失败轨迹表现出更强的视觉依赖性,因此希望显式增强视觉条件作用。
Method: 首先在轨迹跟随代理任务上通过偏好优化对齐动作预测与视觉输入,再通过潜在空间蒸馏将增强的对齐能力迁移到指令跟随任务的监督微调中。
Result: 在离散OpenVLA上提升了视觉条件依赖性和任务性能;在连续OpenVLA-OFT设置中也取得一致增益。
Conclusion: 显式增强视觉条件作用可有效缓解视觉-动作错位问题,提升VLA模型鲁棒性与泛化性,且方法通用、轻量、无需架构改动或新数据。
Abstract: Vision-Language-Action (VLA) models have demonstrated strong performance across a wide range of robotic manipulation tasks. Despite the success, extending large pretrained Vision-Language Models (VLMs) to the action space can induce vision-action misalignment, where action predictions exhibit weak dependence on the current visual state, leading to unreliable action outputs. In this work, we study VLA models through the lens of visual conditioning and empirically show that successful rollouts consistently exhibit stronger visual dependence than failed ones. Motivated by this observation, we propose a training framework that explicitly strengthens visual conditioning in VLA models. Our approach first aligns action prediction with visual input via preference optimization on a track-following surrogate task, and then transfers the enhanced alignment to instruction-following task through latent-space distillation during supervised finetuning. Without introducing architectural modifications or additional data collection, our method improves both visual conditioning and task performance for discrete OpenVLA, and further yields consistent gains when extended to the continuous OpenVLA-OFT setting. Project website: https://vista-vla.github.io/ .
### [63] [Food Portion Estimation: From Pixels to Calories](https://arxiv.org/abs/2602.05078)
*Gautham Vinod,Fengqing Zhu*
Main category: cs.CV
TL;DR: 本文综述了基于图像的膳食评估中食物份量估计的各种策略,重点解决从2D图像推断3D食物尺寸的挑战。
Details
Motivation: 图像膳食评估在慢性病和肥胖防控中至关重要,但其核心难点在于从2D图像准确估计食物的3D尺寸。
Method: 综述分析了多种策略,包括深度图、多视角输入、模板匹配等辅助方法,以及单目图像或图像与辅助输入结合的深度学习方法。
Result: 系统梳理了当前用于提升食物份量估计精度的不同技术路径及其优劣。
Conclusion: 尽管已有多种方法被提出,仍需进一步研究以提高鲁棒性、泛化性和实际应用可行性。
Abstract: Reliance on images for dietary assessment is an important strategy to accurately and conveniently monitor an individual's health, making it a vital mechanism in the prevention and care of chronic diseases and obesity. However, image-based dietary assessment suffers from estimating the three dimensional size of food from 2D image inputs. Many strategies have been devised to overcome this critical limitation such as the use of auxiliary inputs like depth maps, multi-view inputs, or model-based approaches such as template matching. Deep learning also helps bridge the gap by either using monocular images or combinations of the image and the auxillary inputs to precisely predict the output portion from the image input. In this paper, we explore the different strategies employed for accurate portion estimation.
### [64] [Visual concept ranking uncovers medical shortcuts used by large multimodal models](https://arxiv.org/abs/2602.05096)
*Joseph D. Janizek,Sonnet Xu,Junayd Lateef,Roxana Daneshjou*
Main category: cs.CV
TL;DR: 本文提出了一种名为视觉概念排序(VCR)的方法,用于识别大型多模态模型(LMMs)中的重要视觉概念,并应用于皮肤癌病变分类等医疗任务,揭示了模型在不同人群子组间性能差异及其视觉特征依赖性。
Details
Motivation: 确保机器学习模型在医疗等安全关键领域的可靠性,需能发现模型缺陷的审计方法。
Method: 提出视觉概念排序(VCR)方法,用于识别LMMs中的重要视觉概念,并结合人工干预验证其生成的视觉特征依赖假设。
Result: 发现LMMs在皮肤病变分类等医疗任务中存在不同人口统计子组间的意外性能差距,并通过VCR定位并验证了相关视觉特征依赖。
Conclusion: VCR是一种有效的可解释性工具,有助于审计和提升多模态模型在医疗等高风险场景下的公平性与鲁棒性。
Abstract: Ensuring the reliability of machine learning models in safety-critical domains such as healthcare requires auditing methods that can uncover model shortcomings. We introduce a method for identifying important visual concepts within large multimodal models (LMMs) and use it to investigate the behaviors these models exhibit when prompted with medical tasks. We primarily focus on the task of classifying malignant skin lesions from clinical dermatology images, with supplemental experiments including both chest radiographs and natural images. After showing how LMMs display unexpected gaps in performance between different demographic subgroups when prompted with demonstrating examples, we apply our method, Visual Concept Ranking (VCR), to these models and prompts. VCR generates hypotheses related to different visual feature dependencies, which we are then able to validate with manual interventions.
### [65] [CLEAR-HPV: Interpretable Concept Discovery for HPV-Associated Morphology in Whole-Slide Histology](https://arxiv.org/abs/2602.05126)
*Weiyi Qin,Yingci Liu-Swetz,Shiwei Tan,Hao Wang*
Main category: cs.CV
TL;DR: 本文提出CLEAR-HPV框架,通过注意力机制重构多实例学习(MIL)的潜在空间,实现无需概念标签的形态学概念自动发现(如角化、基底样、间质),生成可解释的概念分数向量,兼顾预测性能与病理形态学可解释性。
Details
Motivation: 现有基于注意力的多实例学习(MIL)方法虽在HPV相关全切片病理图像分类中表现良好,但缺乏形态学层面的可解释性。
Method: 提出CLEAR-HPV框架,在注意力加权的潜在空间中自动发现无监督形态学概念,生成空间概念图和紧凑的概念分数向量(如10维),替代原始高维MIL嵌入(如1536维)。
Result: CLEAR-HPV在TCGA-HNSCC、TCGA-CESC和CPTAC-HNSCC多个数据集上泛化稳定,成功发现keratinizing、basaloid、stromal等病理概念,概念分数向量保持预测能力且显著降维。
Conclusion: CLEAR-HPV是一种通用、骨干网络无关的框架,为HPV相关病理图像分析提供了兼具高预测性能与细粒度形态学可解释性的新范式。
Abstract: Human papillomavirus (HPV) status is a critical determinant of prognosis and treatment response in head and neck and cervical cancers. Although attention-based multiple instance learning (MIL) achieves strong slide-level prediction for HPV-related whole-slide histopathology, it provides limited morphologic interpretability. To address this limitation, we introduce Concept-Level Explainable Attention-guided Representation for HPV (CLEAR-HPV), a framework that restructures the MIL latent space using attention to enable concept discovery without requiring concept labels during training. Operating in an attention-weighted latent space, CLEAR-HPV automatically discovers keratinizing, basaloid, and stromal morphologic concepts, generates spatial concept maps, and represents each slide using a compact concept-fraction vector. CLEAR-HPV's concept-fraction vectors preserve the predictive information of the original MIL embeddings while reducing the high-dimensional feature space (e.g., 1536 dimensions) to only 10 interpretable concepts. CLEAR-HPV generalizes consistently across TCGA-HNSCC, TCGA-CESC, and CPTAC-HNSCC, providing compact, concept-level interpretability through a general, backbone-agnostic framework for attention-based MIL models of whole-slide histopathology.
### [66] [ARGaze: Autoregressive Transformers for Online Egocentric Gaze Estimation](https://arxiv.org/abs/2602.05132)
*Jia Li,Wenjie Zhao,Shijian Deng,Bolin Lai,Yuheng Wu,RUijia Chen,Jon E. Froehlich,Yuhang Zhao,Yapeng Tian*
Main category: cs.CV
TL;DR: 本文提出ARGaze模型,将第一人称视频中的在线注视点估计任务建模为基于视觉特征和有限长度历史注视目标的自回归序列预测问题,显著提升了在线场景下的性能。
Details
Motivation: 第一人称注视估计缺乏显式的头部或眼部信号,需从手-物交互、场景显著性等稀疏间接线索中推断视觉注意;同时,注视在目标导向活动中具有强时间连续性,近期注视位置可作为预测当前注视的强先验。
Method: 提出ARGaze:使用Transformer解码器,在每个时间步基于当前视觉特征和固定长度的‘注视上下文窗口’(即近期历史注视目标估计)进行因果、流式自回归预测。
Result: 在多个第一人称基准数据集的在线评估中达到SOTA性能;消融实验验证了带有限历史的自回归建模对鲁棒预测至关重要。
Conclusion: 将在线注视估计重构为因果自回归序列预测是有效且实用的范式,ARGaze兼顾准确性、实时性和资源可控性,适用于AR与辅助技术。
Abstract: Online egocentric gaze estimation predicts where a camera wearer is looking from first-person video using only past and current frames, a task essential for augmented reality and assistive technologies. Unlike third-person gaze estimation, this setting lacks explicit head or eye signals, requiring models to infer current visual attention from sparse, indirect cues such as hand-object interactions and salient scene content. We observe that gaze exhibits strong temporal continuity during goal-directed activities: knowing where a person looked recently provides a powerful prior for predicting where they look next. Inspired by vision-conditioned autoregressive decoding in vision-language models, we propose ARGaze, which reformulates gaze estimation as sequential prediction: at each timestep, a transformer decoder predicts current gaze by conditioning on (i) current visual features and (ii) a fixed-length Gaze Context Window of recent gaze target estimates. This design enforces causality and enables bounded-resource streaming inference. We achieve state-of-the-art performance across multiple egocentric benchmarks under online evaluation, with extensive ablations validating that autoregressive modeling with bounded gaze history is critical for robust prediction. We will release our source code and pre-trained models.
### [67] [AirGlove: Exploring Egocentric 3D Hand Tracking and Appearance Generalization for Sensing Gloves](https://arxiv.org/abs/2602.05159)
*Wenhui Cui,Ziyi Kou,Chuan Qin,Ergys Ristani,Li Guan*
Main category: cs.CV
TL;DR: 本文提出AirGlove方法,通过利用现有传感手套数据,提升视觉模型对新型传感手套的手部姿态估计泛化能力,显著优于现有零样本和微调方法。
Details
Motivation: 现有传感器跟踪易受信号与标定质量影响;而主流视觉手部姿态估计模型在裸手数据上预训练,在外观差异大的传感手套上性能显著下降。
Method: 提出AirGlove框架,基于已有传感手套数据,实现对新设计手套的少样本表征泛化,提升视觉手部姿态估计模型在未见手套上的适应能力。
Result: 在多种传感手套上实验表明,AirGlove能有效泛化至新手套设计,在零样本和微调设定下均显著超越对比方法。
Conclusion: 视觉模型对传感手套的姿态估计性能受限于外观域偏移,AirGlove通过跨手套表征泛化解决了该问题,为实际部署提供了更鲁棒、灵活的解决方案。
Abstract: Sensing gloves have become important tools for teleoperation and robotic policy learning as they are able to provide rich signals like speed, acceleration and tactile feedback. A common approach to track gloved hands is to directly use the sensor signals (e.g., angular velocity, gravity orientation) to estimate 3D hand poses. However, sensor-based tracking can be restrictive in practice as the accuracy is often impacted by sensor signal and calibration quality. Recent advances in vision-based approaches have achieved strong performance on human hands via large-scale pre-training, but their performance on gloved hands with distinct visual appearances remains underexplored. In this work, we present the first systematic evaluation of vision-based hand tracking models on gloved hands under both zero-shot and fine-tuning setups. Our analysis shows that existing bare-hand models suffer from substantial performance degradation on sensing gloves due to large appearance gap between bare-hand and glove designs. We therefore propose AirGlove, which leverages existing gloves to generalize the learned glove representations towards new gloves with limited data. Experiments with multiple sensing gloves show that AirGlove effectively generalizes the hand pose models to new glove designs and achieves a significant performance boost over the compared schemes.
### [68] [SHaSaM: Submodular Hard Sample Mining for Fair Facial Attribute Recognition](https://arxiv.org/abs/2602.05162)
*Anay Majee,Rishabh Iyer*
Main category: cs.CV
TL;DR: 本文提出SHaSaM方法,通过子模硬样本挖掘解决深度神经网络中的社会与人口统计偏差问题,提升公平性且不牺牲性能。
Details
Motivation: 深度神经网络常因训练数据中的标注偏差而继承社会和人口统计偏差,导致在种族、年龄、性别等敏感属性存在时产生不公平预测;现有方法难以应对属性组间的数据不平衡,并可能无意中强化敏感属性,加剧不公平性。
Method: 提出SHaSaM(Submodular Hard Sample Mining),包含两个阶段:SHaSaM-MINE采用子模子集选择策略挖掘难正/负样本以缓解数据不平衡;SHaSaM-LEARN基于子模条件互信息设计组合损失函数,在最大化目标类别决策边界的同时最小化敏感属性影响。
Result: 在CelebA和UTKFace数据集上实验表明,SHaSaM达到SOTA效果:公平性指标(Equalized Odds)提升最多2.7分,准确率提升3.5%,且收敛更快。
Conclusion: SHaSaM通过统一的子模优化框架有效约束模型避免学习与敏感属性相关的特征,在显著提升公平性的同时保持甚至提升模型性能。
Abstract: Deep neural networks often inherit social and demographic biases from annotated data during model training, leading to unfair predictions, especially in the presence of sensitive attributes like race, age, gender etc. Existing methods fall prey to the inherent data imbalance between attribute groups and inadvertently emphasize on sensitive attributes, worsening unfairness and performance. To surmount these challenges, we propose SHaSaM (Submodular Hard Sample Mining), a novel combinatorial approach that models fairness-driven representation learning as a submodular hard-sample mining problem. Our two-stage approach comprises of SHaSaM-MINE, which introduces a submodular subset selection strategy to mine hard positives and negatives - effectively mitigating data imbalance, and SHaSaM-LEARN, which introduces a family of combinatorial loss functions based on Submodular Conditional Mutual Information to maximize the decision boundary between target classes while minimizing the influence of sensitive attributes. This unified formulation restricts the model from learning features tied to sensitive attributes, significantly enhancing fairness without sacrificing performance. Experiments on CelebA and UTKFace demonstrate that SHaSaM achieves state-of-the-art results, with up to 2.7 points improvement in model fairness (Equalized Odds) and a 3.5% gain in Accuracy, within fewer epochs as compared to existing methods.
### [69] [LOBSTgER-enhance: an underwater image enhancement pipeline](https://arxiv.org/abs/2602.05163)
*Andreas Mentzelopoulos,Keith Ellenbogen*
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散模型的图像到图像转换方法,用于恢复水下摄影中的色彩失真、模糊和对比度下降等问题,仅用约2500张高质量水下图像即可实现良好泛化与感知一致性。
Details
Motivation: 水下摄影存在对比度低、空间模糊和波长依赖的色彩失真等问题,导致后期处理复杂且耗时,亟需自动化的端到端校正方法。
Method: 构建合成退化管道模拟水下失真,并利用扩散模型学习其逆过程;在Keith Ellenbogen提供的小规模高质量水下图像数据集上从零训练约1100万参数模型。
Result: 在512×768分辨率图像上实现了高感知一致性和强泛化能力,验证了小数据下扩散模型的有效性。
Conclusion: 该方法证明了轻量级扩散模型在小样本水下图像增强任务中的可行性与实用性,为资源受限场景下的图像复原提供了新思路。
Abstract: Underwater photography presents significant inherent challenges including reduced contrast, spatial blur, and wavelength-dependent color distortions. These effects can obscure the vibrancy of marine life and awareness photographers in particular are often challenged with heavy post-processing pipelines to correct for these distortions.
We develop an image-to-image pipeline that learns to reverse underwater degradations by introducing a synthetic corruption pipeline and learning to reverse its effects with diffusion-based generation. Training and evaluation are performed on a small high-quality dataset of awareness photography images by Keith Ellenbogen. The proposed methodology achieves high perceptual consistency and strong generalization in synthesizing 512x768 images using a model of ~11M parameters after training from scratch on ~2.5k images.
### [70] [ShapePuri: Shape Guided and Appearance Generalized Adversarial Purification](https://arxiv.org/abs/2602.05175)
*Zhe Li,Bernhard Kainz*
Main category: cs.CV
TL;DR: 本文提出了一种名为Shape Guided Purification(ShapePuri)的新防御框架,通过将模型表征与稳定的结构不变量对齐来提升对抗鲁棒性,包含基于符号距离函数(SDF)的形状编码模块(SEM)和通过随机变换缓解外观偏差的全局外观去偏模块(GAD),在AutoAttack基准上首次突破80%鲁棒准确率(81.64%),同时保持高清洁准确率(84.06%),且不增加推理开销。
Details
Motivation: 现有防御方法如对抗训练和基于扩散的净化存在高计算成本或信息损失问题,亟需一种高效、低开销且能保留判别信息的鲁棒性提升方案。
Method: 提出ShapePuri框架,包含两个核心模块:1)形状编码模块(SEM),利用符号距离函数(SDF)提供密集几何引导;2)全局外观去偏模块(GAD),通过随机变换缓解模型对外观特征的依赖偏差。
Result: 在AutoAttack协议下达到84.06%清洁准确率和81.64%对抗鲁棒准确率,是首个在此基准上突破80%鲁棒准确率的防御方法,且无需额外模块或推理开销。
Conclusion: ShapePuri通过引入结构先验(形状)引导净化过程,在保证高效性的同时显著提升模型鲁棒性,为轻量级、可扩展的对抗防御提供了新范式。
Abstract: Deep neural networks demonstrate impressive performance in visual recognition, but they remain vulnerable to adversarial attacks that is imperceptible to the human. Although existing defense strategies such as adversarial training and purification have achieved progress, diffusion-based purification often involves high computational costs and information loss. To address these challenges, we introduce Shape Guided Purification (ShapePuri), a novel defense framework enhances robustness by aligning model representations with stable structural invariants. ShapePuri integrates two components: a Shape Encoding Module (SEM) that provides dense geometric guidance through Signed Distance Functions (SDF), and a Global Appearance Debiasing (GAD) module that mitigates appearance bias via stochastic transformations. In our experiments, ShapePuri achieves $84.06\%$ clean accuracy and $81.64\%$ robust accuracy under the AutoAttack protocol, representing the first defense framework to surpass the $80\%$ threshold on this benchmark. Our approach provides a scalable and efficient adversarial defense that preserves prediction stability during inference without requiring auxiliary modules or additional computational cost.
### [71] [PoseGaussian: Pose-Driven Novel View Synthesis for Robust 3D Human Reconstruction](https://arxiv.org/abs/2602.05190)
*Ju Shen,Chen Chen,Tam V. Nguyen,Vijayan K. Asari*
Main category: cs.CV
TL;DR: PoseGaussian是一种姿态引导的高斯泼溅框架,用于高质量的人体新视角合成,通过将人体姿态作为结构先验和时间线索嵌入几何与时间建模阶段,实现高保真、实时(100 FPS)渲染,并在多个数据集上达到SOTA性能。
Details
Motivation: 解决动态人体场景中关节运动和严重自遮挡带来的挑战,提升人体新视角合成的鲁棒性与泛化能力。
Method: 提出PoseGaussian框架:利用人体姿态作为结构先验,融合颜色编码器以优化深度估计;同时作为时间线索,经专用姿态编码器增强帧间时间一致性;整个流程端到端可微、可训练。
Result: 在ZJU-MoCap、THuman2.0及自建数据集上达到SOTA:PSNR 30.86,SSIM 0.979,LPIPS 0.028,并支持100 FPS实时渲染。
Conclusion: PoseGaussian通过在几何和时间两个阶段深度融合姿态信息,显著提升了动态人体新视角合成的质量、鲁棒性与效率,优于仅将姿态作为条件或形变引导的现有方法。
Abstract: We propose PoseGaussian, a pose-guided Gaussian Splatting framework for high-fidelity human novel view synthesis. Human body pose serves a dual purpose in our design: as a structural prior, it is fused with a color encoder to refine depth estimation; as a temporal cue, it is processed by a dedicated pose encoder to enhance temporal consistency across frames. These components are integrated into a fully differentiable, end-to-end trainable pipeline. Unlike prior works that use pose only as a condition or for warping, PoseGaussian embeds pose signals into both geometric and temporal stages to improve robustness and generalization. It is specifically designed to address challenges inherent in dynamic human scenes, such as articulated motion and severe self-occlusion. Notably, our framework achieves real-time rendering at 100 FPS, maintaining the efficiency of standard Gaussian Splatting pipelines. We validate our approach on ZJU-MoCap, THuman2.0, and in-house datasets, demonstrating state-of-the-art performance in perceptual quality and structural accuracy (PSNR 30.86, SSIM 0.979, LPIPS 0.028).
### [72] [GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling](https://arxiv.org/abs/2602.05202)
*Shivanshu Shekhar,Uttaran Bhattacharya,Raghavendra Addanki,Mehrab Tanjim,Somdeb Sarkhel,Tong Zhang*
Main category: cs.CV
TL;DR: 本文提出了一种新方法,将视频生成模型本身作为奖励模型(而非依赖VLM),通过将其重构为能量模型并设计合成负样本进行对比学习,显著提升了对视频时序质量的判别能力,在多个基准上以更少标注数据达到SOTA。
Details
Motivation: 现有基于视觉语言模型(VLM)的视频生成奖励建模方法难以捕捉细微的时间动态;需一种能天然建模时序结构的替代方案。
Method: 将先进视频生成模型(如Generative-Transformer)重构为能量基模型(EBM),通过对比学习训练其区分高质量与退化视频;设计三类可控潜在空间扰动(时间切片、特征交换、帧打乱)生成合成负样本,避免模型利用表层伪影。
Result: 在GenAI-Bench和MonteBench上达到SOTA性能,仅需30K人工标注,比现有VLM方法少6–65倍标注量。
Conclusion: 视频生成模型可被有效重用为高精度、时序感知的奖励模型,无需额外架构设计,且数据效率大幅提升。
Abstract: Aligning video generative models with human preferences remains challenging: current approaches rely on Vision-Language Models (VLMs) for reward modeling, but these models struggle to capture subtle temporal dynamics. We propose a fundamentally different approach: repurposing video generative models, which are inherently designed to model temporal structure, as reward models. We present the Generative-Transformer-based Self-Supervised Video Judge (\modelname), a novel evaluation model that transforms state-of-the-art video generation models into powerful temporally-aware reward models. Our key insight is that generative models can be reformulated as energy-based models (EBMs) that assign low energy to high-quality videos and high energy to degraded ones, enabling them to discriminate video quality with remarkable precision when trained via contrastive objectives. To prevent the model from exploiting superficial differences between real and generated videos, we design challenging synthetic negative videos through controlled latent-space perturbations: temporal slicing, feature swapping, and frame shuffling, which simulate realistic but subtle visual degradations. This forces the model to learn meaningful spatiotemporal features rather than trivial artifacts. \modelname achieves state-of-the-art performance on GenAI-Bench and MonteBench using only 30K human-annotations: $6\times$ to $65\times$ fewer than existing VLM-based approaches.
### [73] [Dual-Representation Image Compression at Ultra-Low Bitrates via Explicit Semantics and Implicit Textures](https://arxiv.org/abs/2602.05213)
*Chuqin Zhou,Xiaoyue Ling,Yunuo Chen,Jincheng Dai,Guo Lu,Wenjun Zhang*
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的统一框架,通过将显式高层语义与隐式细节编码相结合,并引入可插拔编码器调控失真-感知权衡,在超低码率图像压缩中实现了SOTA的率-感知性能。
Details
Motivation: 现有神经编解码器在超低码率下性能显著下降;生成式压缩方法虽利用预训练模型语义先验,但受限于语义保真度与感知真实性的固有矛盾。
Method: 提出一种训练无关的统一框架:以显式高层语义为条件引导扩散模型,同时采用反向信道编码隐式传递细粒度细节;并设计可插拔编码器以灵活调控失真-感知权衡。
Result: 在Kodak、DIV2K和CLIC2020数据集上,DISTS BD-Rate指标分别比DiffC提升29.92%、19.33%和20.89%,达到当前最优率-感知性能。
Conclusion: 显式与隐式表征的协同融合可在不牺牲语义一致性前提下增强纹理真实感,有效突破超低码率压缩中感知质量与内容保真间的权衡瓶颈。
Abstract: While recent neural codecs achieve strong performance at low bitrates when optimized for perceptual quality, their effectiveness deteriorates significantly under ultra-low bitrate conditions. To mitigate this, generative compression methods leveraging semantic priors from pretrained models have emerged as a promising paradigm. However, existing approaches are fundamentally constrained by a tradeoff between semantic faithfulness and perceptual realism. Methods based on explicit representations preserve content structure but often lack fine-grained textures, whereas implicit methods can synthesize visually plausible details at the cost of semantic drift. In this work, we propose a unified framework that bridges this gap by coherently integrating explicit and implicit representations in a training-free manner. Specifically, We condition a diffusion model on explicit high-level semantics while employing reverse-channel coding to implicitly convey fine-grained details. Moreover, we introduce a plug-in encoder that enables flexible control of the distortion-perception tradeoff by modulating the implicit information. Extensive experiments demonstrate that the proposed framework achieves state-of-the-art rate-perception performance, outperforming existing methods and surpassing DiffC by 29.92%, 19.33%, and 20.89% in DISTS BD-Rate on the Kodak, DIV2K, and CLIC2020 datasets, respectively.
### [74] [E.M.Ground: A Temporal Grounding Vid-LLM with Holistic Event Perception and Matching](https://arxiv.org/abs/2602.05215)
*Jiahao Nie,Wenbin An,Gongjie Zhang,Yicheng Xu,Yap-Peng Tan,Alex C. Kot,Shijian Lu*
Main category: cs.CV
TL;DR: 本文提出E.M.Ground,一种面向时间视频定位(TVG)任务的新型视频大语言模型,通过引入标记、Savitzky-Golay平滑和多粒度帧特征聚合,提升事件语义连续性建模与时间定位精度。
Details
Motivation: 现有Vid-LLMs在TVG任务中依赖分离的起止帧token匹配,难以建模事件的语义连续性和完整性,导致定位模糊。
Method: 提出E.M.Ground模型:(i) 设计 token聚合整个事件帧的信息;(ii) 采用Savitzky-Golay滤波平滑token-frame相似度曲线;(iii) 引入多粒度帧特征聚合以缓解压缩损失并增强时序理解。
Result: 在多个基准数据集上显著超越当前最优Vid-LLMs。
Conclusion: 整体事件感知范式更契合TVG任务本质,E.M.Ground验证了语义连续性建模对精准时间定位的关键作用。
Abstract: Despite recent advances in Video Large Language Models (Vid-LLMs), Temporal Video Grounding (TVG), which aims to precisely localize time segments corresponding to query events, remains a significant challenge. Existing methods often match start and end frames by comparing frame features with two separate tokens, relying heavily on exact timestamps. However, this approach fails to capture the event's semantic continuity and integrity, leading to ambiguities. To address this, we propose E.M.Ground, a novel Vid-LLM for TVG that focuses on holistic and coherent event perception. E.M.Ground introduces three key innovations: (i) a special token that aggregates information from all frames of a query event, preserving semantic continuity for accurate event matching; (ii) Savitzky-Golay smoothing to reduce noise in token-to-frame similarities across timestamps, improving prediction accuracy; (iii) multi-grained frame feature aggregation to enhance matching reliability and temporal understanding, compensating for compression-induced information loss. Extensive experiments on benchmark datasets show that E.M.Ground consistently outperforms state-of-the-art Vid-LLMs by significant margins.
### [75] [Cross-Domain Few-Shot Segmentation via Multi-view Progressive Adaptation](https://arxiv.org/abs/2602.05217)
*Jiahao Nie,Guanqiao Fu,Wenbin An,Yap-Peng Tan,Alex C. Kot,Shijian Lu*
Main category: cs.CV
TL;DR: 本文提出多视角渐进式自适应方法(MPA),通过混合渐进增强和双链多视角预测,从数据和策略两方面提升跨域少样本分割的性能,显著优于现有方法。
Details
Motivation: 现有跨域少样本分割方法受限于目标域样本数量少、多样性低,且源域训练模型在目标域初始少样本能力弱、域差距大,导致目标样本利用效率低、适应效果差。
Method: 提出多视角渐进式自适应(MPA):(i)数据层面采用混合渐进增强,逐步生成更多样、更复杂的视图;(ii)策略层面设计双链多视角预测,通过串行与并行学习路径,在强监督下联合约束多视图预测一致性。
Result: 在多个基准上大幅超越现有最优方法,性能提升达+7.0%。
Conclusion: MPA通过协同优化数据增强复杂度与预测一致性机制,有效提升了模型在目标域的鲁棒且精准的少样本适应能力。
Abstract: Cross-Domain Few-Shot Segmentation aims to segment categories in data-scarce domains conditioned on a few exemplars. Typical methods first establish few-shot capability in a large-scale source domain and then adapt it to target domains. However, due to the limited quantity and diversity of target samples, existing methods still exhibit constrained performance. Moreover, the source-trained model's initially weak few-shot capability in target domains, coupled with substantial domain gaps, severely hinders the effective utilization of target samples and further impedes adaptation. To this end, we propose Multi-view Progressive Adaptation, which progressively adapts few-shot capability to target domains from both data and strategy perspectives. (i) From the data perspective, we introduce Hybrid Progressive Augmentation, which progressively generates more diverse and complex views through cumulative strong augmentations, thereby creating increasingly challenging learning scenarios. (ii) From the strategy perspective, we design Dual-chain Multi-view Prediction, which fully leverages these progressively complex views through sequential and parallel learning paths under extensive supervision. By jointly enforcing prediction consistency across diverse and complex views, MPA achieves both robust and accurate adaptation to target domains. Extensive experiments demonstrate that MPA effectively adapts few-shot capability to target domains, outperforming state-of-the-art methods by a large margin (+7.0%).
### [76] [Boosting SAM for Cross-Domain Few-Shot Segmentation via Conditional Point Sparsification](https://arxiv.org/abs/2602.05218)
*Jiahao Nie,Yun Xing,Wenbin An,Qingsong Zhao,Jiawei Shao,Yap-Peng Tan,Alex C. Kot,Shijian Lu,Xuelong Li*
Main category: cs.CV
TL;DR: 本文提出Conditional Point Sparsification (CPS)方法,解决SAM在跨域少样本分割(CD-FSS)中因域偏移导致的密集点匹配失效问题,通过参考图像真值掩码自适应稀疏化匹配点,提升分割精度。
Details
Motivation: 现有基于SAM的少样本分割方法在跨域(如医学、卫星图像)场景下性能下降,主因是域偏移破坏了SAM学习到的点-图像交互,且点密度在此类场景中尤为关键。
Method: 提出无训练的Conditional Point Sparsification(CPS),利用参考图像的真值掩码,自适应地稀疏化参考与目标图像间匹配的密集点,从而优化SAM对跨域图像的提示交互。
Result: CPS在多个跨域少样本分割数据集上显著优于现有无训练的SAM基方法。
Conclusion: 点密度在跨域场景中至关重要;CPS通过参考掩码引导的自适应稀疏化,有效缓解域偏移影响,提升SAM在CD-FSS任务中的泛化能力。
Abstract: Motivated by the success of the Segment Anything Model (SAM) in promptable segmentation, recent studies leverage SAM to develop training-free solutions for few-shot segmentation, which aims to predict object masks in the target image based on a few reference exemplars. These SAM-based methods typically rely on point matching between reference and target images and use the matched dense points as prompts for mask prediction. However, we observe that dense points perform poorly in Cross-Domain Few-Shot Segmentation (CD-FSS), where target images are from medical or satellite domains. We attribute this issue to large domain shifts that disrupt the point-image interactions learned by SAM, and find that point density plays a crucial role under such conditions. To address this challenge, we propose Conditional Point Sparsification (CPS), a training-free approach that adaptively guides SAM interactions for cross-domain images based on reference exemplars. Leveraging ground-truth masks, the reference images provide reliable guidance for adaptively sparsifying dense matched points, enabling more accurate segmentation results. Extensive experiments demonstrate that CPS outperforms existing training-free SAM-based methods across diverse CD-FSS datasets.
### [77] [PatchFlow: Leveraging a Flow-Based Model with Patch Features](https://arxiv.org/abs/2602.05238)
*Boxiang Zhang,Baijian Yang,Xiaoming Wang,Corey Vian*
Main category: cs.CV
TL;DR: 本文提出了一种结合局部邻域感知图像块特征与标准化流模型的无监督异常检测方法,并引入适配器模块以桥接通用预训练特征提取器与工业产品图像,显著提升了压铸件表面缺陷检测的准确率。
Details
Motivation: 压铸件表面缺陷严重影响质量控制,传统人工检测效率低、成本高,亟需自动化、高精度的计算机视觉检测方法。
Method: 融合局部邻域感知的图像块特征与标准化流模型,并设计适配器模块,实现对预训练特征的领域自适应,支持无监督(仅用正常样本)异常检测。
Result: 在MVTec AD数据集上图像级AUROC达99.28%(错误率降低20%);在VisA数据集上达96.48%(错误率降低28.2%);在自建压铸数据集上检测准确率达95.77%,且无需异常样本训练。
Conclusion: 该方法有效提升了工业场景下压铸件表面缺陷的自动化检测性能,验证了结合预训练特征适配与生成式建模在无监督工业质检中的可行性与优越性。
Abstract: Die casting plays a crucial role across various industries due to its ability to craft intricate shapes with high precision and smooth surfaces. However, surface defects remain a major issue that impedes die casting quality control. Recently, computer vision techniques have been explored to automate and improve defect detection. In this work, we combine local neighbor-aware patch features with a normalizing flow model and bridge the gap between the generic pretrained feature extractor and industrial product images by introducing an adapter module to increase the efficiency and accuracy of automated anomaly detection. Compared to state-of-the-art methods, our approach reduces the error rate by 20\% on the MVTec AD dataset, achieving an image-level AUROC of 99.28\%. Our approach has also enhanced performance on the VisA dataset , achieving an image-level AUROC of 96.48\%. Compared to the state-of-the-art models, this represents a 28.2\% reduction in error. Additionally, experiments on a proprietary die casting dataset yield an accuracy of 95.77\% for anomaly detection, without requiring any anomalous samples for training. Our method illustrates the potential of leveraging computer vision and deep learning techniques to advance inspection capabilities for the die casting industry
### [78] [Active Label Cleaning for Reliable Detection of Electron Dense Deposits in Transmission Electron Microscopy Images](https://arxiv.org/abs/2602.05250)
*Jieyun Tan,Shuo Liu,Guibin Zhang,Ziqi Li,Jian Geng,Lei Zhang,Lei Cao*
Main category: cs.CV
TL;DR: 本文提出了一种主动标签清洗方法,用于提升基于众包标注的电子致密沉积物(EDD)检测模型性能,在减少专家标注成本的同时显著提高检测精度。
Details
Motivation: 电子致密沉积物(EDD)自动检测受限于高质量标注数据稀缺;众包标注虽降低成本,但引入标签噪声。
Method: 提出主动标签清洗方法:利用主动学习筛选最具价值的噪声样本交由专家重标注;设计标签选择模块,结合众包标签与模型预测差异进行样本选择和实例级噪声评分。
Result: 在私有数据集上达到67.18% AP₅₀,较直接使用噪声标签训练提升18.83%;性能达全专家标注结果的95.79%,同时降低73.30%标注成本。
Conclusion: 该方法为专家资源有限场景下构建可靠医学AI提供了实用、低成本的解决方案。
Abstract: Automated detection of electron dense deposits (EDD) in glomerular disease is hindered by the scarcity of high-quality labeled data. While crowdsourcing reduces annotation cost, it introduces label noise. We propose an active label cleaning method to efficiently denoise crowdsourced datasets. Our approach uses active learning to select the most valuable noisy samples for expert re-annotation, building high-accuracy cleaning models. A Label Selection Module leverages discrepancies between crowdsourced labels and model predictions for both sample selection and instance-level noise grading. Experiments show our method achieves 67.18% AP\textsubscript{50} on a private dataset, an 18.83% improvement over training on noisy labels. This performance reaches 95.79% of that with full expert annotation while reducing annotation cost by 73.30%. The method provides a practical, cost-effective solution for developing reliable medical AI with limited expert resources.
### [79] [RFM-Pose:Reinforcement-Guided Flow Matching for Fast Category-Level 6D Pose Estimation](https://arxiv.org/abs/2602.05257)
*Diya He,Qingchen Liu,Cong Zhang,Jiahu Qin*
Main category: cs.CV
TL;DR: 本文提出RFM-Pose框架,结合流匹配生成模型与强化学习(PPO),在类别级6D物体位姿估计中兼顾高效采样与假设评估,显著降低计算成本并在REAL275基准上取得优异性能。
Details
Motivation: 现有基于分数的生成模型虽部分缓解了类别级位姿估计中的旋转对称歧义问题,但其高采样开销限制了效率。
Method: 提出RFM-Pose:采用流匹配生成模型沿最优传输路径生成位姿候选;将采样过程建模为马尔可夫决策过程,用近端策略优化(PPO)微调采样策略;将流场视为可学习策略,估计器映射为价值网络,实现位姿生成与假设评分的联合优化。
Result: 在REAL275基准上,RFM-Pose在保持高性能的同时显著降低计算成本;并能自然拓展至物体位姿跟踪任务,取得具有竞争力的结果。
Conclusion: 流匹配与强化学习的协同设计可有效提升类别级6D位姿生成的效率与质量,为生成式位姿估计提供了新范式。
Abstract: Object pose estimation is a fundamental problem in computer vision and plays a critical role in virtual reality and embodied intelligence, where agents must understand and interact with objects in 3D space. Recently, score based generative models have to some extent solved the rotational symmetry ambiguity problem in category level pose estimation, but their efficiency remains limited by the high sampling cost of score-based diffusion. In this work, we propose a new framework, RFM-Pose, that accelerates category-level 6D object pose generation while actively evaluating sampled hypotheses. To improve sampling efficiency, we adopt a flow-matching generative model and generate pose candidates along an optimal transport path from a simple prior to the pose distribution. To further refine these candidates, we cast the flow-matching sampling process as a Markov decision process and apply proximal policy optimization to fine-tune the sampling policy. In particular, we interpret the flow field as a learnable policy and map an estimator to a value network, enabling joint optimization of pose generation and hypothesis scoring within a reinforcement learning framework. Experiments on the REAL275 benchmark demonstrate that RFM-Pose achieves favorable performance while significantly reducing computational cost. Moreover, similar to prior work, our approach can be readily adapted to object pose tracking and attains competitive results in this setting.
### [80] [ReGLA: Efficient Receptive-Field Modeling with Gated Linear Attention Network](https://arxiv.org/abs/2602.05262)
*Junzhou Li,Manqi Zhao,Yilin Gao,Zhiheng Yu,Yin Li,Dongsheng Jiang,Li Xiao*
Main category: cs.CV
TL;DR: 本文提出ReGLA系列轻量级混合网络,结合高效卷积与ReLU门控线性注意力,在保持高精度的同时显著降低高分辨率图像处理延迟。
Details
Motivation: 解决轻量级模型(尤其是Transformer架构)在高分辨率图像上精度与延迟难以兼顾的问题。
Method: 提出ReGLA:包含ELRF模块(提升卷积效率并保持大感受野)、RGMA模块(线性复杂度下增强局部特征表示)及多教师知识蒸馏策略。
Result: ReGLA-M在ImageNet-1K达80.85% Top-1准确率(224px),512px下仅4.98ms延迟;在COCO检测和ADE20K分割中分别超越iFormer 3.1% AP和3.6% mIoU。
Conclusion: ReGLA是面向高分辨率视觉任务的先进轻量级解决方案,在精度、速度和下游任务泛化性上均取得SOTA性能。
Abstract: Balancing accuracy and latency on high-resolution images is a critical challenge for lightweight models, particularly for Transformer-based architectures that often suffer from excessive latency. To address this issue, we introduce \textbf{ReGLA}, a series of lightweight hybrid networks, which integrates efficient convolutions for local feature extraction with ReLU-based gated linear attention for global modeling. The design incorporates three key innovations: the Efficient Large Receptive Field (ELRF) module for enhancing convolutional efficiency while preserving a large receptive field; the ReLU Gated Modulated Attention (RGMA) module for maintaining linear complexity while enhancing local feature representation; and a multi-teacher distillation strategy to boost performance on downstream tasks. Extensive experiments validate the superiority of ReGLA; particularly the ReGLA-M achieves \textbf{80.85\%} Top-1 accuracy on ImageNet-1K at $224px$, with only \textbf{4.98 ms} latency at $512px$. Furthermore, ReGLA outperforms similarly scaled iFormer models in downstream tasks, achieving gains of \textbf{3.1\%} AP on COCO object detection and \textbf{3.6\%} mIoU on ADE20K semantic segmentation, establishing it as a state-of-the-art solution for high-resolution visual applications.
### [81] [Unlocking Prototype Potential: An Efficient Tuning Framework for Few-Shot Class-Incremental Learning](https://arxiv.org/abs/2602.05271)
*Shengqin Jiang,Xiaoran Feng,Yuankai Qi,Haokui Zhang,Renlong Hang,Qingshan Liu,Lina Yao,Quan Z. Sheng,Ming-Hsuan Yang*
Main category: cs.CV
TL;DR: 本文提出了一种新的少样本类增量学习(FSCIL)方法,核心思想是冻结特征提取器、微调原型,通过双校准机制(类特定和任务感知偏移)提升原型判别能力,在多个基准上取得优异性能且参数量极小。
Details
Motivation: 传统FSCIL方法依赖冻结的预训练特征提取器生成静态类原型,存在骨干网络表征偏差;而现有提示微调方法在极低数据下难以显著提升全局判别能力。本文认为FSCIL的关键挑战在于静态高质量特征空间中决策区域的优化,而非特征获取本身。
Method: 提出冻结特征提取器、仅微调原型的新范式;设计一种高效的原型微调框架,将静态质心转化为动态可学习组件;引入双校准方法,包括类特定偏移和任务感知偏移,协同增强原型对增量类的判别能力。
Result: 在多个FSCIL基准上达到最优性能,同时所需可学习参数极少。
Conclusion: 冻结特征提取器并精细调整原型是一种更有效、更轻量的FSCIL解决方案,凸显了在固定特征空间中优化决策边界的重要性。
Abstract: Few-shot class-incremental learning (FSCIL) seeks to continuously learn new classes from very limited samples while preserving previously acquired knowledge. Traditional methods often utilize a frozen pre-trained feature extractor to generate static class prototypes, which suffer from the inherent representation bias of the backbone. While recent prompt-based tuning methods attempt to adapt the backbone via minimal parameter updates, given the constraint of extreme data scarcity, the model's capacity to assimilate novel information and substantively enhance its global discriminative power is inherently limited. In this paper, we propose a novel shift in perspective: freezing the feature extractor while fine-tuning the prototypes. We argue that the primary challenge in FSCIL is not feature acquisition, but rather the optimization of decision regions within a static, high-quality feature space. To this end, we introduce an efficient prototype fine-tuning framework that evolves static centroids into dynamic, learnable components. The framework employs a dual-calibration method consisting of class-specific and task-aware offsets. These components function synergistically to improve the discriminative capacity of prototypes for ongoing incremental classes. Extensive results demonstrate that our method attains superior performance across multiple benchmarks while requiring minimal learnable parameters.
### [82] [Magic-MM-Embedding: Towards Visual-Token-Efficient Universal Multimodal Embedding with MLLMs](https://arxiv.org/abs/2602.05275)
*Qi Li,Yanzhe Zhao,Yongxin Zhou,Yameng Wang,Yandong Yang,Yuanjia Zhou,Jue Wang,Zuojian Wang,Jinxiang Liu*
Main category: cs.CV
TL;DR: 本文提出Magic-MM-Embedding系列模型,通过视觉token压缩和多阶段渐进训练策略,在保证高性能的同时显著提升通用多模态嵌入的推理效率。
Details
Motivation: 多模态大语言模型(MLLMs)在通用多模态检索中潜力巨大,但处理大量视觉token带来的高计算成本限制了其实际应用。
Method: 提出Magic-MM-Embedding:(1)基于视觉token压缩的高效MLLM架构;(2)包含持续预训练、对比预训练与难负样本挖掘、以及MLLM-as-a-Judge引导的任务感知微调的三阶段渐进训练策略。
Result: 实验表明,该模型在性能上大幅超越现有方法,同时推理更高效。
Conclusion: Magic-MM-Embedding成功兼顾了多模态嵌入任务中的效率与性能,为实用化多模态检索提供了新范式。
Abstract: Multimodal Large Language Models (MLLMs) have shown immense promise in universal multimodal retrieval, which aims to find relevant items of various modalities for a given query. But their practical application is often hindered by the substantial computational cost incurred from processing a large number of tokens from visual inputs. In this paper, we propose Magic-MM-Embedding, a series of novel models that achieve both high efficiency and state-of-the-art performance in universal multimodal embedding. Our approach is built on two synergistic pillars: (1) a highly efficient MLLM architecture incorporating visual token compression to drastically reduce inference latency and memory footprint, and (2) a multi-stage progressive training strategy designed to not only recover but significantly boost performance. This coarse-to-fine training paradigm begins with extensive continue pretraining to restore multimodal understanding and generation capabilities, progresses to large-scale contrastive pretraining and hard negative mining to enhance discriminative power, and culminates in a task-aware fine-tuning stage guided by an MLLM-as-a-Judge for precise data curation. Comprehensive experiments show that our model outperforms existing methods by a large margin while being more inference-efficient.
### [83] [Fast-SAM3D: 3Dfy Anything in Images but Faster](https://arxiv.org/abs/2602.05293)
*Weilun Feng,Mingqiang Wu,Zhiliang Chen,Chuanguang Yang,Haotong Qin,Yuqi Li,Xiaokun Liu,Guoxin Fan,Zhulin An,Libo Huang,Yulun Zhang,Michele Magno,Yongjun Xu*
Main category: cs.CV
TL;DR: 本文提出Fast-SAM3D,一种无需训练的加速框架,通过模态感知缓存、时空令牌裁剪和频谱感知聚合三种异构性感知机制,显著提升SAM3D在开放世界单视图3D重建中的推理速度(最高2.67×),同时几乎不损失重建质量。
Details
Motivation: SAM3D虽支持可扩展、开放世界的3D重建,但其高推理延迟严重阻碍实际部署;现有通用加速策略因其忽略管道中多层级异构性(如形状/布局动力学差异、纹理细化稀疏性、几何频谱差异)而表现脆弱。
Method: 提出Fast-SAM3D——一个训练无关的动态计算对齐框架,包含:(1) 模态感知步长缓存,解耦结构演化与布局更新;(2) 联合时空令牌裁剪,在高熵区域集中细化;(3) 频谱感知令牌聚合,自适应调整解码分辨率。
Result: 实验表明Fast-SAM3D实现最高2.67×端到端加速,保真度损失可忽略,在单视图3D生成中确立新的效率-质量Pareto前沿。
Conclusion: 忽视多级异构性是现有加速方法失效的根本原因;Fast-SAM3D通过细粒度、动态、异构性感知的计算调度,为复杂开放场景下的高效3D重建提供了新范式。
Abstract: SAM3D enables scalable, open-world 3D reconstruction from complex scenes, yet its deployment is hindered by prohibitive inference latency. In this work, we conduct the \textbf{first systematic investigation} into its inference dynamics, revealing that generic acceleration strategies are brittle in this context. We demonstrate that these failures stem from neglecting the pipeline's inherent multi-level \textbf{heterogeneity}: the kinematic distinctiveness between shape and layout, the intrinsic sparsity of texture refinement, and the spectral variance across geometries. To address this, we present \textbf{Fast-SAM3D}, a training-free framework that dynamically aligns computation with instantaneous generation complexity. Our approach integrates three heterogeneity-aware mechanisms: (1) \textit{Modality-Aware Step Caching} to decouple structural evolution from sensitive layout updates; (2) \textit{Joint Spatiotemporal Token Carving} to concentrate refinement on high-entropy regions; and (3) \textit{Spectral-Aware Token Aggregation} to adapt decoding resolution. Extensive experiments demonstrate that Fast-SAM3D delivers up to \textbf{2.67$\times$} end-to-end speedup with negligible fidelity loss, establishing a new Pareto frontier for efficient single-view 3D generation. Our code is released in https://github.com/wlfeng0509/Fast-SAM3D.
### [84] [FlashBlock: Attention Caching for Efficient Long-Context Block Diffusion](https://arxiv.org/abs/2602.05305)
*Zhuokun Chen,Jianfei Cai,Bohan Zhuang*
Main category: cs.CV
TL;DR: 本文提出FlashBlock,一种利用块扩散中跨步注意力冗余性的高效机制,通过缓存并重用块外部稳定注意力输出,显著减少注意力计算和KV缓存访问,提升长上下文生成效率,同时保持生成质量。
Details
Motivation: 块扩散在长上下文生成中仍存在因KV缓存增长导致的重复注意力计算开销;作者发现其存在被忽视的块内跨步注意力冗余性,即块外注意力输出稳定、块内变化大。
Method: 提出FlashBlock机制:缓存并重用块外部(stable)注意力输出,避免重复计算;该方法正交于稀疏注意力,可作为互补的残差重用策略。
Result: 在扩散语言模型和视频生成任务上,实现最高1.44×的token吞吐量提升和1.6×的注意力时间减少,生成质量几乎无损。
Conclusion: FlashBlock有效缓解了长上下文块扩散的计算瓶颈,为高效长序列生成提供了新思路,且兼容现有优化技术。
Abstract: Generating long-form content, such as minute-long videos and extended texts, is increasingly important for modern generative models. Block diffusion improves inference efficiency via KV caching and block-wise causal inference and has been widely adopted in diffusion language models and video generation. However, in long-context settings, block diffusion still incurs substantial overhead from repeatedly computing attention over a growing KV cache. We identify an underexplored property of block diffusion: cross-step redundancy of attention within a block. Our analysis shows that attention outputs from tokens outside the current block remain largely stable across diffusion steps, while block-internal attention varies significantly. Based on this observation, we propose FlashBlock, a cached block-external attention mechanism that reuses stable attention output, reducing attention computation and KV cache access without modifying the diffusion process. Moreover, FlashBlock is orthogonal to sparse attention and can be combined as a complementary residual reuse strategy, substantially improving model accuracy under aggressive sparsification. Experiments on diffusion language models and video generation demonstrate up to 1.44$\times$ higher token throughput and up to 1.6$\times$ reduction in attention time, with negligible impact on generation quality. Project page: https://caesarhhh.github.io/FlashBlock/.
### [85] [Wid3R: Wide Field-of-View 3D Reconstruction via Camera Model Conditioning](https://arxiv.org/abs/2602.05321)
*Dongki Jung,Jaehoon Choi,Adil Qureshi,Somi Jeong,Dinesh Manocha,Suyong Yeon*
Main category: cs.CV
TL;DR: Wid3R是一种支持广角相机模型的前馈神经网络,用于视觉几何重建,首次实现直接从360度图像进行多视角基础建模与前馈3D重建。
Details
Motivation: 现有方法通常假设输入图像是针孔相机拍摄或已校正的透视图像,难以适用于鱼眼或全景相机等真实场景,且依赖繁琐的标定和去畸变过程。
Method: 采用基于球谐函数的光线表示和网络内嵌的新型相机模型标记,实现失真感知的3D重建。
Result: Wid3R在Stanford2D3D数据集上零样本迁移性能显著,最高提升达+77.33,且在多种广角相机类型上表现出强泛化性和鲁棒性。
Conclusion: Wid3R是首个支持前馈式360度图像多视角3D重建的基础模型,突破了传统方法对针孔模型的依赖,提升了广角视觉几何重建的实用性与通用性。
Abstract: We present Wid3R, a feed-forward neural network for visual geometry reconstruction that supports wide field-of-view camera models. Prior methods typically assume that input images are rectified or captured with pinhole cameras, since both their architectures and training datasets are tailored to perspective images only. These assumptions limit their applicability in real-world scenarios that use fisheye or panoramic cameras and often require careful calibration and undistortion. In contrast, Wid3R is a generalizable multi-view 3D estimation method that can model wide field-of-view camera types. Our approach leverages a ray representation with spherical harmonics and a novel camera model token within the network, enabling distortion-aware 3D reconstruction. Furthermore, Wid3R is the first multi-view foundation model to support feed-forward 3D reconstruction directly from 360 imagery. It demonstrates strong zero-shot robustness and consistently outperforms prior methods, achieving improvements of up to +77.33 on Stanford2D3D.
### [86] [MTPano: Multi-Task Panoramic Scene Understanding via Label-Free Integration of Dense Prediction Priors](https://arxiv.org/abs/2602.05330)
*Jingdong Zhang,Xiaohang Zhan,Lingzhi Zhang,Yizhou Wang,Zhengming Yu,Jionghao Wang,Wenping Wang,Xin Li*
Main category: cs.CV
TL;DR: 本文提出MTPano,一种无需标注的多任务全景基础模型,通过利用透视基础模型生成伪标签,并设计Panoramic Dual BridgeNet来解耦旋转不变与旋转可变任务,有效应对全景图像的几何畸变和坐标系差异,实现了全景场景理解的SOTA性能。
Details
Motivation: 全景场景理解面临高分辨率、多任务标注稀缺的挑战,且现有透视基础模型难以直接适配全景域,因存在严重几何畸变和坐标系不一致问题;同时球面空间中不同密集预测任务间的内在关系尚未被充分探索。
Method: 提出MTPano模型:1)利用透视基础模型在投影后的视角图像上生成无域偏移伪标签,并重投影回全景图作为监督;2)将任务分为旋转不变与旋转可变两类,设计带几何感知调制层的Panoramic Dual BridgeNet以解耦特征流;3)引入ERP token mixer与双支BridgeNet(含梯度截断)缓解畸变并促进有益跨任务信息共享;4)增加图像梯度、点图等辅助任务增强跨任务学习。
Result: MTPano在多个全景基准测试中达到SOTA性能,并在与各任务专用全景专家模型对比中表现具有竞争力。
Conclusion: MTPano通过标签免费训练范式与几何感知多任务协同架构,显著提升了全景密集预测任务的统一建模能力,为构建通用全景基础模型提供了新思路。
Abstract: Comprehensive panoramic scene understanding is critical for immersive applications, yet it remains challenging due to the scarcity of high-resolution, multi-task annotations. While perspective foundation models have achieved success through data scaling, directly adapting them to the panoramic domain often fails due to severe geometric distortions and coordinate system discrepancies. Furthermore, the underlying relations between diverse dense prediction tasks in spherical spaces are underexplored. To address these challenges, we propose MTPano, a robust multi-task panoramic foundation model established by a label-free training pipeline. First, to circumvent data scarcity, we leverage powerful perspective dense priors. We project panoramic images into perspective patches to generate accurate, domain-gap-free pseudo-labels using off-the-shelf foundation models, which are then re-projected to serve as patch-wise supervision. Second, to tackle the interference between task types, we categorize tasks into rotation-invariant (e.g., depth, segmentation) and rotation-variant (e.g., surface normals) groups. We introduce the Panoramic Dual BridgeNet, which disentangles these feature streams via geometry-aware modulation layers that inject absolute position and ray direction priors. To handle the distortion from equirectangular projections (ERP), we incorporate ERP token mixers followed by a dual-branch BridgeNet for interactions with gradient truncation, facilitating beneficial cross-task information sharing while blocking conflicting gradients from incompatible task attributes. Additionally, we introduce auxiliary tasks (image gradient, point map, etc.) to fertilize the cross-task learning process. Extensive experiments demonstrate that MTPano achieves state-of-the-art performance on multiple benchmarks and delivers competitive results against task-specific panoramic specialist foundation models.
### [87] [Consistency-Preserving Concept Erasure via Unsafe-Safe Pairing and Directional Fisher-weighted Adaptation](https://arxiv.org/abs/2602.05339)
*Yongwoo Kim,Sungmin Cha,Hyunsoo Kim,Jaewon Lee,Donghyun Kim*
Main category: cs.CV
TL;DR: 本文提出PAIR框架,通过不安全-安全概念配对实现一致性保持的语义重对齐,而非简单删除,从而在文本到图像生成中精准擦除有害概念并保留结构与语义一致性。
Details
Motivation: 现有概念擦除方法仅关注移除不安全概念,缺乏对安全替代方案的引导,导致生成结果在结构和语义上与原图不一致。
Method: 提出PAIR框架:1)构建不安全-安全配对的多模态数据;2)设计配对语义重对齐目标,将目标概念映射至语义对齐的安全锚点;3)基于Fisher信息加权初始化DoRA低秩适配矩阵,以协同抑制不安全概念并促进安全生成。
Result: 在多项实验中显著优于SOTA基线,在有效擦除目标概念的同时,更好保持结构完整性、语义连贯性和图像生成质量。
Conclusion: 概念擦除应转向‘语义重对齐’而非‘单纯删除’;PAIR通过利用不安全-安全配对实现细粒度、一致性保持的擦除,为可控文本到图像生成提供了新范式。
Abstract: With the increasing versatility of text-to-image diffusion models, the ability to selectively erase undesirable concepts (e.g., harmful content) has become indispensable. However, existing concept erasure approaches primarily focus on removing unsafe concepts without providing guidance toward corresponding safe alternatives, which often leads to failure in preserving the structural and semantic consistency between the original and erased generations. In this paper, we propose a novel framework, PAIRed Erasing (PAIR), which reframes concept erasure from simple removal to consistency-preserving semantic realignment using unsafe-safe pairs. We first generate safe counterparts from unsafe inputs while preserving structural and semantic fidelity, forming paired unsafe-safe multimodal data. Leveraging these pairs, we introduce two key components: (1) Paired Semantic Realignment, a guided objective that uses unsafe-safe pairs to explicitly map target concepts to semantically aligned safe anchors; and (2) Fisher-weighted Initialization for DoRA, which initializes parameter-efficient low-rank adaptation matrices using unsafe-safe pairs, encouraging the generation of safe alternatives while selectively suppressing unsafe concepts. Together, these components enable fine-grained erasure that removes only the targeted concepts while maintaining overall semantic consistency. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, achieving effective concept erasure while preserving structural integrity, semantic coherence, and generation quality.
### [88] [Learning with Adaptive Prototype Manifolds for Out-of-Distribution Detection](https://arxiv.org/abs/2602.05349)
*Ningkang Peng,JiuTao Zhou,Yuhao Zhang,Xiaoqian Peng,Qianfeng Yu,Linjing Qian,Tingyu Lu,Yi Chen,Yanhui Gu*
Main category: cs.CV
TL;DR: 本文提出APEX框架,通过自适应原型流形(APM)和后验感知OOD评分(PAOS)机制,解决现有原型学习方法中静态同质性假设和学习-推理断连两大根本缺陷,显著提升OOD检测性能。
Details
Motivation: 现有基于原型的表征学习方法存在两个根本缺陷:静态同质性假设(为所有类别分配固定表征资源)和学习-推理断连(推理时丢弃原型质量知识),限制了模型容量与性能。
Method: 提出APEX框架,包含两阶段修复:1)自适应原型流形(APM),基于最小描述长度(MDL)原则为每类自动确定最优原型数量K_c^*,解决原型冲突;2)后验感知OOD评分(PAOS)机制,量化原型质量(凝聚性与分离性),弥合学习-推理断连。
Result: 在CIFAR-100等基准上实验验证,APEX达到新的SOTA性能。
Conclusion: APEX通过动态调整原型复杂度与利用原型质量信息,有效克服了原型学习方法的根本局限,显著提升了OOD检测能力与鲁棒性。
Abstract: Out-of-distribution (OOD) detection is a critical task for the safe deployment of machine learning models in the real world. Existing prototype-based representation learning methods have demonstrated exceptional performance. Specifically, we identify two fundamental flaws that universally constrain these methods: the Static Homogeneity Assumption (fixed representational resources for all classes) and the Learning-Inference Disconnect (discarding rich prototype quality knowledge at inference). These flaws fundamentally limit the model's capacity and performance. To address these issues, we propose APEX (Adaptive Prototype for eXtensive OOD Detection), a novel OOD detection framework designed via a Two-Stage Repair process to optimize the learned feature manifold. APEX introduces two key innovations to address these respective flaws: (1) an Adaptive Prototype Manifold (APM), which leverages the Minimum Description Length (MDL) principle to automatically determine the optimal prototype complexity $K_c^*$ for each class, thereby fundamentally resolving prototype collision; and (2) a Posterior-Aware OOD Scoring (PAOS) mechanism, which quantifies prototype quality (cohesion and separation) to bridge the learning-inference disconnect. Comprehensive experiments on benchmarks such as CIFAR-100 validate the superiority of our method, where APEX achieves new state-of-the-art performance.
### [89] [Multimodal Latent Reasoning via Hierarchical Visual Cues Injection](https://arxiv.org/abs/2602.05359)
*Yiming Zhang,Qiangyu Yan,Borui Jiang,Kai Han*
Main category: cs.CV
TL;DR: 本文提出HIVE框架,通过在潜在空间中注入分层视觉线索,实现多模态大语言模型的'慢思考'式推理,提升复杂场景理解能力。
Details
Motivation: 现有MLLMs依赖端到端生成或语言中心的思维链(CoT),存在低效、冗长和幻觉问题,需在潜在空间中实现更鲁棒的多模态推理。
Method: 提出HIVE框架:递归扩展Transformer块构建内部循环以迭代优化推理,并将从全局场景到细粒度区域的分层视觉线索直接注入模型潜在表征中。
Result: 实验表明,测试时缩放有效,且分层信息整合显著提升了模型对复杂场景的理解能力。
Conclusion: HIVE实现了不依赖文本解释的、基于潜在空间的多模态‘慢思考’推理,为MLLMs提供了更可靠、高效和可解释的推理范式。
Abstract: The advancement of multimodal large language models (MLLMs) has enabled impressive perception capabilities. However, their reasoning process often remains a "fast thinking" paradigm, reliant on end-to-end generation or explicit, language-centric chains of thought (CoT), which can be inefficient, verbose, and prone to hallucination. This work posits that robust reasoning should evolve within a latent space, integrating multimodal signals seamlessly. We propose multimodal latent reasoning via HIerarchical Visual cuEs injection (\emph{HIVE}), a novel framework that instills deliberate, "slow thinking" without depending on superficial textual rationales. Our method recursively extends transformer blocks, creating an internal loop for iterative reasoning refinement. Crucially, it injectively grounds this process with hierarchical visual cues from global scene context to fine-grained regional details directly into the model's latent representations. This enables the model to perform grounded, multi-step inference entirely in the aligned latent space. Extensive evaluations demonstrate that test-time scaling is effective when incorporating vision knowledge, and that integrating hierarchical information significantly enhances the model's understanding of complex scenes.
### [90] [Breaking Semantic Hegemony: Decoupling Principal and Residual Subspaces for Generalized OOD Detection](https://arxiv.org/abs/2602.05360)
*Ningkang Peng,Xiaoqian Peng,Yuhao Zhang,Qianfeng Yu,Feng Xing,Peirong Ma,Xichen Yang,Yi Chen,Tingyu Lu,Yanhui Gu*
Main category: cs.CV
TL;DR: 本文发现现有OOD检测方法存在'简单性悖论':对语义细微差异敏感,但对结构明显却语义简单的分布外样本或高频传感器噪声不敏感;作者将其归因于深度特征空间中的'语义霸权',并提出无需训练的几何解耦框架D-KNN,通过正交分解分离语义与结构成分,并在双空间校准以增强对残差信号的敏感性,显著提升OOD检测性能。
Details
Motivation: 揭示现有SOTA OOD检测模型中存在的'简单性悖论'——即对语义细微差异敏感但对结构差异明显(如传感器噪声)的样本不敏感,并探究其根源(语义霸权与神经坍缩下的谱集中偏差)。
Method: 提出训练无关、即插即用的几何解耦框架D-KNN:利用正交分解显式分离语义成分与结构残差,并引入双空间校准机制以恢复模型对弱残差信号的敏感性。
Result: 在CIFAR和ImageNet基准上达到新SOTA;FPR95从31.3%降至2.3%;对高斯噪声等传感器失效的AUROC从79.7%提升至94.9%。
Conclusion: D-KNN有效打破语义霸权,缓解简单性悖论,证明显式建模结构信息对OOD检测至关重要,为后验特征方法提供了新的几何视角。
Abstract: While feature-based post-hoc methods have made significant strides in Out-of-Distribution (OOD) detection, we uncover a counter-intuitive Simplicity Paradox in existing state-of-the-art (SOTA) models: these models exhibit keen sensitivity in distinguishing semantically subtle OOD samples but suffer from severe Geometric Blindness when confronting structurally distinct yet semantically simple samples or high-frequency sensor noise. We attribute this phenomenon to Semantic Hegemony within the deep feature space and reveal its mathematical essence through the lens of Neural Collapse. Theoretical analysis demonstrates that the spectral concentration bias, induced by the high variance of the principal subspace, numerically masks the structural distribution shift signals that should be significant in the residual subspace. To address this issue, we propose D-KNN, a training-free, plug-and-play geometric decoupling framework. This method utilizes orthogonal decomposition to explicitly separate semantic components from structural residuals and introduces a dual-space calibration mechanism to reactivate the model's sensitivity to weak residual signals. Extensive experiments demonstrate that D-KNN effectively breaks Semantic Hegemony, establishing new SOTA performance on both CIFAR and ImageNet benchmarks. Notably, in resolving the Simplicity Paradox, it reduces the FPR95 from 31.3% to 2.3%; when addressing sensor failures such as Gaussian noise, it boosts the detection performance (AUROC) from a baseline of 79.7% to 94.9%.
### [91] [Imagine a City: CityGenAgent for Procedural 3D City Generation](https://arxiv.org/abs/2602.05362)
*Zishan Liu,Zecong Tang,RuoCheng Wu,Xinzhe Zheng,Jingyu Hu,Ka-Hei Hui,Haoran Xie,Bo Dai,Zhengzhe Liu*
Main category: cs.CV
TL;DR: 本文提出CityGenAgent,一种基于自然语言驱动的分层程序化生成高质量3D城市的框架,通过两阶段学习(监督微调与强化学习)提升结构正确性、空间推理与图文一致性,并支持自然语言编辑。
Details
Motivation: 现有3D城市自动生成方法在高保真资产创建、可控性和可编辑性方面存在不足,难以满足自动驾驶、虚拟现实和具身智能等应用需求。
Method: 提出CityGenAgent框架,将城市生成分解为Block Program和Building Program两个可解释模块;采用两阶段学习:(1) 监督微调(SFT)确保程序符合几何与语义约束;(2) 强化学习(RL),设计空间对齐奖励和视觉一致性奖励以提升空间推理与图文匹配能力。
Result: 实验表明,CityGenAgent在语义对齐性、视觉质量和可控性方面优于现有方法,支持自然语言驱动的城市编辑与操控。
Conclusion: CityGenAgent为可扩展、可控、高质量的3D城市生成提供了坚实基础,推动了程序化内容生成向更智能、更交互的方向发展。
Abstract: The automated generation of interactive 3D cities is a critical challenge with broad applications in autonomous driving, virtual reality, and embodied intelligence. While recent advances in generative models and procedural techniques have improved the realism of city generation, existing methods often struggle with high-fidelity asset creation, controllability, and manipulation. In this work, we introduce CityGenAgent, a natural language-driven framework for hierarchical procedural generation of high-quality 3D cities. Our approach decomposes city generation into two interpretable components, Block Program and Building Program. To ensure structural correctness and semantic alignment, we adopt a two-stage learning strategy: (1) Supervised Fine-Tuning (SFT). We train BlockGen and BuildingGen to generate valid programs that adhere to schema constraints, including non-self-intersecting polygons and complete fields; (2) Reinforcement Learning (RL). We design Spatial Alignment Reward to enhance spatial reasoning ability and Visual Consistency Reward to bridge the gap between textual descriptions and the visual modality. Benefiting from the programs and the models' generalization, CityGenAgent supports natural language editing and manipulation. Comprehensive evaluations demonstrate superior semantic alignment, visual quality, and controllability compared to existing methods, establishing a robust foundation for scalable 3D city generation.
### [92] [SAIL: Self-Amplified Iterative Learning for Diffusion Model Alignment with Minimal Human Feedback](https://arxiv.org/abs/2602.05380)
*Xiaoxuan He,Siming Fu,Wanli Li,Zhiyuan Li,Dacheng Yin,Kang Rong,Fengyun Rao,Bo Zhang*
Main category: cs.CV
TL;DR: 本文提出SAIL框架,通过迭代自增强学习,使扩散模型无需外部奖励模型或大规模偏好数据即可实现与人类偏好的对齐。
Details
Motivation: 现有扩散模型对齐方法依赖外部奖励模型或大规模人类偏好数据,成本高且不实用;本文探索仅用极少人类反馈、不依赖奖励模型,挖掘扩散模型自身潜在能力进行对齐的可行性。
Method: 提出SAIL(Self-Amplified Iterative Learning):以少量人类标注偏好对为种子,在闭环中迭代生成样本、自我标注偏好、并用自增强数据微调模型;引入排序偏好mixup策略,平衡探索与保持初始人类先验。
Result: 在多个基准上显著超越现有方法,仅需6%的偏好数据量,验证了扩散模型具备强大自改进能力。
Conclusion: 扩散模型可通过SAIL框架实现高效、低数据依赖的自我对齐,可替代大规模人工标注和外部奖励模型。
Abstract: Aligning diffusion models with human preferences remains challenging, particularly when reward models are unavailable or impractical to obtain, and collecting large-scale preference datasets is prohibitively expensive. \textit{This raises a fundamental question: can we achieve effective alignment using only minimal human feedback, without auxiliary reward models, by unlocking the latent capabilities within diffusion models themselves?} In this paper, we propose \textbf{SAIL} (\textbf{S}elf-\textbf{A}mplified \textbf{I}terative \textbf{L}earning), a novel framework that enables diffusion models to act as their own teachers through iterative self-improvement. Starting from a minimal seed set of human-annotated preference pairs, SAIL operates in a closed-loop manner where the model progressively generates diverse samples, self-annotates preferences based on its evolving understanding, and refines itself using this self-augmented dataset. To ensure robust learning and prevent catastrophic forgetting, we introduce a ranked preference mixup strategy that carefully balances exploration with adherence to initial human priors. Extensive experiments demonstrate that SAIL consistently outperforms state-of-the-art methods across multiple benchmarks while using merely 6\% of the preference data required by existing approaches, revealing that diffusion models possess remarkable self-improvement capabilities that, when properly harnessed, can effectively replace both large-scale human annotation and external reward models.
### [93] [VRIQ: Benchmarking and Analyzing Visual-Reasoning IQ of VLMs](https://arxiv.org/abs/2602.05382)
*Tina Khezresmaeilzadeh,Jike Zhong,Konstantinos Psounis*
Main category: cs.CV
TL;DR: 本文提出了VRIQ基准来评估视觉语言模型(VLMs)的非语言视觉推理能力,发现其在抽象谜题和自然图像任务上表现均较差,主要瓶颈在于感知能力而非推理能力。
Details
Motivation: 探究当前视觉语言模型(VLMs)是否能可靠执行非语言视觉推理任务。
Method: 构建VRIQ基准,包含抽象谜题式与自然图像推理两类任务;设计诊断探针分别定位感知与推理失败来源,并细分为形状、数量、位置、3D/深度等感知子类。
Result: 抽象任务平均准确率仅28%,自然任务为45%;56%失败源于感知问题,43%源于感知与推理共同问题,仅1%纯属推理失败;特定感知类别(如数量、3D)错误率更高。
Conclusion: 当前VLMs的视觉推理不可靠,主因是感知局限而非推理缺陷;VRIQ为提升多模态系统视觉推理能力提供了可解释的诊断框架与改进方向。
Abstract: Recent progress in Vision Language Models (VLMs) has raised the question of whether they can reliably perform nonverbal reasoning. To this end, we introduce VRIQ (Visual Reasoning IQ), a novel benchmark designed to assess and analyze the visual reasoning ability of VLMs. We evaluate models on two sets of tasks: abstract puzzle-style and natural-image reasoning tasks. We find that on abstract puzzles, performance remains near random with an average accuracy of around 28%, while natural tasks yield better but still weak results with 45% accuracy. We also find that tool-augmented reasoning demonstrates only modest improvements. To uncover the source of this weakness, we introduce diagnostic probes targeting perception and reasoning. Our analysis demonstrates that around 56% of failures arise from perception alone, 43% from both perception and reasoning, and only a mere 1% from reasoning alone. This motivates us to design fine-grained diagnostic probe questions targeting specific perception categories (e.g., shape, count, position, 3D/depth), revealing that certain categories cause more failures than others. Our benchmark and analysis establish that current VLMs, even with visual reasoning tools, remain unreliable abstract reasoners, mostly due to perception limitations, and offer a principled basis for improving visual reasoning in multimodal systems.
### [94] [Dolphin-v2: Universal Document Parsing via Scalable Anchor Prompting](https://arxiv.org/abs/2602.05384)
*Hao Feng,Wei Shi,Ke Zhang,Xiang Fei,Lei Liao,Dingkang Yang,Yongkun Du,Xuecheng Wu,Jingqun Tang,Yang Liu,Hong Chen,Can Huang*
Main category: cs.CV
TL;DR: Dolphin-v2 是一个两阶段文档图像解析模型,通过联合文档类型分类与布局分析、针对不同文档类型(数字原生/拍摄)采用差异化解析策略(整体页面级 vs. 元素级并行),显著提升了对畸变/拍摄文档的鲁棒性、细粒度元素识别(21类+语义属性)及代码块识别能力,在多个基准上取得大幅性能提升。
Details
Motivation: 现有文档解析领域模型碎片化、选择复杂;两阶段方法依赖轴对齐边界框,难以处理畸变或拍摄文档。
Method: 提出 Dolphin-v2 两阶段模型:第一阶段联合进行文档类型分类(数字原生/拍摄)与布局分析(含数字文档的细粒度元素检测与阅读顺序预测);第二阶段采用混合解析策略——对拍摄文档进行整体页面级解析以应对几何畸变,对数字文档则基于布局锚点进行元素级并行解析。
Result: 在 OmniDocBench 上总体提升 +14.78 分,拍摄文档错误率降低 91%,支持 21 类细粒度元素识别、作者/元数据等语义属性提取及保留缩进的代码块识别,并通过并行处理保持高效推理。
Conclusion: Dolphin-v2 通过结构化设计与任务协同,有效解决了文档解析中模型碎片化与畸变鲁棒性两大核心挑战,显著提升了实用性与泛化能力。
Abstract: Document parsing has garnered widespread attention as vision-language models (VLMs) advance OCR capabilities. However, the field remains fragmented across dozens of specialized models with varying strengths, forcing users to navigate complex model selection and limiting system scalability. Moreover, existing two-stage approaches depend on axis-aligned bounding boxes for layout detection, failing to handle distorted or photographed documents effectively. To this end, we present Dolphin-v2, a two-stage document image parsing model that substantially improves upon the original Dolphin. In the first stage, Dolphin-v2 jointly performs document type classification (digital-born versus photographed) alongside layout analysis. For digital-born documents, it conducts finer-grained element detection with reading order prediction. In the second stage, we employ a hybrid parsing strategy: photographed documents are parsed holistically as complete pages to handle geometric distortions, while digital-born documents undergo element-wise parallel parsing guided by the detected layout anchors, enabling efficient content extraction. Compared with the original Dolphin, Dolphin-v2 introduces several crucial enhancements: (1) robust parsing of photographed documents via holistic page-level understanding, (2) finer-grained element detection (21 categories) with semantic attribute extraction such as author information and document metadata, and (3) code block recognition with indentation preservation, which existing systems typically lack. Comprehensive evaluations are conducted on DocPTBench, OmniDocBench, and our self-constructed RealDoc-160 benchmark. The results demonstrate substantial improvements: +14.78 points overall on the challenging OmniDocBench and 91% error reduction on photographed documents, while maintaining efficient inference through parallel processing.
### [95] [Parallel Swin Transformer-Enhanced 3D MRI-to-CT Synthesis for MRI-Only Radiotherapy Planning](https://arxiv.org/abs/2602.05387)
*Zolnamar Dorjsembe,Hung-Yi Chen,Furen Xiao,Hsing-Kuo Pao*
Main category: cs.CV
TL;DR: 本文提出了一种名为Parallel Swin Transformer-Enhanced Med2Transformer的3D网络架构,用于从MRI生成合成CT图像,以支持MRI-only放疗计划。该方法结合卷积编码与双分支Swin Transformer,兼顾局部细节与长程上下文建模,显著提升图像相似性、几何精度和剂量计算准确性(平均靶区剂量误差1.69%)。
Details
Motivation: MRI缺乏电子密度信息,无法直接用于放疗剂量计算,当前需MRI+CT配准,带来不确定性与复杂性;合成CT可实现MRI-only规划,但面临MRI-CT非线性映射及解剖变异挑战。
Method: 提出Parallel Swin Transformer-Enhanced Med2Transformer:3D架构,融合卷积编码器与两个并行Swin Transformer分支,采用多尺度移窗注意力与分层特征聚合,联合建模局部解剖细节与长程上下文依赖。
Result: 在公开与临床数据集上,相比基线方法,图像相似性(如PSNR、SSIM)与几何精度(如轮廓Dice、Hausdorff距离)更高;剂量学评估显示靶区平均剂量误差为1.69%,达临床可接受水平。
Conclusion: 所提方法有效提升了MRI-to-CT合成质量与放疗剂量预测可靠性,为临床MRI-only放疗流程提供了可行、高精度的技术路径。
Abstract: MRI provides superior soft tissue contrast without ionizing radiation; however, the absence of electron density information limits its direct use for dose calculation. As a result, current radiotherapy workflows rely on combined MRI and CT acquisitions, increasing registration uncertainty and procedural complexity. Synthetic CT generation enables MRI only planning but remains challenging due to nonlinear MRI-CT relationships and anatomical variability. We propose Parallel Swin Transformer-Enhanced Med2Transformer, a 3D architecture that integrates convolutional encoding with dual Swin Transformer branches to model both local anatomical detail and long-range contextual dependencies. Multi-scale shifted window attention with hierarchical feature aggregation improves anatomical fidelity. Experiments on public and clinical datasets demonstrate higher image similarity and improved geometric accuracy compared with baseline methods. Dosimetric evaluation shows clinically acceptable performance, with a mean target dose error of 1.69%. Code is available at: https://github.com/mobaidoctor/med2transformer.
### [96] [Dataset Distillation via Relative Distribution Matching and Cognitive Heritage](https://arxiv.org/abs/2602.05391)
*Qianxin Xia,Jiawei Du,Yuhan Zhang,Jielei Wang,Guoming Lu*
Main category: cs.CV
TL;DR: 本文提出了一种名为统计流匹配(statistical flow matching)的新方法,用于高效的数据集蒸馏,特别适用于基于预训练自监督模型的分类任务。该方法通过在原始数据的类别中心间对齐恒定统计流来优化合成图像,显著降低了计算和内存开销,并结合分类器继承策略进一步提升性能与效率。
Details
Motivation: 现有线性梯度匹配方法在数据集蒸馏中存在高计算与内存开销问题,因其需每步加载大量真实图像并多次应用可微增强。
Method: 提出统计流匹配框架,仅一次性加载原始数据的统计信息,对合成图像单次增强;并设计分类器继承策略,复用原数据训练的分类器,仅添加轻量线性投影器。
Result: 相比SOTA方法,GPU内存降低10倍、运行时间缩短4倍,同时性能相当或更优;分类器继承策略带来显著性能增益且存储开销极小。
Conclusion: 统计流匹配是一种稳定、高效的数据集蒸馏新范式,结合分类器继承策略,在资源受限场景下展现出巨大潜力。
Abstract: Dataset distillation seeks to synthesize a highly compact dataset that achieves performance comparable to the original dataset on downstream tasks. For the classification task that use pre-trained self-supervised models as backbones, previous linear gradient matching optimizes synthetic images by encouraging them to mimic the gradient updates induced by real images on the linear classifier. However, this batch-level formulation requires loading thousands of real images and applying multiple rounds of differentiable augmentations to synthetic images at each distillation step, leading to substantial computational and memory overhead. In this paper, we introduce statistical flow matching , a stable and efficient supervised learning framework that optimizes synthetic images by aligning constant statistical flows from target class centers to non-target class centers in the original data. Our approach loads raw statistics only once and performs a single augmentation pass on the synthetic data, achieving performance comparable to or better than the state-of-the-art methods with 10x lower GPU memory usage and 4x shorter runtime. Furthermore, we propose a classifier inheritance strategy that reuses the classifier trained on the original dataset for inference, requiring only an extremely lightweight linear projector and marginal storage while achieving substantial performance gains.
### [97] [Explainable Pathomics Feature Visualization via Correlation-aware Conditional Feature Editing](https://arxiv.org/abs/2602.05397)
*Yuechen Yang,Junlin Guo,Ruining Deng,Junchao Zhu,Zhengyi Lu,Chongyu Qu,Yanfan Zhu,Xingyi Guo,Yu Wang,Shilin Zhao,Haichun Yang,Yuankai Huo*
Main category: cs.CV
TL;DR: 本文提出了一种面向数字病理学的流形感知扩散(MAD)框架,通过在变分自编码器学习的解耦潜在空间中正则化特征轨迹,实现可控且生物学合理的细胞核特征编辑,从而克服传统条件扩散模型因忽略特征相关性而导致的不真实伪影问题。
Details
Motivation: 现有病理组学特征(如“二阶矩”)难以跨临床场景解释,限制了实际应用;而条件扩散模型假设特征独立,与病理组学特征固有的强相关性矛盾,导致编辑时偏离生物流形、生成不真实图像。
Method: 提出Manifold-Aware Diffusion(MAD)框架:首先用VAE学习解耦的潜在空间并建模病理组学特征流形;在该空间中正则化目标特征编辑轨迹,使相关属性协同调整以保持流形内一致性;再将优化后的特征作为条件引导扩散模型生成高保真图像。
Result: 实验表明MAD能在编辑病理组学特征时有效沿真实细胞流形导航,在条件特征编辑任务上优于基线方法,同时更好保持细胞结构一致性。
Conclusion: MAD框架通过引入流形感知机制,提升了病理组学特征编辑的生物学合理性和图像合成质量,为可解释、可复现的数字病理 biomarker 研究提供了新范式。
Abstract: Pathomics is a recent approach that offers rich quantitative features beyond what black-box deep learning can provide, supporting more reproducible and explainable biomarkers in digital pathology. However, many derived features (e.g., "second-order moment") remain difficult to interpret, especially across different clinical contexts, which limits their practical adoption. Conditional diffusion models show promise for explainability through feature editing, but they typically assume feature independence**--**an assumption violated by intrinsically correlated pathomics features. Consequently, editing one feature while fixing others can push the model off the biological manifold and produce unrealistic artifacts. To address this, we propose a Manifold-Aware Diffusion (MAD) framework for controllable and biologically plausible cell nuclei editing. Unlike existing approaches, our method regularizes feature trajectories within a disentangled latent space learned by a variational auto-encoder (VAE). This ensures that manipulating a target feature automatically adjusts correlated attributes to remain within the learned distribution of real cells. These optimized features then guide a conditional diffusion model to synthesize high-fidelity images. Experiments demonstrate that our approach is able to navigate the manifold of pathomics features when editing those features. The proposed method outperforms baseline methods in conditional feature editing while preserving structural coherence.
### [98] [TSBOW: Traffic Surveillance Benchmark for Occluded Vehicles Under Various Weather Conditions](https://arxiv.org/abs/2602.05414)
*Ngoc Doan-Minh Huynh,Duong Nguyen-Ngoc Tran,Long Hoang Pham,Tai Huu-Phuong Tran,Hyung-Joon Jeon,Huy-Hung Nguyen,Duong Khac Vu,Hyung-Min Jeon,Son Hong Phan,Quoc Pham-Nam Ho,Chi Dai Tran,Trinh Le Ba Khanh,Jae Wook Jeon*
Main category: cs.CV
TL;DR: 本文提出TSBOW数据集,用于提升恶劣天气下遮挡车辆的检测性能,包含32小时真实交通数据和48,000+人工标注帧,支持智能交通系统研究。
Details
Motivation: 全球变暖加剧极端天气,导致CCTV视频质量下降、交通流紊乱、事故率上升;现有数据集缺乏对极端天气的覆盖。
Method: 构建大规模真实世界交通监控数据集TSBOW,涵盖多种极端天气、道路类型、视角和尺度,并提供人工精标与半自动标注(共3.2M帧),建立遮挡与恶劣天气下的目标检测基准。
Result: 发布首个面向多类交通参与者(含微出行设备、行人等)在复杂天气与遮挡场景下的高质量监控数据集TSBOW,含48,000+手动标注框和丰富元信息。
Conclusion: TSBOW填补了极端天气下交通监控数据的空白,为提升恶劣环境下车辆检测鲁棒性及推动智能交通系统发展提供了关键基础资源。
Abstract: Global warming has intensified the frequency and severity of extreme weather events, which degrade CCTV signal and video quality while disrupting traffic flow, thereby increasing traffic accident rates. Existing datasets, often limited to light haze, rain, and snow, fail to capture extreme weather conditions. To address this gap, this study introduces the Traffic Surveillance Benchmark for Occluded vehicles under various Weather conditions (TSBOW), a comprehensive dataset designed to enhance occluded vehicle detection across diverse annual weather scenarios. Comprising over 32 hours of real-world traffic data from densely populated urban areas, TSBOW includes more than 48,000 manually annotated and 3.2 million semi-labeled frames; bounding boxes spanning eight traffic participant classes from large vehicles to micromobility devices and pedestrians. We establish an object detection benchmark for TSBOW, highlighting challenges posed by occlusions and adverse weather. With its varied road types, scales, and viewpoints, TSBOW serves as a critical resource for advancing Intelligent Transportation Systems. Our findings underscore the potential of CCTV-based traffic monitoring, pave the way for new research and applications. The TSBOW dataset is publicly available at: https://github.com/SKKUAutoLab/TSBOW.
### [99] [VMF-GOS: Geometry-guided virtual Outlier Synthesis for Long-Tailed OOD Detection](https://arxiv.org/abs/2602.05415)
*Ningkang Peng,Qianfeng Yu,Yuhao Zhang,Yafei Liu,Xiaoqian Peng,Peirong Ma,Yi Chen,Peiheng Li,Yanhui Gu*
Main category: cs.CV
TL;DR: 本文提出了一种无需外部数据的OOD检测新框架GOS,通过vMF分布引导在特征空间低似然环形区域合成虚拟异常样本,并结合双粒度语义损失增强ID样本与合成异常样本的区分能力,在长尾分布下实现了优于依赖外部数据的SOTA方法的性能。
Details
Motivation: 现有OOD检测方法在长尾分布下因尾部类别样本稀少导致决策边界模糊,且依赖大规模外部数据集(如80M Tiny Images)进行正则化,存在数据获取成本高和隐私敏感等实际部署障碍。
Method: 提出几何引导的虚拟异常合成(GOS)策略:在超球面上用vMF分布建模统计特性,在特征空间低似然环形区域进行方向采样生成虚拟异常;并设计双粒度语义损失(DGS),利用对比学习最大化ID特征与合成边界异常之间的区分度。
Result: 在CIFAR-LT等基准上,该方法性能超越使用真实外部图像的SOTA方法。
Conclusion: 所提数据无关框架GOS+DGS能有效缓解长尾分布下OOD检测对真实外部数据的依赖,在保持甚至提升检测性能的同时更具实用性和隐私友好性。
Abstract: Out-of-Distribution (OOD) detection under long-tailed distributions is a highly challenging task because the scarcity of samples in tail classes leads to blurred decision boundaries in the feature space. Current state-of-the-art (sota) methods typically employ Outlier Exposure (OE) strategies, relying on large-scale real external datasets (such as 80 Million Tiny Images) to regularize the feature space. However, this dependence on external data often becomes infeasible in practical deployment due to high data acquisition costs and privacy sensitivity. To this end, we propose a novel data-free framework aimed at completely eliminating reliance on external datasets while maintaining superior detection performance. We introduce a Geometry-guided virtual Outlier Synthesis (GOS) strategy that models statistical properties using the von Mises-Fisher (vMF) distribution on a hypersphere. Specifically, we locate a low-likelihood annulus in the feature space and perform directional sampling of virtual outliers in this region. Simultaneously, we introduce a new Dual-Granularity Semantic Loss (DGS) that utilizes contrastive learning to maximize the distinction between in-distribution (ID) features and these synthesized boundary outliers. Extensive experiments on benchmarks such as CIFAR-LT demonstrate that our method outperforms sota approaches that utilize external real images.
### [100] [Disco: Densely-overlapping Cell Instance Segmentation via Adjacency-aware Collaborative Coloring](https://arxiv.org/abs/2602.05420)
*Rui Sun,Yiwen Yang,Kaiyu Guo,Chen Jiang,Dongli Xu,Zhaonan Liu,Tan Pan,Limei Han,Xue Jiang,Wu Wei,Yuan Cheng*
Main category: cs.CV
TL;DR: 本文提出Disco框架,通过图着色思想解决密集重叠细胞实例分割问题,并发布大规模数据集GBC-FS 2025;发现真实细胞图多含奇圈、非二分,故传统2色法不足,需兼顾高色数与避免冗余;方法包含显式标记(递归分解图+冲突集识别)与隐式消歧(特征差异约束)两阶段。
Details
Motivation: 现有基于轮廓或距离图的方法难以处理复杂密集细胞区域;图着色方法虽有潜力,但其在真实密集重叠和复杂拓扑场景下的有效性尚未验证。
Method: 提出Disco框架:1)显式标记——递归分解细胞邻接图,识别冲突集,将拓扑问题转化为分类任务;2)隐式消歧——通过约束不同实例特征的差异性,学习可分离的特征表示;并构建首个系统性分析细胞图色数特性的实验,揭示真实细胞图普遍非二分、富含三角形等奇圈。
Result: 在GBC-FS 2025等四个数据集上验证了细胞邻接图的高奇圈率与非二分性;Disco在密集细胞分割任务中显著优于现有SOTA方法,尤其在重叠与拓扑复杂区域表现鲁棒。
Conclusion: 细胞实例分割不能简单依赖2-coloring理论;需结合数据驱动的拓扑建模与约束学习,在保证染色正确性的同时避免高色数带来的冗余与优化困难;Disco为图着色范式在病理图像分析中的落地提供了新思路与基准资源。
Abstract: Accurate cell instance segmentation is foundational for digital pathology analysis. Existing methods based on contour detection and distance mapping still face significant challenges in processing complex and dense cellular regions. Graph coloring-based methods provide a new paradigm for this task, yet the effectiveness of this paradigm in real-world scenarios with dense overlaps and complex topologies has not been verified. Addressing this issue, we release a large-scale dataset GBC-FS 2025, which contains highly complex and dense sub-cellular nuclear arrangements. We conduct the first systematic analysis of the chromatic properties of cell adjacency graphs across four diverse datasets and reveal an important discovery: most real-world cell graphs are non-bipartite, with a high prevalence of odd-length cycles (predominantly triangles). This makes simple 2-coloring theory insufficient for handling complex tissues, while higher-chromaticity models would cause representational redundancy and optimization difficulties. Building on this observation of complex real-world contexts, we propose Disco (Densely-overlapping Cell Instance Segmentation via Adjacency-aware COllaborative Coloring), an adjacency-aware framework based on the "divide and conquer" principle. It uniquely combines a data-driven topological labeling strategy with a constrained deep learning system to resolve complex adjacency conflicts. First, "Explicit Marking" strategy transforms the topological challenge into a learnable classification task by recursively decomposing the cell graph and isolating a "conflict set." Second, "Implicit Disambiguation" mechanism resolves ambiguities in conflict regions by enforcing feature dissimilarity between different instances, enabling the model to learn separable feature representations.
### [101] [NeVStereo: A NeRF-Driven NVS-Stereo Architecture for High-Fidelity 3D Tasks](https://arxiv.org/abs/2602.05423)
*Pengcheng Chen,Yue Hu,Wenhao Li,Nicole M Gunderson,Andrew Feng,Zhenglong Sun,Peter Beerel,Eric J Seibel*
Main category: cs.CV
TL;DR: NeVStereo是一个NeRF驱动的多视图立体架构,能够从仅RGB输入中联合估计相机位姿、深度、新视角合成(NVS)和表面重建,显著提升几何一致性和渲染质量。
Details
Motivation: 现有方法难以在单框架中同时实现高精度位姿估计、可靠深度、高质量新视角合成和准确3D表面重建;端到端匹配方法不显式支持NVS,而NeRF方法对位姿误差敏感且缺乏联合优化。
Method: NeVStereo融合NeRF-based NVS(适配立体匹配)、置信度引导的多视图深度估计、NeRF耦合的光束法平差(BA)用于位姿优化,以及深度与辐射场协同迭代优化机制。
Result: 在多种场景基准上实现零样本强性能:深度误差降低36%,位姿精度提升10.4%,NVS保真度提高4.5%,网格质量达SOTA(F1 91.93%,Chamfer距离4.35 mm)。
Conclusion: NeVStereo有效缓解了NeRF中常见的表面堆叠、伪影及位姿-深度耦合问题,为多视图三维重建提供统一、鲁棒且高保真的解决方案。
Abstract: In modern dense 3D reconstruction, feed-forward systems (e.g., VGGT, pi3) focus on end-to-end matching and geometry prediction but do not explicitly output the novel view synthesis (NVS). Neural rendering-based approaches offer high-fidelity NVS and detailed geometry from posed images, yet they typically assume fixed camera poses and can be sensitive to pose errors. As a result, it remains non-trivial to obtain a single framework that can offer accurate poses, reliable depth, high-quality rendering, and accurate 3D surfaces from casually captured views. We present NeVStereo, a NeRF-driven NVS-stereo architecture that aims to jointly deliver camera poses, multi-view depth, novel view synthesis, and surface reconstruction from multi-view RGB-only inputs. NeVStereo combines NeRF-based NVS for stereo-friendly renderings, confidence-guided multi-view depth estimation, NeRF-coupled bundle adjustment for pose refinement, and an iterative refinement stage that updates both depth and the radiance field to improve geometric consistency. This design mitigated the common NeRF-based issues such as surface stacking, artifacts, and pose-depth coupling. Across indoor, outdoor, tabletop, and aerial benchmarks, our experiments indicate that NeVStereo achieves consistently strong zero-shot performance, with up to 36% lower depth error, 10.4% improved pose accuracy, 4.5% higher NVS fidelity, and state-of-the-art mesh quality (F1 91.93%, Chamfer 4.35 mm) compared to existing prestigious methods.
### [102] [Multi-AD: Cross-Domain Unsupervised Anomaly Detection for Medical and Industrial Applications](https://arxiv.org/abs/2602.05426)
*Wahyu Rahmaniar,Kenji Suzuki*
Main category: cs.CV
TL;DR: 本文提出了一种名为Multi-AD的无监督异常检测CNN模型,结合SE注意力机制、知识蒸馏与判别器网络,在医学和工业图像上实现了跨域鲁棒异常检测,并在多个数据集上达到SOTA性能。
Details
Motivation: 传统深度学习模型在跨领域(如医学早期疾病诊断、工业缺陷检测)中常面临标注数据稀缺的问题,尤其在异常检测任务中尤为突出。
Method: 提出Multi-AD模型:1)引入Squeeze-and-Excitation(SE)模块增强通道级特征注意力;2)采用知识蒸馏(KD)将教师模型的判别性特征迁移至学生模型;3)加入判别器网络强化正常/异常区分能力;4)融合多尺度特征实现不同尺寸异常的检测;5)基于教师-学生(T-S)架构保障高维特征一致性并适配异常检测任务。
Result: 在多个医学(脑MRI、肝CT、视网膜OCT)与工业(MVTec AD)数据集上验证,图像级AUROC达81.4%(医学)和99.6%(工业),像素级AUROC达97.0%(医学)和98.4%(工业),全面超越现有SOTA方法。
Conclusion: Multi-AD是一种通用、鲁棒且高性能的跨域无监督异常检测框架,具备良好的实际应用潜力。
Abstract: Traditional deep learning models often lack annotated data, especially in cross-domain applications such as anomaly detection, which is critical for early disease diagnosis in medicine and defect detection in industry. To address this challenge, we propose Multi-AD, a convolutional neural network (CNN) model for robust unsupervised anomaly detection across medical and industrial images. Our approach employs the squeeze-and-excitation (SE) block to enhance feature extraction via channel-wise attention, enabling the model to focus on the most relevant features and detect subtle anomalies. Knowledge distillation (KD) transfers informative features from the teacher to the student model, enabling effective learning of the differences between normal and anomalous data. Then, the discriminator network further enhances the model's capacity to distinguish between normal and anomalous data. At the inference stage, by integrating multi-scale features, the student model can detect anomalies of varying sizes. The teacher-student (T-S) architecture ensures consistent representation of high-dimensional features while adapting them to enhance anomaly detection. Multi-AD was evaluated on several medical datasets, including brain MRI, liver CT, and retina OCT, as well as industrial datasets, such as MVTec AD, demonstrating strong generalization across multiple domains. Experimental results demonstrated that our approach consistently outperformed state-of-the-art models, achieving the best average AUROC for both image-level (81.4% for medical and 99.6% for industrial) and pixel-level (97.0% for medical and 98.4% for industrial) tasks, making it effective for real-world applications.
### [103] [LD-SLRO: Latent Diffusion Structured Light for 3-D Reconstruction of Highly Reflective Objects](https://arxiv.org/abs/2602.05434)
*Sanghoon Jeon,Gihyun Jung,Suhyeon Ka,Jae-Sang Hyun*
Main category: cs.CV
TL;DR: 本文提出了一种基于潜在扩散模型的结构光方法(LD-SLRO),用于改善高反射、低粗糙度表面的条纹图像质量,从而提升三维重建精度。
Details
Motivation: 高反射、低粗糙度表面在条纹投影三维重建中易受镜面反射和间接光照干扰,导致条纹失真或丢失,影响重建精度。
Method: 提出LD-SLRO方法:先对相移条纹图像进行编码提取表征表面反射特性的潜在特征;再将这些特征作为条件输入至潜在扩散模型,概率性抑制反射伪影并恢复缺失条纹;引入镜面反射编码器、时变通道仿射层和注意力模块增强恢复效果。
Result: 实验表明该方法显著提升条纹图像质量和三维重建精度,平均均方根误差从1.8176 mm降至0.9619 mm。
Conclusion: LD-SLRO有效缓解了高反射表面带来的条纹畸变问题,为复杂表面三维测量提供了新思路,具有高灵活性与先进性能。
Abstract: Fringe projection profilometry-based 3-D reconstruction of objects with high reflectivity and low surface roughness remains a significant challenge. When measuring such glossy surfaces, specular reflection and indirect illumination often lead to severe distortion or loss of the projected fringe patterns. To address these issues, we propose a latent diffusion-based structured light for reflective objects (LD-SLRO). Phase-shifted fringe images captured from highly reflective surfaces are first encoded to extract latent representations that capture surface reflectance characteristics. These latent features are then used as conditional inputs to a latent diffusion model, which probabilistically suppresses reflection-induced artifacts and recover lost fringe information, yielding high-quality fringe images. The proposed components, including the specular reflection encoder, time-variant channel affine layer, and attention modules, further improve fringe restoration quality. In addition, LD-SLRO provides high flexibility in configuring the input and output fringe sets. Experimental results demonstrate that the proposed method improves both fringe quality and 3-D reconstruction accuracy over state-of-the-art methods, reducing the average root-mean-squared error from 1.8176 mm to 0.9619 mm.
### [104] [Stable Velocity: A Variance Perspective on Flow Matching](https://arxiv.org/abs/2602.05435)
*Donglin Yang,Yongxing Zhang,Xin Yu,Liang Hou,Xin Tao,Pengfei Wan,Xiaojuan Qi,Renjie Liao*
Main category: cs.CV
TL;DR: 本文提出Stable Velocity框架,通过分析流匹配中单样本条件速度导致的高方差问题,设计了降低方差的训练目标(StableVM)与自适应辅助监督(VA-REPA),并在推理阶段利用低方差区域的动力学简化实现无需微调的加速采样(StableVS),在多个图像/视频生成模型上验证了训练更高效、采样快2倍以上且不损失质量。
Details
Motivation: 流匹配因依赖单一样本条件速度而存在高方差训练目标,导致优化不稳定、收敛慢;尤其在先验分布附近形成高方差困难区域。
Method: 1) 理论刻画条件速度的方差分布,识别高/低方差区域;2) 提出StableVM——无偏方差缩减训练目标;3) 设计VA-REPA——在低方差区自适应增强辅助监督;4) 推导低方差区动力学闭式解,构建无需微调的StableVS采样方法。
Result: 在ImageNet 256×256及SD3.5、Flux、Qwen-Image、Wan2.2等大模型上,训练效率提升,采样速度在低方差区提升超2倍,样本质量未下降。
Conclusion: Stable Velocity通过方差建模统一改进流匹配的训练稳定性与采样效率,为高维生成建模提供可扩展、即插即用的优化范式。
Abstract: While flow matching is elegant, its reliance on single-sample conditional velocities leads to high-variance training targets that destabilize optimization and slow convergence. By explicitly characterizing this variance, we identify 1) a high-variance regime near the prior, where optimization is challenging, and 2) a low-variance regime near the data distribution, where conditional and marginal velocities nearly coincide. Leveraging this insight, we propose Stable Velocity, a unified framework that improves both training and sampling. For training, we introduce Stable Velocity Matching (StableVM), an unbiased variance-reduction objective, along with Variance-Aware Representation Alignment (VA-REPA), which adaptively strengthen auxiliary supervision in the low-variance regime. For inference, we show that dynamics in the low-variance regime admit closed-form simplifications, enabling Stable Velocity Sampling (StableVS), a finetuning-free acceleration. Extensive experiments on ImageNet $256\times256$ and large pretrained text-to-image and text-to-video models, including SD3.5, Flux, Qwen-Image, and Wan2.2, demonstrate consistent improvements in training efficiency and more than $2\times$ faster sampling within the low-variance regime without degrading sample quality. Our code is available at https://github.com/linYDTHU/StableVelocity.
### [105] [Synthetic Defect Geometries of Cast Metal Objects Modeled via 2d Voronoi Tessellations](https://arxiv.org/abs/2602.05440)
*Natascha Jeziorski,Petra Gospodnetić,Claudia Redenbach*
Main category: cs.CV
TL;DR: 本文提出了一种基于参数化3D缺陷建模与物理仿真相结合的合成数据生成方法,用于自动化无损检测中的缺陷检测训练,支持像素级精确标注和罕见缺陷的可控生成。
Details
Motivation: 工业中缺陷检测对质量控制至关重要,但真实缺陷数据稀缺且标注成本高;为提升机器学习模型性能,需大量高质量、多样化的带标注训练数据。
Method: 提出参数化3D缺陷建模方法(针对铸造等工艺常见缺陷),构建可嵌入原始工件网格的数字孪生缺陷模型;结合物理基础的蒙特卡洛仿真(如视觉表面检测)生成逼真合成数据,并同步生成像素级精确标注。
Result: 实现了可变、任意规模的合成缺陷数据集生成,尤其支持罕见缺陷的充足采样;标注无需人工干预,且方法可扩展至其他无损检测技术(如X射线、超声)。
Conclusion: 该方法为自动化缺陷检测提供了可控、可扩展、高保真的合成数据生成框架,显著缓解真实数据瓶颈,提升模型泛化能力与鲁棒性。
Abstract: In industry, defect detection is crucial for quality control. Non-destructive testing (NDT) methods are preferred as they do not influence the functionality of the object while inspecting. Automated data evaluation for automated defect detection is a growing field of research. In particular, machine learning approaches show promising results. To provide training data in sufficient amount and quality, synthetic data can be used. Rule-based approaches enable synthetic data generation in a controllable environment. Therefore, a digital twin of the inspected object including synthetic defects is needed. We present parametric methods to model 3d mesh objects of various defect types that can then be added to the object geometry to obtain synthetic defective objects. The models are motivated by common defects in metal casting but can be transferred to other machining procedures that produce similar defect shapes. Synthetic data resembling the real inspection data can then be created by using a physically based Monte Carlo simulation of the respective testing method. Using our defect models, a variable and arbitrarily large synthetic data set can be generated with the possibility to include rarely occurring defects in sufficient quantity. Pixel-perfect annotation can be created in parallel. As an example, we will use visual surface inspection, but the procedure can be applied in combination with simulations for any other NDT method.
### [106] [DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching](https://arxiv.org/abs/2602.05449)
*Chang Zou,Changlin Li,Yang Li,Patrol Li,Jianbing Wu,Xiao He,Songtao Liu,Zhao Zhong,Kailin Huang,Linfeng Zhang*
Main category: cs.CV
TL;DR: 本文提出了一种新型可学习的特征缓存机制,结合保守的Restricted MeanFlow蒸馏方法,显著提升视频扩散模型的推理速度(达11.8×),同时保持生成质量。
Details
Motivation: 现有视频扩散模型加速方法(如无训练特征缓存和步蒸馏)在高压缩比下存在语义/细节丢失或严重质量下降问题,尤其在步数稀疏时二者结合效果更差。
Method: 提出可学习的轻量神经预测器替代传统无训练启发式特征缓存;设计保守的Restricted MeanFlow蒸馏策略以提升大规模视频模型在高度压缩下的蒸馏稳定性。
Result: 实现11.8倍加速,同时保持视频生成质量;在多个基准上验证了有效性。
Conclusion: 可学习特征缓存与保守蒸馏策略的结合,为高效高质量视频扩散模型提供了新范式。
Abstract: While diffusion models have achieved great success in the field of video generation, this progress is accompanied by a rapidly escalating computational burden. Among the existing acceleration methods, Feature Caching is popular due to its training-free property and considerable speedup performance, but it inevitably faces semantic and detail drop with further compression. Another widely adopted method, training-aware step-distillation, though successful in image generation, also faces drastic degradation in video generation with a few steps. Furthermore, the quality loss becomes more severe when simply applying training-free feature caching to the step-distilled models, due to the sparser sampling steps. This paper novelly introduces a distillation-compatible learnable feature caching mechanism for the first time. We employ a lightweight learnable neural predictor instead of traditional training-free heuristics for diffusion models, enabling a more accurate capture of the high-dimensional feature evolution process. Furthermore, we explore the challenges of highly compressed distillation on large-scale video models and propose a conservative Restricted MeanFlow approach to achieve more stable and lossless distillation. By undertaking these initiatives, we further push the acceleration boundaries to $11.8\times$ while preserving generation quality. Extensive experiments demonstrate the effectiveness of our method. The code is in the supplementary materials and will be publicly available.
### [107] [Attention Retention for Continual Learning with Vision Transformers](https://arxiv.org/abs/2602.05454)
*Yue Lu,Xiangyu Zhou,Shizhou Zhang,Yinghui Xing,Guoqiang Liang,Wencong Zhang*
Main category: cs.CV
TL;DR: 本文提出了一种基于注意力保持的持续学习框架,通过在反向传播中约束视觉Transformer中的注意力漂移,有效缓解灾难性遗忘。
Details
Motivation: 识别出视觉Transformer中注意力漂移是导致持续学习中灾难性遗忘的主要原因,并受人类视觉系统选择性注意机制启发,提出缓解方法。
Method: 提出一种两步注意力保持框架:1)利用层级展开机制提取前序任务的注意力图并生成实例自适应二值掩码;2)在学习新任务时用掩码置零与先前注意力区域相关的梯度,并按比例缩放参数更新以兼容现代优化器。
Result: 实验和可视化验证了该方法能有效缓解灾难性遗忘、保留视觉概念,在多种持续学习场景下达到SOTA性能并具有强泛化性。
Conclusion: 注意力漂移是灾难性遗忘的关键因素,所提注意力保持机制可显著提升持续学习模型的稳定性与泛化能力。
Abstract: Continual learning (CL) empowers AI systems to progressively acquire knowledge from non-stationary data streams. However, catastrophic forgetting remains a critical challenge. In this work, we identify attention drift in Vision Transformers as a primary source of catastrophic forgetting, where the attention to previously learned visual concepts shifts significantly after learning new tasks. Inspired by neuroscientific insights into the selective attention in the human visual system, we propose a novel attention-retaining framework to mitigate forgetting in CL. Our method constrains attention drift by explicitly modifying gradients during backpropagation through a two-step process: 1) extracting attention maps of the previous task using a layer-wise rollout mechanism and generating instance-adaptive binary masks, and 2) when learning a new task, applying these masks to zero out gradients associated with previous attention regions, thereby preventing disruption of learned visual concepts. For compatibility with modern optimizers, the gradient masking process is further enhanced by scaling parameter updates proportionally to maintain their relative magnitudes. Experiments and visualizations demonstrate the effectiveness of our method in mitigating catastrophic forgetting and preserving visual concepts. It achieves state-of-the-art performance and exhibits robust generalizability across diverse CL scenarios.
### [108] [MerNav: A Highly Generalizable Memory-Execute-Review Framework for Zero-Shot Object Goal Navigation](https://arxiv.org/abs/2602.05467)
*Dekang Qi,Shuang Zeng,Xinyuan Chang,Feng Xiong,Shichao Xie,Xiaolong Wu,Mu Xu*
Main category: cs.CV
TL;DR: 本文提出了一种Memory-Execute-Review框架,用于提升视觉语言导航(VLN)任务中的成功率(SR)与泛化能力,兼顾监督微调(SFT)与训练自由(TF)方法的优势。
Details
Motivation: 现有VLN方法在成功率(SR)和泛化能力上难以兼顾:SFT方法SR高但泛化差,TF方法泛化好但SR低。
Method: 提出Memory-Execute-Review三模块框架:分层记忆模块提供信息支持,执行模块进行常规决策与动作,审查模块处理异常并纠正行为;在Object Goal Navigation任务上验证。
Result: 在4个数据集上,零样本(ZS)和训练自由(TF)设置下平均SR分别提升5%和7%;在HM3D_v0.1和HM3D_OVON上ZS设置下SR分别提升8%和6%;在MP3D和HM3D_OVON上同时超越所有TF和SFT方法,SR分别领先5%和2%。
Conclusion: 该框架有效平衡了成功率与泛化能力,在多个基准上实现全面领先,为VLN提供了新范式。
Abstract: Visual Language Navigation (VLN) is one of the fundamental capabilities for embodied intelligence and a critical challenge that urgently needs to be addressed. However, existing methods are still unsatisfactory in terms of both success rate (SR) and generalization: Supervised Fine-Tuning (SFT) approaches typically achieve higher SR, while Training-Free (TF) approaches often generalize better, but it is difficult to obtain both simultaneously. To this end, we propose a Memory-Execute-Review framework. It consists of three parts: a hierarchical memory module for providing information support, an execute module for routine decision-making and actions, and a review module for handling abnormal situations and correcting behavior. We validated the effectiveness of this framework on the Object Goal Navigation task. Across 4 datasets, our average SR achieved absolute improvements of 7% and 5% compared to all baseline methods under TF and Zero-Shot (ZS) settings, respectively. On the most commonly used HM3D_v0.1 and the more challenging open vocabulary dataset HM3D_OVON, the SR improved by 8% and 6%, under ZS settings. Furthermore, on the MP3D and HM3D_OVON datasets, our method not only outperformed all TF methods but also surpassed all SFT methods, achieving comprehensive leadership in both SR (5% and 2%) and generalization.
### [109] [SOMA-1M: A Large-Scale SAR-Optical Multi-resolution Alignment Dataset for Multi-Task Remote Sensing](https://arxiv.org/abs/2602.05480)
*Peihao Wu,Yongxiang Yao,Yi Wan,Wenfei Zhang,Ruipeng Zhao,Jiayuan Li,Yongjun Zhang*
Main category: cs.CV
TL;DR: 本文提出了SOMA-1M数据集,一个包含130万对高精度像素级对齐的SAR与光学遥感图像的数据集,覆盖多空间分辨率(0.5m–10m)和全球典型地物类型,支持图像匹配、融合、云去除与跨模态翻译等任务,并验证其显著提升多模态遥感算法性能。
Details
Motivation: 现有SAR-光学遥感基准数据集存在单一分辨率、规模不足、对齐精度低等问题,难以支撑多尺度基础模型训练与泛化。
Method: 构建了SOMA-1M数据集,整合Sentinel-1、PIESAT-1、Capella Space和Google Earth影像,设计粗到精图像匹配框架实现像素级对齐,并建立涵盖四类视觉任务的综合评测基准。
Result: 基于SOMA-1M监督训练显著提升各项任务性能,尤其在多模态遥感图像匹配上达到当前最优水平(SOTA)。
Conclusion: SOMA-1M为鲁棒多模态遥感算法及遥感基础模型提供了关键基础资源,将开源发布。
Abstract: Synthetic Aperture Radar (SAR) and optical imagery provide complementary strengths that constitute the critical foundation for transcending single-modality constraints and facilitating cross-modal collaborative processing and intelligent interpretation. However, existing benchmark datasets often suffer from limitations such as single spatial resolution, insufficient data scale, and low alignment accuracy, making them inadequate for supporting the training and generalization of multi-scale foundation models. To address these challenges, we introduce SOMA-1M (SAR-Optical Multi-resolution Alignment), a pixel-level precisely aligned dataset containing over 1.3 million pairs of georeferenced images with a specification of 512 x 512 pixels. This dataset integrates imagery from Sentinel-1, PIESAT-1, Capella Space, and Google Earth, achieving global multi-scale coverage from 0.5 m to 10 m. It encompasses 12 typical land cover categories, effectively ensuring scene diversity and complexity. To address multimodal projection deformation and massive data registration, we designed a rigorous coarse-to-fine image matching framework ensuring pixel-level alignment. Based on this dataset, we established comprehensive evaluation benchmarks for four hierarchical vision tasks, including image matching, image fusion, SAR-assisted cloud removal, and cross-modal translation, involving over 30 mainstream algorithms. Experimental results demonstrate that supervised training on SOMA-1M significantly enhances performance across all tasks. Notably, multimodal remote sensing image (MRSI) matching performance achieves current state-of-the-art (SOTA) levels. SOMA-1M serves as a foundational resource for robust multimodal algorithms and remote sensing foundation models. The dataset will be released publicly at: https://github.com/PeihaoWu/SOMA-1M.
### [110] [Feature points evaluation on omnidirectional vision with a photorealistic fisheye sequence -- A report on experiments done in 2014](https://arxiv.org/abs/2602.05487)
*Julien Moreau,S. Ambellouis,Yassine Ruichek*
Main category: cs.CV
TL;DR: 本报告是一项未发表的博士研究工作,旨在为鱼眼图像寻找最优的特征检测与描述方法,以支持自校准、视觉里程计和城市环境中的立体视觉。报告提供了PFSeq(Photorealistic Fisheye Sequence)数据集、详尽文献综述及实验结果,但未提出新算法,也未与专为全向图像设计的算法对比,且未经过同行评审。
Details
Motivation: 解决鱼眼图像自校准中的‘鸡生蛋还是蛋生鸡’问题:缺乏精确投影模型难以优化特征提取,而高质量特征又是估计该模型的前提;应用场景为车载朝天鱼眼相机在城市环境下的定位。
Method: 系统评估现有通用特征检测器与描述子在鱼眼图像上的表现,基于自校准任务需求进行综合实验分析,并构建并发布PFSeq真实感鱼眼图像序列数据集。
Result: 获得了不同特征方法在鱼眼图像上的性能基准结果,识别出相对更适用于该场景的特征组合,但未得出唯一最优方案;发布了PFSeq数据集并提供完整实验设置与参考文献。
Conclusion: 在缺乏精确投影模型的前提下,传统特征方法中某些仍具备一定鲁棒性,但整体性能受限;强调了针对鱼眼图像设计专用特征方法与建立标准评测协议的必要性;本工作作为2014年阶段性技术报告,具有历史参考价值但需结合最新进展审慎使用。
Abstract: What is this report: This is a scientific report, contributing with a detailed bibliography, a dataset which we will call now PFSeq for ''Photorealistic Fisheye Sequence'' and make available at https://doi.org/10. 57745/DYIVVU, and comprehensive experiments. This work should be considered as a draft, and has been done during my PhD thesis ''Construction of 3D models from fisheye video data-Application to the localisation in urban area'' in 2014 [Mor16]. These results have never been published. The aim was to find the best features detector and descriptor for fisheye images, in the context of selfcalibration, with cameras mounted on the top of a car and aiming at the zenith (to proceed then fisheye visual odometry and stereovision in urban scenes). We face a chicken and egg problem, because we can not take advantage of an accurate projection model for an optimal features detection and description, and we rightly need good features to perform the calibration (i.e. to compute the accurate projection model of the camera). What is not this report: It does not contribute with new features algorithm. It does not compare standard features algorithms to algorithms designed for omnidirectional images (unfortunately). It has not been peer-reviewed. Discussions have been translated and enhanced but the experiments have not been run again and the report has not been updated accordingly to the evolution of the state-of-the-art (read this as a 2014 report).
### [111] [VGGT-Motion: Motion-Aware Calibration-Free Monocular SLAM for Long-Range Consistency](https://arxiv.org/abs/2602.05508)
*Zhuang Xiong,Chen Zhang,Qingshan Xu,Wenbing Tao*
Main category: cs.CV
TL;DR: 本文提出VGGT-Motion,一种无需相机标定的单目SLAM系统,通过运动感知子图构建、锚点驱动的Sim(3)直接配准和轻量级子图级位姿图优化,在千米级轨迹上实现高效鲁棒的全局一致性,显著提升长序列零样本SLAM的精度与效率。
Details
Motivation: 现有无标定单目SLAM方法在长序列中存在严重尺度漂移;运动无关的分段破坏上下文连贯性并导致零运动漂移;传统几何对齐计算开销大。
Method: 1)运动感知子图构建:利用光流指导自适应分段、剔除静态冗余、封装转弯以稳定局部几何;2)锚点驱动的直接Sim(3)配准:基于上下文平衡锚点实现免搜索、像素级稠密对齐与高效回环检测;3)轻量子图级位姿图优化:线性复杂度,保障全局一致性与可扩展性。
Result: 在零样本、长距离、无标定单目SLAM任务中,VGGT-Motion显著提升轨迹精度与运行效率,达到当前最优性能。
Conclusion: VGGT-Motion有效缓解了长序列下的尺度漂移问题,兼顾鲁棒性、效率与全局一致性,为无标定单目SLAM提供了实用化新范式。
Abstract: Despite recent progress in calibration-free monocular SLAM via 3D vision foundation models, scale drift remains severe on long sequences. Motion-agnostic partitioning breaks contextual coherence and causes zero-motion drift, while conventional geometric alignment is computationally expensive. To address these issues, we propose VGGT-Motion, a calibration-free SLAM system for efficient and robust global consistency over kilometer-scale trajectories. Specifically, we first propose a motion-aware submap construction mechanism that uses optical flow to guide adaptive partitioning, prune static redundancy, and encapsulate turns for stable local geometry. We then design an anchor-driven direct Sim(3) registration strategy. By exploiting context-balanced anchors, it achieves search-free, pixel-wise dense alignment and efficient loop closure without costly feature matching. Finally, a lightweight submap-level pose graph optimization enforces global consistency with linear complexity, enabling scalable long-range operation. Experiments show that VGGT-Motion markedly improves trajectory accuracy and efficiency, achieving state-of-the-art performance in zero-shot, long-range calibration-free monocular SLAM.
### [112] [Mapper-GIN: Lightweight Structural Graph Abstraction for Corrupted 3D Point Cloud Classification](https://arxiv.org/abs/2602.05522)
*Jeongbin You,Donggun Kim,Sejun Park,Seungsang Oh*
Main category: cs.CV
TL;DR: 本文提出Mapper-GIN,一种基于拓扑Mapper算法的轻量级点云分类方法,通过构建区域图并用GIN进行图分类,在ModelNet40-C上实现了对噪声和变换扰动的强鲁棒性,仅需0.5M参数。
Details
Motivation: 探究仅靠结构抽象(而非扩大模型或复杂数据增强)能否提升3D点云分类的鲁棒性。
Method: 使用Mapper算法(PCA lens、立方覆盖、密度聚类)将点云划分为重叠区域,构建区域图,并用Graph Isomorphism Network(GIN)进行图分类。
Result: 在ModelNet40-C基准上,Mapper-GIN在Noise和Transformation扰动下取得具有竞争力且稳定的准确率,参数量仅0.5M。
Conclusion: 区域图结构是一种高效、可解释的鲁棒性来源,无需复杂架构或额外机制即可提升3D视觉识别鲁棒性。
Abstract: Robust 3D point cloud classification is often pursued by scaling up backbones or relying on specialized data augmentation. We instead ask whether structural abstraction alone can improve robustness, and study a simple topology-inspired decomposition based on the Mapper algorithm. We propose Mapper-GIN, a lightweight pipeline that partitions a point cloud into overlapping regions using Mapper (PCA lens, cubical cover, and followed by density-based clustering), constructs a region graph from their overlaps, and performs graph classification with a Graph Isomorphism Network. On the corruption benchmark ModelNet40-C, Mapper-GIN achieves competitive and stable accuracy under Noise and Transformation corruptions with only 0.5M parameters. In contrast to prior approaches that require heavier architectures or additional mechanisms to gain robustness, Mapper-GIN attains strong corruption robustness through simple region-level graph abstraction and GIN message passing. Overall, our results suggest that region-graph structure offers an efficient and interpretable source of robustness for 3D visual recognition.
### [113] [Generalization of Self-Supervised Vision Transformers for Protein Localization Across Microscopy Domains](https://arxiv.org/abs/2602.05527)
*Ben Isselmann,Dilara Göksu,Andreas Weinmann*
Main category: cs.CV
TL;DR: 本文研究了自监督学习(SSL)中DINO预训练的视觉变换器(ViT)在不同显微镜数据集间的跨域迁移能力,特别是在OpenCell蛋白定位任务上;结果表明,基于人蛋白图谱(HPA)预训练的模型表现最佳,略优于直接在OpenCell上训练的模型,说明领域相关的SSL表征可有效泛化到相关但不同的显微镜数据集。
Details
Motivation: 显微镜任务特定数据集通常规模小,难以训练鲁棒的深度学习模型;自监督学习虽可通过大规模无标签数据缓解此问题,但其在不同染色协议和通道配置的显微镜域间迁移效果尚不明确。
Method: 采用三种DINO预训练的ViT主干网络(分别在ImageNet-1k、HPA和OpenCell上预训练),提取OpenCell图像嵌入,并在其上训练监督分类头以评估跨域迁移性能。
Result: 所有预训练模型均表现出良好迁移能力;HPA预训练模型取得最高平均宏F1分数(0.8221 ± 0.0062),略高于直接在OpenCell上预训练的模型(0.8057 ± 0.0090)。
Conclusion: 大规模、领域相关的自监督预训练能显著提升显微镜图像分析中有限标注数据下的下游任务性能,且SSL表征具备良好的跨域泛化能力。
Abstract: Task-specific microscopy datasets are often too small to train deep learning models that learn robust feature representations. Self-supervised learning (SSL) can mitigate this by pretraining on large unlabeled datasets, but it remains unclear how well such representations transfer across microscopy domains with different staining protocols and channel configurations. We investigate the cross-domain transferability of DINO-pretrained Vision Transformers for protein localization on the OpenCell dataset. We generate image embeddings using three DINO backbones pretrained on ImageNet-1k, the Human Protein Atlas (HPA), and OpenCell, and evaluate them by training a supervised classification head on OpenCell labels. All pretrained models transfer well, with the microscopy-specific HPA-pretrained model achieving the best performance (mean macro $F_1$-score = 0.8221 \pm 0.0062), slightly outperforming a DINO model trained directly on OpenCell (0.8057 \pm 0.0090). These results highlight the value of large-scale pretraining and indicate that domain-relevant SSL representations can generalize effectively to related but distinct microscopy datasets, enabling strong downstream performance even when task-specific labeled data are limited.
### [114] [SSG: Scaled Spatial Guidance for Multi-Scale Visual Autoregressive Generation](https://arxiv.org/abs/2602.05534)
*Youngwoo Shin,Jiwan Hur,Junmo Kim*
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的推理时引导方法SSG,通过信息论视角解决视觉自回归(VAR)模型在多尺度生成中出现的层次漂移问题,结合频率域增强技术DSE提取语义残差,显著提升生成图像的保真度与多样性,同时保持低延迟。
Details
Motivation: VAR模型在推理时易因容量限制和误差累积导致粗到细生成层次漂移,造成训练-推理不一致。
Method: 提出Scaled Spatial Guidance(SSG),一种训练无关的推理时引导方法;引入Discrete Spatial Enhancement(DSE)从频率域构建并增强语义残差作为高频频信号目标;SSG适用于各类基于离散视觉token的VAR模型。
Result: SSG在多个VAR模型上一致提升图像生成的保真度和多样性,且不增加延迟,验证了粗到细生成范式中尚未开发的效率潜力。
Conclusion: 从信息论出发保障各尺度贡献未被前序尺度解释的高频内容,可有效缓解VAR模型的层次漂移;SSG作为一种轻量、通用、即插即用的推理优化策略,提升了VAR生成质量与效率的平衡。
Abstract: Visual autoregressive (VAR) models generate images through next-scale prediction, naturally achieving coarse-to-fine, fast, high-fidelity synthesis mirroring human perception. In practice, this hierarchy can drift at inference time, as limited capacity and accumulated error cause the model to deviate from its coarse-to-fine nature. We revisit this limitation from an information-theoretic perspective and deduce that ensuring each scale contributes high-frequency content not explained by earlier scales mitigates the train-inference discrepancy. With this insight, we propose Scaled Spatial Guidance (SSG), training-free, inference-time guidance that steers generation toward the intended hierarchy while maintaining global coherence. SSG emphasizes target high-frequency signals, defined as the semantic residual, isolated from a coarser prior. To obtain this prior, we leverage a principled frequency-domain procedure, Discrete Spatial Enhancement (DSE), which is devised to sharpen and better isolate the semantic residual through frequency-aware construction. SSG applies broadly across VAR models leveraging discrete visual tokens, regardless of tokenization design or conditioning modality. Experiments demonstrate SSG yields consistent gains in fidelity and diversity while preserving low latency, revealing untapped efficiency in coarse-to-fine image generation. Code is available at https://github.com/Youngwoo-git/SSG.
### [115] [A Comparative Study of 3D Person Detection: Sensor Modalities and Robustness in Diverse Indoor and Outdoor Environments](https://arxiv.org/abs/2602.05538)
*Malaz Tamim,Andrea Matic-Flierl,Karsten Roscher*
Main category: cs.CV
TL;DR: 本文系统评估了仅使用相机、仅使用LiDAR以及相机-LiDAR融合三种方式在JRDB数据集上的3D人体检测性能,发现融合方法(DAL)整体最优但对传感器错位和某些LiDAR损坏仍敏感,而纯相机方法(BEVDepth)性能最差且易受遮挡、距离和噪声影响。
Details
Motivation: 现有研究多聚焦于自动驾驶场景,本文旨在拓展至多样化的室内外场景,系统评估不同传感器模态在3D人体检测中的性能与鲁棒性。
Method: 在JRDB数据集上对比BEVDepth(相机)、PointPillars(LiDAR)和DAL(相机-LiDAR融合)三种代表性模型,分析其在不同遮挡程度、距离、传感器损坏及标定误差下的检测表现。
Result: 融合方法DAL在各类场景下均优于单模态方法;但对传感器错位和部分LiDAR损坏较敏感;BEVDepth性能最差,且对遮挡、远距离和噪声最脆弱。
Conclusion: 传感器融合显著提升3D人体检测性能与鲁棒性,但仍需进一步研究以缓解其对传感器标定误差和特定损坏的敏感性。
Abstract: Accurate 3D person detection is critical for safety in applications such as robotics, industrial monitoring, and surveillance. This work presents a systematic evaluation of 3D person detection using camera-only, LiDAR-only, and camera-LiDAR fusion. While most existing research focuses on autonomous driving, we explore detection performance and robustness in diverse indoor and outdoor scenes using the JRDB dataset. We compare three representative models - BEVDepth (camera), PointPillars (LiDAR), and DAL (camera-LiDAR fusion) - and analyze their behavior under varying occlusion and distance levels. Our results show that the fusion-based approach consistently outperforms single-modality models, particularly in challenging scenarios. We further investigate robustness against sensor corruptions and misalignments, revealing that while DAL offers improved resilience, it remains sensitive to sensor misalignment and certain LiDAR-based corruptions. In contrast, the camera-based BEVDepth model showed the lowest performance and was most affected by occlusion, distance, and noise. Our findings highlight the importance of utilizing sensor fusion for enhanced 3D person detection, while also underscoring the need for ongoing research to address the vulnerabilities inherent in these systems.
### [116] [FastVMT: Eliminating Redundancy in Video Motion Transfer](https://arxiv.org/abs/2602.05551)
*Yue Ma,Zhikai Wang,Tianhao Ren,Mingzhe Zheng,Hongyu Liu,Jiayi Guo,Mark Fong,Yuxuan Xue,Zixiang Zhao,Konrad Schindler,Qifeng Chen,Linfeng Zhang*
Main category: cs.CV
TL;DR: 本文提出FastVMT方法,通过消除运动冗余(局部注意力掩码)和梯度冗余(梯度重用与跳过),显著加速视频运动迁移任务,在保持视觉质量和时序一致性的前提下实现3.43倍加速。
Details
Motivation: 现有基于Diffusion Transformer(DiT)的视频运动迁移方法存在结构性低效问题,未利用帧间运动小而平滑、扩散轨迹中梯度变化缓慢等先验特性。
Method: 1)针对运动冗余:在注意力层引入局部邻域掩码,限制远距离图像区域的无谓交互;2)针对梯度冗余:设计梯度复用与跳过机制,在扩散过程中重用前序步的梯度并跳过变化微小的梯度计算。
Result: FastVMT在多个基准上平均实现3.43倍推理加速,且不损害生成视频的视觉保真度与时序一致性。
Conclusion: 消除运动与梯度两类冗余是提升视频扩散模型效率的有效途径,所提方法兼顾速度与质量,为高效视频生成提供了新思路。
Abstract: Video motion transfer aims to synthesize videos by generating visual content according to a text prompt while transferring the motion pattern observed in a reference video. Recent methods predominantly use the Diffusion Transformer (DiT) architecture. To achieve satisfactory runtime, several methods attempt to accelerate the computations in the DiT, but fail to address structural sources of inefficiency. In this work, we identify and remove two types of computational redundancy in earlier work: motion redundancy arises because the generic DiT architecture does not reflect the fact that frame-to-frame motion is small and smooth; gradient redundancy occurs if one ignores that gradients change slowly along the diffusion trajectory. To mitigate motion redundancy, we mask the corresponding attention layers to a local neighborhood such that interaction weights are not computed unnecessarily distant image regions. To exploit gradient redundancy, we design an optimization scheme that reuses gradients from previous diffusion steps and skips unwarranted gradient computations. On average, FastVMT achieves a 3.43x speedup without degrading the visual fidelity or the temporal consistency of the generated videos.
### [117] [IndustryShapes: An RGB-D Benchmark dataset for 6D object pose estimation of industrial assembly components and tools](https://arxiv.org/abs/2602.05555)
*Panagiotis Sapoutzoglou,Orestis Vaggelis,Athina Zacharia,Evangelos Sartinas,Maria Pateraki*
Main category: cs.CV
TL;DR: IndustryShapes 是一个面向工业工具与零部件的新型RGB-D基准数据集,专为实例级和新物体6D位姿估计任务设计,强调真实工业装配场景,填补了实验室研究与实际制造部署之间的鸿沟。
Details
Motivation: 现有数据集多聚焦于家用物品、合成环境或受控实验室场景,缺乏真实工业场景下的挑战性数据,难以支撑6D位姿估计算法向实际制造业落地。
Method: 构建包含五类新型工业物体的RGB-D数据集,涵盖单/多物体、单/多实例等复杂场景,分为经典集(4.6k图像、6k标注位姿)和扩展集(支持无模型及序列方法评估),并首次引入RGB-D静态初始配置序列。
Result: 在多个SOTA 6D位姿估计、目标检测与分割方法上进行了评测,结果表明当前方法在该数据集上仍有显著提升空间。
Conclusion: IndustryShapes为工业机器人领域的6D位姿估计提供了首个贴近真实产线的基准测试平台,推动算法从实验室走向实际工业应用。
Abstract: We introduce IndustryShapes, a new RGB-D benchmark dataset of industrial tools and components, designed for both instance-level and novel object 6D pose estimation approaches. The dataset provides a realistic and application-relevant testbed for benchmarking these methods in the context of industrial robotics bridging the gap between lab-based research and deployment in real-world manufacturing scenarios. Unlike many previous datasets that focus on household or consumer products or use synthetic, clean tabletop datasets, or objects captured solely in controlled lab environments, IndustryShapes introduces five new object types with challenging properties, also captured in realistic industrial assembly settings. The dataset has diverse complexity, from simple to more challenging scenes, with single and multiple objects, including scenes with multiple instances of the same object and it is organized in two parts: the classic set and the extended set. The classic set includes a total of 4,6k images and 6k annotated poses. The extended set introduces additional data modalities to support the evaluation of model-free and sequence-based approaches. To the best of our knowledge, IndustryShapes is the first dataset to offer RGB-D static onboarding sequences. We further evaluate the dataset on a representative set of state-of-the art methods for instance-based and novel object 6D pose estimation, including also object detection, segmentation, showing that there is room for improvement in this domain. The dataset page can be found in https://pose-lab.github.io/IndustryShapes.
### [118] [PIRATR: Parametric Object Inference for Robotic Applications with Transformers in 3D Point Clouds](https://arxiv.org/abs/2602.05557)
*Michael Schwingshackl,Fabio F. Oberweger,Mario Niedermeyer,Huemer Johannes,Markus Murschitz*
Main category: cs.CV
TL;DR: PIRATR是一种面向机器人应用的端到端3D目标检测框架,能直接从受遮挡点云中联合估计多类6自由度位姿和类别特异性参数属性,无需真实数据微调即可在真实户外LiDAR场景中实现高精度检测(mAP 0.919)。
Details
Motivation: 解决机器人在动态环境中对参数化物体(如可调节夹爪)进行几何定位与任务相关属性联合估计的需求,弥合低层几何推理与可执行世界模型之间的鸿沟。
Method: 基于PI3DETR扩展,提出PIRATR框架;采用模块化、类别专用检测头,支持直接从遮挡点云联合预测6-DoF姿态及参数属性(如夹爪开合);模型完全在合成环境中训练。
Result: 在自动化叉车平台上验证,针对起重机夹爪、装载平台和托盘三类对象,在未经微调的情况下于真实户外LiDAR数据上达到0.919 mAP。
Conclusion: PIRATR确立了姿态感知、参数化的感知新范式,支持仿真训练、真实部署,为可扩展机器人感知系统提供了新路径。
Abstract: We present PIRATR, an end-to-end 3D object detection framework for robotic use cases in point clouds. Extending PI3DETR, our method streamlines parametric 3D object detection by jointly estimating multi-class 6-DoF poses and class-specific parametric attributes directly from occlusion-affected point cloud data. This formulation enables not only geometric localization but also the estimation of task-relevant properties for parametric objects, such as a gripper's opening, where the 3D model is adjusted according to simple, predefined rules. The architecture employs modular, class-specific heads, making it straightforward to extend to novel object types without re-designing the pipeline. We validate PIRATR on an automated forklift platform, focusing on three structurally and functionally diverse categories: crane grippers, loading platforms, and pallets. Trained entirely in a synthetic environment, PIRATR generalizes effectively to real outdoor LiDAR scans, achieving a detection mAP of 0.919 without additional fine-tuning. PIRATR establishes a new paradigm of pose-aware, parameterized perception. This bridges the gap between low-level geometric reasoning and actionable world models, paving the way for scalable, simulation-trained perception systems that can be deployed in dynamic robotic environments. Code available at https://github.com/swingaxe/piratr.
### [119] [ShapeGaussian: High-Fidelity 4D Human Reconstruction in Monocular Videos via Vision Priors](https://arxiv.org/abs/2602.05572)
*Zhenxiao Liang,Ning Zhang,Youbao Tang,Ruei-Sung Lin,Qixing Huang,Peng Chang,Jing Xiao*
Main category: cs.CV
TL;DR: ShapeGaussian是一种无需模板、高保真的4D人体重建方法,适用于单目视频,通过结合数据驱动的2D视觉先验与两阶段几何建模,在保持高精度的同时提升对姿态估计误差和遮挡的鲁棒性。
Details
Motivation: 现有通用4D重建方法(如4DGS)缺乏强视觉先验,难以处理高形变人体运动;而基于模板(如SMPL)的方法(如HUGS)虽能生成高质量结果,却严重依赖精确姿态估计,易因误差产生失真。
Method: 采用两阶段流程:首先利用预训练模型学习粗略可变形几何以获取数据驱动先验;再用神经形变模型精细化动态细节;全程融合2D视觉先验,并利用多帧参考缓解关键点遮挡问题。
Result: 在多种日常单目视频上实验表明,ShapeGaussian在重建精度、视觉质量和运动鲁棒性方面均优于基于模板的方法。
Conclusion: ShapeGaussian成功融合模板自由与视觉先验优势,实现了高保真且鲁棒的4D人体重建,为单目视频中复杂人体运动建模提供了新范式。
Abstract: We introduce ShapeGaussian, a high-fidelity, template-free method for 4D human reconstruction from casual monocular videos. Generic reconstruction methods lacking robust vision priors, such as 4DGS, struggle to capture high-deformation human motion without multi-view cues. While template-based approaches, primarily relying on SMPL, such as HUGS, can produce photorealistic results, they are highly susceptible to errors in human pose estimation, often leading to unrealistic artifacts. In contrast, ShapeGaussian effectively integrates template-free vision priors to achieve both high-fidelity and robust scene reconstructions. Our method follows a two-step pipeline: first, we learn a coarse, deformable geometry using pretrained models that estimate data-driven priors, providing a foundation for reconstruction. Then, we refine this geometry using a neural deformation model to capture fine-grained dynamic details. By leveraging 2D vision priors, we mitigate artifacts from erroneous pose estimation in template-based methods and employ multiple reference frames to resolve the invisibility issue of 2D keypoints in a template-free manner. Extensive experiments demonstrate that ShapeGaussian surpasses template-based methods in reconstruction accuracy, achieving superior visual quality and robustness across diverse human motions in casual monocular videos.
### [120] [Visual Implicit Geometry Transformer for Autonomous Driving](https://arxiv.org/abs/2602.05573)
*Arsenii Shirokov,Mikhail Kuznetsov,Danila Stepochkin,Egor Evdokimov,Daniil Glazkov,Nikolay Patakin,Anton Konushin,Dmitry Senushkin*
Main category: cs.CV
TL;DR: ViGT是一种面向自动驾驶的视觉隐式几何变换器,通过无标定、自监督方式从环视相机估计连续3D占用场(BEV),支持多数据集联合训练并达到SOTA性能。
Details
Motivation: 构建可扩展、架构简洁、泛化性强的基础几何模型,满足自动驾驶对多传感器配置适配和BEV连续几何表征的需求。
Method: 提出无标定的ViGT架构,直接从多视角图像回归连续3D占用场;采用同步图像-LiDAR对进行自监督训练,避免人工标注;统一映射至度量BEV坐标系以支持多任务。
Result: 在五个大规模自动驾驶数据集(NuScenes、Waymo等)混合训练下,点图估计任务取得SOTA,Occ3D-nuScenes上性能媲美监督方法。
Conclusion: ViGT验证了无需标定与人工标注的通用几何建模可行性,为自动驾驶基础几何模型提供了新范式。
Abstract: We introduce the Visual Implicit Geometry Transformer (ViGT), an autonomous driving geometric model that estimates continuous 3D occupancy fields from surround-view camera rigs. ViGT represents a step towards foundational geometric models for autonomous driving, prioritizing scalability, architectural simplicity, and generalization across diverse sensor configurations. Our approach achieves this through a calibration-free architecture, enabling a single model to adapt to different sensor setups. Unlike general-purpose geometric foundational models that focus on pixel-aligned predictions, ViGT estimates a continuous 3D occupancy field in a birds-eye-view (BEV) addressing domain-specific requirements. ViGT naturally infers geometry from multiple camera views into a single metric coordinate frame, providing a common representation for multiple geometric tasks. Unlike most existing occupancy models, we adopt a self-supervised training procedure that leverages synchronized image-LiDAR pairs, eliminating the need for costly manual annotations. We validate the scalability and generalizability of our approach by training our model on a mixture of five large-scale autonomous driving datasets (NuScenes, Waymo, NuPlan, ONCE, and Argoverse) and achieving state-of-the-art performance on the pointmap estimation task, with the best average rank across all evaluated baselines. We further evaluate ViGT on the Occ3D-nuScenes benchmark, where ViGT achieves comparable performance with supervised methods. The source code is publicly available at \href{https://github.com/whesense/ViGT}{https://github.com/whesense/ViGT}.
### [121] [A Hybrid CNN and ML Framework for Multi-modal Classification of Movement Disorders Using MRI and Brain Structural Features](https://arxiv.org/abs/2602.05574)
*Mengyu Li,Ingibjörg Kristjánsdóttir,Thilo van Eimeren,Kathrin Giehl,Lotta M. Ellingsen,the ASAP Neuroimaging Initiative*
Main category: cs.CV
TL;DR: 本文提出了一种结合CNN与机器学习的混合框架,利用多模态MRI数据(T1加权图像、深部脑结构分割掩膜及体积测量)对非典型帕金森综合征(APD)亚型(PSP、MSA)与帕金森病(PD)进行鉴别诊断,AUC达0.95(PSP vs. PD)、0.86(MSA vs. PD)、0.92(PSP vs. MSA)。
Details
Motivation: APD早期临床表现与PD高度重叠,易误诊;亟需可靠的影像学生物标志物实现早期精准分型。
Method: 构建CNN-ML混合模型,输入包括T1加权MRI图像、12个APD相关深部脑结构的分割掩膜及其体积测量值,融合图像特征与定量体积特征进行多分类。
Result: 在PSP vs. PD、MSA vs. PD、PSP vs. MSA三组二分类任务中,AUC分别达0.95、0.86、0.92。
Conclusion: 融合CNN提取的空间图像特征与ML处理的体积结构特征,可显著提升APD亚型鉴别诊断准确率,有助于临床早期干预。
Abstract: Atypical Parkinsonian Disorders (APD), also known as Parkinson-plus syndrome, are a group of neurodegenerative diseases that include progressive supranuclear palsy (PSP) and multiple system atrophy (MSA). In the early stages, overlapping clinical features often lead to misdiagnosis as Parkinson's disease (PD). Identifying reliable imaging biomarkers for early differential diagnosis remains a critical challenge. In this study, we propose a hybrid framework combining convolutional neural networks (CNNs) with machine learning (ML) techniques to classify APD subtypes versus PD and distinguish between the subtypes themselves: PSP vs. PD, MSA vs. PD, and PSP vs. MSA. The model leverages multi-modal input data, including T1-weighted magnetic resonance imaging (MRI), segmentation masks of 12 deep brain structures associated with APD, and their corresponding volumetric measurements. By integrating these complementary modalities, including image data, structural segmentation masks, and quantitative volume features, the hybrid approach achieved promising classification performance with area under the curve (AUC) scores of 0.95 for PSP vs. PD, 0.86 for MSA vs. PD, and 0.92 for PSP vs. MSA. These results highlight the potential of combining spatial and structural information for robust subtype differentiation. In conclusion, this study demonstrates that fusing CNN-based image features with volume-based ML inputs improves classification accuracy for APD subtypes. The proposed approach may contribute to more reliable early-stage diagnosis, facilitating timely and targeted interventions in clinical practice.
### [122] [LocateEdit-Bench: A Benchmark for Instruction-Based Editing Localization](https://arxiv.org/abs/2602.05577)
*Shiyu Wu,Shuyan Li,Jing Li,Jing Liu,Yequan Wang*
Main category: cs.CV
TL;DR: 本文提出LocateEdit-Bench数据集,用于评估针对指令驱动图像编辑的伪造定位方法,包含231K张编辑图像,覆盖4种前沿编辑模型和3类常见编辑类型,并设计了多指标评估协议。
Details
Motivation: 现有AI生成伪造定位方法主要针对基于inpainting的篡改,在面对新兴的指令驱动图像编辑时失效,亟需构建适配新编辑范式的基准数据集。
Method: 构建大规模LocateEdit-Bench数据集(231K编辑图像),涵盖4种先进编辑模型与3类编辑类型;提出两种多指标评估协议,对现有定位方法进行系统评测。
Result: 建立了首个面向指令驱动图像编辑的伪造定位基准数据集及配套评估体系,揭示了现有方法在此新场景下的局限性。
Conclusion: LocateEdit-Bench为应对快速演进的图像编辑技术提供了关键基准支撑,推动未来伪造定位方法的发展。
Abstract: Recent advancements in image editing have enabled highly controllable and semantically-aware alteration of visual content, posing unprecedented challenges to manipulation localization. However, existing AI-generated forgery localization methods primarily focus on inpainting-based manipulations, making them ineffective against the latest instruction-based editing paradigms. To bridge this critical gap, we propose LocateEdit-Bench, a large-scale dataset comprising $231$K edited images, designed specifically to benchmark localization methods against instruction-driven image editing. Our dataset incorporates four cutting-edge editing models and covers three common edit types. We conduct a detailed analysis of the dataset and develop two multi-metric evaluation protocols to assess existing localization methods. Our work establishes a foundation to keep pace with the evolving landscape of image editing, thereby facilitating the development of effective methods for future forgery localization. Dataset will be open-sourced upon acceptance.
### [123] [LoGoSeg: Integrating Local and Global Features for Open-Vocabulary Semantic Segmentation](https://arxiv.org/abs/2602.05578)
*Junyang Chen,Xiangbo Lv,Zhiqiang Kou,Xingdong Sheng,Ning Xu,Yiguo Qiao*
Main category: cs.CV
TL;DR: 本文提出LoGoSeg,一种高效的单阶段开放词汇语义分割框架,通过引入对象存在先验、区域感知对齐模块和双流融合机制,提升视觉-文本区域级对齐精度并减少对象幻觉,无需额外模型或数据,在多个基准上表现出色。
Details
Motivation: 现有基于VLM(如CLIP)的开放词汇语义分割方法因依赖图像级预训练,空间对齐不精确,且缺乏强对象先验和区域约束,易导致对象幻觉或漏检。
Method: 提出LoGoSeg框架,包含三部分创新:(i) 基于全局图文相似度的对象存在先验,动态加权相关类别;(ii) 区域感知对齐模块,建立精细的区域级图文对应;(iii) 双流融合机制,融合局部结构与全局语义。端到端单阶段,无需外部掩码提议、额外骨干网或数据集。
Result: 在A-847、PC-459、A-150、PC-59、PAS-20和PAS-20b共六个基准上验证了方法的竞争力与泛化能力。
Conclusion: LoGoSeg通过引入对象先验与区域级对齐机制,在保持高效性的同时显著提升了开放词汇语义分割的精度与鲁棒性,为该任务提供了新思路。
Abstract: Open-vocabulary semantic segmentation (OVSS) extends traditional closed-set segmentation by enabling pixel-wise annotation for both seen and unseen categories using arbitrary textual descriptions. While existing methods leverage vision-language models (VLMs) like CLIP, their reliance on image-level pretraining often results in imprecise spatial alignment, leading to mismatched segmentations in ambiguous or cluttered scenes. However, most existing approaches lack strong object priors and region-level constraints, which can lead to object hallucination or missed detections, further degrading performance. To address these challenges, we propose LoGoSeg, an efficient single-stage framework that integrates three key innovations: (i) an object existence prior that dynamically weights relevant categories through global image-text similarity, effectively reducing hallucinations; (ii) a region-aware alignment module that establishes precise region-level visual-textual correspondences; and (iii) a dual-stream fusion mechanism that optimally combines local structural information with global semantic context. Unlike prior works, LoGoSeg eliminates the need for external mask proposals, additional backbones, or extra datasets, ensuring efficiency. Extensive experiments on six benchmarks (A-847, PC-459, A-150, PC-59, PAS-20, and PAS-20b) demonstrate its competitive performance and strong generalization in open-vocabulary settings.
### [124] [Geometric Observability Index: An Operator-Theoretic Framework for Per-Feature Sensitivity, Weak Observability, and Dynamic Effects in SE(3) Pose Estimation](https://arxiv.org/abs/2602.05582)
*Joe-Mei Feng,Sheng-Wei Yu*
Main category: cs.CV
TL;DR: 本文提出了一种统一的算子理论框架——几何可观测性指数(GOI),用于分析SE(3)李群上相机位姿估计中各图像特征的敏感性,将影响函数理论扩展至矩阵李群,并通过曲率算子与李代数结构量化单个测量的贡献,从而统一解释条件数分析、Fisher信息几何、动态场景可检测性等传统问题。
Details
Motivation: 经典敏感性分析工具(如条件数分析、欧氏扰动论证、Fisher信息界)无法解释单个图像特征如何影响位姿估计,也无法说明为何动态或不一致观测会显著扭曲SLAM和SfM系统。
Method: 将影响函数理论扩展到矩阵李群,推导SE(3)上左平凡化M估计器的内禀扰动算子,定义几何可观测性指数(GOI),基于可观测子空间的曲率算子和李代数结构进行量化,并进行谱分解分析。
Result: GOI揭示了弱可观测性与高敏感性之间的直接对应关系;在总体情形下与SE(3)上的Fisher信息几何一致,给出单测量版本的Cramér-Rao界;能解释纯旋转、视差消失等经典退化现象及弱曲率方向上的动态特征放大效应;且可在标准Gauss-Newton流程中轻量、免训练地提取诊断信号。
Conclusion: GOI提供了一个几何一致的测量影响描述框架,统一了多种经典分析视角,并可作为现有SLAM系统中无需修改架构即可部署的实时诊断工具。
Abstract: We present a unified operator-theoretic framework for analyzing per-feature sensitivity in camera pose estimation on the Lie group SE(3). Classical sensitivity tools - conditioning analyses, Euclidean perturbation arguments, and Fisher information bounds - do not explain how individual image features influence the pose estimate, nor why dynamic or inconsistent observations can disproportionately distort modern SLAM and structure-from-motion systems. To address this gap, we extend influence function theory to matrix Lie groups and derive an intrinsic perturbation operator for left-trivialized M-estimators on SE(3).
The resulting Geometric Observability Index (GOI) quantifies the contribution of a single measurement through the curvature operator and the Lie algebraic structure of the observable subspace. GOI admits a spectral decomposition along the principal directions of the observable curvature, revealing a direct correspondence between weak observability and amplified sensitivity. In the population regime, GOI coincides with the Fisher information geometry on SE(3), yielding a single-measurement analogue of the Cramer-Rao bound.
The same spectral mechanism explains classical degeneracies such as pure rotation and vanishing parallax, as well as dynamic feature amplification along weak curvature directions. Overall, GOI provides a geometrically consistent description of measurement influence that unifies conditioning analysis, Fisher information geometry, influence function theory, and dynamic scene detectability through the spectral geometry of the curvature operator. Because these quantities arise directly within Gauss-Newton pipelines, the curvature spectrum and GOI also yield lightweight, training-free diagnostic signals for identifying dynamic features and detecting weak observability configurations without modifying existing SLAM architectures.
### [125] [A Mixed Reality System for Robust Manikin Localization in Childbirth Training](https://arxiv.org/abs/2602.05588)
*Haojie Cheng,Chang Liu,Abhiram Kanneganti,Mahesh Arjandas Choolani,Arundhati Tushar Gosavi,Eng Tat Khoo*
Main category: cs.CV
TL;DR: 本文提出了一种用于产科培训的混合现实(MR)系统,结合虚拟指导与实体分娩模拟人触觉交互,在保留真实触感的同时支持无专家监督的自主训练;通过空间校准外置RGB-D相机扩展HMD直通功能,并设计粗-精两级定位流程实现虚拟引导手的精准空间叠加;实验表明该系统定位稳定、可独立运行,且在大规模医学实习生对比研究中显著优于纯VR训练。
Details
Motivation: 医学生获取阴道分娩实践机会日益受限,源于临床轮转时间缩短、患者配合度低及产程不可预测性;同时临床教师教学负担重,传统培训效率不足。
Method: 开发基于商用头戴显示器(HMD)的混合现实系统:1)通过外置RGB-D相机空间校准扩展HMD直通能力;2)构建粗-精两级定位流程——先用标记点对齐母体模拟人以定义产道区域,再将预扫描新生儿头部模型注册至该区域;3)实现在模拟人附近精准叠加虚拟引导手,结合真实触觉反馈进行操作训练。
Result: 系统在独立头显上实现了准确稳定的模拟人定位,无需外部计算资源;83名四年级医学生的大规模用户研究表明,MR组在分娩操作、产后处理及整体任务表现上均显著优于VR组,且获参训者一致偏好。
Conclusion: 该MR系统有效缓解了产科临床教学资源紧张问题,在保持真实触觉反馈前提下提升了训练自主性与有效性,为高保真、可扩展的医学技能训练提供了新范式。
Abstract: Opportunities for medical students to gain practical experience in vaginal births are increasingly constrained by shortened clinical rotations, patient reluctance, and the unpredictable nature of labour. To alleviate clinicians' instructional burden and enhance trainees' learning efficiency, we introduce a mixed reality (MR) system for childbirth training that combines virtual guidance with tactile manikin interaction, thereby preserving authentic haptic feedback while enabling independent practice without continuous on-site expert supervision. The system extends the passthrough capability of commercial head-mounted displays (HMDs) by spatially calibrating an external RGB-D camera, allowing real-time visual integration of physical training objects. Building on this capability, we implement a coarse-to-fine localization pipeline that first aligns the maternal manikin with fiducial markers to define a delivery region and then registers the pre-scanned neonatal head within this area. This process enables spatially accurate overlay of virtual guiding hands near the manikin, allowing trainees to follow expert trajectories reinforced by haptic interaction. Experimental evaluations demonstrate that the system achieves accurate and stable manikin localization on a standalone headset, ensuring practical deployment without external computing resources. A large-scale user study involving 83 fourth-year medical students was subsequently conducted to compare MR-based and virtual reality (VR)-based childbirth training. Four senior obstetricians independently assessed performance using standardized criteria. Results showed that MR training achieved significantly higher scores in delivery, post-delivery, and overall task performance, and was consistently preferred by trainees over VR training.
### [126] [EgoPoseVR: Spatiotemporal Multi-Modal Reasoning for Egocentric Full-Body Pose in Virtual Reality](https://arxiv.org/abs/2602.05590)
*Haojie Cheng,Shaun Jing Heng Ong,Shaoyu Cai,Aiden Tat Yang Koh,Fuxi Ouyang,Eng Tat Khoo*
Main category: cs.CV
TL;DR: 本文提出EgoPoseVR,一种结合头显运动信号与RGB-D图像的双模态融合框架,用于VR中实时、稳定、准确的自我中心式全身姿态估计,并构建了大规模合成数据集进行训练和评估。
Details
Motivation: 现有基于头戴摄像头的自我中心姿态估计方法在VR头显应用中存在时序不稳定、下身估计不准及无法实时等问题。
Method: 提出端到端的EgoPoseVR框架:采用时空编码器提取帧级与关节点级特征,通过跨模态交叉注意力融合头显运动与RGB-D信息,并引入基于HMD信号的运动学优化模块提升精度与稳定性;同时构建含180万帧对齐数据的大规模合成VR数据集。
Result: EgoPoseVR在多个指标上超越当前最优自我中心姿态估计模型;真实场景用户研究显示其在准确性、稳定性、具身感和未来使用意愿方面显著优于基线方法。
Conclusion: EgoPoseVR实现了无需额外穿戴传感器或房间级追踪系统的鲁棒VR全身姿态跟踪,为高保真VR具身交互提供了实用解决方案。
Abstract: Immersive virtual reality (VR) applications demand accurate, temporally coherent full-body pose tracking. Recent head-mounted camera-based approaches show promise in egocentric pose estimation, but encounter challenges when applied to VR head-mounted displays (HMDs), including temporal instability, inaccurate lower-body estimation, and the lack of real-time performance. To address these limitations, we present EgoPoseVR, an end-to-end framework for accurate egocentric full-body pose estimation in VR that integrates headset motion cues with egocentric RGB-D observations through a dual-modality fusion pipeline. A spatiotemporal encoder extracts frame- and joint-level representations, which are fused via cross-attention to fully exploit complementary motion cues across modalities. A kinematic optimization module then imposes constraints from HMD signals, enhancing the accuracy and stability of pose estimation. To facilitate training and evaluation, we introduce a large-scale synthetic dataset of over 1.8 million temporally aligned HMD and RGB-D frames across diverse VR scenarios. Experimental results show that EgoPoseVR outperforms state-of-the-art egocentric pose estimation models. A user study in real-world scenes further shows that EgoPoseVR achieved significantly higher subjective ratings in accuracy, stability, embodiment, and intention for future use compared to baseline methods. These results show that EgoPoseVR enables robust full-body pose tracking, offering a practical solution for accurate VR embodiment without requiring additional body-worn sensors or room-scale tracking systems.
### [127] [CAViT -- Channel-Aware Vision Transformer for Dynamic Feature Fusion](https://arxiv.org/abs/2602.05598)
*Aon Safdar,Mohamed Saadeldin*
Main category: cs.CV
TL;DR: 本文提出CAViT,一种双注意力架构的视觉Transformer,通过引入通道级自注意力机制替代静态MLP,实现内容感知的动态特征交互,在多个基准数据集上显著提升准确率并降低参数量和计算量。
Details
Motivation: 现有ViT中通道混合是静态的,依赖固定MLP,缺乏对输入内容的适应性。
Method: 提出CAViT架构,在每个Transformer块中先进行空间自注意力,再进行通道级自注意力,实现动态、内容感知的token混合。
Result: 在五个自然与医学图像基准数据集上,CAViT相比标准ViT最高提升3.6%准确率,同时参数量和FLOPs减少超30%;注意力图显示更锐利、语义更明确的激活模式。
Conclusion: 通道级自注意力可有效增强ViT的表征能力与内容适应性,CAViT以更低复杂度实现了更高性能,验证了动态token混合的有效性与实用性。
Abstract: Vision Transformers (ViTs) have demonstrated strong performance across a range of computer vision tasks by modeling long-range spatial interactions via self-attention. However, channel-wise mixing in ViTs remains static, relying on fixed multilayer perceptrons (MLPs) that lack adaptability to input content. We introduce 'CAViT', a dual-attention architecture that replaces the static MLP with a dynamic, attention-based mechanism for feature interaction. Each Transformer block in CAViT performs spatial self-attention followed by channel-wise self-attention, allowing the model to dynamically recalibrate feature representations based on global image context. This unified and content-aware token mixing strategy enhances representational expressiveness without increasing depth or complexity. We validate CAViT across five benchmark datasets spanning both natural and medical domains, where it outperforms the standard ViT baseline by up to +3.6% in accuracy, while reducing parameter count and FLOPs by over 30%. Qualitative attention maps reveal sharper and semantically meaningful activation patterns, validating the effectiveness of our attention-driven token mixing.
### [128] [Multi-instance robust fitting for non-classical geometric models](https://arxiv.org/abs/2602.05602)
*Zongliang Zhang,Shuxiang Li,Xingwang Huang,Zongyue Wang*
Main category: cs.CV
TL;DR: 本文提出了一种针对非经典模型(如螺旋曲线、程序化字符模型、自由曲面)的多实例鲁棒拟合方法,通过基于模型到数据误差的新型估计器和元启发式优化器解决含噪数据下的全局优化问题。
Details
Motivation: 现有鲁棒拟合方法主要面向经典几何模型(如直线、圆、平面),对非经典模型支持不足,且多限于单实例重建;本文旨在解决非经典模型的多实例鲁棒重建问题。
Method: 将多实例拟合建模为包含估计器与优化器的优化问题:估计器基于模型到数据误差,无需预设误差阈值即可处理离群点;因估计器关于模型参数不可微,采用元启发式算法进行全局优化。
Result: 在多种非经典模型上验证了方法的有效性,并开源代码。
Conclusion: 所提方法能有效实现非经典模型的多实例鲁棒拟合,克服了传统方法对经典模型和单实例的依赖,以及对误差阈值的敏感性。
Abstract: Most existing robust fitting methods are designed for classical models, such as lines, circles, and planes. In contrast, fewer methods have been developed to robustly handle non-classical models, such as spiral curves, procedural character models, and free-form surfaces. Furthermore, existing methods primarily focus on reconstructing a single instance of a non-classical model. This paper aims to reconstruct multiple instances of non-classical models from noisy data. We formulate this multi-instance fitting task as an optimization problem, which comprises an estimator and an optimizer. Specifically, we propose a novel estimator based on the model-to-data error, capable of handling outliers without a predefined error threshold. Since the proposed estimator is non-differentiable with respect to the model parameters, we employ a meta-heuristic algorithm as the optimizer to seek the global optimum. The effectiveness of our method are demonstrated through experimental results on various non-classical models. The code is available at https://github.com/zhangzongliang/fitting.
### [129] [Unified Sensor Simulation for Autonomous Driving](https://arxiv.org/abs/2602.05617)
*Nikolay Patakin,Arsenii Shirokov,Anton Konushin,Dmitry Senushkin*
Main category: cs.CV
TL;DR: XSIM is a sensor simulation framework for autonomous driving that extends 3DGUT splatting with rolling-shutter modeling, phase modeling for spherical cameras, and an extended 3D Gaussian representation to improve geometric consistency and photorealism.
Details
Motivation: Existing 3DGUT splatting struggles with spherical sensors like LiDARs due to cyclic projection and time discontinuities at azimuth boundaries, leading to incorrect particle projection; there's a need for unified, flexible sensor modeling for dynamic autonomous driving environments.
Method: XSIM introduces generalized rolling-shutter modeling, a phase modeling mechanism to handle temporal and shape discontinuities at azimuth borders for spherical cameras, and an extended 3D Gaussian representation with two opacity parameters to decouple geometry and color distributions.
Result: XSIM achieves state-of-the-art performance on Waymo Open Dataset, Argoverse 2, and PandaSet, outperforming strong recent baselines with improved geometric consistency and photorealistic appearance.
Conclusion: XSIM provides a robust, unified sensor simulation framework tailored for autonomous driving, effectively addressing key limitations of prior 3DGUT-based methods—especially for spherical sensors—and enabling high-fidelity rendering of complex sensor distortions in dynamic scenes.
Abstract: In this work, we introduce \textbf{XSIM}, a sensor simulation framework for autonomous driving. XSIM extends 3DGUT splatting with a generalized rolling-shutter modeling tailored for autonomous driving applications. Our framework provides a unified and flexible formulation for appearance and geometric sensor modeling, enabling rendering of complex sensor distortions in dynamic environments. We identify spherical cameras, such as LiDARs, as a critical edge case for existing 3DGUT splatting due to cyclic projection and time discontinuities at azimuth boundaries leading to incorrect particle projection. To address this issue, we propose a phase modeling mechanism that explicitly accounts temporal and shape discontinuities of Gaussians projected by the Unscented Transform at azimuth borders. In addition, we introduce an extended 3D Gaussian representation that incorporates two distinct opacity parameters to resolve mismatches between geometry and color distributions. As a result, our framework provides enhanced scene representations with improved geometric consistency and photorealistic appearance. We evaluate our framework extensively on multiple autonomous driving datasets, including Waymo Open Dataset, Argoverse 2, and PandaSet. Our framework consistently outperforms strong recent baselines and achieves state-of-the-art performance across all datasets. The source code is publicly available at \href{https://github.com/whesense/XSIM}{https://github.com/whesense/XSIM}.
### [130] [ROMAN: Reward-Orchestrated Multi-Head Attention Network for Autonomous Driving System Testing](https://arxiv.org/abs/2602.05629)
*Jianlei Chi,Yuzhen Wu,Jiaxuan Hou,Xiaodong Zhang,Ming Fan,Suhui Sun,Weijun Dai,Bo Li,Jianguo Sun,Jun Sun*
Main category: cs.CV
TL;DR: 本文提出ROMAN方法,结合多头注意力网络与交通法规加权机制,生成高风险违规场景以增强自动驾驶系统(ADS)测试的全面性与针对性。实验表明其在违规数量和场景多样性上均优于现有工具,并能覆盖全部输入交通法规条款。
Details
Motivation: 当前ADS测试难以生成复杂高风险违法场景,且忽略多车交互与关键情境,导致安全评估不足。
Method: 提出ROMAN方法:采用多头注意力网络建模车辆、信号灯等交互;引入基于大语言模型(LLM)的交通法规风险加权模块,从严重性与发生概率两维度评估违规风险。
Result: 在CARLA中测试Baidu Apollo ADS,ROMAN比ABLE和LawBreaker平均违规数分别提升7.91%和55.96%,场景多样性更高,且唯一实现对所有输入交通法规条款的全覆盖违规生成。
Conclusion: ROMAN显著提升了ADS测试中高风险违法场景的生成能力与覆盖广度,为更安全可靠的自动驾驶部署提供了有效验证手段。
Abstract: Automated Driving System (ADS) acts as the brain of autonomous vehicles, responsible for their safety and efficiency. Safe deployment requires thorough testing in diverse real-world scenarios and compliance with traffic laws like speed limits, signal obedience, and right-of-way rules. Violations like running red lights or speeding pose severe safety risks. However, current testing approaches face significant challenges: limited ability to generate complex and high-risk law-breaking scenarios, and failing to account for complex interactions involving multiple vehicles and critical situations. To address these challenges, we propose ROMAN, a novel scenario generation approach for ADS testing that combines a multi-head attention network with a traffic law weighting mechanism. ROMAN is designed to generate high-risk violation scenarios to enable more thorough and targeted ADS evaluation. The multi-head attention mechanism models interactions among vehicles, traffic signals, and other factors. The traffic law weighting mechanism implements a workflow that leverages an LLM-based risk weighting module to evaluate violations based on the two dimensions of severity and occurrence. We have evaluated ROMAN by testing the Baidu Apollo ADS within the CARLA simulation platform and conducting extensive experiments to measure its performance. Experimental results demonstrate that ROMAN surpassed state-of-the-art tools ABLE and LawBreaker by achieving 7.91% higher average violation count than ABLE and 55.96% higher than LawBreaker, while also maintaining greater scenario diversity. In addition, only ROMAN successfully generated violation scenarios for every clause of the input traffic laws, enabling it to identify more high-risk violations than existing approaches.
### [131] [UniSurg: A Video-Native Foundation Model for Universal Understanding of Surgical Videos](https://arxiv.org/abs/2602.05638)
*Jinlin Wu,Felix Holm,Chuxi Chen,An Wang,Yaxin Hu,Xiaofan Ye,Zelin Zang,Miao Xu,Lihua Zhou,Huai Liao,Danny T. M. Chan,Ming Feng,Wai S. Poon,Hongliang Ren,Dong Yi,Nassir Navab,Gaofeng Meng,Jiebo Luo,Hongbin Liu,Zhen Lei*
Main category: cs.CV
TL;DR: UniSurg是一种面向手术视频的新型基础模型,摒弃像素级重建,转而预测潜在运动表征;通过运动引导预测、时空亲和自蒸馏和特征多样性正则化三项创新,在大规模手术视频数据集UniSurg-15M上预训练,显著提升多项手术视频理解任务性能。
Details
Motivation: 现有手术视频分析模型过度关注烟雾、反光、液体流动等低层像素细节,忽视对语义结构(如动作、流程、解剖关系)的理解,导致模型容量浪费。
Method: 基于视频联合嵌入预测架构(V-JEPA),提出三方面创新:1)运动引导的潜在预测以聚焦语义区域;2)时空亲和自蒸馏保障关系一致性;3)特征多样性正则化防止纹理稀疏场景下的表征坍缩;并构建大规模手术视频数据集UniSurg-15M(3658小时,50个来源,13个解剖区域)用于预训练。
Result: 在17个基准测试中全面超越SOTA:手术工作流识别(EgoSurgery +14.6% F1,PitVis +10.3%)、动作三元组识别(CholecT50 mAP-IVT达39.54%)、技能评估、息肉分割与深度估计等任务均取得显著提升。
Conclusion: UniSurg确立了以运动为中心、通用化的手术视频理解新范式与新标准。
Abstract: While foundation models have advanced surgical video analysis, current approaches rely predominantly on pixel-level reconstruction objectives that waste model capacity on low-level visual details - such as smoke, specular reflections, and fluid motion - rather than semantic structures essential for surgical understanding. We present UniSurg, a video-native foundation model that shifts the learning paradigm from pixel-level reconstruction to latent motion prediction. Built on the Video Joint Embedding Predictive Architecture (V-JEPA), UniSurg introduces three key technical innovations tailored to surgical videos: 1) motion-guided latent prediction to prioritize semantically meaningful regions, 2) spatiotemporal affinity self-distillation to enforce relational consistency, and 3) feature diversity regularization to prevent representation collapse in texture-sparse surgical scenes. To enable large-scale pretraining, we curate UniSurg-15M, the largest surgical video dataset to date, comprising 3,658 hours of video from 50 sources across 13 anatomical regions. Extensive experiments across 17 benchmarks demonstrate that UniSurg significantly outperforms state-of-the-art methods on surgical workflow recognition (+14.6% F1 on EgoSurgery, +10.3% on PitVis), action triplet recognition (39.54% mAP-IVT on CholecT50), skill assessment, polyp segmentation, and depth estimation. These results establish UniSurg as a new standard for universal, motion-oriented surgical video understanding.
### [132] [Enhancing Personality Recognition by Comparing the Predictive Power of Traits, Facets, and Nuances](https://arxiv.org/abs/2602.05650)
*Amir Ansari,Jana Subirana,Bruna Silva,Sergio Escalera,David Gallardo-Pujol,Cristina Palmero*
Main category: cs.CV
TL;DR: 本文探讨了在音频视频交互数据中,利用大五人格模型的更细粒度层次(如细微特征)来提升人格识别性能,结果表明细微特征级别的模型显著优于传统特质和层面模型。
Details
Motivation: 现有基于宽泛人格特质评分作为真值的人格识别模型面临泛化能力差的问题,因为相似的特质评分可能源于多样且依赖情境的行为表现。
Method: 采用UDIVA v0.5数据集,构建了一个融合跨模态(音视频)与跨被试(对话对感知)注意力机制的Transformer模型,并在大五人格模型的特质、层面和细微特征三个粒度上分别建模。
Result: 细微特征(nuance)级别模型在各类交互场景中持续优于层面(facet)和特质(trait)级别模型,均方误差最高降低达74%。
Conclusion: 使用更细粒度的人格结构(如细微特征)作为监督信号可显著提升人格识别模型的性能与泛化能力,为未来人格建模提供了新思路。
Abstract: Personality is a complex, hierarchical construct typically assessed through item-level questionnaires aggregated into broad trait scores. Personality recognition models aim to infer personality traits from different sources of behavioral data. However, reliance on broad trait scores as ground truth, combined with limited training data, poses challenges for generalization, as similar trait scores can manifest through diverse, context dependent behaviors. In this work, we explore the predictive impact of the more granular hierarchical levels of the Big-Five Personality Model, facets and nuances, to enhance personality recognition from audiovisual interaction data. Using the UDIVA v0.5 dataset, we trained a transformer-based model including cross-modal (audiovisual) and cross-subject (dyad-aware) attention mechanisms. Results show that nuance-level models consistently outperform facet and trait-level models, reducing mean squared error by up to 74% across interaction scenarios.
### [133] [ShapeUP: Scalable Image-Conditioned 3D Editing](https://arxiv.org/abs/2602.05676)
*Inbar Gat,Dana Cohen-Bar,Guy Levy,Elad Richardson,Daniel Cohen-Or*
Main category: cs.CV
TL;DR: ShapeUP is a scalable, image-conditioned 3D editing framework that enables precise, controllable, and geometrically consistent 3D manipulation via supervised latent-to-latent translation in native 3D space, outperforming existing methods in identity preservation and edit fidelity.
Details
Motivation: Precise 3D manipulation remains challenging due to trade-offs among visual controllability, geometric consistency, and scalability; existing methods suffer from slowness, visual drift, or inflexibility from frozen priors.
Method: ShapeUP formulates 3D editing as supervised latent-to-latent translation using a pretrained 3D foundation model and trains a 3D Diffusion Transformer (DiT) on triplets of source 3D shape, edited 2D image (as prompt), and target edited 3D shape.
Result: ShapeUP achieves fine-grained visual control, implicit mask-free localization, strict structural consistency, and consistently outperforms trained and training-free baselines in identity preservation and edit fidelity.
Conclusion: ShapeUP establishes a robust and scalable paradigm for native 3D content editing by bridging high-fidelity generation with precise, image-guided manipulation in 3D latent space.
Abstract: Recent advancements in 3D foundation models have enabled the generation of high-fidelity assets, yet precise 3D manipulation remains a significant challenge. Existing 3D editing frameworks often face a difficult trade-off between visual controllability, geometric consistency, and scalability. Specifically, optimization-based methods are prohibitively slow, multi-view 2D propagation techniques suffer from visual drift, and training-free latent manipulation methods are inherently bound by frozen priors and cannot directly benefit from scaling. In this work, we present ShapeUP, a scalable, image-conditioned 3D editing framework that formulates editing as a supervised latent-to-latent translation within a native 3D representation. This formulation allows ShapeUP to build on a pretrained 3D foundation model, leveraging its strong generative prior while adapting it to editing through supervised training. In practice, ShapeUP is trained on triplets consisting of a source 3D shape, an edited 2D image, and the corresponding edited 3D shape, and learns a direct mapping using a 3D Diffusion Transformer (DiT). This image-as-prompt approach enables fine-grained visual control over both local and global edits and achieves implicit, mask-free localization, while maintaining strict structural consistency with the original asset. Our extensive evaluations demonstrate that ShapeUP consistently outperforms current trained and training-free baselines in both identity preservation and edit fidelity, offering a robust and scalable paradigm for native 3D content creation.
### [134] [Poster: Camera Tampering Detection for Outdoor IoT Systems](https://arxiv.org/abs/2602.05706)
*Shadi Attarha,Kanaga Shanmugi,Anna Förster*
Main category: cs.CV
TL;DR: 本文提出两种相机篡改检测方法:基于规则的方法和基于深度学习的方法,比较了它们在准确性、计算需求和训练数据要求方面的表现,并提供了公开数据集支持后续研究。
Details
Motivation: 智能摄像头在户外环境中易受故意破坏或恶劣环境影响,导致监控失效;而静态图像篡改检测比视频更困难,缺乏连续帧信息。
Method: 提出了两种篡改图像检测方法:一种是基于规则的方法,另一种是基于深度学习的方法,并在真实场景中评估其性能。
Result: 深度学习模型准确率更高;规则方法更适合资源受限、无法进行长时间校准的场景;同时发布了包含正常、模糊和旋转图像的公开数据集。
Conclusion: 两种方法各有适用场景:深度学习适合高精度需求,规则方法适合低资源环境;公开数据集填补了该领域资源空白。
Abstract: Recently, the use of smart cameras in outdoor settings has grown to improve surveillance and security. Nonetheless, these systems are susceptible to tampering, whether from deliberate vandalism or harsh environmental conditions, which can undermine their monitoring effectiveness. In this context, detecting camera tampering is more challenging when a camera is capturing still images rather than video as there is no sequence of continuous frames over time. In this study, we propose two approaches for detecting tampered images: a rule-based method and a deep-learning-based method. The aim is to evaluate how each method performs in terms of accuracy, computational demands, and the data required for training when applied to real-world scenarios. Our results show that the deep-learning model provides higher accuracy, while the rule-based method is more appropriate for scenarios where resources are limited and a prolonged calibration phase is impractical. We also offer publicly available datasets with normal, blurred, and rotated images to support the development and evaluation of camera tampering detection methods, addressing the need for such resources.
### [135] [Exploring the Temporal Consistency for Point-Level Weakly-Supervised Temporal Action Localization](https://arxiv.org/abs/2602.05718)
*Yunchuan Ma,Laiyun Qing,Guorong Li,Yuqing Liu,Yuankai Qi,Qingming Huang*
Main category: cs.CV
TL;DR: 本文提出了一种多任务学习框架,通过三种自监督时序理解任务(动作完成、动作顺序理解、动作规律性理解)来增强点监督下的时序动作定位模型对帧间时序关系的理解能力。
Details
Motivation: 现有PTAL方法仅依赖片段级点监督分类,缺乏对动作内部帧间时序关系的显式建模,而时序关系对准确定位完整动作至关重要。
Method: 设计了一个多任务学习框架,包含三个自监督时序理解任务:动作完成、动作顺序理解和动作规律性理解,以提升模型对动作时序一致性的理解能力。
Result: 在四个基准数据集上的大量实验表明,所提方法优于多个当前最优方法。
Conclusion: 首次显式探索时序一致性建模用于点监督动作定位,有效提升了模型的时序理解与定位性能。
Abstract: Point-supervised Temporal Action Localization (PTAL) adopts a lightly frame-annotated paradigm (\textit{i.e.}, labeling only a single frame per action instance) to train a model to effectively locate action instances within untrimmed videos. Most existing approaches design the task head of models with only a point-supervised snippet-level classification, without explicit modeling of understanding temporal relationships among frames of an action. However, understanding the temporal relationships of frames is crucial because it can help a model understand how an action is defined and therefore benefits localizing the full frames of an action. To this end, in this paper, we design a multi-task learning framework that fully utilizes point supervision to boost the model's temporal understanding capability for action localization. Specifically, we design three self-supervised temporal understanding tasks: (i) Action Completion, (ii) Action Order Understanding, and (iii) Action Regularity Understanding. These tasks help a model understand the temporal consistency of actions across videos. To the best of our knowledge, this is the first attempt to explicitly explore temporal consistency for point supervision action localization. Extensive experimental results on four benchmark datasets demonstrate the effectiveness of the proposed method compared to several state-of-the-art approaches.
### [136] [Adaptive Global and Fine-Grained Perceptual Fusion for MLLM Embeddings Compatible with Hard Negative Amplification](https://arxiv.org/abs/2602.05729)
*Lexiang Hu,Youze Xue,Dian Li,Gang Liu,Zhouchen Lin*
Main category: cs.CV
TL;DR: 本文提出AGFF-Embed方法,通过自适应融合全局与细粒度感知嵌入,并结合显式梯度放大(EGA)技术增强难负样本学习,显著提升多模态嵌入模型在通用与细粒度理解任务上的性能。
Details
Motivation: 现有CLIP和MLLM嵌入模型仅捕获全局语义信息,而实际复杂场景需兼顾全局与细粒度感知,缺乏兼容的融合机制。
Method: 提出AGFF-Embed方法,利用MLLM生成多维度语义嵌入并自适应平滑聚合;引入Explicit Gradient Amplification(EGA)技术实现批内难负样本增强,无需数据精细编辑。
Result: 在MMEB和MMVP-VLM基准上,AGFF-Embed在通用与细粒度理解任务中均达到SOTA性能。
Conclusion: AGFF-Embed有效解决了多模态嵌入中全局与细粒度感知融合难题,结合EGA进一步提升了难负样本学习能力,为多模态表示学习提供了新思路。
Abstract: Multimodal embeddings serve as a bridge for aligning vision and language, with the two primary implementations -- CLIP-based and MLLM-based embedding models -- both limited to capturing only global semantic information. Although numerous studies have focused on fine-grained understanding, we observe that complex scenarios currently targeted by MLLM embeddings often involve a hybrid perceptual pattern of both global and fine-grained elements, thus necessitating a compatible fusion mechanism. In this paper, we propose Adaptive Global and Fine-grained perceptual Fusion for MLLM Embeddings (AGFF-Embed), a method that prompts the MLLM to generate multiple embeddings focusing on different dimensions of semantic information, which are then adaptively and smoothly aggregated. Furthermore, we adapt AGFF-Embed with the Explicit Gradient Amplification (EGA) technique to achieve in-batch hard negatives enhancement without requiring fine-grained editing of the dataset. Evaluation on the MMEB and MMVP-VLM benchmarks shows that AGFF-Embed comprehensively achieves state-of-the-art performance in both general and fine-grained understanding compared to other multimodal embedding models.
### [137] [Depth as Prior Knowledge for Object Detection](https://arxiv.org/abs/2602.05730)
*Moussa Kassem Sbeyti,Nadja Klein*
Main category: cs.CV
TL;DR: 本文提出DepthPrior框架,利用深度信息作为先验知识而非融合特征,通过训练时的深度加权损失(DLW)和分层损失(DLS),以及推理时的深度感知置信度阈值(DCT),显著提升小目标检测性能,无需修改检测器结构或增加传感器。
Details
Motivation: 小而远的目标检测因尺度变化、分辨率低和背景杂乱而困难,尤其在安全关键应用中亟需可靠检测;现有利用深度信息的方法需复杂且模型特定的架构修改。
Method: 理论分析与实证研究深度-检测关系;提出DepthPrior框架,包含训练阶段的Depth-Based Loss Weighting(DLW)和Depth-Based Loss Stratification(DLS),以及推理阶段的Depth-Aware Confidence Thresholding(DCT)。
Result: 在KITTI、MS COCO、VisDrone、SUN RGB-D四个基准及YOLOv11、EfficientDet两个检测器上,小目标mAP_S提升最高达+9%,mAR_S提升+7%,推理恢复率高达95:1(真检/误检)。
Conclusion: DepthPrior以轻量、通用、无侵入方式利用深度先验,显著改善小目标检测,不依赖额外传感器、架构改动或牺牲推理效率。
Abstract: Detecting small and distant objects remains challenging for object detectors due to scale variation, low resolution, and background clutter. Safety-critical applications require reliable detection of these objects for safe planning. Depth information can improve detection, but existing approaches require complex, model-specific architectural modifications. We provide a theoretical analysis followed by an empirical investigation of the depth-detection relationship. Together, they explain how depth causes systematic performance degradation and why depth-informed supervision mitigates it. We introduce DepthPrior, a framework that uses depth as prior knowledge rather than as a fused feature, providing comparable benefits without modifying detector architectures. DepthPrior consists of Depth-Based Loss Weighting (DLW) and Depth-Based Loss Stratification (DLS) during training, and Depth-Aware Confidence Thresholding (DCT) during inference. The only overhead is the initial cost of depth estimation. Experiments across four benchmarks (KITTI, MS COCO, VisDrone, SUN RGB-D) and two detectors (YOLOv11, EfficientDet) demonstrate the effectiveness of DepthPrior, achieving up to +9% mAP$_S$ and +7% mAR$_S$ for small objects, with inference recovery rates as high as 95:1 (true vs. false detections). DepthPrior offers these benefits without additional sensors, architectural changes, or performance costs. Code is available at https://github.com/mos-ks/DepthPrior.
### [138] [Neuro-Inspired Visual Pattern Recognition via Biological Reservoir Computing](https://arxiv.org/abs/2602.05737)
*Luca Ciampi,Ludovico Iannello,Fabrizio Tonelli,Gabriele Lagani,Angelo Di Garbo,Federico Cremisi,Giuseppe Amato*
Main category: cs.CV
TL;DR: 本文提出一种基于体外培养皮层神经元的生物储层计算(BRC)方法,利用高密度微电极阵列(HD-MEA)刺激与记录神经活动,结合线性读出层实现静态视觉模式识别任务,验证了活体神经网络作为有效储层的可行性。
Details
Motivation: 突破传统人工递归模型对神经动力学的近似局限,探索利用真实生物神经回路的自发与诱发活动作为天然计算基质,推动神经形态计算与生物启发式机器学习的发展。
Method: 以体外培养的大鼠皮层神经元网络为物理储层;通过HD-MEA进行多通道输入刺激与高维神经响应记录;将神经响应作为特征输入至单层感知机进行监督训练与分类。
Result: 系统在点刺激、朝向光栅、类钟表数字及MNIST手写数字等逐级提升难度的任务上均实现准确分类,尽管存在生物噪声、自发活动和跨会话变异性,仍能生成高维可分表征。
Conclusion: 体外皮层神经网络可作为有效的生物储层用于静态视觉模式识别,为活体神经硬件融入神经形态计算提供了实证基础,并支持面向生物合理性的新型视觉计算模型构建。
Abstract: In this paper, we present a neuro-inspired approach to reservoir computing (RC) in which a network of in vitro cultured cortical neurons serves as the physical reservoir. Rather than relying on artificial recurrent models to approximate neural dynamics, our biological reservoir computing (BRC) system leverages the spontaneous and stimulus-evoked activity of living neural circuits as its computational substrate. A high-density multi-electrode array (HD-MEA) provides simultaneous stimulation and readout across hundreds of channels: input patterns are delivered through selected electrodes, while the remaining ones capture the resulting high-dimensional neural responses, yielding a biologically grounded feature representation. A linear readout layer (single-layer perceptron) is then trained to classify these reservoir states, enabling the living neural network to perform static visual pattern-recognition tasks within a computer-vision framework. We evaluate the system across a sequence of tasks of increasing difficulty, ranging from pointwise stimuli to oriented bars, clock-digit-like shapes, and handwritten digits from the MNIST dataset. Despite the inherent variability of biological neural responses-arising from noise, spontaneous activity, and inter-session differences-the system consistently generates high-dimensional representations that support accurate classification. These results demonstrate that in vitro cortical networks can function as effective reservoirs for static visual pattern recognition, opening new avenues for integrating living neural substrates into neuromorphic computing frameworks. More broadly, this work contributes to the effort to incorporate biological principles into machine learning and supports the goals of neuro-inspired vision by illustrating how living neural systems can inform the design of efficient and biologically grounded computational models.
### [139] [FMPose3D: monocular 3D pose estimation via flow matching](https://arxiv.org/abs/2602.05755)
*Ti Wang,Xiaohang Yu,Mackenzie Weygandt Mathis*
Main category: cs.CV
TL;DR: 本文提出FMPose3D,一种基于流匹配(Flow Matching)的高效单目3D姿态估计框架,通过ODE建模将高斯先验映射到条件3D姿态分布,仅需少量积分步即可生成多假设;并引入重投影后验期望聚合(RPEA)模块提升最终精度,在人与动物3D姿态数据集上均达SOTA。
Details
Motivation: 单目3D姿态估计存在深度模糊与遮挡问题,传统扩散模型虽性能好但推理慢;需更高效的概率化生成方法。
Method: 提出FMPose3D框架,将3D姿态估计建模为条件分布传输问题,利用流匹配学习ODE速度场,从标准高斯先验连续传输至以2D输入为条件的3D姿态分布;通过不同噪声种子采样生成多样假设,并设计RPEA模块基于重投影一致性聚合后验期望。
Result: 在Human3.6M和MPI-INF-3DHP上超越现有方法,在Animal3D和CtrlAni3D动物数据集上亦达SOTA;代码已开源。
Conclusion: 流匹配为单目3D姿态估计提供了高效、可扩展的概率建模范式,FMPose3D兼顾生成多样性与精度,在跨物种3D姿态任务中展现出强泛化能力。
Abstract: Monocular 3D pose estimation is fundamentally ill-posed due to depth ambiguity and occlusions, thereby motivating probabilistic methods that generate multiple plausible 3D pose hypotheses. In particular, diffusion-based models have recently demonstrated strong performance, but their iterative denoising process typically requires many timesteps for each prediction, making inference computationally expensive. In contrast, we leverage Flow Matching (FM) to learn a velocity field defined by an Ordinary Differential Equation (ODE), enabling efficient generation of 3D pose samples with only a few integration steps. We propose a novel generative pose estimation framework, FMPose3D, that formulates 3D pose estimation as a conditional distribution transport problem. It continuously transports samples from a standard Gaussian prior to the distribution of plausible 3D poses conditioned only on 2D inputs. Although ODE trajectories are deterministic, FMPose3D naturally generates various pose hypotheses by sampling different noise seeds. To obtain a single accurate prediction from those hypotheses, we further introduce a Reprojection-based Posterior Expectation Aggregation (RPEA) module, which approximates the Bayesian posterior expectation over 3D hypotheses. FMPose3D surpasses existing methods on the widely used human pose estimation benchmarks Human3.6M and MPI-INF-3DHP, and further achieves state-of-the-art performance on the 3D animal pose datasets Animal3D and CtrlAni3D, demonstrating strong performance across both 3D pose domains. The code is available at https://github.com/AdaptiveMotorControlLab/FMPose3D.
### [140] [ReText: Text Boosts Generalization in Image-Based Person Re-identification](https://arxiv.org/abs/2602.05785)
*Timur Mamedov,Karina Kvanchiani,Anton Konushin,Vadim Konushin*
Main category: cs.CV
TL;DR: ReText是一种新颖的图像基础人员重识别(Re-ID)方法,通过结合多摄像头Re-ID数据与带文本描述的单摄像头数据,联合优化Re-ID、图像-文本匹配和文本引导的图像重建三个任务,显著提升了跨域泛化能力。
Details
Motivation: 现有方法虽能缓解域间差异,但复杂架构效果有限;而风格多样的单摄像头数据虽易获取,却因缺乏跨视角变化而语义信息不足,需引入文本增强其语义表达。
Method: ReText采用多任务联合训练框架:在多摄像头数据上进行Re-ID训练,在单摄像头数据上同时进行图像-文本匹配和文本引导的图像重建,融合视觉与语言模态信息。
Result: ReText在多个跨域Re-ID基准测试中显著优于当前最优方法,展现出更强的泛化性能。
Conclusion: 本工作首次探索了在图像基础Re-ID中对多摄像头与单摄像头混合数据进行多模态联合学习,验证了文本引导可有效提升单摄像头数据的语义丰富性与模型泛化能力。
Abstract: Generalizable image-based person re-identification (Re-ID) aims to recognize individuals across cameras in unseen domains without retraining. While multiple existing approaches address the domain gap through complex architectures, recent findings indicate that better generalization can be achieved by stylistically diverse single-camera data. Although this data is easy to collect, it lacks complexity due to minimal cross-view variation. We propose ReText, a novel method trained on a mixture of multi-camera Re-ID data and single-camera data, where the latter is complemented by textual descriptions to enrich semantic cues. During training, ReText jointly optimizes three tasks: (1) Re-ID on multi-camera data, (2) image-text matching, and (3) image reconstruction guided by text on single-camera data. Experiments demonstrate that ReText achieves strong generalization and significantly outperforms state-of-the-art methods on cross-domain Re-ID benchmarks. To the best of our knowledge, this is the first work to explore multimodal joint learning on a mixture of multi-camera and single-camera data in image-based person Re-ID.
### [141] [Allocentric Perceiver: Disentangling Allocentric Reasoning from Egocentric Visual Priors via Frame Instantiation](https://arxiv.org/abs/2602.05789)
*Hengyi Wang,Ruiqiang Zhang,Chang Liu,Guanjie Wang,Zehua Ma,Han Fang,Weiming Zhang*
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的策略Allocentric Perceiver,通过利用现成的几何专家模型从图像中恢复3D空间状态,并构建与指令语义对齐的目标中心参考系,从而提升视觉语言模型在以目标为中心(allocentric)空间推理任务中的性能。
Details
Motivation: 现有视觉语言模型在需要显式视角转换的以目标为中心的空间查询任务上表现脆弱,难以在目标中心坐标系中进行有效推理。
Method: Allocentric Perceiver不依赖训练,而是利用现成几何专家模型从单张或多张图像中恢复度量级3D状态,并据此构建指令驱动的allocentric参考系;再将重建的几何信息确定性地变换至该参考系,并以结构化、几何对齐的表示提示主干VLM。
Result: 在多个空间推理基准上,Allocentric Perceiver在allocentric任务上带来约10%的一致显著提升,同时保持优异的egocentric性能,并超越了专门微调空间感知能力的模型及当前最优开源和闭源模型。
Conclusion: 将隐式的心理旋转转化为显式的几何计算可有效增强VLM的allocentric空间理解能力,且无需额外训练,具备通用性和实用性。
Abstract: With the rising need for spatially grounded tasks such as Vision-Language Navigation/Action, allocentric perception capabilities in Vision-Language Models (VLMs) are receiving growing focus. However, VLMs remain brittle on allocentric spatial queries that require explicit perspective shifts, where the answer depends on reasoning in a target-centric frame rather than the observed camera view. Thus, we introduce Allocentric Perceiver, a training-free strategy that recovers metric 3D states from one or more images with off-the-shelf geometric experts, and then instantiates a query-conditioned allocentric reference frame aligned with the instruction's semantic intent. By deterministically transforming reconstructed geometry into the target frame and prompting the backbone VLM with structured, geometry-grounded representations, Allocentric Perceriver offloads mental rotation from implicit reasoning to explicit computation. We evaluate Allocentric Perciver across multiple backbone families on spatial reasoning benchmarks, observing consistent and substantial gains ($\sim$10%) on allocentric tasks while maintaining strong egocentric performance, and surpassing both spatial-perception-finetuned models and state-of-the-art open-source and proprietary models.
### [142] [Focus-Scan-Refine: From Human Visual Perception to Efficient Visual Token Pruning](https://arxiv.org/abs/2602.05809)
*Enwei Tong,Yuanchao Bai,Yao Zhu,Junjun Jiang,Xianming Liu*
Main category: cs.CV
TL;DR: 本文提出Focus-Scan-Refine(FSR)框架,一种受人类视觉问答启发的训练无关视觉令牌剪枝方法,在保持甚至提升模型性能的同时显著降低推理延迟和内存开销。
Details
Motivation: 现有视觉语言模型生成大量视觉token导致高延迟与高内存占用,而训练-free剪枝方法难以在强压缩下兼顾局部证据与全局上下文。
Method: FSR包含三阶段:1)Focus——融合视觉重要性与指令相关性,聚焦关键证据;2)Scan——以聚焦集为条件,选择与其最不同的互补token进行全局扫描;3)Refine——基于相似性分配与分数加权合并,将邻近信息token聚合至扫描锚点,不增加token总数。
Result: 在多个VLM主干网络和视觉语言基准上,FSR持续优于现有SOTA剪枝方法,在准确率与效率间取得更优权衡。
Conclusion: FSR是一种即插即用、无需训练的高效视觉token剪枝框架,通过模拟人类视觉推理过程,有效缓解VLM的计算瓶颈,具备良好通用性与实用性。
Abstract: Vision-language models (VLMs) often generate massive visual tokens that greatly increase inference latency and memory footprint; while training-free token pruning offers a practical remedy, existing methods still struggle to balance local evidence and global context under aggressive compression. We propose Focus-Scan-Refine (FSR), a human-inspired, plug-and-play pruning framework that mimics how humans answer visual questions: focus on key evidence, then scan globally if needed, and refine the scanned context by aggregating relevant details. FSR first focuses on key evidence by combining visual importance with instruction relevance, avoiding the bias toward visually salient but query-irrelevant regions. It then scans for complementary context conditioned on the focused set, selecting tokens that are most different from the focused evidence. Finally, FSR refines the scanned context by aggregating nearby informative tokens into the scan anchors via similarity-based assignment and score-weighted merging, without increasing the token budget. Extensive experiments across multiple VLM backbones and vision-language benchmarks show that FSR consistently improves the accuracy-efficiency trade-off over existing state-of-the-art pruning methods. The source codes can be found at https://github.com/ILOT-code/FSR
### [143] [NVS-HO: A Benchmark for Novel View Synthesis of Handheld Objects](https://arxiv.org/abs/2602.05822)
*Musawar Ali,Manuel Carranza-García,Nicola Fioraio,Samuele Salti,Luigi Di Stefano*
Main category: cs.CV
TL;DR: 本文提出了NVS-HO,首个仅使用RGB输入、面向真实环境中手持物体的新视角合成(NVS)基准;它包含手持序列(物体被手操控)和标定板序列(提供精确相机位姿),旨在利用前者学习完整外观,后者用于评估;实验表明现有方法在无约束手持场景下性能显著下降,凸显了对更鲁棒方法的需求。
Details
Motivation: 现有新视角合成方法在真实手持场景下缺乏系统性评估基准,难以反映其在无约束、动态操控条件下的实际性能瓶颈。
Method: 构建NVS-HO双序列RGB数据集(手持序列用于建模,ChArUco标定板序列提供真值位姿与评估图像);采用经典SfM与预训练VGGT作为位姿估计器,结合NeRF和高斯溅射(Gaussian Splatting)训练NVS模型。
Result: 当前主流NVS方法(基于NeRF和高斯溅射)在手持序列上重建质量明显下降,暴露其对位姿误差和运动模糊的敏感性;标定板序列验证了评估可靠性。
Conclusion: NVS-HO填补了真实手持物体NVS评测的空白,揭示了现有RGB-only方法的局限性,为发展更具鲁棒性的位姿估计与表征学习方法提供了关键基准。
Abstract: We propose NVS-HO, the first benchmark designed for novel view synthesis of handheld objects in real-world environments using only RGB inputs. Each object is recorded in two complementary RGB sequences: (1) a handheld sequence, where the object is manipulated in front of a static camera, and (2) a board sequence, where the object is fixed on a ChArUco board to provide accurate camera poses via marker detection. The goal of NVS-HO is to learn a NVS model that captures the full appearance of an object from (1), whereas (2) provides the ground-truth images used for evaluation. To establish baselines, we consider both a classical SfM pipeline and a state-of-the-art pre-trained feed-forward neural network (VGGT) as pose estimators, and train NVS models based on NeRF and Gaussian Splatting. Our experiments reveal significant performance gaps in current methods under unconstrained handheld conditions, highlighting the need for more robust approaches. NVS-HO thus offers a challenging real-world benchmark to drive progress in RGB-based novel view synthesis of handheld objects.
### [144] [Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation](https://arxiv.org/abs/2602.05827)
*Hai Zhang,Siqi Liang,Li Chen,Yuxian Li,Yukuan Xu,Yichao Zhong,Fu Zhang,Hongyang Li*
Main category: cs.CV
TL;DR: 本文提出SparseVideoNav,首次将视频生成模型引入超越视野导航(BVN)任务,通过生成稀疏未来视频实现亚秒级轨迹推理,在真实世界零样本实验中成功率是现有LLM基线的2.5倍,并首次实现在夜间场景下的BVN。
Details
Motivation: 现实世界导航应支持基于简洁高层意图的自主导航,而非依赖冗长详尽的语言指令;现有LLM方法因短视监督难以胜任需长时程推理的BVN任务。
Method: 利用视频生成模型天然具备长时程监督优势的特点,提出SparseVideoNav:不生成完整长视频,而是生成跨度达20秒的稀疏未来帧,以实现高效轨迹推理。
Result: 在真实世界零样本BVN任务中,SparseVideoNav成功率达SOTA LLM基线的2.5倍;首次实现夜间复杂场景下的BVN;推理速度提升27倍(亚秒级)。
Conclusion: 视频生成模型可有效支撑长时程、高层意图驱动的导航任务;SparseVideoNav为超越视野、低延迟、强泛化的真实导航提供了新范式。
Abstract: Why must vision-language navigation be bound to detailed and verbose language instructions? While such details ease decision-making, they fundamentally contradict the goal for navigation in the real-world. Ideally, agents should possess the autonomy to navigate in unknown environments guided solely by simple and high-level intents. Realizing this ambition introduces a formidable challenge: Beyond-the-View Navigation (BVN), where agents must locate distant, unseen targets without dense and step-by-step guidance. Existing large language model (LLM)-based methods, though adept at following dense instructions, often suffer from short-sighted behaviors due to their reliance on short-horimzon supervision. Simply extending the supervision horizon, however, destabilizes LLM training. In this work, we identify that video generation models inherently benefit from long-horizon supervision to align with language instructions, rendering them uniquely suitable for BVN tasks. Capitalizing on this insight, we propose introducing the video generation model into this field for the first time. Yet, the prohibitive latency for generating videos spanning tens of seconds makes real-world deployment impractical. To bridge this gap, we propose SparseVideoNav, achieving sub-second trajectory inference guided by a generated sparse future spanning a 20-second horizon. This yields a remarkable 27x speed-up compared to the unoptimized counterpart. Extensive real-world zero-shot experiments demonstrate that SparseVideoNav achieves 2.5x the success rate of state-of-the-art LLM baselines on BVN tasks and marks the first realization of such capability in challenging night scenes.
### [145] [Weaver: End-to-End Agentic System Training for Video Interleaved Reasoning](https://arxiv.org/abs/2602.05829)
*Yudi Shi,Shangzhe Di,Qirui Chen,Qinian Wang,Jiayin Cai,Xiaolong Jiang,Yao Hu,Weidi Xie*
Main category: cs.CV
TL;DR: Weaver是一个端到端可训练的多模态推理代理系统,通过动态调用多种工具并结合强化学习,提升视频推理能力,尤其在长视频任务上表现优异。
Details
Motivation: 现有基于文本链式思维(Chain-of-Thought)的视频推理方法存在表征不匹配和感知能力受限的问题。
Method: 提出Weaver系统:1)策略模型可动态调用多样化工具以渐进获取视觉线索;2)引入无需轨迹监督的强化学习,自由探索工具使用与组合策略。
Result: 在多个复杂视频推理基准(尤其是长视频任务)上显著提升性能。
Conclusion: Weaver通过工具增强与强化学习驱动的多模态推理轨迹构建,有效突破了当前视频推理模型的感知与表征瓶颈。
Abstract: Video reasoning constitutes a comprehensive assessment of a model's capabilities, as it demands robust perceptual and interpretive skills, thereby serving as a means to explore the boundaries of model performance. While recent research has leveraged text-centric Chain-of-Thought reasoning to augment these capabilities, such approaches frequently suffer from representational mismatch and restricted by limited perceptual acuity. To address these limitations, we propose Weaver, a novel, end-to-end trainable multimodal reasoning agentic system. Weaver empowers its policy model to dynamically invoke diverse tools throughout the reasoning process, enabling progressive acquisition of crucial visual cues and construction of authentic multimodal reasoning trajectories. Furthermore, we integrate a reinforcement learning algorithm to allow the system to freely explore strategies for employing and combining these tools with trajectory-free data. Extensive experiments demonstrate that our system, Weaver, enhances performance on several complex video reasoning benchmarks, particularly those involving long videos.
### [146] [UI-Mem: Self-Evolving Experience Memory for Online Reinforcement Learning in Mobile GUI Agents](https://arxiv.org/abs/2602.05832)
*Han Xiao,Guozhi Wang,Hao Wang,Shilong Liu,Yuxiang Chai,Yue Pan,Yufeng Zhou,Xiaoxin Chen,Yafei Wen,Hongsheng Li*
Main category: cs.CV
TL;DR: 本文提出UI-Mem框架,通过分层经验记忆(含工作流、子任务技能与失败模式)和分层组采样机制,提升GUI智能体在在线强化学习中的信用分配与跨任务经验迁移能力,并引入自演化循环持续更新记忆,显著提升性能与泛化性。
Details
Motivation: 在线强化学习在GUI智能体中面临长周期任务信用分配低效及跨任务重复错误的问题,主因是缺乏有效的经验迁移机制。
Method: 提出UI-Mem框架,包含:1)结构化分层经验记忆(参数化模板存储工作流、子任务技能与失败模式);2)分层组采样策略,在每组rollout中注入多级记忆引导以维持多样性;3)自演化循环,自动抽象新策略与错误以更新记忆。
Result: 在在线GUI基准测试中,UI-Mem显著优于传统RL基线和静态复用方法,并展现出对未见应用的强泛化能力。
Conclusion: UI-Mem通过结构化记忆建模与动态引导机制,有效解决了GUI在线RL中的信用分配与经验迁移难题,为构建可复用、可演化的GUI智能体提供了新范式。
Abstract: Online Reinforcement Learning (RL) offers a promising paradigm for enhancing GUI agents through direct environment interaction. However, its effectiveness is severely hindered by inefficient credit assignment in long-horizon tasks and repetitive errors across tasks due to the lack of experience transfer. To address these challenges, we propose UI-Mem, a novel framework that enhances GUI online RL with a Hierarchical Experience Memory. Unlike traditional replay buffers, our memory accumulates structured knowledge, including high-level workflows, subtask skills, and failure patterns. These experiences are stored as parameterized templates that enable cross-task and cross-application transfer. To effectively integrate memory guidance into online RL, we introduce Stratified Group Sampling, which injects varying levels of guidance across trajectories within each rollout group to maintain outcome diversity, driving the unguided policy toward internalizing guided behaviors. Furthermore, a Self-Evolving Loop continuously abstracts novel strategies and errors to keep the memory aligned with the agent's evolving policy. Experiments on online GUI benchmarks demonstrate that UI-Mem significantly outperforms traditional RL baselines and static reuse strategies, with strong generalization to unseen applications. Project page: https://ui-mem.github.io
### [147] [Self-Supervised Learning with a Multi-Task Latent Space Objective](https://arxiv.org/abs/2602.05845)
*Pierre-François De Plaen,Abhishek Jha,Luc Van Gool,Tinne Tuytelaars,Marc Proesmans*
Main category: cs.CV
TL;DR: 本文提出一种针对自监督学习中多裁剪策略不稳定问题的改进方法,通过为每种视图类型分配独立预测器,并引入掩码视图(cutout),构建了一个稳定且通用的多任务非对称Siamese框架,显著提升了ResNet和ViT在ImageNet上的性能。
Details
Motivation: 多裁剪策略虽能提升SSL性能,但在BYOL、SimSiam、MoCo v3等预测器架构中引发训练不稳定;根本原因在于所有视图共享同一预测器。
Method: 为不同视图类型(全局、局部、cutout掩码)分别设置独立预测器,并将各类空间变换视为独立对齐任务,构建统一的多任务非对称Siamese SSL框架。
Result: 该方法显著提升训练稳定性,在ImageNet上一致提升ResNet与ViT骨干网络的表示学习性能。
Conclusion: 视图专用预测器是解决多裁剪SSL不稳定的有效途径;引入cutout视图拓展了SSL的视图多样性,形成更鲁棒、通用的多任务自监督学习范式。
Abstract: Self-supervised learning (SSL) methods based on Siamese networks learn visual representations by aligning different views of the same image. The multi-crop strategy, which incorporates small local crops to global ones, enhances many SSL frameworks but causes instability in predictor-based architectures such as BYOL, SimSiam, and MoCo v3. We trace this failure to the shared predictor used across all views and demonstrate that assigning a separate predictor to each view type stabilizes multi-crop training, resulting in significant performance gains. Extending this idea, we treat each spatial transformation as a distinct alignment task and add cutout views, where part of the image is masked before encoding. This yields a simple multi-task formulation of asymmetric Siamese SSL that combines global, local, and masked views into a single framework. The approach is stable, generally applicable across backbones, and consistently improves the performance of ResNet and ViT models on ImageNet.
### [148] [Pathwise Test-Time Correction for Autoregressive Long Video Generation](https://arxiv.org/abs/2602.05871)
*Xunzhi Xiang,Zixuan Duan,Guiyu Zhang,Haiyu Zhang,Zhe Gao,Junta Wu,Shaofeng Zhang,Tengfei Wang,Qi Fan,Chunchao Guo*
Main category: cs.CV
TL;DR: 本文提出Test-Time Correction (TTC)方法,通过以初始帧为稳定参考锚点校准采样过程中的中间随机状态,解决蒸馏自回归扩散模型在长视频生成中误差累积的问题,无需训练即可显著延长生成长度并保持高质量。
Details
Motivation: 蒸馏自回归扩散模型在实时短视频合成中表现良好,但在长序列生成中存在严重误差累积;现有测试时优化(TTO)方法因奖励景观不稳定和蒸馏参数高度敏感,难以缓解长序列漂移问题。
Method: 提出无需训练的Test-Time Correction (TTC),利用初始帧作为稳定参考锚点,在采样轨迹中动态校准中间随机状态。
Result: TTC可无缝集成于多种蒸馏模型,在几乎无额外开销下延长生成长度,并在30秒基准测试中达到与计算密集型训练方法相当的质量。
Conclusion: TTC是一种高效、通用且训练无关的长视频生成校正方法,有效克服了误差累积和漂移问题,提升了蒸馏扩散模型的实用性。
Abstract: Distilled autoregressive diffusion models facilitate real-time short video synthesis but suffer from severe error accumulation during long-sequence generation. While existing Test-Time Optimization (TTO) methods prove effective for images or short clips, we identify that they fail to mitigate drift in extended sequences due to unstable reward landscapes and the hypersensitivity of distilled parameters. To overcome these limitations, we introduce Test-Time Correction (TTC), a training-free alternative. Specifically, TTC utilizes the initial frame as a stable reference anchor to calibrate intermediate stochastic states along the sampling trajectory. Extensive experiments demonstrate that our method seamlessly integrates with various distilled models, extending generation lengths with negligible overhead while matching the quality of resource-intensive training-based methods on 30-second benchmarks.
### [149] [Contour Refinement using Discrete Diffusion in Low Data Regime](https://arxiv.org/abs/2602.05880)
*Fei Yu Guan,Ian Keefe,Sophie Wilkinson,Daniel D. B. Perrakis,Steven Waslander*
Main category: cs.CV
TL;DR: 本文提出了一种轻量级离散扩散轮廓优化流程,用于在标注数据稀缺的情况下鲁棒地检测不规则、半透明物体的边界,尤其适用于医疗影像等低数据场景。
Details
Motivation: 不规则和半透明物体的边界检测在医疗影像、环境监测和制造业中至关重要,但常受限于标注数据稀缺和现场计算资源有限;现有分割研究多关注掩码对齐,而边界检测(尤其在低数据场景)研究不足。
Method: 提出基于CNN与自注意力机制的轻量级离散扩散轮廓优化流程,以分割掩码为条件,迭代去噪稀疏轮廓表示;引入简化扩散过程、定制化网络架构和极简后处理等新设计,适配<500图像的小规模训练集。
Result: 在KVASIR医学影像数据集上优于多个SOTA基线,在HAM10K和自建野火烟雾数据集Smoke上表现相当,推理帧率提升3.5倍。
Conclusion: 该方法在低数据、低算力约束下实现了高效准确的边界检测,为资源受限的实际应用场景提供了可行方案。
Abstract: Boundary detection of irregular and translucent objects is an important problem with applications in medical imaging, environmental monitoring and manufacturing, where many of these applications are plagued with scarce labeled data and low in situ computational resources. While recent image segmentation studies focus on segmentation mask alignment with ground-truth, the task of boundary detection remains understudied, especially in the low data regime. In this work, we present a lightweight discrete diffusion contour refinement pipeline for robust boundary detection in the low data regime. We use a Convolutional Neural Network(CNN) architecture with self-attention layers as the core of our pipeline, and condition on a segmentation mask, iteratively denoising a sparse contour representation. We introduce multiple novel adaptations for improved low-data efficacy and inference efficiency, including using a simplified diffusion process, a customized model architecture, and minimal post processing to produce a dense, isolated contour given a dataset of size <500 training images. Our method outperforms several SOTA baselines on the medical imaging dataset KVASIR, is competitive on HAM10K and our custom wildfire dataset, Smoke, while improving inference framerate by 3.5X.
### [150] [EoCD: Encoder only Remote Sensing Change Detection](https://arxiv.org/abs/2602.05882)
*Mubashir Noman,Mustansar Fiaz,Hiyam Debary,Abdul Hannan,Shah Nawaz,Fahad Shahbaz Khan,Salman Khan*
Main category: cs.CV
TL;DR: 本文提出了一种名为EoCD的编码器-only变化检测方法,通过早期融合时序图像并用无参多尺度特征融合模块替代解码器,显著降低了模型复杂度,同时在性能与预测速度间取得最佳平衡。
Details
Motivation: 现有变化检测方法依赖Siamese编码器和复杂解码器,导致计算成本高、模型复杂;早期融合方法虽省去Siamese结构,但性能不如晚期融合且仍需复杂解码器。
Method: 提出Encoder-only Change Detection(EoCD):采用早期融合策略融合时序影像,并以无参数的多尺度特征融合模块替代传统解码器。
Result: EoCD在四个具有挑战性的变化检测数据集上验证有效,在保持高性能的同时显著提升预测速度,且证明模型性能主要取决于编码器,解码器为冗余组件。
Conclusion: EoCD是一种简单而高效的变化检测框架,实现了性能与效率的最优权衡,揭示了解码器在该任务中的非必要性。
Abstract: Being a cornerstone of temporal analysis, change detection has been playing a pivotal role in modern earth observation. Existing change detection methods rely on the Siamese encoder to individually extract temporal features followed by temporal fusion. Subsequently, these methods design sophisticated decoders to improve the change detection performance without taking into consideration the complexity of the model. These aforementioned issues intensify the overall computational cost as well as the network's complexity which is undesirable. Alternatively, few methods utilize the early fusion scheme to combine the temporal images. These methods prevent the extra overhead of Siamese encoder, however, they also rely on sophisticated decoders for better performance. In addition, these methods demonstrate inferior performance as compared to late fusion based methods. To bridge these gaps, we introduce encoder only change detection (EoCD) that is a simple and effective method for the change detection task. The proposed method performs the early fusion of the temporal data and replaces the decoder with a parameter-free multiscale feature fusion module thereby significantly reducing the overall complexity of the model. EoCD demonstrate the optimal balance between the change detection performance and the prediction speed across a variety of encoder architectures. Additionally, EoCD demonstrate that the performance of the model is predominantly dependent on the encoder network, making the decoder an additional component. Extensive experimentation on four challenging change detection datasets reveals the effectiveness of the proposed method.
### [151] [Neural Implicit 3D Cardiac Shape Reconstruction from Sparse CT Angiography Slices Mimicking 2D Transthoracic Echocardiography Views](https://arxiv.org/abs/2602.05884)
*Gino E. Jansen,Carolina Brás,R. Nils Planken,Mark J. Schuuring,Berto J. Bouma,Ivana Išgum*
Main category: cs.CV
TL;DR: 本文提出了一种基于神经隐式函数的方法,从模拟标准经胸超声(TTE)视角的稀疏CTA切片分割中重建完整三维心脏结构,显著提升了左心室和左心房容积量化精度,优于临床常用的Simpson双平面法。
Details
Motivation: 为提升2D经胸超声(TTE)中心脏腔室三维定量分析的准确性,需从稀疏二维切片重建高质量三维心脏结构,而现有临床方法(如Simpson法)误差较大。
Method: 采用神经隐式函数(MLP)学习CTA三维分割中的形状先验;测试时联合优化潜在码与刚性变换,将模拟TTE的稀疏CTA平面映射至3D空间并重建多类别心脏结构(心腔及左室心肌)。
Result: 在独立CTA测试集上,所有结构平均Dice系数达0.86±0.04;左心室和左心房容积误差分别为4.88±4.26 mL和6.40±7.37 mL,明显优于Simpson法(8.14±6.04 mL和37.76±22.96 mL)。
Conclusion: 该方法为2D TTE提供了一条可行且更准确的三维心脏腔室量化新路径。
Abstract: Accurate 3D representations of cardiac structures allow quantitative analysis of anatomy and function. In this work, we propose a method for reconstructing complete 3D cardiac shapes from segmentations of sparse planes in CT angiography (CTA) for application in 2D transthoracic echocardiography (TTE). Our method uses a neural implicit function to reconstruct the 3D shape of the cardiac chambers and left-ventricle myocardium from sparse CTA planes. To investigate the feasibility of achieving 3D reconstruction from 2D TTE, we select planes that mimic the standard apical 2D TTE views. During training, a multi-layer perceptron learns shape priors from 3D segmentations of the target structures in CTA. At test time, the network reconstructs 3D cardiac shapes from segmentations of TTE-mimicking CTA planes by jointly optimizing the latent code and the rigid transforms that map the observed planes into 3D space. For each heart, we simulate four realistic apical views, and we compare reconstructed multi-class volumes with the reference CTA volumes. On a held-out set of CTA segmentations, our approach achieves an average Dice coefficient of 0.86 $\pm$ 0.04 across all structures. Our method also achieves markedly lower volume errors than the clinical standard, Simpson's biplane rule: 4.88 $\pm$ 4.26 mL vs. 8.14 $\pm$ 6.04 mL, respectively, for the left ventricle; and 6.40 $\pm$ 7.37 mL vs. 37.76 $\pm$ 22.96 mL, respectively, for the left atrium. This suggests that our approach offers a viable route to more accurate 3D chamber quantification in 2D transthoracic echocardiography.
### [152] [CLIP-Map: Structured Matrix Mapping for Parameter-Efficient CLIP Compression](https://arxiv.org/abs/2602.05909)
*Kangjie Zhang,Wenxuan Huang,Xin Zhou,Boxiang Zhou,Dejia Song,Yuan Xie,Baochang Zhang,Lizhuang Ma,Nemo Chen,Xu Tang,Yao Hu,Shaohui Lin*
Main category: cs.CV
TL;DR: 本文提出了一种基于映射的CLIP压缩框架CLIP-Map,通过可学习矩阵和Kronecker分解进行全映射,结合对角继承初始化缓解优化困难,在高倍压缩下显著优于传统基于权重选择的压缩方法。
Details
Motivation: CLIP模型计算与内存开销大,难以部署于资源受限场景;现有基于权重选择的压缩方法在极端压缩下会严重损害特征表达能力。
Method: 提出CLIP-Map框架:采用全映射(Full-Mapping)结合Kronecker因子分解,用可学习矩阵组合原始权重;引入对角继承初始化(Diagonal Inheritance Initialization)缓解分布偏移,提升映射学习效率。
Result: 在多种压缩比下均优于基于选择的压缩方法,尤其在高压缩比(如高倍压缩)场景下性能增益显著。
Conclusion: 基于映射的权重压缩范式比基于选择的范式更能保留原始CLIP的表征能力,CLIP-Map为高效轻量化CLIP提供了新思路。
Abstract: Contrastive Language-Image Pre-training (CLIP) has achieved widely applications in various computer vision tasks, e.g., text-to-image generation, Image-Text retrieval and Image captioning. However, CLIP suffers from high memory and computation cost, which prohibits its usage to the resource-limited application scenarios. Existing CLIP compression methods typically reduce the size of pre-trained CLIP weights by selecting their subset as weight inheritance for further retraining via mask optimization or important weight measurement. However, these select-based weight inheritance often compromises the feature presentation ability, especially on the extreme compression. In this paper, we propose a novel mapping-based CLIP compression framework, CLIP-Map. It leverages learnable matrices to map and combine pretrained weights by Full-Mapping with Kronecker Factorization, aiming to preserve as much information from the original weights as possible. To mitigate the optimization challenges introduced by the learnable mapping, we propose Diagonal Inheritance Initialization to reduce the distribution shifting problem for efficient and effective mapping learning. Extensive experimental results demonstrate that the proposed CLIP-Map outperforms select-based frameworks across various compression ratios, with particularly significant gains observed under high compression settings.
### [153] [Multi-Scale Global-Instance Prompt Tuning for Continual Test-time Adaptation in Medical Image Segmentation](https://arxiv.org/abs/2602.05937)
*Lingrui Li,Yanfeng Zhou,Nan Pu,Xin Chen,Zhun Zhong*
Main category: cs.CV
TL;DR: 本文提出了一种名为Multi-scale Global-Instance Prompt Tuning(MGIPT)的新方法,用于解决医学图像语义分割中持续测试时适应(CTTA)面临的误差累积、灾难性遗忘和隐私泄露等问题。该方法通过自适应尺度实例提示(AIP)与多尺度全局提示(MGP)协同建模全局与实例级知识,提升跨中心分布偏移下的鲁棒适应性能。
Details
Motivation: 现有CTTA方法在长期适应中易出现误差累积和灾难性遗忘;基于提示微调的方法虽有改进,但仍缺乏多尺度提示多样性、实例特异性知识建模不足、且存在隐私泄露风险。
Method: 提出MGIPT框架,包含两个核心模块:1)自适应尺度实例提示(AIP),动态学习轻量、实例特定的提示,并通过自适应最优尺度选择机制缓解误差累积;2)多尺度全局提示(MGP),在多个尺度上捕获域级知识以增强抗遗忘能力;二者通过加权集成实现双层级适应。
Result: 在多个医学图像分割基准上,MGIPT显著优于当前SOTA方法,展现出对持续变化目标域的强鲁棒适应能力。
Conclusion: MGIPT通过融合多尺度、全局与实例级提示机制,有效缓解了CTTA中的关键挑战,为医学图像跨中心部署提供了更可靠、隐私友好的持续适应方案。
Abstract: Distribution shift is a common challenge in medical images obtained from different clinical centers, significantly hindering the deployment of pre-trained semantic segmentation models in real-world applications across multiple domains. Continual Test-Time Adaptation(CTTA) has emerged as a promising approach to address cross-domain shifts during continually evolving target domains. Most existing CTTA methods rely on incrementally updating model parameters, which inevitably suffer from error accumulation and catastrophic forgetting, especially in long-term adaptation. Recent prompt-tuning-based works have shown potential to mitigate the two issues above by updating only visual prompts. While these approaches have demonstrated promising performance, several limitations remain:1)lacking multi-scale prompt diversity, 2)inadequate incorporation of instance-specific knowledge, and 3)risk of privacy leakage. To overcome these limitations, we propose Multi-scale Global-Instance Prompt Tuning(MGIPT), to enhance scale diversity of prompts and capture both global- and instance-level knowledge for robust CTTA. Specifically, MGIPT consists of an Adaptive-scale Instance Prompt(AIP) and a Multi-scale Global-level Prompt(MGP). AIP dynamically learns lightweight and instance-specific prompts to mitigate error accumulation with adaptive optimal-scale selection mechanism. MGP captures domain-level knowledge across different scales to ensure robust adaptation with anti-forgetting capabilities. These complementary components are combined through a weighted ensemble approach, enabling effective dual-level adaptation that integrates both global and local information. Extensive experiments on medical image segmentation benchmarks demonstrate that our MGIPT outperforms state-of-the-art methods, achieving robust adaptation across continually changing target domains.
### [154] [Better Source, Better Flow: Learning Condition-Dependent Source Distribution for Flow Matching](https://arxiv.org/abs/2602.05951)
*Junwan Kim,Jiho Park,Seonghu Jeon,Seungryong Kim*
Main category: cs.CV
TL;DR: 本文提出了一种针对条件流匹配(conditional flow matching)的源分布(source distribution)的可学习、条件依赖设计方法,通过方差正则化和源-目标方向对齐缓解坍缩与不稳定性问题,并在多个文本到图像生成基准上验证了其显著加速收敛(FID提升达3倍)和性能增益。
Details
Motivation: 现有流匹配方法多沿用扩散模型的固定标准高斯源分布,忽视源分布本身的设计潜力,尤其在强条件(如文本)生成任务中未将其作为优化目标。
Method: 提出可学习的条件依赖源分布,引入方差正则化与源-目标分布的方向对齐机制以稳定训练;分析不同目标表征空间对结构化源分布有效性的影响。
Result: 在多个文本到图像基准上实现一致且鲁棒的性能提升,FID指标收敛速度最高提升3倍。
Conclusion: 源分布的原理性设计不仅是可行的,而且对现代条件流匹配系统具有实质性收益,是提升生成质量与训练效率的重要新方向。
Abstract: Flow matching has recently emerged as a promising alternative to diffusion-based generative models, particularly for text-to-image generation. Despite its flexibility in allowing arbitrary source distributions, most existing approaches rely on a standard Gaussian distribution, a choice inherited from diffusion models, and rarely consider the source distribution itself as an optimization target in such settings. In this work, we show that principled design of the source distribution is not only feasible but also beneficial at the scale of modern text-to-image systems. Specifically, we propose learning a condition-dependent source distribution under flow matching objective that better exploit rich conditioning signals. We identify key failure modes that arise when directly incorporating conditioning into the source, including distributional collapse and instability, and show that appropriate variance regularization and directional alignment between source and target are critical for stable and effective learning. We further analyze how the choice of target representation space impacts flow matching with structured sources, revealing regimes in which such designs are most effective. Extensive experiments across multiple text-to-image benchmarks demonstrate consistent and robust improvements, including up to a 3x faster convergence in FID, highlighting the practical benefits of a principled source distribution design for conditional flow matching.
### [155] [LSA: Localized Semantic Alignment for Enhancing Temporal Consistency in Traffic Video Generation](https://arxiv.org/abs/2602.05966)
*Mirlan Karimov,Teodora Spasojevic,Markus Braun,Julian Wiederer,Vasileios Belagiannis,Marc Pollefeys*
Main category: cs.CV
TL;DR: 本文提出Localized Semantic Alignment (LSA)框架,通过在动态物体局部区域对齐真实与生成视频的语义特征,提升预训练视频生成模型的时间一致性,无需推理时外部控制信号。
Details
Motivation: 现有可控视频生成方法依赖推理时的控制信号来保证动态物体的时间一致性,限制了其作为可扩展、通用数据引擎的能力。
Method: 提出LSA框架,在预训练视频生成模型微调阶段引入局部语义特征一致性损失:利用现成特征提取模型,在动态物体区域比较真实与生成视频的语义特征,并将该损失与标准扩散损失联合优化。
Result: 仅用单轮微调,LSA即在主流视频生成评估指标上超越基线;在nuScenes和KITTI数据集上验证了其显著提升时间一致性,且不增加推理开销或依赖外部控制信号。
Conclusion: LSA是一种简单有效的方法,能显著增强自动驾驶场景下视频生成的时间一致性,为构建无需控制信号的通用视频数据引擎提供了新思路。
Abstract: Controllable video generation has emerged as a versatile tool for autonomous driving, enabling realistic synthesis of traffic scenarios. However, existing methods depend on control signals at inference time to guide the generative model towards temporally consistent generation of dynamic objects, limiting their utility as scalable and generalizable data engines. In this work, we propose Localized Semantic Alignment (LSA), a simple yet effective framework for fine-tuning pre-trained video generation models. LSA enhances temporal consistency by aligning semantic features between ground-truth and generated video clips. Specifically, we compare the output of an off-the-shelf feature extraction model between the ground-truth and generated video clips localized around dynamic objects inducing a semantic feature consistency loss. We fine-tune the base model by combining this loss with the standard diffusion loss. The model fine-tuned for a single epoch with our novel loss outperforms the baselines in common video generation evaluation metrics. To further test the temporal consistency in generated videos we adapt two additional metrics from object detection task, namely mAP and mIoU. Extensive experiments on nuScenes and KITTI datasets show the effectiveness of our approach in enhancing temporal consistency in video generation without the need for external control signals during inference and any computational overheads.
### [156] [RISE-Video: Can Video Generators Decode Implicit World Rules?](https://arxiv.org/abs/2602.05986)
*Mingxin Liu,Shuran Ma,Shibei Meng,Xiangyu Zhao,Zicheng Zhang,Shaofeng Zhang,Zhihang Zhong,Peixian Chen,Haoyu Cao,Xing Sun,Haodong Duan,Xue Yang*
Main category: cs.CV
TL;DR: 本文提出了RISE-Video,一个面向推理能力评估的文本-图像到视频生成基准,强调对隐式世界规则的理解与推理,而非仅关注视觉质量;包含467个人工标注样本、四维评估指标,并引入基于大模型的自动化评估流程;实验揭示当前主流TI2V模型在复杂隐式约束场景下普遍存在推理缺陷。
Details
Motivation: 现有生成式视频模型虽视觉保真度高,但缺乏对隐式世界规则(如常识、物理规律、时空动态)的建模与推理能力,亟需专门的推理导向评测基准。
Method: 构建RISE-Video基准:含467个八类人工标注样本;提出四维评估协议(推理对齐性、时间一致性、物理合理性、视觉质量);设计基于大视觉语言模型(LMM)的自动化评估流水线。
Result: 在11个SOTA TI2V模型上的实验表明,所有模型在隐式约束下的复杂推理任务上表现薄弱,尤其在物理合理性和推理对齐性方面存在显著不足。
Conclusion: RISE-Video为生成式视频模型的认知能力评估提供了新范式,揭示了当前模型‘重表象、轻推理’的根本局限,为构建具备世界模拟能力的下一代视频生成模型指明方向。
Abstract: While generative video models have achieved remarkable visual fidelity, their capacity to internalize and reason over implicit world rules remains a critical yet under-explored frontier. To bridge this gap, we present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis that shifts the evaluative focus from surface-level aesthetics to deep cognitive reasoning. RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories, providing a structured testbed for probing model intelligence across diverse dimensions, ranging from commonsense and spatial dynamics to specialized subject domains. Our framework introduces a multi-dimensional evaluation protocol consisting of four metrics: \textit{Reasoning Alignment}, \textit{Temporal Consistency}, \textit{Physical Rationality}, and \textit{Visual Quality}. To further support scalable evaluation, we propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment. Extensive experiments on 11 state-of-the-art TI2V models reveal pervasive deficiencies in simulating complex scenarios under implicit constraints, offering critical insights for the advancement of future world-simulating generative models.
### [157] [VisRefiner: Learning from Visual Differences for Screenshot-to-Code Generation](https://arxiv.org/abs/2602.05998)
*Jie Deng,Kaichun Yao,Libo Zhang*
Main category: cs.CV
TL;DR: 本文提出VisRefiner框架,通过让模型学习渲染结果与参考设计之间的视觉差异,提升截图生成代码的准确性和自修正能力。
Details
Motivation: 现有模型直接从截图生成代码,但未观察生成代码的视觉效果;而人类开发者通过迭代渲染、对比和修改来优化代码,因此作者希望让模型也具备这种基于视觉反馈的学习能力。
Method: 提出VisRefiner训练框架:1)构建差异对齐监督信号,将视觉差异与对应代码编辑关联;2)引入强化学习自修正阶段,模型根据渲染结果与目标设计的视觉差异更新代码。
Result: 实验表明VisRefiner显著提升了单步生成质量与布局保真度,并赋予模型强大的自修正能力。
Conclusion: 学习视觉差异能有效推动截图生成代码任务的发展,VisRefiner为该方向提供了新范式。
Abstract: Screenshot-to-code generation aims to translate user interface screenshots into executable frontend code that faithfully reproduces the target layout and style. Existing multimodal large language models perform this mapping directly from screenshots but are trained without observing the visual outcomes of their generated code. In contrast, human developers iteratively render their implementation, compare it with the design, and learn how visual differences relate to code changes. Inspired by this process, we propose VisRefiner, a training framework that enables models to learn from visual differences between rendered predictions and reference designs. We construct difference-aligned supervision that associates visual discrepancies with corresponding code edits, allowing the model to understand how appearance variations arise from implementation changes. Building on this, we introduce a reinforcement learning stage for self-refinement, where the model improves its generated code by observing both the rendered output and the target design, identifying their visual differences, and updating the code accordingly. Experiments show that VisRefiner substantially improves single-step generation quality and layout fidelity, while also endowing models with strong self-refinement ability. These results demonstrate the effectiveness of learning from visual differences for advancing screenshot-to-code generation.
### [158] [GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?](https://arxiv.org/abs/2602.06013)
*Ruihang Li,Leigang Qu,Jingxu Zhang,Dongnan Gui,Mengde Xu,Xiaosong Zhang,Han Hu,Wenjie Wang,Jiaqi Wang*
Main category: cs.CV
TL;DR: 本文提出GenArena框架,采用成对比较范式替代传统绝对评分法,显著提升视觉生成模型评估的稳定性与人类感知一致性,并使开源模型在评估性能上超越顶级闭源模型。
Details
Motivation: 视觉生成模型快速发展,传统评估方法已无法满足需求,而现有基于视觉语言模型的绝对点评分标准存在随机不一致性和与人类感知对齐差的问题。
Method: 提出GenArena统一评估框架,采用成对比较范式替代绝对点评分;系统评测多种视觉生成任务,并验证其在多个指标上的优越性。
Result: GenArena将评估准确率提升超20%,与权威LMArena排行榜的Spearman相关性达0.86,远超点评分法的0.36;且仅靠该协议即可使开源模型超越顶级闭源模型。
Conclusion: 成对比较范式是更可靠、更符合人类判断的视觉生成评估方式,GenArena为社区提供了严谨、自动化的新型评估标准。
Abstract: The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision-Language Models as surrogate judges. In this work, we systematically investigate the reliability of the prevailing absolute pointwise scoring standard, across a wide spectrum of visual generation tasks. Our analysis reveals that this paradigm is limited due to stochastic inconsistency and poor alignment with human perception. To resolve these limitations, we introduce GenArena, a unified evaluation framework that leverages a pairwise comparison paradigm to ensure stable and human-aligned evaluation. Crucially, our experiments uncover a transformative finding that simply adopting this pairwise protocol enables off-the-shelf open-source models to outperform top-tier proprietary models. Notably, our method boosts evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, drastically surpassing the 0.36 correlation of pointwise methods. Based on GenArena, we benchmark state-of-the-art visual generation models across diverse tasks, providing the community with a rigorous and automated evaluation standard for visual generation.
### [159] [MambaVF: State Space Model for Efficient Video Fusion](https://arxiv.org/abs/2602.06017)
*Zixiang Zhao,Yukun Cui,Lilun Deng,Haowen Bai,Haotong Qin,Tao Feng,Konrad Schindler*
Main category: cs.CV
TL;DR: 本文提出MambaVF,一种基于状态空间模型(SSM)的高效视频融合框架,无需光流估计即可建模长时序依赖,显著降低计算与内存开销,并在多类视频融合任务上达到SOTA性能。
Details
Motivation: 现有视频融合方法严重依赖光流估计和特征形变,导致计算开销大、可扩展性差。
Method: 将视频融合重构为序列状态更新过程,采用轻量级SSM融合模块,结合空-时双向扫描机制替代传统光流对齐。
Result: 在多曝光、多焦点、红外-可见光及医学视频融合任务中均达SOTA;参数减少92.25%,FLOPs降低88.79%,推理速度提升2.1倍。
Conclusion: MambaVF验证了SSM在视频融合中替代显式运动建模的有效性与高效性,为轻量、可扩展的视频融合提供了新范式。
Abstract: Video fusion is a fundamental technique in various video processing tasks. However, existing video fusion methods heavily rely on optical flow estimation and feature warping, resulting in severe computational overhead and limited scalability. This paper presents MambaVF, an efficient video fusion framework based on state space models (SSMs) that performs temporal modeling without explicit motion estimation. First, by reformulating video fusion as a sequential state update process, MambaVF captures long-range temporal dependencies with linear complexity while significantly reducing computation and memory costs. Second, MambaVF proposes a lightweight SSM-based fusion module that replaces conventional flow-guided alignment via a spatio-temporal bidirectional scanning mechanism. This module enables efficient information aggregation across frames. Extensive experiments across multiple benchmarks demonstrate that our MambaVF achieves state-of-the-art performance in multi-exposure, multi-focus, infrared-visible, and medical video fusion tasks. We highlight that MambaVF enjoys high efficiency, reducing up to 92.25% of parameters and 88.79% of computational FLOPs and a 2.1x speedup compared to existing methods. Project page: https://mambavf.github.io
### [160] [Context Forcing: Consistent Autoregressive Video Generation with Long Context](https://arxiv.org/abs/2602.06028)
*Shuo Chen,Cong Wei,Sun Sun,Ping Nie,Kai Zhou,Ge Zhang,Ming-Hsuan Yang,Wenhu Chen*
Main category: cs.CV
TL;DR: 本文提出Context Forcing框架,通过长上下文教师模型指导长上下文学生模型训练,解决传统流式调优中师生不匹配问题,并引入Slow-Fast Memory架构实现超长视频(如2分钟)的高效生成,显著提升长时一致性。
Details
Motivation: 现有实时长视频生成方法采用短上下文教师监督长上下文学生,导致无法建模全局时序依赖,形成学生-教师不匹配问题,限制模型有效上下文长度。
Method: 提出Context Forcing框架,使用具备完整历史感知能力的长上下文教师训练学生;为提升计算效率,设计Slow-Fast Memory上下文管理系统,压缩视觉冗余,支持极端时长(如2分钟)生成。
Result: 实验表明该方法有效上下文长度超20秒,是LongLive和Infinite-RoPE等SOTA方法的2–10倍,在多项长视频评估指标上一致性表现更优。
Conclusion: Context Forcing通过消除师生上下文不匹配并优化内存管理,实现了更长、更一致的视频生成,为实时长视频生成提供了新范式。
Abstract: Recent approaches to real-time long video generation typically employ streaming tuning strategies, attempting to train a long-context student using a short-context (memoryless) teacher. In these frameworks, the student performs long rollouts but receives supervision from a teacher limited to short 5-second windows. This structural discrepancy creates a critical \textbf{student-teacher mismatch}: the teacher's inability to access long-term history prevents it from guiding the student on global temporal dependencies, effectively capping the student's context length. To resolve this, we propose \textbf{Context Forcing}, a novel framework that trains a long-context student via a long-context teacher. By ensuring the teacher is aware of the full generation history, we eliminate the supervision mismatch, enabling the robust training of models capable of long-term consistency. To make this computationally feasible for extreme durations (e.g., 2 minutes), we introduce a context management system that transforms the linearly growing context into a \textbf{Slow-Fast Memory} architecture, significantly reducing visual redundancy. Extensive results demonstrate that our method enables effective context lengths exceeding 20 seconds -- 2 to 10 times longer than state-of-the-art methods like LongLive and Infinite-RoPE. By leveraging this extended context, Context Forcing preserves superior consistency across long durations, surpassing state-of-the-art baselines on various long video evaluation metrics.
### [161] [Splat and Distill: Augmenting Teachers with Feed-Forward 3D Reconstruction For 3D-Aware Distillation](https://arxiv.org/abs/2602.06032)
*David Shavin,Sagie Benaim*
Main category: cs.CV
TL;DR: 本文提出Splat and Distill框架,通过将2D视觉基础模型(VFM)的特征前馈式提升为3D高斯表示,并在新视角上‘splatted’生成监督信号,以蒸馏3D几何知识到学生模型,显著提升其3D感知能力与语义表征质量。
Details
Motivation: 现有2D视觉基础模型缺乏3D感知能力,限制其在需要几何理解的下游任务中的表现。
Method: 引入前馈式3D重建流水线,将教师模型输出的2D特征提升为显式3D高斯表示,再投影(splatted)至新视角生成监督特征,用于蒸馏训练学生模型;避免了传统逐场景优化带来的特征平均伪影。
Result: 在单目深度估计、法向量估计、多视图匹配和语义分割等任务上显著超越先前方法,既增强3D感知,又提升2D特征的语义丰富性。
Conclusion: Splat and Distill提供了一种高效、可扩展的方式,将3D几何先验注入2D视觉基础模型,无需额外3D标注,且训练动态促进师生一致性提升。
Abstract: Vision Foundation Models (VFMs) have achieved remarkable success when applied to various downstream 2D tasks. Despite their effectiveness, they often exhibit a critical lack of 3D awareness. To this end, we introduce Splat and Distill, a framework that instills robust 3D awareness into 2D VFMs by augmenting the teacher model with a fast, feed-forward 3D reconstruction pipeline. Given 2D features produced by a teacher model, our method first lifts these features into an explicit 3D Gaussian representation, in a feedforward manner. These 3D features are then ``splatted" onto novel viewpoints, producing a set of novel 2D feature maps used to supervise the student model, ``distilling" geometrically grounded knowledge. By replacing slow per-scene optimization of prior work with our feed-forward lifting approach, our framework avoids feature-averaging artifacts, creating a dynamic learning process where the teacher's consistency improves alongside that of the student. We conduct a comprehensive evaluation on a suite of downstream tasks, including monocular depth estimation, surface normal estimation, multi-view correspondence, and semantic segmentation. Our method significantly outperforms prior works, not only achieving substantial gains in 3D awareness but also enhancing the underlying semantic richness of 2D features. Project page is available at https://davidshavin4.github.io/Splat-and-Distill/
### [162] [V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval](https://arxiv.org/abs/2602.06034)
*Dongyang Chen,Chaoyang Wang,Dezhao SU,Xi Xiao,Zeyu Zhang,Jing Xiong,Qing Li,Yuzhang Shang,Shichao Ka*
Main category: cs.CV
TL;DR: 本文提出V-Retrver框架,将多模态检索重构为基于视觉检验的代理式推理过程,通过动态调用外部视觉工具获取证据,结合课程学习策略训练,显著提升检索准确率与推理可靠性。
Details
Motivation: 现有方法过于语言驱动,依赖静态视觉编码,缺乏主动验证细粒度视觉证据的能力,导致在视觉模糊情况下出现推测性推理。
Method: 提出V-Retrver框架,将检索建模为多模态交错推理(假设生成与目标视觉验证交替进行),并采用课程学习策略(监督推理激活、拒绝式精炼、证据对齐的强化学习)训练证据采集型检索代理。
Result: 在多个多模态检索基准上实现平均23.0%的检索准确率提升,并增强感知驱动推理的可靠性与泛化能力。
Conclusion: 视觉证据驱动的代理式推理范式能有效克服语言中心局限,提升多模态检索的准确性与可解释性。
Abstract: Multimodal Large Language Models (MLLMs) have recently been applied to universal multimodal retrieval, where Chain-of-Thought (CoT) reasoning improves candidate reranking. However, existing approaches remain largely language-driven, relying on static visual encodings and lacking the ability to actively verify fine-grained visual evidence, which often leads to speculative reasoning in visually ambiguous cases. We propose V-Retrver, an evidence-driven retrieval framework that reformulates multimodal retrieval as an agentic reasoning process grounded in visual inspection. V-Retrver enables an MLLM to selectively acquire visual evidence during reasoning via external visual tools, performing a multimodal interleaved reasoning process that alternates between hypothesis generation and targeted visual verification.To train such an evidence-gathering retrieval agent, we adopt a curriculum-based learning strategy combining supervised reasoning activation, rejection-based refinement, and reinforcement learning with an evidence-aligned objective. Experiments across multiple multimodal retrieval benchmarks demonstrate consistent improvements in retrieval accuracy (with 23.0% improvements on average), perception-driven reasoning reliability, and generalization.
### [163] [InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions](https://arxiv.org/abs/2602.06035)
*Sirui Xu,Samuel Schulter,Morteza Ziyadi,Xialin He,Xiaohan Fei,Yu-Xiong Wang,Liangyan Gui*
Main category: cs.CV
TL;DR: 本文提出InterPrior框架,通过大规模模仿预训练和强化学习微调,学习一个统一的生成式控制器,以实现人形机器人在多样化场景下对物体进行全身协调操作(loco-manipulation)的泛化能力。
Details
Motivation: 人类通常不显式规划全身运动,而是基于高阶意图(如可供性)和底层物理/运动先验自然产生协调动作;为使人形机器人具备类似泛化与组合能力,需扩展此类先验。
Method: 提出InterPrior框架:1)用大规模模仿学习蒸馏全参考专家策略,构建目标条件变分策略,从多模态观测与高层意图重建运动;2)引入物理扰动的数据增强,并通过强化学习微调提升对未见目标和初始状态的鲁棒性;3)将隐式技能收敛至有效流形,形成可泛化的运动先验。
Result: InterPrior能泛化到未见物体和任务,支持用户交互控制,并在真实机器人上验证了部署潜力。
Conclusion: InterPrior通过结合模仿学习与强化学习,并引入物理感知的数据增强,成功构建了一个可扩展、可泛化、物理一致的全身运动先验,显著提升了人形机器人在复杂人-物交互任务中的适应性与鲁棒性。
Abstract: Humans rarely plan whole-body interactions with objects at the level of explicit whole-body movements. High-level intentions, such as affordance, define the goal, while coordinated balance, contact, and manipulation can emerge naturally from underlying physical and motor priors. Scaling such priors is key to enabling humanoids to compose and generalize loco-manipulation skills across diverse contexts while maintaining physically coherent whole-body coordination. To this end, we introduce InterPrior, a scalable framework that learns a unified generative controller through large-scale imitation pretraining and post-training by reinforcement learning. InterPrior first distills a full-reference imitation expert into a versatile, goal-conditioned variational policy that reconstructs motion from multimodal observations and high-level intent. While the distilled policy reconstructs training behaviors, it does not generalize reliably due to the vast configuration space of large-scale human-object interactions. To address this, we apply data augmentation with physical perturbations, and then perform reinforcement learning finetuning to improve competence on unseen goals and initializations. Together, these steps consolidate the reconstructed latent skills into a valid manifold, yielding a motion prior that generalizes beyond the training data, e.g., it can incorporate new behaviors such as interactions with unseen objects. We further demonstrate its effectiveness for user-interactive control and its potential for real robot deployment.
### [164] [Thinking with Geometry: Active Geometry Integration for Spatial Reasoning](https://arxiv.org/abs/2602.06037)
*Haoyuan Li,Qihang Cao,Tao Tang,Kun Xiang,Zihan Guo,Jianhua Han,Hang Xu,Xiaodan Liang*
Main category: cs.CV
TL;DR: 本文提出GeoThinker框架,通过空间锚定融合和重要性门控机制,使多模态大模型能主动、选择性地整合几何信息,显著提升空间推理能力。
Details
Motivation: 现有方法被动融合几何信息,导致语义-几何错位和冗余信号,缺乏对空间结构的主动感知能力。
Method: 提出GeoThinker框架,采用空间锚定融合(Spatial-Grounded Fusion)与帧严格交叉注意力,在选定VLM层实现语义视觉先验对任务相关几何证据的选择性查询与整合,并引入重要性门控(Importance Gating)增强关键空间结构的关注。
Result: 在VSI-Bench上达到72.6的峰值分数,刷新空间智能SOTA;在具身指代和自动驾驶等复杂下游任务中展现强泛化性与空间感知提升。
Conclusion: 主动整合空间结构的能力是下一代空间智能的关键,GeoThinker验证了该范式的有效性与普适性。
Abstract: Recent progress in spatial reasoning with Multimodal Large Language Models (MLLMs) increasingly leverages geometric priors from 3D encoders. However, most existing integration strategies remain passive: geometry is exposed as a global stream and fused in an indiscriminate manner, which often induces semantic-geometry misalignment and redundant signals. We propose GeoThinker, a framework that shifts the paradigm from passive fusion to active perception. Instead of feature mixing, GeoThinker enables the model to selectively retrieve geometric evidence conditioned on its internal reasoning demands. GeoThinker achieves this through Spatial-Grounded Fusion applied at carefully selected VLM layers, where semantic visual priors selectively query and integrate task-relevant geometry via frame-strict cross-attention, further calibrated by Importance Gating that biases per-frame attention toward task-relevant structures. Comprehensive evaluation results show that GeoThinker sets a new state-of-the-art in spatial intelligence, achieving a peak score of 72.6 on the VSI-Bench. Furthermore, GeoThinker demonstrates robust generalization and significantly improved spatial perception across complex downstream scenarios, including embodied referring and autonomous driving. Our results indicate that the ability to actively integrate spatial structures is essential for next-generation spatial intelligence. Code can be found at https://github.com/Li-Hao-yuan/GeoThinker.
### [165] [SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs](https://arxiv.org/abs/2602.06040)
*Jintao Tong,Shilin Yan,Hongwei Xue,Xiaojun Tang,Kunyu Shi,Guannan Zhang,Ruixuan Li,Yixiong Zou*
Main category: cs.CV
TL;DR: SwimBird是一种可切换推理模式的多模态大语言模型,能根据输入自适应选择纯文本、纯视觉或图文交错三种推理方式,在保持文本逻辑能力的同时显著提升视觉密集型任务性能。
Details
Motivation: 现有MLLMs大多仅依赖文本思维链(CoT)进行推理,难以应对视觉密集型任务;而引入固定数量视觉隐状态的方法虽提升视觉能力,却损害文本逻辑推理。核心问题在于推理模式僵化,无法按需适配不同查询。
Method: 提出SwimBird模型,采用混合自回归建模统一文本token预测与视觉embedding预测,并设计涵盖三种推理模式的监督微调数据集SwimBird-SFT-92K,实现基于输入的动态推理模式切换。
Result: 在涵盖文本推理与复杂视觉理解的多个基准上达到SOTA,相比固定模式方法展现出更强鲁棒性与全面性能提升。
Conclusion: 动态、查询自适应的多模态推理模式切换是提升MLLM综合能力的关键路径,SwimBird验证了该范式的有效性与可行性。
Abstract: Multimodal Large Language Models (MLLMs) have made remarkable progress in multimodal perception and reasoning by bridging vision and language. However, most existing MLLMs perform reasoning primarily with textual CoT, which limits their effectiveness on vision-intensive tasks. Recent approaches inject a fixed number of continuous hidden states as "visual thoughts" into the reasoning process and improve visual performance, but often at the cost of degraded text-based logical reasoning. We argue that the core limitation lies in a rigid, pre-defined reasoning pattern that cannot adaptively choose the most suitable thinking modality for different user queries. We introduce SwimBird, a reasoning-switchable MLLM that dynamically switches among three reasoning modes conditioned on the input: (1) text-only reasoning, (2) vision-only reasoning (continuous hidden states as visual thoughts), and (3) interleaved vision-text reasoning. To enable this capability, we adopt a hybrid autoregressive formulation that unifies next-token prediction for textual thoughts with next-embedding prediction for visual thoughts, and design a systematic reasoning-mode curation strategy to construct SwimBird-SFT-92K, a diverse supervised fine-tuning dataset covering all three reasoning patterns. By enabling flexible, query-adaptive mode selection, SwimBird preserves strong textual logic while substantially improving performance on vision-dense tasks. Experiments across diverse benchmarks covering textual reasoning and challenging visual understanding demonstrate that SwimBird achieves state-of-the-art results and robust gains over prior fixed-pattern multimodal reasoning methods.
### [166] [Predicting Camera Pose from Perspective Descriptions for Spatial Reasoning](https://arxiv.org/abs/2602.06041)
*Xuejun Zhang,Aditi Tiwari,Zhenhailong Wang,Heng Ji*
Main category: cs.CV
TL;DR: 本文提出CAMCUE框架,通过显式利用相机姿态作为几何锚点,实现多视角图像的跨视角融合与新视角推理,显著提升多图像空间推理能力,并大幅降低推理时间。
Details
Motivation: 当前多模态大语言模型在多图像空间推理上仍面临挑战,尤其在需要从多视角构建一致3D场景理解并依语言指定新视角进行推理(即视角转换)的任务中表现不足。
Method: 提出CAMCUE:1)将每张图像的相机姿态注入视觉token;2)将自然语言描述的视角映射到目标相机姿态;3)合成姿态条件下的想象目标视图以支持回答;并构建包含27,668训练和508测试样本的CAMCUE-DATA数据集,含多视角图像、姿态及多样化视角描述与问题。
Result: CAMCUE整体准确率提升9.06%;对自然语言视角描述预测目标姿态达90%以上旋转精度(≤20°)和高平移精度(误差≤0.5);推理时间从256.6秒/例降至1.45秒/例。
Conclusion: 显式姿态建模与语言-姿态直接对齐可有效提升多图像空间推理性能与效率,为实时交互式应用提供可行路径。
Abstract: Multi-image spatial reasoning remains challenging for current multimodal large language models (MLLMs). While single-view perception is inherently 2D, reasoning over multiple views requires building a coherent scene understanding across viewpoints. In particular, we study perspective taking, where a model must build a coherent 3D understanding from multi-view observations and use it to reason from a new, language-specified viewpoint. We introduce CAMCUE, a pose-aware multi-image framework that uses camera pose as an explicit geometric anchor for cross-view fusion and novel-view reasoning. CAMCUE injects per-view pose into visual tokens, grounds natural-language viewpoint descriptions to a target camera pose, and synthesizes a pose-conditioned imagined target view to support answering. To support this setting, we curate CAMCUE-DATA with 27,668 training and 508 test instances pairing multi-view images and poses with diverse target-viewpoint descriptions and perspective-shift questions. We also include human-annotated viewpoint descriptions in the test split to evaluate generalization to human language. CAMCUE improves overall accuracy by 9.06% and predicts target poses from natural-language viewpoint descriptions with over 90% rotation accuracy within 20° and translation accuracy within a 0.5 error threshold. This direct grounding avoids expensive test-time search-and-match, reducing inference time from 256.6s to 1.45s per example and enabling fast, interactive use in real-world scenarios.