Skip to content

Table of Contents

cs.CL [Back]

[1] Capacity Constraints and the Multilingual Penalty for Lexical Disambiguation

Sean Trott,Pamela D. Rivière

Main category: cs.CL

TL;DR: 本文研究了多语言语言模型在词义消歧任务中表现不如单语模型的现象,即“多语言惩罚”,并通过控制实验量化了这一现象,并探讨了表征、注意力和词汇三个方面的容量限制因素。

Details Motivation: 多语言语言模型有时表现不如其单语对应模型,可能受限于模型容量,本文旨在量化并解释这种‘多语言惩罚’现象。 Method: 使用英语和西班牙语的受控人类相关性判断数据集评估词义消歧任务;比较同族单语与多语言语言模型的表现;分析表征各向同性、注意力机制和词汇分词三个潜在容量限制因素。 Result: 多语言模型在词义消歧任务中表现持续较差;发现其存在表征各向同性降低、对消歧线索注意力减弱、多子词分词增多三种容量限制;这些因素可统计解释原本归因于多语言状态的性能差异。 Conclusion: 多语言语言模型确实受到多种容量限制,且这些限制与词义消歧性能下降密切相关。 Abstract: Multilingual language models (LMs) sometimes under-perform their monolingual counterparts, possibly due to capacity limitations. We quantify this ``multilingual penalty'' for lexical disambiguation--a task requiring precise semantic representations and contextualization mechanisms--using controlled datasets of human relatedness judgments for ambiguous words in both English and Spanish. Comparing monolingual and multilingual LMs from the same families, we find consistently reduced performance in multilingual LMs. We then explore three potential capacity constraints: representational (reduced embedding isotropy), attentional (reduced attention to disambiguating cues), and vocabulary-related (increased multi-token segmentation). Multilingual LMs show some evidence of all three limitations; moreover, these factors statistically account for the variance formerly attributed to a model's multilingual status. These findings suggest both that multilingual LMs do suffer from multiple capacity constraints, and that these constraints correlate with reduced disambiguation performance.

[2] Locas: Your Models are Principled Initializers of Locally-Supported Parametric Memories

Sidi Lu,Zhenwen Liang,Dongyang Ma,Yan Wang,Haitao Mi,Dong Yu

Main category: cs.CL

TL;DR: 本文提出Locas,一种局部支持的参数化记忆机制,可灵活地从模型参数中卸载或合并,支持高效持续学习,并在语言建模与长上下文问答任务中验证其有效性。

Details Motivation: 旨在弥合测试时训练(test-time training)与新型可灵活卸载/合并的参数化记忆之间的鸿沟,解决持续学习中的灾难性遗忘与效率问题。 Method: 提出Locas(Locally-Supported parametric memory),采用类似Transformer中FFN块的设计,包含两种变体:常规双层MLP(理论保证强)和GLU-FFN结构(兼容SOTA大模型);强调基于模型参数、激活或梯度的原理性低秩侧向FFN初始化。 Result: Locas-GLU仅增加0.02%参数即可在PG-19和LoCoMo任务上有效存储历史上下文并缩小所需上下文窗口;MMLU评测显示其在永久化上下文知识的同时最小化对原有知识的灾难性遗忘。 Conclusion: Locas是一种高效、兼容性强且理论扎实的参数化记忆机制,为测试时持续学习提供了新范式,在保持模型泛化能力的同时显著缓解灾难性遗忘。 Abstract: In this paper, we aim to bridge test-time-training with a new type of parametric memory that can be flexibly offloaded from or merged into model parameters. We present Locas, a Locally-Supported parametric memory that shares the design of FFN blocks in modern transformers, allowing it to be flexibly permanentized into the model parameters while supporting efficient continual learning. We discuss two major variants of Locas: one with a conventional two-layer MLP design that has a clearer theoretical guarantee; the other one shares the same GLU-FFN structure with SOTA LLMs, and can be easily attached to existing models for both parameter-efficient and computation-efficient continual learning. Crucially, we show that proper initialization of such low-rank sideway-FFN-style memories -- performed in a principled way by reusing model parameters, activations and/or gradients -- is essential for fast convergence, improved generalization, and catastrophic forgetting prevention. We validate the proposed memory mechanism on the PG-19 whole-book language modeling and LoCoMo long-context dialogue question answering tasks. With only 0.02\% additional parameters in the lowest case, Locas-GLU is capable of storing the information from past context while maintaining a much smaller context window. In addition, we also test the model's general capability loss after memorizing the whole book with Locas, through comparative MMLU evaluation. Results show the promising ability of Locas to permanentize past context into parametric knowledge with minimized catastrophic forgetting of the model's existing internal knowledge.

[3] Multilingual Extraction and Recognition of Implicit Discourse Relations in Speech and Text

Ahmed Ruby,Christian Hardmeier,Sara Stymne

Main category: cs.CL

TL;DR: 本文提出了一种多语言多模态隐式篇章关系分类方法,通过构建包含英、法、西三语的文本-音频数据集,并利用Qwen2-Audio联合建模文本与语音信息,验证了多模态融合与跨语言迁移对低资源语言的有效性。

Details Motivation: 隐式篇章关系分类依赖上下文推断,而单靠文本难以充分捕获跨模态、跨语言的上下文线索。 Method: 构建面向远距离及无关语言对的多语言多模态数据集(含英语、法语、西班牙语),并基于Qwen2-Audio模型融合文本与声学信息进行跨语言隐式篇章关系分类。 Result: 文本模型优于音频模型,但多模态融合可提升性能;跨语言迁移显著改善低资源语言表现。 Conclusion: 多模态与跨语言联合建模是提升隐式篇章关系分类效果、尤其对低资源语言的重要路径。 Abstract: Implicit discourse relation classification is a challenging task, as it requires inferring meaning from context. While contextual cues can be distributed across modalities and vary across languages, they are not always captured by text alone. To address this, we introduce an automatic method for distantly related and unrelated language pairs to construct a multilingual and multimodal dataset for implicit discourse relations in English, French, and Spanish. For classification, we propose a multimodal approach that integrates textual and acoustic information through Qwen2-Audio, allowing joint modeling of text and audio for implicit discourse relation classification across languages. We find that while text-based models outperform audio-based models, integrating both modalities can enhance performance, and cross-lingual transfer can provide substantial improvements for low-resource languages.

[4] GreekMMLU: A Native-Sourced Multitask Benchmark for Evaluating Language Models in Greek

Yang Zhang,Mersin Konomi,Christos Xypolopoulos,Konstantinos Divriotis,Konstantinos Skianis,Giannis Nikolentzos,Giorgos Stamou,Guokan Shang,Michalis Vazirgiannis

Main category: cs.CL

TL;DR: 本文介绍了GreekMMLU,一个原生希腊语的多任务语言理解评测基准,包含45个学科领域的21805道多项选择题,全部源自希腊本土学术、职业和政府考试,并公开发布其中16857题,保留4948题用于私有排行榜。实验评估了80多个大模型在该基准上的表现,揭示了前沿模型与开源模型、希腊适配模型与通用多语模型之间的显著性能差距,并系统分析了影响性能的关键因素。

Details Motivation: 现有希腊语评测基准多为英译而来,无法准确反映希腊语的语言与文化特性,缺乏基于真实母语内容的可靠评测工具。 Method: 构建原生希腊语多任务理解基准GreekMMLU,涵盖45个学科、21805道题,按新定义的学科分类与教育难度等级标注;题目全部源自希腊本土考试;公开16857题,保留4948题用于私有评测;对80余个LLM进行系统评测并分析影响因素。 Result: 评估发现前沿模型与开源模型、希腊适配模型与通用多语模型之间存在显著性能差距;模型规模、适配方式和提示策略均显著影响希腊语表现。 Conclusion: GreekMMLU填补了高质量、原生希腊语评测基准的空白,为提升大模型希腊语能力提供了可靠评估手段与改进方向。 Abstract: Large Language Models (LLMs) are commonly trained on multilingual corpora that include Greek, yet reliable evaluation benchmarks for Greek-particularly those based on authentic, native-sourced content-remain limited. Existing datasets are often machine-translated from English, failing to capture Greek linguistic and cultural characteristics. We introduce GreekMMLU, a native-sourced benchmark for massive multitask language understanding in Greek, comprising 21,805 multiple-choice questions across 45 subject areas, organized under a newly defined subject taxonomy and annotated with educational difficulty levels spanning primary to professional examinations. All questions are sourced or authored in Greek from academic, professional, and governmental exams. We publicly release 16,857 samples and reserve 4,948 samples for a private leaderboard to enable robust and contamination-resistant evaluation. Evaluations of over 80 open- and closed-source LLMs reveal substantial performance gaps between frontier and open-weight models, as well as between Greek-adapted models and general multilingual ones. Finally, we provide a systematic analysis of factors influencing performance-including model scale, adaptation, and prompting-and derive insights for improving LLM capabilities in Greek.

[5] Among Us: Measuring and Mitigating Malicious Contributions in Model Collaboration Systems

Ziyuan Yang,Wenxuan Ding,Shangbin Feng,Yulia Tsvetkov

Main category: cs.CL

TL;DR: 本文研究了多语言模型协作系统中恶意模型的安全风险,量化了其对系统性能的影响,并提出了通过外部监督器缓解恶意模型影响的策略。

Details Motivation: 随着语言模型在多模型协作系统中的广泛应用,如何应对其中可能存在的恶意或被攻破的模型成为关键安全问题。 Method: 通过构建四类恶意语言模型并将其嵌入四种主流多模型协作系统中,在10个数据集上评估其影响;进而提出基于外部监督器的缓解策略,用于识别并屏蔽恶意模型。 Result: 恶意模型显著降低多模型系统在推理和安全领域的性能(分别下降7.12%和7.94%);所提缓解策略平均恢复95.31%的初始性能。 Conclusion: 恶意模型对多语言模型协作系统构成严重威胁,虽可通过外部监督机制大幅缓解,但实现完全抗恶意模型的协作系统仍是开放问题。 Abstract: Language models (LMs) are increasingly used in collaboration: multiple LMs trained by different parties collaborate through routing systems, multi-agent debate, model merging, and more. Critical safety risks remain in this decentralized paradigm: what if some of the models in multi-LLM systems are compromised or malicious? We first quantify the impact of malicious models by engineering four categories of malicious LMs, plug them into four types of popular model collaboration systems, and evaluate the compromised system across 10 datasets. We find that malicious models have a severe impact on the multi-LLM systems, especially for reasoning and safety domains where performance is lowered by 7.12% and 7.94% on average. We then propose mitigation strategies to alleviate the impact of malicious components, by employing external supervisors that oversee model collaboration to disable/mask them out to reduce their influence. On average, these strategies recover 95.31% of the initial performance, while making model collaboration systems fully resistant to malicious models remains an open research question.

[6] The Single-Multi Evolution Loop for Self-Improving Model Collaboration Systems

Shangbin Feng,Kishan Panaganti,Yulia Tsvetkov,Wenhao Yu

Main category: cs.CL

TL;DR: 本文提出了一种名为'single-multi evolution loop'的模型协作蒸馏与进化方法,通过将多模型协作输出蒸馏到单个模型中,并迭代优化协作系统,显著提升了单模型性能与协作系统整体效果。

Details Motivation: 解决多语言模型(LM)协作系统效率低、部署成本高的问题,同时保留协作带来的性能优势。 Method: 1) 将多LM协作系统的输出蒸馏为单个模型;2) 提出single-multi evolution loop:多模型协作→各自蒸馏→蒸馏后模型再协作→循环迭代;3) 在7种协作策略和15项任务上进行实验验证。 Result: 1) 单模型平均提升8.0%,兼具协作优势与单模型效率;2) 协作系统在进化后平均提升14.9%;3) 该方法优于现有进化AI方法,泛化性强,可解决初始模型难以处理的问题。 Conclusion: single-multi evolution loop构建了一个模型间相互促进、持续进化的生态系统,实现了高效、可扩展且自增强的模型协作范式。 Abstract: Model collaboration -- systems where multiple language models (LMs) collaborate -- combines the strengths of diverse models with cost in loading multiple LMs. We improve efficiency while preserving the strengths of collaboration by distilling collaborative patterns into a single model, where the model is trained on the outputs of the model collaboration system. At inference time, only the distilled model is employed: it imitates the collaboration while only incurring the cost of a single model. Furthermore, we propose the single-multi evolution loop: multiple LMs collaborate, each distills from the collaborative outputs, and these post-distillation improved LMs collaborate again, forming a collective evolution ecosystem where models evolve and self-improve by interacting with an environment of other models. Extensive experiments with 7 collaboration strategies and 15 tasks (QA, reasoning, factuality, etc.) demonstrate that: 1) individual models improve by 8.0% on average, absorbing the strengths of collaboration while reducing the cost to a single model; 2) the collaboration also benefits from the stronger and more synergistic LMs after distillation, improving over initial systems without evolution by 14.9% on average. Analysis reveals that the single-multi evolution loop outperforms various existing evolutionary AI methods, is compatible with diverse model/collaboration/distillation settings, and helps solve problems where the initial model/system struggles to.

[7] Are Open-Weight LLMs Ready for Social Media Moderation? A Comparative Study on Bluesky

Hsuan-Yu Chou,Wajiha Naveed,Shuyan Zhou,Xiaowei Yang

Main category: cs.CL

TL;DR: 本文评估了七种最先进语言模型(四个专有模型和三个开源模型)在社交媒体内容审核任务中的表现,发现开源模型在敏感性和特异性上与专有模型相当,表明其可在消费级硬件上支持隐私保护的内容审核。

Details Motivation: 随着互联网普及,有害内容暴露增加,亟需有效的内容审核手段;而开源大模型在零样本有害内容检测中的能力尚不明确,本文旨在评估其实际效果。 Method: 在真实Bluesky平台帖子数据上,对比测试四种专有和三种开源大语言模型的审核性能,以Bluesky审核服务结果及两位作者人工标注为基准,计算敏感性、特异性及人机一致性。 Result: 开源LLM在敏感性(81%–97%)和特异性(91%–100%)上与专有LLM(72%–98%,93%–99%)高度重叠;不同违规类型(粗鲁、不容忍、威胁)呈现敏感性与特异性此消彼长现象;人机间存在可观的评分一致性。 Conclusion: 开源权重LLM具备支撑隐私优先、终端部署型内容审核的潜力,可兼顾社区规范与个体偏好,为未来审核系统设计提供新方向。 Abstract: As internet access expands, so does exposure to harmful content, increasing the need for effective moderation. Research has demonstrated that large language models (LLMs) can be effectively utilized for social media moderation tasks, including harmful content detection. While proprietary LLMs have been shown to zero-shot outperform traditional machine learning models, the out-of-the-box capability of open-weight LLMs remains an open question. Motivated by recent developments of reasoning LLMs, we evaluate seven state-of-the-art models: four proprietary and three open-weight. Testing with real-world posts on Bluesky, moderation decisions by Bluesky Moderation Service, and annotations by two authors, we find a considerable degree of overlap between the sensitivity (81%--97%) and specificity (91%--100%) of the open-weight LLMs and those (72%--98%, and 93%--99%) of the proprietary ones. Additionally, our analysis reveals that specificity exceeds sensitivity for rudeness detection, but the opposite holds for intolerance and threats. Lastly, we identify inter-rater agreement across human moderators and the LLMs, highlighting considerations for deploying LLMs in both platform-scale and personalized moderation contexts. These findings show open-weight LLMs can support privacy-preserving moderation on consumer-grade hardware and suggest new directions for designing moderation systems that balance community values with individual user preferences.

[8] Aligning Large Language Model Behavior with Human Citation Preferences

Kenichiro Ando,Tatsuya Harada

Main category: cs.CL

TL;DR: 本研究探讨了大语言模型(LLM)在生成文本时的引用偏好及其与人类偏好的对齐程度,构建了涵盖八类引用动机的标注数据集,发现当前模型在引用行为上存在过引(如维基百科标记内容)和欠引(如数值句、含人名句)现象,并通过直接偏好优化(DPO)提升了对齐度。

Details Motivation: 现有研究关注LLM应引用哪些参考文献,但缺乏对其‘引用值得性’识别机制及如何可控引导的理解;本文旨在刻画LLM当前引用行为模式,并评估其与人类 citation preference 的一致性。 Method: 构建包含八类引用动机(如医学、数值、人名等)的网页文本数据集;进行全组合成对人类偏好标注;量化分析LLM引用倾向与人类偏好的偏差;采用Direct Preference Optimization(DPO)微调模型以提升对齐性。 Result: 人类最常要求为医学类文本提供引用;强模型呈现相似趋势;但模型比人类多27%引用维基百科等显式标注需引用的内容,同时分别少22.6%和20.1%引用数值句和含人名句;DPO可有效校准引用行为。 Conclusion: 当前LLM的引用行为与人类偏好存在系统性偏差,既存在过引也存在欠引;该偏差可通过偏好对齐方法(如DPO)缓解;本工作为细粒度研究LLM引用认知与可控引用生成奠定基础。 Abstract: Most services built on powerful large-scale language models (LLMs) add citations to their output to enhance credibility. Recent research has paid increasing attention to the question of what reference documents to link to outputs. However, how LLMs recognize cite-worthiness and how this process should be controlled remains underexplored. In this study, we focus on what kinds of content LLMs currently tend to cite and how well that behavior aligns with human preferences. We construct a dataset to characterize the relationship between human citation preferences and LLM behavior. Web-derived texts are categorized into eight citation-motivation types, and pairwise citation preferences are exhaustively evaluated across all type combinations to capture fine-grained contrasts. Our results show that humans most frequently seek citations for medical text, and stronger models display a similar tendency. We also find that current models are as much as $27\%$ more likely than humans to add citations to text that is explicitly marked as needing citations on sources such as Wikipedia, and this overemphasis reduces alignment accuracy. Conversely, models systematically underselect numeric sentences (by $-22.6\%$ relative to humans) and sentences containing personal names (by $-20.1\%$), categories for which humans typically demand citations. Furthermore, experiments with Direct Preference Optimization demonstrate that model behavior can be calibrated to better match human citation preferences. We expect this study to provide a foundation for more fine-grained investigations into LLM citation preferences.

[9] Quantifying the Knowledge Proximity Between Academic and Industry Research: An Entity and Semantic Perspective

Hongye Zhao,Yi Zhao,Chengzhi Zhang

Main category: cs.CL

TL;DR: 本研究通过细粒度知识实体和语义空间量化产学研协同演化轨迹,揭示二者知识邻近性随技术变革增强,且学术界知识主导地位在范式转变期减弱。

Details Motivation: 现有研究依赖宏观指标(如合作论文数)衡量产学研知识邻近性,缺乏对文献中知识单元的细粒度分析,导致对知识邻近性的理解不足,影响协作框架与资源配置效率。 Method: 结合实体层面(预训练模型提取知识实体、余弦相似度测序列重叠、复杂网络分析拓扑特征)与语义层面(无监督对比学习量化跨机构文本相似性),并利用引用分布模式检验双向知识流动与相似性的关联。 Result: 产学研知识邻近性整体上升,尤以技术变革后显著;学术界知识主导地位在技术范式转变期减弱;提供了协同演化的双向适应文本证据。 Conclusion: 细粒度知识分析能更准确刻画产学研协同演化动态,为优化协作机制与资源分配提供新依据。 Abstract: The academia and industry are characterized by a reciprocal shaping and dynamic feedback mechanism. Despite distinct institutional logics, they have adapted closely in collaborative publishing and talent mobility, demonstrating tension between institutional divergence and intensive collaboration. Existing studies on their knowledge proximity mainly rely on macro indicators such as the number of collaborative papers or patents, lacking an analysis of knowledge units in the literature. This has led to an insufficient grasp of fine-grained knowledge proximity between industry and academia, potentially undermining collaboration frameworks and resource allocation efficiency. To remedy the limitation, this study quantifies the trajectory of academia-industry co-evolution through fine-grained entities and semantic space. In the entity measurement part, we extract fine-grained knowledge entities via pre-trained models, measure sequence overlaps using cosine similarity, and analyze topological features through complex network analysis. At the semantic level, we employ unsupervised contrastive learning to quantify convergence in semantic spaces by measuring cross-institutional textual similarities. Finally, we use citation distribution patterns to examine correlations between bidirectional knowledge flows and similarity. Analysis reveals that knowledge proximity between academia and industry rises, particularly following technological change. This provides textual evidence of bidirectional adaptation in co-evolution. Additionally, academia's knowledge dominance weakens during technological paradigm shifts. The dataset and code for this paper can be accessed at https://github.com/tinierZhao/Academic-Industrial-associations.

[10] Bagpiper: Solving Open-Ended Audio Tasks via Rich Captions

Jinchuan Tian,Haoran Wang,Bo-Hao Su,Chien-yu Huang,Qingzheng Wang,Jiatong Shi,William Chen,Xun Gong,Siddhant Arora,Chin-Jou Li,Masao Someki,Takashi Maekaku,Yusuke Shinohara,Jin Sakuma,Chao-Han Huck Yang,Shinji Watanabe

Main category: cs.CL

TL;DR: Bagpiper是一个8B参数的音频基础模型,通过大规模图文对齐预训练,建立音频与高级认知概念(如转录、事件描述)之间的双向映射,实现统一的音频理解与生成。

Details Motivation: 现有音频基础模型依赖刚性、任务特定的监督,仅关注音频的孤立因素;而人类能整体处理音频,将物理信号与抽象认知概念无缝连接。本文旨在构建一种更接近人类认知方式的通用音频基础模型。 Method: 提出Bagpiper模型,以丰富自然语言描述(即‘caption’)作为音频的高阶认知表征,基于600B token语料进行音频-文本双向对齐预训练;微调时采用‘caption-then-process’流程,引入中间认知推理步骤,无需任务特定先验。 Result: 在MMAU和AIRBench音频理解基准上超越Qwen-2.5-Omni,在生成质量上优于CosyVoice3和TangoFlux,支持语音、音乐与音效的任意组合合成。 Conclusion: Bagpiper是首个实现通用音频统一理解与生成的基础模型之一,验证了以认知导向caption为桥梁构建音频基础模型的有效性。 Abstract: Current audio foundation models typically rely on rigid, task-specific supervision, addressing isolated factors of audio rather than the whole. In contrast, human intelligence processes audio holistically, seamlessly bridging physical signals with abstract cognitive concepts to execute complex tasks. Grounded in this philosophy, we introduce Bagpiper, an 8B audio foundation model that interprets physical audio via rich captions, i.e., comprehensive natural language descriptions that encapsulate the critical cognitive concepts inherent in the signal (e.g., transcription, audio events). By pre-training on a massive corpus of 600B tokens, the model establishes a robust bidirectional mapping between raw audio and this high-level conceptual space. During fine-tuning, Bagpiper adopts a caption-then-process workflow, simulating an intermediate cognitive reasoning step to solve diverse tasks without task-specific priors. Experimentally, Bagpiper outperforms Qwen-2.5-Omni on MMAU and AIRBench for audio understanding and surpasses CosyVoice3 and TangoFlux in generation quality, capable of synthesizing arbitrary compositions of speech, music, and sound effects. To the best of our knowledge, Bagpiper is among the first works that achieve unified understanding generation for general audio. Model, data, and code are available at Bagpiper Home Page.

[11] FedMosaic: Federated Retrieval-Augmented Generation via Parametric Adapters

Zhilin Liang,Yuxiang Wang,Zimu Zhou,Hainan Zhang,Boyi Liu,Yongxin Tong

Main category: cs.CL

TL;DR: 本文提出FedMosaic,一种基于参数化适配器的联邦检索增强生成(FedRAG)框架,通过语义聚类与选择性聚合,在保护数据隐私前提下显著提升准确率并大幅降低存储与通信开销。

Details Motivation: 现有RAG依赖中心化语料库,难以满足隐私敏感场景中知识孤岛的需求,亟需支持分布式、不共享原始文档的联邦RAG方案。 Method: 提出FedMosaic框架:1)将语义相关的多文档聚类为共享的多文档适配器,并引入文档特定掩码以兼顾泛化性与特异性;2)设计选择性适配器聚合机制,仅融合相关且无冲突的适配器。 Result: 在四类任务上平均准确率较SOTA方法提升10.9%,存储成本降低78.8%–86.3%,通信成本降低91.4%,全程不传输原始文档。 Conclusion: FedMosaic首次实现了高效、低开销、强隐私保障的联邦RAG,验证了参数化适配器在FedRAG中的可行性与优越性。 Abstract: Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by grounding generation in external knowledge to improve factuality and reduce hallucinations. Yet most deployments assume a centralized corpus, which is infeasible in privacy aware domains where knowledge remains siloed. This motivates federated RAG (FedRAG), where a central LLM server collaborates with distributed silos without sharing raw documents. In context RAG violates this requirement by transmitting verbatim documents, whereas parametric RAG encodes documents into lightweight adapters that merge with a frozen LLM at inference, avoiding raw-text exchange. We adopt the parametric approach but face two unique challenges induced by FedRAG: high storage and communication from per-document adapters, and destructive aggregation caused by indiscriminately merging multiple adapters. We present FedMosaic, the first federated RAG framework built on parametric adapters. FedMosaic clusters semantically related documents into multi-document adapters with document-specific masks to reduce overhead while preserving specificity, and performs selective adapter aggregation to combine only relevance-aligned, nonconflicting adapters. Experiments show that FedMosaic achieves an average 10.9% higher accuracy than state-of-the-art methods in four categories, while lowering storage costs by 78.8% to 86.3% and communication costs by 91.4%, and never sharing raw documents.

Guangwei Zhang,Jianing Zhu,Cheng Qian,Neil Gong,Rada Mihalcea,Zhaozhuo Xu,Jingrui He,Jiaqi Ma,Yun Huang,Chaowei Xiao,Bo Li,Ahmed Abbasi,Dongwon Lee,Heng Ji,Denghui Zhang

Main category: cs.CL

TL;DR: Copyright Detective 是首个用于检测、分析和可视化大语言模型(LLM)输出中潜在版权风险的交互式法证系统,将版权合规性评估建模为证据发现过程,融合多种检测范式,支持黑盒场景下的系统化审计。

Details Motivation: 现有方法将版权问题简化为静态分类任务,难以应对版权法律的复杂性;需支持黑盒模型的透明、可审计的版权风险评估。 Method: 提出 Copyright Detective 系统,整合内容召回测试、改写级相似性分析、说服式越狱探测与遗忘验证等多种检测范式,基于交互式提示、响应采集与迭代工作流实现系统化审计。 Result: 实现了对 LLM 输出中逐字记忆与改写级泄露的可解释、可交互检测与可视化,支持负责任部署与透明评估。 Conclusion: 将版权风险分析重构为证据驱动的法证过程是可行且必要的,该框架为黑盒 LLM 的版权合规性提供了新范式。 Abstract: We present Copyright Detective, the first interactive forensic system for detecting, analyzing, and visualizing potential copyright risks in LLM outputs. The system treats copyright infringement versus compliance as an evidence discovery process rather than a static classification task due to the complex nature of copyright law. It integrates multiple detection paradigms, including content recall testing, paraphrase-level similarity analysis, persuasive jailbreak probing, and unlearning verification, within a unified and extensible framework. Through interactive prompting, response collection, and iterative workflows, our system enables systematic auditing of verbatim memorization and paraphrase-level leakage, supporting responsible deployment and transparent evaluation of LLM copyright risks even with black-box access.

[13] CoPE: Clipped RoPE as A Scalable Free Lunch for Long Context LLMs

Haoran Li,Sucheng Ren,Alan Yuille,Feng Wang

Main category: cs.CL

TL;DR: 本文提出CoPE方法,通过软裁剪RoPE的低频分量,统一解决OOD缓解与语义建模问题,在长达256k上下文长度上实现显著性能提升。

Details Motivation: 现有RoPE适配长上下文的方法分为OOD缓解和语义建模两类,二者目标看似不同,缺乏统一视角。 Method: 提出CoPE:对RoPE的低频成分进行软裁剪,兼顾OOD缓解、语义信号增强与避免谱泄漏。 Result: 在长达256k上下文长度的实验中,CoPE显著提升模型性能,成为当前长度泛化的新SOTA。 Conclusion: CoPE以极简干预统一了两种主流RoPE适配目标,理论合理且实践有效,为长上下文建模提供了新范式。 Abstract: Rotary Positional Embedding (RoPE) is a key component of context scaling in Large Language Models (LLMs). While various methods have been proposed to adapt RoPE to longer contexts, their guiding principles generally fall into two categories: (1) out-of-distribution (OOD) mitigation, which scales RoPE frequencies to accommodate unseen positions, and (2) Semantic Modeling, which posits that the attention scores computed with RoPE should always prioritize semantically similar tokens. In this work, we unify these seemingly distinct objectives through a minimalist intervention, namely CoPE: soft clipping lowfrequency components of RoPE. CoPE not only eliminates OOD outliers and refines semantic signals, but also prevents spectral leakage caused by hard clipping. Extensive experiments demonstrate that simply applying our soft clipping strategy to RoPE yields significant performance gains that scale up to 256k context length, validating our theoretical analysis and establishing CoPE as a new state-of-the-art for length generalization. Our code, data, and models are available at https://github.com/hrlics/CoPE.

[14] Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR

Fanfan Liu,Youyang Yin,Peng Shi,Siqi Yang,Zhixiong Zeng,Haibo Qiu

Main category: cs.CL

TL;DR: 本文分析了强化学习中可验证奖励(RLVR)算法对大语言模型和视觉-语言模型推理能力提升时响应长度变化的差异,提出了无长度偏差的序列策略优化算法(LUSPO),解决了响应长度坍缩问题,并在数学与多模态推理任务上取得了SOTA效果。

Details Motivation: 不同RLVR算法在训练过程中响应长度变化模式差异显著,缺乏对其根本原因的理论解释。 Method: 对主流RLVR算法组件进行深入理论分析,识别影响响应长度的关键因素;基于理论发现提出LUSPO算法,通过修正GSPO中的长度偏差,使其损失函数对响应长度无偏。 Result: LUSPO在数学推理基准和多模态推理场景中均持续取得更优性能,实证表明其为优于GRPO和GSPO等现有方法的新一代SOTA优化策略。 Conclusion: 响应长度变化受算法内在机制影响,LUSPO通过消除长度偏差有效提升模型推理能力,为RLVR训练提供了更稳健、更公平的优化范式。 Abstract: Recent applications of Reinforcement Learning with Verifiable Rewards (RLVR) to Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated significant success in enhancing reasoning capabilities for complex tasks. During RLVR training, an increase in response length is often regarded as a key factor contributing to the growth of reasoning ability. However, the patterns of change in response length vary significantly across different RLVR algorithms during the training process. To provide a fundamental explanation for these variations, this paper conducts an in-depth analysis of the components of mainstream RLVR algorithms. We present a theoretical analysis of the factors influencing response length and validate our theory through extensive experimentation. Building upon these theoretical findings, we propose the Length-Unbiased Sequence Policy Optimization (LUSPO) algorithm. Specifically, we rectify the length bias inherent in Group Sequence Policy Optimization (GSPO), rendering its loss function unbiased with respect to response length and thereby resolving the issue of response length collapse. We conduct extensive experiments across mathematical reasoning benchmarks and multimodal reasoning scenarios, where LUSPO consistently achieves superior performance. Empirical results demonstrate that LUSPO represents a novel, state-of-the-art optimization strategy compared to existing methods such as GRPO and GSPO.

[15] Towards a Science of Collective AI: LLM-based Multi-Agent Systems Need a Transition from Blind Trial-and-Error to Rigorous Science

Jingru Fan,Dewen Liu,Yufan Dang,Huatao Li,Yuheng Wang,Wei Liu,Feiyu Duan,Xuanwen Ding,Shu Yao,Lin Wu,Ruijie Shi,Wai-Shing Leung,Yuan Cheng,Zhongyu Wei,Cheng Yang,Chen Qian,Zhiyuan Liu,Maosong Sun

Main category: cs.CL

TL;DR: 本文提出了一种面向多智能体系统(MAS)的科学设计框架,引入协作增益度量Γ以区分真实协作效果与资源堆砌,并构建因素库与归因范式,推动MAS研究从经验试错走向系统化科学。

Details Motivation: 当前多智能体系统(MAS)虽在大语言模型驱动下取得进展,但严重依赖经验试错,缺乏统一科学框架;根本瓶颈在于归因模糊:一是缺少结构化因素分类体系,二是缺乏能剥离预算效应的统一评估指标。 Method: 提出协作增益度量Γ作为科学标准,用以量化剔除资源增加后的内在协作提升;基于Γ构建因素归因范式;建立系统化的MAS因素库,将设计空间划分为控制层预设与信息层动态两类。 Result: 建立了首个以协作增益Γ为核心的MAS科学评估与归因框架,配套构建了结构化MAS因素库与可复现的归因分析方法,为MAS系统性优化提供理论工具和实践路径。 Conclusion: 该框架标志着MAS研究从经验工程迈向设计科学的关键转变,为构建‘集体人工智能’的严谨学科基础提供了方法论支撑。 Abstract: Recent advancements in Large Language Models (LLMs) have greatly extended the capabilities of Multi-Agent Systems (MAS), demonstrating significant effectiveness across a wide range of complex and open-ended domains. However, despite this rapid progress, the field still relies heavily on empirical trial-and-error. It lacks a unified and principled scientific framework necessary for systematic optimization and improvement. This bottleneck stems from the ambiguity of attribution: first, the absence of a structured taxonomy of factors leaves researchers restricted to unguided adjustments; second, the lack of a unified metric fails to distinguish genuine collaboration gain from mere resource accumulation. In this paper, we advocate for a transition to design science through an integrated framework. We advocate to establish the collaboration gain metric ($Γ$) as the scientific standard to isolate intrinsic gains from increased budgets. Leveraging $Γ$, we propose a factor attribution paradigm to systematically identify collaboration-driving factors. To support this, we construct a systematic MAS factor library, structuring the design space into control-level presets and information-level dynamics. Ultimately, this framework facilitates the transition from blind experimentation to rigorous science, paving the way towards a true science of Collective AI.

[16] MentorCollab: Selective Large-to-Small Inference-Time Guidance for Efficient Reasoning

Haojin Wang,Yike Wang,Shangbin Feng,Hannaneh Hajishirzi,Yulia Tsvetkov

Main category: cs.CL

TL;DR: 本文提出MentorCollab方法,让大模型在推理时稀疏、选择性地指导小模型,通过轻量验证器决定是否采纳大模型的短程前瞻生成,显著提升小模型多步推理性能,同时大幅降低大模型调用开销。

Details Motivation: 大模型推理能力强但成本高、冗余多;小模型高效但多步推理弱;现有协同方法易导致模仿式冗长推理,缺乏有效纠错机制。 Method: MentorCollab:在随机采样的token位置探测大小模型输出分歧,用轻量验证器判断是否让小模型采纳大模型提供的短前瞻片段(而非全程接管),实现稀疏、选择性指导。 Result: 在15组大小模型对、3个领域(数学、常识、通识推理)共多项任务上,12项取得提升,平均增益3.0%,最高达8.0%;大模型仅生成平均18.4%的token。 Conclusion: 稀疏、选择性的推理时指导足以恢复大模型级推理能力,且不带来显著推理开销,为高效协同推理提供了新范式。 Abstract: Large reasoning models (LRMs) achieve strong performance by producing long chains of thought, but their inference costs are high and often generate redundant reasoning. Small language models (SLMs) are far more efficient, yet struggle on multi-step reasoning tasks. A natural idea is to let a large model guide a small one at inference time as a mentor, yet existing collaboration methods often promote imitation, resulting in verbose reasoning without consistent error correction. We propose MentorCollab, an inference-time collaboration method in which an LRM selectively and sparsely guides an SLM, rather than taking over generation. At randomly sampled token positions, we probe for divergences between the two models and use a lightweight verifier to decide whether the SLM should follow a short lookahead segment from its mentor or continue on its own. Across 15 SLM--LRM pairs and 3 domains (math reasoning, general knowledge, and commonsense reasoning), our method improves performance in 12 settings, with average gains of 3.0% and up to 8.0%, while adopting only having 18.4% tokens generated by the expensive mentor model on average. We find that short segments and selective probing are sufficient for effective collaboration. Our results show that selective inference-time guidance restores large-model reasoning ability without substantial inference overhead.

[17] How Do Language Models Acquire Character-Level Information?

Soma Sato,Ryohei Sasano

Main category: cs.CL

TL;DR: 本文探讨了语言模型如何隐式编码字符级信息,通过控制实验分析了分词和非分词因素对字符知识获取的影响。

Details Motivation: 语言模型在未显式提供字符级信息的情况下仍能隐式编码该信息,但其内在机制尚不清楚。 Method: 通过在受控设置(如指定预训练数据集或分词器)下训练语言模型,并与标准设置下的模型进行对比分析,将影响因素分为依赖分词和不依赖分词两类。 Result: 发现分词相关的主因是合并规则和正字法约束;而非分词相关的主因是子字符串的语义关联和句法信息。 Conclusion: 语言模型隐式获取字符级知识由分词相关与非分词相关因素共同作用,其中前者源于分词过程本身,后者源于语言的语义与句法结构。 Abstract: Language models (LMs) have been reported to implicitly encode character-level information, despite not being explicitly provided during training. However, the mechanisms underlying this phenomenon remain largely unexplored. To reveal the mechanisms, we analyze how models acquire character-level knowledge by comparing LMs trained under controlled settings, such as specifying the pre-training dataset or tokenizer, with those trained under standard settings. We categorize the contributing factors into those independent of tokenization. Our analysis reveals that merge rules and orthographic constraints constitute primary factors arising from tokenization, whereas semantic associations of substrings and syntactic information function as key factors independent of tokenization.

[18] PACE: Defying the Scaling Hypothesis of Exploration in Iterative Alignment for Mathematical Reasoning

Jun Rao,Zixiong Yu,Xuebo Liu,Guhan Chen,Jing Li,Jiansheng Wei,Xiaojun Meng,Min Zhang

Main category: cs.CL

TL;DR: 本文提出PACE方法,通过基于生成的纠正策略替代暴力挖掘,在数学推理任务中以更少计算量实现比DPO-R1更好的对齐效果,并提升鲁棒性。

Details Motivation: 标准DPO-R1方法依赖大量Best-of-N采样(N≥8)来挖掘高质量推理轨迹,但作者发现其在数学推理中存在收益递减甚至策略崩溃问题,且放大验证器噪声、引发有害分布偏移。 Method: 提出PACE(Proximal Alignment via Corrective Exploration),用生成式纠正策略替代暴力采样,在极小预算(2 ### [19] [Cross-Lingual Empirical Evaluation of Large Language Models for Arabic Medical Tasks](https://arxiv.org/abs/2602.05374) *Chaimae Abouzahir,Congbo Ma,Nizar Habash,Farah E. Shamout* Main category: cs.CL TL;DR: 本研究对大型语言模型(LLMs)在阿拉伯语和英语医学问答任务中的跨语言性能进行了实证分析,发现存在随任务复杂度增加而加剧的语言驱动性能差距,并指出阿拉伯语医学文本的分词结构碎片化及模型置信度与答案正确性相关性低等问题,强调需采用语言感知的设计与评估策略。
Details Motivation: 现有LLM多为英语中心化,导致其在语言多样性社区(尤其是低资源语言)中鲁棒性和可靠性受限;阿拉伯语等语言在医学任务中性能下降的原因尚不明确。 Method: 开展阿拉伯语与英语医学问答任务的跨语言实证分析,结合分词分析(tokenization analysis)和可靠性分析(reliability analysis),考察性能差距、文本结构问题及模型置信度与正确性的关联。 Result: 发现显著且随任务复杂度加剧的语言驱动性能差距;阿拉伯语医学文本存在结构性分词碎片化;模型报告的置信度和解释与实际正确性相关性有限。 Conclusion: 需在医学LLM的设计与评估中引入语言感知(language-aware)策略,以提升多语言场景下的鲁棒性与可靠性。 Abstract: In recent years, Large Language Models (LLMs) have become widely used in medical applications, such as clinical decision support, medical education, and medical question answering. Yet, these models are often English-centric, limiting their robustness and reliability for linguistically diverse communities. Recent work has highlighted discrepancies in performance in low-resource languages for various medical tasks, but the underlying causes remain poorly understood. In this study, we conduct a cross-lingual empirical analysis of LLM performance on Arabic and English medical question and answering. Our findings reveal a persistent language-driven performance gap that intensifies with increasing task complexity. Tokenization analysis exposes structural fragmentation in Arabic medical text, while reliability analysis suggests that model-reported confidence and explanations exhibit limited correlation with correctness. Together, these findings underscore the need for language-aware design and evaluation strategies in LLMs for medical tasks.
### [20] [IESR:Efficient MCTS-Based Modular Reasoning for Text-to-SQL with Large Language Models](https://arxiv.org/abs/2602.05385) *Tao Liu,Jiafan Lu,Bohan Yu,Pengcheng Wu,Liu Haixin,Guoyu Xu,Li Xiangheng,Lixiao Li,Jiaming Hou,Zhao Shijun,Xinglin Lyu,Kunli Zhang,Yuxiang Jia,Hongyin Zan* Main category: cs.CL TL;DR: 本文提出IESR框架,利用轻量级大语言模型解决Text-to-SQL任务中复杂推理、领域知识和假设性查询等挑战,通过信息增强、多路径MCTS推理与轨迹一致性验证,在LogicCat和Archer数据集上达到SOTA性能,且无需微调。
Details Motivation: 现有Text-to-SQL方法在复杂推理、领域知识和假设性查询上表现不佳,且企业部署成本高。 Method: 提出IESR框架:(i) 利用LLM进行关键信息理解与模式链接,并解耦数学计算与SQL生成;(ii) 基于蒙特卡洛树搜索(MCTS)的多路径推理加多数投票;(iii) 引入带判别器的轨迹一致性验证模块。 Result: 在LogicCat(24.28 EX)和Archer(37.28 EX)上达到SOTA,仅使用未微调的轻量级模型;发现当前编码模型在物理知识、数学计算和常识推理方面存在明显偏差与缺陷。 Conclusion: IESR为轻量级模型提供了高效、准确的Text-to-SQL解决方案,揭示了现有coder模型的关键能力短板,为后续研究指明方向。 Abstract: Text-to-SQL is a key natural language processing task that maps natural language questions to SQL queries, enabling intuitive interaction with web-based databases. Although current methods perform well on benchmarks like BIRD and Spider, they struggle with complex reasoning, domain knowledge, and hypothetical queries, and remain costly in enterprise deployment. To address these issues, we propose a framework named IESR(Information Enhanced Structured Reasoning) for lightweight large language models: (i) leverages LLMs for key information understanding and schema linking, and decoupling mathematical computation and SQL generation, (ii) integrates a multi-path reasoning mechanism based on Monte Carlo Tree Search (MCTS) with majority voting, and (iii) introduces a trajectory consistency verification module with a discriminator model to ensure accuracy and consistency. Experimental results demonstrate that IESR achieves state-of-the-art performance on the complex reasoning benchmark LogicCat (24.28 EX) and the Archer dataset (37.28 EX) using only compact lightweight models without fine-tuning. Furthermore, our analysis reveals that current coder models exhibit notable biases and deficiencies in physical knowledge, mathematical computation, and common-sense reasoning, highlighting important directions for future research. We released code at https://github.com/Ffunkytao/IESR-SLM.
### [21] [Beyond Length: Context-Aware Expansion and Independence as Developmentally Sensitive Evaluation in Child Utterances](https://arxiv.org/abs/2602.05392) *Jiyun Chun,Eric Fosler-Lussier,Michael White,Andrew Perrault* Main category: cs.CL TL;DR: 本文提出了一种基于大语言模型(LLM)的评估框架,用于衡量儿童在成人-儿童对话中话语的质量,聚焦于‘扩展性’(Expansion)和‘独立性’(Independence)两个维度,超越传统以长度为主的指标,具备发展有效性、预测力与语义敏感性。
Details Motivation: 现有指标(如MLU、vocd-D、Flesch-Kincaid等)过度依赖长度且忽略对话上下文,无法捕捉推理深度、话题维持和话语规划等关键响应质量特征。 Method: 构建LLM-as-a-judge框架:先分类前一成人话语类型,再沿Expansion(上下文拓展与推理深度)和Independence(推动话语进展的自主性)两轴评分;结合发展验证(年龄模式)、预测任务(年龄估计)与语义检验(话语关系差异)。 Result: 所提指标展现出显著的年龄相关性、优于基线的年龄预测性能、对话语关系的敏感性,且与人工评分高度一致,支持大规模自动化评估。 Conclusion: 该框架将儿童话语评估从‘测长度’转向‘评语境贡献’,更贴合语言发展本质,为儿童语言研究与交互式AI评估提供新范式。 Abstract: Evaluating the quality of children's utterances in adult-child dialogue remains challenging due to insufficient context-sensitive metrics. Common proxies such as Mean Length of Utterance (MLU), lexical diversity (vocd-D), and readability indices (Flesch-Kincaid Grade Level, Gunning Fog Index) are dominated by length and ignore conversational context, missing aspects of response quality such as reasoning depth, topic maintenance, and discourse planning. We introduce an LLM-as-a-judge framework that first classifies the Previous Adult Utterance Type and then scores the child's response along two axes: Expansion (contextual elaboration and inferential depth) and Independence (the child's contribution to advancing the discourse). These axes reflect fundamental dimensions in child language development, where Expansion captures elaboration, clause combining, and causal and contrastive connectives. Independence captures initiative, topic control, decreasing reliance on adult scaffolding through growing self-regulation, and audience design. We establish developmental validity by showing age-related patterns and demonstrate predictive value by improving age estimation over common baselines. We further confirm semantic sensitivity by detecting differences tied to discourse relations. Our metrics align with human judgments, enabling large-scale evaluation. This shifts child utterance assessment from simply measuring length to evaluating how meaningfully the child's speech contributes to and advances the conversation within its context.
### [22] [Late-to-Early Training: LET LLMs Learn Earlier, So Faster and Better](https://arxiv.org/abs/2602.05393) *Ji Zhao,Yufei Gu,Shitong Shao,Xun Zhou,Liang Xiang,Zeke Xie* Main category: cs.CL TL;DR: 本文提出Late-to-Early Training(LET)范式,利用小规模预训练模型的晚期表征来指导大规模语言模型早期训练,显著加速收敛并提升性能。
Details Motivation: 预训练大语言模型计算成本高昂,而大量已有的小规模预训练模型尚未被有效用于加速大规模模型训练。 Method: 提出LET范式,通过将预训练大模型晚期层的表征作为监督信号,引导目标模型早期层在训练初期学习晚期知识,包含late-to-early-step和late-to-early-layer两种机制。 Result: 在1.4B和7B模型上验证有效;1.4B模型在Pile上训练时实现1.6×加速,并在下游任务准确率上提升近5%,即使教师模型参数仅为学生的1/10。 Conclusion: LET是一种高效利用已有小模型知识加速大模型训练的新范式,在训练速度与模型性能上均取得显著提升。 Abstract: As Large Language Models (LLMs) achieve remarkable empirical success through scaling model and data size, pretraining has become increasingly critical yet computationally prohibitive, hindering rapid development. Despite the availability of numerous pretrained LLMs developed at significant computational expense, a fundamental real-world question remains underexplored: \textit{Can we leverage existing small pretrained models to accelerate the training of larger models?} In this paper, we propose a Late-to-Early Training (LET) paradigm that enables LLMs to explicitly learn later knowledge in earlier steps and earlier layers. The core idea is to guide the early layers of an LLM during early training using representations from the late layers of a pretrained (i.e. late training phase) model. We identify two key mechanisms that drive LET's effectiveness: late-to-early-step learning and late-to-early-layer learning. These mechanisms significantly accelerate training convergence while robustly enhancing both language modeling capabilities and downstream task performance, enabling faster training with superior performance. Extensive experiments on 1.4B and 7B parameter models demonstrate LET's efficiency and effectiveness. Notably, when training a 1.4B LLM on the Pile dataset, our method achieves up to 1.6$\times$ speedup with nearly 5\% improvement in downstream task accuracy compared to standard training, even when using a pretrained model with 10$\times$ fewer parameters than the target model.
### [23] [OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration](https://arxiv.org/abs/2602.05400) *Shaobo Wang,Xuan Ouyang,Tianyi Xu,Yuzheng Hu,Jialin Liu,Guo Chen,Tianyu Zhang,Junhao Zheng,Kexin Yang,Xingzhang Ren,Dayiheng Liu,Linfeng Zhang* Main category: cs.CL TL;DR: OPUS是一种动态数据选择框架,通过在优化器诱导的更新空间中定义数据效用,结合Ghost技术与Boltzmann采样,在极低额外开销下显著提升预训练的数据效率。
Details Motivation: 随着高质量公共文本趋于枯竭(Data Wall),预训练正从‘更多token’转向‘更好token’;但现有方法或依赖忽略训练动态的静态启发式过滤,或采用与优化器无关的梯度准则,缺乏对现代优化器行为的建模。 Method: 提出OPUS框架:在优化器诱导的更新空间中定义数据效用,将候选样本的有效更新投影到由稳定、分布内代理导出的目标方向上进行打分;采用Ghost+CountSketch加速计算,Boltzmann采样保障多样性。 Result: 在GPT-2 Large/XL预训练中,仅用30B token即超越200B full training;与工业级静态过滤联用时仍有效;在Qwen3-8B-Base续训中,0.5B token效果优于3B full training,计算开销仅增4.7%。 Conclusion: OPUS实现了与优化器协同的高效动态数据选择,在多种模型、数据集、优化器和规模下均显著提升数据利用效率,为突破Data Wall提供新范式。 Abstract: As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic static filters that ignore training dynamics, or use dynamic yet optimizer-agnostic criteria based on raw gradients. We propose OPUS (Optimizer-induced Projected Utility Selection), a dynamic data selection framework that defines utility in the optimizer-induced update space. OPUS scores candidates by projecting their effective updates, shaped by modern optimizers, onto a target direction derived from a stable, in-distribution proxy. To ensure scalability, we employ Ghost technique with CountSketch for computational efficiency, and Boltzmann sampling for data diversity, incurring only 4.7\% additional compute overhead. OPUS achieves remarkable results across diverse corpora, quality tiers, optimizers, and model scales. In pre-training of GPT-2 Large/XL on FineWeb and FineWeb-Edu with 30B tokens, OPUS outperforms industrial-level baselines and even full 200B-token training. Moreover, when combined with industrial-level static filters, OPUS further improves pre-training efficiency, even with lower-quality data. Furthermore, in continued pre-training of Qwen3-8B-Base on SciencePedia, OPUS achieves superior performance using only 0.5B tokens compared to full training with 3B tokens, demonstrating significant data efficiency gains in specialized domains.
### [24] [Grammatical Error Correction Evaluation by Optimally Transporting Edit Representation](https://arxiv.org/abs/2602.05419) *Takumi Goto,Yusuke Sakai,Taro Watanabe* Main category: cs.CL TL;DR: 本文提出了一种基于编辑向量和非平衡最优传输(UOT)的新评估指标UOT-ERRANT,用于语法错误纠正(GEC)任务,通过在ERRANT编辑层面建模假设与参考之间的对齐,提升了自动评估性能,尤其在流畅性提升(+Fluency)场景下效果显著,并具备可解释性。
Details Motivation: 现有基于嵌入相似度(如BERTScore)的GEC自动评估指标因源句中大量词未改动而失效;需更聚焦于GEC特有编辑操作的细粒度评估方法。 Method: 提出‘编辑向量’表示ERRANT提取的语法编辑操作,并构建UOT-ERRANT指标:利用非平衡最优传输将假设句的编辑向量‘传输’到参考句的编辑向量,实现编辑级软对齐。 Result: 在SEEDA元评估基准上,UOT-ERRANT显著优于现有指标,尤其在+Fluency领域;其传输计划提供可解释的软编辑对齐,支持系统排序与错误分析。 Conclusion: UOT-ERRANT是一种更精准、更具可解释性的GEC评估指标,推动了从句子级相似度到编辑级语义对齐的评估范式转变。 Abstract: Automatic evaluation in grammatical error correction (GEC) is crucial for selecting the best-performing systems. Currently, reference-based metrics are a popular choice, which basically measure the similarity between hypothesis and reference sentences. However, similarity measures based on embeddings, such as BERTScore, are often ineffective, since many words in the source sentences remain unchanged in both the hypothesis and the reference. This study focuses on edits specifically designed for GEC, i.e., ERRANT, and computes similarity measured over the edits from the source sentence. To this end, we propose edit vector, a representation for an edit, and introduce a new metric, UOT-ERRANT, which transports these edit vectors from hypothesis to reference using unbalanced optimal transport. Experiments with SEEDA meta-evaluation show that UOT-ERRANT improves evaluation performance, particularly in the +Fluency domain where many edits occur. Moreover, our method is highly interpretable because the transport plan can be interpreted as a soft edit alignment, making UOT-ERRANT a useful metric for both system ranking and analyzing GEC systems. Our code is available from https://github.com/gotutiyan/uot-errant.
### [25] [Once Correct, Still Wrong: Counterfactual Hallucination in Multilingual Vision-Language Models](https://arxiv.org/abs/2602.05437) *Basel Mousi,Fahim Dalvi,Shammur Chowdhury,Firoj Alam,Nadir Durrani* Main category: cs.CL TL;DR: 本文提出M2CQA基准,用于评估视觉语言模型在多文化、多语言(英语、阿拉伯语及其方言)环境下的反事实幻觉问题,并引入CFHR指标来量化这种幻觉。实验发现当前VLMs在阿拉伯语尤其是方言中反事实幻觉率显著上升,且推理优先提示会加剧该问题。
Details Motivation: 现有幻觉评测基准缺乏对文化适配性与非西方语境下视觉-语言模型误接受文化上合理但视觉上错误解释的能力的考察。 Method: 构建了覆盖17个中东与北非国家图像的多模态基准M2CQA,配以英语、阿拉伯语及方言的对比性真实与反事实陈述;提出反事实幻觉率(CFHR)作为新评估指标;在多种提示策略下评测主流VLMs。 Result: CFHR在阿拉伯语(尤其方言)中显著升高,即使真实陈述准确率仍高;推理优先提示加剧幻觉,而先作答再解释则提升鲁棒性。 Conclusion: 当前VLMs在跨文化、多语言场景下面临严重反事实幻觉风险,需更细粒度、文化敏感的评测方法与提示策略优化。 Abstract: Vision-language models (VLMs) can achieve high accuracy while still accepting culturally plausible but visually incorrect interpretations. Existing hallucination benchmarks rarely test this failure mode, particularly outside Western contexts and English. We introduce M2CQA, a culturally grounded multimodal benchmark built from images spanning 17 MENA countries, paired with contrastive true and counterfactual statements in English, Arabic, and its dialects. To isolate hallucination beyond raw accuracy, we propose the CounterFactual Hallucination Rate (CFHR), which measures counterfactual acceptance conditioned on correctly answering the true statement. Evaluating state-of-the-art VLMs under multiple prompting strategies, we find that CFHR rises sharply in Arabic, especially in dialects, even when true-statement accuracy remains high. Moreover, reasoning-first prompting consistently increases counterfactual hallucination, while answering before justifying improves robustness. We will make the experimental resources and dataset publicly available for the community.
### [26] [Causal Front-Door Adjustment for Robust Jailbreak Attacks on LLMs](https://arxiv.org/abs/2602.05444) *Yao Zhou,Zeen Song,Wenwen Qiang,Fengge Wu,Shuyi Zhou,Changwen Zheng,Hui Xiong* Main category: cs.CL TL;DR: 本文提出CFA²攻击框架,利用因果推断中的前门准则剥离大语言模型的安全对齐机制,实现高效且可解释的越狱攻击。
Details Motivation: 现有LLM安全对齐机制常以隐状态形式存在,掩盖模型真实能力;需从因果视角建模安全机制为未观测混杂因子,以揭示并绕过其影响。 Method: 将安全机制建模为因果图中的未观测混杂因子,基于Pearl前门准则设计CFA²攻击框架;使用稀疏自编码器(SAEs)物理剥离防御相关特征,并将高开销的边缘化简化为低复杂度的确定性干预。 Result: CFA²在多个基准上达到最先进的越狱成功率,同时提供对越狱过程的机制性可解释性。 Conclusion: 通过因果建模范式和前门调整,可有效解耦安全对齐与核心能力,为理解与评估LLM对齐鲁棒性提供了新路径。 Abstract: Safety alignment mechanisms in Large Language Models (LLMs) often operate as latent internal states, obscuring the model's inherent capabilities. Building on this observation, we model the safety mechanism as an unobserved confounder from a causal perspective. Then, we propose the \textbf{C}ausal \textbf{F}ront-Door \textbf{A}djustment \textbf{A}ttack ({\textbf{CFA}}$^2$) to jailbreak LLM, which is a framework that leverages Pearl's Front-Door Criterion to sever the confounding associations for robust jailbreaking. Specifically, we employ Sparse Autoencoders (SAEs) to physically strip defense-related features, isolating the core task intent. We further reduce computationally expensive marginalization to a deterministic intervention with low inference complexity. Experiments demonstrate that {CFA}$^2$ achieves state-of-the-art attack success rates while offering a mechanistic interpretation of the jailbreaking process.
### [27] [Structured Context Engineering for File-Native Agentic Systems: Evaluating Schema Accuracy, Format Effectiveness, and Multi-File Navigation at Scale](https://arxiv.org/abs/2602.05447) *Damon McMillan* Main category: cs.CL TL;DR: 本文通过9649次实验,系统研究了大语言模型(LLM)代理在SQL生成任务中处理结构化数据时的上下文工程策略,发现模型能力是决定性能的最主要因素,架构与格式选择需依据模型类型(前沿vs开源)定制,而非采用通用方案。
Details Motivation: 当前LLM代理广泛调用外部系统接口,但缺乏关于如何有效组织输入上下文的实证指导;本文以SQL生成为代理操作结构化数据的典型场景,填补这一空白。 Method: 开展大规模控制实验(11种模型、4种上下文格式:YAML/Markdown/JSON/TOON、表规模10–10,000),评估不同上下文架构(如文件式检索)和格式对生成准确率、运行效率的影响,并进行统计检验(如p值、卡方检验)。 Result: 1)文件式上下文检索对前沿模型提升准确率(+2.7%),但对开源模型普遍降低(-7.7%);2)格式整体无显著影响(p=0.484),但开源模型对格式敏感;3)前沿与开源模型间存在21个百分点的准确率鸿沟;4)文件原生代理可扩展至10,000表且保持高导航精度;5)文件大小不决定运行效率,紧凑格式可能因模型不熟悉而消耗更多token。 Conclusion: 上下文工程决策必须适配具体模型能力层级,不存在普适最优方案;应避免将前沿模型经验直接迁移至开源模型,实践需基于实证、分层定制。 Abstract: Large Language Model agents increasingly operate external systems through programmatic interfaces, yet practitioners lack empirical guidance on how to structure the context these agents consume. Using SQL generation as a proxy for programmatic agent operations, we present a systematic study of context engineering for structured data, comprising 9,649 experiments across 11 models, 4 formats (YAML, Markdown, JSON, Token-Oriented Object Notation [TOON]), and schemas ranging from 10 to 10,000 tables. Our findings challenge common assumptions. First, architecture choice is model-dependent: file-based context retrieval improves accuracy for frontier-tier models (Claude, GPT, Gemini; +2.7%, p=0.029) but shows mixed results for open source models (aggregate -7.7%, p<0.001), with deficits varying substantially by model. Second, format does not significantly affect aggregate accuracy (chi-squared=2.45, p=0.484), though individual models, particularly open source, exhibit format-specific sensitivities. Third, model capability is the dominant factor, with a 21 percentage point accuracy gap between frontier and open source tiers that dwarfs any format or architecture effect. Fourth, file-native agents scale to 10,000 tables through domain-partitioned schemas while maintaining high navigation accuracy. Fifth, file size does not predict runtime efficiency: compact formats can consume significantly more tokens at scale due to format-unfamiliar search patterns. These findings provide practitioners with evidence-based guidance for deploying LLM agents on structured systems, demonstrating that architectural decisions should be tailored to model capability rather than assuming universal best practices.
### [28] [Reasoning under Ambiguity: Uncertainty-Aware Multilingual Emotion Classification under Partial Supervision](https://arxiv.org/abs/2602.05471) *Md. Mithun Hossaina,Mashary N. Alrasheedy,Nirban Bhowmick,Shamim Forhad,Md. Shakil Hossain,Sudipto Chaki,Md Shafiqul Islam* Main category: cs.CL TL;DR: 本文提出了一种名为'Reasoning under Ambiguity'的不确定性感知框架,用于多语言多标签情感分类,通过熵加权机制和掩码感知目标函数,有效应对标注模糊与不完全监督问题,并在多种语言数据集上取得显著性能提升。
Details Motivation: 现有方法假设标签完全可观测且采用确定性学习目标,在情感模糊和标注不全(缺失/异构)场景下易导致偏差和不可靠预测。 Method: 提出不确定性感知框架:1)共享多语言编码器+语言特定优化;2)基于熵的模糊性加权机制,降低高模糊样本权重;3)掩码感知目标函数结合正-未标注(PU)正则化以支持部分监督学习。 Result: 在英语、西班牙语和阿拉伯语情感分类基准上,该方法在多个评估指标上持续超越强基线,同时提升了训练稳定性、对标注稀疏性的鲁棒性及模型可解释性。 Conclusion: 显式建模标注不确定性是提升多语言多标签情感分类性能的关键,所提框架为部分监督下的模糊情感识别提供了有效且实用的解决方案。 Abstract: Contemporary knowledge-based systems increasingly rely on multilingual emotion identification to support intelligent decision-making, yet they face major challenges due to emotional ambiguity and incomplete supervision. Emotion recognition from text is inherently uncertain because multiple emotional states often co-occur and emotion annotations are frequently missing or heterogeneous. Most existing multi-label emotion classification methods assume fully observed labels and rely on deterministic learning objectives, which can lead to biased learning and unreliable predictions under partial supervision. This paper introduces Reasoning under Ambiguity, an uncertainty-aware framework for multilingual multi-label emotion classification that explicitly aligns learning with annotation uncertainty. The proposed approach uses a shared multilingual encoder with language-specific optimization and an entropy-based ambiguity weighting mechanism that down-weights highly ambiguous training instances rather than treating missing labels as negative evidence. A mask-aware objective with positive-unlabeled regularization is further incorporated to enable robust learning under partial supervision. Experiments on English, Spanish, and Arabic emotion classification benchmarks demonstrate consistent improvements over strong baselines across multiple evaluation metrics, along with improved training stability, robustness to annotation sparsity, and enhanced interpretability.
### [29] [LinguistAgent: A Reflective Multi-Model Platform for Automated Linguistic Annotation](https://arxiv.org/abs/2602.05493) *Bingru Li* Main category: cs.CL TL;DR: 本文介绍了LinguistAgent,一个利用反思式多模型架构自动化语言标注的集成平台,通过双代理(标注员与审阅员)工作流模拟专业同行评审过程,并在隐喻识别任务中验证其有效性。
Details Motivation: 数据标注在人文与社会科学中仍是瓶颈,尤其是复杂语义任务(如隐喻识别);尽管大语言模型(LLMs)有潜力,但其理论能力与实际研究效用之间仍存在显著差距。 Method: 提出LinguistAgent平台,采用反思式多模型架构和双代理(Annotator + Reviewer)工作流;支持三种范式对比实验:提示工程(零/少样本)、检索增强生成(RAG)与微调;提供实时词元级评估(精确率、召回率、F1)。 Result: 在隐喻识别任务上,LinguistAgent实现了与人工金标准可比的性能,并提供实时量化评估;平台与代码已开源。 Conclusion: LinguistAgent显著提升了人文社科学者在复杂语义标注任务中的效率与可靠性,弥合了LLM理论能力与实际应用之间的鸿沟,为领域定制化AI工具开发提供了可行范式。 Abstract: Data annotation remains a significant bottleneck in the Humanities and Social Sciences, particularly for complex semantic tasks such as metaphor identification. While Large Language Models (LLMs) show promise, a significant gap remains between the theoretical capability of LLMs and their practical utility for researchers. This paper introduces LinguistAgent, an integrated, user-friendly platform that leverages a reflective multi-model architecture to automate linguistic annotation. The system implements a dual-agent workflow, comprising an Annotator and a Reviewer, to simulate a professional peer-review process. LinguistAgent supports comparative experiments across three paradigms: Prompt Engineering (Zero/Few-shot), Retrieval-Augmented Generation, and Fine-tuning. We demonstrate LinguistAgent's efficacy using the task of metaphor identification as an example, providing real-time token-level evaluation (Precision, Recall, and $F_1$ score) against human gold standards. The application and codes are released on https://github.com/Bingru-Li/LinguistAgent.
### [30] [Transport and Merge: Cross-Architecture Merging for Large Language Models](https://arxiv.org/abs/2602.05495) *Chenhang Cui,Binyun Yang,Fei Shen,Yuxin Chen,Jingnan Zheng,Xiang Wang,An Zhang,Tat-Seng Chua* Main category: cs.CL TL;DR: 本文提出了一种基于最优传输(OT)的跨架构模型融合框架,用于将大语言模型(LLM)的知识有效迁移到异构的小型低资源模型上,仅需少量输入即可实现权重空间的直接融合,并在多语言和专业领域任务中验证了其有效性。
Details Motivation: 大型语言模型虽能力强,但实际部署常依赖小型低资源模型;现有模型融合方法多要求架构兼容,难以实现从大模型到异构小模型的知识迁移。 Method: 提出基于最优传输(OT)的跨架构融合框架,通过激活对齐推断异构模型间的跨神经元对应关系,并利用所得传输计划指导权重空间的直接融合。 Result: 在低资源语言和专业领域任务上,该方法在多个实验中持续提升目标小模型性能。 Conclusion: 基于最优传输的跨架构融合是一种高效、轻量且通用的知识迁移机制,可弥合高资源大模型与低资源小模型之间的能力鸿沟。 Abstract: Large language models (LLMs) achieve strong capabilities by scaling model capacity and training data, yet many real-world deployments rely on smaller models trained or adapted from low-resource data. This gap motivates the need for mechanisms to transfer knowledge from large, high-resource models to smaller, low-resource targets. While model merging provides an effective transfer mechanism, most existing approaches assume architecture-compatible models and therefore cannot directly transfer knowledge from large high-resource LLMs to heterogeneous low-resource targets. In this work, we propose a cross-architecture merging framework based on optimal transport (OT) that aligns activations to infer cross-neuron correspondences between heterogeneous models. The resulting transport plans are then used to guide direct weight-space fusion, enabling effective high-resource to low-resource transfer using only a small set of inputs. Extensive experiments across low-resource languages and specialized domains demonstrate consistent improvements over target models.
### [31] [A Human-in-the-Loop, LLM-Centered Architecture for Knowledge-Graph Question Answering](https://arxiv.org/abs/2602.05512) *Larissa Pusch,Alexandre Courtiol,Tim Conrad* Main category: cs.CL TL;DR: 本文提出一种交互式框架,让大语言模型(LLMs)生成并解释Cypher图查询语句,用户可通过自然语言迭代优化查询,从而提升对知识图谱(KG)的可访问性、准确性与可解释性。
Details Motivation: 解决LLMs在知识密集型任务中易幻觉、信息过时、缺乏可解释性的问题;弥补传统RAG在多跳推理上的不足;降低使用KG所需的专业查询语言门槛。 Method: 设计一个LLM驱动的交互式框架,支持自动生成和自然语言解释Cypher查询,并允许用户反馈修正;构建90条合成电影KG查询基准,辅以Hyena和MaRDI两个真实KG的小规模实验。 Result: 该框架显著提升了KG查询的可访问性与准确性,在合成基准上量化评估了不同LLM在查询解释质量与错误检测能力上的差异,并揭示了模型性能的领域依赖性。 Conclusion: 结合LLM与KG的交互式自然语言接口是提升知识密集型任务可靠性与可用性的可行路径,兼顾语义严谨性与用户友好性。 Abstract: Large Language Models (LLMs) excel at language understanding but remain limited in knowledge-intensive domains due to hallucinations, outdated information, and limited explainability. Text-based retrieval-augmented generation (RAG) helps ground model outputs in external sources but struggles with multi-hop reasoning. Knowledge Graphs (KGs), in contrast, support precise, explainable querying, yet require a knowledge of query languages. This work introduces an interactive framework in which LLMs generate and explain Cypher graph queries and users iteratively refine them through natural language. Applied to real-world KGs, the framework improves accessibility to complex datasets while preserving factual accuracy and semantic rigor and provides insight into how model performance varies across domains. Our core quantitative evaluation is a 90-query benchmark on a synthetic movie KG that measures query explanation quality and fault detection across multiple LLMs, complemented by two smaller real-life query-generation experiments on a Hyena KG and the MaRDI (Mathematical Research Data Initiative) KG.
### [32] [Multi-Task GRPO: Reliable LLM Reasoning Across Tasks](https://arxiv.org/abs/2602.05547) *Shyam Sundhar Ramesh,Xiaotong Ji,Matthieu Zimmer,Sangwoong Yoon,Zhiyong Wang,Haitham Bou Ammar,Aurelien Lucchi,Ilija Bogunovic* Main category: cs.CL TL;DR: 本文提出了一种多任务GRPO(MT-GRPO)算法,通过动态调整任务权重和引入比例保持采样器,提升大语言模型在多任务场景下最差任务性能的鲁棒性与训练效率。
Details Motivation: 现有基于RL的GRPO方法在单任务上表现良好,但在多任务部署中存在任务间性能不平衡、零梯度提示比例差异大导致优化信号失真等问题。 Method: 提出MT-GRPO:(i) 动态任务加权机制以显式优化最差任务性能;(ii) 比例保持采样器确保策略梯度反映适应后的任务权重。 Result: 在3任务和9任务设置中,MT-GRPO在最差任务准确率上分别比标准GRPO和DAPO提升16–28%和6%绝对值,并以50%更少训练步数达到50%最差任务准确率,平均准确率保持竞争力。 Conclusion: MT-GRPO有效提升了多任务环境下LLM的鲁棒性与训练效率,为实际部署提供了更可靠的优化框架。 Abstract: RL-based post-training with GRPO is widely used to improve large language models on individual reasoning tasks. However, real-world deployment requires reliable performance across diverse tasks. A straightforward multi-task adaptation of GRPO often leads to imbalanced outcomes, with some tasks dominating optimization while others stagnate. Moreover, tasks can vary widely in how frequently prompts yield zero advantages (and thus zero gradients), which further distorts their effective contribution to the optimization signal. To address these issues, we propose a novel Multi-Task GRPO (MT-GRPO) algorithm that (i) dynamically adapts task weights to explicitly optimize worst-task performance and promote balanced progress across tasks, and (ii) introduces a ratio-preserving sampler to ensure task-wise policy gradients reflect the adapted weights. Experiments on both 3-task and 9-task settings show that MT-GRPO consistently outperforms baselines in worst-task accuracy. In particular, MT-GRPO achieves 16-28% and 6% absolute improvement on worst-task performance over standard GRPO and DAPO, respectively, while maintaining competitive average accuracy. Moreover, MT-GRPO requires 50% fewer training steps to reach 50% worst-task accuracy in the 3-task setting, demonstrating substantially improved efficiency in achieving reliable performance across tasks.
### [33] [CASTLE: A Comprehensive Benchmark for Evaluating Student-Tailored Personalized Safety in Large Language Models](https://arxiv.org/abs/2602.05633) *Rui Jia,Ruiyi Lan,Fengrui Liu,Zhongxiang Dai,Bo Jiang,Jing Shao,Jingyuan Chen,Guandong Xu,Fei Wu,Min Zhang* Main category: cs.CL TL;DR: 本文提出学生定制化个性化安全概念,并构建了CASTLE基准,涵盖15种教育安全风险和14种学生属性,包含92908个双语场景,设计了三种评估指标,实验表明现有SOTA LLM在个性化安全方面存在显著不足。
Details Motivation: 现有大语言模型生成机制导致对相同提示产生同质化响应,忽视学生认知与心理的异质性,且传统安全评估指标(如事实准确性、偏见、毒性)无法反映同一响应对不同学生属性可能造成的差异化危害。 Method: 基于教育理论提出学生定制化个性化安全概念,构建CASTLE基准,涵盖15类教育安全风险和14类学生属性,共92908个双语场景;设计Risk Sensitivity、Emotional Empathy和Student Alignment三项评估指标。 Result: 在18个SOTA大语言模型上的实验表明,所有模型在CASTLE上的平均安全评分为低于2.3(满分5分),揭示其在个性化安全保障方面存在严重缺陷。 Conclusion: 当前大语言模型在面向教育场景的学生个性化安全方面能力严重不足,亟需发展更细粒度、属性感知的安全评估与建模方法。 Abstract: Large language models (LLMs) have advanced the development of personalized learning in education. However, their inherent generation mechanisms often produce homogeneous responses to identical prompts. This one-size-fits-all mechanism overlooks the substantial heterogeneity in students cognitive and psychological, thereby posing potential safety risks to vulnerable groups. Existing safety evaluations primarily rely on context-independent metrics such as factual accuracy, bias, or toxicity, which fail to capture the divergent harms that the same response might cause across different student attributes. To address this gap, we propose the concept of Student-Tailored Personalized Safety and construct CASTLE based on educational theories. This benchmark covers 15 educational safety risks and 14 student attributes, comprising 92,908 bilingual scenarios. We further design three evaluation metrics: Risk Sensitivity, measuring the model ability to detect risks; Emotional Empathy, evaluating the model capacity to recognize student states; and Student Alignment, assessing the match between model responses and student attributes. Experiments on 18 SOTA LLMs demonstrate that CASTLE poses a significant challenge: all models scored below an average safety rating of 2.3 out of 5, indicating substantial deficiencies in personalized safety assurance.
### [34] [Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew](https://arxiv.org/abs/2602.05648) *Giuseppe Samo,Paola Merlo* Main category: cs.CL TL;DR: 本文研究了Transformer模型如何表示土耳其语和现代希伯来语中的复杂动词变位范式,重点探讨分词策略对其建模能力的影响。结果表明:土耳其语因形态标记透明,各类模型均表现良好;而希伯来语因非连缀构词特性,仅形态感知的单语模型效果优异。合成数据可提升所有模型性能。
Details Motivation: 探究不同分词策略(如原子级、子词级、字符级、形态感知)如何影响Transformer模型对高度屈折语言(土耳其语、希伯来语)中复杂动词范式的表征能力。 Method: 采用Blackbird Language Matrices任务,在真实语料上评估单语/多语Transformer模型;对比不同tokenization策略(原子、子词、字符、形态感知)下的表现,并在合成数据上进行补充验证。 Result: 土耳其语中,各类模型与分词策略组合均表现良好;希伯来语中,仅形态感知的单语模型表现优异,字符级多语模型失败;所有模型在合成数据上性能均提升。 Conclusion: 分词策略对Transformer建模非连缀形态语言至关重要;形态感知分词优于通用子词或字符级分词;语言特异性建模(如单语+形态感知)在复杂形态语言中更具优势。 Abstract: We investigate how transformer models represent complex verb paradigms in Turkish and Modern Hebrew, concentrating on how tokenization strategies shape this ability. Using the Blackbird Language Matrices task on natural data, we show that for Turkish -- with its transparent morphological markers -- both monolingual and multilingual models succeed, either when tokenization is atomic or when it breaks words into small subword units. For Hebrew, instead, monolingual and multilingual models diverge. A multilingual model using character-level tokenization fails to capture the language non-concatenative morphology, but a monolingual model with morpheme-aware segmentation performs well. Performance improves on more synthetic datasets, in all models.
### [35] [MedErrBench: A Fine-Grained Multilingual Benchmark for Medical Error Detection and Correction with Clinical Expert Annotations](https://arxiv.org/abs/2602.05692) *Congbo Ma,Yichun Zhang,Yousef Al-Jazzazi,Ahamed Foisal,Laasya Sharma,Yousra Sadqi,Khaled Saleh,Jihad Mallat,Farah E. Shamout* Main category: cs.CL TL;DR: 本文提出了首个用于临床文本错误检测、定位与纠正的多语言基准MedErrBench,涵盖英语、阿拉伯语和中文,由临床专家标注并验证;实验揭示了现有大模型在非英语临床场景中的显著性能差距,强调需构建临床可信、语言适配的系统。
Details Motivation: 现有或生成的临床文本存在错误可能导致严重后果,而当前缺乏覆盖多语言、多临床场景的专用错误评估基准。 Method: 构建基于十类常见错误扩展分类法的多语言(英/阿/中)临床错误基准MedErrBench,由临床专家标注与审核,并对通用、语言特化及医学领域大模型在错误检测、定位、纠正三项任务上进行系统评估。 Result: 实验发现各模型在非英语语种(尤其阿拉伯语和中文)上的错误检测与纠正能力明显弱于英语,且医学领域模型未全面优于通用模型。 Conclusion: 亟需开发以临床实践为根基、具备语言感知能力的NLP系统;MedErrBench的开源将推动全球多语言临床NLP发展,助力更安全、公平的AI医疗应用。 Abstract: Inaccuracies in existing or generated clinical text may lead to serious adverse consequences, especially if it is a misdiagnosis or incorrect treatment suggestion. With Large Language Models (LLMs) increasingly being used across diverse healthcare applications, comprehensive evaluation through dedicated benchmarks is crucial. However, such datasets remain scarce, especially across diverse languages and contexts. In this paper, we introduce MedErrBench, the first multilingual benchmark for error detection, localization, and correction, developed under the guidance of experienced clinicians. Based on an expanded taxonomy of ten common error types, MedErrBench covers English, Arabic and Chinese, with natural clinical cases annotated and reviewed by domain experts. We assessed the performance of a range of general-purpose, language-specific, and medical-domain language models across all three tasks. Our results reveal notable performance gaps, particularly in non-English settings, highlighting the need for clinically grounded, language-aware systems. By making MedErrBench and our evaluation protocols publicly-available, we aim to advance multilingual clinical NLP to promote safer and more equitable AI-based healthcare globally. The dataset is available in the supplementary material. An anonymized version of the dataset is available at: https://github.com/congboma/MedErrBench.
### [36] [Consensus-Aligned Neuron Efficient Fine-Tuning Large Language Models for Multi-Domain Machine Translation](https://arxiv.org/abs/2602.05694) *Shuting Jiang,Ran Song,Yuxin Huang,Yan Xiang,Yantuan Xian,Shengxiang Gao,Zhengtao Yu* Main category: cs.CL TL;DR: 本文提出了一种面向多领域机器翻译(MDMT)的神经元高效微调框架,通过最大化神经元行为与领域特征间的互信息,识别并更新共识对齐神经元,从而缓解参数干扰和领域过拟合,在多个模型和语言对上实现了SOTA性能。
Details Motivation: 现有大语言模型(LLMs)在多领域机器翻译中仍面临领域适配难题,如上下文学习和参数高效微调(PEFT)方法易受领域偏移、参数干扰和泛化能力有限等问题影响。 Method: 提出一种神经元高效微调框架:首先基于互信息准则识别对齐领域特征的共识神经元,然后仅针对这些关键神经元进行微调,以兼顾通用翻译模式与领域特异性。 Result: 在三个LLM和十个德英/中英翻译领域上的实验表明,该方法在已见和未见领域上均持续超越强PEFT基线,达到当前最优性能。 Conclusion: 共识对齐神经元的选择与微调是一种更鲁棒、更高效的多领域适配策略,为LLM的领域迁移提供了新思路。 Abstract: Multi-domain machine translation (MDMT) aims to build a unified model capable of translating content across diverse domains. Despite the impressive machine translation capabilities demonstrated by large language models (LLMs), domain adaptation still remains a challenge for LLMs. Existing MDMT methods such as in-context learning and parameter-efficient fine-tuning often suffer from domain shift, parameter interference and limited generalization. In this work, we propose a neuron-efficient fine-tuning framework for MDMT that identifies and updates consensus-aligned neurons within LLMs. These neurons are selected by maximizing the mutual information between neuron behavior and domain features, enabling LLMs to capture both generalizable translation patterns and domain-specific nuances. Our method then fine-tunes LLMs guided by these neurons, effectively mitigating parameter interference and domain-specific overfitting. Comprehensive experiments on three LLMs across ten German-English and Chinese-English translation domains evidence that our method consistently outperforms strong PEFT baselines on both seen and unseen domains, achieving state-of-the-art performance.
### [37] [OmniMoE: An Efficient MoE by Orchestrating Atomic Experts at Scale](https://arxiv.org/abs/2602.05711) *Jingze Shi,Zhangyang Peng,Yizhang Zhu,Yifan Wu,Guang Liu,Yuyu Luo* Main category: cs.CL TL;DR: OmniMoE提出向极致细粒度专家发展,通过原子级专家、笛卡尔积路由器和专家中心调度,在保持高准确率的同时大幅降低推理延迟。
Details Motivation: 现有MoE架构在专家粒度细化与硬件执行效率之间存在固有矛盾,需突破该权衡以提升参数效率与执行性能。 Method: 提出系统-算法协同设计的OmniMoE框架:引入向量级Atomic Experts;设计笛卡尔积路由器将路由复杂度从O(N)降至O(√N);采用Expert-Centric Scheduling将稀疏查表转为密集矩阵运算。 Result: 在七个基准上,OmniMoE(1.7B激活参数)零样本准确率达50.9%,超越DeepSeekMoE和PEER;推理延迟从73ms降至6.7ms(加速10.9倍)。 Conclusion: 极致细粒度MoE可通过系统-算法协同设计实现高精度与高效率的统一,打破细粒度与速度不可兼得的传统认知。 Abstract: Mixture-of-Experts (MoE) architectures are evolving towards finer granularity to improve parameter efficiency. However, existing MoE designs face an inherent trade-off between the granularity of expert specialization and hardware execution efficiency. We propose OmniMoE, a system-algorithm co-designed framework that pushes expert granularity to its logical extreme. OmniMoE introduces vector-level Atomic Experts, enabling scalable routing and execution within a single MoE layer, while retaining a shared dense MLP branch for general-purpose processing. Although this atomic design maximizes capacity, it poses severe challenges for routing complexity and memory access. To address these, OmniMoE adopts a system-algorithm co-design: (i) a Cartesian Product Router that decomposes the massive index space to reduce routing complexity from O(N) to O(sqrt(N)); and (ii) Expert-Centric Scheduling that inverts the execution order to turn scattered, memory-bound lookups into efficient dense matrix operations. Validated on seven benchmarks, OmniMoE (with 1.7B active parameters) achieves 50.9% zero-shot accuracy across seven benchmarks, outperforming coarse-grained (e.g., DeepSeekMoE) and fine-grained (e.g., PEER) baselines. Crucially, OmniMoE reduces inference latency from 73ms to 6.7ms (a 10.9-fold speedup) compared to PEER, demonstrating that massive-scale fine-grained MoE can be fast and accurate. Our code is open-sourced at https://github.com/flash-algo/omni-moe.
### [38] [CompactRAG: Reducing LLM Calls and Token Overhead in Multi-Hop Question Answering](https://arxiv.org/abs/2602.05728) *Hao Yang,Zhiyu Yang,Xupeng Zhang,Wei Wei,Yunjie Zhang,Lin Yang* Main category: cs.CL TL;DR: CompactRAG是一种解耦离线重构与在线推理的检索增强生成框架,通过构建原子化QA知识库和两步LLM调用,显著降低多跳问答中的token消耗并保持高准确率。
Details Motivation: 现有多跳RAG系统效率低:反复调用LLM、token消耗高、跨跳实体定位不稳定。 Method: 离线阶段用LLM将语料转化为原子化QA知识库;在线阶段将复杂查询分解重写以保持实体一致性,再经稠密检索和RoBERTa答案抽取完成推理;全程仅两次LLM调用(分解+合成)。 Result: 在HotpotQA、2WikiMultiHopQA和MuSiQue上达到与迭代RAG相当的准确率,但token消耗大幅下降。 Conclusion: CompactRAG提供了一种成本更低、更实用的多跳知识推理方案,兼顾效率与效果。 Abstract: Retrieval-augmented generation (RAG) has become a key paradigm for knowledge-intensive question answering. However, existing multi-hop RAG systems remain inefficient, as they alternate between retrieval and reasoning at each step, resulting in repeated LLM calls, high token consumption, and unstable entity grounding across hops. We propose CompactRAG, a simple yet effective framework that decouples offline corpus restructuring from online reasoning. In the offline stage, an LLM reads the corpus once and converts it into an atomic QA knowledge base, which represents knowledge as minimal, fine-grained question-answer pairs. In the online stage, complex queries are decomposed and carefully rewritten to preserve entity consistency, and are resolved through dense retrieval followed by RoBERTa-based answer extraction. Notably, during inference, the LLM is invoked only twice in total - once for sub-question decomposition and once for final answer synthesis - regardless of the number of reasoning hops. Experiments on HotpotQA, 2WikiMultiHopQA, and MuSiQue demonstrate that CompactRAG achieves competitive accuracy while substantially reducing token consumption compared to iterative RAG baselines, highlighting a cost-efficient and practical approach to multi-hop reasoning over large knowledge corpora. The implementation is available at GitHub.
### [39] [LongR: Unleashing Long-Context Reasoning via Reinforcement Learning with Dense Utility Rewards](https://arxiv.org/abs/2602.05758) *Bowen Ping,Zijun Chen,Yiyao Yu,Tingfeng Hui,Junchi Yan,Baobao Chang* Main category: cs.CL TL;DR: 本文提出LongR框架,通过动态'思考与阅读'机制和基于相对信息增益的上下文密度奖励,提升大语言模型在长文本推理任务中的性能。
Details Motivation: 现有强化学习方法仅依赖稀疏的结果奖励,在长文本推理任务中难以提供足够细粒度的指导,导致性能提升有限。 Method: 提出LongR框架,结合动态'思考与阅读'机制(交替进行推理与文档查阅)与基于相对信息增益的上下文密度奖励,以量化相关文档的效用。 Result: LongR在LongBench v2上提升9%,并在RULER和InfiniteBench上持续改进;对多种RL算法(如DAPO、GSPO)均有效;并通过分析验证其对推理链长度和干扰项的鲁棒性。 Conclusion: LongR通过更精细的奖励设计与推理-检索协同机制,显著提升了大语言模型在长上下文推理任务中的性能与鲁棒性。 Abstract: Reinforcement Learning has emerged as a key driver for LLM reasoning. This capability is equally pivotal in long-context scenarios--such as long-dialogue understanding and structured data analysis, where the challenge extends beyond consuming tokens to performing rigorous deduction. While existing efforts focus on data synthesis or architectural changes, recent work points out that relying solely on sparse, outcome-only rewards yields limited gains, as such coarse signals are often insufficient to effectively guide the complex long-context reasoning. To address this, we propose LongR, a unified framework that enhances long-context performance by integrating a dynamic "Think-and-Read" mechanism, which interleaves reasoning with document consultation, with a contextual density reward based on relative information gain to quantify the utility of the relevant documents. Empirically, LongR achieves a 9% gain on LongBench v2 and consistent improvements on RULER and InfiniteBench, demonstrating robust efficiency in navigating extensive contexts. Furthermore, LongR consistently enhances performance across diverse RL algorithms (e.g., DAPO, GSPO). Finally, we conduct in-depth analyses to investigate the impact of reasoning chain length on efficiency and the model's robustness against distractors.
### [40] [Different Time, Different Language: Revisiting the Bias Against Non-Native Speakers in GPT Detectors](https://arxiv.org/abs/2602.05769) *Adnan Al Ali,Jindřich Helcl,Jindřich Libovický* Main category: cs.CL TL;DR: 本文重新审视了LLM生成文本检测器对捷克语非母语者文本是否存在系统性偏见的问题,发现此前关于非母语者文本因低困惑度被误判的说法在捷克语中不成立,并指出当前检测器已不再主要依赖困惑度特征。
Details Motivation: 先前研究指出LLM检测器常因非母语者文本的低困惑度而误判为AI生成,本文旨在验证该结论在捷克语场景下是否依然成立。 Method: 在捷克语语料上对比分析母语与非母语作者文本的困惑度;评估三类主流检测器对两类文本的判别表现;分析当代检测器是否仍依赖困惑度作为关键特征。 Result: 捷克语非母语者文本的困惑度不低于母语者;三类检测器均未表现出对非母语者的系统性偏差;当前检测器的有效性不依赖于困惑度。 Conclusion: 先前关于困惑度导致检测器偏见的结论在捷克语中不成立,且现代检测器已转向更鲁棒的特征,降低了对语言背景的敏感性。 Abstract: LLM-based assistants have been widely popularised after the release of ChatGPT. Concerns have been raised about their misuse in academia, given the difficulty of distinguishing between human-written and generated text. To combat this, automated techniques have been developed and shown to be effective, to some extent. However, prior work suggests that these methods often falsely flag essays from non-native speakers as generated, due to their low perplexity extracted from an LLM, which is supposedly a key feature of the detectors. We revisit these statements two years later, specifically in the Czech language setting. We show that the perplexity of texts from non-native speakers of Czech is not lower than that of native speakers. We further examine detectors from three separate families and find no systematic bias against non-native speakers. Finally, we demonstrate that contemporary detectors operate effectively without relying on perplexity.
### [41] [Reinforcement World Model Learning for LLM-based Agents](https://arxiv.org/abs/2602.05842) *Xiao Yu,Baolin Peng,Ruize Xu,Yelong Shen,Pengcheng He,Suman Nath,Nikhil Singh,Jiangfeng Gao,Zhou Yu* Main category: cs.CL TL;DR: 本文提出了一种名为Reinforcement World Model Learning (RWML)的自监督方法,用于增强LLM-based agent的世界建模能力,通过在文本状态上学习动作条件世界模型,并利用sim-to-real gap rewards对齐模拟与真实状态,在ALFWorld和τ² Bench上显著提升性能。
Details Motivation: LLM在具身智能体(agentic)场景中难以预测动作后果并适应环境动态,亟需具备世界建模能力。 Method: 提出RWML方法:在预训练嵌入空间中,以sim-to-real gap reward为信号,对齐模型生成的模拟下一状态与环境中实际观测到的下一状态;避免传统next-token预测导致的语义失真与模型坍缩。 Result: 在ALFWorld和τ² Bench上显著超越基线模型;结合任务成功奖励后,分别比直接任务奖励强化学习高6.9和5.7分,并达到专家数据训练的性能水平。 Conclusion: RWML提供了一种鲁棒、自监督的世界建模训练范式,有效弥补LLM在动态环境中的推理缺陷,且优于LLM-as-a-judge等替代方案。 Abstract: Large language models (LLMs) have achieved strong performance in language-centric tasks. However, in agentic settings, LLMs often struggle to anticipate action consequences and adapt to environment dynamics, highlighting the need for world-modeling capabilities in LLM-based agents. We propose Reinforcement World Model Learning (RWML), a self-supervised method that learns action-conditioned world models for LLM-based agents on textual states using sim-to-real gap rewards. Our method aligns simulated next states produced by the model with realized next states observed from the environment, encouraging consistency between internal world simulations and actual environment dynamics in a pre-trained embedding space. Unlike next-state token prediction, which prioritizes token-level fidelity (i.e., reproducing exact wording) over semantic equivalence and can lead to model collapse, our method provides a more robust training signal and is empirically less susceptible to reward hacking than LLM-as-a-judge. We evaluate our method on ALFWorld and $τ^2$ Bench and observe significant gains over the base model, despite being entirely self-supervised. When combined with task-success rewards, our method outperforms direct task-success reward RL by 6.9 and 5.7 points on ALFWorld and $τ^2$ Bench respectively, while matching the performance of expert-data training.
### [42] [OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions](https://arxiv.org/abs/2602.05843) *Fangzhi Xu,Hang Yan,Qiushi Sun,Jinyang Wu,Zixian Huang,Muye Huang,Jingyang Gong,Zichen Ding,Kanzhi Cheng,Yian Wang,Xinyu Che,Zeyi Sun,Jian Zhang,Zhangyue Yin,Haoran Luo,Xuanjing Huang,Ben Kao,Jun Liu,Qika Lin* Main category: cs.CL TL;DR: 本文提出OdysseyArena,一个专注于长时程、主动式和归纳式交互的自主智能体评估框架,旨在弥补现有评估方法忽视智能体从经验中自主发现潜在转移规律的不足。
Details Motivation: 现有评估方法主要采用演绎范式,依赖显式规则和静态目标,忽略了智能体需从经验中自主归纳潜在状态转移规律这一关键能力,而该能力是实现前瞻性决策与策略一致性的基础。 Method: 提出OdysseyArena评估框架,形式化并实例化四大基本要素,构建支持归纳学习的交互环境;进一步设计轻量版OdysseyArena-Lite(含120个任务)用于标准化评测,以及挑战版OdysseyArena-Challenge(超长交互步数>200)用于压力测试。 Result: 在15+个主流大语言模型上的实验表明,即使是前沿模型,在归纳式场景下仍存在显著缺陷,揭示了当前自主发现能力的关键瓶颈。 Conclusion: OdysseyArena为评估智能体的长时程归纳能力提供了新范式,其开源代码与数据将推动自主智能体向真正具备环境建模与战略 foresight 能力的方向发展。 Abstract: The rapid advancement of Large Language Models (LLMs) has catalyzed the development of autonomous agents capable of navigating complex environments. However, existing evaluations primarily adopt a deductive paradigm, where agents execute tasks based on explicitly provided rules and static goals, often within limited planning horizons. Crucially, this neglects the inductive necessity for agents to discover latent transition laws from experience autonomously, which is the cornerstone for enabling agentic foresight and sustaining strategic coherence. To bridge this gap, we introduce OdysseyArena, which re-centers agent evaluation on long-horizon, active, and inductive interactions. We formalize and instantiate four primitives, translating abstract transition dynamics into concrete interactive environments. Building upon this, we establish OdysseyArena-Lite for standardized benchmarking, providing a set of 120 tasks to measure an agent's inductive efficiency and long-horizon discovery. Pushing further, we introduce OdysseyArena-Challenge to stress-test agent stability across extreme interaction horizons (e.g., > 200 steps). Extensive experiments on 15+ leading LLMs reveal that even frontier models exhibit a deficiency in inductive scenarios, identifying a critical bottleneck in the pursuit of autonomous discovery in complex environments. Our code and data are available at https://github.com/xufangzhi/Odyssey-Arena
### [43] [RRAttention: Dynamic Block Sparse Attention via Per-Head Round-Robin Shifts for Long-Context Inference](https://arxiv.org/abs/2602.05853) *Siran Liu,Guoxia Wang,Sa Wang,Jinle Zeng,HaoYang Xie,Siyu Lou,JiaBin Yang,DianHai Yu,Haifeng Wang,Chao Yang* Main category: cs.CL TL;DR: 本文提出RRAttention,一种新型动态稀疏注意力机制,通过轮转采样策略在保持查询独立性的同时实现高效全局模式发现,显著降低计算复杂度并提升长上下文处理性能。
Details Motivation: 注意力机制的二次复杂度是大语言模型处理长上下文的关键瓶颈;现有动态稀疏注意力方法存在预处理需求、缺乏全局评估、违反查询独立性或计算开销高等根本权衡。 Method: 提出RRAttention,采用头级轮转(round-robin)采样策略,在每个步幅内轮换各注意力头的查询采样位置,并结合步幅级聚合与自适应Top-τ选择,实现输入自适应稀疏化。 Result: 在HELMET和Video-MME基准上,RRAttention恢复超99%全注意力性能,仅计算一半注意力块,在128K上下文长度下达2.4×加速,优于现有动态稀疏注意力方法。 Conclusion: RRAttention首次同时满足无需预处理、支持全局评估、保持查询独立性及低计算开销等理想性质,为长上下文建模提供了高效可行的新范式。 Abstract: The quadratic complexity of attention mechanisms poses a critical bottleneck for large language models processing long contexts. While dynamic sparse attention methods offer input-adaptive efficiency, they face fundamental trade-offs: requiring preprocessing, lacking global evaluation, violating query independence, or incurring high computational overhead. We present RRAttention, a novel dynamic sparse attention method that simultaneously achieves all desirable properties through a head \underline{r}ound-\underline{r}obin (RR) sampling strategy. By rotating query sampling positions across attention heads within each stride, RRAttention maintains query independence while enabling efficient global pattern discovery with stride-level aggregation. Our method reduces complexity from $O(L^2)$ to $O(L^2/S^2)$ and employs adaptive Top-$τ$ selection for optimal sparsity. Extensive experiments on natural language understanding (HELMET) and multimodal video comprehension (Video-MME) demonstrate that RRAttention recovers over 99\% of full attention performance while computing only half of the attention blocks, achieving 2.4$\times$ speedup at 128K context length and outperforming existing dynamic sparse attention methods.
### [44] [xList-Hate: A Checklist-Based Framework for Interpretable and Generalizable Hate Speech Detection](https://arxiv.org/abs/2602.05874) *Adrián Girón,Pablo Miralles,Javier Huertas-Tato,Sergio D'Antonio,David Camacho* Main category: cs.CL TL;DR: 本文提出xList-Hate框架,将仇恨言论检测分解为基于规范性标准的多步诊断问题,由大语言模型逐项回答并经可解释决策树聚合,提升跨域鲁棒性与可解释性。
Details Motivation: 现有仇恨言论检测多采用端到端二分类,易过拟合特定数据集定义,在领域迁移和标注噪声下鲁棒性差。 Method: 构建xList-Hate诊断框架:将仇恨言论判定拆解为多个概念级、规范性驱动的二元问题;由LLM独立回答生成诊断信号;用轻量、完全可解释的决策树聚合信号并输出预测。 Result: 在多个基准和模型家族上验证,相比零样本LLM分类和监督微调,该方法显著提升跨数据集鲁棒性与领域偏移下的相对性能,并对部分标注不一致和语境模糊更具鲁棒性;支持细粒度可解释性分析。 Conclusion: 将仇恨言论检测重构为诊断推理任务,是一种更鲁棒、可解释且可扩展的内容审核新范式。 Abstract: Hate speech detection is commonly framed as a direct binary classification problem despite being a composite concept defined through multiple interacting factors that vary across legal frameworks, platform policies, and annotation guidelines. As a result, supervised models often overfit dataset-specific definitions and exhibit limited robustness under domain shift and annotation noise. We introduce xList-Hate, a diagnostic framework that decomposes hate speech detection into a checklist of explicit, concept-level questions grounded in widely shared normative criteria. Each question is independently answered by a large language model (LLM), producing a binary diagnostic representation that captures hateful content features without directly predicting the final label. These diagnostic signals are then aggregated by a lightweight, fully interpretable decision tree, yielding transparent and auditable predictions. We evaluate it across multiple hate speech benchmarks and model families, comparing it against zero-shot LLM classification and in-domain supervised fine-tuning. While supervised methods typically maximize in-domain performance, we consistently improves cross-dataset robustness and relative performance under domain shift. In addition, qualitative analysis of disagreement cases provides evidence that the framework can be less sensitive to certain forms of annotation inconsistency and contextual ambiguity. Crucially, the approach enables fine-grained interpretability through explicit decision paths and factor-level analysis. Our results suggest that reframing hate speech detection as a diagnostic reasoning task, rather than a monolithic classification problem, provides a robust, explainable, and extensible alternative for content moderation.
### [45] [EuroLLM-22B: Technical Report](https://arxiv.org/abs/2602.05879) *Miguel Moura Ramos,Duarte M. Alves,Hippolyte Gisserot-Boukhlef,João Alves,Pedro Henrique Martins,Patrick Fernandes,José Pombal,Nuno M. Guerreiro,Ricardo Rei,Nicolas Boizard,Amin Farajian,Mateusz Klimaszewski,José G. C. de Souza,Barry Haddow,François Yvon,Pierre Colombo,Alexandra Birch,André F. T. Martins* Main category: cs.CL TL;DR: EuroLLM-22B is a new multilingual large language model trained from scratch to better serve European languages, covering all 24 EU official languages plus 11 others; it shows competitive performance on multilingual benchmarks and its models, data, and code are publicly released.
Details Motivation: European languages are underrepresented and underserved in existing open large language models, prompting the need for a model specifically designed to support European linguistic diversity. Method: Trained EuroLLM-22B from scratch with custom tokenizer design, architectural specifications, rigorous multilingual data filtering, and comprehensive training procedures focused on 35 languages. Result: EuroLLM-22B achieves competitive performance on multilingual benchmarks across reasoning, instruction following, and translation tasks. Conclusion: EuroLLM-22B successfully addresses the gap in multilingual LLM support for European languages and contributes open resources—including models, datasets, and code—to advance future research. Abstract: This report presents EuroLLM-22B, a large language model trained from scratch to support the needs of European citizens by covering all 24 official European Union languages and 11 additional languages. EuroLLM addresses the issue of European languages being underrepresented and underserved in existing open large language models. We provide a comprehensive overview of EuroLLM-22B's development, including tokenizer design, architectural specifications, data filtering, and training procedures. Across a broad set of multilingual benchmarks, EuroLLM-22B demonstrates strong performance in reasoning, instruction following, and translation, achieving results competitive with models of comparable size. To support future research, we release our base and instruction-tuned models, our multilingual web pretraining data and updated EuroBlocks instruction datasets, as well as our pre-training and evaluation codebases.
### [46] [Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models](https://arxiv.org/abs/2602.05897) *Shuo Nie,Hexuan Deng,Chao Wang,Ruiyu Fang,Xuebo Liu,Shuangyong Song,Yu Li,Min Zhang,Xuelong Li* Main category: cs.CL TL;DR: 本文提出FaithRL方法,通过步骤级的忠实性奖励和隐式截断重采样策略,有效减少小推理模型在链式思维过程中的忠实性幻觉问题。
Details Motivation: 小推理模型(SRMs)在资源受限场景中易出现中间推理步骤的忠实性幻觉,而现有基于结果奖励或粗粒度CoT评估的在线强化学习方法可能错误强化不忠实推理。 Method: 提出Faithfulness-Aware Step-Level Reinforcement Learning(FaithRL),包含显式的步骤级忠实性奖励(来自过程奖励模型)和隐式的截断重采样策略以生成忠实前缀的对比信号。 Result: 在多个SRMs和开放书问答基准上的实验表明,FaithRL能持续降低CoT及最终答案中的幻觉,提升推理的忠实性与可靠性。 Conclusion: FaithRL通过细粒度步骤监督与对比学习机制,显著改善小模型链式思维的忠实性,为资源受限下的可信推理提供了新范式。 Abstract: As large language models become smaller and more efficient, small reasoning models (SRMs) are crucial for enabling chain-of-thought (CoT) reasoning in resource-constrained settings. However, they are prone to faithfulness hallucinations, especially in intermediate reasoning steps. Existing mitigation methods based on online reinforcement learning rely on outcome-based rewards or coarse-grained CoT evaluation, which can inadvertently reinforce unfaithful reasoning when the final answer is correct. To address these limitations, we propose Faithfulness-Aware Step-Level Reinforcement Learning (FaithRL), introducing step-level supervision via explicit faithfulness rewards from a process reward model, together with an implicit truncated resampling strategy that generates contrastive signals from faithful prefixes. Experiments across multiple SRMs and Open-Book QA benchmarks demonstrate that FaithRL consistently reduces hallucinations in both the CoT and final answers, leading to more faithful and reliable reasoning. Code is available at https://github.com/Easy195/FaithRL.
### [47] [Codified Finite-state Machines for Role-playing](https://arxiv.org/abs/2602.05905) *Letian Peng,Yupeng Hou,Kun Zhou,Jingbo Shang* Main category: cs.CL TL;DR: 本文提出Codified Finite-State Machines (CFSMs)及其概率扩展CPFSMs,利用LLM自动从角色档案中提取状态与转移,提升角色扮演中潜态建模的一致性与可解释性。
Details Motivation: 现有基于提示的方法难以建模驱动角色交互的潜在状态,而传统手工构建的有限状态机又难以适应开放语义的角色扮演场景。 Method: 提出CFSM框架,用LLM将文本角色档案自动编码为有限状态机;进一步扩展为CPFSM,以概率分布建模状态转移。 Result: 在合成评估和真实角色扮演任务中,CFSM和CPFSM均优于通用基线方法,验证了其在结构化任务与开放、随机状态探索中的有效性。 Conclusion: CFSM/CPFSM为LLM角色扮演提供了可解释、自适应且鲁棒的潜态建模新范式。 Abstract: Modeling latent character states is crucial for consistent and engaging role-playing (RP) with large language models (LLMs). Yet, existing prompting-based approaches mainly capture surface actions, often failing to track the latent states that drive interaction. We revisit finite-state machines (FSMs), long used in game design to model state transitions. While effective in small, well-specified state spaces, traditional hand-crafted, rule-based FSMs struggle to adapt to the open-ended semantic space of RP. To address this, we introduce Codified Finite-State Machines (CFSMs), a framework that automatically codifies textual character profiles into FSMs using LLM-based coding. CFSMs extract key states and transitions directly from the profile, producing interpretable structures that enforce character consistency. To further capture uncertainty and variability, we extend CFSMs into Codified Probabilistic Finite-State Machines (CPFSMs), where transitions are modeled as probability distributions over states. Through both synthetic evaluations and real-world RP scenarios in established artifacts, we demonstrate that CFSM and CPFSM outperform generally applied baselines, verifying effectiveness not only in structured tasks but also in open-ended stochastic state exploration.
### [48] [KV-CoRE: Benchmarking Data-Dependent Low-Rank Compressibility of KV-Caches in LLMs](https://arxiv.org/abs/2602.05929) *Jian Chen,Zhuoran Wang,Jiayu Qin,Ming Li,Meng Wang,Changyou Chen,Yin Chen,Qizhen Weng,Yirui Liu* Main category: cs.CL TL;DR: 本文提出KV-CoRE方法,通过SVD量化KV缓存的数据依赖低秩可压缩性,揭示其与模型架构、训练数据和语言覆盖的系统性关联,并建立首个大规模LLM KV缓存可压缩性基准。
Details Motivation: KV缓存随上下文增长会饱和GPU内存带宽,现有压缩方法忽视其数据依赖性和层间差异。 Method: 提出基于SVD的KV-CoRE方法,计算Frobenius范数下的最优低秩近似,支持无梯度、增量式、数据集级和层级别评估;采用归一化有效秩作为可压缩性指标。 Result: 在5个英文领域和16种语言的多个模型与数据集上分析,发现可压缩性与模型架构、训练数据和语言覆盖存在系统性关联;归一化有效秩与压缩下的性能下降强相关。 Conclusion: 建立了LLM KV缓存可压缩性的原理性评估框架和首个大规模基准,为动态、数据感知压缩及以数据为中心的模型开发提供依据。 Abstract: Large language models rely on kv-caches to avoid redundant computation during autoregressive decoding, but as context length grows, reading and writing the cache can quickly saturate GPU memory bandwidth. Recent work has explored KV-cache compression, yet most approaches neglect the data-dependent nature of kv-caches and their variation across layers. We introduce KV-CoRE KV-cache Compressibility by Rank Evaluation), an SVD-based method for quantifying the data-dependent low-rank compressibility of kv-caches. KV-CoRE computes the optimal low-rank approximation under the Frobenius norm and, being gradient-free and incremental, enables efficient dataset-level, layer-wise evaluation. Using this method, we analyze multiple models and datasets spanning five English domains and sixteen languages, uncovering systematic patterns that link compressibility to model architecture, training data, and language coverage. As part of this analysis, we employ the Normalized Effective Rank as a metric of compressibility and show that it correlates strongly with performance degradation under compression. Our study establishes a principled evaluation framework and the first large-scale benchmark of kv-cache compressibility in LLMs, offering insights for dynamic, data-aware compression and data-centric model development.
### [49] [Polyglots or Multitudes? Multilingual LLM Answers to Value-laden Multiple-Choice Questions](https://arxiv.org/abs/2602.05932) *Léo Labat,Etienne Ollion,François Yvon* Main category: cs.CL TL;DR: 本文研究多语言大语言模型(LLM)在价值导向的多项选择题(MCQ)中是否因提问语言不同而给出不同答案,发现尽管指令微调的大模型整体一致性更高,但其回答仍存在显著的语言依赖性,尤其在特定问题上;为此作者构建了全新人工翻译的8语种欧洲价值观调查数据集(MEVS)并进行了系统实验。
Details Motivation: 探究多语言LLM在价值导向MCQ中是否存在语言诱导的响应差异,即其是否表现如理论上的‘通晓多语者’(跨语言一致),还是更像多个单语模型(语言依赖)。 Method: 构建全新人工翻译、严格对齐的8种欧洲语言价值观调查数据集MEVS;在30余个不同规模、厂商和对齐微调状态的多语言LLM上,采用控制变量法(答案顺序、符号类型、尾部字符等)进行系统性MCQ测试。 Result: 较大且经指令微调的模型整体一致性更高,但不同问题间鲁棒性差异极大:部分题目引发模型完全一致回答,另一些则导致高度分歧;所有一致且经指令微调的模型均在某些题目上表现出语言特异性行为。 Conclusion: 多语言LLM在价值判断类MCQ中并非完全跨语言一致,其语言依赖性具有选择性,可能与偏好微调方式有关,需进一步研究对齐方法对多语言价值观表达的影响。 Abstract: Multiple-Choice Questions (MCQs) are often used to assess knowledge, reasoning abilities, and even values encoded in large language models (LLMs). While the effect of multilingualism has been studied on LLM factual recall, this paper seeks to investigate the less explored question of language-induced variation in value-laden MCQ responses. Are multilingual LLMs consistent in their responses across languages, i.e. behave like theoretical polyglots, or do they answer value-laden MCQs depending on the language of the question, like a multitude of monolingual models expressing different values through a single model? We release a new corpus, the Multilingual European Value Survey (MEVS), which, unlike prior work relying on machine translation or ad hoc prompts, solely comprises human-translated survey questions aligned in 8 European languages. We administer a subset of those questions to over thirty multilingual LLMs of various sizes, manufacturers and alignment-fine-tuning status under comprehensive, controlled prompt variations including answer order, symbol type, and tail character. Our results show that while larger, instruction-tuned models display higher overall consistency, the robustness of their responses varies greatly across questions, with certain MCQs eliciting total agreement within and across models while others leave LLM answers split. Language-specific behavior seems to arise in all consistent, instruction-fine-tuned models, but only on certain questions, warranting a further study of the selective effect of preference fine-tuning.
### [50] [Self-Improving Multilingual Long Reasoning via Translation-Reasoning Integrated Training](https://arxiv.org/abs/2602.05940) *Junxiao Liu,Zhijun Wang,Yixiao Li,Zhejian Lai,Liqian Huang,Xin Huang,Xue Han,Junlan Feng,Shujian Huang* Main category: cs.CL TL;DR: 本文提出TRIT框架,通过将翻译训练与多语言推理训练相结合,提升模型在多语言长推理任务中的问题理解与响应生成能力,无需额外数据或外部反馈,在MMATH等基准上显著提升准确率与语言一致性。
Details Motivation: 长推理模型在多语言场景下表现不佳:常默认用英语推理非英语问题;若强制使用提问语言推理,准确率又大幅下降,根源在于多语言问题理解与多语言推理能力均不足。 Method: 提出TRIT(Translation-Reasoning Integrated Training)自优化框架,将翻译能力训练无缝融入多语言推理训练过程,联合增强多语言问题理解与响应生成,不依赖外部反馈或多语言标注数据。 Result: 在MMATH数据集上平均超越多个基线7个百分点,答案正确率与语言一致性同步提升;跨语言问题对齐能力提升超10个百分点;数学问题及通用文本翻译质量显著提高,FLORES-200上COMET得分最高提升8.4分。 Conclusion: 集成翻译训练可有效协同提升多语言理解与推理能力,TRIT是一种轻量、高效且无需额外资源的多语言长推理优化方法。 Abstract: Long reasoning models often struggle in multilingual settings: they tend to reason in English for non-English questions; when constrained to reasoning in the question language, accuracies drop substantially. The struggle is caused by the limited abilities for both multilingual question understanding and multilingual reasoning. To address both problems, we propose TRIT (Translation-Reasoning Integrated Training), a self-improving framework that integrates the training of translation into multilingual reasoning. Without external feedback or additional multilingual data, our method jointly enhances multilingual question understanding and response generation. On MMATH, our method outperforms multiple baselines by an average of 7 percentage points, improving both answer correctness and language consistency. Further analysis reveals that integrating translation training improves cross-lingual question alignment by over 10 percentage points and enhances translation quality for both mathematical questions and general-domain text, with gains up to 8.4 COMET points on FLORES-200.
### [51] [Characterizing Human Semantic Navigation in Concept Production as Trajectories in Embedding Space](https://arxiv.org/abs/2602.05971) *Felipe D. Toro-Hernández,Jesuino Vieira Filho,Rodrigo M. Cabral-Carvalho* Main category: cs.CL TL;DR: 本文提出了一种将语义生成建模为在嵌入空间中导航的新框架,利用累积嵌入构建个体化语义轨迹,并提取几何与动力学指标,以量化语义表征的动态性;该方法在多语言、多任务临床与认知数据上验证有效,且对模型选择鲁棒。
Details Motivation: 理解人类如何在语义空间中导航以检索和操作意义,需一个能刻画语义表征动态结构的计算框架,尤其面向临床与跨语言研究中低人工干预、高可扩展性的需求。 Method: 基于多种Transformer文本嵌入模型,构建被试特异的累积语义轨迹,提取距离(到下一词、到质心)、熵、速度、加速度等几何与动力学指标,对比累积与非累积嵌入在不同轨迹长度下的表现。 Result: 该框架在四种多语言数据集(神经退行性疾病、辱骂词流畅性、意大利语/德语属性列举)上成功区分临床组与概念类型;累积嵌入更适用于长轨迹,不同嵌入模型结果高度一致。 Conclusion: 将语义导航形式化为嵌入空间中的结构化轨迹,可桥接认知建模与学习表征,为临床评估、跨语言分析及人工认知测评提供可量化、低干预的通用计算管道。 Abstract: Semantic representations can be framed as a structured, dynamic knowledge space through which humans navigate to retrieve and manipulate meaning. To investigate how humans traverse this geometry, we introduce a framework that represents concept production as navigation through embedding space. Using different transformer text embedding models, we construct participant-specific semantic trajectories based on cumulative embeddings and extract geometric and dynamical metrics, including distance to next, distance to centroid, entropy, velocity, and acceleration. These measures capture both scalar and directional aspects of semantic navigation, providing a computationally grounded view of semantic representation search as movement in a geometric space. We evaluate the framework on four datasets across different languages, spanning different property generation tasks: Neurodegenerative, Swear verbal fluency, Property listing task in Italian, and in German. Across these contexts, our approach distinguishes between clinical groups and concept types, offering a mathematical framework that requires minimal human intervention compared to typical labor-intensive linguistic pre-processing methods. Comparison with a non-cumulative approach reveals that cumulative embeddings work best for longer trajectories, whereas shorter ones may provide too little context, favoring the non-cumulative alternative. Critically, different embedding models yielded similar results, highlighting similarities between different learned representations despite different training pipelines. By framing semantic navigation as a structured trajectory through embedding space, bridging cognitive modeling with learned representation, thereby establishing a pipeline for quantifying semantic representation dynamics with applications in clinical research, cross-linguistic analysis, and the assessment of artificial cognition.
### [52] [DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs](https://arxiv.org/abs/2602.05992) *Lizhuo Luo,Shenggui Li,Yonggang Wen,Tianwei Zhang* Main category: cs.CL TL;DR: 本文提出了一种无需训练的动态滑动块(DSB)调度方法,用于改进扩散大语言模型(dLLMs)的并行文本生成,通过自适应调整块大小以匹配语义难度,并结合专用的DSB Cache机制,在不牺牲质量的前提下显著提升推理效率。
Details Motivation: 现有固定块调度策略忽视语义难度差异,导致在不确定位置过早承诺、在简单位置延迟生成,影响生成质量和推理效率。 Method: 提出动态滑动块(DSB)调度方法,使用尺寸可变的滑动块替代固定块;并设计配套的无需训练的KV缓存机制DSB Cache。 Result: 在多个模型和基准测试上,DSB与DSB Cache联合显著提升了dLLMs的生成质量与推理效率。 Conclusion: 动态适配语义难度的块调度是提升dLLMs推理性能的关键,DSB及其缓存机制为训练-free高效生成提供了有效新范式。 Abstract: Diffusion large language models (dLLMs) have emerged as a promising alternative for text generation, distinguished by their native support for parallel decoding. In practice, block inference is crucial for avoiding order misalignment in global bidirectional decoding and improving output quality. However, the widely-used fixed, predefined block (naive) schedule is agnostic to semantic difficulty, making it a suboptimal strategy for both quality and efficiency: it can force premature commitments to uncertain positions while delaying easy positions near block boundaries. In this work, we analyze the limitations of naive block scheduling and disclose the importance of dynamically adapting the schedule to semantic difficulty for reliable and efficient inference. Motivated by this, we propose Dynamic Sliding Block (DSB), a training-free block scheduling method that uses a sliding block with a dynamic size to overcome the rigidity of the naive block. To further improve efficiency, we introduce DSB Cache, a training-free KV-cache mechanism tailored to DSB. Extensive experiments across multiple models and benchmarks demonstrate that DSB, together with DSB Cache, consistently improves both generation quality and inference efficiency for dLLMs. Code is released at https://github.com/lizhuo-luo/DSB.
### [53] [A Systematic Evaluation of Large Language Models for PTSD Severity Estimation: The Role of Contextual Knowledge and Modeling Strategies](https://arxiv.org/abs/2602.06015) *Panagiotis Kaliosis,Adithya V Ganesan,Oscar N. E. Kjell,Whitney Ringwald,Scott Feltman,Melissa A. Carr,Dimitris Samaras,Camilo Ruggero,Benjamin J. Luft,Roman Kotov,Andrew H. Schwartz* Main category: cs.CL TL;DR: 本研究系统评估了11种大语言模型(LLMs)在零样本下评估PTSD严重程度的准确性,发现上下文知识(如构念定义、叙事背景)、推理努力程度、模型规模及集成策略显著影响性能;开放权重模型在70B参数后性能趋于饱和,而闭源模型随代际更新持续提升;最佳效果来自监督模型与零样本LLM的集成。
Details Motivation: 尽管大语言模型被越来越多地用于零样本心理健康评估,但影响其准确性的关键因素尚不清楚。 Method: 使用包含1437名个体自然语言叙述和自评PTSD严重度分数的临床数据集,系统考察11种前沿LLM;变量包括:(i)上下文知识(子量表定义、分布摘要、访谈问题),(ii)建模策略(零样本/少样本、推理努力量、模型大小、结构化子量表预测vs直接标量预测、输出重标定、九种集成方法)。 Result: (a)提供详细构念定义和叙事背景时LLM最准确;(b)增加推理努力可提升估计精度;(c)开源模型(Llama、Deepseek)在70B参数后性能饱和,闭源模型(o3-mini、gpt-5)随新代际持续提升;(d)监督模型与零样本LLM集成效果最优。 Conclusion: 上下文知识选择与建模策略对LLM在心理健康评估中的准确部署至关重要。 Abstract: Large language models (LLMs) are increasingly being used in a zero-shot fashion to assess mental health conditions, yet we have limited knowledge on what factors affect their accuracy. In this study, we utilize a clinical dataset of natural language narratives and self-reported PTSD severity scores from 1,437 individuals to comprehensively evaluate the performance of 11 state-of-the-art LLMs. To understand the factors affecting accuracy, we systematically varied (i) contextual knowledge like subscale definitions, distribution summary, and interview questions, and (ii) modeling strategies including zero-shot vs few shot, amount of reasoning effort, model sizes, structured subscales vs direct scalar prediction, output rescaling and nine ensemble methods. Our findings indicate that (a) LLMs are most accurate when provided with detailed construct definitions and context of the narrative; (b) increased reasoning effort leads to better estimation accuracy; (c) performance of open-weight models (Llama, Deepseek), plateau beyond 70B parameters while closed-weight (o3-mini, gpt-5) models improve with newer generations; and (d) best performance is achieved when ensembling a supervised model with the zero-shot LLMs. Taken together, the results suggest choice of contextual knowledge and modeling strategies is important for deploying LLMs to accurately assess mental health.
### [54] [Multi-Token Prediction via Self-Distillation](https://arxiv.org/abs/2602.06019) *John Kirchenbauer,Abhimanyu Hans,Brian Bartoldson,Micah Goldblum,Ashwinee Panda,Tom Goldstein* Main category: cs.CL TL;DR: 本文提出了一种无需额外模型或复杂推理管道的在线蒸馏方法,将预训练自回归语言模型直接转换为多令牌预测模型,实现约3倍加速且精度损失小于5%。
Details Motivation: 现有加速语言模型推理的技术(如推测解码)需要训练辅助推测模型并构建复杂的推理流程,增加了部署难度和开销。 Method: 采用简单的在线蒸馏目标,将预训练的自回归语言模型转化为独立的多令牌预测模型,保持原始模型结构与实现不变,无需额外验证器或专用推理代码。 Result: 在GSM8K数据集上,该方法使模型平均解码速度提升超3倍,同时准确率下降小于5%。 Conclusion: 该方法提供了一种轻量、高效且易于部署的语言模型推理加速方案,避免了传统方法对辅助模型和复杂流水线的依赖。 Abstract: Existing techniques for accelerating language model inference, such as speculative decoding, require training auxiliary speculator models and building and deploying complex inference pipelines. We consider a new approach for converting a pretrained autoregressive language model from a slow single next token prediction model into a fast standalone multi-token prediction model using a simple online distillation objective. The final model retains the exact same implementation as the pretrained initial checkpoint and is deployable without the addition of any auxiliary verifier or other specialized inference code. On GSM8K, our method produces models that can decode more than $3\times$ faster on average at $<5\%$ drop in accuracy relative to single token decoding performance.
### [55] [Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory](https://arxiv.org/abs/2602.06025) *Haozhen Zhang,Haodong Yue,Tao Feng,Quanyu Long,Jianzhu Bao,Bowen Jin,Weizhi Zhang,Xiao Li,Jiaxuan You,Chengwei Qin,Wenya Wang* Main category: cs.CL TL;DR: 本文提出BudgetMem,一种运行时代理记忆框架,通过轻量级路由器在不同预算层级(低/中/高)的记忆模块间进行路由,以显式、查询感知地平衡任务性能与记忆构建成本,并在多个基准上验证了其有效性。
Details Motivation: 现有LLM代理记忆系统多采用离线、查询无关的记忆构建方式,效率低且易丢失关键信息;而运行时记忆利用虽更自然,但常带来高开销且缺乏对性能-成本权衡的显式控制。 Method: 提出BudgetMem框架,将记忆处理结构化为多个支持三档预算(Low/Mid/High)的记忆模块;设计轻量级神经路由器,通过强化学习训练,实现跨模块的预算层级路由;研究三种实现预算层级的策略:实现复杂度、推理行为和模型容量。 Result: 在LoCoMo、LongMemEval和HotpotQA等基准上,BudgetMem在高性能(高预算)设置下超越强基线,并在受限预算下提供更优的准确率-成本前沿;分析揭示了不同预算策略在不同预算区间下的适用性差异。 Conclusion: BudgetMem为LLM代理提供了可调控、查询感知的运行时记忆机制,统一支持多种预算实现路径,并在性能与成本之间实现了更灵活、更高效的权衡。 Abstract: Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rely on offline, query-agnostic memory construction that can be inefficient and may discard query-critical information. Although runtime memory utilization is a natural alternative, prior work often incurs substantial overhead and offers limited explicit control over the performance-cost trade-off. In this work, we present \textbf{BudgetMem}, a runtime agent memory framework for explicit, query-aware performance-cost control. BudgetMem structures memory processing as a set of memory modules, each offered in three budget tiers (i.e., \textsc{Low}/\textsc{Mid}/\textsc{High}). A lightweight router performs budget-tier routing across modules to balance task performance and memory construction cost, which is implemented as a compact neural policy trained with reinforcement learning. Using BudgetMem as a unified testbed, we study three complementary strategies for realizing budget tiers: implementation (method complexity), reasoning (inference behavior), and capacity (module model size). Across LoCoMo, LongMemEval, and HotpotQA, BudgetMem surpasses strong baselines when performance is prioritized (i.e., high-budget setting), and delivers better accuracy-cost frontiers under tighter budgets. Moreover, our analysis disentangles the strengths and weaknesses of different tiering strategies, clarifying when each axis delivers the most favorable trade-offs under varying budget regimes.
### [56] [DFlash: Block Diffusion for Flash Speculative Decoding](https://arxiv.org/abs/2602.06036) *Jian Chen,Yesheng Liang,Zhijian Liu* Main category: cs.CL TL;DR: 本文提出DFlash,一种基于轻量级块扩散模型的推测解码框架,通过并行生成草稿令牌并利用目标模型上下文特征进行条件化,实现高质量、高接受率的高效推测解码,在多种模型和任务上实现超6倍无损加速。
Details Motivation: 自回归大语言模型推理延迟高、GPU利用率低;现有推测解码仍依赖顺序草稿生成,而扩散LLM虽支持并行但性能不足。 Method: 提出DFlash框架:采用轻量级块扩散模型进行单次前向并行草稿生成,并以目标模型提取的上下文特征为条件进行草稿建模。 Result: 在多个模型与任务上实现超6倍无损加速,相比当前最优方法EAGLE-3提速最高达2.5倍。 Conclusion: DFlash通过结合扩散模型的并行性与目标模型的上下文信息,在保持输出质量前提下显著提升推测解码效率,为高效LLM推理提供了新范式。 Abstract: Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by using a fast draft model whose outputs are verified in parallel by the target LLM; however, existing methods still rely on autoregressive drafting, which remains sequential and limits practical speedups. Diffusion LLMs offer a promising alternative by enabling parallel generation, but current diffusion models typically underperform compared with autoregressive models. In this paper, we introduce DFlash, a speculative decoding framework that employs a lightweight block diffusion model for parallel drafting. By generating draft tokens in a single forward pass and conditioning the draft model on context features extracted from the target model, DFlash enables efficient drafting with high-quality outputs and higher acceptance rates. Experiments show that DFlash achieves over 6x lossless acceleration across a range of models and tasks, delivering up to 2.5x higher speedup than the state-of-the-art speculative decoding method EAGLE-3.
# cs.CV [[Back]](#toc) ### [57] [Food Portion Estimation: From Pixels to Calories](https://arxiv.org/abs/2602.05078) *Gautham Vinod,Fengqing Zhu* Main category: cs.CV TL;DR: 本文综述了基于图像的膳食评估中食物份量估计的各种策略,重点解决从2D图像推断3D食物尺寸的挑战。
Details Motivation: 图像依赖的膳食评估在慢性病和肥胖防控中至关重要,但其核心难点在于从2D图像准确估计食物的3D尺寸。 Method: 综述性分析,涵盖深度图、多视角输入、模板匹配等辅助方法,以及单目/多模态深度学习方法。 Result: 系统梳理了当前主流的食物份量估计策略及其优劣,为后续研究提供参考框架。 Conclusion: 多种辅助信息与深度学习结合是提升份量估计精度的有效路径,但鲁棒性与泛化能力仍需加强。 Abstract: Reliance on images for dietary assessment is an important strategy to accurately and conveniently monitor an individual's health, making it a vital mechanism in the prevention and care of chronic diseases and obesity. However, image-based dietary assessment suffers from estimating the three dimensional size of food from 2D image inputs. Many strategies have been devised to overcome this critical limitation such as the use of auxiliary inputs like depth maps, multi-view inputs, or model-based approaches such as template matching. Deep learning also helps bridge the gap by either using monocular images or combinations of the image and the auxillary inputs to precisely predict the output portion from the image input. In this paper, we explore the different strategies employed for accurate portion estimation.
### [58] [ARGaze: Autoregressive Transformers for Online Egocentric Gaze Estimation](https://arxiv.org/abs/2602.05132) *Jia Li,Wenjie Zhao,Shijian Deng,Bolin Lai,Yuheng Wu,RUijia Chen,Jon E. Froehlich,Yuhang Zhao,Yapeng Tian* Main category: cs.CV TL;DR: 本文提出ARGaze,一种基于自回归解码的在线第一人称视线估计方法,通过结合当前视觉特征和有限长度的视线上下文窗口来预测视线方向,在多个基准测试中达到SOTA性能。
Details Motivation: 在线第一人称视线估计缺乏显式的头部或眼部信号,需从手-物交互和场景显著性等间接线索中推断注意力;同时,视线在目标导向活动中具有强时间连续性,可作为有效先验。 Method: 提出ARGaze模型,将视线估计建模为序列预测任务:使用Transformer解码器,在每个时间步基于当前视觉特征和固定长度的近期视线目标估计(Gaze Context Window)进行因果、流式推理。 Result: 在多个第一人称视线估计基准上实现在线评估下的SOTA性能;消融实验验证了带有限视线历史的自回归建模对鲁棒预测的关键作用。 Conclusion: ARGaze通过引入视觉条件下的自回归时序建模,有效利用视线的时间连续性,在资源受限的在线场景下显著提升预测精度与实用性。 Abstract: Online egocentric gaze estimation predicts where a camera wearer is looking from first-person video using only past and current frames, a task essential for augmented reality and assistive technologies. Unlike third-person gaze estimation, this setting lacks explicit head or eye signals, requiring models to infer current visual attention from sparse, indirect cues such as hand-object interactions and salient scene content. We observe that gaze exhibits strong temporal continuity during goal-directed activities: knowing where a person looked recently provides a powerful prior for predicting where they look next. Inspired by vision-conditioned autoregressive decoding in vision-language models, we propose ARGaze, which reformulates gaze estimation as sequential prediction: at each timestep, a transformer decoder predicts current gaze by conditioning on (i) current visual features and (ii) a fixed-length Gaze Context Window of recent gaze target estimates. This design enforces causality and enables bounded-resource streaming inference. We achieve state-of-the-art performance across multiple egocentric benchmarks under online evaluation, with extensive ablations validating that autoregressive modeling with bounded gaze history is critical for robust prediction. We will release our source code and pre-trained models.
### [59] [SHaSaM: Submodular Hard Sample Mining for Fair Facial Attribute Recognition](https://arxiv.org/abs/2602.05162) *Anay Majee,Rishabh Iyer* Main category: cs.CV TL;DR: 本文提出SHaSaM(子模硬样本挖掘)方法,通过子模优化框架解决深度神经网络中的公平性问题,有效缓解数据不平衡并减少敏感属性影响,在CelebA和UTKFace数据集上实现了公平性和准确率的同步提升。
Details Motivation: 深度神经网络常因训练数据中固有的社会与人口统计偏差而导致不公平预测,尤其在存在种族、年龄、性别等敏感属性时;现有方法难以应对属性组间的数据不平衡,且易过度关注敏感属性,加剧不公平性。 Method: 提出两阶段子模组合优化方法:SHaSaM-MINE通过子模子集选择策略挖掘难正/负样本以缓解数据不平衡;SHaSaM-LEARN基于子模条件互信息设计新型组合损失函数,在最大化目标类别决策边界的同时最小化敏感属性影响。 Result: 在CelebA和UTKFace数据集上,SHaSaM相较现有方法最高提升公平性指标(Equalized Odds)2.7分、准确率3.5%,且收敛更快。 Conclusion: SHaSaM通过统一的子模建模范式,有效约束模型学习与敏感属性相关特征,在不牺牲性能的前提下显著提升公平性,为公平表示学习提供了新思路。 Abstract: Deep neural networks often inherit social and demographic biases from annotated data during model training, leading to unfair predictions, especially in the presence of sensitive attributes like race, age, gender etc. Existing methods fall prey to the inherent data imbalance between attribute groups and inadvertently emphasize on sensitive attributes, worsening unfairness and performance. To surmount these challenges, we propose SHaSaM (Submodular Hard Sample Mining), a novel combinatorial approach that models fairness-driven representation learning as a submodular hard-sample mining problem. Our two-stage approach comprises of SHaSaM-MINE, which introduces a submodular subset selection strategy to mine hard positives and negatives - effectively mitigating data imbalance, and SHaSaM-LEARN, which introduces a family of combinatorial loss functions based on Submodular Conditional Mutual Information to maximize the decision boundary between target classes while minimizing the influence of sensitive attributes. This unified formulation restricts the model from learning features tied to sensitive attributes, significantly enhancing fairness without sacrificing performance. Experiments on CelebA and UTKFace demonstrate that SHaSaM achieves state-of-the-art results, with up to 2.7 points improvement in model fairness (Equalized Odds) and a 3.5% gain in Accuracy, within fewer epochs as compared to existing methods.
### [60] [LOBSTgER-enhance: an underwater image enhancement pipeline](https://arxiv.org/abs/2602.05163) *Andreas Mentzelopoulos,Keith Ellenbogen* Main category: cs.CV TL;DR: 本文提出了一种基于扩散模型的图像到图像转换方法,用于恢复水下摄影中的色彩失真、模糊和对比度下降等问题,仅用约2500张高质量水下图像即可实现良好泛化效果。
Details Motivation: 水下摄影存在对比度低、空间模糊和波长相关色彩畸变等固有挑战,导致海洋生物色彩失真,摄影师需依赖繁重的后期处理流程。 Method: 构建了一个合成退化管道来模拟水下图像退化,并利用基于扩散的生成模型学习逆向恢复;在Keith Ellenbogen提供的小规模高质量水下图像数据集上从零训练模型。 Result: 模型参数约1100万,在约2500张图像上训练后,能稳定生成512×768分辨率图像,具有高感知一致性与强泛化能力。 Conclusion: 该方法证明了轻量级扩散模型在小数据水下图像恢复任务中的有效性,为资源受限场景下的水下视觉增强提供了新思路。 Abstract: Underwater photography presents significant inherent challenges including reduced contrast, spatial blur, and wavelength-dependent color distortions. These effects can obscure the vibrancy of marine life and awareness photographers in particular are often challenged with heavy post-processing pipelines to correct for these distortions. We develop an image-to-image pipeline that learns to reverse underwater degradations by introducing a synthetic corruption pipeline and learning to reverse its effects with diffusion-based generation. Training and evaluation are performed on a small high-quality dataset of awareness photography images by Keith Ellenbogen. The proposed methodology achieves high perceptual consistency and strong generalization in synthesizing 512x768 images using a model of ~11M parameters after training from scratch on ~2.5k images.
### [61] [ShapePuri: Shape Guided and Appearance Generalized Adversarial Purification](https://arxiv.org/abs/2602.05175) *Zhe Li,Bernhard Kainz* Main category: cs.CV TL;DR: 本文提出ShapePuri,一种基于形状引导的净化框架,通过结合符号距离函数(SDF)提供的几何引导和随机变换缓解外观偏差,显著提升模型对对抗攻击的鲁棒性,在AutoAttack下首次突破80%鲁棒准确率。
Details Motivation: 现有防御方法如对抗训练和基于扩散的净化存在计算开销大、信息损失等问题,需更高效稳定的鲁棒性提升方案。 Method: 提出Shape Guided Purification(ShapePuri),包含两个模块:1)Shape Encoding Module(SEM),利用Signed Distance Functions(SDF)提供密集几何结构引导;2)Global Appearance Debiasing(GAD)模块,通过随机变换缓解外观偏差。 Result: 在AutoAttack协议下达到84.06%干净准确率和81.64%鲁棒准确率,是首个在该基准上突破80%鲁棒准确率的防御框架,且不引入额外计算开销或辅助模块。 Conclusion: ShapePuri通过将模型表征与稳定结构不变量对齐,实现了高鲁棒性、高效率与预测稳定性的统一,为对抗防御提供了新范式。 Abstract: Deep neural networks demonstrate impressive performance in visual recognition, but they remain vulnerable to adversarial attacks that is imperceptible to the human. Although existing defense strategies such as adversarial training and purification have achieved progress, diffusion-based purification often involves high computational costs and information loss. To address these challenges, we introduce Shape Guided Purification (ShapePuri), a novel defense framework enhances robustness by aligning model representations with stable structural invariants. ShapePuri integrates two components: a Shape Encoding Module (SEM) that provides dense geometric guidance through Signed Distance Functions (SDF), and a Global Appearance Debiasing (GAD) module that mitigates appearance bias via stochastic transformations. In our experiments, ShapePuri achieves $84.06\%$ clean accuracy and $81.64\%$ robust accuracy under the AutoAttack protocol, representing the first defense framework to surpass the $80\%$ threshold on this benchmark. Our approach provides a scalable and efficient adversarial defense that preserves prediction stability during inference without requiring auxiliary modules or additional computational cost.
### [62] [PoseGaussian: Pose-Driven Novel View Synthesis for Robust 3D Human Reconstruction](https://arxiv.org/abs/2602.05190) *Ju Shen,Chen Chen,Tam V. Nguyen,Vijayan K. Asari* Main category: cs.CV TL;DR: PoseGaussian是一种姿态引导的高斯泼溅框架,用于高质量的人体新视角合成,通过将姿态信息融入几何与时间建模中,提升动态人体场景下的鲁棒性、泛化性与实时渲染性能(100 FPS)。
Details Motivation: 解决动态人体场景中关节运动和严重自遮挡带来的新视角合成挑战,提升现有方法在几何重建和时序一致性上的不足。 Method: 提出PoseGaussian框架:利用人体姿态作为结构先验,融合颜色编码器以优化深度估计;同时作为时序线索,经专用姿态编码器增强帧间时间一致性;整个流程端到端可微、可训练。 Result: 在ZJU-MoCap、THuman2.0及自建数据集上达到SOTA效果:PSNR 30.86,SSIM 0.979,LPIPS 0.028,并实现100 FPS实时渲染。 Conclusion: PoseGaussian通过深度耦合姿态信号于几何与时间建模阶段,显著提升了人体新视角合成的质量、鲁棒性与效率,为动态场景建模提供了新范式。 Abstract: We propose PoseGaussian, a pose-guided Gaussian Splatting framework for high-fidelity human novel view synthesis. Human body pose serves a dual purpose in our design: as a structural prior, it is fused with a color encoder to refine depth estimation; as a temporal cue, it is processed by a dedicated pose encoder to enhance temporal consistency across frames. These components are integrated into a fully differentiable, end-to-end trainable pipeline. Unlike prior works that use pose only as a condition or for warping, PoseGaussian embeds pose signals into both geometric and temporal stages to improve robustness and generalization. It is specifically designed to address challenges inherent in dynamic human scenes, such as articulated motion and severe self-occlusion. Notably, our framework achieves real-time rendering at 100 FPS, maintaining the efficiency of standard Gaussian Splatting pipelines. We validate our approach on ZJU-MoCap, THuman2.0, and in-house datasets, demonstrating state-of-the-art performance in perceptual quality and structural accuracy (PSNR 30.86, SSIM 0.979, LPIPS 0.028).
### [63] [GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling](https://arxiv.org/abs/2602.05202) *Shivanshu Shekhar,Uttaran Bhattacharya,Raghavendra Addanki,Mehrab Tanjim,Somdeb Sarkhel,Tong Zhang* Main category: cs.CV TL;DR: 本文提出了一种新方法,将视频生成模型本身作为奖励模型(而非依赖VLM),通过将其重构为能量模型并设计合成负样本进行对比学习,显著提升了对视频时序质量的判别能力,在多个基准上以更少标注数据达到SOTA。
Details Motivation: 现有基于视觉语言模型(VLM)的视频生成奖励建模方法难以捕捉细微的时间动态;需一种能天然建模时序结构的替代方案。 Method: 将先进视频生成模型(如Generative-Transformer)重构为能量基模型(EBM),通过对比学习训练其区分高质量与退化视频;设计三类可控潜在空间扰动(时间切片、特征交换、帧打乱)生成合成负样本,避免模型利用表层伪影做判断。 Result: 在GenAI-Bench和MonteBench上达到SOTA性能,仅需30K人工标注,比现有VLM方法少6–65倍标注量。 Conclusion: 视频生成模型可被有效重用为高精度、时序感知的奖励模型,无需额外架构设计,且数据效率大幅提升。 Abstract: Aligning video generative models with human preferences remains challenging: current approaches rely on Vision-Language Models (VLMs) for reward modeling, but these models struggle to capture subtle temporal dynamics. We propose a fundamentally different approach: repurposing video generative models, which are inherently designed to model temporal structure, as reward models. We present the Generative-Transformer-based Self-Supervised Video Judge (\modelname), a novel evaluation model that transforms state-of-the-art video generation models into powerful temporally-aware reward models. Our key insight is that generative models can be reformulated as energy-based models (EBMs) that assign low energy to high-quality videos and high energy to degraded ones, enabling them to discriminate video quality with remarkable precision when trained via contrastive objectives. To prevent the model from exploiting superficial differences between real and generated videos, we design challenging synthetic negative videos through controlled latent-space perturbations: temporal slicing, feature swapping, and frame shuffling, which simulate realistic but subtle visual degradations. This forces the model to learn meaningful spatiotemporal features rather than trivial artifacts. \modelname achieves state-of-the-art performance on GenAI-Bench and MonteBench using only 30K human-annotations: $6\times$ to $65\times$ fewer than existing VLM-based approaches.
### [64] [Dual-Representation Image Compression at Ultra-Low Bitrates via Explicit Semantics and Implicit Textures](https://arxiv.org/abs/2602.05213) *Chuqin Zhou,Xiaoyue Ling,Yunuo Chen,Jincheng Dai,Guo Lu,Wenjun Zhang* Main category: cs.CV TL;DR: 本文提出了一种无需训练的统一框架,通过协同融合显式高层语义与隐式细节表示(结合扩散模型与反向信道编码),并引入可插拔编码器调控失真-感知权衡,在超低码率图像压缩中显著提升感知质量,大幅超越现有方法。
Details Motivation: 现有神经编解码器在超低码率下性能急剧下降;生成式压缩方法虽利用预训练模型语义先验,但显式表示缺乏纹理细节、隐式表示易导致语义漂移,二者存在语义保真度与感知真实性的根本权衡。 Method: 提出一种无需训练的统一框架:以显式高层语义为条件驱动扩散模型,并采用反向信道编码隐式传递细粒度细节;同时引入一个可插拔编码器,灵活调控隐式信息以控制失真-感知权衡。 Result: 在Kodak、DIV2K和CLIC2020数据集上,DISTS BD-Rate指标分别比DiffC提升29.92%、19.33%和20.89%,达到当前最优的码率-感知性能。 Conclusion: 显式与隐式表征的协同融合可在不牺牲语义一致性的前提下增强纹理合成能力,所提框架有效突破了超低码率下感知压缩的性能瓶颈。 Abstract: While recent neural codecs achieve strong performance at low bitrates when optimized for perceptual quality, their effectiveness deteriorates significantly under ultra-low bitrate conditions. To mitigate this, generative compression methods leveraging semantic priors from pretrained models have emerged as a promising paradigm. However, existing approaches are fundamentally constrained by a tradeoff between semantic faithfulness and perceptual realism. Methods based on explicit representations preserve content structure but often lack fine-grained textures, whereas implicit methods can synthesize visually plausible details at the cost of semantic drift. In this work, we propose a unified framework that bridges this gap by coherently integrating explicit and implicit representations in a training-free manner. Specifically, We condition a diffusion model on explicit high-level semantics while employing reverse-channel coding to implicitly convey fine-grained details. Moreover, we introduce a plug-in encoder that enables flexible control of the distortion-perception tradeoff by modulating the implicit information. Extensive experiments demonstrate that the proposed framework achieves state-of-the-art rate-perception performance, outperforming existing methods and surpassing DiffC by 29.92%, 19.33%, and 20.89% in DISTS BD-Rate on the Kodak, DIV2K, and CLIC2020 datasets, respectively.
### [65] [E.M.Ground: A Temporal Grounding Vid-LLM with Holistic Event Perception and Matching](https://arxiv.org/abs/2602.05215) *Jiahao Nie,Wenbin An,Gongjie Zhang,Yicheng Xu,Yap-Peng Tan,Alex C. Kot,Shijian Lu* Main category: cs.CV TL;DR: 本文提出E.M.Ground,一种面向时间视频定位(TVG)任务的新型视频大语言模型,通过引入标记、Savitzky-Golay平滑和多粒度帧特征聚合,提升事件语义连续性建模与时间定位精度。
Details Motivation: 现有Vid-LLMs在TVG任务中依赖独立起止帧匹配,难以建模事件语义连续性和完整性,导致定位模糊。 Method: 提出E.M.Ground模型:(i) 引入标记聚合整个查询事件的帧信息;(ii) 采用Savitzky-Golay滤波平滑token-frame相似度曲线;(iii) 设计多粒度帧特征聚合机制以缓解压缩损失。 Result: 在多个基准数据集上显著超越当前最优Vid-LLMs。 Conclusion: 整体事件感知范式(而非孤立帧匹配)更适配TVG任务,E.M.Ground验证了其有效性与鲁棒性。 Abstract: Despite recent advances in Video Large Language Models (Vid-LLMs), Temporal Video Grounding (TVG), which aims to precisely localize time segments corresponding to query events, remains a significant challenge. Existing methods often match start and end frames by comparing frame features with two separate tokens, relying heavily on exact timestamps. However, this approach fails to capture the event's semantic continuity and integrity, leading to ambiguities. To address this, we propose E.M.Ground, a novel Vid-LLM for TVG that focuses on holistic and coherent event perception. E.M.Ground introduces three key innovations: (i) a special token that aggregates information from all frames of a query event, preserving semantic continuity for accurate event matching; (ii) Savitzky-Golay smoothing to reduce noise in token-to-frame similarities across timestamps, improving prediction accuracy; (iii) multi-grained frame feature aggregation to enhance matching reliability and temporal understanding, compensating for compression-induced information loss. Extensive experiments on benchmark datasets show that E.M.Ground consistently outperforms state-of-the-art Vid-LLMs by significant margins.
### [66] [Cross-Domain Few-Shot Segmentation via Multi-view Progressive Adaptation](https://arxiv.org/abs/2602.05217) *Jiahao Nie,Guanqiao Fu,Wenbin An,Yap-Peng Tan,Alex C. Kot,Shijian Lu* Main category: cs.CV TL;DR: 本文提出多视角渐进式适应(MPA)方法,通过混合渐进增强和双链多视角预测,从数据和策略两方面提升跨域少样本分割的性能,显著优于现有方法。
Details Motivation: 现有跨域少样本分割方法受限于目标域样本数量少、多样性低,且源域训练模型在目标域的少样本能力弱、域差距大,导致目标样本利用效率低、适应效果差。 Method: 提出多视角渐进式适应(MPA):(i)数据层面引入混合渐进增强,通过累积强增强生成更多样、更复杂的视图;(ii)策略层面设计双链多视角预测,结合顺序与并行学习路径,在强监督下联合约束多视图预测一致性。 Result: 在多个基准上大幅超越现有最先进方法,性能提升达+7.0%。 Conclusion: MPA通过协同优化数据增强策略与模型预测机制,有效弥合域间差距,显著提升少样本能力在目标域的迁移与适应效果。 Abstract: Cross-Domain Few-Shot Segmentation aims to segment categories in data-scarce domains conditioned on a few exemplars. Typical methods first establish few-shot capability in a large-scale source domain and then adapt it to target domains. However, due to the limited quantity and diversity of target samples, existing methods still exhibit constrained performance. Moreover, the source-trained model's initially weak few-shot capability in target domains, coupled with substantial domain gaps, severely hinders the effective utilization of target samples and further impedes adaptation. To this end, we propose Multi-view Progressive Adaptation, which progressively adapts few-shot capability to target domains from both data and strategy perspectives. (i) From the data perspective, we introduce Hybrid Progressive Augmentation, which progressively generates more diverse and complex views through cumulative strong augmentations, thereby creating increasingly challenging learning scenarios. (ii) From the strategy perspective, we design Dual-chain Multi-view Prediction, which fully leverages these progressively complex views through sequential and parallel learning paths under extensive supervision. By jointly enforcing prediction consistency across diverse and complex views, MPA achieves both robust and accurate adaptation to target domains. Extensive experiments demonstrate that MPA effectively adapts few-shot capability to target domains, outperforming state-of-the-art methods by a large margin (+7.0%).
### [67] [Boosting SAM for Cross-Domain Few-Shot Segmentation via Conditional Point Sparsification](https://arxiv.org/abs/2602.05218) *Jiahao Nie,Yun Xing,Wenbin An,Qingsong Zhao,Jiawei Shao,Yap-Peng Tan,Alex C. Kot,Shijian Lu,Xuelong Li* Main category: cs.CV TL;DR: 本文提出Conditional Point Sparsification (CPS)方法,通过自适应稀疏化密集匹配点来提升SAM在跨域少样本分割(CD-FSS)中的性能,尤其在医学和卫星图像等存在大域偏移的场景下效果显著。
Details Motivation: 现有基于SAM的少样本分割方法在跨域场景(如医学、卫星图像)中因域偏移导致点-图像交互失效,密集点匹配效果差。 Method: 提出无训练的Conditional Point Sparsification(CPS),利用参考图像的真值掩码指导,自适应稀疏化参考与目标图像间匹配的密集点,以增强SAM在跨域下的交互鲁棒性。 Result: CPS在多个CD-FSS数据集上显著优于现有无训练SAM方法,提升了跨域少样本分割精度。 Conclusion: 点密度在跨域条件下至关重要;CPS通过参考掩码引导的自适应稀疏策略,有效缓解域偏移对SAM点提示的影响,为训练-free跨域分割提供了新思路。 Abstract: Motivated by the success of the Segment Anything Model (SAM) in promptable segmentation, recent studies leverage SAM to develop training-free solutions for few-shot segmentation, which aims to predict object masks in the target image based on a few reference exemplars. These SAM-based methods typically rely on point matching between reference and target images and use the matched dense points as prompts for mask prediction. However, we observe that dense points perform poorly in Cross-Domain Few-Shot Segmentation (CD-FSS), where target images are from medical or satellite domains. We attribute this issue to large domain shifts that disrupt the point-image interactions learned by SAM, and find that point density plays a crucial role under such conditions. To address this challenge, we propose Conditional Point Sparsification (CPS), a training-free approach that adaptively guides SAM interactions for cross-domain images based on reference exemplars. Leveraging ground-truth masks, the reference images provide reliable guidance for adaptively sparsifying dense matched points, enabling more accurate segmentation results. Extensive experiments demonstrate that CPS outperforms existing training-free SAM-based methods across diverse CD-FSS datasets.
### [68] [PatchFlow: Leveraging a Flow-Based Model with Patch Features](https://arxiv.org/abs/2602.05238) *Boxiang Zhang,Baijian Yang,Xiaoming Wang,Corey Vian* Main category: cs.CV TL;DR: 本文提出了一种结合局部邻域感知图像块特征、归一化流模型和适配器模块的无监督异常检测方法,显著提升了铸件表面缺陷检测的准确率。
Details Motivation: 解决压铸行业因表面缺陷导致的质量控制难题,提升自动化缺陷检测的效率与精度。 Method: 融合局部邻域感知的图像块特征与归一化流模型,并引入适配器模块,桥接通用预训练特征提取器与工业产品图像之间的域差距,实现无需异常样本的无监督异常检测。 Result: 在MVTec AD数据集上图像级AUROC达99.28%(错误率降低20%);在VisA数据集上达96.48%(错误率降低28.2%);在自建压铸数据集上检测准确率达95.77%,且训练无需异常样本。 Conclusion: 该方法有效提升了压铸件表面缺陷的自动化检测性能,验证了计算机视觉与深度学习在工业质检中落地应用的巨大潜力。 Abstract: Die casting plays a crucial role across various industries due to its ability to craft intricate shapes with high precision and smooth surfaces. However, surface defects remain a major issue that impedes die casting quality control. Recently, computer vision techniques have been explored to automate and improve defect detection. In this work, we combine local neighbor-aware patch features with a normalizing flow model and bridge the gap between the generic pretrained feature extractor and industrial product images by introducing an adapter module to increase the efficiency and accuracy of automated anomaly detection. Compared to state-of-the-art methods, our approach reduces the error rate by 20\% on the MVTec AD dataset, achieving an image-level AUROC of 99.28\%. Our approach has also enhanced performance on the VisA dataset , achieving an image-level AUROC of 96.48\%. Compared to the state-of-the-art models, this represents a 28.2\% reduction in error. Additionally, experiments on a proprietary die casting dataset yield an accuracy of 95.77\% for anomaly detection, without requiring any anomalous samples for training. Our method illustrates the potential of leveraging computer vision and deep learning techniques to advance inspection capabilities for the die casting industry
### [69] [Active Label Cleaning for Reliable Detection of Electron Dense Deposits in Transmission Electron Microscopy Images](https://arxiv.org/abs/2602.05250) *Jieyun Tan,Shuo Liu,Guibin Zhang,Ziqi Li,Jian Geng,Lei Zhang,Lei Cao* Main category: cs.CV TL;DR: 本文提出了一种主动标签清洗方法,用于提升基于众包标注的电子致密沉积物(EDD)检测模型性能,在保证高精度的同时大幅降低专家标注成本。
Details Motivation: 电子致密沉积物(EDD)自动检测受限于高质量标注数据稀缺;众包标注虽降低成本,但引入标签噪声。 Method: 提出主动标签清洗方法:利用主动学习选择最具价值的噪声样本交由专家重标注;设计标签选择模块,结合众包标签与模型预测的差异进行样本筛选和实例级噪声评分。 Result: 在私有数据集上达到67.18% AP₅₀,较直接使用噪声标签训练提升18.83%;性能达全专家标注模型的95.79%,标注成本降低73.30%。 Conclusion: 该方法为专家资源有限条件下的可靠医学AI开发提供了实用、低成本的解决方案。 Abstract: Automated detection of electron dense deposits (EDD) in glomerular disease is hindered by the scarcity of high-quality labeled data. While crowdsourcing reduces annotation cost, it introduces label noise. We propose an active label cleaning method to efficiently denoise crowdsourced datasets. Our approach uses active learning to select the most valuable noisy samples for expert re-annotation, building high-accuracy cleaning models. A Label Selection Module leverages discrepancies between crowdsourced labels and model predictions for both sample selection and instance-level noise grading. Experiments show our method achieves 67.18% AP\textsubscript{50} on a private dataset, an 18.83% improvement over training on noisy labels. This performance reaches 95.79% of that with full expert annotation while reducing annotation cost by 73.30%. The method provides a practical, cost-effective solution for developing reliable medical AI with limited expert resources.
### [70] [RFM-Pose:Reinforcement-Guided Flow Matching for Fast Category-Level 6D Pose Estimation](https://arxiv.org/abs/2602.05257) *Diya He,Qingchen Liu,Cong Zhang,Jiahu Qin* Main category: cs.CV TL;DR: 本文提出RFM-Pose框架,结合流匹配生成模型与强化学习(PPO),提升类别级6D物体位姿估计的采样效率与精度,在REAL275上显著降低计算成本并保持高性能。
Details Motivation: 现有基于分数的生成模型虽缓解了类别级位姿估计中的旋转对称性模糊问题,但采样开销高、效率低。 Method: 提出RFM-Pose:1)采用流匹配生成模型,沿最优传输路径从简单先验生成位姿;2)将采样过程建模为马尔可夫决策过程,用近端策略优化(PPO)微调采样策略;3)将流场视为可学习策略,估计器映射为值网络,实现位姿生成与假设评分的联合优化。 Result: 在REAL275基准上,RFM-Pose在保持优异性能的同时显著降低计算成本;且可自然扩展至物体位姿跟踪任务,并取得有竞争力的结果。 Conclusion: 流匹配与强化学习的协同设计有效提升了类别级6D位姿生成的效率与质量,为生成式位姿估计提供了新范式。 Abstract: Object pose estimation is a fundamental problem in computer vision and plays a critical role in virtual reality and embodied intelligence, where agents must understand and interact with objects in 3D space. Recently, score based generative models have to some extent solved the rotational symmetry ambiguity problem in category level pose estimation, but their efficiency remains limited by the high sampling cost of score-based diffusion. In this work, we propose a new framework, RFM-Pose, that accelerates category-level 6D object pose generation while actively evaluating sampled hypotheses. To improve sampling efficiency, we adopt a flow-matching generative model and generate pose candidates along an optimal transport path from a simple prior to the pose distribution. To further refine these candidates, we cast the flow-matching sampling process as a Markov decision process and apply proximal policy optimization to fine-tune the sampling policy. In particular, we interpret the flow field as a learnable policy and map an estimator to a value network, enabling joint optimization of pose generation and hypothesis scoring within a reinforcement learning framework. Experiments on the REAL275 benchmark demonstrate that RFM-Pose achieves favorable performance while significantly reducing computational cost. Moreover, similar to prior work, our approach can be readily adapted to object pose tracking and attains competitive results in this setting.
### [71] [ReGLA: Efficient Receptive-Field Modeling with Gated Linear Attention Network](https://arxiv.org/abs/2602.05262) *Junzhou Li,Manqi Zhao,Yilin Gao,Zhiheng Yu,Yin Li,Dongsheng Jiang,Li Xiao* Main category: cs.CV TL;DR: 本文提出ReGLA系列轻量级混合网络,结合高效卷积与ReLU门控线性注意力,在高分辨率图像任务中实现精度与延迟的良好平衡。
Details Motivation: 解决轻量级模型(尤其是Transformer架构)在高分辨率图像上精度与延迟难以兼顾的问题。 Method: 设计了三个关键模块:高效大感受野(ELRF)模块、ReLU门控调制注意力(RGMA)模块,以及多教师知识蒸馏策略。 Result: ReGLA-M在ImageNet-1K上达80.85% Top-1精度(224px),512px下仅4.98ms延迟;在COCO检测和ADE20K分割上分别超越iFormer 3.1% AP和3.6% mIoU。 Conclusion: ReGLA是面向高分辨率视觉任务的当前最优轻量级解决方案。 Abstract: Balancing accuracy and latency on high-resolution images is a critical challenge for lightweight models, particularly for Transformer-based architectures that often suffer from excessive latency. To address this issue, we introduce \textbf{ReGLA}, a series of lightweight hybrid networks, which integrates efficient convolutions for local feature extraction with ReLU-based gated linear attention for global modeling. The design incorporates three key innovations: the Efficient Large Receptive Field (ELRF) module for enhancing convolutional efficiency while preserving a large receptive field; the ReLU Gated Modulated Attention (RGMA) module for maintaining linear complexity while enhancing local feature representation; and a multi-teacher distillation strategy to boost performance on downstream tasks. Extensive experiments validate the superiority of ReGLA; particularly the ReGLA-M achieves \textbf{80.85\%} Top-1 accuracy on ImageNet-1K at $224px$, with only \textbf{4.98 ms} latency at $512px$. Furthermore, ReGLA outperforms similarly scaled iFormer models in downstream tasks, achieving gains of \textbf{3.1\%} AP on COCO object detection and \textbf{3.6\%} mIoU on ADE20K semantic segmentation, establishing it as a state-of-the-art solution for high-resolution visual applications.
### [72] [Unlocking Prototype Potential: An Efficient Tuning Framework for Few-Shot Class-Incremental Learning](https://arxiv.org/abs/2602.05271) *Shengqin Jiang,Xiaoran Feng,Yuankai Qi,Haokui Zhang,Renlong Hang,Qingshan Liu,Lina Yao,Quan Z. Sheng,Ming-Hsuan Yang* Main category: cs.CV TL;DR: 本文提出一种新的少样本类增量学习方法,通过微调原型而非骨干网络,在静态高质量特征空间中优化决策区域,采用双校准方法提升原型判别能力。
Details Motivation: 传统FSCIL方法使用冻结的预训练特征提取器生成静态类原型,存在骨干网络表征偏差;而基于提示的调优方法在极低数据下难以实质性提升全局判别能力。 Method: 冻结特征提取器,微调原型;引入双校准方法(类特定偏移和任务感知偏移)使静态质心演变为动态可学习组件。 Result: 在多个基准上取得优越性能,且仅需极少可学习参数。 Conclusion: FSCIL的核心挑战在于静态优质特征空间中决策区域的优化,而非特征获取;原型微调比骨干微调更高效、更有效。 Abstract: Few-shot class-incremental learning (FSCIL) seeks to continuously learn new classes from very limited samples while preserving previously acquired knowledge. Traditional methods often utilize a frozen pre-trained feature extractor to generate static class prototypes, which suffer from the inherent representation bias of the backbone. While recent prompt-based tuning methods attempt to adapt the backbone via minimal parameter updates, given the constraint of extreme data scarcity, the model's capacity to assimilate novel information and substantively enhance its global discriminative power is inherently limited. In this paper, we propose a novel shift in perspective: freezing the feature extractor while fine-tuning the prototypes. We argue that the primary challenge in FSCIL is not feature acquisition, but rather the optimization of decision regions within a static, high-quality feature space. To this end, we introduce an efficient prototype fine-tuning framework that evolves static centroids into dynamic, learnable components. The framework employs a dual-calibration method consisting of class-specific and task-aware offsets. These components function synergistically to improve the discriminative capacity of prototypes for ongoing incremental classes. Extensive results demonstrate that our method attains superior performance across multiple benchmarks while requiring minimal learnable parameters.
### [73] [Magic-MM-Embedding: Towards Visual-Token-Efficient Universal Multimodal Embedding with MLLMs](https://arxiv.org/abs/2602.05275) *Qi Li,Yanzhe Zhao,Yongxin Zhou,Yameng Wang,Yandong Yang,Yuanjia Zhou,Jue Wang,Zuojian Wang,Jinxiang Liu* Main category: cs.CV TL;DR: 本文提出Magic-MM-Embedding系列模型,通过视觉令牌压缩和多阶段渐进训练策略,在保证高性能的同时显著提升通用多模态嵌入的推理效率。
Details Motivation: 多模态大语言模型(MLLMs)在通用多模态检索中潜力巨大,但其实际应用受限于处理大量视觉输入token带来的高计算成本。 Method: 提出Magic-MM-Embedding:(1)基于视觉token压缩的高效MLLM架构,降低延迟与内存占用;(2)多阶段渐进训练策略,包括持续预训练、对比预训练与难负样本挖掘、以及由MLLM-as-a-Judge引导的任务感知微调。 Result: 实验表明,该模型在多项指标上大幅超越现有方法,同时具备更高推理效率。 Conclusion: Magic-MM-Embedding成功兼顾了多模态嵌入任务的性能与效率,为实用化部署提供了新路径。 Abstract: Multimodal Large Language Models (MLLMs) have shown immense promise in universal multimodal retrieval, which aims to find relevant items of various modalities for a given query. But their practical application is often hindered by the substantial computational cost incurred from processing a large number of tokens from visual inputs. In this paper, we propose Magic-MM-Embedding, a series of novel models that achieve both high efficiency and state-of-the-art performance in universal multimodal embedding. Our approach is built on two synergistic pillars: (1) a highly efficient MLLM architecture incorporating visual token compression to drastically reduce inference latency and memory footprint, and (2) a multi-stage progressive training strategy designed to not only recover but significantly boost performance. This coarse-to-fine training paradigm begins with extensive continue pretraining to restore multimodal understanding and generation capabilities, progresses to large-scale contrastive pretraining and hard negative mining to enhance discriminative power, and culminates in a task-aware fine-tuning stage guided by an MLLM-as-a-Judge for precise data curation. Comprehensive experiments show that our model outperforms existing methods by a large margin while being more inference-efficient.
### [74] [Fast-SAM3D: 3Dfy Anything in Images but Faster](https://arxiv.org/abs/2602.05293) *Weilun Feng,Mingqiang Wu,Zhiliang Chen,Chuanguang Yang,Haotong Qin,Yuqi Li,Xiaokun Liu,Guoxin Fan,Zhulin An,Libo Huang,Yulun Zhang,Michele Magno,Yongjun Xu* Main category: cs.CV TL;DR: 本文提出Fast-SAM3D,一种无需训练的框架,通过三种异构性感知机制动态匹配计算资源与生成复杂度,在保持重建质量的同时实现最高2.67倍加速。
Details Motivation: SAM3D在开放世界3D重建中具有潜力,但其推理延迟过高阻碍实际部署;现有通用加速方法因忽略其多层级异构性(如形状与布局的动力学差异、纹理细化的稀疏性、几何频谱差异)而表现脆弱。 Method: 提出Fast-SAM3D框架,包含三种训练无关的异构性感知机制:(1) 模态感知步长缓存,解耦结构演化与布局更新;(2) 联合时空Token裁剪,聚焦高熵区域细化;(3) 频谱感知Token聚合,自适应调整解码分辨率。 Result: 实验表明Fast-SAM3D实现最高2.67×端到端加速,保真度损失可忽略,在单视图3D生成效率上建立新的Pareto前沿。 Conclusion: 针对SAM3D推理异构性设计的动态计算对齐策略,显著提升效率而不牺牲质量,为高效开放世界3D重建提供了新范式。 Abstract: SAM3D enables scalable, open-world 3D reconstruction from complex scenes, yet its deployment is hindered by prohibitive inference latency. In this work, we conduct the \textbf{first systematic investigation} into its inference dynamics, revealing that generic acceleration strategies are brittle in this context. We demonstrate that these failures stem from neglecting the pipeline's inherent multi-level \textbf{heterogeneity}: the kinematic distinctiveness between shape and layout, the intrinsic sparsity of texture refinement, and the spectral variance across geometries. To address this, we present \textbf{Fast-SAM3D}, a training-free framework that dynamically aligns computation with instantaneous generation complexity. Our approach integrates three heterogeneity-aware mechanisms: (1) \textit{Modality-Aware Step Caching} to decouple structural evolution from sensitive layout updates; (2) \textit{Joint Spatiotemporal Token Carving} to concentrate refinement on high-entropy regions; and (3) \textit{Spectral-Aware Token Aggregation} to adapt decoding resolution. Extensive experiments demonstrate that Fast-SAM3D delivers up to \textbf{2.67$\times$} end-to-end speedup with negligible fidelity loss, establishing a new Pareto frontier for efficient single-view 3D generation. Our code is released in https://github.com/wlfeng0509/Fast-SAM3D.
### [75] [FlashBlock: Attention Caching for Efficient Long-Context Block Diffusion](https://arxiv.org/abs/2602.05305) *Zhuokun Chen,Jianfei Cai,Bohan Zhuang* Main category: cs.CV TL;DR: 本文提出FlashBlock,一种利用块外部注意力输出稳定性的缓存机制,以减少长上下文扩散模型中的注意力计算和KV缓存访问开销,显著提升推理效率,同时几乎不损失生成质量。
Details Motivation: 现有块扩散方法在长上下文下仍存在因KV缓存增长导致的重复注意力计算开销;作者发现块内跨步注意力存在块外部冗余性这一被忽视特性。 Method: 提出FlashBlock机制,缓存并复用稳定的块外部注意力输出,避免重复计算;该方法与稀疏注意力正交,可作为残差复用策略协同使用。 Result: 在扩散语言模型和视频生成任务上,实现最高1.44×的token吞吐量提升和1.6×的注意力时间减少,生成质量几乎无损。 Conclusion: FlashBlock是一种高效、即插即用的注意力优化方法,有效缓解长内容生成中的计算瓶颈,且兼容现有加速技术。 Abstract: Generating long-form content, such as minute-long videos and extended texts, is increasingly important for modern generative models. Block diffusion improves inference efficiency via KV caching and block-wise causal inference and has been widely adopted in diffusion language models and video generation. However, in long-context settings, block diffusion still incurs substantial overhead from repeatedly computing attention over a growing KV cache. We identify an underexplored property of block diffusion: cross-step redundancy of attention within a block. Our analysis shows that attention outputs from tokens outside the current block remain largely stable across diffusion steps, while block-internal attention varies significantly. Based on this observation, we propose FlashBlock, a cached block-external attention mechanism that reuses stable attention output, reducing attention computation and KV cache access without modifying the diffusion process. Moreover, FlashBlock is orthogonal to sparse attention and can be combined as a complementary residual reuse strategy, substantially improving model accuracy under aggressive sparsification. Experiments on diffusion language models and video generation demonstrate up to 1.44$\times$ higher token throughput and up to 1.6$\times$ reduction in attention time, with negligible impact on generation quality. Project page: https://caesarhhh.github.io/FlashBlock/.
### [76] [Wid3R: Wide Field-of-View 3D Reconstruction via Camera Model Conditioning](https://arxiv.org/abs/2602.05321) *Dongki Jung,Jaehoon Choi,Adil Qureshi,Somi Jeong,Dinesh Manocha,Suyong Yeon* Main category: cs.CV TL;DR: Wid3R是一种支持广角相机模型的前馈神经网络,用于视觉几何重建,首次实现直接从360度图像进行多视角零样本3D重建。
Details Motivation: 现有方法通常假设输入图像是针孔相机拍摄或已校正,难以适用于鱼眼或全景相机等真实场景,且依赖精细标定和去畸变。 Method: 采用基于球谐函数的光线表示和新型相机模型标记,实现失真感知的3D重建;设计可泛化的多视角3D估计架构,支持广角相机类型。 Result: 在Stanford2D3D数据集上相较先前方法最高提升+77.33,具备强零样本鲁棒性,并首次支持前馈式360度图像3D重建。 Conclusion: Wid3R突破了传统方法对针孔相机的依赖,为广角相机下的实时、免标定、免去畸变3D重建提供了新范式。 Abstract: We present Wid3R, a feed-forward neural network for visual geometry reconstruction that supports wide field-of-view camera models. Prior methods typically assume that input images are rectified or captured with pinhole cameras, since both their architectures and training datasets are tailored to perspective images only. These assumptions limit their applicability in real-world scenarios that use fisheye or panoramic cameras and often require careful calibration and undistortion. In contrast, Wid3R is a generalizable multi-view 3D estimation method that can model wide field-of-view camera types. Our approach leverages a ray representation with spherical harmonics and a novel camera model token within the network, enabling distortion-aware 3D reconstruction. Furthermore, Wid3R is the first multi-view foundation model to support feed-forward 3D reconstruction directly from 360 imagery. It demonstrates strong zero-shot robustness and consistently outperforms prior methods, achieving improvements of up to +77.33 on Stanford2D3D.
### [77] [MTPano: Multi-Task Panoramic Scene Understanding via Label-Free Integration of Dense Prediction Priors](https://arxiv.org/abs/2602.05330) *Jingdong Zhang,Xiaohang Zhan,Lingzhi Zhang,Yizhou Wang,Zhengming Yu,Jionghao Wang,Wenping Wang,Xin Li* Main category: cs.CV TL;DR: 本文提出MTPano,一种无需标签的多任务全景基础模型,通过利用透视密集先验生成伪标签,并设计Panoramic Dual BridgeNet来解耦旋转不变与旋转可变任务特征流,有效应对全景图像几何畸变和多任务干扰问题,实现了全景场景理解的SOTA性能。
Details Motivation: 全景场景理解面临高分辨率、多任务标注稀缺的挑战;现有透视基础模型直接迁移至全景域效果差,因严重几何畸变和坐标系差异;球面空间中不同密集预测任务间的内在关系未被充分探索。 Method: 1) 利用透视基础模型在投影后的视角图像块上生成无域偏移伪标签,并重投影为全景监督信号;2) 将任务分为旋转不变型(如深度、分割)与旋转变异型(如法向量),设计Panoramic Dual BridgeNet,通过几何感知调制层注入绝对位置与光线方向先验以解耦特征流;3) 引入ERP token mixer与双支路BridgeNet(含梯度截断)缓解畸变并促进有益跨任务信息共享;4) 增加图像梯度、点图等辅助任务增强跨任务学习。 Result: MTPano在多个全景基准测试中达到SOTA性能,并在与各任务专用全景基础模型对比中表现具有竞争力。 Conclusion: MTPano通过标签无关训练范式与几何感知多任务解耦架构,显著提升了全景密集预测任务的统一建模能力,为全景基础模型提供了新思路。 Abstract: Comprehensive panoramic scene understanding is critical for immersive applications, yet it remains challenging due to the scarcity of high-resolution, multi-task annotations. While perspective foundation models have achieved success through data scaling, directly adapting them to the panoramic domain often fails due to severe geometric distortions and coordinate system discrepancies. Furthermore, the underlying relations between diverse dense prediction tasks in spherical spaces are underexplored. To address these challenges, we propose MTPano, a robust multi-task panoramic foundation model established by a label-free training pipeline. First, to circumvent data scarcity, we leverage powerful perspective dense priors. We project panoramic images into perspective patches to generate accurate, domain-gap-free pseudo-labels using off-the-shelf foundation models, which are then re-projected to serve as patch-wise supervision. Second, to tackle the interference between task types, we categorize tasks into rotation-invariant (e.g., depth, segmentation) and rotation-variant (e.g., surface normals) groups. We introduce the Panoramic Dual BridgeNet, which disentangles these feature streams via geometry-aware modulation layers that inject absolute position and ray direction priors. To handle the distortion from equirectangular projections (ERP), we incorporate ERP token mixers followed by a dual-branch BridgeNet for interactions with gradient truncation, facilitating beneficial cross-task information sharing while blocking conflicting gradients from incompatible task attributes. Additionally, we introduce auxiliary tasks (image gradient, point map, etc.) to fertilize the cross-task learning process. Extensive experiments demonstrate that MTPano achieves state-of-the-art performance on multiple benchmarks and delivers competitive results against task-specific panoramic specialist foundation models.
### [78] [Consistency-Preserving Concept Erasure via Unsafe-Safe Pairing and Directional Fisher-weighted Adaptation](https://arxiv.org/abs/2602.05339) *Yongwoo Kim,Sungmin Cha,Hyunsoo Kim,Jaewon Lee,Donghyun Kim* Main category: cs.CV TL;DR: 本文提出PAIR框架,通过不安全-安全概念配对实现一致性保持的语义重对齐,提升文本到图像扩散模型中的概念擦除效果。
Details Motivation: 现有概念擦除方法仅关注移除不安全概念,缺乏对安全替代方案的引导,导致结构与语义一致性难以维持。 Method: 提出PAIR框架,包括配对语义重对齐目标(利用不安全-安全配对显式映射概念)和基于Fisher权重的DoRA初始化(利用配对数据初始化低秩适配矩阵)。 Result: 在多个指标上显著超越现有最优方法,在有效擦除目标概念的同时,更好保持结构完整性、语义连贯性和生成质量。 Conclusion: 将概念擦除从单纯删除重构为一致性保持的语义重对齐,是提升文本到图像模型安全性与可控性的可行新范式。 Abstract: With the increasing versatility of text-to-image diffusion models, the ability to selectively erase undesirable concepts (e.g., harmful content) has become indispensable. However, existing concept erasure approaches primarily focus on removing unsafe concepts without providing guidance toward corresponding safe alternatives, which often leads to failure in preserving the structural and semantic consistency between the original and erased generations. In this paper, we propose a novel framework, PAIRed Erasing (PAIR), which reframes concept erasure from simple removal to consistency-preserving semantic realignment using unsafe-safe pairs. We first generate safe counterparts from unsafe inputs while preserving structural and semantic fidelity, forming paired unsafe-safe multimodal data. Leveraging these pairs, we introduce two key components: (1) Paired Semantic Realignment, a guided objective that uses unsafe-safe pairs to explicitly map target concepts to semantically aligned safe anchors; and (2) Fisher-weighted Initialization for DoRA, which initializes parameter-efficient low-rank adaptation matrices using unsafe-safe pairs, encouraging the generation of safe alternatives while selectively suppressing unsafe concepts. Together, these components enable fine-grained erasure that removes only the targeted concepts while maintaining overall semantic consistency. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, achieving effective concept erasure while preserving structural integrity, semantic coherence, and generation quality.
### [79] [Learning with Adaptive Prototype Manifolds for Out-of-Distribution Detection](https://arxiv.org/abs/2602.05349) *Ningkang Peng,JiuTao Zhou,Yuhao Zhang,Xiaoqian Peng,Qianfeng Yu,Linjing Qian,Tingyu Lu,Yi Chen,Yanhui Gu* Main category: cs.CV TL;DR: 本文提出APEX框架,通过自适应原型流形(APM)和后验感知OOD评分(PAOS)机制,解决现有原型学习方法中的静态同质性假设和学习-推理断连问题,显著提升OOD检测性能。
Details Motivation: 现有基于原型的表征学习方法存在两个根本缺陷:静态同质性假设(为所有类别分配固定表征资源)和学习-推理断连(推理时忽略原型质量知识),限制了模型容量与性能。 Method: 提出APEX框架,包含两阶段修复:(1)自适应原型流形(APM),基于最小描述长度(MDL)原则为每类自动确定最优原型数量K_c^*,解决原型碰撞;(2)后验感知OOD评分(PAOS)机制,量化原型凝聚性与分离性以弥合学习-推理断连。 Result: 在CIFAR-100等基准上实验验证,APEX达到新的SOTA性能。 Conclusion: APEX通过自适应原型设计与后验感知评分,有效克服原型学习中固有缺陷,显著提升OOD检测能力与鲁棒性。 Abstract: Out-of-distribution (OOD) detection is a critical task for the safe deployment of machine learning models in the real world. Existing prototype-based representation learning methods have demonstrated exceptional performance. Specifically, we identify two fundamental flaws that universally constrain these methods: the Static Homogeneity Assumption (fixed representational resources for all classes) and the Learning-Inference Disconnect (discarding rich prototype quality knowledge at inference). These flaws fundamentally limit the model's capacity and performance. To address these issues, we propose APEX (Adaptive Prototype for eXtensive OOD Detection), a novel OOD detection framework designed via a Two-Stage Repair process to optimize the learned feature manifold. APEX introduces two key innovations to address these respective flaws: (1) an Adaptive Prototype Manifold (APM), which leverages the Minimum Description Length (MDL) principle to automatically determine the optimal prototype complexity $K_c^*$ for each class, thereby fundamentally resolving prototype collision; and (2) a Posterior-Aware OOD Scoring (PAOS) mechanism, which quantifies prototype quality (cohesion and separation) to bridge the learning-inference disconnect. Comprehensive experiments on benchmarks such as CIFAR-100 validate the superiority of our method, where APEX achieves new state-of-the-art performance.
### [80] [Multimodal Latent Reasoning via Hierarchical Visual Cues Injection](https://arxiv.org/abs/2602.05359) *Yiming Zhang,Qiangyu Yan,Borui Jiang,Kai Han* Main category: cs.CV TL;DR: 本文提出HIVE框架,通过在潜在空间中注入分层视觉线索实现多模态慢思考推理,避免依赖文本式思维链,提升复杂场景理解能力。
Details Motivation: 现有MLLMs的推理依赖端到端生成或语言中心的思维链(CoT),存在低效、冗长和幻觉问题;需在潜在空间中融合多模态信号以实现更鲁棒的推理。 Method: 提出HIVE框架:递归扩展Transformer块构建内部迭代推理循环,并将从全局场景到细粒度区域的分层视觉线索直接注入模型潜在表征,实现完全在对齐潜在空间中的多步推理。 Result: 实验证明测试时缩放有效,且分层信息集成显著提升模型对复杂场景的理解能力。 Conclusion: 在潜在空间中进行分层视觉引导的慢思考推理,是提升MLLMs鲁棒性和准确性的有效路径,摆脱了对表面文本推理的依赖。 Abstract: The advancement of multimodal large language models (MLLMs) has enabled impressive perception capabilities. However, their reasoning process often remains a "fast thinking" paradigm, reliant on end-to-end generation or explicit, language-centric chains of thought (CoT), which can be inefficient, verbose, and prone to hallucination. This work posits that robust reasoning should evolve within a latent space, integrating multimodal signals seamlessly. We propose multimodal latent reasoning via HIerarchical Visual cuEs injection (\emph{HIVE}), a novel framework that instills deliberate, "slow thinking" without depending on superficial textual rationales. Our method recursively extends transformer blocks, creating an internal loop for iterative reasoning refinement. Crucially, it injectively grounds this process with hierarchical visual cues from global scene context to fine-grained regional details directly into the model's latent representations. This enables the model to perform grounded, multi-step inference entirely in the aligned latent space. Extensive evaluations demonstrate that test-time scaling is effective when incorporating vision knowledge, and that integrating hierarchical information significantly enhances the model's understanding of complex scenes.
### [81] [Breaking Semantic Hegemony: Decoupling Principal and Residual Subspaces for Generalized OOD Detection](https://arxiv.org/abs/2602.05360) *Ningkang Peng,Xiaoqian Peng,Yuhao Zhang,Qianfeng Yu,Feng Xing,Peirong Ma,Xichen Yang,Yi Chen,Tingyu Lu,Yanhui Gu* Main category: cs.CV TL;DR: 本文发现现有OOD检测方法存在'简单性悖论':对语义细微差异敏感,但对结构明显却语义简单的分布外样本或高频传感器噪声不敏感;提出无需训练、即插即用的D-KNN框架,通过正交分解解耦语义与结构信息,并引入双空间校准机制,显著提升OOD检测性能。
Details Motivation: 现有基于特征的OOD检测方法虽取得进展,但在处理结构差异大但语义简单的样本或高频传感器噪声时表现差,存在‘简单性悖论’,根源在于深度特征空间中的语义霸权和神经坍缩引发的谱集中偏差。 Method: 提出D-KNN框架:利用正交分解将特征解耦为语义主成分与结构残差分量,并设计双空间校准机制以增强模型对弱残差信号的敏感性;全程无需额外训练,可即插即用。 Result: 在CIFAR和ImageNet基准上达到新SOTA;FPR95从31.3%降至2.3%;对高斯噪声等传感器故障的AUROC从79.7%提升至94.9%。 Conclusion: 语义霸权抑制了结构分布偏移信号的表达,D-KNN通过几何解耦有效打破该霸权,验证了显式建模结构信息对OOD检测的关键作用。 Abstract: While feature-based post-hoc methods have made significant strides in Out-of-Distribution (OOD) detection, we uncover a counter-intuitive Simplicity Paradox in existing state-of-the-art (SOTA) models: these models exhibit keen sensitivity in distinguishing semantically subtle OOD samples but suffer from severe Geometric Blindness when confronting structurally distinct yet semantically simple samples or high-frequency sensor noise. We attribute this phenomenon to Semantic Hegemony within the deep feature space and reveal its mathematical essence through the lens of Neural Collapse. Theoretical analysis demonstrates that the spectral concentration bias, induced by the high variance of the principal subspace, numerically masks the structural distribution shift signals that should be significant in the residual subspace. To address this issue, we propose D-KNN, a training-free, plug-and-play geometric decoupling framework. This method utilizes orthogonal decomposition to explicitly separate semantic components from structural residuals and introduces a dual-space calibration mechanism to reactivate the model's sensitivity to weak residual signals. Extensive experiments demonstrate that D-KNN effectively breaks Semantic Hegemony, establishing new SOTA performance on both CIFAR and ImageNet benchmarks. Notably, in resolving the Simplicity Paradox, it reduces the FPR95 from 31.3% to 2.3%; when addressing sensor failures such as Gaussian noise, it boosts the detection performance (AUROC) from a baseline of 79.7% to 94.9%.
### [82] [Imagine a City: CityGenAgent for Procedural 3D City Generation](https://arxiv.org/abs/2602.05362) *Zishan Liu,Zecong Tang,RuoCheng Wu,Xinzhe Zheng,Jingyu Hu,Ka-Hei Hui,Haoran Xie,Bo Dai,Zhengzhe Liu* Main category: cs.CV TL;DR: 本文提出CityGenAgent,一种基于自然语言驱动的分层程序化生成高质量3D城市的框架,通过两阶段学习(监督微调与强化学习)提升结构正确性、空间推理与图文一致性,并支持自然语言编辑。
Details Motivation: 现有3D城市自动生成方法在高保真资产创建、可控性和可编辑性方面存在不足,难以满足自动驾驶、虚拟现实和具身智能等应用需求。 Method: 提出CityGenAgent框架,将城市生成分解为Block Program和Building Program两个可解释模块;采用两阶段学习策略:(1) 监督微调(SFT)确保生成程序符合结构约束(如无自交多边形、字段完整);(2) 强化学习(RL),设计空间对齐奖励和视觉一致性奖励以增强空间推理与图文匹配能力。 Result: 实验表明CityGenAgent在语义对齐性、视觉质量和可控性上优于现有方法,支持自然语言编辑与操作,具备良好泛化能力。 Conclusion: CityGenAgent为可扩展、可控、高质量的3D城市生成提供了稳健基础,推动了程序化建模与生成式AI在城市级场景中的融合应用。 Abstract: The automated generation of interactive 3D cities is a critical challenge with broad applications in autonomous driving, virtual reality, and embodied intelligence. While recent advances in generative models and procedural techniques have improved the realism of city generation, existing methods often struggle with high-fidelity asset creation, controllability, and manipulation. In this work, we introduce CityGenAgent, a natural language-driven framework for hierarchical procedural generation of high-quality 3D cities. Our approach decomposes city generation into two interpretable components, Block Program and Building Program. To ensure structural correctness and semantic alignment, we adopt a two-stage learning strategy: (1) Supervised Fine-Tuning (SFT). We train BlockGen and BuildingGen to generate valid programs that adhere to schema constraints, including non-self-intersecting polygons and complete fields; (2) Reinforcement Learning (RL). We design Spatial Alignment Reward to enhance spatial reasoning ability and Visual Consistency Reward to bridge the gap between textual descriptions and the visual modality. Benefiting from the programs and the models' generalization, CityGenAgent supports natural language editing and manipulation. Comprehensive evaluations demonstrate superior semantic alignment, visual quality, and controllability compared to existing methods, establishing a robust foundation for scalable 3D city generation.
### [83] [SAIL: Self-Amplified Iterative Learning for Diffusion Model Alignment with Minimal Human Feedback](https://arxiv.org/abs/2602.05380) *Xiaoxuan He,Siming Fu,Wanli Li,Zhiyuan Li,Dacheng Yin,Kang Rong,Fengyun Rao,Bo Zhang* Main category: cs.CV TL;DR: 本文提出SAIL框架,利用扩散模型自身进行迭代式自我改进,仅需极少量人类偏好数据即可实现高效对齐,无需外部奖励模型。
Details Motivation: 现有扩散模型对齐方法依赖大量人类偏好数据或辅助奖励模型,成本高昂且不切实际;本文探索能否仅用极少人类反馈、不依赖外部奖励模型,挖掘扩散模型自身的潜在能力来实现有效对齐。 Method: 提出SAIL(Self-Amplified Iterative Learning)框架:以少量人工标注的偏好对为种子,在闭环中迭代执行样本生成、基于模型自身演进理解的自我偏好标注、以及利用自增强数据集进行模型更新;引入排序偏好混合策略(ranked preference mixup)以平衡探索与对初始人类先验的保持。 Result: 实验表明,SAIL在多个基准上持续超越当前最优方法,且仅需现有方法6%的偏好数据量。 Conclusion: 扩散模型具备显著的自我提升能力,恰当激发后可替代大规模人工标注和外部奖励模型,实现高效、低成本的人类偏好对齐。 Abstract: Aligning diffusion models with human preferences remains challenging, particularly when reward models are unavailable or impractical to obtain, and collecting large-scale preference datasets is prohibitively expensive. \textit{This raises a fundamental question: can we achieve effective alignment using only minimal human feedback, without auxiliary reward models, by unlocking the latent capabilities within diffusion models themselves?} In this paper, we propose \textbf{SAIL} (\textbf{S}elf-\textbf{A}mplified \textbf{I}terative \textbf{L}earning), a novel framework that enables diffusion models to act as their own teachers through iterative self-improvement. Starting from a minimal seed set of human-annotated preference pairs, SAIL operates in a closed-loop manner where the model progressively generates diverse samples, self-annotates preferences based on its evolving understanding, and refines itself using this self-augmented dataset. To ensure robust learning and prevent catastrophic forgetting, we introduce a ranked preference mixup strategy that carefully balances exploration with adherence to initial human priors. Extensive experiments demonstrate that SAIL consistently outperforms state-of-the-art methods across multiple benchmarks while using merely 6\% of the preference data required by existing approaches, revealing that diffusion models possess remarkable self-improvement capabilities that, when properly harnessed, can effectively replace both large-scale human annotation and external reward models.
### [84] [VRIQ: Benchmarking and Analyzing Visual-Reasoning IQ of VLMs](https://arxiv.org/abs/2602.05382) *Tina Khezresmaeilzadeh,Jike Zhong,Konstantinos Psounis* Main category: cs.CV TL;DR: 本文提出了VRIQ基准来评估视觉语言模型(VLMs)的非言语视觉推理能力,发现其在抽象谜题和自然图像任务上表现均较差,主要瓶颈在于感知能力而非推理能力。
Details Motivation: 探究当前视觉语言模型(VLMs)是否能可靠执行非言语视觉推理任务。 Method: 构建VRIQ基准,包含抽象谜题式与自然图像推理两类任务;设计诊断探针分析感知与推理失败来源,并细粒度评估形状、数量、位置、3D/深度等感知类别。 Result: 抽象任务平均准确率仅约28%,自然任务为45%;56%失败源于感知问题,43%源于感知与推理共同问题,仅1%纯属推理失败;某些感知类别(如3D/深度)错误率更高。 Conclusion: 当前VLMs的视觉推理不可靠,主因是感知能力不足而非推理机制缺陷;VRIQ为提升多模态系统视觉推理提供了可解释、可诊断的评估基础。 Abstract: Recent progress in Vision Language Models (VLMs) has raised the question of whether they can reliably perform nonverbal reasoning. To this end, we introduce VRIQ (Visual Reasoning IQ), a novel benchmark designed to assess and analyze the visual reasoning ability of VLMs. We evaluate models on two sets of tasks: abstract puzzle-style and natural-image reasoning tasks. We find that on abstract puzzles, performance remains near random with an average accuracy of around 28%, while natural tasks yield better but still weak results with 45% accuracy. We also find that tool-augmented reasoning demonstrates only modest improvements. To uncover the source of this weakness, we introduce diagnostic probes targeting perception and reasoning. Our analysis demonstrates that around 56% of failures arise from perception alone, 43% from both perception and reasoning, and only a mere 1% from reasoning alone. This motivates us to design fine-grained diagnostic probe questions targeting specific perception categories (e.g., shape, count, position, 3D/depth), revealing that certain categories cause more failures than others. Our benchmark and analysis establish that current VLMs, even with visual reasoning tools, remain unreliable abstract reasoners, mostly due to perception limitations, and offer a principled basis for improving visual reasoning in multimodal systems.
### [85] [Dolphin-v2: Universal Document Parsing via Scalable Anchor Prompting](https://arxiv.org/abs/2602.05384) *Hao Feng,Wei Shi,Ke Zhang,Xiang Fei,Lei Liao,Dingkang Yang,Yongkun Du,Xuecheng Wu,Jingqun Tang,Yang Liu,Hong Chen,Can Huang* Main category: cs.CV TL;DR: Dolphin-v2 是一个两阶段文档图像解析模型,通过联合文档类型分类与布局分析、针对不同文档类型采用混合解析策略(整体页面解析 vs. 元素级并行解析),显著提升了对畸变/拍摄文档的鲁棒性、细粒度元素识别(21类)及代码块识别能力,在多个基准上取得大幅性能提升。
Details Motivation: 现有文档解析系统碎片化严重,依赖轴对齐边界框的两阶段方法难以处理畸变或拍摄文档,且缺乏对文档类型自适应解析和细粒度语义属性提取的能力。 Method: Dolphin-v2 采用两阶段架构:第一阶段联合进行文档类型分类(数字原生 vs. 拍摄)与布局分析(含阅读顺序预测);第二阶段采用混合解析策略——对拍摄文档进行整体页面级解析以应对几何畸变,对数字原生文档则基于布局锚点进行元素级并行解析。新增代码块识别与缩进保留、21类细粒度元素检测及语义属性提取。 Result: 在 OmniDocBench 上整体提升 +14.78 分,在拍摄文档上错误率降低 91%,并在 DocPTBench、RealDoc-160 等基准上验证了有效性;支持高效并行推理。 Conclusion: Dolphin-v2 通过结构化设计与任务协同,统一了多样文档解析需求,显著提升了鲁棒性、细粒度和实用性,为通用文档理解提供了新范式。 Abstract: Document parsing has garnered widespread attention as vision-language models (VLMs) advance OCR capabilities. However, the field remains fragmented across dozens of specialized models with varying strengths, forcing users to navigate complex model selection and limiting system scalability. Moreover, existing two-stage approaches depend on axis-aligned bounding boxes for layout detection, failing to handle distorted or photographed documents effectively. To this end, we present Dolphin-v2, a two-stage document image parsing model that substantially improves upon the original Dolphin. In the first stage, Dolphin-v2 jointly performs document type classification (digital-born versus photographed) alongside layout analysis. For digital-born documents, it conducts finer-grained element detection with reading order prediction. In the second stage, we employ a hybrid parsing strategy: photographed documents are parsed holistically as complete pages to handle geometric distortions, while digital-born documents undergo element-wise parallel parsing guided by the detected layout anchors, enabling efficient content extraction. Compared with the original Dolphin, Dolphin-v2 introduces several crucial enhancements: (1) robust parsing of photographed documents via holistic page-level understanding, (2) finer-grained element detection (21 categories) with semantic attribute extraction such as author information and document metadata, and (3) code block recognition with indentation preservation, which existing systems typically lack. Comprehensive evaluations are conducted on DocPTBench, OmniDocBench, and our self-constructed RealDoc-160 benchmark. The results demonstrate substantial improvements: +14.78 points overall on the challenging OmniDocBench and 91% error reduction on photographed documents, while maintaining efficient inference through parallel processing.
### [86] [Parallel Swin Transformer-Enhanced 3D MRI-to-CT Synthesis for MRI-Only Radiotherapy Planning](https://arxiv.org/abs/2602.05387) *Zolnamar Dorjsembe,Hung-Yi Chen,Furen Xiao,Hsing-Kuo Pao* Main category: cs.CV TL;DR: 本文提出了一种名为Parallel Swin Transformer-Enhanced Med2Transformer的3D网络架构,用于合成CT图像,以支持MRI-only放疗计划,显著提升了图像相似性、几何精度和剂量计算准确性。
Details Motivation: MRI缺乏电子密度信息,无法直接用于放疗剂量计算,当前需联合MRI与CT扫描,带来配准不确定性和流程复杂性;合成CT可实现MRI-only放疗规划,但面临MRI-CT非线性映射和解剖变异等挑战。 Method: 提出一种融合卷积编码与双分支Swin Transformer的3D网络结构(Parallel Swin Transformer-Enhanced Med2Transformer),通过多尺度移窗注意力机制和分层特征聚合建模局部细节与长程上下文依赖。 Result: 在公开及临床数据集上实验表明,该方法在图像相似性(如PSNR、SSIM)和几何精度(如Dice、HD95)上优于基线方法;剂量学评估显示靶区平均剂量误差为1.69%,达到临床可接受水平。 Conclusion: 所提方法有效提升了MRI到CT合成的质量与剂量计算可靠性,推动了MRI-only放射治疗工作流的临床落地。 Abstract: MRI provides superior soft tissue contrast without ionizing radiation; however, the absence of electron density information limits its direct use for dose calculation. As a result, current radiotherapy workflows rely on combined MRI and CT acquisitions, increasing registration uncertainty and procedural complexity. Synthetic CT generation enables MRI only planning but remains challenging due to nonlinear MRI-CT relationships and anatomical variability. We propose Parallel Swin Transformer-Enhanced Med2Transformer, a 3D architecture that integrates convolutional encoding with dual Swin Transformer branches to model both local anatomical detail and long-range contextual dependencies. Multi-scale shifted window attention with hierarchical feature aggregation improves anatomical fidelity. Experiments on public and clinical datasets demonstrate higher image similarity and improved geometric accuracy compared with baseline methods. Dosimetric evaluation shows clinically acceptable performance, with a mean target dose error of 1.69%. Code is available at: https://github.com/mobaidoctor/med2transformer.
### [87] [Dataset Distillation via Relative Distribution Matching and Cognitive Heritage](https://arxiv.org/abs/2602.05391) *Qianxin Xia,Jiawei Du,Yuhan Zhang,Jielei Wang,Guoming Lu* Main category: cs.CV TL;DR: 本文提出了一种名为统计流匹配(statistical flow matching)的新方法,用于数据集蒸馏,通过在原始数据的类别中心间对齐恒定统计流来优化合成图像,显著降低计算与内存开销,并结合分类器继承策略进一步提升效率与性能。
Details Motivation: 现有基于线性梯度匹配的数据集蒸馏方法存在高计算和内存开销问题,尤其在使用预训练自监督模型作为骨干网络时,需批量加载大量真实图像并多次应用可微增强。 Method: 提出统计流匹配框架,仅一次性加载原始数据的统计信息(如类别中心),并在合成图像上执行单次增强;同时引入分类器继承策略,复用原数据集训练好的分类器,仅添加轻量级线性投影器。 Result: 相比当前最优方法,GPU内存减少10倍、运行时间缩短4倍,且性能相当或更优;分类器继承策略大幅降低存储与推理开销,同时带来显著性能增益。 Conclusion: 统计流匹配是一种稳定高效的数据集蒸馏新范式,结合分类器继承策略,在保持高性能的同时极大提升了资源效率,为实际部署提供了可行方案。 Abstract: Dataset distillation seeks to synthesize a highly compact dataset that achieves performance comparable to the original dataset on downstream tasks. For the classification task that use pre-trained self-supervised models as backbones, previous linear gradient matching optimizes synthetic images by encouraging them to mimic the gradient updates induced by real images on the linear classifier. However, this batch-level formulation requires loading thousands of real images and applying multiple rounds of differentiable augmentations to synthetic images at each distillation step, leading to substantial computational and memory overhead. In this paper, we introduce statistical flow matching , a stable and efficient supervised learning framework that optimizes synthetic images by aligning constant statistical flows from target class centers to non-target class centers in the original data. Our approach loads raw statistics only once and performs a single augmentation pass on the synthetic data, achieving performance comparable to or better than the state-of-the-art methods with 10x lower GPU memory usage and 4x shorter runtime. Furthermore, we propose a classifier inheritance strategy that reuses the classifier trained on the original dataset for inference, requiring only an extremely lightweight linear projector and marginal storage while achieving substantial performance gains.
### [88] [Explainable Pathomics Feature Visualization via Correlation-aware Conditional Feature Editing](https://arxiv.org/abs/2602.05397) *Yuechen Yang,Junlin Guo,Ruining Deng,Junchao Zhu,Zhengyi Lu,Chongyu Qu,Yanfan Zhu,Xingyi Guo,Yu Wang,Shilin Zhao,Haichun Yang,Yuankai Huo* Main category: cs.CV TL;DR: 本文提出了一种面向数字病理学的流形感知扩散模型(MAD),通过在VAE学习的解耦隐空间中正则化特征轨迹,实现对细胞核图像的可控且生物合理的特征编辑,克服了传统条件扩散模型因忽略特征相关性而导致的失真问题。
Details Motivation: 现有病理组学(Pathomics)特征(如‘二阶矩’)可解释性差,且条件扩散模型常假设特征独立,而实际病理特征高度相关,直接编辑易脱离生物学流形、生成不真实图像。 Method: 提出Manifold-Aware Diffusion(MAD)框架:先用变分自编码器(VAE)学习解耦的隐空间并建模特征相关性;在该空间中正则化目标特征的编辑轨迹,使相关属性协同调整以保持流形一致性;再将优化后的特征输入条件扩散模型生成高保真图像。 Result: 实验表明MAD能在病理组学特征流形内有效导航与编辑,相比基线方法在条件特征编辑任务中性能更优,同时更好保持细胞结构的连贯性与真实性。 Conclusion: MAD通过引入流形感知机制,提升了病理图像生成的可控性、生物合理性与可解释性,为可信赖的数字病理分析提供了新范式。 Abstract: Pathomics is a recent approach that offers rich quantitative features beyond what black-box deep learning can provide, supporting more reproducible and explainable biomarkers in digital pathology. However, many derived features (e.g., "second-order moment") remain difficult to interpret, especially across different clinical contexts, which limits their practical adoption. Conditional diffusion models show promise for explainability through feature editing, but they typically assume feature independence**--**an assumption violated by intrinsically correlated pathomics features. Consequently, editing one feature while fixing others can push the model off the biological manifold and produce unrealistic artifacts. To address this, we propose a Manifold-Aware Diffusion (MAD) framework for controllable and biologically plausible cell nuclei editing. Unlike existing approaches, our method regularizes feature trajectories within a disentangled latent space learned by a variational auto-encoder (VAE). This ensures that manipulating a target feature automatically adjusts correlated attributes to remain within the learned distribution of real cells. These optimized features then guide a conditional diffusion model to synthesize high-fidelity images. Experiments demonstrate that our approach is able to navigate the manifold of pathomics features when editing those features. The proposed method outperforms baseline methods in conditional feature editing while preserving structural coherence.
### [89] [TSBOW: Traffic Surveillance Benchmark for Occluded Vehicles Under Various Weather Conditions](https://arxiv.org/abs/2602.05414) *Ngoc Doan-Minh Huynh,Duong Nguyen-Ngoc Tran,Long Hoang Pham,Tai Huu-Phuong Tran,Hyung-Joon Jeon,Huy-Hung Nguyen,Duong Khac Vu,Hyung-Min Jeon,Son Hong Phan,Quoc Pham-Nam Ho,Chi Dai Tran,Trinh Le Ba Khanh,Jae Wook Jeon* Main category: cs.CV TL;DR: 本文提出了TSBOW数据集,用于在各种极端天气条件下提升遮挡车辆的检测性能,该数据集包含大量真实城市交通视频帧,并建立了相应的目标检测基准。
Details Motivation: 全球变暖加剧了极端天气事件的频次和强度,导致CCTV视频质量下降、交通流受阻、事故率上升;而现有数据集无法覆盖极端天气场景,因此需要构建更全面的数据集。 Method: 构建了一个名为TSBOW的大规模真实交通监控数据集,涵盖32小时以上密集城区视频,含4.8万手动标注与320万半标注帧,支持8类交通参与者;并建立面向遮挡与恶劣天气挑战的目标检测基准。 Result: TSBOW数据集具备多道路类型、多尺度、多视角特性,成为推动智能交通系统发展的重要资源;相关研究发现强化了基于CCTV的交通监控潜力。 Conclusion: TSBOW填补了极端天气下遮挡车辆检测数据集的空白,为后续算法研发与实际应用提供了关键支撑,且已开源供社区使用。 Abstract: Global warming has intensified the frequency and severity of extreme weather events, which degrade CCTV signal and video quality while disrupting traffic flow, thereby increasing traffic accident rates. Existing datasets, often limited to light haze, rain, and snow, fail to capture extreme weather conditions. To address this gap, this study introduces the Traffic Surveillance Benchmark for Occluded vehicles under various Weather conditions (TSBOW), a comprehensive dataset designed to enhance occluded vehicle detection across diverse annual weather scenarios. Comprising over 32 hours of real-world traffic data from densely populated urban areas, TSBOW includes more than 48,000 manually annotated and 3.2 million semi-labeled frames; bounding boxes spanning eight traffic participant classes from large vehicles to micromobility devices and pedestrians. We establish an object detection benchmark for TSBOW, highlighting challenges posed by occlusions and adverse weather. With its varied road types, scales, and viewpoints, TSBOW serves as a critical resource for advancing Intelligent Transportation Systems. Our findings underscore the potential of CCTV-based traffic monitoring, pave the way for new research and applications. The TSBOW dataset is publicly available at: https://github.com/SKKUAutoLab/TSBOW.
### [90] [VMF-GOS: Geometry-guided virtual Outlier Synthesis for Long-Tailed OOD Detection](https://arxiv.org/abs/2602.05415) *Ningkang Peng,Qianfeng Yu,Yuhao Zhang,Yafei Liu,Xiaoqian Peng,Peirong Ma,Yi Chen,Peiheng Li,Yanhui Gu* Main category: cs.CV TL;DR: 本文提出了一种无需外部数据的几何引导虚拟异常样本合成(GOS)方法,结合双粒度语义损失(DGS),在长尾分布下实现高性能的分布外检测。
Details Motivation: 现有基于外部异常数据(如80M Tiny Images)的OOD检测方法在实际部署中受限于数据获取成本和隐私问题,尤其在长尾分布下因尾部类别样本稀少导致决策边界模糊。 Method: 提出数据无关框架:1)基于vMF分布在超球面上建模特征统计,定位低似然环形区域并进行方向性采样生成虚拟异常样本(GOS);2)设计双粒度语义损失(DGS),利用对比学习最大化ID特征与合成边界异常之间的区分度。 Result: 在CIFAR-LT等基准上,本方法性能超越使用真实外部图像的SOTA方法。 Conclusion: 所提GOS+DGS框架成功摆脱对外部数据的依赖,在长尾场景下实现了更鲁棒、更实用的OOD检测。 Abstract: Out-of-Distribution (OOD) detection under long-tailed distributions is a highly challenging task because the scarcity of samples in tail classes leads to blurred decision boundaries in the feature space. Current state-of-the-art (sota) methods typically employ Outlier Exposure (OE) strategies, relying on large-scale real external datasets (such as 80 Million Tiny Images) to regularize the feature space. However, this dependence on external data often becomes infeasible in practical deployment due to high data acquisition costs and privacy sensitivity. To this end, we propose a novel data-free framework aimed at completely eliminating reliance on external datasets while maintaining superior detection performance. We introduce a Geometry-guided virtual Outlier Synthesis (GOS) strategy that models statistical properties using the von Mises-Fisher (vMF) distribution on a hypersphere. Specifically, we locate a low-likelihood annulus in the feature space and perform directional sampling of virtual outliers in this region. Simultaneously, we introduce a new Dual-Granularity Semantic Loss (DGS) that utilizes contrastive learning to maximize the distinction between in-distribution (ID) features and these synthesized boundary outliers. Extensive experiments on benchmarks such as CIFAR-LT demonstrate that our method outperforms sota approaches that utilize external real images.
### [91] [Disco: Densely-overlapping Cell Instance Segmentation via Adjacency-aware Collaborative Coloring](https://arxiv.org/abs/2602.05420) *Rui Sun,Yiwen Yang,Kaiyu Guo,Chen Jiang,Dongli Xu,Zhaonan Liu,Tan Pan,Limei Han,Xue Jiang,Wu Wei,Yuan Cheng* Main category: cs.CV TL;DR: 本文提出Disco框架,通过邻接感知的协作着色方法解决密集重叠细胞实例分割问题,并发布GBC-FS 2025大规模数据集,首次系统分析细胞邻接图的染色性质,发现多数真实细胞图为非二分图,需超越2-着色建模。
Details Motivation: 现有基于轮廓检测和距离映射的方法难以处理复杂密集细胞区域;图着色方法虽具潜力,但其在真实密集重叠与复杂拓扑场景下的有效性尚未验证。 Method: 提出Disco框架:1)'显式标记'策略递归分解细胞图、提取冲突集,将拓扑问题转为可学习分类任务;2)'隐式消歧'机制通过约束不同实例特征差异性,学习可分离特征表示;并发布GBC-FS 2025数据集,开展四大数据集细胞邻接图色数性质系统分析。 Result: 揭示多数真实细胞邻接图含大量奇圈(尤以三角形为主),属非二分图,2-着色不充分;更高色数模型则引发表征冗余与优化困难;Disco在复杂密集场景下显著提升分割精度。 Conclusion: 细胞实例分割需适配真实图结构的着色建模,Disco通过邻接感知协同着色与拓扑-学习联合优化,为密集病理图像分析提供了新范式。 Abstract: Accurate cell instance segmentation is foundational for digital pathology analysis. Existing methods based on contour detection and distance mapping still face significant challenges in processing complex and dense cellular regions. Graph coloring-based methods provide a new paradigm for this task, yet the effectiveness of this paradigm in real-world scenarios with dense overlaps and complex topologies has not been verified. Addressing this issue, we release a large-scale dataset GBC-FS 2025, which contains highly complex and dense sub-cellular nuclear arrangements. We conduct the first systematic analysis of the chromatic properties of cell adjacency graphs across four diverse datasets and reveal an important discovery: most real-world cell graphs are non-bipartite, with a high prevalence of odd-length cycles (predominantly triangles). This makes simple 2-coloring theory insufficient for handling complex tissues, while higher-chromaticity models would cause representational redundancy and optimization difficulties. Building on this observation of complex real-world contexts, we propose Disco (Densely-overlapping Cell Instance Segmentation via Adjacency-aware COllaborative Coloring), an adjacency-aware framework based on the "divide and conquer" principle. It uniquely combines a data-driven topological labeling strategy with a constrained deep learning system to resolve complex adjacency conflicts. First, "Explicit Marking" strategy transforms the topological challenge into a learnable classification task by recursively decomposing the cell graph and isolating a "conflict set." Second, "Implicit Disambiguation" mechanism resolves ambiguities in conflict regions by enforcing feature dissimilarity between different instances, enabling the model to learn separable feature representations.
### [92] [NeVStereo: A NeRF-Driven NVS-Stereo Architecture for High-Fidelity 3D Tasks](https://arxiv.org/abs/2602.05423) *Pengcheng Chen,Yue Hu,Wenhao Li,Nicole M Gunderson,Andrew Feng,Zhenglong Sun,Peter Beerel,Eric J Seibel* Main category: cs.CV TL;DR: NeVStereo 是一种 NeRF 驱动的 NVS-立体视觉架构,可从多视角 RGB 图像联合估计相机位姿、深度图、新视角合成与三维表面重建,显著提升几何一致性与渲染质量。
Details Motivation: 现有方法难以在单框架中同时实现高精度位姿估计、可靠深度、高质量新视角合成和准确三维表面重建;传统前馈系统不显式支持NV S,而神经渲染方法对位姿误差敏感。 Method: NeVStereo 结合 NeRF 新视角合成(适配立体匹配)、置信度引导的多视角深度估计、NeRF 耦合的光束法平差(用于位姿优化)以及深度与辐射场协同迭代优化机制。 Result: 在室内外、桌面、航拍等多类基准上实现零样本强泛化:深度误差降低36%,位姿精度提升10.4%,NV S保真度提高4.5%,网格质量达SOTA(F1 91.93%,Chamfer 4.35 mm)。 Conclusion: NeVStereo 成功统一了多任务密集三维重建,缓解了NeRF中常见的表面堆叠、伪影及位姿-深度耦合问题,为端到端可微分多视角重建提供了新范式。 Abstract: In modern dense 3D reconstruction, feed-forward systems (e.g., VGGT, pi3) focus on end-to-end matching and geometry prediction but do not explicitly output the novel view synthesis (NVS). Neural rendering-based approaches offer high-fidelity NVS and detailed geometry from posed images, yet they typically assume fixed camera poses and can be sensitive to pose errors. As a result, it remains non-trivial to obtain a single framework that can offer accurate poses, reliable depth, high-quality rendering, and accurate 3D surfaces from casually captured views. We present NeVStereo, a NeRF-driven NVS-stereo architecture that aims to jointly deliver camera poses, multi-view depth, novel view synthesis, and surface reconstruction from multi-view RGB-only inputs. NeVStereo combines NeRF-based NVS for stereo-friendly renderings, confidence-guided multi-view depth estimation, NeRF-coupled bundle adjustment for pose refinement, and an iterative refinement stage that updates both depth and the radiance field to improve geometric consistency. This design mitigated the common NeRF-based issues such as surface stacking, artifacts, and pose-depth coupling. Across indoor, outdoor, tabletop, and aerial benchmarks, our experiments indicate that NeVStereo achieves consistently strong zero-shot performance, with up to 36% lower depth error, 10.4% improved pose accuracy, 4.5% higher NVS fidelity, and state-of-the-art mesh quality (F1 91.93%, Chamfer 4.35 mm) compared to existing prestigious methods.
### [93] [Multi-AD: Cross-Domain Unsupervised Anomaly Detection for Medical and Industrial Applications](https://arxiv.org/abs/2602.05426) *Wahyu Rahmaniar,Kenji Suzuki* Main category: cs.CV TL;DR: 本文提出了一种名为Multi-AD的无监督异常检测CNN模型,结合SE注意力机制、知识蒸馏与判别器网络,在医学和工业图像上实现了跨域鲁棒异常检测,并在多个数据集上达到SOTA性能。
Details Motivation: 传统深度学习模型在跨领域(如医学早期疾病诊断、工业缺陷检测)中常面临标注数据稀缺的问题,尤其在异常检测任务中尤为突出。 Method: 提出Multi-AD模型:1)引入Squeeze-and-Excitation(SE)模块增强通道级特征注意力;2)采用知识蒸馏(KD)在教师-学生架构中迁移正常/异常差异信息;3)加入判别器网络强化区分能力;4)融合多尺度特征以检测不同尺寸异常。 Result: 在多个医学(脑MRI、肝CT、视网膜OCT)与工业(MVTec AD)数据集上验证,图像级AUROC达81.4%(医学)和99.6%(工业),像素级AUROC达97.0%(医学)和98.4%(工业),均优于现有方法。 Conclusion: Multi-AD通过结合注意力机制、知识蒸馏与多尺度建模,显著提升了无监督跨域异常检测的鲁棒性与泛化能力,适用于真实场景。 Abstract: Traditional deep learning models often lack annotated data, especially in cross-domain applications such as anomaly detection, which is critical for early disease diagnosis in medicine and defect detection in industry. To address this challenge, we propose Multi-AD, a convolutional neural network (CNN) model for robust unsupervised anomaly detection across medical and industrial images. Our approach employs the squeeze-and-excitation (SE) block to enhance feature extraction via channel-wise attention, enabling the model to focus on the most relevant features and detect subtle anomalies. Knowledge distillation (KD) transfers informative features from the teacher to the student model, enabling effective learning of the differences between normal and anomalous data. Then, the discriminator network further enhances the model's capacity to distinguish between normal and anomalous data. At the inference stage, by integrating multi-scale features, the student model can detect anomalies of varying sizes. The teacher-student (T-S) architecture ensures consistent representation of high-dimensional features while adapting them to enhance anomaly detection. Multi-AD was evaluated on several medical datasets, including brain MRI, liver CT, and retina OCT, as well as industrial datasets, such as MVTec AD, demonstrating strong generalization across multiple domains. Experimental results demonstrated that our approach consistently outperformed state-of-the-art models, achieving the best average AUROC for both image-level (81.4% for medical and 99.6% for industrial) and pixel-level (97.0% for medical and 98.4% for industrial) tasks, making it effective for real-world applications.
### [94] [LD-SLRO: Latent Diffusion Structured Light for 3-D Reconstruction of Highly Reflective Objects](https://arxiv.org/abs/2602.05434) *Sanghoon Jeon,Gihyun Jung,Suhyeon Ka,Jae-Sang Hyun* Main category: cs.CV TL;DR: 本文提出了一种基于潜在扩散模型的结构光方法(LD-SLRO),用于改善高反射、低粗糙度表面的条纹图像质量,从而提升三维重建精度。
Details Motivation: 高反射率、低表面粗糙度物体在条纹投影三维重建中易受镜面反射和间接照明干扰,导致条纹畸变或丢失。 Method: 提出LD-SLRO方法:先对相移条纹图像进行编码提取表征表面反射特性的潜在特征;再将这些特征作为条件输入至潜在扩散模型,概率性抑制反射伪影并恢复缺失条纹;引入镜面反射编码器、时变通道仿射层和注意力模块增强恢复效果。 Result: 实验表明该方法显著提升条纹图像质量和三维重建精度,平均均方根误差从1.8176 mm降至0.9619 mm。 Conclusion: LD-SLRO有效解决了高反射表面三维测量中的条纹退化问题,在 fringe 质量与 3D 重建精度上优于现有最先进方法。 Abstract: Fringe projection profilometry-based 3-D reconstruction of objects with high reflectivity and low surface roughness remains a significant challenge. When measuring such glossy surfaces, specular reflection and indirect illumination often lead to severe distortion or loss of the projected fringe patterns. To address these issues, we propose a latent diffusion-based structured light for reflective objects (LD-SLRO). Phase-shifted fringe images captured from highly reflective surfaces are first encoded to extract latent representations that capture surface reflectance characteristics. These latent features are then used as conditional inputs to a latent diffusion model, which probabilistically suppresses reflection-induced artifacts and recover lost fringe information, yielding high-quality fringe images. The proposed components, including the specular reflection encoder, time-variant channel affine layer, and attention modules, further improve fringe restoration quality. In addition, LD-SLRO provides high flexibility in configuring the input and output fringe sets. Experimental results demonstrate that the proposed method improves both fringe quality and 3-D reconstruction accuracy over state-of-the-art methods, reducing the average root-mean-squared error from 1.8176 mm to 0.9619 mm.
### [95] [Stable Velocity: A Variance Perspective on Flow Matching](https://arxiv.org/abs/2602.05435) *Donglin Yang,Yongxing Zhang,Xin Yu,Liang Hou,Xin Tao,Pengfei Wan,Xiaojuan Qi,Renjie Liao* Main category: cs.CV TL;DR: 本文提出Stable Velocity框架,通过分析流匹配中单样本条件速度导致的高方差问题,设计了方差降低的训练目标(StableVM)和自适应辅助监督(VA-REPA),并在推理阶段利用低方差区域的动力学特性实现无需微调的加速采样(StableVS),显著提升训练效率与采样速度。
Details Motivation: 流匹配因依赖单一样本条件速度而存在高方差训练目标,导致优化不稳定、收敛慢,尤其在先验分布附近;需明确刻画并缓解该方差问题。 Method: 1)理论分析流匹配中方差的空间分布,识别高低方差区域;2)提出StableVM——无偏方差缩减训练目标;3)提出VA-REPA——在低方差区增强辅助监督;4)提出StableVS——利用低方差区动力学闭式解实现免微调加速采样。 Result: 在ImageNet 256×256及SD3.5、Flux、Qwen-Image、Wan2.2等大型文生图/视频模型上验证:训练效率提升,采样速度提升超2倍(低方差区),且样本质量不下降。 Conclusion: Stable Velocity通过显式建模与利用流匹配中的方差结构,在训练与采样两方面实现稳定性和效率的协同提升,为生成模型提供新范式。 Abstract: While flow matching is elegant, its reliance on single-sample conditional velocities leads to high-variance training targets that destabilize optimization and slow convergence. By explicitly characterizing this variance, we identify 1) a high-variance regime near the prior, where optimization is challenging, and 2) a low-variance regime near the data distribution, where conditional and marginal velocities nearly coincide. Leveraging this insight, we propose Stable Velocity, a unified framework that improves both training and sampling. For training, we introduce Stable Velocity Matching (StableVM), an unbiased variance-reduction objective, along with Variance-Aware Representation Alignment (VA-REPA), which adaptively strengthen auxiliary supervision in the low-variance regime. For inference, we show that dynamics in the low-variance regime admit closed-form simplifications, enabling Stable Velocity Sampling (StableVS), a finetuning-free acceleration. Extensive experiments on ImageNet $256\times256$ and large pretrained text-to-image and text-to-video models, including SD3.5, Flux, Qwen-Image, and Wan2.2, demonstrate consistent improvements in training efficiency and more than $2\times$ faster sampling within the low-variance regime without degrading sample quality. Our code is available at https://github.com/linYDTHU/StableVelocity.
### [96] [Synthetic Defect Geometries of Cast Metal Objects Modeled via 2d Voronoi Tessellations](https://arxiv.org/abs/2602.05440) *Natascha Jeziorski,Petra Gospodnetić,Claudia Redenbach* Main category: cs.CV TL;DR: 本文提出了一种基于参数化3D缺陷建模与物理仿真相结合的合成数据生成方法,用于提升无损检测中自动化缺陷识别的训练数据质量与多样性。
Details Motivation: 工业中缺陷检测对质量控制至关重要,但真实标注数据稀缺且难以覆盖罕见缺陷;需可控、可扩展、高保真且像素级精确标注的合成数据来支持机器学习模型训练。 Method: 构建面向金属铸造等工艺的参数化3D缺陷几何模型(如气孔、裂纹等),将其嵌入工件数字孪生体的网格中;结合物理基础的蒙特卡洛仿真(如视觉表面检测)生成逼真合成图像,并同步生成像素级精确标注。 Result: 实现了任意规模、高可控性、含罕见缺陷的合成数据集生成,并支持像素级自动标注;方法可适配多种无损检测模态。 Conclusion: 该参数化缺陷建模与仿真驱动的合成数据生成框架,有效解决了工业缺陷检测中数据稀缺与标注成本高的核心瓶颈,为自动化NDT系统提供了可靠的数据基础。 Abstract: In industry, defect detection is crucial for quality control. Non-destructive testing (NDT) methods are preferred as they do not influence the functionality of the object while inspecting. Automated data evaluation for automated defect detection is a growing field of research. In particular, machine learning approaches show promising results. To provide training data in sufficient amount and quality, synthetic data can be used. Rule-based approaches enable synthetic data generation in a controllable environment. Therefore, a digital twin of the inspected object including synthetic defects is needed. We present parametric methods to model 3d mesh objects of various defect types that can then be added to the object geometry to obtain synthetic defective objects. The models are motivated by common defects in metal casting but can be transferred to other machining procedures that produce similar defect shapes. Synthetic data resembling the real inspection data can then be created by using a physically based Monte Carlo simulation of the respective testing method. Using our defect models, a variable and arbitrarily large synthetic data set can be generated with the possibility to include rarely occurring defects in sufficient quantity. Pixel-perfect annotation can be created in parallel. As an example, we will use visual surface inspection, but the procedure can be applied in combination with simulations for any other NDT method.
### [97] [DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching](https://arxiv.org/abs/2602.05449) *Chang Zou,Changlin Li,Yang Li,Patrol Li,Jianbing Wu,Xiao He,Songtao Liu,Zhao Zhong,Kailin Huang,Linfeng Zhang* Main category: cs.CV TL;DR: 本文提出了一种新型的蒸馏兼容可学习特征缓存机制,结合轻量级可学习神经预测器和保守的Restricted MeanFlow方法,在视频扩散模型中实现了11.8倍加速且保持生成质量。
Details Motivation: 现有视频扩散模型加速方法(如无训练特征缓存和步蒸馏)在压缩率提高时面临语义/细节丢失或质量严重下降问题,尤其在步数稀疏的蒸馏模型上叠加特征缓存效果更差。 Method: 提出可学习特征缓存机制,用轻量级神经预测器替代传统启发式缓存;设计保守的Restricted MeanFlow方法以提升高保真、低步数蒸馏的稳定性。 Result: 在保持生成质量前提下实现11.8倍推理加速;大量实验验证有效性;代码将开源。 Conclusion: 可学习特征缓存与Restricted MeanFlow协同显著提升了视频扩散模型的加速上限与质量稳定性,为高效视频生成提供了新范式。 Abstract: While diffusion models have achieved great success in the field of video generation, this progress is accompanied by a rapidly escalating computational burden. Among the existing acceleration methods, Feature Caching is popular due to its training-free property and considerable speedup performance, but it inevitably faces semantic and detail drop with further compression. Another widely adopted method, training-aware step-distillation, though successful in image generation, also faces drastic degradation in video generation with a few steps. Furthermore, the quality loss becomes more severe when simply applying training-free feature caching to the step-distilled models, due to the sparser sampling steps. This paper novelly introduces a distillation-compatible learnable feature caching mechanism for the first time. We employ a lightweight learnable neural predictor instead of traditional training-free heuristics for diffusion models, enabling a more accurate capture of the high-dimensional feature evolution process. Furthermore, we explore the challenges of highly compressed distillation on large-scale video models and propose a conservative Restricted MeanFlow approach to achieve more stable and lossless distillation. By undertaking these initiatives, we further push the acceleration boundaries to $11.8\times$ while preserving generation quality. Extensive experiments demonstrate the effectiveness of our method. The code is in the supplementary materials and will be publicly available.
### [98] [Attention Retention for Continual Learning with Vision Transformers](https://arxiv.org/abs/2602.05454) *Yue Lu,Xiangyu Zhou,Shizhou Zhang,Yinghui Xing,Guoqiang Liang,Wencong Zhang* Main category: cs.CV TL;DR: 本文提出了一种基于注意力保持的持续学习框架,通过在反向传播中约束Vision Transformer中的注意力漂移来缓解灾难性遗忘。
Details Motivation: 识别出Vision Transformer中的注意力漂移是导致持续学习中灾难性遗忘的主要原因,并受人类视觉系统选择性注意机制启发,提出相应解决方案。 Method: 提出一种两步注意力保留方法:1)利用逐层展开机制提取前序任务的注意力图并生成实例自适应二值掩码;2)在学习新任务时,用这些掩码置零与先前注意力区域相关的梯度,并按比例缩放参数更新以兼容现代优化器。 Result: 实验和可视化结果表明该方法能有效缓解灾难性遗忘、保持视觉概念,在多种持续学习场景下达到SOTA性能并展现出强泛化能力。 Conclusion: 注意力漂移是Vision Transformer中灾难性遗忘的关键因素,所提出的注意力保留框架能有效缓解该问题,为持续学习提供了新思路。 Abstract: Continual learning (CL) empowers AI systems to progressively acquire knowledge from non-stationary data streams. However, catastrophic forgetting remains a critical challenge. In this work, we identify attention drift in Vision Transformers as a primary source of catastrophic forgetting, where the attention to previously learned visual concepts shifts significantly after learning new tasks. Inspired by neuroscientific insights into the selective attention in the human visual system, we propose a novel attention-retaining framework to mitigate forgetting in CL. Our method constrains attention drift by explicitly modifying gradients during backpropagation through a two-step process: 1) extracting attention maps of the previous task using a layer-wise rollout mechanism and generating instance-adaptive binary masks, and 2) when learning a new task, applying these masks to zero out gradients associated with previous attention regions, thereby preventing disruption of learned visual concepts. For compatibility with modern optimizers, the gradient masking process is further enhanced by scaling parameter updates proportionally to maintain their relative magnitudes. Experiments and visualizations demonstrate the effectiveness of our method in mitigating catastrophic forgetting and preserving visual concepts. It achieves state-of-the-art performance and exhibits robust generalizability across diverse CL scenarios.
### [99] [MerNav: A Highly Generalizable Memory-Execute-Review Framework for Zero-Shot Object Goal Navigation](https://arxiv.org/abs/2602.05467) *Dekang Qi,Shuang Zeng,Xinyuan Chang,Feng Xiong,Shichao Xie,Xiaolong Wu,Mu Xu* Main category: cs.CV TL;DR: 本文提出了一种Memory-Execute-Review框架,用于提升视觉语言导航(VLN)任务的成功率(SR)与泛化能力,在多个数据集上显著超越现有监督微调(SFT)和无训练(TF)方法。
Details Motivation: 现有VLN方法难以同时兼顾高成功率(SR)和强泛化能力:监督微调方法SR高但泛化差,无训练方法泛化好但SR低。 Method: 提出Memory-Execute-Review三模块框架:分层记忆模块提供信息支持,执行模块进行常规决策与动作,审查模块处理异常并纠正行为;在Object Goal Navigation任务上验证。 Result: 在4个数据集上,零样本(ZS)和无训练(TF)设置下平均SR分别提升5%和7%;在HM3D_v0.1和HM3D_OVON上ZS设置下SR分别提升8%和6%;在MP3D和HM3D_OVON上同时超越所有TF和SFT方法,SR分别领先5%和2%。 Conclusion: Memory-Execute-Review框架有效兼顾了VLN任务的成功率与泛化能力,实现了全面性能领先。 Abstract: Visual Language Navigation (VLN) is one of the fundamental capabilities for embodied intelligence and a critical challenge that urgently needs to be addressed. However, existing methods are still unsatisfactory in terms of both success rate (SR) and generalization: Supervised Fine-Tuning (SFT) approaches typically achieve higher SR, while Training-Free (TF) approaches often generalize better, but it is difficult to obtain both simultaneously. To this end, we propose a Memory-Execute-Review framework. It consists of three parts: a hierarchical memory module for providing information support, an execute module for routine decision-making and actions, and a review module for handling abnormal situations and correcting behavior. We validated the effectiveness of this framework on the Object Goal Navigation task. Across 4 datasets, our average SR achieved absolute improvements of 7% and 5% compared to all baseline methods under TF and Zero-Shot (ZS) settings, respectively. On the most commonly used HM3D_v0.1 and the more challenging open vocabulary dataset HM3D_OVON, the SR improved by 8% and 6%, under ZS settings. Furthermore, on the MP3D and HM3D_OVON datasets, our method not only outperformed all TF methods but also surpassed all SFT methods, achieving comprehensive leadership in both SR (5% and 2%) and generalization.
### [100] [SOMA-1M: A Large-Scale SAR-Optical Multi-resolution Alignment Dataset for Multi-Task Remote Sensing](https://arxiv.org/abs/2602.05480) *Peihao Wu,Yongxiang Yao,Yi Wan,Wenfei Zhang,Ruipeng Zhao,Jiayuan Li,Yongjun Zhang* Main category: cs.CV TL;DR: 本文提出SOMA-1M数据集,一个包含130万对精确像素级配准的多分辨率SAR-光学遥感图像数据集,覆盖全球、多尺度(0.5m–10m)和12类地物,支持图像匹配、融合、云去除与跨模态翻译等任务,并验证其显著提升多模态遥感算法性能。
Details Motivation: 现有SAR-光学遥感基准数据集存在单一分辨率、规模不足、配准精度低等问题,难以支撑多尺度基础模型训练与泛化。 Method: 构建了SOMA-1M数据集,整合Sentinel-1、PIESAT-1、Capella Space和Google Earth影像;设计粗到精图像匹配框架实现像素级高精度配准;建立涵盖四大视觉任务的综合评测基准。 Result: 基于SOMA-1M监督训练显著提升所有任务性能,尤其在多模态遥感图像匹配上达到当前最优(SOTA)水平。 Conclusion: SOMA-1M为鲁棒多模态遥感算法及遥感基础模型提供了关键基础资源,将开源发布。 Abstract: Synthetic Aperture Radar (SAR) and optical imagery provide complementary strengths that constitute the critical foundation for transcending single-modality constraints and facilitating cross-modal collaborative processing and intelligent interpretation. However, existing benchmark datasets often suffer from limitations such as single spatial resolution, insufficient data scale, and low alignment accuracy, making them inadequate for supporting the training and generalization of multi-scale foundation models. To address these challenges, we introduce SOMA-1M (SAR-Optical Multi-resolution Alignment), a pixel-level precisely aligned dataset containing over 1.3 million pairs of georeferenced images with a specification of 512 x 512 pixels. This dataset integrates imagery from Sentinel-1, PIESAT-1, Capella Space, and Google Earth, achieving global multi-scale coverage from 0.5 m to 10 m. It encompasses 12 typical land cover categories, effectively ensuring scene diversity and complexity. To address multimodal projection deformation and massive data registration, we designed a rigorous coarse-to-fine image matching framework ensuring pixel-level alignment. Based on this dataset, we established comprehensive evaluation benchmarks for four hierarchical vision tasks, including image matching, image fusion, SAR-assisted cloud removal, and cross-modal translation, involving over 30 mainstream algorithms. Experimental results demonstrate that supervised training on SOMA-1M significantly enhances performance across all tasks. Notably, multimodal remote sensing image (MRSI) matching performance achieves current state-of-the-art (SOTA) levels. SOMA-1M serves as a foundational resource for robust multimodal algorithms and remote sensing foundation models. The dataset will be released publicly at: https://github.com/PeihaoWu/SOMA-1M.
### [101] [Feature points evaluation on omnidirectional vision with a photorealistic fisheye sequence -- A report on experiments done in 2014](https://arxiv.org/abs/2602.05487) *Julien Moreau,S. Ambellouis,Yassine Ruichek* Main category: cs.CV TL;DR: 本报告是一项未发表的博士研究工作,旨在为鱼眼图像自校准寻找最佳特征检测与描述算法,提出了PFSeq(Photorealistic Fisheye Sequence)数据集,并进行了系统实验,但未提出新算法,也未与专为全向图像设计的方法对比。
Details Motivation: 解决鱼眼图像自校准中的‘鸡生蛋还是蛋生鸡’问题:缺乏精确投影模型难以优化特征检测,而高质量特征又是估计该模型的前提;应用场景为车载天顶朝上鱼眼相机在城市环境中的视觉里程计与立体视觉。 Method: 对多种经典特征检测器和描述子在鱼眼图像上进行综合实验评估,使用自建的PFSeq真实感鱼眼图像序列数据集,实验环境基于2014年技术背景。 Result: 给出了不同特征方法在鱼眼图像上的性能比较结果,识别出相对更适用于该任务的现有特征组合,但未得出普适性最优方案;发布了PFSeq数据集(DOI: 10.57745/DYIVVU)。 Conclusion: 在无精确投影模型先验条件下,传统特征方法中某些组合(如SIFT+FLANN)在鱼眼图像自校准任务中表现更稳健;强调了针对鱼眼图像设计专用特征方法的必要性,但本工作仅完成基准评估,未提出新算法。 Abstract: What is this report: This is a scientific report, contributing with a detailed bibliography, a dataset which we will call now PFSeq for ''Photorealistic Fisheye Sequence'' and make available at https://doi.org/10. 57745/DYIVVU, and comprehensive experiments. This work should be considered as a draft, and has been done during my PhD thesis ''Construction of 3D models from fisheye video data-Application to the localisation in urban area'' in 2014 [Mor16]. These results have never been published. The aim was to find the best features detector and descriptor for fisheye images, in the context of selfcalibration, with cameras mounted on the top of a car and aiming at the zenith (to proceed then fisheye visual odometry and stereovision in urban scenes). We face a chicken and egg problem, because we can not take advantage of an accurate projection model for an optimal features detection and description, and we rightly need good features to perform the calibration (i.e. to compute the accurate projection model of the camera). What is not this report: It does not contribute with new features algorithm. It does not compare standard features algorithms to algorithms designed for omnidirectional images (unfortunately). It has not been peer-reviewed. Discussions have been translated and enhanced but the experiments have not been run again and the report has not been updated accordingly to the evolution of the state-of-the-art (read this as a 2014 report).
### [102] [VGGT-Motion: Motion-Aware Calibration-Free Monocular SLAM for Long-Range Consistency](https://arxiv.org/abs/2602.05508) *Zhuang Xiong,Chen Zhang,Qingshan Xu,Wenbing Tao* Main category: cs.CV TL;DR: 本文提出VGGT-Motion,一种无需相机标定的单目SLAM系统,通过运动感知子图构建、锚点驱动的Sim(3)直接配准和轻量级子图级位姿图优化,显著缓解长序列中的尺度漂移问题,实现千米级轨迹的高效鲁棒全局一致性。
Details Motivation: 现有无标定单目SLAM方法在长序列中存在严重尺度漂移;运动无关的分段破坏上下文连贯性并导致零运动漂移;传统几何对齐计算开销大。 Method: 1)运动感知子图构建:利用光流指导自适应分段、剔除静态冗余、封装转弯以稳定局部几何;2)锚点驱动的直接Sim(3)配准:基于上下文平衡锚点实现免搜索、像素级稠密对齐与高效回环检测;3)轻量子图级位姿图优化:线性复杂度,保障全局一致性。 Result: 在零样本、长距离、无标定单目SLAM任务上达到SOTA性能,显著提升轨迹精度与运行效率。 Conclusion: VGGT-Motion有效解决了无标定单目SLAM中长序列尺度漂移与计算效率的矛盾,为千米级场景提供了鲁棒、可扩展的解决方案。 Abstract: Despite recent progress in calibration-free monocular SLAM via 3D vision foundation models, scale drift remains severe on long sequences. Motion-agnostic partitioning breaks contextual coherence and causes zero-motion drift, while conventional geometric alignment is computationally expensive. To address these issues, we propose VGGT-Motion, a calibration-free SLAM system for efficient and robust global consistency over kilometer-scale trajectories. Specifically, we first propose a motion-aware submap construction mechanism that uses optical flow to guide adaptive partitioning, prune static redundancy, and encapsulate turns for stable local geometry. We then design an anchor-driven direct Sim(3) registration strategy. By exploiting context-balanced anchors, it achieves search-free, pixel-wise dense alignment and efficient loop closure without costly feature matching. Finally, a lightweight submap-level pose graph optimization enforces global consistency with linear complexity, enabling scalable long-range operation. Experiments show that VGGT-Motion markedly improves trajectory accuracy and efficiency, achieving state-of-the-art performance in zero-shot, long-range calibration-free monocular SLAM.
### [103] [Mapper-GIN: Lightweight Structural Graph Abstraction for Corrupted 3D Point Cloud Classification](https://arxiv.org/abs/2602.05522) *Jeongbin You,Donggun Kim,Sejun Park,Seungsang Oh* Main category: cs.CV TL;DR: 本文提出Mapper-GIN,一种基于拓扑Mapper算法的轻量级点云分类方法,通过构建区域图并用GIN进行图分类,在ModelNet40-C上实现了对噪声和变换扰动的强鲁棒性,仅需0.5M参数。
Details Motivation: 探索仅通过结构抽象(而非扩大模型或复杂数据增强)是否能提升3D点云分类的鲁棒性。 Method: 使用Mapper算法(PCA lens、立方覆盖、密度聚类)将点云划分为重叠区域,构建区域图,并用Graph Isomorphism Network(GIN)进行图分类。 Result: 在ModelNet40-C基准上,Mapper-GIN在Noise和Transformation扰动下达到具有竞争力且稳定的准确率,参数量仅0.5M,优于依赖更大架构或额外机制的现有方法。 Conclusion: 区域图结构是一种高效且可解释的鲁棒性来源,为3D视觉识别提供了新思路。 Abstract: Robust 3D point cloud classification is often pursued by scaling up backbones or relying on specialized data augmentation. We instead ask whether structural abstraction alone can improve robustness, and study a simple topology-inspired decomposition based on the Mapper algorithm. We propose Mapper-GIN, a lightweight pipeline that partitions a point cloud into overlapping regions using Mapper (PCA lens, cubical cover, and followed by density-based clustering), constructs a region graph from their overlaps, and performs graph classification with a Graph Isomorphism Network. On the corruption benchmark ModelNet40-C, Mapper-GIN achieves competitive and stable accuracy under Noise and Transformation corruptions with only 0.5M parameters. In contrast to prior approaches that require heavier architectures or additional mechanisms to gain robustness, Mapper-GIN attains strong corruption robustness through simple region-level graph abstraction and GIN message passing. Overall, our results suggest that region-graph structure offers an efficient and interpretable source of robustness for 3D visual recognition.
### [104] [Generalization of Self-Supervised Vision Transformers for Protein Localization Across Microscopy Domains](https://arxiv.org/abs/2602.05527) *Ben Isselmann,Dilara Göksu,Andreas Weinmann* Main category: cs.CV TL;DR: 本文研究了自监督学习(SSL)预训练的视觉Transformer(DINO)在显微镜图像跨域迁移中的有效性,发现使用人蛋白图谱(HPA)数据预训练的模型在OpenCell蛋白定位任务上表现最佳,优于直接在OpenCell上训练的模型,表明领域相关的SSL表征可有效泛化到相关但不同的显微镜数据集。
Details Motivation: 显微镜任务特定数据集通常规模小,难以训练鲁棒的深度学习模型;自监督预训练虽有潜力,但其在不同染色协议和通道配置的显微镜域间迁移能力尚不明确。 Method: 采用DINO框架,在ImageNet-1k、HPA和OpenCell三个数据集上分别预训练ViT骨干网络,提取图像嵌入,并在OpenCell标注数据上训练监督分类头进行评估。 Result: 所有预训练模型均具有良好迁移效果,其中HPA预训练模型在OpenCell上取得最高平均宏F1分数(0.8221 ± 0.0062),略优于直接在OpenCell上预训练的模型(0.8057 ± 0.0090)。 Conclusion: 大规模、领域相关的自监督预训练能显著提升显微镜图像下游任务性能,即使目标域标注数据有限,也具备良好泛化能力。 Abstract: Task-specific microscopy datasets are often too small to train deep learning models that learn robust feature representations. Self-supervised learning (SSL) can mitigate this by pretraining on large unlabeled datasets, but it remains unclear how well such representations transfer across microscopy domains with different staining protocols and channel configurations. We investigate the cross-domain transferability of DINO-pretrained Vision Transformers for protein localization on the OpenCell dataset. We generate image embeddings using three DINO backbones pretrained on ImageNet-1k, the Human Protein Atlas (HPA), and OpenCell, and evaluate them by training a supervised classification head on OpenCell labels. All pretrained models transfer well, with the microscopy-specific HPA-pretrained model achieving the best performance (mean macro $F_1$-score = 0.8221 \pm 0.0062), slightly outperforming a DINO model trained directly on OpenCell (0.8057 \pm 0.0090). These results highlight the value of large-scale pretraining and indicate that domain-relevant SSL representations can generalize effectively to related but distinct microscopy datasets, enabling strong downstream performance even when task-specific labeled data are limited.
### [105] [SSG: Scaled Spatial Guidance for Multi-Scale Visual Autoregressive Generation](https://arxiv.org/abs/2602.05534) *Youngwoo Shin,Jiwan Hur,Junmo Kim* Main category: cs.CV TL;DR: 本文提出了一种无需训练、仅在推理时使用的引导方法Scaled Spatial Guidance(SSG),通过信息论视角分析并缓解视觉自回归(VAR)模型在多尺度生成中出现的层次漂移问题;SSG结合频率域增强技术DSE,强调各尺度应贡献前序尺度未涵盖的高频语义残差,从而提升图像生成的保真度、多样性与全局一致性,且不增加延迟。
Details Motivation: 视觉自回归(VAR)模型在推理时易因容量限制和误差累积导致多尺度生成层次漂移,破坏其本应具备的粗到细生成特性,造成训练-推理不一致。 Method: 提出无需训练的推理时引导方法SSG,核心是引导各尺度生成未被前序尺度解释的高频语义残差;为提取该残差,设计离散空间增强(DSE)方法,在频域中对粗尺度先验进行锐化与残差隔离;SSG适用于任意基于离散视觉token的VAR模型。 Result: 在多个VAR模型上验证了SSG能一致提升生成图像的保真度与多样性,同时保持低延迟,揭示了粗到细生成范式中尚未开发的效率潜力。 Conclusion: 从信息论出发,确保每层贡献互补高频内容可有效缓解VAR模型的层次漂移;SSG作为一种轻量、通用、即插即用的推理引导机制,显著提升了VAR生成质量与稳定性。 Abstract: Visual autoregressive (VAR) models generate images through next-scale prediction, naturally achieving coarse-to-fine, fast, high-fidelity synthesis mirroring human perception. In practice, this hierarchy can drift at inference time, as limited capacity and accumulated error cause the model to deviate from its coarse-to-fine nature. We revisit this limitation from an information-theoretic perspective and deduce that ensuring each scale contributes high-frequency content not explained by earlier scales mitigates the train-inference discrepancy. With this insight, we propose Scaled Spatial Guidance (SSG), training-free, inference-time guidance that steers generation toward the intended hierarchy while maintaining global coherence. SSG emphasizes target high-frequency signals, defined as the semantic residual, isolated from a coarser prior. To obtain this prior, we leverage a principled frequency-domain procedure, Discrete Spatial Enhancement (DSE), which is devised to sharpen and better isolate the semantic residual through frequency-aware construction. SSG applies broadly across VAR models leveraging discrete visual tokens, regardless of tokenization design or conditioning modality. Experiments demonstrate SSG yields consistent gains in fidelity and diversity while preserving low latency, revealing untapped efficiency in coarse-to-fine image generation. Code is available at https://github.com/Youngwoo-git/SSG.
### [106] [A Comparative Study of 3D Person Detection: Sensor Modalities and Robustness in Diverse Indoor and Outdoor Environments](https://arxiv.org/abs/2602.05538) *Malaz Tamim,Andrea Matic-Flierl,Karsten Roscher* Main category: cs.CV TL;DR: 本文系统评估了纯摄像头、纯LiDAR及相机-LiDAR融合三种方案在室内外场景下3D人体检测的性能与鲁棒性,发现融合方法(DAL)整体最优但对传感器错位和部分LiDAR干扰仍敏感,而纯摄像头方法(BEVDepth)性能最弱且易受遮挡、距离和噪声影响。
Details Motivation: 现有研究多聚焦于自动驾驶场景,本文旨在拓展至更广泛的室内外应用(如机器人、工业监控、安防),并系统评估不同传感模态在多样化真实场景下的3D人体检测能力与鲁棒性。 Method: 基于JRDB数据集,在室内与室外多种场景下,对比评估三种典型模型:BEVDepth(纯视觉)、PointPillars(纯LiDAR)和DAL(相机-LiDAR融合),分析其在不同遮挡程度、距离以及传感器损坏与标定误差下的表现。 Result: 融合方法DAL在各类场景下均优于单模态方法;但在传感器错位和特定LiDAR干扰下仍显脆弱;BEVDepth性能最差,对遮挡、远距离和噪声最为敏感;PointPillars居中且鲁棒性相对较好。 Conclusion: 传感器融合显著提升3D人体检测性能与鲁棒性,但其实际部署仍受限于传感器标定精度与抗干扰能力,亟需进一步研究以增强系统可靠性。 Abstract: Accurate 3D person detection is critical for safety in applications such as robotics, industrial monitoring, and surveillance. This work presents a systematic evaluation of 3D person detection using camera-only, LiDAR-only, and camera-LiDAR fusion. While most existing research focuses on autonomous driving, we explore detection performance and robustness in diverse indoor and outdoor scenes using the JRDB dataset. We compare three representative models - BEVDepth (camera), PointPillars (LiDAR), and DAL (camera-LiDAR fusion) - and analyze their behavior under varying occlusion and distance levels. Our results show that the fusion-based approach consistently outperforms single-modality models, particularly in challenging scenarios. We further investigate robustness against sensor corruptions and misalignments, revealing that while DAL offers improved resilience, it remains sensitive to sensor misalignment and certain LiDAR-based corruptions. In contrast, the camera-based BEVDepth model showed the lowest performance and was most affected by occlusion, distance, and noise. Our findings highlight the importance of utilizing sensor fusion for enhanced 3D person detection, while also underscoring the need for ongoing research to address the vulnerabilities inherent in these systems.
### [107] [FastVMT: Eliminating Redundancy in Video Motion Transfer](https://arxiv.org/abs/2602.05551) *Yue Ma,Zhikai Wang,Tianhao Ren,Mingzhe Zheng,Hongyu Liu,Jiayi Guo,Mark Fong,Yuxuan Xue,Zixiang Zhao,Konrad Schindler,Qifeng Chen,Linfeng Zhang* Main category: cs.CV TL;DR: 本文提出FastVMT方法,通过消除运动冗余和梯度冗余来加速视频运动迁移中的Diffusion Transformer(DiT)计算,在不牺牲视觉质量和时序一致性的前提下实现平均3.43倍加速。
Details Motivation: 现有基于DiT的视频运动迁移方法存在结构性低效问题,未针对视频帧间运动小而平滑、扩散轨迹上梯度变化缓慢等特性进行优化。 Method: 1)通过局部注意力掩码缓解运动冗余;2)设计梯度重用与跳过机制以利用梯度冗余。 Result: FastVMT在保持生成视频视觉保真度与时序一致性的同时,平均实现3.43倍推理加速。 Conclusion: 消除运动与梯度两类冗余可显著提升DiT在视频运动迁移任务中的效率,无需牺牲生成质量。 Abstract: Video motion transfer aims to synthesize videos by generating visual content according to a text prompt while transferring the motion pattern observed in a reference video. Recent methods predominantly use the Diffusion Transformer (DiT) architecture. To achieve satisfactory runtime, several methods attempt to accelerate the computations in the DiT, but fail to address structural sources of inefficiency. In this work, we identify and remove two types of computational redundancy in earlier work: motion redundancy arises because the generic DiT architecture does not reflect the fact that frame-to-frame motion is small and smooth; gradient redundancy occurs if one ignores that gradients change slowly along the diffusion trajectory. To mitigate motion redundancy, we mask the corresponding attention layers to a local neighborhood such that interaction weights are not computed unnecessarily distant image regions. To exploit gradient redundancy, we design an optimization scheme that reuses gradients from previous diffusion steps and skips unwarranted gradient computations. On average, FastVMT achieves a 3.43x speedup without degrading the visual fidelity or the temporal consistency of the generated videos.
### [108] [IndustryShapes: An RGB-D Benchmark dataset for 6D object pose estimation of industrial assembly components and tools](https://arxiv.org/abs/2602.05555) *Panagiotis Sapoutzoglou,Orestis Vaggelis,Athina Zacharia,Evangelos Sartinas,Maria Pateraki* Main category: cs.CV TL;DR: IndustryShapes 是一个面向工业场景的新型RGB-D基准数据集,专注于工具和零部件的实例级及新物体6D位姿估计,填补了实验室研究与实际制造部署之间的鸿沟。
Details Motivation: 现有数据集多聚焦于家用物品、合成环境或受控实验室场景,缺乏真实工业装配环境下的挑战性数据,难以支撑工业机器人位姿估计方法的实际落地。 Method: 构建包含经典集(4.6k图像、6k标注位姿)和扩展集(支持无模型及序列化方法)的RGB-D数据集,涵盖5类具有挑战性的工业物体,并在真实工业装配场景中采集;同时对多种SOTA位姿估计算法进行基准评测。 Result: IndustryShapes是首个提供RGB-D静态初始序列(static onboarding sequences)的工业位姿数据集;实验表明当前SOTA方法在该数据集上仍有明显提升空间。 Conclusion: IndustryShapes为工业机器人6D位姿估计提供了更贴近实际应用的评测基准,推动算法从实验室走向真实产线。 Abstract: We introduce IndustryShapes, a new RGB-D benchmark dataset of industrial tools and components, designed for both instance-level and novel object 6D pose estimation approaches. The dataset provides a realistic and application-relevant testbed for benchmarking these methods in the context of industrial robotics bridging the gap between lab-based research and deployment in real-world manufacturing scenarios. Unlike many previous datasets that focus on household or consumer products or use synthetic, clean tabletop datasets, or objects captured solely in controlled lab environments, IndustryShapes introduces five new object types with challenging properties, also captured in realistic industrial assembly settings. The dataset has diverse complexity, from simple to more challenging scenes, with single and multiple objects, including scenes with multiple instances of the same object and it is organized in two parts: the classic set and the extended set. The classic set includes a total of 4,6k images and 6k annotated poses. The extended set introduces additional data modalities to support the evaluation of model-free and sequence-based approaches. To the best of our knowledge, IndustryShapes is the first dataset to offer RGB-D static onboarding sequences. We further evaluate the dataset on a representative set of state-of-the art methods for instance-based and novel object 6D pose estimation, including also object detection, segmentation, showing that there is room for improvement in this domain. The dataset page can be found in https://pose-lab.github.io/IndustryShapes.
### [109] [PIRATR: Parametric Object Inference for Robotic Applications with Transformers in 3D Point Clouds](https://arxiv.org/abs/2602.05557) *Michael Schwingshackl,Fabio F. Oberweger,Mario Niedermeyer,Huemer Johannes,Markus Murschitz* Main category: cs.CV TL;DR: PIRATR是一种面向机器人应用的端到端3D目标检测框架,能直接从受遮挡影响的点云中联合估计多类6自由度位姿及类别特定参数属性,在合成数据上训练后可直接泛化至真实户外LiDAR场景,mAP达0.919。
Details Motivation: 解决机器人在动态环境中对参数化物体(如可调节夹爪)进行几何定位与任务相关属性联合估计的需求,弥合低层几何推理与高层可执行世界模型之间的鸿沟。 Method: 基于PI3DETR扩展,提出模块化、类别专用检测头架构,支持直接从点云联合预测6-DoF位姿和参数化属性(如夹爪开合),并允许便捷扩展新物体类别。 Result: 在自动化叉车平台上验证,针对起重机夹爪、装载平台和托盘三类物体,在纯合成数据训练下,无需微调即在真实户外LiDAR数据上达到0.919 mAP。 Conclusion: PIRATR确立了姿态感知、参数化的感知新范式,实现了仿真训练、真实部署的可扩展感知系统。 Abstract: We present PIRATR, an end-to-end 3D object detection framework for robotic use cases in point clouds. Extending PI3DETR, our method streamlines parametric 3D object detection by jointly estimating multi-class 6-DoF poses and class-specific parametric attributes directly from occlusion-affected point cloud data. This formulation enables not only geometric localization but also the estimation of task-relevant properties for parametric objects, such as a gripper's opening, where the 3D model is adjusted according to simple, predefined rules. The architecture employs modular, class-specific heads, making it straightforward to extend to novel object types without re-designing the pipeline. We validate PIRATR on an automated forklift platform, focusing on three structurally and functionally diverse categories: crane grippers, loading platforms, and pallets. Trained entirely in a synthetic environment, PIRATR generalizes effectively to real outdoor LiDAR scans, achieving a detection mAP of 0.919 without additional fine-tuning. PIRATR establishes a new paradigm of pose-aware, parameterized perception. This bridges the gap between low-level geometric reasoning and actionable world models, paving the way for scalable, simulation-trained perception systems that can be deployed in dynamic robotic environments. Code available at https://github.com/swingaxe/piratr.
### [110] [ShapeGaussian: High-Fidelity 4D Human Reconstruction in Monocular Videos via Vision Priors](https://arxiv.org/abs/2602.05572) *Zhenxiao Liang,Ning Zhang,Youbao Tang,Ruei-Sung Lin,Qixing Huang,Peng Chang,Jing Xiao* Main category: cs.CV TL;DR: ShapeGaussian是一种无需模板、高保真的4D人体重建方法,适用于单目视频,通过融合数据驱动的2D视觉先验与神经形变建模,在无多视角条件下实现鲁棒、高精度的动态人体重建。
Details Motivation: 现有通用4D重建方法(如4DGS)缺乏强视觉先验,难以在单目视频中处理高形变人体运动;而基于SMPL模板的方法(如HUGS)虽能生成逼真结果,却严重依赖姿态估计精度,易产生失真伪影。 Method: 采用两阶段流程:首先利用预训练模型学习数据驱动的粗略可变形几何;再通过神经形变模型精细建模动态细节;全程融合2D视觉先验,并使用多帧参考缓解关键点遮挡问题。 Result: 在多种日常单目视频上实验表明,ShapeGaussian在重建精度、视觉质量与运动鲁棒性方面均超越主流模板方法。 Conclusion: ShapeGaussian成功兼顾模板自由性与高保真度,为单目4D人体重建提供了更鲁棒、泛化性更强的新范式。 Abstract: We introduce ShapeGaussian, a high-fidelity, template-free method for 4D human reconstruction from casual monocular videos. Generic reconstruction methods lacking robust vision priors, such as 4DGS, struggle to capture high-deformation human motion without multi-view cues. While template-based approaches, primarily relying on SMPL, such as HUGS, can produce photorealistic results, they are highly susceptible to errors in human pose estimation, often leading to unrealistic artifacts. In contrast, ShapeGaussian effectively integrates template-free vision priors to achieve both high-fidelity and robust scene reconstructions. Our method follows a two-step pipeline: first, we learn a coarse, deformable geometry using pretrained models that estimate data-driven priors, providing a foundation for reconstruction. Then, we refine this geometry using a neural deformation model to capture fine-grained dynamic details. By leveraging 2D vision priors, we mitigate artifacts from erroneous pose estimation in template-based methods and employ multiple reference frames to resolve the invisibility issue of 2D keypoints in a template-free manner. Extensive experiments demonstrate that ShapeGaussian surpasses template-based methods in reconstruction accuracy, achieving superior visual quality and robustness across diverse human motions in casual monocular videos.
### [111] [Visual Implicit Geometry Transformer for Autonomous Driving](https://arxiv.org/abs/2602.05573) *Arsenii Shirokov,Mikhail Kuznetsov,Danila Stepochkin,Egor Evdokimov,Daniil Glazkov,Nikolay Patakin,Anton Konushin,Dmitry Senushkin* Main category: cs.CV TL;DR: ViGT是一种面向自动驾驶的视觉隐式几何Transformer模型,无需相机标定即可从环视相机输入估计连续3D占据场(BEV),采用自监督方式利用图像-LiDAR配对训练,在多数据集联合训练下实现了点图估计任务SOTA,并在Occ3D-nuScenes上媲美监督方法。
Details Motivation: 现有几何基础模型难以兼顾自动驾驶场景的可扩展性、架构简洁性与跨传感器配置泛化能力;同时,多数占据模型依赖昂贵人工标注,且缺乏统一的度量空间表征。 Method: 提出校准无关(calibration-free)的ViGT架构,直接从多视角图像回归连续BEV 3D占据场;采用自监督训练范式,利用同步图像-LiDAR数据构建伪真值;支持多源数据(NuScenes、Waymo等5个大规模AD数据集)联合训练。 Result: 在点图估计任务上取得SOTA,平均排名最优;在Occ3D-nuScenes基准上性能媲美监督方法;验证了跨数据集与传感器配置的强泛化能力。 Conclusion: ViGT为自动驾驶提供了可扩展、轻量、通用的几何基础模型范式,推动了无标注、多传感器兼容的3D场景理解发展。 Abstract: We introduce the Visual Implicit Geometry Transformer (ViGT), an autonomous driving geometric model that estimates continuous 3D occupancy fields from surround-view camera rigs. ViGT represents a step towards foundational geometric models for autonomous driving, prioritizing scalability, architectural simplicity, and generalization across diverse sensor configurations. Our approach achieves this through a calibration-free architecture, enabling a single model to adapt to different sensor setups. Unlike general-purpose geometric foundational models that focus on pixel-aligned predictions, ViGT estimates a continuous 3D occupancy field in a birds-eye-view (BEV) addressing domain-specific requirements. ViGT naturally infers geometry from multiple camera views into a single metric coordinate frame, providing a common representation for multiple geometric tasks. Unlike most existing occupancy models, we adopt a self-supervised training procedure that leverages synchronized image-LiDAR pairs, eliminating the need for costly manual annotations. We validate the scalability and generalizability of our approach by training our model on a mixture of five large-scale autonomous driving datasets (NuScenes, Waymo, NuPlan, ONCE, and Argoverse) and achieving state-of-the-art performance on the pointmap estimation task, with the best average rank across all evaluated baselines. We further evaluate ViGT on the Occ3D-nuScenes benchmark, where ViGT achieves comparable performance with supervised methods. The source code is publicly available at \href{https://github.com/whesense/ViGT}{https://github.com/whesense/ViGT}.
### [112] [A Hybrid CNN and ML Framework for Multi-modal Classification of Movement Disorders Using MRI and Brain Structural Features](https://arxiv.org/abs/2602.05574) *Mengyu Li,Ingibjörg Kristjánsdóttir,Thilo van Eimeren,Kathrin Giehl,Lotta M. Ellingsen,the ASAP Neuroimaging Initiative* Main category: cs.CV TL;DR: 本文提出了一种结合CNN与机器学习的混合框架,利用多模态MRI数据(T1加权图像、深部脑结构分割掩膜及体积测量)对非典型帕金森综合征(APD)亚型(PSP、MSA)与帕金森病(PD)进行早期鉴别诊断,取得了高AUC(0.95/0.86/0.92)效果。
Details Motivation: APD早期临床表现与PD高度重叠,易误诊;亟需可靠的影像学生物标志物实现早期精准分型。 Method: 构建CNN-ML混合模型:CNN处理T1加权MRI图像和结构分割掩膜,ML模块融合12个深部脑结构的体积特征;输入为多模态数据(图像、掩膜、体积)。 Result: 在PSP vs. PD、MSA vs. PD、PSP vs. MSA三组二分类任务中,AUC分别达0.95、0.86、0.92。 Conclusion: 融合CNN提取的空间特征与ML处理的定量体积特征,可显著提升APD亚型鉴别准确性,有望助力临床早期精准诊断与干预。 Abstract: Atypical Parkinsonian Disorders (APD), also known as Parkinson-plus syndrome, are a group of neurodegenerative diseases that include progressive supranuclear palsy (PSP) and multiple system atrophy (MSA). In the early stages, overlapping clinical features often lead to misdiagnosis as Parkinson's disease (PD). Identifying reliable imaging biomarkers for early differential diagnosis remains a critical challenge. In this study, we propose a hybrid framework combining convolutional neural networks (CNNs) with machine learning (ML) techniques to classify APD subtypes versus PD and distinguish between the subtypes themselves: PSP vs. PD, MSA vs. PD, and PSP vs. MSA. The model leverages multi-modal input data, including T1-weighted magnetic resonance imaging (MRI), segmentation masks of 12 deep brain structures associated with APD, and their corresponding volumetric measurements. By integrating these complementary modalities, including image data, structural segmentation masks, and quantitative volume features, the hybrid approach achieved promising classification performance with area under the curve (AUC) scores of 0.95 for PSP vs. PD, 0.86 for MSA vs. PD, and 0.92 for PSP vs. MSA. These results highlight the potential of combining spatial and structural information for robust subtype differentiation. In conclusion, this study demonstrates that fusing CNN-based image features with volume-based ML inputs improves classification accuracy for APD subtypes. The proposed approach may contribute to more reliable early-stage diagnosis, facilitating timely and targeted interventions in clinical practice.
### [113] [LocateEdit-Bench: A Benchmark for Instruction-Based Editing Localization](https://arxiv.org/abs/2602.05577) *Shiyu Wu,Shuyan Li,Jing Li,Jing Liu,Yequan Wang* Main category: cs.CV TL;DR: 本文提出LocateEdit-Bench数据集,用于评估针对指令驱动图像编辑的伪造定位方法,填补了现有方法在新型编辑范式下的空白。
Details Motivation: 现有AI生成伪造定位方法主要针对基于修复(inpainting)的编辑,难以应对新兴的指令驱动型图像编辑,亟需构建适配新编辑范式的基准数据集。 Method: 构建大规模LocateEdit-Bench数据集(231K张编辑图像),涵盖4种前沿编辑模型和3类常见编辑类型,并设计两种多指标评估协议。 Result: 提供了首个面向指令驱动图像编辑的伪造定位基准数据集及配套评估方案,系统分析了当前定位方法在该场景下的表现。 Conclusion: LocateEdit-Bench为应对快速演进的图像编辑技术提供了关键基准支撑,推动未来伪造定位方法的发展。 Abstract: Recent advancements in image editing have enabled highly controllable and semantically-aware alteration of visual content, posing unprecedented challenges to manipulation localization. However, existing AI-generated forgery localization methods primarily focus on inpainting-based manipulations, making them ineffective against the latest instruction-based editing paradigms. To bridge this critical gap, we propose LocateEdit-Bench, a large-scale dataset comprising $231$K edited images, designed specifically to benchmark localization methods against instruction-driven image editing. Our dataset incorporates four cutting-edge editing models and covers three common edit types. We conduct a detailed analysis of the dataset and develop two multi-metric evaluation protocols to assess existing localization methods. Our work establishes a foundation to keep pace with the evolving landscape of image editing, thereby facilitating the development of effective methods for future forgery localization. Dataset will be open-sourced upon acceptance.
### [114] [LoGoSeg: Integrating Local and Global Features for Open-Vocabulary Semantic Segmentation](https://arxiv.org/abs/2602.05578) *Junyang Chen,Xiangbo Lv,Zhiqiang Kou,Xingdong Sheng,Ning Xu,Yiguo Qiao* Main category: cs.CV TL;DR: LoGoSeg是一种高效的单阶段开放词汇语义分割框架,通过引入对象存在先验、区域感知对齐模块和双流融合机制,提升了跨类别像素级分割的精度与泛化能力。
Details Motivation: 现有基于VLM(如CLIP)的方法因依赖图像级预训练,空间对齐不精确,且缺乏强对象先验和区域约束,易导致对象幻觉或漏检。 Method: 提出LoGoSeg框架,包含:(i) 基于全局图文相似度的对象存在先验,动态加权相关类别;(ii) 区域感知对齐模块,建立精细的区域级图文对应;(iii) 双流融合机制,融合局部结构与全局语义。无需外部掩码建议、额外骨干网络或数据集。 Result: 在六个基准(A-847、PC-459等)上展现出竞争性性能和强泛化能力。 Conclusion: LoGoSeg有效缓解了开放词汇语义分割中空间错位与对象幻觉问题,在保持高效性的同时显著提升分割质量。 Abstract: Open-vocabulary semantic segmentation (OVSS) extends traditional closed-set segmentation by enabling pixel-wise annotation for both seen and unseen categories using arbitrary textual descriptions. While existing methods leverage vision-language models (VLMs) like CLIP, their reliance on image-level pretraining often results in imprecise spatial alignment, leading to mismatched segmentations in ambiguous or cluttered scenes. However, most existing approaches lack strong object priors and region-level constraints, which can lead to object hallucination or missed detections, further degrading performance. To address these challenges, we propose LoGoSeg, an efficient single-stage framework that integrates three key innovations: (i) an object existence prior that dynamically weights relevant categories through global image-text similarity, effectively reducing hallucinations; (ii) a region-aware alignment module that establishes precise region-level visual-textual correspondences; and (iii) a dual-stream fusion mechanism that optimally combines local structural information with global semantic context. Unlike prior works, LoGoSeg eliminates the need for external mask proposals, additional backbones, or extra datasets, ensuring efficiency. Extensive experiments on six benchmarks (A-847, PC-459, A-150, PC-59, PAS-20, and PAS-20b) demonstrate its competitive performance and strong generalization in open-vocabulary settings.
### [115] [Geometric Observability Index: An Operator-Theoretic Framework for Per-Feature Sensitivity, Weak Observability, and Dynamic Effects in SE(3) Pose Estimation](https://arxiv.org/abs/2602.05582) *Joe-Mei Feng,Sheng-Wei Yu* Main category: cs.CV TL;DR: 本文提出了一种基于李群SE(3)的算子理论框架,用于分析相机位姿估计中各图像特征的敏感性,并定义了几何可观测性指数(GOI)来量化单个测量对位姿估计的影响。
Details Motivation: 传统敏感性分析工具(如条件数分析、欧氏扰动论证和Fisher信息界)无法解释单个图像特征如何影响位姿估计,也无法说明动态或不一致观测为何会显著干扰SLAM和运动恢复结构系统。 Method: 将影响函数理论扩展到矩阵李群,推导出SE(3)上左平凡化M估计器的内禀扰动算子,并基于曲率算子与可观测子空间的李代数结构定义几何可观测性指数(GOI)。 Result: GOI可通过曲率算子的谱分解揭示弱可观测性与高敏感性的直接关联;在总体情形下,GOI与SE(3)上的Fisher信息几何一致,提供单测量版本的Cramér-Rao界;其谱机制可统一解释经典退化现象(如纯旋转、视差消失)及动态特征放大效应。 Conclusion: GOI提供了几何一致的测量影响描述,统一了条件分析、Fisher信息几何、影响函数理论与动态场景可检测性,并可在标准Gauss-Newton流程中作为轻量、免训练的诊断信号,识别动态特征和弱可观测构型,无需修改现有SLAM架构。 Abstract: We present a unified operator-theoretic framework for analyzing per-feature sensitivity in camera pose estimation on the Lie group SE(3). Classical sensitivity tools - conditioning analyses, Euclidean perturbation arguments, and Fisher information bounds - do not explain how individual image features influence the pose estimate, nor why dynamic or inconsistent observations can disproportionately distort modern SLAM and structure-from-motion systems. To address this gap, we extend influence function theory to matrix Lie groups and derive an intrinsic perturbation operator for left-trivialized M-estimators on SE(3). The resulting Geometric Observability Index (GOI) quantifies the contribution of a single measurement through the curvature operator and the Lie algebraic structure of the observable subspace. GOI admits a spectral decomposition along the principal directions of the observable curvature, revealing a direct correspondence between weak observability and amplified sensitivity. In the population regime, GOI coincides with the Fisher information geometry on SE(3), yielding a single-measurement analogue of the Cramer-Rao bound. The same spectral mechanism explains classical degeneracies such as pure rotation and vanishing parallax, as well as dynamic feature amplification along weak curvature directions. Overall, GOI provides a geometrically consistent description of measurement influence that unifies conditioning analysis, Fisher information geometry, influence function theory, and dynamic scene detectability through the spectral geometry of the curvature operator. Because these quantities arise directly within Gauss-Newton pipelines, the curvature spectrum and GOI also yield lightweight, training-free diagnostic signals for identifying dynamic features and detecting weak observability configurations without modifying existing SLAM architectures.
### [116] [A Mixed Reality System for Robust Manikin Localization in Childbirth Training](https://arxiv.org/abs/2602.05588) *Haojie Cheng,Chang Liu,Abhiram Kanneganti,Mahesh Arjandas Choolani,Arundhati Tushar Gosavi,Eng Tat Khoo* Main category: cs.CV TL;DR: 本文提出了一种用于产科培训的混合现实(MR)系统,结合虚拟指导与实体模拟人触觉交互,在保留真实触感的同时支持无专家监督的自主训练;实验表明该系统在操作准确性、任务完成度及用户偏好上均优于纯虚拟现实(VR)方案。
Details Motivation: 医学生获取阴道分娩实践机会日益受限,原因包括临床轮转时间缩短、患者配合度低以及分娩过程不可预测;同时临床教师教学负担重,亟需提升培训效率。 Method: 开发基于商用头戴显示设备(HMD)的MR系统:通过外接RGB-D相机实现空间校准与穿透式视觉融合;构建粗到精的定位流程——先用标记点对齐母体模型以定义产道区域,再将预扫描的胎儿头模型注册至该区域;最终实现在模拟人附近精准叠加虚拟引导手,并结合真实触觉反馈进行训练。 Result: 系统在独立头显上实现了准确稳定的模拟人定位,无需外部计算资源;83名四年级医学生的对照研究表明,MR组在分娩操作、产后处理及整体任务表现上得分显著高于VR组,且获得更高用户偏好。 Conclusion: 所提出的MR系统可有效缓解师资压力、提升产科培训质量与自主性,是替代传统或纯VR培训的可行新范式。 Abstract: Opportunities for medical students to gain practical experience in vaginal births are increasingly constrained by shortened clinical rotations, patient reluctance, and the unpredictable nature of labour. To alleviate clinicians' instructional burden and enhance trainees' learning efficiency, we introduce a mixed reality (MR) system for childbirth training that combines virtual guidance with tactile manikin interaction, thereby preserving authentic haptic feedback while enabling independent practice without continuous on-site expert supervision. The system extends the passthrough capability of commercial head-mounted displays (HMDs) by spatially calibrating an external RGB-D camera, allowing real-time visual integration of physical training objects. Building on this capability, we implement a coarse-to-fine localization pipeline that first aligns the maternal manikin with fiducial markers to define a delivery region and then registers the pre-scanned neonatal head within this area. This process enables spatially accurate overlay of virtual guiding hands near the manikin, allowing trainees to follow expert trajectories reinforced by haptic interaction. Experimental evaluations demonstrate that the system achieves accurate and stable manikin localization on a standalone headset, ensuring practical deployment without external computing resources. A large-scale user study involving 83 fourth-year medical students was subsequently conducted to compare MR-based and virtual reality (VR)-based childbirth training. Four senior obstetricians independently assessed performance using standardized criteria. Results showed that MR training achieved significantly higher scores in delivery, post-delivery, and overall task performance, and was consistently preferred by trainees over VR training.
### [117] [EgoPoseVR: Spatiotemporal Multi-Modal Reasoning for Egocentric Full-Body Pose in Virtual Reality](https://arxiv.org/abs/2602.05590) *Haojie Cheng,Shaun Jing Heng Ong,Shaoyu Cai,Aiden Tat Yang Koh,Fuxi Ouyang,Eng Tat Khoo* Main category: cs.CV TL;DR: EgoPoseVR是一个端到端框架,融合头显运动信号与第一人称RGB-D观测,实现高精度、时序稳定的VR全身体态估计,无需额外传感器。
Details Motivation: 现有头戴式相机方案在VR中存在时序不稳定、下肢估计不准、实时性差等问题。 Method: 提出双模态融合流水线:时空编码器提取帧级和关节点级表征,通过交叉注意力融合;引入基于HMD信号的运动学优化模块提升精度与稳定性;构建含180万帧的合成数据集用于训练与评估。 Result: 在多个指标上超越当前SOTA方法;用户研究显示其在准确性、稳定性、沉浸感和未来使用意愿方面显著优于基线方法。 Conclusion: EgoPoseVR实现了鲁棒的无传感器全身体态跟踪,为VR具身化提供了实用解决方案。 Abstract: Immersive virtual reality (VR) applications demand accurate, temporally coherent full-body pose tracking. Recent head-mounted camera-based approaches show promise in egocentric pose estimation, but encounter challenges when applied to VR head-mounted displays (HMDs), including temporal instability, inaccurate lower-body estimation, and the lack of real-time performance. To address these limitations, we present EgoPoseVR, an end-to-end framework for accurate egocentric full-body pose estimation in VR that integrates headset motion cues with egocentric RGB-D observations through a dual-modality fusion pipeline. A spatiotemporal encoder extracts frame- and joint-level representations, which are fused via cross-attention to fully exploit complementary motion cues across modalities. A kinematic optimization module then imposes constraints from HMD signals, enhancing the accuracy and stability of pose estimation. To facilitate training and evaluation, we introduce a large-scale synthetic dataset of over 1.8 million temporally aligned HMD and RGB-D frames across diverse VR scenarios. Experimental results show that EgoPoseVR outperforms state-of-the-art egocentric pose estimation models. A user study in real-world scenes further shows that EgoPoseVR achieved significantly higher subjective ratings in accuracy, stability, embodiment, and intention for future use compared to baseline methods. These results show that EgoPoseVR enables robust full-body pose tracking, offering a practical solution for accurate VR embodiment without requiring additional body-worn sensors or room-scale tracking systems.
### [118] [CAViT -- Channel-Aware Vision Transformer for Dynamic Feature Fusion](https://arxiv.org/abs/2602.05598) *Aon Safdar,Mohamed Saadeldin* Main category: cs.CV TL;DR: 本文提出CAViT,一种双注意力架构的视觉Transformer,通过在每个Transformer块中引入通道级自注意力机制替代静态MLP,实现内容感知的动态特征交互,在多个数据集上显著提升准确率并降低参数量和计算量。
Details Motivation: 现有Vision Transformers(ViTs)中的通道混合是静态的,依赖固定的MLP,缺乏对输入内容的适应性。 Method: 提出CAViT架构,在每个Transformer块中先进行空间自注意力,再进行通道自注意力,以实现基于全局图像上下文的动态特征重校准。 Result: 在五个自然与医学领域的基准数据集上,CAViT相比标准ViT基线最高提升3.6%准确率,同时参数量和FLOPs减少超30%;注意力图显示更锐利、语义更明确的激活模式。 Conclusion: CAViT通过统一且内容感知的token混合策略,提升了表征能力,无需增加模型深度或复杂度,验证了动态通道注意力的有效性。 Abstract: Vision Transformers (ViTs) have demonstrated strong performance across a range of computer vision tasks by modeling long-range spatial interactions via self-attention. However, channel-wise mixing in ViTs remains static, relying on fixed multilayer perceptrons (MLPs) that lack adaptability to input content. We introduce 'CAViT', a dual-attention architecture that replaces the static MLP with a dynamic, attention-based mechanism for feature interaction. Each Transformer block in CAViT performs spatial self-attention followed by channel-wise self-attention, allowing the model to dynamically recalibrate feature representations based on global image context. This unified and content-aware token mixing strategy enhances representational expressiveness without increasing depth or complexity. We validate CAViT across five benchmark datasets spanning both natural and medical domains, where it outperforms the standard ViT baseline by up to +3.6% in accuracy, while reducing parameter count and FLOPs by over 30%. Qualitative attention maps reveal sharper and semantically meaningful activation patterns, validating the effectiveness of our attention-driven token mixing.
### [119] [Multi-instance robust fitting for non-classical geometric models](https://arxiv.org/abs/2602.05602) *Zongliang Zhang,Shuxiang Li,Xingwang Huang,Zongyue Wang* Main category: cs.CV TL;DR: 本文提出了一种针对非经典模型(如螺旋曲线、程序化字符模型、自由曲面)的多实例鲁棒拟合方法,通过基于模型到数据误差的新型估计器和元启发式优化器解决含噪数据下的全局优化问题。
Details Motivation: 现有鲁棒拟合方法主要面向经典几何模型(如直线、圆、平面),对非经典模型支持不足,且多限于单实例重建;本文旨在解决非经典模型的多实例鲁棒重建问题。 Method: 将多实例拟合建模为包含估计器与优化器的优化问题:设计一种不依赖预设误差阈值、能处理异常值的模型到数据误差估计器;因该估计器关于模型参数不可微,采用元启发式算法进行全局优化。 Result: 在多种非经典模型上验证了方法的有效性,并开源代码。 Conclusion: 所提方法能有效实现非经典模型的多实例鲁棒拟合,克服了传统方法对经典模型和单实例的局限性,适用于含噪声的实际数据场景。 Abstract: Most existing robust fitting methods are designed for classical models, such as lines, circles, and planes. In contrast, fewer methods have been developed to robustly handle non-classical models, such as spiral curves, procedural character models, and free-form surfaces. Furthermore, existing methods primarily focus on reconstructing a single instance of a non-classical model. This paper aims to reconstruct multiple instances of non-classical models from noisy data. We formulate this multi-instance fitting task as an optimization problem, which comprises an estimator and an optimizer. Specifically, we propose a novel estimator based on the model-to-data error, capable of handling outliers without a predefined error threshold. Since the proposed estimator is non-differentiable with respect to the model parameters, we employ a meta-heuristic algorithm as the optimizer to seek the global optimum. The effectiveness of our method are demonstrated through experimental results on various non-classical models. The code is available at https://github.com/zhangzongliang/fitting.
### [120] [Unified Sensor Simulation for Autonomous Driving](https://arxiv.org/abs/2602.05617) *Nikolay Patakin,Arsenii Shirokov,Anton Konushin,Dmitry Senushkin* Main category: cs.CV TL;DR: XSIM是一个面向自动驾驶的传感器仿真框架,扩展了3DGUT splatting方法,引入滚动快门建模、相位建模机制和双不透明度高斯表示,以提升动态场景中几何一致性与外观真实感,并在多个自动驾驶数据集上达到SOTA性能。
Details Motivation: 现有3DGUT splatting在处理球面传感器(如LiDAR)时存在方位角边界处的时间与投影不连续问题,导致高斯粒子投影错误;缺乏对复杂传感器畸变(尤其动态环境)的统一建模能力。 Method: 提出XSIM框架:1)广义滚动快门建模;2)针对球面相机的相位建模机制,显式处理方位角边界处的时间与形状不连续;3)扩展3D高斯表示,引入两个独立不透明度参数以解耦几何与颜色分布。 Result: 在Waymo Open Dataset、Argoverse 2和PandaSet上全面超越近期强基线,实现SOTA性能;显著提升场景几何一致性与渲染真实性。 Conclusion: XSIM为自动驾驶传感器仿真提供了统一、灵活且高保真的建模框架,解决了球面传感器建模的关键难点,具备实际部署价值与开源可复现性。 Abstract: In this work, we introduce \textbf{XSIM}, a sensor simulation framework for autonomous driving. XSIM extends 3DGUT splatting with a generalized rolling-shutter modeling tailored for autonomous driving applications. Our framework provides a unified and flexible formulation for appearance and geometric sensor modeling, enabling rendering of complex sensor distortions in dynamic environments. We identify spherical cameras, such as LiDARs, as a critical edge case for existing 3DGUT splatting due to cyclic projection and time discontinuities at azimuth boundaries leading to incorrect particle projection. To address this issue, we propose a phase modeling mechanism that explicitly accounts temporal and shape discontinuities of Gaussians projected by the Unscented Transform at azimuth borders. In addition, we introduce an extended 3D Gaussian representation that incorporates two distinct opacity parameters to resolve mismatches between geometry and color distributions. As a result, our framework provides enhanced scene representations with improved geometric consistency and photorealistic appearance. We evaluate our framework extensively on multiple autonomous driving datasets, including Waymo Open Dataset, Argoverse 2, and PandaSet. Our framework consistently outperforms strong recent baselines and achieves state-of-the-art performance across all datasets. The source code is publicly available at \href{https://github.com/whesense/XSIM}{https://github.com/whesense/XSIM}.
### [121] [ROMAN: Reward-Orchestrated Multi-Head Attention Network for Autonomous Driving System Testing](https://arxiv.org/abs/2602.05629) *Jianlei Chi,Yuzhen Wu,Jiaxuan Hou,Xiaodong Zhang,Ming Fan,Suhui Sun,Weijun Dai,Bo Li,Jianguo Sun,Jun Sun* Main category: cs.CV TL;DR: 本文提出ROMAN方法,结合多头注意力网络与交通法规加权机制,生成高风险违规场景以增强自动驾驶系统(ADS)测试的全面性与针对性。实验表明其在违规数量和场景多样性上均优于现有工具,并能覆盖全部输入交通法规条款。
Details Motivation: 当前ADS测试难以生成复杂、高风险的违法场景,且忽略多车交互与关键情境,导致安全评估不充分。 Method: 提出ROMAN:采用多头注意力网络建模车辆、信号灯等要素交互;引入基于大语言模型(LLM)的风险加权模块,从严重性与发生概率两维度量化交通法规违规风险。 Result: 在CARLA中测试Baidu Apollo,ROMAN平均违规数比ABLE高7.91%、比LawBreaker高55.96%,场景多样性更高,且唯一实现对所有输入交通法规条款全覆盖的违规场景生成。 Conclusion: ROMAN显著提升了ADS测试中高风险违法场景的生成能力与法规覆盖度,为更安全可靠的自动驾驶部署提供了有效验证手段。 Abstract: Automated Driving System (ADS) acts as the brain of autonomous vehicles, responsible for their safety and efficiency. Safe deployment requires thorough testing in diverse real-world scenarios and compliance with traffic laws like speed limits, signal obedience, and right-of-way rules. Violations like running red lights or speeding pose severe safety risks. However, current testing approaches face significant challenges: limited ability to generate complex and high-risk law-breaking scenarios, and failing to account for complex interactions involving multiple vehicles and critical situations. To address these challenges, we propose ROMAN, a novel scenario generation approach for ADS testing that combines a multi-head attention network with a traffic law weighting mechanism. ROMAN is designed to generate high-risk violation scenarios to enable more thorough and targeted ADS evaluation. The multi-head attention mechanism models interactions among vehicles, traffic signals, and other factors. The traffic law weighting mechanism implements a workflow that leverages an LLM-based risk weighting module to evaluate violations based on the two dimensions of severity and occurrence. We have evaluated ROMAN by testing the Baidu Apollo ADS within the CARLA simulation platform and conducting extensive experiments to measure its performance. Experimental results demonstrate that ROMAN surpassed state-of-the-art tools ABLE and LawBreaker by achieving 7.91% higher average violation count than ABLE and 55.96% higher than LawBreaker, while also maintaining greater scenario diversity. In addition, only ROMAN successfully generated violation scenarios for every clause of the input traffic laws, enabling it to identify more high-risk violations than existing approaches.
### [122] [UniSurg: A Video-Native Foundation Model for Universal Understanding of Surgical Videos](https://arxiv.org/abs/2602.05638) *Jinlin Wu,Felix Holm,Chuxi Chen,An Wang,Yaxin Hu,Xiaofan Ye,Zelin Zang,Miao Xu,Lihua Zhou,Huai Liao,Danny T. M. Chan,Ming Feng,Wai S. Poon,Hongliang Ren,Dong Yi,Nassir Navab,Gaofeng Meng,Jiebo Luo,Hongbin Liu,Zhen Lei* Main category: cs.CV TL;DR: UniSurg是一种面向手术视频的新型基础模型,摒弃像素级重建,转而预测潜在运动表征,并通过三项技术创新提升语义理解能力,在多项手术视频分析任务中显著超越现有方法。
Details Motivation: 现有手术视频分析方法过度依赖像素级重建目标,浪费模型容量于低层视觉细节(如烟雾、镜面反射、液体流动),忽视对手术理解至关重要的语义结构。 Method: 提出基于V-JEPA架构的视频原生模型UniSurg,包含三项创新:1)运动引导的潜在预测以聚焦语义区域;2)时空亲和自蒸馏以保持关系一致性;3)特征多样性正则化以防止纹理稀疏场景下的表征坍缩;并构建大规模手术视频数据集UniSurg-15M用于预训练。 Result: 在17个基准上全面领先,包括手术流程识别(EgoSurgery +14.6% F1,PitVis +10.3%)、动作三元组识别(CholecT50 mAP-IVT达39.54%)、技能评估、息肉分割与深度估计。 Conclusion: UniSurg确立了面向运动的通用手术视频理解新范式与新标准。 Abstract: While foundation models have advanced surgical video analysis, current approaches rely predominantly on pixel-level reconstruction objectives that waste model capacity on low-level visual details - such as smoke, specular reflections, and fluid motion - rather than semantic structures essential for surgical understanding. We present UniSurg, a video-native foundation model that shifts the learning paradigm from pixel-level reconstruction to latent motion prediction. Built on the Video Joint Embedding Predictive Architecture (V-JEPA), UniSurg introduces three key technical innovations tailored to surgical videos: 1) motion-guided latent prediction to prioritize semantically meaningful regions, 2) spatiotemporal affinity self-distillation to enforce relational consistency, and 3) feature diversity regularization to prevent representation collapse in texture-sparse surgical scenes. To enable large-scale pretraining, we curate UniSurg-15M, the largest surgical video dataset to date, comprising 3,658 hours of video from 50 sources across 13 anatomical regions. Extensive experiments across 17 benchmarks demonstrate that UniSurg significantly outperforms state-of-the-art methods on surgical workflow recognition (+14.6% F1 on EgoSurgery, +10.3% on PitVis), action triplet recognition (39.54% mAP-IVT on CholecT50), skill assessment, polyp segmentation, and depth estimation. These results establish UniSurg as a new standard for universal, motion-oriented surgical video understanding.
### [123] [Enhancing Personality Recognition by Comparing the Predictive Power of Traits, Facets, and Nuances](https://arxiv.org/abs/2602.05650) *Amir Ansari,Jana Subirana,Bruna Silva,Sergio Escalera,David Gallardo-Pujol,Cristina Palmero* Main category: cs.CV TL;DR: 本文探讨了在音频视频交互数据中,使用大五人格模型更细粒度的层次(如细微特征)来提升人格识别效果,结果表明细微特征级别的模型显著优于层面和特质级别模型。
Details Motivation: 依赖宽泛的人格特质分数作为真实标签,加上训练数据有限,导致模型泛化能力差,因为相似的特质分数可能源于多样且依赖情境的行为表现。 Method: 利用UDIVA v0.5数据集,训练了一个基于Transformer的模型,该模型包含跨模态(音视频)和跨被试(二元感知)注意力机制,并对比了特质、层面和细微特征三个层级的预测性能。 Result: 细微特征级别模型在所有交互场景中持续优于层面和特质级别模型,均方误差最多降低74%。 Conclusion: 采用大五人格模型中更细粒度的细微特征作为监督信号,可显著提升人格识别模型在音频视频交互数据上的性能与泛化能力。 Abstract: Personality is a complex, hierarchical construct typically assessed through item-level questionnaires aggregated into broad trait scores. Personality recognition models aim to infer personality traits from different sources of behavioral data. However, reliance on broad trait scores as ground truth, combined with limited training data, poses challenges for generalization, as similar trait scores can manifest through diverse, context dependent behaviors. In this work, we explore the predictive impact of the more granular hierarchical levels of the Big-Five Personality Model, facets and nuances, to enhance personality recognition from audiovisual interaction data. Using the UDIVA v0.5 dataset, we trained a transformer-based model including cross-modal (audiovisual) and cross-subject (dyad-aware) attention mechanisms. Results show that nuance-level models consistently outperform facet and trait-level models, reducing mean squared error by up to 74% across interaction scenarios.
### [124] [ShapeUP: Scalable Image-Conditioned 3D Editing](https://arxiv.org/abs/2602.05676) *Inbar Gat,Dana Cohen-Bar,Guy Levy,Elad Richardson,Daniel Cohen-Or* Main category: cs.CV TL;DR: ShapeUP 是一种可扩展的、以图像为条件的3D编辑框架,通过监督式潜在空间到潜在空间的转换,在原生3D表征中实现高保真、结构一致的3D编辑。
Details Motivation: 现有3D编辑方法在视觉可控性、几何一致性与可扩展性之间难以兼顾:优化法慢、多视角2D传播易漂移、无训练潜空间编辑受限于冻结先验。 Method: 提出ShapeUP框架,基于预训练3D基础模型,利用3D扩散Transformer(DiT)学习从源3D形状+编辑后2D图像到编辑后3D形状的映射;训练数据为(源3D,编辑2D图像,目标3D)三元组;采用‘图像即提示’范式,实现免掩码的隐式定位与细粒度控制。 Result: 在身份保持与编辑保真度上全面超越现有训练型与无训练型基线,兼具结构一致性与可扩展性。 Conclusion: ShapeUP为原生3D内容创作提供了一种鲁棒、可扩展的新范式,弥合了3D生成与可控编辑之间的关键鸿沟。 Abstract: Recent advancements in 3D foundation models have enabled the generation of high-fidelity assets, yet precise 3D manipulation remains a significant challenge. Existing 3D editing frameworks often face a difficult trade-off between visual controllability, geometric consistency, and scalability. Specifically, optimization-based methods are prohibitively slow, multi-view 2D propagation techniques suffer from visual drift, and training-free latent manipulation methods are inherently bound by frozen priors and cannot directly benefit from scaling. In this work, we present ShapeUP, a scalable, image-conditioned 3D editing framework that formulates editing as a supervised latent-to-latent translation within a native 3D representation. This formulation allows ShapeUP to build on a pretrained 3D foundation model, leveraging its strong generative prior while adapting it to editing through supervised training. In practice, ShapeUP is trained on triplets consisting of a source 3D shape, an edited 2D image, and the corresponding edited 3D shape, and learns a direct mapping using a 3D Diffusion Transformer (DiT). This image-as-prompt approach enables fine-grained visual control over both local and global edits and achieves implicit, mask-free localization, while maintaining strict structural consistency with the original asset. Our extensive evaluations demonstrate that ShapeUP consistently outperforms current trained and training-free baselines in both identity preservation and edit fidelity, offering a robust and scalable paradigm for native 3D content creation.
### [125] [Poster: Camera Tampering Detection for Outdoor IoT Systems](https://arxiv.org/abs/2602.05706) *Shadi Attarha,Kanaga Shanmugi,Anna Förster* Main category: cs.CV TL;DR: 本文提出两种相机篡改检测方法:基于规则的方法和基于深度学习的方法,比较了它们在准确性、计算需求和训练数据要求方面的表现,并提供了公开的数据集。
Details Motivation: 智能摄像头在户外环境中易受故意破坏或恶劣环境条件影响,导致监控效果下降;而静态图像篡改检测比视频更困难,因缺乏时间连续帧。 Method: 提出了两种篡改检测方法:一种是基于规则的方法,另一种是基于深度学习的方法,并在真实场景中评估其性能。 Result: 深度学习模型准确率更高;规则方法更适合资源受限且难以长时间校准的场景;同时发布了包含正常、模糊和旋转图像的公开数据集。 Conclusion: 两种方法各有优势:深度学习适合高精度需求场景,规则方法适合低资源环境;公开数据集有助于推动该领域研究。 Abstract: Recently, the use of smart cameras in outdoor settings has grown to improve surveillance and security. Nonetheless, these systems are susceptible to tampering, whether from deliberate vandalism or harsh environmental conditions, which can undermine their monitoring effectiveness. In this context, detecting camera tampering is more challenging when a camera is capturing still images rather than video as there is no sequence of continuous frames over time. In this study, we propose two approaches for detecting tampered images: a rule-based method and a deep-learning-based method. The aim is to evaluate how each method performs in terms of accuracy, computational demands, and the data required for training when applied to real-world scenarios. Our results show that the deep-learning model provides higher accuracy, while the rule-based method is more appropriate for scenarios where resources are limited and a prolonged calibration phase is impractical. We also offer publicly available datasets with normal, blurred, and rotated images to support the development and evaluation of camera tampering detection methods, addressing the need for such resources.
### [126] [Exploring the Temporal Consistency for Point-Level Weakly-Supervised Temporal Action Localization](https://arxiv.org/abs/2602.05718) *Yunchuan Ma,Laiyun Qing,Guorong Li,Yuqing Liu,Yuankai Qi,Qingming Huang* Main category: cs.CV TL;DR: 本文提出了一种多任务学习框架,通过三种自监督时序理解任务(动作完成、动作顺序理解、动作规律性理解)来增强点监督下的时序动作定位模型对帧间时序关系的理解能力。
Details Motivation: 现有PTAL方法仅进行片段级分类,缺乏对动作内部帧间时序关系的显式建模,而时序关系对准确定位完整动作至关重要。 Method: 设计了包含三个自监督时序理解任务(动作完成、动作顺序理解、动作规律性理解)的多任务学习框架,充分利用点监督信号提升模型的时序理解能力。 Result: 在四个基准数据集上的大量实验表明,所提方法优于多个当前最优方法。 Conclusion: 显式建模动作的时序一致性可显著提升点监督下时序动作定位的性能,本文是首个对此方向进行探索的工作。 Abstract: Point-supervised Temporal Action Localization (PTAL) adopts a lightly frame-annotated paradigm (\textit{i.e.}, labeling only a single frame per action instance) to train a model to effectively locate action instances within untrimmed videos. Most existing approaches design the task head of models with only a point-supervised snippet-level classification, without explicit modeling of understanding temporal relationships among frames of an action. However, understanding the temporal relationships of frames is crucial because it can help a model understand how an action is defined and therefore benefits localizing the full frames of an action. To this end, in this paper, we design a multi-task learning framework that fully utilizes point supervision to boost the model's temporal understanding capability for action localization. Specifically, we design three self-supervised temporal understanding tasks: (i) Action Completion, (ii) Action Order Understanding, and (iii) Action Regularity Understanding. These tasks help a model understand the temporal consistency of actions across videos. To the best of our knowledge, this is the first attempt to explicitly explore temporal consistency for point supervision action localization. Extensive experimental results on four benchmark datasets demonstrate the effectiveness of the proposed method compared to several state-of-the-art approaches.
### [127] [Adaptive Global and Fine-Grained Perceptual Fusion for MLLM Embeddings Compatible with Hard Negative Amplification](https://arxiv.org/abs/2602.05729) *Lexiang Hu,Youze Xue,Dian Li,Gang Liu,Zhouchen Lin* Main category: cs.CV TL;DR: 本文提出AGFF-Embed方法,通过自适应融合全局与细粒度感知嵌入,并结合显式梯度放大(EGA)技术增强难负样本,显著提升多模态嵌入模型在通用与细粒度理解任务上的性能。
Details Motivation: 现有CLIP和MLLM类多模态嵌入模型仅捕获全局语义,难以应对同时包含全局与细粒度元素的复杂场景,亟需兼容性融合机制。 Method: 提出AGFF-Embed:利用MLLM生成多维度语义嵌入,并自适应平滑聚合;引入EGA技术实现批内难负样本增强,无需数据精细标注。 Result: 在MMEB和MMVP-VLM基准上,AGFF-Embed全面超越现有方法,达到通用与细粒度理解双SOTA。 Conclusion: AGFF-Embed有效统一全局与细粒度感知建模,结合EGA提升了难样本判别能力,为多模态嵌入设计提供了新范式。 Abstract: Multimodal embeddings serve as a bridge for aligning vision and language, with the two primary implementations -- CLIP-based and MLLM-based embedding models -- both limited to capturing only global semantic information. Although numerous studies have focused on fine-grained understanding, we observe that complex scenarios currently targeted by MLLM embeddings often involve a hybrid perceptual pattern of both global and fine-grained elements, thus necessitating a compatible fusion mechanism. In this paper, we propose Adaptive Global and Fine-grained perceptual Fusion for MLLM Embeddings (AGFF-Embed), a method that prompts the MLLM to generate multiple embeddings focusing on different dimensions of semantic information, which are then adaptively and smoothly aggregated. Furthermore, we adapt AGFF-Embed with the Explicit Gradient Amplification (EGA) technique to achieve in-batch hard negatives enhancement without requiring fine-grained editing of the dataset. Evaluation on the MMEB and MMVP-VLM benchmarks shows that AGFF-Embed comprehensively achieves state-of-the-art performance in both general and fine-grained understanding compared to other multimodal embedding models.
### [128] [Depth as Prior Knowledge for Object Detection](https://arxiv.org/abs/2602.05730) *Moussa Kassem Sbeyti,Nadja Klein* Main category: cs.CV TL;DR: 本文提出DepthPrior框架,利用深度信息作为先验知识而非融合特征,通过训练时的深度加权损失(DLW)与分层损失(DLS)以及推理时的深度感知置信度阈值(DCT),显著提升小目标检测性能,无需修改检测器结构或增加传感器。
Details Motivation: 小而远的目标检测因尺度变化、低分辨率和背景杂波而困难,尤其在安全关键应用中亟需可靠检测;现有利用深度信息的方法需复杂且模型特定的架构修改。 Method: 理论分析与实证研究深度-检测关系;提出DepthPrior框架,包含训练阶段的Depth-Based Loss Weighting(DLW)和Depth-Based Loss Stratification(DLS),以及推理阶段的Depth-Aware Confidence Thresholding(DCT)。 Result: 在KITTI、MS COCO、VisDrone、SUN RGB-D四个基准及YOLOv11、EfficientDet两个检测器上,小目标mAP_S提升最高达+9%,mAR_S提升+7%,推理误检恢复率高达95:1。 Conclusion: DepthPrior以轻量、通用的方式利用深度先验,显著改善小目标检测性能,不依赖额外传感器、架构修改或推理开销,具备强实用性与可扩展性。 Abstract: Detecting small and distant objects remains challenging for object detectors due to scale variation, low resolution, and background clutter. Safety-critical applications require reliable detection of these objects for safe planning. Depth information can improve detection, but existing approaches require complex, model-specific architectural modifications. We provide a theoretical analysis followed by an empirical investigation of the depth-detection relationship. Together, they explain how depth causes systematic performance degradation and why depth-informed supervision mitigates it. We introduce DepthPrior, a framework that uses depth as prior knowledge rather than as a fused feature, providing comparable benefits without modifying detector architectures. DepthPrior consists of Depth-Based Loss Weighting (DLW) and Depth-Based Loss Stratification (DLS) during training, and Depth-Aware Confidence Thresholding (DCT) during inference. The only overhead is the initial cost of depth estimation. Experiments across four benchmarks (KITTI, MS COCO, VisDrone, SUN RGB-D) and two detectors (YOLOv11, EfficientDet) demonstrate the effectiveness of DepthPrior, achieving up to +9% mAP$_S$ and +7% mAR$_S$ for small objects, with inference recovery rates as high as 95:1 (true vs. false detections). DepthPrior offers these benefits without additional sensors, architectural changes, or performance costs. Code is available at https://github.com/mos-ks/DepthPrior.
### [129] [Neuro-Inspired Visual Pattern Recognition via Biological Reservoir Computing](https://arxiv.org/abs/2602.05737) *Luca Ciampi,Ludovico Iannello,Fabrizio Tonelli,Gabriele Lagani,Angelo Di Garbo,Federico Cremisi,Giuseppe Amato* Main category: cs.CV TL;DR: 本文提出了一种基于体外培养皮层神经元的生物储层计算(BRC)方法,利用高密度微电极阵列(HD-MEA)刺激与记录神经活动,结合线性读出层实现静态视觉模式识别任务,验证了活体神经网络作为有效储层的可行性。
Details Motivation: 克服传统人工递归模型对神经动力学近似的局限性,探索利用真实生物神经回路的自发与诱发活动作为计算基质,推动神经形态计算与生物启发式机器学习的发展。 Method: 将体外培养的皮层神经元网络作为物理储层,通过HD-MEA施加输入刺激并记录多通道神经响应,提取高维生物特征表示,再用单层感知机进行分类训练。 Result: 系统在点刺激、朝向光栅、类钟表数字及MNIST手写数字等递进式任务中均实现准确分类,尽管存在生物噪声与跨会话变异性,仍能稳定生成高维可分表征。 Conclusion: 体外皮层神经网络可作为有效的生物储层用于静态视觉模式识别,为活体神经硬件融入神经形态计算提供了实证基础,并支持更具生物学合理性的视觉计算模型构建。 Abstract: In this paper, we present a neuro-inspired approach to reservoir computing (RC) in which a network of in vitro cultured cortical neurons serves as the physical reservoir. Rather than relying on artificial recurrent models to approximate neural dynamics, our biological reservoir computing (BRC) system leverages the spontaneous and stimulus-evoked activity of living neural circuits as its computational substrate. A high-density multi-electrode array (HD-MEA) provides simultaneous stimulation and readout across hundreds of channels: input patterns are delivered through selected electrodes, while the remaining ones capture the resulting high-dimensional neural responses, yielding a biologically grounded feature representation. A linear readout layer (single-layer perceptron) is then trained to classify these reservoir states, enabling the living neural network to perform static visual pattern-recognition tasks within a computer-vision framework. We evaluate the system across a sequence of tasks of increasing difficulty, ranging from pointwise stimuli to oriented bars, clock-digit-like shapes, and handwritten digits from the MNIST dataset. Despite the inherent variability of biological neural responses-arising from noise, spontaneous activity, and inter-session differences-the system consistently generates high-dimensional representations that support accurate classification. These results demonstrate that in vitro cortical networks can function as effective reservoirs for static visual pattern recognition, opening new avenues for integrating living neural substrates into neuromorphic computing frameworks. More broadly, this work contributes to the effort to incorporate biological principles into machine learning and supports the goals of neuro-inspired vision by illustrating how living neural systems can inform the design of efficient and biologically grounded computational models.
### [130] [FMPose3D: monocular 3D pose estimation via flow matching](https://arxiv.org/abs/2602.05755) *Ti Wang,Xiaohang Yu,Mackenzie Weygandt Mathis* Main category: cs.CV TL;DR: 本文提出FMPose3D,一种基于流匹配(Flow Matching)的高效单目3D姿态估计框架,将姿态估计建模为条件分布传输问题,仅需少量积分步即可生成多样化的3D姿态假设,并通过重投影后验期望聚合(RPEA)模块获得高精度单预测结果,在人与动物3D姿态基准上均达SOTA。
Details Motivation: 单目3D姿态估计存在深度模糊和遮挡问题,传统扩散模型虽性能强但推理慢;亟需更高效的概率化生成方法。 Method: 提出FMPose3D框架:利用流匹配学习ODE速度场,实现从标准高斯先验到2D条件下的3D姿态分布的连续传输;通过不同噪声种子采样生成多假设;引入RPEA模块基于重投影一致性聚合后验期望。 Result: 在Human3.6M和MPI-INF-3DHP上超越现有方法,在Animal3D和CtrlAni3D动物数据集上达到SOTA;代码已开源。 Conclusion: 流匹配为单目3D姿态估计提供了高效、可扩展且泛化性强的概率建模范式,FMPose3D验证了其在人类与动物姿态估计任务中的通用有效性。 Abstract: Monocular 3D pose estimation is fundamentally ill-posed due to depth ambiguity and occlusions, thereby motivating probabilistic methods that generate multiple plausible 3D pose hypotheses. In particular, diffusion-based models have recently demonstrated strong performance, but their iterative denoising process typically requires many timesteps for each prediction, making inference computationally expensive. In contrast, we leverage Flow Matching (FM) to learn a velocity field defined by an Ordinary Differential Equation (ODE), enabling efficient generation of 3D pose samples with only a few integration steps. We propose a novel generative pose estimation framework, FMPose3D, that formulates 3D pose estimation as a conditional distribution transport problem. It continuously transports samples from a standard Gaussian prior to the distribution of plausible 3D poses conditioned only on 2D inputs. Although ODE trajectories are deterministic, FMPose3D naturally generates various pose hypotheses by sampling different noise seeds. To obtain a single accurate prediction from those hypotheses, we further introduce a Reprojection-based Posterior Expectation Aggregation (RPEA) module, which approximates the Bayesian posterior expectation over 3D hypotheses. FMPose3D surpasses existing methods on the widely used human pose estimation benchmarks Human3.6M and MPI-INF-3DHP, and further achieves state-of-the-art performance on the 3D animal pose datasets Animal3D and CtrlAni3D, demonstrating strong performance across both 3D pose domains. The code is available at https://github.com/AdaptiveMotorControlLab/FMPose3D.
### [131] [ReText: Text Boosts Generalization in Image-Based Person Re-identification](https://arxiv.org/abs/2602.05785) *Timur Mamedov,Karina Kvanchiani,Anton Konushin,Vadim Konushin* Main category: cs.CV TL;DR: 本文提出ReText方法,通过结合多摄像头Re-ID数据与带文本描述的单摄像头数据进行多任务联合训练(Re-ID、图文匹配、文本引导图像重建),提升跨域行人重识别的泛化能力。
Details Motivation: 现有方法依赖复杂架构解决域间差异,而新发现表明风格多样的单摄像头数据有助于泛化;但其缺乏跨视角变化,语义信息不足。 Method: ReText在多摄像头Re-ID数据和带文本描述的单摄像头数据混合集上联合优化三个任务:(1) 多摄像头数据上的Re-ID;(2) 图文匹配;(3) 文本引导的单摄像头图像重建。 Result: ReText在跨域Re-ID基准测试中显著超越现有最优方法,展现出强泛化性能。 Conclusion: ReText是首个在图像式行人Re-ID中探索多模态联合学习、融合多摄像头与单摄像头数据的工作,验证了文本增强对单摄像头数据语义丰富性的有效性。 Abstract: Generalizable image-based person re-identification (Re-ID) aims to recognize individuals across cameras in unseen domains without retraining. While multiple existing approaches address the domain gap through complex architectures, recent findings indicate that better generalization can be achieved by stylistically diverse single-camera data. Although this data is easy to collect, it lacks complexity due to minimal cross-view variation. We propose ReText, a novel method trained on a mixture of multi-camera Re-ID data and single-camera data, where the latter is complemented by textual descriptions to enrich semantic cues. During training, ReText jointly optimizes three tasks: (1) Re-ID on multi-camera data, (2) image-text matching, and (3) image reconstruction guided by text on single-camera data. Experiments demonstrate that ReText achieves strong generalization and significantly outperforms state-of-the-art methods on cross-domain Re-ID benchmarks. To the best of our knowledge, this is the first work to explore multimodal joint learning on a mixture of multi-camera and single-camera data in image-based person Re-ID.
### [132] [Allocentric Perceiver: Disentangling Allocentric Reasoning from Egocentric Visual Priors via Frame Instantiation](https://arxiv.org/abs/2602.05789) *Hengyi Wang,Ruiqiang Zhang,Chang Liu,Guanjie Wang,Zehua Ma,Han Fang,Weiming Zhang* Main category: cs.CV TL;DR: 本文提出Allocentric Perceiver,一种无需训练的策略,通过利用现成的几何专家从图像中恢复度量3D状态,并构建与指令语义意图对齐的查询条件化异心参考系,从而提升视觉语言模型在异心空间推理任务上的性能。
Details Motivation: 现有视觉语言模型(VLMs)在需要显式视角转换的异心空间查询任务上表现脆弱,因其依赖于以观察者为中心的视角而非目标中心的坐标系进行推理。 Method: 提出Allocentric Perceiver方法:1)利用现成几何专家从单张或多张图像中无训练地恢复度量3D状态;2)根据指令语义构建查询条件化的异心参考系;3)将重建的几何结构确定性地变换至目标帧,并以结构化、几何接地的表征提示骨干VLM。 Result: 在多个骨干模型和空间推理基准上验证,Allocentric Perceiver在异心任务上稳定提升约10%,同时保持优异的自我中心性能,并超越经空间感知微调及当前最优开源与闭源模型。 Conclusion: 将隐式的心理旋转推理转为显式的几何计算可显著增强VLM的空间接地能力,且无需额外训练,具备通用性和实用性。 Abstract: With the rising need for spatially grounded tasks such as Vision-Language Navigation/Action, allocentric perception capabilities in Vision-Language Models (VLMs) are receiving growing focus. However, VLMs remain brittle on allocentric spatial queries that require explicit perspective shifts, where the answer depends on reasoning in a target-centric frame rather than the observed camera view. Thus, we introduce Allocentric Perceiver, a training-free strategy that recovers metric 3D states from one or more images with off-the-shelf geometric experts, and then instantiates a query-conditioned allocentric reference frame aligned with the instruction's semantic intent. By deterministically transforming reconstructed geometry into the target frame and prompting the backbone VLM with structured, geometry-grounded representations, Allocentric Perceriver offloads mental rotation from implicit reasoning to explicit computation. We evaluate Allocentric Perciver across multiple backbone families on spatial reasoning benchmarks, observing consistent and substantial gains ($\sim$10%) on allocentric tasks while maintaining strong egocentric performance, and surpassing both spatial-perception-finetuned models and state-of-the-art open-source and proprietary models.
### [133] [Focus-Scan-Refine: From Human Visual Perception to Efficient Visual Token Pruning](https://arxiv.org/abs/2602.05809) *Enwei Tong,Yuanchao Bai,Yao Zhu,Junjun Jiang,Xianming Liu* Main category: cs.CV TL;DR: 本文提出Focus-Scan-Refine(FSR)框架,一种受人类视觉问答启发的训练无关视觉token剪枝方法,在保持甚至提升VLM性能的同时显著降低计算开销。
Details Motivation: 现有训练无关的视觉token剪枝方法难以在高压缩率下兼顾局部关键证据与全局上下文信息,导致性能下降。 Method: FSR分三步:1)Focus——结合视觉重要性与指令相关性聚焦关键区域;2)Scan——基于已聚焦token选择差异性最大的互补上下文token;3)Refine——通过相似性分配与分数加权融合,将邻近信息聚合到扫描锚点,不增加token总数。 Result: 在多个VLM主干和视觉语言基准上,FSR持续优于现有SOTA剪枝方法,在精度与效率权衡上取得更好表现。 Conclusion: FSR是一种即插即用、无需训练的高效视觉token剪枝框架,有效缓解VLM推理延迟与内存压力,同时保持语义完整性与任务性能。 Abstract: Vision-language models (VLMs) often generate massive visual tokens that greatly increase inference latency and memory footprint; while training-free token pruning offers a practical remedy, existing methods still struggle to balance local evidence and global context under aggressive compression. We propose Focus-Scan-Refine (FSR), a human-inspired, plug-and-play pruning framework that mimics how humans answer visual questions: focus on key evidence, then scan globally if needed, and refine the scanned context by aggregating relevant details. FSR first focuses on key evidence by combining visual importance with instruction relevance, avoiding the bias toward visually salient but query-irrelevant regions. It then scans for complementary context conditioned on the focused set, selecting tokens that are most different from the focused evidence. Finally, FSR refines the scanned context by aggregating nearby informative tokens into the scan anchors via similarity-based assignment and score-weighted merging, without increasing the token budget. Extensive experiments across multiple VLM backbones and vision-language benchmarks show that FSR consistently improves the accuracy-efficiency trade-off over existing state-of-the-art pruning methods. The source codes can be found at https://github.com/ILOT-code/FSR
### [134] [NVS-HO: A Benchmark for Novel View Synthesis of Handheld Objects](https://arxiv.org/abs/2602.05822) *Musawar Ali,Manuel Carranza-García,Nicola Fioraio,Samuele Salti,Luigi Di Stefano* Main category: cs.CV TL;DR: 本文提出了NVS-HO,首个仅使用RGB输入、面向真实环境中手持物体的新视角合成(NVS)基准,包含手持序列和标定板序列两种互补数据,并基于NeRF与高斯泼溅等方法建立基线,揭示了当前方法在无约束手持场景下的性能瓶颈。
Details Motivation: 现有新视角合成方法在真实世界手持物体场景下缺乏统一、具挑战性的RGB-only基准,难以评估模型鲁棒性与泛化能力。 Method: 构建NVS-HO基准:采集同一物体的手持RGB序列(用于建模外观)与ChArUco标定板固定序列(提供精确相机位姿与真值图像);采用SfM与VGGT作为位姿估计器,结合NeRF和Gaussian Splatting训练NVS模型。 Result: 实验表明,当前主流NVS方法在手持条件下性能显著下降,存在明显性能差距;NVS-HO为该任务提供了具有现实挑战性的新基准。 Conclusion: NVS-HO填补了真实手持物体RGB-only新视角合成的基准空白,突显了对更鲁棒位姿估计与表征学习方法的需求,有望推动该方向发展。 Abstract: We propose NVS-HO, the first benchmark designed for novel view synthesis of handheld objects in real-world environments using only RGB inputs. Each object is recorded in two complementary RGB sequences: (1) a handheld sequence, where the object is manipulated in front of a static camera, and (2) a board sequence, where the object is fixed on a ChArUco board to provide accurate camera poses via marker detection. The goal of NVS-HO is to learn a NVS model that captures the full appearance of an object from (1), whereas (2) provides the ground-truth images used for evaluation. To establish baselines, we consider both a classical SfM pipeline and a state-of-the-art pre-trained feed-forward neural network (VGGT) as pose estimators, and train NVS models based on NeRF and Gaussian Splatting. Our experiments reveal significant performance gaps in current methods under unconstrained handheld conditions, highlighting the need for more robust approaches. NVS-HO thus offers a challenging real-world benchmark to drive progress in RGB-based novel view synthesis of handheld objects.
### [135] [Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation](https://arxiv.org/abs/2602.05827) *Hai Zhang,Siqi Liang,Li Chen,Yuxian Li,Yukuan Xu,Yichao Zhong,Fu Zhang,Hongyang Li* Main category: cs.CV TL;DR: 本文提出SparseVideoNav,首次将视频生成模型引入超越视野导航(BVN)任务,通过生成稀疏未来视频实现亚秒级轨迹推理,在真实世界零样本实验中成功率提升2.5倍,并首次实现在夜间场景下的BVN。
Details Motivation: 现有视觉语言导航依赖冗长详细指令,与现实世界中仅需简单高层意图的自主导航目标相悖;亟需解决无密集指导、定位远处不可见目标的Beyond-the-View Navigation(BVN)难题。 Method: 发现视频生成模型天然适配长时序监督,故首次将其引入BVN;为克服视频生成延迟问题,提出SparseVideoNav,生成覆盖20秒视野的稀疏未来视频以实现快速轨迹推理。 Result: 在真实世界零样本实验中,SparseVideoNav在BVN任务上成功率是当前最优LLM基线的2.5倍,并首次实现夜间复杂场景下的成功导航;推理速度较未优化版本提升27倍,达亚秒级。 Conclusion: 视频生成模型因其长时序建模能力,是解决BVN问题的新范式;SparseVideoNav验证了该思路的有效性与实用性,为轻量、鲁棒、真实部署的自主导航开辟新路径。 Abstract: Why must vision-language navigation be bound to detailed and verbose language instructions? While such details ease decision-making, they fundamentally contradict the goal for navigation in the real-world. Ideally, agents should possess the autonomy to navigate in unknown environments guided solely by simple and high-level intents. Realizing this ambition introduces a formidable challenge: Beyond-the-View Navigation (BVN), where agents must locate distant, unseen targets without dense and step-by-step guidance. Existing large language model (LLM)-based methods, though adept at following dense instructions, often suffer from short-sighted behaviors due to their reliance on short-horimzon supervision. Simply extending the supervision horizon, however, destabilizes LLM training. In this work, we identify that video generation models inherently benefit from long-horizon supervision to align with language instructions, rendering them uniquely suitable for BVN tasks. Capitalizing on this insight, we propose introducing the video generation model into this field for the first time. Yet, the prohibitive latency for generating videos spanning tens of seconds makes real-world deployment impractical. To bridge this gap, we propose SparseVideoNav, achieving sub-second trajectory inference guided by a generated sparse future spanning a 20-second horizon. This yields a remarkable 27x speed-up compared to the unoptimized counterpart. Extensive real-world zero-shot experiments demonstrate that SparseVideoNav achieves 2.5x the success rate of state-of-the-art LLM baselines on BVN tasks and marks the first realization of such capability in challenging night scenes.
### [136] [Weaver: End-to-End Agentic System Training for Video Interleaved Reasoning](https://arxiv.org/abs/2602.05829) *Yudi Shi,Shangzhe Di,Qirui Chen,Qinian Wang,Jiayin Cai,Xiaolong Jiang,Yao Hu,Weidi Xie* Main category: cs.CV TL;DR: 本文提出Weaver,一种端到端可训练的多模态推理代理系统,通过动态调用多样化工具并结合强化学习,提升视频推理能力,尤其在长视频任务上表现优异。
Details Motivation: 现有基于文本链式思维(Chain-of-Thought)的视频推理方法存在表征不匹配和感知能力受限的问题。 Method: 提出Weaver系统:1)策略模型动态调用多种视觉工具以渐进获取关键视觉线索;2)引入无轨迹监督的强化学习,自由探索工具使用与组合策略。 Result: 在多个复杂视频推理基准(尤其是长视频任务)上显著提升性能。 Conclusion: Weaver通过具身化、工具增强与强化学习驱动的多模态推理范式,有效突破了当前视频理解与推理的瓶颈。 Abstract: Video reasoning constitutes a comprehensive assessment of a model's capabilities, as it demands robust perceptual and interpretive skills, thereby serving as a means to explore the boundaries of model performance. While recent research has leveraged text-centric Chain-of-Thought reasoning to augment these capabilities, such approaches frequently suffer from representational mismatch and restricted by limited perceptual acuity. To address these limitations, we propose Weaver, a novel, end-to-end trainable multimodal reasoning agentic system. Weaver empowers its policy model to dynamically invoke diverse tools throughout the reasoning process, enabling progressive acquisition of crucial visual cues and construction of authentic multimodal reasoning trajectories. Furthermore, we integrate a reinforcement learning algorithm to allow the system to freely explore strategies for employing and combining these tools with trajectory-free data. Extensive experiments demonstrate that our system, Weaver, enhances performance on several complex video reasoning benchmarks, particularly those involving long videos.
### [137] [UI-Mem: Self-Evolving Experience Memory for Online Reinforcement Learning in Mobile GUI Agents](https://arxiv.org/abs/2602.05832) *Han Xiao,Guozhi Wang,Hao Wang,Shilong Liu,Yuxiang Chai,Yue Pan,Yufeng Zhou,Xiaoxin Chen,Yafei Wen,Hongsheng Li* Main category: cs.CV TL;DR: 本文提出UI-Mem框架,通过分层经验记忆(含工作流、子任务技能与失败模式)和分层组采样策略,提升GUI智能体在在线强化学习中的信用分配与跨任务经验迁移能力,并借助自演化循环持续更新记忆,显著提升性能与泛化性。
Details Motivation: 在线强化学习在GUI智能体中面临长周期任务信用分配低效及跨任务重复错误的问题,主因是缺乏有效的经验迁移机制。 Method: 提出UI-Mem框架,包含:1)结构化分层经验记忆(参数化模板存储工作流、技能与失败模式);2)分层组采样(在每组rollout中注入多级记忆指导以保持多样性);3)自演化循环(自动抽象新策略与错误以更新记忆)。 Result: 在在线GUI基准测试中,UI-Mem显著优于传统RL基线和静态复用方法,并展现出对未见应用的强泛化能力。 Conclusion: UI-Mem通过结构化记忆建模与动态引导机制,有效缓解了GUI在线RL中的信用分配与经验复用瓶颈,为构建可进化、可迁移的GUI智能体提供了新范式。 Abstract: Online Reinforcement Learning (RL) offers a promising paradigm for enhancing GUI agents through direct environment interaction. However, its effectiveness is severely hindered by inefficient credit assignment in long-horizon tasks and repetitive errors across tasks due to the lack of experience transfer. To address these challenges, we propose UI-Mem, a novel framework that enhances GUI online RL with a Hierarchical Experience Memory. Unlike traditional replay buffers, our memory accumulates structured knowledge, including high-level workflows, subtask skills, and failure patterns. These experiences are stored as parameterized templates that enable cross-task and cross-application transfer. To effectively integrate memory guidance into online RL, we introduce Stratified Group Sampling, which injects varying levels of guidance across trajectories within each rollout group to maintain outcome diversity, driving the unguided policy toward internalizing guided behaviors. Furthermore, a Self-Evolving Loop continuously abstracts novel strategies and errors to keep the memory aligned with the agent's evolving policy. Experiments on online GUI benchmarks demonstrate that UI-Mem significantly outperforms traditional RL baselines and static reuse strategies, with strong generalization to unseen applications. Project page: https://ui-mem.github.io
### [138] [Self-Supervised Learning with a Multi-Task Latent Space Objective](https://arxiv.org/abs/2602.05845) *Pierre-François De Plaen,Abhishek Jha,Luc Van Gool,Tinne Tuytelaars,Marc Proesmans* Main category: cs.CV TL;DR: 本文提出了一种针对基于Siamese网络的自监督学习(SSL)中多裁剪(multi-crop)策略不稳定问题的改进方法:为每种视图类型分配独立预测器,并引入cutout掩码视图,形成融合全局、局部与掩码视图的多任务不对称Siamese SSL框架,显著提升ResNet和ViT在ImageNet上的性能且训练稳定。
Details Motivation: 多裁剪策略虽能增强SSL框架性能,但在BYOL、SimSiam、MoCo v3等预测器架构中引发训练不稳定性;作者旨在定位并解决该问题。 Method: 分析发现共享预测器是导致不稳定的根本原因,因此为不同视图类型(如全局、局部)分配独立预测器;进一步将每种空间变换视为独立对齐任务,并引入cutout(图像部分掩码)视图,构建统一的多任务不对称Siamese SSL框架。 Result: 所提方法显著提升模型性能,在ImageNet上一致改善ResNet与ViT的表征能力,且训练过程稳定、适用于多种骨干网络。 Conclusion: 为不同视图类型设计专用预测器并融合cutout视图,可有效缓解多裁剪SSL中的不稳定性,是一种简单、通用且高效的自监督学习改进范式。 Abstract: Self-supervised learning (SSL) methods based on Siamese networks learn visual representations by aligning different views of the same image. The multi-crop strategy, which incorporates small local crops to global ones, enhances many SSL frameworks but causes instability in predictor-based architectures such as BYOL, SimSiam, and MoCo v3. We trace this failure to the shared predictor used across all views and demonstrate that assigning a separate predictor to each view type stabilizes multi-crop training, resulting in significant performance gains. Extending this idea, we treat each spatial transformation as a distinct alignment task and add cutout views, where part of the image is masked before encoding. This yields a simple multi-task formulation of asymmetric Siamese SSL that combines global, local, and masked views into a single framework. The approach is stable, generally applicable across backbones, and consistently improves the performance of ResNet and ViT models on ImageNet.
### [139] [Pathwise Test-Time Correction for Autoregressive Long Video Generation](https://arxiv.org/abs/2602.05871) *Xunzhi Xiang,Zixuan Duan,Guiyu Zhang,Haiyu Zhang,Zhe Gao,Junta Wu,Shaofeng Zhang,Tengfei Wang,Qi Fan,Chunchao Guo* Main category: cs.CV TL;DR: 本文提出Test-Time Correction (TTC)方法,通过以初始帧为稳定参考锚点校准采样过程中的中间随机状态,解决蒸馏自回归扩散模型在长视频生成中误差累积的问题,无需训练即可显著延长生成长度并保持高质量。
Details Motivation: 蒸馏自回归扩散模型在实时短视频合成中表现良好,但在长序列生成中存在严重误差累积;现有测试时优化(TTO)方法因奖励景观不稳定和蒸馏参数高度敏感,难以缓解长序列漂移问题。 Method: 提出无需训练的Test-Time Correction (TTC),利用初始帧作为稳定参考锚点,在采样轨迹中动态校准中间随机状态。 Result: TTC可无缝集成于多种蒸馏模型,在几乎无额外开销下显著延长生成长度,并在30秒基准测试中达到与资源密集型训练方法相当的质量。 Conclusion: TTC是一种高效、通用且训练无关的长视频生成校正方案,有效克服了蒸馏模型在长序列生成中的漂移问题。 Abstract: Distilled autoregressive diffusion models facilitate real-time short video synthesis but suffer from severe error accumulation during long-sequence generation. While existing Test-Time Optimization (TTO) methods prove effective for images or short clips, we identify that they fail to mitigate drift in extended sequences due to unstable reward landscapes and the hypersensitivity of distilled parameters. To overcome these limitations, we introduce Test-Time Correction (TTC), a training-free alternative. Specifically, TTC utilizes the initial frame as a stable reference anchor to calibrate intermediate stochastic states along the sampling trajectory. Extensive experiments demonstrate that our method seamlessly integrates with various distilled models, extending generation lengths with negligible overhead while matching the quality of resource-intensive training-based methods on 30-second benchmarks.
### [140] [Contour Refinement using Discrete Diffusion in Low Data Regime](https://arxiv.org/abs/2602.05880) *Fei Yu Guan,Ian Keefe,Sophie Wilkinson,Daniel D. B. Perrakis,Steven Waslander* Main category: cs.CV TL;DR: 本文提出了一种轻量级离散扩散轮廓优化方法,用于小样本条件下的不规则、半透明物体边界检测,结合CNN与自注意力机制,在少量标注数据(<500图像)下实现高效、鲁棒的边界提取,并在多个医学与遥感数据集上取得SOTA或竞争性结果,同时推理速度提升3.5倍。
Details Motivation: 不规则和半透明物体的边界检测在医疗影像、环境监测和制造业中至关重要,但面临标注数据稀缺和现场计算资源受限的挑战;现有分割研究聚焦掩码对齐,而边界检测尤其在低数据场景下缺乏深入探索。 Method: 提出一种轻量级离散扩散轮廓优化流程:以带自注意力的CNN为核心,以初始分割掩码为条件,迭代去噪稀疏轮廓表示;引入简化扩散过程、定制化网络结构及极简后处理等创新设计以提升小样本性能与推理效率。 Result: 在KVASIR医学数据集上显著优于多个SOTA基线,在HAM10K和自建野火烟雾数据集Smoke上表现具竞争力,推理帧率提升3.5倍,仅需<500张训练图像。 Conclusion: 该方法有效解决了低数据、低算力约束下的高精度边界检测难题,验证了离散扩散建模与轻量化设计在轮廓细化任务中的可行性与优势。 Abstract: Boundary detection of irregular and translucent objects is an important problem with applications in medical imaging, environmental monitoring and manufacturing, where many of these applications are plagued with scarce labeled data and low in situ computational resources. While recent image segmentation studies focus on segmentation mask alignment with ground-truth, the task of boundary detection remains understudied, especially in the low data regime. In this work, we present a lightweight discrete diffusion contour refinement pipeline for robust boundary detection in the low data regime. We use a Convolutional Neural Network(CNN) architecture with self-attention layers as the core of our pipeline, and condition on a segmentation mask, iteratively denoising a sparse contour representation. We introduce multiple novel adaptations for improved low-data efficacy and inference efficiency, including using a simplified diffusion process, a customized model architecture, and minimal post processing to produce a dense, isolated contour given a dataset of size <500 training images. Our method outperforms several SOTA baselines on the medical imaging dataset KVASIR, is competitive on HAM10K and our custom wildfire dataset, Smoke, while improving inference framerate by 3.5X.
### [141] [EoCD: Encoder only Remote Sensing Change Detection](https://arxiv.org/abs/2602.05882) *Mubashir Noman,Mustansar Fiaz,Hiyam Debary,Abdul Hannan,Shah Nawaz,Fahad Shahbaz Khan,Salman Khan* Main category: cs.CV TL;DR: 本文提出了一种名为EoCD的轻量高效变化检测方法,通过早期融合时序图像并移除传统解码器,改用无参多尺度特征融合模块,在保持高性能的同时显著降低模型复杂度和计算成本。
Details Motivation: 现有变化检测方法依赖Siamese编码器和复杂解码器,导致模型复杂、计算开销大;早期融合方法虽省去Siamese结构,但性能较差且仍依赖复杂解码器。 Method: 提出Encoder-only Change Detection(EoCD),采用时序图像早期融合策略,并以参数自由的多尺度特征融合模块替代传统解码器。 Result: EoCD在四个具有挑战性的变化检测数据集上验证有效,在检测精度与推理速度之间取得最优平衡,并证明性能主要取决于编码器,解码器非必需。 Conclusion: 解码器并非变化检测任务所必需,仅依赖编码器配合轻量融合模块即可实现高性能、低复杂度的变化检测。 Abstract: Being a cornerstone of temporal analysis, change detection has been playing a pivotal role in modern earth observation. Existing change detection methods rely on the Siamese encoder to individually extract temporal features followed by temporal fusion. Subsequently, these methods design sophisticated decoders to improve the change detection performance without taking into consideration the complexity of the model. These aforementioned issues intensify the overall computational cost as well as the network's complexity which is undesirable. Alternatively, few methods utilize the early fusion scheme to combine the temporal images. These methods prevent the extra overhead of Siamese encoder, however, they also rely on sophisticated decoders for better performance. In addition, these methods demonstrate inferior performance as compared to late fusion based methods. To bridge these gaps, we introduce encoder only change detection (EoCD) that is a simple and effective method for the change detection task. The proposed method performs the early fusion of the temporal data and replaces the decoder with a parameter-free multiscale feature fusion module thereby significantly reducing the overall complexity of the model. EoCD demonstrate the optimal balance between the change detection performance and the prediction speed across a variety of encoder architectures. Additionally, EoCD demonstrate that the performance of the model is predominantly dependent on the encoder network, making the decoder an additional component. Extensive experimentation on four challenging change detection datasets reveals the effectiveness of the proposed method.
### [142] [Neural Implicit 3D Cardiac Shape Reconstruction from Sparse CT Angiography Slices Mimicking 2D Transthoracic Echocardiography Views](https://arxiv.org/abs/2602.05884) *Gino E. Jansen,Carolina Brás,R. Nils Planken,Mark J. Schuuring,Berto J. Bouma,Ivana Išgum* Main category: cs.CV TL;DR: 本文提出了一种基于神经隐式函数的方法,从稀疏CTA平面分割重建完整3D心脏结构,以支持2D经胸超声心动图(TTE)的定量分析。该方法在模拟标准TTE视图下实现了高精度3D重建,并在左心室和左心房体积测量上显著优于临床常用的Simpson双平面法。
Details Motivation: 提升2D经胸超声心动图(TTE)中对心脏腔室三维结构与功能的定量分析精度,克服其固有的二维限制和现有临床方法(如Simpson双平面法)的误差问题。 Method: 采用神经隐式函数(多层感知机)学习CTA三维分割中的形状先验;训练时利用完整3D CTA分割数据;测试时仅输入模拟标准TTE视角的稀疏CTA平面分割,并联合优化潜在编码与刚性变换,将观测平面映射至3D空间完成重建。 Result: 在独立CTA测试集上,所有结构平均Dice系数达0.86 ± 0.04;左心室和左心房体积误差显著低于Simpson双平面法(分别为4.88±4.26 mL vs. 8.14±6.04 mL;6.40±7.37 mL vs. 37.76±22.96 mL)。 Conclusion: 该方法为2D TTE提供了可行且更准确的3D心脏腔室量化路径,有望提升临床诊断精度。 Abstract: Accurate 3D representations of cardiac structures allow quantitative analysis of anatomy and function. In this work, we propose a method for reconstructing complete 3D cardiac shapes from segmentations of sparse planes in CT angiography (CTA) for application in 2D transthoracic echocardiography (TTE). Our method uses a neural implicit function to reconstruct the 3D shape of the cardiac chambers and left-ventricle myocardium from sparse CTA planes. To investigate the feasibility of achieving 3D reconstruction from 2D TTE, we select planes that mimic the standard apical 2D TTE views. During training, a multi-layer perceptron learns shape priors from 3D segmentations of the target structures in CTA. At test time, the network reconstructs 3D cardiac shapes from segmentations of TTE-mimicking CTA planes by jointly optimizing the latent code and the rigid transforms that map the observed planes into 3D space. For each heart, we simulate four realistic apical views, and we compare reconstructed multi-class volumes with the reference CTA volumes. On a held-out set of CTA segmentations, our approach achieves an average Dice coefficient of 0.86 $\pm$ 0.04 across all structures. Our method also achieves markedly lower volume errors than the clinical standard, Simpson's biplane rule: 4.88 $\pm$ 4.26 mL vs. 8.14 $\pm$ 6.04 mL, respectively, for the left ventricle; and 6.40 $\pm$ 7.37 mL vs. 37.76 $\pm$ 22.96 mL, respectively, for the left atrium. This suggests that our approach offers a viable route to more accurate 3D chamber quantification in 2D transthoracic echocardiography.
### [143] [CLIP-Map: Structured Matrix Mapping for Parameter-Efficient CLIP Compression](https://arxiv.org/abs/2602.05909) *Kangjie Zhang,Wenxuan Huang,Xin Zhou,Boxiang Zhou,Dejia Song,Yuan Xie,Baochang Zhang,Lizhuang Ma,Nemo Chen,Xu Tang,Yao Hu,Shaohui Lin* Main category: cs.CV TL;DR: 本文提出了一种基于映射的CLIP压缩框架CLIP-Map,通过可学习矩阵和Kronecker分解进行全映射,结合对角继承初始化缓解优化困难,在高倍压缩下显著优于传统基于权重选择的压缩方法。
Details Motivation: CLIP模型计算与内存开销大,难以部署于资源受限场景;现有基于权重选择的压缩方法在极端压缩下会严重损害特征表达能力。 Method: 提出CLIP-Map框架:采用全映射(Full-Mapping)结合Kronecker因子分解,用可学习矩阵组合原始预训练权重;引入对角继承初始化(Diagonal Inheritance Initialization)缓解映射学习中的分布偏移问题。 Result: 在多种压缩比下,CLIP-Map均优于基于选择的压缩方法,尤其在高压缩比时提升显著。 Conclusion: 映射式权重继承比选择式更有效地保留原始CLIP的信息,是更优的轻量化范式。 Abstract: Contrastive Language-Image Pre-training (CLIP) has achieved widely applications in various computer vision tasks, e.g., text-to-image generation, Image-Text retrieval and Image captioning. However, CLIP suffers from high memory and computation cost, which prohibits its usage to the resource-limited application scenarios. Existing CLIP compression methods typically reduce the size of pre-trained CLIP weights by selecting their subset as weight inheritance for further retraining via mask optimization or important weight measurement. However, these select-based weight inheritance often compromises the feature presentation ability, especially on the extreme compression. In this paper, we propose a novel mapping-based CLIP compression framework, CLIP-Map. It leverages learnable matrices to map and combine pretrained weights by Full-Mapping with Kronecker Factorization, aiming to preserve as much information from the original weights as possible. To mitigate the optimization challenges introduced by the learnable mapping, we propose Diagonal Inheritance Initialization to reduce the distribution shifting problem for efficient and effective mapping learning. Extensive experimental results demonstrate that the proposed CLIP-Map outperforms select-based frameworks across various compression ratios, with particularly significant gains observed under high compression settings.
### [144] [Multi-Scale Global-Instance Prompt Tuning for Continual Test-time Adaptation in Medical Image Segmentation](https://arxiv.org/abs/2602.05937) *Lingrui Li,Yanfeng Zhou,Nan Pu,Xin Chen,Zhun Zhong* Main category: cs.CV TL;DR: 本文提出了一种名为Multi-scale Global-Instance Prompt Tuning(MGIPT)的新方法,用于解决医学图像语义分割中持续测试时自适应(CTTA)面临的错误累积、灾难性遗忘和隐私泄露等问题。该方法通过自适应尺度实例提示(AIP)和多尺度全局提示(MGP)协同建模实例级与全局域级知识,提升跨中心分布偏移下的鲁棒适应能力。
Details Motivation: 医学图像在不同临床中心存在分布偏移,导致预训练模型难以部署;现有持续测试时自适应(CTTA)方法因参数增量更新易产生误差累积和灾难性遗忘;基于提示调优的方法虽有改进,但仍缺乏多尺度提示多样性、实例特异性知识建模不足、且存在隐私泄露风险。 Method: 提出Multi-scale Global-Instance Prompt Tuning(MGIPT),包含两个核心模块:1)自适应尺度实例提示(AIP),动态学习轻量、实例特定提示,并通过自适应最优尺度选择缓解误差累积;2)多尺度全局提示(MGP),捕获跨尺度的域级知识以增强抗遗忘能力;二者通过加权集成实现全局与实例双层级协同适应。 Result: 在多个医学图像分割基准上实验表明,MGIPT显著优于现有最先进方法,在持续变化的目标域上展现出更强的鲁棒适应性能。 Conclusion: MGIPT通过融合多尺度、全局与实例级提示机制,有效缓解了CTTA中的误差累积、灾难性遗忘与隐私问题,为医学图像跨中心鲁棒分割提供了新范式。 Abstract: Distribution shift is a common challenge in medical images obtained from different clinical centers, significantly hindering the deployment of pre-trained semantic segmentation models in real-world applications across multiple domains. Continual Test-Time Adaptation(CTTA) has emerged as a promising approach to address cross-domain shifts during continually evolving target domains. Most existing CTTA methods rely on incrementally updating model parameters, which inevitably suffer from error accumulation and catastrophic forgetting, especially in long-term adaptation. Recent prompt-tuning-based works have shown potential to mitigate the two issues above by updating only visual prompts. While these approaches have demonstrated promising performance, several limitations remain:1)lacking multi-scale prompt diversity, 2)inadequate incorporation of instance-specific knowledge, and 3)risk of privacy leakage. To overcome these limitations, we propose Multi-scale Global-Instance Prompt Tuning(MGIPT), to enhance scale diversity of prompts and capture both global- and instance-level knowledge for robust CTTA. Specifically, MGIPT consists of an Adaptive-scale Instance Prompt(AIP) and a Multi-scale Global-level Prompt(MGP). AIP dynamically learns lightweight and instance-specific prompts to mitigate error accumulation with adaptive optimal-scale selection mechanism. MGP captures domain-level knowledge across different scales to ensure robust adaptation with anti-forgetting capabilities. These complementary components are combined through a weighted ensemble approach, enabling effective dual-level adaptation that integrates both global and local information. Extensive experiments on medical image segmentation benchmarks demonstrate that our MGIPT outperforms state-of-the-art methods, achieving robust adaptation across continually changing target domains.
### [145] [Better Source, Better Flow: Learning Condition-Dependent Source Distribution for Flow Matching](https://arxiv.org/abs/2602.05951) *Junwan Kim,Jiho Park,Seonghu Jeon,Seungryong Kim* Main category: cs.CV TL;DR: 本文提出了一种针对条件流匹配(conditional flow matching)的源分布(source distribution)设计方法,通过学习条件依赖的源分布,并引入方差正则化与方向对齐机制来避免坍缩与不稳定问题,显著提升了文本到图像生成的收敛速度与生成质量。
Details Motivation: 现有流匹配方法大多沿用扩散模型中的标准高斯源分布,未将源分布本身作为可优化对象,尤其在文本到图像等强条件生成任务中存在潜力未被挖掘。 Method: 提出学习条件依赖的源分布;引入方差正则化和源-目标分布间的方向对齐约束以稳定训练;分析不同目标表征空间对结构化源分布效果的影响。 Result: 在多个文本到图像基准上实现一致提升,FID指标收敛速度最高提升3倍,验证了源分布设计的有效性与实用性。 Conclusion: 源分布的设计是条件流匹配中一个关键且可优化的维度,其原则性设计能显著增强模型性能与训练稳定性。 Abstract: Flow matching has recently emerged as a promising alternative to diffusion-based generative models, particularly for text-to-image generation. Despite its flexibility in allowing arbitrary source distributions, most existing approaches rely on a standard Gaussian distribution, a choice inherited from diffusion models, and rarely consider the source distribution itself as an optimization target in such settings. In this work, we show that principled design of the source distribution is not only feasible but also beneficial at the scale of modern text-to-image systems. Specifically, we propose learning a condition-dependent source distribution under flow matching objective that better exploit rich conditioning signals. We identify key failure modes that arise when directly incorporating conditioning into the source, including distributional collapse and instability, and show that appropriate variance regularization and directional alignment between source and target are critical for stable and effective learning. We further analyze how the choice of target representation space impacts flow matching with structured sources, revealing regimes in which such designs are most effective. Extensive experiments across multiple text-to-image benchmarks demonstrate consistent and robust improvements, including up to a 3x faster convergence in FID, highlighting the practical benefits of a principled source distribution design for conditional flow matching.
### [146] [LSA: Localized Semantic Alignment for Enhancing Temporal Consistency in Traffic Video Generation](https://arxiv.org/abs/2602.05966) *Mirlan Karimov,Teodora Spasojevic,Markus Braun,Julian Wiederer,Vasileios Belagiannis,Marc Pollefeys* Main category: cs.CV TL;DR: 本文提出Localized Semantic Alignment (LSA)框架,通过在预训练视频生成模型微调中引入局部语义特征一致性损失,提升动态物体生成的时序一致性,无需推理阶段外部控制信号。
Details Motivation: 现有可控视频生成方法依赖推理时的控制信号来保证时序一致性,限制了其作为可扩展、通用数据引擎的能力。 Method: 提出Localized Semantic Alignment(LSA):利用现成特征提取模型,在真实与生成视频片段中针对动态物体区域提取并比对语义特征,构建局部语义一致性损失,并与标准扩散损失联合微调预训练模型。 Result: 仅用单轮微调即在主流视频生成评估指标上超越基线;在nuScenes和KITTI数据集上验证了该方法显著提升时序一致性,且不增加推理开销或依赖外部控制信号;额外引入目标检测指标mAP和mIoU进一步验证时序一致性。 Conclusion: LSA是一种简单有效、无额外推理负担的微调策略,能显著增强自动驾驶场景下视频生成的时序一致性与泛化能力。 Abstract: Controllable video generation has emerged as a versatile tool for autonomous driving, enabling realistic synthesis of traffic scenarios. However, existing methods depend on control signals at inference time to guide the generative model towards temporally consistent generation of dynamic objects, limiting their utility as scalable and generalizable data engines. In this work, we propose Localized Semantic Alignment (LSA), a simple yet effective framework for fine-tuning pre-trained video generation models. LSA enhances temporal consistency by aligning semantic features between ground-truth and generated video clips. Specifically, we compare the output of an off-the-shelf feature extraction model between the ground-truth and generated video clips localized around dynamic objects inducing a semantic feature consistency loss. We fine-tune the base model by combining this loss with the standard diffusion loss. The model fine-tuned for a single epoch with our novel loss outperforms the baselines in common video generation evaluation metrics. To further test the temporal consistency in generated videos we adapt two additional metrics from object detection task, namely mAP and mIoU. Extensive experiments on nuScenes and KITTI datasets show the effectiveness of our approach in enhancing temporal consistency in video generation without the need for external control signals during inference and any computational overheads.
### [147] [RISE-Video: Can Video Generators Decode Implicit World Rules?](https://arxiv.org/abs/2602.05986) *Mingxin Liu,Shuran Ma,Shibei Meng,Xiangyu Zhao,Zicheng Zhang,Shaofeng Zhang,Zhihang Zhong,Peixian Chen,Haoyu Cao,Xing Sun,Haodong Duan,Xue Yang* Main category: cs.CV TL;DR: 本文提出RISE-Video,一个面向推理能力评估的文本-图像到视频生成基准,强调对隐式世界规则的理解与推理,而非仅关注视觉保真度;包含467个人工标注样本、四维评估指标,并利用大图文模型实现可扩展自动化评估;实验揭示当前11个SOTA模型在复杂隐式约束场景下普遍存在推理缺陷。
Details Motivation: 现有生成式视频模型虽视觉质量高,但缺乏对隐式世界规则(如常识、物理规律、时空动态)的建模与推理能力,亟需专门的推理导向评估基准。 Method: 构建RISE-Video基准:含467个跨8类的人工标注样本;设计四维评估协议(推理对齐性、时间一致性、物理合理性、视觉质量);开发基于大图文模型(LMM)的自动化评估流水线。 Result: 在11个SOTA TI2V模型上的大规模实验表明,所有模型在涉及隐式约束的复杂推理任务中表现显著不足,尤其在物理合理性和推理对齐性方面。 Conclusion: RISE-Video为视频生成模型的认知能力评估提供了新范式,揭示了当前模型‘重表象、轻推理’的根本局限,为构建具备世界模拟能力的下一代生成模型指明方向。 Abstract: While generative video models have achieved remarkable visual fidelity, their capacity to internalize and reason over implicit world rules remains a critical yet under-explored frontier. To bridge this gap, we present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis that shifts the evaluative focus from surface-level aesthetics to deep cognitive reasoning. RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories, providing a structured testbed for probing model intelligence across diverse dimensions, ranging from commonsense and spatial dynamics to specialized subject domains. Our framework introduces a multi-dimensional evaluation protocol consisting of four metrics: \textit{Reasoning Alignment}, \textit{Temporal Consistency}, \textit{Physical Rationality}, and \textit{Visual Quality}. To further support scalable evaluation, we propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment. Extensive experiments on 11 state-of-the-art TI2V models reveal pervasive deficiencies in simulating complex scenarios under implicit constraints, offering critical insights for the advancement of future world-simulating generative models.
### [148] [VisRefiner: Learning from Visual Differences for Screenshot-to-Code Generation](https://arxiv.org/abs/2602.05998) *Jie Deng,Kaichun Yao,Libo Zhang* Main category: cs.CV TL;DR: 本文提出VisRefiner框架,通过让模型学习渲染结果与参考设计之间的视觉差异,提升截图生成代码的性能和自修正能力。
Details Motivation: 现有模型直接从截图生成代码,但未观察其生成代码的视觉效果;而人类开发者会迭代渲染、对比并根据视觉差异调整代码,因此作者希望模型也能从视觉差异中学习。 Method: 提出VisRefiner训练框架:1)构建差异对齐监督信号,将视觉差异与对应代码编辑关联;2)引入基于渲染结果与目标设计对比的强化学习自修正阶段。 Result: 实验表明VisRefiner显著提升了单步生成质量与布局保真度,并赋予模型强大的自修正能力。 Conclusion: 从视觉差异中学习能有效推动截图生成代码任务的发展。 Abstract: Screenshot-to-code generation aims to translate user interface screenshots into executable frontend code that faithfully reproduces the target layout and style. Existing multimodal large language models perform this mapping directly from screenshots but are trained without observing the visual outcomes of their generated code. In contrast, human developers iteratively render their implementation, compare it with the design, and learn how visual differences relate to code changes. Inspired by this process, we propose VisRefiner, a training framework that enables models to learn from visual differences between rendered predictions and reference designs. We construct difference-aligned supervision that associates visual discrepancies with corresponding code edits, allowing the model to understand how appearance variations arise from implementation changes. Building on this, we introduce a reinforcement learning stage for self-refinement, where the model improves its generated code by observing both the rendered output and the target design, identifying their visual differences, and updating the code accordingly. Experiments show that VisRefiner substantially improves single-step generation quality and layout fidelity, while also endowing models with strong self-refinement ability. These results demonstrate the effectiveness of learning from visual differences for advancing screenshot-to-code generation.
### [149] [GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?](https://arxiv.org/abs/2602.06013) *Ruihang Li,Leigang Qu,Jingxu Zhang,Dongnan Gui,Mengde Xu,Xiaosong Zhang,Han Hu,Wenjie Wang,Jiaqi Wang* Main category: cs.CV TL;DR: 本文提出GenArena框架,采用成对比较范式替代传统绝对评分方式,显著提升视觉生成模型评估的稳定性与人类感知一致性,使开源模型在评估中超越顶级闭源模型。
Details Motivation: 视觉生成模型快速发展,传统评估方法已无法满足需求,需引入视觉语言模型作为代理评判者;现有绝对点式评分标准存在随机不一致性和与人类感知对齐差的问题。 Method: 提出GenArena统一评估框架,采用成对比较范式替代绝对点式评分,并在多种视觉生成任务上系统验证其有效性。 Result: GenArena将评估准确率提升超20%,与权威LMArena排行榜的Spearman相关性达0.86,远超点式方法的0.36;且仅靠该协议即可使现成开源模型评估性能超越顶级闭源模型。 Conclusion: 成对比较范式是更稳定、更符合人类判断的视觉生成评估新标准,GenArena为社区提供了严谨、自动化的评估基准。 Abstract: The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision-Language Models as surrogate judges. In this work, we systematically investigate the reliability of the prevailing absolute pointwise scoring standard, across a wide spectrum of visual generation tasks. Our analysis reveals that this paradigm is limited due to stochastic inconsistency and poor alignment with human perception. To resolve these limitations, we introduce GenArena, a unified evaluation framework that leverages a pairwise comparison paradigm to ensure stable and human-aligned evaluation. Crucially, our experiments uncover a transformative finding that simply adopting this pairwise protocol enables off-the-shelf open-source models to outperform top-tier proprietary models. Notably, our method boosts evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, drastically surpassing the 0.36 correlation of pointwise methods. Based on GenArena, we benchmark state-of-the-art visual generation models across diverse tasks, providing the community with a rigorous and automated evaluation standard for visual generation.
### [150] [MambaVF: State Space Model for Efficient Video Fusion](https://arxiv.org/abs/2602.06017) *Zixiang Zhao,Yukun Cui,Lilun Deng,Haowen Bai,Haotong Qin,Tao Feng,Konrad Schindler* Main category: cs.CV TL;DR: 本文提出MambaVF,一种基于状态空间模型(SSM)的高效视频融合框架,无需光流估计即可建模长时序依赖,显著降低计算开销与参数量,并在多类视频融合任务上达到SOTA性能。
Details Motivation: 现有视频融合方法严重依赖光流估计和特征扭曲,导致计算开销大、可扩展性差。 Method: 将视频融合重构为序列状态更新过程,采用轻量级SSM融合模块,通过时空双向扫描机制替代传统光流引导对齐。 Result: 在多曝光、多焦点、红外-可见光及医学视频融合多个基准上达到SOTA;参数减少92.25%,FLOPs降低88.79%,速度提升2.1倍。 Conclusion: MambaVF是一种高效、可扩展且性能优越的视频融合新范式,验证了SSM在视频融合任务中的巨大潜力。 Abstract: Video fusion is a fundamental technique in various video processing tasks. However, existing video fusion methods heavily rely on optical flow estimation and feature warping, resulting in severe computational overhead and limited scalability. This paper presents MambaVF, an efficient video fusion framework based on state space models (SSMs) that performs temporal modeling without explicit motion estimation. First, by reformulating video fusion as a sequential state update process, MambaVF captures long-range temporal dependencies with linear complexity while significantly reducing computation and memory costs. Second, MambaVF proposes a lightweight SSM-based fusion module that replaces conventional flow-guided alignment via a spatio-temporal bidirectional scanning mechanism. This module enables efficient information aggregation across frames. Extensive experiments across multiple benchmarks demonstrate that our MambaVF achieves state-of-the-art performance in multi-exposure, multi-focus, infrared-visible, and medical video fusion tasks. We highlight that MambaVF enjoys high efficiency, reducing up to 92.25% of parameters and 88.79% of computational FLOPs and a 2.1x speedup compared to existing methods. Project page: https://mambavf.github.io
### [151] [Context Forcing: Consistent Autoregressive Video Generation with Long Context](https://arxiv.org/abs/2602.06028) *Shuo Chen,Cong Wei,Sun Sun,Ping Nie,Kai Zhou,Ge Zhang,Ming-Hsuan Yang,Wenhu Chen* Main category: cs.CV TL;DR: 本文提出Context Forcing框架,通过长上下文教师指导长上下文学生训练,解决现有流式调优中师生不匹配问题,并引入Slow-Fast Memory架构实现超长视频(如2分钟)的高效生成,显著提升长时一致性。
Details Motivation: 现有实时长视频生成方法采用短上下文教师监督长上下文学生,导致师生在时间依赖建模上的结构性不匹配,限制学生建模长程依赖的能力。 Method: 提出Context Forcing框架,使用具备完整历史感知能力的长上下文教师;设计Slow-Fast Memory上下文管理系统,将线性增长的上下文压缩为低冗余的慢-快双速记忆结构。 Result: 实现超过20秒的有效上下文长度(是LongLive、Infinite-RoPE等SOTA方法的2–10倍),在多项长视频评估指标上超越现有基线。 Conclusion: Context Forcing通过消除师生不匹配并优化上下文管理,为长视频生成提供了更鲁棒、更一致的训练范式。 Abstract: Recent approaches to real-time long video generation typically employ streaming tuning strategies, attempting to train a long-context student using a short-context (memoryless) teacher. In these frameworks, the student performs long rollouts but receives supervision from a teacher limited to short 5-second windows. This structural discrepancy creates a critical \textbf{student-teacher mismatch}: the teacher's inability to access long-term history prevents it from guiding the student on global temporal dependencies, effectively capping the student's context length. To resolve this, we propose \textbf{Context Forcing}, a novel framework that trains a long-context student via a long-context teacher. By ensuring the teacher is aware of the full generation history, we eliminate the supervision mismatch, enabling the robust training of models capable of long-term consistency. To make this computationally feasible for extreme durations (e.g., 2 minutes), we introduce a context management system that transforms the linearly growing context into a \textbf{Slow-Fast Memory} architecture, significantly reducing visual redundancy. Extensive results demonstrate that our method enables effective context lengths exceeding 20 seconds -- 2 to 10 times longer than state-of-the-art methods like LongLive and Infinite-RoPE. By leveraging this extended context, Context Forcing preserves superior consistency across long durations, surpassing state-of-the-art baselines on various long video evaluation metrics.
### [152] [Splat and Distill: Augmenting Teachers with Feed-Forward 3D Reconstruction For 3D-Aware Distillation](https://arxiv.org/abs/2602.06032) *David Shavin,Sagie Benaim* Main category: cs.CV TL;DR: 本文提出Splat and Distill框架,通过将2D视觉基础模型的特征前馈式提升为3D高斯表示并重投影到新视角,以蒸馏几何感知知识,显著增强模型的3D感知能力与语义丰富性。
Details Motivation: 现有2D视觉基础模型缺乏3D感知能力,限制其在需几何理解任务中的表现。 Method: 引入前馈式3D重建流水线,将教师模型输出的2D特征提升为显式3D高斯表示,再‘splatted’(重投影)至新视角生成监督信号,蒸馏给学生模型;摒弃传统逐场景优化,避免特征平均伪影。 Result: 在单目深度估计、法向量估计、多视图匹配和语义分割等下游任务中显著优于先前方法,同时提升3D感知能力和2D特征语义丰富性。 Conclusion: Splat and Distill为2D视觉基础模型注入强3D感知能力,兼具高效性与泛化性,推动视觉模型向几何感知智能演进。 Abstract: Vision Foundation Models (VFMs) have achieved remarkable success when applied to various downstream 2D tasks. Despite their effectiveness, they often exhibit a critical lack of 3D awareness. To this end, we introduce Splat and Distill, a framework that instills robust 3D awareness into 2D VFMs by augmenting the teacher model with a fast, feed-forward 3D reconstruction pipeline. Given 2D features produced by a teacher model, our method first lifts these features into an explicit 3D Gaussian representation, in a feedforward manner. These 3D features are then ``splatted" onto novel viewpoints, producing a set of novel 2D feature maps used to supervise the student model, ``distilling" geometrically grounded knowledge. By replacing slow per-scene optimization of prior work with our feed-forward lifting approach, our framework avoids feature-averaging artifacts, creating a dynamic learning process where the teacher's consistency improves alongside that of the student. We conduct a comprehensive evaluation on a suite of downstream tasks, including monocular depth estimation, surface normal estimation, multi-view correspondence, and semantic segmentation. Our method significantly outperforms prior works, not only achieving substantial gains in 3D awareness but also enhancing the underlying semantic richness of 2D features. Project page is available at https://davidshavin4.github.io/Splat-and-Distill/
### [153] [V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval](https://arxiv.org/abs/2602.06034) *Dongyang Chen,Chaoyang Wang,Dezhao SU,Xi Xiao,Zeyu Zhang,Jing Xiong,Qing Li,Yuzhang Shang,Shichao Ka* Main category: cs.CV TL;DR: 本文提出V-Retrver框架,将多模态检索重构为基于视觉检查的代理式推理过程,通过调用外部视觉工具主动获取视觉证据,实现假设生成与针对性视觉验证交替进行的多模态交错推理,并采用课程学习策略训练该证据收集型检索代理,在多个基准上显著提升检索准确率和推理可靠性。
Details Motivation: 现有方法多为语言驱动,依赖静态视觉编码,缺乏主动验证细粒度视觉证据的能力,导致在视觉模糊情况下易产生推测性推理。 Method: 提出V-Retrver证据驱动检索框架,将多模态检索建模为基于视觉检验的代理推理过程;MLLM可选择性调用外部视觉工具获取视觉证据,实现假设生成与视觉验证交替的多模态交错推理;采用结合监督推理激活、拒绝式精炼和证据对齐目标强化学习的课程学习策略进行训练。 Result: 在多个多模态检索基准上实验表明,V-Retrver平均提升检索准确率23.0%,同时增强感知驱动推理的可靠性与泛化能力。 Conclusion: V-Retrver通过引入主动视觉验证机制和证据驱动的代理式推理范式,有效克服了传统语言主导检索方法在视觉细粒度理解上的局限,显著提升了多模态检索性能与可信度。 Abstract: Multimodal Large Language Models (MLLMs) have recently been applied to universal multimodal retrieval, where Chain-of-Thought (CoT) reasoning improves candidate reranking. However, existing approaches remain largely language-driven, relying on static visual encodings and lacking the ability to actively verify fine-grained visual evidence, which often leads to speculative reasoning in visually ambiguous cases. We propose V-Retrver, an evidence-driven retrieval framework that reformulates multimodal retrieval as an agentic reasoning process grounded in visual inspection. V-Retrver enables an MLLM to selectively acquire visual evidence during reasoning via external visual tools, performing a multimodal interleaved reasoning process that alternates between hypothesis generation and targeted visual verification.To train such an evidence-gathering retrieval agent, we adopt a curriculum-based learning strategy combining supervised reasoning activation, rejection-based refinement, and reinforcement learning with an evidence-aligned objective. Experiments across multiple multimodal retrieval benchmarks demonstrate consistent improvements in retrieval accuracy (with 23.0% improvements on average), perception-driven reasoning reliability, and generalization.
### [154] [InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions](https://arxiv.org/abs/2602.06035) *Sirui Xu,Samuel Schulter,Morteza Ziyadi,Xialin He,Xiaohan Fei,Yu-Xiong Wang,Liangyan Gui* Main category: cs.CV TL;DR: 本文提出InterPrior框架,通过大规模模仿预训练和强化学习微调,学习一个统一的生成式控制器,使类人机器人能够泛化并组合多种全身运动-操作技能。
Details Motivation: 人类通常不以显式的全身动作为基础进行物体交互规划,而是依赖高层次意图(如可供性)与底层物理和运动先验来自然协调平衡、接触和操作;为使类人机器人在多样化场景中实现可泛化的全身运控,需扩展此类先验。 Method: 提出InterPrior框架:首先通过大规模模仿学习蒸馏出一个目标条件变分策略,能从多模态观测与高层意图重建运动;随后引入物理扰动的数据增强,并结合强化学习微调,将隐含技能整合至有效流形。 Result: InterPrior能泛化到未见过的目标、初始状态及新物体交互任务;支持用户交互式控制,并在真实机器人上展现出部署潜力。 Conclusion: InterPrior通过融合模仿学习与强化学习,构建了一个可扩展、可泛化且物理一致的全身运动先验,显著提升了类人机器人在复杂人-物交互任务中的适应性与鲁棒性。 Abstract: Humans rarely plan whole-body interactions with objects at the level of explicit whole-body movements. High-level intentions, such as affordance, define the goal, while coordinated balance, contact, and manipulation can emerge naturally from underlying physical and motor priors. Scaling such priors is key to enabling humanoids to compose and generalize loco-manipulation skills across diverse contexts while maintaining physically coherent whole-body coordination. To this end, we introduce InterPrior, a scalable framework that learns a unified generative controller through large-scale imitation pretraining and post-training by reinforcement learning. InterPrior first distills a full-reference imitation expert into a versatile, goal-conditioned variational policy that reconstructs motion from multimodal observations and high-level intent. While the distilled policy reconstructs training behaviors, it does not generalize reliably due to the vast configuration space of large-scale human-object interactions. To address this, we apply data augmentation with physical perturbations, and then perform reinforcement learning finetuning to improve competence on unseen goals and initializations. Together, these steps consolidate the reconstructed latent skills into a valid manifold, yielding a motion prior that generalizes beyond the training data, e.g., it can incorporate new behaviors such as interactions with unseen objects. We further demonstrate its effectiveness for user-interactive control and its potential for real robot deployment.
### [155] [Thinking with Geometry: Active Geometry Integration for Spatial Reasoning](https://arxiv.org/abs/2602.06037) *Haoyuan Li,Qihang Cao,Tao Tang,Kun Xiang,Zihan Guo,Jianhua Han,Hang Xu,Xiaodan Liang* Main category: cs.CV TL;DR: 本文提出GeoThinker框架,通过主动感知而非被动融合的方式,使多模态大语言模型能根据推理需求选择性地检索和整合几何信息,显著提升空间推理能力。
Details Motivation: 现有MLLMs在空间推理中虽引入3D几何先验,但多采用被动、全局、 indiscriminate 的特征融合方式,易导致语义-几何错位和冗余信号。 Method: 提出Spatial-Grounded Fusion机制,在选定VLM层上以帧严格交叉注意力实现语义视觉先验对任务相关几何信息的选择性查询与融合,并辅以Importance Gating机制,引导注意力聚焦于任务相关空间结构。 Result: 在VSI-Bench上达到72.6的SOTA成绩,且在具身指代、自动驾驶等复杂下游任务中展现出强泛化性与空间感知提升。 Conclusion: 主动整合空间结构的能力是下一代空间智能的关键。 Abstract: Recent progress in spatial reasoning with Multimodal Large Language Models (MLLMs) increasingly leverages geometric priors from 3D encoders. However, most existing integration strategies remain passive: geometry is exposed as a global stream and fused in an indiscriminate manner, which often induces semantic-geometry misalignment and redundant signals. We propose GeoThinker, a framework that shifts the paradigm from passive fusion to active perception. Instead of feature mixing, GeoThinker enables the model to selectively retrieve geometric evidence conditioned on its internal reasoning demands. GeoThinker achieves this through Spatial-Grounded Fusion applied at carefully selected VLM layers, where semantic visual priors selectively query and integrate task-relevant geometry via frame-strict cross-attention, further calibrated by Importance Gating that biases per-frame attention toward task-relevant structures. Comprehensive evaluation results show that GeoThinker sets a new state-of-the-art in spatial intelligence, achieving a peak score of 72.6 on the VSI-Bench. Furthermore, GeoThinker demonstrates robust generalization and significantly improved spatial perception across complex downstream scenarios, including embodied referring and autonomous driving. Our results indicate that the ability to actively integrate spatial structures is essential for next-generation spatial intelligence. Code can be found at https://github.com/Li-Hao-yuan/GeoThinker.
### [156] [SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs](https://arxiv.org/abs/2602.06040) *Jintao Tong,Shilin Yan,Hongwei Xue,Xiaojun Tang,Kunyu Shi,Guannan Zhang,Ruixuan Li,Yixiong Zou* Main category: cs.CV TL;DR: SwimBird是一种可切换推理模式的多模态大语言模型,能根据输入动态选择纯文本、纯视觉或图文交错三种推理方式,在保持文本逻辑能力的同时显著提升视觉密集型任务性能。
Details Motivation: 现有MLLMs主要依赖文本思维链(CoT)进行推理,难以应对视觉密集型任务;而引入固定数量连续隐状态作为‘视觉思维’的方法虽提升视觉性能,却损害文本逻辑推理能力;根本问题在于推理模式僵化,无法自适应选择最适合的模态。 Method: 提出SwimBird模型,支持三种条件触发的推理模式(文本-only、视觉-only、图文交错),采用混合自回归建模(统一token预测与embedding预测),并构建覆盖三类模式的92K样本监督微调数据集SwimBird-SFT-92K。 Result: 在涵盖文本推理与挑战性视觉理解的多个基准上,SwimBird达到SOTA性能,相比固定模式方法展现出稳健提升。 Conclusion: 动态、查询自适应的推理模式切换是提升MLLM多模态推理通用性的关键路径,SwimBird验证了该设计在兼顾文本逻辑与视觉理解上的有效性。 Abstract: Multimodal Large Language Models (MLLMs) have made remarkable progress in multimodal perception and reasoning by bridging vision and language. However, most existing MLLMs perform reasoning primarily with textual CoT, which limits their effectiveness on vision-intensive tasks. Recent approaches inject a fixed number of continuous hidden states as "visual thoughts" into the reasoning process and improve visual performance, but often at the cost of degraded text-based logical reasoning. We argue that the core limitation lies in a rigid, pre-defined reasoning pattern that cannot adaptively choose the most suitable thinking modality for different user queries. We introduce SwimBird, a reasoning-switchable MLLM that dynamically switches among three reasoning modes conditioned on the input: (1) text-only reasoning, (2) vision-only reasoning (continuous hidden states as visual thoughts), and (3) interleaved vision-text reasoning. To enable this capability, we adopt a hybrid autoregressive formulation that unifies next-token prediction for textual thoughts with next-embedding prediction for visual thoughts, and design a systematic reasoning-mode curation strategy to construct SwimBird-SFT-92K, a diverse supervised fine-tuning dataset covering all three reasoning patterns. By enabling flexible, query-adaptive mode selection, SwimBird preserves strong textual logic while substantially improving performance on vision-dense tasks. Experiments across diverse benchmarks covering textual reasoning and challenging visual understanding demonstrate that SwimBird achieves state-of-the-art results and robust gains over prior fixed-pattern multimodal reasoning methods.
### [157] [Predicting Camera Pose from Perspective Descriptions for Spatial Reasoning](https://arxiv.org/abs/2602.06041) *Xuejun Zhang,Aditi Tiwari,Zhenhailong Wang,Heng Ji* Main category: cs.CV TL;DR: 本文提出CAMCUE框架,利用显式相机姿态作为几何锚点,实现多视角图像的跨视角融合与新视角推理,显著提升多图像空间推理能力,并大幅降低推理耗时。
Details Motivation: 当前多模态大语言模型在多图像空间推理(尤其是视角转换)方面仍面临挑战,需从多视角构建一致的3D场景理解并据此进行新视角推理。 Method: 提出CAMCUE:将每视角相机姿态注入视觉token;将自然语言描述的目标视角映射到目标相机姿态;生成姿态条件下的想象目标视图以支持问答;并构建包含27,668训练和508测试样本的CAMCUE-DATA数据集。 Result: 在任务上整体准确率提升9.06%;对自然语言视角描述预测目标姿态的旋转误差≤20°、平移误差≤0.5的准确率超90%;推理时间由256.6秒/例降至1.45秒/例。 Conclusion: CAMCUE通过显式姿态建模实现了高效、准确的多视角空间推理,为真实场景中交互式应用提供了可行方案。 Abstract: Multi-image spatial reasoning remains challenging for current multimodal large language models (MLLMs). While single-view perception is inherently 2D, reasoning over multiple views requires building a coherent scene understanding across viewpoints. In particular, we study perspective taking, where a model must build a coherent 3D understanding from multi-view observations and use it to reason from a new, language-specified viewpoint. We introduce CAMCUE, a pose-aware multi-image framework that uses camera pose as an explicit geometric anchor for cross-view fusion and novel-view reasoning. CAMCUE injects per-view pose into visual tokens, grounds natural-language viewpoint descriptions to a target camera pose, and synthesizes a pose-conditioned imagined target view to support answering. To support this setting, we curate CAMCUE-DATA with 27,668 training and 508 test instances pairing multi-view images and poses with diverse target-viewpoint descriptions and perspective-shift questions. We also include human-annotated viewpoint descriptions in the test split to evaluate generalization to human language. CAMCUE improves overall accuracy by 9.06% and predicts target poses from natural-language viewpoint descriptions with over 90% rotation accuracy within 20° and translation accuracy within a 0.5 error threshold. This direct grounding avoids expensive test-time search-and-match, reducing inference time from 256.6s to 1.45s per example and enabling fast, interactive use in real-world scenarios.