cs.CL [Back]

[1] What Really Counts? Examining Step and Token Level Attribution in Multilingual CoT Reasoning

Jeremias Ferrao,Ezgi Basar,Khondoker Ittehadul Islam,Mahrokh Hassani

Main category: cs.CL

TL;DR: 本研究探讨了多语言大模型中思维链（CoT）推理的归因模式，发现归因分数过度集中在最后推理步骤，且CoT提示在高资源拉丁语系语言中效果更显著，但在多语言鲁棒性和解释透明性方面存在局限。

Details

Motivation: 尽管CoT提示能提升任务性能，但其生成的推理链是否真实可信、可解释仍存疑，尤其是在多语言背景下缺乏系统评估。 Method: 采用ContextCite（步骤级）和Inseq（令牌级）两种归因方法，基于MGSM基准对Qwen2.5 1.5B-Instruct模型进行分析，并通过否定和干扰句进行受控扰动实验。 Result: 1) 归因分数过度强调最终推理步骤，尤其在错误生成中；2) 结构化CoT提示主要提升高资源拉丁语系语言的准确性；3) 引入干扰或否定会降低准确性和归因一致性。 Conclusion: CoT提示在多语言场景下存在归因偏差和鲁棒性问题，其解释的忠实性和跨语言有效性受限，需进一步改进以实现可靠、透明的推理。 Abstract: This study investigates the attribution patterns underlying Chain-of-Thought (CoT) reasoning in multilingual LLMs. While prior works demonstrate the role of CoT prompting in improving task performance, there are concerns regarding the faithfulness and interpretability of the generated reasoning chains. To assess these properties across languages, we applied two complementary attribution methods--ContextCite for step-level attribution and Inseq for token-level attribution--to the Qwen2.5 1.5B-Instruct model using the MGSM benchmark. Our experimental results highlight key findings such as: (1) attribution scores excessively emphasize the final reasoning step, particularly in incorrect generations; (2) structured CoT prompting significantly improves accuracy primarily for high-resource Latin-script languages; and (3) controlled perturbations via negation and distractor sentences reduce model accuracy and attribution coherence. These findings highlight the limitations of CoT prompting, particularly in terms of multilingual robustness and interpretive transparency.

[2] Mind the Motions: Benchmarking Theory-of-Mind in Everyday Body Language

Seungbeen Lee,Jinhong Jeong,Donghyun Kim,Yejin Son,Youngjae Yu

Main category: cs.CL

TL;DR: 本文提出了Motion2Mind框架，用于评估机器在理解非语言线索（NVCs）方面的心智理论（ToM）能力，并构建了一个包含精细标注的视频数据集，涵盖222种非语言线索和397种心理状态。实验表明当前AI系统在NVC识别和解释上表现不佳，存在明显的性能差距和过度解读问题。

Details

Motivation: 现有心智理论（ToM）基准主要关注错误信念任务和非对称信息推理，忽略了除信念外的其他心理状态以及丰富的非语言交流形式，因此需要一个更全面的评估框架来衡量机器对非语言线索的理解能力。 Method: 基于专家整理的身体语言参考知识库，构建了Motion2Mind视频数据集，包含细粒度的非语言线索标注及人工验证的心理学解释；通过检测与解释两个任务评估AI模型的ToM能力。 Result: 当前AI系统在非语言线索的检测上存在显著性能差距，在解释阶段相比人类标注者更容易出现过度解读现象。 Conclusion: Motion2Mind揭示了现有AI在理解人类非语言交流方面的局限性，强调了未来需加强对多模态、细粒度社会认知能力的研究以提升机器的社会智能水平。 Abstract: Our ability to interpret others' mental states through nonverbal cues (NVCs) is fundamental to our survival and social cohesion. While existing Theory of Mind (ToM) benchmarks have primarily focused on false-belief tasks and reasoning with asymmetric information, they overlook other mental states beyond belief and the rich tapestry of human nonverbal communication. We present Motion2Mind, a framework for evaluating the ToM capabilities of machines in interpreting NVCs. Leveraging an expert-curated body-language reference as a proxy knowledge base, we build Motion2Mind, a carefully curated video dataset with fine-grained nonverbal cue annotations paired with manually verified psychological interpretations. It encompasses 222 types of nonverbal cues and 397 mind states. Our evaluation reveals that current AI systems struggle significantly with NVC interpretation, exhibiting not only a substantial performance gap in Detection, as well as patterns of over-interpretation in Explanation compared to human annotators.

[3] TOD-ProcBench: Benchmarking Complex Instruction-Following in Task-Oriented Dialogues

Sarik Ghazarian,Abhinav Gullapalli,Swair Shah,Anurag Beniwal,Nanyun Peng,Narayanan Sadagopan,Zhou Yu

Main category: cs.CL

TL;DR: 本文提出了TOD-ProcBench，一个用于评估大语言模型在多轮任务型对话中遵循复杂自然语言指令能力的基准测试，包含三个任务以全面衡量模型的理解、识别违规和条件生成能力。

Details

Motivation: 现有任务型对话基准过于简化实际复杂的自然语言操作指南，无法有效评估模型对复杂指令的遵循能力，因此需要构建更贴近真实场景的评测基准。 Method: 基于高质量ABCD数据集构建包含多层次条件-动作语句的复杂指令文档，并设计三个任务：相关语句检索与动作预测、违规响应识别、条件式合规响应生成；同时研究多语言设置和指令格式对性能的影响。 Result: TOD-ProcBench提供了细粒度的约束和流程化指令，能够有效评测不同LLMs在多轮对话中理解与遵循复杂指令的能力，实验涵盖了多种模型在不同设置下的表现。 Conclusion: 该基准凸显了当前LLMs在处理复杂、结构化自然语言指令方面的不足，为提升模型在真实任务型对话中的可控性和可靠性提供了重要评估工具。 Abstract: In real-world task-oriented dialogue (TOD) settings, agents are required to strictly adhere to complex instructions while conducting multi-turn conversations with customers. These instructions are typically presented in natural language format and include general guidelines and step-by-step procedures with complex constraints. Existing TOD benchmarks often oversimplify the complex nature of these instructions by reducing them to simple schemas composed of intents, slots, and API call configurations. To address this gap and systematically benchmark LLMs' instruction-following capabilities, we propose TOD-ProcBench, a challenging benchmark featuring complex process instructions with intricate, fine-grained constraints that evaluates various LLMs' abilities to understand and follow instructions in multi-turn TODs. Our benchmark dataset comprises instruction documents derived from the high-quality ABCD dataset with corresponding conversations under human quality control. We formulate fine-grained constraints and action procedures as multi-level condition-action instruction statements. We design three tasks to comprehensively benchmark LLMs' complex instruction-following capabilities in multi-turn TODs. Task 1 evaluates how LLMs retrieve the most relevant statement from a complex instruction and predict the corresponding next action. In Task 2, we synthesize instruction-violating responses by injecting inconsistencies and manipulating the original instructions, and then we analyze how effectively LLMs can identify instruction-violating responses. Task 3 investigates LLMs' abilities in conditional generation of instruction-following responses based on the original complex instructions. Additionally, we conduct studies on the impact of multilingual settings and different instruction text formats on compliance performance. We release our benchmark under the Llama 3.3 Community License Agreement.

[4] Liars' Bench: Evaluating Lie Detectors for Language Models

Kieron Kretschmar,Walter Laurito,Sharan Maiya,Samuel Marks

Main category: cs.CL

TL;DR: 本文提出了LIARS' BENCH，一个包含72,863个谎言和诚实回答的测试集，用于评估大语言模型的说谎检测技术，并揭示现有方法在多种说谎场景下的局限性。

Details

Motivation: 现有说谎检测技术通常在狭窄场景下验证，未能涵盖大语言模型可能产生的多样化谎言，因此需要更全面的测试基准。 Method: 构建了一个涵盖四个开源模型在七个数据集上生成的谎言与诚实响应的测试集LIARS' BENCH，涵盖不同类型的谎言，并从说谎动机和信念对象两个维度进行分析；在此基础上评估了三种黑盒和白盒说谎检测方法。 Result: 实验发现现有说谎检测技术在某些类型的谎言上系统性失败，尤其是在仅从对话记录无法判断是否说谎的情况下表现不佳。 Conclusion: LIARS' BENCH揭示了当前说谎检测技术的局限性，并为未来研究提供了一个实用的评估平台。 Abstract: Prior work has introduced techniques for detecting when large language models (LLMs) lie, that is, generating statements they believe are false. However, these techniques are typically validated in narrow settings that do not capture the diverse lies LLMs can generate. We introduce LIARS' BENCH, a testbed consisting of 72,863 examples of lies and honest responses generated by four open-weight models across seven datasets. Our settings capture qualitatively different types of lies and vary along two dimensions: the model's reason for lying and the object of belief targeted by the lie. Evaluating three black- and white-box lie detection techniques on LIARS' BENCH, we find that existing techniques systematically fail to identify certain types of lies, especially in settings where it's not possible to determine whether the model lied from the transcript alone. Overall, LIARS' BENCH reveals limitations in prior techniques and provides a practical testbed for guiding progress in lie detection.

[5] Learning Tractable Distributions Of Language Model Continuations

Gwen Yidou-Weng,Ian Li,Anji Liu,Oliver Broadrick,Guy Van den Broeck,Benjie Wang

Main category: cs.CL

TL;DR: 提出了一种名为Learning to Look Ahead (LTLA)的混合方法，通过结合语言模型和固定可追踪代理模型，有效解决在受控语言生成中序列级约束依赖未来token导致的生成难题，提升了条件似然性和约束满足度，同时保持较低推理开销。

Details

Motivation: 现有基于代理模型（如HMM）的方法在处理序列级约束时上下文感知能力弱，影响生成质量，且难以有效融入神经上下文信息。 Method: LTLA将基础语言模型用于前缀编码，并结合一个固定的可追踪代理模型计算精确的延续概率；通过批量HMM更新一次性处理所有候选下一词，并仅将语言模型的隐状态用于调整代理模型的潜在状态先验，保持代理解码器固定以实现跨前缀的计算复用。 Result: LTLA在条件似然性上优于无条件HMM，能为视觉-语言模型近似延续分布（传统HMM无法编码视觉上下文），并在受控生成任务中提升约束满足度且保持流畅性，推理开销极小。 Conclusion: LTLA通过融合神经上下文与固定代理模型，在保证效率的同时显著提升了受控语言生成的质量和适用范围，解决了现有方法上下文感知弱和计算重复的问题。 Abstract: Controlled language generation conditions text on sequence-level constraints (for example, syntax, style, or safety). These constraints may depend on future tokens, which makes directly conditioning an autoregressive language model (LM) generally intractable. Prior work uses tractable surrogates such as hidden Markov models (HMMs) to approximate the distribution over continuations and adjust the model's next-token logits at decoding time. However, we find that these surrogates are often weakly context aware, which reduces query quality. We propose Learning to Look Ahead (LTLA), a hybrid approach that pairs the same base language model for rich prefix encoding with a fixed tractable surrogate model that computes exact continuation probabilities. Two efficiency pitfalls arise when adding neural context: (i) naively rescoring the prefix with every candidate next token requires a sweep over the entire vocabulary at each step, and (ii) predicting fresh surrogate parameters for each prefix, although tractable at a single step, forces recomputation of future probabilities for every new prefix and eliminates reuse. LTLA avoids both by using a single batched HMM update to account for all next-token candidates at once, and by conditioning only the surrogate's latent state prior on the LM's hidden representations while keeping the surrogate decoder fixed, so computations can be reused across prefixes. Empirically, LTLA attains higher conditional likelihood than an unconditional HMM, approximates continuation distributions for vision-language models where a standalone HMM cannot encode visual context, and improves constraint satisfaction at comparable fluency on controlled-generation tasks, with minimal inference overhead.

[6] Early science acceleration experiments with GPT-5

Sébastien Bubeck,Christian Coester,Ronen Eldan,Timothy Gowers,Yin Tat Lee,Alexandru Lupsasca,Mehtaab Sawhney,Robert Scherrer,Mark Sellke,Brian K. Spears,Derya Unutmaz,Kevin Weil,Steven Yin,Nikita Zhivotovskiy

Main category: cs.CL

TL;DR: 本文展示了GPT-5在多个科学领域（如数学、物理、天文、计算机科学等）中推动研究进展的案例，强调AI如何加速科研工作，并指出其局限性与人类专家协作的重要性。论文包含四个经验证的数学新成果，表明前沿AI在解决未解问题上的潜力。

Details

Motivation: 许多科学家尚未充分意识到前沿AI的能力，本文旨在通过实际案例展示AI模型（如GPT-5）在科学研究中的实际价值和协作潜力。 Method: 作者们通过在不同学科中使用GPT-5参与真实研究项目，记录AI生成新思路或解决方案的过程，并分析其有效性、局限性以及人机协作模式。 Result: GPT-5在多个领域提出了具体可行的研究步骤，尤其在数学中帮助产出了四个经过人类验证的新结果；同时明确了AI在推理、创造力和专业判断方面的不足之处。 Conclusion: 尽管当前AI的贡献范围有限，但其对科研效率的提升具有深远意义，未来随着AI能力的快速进步，人机协同将成为科学研究的重要范式。 Abstract: AI models like GPT-5 are an increasingly valuable tool for scientists, but many remain unaware of the capabilities of frontier AI. We present a collection of short case studies in which GPT-5 produced new, concrete steps in ongoing research across mathematics, physics, astronomy, computer science, biology, and materials science. In these examples, the authors highlight how AI accelerated their work, and where it fell short; where expert time was saved, and where human input was still key. We document the interactions of the human authors with GPT-5, as guiding examples of fruitful collaboration with AI. Of note, this paper includes four new results in mathematics (carefully verified by the human authors), underscoring how GPT-5 can help human mathematicians settle previously unsolved problems. These contributions are modest in scope but profound in implication, given the rate at which frontier AI is progressing.

[7] ELPO: Ensemble Learning Based Prompt Optimization for Large Language Models

Qing Zhang,Bing Xu,Xudong Zhang,Yifan Shi,Yang Li,Chen Zhang,Yik Chung Wu,Ngai Wong,Yijie Chen,Hong Dai,Xiansen Chen,Mian Zhang

Main category: cs.CL

TL;DR: 提出了一种基于集成学习的提示优化框架ELPO，通过结合多种搜索方法和共享生成策略，显著提升了提示优化的准确性和鲁棒性，在多个任务上优于现有最先进方法。

Details

Motivation: 现有的自动提示优化方法多依赖单一模型或算法，限制了其在复杂任务上的性能，因此需要一种更强大、更灵活的优化框架。 Method: 受集成学习启发，ELPO引入投票机制，结合共享的提示生成策略与多种搜索方法，协同优化提示，并设计了更高效的生成与搜索算法。 Result: 实验表明，ELPO在多个任务上优于当前最先进的提示优化方法，例如在ArSarcasm数据集上F1分数提升了7.6。 Conclusion: ELPO通过集成多种优化策略，有效提升了提示优化的性能，为自动提示工程提供了一个更强大且鲁棒的解决方案。 Abstract: The remarkable performance of Large Language Models (LLMs) highly relies on crafted prompts. However, manual prompt engineering is a laborious process, creating a core bottleneck for practical application of LLMs. This phenomenon has led to the emergence of a new research area known as Automatic Prompt Optimization (APO), which develops rapidly in recent years. Existing APO methods such as those based on evolutionary algorithms or trial-and-error approaches realize an efficient and accurate prompt optimization to some extent. However, those researches focus on a single model or algorithm for the generation strategy and optimization process, which limits their performance when handling complex tasks. To address this, we propose a novel framework called Ensemble Learning based Prompt Optimization (ELPO) to achieve more accurate and robust results. Motivated by the idea of ensemble learning, ELPO conducts voting mechanism and introduces shared generation strategies along with different search methods for searching superior prompts. Moreover, ELPO creatively presents more efficient algorithms for the prompt generation and search process. Experimental results demonstrate that ELPO outperforms state-of-the-art prompt optimization methods across different tasks, e.g., improving F1 score by 7.6 on ArSarcasm dataset.

[8] TS-PEFT: Token-Selective Parameter-Efficient Fine-Tuning with Learnable Threshold Gating

Dabiao Ma,Ziming Dai,Zhimin Xin,Shu Wang,Ye Wang,Haojun Fei

Main category: cs.CL

TL;DR: 提出了一种新的参数高效微调范式Token-Selective PEFT（TS-PEFT），通过选择性地在部分位置索引上应用微调，提升大模型下游任务性能。

Details

Motivation: 传统PEFT方法对所有位置索引进行修改，可能造成资源浪费甚至性能下降，因此需要探索更精细的微调策略。 Method: 设计一个选择函数S，用于决定在哪些位置索引上应用PEFT修改，实现对大模型的令牌级选择性微调。 Result: 实验证明，不加区分地在所有位置应用PEFT是冗余甚至有害的，而TS-PEFT能有效提升下游任务性能。 Conclusion: TS-PEFT提供了一种更高效、更有针对性的微调范式，为大模型的参数高效微调开辟了新方向。 Abstract: In the field of large models (LMs) for natural language processing (NLP) and computer vision (CV), Parameter-Efficient Fine-Tuning (PEFT) has emerged as a resource-efficient method that modifies a limited number of parameters while keeping the pretrained weights fixed. This paper investigates the traditional PEFT approach, which applies modifications to all position indices, and questions its necessity. We introduce a new paradigm called Token-Selective PEFT (TS-PEFT), in which a function S selectively applies PEFT modifications to a subset of position indices, potentially enhancing performance on downstream tasks. Our experimental results reveal that the indiscriminate application of PEFT to all indices is not only superfluous, but may also be counterproductive. This study offers a fresh perspective on PEFT, advocating for a more targeted approach to modifications and providing a framework for future research to optimize the fine-tuning process for large models.

[9] SemanticCite: Citation Verification with AI-Powered Full-Text Analysis and Evidence-Based Reasoning

Sebastian Haan

Main category: cs.CL

TL;DR: SemanticCite 是一个基于 AI 的引文验证系统，通过全文分析和细粒度分类（支持、部分支持、不支持、不确定）提升引文准确性，并提供开源框架与多学科数据集以增强科研诚信。

Details

Motivation: 解决学术文献中日益严重的语义引文错误、AI 生成的虚假引用以及传统引用格式缺乏上下文支持的问题，提升科学交流的准确性和可信度。 Method: 结合多种检索方法与轻量级微调语言模型，对引文进行四类分类，并通过相关文本片段和推理提供上下文信息；构建包含1000多个跨学科引文的标注数据集。 Result: 微调的轻量级语言模型表现接近大型商业系统，但计算成本显著降低；系统能有效识别不同类型的引文错误并提供可解释的证据支持。 Conclusion: SemanticCite 实现了可扩展的引文验证，有助于提升同行评审效率、控制 AI 生成内容质量，并为大规模维护引文准确性提供了开源基础。 Abstract: Effective scientific communication depends on accurate citations that validate sources and guide readers to supporting evidence. Yet academic literature faces mounting challenges: semantic citation errors that misrepresent sources, AI-generated hallucinated references, and traditional citation formats that point to entire papers without indicating which sections substantiate specific claims. We introduce SemanticCite, an AI-powered system that verifies citation accuracy through full-text source analysis while providing rich contextual information via detailed reasoning and relevant text snippets. Our approach combines multiple retrieval methods with a four-class classification system (Supported, Partially Supported, Unsupported, Uncertain) that captures nuanced claim-source relationships and enables appropriate remedial actions for different error types. Our experiments show that fine-tuned lightweight language models achieve performance comparable to large commercial systems with significantly lower computational requirements, making large-scale citation verification practically feasible. The system provides transparent, evidence-based explanations that support user understanding and trust. We contribute a comprehensive dataset of over 1,000 citations with detailed alignments, functional classifications, semantic annotations, and bibliometric metadata across eight disciplines, alongside fine-tuned models and the complete verification framework as open-source software. SemanticCite addresses critical challenges in research integrity through scalable citation verification, streamlined peer review, and quality control for AI-generated content, providing an open-source foundation for maintaining citation accuracy at scale.

[10] SeSE: A Structural Information-Guided Uncertainty Quantification Framework for Hallucination Detection in LLMs

Xingtao Zhao,Hao Peng,Dingli Su,Xianghua Zeng,Chunyang Liu,Jinzhi Liao,Philip S. Yu

Main category: cs.CL

TL;DR: 本文提出了一种基于语义结构信息的新颖不确定性量化框架SeSE，用于检测大语言模型中的幻觉生成，通过构建稀疏化的有向语义图并计算最优语义编码树的结构熵，显著优于现有方法。

Details

Motivation: 现有的不确定性量化方法主要依赖概率分布或距离度量，忽略了语义空间中的结构信息，导致对幻觉的检测不够精确。 Method: 提出Semantic Structural Entropy（SeSE），首先构建自适应稀疏化的有向语义图以捕捉方向性语义依赖，然后通过分层抽象定义最优语义编码树的结构熵作为不确定性度量，并扩展至细粒度长文本生成中的单个主张不确定性评估。 Result: 在29个模型-数据集组合上的实验表明，SeSE显著优于包括监督方法和KLE在内的先进基线方法，在幻觉检测和不确定性量化方面表现优异。 Conclusion: SeSE从结构信息视角为大语言模型提供了更精确的不确定性量化方案，有效提升了安全关键场景下模型拒绝回答或避免幻觉的能力。 Abstract: Reliable uncertainty quantification (UQ) is essential for deploying large language models (LLMs) in safety-critical scenarios, as it enables them to abstain from responding when uncertain, thereby avoiding hallucinating falsehoods. However, state-of-the-art UQ methods primarily rely on semantic probability distributions or pairwise distances, overlooking latent semantic structural information that could enable more precise uncertainty estimates. This paper presents Semantic Structural Entropy (SeSE), a principled UQ framework that quantifies the inherent semantic uncertainty of LLMs from a structural information perspective for hallucination detection. Specifically, to effectively model semantic spaces, we first develop an adaptively sparsified directed semantic graph construction algorithm that captures directional semantic dependencies while automatically pruning unnecessary connections that introduce negative interference. We then exploit latent semantic structural information through hierarchical abstraction: SeSE is defined as the structural entropy of the optimal semantic encoding tree, formalizing intrinsic uncertainty within semantic spaces after optimal compression. A higher SeSE value corresponds to greater uncertainty, indicating that LLMs are highly likely to generate hallucinations. In addition, to enhance fine-grained UQ in long-form generation -- where existing methods often rely on heuristic sample-and-count techniques -- we extend SeSE to quantify the uncertainty of individual claims by modeling their random semantic interactions, providing theoretically explicable hallucination detection. Extensive experiments across 29 model-dataset combinations show that SeSE significantly outperforms advanced UQ baselines, including strong supervised methods and the recently proposed KLE.

[11] SDA: Steering-Driven Distribution Alignment for Open LLMs without Fine-Tuning

Wei Xia,Zhi-Hong Deng

Main category: cs.CL

TL;DR: 提出了一种无需训练、模型无关的推理阶段对齐框架SDA，通过动态调整输出概率分布来提升大语言模型在有用性、无害性和诚实性方面的表现。

Details

Motivation: 在不进行昂贵重训练或大量监督的情况下，如何在推理阶段有效且高效地对齐大语言模型的行为与人类意图，是一个关键挑战。 Method: 提出SDA（Steering-Driven Distribution Alignment）框架，通过用户定义的对齐指令，在推理时动态重分配模型输出概率，实现训练自由、模型无关的对齐，支持个性化偏好控制，并可与基于训练的方法结合。 Result: 在8个不同规模和来源的开源大模型上验证，SDA在有用性、无害性和诚实性三个维度均显著提升性能，平均提升分别为64.4%、11.5%和30%。 Conclusion: SDA是一种轻量、高效、通用的训练-free对齐方法，能够在多样化模型和场景下有效提升LLM的多维对齐性能，具有良好的实用性和扩展性。 Abstract: With the rapid advancement of large language models (LLMs), their deployment in real-world applications has become increasingly widespread. LLMs are expected to deliver robust performance across diverse tasks, user preferences, and practical scenarios. However, as demands grow, ensuring that LLMs produce responses aligned with human intent remains a foundational challenge. In particular, aligning model behavior effectively and efficiently during inference, without costly retraining or extensive supervision, is both a critical requirement and a non-trivial technical endeavor. To address the challenge, we propose SDA (Steering-Driven Distribution Alignment), a training-free and model-agnostic alignment framework designed for open-source LLMs. SDA dynamically redistributes model output probabilities based on user-defined alignment instructions, enhancing alignment between model behavior and human intents without fine-tuning. The method is lightweight, resource-efficient, and compatible with a wide range of open-source LLMs. It can function independently during inference or be integrated with training-based alignment strategies. Moreover, SDA supports personalized preference alignment, enabling flexible control over the model response behavior. Empirical results demonstrate that SDA consistently improves alignment performance across 8 open-source LLMs with varying scales and diverse origins, evaluated on three key alignment dimensions, helpfulness, harmlessness, and honesty (3H). Specifically, SDA achieves average gains of 64.4% in helpfulness, 30% in honesty and 11.5% in harmlessness across the tested models, indicating its effectiveness and generalization across diverse models and application scenarios.

[12] Incorporating Self-Rewriting into Large Language Model Reasoning Reinforcement

Jiashu Yao,Heyan Huang,Shuang Zeng,Chuwei Luo,WangJie You,Jie Tang,Qingsong Liu,Yuhang Guo,Yangyang Kang

Main category: cs.CL

TL;DR: 本文提出了一种自重写（self-rewriting）框架，通过强化学习提升大推理模型的内部推理质量，解决了传统仅依赖最终正确性奖励导致的推理过程低效问题。

Details

Motivation: 传统的基于结果正确性的单一奖励机制无法有效监督模型的内部推理过程，导致出现过度思考、思考不足、冗余或混乱等推理缺陷。 Method: 提出自重写框架，让模型自行重写其推理文本，并从中学习以优化推理过程；采用选择性重写策略，仅对模型稳定正确的‘简单’样本进行重写，并在同一batch中整合重写与原始生成，保持算法可扩展性。 Result: 实验表明，该方法在准确率-长度权衡上表现更优，在准确率提升0.6的同时推理长度减少46%；在LLM-as-a-judge指标下内部推理质量评分提高7.2，显著缓解了各类推理缺陷。 Conclusion: 自重写框架能有效提升大推理模型的内部推理质量，在不牺牲可扩展性的前提下，优于现有强基线方法。 Abstract: Through reinforcement learning (RL) with outcome correctness rewards, large reasoning models (LRMs) with scaled inference computation have demonstrated substantial success on complex reasoning tasks. However, the one-sided reward, focused solely on final correctness, limits its ability to provide detailed supervision over internal reasoning process. This deficiency leads to suboptimal internal reasoning quality, manifesting as issues like over-thinking, under-thinking, redundant-thinking, and disordered-thinking. Inspired by the recent progress in LRM self-rewarding, we introduce self-rewriting framework, where a model rewrites its own reasoning texts, and subsequently learns from the rewritten reasoning to improve the internal thought process quality. For algorithm design, we propose a selective rewriting approach wherein only "simple" samples, defined by the model's consistent correctness, are rewritten, thereby preserving all original reward signals of GRPO. For practical implementation, we compile rewriting and vanilla generation within one single batch, maintaining the scalability of the RL algorithm and introducing only ~10% overhead. Extensive experiments on diverse tasks with different model sizes validate the effectiveness of self-rewriting. In terms of the accuracy-length tradeoff, the self-rewriting approach achieves improved accuracy (+0.6) with substantially shorter reasoning (-46%) even without explicit instructions in rewriting prompts to reduce reasoning length, outperforming existing strong baselines. In terms of internal reasoning quality, self-rewriting achieves significantly higher scores (+7.2) under the LLM-as-a-judge metric, successfully mitigating internal reasoning flaws.

[13] NLP Datasets for Idiom and Figurative Language Tasks

Blake Matheny,Phuong Minh Nguyen,Minh Le Nguyen,Stephanie Reynolds

Main category: cs.CL

TL;DR: 本文提出了用于识别习语和比喻语言的大规模数据集，旨在提升预训练语言模型在处理非字面意义表达上的能力。

Details

Motivation: 由于社交媒体中大量使用非正式语言，而现有大语言模型在理解习语和比喻表达方面仍存在困难，因此需要更优质、更大规模的数据集来弥补这一差距。 Method: 整合多个近期的习语数据集生成综合列表，并从大型语料库中提取上下文序列，构建一个大规模的潜在习语表达数据集及两个人工标注的确切习语表达数据集，经过后处理以兼容多种模型，用于槽位标注和序列标注任务的训练与评估。 Result: 成功创建了可用于评估预训练语言模型在习语识别任务中表现的数据集，并验证了其在模型训练中的有效性。 Conclusion: 所提出的数据集为开发更强大的习语和比喻语言理解模型提供了重要资源，有助于推动NLP在非字面语言处理方面的发展。 Abstract: Idiomatic and figurative language form a large portion of colloquial speech and writing. With social media, this informal language has become more easily observable to people and trainers of large language models (LLMs) alike. While the advantage of large corpora seems like the solution to all machine learning and Natural Language Processing (NLP) problems, idioms and figurative language continue to elude LLMs. Finetuning approaches are proving to be optimal, but better and larger datasets can help narrow this gap even further. The datasets presented in this paper provide one answer, while offering a diverse set of categories on which to build new models and develop new approaches. A selection of recent idiom and figurative language datasets were used to acquire a combined idiom list, which was used to retrieve context sequences from a large corpus. One large-scale dataset of potential idiomatic and figurative language expressions and two additional human-annotated datasets of definite idiomatic and figurative language expressions were created to evaluate the baseline ability of pre-trained language models in handling figurative meaning through idiom recognition (detection) tasks. The resulting datasets were post-processed for model agnostic training compatibility, utilized in training, and evaluated on slot labeling and sequence tagging.

[14] Learning from Sufficient Rationales: Analysing the Relationship Between Explanation Faithfulness and Token-level Regularisation Strategies

Jonathan Kamp,Lisa Beinborn,Antske Fokkens

Main category: cs.CL

TL;DR: 本文研究了自然语言解释（即“理由”）在模型推理中的作用，发现常用的充分性指标虽能衡量理由的信息量，但并不能有效反映其对分类性能的影响，且理由的引入在跨域分类中效果不一，提示需开发更系统的评估指标。

Details

Motivation: 旨在探究人类提供的自然语言理由是否真正帮助模型基于正确依据进行预测，并揭示现有充分性指标在评估理由有效性方面的局限性。 Method: 通过将充分性指标与两种建模范式关联：一是基于词元分类识别理由中的关键token，二是通过注意力正则化将理由信息融入输入以提升模型性能。 Result: 高信息量的理由未必能提升分类准确率；充分性更多反映非理由上下文对分类的干扰；引入理由可提升跨域分类表现，但效果因任务和模型而异；充分性与词元分类能力无明显关联。 Conclusion: 理由的作用复杂，当前指标不足以全面评估其影响，未来需发展更系统的方法来捕捉此类解释性信息。 Abstract: Human explanations of natural language, rationales, form a tool to assess whether models learn a label for the right reasons or rely on dataset-specific shortcuts. Sufficiency is a common metric for estimating the informativeness of rationales, but it provides limited insight into the effects of rationale information on model performance. We address this limitation by relating sufficiency to two modelling paradigms: the ability of models to identify which tokens are part of the rationale (through token classification) and the ability of improving model performance by incorporating rationales in the input (through attention regularisation). We find that highly informative rationales are not likely to help classify the instance correctly. Sufficiency conversely captures the classification impact of the non-rationalised context, which interferes with rationale information in the same input. We also find that incorporating rationale information in model inputs can boost cross-domain classification, but results are inconsistent per task and model type. Finally, sufficiency and token classification appear to be unrelated. These results exemplify the complexity of rationales, showing that metrics capable of systematically capturing this type of information merit further investigation.

[15] AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser

Ren Ma,Jiantao Qiu,Chao Xu,Pei Chu,Kaiwen Liu,Pengli Ren,Yuan Qu,Jiahui Peng,Linfeng Hou,Mengjie Liu,Lindong Lu,Wenchang Ning,Jia Yu,Rui Min,Jin Shi,Haojiong Chen,Peng Zhang,Wenjian Zhang,Qian Jiang,Zengjie Hu,Guoqiang Yang,Zhenxiang Li,Fukai Shang,Zhongying Tu,Wentao Zhang,Dahua Lin,Conghui He

Main category: cs.CL

TL;DR: 本文提出了一种新的HTML到文本的提取方法MinerU-HTML，通过将内容提取重构为序列标注问题，并利用语言模型提升网页结构化元素（如公式、代码、表格）的保留质量。基于此方法构建了高质量多语言语料库AICC，在相同过滤条件下，使用AICC预训练的语言模型在多个基准上优于传统提取方法构建的语料库，验证了提取质量对模型性能的重要影响。

Details

Motivation: 现有网页文本提取方法（如Trafilatura）依赖启发式规则，难以有效保留文档中的结构化内容（如公式、代码、表格），导致信息损失，限制了大语言模型的数据质量。作者认为提升提取质量对下游任务的影响不亚于数据过滤和去重，但该环节常被忽视。 Method: 提出MinerU-HTML，一种基于0.6B参数语言模型的两阶段提取流程：第一阶段将HTML提取视为序列标注任务，利用语义理解识别不同语义元素；第二阶段设计专门的格式化管道，将识别出的元素转换为Markdown格式，显式保留结构信息。同时构建包含7887个标注网页的MainWebBench用于评估。 Result: 在MainWebBench上，MinerU-HTML的ROUGE-N F1得分为81.8%，显著高于Trafilatura的63.6%；对代码块和公式的保留率分别达到90.9%和94.0%。基于MinerU-HTML构建的AICC语料库（7.3万亿token）在相同过滤条件下，使预训练模型在13项基准上的平均准确率达到50.8%，比使用Trafilatura提取的TfCC高1.08个百分点，且优于RefinedWeb和FineWeb。 Conclusion: HTML内容提取是构建高质量网页语料库的关键环节，其重要性常被低估。MinerU-HTML通过语义驱动的模型化方法显著提升了结构化内容的保留能力，证明了高质量提取能直接增强语言模型性能。作者开源了MainWebBench、MinerU-HTML和AICC，推动数据处理环节的进一步研究。 Abstract: While web data quality is crucial for large language models, most curation efforts focus on filtering and deduplication,treating HTML-to-text extraction as a fixed pre-processing step. Existing web corpora rely on heuristic-based extractors like Trafilatura, which struggle to preserve document structure and frequently corrupt structured elements such as formulas, codes, and tables. We hypothesize that improving extraction quality can be as impactful as aggressive filtering strategies for downstream performance. We introduce MinerU-HTML, a novel extraction pipeline that reformulates content extraction as a sequence labeling problem solved by a 0.6B-parameter language model. Unlike text-density heuristics, MinerU-HTML leverages semantic understanding and employs a two-stage formatting pipeline that explicitly categorizes semantic elements before converting to Markdown. Crucially, its model-based approach is inherently scalable, whereas heuristic methods offer limited improvement pathways. On MainWebBench, our benchmark of 7,887 annotated web pages, MinerU-HTML achieves 81.8\% ROUGE-N F1 compared to Trafilatura's 63.6\%, with exceptional structured element preservation (90.9\% for code blocks, 94.0\% for formulas). Using MinerU-HTML, we construct AICC (AI-ready Common Crawl), a 7.3-trillion token multilingual corpus from two Common Crawl snapshots. In controlled pretraining experiments where AICC and Trafilatura-extracted TfCC undergo identical filtering, models trained on AICC (62B tokens) achieve 50.8\% average accuracy across 13 benchmarks, outperforming TfCC by 1.08pp-providing direct evidence that extraction quality significantly impacts model capabilities. AICC also surpasses RefinedWeb and FineWeb on key benchmarks. We publicly release MainWebBench, MinerU-HTML, and AICC, demonstrating that HTML extraction is a critical, often underestimated component of web corpus construction.

[16] Classification of worldwide news articles by perceived quality, 2018-2024

Connor McElroy,Thiago E. A. de Oliveira,Chris Brogly

Main category: cs.CL

TL;DR: 该研究评估了机器学习和深度学习模型在区分感知新闻质量方面的有效性，使用包含140多万篇文章的大规模数据集，结果显示ModernBERT-large表现最佳。

Details

Motivation: 探索机器学习与深度学习模型是否能够有效区分感知质量高低的新闻文章。 Method: 采用3种传统机器学习分类器和3种深度学习模型，基于1,412,272篇英文新闻文章的数据集进行评估，每篇文章提取194个语言特征，并根据579个新闻网站的专家评分划分为高质量和低质量类别。 Result: Random Forest等传统模型表现良好（准确率0.7355，ROC AUC 0.8131）；ModernBERT-large表现最优（准确率0.8744，ROC AUC 0.9593，F1 0.8739），优于其他BERT变体。 Conclusion: 传统机器学习和深度学习模型均能有效区分全球新闻文章的感知质量，其中ModernBERT-large性能最佳。 Abstract: This study explored whether supervised machine learning and deep learning models can effectively distinguish perceived lower-quality news articles from perceived higher-quality news articles. 3 machine learning classifiers and 3 deep learning models were assessed using a newly created dataset of 1,412,272 English news articles from the Common Crawl over 2018-2024. Expert consensus ratings on 579 source websites were split at the median, creating perceived low and high-quality classes of about 706,000 articles each, with 194 linguistic features per website-level labelled article. Traditional machine learning classifiers such as the Random Forest demonstrated capable performance (0.7355 accuracy, 0.8131 ROC AUC). For deep learning, ModernBERT-large (256 context length) achieved the best performance (0.8744 accuracy; 0.9593 ROC-AUC; 0.8739 F1), followed by DistilBERT-base (512 context length) at 0.8685 accuracy and 0.9554 ROC-AUC. DistilBERT-base (256 context length) reached 0.8478 accuracy and 0.9407 ROC-AUC, while ModernBERT-base (256 context length) attained 0.8569 accuracy and 0.9470 ROC-AUC. These results suggest that the perceived quality of worldwide news articles can be effectively differentiated by traditional CPU-based machine learning classifiers and deep learning classifiers.

[17] ESGBench: A Benchmark for Explainable ESG Question Answering in Corporate Sustainability Reports

Sherine George,Nithish Saji

Main category: cs.CL

TL;DR: ESGBench是一个用于评估基于企业可持续发展报告的可解释ESG问答系统的基准数据集和评估框架，包含多个ESG主题的领域问题、人工整理的答案和支持证据，旨在推动透明和负责任的ESG人工智能系统研究。

Details

Motivation: 为了促进对ESG（环境、社会和治理）领域中AI系统的可解释性和可靠性评估，需要一个专门针对企业可持续发展报告的高质量基准测试工具。 Method: 构建了一个名为ESGBench的基准数据集，包含多主题的领域问题、人工标注的答案及支持证据，并对当前最先进的大语言模型进行了性能分析。 Result: 评估揭示了现有大语言模型在事实一致性、可追溯性和领域对齐方面存在关键挑战。 Conclusion: ESGBench为可解释的ESG问答系统提供了有效的评估平台，有助于推动透明且负责任的ESG人工智能研究的发展。 Abstract: We present ESGBench, a benchmark dataset and evaluation framework designed to assess explainable ESG question answering systems using corporate sustainability reports. The benchmark consists of domain-grounded questions across multiple ESG themes, paired with human-curated answers and supporting evidence to enable fine-grained evaluation of model reasoning. We analyze the performance of state-of-the-art LLMs on ESGBench, highlighting key challenges in factual consistency, traceability, and domain alignment. ESGBench aims to accelerate research in transparent and accountable ESG-focused AI systems.

[18] Anatomy of an Idiom: Tracing Non-Compositionality in Language Models

Andrew Gomes

Main category: cs.CL

TL;DR: 研究了基于Transformer的语言模型中习语表达的处理机制，发现了一种称为“习语头”的注意力头以及因早期处理导致的增强注意现象（即“增强接收”），揭示了Transformer在计算效率与鲁棒性之间的平衡机制。

Details

Motivation: 理解Transformer模型如何处理非组合性语言现象（如习语），并探索其内部电路机制。 Method: 使用改进的路径补丁算法进行电路发现，并分析注意力头在不同习语中的激活模式及词元间的增强注意现象。 Result: 发现了频繁在不同习语中激活的‘Idiom Heads’和‘augmented reception’现象，揭示了Transformer处理习语的独特计算模式。 Conclusion: Transformer通过特定注意力机制处理非组合性语言，这些机制有助于平衡计算效率与鲁棒性，为理解复杂语法结构的处理提供了线索。 Abstract: We investigate the processing of idiomatic expressions in transformer-based language models using a novel set of techniques for circuit discovery and analysis. First discovering circuits via a modified path patching algorithm, we find that idiom processing exhibits distinct computational patterns. We identify and investigate ``Idiom Heads,'' attention heads that frequently activate across different idioms, as well as enhanced attention between idiom tokens due to earlier processing, which we term ``augmented reception.'' We analyze these phenomena and the general features of the discovered circuits as mechanisms by which transformers balance computational efficiency and robustness. Finally, these findings provide insights into how transformers handle non-compositional language and suggest pathways for understanding the processing of more complex grammatical constructions.

[19] Arctic-Extract Technical Report

Mateusz Chiliński,Julita Ołtusek,Wojciech Jaśkowski

Main category: cs.CL

TL;DR: Arctic-Extract 是一种先进的模型，用于从扫描或数字生成的商业文档中提取结构化数据，可在资源受限的硬件上部署，并支持长文档处理。

Details

Motivation: 为了在资源受限的设备上实现高效、准确的结构化数据提取，满足实际业务文档处理需求。 Method: 采用轻量化的模型设计和优化的训练协议，以实现高性能的文档理解与结构化信息抽取。 Result: 模型仅占用6.6 GiB内存，可在A10 GPU上处理多达125页A4文档，并在多项评估任务中表现出色。 Conclusion: Arctic-Extract 在保持高性能的同时实现了低资源消耗，适用于实际场景中的长文档结构化数据提取。 Abstract: Arctic-Extract is a state-of-the-art model designed for extracting structural data (question answering, entities and tables) from scanned or digital-born business documents. Despite its SoTA capabilities, the model is deployable on resource-constrained hardware, weighting only 6.6 GiB, making it suitable for deployment on devices with limited resources, such as A10 GPUs with 24 GB of memory. Arctic-Extract can process up to 125 A4 pages on those GPUs, making suitable for long document processing. This paper highlights Arctic-Extract's training protocols and evaluation results, demonstrating its strong performance in document understanding.

[20] TurkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval

Özay Ezerceli,Mahmoud El Hussieni,Selva Taş,Reyhan Bayraktar,Fatma Betül Terzioğlu,Yusuf Çelebi,Yağız Asker

Main category: cs.CL

TL;DR: 本文提出了TurkColBERT，首个针对土耳其语信息检索的综合基准，系统比较了密集编码器与晚期交互模型。通过两阶段微调方法，将英文和多语言模型适配到土耳其语，并引入高效的晚期交互架构，在多个领域实现显著性能提升。小型化模型在保持高效率的同时大幅超越大型密集模型，且结合MUVERA索引可实现毫秒级查询响应。所有模型、配置和代码均已开源。

Details

Motivation: 现有神经信息检索研究主要集中于高资源语言，对形态复杂、资源较少的语言（如土耳其语）关注不足。尽管密集双编码器主导了土耳其语检索，但具有细粒度匹配优势的晚期交互模型尚未被系统评估。因此，亟需构建专门基准以探索更适合土耳其语的检索架构。 Method: 提出两阶段适应流程：首先在土耳其语NLI/STS任务上微调英语和多语言编码器，再使用PyLate框架基于MS MARCO-TR数据将其转化为ColBERT风格的晚期交互检索器；在五个土耳其BEIR数据集上评估10种模型，并对比不同索引算法（如MUVERA与PLAID）的效率与效果。 Result: 实验表明，仅1.0M参数的colbert-hash-nano-tr模型体积比600M参数的turkish-e5-large小600倍，仍保留其71%以上的平均mAP；3–5倍更小的晚期交互模型显著优于大尺寸密集模型，ColmmBERT-base-TR在特定领域任务上mAP提升达+13.8%；采用MUVERA+Rerank索引比PLAID快3.33倍且mAP相对提升+1.7%，实现0.54ms的极低延迟查询。 Conclusion: 晚期交互模型在土耳其语信息检索中兼具高效性与高性能，显著优于传统密集模型，尤其适合资源受限场景；结合高效索引技术（如MUVERA）可实现实时检索，推动实际应用落地。作者开源全部资源以促进该领域发展，但当前局限在于依赖中等规模及翻译而来的数据集，未来需更大规模真实场景验证。 Abstract: Neural information retrieval systems excel in high-resource languages but remain underexplored for morphologically rich, lower-resource languages such as Turkish. Dense bi-encoders currently dominate Turkish IR, yet late-interaction models -- which retain token-level representations for fine-grained matching -- have not been systematically evaluated. We introduce TurkColBERT, the first comprehensive benchmark comparing dense encoders and late-interaction models for Turkish retrieval. Our two-stage adaptation pipeline fine-tunes English and multilingual encoders on Turkish NLI/STS tasks, then converts them into ColBERT-style retrievers using PyLate trained on MS MARCO-TR. We evaluate 10 models across five Turkish BEIR datasets covering scientific, financial, and argumentative domains. Results show strong parameter efficiency: the 1.0M-parameter colbert-hash-nano-tr is 600$\times$ smaller than the 600M turkish-e5-large dense encoder while preserving over 71\% of its average mAP. Late-interaction models that are 3--5$\times$ smaller than dense encoders significantly outperform them; ColmmBERT-base-TR yields up to +13.8\% mAP on domain-specific tasks. For production-readiness, we compare indexing algorithms: MUVERA+Rerank is 3.33$\times$ faster than PLAID and offers +1.7\% relative mAP gain. This enables low-latency retrieval, with ColmmBERT-base-TR achieving 0.54 ms query times under MUVERA. We release all checkpoints, configs, and evaluation scripts. Limitations include reliance on moderately sized datasets ($\leq$50K documents) and translated benchmarks, which may not fully reflect real-world Turkish retrieval conditions; larger-scale MUVERA evaluations remain necessary.

[21] Beyond Tokens in Language Models: Interpreting Activations through Text Genre Chunks

Éloïse Benito-Rodriguez,Einar Urdshals,Jasmina Nasufi,Nicky Pochinkov

Main category: cs.CL

TL;DR: 本文提出了一个预测框架的初步步骤，通过大语言模型（LLM）的激活状态来预测输入文本的体裁，并在Mistral-7B和两个数据集上验证了该方法的有效性，F1分数分别达到98%和71%，证明了浅层学习模型可以从LLM中推断文本体裁。

Details

Motivation: 理解大语言模型（LLM）对于确保其安全和有益部署至关重要，但由于LLM结构难以解释且无法人工评估所有输出，因此需要一种可预测的方法来分析其行为。 Method: 利用Mistral-7B模型和两个数据集，通过提取LLM的激活状态，使用scikit-learn分类器预测输入文本的体裁。 Result: 在两个数据集上，体裁预测的F1分数分别达到98%和71%，且结果始终优于控制任务。 Conclusion: 研究证明了基于LLM激活状态使用浅层学习模型推断文本体裁的可行性，为理解LLM提供了新的可解释性路径。 Abstract: Understanding Large Language Models (LLMs) is key to ensure their safe and beneficial deployment. This task is complicated by the difficulty of interpretability of LLM structures, and the inability to have all their outputs human-evaluated. In this paper, we present the first step towards a predictive framework, where the genre of a text used to prompt an LLM, is predicted based on its activations. Using Mistral-7B and two datasets, we show that genre can be extracted with F1-scores of up to 98% and 71% using scikit-learn classifiers. Across both datasets, results consistently outperform the control task, providing a proof of concept that text genres can be inferred from LLMs with shallow learning models.

[22] WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue

Zachary Ellis,Jared Joselowitz,Yash Deo,Yajie He,Anna Kalygina,Aisling Higham,Mana Rahimzadeh,Yan Jia,Ibrahim Habli,Ernest Lim

Main category: cs.CL

TL;DR: 该论文挑战了临床语音识别中依赖词错误率（WER）的传统评估方式，提出一种基于大语言模型的自动化评估框架，能够更准确地衡量转录错误对临床影响的风险。

Details

Motivation: 现有的ASR评估指标如WER无法有效反映转录错误在临床对话中的实际影响，亟需一种能衡量临床安全性的新评估方法。 Method: 通过让临床专家标注ASR错误对诊疗的影响，构建黄金标准数据集，并训练优化基于大语言模型的评判器（LLM-as-a-Judge），使用GEPA方法提升其判断准确性。 Result: 现有指标与临床影响的相关性差；优化后的Gemini-2.5-Pro模型达到90%的准确率和0.816的Cohen's κ，接近人类专家水平。 Conclusion: 提出的LLM-based评判框架可有效替代人工评估，为临床ASR系统提供可扩展、可靠的安全性评估方案。 Abstract: As Automatic Speech Recognition (ASR) is increasingly deployed in clinical dialogue, standard evaluations still rely heavily on Word Error Rate (WER). This paper challenges that standard, investigating whether WER or other common metrics correlate with the clinical impact of transcription errors. We establish a gold-standard benchmark by having expert clinicians compare ground-truth utterances to their ASR-generated counterparts, labeling the clinical impact of any discrepancies found in two distinct doctor-patient dialogue datasets. Our analysis reveals that WER and a comprehensive suite of existing metrics correlate poorly with the clinician-assigned risk labels (No, Minimal, or Significant Impact). To bridge this evaluation gap, we introduce an LLM-as-a-Judge, programmatically optimized using GEPA to replicate expert clinical assessment. The optimized judge (Gemini-2.5-Pro) achieves human-comparable performance, obtaining 90% accuracy and a strong Cohen's $κ$ of 0.816. This work provides a validated, automated framework for moving ASR evaluation beyond simple textual fidelity to a necessary, scalable assessment of safety in clinical dialogue.

[23] Integrating Symbolic Natural Language Understanding and Language Models for Word Sense Disambiguation

Kexin Zhao,Ken Forbus

Main category: cs.CL

TL;DR: 提出一种基于大语言模型的词义消歧方法，无需人工标注训练数据，通过将候选意义转化为自然语言查询来利用语言模型进行消歧。

Details

Motivation: 现有词义消歧方法依赖人工标注数据和粗粒度语义表示，难以扩展到细粒度、丰富的知识表示（如OpenCyc），且缺乏自动化能力。 Method: 将符号化自然语言理解系统生成的多个候选意义转换为可区分的自然语言表述，利用统计语言模型（LLM）作为判别器，在上下文中选择最合适的解释，并将结果反馈回符号系统。 Result: 在人工标注的黄金标准答案上进行评估，验证了该方法的有效性，能够在无需人工标注的情况下实现较准确的词义消歧。 Conclusion: 该方法提供了一种无需人工标注的词义消歧新路径，结合符号系统与语言模型的优势，支持更丰富语义表示的自动消歧，具有较强的实用潜力。 Abstract: Word sense disambiguation is a fundamental challenge in natural language understanding. Current methods are primarily aimed at coarse-grained representations (e.g. WordNet synsets or FrameNet frames) and require hand-annotated training data to construct. This makes it difficult to automatically disambiguate richer representations (e.g. built on OpenCyc) that are needed for sophisticated inference. We propose a method that uses statistical language models as oracles for disambiguation that does not require any hand-annotation of training data. Instead, the multiple candidate meanings generated by a symbolic NLU system are converted into distinguishable natural language alternatives, which are used to query an LLM to select appropriate interpretations given the linguistic context. The selected meanings are propagated back to the symbolic NLU system. We evaluate our method against human-annotated gold answers to demonstrate its effectiveness.

[24] Comparison of Text-Based and Image-Based Retrieval in Multimodal Retrieval Augmented Generation Large Language Model Systems

Elias Lumer,Alex Cardenas,Matt Melich,Myles Mason,Sara Dieter,Vamse Kumar Subbiah,Pradeep Honaganahalli Basavaraju,Roberto Hernandez

Main category: cs.CL

TL;DR: 本文提出并比较了多模态RAG系统中的两种检索方法，发现直接使用多模态嵌入检索优于基于LLM摘要的文本检索，在金融文档问答任务中显著提升了检索和回答的准确性。

Details

Motivation: 现有基于LLM摘要的多模态RAG系统在预处理中将图像转为文本，导致视觉信息丢失，影响下游任务性能。 Method: 对比分析了基于文本块检索（图像先被总结为文本）与直接多模态嵌入检索（图像原生嵌入）两种方法，在新构建的金融财报电话会议基准上评估6个LLM和2个多模态嵌入模型。 Result: 直接多模态嵌入检索在mAP@5上绝对提升13%（相对提升32%），nDCG@5上绝对提升11%（相对提升20%），且生成的答案更准确、事实一致性更高。 Conclusion: 直接多模态嵌入检索能有效保留视觉上下文信息，避免LLM摘要带来的信息损失，显著优于传统文本化方法。 Abstract: Recent advancements in Retrieval-Augmented Generation (RAG) have enabled Large Language Models (LLMs) to access multimodal knowledge bases containing both text and visual information such as charts, diagrams, and tables in financial documents. However, existing multimodal RAG systems rely on LLM-based summarization to convert images into text during preprocessing, storing only text representations in vector databases, which causes loss of contextual information and visual details critical for downstream retrieval and question answering. To address this limitation, we present a comprehensive comparative analysis of two retrieval approaches for multimodal RAG systems, including text-based chunk retrieval (where images are summarized into text before embedding) and direct multimodal embedding retrieval (where images are stored natively in the vector space). We evaluate all three approaches across 6 LLM models and a two multi-modal embedding models on a newly created financial earnings call benchmark comprising 40 question-answer pairs, each paired with 2 documents (1 image and 1 text chunk). Experimental results demonstrate that direct multimodal embedding retrieval significantly outperforms LLM-summary-based approaches, achieving absolute improvements of 13% in mean average precision (mAP@5) and 11% in normalized discounted cumulative gain. These gains correspond to relative improvements of 32% in mAP@5 and 20% in nDCG@5, providing stronger evidence of their practical impact. We additionally find that direct multimodal retrieval produces more accurate and factually consistent answers as measured by LLM-as-a-judge pairwise comparisons. We demonstrate that LLM summarization introduces information loss during preprocessing, whereas direct multimodal embeddings preserve visual context for retrieval and inference.

[25] Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs

Ali Taghibakhshi,Sharath Turuvekere Sreenivas,Saurav Muralidharan,Ruisi Cai,Marcin Chochowski,Ameya Sunil Mahabaleshwarkar,Yoshi Suhara,Oluwatobi Olabiyi,Daniel Korzekwa,Mostofa Patwary,Mohammad Shoeybi,Jan Kautz,Bryan Catanzaro,Ashwath Aithal,Nima Tajbakhsh,Pavlo Molchanov

Main category: cs.CL

TL;DR: 提出Nemotron Elastic框架，可在单个大模型内嵌套多个不同规模的子模型，实现零样本提取，显著降低训练成本并保持高性能。

Details

Motivation: 传统多规模大模型训练成本高昂，现有压缩方法仍需大量训练开销，难以高效支持多种部署场景。 Method: 通过端到端训练的路由器和两阶段课程学习，结合组感知SSM弹性化、异构MLP弹性化、基于归一化MSE的层重要性评估及知识蒸馏，在单一父模型中嵌入多个共享权重的子模型。 Result: 在Nemotron Nano V2 12B上同时生成9B和6B模型，仅用110B token，训练成本较从头训练降低360倍，较SOTA压缩技术降低7倍，且性能持平或优于SOTA。 Conclusion: Nemotron Elastic实现了高效、可扩展的推理导向模型压缩与部署，支持多预算零样本子模型提取，大幅降低成本并保持高精度。 Abstract: Training a family of large language models targeting multiple scales and deployment objectives is prohibitively expensive, requiring separate training runs for each different size. Recent work on model compression through pruning and knowledge distillation has reduced this cost; however, this process still incurs hundreds of billions of tokens worth of training cost per compressed model. In this paper, we present Nemotron Elastic, a framework for building reasoning-oriented LLMs, including hybrid Mamba-Attention architectures, that embed multiple nested submodels within a single parent model, each optimized for different deployment configurations and budgets. Each of these submodels shares weights with the parent model and can be extracted zero-shot during deployment without additional training or fine-tuning. We enable this functionality through an end-to-end trained router, tightly coupled to a two-stage training curriculum designed specifically for reasoning models. We additionally introduce group-aware SSM elastification that preserves Mamba's structural constraints, heterogeneous MLP elastification, normalized MSE-based layer importance for improved depth selection, and knowledge distillation enabling simultaneous multi-budget optimization. We apply Nemotron Elastic to the Nemotron Nano V2 12B model, simultaneously producing a 9B and a 6B model using only 110B training tokens; this results in over 360x cost reduction compared to training model families from scratch, and around 7x compared to SoTA compression techniques. Each of the nested models performs on par or better than the SoTA in accuracy. Moreover, unlike other compression methods, the nested capability of our approach allows having a many-in-one reasoning model that has constant deployment memory against the number of models in the family.

cs.CV [Back]

[26] UniFit: Towards Universal Virtual Try-on with MLLM-Guided Semantic Alignment

Wei Zhang,Yeying Jin,Xin Li,Yan Zhang,Xiaofeng Cong,Cong Wang,Fengcai Qiao,zhichao Lian

Main category: cs.CV

TL;DR: 提出UniFit，一种基于多模态大语言模型的通用虚拟试衣框架，通过语义对齐模块和渐进式训练策略解决文本-图像语义鸿沟和数据稀缺问题，支持多种复杂任务并达到SOTA性能。

Details

Motivation: 现有虚拟试衣方法在处理多样化和复杂任务时存在文本指令与参考图像之间的语义鸿沟以及复杂场景下数据不足的问题，难以构建通用框架。 Method: 提出UniFit框架，引入MLLM引导的语义对齐模块（MGSA）以融合多模态输入并缩小语义差距，并设计两阶段渐进式训练策略与自合成流程，从有限数据中学习复杂任务。 Result: 实验表明UniFit能支持多衣物、模特到模特等复杂虚拟试衣任务，在多个指标上优于现有方法，实现最先进性能。 Conclusion: UniFit通过多模态大模型和语义对齐机制，有效提升了虚拟试衣系统的通用性与生成质量，为构建统一VTON系统提供了可行方案。 Abstract: Image-based virtual try-on (VTON) aims to synthesize photorealistic images of a person wearing specified garments. Despite significant progress, building a universal VTON framework that can flexibly handle diverse and complex tasks remains a major challenge. Recent methods explore multi-task VTON frameworks guided by textual instructions, yet they still face two key limitations: (1) semantic gap between text instructions and reference images, and (2) data scarcity in complex scenarios. To address these challenges, we propose UniFit, a universal VTON framework driven by a Multimodal Large Language Model (MLLM). Specifically, we introduce an MLLM-Guided Semantic Alignment Module (MGSA), which integrates multimodal inputs using an MLLM and a set of learnable queries. By imposing a semantic alignment loss, MGSA captures cross-modal semantic relationships and provides coherent and explicit semantic guidance for the generative process, thereby reducing the semantic gap. Moreover, by devising a two-stage progressive training strategy with a self-synthesis pipeline, UniFit is able to learn complex tasks from limited data. Extensive experiments show that UniFit not only supports a wide range of VTON tasks, including multi-garment and model-to-model try-on, but also achieves state-of-the-art performance. The source code and pretrained models are available at https://github.com/zwplus/UniFit.

[27] EfficientSAM3: Progressive Hierarchical Distillation for Video Concept Segmentation from SAM1, 2, and 3

Chengxi Zeng,Yuxuan Jiang,Aaron Zhang

Main category: cs.CV

TL;DR: 本文提出EfficientSAM3，通过渐进式分层蒸馏（PHD）将SAM3的能力迁移到轻量级模型上，实现高效的设备端概念分割与跟踪。

Details

Motivation: SAM3虽然性能强大，但其统一架构计算开销大，难以部署到设备端，因此需要更高效的模型。 Method: 采用三阶段的渐进式分层蒸馏（PHD）：1）编码器蒸馏对齐图像特征；2）时序记忆蒸馏用Perceiver模块压缩时空特征；3）端到端微调保持概念级性能。使用RepViT、TinyViT和EfficientViT等轻量骨干网络。 Result: 在多个VOS数据集上取得良好的性能-效率权衡，显著优于其他相关方法，实现了高效的设备端概念分割与跟踪。 Conclusion: EfficientSAM3通过PHD有效继承了SAM3的能力，在保持高精度的同时大幅降低计算需求，适合实际应用中的设备端部署。 Abstract: The Segment Anything Model 3 (SAM3) advances visual understanding with Promptable Concept Segmentation (PCS) across images and videos, but its unified architecture (shared vision backbone, DETR-style detector, dense-memory tracker) remains prohibitive for on-device use. We present EfficientSAM3, a family of efficient models built on Progressive Hierarchical Distillation (PHD) that transfers capability from SAM3 to lightweight students in three stages: (1) Encoder Distillation aligns image features via prompt-in-the-loop training on SA-1B; (2) Temporal Memory Distillation replaces dense memory with a compact Perceiver-based module trained on SA-V to compress and retrieve spatiotemporal features efficiently; and (3) End-to-End Fine-Tuning refines the full pipeline on the official SAM3 PCS data to preserve concept-level performance. PHD yields a spectrum of student variants using RepViT, TinyViT, and EfficientViT backbones, enabling on-device concept segmentation and tracking while maintaining high fidelity to teacher behavior. We benchmark on popular VOS datasets, and compare with varies of releated work, achieing strong performance-efficiency trade-offs.

[28] WALDO: Where Unseen Model-based 6D Pose Estimation Meets Occlusion

Sajjad Pakdamansavoji,Yintao Ma,Amir Rasouli,Tongtong Cao

Main category: cs.CV

TL;DR: 本文提出了一种针对遮挡场景下基于模型的6D物体姿态估计新方法，通过动态采样、多假设推理、迭代优化和数据增强显著提升了精度与速度。

Details

Motivation: 现有6D姿态估计方法在遮挡情况下因多阶段流水线的早期错误传播而性能下降，且对未见物体泛化能力有限，需改进鲁棒性和评估标准。 Method: 提出四种创新：动态非均匀密集采样、多假设推理机制、迭代精炼策略以及面向遮挡的数据增强，并引入基于可见性的加权评估指标。 Result: 在ICBIN和BOP数据集上分别实现超过5%和2%的精度提升，推理速度提高约3倍。 Conclusion: 所提方法有效缓解了遮挡带来的影响，增强了模型鲁棒性与泛化能力，同时提升了评估公平性，为通用6D姿态估计提供了高效可靠的解决方案。 Abstract: Accurate 6D object pose estimation is vital for robotics, augmented reality, and scene understanding. For seen objects, high accuracy is often attainable via per-object fine-tuning but generalizing to unseen objects remains a challenge. To address this problem, past arts assume access to CAD models at test time and typically follow a multi-stage pipeline to estimate poses: detect and segment the object, propose an initial pose, and then refine it. Under occlusion, however, the early-stage of such pipelines are prone to errors, which can propagate through the sequential processing, and consequently degrade the performance. To remedy this shortcoming, we propose four novel extensions to model-based 6D pose estimation methods: (i) a dynamic non-uniform dense sampling strategy that focuses computation on visible regions, reducing occlusion-induced errors; (ii) a multi-hypothesis inference mechanism that retains several confidence-ranked pose candidates, mitigating brittle single-path failures; (iii) iterative refinement to progressively improve pose accuracy; and (iv) series of occlusion-focused training augmentations that strengthen robustness and generalization. Furthermore, we propose a new weighted by visibility metric for evaluation under occlusion to minimize the bias in the existing protocols. Via extensive empirical evaluations, we show that our proposed approach achieves more than 5% improvement in accuracy on ICBIN and more than 2% on BOP dataset benchmarks, while achieving approximately 3 times faster inference.

[29] Automatic Uncertainty-Aware Synthetic Data Bootstrapping for Historical Map Segmentation

Lukas Arzoumanidis,Julius Knechtel,Jan-Henrik Haunert,Youness Dehbi

Main category: cs.CV

TL;DR: 提出了一种基于深度生成模型和手动随机退化技术的自动方法，用于生成具有真实感和多样性的合成历史地图数据，以解决历史地图标注数据稀缺的问题，并通过自构建图卷积网络进行领域自适应语义分割验证其有效性。

Details

Motivation: 历史地图的深度学习分析受限于标注数据的缺乏，尤其是特定同质制图领域的数据稀缺，且合成数据常缺乏真实性和多样性。 Method: 采用深度生成模型将原始历史地图的制图风格迁移到矢量数据上，并结合手动随机退化技术模拟历史地图中的视觉不确定性和噪声，从而生成大量逼真的合成地图数据；使用自构建图卷积网络进行域自适应语义分割以评估数据质量。 Result: 生成的合成数据显著提升了在同质历史地图语料库上的土地覆盖解释性能，验证了所提方法在缓解训练数据稀缺问题上的有效性与适用性。 Conclusion: 该方法能有效生成高质量、具真实感和多样性的历史地图训练数据，为缺乏标注数据的历史地图分析任务提供了可行的数据增强解决方案。 Abstract: The automated analysis of historical documents, particularly maps, has drastically benefited from advances in deep learning and its success across various computer vision applications. However, most deep learning-based methods heavily rely on large amounts of annotated training data, which are typically unavailable for historical maps, especially for those belonging to specific, homogeneous cartographic domains, also known as corpora. Creating high-quality training data suitable for machine learning often takes a significant amount of time and involves extensive manual effort. While synthetic training data can alleviate the scarcity of real-world samples, it often lacks the affinity (realism) and diversity (variation) necessary for effective learning. By transferring the cartographic style of an original historical map corpus onto vector data, we bootstrap an effectively unlimited number of synthetic historical maps suitable for tasks such as land-cover interpretation of a homogeneous historical map corpus. We propose an automatic deep generative approach and a alternative manual stochastic degradation technique to emulate the visual uncertainty and noise, also known as data-dependent uncertainty, commonly observed in historical map scans. To quantitatively evaluate the effectiveness and applicability of our approach, the generated training datasets were employed for domain-adaptive semantic segmentation on a homogeneous map corpus using a Self-Constructing Graph Convolutional Network, enabling a comprehensive assessment of the impact of our data bootstrapping methods.

[30] Box6D : Zero-shot Category-level 6D Pose Estimation of Warehouse Boxes

Yintao Ma,Sajjad Pakdamansavoji,Amir Rasouli,Tongtong Cao

Main category: cs.CV

TL;DR: 本文提出了一种名为Box6D的类别级6D姿态估计方法，专用于仓库环境中存储箱的位姿估计。该方法利用RGB-D图像，通过二分搜索快速推断尺寸，并使用类别模板进行姿态估计，结合深度可信过滤和早停策略，在保证精度的同时显著降低计算开销。

Details

Motivation: 现有6D姿态估计方法在准确性、灵活性和实用性之间难以平衡，尤其在仓库场景中面对遮挡和杂乱环境时表现不佳，需要一种兼顾效率与精度的专用解决方案。 Method: Box6D基于单帧RGB-D输入，采用快速二分搜索推断物体尺寸，使用类别级别的CAD模板进行姿态估计，并引入基于深度的可信度过滤机制和早停策略以剔除不合理假设，提升推理效率。 Result: 在真实仓储场景和公开基准上的实验表明，Box6D在6D姿态估计精度上达到或优于现有方法，同时将推理时间减少了约76%。 Conclusion: Box6D在保持高精度的同时大幅提升了推理速度，适用于实际工业场景中的箱体类物体6D姿态估计，具有良好的实用性和部署潜力。 Abstract: Accurate and efficient 6D pose estimation of novel objects under clutter and occlusion is critical for robotic manipulation across warehouse automation, bin picking, logistics, and e-commerce fulfillment. There are three main approaches in this domain; Model-based methods assume an exact CAD model at inference but require high-resolution meshes and transfer poorly to new environments; Model-free methods that rely on a few reference images or videos are more flexible, however often fail under challenging conditions; Category-level approaches aim to balance flexibility and accuracy but many are overly general and ignore environment and object priors, limiting their practicality in industrial settings. To this end, we propose Box6d, a category-level 6D pose estimation method tailored for storage boxes in the warehouse context. From a single RGB-D observation, Box6D infers the dimensions of the boxes via a fast binary search and estimates poses using a category CAD template rather than instance-specific models. Suing a depth-based plausibility filter and early-stopping strategy, Box6D then rejects implausible hypotheses, lowering computational cost. We conduct evaluations on real-world storage scenarios and public benchmarks, and show that our approach delivers competitive or superior 6D pose precision while reducing inference time by approximately 76%.

[31] RB-FT: Rationale-Bootstrapped Fine-Tuning for Video Classification

Meilong Xu,Di Fu,Jiaxing Zhang,Gong Yu,Jiayu Zheng,Xiaoling Hu,Dongdi Zhao,Feiyang Li,Chao Chen,Yong Cao

Main category: cs.CV

TL;DR: 提出一种无需新标注的两阶段自提升范式，通过自生成文本理由来桥接视觉语言模型在领域特定视频分类中的语义鸿沟，显著优于直接监督微调。

Details

Motivation: 视觉语言模型在数据有限的领域特定视频分类任务中表现不佳，存在从复杂时空内容到抽象标签之间的语义距离（即“推理鸿沟”），难以有效学习领域知识。 Method: 采用两阶段自提升方法：第一阶段利用VLM生成每个视频的详细文本理由，并基于这些自生成理由进行微调，以增强模型对领域逻辑的理解；第二阶段在此基础上进行传统的标签监督微调。整个过程无需额外人工标注。 Result: 在多个不同数据集上的实验表明，该方法显著优于直接监督微调，能更有效地适应领域特定的视频分析任务。 Conclusion: 自生成理由是一种高效、无需额外标注的范式，能够有效弥补视觉语言模型在小样本视频分类中的推理鸿沟，提升其领域适应能力。 Abstract: Vision Language Models (VLMs) are becoming increasingly integral to multimedia understanding; however, they often struggle with domain-specific video classification tasks, particularly in cases with limited data. This stems from a critical \textit{rationale gap}, where sparse domain data is insufficient to bridge the semantic distance between complex spatio-temporal content and abstract classification labels. We propose a two-stage self-improvement paradigm to bridge this gap without new annotations. First, we prompt the VLMs to generate detailed textual rationales for each video, compelling them to articulate the domain-specific logic. The VLM is then fine-tuned on these self-generated rationales, utilizing this intermediate supervision to align its representations with the nuances of the target domain. Second, conventional supervised fine-tuning (SFT) is performed on the task labels, achieving markedly higher effectiveness as a result of the model's pre-acquired domain reasoning. Extensive experiments on diverse datasets demonstrate that our method significantly outperforms direct SFT, validating self-generated rationale as an effective, annotation-efficient paradigm for adapting VLMs to domain-specific video analysis.

[32] Boosting Medical Visual Understanding From Multi-Granular Language Learning

Zihan Li,Yiqing Wang,Sina Farsiu,Paul Kinahan

Main category: cs.CV

TL;DR: 提出了一种新的对比学习框架MGLL，用于增强多标签和跨粒度的图像-文本对齐，特别适用于医疗影像等复杂领域。

Details

Motivation: CLIP在单标签、单粒度对齐上的局限性限制了其在需要多标签和多粒度注释的复杂领域（如医学影像）中的应用。 Method: MGLL利用结构化的多标签监督，整合不同粒度的文本描述，并引入带有逐点约束的软标签监督来增强对齐；采用平滑KL散度确保跨粒度一致性。 Result: 在多个数据集上评估显示，MGLL在下游任务中优于其他最先进的方法。 Conclusion: MGLL有效提升了视觉语言模型在多标签和跨粒度场景下的性能，具有良好的通用性和计算效率。 Abstract: Recent advances in image-text pretraining have significantly enhanced visual understanding by aligning visual and textual representations. Contrastive Language-Image Pretraining (CLIP) has played a pivotal role in multimodal learning. However, its focus on single-label, single-granularity alignment limits its effectiveness in complex domains such as medical imaging, where images often correspond to multiple high-level labels (e.g., disease categories) across different annotation granularities (e.g., diagnostic description, clinical explanation). To address this, we propose Multi-Granular Language Learning (MGLL), a contrastive learning framework designed to improve both multi-label and cross-granularity alignment. MGLL leverages structured multi-label supervision, integrates textual descriptions across granularities, and introduces soft-label supervision with point-wise constraints to enhance alignment. MGLL employs smooth Kullback-Leibler (KL) divergence to ensure cross-granularity consistency while maintaining computational efficiency as a plug-and-play module for vision-language models. Pretrained on our constructed large-scale multi-granular datasets and evaluated across multiple datasets, MGLL outperforms other state-of-the-art methods in downstream tasks. The code is available at \href{https://github.com/HUANGLIZI/MGLL}{https://github.com/HUANGLIZI/MGLL}.

[33] Automated Interpretable 2D Video Extraction from 3D Echocardiography

Milos Vukadinovic,Hirotaka Ieki,Yuki Sahasi,David Ouyang,Bryan He

Main category: cs.CV

TL;DR: 提出一种自动化方法，从3D心脏超声容积中选择标准2D视图，结合深度学习和解剖学先验知识重建标准超声心动图视图，经三名心脏病专家盲评验证准确率达96%，并支持下游AI模型进行异常检测和临床级测量。

Details

Motivation: 传统心脏超声依赖二维视频，难以全面反映心脏复杂三维结构；而三维超声虽能提供更完整信息，但临床解读习惯仍以二维为主。因此需要一种方法在保留医生熟悉二维格式的同时，发挥三维扫描的优势。 Method: 采用深度学习视图分类器，结合基于解剖标志的启发式规则及心脏病专家提供的经验规则，从3D超声容积中自动选取并重建标准2D超声心动图视图。 Result: 在来自两家医院的1,600个视频上实现了96%的准确率；生成的2D视频可用于现有AI模型（如EchoPrime、PanEcho）进行心脏异常检测，并通过EchoNet-Measurement生成临床级心脏解剖测量结果，且保持空间校准和诊断特征。 Conclusion: 该方法成功桥接了3D超声采集与2D临床解读之间的鸿沟，既提升了扫描效率和可用性，又兼容现有临床工作流程，具有广泛临床应用潜力。 Abstract: Although the heart has complex three-dimensional (3D) anatomy, conventional medical imaging with cardiac ultrasound relies on a series of 2D videos showing individual cardiac structures. 3D echocardiography is a developing modality that now offers adequate image quality for clinical use, with potential to streamline acquisition and improve assessment of off-axis features. We propose an automated method to select standard 2D views from 3D cardiac ultrasound volumes, allowing physicians to interpret the data in their usual format while benefiting from the speed and usability of 3D scanning. Applying a deep learning view classifier and downstream heuristics based on anatomical landmarks together with heuristics provided by cardiologists, we reconstruct standard echocardiography views. This approach was validated by three cardiologists in blinded evaluation (96\% accuracy in 1,600 videos from 2 hospitals). The downstream 2D videos were also validated in their ability to detect cardiac abnormalities using AI echocardiography models (EchoPrime and PanEcho) as well as ability to generate clinical-grade measurements of cardiac anatomy (EchoNet-Measurement). We demonstrated that the extracted 2D videos preserve spatial calibration and diagnostic features, allowing clinicians to obtain accurate real-world interpretations from 3D volumes. We release the code and a dataset of 29 3D echocardiography videos https://github.com/echonet/3d-echo .

[34] Click2Graph: Interactive Panoptic Video Scene Graphs from a Single Click

Raphael Ruschel,Hardikkumar Prajapati,Awsafur Rahman,B. S. Manjunath

Main category: cs.CV

TL;DR: 本文提出了Click2Graph，首个可交互的全景视频场景图生成（PVSG）框架，通过结合视觉提示与时空语义理解，实现从用户单次点击或框选中生成时序一致的场景图。

Details

Motivation: 现有VSGG系统为封闭前馈管道，无法融入人类指导；而如SAM2等可提示分割模型缺乏语义和关系推理能力。因此需要一个能结合人机交互与语义推理的框架。 Method: 提出Click2Graph框架，包含动态交互发现模块（生成主体条件下的对象提示）和语义分类头（联合实体与谓词推理），通过单个用户提示实现对象分割、追踪、交互发现及三元组预测。 Result: 在OpenPVSG基准上的实验表明，Click2Graph建立了强大的用户引导PVSG基础，能够有效结合人类提示、全景定位和关系推断。 Conclusion: Click2Graph首次实现了可交互的PVSG，展示了人类提示与结构化视频理解相结合的潜力，推动了可控且可解释的视频场景理解发展。 Abstract: State-of-the-art Video Scene Graph Generation (VSGG) systems provide structured visual understanding but operate as closed, feed-forward pipelines with no ability to incorporate human guidance. In contrast, promptable segmentation models such as SAM2 enable precise user interaction but lack semantic or relational reasoning. We introduce Click2Graph, the first interactive framework for Panoptic Video Scene Graph Generation (PVSG) that unifies visual prompting with spatial, temporal, and semantic understanding. From a single user cue, such as a click or bounding box, Click2Graph segments and tracks the subject across time, autonomously discovers interacting objects, and predicts triplets to form a temporally consistent scene graph. Our framework introduces two key components: a Dynamic Interaction Discovery Module that generates subject-conditioned object prompts, and a Semantic Classification Head that performs joint entity and predicate reasoning. Experiments on the OpenPVSG benchmark demonstrate that Click2Graph establishes a strong foundation for user-guided PVSG, showing how human prompting can be combined with panoptic grounding and relational inference to enable controllable and interpretable video scene understanding.

[35] InfoCLIP: Bridging Vision-Language Pretraining and Open-Vocabulary Semantic Segmentation via Information-Theoretic Alignment Transfer

Muyao Yuan,Yuanhong Zhang,Weizhan Zhang,Lan Ma,Yuan Gao,Jiangyong Ying,Yudeng Xin

Main category: cs.CV

TL;DR: 本文提出InfoCLIP，通过信息论方法在保持预训练视觉-语言对齐的同时，提升CLIP在开放词汇语义分割中的微调效果。

Details

Motivation: 现有方法在有限类别上微调CLIP进行分割时容易过拟合并破坏模态对齐，因此需要一种能稳定对齐关系的方法。 Method: 提出基于互信息的两个新目标：压缩来自预训练CLIP的像素-文本模态对齐以减少噪声，并最大化预训练与微调模型之间对齐知识的互信息以传递适合分割的局部语义关系。 Result: 在多个基准上的实验表明，InfoCLIP有效增强了CLIP在开放词汇语义分割中的微调性能，展现出良好的适应性和优越性。 Conclusion: InfoCLIP通过信息论驱动的知识迁移，成功保留了预训练的模态对齐能力，同时提升了分割任务的表现，尤其适用于不对称迁移场景。 Abstract: Recently, the strong generalization ability of CLIP has facilitated open-vocabulary semantic segmentation, which labels pixels using arbitrary text. However, existing methods that fine-tune CLIP for segmentation on limited seen categories often lead to overfitting and degrade the pretrained vision-language alignment. To stabilize modality alignment during fine-tuning, we propose InfoCLIP, which leverages an information-theoretic perspective to transfer alignment knowledge from pretrained CLIP to the segmentation task. Specifically, this transfer is guided by two novel objectives grounded in mutual information. First, we compress the pixel-text modality alignment from pretrained CLIP to reduce noise arising from its coarse-grained local semantic representations learned under image-text supervision. Second, we maximize the mutual information between the alignment knowledge of pretrained CLIP and the fine-tuned model to transfer compact local semantic relations suited for the segmentation task. Extensive evaluations across various benchmarks validate the effectiveness of InfoCLIP in enhancing CLIP fine-tuning for open-vocabulary semantic segmentation, demonstrating its adaptability and superiority in asymmetric transfer.

[36] Externally Validated Multi-Task Learning via Consistency Regularization Using Differentiable BI-RADS Features for Breast Ultrasound Tumor Segmentation

Jingru Zhang,Saed Moradi,Ashirbani Saha

Main category: cs.CV

TL;DR: 提出一种基于一致性正则化的多任务学习方法，通过可微的BI-RADS启发式形态特征缓解乳腺超声肿瘤分割中的任务干扰，显著提升跨数据集的泛化性能。

Details

Motivation: 多任务学习在乳腺超声肿瘤分割中可能因任务干扰导致性能下降，影响模型泛化能力。 Method: 提出一种新颖的一致性正则化方法，结合可微的BI-RADS启发式形态特征，以减轻分割与分类任务间的破坏性干扰。 Result: 在BrEaST数据集上训练并在三个外部数据集（UDIAT、BUSI、BUS-UCLM）验证，分割任务的Dice系数显著提升（分别为0.81 vs 0.59、0.66 vs 0.56、0.69 vs 0.49，p<0.001），并在UDIAT上达到当前最优水平。 Conclusion: 该方法有效缓解了多任务学习中的任务干扰，显著提升了乳腺超声肿瘤分割模型的泛化能力和性能。 Abstract: Multi-task learning can suffer from destructive task interference, where jointly trained models underperform single-task baselines and limit generalization. To improve generalization performance in breast ultrasound-based tumor segmentation via multi-task learning, we propose a novel consistency regularization approach that mitigates destructive interference between segmentation and classification. The consistency regularization approach is composed of differentiable BI-RADS-inspired morphological features. We validated this approach by training all models on the BrEaST dataset (Poland) and evaluating them on three external datasets: UDIAT (Spain), BUSI (Egypt), and BUS-UCLM (Spain). Our comprehensive analysis demonstrates statistically significant (p<0.001) improvements in generalization for segmentation task of the proposed multi-task approach vs. the baseline one: UDIAT, BUSI, BUS-UCLM (Dice coefficient=0.81 vs 0.59, 0.66 vs 0.56, 0.69 vs 0.49, resp.). The proposed approach also achieves state-of-the-art segmentation performance under rigorous external validation on the UDIAT dataset.

[37] UniDGF: A Unified Detection-to-Generation Framework for Hierarchical Object Visual Recognition

Xinyu Nan,Lingtao Mao,Huangyu Dai,Zexin Zheng,Xinyu Sun,Zihan Liang,Ben Chen,Yuqing Ding,Chenyi Lei,Wenwu Ou,Han Li

Main category: cs.CV

TL;DR: 提出一种检测引导的生成式框架，通过提取ROI级特征并使用BART生成器按层次结构预测类别和属性标记，实现更精确的细粒度语义理解。

Details

Motivation: 现有方法依赖全局相似性，难以捕捉细粒度类别差异和特定类别的属性多样性，尤其在大规模电商场景中表现不足。 Method: 采用检测引导的生成框架，对每个检测对象提取精细化的ROI级特征，并利用基于BART的生成器以从粗到细的顺序输出层次化类别和属性-值对，支持属性条件识别。 Result: 在大规模私有电商数据集和开源数据集上实验表明，该方法显著优于现有的基于相似性的流水线和多阶段分类系统，提升了细粒度识别能力和统一推理的一致性。 Conclusion: 所提方法有效解决了细粒度语义理解中的类别区分与属性多样性问题，实现了更强大且连贯的视觉语义统一建模。 Abstract: Achieving visual semantic understanding requires a unified framework that simultaneously handles object detection, category prediction, and attribute recognition. However, current advanced approaches rely on global similarity and struggle to capture fine-grained category distinctions and category-specific attribute diversity, especially in large-scale e-commerce scenarios. To overcome these challenges, we introduce a detection-guided generative framework that predicts hierarchical category and attribute tokens. For each detected object, we extract refined ROI-level features and employ a BART-based generator to produce semantic tokens in a coarse-to-fine sequence covering category hierarchies and property-value pairs, with support for property-conditioned attribute recognition. Experiments on both large-scale proprietary e-commerce datasets and open-source datasets demonstrate that our approach significantly outperforms existing similarity-based pipelines and multi-stage classification systems, achieving stronger fine-grained recognition and more coherent unified inference.

Dawei Li,Zijian Gu,Peng Wang,Chuhan Song,Zhen Tan,Mohan Zhang,Tianlong Chen,Yu Tian,Song Wang

Main category: cs.CV

TL;DR: 提出了一种名为FADS的公平性感知上下文学习方法，通过聚类采样构建人口统计上平衡且语义相关的示例，有效减少医疗图像推理中的性别、种族和族裔偏差，同时保持高准确性。

Details

Motivation: 现有的去偏方法通常依赖大规模标注数据或微调，难以应用于基础规模的多模态大模型，且在医疗图像推理中存在人群公平性问题。 Method: 提出FADS（Fairness-Aware Demonstration Selection），采用基于聚类的采样策略，在构建上下文示例时兼顾人口统计平衡性和语义相关性，无需微调。 Result: 在多个医学图像基准上验证了FADS的有效性，显著降低了与性别、种族和族裔相关的预测差异，同时保持了良好的整体准确率。 Conclusion: FADS为实现公平、高效、可扩展的医疗图像推理提供了一种无需微调的新途径，凸显了公平性感知上下文学习在医疗AI中的潜力。 Abstract: Multimodal large language models (MLLMs) have shown strong potential for medical image reasoning, yet fairness across demographic groups remains a major concern. Existing debiasing methods often rely on large labeled datasets or fine-tuning, which are impractical for foundation-scale models. We explore In-Context Learning (ICL) as a lightweight, tuning-free alternative for improving fairness. Through systematic analysis, we find that conventional demonstration selection (DS) strategies fail to ensure fairness due to demographic imbalance in selected exemplars. To address this, we propose Fairness-Aware Demonstration Selection (FADS), which builds demographically balanced and semantically relevant demonstrations via clustering-based sampling. Experiments on multiple medical imaging benchmarks show that FADS consistently reduces gender-, race-, and ethnicity-related disparities while maintaining strong accuracy, offering an efficient and scalable path toward fair medical image reasoning. These results highlight the potential of fairness-aware in-context learning as a scalable and data-efficient solution for equitable medical image reasoning.

[39] Exploiting Inter-Sample Information for Long-tailed Out-of-Distribution Detection

Nimeshika Udayangani,Hadi M. Dolatabadi,Sarah Erfani,Christopher Leckie

Main category: cs.CV

TL;DR: 提出一种基于图神经网络的长尾分布下OOD检测方法，利用预训练模型特征空间构建图结构，并通过高斯化和图卷积优化表示，显著提升OOD检测性能与尾部类别识别准确率。

Details

Motivation: 在长尾分布的in-distribution数据下，现有OOD检测方法存在高假阳性率和低尾部类别识别准确率的问题，亟需改进。 Method: 利用预训练模型的特征空间初始化图结构，引入高斯化校正激活层分布偏差，并使用图卷积网络（GCN）优化图表示，以增强对长尾数据中OOD样本的检测能力。 Result: 在CIFAR10-LT、CIFAR100-LT和ImageNet-LT三个基准上，该方法在FPR和尾类ID分类准确率上均显著优于现有最先进方法。 Conclusion: 通过图结构建模和分布校正，所提方法有效解决了长尾场景下OOD检测的挑战，兼顾了主干类和尾部类的性能提升。 Abstract: Detecting out-of-distribution (OOD) data is essential for safe deployment of deep neural networks (DNNs). This problem becomes particularly challenging in the presence of long-tailed in-distribution (ID) datasets, often leading to high false positive rates (FPR) and low tail-class ID classification accuracy. In this paper, we demonstrate that exploiting inter-sample relationships using a graph-based representation can significantly improve OOD detection in long-tailed recognition of vision datasets. To this end, we use the feature space of a pre-trained model to initialize our graph structure. We account for the differences between the activation layer distribution of the pre-training vs. training data, and actively introduce Gaussianization to alleviate any deviations from a standard normal distribution in the activation layers of the pre-trained model. We then refine this initial graph representation using graph convolutional networks (GCNs) to arrive at a feature space suitable for long-tailed OOD detection. This leads us to address the inferior performance observed in ID tail-classes within existing OOD detection methods. Experiments over three benchmarks CIFAR10-LT, CIFAR100-LT, and ImageNet-LT demonstrate that our method outperforms the state-of-the-art approaches by a large margin in terms of FPR and tail-class ID classification accuracy.

[40] Physically Realistic Sequence-Level Adversarial Clothing for Robust Human-Detection Evasion

Dingkun Zhou,Patrick P. K. Chan,Hengxu Wu,Shikang Zheng,Ruiqi Huang,Yuanjie Zhao

Main category: cs.CV

TL;DR: 提出了一种序列级优化框架，生成自然且可打印的对抗性纹理，能在整个行走视频中有效隐藏人体检测，在数字和物理环境中均表现出强隐蔽性和鲁棒性。

Details

Motivation: 现有可穿戴攻击方法在长时间视频中因运动、姿态变化和衣物形变而难以保持隐蔽性，缺乏跨帧一致性，限制了实际应用。 Method: 将产品图像映射到UV空间并参数化为紧凑调色板和控制点，结合ICC锁定确保颜色可打印；使用基于物理的人体-衣物模拟管线生成多角度、动态光照和布料形变下的视频序列；通过带时间加权的期望变换目标函数优化控制点，实现整个视频序列上的检测置信度最小化。 Result: 实验显示该方法在数字和物理环境下均能实现强而稳定的隐蔽效果，对视角变化具有高鲁棒性，并具备良好的跨模型迁移能力；实物打印服装在室内外录制中均能可靠抑制检测。 Conclusion: 所提出的序列级优化方法显著提升了可穿戴对抗纹理在真实监控场景中的实用性和持久性，验证了其在现实世界中的可行性与应用潜力。 Abstract: Deep neural networks used for human detection are highly vulnerable to adversarial manipulation, creating safety and privacy risks in real surveillance environments. Wearable attacks offer a realistic threat model, yet existing approaches usually optimize textures frame by frame and therefore fail to maintain concealment across long video sequences with motion, pose changes, and garment deformation. In this work, a sequence-level optimization framework is introduced to generate natural, printable adversarial textures for shirts, trousers, and hats that remain effective throughout entire walking videos in both digital and physical settings. Product images are first mapped to UV space and converted into a compact palette and control-point parameterization, with ICC locking to keep all colors printable. A physically based human-garment pipeline is then employed to simulate motion, multi-angle camera viewpoints, cloth dynamics, and illumination variation. An expectation-over-transformation objective with temporal weighting is used to optimize the control points so that detection confidence is minimized across whole sequences. Extensive experiments demonstrate strong and stable concealment, high robustness to viewpoint changes, and superior cross-model transferability. Physical garments produced with sublimation printing achieve reliable suppression under indoor and outdoor recordings, confirming real-world feasibility.

[41] Mixture of Ranks with Degradation-Aware Routing for One-Step Real-World Image Super-Resolution

Xiao He,Zhijun Tu,Kun Cheng,Mingrui Zhu,Jie Hu,Nannan Wang,Xinbo Gao

Main category: cs.CV

TL;DR: 本文提出了一种基于稀疏混合专家（MoE）的Mixture-of-Ranks（MoR）架构，用于单步真实图像超分辨率（Real-ISR），通过将LoRA中的每个秩视为独立专家，并结合退化估计与动态负载均衡机制，显著提升了对复杂退化样本的适应能力与计算效率。

Details

Motivation: 现有的Real-ISR方法依赖于LoRA微调扩散模型，属于密集模型，难以自适应地捕捉真实世界退化样本的异质特性，且在相同计算预算下缺乏输入间的知识共享。因此，需要一种更灵活、可扩展的架构来提升性能和资源利用率。 Method: 提出Mixture-of-Ranks（MoR）架构，将LoRA中的每个秩作为独立专家，引入细粒度专家划分策略；设计基于CLIP嵌入和预定义文本对的退化估计模块，动态指导专家激活；引入零专家槽位和退化感知的负载均衡损失，根据退化程度动态调整活跃专家数量。 Result: 实验表明，所提MoR框架在多个真实图像超分辨率基准上实现了最先进的性能，有效提升了模型对不同退化程度样本的适应能力，并优化了计算资源分配。 Conclusion: 将稀疏MoE思想引入Real-ISR任务是可行且高效的，MoR架构通过细粒度专家设计与退化感知路由机制，实现了灵活的知识重组与计算资源动态分配，为高效超分辨率建模提供了新思路。 Abstract: The demonstrated success of sparsely-gated Mixture-of-Experts (MoE) architectures, exemplified by models such as DeepSeek and Grok, has motivated researchers to investigate their adaptation to diverse domains. In real-world image super-resolution (Real-ISR), existing approaches mainly rely on fine-tuning pre-trained diffusion models through Low-Rank Adaptation (LoRA) module to reconstruct high-resolution (HR) images. However, these dense Real-ISR models are limited in their ability to adaptively capture the heterogeneous characteristics of complex real-world degraded samples or enable knowledge sharing between inputs under equivalent computational budgets. To address this, we investigate the integration of sparse MoE into Real-ISR and propose a Mixture-of-Ranks (MoR) architecture for single-step image super-resolution. We introduce a fine-grained expert partitioning strategy that treats each rank in LoRA as an independent expert. This design enables flexible knowledge recombination while isolating fixed-position ranks as shared experts to preserve common-sense features and minimize routing redundancy. Furthermore, we develop a degradation estimation module leveraging CLIP embeddings and predefined positive-negative text pairs to compute relative degradation scores, dynamically guiding expert activation. To better accommodate varying sample complexities, we incorporate zero-expert slots and propose a degradation-aware load-balancing loss, which dynamically adjusts the number of active experts based on degradation severity, ensuring optimal computational resource allocation. Comprehensive experiments validate our framework's effectiveness and state-of-the-art performance.

[42] Towards a Safer and Sustainable Manufacturing Process: Material classification in Laser Cutting Using Deep Learning

Mohamed Abdallah Salem,Hamdy Ahmed Ashur,Ahmed Elshinnawy

Main category: cs.CV

TL;DR: 提出一种基于深度学习的激光散斑模式材料分类方法，用于实时监测和控制激光切割过程，具有高准确性和鲁棒性。

Details

Motivation: 传统散斑传感方法在激光颜色变化时分类性能下降，需开发更稳定的材料识别技术以保障激光切割的安全与效率。 Method: 利用卷积神经网络（CNN）对材料表面的激光散斑图案进行训练，实现对不同材料类型的识别，并验证在不同激光颜色下的泛化能力。 Result: 模型在训练集上准确率达98.30%，验证集上达96.88%，在30种新材料的3000张图像上F1得分为0.9643，表现出优异的分类性能和鲁棒性。 Conclusion: 该方法能有效应对激光颜色变化带来的干扰，为基于散斑感知的材料感知激光切割提供了可靠解决方案。 Abstract: Laser cutting is a widely adopted technology in material processing across various industries, but it generates a significant amount of dust, smoke, and aerosols during operation, posing a risk to both the environment and workers' health. Speckle sensing has emerged as a promising method to monitor the cutting process and identify material types in real-time. This paper proposes a material classification technique using a speckle pattern of the material's surface based on deep learning to monitor and control the laser cutting process. The proposed method involves training a convolutional neural network (CNN) on a dataset of laser speckle patterns to recognize distinct material types for safe and efficient cutting. Previous methods for material classification using speckle sensing may face issues when the color of the laser used to produce the speckle pattern is changed. Experiments conducted in this study demonstrate that the proposed method achieves high accuracy in material classification, even when the laser color is changed. The model achieved an accuracy of 98.30 % on the training set and 96.88% on the validation set. Furthermore, the model was evaluated on a set of 3000 new images for 30 different materials, achieving an F1-score of 0.9643. The proposed method provides a robust and accurate solution for material-aware laser cutting using speckle sensing.

[43] CuriGS: Curriculum-Guided Gaussian Splatting for Sparse View Synthesis

Zijian Wu,Mingfeng Jiang,Zidian Lin,Ying Song,Hanjie Ma,Qun Wu,Dongping Zhang,Guiyang Pu

Main category: cs.CV

TL;DR: 本文提出了CuriGS，一种基于课程学习的3D高斯点阵稀疏视图重建框架，通过引入多扰动级别的伪视角（学生视图）并逐步优化训练过程，在监督稀缺的情况下显著提升了渲染质量和几何一致性。

Details

Motivation: 在稀疏视图条件下，3D高斯点阵（3DGS）因视角覆盖有限导致监督不足和过拟合问题，难以实现高质量重建。 Method: 提出CuriGS框架，生成围绕真实姿态（教师）的不同扰动级别的伪视图（学生），采用课程学习策略逐步解锁更高扰动级别，并通过深度相关性和协同正则化对学生视图进行约束，结合多信号指标评估并择优增强训练集。 Result: 实验表明，CuriGS在多种合成与真实稀疏视图场景中，在渲染保真度和几何一致性方面均优于现有最先进方法。 Conclusion: CuriGS通过课程引导的视图增广策略有效缓解了稀疏视图下的过拟合与监督不足问题，为3DGS在低数据输入场景下的应用提供了可靠解决方案。 Abstract: 3D Gaussian Splatting (3DGS) has recently emerged as an efficient, high-fidelity representation for real-time scene reconstruction and rendering. However, extending 3DGS to sparse-view settings remains challenging because of supervision scarcity and overfitting caused by limited viewpoint coverage. In this paper, we present CuriGS, a curriculum-guided framework for sparse-view 3D reconstruction using 3DGS. CuriGS addresses the core challenge of sparse-view synthesis by introducing student views: pseudo-views sampled around ground-truth poses (teacher). For each teacher, we generate multiple groups of student views with different perturbation levels. During training, we follow a curriculum schedule that gradually unlocks higher perturbation level, randomly sampling candidate students from the active level to assist training. Each sampled student is regularized via depth-correlation and co-regularization, and evaluated using a multi-signal metric that combines SSIM, LPIPS, and an image-quality measure. For every teacher and perturbation level, we periodically retain the best-performing students and promote those that satisfy a predefined quality threshold to the training set, resulting in a stable augmentation of sparse training views. Experimental results show that CuriGS outperforms state-of-the-art baselines in both rendering fidelity and geometric consistency across various synthetic and real sparse-view scenes. Project page: https://zijian1026.github.io/CuriGS/

[44] Crossmodal learning for Crop Canopy Trait Estimation

Timilehin T. Ayanlade,Anirudha Powadi,Talukder Z. Jubery,Baskar Ganapathysubramanian,Soumik Sarkar

Main category: cs.CV

TL;DR: 提出一种跨模态学习策略，利用无人机图像细节增强高分辨率卫星影像，用于作物冠层性状估计，在产量和氮素预测等任务中优于真实卫星影像。

Details

Motivation: 卫星影像受空间分辨率限制，难以满足现代微地块农业管理需求，而无人机影像虽精度高但成本较高，需融合二者优势。 Method: 基于约84种杂交玉米品种在五个地点的同步卫星-无人机图像对数据集，训练模型学习多模态间的光谱-空间细粒度对应关系。 Result: 从卫星输入生成的类无人机表征在多个下游任务中持续优于真实卫星影像，包括产量预测和氮素预测。 Conclusion: 跨模态对应学习能有效弥合卫星与无人机遥感在农业监测中的差距，提升作物监测性能。 Abstract: Recent advances in plant phenotyping have driven widespread adoption of multi sensor platforms for collecting crop canopy reflectance data. This includes the collection of heterogeneous data across multiple platforms, with Unmanned Aerial Vehicles (UAV) seeing significant usage due to their high performance in crop monitoring, forecasting, and prediction tasks. Similarly, satellite missions have been shown to be effective for agriculturally relevant tasks. In contrast to UAVs, such missions are bound to the limitation of spatial resolution, which hinders their effectiveness for modern farming systems focused on micro-plot management. In this work, we propose a cross modal learning strategy that enriches high-resolution satellite imagery with UAV level visual detail for crop canopy trait estimation. Using a dataset of approximately co registered satellite UAV image pairs collected from replicated plots of 84 hybrid maize varieties across five distinct locations in the U.S. Corn Belt, we train a model that learns fine grained spectral spatial correspondences between sensing modalities. Results show that the generated UAV-like representations from satellite inputs consistently outperform real satellite imagery on multiple downstream tasks, including yield and nitrogen prediction, demonstrating the potential of cross-modal correspondence learning to bridge the gap between satellite and UAV sensing in agricultural monitoring.

[45] LLMs-based Augmentation for Domain Adaptation in Long-tailed Food Datasets

Qing Wang,Chong-Wah Ngo,Ee-Peng Lim,Qianru Sun

Main category: cs.CV

TL;DR: 提出一种基于大语言模型（LLM）的框架，通过将图像和生成的文本（如食物名称和成分）映射到共享嵌入空间来解决食品识别中的域偏移、长尾分布和细粒度分类问题。

Details

Motivation: 食品识别面临训练数据与用户实际拍摄图像之间的域偏移、真实数据集的长尾分布以及不同类别食物视觉差异细微等挑战。 Method: 利用大语言模型解析食物图像生成标题和成分，将生成的文本与图像投影到共享嵌入空间以最大化配对相似性，并使用对齐后的多模态特征进行识别。 Result: 在两个食品数据集上，该方法优于针对长尾分布、域适应和细粒度分类的现有方法。 Conclusion: 所提出的简单且有效的多模态框架能够显著提升复杂现实场景下的食品识别性能。 Abstract: Training a model for food recognition is challenging because the training samples, which are typically crawled from the Internet, are visually different from the pictures captured by users in the free-living environment. In addition to this domain-shift problem, the real-world food datasets tend to be long-tailed distributed and some dishes of different categories exhibit subtle variations that are difficult to distinguish visually. In this paper, we present a framework empowered with large language models (LLMs) to address these challenges in food recognition. We first leverage LLMs to parse food images to generate food titles and ingredients. Then, we project the generated texts and food images from different domains to a shared embedding space to maximize the pair similarities. Finally, we take the aligned features of both modalities for recognition. With this simple framework, we show that our proposed approach can outperform the existing approaches tailored for long-tailed data distribution, domain adaptation, and fine-grained classification, respectively, on two food datasets.

[46] AMS-KV: Adaptive KV Caching in Multi-Scale Visual Autoregressive Transformers

Boxun Xu,Yu Wang,Zihu Wang,Peng Li

Main category: cs.CV

TL;DR: 提出了一种针对视觉自回归模型中下一尺度预测的自适应KV缓存策略AMS-KV，显著降低了内存占用和计算延迟，提升了生成效率与可扩展性。

Details

Motivation: 在基于下一尺度预测的视觉自回归模型中，KV缓存随尺度增加而急剧增长，严重限制了模型的可扩展性，现有方法缺乏对此类多尺度缓存的有效管理。 Method: 通过分析不同尺度间的KV相似性，提出AMS-KV缓存策略：优先保留局部尺度和压缩粗粒度尺度的KV，并根据层间缓存需求动态优化存储；识别缓存密集型层以提升利用率。 Result: 相比基线模型，AMS-KV减少最多84.83%的KV缓存使用量，降低60.48%自注意力延迟，并支持批大小从128稳定扩展到256，显著提升吞吐量。 Conclusion: AMS-KV有效解决了多尺度图像生成中的KV缓存膨胀问题，在保持生成质量的同时大幅提升了效率和可扩展性，为VAR模型的实际部署提供了可行方案。 Abstract: Visual autoregressive modeling (VAR) via next-scale prediction has emerged as a scalable image generation paradigm. While Key and Value (KV) caching in large language models (LLMs) has been extensively studied, next-scale prediction presents unique challenges, and KV caching design for next-scale based VAR transformers remains largely unexplored. A major bottleneck is the excessive KV memory growth with the increasing number of scales-severely limiting scalability. Our systematic investigation reveals that: (1) Attending to tokens from local scales significantly contributes to generation quality (2) Allocating a small amount of memory for the coarsest scales, termed as condensed scales, stabilizes multi-scale image generation (3) Strong KV similarity across finer scales is predominantly observed in cache-efficient layers, whereas cache-demanding layers exhibit weaker inter-scale similarity. Based on the observations, we introduce AMS-KV, a scale-adaptive KV caching policy for next-scale prediction in VAR models. AMS-KV prioritizes storing KVs from condensed and local scales, preserving the most relevant tokens to maintain generation quality. It further optimizes KV cache utilization and computational efficiency identifying cache-demanding layers through inter-scale similarity analysis. Compared to the vanilla next-scale prediction-based VAR models, AMS-KV reduces KV cache usage by up to 84.83% and self-attention latency by 60.48%. Moreover, when the baseline VAR-d30 model encounters out-of-memory failures at a batch size of 128, AMS-KV enables stable scaling to a batch size of 256 with improved throughput.

[47] LiSTAR: Ray-Centric World Models for 4D LiDAR Sequences in Autonomous Driving

Pei Liu,Songtao Wang,Lang Zhang,Xingyue Peng,Yuandong Lyu,Jiaxin Deng,Songxin Lu,Weiliang Ma,Xueyang Zhang,Yifei Zhan,XianPeng Lang,Jun Ma

Main category: cs.CV

TL;DR: 提出LiSTAR，一种基于原生传感器几何的生成式世界模型，用于高保真、可控制的4D LiDAR数据合成，结合HCS表示和START注意力机制，在重建、预测和条件生成任务中显著优于现有方法。

Details

Motivation: 高保真且可控制的4D LiDAR数据合成对自动驾驶仿真至关重要，但受限于传感器球面几何、时间稀疏性和动态场景复杂性。 Method: 提出LiSTAR模型：采用混合柱面-球面（HCS）表示以减少量化误差；使用基于射线中心Transformer的时空注意力（START）建模单条扫描线的时间演化；引入4D点云对齐体素布局与离散Masked Generative START（MaskSTART）框架实现可控生成。 Result: 在4D LiDAR重建、预测和条件生成任务上达到SOTA性能：生成MMD降低76%，重建IoU提升32%，预测L1 Med降低50%。 Conclusion: LiSTAR通过原生几何建模和新型表示学习框架，实现了高效、高分辨率、布局引导的4D LiDAR合成，为自动驾驶仿真提供了强大新工具。 Abstract: Synthesizing high-fidelity and controllable 4D LiDAR data is crucial for creating scalable simulation environments for autonomous driving. This task is inherently challenging due to the sensor's unique spherical geometry, the temporal sparsity of point clouds, and the complexity of dynamic scenes. To address these challenges, we present LiSTAR, a novel generative world model that operates directly on the sensor's native geometry. LiSTAR introduces a Hybrid-Cylindrical-Spherical (HCS) representation to preserve data fidelity by mitigating quantization artifacts common in Cartesian grids. To capture complex dynamics from sparse temporal data, it utilizes a Spatio-Temporal Attention with Ray-Centric Transformer (START) that explicitly models feature evolution along individual sensor rays for robust temporal coherence. Furthermore, for controllable synthesis, we propose a novel 4D point cloud-aligned voxel layout for conditioning and a corresponding discrete Masked Generative START (MaskSTART) framework, which learns a compact, tokenized representation of the scene, enabling efficient, high-resolution, and layout-guided compositional generation. Comprehensive experiments validate LiSTAR's state-of-the-art performance across 4D LiDAR reconstruction, prediction, and conditional generation, with substantial quantitative gains: reducing generation MMD by a massive 76%, improving reconstruction IoU by 32%, and lowering prediction L1 Med by 50%. This level of performance provides a powerful new foundation for creating realistic and controllable autonomous systems simulations. Project link: https://ocean-luna.github.io/LiSTAR.gitub.io.

[48] VideoSeg-R1:Reasoning Video Object Segmentation via Reinforcement Learning

Zishan Xu,Yifu Guo,Yuquan Lu,Fengyu Yang,Junxin Li

Main category: cs.CV

TL;DR: 提出VideoSeg-R1，首个将强化学习引入视频推理分割的框架，采用解耦架构，结合指代表分割与视频掩码传播，在多个基准上实现最先进性能。

Details

Motivation: 传统方法依赖监督微调，泛化能力差且缺乏显式推理，难以应对分布外场景。 Method: 采用三阶段解耦架构：分层文本引导帧采样、生成空间线索和显式推理链的推理模型、基于SAM2和XMem的分割传播；引入任务难度感知机制动态控制推理长度。 Result: 在多个基准上验证了优越性能，显著提升复杂视频推理与分割效果，兼顾效率与准确性。 Conclusion: VideoSeg-R1通过强化学习与显式推理机制，有效提升了视频分割在分布外场景的泛化能力和可解释性，为未来研究提供了新方向。 Abstract: Traditional video reasoning segmentation methods rely on supervised fine-tuning, which limits generalization to out-of-distribution scenarios and lacks explicit reasoning. To address this, we propose \textbf{VideoSeg-R1}, the first framework to introduce reinforcement learning into video reasoning segmentation. It adopts a decoupled architecture that formulates the task as joint referring image segmentation and video mask propagation. It comprises three stages: (1) A hierarchical text-guided frame sampler to emulate human attention; (2) A reasoning model that produces spatial cues along with explicit reasoning chains; and (3) A segmentation-propagation stage using SAM2 and XMem. A task difficulty-aware mechanism adaptively controls reasoning length for better efficiency and accuracy. Extensive evaluations on multiple benchmarks demonstrate that VideoSeg-R1 achieves state-of-the-art performance in complex video reasoning and segmentation tasks. The code will be publicly available at https://github.com/euyis1019/VideoSeg-R1.

[49] SpectralTrain: A Universal Framework for Hyperspectral Image Classification

Meihua Zhou,Liping Yu,Jiawei Cai,Wai Kin Fung,Ruiguo Hu,Jiarui Zhao,Wenzhuo Liu,Nan Wan

Main category: cs.CV

TL;DR: 提出了一种名为SpectralTrain的通用训练框架，结合课程学习和PCA光谱降维，显著提升高光谱图像分类的训练效率，兼容多种模型并在多个数据集上实现2-7倍加速。

Details

Motivation: 深度学习在高光谱图像分类中面临大规模数据和高计算成本的问题，限制了其在实际遥感任务中的部署，因此需要更高效的训练方法。 Method: 引入SpectralTrain框架，结合主成分分析（PCA）进行光谱降采样和课程学习（CL），逐步增加光谱复杂性，在保留关键信息的同时降低计算开销，且不依赖特定网络架构、优化器或损失函数。 Result: 在Indian Pines、Salinas-A和新提出的CloudPatch-7三个数据集上实验表明，该方法可实现2-7倍的训练加速，精度略有下降但保持在可接受范围，并展现出良好的跨尺度、跨场景泛化能力，尤其在云分类任务中表现突出。 Conclusion: SpectralTrain作为一种与架构无关的高效训练策略，能有效补充模型设计，推动高光谱图像分类在实际应用中的部署，特别是在气候相关遥感领域具有应用潜力。 Abstract: Hyperspectral image (HSI) classification typically involves large-scale data and computationally intensive training, which limits the practical deployment of deep learning models in real-world remote sensing tasks. This study introduces SpectralTrain, a universal, architecture-agnostic training framework that enhances learning efficiency by integrating curriculum learning (CL) with principal component analysis (PCA)-based spectral downsampling. By gradually introducing spectral complexity while preserving essential information, SpectralTrain enables efficient learning of spectral -- spatial patterns at significantly reduced computational costs. The framework is independent of specific architectures, optimizers, or loss functions and is compatible with both classical and state-of-the-art (SOTA) models. Extensive experiments on three benchmark datasets -- Indian Pines, Salinas-A, and the newly introduced CloudPatch-7 -- demonstrate strong generalization across spatial scales, spectral characteristics, and application domains. The results indicate consistent reductions in training time by 2-7x speedups with small-to-moderate accuracy deltas depending on backbone. Its application to cloud classification further reveals potential in climate-related remote sensing, emphasizing training strategy optimization as an effective complement to architectural design in HSI models. Code is available at https://github.com/mh-zhou/SpectralTrain.

[50] Rad-GS: Radar-Vision Integration for 3D Gaussian Splatting SLAM in Outdoor Environments

Renxiang Xiao,Wei Liu,Yuanfan Zhang,Yushuai Chen,Jinming Chen,Zilu Wang,Liang Hu

Main category: cs.CV

TL;DR: Rad-GS 是一种基于 3D 高斯表示的 4D 雷达-相机 SLAM 系统，适用于公里级户外环境，结合雷达点云与图像信息实现高精度定位与重建。

Details

Motivation: 传统视觉或激光SLAM在大尺度户外环境中易受动态物体和光照变化影响，而4D毫米波雷达具有较强的环境鲁棒性，但其空间分辨率较低，难以直接用于高质量场景重建，因此需要一种融合雷达与相机优势的新方法。 Method: 提出 Rad-GS，利用 3D 高斯作为可微分空间表示；结合原始雷达点云的多普勒信息与几何增强点云进行动态物体掩码，消除图像渲染伪影；利用非同步图像帧全局优化高斯表示；采用全局八叉树结构与高斯图元管理策略降低噪声并减少内存消耗。 Result: 实验表明，Rad-GS 在定位精度和新视角合成质量上达到与基于相机或 LiDAR 的传统 3D 高斯方法相当的性能，并在公里级真实场景中验证了其大规模重建能力。 Conclusion: Rad-GS 成功实现了基于 4D 毫米波雷达的鲁棒户外大尺度建图，证明了纯雷达-相机融合方案在大规模场景重建中的可行性与潜力。 Abstract: We present Rad-GS, a 4D radar-camera SLAM system designed for kilometer-scale outdoor environments, utilizing 3D Gaussian as a differentiable spatial representation. Rad-GS combines the advantages of raw radar point cloud with Doppler information and geometrically enhanced point cloud to guide dynamic object masking in synchronized images, thereby alleviating rendering artifacts and improving localization accuracy. Additionally, unsynchronized image frames are leveraged to globally refine the 3D Gaussian representation, enhancing texture consistency and novel view synthesis fidelity. Furthermore, the global octree structure coupled with a targeted Gaussian primitive management strategy further suppresses noise and significantly reduces memory consumption in large-scale environments. Extensive experiments and ablation studies demonstrate that Rad-GS achieves performance comparable to traditional 3D Gaussian methods based on camera or LiDAR inputs, highlighting the feasibility of robust outdoor mapping using 4D mmWave radar. Real-world reconstruction at kilometer scale validates the potential of Rad-GS for large-scale scene reconstruction.

[51] T2T-VICL: Unlocking the Boundaries of Cross-Task Visual In-Context Learning via Implicit Text-Driven VLMs

Shao-Jun Xia,Huixin Zhang,Zhengzhong Tu

Main category: cs.CV

TL;DR: 本文提出了一种名为T2T-VICL的全协作流程，用于研究视觉语言模型（VLMs）在不同视觉任务间的跨任务上下文学习（cross-task VICL）潜力。通过设计生成和选择文本提示的机制，并构建首个跨任务VICL数据集，结合基于感知评分的推理与传统评估指标，该方法在多个跨任务场景中取得领先或次优表现，拓展了VLMs在跨任务VICL中的应用边界。

Details

Motivation: 探索当视觉提示与目标图像来自不同视觉任务时，视觉语言模型是否仍能实现上下文学习，推动跨任务视觉上下文学习的发展。 Method: 设计一种生成和选择文本提示的机制以描述不同低层视觉任务间的差异，构建首个跨任务VICL数据集，并提出结合感知评分推理与传统评估指标的新型推理框架。 Result: 在九个跨任务场景中达到顶级性能，在另外十个场景中表现第二，显著提升了跨任务VICL的效果。 Conclusion: T2T-VICL有效释放了视觉语言模型在跨任务上下文学习中的潜力，验证了跨任务条件下VICL的可行性与优势。 Abstract: In large language models (LLM), in-context learning (ICL) refers to performing new tasks by conditioning on small demonstrations provided in the input context. Recent advances in visual in-context learning (VICL) demonstrate promising capabilities for solving downstream tasks by unified vision-language models (VLMs). When the visual prompt and the target images originate from different visual tasks, can VLMs still enable VICL? In the paper, we propose a fully collaborative pipeline, i.e. T2T-VICL, for VLMs to investigate the potential of cross-task VICL. Fundamentally, we design a mechanism to generate and select text prompts that best implicitly describe the differences between two distinct low-level vision tasks, and construct the first cross-task VICL dataset. Building upon this, we propose a novel inference framework that combines perceptual score-based reasoning with traditional evaluation metrics to perform cross-task VICL. Our approach achieves top-tier results across nine cross-task scenarios and second-tier performance in ten additional scenarios, unlocking the boundaries of cross-task VICL within VLMs.

[52] Clustered Error Correction with Grouped 4D Gaussian Splatting

Taeho Kang,Jaeyeon Park,Kyungjin Lee,Youngki Lee

Main category: cs.CV

TL;DR: 提出一种新的4D高斯点阵方法（CEM-4DGS），通过椭圆误差聚类与误差校正点添加以及分组策略，提升动态场景重建的时空一致性和渲染质量。

Details

Motivation: 现有4D高斯点阵方法在动态场景重建中存在像素对应模糊和动态区域稠密化不足的问题，影响重建精度和一致性。 Method: 1) 基于渲染误差分类（缺色与遮挡）进行椭圆误差聚类，并通过反投影或前景分割实现误差校正与新点初始化；2) 引入分组4D高斯点阵策略，增强点与动态物体间的映射一致性。 Result: 在Neural 3D Video和Technicolor数据集上实现了最先进的感知渲染质量，Technicolor光场数据集上PSNR提升0.39dB，可视化结果表明点与动态物体对齐更好，误差识别与纠正更有效。 Conclusion: 所提方法有效提升了动态场景的4D重建质量与时间一致性，为4D高斯点阵的应用提供了更鲁棒的解决方案。 Abstract: Existing 4D Gaussian Splatting (4DGS) methods struggle to accurately reconstruct dynamic scenes, often failing to resolve ambiguous pixel correspondences and inadequate densification in dynamic regions. We address these issues by introducing a novel method composed of two key components: (1) Elliptical Error Clustering and Error Correcting Splat Addition that pinpoints dynamic areas to improve and initialize fitting splats, and (2) Grouped 4D Gaussian Splatting that improves consistency of mapping between splats and represented dynamic objects. Specifically, we classify rendering errors into missing-color and occlusion types, then apply targeted corrections via backprojection or foreground splitting guided by cross-view color consistency. Evaluations on Neural 3D Video and Technicolor datasets demonstrate that our approach significantly improves temporal consistency and achieves state-of-the-art perceptual rendering quality, improving 0.39dB of PSNR on the Technicolor Light Field dataset. Our visualization shows improved alignment between splats and dynamic objects, and the error correction method's capability to identify errors and properly initialize new splats. Our implementation details and source code are available at https://github.com/tho-kn/cem-4dgs.

[53] Decoupling Complexity from Scale in Latent Diffusion Model

Tianxiong Zhong,Xingye Tian,Xuebo Wang,Boyuan Jiang,Xin Tao,Pengfei Wan

Main category: cs.CV

TL;DR: 提出DCS-LDM，一种解耦信息复杂度与尺度的新型视觉生成模型，通过构建层次化、尺度无关的潜在空间，支持任意分辨率和帧率的灵活生成。

Details

Motivation: 现有潜在扩散模型将尺度与内容复杂度耦合，导致潜在容量需求不合理；而实际上内容复杂度应独立于尺度（如分辨率、帧率）进行建模。 Method: 设计层次化、尺度无关的潜在空间，使用多级潜在令牌分别建模结构与细节信息，并实现固定潜在表示下解码到任意分辨率和帧率，支持渐进式由粗到精生成。 Result: 实验表明DCS-LDM在性能上可媲美当前最先进方法，同时支持跨多种尺度和视觉质量的灵活生成，实现计算与质量的灵活权衡。 Conclusion: DCS-LDM成功解耦了视觉生成中的复杂度与尺度，提供更高效的潜在表示和灵活的生成能力，为通用视觉生成提供了新范式。 Abstract: Existing latent diffusion models typically couple scale with content complexity, using more latent tokens to represent higher-resolution images or higher-frame rate videos. However, the latent capacity required to represent visual data primarily depends on content complexity, with scale serving only as an upper bound. Motivated by this observation, we propose DCS-LDM, a novel paradigm for visual generation that decouples information complexity from scale. DCS-LDM constructs a hierarchical, scale-independent latent space that models sample complexity through multi-level tokens and supports decoding to arbitrary resolutions and frame rates within a fixed latent representation. This latent space enables DCS-LDM to achieve a flexible computation-quality tradeoff. Furthermore, by decomposing structural and detailed information across levels, DCS-LDM supports a progressive coarse-to-fine generation paradigm. Experimental results show that DCS-LDM delivers performance comparable to state-of-the-art methods while offering flexible generation across diverse scales and visual qualities.

[54] VTinker: Guided Flow Upsampling and Texture Mapping for High-Resolution Video Frame Interpolation

Chenyang Wu,Jiayi Fu,Chun-Le Guo,Shuhao Han,Chongyi Li

Main category: cs.CV

TL;DR: 提出了一种新的视频帧插值方法VTinker，包含引导流上采样（GFU）和纹理映射两个核心组件，有效解决了高分辨率帧间运动估计中的模糊、马赛克和鬼影问题，显著提升了插值质量。

Details

Motivation: 现有基于光流的视频帧插值方法在低分辨率下估计运动并使用简单上采样获得高分辨率光流，容易导致边缘模糊或马赛克，且难以捕捉细粒度运动，造成对齐错误和插值结果出现鬼影与不连续。 Method: 提出VTinker框架，首先通过引导流上采样（GFU）利用输入帧作为指导优化光流上采样效果，增强边缘清晰度；然后采用纹理映射机制生成中间代理帧，并从中选取清晰纹理块进行重建，避免像素级伪影。 Result: 在多个数据集上实验表明，VTinker在定量指标和视觉质量方面均达到最先进的性能。 Conclusion: VTinker通过GFU和纹理映射有效改善了高分辨率视频帧插值中的光流精度和细节还原能力，显著减少了鬼影和不连续现象，实现了SOTA性能。 Abstract: Due to large pixel movement and high computational cost, estimating the motion of high-resolution frames is challenging. Thus, most flow-based Video Frame Interpolation (VFI) methods first predict bidirectional flows at low resolution and then use high-magnification upsampling (e.g., bilinear) to obtain the high-resolution ones. However, this kind of upsampling strategy may cause blur or mosaic at the flows' edges. Additionally, the motion of fine pixels at high resolution cannot be adequately captured in motion estimation at low resolution, which leads to the misalignment of task-oriented flows. With such inaccurate flows, input frames are warped and combined pixel-by-pixel, resulting in ghosting and discontinuities in the interpolated frame. In this study, we propose a novel VFI pipeline, VTinker, which consists of two core components: guided flow upsampling (GFU) and Texture Mapping. After motion estimation at low resolution, GFU introduces input frames as guidance to alleviate the blurring details in bilinear upsampling flows, which makes flows' edges clearer. Subsequently, to avoid pixel-level ghosting and discontinuities, Texture Mapping generates an initial interpolated frame, referred to as the intermediate proxy. The proxy serves as a cue for selecting clear texture blocks from the input frames, which are then mapped onto the proxy to facilitate producing the final interpolated frame via a reconstruction module. Extensive experiments demonstrate that VTinker achieves state-of-the-art performance in VFI. Codes are available at: https://github.com/Wucy0519/VTinker.

Caixin Kang,Yifei Huang,Liangyang Ouyang,Mingfang Zhang,Ruicong Liu,Yoichi Sato

Main category: cs.CV

TL;DR: 本文提出了多模态交互式欺骗评估（MIDA）任务和新数据集，揭示了现有MLLM在理解社交线索和判断欺骗方面的不足，并提出SoCoT和DSEM方法以提升模型的社会推理能力。

Details

Motivation: 现有的多模态大语言模型缺乏人类‘读懂房间’和识破欺骗的能力，难以在复杂社交互动中进行有效推理，亟需量化评估与改进方法。 Method: 提出MIDA任务和带真实标签的多模态数据集，构建包含12个先进MLLM的基准测试；设计Social Chain-of-Thought（SoCoT）推理流程和Dynamic Social Epistemic Memory（DSEM）模块以增强社会认知建模。 Result: 实验显示现有模型（如GPT-4o）在该任务上表现不佳，存在显著性能差距；所提SoCoT与DSEM框架在该挑战性任务上实现了性能提升。 Conclusion: 当前MLLM在社会认知和多模态语境理解方面存在根本缺陷，需引入显式心理状态建模与社会推理机制，推动更具感知力和可信度的AI系统发展。 Abstract: Despite their advanced reasoning capabilities, state-of-the-art Multimodal Large Language Models (MLLMs) demonstrably lack a core component of human intelligence: the ability to `read the room' and assess deception in complex social interactions. To rigorously quantify this failure, we introduce a new task, Multimodal Interactive Deception Assessment (MIDA), and present a novel multimodal dataset providing synchronized video and text with verifiable ground-truth labels for every statement. We establish a comprehensive benchmark evaluating 12 state-of-the-art open- and closed-source MLLMs, revealing a significant performance gap: even powerful models like GPT-4o struggle to distinguish truth from falsehood reliably. Our analysis of failure modes indicates that these models fail to effectively ground language in multimodal social cues and lack the ability to model what others know, believe, or intend, highlighting the urgent need for novel approaches to building more perceptive and trustworthy AI systems. To take a step forward, we design a Social Chain-of-Thought (SoCoT) reasoning pipeline and a Dynamic Social Epistemic Memory (DSEM) module. Our framework yields performance improvement on this challenging task, demonstrating a promising new path toward building MLLMs capable of genuine human-like social reasoning.

[56] How Noise Benefits AI-generated Image Detection

Jiazhen Yan,Ziqiang Li,Fan Wang,Kai Zeng,Zhangjie Fu

Main category: cs.CV

TL;DR: 本文提出了一种名为PiN-CLIP的新方法，通过在特征空间中引入正激励噪声来提升AI生成图像检测的分布外泛化能力，在包含42种生成模型的开放数据集上取得了领先性能。

Details

Motivation: 现有的AI生成图像检测方法在分布外泛化方面表现不佳，主要由于训练过程中依赖了虚假的捷径特征，导致模型鲁棒性差。 Method: 提出PiN-CLIP，结合噪声生成器和检测网络，通过变分正激励原则，在特征空间中利用视觉与类别语义特征的交叉注意力融合构造正激励噪声，并注入到视觉编码器中以抑制捷径特征并增强稳定法医线索。 Result: 在包含42种生成模型的开放世界数据集上进行实验，该方法平均准确率比现有方法高出5.4%，达到当前最优性能。 Conclusion: PiN-CLIP有效缓解了检测模型对捷径特征的依赖，提升了特征表示的鲁棒性和泛化能力，为生成图像检测提供了新的可控制优化方向。 Abstract: The rapid advancement of generative models has made real and synthetic images increasingly indistinguishable. Although extensive efforts have been devoted to detecting AI-generated images, out-of-distribution generalization remains a persistent challenge. We trace this weakness to spurious shortcuts exploited during training and we also observe that small feature-space perturbations can mitigate shortcut dominance. To address this problem in a more controllable manner, we propose the Positive-Incentive Noise for CLIP (PiN-CLIP), which jointly trains a noise generator and a detection network under a variational positive-incentive principle. Specifically, we construct positive-incentive noise in the feature space via cross-attention fusion of visual and categorical semantic features. During optimization, the noise is injected into the feature space to fine-tune the visual encoder, suppressing shortcut-sensitive directions while amplifying stable forensic cues, thereby enabling the extraction of more robust and generalized artifact representations. Comparative experiments are conducted on an open-world dataset comprising synthetic images generated by 42 distinct generative models. Our method achieves new state-of-the-art performance, with notable improvements of 5.4 in average accuracy over existing approaches.

[57] TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding

Boshen Xu,Zihan Xiao,Jiaze Li,Jianzhong Ju,Zhenbo Luo,Jian Luan,Qin Jin

Main category: cs.CV

TL;DR: TimeViper是一种混合Mamba-Transformer视觉语言模型，用于长视频理解，通过TransV模块实现视觉令牌向指令令牌的信息转移与压缩，支持处理超万帧的小时级视频。

Details

Motivation: 长视频理解需要高效架构和有效处理长时间上下文的能力，现有模型在处理长序列时存在效率和冗余问题。 Method: 采用混合Mamba-Transformer骨干网络，并提出TransV模块，动态转移和压缩视觉令牌信息到文本令牌中，提升效率并保持多模态理解能力。 Result: TimeViper能处理超过10,000帧的小时级视频，在多个基准上表现优于或媲美现有SOTA模型，并揭示了视觉到文本的信息聚合现象。 Conclusion: TransV有效缓解了视觉令牌冗余问题，混合架构为长视频理解提供了高效且可解释的解决方案，推动了Mamba-Transformer模型的发展与解释性研究。 Abstract: We introduce TimeViper, a hybrid vision-language model designed to tackle challenges of long video understanding. Processing long videos demands both an efficient model architecture and an effective mechanism for handling extended temporal contexts. To this end, TimeViper adopts a hybrid Mamba-Transformer backbone that combines the efficiency of state-space models with the expressivity of attention mechanisms. Through this hybrid design, we reveal the vision-to-text information aggregation phenomenon, where information progressively flows from vision tokens to text tokens across increasing LLM depth, resulting in severe vision token redundancy. Motivated by this observation, we propose TransV, a token information transfer module that transfers and compresses vision tokens into instruction tokens while maintaining multimodal understanding capabilities. This design enables TimeViper to process hour-long videos exceeding 10,000 frames. Extensive experiments across multiple benchmarks demonstrate that TimeViper competes with state-of-the-art models while extending frame numbers. We further analyze attention behaviors of both Mamba and Transformer layers, offering new insights into hybrid model interpretability. This work represents an initial step towards developing, interpreting, and compressing hybrid Mamba-Transformer architectures.

Li Yu,Yingbo Zhao,Shiyu Wu,Siyue Yu,Moncef Gabbouj,Qingshan Liu

Main category: cs.CV

TL;DR: 提出了一种基于预训练退化表示学习（DRL）模块和分层终止机制的盲视频质量增强方法，有效提升压缩视频的去伪影性能与推理效率。

Details

Motivation: 现有非盲视频质量增强方法依赖已知量化参数（QP），在实际传输或转码中QP常未知，限制了应用；现有盲方法仅捕获全局退化信息，缺乏空间细节，且多数方法对不同QP采用统一架构，未考虑不同压缩水平下的计算需求差异。 Method: 设计了一个预训练的退化表示学习（DRL）模块，从视频内容中解耦并提取高维、多尺度的退化表示，用于指导去伪影过程；引入分层终止机制，根据压缩程度动态调整去伪影阶段的数量，实现计算资源的自适应分配。 Result: 在QP=22下，相比当前最先进的盲方法PSNR提升了110%（从0.31 dB提升至0.65 dB），同时分层终止机制使平均推理时间比QP=42时减少一半。 Conclusion: 所提方法通过高维多尺度退化表示和动态推理机制，在盲视频质量增强上实现了性能与效率的显著提升，具有更强的实用性与适应性。 Abstract: Existing studies on Quality Enhancement for Compressed Video (QECV) predominantly rely on known Quantization Parameters (QPs), employing distinct enhancement models per QP setting, termed non-blind methods. However, in real-world scenarios involving transcoding or transmission, QPs may be partially or entirely unknown, limiting the applicability of such approaches and motivating the development of blind QECV techniques. Current blind methods generate degradation vectors via classification models with cross-entropy loss, using them as channel attention to guide artifact removal. However, these vectors capture only global degradation information and lack spatial details, hindering adaptation to varying artifact patterns at different spatial positions. To address these limitations, we propose a pretrained Degradation Representation Learning (DRL) module that decouples and extracts high-dimensional, multiscale degradation representations from video content to guide the artifact removal. Additionally, both blind and non-blind methods typically employ uniform architectures across QPs, hence, overlooking the varying computational demands inherent to different compression levels. We thus introduce a hierarchical termination mechanism that dynamically adjusts the number of artifact reduction stages based on the compression level. Experimental results demonstrate that the proposed approach significantly enhances performance, achieving a PSNR improvement of 110% (from 0.31 dB to 0.65 dB) over a competing state-of-the-art blind method at QP = 22. Furthermore, the proposed hierarchical termination mechanism reduces the average inference time at QP = 22 by half compared to QP = 42.

[59] SurvAgent: Hierarchical CoT-Enhanced Case Banking and Dichotomy-Based Multi-Agent System for Multimodal Survival Prediction

Guolin Huang,Wenting Chen,Jiaqi Yang,Xinheng Lyu,Xiaoling Luo,Sen Yang,Xiaohan Xing,Linlin Shen

Main category: cs.CV

TL;DR: 本文提出了SurvAgent，首个基于分层思维链（CoT）增强的多智能体系统，用于多模态生存预测，解决了现有方法在透明性、多模态整合和历史经验利用方面的不足。

Details

Motivation: 现有的癌症生存分析方法缺乏临床可解释性，且无法有效整合多模态数据、探索感兴趣区域或利用历史病例进行经验学习。 Method: SurvAgent分为两个阶段：第一阶段通过低倍率筛选、跨模态相似性感知补丁挖掘和置信度感知补丁挖掘构建WSI-基因CoT增强的病例库，并结合基因分层分析生成带推理路径的结构化报告；第二阶段采用基于二分法的多专家智能体推理，通过RAG检索相似病例并融合多模态报告与专家预测，实现渐进式区间优化。 Result: 在五个TCGA队列上的实验表明，SurvAgent在性能上优于传统方法、专有MLLMs和医学智能体，显著提升了生存预测的准确性和可解释性。 Conclusion: SurvAgent为精准肿瘤学中的可解释AI驱动生存预测建立了新范式，具备良好的临床应用前景。 Abstract: Survival analysis is critical for cancer prognosis and treatment planning, yet existing methods lack the transparency essential for clinical adoption. While recent pathology agents have demonstrated explainability in diagnostic tasks, they face three limitations for survival prediction: inability to integrate multimodal data, ineffective region-of-interest exploration, and failure to leverage experiential learning from historical cases. We introduce SurvAgent, the first hierarchical chain-of-thought (CoT)-enhanced multi-agent system for multimodal survival prediction. SurvAgent consists of two stages: (1) WSI-Gene CoT-Enhanced Case Bank Construction employs hierarchical analysis through Low-Magnification Screening, Cross-Modal Similarity-Aware Patch Mining, and Confidence-Aware Patch Mining for pathology images, while Gene-Stratified analysis processes six functional gene categories. Both generate structured reports with CoT reasoning, storing complete analytical processes for experiential learning. (2) Dichotomy-Based Multi-Expert Agent Inference retrieves similar cases via RAG and integrates multimodal reports with expert predictions through progressive interval refinement. Extensive experiments on five TCGA cohorts demonstrate SurvAgent's superority over conventional methods, proprietary MLLMs, and medical agents, establishing a new paradigm for explainable AI-driven survival prediction in precision oncology.

[60] Real-Time 3D Object Detection with Inference-Aligned Learning

Chenyu Zhao,Xianwei Zheng,Zimin Xia,Linwei Yue,Nan Xue

Main category: cs.CV

TL;DR: 提出了一种用于室内点云的新型三维目标检测框架SR3D，通过空间优先级和排序感知机制有效缩小训练与推理之间的差距，在保持实时速度的同时显著提升了检测精度。

Details

Motivation: 现有的三维目标检测方法在训练和推理之间存在不一致，缺乏空间可靠性和排序感知，影响了模型在实际应用中的表现。 Method: 提出了SR3D框架，包含两个关键组件：空间优先级最优传输分配策略，动态强调位置准确且空间可靠的样本；以及排序感知自蒸馏机制，通过自蒸馏引入排序感知能力。 Result: 在ScanNet V2和SUN RGB-D数据集上的实验表明，SR3D在精度上显著优于先前方法，同时保持实时处理速度。 Conclusion: SR3D有效弥合了训练与推理之间的差距，提升了三维目标检测的性能，适用于增强现实、机器人和导航等需要实时动态场景理解的应用。 Abstract: Real-time 3D object detection from point clouds is essential for dynamic scene understanding in applications such as augmented reality, robotics and navigation. We introduce a novel Spatial-prioritized and Rank-aware 3D object detection (SR3D) framework for indoor point clouds, to bridge the gap between how detectors are trained and how they are evaluated. This gap stems from the lack of spatial reliability and ranking awareness during training, which conflicts with the ranking-based prediction selection used as inference. Such a training-inference gap hampers the model's ability to learn representations aligned with inference-time behavior. To address the limitation, SR3D consists of two components tailored to the spatial nature of point clouds during training: a novel spatial-prioritized optimal transport assignment that dynamically emphasizes well-located and spatially reliable samples, and a rank-aware adaptive self-distillation scheme that adaptively injects ranking perception via a self-distillation paradigm. Extensive experiments on ScanNet V2 and SUN RGB-D show that SR3D effectively bridges the training-inference gap and significantly outperforms prior methods in accuracy while maintaining real-time speed.

[61] Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation

Ziyu Guo,Renrui Zhang,Hongyu Li,Manyuan Zhang,Xinyan Chen,Sifan Wang,Yan Feng,Peng Pei,Pheng-Ann Heng

Main category: cs.CV

TL;DR: 本文提出了Thinking-while-Generating (TwiG)框架，首次实现生成过程中文本推理与视觉生成的交错进行，提升生成内容的语义丰富性与上下文感知能力。

Details

Motivation: 现有视觉生成方法缺乏在生成过程中动态融合多模态交互的机制，仅在生成前后使用文本推理，难以实现实时调整与全局协调。 Method: 提出TwiG框架，在视觉生成过程中交替插入文本推理步骤，以指导后续区域生成并反思已生成内容；探索了零样本提示、基于TwiG-50K数据集的监督微调和基于自定义TwiG-GRPO的强化学习三种策略。 Result: 实现了生成过程中文本推理与视觉内容的协同演化，显著提升了生成结果的语义一致性和细节质量，验证了交错式推理的潜力。 Conclusion: TwiG为视觉生成引入了新的动态推理范式，展示了交错式文本推理的有效性，有望推动具备更強推理能力的生成模型研究。 Abstract: Recent advances in visual generation have increasingly explored the integration of reasoning capabilities. They incorporate textual reasoning, i.e., think, either before (as pre-planning) or after (as post-refinement) the generation process, yet they lack on-the-fly multimodal interaction during the generation itself. In this preliminary study, we introduce Thinking-while-Generating (TwiG), the first interleaved framework that enables co-evolving textual reasoning throughout the visual generation process. As visual content is progressively generating, textual reasoning is interleaved to both guide upcoming local regions and reflect on previously synthesized ones. This dynamic interplay produces more context-aware and semantically rich visual outputs. To unveil the potential of this framework, we investigate three candidate strategies, zero-shot prompting, supervised fine-tuning (SFT) on our curated TwiG-50K dataset, and reinforcement learning (RL) via a customized TwiG-GRPO strategy, each offering unique insights into the dynamics of interleaved reasoning. We hope this work inspires further research into interleaving textual reasoning for enhanced visual generation. Code will be released at: https://github.com/ZiyuGuo99/Thinking-while-Generating.

[62] A Spatial Semantics and Continuity Perception Attention for Remote Sensing Water Body Change Detection

Quanqing Ma,Jiaen Chen,Peng Wang,Yao Zheng,Qingzhan Zhao,Yuchen Zheng

Main category: cs.CV

TL;DR: 提出了一种新的高分辨率遥感水体变化检测数据集HSRW-CD，并设计了SSCP注意力模块以充分利用深度特征中的空间语义和结构信息，显著提升了水体变化检测的精度和泛化能力。

Details

Motivation: 现有高空间分辨率的水体变化检测数据集稀缺，且深度学习方法未能充分挖掘变化检测网络中深层特征的空间语义与结构信息，限制了其在城乡区域的精确应用。 Method: 构建了一个高于3米空间分辨率的新数据集HSRW-CD，并提出包含多语义空间注意力（MSA）、结构关系感知全局注意力（SRGA）和通道自注意力（CSA）的SSCP注意力模块，作为即插即用组件集成到现有水体变化检测模型中。 Result: 在HSRW-CD和Water-CD数据集上的大量实验验证了SSCP模块的有效性和良好泛化性，显著提升了水体变化检测性能。 Conclusion: SSCP模块能有效增强水体特征的语义和结构表达能力，结合新构建的高分辨率数据集HSRW-CD，推动了遥感水体变化检测的发展，代码与数据集已公开。 Abstract: Remote sensing Water Body Change Detection (WBCD) aims to detect water body surface changes from bi-temporal images of the same geographic area. Recently, the scarcity of high spatial resolution datasets for WBCD restricts its application in urban and rural regions, which require more accurate positioning. Meanwhile, previous deep learning-based methods fail to comprehensively exploit the spatial semantic and structural information in deep features in the change detection networks. To resolve these concerns, we first propose a new dataset, HSRW-CD, with a spatial resolution higher than 3 meters for WBCD. Specifically, it contains a large number of image pairs, widely covering various water body types. Besides, a Spatial Semantics and Continuity Perception (SSCP) attention module is designed to fully leverage both the spatial semantics and structure of deep features in the WBCD networks, significantly improving the discrimination capability for water body. The proposed SSCP has three components: the Multi-Semantic spatial Attention (MSA), the Structural Relation-aware Global Attention (SRGA), and the Channel-wise Self-Attention (CSA). The MSA enhances the spatial semantics of water body features and provides precise spatial semantic priors for the CSA. Then, the SRGA further extracts spatial structure to learn the spatial continuity of the water body. Finally, the CSA utilizes the spatial semantic and structural priors from the MSA and SRGA to compute the similarity across channels. Specifically designed as a plug-and-play module for water body deep features, the proposed SSCP allows integration into existing WBCD models. Numerous experiments conducted on the proposed HSRW-CD and Water-CD datasets validate the effectiveness and generalization of the SSCP. The code of this work and the HSRW-CD dataset will be accessed at https://github.com/QingMa1/SSCP.

[63] LEGO-SLAM: Language-Embedded Gaussian Optimization SLAM

Sibaek Lee,Seongbo Ha,Kyeongsu Kang,Joonyeol Choi,Seungjun Tak,Hyeonwoo Yu

Main category: cs.CV

TL;DR: LEGO-SLAM 是首个实现实时、开放词汇映射的3DGS-based SLAM框架，通过场景自适应的编码器-解码器将高维语言嵌入压缩为16维特征，实现语义理解、实时渲染、高斯精简和语言引导闭环检测。

Details

Motivation: 现有3DGS SLAM系统缺乏开放词汇语义理解能力，且集成语言特征面临内存开销大、渲染慢、模型适应性差的问题。 Method: 提出LEGO-SLAM，采用场景自适应的编码器-解码器将高维语言特征压缩至16维；引入语言引导的高斯剪枝策略减少冗余；利用压缩特征实现语言闭环检测，无需额外模型。 Result: 在保持渲染质量和定位精度的同时，实现15 FPS实时运行；高斯数量减少超过60%；无需独立闭环检测模块即可完成语义闭环。 Conclusion: LEGO-SLAM 实现了高效、自适应、轻量化的开放词汇SLAM，推动了具身智能体在未知环境中语义交互的能力。 Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have enabled Simultaneous Localization and Mapping (SLAM) systems to build photorealistic maps. However, these maps lack the open-vocabulary semantic understanding required for advanced robotic interaction. Integrating language features into SLAM remains a significant challenge, as storing high-dimensional features demands excessive memory and rendering overhead, while existing methods with static models lack adaptability for novel environments. To address these limitations, we propose LEGO-SLAM (Language-Embedded Gaussian Optimization SLAM), the first framework to achieve real-time, open-vocabulary mapping within a 3DGS-based SLAM system. At the core of our method is a scene-adaptive encoder-decoder that distills high-dimensional language embeddings into a compact 16-dimensional feature space. This design reduces the memory per Gaussian and accelerates rendering, enabling real-time performance. Unlike static approaches, our encoder adapts online to unseen scenes. These compact features also enable a language-guided pruning strategy that identifies semantic redundancy, reducing the map's Gaussian count by over 60\% while maintaining rendering quality. Furthermore, we introduce a language-based loop detection approach that reuses these mapping features, eliminating the need for a separate detection model. Extensive experiments demonstrate that LEGO-SLAM achieves competitive mapping quality and tracking accuracy, all while providing open-vocabulary capabilities at 15 FPS.

[64] Reasoning Guided Embeddings: Leveraging MLLM Reasoning for Improved Multimodal Retrieval

Chunxu Liu,Jiyuan Yang,Ruopeng Gao,Yuhan Zhu,Feng Zhu,Rui Zhao,Limin Wang

Main category: cs.CV

TL;DR: 本文提出了Reasoning Guided Embeddings (RGE)，通过将多模态大语言模型的推理能力融入嵌入过程，提升多模态表示质量。

Details

Motivation: 现有方法在提取多模态嵌入时忽略了MLLMs具备的生成式推理能力，未能充分利用其增强表示潜力。 Method: 提出RGE方法，先让模型基于指令生成结构化推理过程，再在推理展开后提取表示，并结合对比学习进行训练。 Result: 在MMEB基准上，RGE相比无推理基线在多模态检索任务中性能提升4.9%。 Conclusion: 显式引入推理过程能有效增强多模态嵌入的质量，验证了利用MLLMs推理能力进行表示学习的有效性。 Abstract: Multimodal embeddings are widely used in downstream tasks such as multimodal retrieval, enabling alignment of interleaved modalities in a shared representation space. While recent studies show that Multimodal Large Language Models (MLLMs) can serve as strong embedding extractors, existing approaches treat embedding extraction as a direct encoding step, overlooking the fact that MLLMs possess the generative capability for reasoning that could be leveraged to enhance representation quality. In this work, we explore how to explicitly incorporate reasoning into the embedding process. To this end, we propose Reasoning Guided Embeddings (RGE), which preserves the generative rationale process of MLLMs and couples it with contrastive training. Our method first enables the model to perform structured rationale generation conditioned on the instruction, and then extracts representations after reasoning has unfolded. This simple design enhances the context-conditional inference signals within the embedding, leading to improved multimodal representation quality. Experiments on the MMEB benchmark show that reasoning-guided conditioning improves multimodal retrieval performance by 4.9% over the non-reasoning baseline, confirming that explicit reasoning can effectively enhance embedding quality.

[65] Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers

Jian Ma,Qirong Peng,Xujie Zhu,Peixing Xie,Chen Chen,Haonan Lu

Main category: cs.CV

TL;DR: 提出了一种名为PPCL的灵活结构化剪枝框架，用于降低Diffusion Transformers在图像生成中的计算成本，在减少50%参数的同时仅导致关键指标不到3%的下降。

Details

Motivation: Diffusion Transformers虽然性能优异，但参数量大、计算成本高，难以在资源受限环境下部署，因此需要高效的模型压缩方法。 Method: 通过线性探测和一阶微分趋势分析识别冗余层区间，并设计了可插拔的师生交替蒸馏方案，统一实现深度和宽度方向的剪枝。 Result: 在多个多模态扩散Transformer模型上实验表明，PPCL能将参数减少50%，关键指标退化小于3%，且保持高质量图像生成能力。 Conclusion: PPCL是一种高效、灵活的DiT剪枝方法，适用于资源受限场景，具备实际部署价值。 Abstract: Diffusion Transformers (DiTs) have shown exceptional performance in image generation, yet their large parameter counts incur high computational costs, impeding deployment in resource-constrained settings. To address this, we propose Pluggable Pruning with Contiguous Layer Distillation (PPCL), a flexible structured pruning framework specifically designed for DiT architectures. First, we identify redundant layer intervals through a linear probing mechanism combined with the first-order differential trend analysis of similarity metrics. Subsequently, we propose a plug-and-play teacher-student alternating distillation scheme tailored to integrate depth-wise and width-wise pruning within a single training phase. This distillation framework enables flexible knowledge transfer across diverse pruning ratios, eliminating the need for per-configuration retraining. Extensive experiments on multiple Multi-Modal Diffusion Transformer architecture models demonstrate that PPCL achieves a 50\% reduction in parameter count compared to the full model, with less than 3\% degradation in key objective metrics. Notably, our method maintains high-quality image generation capabilities while achieving higher compression ratios, rendering it well-suited for resource-constrained environments. The open-source code, checkpoints for PPCL can be found at the following link: https://github.com/OPPO-Mente-Lab/Qwen-Image-Pruning.

[66] Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning

Yibin Huang,Wang Xu,Wanyue Zhang,Helu Zhi,Jingjing Huang,Yangbin Xu,Yangang Sun,Conghui Zhu,Tiejun Zhao

Main category: cs.CV

TL;DR: 本文提出了一种名为Video2Layout的新框架，通过连续的对象边界坐标从视频中重建度量空间布局，以提升多模态大语言模型在细粒度空间推理上的能力。

Details

Motivation: 现有的基于网格的认知地图方法依赖于离散化的栅格表示，限制了模型在精细空间推理方面的能力。为了克服这一局限性，需要一种更精确的空间表征方式。 Method: 该方法包括两个阶段：首先，在监督微调阶段，利用AI2THOR模拟器构建高质量数据集，使模型学习从视觉输入到精确边界坐标的映射；然后，在强化微调阶段进一步提升模型在真实世界中的泛化能力。同时提出了QVS-Bench基准用于系统评估认知地图精度与图像数量之间的关系。 Result: 在QVS-Bench和主流空间推理基准上，所提出的V2LO-7B模型相比基于网格图的方法平均提升了4.92%，验证了该方法的优越性。 Conclusion: 使用连续对象边界坐标进行空间布局重建能有效提升多模态大语言模型的定量空间计算能力和空间理解精度，优于传统的离散化网格表示方法。 Abstract: Spatial intelligence is a critical frontier for Multimodal Large Language Models (MLLMs), empowering them to comprehend the physical world. Drawing inspiration from human perception mechanisms, existing studies attempt to construct a coherent spatial understanding via grid-based cognitive maps from multi-frame visual inputs. However, current grid-based map methods rely on discretized raster representations, which limit the model's ability in fine-grained spatial reasoning. To overcome this limitation, we propose Video2Layout, a framework for reconstructing metric-grounded spatial layouts from video. The framework employs continuous object boundary coordinates to quantify inter-object physical distances and object size. This empowers the model with quantitative spatial computation capabilities, effectively alleviating the inherent ambiguity when describing spatial relationships in natural language. Specifically, our method comprises two core stages. First, in supervised fine-tuning stage, we construct a high-quality dataset from the AI2THOR simulator, which enables the model to learn the mapping from visual inputs to precise boundary coordinates. Subsequently, a reinforcement fine-tuning stage further enhances the model's real-world generalization capabilities. To systematically evaluate the correlation between cognitive map accuracy and image quantity, as well as how the quantity of image inputs affects spatial reasoning accuracy, we introduce QVS-Bench, a diagnostic benchmark designed to analyze the relevant mechanisms. Evaluated on QVS-Bench and mainstream spatial reasoning benchmarks, our model, V2LO-7B achieves an average improvement of 4.92% over the model trained on grid maps, validating the superiority of our method. Our code is available at https://github.com/ybrrraway/Video2Layout.

[67] Simba: Towards High-Fidelity and Geometrically-Consistent Point Cloud Completion via Transformation Diffusion

Lirui Zhang,Zhengkai Zhao,Zhi Zuo,Pan Gao,Jie Qin

Main category: cs.CV

TL;DR: 提出Simba框架，通过将点云变换回归转化为分布学习问题，结合对称性先验与扩散模型，提升点云补全的细节保持与结构完整性。

Details

Motivation: 现有基于回归的点云补全方法易过拟合且对噪声敏感，难以兼顾局部细节保留与全局结构一致性。 Method: 将点云变换回归转为分布学习任务，利用扩散模型结合对称性先验，并设计分层Mamba架构实现高保真上采样。 Result: 在PCN、ShapeNet和KITTI数据集上达到SOTA性能，显著提升鲁棒性与泛化能力。 Conclusion: Simba有效解决了回归方法的过拟合与噪声敏感问题，在保持细粒度细节的同时确保了整体结构完整性。 Abstract: Point cloud completion is a fundamental task in 3D vision. A persistent challenge in this field is simultaneously preserving fine-grained details present in the input while ensuring the global structural integrity of the completed shape. While recent works leveraging local symmetry transformations via direct regression have significantly improved the preservation of geometric structure details, these methods suffer from two major limitations: (1) These regression-based methods are prone to overfitting which tend to memorize instant-specific transformations instead of learning a generalizable geometric prior. (2) Their reliance on point-wise transformation regression lead to high sensitivity to input noise, severely degrading their robustness and generalization. To address these challenges, we introduce Simba, a novel framework that reformulates point-wise transformation regression as a distribution learning problem. Our approach integrates symmetry priors with the powerful generative capabilities of diffusion models, avoiding instance-specific memorization while capturing robust geometric structures. Additionally, we introduce a hierarchical Mamba-based architecture to achieve high-fidelity upsampling. Extensive experiments across the PCN, ShapeNet, and KITTI benchmarks validate our method's state-of-the-art (SOTA) performance.

[68] Layer-wise Noise Guided Selective Wavelet Reconstruction for Robust Medical Image Segmentation

Yuting Lu,Ziliang Wang,Weixin Xu,Wei Zhang,Yongqiang Zhao,Yang Yu,Xiaohong Zhang

Main category: cs.CV

TL;DR: 提出了一种名为LNG-SWR的新方法，通过层间噪声引导的频域自适应重建来提升医学图像分割模型在分布偏移和对抗攻击下的鲁棒性，且不牺牲干净数据性能，具有低开销、可插拔、主干无关等优点。

Details

Motivation: 现有对抗训练方法存在干净性能与鲁棒性之间的权衡问题，且训练成本高，难以在医学图像分割中广泛部署。需要一种更高效、低成本且可扩展的鲁棒性增强方案。 Method: 在多个网络层注入零均值小噪声，学习频率偏差先验，指导选择性小波重构：抑制对噪声敏感的频带，增强方向结构和形状线索，稳定边界响应，同时保持频谱一致性。该方法可与对抗训练结合或独立使用。 Result: 在CT和超声数据集上，LNG-SWR在PGD-L∞/L2和SSAH攻击下显著降低性能下降，同时提升干净样本的Dice/IoU指标；与对抗训练结合时获得额外增益，且不牺牲原始性能。 Conclusion: LNG-SWR为医学图像分割提供了一条简单、有效、工程友好且可扩展的鲁棒性提升路径，适用于对抗和标准训练两种场景。 Abstract: Clinical deployment requires segmentation models to stay stable under distribution shifts and perturbations. The mainstream solution is adversarial training (AT) to improve robustness; however, AT often brings a clean--robustness trade-off and high training/tuning cost, which limits scalability and maintainability in medical imaging. We propose \emph{Layer-wise Noise-Guided Selective Wavelet Reconstruction (LNG-SWR)}. During training, we inject small, zero-mean noise at multiple layers to learn a frequency-bias prior that steers representations away from noise-sensitive directions. We then apply prior-guided selective wavelet reconstruction on the input/feature branch to achieve frequency adaptation: suppress noise-sensitive bands, enhance directional structures and shape cues, and stabilize boundary responses while maintaining spectral consistency. The framework is backbone-agnostic and adds low additional inference overhead. It can serve as a plug-in enhancement to AT and also improves robustness without AT. On CT and ultrasound datasets, under a unified protocol with PGD-$L_{\infty}/L_{2}$ and SSAH, LNG-SWR delivers consistent gains on clean Dice/IoU and significantly reduces the performance drop under strong attacks; combining LNG-SWR with AT yields additive gains. When combined with adversarial training, robustness improves further without sacrificing clean accuracy, indicating an engineering-friendly and scalable path to robust segmentation. These results indicate that LNG-SWR provides a simple, effective, and engineering-friendly path to robust medical image segmentation in both adversarial and standard training regimes.

[69] An Image Is Worth Ten Thousand Words: Verbose-Text Induction Attacks on VLMs

Zhi Luo,Zenghui Yuan,Wenqi Wei,Daizong Liu,Pan Zhou

Main category: cs.CV

TL;DR: 本文提出了一种新的冗长文本诱导攻击（VTIA），通过两阶段框架在视觉-语言模型（VLMs）中注入难以察觉的对抗性扰动，以最大化输出token长度，提升攻击的有效性、效率和泛化能力。

Details

Motivation: 由于VLM在生成过程中消耗的token数量成为关键评估指标，现有方法无法稳定且可控地延长输出，因此需要一种更有效的机制来显式优化输出长度。 Method: 采用两阶段框架：首先使用强化学习进行对抗性提示搜索，找到能诱导LLM产生冗长输出的提示；然后进行视觉对齐的扰动优化，使扰动图像的视觉嵌入与对抗提示的嵌入相似，从而触发冗长文本生成。 Result: 在四个主流VLM上的实验表明，该方法在延长输出token方面显著优于现有技术，具有更高的有效性、效率和跨模型泛化能力。 Conclusion: VTIA能够有效且可控地诱导VLM生成高冗余文本，揭示了当前模型在部署效率方面的潜在安全风险，并为防御此类攻击提供了研究方向。 Abstract: With the remarkable success of Vision-Language Models (VLMs) on multimodal tasks, concerns regarding their deployment efficiency have become increasingly prominent. In particular, the number of tokens consumed during the generation process has emerged as a key evaluation metric.Prior studies have shown that specific inputs can induce VLMs to generate lengthy outputs with low information density, which significantly increases energy consumption, latency, and token costs. However, existing methods simply delay the occurrence of the EOS token to implicitly prolong output, and fail to directly maximize the output token length as an explicit optimization objective, lacking stability and controllability.To address these limitations, this paper proposes a novel verbose-text induction attack (VTIA) to inject imperceptible adversarial perturbations into benign images via a two-stage framework, which identifies the most malicious prompt embeddings for optimizing and maximizing the output token of the perturbed images.Specifically, we first perform adversarial prompt search, employing reinforcement learning strategies to automatically identify adversarial prompts capable of inducing the LLM component within VLMs to produce verbose outputs. We then conduct vision-aligned perturbation optimization to craft adversarial examples on input images, maximizing the similarity between the perturbed image's visual embeddings and those of the adversarial prompt, thereby constructing malicious images that trigger verbose text generation. Comprehensive experiments on four popular VLMs demonstrate that our method achieves significant advantages in terms of effectiveness, efficiency, and generalization capability.

[70] EvoVLA: Self-Evolving Vision-Language-Action Model

Zeting Liu,Zida Yang,Zeyu Zhang,Hao Tang

Main category: cs.CV

TL;DR: EvoVLA是一种自监督的视觉-语言-动作框架，通过阶段对齐奖励、基于姿态的物体探索和长视野记忆来解决长视野机器人操作中的阶段幻觉问题，在仿真和真实场景中均显著提升了任务成功率和样本效率。

Details

Motivation: 现有VLA模型在长视野多步任务中存在阶段幻觉问题，即利用粗略评估信号跳过实际步骤，导致任务完成度虚高。 Method: 提出EvoVLA框架，包含三个组件：1）阶段对齐奖励（SAR），使用三元组对比学习和Gemini生成的难负样本防止视觉捷径；2）基于姿态的物体探索（POE），以抓取器与物体的相对姿态驱动好奇心；3）长视野记忆，通过选择性上下文保留和门控融合稳定内在奖励塑造。 Result: 在Discoverse-L基准上，EvoVLA比OpenVLA-OFT平均任务成功率提高10.2个百分点，达到69.2%；样本效率提升1.5倍，阶段幻觉率从38.5%降至14.8%；在真实机器人上四任务平均成功率达54.6%，超过基线11个百分点。 Conclusion: EvoVLA有效缓解了长视野操作中的阶段幻觉问题，具备良好的仿真到真实迁移能力和泛化性能，推动了零样本机器人操作的发展。 Abstract: Long-horizon robotic manipulation remains challenging for Vision-Language-Action (VLA) models despite recent progress in zero-shot generalization and simulation-to-real-world transfer. Current VLA models suffer from stage hallucination, where agents exploit coarse evaluation signals to shortcut multi-step tasks, reporting high progress without truly completing them. We present EvoVLA, a self-supervised VLA framework that addresses this issue through three complementary components: Stage-Aligned Reward (SAR), which uses triplet contrastive learning with Gemini-generated hard negatives to prevent visual shortcuts; Pose-Based Object Exploration (POE), which grounds curiosity in relative object-gripper pose instead of raw pixels; and Long-Horizon Memory, which uses selective context retention and gated fusion to stabilize intrinsic shaping during extended rollouts. Extensive evaluations on Discoverse-L, a long-horizon manipulation benchmark with three multi-stage tasks, show that EvoVLA improves average task success by 10.2 percentage points over the strongest baseline (OpenVLA-OFT), reaching 69.2 percent. EvoVLA also achieves one-and-a-half times better sample efficiency and reduces stage hallucination from 38.5 percent to 14.8 percent. Real-world deployment on physical robots reaches an average success rate of 54.6 percent across four manipulation tasks, outperforming OpenVLA-OFT by 11 points, demonstrating effective sim-to-real transfer and strong generalization. Code: https://github.com/AIGeeksGroup/EvoVLA. Website: https://aigeeksgroup.github.io/EvoVLA.

[71] Target Refocusing via Attention Redistribution for Open-Vocabulary Semantic Segmentation: An Explainability Perspective

Jiahao Li,Yang Lu,Yachao Zhang,Yong Xie,Fangyong Wang,Yuan Xie,Yanyun Qu

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的ReFocusing CLIP (RF-CLIP)方法，通过分析CLIP在密集预测任务中的注意力分散问题，模拟人类的注意力重聚焦机制，提升开放词汇语义分割中的像素级视觉-语言对齐精度，在八个基准上实现了最先进的性能。

Details

Motivation: 现有基于CLIP的开放词汇语义分割方法较少从可解释性角度探究其在密集预测任务中的性能瓶颈，尤其是注意力机制中存在类似人类分心的现象，导致目标区域关注不足。 Method: 系统分析CLIP内部机制，发现因维度特定过激活导致的无关token会吸引注意力；提出RF-CLIP，通过过滤这些干扰token并重新分配注意力资源，增强目标区域的对齐能力，实现更精细的多模态对齐。 Result: RF-CLIP在八个公开基准上达到SOTA性能，同时保持高推理效率，验证了注意力重聚焦策略的有效性。 Conclusion: 通过模拟人类注意力的重聚焦行为，RF-CLIP有效缓解了CLIP在密集预测中的分心问题，提升了开放词汇语义分割的性能，为CLIP的免训练优化提供了新思路。 Abstract: Open-vocabulary semantic segmentation (OVSS) employs pixel-level vision-language alignment to associate category-related prompts with corresponding pixels. A key challenge is enhancing the multimodal dense prediction capability, specifically this pixel-level multimodal alignment. Although existing methods achieve promising results by leveraging CLIP's vision-language alignment, they rarely investigate the performance boundaries of CLIP for dense prediction from an interpretability mechanisms perspective. In this work, we systematically investigate CLIP's internal mechanisms and identify a critical phenomenon: analogous to human distraction, CLIP diverts significant attention resources from target regions to irrelevant tokens. Our analysis reveals that these tokens arise from dimension-specific over-activation; filtering them enhances CLIP's dense prediction performance. Consequently, we propose ReFocusing CLIP (RF-CLIP), a training-free approach that emulates human distraction-refocusing behavior to redirect attention from distraction tokens back to target regions, thereby refining CLIP's multimodal alignment granularity. Our method achieves SOTA performance on eight benchmarks while maintaining high inference efficiency.

[72] Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight

Yi Yang,Xueqi Li,Yiyang Chen,Jin Song,Yihan Wang,Zipeng Xiao,Jiadi Su,You Qiaoben,Pengfei Liu,Zhijie Deng

Main category: cs.CV

TL;DR: 本文提出了Mantis框架，通过解耦视觉预测与主干网络，利用元查询和扩散Transformer头提升视觉-语言-动作模型的性能，在减少训练开销的同时增强了对指令的理解与推理能力。

Details

Motivation: 现有VLA模型在直接预测高维视觉状态时存在计算成本高、信息瓶颈及语言监督不足导致理解与推理能力弱的问题。 Method: 提出Disentangled Visual Foresight（DVF）框架，将视觉预测从主干网络中解耦，引入meta queries和diffusion Transformer（DiT）头，结合残差连接实现下一状态预测，从而自动捕捉潜在动作并增强显式动作学习。 Result: 在LIBERO基准上微调后达到96.7%的成功率，超越强基线模型；实证显示其在指令跟随、泛化能力和推理方面优于π₀.₅等现有模型。 Conclusion: Mantis通过解耦设计有效平衡了模型容量分配与监督信号压缩的矛盾，提升了VLA模型的效率、收敛速度和多任务能力，且代码和权重已开源。 Abstract: Recent advances in Vision-Language-Action (VLA) models demonstrate that visual signals can effectively complement sparse action supervisions. However, letting VLA directly predict high-dimensional visual states can distribute model capacity and incur prohibitive training cost, while compressing visual states into more compact supervisory signals inevitably incurs information bottlenecks. Moreover, existing methods often suffer from poor comprehension and reasoning capabilities due to the neglect of language supervision. This paper introduces Mantis, a novel framework featuring a Disentangled Visual Foresight (DVF) to tackle these issues. Specifically, Mantis decouples visual foresight prediction from the backbone with the combination of meta queries and a diffusion Transformer (DiT) head. With the current visual state provided to the DiT via a residual connection, a simple next-state prediction objective enables the meta queries to automatically capture the latent actions that delineate the visual trajectory, and hence boost the learning of explicit actions. The disentanglement reduces the burden of the VLA backbone, enabling it to maintain comprehension and reasoning capabilities through language supervision. Empirically, pretrained on human manipulation videos, robot demonstrations, and image-text pairs, Mantis achieves a 96.7% success rate on LIBERO benchmark after fine-tuning, surpassing powerful baselines while exhibiting high convergence speed. Real-world evaluations show that Mantis outperforms $π_{0.5}$, a leading open-source VLA model, particularly in instruction-following capability, generalization to unseen instructions, and reasoning ability. Code and weights are released to support the open-source community.

[73] Domain-Shared Learning and Gradual Alignment for Unsupervised Domain Adaptation Visible-Infrared Person Re-Identification

Nianchang Huang,Yi Xu,Ruida Xi,Ruida Xi,Qiang Zhang

Main category: cs.CV

TL;DR: 本文提出了一种用于无监督域自适应可见光-红外行人重识别（UDA-VI-ReID）的两阶段模型DSLGA，通过域共享学习和渐进对齐策略有效缓解跨域和跨模态差异，在多个实验中显著优于现有方法。

Details

Motivation: 由于公开数据集与真实场景数据之间存在差异，现有VI-ReID方法在实际应用中表现不佳，因此需要一种无需标注新样本即可将知识从公开数据迁移到真实数据的UDA-VI-ReID方法。 Method: 提出DSLGA模型，包含两个阶段：第一阶段采用域共享学习策略（DSLS）减少因跨域模态差异导致的预训练失效；第二阶段设计渐进对齐策略（GAS），通过聚类到整体的方式实现可见光与红外数据的跨模态对齐，并构建新的测试方法CMDA-XD。 Result: 大量实验表明，该方法在多种设置下显著优于现有的域自适应VI-ReID方法，甚至超过一些有监督方法的性能。 Conclusion: DSLGA通过有效的两阶段策略成功应对了UDA-VI-ReID中的跨域和跨模态差异，为实际场景下的VI-ReID提供了可行且高性能的解决方案。 Abstract: Recently, Visible-Infrared person Re-Identification (VI-ReID) has achieved remarkable performance on public datasets. However, due to the discrepancies between public datasets and real-world data, most existing VI-ReID algorithms struggle in real-life applications. To address this, we take the initiative to investigate Unsupervised Domain Adaptation Visible-Infrared person Re-Identification (UDA-VI-ReID), aiming to transfer the knowledge learned from the public data to real-world data without compromising accuracy and requiring the annotation of new samples. Specifically, we first analyze two basic challenges in UDA-VI-ReID, i.e., inter-domain modality discrepancies and intra-domain modality discrepancies. Then, we design a novel two-stage model, i.e., Domain-Shared Learning and Gradual Alignment (DSLGA), to handle these discrepancies. In the first pre-training stage, DSLGA introduces a Domain-Shared Learning Strategy (DSLS) to mitigate ineffective pre-training caused by inter-domain modality discrepancies via exploiting shared information between the source and target domains. While, in the second fine-tuning stage, DSLGA designs a Gradual Alignment Strategy (GAS) to handle the cross-modality alignment challenges between visible and infrared data caused by the large intra-domain modality discrepancies through a cluster-to-holistic alignment way. Finally, a new UDA-VI-ReID testing method i.e., CMDA-XD, is constructed for training and testing different UDA-VI-ReID models. A large amount of experiments demonstrate that our method significantly outperforms existing domain adaptation methods for VI-ReID and even some supervised methods under various settings.

[74] PrIntMesh: Precise Intersection Surfaces for 3D Organ Mesh Reconstruction

Deniz Sayin Mercadier,Hieu Le,Yihong Chen,Jiancheng Yang,Udaranga Wickramasinghe,Pascal Fua

Main category: cs.CV

TL;DR: PrIntMesh是一种基于模板的拓扑保持框架，用于将器官作为统一系统进行联合重建，相较于传统方法能更好地保持解剖结构的合理性和表面光滑性。

Details

Motivation: 现有深度学习方法通常独立处理器官的子结构，导致重建结果在解剖学上不合理；需要一种能够保持子结构间空间关系和拓扑一致性的方法。 Method: 提出PrIntMesh，采用连接的模板作为初始形状，联合变形所有子结构以匹配患者特异性解剖结构，同时显式保持内部边界并生成平滑无伪影的表面。 Result: 在心脏、海马体和肺部数据上验证了方法的有效性，表现出高几何精度、正确的拓扑结构，并在训练数据有限或噪声较大时仍具鲁棒性。 Conclusion: PrIntMesh优于基于体素和表面的方法，能更好重建共享界面、维持结构一致性，且数据效率高，适合临床应用。 Abstract: Human organs are composed of interconnected substructures whose geometry and spatial relationships constrain one another. Yet, most deep-learning approaches treat these parts independently, producing anatomically implausible reconstructions. We introduce PrIntMesh, a template-based, topology-preserving framework that reconstructs organs as unified systems. Starting from a connected template, PrIntMesh jointly deforms all substructures to match patient-specific anatomy, while explicitly preserving internal boundaries and enforcing smooth, artifact-free surfaces. We demonstrate its effectiveness on the heart, hippocampus, and lungs, achieving high geometric accuracy, correct topology, and robust performance even with limited or noisy training data. Compared to voxel- and surface-based methods, PrIntMesh better reconstructs shared interfaces, maintains structural consistency, and provides a data-efficient solution suitable for clinical use.

[75] When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models

Yuping Yan,Yuhan Xie,Yinxin Zhang,Lingjuan Lyu,Yaochu Jin

Main category: cs.CV

TL;DR: 本文提出了VLA-Fool，首次系统研究了具身视觉-语言-动作模型在白盒和黑盒设置下的多模态对抗鲁棒性，引入了文本、视觉和跨模态错位攻击，并提出语义引导的提示框架，实验表明多模态扰动会显著影响模型行为。

Details

Motivation: 现有研究主要关注单模态对抗攻击，忽视了影响具身推理和决策的跨模态错位问题，且缺乏对真实多模态和黑盒条件下VLA模型鲁棒性的探索。 Method: 提出VLA-Fool，统一三种多模态对抗攻击：基于梯度和提示的文本扰动、补丁与噪声引起的视觉扰动、破坏感知与指令间语义对应关系的跨模态错位攻击；并构建VLA感知的语义空间，实现自动化的语义引导提示生成。 Result: 在LIBERO基准上使用微调的OpenVLA模型进行实验，结果显示即使轻微的多模态扰动也会导致显著的行为偏差，验证了当前具身多模态对齐的脆弱性。 Conclusion: VLA-Fool揭示了当前VLA模型在多模态对抗攻击下的脆弱性，强调了跨模态对齐的重要性，为未来提升具身智能系统的鲁棒性提供了方向。 Abstract: Vision-Language-Action models (VLAs) have recently demonstrated remarkable progress in embodied environments, enabling robots to perceive, reason, and act through unified multimodal understanding. Despite their impressive capabilities, the adversarial robustness of these systems remains largely unexplored, especially under realistic multimodal and black-box conditions. Existing studies mainly focus on single-modality perturbations and overlook the cross-modal misalignment that fundamentally affects embodied reasoning and decision-making. In this paper, we introduce VLA-Fool, a comprehensive study of multimodal adversarial robustness in embodied VLA models under both white-box and black-box settings. VLA-Fool unifies three levels of multimodal adversarial attacks: (1) textual perturbations through gradient-based and prompt-based manipulations, (2) visual perturbations via patch and noise distortions, and (3) cross-modal misalignment attacks that intentionally disrupt the semantic correspondence between perception and instruction. We further incorporate a VLA-aware semantic space into linguistic prompts, developing the first automatically crafted and semantically guided prompting framework. Experiments on the LIBERO benchmark using a fine-tuned OpenVLA model reveal that even minor multimodal perturbations can cause significant behavioral deviations, demonstrating the fragility of embodied multimodal alignment.

[76] Unsupervised Image Classification with Adaptive Nearest Neighbor Selection and Cluster Ensembles

Melih Baydar,Emre Akbas

Main category: cs.CV

TL;DR: 本文提出了一种名为ICCE的无监督图像分类方法，通过多头聚类、自适应最近邻选择和聚类集成策略，在冻结主干网络上生成多样化聚类结果，并融合为共识聚类以训练分类器，在多个基准上达到最优性能，首次在ImageNet上超过70%准确率。

Details

Motivation: 现有方法多跳过表征学习直接聚类，缺乏对聚类过程多样性和一致性的充分建模，且性能与监督方法仍有较大差距。 Method: 采用多头聚类结构，在冻结的基础模型上训练多个聚类头以生成多样化的聚类结果；引入自适应最近邻选择和聚类集成策略，将多个聚类结果融合为统一的共识聚类，并以此生成伪标签训练图像分类器。 Result: 在CIFAR10、CIFAR100和ImageNet等10个基准上达到最先进性能，分别取得99.3%、89%和70.4%的准确率，是首个在ImageNet上超过70%准确率的完全无监督图像分类方法。 Conclusion: ICCE有效结合多头聚类与集成策略，显著提升了无监督图像分类性能，缩小了与监督学习方法之间的差距，推动了该领域的进展。 Abstract: Unsupervised image classification, or image clustering, aims to group unlabeled images into semantically meaningful categories. Early methods integrated representation learning and clustering within an iterative framework. However, the rise of foundational models have recently shifted focus solely to clustering, bypassing the representation learning step. In this work, we build upon a recent multi-head clustering approach by introducing adaptive nearest neighbor selection and cluster ensembling strategies to improve clustering performance. Our method, "Image Clustering through Cluster Ensembles" (ICCE), begins with a clustering stage, where we train multiple clustering heads on a frozen backbone, producing diverse image clusterings. We then employ a cluster ensembling technique to consolidate these potentially conflicting results into a unified consensus clustering. Finally, we train an image classifier using the consensus clustering result as pseudo-labels. ICCE achieves state-of-the-art performance on ten image classification benchmarks, achieving 99.3% accuracy on CIFAR10, 89% on CIFAR100, and 70.4% on ImageNet datasets, narrowing the performance gap with supervised methods. To the best of our knowledge, ICCE is the first fully unsupervised image classification method to exceed 70% accuracy on ImageNet.

Boyue Xu,Ruichao Hou,Tongwei Ren,Dongming Zhou,Gangshan Wu,Jinde Cao

Main category: cs.CV

TL;DR: 本文提出了一种名为SwiTrack的新型状态切换框架，用于跨模态目标跟踪（CMOT），通过三个专用分支处理RGB和NIR帧，并引入一致性轨迹预测模块和动态模板重建来提升跟踪鲁棒性和精度。

Details

Motivation: 现有方法在处理跨模态目标跟踪时难以充分提取模态特异性特征，且在输入不可靠时易发生目标漂移，因此需要一种更有效的框架来提升跨模态表示能力和跟踪稳定性。 Method: SwiTrack采用三个专用流：RGB帧由视觉编码器处理，NIR帧通过带门控适配器的视觉编码器进行细化以校准共享潜在空间特征；对于无效模态，利用时空线索的一致性轨迹预测模块估计目标运动；同时结合动态模板重建和相似性对齐损失来增强特征一致性。 Result: 在最新基准上的实验表明，该方法在精度率和成功率上分别提升了7.2%和4.3%，并实现了65帧/秒的实时跟踪性能。 Conclusion: SwiTrack通过新颖的状态切换机制和多策略融合，在跨模态目标跟踪任务中实现了最先进的性能，有效缓解了目标漂移问题，具备良好的实时性和应用前景。 Abstract: Cross-modal object tracking (CMOT) is an emerging task that maintains target consistency while the video stream switches between different modalities, with only one modality available in each frame, mostly focusing on RGB-Near Infrared (RGB-NIR) tracking. Existing methods typically connect parallel RGB and NIR branches to a shared backbone, which limits the comprehensive extraction of distinctive modality-specific features and fails to address the issue of object drift, especially in the presence of unreliable inputs. In this paper, we propose SwiTrack, a novel state-switching framework that redefines CMOT through the deployment of three specialized streams. Specifically, RGB frames are processed by the visual encoder, while NIR frames undergo refinement via a NIR gated adapter coupled with the visual encoder to progressively calibrate shared latent space features, thereby yielding more robust cross-modal representations. For invalid modalities, a consistency trajectory prediction module leverages spatio-temporal cues to estimate target movement, ensuring robust tracking and mitigating drift. Additionally, we incorporate dynamic template reconstruction to iteratively update template features and employ a similarity alignment loss to reinforce feature consistency. Experimental results on the latest benchmarks demonstrate that our tracker achieves state-of-the-art performance, boosting precision rate and success rate gains by 7.2\% and 4.3\%, respectively, while maintaining real-time tracking at 65 frames per second. Code and models are available at https://github.com/xuboyue1999/SwiTrack.git.

[78] Mem-MLP: Real-Time 3D Human Motion Generation from Sparse Inputs

Sinan Mutlu,Georgios F. Angelis,Savas Ozkan,Paul Wisbey,Anastasios Drosou,Mete Ozay

Main category: cs.CV

TL;DR: 提出一种基于多层感知机（MLP）并引入残差连接和新型Memory-Block组件的方法，用于从稀疏传感器输入生成高质量的全身动作，显著提升AR/VR中的全身体追踪性能。

Details

Motivation: 现有AR/VR系统主要依赖头显和手柄追踪头手部位，难以实现完整的3D全身体追踪，需从稀疏传感器信号中恢复完整身体动作。 Method: 采用MLP主干网络，加入残差连接和Memory-Block；Memory-Block使用可训练的码向量表示缺失传感器数据，并结合历史时序信号提升时间一致性；采用多任务学习框架增强特征表示能力。 Result: 在实验中显著优于现有最先进方法，大幅降低预测误差，并在移动HMD上达到72 FPS，兼顾高精度与实时性。 Conclusion: 所提方法在精度和运行效率之间取得了良好平衡，适合部署于实际AR/VR系统中，推动基于稀疏传感的全身体动作重建发展。 Abstract: Realistic and smooth full-body tracking is crucial for immersive AR/VR applications. Existing systems primarily track head and hands via Head Mounted Devices (HMDs) and controllers, making the 3D full-body reconstruction in-complete. One potential approach is to generate the full-body motions from sparse inputs collected from limited sensors using a Neural Network (NN) model. In this paper, we propose a novel method based on a multi-layer perceptron (MLP) backbone that is enhanced with residual connections and a novel NN-component called Memory-Block. In particular, Memory-Block represents missing sensor data with trainable code-vectors, which are combined with the sparse signals from previous time instances to improve the temporal consistency. Furthermore, we formulate our solution as a multi-task learning problem, allowing our MLP-backbone to learn robust representations that boost accuracy. Our experiments show that our method outperforms state-of-the-art baselines by substantially reducing prediction errors. Moreover, it achieves 72 FPS on mobile HMDs that ultimately improves the accuracy-running time tradeoff.

[79] TetraSDF: Precise Mesh Extraction with Multi-resolution Tetrahedral Grid

Seonghun Oh,Youngjung Uh,Jin-Hwa Kim

Main category: cs.CV

TL;DR: TetraSDF 是一种精确的解析网格化框架，用于由ReLU MLP和多分辨率四面体位置编码器组成的神经符号距离函数（SDF），能够生成与零水平集完全匹配的高质量网格。

Details

Motivation: 现有基于采样的方法存在离散化误差，而连续分段仿射（CPWA）解析方法仅适用于简单的ReLU MLP，无法处理更复杂的编码结构。 Method: 提出TetraSDF，利用四面体编码器的重心插值保持全局CPWA结构，追踪编码器诱导的多面体复合体中的ReLU线性区域，并设计固定的解析输入预处理器以减少方向偏差并稳定训练。 Result: 在多个基准测试中，TetraSDF在SDF重建精度上达到或超过现有的基于网格的编码器，其解析提取器生成的网格具有高度自洽性且忠实于学习到的等值面，同时具备实用的运行时间和内存效率。 Conclusion: TetraSDF实现了对复杂神经SDF的精确解析网格提取，兼顾了高精度、几何保真度与计算效率，拓展了解析方法在神经隐式表示中的应用范围。 Abstract: Extracting meshes that exactly match the zero-level set of neural signed distance functions (SDFs) remains challenging. Sampling-based methods introduce discretization error, while continuous piecewise affine (CPWA) analytic approaches apply only to plain ReLU MLPs. We present TetraSDF, a precise analytic meshing framework for SDFs represented by a ReLU MLP composed with a multi-resolution tetrahedral positional encoder. The encoder's barycentric interpolation preserves global CPWA structure, enabling us to track ReLU linear regions within an encoder-induced polyhedral complex. A fixed analytic input preconditioner derived from the encoder's metric further reduces directional bias and stabilizes training. Across multiple benchmarks, TetraSDF matches or surpasses existing grid-based encoders in SDF reconstruction accuracy, and its analytic extractor produces highly self-consistent meshes that remain faithful to the learned isosurfaces, all with practical runtime and memory efficiency.

[80] Building temporally coherent 3D maps with VGGT for memory-efficient Semantic SLAM

Gergely Dinya,Péter Halász,András Lőrincz,Kristóf Karacs,Anna Gelencsér-Horváth

Main category: cs.CV

TL;DR: 提出了一种基于Vision Gated Generative Transformers（VGGT）的快速时空场景理解框架，适用于辅助导航等需要近实时性能的应用。

Details

Motivation: 为克服VGGT高内存需求并实现连续3D场景更新，提升在辅助导航等实际场景中的适用性。 Method: 采用滑动窗口处理图像流，对齐子图，并利用VGGT的跟踪头将2D语义实例掩码聚合成3D对象，同时存储时间戳和实例级身份以保证时间一致性和环境变化检测。 Result: 在知名基准和自定义数据集上验证了该方法的有效性，实现了接近实时的性能和良好的场景理解能力。 Conclusion: 该框架能够有效支持现实世界中的应用，特别是在辅助导航场景中表现出良好的实用性与可扩展性。 Abstract: We present a fast, spatio-temporal scene understanding framework based on Vision Gated Generative Transformers (VGGT). The proposed pipeline is designed to enable efficient, close to real-time performance, supporting applications including assistive navigation. To achieve continuous updates of the 3D scene representation, we process the image flow with a sliding window, aligning submaps, thereby overcoming VGGT's high memory demands. We exploit the VGGT tracking head to aggregate 2D semantic instance masks into 3D objects. To allow for temporal consistency and richer contextual reasoning the system stores timestamps and instance-level identities, thereby enabling the detection of changes in the environment. We evaluate the approach on well-known benchmarks and custom datasets specifically designed for assistive navigation scenarios. The results demonstrate the applicability of the framework to real-world scenarios.

[81] Explainable AI for Diabetic Retinopathy Detection Using Deep Learning with Attention Mechanisms and Fuzzy Logic-Based Interpretability

Abishek Karthik,Pandiyaraju V,Sreya Mynampati

Main category: cs.CV

TL;DR: 本文提出了一种结合CNN、ViT和GNN的混合深度学习框架，用于在复杂田间条件下实现高精度杂草检测，结合GAN增强与自监督预训练，在多个基准数据集上达到99.33%的准确率。

Details

Motivation: 精准农业中杂草种类的准确识别有助于选择性施用除草剂，推动可持续耕作，但田间环境多变且标注数据有限，传统方法鲁棒性和泛化能力不足。 Method: 提出融合CNN、Vision Transformer和Graph Neural Network的混合框架，采用GAN进行数据增强以平衡类别分布，并引入自监督对比预训练以提升小样本下的特征学习能力。 Result: 在多基准数据集上实现了99.33%的准确率、精确率、召回率和F1分数，模型具备局部、全局和关系特征表达能力，具有高可解释性和适应性。 Conclusion: 该框架支持边缘设备上的实时高效部署，有助于减少除草剂滥用，为自动化杂草检测和可持续精准农业提供可扩展的解决方案。 Abstract: The task of weed detection is an essential element of precision agriculture since accurate species identification allows a farmer to selectively apply herbicides and fits into sustainable agriculture crop management. This paper proposes a hybrid deep learning framework recipe for weed detection that utilizes Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and Graph Neural Networks (GNNs) to build robustness to multiple field conditions. A Generative Adversarial Network (GAN)-based augmentation method was imposed to balance class distributions and better generalize the model. Further, a self-supervised contrastive pre-training method helps to learn more features from limited annotated data. Experimental results yield superior results with 99.33% accuracy, precision, recall, and F1-score on multi-benchmark datasets. The proposed model architecture enables local, global, and relational feature representations and offers high interpretability and adaptability. Practically, the framework allows real-time, efficient deployment of edge devices for automated weed detecting, reducing over-reliance on herbicides and providing scalable, sustainable precision-farming options.

[82] Optimizing 3D Gaussian Splattering for Mobile GPUs

Md Musfiqur Rahman Sanim,Zhihao Shu,Bahram Afsharmanesh,AmirAli Mirian,Jiexiong Guan,Wei Niu,Bin Ren,Gagan Agrawal

Main category: cs.CV

TL;DR: 本文提出了Texture3dgs，一种针对移动GPU优化的3D高斯点阵化（3DGS）方法，通过专为2D纹理缓存设计的新型排序算法及其他优化，在移动设备上实现了高效的3D场景重建。

Details

Motivation: 由于数据隐私、离线操作和响应速度等优势，希望将3D高斯点阵化技术部署到移动设备上，但需克服移动GPU上2D纹理缓存效率低下的挑战。 Method: 提出了一种面向2D内存高度优化的新型排序算法，并改进变量布局和其他部分以加速3DGS流程，同时基于纹理缓存成本模型进行分析与优化。 Result: 端到端实验表明，Texture3dgs在排序阶段最高实现4.1倍加速，整体重建速度提升达1.7倍，并减少最多1.6倍的内存使用。 Conclusion: Texture3dgs有效提升了移动设备上的3D场景重建效率，展示了其在资源受限平台上的应用潜力。 Abstract: Image-based 3D scene reconstruction, which transforms multi-view images into a structured 3D representation of the surrounding environment, is a common task across many modern applications. 3D Gaussian Splatting (3DGS) is a new paradigm to address this problem and offers considerable efficiency as compared to the previous methods. Motivated by this, and considering various benefits of mobile device deployment (data privacy, operating without internet connectivity, and potentially faster responses), this paper develops Texture3dgs, an optimized mapping of 3DGS for a mobile GPU. A critical challenge in this area turns out to be optimizing for the two-dimensional (2D) texture cache, which needs to be exploited for faster executions on mobile GPUs. As a sorting method dominates the computations in 3DGS on mobile platforms, the core of Texture3dgs is a novel sorting algorithm where the processing, data movement, and placement are highly optimized for 2D memory. The properties of this algorithm are analyzed in view of a cost model for the texture cache. In addition, we accelerate other steps of the 3DGS algorithm through improved variable layout design and other optimizations. End-to-end evaluation shows that Texture3dgs delivers up to 4.1$\times$ and 1.7$\times$ speedup for the sorting and overall 3D scene reconstruction, respectively -- while also reducing memory usage by up to 1.6$\times$ -- demonstrating the effectiveness of our design for efficient mobile 3D scene reconstruction.

[83] Upsample Anything: A Simple and Hard to Beat Baseline for Feature Upsampling

Minseok Seo,Mark Hamilton,Changick Kim

Main category: cs.CV

TL;DR: 提出了一种名为Upsample Anything的轻量级测试时优化框架，无需训练即可将低分辨率特征恢复为高分辨率像素级输出。

Details

Motivation: 现有特征上采样方法依赖数据集特定重训练或重型隐式优化，限制了可扩展性和泛化能力；同时视觉基础模型的特征通常被显著下采样，难以直接用于像素级任务。 Method: 通过每张图像的简单优化学习一个结合空间和范围线索的各向异性高斯核，将高斯点阵化与联合双线性上采样相结合，实现通用、保边的上采样操作。 Result: 在语义分割、深度估计以及深度和概率图上采样任务中达到最先进的性能，单张224x224图像处理时间仅约0.419秒。 Conclusion: Upsample Anything是一种高效、通用且无需训练的上采样方法，可跨架构和模态迁移，显著提升视觉基础模型在像素级任务中的适用性。 Abstract: We present \textbf{Upsample Anything}, a lightweight test-time optimization (TTO) framework that restores low-resolution features to high-resolution, pixel-wise outputs without any training. Although Vision Foundation Models demonstrate strong generalization across diverse downstream tasks, their representations are typically downsampled by 14x/16x (e.g., ViT), which limits their direct use in pixel-level applications. Existing feature upsampling approaches depend on dataset-specific retraining or heavy implicit optimization, restricting scalability and generalization. Upsample Anything addresses these issues through a simple per-image optimization that learns an anisotropic Gaussian kernel combining spatial and range cues, effectively bridging Gaussian Splatting and Joint Bilateral Upsampling. The learned kernel acts as a universal, edge-aware operator that transfers seamlessly across architectures and modalities, enabling precise high-resolution reconstruction of features, depth, or probability maps. It runs in only $\approx0.419 \text{s}$ per 224x224 image and achieves state-of-the-art performance on semantic segmentation, depth estimation, and both depth and probability map upsampling.

[84] Sparse Autoencoders are Topic Models

Leander Girrbach,Zeynep Akata

Main category: cs.CV

TL;DR: 本文提出了一种将稀疏自编码器（SAE）视为主题模型的新视角，通过扩展潜在狄利克雷分配（LDA）到嵌入空间，并将SAE目标函数解释为该模型下的最大后验估计，从而证明SAE特征是主题成分而非可操控方向。基于此，作者提出了SAE-TM框架，用于跨模态的大规模主题分析，在文本和图像数据集上生成更连贯的主题，并应用于日本浮世绘的时间主题演变分析。

Details

Motivation: 尽管稀疏自编码器（SAE）被广泛用于分析嵌入，但其作用和实际价值仍存在争议。本文旨在提供一个统一的理论框架来理解SAE的本质，并提升其在主题建模中的可解释性和实用性。 Method: 通过将Latent Dirichlet Allocation（LDA）扩展到嵌入空间，推导出SAE的目标函数作为该概率模型下的最大后验（MAP）估计，从而建立SAE与主题模型之间的理论联系。在此基础上提出SAE-TM框架：首先训练SAE以学习可重用的主题原子，然后将其解释为下游数据上的词分布，最后无需重新训练即可合并成任意数量的主题。 Result: SAE-TM在文本和图像数据集上的主题连贯性优于强基线方法，同时保持多样性。实验成功揭示了图像数据中的主题结构，并追踪了日本浮世绘作品中主题随时间的变化趋势。 Conclusion: SAE可以被有效理解为主题模型，SAE-TM为跨模态大规模主题分析提供了实用且可解释的工具，增强了SAE在语义理解中的理论基础和应用价值。 Abstract: Sparse autoencoders (SAEs) are used to analyze embeddings, but their role and practical value are debated. We propose a new perspective on SAEs by demonstrating that they can be naturally understood as topic models. We extend Latent Dirichlet Allocation to embedding spaces and derive the SAE objective as a maximum a posteriori estimator under this model. This view implies SAE features are thematic components rather than steerable directions. Based on this, we introduce SAE-TM, a topic modeling framework that: (1) trains an SAE to learn reusable topic atoms, (2) interprets them as word distributions on downstream data, and (3) merges them into any number of topics without retraining. SAE-TM yields more coherent topics than strong baselines on text and image datasets while maintaining diversity. Finally, we analyze thematic structure in image datasets and trace topic changes over time in Japanese woodblock prints. Our work positions SAEs as effective tools for large-scale thematic analysis across modalities. Code and data will be released upon publication.

[85] BioBench: A Blueprint to Move Beyond ImageNet for Scientific ML Benchmarks

Samuel Stevens

Main category: cs.CV

TL;DR: BioBench是一个新的生态视觉基准，旨在解决ImageNet-1K在线性探测迁移准确性上对科学图像性能预测不足的问题。它整合了9个公开的应用驱动任务，涵盖4个生物分类界和6种数据获取方式，共包含310万张图像。通过一个简单的Python API，用户可以下载数据、拟合轻量级分类器到冻结的主干网络，并报告类别平衡的宏F1分数。该基准为计算机视觉在生态学中的应用提供了新的信号，并为构建可靠的AI-for-science基准提供了一个模板。

Details

Motivation: ImageNet-1K线性探测迁移准确性不再能有效预测现代视觉模型在生态学任务上的表现，需要一个新的基准来更准确地评估这些模型。 Method: BioBench整合了9个公开的任务，覆盖4个生物分类界和6种不同的图像采集方式，使用一个统一的Python API进行数据下载、分类器训练和性能评估，主要采用类别平衡的宏F1作为评价指标。 Result: 在46个现代视觉模型检查点上，ImageNet top-1准确率仅解释了生态学任务中34%的方差，并且错误排名了超过75%准确率的30%的模型。BioBench能够更好地反映模型在实际生态学任务中的性能。 Conclusion: BioBench为评估视觉模型在生态学任务上的性能提供了一个更加可靠和全面的基准，有助于推动AI技术在科学研究领域的应用和发展。 Abstract: ImageNet-1K linear-probe transfer accuracy remains the default proxy for visual representation quality, yet it no longer predicts performance on scientific imagery. Across 46 modern vision model checkpoints, ImageNet top-1 accuracy explains only 34% of variance on ecology tasks and mis-ranks 30% of models above 75% accuracy. We present BioBench, an open ecology vision benchmark that captures what ImageNet misses. BioBench unifies 9 publicly released, application-driven tasks, 4 taxonomic kingdoms, and 6 acquisition modalities (drone RGB, web video, micrographs, in-situ and specimen photos, camera-trap frames), totaling 3.1M images. A single Python API downloads data, fits lightweight classifiers to frozen backbones, and reports class-balanced macro-F1 (plus domain metrics for FishNet and FungiCLEF); ViT-L models evaluate in 6 hours on an A6000 GPU. BioBench provides new signal for computer vision in ecology and a template recipe for building reliable AI-for-science benchmarks in any domain. Code and predictions are available at https://github.com/samuelstevens/biobench and results at https://samuelstevens.me/biobench.

[86] NaTex: Seamless Texture Generation as Latent Color Diffusion

Zeqiang Lai,Yunfei Zhao,Zibo Zhao,Xin Yang,Xin Huang,Jingwei Huang,Xiangyu Yue,Chunchao Guo

Main category: cs.CV

TL;DR: 提出NaTex，一种直接在3D空间中预测纹理颜色的原生纹理生成框架，避免了传统多视图扩散模型在遮挡处理、纹理对齐和跨视图一致性上的局限。

Details

Motivation: 现有基于2D多视图扩散模型的纹理生成方法存在遮挡区域需修复、纹理与网格边界对齐困难、跨视图内容与颜色不一致等问题，限制了纹理质量与应用。 Method: 将纹理视为密集的颜色点云，提出潜颜色扩散模型（latent color diffusion），包括几何感知的颜色点云VAE和多控制扩散Transformer（DiT），并引入原生几何控制机制，通过位置嵌入和几何潜在表示实现精确对齐；整体架构从3D数据端到端训练。 Result: NaTex在纹理连贯性和对齐精度上显著优于先前方法，并展现出强大的泛化能力，可无需训练或简单微调应用于材质生成、纹理优化、部件分割与着色等下游任务。 Conclusion: NaTex通过原生3D纹理建模范式有效解决了传统MVD方法的关键缺陷，为高质量、高一致性3D纹理生成提供了新方向。 Abstract: We present NaTex, a native texture generation framework that predicts texture color directly in 3D space. In contrast to previous approaches that rely on baking 2D multi-view images synthesized by geometry-conditioned Multi-View Diffusion models (MVDs), NaTex avoids several inherent limitations of the MVD pipeline. These include difficulties in handling occluded regions that require inpainting, achieving precise mesh-texture alignment along boundaries, and maintaining cross-view consistency and coherence in both content and color intensity. NaTex features a novel paradigm that addresses the aforementioned issues by viewing texture as a dense color point cloud. Driven by this idea, we propose latent color diffusion, which comprises a geometry-awared color point cloud VAE and a multi-control diffusion transformer (DiT), entirely trained from scratch using 3D data, for texture reconstruction and generation. To enable precise alignment, we introduce native geometry control that conditions the DiT on direct 3D spatial information via positional embeddings and geometry latents. We co-design the VAE-DiT architecture, where the geometry latents are extracted via a dedicated geometry branch tightly coupled with the color VAE, providing fine-grained surface guidance that maintains strong correspondence with the texture. With these designs, NaTex demonstrates strong performance, significantly outperforming previous methods in texture coherence and alignment. Moreover, NaTex also exhibits strong generalization capabilities, either training-free or with simple tuning, for various downstream applications, e.g., material generation, texture refinement, and part segmentation and texturing.

[87] WWE-UIE: A Wavelet & White Balance Efficient Network for Underwater Image Enhancement

Ching-Heng Cheng,Jen-Wei Lee,Chia-Ming Lee,Chih-Chung Hsu

Main category: cs.CV

TL;DR: 提出了一种紧凑高效的水下图像增强网络WWE-UIE，结合三种可解释先验，在保持高质量恢复的同时实现实时推理。

Details

Motivation: 现有混合方法计算成本高，难以应用于实时场景，需设计更高效、轻量的水下图像增强模型。 Method: 引入自适应白平衡、基于小波的增强模块（WEB）和梯度感知模块（SGFB），结合多尺度分解与可学习边缘保持机制。 Result: 在基准数据集上表现出竞争力的恢复质量，参数量和FLOPs显著降低，支持资源受限平台的实时推理。 Conclusion: WWE-UIE通过融合可解释先验实现了高效、高质量的水下图像增强，适合实际应用。 Abstract: Underwater Image Enhancement (UIE) aims to restore visibility and correct color distortions caused by wavelength-dependent absorption and scattering. Recent hybrid approaches, which couple domain priors with modern deep neural architectures, have achieved strong performance but incur high computational cost, limiting their practicality in real-time scenarios. In this work, we propose WWE-UIE, a compact and efficient enhancement network that integrates three interpretable priors. First, adaptive white balance alleviates the strong wavelength-dependent color attenuation, particularly the dominance of blue-green tones. Second, a wavelet-based enhancement block (WEB) performs multi-band decomposition, enabling the network to capture both global structures and fine textures, which are critical for underwater restoration. Third, a gradient-aware module (SGFB) leverages Sobel operators with learnable gating to explicitly preserve edge structures degraded by scattering. Extensive experiments on benchmark datasets demonstrate that WWE-UIE achieves competitive restoration quality with substantially fewer parameters and FLOPs, enabling real-time inference on resource-limited platforms. Ablation studies and visualizations further validate the contribution of each component. The source code is available at https://github.com/chingheng0808/WWE-UIE.

[88] ChangeDINO: DINOv3-Driven Building Change Detection in Optical Remote Sensing Imagery

Ching-Heng Cheng,Chih-Chung Hsu

Main category: cs.CV

TL;DR: 本文提出了ChangeDINO，一种用于光学遥感影像建筑物变化检测的端到端多尺度Siamese框架，通过融合轻量主干与冻结的DINOv3特征、空间-光谱差分Transformer解码器及可学习形态学模块，在少量标注下实现更鲁棒的变化检测。

Details

Motivation: 现有深度学习方法主要依赖变化图标注，未充分利用非变化区域的语义信息，导致在光照变化、斜视成像和标签稀缺情况下鲁棒性不足。 Method: 提出ChangeDINO框架：1）结合轻量主干与冻结DINOv3进行多尺度特征提取；2）设计空间-光谱差分Transformer解码器，利用绝对差异作为变化先验；3）引入可学习形态学模块优化边界。 Result: 在四个公开数据集上，ChangeDINO在IoU和F1指标上均优于最新方法，消融实验验证了各组件的有效性。 Conclusion: ChangeDINO通过有效融合语义信息与变化先验，在小样本条件下实现了高精度、鲁棒的建筑物变化检测，具有良好的应用前景。 Abstract: Remote sensing change detection (RSCD) aims to identify surface changes from co-registered bi-temporal images. However, many deep learning-based RSCD methods rely solely on change-map annotations and underuse the semantic information in non-changing regions, which limits robustness under illumination variation, off-nadir views, and scarce labels. This article introduces ChangeDINO, an end-to-end multiscale Siamese framework for optical building change detection. The model fuses a lightweight backbone stream with features transferred from a frozen DINOv3, yielding semantic- and context-rich pyramids even on small datasets. A spatial-spectral differential transformer decoder then exploits multi-scale absolute differences as change priors to highlight true building changes and suppress irrelevant responses. Finally, a learnable morphology module refines the upsampled logits to recover clean boundaries. Experiments on four public benchmarks show that ChangeDINO consistently outperforms recent state-of-the-art methods in IoU and F1, and ablation studies confirm the effectiveness of each component. The source code is available at https://github.com/chingheng0808/ChangeDINO.

[89] Arbitrary-Resolution and Arbitrary-Scale Face Super-Resolution with Implicit Representation Networks

Yi Ting Tsai,Yu Wei Chen,Hong-Han Shuai,Ching-Chun Huang

Main category: cs.CV

TL;DR: 本文提出了一种任意分辨率和任意尺度的人脸超分辨率方法ARASFSR，通过隐式表示网络实现对不同输入尺寸和放大倍数的人脸图像进行高质量重建。

Details

Motivation: 现有FSR方法受限于固定的上采样尺度且对输入尺寸变化敏感，难以适应实际应用中多样化的分辨率需求。 Method: ARASFSR利用2D深度特征、局部相对坐标和上采样比例预测目标像素的RGB值；引入局部频率估计模块以捕捉高频纹理信息，并设计全局坐标调制模块来融合面部结构先验知识，实现分辨率自适应。 Result: 实验表明，ARASFSR在多种输入尺寸和上采样尺度下均优于现有最先进方法，具备更强的鲁棒性和视觉质量。 Conclusion: ARASFSR能够有效支持任意分辨率输入与任意上采样倍率，显著提升了人脸超分辨率的灵活性和实用性。 Abstract: Face super-resolution (FSR) is a critical technique for enhancing low-resolution facial images and has significant implications for face-related tasks. However, existing FSR methods are limited by fixed up-sampling scales and sensitivity to input size variations. To address these limitations, this paper introduces an Arbitrary-Resolution and Arbitrary-Scale FSR method with implicit representation networks (ARASFSR), featuring three novel designs. First, ARASFSR employs 2D deep features, local relative coordinates, and up-sampling scale ratios to predict RGB values for each target pixel, allowing super-resolution at any up-sampling scale. Second, a local frequency estimation module captures high-frequency facial texture information to reduce the spectral bias effect. Lastly, a global coordinate modulation module guides FSR to leverage prior facial structure knowledge and achieve resolution adaptation effectively. Quantitative and qualitative evaluations demonstrate the robustness of ARASFSR over existing state-of-the-art methods while super-resolving facial images across various input sizes and up-sampling scales.

[90] Aerial View River Landform Video segmentation: A Weakly Supervised Context-aware Temporal Consistency Distillation Approach

Chi-Han Chen,Chieh-Ming Chen,Wen-Huang Cheng,Ching-Chun Huang

Main category: cs.CV

TL;DR: 提出一种基于教师-学生架构的弱监督学习方法，结合关键帧选择与更新算法，在仅使用30%标注数据的情况下显著提升无人机遥感中的地形分类性能和时序一致性。

Details

Motivation: 解决无人机遥感中地形分类任务面临的数据标注复杂、时序不一致及标注数据稀缺的问题。 Method: 采用教师-学生架构，引入关键帧选择与更新机制，进行弱监督学习以及时序一致性知识蒸馏。 Result: 在仅使用30%标注数据的情况下，同时提升了mIoU和时序一致性，实现了稳定的地形目标定位。 Conclusion: 所提方法有效克服了传统方法在空中任务中因数据不足和时序不连贯导致的性能瓶颈，为无人机遥感地形分类提供了高效且实用的解决方案。 Abstract: The study of terrain and landform classification through UAV remote sensing diverges significantly from ground vehicle patrol tasks. Besides grappling with the complexity of data annotation and ensuring temporal consistency, it also confronts the scarcity of relevant data and the limitations imposed by the effective range of many technologies. This research substantiates that, in aerial positioning tasks, both the mean Intersection over Union (mIoU) and temporal consistency (TC) metrics are of paramount importance. It is demonstrated that fully labeled data is not the optimal choice, as selecting only key data lacks the enhancement in TC, leading to failures. Hence, a teacher-student architecture, coupled with key frame selection and key frame updating algorithms, is proposed. This framework successfully performs weakly supervised learning and TC knowledge distillation, overcoming the deficiencies of traditional TC training in aerial tasks. The experimental results reveal that our method utilizing merely 30\% of labeled data, concurrently elevates mIoU and temporal consistency ensuring stable localization of terrain objects. Result demo : https://gitlab.com/prophet.ai.inc/drone-based-riverbed-inspection

[91] CRISTAL: Real-time Camera Registration in Static LiDAR Scans using Neural Rendering

Joni Vanherck,Steven Moonen,Brent Zoomers,Kobe Werner,Jeroen Put,Lode Jorissen,Nick Michiels

Main category: cs.CV

TL;DR: 本文提出了一种基于高精度彩色LiDAR点云的实时相机定位方法，通过神经渲染技术缩小合成图像与真实图像之间的域差距，实现无漂移、具有正确度量尺度的相机跟踪，并在ScanNet++数据集上优于现有SLAM方法。

Details

Motivation: 现有的视觉定位方法常存在漂移、尺度模糊和依赖标记物或回环检测的问题，难以满足机器人和扩展现实（XR）中对高精度、全局一致定位的需求。 Method: 利用预采集的高精度彩色LiDAR点云生成合成视图，建立实时帧与点云之间的2D-3D对应关系；采用神经渲染技术减少合成与真实图像间的域差异，提升特征匹配精度；提出了两种实时方案：Online Render and Match 与 Prebuild and Localize。 Result: 该方法在ScanNet++数据集上实现了优于现有SLAM系统的定位精度，实现了无漂移、具有一致度量尺度的实时相机跟踪。 Conclusion: 所提出的方法能够有效实现高精度、实时的相机定位，无需回环或重定位即可保持全局一致性，适用于机器人和XR等应用场景。 Abstract: Accurate camera localization is crucial for robotics and Extended Reality (XR), enabling reliable navigation and alignment of virtual and real content. Existing visual methods often suffer from drift, scale ambiguity, and depend on fiducials or loop closure. This work introduces a real-time method for localizing a camera within a pre-captured, highly accurate colored LiDAR point cloud. By rendering synthetic views from this cloud, 2D-3D correspondences are established between live frames and the point cloud. A neural rendering technique narrows the domain gap between synthetic and real images, reducing occlusion and background artifacts to improve feature matching. The result is drift-free camera tracking with correct metric scale in the global LiDAR coordinate system. Two real-time variants are presented: Online Render and Match, and Prebuild and Localize. We demonstrate improved results on the ScanNet++ dataset and outperform existing SLAM pipelines.

[92] Multi-Order Matching Network for Alignment-Free Depth Super-Resolution

Zhengxue Wang,Zhiqiang Yan,Yuan Wu,Guangwei Gao,Xiang Li,Jian Yang

Main category: cs.CV

TL;DR: 提出了一种无需对齐的多阶匹配网络（MOMNet），用于解决真实场景中RGB与深度图因硬件限制和校准漂移导致的错位问题，在深度超分辨率任务中实现了最先进的性能和强鲁棒性。

Details

Motivation: 现有方法依赖严格对齐的RGB-D数据，但在实际中由于硬件差异和环境变化难以实现完美对齐，导致性能下降。 Method: 提出MOMNet，包含多阶匹配机制（零阶、一阶、二阶）以在多阶特征空间中识别与深度一致的RGB信息，并设计多阶聚合模块结合多个结构检测器，利用多阶先验引导RGB特征选择性地传递给深度图。 Result: 在多个实验中，MOMNet在错位的真实场景下表现出卓越的鲁棒性，并在深度超分辨率任务上达到最先进水平。 Conclusion: MOMNet有效克服了RGB-D传感器间错位带来的挑战，提供了一种无需严格对齐的深度超分辨率解决方案，具有良好的实际应用前景。 Abstract: Recent guided depth super-resolution methods are premised on the assumption of strictly spatial alignment between depth and RGB, achieving high-quality depth reconstruction. However, in real-world scenarios, the acquisition of strictly aligned RGB-D is hindered by inherent hardware limitations (e.g., physically separate RGB-D sensors) and unavoidable calibration drift induced by mechanical vibrations or temperature variations. Consequently, existing approaches often suffer inevitable performance degradation when applied to misaligned real-world scenes. In this paper, we propose the Multi-Order Matching Network (MOMNet), a novel alignment-free framework that adaptively retrieves and selects the most relevant information from misaligned RGB. Specifically, our method begins with a multi-order matching mechanism, which jointly performs zero-order, first-order, and second-order matching to comprehensively identify RGB information consistent with depth across multi-order feature spaces. To effectively integrate the retrieved RGB and depth, we further introduce a multi-order aggregation composed of multiple structure detectors. This strategy uses multi-order priors as prompts to facilitate the selective feature transfer from RGB to depth. Extensive experiments demonstrate that MOMNet achieves state-of-the-art performance and exhibits outstanding robustness.

[93] DetailSemNet: Elevating Signature Verification through Detail-Semantic Integration

Meng-Cheng Shih,Tsai-Ling Huang,Yu-Heng Shih,Hong-Han Shuai,Hsuan-Tung Liu,Yi-Ren Yeh,Ching-Chun Huang

Main category: cs.CV

TL;DR: 本文提出了一种用于离线签名验证（OSV）的新模型DetailSemNet，强调细粒度差异的重要性，通过局部结构匹配提升验证精度。

Details

Motivation: 现有方法多依赖整体特征进行签名比对，难以捕捉细微伪造差异；同时，基于Transformer的骨干网络可能模糊局部细节，影响OSV性能。 Method: 提出DetailSemNet，引入细节语义整合器（Detail Semantics Integrator），通过特征解耦与再耦合机制增强局部细节并扩展判别性语义，实现局部结构的精细匹配。 Result: 在主流离线签名验证基准上，DetailSemNet显著超越现有方法，达到最先进的性能，并在跨数据集测试中表现出强泛化能力。 Conclusion: 强调局部结构匹配不仅提升了OSV的准确性和可解释性，还增强了模型在真实场景中的应用潜力。 Abstract: Offline signature verification (OSV) is a frequently utilized technology in forensics. This paper proposes a new model, DetailSemNet, for OSV. Unlike previous methods that rely on holistic features for pair comparisons, our approach underscores the significance of fine-grained differences for robust OSV. We propose to match local structures between two signature images, significantly boosting verification accuracy. Furthermore, we observe that without specific architectural modifications, transformer-based backbones might naturally obscure local details, adversely impacting OSV performance. To address this, we introduce a Detail Semantics Integrator, leveraging feature disentanglement and re-entanglement. This integrator is specifically designed to enhance intricate details while simultaneously expanding discriminative semantics, thereby augmenting the efficacy of local structural matching. We evaluate our method against leading benchmarks in offline signature verification. Our model consistently outperforms recent methods, achieving state-of-the-art results with clear margins. The emphasis on local structure matching not only improves performance but also enhances the model's interpretability, supporting our findings. Additionally, our model demonstrates remarkable generalization capabilities in cross-dataset testing scenarios. The combination of generalizability and interpretability significantly bolsters the potential of DetailSemNet for real-world applications.

[94] CAMS: Towards Compositional Zero-Shot Learning via Gated Cross-Attention and Multi-Space Disentanglement

Pan Yang,Cheng Deng,Jing Yang,Han Zhao,Yun Liu,Yuling Chen,Xiaoli Ruan,Yanping Chen

Main category: cs.CV

TL;DR: 提出CAMS方法，通过门控交叉注意力和多空间解耦提升组合零样本学习中属性与对象的解耦能力，在多个基准上实现最先进性能。

Details

Motivation: 现有基于CLIP的方法依赖全局图像表征，难以完全解耦属性与对象语义，限制了在未见组合上的泛化能力。 Method: 设计门控交叉注意力机制，从CLIP高层编码块中提取细粒度语义特征并抑制背景干扰；通过多空间解耦在多维空间中分离属性与对象语义。 Result: 在MIT-States、UT-Zappos和C-GQA三个基准上，CAMS在闭世界和开世界设置下均达到SOTA性能。 Conclusion: CAMS通过增强语义特征提取和多维空间解耦，有效提升了组合零样本学习的泛化能力。 Abstract: Compositional zero-shot learning (CZSL) aims to learn the concepts of attributes and objects in seen compositions and to recognize their unseen compositions. Most Contrastive Language-Image Pre-training (CLIP)-based CZSL methods focus on disentangling attributes and objects by leveraging the global semantic representation obtained from the image encoder. However, this representation has limited representational capacity and do not allow for complete disentanglement of the two. To this end, we propose CAMS, which aims to extract semantic features from visual features and perform semantic disentanglement in multidimensional spaces, thereby improving generalization over unseen attribute-object compositions. Specifically, CAMS designs a Gated Cross-Attention that captures fine-grained semantic features from the high-level image encoding blocks of CLIP through a set of latent units, while adaptively suppressing background and other irrelevant information. Subsequently, it conducts Multi-Space Disentanglement to achieve disentanglement of attribute and object semantics. Experiments on three popular benchmarks (MIT-States, UT-Zappos, and C-GQA) demonstrate that CAMS achieves state-of-the-art performance in both closed-world and open-world settings. The code is available at https://github.com/ybyangjing/CAMS.

[95] End-to-End Motion Capture from Rigid Body Markers with Geodesic Loss

Hai Lan,Zongyan Li,Jianmin Hu,Jialing Yang,Houde Dai

Main category: cs.CV

TL;DR: 提出了一种基于刚体标记（RBM）的新型光学动作捕捉方法，结合深度学习模型和测地线损失，实现高效、高精度的实时动作捕捉。

Details

Motivation: 传统基于密集标记点的光学动捕虽精度高，但准备耗时且标记识别易混淆，限制了其可扩展性。因此需要一种更简洁、鲁棒的标记方案。 Method: 引入刚体标记（RBM）作为动捕基本单元，提供无歧义的6自由度数据，并构建端到端的深度学习回归模型，直接估计SMPL参数，采用测地线损失进行流形感知训练。 Result: 在合成数据（AMASS）和真实Vicon系统数据上均达到业界领先的身体姿态估计精度，计算量比优化方法低一个数量级以上，验证了方法的有效性和实用性。 Conclusion: 结合稀疏6-DoF RBM与流形感知的测地线损失，为图形学、虚拟现实和生物力学中的实时动作捕捉提供了实用且高保真的解决方案。 Abstract: Marker-based optical motion capture (MoCap), while long regarded as the gold standard for accuracy, faces practical challenges, such as time-consuming preparation and marker identification ambiguity, due to its reliance on dense marker configurations, which fundamentally limit its scalability. To address this, we introduce a novel fundamental unit for MoCap, the Rigid Body Marker (RBM), which provides unambiguous 6-DoF data and drastically simplifies setup. Leveraging this new data modality, we develop a deep-learning-based regression model that directly estimates SMPL parameters under a geodesic loss. This end-to-end approach matches the performance of optimization-based methods while requiring over an order of magnitude less computation. Trained on synthesized data from the AMASS dataset, our end-to-end model achieves state-of-the-art accuracy in body pose estimation. Real-world data captured using a Vicon optical tracking system further demonstrates the practical viability of our approach. Overall, the results show that combining sparse 6-DoF RBM with a manifold-aware geodesic loss yields a practical and high-fidelity solution for real-time MoCap in graphics, virtual reality, and biomechanics.

[96] CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation

Samer Abualhanud,Christian Grannemann,Max Mehltretter

Main category: cs.CV

TL;DR: 提出一种几何引导的自监督方法，用于校准的多相机系统，通过将3D点投影到共享圆柱体上并利用空间注意力机制，提升跨视角深度估计的一致性和整体精度。

Details

Motivation: 现有自监督环绕视图深度估计方法在重叠图像间的深度预测不一致，影响3D感知质量。 Method: 利用相机内参和外参，将各图像预测的3D点投影到共享单位圆柱体上，生成2D位置图，并基于圆柱体上的距离采用非学习的空间注意力机制聚合跨图像特征，优化深度图。 Result: 在DDAD和nuScenes数据集上验证，该方法显著提升了跨图像深度估计的一致性及整体深度精度。 Conclusion: 所提几何引导的跨视图一致性建模方法有效改善了多相机系统下的自监督深度估计性能，适用于低成本、高精度的360° 3D感知。 Abstract: Self-supervised surround-view depth estimation enables dense, low-cost 3D perception with a 360° field of view from multiple minimally overlapping images. Yet, most existing methods suffer from depth estimates that are inconsistent between overlapping images. Addressing this limitation, we propose a novel geometry-guided method for calibrated, time-synchronized multi-camera rigs that predicts dense, metric, and cross-view-consistent depth. Given the intrinsic and relative orientation parameters, a first depth map is predicted per image and the so-derived 3D points from all images are projected onto a shared unit cylinder, establishing neighborhood relations across different images. This produces a 2D position map for every image, where each pixel is assigned its projected position on the cylinder. Based on these position maps, we apply an explicit, non-learned spatial attention that aggregates features among pixels across images according to their distances on the cylinder, to predict a final depth map per image. Evaluated on the DDAD and nuScenes datasets, our approach improves the consistency of depth estimates across images and the overall depth compared to state-of-the-art methods.

[97] Graph Neural Networks for Surgical Scene Segmentation

Yihan Li,Nikhil Churamani,Maria Robu,Imanol Luengo,Danail Stoyanov

Main category: cs.CV

TL;DR: 本文提出基于图的分割方法，结合Vision Transformer和图神经网络，提升腹腔镜手术场景中肝囊解剖结构的识别精度，尤其在细小、罕见且关键结构上表现优异。

Details

Motivation: 深度学习模型在处理遮挡、长距离依赖及稀有结构精细几何时存在困难，而准确识别肝囊解剖结构对手术安全至关重要。 Method: 提出两种融合ViT与GNN的分割模型：一是基于静态k近邻图和GCNII的模型，实现稳定的长距离信息传播；二是基于动态可微图生成器（DGG）和GAT的模型，支持自适应拓扑学习。 Result: 在Endoscapes-Seg50和CholecSeg8k数据集上，相比现有方法mIoU提升7-8%，mDice提升6%，且对稀有和关键结构具有更解剖一致性的预测结果。 Conclusion: 所提出的图基分割方法通过结合ViT的全局上下文与图结构的关系推理，显著提升了手术场景分割的性能与可靠性，有助于实现更安全的腹腔镜和机器人辅助手术。 Abstract: Purpose: Accurate identification of hepatocystic anatomy is critical to preventing surgical complications during laparoscopic cholecystectomy. Deep learning models often struggle with occlusions, long-range dependencies, and capturing the fine-scale geometry of rare structures. This work addresses these challenges by introducing graph-based segmentation approaches that enhance spatial and semantic understanding in surgical scene analyses. Methods: We propose two segmentation models integrating Vision Transformer (ViT) feature encoders with Graph Neural Networks (GNNs) to explicitly model spatial relationships between anatomical regions. (1) A static k Nearest Neighbours (k-NN) graph with a Graph Convolutional Network with Initial Residual and Identity Mapping (GCNII) enables stable long-range information propagation. (2) A dynamic Differentiable Graph Generator (DGG) with a Graph Attention Network (GAT) supports adaptive topology learning. Both models are evaluated on the Endoscapes-Seg50 and CholecSeg8k benchmarks. Results: The proposed approaches achieve up to 7-8% improvement in Mean Intersection over Union (mIoU) and 6% improvement in Mean Dice (mDice) scores over state-of-the-art baselines. It produces anatomically coherent predictions, particularly on thin, rare and safety-critical structures. Conclusion: The proposed graph-based segmentation methods enhance both performance and anatomical consistency in surgical scene segmentation. By combining ViT-based global context with graph-based relational reasoning, the models improve interpretability and reliability, paving the way for safer laparoscopic and robot-assisted surgery through a precise identification of critical anatomical features.

[98] Beyond Visual Cues: Leveraging General Semantics as Support for Few-Shot Segmentation

Jin Wang,Bingfeng Zhang,Jian Pang,Mengyu Liu,Honglong Chen,Weifeng Liu

Main category: cs.CV

TL;DR: 本文提出了一种基于语言驱动属性泛化的Few-shot分割方法LDAG，通过大语言模型生成目标类的多属性描述，并结合多模态对齐机制实现跨模态交互，提升了对未见类别的分割性能。

Details

Motivation: 现有FSS方法依赖支持图像提取元信息，但因类内视觉差异导致指导不准确；本文认为关键在于提供对训练和未训练类别均无偏的元指导，而非依赖支持图像。 Method: 提出LDAG框架，包含多属性增强模块（MaE）利用LLM生成目标类的详细属性描述并构建视觉-文本先验指导，以及多模态属性对齐模块（MaA）实现文本与视觉特征的跨模态交互。 Result: 实验表明该方法显著优于现有方法，实现了新的SOTA性能。 Conclusion: 通过引入语言驱动的无偏元指导，LDAG有效克服了传统FSS中因支持图像偏差带来的局限性，为Few-shot分割提供了新思路。 Abstract: Few-shot segmentation (FSS) aims to segment novel classes under the guidance of limited support samples by a meta-learning paradigm. Existing methods mainly mine references from support images as meta guidance. However, due to intra-class variations among visual representations, the meta information extracted from support images cannot produce accurate guidance to segment untrained classes. In this paper, we argue that the references from support images may not be essential, the key to the support role is to provide unbiased meta guidance for both trained and untrained classes. We then introduce a Language-Driven Attribute Generalization (LDAG) architecture to utilize inherent target property language descriptions to build robust support strategy. Specifically, to obtain an unbiased support representation, we design a Multi-attribute Enhancement (MaE) module, which produces multiple detailed attribute descriptions of the target class through Large Language Models (LLMs), and then builds refined visual-text prior guidance utilizing multi-modal matching. Meanwhile, due to text-vision modal shift, attribute text struggles to promote visual feature representation, we design a Multi-modal Attribute Alignment (MaA) to achieve cross-modal interaction between attribute texts and visual feature. Experiments show that our proposed method outperforms existing approaches by a clear margin and achieves the new state-of-the art performance. The code will be released.

[99] StreetView-Waste: A Multi-Task Dataset for Urban Waste Management

Diogo J. Paulo,João Martins,Hugo Proença,João C. Neves

Main category: cs.CV

TL;DR: 本文提出了一个名为StreetView-Waste的新数据集，用于支持城市垃圾管理中的垃圾箱检测、跟踪和溢出分割任务，并提供了基于现有先进模型的基线方法及两种改进策略，显著提升了跟踪准确性和分割性能。

Details

Motivation: 现有的垃圾检测数据集缺乏对垃圾箱具体跟踪的标注，且多在静态、脱离实际场景的环境中采集，限制了其在现实物流中的应用。因此，需要一个更贴近真实城市环境的数据集来推动相关研究。 Method: 提出StreetView-Waste数据集，包含城市街道场景中的垃圾和垃圾箱，并支持三个任务：垃圾箱检测、跟踪和溢出分割；提供基于最先进模型的基线，并引入基于启发式的跟踪优化方法和利用几何先验信息的模型无关框架以提升分割性能。 Result: 实验表明，微调后的检测器在垃圾箱检测中表现良好，但基线跟踪方法在数量估计上存在困难，所提启发式方法使平均绝对计数误差降低79.6%；在分割方面，所提几何感知策略使轻量级模型的mAP@0.5提升了27%。 Conclusion: StreetView-Waste为城市垃圾管理中的实际感知系统研究提供了一个具有挑战性的基准，促进了该领域的进一步发展。 Abstract: Urban waste management remains a critical challenge for the development of smart cities. Despite the growing number of litter detection datasets, the problem of monitoring overflowing waste containers, particularly from images captured by garbage trucks, has received little attention. While existing datasets are valuable, they often lack annotations for specific container tracking or are captured in static, decontextualized environments, limiting their utility for real-world logistics. To address this gap, we present StreetView-Waste, a comprehensive dataset of urban scenes featuring litter and waste containers. The dataset supports three key evaluation tasks: (1) waste container detection, (2) waste container tracking, and (3) waste overflow segmentation. Alongside the dataset, we provide baselines for each task by benchmarking state-of-the-art models in object detection, tracking, and segmentation. Additionally, we enhance baseline performance by proposing two complementary strategies: a heuristic-based method for improved waste container tracking and a model-agnostic framework that leverages geometric priors to refine litter segmentation. Our experimental results show that while fine-tuned object detectors achieve reasonable performance in detecting waste containers, baseline tracking methods struggle to accurately estimate their number; however, our proposed heuristics reduce the mean absolute counting error by 79.6%. Similarly, while segmenting amorphous litter is challenging, our geometry-aware strategy improves segmentation mAP@0.5 by 27% on lightweight models, demonstrating the value of multimodal inputs for this task. Ultimately, StreetView-Waste provides a challenging benchmark to encourage research into real-world perception systems for urban waste management.

[100] VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference

Ziyan Liu,Yeqiu Chen,Hongyi Cai,Tao Lin,Shuo Yang,Zheng Liu,Bo Zhao

Main category: cs.CV

TL;DR: 提出VLA-Pruner，一种面向视觉-语言-动作（VLA）模型的双层次令牌剪枝方法，兼顾语义理解与动作执行，提升推理效率与性能。

Details

Motivation: 现有基于语义显著性的令牌剪枝方法忽视了VLA模型中高层语义理解与低层动作执行的双重机制，导致关键动作信息丢失，影响模型表现。 Method: 提出VLA-Pruner，采用双层次重要性标准：基于视觉-语言预填充注意力的语义级相关性和基于时间平滑估计的动作解码注意力的动作级重要性，并设计自适应的双层令牌选择策略。 Result: 在多种VLA架构和机器人任务上验证了VLA-Pruner的有效性，实现了最先进的剪枝性能，在保持甚至提升任务表现的同时显著降低计算开销。 Conclusion: VLA-Pruner通过融合语义与动作双层次注意力，有效保留关键视觉令牌，是一种高效、通用且即插即用的VLA模型加速方案。 Abstract: Vision-Language-Action (VLA) models have shown great promise for embodied AI, yet the heavy computational cost of processing continuous visual streams severely limits their real-time deployment. Token pruning (keeping salient visual tokens and dropping redundant ones) has emerged as an effective approach for accelerating Vision-Language Models (VLMs), offering a solution for efficient VLA. However, these VLM-specific token pruning methods select tokens based solely on semantic salience metrics (e.g., prefill attention), while overlooking the VLA's intrinsic dual-system nature of high-level semantic understanding and low-level action execution. Consequently, these methods bias token retention toward semantic cues, discard critical information for action generation, and significantly degrade VLA performance. To bridge this gap, we propose VLA-Pruner, a versatile plug-and-play VLA-specific token prune method that aligns with the dual-system nature of VLA models and exploits the temporal continuity in robot manipulation. Specifically, VLA-Pruner adopts a dual-level importance criterion for visual token retention: vision-language prefill attention for semantic-level relevance and action decode attention, estimated via temporal smoothing, for action-level importance. Based on this criterion, VLA-Pruner proposes a novel dual-level token selection strategy that adaptively preserves a compact, informative set of visual tokens for both semantic understanding and action execution under given compute budget. Experiments show that VLA-Pruner achieves state-of-the-art performance across multiple VLA architectures and diverse robotic tasks.

[101] LLaVA$^3$: Representing 3D Scenes like a Cubist Painter to Boost 3D Scene Understanding of VLMs

Doriand Petit,Steve Bourgeois,Vincent Gay-Bellile,Florian Chabot,Loïc Barthe

Main category: cs.CV

TL;DR: 提出LLaVA^3，一种无需微调、仅使用多视角2D图像提升视觉语言模型3D场景理解能力的新方法。

Details

Motivation: 由于3D训练数据有限，而2D视觉语言数据丰富，如何利用2D数据提升模型对3D场景的理解成为挑战。 Method: 受立体派绘画启发，通过多视角2D图像进行中间的3D重建，生成每个物体的全向视觉表示，从而向VLM描述3D场景，无需微调。 Result: 在3D视觉问答和3D语言定位任务上，该方法优于之前的2D-based VLM方法。 Conclusion: LLaVA^3有效提升了现有VLM在3D场景理解上的能力，且无需额外的3D标注或模型微调。 Abstract: Developing a multi-modal language model capable of understanding 3D scenes remains challenging due to the limited availability of 3D training data, in contrast to the abundance of 2D datasets used for vision-language models (VLM). As an alternative, we introduce LLaVA$^3$ (pronounced LLaVA-Cube), a novel method that improves the 3D scene understanding capabilities of VLM using only multi-view 2D images and without any fine-tuning. Inspired by Cubist painters, who represented multiple viewpoints of a 3D object within a single picture, we propose to describe the 3D scene for the VLM through omnidirectional visual representations of each object. These representations are derived from an intermediate multi-view 3D reconstruction of the scene. Extensive experiments on 3D VQA and 3D language grounding show that our approach outperforms previous 2D-based VLM solutions.

[102] FastSurfer-CC: A robust, accurate, and comprehensive framework for corpus callosum morphometry

Clemens Pollak,Kersten Diers,Santiago Estrada,David Kügler,Martin Reuter

Main category: cs.CV

TL;DR: 本文提出了一种名为FastSurfer-CC的高效全自动框架，用于胼胝体形态测量，能够自动完成分割、标准化和形态学分析，并在亨廷顿病患者中检测到现有方法无法发现的显著差异。

Details

Motivation: 目前缺乏公开可用的、能提供全面且自动化胼胝体分析流程的工具，限制了其在衰老、神经系统疾病研究及临床试验中的应用。 Method: 开发了一个名为FastSurfer-CC的自动化框架，可自动识别中矢状面切片，分割胼胝体和穹窿，定位前连合和后连合以标准化头部位置，并生成厚度轮廓、分区及八种形态学指标用于统计分析。 Result: FastSurfer-CC在各项子任务上均优于现有的专用工具，并能在亨廷顿病患者与健康对照之间检测到当前最先进方法未能发现的统计学显著差异。 Conclusion: FastSurfer-CC是一个高效、全自动的胼胝体分析工具，具有更高的敏感性和应用潜力，适用于神经退行性疾病的研究和临床试验中的生物标志物分析。 Abstract: The corpus callosum, the largest commissural structure in the human brain, is a central focus in research on aging and neurological diseases. It is also a critical target for interventions such as deep brain stimulation and serves as an important biomarker in clinical trials, including those investigating remyelination therapies. Despite extensive research on corpus callosum segmentation, few publicly available tools provide a comprehensive and automated analysis pipeline. To address this gap, we present FastSurfer-CC, an efficient and fully automated framework for corpus callosum morphometry. FastSurfer-CC automatically identifies mid-sagittal slices, segments the corpus callosum and fornix, localizes the anterior and posterior commissures to standardize head positioning, generates thickness profiles and subdivisions, and extracts eight shape metrics for statistical analysis. We demonstrate that FastSurfer-CC outperforms existing specialized tools across the individual tasks. Moreover, our method reveals statistically significant differences between Huntington's disease patients and healthy controls that are not detected by the current state-of-the-art.

[103] Flow and Depth Assisted Video Prediction with Latent Transformer

Eliyas Suleyman,Paul Henderson,Eksan Firkat,Nicolas Pugeault

Main category: cs.CV

TL;DR: 本文研究了在遮挡和背景运动情况下，通过引入点流和深度图来提升视频预测模型性能的方法。

Details

Motivation: 遮挡仍然是视频预测中的一个固有挑战，现有模型在标准场景下表现良好，但在处理遮挡和背景运动时存在不足。 Method: 采用标准的多对象潜在Transformer架构，并修改其以融合深度和点流信息进行未来帧预测。 Result: 实验表明，结合点流和深度信息的模型在遮挡场景中表现更好，并能更准确地预测背景运动。此外，在合成和真实世界数据集上使用外观指标和对象掩码的Wasserstein距离进行了评估。 Conclusion: 提供显式的运动和几何结构信息有助于提升视频预测模型在遮挡和复杂背景运动下的性能。 Abstract: Video prediction is a fundamental task for various downstream applications, including robotics and world modeling. Although general video prediction models have achieved remarkable performance in standard scenarios, occlusion is still an inherent challenge in video prediction. We hypothesize that providing explicit information about motion (via point-flow) and geometric structure (via depth-maps) will enable video prediction models to perform better in situations with occlusion and the background motion. To investigate this, we present the first systematic study dedicated to occluded video prediction. We use a standard multi-object latent transformer architecture to predict future frames, but modify this to incorporate information from depth and point-flow. We evaluate this model in a controlled setting on both synthetic and real-world datasets with not only appearance-based metrics but also Wasserstein distances on object masks, which can effectively measure the motion distribution of the prediction. We find that when the prediction model is assisted with point flow and depth, it performs better in occluded scenarios and predicts more accurate background motion compared to models without the help of these modalities.

[104] Physics-Informed Machine Learning for Efficient Sim-to-Real Data Augmentation in Micro-Object Pose Estimation

Zongcai Tan,Lan Wei,Dandan Zhang

Main category: cs.CV

TL;DR: 提出了一种结合波动光学物理渲染与深度对齐的生成对抗网络框架，用于高效生成高保真显微图像，以实现无需大量真实数据的微机器人姿态估计。

Details

Motivation: 现有方法依赖大量标注良好的显微图像数据，难以获取；且传统仿真难以复现复杂的光学现象（如衍射、景深效应），限制了sim-to-real迁移。 Method: 将基于波动物理的光学渲染和深度对齐机制融入生成对抗网络（GAN），构建数字孪生系统，生成用于微机器人姿态估计的高保真合成显微图像。 Result: 相比纯AI方法，结构相似性（SSIM）提升35.6%，单帧渲染时间仅0.022秒；在合成数据上训练的姿态估计器准确率达93.9%（pitch）和91.9%（roll），仅比真实数据训练低5.0%/5.4%。 Conclusion: 该物理信息驱动的生成框架能有效桥接仿真与现实差距，支持实时、高精度的微机器人姿态估计，并具备对未见姿态的良好泛化能力，适用于新型微机器人配置的数据增强。 Abstract: Precise pose estimation of optical microrobots is essential for enabling high-precision object tracking and autonomous biological studies. However, current methods rely heavily on large, high-quality microscope image datasets, which are difficult and costly to acquire due to the complexity of microrobot fabrication and the labour-intensive labelling. Digital twin systems offer a promising path for sim-to-real data augmentation, yet existing techniques struggle to replicate complex optical microscopy phenomena, such as diffraction artifacts and depth-dependent imaging.This work proposes a novel physics-informed deep generative learning framework that, for the first time, integrates wave optics-based physical rendering and depth alignment into a generative adversarial network (GAN), to synthesise high-fidelity microscope images for microrobot pose estimation efficiently. Our method improves the structural similarity index (SSIM) by 35.6% compared to purely AI-driven methods, while maintaining real-time rendering speeds (0.022 s/frame).The pose estimator (CNN backbone) trained on our synthetic data achieves 93.9%/91.9% (pitch/roll) accuracy, just 5.0%/5.4% (pitch/roll) below that of an estimator trained exclusively on real data. Furthermore, our framework generalises to unseen poses, enabling data augmentation and robust pose estimation for novel microrobot configurations without additional training data.

[105] Acquisition Time-Informed Breast Tumor Segmentation from Dynamic Contrast-Enhanced MRI

Rui Wang,Yuexi Du,John Lewin,R. Todd Constable,Nicha C. Dvornek

Main category: cs.CV

TL;DR: 提出一种利用图像采集时间通过FiLM层调节模型特征的肿瘤分割方法，提升了乳腺DCE-MRI中肿瘤分割的性能和模型泛化能力。

Details

Motivation: 由于采集协议和个体差异导致乳腺DCE-MRI图像中组织外观变异大，尤其是在相同相位下，使得自动肿瘤分割具有挑战性。 Method: 采用基于特征级线性调制（FiLM）的架构，将采集时间作为先验信息融入模型，以调节不同时间点的特征表示，并在多中心数据集上训练和评估不同骨干网络的模型。 Result: 实验表明，结合采集时间信息的模型在域内和域外数据集上均优于基线模型，显著提升分割性能和泛化能力。 Conclusion: 引入采集时间通过FiLM进行特征调制可有效增强DCE-MRI肿瘤分割模型的鲁棒性和适应性，尤其适用于多中心、多协议临床数据。 Abstract: Dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) plays an important role in breast cancer screening, tumor assessment, and treatment planning and monitoring. The dynamic changes in contrast in different tissues help to highlight the tumor in post-contrast images. However, varying acquisition protocols and individual factors result in large variation in the appearance of tissues, even for images acquired in the same phase (e.g., first post-contrast phase), making automated tumor segmentation challenging. Here, we propose a tumor segmentation method that leverages knowledge of the image acquisition time to modulate model features according to the specific acquisition sequence. We incorporate the acquisition times using feature-wise linear modulation (FiLM) layers, a lightweight method for incorporating temporal information that also allows for capitalizing on the full, variables number of images acquired per imaging study. We trained baseline and different configurations for the time-modulated models with varying backbone architectures on a large public multisite breast DCE-MRI dataset. Evaluation on in-domain images and a public out-of-domain dataset showed that incorporating knowledge of phase acquisition time improved tumor segmentation performance and model generalization.

[106] YOWO: You Only Walk Once to Jointly Map An Indoor Scene and Register Ceiling-mounted Cameras

Fan Yang,Sosuke Yamao,Ikuo Kusajima,Atsunori Moteki,Shoichi Masui,Shan Jiang

Main category: cs.CV

TL;DR: 提出一种联合室内场景建图与天花板摄像头（CMC）注册的新方法，通过移动代理携带RGB-D相机并同步CMC视频，实现高效准确的场景布局重建与摄像头定位。

Details

Motivation: 解决天花板摄像头（CMC）在室内场景中手动或自动注册效率低、精度差的问题，尤其是在视觉模糊情况下难以准确对齐的问题。 Method: 利用头戴式RGB-D相机的移动代理遍历场景，生成以自我为中心的轨迹和场景布局；同时用CMC捕捉该代理，获得伪尺度轨迹和相对位姿；通过时间戳关联所有轨迹，并构建因子图联合优化自我相机位姿、场景布局和CMC位姿。 Result: 实验结果表明，该方法在统一框架下有效完成了场景建图与CMC注册两项任务，并相互提升性能；同时发布了首个用于协同建图与CMC注册的基准数据集。 Conclusion: 所提方法实现了高效、准确的室内场景建模与CMC注册，为位置感知应用提供了可靠工具。 Abstract: Using ceiling-mounted cameras (CMCs) for indoor visual capturing opens up a wide range of applications. However, registering CMCs to the target scene layout presents a challenging task. While manual registration with specialized tools is inefficient and costly, automatic registration with visual localization may yield poor results when visual ambiguity exists. To alleviate these issues, we propose a novel solution for jointly mapping an indoor scene and registering CMCs to the scene layout. Our approach involves equipping a mobile agent with a head-mounted RGB-D camera to traverse the entire scene once and synchronize CMCs to capture this mobile agent. The egocentric videos generate world-coordinate agent trajectories and the scene layout, while the videos of CMCs provide pseudo-scale agent trajectories and CMC relative poses. By correlating all the trajectories with their corresponding timestamps, the CMC relative poses can be aligned to the world-coordinate scene layout. Based on this initialization, a factor graph is customized to enable the joint optimization of ego-camera poses, scene layout, and CMC poses. We also develop a new dataset, setting the first benchmark for collaborative scene mapping and CMC registration (https://sites.google.com/view/yowo/home). Experimental results indicate that our method not only effectively accomplishes two tasks within a unified framework, but also jointly enhances their performance. We thus provide a reliable tool to facilitate downstream position-aware applications.

Rahul Kumar,Vipul Baghel,Sudhanshu Singh,Bikash Kumar Badatya,Shivam Yadav,Babji Srinivasan,Ravi Hegde

Main category: cs.CV

TL;DR: 本文提出了一個專為拳擊中出拳檢測與分類設計的高質量視訊資料集，包含來自20段YouTube對抗訓練影片的6,915個標註出拳片段，涵蓋六種出拳類型和18位運動員，支援在非受限環境下的即時動作識別研究。

Details

Motivation: 由於格鬥運動動作動態且環境多變，現有電腦視覺分析缺乏足夠且穩健的資料集，因此需要一個精確標註、多樣性高的拳擊出拳資料集來推動相關研究。 Method: 從公開的YouTube對抗影片中擷取高品質出拳片段，進行手動分割與標註，建立包含六種出拳類型的結構化資料集，並確保時間邊界與類別的一致性。 Result: 資料集共包含6,915個出拳片段，來自18位運動員的20段影片，涵蓋多種動作風格、攝影角度與體型變化，具有高標註品質與多樣性。 Conclusion: 此資料集為基於視覺的即時動作識別提供了良好的基準，有助於推動拳擊及相關領域中的動作分析、自動教練系統與表現評估技術的發展。 Abstract: Accurate analysis of combat sports using computer vision has gained traction in recent years, yet the development of robust datasets remains a major bottleneck due to the dynamic, unstructured nature of actions and variations in recording environments. In this work, we present a comprehensive, well-annotated video dataset tailored for punch detection and classification in boxing. The dataset comprises 6,915 high-quality punch clips categorized into six distinct punch types, extracted from 20 publicly available YouTube sparring sessions and involving 18 different athletes. Each clip is manually segmented and labeled to ensure precise temporal boundaries and class consistency, capturing a wide range of motion styles, camera angles, and athlete physiques. This dataset is specifically curated to support research in real-time vision-based action recognition, especially in low-resource and unconstrained environments. By providing a rich benchmark with diverse punch examples, this contribution aims to accelerate progress in movement analysis, automated coaching, and performance assessment within boxing and related domains.

[108] Contrastive vision-language learning with paraphrasing and negation

Kwun Ho Ngan,Saman Sadeghi Afgeh,Joe Townsend,Artur d'Avila Garcez

Main category: cs.CV

TL;DR: 本文提出SemCLIP，通过改进CLIP的对比损失函数并利用LLM生成的原始、改写和否定文本三元组进行训练，增强了模型对否定和改写的语义鲁棒性，在CC-Neg等基准上显著提升了对否定文本的区分能力。

Details

Motivation: 现有视觉-语言模型在处理否定或改写文本时表现不稳定，因否定带来语义剧变而词汇变化小，改写则词汇差异大但语义一致，亟需提升模型对语义变换的鲁棒性。 Method: 提出SemCLIP方法，设计兼顾改写与否定的新对比损失函数，并使用大语言模型生成包含原始、改写和否定文本的三元组数据进行训练，使改写文本靠近原图像嵌入、否定文本远离。 Result: 在CC-Neg基准上，图像检索准确率从68.1%提升至78.1%；在Sugarcrepe++上结果混合但优于仅用否定文本训练的模型；零样本分类任务中SemCLIP均优于CLIP。 Conclusion: SemCLIP能有效提升视觉-语言模型对否定和改写等语义变换的鲁棒性，同时保持原有性能，具有实际应用潜力。 Abstract: Contrastive vision-language models continue to be the dominant approach for image and text retrieval. Contrastive Language-Image Pre-training (CLIP) trains two neural networks in contrastive manner to align their image and text embeddings in a shared latent space. Recent results evaluating CLIP on negated or paraphrased text have shown mixed performance because negation changes meaning radically with minimal lexical changes, while paraphrasing can create very different textual expressions with the same intended meaning. This poses a significant challenge for improving the evaluation results and alignment of vision-language models. To address this challenge, this paper evaluates the combination of paraphrasing and negation, proposes a new CLIP contrastive loss function accounting for both paraphrasing and negation, and applies LLM-generated training triples consisting of original, paraphrased and negated textual captions to CLIP-like training models. The approach, called SemCLIP, is shown to move paraphrased captions towards the original image embeddings while pushing negated captions further away in embedding space. Empirically, SemCLIP is shown to be capable of preserving CLIP's performance while increasing considerably the distances to negated captions. On the CC-Neg benchmark using an original over negation image-retrieval accuracy metric, SemCLIP improves accuracy from 68.1% to 78.1%. Although results are mixed when compared with CLIP on the Sugarcrepe++ benchmark, SemCLIP's performance is generally better than the models trained with negated captions. This robustness to negation extends to downstream zero-shot classification tasks where SemCLIP pre-trained on Sugarcrepe++ performs better than CLIP on all tested downstream tasks. These results indicate that SemCLIP can achieve significant robustness to semantic transformations.

[109] Enhancing Multi-Camera Gymnast Tracking Through Domain Knowledge Integration

Fan Yang,Shigeyuki Odashima,Shoichi Masui,Ikuo Kusajima,Sosuke Yamao,Shan Jiang

Main category: cs.CV

TL;DR: 提出了一种结合体操领域知识的多摄像头级联数据关联方法，用于在检测不足时通过射线-平面相交生成共面3D轨迹候选，提升体操运动员跟踪鲁棒性，并成功应用于国际体操锦标赛裁判系统。

Details

Motivation: 由于摄像头数量有限以及光照、背景、服装和遮挡变化，传统多摄像头三角化难以准确估计体操运动员的3D轨迹，尤其在部分视角检测失效的情况下。 Method: 引入体操领域知识，假设运动员3D中心位于预定义垂直平面内，利用射线-平面相交生成共面3D轨迹候选；提出级联数据关联范式，在跨视角检测充足时使用三角化，不足时使用射线-平面相交补偿。 Result: 实验表明该方法在挑战场景下优于现有方法，显著减少跟踪失败，并已成功应用于近期体操世锦赛裁判系统，获得国际体操联合会高度认可。 Conclusion: 结合领域知识的级联数据关联策略有效提升了多摄像头跟踪在复杂实际场景中的鲁棒性，为体操自动评分系统提供了可靠的技术支持。 Abstract: We present a robust multi-camera gymnast tracking, which has been applied at international gymnastics championships for gymnastics judging. Despite considerable progress in multi-camera tracking algorithms, tracking gymnasts presents unique challenges: (i) due to space restrictions, only a limited number of cameras can be installed in the gymnastics stadium; and (ii) due to variations in lighting, background, uniforms, and occlusions, multi-camera gymnast detection may fail in certain views and only provide valid detections from two opposing views. These factors complicate the accurate determination of a gymnast's 3D trajectory using conventional multi-camera triangulation. To alleviate this issue, we incorporate gymnastics domain knowledge into our tracking solution. Given that a gymnast's 3D center typically lies within a predefined vertical plane during \revised{much of their} performance, we can apply a ray-plane intersection to generate coplanar 3D trajectory candidates for opposing-view detections. More specifically, we propose a novel cascaded data association (DA) paradigm that employs triangulation to generate 3D trajectory candidates when cross-view detections are sufficient, and resort to the ray-plane intersection when they are insufficient. Consequently, coplanar candidates are used to compensate for uncertain trajectories, thereby minimizing tracking failures. The robustness of our method is validated through extensive experimentation, demonstrating its superiority over existing methods in challenging scenarios. Furthermore, our gymnastics judging system, equipped with this tracking method, has been successfully applied to recent Gymnastics World Championships, earning significant recognition from the International Gymnastics Federation.

[110] Investigating Optical Flow Computation: From Local Methods to a Multiresolution Horn-Schunck Implementation with Bilinear Interpolation

Haytham Ziani

Main category: cs.CV

TL;DR: 本文分析了局部和全局光流计算方法，重点研究了Horn-Schunck算法，并实现了其多分辨率版本以提升精度和收敛性。

Details

Motivation: 为了提高在不同图像条件下帧间运动估计的准确性和鲁棒性，研究局部与全局方法的结合策略。 Method: 采用理论与实践相结合的方式分析Lucas-Kanade和Horn-Schunck等方法，并实现基于双线性插值和延拓的多分辨率Horn-Schunck算法。 Result: 多分辨率策略有效提升了Horn-Schunck算法的精度和收敛速度，在不同图像条件下表现出更好的运动估计性能。 Conclusion: 结合多分辨率技术的全局方法能显著改善光流估计效果，尤其适用于复杂变化的视觉场景。 Abstract: This paper presents an applied analysis of local and global methods, with a focus on the Horn-Schunck algorithm for optical flow computation. We explore the theoretical and practical aspects of local approaches, such as the Lucas-Kanade method, and global techniques such as Horn-Schunck. Additionally, we implement a multiresolution version of the Horn-Schunck algorithm, using bilinear interpolation and prolongation to improve accuracy and convergence. The study investigates the effectiveness of these combined strategies in estimating motion between frames, particularly under varying image conditions.

[111] Supervised Contrastive Learning for Few-Shot AI-Generated Image Detection and Attribution

Jaime Álvarez Urueña,David Camacho,Javier Huertas Tato

Main category: cs.CV

TL;DR: 本文提出了一种两阶段合成图像检测框架，通过对比学习提取判别性嵌入，并结合少量样本的k近邻分类器，在无需频繁重训练的情况下实现对新型生成模型的高效检测与溯源。

Details

Motivation: 由于生成式AI快速发展，传统依赖周期性重训练的检测方法难以应对新出现的生成模型，亟需具备良好泛化能力的检测方案。 Method: 第一阶段采用监督对比学习训练视觉深度模型以提取图像嵌入，训练时保留部分生成器架构用于验证跨模型泛化能力；第二阶段在嵌入空间上使用基于少量样本的k近邻分类器进行分类。 Result: 在每类仅150张图像的小样本设置下，平均检测准确率达91.3%，较现有方法提升5.2个百分点；在开放集分类任务中，AUC和OSCR分别提升14.70%和4.27%。 Conclusion: 所提框架显著提升了对未见生成器的泛化检测与溯源能力，为适应快速演进的生成AI环境提供了可扩展、低维护成本的数字取证解决方案。 Abstract: The rapid advancement of generative artificial intelligence has enabled the creation of synthetic images that are increasingly indistinguishable from authentic content, posing significant challenges for digital media integrity. This problem is compounded by the accelerated release cycle of novel generative models, which renders traditional detection approaches (reliant on periodic retraining) computationally infeasible and operationally impractical. This work proposes a novel two-stage detection framework designed to address the generalization challenge inherent in synthetic image detection. The first stage employs a vision deep learning model trained via supervised contrastive learning to extract discriminative embeddings from input imagery. Critically, this model was trained on a strategically partitioned subset of available generators, with specific architectures withheld from training to rigorously ablate cross-generator generalization capabilities. The second stage utilizes a k-nearest neighbors (k-NN) classifier operating on the learned embedding space, trained in a few-shot learning paradigm incorporating limited samples from previously unseen test generators. With merely 150 images per class in the few-shot learning regime, which are easily obtainable from current generation models, the proposed framework achieves an average detection accuracy of 91.3\%, representing a 5.2 percentage point improvement over existing approaches . For the source attribution task, the proposed approach obtains improvements of of 14.70\% and 4.27\% in AUC and OSCR respectively on an open set classification context, marking a significant advancement toward robust, scalable forensic attribution systems capable of adapting to the evolving generative AI landscape without requiring exhaustive retraining protocols.

Pierrick Bournez,Luca Savant Aira,Thibaud Ehret,Gabriele Facciolo

Main category: cs.CV

TL;DR: EOGS++ 是一种针对卫星图像的3D高斯点阵化新方法，直接处理原始高分辨率全色数据，并将光流与捆绑调整融入训练过程，提升了重建质量与几何精度。

Details

Motivation: 为克服现有地球观测方法对预处理和外部优化工具的依赖，提升重建效率与相机位姿估计精度。 Method: 提出 EOGS++，直接在原始高分辨率全色数据上操作，结合光学流实现内置捆绑调整，并引入早停机制和TSDF后处理。 Result: 在 IARPA 2016 和 DFC2019 数据集上达到最优性能，建筑的平均MAE误差从1.33降至1.19，优于EOGS及其他NeRF方法。 Conclusion: EOGS++ 在保持高计算效率的同时，显著提升了地球观测中的三维重建质量与几何准确性，是NeRF的有力替代方案。 Abstract: Recently, 3D Gaussian Splatting has been introduced as a compelling alternative to NeRF for Earth observation, offering com- petitive reconstruction quality with significantly reduced training times. In this work, we extend the Earth Observation Gaussian Splatting (EOGS) framework to propose EOGS++, a novel method tailored for satellite imagery that directly operates on raw high-resolution panchromatic data without requiring external preprocessing. Furthermore, leveraging optical flow techniques we embed bundle adjustment directly within the training process, avoiding reliance on external optimization tools while improving camera pose estimation. We also introduce several improvements to the original implementation, including early stopping and TSDF post-processing, all contributing to sharper reconstructions and better geometric accuracy. Experiments on the IARPA 2016 and DFC2019 datasets demonstrate that EOGS++ achieves state-of-the-art performance in terms of reconstruction quality and effi- ciency, outperforming the original EOGS method and other NeRF-based methods while maintaining the computational advantages of Gaussian Splatting. Our model demonstrates an improvement from 1.33 to 1.19 mean MAE errors on buildings compared to the original EOGS models

[113] Progressive Supernet Training for Efficient Visual Autoregressive Modeling

Xiaoyue Chen,Yuling Shi,Kaiyuan Li,Huandong Wang,Yong Li,Xiaodong Gu,Xinlei Chen,Mingbao Lin

Main category: cs.CV

TL;DR: 提出VARiant，通过子网与全网络共享权重并动态调整模型深度，在保持生成质量的同时显著降低内存消耗和加速推理，适用于多场景部署。

Details

Motivation: VAR模型在多尺度生成中存在累积KV缓存导致的高内存开销问题，且不同生成阶段对网络深度的敏感性不同，需更灵活高效的推理架构。 Method: 基于尺度-深度非对称依赖现象，采用等距采样构建从30层主干网络中提取的多深度子网；早期尺度用全网络，后期用子网，共享权重，并设计渐进式训练策略缓解优化冲突。 Result: 在ImageNet上，VARiant-d16/d8接近原模型质量（FID 2.05/2.12 vs 1.95），内存减少40-65%；VARiant-d2实现3.5倍加速和80%内存缩减（FID 2.97）；支持运行时无成本切换深度。 Conclusion: VARiant通过灵活深度调整和单一模型设计，在生成质量与效率间实现更好权衡，提升VAR模型在多样化应用中的可部署性。 Abstract: Visual Auto-Regressive (VAR) models significantly reduce inference steps through the "next-scale" prediction paradigm. However, progressive multi-scale generation incurs substantial memory overhead due to cumulative KV caching, limiting practical deployment. We observe a scale-depth asymmetric dependency in VAR: early scales exhibit extreme sensitivity to network depth, while later scales remain robust to depth reduction. Inspired by this, we propose VARiant: by equidistant sampling, we select multiple subnets ranging from 16 to 2 layers from the original 30-layer VAR-d30 network. Early scales are processed by the full network, while later scales utilize subnet. Subnet and the full network share weights, enabling flexible depth adjustment within a single model. However, weight sharing between subnet and the entire network can lead to optimization conflicts. To address this, we propose a progressive training strategy that breaks through the Pareto frontier of generation quality for both subnets and the full network under fixed-ratio training, achieving joint optimality. Experiments on ImageNet demonstrate that, compared to the pretrained VAR-d30 (FID 1.95), VARiant-d16 and VARiant-d8 achieve nearly equivalent quality (FID 2.05/2.12) while reducing memory consumption by 40-65%. VARiant-d2 achieves 3.5 times speedup and 80% memory reduction at moderate quality cost (FID 2.97). In terms of deployment, VARiant's single-model architecture supports zero-cost runtime depth switching and provides flexible deployment options from high quality to extreme efficiency, catering to diverse application scenarios.

[114] Lite Any Stereo: Efficient Zero-Shot Stereo Matching

Junpeng Jing,Weixun Luo,Ye Mao,Krystian Mikolajczyk

Main category: cs.CV

TL;DR: 本文提出了Lite Any Stereo，一个高效且具有强零样本泛化能力的立体深度估计框架，在保持极低计算成本的同时，在多个真实世界基准上达到领先性能。

Details

Motivation: 现有高效立体匹配模型通常被认为缺乏零样本泛化能力，且高精度模型往往计算代价高昂，本文旨在构建一个兼具高效率和强泛化能力的模型。 Method: 设计了一个紧凑但表达能力强的主干网络和混合代价聚合模块，并采用三阶段大规模训练策略来缩小仿真到真实的差距。 Result: 模型在四个主流真实世界基准上排名第一，精度媲美甚至超过现有的高精度非先验方法，但计算成本不到其1%。 Conclusion: 证明了超轻量模型也能实现优异的零样本泛化性能，为高效立体匹配设定了新标准。 Abstract: Recent advances in stereo matching have focused on accuracy, often at the cost of significantly increased model size. Traditionally, the community has regarded efficient models as incapable of zero-shot ability due to their limited capacity. In this paper, we introduce Lite Any Stereo, a stereo depth estimation framework that achieves strong zero-shot generalization while remaining highly efficient. To this end, we design a compact yet expressive backbone to ensure scalability, along with a carefully crafted hybrid cost aggregation module. We further propose a three-stage training strategy on million-scale data to effectively bridge the sim-to-real gap. Together, these components demonstrate that an ultra-light model can deliver strong generalization, ranking 1st across four widely used real-world benchmarks. Remarkably, our model attains accuracy comparable to or exceeding state-of-the-art non-prior-based accurate methods while requiring less than 1% computational cost, setting a new standard for efficient stereo matching.

[115] NutriScreener: Retrieval-Augmented Multi-Pose Graph Attention Network for Malnourishment Screening

Misaal Khan,Mayank Vatsa,Kuldeep Singh,Richa Singh

Main category: cs.CV

TL;DR: NutriScreener是一种基于多姿态图注意力网络的儿童营养不良筛查工具，结合视觉嵌入与知识检索，可在低资源环境中实现高效、准确的营养不良检测和人体测量预测。

Details

Motivation: 现有儿童营养不良筛查方法费时且难以扩展，导致早期干预困难，尤其是在资源匮乏地区。 Method: 提出NutriScreener，采用CLIP视觉嵌入、类别增强的知识检索和上下文感知的多姿态图注意力网络，结合检索增强技术进行营养不良检测和人体测量预测。 Result: 在2,141名儿童数据上训练和测试，召回率达0.79，AUC为0.82，RMSE显著降低；跨数据集实验显示召回率最高提升25%，RMSE减少3.5厘米；医生评分准确性和效率分别为4.3/5和4.6/5。 Conclusion: NutriScreener具备良好的泛化能力和鲁棒性，可扩展性强，适合在低资源环境下部署，为儿童营养不良的早期筛查提供了可靠解决方案。 Abstract: Child malnutrition remains a global crisis, yet existing screening methods are laborious and poorly scalable, hindering early intervention. In this work, we present NutriScreener, a retrieval-augmented, multi-pose graph attention network that combines CLIP-based visual embeddings, class-boosted knowledge retrieval, and context awareness to enable robust malnutrition detection and anthropometric prediction from children's images, simultaneously addressing generalizability and class imbalance. In a clinical study, doctors rated it 4.3/5 for accuracy and 4.6/5 for efficiency, confirming its deployment readiness in low-resource settings. Trained and tested on 2,141 children from AnthroVision and additionally evaluated on diverse cross-continent populations, including ARAN and an in-house collected CampusPose dataset, it achieves 0.79 recall, 0.82 AUC, and significantly lower anthropometric RMSEs, demonstrating reliable measurement in unconstrained pediatric settings. Cross-dataset results show up to 25% recall gain and up to 3.5 cm RMSE reduction using demographically matched knowledge bases. NutriScreener offers a scalable and accurate solution for early malnutrition detection in low-resource environments.

[116] POMA-3D: The Point Map Way to 3D Scene Understanding

Ye Mao,Weixun Luo,Ranran Huang,Junpeng Jing,Krystian Mikolajczyk

Main category: cs.CV

TL;DR: 本文提出了POMA-3D，首个从点图（point maps）中自监督学习的3D表示模型，通过引入视图到场景对齐策略和POMA-JEPA架构，实现跨多视角的几何一致性，并利用大规模ScenePoint数据集进行预训练，在仅使用3D坐标的情况下，在多种3D理解任务中表现出色。

Details

Motivation: 解决3D表示学习中缺乏预训练先验知识和数据不足的问题，同时利用2D基础模型的丰富先验来提升3D场景理解能力。 Method: 提出点图作为输入表示，设计视图到场景对齐策略以迁移2D先验，并引入POMA-JEPA联合嵌入-预测架构确保多视角几何一致性；构建包含6.5K房间级RGB-D场景的大规模ScenePoint数据集用于预训练。 Result: POMA-3D在3D问答、具身导航、场景检索和具身定位等多个任务上表现优异，仅依赖几何输入（3D坐标）即可实现强大多任务性能。 Conclusion: POMA-3D探索了基于点图的3D场景理解新路径，有效结合2D先验与3D几何结构，为自监督3D表示学习提供了可扩展的解决方案。 Abstract: In this paper, we introduce POMA-3D, the first self-supervised 3D representation model learned from point maps. Point maps encode explicit 3D coordinates on a structured 2D grid, preserving global 3D geometry while remaining compatible with the input format of 2D foundation models. To transfer rich 2D priors into POMA-3D, a view-to-scene alignment strategy is designed. Moreover, as point maps are view-dependent with respect to a canonical space, we introduce POMA-JEPA, a joint embedding-predictive architecture that enforces geometrically consistent point map features across multiple views. Additionally, we introduce ScenePoint, a point map dataset constructed from 6.5K room-level RGB-D scenes and 1M 2D image scenes to facilitate large-scale POMA-3D pretraining. Experiments show that POMA-3D serves as a strong backbone for both specialist and generalist 3D understanding. It benefits diverse tasks, including 3D question answering, embodied navigation, scene retrieval, and embodied localization, all achieved using only geometric inputs (i.e., 3D coordinates). Overall, our POMA-3D explores a point map way to 3D scene understanding, addressing the scarcity of pretrained priors and limited data in 3D representation learning. Project Page: https://matchlab-imperial.github.io/poma3d/

[117] Erase to Retain: Low Rank Adaptation Guided Selective Unlearning in Medical Segmentation Networks

Nirjhor Datta,Md. Golam Rabiul Alam

Main category: cs.CV

TL;DR: 提出一种名为Erase to Retain的可控遗忘框架，用于医学图像分割中的选择性知识删除，基于LoRA子空间更新和师生蒸馏机制，在无需完全重训练的情况下实现目标遗忘并保持模型整体性能。

Details

Motivation: 为了满足医学图像分析中隐私合规、伦理部署和数据集持续更新的需求，需要能够从分割网络中选择性地移除特定知识（如敏感病变信息），同时保留对其他结构的有效分割能力。 Method: 采用教师-学生蒸馏框架，结合低秩适应（LoRA）限制的子空间更新：在强遗忘阶段，通过对抗优化LoRA模块使学生网络在指定遗忘子集上偏离教师的预测；在温和恢复阶段，仅微调分类头以恢复在保留数据上的泛化能力。 Result: 在ISIC分割任务中，遗忘集IoU从0.875降至0.509，保留集和验证集性能保持稳定（IoU 0.647–0.677）；在CHASE跨域数据集上也表现出一致的遗忘效果与性能保持；在ISIC分类任务中，遗忘子集准确率从87.0%降至64.1%，而保留集准确率从83.9%提升至90.6%。 Conclusion: 基于LoRA的子空间遗忘方法为医学图像分析提供了实用、可控且可逆的知识删除路径，能够在去除敏感信息的同时有效保持模型在关键区域的性能。 Abstract: The ability to selectively remove knowledge from medical segmentation networks is increasingly important for privacy compliance, ethical deployment, and continual dataset revision. We introduce Erase to Retain, a controllable unlearning framework for medical image segmentation that achieves targeted forgetting without full retraining. Our method uses a teacher-student distillation paradigm with Low-Rank Adaptation (LoRA) constrained subspace updates, enabling the student network to erase lesion-specific or class-specific representations in low-rank decoder spaces while preserving global anatomical understanding. During the strong unlearning phase, LoRA modules are adversarially optimized to contradict the teacher's confident predictions on a designated forget subset, enforcing semantic removal. This is followed by a gentle restoration phase that recovers generalization on retained data through head-only supervised refinement. For ISIC segmentation, the student reduces forget-set IoU from 0.875 to 0.509 while maintaining competitive performance on the retain and validation splits (0.647 to 0.677 IoU). On the cross-domain CHASE dataset, Erase to Retain consistently lowers forget-set IoU while preserving utility on retain and validation sets. For ISIC classification, our method decreases accuracy on the forget subset from 87.0 percent to 64.1 percent while improving retain accuracy from 83.9 percent to 90.6 percent. These results demonstrate that LoRA-based subspace unlearning provides a practical pathway toward responsible, controllable, and reversible unlearning in medical image analysis, enabling models to forget sensitive samples or structures while preserving performance where it matters most.

[118] Generative AI for Enhanced Wildfire Detection: Bridging the Synthetic-Real Domain Gap

Satyam Gaba

Main category: cs.CV

TL;DR: 本文提出利用生成式AI技术合成带标注的烟雾数据集，以解决真实烟雾数据稀缺的问题，并结合无监督域适应与生成模型（如GAN、风格迁移和图像抠图）提升合成数据的真实性，从而提高野火烟雾检测的准确性和可扩展性。

Details

Motivation: 由于缺乏大规模标注的烟雾数据，深度神经网络在野火烟雾检测中的应用受到限制，亟需有效方法缓解数据不足问题。 Method: 采用生成式AI技术构建合成烟雾数据集，并结合无监督域适应方法以及风格迁移、GAN和图像抠图等技术来缩小合成数据与真实数据之间的域差异。 Result: 所提出的方法有效提升了烟雾分割性能，显著缩小了合成与真实数据之间的域差距，增强了模型在真实场景中的泛化能力。 Conclusion: 通过融合生成式AI与域适应技术，能够克服烟雾检测中数据稀缺的挑战，为高效、可扩展的野火早期监测提供了可行方案。 Abstract: The early detection of wildfires is a critical environmental challenge, with timely identification of smoke plumes being key to mitigating large-scale damage. While deep neural networks have proven highly effective for localization tasks, the scarcity of large, annotated datasets for smoke detection limits their potential. In response, we leverage generative AI techniques to address this data limitation by synthesizing a comprehensive, annotated smoke dataset. We then explore unsupervised domain adaptation methods for smoke plume segmentation, analyzing their effectiveness in closing the gap between synthetic and real-world data. To further refine performance, we integrate advanced generative approaches such as style transfer, Generative Adversarial Networks (GANs), and image matting. These methods aim to enhance the realism of synthetic data and bridge the domain disparity, paving the way for more accurate and scalable wildfire detection models.

[119] SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking

Haofeng Liu,Ziyue Wang,Sudhanshu Mishra,Mingqi Gao,Guanyi Qin,Chang Han Low,Alex Y. W. Kong,Yueming Jin

Main category: cs.CV

TL;DR: 本文提出了SA-SV，这是目前最大的用于外科手术视频分割的基准数据集，并基于此提出了SAM2S模型，通过改进SAM2实现了更优的交互式视频对象分割性能，具有良好的长期跟踪能力和零样本泛化能力。

Details

Motivation: 现有的交互式视频对象分割模型（如SAM2）在外科手术场景中存在领域差距和长期跟踪能力不足的问题，需要专门针对手术视频进行优化和评估。 Method: 构建了包含八种手术类型的大型数据集SA-SV，并提出SAM2S模型，引入可训练的多样化记忆机制（DiveMem）、时间语义学习和抗模糊学习策略来提升分割与跟踪性能。 Result: 在SA-SV上微调后，SAM2性能提升了12.99个J&F分数，SAM2S进一步达到80.42的平均J&F分数，超越原始和微调后的SAM2分别17.10和4.11点，同时保持68 FPS的实时推理速度和强零样本泛化能力。 Conclusion: SAM2S显著提升了外科手术视频中的交互式分割与长期跟踪性能，SA-SV为未来研究提供了重要资源，推动了计算机辅助手术的发展。 Abstract: Surgical video segmentation is crucial for computer-assisted surgery, enabling precise localization and tracking of instruments and tissues. Interactive Video Object Segmentation (iVOS) models such as Segment Anything Model 2 (SAM2) provide prompt-based flexibility beyond methods with predefined categories, but face challenges in surgical scenarios due to the domain gap and limited long-term tracking. To address these limitations, we construct SA-SV, the largest surgical iVOS benchmark with instance-level spatio-temporal annotations (masklets) spanning eight procedure types (61k frames, 1.6k masklets), enabling comprehensive development and evaluation for long-term tracking and zero-shot generalization. Building on SA-SV, we propose SAM2S, a foundation model enhancing \textbf{SAM2} for \textbf{S}urgical iVOS through: (1) DiveMem, a trainable diverse memory mechanism for robust long-term tracking; (2) temporal semantic learning for instrument understanding; and (3) ambiguity-resilient learning to mitigate annotation inconsistencies across multi-source datasets. Extensive experiments demonstrate that fine-tuning on SA-SV enables substantial performance gains, with SAM2 improving by 12.99 average $\mathcal{J}$\&$\mathcal{F}$ over vanilla SAM2. SAM2S further advances performance to 80.42 average $\mathcal{J}$\&$\mathcal{F}$, surpassing vanilla and fine-tuned SAM2 by 17.10 and 4.11 points respectively, while maintaining 68 FPS real-time inference and strong zero-shot generalization. Code and dataset will be released at https://jinlab-imvr.github.io/SAM2S.

[120] Improving Long-Tailed Object Detection with Balanced Group Softmax and Metric Learning

Satyam Gaba

Main category: cs.CV

TL;DR: 本文针对长尾分布下的2D目标检测问题，提出改进的Balanced Group Softmax框架，并结合度量学习与k-NN分类策略，显著提升稀有类别性能，在LVISv1数据集上达到24.5% mAP的新纪录。

Details

Motivation: 现实场景中类别分布长尾问题导致检测模型偏向高频类，削弱对罕见类别的检测能力，亟需有效缓解类别不平衡的方法。 Method: 基于Faster R-CNN两阶段架构，改进Balanced Group Softmax（BAGS）框架；引入度量学习使特征在类间分离、类内紧凑，并在推理时采用k-Nearest Neighbors进行分类优化。 Result: 在LVISv1数据集上实现24.5%的mAP，超过先前24.0%的基准性能，验证了所提方法在长尾目标检测中的有效性。 Conclusion: 改进的BAGS框架结合度量学习与k-NN推理能有效缓解长尾分布带来的类别不平衡问题，显著提升罕见类别的检测性能，推动长尾目标检测的进一步发展。 Abstract: Object detection has been widely explored for class-balanced datasets such as COCO. However, real-world scenarios introduce the challenge of long-tailed distributions, where numerous categories contain only a few instances. This inherent class imbalance biases detection models towards the more frequent classes, degrading performance on rare categories. In this paper, we tackle the problem of long-tailed 2D object detection using the LVISv1 dataset, which consists of 1,203 categories and 164,000 images. We employ a two-stage Faster R-CNN architecture and propose enhancements to the Balanced Group Softmax (BAGS) framework to mitigate class imbalance. Our approach achieves a new state-of-the-art performance with a mean Average Precision (mAP) of 24.5%, surpassing the previous benchmark of 24.0%. Additionally, we hypothesize that tail class features may form smaller, denser clusters within the feature space of head classes, making classification challenging for regression-based classifiers. To address this issue, we explore metric learning to produce feature embeddings that are both well-separated across classes and tightly clustered within each class. For inference, we utilize a k-Nearest Neighbors (k-NN) approach to improve classification performance, particularly for rare classes. Our results demonstrate the effectiveness of these methods in advancing long-tailed object detection.

[121] Adaptive Guided Upsampling for Low-light Image Enhancement

Angela Vivian Dcosta,Chunbo Song,Rafael Radkowski

Main category: cs.CV

TL;DR: 提出了一种名为自适应引导上采样（AGU）的方法，用于高效提升低光照图像质量，能在去噪和增强清晰度等方面同时优化，实现实时高质量图像生成。

Details

Motivation: 现有引导图像方法在处理低光照图像时因噪声高、亮度低导致特征不足，难以有效提升图像质量。 Method: 基于多参数优化的引导图像方法，通过机器学习从少量低光-明亮图像对中学习图像特征关联，实现自适应上采样。 Result: AGU能够在实时条件下生成高质量图像，实验表明其在低光图像增强任务中优于当前最先进的方法。 Conclusion: AGU通过学习低光与明亮图像间的特征映射，有效解决了传统引导方法在低光环境下性能不佳的问题，显著提升了上采样图像的质量。 Abstract: We introduce Adaptive Guided Upsampling (AGU), an efficient method for upscaling low-light images capable of optimizing multiple image quality characteristics at the same time, such as reducing noise and increasing sharpness. It is based on a guided image method, which transfers image characteristics from a guidance image to the target image. Using state-of-the-art guided methods, low-light images lack sufficient characteristics for this purpose due to their high noise level and low brightness, rendering suboptimal/not significantly improved images in the process. We solve this problem with multi-parameter optimization, learning the association between multiple low-light and bright image characteristics. Our proposed machine learning method learns these characteristics from a few sample images-pairs. AGU can render high-quality images in real time using low-quality, low-resolution input; our experiments demonstrate that it is superior to state-of-the-art methods in the addressed low-light use case.

[122] SAM 3D: 3Dfy Anything in Images

SAM 3D Team,Xingyu Chen,Fu-Jen Chu,Pierre Gleize,Kevin J Liang,Alexander Sax,Hao Tang,Weiyao Wang,Michelle Guo,Thibaut Hardin,Xiang Li,Aohan Lin,Jiawei Liu,Ziqi Ma,Anushka Sagar,Bowen Song,Xiaodong Wang,Jianing Yang,Bowen Zhang,Piotr Dollár,Georgia Gkioxari,Matt Feiszli,Jitendra Malik

Main category: cs.CV

TL;DR: SAM 3D 是一种基于单张图像生成3D物体几何、纹理和布局的生成模型，通过人机协同标注流程构建大规模视觉对齐的3D数据，并结合合成预训练与真实数据对齐，在自然场景中显著优于现有方法。

Details

Motivation: 现有的3D物体重建方法在处理真实世界图像中的遮挡和场景杂乱时表现不佳，且缺乏大规模视觉对齐的训练数据，限制了模型的泛化能力。 Method: 提出SAM 3D，采用人机协同的标注流程获取高质量的3D形状、纹理和姿态数据；设计多阶段训练框架，结合合成数据预训练和真实数据微调，实现对自然图像的有效重建。 Result: 在真实场景物体和场景的人类偏好测试中，相比最新方法取得至少5:1的胜率，显著提升重建质量，并发布新基准、代码、模型权重及在线演示。 Conclusion: SAM 3D通过大规模视觉对齐数据和现代训练框架，突破了3D重建的数据瓶颈，在复杂真实场景中实现了更优的视觉对齐3D重建效果。 Abstract: We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D "data barrier". We obtain significant gains over recent work, with at least a 5:1 win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.

[123] TRIM: Scalable 3D Gaussian Diffusion Inference with Temporal and Spatial Trimming

Zeyuan Yin,Xiaoming Liu

Main category: cs.CV

TL;DR: 本文提出了TRIM，一种用于加速3D高斯扩散模型推理的后训练方法，通过轨迹缩减和实例掩码去噪，在不牺牲生成质量的前提下显著提升效率。

Details

Motivation: 现有的3D高斯扩散模型由于高数量的高斯基元导致去噪过程耗时，生成速度慢且扩展性差，亟需提高推理效率。 Method: 提出TRIM方法，包含两个策略：1）使用轻量级选择器模型评估潜在高斯基元，实现高质量候选的早期轨迹缩减；2）引入实例掩码去噪，通过过滤冗余背景区域剪枝可学习的高斯基元，减少每步去噪的计算量。 Result: 实验表明，TRIM在多个指标上显著提升了3D生成的速度与质量，在保持输出质量的同时实现了高效的推理，并支持推理时的模型扩展。 Conclusion: TRIM为3D高斯扩散模型提供了一种高效、可扩展的推理加速方案，具有实际应用潜力。 Abstract: Recent advances in 3D Gaussian diffusion models suffer from time-intensive denoising and post-denoising processing due to the massive number of Gaussian primitives, resulting in slow generation and limited scalability along sampling trajectories. To improve the efficiency of 3D diffusion models, we propose $\textbf{TRIM}$ ($\textbf{T}$rajectory $\textbf{R}$eduction and $\textbf{I}$nstance $\textbf{M}$ask denoising), a post-training approach that incorporates both temporal and spatial trimming strategies, to accelerate inference without compromising output quality while supporting the inference-time scaling for Gaussian diffusion models. Instead of scaling denoising trajectories in a costly end-to-end manner, we develop a lightweight selector model to evaluate latent Gaussian primitives derived from multiple sampled noises, enabling early trajectory reduction by selecting candidates with high-quality potential. Furthermore, we introduce instance mask denoising to prune learnable Gaussian primitives by filtering out redundant background regions, reducing inference computation at each denoising step. Extensive experiments and analysis demonstrate that TRIM significantly improves both the efficiency and quality of 3D generation. Source code is available at $\href{https://github.com/zeyuanyin/TRIM}{link}$.

[124] Late-decoupled 3D Hierarchical Semantic Segmentation with Semantic Prototype Discrimination based Bi-branch Supervision

Shuyu Cao,Chongshou Li,Jie Xu,Tianrui Li,Na Zhao

Main category: cs.CV

TL;DR: 提出了一种新的3D层次语义分割框架，通过 late-decoupled 架构和双分支监督机制解决跨层次优化冲突和类别不平衡问题，实现了最先进的性能。

Details

Motivation: 现有3DHS方法忽视了跨层次优化中的多层级冲突和多层次场景中的类别不平衡问题，影响模型性能。 Method: 设计了一个主干3DHS分支和一个辅助判别分支的框架；采用late-decoupled架构，结合从粗到细的层次引导和一致性约束；引入基于语义原型的双分支监督机制，增强类别不平衡下的分割能力。 Result: 在多个数据集和骨干网络上实验表明，该方法达到了最先进的3DHS性能，且核心组件可作为即插即用模块提升已有方法。 Conclusion: 所提出的框架有效缓解了多层级冲突和类别不平衡问题，显著提升了3D层次语义分割的性能，具有良好的通用性和扩展性。 Abstract: 3D hierarchical semantic segmentation (3DHS) is crucial for embodied intelligence applications that demand a multi-grained and multi-hierarchy understanding of 3D scenes. Despite the progress, previous 3DHS methods have overlooked following two challenges: I) multi-label learning with a parameter-sharing model can lead to multi-hierarchy conflicts in cross-hierarchy optimization, and II) the class imbalance issue is inevitable across multiple hierarchies of 3D scenes, which makes the model performance become dominated by major classes. To address these issues, we propose a novel framework with a primary 3DHS branch and an auxiliary discrimination branch. Specifically, to alleviate the multi-hierarchy conflicts, we propose a late-decoupled 3DHS framework which employs multiple decoders with the coarse-to-fine hierarchical guidance and consistency. The late-decoupled architecture can mitigate the underfitting and overfitting conflicts among multiple hierarchies and can also constrain the class imbalance problem in each individual hierarchy. Moreover, we introduce a 3DHS-oriented semantic prototype based bi-branch supervision mechanism, which additionally learns class-wise discriminative point cloud features and performs mutual supervision between the auxiliary and 3DHS branches, to enhance the class-imbalance segmentation. Extensive experiments on multiple datasets and backbones demonstrate that our approach achieves state-of-the-art 3DHS performance, and its core components can also be used as a plug-and-play enhancement to improve previous methods.

[125] Teacher-Guided One-Shot Pruning via Context-Aware Knowledge Distillation

Md. Samiul Alim,Sharjil Khan,Amrijit Biswas,Fuad Rahman,Shafin Rahman,Nabeel Mohammed

Main category: cs.CV

TL;DR: 提出一种结合知识蒸馏的教师引导剪枝框架，实现高效的一次性全局剪枝，在高稀疏度下保持良好性能，显著降低计算开销。

Details

Motivation: 非结构化剪枝通常需要多次训练-剪枝-重训练循环，计算开销大，缺乏对知识迁移关键参数的有效识别机制。 Method: 在重要性评分计算中引入教师模型的梯度信号，将知识蒸馏与重要性评估紧密结合，提出一次性全局剪枝策略，并采用稀疏感知重训练恢复精度。 Result: 在CIFAR-10、CIFAR-100和TinyImageNet上验证了方法的有效性，高稀疏度下性能优于EPG、EPSD等先进方法，且比COLT等迭代方法更高效。 Conclusion: 该框架能有效平衡剪枝效率与模型性能，适合资源受限环境下的部署，为高效模型压缩提供了新思路。 Abstract: Unstructured pruning remains a powerful strategy for compressing deep neural networks, yet it often demands iterative train-prune-retrain cycles, resulting in significant computational overhead. To address this challenge, we introduce a novel teacher-guided pruning framework that tightly integrates Knowledge Distillation (KD) with importance score estimation. Unlike prior approaches that apply KD as a post-pruning recovery step, our method leverages gradient signals informed by the teacher during importance score calculation to identify and retain parameters most critical for both task performance and knowledge transfer. Our method facilitates a one-shot global pruning strategy that efficiently eliminates redundant weights while preserving essential representations. After pruning, we employ sparsity-aware retraining with and without KD to recover accuracy without reactivating pruned connections. Comprehensive experiments across multiple image classification benchmarks, including CIFAR-10, CIFAR-100, and TinyImageNet, demonstrate that our method consistently achieves high sparsity levels with minimal performance degradation. Notably, our approach outperforms state-of-the-art baselines such as EPG and EPSD at high sparsity levels, while offering a more computationally efficient alternative to iterative pruning schemes like COLT. The proposed framework offers a computation-efficient, performance-preserving solution well suited for deployment in resource-constrained environments.

[126] Solving Spatial Supersensing Without Spatial Supersensing

Vishaal Udandarao,Shyamgopal Karthik,Surabhi S. Nath,Andreas Hochlehnert,Matthias Bethge,Ameya Prabhu

Main category: cs.CV

TL;DR: 本文对Cambrian-S提出的视频世界模型中的空间超感知基准（VSR和VSC）进行了批判性分析，提出简单基线NoSense即可在不依赖空间认知的情况下接近完美解决VSR任务，且Cambrian-S的推理方法可能利用了基准中的捷径启发，而非真正实现鲁棒的空间超感知。

Details

Motivation: 当前用于评估空间超感知的基准（如VSR和VSC）可能无法真实反映模型的空间理解能力，因此需要对其有效性进行检验，并揭示现有方法是否依赖于数据集捷径而非真正的认知建模。 Method: 提出一个忽略时间结构的简单基线NoSense（基于袋词SigLIP模型），并在VSR上测试其性能；设计VSC-Repeat实验，通过重复视频多次来检测模型是否依赖‘房间不会被重访’这一捷径。 Result: NoSense在VSR上达到95%准确率，表明该任务可被无空间认知的方法破解；在VSC-Repeat实验中，Cambrian-S的准确率从42%降至0%，说明其推理严重依赖于未重访的假设捷径。 Conclusion: 当前VSI-Super系列基准未能可靠衡量空间超感知能力，Cambrian-S所用的预测感知推理策略主要通过利用基准中的捷径提升性能，而非实现真正的空间认知整合。 Abstract: Cambrian-S aims to take the first steps towards improving video world models with spatial supersensing by introducing (i) two benchmarks, VSI-Super-Recall (VSR) and VSI-Super-Counting (VSC), and (ii) bespoke predictive sensing inference strategies tailored to each benchmark. In this work, we conduct a critical analysis of Cambrian-S across both these fronts. First, we introduce a simple baseline, NoSense, which discards almost all temporal structure and uses only a bag-of-words SigLIP model, yet near-perfectly solves VSR, achieving 95% accuracy even on 4-hour videos. This shows benchmarks like VSR can be nearly solved without spatial cognition, world modeling or spatial supersensing. Second, we hypothesize that the tailored inference methods proposed by Cambrian-S likely exploit shortcut heuristics in the benchmark. We illustrate this with a simple sanity check on the VSC benchmark, called VSC-Repeat: We concatenate each video with itself 1-5 times, which does not change the number of unique objects. However, this simple perturbation entirely collapses the mean relative accuracy of Cambrian-S from 42% to 0%. A system that performs spatial supersensing and integrates information across experiences should recognize views of the same scene and keep object-count predictions unchanged; instead, Cambrian-S inference algorithm relies largely on a shortcut in the VSC benchmark that rooms are never revisited. Taken together, our findings suggest that (i) current VSI-Super benchmarks do not yet reliably measure spatial supersensing, and (ii) predictive-sensing inference recipes used by Cambrian-S improve performance by inadvertently exploiting shortcuts rather than from robust spatial supersensing. We include the response from the Cambrian-S authors (in Appendix A) to provide a balanced perspective alongside our claims. We release our code at: https://github.com/bethgelab/supersanity

[127] PartUV: Part-Based UV Unwrapping of 3D Meshes

Zhaoning Wang,Xinyue Wei,Ruoxi Shi,Xiaoshuai Zhang,Hao Su,Minghua Liu

Main category: cs.CV

TL;DR: 提出PartUV，一种基于部件的UV展开方法，通过结合语义部件分解与几何启发式，在AI生成的复杂网格上生成更少、对齐部件且低失真的图块。

Details

Motivation: 现有UV展开方法在处理噪声多、不规则的AI生成网格时表现不佳，常导致碎片化严重和边界不优的问题。 Method: 基于学习的部件分解方法PartField，采用自上而下的递归框架，结合高层语义信息与新几何启发式策略，并集成参数化、打包算法，支持非流形和退化网格处理。 Result: 在多个数据集上优于现有工具和神经方法，显著减少图块数量和接缝长度，保持低失真，高成功率，并支持如部件级多图块打包等新应用。 Conclusion: PartUV在处理复杂、噪声较多的网格时具有优势，能生成结构更合理、数量更少的图块，提升下游任务效果。 Abstract: UV unwrapping flattens 3D surfaces to 2D with minimal distortion, often requiring the complex surface to be decomposed into multiple charts. Although extensively studied, existing UV unwrapping methods frequently struggle with AI-generated meshes, which are typically noisy, bumpy, and poorly conditioned. These methods often produce highly fragmented charts and suboptimal boundaries, introducing artifacts and hindering downstream tasks. We introduce PartUV, a part-based UV unwrapping pipeline that generates significantly fewer, part-aligned charts while maintaining low distortion. Built on top of a recent learning-based part decomposition method PartField, PartUV combines high-level semantic part decomposition with novel geometric heuristics in a top-down recursive framework. It ensures each chart's distortion remains below a user-specified threshold while minimizing the total number of charts. The pipeline integrates and extends parameterization and packing algorithms, incorporates dedicated handling of non-manifold and degenerate meshes, and is extensively parallelized for efficiency. Evaluated across four diverse datasets, including man-made, CAD, AI-generated, and Common Shapes, PartUV outperforms existing tools and recent neural methods in chart count and seam length, achieves comparable distortion, exhibits high success rates on challenging meshes, and enables new applications like part-specific multi-tiles packing. Our project page is at https://www.zhaoningwang.com/PartUV.

[128] TriDiff-4D: Fast 4D Generation through Diffusion-based Triplane Re-posing

Eddie Pokming Sheung,Qihao Liu,Wufei Ma,Prakhar Kaushik,Jianwen Xie,Alan Yuille

Main category: cs.CV

TL;DR: 提出TriDiff-4D，一种基于扩散模型的四维生成框架，通过三平面重姿态技术实现高质量、时间连贯的文本到4D人像生成，显著提升生成效率与运动准确性。

Details

Motivation: 现有4D生成方法存在时间与几何不一致、感知伪影、运动不规则、计算成本高和动态控制有限等问题，难以满足高质量3D动画需求。 Method: 采用基于扩散模型的三平面重姿态方法，结合自回归策略，先生成规范3D人像和动作序列，再通过第二个扩散模型驱动动画；利用大规模3D与动作数据学习结构与运动先验，实现骨骼驱动的4D生成。 Result: 实验表明，TriDiff-4D在生成质量、时间一致性、运动精度和视觉保真度上优于现有方法，将生成时间从数小时缩短至数秒，并能生成复杂动作与高保真外观。 Conclusion: TriDiff-4D有效解决了当前4D生成中的关键挑战，实现了高效、可控、高质量的文本到4D人像生成，具有广泛的应用前景。 Abstract: With the increasing demand for 3D animation, generating high-fidelity, controllable 4D avatars from textual descriptions remains a significant challenge. Despite notable efforts in 4D generative modeling, existing methods exhibit fundamental limitations that impede their broader applicability, including temporal and geometric inconsistencies, perceptual artifacts, motion irregularities, high computational costs, and limited control over dynamics. To address these challenges, we propose TriDiff-4D, a novel 4D generative pipeline that employs diffusion-based triplane re-posing to produce high-quality, temporally coherent 4D avatars. Our model adopts an auto-regressive strategy to generate 4D sequences of arbitrary length, synthesizing each 3D frame with a single diffusion process. By explicitly learning 3D structure and motion priors from large-scale 3D and motion datasets, TriDiff-4D enables skeleton-driven 4D generation that excels in temporal consistency, motion accuracy, computational efficiency, and visual fidelity. Specifically, TriDiff-4D first generates a canonical 3D avatar and a corresponding motion sequence from a text prompt, then uses a second diffusion model to animate the avatar according to the motion sequence, supporting arbitrarily long 4D generation. Experimental results demonstrate that TriDiff-4D significantly outperforms existing methods, reducing generation time from hours to seconds by eliminating the optimization process, while substantially improving the generation of complex motions with high-fidelity appearance and accurate 3D geometry.

[129] SceneDesigner: Controllable Multi-Object Image Generation with 9-DoF Pose Manipulation

Zhenyuan Qin,Xincheng Shuai,Henghui Ding

Main category: cs.CV

TL;DR: 提出SceneDesigner，一种用于精确且灵活的多物体9-DoF姿态操控方法，通过分支网络和新的CNOCS地图表示实现，并构建了ObjectPose9D数据集支持训练，结合两阶段强化学习策略和解耦对象采样技术提升性能。

Details

Motivation: 现有方法在多物体9D姿态控制上存在可控性不足和生成质量下降的问题，难以实现全面的多物体9D姿态控制。 Method: 引入SceneDesigner，采用分支网络扩展预训练模型，利用CNOCS地图编码相机视角下的9D姿态信息；构建ObjectPose9D数据集；采用两阶段强化学习训练策略解决数据不平衡问题；提出解耦对象采样技术以缓解复杂场景中的生成问题；支持个性化权重定制。 Result: 实验表明，SceneDesigner在可控性和生成质量方面显著优于现有方法，能够有效实现多物体9D姿态的精确控制。 Conclusion: SceneDesigner为多物体9D姿态控制提供了高效、稳定的解决方案，推动了可控图像生成的发展。 Abstract: Controllable image generation has attracted increasing attention in recent years, enabling users to manipulate visual content such as identity and style. However, achieving simultaneous control over the 9D poses (location, size, and orientation) of multiple objects remains an open challenge. Despite recent progress, existing methods often suffer from limited controllability and degraded quality, falling short of comprehensive multi-object 9D pose control. To address these limitations, we propose SceneDesigner, a method for accurate and flexible multi-object 9-DoF pose manipulation. SceneDesigner incorporates a branched network to the pre-trained base model and leverages a new representation, CNOCS map, which encodes 9D pose information from the camera view. This representation exhibits strong geometric interpretation properties, leading to more efficient and stable training. To support training, we construct a new dataset, ObjectPose9D, which aggregates images from diverse sources along with 9D pose annotations. To further address data imbalance issues, particularly performance degradation on low-frequency poses, we introduce a two-stage training strategy with reinforcement learning, where the second stage fine-tunes the model using a reward-based objective on rebalanced data. At inference time, we propose Disentangled Object Sampling, a technique that mitigates insufficient object generation and concept confusion in complex multi-object scenes. Moreover, by integrating user-specific personalization weights, SceneDesigner enables customized pose control for reference subjects. Extensive qualitative and quantitative experiments demonstrate that SceneDesigner significantly outperforms existing approaches in both controllability and quality. Code is publicly available at https://github.com/FudanCVL/SceneDesigner.

[130] V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models

Yang Luo,Xuanlei Zhao,Baijiong Lin,Lingting Zhu,Liyao Tang,Yuqi Liu,Ying-Cong Chen,Shengju Qian,Xin Wang,Yang You

Main category: cs.CV

TL;DR: 本文提出了V-ReasonBench，一个用于评估生成视频模型在结构化问题解决、空间认知、模式推断和物理动态四个维度上推理能力的基准。

Details

Motivation: 随着生成视频模型（如Veo-3）展现出惊人的零样本推理能力，亟需一种系统且可靠的视频推理评估方法。 Method: 构建了一个包含合成和真实世界图像序列的多样化、可验证、可扩展且无歧义的基准测试集，涵盖四种推理类型，并对六种最先进视频模型进行了评估，同时与强图像模型对比，分析幻觉行为及视频长度对Chain-of-Frames推理的影响。 Result: 实验揭示了不同模型在各推理维度上的显著差异，发现了常见的幻觉问题，并表明视频长度影响推理性能。 Conclusion: V-ReasonBench提供了一个统一、可复现的框架来衡量视频推理能力，有助于推动具备更可靠、人类对齐推理能力的视频模型发展。 Abstract: Recent progress in generative video models, such as Veo-3, has shown surprising zero-shot reasoning abilities, creating a growing need for systematic and reliable evaluation. We introduce V-ReasonBench, a benchmark designed to assess video reasoning across four key dimensions: structured problem-solving, spatial cognition, pattern-based inference, and physical dynamics. The benchmark is built from both synthetic and real-world image sequences and provides a diverse set of answer-verifiable tasks that are reproducible, scalable, and unambiguous. Evaluations of six state-of-the-art video models reveal clear dimension-wise differences, with strong variation in structured, spatial, pattern-based, and physical reasoning. We further compare video models with strong image models, analyze common hallucination behaviors, and study how video duration affects Chain-of-Frames reasoning. Overall, V-ReasonBench offers a unified and reproducible framework for measuring video reasoning and aims to support the development of models with more reliable, human-aligned reasoning skills.

[131] Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO

Junhao Cheng,Liang Hou,Xin Tao,Jing Liao

Main category: cs.CV

TL;DR: 本文提出了视频下一事件预测（VNEP）这一新任务，将视频生成作为回答方式，超越传统的文本预测。为此，作者提出VANS模型，结合视觉语言模型和视频扩散模型，并通过强化学习框架Joint-GRPO实现两者的协同优化，在多模态理解与视频生成一致性方面取得领先表现。

Details

Motivation: 由于视频能直观展示物理世界中的动态信息，而纯文本难以表达（如系领带的过程），作者希望扩展视频作为一种新的回答形式，用于下一事件预测任务，从而提升程序性学习和创造性探索的体验。 Method: 提出VANS模型，采用强化学习方法，通过新设计的Joint-GRPO框架联合优化视觉语言模型（VLM）和视频扩散模型（VDM）。同时构建了包含10万样本的VANS-Data-100K数据集以支持训练与评估。 Result: 在程序性和预测性基准实验中，VANS在视频事件预测与可视化生成方面均达到当前最优性能，显著优于现有方法。 Conclusion: VANS成功实现了从‘告诉’到‘展示’的转变，验证了视频作为答案模态在下一事件预测中的潜力，为多模态推理与生成提供了新方向。 Abstract: While language models have become impactful in many real-world applications, video generation remains largely confined to entertainment. Motivated by video's inherent capacity to demonstrate physical-world information that is difficult to convey through language alone (e.g., imagine teaching someone to tie a tie using only text), we identify an underutilized opportunity to extend video as a new answer modality for Next-Event Prediction (NEP), formalized as Video-Next-Event Prediction (VNEP). While the established NEP task takes a video with a procedural or predictive question as input to predict the next event in text, VNEP requires dynamic video responses. This shift from telling to showing unlocks more intuitive and customized answers for procedural learning and creative exploration. However, this task remains challenging for existing models, as it demands an understanding of multimodal input, instruction-conditioned reasoning, and the generation of video with visual and semantic consistency. To address this, we introduce VANS, a model that leverages reinforcement learning to align a Vision-Language Model (VLM) with a Video Diffusion Model (VDM) for VNEP. The core of VANS is our proposed Joint-GRPO that orchestrates the VLM and VDM to function as a unit. Driven by a shared reward on their respective output, it optimizes the VLM to produce captions that are both accurate and friendly to visualize, while guiding the VDM to generate videos that are faithful to these captions and the input visual context. To enable this learning, we craft VANS-Data-100K, a dedicated dataset for the VNEP task. Experiments on procedural and predictive benchmarks demonstrate that VANS achieves state-of-the-art performance in both video event prediction and visualization. Codes are released in https://github.com/KlingTeam/VANS.

[132] Learning to Think Fast and Slow for Visual Language Models

Chenyu Lin,Cheng Chi,Jinlin Wu,Sharon Li,Kaiyang Zhou

Main category: cs.CV

TL;DR: 本文提出了一种名为DualMindVLM的简单强化学习方法，使视觉语言模型能根据任务难度自动切换快速和慢速思维模式，从而在保持高性能的同时显著提升推理效率。

Details

Motivation: 现有的视觉语言模型通常追求长而详细的推理链，导致计算成本过高，缺乏根据问题复杂度灵活分配认知资源的能力，限制了实际应用中的效率。 Method: 该方法分为两个阶段：第一阶段根据预训练模型输出答案的长度标注数据为快速或慢速思维模式；第二阶段使用GRPO结合思维模式标签进行训练，实现双模式推理能力。 Result: DualMindVLM在多个视觉推理任务上显著优于基线模型，性能媲美当前最先进的模型，同时具有极高的token效率。 Conclusion: 通过引入类人双系统思维机制，该方法有效实现了视觉语言模型在推理速度与准确性之间的平衡，为高效推理提供了新思路。 Abstract: When confronted with complex problems, we tend to think slowly; conversely, for simple questions, we think quickly. Such a two-system thinking mechanism allows us to efficiently allocate cognitive resources, enabling quick decision-making for straightforward issues while reserving deeper analytical thinking for more intricate challenges. However, existing reasoning-oriented visual language models (VLMs), whether trained with explicit chain-of-thought annotations or rule-based RL rewards, mainly pursue lengthy, detailed reasoning chains, which often lead to excessive computational costs. In this work, we propose a simple RL approach, which enables VLMs to automatically switch between fast and slow thinking modes depending on task difficulty. The approach consists of two stages: in the first stage, we label data as either requiring fast thinking or slow thinking based on the model output length, which is inspired by the observation that pre-trained VLMs typically produce answers of varying lengths for different types of questions; in the second stage, we train the model using GRPO along with the thinking mode labels to develop dual-mode thinking. Despite its simplicity, our model, named DualMindVLM, significantly outperforms the base model and achieves performance on par with state-of-the-art visual reasoning models, while maintaining exceptionally high token efficiency.

[133] EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

Omkat Thawakar,Shravan Venkatraman,Ritesh Thawkar,Abdelrahman Shaker,Hisham Cholakkal,Rao Muhammad Anwer,Salman Khan,Fahad Khan

Main category: cs.CV

TL;DR: 提出了一种名为EvoLMM的自演化框架，通过无监督方式提升大视觉语言模型的推理能力，利用双代理机制（生成问题的Proposer和解决问题的Solver）实现自我奖励学习，在多个多模态数学推理基准上取得显著提升。

Details

Motivation: 现有大视觉语言模型训练依赖人工标注数据或外部奖励模型，限制了自主性和可扩展性，因此需要一种完全无监督的自演化方法来增强模型推理能力。 Method: 构建一个基于单一骨干模型的双代理系统：Proposer生成图像相关的问题，Solver通过内部一致性进行解答，两者通过持续的自我奖励机制共同进化，实现无监督学习。 Result: 在仅使用原始训练图像的情况下，基于Qwen2.5-VL的EvoLMM在ChartQA、MathVista和MathVision等多个多模态数学推理基准上性能提升约3%。 Conclusion: EvoLMM提供了一种简单而有效的全无监督自改进框架，为未来大视觉语言模型的自主进化研究提供了坚实基础。 Abstract: Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting their autonomy and scalability. In this work, we strive to improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that instantiates two cooperative agents from a single backbone model: a Proposer, which generates diverse, image-grounded questions, and a Solver, which solves them through internal consistency, where learning proceeds through a continuous self-rewarding process. This dynamic feedback encourages both the generation of informative queries and the refinement of structured reasoning without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base model, our EvoLMM yields consistent gains upto $\sim$3\% on multimodal math-reasoning benchmarks, including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple yet effective approach will serve as a solid baseline easing future research in self-improving LMMs in a fully-unsupervised fashion. Our code and models are available at https://github.com/mbzuai-oryx/EvoLMM.

[134] NoPo-Avatar: Generalizable and Animatable Avatars from Sparse Inputs without Human Poses

Jing Wen,Alexander G. Schwing,Shenlong Wang

Main category: cs.CV

TL;DR: 提出NoPo-Avatar，一种无需输入人体姿态即可从单张或稀疏图像中重建可动画3D人像的方法，避免了因姿态估计噪声导致的性能下降，在真实场景中表现更优。

Details

Motivation: 现有方法依赖测试时输入精确的人体姿态和相机位姿，但在实际应用中姿态估计常含噪声，导致重建质量下降。因此需要一种不依赖姿态输入的重建方法以提升鲁棒性和适用性。 Method: 提出NoPo-Avatar，完全基于输入图像进行3D人像重建，去除对测试时人体姿态的依赖，通过端到端学习直接从图像中恢复可动画的3D人像。 Result: 在THuman2.0、XHuman和HuGe100K数据集上实验表明，NoPo-Avatar在无真实姿态的实用场景下优于现有方法，在有真实姿态的实验室场景下性能相当。 Conclusion: NoPo-Avatar通过消除对姿态输入的依赖，实现了更鲁棒和广泛适用的3D人像重建，在实际应用中具有显著优势。 Abstract: We tackle the task of recovering an animatable 3D human avatar from a single or a sparse set of images. For this task, beyond a set of images, many prior state-of-the-art methods use accurate "ground-truth" camera poses and human poses as input to guide reconstruction at test-time. We show that pose-dependent reconstruction degrades results significantly if pose estimates are noisy. To overcome this, we introduce NoPo-Avatar, which reconstructs avatars solely from images, without any pose input. By removing the dependence of test-time reconstruction on human poses, NoPo-Avatar is not affected by noisy human pose estimates, making it more widely applicable. Experiments on challenging THuman2.0, XHuman, and HuGe100K data show that NoPo-Avatar outperforms existing baselines in practical settings (without ground-truth poses) and delivers comparable results in lab settings (with ground-truth poses).

[135] Dataset Distillation for Pre-Trained Self-Supervised Vision Models

George Cazenavette,Antonio Torralba,Vincent Sitzmann

Main category: cs.CV

TL;DR: 本文提出了线性梯度匹配方法，用于在预训练视觉模型上蒸馏数据集以优化线性探测器的训练，合成数据在多个预训练模型间具有良好泛化性，并可用于细粒度分类和模型可解释性分析。

Details

Motivation: 现有数据集蒸馏方法主要针对从零开始训练的模型，而当前先进视觉方法多基于大规模预训练模型，因此需要针对预训练模型上的线性探测任务设计新的蒸馏方法。 Method: 提出线性梯度匹配方法，通过优化合成图像，使其经过预训练特征提取器后在线性分类器上产生的梯度与真实数据相似，从而实现高效的知识迁移。 Result: 所生成的合成数据优于所有基于真实图像的基线方法，且能在不同预训练模型间泛化（如用DINO蒸馏的数据训练CLIP探针）；在细粒度分类和模型可解释性任务中表现优异。 Conclusion: 线性梯度匹配为基于预训练模型的线性探测提供了高效的数据蒸馏方案，合成数据不仅性能优越且具跨模型泛化能力，同时为模型比较和鲁棒性分析提供了有效工具。 Abstract: The task of dataset distillation aims to find a small set of synthetic images such that training a model on them reproduces the performance of the same model trained on a much larger dataset of real samples. Existing distillation methods focus on synthesizing datasets that enable training randomly initialized models. In contrast, state-of-the-art vision approaches are increasingly building on large, pre-trained self-supervised models rather than training from scratch. In this paper, we investigate the problem of distilling datasets that enable us to optimally train linear probes on top of such large, pre-trained vision models. We introduce a method of dataset distillation for this task called Linear Gradient Matching that optimizes the synthetic images such that, when passed through a pre-trained feature extractor, they induce gradients in the linear classifier similar to those produced by the real data. Our method yields synthetic data that outperform all real-image baselines and, remarkably, generalize across pre-trained vision models, enabling us, for instance, to train a linear CLIP probe that performs competitively using a dataset distilled via a DINO backbone. Further, we show that our distilled datasets are exceptionally effective for fine-grained classification and provide a valuable tool for model interpretability, predicting, among other things, how similar two models' embedding spaces are under the platonic representation hypothesis or whether a model is sensitive to spurious correlations in adversarial datasets.

Table of Contents

cs.CL [Back]

[1] What Really Counts? Examining Step and Token Level Attribution in Multilingual CoT Reasoning

[2] Mind the Motions: Benchmarking Theory-of-Mind in Everyday Body Language

[3] TOD-ProcBench: Benchmarking Complex Instruction-Following in Task-Oriented Dialogues

[4] Liars' Bench: Evaluating Lie Detectors for Language Models

[5] Learning Tractable Distributions Of Language Model Continuations

[6] Early science acceleration experiments with GPT-5

[7] ELPO: Ensemble Learning Based Prompt Optimization for Large Language Models

[8] TS-PEFT: Token-Selective Parameter-Efficient Fine-Tuning with Learnable Threshold Gating

[9] SemanticCite: Citation Verification with AI-Powered Full-Text Analysis and Evidence-Based Reasoning

[10] SeSE: A Structural Information-Guided Uncertainty Quantification Framework for Hallucination Detection in LLMs

[11] SDA: Steering-Driven Distribution Alignment for Open LLMs without Fine-Tuning

[12] Incorporating Self-Rewriting into Large Language Model Reasoning Reinforcement

[13] NLP Datasets for Idiom and Figurative Language Tasks

[14] Learning from Sufficient Rationales: Analysing the Relationship Between Explanation Faithfulness and Token-level Regularisation Strategies

[15] AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser

[16] Classification of worldwide news articles by perceived quality, 2018-2024

[17] ESGBench: A Benchmark for Explainable ESG Question Answering in Corporate Sustainability Reports

[18] Anatomy of an Idiom: Tracing Non-Compositionality in Language Models

[19] Arctic-Extract Technical Report

[20] TurkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval

[21] Beyond Tokens in Language Models: Interpreting Activations through Text Genre Chunks

[22] WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue

[23] Integrating Symbolic Natural Language Understanding and Language Models for Word Sense Disambiguation

[24] Comparison of Text-Based and Image-Based Retrieval in Multimodal Retrieval Augmented Generation Large Language Model Systems

[25] Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs

cs.CV [Back]

[26] UniFit: Towards Universal Virtual Try-on with MLLM-Guided Semantic Alignment

[27] EfficientSAM3: Progressive Hierarchical Distillation for Video Concept Segmentation from SAM1, 2, and 3

[28] WALDO: Where Unseen Model-based 6D Pose Estimation Meets Occlusion

[29] Automatic Uncertainty-Aware Synthetic Data Bootstrapping for Historical Map Segmentation

[30] Box6D : Zero-shot Category-level 6D Pose Estimation of Warehouse Boxes

[31] RB-FT: Rationale-Bootstrapped Fine-Tuning for Video Classification

[32] Boosting Medical Visual Understanding From Multi-Granular Language Learning

[33] Automated Interpretable 2D Video Extraction from 3D Echocardiography

[34] Click2Graph: Interactive Panoptic Video Scene Graphs from a Single Click

[35] InfoCLIP: Bridging Vision-Language Pretraining and Open-Vocabulary Semantic Segmentation via Information-Theoretic Alignment Transfer

[36] Externally Validated Multi-Task Learning via Consistency Regularization Using Differentiable BI-RADS Features for Breast Ultrasound Tumor Segmentation

[37] UniDGF: A Unified Detection-to-Generation Framework for Hierarchical Object Visual Recognition

[38] Fairness in Multi-modal Medical Diagnosis with Demonstration Selection

[39] Exploiting Inter-Sample Information for Long-tailed Out-of-Distribution Detection

[40] Physically Realistic Sequence-Level Adversarial Clothing for Robust Human-Detection Evasion

[41] Mixture of Ranks with Degradation-Aware Routing for One-Step Real-World Image Super-Resolution

[42] Towards a Safer and Sustainable Manufacturing Process: Material classification in Laser Cutting Using Deep Learning

[43] CuriGS: Curriculum-Guided Gaussian Splatting for Sparse View Synthesis

[44] Crossmodal learning for Crop Canopy Trait Estimation

[45] LLMs-based Augmentation for Domain Adaptation in Long-tailed Food Datasets

[46] AMS-KV: Adaptive KV Caching in Multi-Scale Visual Autoregressive Transformers

[47] LiSTAR: Ray-Centric World Models for 4D LiDAR Sequences in Autonomous Driving

[48] VideoSeg-R1:Reasoning Video Object Segmentation via Reinforcement Learning

[49] SpectralTrain: A Universal Framework for Hyperspectral Image Classification

[50] Rad-GS: Radar-Vision Integration for 3D Gaussian Splatting SLAM in Outdoor Environments

[51] T2T-VICL: Unlocking the Boundaries of Cross-Task Visual In-Context Learning via Implicit Text-Driven VLMs

[52] Clustered Error Correction with Grouped 4D Gaussian Splatting

[53] Decoupling Complexity from Scale in Latent Diffusion Model

[54] VTinker: Guided Flow Upsampling and Texture Mapping for High-Resolution Video Frame Interpolation

[55] Can MLLMs Read the Room? A Multimodal Benchmark for Assessing Deception in Multi-Party Social Interactions

[56] How Noise Benefits AI-generated Image Detection

[57] TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding

[58] Degradation-Aware Hierarchical Termination for Blind Quality Enhancement of Compressed Video

[59] SurvAgent: Hierarchical CoT-Enhanced Case Banking and Dichotomy-Based Multi-Agent System for Multimodal Survival Prediction

[60] Real-Time 3D Object Detection with Inference-Aligned Learning

[61] Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation

[62] A Spatial Semantics and Continuity Perception Attention for Remote Sensing Water Body Change Detection

[63] LEGO-SLAM: Language-Embedded Gaussian Optimization SLAM

[64] Reasoning Guided Embeddings: Leveraging MLLM Reasoning for Improved Multimodal Retrieval

[65] Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers

[66] Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning

[67] Simba: Towards High-Fidelity and Geometrically-Consistent Point Cloud Completion via Transformation Diffusion

[68] Layer-wise Noise Guided Selective Wavelet Reconstruction for Robust Medical Image Segmentation

[69] An Image Is Worth Ten Thousand Words: Verbose-Text Induction Attacks on VLMs

[70] EvoVLA: Self-Evolving Vision-Language-Action Model

[71] Target Refocusing via Attention Redistribution for Open-Vocabulary Semantic Segmentation: An Explainability Perspective

[72] Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight

[73] Domain-Shared Learning and Gradual Alignment for Unsupervised Domain Adaptation Visible-Infrared Person Re-Identification

[74] PrIntMesh: Precise Intersection Surfaces for 3D Organ Mesh Reconstruction

[75] When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models

[76] Unsupervised Image Classification with Adaptive Nearest Neighbor Selection and Cluster Ensembles

[77] SwiTrack: Tri-State Switch for Cross-Modal Object Tracking