cs.CL [Back]

[1] What Really Counts? Examining Step and Token Level Attribution in Multilingual CoT Reasoning

Jeremias Ferrao,Ezgi Basar,Khondoker Ittehadul Islam,Mahrokh Hassani

Main category: cs.CL

TL;DR: 本研究探讨了多语言大模型中思维链（CoT）推理的归因模式，使用ContextCite和Inseq方法分析Qwen2.5 1.5B-Instruct模型在MGSM基准上的表现，发现归因过度集中在最后推理步骤、结构化提示仅对高资源拉丁语系语言有效，且干扰会降低准确性和归因一致性。

Details

Motivation: 尽管CoT提示能提升任务性能，但其生成的推理链在多语言场景下的可信度和可解释性尚不明确，需系统评估其归因特性。 Method: 采用ContextCite（步骤级）和Inseq（令牌级）两种归因方法，在Qwen2.5 1.5B-Instruct模型上基于MGSM基准进行实验分析。 Result: 1) 归因分数过度强调最终推理步骤，尤其在错误生成中；2) 结构化CoT提示主要提升高资源拉丁语系语言的准确性；3) 否定和干扰句扰动会降低模型准确性和归因连贯性。 Conclusion: CoT提示在多语言鲁棒性和解释透明性方面存在局限，需进一步改进以提升跨语言推理的可靠性。 Abstract: This study investigates the attribution patterns underlying Chain-of-Thought (CoT) reasoning in multilingual LLMs. While prior works demonstrate the role of CoT prompting in improving task performance, there are concerns regarding the faithfulness and interpretability of the generated reasoning chains. To assess these properties across languages, we applied two complementary attribution methods--ContextCite for step-level attribution and Inseq for token-level attribution--to the Qwen2.5 1.5B-Instruct model using the MGSM benchmark. Our experimental results highlight key findings such as: (1) attribution scores excessively emphasize the final reasoning step, particularly in incorrect generations; (2) structured CoT prompting significantly improves accuracy primarily for high-resource Latin-script languages; and (3) controlled perturbations via negation and distractor sentences reduce model accuracy and attribution coherence. These findings highlight the limitations of CoT prompting, particularly in terms of multilingual robustness and interpretive transparency.

[2] Mind the Motions: Benchmarking Theory-of-Mind in Everyday Body Language

Seungbeen Lee,Jinhong Jeong,Donghyun Kim,Yejin Son,Youngjae Yu

Main category: cs.CL

TL;DR: 本文提出了Motion2Mind框架，用于评估机器在解读非语言线索（NVCs）以推断心理状态方面的心智理论（ToM）能力。该研究构建了一个包含精细标注的非语言线索和心理状态配对的数据集，涵盖222种非语言线索和397种心理状态。实验表明当前AI系统在此任务上表现不佳，存在检测性能差距及解释阶段的过度解读问题。

Details

Motivation: 现有心智理论（ToM）基准多集中于错误信念任务和不对称信息推理，忽略了除信念外的其他心理状态以及丰富的非语言交流形式。因此需要一个更全面的框架来评估机器对人类非语言线索的理解能力。 Method: 基于专家整理的身体语言参考知识库，构建了Motion2Mind视频数据集，包含细粒度的非语言线索标注及其对应的手动验证的心理学解释，并设计了检测与解释两阶段评估任务。 Result: 评估结果显示当前AI系统在非语言线索的检测和解释任务中显著落后于人类标注者，尤其在解释阶段表现出明显的过度解读倾向。 Conclusion: Motion2Mind揭示了现有AI在理解人类非语言交流方面的局限性，强调了未来需发展更精细的模型以准确捕捉复杂的社会认知信号。 Abstract: Our ability to interpret others' mental states through nonverbal cues (NVCs) is fundamental to our survival and social cohesion. While existing Theory of Mind (ToM) benchmarks have primarily focused on false-belief tasks and reasoning with asymmetric information, they overlook other mental states beyond belief and the rich tapestry of human nonverbal communication. We present Motion2Mind, a framework for evaluating the ToM capabilities of machines in interpreting NVCs. Leveraging an expert-curated body-language reference as a proxy knowledge base, we build Motion2Mind, a carefully curated video dataset with fine-grained nonverbal cue annotations paired with manually verified psychological interpretations. It encompasses 222 types of nonverbal cues and 397 mind states. Our evaluation reveals that current AI systems struggle significantly with NVC interpretation, exhibiting not only a substantial performance gap in Detection, as well as patterns of over-interpretation in Explanation compared to human annotators.

[3] TOD-ProcBench: Benchmarking Complex Instruction-Following in Task-Oriented Dialogues

Sarik Ghazarian,Abhinav Gullapalli,Swair Shah,Anurag Beniwal,Nanyun Peng,Narayanan Sadagopan,Zhou Yu

Main category: cs.CL

TL;DR: 本文提出了TOD-ProcBench，一个用于评估大语言模型在多轮任务导向对话中遵循复杂自然语言指令能力的基准测试，包含三个任务来全面衡量模型的理解、识别违规和条件生成能力。

Details

Motivation: 现有任务导向对话基准过于简化指令为意图-槽位或API调用形式，无法反映真实场景中复杂的自然语言操作流程和细粒度约束，因此需要一个新的基准来系统评估模型对复杂指令的遵循能力。 Method: 基于高质量ABCD数据集构建包含复杂流程指令的对话数据，将指令建模为多层次条件-动作语句，设计三个任务：相关指令检索与动作预测、违规响应识别、基于指令的条件响应生成，并研究多语言和不同指令格式对性能的影响。 Result: TOD-ProcBench提供了对多种LLM在复杂指令遵循方面的系统评估，揭示了当前模型在理解细粒度约束和执行复杂流程方面的局限性，特别是在识别指令违规和条件生成方面表现不足。 Conclusion: TOD-ProcBench填补了任务导向对话中复杂指令遵循评估的空白，为未来提升LLM在真实场景中遵循自然语言操作流程的能力提供了有效评测工具和方向。 Abstract: In real-world task-oriented dialogue (TOD) settings, agents are required to strictly adhere to complex instructions while conducting multi-turn conversations with customers. These instructions are typically presented in natural language format and include general guidelines and step-by-step procedures with complex constraints. Existing TOD benchmarks often oversimplify the complex nature of these instructions by reducing them to simple schemas composed of intents, slots, and API call configurations. To address this gap and systematically benchmark LLMs' instruction-following capabilities, we propose TOD-ProcBench, a challenging benchmark featuring complex process instructions with intricate, fine-grained constraints that evaluates various LLMs' abilities to understand and follow instructions in multi-turn TODs. Our benchmark dataset comprises instruction documents derived from the high-quality ABCD dataset with corresponding conversations under human quality control. We formulate fine-grained constraints and action procedures as multi-level condition-action instruction statements. We design three tasks to comprehensively benchmark LLMs' complex instruction-following capabilities in multi-turn TODs. Task 1 evaluates how LLMs retrieve the most relevant statement from a complex instruction and predict the corresponding next action. In Task 2, we synthesize instruction-violating responses by injecting inconsistencies and manipulating the original instructions, and then we analyze how effectively LLMs can identify instruction-violating responses. Task 3 investigates LLMs' abilities in conditional generation of instruction-following responses based on the original complex instructions. Additionally, we conduct studies on the impact of multilingual settings and different instruction text formats on compliance performance. We release our benchmark under the Llama 3.3 Community License Agreement.

[4] Liars' Bench: Evaluating Lie Detectors for Language Models

Kieron Kretschmar,Walter Laurito,Sharan Maiya,Samuel Marks

Main category: cs.CL

TL;DR: 本文提出了LIARS' BENCH，一个包含72,863个样本的测谎基准，用于评估大语言模型在多种情境下的说谎行为检测技术，并揭示现有方法在复杂说谎场景中的局限性。

Details

Motivation: 现有的大语言模型（LLM）测谎技术通常在狭窄的设定下验证，未能涵盖LLM可能产生的多样化谎言，因此需要一个更全面、多样化的测试平台来评估和改进测谎方法。 Method: 构建了一个名为LIARS' BENCH的大规模测试集，包含四个开源模型在七个数据集上生成的谎言与诚实回答，共72,863个样本；该测试集覆盖了不同类型的谎言，并沿‘说谎动机’和‘信念对象’两个维度进行分类；并在该基准上评估了三种黑盒与白盒测谎技术。 Result: 实验发现现有测谎技术在识别某些类型的谎言时系统性失效，尤其是在仅凭对话记录无法判断是否说谎的情境下表现更差。 Conclusion: LIARS' BENCH揭示了当前测谎技术的局限性，并提供了一个实用的测试平台，有助于推动更鲁棒的LLM测谎方法的发展。 Abstract: Prior work has introduced techniques for detecting when large language models (LLMs) lie, that is, generating statements they believe are false. However, these techniques are typically validated in narrow settings that do not capture the diverse lies LLMs can generate. We introduce LIARS' BENCH, a testbed consisting of 72,863 examples of lies and honest responses generated by four open-weight models across seven datasets. Our settings capture qualitatively different types of lies and vary along two dimensions: the model's reason for lying and the object of belief targeted by the lie. Evaluating three black- and white-box lie detection techniques on LIARS' BENCH, we find that existing techniques systematically fail to identify certain types of lies, especially in settings where it's not possible to determine whether the model lied from the transcript alone. Overall, LIARS' BENCH reveals limitations in prior techniques and provides a practical testbed for guiding progress in lie detection.

[5] Learning Tractable Distributions Of Language Model Continuations

Gwen Yidou-Weng,Ian Li,Anji Liu,Oliver Broadrick,Guy Van den Broeck,Benjie Wang

Main category: cs.CL

TL;DR: 提出了一种名为Learning to Look Ahead (LTLA)的混合方法，通过结合语言模型和固定可追踪代理模型，有效解决受控文本生成中序列级约束的建模难题，在保持推理效率的同时提升了约束满足度和生成质量。

Details

Motivation: 现有基于代理模型（如HMM）的受控生成方法上下文感知能力弱，难以准确建模未来token对约束的影响，导致生成质量下降。 Method: LTLA将基础语言模型用于前缀编码，并结合一个固定的可计算代理模型来精确计算延续概率；通过批量HMM更新处理所有候选下一词，且仅将LM的隐状态用于条件化代理模型的潜在状态先验，保持解码器固定以实现跨前缀的计算复用。 Result: LTLA在条件似然上优于无条件HMM，能为视觉-语言模型近似延续分布（传统HMM无法做到），并在受控生成任务中提升约束满足度且保持流畅性，推理开销极小。 Conclusion: LTLA通过融合神经上下文与固定代理模型，在不牺牲效率的前提下显著提升了受控语言生成的准确性与适用范围，尤其适用于多模态或复杂约束场景。 Abstract: Controlled language generation conditions text on sequence-level constraints (for example, syntax, style, or safety). These constraints may depend on future tokens, which makes directly conditioning an autoregressive language model (LM) generally intractable. Prior work uses tractable surrogates such as hidden Markov models (HMMs) to approximate the distribution over continuations and adjust the model's next-token logits at decoding time. However, we find that these surrogates are often weakly context aware, which reduces query quality. We propose Learning to Look Ahead (LTLA), a hybrid approach that pairs the same base language model for rich prefix encoding with a fixed tractable surrogate model that computes exact continuation probabilities. Two efficiency pitfalls arise when adding neural context: (i) naively rescoring the prefix with every candidate next token requires a sweep over the entire vocabulary at each step, and (ii) predicting fresh surrogate parameters for each prefix, although tractable at a single step, forces recomputation of future probabilities for every new prefix and eliminates reuse. LTLA avoids both by using a single batched HMM update to account for all next-token candidates at once, and by conditioning only the surrogate's latent state prior on the LM's hidden representations while keeping the surrogate decoder fixed, so computations can be reused across prefixes. Empirically, LTLA attains higher conditional likelihood than an unconditional HMM, approximates continuation distributions for vision-language models where a standalone HMM cannot encode visual context, and improves constraint satisfaction at comparable fluency on controlled-generation tasks, with minimal inference overhead.

[6] Early science acceleration experiments with GPT-5

Sébastien Bubeck,Christian Coester,Ronen Eldan,Timothy Gowers,Yin Tat Lee,Alexandru Lupsasca,Mehtaab Sawhney,Robert Scherrer,Mark Sellke,Brian K. Spears,Derya Unutmaz,Kevin Weil,Steven Yin,Nikita Zhivotovskiy

Main category: cs.CL

TL;DR: 本文展示了GPT-5在多个科学领域研究中的实际应用案例，揭示了其如何加速科研进程，并指出了AI辅助研究的优势与局限。

Details

Motivation: 许多科学家尚未充分意识到前沿AI的能力，本文旨在通过具体案例展示AI模型（如GPT-5）在真实科研中的潜力。 Method: 作者们通过在数学、物理、天文、计算机科学、生物学和材料科学中使用GPT-5进行短篇案例研究，记录人机协作过程，并分析AI在推动研究进展中的作用。 Result: GPT-5在多个领域提出了新的具体研究步骤，帮助加速研究进程；特别是在数学中产出了四个经过验证的新结果，体现了AI辅助解决未解问题的能力。 Conclusion: 尽管这些成果范围有限，但具有深远意义，表明随着前沿AI的快速发展，其在科学研究中的协同潜力巨大，未来有望显著提升科研效率。 Abstract: AI models like GPT-5 are an increasingly valuable tool for scientists, but many remain unaware of the capabilities of frontier AI. We present a collection of short case studies in which GPT-5 produced new, concrete steps in ongoing research across mathematics, physics, astronomy, computer science, biology, and materials science. In these examples, the authors highlight how AI accelerated their work, and where it fell short; where expert time was saved, and where human input was still key. We document the interactions of the human authors with GPT-5, as guiding examples of fruitful collaboration with AI. Of note, this paper includes four new results in mathematics (carefully verified by the human authors), underscoring how GPT-5 can help human mathematicians settle previously unsolved problems. These contributions are modest in scope but profound in implication, given the rate at which frontier AI is progressing.

[7] ELPO: Ensemble Learning Based Prompt Optimization for Large Language Models

Qing Zhang,Bing Xu,Xudong Zhang,Yifan Shi,Yang Li,Chen Zhang,Yik Chung Wu,Ngai Wong,Yijie Chen,Hong Dai,Xiansen Chen,Mian Zhang

Main category: cs.CL

TL;DR: 提出了一种基于集成学习的提示优化框架ELPO，通过结合多种搜索方法和共享生成策略，提升了提示优化的准确性和鲁棒性，在多个任务上优于现有方法。

Details

Motivation: 现有的自动提示优化方法多依赖单一模型或算法，限制了其在复杂任务上的性能，因此需要更强大和灵活的优化框架。 Method: 受集成学习启发，ELPO引入投票机制，结合共享生成策略与多种搜索方法，并设计了更高效的提示生成与搜索算法。 Result: 实验结果显示ELPO在多个任务上优于当前最先进的提示优化方法，例如在ArSarcasm数据集上F1分数提高了7.6。 Conclusion: ELPO通过集成多种优化策略，显著提升了提示优化的效果和稳定性，为自动提示优化提供了新的有效范式。 Abstract: The remarkable performance of Large Language Models (LLMs) highly relies on crafted prompts. However, manual prompt engineering is a laborious process, creating a core bottleneck for practical application of LLMs. This phenomenon has led to the emergence of a new research area known as Automatic Prompt Optimization (APO), which develops rapidly in recent years. Existing APO methods such as those based on evolutionary algorithms or trial-and-error approaches realize an efficient and accurate prompt optimization to some extent. However, those researches focus on a single model or algorithm for the generation strategy and optimization process, which limits their performance when handling complex tasks. To address this, we propose a novel framework called Ensemble Learning based Prompt Optimization (ELPO) to achieve more accurate and robust results. Motivated by the idea of ensemble learning, ELPO conducts voting mechanism and introduces shared generation strategies along with different search methods for searching superior prompts. Moreover, ELPO creatively presents more efficient algorithms for the prompt generation and search process. Experimental results demonstrate that ELPO outperforms state-of-the-art prompt optimization methods across different tasks, e.g., improving F1 score by 7.6 on ArSarcasm dataset.

[8] TS-PEFT: Token-Selective Parameter-Efficient Fine-Tuning with Learnable Threshold Gating

Dabiao Ma,Ziming Dai,Zhimin Xin,Shu Wang,Ye Wang,Haojun Fei

Main category: cs.CL

TL;DR: 本文提出了一种新的参数高效微调范式——Token选择性PEFT（TS-PEFT），通过仅对部分位置索引应用微调，提升大模型在下游任务中的性能。

Details

Motivation: 传统的PEFT方法对所有位置索引进行修改，可能造成资源浪费甚至性能下降，因此需要探索更精细化的微调策略。 Method: 引入一个选择函数S，动态决定哪些token的位置索引应用PEFT修改，实现对参数更新的精确控制。 Result: 实验证明，不加选择地对所有位置应用PEFT是冗余且可能有害的，而TS-PEFT能有效提升下游任务性能。 Conclusion: TS-PEFT提供了一种更高效、更有针对性的微调思路，为大模型的参数高效优化指出了新方向。 Abstract: In the field of large models (LMs) for natural language processing (NLP) and computer vision (CV), Parameter-Efficient Fine-Tuning (PEFT) has emerged as a resource-efficient method that modifies a limited number of parameters while keeping the pretrained weights fixed. This paper investigates the traditional PEFT approach, which applies modifications to all position indices, and questions its necessity. We introduce a new paradigm called Token-Selective PEFT (TS-PEFT), in which a function S selectively applies PEFT modifications to a subset of position indices, potentially enhancing performance on downstream tasks. Our experimental results reveal that the indiscriminate application of PEFT to all indices is not only superfluous, but may also be counterproductive. This study offers a fresh perspective on PEFT, advocating for a more targeted approach to modifications and providing a framework for future research to optimize the fine-tuning process for large models.

[9] SemanticCite: Citation Verification with AI-Powered Full-Text Analysis and Evidence-Based Reasoning

Sebastian Haan

Main category: cs.CL

TL;DR: SemanticCite是一个AI驱动的引文验证系统，通过全文分析和细粒度分类（支持、部分支持、不支持、不确定）提升学术引用的准确性与透明度，结合轻量级微调模型实现高效可扩展的验证，并开源数据集与框架以促进研究诚信。

Details

Motivation: 学术交流依赖准确引用，但存在语义错误、AI幻觉引用及传统引用缺乏上下文等问题，影响研究可信度和证据追溯。 Method: 结合多种检索方法，利用微调的轻量语言模型对引用进行四类分类（Supported, Partially Supported, Unsupported, Uncertain），并通过提取相关文本片段提供解释；构建包含1000多个标注引用的多学科数据集。 Result: 微调的轻量模型性能媲美大型商业系统，计算成本更低；系统能有效识别不同类型的引用错误，提供透明、基于证据的解释，并支持大规模应用。 Conclusion: SemanticCite通过可扩展的引文验证、简化同行评审和AI生成内容的质量控制，为维护大规模学术引用准确性提供了开源解决方案，有助于提升科研诚信。 Abstract: Effective scientific communication depends on accurate citations that validate sources and guide readers to supporting evidence. Yet academic literature faces mounting challenges: semantic citation errors that misrepresent sources, AI-generated hallucinated references, and traditional citation formats that point to entire papers without indicating which sections substantiate specific claims. We introduce SemanticCite, an AI-powered system that verifies citation accuracy through full-text source analysis while providing rich contextual information via detailed reasoning and relevant text snippets. Our approach combines multiple retrieval methods with a four-class classification system (Supported, Partially Supported, Unsupported, Uncertain) that captures nuanced claim-source relationships and enables appropriate remedial actions for different error types. Our experiments show that fine-tuned lightweight language models achieve performance comparable to large commercial systems with significantly lower computational requirements, making large-scale citation verification practically feasible. The system provides transparent, evidence-based explanations that support user understanding and trust. We contribute a comprehensive dataset of over 1,000 citations with detailed alignments, functional classifications, semantic annotations, and bibliometric metadata across eight disciplines, alongside fine-tuned models and the complete verification framework as open-source software. SemanticCite addresses critical challenges in research integrity through scalable citation verification, streamlined peer review, and quality control for AI-generated content, providing an open-source foundation for maintaining citation accuracy at scale.

[10] SeSE: A Structural Information-Guided Uncertainty Quantification Framework for Hallucination Detection in LLMs

Xingtao Zhao,Hao Peng,Dingli Su,Xianghua Zeng,Chunyang Liu,Jinzhi Liao,Philip S. Yu

Main category: cs.CL

TL;DR: 本文提出了Semantic Structural Entropy（SeSE），一种从结构信息角度量化大语言模型内在语义不确定性的新框架，用于更精准的幻觉检测。

Details

Motivation: 现有不确定性量化方法主要依赖语义概率分布或成对距离，忽略了潜在的语义结构信息，导致在关键场景下可靠性不足。 Method: 提出自适应稀疏化的有向语义图构建算法，并通过分层抽象定义SeSE为最优语义编码树的结构熵；进一步扩展至细粒度长文本生成中的单个主张不确定性量化。 Result: 在29个模型-数据集组合上的实验表明，SeSE显著优于包括KLE在内的先进基线方法，尤其在幻觉检测和长文本生成中表现突出。 Conclusion: SeSE通过利用语义空间中的潜在结构信息，提供了更精确、可解释的不确定性估计，有效提升LLMs在安全关键应用中的可靠性。 Abstract: Reliable uncertainty quantification (UQ) is essential for deploying large language models (LLMs) in safety-critical scenarios, as it enables them to abstain from responding when uncertain, thereby avoiding hallucinating falsehoods. However, state-of-the-art UQ methods primarily rely on semantic probability distributions or pairwise distances, overlooking latent semantic structural information that could enable more precise uncertainty estimates. This paper presents Semantic Structural Entropy (SeSE), a principled UQ framework that quantifies the inherent semantic uncertainty of LLMs from a structural information perspective for hallucination detection. Specifically, to effectively model semantic spaces, we first develop an adaptively sparsified directed semantic graph construction algorithm that captures directional semantic dependencies while automatically pruning unnecessary connections that introduce negative interference. We then exploit latent semantic structural information through hierarchical abstraction: SeSE is defined as the structural entropy of the optimal semantic encoding tree, formalizing intrinsic uncertainty within semantic spaces after optimal compression. A higher SeSE value corresponds to greater uncertainty, indicating that LLMs are highly likely to generate hallucinations. In addition, to enhance fine-grained UQ in long-form generation -- where existing methods often rely on heuristic sample-and-count techniques -- we extend SeSE to quantify the uncertainty of individual claims by modeling their random semantic interactions, providing theoretically explicable hallucination detection. Extensive experiments across 29 model-dataset combinations show that SeSE significantly outperforms advanced UQ baselines, including strong supervised methods and the recently proposed KLE.

[11] SDA: Steering-Driven Distribution Alignment for Open LLMs without Fine-Tuning

Wei Xia,Zhi-Hong Deng

Main category: cs.CL

TL;DR: 本文提出了一种无需训练、模型无关的对齐框架SDA，通过在推理时动态调整输出概率分布来提升大语言模型在有用性、无害性和诚实性三个维度上的对齐性能。

Details

Motivation: 确保大语言模型在多样化任务和用户需求下仍能与人类意图保持一致，尤其是在不进行昂贵再训练的情况下实现在推理阶段的有效对齐。 Method: 提出SDA（Steering-Driven Distribution Alignment）框架，通过用户定义的对齐指令动态重分配模型输出概率，实现无需微调的训练自由对齐方法，并可与基于训练的方法结合使用。 Result: 在8个不同规模和来源的开源大模型上验证了SDA的有效性，在3H（有用性、无害性、诚实性）评估中平均提升64.4%（有用性）、30%（诚实性）和11.5%（无害性），表现出良好的泛化能力。 Conclusion: SDA是一种轻量、高效、通用的推理时对齐方法，能够在不重训练的前提下显著提升多种开源大语言模型的行为对齐水平，支持个性化偏好控制和灵活部署。 Abstract: With the rapid advancement of large language models (LLMs), their deployment in real-world applications has become increasingly widespread. LLMs are expected to deliver robust performance across diverse tasks, user preferences, and practical scenarios. However, as demands grow, ensuring that LLMs produce responses aligned with human intent remains a foundational challenge. In particular, aligning model behavior effectively and efficiently during inference, without costly retraining or extensive supervision, is both a critical requirement and a non-trivial technical endeavor. To address the challenge, we propose SDA (Steering-Driven Distribution Alignment), a training-free and model-agnostic alignment framework designed for open-source LLMs. SDA dynamically redistributes model output probabilities based on user-defined alignment instructions, enhancing alignment between model behavior and human intents without fine-tuning. The method is lightweight, resource-efficient, and compatible with a wide range of open-source LLMs. It can function independently during inference or be integrated with training-based alignment strategies. Moreover, SDA supports personalized preference alignment, enabling flexible control over the model response behavior. Empirical results demonstrate that SDA consistently improves alignment performance across 8 open-source LLMs with varying scales and diverse origins, evaluated on three key alignment dimensions, helpfulness, harmlessness, and honesty (3H). Specifically, SDA achieves average gains of 64.4% in helpfulness, 30% in honesty and 11.5% in harmlessness across the tested models, indicating its effectiveness and generalization across diverse models and application scenarios.

[12] Incorporating Self-Rewriting into Large Language Model Reasoning Reinforcement

Jiashu Yao,Heyan Huang,Shuang Zeng,Chuwei Luo,WangJie You,Jie Tang,Qingsong Liu,Yuhang Guo,Yangyang Kang

Main category: cs.CL

TL;DR: 本文提出了一种自重写（self-rewriting）框架，通过强化学习提升大推理模型的内部推理质量，解决了传统仅基于最终正确性奖励导致的推理过程低效问题。

Details

Motivation: 传统强化学习仅依赖最终结果的奖励信号，无法有效监督模型的内部推理过程，导致出现过度思考、思考不足、冗余或混乱等问题，影响推理效率和质量。 Method: 提出自重写框架，让模型自行重写其推理路径，并从中学习以优化推理过程；采用选择性重写策略，仅对简单样本进行重写，并在同一batch中整合重写与原始生成，保持算法可扩展性。 Result: 实验表明，该方法在准确率上提升+0.6的同时，推理长度减少46%；在LLM-as-a-judge指标下，内部推理质量评分提高+7.2，显著优于基线方法。 Conclusion: 自重写框架能有效提升大推理模型的推理效率与内部思维质量，是一种高效、可扩展的推理优化方法。 Abstract: Through reinforcement learning (RL) with outcome correctness rewards, large reasoning models (LRMs) with scaled inference computation have demonstrated substantial success on complex reasoning tasks. However, the one-sided reward, focused solely on final correctness, limits its ability to provide detailed supervision over internal reasoning process. This deficiency leads to suboptimal internal reasoning quality, manifesting as issues like over-thinking, under-thinking, redundant-thinking, and disordered-thinking. Inspired by the recent progress in LRM self-rewarding, we introduce self-rewriting framework, where a model rewrites its own reasoning texts, and subsequently learns from the rewritten reasoning to improve the internal thought process quality. For algorithm design, we propose a selective rewriting approach wherein only "simple" samples, defined by the model's consistent correctness, are rewritten, thereby preserving all original reward signals of GRPO. For practical implementation, we compile rewriting and vanilla generation within one single batch, maintaining the scalability of the RL algorithm and introducing only ~10% overhead. Extensive experiments on diverse tasks with different model sizes validate the effectiveness of self-rewriting. In terms of the accuracy-length tradeoff, the self-rewriting approach achieves improved accuracy (+0.6) with substantially shorter reasoning (-46%) even without explicit instructions in rewriting prompts to reduce reasoning length, outperforming existing strong baselines. In terms of internal reasoning quality, self-rewriting achieves significantly higher scores (+7.2) under the LLM-as-a-judge metric, successfully mitigating internal reasoning flaws.

[13] NLP Datasets for Idiom and Figurative Language Tasks

Blake Matheny,Phuong Minh Nguyen,Minh Le Nguyen,Stephanie Reynolds

Main category: cs.CL

TL;DR: 本文介绍了为提升大语言模型对习语和比喻语言理解能力而构建的多个数据集，包括一个大规模潜在习语表达数据集和两个人工标注的确切习语表达数据集，并通过槽位标注和序列标注任务评估预训练模型的表现。

Details

Motivation: 由于社交媒体中大量存在非正式的习语和比喻语言，现有大语言模型在理解这类表达上仍存在困难，因此需要更优质、更大规模的数据集来弥补这一差距。 Method: 整合现有习语数据集生成综合习语列表，并从大型语料库中提取上下文序列，构建一个大规模自动采集和两个人工标注的高质量数据集，随后对数据进行后处理以适配多种模型训练，并用于槽位标注与序列标注任务。 Result: 成功构建了三个可用于习语识别（检测）任务的数据集，验证了其在评估预训练语言模型处理比喻意义方面的能力，并展示了模型在特定任务上的基线性能。 Conclusion: 所提出的数据集为开发和评估面向习语及比喻语言理解的新模型和新方法提供了多样化且实用的资源，有助于推动该领域的进一步发展。 Abstract: Idiomatic and figurative language form a large portion of colloquial speech and writing. With social media, this informal language has become more easily observable to people and trainers of large language models (LLMs) alike. While the advantage of large corpora seems like the solution to all machine learning and Natural Language Processing (NLP) problems, idioms and figurative language continue to elude LLMs. Finetuning approaches are proving to be optimal, but better and larger datasets can help narrow this gap even further. The datasets presented in this paper provide one answer, while offering a diverse set of categories on which to build new models and develop new approaches. A selection of recent idiom and figurative language datasets were used to acquire a combined idiom list, which was used to retrieve context sequences from a large corpus. One large-scale dataset of potential idiomatic and figurative language expressions and two additional human-annotated datasets of definite idiomatic and figurative language expressions were created to evaluate the baseline ability of pre-trained language models in handling figurative meaning through idiom recognition (detection) tasks. The resulting datasets were post-processed for model agnostic training compatibility, utilized in training, and evaluated on slot labeling and sequence tagging.

[14] Learning from Sufficient Rationales: Analysing the Relationship Between Explanation Faithfulness and Token-level Regularisation Strategies

Jonathan Kamp,Lisa Beinborn,Antske Fokkens

Main category: cs.CL

TL;DR: 本文研究了自然语言解释（即rationales）在模型性能评估中的作用，发现常用的sufficiency指标虽然能反映rationales的信息量，但并不能有效提升分类准确性，且与token分类任务无关。研究表明，rationale信息的引入可能有助于跨域分类，但效果因任务和模型而异，说明需要更系统的指标来评估rationales。

Details

Motivation: 评估模型是否基于正确理由进行预测，而非依赖数据集的捷径；现有指标如sufficiency对rationale的实际影响缺乏深入理解。 Method: 通过两种建模范式分析sufficiency：一是通过token分类识别rationale中的关键token，二是通过注意力正则化将rationale信息融入输入以提升模型性能。 Result: 高信息量的rationales并不一定提高分类准确率；sufficiency主要反映非rationale上下文对分类的影响；引入rationale可提升跨域表现但效果不稳定；sufficiency与token分类任务无明显关联。 Conclusion: rationale的作用复杂，当前指标不足以全面衡量其影响，需进一步研究能系统捕捉此类信息的新指标。 Abstract: Human explanations of natural language, rationales, form a tool to assess whether models learn a label for the right reasons or rely on dataset-specific shortcuts. Sufficiency is a common metric for estimating the informativeness of rationales, but it provides limited insight into the effects of rationale information on model performance. We address this limitation by relating sufficiency to two modelling paradigms: the ability of models to identify which tokens are part of the rationale (through token classification) and the ability of improving model performance by incorporating rationales in the input (through attention regularisation). We find that highly informative rationales are not likely to help classify the instance correctly. Sufficiency conversely captures the classification impact of the non-rationalised context, which interferes with rationale information in the same input. We also find that incorporating rationale information in model inputs can boost cross-domain classification, but results are inconsistent per task and model type. Finally, sufficiency and token classification appear to be unrelated. These results exemplify the complexity of rationales, showing that metrics capable of systematically capturing this type of information merit further investigation.

[15] AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser

Ren Ma,Jiantao Qiu,Chao Xu,Pei Chu,Kaiwen Liu,Pengli Ren,Yuan Qu,Jiahui Peng,Linfeng Hou,Mengjie Liu,Lindong Lu,Wenchang Ning,Jia Yu,Rui Min,Jin Shi,Haojiong Chen,Peng Zhang,Wenjian Zhang,Qian Jiang,Zengjie Hu,Guoqiang Yang,Zhenxiang Li,Fukai Shang,Zhongying Tu,Wentao Zhang,Dahua Lin,Conghui He

Main category: cs.CL

TL;DR: 本文提出了MinerU-HTML，一种基于语言模型的HTML到文本提取新方法，相比传统启发式方法能更好保留网页中的结构化内容（如公式、代码、表格），并通过构建AICC语料库验证了高质量提取对大语言模型性能的显著提升。

Details

Motivation: 现有网页文本提取方法（如Trafilatura）依赖启发式规则，难以有效保留结构化元素，导致信息损失；而当前数据预处理多关注过滤与去重，忽视提取质量的影响。作者认为提升提取质量对下游任务性能具有重要意义。 Method: 将HTML内容提取重新定义为序列标注问题，使用一个0.6B参数的语言模型进行语义理解，并采用两阶段格式化流程：先对语义元素进行分类，再转换为Markdown格式，从而更准确地保留原始文档结构。 Result: 在包含7,887个标注网页的MainWebBench基准上，MinerU-HTML的ROUGE-N F1得分为81.8%，显著优于Trafilatura的63.6%；在代码块和公式等结构化元素的保留上分别达到90.9%和94.0%。基于MinerU-HTML构建的AICC语料库（7.3万亿token）在相同过滤条件下训练出的模型，在13项基准测试中平均准确率为50.8%，比使用Trafilatura提取的TfCC高出1.08个百分点，且优于RefinedWeb和FineWeb。 Conclusion: HTML内容提取质量对大语言模型训练效果有重要影响，不应被忽视；MinerU-HTML通过语义驱动的方法显著提升了提取质量，证明了模型化、可扩展的提取方案优于传统启发式方法。 Abstract: While web data quality is crucial for large language models, most curation efforts focus on filtering and deduplication,treating HTML-to-text extraction as a fixed pre-processing step. Existing web corpora rely on heuristic-based extractors like Trafilatura, which struggle to preserve document structure and frequently corrupt structured elements such as formulas, codes, and tables. We hypothesize that improving extraction quality can be as impactful as aggressive filtering strategies for downstream performance. We introduce MinerU-HTML, a novel extraction pipeline that reformulates content extraction as a sequence labeling problem solved by a 0.6B-parameter language model. Unlike text-density heuristics, MinerU-HTML leverages semantic understanding and employs a two-stage formatting pipeline that explicitly categorizes semantic elements before converting to Markdown. Crucially, its model-based approach is inherently scalable, whereas heuristic methods offer limited improvement pathways. On MainWebBench, our benchmark of 7,887 annotated web pages, MinerU-HTML achieves 81.8\% ROUGE-N F1 compared to Trafilatura's 63.6\%, with exceptional structured element preservation (90.9\% for code blocks, 94.0\% for formulas). Using MinerU-HTML, we construct AICC (AI-ready Common Crawl), a 7.3-trillion token multilingual corpus from two Common Crawl snapshots. In controlled pretraining experiments where AICC and Trafilatura-extracted TfCC undergo identical filtering, models trained on AICC (62B tokens) achieve 50.8\% average accuracy across 13 benchmarks, outperforming TfCC by 1.08pp-providing direct evidence that extraction quality significantly impacts model capabilities. AICC also surpasses RefinedWeb and FineWeb on key benchmarks. We publicly release MainWebBench, MinerU-HTML, and AICC, demonstrating that HTML extraction is a critical, often underestimated component of web corpus construction.

[16] Classification of worldwide news articles by perceived quality, 2018-2024

Connor McElroy,Thiago E. A. de Oliveira,Chris Brogly

Main category: cs.CL

TL;DR: 该研究评估了机器学习和深度学习模型在区分感知新闻质量方面的有效性，使用包含140多万篇文章的数据集，结果显示ModernBERT-large表现最佳。

Details

Motivation: 探索机器学习与深度学习模型是否能有效区分感知新闻质量的高低。 Method: 采用3种传统机器学习分类器和3种深度学习模型，基于194个语言学特征，在一个包含1,412,272篇英文新闻文章的新数据集上进行评估。 Result: Random Forest准确率为0.7355，ModernBERT-large达到最高性能（准确率0.8744，ROC-AUC 0.9593，F1分数0.8739）。 Conclusion: 传统机器学习和深度学习模型均能有效区分全球新闻文章的感知质量，其中ModernBERT-large表现最优。 Abstract: This study explored whether supervised machine learning and deep learning models can effectively distinguish perceived lower-quality news articles from perceived higher-quality news articles. 3 machine learning classifiers and 3 deep learning models were assessed using a newly created dataset of 1,412,272 English news articles from the Common Crawl over 2018-2024. Expert consensus ratings on 579 source websites were split at the median, creating perceived low and high-quality classes of about 706,000 articles each, with 194 linguistic features per website-level labelled article. Traditional machine learning classifiers such as the Random Forest demonstrated capable performance (0.7355 accuracy, 0.8131 ROC AUC). For deep learning, ModernBERT-large (256 context length) achieved the best performance (0.8744 accuracy; 0.9593 ROC-AUC; 0.8739 F1), followed by DistilBERT-base (512 context length) at 0.8685 accuracy and 0.9554 ROC-AUC. DistilBERT-base (256 context length) reached 0.8478 accuracy and 0.9407 ROC-AUC, while ModernBERT-base (256 context length) attained 0.8569 accuracy and 0.9470 ROC-AUC. These results suggest that the perceived quality of worldwide news articles can be effectively differentiated by traditional CPU-based machine learning classifiers and deep learning classifiers.

[17] ESGBench: A Benchmark for Explainable ESG Question Answering in Corporate Sustainability Reports

Sherine George,Nithish Saji

Main category: cs.CL

TL;DR: ESGBench是一个用于评估基于企业可持续发展报告的可解释ESG问答系统的基准数据集和评估框架，包含多个ESG主题的领域相关问题、人工整理的答案和支持证据，旨在推动透明和负责任的ESG人工智能系统研究。

Details

Motivation: 现有ESG问答系统缺乏标准化的评估基准，难以衡量模型在事实一致性、可追溯性和领域对齐方面的表现，因此需要一个专门的基准来促进透明和可信的ESG AI研究。 Method: 构建了一个名为ESGBench的基准数据集，包含来自企业可持续发展报告的领域相关问题、人工标注的答案及支持证据，并对当前最先进的大语言模型进行了性能分析。 Result: 实验表明现有大语言模型在事实一致性、信息溯源和领域专业知识对齐方面存在显著挑战。 Conclusion: ESGBench为可解释的ESG问答系统提供了有效的评估平台，有助于推动在ESG领域中更透明、可靠的人工智能系统的发展。 Abstract: We present ESGBench, a benchmark dataset and evaluation framework designed to assess explainable ESG question answering systems using corporate sustainability reports. The benchmark consists of domain-grounded questions across multiple ESG themes, paired with human-curated answers and supporting evidence to enable fine-grained evaluation of model reasoning. We analyze the performance of state-of-the-art LLMs on ESGBench, highlighting key challenges in factual consistency, traceability, and domain alignment. ESGBench aims to accelerate research in transparent and accountable ESG-focused AI systems.

[18] Anatomy of an Idiom: Tracing Non-Compositionality in Language Models

Andrew Gomes

Main category: cs.CL

TL;DR: 本文研究了基于Transformer的语言模型如何处理成语表达，提出了一种改进的路径补丁算法来发现处理成语的电路，并识别出‘成语头’和‘增强接收’现象，揭示了Transformer在处理非组合性语言时的机制。

Details

Motivation: 理解Transformer模型如何处理非组合性语言（如成语），并探索其内部计算模式以提高对复杂语法结构的理解。 Method: 使用改进的路径补丁算法进行电路发现，并分析注意力头在不同成语中的激活模式以及成语词元间的增强注意力。 Result: 发现了频繁激活的‘Idiom Heads’和由于前期处理导致的成语词元间增强注意力（即‘augmented reception’），表明Transformer通过这些机制平衡计算效率与鲁棒性。 Conclusion: Transformer通过特定的注意力机制（如Idiom Heads和augmented reception）有效处理成语等非组合性语言现象，这为理解更复杂语法结构的处理提供了新思路。 Abstract: We investigate the processing of idiomatic expressions in transformer-based language models using a novel set of techniques for circuit discovery and analysis. First discovering circuits via a modified path patching algorithm, we find that idiom processing exhibits distinct computational patterns. We identify and investigate ``Idiom Heads,'' attention heads that frequently activate across different idioms, as well as enhanced attention between idiom tokens due to earlier processing, which we term ``augmented reception.'' We analyze these phenomena and the general features of the discovered circuits as mechanisms by which transformers balance computational efficiency and robustness. Finally, these findings provide insights into how transformers handle non-compositional language and suggest pathways for understanding the processing of more complex grammatical constructions.

[19] Arctic-Extract Technical Report

Mateusz Chiliński,Julita Ołtusek,Wojciech Jaśkowski

Main category: cs.CL

TL;DR: Arctic-Extract 是一种先进的模型，用于从扫描或数字生成的商业文档中提取结构化数据，支持在资源受限的硬件上部署，并能高效处理长文档。

Details

Motivation: 为了在资源受限的设备上实现高效的结构化数据提取，特别是在处理长业务文档时保持高性能。 Method: 采用轻量级架构设计，仅占用6.6 GiB内存，并优化训练协议以提升文档理解能力。 Result: 模型可在配备24GB内存的A10 GPU上处理多达125页A4文档，展现出卓越的性能。 Conclusion: Arctic-Extract 在保持低资源消耗的同时，实现了对长文档的高效、准确结构化信息提取，适用于实际工业部署。 Abstract: Arctic-Extract is a state-of-the-art model designed for extracting structural data (question answering, entities and tables) from scanned or digital-born business documents. Despite its SoTA capabilities, the model is deployable on resource-constrained hardware, weighting only 6.6 GiB, making it suitable for deployment on devices with limited resources, such as A10 GPUs with 24 GB of memory. Arctic-Extract can process up to 125 A4 pages on those GPUs, making suitable for long document processing. This paper highlights Arctic-Extract's training protocols and evaluation results, demonstrating its strong performance in document understanding.

[20] TurkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval

Özay Ezerceli,Mahmoud El Hussieni,Selva Taş,Reyhan Bayraktar,Fatma Betül Terzioğlu,Yusuf Çelebi,Yağız Asker

Main category: cs.CL

TL;DR: 本文提出了TurkColBERT，首个针对土耳其语信息检索的综合基准，系统比较了密集编码器与晚期交互模型的表现。通过两阶段适应管道，将英文和多语言模型应用于土耳其语文本，并在多个领域数据集上评估10个模型。结果表明，晚期交互模型在参数更少的情况下显著优于密集模型，同时实现了低延迟检索。

Details

Motivation: 尽管神经信息检索在高资源语言中表现优异，但在形态丰富、资源较少的语言（如土耳其语）中研究不足。目前密集双编码器主导土耳其语检索，但晚期交互模型尚未被充分评估。因此需要构建专门基准以探索更适合土耳其语的检索模型。 Method: 提出两阶段适应管道：首先在土耳其语NLI/STS任务上微调英文和多语言编码器，然后使用PyLate框架基于MS MARCO-TR训练将其转化为ColBERT风格的晚期交互检索器。在五个土耳其BEIR数据集上评估10个模型，并比较不同索引算法（如MUVERA与PLAID）的效率与效果。 Result: 实验显示，参数仅为1.0M的colbert-hash-nano-tr比600M参数的turkish-e5-large小600倍，仍保留其71%以上的平均mAP。晚期交互模型在小3-5倍参数下显著超越密集模型，ColmmBERT-base-TR在特定领域任务上mAP提升达+13.8%。MUVERA+Rerank比PLAID快3.33倍且mAP相对提升+1.7%，实现0.54ms的查询延迟。 Conclusion: 晚期交互模型在土耳其语信息检索中具有更高的参数效率和性能优势，结合高效索引方法可实现生产级低延迟检索。研究为低资源、形态丰富语言的IR提供了有效路径和开源资源，但需更大规模数据验证其广泛适用性。 Abstract: Neural information retrieval systems excel in high-resource languages but remain underexplored for morphologically rich, lower-resource languages such as Turkish. Dense bi-encoders currently dominate Turkish IR, yet late-interaction models -- which retain token-level representations for fine-grained matching -- have not been systematically evaluated. We introduce TurkColBERT, the first comprehensive benchmark comparing dense encoders and late-interaction models for Turkish retrieval. Our two-stage adaptation pipeline fine-tunes English and multilingual encoders on Turkish NLI/STS tasks, then converts them into ColBERT-style retrievers using PyLate trained on MS MARCO-TR. We evaluate 10 models across five Turkish BEIR datasets covering scientific, financial, and argumentative domains. Results show strong parameter efficiency: the 1.0M-parameter colbert-hash-nano-tr is 600$\times$ smaller than the 600M turkish-e5-large dense encoder while preserving over 71\% of its average mAP. Late-interaction models that are 3--5$\times$ smaller than dense encoders significantly outperform them; ColmmBERT-base-TR yields up to +13.8\% mAP on domain-specific tasks. For production-readiness, we compare indexing algorithms: MUVERA+Rerank is 3.33$\times$ faster than PLAID and offers +1.7\% relative mAP gain. This enables low-latency retrieval, with ColmmBERT-base-TR achieving 0.54 ms query times under MUVERA. We release all checkpoints, configs, and evaluation scripts. Limitations include reliance on moderately sized datasets ($\leq$50K documents) and translated benchmarks, which may not fully reflect real-world Turkish retrieval conditions; larger-scale MUVERA evaluations remain necessary.

[21] Beyond Tokens in Language Models: Interpreting Activations through Text Genre Chunks

Éloïse Benito-Rodriguez,Einar Urdshals,Jasmina Nasufi,Nicky Pochinkov

Main category: cs.CL

TL;DR: 本文提出了一个预测框架的初步步骤，通过大型语言模型（LLM）的激活状态来预测输入文本的体裁，使用Mistral-7B和两个数据集实现了高达98%和71%的F1分数。

Details

Motivation: 理解大型语言模型（LLMs）对于确保其安全和有益的部署至关重要，但由于LLM结构的可解释性困难以及无法对所有输出进行人工评估，这一任务变得复杂。 Method: 利用Mistral-7B模型和两个数据集，采用scikit-learn分类器从LLM的激活中提取文本体裁。 Result: 在两个数据集上，体裁预测的F1分数分别达到98%和71%，且结果始终优于控制任务。 Conclusion: 证明了使用浅层学习模型可以从LLM中推断出文本体裁，为理解LLM提供了新的可行途径。 Abstract: Understanding Large Language Models (LLMs) is key to ensure their safe and beneficial deployment. This task is complicated by the difficulty of interpretability of LLM structures, and the inability to have all their outputs human-evaluated. In this paper, we present the first step towards a predictive framework, where the genre of a text used to prompt an LLM, is predicted based on its activations. Using Mistral-7B and two datasets, we show that genre can be extracted with F1-scores of up to 98% and 71% using scikit-learn classifiers. Across both datasets, results consistently outperform the control task, providing a proof of concept that text genres can be inferred from LLMs with shallow learning models.

[22] WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue

Zachary Ellis,Jared Joselowitz,Yash Deo,Yajie He,Anna Kalygina,Aisling Higham,Mana Rahimzadeh,Yan Jia,Ibrahim Habli,Ernest Lim

Main category: cs.CL

TL;DR: 该论文挑战了在临床对话中使用词错误率（WER）作为自动语音识别（ASR）评估标准的做法，提出一种基于大语言模型的评估方法，能更准确地衡量转录错误对临床影响的风险。

Details

Motivation: 现有的ASR评估指标如WER无法有效反映转录错误在临床环境中的实际影响，缺乏与临床风险的相关性，因此需要一种更能体现安全性和实用性的评估方式。 Method: 通过让临床专家对比真实语句与ASR生成结果，标注差异的临床影响；在此基础上构建基准数据集，并利用GEPA优化大语言模型（LLM-as-a-Judge）以模拟专家判断。 Result: 现有指标（包括WER）与临床风险标签相关性差；优化后的Gemini-2.5-Pro模型达到90%的准确率和Cohen's κ为0.816，表现接近人类专家。 Conclusion: 提出了一种经验证的自动化框架，可将ASR评估从文本准确性提升至对临床对话安全性进行可扩展的评估，推动ASR在医疗场景中的可靠部署。 Abstract: As Automatic Speech Recognition (ASR) is increasingly deployed in clinical dialogue, standard evaluations still rely heavily on Word Error Rate (WER). This paper challenges that standard, investigating whether WER or other common metrics correlate with the clinical impact of transcription errors. We establish a gold-standard benchmark by having expert clinicians compare ground-truth utterances to their ASR-generated counterparts, labeling the clinical impact of any discrepancies found in two distinct doctor-patient dialogue datasets. Our analysis reveals that WER and a comprehensive suite of existing metrics correlate poorly with the clinician-assigned risk labels (No, Minimal, or Significant Impact). To bridge this evaluation gap, we introduce an LLM-as-a-Judge, programmatically optimized using GEPA to replicate expert clinical assessment. The optimized judge (Gemini-2.5-Pro) achieves human-comparable performance, obtaining 90% accuracy and a strong Cohen's $κ$ of 0.816. This work provides a validated, automated framework for moving ASR evaluation beyond simple textual fidelity to a necessary, scalable assessment of safety in clinical dialogue.

[23] Integrating Symbolic Natural Language Understanding and Language Models for Word Sense Disambiguation

Kexin Zhao,Ken Forbus

Main category: cs.CL

TL;DR: 提出一种无需人工标注训练数据的词义消歧方法，利用统计语言模型作为消歧的参考，通过将候选意义转化为自然语言选项来查询大语言模型以选择合适的解释。

Details

Motivation: 现有词义消歧方法主要针对粗粒度表示且依赖人工标注数据，难以自动消歧更丰富的语义表示，限制了复杂推理的应用。 Method: 将符号化自然语言理解系统生成的多个候选含义转换为可区分的自然语言替代形式，利用大语言模型（LLM）根据上下文选择最合适的解释，并将结果反馈回原系统。 Result: 该方法在与人工标注的黄金标准对比中表现出有效性，能够有效支持无需人工标注的词义消歧。 Conclusion: 所提方法能有效利用大语言模型进行词义消歧，摆脱对人工标注数据的依赖，适用于更丰富语义表示的自动消歧任务。 Abstract: Word sense disambiguation is a fundamental challenge in natural language understanding. Current methods are primarily aimed at coarse-grained representations (e.g. WordNet synsets or FrameNet frames) and require hand-annotated training data to construct. This makes it difficult to automatically disambiguate richer representations (e.g. built on OpenCyc) that are needed for sophisticated inference. We propose a method that uses statistical language models as oracles for disambiguation that does not require any hand-annotation of training data. Instead, the multiple candidate meanings generated by a symbolic NLU system are converted into distinguishable natural language alternatives, which are used to query an LLM to select appropriate interpretations given the linguistic context. The selected meanings are propagated back to the symbolic NLU system. We evaluate our method against human-annotated gold answers to demonstrate its effectiveness.

[24] Comparison of Text-Based and Image-Based Retrieval in Multimodal Retrieval Augmented Generation Large Language Model Systems

Elias Lumer,Alex Cardenas,Matt Melich,Myles Mason,Sara Dieter,Vamse Kumar Subbiah,Pradeep Honaganahalli Basavaraju,Roberto Hernandez

Main category: cs.CL

TL;DR: 本文提出并比较了两种多模态RAG系统中的检索方法，发现直接使用多模态嵌入检索优于基于LLM图像摘要的文本检索，在金融财报电话会议数据集上显著提升了检索与问答性能。

Details

Motivation: 现有基于LLM摘要的多模态RAG系统在预处理中将图像转为文本，导致视觉上下文信息丢失，影响下游任务性能。 Method: 对比了两种检索方式：基于文本的块检索（图像先被总结为文本）和直接多模态嵌入检索（图像原生嵌入），并在新构建的金融问答基准上评估6个LLM和2个多模态嵌入模型。 Result: 直接多模态嵌入检索在mAP@5上绝对提升13%（相对提升32%），nDCG@5上绝对提升11%（相对提升20%），且生成的答案更准确、事实一致性更高。 Conclusion: 直接多模态嵌入能更好保留视觉上下文，避免LLM摘要带来的信息损失，是更优的多模态RAG检索方案。 Abstract: Recent advancements in Retrieval-Augmented Generation (RAG) have enabled Large Language Models (LLMs) to access multimodal knowledge bases containing both text and visual information such as charts, diagrams, and tables in financial documents. However, existing multimodal RAG systems rely on LLM-based summarization to convert images into text during preprocessing, storing only text representations in vector databases, which causes loss of contextual information and visual details critical for downstream retrieval and question answering. To address this limitation, we present a comprehensive comparative analysis of two retrieval approaches for multimodal RAG systems, including text-based chunk retrieval (where images are summarized into text before embedding) and direct multimodal embedding retrieval (where images are stored natively in the vector space). We evaluate all three approaches across 6 LLM models and a two multi-modal embedding models on a newly created financial earnings call benchmark comprising 40 question-answer pairs, each paired with 2 documents (1 image and 1 text chunk). Experimental results demonstrate that direct multimodal embedding retrieval significantly outperforms LLM-summary-based approaches, achieving absolute improvements of 13% in mean average precision (mAP@5) and 11% in normalized discounted cumulative gain. These gains correspond to relative improvements of 32% in mAP@5 and 20% in nDCG@5, providing stronger evidence of their practical impact. We additionally find that direct multimodal retrieval produces more accurate and factually consistent answers as measured by LLM-as-a-judge pairwise comparisons. We demonstrate that LLM summarization introduces information loss during preprocessing, whereas direct multimodal embeddings preserve visual context for retrieval and inference.

[25] Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs

Ali Taghibakhshi,Sharath Turuvekere Sreenivas,Saurav Muralidharan,Ruisi Cai,Marcin Chochowski,Ameya Sunil Mahabaleshwarkar,Yoshi Suhara,Oluwatobi Olabiyi,Daniel Korzekwa,Mostofa Patwary,Mohammad Shoeybi,Jan Kautz,Bryan Catanzaro,Ashwath Aithal,Nima Tajbakhsh,Pavlo Molchanov

Main category: cs.CL

TL;DR: 提出Nemotron Elastic框架，可在单个大模型内嵌套多个不同规模的子模型，实现零样本提取，显著降低训练成本并保持高性能。

Details

Motivation: 传统多规模大模型训练成本高昂，现有压缩方法仍需大量训练开销，难以高效支持多种部署需求。 Method: 通过端到端训练的路由器和两阶段课程学习，结合Mamba-Attention架构，引入组感知SSM弹性化、异构MLP弹性化、基于归一化MSE的层重要性评估及知识蒸馏，实现多子模型嵌套。 Result: 在Nemotron Nano V2 12B上同时生成9B和6B模型，仅用110B token，训练成本较从头训练降低360倍，较SOTA压缩技术降低7倍，且性能相当或更优。 Conclusion: Nemotron Elastic实现了高效、可扩展的推理模型家族构建，支持多预算部署，具备显著成本优势和部署灵活性。 Abstract: Training a family of large language models targeting multiple scales and deployment objectives is prohibitively expensive, requiring separate training runs for each different size. Recent work on model compression through pruning and knowledge distillation has reduced this cost; however, this process still incurs hundreds of billions of tokens worth of training cost per compressed model. In this paper, we present Nemotron Elastic, a framework for building reasoning-oriented LLMs, including hybrid Mamba-Attention architectures, that embed multiple nested submodels within a single parent model, each optimized for different deployment configurations and budgets. Each of these submodels shares weights with the parent model and can be extracted zero-shot during deployment without additional training or fine-tuning. We enable this functionality through an end-to-end trained router, tightly coupled to a two-stage training curriculum designed specifically for reasoning models. We additionally introduce group-aware SSM elastification that preserves Mamba's structural constraints, heterogeneous MLP elastification, normalized MSE-based layer importance for improved depth selection, and knowledge distillation enabling simultaneous multi-budget optimization. We apply Nemotron Elastic to the Nemotron Nano V2 12B model, simultaneously producing a 9B and a 6B model using only 110B training tokens; this results in over 360x cost reduction compared to training model families from scratch, and around 7x compared to SoTA compression techniques. Each of the nested models performs on par or better than the SoTA in accuracy. Moreover, unlike other compression methods, the nested capability of our approach allows having a many-in-one reasoning model that has constant deployment memory against the number of models in the family.

cs.CV [Back]

[26] UniFit: Towards Universal Virtual Try-on with MLLM-Guided Semantic Alignment

Wei Zhang,Yeying Jin,Xin Li,Yan Zhang,Xiaofeng Cong,Cong Wang,Fengcai Qiao,zhichao Lian

Main category: cs.CV

TL;DR: 提出UniFit，一种基于多模态大语言模型的通用虚拟试衣框架，通过语义对齐模块和渐进式训练策略解决文本-图像语义鸿沟和复杂场景数据稀缺问题，支持多种试衣任务并达到SOTA性能。

Details

Motivation: 现有虚拟试衣方法在处理多样化和复杂任务时存在文本指令与图像之间的语义鸿沟以及复杂场景下数据不足的问题，难以构建统一灵活的框架。 Method: 提出UniFit框架，引入MLLM引导的语义对齐模块（MGSA），利用多模态大语言模型融合图文信息并通过可学习查询和语义对齐损失缩小模态差距；设计两阶段渐进训练策略和自合成流程以提升在有限数据下的复杂任务学习能力。 Result: 实验表明UniFit能有效支持多衣物、模特到模特等复杂虚拟试衣任务，在多个指标上优于现有方法，实现最先进性能。 Conclusion: UniFit通过多模态大模型驱动的语义对齐与渐进学习策略，显著提升了通用虚拟试衣的灵活性与生成质量，为构建统一VTON系统提供了有效方案。 Abstract: Image-based virtual try-on (VTON) aims to synthesize photorealistic images of a person wearing specified garments. Despite significant progress, building a universal VTON framework that can flexibly handle diverse and complex tasks remains a major challenge. Recent methods explore multi-task VTON frameworks guided by textual instructions, yet they still face two key limitations: (1) semantic gap between text instructions and reference images, and (2) data scarcity in complex scenarios. To address these challenges, we propose UniFit, a universal VTON framework driven by a Multimodal Large Language Model (MLLM). Specifically, we introduce an MLLM-Guided Semantic Alignment Module (MGSA), which integrates multimodal inputs using an MLLM and a set of learnable queries. By imposing a semantic alignment loss, MGSA captures cross-modal semantic relationships and provides coherent and explicit semantic guidance for the generative process, thereby reducing the semantic gap. Moreover, by devising a two-stage progressive training strategy with a self-synthesis pipeline, UniFit is able to learn complex tasks from limited data. Extensive experiments show that UniFit not only supports a wide range of VTON tasks, including multi-garment and model-to-model try-on, but also achieves state-of-the-art performance. The source code and pretrained models are available at https://github.com/zwplus/UniFit.

[27] EfficientSAM3: Progressive Hierarchical Distillation for Video Concept Segmentation from SAM1, 2, and 3

Chengxi Zeng,Yuxuan Jiang,Aaron Zhang

Main category: cs.CV

TL;DR: 本文提出了EfficientSAM3，一个基于渐进式分层蒸馏（PHD）的高效模型家族，能够在保持SAM3高性能的同时实现设备端的概念分割与跟踪。

Details

Motivation: SAM3虽然在图像和视频中的可提示概念分割上表现出色，但其统一架构计算开销大，难以部署到设备端，因此需要更高效的模型。 Method: 提出渐进式分层蒸馏（PHD）三阶段方法：1）编码器蒸馏，通过prompt-in-the-loop训练对齐图像特征；2）时序记忆蒸馏，使用Perceiver-based模块替代密集记忆以高效压缩和检索时空特征；3）端到端微调，在SAM3 PCS数据上优化整体性能。采用RepViT、TinyViT和EfficientViT等轻量主干网络构建学生模型。 Result: 在多个VOS数据集上进行了评测，EfficientSAM3在效率和性能之间实现了良好的权衡，显著优于其他相关方法，适合设备端部署。 Conclusion: EfficientSAM3通过PHD成功继承了SAM3的能力，同时大幅降低计算需求，推动了实时、低资源环境下可提示概念分割的发展。 Abstract: The Segment Anything Model 3 (SAM3) advances visual understanding with Promptable Concept Segmentation (PCS) across images and videos, but its unified architecture (shared vision backbone, DETR-style detector, dense-memory tracker) remains prohibitive for on-device use. We present EfficientSAM3, a family of efficient models built on Progressive Hierarchical Distillation (PHD) that transfers capability from SAM3 to lightweight students in three stages: (1) Encoder Distillation aligns image features via prompt-in-the-loop training on SA-1B; (2) Temporal Memory Distillation replaces dense memory with a compact Perceiver-based module trained on SA-V to compress and retrieve spatiotemporal features efficiently; and (3) End-to-End Fine-Tuning refines the full pipeline on the official SAM3 PCS data to preserve concept-level performance. PHD yields a spectrum of student variants using RepViT, TinyViT, and EfficientViT backbones, enabling on-device concept segmentation and tracking while maintaining high fidelity to teacher behavior. We benchmark on popular VOS datasets, and compare with varies of releated work, achieing strong performance-efficiency trade-offs.

[28] WALDO: Where Unseen Model-based 6D Pose Estimation Meets Occlusion

Sajjad Pakdamansavoji,Yintao Ma,Amir Rasouli,Tongtong Cao

Main category: cs.CV

TL;DR: 提出了一种针对遮挡场景下基于模型的6D物体姿态估计新方法，通过动态采样、多假设推理、迭代优化和数据增强显著提升了精度与速度。

Details

Motivation: 现有6D姿态估计方法在遮挡情况下因多阶段流水线的早期错误传播而导致性能下降，且对未见物体泛化能力有限，需在测试时依赖CAD模型。 Method: 提出了四种新扩展：动态非均匀密集采样策略、多假设推理机制、迭代优化以及面向遮挡的数据增强，并引入了基于可见性加权的新评估指标。 Result: 在ICBIN上精度提升超过5%，BOP数据集上提升超2%，推理速度提高约3倍。 Conclusion: 所提方法有效缓解了遮挡带来的影响，提高了模型鲁棒性和泛化能力，在准确性和效率上均优于现有方法。 Abstract: Accurate 6D object pose estimation is vital for robotics, augmented reality, and scene understanding. For seen objects, high accuracy is often attainable via per-object fine-tuning but generalizing to unseen objects remains a challenge. To address this problem, past arts assume access to CAD models at test time and typically follow a multi-stage pipeline to estimate poses: detect and segment the object, propose an initial pose, and then refine it. Under occlusion, however, the early-stage of such pipelines are prone to errors, which can propagate through the sequential processing, and consequently degrade the performance. To remedy this shortcoming, we propose four novel extensions to model-based 6D pose estimation methods: (i) a dynamic non-uniform dense sampling strategy that focuses computation on visible regions, reducing occlusion-induced errors; (ii) a multi-hypothesis inference mechanism that retains several confidence-ranked pose candidates, mitigating brittle single-path failures; (iii) iterative refinement to progressively improve pose accuracy; and (iv) series of occlusion-focused training augmentations that strengthen robustness and generalization. Furthermore, we propose a new weighted by visibility metric for evaluation under occlusion to minimize the bias in the existing protocols. Via extensive empirical evaluations, we show that our proposed approach achieves more than 5% improvement in accuracy on ICBIN and more than 2% on BOP dataset benchmarks, while achieving approximately 3 times faster inference.

[29] Automatic Uncertainty-Aware Synthetic Data Bootstrapping for Historical Map Segmentation

Lukas Arzoumanidis,Julius Knechtel,Jan-Henrik Haunert,Youness Dehbi

Main category: cs.CV

TL;DR: 提出了一种自动生成和手动退化技术，用于生成具有真实感和多样性的合成历史地图数据，以解决深度学习在历史地图分析中训练数据不足的问题。

Details

Motivation: 历史地图的深度学习方法通常受限于标注数据的稀缺性，尤其是特定制图领域的数据获取困难且耗时，需要一种有效的方法来生成高质量的训练数据。 Method: 通过迁移原始历史地图的制图风格到矢量数据上，结合自动深度生成模型和手动随机退化技术，模拟历史地图中的视觉不确定性和噪声，从而生成大量逼真的合成地图数据，并用于领域自适应语义分割。 Result: 生成的合成数据显著提升了Self-Constructing Graph Convolutional Network在同质历史地图语料库上的语义分割性能，验证了所提数据增强方法的有效性和适用性。 Conclusion: 该方法能够有效缓解历史地图分析中训练数据稀缺的问题，为基于深度学习的历史地图理解提供了可扩展且实用的数据生成方案。 Abstract: The automated analysis of historical documents, particularly maps, has drastically benefited from advances in deep learning and its success across various computer vision applications. However, most deep learning-based methods heavily rely on large amounts of annotated training data, which are typically unavailable for historical maps, especially for those belonging to specific, homogeneous cartographic domains, also known as corpora. Creating high-quality training data suitable for machine learning often takes a significant amount of time and involves extensive manual effort. While synthetic training data can alleviate the scarcity of real-world samples, it often lacks the affinity (realism) and diversity (variation) necessary for effective learning. By transferring the cartographic style of an original historical map corpus onto vector data, we bootstrap an effectively unlimited number of synthetic historical maps suitable for tasks such as land-cover interpretation of a homogeneous historical map corpus. We propose an automatic deep generative approach and a alternative manual stochastic degradation technique to emulate the visual uncertainty and noise, also known as data-dependent uncertainty, commonly observed in historical map scans. To quantitatively evaluate the effectiveness and applicability of our approach, the generated training datasets were employed for domain-adaptive semantic segmentation on a homogeneous map corpus using a Self-Constructing Graph Convolutional Network, enabling a comprehensive assessment of the impact of our data bootstrapping methods.

[30] Box6D : Zero-shot Category-level 6D Pose Estimation of Warehouse Boxes

Yintao Ma,Sajjad Pakdamansavoji,Amir Rasouli,Tongtong Cao

Main category: cs.CV

TL;DR: 本文提出了一种名为Box6D的类别级6D姿态估计方法，专用于仓库环境中存储箱的姿态估计。该方法利用RGB-D数据通过二值搜索推断尺寸，并使用类别CAD模板进行姿态估计，在保证精度的同时显著降低计算成本。

Details

Motivation: 在杂乱和遮挡环境下对新物体进行准确高效地6D姿态估计对于仓储自动化等应用至关重要，现有方法在灵活性、准确性或实用性方面存在不足。 Method: Box6D基于单张RGB-D图像，采用快速二值搜索推断箱子尺寸，结合类别级别的CAD模板估计6D姿态，并引入基于深度的合理性过滤与早停策略以减少计算开销。 Result: 在真实仓储场景和公开基准上的实验表明，Box6D在6D姿态估计精度上达到领先水平，同时推理时间减少了约76%。 Conclusion: Box6D在保持高精度的同时大幅提升了推理效率，适用于工业仓储场景中对存储箱的快速6D姿态估计。 Abstract: Accurate and efficient 6D pose estimation of novel objects under clutter and occlusion is critical for robotic manipulation across warehouse automation, bin picking, logistics, and e-commerce fulfillment. There are three main approaches in this domain; Model-based methods assume an exact CAD model at inference but require high-resolution meshes and transfer poorly to new environments; Model-free methods that rely on a few reference images or videos are more flexible, however often fail under challenging conditions; Category-level approaches aim to balance flexibility and accuracy but many are overly general and ignore environment and object priors, limiting their practicality in industrial settings. To this end, we propose Box6d, a category-level 6D pose estimation method tailored for storage boxes in the warehouse context. From a single RGB-D observation, Box6D infers the dimensions of the boxes via a fast binary search and estimates poses using a category CAD template rather than instance-specific models. Suing a depth-based plausibility filter and early-stopping strategy, Box6D then rejects implausible hypotheses, lowering computational cost. We conduct evaluations on real-world storage scenarios and public benchmarks, and show that our approach delivers competitive or superior 6D pose precision while reducing inference time by approximately 76%.

[31] RB-FT: Rationale-Bootstrapped Fine-Tuning for Video Classification

Meilong Xu,Di Fu,Jiaxing Zhang,Gong Yu,Jiayu Zheng,Xiaoling Hu,Dongdi Zhao,Feiyang Li,Chao Chen,Yong Cao

Main category: cs.CV

TL;DR: 提出一种无需新标注的两阶段自提升范式，通过自生成文本理由来桥接视觉语言模型在领域特定视频分类中的语义鸿沟，显著优于直接监督微调。

Details

Motivation: 现有视觉语言模型在数据稀缺的领域特定视频分类任务中表现不佳，存在从复杂时空内容到抽象标签之间的语义距离（即“推理鸿沟”），难以有效学习领域知识。 Method: 采用两阶段训练：第一阶段提示VLM生成每个视频的详细文本理由，并基于这些自生成理由进行微调，以增强模型对领域逻辑的理解；第二阶段在任务标签上进行常规监督微调。 Result: 在多个不同数据集上的实验表明，该方法显著优于直接监督微调，能更有效地适应领域特定的视频分析任务。 Conclusion: 自生成理由是一种高效、无需额外标注的范式，可有效缩小视觉语言模型在小样本视频分类中的推理鸿沟，提升模型性能。 Abstract: Vision Language Models (VLMs) are becoming increasingly integral to multimedia understanding; however, they often struggle with domain-specific video classification tasks, particularly in cases with limited data. This stems from a critical \textit{rationale gap}, where sparse domain data is insufficient to bridge the semantic distance between complex spatio-temporal content and abstract classification labels. We propose a two-stage self-improvement paradigm to bridge this gap without new annotations. First, we prompt the VLMs to generate detailed textual rationales for each video, compelling them to articulate the domain-specific logic. The VLM is then fine-tuned on these self-generated rationales, utilizing this intermediate supervision to align its representations with the nuances of the target domain. Second, conventional supervised fine-tuning (SFT) is performed on the task labels, achieving markedly higher effectiveness as a result of the model's pre-acquired domain reasoning. Extensive experiments on diverse datasets demonstrate that our method significantly outperforms direct SFT, validating self-generated rationale as an effective, annotation-efficient paradigm for adapting VLMs to domain-specific video analysis.

[32] Boosting Medical Visual Understanding From Multi-Granular Language Learning

Zihan Li,Yiqing Wang,Sina Farsiu,Paul Kinahan

Main category: cs.CV

TL;DR: 提出了一种多粒度语言学习框架（MGLL），用于增强视觉-语言模型在多标签和跨粒度模态对齐方面的能力，尤其适用于医学图像等复杂场景。

Details

Motivation: 现有CLIP模型主要关注单标签、单粒度的图文对齐，在医学图像等需要多标签、多粒度理解的复杂领域表现受限。 Method: 提出MGLL框架，利用结构化多标签监督、跨粒度文本描述整合，并引入带点级约束的软标签监督；采用平滑KL散度保证跨粒度一致性，可作为即插即用模块集成到现有模型中。 Result: 在多个数据集上验证，MGLL在下游任务中优于当前最先进的方法。 Conclusion: MGLL有效提升了视觉-语言模型在多标签、多粒度场景下的对齐能力与性能，具有良好的应用潜力，特别是在医学图像分析领域。 Abstract: Recent advances in image-text pretraining have significantly enhanced visual understanding by aligning visual and textual representations. Contrastive Language-Image Pretraining (CLIP) has played a pivotal role in multimodal learning. However, its focus on single-label, single-granularity alignment limits its effectiveness in complex domains such as medical imaging, where images often correspond to multiple high-level labels (e.g., disease categories) across different annotation granularities (e.g., diagnostic description, clinical explanation). To address this, we propose Multi-Granular Language Learning (MGLL), a contrastive learning framework designed to improve both multi-label and cross-granularity alignment. MGLL leverages structured multi-label supervision, integrates textual descriptions across granularities, and introduces soft-label supervision with point-wise constraints to enhance alignment. MGLL employs smooth Kullback-Leibler (KL) divergence to ensure cross-granularity consistency while maintaining computational efficiency as a plug-and-play module for vision-language models. Pretrained on our constructed large-scale multi-granular datasets and evaluated across multiple datasets, MGLL outperforms other state-of-the-art methods in downstream tasks. The code is available at \href{https://github.com/HUANGLIZI/MGLL}{https://github.com/HUANGLIZI/MGLL}.

[33] Automated Interpretable 2D Video Extraction from 3D Echocardiography

Milos Vukadinovic,Hirotaka Ieki,Yuki Sahasi,David Ouyang,Bryan He

Main category: cs.CV

TL;DR: 提出一种基于深度学习和解剖标志的自动化方法，从三维心脏超声容积中选取标准二维视图，保留诊断特征并支持临床应用。

Details

Motivation: 传统心脏超声依赖二维视频，难以全面评估复杂的心脏结构；三维超声虽具潜力，但临床解读习惯仍以二维为主，因此需要自动化方法将三维数据转化为标准二维视图。 Method: 结合深度学习视图分类器与基于解剖标志及心脏病专家经验的启发式规则，从三维超声容积中重建标准二维视图。 Result: 在两家医院的1600个视频中经三位心脏病专家盲评验证，准确率达96%；生成的二维视频可用于异常检测（EchoPrime、PanEcho模型）和临床级测量（EchoNet-Measurement），并保持空间校准。 Conclusion: 该方法成功实现了从三维超声到标准二维视图的自动转换，在保留诊断信息的同时兼容现有临床工作流，提升了三维超声的可用性和效率。 Abstract: Although the heart has complex three-dimensional (3D) anatomy, conventional medical imaging with cardiac ultrasound relies on a series of 2D videos showing individual cardiac structures. 3D echocardiography is a developing modality that now offers adequate image quality for clinical use, with potential to streamline acquisition and improve assessment of off-axis features. We propose an automated method to select standard 2D views from 3D cardiac ultrasound volumes, allowing physicians to interpret the data in their usual format while benefiting from the speed and usability of 3D scanning. Applying a deep learning view classifier and downstream heuristics based on anatomical landmarks together with heuristics provided by cardiologists, we reconstruct standard echocardiography views. This approach was validated by three cardiologists in blinded evaluation (96\% accuracy in 1,600 videos from 2 hospitals). The downstream 2D videos were also validated in their ability to detect cardiac abnormalities using AI echocardiography models (EchoPrime and PanEcho) as well as ability to generate clinical-grade measurements of cardiac anatomy (EchoNet-Measurement). We demonstrated that the extracted 2D videos preserve spatial calibration and diagnostic features, allowing clinicians to obtain accurate real-world interpretations from 3D volumes. We release the code and a dataset of 29 3D echocardiography videos https://github.com/echonet/3d-echo .

[34] Click2Graph: Interactive Panoptic Video Scene Graphs from a Single Click

Raphael Ruschel,Hardikkumar Prajapati,Awsafur Rahman,B. S. Manjunath

Main category: cs.CV

TL;DR: 本文提出了Click2Graph，首个可交互的全景视频场景图生成（PVSG）框架，通过结合视觉提示与时空语义理解，实现从用户输入（如点击或框选）中生成时序一致的场景图。

Details

Motivation: 现有VSGG系统为封闭式前馈流程，无法融入人类指导；而SAM2等可提示分割模型缺乏语义和关系推理能力。因此需要一个能结合人机交互与语义推理的统一框架。 Method: 提出Click2Graph框架，包含动态交互发现模块（生成主体条件下的对象提示）和语义分类头（联合实体与谓词推理），通过单个用户提示实现对象分割、跟踪、交互发现及三元组预测。 Result: 在OpenPVSG基准上的实验表明，Click2Graph建立了强大的用户引导PVSG基础，能够有效结合人类提示、全景定位和关系推断。 Conclusion: Click2Graph首次实现了可交互的PVSG，展示了人类提示与结构化视频理解结合的潜力，推动了可控且可解释的视频场景理解发展。 Abstract: State-of-the-art Video Scene Graph Generation (VSGG) systems provide structured visual understanding but operate as closed, feed-forward pipelines with no ability to incorporate human guidance. In contrast, promptable segmentation models such as SAM2 enable precise user interaction but lack semantic or relational reasoning. We introduce Click2Graph, the first interactive framework for Panoptic Video Scene Graph Generation (PVSG) that unifies visual prompting with spatial, temporal, and semantic understanding. From a single user cue, such as a click or bounding box, Click2Graph segments and tracks the subject across time, autonomously discovers interacting objects, and predicts triplets to form a temporally consistent scene graph. Our framework introduces two key components: a Dynamic Interaction Discovery Module that generates subject-conditioned object prompts, and a Semantic Classification Head that performs joint entity and predicate reasoning. Experiments on the OpenPVSG benchmark demonstrate that Click2Graph establishes a strong foundation for user-guided PVSG, showing how human prompting can be combined with panoptic grounding and relational inference to enable controllable and interpretable video scene understanding.

[35] InfoCLIP: Bridging Vision-Language Pretraining and Open-Vocabulary Semantic Segmentation via Information-Theoretic Alignment Transfer

Muyao Yuan,Yuanhong Zhang,Weizhan Zhang,Lan Ma,Yuan Gao,Jiangyong Ying,Yudeng Xin

Main category: cs.CV

TL;DR: 本文提出了InfoCLIP，一种基于信息论视角的CLIP微调方法，用于开放词汇语义分割，通过互信息驱动的目标来保持模态对齐并提升细粒度分割性能。

Details

Motivation: 现有方法在有限类别上微调CLIP进行分割时容易过拟合并破坏预训练的视觉-语言对齐，因此需要一种能稳定模态对齐的微调策略。 Method: 提出InfoCLIP，引入两个基于互信息的新目标：一是压缩预训练CLIP中的像素-文本模态对齐以减少噪声；二是最大化预训练模型与微调模型之间对齐知识的互信息，以传递适用于分割任务的紧凑局部语义关系。 Result: 在多个基准上的实验表明，InfoCLIP有效提升了CLIP在开放词汇语义分割中的微调效果，验证了其在不对称迁移中的适应性和优越性。 Conclusion: InfoCLIP通过信息论指导的知识迁移，成功保留了CLIP的模态对齐能力，并显著提高了其在开放词汇分割任务中的性能。 Abstract: Recently, the strong generalization ability of CLIP has facilitated open-vocabulary semantic segmentation, which labels pixels using arbitrary text. However, existing methods that fine-tune CLIP for segmentation on limited seen categories often lead to overfitting and degrade the pretrained vision-language alignment. To stabilize modality alignment during fine-tuning, we propose InfoCLIP, which leverages an information-theoretic perspective to transfer alignment knowledge from pretrained CLIP to the segmentation task. Specifically, this transfer is guided by two novel objectives grounded in mutual information. First, we compress the pixel-text modality alignment from pretrained CLIP to reduce noise arising from its coarse-grained local semantic representations learned under image-text supervision. Second, we maximize the mutual information between the alignment knowledge of pretrained CLIP and the fine-tuned model to transfer compact local semantic relations suited for the segmentation task. Extensive evaluations across various benchmarks validate the effectiveness of InfoCLIP in enhancing CLIP fine-tuning for open-vocabulary semantic segmentation, demonstrating its adaptability and superiority in asymmetric transfer.

[36] Externally Validated Multi-Task Learning via Consistency Regularization Using Differentiable BI-RADS Features for Breast Ultrasound Tumor Segmentation

Jingru Zhang,Saed Moradi,Ashirbani Saha

Main category: cs.CV

TL;DR: 提出一种基于一致性正则化的多任务学习方法，通过可微的BI-RADS启发式形态学特征缓解乳腺超声肿瘤分割中的任务干扰，显著提升分割泛化性能。

Details

Motivation: 多任务学习中存在破坏性任务干扰，导致模型性能低于单任务基线，限制泛化能力，尤其是在乳腺超声肿瘤分割任务中。 Method: 提出一种新颖的一致性正则化方法，结合可微的BI-RADS启发式形态学特征，以减轻分割与分类任务间的干扰。 Result: 在BrEaST数据集上训练并在三个外部数据集（UDIAT、BUSI、BUS-UCLM）上验证，分割任务的Dice系数分别为0.81 vs 0.59、0.66 vs 0.56、0.69 vs 0.49，均显著优于基线（p<0.001），并在UDIAT数据集上达到当前最优水平。 Conclusion: 该方法有效缓解了多任务学习中的破坏性干扰，显著提升了乳腺超声肿瘤分割的泛化能力和性能。 Abstract: Multi-task learning can suffer from destructive task interference, where jointly trained models underperform single-task baselines and limit generalization. To improve generalization performance in breast ultrasound-based tumor segmentation via multi-task learning, we propose a novel consistency regularization approach that mitigates destructive interference between segmentation and classification. The consistency regularization approach is composed of differentiable BI-RADS-inspired morphological features. We validated this approach by training all models on the BrEaST dataset (Poland) and evaluating them on three external datasets: UDIAT (Spain), BUSI (Egypt), and BUS-UCLM (Spain). Our comprehensive analysis demonstrates statistically significant (p<0.001) improvements in generalization for segmentation task of the proposed multi-task approach vs. the baseline one: UDIAT, BUSI, BUS-UCLM (Dice coefficient=0.81 vs 0.59, 0.66 vs 0.56, 0.69 vs 0.49, resp.). The proposed approach also achieves state-of-the-art segmentation performance under rigorous external validation on the UDIAT dataset.

[37] UniDGF: A Unified Detection-to-Generation Framework for Hierarchical Object Visual Recognition

Xinyu Nan,Lingtao Mao,Huangyu Dai,Zexin Zheng,Xinyu Sun,Zihan Liang,Ben Chen,Yuqing Ding,Chenyi Lei,Wenwu Ou,Han Li

Main category: cs.CV

TL;DR: 提出一种检测引导的生成式框架，通过ROI级特征和BART生成器实现细粒度的层次化类别与属性识别，显著优于现有方法。

Details

Motivation: 现有方法依赖全局相似性，难以捕捉细粒度类别差异和特定类别的属性多样性，尤其在大规模电商场景中表现不足。 Method: 采用检测引导的生成框架，提取每个检测对象的ROI级特征，使用基于BART的生成器按粗到细顺序输出层次化类别和属性标记，并支持属性条件识别。 Result: 在大规模私有电商数据集和开源数据集上实验表明，该方法在细粒度识别和统一推理一致性方面显著优于现有的相似性基线和多阶段分类系统。 Conclusion: 所提方法有效解决了细粒度语义理解中的类别区分与属性多样性问题，为视觉语义理解提供了更强的统一生成式解决方案。 Abstract: Achieving visual semantic understanding requires a unified framework that simultaneously handles object detection, category prediction, and attribute recognition. However, current advanced approaches rely on global similarity and struggle to capture fine-grained category distinctions and category-specific attribute diversity, especially in large-scale e-commerce scenarios. To overcome these challenges, we introduce a detection-guided generative framework that predicts hierarchical category and attribute tokens. For each detected object, we extract refined ROI-level features and employ a BART-based generator to produce semantic tokens in a coarse-to-fine sequence covering category hierarchies and property-value pairs, with support for property-conditioned attribute recognition. Experiments on both large-scale proprietary e-commerce datasets and open-source datasets demonstrate that our approach significantly outperforms existing similarity-based pipelines and multi-stage classification systems, achieving stronger fine-grained recognition and more coherent unified inference.

Dawei Li,Zijian Gu,Peng Wang,Chuhan Song,Zhen Tan,Mohan Zhang,Tianlong Chen,Yu Tian,Song Wang

Main category: cs.CV

TL;DR: 提出了一种名为FADS的公平性感知上下文学习方法，通过聚类采样构建人口统计学平衡且语义相关的示例，有效减少医疗图像推理中的性别、种族和族裔偏差，同时保持高准确性。

Details

Motivation: 现有的去偏方法依赖大量标注数据或微调，难以应用于基础规模的多模态大模型，且传统演示选择策略因示例中的人口统计不平衡而无法保证公平性。 Method: 提出Fairness-Aware Demonstration Selection (FADS)，结合聚类-based采样方法，构建在人口统计上平衡且语义相关的上下文示例，利用上下文学习实现无需微调的公平性提升。 Result: 在多个医疗影像基准上验证了FADS的有效性，显著降低了性别、种族和族裔相关的预测差异，同时保持了较高的准确率。 Conclusion: FADS为大规模多模态模型提供了一种高效、可扩展且无需微调的公平医疗图像推理解决方案，展示了公平性感知上下文学习在医疗AI中的潜力。 Abstract: Multimodal large language models (MLLMs) have shown strong potential for medical image reasoning, yet fairness across demographic groups remains a major concern. Existing debiasing methods often rely on large labeled datasets or fine-tuning, which are impractical for foundation-scale models. We explore In-Context Learning (ICL) as a lightweight, tuning-free alternative for improving fairness. Through systematic analysis, we find that conventional demonstration selection (DS) strategies fail to ensure fairness due to demographic imbalance in selected exemplars. To address this, we propose Fairness-Aware Demonstration Selection (FADS), which builds demographically balanced and semantically relevant demonstrations via clustering-based sampling. Experiments on multiple medical imaging benchmarks show that FADS consistently reduces gender-, race-, and ethnicity-related disparities while maintaining strong accuracy, offering an efficient and scalable path toward fair medical image reasoning. These results highlight the potential of fairness-aware in-context learning as a scalable and data-efficient solution for equitable medical image reasoning.

[39] Exploiting Inter-Sample Information for Long-tailed Out-of-Distribution Detection

Nimeshika Udayangani,Hadi M. Dolatabadi,Sarah Erfani,Christopher Leckie

Main category: cs.CV

TL;DR: 本文提出了一种基于图神经网络的方法，通过利用样本间关系和高斯化特征分布来改善长尾分布下的OOD检测性能，在多个基准数据集上显著优于现有方法。

Details

Motivation: 在长尾分布的训练数据下，现有的OOD检测方法容易产生高误报率并降低尾部类别分类准确率，因此需要更有效的方法来提升OOD检测性能。 Method: 使用预训练模型的特征空间初始化图结构，引入高斯化校正激活层分布偏差，并利用图卷积网络（GCN）优化图表示，以适应长尾场景下的OOD检测。 Result: 在CIFAR10-LT、CIFAR100-LT和ImageNet-LT三个基准上，该方法在FPR和尾部类别ID分类准确率方面均显著优于现有最先进方法。 Conclusion: 利用图结构建模样本关系并结合高斯化预处理可有效提升长尾场景下的OOD检测性能，同时改善尾部类别的识别能力。 Abstract: Detecting out-of-distribution (OOD) data is essential for safe deployment of deep neural networks (DNNs). This problem becomes particularly challenging in the presence of long-tailed in-distribution (ID) datasets, often leading to high false positive rates (FPR) and low tail-class ID classification accuracy. In this paper, we demonstrate that exploiting inter-sample relationships using a graph-based representation can significantly improve OOD detection in long-tailed recognition of vision datasets. To this end, we use the feature space of a pre-trained model to initialize our graph structure. We account for the differences between the activation layer distribution of the pre-training vs. training data, and actively introduce Gaussianization to alleviate any deviations from a standard normal distribution in the activation layers of the pre-trained model. We then refine this initial graph representation using graph convolutional networks (GCNs) to arrive at a feature space suitable for long-tailed OOD detection. This leads us to address the inferior performance observed in ID tail-classes within existing OOD detection methods. Experiments over three benchmarks CIFAR10-LT, CIFAR100-LT, and ImageNet-LT demonstrate that our method outperforms the state-of-the-art approaches by a large margin in terms of FPR and tail-class ID classification accuracy.

[40] Physically Realistic Sequence-Level Adversarial Clothing for Robust Human-Detection Evasion

Dingkun Zhou,Patrick P. K. Chan,Hengxu Wu,Shikang Zheng,Ruiqi Huang,Yuanjie Zhao

Main category: cs.CV

TL;DR: 提出了一种序列级优化框架，用于生成自然、可打印的对抗性纹理，能在整个行走视频中有效隐藏人体检测，在数字和物理环境中均表现出强隐蔽性和鲁棒性。

Details

Motivation: 现有可穿戴攻击方法在长时间视频中难以维持隐蔽性，因未考虑运动、姿态变化和衣物形变，导致对抗纹理失效。 Method: 将产品图像映射到UV空间并参数化为控制点与调色板，结合物理仿真的人体-衣物管线模拟动态变化，采用期望变换目标函数结合时间加权进行序列级优化，确保颜色可打印。 Result: 在数字和物理环境下均实现稳定且强效的检测抑制，具有高视角鲁棒性和跨模型迁移性；实物印制验证了室内外场景的有效性。 Conclusion: 该方法显著提升了可穿戴对抗攻击在真实视频监控场景中的实用性与持久性，具备实际部署潜力。 Abstract: Deep neural networks used for human detection are highly vulnerable to adversarial manipulation, creating safety and privacy risks in real surveillance environments. Wearable attacks offer a realistic threat model, yet existing approaches usually optimize textures frame by frame and therefore fail to maintain concealment across long video sequences with motion, pose changes, and garment deformation. In this work, a sequence-level optimization framework is introduced to generate natural, printable adversarial textures for shirts, trousers, and hats that remain effective throughout entire walking videos in both digital and physical settings. Product images are first mapped to UV space and converted into a compact palette and control-point parameterization, with ICC locking to keep all colors printable. A physically based human-garment pipeline is then employed to simulate motion, multi-angle camera viewpoints, cloth dynamics, and illumination variation. An expectation-over-transformation objective with temporal weighting is used to optimize the control points so that detection confidence is minimized across whole sequences. Extensive experiments demonstrate strong and stable concealment, high robustness to viewpoint changes, and superior cross-model transferability. Physical garments produced with sublimation printing achieve reliable suppression under indoor and outdoor recordings, confirming real-world feasibility.

[41] Mixture of Ranks with Degradation-Aware Routing for One-Step Real-World Image Super-Resolution

Xiao He,Zhijun Tu,Kun Cheng,Mingrui Zhu,Jie Hu,Nannan Wang,Xinbo Gao

Main category: cs.CV

TL;DR: 本文提出了一种基于稀疏门控混合专家（MoE）的Mixture-of-Ranks（MoR）架构，用于单步真实图像超分辨率（Real-ISR），通过将LoRA中的每个秩视为独立专家，并结合退化估计模块动态激活专家，在保持计算效率的同时显著提升重建性能。

Details

Motivation: 现有的Real-ISR方法依赖于密集模型微调，难以自适应地捕捉复杂真实退化样本的异质特性，且在固定计算预算下缺乏有效的知识共享机制。 Method: 提出MoR架构，将LoRA的每个秩作为独立专家，引入细粒度专家划分策略；设计基于CLIP嵌入和文本对的退化估计模块以动态指导专家激活；引入零专家槽位和退化感知负载均衡损失，根据退化程度动态调整激活专家数量。 Result: 实验表明，所提方法在多个基准上实现了最先进的性能，有效提升了真实图像超分辨率的质量与计算资源利用率。 Conclusion: MoR通过稀疏化和动态路由机制，成功将MoE的优势引入Real-ISR任务，实现了对复杂退化样本的自适应建模与高效推理。 Abstract: The demonstrated success of sparsely-gated Mixture-of-Experts (MoE) architectures, exemplified by models such as DeepSeek and Grok, has motivated researchers to investigate their adaptation to diverse domains. In real-world image super-resolution (Real-ISR), existing approaches mainly rely on fine-tuning pre-trained diffusion models through Low-Rank Adaptation (LoRA) module to reconstruct high-resolution (HR) images. However, these dense Real-ISR models are limited in their ability to adaptively capture the heterogeneous characteristics of complex real-world degraded samples or enable knowledge sharing between inputs under equivalent computational budgets. To address this, we investigate the integration of sparse MoE into Real-ISR and propose a Mixture-of-Ranks (MoR) architecture for single-step image super-resolution. We introduce a fine-grained expert partitioning strategy that treats each rank in LoRA as an independent expert. This design enables flexible knowledge recombination while isolating fixed-position ranks as shared experts to preserve common-sense features and minimize routing redundancy. Furthermore, we develop a degradation estimation module leveraging CLIP embeddings and predefined positive-negative text pairs to compute relative degradation scores, dynamically guiding expert activation. To better accommodate varying sample complexities, we incorporate zero-expert slots and propose a degradation-aware load-balancing loss, which dynamically adjusts the number of active experts based on degradation severity, ensuring optimal computational resource allocation. Comprehensive experiments validate our framework's effectiveness and state-of-the-art performance.

[42] Towards a Safer and Sustainable Manufacturing Process: Material classification in Laser Cutting Using Deep Learning

Mohamed Abdallah Salem,Hamdy Ahmed Ashur,Ahmed Elshinnawy

Main category: cs.CV

TL;DR: 提出了一种基于深度学习的激光散斑模式材料分类方法，用于实时监测和控制激光切割过程，即使在激光颜色变化的情况下也能实现高精度材料识别。

Details

Motivation: 激光切割过程中产生的粉尘和烟雾对环境和工人健康构成威胁，需要一种能够实时监控并识别材料类型的方法以确保安全高效的切割操作。 Method: 利用卷积神经网络（CNN）对激光散斑图案数据集进行训练，以识别不同材料类型，并验证其在不同激光颜色下的分类性能。 Result: 模型在训练集上达到98.30%的准确率，在验证集上达到96.88%，并在30种新材料的3000张图像测试中获得0.9643的F1分数。 Conclusion: 该方法为基于散斑传感的材料感知激光切割提供了一个鲁棒且准确的解决方案。 Abstract: Laser cutting is a widely adopted technology in material processing across various industries, but it generates a significant amount of dust, smoke, and aerosols during operation, posing a risk to both the environment and workers' health. Speckle sensing has emerged as a promising method to monitor the cutting process and identify material types in real-time. This paper proposes a material classification technique using a speckle pattern of the material's surface based on deep learning to monitor and control the laser cutting process. The proposed method involves training a convolutional neural network (CNN) on a dataset of laser speckle patterns to recognize distinct material types for safe and efficient cutting. Previous methods for material classification using speckle sensing may face issues when the color of the laser used to produce the speckle pattern is changed. Experiments conducted in this study demonstrate that the proposed method achieves high accuracy in material classification, even when the laser color is changed. The model achieved an accuracy of 98.30 % on the training set and 96.88% on the validation set. Furthermore, the model was evaluated on a set of 3000 new images for 30 different materials, achieving an F1-score of 0.9643. The proposed method provides a robust and accurate solution for material-aware laser cutting using speckle sensing.

[43] CuriGS: Curriculum-Guided Gaussian Splatting for Sparse View Synthesis

Zijian Wu,Mingfeng Jiang,Zidian Lin,Ying Song,Hanjie Ma,Qun Wu,Dongping Zhang,Guiyang Pu

Main category: cs.CV

TL;DR: 本文提出CuriGS，一种基于课程学习的3D高斯点阵化框架，用于稀疏视角下的三维重建，通过逐步引入带扰动的伪视角（学生视图）增强训练，提升了渲染质量和几何一致性。

Details

Motivation: 在稀疏视角下，3D高斯点阵化（3DGS）因监督信号不足和视角覆盖有限导致过拟合，难以实现高质量重建，本文旨在解决这一挑战。 Method: 提出CuriGS框架，引入围绕真实视角（教师）生成的不同扰动级别的伪视角（学生视图），采用课程学习策略逐步解锁更高扰动级别，并通过深度相关性和协同正则化对学生视图进行约束，结合多信号指标评估并择优保留高质量学生视图加入训练集。 Result: 实验表明，CuriGS在多种合成与真实稀疏视角场景中，均优于现有最先进方法，在渲染保真度和几何一致性方面表现更优。 Conclusion: CuriGS通过课程引导的视图增广策略，有效缓解了稀疏视角下的过拟合问题，显著提升了3DGS在少视角条件下的重建性能。 Abstract: 3D Gaussian Splatting (3DGS) has recently emerged as an efficient, high-fidelity representation for real-time scene reconstruction and rendering. However, extending 3DGS to sparse-view settings remains challenging because of supervision scarcity and overfitting caused by limited viewpoint coverage. In this paper, we present CuriGS, a curriculum-guided framework for sparse-view 3D reconstruction using 3DGS. CuriGS addresses the core challenge of sparse-view synthesis by introducing student views: pseudo-views sampled around ground-truth poses (teacher). For each teacher, we generate multiple groups of student views with different perturbation levels. During training, we follow a curriculum schedule that gradually unlocks higher perturbation level, randomly sampling candidate students from the active level to assist training. Each sampled student is regularized via depth-correlation and co-regularization, and evaluated using a multi-signal metric that combines SSIM, LPIPS, and an image-quality measure. For every teacher and perturbation level, we periodically retain the best-performing students and promote those that satisfy a predefined quality threshold to the training set, resulting in a stable augmentation of sparse training views. Experimental results show that CuriGS outperforms state-of-the-art baselines in both rendering fidelity and geometric consistency across various synthetic and real sparse-view scenes. Project page: https://zijian1026.github.io/CuriGS/

[44] Crossmodal learning for Crop Canopy Trait Estimation

Timilehin T. Ayanlade,Anirudha Powadi,Talukder Z. Jubery,Baskar Ganapathysubramanian,Soumik Sarkar

Main category: cs.CV

TL;DR: 提出了一种跨模态学习策略，利用无人机（UAV）的高分辨率细节增强卫星图像，用于作物冠层性状估计，在产量和氮素预测等任务中优于真实卫星图像。

Details

Motivation: 由于卫星图像空间分辨率有限，难以满足现代精细化农业管理需求，而无人机虽分辨率高但成本较高，因此需要一种方法结合两者优势。 Method: 采用跨模态学习策略，基于约84种杂交玉米品种在五个不同地点的同步卫星-无人机图像对数据集，训练模型学习多模态间的光谱-空间细粒度对应关系。 Result: 从卫星输入生成的类无人机表征在多个下游任务（如产量预测和氮素预测）中 consistently 优于原始卫星影像。 Conclusion: 跨模态对应学习能有效弥合卫星与无人机遥感在农业监测中的差距，具有广阔应用潜力。 Abstract: Recent advances in plant phenotyping have driven widespread adoption of multi sensor platforms for collecting crop canopy reflectance data. This includes the collection of heterogeneous data across multiple platforms, with Unmanned Aerial Vehicles (UAV) seeing significant usage due to their high performance in crop monitoring, forecasting, and prediction tasks. Similarly, satellite missions have been shown to be effective for agriculturally relevant tasks. In contrast to UAVs, such missions are bound to the limitation of spatial resolution, which hinders their effectiveness for modern farming systems focused on micro-plot management. In this work, we propose a cross modal learning strategy that enriches high-resolution satellite imagery with UAV level visual detail for crop canopy trait estimation. Using a dataset of approximately co registered satellite UAV image pairs collected from replicated plots of 84 hybrid maize varieties across five distinct locations in the U.S. Corn Belt, we train a model that learns fine grained spectral spatial correspondences between sensing modalities. Results show that the generated UAV-like representations from satellite inputs consistently outperform real satellite imagery on multiple downstream tasks, including yield and nitrogen prediction, demonstrating the potential of cross-modal correspondence learning to bridge the gap between satellite and UAV sensing in agricultural monitoring.

[45] LLMs-based Augmentation for Domain Adaptation in Long-tailed Food Datasets

Qing Wang,Chong-Wah Ngo,Ee-Peng Lim,Qianru Sun

Main category: cs.CV

TL;DR: 提出一种利用大语言模型（LLM）辅助的食品识别框架，通过将图像与生成的文本（如食物名称和成分）对齐到共享嵌入空间，有效应对领域偏移、长尾分布和细粒度分类挑战。

Details

Motivation: 食品识别面临训练数据与真实用户拍摄图像之间的领域差异、数据长尾分布以及类别间视觉差异细微等挑战，传统方法难以同时应对这些问题。 Method: 利用大语言模型解析食品图像生成食物标题和成分，将生成的文本与食品图像映射到共享嵌入空间进行跨模态对齐，并融合双模态特征用于识别。 Result: 在两个食品数据集上超越了专门针对长尾分布、领域自适应和细粒度分类的现有方法。 Conclusion: 所提出的基于LLM的多模态对齐框架能有效提升复杂实际场景下的食品识别性能，具有较强的鲁棒性和泛化能力。 Abstract: Training a model for food recognition is challenging because the training samples, which are typically crawled from the Internet, are visually different from the pictures captured by users in the free-living environment. In addition to this domain-shift problem, the real-world food datasets tend to be long-tailed distributed and some dishes of different categories exhibit subtle variations that are difficult to distinguish visually. In this paper, we present a framework empowered with large language models (LLMs) to address these challenges in food recognition. We first leverage LLMs to parse food images to generate food titles and ingredients. Then, we project the generated texts and food images from different domains to a shared embedding space to maximize the pair similarities. Finally, we take the aligned features of both modalities for recognition. With this simple framework, we show that our proposed approach can outperform the existing approaches tailored for long-tailed data distribution, domain adaptation, and fine-grained classification, respectively, on two food datasets.

[46] AMS-KV: Adaptive KV Caching in Multi-Scale Visual Autoregressive Transformers

Boxun Xu,Yu Wang,Zihu Wang,Peng Li

Main category: cs.CV

TL;DR: 本文提出了一种面向视觉自回归模型中下一代尺度预测的自适应KV缓存策略AMS-KV，显著降低了内存占用和计算延迟，提升了生成效率与可扩展性。

Details

Motivation: 在基于下一代尺度预测的视觉自回归模型（VAR）中，传统的KV缓存机制面临随尺度增加而内存急剧增长的问题，严重限制了模型的可扩展性，亟需一种高效的缓存策略。 Method: 通过系统分析发现局部尺度和压缩粗粒度尺度对生成质量至关重要，并观察到不同网络层在跨尺度KV相似性上的差异；基于此提出AMS-KV，优先保留局部和压缩尺度的KV，并根据层间相似性识别高缓存需求层以优化缓存利用。 Result: 相比基线模型，AMS-KV最多减少84.83%的KV缓存使用量，降低60.48%的自注意力延迟，并能在批大小256下稳定运行（原模型在128时即内存溢出），显著提升吞吐量。 Conclusion: AMS-KV通过尺度感知的KV缓存管理，在保证图像生成质量的同时大幅提升了VAR模型的内存效率和可扩展性，为大规模图像生成提供了可行方案。 Abstract: Visual autoregressive modeling (VAR) via next-scale prediction has emerged as a scalable image generation paradigm. While Key and Value (KV) caching in large language models (LLMs) has been extensively studied, next-scale prediction presents unique challenges, and KV caching design for next-scale based VAR transformers remains largely unexplored. A major bottleneck is the excessive KV memory growth with the increasing number of scales-severely limiting scalability. Our systematic investigation reveals that: (1) Attending to tokens from local scales significantly contributes to generation quality (2) Allocating a small amount of memory for the coarsest scales, termed as condensed scales, stabilizes multi-scale image generation (3) Strong KV similarity across finer scales is predominantly observed in cache-efficient layers, whereas cache-demanding layers exhibit weaker inter-scale similarity. Based on the observations, we introduce AMS-KV, a scale-adaptive KV caching policy for next-scale prediction in VAR models. AMS-KV prioritizes storing KVs from condensed and local scales, preserving the most relevant tokens to maintain generation quality. It further optimizes KV cache utilization and computational efficiency identifying cache-demanding layers through inter-scale similarity analysis. Compared to the vanilla next-scale prediction-based VAR models, AMS-KV reduces KV cache usage by up to 84.83% and self-attention latency by 60.48%. Moreover, when the baseline VAR-d30 model encounters out-of-memory failures at a batch size of 128, AMS-KV enables stable scaling to a batch size of 256 with improved throughput.

[47] LiSTAR: Ray-Centric World Models for 4D LiDAR Sequences in Autonomous Driving

Pei Liu,Songtao Wang,Lang Zhang,Xingyue Peng,Yuandong Lyu,Jiaxin Deng,Songxin Lu,Weiliang Ma,Xueyang Zhang,Yifei Zhan,XianPeng Lang,Jun Ma

Main category: cs.CV

TL;DR: 本文提出了LiSTAR，一种基于传感器原生几何结构的新型生成式世界模型，用于合成高保真、可控制的4D LiDAR数据。该模型采用混合柱面-球面（HCS）表示以减少量化误差，并通过基于射线中心Transformer的时空注意力机制（START）实现对稀疏时序点云中复杂动态的有效建模。此外，提出了一种4D点云对齐的体素布局和离散Masked Generative START（MaskSTART）框架，支持布局引导的组合生成。实验表明，LiSTAR在4D LiDAR重建、预测和条件生成任务上均达到SOTA性能，显著提升各项指标。

Details

Motivation: 由于LiDAR传感器具有独特的球面几何结构、点云时间稀疏性和动态场景复杂性，合成高保真且可控的4D LiDAR数据极具挑战。现有方法在几何保真度和时序一致性方面存在局限，难以满足自动驾驶仿真系统对高质量、可控制数据的需求。因此，需要一种能直接在原生几何空间中建模并有效捕捉动态变化的新方法。 Method: 提出LiSTAR模型：1）设计混合柱面-球面（HCS）表示法，保留原始传感器几何结构，减少笛卡尔网格中的量化误差；2）引入基于射线中心Transformer的时空注意力（START），沿单个传感器射线显式建模特征演化，增强时序连贯性；3）提出4D点云对齐的体素布局与离散MaskSTART框架，通过学习紧凑的令牌化场景表示，实现高效、高分辨率、布局引导的可控生成。 Result: 实验显示LiSTAR在4D LiDAR重建、预测和条件生成任务上表现优异：生成MMD降低76%，重建IoU提升32%，预测L1 Med降低50%。模型实现了高分辨率、高保真和强时序一致性的4D点云生成，并支持基于布局的可控合成。 Conclusion: LiSTAR通过在传感器原生几何空间中进行建模，结合新颖的HCS表示和START注意力机制，显著提升了4D LiDAR数据的生成质量与时序一致性。其提出的MaskSTART框架实现了高效的可控生成，为自动驾驶仿真提供了强大且实用的工具，推动了基于生成模型的虚拟环境构建发展。 Abstract: Synthesizing high-fidelity and controllable 4D LiDAR data is crucial for creating scalable simulation environments for autonomous driving. This task is inherently challenging due to the sensor's unique spherical geometry, the temporal sparsity of point clouds, and the complexity of dynamic scenes. To address these challenges, we present LiSTAR, a novel generative world model that operates directly on the sensor's native geometry. LiSTAR introduces a Hybrid-Cylindrical-Spherical (HCS) representation to preserve data fidelity by mitigating quantization artifacts common in Cartesian grids. To capture complex dynamics from sparse temporal data, it utilizes a Spatio-Temporal Attention with Ray-Centric Transformer (START) that explicitly models feature evolution along individual sensor rays for robust temporal coherence. Furthermore, for controllable synthesis, we propose a novel 4D point cloud-aligned voxel layout for conditioning and a corresponding discrete Masked Generative START (MaskSTART) framework, which learns a compact, tokenized representation of the scene, enabling efficient, high-resolution, and layout-guided compositional generation. Comprehensive experiments validate LiSTAR's state-of-the-art performance across 4D LiDAR reconstruction, prediction, and conditional generation, with substantial quantitative gains: reducing generation MMD by a massive 76%, improving reconstruction IoU by 32%, and lowering prediction L1 Med by 50%. This level of performance provides a powerful new foundation for creating realistic and controllable autonomous systems simulations. Project link: https://ocean-luna.github.io/LiSTAR.gitub.io.

[48] VideoSeg-R1:Reasoning Video Object Segmentation via Reinforcement Learning

Zishan Xu,Yifu Guo,Yuquan Lu,Fengyu Yang,Junxin Li

Main category: cs.CV

TL;DR: 本文提出了VideoSeg-R1，首个将强化学习引入视频推理分割的框架，采用解耦架构，结合指代表像分割与视频掩码传播，并通过任务难度感知机制自适应控制推理长度，在多个基准上实现了最先进的性能。

Details

Motivation: 传统视频推理分割方法依赖监督微调，泛化能力有限且缺乏显式推理过程，难以应对分布外场景。 Method: 提出VideoSeg-R1框架，包含三个阶段：分层文本引导帧采样器、生成空间线索和显式推理链的推理模型、基于SAM2和XMem的分割-传播阶段，并引入任务难度感知机制动态调整推理长度。 Result: 在多个基准上进行广泛评估，结果显示VideoSeg-R1在复杂视频推理与分割任务中达到最先进水平。 Conclusion: VideoSeg-R1通过引入强化学习和显式推理机制，显著提升了视频推理分割的性能与泛化能力，为未来研究提供了新方向。 Abstract: Traditional video reasoning segmentation methods rely on supervised fine-tuning, which limits generalization to out-of-distribution scenarios and lacks explicit reasoning. To address this, we propose \textbf{VideoSeg-R1}, the first framework to introduce reinforcement learning into video reasoning segmentation. It adopts a decoupled architecture that formulates the task as joint referring image segmentation and video mask propagation. It comprises three stages: (1) A hierarchical text-guided frame sampler to emulate human attention; (2) A reasoning model that produces spatial cues along with explicit reasoning chains; and (3) A segmentation-propagation stage using SAM2 and XMem. A task difficulty-aware mechanism adaptively controls reasoning length for better efficiency and accuracy. Extensive evaluations on multiple benchmarks demonstrate that VideoSeg-R1 achieves state-of-the-art performance in complex video reasoning and segmentation tasks. The code will be publicly available at https://github.com/euyis1019/VideoSeg-R1.

[49] SpectralTrain: A Universal Framework for Hyperspectral Image Classification

Meihua Zhou,Liping Yu,Jiawei Cai,Wai Kin Fung,Ruiguo Hu,Jiarui Zhao,Wenzhuo Liu,Nan Wan

Main category: cs.CV

TL;DR: SpectralTrain是一种通用的、与架构无关的训练框架，通过结合课程学习和基于PCA的光谱降采样，显著提升高光谱图像分类的训练效率，实现2-7倍加速，精度损失小，并在多个数据集上表现出强泛化能力。

Details

Motivation: 高光谱图像分类面临大规模数据和高计算成本的挑战，限制了深度学习模型在实际遥感任务中的部署，因此需要一种高效且通用的训练方法。 Method: 提出SpectralTrain框架，结合课程学习（CL）和基于主成分分析（PCA）的光谱降采样，逐步引入光谱复杂性，同时保留关键信息，从而提高学习效率，适用于各种模型架构。 Result: 在Indian Pines、Salinas-A和新提出的CloudPatch-7三个基准数据集上实验表明，SpectralTrain可将训练时间减少2-7倍，精度仅有小幅下降，且具有良好的跨空间尺度、光谱特性和应用领域的泛化能力。 Conclusion: SpectralTrain通过优化训练策略而非网络结构，有效提升了高光谱图像分类的训练效率，为气候相关遥感任务（如云分类）提供了可行方案，强调训练策略优化是模型设计的重要补充。 Abstract: Hyperspectral image (HSI) classification typically involves large-scale data and computationally intensive training, which limits the practical deployment of deep learning models in real-world remote sensing tasks. This study introduces SpectralTrain, a universal, architecture-agnostic training framework that enhances learning efficiency by integrating curriculum learning (CL) with principal component analysis (PCA)-based spectral downsampling. By gradually introducing spectral complexity while preserving essential information, SpectralTrain enables efficient learning of spectral -- spatial patterns at significantly reduced computational costs. The framework is independent of specific architectures, optimizers, or loss functions and is compatible with both classical and state-of-the-art (SOTA) models. Extensive experiments on three benchmark datasets -- Indian Pines, Salinas-A, and the newly introduced CloudPatch-7 -- demonstrate strong generalization across spatial scales, spectral characteristics, and application domains. The results indicate consistent reductions in training time by 2-7x speedups with small-to-moderate accuracy deltas depending on backbone. Its application to cloud classification further reveals potential in climate-related remote sensing, emphasizing training strategy optimization as an effective complement to architectural design in HSI models. Code is available at https://github.com/mh-zhou/SpectralTrain.

[50] Rad-GS: Radar-Vision Integration for 3D Gaussian Splatting SLAM in Outdoor Environments

Renxiang Xiao,Wei Liu,Yuanfan Zhang,Yushuai Chen,Jinming Chen,Zilu Wang,Liang Hu

Main category: cs.CV

TL;DR: 提出了一种名为Rad-GS的4D雷达-相机SLAM系统，利用3D高斯作为可微空间表示，适用于千米级户外环境。

Details

Motivation: 传统基于相机或LiDAR的3D高斯方法在大尺度户外环境中存在渲染伪影和定位精度问题，且难以有效处理动态对象；同时，纯雷达系统缺乏纹理信息，限制了重建质量。因此，需要一种融合雷达与相机优势并能高效处理大规模场景的SLAM系统。 Method: 采用3D高斯作为可微分空间表示；结合原始雷达点云（含多普勒信息）与几何增强点云来引导同步图像中的动态物体掩码；利用非同步图像帧对3D高斯表示进行全局优化；引入全局八叉树结构与目标高斯原语管理策略以抑制噪声并降低内存消耗。 Result: 在大规模实验和消融研究中，Rad-GS实现了与传统基于相机或LiDAR的3D高斯方法相当的性能；成功完成了千米级真实世界场景重建；显著减少了内存占用，并提高了定位精度和新视角合成质量。 Conclusion: Rad-GS验证了使用4D毫米波雷达实现鲁棒户外大尺度建图的可行性，为未来低功耗、高精度的大规模环境感知提供了新方向。 Abstract: We present Rad-GS, a 4D radar-camera SLAM system designed for kilometer-scale outdoor environments, utilizing 3D Gaussian as a differentiable spatial representation. Rad-GS combines the advantages of raw radar point cloud with Doppler information and geometrically enhanced point cloud to guide dynamic object masking in synchronized images, thereby alleviating rendering artifacts and improving localization accuracy. Additionally, unsynchronized image frames are leveraged to globally refine the 3D Gaussian representation, enhancing texture consistency and novel view synthesis fidelity. Furthermore, the global octree structure coupled with a targeted Gaussian primitive management strategy further suppresses noise and significantly reduces memory consumption in large-scale environments. Extensive experiments and ablation studies demonstrate that Rad-GS achieves performance comparable to traditional 3D Gaussian methods based on camera or LiDAR inputs, highlighting the feasibility of robust outdoor mapping using 4D mmWave radar. Real-world reconstruction at kilometer scale validates the potential of Rad-GS for large-scale scene reconstruction.

[51] T2T-VICL: Unlocking the Boundaries of Cross-Task Visual In-Context Learning via Implicit Text-Driven VLMs

Shao-Jun Xia,Huixin Zhang,Zhengzhong Tu

Main category: cs.CV

TL;DR: 本文提出了一种用于视觉语言模型（VLM）的跨任务视觉上下文学习（T2T-VICL）框架，通过生成和选择描述不同低层视觉任务差异的文本提示，并构建首个跨任务VICL数据集，结合感知评分推理与传统评估指标，在19种跨任务场景中表现出色。

Details

Motivation: 探索当视觉提示与目标图像来自不同视觉任务时，视觉语言模型是否仍能实现视觉上下文学习（VICL），以拓展VICL在多任务间的泛化能力。 Method: 设计了一种生成与选择文本提示的机制来隐式描述不同低层视觉任务间的差异，构建了首个跨任务VICL数据集，并提出了结合感知评分推理与传统评估指标的新型推理框架。 Result: 该方法在九个跨任务场景中达到顶级性能，在另外十个场景中表现位居第二，显著提升了VLM在跨任务VICL中的表现。 Conclusion: T2T-VICL有效拓展了视觉语言模型在跨任务视觉上下文学习中的能力边界，验证了文本提示在跨任务迁移中的关键作用。 Abstract: In large language models (LLM), in-context learning (ICL) refers to performing new tasks by conditioning on small demonstrations provided in the input context. Recent advances in visual in-context learning (VICL) demonstrate promising capabilities for solving downstream tasks by unified vision-language models (VLMs). When the visual prompt and the target images originate from different visual tasks, can VLMs still enable VICL? In the paper, we propose a fully collaborative pipeline, i.e. T2T-VICL, for VLMs to investigate the potential of cross-task VICL. Fundamentally, we design a mechanism to generate and select text prompts that best implicitly describe the differences between two distinct low-level vision tasks, and construct the first cross-task VICL dataset. Building upon this, we propose a novel inference framework that combines perceptual score-based reasoning with traditional evaluation metrics to perform cross-task VICL. Our approach achieves top-tier results across nine cross-task scenarios and second-tier performance in ten additional scenarios, unlocking the boundaries of cross-task VICL within VLMs.

[52] Clustered Error Correction with Grouped 4D Gaussian Splatting

Taeho Kang,Jaeyeon Park,Kyungjin Lee,Youngki Lee

Main category: cs.CV

TL;DR: 提出了一种改进的4D高斯点阵渲染方法（CEM-4DGS），通过椭圆误差聚类与误差校正点添加以及分组4D高斯点阵，提升了动态场景重建的时空一致性和渲染质量。

Details

Motivation: 现有4D高斯点阵方法在动态场景重建中存在像素对应模糊和动态区域点密度不足的问题，影响重建精度和时间一致性。 Method: 1) 椭圆误差聚类与误差校正点添加：通过分类渲染误差（缺色与遮挡）并利用跨视角颜色一致性进行后投影或前景分割来初始化动态区域的点；2) 分组4D高斯点阵：增强点与动态物体之间的映射一致性。 Result: 在Neural 3D Video和Technicolor数据集上验证，相比现有方法提升了0.39dB PSNR，显著改善了时间一致性和感知渲染质量，可视化结果表明点与动态物体对齐更好，且能有效识别误差并生成新点。 Conclusion: 所提方法有效解决了4D高斯点阵在动态场景中的关键问题，实现了最先进的渲染效果，具备良好的应用潜力。 Abstract: Existing 4D Gaussian Splatting (4DGS) methods struggle to accurately reconstruct dynamic scenes, often failing to resolve ambiguous pixel correspondences and inadequate densification in dynamic regions. We address these issues by introducing a novel method composed of two key components: (1) Elliptical Error Clustering and Error Correcting Splat Addition that pinpoints dynamic areas to improve and initialize fitting splats, and (2) Grouped 4D Gaussian Splatting that improves consistency of mapping between splats and represented dynamic objects. Specifically, we classify rendering errors into missing-color and occlusion types, then apply targeted corrections via backprojection or foreground splitting guided by cross-view color consistency. Evaluations on Neural 3D Video and Technicolor datasets demonstrate that our approach significantly improves temporal consistency and achieves state-of-the-art perceptual rendering quality, improving 0.39dB of PSNR on the Technicolor Light Field dataset. Our visualization shows improved alignment between splats and dynamic objects, and the error correction method's capability to identify errors and properly initialize new splats. Our implementation details and source code are available at https://github.com/tho-kn/cem-4dgs.

[53] Decoupling Complexity from Scale in Latent Diffusion Model

Tianxiong Zhong,Xingye Tian,Xuebo Wang,Boyuan Jiang,Xin Tao,Pengfei Wan

Main category: cs.CV

TL;DR: 提出DCS-LDM，一种解耦信息复杂度与尺度的新型视觉生成模型，通过构建层次化、尺度无关的潜在空间，支持任意分辨率和帧率的灵活生成。

Details

Motivation: 现有潜在扩散模型将尺度与内容复杂度耦合，导致潜在容量需求被不必要地增加；而实际上内容复杂度才是决定因素，尺度仅应作为上界。 Method: 设计了一个层次化的尺度无关潜在空间，使用多级令牌建模样本复杂度，并在固定潜在表示下解码到任意分辨率和帧率，实现结构与细节信息的分层解耦和从粗到精的生成。 Result: 实验表明DCS-LDM性能媲美当前最先进方法，同时支持跨多种尺度和视觉质量的灵活生成，并实现计算质量的权衡。 Conclusion: DCS-LDM成功解耦了视觉生成中的复杂度与尺度，提供更高效、灵活的生成能力，具有广泛的应用潜力。 Abstract: Existing latent diffusion models typically couple scale with content complexity, using more latent tokens to represent higher-resolution images or higher-frame rate videos. However, the latent capacity required to represent visual data primarily depends on content complexity, with scale serving only as an upper bound. Motivated by this observation, we propose DCS-LDM, a novel paradigm for visual generation that decouples information complexity from scale. DCS-LDM constructs a hierarchical, scale-independent latent space that models sample complexity through multi-level tokens and supports decoding to arbitrary resolutions and frame rates within a fixed latent representation. This latent space enables DCS-LDM to achieve a flexible computation-quality tradeoff. Furthermore, by decomposing structural and detailed information across levels, DCS-LDM supports a progressive coarse-to-fine generation paradigm. Experimental results show that DCS-LDM delivers performance comparable to state-of-the-art methods while offering flexible generation across diverse scales and visual qualities.

[54] VTinker: Guided Flow Upsampling and Texture Mapping for High-Resolution Video Frame Interpolation

Chenyang Wu,Jiayi Fu,Chun-Le Guo,Shuhao Han,Chongyi Li

Main category: cs.CV

TL;DR: 提出了一种新的视频帧插值方法VTinker，包含引导光流上采样和纹理映射，显著提升了高分辨率帧插值的性能。

Details

Motivation: 现有方法在低分辨率下估计光流并使用简单上采样导致边缘模糊、细节丢失和像素错位，从而引发鬼影和不连续问题。 Method: 提出VTinker，包括引导光流上采样（GFU）利用输入帧作为指导优化上采样光流，以及纹理映射生成中间代理帧并选择清晰纹理块进行重建。 Result: 在多个实验中，VTinker在视频帧插值任务上达到最先进的性能。 Conclusion: VTinker有效解决了高分辨率帧插值中的光流上采样模糊和像素级伪影问题，显著提升了插值质量。 Abstract: Due to large pixel movement and high computational cost, estimating the motion of high-resolution frames is challenging. Thus, most flow-based Video Frame Interpolation (VFI) methods first predict bidirectional flows at low resolution and then use high-magnification upsampling (e.g., bilinear) to obtain the high-resolution ones. However, this kind of upsampling strategy may cause blur or mosaic at the flows' edges. Additionally, the motion of fine pixels at high resolution cannot be adequately captured in motion estimation at low resolution, which leads to the misalignment of task-oriented flows. With such inaccurate flows, input frames are warped and combined pixel-by-pixel, resulting in ghosting and discontinuities in the interpolated frame. In this study, we propose a novel VFI pipeline, VTinker, which consists of two core components: guided flow upsampling (GFU) and Texture Mapping. After motion estimation at low resolution, GFU introduces input frames as guidance to alleviate the blurring details in bilinear upsampling flows, which makes flows' edges clearer. Subsequently, to avoid pixel-level ghosting and discontinuities, Texture Mapping generates an initial interpolated frame, referred to as the intermediate proxy. The proxy serves as a cue for selecting clear texture blocks from the input frames, which are then mapped onto the proxy to facilitate producing the final interpolated frame via a reconstruction module. Extensive experiments demonstrate that VTinker achieves state-of-the-art performance in VFI. Codes are available at: https://github.com/Wucy0519/VTinker.

Caixin Kang,Yifei Huang,Liangyang Ouyang,Mingfang Zhang,Ruicong Liu,Yoichi Sato

Main category: cs.CV

TL;DR: 本文提出了多模态交互式欺骗评估（MIDA）任务和新数据集，揭示了当前多模态大语言模型在理解社会线索和判断欺骗方面的不足，并提出SoCoT推理框架和DSEM模块以提升模型的社会认知能力。

Details

Motivation: 现有MLLM缺乏人类‘察言观色’和识破欺骗的能力，亟需量化其在复杂社交互动中的表现并推动更智能、可信AI的发展。 Method: 构建包含同步视频与文本及真实标签的MIDA数据集，评估12种先进MLLM；设计Social Chain-of-Thought (SoCoT) 推理流程和Dynamic Social Epistemic Memory (DSEM) 模块以增强社会推理。 Result: 实验显示现有模型（如GPT-4o）难以可靠区分真假陈述，普遍存在跨模态对齐不足和心理理论缺失的问题；引入SoCoT与DSEM后模型性能显著提升。 Conclusion: 当前MLLM在社会认知方面存在明显短板，需结合结构化推理与动态信念建模来实现真正类人社交推理，本文提出的框架为此提供了可行路径。 Abstract: Despite their advanced reasoning capabilities, state-of-the-art Multimodal Large Language Models (MLLMs) demonstrably lack a core component of human intelligence: the ability to `read the room' and assess deception in complex social interactions. To rigorously quantify this failure, we introduce a new task, Multimodal Interactive Deception Assessment (MIDA), and present a novel multimodal dataset providing synchronized video and text with verifiable ground-truth labels for every statement. We establish a comprehensive benchmark evaluating 12 state-of-the-art open- and closed-source MLLMs, revealing a significant performance gap: even powerful models like GPT-4o struggle to distinguish truth from falsehood reliably. Our analysis of failure modes indicates that these models fail to effectively ground language in multimodal social cues and lack the ability to model what others know, believe, or intend, highlighting the urgent need for novel approaches to building more perceptive and trustworthy AI systems. To take a step forward, we design a Social Chain-of-Thought (SoCoT) reasoning pipeline and a Dynamic Social Epistemic Memory (DSEM) module. Our framework yields performance improvement on this challenging task, demonstrating a promising new path toward building MLLMs capable of genuine human-like social reasoning.

[56] How Noise Benefits AI-generated Image Detection

Jiazhen Yan,Ziqiang Li,Fan Wang,Kai Zeng,Zhangjie Fu

Main category: cs.CV

TL;DR: 本文提出了一种名为PiN-CLIP的新方法，通过在特征空间中引入正激励噪声来提升AI生成图像检测的泛化能力，在包含42种生成模型的开放世界数据集上实现了最先进的性能。

Details

Motivation: 现有的AI生成图像检测方法在分布外泛化方面表现不佳，主要因为训练过程中利用了虚假的捷径特征。本文旨在解决这一问题以提高模型的鲁棒性和泛化能力。 Method: 提出PiN-CLIP方法，联合训练噪声生成器和检测网络，采用变分正激励原则，在特征空间中通过视觉与类别语义特征的交叉注意力融合构造正激励噪声，并将其注入到视觉编码器的优化过程中。 Result: 在包含42种不同生成模型的开放世界数据集上进行实验，该方法在平均准确率上比现有方法提升了5.4%，达到最先进水平。 Conclusion: PiN-CLIP能有效抑制捷径依赖，增强稳定取证线索，从而学习到更鲁棒和可泛化的伪造特征表示，显著提升生成图像检测的跨域泛化能力。 Abstract: The rapid advancement of generative models has made real and synthetic images increasingly indistinguishable. Although extensive efforts have been devoted to detecting AI-generated images, out-of-distribution generalization remains a persistent challenge. We trace this weakness to spurious shortcuts exploited during training and we also observe that small feature-space perturbations can mitigate shortcut dominance. To address this problem in a more controllable manner, we propose the Positive-Incentive Noise for CLIP (PiN-CLIP), which jointly trains a noise generator and a detection network under a variational positive-incentive principle. Specifically, we construct positive-incentive noise in the feature space via cross-attention fusion of visual and categorical semantic features. During optimization, the noise is injected into the feature space to fine-tune the visual encoder, suppressing shortcut-sensitive directions while amplifying stable forensic cues, thereby enabling the extraction of more robust and generalized artifact representations. Comparative experiments are conducted on an open-world dataset comprising synthetic images generated by 42 distinct generative models. Our method achieves new state-of-the-art performance, with notable improvements of 5.4 in average accuracy over existing approaches.

[57] TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding

Boshen Xu,Zihan Xiao,Jiaze Li,Jianzhong Ju,Zhenbo Luo,Jian Luan,Qin Jin

Main category: cs.CV

TL;DR: TimeViper是一种混合Mamba-Transformer架构的视觉-语言模型，用于长视频理解，通过TransV模块实现视觉令牌向指令令牌的信息转移与压缩，支持处理超万帧的小时级视频。

Details

Motivation: 长视频理解需要高效模型架构和有效处理长时序上下文的能力，现有方法在效率和信息冗余方面存在挑战。 Method: 采用混合Mamba-Transformer骨干网络，并提出TransV模块，动态转移和压缩视觉令牌信息到文本令牌中，提升长时序建模效率并减少冗余。 Result: TimeViper能在多个基准上媲美最先进模型，同时支持处理超过10,000帧的长视频；揭示了视觉到文本的信息聚合现象，并分析了混合层的注意力行为。 Conclusion: 该工作推动了混合Mamba-Transformer架构在长视频理解中的发展，提供了模型解释性与压缩的新视角。 Abstract: We introduce TimeViper, a hybrid vision-language model designed to tackle challenges of long video understanding. Processing long videos demands both an efficient model architecture and an effective mechanism for handling extended temporal contexts. To this end, TimeViper adopts a hybrid Mamba-Transformer backbone that combines the efficiency of state-space models with the expressivity of attention mechanisms. Through this hybrid design, we reveal the vision-to-text information aggregation phenomenon, where information progressively flows from vision tokens to text tokens across increasing LLM depth, resulting in severe vision token redundancy. Motivated by this observation, we propose TransV, a token information transfer module that transfers and compresses vision tokens into instruction tokens while maintaining multimodal understanding capabilities. This design enables TimeViper to process hour-long videos exceeding 10,000 frames. Extensive experiments across multiple benchmarks demonstrate that TimeViper competes with state-of-the-art models while extending frame numbers. We further analyze attention behaviors of both Mamba and Transformer layers, offering new insights into hybrid model interpretability. This work represents an initial step towards developing, interpreting, and compressing hybrid Mamba-Transformer architectures.

Li Yu,Yingbo Zhao,Shiyu Wu,Siyue Yu,Moncef Gabbouj,Qingshan Liu

Main category: cs.CV

TL;DR: 提出了一种基于预训练退化表示学习（DRL）模块和分层终止机制的盲视频质量增强方法，有效提升了压缩视频的恢复性能与推理效率。

Details

Motivation: 现有非盲视频质量增强方法依赖已知量化参数（QP），在实际应用中受限；当前盲方法仅捕获全局退化信息，缺乏空间细节，且多数方法对不同压缩级别采用统一架构，未考虑计算需求差异。 Method: 设计了一个预训练的退化表示学习（DRL）模块，用于从视频内容中解耦并提取高维、多尺度的退化表示，以指导去伪影过程；引入分层终止机制，根据压缩程度动态调整去伪影阶段的数量，实现自适应计算。 Result: 在QP=22下，相比现有最先进的盲法PSNR提升110%（从0.31 dB提升至0.65 dB），同时推理时间比QP=42时减少一半。 Conclusion: 所提方法在盲视频质量增强中兼顾了性能提升与计算效率优化，显著优于现有盲方法，具有较强的实际应用潜力。 Abstract: Existing studies on Quality Enhancement for Compressed Video (QECV) predominantly rely on known Quantization Parameters (QPs), employing distinct enhancement models per QP setting, termed non-blind methods. However, in real-world scenarios involving transcoding or transmission, QPs may be partially or entirely unknown, limiting the applicability of such approaches and motivating the development of blind QECV techniques. Current blind methods generate degradation vectors via classification models with cross-entropy loss, using them as channel attention to guide artifact removal. However, these vectors capture only global degradation information and lack spatial details, hindering adaptation to varying artifact patterns at different spatial positions. To address these limitations, we propose a pretrained Degradation Representation Learning (DRL) module that decouples and extracts high-dimensional, multiscale degradation representations from video content to guide the artifact removal. Additionally, both blind and non-blind methods typically employ uniform architectures across QPs, hence, overlooking the varying computational demands inherent to different compression levels. We thus introduce a hierarchical termination mechanism that dynamically adjusts the number of artifact reduction stages based on the compression level. Experimental results demonstrate that the proposed approach significantly enhances performance, achieving a PSNR improvement of 110% (from 0.31 dB to 0.65 dB) over a competing state-of-the-art blind method at QP = 22. Furthermore, the proposed hierarchical termination mechanism reduces the average inference time at QP = 22 by half compared to QP = 42.

[59] SurvAgent: Hierarchical CoT-Enhanced Case Banking and Dichotomy-Based Multi-Agent System for Multimodal Survival Prediction

Guolin Huang,Wenting Chen,Jiaqi Yang,Xinheng Lyu,Xiaoling Luo,Sen Yang,Xiaohan Xing,Linlin Shen

Main category: cs.CV

TL;DR: 本文提出了SurvAgent，首个基于分层思维链（CoT）增强的多智能体系统，用于多模态生存预测，解决了现有方法在透明性、多模态整合和历史经验利用方面的不足。

Details

Motivation: 现有生存分析方法缺乏临床可解释性，且无法有效整合多模态数据、探索感兴趣区域或利用历史病例经验。 Method: SurvAgent分为两个阶段：第一阶段通过层次化分析构建WSI-基因CoT增强的病例库，包括低倍率筛选、跨模态相似性感知补丁挖掘和置信度感知补丁挖掘，并结合基因分层分析生成结构化报告；第二阶段采用基于二分法的多专家智能体推理，通过RAG检索相似病例，并融合多模态报告与专家预测进行逐步区间优化。 Result: 在五个TCGA队列上的实验表明，SurvAgent在性能上优于传统方法、专有MLLMs和医疗智能体。 Conclusion: SurvAgent为精准肿瘤学中可解释的AI驱动生存预测建立了新范式。 Abstract: Survival analysis is critical for cancer prognosis and treatment planning, yet existing methods lack the transparency essential for clinical adoption. While recent pathology agents have demonstrated explainability in diagnostic tasks, they face three limitations for survival prediction: inability to integrate multimodal data, ineffective region-of-interest exploration, and failure to leverage experiential learning from historical cases. We introduce SurvAgent, the first hierarchical chain-of-thought (CoT)-enhanced multi-agent system for multimodal survival prediction. SurvAgent consists of two stages: (1) WSI-Gene CoT-Enhanced Case Bank Construction employs hierarchical analysis through Low-Magnification Screening, Cross-Modal Similarity-Aware Patch Mining, and Confidence-Aware Patch Mining for pathology images, while Gene-Stratified analysis processes six functional gene categories. Both generate structured reports with CoT reasoning, storing complete analytical processes for experiential learning. (2) Dichotomy-Based Multi-Expert Agent Inference retrieves similar cases via RAG and integrates multimodal reports with expert predictions through progressive interval refinement. Extensive experiments on five TCGA cohorts demonstrate SurvAgent's superority over conventional methods, proprietary MLLMs, and medical agents, establishing a new paradigm for explainable AI-driven survival prediction in precision oncology.

[60] Real-Time 3D Object Detection with Inference-Aligned Learning

Chenyu Zhao,Xianwei Zheng,Zimin Xia,Linwei Yue,Nan Xue

Main category: cs.CV

TL;DR: 提出了一种用于室内点云的实时3D目标检测框架SR3D，通过空间优先优化和排序感知自蒸馏机制，有效缩小训练与推理之间的差距，在ScanNet V2和SUN RGB-D上显著优于现有方法。

Details

Motivation: 现有的3D目标检测方法在训练和推理之间存在不一致，缺乏空间可靠性和排序感知，导致模型无法学习到与推理行为对齐的表示。 Method: 提出了SR3D框架，包含两个关键组件：空间优先最优传输分配，动态强调位置准确且空间可靠的样本；以及排序感知的自适应自蒸馏方案，通过自蒸馏引入排序感知。 Result: 在ScanNet V2和SUN RGB-D数据集上进行了大量实验，结果表明SR3D在保持实时速度的同时，显著提升了检测精度。 Conclusion: SR3D有效弥合了训练与推理间的差距，提升了3D目标检测的性能，适用于增强现实、机器人和导航等需要动态场景理解的应用。 Abstract: Real-time 3D object detection from point clouds is essential for dynamic scene understanding in applications such as augmented reality, robotics and navigation. We introduce a novel Spatial-prioritized and Rank-aware 3D object detection (SR3D) framework for indoor point clouds, to bridge the gap between how detectors are trained and how they are evaluated. This gap stems from the lack of spatial reliability and ranking awareness during training, which conflicts with the ranking-based prediction selection used as inference. Such a training-inference gap hampers the model's ability to learn representations aligned with inference-time behavior. To address the limitation, SR3D consists of two components tailored to the spatial nature of point clouds during training: a novel spatial-prioritized optimal transport assignment that dynamically emphasizes well-located and spatially reliable samples, and a rank-aware adaptive self-distillation scheme that adaptively injects ranking perception via a self-distillation paradigm. Extensive experiments on ScanNet V2 and SUN RGB-D show that SR3D effectively bridges the training-inference gap and significantly outperforms prior methods in accuracy while maintaining real-time speed.

[61] Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation

Ziyu Guo,Renrui Zhang,Hongyu Li,Manyuan Zhang,Xinyan Chen,Sifan Wang,Yan Feng,Peng Pei,Pheng-Ann Heng

Main category: cs.CV

TL;DR: 提出了一种名为Thinking-while-Generating (TwiG)的框架，首次实现在视觉生成过程中交错进行文本推理，以提升生成内容的语义丰富性和上下文感知能力。

Details

Motivation: 现有视觉生成方法在生成过程中缺乏实时的多模态交互，仅在生成前后进行文本推理，无法动态指导和反思生成过程。 Method: 引入TwiG框架，在生成过程中交错进行文本推理，通过零样本提示、监督微调（基于TwiG-50K数据集）和基于强化学习的TwiG-GRPO策略进行探索。 Result: 实现了生成过程中文本推理与视觉内容的协同演化，提升了生成结果的语义一致性与细节质量。 Conclusion: TwiG为视觉生成中的实时推理交互提供了新范式，展示了交错推理的潜力，有望推动该方向的进一步研究。 Abstract: Recent advances in visual generation have increasingly explored the integration of reasoning capabilities. They incorporate textual reasoning, i.e., think, either before (as pre-planning) or after (as post-refinement) the generation process, yet they lack on-the-fly multimodal interaction during the generation itself. In this preliminary study, we introduce Thinking-while-Generating (TwiG), the first interleaved framework that enables co-evolving textual reasoning throughout the visual generation process. As visual content is progressively generating, textual reasoning is interleaved to both guide upcoming local regions and reflect on previously synthesized ones. This dynamic interplay produces more context-aware and semantically rich visual outputs. To unveil the potential of this framework, we investigate three candidate strategies, zero-shot prompting, supervised fine-tuning (SFT) on our curated TwiG-50K dataset, and reinforcement learning (RL) via a customized TwiG-GRPO strategy, each offering unique insights into the dynamics of interleaved reasoning. We hope this work inspires further research into interleaving textual reasoning for enhanced visual generation. Code will be released at: https://github.com/ZiyuGuo99/Thinking-while-Generating.

[62] A Spatial Semantics and Continuity Perception Attention for Remote Sensing Water Body Change Detection

Quanqing Ma,Jiaen Chen,Peng Wang,Yao Zheng,Qingzhan Zhao,Yuchen Zheng

Main category: cs.CV

TL;DR: 提出了一种新的高分辨率遥感水体变化检测数据集HSRW-CD，并设计了SSCP注意力模块以更好利用深度特征中的空间语义和结构信息，显著提升了水体变化检测性能。

Details

Motivation: 现有水体变化检测方法受限于高分辨率数据集的缺乏，且深度学习方法未能充分挖掘特征中的空间语义与结构信息，导致在城乡区域的精确检测能力不足。 Method: 构建了一个空间分辨率高于3米的水体变化检测数据集HSRW-CD；提出SSCP注意力模块，包含多语义空间注意力（MSA）、结构关系感知全局注意力（SRGA）和通道自注意力（CSA），用于增强水体特征的语义与连续性建模，并可作为即插即用模块集成到现有模型中。 Result: 在HSRW-CD和Water-CD数据集上的实验表明，所提SSCP模块显著提升了检测精度和泛化能力，优于现有方法。 Conclusion: SSCP模块有效增强了水体变化检测中对空间语义和结构信息的建模能力，结合新构建的高分辨率HSRW-CD数据集，推动了高精度遥感水体变化检测的发展。 Abstract: Remote sensing Water Body Change Detection (WBCD) aims to detect water body surface changes from bi-temporal images of the same geographic area. Recently, the scarcity of high spatial resolution datasets for WBCD restricts its application in urban and rural regions, which require more accurate positioning. Meanwhile, previous deep learning-based methods fail to comprehensively exploit the spatial semantic and structural information in deep features in the change detection networks. To resolve these concerns, we first propose a new dataset, HSRW-CD, with a spatial resolution higher than 3 meters for WBCD. Specifically, it contains a large number of image pairs, widely covering various water body types. Besides, a Spatial Semantics and Continuity Perception (SSCP) attention module is designed to fully leverage both the spatial semantics and structure of deep features in the WBCD networks, significantly improving the discrimination capability for water body. The proposed SSCP has three components: the Multi-Semantic spatial Attention (MSA), the Structural Relation-aware Global Attention (SRGA), and the Channel-wise Self-Attention (CSA). The MSA enhances the spatial semantics of water body features and provides precise spatial semantic priors for the CSA. Then, the SRGA further extracts spatial structure to learn the spatial continuity of the water body. Finally, the CSA utilizes the spatial semantic and structural priors from the MSA and SRGA to compute the similarity across channels. Specifically designed as a plug-and-play module for water body deep features, the proposed SSCP allows integration into existing WBCD models. Numerous experiments conducted on the proposed HSRW-CD and Water-CD datasets validate the effectiveness and generalization of the SSCP. The code of this work and the HSRW-CD dataset will be accessed at https://github.com/QingMa1/SSCP.

[63] LEGO-SLAM: Language-Embedded Gaussian Optimization SLAM

Sibaek Lee,Seongbo Ha,Kyeongsu Kang,Joonyeol Choi,Seungjun Tak,Hyeonwoo Yu

Main category: cs.CV

TL;DR: LEGO-SLAM 是首个在3D高斯点阵SLAM系统中实现实时、开放词汇语义映射的框架，通过场景自适应的编码器-解码器压缩语言特征，实现高效存储与渲染，并支持语义剪枝和基于语言的闭环检测。

Details

Motivation: 现有的3DGS SLAM系统缺乏开放词汇语义理解能力，且集成语言特征面临内存开销大、渲染慢和模型适应性差的问题。 Method: 提出LEGO-SLAM，采用场景自适应的编码器-解码器将高维语言嵌入压缩为16维紧凑特征，结合语言引导的高斯剪枝和基于该特征的闭环检测机制。 Result: 在保持渲染质量的同时，高斯数量减少超过60%，实现15 FPS的实时性能，在多个实验中表现出具有竞争力的建图质量和跟踪精度。 Conclusion: LEGO-SLAM 实现了高效、自适应的开放词汇语义SLAM，为3DGS系统引入了强大的语义理解能力，同时兼顾实时性与轻量化设计。 Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have enabled Simultaneous Localization and Mapping (SLAM) systems to build photorealistic maps. However, these maps lack the open-vocabulary semantic understanding required for advanced robotic interaction. Integrating language features into SLAM remains a significant challenge, as storing high-dimensional features demands excessive memory and rendering overhead, while existing methods with static models lack adaptability for novel environments. To address these limitations, we propose LEGO-SLAM (Language-Embedded Gaussian Optimization SLAM), the first framework to achieve real-time, open-vocabulary mapping within a 3DGS-based SLAM system. At the core of our method is a scene-adaptive encoder-decoder that distills high-dimensional language embeddings into a compact 16-dimensional feature space. This design reduces the memory per Gaussian and accelerates rendering, enabling real-time performance. Unlike static approaches, our encoder adapts online to unseen scenes. These compact features also enable a language-guided pruning strategy that identifies semantic redundancy, reducing the map's Gaussian count by over 60\% while maintaining rendering quality. Furthermore, we introduce a language-based loop detection approach that reuses these mapping features, eliminating the need for a separate detection model. Extensive experiments demonstrate that LEGO-SLAM achieves competitive mapping quality and tracking accuracy, all while providing open-vocabulary capabilities at 15 FPS.

[64] Reasoning Guided Embeddings: Leveraging MLLM Reasoning for Improved Multimodal Retrieval

Chunxu Liu,Jiyuan Yang,Ruopeng Gao,Yuhan Zhu,Feng Zhu,Rui Zhao,Limin Wang

Main category: cs.CV

TL;DR: 本文提出了一种新的多模态嵌入方法——推理引导嵌入（RGE），通过在嵌入过程中显式引入多模态大语言模型（MLLMs）的生成式推理能力，结合对比训练提升表示质量，在MMEB基准上比非推理基线提升了4.9%的检索性能。

Details

Motivation: 现有的多模态嵌入方法通常将嵌入提取视为直接编码过程，忽略了MLLMs具备的可用来增强表示质量的推理能力，因此需要探索如何显式地将推理融入嵌入过程。 Method: 提出Reasoning Guided Embeddings（RGE），利用MLLMs的生成能力进行结构化理由生成，并在推理展开后提取表示，同时结合对比学习进行训练。 Result: 在MMEB基准上的实验表明，所提方法在多模态检索任务中比非推理基线性能提升4.9%。 Conclusion: 显式引入推理过程可以有效提升多模态嵌入的质量，验证了利用MLLMs的推理能力优化表示学习的有效性。 Abstract: Multimodal embeddings are widely used in downstream tasks such as multimodal retrieval, enabling alignment of interleaved modalities in a shared representation space. While recent studies show that Multimodal Large Language Models (MLLMs) can serve as strong embedding extractors, existing approaches treat embedding extraction as a direct encoding step, overlooking the fact that MLLMs possess the generative capability for reasoning that could be leveraged to enhance representation quality. In this work, we explore how to explicitly incorporate reasoning into the embedding process. To this end, we propose Reasoning Guided Embeddings (RGE), which preserves the generative rationale process of MLLMs and couples it with contrastive training. Our method first enables the model to perform structured rationale generation conditioned on the instruction, and then extracts representations after reasoning has unfolded. This simple design enhances the context-conditional inference signals within the embedding, leading to improved multimodal representation quality. Experiments on the MMEB benchmark show that reasoning-guided conditioning improves multimodal retrieval performance by 4.9% over the non-reasoning baseline, confirming that explicit reasoning can effectively enhance embedding quality.

[65] Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers

Jian Ma,Qirong Peng,Xujie Zhu,Peixing Xie,Chen Chen,Haonan Lu

Main category: cs.CV

TL;DR: 提出了一种名为PPCL的灵活结构化剪枝框架，用于压缩Diffusion Transformers，在减少50%参数的同时仅导致不到3%的性能下降。

Details

Motivation: DiT模型虽然在图像生成上表现优异，但参数量大、计算成本高，难以部署于资源受限环境，因此需要有效的压缩方法。 Method: 通过线性探测和相似性度量的一阶微分趋势分析识别冗余层区间，并设计了可插拔的师生交替蒸馏框架，统一实现深度和宽度方向的剪枝。 Result: 在多个多模态扩散Transformer模型上实验表明，PPCL可将参数减少50%，关键指标退化小于3%，且保持高质量图像生成能力。 Conclusion: PPCL是一种高效、灵活的DiT压缩方案，适用于资源受限场景，无需针对不同配置重新训练，具有良好的实用性和扩展性。 Abstract: Diffusion Transformers (DiTs) have shown exceptional performance in image generation, yet their large parameter counts incur high computational costs, impeding deployment in resource-constrained settings. To address this, we propose Pluggable Pruning with Contiguous Layer Distillation (PPCL), a flexible structured pruning framework specifically designed for DiT architectures. First, we identify redundant layer intervals through a linear probing mechanism combined with the first-order differential trend analysis of similarity metrics. Subsequently, we propose a plug-and-play teacher-student alternating distillation scheme tailored to integrate depth-wise and width-wise pruning within a single training phase. This distillation framework enables flexible knowledge transfer across diverse pruning ratios, eliminating the need for per-configuration retraining. Extensive experiments on multiple Multi-Modal Diffusion Transformer architecture models demonstrate that PPCL achieves a 50\% reduction in parameter count compared to the full model, with less than 3\% degradation in key objective metrics. Notably, our method maintains high-quality image generation capabilities while achieving higher compression ratios, rendering it well-suited for resource-constrained environments. The open-source code, checkpoints for PPCL can be found at the following link: https://github.com/OPPO-Mente-Lab/Qwen-Image-Pruning.

[66] Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning

Yibin Huang,Wang Xu,Wanyue Zhang,Helu Zhi,Jingjing Huang,Yangbin Xu,Yangang Sun,Conghui Zhu,Tiejun Zhao

Main category: cs.CV

TL;DR: 提出Video2Layout框架，利用连续物体边界坐标从视频中重建度量空间布局，提升多模态大模型的细粒度空间推理能力。

Details

Motivation: 现有基于网格的认知图依赖离散化表示，限制了模型在细粒度空间推理上的表现，需更精确的空间表征方法。 Method: 提出Video2Layout，使用连续物体边界坐标表示空间布局，并通过监督微调与强化微调两阶段训练；构建AI2THOR数据集并引入QVS-Bench基准评估输入图像数量对空间推理的影响。 Result: V2LO-7B在QVS-Bench和主流空间推理基准上平均比基于网格图的模型提升4.92%。 Conclusion: 连续坐标表示的空间布局优于传统离散网格图，显著提升多模态大模型的空间理解与定量推理能力。 Abstract: Spatial intelligence is a critical frontier for Multimodal Large Language Models (MLLMs), empowering them to comprehend the physical world. Drawing inspiration from human perception mechanisms, existing studies attempt to construct a coherent spatial understanding via grid-based cognitive maps from multi-frame visual inputs. However, current grid-based map methods rely on discretized raster representations, which limit the model's ability in fine-grained spatial reasoning. To overcome this limitation, we propose Video2Layout, a framework for reconstructing metric-grounded spatial layouts from video. The framework employs continuous object boundary coordinates to quantify inter-object physical distances and object size. This empowers the model with quantitative spatial computation capabilities, effectively alleviating the inherent ambiguity when describing spatial relationships in natural language. Specifically, our method comprises two core stages. First, in supervised fine-tuning stage, we construct a high-quality dataset from the AI2THOR simulator, which enables the model to learn the mapping from visual inputs to precise boundary coordinates. Subsequently, a reinforcement fine-tuning stage further enhances the model's real-world generalization capabilities. To systematically evaluate the correlation between cognitive map accuracy and image quantity, as well as how the quantity of image inputs affects spatial reasoning accuracy, we introduce QVS-Bench, a diagnostic benchmark designed to analyze the relevant mechanisms. Evaluated on QVS-Bench and mainstream spatial reasoning benchmarks, our model, V2LO-7B achieves an average improvement of 4.92% over the model trained on grid maps, validating the superiority of our method. Our code is available at https://github.com/ybrrraway/Video2Layout.

[67] Simba: Towards High-Fidelity and Geometrically-Consistent Point Cloud Completion via Transformation Diffusion

Lirui Zhang,Zhengkai Zhao,Zhi Zuo,Pan Gao,Jie Qin

Main category: cs.CV

TL;DR: 本文提出了一种名为Simba的新框架，用于点云补全，通过将逐点变换回归重构为分布学习问题，并结合对称性先验与扩散模型，解决了现有方法易过拟合和对噪声敏感的问题，同时引入基于Mamba的分层架构实现高保真上采样，在多个基准上达到SOTA性能。

Details

Motivation: 在点云补全中，如何同时保持输入中的细粒度细节并确保整体结构完整性是一个长期挑战。现有基于局部对称变换回归的方法虽提升了几何细节保留，但存在易过拟合和对输入噪声敏感两大问题。 Method: 提出Simba框架，将点级变换回归转为分布学习任务，结合对称性先验与扩散模型增强泛化性和鲁棒性；采用分层Mamba架构实现高质量上采样。 Result: 在PCN、ShapeNet和KITTI数据集上实验表明，该方法在定量和定性结果上均优于现有方法，实现了最先进的性能。 Conclusion: Simba通过分布式学习和生成建模有效克服了传统回归方法的局限，在点云补全中实现了更好的细节保持、结构完整性和抗噪能力，具有强泛化性。 Abstract: Point cloud completion is a fundamental task in 3D vision. A persistent challenge in this field is simultaneously preserving fine-grained details present in the input while ensuring the global structural integrity of the completed shape. While recent works leveraging local symmetry transformations via direct regression have significantly improved the preservation of geometric structure details, these methods suffer from two major limitations: (1) These regression-based methods are prone to overfitting which tend to memorize instant-specific transformations instead of learning a generalizable geometric prior. (2) Their reliance on point-wise transformation regression lead to high sensitivity to input noise, severely degrading their robustness and generalization. To address these challenges, we introduce Simba, a novel framework that reformulates point-wise transformation regression as a distribution learning problem. Our approach integrates symmetry priors with the powerful generative capabilities of diffusion models, avoiding instance-specific memorization while capturing robust geometric structures. Additionally, we introduce a hierarchical Mamba-based architecture to achieve high-fidelity upsampling. Extensive experiments across the PCN, ShapeNet, and KITTI benchmarks validate our method's state-of-the-art (SOTA) performance.

[68] Layer-wise Noise Guided Selective Wavelet Reconstruction for Robust Medical Image Segmentation

Yuting Lu,Ziliang Wang,Weixin Xu,Wei Zhang,Yongqiang Zhao,Yang Yu,Xiaohong Zhang

Main category: cs.CV

TL;DR: 提出了一种名为LNG-SWR的新方法，通过层间噪声引导的频域选择性小波重构提升医学图像分割模型的鲁棒性，兼具低开销、可扩展性，并可在有无对抗训练下均有效提升性能。

Details

Motivation: 医学图像分割模型在实际临床部署中面临分布偏移和扰动下的稳定性问题，现有对抗训练方法存在干净准确率与鲁棒性之间的权衡及高训练成本，限制了其可扩展性和可维护性。 Method: 在多个网络层注入零均值小噪声，学习频率偏差先验，引导表示避开对噪声敏感的方向；基于该先验，在输入或特征分支上应用选择性小波重构，抑制易受干扰的频带，增强方向结构和形状线索，稳定边界响应，同时保持频谱一致性。 Result: 在CT和超声数据集上，LNG-SWR在标准指标（Dice/IoU）上表现优异，显著降低强攻击下的性能下降；无论是否结合对抗训练，均带来一致增益，且与对抗训练结合时效果叠加，不牺牲干净样本准确性。 Conclusion: LNG-SWR为医学图像分割提供了一种简单、高效、工程友好的鲁棒性提升方案，适用于对抗训练和标准训练两种场景，具有良好的可扩展性和实用性。 Abstract: Clinical deployment requires segmentation models to stay stable under distribution shifts and perturbations. The mainstream solution is adversarial training (AT) to improve robustness; however, AT often brings a clean--robustness trade-off and high training/tuning cost, which limits scalability and maintainability in medical imaging. We propose \emph{Layer-wise Noise-Guided Selective Wavelet Reconstruction (LNG-SWR)}. During training, we inject small, zero-mean noise at multiple layers to learn a frequency-bias prior that steers representations away from noise-sensitive directions. We then apply prior-guided selective wavelet reconstruction on the input/feature branch to achieve frequency adaptation: suppress noise-sensitive bands, enhance directional structures and shape cues, and stabilize boundary responses while maintaining spectral consistency. The framework is backbone-agnostic and adds low additional inference overhead. It can serve as a plug-in enhancement to AT and also improves robustness without AT. On CT and ultrasound datasets, under a unified protocol with PGD-$L_{\infty}/L_{2}$ and SSAH, LNG-SWR delivers consistent gains on clean Dice/IoU and significantly reduces the performance drop under strong attacks; combining LNG-SWR with AT yields additive gains. When combined with adversarial training, robustness improves further without sacrificing clean accuracy, indicating an engineering-friendly and scalable path to robust segmentation. These results indicate that LNG-SWR provides a simple, effective, and engineering-friendly path to robust medical image segmentation in both adversarial and standard training regimes.

[69] An Image Is Worth Ten Thousand Words: Verbose-Text Induction Attacks on VLMs

Zhi Luo,Zenghui Yuan,Wenqi Wei,Daizong Liu,Pan Zhou

Main category: cs.CV

TL;DR: 本文提出了一种新的冗长文本诱导攻击（VTIA），通过两阶段框架向图像中注入难以察觉的对抗性扰动，以最大化视觉-语言模型（VLMs）的输出令牌数量，提升攻击的有效性、效率和泛化能力。

Details

Motivation: 由于VLM在生成过程中消耗的令牌数量成为关键评估指标，而现有方法无法稳定且可控地延长输出长度，因此需要一种更有效的攻击方式来评估模型的部署效率与安全性。 Method: 采用两阶段框架：第一阶段使用强化学习进行对抗性提示搜索，寻找能诱导LLM产生冗长输出的恶意提示；第二阶段进行视觉对齐的扰动优化，使扰动图像的视觉嵌入与对抗提示的嵌入相似，从而触发冗长文本生成。 Result: 在四个主流VLM上的实验表明，该方法在增加输出令牌数方面显著优于现有方法，具有更高的有效性、效率和跨模型泛化能力。 Conclusion: VTIA能够有效诱导VLM生成高冗余、低信息密度的长文本，揭示了当前模型在生成控制方面的安全漏洞，为提高VLM的鲁棒性和部署效率提供了重要启示。 Abstract: With the remarkable success of Vision-Language Models (VLMs) on multimodal tasks, concerns regarding their deployment efficiency have become increasingly prominent. In particular, the number of tokens consumed during the generation process has emerged as a key evaluation metric.Prior studies have shown that specific inputs can induce VLMs to generate lengthy outputs with low information density, which significantly increases energy consumption, latency, and token costs. However, existing methods simply delay the occurrence of the EOS token to implicitly prolong output, and fail to directly maximize the output token length as an explicit optimization objective, lacking stability and controllability.To address these limitations, this paper proposes a novel verbose-text induction attack (VTIA) to inject imperceptible adversarial perturbations into benign images via a two-stage framework, which identifies the most malicious prompt embeddings for optimizing and maximizing the output token of the perturbed images.Specifically, we first perform adversarial prompt search, employing reinforcement learning strategies to automatically identify adversarial prompts capable of inducing the LLM component within VLMs to produce verbose outputs. We then conduct vision-aligned perturbation optimization to craft adversarial examples on input images, maximizing the similarity between the perturbed image's visual embeddings and those of the adversarial prompt, thereby constructing malicious images that trigger verbose text generation. Comprehensive experiments on four popular VLMs demonstrate that our method achieves significant advantages in terms of effectiveness, efficiency, and generalization capability.

[70] EvoVLA: Self-Evolving Vision-Language-Action Model

Zeting Liu,Zida Yang,Zeyu Zhang,Hao Tang

Main category: cs.CV

TL;DR: EvoVLA是一种自监督的视觉-语言-动作框架，通过阶段对齐奖励、基于姿态的物体探索和长时记忆机制，有效缓解了长视野操作中的阶段幻觉问题，在模拟和真实机器人任务中均显著提升了性能。

Details

Motivation: 现有的视觉-语言-动作（VLA）模型在长视野操作任务中存在阶段幻觉问题，即利用粗略的评估信号跳过任务步骤，导致任务完成度虚高，难以实现真正的多步任务执行。 Method: 提出EvoVLA框架，包含三个核心组件：1）阶段对齐奖励（SAR），使用三元组对比学习和Gemini生成的难负样本防止视觉捷径；2）基于姿态的物体探索（POE），以抓取器与物体的相对位姿作为好奇心基础；3）长时记忆机制，通过选择性上下文保留和门控融合稳定长期策略学习中的内在奖励塑造。 Result: 在Discoverse-L基准上，EvoVLA比最强基线OpenVLA-OFT平均任务成功率提升10.2个百分点，达到69.2%；样本效率提高1.5倍，阶段幻觉率从38.5%降至14.8%；在真实机器人上的平均成功率达54.6%，超过基线11个百分点，展现出优异的仿真到现实迁移与泛化能力。 Conclusion: EvoVLA通过多组件协同设计有效解决了VLA模型在长视野操作中的阶段幻觉问题，显著提升了任务成功率、样本效率和现实部署表现，推动了零样本泛化与跨环境迁移的发展。 Abstract: Long-horizon robotic manipulation remains challenging for Vision-Language-Action (VLA) models despite recent progress in zero-shot generalization and simulation-to-real-world transfer. Current VLA models suffer from stage hallucination, where agents exploit coarse evaluation signals to shortcut multi-step tasks, reporting high progress without truly completing them. We present EvoVLA, a self-supervised VLA framework that addresses this issue through three complementary components: Stage-Aligned Reward (SAR), which uses triplet contrastive learning with Gemini-generated hard negatives to prevent visual shortcuts; Pose-Based Object Exploration (POE), which grounds curiosity in relative object-gripper pose instead of raw pixels; and Long-Horizon Memory, which uses selective context retention and gated fusion to stabilize intrinsic shaping during extended rollouts. Extensive evaluations on Discoverse-L, a long-horizon manipulation benchmark with three multi-stage tasks, show that EvoVLA improves average task success by 10.2 percentage points over the strongest baseline (OpenVLA-OFT), reaching 69.2 percent. EvoVLA also achieves one-and-a-half times better sample efficiency and reduces stage hallucination from 38.5 percent to 14.8 percent. Real-world deployment on physical robots reaches an average success rate of 54.6 percent across four manipulation tasks, outperforming OpenVLA-OFT by 11 points, demonstrating effective sim-to-real transfer and strong generalization. Code: https://github.com/AIGeeksGroup/EvoVLA. Website: https://aigeeksgroup.github.io/EvoVLA.

[71] Target Refocusing via Attention Redistribution for Open-Vocabulary Semantic Segmentation: An Explainability Perspective

Jiahao Li,Yang Lu,Yachao Zhang,Yong Xie,Fangyong Wang,Yuan Xie,Yanyun Qu

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的ReFocusing CLIP（RF-CLIP）方法，通过分析CLIP在密集预测中的注意力分散问题，模拟人类“分心-重聚焦”行为，提升开放词汇语义分割的像素级对齐精度，在八个基准上实现了最先进的性能。

Details

Motivation: 现有基于CLIP的开放词汇语义分割方法较少从可解释性角度探究其在密集预测任务中的性能瓶颈，尤其是像素级多模态对齐的局限性，因此需要深入分析CLIP内部机制以突破其表现边界。 Method: 通过系统分析CLIP内部注意力机制，发现其因维度特定的过激活导致注意力被无关token分散；提出RF-CLIP，通过过滤这些干扰token并重新聚焦于目标区域，增强像素级视觉-语言对齐能力，且无需额外训练。 Result: RF-CLIP在八个开放词汇语义分割基准上均达到最先进水平，同时保持高推理效率。 Conclusion: 通过模拟人类注意力重聚焦机制，RF-CLIP有效提升了CLIP在密集预测任务中的表现，揭示了注意力过滤对多模态对齐的重要性，为训练-free的视觉语言模型优化提供了新思路。 Abstract: Open-vocabulary semantic segmentation (OVSS) employs pixel-level vision-language alignment to associate category-related prompts with corresponding pixels. A key challenge is enhancing the multimodal dense prediction capability, specifically this pixel-level multimodal alignment. Although existing methods achieve promising results by leveraging CLIP's vision-language alignment, they rarely investigate the performance boundaries of CLIP for dense prediction from an interpretability mechanisms perspective. In this work, we systematically investigate CLIP's internal mechanisms and identify a critical phenomenon: analogous to human distraction, CLIP diverts significant attention resources from target regions to irrelevant tokens. Our analysis reveals that these tokens arise from dimension-specific over-activation; filtering them enhances CLIP's dense prediction performance. Consequently, we propose ReFocusing CLIP (RF-CLIP), a training-free approach that emulates human distraction-refocusing behavior to redirect attention from distraction tokens back to target regions, thereby refining CLIP's multimodal alignment granularity. Our method achieves SOTA performance on eight benchmarks while maintaining high inference efficiency.

[72] Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight

Yi Yang,Xueqi Li,Yiyang Chen,Jin Song,Yihan Wang,Zipeng Xiao,Jiadi Su,You Qiaoben,Pengfei Liu,Zhijie Deng

Main category: cs.CV

TL;DR: Mantis是一种新型视觉-语言-动作（VLA）框架，提出解耦视觉预测（DVF），通过元查询和扩散Transformer头减轻模型负担，提升动作学习与语言理解能力，在LIBERO基准上达到96.7%成功率，优于现有方法。

Details

Motivation: 现有VLA模型在高维视觉状态预测上消耗过多模型容量，训练成本高，且因忽略语言监督导致理解和推理能力差；压缩视觉信号又引发信息瓶颈。需要一种更高效、兼顾语言监督的框架来提升性能。 Method: Mantis采用解耦的视觉预见（DVF）设计，将视觉预测从主干网络中分离，引入元查询和扩散Transformer（DiT）头，通过残差连接输入当前视觉状态，并用简单的下一状态预测目标使元查询自动捕捉潜在动作，从而增强显式动作学习，同时保留语言监督以维持理解和推理能力。 Result: 在LIBERO基准上微调后达到96.7%的成功率，显著超过强基线模型，收敛速度更快；在真实场景中优于π₀.₅模型，尤其在指令跟随、对未见指令的泛化和推理能力方面表现更优。 Conclusion: Mantis通过解耦视觉预见机制有效平衡了视觉预测效率与语言监督，提升了VLA模型的动作学习、泛化和推理能力，兼具高性能与快速收敛，推动了开源机器人模型的发展。 Abstract: Recent advances in Vision-Language-Action (VLA) models demonstrate that visual signals can effectively complement sparse action supervisions. However, letting VLA directly predict high-dimensional visual states can distribute model capacity and incur prohibitive training cost, while compressing visual states into more compact supervisory signals inevitably incurs information bottlenecks. Moreover, existing methods often suffer from poor comprehension and reasoning capabilities due to the neglect of language supervision. This paper introduces Mantis, a novel framework featuring a Disentangled Visual Foresight (DVF) to tackle these issues. Specifically, Mantis decouples visual foresight prediction from the backbone with the combination of meta queries and a diffusion Transformer (DiT) head. With the current visual state provided to the DiT via a residual connection, a simple next-state prediction objective enables the meta queries to automatically capture the latent actions that delineate the visual trajectory, and hence boost the learning of explicit actions. The disentanglement reduces the burden of the VLA backbone, enabling it to maintain comprehension and reasoning capabilities through language supervision. Empirically, pretrained on human manipulation videos, robot demonstrations, and image-text pairs, Mantis achieves a 96.7% success rate on LIBERO benchmark after fine-tuning, surpassing powerful baselines while exhibiting high convergence speed. Real-world evaluations show that Mantis outperforms $π_{0.5}$, a leading open-source VLA model, particularly in instruction-following capability, generalization to unseen instructions, and reasoning ability. Code and weights are released to support the open-source community.

[73] Domain-Shared Learning and Gradual Alignment for Unsupervised Domain Adaptation Visible-Infrared Person Re-Identification

Nianchang Huang,Yi Xu,Ruida Xi,Ruida Xi,Qiang Zhang

Main category: cs.CV

TL;DR: 本文提出了一种用于可见光-红外行人重识别的无监督域自适应方法UDA-VI-ReID，设计了两阶段模型DSLGA以解决跨域和跨模态差异问题，并通过新构建的测试方法CMDA-XD验证了其优越性能。

Details

Motivation: 由于公开数据集与真实世界数据之间存在差异，现有VI-ReID算法在实际应用中表现不佳，因此需要一种无需标注新样本即可将知识从源域迁移到目标域的无监督域自适应方法。 Method: 提出两阶段模型DSLGA：第一阶段采用域共享学习策略（DSLS）缓解域间模态差异；第二阶段采用渐进对齐策略（GAS），通过聚类到整体的对齐方式应对域内模态差异。同时构建了新的测试方法CMDA-XD。 Result: 大量实验表明，该方法在多种设置下显著优于现有的域自适应方法，甚至超过了一些有监督方法。 Conclusion: DSLGA有效解决了UDA-VI-ReID中的域间和域内模态差异问题，提升了模型在真实场景中的泛化能力，具有较强的实用价值。 Abstract: Recently, Visible-Infrared person Re-Identification (VI-ReID) has achieved remarkable performance on public datasets. However, due to the discrepancies between public datasets and real-world data, most existing VI-ReID algorithms struggle in real-life applications. To address this, we take the initiative to investigate Unsupervised Domain Adaptation Visible-Infrared person Re-Identification (UDA-VI-ReID), aiming to transfer the knowledge learned from the public data to real-world data without compromising accuracy and requiring the annotation of new samples. Specifically, we first analyze two basic challenges in UDA-VI-ReID, i.e., inter-domain modality discrepancies and intra-domain modality discrepancies. Then, we design a novel two-stage model, i.e., Domain-Shared Learning and Gradual Alignment (DSLGA), to handle these discrepancies. In the first pre-training stage, DSLGA introduces a Domain-Shared Learning Strategy (DSLS) to mitigate ineffective pre-training caused by inter-domain modality discrepancies via exploiting shared information between the source and target domains. While, in the second fine-tuning stage, DSLGA designs a Gradual Alignment Strategy (GAS) to handle the cross-modality alignment challenges between visible and infrared data caused by the large intra-domain modality discrepancies through a cluster-to-holistic alignment way. Finally, a new UDA-VI-ReID testing method i.e., CMDA-XD, is constructed for training and testing different UDA-VI-ReID models. A large amount of experiments demonstrate that our method significantly outperforms existing domain adaptation methods for VI-ReID and even some supervised methods under various settings.

[74] PrIntMesh: Precise Intersection Surfaces for 3D Organ Mesh Reconstruction

Deniz Sayin Mercadier,Hieu Le,Yihong Chen,Jiancheng Yang,Udaranga Wickramasinghe,Pascal Fua

Main category: cs.CV

TL;DR: PrIntMesh是一种基于模板的拓扑保持框架，用于整体重建器官子结构，实现高精度几何重建和结构一致性。

Details

Motivation: 现有深度学习方法通常独立处理器官的子结构，导致解剖上不合理的重建结果。 Method: 提出PrIntMesh，采用连接模板联合变形所有子结构，匹配患者特异性解剖结构，同时显式保持内部边界并生成平滑无伪影的表面。 Result: 在心脏、海马体和肺部的应用中表现出高几何精度、正确拓扑结构，并在训练数据有限或噪声较多时仍保持鲁棒性。 Conclusion: PrIntMesh优于基于体素和表面的方法，能更好重建共享界面，维持结构一致性，且数据效率高，适合临床应用。 Abstract: Human organs are composed of interconnected substructures whose geometry and spatial relationships constrain one another. Yet, most deep-learning approaches treat these parts independently, producing anatomically implausible reconstructions. We introduce PrIntMesh, a template-based, topology-preserving framework that reconstructs organs as unified systems. Starting from a connected template, PrIntMesh jointly deforms all substructures to match patient-specific anatomy, while explicitly preserving internal boundaries and enforcing smooth, artifact-free surfaces. We demonstrate its effectiveness on the heart, hippocampus, and lungs, achieving high geometric accuracy, correct topology, and robust performance even with limited or noisy training data. Compared to voxel- and surface-based methods, PrIntMesh better reconstructs shared interfaces, maintains structural consistency, and provides a data-efficient solution suitable for clinical use.

[75] When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models

Yuping Yan,Yuhan Xie,Yinxin Zhang,Lingjuan Lyu,Yaochu Jin

Main category: cs.CV

TL;DR: 本文提出了VLA-Fool，首次系统研究了具身视觉-语言-动作模型在白盒和黑盒设置下的多模态对抗鲁棒性，引入了文本、视觉和跨模态对齐攻击，并提出语义感知的提示框架，实验表明现有VLA模型在多模态扰动下存在严重脆弱性。

Details

Motivation: 尽管视觉-语言-动作模型在具身智能中取得进展，但其在真实多模态和黑盒条件下的对抗鲁棒性尚未被充分探索，尤其是跨模态不对齐问题对决策的影响。 Method: 提出了VLA-Fool，统一了三个层次的多模态对抗攻击：基于梯度和提示的文本扰动、补丁与噪声引起的视觉扰动，以及破坏感知与指令间语义对应关系的跨模态不对齐攻击；并设计了首个自动构建且语义引导的VLA感知提示框架。 Result: 在LIBERO基准上使用微调的OpenVLA模型进行实验，结果显示即使轻微的多模态扰动也会导致显著的行为偏差，暴露出当前VLA模型在多模态对齐上的脆弱性。 Conclusion: VLA-Fool揭示了当前具身VLA模型在面对多模态对抗攻击时的严重弱点，强调了提升其对抗鲁棒性和跨模态对齐能力的重要性，为未来更安全可靠的具身智能系统提供了研究方向。 Abstract: Vision-Language-Action models (VLAs) have recently demonstrated remarkable progress in embodied environments, enabling robots to perceive, reason, and act through unified multimodal understanding. Despite their impressive capabilities, the adversarial robustness of these systems remains largely unexplored, especially under realistic multimodal and black-box conditions. Existing studies mainly focus on single-modality perturbations and overlook the cross-modal misalignment that fundamentally affects embodied reasoning and decision-making. In this paper, we introduce VLA-Fool, a comprehensive study of multimodal adversarial robustness in embodied VLA models under both white-box and black-box settings. VLA-Fool unifies three levels of multimodal adversarial attacks: (1) textual perturbations through gradient-based and prompt-based manipulations, (2) visual perturbations via patch and noise distortions, and (3) cross-modal misalignment attacks that intentionally disrupt the semantic correspondence between perception and instruction. We further incorporate a VLA-aware semantic space into linguistic prompts, developing the first automatically crafted and semantically guided prompting framework. Experiments on the LIBERO benchmark using a fine-tuned OpenVLA model reveal that even minor multimodal perturbations can cause significant behavioral deviations, demonstrating the fragility of embodied multimodal alignment.

[76] Unsupervised Image Classification with Adaptive Nearest Neighbor Selection and Cluster Ensembles

Melih Baydar,Emre Akbas

Main category: cs.CV

TL;DR: 本文提出了一种名为ICCE的无监督图像分类方法，通过多头聚类、自适应近邻选择和聚类集成策略，在冻结主干网络的基础上生成多样化的聚类结果，并融合为共识聚类以训练分类器，在多个基准上达到最优性能，首次在ImageNet上超过70%准确率。

Details

Motivation: 现有的无监督图像聚类方法往往忽略表征学习，直接进行聚类，但难以获得高质量的聚类结果。本文旨在结合现代基础模型的优势，提升无监督图像分类的性能，尤其是在大规模数据集如ImageNet上的表现。 Method: 提出ICCE方法，首先在冻结的主干网络上训练多个聚类头，生成多样化的聚类结果；然后引入自适应最近邻选择和聚类集成技术，将多个聚类结果融合为统一的共识聚类；最后使用该共识聚类作为伪标签训练图像分类器。 Result: ICCE在十个图像分类基准上实现了最先进的性能，包括CIFAR10（99.3%）、CIFAR100（89%）和ImageNet（70.4%），是首个在ImageNet上准确率超过70%的完全无监督图像分类方法。 Conclusion: ICCE通过聚类集成有效提升了无监督图像分类性能，缩小了与有监督方法之间的差距，展示了在不依赖标注数据的情况下实现高性能图像分类的潜力。 Abstract: Unsupervised image classification, or image clustering, aims to group unlabeled images into semantically meaningful categories. Early methods integrated representation learning and clustering within an iterative framework. However, the rise of foundational models have recently shifted focus solely to clustering, bypassing the representation learning step. In this work, we build upon a recent multi-head clustering approach by introducing adaptive nearest neighbor selection and cluster ensembling strategies to improve clustering performance. Our method, "Image Clustering through Cluster Ensembles" (ICCE), begins with a clustering stage, where we train multiple clustering heads on a frozen backbone, producing diverse image clusterings. We then employ a cluster ensembling technique to consolidate these potentially conflicting results into a unified consensus clustering. Finally, we train an image classifier using the consensus clustering result as pseudo-labels. ICCE achieves state-of-the-art performance on ten image classification benchmarks, achieving 99.3% accuracy on CIFAR10, 89% on CIFAR100, and 70.4% on ImageNet datasets, narrowing the performance gap with supervised methods. To the best of our knowledge, ICCE is the first fully unsupervised image classification method to exceed 70% accuracy on ImageNet.

Boyue Xu,Ruichao Hou,Tongwei Ren,Dongming Zhou,Gangshan Wu,Jinde Cao

Main category: cs.CV

TL;DR: 本文提出了一种名为SwiTrack的新型状态切换框架，用于跨模态目标跟踪（CMOT），通过三个专用分支处理RGB、NIR及无效模态，提升了跨模态表征的鲁棒性，在最新基准上实现了最先进的性能。

Details

Motivation: 现有方法在处理跨模态目标跟踪时难以充分提取模态特异性特征，且在输入不可靠时易发生目标漂移，因此需要一种更鲁棒的框架来提升跟踪一致性与精度。 Method: 提出SwiTrack框架，包含三个分支：RGB由视觉编码器处理，NIR通过带门控适配器的编码器进行特征校准，无效模态则利用时空线索预测轨迹；同时引入动态模板重建和相似性对齐损失以增强特征一致性。 Result: 在最新基准测试中，SwiTrack将精确率和成功率分别提升了7.2%和4.3%，并保持65 FPS的实时跟踪速度。 Conclusion: SwiTrack通过状态切换机制和多分支设计有效解决了跨模态跟踪中的特征提取不足和目标漂移问题，显著提升了性能并支持实时应用。 Abstract: Cross-modal object tracking (CMOT) is an emerging task that maintains target consistency while the video stream switches between different modalities, with only one modality available in each frame, mostly focusing on RGB-Near Infrared (RGB-NIR) tracking. Existing methods typically connect parallel RGB and NIR branches to a shared backbone, which limits the comprehensive extraction of distinctive modality-specific features and fails to address the issue of object drift, especially in the presence of unreliable inputs. In this paper, we propose SwiTrack, a novel state-switching framework that redefines CMOT through the deployment of three specialized streams. Specifically, RGB frames are processed by the visual encoder, while NIR frames undergo refinement via a NIR gated adapter coupled with the visual encoder to progressively calibrate shared latent space features, thereby yielding more robust cross-modal representations. For invalid modalities, a consistency trajectory prediction module leverages spatio-temporal cues to estimate target movement, ensuring robust tracking and mitigating drift. Additionally, we incorporate dynamic template reconstruction to iteratively update template features and employ a similarity alignment loss to reinforce feature consistency. Experimental results on the latest benchmarks demonstrate that our tracker achieves state-of-the-art performance, boosting precision rate and success rate gains by 7.2\% and 4.3\%, respectively, while maintaining real-time tracking at 65 frames per second. Code and models are available at https://github.com/xuboyue1999/SwiTrack.git.

[78] Mem-MLP: Real-Time 3D Human Motion Generation from Sparse Inputs

Sinan Mutlu,Georgios F. Angelis,Savas Ozkan,Paul Wisbey,Anastasios Drosou,Mete Ozay

Main category: cs.CV

TL;DR: 提出一种基于多层感知机（MLP）并引入残差连接和新型Memory-Block组件的全身体动捕方法，通过记忆缺失传感器数据和利用历史信号提升时序一致性，在移动HMD上实现高帧率与高精度。

Details

Motivation: 现有AR/VR系统主要依赖头显和手柄追踪头手动作，难以实现完整且平滑的全身追踪，需从稀疏传感器输入中生成高质量全身运动。 Method: 采用增强残差连接的MLP主干网络，设计Memory-Block模块，用可训练编码向量表示缺失传感器数据，并结合历史稀疏信号进行多任务学习，提升模型鲁棒性与时序一致性。 Result: 在实验中显著降低预测误差，优于现有最先进基线方法，并在移动HMD上达到72 FPS，优化了精度与运行时间的权衡。 Conclusion: 所提方法在保持高效推理速度的同时大幅提升全身体重建精度，适合应用于移动AR/VR场景中的实时全身追踪。 Abstract: Realistic and smooth full-body tracking is crucial for immersive AR/VR applications. Existing systems primarily track head and hands via Head Mounted Devices (HMDs) and controllers, making the 3D full-body reconstruction in-complete. One potential approach is to generate the full-body motions from sparse inputs collected from limited sensors using a Neural Network (NN) model. In this paper, we propose a novel method based on a multi-layer perceptron (MLP) backbone that is enhanced with residual connections and a novel NN-component called Memory-Block. In particular, Memory-Block represents missing sensor data with trainable code-vectors, which are combined with the sparse signals from previous time instances to improve the temporal consistency. Furthermore, we formulate our solution as a multi-task learning problem, allowing our MLP-backbone to learn robust representations that boost accuracy. Our experiments show that our method outperforms state-of-the-art baselines by substantially reducing prediction errors. Moreover, it achieves 72 FPS on mobile HMDs that ultimately improves the accuracy-running time tradeoff.

[79] TetraSDF: Precise Mesh Extraction with Multi-resolution Tetrahedral Grid

Seonghun Oh,Youngjung Uh,Jin-Hwa Kim

Main category: cs.CV

TL;DR: TetraSDF 是一种精确的解析网格化框架，用于由ReLU MLP与多分辨率四面体位置编码器组合表示的神经符号距离函数（SDF），能够在保持高效计算的同时生成高保真、自洽的网格。

Details

Motivation: 现有基于采样的方法存在离散化误差，而连续分段仿射（CPWA）解析方法仅适用于简单的ReLU MLP，无法处理更复杂的SDF表示结构。 Method: 提出TetraSDF，利用四面体位置编码器的重心插值保持全局CPWA结构，追踪编码器诱导的多面体复形中的ReLU线性区域，并设计固定的解析输入预处理器以减少方向偏差并稳定训练过程。 Result: 在多个基准测试中，TetraSDF在SDF重建精度上达到或超过现有的基于网格的编码器，其解析提取器生成的网格具有高度自洽性且忠实于学习到的等值面，同时具备实用的运行时间和内存效率。 Conclusion: TetraSDF成功扩展了CPWA解析网格化方法的应用范围至包含多分辨率四面体编码的神经SDF，实现了高精度、无离散误差的网格提取，兼具效率与稳定性。 Abstract: Extracting meshes that exactly match the zero-level set of neural signed distance functions (SDFs) remains challenging. Sampling-based methods introduce discretization error, while continuous piecewise affine (CPWA) analytic approaches apply only to plain ReLU MLPs. We present TetraSDF, a precise analytic meshing framework for SDFs represented by a ReLU MLP composed with a multi-resolution tetrahedral positional encoder. The encoder's barycentric interpolation preserves global CPWA structure, enabling us to track ReLU linear regions within an encoder-induced polyhedral complex. A fixed analytic input preconditioner derived from the encoder's metric further reduces directional bias and stabilizes training. Across multiple benchmarks, TetraSDF matches or surpasses existing grid-based encoders in SDF reconstruction accuracy, and its analytic extractor produces highly self-consistent meshes that remain faithful to the learned isosurfaces, all with practical runtime and memory efficiency.

[80] Building temporally coherent 3D maps with VGGT for memory-efficient Semantic SLAM

Gergely Dinya,Péter Halász,András Lőrincz,Kristóf Karacs,Anna Gelencsér-Horváth

Main category: cs.CV

TL;DR: 提出了一种基于Vision Gated Generative Transformers（VGGT）的快速时空场景理解框架，适用于辅助导航等实时应用。

Details

Motivation: 为实现高效、接近实时的场景理解，支持如辅助导航等应用，需克服VGGT高内存消耗和连续3D场景更新的挑战。 Method: 采用滑动窗口处理图像流，对齐子地图以降低内存需求；利用VGGT的跟踪头将2D语义实例掩码聚合成3D对象，并通过存储时间戳和实例级身份保证时间一致性和环境变化检测。 Result: 在知名基准和定制数据集上评估显示，该方法能有效实现连续3D场景更新，具备良好的时空一致性和上下文推理能力，适用于真实世界场景。 Conclusion: 所提出的VGGT框架在效率和性能之间取得了良好平衡，能够支持需要实时感知与理解的辅助导航等实际应用。 Abstract: We present a fast, spatio-temporal scene understanding framework based on Vision Gated Generative Transformers (VGGT). The proposed pipeline is designed to enable efficient, close to real-time performance, supporting applications including assistive navigation. To achieve continuous updates of the 3D scene representation, we process the image flow with a sliding window, aligning submaps, thereby overcoming VGGT's high memory demands. We exploit the VGGT tracking head to aggregate 2D semantic instance masks into 3D objects. To allow for temporal consistency and richer contextual reasoning the system stores timestamps and instance-level identities, thereby enabling the detection of changes in the environment. We evaluate the approach on well-known benchmarks and custom datasets specifically designed for assistive navigation scenarios. The results demonstrate the applicability of the framework to real-world scenarios.

[81] Explainable AI for Diabetic Retinopathy Detection Using Deep Learning with Attention Mechanisms and Fuzzy Logic-Based Interpretability

Abishek Karthik,Pandiyaraju V,Sreya Mynampati

Main category: cs.CV

TL;DR: 提出了一种结合CNN、ViT和GNN的混合深度学习框架，用于在复杂田间条件下实现高精度杂草检测，结合GAN增强和自监督预训练，在多基准数据集上达到99.33%的准确率。

Details

Motivation: 精准农业中需要高效、可持续的杂草识别方法，以实现选择性施用除草剂，减少环境影响并提高作物管理效率。 Method: 提出融合卷积神经网络（CNN）、视觉Transformer（ViT）和图神经网络（GNN）的混合框架，采用GAN进行数据增强以平衡类别分布，并使用自监督对比预训练方法提升模型在标注数据有限情况下的泛化能力。 Result: 在多基准数据集上实现了99.33%的准确率、精确率、召回率和F1分数，模型具备局部、全局和关系特征表示能力，具有高可解释性和适应性，支持边缘设备上的实时部署。 Conclusion: 该框架为自动化杂草检测提供了高效、可扩展且可持续的解决方案，有助于减少除草剂滥用，推动精准农业的发展。 Abstract: The task of weed detection is an essential element of precision agriculture since accurate species identification allows a farmer to selectively apply herbicides and fits into sustainable agriculture crop management. This paper proposes a hybrid deep learning framework recipe for weed detection that utilizes Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and Graph Neural Networks (GNNs) to build robustness to multiple field conditions. A Generative Adversarial Network (GAN)-based augmentation method was imposed to balance class distributions and better generalize the model. Further, a self-supervised contrastive pre-training method helps to learn more features from limited annotated data. Experimental results yield superior results with 99.33% accuracy, precision, recall, and F1-score on multi-benchmark datasets. The proposed model architecture enables local, global, and relational feature representations and offers high interpretability and adaptability. Practically, the framework allows real-time, efficient deployment of edge devices for automated weed detecting, reducing over-reliance on herbicides and providing scalable, sustainable precision-farming options.

[82] Optimizing 3D Gaussian Splattering for Mobile GPUs

Md Musfiqur Rahman Sanim,Zhihao Shu,Bahram Afsharmanesh,AmirAli Mirian,Jiexiong Guan,Wei Niu,Bin Ren,Gagan Agrawal

Main category: cs.CV

TL;DR: 本文提出Texture3dgs，一种针对移动GPU优化的3D高斯点阵化（3DGS）方法，通过改进排序算法和内存布局，显著提升移动端3D场景重建效率。

Details

Motivation: 为了在移动设备上实现高效、隐私保护且无需联网的3D场景重建，需将3DGS适配至移动GPU。 Method: 设计了一种面向2D纹理缓存优化的新型排序算法，并优化变量布局和其他3DGS步骤，以适应移动GPU的内存特性。 Result: 端到端评估显示，排序速度最高提升4.1倍，整体重建速度提升1.7倍，内存使用减少最多1.6倍。 Conclusion: Texture3dgs有效提升了移动设备上的3D场景重建效率，展示了其在移动端应用的巨大潜力。 Abstract: Image-based 3D scene reconstruction, which transforms multi-view images into a structured 3D representation of the surrounding environment, is a common task across many modern applications. 3D Gaussian Splatting (3DGS) is a new paradigm to address this problem and offers considerable efficiency as compared to the previous methods. Motivated by this, and considering various benefits of mobile device deployment (data privacy, operating without internet connectivity, and potentially faster responses), this paper develops Texture3dgs, an optimized mapping of 3DGS for a mobile GPU. A critical challenge in this area turns out to be optimizing for the two-dimensional (2D) texture cache, which needs to be exploited for faster executions on mobile GPUs. As a sorting method dominates the computations in 3DGS on mobile platforms, the core of Texture3dgs is a novel sorting algorithm where the processing, data movement, and placement are highly optimized for 2D memory. The properties of this algorithm are analyzed in view of a cost model for the texture cache. In addition, we accelerate other steps of the 3DGS algorithm through improved variable layout design and other optimizations. End-to-end evaluation shows that Texture3dgs delivers up to 4.1$\times$ and 1.7$\times$ speedup for the sorting and overall 3D scene reconstruction, respectively -- while also reducing memory usage by up to 1.6$\times$ -- demonstrating the effectiveness of our design for efficient mobile 3D scene reconstruction.

[83] Upsample Anything: A Simple and Hard to Beat Baseline for Feature Upsampling

Minseok Seo,Mark Hamilton,Changick Kim

Main category: cs.CV

TL;DR: 提出了一种名为Upsample Anything的轻量级测试时优化框架，无需训练即可将低分辨率特征恢复为高分辨率像素级输出。

Details

Motivation: 现有特征上采样方法依赖数据集特定重训练或重型隐式优化，限制了可扩展性和泛化能力；而视觉基础模型的特征通常被显著下采样，难以直接用于像素级任务。 Method: 通过每张图像的简单优化学习结合空间和范围线索的各向异性高斯核，将高斯点阵化与联合双边上采样相结合，实现通用、边缘感知的上采样。 Result: 在语义分割、深度估计以及深度和概率图上采样任务中达到最先进性能，单张224x224图像处理时间仅约0.419秒。 Conclusion: Upsample Anything是一种高效、通用且无需训练的上采样方法，可跨架构和模态无缝迁移，显著提升视觉基础模型在像素级任务中的适用性。 Abstract: We present \textbf{Upsample Anything}, a lightweight test-time optimization (TTO) framework that restores low-resolution features to high-resolution, pixel-wise outputs without any training. Although Vision Foundation Models demonstrate strong generalization across diverse downstream tasks, their representations are typically downsampled by 14x/16x (e.g., ViT), which limits their direct use in pixel-level applications. Existing feature upsampling approaches depend on dataset-specific retraining or heavy implicit optimization, restricting scalability and generalization. Upsample Anything addresses these issues through a simple per-image optimization that learns an anisotropic Gaussian kernel combining spatial and range cues, effectively bridging Gaussian Splatting and Joint Bilateral Upsampling. The learned kernel acts as a universal, edge-aware operator that transfers seamlessly across architectures and modalities, enabling precise high-resolution reconstruction of features, depth, or probability maps. It runs in only $\approx0.419 \text{s}$ per 224x224 image and achieves state-of-the-art performance on semantic segmentation, depth estimation, and both depth and probability map upsampling.

[84] Sparse Autoencoders are Topic Models

Leander Girrbach,Zeynep Akata

Main category: cs.CV

TL;DR: 本文提出了一种将稀疏自编码器（SAE）视为主题模型的新视角，通过扩展潜在狄利克雷分配（LDA）到嵌入空间，推导出SAE的目标函数，并提出SAE-TM框架用于跨模态的大规模主题分析。

Details

Motivation: 稀疏自编码器在嵌入分析中的作用和实际价值存在争议，作者希望明确其解释性并提升其在主题建模中的应用效果。 Method: 将潜在狄利克雷分配（LDA）扩展到嵌入空间，推导出SAE作为最大后验估计的目标函数，提出SAE-TM框架，包括学习可重用的主题原子、解释为词分布以及无需重新训练即可合并成任意数量的主题。 Result: SAE-TM在文本和图像数据集上生成比强基线更连贯且保持多样性的主题，并成功应用于分析图像数据集中的主题结构及日本浮世绘随时间的主题演变。 Conclusion: SAE可以被有效用作跨模态大规模主题分析的工具，SAE-TM提供了一种灵活且高效的主题建模方法。 Abstract: Sparse autoencoders (SAEs) are used to analyze embeddings, but their role and practical value are debated. We propose a new perspective on SAEs by demonstrating that they can be naturally understood as topic models. We extend Latent Dirichlet Allocation to embedding spaces and derive the SAE objective as a maximum a posteriori estimator under this model. This view implies SAE features are thematic components rather than steerable directions. Based on this, we introduce SAE-TM, a topic modeling framework that: (1) trains an SAE to learn reusable topic atoms, (2) interprets them as word distributions on downstream data, and (3) merges them into any number of topics without retraining. SAE-TM yields more coherent topics than strong baselines on text and image datasets while maintaining diversity. Finally, we analyze thematic structure in image datasets and trace topic changes over time in Japanese woodblock prints. Our work positions SAEs as effective tools for large-scale thematic analysis across modalities. Code and data will be released upon publication.

[85] BioBench: A Blueprint to Move Beyond ImageNet for Scientific ML Benchmarks

Samuel Stevens

Main category: cs.CV

TL;DR: BioBench是一个新的生态学视觉基准，旨在解决ImageNet在科学图像任务中表现不佳的问题，涵盖9个应用驱动任务、4个生物分类界和6种图像模态，提供更可靠的AI-for-science评估标准。

Details

Motivation: ImageNet-1K线性探针迁移准确率已无法有效预测现代视觉模型在生态学图像任务上的表现，缺乏对科学图像的代表性。 Method: 构建了一个包含310万图像的开源生态视觉基准BioBench，整合9个公开任务、4个生物界和6种采集方式，通过冻结主干网络并训练轻量级分类器，使用类平衡macro-F1等指标进行评估。 Result: 在46个现代视觉模型上验证显示，ImageNet top-1准确率仅解释生态任务34%的方差，并错误排序30%超过75%准确率的模型；ViT-L模型可在A6000 GPU上6小时内完成评估。 Conclusion: BioBench提供了生态学领域计算机视觉的新评估信号，并为其他科学领域的可靠AI基准建设提供了模板。 Abstract: ImageNet-1K linear-probe transfer accuracy remains the default proxy for visual representation quality, yet it no longer predicts performance on scientific imagery. Across 46 modern vision model checkpoints, ImageNet top-1 accuracy explains only 34% of variance on ecology tasks and mis-ranks 30% of models above 75% accuracy. We present BioBench, an open ecology vision benchmark that captures what ImageNet misses. BioBench unifies 9 publicly released, application-driven tasks, 4 taxonomic kingdoms, and 6 acquisition modalities (drone RGB, web video, micrographs, in-situ and specimen photos, camera-trap frames), totaling 3.1M images. A single Python API downloads data, fits lightweight classifiers to frozen backbones, and reports class-balanced macro-F1 (plus domain metrics for FishNet and FungiCLEF); ViT-L models evaluate in 6 hours on an A6000 GPU. BioBench provides new signal for computer vision in ecology and a template recipe for building reliable AI-for-science benchmarks in any domain. Code and predictions are available at https://github.com/samuelstevens/biobench and results at https://samuelstevens.me/biobench.

[86] NaTex: Seamless Texture Generation as Latent Color Diffusion

Zeqiang Lai,Yunfei Zhao,Zibo Zhao,Xin Yang,Xin Huang,Jingwei Huang,Xiangyu Yue,Chunchao Guo

Main category: cs.CV

TL;DR: NaTex是一种直接在3D空间中预测纹理颜色的原生纹理生成框架，避免了传统多视图扩散模型在遮挡处理、跨视图一致性与网格-纹理对齐上的局限性。

Details

Motivation: 传统基于2D多视图扩散模型的纹理生成方法存在遮挡区域需修复、纹理与几何对齐困难以及跨视图不一致等问题，限制了生成质量与应用效果。 Method: 提出将纹理视为密集的颜色点云，设计了潜颜色扩散模型，包括几何感知的颜色点云VAE和多控制扩散Transformer（DiT），并引入原生几何控制机制，通过位置编码和几何潜在表示实现精确对齐。 Result: NaTex在纹理连贯性和对齐精度上显著优于现有方法，并展现出强大的泛化能力，可应用于材质生成、纹理优化及部件分割与着色等任务。 Conclusion: NaTex通过全新的3D原生纹理生成范式，有效解决了传统MVD方法的关键缺陷，为高质量、高一致性纹理生成提供了新思路。 Abstract: We present NaTex, a native texture generation framework that predicts texture color directly in 3D space. In contrast to previous approaches that rely on baking 2D multi-view images synthesized by geometry-conditioned Multi-View Diffusion models (MVDs), NaTex avoids several inherent limitations of the MVD pipeline. These include difficulties in handling occluded regions that require inpainting, achieving precise mesh-texture alignment along boundaries, and maintaining cross-view consistency and coherence in both content and color intensity. NaTex features a novel paradigm that addresses the aforementioned issues by viewing texture as a dense color point cloud. Driven by this idea, we propose latent color diffusion, which comprises a geometry-awared color point cloud VAE and a multi-control diffusion transformer (DiT), entirely trained from scratch using 3D data, for texture reconstruction and generation. To enable precise alignment, we introduce native geometry control that conditions the DiT on direct 3D spatial information via positional embeddings and geometry latents. We co-design the VAE-DiT architecture, where the geometry latents are extracted via a dedicated geometry branch tightly coupled with the color VAE, providing fine-grained surface guidance that maintains strong correspondence with the texture. With these designs, NaTex demonstrates strong performance, significantly outperforming previous methods in texture coherence and alignment. Moreover, NaTex also exhibits strong generalization capabilities, either training-free or with simple tuning, for various downstream applications, e.g., material generation, texture refinement, and part segmentation and texturing.

[87] WWE-UIE: A Wavelet & White Balance Efficient Network for Underwater Image Enhancement

Ching-Heng Cheng,Jen-Wei Lee,Chia-Ming Lee,Chih-Chung Hsu

Main category: cs.CV

TL;DR: 提出了一种紧凑高效的水下图像增强网络WWE-UIE，结合三种可解释先验，在减少参数和计算量的同时实现竞争性性能。

Details

Motivation: 现有混合方法虽性能强但计算成本高，难以实现实时应用，需更高效的方法。 Method: 引入自适应白平衡、基于小波的增强模块（WEB）和梯度感知模块（SGFB），结合可解释先验设计轻量网络。 Result: 在基准数据集上表现出色，参数和FLOPs显著减少，支持资源受限平台的实时推理。 Conclusion: WWE-UIE在保持高质量恢复的同时大幅提升效率，适合实际应用场景。 Abstract: Underwater Image Enhancement (UIE) aims to restore visibility and correct color distortions caused by wavelength-dependent absorption and scattering. Recent hybrid approaches, which couple domain priors with modern deep neural architectures, have achieved strong performance but incur high computational cost, limiting their practicality in real-time scenarios. In this work, we propose WWE-UIE, a compact and efficient enhancement network that integrates three interpretable priors. First, adaptive white balance alleviates the strong wavelength-dependent color attenuation, particularly the dominance of blue-green tones. Second, a wavelet-based enhancement block (WEB) performs multi-band decomposition, enabling the network to capture both global structures and fine textures, which are critical for underwater restoration. Third, a gradient-aware module (SGFB) leverages Sobel operators with learnable gating to explicitly preserve edge structures degraded by scattering. Extensive experiments on benchmark datasets demonstrate that WWE-UIE achieves competitive restoration quality with substantially fewer parameters and FLOPs, enabling real-time inference on resource-limited platforms. Ablation studies and visualizations further validate the contribution of each component. The source code is available at https://github.com/chingheng0808/WWE-UIE.

[88] ChangeDINO: DINOv3-Driven Building Change Detection in Optical Remote Sensing Imagery

Ching-Heng Cheng,Chih-Chung Hsu

Main category: cs.CV

TL;DR: 本文提出了一种名为ChangeDINO的端到端多尺度Siamese框架，用于光学遥感图像中的建筑物变化检测，通过融合轻量级骨干网络与冻结的DINOv3特征，结合空间-光谱差分Transformer解码器和可学习形态学模块，在多个公开数据集上实现了优于现有方法的性能。

Details

Motivation: 现有的深度学习方法在遥感变化检测中主要依赖变化图标注，未能充分利用非变化区域的语义信息，导致在光照变化、斜视成像和标签稀缺情况下鲁棒性不足。 Method: 提出ChangeDINO框架：采用多尺度Siamese结构，融合轻量级骨干流与冻结的DINOv3迁移特征，构建语义和上下文丰富的特征金字塔；设计空间-光谱差分Transformer解码器，利用多尺度绝对差异作为变化先验；引入可学习形态学模块优化上采样后的 logits 以恢复清晰边界。 Result: 在四个公开基准上的实验表明，ChangeDINO在IoU和F1指标上持续优于最新的先进方法，消融实验验证了各组件的有效性。 Conclusion: ChangeDINO通过有效融合多尺度语义特征与变化先验，在少量标注条件下展现出卓越的建筑物变化检测能力，提升了模型在复杂条件下的鲁棒性和精度。 Abstract: Remote sensing change detection (RSCD) aims to identify surface changes from co-registered bi-temporal images. However, many deep learning-based RSCD methods rely solely on change-map annotations and underuse the semantic information in non-changing regions, which limits robustness under illumination variation, off-nadir views, and scarce labels. This article introduces ChangeDINO, an end-to-end multiscale Siamese framework for optical building change detection. The model fuses a lightweight backbone stream with features transferred from a frozen DINOv3, yielding semantic- and context-rich pyramids even on small datasets. A spatial-spectral differential transformer decoder then exploits multi-scale absolute differences as change priors to highlight true building changes and suppress irrelevant responses. Finally, a learnable morphology module refines the upsampled logits to recover clean boundaries. Experiments on four public benchmarks show that ChangeDINO consistently outperforms recent state-of-the-art methods in IoU and F1, and ablation studies confirm the effectiveness of each component. The source code is available at https://github.com/chingheng0808/ChangeDINO.

[89] Arbitrary-Resolution and Arbitrary-Scale Face Super-Resolution with Implicit Representation Networks

Yi Ting Tsai,Yu Wei Chen,Hong-Han Shuai,Ching-Chun Huang

Main category: cs.CV

TL;DR: 本文提出了一种任意分辨率和任意尺度的人脸超分辨率方法ARASFSR，通过隐式表示网络解决现有方法在固定上采样尺度和输入尺寸变化敏感性方面的局限性。

Details

Motivation: 现有FSR方法受限于固定的上采样尺度且对输入尺寸变化敏感，难以适应实际应用中多样化的分辨率需求。 Method: 采用2D深度特征、局部相对坐标和上采样比例预测目标像素的RGB值；引入局部频率估计模块以捕捉高频纹理信息；设计全局坐标调制模块来利用先验面部结构知识并实现分辨率自适应。 Result: 在多种输入尺寸和上采样尺度下，ARASFSR在定量和定性评估中均表现出优于现有最先进方法的鲁棒性。 Conclusion: ARASFSR能够有效支持任意分辨率输入与任意上采样倍数，在人脸超分辨率任务中具有更强的灵活性和实用性。 Abstract: Face super-resolution (FSR) is a critical technique for enhancing low-resolution facial images and has significant implications for face-related tasks. However, existing FSR methods are limited by fixed up-sampling scales and sensitivity to input size variations. To address these limitations, this paper introduces an Arbitrary-Resolution and Arbitrary-Scale FSR method with implicit representation networks (ARASFSR), featuring three novel designs. First, ARASFSR employs 2D deep features, local relative coordinates, and up-sampling scale ratios to predict RGB values for each target pixel, allowing super-resolution at any up-sampling scale. Second, a local frequency estimation module captures high-frequency facial texture information to reduce the spectral bias effect. Lastly, a global coordinate modulation module guides FSR to leverage prior facial structure knowledge and achieve resolution adaptation effectively. Quantitative and qualitative evaluations demonstrate the robustness of ARASFSR over existing state-of-the-art methods while super-resolving facial images across various input sizes and up-sampling scales.

[90] Aerial View River Landform Video segmentation: A Weakly Supervised Context-aware Temporal Consistency Distillation Approach

Chi-Han Chen,Chieh-Ming Chen,Wen-Huang Cheng,Ching-Chun Huang

Main category: cs.CV

TL;DR: 提出一种基于教师-学生架构的弱监督学习方法，结合关键帧选择与更新算法，用于无人机遥感中的地形分类，在仅使用30%标注数据的情况下显著提升mIoU和时间一致性。

Details

Motivation: 解决无人机遥感中地形分类任务的数据标注复杂、标注数据稀缺以及传统时间一致性训练效果不佳的问题。 Method: 采用教师-学生架构，引入关键帧选择和关键帧更新算法，实现弱监督学习与时间一致性知识蒸馏。 Result: 在仅使用30%标注数据的情况下，同时提升了mIoU和时间一致性，实现了稳定的地形目标定位。 Conclusion: 所提方法有效克服了传统方法在 aerial 任务中的局限性，为无人机遥感地形分类提供了高效、稳定的弱监督解决方案。 Abstract: The study of terrain and landform classification through UAV remote sensing diverges significantly from ground vehicle patrol tasks. Besides grappling with the complexity of data annotation and ensuring temporal consistency, it also confronts the scarcity of relevant data and the limitations imposed by the effective range of many technologies. This research substantiates that, in aerial positioning tasks, both the mean Intersection over Union (mIoU) and temporal consistency (TC) metrics are of paramount importance. It is demonstrated that fully labeled data is not the optimal choice, as selecting only key data lacks the enhancement in TC, leading to failures. Hence, a teacher-student architecture, coupled with key frame selection and key frame updating algorithms, is proposed. This framework successfully performs weakly supervised learning and TC knowledge distillation, overcoming the deficiencies of traditional TC training in aerial tasks. The experimental results reveal that our method utilizing merely 30\% of labeled data, concurrently elevates mIoU and temporal consistency ensuring stable localization of terrain objects. Result demo : https://gitlab.com/prophet.ai.inc/drone-based-riverbed-inspection

[91] CRISTAL: Real-time Camera Registration in Static LiDAR Scans using Neural Rendering

Joni Vanherck,Steven Moonen,Brent Zoomers,Kobe Werner,Jeroen Put,Lode Jorissen,Nick Michiels

Main category: cs.CV

TL;DR: 提出了一种基于高精度彩色LiDAR点云的实时相机定位方法，通过神经渲染缩小合成与真实图像的域差距，实现无漂移、具正确度量尺度的相机跟踪。

Details

Motivation: 现有视觉方法常存在漂移、尺度模糊和依赖标记物或回环检测的问题，难以满足机器人和扩展现实对高精度定位的需求。 Method: 利用预采集的彩色LiDAR点云生成合成视图，建立实时帧与点云间的2D-3D对应关系，并采用神经渲染技术减小合成与真实图像之间的域差距，提升特征匹配精度。 Result: 在ScanNet++数据集上表现优于现有SLAM方法，实现了实时、无漂移且具有正确全局尺度的相机定位。 Conclusion: 该方法有效解决了视觉定位中的漂移和尺度问题，可在机器人和XR应用中实现高精度、实时的相机定位。 Abstract: Accurate camera localization is crucial for robotics and Extended Reality (XR), enabling reliable navigation and alignment of virtual and real content. Existing visual methods often suffer from drift, scale ambiguity, and depend on fiducials or loop closure. This work introduces a real-time method for localizing a camera within a pre-captured, highly accurate colored LiDAR point cloud. By rendering synthetic views from this cloud, 2D-3D correspondences are established between live frames and the point cloud. A neural rendering technique narrows the domain gap between synthetic and real images, reducing occlusion and background artifacts to improve feature matching. The result is drift-free camera tracking with correct metric scale in the global LiDAR coordinate system. Two real-time variants are presented: Online Render and Match, and Prebuild and Localize. We demonstrate improved results on the ScanNet++ dataset and outperform existing SLAM pipelines.

[92] Multi-Order Matching Network for Alignment-Free Depth Super-Resolution

Zhengxue Wang,Zhiqiang Yan,Yuan Wu,Guangwei Gao,Xiang Li,Jian Yang

Main category: cs.CV

TL;DR: 提出了一种无需对齐的多阶匹配网络（MOMNet），用于解决真实场景中RGB与深度图因硬件限制和校准漂移导致的错位问题，在深度超分辨率任务中实现了最先进的性能和强鲁棒性。

Details

Motivation: 现有引导深度超分辨率方法依赖于RGB与深度图严格空间对齐的假设，但在实际中由于硬件差异和校准漂移难以满足，导致性能下降。因此需要一种对错位鲁棒的对齐自由方法。 Method: 提出MOMNet，包含多阶匹配机制（联合进行零阶、一阶、二阶匹配）以在多阶特征空间中识别与深度一致的RGB信息，并设计多阶聚合模块结合多个结构检测器，利用多阶先验作为提示，实现从RGB到深度的选择性特征迁移。 Result: 在多个实验中，MOMNet在错位的真实场景下表现出优异的性能，显著优于现有方法，实现了深度超分辨率领域的最先进水平，并展现出良好的鲁棒性。 Conclusion: MOMNet有效解决了RGB-D传感器间错位带来的挑战，通过多阶匹配与聚合机制实现了高质量的深度超分辨率，为实际应用提供了更可靠的解决方案。 Abstract: Recent guided depth super-resolution methods are premised on the assumption of strictly spatial alignment between depth and RGB, achieving high-quality depth reconstruction. However, in real-world scenarios, the acquisition of strictly aligned RGB-D is hindered by inherent hardware limitations (e.g., physically separate RGB-D sensors) and unavoidable calibration drift induced by mechanical vibrations or temperature variations. Consequently, existing approaches often suffer inevitable performance degradation when applied to misaligned real-world scenes. In this paper, we propose the Multi-Order Matching Network (MOMNet), a novel alignment-free framework that adaptively retrieves and selects the most relevant information from misaligned RGB. Specifically, our method begins with a multi-order matching mechanism, which jointly performs zero-order, first-order, and second-order matching to comprehensively identify RGB information consistent with depth across multi-order feature spaces. To effectively integrate the retrieved RGB and depth, we further introduce a multi-order aggregation composed of multiple structure detectors. This strategy uses multi-order priors as prompts to facilitate the selective feature transfer from RGB to depth. Extensive experiments demonstrate that MOMNet achieves state-of-the-art performance and exhibits outstanding robustness.

[93] DetailSemNet: Elevating Signature Verification through Detail-Semantic Integration

Meng-Cheng Shih,Tsai-Ling Huang,Yu-Heng Shih,Hong-Han Shuai,Hsuan-Tung Liu,Yi-Ren Yeh,Ching-Chun Huang

Main category: cs.CV

TL;DR: 本文提出了一种用于离线签名验证的新模型DetailSemNet，强调细粒度差异的重要性，通过局部结构匹配提升验证精度。

Details

Motivation: 传统方法依赖整体特征进行比对，难以捕捉细微差异，影响验证准确性。 Method: 提出Detail Semantics Integrator模块，结合特征解耦与重耦，增强局部细节和判别性语义，实现更有效的局部结构匹配。 Result: 在多个基准数据集上达到最先进的性能，并展现出优异的跨数据集泛化能力和可解释性。 Conclusion: DetailSemNet通过关注局部结构显著提升了离线签名验证的准确性和实用性，具有广泛的应用前景。 Abstract: Offline signature verification (OSV) is a frequently utilized technology in forensics. This paper proposes a new model, DetailSemNet, for OSV. Unlike previous methods that rely on holistic features for pair comparisons, our approach underscores the significance of fine-grained differences for robust OSV. We propose to match local structures between two signature images, significantly boosting verification accuracy. Furthermore, we observe that without specific architectural modifications, transformer-based backbones might naturally obscure local details, adversely impacting OSV performance. To address this, we introduce a Detail Semantics Integrator, leveraging feature disentanglement and re-entanglement. This integrator is specifically designed to enhance intricate details while simultaneously expanding discriminative semantics, thereby augmenting the efficacy of local structural matching. We evaluate our method against leading benchmarks in offline signature verification. Our model consistently outperforms recent methods, achieving state-of-the-art results with clear margins. The emphasis on local structure matching not only improves performance but also enhances the model's interpretability, supporting our findings. Additionally, our model demonstrates remarkable generalization capabilities in cross-dataset testing scenarios. The combination of generalizability and interpretability significantly bolsters the potential of DetailSemNet for real-world applications.

[94] CAMS: Towards Compositional Zero-Shot Learning via Gated Cross-Attention and Multi-Space Disentanglement

Pan Yang,Cheng Deng,Jing Yang,Han Zhao,Yun Liu,Yuling Chen,Xiaoli Ruan,Yanping Chen

Main category: cs.CV

TL;DR: 本文提出了一种新的组合式零样本学习方法CAMS，通过门控交叉注意力和多空间解耦机制，从视觉特征中提取语义信息并实现属性与对象的解耦，显著提升了在未见组合上的泛化性能。

Details

Motivation: 现有基于CLIP的组合式零样本学习方法依赖全局图像表示，难以完全解耦属性与对象语义，限制了模型对未见组合的识别能力。 Method: 提出CAMS框架，包含门控交叉注意力模块以从CLIP高层编码块中提取细粒度语义特征并抑制背景干扰，以及多空间解耦机制在多维空间中分离属性与对象语义。 Result: 在MIT-States、UT-Zappos和C-GQA三个基准上，CAMS在闭集和开集场景下均达到最优性能。 Conclusion: CAMS通过细粒度语义提取和多空间解耦，有效提升了组合式零样本学习的泛化能力，为基于CLIP的方法提供了改进方向。 Abstract: Compositional zero-shot learning (CZSL) aims to learn the concepts of attributes and objects in seen compositions and to recognize their unseen compositions. Most Contrastive Language-Image Pre-training (CLIP)-based CZSL methods focus on disentangling attributes and objects by leveraging the global semantic representation obtained from the image encoder. However, this representation has limited representational capacity and do not allow for complete disentanglement of the two. To this end, we propose CAMS, which aims to extract semantic features from visual features and perform semantic disentanglement in multidimensional spaces, thereby improving generalization over unseen attribute-object compositions. Specifically, CAMS designs a Gated Cross-Attention that captures fine-grained semantic features from the high-level image encoding blocks of CLIP through a set of latent units, while adaptively suppressing background and other irrelevant information. Subsequently, it conducts Multi-Space Disentanglement to achieve disentanglement of attribute and object semantics. Experiments on three popular benchmarks (MIT-States, UT-Zappos, and C-GQA) demonstrate that CAMS achieves state-of-the-art performance in both closed-world and open-world settings. The code is available at https://github.com/ybyangjing/CAMS.

[95] End-to-End Motion Capture from Rigid Body Markers with Geodesic Loss

Hai Lan,Zongyan Li,Jianmin Hu,Jialing Yang,Houde Dai

Main category: cs.CV

TL;DR: 提出了一种基于刚体标记（RBM）的新型光学动作捕捉方法，结合深度学习模型和测地线损失，实现了高精度、低计算成本的实时人体姿态估计。

Details

Motivation: 传统基于密集标记点的动作捕捉虽准确但准备繁琐、标记易混淆，限制了可扩展性，因此需要更简洁高效的方法。 Method: 引入刚体标记（RBM）作为新基本单元，提供无歧义的6自由度数据，并构建端到端的深度学习回归模型，在测地线损失下直接估计SMPL参数。 Result: 在合成数据（AMASS）和真实Vicon系统数据上均达到最先进精度，计算量降低一个数量级以上，验证了方法的有效性和实用性。 Conclusion: 结合稀疏6-DoF RBM与流形感知的测地线损失，为图形学、虚拟现实和生物力学中的实时动作捕捉提供了实用且高保真的解决方案。 Abstract: Marker-based optical motion capture (MoCap), while long regarded as the gold standard for accuracy, faces practical challenges, such as time-consuming preparation and marker identification ambiguity, due to its reliance on dense marker configurations, which fundamentally limit its scalability. To address this, we introduce a novel fundamental unit for MoCap, the Rigid Body Marker (RBM), which provides unambiguous 6-DoF data and drastically simplifies setup. Leveraging this new data modality, we develop a deep-learning-based regression model that directly estimates SMPL parameters under a geodesic loss. This end-to-end approach matches the performance of optimization-based methods while requiring over an order of magnitude less computation. Trained on synthesized data from the AMASS dataset, our end-to-end model achieves state-of-the-art accuracy in body pose estimation. Real-world data captured using a Vicon optical tracking system further demonstrates the practical viability of our approach. Overall, the results show that combining sparse 6-DoF RBM with a manifold-aware geodesic loss yields a practical and high-fidelity solution for real-time MoCap in graphics, virtual reality, and biomechanics.

[96] CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation

Samer Abualhanud,Christian Grannemann,Max Mehltretter

Main category: cs.CV

TL;DR: 提出一种几何引导的自监督方法，用于校准的多相机系统，通过将3D点投影到共享圆柱体上实现跨视角一致的密集深度估计。

Details

Motivation: 现有自监督环视深度估计方法在重叠图像间存在深度不一致问题，影响3D感知质量。 Method: 利用相机内参和外参，将各图像预测的3D点投影到共享单位圆柱体，生成2D位置图，并基于圆柱体上的距离进行跨图像的空间注意力特征聚合，以优化深度图。 Result: 在DDAD和nuScenes数据集上验证，所提方法提升了跨图像深度一致性及整体深度估计精度。 Conclusion: 该方法有效提高了多相机系统下自监督深度估计的跨视角一致性和整体性能。 Abstract: Self-supervised surround-view depth estimation enables dense, low-cost 3D perception with a 360° field of view from multiple minimally overlapping images. Yet, most existing methods suffer from depth estimates that are inconsistent between overlapping images. Addressing this limitation, we propose a novel geometry-guided method for calibrated, time-synchronized multi-camera rigs that predicts dense, metric, and cross-view-consistent depth. Given the intrinsic and relative orientation parameters, a first depth map is predicted per image and the so-derived 3D points from all images are projected onto a shared unit cylinder, establishing neighborhood relations across different images. This produces a 2D position map for every image, where each pixel is assigned its projected position on the cylinder. Based on these position maps, we apply an explicit, non-learned spatial attention that aggregates features among pixels across images according to their distances on the cylinder, to predict a final depth map per image. Evaluated on the DDAD and nuScenes datasets, our approach improves the consistency of depth estimates across images and the overall depth compared to state-of-the-art methods.

[97] Graph Neural Networks for Surgical Scene Segmentation

Yihan Li,Nikhil Churamani,Maria Robu,Imanol Luengo,Danail Stoyanov

Main category: cs.CV

TL;DR: 提出基于图的分割方法，结合Vision Transformer和Graph Neural Networks，提升手术场景中肝胆解剖结构的分割精度与解剖一致性，尤其在稀有和关键结构上表现优异。

Details

Motivation: 准确识别肝胆解剖结构对预防腹腔镜胆囊切除术中的并发症至关重要；现有深度学习模型在遮挡、长距离依赖和精细几何建模方面存在挑战。 Method: 提出两种结合ViT特征编码器与GNN的分割模型：1）基于静态k近邻图和GCNII的模型，实现稳定的长距离信息传播；2）基于动态可微图生成器（DGG）和GAT的模型，支持自适应拓扑学习。 Result: 在Endoscapes-Seg50和CholecSeg8k数据集上，相比现有最优方法，mIoU提升7-8%，mDice提升6%，并对细小、罕见且安全关键的结构产生解剖学一致的预测。 Conclusion: 所提出的图基分割方法通过结合ViT的全局上下文与图结构的关系推理，提升了手术场景分割的性能、可解释性和可靠性，有助于实现更安全的腹腔镜和机器人辅助手术。 Abstract: Purpose: Accurate identification of hepatocystic anatomy is critical to preventing surgical complications during laparoscopic cholecystectomy. Deep learning models often struggle with occlusions, long-range dependencies, and capturing the fine-scale geometry of rare structures. This work addresses these challenges by introducing graph-based segmentation approaches that enhance spatial and semantic understanding in surgical scene analyses. Methods: We propose two segmentation models integrating Vision Transformer (ViT) feature encoders with Graph Neural Networks (GNNs) to explicitly model spatial relationships between anatomical regions. (1) A static k Nearest Neighbours (k-NN) graph with a Graph Convolutional Network with Initial Residual and Identity Mapping (GCNII) enables stable long-range information propagation. (2) A dynamic Differentiable Graph Generator (DGG) with a Graph Attention Network (GAT) supports adaptive topology learning. Both models are evaluated on the Endoscapes-Seg50 and CholecSeg8k benchmarks. Results: The proposed approaches achieve up to 7-8% improvement in Mean Intersection over Union (mIoU) and 6% improvement in Mean Dice (mDice) scores over state-of-the-art baselines. It produces anatomically coherent predictions, particularly on thin, rare and safety-critical structures. Conclusion: The proposed graph-based segmentation methods enhance both performance and anatomical consistency in surgical scene segmentation. By combining ViT-based global context with graph-based relational reasoning, the models improve interpretability and reliability, paving the way for safer laparoscopic and robot-assisted surgery through a precise identification of critical anatomical features.

[98] Beyond Visual Cues: Leveraging General Semantics as Support for Few-Shot Segmentation

Jin Wang,Bingfeng Zhang,Jian Pang,Mengyu Liu,Honglong Chen,Weifeng Liu

Main category: cs.CV

TL;DR: 本文提出了一种语言驱动属性泛化（LDAG）架构，用于解决少样本分割中支持图像因类内差异导致的引导不准确问题，通过利用目标类别的语言描述生成无偏且鲁棒的元引导，实现了新的最先进性能。

Details

Motivation: 现有少样本分割方法依赖支持图像提取元信息，但类内视觉差异导致其对未训练类别的引导不准确，因此需要一种更稳定、无偏的元引导机制。 Method: 提出LDAG框架，包含多属性增强（MaE）模块和多模态属性对齐（MaA）模块：MaE利用大语言模型生成目标类别的多维度属性文本描述，并构建细粒度的视觉-文本先验引导；MaA实现文本与视觉特征之间的跨模态交互，缓解模态鸿沟。 Result: 实验表明，所提方法在少样本分割任务上显著优于现有方法，取得了新的最先进性能。 Conclusion: 通过引入语言描述作为无偏元引导来源，避免了依赖支持图像带来的偏差，验证了语言驱动属性泛化在少样本分割中的有效性与优越性。 Abstract: Few-shot segmentation (FSS) aims to segment novel classes under the guidance of limited support samples by a meta-learning paradigm. Existing methods mainly mine references from support images as meta guidance. However, due to intra-class variations among visual representations, the meta information extracted from support images cannot produce accurate guidance to segment untrained classes. In this paper, we argue that the references from support images may not be essential, the key to the support role is to provide unbiased meta guidance for both trained and untrained classes. We then introduce a Language-Driven Attribute Generalization (LDAG) architecture to utilize inherent target property language descriptions to build robust support strategy. Specifically, to obtain an unbiased support representation, we design a Multi-attribute Enhancement (MaE) module, which produces multiple detailed attribute descriptions of the target class through Large Language Models (LLMs), and then builds refined visual-text prior guidance utilizing multi-modal matching. Meanwhile, due to text-vision modal shift, attribute text struggles to promote visual feature representation, we design a Multi-modal Attribute Alignment (MaA) to achieve cross-modal interaction between attribute texts and visual feature. Experiments show that our proposed method outperforms existing approaches by a clear margin and achieves the new state-of-the art performance. The code will be released.

[99] StreetView-Waste: A Multi-Task Dataset for Urban Waste Management

Diogo J. Paulo,João Martins,Hugo Proença,João C. Neves

Main category: cs.CV

TL;DR: 本文提出了一个名为StreetView-Waste的城市垃圾管理数据集，支持垃圾容器检测、跟踪和溢出分割三项任务，并提供了基于现有先进模型的基线方法及两种改进策略：基于启发式的跟踪优化和利用几何先验的分割优化，显著提升了计数准确性和分割性能。

Details

Motivation: 现有垃圾检测数据集缺乏对实际场景中垃圾箱跟踪和溢出监测的支持，且多为静态环境采集，难以应用于真实城市物流系统。因此需要一个更贴近现实、具备多任务标注的数据集来推动智能城市垃圾管理研究。 Method: 提出StreetView-Waste数据集，包含真实城市街景中的垃圾与容器标注；设计三个评估任务的基线模型；引入启发式方法改善容器跟踪，提出模型无关的几何先验框架提升溢出分割效果。 Result: 实验表明，微调后的检测器在容器检测上表现良好；所提启发式方法使平均绝对计数误差降低79.6%；几何感知策略使轻量级模型的分割mAP@0.5提升27%。 Conclusion: StreetView-Waste为城市垃圾管理提供了具有挑战性的现实基准，验证了多模态信息与领域先验在实际感知系统中的有效性，有望推动智能环卫系统的发展。 Abstract: Urban waste management remains a critical challenge for the development of smart cities. Despite the growing number of litter detection datasets, the problem of monitoring overflowing waste containers, particularly from images captured by garbage trucks, has received little attention. While existing datasets are valuable, they often lack annotations for specific container tracking or are captured in static, decontextualized environments, limiting their utility for real-world logistics. To address this gap, we present StreetView-Waste, a comprehensive dataset of urban scenes featuring litter and waste containers. The dataset supports three key evaluation tasks: (1) waste container detection, (2) waste container tracking, and (3) waste overflow segmentation. Alongside the dataset, we provide baselines for each task by benchmarking state-of-the-art models in object detection, tracking, and segmentation. Additionally, we enhance baseline performance by proposing two complementary strategies: a heuristic-based method for improved waste container tracking and a model-agnostic framework that leverages geometric priors to refine litter segmentation. Our experimental results show that while fine-tuned object detectors achieve reasonable performance in detecting waste containers, baseline tracking methods struggle to accurately estimate their number; however, our proposed heuristics reduce the mean absolute counting error by 79.6%. Similarly, while segmenting amorphous litter is challenging, our geometry-aware strategy improves segmentation mAP@0.5 by 27% on lightweight models, demonstrating the value of multimodal inputs for this task. Ultimately, StreetView-Waste provides a challenging benchmark to encourage research into real-world perception systems for urban waste management.

[100] VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference

Ziyan Liu,Yeqiu Chen,Hongyi Cai,Tao Lin,Shuo Yang,Zheng Liu,Bo Zhao

Main category: cs.CV

TL;DR: 提出VLA-Pruner，一种面向视觉-语言-动作（VLA）模型的双层级令牌剪枝方法，兼顾语义理解与动作执行需求，通过语义级和动作级重要性准则实现高效实时推理。

Details

Motivation: 现有基于语义显著性的令牌剪枝方法忽视了VLA模型语义理解与动作执行的双系统特性，导致关键动作信息丢失，影响模型性能。 Method: 提出VLA-Pruner，采用双层级重要性准则：基于视觉-语言预填充注意力的语义级相关性和基于时间平滑估计的动作解码注意力的动作级重要性，并设计自适应双层级令牌选择策略。 Result: 在多种VLA架构和机器人任务上验证了VLA-Pruner的有效性，实现了最先进的压缩与加速效果，同时保持高性能。 Conclusion: VLA-Pruner通过融合语义与动作重要性评估，有效平衡计算效率与模型性能，适用于实际部署中的VLA模型加速。 Abstract: Vision-Language-Action (VLA) models have shown great promise for embodied AI, yet the heavy computational cost of processing continuous visual streams severely limits their real-time deployment. Token pruning (keeping salient visual tokens and dropping redundant ones) has emerged as an effective approach for accelerating Vision-Language Models (VLMs), offering a solution for efficient VLA. However, these VLM-specific token pruning methods select tokens based solely on semantic salience metrics (e.g., prefill attention), while overlooking the VLA's intrinsic dual-system nature of high-level semantic understanding and low-level action execution. Consequently, these methods bias token retention toward semantic cues, discard critical information for action generation, and significantly degrade VLA performance. To bridge this gap, we propose VLA-Pruner, a versatile plug-and-play VLA-specific token prune method that aligns with the dual-system nature of VLA models and exploits the temporal continuity in robot manipulation. Specifically, VLA-Pruner adopts a dual-level importance criterion for visual token retention: vision-language prefill attention for semantic-level relevance and action decode attention, estimated via temporal smoothing, for action-level importance. Based on this criterion, VLA-Pruner proposes a novel dual-level token selection strategy that adaptively preserves a compact, informative set of visual tokens for both semantic understanding and action execution under given compute budget. Experiments show that VLA-Pruner achieves state-of-the-art performance across multiple VLA architectures and diverse robotic tasks.

[101] LLaVA$^3$: Representing 3D Scenes like a Cubist Painter to Boost 3D Scene Understanding of VLMs

Doriand Petit,Steve Bourgeois,Vincent Gay-Bellile,Florian Chabot,Loïc Barthe

Main category: cs.CV

TL;DR: 提出LLaVA^3，一种无需微调、仅用多视角2D图像提升视觉语言模型3D场景理解能力的新方法。

Details

Motivation: 由于缺乏足够的3D训练数据，构建能理解3D场景的多模态语言模型具有挑战性，而2D视觉语言模型数据丰富。 Method: 受立体派绘画启发，通过多视角2D图像进行中间的3D重建，生成每个物体的全向视觉表示，从而让VLM理解3D场景，且无需微调。 Result: 在3D视觉问答和3D语言定位任务上优于之前的2D基线方法。 Conclusion: LLaVA^3通过巧妙利用多视角2D图像生成3D感知表示，在不需微调的情况下显著提升了VLM的3D理解能力。 Abstract: Developing a multi-modal language model capable of understanding 3D scenes remains challenging due to the limited availability of 3D training data, in contrast to the abundance of 2D datasets used for vision-language models (VLM). As an alternative, we introduce LLaVA$^3$ (pronounced LLaVA-Cube), a novel method that improves the 3D scene understanding capabilities of VLM using only multi-view 2D images and without any fine-tuning. Inspired by Cubist painters, who represented multiple viewpoints of a 3D object within a single picture, we propose to describe the 3D scene for the VLM through omnidirectional visual representations of each object. These representations are derived from an intermediate multi-view 3D reconstruction of the scene. Extensive experiments on 3D VQA and 3D language grounding show that our approach outperforms previous 2D-based VLM solutions.

[102] FastSurfer-CC: A robust, accurate, and comprehensive framework for corpus callosum morphometry

Clemens Pollak,Kersten Diers,Santiago Estrada,David Kügler,Martin Reuter

Main category: cs.CV

TL;DR: FastSurfer-CC是一个高效、全自动的框架，用于胼胝体形态测量，能够自动完成中矢状面识别、分割、标准化定位及形态指标提取，并在亨廷顿病研究中展现出优于现有工具的敏感性。

Details

Motivation: 现有的胼胝体分割工具缺乏全面且自动化的分析流程，限制了其在老龄化、神经系统疾病研究和临床试验中的应用，因此需要一个高效且完整的自动化解决方案。 Method: 提出FastSurfer-CC框架，自动识别中矢状面，分割胼胝体和穹窿，利用前连合和后连合进行头部定位标准化，生成厚度轮廓和分区，并提取八个用于统计分析的形态学指标。 Result: FastSurfer-CC在各项子任务上优于现有的专用工具，并能检测到亨廷顿病患者与健康对照之间的显著差异，而这些差异未被当前最先进的方法检测到。 Conclusion: FastSurfer-CC提供了一个快速、全自动且高灵敏度的胼胝体分析 pipeline，在神经退行性疾病研究和临床生物标志物应用中具有重要价值。 Abstract: The corpus callosum, the largest commissural structure in the human brain, is a central focus in research on aging and neurological diseases. It is also a critical target for interventions such as deep brain stimulation and serves as an important biomarker in clinical trials, including those investigating remyelination therapies. Despite extensive research on corpus callosum segmentation, few publicly available tools provide a comprehensive and automated analysis pipeline. To address this gap, we present FastSurfer-CC, an efficient and fully automated framework for corpus callosum morphometry. FastSurfer-CC automatically identifies mid-sagittal slices, segments the corpus callosum and fornix, localizes the anterior and posterior commissures to standardize head positioning, generates thickness profiles and subdivisions, and extracts eight shape metrics for statistical analysis. We demonstrate that FastSurfer-CC outperforms existing specialized tools across the individual tasks. Moreover, our method reveals statistically significant differences between Huntington's disease patients and healthy controls that are not detected by the current state-of-the-art.

[103] Flow and Depth Assisted Video Prediction with Latent Transformer

Eliyas Suleyman,Paul Henderson,Eksan Firkat,Nicolas Pugeault

Main category: cs.CV

TL;DR: 本文研究了在遮挡和背景运动情况下，通过引入点流（point-flow）和深度图（depth-maps）来提升视频预测模型性能的方法，并在合成和真实世界数据集上验证了该方法的有效性。

Details

Motivation: 遮挡仍是视频预测中的固有挑战，现有模型在处理遮挡和背景运动时表现不佳，因此需要引入运动和几何结构信息以提升预测准确性。 Method: 采用标准的多目标潜在Transformer架构，修改其以融合深度信息和点流信息，并在受控环境下进行评估，使用外观指标和基于对象掩码的Wasserstein距离来衡量预测效果。 Result: 结合点流和深度信息的模型在遮挡场景下表现更优，能更准确地预测背景运动，优于未使用这些模态的模型。 Conclusion: 引入显式的运动和几何结构信息有助于提升视频预测模型在遮挡和复杂运动场景下的性能，为未来研究提供了有效方向。 Abstract: Video prediction is a fundamental task for various downstream applications, including robotics and world modeling. Although general video prediction models have achieved remarkable performance in standard scenarios, occlusion is still an inherent challenge in video prediction. We hypothesize that providing explicit information about motion (via point-flow) and geometric structure (via depth-maps) will enable video prediction models to perform better in situations with occlusion and the background motion. To investigate this, we present the first systematic study dedicated to occluded video prediction. We use a standard multi-object latent transformer architecture to predict future frames, but modify this to incorporate information from depth and point-flow. We evaluate this model in a controlled setting on both synthetic and real-world datasets with not only appearance-based metrics but also Wasserstein distances on object masks, which can effectively measure the motion distribution of the prediction. We find that when the prediction model is assisted with point flow and depth, it performs better in occluded scenarios and predicts more accurate background motion compared to models without the help of these modalities.

[104] Physics-Informed Machine Learning for Efficient Sim-to-Real Data Augmentation in Micro-Object Pose Estimation

Zongcai Tan,Lan Wei,Dandan Zhang

Main category: cs.CV

TL;DR: 提出一种结合波动光学物理渲染与深度对齐的生成对抗网络框架，用于高效生成高保真显微图像，以实现无需大量真实数据的微机器人姿态估计。

Details

Motivation: 现有方法依赖大量标注良好的显微图像数据，获取成本高；且传统仿真难以复现复杂的光学显微现象（如衍射、景深效应），限制了sim-to-real迁移性能。 Method: 将基于波动物理的光学渲染和深度对齐机制嵌入生成对抗网络（GAN），构建物理信息驱动的深度生成模型，合成逼真的显微图像用于训练姿态估计器。 Result: 相比纯AI方法，结构相似性（SSIM）提升35.6%，单帧渲染耗时仅0.022秒；在合成数据上训练的CNN姿态估计器达到真实数据训练结果93.9%/91.9%的精度（pitch/roll），且泛化至未见姿态和新型微机器人配置。 Conclusion: 该框架显著降低对真实标注数据的依赖，实现高效、高保真的sim-to-real迁移，为光学微机器人自主操作提供了可靠的姿态估计解决方案。 Abstract: Precise pose estimation of optical microrobots is essential for enabling high-precision object tracking and autonomous biological studies. However, current methods rely heavily on large, high-quality microscope image datasets, which are difficult and costly to acquire due to the complexity of microrobot fabrication and the labour-intensive labelling. Digital twin systems offer a promising path for sim-to-real data augmentation, yet existing techniques struggle to replicate complex optical microscopy phenomena, such as diffraction artifacts and depth-dependent imaging.This work proposes a novel physics-informed deep generative learning framework that, for the first time, integrates wave optics-based physical rendering and depth alignment into a generative adversarial network (GAN), to synthesise high-fidelity microscope images for microrobot pose estimation efficiently. Our method improves the structural similarity index (SSIM) by 35.6% compared to purely AI-driven methods, while maintaining real-time rendering speeds (0.022 s/frame).The pose estimator (CNN backbone) trained on our synthetic data achieves 93.9%/91.9% (pitch/roll) accuracy, just 5.0%/5.4% (pitch/roll) below that of an estimator trained exclusively on real data. Furthermore, our framework generalises to unseen poses, enabling data augmentation and robust pose estimation for novel microrobot configurations without additional training data.

[105] Acquisition Time-Informed Breast Tumor Segmentation from Dynamic Contrast-Enhanced MRI

Rui Wang,Yuexi Du,John Lewin,R. Todd Constable,Nicha C. Dvornek

Main category: cs.CV

TL;DR: 提出一种利用图像采集时间通过FiLM层调节模型特征的肿瘤分割方法，提升了DCE-MRI中肿瘤分割性能与模型泛化能力。

Details

Motivation: 由于采集协议和个体差异导致DCE-MRI图像中组织外观变化大，使自动肿瘤分割具有挑战性。 Method: 采用基于采集时间的特征调节（FiLM）层，将时间信息融入不同骨干网络的分割模型中，以适应可变数量的动态图像序列。 Result: 在大型多中心公开数据集上训练并测试了多种模型配置，结果表明引入采集时间信息可提升肿瘤分割性能，并增强在域内和跨域数据上的泛化能力。 Conclusion: 结合采集时间信息的FiLM调制方法能有效提升乳腺DCE-MRI肿瘤分割的准确性和模型鲁棒性，尤其在处理多中心和多协议数据时具有优势。 Abstract: Dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) plays an important role in breast cancer screening, tumor assessment, and treatment planning and monitoring. The dynamic changes in contrast in different tissues help to highlight the tumor in post-contrast images. However, varying acquisition protocols and individual factors result in large variation in the appearance of tissues, even for images acquired in the same phase (e.g., first post-contrast phase), making automated tumor segmentation challenging. Here, we propose a tumor segmentation method that leverages knowledge of the image acquisition time to modulate model features according to the specific acquisition sequence. We incorporate the acquisition times using feature-wise linear modulation (FiLM) layers, a lightweight method for incorporating temporal information that also allows for capitalizing on the full, variables number of images acquired per imaging study. We trained baseline and different configurations for the time-modulated models with varying backbone architectures on a large public multisite breast DCE-MRI dataset. Evaluation on in-domain images and a public out-of-domain dataset showed that incorporating knowledge of phase acquisition time improved tumor segmentation performance and model generalization.

[106] YOWO: You Only Walk Once to Jointly Map An Indoor Scene and Register Ceiling-mounted Cameras

Fan Yang,Sosuke Yamao,Ikuo Kusajima,Atsunori Moteki,Shoichi Masui,Shan Jiang

Main category: cs.CV

TL;DR: 提出一种联合室内场景建图与天花板摄像头（CMC）注册的新方法，通过移动代理携带RGB-D相机遍历场景，并利用同步的CMC视频实现轨迹对齐与联合优化，有效提升两者性能。

Details

Motivation: 解决天花板摄像头（CMC）在室内场景中手动注册效率低、自动视觉定位在视觉模糊时效果差的问题。 Method: 使用头戴式RGB-D相机的移动代理遍历场景，生成以自我为中心的视频用于构建场景布局和轨迹；同步的CMC视频提供伪尺度轨迹和相对位姿；通过时间戳关联所有轨迹，将CMC位姿对齐到世界坐标系，并构建因子图联合优化ego-camera位姿、场景布局和CMC位姿。 Result: 实验结果表明该方法在一个统一框架内有效完成场景建图与CMC注册两项任务，并相互提升性能；同时发布了首个用于协同场景建图与CMC注册的基准数据集。 Conclusion: 所提方法为室内天花板摄像头的自动注册与场景建图提供了可靠解决方案，有助于推动基于位置的下游应用。 Abstract: Using ceiling-mounted cameras (CMCs) for indoor visual capturing opens up a wide range of applications. However, registering CMCs to the target scene layout presents a challenging task. While manual registration with specialized tools is inefficient and costly, automatic registration with visual localization may yield poor results when visual ambiguity exists. To alleviate these issues, we propose a novel solution for jointly mapping an indoor scene and registering CMCs to the scene layout. Our approach involves equipping a mobile agent with a head-mounted RGB-D camera to traverse the entire scene once and synchronize CMCs to capture this mobile agent. The egocentric videos generate world-coordinate agent trajectories and the scene layout, while the videos of CMCs provide pseudo-scale agent trajectories and CMC relative poses. By correlating all the trajectories with their corresponding timestamps, the CMC relative poses can be aligned to the world-coordinate scene layout. Based on this initialization, a factor graph is customized to enable the joint optimization of ego-camera poses, scene layout, and CMC poses. We also develop a new dataset, setting the first benchmark for collaborative scene mapping and CMC registration (https://sites.google.com/view/yowo/home). Experimental results indicate that our method not only effectively accomplishes two tasks within a unified framework, but also jointly enhances their performance. We thus provide a reliable tool to facilitate downstream position-aware applications.

Rahul Kumar,Vipul Baghel,Sudhanshu Singh,Bikash Kumar Badatya,Shivam Yadav,Babji Srinivasan,Ravi Hegde

Main category: cs.CV

TL;DR: 本文提出了一种针对拳击中出拳检测与分类的高质量、精细标注的视频数据集，包含6,915个出拳片段，涵盖六种出拳类型，来自20个YouTube对战视频中的18名运动员。

Details

Motivation: 由于动作动态性强、拍摄环境多变，格斗类运动的计算机视觉分析面临缺乏高质量数据集的瓶颈，亟需一个准确且多样化的基准数据集以推动相关研究。 Method: 从20个公开的YouTube对练视频中提取出拳片段，手动进行精确的时间边界分割和类别标注，构建包含六种拳法类型的结构化数据集，并确保在不同运动员、动作风格和摄像角度下的多样性。 Result: 构建了一个包含6,915个高质量出拳剪辑的数据集，每个样本均具备精确时序标注和类别标签，覆盖多种出拳类型、拍摄条件和人体形态，适用于低资源和非受限环境下的动作识别任务。 Conclusion: 该数据集为基于视觉的实时出拳识别提供了有力支持，有望推动拳击运动中的自动化动作分析、智能教练系统和运动表现评估的发展。 Abstract: Accurate analysis of combat sports using computer vision has gained traction in recent years, yet the development of robust datasets remains a major bottleneck due to the dynamic, unstructured nature of actions and variations in recording environments. In this work, we present a comprehensive, well-annotated video dataset tailored for punch detection and classification in boxing. The dataset comprises 6,915 high-quality punch clips categorized into six distinct punch types, extracted from 20 publicly available YouTube sparring sessions and involving 18 different athletes. Each clip is manually segmented and labeled to ensure precise temporal boundaries and class consistency, capturing a wide range of motion styles, camera angles, and athlete physiques. This dataset is specifically curated to support research in real-time vision-based action recognition, especially in low-resource and unconstrained environments. By providing a rich benchmark with diverse punch examples, this contribution aims to accelerate progress in movement analysis, automated coaching, and performance assessment within boxing and related domains.

[108] Contrastive vision-language learning with paraphrasing and negation

Kwun Ho Ngan,Saman Sadeghi Afgeh,Joe Townsend,Artur d'Avila Garcez

Main category: cs.CV

TL;DR: 本文提出了一种名为SemCLIP的新方法，通过改进对比损失函数并利用大语言模型生成的包含原句、改写句和否定句的训练三元组，提升视觉-语言模型对语义变换（特别是否定和改写）的鲁棒性。实验表明，SemCLIP在保持原有性能的同时显著增强了对否定文本的区分能力。

Details

Motivation: 现有的视觉-语言模型（如CLIP）在处理否定或改写文本时表现不稳定，因为否定会大幅改变语义而词汇变化小，改写则词汇差异大但语义相同，这对模型的语义对齐能力构成挑战。因此，需要一种能同时应对这两种语义变换的训练方法。 Method: 提出SemCLIP方法，设计新的对比损失函数，显式建模原始文本、改写文本和否定文本之间的语义关系，并使用大语言模型生成的三元组（原句、改写句、否定句）进行训练，使模型在嵌入空间中拉近改写句与原句的距离，推远否定句的距离。 Result: 在CC-Neg基准上，图像检索准确率从68.1%提升至78.1%；在Sugarcrepe++上结果混合但整体优于仅用否定数据训练的模型；在下游零样本分类任务中，SemCLIP均优于CLIP，显示出更强的语义鲁棒性。 Conclusion: SemCLIP通过结合改写与否定的语义结构信息，有效提升了视觉-语言模型对语义变换的鲁棒性，尤其在处理否定表达方面表现显著改进，为未来更可靠的跨模态对齐提供了新方向。 Abstract: Contrastive vision-language models continue to be the dominant approach for image and text retrieval. Contrastive Language-Image Pre-training (CLIP) trains two neural networks in contrastive manner to align their image and text embeddings in a shared latent space. Recent results evaluating CLIP on negated or paraphrased text have shown mixed performance because negation changes meaning radically with minimal lexical changes, while paraphrasing can create very different textual expressions with the same intended meaning. This poses a significant challenge for improving the evaluation results and alignment of vision-language models. To address this challenge, this paper evaluates the combination of paraphrasing and negation, proposes a new CLIP contrastive loss function accounting for both paraphrasing and negation, and applies LLM-generated training triples consisting of original, paraphrased and negated textual captions to CLIP-like training models. The approach, called SemCLIP, is shown to move paraphrased captions towards the original image embeddings while pushing negated captions further away in embedding space. Empirically, SemCLIP is shown to be capable of preserving CLIP's performance while increasing considerably the distances to negated captions. On the CC-Neg benchmark using an original over negation image-retrieval accuracy metric, SemCLIP improves accuracy from 68.1% to 78.1%. Although results are mixed when compared with CLIP on the Sugarcrepe++ benchmark, SemCLIP's performance is generally better than the models trained with negated captions. This robustness to negation extends to downstream zero-shot classification tasks where SemCLIP pre-trained on Sugarcrepe++ performs better than CLIP on all tested downstream tasks. These results indicate that SemCLIP can achieve significant robustness to semantic transformations.

[109] Enhancing Multi-Camera Gymnast Tracking Through Domain Knowledge Integration

Fan Yang,Shigeyuki Odashima,Shoichi Masui,Ikuo Kusajima,Sosuke Yamao,Shan Jiang

Main category: cs.CV

TL;DR: 提出了一种结合体操领域知识的鲁棒多相机追踪方法，通过级联数据关联范式在检测不足时利用射线-平面相交生成共面3D轨迹候选，有效提升了体操运动员追踪的准确性与稳定性。

Details

Motivation: 由于场地限制和光照、背景、服装及遮挡变化，传统多相机三角化难以准确确定体操运动员的3D轨迹。 Method: 引入体操领域知识，假设运动员3D中心位于预定义垂直平面内，采用射线-平面相交生成共面3D轨迹候选；提出级联数据关联范式，在跨视角检测充足时使用三角化，不足时使用射线-平面相交。 Result: 实验表明该方法在挑战性场景下优于现有方法，并已成功应用于近期体操世锦赛的裁判系统，获得国际体操联合会高度认可。 Conclusion: 结合领域知识的级联数据关联策略显著提高了多相机系统在复杂条件下对体操运动员的追踪鲁棒性，具备实际应用价值。 Abstract: We present a robust multi-camera gymnast tracking, which has been applied at international gymnastics championships for gymnastics judging. Despite considerable progress in multi-camera tracking algorithms, tracking gymnasts presents unique challenges: (i) due to space restrictions, only a limited number of cameras can be installed in the gymnastics stadium; and (ii) due to variations in lighting, background, uniforms, and occlusions, multi-camera gymnast detection may fail in certain views and only provide valid detections from two opposing views. These factors complicate the accurate determination of a gymnast's 3D trajectory using conventional multi-camera triangulation. To alleviate this issue, we incorporate gymnastics domain knowledge into our tracking solution. Given that a gymnast's 3D center typically lies within a predefined vertical plane during \revised{much of their} performance, we can apply a ray-plane intersection to generate coplanar 3D trajectory candidates for opposing-view detections. More specifically, we propose a novel cascaded data association (DA) paradigm that employs triangulation to generate 3D trajectory candidates when cross-view detections are sufficient, and resort to the ray-plane intersection when they are insufficient. Consequently, coplanar candidates are used to compensate for uncertain trajectories, thereby minimizing tracking failures. The robustness of our method is validated through extensive experimentation, demonstrating its superiority over existing methods in challenging scenarios. Furthermore, our gymnastics judging system, equipped with this tracking method, has been successfully applied to recent Gymnastics World Championships, earning significant recognition from the International Gymnastics Federation.

[110] Investigating Optical Flow Computation: From Local Methods to a Multiresolution Horn-Schunck Implementation with Bilinear Interpolation

Haytham Ziani

Main category: cs.CV

TL;DR: 本文分析了局部和全局光流计算方法，重点研究了Horn-Schunck算法，并实现了其多分辨率版本以提高精度和收敛性。

Details

Motivation: 为了在不同图像条件下更准确地估计帧间运动，需要改进现有的光流计算方法。 Method: 结合双线性插值和延拓技术，实现了一种多分辨率的Horn-Schunck算法，并与Lucas-Kanade等局部方法进行对比分析。 Result: 多分辨率策略有效提升了Horn-Schunck算法的精度和收敛性能，在复杂图像条件下表现出更好的运动估计效果。 Conclusion: 全局方法结合多分辨率策略在光流计算中具有优势，尤其适用于处理复杂运动和噪声较多的场景。 Abstract: This paper presents an applied analysis of local and global methods, with a focus on the Horn-Schunck algorithm for optical flow computation. We explore the theoretical and practical aspects of local approaches, such as the Lucas-Kanade method, and global techniques such as Horn-Schunck. Additionally, we implement a multiresolution version of the Horn-Schunck algorithm, using bilinear interpolation and prolongation to improve accuracy and convergence. The study investigates the effectiveness of these combined strategies in estimating motion between frames, particularly under varying image conditions.

[111] Supervised Contrastive Learning for Few-Shot AI-Generated Image Detection and Attribution

Jaime Álvarez Urueña,David Camacho,Javier Huertas Tato

Main category: cs.CV

TL;DR: 本文提出了一种两阶段的合成图像检测框架，利用监督对比学习提取特征，并结合少量样本进行k近邻分类，在无需频繁重训练的情况下实现了对新型生成模型的高效检测与溯源。

Details

Motivation: 由于生成式AI快速发展，传统依赖周期性重训练的检测方法难以应对新型生成模型的快速迭代，亟需具备良好泛化能力的检测方案。 Method: 第一阶段采用监督对比学习训练视觉深度模型以提取判别性嵌入；第二阶段在嵌入空间中使用基于少量样本的k近邻分类器，对未见过的生成器进行开集分类和溯源。 Result: 仅用每类150张图像，检测准确率达91.3%，较现有方法提升5.2个百分点；在源归因任务中，AUC和OSCR分别提升14.70%和4.27%。 Conclusion: 该框架在少量样本下表现出优异的泛化能力和检测性能，为应对不断演进的生成式AI提供了可扩展、鲁棒的数字取证解决方案。 Abstract: The rapid advancement of generative artificial intelligence has enabled the creation of synthetic images that are increasingly indistinguishable from authentic content, posing significant challenges for digital media integrity. This problem is compounded by the accelerated release cycle of novel generative models, which renders traditional detection approaches (reliant on periodic retraining) computationally infeasible and operationally impractical. This work proposes a novel two-stage detection framework designed to address the generalization challenge inherent in synthetic image detection. The first stage employs a vision deep learning model trained via supervised contrastive learning to extract discriminative embeddings from input imagery. Critically, this model was trained on a strategically partitioned subset of available generators, with specific architectures withheld from training to rigorously ablate cross-generator generalization capabilities. The second stage utilizes a k-nearest neighbors (k-NN) classifier operating on the learned embedding space, trained in a few-shot learning paradigm incorporating limited samples from previously unseen test generators. With merely 150 images per class in the few-shot learning regime, which are easily obtainable from current generation models, the proposed framework achieves an average detection accuracy of 91.3\%, representing a 5.2 percentage point improvement over existing approaches . For the source attribution task, the proposed approach obtains improvements of of 14.70\% and 4.27\% in AUC and OSCR respectively on an open set classification context, marking a significant advancement toward robust, scalable forensic attribution systems capable of adapting to the evolving generative AI landscape without requiring exhaustive retraining protocols.

Pierrick Bournez,Luca Savant Aira,Thibaud Ehret,Gabriele Facciolo

Main category: cs.CV

TL;DR: 本文提出了EOGS++，一种针对卫星影像的3D高斯点阵化方法，直接处理原始高分辨率全色数据，并将光流与束调整集成到训练中，提升了重建质量与几何精度。

Details

Motivation: 为克服现有地球观测重建方法依赖外部预处理和优化工具的问题，提升重建效率与相机位姿估计精度。 Method: 在EOGS框架基础上，直接处理原始高分辨率全色图像；引入基于光流的束调整机制；加入早停策略和TSDF后处理技术。 Result: 在IARPA 2016和DFC2019数据集上达到SOTA性能，建筑的平均MAE误差从1.33降至1.19，重建更清晰且几何更准确。 Conclusion: EOGS++显著提升了卫星影像三维重建的质量与效率，兼具高精度与计算优势，优于原有EOGS及NeRF类方法。 Abstract: Recently, 3D Gaussian Splatting has been introduced as a compelling alternative to NeRF for Earth observation, offering com- petitive reconstruction quality with significantly reduced training times. In this work, we extend the Earth Observation Gaussian Splatting (EOGS) framework to propose EOGS++, a novel method tailored for satellite imagery that directly operates on raw high-resolution panchromatic data without requiring external preprocessing. Furthermore, leveraging optical flow techniques we embed bundle adjustment directly within the training process, avoiding reliance on external optimization tools while improving camera pose estimation. We also introduce several improvements to the original implementation, including early stopping and TSDF post-processing, all contributing to sharper reconstructions and better geometric accuracy. Experiments on the IARPA 2016 and DFC2019 datasets demonstrate that EOGS++ achieves state-of-the-art performance in terms of reconstruction quality and effi- ciency, outperforming the original EOGS method and other NeRF-based methods while maintaining the computational advantages of Gaussian Splatting. Our model demonstrates an improvement from 1.33 to 1.19 mean MAE errors on buildings compared to the original EOGS models

[113] Progressive Supernet Training for Efficient Visual Autoregressive Modeling

Xiaoyue Chen,Yuling Shi,Kaiyuan Li,Huandong Wang,Yong Li,Xiaodong Gu,Xinlei Chen,Mingbao Lin

Main category: cs.CV

TL;DR: 提出VARiant模型，通过多子网共享权重和渐进训练策略，在保持生成质量的同时显著降低内存消耗并提升推理速度，支持单模型运行时深度切换，适用于多种部署场景。

Details

Motivation: VAR模型在多尺度生成中因累积KV缓存导致内存开销大，限制了实际部署，需减少内存占用并提升推理效率。 Method: 基于尺度-深度非对称依赖现象，采用等距采样构建从30层主干网络中提取的16至2层子网；早期尺度用全网络，后期用子网，共享权重并设计渐进式训练策略以解决优化冲突。 Result: 在ImageNet上，VARiant-d16/d8在FID仅轻微下降（2.05/2.12）下减少40-65%内存；VARiant-d2实现3.5倍加速和80%内存缩减（FID 2.97）；支持零成本运行时深度切换。 Conclusion: VARiant通过灵活调整网络深度，在生成质量与效率之间实现良好平衡，提供面向多样化应用的高效、可扩展视觉生成部署方案。 Abstract: Visual Auto-Regressive (VAR) models significantly reduce inference steps through the "next-scale" prediction paradigm. However, progressive multi-scale generation incurs substantial memory overhead due to cumulative KV caching, limiting practical deployment. We observe a scale-depth asymmetric dependency in VAR: early scales exhibit extreme sensitivity to network depth, while later scales remain robust to depth reduction. Inspired by this, we propose VARiant: by equidistant sampling, we select multiple subnets ranging from 16 to 2 layers from the original 30-layer VAR-d30 network. Early scales are processed by the full network, while later scales utilize subnet. Subnet and the full network share weights, enabling flexible depth adjustment within a single model. However, weight sharing between subnet and the entire network can lead to optimization conflicts. To address this, we propose a progressive training strategy that breaks through the Pareto frontier of generation quality for both subnets and the full network under fixed-ratio training, achieving joint optimality. Experiments on ImageNet demonstrate that, compared to the pretrained VAR-d30 (FID 1.95), VARiant-d16 and VARiant-d8 achieve nearly equivalent quality (FID 2.05/2.12) while reducing memory consumption by 40-65%. VARiant-d2 achieves 3.5 times speedup and 80% memory reduction at moderate quality cost (FID 2.97). In terms of deployment, VARiant's single-model architecture supports zero-cost runtime depth switching and provides flexible deployment options from high quality to extreme efficiency, catering to diverse application scenarios.

[114] Lite Any Stereo: Efficient Zero-Shot Stereo Matching

Junpeng Jing,Weixun Luo,Ye Mao,Krystian Mikolajczyk

Main category: cs.CV

TL;DR: 本文提出了一种高效的立体匹配框架Lite Any Stereo，在保持模型轻量化的同时实现了强大的零样本泛化能力，仅用不到1%的计算成本即可达到或超越现有先进方法的精度。

Details

Motivation: 现有的高效立体匹配模型通常被认为因容量有限而缺乏零样本泛化能力，本文旨在打破这一认知，探索轻量模型在高精度和强泛化之间的平衡。 Method: 设计了一个紧凑但表达能力强的主干网络和混合代价聚合模块，并采用三阶段大规模训练策略来缩小仿真到现实的差距。 Result: 该模型在四个广泛使用的现实世界基准上排名第一，精度媲美甚至超过现有的非基于先验的精确方法，且计算成本低于1%。 Conclusion: 研究表明，超轻量模型也能实现优异的零样本泛化性能，为高效立体匹配设定了新标准。 Abstract: Recent advances in stereo matching have focused on accuracy, often at the cost of significantly increased model size. Traditionally, the community has regarded efficient models as incapable of zero-shot ability due to their limited capacity. In this paper, we introduce Lite Any Stereo, a stereo depth estimation framework that achieves strong zero-shot generalization while remaining highly efficient. To this end, we design a compact yet expressive backbone to ensure scalability, along with a carefully crafted hybrid cost aggregation module. We further propose a three-stage training strategy on million-scale data to effectively bridge the sim-to-real gap. Together, these components demonstrate that an ultra-light model can deliver strong generalization, ranking 1st across four widely used real-world benchmarks. Remarkably, our model attains accuracy comparable to or exceeding state-of-the-art non-prior-based accurate methods while requiring less than 1% computational cost, setting a new standard for efficient stereo matching.

[115] NutriScreener: Retrieval-Augmented Multi-Pose Graph Attention Network for Malnourishment Screening

Misaal Khan,Mayank Vatsa,Kuldeep Singh,Richa Singh

Main category: cs.CV

TL;DR: 本文提出了一种名为NutriScreener的多姿态图注意力网络，结合CLIP视觉嵌入和知识检索，用于从儿童图像中检测营养不良并预测人体测量数据，在低资源环境中实现了高效、准确的早期筛查。

Details

Motivation: 现有儿童营养不良筛查方法繁琐且难以扩展，导致早期干预受限，尤其是在资源匮乏地区，亟需一种自动化、可扩展的解决方案。 Method: 提出NutriScreener，采用检索增强的多姿态图注意力网络，融合CLIP-based视觉嵌入、类别增强的知识检索和上下文感知机制，在AnthroVision、ARAN和自建CampusPose数据集上训练和测试，并在跨大陆人群中进行评估。 Result: 在临床研究中，医生评分准确性和效率分别为4.3/5和4.6/5；模型达到0.79召回率、0.82 AUC，并显著降低人体测量RMSE；跨数据集实验显示召回率最高提升25%，RMSE最多减少3.5厘米。 Conclusion: NutriScreener是一种可扩展且准确的工具，能够在非约束儿科环境中实现可靠的营养不良早期检测，适用于低资源地区的实际部署。 Abstract: Child malnutrition remains a global crisis, yet existing screening methods are laborious and poorly scalable, hindering early intervention. In this work, we present NutriScreener, a retrieval-augmented, multi-pose graph attention network that combines CLIP-based visual embeddings, class-boosted knowledge retrieval, and context awareness to enable robust malnutrition detection and anthropometric prediction from children's images, simultaneously addressing generalizability and class imbalance. In a clinical study, doctors rated it 4.3/5 for accuracy and 4.6/5 for efficiency, confirming its deployment readiness in low-resource settings. Trained and tested on 2,141 children from AnthroVision and additionally evaluated on diverse cross-continent populations, including ARAN and an in-house collected CampusPose dataset, it achieves 0.79 recall, 0.82 AUC, and significantly lower anthropometric RMSEs, demonstrating reliable measurement in unconstrained pediatric settings. Cross-dataset results show up to 25% recall gain and up to 3.5 cm RMSE reduction using demographically matched knowledge bases. NutriScreener offers a scalable and accurate solution for early malnutrition detection in low-resource environments.

[116] POMA-3D: The Point Map Way to 3D Scene Understanding

Ye Mao,Weixun Luo,Ranran Huang,Junpeng Jing,Krystian Mikolajczyk

Main category: cs.CV

TL;DR: 本文提出了POMA-3D，首个从点图（point maps）中自监督学习的3D表示模型，通过结合2D基础模型的先验知识和多视角几何一致性，实现仅基于3D坐标的通用3D场景理解。

Details

Motivation: 3D表示学习面临预训练先验知识稀缺和数据不足的问题，现有方法难以有效利用2D视觉模型的强大先验。因此，需要一种兼容2D模型输入格式且能保留全局3D几何结构的新表示方法。 Method: 提出POMA-3D模型，使用点图作为输入，其在结构化2D网格上编码显式3D坐标；设计视图到场景对齐策略以迁移2D先验知识，并引入POMA-JEPA联合嵌入-预测架构确保多视角下的几何一致性；构建大规模点图数据集ScenePoint用于预训练。 Result: 实验表明POMA-3D可作为强大的3D理解骨干网络，在3D问答、具身导航、场景检索和定位等任务中表现优异，且仅依赖几何输入（3D坐标）。 Conclusion: POMA-3D探索了基于点图的3D场景理解新路径，有效解决了3D表示学习中预训练先验缺乏和数据有限的问题，为融合2D先验与3D几何提供了可行方案。 Abstract: In this paper, we introduce POMA-3D, the first self-supervised 3D representation model learned from point maps. Point maps encode explicit 3D coordinates on a structured 2D grid, preserving global 3D geometry while remaining compatible with the input format of 2D foundation models. To transfer rich 2D priors into POMA-3D, a view-to-scene alignment strategy is designed. Moreover, as point maps are view-dependent with respect to a canonical space, we introduce POMA-JEPA, a joint embedding-predictive architecture that enforces geometrically consistent point map features across multiple views. Additionally, we introduce ScenePoint, a point map dataset constructed from 6.5K room-level RGB-D scenes and 1M 2D image scenes to facilitate large-scale POMA-3D pretraining. Experiments show that POMA-3D serves as a strong backbone for both specialist and generalist 3D understanding. It benefits diverse tasks, including 3D question answering, embodied navigation, scene retrieval, and embodied localization, all achieved using only geometric inputs (i.e., 3D coordinates). Overall, our POMA-3D explores a point map way to 3D scene understanding, addressing the scarcity of pretrained priors and limited data in 3D representation learning. Project Page: https://matchlab-imperial.github.io/poma3d/

[117] Erase to Retain: Low Rank Adaptation Guided Selective Unlearning in Medical Segmentation Networks

Nirjhor Datta,Md. Golam Rabiul Alam

Main category: cs.CV

TL;DR: 本文提出了一种名为Erase to Retain的可控遗忘框架，用于医学图像分割中的选择性知识删除，通过低秩自适应（LoRA）子空间更新实现无需完全重训练的目标遗忘。

Details

Motivation: 为了满足隐私合规、伦理部署和持续数据集修订的需求，需要能够从医学分割网络中选择性地移除特定知识。 Method: 采用教师-学生蒸馏范式，在解码器的低秩子空间中利用LoRA进行约束更新；通过对抗优化使学生网络在指定遗忘子集上削弱教师网络的预测，随后通过仅优化头部参数恢复对保留数据的泛化能力。 Result: 在ISIC分割任务中，遗忘集IoU从0.875降至0.509，同时保留集和验证集性能保持稳定（0.647–0.677 IoU）；在CHASE跨域数据集中也表现出一致的遗忘效果与性能保持；在ISIC分类任务中，遗忘子集准确率从87.0%降至64.1%，而保留集准确率从83.9%提升至90.6%。 Conclusion: 基于LoRA的子空间遗忘为医学图像分析提供了一条实用、可控且可逆的知识删除路径，能够在关键区域保持模型性能的同时实现敏感信息的有效遗忘。 Abstract: The ability to selectively remove knowledge from medical segmentation networks is increasingly important for privacy compliance, ethical deployment, and continual dataset revision. We introduce Erase to Retain, a controllable unlearning framework for medical image segmentation that achieves targeted forgetting without full retraining. Our method uses a teacher-student distillation paradigm with Low-Rank Adaptation (LoRA) constrained subspace updates, enabling the student network to erase lesion-specific or class-specific representations in low-rank decoder spaces while preserving global anatomical understanding. During the strong unlearning phase, LoRA modules are adversarially optimized to contradict the teacher's confident predictions on a designated forget subset, enforcing semantic removal. This is followed by a gentle restoration phase that recovers generalization on retained data through head-only supervised refinement. For ISIC segmentation, the student reduces forget-set IoU from 0.875 to 0.509 while maintaining competitive performance on the retain and validation splits (0.647 to 0.677 IoU). On the cross-domain CHASE dataset, Erase to Retain consistently lowers forget-set IoU while preserving utility on retain and validation sets. For ISIC classification, our method decreases accuracy on the forget subset from 87.0 percent to 64.1 percent while improving retain accuracy from 83.9 percent to 90.6 percent. These results demonstrate that LoRA-based subspace unlearning provides a practical pathway toward responsible, controllable, and reversible unlearning in medical image analysis, enabling models to forget sensitive samples or structures while preserving performance where it matters most.

[118] Generative AI for Enhanced Wildfire Detection: Bridging the Synthetic-Real Domain Gap

Satyam Gaba

Main category: cs.CV

TL;DR: 本论文提出利用生成式AI技术合成带标注的烟雾数据集，以解决真实烟雾数据稀缺的问题，并结合无监督域适应与生成模型（如GAN、风格迁移和图像抠图）提升烟雾分割模型在真实场景中的性能。

Details

Motivation: 由于缺乏大规模标注的烟雾数据，深度神经网络在野火早期检测中的应用受到限制，亟需有效方法缓解数据不足问题。 Method: 采用生成式AI创建合成烟雾数据集，结合无监督域适应方法进行烟雾轮廓分割，并引入风格迁移、生成对抗网络（GAN）和图像抠图技术提升合成数据的真实感和跨域适应能力。 Result: 所提方法有效缩小了合成数据与真实数据之间的域差距，提升了烟雾分割模型的准确性和泛化能力。 Conclusion: 通过融合生成式AI与域适应技术，能够克服烟雾检测中数据稀缺的挑战，为可扩展、高精度的 wildfire 早期预警系统提供可行路径。 Abstract: The early detection of wildfires is a critical environmental challenge, with timely identification of smoke plumes being key to mitigating large-scale damage. While deep neural networks have proven highly effective for localization tasks, the scarcity of large, annotated datasets for smoke detection limits their potential. In response, we leverage generative AI techniques to address this data limitation by synthesizing a comprehensive, annotated smoke dataset. We then explore unsupervised domain adaptation methods for smoke plume segmentation, analyzing their effectiveness in closing the gap between synthetic and real-world data. To further refine performance, we integrate advanced generative approaches such as style transfer, Generative Adversarial Networks (GANs), and image matting. These methods aim to enhance the realism of synthetic data and bridge the domain disparity, paving the way for more accurate and scalable wildfire detection models.

[119] SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking

Haofeng Liu,Ziyue Wang,Sudhanshu Mishra,Mingqi Gao,Guanyi Qin,Chang Han Low,Alex Y. W. Kong,Yueming Jin

Main category: cs.CV

TL;DR: 本文提出了SA-SV，这是目前最大的用于外科手术视频分割的基准数据集，并基于此提出了SAM2S模型，通过改进记忆机制、时间语义学习和抗模糊学习，显著提升了在手术场景中交互式视频对象分割的性能，实现了高精度和实时性。

Details

Motivation: 现有的交互式视频对象分割模型（如SAM2）在外科手术场景中面临领域差异和长期跟踪能力不足的问题，缺乏高质量、大规模的标注数据集支持其在手术场景中的发展与评估。 Method: 构建了包含八种手术类型、61k帧和1.6k个masklet的大规模手术iVOS基准SA-SV；在此基础上提出SAM2S模型，引入DiveMem可训练多样性记忆机制以增强长期跟踪，结合时间语义学习提升对手术器械的理解，并采用抗模糊学习缓解多源数据集中标注不一致问题。 Result: 在SA-SV上微调后，SAM2性能提升12.99个平均J&F分数；SAM2S进一步达到80.42的平均J&F分数，分别超越原始SAM2和微调SAM2达17.10和4.11个百分点，同时保持68 FPS的实时推理速度，并展现出强零样本泛化能力。 Conclusion: SAM2S通过针对性设计显著提升了基础模型在复杂手术视频中的分割与长期跟踪性能，结合SA-SV为未来手术辅助系统的发展提供了重要数据与技术基础。 Abstract: Surgical video segmentation is crucial for computer-assisted surgery, enabling precise localization and tracking of instruments and tissues. Interactive Video Object Segmentation (iVOS) models such as Segment Anything Model 2 (SAM2) provide prompt-based flexibility beyond methods with predefined categories, but face challenges in surgical scenarios due to the domain gap and limited long-term tracking. To address these limitations, we construct SA-SV, the largest surgical iVOS benchmark with instance-level spatio-temporal annotations (masklets) spanning eight procedure types (61k frames, 1.6k masklets), enabling comprehensive development and evaluation for long-term tracking and zero-shot generalization. Building on SA-SV, we propose SAM2S, a foundation model enhancing \textbf{SAM2} for \textbf{S}urgical iVOS through: (1) DiveMem, a trainable diverse memory mechanism for robust long-term tracking; (2) temporal semantic learning for instrument understanding; and (3) ambiguity-resilient learning to mitigate annotation inconsistencies across multi-source datasets. Extensive experiments demonstrate that fine-tuning on SA-SV enables substantial performance gains, with SAM2 improving by 12.99 average $\mathcal{J}$\&$\mathcal{F}$ over vanilla SAM2. SAM2S further advances performance to 80.42 average $\mathcal{J}$\&$\mathcal{F}$, surpassing vanilla and fine-tuned SAM2 by 17.10 and 4.11 points respectively, while maintaining 68 FPS real-time inference and strong zero-shot generalization. Code and dataset will be released at https://jinlab-imvr.github.io/SAM2S.

[120] Improving Long-Tailed Object Detection with Balanced Group Softmax and Metric Learning

Satyam Gaba

Main category: cs.CV

TL;DR: 本文提出了一种改进的Balanced Group Softmax框架，并结合度量学习与k-NN分类策略，有效提升了长尾分布下的2D目标检测性能，在LVISv1数据集上达到了24.5% mAP的新纪录。

Details

Motivation: 现实场景中类别分布常呈长尾分布，导致检测模型偏向高频类别，忽视罕见类别，本文旨在解决这一类不平衡问题。 Method: 基于Faster R-CNN两阶段检测器，改进Balanced Group Softmax（BAGS）框架，并引入度量学习使特征在类内紧凑、类间分离，推理时采用k-Nearest Neighbors进行分类优化。 Result: 在LVISv1数据集上实现了24.5%的mAP，超过先前24.0%的基准性能，尤其提升了对尾部类别（罕见类）的检测效果。 Conclusion: 所提方法有效缓解了长尾目标检测中的类别不平衡问题，通过特征解耦与k-NN分类策略显著提升了对稀有类别的识别能力。 Abstract: Object detection has been widely explored for class-balanced datasets such as COCO. However, real-world scenarios introduce the challenge of long-tailed distributions, where numerous categories contain only a few instances. This inherent class imbalance biases detection models towards the more frequent classes, degrading performance on rare categories. In this paper, we tackle the problem of long-tailed 2D object detection using the LVISv1 dataset, which consists of 1,203 categories and 164,000 images. We employ a two-stage Faster R-CNN architecture and propose enhancements to the Balanced Group Softmax (BAGS) framework to mitigate class imbalance. Our approach achieves a new state-of-the-art performance with a mean Average Precision (mAP) of 24.5%, surpassing the previous benchmark of 24.0%. Additionally, we hypothesize that tail class features may form smaller, denser clusters within the feature space of head classes, making classification challenging for regression-based classifiers. To address this issue, we explore metric learning to produce feature embeddings that are both well-separated across classes and tightly clustered within each class. For inference, we utilize a k-Nearest Neighbors (k-NN) approach to improve classification performance, particularly for rare classes. Our results demonstrate the effectiveness of these methods in advancing long-tailed object detection.

[121] Adaptive Guided Upsampling for Low-light Image Enhancement

Angela Vivian Dcosta,Chunbo Song,Rafael Radkowski

Main category: cs.CV

TL;DR: 提出了一种名为自适应引导上采样（AGU）的方法，用于高效提升低光照图像质量，能同时优化多个图像特性，如降噪和增强锐度。

Details

Motivation: 现有的引导图像方法在处理低光照图像时因噪声高、亮度低而导致特征不足，无法显著改善图像质量。 Method: 基于引导图像方法，通过多参数优化学习低光照图像与明亮图像特征之间的关联，并利用少量图像对进行机器学习训练。 Result: 实验表明，AGU能够实时生成高质量图像，且在低光照场景下优于现有最先进方法。 Conclusion: AGU是一种高效、实时的低光照图像上采样方法，能显著提升图像质量，适用于实际应用。 Abstract: We introduce Adaptive Guided Upsampling (AGU), an efficient method for upscaling low-light images capable of optimizing multiple image quality characteristics at the same time, such as reducing noise and increasing sharpness. It is based on a guided image method, which transfers image characteristics from a guidance image to the target image. Using state-of-the-art guided methods, low-light images lack sufficient characteristics for this purpose due to their high noise level and low brightness, rendering suboptimal/not significantly improved images in the process. We solve this problem with multi-parameter optimization, learning the association between multiple low-light and bright image characteristics. Our proposed machine learning method learns these characteristics from a few sample images-pairs. AGU can render high-quality images in real time using low-quality, low-resolution input; our experiments demonstrate that it is superior to state-of-the-art methods in the addressed low-light use case.

[122] SAM 3D: 3Dfy Anything in Images

SAM 3D Team,Xingyu Chen,Fu-Jen Chu,Pierre Gleize,Kevin J Liang,Alexander Sax,Hao Tang,Weiyao Wang,Michelle Guo,Thibaut Hardin,Xiang Li,Aohan Lin,Jiawei Liu,Ziqi Ma,Anushka Sagar,Bowen Song,Xiaodong Wang,Jianing Yang,Bowen Zhang,Piotr Dollár,Georgia Gkioxari,Matt Feiszli,Jitendra Malik

Main category: cs.CV

TL;DR: SAM 3D 是一种从单张图像生成3D物体几何、纹理和布局的视觉驱动生成模型，适用于复杂真实场景。

Details

Motivation: 解决真实场景中因遮挡和杂乱导致的3D重建难题，利用上下文视觉线索提升重建质量。 Method: 采用人机协同标注流程构建大规模视觉对齐的3D数据集，结合合成数据预训练与真实数据微调的多阶段训练框架。 Result: 在真实世界物体和场景的人类偏好测试中，相较最新方法取得至少5:1的优势，并发布代码、模型、在线演示及新基准测试集。 Conclusion: SAM 3D 通过大规模视觉对齐数据和现代训练策略突破了3D重建的数据瓶颈，显著提升了自然图像下的重建效果。 Abstract: We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D "data barrier". We obtain significant gains over recent work, with at least a 5:1 win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.

[123] TRIM: Scalable 3D Gaussian Diffusion Inference with Temporal and Spatial Trimming

Zeyuan Yin,Xiaoming Liu

Main category: cs.CV

TL;DR: 提出TRIM方法，通过轨迹减少和实例掩码去噪，提升3D高斯扩散模型的推理效率与生成质量。

Details

Motivation: 现有3D高斯扩散模型因高数量的高斯基元导致去噪过程耗时，生成速度慢且可扩展性差。 Method: 设计轻量级选择器模型实现潜在高斯基元的质量评估，进行早期轨迹缩减；引入实例掩码去噪策略，过滤冗余背景区域以减少每步去噪计算量。 Result: 实验表明TRIM显著提升了3D生成的效率和质量，支持推理时的灵活缩放。 Conclusion: TRIM是一种高效的后训练加速方法，在不牺牲输出质量的前提下，有效改善了3D扩散模型的推理速度和可扩展性。 Abstract: Recent advances in 3D Gaussian diffusion models suffer from time-intensive denoising and post-denoising processing due to the massive number of Gaussian primitives, resulting in slow generation and limited scalability along sampling trajectories. To improve the efficiency of 3D diffusion models, we propose $\textbf{TRIM}$ ($\textbf{T}$rajectory $\textbf{R}$eduction and $\textbf{I}$nstance $\textbf{M}$ask denoising), a post-training approach that incorporates both temporal and spatial trimming strategies, to accelerate inference without compromising output quality while supporting the inference-time scaling for Gaussian diffusion models. Instead of scaling denoising trajectories in a costly end-to-end manner, we develop a lightweight selector model to evaluate latent Gaussian primitives derived from multiple sampled noises, enabling early trajectory reduction by selecting candidates with high-quality potential. Furthermore, we introduce instance mask denoising to prune learnable Gaussian primitives by filtering out redundant background regions, reducing inference computation at each denoising step. Extensive experiments and analysis demonstrate that TRIM significantly improves both the efficiency and quality of 3D generation. Source code is available at $\href{https://github.com/zeyuanyin/TRIM}{link}$.

[124] Late-decoupled 3D Hierarchical Semantic Segmentation with Semantic Prototype Discrimination based Bi-branch Supervision

Shuyu Cao,Chongshou Li,Jie Xu,Tianrui Li,Na Zhao

Main category: cs.CV

TL;DR: 提出了一种新的3D层次语义分割框架，通过 late-decoupled 架构和双分支监督机制解决多层级冲突和类别不平衡问题，实现了最先进的性能。

Details

Motivation: 现有3DHS方法忽视了跨层级优化中的多层级冲突和3D场景中不可避免的类别不平衡问题。 Method: 设计了一个主3DHS分支和一个辅助判别分支的框架，采用late-decoupled架构和基于语义原型的双分支监督机制。 Result: 在多个数据集和骨干网络上实验表明，该方法达到了最先进的3DHS性能，其核心组件可作为即插即用模块提升先前方法。 Conclusion: 所提方法有效缓解了多层级冲突和类别不平衡问题，显著提升了3D层次语义分割的性能。 Abstract: 3D hierarchical semantic segmentation (3DHS) is crucial for embodied intelligence applications that demand a multi-grained and multi-hierarchy understanding of 3D scenes. Despite the progress, previous 3DHS methods have overlooked following two challenges: I) multi-label learning with a parameter-sharing model can lead to multi-hierarchy conflicts in cross-hierarchy optimization, and II) the class imbalance issue is inevitable across multiple hierarchies of 3D scenes, which makes the model performance become dominated by major classes. To address these issues, we propose a novel framework with a primary 3DHS branch and an auxiliary discrimination branch. Specifically, to alleviate the multi-hierarchy conflicts, we propose a late-decoupled 3DHS framework which employs multiple decoders with the coarse-to-fine hierarchical guidance and consistency. The late-decoupled architecture can mitigate the underfitting and overfitting conflicts among multiple hierarchies and can also constrain the class imbalance problem in each individual hierarchy. Moreover, we introduce a 3DHS-oriented semantic prototype based bi-branch supervision mechanism, which additionally learns class-wise discriminative point cloud features and performs mutual supervision between the auxiliary and 3DHS branches, to enhance the class-imbalance segmentation. Extensive experiments on multiple datasets and backbones demonstrate that our approach achieves state-of-the-art 3DHS performance, and its core components can also be used as a plug-and-play enhancement to improve previous methods.

[125] Teacher-Guided One-Shot Pruning via Context-Aware Knowledge Distillation

Md. Samiul Alim,Sharjil Khan,Amrijit Biswas,Fuad Rahman,Shafin Rahman,Nabeel Mohammed

Main category: cs.CV

TL;DR: 提出一种结合知识蒸馏的教师引导剪枝框架，实现高效的一次性全局剪枝，在高稀疏度下保持优异性能。

Details

Motivation: 非结构化剪枝通常需要多次训练-剪枝-重训循环，计算开销大，亟需更高效的剪枝方法。 Method: 在重要性评分计算中引入教师模型的梯度信号，将知识蒸馏与重要性评估紧密结合，实现一次性全局剪枝，并采用稀疏感知重训恢复精度。 Result: 在CIFAR-10、CIFAR-100和TinyImageNet上验证了方法的有效性，高稀疏度下性能优于EPG、EPSD等先进方法，且比COLT等迭代方法更高效。 Conclusion: 该框架在保持高性能的同时显著降低计算成本，适用于资源受限环境下的部署。 Abstract: Unstructured pruning remains a powerful strategy for compressing deep neural networks, yet it often demands iterative train-prune-retrain cycles, resulting in significant computational overhead. To address this challenge, we introduce a novel teacher-guided pruning framework that tightly integrates Knowledge Distillation (KD) with importance score estimation. Unlike prior approaches that apply KD as a post-pruning recovery step, our method leverages gradient signals informed by the teacher during importance score calculation to identify and retain parameters most critical for both task performance and knowledge transfer. Our method facilitates a one-shot global pruning strategy that efficiently eliminates redundant weights while preserving essential representations. After pruning, we employ sparsity-aware retraining with and without KD to recover accuracy without reactivating pruned connections. Comprehensive experiments across multiple image classification benchmarks, including CIFAR-10, CIFAR-100, and TinyImageNet, demonstrate that our method consistently achieves high sparsity levels with minimal performance degradation. Notably, our approach outperforms state-of-the-art baselines such as EPG and EPSD at high sparsity levels, while offering a more computationally efficient alternative to iterative pruning schemes like COLT. The proposed framework offers a computation-efficient, performance-preserving solution well suited for deployment in resource-constrained environments.

[126] Solving Spatial Supersensing Without Spatial Supersensing

Vishaal Udandarao,Shyamgopal Karthik,Surabhi S. Nath,Andreas Hochlehnert,Matthias Bethge,Ameya Prabhu

Main category: cs.CV

TL;DR: 本文对Cambrian-S提出的视频世界模型中的空间超感知基准进行了批判性分析，发现其VSR和VSC基准存在可被简单方法或捷径启发式 exploited 的问题，表明当前基准未能可靠衡量空间超感知能力。

Details

Motivation: 旨在评估Cambrian-S所提出的空间超感知（spatial supersensing）是否真正有效，以及其基准测试是否能准确反映模型的空间认知与世界建模能力。 Method: 提出了NoSense基线模型用于VSR任务，并设计了VSC-Repeat测试来检验模型在重复场景下的表现；通过分析模型在未改变语义内容但结构变化的视频上的性能下降情况，揭示其依赖数据捷径而非真正空间推理。 Result: NoSense在VSR上达到95%准确率，显示无需时空建模即可解决该任务；VSC-Repeat实验使Cambrian-S的准确性从42%降至0%，暴露其推理算法依赖‘房间不会重复’这一捷径。 Conclusion: 当前VSI-Super基准不能可靠地测量空间超感知能力，Cambrian-S的性能提升主要源于利用基准中的捷径，而非实现了稳健的空间超感知。 Abstract: Cambrian-S aims to take the first steps towards improving video world models with spatial supersensing by introducing (i) two benchmarks, VSI-Super-Recall (VSR) and VSI-Super-Counting (VSC), and (ii) bespoke predictive sensing inference strategies tailored to each benchmark. In this work, we conduct a critical analysis of Cambrian-S across both these fronts. First, we introduce a simple baseline, NoSense, which discards almost all temporal structure and uses only a bag-of-words SigLIP model, yet near-perfectly solves VSR, achieving 95% accuracy even on 4-hour videos. This shows benchmarks like VSR can be nearly solved without spatial cognition, world modeling or spatial supersensing. Second, we hypothesize that the tailored inference methods proposed by Cambrian-S likely exploit shortcut heuristics in the benchmark. We illustrate this with a simple sanity check on the VSC benchmark, called VSC-Repeat: We concatenate each video with itself 1-5 times, which does not change the number of unique objects. However, this simple perturbation entirely collapses the mean relative accuracy of Cambrian-S from 42% to 0%. A system that performs spatial supersensing and integrates information across experiences should recognize views of the same scene and keep object-count predictions unchanged; instead, Cambrian-S inference algorithm relies largely on a shortcut in the VSC benchmark that rooms are never revisited. Taken together, our findings suggest that (i) current VSI-Super benchmarks do not yet reliably measure spatial supersensing, and (ii) predictive-sensing inference recipes used by Cambrian-S improve performance by inadvertently exploiting shortcuts rather than from robust spatial supersensing. We include the response from the Cambrian-S authors (in Appendix A) to provide a balanced perspective alongside our claims. We release our code at: https://github.com/bethgelab/supersanity

[127] PartUV: Part-Based UV Unwrapping of 3D Meshes

Zhaoning Wang,Xinyue Wei,Ruoxi Shi,Xiaoshuai Zhang,Hao Su,Minghua Liu

Main category: cs.CV

TL;DR: 本文提出了一种基于部件的UV展开方法PartUV，通过结合语义部件分解与几何启发策略，有效减少AI生成网格等复杂模型的图表碎片化问题，在保持低扭曲的同时显著降低图表数量和接缝长度。

Details

Motivation: 现有UV展开方法在处理噪声多、表面不平整的AI生成网格时表现不佳，常导致图表碎片化和边界不优，影响后续应用。因此需要一种更鲁棒、结构更合理的展开方法。 Method: PartUV基于学习驱动的部件分解方法PartField，采用自上而下的递归框架，结合高层语义部件信息与新的几何启发式规则，进行图表划分；并集成参数化、打包算法，支持非流形和退化网格处理，且高度并行化以提升效率。 Result: 在四个不同数据集（包括AI生成、CAD、人造和常见形状）上的实验表明，PartUV在图表数量和接缝长度方面优于现有工具和神经网络方法，扭曲程度相当，对困难网格具有高成功率，并支持如部件特定多图块打包等新应用。 Conclusion: PartUV通过融合语义部件信息与几何优化，在复杂和噪声网格上实现了更少图表、更低接缝、低失真的高质量UV展开，具备广泛适用性和实际应用价值。 Abstract: UV unwrapping flattens 3D surfaces to 2D with minimal distortion, often requiring the complex surface to be decomposed into multiple charts. Although extensively studied, existing UV unwrapping methods frequently struggle with AI-generated meshes, which are typically noisy, bumpy, and poorly conditioned. These methods often produce highly fragmented charts and suboptimal boundaries, introducing artifacts and hindering downstream tasks. We introduce PartUV, a part-based UV unwrapping pipeline that generates significantly fewer, part-aligned charts while maintaining low distortion. Built on top of a recent learning-based part decomposition method PartField, PartUV combines high-level semantic part decomposition with novel geometric heuristics in a top-down recursive framework. It ensures each chart's distortion remains below a user-specified threshold while minimizing the total number of charts. The pipeline integrates and extends parameterization and packing algorithms, incorporates dedicated handling of non-manifold and degenerate meshes, and is extensively parallelized for efficiency. Evaluated across four diverse datasets, including man-made, CAD, AI-generated, and Common Shapes, PartUV outperforms existing tools and recent neural methods in chart count and seam length, achieves comparable distortion, exhibits high success rates on challenging meshes, and enables new applications like part-specific multi-tiles packing. Our project page is at https://www.zhaoningwang.com/PartUV.

[128] TriDiff-4D: Fast 4D Generation through Diffusion-based Triplane Re-posing

Eddie Pokming Sheung,Qihao Liu,Wufei Ma,Prakhar Kaushik,Jianwen Xie,Alan Yuille

Main category: cs.CV

TL;DR: 提出TriDiff-4D，一种基于扩散模型的四维生成框架，通过三平面重定位实现高质量、时间一致的文本到4D角色生成。

Details

Motivation: 现有4D生成方法存在时间与几何不一致、感知伪影、运动不规则、计算成本高和动态控制有限等问题，限制了其广泛应用。 Method: 采用扩散模型结合自回归策略，先生成标准3D角色和对应动作序列，再通过第二个扩散模型根据动作序列驱动角色动画；利用大规模3D与动作数据学习结构与运动先验，实现骨架驱动的4D生成。 Result: TriDiff-4D在生成质量、时间一致性、运动准确性、计算效率和视觉保真度方面显著优于现有方法，将生成时间从数小时缩短至数秒，并能生成复杂动作与高保真外观。 Conclusion: TriDiff-4D有效解决了当前4D生成中的关键挑战，实现了高效、可控、高质量的文本到4D角色生成，具有广泛的应用前景。 Abstract: With the increasing demand for 3D animation, generating high-fidelity, controllable 4D avatars from textual descriptions remains a significant challenge. Despite notable efforts in 4D generative modeling, existing methods exhibit fundamental limitations that impede their broader applicability, including temporal and geometric inconsistencies, perceptual artifacts, motion irregularities, high computational costs, and limited control over dynamics. To address these challenges, we propose TriDiff-4D, a novel 4D generative pipeline that employs diffusion-based triplane re-posing to produce high-quality, temporally coherent 4D avatars. Our model adopts an auto-regressive strategy to generate 4D sequences of arbitrary length, synthesizing each 3D frame with a single diffusion process. By explicitly learning 3D structure and motion priors from large-scale 3D and motion datasets, TriDiff-4D enables skeleton-driven 4D generation that excels in temporal consistency, motion accuracy, computational efficiency, and visual fidelity. Specifically, TriDiff-4D first generates a canonical 3D avatar and a corresponding motion sequence from a text prompt, then uses a second diffusion model to animate the avatar according to the motion sequence, supporting arbitrarily long 4D generation. Experimental results demonstrate that TriDiff-4D significantly outperforms existing methods, reducing generation time from hours to seconds by eliminating the optimization process, while substantially improving the generation of complex motions with high-fidelity appearance and accurate 3D geometry.

[129] SceneDesigner: Controllable Multi-Object Image Generation with 9-DoF Pose Manipulation

Zhenyuan Qin,Xincheng Shuai,Henghui Ding

Main category: cs.CV

TL;DR: 本文提出SceneDesigner，一种用于精确且灵活的多物体9-DoF姿态操控的方法，通过引入CNOCS地图表示和分支网络，解决了现有方法在可控性和生成质量上的不足。

Details

Motivation: 现有的可控图像生成方法在实现多个物体的9D姿态（位置、大小、方向）同时控制方面存在局限性，难以兼顾控制精度与生成质量。 Method: 提出SceneDesigner，采用分支网络扩展预训练模型，利用新的CNOCS map编码相机视角下的9D姿态信息；构建ObjectPose9D数据集，并设计两阶段强化学习训练策略以缓解数据不平衡问题；推断时使用解耦对象采样技术改善复杂场景中的生成效果，并支持个性化姿态控制。 Result: 实验表明，SceneDesigner在可控性和生成质量上均显著优于现有方法，能够实现更准确、稳定的多物体9D姿态控制。 Conclusion: SceneDesigner有效实现了高质量、高可控性的多物体9D姿态图像生成，通过新表示、数据集和训练策略推动了可控生成的发展。 Abstract: Controllable image generation has attracted increasing attention in recent years, enabling users to manipulate visual content such as identity and style. However, achieving simultaneous control over the 9D poses (location, size, and orientation) of multiple objects remains an open challenge. Despite recent progress, existing methods often suffer from limited controllability and degraded quality, falling short of comprehensive multi-object 9D pose control. To address these limitations, we propose SceneDesigner, a method for accurate and flexible multi-object 9-DoF pose manipulation. SceneDesigner incorporates a branched network to the pre-trained base model and leverages a new representation, CNOCS map, which encodes 9D pose information from the camera view. This representation exhibits strong geometric interpretation properties, leading to more efficient and stable training. To support training, we construct a new dataset, ObjectPose9D, which aggregates images from diverse sources along with 9D pose annotations. To further address data imbalance issues, particularly performance degradation on low-frequency poses, we introduce a two-stage training strategy with reinforcement learning, where the second stage fine-tunes the model using a reward-based objective on rebalanced data. At inference time, we propose Disentangled Object Sampling, a technique that mitigates insufficient object generation and concept confusion in complex multi-object scenes. Moreover, by integrating user-specific personalization weights, SceneDesigner enables customized pose control for reference subjects. Extensive qualitative and quantitative experiments demonstrate that SceneDesigner significantly outperforms existing approaches in both controllability and quality. Code is publicly available at https://github.com/FudanCVL/SceneDesigner.

[130] V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models

Yang Luo,Xuanlei Zhao,Baijiong Lin,Lingting Zhu,Liyao Tang,Yuqi Liu,Ying-Cong Chen,Shengju Qian,Xin Wang,Yang You

Main category: cs.CV

TL;DR: 本文提出了V-ReasonBench，一个用于评估生成视频模型在结构化问题解决、空间认知、模式推理和物理动态四个维度上推理能力的基准。该基准结合合成与真实世界图像序列，提供可验证、可扩展且无歧义的任务，用于系统评估先进视频模型的表现差异，并分析其与图像模型的对比及常见幻觉行为。

Details

Motivation: 随着生成视频模型（如Veo-3）展现出零样本推理能力，亟需一个系统、可靠的方法来评估其视频推理性能。现有方法缺乏统一性、可复现性和明确性，因此需要构建专用基准以推动模型发展。 Method: 构建V-ReasonBench基准，涵盖四个推理维度，使用合成与真实图像序列设计多样化、答案可验证的任务；对六种最先进视频模型进行评估，比较其表现差异，并与强图像模型对比，分析幻觉行为及视频时长对Chain-of-Frames推理的影响。 Result: 评估揭示了不同模型在四个推理维度上的显著差异；发现当前视频模型在结构化和物理推理方面较弱，存在特定幻觉模式；视频时长对推理性能有影响；整体上视频模型仍落后于强图像模型在某些推理任务上的表现。 Conclusion: V-ReasonBench为视频推理提供了统一、可复现的评估框架，有助于推动生成视频模型向更可靠、符合人类推理的方向发展。 Abstract: Recent progress in generative video models, such as Veo-3, has shown surprising zero-shot reasoning abilities, creating a growing need for systematic and reliable evaluation. We introduce V-ReasonBench, a benchmark designed to assess video reasoning across four key dimensions: structured problem-solving, spatial cognition, pattern-based inference, and physical dynamics. The benchmark is built from both synthetic and real-world image sequences and provides a diverse set of answer-verifiable tasks that are reproducible, scalable, and unambiguous. Evaluations of six state-of-the-art video models reveal clear dimension-wise differences, with strong variation in structured, spatial, pattern-based, and physical reasoning. We further compare video models with strong image models, analyze common hallucination behaviors, and study how video duration affects Chain-of-Frames reasoning. Overall, V-ReasonBench offers a unified and reproducible framework for measuring video reasoning and aims to support the development of models with more reliable, human-aligned reasoning skills.

[131] Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO

Junhao Cheng,Liang Hou,Xin Tao,Jing Liao

Main category: cs.CV

TL;DR: 本文提出了Video-Next-Event Prediction (VNEP)任务，将视频生成作为下一事件预测的回答方式，并提出VANS模型通过强化学习联合优化视觉语言模型和视频扩散模型，在新构建的数据集上实现了最先进的性能。

Details

Motivation: 现有下一事件预测多以文本输出为主，而视频能更直观地展示物理世界中的动态过程（如系领带），因此作者希望利用视频作为新的回答模态，提升程序性学习与创造性探索的体验。 Method: 提出VANS模型，采用强化学习框架，通过新设计的Joint-GRPO方法协同优化视觉语言模型（VLM）和视频扩散模型（VDM），使其在共享奖励信号下联合训练；同时构建了VANS-Data-100K数据集用于训练与评估。 Result: 在程序性和预测性基准实验中，VANS在视频事件预测与可视化生成方面均达到当前最优性能。 Conclusion: VANS成功实现了从‘告诉’到‘展示’的转变，验证了视频作为下一事件预测输出模态的潜力，为多模态推理与生成提供了新方向。 Abstract: While language models have become impactful in many real-world applications, video generation remains largely confined to entertainment. Motivated by video's inherent capacity to demonstrate physical-world information that is difficult to convey through language alone (e.g., imagine teaching someone to tie a tie using only text), we identify an underutilized opportunity to extend video as a new answer modality for Next-Event Prediction (NEP), formalized as Video-Next-Event Prediction (VNEP). While the established NEP task takes a video with a procedural or predictive question as input to predict the next event in text, VNEP requires dynamic video responses. This shift from telling to showing unlocks more intuitive and customized answers for procedural learning and creative exploration. However, this task remains challenging for existing models, as it demands an understanding of multimodal input, instruction-conditioned reasoning, and the generation of video with visual and semantic consistency. To address this, we introduce VANS, a model that leverages reinforcement learning to align a Vision-Language Model (VLM) with a Video Diffusion Model (VDM) for VNEP. The core of VANS is our proposed Joint-GRPO that orchestrates the VLM and VDM to function as a unit. Driven by a shared reward on their respective output, it optimizes the VLM to produce captions that are both accurate and friendly to visualize, while guiding the VDM to generate videos that are faithful to these captions and the input visual context. To enable this learning, we craft VANS-Data-100K, a dedicated dataset for the VNEP task. Experiments on procedural and predictive benchmarks demonstrate that VANS achieves state-of-the-art performance in both video event prediction and visualization. Codes are released in https://github.com/KlingTeam/VANS.

[132] Learning to Think Fast and Slow for Visual Language Models

Chenyu Lin,Cheng Chi,Jinlin Wu,Sharon Li,Kaiyang Zhou

Main category: cs.CV

TL;DR: 本文提出了一种名为DualMindVLM的简单强化学习方法，使视觉语言模型能根据任务难度在快速和慢速思维模式间自动切换，提升了推理效率和性能。

Details

Motivation: 现有视觉语言模型在处理问题时普遍追求长而详细的推理链，导致计算成本过高，缺乏对简单问题的快速响应机制。 Method: 该方法分为两个阶段：第一阶段根据模型输出长度标注数据所需的思维模式（快或慢）；第二阶段使用GRPO结合思维模式标签进行训练，实现双模式推理。 Result: DualMindVLM显著优于基线模型，性能媲美当前最先进的视觉推理模型，同时保持极高的token效率。 Conclusion: 通过引入类人双系统思维机制，该方法有效平衡了推理质量与计算成本，为高效视觉语言推理提供了新思路。 Abstract: When confronted with complex problems, we tend to think slowly; conversely, for simple questions, we think quickly. Such a two-system thinking mechanism allows us to efficiently allocate cognitive resources, enabling quick decision-making for straightforward issues while reserving deeper analytical thinking for more intricate challenges. However, existing reasoning-oriented visual language models (VLMs), whether trained with explicit chain-of-thought annotations or rule-based RL rewards, mainly pursue lengthy, detailed reasoning chains, which often lead to excessive computational costs. In this work, we propose a simple RL approach, which enables VLMs to automatically switch between fast and slow thinking modes depending on task difficulty. The approach consists of two stages: in the first stage, we label data as either requiring fast thinking or slow thinking based on the model output length, which is inspired by the observation that pre-trained VLMs typically produce answers of varying lengths for different types of questions; in the second stage, we train the model using GRPO along with the thinking mode labels to develop dual-mode thinking. Despite its simplicity, our model, named DualMindVLM, significantly outperforms the base model and achieves performance on par with state-of-the-art visual reasoning models, while maintaining exceptionally high token efficiency.

[133] EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

Omkat Thawakar,Shravan Venkatraman,Ritesh Thawkar,Abdelrahman Shaker,Hisham Cholakkal,Rao Muhammad Anwer,Salman Khan,Fahad Khan

Main category: cs.CV

TL;DR: 提出了一种名为EvoLMM的自演化框架，通过两个协作代理（提议者和求解者）在无监督情况下提升大视觉模型的推理能力，仅使用原始图像数据就在多个多模态数学推理基准上实现了约3%的性能提升。

Details

Motivation: 现有大视觉模型训练依赖人工标注或外部奖励模型，限制了自主性和可扩展性，因此需要一种完全无监督的自我进化方法来提升模型推理能力。 Method: 构建一个基于单一骨干模型的双代理框架：提议者生成多样化的图像相关问题，求解者通过内部一致性进行解答，学习过程通过持续的自我奖励机制推进，无需真实标签或人类判断。 Result: 在ChartQA、MathVista和MathVision等多模态数学推理基准上，使用Qwen2.5-VL作为基础模型时，EvoLMM在仅使用原始训练图像的情况下实现了最高约3%的性能提升。 Conclusion: EvoLMM提供了一种简单而有效的全无监督自改进方法，为未来大视觉模型的自主进化研究提供了有力基线。 Abstract: Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting their autonomy and scalability. In this work, we strive to improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that instantiates two cooperative agents from a single backbone model: a Proposer, which generates diverse, image-grounded questions, and a Solver, which solves them through internal consistency, where learning proceeds through a continuous self-rewarding process. This dynamic feedback encourages both the generation of informative queries and the refinement of structured reasoning without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base model, our EvoLMM yields consistent gains upto $\sim$3\% on multimodal math-reasoning benchmarks, including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple yet effective approach will serve as a solid baseline easing future research in self-improving LMMs in a fully-unsupervised fashion. Our code and models are available at https://github.com/mbzuai-oryx/EvoLMM.

[134] NoPo-Avatar: Generalizable and Animatable Avatars from Sparse Inputs without Human Poses

Jing Wen,Alexander G. Schwing,Shenlong Wang

Main category: cs.CV

TL;DR: 提出NoPo-Avatar，一种无需输入姿态信息、仅从单张或稀疏图像中恢复可动画3D人像的方法，在无真值姿态的实用场景下表现更优。

Details

Motivation: 现有方法依赖测试时的人体姿态和相机姿态输入，但在姿态估计存在噪声时重建质量显著下降。因此需要一种不依赖姿态输入的鲁棒方法。 Method: 提出NoPo-Avatar，完全基于图像进行重建，去除对测试时人体姿态的依赖，实现从单张或稀疏图像中恢复可动画的3D人像。 Result: 在THuman2.0、XHuman和HuGe100K数据集上实验表明，NoPo-Avatar在无真值姿态的实际场景中优于现有基线方法，在有真值姿态的实验室场景下性能相当。 Conclusion: NoPo-Avatar通过消除对姿态输入的依赖，提升了在真实应用场景中的鲁棒性和适用性，是构建可动画3D人像的有效新范式。 Abstract: We tackle the task of recovering an animatable 3D human avatar from a single or a sparse set of images. For this task, beyond a set of images, many prior state-of-the-art methods use accurate "ground-truth" camera poses and human poses as input to guide reconstruction at test-time. We show that pose-dependent reconstruction degrades results significantly if pose estimates are noisy. To overcome this, we introduce NoPo-Avatar, which reconstructs avatars solely from images, without any pose input. By removing the dependence of test-time reconstruction on human poses, NoPo-Avatar is not affected by noisy human pose estimates, making it more widely applicable. Experiments on challenging THuman2.0, XHuman, and HuGe100K data show that NoPo-Avatar outperforms existing baselines in practical settings (without ground-truth poses) and delivers comparable results in lab settings (with ground-truth poses).

[135] Dataset Distillation for Pre-Trained Self-Supervised Vision Models

George Cazenavette,Antonio Torralba,Vincent Sitzmann

Main category: cs.CV

TL;DR: 本文提出了线性梯度匹配方法，用于在预训练视觉模型上蒸馏数据集以优化线性探测器的训练，合成数据在多个任务中表现优异且跨模型通用。

Details

Motivation: 现有数据集蒸馏方法主要针对从零开始训练模型，而当前先进视觉方法多基于大规模预训练模型，因此需要一种适配预训练模型的数据蒸馏新方法。 Method: 提出线性梯度匹配方法，通过优化合成图像，使其经预训练特征提取器后在线性分类器上产生的梯度与真实数据相似。 Result: 合成数据优于所有真实图像基线，且可在不同预训练模型间泛化，例如用DINO蒸馏的数据训练CLIP线性探针也具竞争力；在细粒度分类和模型可解释性任务中表现出色。 Conclusion: 所提方法有效支持基于预训练模型的线性探测训练，合成数据具备跨模型泛化能力，为模型分析和细粒度识别提供了有力工具。 Abstract: The task of dataset distillation aims to find a small set of synthetic images such that training a model on them reproduces the performance of the same model trained on a much larger dataset of real samples. Existing distillation methods focus on synthesizing datasets that enable training randomly initialized models. In contrast, state-of-the-art vision approaches are increasingly building on large, pre-trained self-supervised models rather than training from scratch. In this paper, we investigate the problem of distilling datasets that enable us to optimally train linear probes on top of such large, pre-trained vision models. We introduce a method of dataset distillation for this task called Linear Gradient Matching that optimizes the synthetic images such that, when passed through a pre-trained feature extractor, they induce gradients in the linear classifier similar to those produced by the real data. Our method yields synthetic data that outperform all real-image baselines and, remarkably, generalize across pre-trained vision models, enabling us, for instance, to train a linear CLIP probe that performs competitively using a dataset distilled via a DINO backbone. Further, we show that our distilled datasets are exceptionally effective for fine-grained classification and provide a valuable tool for model interpretability, predicting, among other things, how similar two models' embedding spaces are under the platonic representation hypothesis or whether a model is sensitive to spurious correlations in adversarial datasets.

Table of Contents

cs.CL [Back]

[1] What Really Counts? Examining Step and Token Level Attribution in Multilingual CoT Reasoning

[2] Mind the Motions: Benchmarking Theory-of-Mind in Everyday Body Language

[3] TOD-ProcBench: Benchmarking Complex Instruction-Following in Task-Oriented Dialogues

[4] Liars' Bench: Evaluating Lie Detectors for Language Models

[5] Learning Tractable Distributions Of Language Model Continuations

[6] Early science acceleration experiments with GPT-5

[7] ELPO: Ensemble Learning Based Prompt Optimization for Large Language Models

[8] TS-PEFT: Token-Selective Parameter-Efficient Fine-Tuning with Learnable Threshold Gating

[9] SemanticCite: Citation Verification with AI-Powered Full-Text Analysis and Evidence-Based Reasoning

[10] SeSE: A Structural Information-Guided Uncertainty Quantification Framework for Hallucination Detection in LLMs

[11] SDA: Steering-Driven Distribution Alignment for Open LLMs without Fine-Tuning

[12] Incorporating Self-Rewriting into Large Language Model Reasoning Reinforcement

[13] NLP Datasets for Idiom and Figurative Language Tasks

[14] Learning from Sufficient Rationales: Analysing the Relationship Between Explanation Faithfulness and Token-level Regularisation Strategies

[15] AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser

[16] Classification of worldwide news articles by perceived quality, 2018-2024

[17] ESGBench: A Benchmark for Explainable ESG Question Answering in Corporate Sustainability Reports

[18] Anatomy of an Idiom: Tracing Non-Compositionality in Language Models

[19] Arctic-Extract Technical Report

[20] TurkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval

[21] Beyond Tokens in Language Models: Interpreting Activations through Text Genre Chunks

[22] WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue

[23] Integrating Symbolic Natural Language Understanding and Language Models for Word Sense Disambiguation

[24] Comparison of Text-Based and Image-Based Retrieval in Multimodal Retrieval Augmented Generation Large Language Model Systems

[25] Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs

cs.CV [Back]

[26] UniFit: Towards Universal Virtual Try-on with MLLM-Guided Semantic Alignment

[27] EfficientSAM3: Progressive Hierarchical Distillation for Video Concept Segmentation from SAM1, 2, and 3

[28] WALDO: Where Unseen Model-based 6D Pose Estimation Meets Occlusion

[29] Automatic Uncertainty-Aware Synthetic Data Bootstrapping for Historical Map Segmentation

[30] Box6D : Zero-shot Category-level 6D Pose Estimation of Warehouse Boxes

[31] RB-FT: Rationale-Bootstrapped Fine-Tuning for Video Classification

[32] Boosting Medical Visual Understanding From Multi-Granular Language Learning

[33] Automated Interpretable 2D Video Extraction from 3D Echocardiography

[34] Click2Graph: Interactive Panoptic Video Scene Graphs from a Single Click

[35] InfoCLIP: Bridging Vision-Language Pretraining and Open-Vocabulary Semantic Segmentation via Information-Theoretic Alignment Transfer

[36] Externally Validated Multi-Task Learning via Consistency Regularization Using Differentiable BI-RADS Features for Breast Ultrasound Tumor Segmentation

[37] UniDGF: A Unified Detection-to-Generation Framework for Hierarchical Object Visual Recognition

[38] Fairness in Multi-modal Medical Diagnosis with Demonstration Selection

[39] Exploiting Inter-Sample Information for Long-tailed Out-of-Distribution Detection

[40] Physically Realistic Sequence-Level Adversarial Clothing for Robust Human-Detection Evasion

[41] Mixture of Ranks with Degradation-Aware Routing for One-Step Real-World Image Super-Resolution

[42] Towards a Safer and Sustainable Manufacturing Process: Material classification in Laser Cutting Using Deep Learning

[43] CuriGS: Curriculum-Guided Gaussian Splatting for Sparse View Synthesis

[44] Crossmodal learning for Crop Canopy Trait Estimation

[45] LLMs-based Augmentation for Domain Adaptation in Long-tailed Food Datasets

[46] AMS-KV: Adaptive KV Caching in Multi-Scale Visual Autoregressive Transformers

[47] LiSTAR: Ray-Centric World Models for 4D LiDAR Sequences in Autonomous Driving

[48] VideoSeg-R1:Reasoning Video Object Segmentation via Reinforcement Learning

[49] SpectralTrain: A Universal Framework for Hyperspectral Image Classification

[50] Rad-GS: Radar-Vision Integration for 3D Gaussian Splatting SLAM in Outdoor Environments

[51] T2T-VICL: Unlocking the Boundaries of Cross-Task Visual In-Context Learning via Implicit Text-Driven VLMs

[52] Clustered Error Correction with Grouped 4D Gaussian Splatting

[53] Decoupling Complexity from Scale in Latent Diffusion Model

[54] VTinker: Guided Flow Upsampling and Texture Mapping for High-Resolution Video Frame Interpolation

[55] Can MLLMs Read the Room? A Multimodal Benchmark for Assessing Deception in Multi-Party Social Interactions

[56] How Noise Benefits AI-generated Image Detection

[57] TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding

[58] Degradation-Aware Hierarchical Termination for Blind Quality Enhancement of Compressed Video

[59] SurvAgent: Hierarchical CoT-Enhanced Case Banking and Dichotomy-Based Multi-Agent System for Multimodal Survival Prediction

[60] Real-Time 3D Object Detection with Inference-Aligned Learning

[61] Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation

[62] A Spatial Semantics and Continuity Perception Attention for Remote Sensing Water Body Change Detection

[63] LEGO-SLAM: Language-Embedded Gaussian Optimization SLAM

[64] Reasoning Guided Embeddings: Leveraging MLLM Reasoning for Improved Multimodal Retrieval

[65] Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers

[66] Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning

[67] Simba: Towards High-Fidelity and Geometrically-Consistent Point Cloud Completion via Transformation Diffusion

[68] Layer-wise Noise Guided Selective Wavelet Reconstruction for Robust Medical Image Segmentation

[69] An Image Is Worth Ten Thousand Words: Verbose-Text Induction Attacks on VLMs

[70] EvoVLA: Self-Evolving Vision-Language-Action Model

[71] Target Refocusing via Attention Redistribution for Open-Vocabulary Semantic Segmentation: An Explainability Perspective

[72] Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight

[73] Domain-Shared Learning and Gradual Alignment for Unsupervised Domain Adaptation Visible-Infrared Person Re-Identification

[74] PrIntMesh: Precise Intersection Surfaces for 3D Organ Mesh Reconstruction

[75] When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models

[76] Unsupervised Image Classification with Adaptive Nearest Neighbor Selection and Cluster Ensembles

[77] SwiTrack: Tri-State Switch for Cross-Modal Object Tracking