Table of Contents
cs.CL [Back]
[1] What Really Counts? Examining Step and Token Level Attribution in Multilingual CoT Reasoning
Jeremias Ferrao,Ezgi Basar,Khondoker Ittehadul Islam,Mahrokh Hassani
Main category: cs.CL
TL;DR: 本研究探讨了多语言大模型中思维链(CoT)推理的归因模式,发现归因分数过度集中在最后推理步骤,尤其在错误生成中更为明显;结构化CoT提示主要提升高资源拉丁语系语言的准确性;否定和干扰句扰动会降低模型准确性和归因一致性。
Details
Motivation: 尽管CoT提示能提升任务性能,但其生成的推理链在多语言场景下的可信度和可解释性仍存疑,亟需系统评估其归因特性。 Method: 采用ContextCite(步骤级归因)和Inseq(token级归因)两种互补的归因方法,在Qwen2.5 1.5B-Instruct模型上结合MGSM基准进行实验分析。 Result: 1) 归因分数过度强调最终推理步骤,尤其在错误生成中;2) 结构化CoT提示显著提升高资源拉丁语系语言的准确性;3) 否定和干扰句扰动会降低模型准确性和归因连贯性。 Conclusion: CoT提示在多语言鲁棒性和解释透明性方面存在局限,需进一步改进以提升其跨语言可信度和可解释性。 Abstract: This study investigates the attribution patterns underlying Chain-of-Thought (CoT) reasoning in multilingual LLMs. While prior works demonstrate the role of CoT prompting in improving task performance, there are concerns regarding the faithfulness and interpretability of the generated reasoning chains. To assess these properties across languages, we applied two complementary attribution methods--ContextCite for step-level attribution and Inseq for token-level attribution--to the Qwen2.5 1.5B-Instruct model using the MGSM benchmark. Our experimental results highlight key findings such as: (1) attribution scores excessively emphasize the final reasoning step, particularly in incorrect generations; (2) structured CoT prompting significantly improves accuracy primarily for high-resource Latin-script languages; and (3) controlled perturbations via negation and distractor sentences reduce model accuracy and attribution coherence. These findings highlight the limitations of CoT prompting, particularly in terms of multilingual robustness and interpretive transparency.[2] Mind the Motions: Benchmarking Theory-of-Mind in Everyday Body Language
Seungbeen Lee,Jinhong Jeong,Donghyun Kim,Yejin Son,Youngjae Yu
Main category: cs.CL
TL;DR: 提出Motion2Mind框架,用于评估机器通过非语言线索推断心理状态的能力,构建了包含细粒度标注的视频数据集,实验表明当前AI系统在非语言线索检测和解释上表现不佳且存在过度解读问题。
Details
Motivation: 现有心智理论(ToM)基准主要关注错误信念任务和非对称信息推理,忽略了除信念外的其他心理状态及丰富的非语言交流形式,因此需要一个能全面评估机器对非语言线索理解能力的框架。 Method: 基于专家整理的身体语言参考资料构建Motion2Mind数据集,包含精细标注的非语言线索视频,并配以人工验证的心理学解释;利用该数据集评估AI模型在非语言线索检测与心理状态推断方面的能力。 Result: Motion2Mind数据集涵盖222种非语言线索和397种心理状态;实验显示当前AI系统在检测任务上性能差距明显,在解释任务中存在过度解读现象。 Conclusion: 当前AI系统在理解和解释人类非语言线索方面仍远落后于人类,Motion2Mind为未来提升机器心智理论能力提供了重要基准和资源。 Abstract: Our ability to interpret others' mental states through nonverbal cues (NVCs) is fundamental to our survival and social cohesion. While existing Theory of Mind (ToM) benchmarks have primarily focused on false-belief tasks and reasoning with asymmetric information, they overlook other mental states beyond belief and the rich tapestry of human nonverbal communication. We present Motion2Mind, a framework for evaluating the ToM capabilities of machines in interpreting NVCs. Leveraging an expert-curated body-language reference as a proxy knowledge base, we build Motion2Mind, a carefully curated video dataset with fine-grained nonverbal cue annotations paired with manually verified psychological interpretations. It encompasses 222 types of nonverbal cues and 397 mind states. Our evaluation reveals that current AI systems struggle significantly with NVC interpretation, exhibiting not only a substantial performance gap in Detection, as well as patterns of over-interpretation in Explanation compared to human annotators.[3] TOD-ProcBench: Benchmarking Complex Instruction-Following in Task-Oriented Dialogues
Sarik Ghazarian,Abhinav Gullapalli,Swair Shah,Anurag Beniwal,Nanyun Peng,Narayanan Sadagopan,Zhou Yu
Main category: cs.CL
TL;DR: 提出TOD-ProcBench,一个用于评估大模型在多轮任务型对话中遵循复杂自然语言指令能力的基准,包含三个任务,涵盖指令理解、违规识别与条件生成,并研究多语言和指令格式对性能的影响。
Details
Motivation: 现有任务型对话基准过于简化指令为槽位和API调用,无法反映真实场景中复杂的自然语言流程指令,缺乏对大模型复杂指令遵循能力的系统评估。 Method: 基于高质量ABCD数据集构建包含复杂流程指令的TOD-ProcBench,将指令形式化为多层次条件-动作语句,设计三项任务:相关语句检索与动作预测、违规响应检测、基于指令的条件响应生成,并评估多语言和不同指令格式的影响。 Result: 实验表明现有大模型在遵循复杂流程指令方面表现有限,尤其在处理细粒度约束和识别指令违规时存在明显不足,多语言设置和指令格式显著影响模型合规性能。 Conclusion: TOD-ProcBench有效揭示了当前大模型在复杂指令理解与遵循上的局限性,为提升任务型对话系统的可靠性提供了重要评估工具和改进方向。 Abstract: In real-world task-oriented dialogue (TOD) settings, agents are required to strictly adhere to complex instructions while conducting multi-turn conversations with customers. These instructions are typically presented in natural language format and include general guidelines and step-by-step procedures with complex constraints. Existing TOD benchmarks often oversimplify the complex nature of these instructions by reducing them to simple schemas composed of intents, slots, and API call configurations. To address this gap and systematically benchmark LLMs' instruction-following capabilities, we propose TOD-ProcBench, a challenging benchmark featuring complex process instructions with intricate, fine-grained constraints that evaluates various LLMs' abilities to understand and follow instructions in multi-turn TODs. Our benchmark dataset comprises instruction documents derived from the high-quality ABCD dataset with corresponding conversations under human quality control. We formulate fine-grained constraints and action procedures as multi-level condition-action instruction statements. We design three tasks to comprehensively benchmark LLMs' complex instruction-following capabilities in multi-turn TODs. Task 1 evaluates how LLMs retrieve the most relevant statement from a complex instruction and predict the corresponding next action. In Task 2, we synthesize instruction-violating responses by injecting inconsistencies and manipulating the original instructions, and then we analyze how effectively LLMs can identify instruction-violating responses. Task 3 investigates LLMs' abilities in conditional generation of instruction-following responses based on the original complex instructions. Additionally, we conduct studies on the impact of multilingual settings and different instruction text formats on compliance performance. We release our benchmark under the Llama 3.3 Community License Agreement.[4] Liars' Bench: Evaluating Lie Detectors for Language Models
Kieron Kretschmar,Walter Laurito,Sharan Maiya,Samuel Marks
Main category: cs.CL
TL;DR: 本文提出了LIARS' BENCH,一个包含72,863个样本的测试平台,用于评估大语言模型在多种情境下的说谎行为检测技术,揭示了现有方法在识别某些类型谎言上的系统性失败。
Details
Motivation: 现有的大语言模型说谎检测技术通常在狭窄的设定下验证,无法全面捕捉模型可能产生的多样化谎言,因此需要一个更广泛、更具代表性的测试平台来评估和改进这些技术。 Method: 构建了一个名为LIARS' BENCH的测试平台,包含四个开源模型在七个数据集上生成的谎言与诚实回答,并从模型说谎动机和信念对象两个维度设计不同类型的说谎场景,评估了三种黑盒和白盒的说谎检测方法。 Result: 实验发现现有说谎检测技术在某些类型的谎言(尤其是仅凭对话记录无法判断是否说谎的情境)上表现不佳,存在系统性漏检问题。 Conclusion: LIARS' BENCH揭示了当前说谎检测技术的局限性,并为未来研究提供了一个实用且多样化的基准测试平台。 Abstract: Prior work has introduced techniques for detecting when large language models (LLMs) lie, that is, generating statements they believe are false. However, these techniques are typically validated in narrow settings that do not capture the diverse lies LLMs can generate. We introduce LIARS' BENCH, a testbed consisting of 72,863 examples of lies and honest responses generated by four open-weight models across seven datasets. Our settings capture qualitatively different types of lies and vary along two dimensions: the model's reason for lying and the object of belief targeted by the lie. Evaluating three black- and white-box lie detection techniques on LIARS' BENCH, we find that existing techniques systematically fail to identify certain types of lies, especially in settings where it's not possible to determine whether the model lied from the transcript alone. Overall, LIARS' BENCH reveals limitations in prior techniques and provides a practical testbed for guiding progress in lie detection.[5] Learning Tractable Distributions Of Language Model Continuations
Gwen Yidou-Weng,Ian Li,Anji Liu,Oliver Broadrick,Guy Van den Broeck,Benjie Wang
Main category: cs.CL
TL;DR: 提出了一种名为Learning to Look Ahead (LTLA)的混合方法,通过结合语言模型和固定可追踪代理模型,有效解决序列级约束下的可控文本生成问题,在保持推理效率的同时提升了约束满足度和生成质量。
Details
Motivation: 现有基于代理模型(如HMM)的方法在处理序列级约束时上下文感知能力弱,导致生成质量下降,难以有效利用未来token信息进行精确控制。 Method: LTLA将基础语言模型用于前缀编码,并结合一个固定的可追踪代理模型计算精确的延续概率;通过批量HMM更新同时处理所有候选下一词,且仅将LM的隐状态用于条件化代理模型的潜在状态先验,保持代理解码器固定以实现跨前缀的计算复用。 Result: LTLA在条件似然上优于无条件HMM,能为视觉-语言模型近似延续分布(传统HMM无法编码视觉上下文),并在可控生成任务中提升约束满足度且保持流畅性,推理开销极小。 Conclusion: LTLA通过融合神经上下文与固定代理模型,在保证高效推理的同时显著提升了可控语言生成的准确性和适用范围,尤其适用于需强上下文感知的复杂约束场景。 Abstract: Controlled language generation conditions text on sequence-level constraints (for example, syntax, style, or safety). These constraints may depend on future tokens, which makes directly conditioning an autoregressive language model (LM) generally intractable. Prior work uses tractable surrogates such as hidden Markov models (HMMs) to approximate the distribution over continuations and adjust the model's next-token logits at decoding time. However, we find that these surrogates are often weakly context aware, which reduces query quality. We propose Learning to Look Ahead (LTLA), a hybrid approach that pairs the same base language model for rich prefix encoding with a fixed tractable surrogate model that computes exact continuation probabilities. Two efficiency pitfalls arise when adding neural context: (i) naively rescoring the prefix with every candidate next token requires a sweep over the entire vocabulary at each step, and (ii) predicting fresh surrogate parameters for each prefix, although tractable at a single step, forces recomputation of future probabilities for every new prefix and eliminates reuse. LTLA avoids both by using a single batched HMM update to account for all next-token candidates at once, and by conditioning only the surrogate's latent state prior on the LM's hidden representations while keeping the surrogate decoder fixed, so computations can be reused across prefixes. Empirically, LTLA attains higher conditional likelihood than an unconditional HMM, approximates continuation distributions for vision-language models where a standalone HMM cannot encode visual context, and improves constraint satisfaction at comparable fluency on controlled-generation tasks, with minimal inference overhead.[6] Early science acceleration experiments with GPT-5
Sébastien Bubeck,Christian Coester,Ronen Eldan,Timothy Gowers,Yin Tat Lee,Alexandru Lupsasca,Mehtaab Sawhney,Robert Scherrer,Mark Sellke,Brian K. Spears,Derya Unutmaz,Kevin Weil,Steven Yin,Nikita Zhivotovskiy
Main category: cs.CL
TL;DR: 本文展示了GPT-5在数学、物理、天文等多个科学领域中推动研究进展的实际案例,强调了AI与人类协作的潜力与局限。
Details
Motivation: 许多科学家尚未充分了解前沿AI的能力,本文旨在通过实际案例展示AI模型(如GPT-5)在科研中的具体应用价值。 Method: 作者们记录了GPT-5在多个学科中参与实际研究项目的互动过程,分析AI生成的新步骤、加速研究的效果及其局限性,并验证其在数学中的新结果。 Result: GPT-5在多个科研项目中产出了具体新进展,包括四个经人类验证的数学新结果;同时揭示了AI在节省专家时间方面的优势及仍需人类干预的环节。 Conclusion: 尽管当前AI贡献的成果范围有限,但其对科学研究的深远影响已显现,随着AI快速发展,人机协作将成为推动科学发现的重要模式。 Abstract: AI models like GPT-5 are an increasingly valuable tool for scientists, but many remain unaware of the capabilities of frontier AI. We present a collection of short case studies in which GPT-5 produced new, concrete steps in ongoing research across mathematics, physics, astronomy, computer science, biology, and materials science. In these examples, the authors highlight how AI accelerated their work, and where it fell short; where expert time was saved, and where human input was still key. We document the interactions of the human authors with GPT-5, as guiding examples of fruitful collaboration with AI. Of note, this paper includes four new results in mathematics (carefully verified by the human authors), underscoring how GPT-5 can help human mathematicians settle previously unsolved problems. These contributions are modest in scope but profound in implication, given the rate at which frontier AI is progressing.[7] ELPO: Ensemble Learning Based Prompt Optimization for Large Language Models
Qing Zhang,Bing Xu,Xudong Zhang,Yifan Shi,Yang Li,Chen Zhang,Yik Chung Wu,Ngai Wong,Yijie Chen,Hong Dai,Xiansen Chen,Mian Zhang
Main category: cs.CL
TL;DR: 提出了一种基于集成学习的提示优化框架ELPO,通过结合多种搜索方法和共享生成策略,提升了提示优化的准确性和鲁棒性,在多个任务上优于现有方法。
Details
Motivation: 现有的自动提示优化方法多依赖单一模型或算法,限制了复杂任务下的性能,因此需要更强大和灵活的优化框架。 Method: 受集成学习启发,ELPO采用投票机制,结合共享的生成策略与不同的搜索方法,并设计了更高效的提示生成与搜索算法。 Result: 实验结果显示ELPO在多个任务上优于当前最先进的提示优化方法,例如在ArSarcasm数据集上F1分数提升了7.6。 Conclusion: ELPO通过集成多种策略有效提升了提示优化的性能,为复杂任务下的自动提示优化提供了新的解决方案。 Abstract: The remarkable performance of Large Language Models (LLMs) highly relies on crafted prompts. However, manual prompt engineering is a laborious process, creating a core bottleneck for practical application of LLMs. This phenomenon has led to the emergence of a new research area known as Automatic Prompt Optimization (APO), which develops rapidly in recent years. Existing APO methods such as those based on evolutionary algorithms or trial-and-error approaches realize an efficient and accurate prompt optimization to some extent. However, those researches focus on a single model or algorithm for the generation strategy and optimization process, which limits their performance when handling complex tasks. To address this, we propose a novel framework called Ensemble Learning based Prompt Optimization (ELPO) to achieve more accurate and robust results. Motivated by the idea of ensemble learning, ELPO conducts voting mechanism and introduces shared generation strategies along with different search methods for searching superior prompts. Moreover, ELPO creatively presents more efficient algorithms for the prompt generation and search process. Experimental results demonstrate that ELPO outperforms state-of-the-art prompt optimization methods across different tasks, e.g., improving F1 score by 7.6 on ArSarcasm dataset.[8] TS-PEFT: Token-Selective Parameter-Efficient Fine-Tuning with Learnable Threshold Gating
Dabiao Ma,Ziming Dai,Zhimin Xin,Shu Wang,Ye Wang,Haojun Fei
Main category: cs.CL
TL;DR: 本文提出了一种新的参数高效微调范式——Token-Selective PEFT (TS-PEFT),通过选择性地在部分位置索引上应用微调,提升大模型在下游任务中的性能。
Details
Motivation: 传统的PEFT方法对所有位置索引均应用修改,可能造成资源浪费甚至性能下降,本文旨在探索是否所有位置都需要进行修改,并寻求更高效的微调策略。 Method: 提出TS-PEFT框架,引入选择函数S,有选择性地将PEFT修改应用于部分位置索引,而非全部位置。 Result: 实验结果表明,传统对所有位置应用PEFT不仅是多余的,甚至可能有害;而TS-PEFT能更有效地提升下游任务性能。 Conclusion: PEFT的改进应关注修改的位置选择,TS-PEFT为大模型的高效微调提供了新方向和可扩展的优化框架。 Abstract: In the field of large models (LMs) for natural language processing (NLP) and computer vision (CV), Parameter-Efficient Fine-Tuning (PEFT) has emerged as a resource-efficient method that modifies a limited number of parameters while keeping the pretrained weights fixed. This paper investigates the traditional PEFT approach, which applies modifications to all position indices, and questions its necessity. We introduce a new paradigm called Token-Selective PEFT (TS-PEFT), in which a function S selectively applies PEFT modifications to a subset of position indices, potentially enhancing performance on downstream tasks. Our experimental results reveal that the indiscriminate application of PEFT to all indices is not only superfluous, but may also be counterproductive. This study offers a fresh perspective on PEFT, advocating for a more targeted approach to modifications and providing a framework for future research to optimize the fine-tuning process for large models.[9] SemanticCite: Citation Verification with AI-Powered Full-Text Analysis and Evidence-Based Reasoning
Sebastian Haan
Main category: cs.CL
TL;DR: 本文提出了一种名为SemanticCite的AI驱动系统,通过全文分析验证引用准确性,并提供上下文信息和细粒度分类(支持、部分支持、不支持、不确定),结合轻量级微调模型实现高效的大规模引用验证,同时发布包含1000多个引用的多学科数据集及开源框架。
Details
Motivation: 科学交流依赖准确引用,但当前存在语义错误、AI幻觉引用以及传统引用无法定位具体证据位置等问题,影响研究可信度和同行评审效率。 Method: 结合多种检索方法与四类分类系统(Supported, Partially Supported, Unsupported, Uncertain),利用微调的轻量级语言模型对引用进行全文本分析,并提取相关文本片段以提供可解释性。 Result: 实验表明轻量级模型性能接近大型商业系统但计算成本更低;构建了包含1000多个标注引用的多学科数据集,涵盖语义标注、功能分类和文献计量元数据。 Conclusion: SemanticCite通过可扩展的引用验证机制提升了科研诚信,支持同行评审自动化和AI生成内容的质量控制,并以开源形式推动大规模引用准确性保障。 Abstract: Effective scientific communication depends on accurate citations that validate sources and guide readers to supporting evidence. Yet academic literature faces mounting challenges: semantic citation errors that misrepresent sources, AI-generated hallucinated references, and traditional citation formats that point to entire papers without indicating which sections substantiate specific claims. We introduce SemanticCite, an AI-powered system that verifies citation accuracy through full-text source analysis while providing rich contextual information via detailed reasoning and relevant text snippets. Our approach combines multiple retrieval methods with a four-class classification system (Supported, Partially Supported, Unsupported, Uncertain) that captures nuanced claim-source relationships and enables appropriate remedial actions for different error types. Our experiments show that fine-tuned lightweight language models achieve performance comparable to large commercial systems with significantly lower computational requirements, making large-scale citation verification practically feasible. The system provides transparent, evidence-based explanations that support user understanding and trust. We contribute a comprehensive dataset of over 1,000 citations with detailed alignments, functional classifications, semantic annotations, and bibliometric metadata across eight disciplines, alongside fine-tuned models and the complete verification framework as open-source software. SemanticCite addresses critical challenges in research integrity through scalable citation verification, streamlined peer review, and quality control for AI-generated content, providing an open-source foundation for maintaining citation accuracy at scale.[10] SeSE: A Structural Information-Guided Uncertainty Quantification Framework for Hallucination Detection in LLMs
Xingtao Zhao,Hao Peng,Dingli Su,Xianghua Zeng,Chunyang Liu,Jinzhi Liao,Philip S. Yu
Main category: cs.CL
TL;DR: 本文提出了Semantic Structural Entropy(SeSE),一种从结构信息角度量化大语言模型内在语义不确定性的新框架,用于更精确地检测幻觉。该方法通过构建稀疏化的有向语义图并利用分层抽象定义结构熵,实现对语义空间中不确定性的建模,并可扩展至细粒度的长文本生成场景。实验表明SeSE在29种模型-数据集组合上显著优于现有先进方法。
Details
Motivation: 现有的不确定性量化方法主要依赖语义概率分布或成对距离,忽略了潜在的语义结构信息,导致在检测大语言模型幻觉时不够精准。因此需要一种能利用语义结构信息的新型UQ框架。 Method: 提出Semantic Structural Entropy(SeSE):首先设计一种自适应稀疏化的有向语义图构建算法,捕捉方向性语义依赖并剪枝干扰连接;然后通过分层抽象,将SeSE定义为最优语义编码树的结构熵,以形式化语义空间中的内在不确定性;进一步扩展SeSE以建模单个主张间的随机语义交互,实现细粒度不确定性估计。 Result: 在29个模型-数据集组合上的实验表明,SeSE显著优于包括强监督方法和最新KLE在内的先进UQ基线方法,在幻觉检测和不确定性量化方面表现优越,尤其在长文本生成中实现更精细、可解释的检测。 Conclusion: SeSE通过引入语义结构信息和结构熵理论,提供了一种原理性强、可解释且高效的不确定性量化方法,有效提升了大语言模型在安全关键场景下的可靠性,特别是在检测和避免幻觉方面具有重要应用价值。 Abstract: Reliable uncertainty quantification (UQ) is essential for deploying large language models (LLMs) in safety-critical scenarios, as it enables them to abstain from responding when uncertain, thereby avoiding hallucinating falsehoods. However, state-of-the-art UQ methods primarily rely on semantic probability distributions or pairwise distances, overlooking latent semantic structural information that could enable more precise uncertainty estimates. This paper presents Semantic Structural Entropy (SeSE), a principled UQ framework that quantifies the inherent semantic uncertainty of LLMs from a structural information perspective for hallucination detection. Specifically, to effectively model semantic spaces, we first develop an adaptively sparsified directed semantic graph construction algorithm that captures directional semantic dependencies while automatically pruning unnecessary connections that introduce negative interference. We then exploit latent semantic structural information through hierarchical abstraction: SeSE is defined as the structural entropy of the optimal semantic encoding tree, formalizing intrinsic uncertainty within semantic spaces after optimal compression. A higher SeSE value corresponds to greater uncertainty, indicating that LLMs are highly likely to generate hallucinations. In addition, to enhance fine-grained UQ in long-form generation -- where existing methods often rely on heuristic sample-and-count techniques -- we extend SeSE to quantify the uncertainty of individual claims by modeling their random semantic interactions, providing theoretically explicable hallucination detection. Extensive experiments across 29 model-dataset combinations show that SeSE significantly outperforms advanced UQ baselines, including strong supervised methods and the recently proposed KLE.[11] SDA: Steering-Driven Distribution Alignment for Open LLMs without Fine-Tuning
Wei Xia,Zhi-Hong Deng
Main category: cs.CL
TL;DR: 提出了一种无需训练、模型无关的对齐框架SDA,通过动态重分配输出概率来提升大语言模型在有用性、无害性和诚实性方面的表现。
Details
Motivation: 确保大语言模型在推理阶段与人类意图保持一致,同时避免高昂的重新训练成本和大量监督需求。 Method: 设计了Steering-Driven Distribution Alignment(SDA)框架,在推理过程中根据用户定义的对齐指令动态调整模型输出概率分布,实现训练-free且模型无关的对齐。 Result: 在8个不同规模和来源的开源大模型上验证了SDA的有效性,在有用性、无害性和诚实性三个维度均取得显著提升,平均分别提高64.4%、11.5%和30%。 Conclusion: SDA是一种轻量、高效、可扩展的对齐方法,支持个性化偏好对齐,并可与基于训练的方法结合使用,具有良好的通用性和实际应用价值。 Abstract: With the rapid advancement of large language models (LLMs), their deployment in real-world applications has become increasingly widespread. LLMs are expected to deliver robust performance across diverse tasks, user preferences, and practical scenarios. However, as demands grow, ensuring that LLMs produce responses aligned with human intent remains a foundational challenge. In particular, aligning model behavior effectively and efficiently during inference, without costly retraining or extensive supervision, is both a critical requirement and a non-trivial technical endeavor. To address the challenge, we propose SDA (Steering-Driven Distribution Alignment), a training-free and model-agnostic alignment framework designed for open-source LLMs. SDA dynamically redistributes model output probabilities based on user-defined alignment instructions, enhancing alignment between model behavior and human intents without fine-tuning. The method is lightweight, resource-efficient, and compatible with a wide range of open-source LLMs. It can function independently during inference or be integrated with training-based alignment strategies. Moreover, SDA supports personalized preference alignment, enabling flexible control over the model response behavior. Empirical results demonstrate that SDA consistently improves alignment performance across 8 open-source LLMs with varying scales and diverse origins, evaluated on three key alignment dimensions, helpfulness, harmlessness, and honesty (3H). Specifically, SDA achieves average gains of 64.4% in helpfulness, 30% in honesty and 11.5% in harmlessness across the tested models, indicating its effectiveness and generalization across diverse models and application scenarios.[12] Incorporating Self-Rewriting into Large Language Model Reasoning Reinforcement
Jiashu Yao,Heyan Huang,Shuang Zeng,Chuwei Luo,WangJie You,Jie Tang,Qingsong Liu,Yuhang Guo,Yangyang Kang
Main category: cs.CL
TL;DR: 本文提出了一种自重写框架(self-rewriting framework),通过让大推理模型自我重写推理过程并从中学习,以提升内部推理质量。该方法在保持强化学习奖励信号的同时,显著缩短推理长度并提高准确性。
Details
Motivation: 现有的基于最终结果正确性的单向奖励机制无法有效监督模型的内部推理过程,导致出现过度思考、思考不足、冗余或混乱等问题,限制了推理质量的提升。 Method: 提出一种选择性重写方法:仅对模型稳定能正确回答的“简单”样本进行自我重写,并在同一训练批次中整合重写与原始生成,保持算法可扩展性。通过自奖励机制实现推理过程优化。 Result: 实验表明,自重写方法在准确率上提升+0.6,推理长度减少46%,且内部推理质量在LLM-as-a-judge指标上提升+7.2,优于现有基线方法。 Conclusion: 自重写框架能有效提升大推理模型的内部推理质量,在不显式要求缩短推理的情况下实现更高效、准确和简洁的推理过程。 Abstract: Through reinforcement learning (RL) with outcome correctness rewards, large reasoning models (LRMs) with scaled inference computation have demonstrated substantial success on complex reasoning tasks. However, the one-sided reward, focused solely on final correctness, limits its ability to provide detailed supervision over internal reasoning process. This deficiency leads to suboptimal internal reasoning quality, manifesting as issues like over-thinking, under-thinking, redundant-thinking, and disordered-thinking. Inspired by the recent progress in LRM self-rewarding, we introduce self-rewriting framework, where a model rewrites its own reasoning texts, and subsequently learns from the rewritten reasoning to improve the internal thought process quality. For algorithm design, we propose a selective rewriting approach wherein only "simple" samples, defined by the model's consistent correctness, are rewritten, thereby preserving all original reward signals of GRPO. For practical implementation, we compile rewriting and vanilla generation within one single batch, maintaining the scalability of the RL algorithm and introducing only ~10% overhead. Extensive experiments on diverse tasks with different model sizes validate the effectiveness of self-rewriting. In terms of the accuracy-length tradeoff, the self-rewriting approach achieves improved accuracy (+0.6) with substantially shorter reasoning (-46%) even without explicit instructions in rewriting prompts to reduce reasoning length, outperforming existing strong baselines. In terms of internal reasoning quality, self-rewriting achieves significantly higher scores (+7.2) under the LLM-as-a-judge metric, successfully mitigating internal reasoning flaws.[13] NLP Datasets for Idiom and Figurative Language Tasks
Blake Matheny,Phuong Minh Nguyen,Minh Le Nguyen,Stephanie Reynolds
Main category: cs.CL
TL;DR: 本文提出了用于评估预训练语言模型在习语识别任务中处理隐喻意义能力的大规模和人工标注数据集,通过从大型语料库中提取上下文序列构建数据集,并进行后处理以兼容模型训练,最终在槽位标注和序列标记任务中进行了评估。
Details
Motivation: 尽管大规模语料库有助于解决许多自然语言处理问题,但大语言模型在处理习语和隐喻语言方面仍存在困难,因此需要更好的数据集来缩小这一差距。 Method: 结合现有习语数据集生成综合习语列表,从大型语料库中检索包含这些习语的上下文序列,构建一个大规模潜在习语数据集和两个人工标注的确切习语数据集,并对数据进行后处理以支持模型无关的训练,最后在槽位标注和序列标记任务中评估模型性能。 Result: 成功创建了可用于习语识别任务的多个数据集,实验证明这些数据集能有效评估预训练语言模型在检测习语和隐喻表达方面的基线能力。 Conclusion: 所提出的多样化数据集为开发新的模型和方法提供了基础,有助于提升大语言模型对非字面语言的理解能力。 Abstract: Idiomatic and figurative language form a large portion of colloquial speech and writing. With social media, this informal language has become more easily observable to people and trainers of large language models (LLMs) alike. While the advantage of large corpora seems like the solution to all machine learning and Natural Language Processing (NLP) problems, idioms and figurative language continue to elude LLMs. Finetuning approaches are proving to be optimal, but better and larger datasets can help narrow this gap even further. The datasets presented in this paper provide one answer, while offering a diverse set of categories on which to build new models and develop new approaches. A selection of recent idiom and figurative language datasets were used to acquire a combined idiom list, which was used to retrieve context sequences from a large corpus. One large-scale dataset of potential idiomatic and figurative language expressions and two additional human-annotated datasets of definite idiomatic and figurative language expressions were created to evaluate the baseline ability of pre-trained language models in handling figurative meaning through idiom recognition (detection) tasks. The resulting datasets were post-processed for model agnostic training compatibility, utilized in training, and evaluated on slot labeling and sequence tagging.[14] Learning from Sufficient Rationales: Analysing the Relationship Between Explanation Faithfulness and Token-level Regularisation Strategies
Jonathan Kamp,Lisa Beinborn,Antske Fokkens
Main category: cs.CL
TL;DR: 本文探讨了自然语言解释(即“理由”)在评估模型推理过程中的作用,指出常用指标“充分性”在衡量理由信息对模型性能影响方面的局限性,并通过两种建模方式分析其关系,发现高信息量的理由未必提升分类准确率,且充分性与上下文干扰相关,结果表明需进一步研究更系统的评估指标。
Details
Motivation: 旨在理解模型是否基于正确理由进行预测,并评估现有理由评价指标(如充分性)是否真正反映其对模型性能的影响。 Method: 将充分性指标与两种建模范式关联:一是通过标记分类识别哪些词元属于理由;二是通过注意力正则化将理由信息融入输入以观察性能变化。 Result: 高信息量的理由不一定有助于正确分类;充分性主要反映非理由上下文对分类的干扰;引入理由可提升跨域分类性能,但效果因任务和模型而异;充分性与标记分类表现无明显关联。 Conclusion: 理由的作用复杂,当前指标如充分性不足以全面衡量其影响,需要开发能系统捕捉此类信息的新评估方法。 Abstract: Human explanations of natural language, rationales, form a tool to assess whether models learn a label for the right reasons or rely on dataset-specific shortcuts. Sufficiency is a common metric for estimating the informativeness of rationales, but it provides limited insight into the effects of rationale information on model performance. We address this limitation by relating sufficiency to two modelling paradigms: the ability of models to identify which tokens are part of the rationale (through token classification) and the ability of improving model performance by incorporating rationales in the input (through attention regularisation). We find that highly informative rationales are not likely to help classify the instance correctly. Sufficiency conversely captures the classification impact of the non-rationalised context, which interferes with rationale information in the same input. We also find that incorporating rationale information in model inputs can boost cross-domain classification, but results are inconsistent per task and model type. Finally, sufficiency and token classification appear to be unrelated. These results exemplify the complexity of rationales, showing that metrics capable of systematically capturing this type of information merit further investigation.[15] AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser
Ren Ma,Jiantao Qiu,Chao Xu,Pei Chu,Kaiwen Liu,Pengli Ren,Yuan Qu,Jiahui Peng,Linfeng Hou,Mengjie Liu,Lindong Lu,Wenchang Ning,Jia Yu,Rui Min,Jin Shi,Haojiong Chen,Peng Zhang,Wenjian Zhang,Qian Jiang,Zengjie Hu,Guoqiang Yang,Zhenxiang Li,Fukai Shang,Zhongying Tu,Wentao Zhang,Dahua Lin,Conghui He
Main category: cs.CL
TL;DR: 本文提出了一种新的HTML到文本的提取管道MinerU-HTML,通过语义理解显著提升网页结构化内容(如公式、代码、表格)的保留质量,并构建高质量语料库AICC,实验证明其在下游任务中优于现有方法。
Details
Motivation: 现有的网页文本提取方法(如Trafilatura)依赖启发式规则,难以有效保留文档结构,尤其对公式、代码和表格等结构化元素易造成破坏,影响大模型训练效果。作者认为提升提取质量对模型性能的影响不亚于数据过滤,因而需要更优的提取方案。 Method: 将HTML内容提取重新定义为序列标注问题,使用一个0.6B参数的语言模型进行语义理解,并采用两阶段格式化流程:先对语义元素进行分类,再转换为Markdown格式。该方法相比基于文本密度的启发式方法更具可扩展性。 Result: 在包含7,887个标注网页的MainWebBench基准上,MinerU-HTML的ROUGE-N F1得分为81.8%,显著高于Trafilatura的63.6%;在代码块和公式保留上分别达到90.9%和94.0%。基于MinerU-HTML构建的7.3万亿token多语言语料库AICC,在相同过滤条件下训练出的模型比Trafilatura提取的TfCC高1.08个百分点(平均50.8%),且优于RefinedWeb和FineWeb。 Conclusion: HTML内容提取是构建高质量网页语料库的关键环节,常被低估。MinerU-HTML通过语义驱动的模型化方法显著提升了提取质量,证明了其对大模型预训练性能的重要影响,相关数据集、工具和语料库已公开发布。 Abstract: While web data quality is crucial for large language models, most curation efforts focus on filtering and deduplication,treating HTML-to-text extraction as a fixed pre-processing step. Existing web corpora rely on heuristic-based extractors like Trafilatura, which struggle to preserve document structure and frequently corrupt structured elements such as formulas, codes, and tables. We hypothesize that improving extraction quality can be as impactful as aggressive filtering strategies for downstream performance. We introduce MinerU-HTML, a novel extraction pipeline that reformulates content extraction as a sequence labeling problem solved by a 0.6B-parameter language model. Unlike text-density heuristics, MinerU-HTML leverages semantic understanding and employs a two-stage formatting pipeline that explicitly categorizes semantic elements before converting to Markdown. Crucially, its model-based approach is inherently scalable, whereas heuristic methods offer limited improvement pathways. On MainWebBench, our benchmark of 7,887 annotated web pages, MinerU-HTML achieves 81.8\% ROUGE-N F1 compared to Trafilatura's 63.6\%, with exceptional structured element preservation (90.9\% for code blocks, 94.0\% for formulas). Using MinerU-HTML, we construct AICC (AI-ready Common Crawl), a 7.3-trillion token multilingual corpus from two Common Crawl snapshots. In controlled pretraining experiments where AICC and Trafilatura-extracted TfCC undergo identical filtering, models trained on AICC (62B tokens) achieve 50.8\% average accuracy across 13 benchmarks, outperforming TfCC by 1.08pp-providing direct evidence that extraction quality significantly impacts model capabilities. AICC also surpasses RefinedWeb and FineWeb on key benchmarks. We publicly release MainWebBench, MinerU-HTML, and AICC, demonstrating that HTML extraction is a critical, often underestimated component of web corpus construction.[16] Classification of worldwide news articles by perceived quality, 2018-2024
Connor McElroy,Thiago E. A. de Oliveira,Chris Brogly
Main category: cs.CL
TL;DR: 本研究评估了机器学习和深度学习模型在区分感知新闻质量方面的有效性,使用包含140多万篇文章的数据集,结果显示ModernBERT-large表现最佳。
Details
Motivation: 探索机器学习与深度学习模型是否能有效区分感知为高质与低质的新闻文章。 Method: 采用3种传统机器学习分类器和3种深度学习模型,基于194个语言特征对Common Crawl中2018-2024年的1,412,272篇英文新闻进行分类。 Result: Random Forest准确率为0.7355,ModernBERT-large达到最高性能(准确率0.8744,ROC-AUC 0.9593,F1 0.8739)。 Conclusion: 传统机器学习和深度学习模型均能有效区分全球新闻文章的感知质量,其中ModernBERT-large表现最优。 Abstract: This study explored whether supervised machine learning and deep learning models can effectively distinguish perceived lower-quality news articles from perceived higher-quality news articles. 3 machine learning classifiers and 3 deep learning models were assessed using a newly created dataset of 1,412,272 English news articles from the Common Crawl over 2018-2024. Expert consensus ratings on 579 source websites were split at the median, creating perceived low and high-quality classes of about 706,000 articles each, with 194 linguistic features per website-level labelled article. Traditional machine learning classifiers such as the Random Forest demonstrated capable performance (0.7355 accuracy, 0.8131 ROC AUC). For deep learning, ModernBERT-large (256 context length) achieved the best performance (0.8744 accuracy; 0.9593 ROC-AUC; 0.8739 F1), followed by DistilBERT-base (512 context length) at 0.8685 accuracy and 0.9554 ROC-AUC. DistilBERT-base (256 context length) reached 0.8478 accuracy and 0.9407 ROC-AUC, while ModernBERT-base (256 context length) attained 0.8569 accuracy and 0.9470 ROC-AUC. These results suggest that the perceived quality of worldwide news articles can be effectively differentiated by traditional CPU-based machine learning classifiers and deep learning classifiers.[17] ESGBench: A Benchmark for Explainable ESG Question Answering in Corporate Sustainability Reports
Sherine George,Nithish Saji
Main category: cs.CL
TL;DR: ESGBench是一个用于评估基于企业可持续发展报告的可解释ESG问答系统的基准数据集和评估框架,包含多个ESG主题的领域内问题、人工整理的答案和支持证据,旨在推动透明和负责任的ESG人工智能系统研究。
Details
Motivation: 现有的ESG问答系统缺乏标准化的评估手段,难以衡量模型在事实一致性、可追溯性和领域对齐方面的表现,因此需要一个专门的基准来推动该领域的发展。 Method: 构建了一个包含多主题ESG问题、人工标注答案及支持证据的数据集,并设计了评估框架,用于细粒度分析大模型在ESGBench上的推理能力。 Result: 对当前最先进的大语言模型进行了评估,揭示了其在事实一致性、信息可追溯性和领域知识对齐方面存在显著挑战。 Conclusion: ESGBench为可解释的ESG问答系统提供了有效的评估平台,有助于促进透明、可信的ESG人工智能技术的研究与发展。 Abstract: We present ESGBench, a benchmark dataset and evaluation framework designed to assess explainable ESG question answering systems using corporate sustainability reports. The benchmark consists of domain-grounded questions across multiple ESG themes, paired with human-curated answers and supporting evidence to enable fine-grained evaluation of model reasoning. We analyze the performance of state-of-the-art LLMs on ESGBench, highlighting key challenges in factual consistency, traceability, and domain alignment. ESGBench aims to accelerate research in transparent and accountable ESG-focused AI systems.[18] Anatomy of an Idiom: Tracing Non-Compositionality in Language Models
Andrew Gomes
Main category: cs.CL
TL;DR: 本文研究了基于Transformer的语言模型如何处理成语表达,提出了一种新的电路发现与分析技术。通过改进的路径补丁算法发现了处理成语的独特计算模式,包括频繁激活的“成语头”注意力机制以及由于早期处理导致的成语词之间增强的注意力(称为“增强接收”)。这些发现揭示了Transformer在计算效率和鲁棒性之间的平衡机制,并为理解非组合性语言及更复杂语法结构的处理提供了思路。
Details
Motivation: 理解Transformer模型如何处理非组合性的语言现象(如成语),并探索其内部机制以揭示模型对复杂语法结构的处理方式。 Method: 使用改进的路径补丁算法进行电路发现,并识别和分析‘成语头’注意力头及成语词间的增强注意力现象(augmented reception)。 Result: 发现了处理成语的独特计算模式,包括跨不同成语频繁激活的‘Idiom Heads’以及成语词之间因早期处理而增强的注意力。这些机制有助于Transformer在效率与鲁棒性之间取得平衡。 Conclusion: Transformer通过特定的注意力机制(如Idiom Heads和augmented reception)有效处理成语等非组合性语言现象,这为理解其处理复杂语言结构的能力提供了新视角。 Abstract: We investigate the processing of idiomatic expressions in transformer-based language models using a novel set of techniques for circuit discovery and analysis. First discovering circuits via a modified path patching algorithm, we find that idiom processing exhibits distinct computational patterns. We identify and investigate ``Idiom Heads,'' attention heads that frequently activate across different idioms, as well as enhanced attention between idiom tokens due to earlier processing, which we term ``augmented reception.'' We analyze these phenomena and the general features of the discovered circuits as mechanisms by which transformers balance computational efficiency and robustness. Finally, these findings provide insights into how transformers handle non-compositional language and suggest pathways for understanding the processing of more complex grammatical constructions.[19] Arctic-Extract Technical Report
Mateusz Chiliński,Julita Ołtusek,Wojciech Jaśkowski
Main category: cs.CL
TL;DR: Arctic-Extract是一个先进的模型,用于从扫描或数字生成的商业文档中提取结构化数据,具有在资源受限硬件上部署的能力。
Details
Motivation: 为了在资源受限的设备上高效地处理长文档中的结构化数据提取问题。 Method: 采用优化的训练协议,设计轻量级模型架构以支持高效推理和大规模文档处理。 Result: 模型仅占用6.6 GiB内存,可在A10 GPU上处理多达125页A4文档,并在文档理解任务中表现出色。 Conclusion: Arctic-Extract在保持高性能的同时实现了低资源消耗,适用于实际场景中的长文档结构化数据提取。 Abstract: Arctic-Extract is a state-of-the-art model designed for extracting structural data (question answering, entities and tables) from scanned or digital-born business documents. Despite its SoTA capabilities, the model is deployable on resource-constrained hardware, weighting only 6.6 GiB, making it suitable for deployment on devices with limited resources, such as A10 GPUs with 24 GB of memory. Arctic-Extract can process up to 125 A4 pages on those GPUs, making suitable for long document processing. This paper highlights Arctic-Extract's training protocols and evaluation results, demonstrating its strong performance in document understanding.[20] TurkColBERT: A Benchmark of Dense and Late-Interaction Models for Turkish Information Retrieval
Özay Ezerceli,Mahmoud El Hussieni,Selva Taş,Reyhan Bayraktar,Fatma Betül Terzioğlu,Yusuf Çelebi,Yağız Asker
Main category: cs.CL
TL;DR: 本文提出了TurkColBERT,首个针对土耳其语信息检索的综合基准,系统评估了密集编码器与晚期交互模型的表现。通过两阶段适应流程,作者将英文和多语言编码器应用于土耳其语文本,并在多个领域数据集上进行测试。结果显示,小型晚期交互模型在参数效率和检索性能上均优于大型密集模型,且结合MUVERA索引算法实现了低延迟、高性能的检索。所有模型和代码均已开源。
Details
Motivation: 尽管神经信息检索在高资源语言中表现优异,但在形态丰富且资源较少的语言(如土耳其语)中研究不足。当前土耳其语检索主要依赖密集双编码器,而晚期交互模型尚未被充分评估。因此,需要一个系统的基准来比较不同模型在土耳其语上的表现。 Method: 提出TurkColBERT基准,采用两阶段适应流程:首先在土耳其语NLI/STS任务上微调英文和多语言编码器,然后使用PyLate框架基于MS MARCO-TR训练将其转化为ColBERT风格的晚期交互检索器。评估涵盖10个模型,在五个土耳其BEIR数据集上测试,覆盖科学、金融和论辩等领域。同时比较了MUVERA与PLAID等索引算法的效率与效果。 Result: 实验表明,仅1.0M参数的colbert-hash-nano-tr模型比600M参数的turkish-e5-large小600倍,但仍保留其71%以上的平均mAP。3–5倍更小的晚期交互模型显著优于密集模型,ColmmBERT-base-TR在特定领域任务上mAP提升达+13.8%。MUVERA+Rerank比PLAID快3.33倍,并带来+1.7%相对mAP增益,实现0.54ms的查询延迟。 Conclusion: 晚期交互模型在土耳其语检索中具有更高的参数效率和更强的性能,尤其适合资源受限场景。结合高效索引方法(如MUVERA),可实现生产级低延迟检索。未来需在更大规模真实数据上进一步验证。 Abstract: Neural information retrieval systems excel in high-resource languages but remain underexplored for morphologically rich, lower-resource languages such as Turkish. Dense bi-encoders currently dominate Turkish IR, yet late-interaction models -- which retain token-level representations for fine-grained matching -- have not been systematically evaluated. We introduce TurkColBERT, the first comprehensive benchmark comparing dense encoders and late-interaction models for Turkish retrieval. Our two-stage adaptation pipeline fine-tunes English and multilingual encoders on Turkish NLI/STS tasks, then converts them into ColBERT-style retrievers using PyLate trained on MS MARCO-TR. We evaluate 10 models across five Turkish BEIR datasets covering scientific, financial, and argumentative domains. Results show strong parameter efficiency: the 1.0M-parameter colbert-hash-nano-tr is 600$\times$ smaller than the 600M turkish-e5-large dense encoder while preserving over 71\% of its average mAP. Late-interaction models that are 3--5$\times$ smaller than dense encoders significantly outperform them; ColmmBERT-base-TR yields up to +13.8\% mAP on domain-specific tasks. For production-readiness, we compare indexing algorithms: MUVERA+Rerank is 3.33$\times$ faster than PLAID and offers +1.7\% relative mAP gain. This enables low-latency retrieval, with ColmmBERT-base-TR achieving 0.54 ms query times under MUVERA. We release all checkpoints, configs, and evaluation scripts. Limitations include reliance on moderately sized datasets ($\leq$50K documents) and translated benchmarks, which may not fully reflect real-world Turkish retrieval conditions; larger-scale MUVERA evaluations remain necessary.[21] Beyond Tokens in Language Models: Interpreting Activations through Text Genre Chunks
Éloïse Benito-Rodriguez,Einar Urdshals,Jasmina Nasufi,Nicky Pochinkov
Main category: cs.CL
TL;DR: 本文提出了一个预测框架的初步步骤,通过大型语言模型(LLM)的激活状态来预测输入文本的体裁,使用Mistral-7B和两个数据集实现了高达98%和71%的F1分数,证明了从LLM中提取文本体裁的可行性。
Details
Motivation: 理解大型语言模型(LLMs)对于确保其安全和有益的部署至关重要,但由于LLM结构的不可解释性以及无法对所有输出进行人工评估,这一任务变得复杂。 Method: 利用Mistral-7B模型和两个数据集,通过浅层学习模型(如scikit-learn分类器)从LLM的激活中提取文本体裁信息,并与控制任务进行比较。 Result: 在两个数据集上,体裁预测的F1分数分别达到98%和71%,且结果始终优于控制任务,验证了该方法的可行性。 Conclusion: 研究表明,可以通过浅层学习模型从LLM的激活中有效推断输入文本的体裁,为理解LLM提供了新的可解释性路径。 Abstract: Understanding Large Language Models (LLMs) is key to ensure their safe and beneficial deployment. This task is complicated by the difficulty of interpretability of LLM structures, and the inability to have all their outputs human-evaluated. In this paper, we present the first step towards a predictive framework, where the genre of a text used to prompt an LLM, is predicted based on its activations. Using Mistral-7B and two datasets, we show that genre can be extracted with F1-scores of up to 98% and 71% using scikit-learn classifiers. Across both datasets, results consistently outperform the control task, providing a proof of concept that text genres can be inferred from LLMs with shallow learning models.[22] WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue
Zachary Ellis,Jared Joselowitz,Yash Deo,Yajie He,Anna Kalygina,Aisling Higham,Mana Rahimzadeh,Yan Jia,Ibrahim Habli,Ernest Lim
Main category: cs.CL
TL;DR: 该论文挑战了在临床对话中使用词错误率(WER)作为自动语音识别(ASR)评估标准的做法,提出一种基于大语言模型的评判系统,能够以与人类专家相当的准确性评估转录错误对临床安全的影响。
Details
Motivation: 现有的ASR评估指标如WER无法有效反映转录错误在临床环境中的实际影响,缺乏与临床安全相关的评估能力,因此需要一种更可靠、可扩展的评估方法。 Method: 通过让临床专家对标医生-患者对话数据集中真实语句与ASR生成结果之间的差异,并标注其临床影响(无、轻微或显著影响),构建黄金标准基准;随后利用GEPA优化大语言模型(LLM-as-a-Judge)以模拟专家判断,评估其与人工标签的一致性。 Result: 研究发现WER及其他常用指标与临床影响标签相关性差;经优化的LLM裁判(Gemini-2.5-Pro)达到90%的准确率和Cohen's κ为0.816,表现出与人类相当的判别能力。 Conclusion: 本文提出的基于LLM的自动化评估框架可有效弥补传统指标在临床ASR评估中的不足,为临床对话中的ASR安全性提供了可扩展且经过验证的新评估范式。 Abstract: As Automatic Speech Recognition (ASR) is increasingly deployed in clinical dialogue, standard evaluations still rely heavily on Word Error Rate (WER). This paper challenges that standard, investigating whether WER or other common metrics correlate with the clinical impact of transcription errors. We establish a gold-standard benchmark by having expert clinicians compare ground-truth utterances to their ASR-generated counterparts, labeling the clinical impact of any discrepancies found in two distinct doctor-patient dialogue datasets. Our analysis reveals that WER and a comprehensive suite of existing metrics correlate poorly with the clinician-assigned risk labels (No, Minimal, or Significant Impact). To bridge this evaluation gap, we introduce an LLM-as-a-Judge, programmatically optimized using GEPA to replicate expert clinical assessment. The optimized judge (Gemini-2.5-Pro) achieves human-comparable performance, obtaining 90% accuracy and a strong Cohen's $κ$ of 0.816. This work provides a validated, automated framework for moving ASR evaluation beyond simple textual fidelity to a necessary, scalable assessment of safety in clinical dialogue.[23] Integrating Symbolic Natural Language Understanding and Language Models for Word Sense Disambiguation
Kexin Zhao,Ken Forbus
Main category: cs.CL
TL;DR: 提出一种利用统计语言模型作为消歧“ oracle”的方法,无需手工标注训练数据,通过将候选意义转换为可区分的自然语言选项来查询大语言模型以选择合适的解释。
Details
Motivation: 现有的词义消歧方法主要针对粗粒度表示,并依赖人工标注的训练数据,难以自动消歧更丰富、复杂的语义表示。 Method: 将符号化自然语言理解系统生成的多个候选意义转化为可区分的自然语言表述,利用大语言模型(LLM)根据上下文选择最合适的解释,并将结果反馈回符号系统。 Result: 该方法在与人工标注的黄金标准对比中验证了有效性,能够在无需人工标注的情况下实现对复杂语义表示的词义消歧。 Conclusion: 所提方法有效结合了统计语言模型和符号系统,为无需人工标注的细粒度词义消歧提供了可行方案。 Abstract: Word sense disambiguation is a fundamental challenge in natural language understanding. Current methods are primarily aimed at coarse-grained representations (e.g. WordNet synsets or FrameNet frames) and require hand-annotated training data to construct. This makes it difficult to automatically disambiguate richer representations (e.g. built on OpenCyc) that are needed for sophisticated inference. We propose a method that uses statistical language models as oracles for disambiguation that does not require any hand-annotation of training data. Instead, the multiple candidate meanings generated by a symbolic NLU system are converted into distinguishable natural language alternatives, which are used to query an LLM to select appropriate interpretations given the linguistic context. The selected meanings are propagated back to the symbolic NLU system. We evaluate our method against human-annotated gold answers to demonstrate its effectiveness.[24] Comparison of Text-Based and Image-Based Retrieval in Multimodal Retrieval Augmented Generation Large Language Model Systems
Elias Lumer,Alex Cardenas,Matt Melich,Myles Mason,Sara Dieter,Vamse Kumar Subbiah,Pradeep Honaganahalli Basavaraju,Roberto Hernandez
Main category: cs.CL
TL;DR: 本文研究了多模态RAG系统中基于文本检索与直接多模态嵌入检索的性能差异,提出直接将图像原生嵌入向量空间可显著提升检索与问答准确性。
Details
Motivation: 现有基于LLM摘要的多模态RAG在预处理中将图像转为文本,导致视觉上下文信息丢失,影响下游任务性能。 Method: 对比两种检索方法:文本块检索(图像先被总结为文本)和直接多模态嵌入检索(图像原生嵌入),在新构建的金融财报电话会议基准上评估6个LLM和2个多模态嵌入模型。 Result: 直接多模态嵌入检索在mAP@5上绝对提升13%(相对提升32%),nDCG@5上绝对提升11%(相对提升20%),且生成答案更准确、事实一致性更高。 Conclusion: 直接多模态嵌入能有效保留视觉上下文,优于依赖LLM摘要的文本化方法,是多模态RAG更优方案。 Abstract: Recent advancements in Retrieval-Augmented Generation (RAG) have enabled Large Language Models (LLMs) to access multimodal knowledge bases containing both text and visual information such as charts, diagrams, and tables in financial documents. However, existing multimodal RAG systems rely on LLM-based summarization to convert images into text during preprocessing, storing only text representations in vector databases, which causes loss of contextual information and visual details critical for downstream retrieval and question answering. To address this limitation, we present a comprehensive comparative analysis of two retrieval approaches for multimodal RAG systems, including text-based chunk retrieval (where images are summarized into text before embedding) and direct multimodal embedding retrieval (where images are stored natively in the vector space). We evaluate all three approaches across 6 LLM models and a two multi-modal embedding models on a newly created financial earnings call benchmark comprising 40 question-answer pairs, each paired with 2 documents (1 image and 1 text chunk). Experimental results demonstrate that direct multimodal embedding retrieval significantly outperforms LLM-summary-based approaches, achieving absolute improvements of 13% in mean average precision (mAP@5) and 11% in normalized discounted cumulative gain. These gains correspond to relative improvements of 32% in mAP@5 and 20% in nDCG@5, providing stronger evidence of their practical impact. We additionally find that direct multimodal retrieval produces more accurate and factually consistent answers as measured by LLM-as-a-judge pairwise comparisons. We demonstrate that LLM summarization introduces information loss during preprocessing, whereas direct multimodal embeddings preserve visual context for retrieval and inference.[25] Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs
Ali Taghibakhshi,Sharath Turuvekere Sreenivas,Saurav Muralidharan,Ruisi Cai,Marcin Chochowski,Ameya Sunil Mahabaleshwarkar,Yoshi Suhara,Oluwatobi Olabiyi,Daniel Korzekwa,Mostofa Patwary,Mohammad Shoeybi,Jan Kautz,Bryan Catanzaro,Ashwath Aithal,Nima Tajbakhsh,Pavlo Molchanov
Main category: cs.CL
TL;DR: 本文提出Nemotron Elastic,一种用于构建推理型大语言模型的框架,支持在单个父模型中嵌入多个可零样本提取的子模型,显著降低训练成本并保持高性能。
Details
Motivation: 训练多规模、多部署目标的大模型成本极高,传统压缩方法仍需大量训练开销,亟需更高效的模型家族构建方法。 Method: 提出Nemotron Elastic框架,结合Mamba-Attention混合架构,通过端到端训练的路由器和两阶段课程学习,实现权重共享的嵌套子模型;引入组感知SSM弹性化、异构MLP弹性化、基于归一化MSE的层重要性评估及知识蒸馏技术。 Result: 在Nemotron Nano V2 12B模型上同时生成9B和6B子模型,仅用1100亿训练token,相比从头训练节省360倍成本,比现有压缩技术节省7倍;各子模型性能达到或优于当前最优水平,且部署内存恒定。 Conclusion: Nemotron Elastic实现了高效、多预算优化的推理模型构建,支持零样本子模型提取,在成本、性能和部署灵活性方面均优于现有方法。 Abstract: Training a family of large language models targeting multiple scales and deployment objectives is prohibitively expensive, requiring separate training runs for each different size. Recent work on model compression through pruning and knowledge distillation has reduced this cost; however, this process still incurs hundreds of billions of tokens worth of training cost per compressed model. In this paper, we present Nemotron Elastic, a framework for building reasoning-oriented LLMs, including hybrid Mamba-Attention architectures, that embed multiple nested submodels within a single parent model, each optimized for different deployment configurations and budgets. Each of these submodels shares weights with the parent model and can be extracted zero-shot during deployment without additional training or fine-tuning. We enable this functionality through an end-to-end trained router, tightly coupled to a two-stage training curriculum designed specifically for reasoning models. We additionally introduce group-aware SSM elastification that preserves Mamba's structural constraints, heterogeneous MLP elastification, normalized MSE-based layer importance for improved depth selection, and knowledge distillation enabling simultaneous multi-budget optimization. We apply Nemotron Elastic to the Nemotron Nano V2 12B model, simultaneously producing a 9B and a 6B model using only 110B training tokens; this results in over 360x cost reduction compared to training model families from scratch, and around 7x compared to SoTA compression techniques. Each of the nested models performs on par or better than the SoTA in accuracy. Moreover, unlike other compression methods, the nested capability of our approach allows having a many-in-one reasoning model that has constant deployment memory against the number of models in the family.cs.CV [Back]
[26] UniFit: Towards Universal Virtual Try-on with MLLM-Guided Semantic Alignment
Wei Zhang,Yeying Jin,Xin Li,Yan Zhang,Xiaofeng Cong,Cong Wang,Fengcai Qiao,zhichao Lian
Main category: cs.CV
TL;DR: 提出UniFit,一种基于多模态大语言模型的通用虚拟试衣框架,通过语义对齐模块和渐进式训练策略解决文本-图像语义鸿沟和复杂场景数据不足问题,支持多种任务并达到SOTA性能。
Details
Motivation: 现有文本引导的多任务虚拟试衣方法存在文本指令与参考图像之间的语义鸿沟以及复杂场景下数据稀缺的问题,难以构建灵活通用的VTON框架。 Method: 提出UniFit框架,包含MLLM引导的语义对齐模块(MGSA),利用多模态大语言模型和可学习查询整合多模态输入,并通过语义对齐损失缩小模态间差距;设计两阶段渐进式训练策略和自合成流程,从有限数据中学习复杂任务。 Result: 实验表明UniFit能支持多服装、模特到模特等多样化试衣任务,在多个指标上优于现有方法,实现了最先进的性能。 Conclusion: UniFit通过引入MLLM和渐进学习机制,有效解决了多任务VTON中的语义对齐和数据稀缺问题,为通用虚拟试衣提供了有效解决方案。 Abstract: Image-based virtual try-on (VTON) aims to synthesize photorealistic images of a person wearing specified garments. Despite significant progress, building a universal VTON framework that can flexibly handle diverse and complex tasks remains a major challenge. Recent methods explore multi-task VTON frameworks guided by textual instructions, yet they still face two key limitations: (1) semantic gap between text instructions and reference images, and (2) data scarcity in complex scenarios. To address these challenges, we propose UniFit, a universal VTON framework driven by a Multimodal Large Language Model (MLLM). Specifically, we introduce an MLLM-Guided Semantic Alignment Module (MGSA), which integrates multimodal inputs using an MLLM and a set of learnable queries. By imposing a semantic alignment loss, MGSA captures cross-modal semantic relationships and provides coherent and explicit semantic guidance for the generative process, thereby reducing the semantic gap. Moreover, by devising a two-stage progressive training strategy with a self-synthesis pipeline, UniFit is able to learn complex tasks from limited data. Extensive experiments show that UniFit not only supports a wide range of VTON tasks, including multi-garment and model-to-model try-on, but also achieves state-of-the-art performance. The source code and pretrained models are available at https://github.com/zwplus/UniFit.[27] EfficientSAM3: Progressive Hierarchical Distillation for Video Concept Segmentation from SAM1, 2, and 3
Chengxi Zeng,Yuxuan Jiang,Aaron Zhang
Main category: cs.CV
TL;DR: 本文提出了EfficientSAM3,一个基于渐进式分层蒸馏(PHD)的高效模型家族,能够在保持SAM3性能的同时实现设备端的概念分割与跟踪。
Details
Motivation: SAM3虽然在图像和视频的可提示概念分割方面表现出色,但其统一架构计算开销大,难以部署到设备端,因此需要更高效的模型。 Method: 采用三阶段渐进式分层蒸馏(PHD):第一阶段通过prompt-in-the-loop训练进行编码器蒸馏;第二阶段使用Perceiver-based模块进行时序记忆蒸馏;第三阶段在SAM3 PCS数据上进行端到端微调。 Result: 在多个VOS数据集上取得良好的性能-效率权衡,支持多种轻量级骨干网络(如RepViT、TinyViT、EfficientViT),实现了高效的设备端概念分割与跟踪。 Conclusion: EfficientSAM3通过PHD有效继承了SAM3的能力,在显著降低计算成本的同时保持了高水平的语义理解性能,适合实际应用中的部署。 Abstract: The Segment Anything Model 3 (SAM3) advances visual understanding with Promptable Concept Segmentation (PCS) across images and videos, but its unified architecture (shared vision backbone, DETR-style detector, dense-memory tracker) remains prohibitive for on-device use. We present EfficientSAM3, a family of efficient models built on Progressive Hierarchical Distillation (PHD) that transfers capability from SAM3 to lightweight students in three stages: (1) Encoder Distillation aligns image features via prompt-in-the-loop training on SA-1B; (2) Temporal Memory Distillation replaces dense memory with a compact Perceiver-based module trained on SA-V to compress and retrieve spatiotemporal features efficiently; and (3) End-to-End Fine-Tuning refines the full pipeline on the official SAM3 PCS data to preserve concept-level performance. PHD yields a spectrum of student variants using RepViT, TinyViT, and EfficientViT backbones, enabling on-device concept segmentation and tracking while maintaining high fidelity to teacher behavior. We benchmark on popular VOS datasets, and compare with varies of releated work, achieing strong performance-efficiency trade-offs.[28] WALDO: Where Unseen Model-based 6D Pose Estimation Meets Occlusion
Sajjad Pakdamansavoji,Yintao Ma,Amir Rasouli,Tongtong Cao
Main category: cs.CV
TL;DR: 提出了一种针对遮挡场景下基于模型的6D物体姿态估计新方法,通过动态采样、多假设推理、迭代优化和数据增强显著提升精度与速度。
Details
Motivation: 现有6D姿态估计方法在遮挡情况下因多阶段流水线的早期错误传播而性能下降,且对未见物体泛化能力有限,需改进鲁棒性和准确性。 Method: 提出了四项创新:动态非均匀密集采样策略、多假设推理机制、迭代精炼过程以及面向遮挡的数据增强,并引入基于可见性的加权评估指标。 Result: 在ICBIN和BOP数据集上分别实现超过5%和2%的精度提升,推理速度提高约3倍。 Conclusion: 所提方法有效缓解了遮挡带来的影响,提升了模型在测试时无需微调情况下的准确性、鲁棒性和泛化能力。 Abstract: Accurate 6D object pose estimation is vital for robotics, augmented reality, and scene understanding. For seen objects, high accuracy is often attainable via per-object fine-tuning but generalizing to unseen objects remains a challenge. To address this problem, past arts assume access to CAD models at test time and typically follow a multi-stage pipeline to estimate poses: detect and segment the object, propose an initial pose, and then refine it. Under occlusion, however, the early-stage of such pipelines are prone to errors, which can propagate through the sequential processing, and consequently degrade the performance. To remedy this shortcoming, we propose four novel extensions to model-based 6D pose estimation methods: (i) a dynamic non-uniform dense sampling strategy that focuses computation on visible regions, reducing occlusion-induced errors; (ii) a multi-hypothesis inference mechanism that retains several confidence-ranked pose candidates, mitigating brittle single-path failures; (iii) iterative refinement to progressively improve pose accuracy; and (iv) series of occlusion-focused training augmentations that strengthen robustness and generalization. Furthermore, we propose a new weighted by visibility metric for evaluation under occlusion to minimize the bias in the existing protocols. Via extensive empirical evaluations, we show that our proposed approach achieves more than 5% improvement in accuracy on ICBIN and more than 2% on BOP dataset benchmarks, while achieving approximately 3 times faster inference.[29] Automatic Uncertainty-Aware Synthetic Data Bootstrapping for Historical Map Segmentation
Lukas Arzoumanidis,Julius Knechtel,Jan-Henrik Haunert,Youness Dehbi
Main category: cs.CV
TL;DR: 提出了一种自动生成和手动退化技术,用于生成具有历史地图风格的无限量合成数据,以解决深度学习在历史地图分析中缺乏标注数据的问题。
Details
Motivation: 历史地图的深度学习分析受限于标注数据的稀缺性,尤其是特定制图领域的数据集。 Method: 通过迁移原始历史地图的制图风格到矢量数据上,并结合自动深度生成模型与手动随机退化技术,模拟历史地图中的视觉不确定性和噪声。 Result: 生成的合成数据被用于基于自构建图卷积网络的领域自适应语义分割,实验验证了该方法在同质历史地图语料库上的有效性。 Conclusion: 所提方法能有效缓解历史地图分析中训练数据不足的问题,提升了模型在真实场景中的适用性。 Abstract: The automated analysis of historical documents, particularly maps, has drastically benefited from advances in deep learning and its success across various computer vision applications. However, most deep learning-based methods heavily rely on large amounts of annotated training data, which are typically unavailable for historical maps, especially for those belonging to specific, homogeneous cartographic domains, also known as corpora. Creating high-quality training data suitable for machine learning often takes a significant amount of time and involves extensive manual effort. While synthetic training data can alleviate the scarcity of real-world samples, it often lacks the affinity (realism) and diversity (variation) necessary for effective learning. By transferring the cartographic style of an original historical map corpus onto vector data, we bootstrap an effectively unlimited number of synthetic historical maps suitable for tasks such as land-cover interpretation of a homogeneous historical map corpus. We propose an automatic deep generative approach and a alternative manual stochastic degradation technique to emulate the visual uncertainty and noise, also known as data-dependent uncertainty, commonly observed in historical map scans. To quantitatively evaluate the effectiveness and applicability of our approach, the generated training datasets were employed for domain-adaptive semantic segmentation on a homogeneous map corpus using a Self-Constructing Graph Convolutional Network, enabling a comprehensive assessment of the impact of our data bootstrapping methods.[30] Box6D : Zero-shot Category-level 6D Pose Estimation of Warehouse Boxes
Yintao Ma,Sajjad Pakdamansavoji,Amir Rasouli,Tongtong Cao
Main category: cs.CV
TL;DR: 本文提出了一种名为Box6D的类别级6D姿态估计方法,专用于仓库环境中存储箱的位姿估计。该方法利用RGB-D图像,通过二值搜索推断尺寸,并使用类别CAD模板进行姿态估计,在保证精度的同时显著降低计算成本。
Details
Motivation: 现有6D姿态估计方法在面对新环境、遮挡和杂乱场景时存在精度与效率的权衡问题:基于模型的方法依赖高精度CAD模型,泛化能力差;无模型方法在复杂条件下表现不佳;类别级方法则常因过于泛化而忽略环境与物体先验,实用性受限。因此需要一种兼顾准确性、灵活性与效率的解决方案。 Method: Box6D从单帧RGB-D图像出发,首先通过快速二值搜索推断箱子的尺寸,然后使用类别级别的通用CAD模板而非实例特定模型来估计6D姿态。引入基于深度的合理性过滤机制和早停策略,以剔除不合理假设,减少计算开销。 Result: 在真实仓储场景和公开基准上的实验表明,Box6D在6D姿态估计精度上达到或优于现有方法,同时将推理时间减少了约76%。 Conclusion: Box6D是一种高效且准确的类别级6D姿态估计方法,特别适用于仓储中存储箱的识别任务,在实际工业应用中具有良好的部署潜力。 Abstract: Accurate and efficient 6D pose estimation of novel objects under clutter and occlusion is critical for robotic manipulation across warehouse automation, bin picking, logistics, and e-commerce fulfillment. There are three main approaches in this domain; Model-based methods assume an exact CAD model at inference but require high-resolution meshes and transfer poorly to new environments; Model-free methods that rely on a few reference images or videos are more flexible, however often fail under challenging conditions; Category-level approaches aim to balance flexibility and accuracy but many are overly general and ignore environment and object priors, limiting their practicality in industrial settings. To this end, we propose Box6d, a category-level 6D pose estimation method tailored for storage boxes in the warehouse context. From a single RGB-D observation, Box6D infers the dimensions of the boxes via a fast binary search and estimates poses using a category CAD template rather than instance-specific models. Suing a depth-based plausibility filter and early-stopping strategy, Box6D then rejects implausible hypotheses, lowering computational cost. We conduct evaluations on real-world storage scenarios and public benchmarks, and show that our approach delivers competitive or superior 6D pose precision while reducing inference time by approximately 76%.[31] RB-FT: Rationale-Bootstrapped Fine-Tuning for Video Classification
Meilong Xu,Di Fu,Jiaxing Zhang,Gong Yu,Jiayu Zheng,Xiaoling Hu,Dongdi Zhao,Feiyang Li,Chao Chen,Yong Cao
Main category: cs.CV
TL;DR: 提出一种无需新标注的两阶段自提升范式,通过自生成文本理由来桥接视觉语言模型在领域特定视频分类中的语义鸿沟,显著优于直接监督微调。
Details
Motivation: 现有视觉语言模型在数据稀缺的领域特定视频分类任务中表现不佳,存在从复杂时空内容到抽象标签之间的语义距离(即“推理鸿沟”),难以有效迁移。 Method: 采用两阶段方法:第一阶段利用VLM生成每个视频的详细文本理由,并基于这些自生成理由进行微调,以增强模型对领域逻辑的理解;第二阶段在此基础上进行常规的监督微调(SFT)。 Result: 在多个不同数据集上的实验表明,该方法显著优于直接SFT,在少样本和零样本场景下均表现出更强的分类性能。 Conclusion: 自生成理由是一种高效、无需额外标注的范式,能有效提升VLM在领域特定视频理解任务中的适应能力。 Abstract: Vision Language Models (VLMs) are becoming increasingly integral to multimedia understanding; however, they often struggle with domain-specific video classification tasks, particularly in cases with limited data. This stems from a critical \textit{rationale gap}, where sparse domain data is insufficient to bridge the semantic distance between complex spatio-temporal content and abstract classification labels. We propose a two-stage self-improvement paradigm to bridge this gap without new annotations. First, we prompt the VLMs to generate detailed textual rationales for each video, compelling them to articulate the domain-specific logic. The VLM is then fine-tuned on these self-generated rationales, utilizing this intermediate supervision to align its representations with the nuances of the target domain. Second, conventional supervised fine-tuning (SFT) is performed on the task labels, achieving markedly higher effectiveness as a result of the model's pre-acquired domain reasoning. Extensive experiments on diverse datasets demonstrate that our method significantly outperforms direct SFT, validating self-generated rationale as an effective, annotation-efficient paradigm for adapting VLMs to domain-specific video analysis.[32] Boosting Medical Visual Understanding From Multi-Granular Language Learning
Zihan Li,Yiqing Wang,Sina Farsiu,Paul Kinahan
Main category: cs.CV
TL;DR: 提出了一种新的多粒度语言学习框架MGLL,用于改进医学图像中的多标签和跨粒度对齐,通过结构化监督和软标签约束,在多个数据集上优于现有方法。
Details
Motivation: 现有的CLIP模型在处理复杂领域(如医学影像)时受限于单标签、单粒度的对齐方式,无法有效应对多标签和不同标注粒度的问题。 Method: 提出Multi-Granular Language Learning (MGLL)框架,利用结构化多标签监督、整合多粒度文本描述,并引入带点态约束的软标签监督,采用平滑KL散度保证跨粒度一致性。 Result: 在构建的大规模多粒度数据集上预训练后,MGLL在多个下游任务中超越了其他最先进的方法。 Conclusion: MGLL有效提升了视觉-语言模型在复杂多标签、多粒度场景下的表现,具有良好的通用性和计算效率,可作为即插即用模块应用于现有模型。 Abstract: Recent advances in image-text pretraining have significantly enhanced visual understanding by aligning visual and textual representations. Contrastive Language-Image Pretraining (CLIP) has played a pivotal role in multimodal learning. However, its focus on single-label, single-granularity alignment limits its effectiveness in complex domains such as medical imaging, where images often correspond to multiple high-level labels (e.g., disease categories) across different annotation granularities (e.g., diagnostic description, clinical explanation). To address this, we propose Multi-Granular Language Learning (MGLL), a contrastive learning framework designed to improve both multi-label and cross-granularity alignment. MGLL leverages structured multi-label supervision, integrates textual descriptions across granularities, and introduces soft-label supervision with point-wise constraints to enhance alignment. MGLL employs smooth Kullback-Leibler (KL) divergence to ensure cross-granularity consistency while maintaining computational efficiency as a plug-and-play module for vision-language models. Pretrained on our constructed large-scale multi-granular datasets and evaluated across multiple datasets, MGLL outperforms other state-of-the-art methods in downstream tasks. The code is available at \href{https://github.com/HUANGLIZI/MGLL}{https://github.com/HUANGLIZI/MGLL}.[33] Automated Interpretable 2D Video Extraction from 3D Echocardiography
Milos Vukadinovic,Hirotaka Ieki,Yuki Sahasi,David Ouyang,Bryan He
Main category: cs.CV
TL;DR: 提出一种基于深度学习的自动化方法,从三维心脏超声容积中选择标准二维切面,保留诊断特征并兼容临床常规解读方式,准确率达96%。
Details
Motivation: 传统心脏超声依赖二维视频,难以全面展示心脏复杂三维结构;而现有三维超声尚未充分整合到以二维切面为基础的临床工作流中,因此需要一种能将三维扫描优势与医生习惯的二维格式结合的方法。 Method: 结合深度学习视图分类器与基于解剖标志和心脏病专家经验的启发式规则,从三维超声容积中自动重建标准二维超声切面。 Result: 在来自两家医院的1,600个视频中,三名心脏病专家盲评结果显示该方法准确率达96%;生成的二维视频可用于检测心脏异常(通过EchoPrime和PanEcho模型验证)并生成临床级心脏测量结果(通过EchoNet-Measurement验证),且保持空间校准和诊断特征。 Conclusion: 该方法成功实现了从三维超声容积自动生成标准二维切面,兼具三维扫描的高效性与二维图像的可解释性,有助于推动三维超声在临床上的广泛应用。 Abstract: Although the heart has complex three-dimensional (3D) anatomy, conventional medical imaging with cardiac ultrasound relies on a series of 2D videos showing individual cardiac structures. 3D echocardiography is a developing modality that now offers adequate image quality for clinical use, with potential to streamline acquisition and improve assessment of off-axis features. We propose an automated method to select standard 2D views from 3D cardiac ultrasound volumes, allowing physicians to interpret the data in their usual format while benefiting from the speed and usability of 3D scanning. Applying a deep learning view classifier and downstream heuristics based on anatomical landmarks together with heuristics provided by cardiologists, we reconstruct standard echocardiography views. This approach was validated by three cardiologists in blinded evaluation (96\% accuracy in 1,600 videos from 2 hospitals). The downstream 2D videos were also validated in their ability to detect cardiac abnormalities using AI echocardiography models (EchoPrime and PanEcho) as well as ability to generate clinical-grade measurements of cardiac anatomy (EchoNet-Measurement). We demonstrated that the extracted 2D videos preserve spatial calibration and diagnostic features, allowing clinicians to obtain accurate real-world interpretations from 3D volumes. We release the code and a dataset of 29 3D echocardiography videos https://github.com/echonet/3d-echo .[34] Click2Graph: Interactive Panoptic Video Scene Graphs from a Single Click
Raphael Ruschel,Hardikkumar Prajapati,Awsafur Rahman,B. S. Manjunath
Main category: cs.CV
TL;DR: 本文提出了Click2Graph,首个可交互的全景视频场景图生成(PVSG)框架,能够通过用户单次点击或边界框提示实现对象分割、跟踪、交互发现与语义关系推理,统一了视觉提示与时空语义理解。
Details
Motivation: 现有VSGG系统为封闭式前馈流程,无法融入人类指导;而如SAM2等可提示分割模型缺乏语义和关系推理能力。因此需要一个结合人机交互与语义推理的统一框架。 Method: 提出Click2Graph框架,包含动态交互发现模块(生成主语条件下的对象提示)和语义分类头(联合进行实体与谓词推理),从单一用户提示出发,实现跨时间的主体分割与追踪、交互对象发现及三元组预测。 Result: 在OpenPVSG基准上的实验表明,Click2Graph能有效支持用户引导的PVSG,实现可控且可解释的视频场景理解。 Conclusion: Click2Graph首次将人类提示与全景定位、关系推断相结合,为可交互、可解释的视频场景理解提供了新范式。 Abstract: State-of-the-art Video Scene Graph Generation (VSGG) systems provide structured visual understanding but operate as closed, feed-forward pipelines with no ability to incorporate human guidance. In contrast, promptable segmentation models such as SAM2 enable precise user interaction but lack semantic or relational reasoning. We introduce Click2Graph, the first interactive framework for Panoptic Video Scene Graph Generation (PVSG) that unifies visual prompting with spatial, temporal, and semantic understanding. From a single user cue, such as a click or bounding box, Click2Graph segments and tracks the subject across time, autonomously discovers interacting objects, and predicts[35] InfoCLIP: Bridging Vision-Language Pretraining and Open-Vocabulary Semantic Segmentation via Information-Theoretic Alignment Transfer
Muyao Yuan,Yuanhong Zhang,Weizhan Zhang,Lan Ma,Yuan Gao,Jiangyong Ying,Yudeng Xin
Main category: cs.CV
TL;DR: 本文提出了一种名为InfoCLIP的新方法,通过信息论视角在保持预训练视觉-语言对齐的同时,提升CLIP在开放词汇语义分割中的微调效果。
Details
Motivation: 现有方法在有限类别上微调CLIP进行分割时容易过拟合并破坏预训练的模态对齐,因此需要一种能稳定对齐关系的方法。 Method: 提出InfoCLIP,引入两个基于互信息的新目标:一是压缩预训练CLIP中的像素-文本模态对齐以减少噪声;二是最大化预训练模型与微调模型之间对齐知识的互信息,以传递适合分割任务的局部语义关系。 Result: 在多个基准上的广泛实验表明,InfoCLIP有效增强了CLIP在开放词汇语义分割中的微调性能,展现出良好的适应性和优越性。 Conclusion: InfoCLIP通过互信息驱动的知识迁移,成功实现了对齐知识的有效传递,在不破坏原始模态对齐的前提下提升了分割性能。 Abstract: Recently, the strong generalization ability of CLIP has facilitated open-vocabulary semantic segmentation, which labels pixels using arbitrary text. However, existing methods that fine-tune CLIP for segmentation on limited seen categories often lead to overfitting and degrade the pretrained vision-language alignment. To stabilize modality alignment during fine-tuning, we propose InfoCLIP, which leverages an information-theoretic perspective to transfer alignment knowledge from pretrained CLIP to the segmentation task. Specifically, this transfer is guided by two novel objectives grounded in mutual information. First, we compress the pixel-text modality alignment from pretrained CLIP to reduce noise arising from its coarse-grained local semantic representations learned under image-text supervision. Second, we maximize the mutual information between the alignment knowledge of pretrained CLIP and the fine-tuned model to transfer compact local semantic relations suited for the segmentation task. Extensive evaluations across various benchmarks validate the effectiveness of InfoCLIP in enhancing CLIP fine-tuning for open-vocabulary semantic segmentation, demonstrating its adaptability and superiority in asymmetric transfer.[36] Externally Validated Multi-Task Learning via Consistency Regularization Using Differentiable BI-RADS Features for Breast Ultrasound Tumor Segmentation
Jingru Zhang,Saed Moradi,Ashirbani Saha
Main category: cs.CV
TL;DR: 提出一种基于一致性正则化的多任务学习方法,通过可微的BI-RADS启发式形态特征缓解乳腺超声肿瘤分割中的任务干扰,显著提升分割性能和泛化能力。
Details
Motivation: 多任务学习中存在破坏性任务干扰,导致模型性能低于单任务基线,影响泛化能力。 Method: 提出一种新颖的一致性正则化方法,结合可微的BI-RADS启发式形态特征,以减轻分割与分类任务间的干扰。 Result: 在BrEaST数据集上训练并在三个外部数据集上验证,所提方法在UDIAT、BUSI和BUS-UCLM上均显著提升分割性能(Dice系数分别为0.81 vs 0.59、0.66 vs 0.56、0.69 vs 0.49,p<0.001),并在UDIAT上达到当前最优水平。 Conclusion: 该一致性正则化方法有效缓解了多任务学习中的干扰问题,显著提升了乳腺超声肿瘤分割的泛化性能和鲁棒性。 Abstract: Multi-task learning can suffer from destructive task interference, where jointly trained models underperform single-task baselines and limit generalization. To improve generalization performance in breast ultrasound-based tumor segmentation via multi-task learning, we propose a novel consistency regularization approach that mitigates destructive interference between segmentation and classification. The consistency regularization approach is composed of differentiable BI-RADS-inspired morphological features. We validated this approach by training all models on the BrEaST dataset (Poland) and evaluating them on three external datasets: UDIAT (Spain), BUSI (Egypt), and BUS-UCLM (Spain). Our comprehensive analysis demonstrates statistically significant (p<0.001) improvements in generalization for segmentation task of the proposed multi-task approach vs. the baseline one: UDIAT, BUSI, BUS-UCLM (Dice coefficient=0.81 vs 0.59, 0.66 vs 0.56, 0.69 vs 0.49, resp.). The proposed approach also achieves state-of-the-art segmentation performance under rigorous external validation on the UDIAT dataset.[37] UniDGF: A Unified Detection-to-Generation Framework for Hierarchical Object Visual Recognition
Xinyu Nan,Lingtao Mao,Huangyu Dai,Zexin Zheng,Xinyu Sun,Zihan Liang,Ben Chen,Yuqing Ding,Chenyi Lei,Wenwu Ou,Han Li
Main category: cs.CV
TL;DR: 提出了一种检测引导的生成式框架,用于实现细粒度的视觉语义理解,能有效预测层次化类别和属性。
Details
Motivation: 现有方法依赖全局相似性,难以捕捉细粒度类别差异和类别特定的属性多样性,尤其在大规模电商场景中表现不足。 Method: 基于检测结果提取ROI级特征,采用BART生成模型以从粗到细的顺序生成类别层级和属性-值对,并支持属性条件识别。 Result: 在大规模私有电商数据集和开源数据集上均显著优于现有的相似性匹配和多阶段分类系统。 Conclusion: 该方法实现了更强的细粒度识别能力和更一致的统一推理,适用于复杂场景下的视觉语义理解。 Abstract: Achieving visual semantic understanding requires a unified framework that simultaneously handles object detection, category prediction, and attribute recognition. However, current advanced approaches rely on global similarity and struggle to capture fine-grained category distinctions and category-specific attribute diversity, especially in large-scale e-commerce scenarios. To overcome these challenges, we introduce a detection-guided generative framework that predicts hierarchical category and attribute tokens. For each detected object, we extract refined ROI-level features and employ a BART-based generator to produce semantic tokens in a coarse-to-fine sequence covering category hierarchies and property-value pairs, with support for property-conditioned attribute recognition. Experiments on both large-scale proprietary e-commerce datasets and open-source datasets demonstrate that our approach significantly outperforms existing similarity-based pipelines and multi-stage classification systems, achieving stronger fine-grained recognition and more coherent unified inference.[38] Fairness in Multi-modal Medical Diagnosis with Demonstration Selection
Dawei Li,Zijian Gu,Peng Wang,Chuhan Song,Zhen Tan,Mohan Zhang,Tianlong Chen,Yu Tian,Song Wang
Main category: cs.CV
TL;DR: 提出了一种名为FADS的公平性感知上下文学习方法,通过聚类采样构建人口统计学平衡且语义相关的示例,有效减少医疗图像推理中的性别、种族和族裔偏差,同时保持高准确性。
Details
Motivation: 现有的去偏方法通常依赖大量标注数据或微调,难以应用于大规模多模态大模型,且在医疗图像推理中存在人群公平性问题。 Method: 提出FADS方法,采用基于聚类的采样策略,在上下文学习中选择人口统计学平衡且语义相关的示范样本,以提升模型在不同群体间的公平性。 Result: 在多个医疗影像基准上验证了FADS的有效性,显著降低了性别、种族和族裔相关的性能差异,同时保持了较高的推理准确率。 Conclusion: FADS为大规模多模态语言模型提供了一种无需微调、高效且可扩展的公平性改进方案,推动了公平医疗图像推理的发展。 Abstract: Multimodal large language models (MLLMs) have shown strong potential for medical image reasoning, yet fairness across demographic groups remains a major concern. Existing debiasing methods often rely on large labeled datasets or fine-tuning, which are impractical for foundation-scale models. We explore In-Context Learning (ICL) as a lightweight, tuning-free alternative for improving fairness. Through systematic analysis, we find that conventional demonstration selection (DS) strategies fail to ensure fairness due to demographic imbalance in selected exemplars. To address this, we propose Fairness-Aware Demonstration Selection (FADS), which builds demographically balanced and semantically relevant demonstrations via clustering-based sampling. Experiments on multiple medical imaging benchmarks show that FADS consistently reduces gender-, race-, and ethnicity-related disparities while maintaining strong accuracy, offering an efficient and scalable path toward fair medical image reasoning. These results highlight the potential of fairness-aware in-context learning as a scalable and data-efficient solution for equitable medical image reasoning.[39] Exploiting Inter-Sample Information for Long-tailed Out-of-Distribution Detection
Nimeshika Udayangani,Hadi M. Dolatabadi,Sarah Erfani,Christopher Leckie
Main category: cs.CV
TL;DR: 提出一种基于图结构的方法,利用预训练模型的特征空间并通过高斯化和图卷积网络优化,显著提升长尾分布下的OOD检测性能。
Details
Motivation: 在长尾分布的in-distribution数据下,现有OOD检测方法存在高假阳性率和尾部类别识别准确率低的问题。 Method: 使用预训练模型的特征空间构建初始图结构,引入高斯化校正激活层分布偏差,并利用图卷积网络(GCN)优化图表示以适应长尾OOD检测。 Result: 在CIFAR10-LT、CIFAR100-LT和ImageNet-LT三个基准上显著优于现有方法,大幅降低FPR并提高尾部类别的ID分类准确率。 Conclusion: 该图结构方法有效缓解了长尾分布下OOD检测的性能瓶颈,兼顾了鲁棒的异常检测与尾部类别识别能力。 Abstract: Detecting out-of-distribution (OOD) data is essential for safe deployment of deep neural networks (DNNs). This problem becomes particularly challenging in the presence of long-tailed in-distribution (ID) datasets, often leading to high false positive rates (FPR) and low tail-class ID classification accuracy. In this paper, we demonstrate that exploiting inter-sample relationships using a graph-based representation can significantly improve OOD detection in long-tailed recognition of vision datasets. To this end, we use the feature space of a pre-trained model to initialize our graph structure. We account for the differences between the activation layer distribution of the pre-training vs. training data, and actively introduce Gaussianization to alleviate any deviations from a standard normal distribution in the activation layers of the pre-trained model. We then refine this initial graph representation using graph convolutional networks (GCNs) to arrive at a feature space suitable for long-tailed OOD detection. This leads us to address the inferior performance observed in ID tail-classes within existing OOD detection methods. Experiments over three benchmarks CIFAR10-LT, CIFAR100-LT, and ImageNet-LT demonstrate that our method outperforms the state-of-the-art approaches by a large margin in terms of FPR and tail-class ID classification accuracy.[40] Physically Realistic Sequence-Level Adversarial Clothing for Robust Human-Detection Evasion
Dingkun Zhou,Patrick P. K. Chan,Hengxu Wu,Shikang Zheng,Ruiqi Huang,Yuanjie Zhao
Main category: cs.CV
TL;DR: 提出了一种序列级优化框架,用于生成在视频序列中持续有效的可打印对抗性纹理,能够在数字和物理环境中有效干扰人体检测。
Details
Motivation: 现有的可穿戴对抗攻击方法通常逐帧优化纹理,在存在运动、姿态变化和衣物形变的长视频序列中难以保持隐蔽性,因此需要一种更鲁棒的跨时间一致性攻击方法。 Method: 将产品图像映射到UV空间并参数化为紧凑调色板和控制点,结合ICC色彩锁定确保可打印性;使用基于物理的人体-衣物模拟管道生成多角度、动态光照和布料形变下的视频序列;通过引入考虑时间加权的变换期望损失函数,对控制点进行序列级优化以最小化整个视频中检测器的置信度。 Result: 实验表明该方法在数字和物理环境下均能实现强而稳定的隐蔽效果,对视角变化具有高鲁棒性,并具备良好的跨模型迁移能力;通过升华打印制作的实物衣物在室内外录制视频中均能可靠抑制检测。 Conclusion: 所提出的序列级优化方法显著提升了可穿戴对抗纹理在真实动态场景中的有效性与实用性,验证了其在现实监控环境中的可行性和应用潜力。 Abstract: Deep neural networks used for human detection are highly vulnerable to adversarial manipulation, creating safety and privacy risks in real surveillance environments. Wearable attacks offer a realistic threat model, yet existing approaches usually optimize textures frame by frame and therefore fail to maintain concealment across long video sequences with motion, pose changes, and garment deformation. In this work, a sequence-level optimization framework is introduced to generate natural, printable adversarial textures for shirts, trousers, and hats that remain effective throughout entire walking videos in both digital and physical settings. Product images are first mapped to UV space and converted into a compact palette and control-point parameterization, with ICC locking to keep all colors printable. A physically based human-garment pipeline is then employed to simulate motion, multi-angle camera viewpoints, cloth dynamics, and illumination variation. An expectation-over-transformation objective with temporal weighting is used to optimize the control points so that detection confidence is minimized across whole sequences. Extensive experiments demonstrate strong and stable concealment, high robustness to viewpoint changes, and superior cross-model transferability. Physical garments produced with sublimation printing achieve reliable suppression under indoor and outdoor recordings, confirming real-world feasibility.[41] Mixture of Ranks with Degradation-Aware Routing for One-Step Real-World Image Super-Resolution
Xiao He,Zhijun Tu,Kun Cheng,Mingrui Zhu,Jie Hu,Nannan Wang,Xinbo Gao
Main category: cs.CV
TL;DR: 本文提出了一种基于稀疏门控混合专家(MoE)的图像超分辨率新架构Mixture-of-Ranks(MoR),将LoRA中的每个秩视为独立专家,并引入细粒度专家划分与退化感知负载均衡机制,显著提升了对复杂真实世界退化图像的适应能力与计算效率。
Details
Motivation: 现有的真实场景图像超分辨率(Real-ISR)模型多为密集型结构,依赖LoRA微调扩散模型,难以自适应地捕捉复杂退化样本的异质特性,且在固定计算预算下缺乏输入间的知识共享能力。因此,需要一种更灵活、高效的架构来提升模型表现。 Method: 提出Mixture-of-Ranks(MoR)架构,将LoRA的每个秩作为独立专家,采用细粒度专家划分策略;设计共享固定位置的秩作为公共专家以保留常识特征;引入基于CLIP嵌入和正负文本对的退化估计模块,动态指导专家激活;结合零专家槽位与退化感知的负载均衡损失,根据退化程度动态调整活跃专家数量。 Result: 实验表明,所提方法在多个真实场景超分辨率基准上实现了最先进的性能,有效提升了对不同退化程度图像的适应能力,并实现了更优的计算资源分配与知识复用。 Conclusion: 通过将稀疏MoE思想引入Real-ISR并提出MoR架构,验证了细粒度专家设计与退化感知路由机制的有效性,为高效、自适应的图像复原模型提供了新思路。 Abstract: The demonstrated success of sparsely-gated Mixture-of-Experts (MoE) architectures, exemplified by models such as DeepSeek and Grok, has motivated researchers to investigate their adaptation to diverse domains. In real-world image super-resolution (Real-ISR), existing approaches mainly rely on fine-tuning pre-trained diffusion models through Low-Rank Adaptation (LoRA) module to reconstruct high-resolution (HR) images. However, these dense Real-ISR models are limited in their ability to adaptively capture the heterogeneous characteristics of complex real-world degraded samples or enable knowledge sharing between inputs under equivalent computational budgets. To address this, we investigate the integration of sparse MoE into Real-ISR and propose a Mixture-of-Ranks (MoR) architecture for single-step image super-resolution. We introduce a fine-grained expert partitioning strategy that treats each rank in LoRA as an independent expert. This design enables flexible knowledge recombination while isolating fixed-position ranks as shared experts to preserve common-sense features and minimize routing redundancy. Furthermore, we develop a degradation estimation module leveraging CLIP embeddings and predefined positive-negative text pairs to compute relative degradation scores, dynamically guiding expert activation. To better accommodate varying sample complexities, we incorporate zero-expert slots and propose a degradation-aware load-balancing loss, which dynamically adjusts the number of active experts based on degradation severity, ensuring optimal computational resource allocation. Comprehensive experiments validate our framework's effectiveness and state-of-the-art performance.[42] Towards a Safer and Sustainable Manufacturing Process: Material classification in Laser Cutting Using Deep Learning
Mohamed Abdallah Salem,Hamdy Ahmed Ashur,Ahmed Elshinnawy
Main category: cs.CV
TL;DR: 提出一种基于深度学习的激光散斑模式材料分类方法,用于激光切割过程中的实时监测与控制,即使改变激光颜色仍能保持高精度分类。
Details
Motivation: 传统散斑传感方法在激光颜色变化时材料分类性能下降,影响激光切割的安全性与效率,需提高分类鲁棒性。 Method: 利用卷积神经网络(CNN)对材料表面的激光散斑图案进行训练,实现对不同材料类型的识别,并验证在不同激光颜色下的分类效果。 Result: 模型在训练集上准确率达98.30%,验证集上达96.88%,并在30种新材料的3000张图像上取得F1-score为0.9643。 Conclusion: 该方法在不同激光条件下均表现出高准确性和鲁棒性,适用于材料感知的激光切割系统。 Abstract: Laser cutting is a widely adopted technology in material processing across various industries, but it generates a significant amount of dust, smoke, and aerosols during operation, posing a risk to both the environment and workers' health. Speckle sensing has emerged as a promising method to monitor the cutting process and identify material types in real-time. This paper proposes a material classification technique using a speckle pattern of the material's surface based on deep learning to monitor and control the laser cutting process. The proposed method involves training a convolutional neural network (CNN) on a dataset of laser speckle patterns to recognize distinct material types for safe and efficient cutting. Previous methods for material classification using speckle sensing may face issues when the color of the laser used to produce the speckle pattern is changed. Experiments conducted in this study demonstrate that the proposed method achieves high accuracy in material classification, even when the laser color is changed. The model achieved an accuracy of 98.30 % on the training set and 96.88% on the validation set. Furthermore, the model was evaluated on a set of 3000 new images for 30 different materials, achieving an F1-score of 0.9643. The proposed method provides a robust and accurate solution for material-aware laser cutting using speckle sensing.[43] CuriGS: Curriculum-Guided Gaussian Splatting for Sparse View Synthesis
Zijian Wu,Mingfeng Jiang,Zidian Lin,Ying Song,Hanjie Ma,Qun Wu,Dongping Zhang,Guiyang Pu
Main category: cs.CV
TL;DR: 本文提出CuriGS,一种基于课程学习的3D高斯点阵稀疏视角重建框架,通过引入多扰动级别的伪视图(学生视图)并逐步优化训练过程,有效提升稀疏视角下的渲染质量和几何一致性。
Details
Motivation: 在稀疏视角下,由于监督信号不足和视角覆盖有限导致过拟合,3D高斯点阵难以实现高质量重建,因此需要一种能增强数据利用效率并缓解过拟合的方法。 Method: 提出CuriGS框架,生成围绕真实姿态(教师)的不同扰动水平的伪视图(学生),采用课程学习策略逐步解锁更高扰动级别的学生参与训练,并通过深度相关性和协同正则化对学生进行约束,结合SSIM、LPIPS和图像质量指标评估学生表现,择优纳入训练集以扩充稀疏数据。 Result: 实验表明,CuriGS在多种合成与真实稀疏视角场景中均优于现有最先进方法,在渲染保真度和几何一致性方面表现更优。 Conclusion: CuriGS通过课程引导的伪视图生成与选择机制,有效解决了稀疏视角下3DGS的过拟合问题,显著提升了稀疏数据条件下的3D重建质量。 Abstract: 3D Gaussian Splatting (3DGS) has recently emerged as an efficient, high-fidelity representation for real-time scene reconstruction and rendering. However, extending 3DGS to sparse-view settings remains challenging because of supervision scarcity and overfitting caused by limited viewpoint coverage. In this paper, we present CuriGS, a curriculum-guided framework for sparse-view 3D reconstruction using 3DGS. CuriGS addresses the core challenge of sparse-view synthesis by introducing student views: pseudo-views sampled around ground-truth poses (teacher). For each teacher, we generate multiple groups of student views with different perturbation levels. During training, we follow a curriculum schedule that gradually unlocks higher perturbation level, randomly sampling candidate students from the active level to assist training. Each sampled student is regularized via depth-correlation and co-regularization, and evaluated using a multi-signal metric that combines SSIM, LPIPS, and an image-quality measure. For every teacher and perturbation level, we periodically retain the best-performing students and promote those that satisfy a predefined quality threshold to the training set, resulting in a stable augmentation of sparse training views. Experimental results show that CuriGS outperforms state-of-the-art baselines in both rendering fidelity and geometric consistency across various synthetic and real sparse-view scenes. Project page: https://zijian1026.github.io/CuriGS/[44] Crossmodal learning for Crop Canopy Trait Estimation
Timilehin T. Ayanlade,Anirudha Powadi,Talukder Z. Jubery,Baskar Ganapathysubramanian,Soumik Sarkar
Main category: cs.CV
TL;DR: 提出一种跨模态学习策略,利用无人机图像细节增强高分辨率卫星影像,用于作物冠层性状估计,在产量和氮素预测等任务中表现优于真实卫星影像。
Details
Motivation: 卫星影像受限于空间分辨率,难以满足现代微地块农业管理需求,而无人机影像虽精度高但成本和操作复杂度较高,因此需要结合两者优势。 Method: 采用近似配准的卫星-无人机图像对数据集,训练模型学习多模态间的光谱-空间细粒度对应关系,实现从卫星输入生成类无人机表征。 Result: 生成的类无人机表征在多个下游任务(如产量预测、氮素预测)中 consistently 优于真实卫星影像。 Conclusion: 跨模态对应学习能有效弥合卫星与无人机在农业监测中的感知差距,提升作物监测性能。 Abstract: Recent advances in plant phenotyping have driven widespread adoption of multi sensor platforms for collecting crop canopy reflectance data. This includes the collection of heterogeneous data across multiple platforms, with Unmanned Aerial Vehicles (UAV) seeing significant usage due to their high performance in crop monitoring, forecasting, and prediction tasks. Similarly, satellite missions have been shown to be effective for agriculturally relevant tasks. In contrast to UAVs, such missions are bound to the limitation of spatial resolution, which hinders their effectiveness for modern farming systems focused on micro-plot management. In this work, we propose a cross modal learning strategy that enriches high-resolution satellite imagery with UAV level visual detail for crop canopy trait estimation. Using a dataset of approximately co registered satellite UAV image pairs collected from replicated plots of 84 hybrid maize varieties across five distinct locations in the U.S. Corn Belt, we train a model that learns fine grained spectral spatial correspondences between sensing modalities. Results show that the generated UAV-like representations from satellite inputs consistently outperform real satellite imagery on multiple downstream tasks, including yield and nitrogen prediction, demonstrating the potential of cross-modal correspondence learning to bridge the gap between satellite and UAV sensing in agricultural monitoring.[45] LLMs-based Augmentation for Domain Adaptation in Long-tailed Food Datasets
Qing Wang,Chong-Wah Ngo,Ee-Peng Lim,Qianru Sun
Main category: cs.CV
TL;DR: 提出一种基于大语言模型(LLM)的框架,通过将图像和生成的文本(如食物名称和成分)映射到共享嵌入空间来解决食品识别中的域偏移、长尾分布和细粒度分类问题。
Details
Motivation: 食品识别面临训练数据与真实用户拍摄图像之间的域偏移、数据长尾分布以及不同菜品间视觉差异细微等挑战,传统方法难以有效应对。 Method: 利用大语言模型解析食品图像生成食物标题和成分,将生成的文本与图像跨域投影到共享嵌入空间以最大化配对相似性,并融合双模态对齐特征进行识别。 Result: 在两个食品数据集上,该方法优于针对长尾分布、域适应和细粒度分类设计的现有方法。 Conclusion: 所提出的LLM驱动的多模态对齐框架能有效应对食品识别中的多重挑战,具有良好的识别性能和应用潜力。 Abstract: Training a model for food recognition is challenging because the training samples, which are typically crawled from the Internet, are visually different from the pictures captured by users in the free-living environment. In addition to this domain-shift problem, the real-world food datasets tend to be long-tailed distributed and some dishes of different categories exhibit subtle variations that are difficult to distinguish visually. In this paper, we present a framework empowered with large language models (LLMs) to address these challenges in food recognition. We first leverage LLMs to parse food images to generate food titles and ingredients. Then, we project the generated texts and food images from different domains to a shared embedding space to maximize the pair similarities. Finally, we take the aligned features of both modalities for recognition. With this simple framework, we show that our proposed approach can outperform the existing approaches tailored for long-tailed data distribution, domain adaptation, and fine-grained classification, respectively, on two food datasets.[46] AMS-KV: Adaptive KV Caching in Multi-Scale Visual Autoregressive Transformers
Boxun Xu,Yu Wang,Zihu Wang,Peng Li
Main category: cs.CV
TL;DR: 本文提出了一种针对视觉自回归模型中下一代尺度预测的自适应KV缓存策略AMS-KV,显著降低了内存占用和计算延迟,提升了生成效率和可扩展性。
Details
Motivation: 在基于下一代尺度预测的视觉自回归模型中,KV缓存随尺度增加而急剧增长,严重限制了模型的可扩展性,现有方法缺乏对此类场景的有效缓存管理。 Method: 通过系统分析发现局部尺度和压缩粗粒度尺度对生成质量至关重要,并观察到不同网络层在跨尺度相似性上的差异;基于此提出AMS-KV,优先保留关键尺度的KV缓存,并通过跨尺度相似性识别高缓存需求层以优化缓存使用。 Result: 相比基线模型,AMS-KV最多减少84.83%的KV缓存用量和60.48%的自注意力延迟,在批大小为256时仍能稳定运行,而基线在128时即出现内存溢出。 Conclusion: AMS-KV有效解决了VAR模型中KV缓存过度增长的问题,在保持生成质量的同时大幅提升可扩展性和推理效率,为大规模图像生成提供了可行的缓存管理方案。 Abstract: Visual autoregressive modeling (VAR) via next-scale prediction has emerged as a scalable image generation paradigm. While Key and Value (KV) caching in large language models (LLMs) has been extensively studied, next-scale prediction presents unique challenges, and KV caching design for next-scale based VAR transformers remains largely unexplored. A major bottleneck is the excessive KV memory growth with the increasing number of scales-severely limiting scalability. Our systematic investigation reveals that: (1) Attending to tokens from local scales significantly contributes to generation quality (2) Allocating a small amount of memory for the coarsest scales, termed as condensed scales, stabilizes multi-scale image generation (3) Strong KV similarity across finer scales is predominantly observed in cache-efficient layers, whereas cache-demanding layers exhibit weaker inter-scale similarity. Based on the observations, we introduce AMS-KV, a scale-adaptive KV caching policy for next-scale prediction in VAR models. AMS-KV prioritizes storing KVs from condensed and local scales, preserving the most relevant tokens to maintain generation quality. It further optimizes KV cache utilization and computational efficiency identifying cache-demanding layers through inter-scale similarity analysis. Compared to the vanilla next-scale prediction-based VAR models, AMS-KV reduces KV cache usage by up to 84.83% and self-attention latency by 60.48%. Moreover, when the baseline VAR-d30 model encounters out-of-memory failures at a batch size of 128, AMS-KV enables stable scaling to a batch size of 256 with improved throughput.[47] LiSTAR: Ray-Centric World Models for 4D LiDAR Sequences in Autonomous Driving
Pei Liu,Songtao Wang,Lang Zhang,Xingyue Peng,Yuandong Lyu,Jiaxin Deng,Songxin Lu,Weiliang Ma,Xueyang Zhang,Yifei Zhan,XianPeng Lang,Jun Ma
Main category: cs.CV
TL;DR: 本文提出了LiSTAR,一种基于传感器原生几何结构的新型生成式世界模型,用于合成高保真、可控制的4D LiDAR数据。该模型采用混合柱面-球面(HCS)表示以减少量化误差,并通过基于射线中心Transformer的时空注意力机制(START)增强时序一致性。此外,提出了一种4D点云对齐的体素布局和MaskSTART框架,实现布局引导的组合生成。实验表明其在重建、预测和条件生成任务上显著优于现有方法。
Details
Motivation: 由于LiDAR传感器具有独特的球面几何结构、点云时间稀疏性和动态场景复杂性,合成高质量且可控的4D LiDAR数据极具挑战。现有方法在保持数据保真度和时序连贯性方面存在局限,难以满足自动驾驶仿真系统对可扩展性和真实性的需求。 Method: 提出LiSTAR模型:1)采用混合柱面-球面(HCS)表示法,在保留传感器原生几何的同时减少笛卡尔网格带来的量化伪影;2)设计基于射线中心Transformer的时空注意力(START),沿单个传感器射线建模特征演化,提升稀疏时序下的动态捕捉能力;3)引入4D点云对齐的体素布局与离散化的Masked Generative START(MaskSTART)框架,学习紧凑的标记化场景表示,支持高效、高分辨率且布局引导的可控生成。 Result: 实验结果显示LiSTAR在4D LiDAR重建、预测和条件生成任务中达到SOTA性能:生成MMD降低76%,重建IoU提升32%,预测L1 Med降低50%。模型能生成高分辨率、时序连贯且布局可控的4D点云序列。 Conclusion: LiSTAR通过原生几何建模、射线级动态建模和结构化条件生成,有效解决了4D LiDAR数据合成中的保真度、时序一致性和可控性难题,为自动驾驶仿真提供了强大且实用的新工具。 Abstract: Synthesizing high-fidelity and controllable 4D LiDAR data is crucial for creating scalable simulation environments for autonomous driving. This task is inherently challenging due to the sensor's unique spherical geometry, the temporal sparsity of point clouds, and the complexity of dynamic scenes. To address these challenges, we present LiSTAR, a novel generative world model that operates directly on the sensor's native geometry. LiSTAR introduces a Hybrid-Cylindrical-Spherical (HCS) representation to preserve data fidelity by mitigating quantization artifacts common in Cartesian grids. To capture complex dynamics from sparse temporal data, it utilizes a Spatio-Temporal Attention with Ray-Centric Transformer (START) that explicitly models feature evolution along individual sensor rays for robust temporal coherence. Furthermore, for controllable synthesis, we propose a novel 4D point cloud-aligned voxel layout for conditioning and a corresponding discrete Masked Generative START (MaskSTART) framework, which learns a compact, tokenized representation of the scene, enabling efficient, high-resolution, and layout-guided compositional generation. Comprehensive experiments validate LiSTAR's state-of-the-art performance across 4D LiDAR reconstruction, prediction, and conditional generation, with substantial quantitative gains: reducing generation MMD by a massive 76%, improving reconstruction IoU by 32%, and lowering prediction L1 Med by 50%. This level of performance provides a powerful new foundation for creating realistic and controllable autonomous systems simulations. Project link: https://ocean-luna.github.io/LiSTAR.gitub.io.[48] VideoSeg-R1:Reasoning Video Object Segmentation via Reinforcement Learning
Zishan Xu,Yifu Guo,Yuquan Lu,Fengyu Yang,Junxin Li
Main category: cs.CV
TL;DR: 本文提出了VideoSeg-R1,首个将强化学习引入视频推理分割的框架,采用解耦架构,结合指代表格分割与视频掩码传播,实现复杂场景下的先进性能。
Details
Motivation: 传统视频分割方法依赖监督微调,泛化能力差且缺乏显式推理能力,难以应对分布外场景。 Method: 提出三阶段解耦架构:分层文本引导帧采样、生成空间线索和显式推理链的推理模型、基于SAM2和XMem的分割-传播阶段,并引入任务难度感知机制动态控制推理长度。 Result: 在多个基准上实现了复杂视频推理与分割任务的最先进性能。 Conclusion: VideoSeg-R1通过引入强化学习和显式推理机制,显著提升了视频分割在复杂和分布外场景中的泛化能力与可解释性。 Abstract: Traditional video reasoning segmentation methods rely on supervised fine-tuning, which limits generalization to out-of-distribution scenarios and lacks explicit reasoning. To address this, we propose \textbf{VideoSeg-R1}, the first framework to introduce reinforcement learning into video reasoning segmentation. It adopts a decoupled architecture that formulates the task as joint referring image segmentation and video mask propagation. It comprises three stages: (1) A hierarchical text-guided frame sampler to emulate human attention; (2) A reasoning model that produces spatial cues along with explicit reasoning chains; and (3) A segmentation-propagation stage using SAM2 and XMem. A task difficulty-aware mechanism adaptively controls reasoning length for better efficiency and accuracy. Extensive evaluations on multiple benchmarks demonstrate that VideoSeg-R1 achieves state-of-the-art performance in complex video reasoning and segmentation tasks. The code will be publicly available at https://github.com/euyis1019/VideoSeg-R1.[49] SpectralTrain: A Universal Framework for Hyperspectral Image Classification
Meihua Zhou,Liping Yu,Jiawei Cai,Wai Kin Fung,Ruiguo Hu,Jiarui Zhao,Wenzhuo Liu,Nan Wan
Main category: cs.CV
TL;DR: SpectralTrain是一种通用的、与架构无关的训练框架,结合课程学习和基于PCA的光谱降采样,显著降低高光谱图像分类的计算成本,同时保持良好的准确性,适用于多种模型和应用场景。
Details
Motivation: 高光谱图像分类通常涉及大规模数据和高计算成本的训练过程,限制了深度学习模型在实际遥感任务中的部署,因此需要一种高效且通用的训练方法来提升学习效率。 Method: 提出SpectralTrain框架,结合课程学习(CL)和基于主成分分析(PCA)的光谱降采样,在训练过程中逐步引入光谱复杂性,同时保留关键信息,以提高光谱-空间模式的学习效率。该方法不依赖特定网络架构、优化器或损失函数,具有广泛兼容性。 Result: 在Indian Pines、Salinas-A和新提出的CloudPatch-7三个基准数据集上实验表明,SpectralTrain可实现2-7倍的训练加速,精度略有下降但保持在可接受范围,并展现出跨空间尺度、光谱特性和应用领域的良好泛化能力,尤其在云分类任务中表现出在气候遥感中的潜力。 Conclusion: SpectralTrain通过优化训练策略而非网络结构,有效提升了高光谱图像分类的训练效率,是架构设计之外的一种有力补充,具有广泛的适用性和应用前景。 Abstract: Hyperspectral image (HSI) classification typically involves large-scale data and computationally intensive training, which limits the practical deployment of deep learning models in real-world remote sensing tasks. This study introduces SpectralTrain, a universal, architecture-agnostic training framework that enhances learning efficiency by integrating curriculum learning (CL) with principal component analysis (PCA)-based spectral downsampling. By gradually introducing spectral complexity while preserving essential information, SpectralTrain enables efficient learning of spectral -- spatial patterns at significantly reduced computational costs. The framework is independent of specific architectures, optimizers, or loss functions and is compatible with both classical and state-of-the-art (SOTA) models. Extensive experiments on three benchmark datasets -- Indian Pines, Salinas-A, and the newly introduced CloudPatch-7 -- demonstrate strong generalization across spatial scales, spectral characteristics, and application domains. The results indicate consistent reductions in training time by 2-7x speedups with small-to-moderate accuracy deltas depending on backbone. Its application to cloud classification further reveals potential in climate-related remote sensing, emphasizing training strategy optimization as an effective complement to architectural design in HSI models. Code is available at https://github.com/mh-zhou/SpectralTrain.[50] Rad-GS: Radar-Vision Integration for 3D Gaussian Splatting SLAM in Outdoor Environments
Renxiang Xiao,Wei Liu,Yuanfan Zhang,Yushuai Chen,Jinming Chen,Zilu Wang,Liang Hu
Main category: cs.CV
TL;DR: Rad-GS 是一种基于 3D 高斯表示的 4D 雷达-相机 SLAM 系统,适用于千米级室外环境,结合雷达点云与图像信息,实现高精度定位与高质量重建。
Details
Motivation: 传统SLAM系统在大尺度户外环境中面临动态物体干扰、纹理缺失和高内存消耗等问题,而纯视觉或LiDAR方法对恶劣天气敏感。本文旨在探索仅使用4D毫米波雷达和相机的鲁棒性户外建图可行性。 Method: 提出Rad-GS,利用原始雷达点云的多普勒信息和几何增强点云进行动态物体掩码,指导同步图像中的渲染优化;通过非同步图像帧全局优化3D高斯表示,并采用八叉树结构与高斯原语管理策略抑制噪声并降低内存开销。 Result: 实验表明,Rad-GS在定位精度和重建质量上可媲美基于相机或LiDAR的传统3D高斯方法,在千米级真实场景中验证了其大规模重建能力,并显著减少内存消耗。 Conclusion: Rad-GS展示了4D毫米波雷达在大尺度户外SLAM中的潜力,实现了稳健的场景建图,为未来低功耗、全天候导航提供了可行方案。 Abstract: We present Rad-GS, a 4D radar-camera SLAM system designed for kilometer-scale outdoor environments, utilizing 3D Gaussian as a differentiable spatial representation. Rad-GS combines the advantages of raw radar point cloud with Doppler information and geometrically enhanced point cloud to guide dynamic object masking in synchronized images, thereby alleviating rendering artifacts and improving localization accuracy. Additionally, unsynchronized image frames are leveraged to globally refine the 3D Gaussian representation, enhancing texture consistency and novel view synthesis fidelity. Furthermore, the global octree structure coupled with a targeted Gaussian primitive management strategy further suppresses noise and significantly reduces memory consumption in large-scale environments. Extensive experiments and ablation studies demonstrate that Rad-GS achieves performance comparable to traditional 3D Gaussian methods based on camera or LiDAR inputs, highlighting the feasibility of robust outdoor mapping using 4D mmWave radar. Real-world reconstruction at kilometer scale validates the potential of Rad-GS for large-scale scene reconstruction.[51] T2T-VICL: Unlocking the Boundaries of Cross-Task Visual In-Context Learning via Implicit Text-Driven VLMs
Shao-Jun Xia,Huixin Zhang,Zhengzhong Tu
Main category: cs.CV
TL;DR: 本文提出了一种名为T2T-VICL的全协作流程,用于研究视觉语言模型(VLMs)在跨任务视觉上下文学习(VICL)中的潜力。通过设计生成和选择文本提示的机制,并构建首个跨任务VICL数据集,结合基于感知评分的推理与传统评估指标,实现了在多个跨任务场景中的优越性能。
Details
Motivation: 探索当视觉提示与目标图像来自不同视觉任务时,视觉语言模型是否仍能实现视觉上下文学习(VICL),以拓展VICL的应用边界。 Method: 设计一种生成和选择文本提示的机制,用以隐式描述两个不同低层视觉任务之间的差异;构建首个跨任务VICL数据集;提出结合感知评分推理与传统评估指标的新型推理框架。 Result: 在九个跨任务场景中达到顶级性能,在另外十个场景中表现位居第二。 Conclusion: 所提出的T2T-VICL方法显著提升了VLM在跨任务VICL中的能力,验证了其在多样化视觉任务间迁移学习的潜力。 Abstract: In large language models (LLM), in-context learning (ICL) refers to performing new tasks by conditioning on small demonstrations provided in the input context. Recent advances in visual in-context learning (VICL) demonstrate promising capabilities for solving downstream tasks by unified vision-language models (VLMs). When the visual prompt and the target images originate from different visual tasks, can VLMs still enable VICL? In the paper, we propose a fully collaborative pipeline, i.e. T2T-VICL, for VLMs to investigate the potential of cross-task VICL. Fundamentally, we design a mechanism to generate and select text prompts that best implicitly describe the differences between two distinct low-level vision tasks, and construct the first cross-task VICL dataset. Building upon this, we propose a novel inference framework that combines perceptual score-based reasoning with traditional evaluation metrics to perform cross-task VICL. Our approach achieves top-tier results across nine cross-task scenarios and second-tier performance in ten additional scenarios, unlocking the boundaries of cross-task VICL within VLMs.[52] Clustered Error Correction with Grouped 4D Gaussian Splatting
Taeho Kang,Jaeyeon Park,Kyungjin Lee,Youngki Lee
Main category: cs.CV
TL;DR: 提出一种新的4D高斯点阵方法,通过椭圆误差聚类与分组策略提升动态场景重建的时序一致性和渲染质量。
Details
Motivation: 现有4D高斯点阵方法在动态场景重建中存在像素对应模糊和动态区域密度不足的问题,影响重建精度。 Method: 引入椭圆误差聚类与误差校正点添加,识别并初始化动态区域;采用分组4D高斯点阵提升点与动态物体间的映射一致性;基于多视角颜色一致性进行后投影或前景分割以纠正缺失颜色和遮挡误差。 Result: 在Neural 3D Video和Technicolor数据集上验证,显著提升时序一致性,在Technicolor光场数据集上PSNR提高0.39dB,可视化显示点与动态物体对齐更好,误差校正有效。 Conclusion: 所提方法有效改善了动态场景的4D重建质量,实现了当前最优的感知渲染效果。 Abstract: Existing 4D Gaussian Splatting (4DGS) methods struggle to accurately reconstruct dynamic scenes, often failing to resolve ambiguous pixel correspondences and inadequate densification in dynamic regions. We address these issues by introducing a novel method composed of two key components: (1) Elliptical Error Clustering and Error Correcting Splat Addition that pinpoints dynamic areas to improve and initialize fitting splats, and (2) Grouped 4D Gaussian Splatting that improves consistency of mapping between splats and represented dynamic objects. Specifically, we classify rendering errors into missing-color and occlusion types, then apply targeted corrections via backprojection or foreground splitting guided by cross-view color consistency. Evaluations on Neural 3D Video and Technicolor datasets demonstrate that our approach significantly improves temporal consistency and achieves state-of-the-art perceptual rendering quality, improving 0.39dB of PSNR on the Technicolor Light Field dataset. Our visualization shows improved alignment between splats and dynamic objects, and the error correction method's capability to identify errors and properly initialize new splats. Our implementation details and source code are available at https://github.com/tho-kn/cem-4dgs.[53] Decoupling Complexity from Scale in Latent Diffusion Model
Tianxiong Zhong,Xingye Tian,Xuebo Wang,Boyuan Jiang,Xin Tao,Pengfei Wan
Main category: cs.CV
TL;DR: 提出DCS-LDM,一种解耦信息复杂度与尺度的新型视觉生成范式,通过构建分层、尺度无关的潜在空间,支持任意分辨率和帧率的灵活生成。
Details
Motivation: 现有扩散模型将尺度与内容复杂度耦合,导致潜在容量利用不高效;而实际所需容量主要取决于内容复杂度,尺度仅是上界。 Method: 构建一个分层、尺度无关的潜在空间,使用多级令牌建模样本复杂度,并在固定潜在表示下支持解码到任意分辨率和帧率,实现结构与细节信息的分解及由粗到精的生成。 Result: DCS-LDM在保持与最先进方法相当性能的同时,实现了跨不同尺度和视觉质量的灵活生成,并支持灵活的计算-质量权衡。 Conclusion: DCS-LDM有效解耦了视觉生成中的复杂度与尺度,提供了更高效的潜在表示和灵活的生成能力,适用于多种分辨率和帧率场景。 Abstract: Existing latent diffusion models typically couple scale with content complexity, using more latent tokens to represent higher-resolution images or higher-frame rate videos. However, the latent capacity required to represent visual data primarily depends on content complexity, with scale serving only as an upper bound. Motivated by this observation, we propose DCS-LDM, a novel paradigm for visual generation that decouples information complexity from scale. DCS-LDM constructs a hierarchical, scale-independent latent space that models sample complexity through multi-level tokens and supports decoding to arbitrary resolutions and frame rates within a fixed latent representation. This latent space enables DCS-LDM to achieve a flexible computation-quality tradeoff. Furthermore, by decomposing structural and detailed information across levels, DCS-LDM supports a progressive coarse-to-fine generation paradigm. Experimental results show that DCS-LDM delivers performance comparable to state-of-the-art methods while offering flexible generation across diverse scales and visual qualities.[54] VTinker: Guided Flow Upsampling and Texture Mapping for High-Resolution Video Frame Interpolation
Chenyang Wu,Jiayi Fu,Chun-Le Guo,Shuhao Han,Chongyi Li
Main category: cs.CV
TL;DR: 本文提出了一种新的视频帧插值方法VTinker,通过引导流上采样和纹理映射来解决高分辨率帧间运动估计中的模糊、马赛克和重影问题,显著提升了插值质量。
Details
Motivation: 由于大像素移动和高计算成本,高分辨率视频帧的运动估计具有挑战性。现有方法在低分辨率下预测光流并使用高倍率上采样,容易导致边缘模糊或马赛克,并且无法充分捕捉细粒度像素运动,造成任务导向光流错位,进而引发插值帧中的重影和不连续问题。 Method: 提出VTinker框架,包含两个核心组件:引导流上采样(GFU)和纹理映射。GFU利用输入帧作为引导信息,优化双线性上采样的光流,使边缘更清晰;纹理映射先生成中间代理帧,再从中选取清晰的纹理块映射到代理帧上,最后通过重建模块生成最终插值帧。 Result: 实验表明,VTinker在多个数据集上实现了最先进的视频帧插值性能,有效减少了重影和不连续现象,提升了视觉质量。 Conclusion: VTinker通过引入引导信息和纹理映射机制,显著改善了高分辨率视频帧插值中的光流精度和图像质量,为VFI提供了一种有效的解决方案。 Abstract: Due to large pixel movement and high computational cost, estimating the motion of high-resolution frames is challenging. Thus, most flow-based Video Frame Interpolation (VFI) methods first predict bidirectional flows at low resolution and then use high-magnification upsampling (e.g., bilinear) to obtain the high-resolution ones. However, this kind of upsampling strategy may cause blur or mosaic at the flows' edges. Additionally, the motion of fine pixels at high resolution cannot be adequately captured in motion estimation at low resolution, which leads to the misalignment of task-oriented flows. With such inaccurate flows, input frames are warped and combined pixel-by-pixel, resulting in ghosting and discontinuities in the interpolated frame. In this study, we propose a novel VFI pipeline, VTinker, which consists of two core components: guided flow upsampling (GFU) and Texture Mapping. After motion estimation at low resolution, GFU introduces input frames as guidance to alleviate the blurring details in bilinear upsampling flows, which makes flows' edges clearer. Subsequently, to avoid pixel-level ghosting and discontinuities, Texture Mapping generates an initial interpolated frame, referred to as the intermediate proxy. The proxy serves as a cue for selecting clear texture blocks from the input frames, which are then mapped onto the proxy to facilitate producing the final interpolated frame via a reconstruction module. Extensive experiments demonstrate that VTinker achieves state-of-the-art performance in VFI. Codes are available at: https://github.com/Wucy0519/VTinker.[55] Can MLLMs Read the Room? A Multimodal Benchmark for Assessing Deception in Multi-Party Social Interactions
Caixin Kang,Yifei Huang,Liangyang Ouyang,Mingfang Zhang,Ruicong Liu,Yoichi Sato
Main category: cs.CV
TL;DR: 本文提出了一个多模态交互式欺骗评估任务(MIDA),并构建了一个带有真实标签的多模态数据集,用于评估现有MLLM在社会推理中的表现。实验表明当前模型难以结合多模态社交线索进行欺骗识别,并缺乏对他人信念和意图的建模能力。为此,作者设计了SoCoT推理流程和DSEM模块,提升了模型在该任务上的性能,推动了具备人类级社会推理能力的MLLM发展。
Details
Motivation: 现有的多模态大语言模型虽然具备较强的推理能力,但在理解复杂社会互动、识破欺骗等人类智能核心能力上仍存在明显不足,缺乏‘读懂氛围’的能力。为量化这一缺陷,需要新的任务与数据集来系统评估模型的社会认知能力。 Method: 提出MIDA任务和一个包含同步视频与文本且每句均有真实标签的多模态数据集;构建包含12个主流MLLM的基准测试;设计Social Chain-of-Thought (SoCoT) 推理流程和Dynamic Social Epistemic Memory (DSEM) 模块以提升模型的社会推理能力。 Result: 实验显示现有MLLM(如GPT-4o)在欺骗识别任务中表现不佳,难以有效融合多模态社交线索并建模他人的心理状态;所提出的SoCoT与DSEM框架显著提升了模型性能,验证了其有效性。 Conclusion: 当前MLLM在社会认知方面存在根本性局限,需引入能建模心理状态与动态社会信息的新机制;SoCoT与DSEM为实现更敏锐、可信的具社会感知力AI提供了可行路径。 Abstract: Despite their advanced reasoning capabilities, state-of-the-art Multimodal Large Language Models (MLLMs) demonstrably lack a core component of human intelligence: the ability to `read the room' and assess deception in complex social interactions. To rigorously quantify this failure, we introduce a new task, Multimodal Interactive Deception Assessment (MIDA), and present a novel multimodal dataset providing synchronized video and text with verifiable ground-truth labels for every statement. We establish a comprehensive benchmark evaluating 12 state-of-the-art open- and closed-source MLLMs, revealing a significant performance gap: even powerful models like GPT-4o struggle to distinguish truth from falsehood reliably. Our analysis of failure modes indicates that these models fail to effectively ground language in multimodal social cues and lack the ability to model what others know, believe, or intend, highlighting the urgent need for novel approaches to building more perceptive and trustworthy AI systems. To take a step forward, we design a Social Chain-of-Thought (SoCoT) reasoning pipeline and a Dynamic Social Epistemic Memory (DSEM) module. Our framework yields performance improvement on this challenging task, demonstrating a promising new path toward building MLLMs capable of genuine human-like social reasoning.[56] How Noise Benefits AI-generated Image Detection
Jiazhen Yan,Ziqiang Li,Fan Wang,Kai Zeng,Zhangjie Fu
Main category: cs.CV
TL;DR: 提出了一种基于正激励噪声的CLIP方法(PiN-CLIP),通过在特征空间中引入可控噪声来抑制检测模型对捷径特征的依赖,提升AI生成图像检测的泛化能力,在包含42种生成模型的开放世界数据集上达到SOTA性能。
Details
Motivation: 现有的AI生成图像检测方法在分布外数据上泛化能力差,主要因为模型训练时依赖于虚假的捷径特征。 Method: 提出PiN-CLIP,结合噪声生成器和检测网络,利用变分正激励原则,在特征空间中通过视觉与类别语义特征的交叉注意力融合构造正激励噪声,注入噪声以微调视觉编码器,抑制对捷径敏感的方向并增强稳定的法证线索。 Result: 在包含42种生成模型的开放世界数据集上进行实验,相比现有方法平均准确率提升5.4,在生成图像检测任务中达到最先进的性能。 Conclusion: PiN-CLIP有效提升了检测模型对未知生成模型的泛化能力,通过特征空间噪声注入机制增强了特征表示的鲁棒性,为生成图像检测提供了新的解决方案。 Abstract: The rapid advancement of generative models has made real and synthetic images increasingly indistinguishable. Although extensive efforts have been devoted to detecting AI-generated images, out-of-distribution generalization remains a persistent challenge. We trace this weakness to spurious shortcuts exploited during training and we also observe that small feature-space perturbations can mitigate shortcut dominance. To address this problem in a more controllable manner, we propose the Positive-Incentive Noise for CLIP (PiN-CLIP), which jointly trains a noise generator and a detection network under a variational positive-incentive principle. Specifically, we construct positive-incentive noise in the feature space via cross-attention fusion of visual and categorical semantic features. During optimization, the noise is injected into the feature space to fine-tune the visual encoder, suppressing shortcut-sensitive directions while amplifying stable forensic cues, thereby enabling the extraction of more robust and generalized artifact representations. Comparative experiments are conducted on an open-world dataset comprising synthetic images generated by 42 distinct generative models. Our method achieves new state-of-the-art performance, with notable improvements of 5.4 in average accuracy over existing approaches.[57] TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding
Boshen Xu,Zihan Xiao,Jiaze Li,Jianzhong Ju,Zhenbo Luo,Jian Luan,Qin Jin
Main category: cs.CV
TL;DR: TimeViper是一种混合视觉-语言模型,结合Mamba和Transformer架构,用于长视频理解,提出TransV模块以压缩视觉令牌并提升效率。
Details
Motivation: 处理长视频需要高效的模型架构和有效的长时间上下文机制,现有方法在信息聚合中存在视觉令牌冗余问题。 Method: 采用混合Mamba-Transformer骨干网络,并提出TransV模块,将视觉令牌信息转移并压缩到指令令牌中。 Result: TimeViper能够处理超过10,000帧的小时级视频,在多个基准上表现优异,并揭示了视觉到文本的信息聚合现象。 Conclusion: TimeViper为混合Mamba-Transformer架构的发展、解释与压缩提供了初步但重要的探索路径。 Abstract: We introduce TimeViper, a hybrid vision-language model designed to tackle challenges of long video understanding. Processing long videos demands both an efficient model architecture and an effective mechanism for handling extended temporal contexts. To this end, TimeViper adopts a hybrid Mamba-Transformer backbone that combines the efficiency of state-space models with the expressivity of attention mechanisms. Through this hybrid design, we reveal the vision-to-text information aggregation phenomenon, where information progressively flows from vision tokens to text tokens across increasing LLM depth, resulting in severe vision token redundancy. Motivated by this observation, we propose TransV, a token information transfer module that transfers and compresses vision tokens into instruction tokens while maintaining multimodal understanding capabilities. This design enables TimeViper to process hour-long videos exceeding 10,000 frames. Extensive experiments across multiple benchmarks demonstrate that TimeViper competes with state-of-the-art models while extending frame numbers. We further analyze attention behaviors of both Mamba and Transformer layers, offering new insights into hybrid model interpretability. This work represents an initial step towards developing, interpreting, and compressing hybrid Mamba-Transformer architectures.[58] Degradation-Aware Hierarchical Termination for Blind Quality Enhancement of Compressed Video
Li Yu,Yingbo Zhao,Shiyu Wu,Siyue Yu,Moncef Gabbouj,Qingshan Liu
Main category: cs.CV
TL;DR: 提出了一种基于预训练退化表示学习(DRL)模块和分层终止机制的盲视频质量增强方法,能够有效提取多尺度退化特征并动态调整网络计算量,在性能和效率上均优于现有方法。
Details
Motivation: 现有盲视频质量增强方法仅利用全局退化信息,缺乏空间细节,且多数方法对不同压缩级别采用统一架构,忽略了不同压缩程度下计算需求的差异。 Method: 设计了一个预训练的退化表示学习(DRL)模块,用于从视频内容中解耦并提取高维、多尺度的退化表示,并引入分层终止机制,根据压缩级别动态调整去伪影阶段的数量。 Result: 在QP=22时,相比现有最先进盲方法PSNR提升110%(从0.31 dB提升至0.65 dB),同时推理时间比QP=42时减少一半。 Conclusion: 所提方法通过更精细的退化表示和自适应计算结构,在盲视频质量增强任务中实现了性能与效率的双重提升。 Abstract: Existing studies on Quality Enhancement for Compressed Video (QECV) predominantly rely on known Quantization Parameters (QPs), employing distinct enhancement models per QP setting, termed non-blind methods. However, in real-world scenarios involving transcoding or transmission, QPs may be partially or entirely unknown, limiting the applicability of such approaches and motivating the development of blind QECV techniques. Current blind methods generate degradation vectors via classification models with cross-entropy loss, using them as channel attention to guide artifact removal. However, these vectors capture only global degradation information and lack spatial details, hindering adaptation to varying artifact patterns at different spatial positions. To address these limitations, we propose a pretrained Degradation Representation Learning (DRL) module that decouples and extracts high-dimensional, multiscale degradation representations from video content to guide the artifact removal. Additionally, both blind and non-blind methods typically employ uniform architectures across QPs, hence, overlooking the varying computational demands inherent to different compression levels. We thus introduce a hierarchical termination mechanism that dynamically adjusts the number of artifact reduction stages based on the compression level. Experimental results demonstrate that the proposed approach significantly enhances performance, achieving a PSNR improvement of 110% (from 0.31 dB to 0.65 dB) over a competing state-of-the-art blind method at QP = 22. Furthermore, the proposed hierarchical termination mechanism reduces the average inference time at QP = 22 by half compared to QP = 42.[59] SurvAgent: Hierarchical CoT-Enhanced Case Banking and Dichotomy-Based Multi-Agent System for Multimodal Survival Prediction
Guolin Huang,Wenting Chen,Jiaqi Yang,Xinheng Lyu,Xiaoling Luo,Sen Yang,Xiaohan Xing,Linlin Shen
Main category: cs.CV
TL;DR: 本文提出了一种名为SurvAgent的新型多智能体系统,用于可解释的多模态癌症生存预测,结合病理图像与基因数据,通过链式思维推理和案例经验学习提升预测性能与透明度。
Details
Motivation: 现有生存分析方法缺乏临床所需的可解释性,且难以整合多模态数据、有效探索感兴趣区域并利用历史病例进行经验学习。 Method: SurvAgent分为两个阶段:第一阶段构建WSI-基因链式思维增强的病例库,采用低倍率筛选、跨模态相似性感知补丁挖掘和置信度感知补丁挖掘处理病理图像,并对六类功能基因进行分层分析;第二阶段通过基于二分法的多专家智能体推理,结合RAG检索相似病例,融合多模态报告与专家预测进行逐步区间优化。 Result: 在五个TCGA队列上的实验表明,SurvAgent在预测性能上优于传统方法、专有MLLMs和医学智能体,具备更强的可解释性和泛化能力。 Conclusion: SurvAgent为精准肿瘤学中的可解释AI驱动生存预测建立了新范式,有效整合多模态数据并支持基于历史经验的推理。 Abstract: Survival analysis is critical for cancer prognosis and treatment planning, yet existing methods lack the transparency essential for clinical adoption. While recent pathology agents have demonstrated explainability in diagnostic tasks, they face three limitations for survival prediction: inability to integrate multimodal data, ineffective region-of-interest exploration, and failure to leverage experiential learning from historical cases. We introduce SurvAgent, the first hierarchical chain-of-thought (CoT)-enhanced multi-agent system for multimodal survival prediction. SurvAgent consists of two stages: (1) WSI-Gene CoT-Enhanced Case Bank Construction employs hierarchical analysis through Low-Magnification Screening, Cross-Modal Similarity-Aware Patch Mining, and Confidence-Aware Patch Mining for pathology images, while Gene-Stratified analysis processes six functional gene categories. Both generate structured reports with CoT reasoning, storing complete analytical processes for experiential learning. (2) Dichotomy-Based Multi-Expert Agent Inference retrieves similar cases via RAG and integrates multimodal reports with expert predictions through progressive interval refinement. Extensive experiments on five TCGA cohorts demonstrate SurvAgent's superority over conventional methods, proprietary MLLMs, and medical agents, establishing a new paradigm for explainable AI-driven survival prediction in precision oncology.[60] Real-Time 3D Object Detection with Inference-Aligned Learning
Chenyu Zhao,Xianwei Zheng,Zimin Xia,Linwei Yue,Nan Xue
Main category: cs.CV
TL;DR: 提出了一种用于室内点云的实时3D目标检测框架SR3D,通过空间优先级分配和排序感知自蒸馏机制,有效缩小了训练与推理之间的差距,在准确性和速度方面均优于先前方法。
Details
Motivation: 现有的3D目标检测方法在训练过程中缺乏空间可靠性和排序感知,导致训练与推理过程之间存在不一致,影响模型性能。 Method: SR3D包含两个关键组件:一是空间优先级最优传输分配,动态强调位置准确且空间可靠的样本;二是排序感知的自适应自蒸馏方案,通过自蒸馏引入排序感知能力。 Result: 在ScanNet V2和SUN RGB-D数据集上的实验表明,SR3D显著优于现有方法,在保持实时性的同时大幅提升了检测精度。 Conclusion: SR3D有效弥合了训练与推理间的差距,提升了3D目标检测模型在实际应用中的表现,适用于增强现实、机器人和导航等场景。 Abstract: Real-time 3D object detection from point clouds is essential for dynamic scene understanding in applications such as augmented reality, robotics and navigation. We introduce a novel Spatial-prioritized and Rank-aware 3D object detection (SR3D) framework for indoor point clouds, to bridge the gap between how detectors are trained and how they are evaluated. This gap stems from the lack of spatial reliability and ranking awareness during training, which conflicts with the ranking-based prediction selection used as inference. Such a training-inference gap hampers the model's ability to learn representations aligned with inference-time behavior. To address the limitation, SR3D consists of two components tailored to the spatial nature of point clouds during training: a novel spatial-prioritized optimal transport assignment that dynamically emphasizes well-located and spatially reliable samples, and a rank-aware adaptive self-distillation scheme that adaptively injects ranking perception via a self-distillation paradigm. Extensive experiments on ScanNet V2 and SUN RGB-D show that SR3D effectively bridges the training-inference gap and significantly outperforms prior methods in accuracy while maintaining real-time speed.[61] Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation
Ziyu Guo,Renrui Zhang,Hongyu Li,Manyuan Zhang,Xinyan Chen,Sifan Wang,Yan Feng,Peng Pei,Pheng-Ann Heng
Main category: cs.CV
TL;DR: 本文提出了Thinking-while-Generating (TwiG)框架,首次在视觉生成过程中实现文本推理与图像生成的交错进行,提升生成内容的语义丰富性与上下文感知能力。
Details
Motivation: 现有视觉生成方法缺乏在生成过程中实时的多模态交互,通常仅在生成前后使用文本推理,难以动态调整生成过程。 Method: 提出TwiG框架,在图像逐步生成时交错插入文本推理,用于指导后续区域生成并反思已生成部分;探索了零样本提示、基于TwiG-50K数据集的监督微调(SFT)和基于TwiG-GRPO的强化学习三种策略。 Result: 实现了生成过程中文本推理与视觉内容的协同演化,提升了生成结果的上下文相关性和语义质量。 Conclusion: TwiG为视觉生成中的推理集成提供了新范式,有望推动具身推理与生成模型的深度融合。 Abstract: Recent advances in visual generation have increasingly explored the integration of reasoning capabilities. They incorporate textual reasoning, i.e., think, either before (as pre-planning) or after (as post-refinement) the generation process, yet they lack on-the-fly multimodal interaction during the generation itself. In this preliminary study, we introduce Thinking-while-Generating (TwiG), the first interleaved framework that enables co-evolving textual reasoning throughout the visual generation process. As visual content is progressively generating, textual reasoning is interleaved to both guide upcoming local regions and reflect on previously synthesized ones. This dynamic interplay produces more context-aware and semantically rich visual outputs. To unveil the potential of this framework, we investigate three candidate strategies, zero-shot prompting, supervised fine-tuning (SFT) on our curated TwiG-50K dataset, and reinforcement learning (RL) via a customized TwiG-GRPO strategy, each offering unique insights into the dynamics of interleaved reasoning. We hope this work inspires further research into interleaving textual reasoning for enhanced visual generation. Code will be released at: https://github.com/ZiyuGuo99/Thinking-while-Generating.[62] A Spatial Semantics and Continuity Perception Attention for Remote Sensing Water Body Change Detection
Quanqing Ma,Jiaen Chen,Peng Wang,Yao Zheng,Qingzhan Zhao,Yuchen Zheng
Main category: cs.CV
TL;DR: 本文提出了一种高分辨率遥感水体变化检测数据集HSRW-CD,并设计了SSCP注意力模块以提升水体变化检测性能,通过多语义空间注意力、结构关系感知全局注意力和通道自注意力充分挖掘深层特征中的空间语义与结构信息。
Details
Motivation: 现有水体变化检测方法受限于高分辨率数据集的缺乏,且深度学习模型未能充分挖掘深层特征中的空间语义与结构信息,导致在城乡区域精确定位水体变化的能力不足。 Method: 提出HSRW-CD高分辨率数据集,并设计可嵌入现有模型的SSCP注意力模块,包含MSA、SRGA和CSA三个组件,分别增强空间语义、提取结构连续性并融合通道间相似性。 Result: 在HSRW-CD和Water-CD数据集上的实验表明,所提SSCP模块显著提升了水体变化检测的精度和泛化能力。 Conclusion: SSCP模块能有效增强水体特征的判别能力,所构建的数据集为高分辨率水体变化检测提供了新基准,推动了该领域的应用发展。 Abstract: Remote sensing Water Body Change Detection (WBCD) aims to detect water body surface changes from bi-temporal images of the same geographic area. Recently, the scarcity of high spatial resolution datasets for WBCD restricts its application in urban and rural regions, which require more accurate positioning. Meanwhile, previous deep learning-based methods fail to comprehensively exploit the spatial semantic and structural information in deep features in the change detection networks. To resolve these concerns, we first propose a new dataset, HSRW-CD, with a spatial resolution higher than 3 meters for WBCD. Specifically, it contains a large number of image pairs, widely covering various water body types. Besides, a Spatial Semantics and Continuity Perception (SSCP) attention module is designed to fully leverage both the spatial semantics and structure of deep features in the WBCD networks, significantly improving the discrimination capability for water body. The proposed SSCP has three components: the Multi-Semantic spatial Attention (MSA), the Structural Relation-aware Global Attention (SRGA), and the Channel-wise Self-Attention (CSA). The MSA enhances the spatial semantics of water body features and provides precise spatial semantic priors for the CSA. Then, the SRGA further extracts spatial structure to learn the spatial continuity of the water body. Finally, the CSA utilizes the spatial semantic and structural priors from the MSA and SRGA to compute the similarity across channels. Specifically designed as a plug-and-play module for water body deep features, the proposed SSCP allows integration into existing WBCD models. Numerous experiments conducted on the proposed HSRW-CD and Water-CD datasets validate the effectiveness and generalization of the SSCP. The code of this work and the HSRW-CD dataset will be accessed at https://github.com/QingMa1/SSCP.[63] LEGO-SLAM: Language-Embedded Gaussian Optimization SLAM
Sibaek Lee,Seongbo Ha,Kyeongsu Kang,Joonyeol Choi,Seungjun Tak,Hyeonwoo Yu
Main category: cs.CV
TL;DR: LEGO-SLAM 是首个实现实时、开放词汇映射的3D高斯泼溅(3DGS)SLAM框架,通过场景自适应的编码器-解码器将语言特征压缩至16维,实现高效语义建图与语言引导的冗余剔除和回环检测。
Details
Motivation: 现有3DGS SLAM系统缺乏开放词汇语义理解能力,且集成语言特征面临内存开销大、渲染慢、模型适应性差的问题。 Method: 提出LEGO-SLAM,采用场景自适应的编码器-解码器将高维语言嵌入压缩为16维紧凑特征;引入语言引导的高斯点剪枝策略减少冗余;并复用语义特征进行语言驱动的回环检测。 Result: 在保持渲染质量的同时,地图高斯点数量减少超过60%,实现15 FPS的实时性能,在多个实验中表现出具有竞争力的建图质量和跟踪精度。 Conclusion: LEGO-SLAM首次在3DGS SLAM中实现了高效、实时的开放词汇语义感知,兼顾低内存开销、高渲染速度与强适应性,为机器人语义交互提供了可行方案。 Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have enabled Simultaneous Localization and Mapping (SLAM) systems to build photorealistic maps. However, these maps lack the open-vocabulary semantic understanding required for advanced robotic interaction. Integrating language features into SLAM remains a significant challenge, as storing high-dimensional features demands excessive memory and rendering overhead, while existing methods with static models lack adaptability for novel environments. To address these limitations, we propose LEGO-SLAM (Language-Embedded Gaussian Optimization SLAM), the first framework to achieve real-time, open-vocabulary mapping within a 3DGS-based SLAM system. At the core of our method is a scene-adaptive encoder-decoder that distills high-dimensional language embeddings into a compact 16-dimensional feature space. This design reduces the memory per Gaussian and accelerates rendering, enabling real-time performance. Unlike static approaches, our encoder adapts online to unseen scenes. These compact features also enable a language-guided pruning strategy that identifies semantic redundancy, reducing the map's Gaussian count by over 60\% while maintaining rendering quality. Furthermore, we introduce a language-based loop detection approach that reuses these mapping features, eliminating the need for a separate detection model. Extensive experiments demonstrate that LEGO-SLAM achieves competitive mapping quality and tracking accuracy, all while providing open-vocabulary capabilities at 15 FPS.[64] Reasoning Guided Embeddings: Leveraging MLLM Reasoning for Improved Multimodal Retrieval
Chunxu Liu,Jiyuan Yang,Ruopeng Gao,Yuhan Zhu,Feng Zhu,Rui Zhao,Limin Wang
Main category: cs.CV
TL;DR: 提出了一种基于推理引导的多模态嵌入方法(RGE),通过结合多模态大模型的生成式推理能力与对比训练,提升多模态表示质量,在MMEB基准上检索性能提升4.9%。
Details
Motivation: 现有方法在提取多模态嵌入时忽略多模态大语言模型(MLLMs)的推理能力,仅将其视为编码器,限制了表示质量的提升。 Method: 提出Reasoning Guided Embeddings(RGE),在指令条件下引导模型进行结构化推理生成,并在推理展开后提取嵌入表示,结合对比学习进行训练。 Result: 在MMEB基准上的实验表明,相比无推理的基线方法,所提方法在多模态检索任务中性能提升4.9%。 Conclusion: 显式引入推理过程能有效增强多模态嵌入的质量,验证了利用MLLMs生成式推理能力改进表示学习的有效性。 Abstract: Multimodal embeddings are widely used in downstream tasks such as multimodal retrieval, enabling alignment of interleaved modalities in a shared representation space. While recent studies show that Multimodal Large Language Models (MLLMs) can serve as strong embedding extractors, existing approaches treat embedding extraction as a direct encoding step, overlooking the fact that MLLMs possess the generative capability for reasoning that could be leveraged to enhance representation quality. In this work, we explore how to explicitly incorporate reasoning into the embedding process. To this end, we propose Reasoning Guided Embeddings (RGE), which preserves the generative rationale process of MLLMs and couples it with contrastive training. Our method first enables the model to perform structured rationale generation conditioned on the instruction, and then extracts representations after reasoning has unfolded. This simple design enhances the context-conditional inference signals within the embedding, leading to improved multimodal representation quality. Experiments on the MMEB benchmark show that reasoning-guided conditioning improves multimodal retrieval performance by 4.9% over the non-reasoning baseline, confirming that explicit reasoning can effectively enhance embedding quality.[65] Pluggable Pruning with Contiguous Layer Distillation for Diffusion Transformers
Jian Ma,Qirong Peng,Xujie Zhu,Peixing Xie,Chen Chen,Haonan Lu
Main category: cs.CV
TL;DR: 本文提出了一种针对扩散Transformer(DiT)的灵活结构化剪枝框架PPCL,结合连续层蒸馏,在不显著降低生成质量的前提下实现50%的参数压缩,适用于资源受限场景。
Details
Motivation: DiT在图像生成中表现优异但参数量大、计算成本高,难以部署于资源受限环境,需有效压缩方法。 Method: 通过线性探测和一阶微分趋势分析识别冗余层区间,设计即插即用的师生交替蒸馏机制,统一实现深度和宽度剪枝。 Result: 在多个多模态扩散Transformer上实验表明,PPCL可减少50%参数,关键指标下降不到3%,且保持高质量图像生成能力。 Conclusion: PPCL是一种高效、灵活的DiT压缩方案,支持多种剪枝比例而无需重新训练,适合实际应用中的快速部署。 Abstract: Diffusion Transformers (DiTs) have shown exceptional performance in image generation, yet their large parameter counts incur high computational costs, impeding deployment in resource-constrained settings. To address this, we propose Pluggable Pruning with Contiguous Layer Distillation (PPCL), a flexible structured pruning framework specifically designed for DiT architectures. First, we identify redundant layer intervals through a linear probing mechanism combined with the first-order differential trend analysis of similarity metrics. Subsequently, we propose a plug-and-play teacher-student alternating distillation scheme tailored to integrate depth-wise and width-wise pruning within a single training phase. This distillation framework enables flexible knowledge transfer across diverse pruning ratios, eliminating the need for per-configuration retraining. Extensive experiments on multiple Multi-Modal Diffusion Transformer architecture models demonstrate that PPCL achieves a 50\% reduction in parameter count compared to the full model, with less than 3\% degradation in key objective metrics. Notably, our method maintains high-quality image generation capabilities while achieving higher compression ratios, rendering it well-suited for resource-constrained environments. The open-source code, checkpoints for PPCL can be found at the following link: https://github.com/OPPO-Mente-Lab/Qwen-Image-Pruning.[66] Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning
Yibin Huang,Wang Xu,Wanyue Zhang,Helu Zhi,Jingjing Huang,Yangbin Xu,Yangang Sun,Conghui Zhu,Tiejun Zhao
Main category: cs.CV
TL;DR: 本文提出了一种名为Video2Layout的新框架,通过连续的对象边界坐标从视频中重建度量空间布局,提升了多模态大模型在细粒度空间推理上的能力。
Details
Motivation: 现有的基于网格的认知地图方法依赖于离散化的栅格表示,限制了模型进行精细空间推理的能力。 Method: 该方法包括两个阶段:首先在监督微调阶段使用AI2THOR模拟器构建高质量数据集,学习从视觉输入到精确边界坐标的映射;然后通过强化微调提升现实世界泛化能力。同时提出了QVS-Bench基准用于系统评估认知地图精度与图像数量的关系。 Result: 在QVS-Bench和主流空间推理基准上,V2LO-7B模型相比基于网格图的方法平均提升了4.92%。 Conclusion: 采用连续对象边界表示的Video2Layout框架优于传统的离散网格方法,在空间智能任务中展现出更强的定量空间计算能力和更高的推理准确性。 Abstract: Spatial intelligence is a critical frontier for Multimodal Large Language Models (MLLMs), empowering them to comprehend the physical world. Drawing inspiration from human perception mechanisms, existing studies attempt to construct a coherent spatial understanding via grid-based cognitive maps from multi-frame visual inputs. However, current grid-based map methods rely on discretized raster representations, which limit the model's ability in fine-grained spatial reasoning. To overcome this limitation, we propose Video2Layout, a framework for reconstructing metric-grounded spatial layouts from video. The framework employs continuous object boundary coordinates to quantify inter-object physical distances and object size. This empowers the model with quantitative spatial computation capabilities, effectively alleviating the inherent ambiguity when describing spatial relationships in natural language. Specifically, our method comprises two core stages. First, in supervised fine-tuning stage, we construct a high-quality dataset from the AI2THOR simulator, which enables the model to learn the mapping from visual inputs to precise boundary coordinates. Subsequently, a reinforcement fine-tuning stage further enhances the model's real-world generalization capabilities. To systematically evaluate the correlation between cognitive map accuracy and image quantity, as well as how the quantity of image inputs affects spatial reasoning accuracy, we introduce QVS-Bench, a diagnostic benchmark designed to analyze the relevant mechanisms. Evaluated on QVS-Bench and mainstream spatial reasoning benchmarks, our model, V2LO-7B achieves an average improvement of 4.92% over the model trained on grid maps, validating the superiority of our method. Our code is available at https://github.com/ybrrraway/Video2Layout.[67] Simba: Towards High-Fidelity and Geometrically-Consistent Point Cloud Completion via Transformation Diffusion
Lirui Zhang,Zhengkai Zhao,Zhi Zuo,Pan Gao,Jie Qin
Main category: cs.CV
TL;DR: 本文提出了一种名为Simba的新框架,通过将点级变换回归重构为分布学习问题,结合对称性先验与扩散模型的生成能力,解决了点云补全中细节保持与结构完整性之间的权衡问题,并采用分层Mamba架构实现高保真上采样,在多个基准上实现了最先进性能。
Details
Motivation: 现有基于回归的点云补全方法容易过拟合且对噪声敏感,难以兼顾输入细节保留与整体结构完整性,缺乏泛化性和鲁棒性。 Method: 提出Simba框架,将点级变换回归转为分布学习任务,利用扩散模型捕捉几何结构的分布特性以增强鲁棒性,并引入层次化Mamba架构进行高质量上采样。 Result: 在PCN、ShapeNet和KITTI数据集上实验表明,该方法在定性和定量评估中均达到SOTA性能,尤其在细节保持和抗噪方面表现优异。 Conclusion: Simba通过分布学习和扩散模型有效克服了传统回归方法的过拟合与噪声敏感问题,结合Mamba架构提升了生成质量,为点云补全提供了更鲁棒、可泛化的解决方案。 Abstract: Point cloud completion is a fundamental task in 3D vision. A persistent challenge in this field is simultaneously preserving fine-grained details present in the input while ensuring the global structural integrity of the completed shape. While recent works leveraging local symmetry transformations via direct regression have significantly improved the preservation of geometric structure details, these methods suffer from two major limitations: (1) These regression-based methods are prone to overfitting which tend to memorize instant-specific transformations instead of learning a generalizable geometric prior. (2) Their reliance on point-wise transformation regression lead to high sensitivity to input noise, severely degrading their robustness and generalization. To address these challenges, we introduce Simba, a novel framework that reformulates point-wise transformation regression as a distribution learning problem. Our approach integrates symmetry priors with the powerful generative capabilities of diffusion models, avoiding instance-specific memorization while capturing robust geometric structures. Additionally, we introduce a hierarchical Mamba-based architecture to achieve high-fidelity upsampling. Extensive experiments across the PCN, ShapeNet, and KITTI benchmarks validate our method's state-of-the-art (SOTA) performance.[68] Layer-wise Noise Guided Selective Wavelet Reconstruction for Robust Medical Image Segmentation
Yuting Lu,Ziliang Wang,Weixin Xu,Wei Zhang,Yongqiang Zhao,Yang Yu,Xiaohong Zhang
Main category: cs.CV
TL;DR: 提出了一种名为LNG-SWR的层间噪声引导选择性小波重构方法,用于提升医学图像分割模型在分布偏移和扰动下的鲁棒性,兼具低训练成本和高兼容性。
Details
Motivation: 医学图像分割模型在临床部署中需应对分布偏移和扰动,但传统对抗训练存在干净准确率与鲁棒性之间的权衡及高昂训练成本问题。 Method: 在多个网络层注入零均值小噪声以学习频率偏向先验,进而指导输入或特征分支的选择性小波重构,抑制对噪声敏感的频带,增强方向结构和形状线索,并稳定边界响应。 Result: 在CT和超声数据集上,LNG-SWR在干净样本和强攻击下均显著提升Dice/IoU指标,且与对抗训练结合时产生额外增益,不牺牲干净精度。 Conclusion: LNG-SWR为医学图像分割提供了一种简单、有效、工程友好的鲁棒性提升方案,适用于对抗和标准训练两种场景。 Abstract: Clinical deployment requires segmentation models to stay stable under distribution shifts and perturbations. The mainstream solution is adversarial training (AT) to improve robustness; however, AT often brings a clean--robustness trade-off and high training/tuning cost, which limits scalability and maintainability in medical imaging. We propose \emph{Layer-wise Noise-Guided Selective Wavelet Reconstruction (LNG-SWR)}. During training, we inject small, zero-mean noise at multiple layers to learn a frequency-bias prior that steers representations away from noise-sensitive directions. We then apply prior-guided selective wavelet reconstruction on the input/feature branch to achieve frequency adaptation: suppress noise-sensitive bands, enhance directional structures and shape cues, and stabilize boundary responses while maintaining spectral consistency. The framework is backbone-agnostic and adds low additional inference overhead. It can serve as a plug-in enhancement to AT and also improves robustness without AT. On CT and ultrasound datasets, under a unified protocol with PGD-$L_{\infty}/L_{2}$ and SSAH, LNG-SWR delivers consistent gains on clean Dice/IoU and significantly reduces the performance drop under strong attacks; combining LNG-SWR with AT yields additive gains. When combined with adversarial training, robustness improves further without sacrificing clean accuracy, indicating an engineering-friendly and scalable path to robust segmentation. These results indicate that LNG-SWR provides a simple, effective, and engineering-friendly path to robust medical image segmentation in both adversarial and standard training regimes.[69] An Image Is Worth Ten Thousand Words: Verbose-Text Induction Attacks on VLMs
Zhi Luo,Zenghui Yuan,Wenqi Wei,Daizong Liu,Pan Zhou
Main category: cs.CV
TL;DR: 本文提出了一种新的冗长文本诱导攻击(VTIA),通过两阶段框架在视觉-语言模型中注入难以察觉的对抗性扰动,以最大化输出token长度,提升攻击的有效性、效率和泛化能力。
Details
Motivation: 由于视觉-语言模型在生成过程中消耗的token数量成为关键评估指标,而现有方法无法稳定且可控地延长输出,因此需要一种更有效的机制来优化和最大化输出长度。 Method: 采用两阶段框架:首先使用强化学习进行对抗性提示搜索,找到能诱导模型生成冗长输出的恶意提示;然后进行视觉对齐的扰动优化,使扰动图像的视觉嵌入与对抗提示的嵌入相似,从而触发冗长文本生成。 Result: 在四个主流视觉-语言模型上的实验表明,该方法在有效性、效率和泛化能力方面均显著优于现有方法。 Conclusion: VTIA能够有效诱导视觉-语言模型生成高冗余、低信息密度的长文本,揭示了模型在部署效率方面的潜在安全风险,并为评估其鲁棒性提供了新视角。 Abstract: With the remarkable success of Vision-Language Models (VLMs) on multimodal tasks, concerns regarding their deployment efficiency have become increasingly prominent. In particular, the number of tokens consumed during the generation process has emerged as a key evaluation metric.Prior studies have shown that specific inputs can induce VLMs to generate lengthy outputs with low information density, which significantly increases energy consumption, latency, and token costs. However, existing methods simply delay the occurrence of the EOS token to implicitly prolong output, and fail to directly maximize the output token length as an explicit optimization objective, lacking stability and controllability.To address these limitations, this paper proposes a novel verbose-text induction attack (VTIA) to inject imperceptible adversarial perturbations into benign images via a two-stage framework, which identifies the most malicious prompt embeddings for optimizing and maximizing the output token of the perturbed images.Specifically, we first perform adversarial prompt search, employing reinforcement learning strategies to automatically identify adversarial prompts capable of inducing the LLM component within VLMs to produce verbose outputs. We then conduct vision-aligned perturbation optimization to craft adversarial examples on input images, maximizing the similarity between the perturbed image's visual embeddings and those of the adversarial prompt, thereby constructing malicious images that trigger verbose text generation. Comprehensive experiments on four popular VLMs demonstrate that our method achieves significant advantages in terms of effectiveness, efficiency, and generalization capability.[70] EvoVLA: Self-Evolving Vision-Language-Action Model
Zeting Liu,Zida Yang,Zeyu Zhang,Hao Tang
Main category: cs.CV
TL;DR: EvoVLA是一种自监督的视觉-语言-动作框架,通过阶段对齐奖励、基于姿态的物体探索和长视野记忆机制,有效缓解了长时程机器人操作中的阶段幻觉问题,在仿真和真实环境中均显著提升了任务成功率和样本效率。
Details
Motivation: 现有视觉-语言-动作(VLA)模型在长时程操作任务中存在阶段幻觉问题,即利用粗略评估信号跳过任务步骤,导致实际完成度低但报告进度高,限制了真实场景的应用。 Method: 提出EvoVLA框架,包含三个核心组件:1)阶段对齐奖励(SAR),采用三元组对比学习与Gemini生成的难负样本防止视觉捷径;2)基于姿态的物体探索(POE),以夹爪与物体的相对位姿驱动好奇心而非原始像素;3)长视野记忆机制,通过选择性上下文保留和门控融合稳定长期策略训练中的内在奖励塑造。 Result: 在Discoverse-L基准上,EvoVLA比最强基线OpenVLA-OFT平均任务成功率提升10.2个百分点,达到69.2%;样本效率提高1.5倍,阶段幻觉率从38.5%降至14.8%;在真实机器人上的四个任务中平均成功率达54.6%,超过基线11个百分点,展现出优异的仿真到现实迁移能力。 Conclusion: EvoVLA通过多机制协同有效缓解了长时程VLA任务中的阶段幻觉问题,显著提升了任务成功率、样本效率和真实环境泛化能力,为零样本迁移和实际机器人部署提供了可行方案。 Abstract: Long-horizon robotic manipulation remains challenging for Vision-Language-Action (VLA) models despite recent progress in zero-shot generalization and simulation-to-real-world transfer. Current VLA models suffer from stage hallucination, where agents exploit coarse evaluation signals to shortcut multi-step tasks, reporting high progress without truly completing them. We present EvoVLA, a self-supervised VLA framework that addresses this issue through three complementary components: Stage-Aligned Reward (SAR), which uses triplet contrastive learning with Gemini-generated hard negatives to prevent visual shortcuts; Pose-Based Object Exploration (POE), which grounds curiosity in relative object-gripper pose instead of raw pixels; and Long-Horizon Memory, which uses selective context retention and gated fusion to stabilize intrinsic shaping during extended rollouts. Extensive evaluations on Discoverse-L, a long-horizon manipulation benchmark with three multi-stage tasks, show that EvoVLA improves average task success by 10.2 percentage points over the strongest baseline (OpenVLA-OFT), reaching 69.2 percent. EvoVLA also achieves one-and-a-half times better sample efficiency and reduces stage hallucination from 38.5 percent to 14.8 percent. Real-world deployment on physical robots reaches an average success rate of 54.6 percent across four manipulation tasks, outperforming OpenVLA-OFT by 11 points, demonstrating effective sim-to-real transfer and strong generalization. Code: https://github.com/AIGeeksGroup/EvoVLA. Website: https://aigeeksgroup.github.io/EvoVLA.[71] Target Refocusing via Attention Redistribution for Open-Vocabulary Semantic Segmentation: An Explainability Perspective
Jiahao Li,Yang Lu,Yachao Zhang,Yong Xie,Fangyong Wang,Yuan Xie,Yanyun Qu
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的开放词汇语义分割方法RF-CLIP,通过模拟人类注意力重聚焦机制,过滤CLIP中导致分心的过激活维度,提升像素级视觉-语言对齐精度,在八个基准上达到SOTA。
Details
Motivation: 现有方法未充分探索CLIP在密集预测任务中的性能边界,尤其缺乏从可解释性机制角度分析其注意力分散问题。 Method: 系统分析CLIP内部机制,发现维度特异性过激活引发的‘分心’现象;提出ReFocusing CLIP(RF-CLIP),通过过滤干扰token并将注意力重新聚焦于目标区域,增强像素级多模态对齐。 Result: 在八个开放词汇语义分割基准上实现了最先进的性能,同时保持高推理效率。 Conclusion: RF-CLIP通过模拟人类注意力重聚焦行为,有效提升了CLIP在密集预测任务中的表现,揭示了改进视觉语言模型用于像素级理解的新方向。 Abstract: Open-vocabulary semantic segmentation (OVSS) employs pixel-level vision-language alignment to associate category-related prompts with corresponding pixels. A key challenge is enhancing the multimodal dense prediction capability, specifically this pixel-level multimodal alignment. Although existing methods achieve promising results by leveraging CLIP's vision-language alignment, they rarely investigate the performance boundaries of CLIP for dense prediction from an interpretability mechanisms perspective. In this work, we systematically investigate CLIP's internal mechanisms and identify a critical phenomenon: analogous to human distraction, CLIP diverts significant attention resources from target regions to irrelevant tokens. Our analysis reveals that these tokens arise from dimension-specific over-activation; filtering them enhances CLIP's dense prediction performance. Consequently, we propose ReFocusing CLIP (RF-CLIP), a training-free approach that emulates human distraction-refocusing behavior to redirect attention from distraction tokens back to target regions, thereby refining CLIP's multimodal alignment granularity. Our method achieves SOTA performance on eight benchmarks while maintaining high inference efficiency.[72] Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight
Yi Yang,Xueqi Li,Yiyang Chen,Jin Song,Yihan Wang,Zipeng Xiao,Jiadi Su,You Qiaoben,Pengfei Liu,Zhijie Deng
Main category: cs.CV
TL;DR: 本文提出了Mantis框架,通过解耦视觉预测与主干网络,结合元查询和扩散Transformer头,提升视觉-语言-动作模型的性能。
Details
Motivation: 现有VLA模型在直接预测高维视觉状态时存在计算成本高、信息瓶颈及语言监督不足导致理解与推理能力差的问题。 Method: 引入解耦的视觉预见(DVF)模块,利用元查询和扩散Transformer头,通过残差连接输入当前视觉状态,并采用简单的下一状态预测目标来捕捉潜在动作。 Result: 在LIBERO基准上微调后达到96.7%的成功率,超越强基线模型,在指令跟随、泛化性和推理能力上优于π₀.₅等现有模型。 Conclusion: Mantis有效缓解了VLA模型中的容量分配与信息瓶颈问题,同时增强了语言监督下的理解与推理能力,具备高效收敛和良好实际表现。 Abstract: Recent advances in Vision-Language-Action (VLA) models demonstrate that visual signals can effectively complement sparse action supervisions. However, letting VLA directly predict high-dimensional visual states can distribute model capacity and incur prohibitive training cost, while compressing visual states into more compact supervisory signals inevitably incurs information bottlenecks. Moreover, existing methods often suffer from poor comprehension and reasoning capabilities due to the neglect of language supervision. This paper introduces Mantis, a novel framework featuring a Disentangled Visual Foresight (DVF) to tackle these issues. Specifically, Mantis decouples visual foresight prediction from the backbone with the combination of meta queries and a diffusion Transformer (DiT) head. With the current visual state provided to the DiT via a residual connection, a simple next-state prediction objective enables the meta queries to automatically capture the latent actions that delineate the visual trajectory, and hence boost the learning of explicit actions. The disentanglement reduces the burden of the VLA backbone, enabling it to maintain comprehension and reasoning capabilities through language supervision. Empirically, pretrained on human manipulation videos, robot demonstrations, and image-text pairs, Mantis achieves a 96.7% success rate on LIBERO benchmark after fine-tuning, surpassing powerful baselines while exhibiting high convergence speed. Real-world evaluations show that Mantis outperforms $π_{0.5}$, a leading open-source VLA model, particularly in instruction-following capability, generalization to unseen instructions, and reasoning ability. Code and weights are released to support the open-source community.[73] Domain-Shared Learning and Gradual Alignment for Unsupervised Domain Adaptation Visible-Infrared Person Re-Identification
Nianchang Huang,Yi Xu,Ruida Xi,Ruida Xi,Qiang Zhang
Main category: cs.CV
TL;DR: 本文提出了一种新的两阶段模型DSLGA,用于解决无监督域自适应可见光-红外行人重识别(UDA-VI-ReID)中的跨域和跨模态差异问题,在多个实验设置下显著优于现有方法。
Details
Motivation: 由于公开数据集与真实世界数据之间存在差异,现有的VI-ReID算法在实际应用中表现不佳,因此需要研究无需标注新样本即可将知识从公开数据迁移到真实数据的UDA-VI-ReID方法。 Method: 设计了一个两阶段模型DSLGA:第一阶段采用域共享学习策略(DSLS)缓解域间模态差异带来的预训练无效问题;第二阶段通过渐进对齐策略(GAS)以聚类到整体的方式应对域内模态差异导致的跨模态对齐挑战,并构建了新的测试方法CMDA-XD。 Result: 大量实验表明,该方法在各种设置下显著优于现有的域自适应VI-ReID方法,甚至超过了一些有监督方法的性能。 Conclusion: DSLGA有效解决了UDA-VI-ReID中的域间和域内模态差异问题,具有较强的迁移能力和实际应用潜力。 Abstract: Recently, Visible-Infrared person Re-Identification (VI-ReID) has achieved remarkable performance on public datasets. However, due to the discrepancies between public datasets and real-world data, most existing VI-ReID algorithms struggle in real-life applications. To address this, we take the initiative to investigate Unsupervised Domain Adaptation Visible-Infrared person Re-Identification (UDA-VI-ReID), aiming to transfer the knowledge learned from the public data to real-world data without compromising accuracy and requiring the annotation of new samples. Specifically, we first analyze two basic challenges in UDA-VI-ReID, i.e., inter-domain modality discrepancies and intra-domain modality discrepancies. Then, we design a novel two-stage model, i.e., Domain-Shared Learning and Gradual Alignment (DSLGA), to handle these discrepancies. In the first pre-training stage, DSLGA introduces a Domain-Shared Learning Strategy (DSLS) to mitigate ineffective pre-training caused by inter-domain modality discrepancies via exploiting shared information between the source and target domains. While, in the second fine-tuning stage, DSLGA designs a Gradual Alignment Strategy (GAS) to handle the cross-modality alignment challenges between visible and infrared data caused by the large intra-domain modality discrepancies through a cluster-to-holistic alignment way. Finally, a new UDA-VI-ReID testing method i.e., CMDA-XD, is constructed for training and testing different UDA-VI-ReID models. A large amount of experiments demonstrate that our method significantly outperforms existing domain adaptation methods for VI-ReID and even some supervised methods under various settings.[74] PrIntMesh: Precise Intersection Surfaces for 3D Organ Mesh Reconstruction
Deniz Sayin Mercadier,Hieu Le,Yihong Chen,Jiancheng Yang,Udaranga Wickramasinghe,Pascal Fua
Main category: cs.CV
TL;DR: PrIntMesh是一种基于模板的拓扑保持框架,用于将器官作为统一系统进行联合重建,相较于传统方法在几何精度、拓扑正确性和数据效率方面表现更优。
Details
Motivation: 现有深度学习方法通常独立处理器官的子结构,导致解剖结构上不合理的重建结果,忽略了子结构间的几何关联和空间约束。 Method: 提出PrIntMesh,采用连接的模板作为初始形状,联合变形所有子结构以匹配患者特异性解剖结构,同时显式保持内部边界并生成平滑无伪影的表面。 Result: 在心脏、海马体和肺部的应用中表现出高几何精度、正确的拓扑结构,并在训练数据有限或噪声较多时仍具鲁棒性;相比体素和表面方法,能更好重建共享界面并保持结构一致性。 Conclusion: PrIntMesh提供了一种数据高效且临床适用的器官重建方案,能够实现解剖学上合理且拓扑正确的三维重建。 Abstract: Human organs are composed of interconnected substructures whose geometry and spatial relationships constrain one another. Yet, most deep-learning approaches treat these parts independently, producing anatomically implausible reconstructions. We introduce PrIntMesh, a template-based, topology-preserving framework that reconstructs organs as unified systems. Starting from a connected template, PrIntMesh jointly deforms all substructures to match patient-specific anatomy, while explicitly preserving internal boundaries and enforcing smooth, artifact-free surfaces. We demonstrate its effectiveness on the heart, hippocampus, and lungs, achieving high geometric accuracy, correct topology, and robust performance even with limited or noisy training data. Compared to voxel- and surface-based methods, PrIntMesh better reconstructs shared interfaces, maintains structural consistency, and provides a data-efficient solution suitable for clinical use.[75] When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models
Yuping Yan,Yuhan Xie,Yinxin Zhang,Lingjuan Lyu,Yaochu Jin
Main category: cs.CV
TL;DR: 本文提出了VLA-Fool,首个针对具身视觉-语言-动作模型(VLA)的多模态对抗鲁棒性研究框架,涵盖文本、视觉和跨模态错位攻击,并提出语义引导的提示生成方法,实验表明现有VLA模型在多模态扰动下极为脆弱。
Details
Motivation: 尽管VLA模型在具身智能中表现出色,但其在真实多模态黑盒环境下的对抗鲁棒性尚未被深入探索,尤其是跨模态错位对决策的影响。 Method: 提出VLA-Fool,统一三种多模态对抗攻击:基于梯度和提示的文本扰动、视觉补丁与噪声扰动、以及破坏感知与指令间语义对应关系的跨模态错位攻击;并设计了VLA感知的语义提示生成框架。 Result: 在LIBERO基准上对微调的OpenVLA模型进行实验,发现即使轻微的多模态扰动也会导致显著行为偏差,揭示了当前VLA模型在多模态对齐上的脆弱性。 Conclusion: VLA模型在面对多模态对抗攻击时存在严重鲁棒性问题,特别是跨模态语义错位的影响,需在未来研究中加强对此类脆弱性的防御机制。 Abstract: Vision-Language-Action models (VLAs) have recently demonstrated remarkable progress in embodied environments, enabling robots to perceive, reason, and act through unified multimodal understanding. Despite their impressive capabilities, the adversarial robustness of these systems remains largely unexplored, especially under realistic multimodal and black-box conditions. Existing studies mainly focus on single-modality perturbations and overlook the cross-modal misalignment that fundamentally affects embodied reasoning and decision-making. In this paper, we introduce VLA-Fool, a comprehensive study of multimodal adversarial robustness in embodied VLA models under both white-box and black-box settings. VLA-Fool unifies three levels of multimodal adversarial attacks: (1) textual perturbations through gradient-based and prompt-based manipulations, (2) visual perturbations via patch and noise distortions, and (3) cross-modal misalignment attacks that intentionally disrupt the semantic correspondence between perception and instruction. We further incorporate a VLA-aware semantic space into linguistic prompts, developing the first automatically crafted and semantically guided prompting framework. Experiments on the LIBERO benchmark using a fine-tuned OpenVLA model reveal that even minor multimodal perturbations can cause significant behavioral deviations, demonstrating the fragility of embodied multimodal alignment.[76] Unsupervised Image Classification with Adaptive Nearest Neighbor Selection and Cluster Ensembles
Melih Baydar,Emre Akbas
Main category: cs.CV
TL;DR: 本文提出了一种名为ICCE的无监督图像分类方法,通过多头聚类、自适应近邻选择和聚类集成策略,在冻结的骨干网络上生成多样化的聚类结果,并融合为共识聚类以训练分类器,在多个基准上达到最优性能,首次在ImageNet上超过70%准确率。
Details
Motivation: 传统方法结合表示学习与聚类,而近期研究跳过表示学习仅关注聚类。本文旨在利用基础模型提供的强表示能力,通过改进聚类过程提升无监督图像分类性能。 Method: 采用多头聚类框架,在冻结的骨干网络上训练多个聚类头以产生多样化聚类;引入自适应最近邻选择和聚类集成策略,将多个聚类结果合并为统一的共识聚类,并以此生成伪标签训练分类器。 Result: ICCE在十个图像分类基准上实现了最先进的性能,CIFAR10达到99.3%,CIFAR100为89%,ImageNet为70.4%,是首个在ImageNet上超过70%准确率的完全无监督图像分类方法。 Conclusion: ICCE通过有效的聚类集成策略,显著提升了无监督图像分类的性能,缩小了与有监督方法之间的差距,展示了无需微调主干网络进行表示学习的潜力。 Abstract: Unsupervised image classification, or image clustering, aims to group unlabeled images into semantically meaningful categories. Early methods integrated representation learning and clustering within an iterative framework. However, the rise of foundational models have recently shifted focus solely to clustering, bypassing the representation learning step. In this work, we build upon a recent multi-head clustering approach by introducing adaptive nearest neighbor selection and cluster ensembling strategies to improve clustering performance. Our method, "Image Clustering through Cluster Ensembles" (ICCE), begins with a clustering stage, where we train multiple clustering heads on a frozen backbone, producing diverse image clusterings. We then employ a cluster ensembling technique to consolidate these potentially conflicting results into a unified consensus clustering. Finally, we train an image classifier using the consensus clustering result as pseudo-labels. ICCE achieves state-of-the-art performance on ten image classification benchmarks, achieving 99.3% accuracy on CIFAR10, 89% on CIFAR100, and 70.4% on ImageNet datasets, narrowing the performance gap with supervised methods. To the best of our knowledge, ICCE is the first fully unsupervised image classification method to exceed 70% accuracy on ImageNet.[77] SwiTrack: Tri-State Switch for Cross-Modal Object Tracking
Boyue Xu,Ruichao Hou,Tongwei Ren,Dongming Zhou,Gangshan Wu,Jinde Cao
Main category: cs.CV
TL;DR: 本文提出了一种名为SwiTrack的新型状态切换框架,用于解决跨模态目标跟踪(CMOT)中的模态特异性特征提取不足和目标漂移问题,通过三个专用分支处理RGB、NIR及无效模态,在最新基准上实现了最先进的性能提升。
Details
Motivation: 现有方法在跨模态对象跟踪中难以充分提取模态特定特征,且在输入不可靠时易发生目标漂移,限制了跟踪鲁棒性。 Method: 提出SwiTrack框架:RGB帧由视觉编码器处理;NIR帧通过带门控适配器的编码器进行特征校准;对无效模态采用一致性轨迹预测模块估计目标运动,并结合动态模板重建和相似性对齐损失增强特征一致性。 Result: 在最新基准测试中,该方法将精度率和成功率分别提升了7.2%和4.3%,同时保持65帧/秒的实时跟踪速度。 Conclusion: SwiTrack通过专门设计的三流架构和特征优化机制,在跨模态跟踪任务中显著提升了鲁棒性和准确性,有效缓解了目标漂移问题,具备实际应用潜力。 Abstract: Cross-modal object tracking (CMOT) is an emerging task that maintains target consistency while the video stream switches between different modalities, with only one modality available in each frame, mostly focusing on RGB-Near Infrared (RGB-NIR) tracking. Existing methods typically connect parallel RGB and NIR branches to a shared backbone, which limits the comprehensive extraction of distinctive modality-specific features and fails to address the issue of object drift, especially in the presence of unreliable inputs. In this paper, we propose SwiTrack, a novel state-switching framework that redefines CMOT through the deployment of three specialized streams. Specifically, RGB frames are processed by the visual encoder, while NIR frames undergo refinement via a NIR gated adapter coupled with the visual encoder to progressively calibrate shared latent space features, thereby yielding more robust cross-modal representations. For invalid modalities, a consistency trajectory prediction module leverages spatio-temporal cues to estimate target movement, ensuring robust tracking and mitigating drift. Additionally, we incorporate dynamic template reconstruction to iteratively update template features and employ a similarity alignment loss to reinforce feature consistency. Experimental results on the latest benchmarks demonstrate that our tracker achieves state-of-the-art performance, boosting precision rate and success rate gains by 7.2\% and 4.3\%, respectively, while maintaining real-time tracking at 65 frames per second. Code and models are available at https://github.com/xuboyue1999/SwiTrack.git.[78] Mem-MLP: Real-Time 3D Human Motion Generation from Sparse Inputs
Sinan Mutlu,Georgios F. Angelis,Savas Ozkan,Paul Wisbey,Anastasios Drosou,Mete Ozay
Main category: cs.CV
TL;DR: 本文提出了一种基于多层感知机(MLP)并引入残差连接和新型Memory-Block组件的神经网络方法,用于从稀疏传感器输入生成高精度、时序一致的全身动作,在AR/VR中实现了优于现有方法的性能和实时性。
Details
Motivation: 现有的AR/VR系统通常仅追踪头和手部,导致全身3D重建不完整,缺乏真实感和沉浸感。因此,需要一种能从稀疏传感器信号中准确恢复完整身体姿态的方法。 Method: 提出一种以MLP为骨干网络的方法,引入残差连接提升模型表达能力,并设计名为Memory-Block的新组件,利用可训练的码向量表示缺失传感器数据,结合历史稀疏信号增强时序一致性;同时采用多任务学习框架,使模型学习更鲁棒的特征表示。 Result: 实验表明,该方法显著降低了预测误差,优于当前最先进的基线方法;在移动HMD设备上达到72 FPS,具备良好的实时性能。 Conclusion: 所提出的方法在精度和运行效率之间取得了良好平衡,适用于资源受限的移动AR/VR设备,推动了基于稀疏输入的高质量全身追踪技术的发展。 Abstract: Realistic and smooth full-body tracking is crucial for immersive AR/VR applications. Existing systems primarily track head and hands via Head Mounted Devices (HMDs) and controllers, making the 3D full-body reconstruction in-complete. One potential approach is to generate the full-body motions from sparse inputs collected from limited sensors using a Neural Network (NN) model. In this paper, we propose a novel method based on a multi-layer perceptron (MLP) backbone that is enhanced with residual connections and a novel NN-component called Memory-Block. In particular, Memory-Block represents missing sensor data with trainable code-vectors, which are combined with the sparse signals from previous time instances to improve the temporal consistency. Furthermore, we formulate our solution as a multi-task learning problem, allowing our MLP-backbone to learn robust representations that boost accuracy. Our experiments show that our method outperforms state-of-the-art baselines by substantially reducing prediction errors. Moreover, it achieves 72 FPS on mobile HMDs that ultimately improves the accuracy-running time tradeoff.[79] TetraSDF: Precise Mesh Extraction with Multi-resolution Tetrahedral Grid
Seonghun Oh,Youngjung Uh,Jin-Hwa Kim
Main category: cs.CV
TL;DR: 提出TetraSDF,一种用于神经符号距离函数(SDF)的精确解析网格化框架,结合多分辨率四面体位置编码器,在保持连续分段仿射结构的同时实现高精度、一致性强且高效的三维重建。
Details
Motivation: 现有基于采样的方法存在离散化误差,而连续分段仿射(CPWA)解析方法仅适用于简单的ReLU MLP,难以精确提取神经SDF的零水平集。 Method: 设计了一种包含多分辨率四面体位置编码器的ReLU MLP架构,利用其双线性插值保持全局CPWA结构,并在编码器诱导的多面体复形内追踪ReLU线性区域;同时引入固定的解析输入预条件器以减少方向偏差并稳定训练过程。 Result: 在多个基准测试中,TetraSDF在SDF重建精度上达到或超过现有的基于网格的编码器,其解析提取器生成的网格具有高度自洽性且忠实于学习到的等值面,同时具备实用的运行时间和内存效率。 Conclusion: TetraSDF通过保留CPWA结构和引入预条件机制,实现了对复杂神经SDF的精确、高效解析网格化,推动了高质量隐式表面重建的发展。 Abstract: Extracting meshes that exactly match the zero-level set of neural signed distance functions (SDFs) remains challenging. Sampling-based methods introduce discretization error, while continuous piecewise affine (CPWA) analytic approaches apply only to plain ReLU MLPs. We present TetraSDF, a precise analytic meshing framework for SDFs represented by a ReLU MLP composed with a multi-resolution tetrahedral positional encoder. The encoder's barycentric interpolation preserves global CPWA structure, enabling us to track ReLU linear regions within an encoder-induced polyhedral complex. A fixed analytic input preconditioner derived from the encoder's metric further reduces directional bias and stabilizes training. Across multiple benchmarks, TetraSDF matches or surpasses existing grid-based encoders in SDF reconstruction accuracy, and its analytic extractor produces highly self-consistent meshes that remain faithful to the learned isosurfaces, all with practical runtime and memory efficiency.[80] Building temporally coherent 3D maps with VGGT for memory-efficient Semantic SLAM
Gergely Dinya,Péter Halász,András Lőrincz,Kristóf Karacs,Anna Gelencsér-Horváth
Main category: cs.CV
TL;DR: 提出一种基于Vision Gated Generative Transformers(VGGT)的快速时空场景理解框架,适用于辅助导航等实时应用。
Details
Motivation: 为实现高效、接近实时的场景理解,支持如辅助导航等应用,需克服VGGT高内存消耗和连续3D场景更新的挑战。 Method: 采用滑动窗口处理图像流,对齐子图以降低内存需求;利用VGGT的跟踪头将2D语义实例掩码聚合成3D对象,并通过存储时间戳和实例级身份信息实现时间一致性和环境变化检测。 Result: 在知名基准和自定义辅助导航数据集上验证了该方法的有效性,结果表明该框架适用于真实场景,具备良好的实时性能和场景理解能力。 Conclusion: 所提出的VGGT-based框架在效率和准确性之间取得了良好平衡,能够支持需要持续更新和上下文推理的现实世界应用。 Abstract: We present a fast, spatio-temporal scene understanding framework based on Vision Gated Generative Transformers (VGGT). The proposed pipeline is designed to enable efficient, close to real-time performance, supporting applications including assistive navigation. To achieve continuous updates of the 3D scene representation, we process the image flow with a sliding window, aligning submaps, thereby overcoming VGGT's high memory demands. We exploit the VGGT tracking head to aggregate 2D semantic instance masks into 3D objects. To allow for temporal consistency and richer contextual reasoning the system stores timestamps and instance-level identities, thereby enabling the detection of changes in the environment. We evaluate the approach on well-known benchmarks and custom datasets specifically designed for assistive navigation scenarios. The results demonstrate the applicability of the framework to real-world scenarios.[81] Explainable AI for Diabetic Retinopathy Detection Using Deep Learning with Attention Mechanisms and Fuzzy Logic-Based Interpretability
Abishek Karthik,Pandiyaraju V,Sreya Mynampati
Main category: cs.CV
TL;DR: 提出了一种结合CNN、ViT和GNN的混合深度学习框架,用于在多变田间条件下实现高精度杂草检测,结合GAN增强与自监督预训练,在多个基准数据集上达到99.33%的准确率。
Details
Motivation: 精准农业中需要高效、准确的杂草识别以实现选择性施药,减少 herbicide 使用,提升可持续性。 Method: 融合卷积神经网络(CNN)、视觉Transformer(ViT)和图神经网络(GNN),采用GAN进行数据增强,并引入自监督对比预训练以提升小样本下的特征学习。 Result: 在多基准数据集上实现了99.33%的准确率、精确率、召回率和F1分数,模型具备良好的泛化性、可解释性和边缘设备部署能力。 Conclusion: 该混合框架有效整合局部、全局与关系特征,支持实时、高效的边缘部署,为可持续精准农业提供了可扩展的解决方案。 Abstract: The task of weed detection is an essential element of precision agriculture since accurate species identification allows a farmer to selectively apply herbicides and fits into sustainable agriculture crop management. This paper proposes a hybrid deep learning framework recipe for weed detection that utilizes Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and Graph Neural Networks (GNNs) to build robustness to multiple field conditions. A Generative Adversarial Network (GAN)-based augmentation method was imposed to balance class distributions and better generalize the model. Further, a self-supervised contrastive pre-training method helps to learn more features from limited annotated data. Experimental results yield superior results with 99.33% accuracy, precision, recall, and F1-score on multi-benchmark datasets. The proposed model architecture enables local, global, and relational feature representations and offers high interpretability and adaptability. Practically, the framework allows real-time, efficient deployment of edge devices for automated weed detecting, reducing over-reliance on herbicides and providing scalable, sustainable precision-farming options.[82] Optimizing 3D Gaussian Splattering for Mobile GPUs
Md Musfiqur Rahman Sanim,Zhihao Shu,Bahram Afsharmanesh,AmirAli Mirian,Jiexiong Guan,Wei Niu,Bin Ren,Gagan Agrawal
Main category: cs.CV
TL;DR: 本文提出Texture3dgs,一种针对移动GPU优化的3D高斯点阵重建方法,通过改进排序算法和内存布局,显著提升移动端3D场景重建效率。
Details
Motivation: 为实现在移动设备上高效、隐私保护且无需联网的3D场景重建,需将3D高斯点阵(3DGS)技术适配至资源受限的移动GPU平台。 Method: 设计了一种面向2D纹理缓存优化的新型排序算法,并结合变量布局优化和其他计算步骤加速策略,在移动GPU上实现高效的3DGS映射。 Result: 端到端实验表明,Texture3dgs在排序阶段最高提速4.1倍,整体重建速度提升达1.7倍,同时内存占用减少最多1.6倍。 Conclusion: Texture3dgs有效解决了移动GPU上3DGS的性能瓶颈,为移动端高效3D重建提供了可行方案。 Abstract: Image-based 3D scene reconstruction, which transforms multi-view images into a structured 3D representation of the surrounding environment, is a common task across many modern applications. 3D Gaussian Splatting (3DGS) is a new paradigm to address this problem and offers considerable efficiency as compared to the previous methods. Motivated by this, and considering various benefits of mobile device deployment (data privacy, operating without internet connectivity, and potentially faster responses), this paper develops Texture3dgs, an optimized mapping of 3DGS for a mobile GPU. A critical challenge in this area turns out to be optimizing for the two-dimensional (2D) texture cache, which needs to be exploited for faster executions on mobile GPUs. As a sorting method dominates the computations in 3DGS on mobile platforms, the core of Texture3dgs is a novel sorting algorithm where the processing, data movement, and placement are highly optimized for 2D memory. The properties of this algorithm are analyzed in view of a cost model for the texture cache. In addition, we accelerate other steps of the 3DGS algorithm through improved variable layout design and other optimizations. End-to-end evaluation shows that Texture3dgs delivers up to 4.1$\times$ and 1.7$\times$ speedup for the sorting and overall 3D scene reconstruction, respectively -- while also reducing memory usage by up to 1.6$\times$ -- demonstrating the effectiveness of our design for efficient mobile 3D scene reconstruction.[83] Upsample Anything: A Simple and Hard to Beat Baseline for Feature Upsampling
Minseok Seo,Mark Hamilton,Changick Kim
Main category: cs.CV
TL;DR: 提出了一种名为Upsample Anything的轻量级测试时优化框架,无需训练即可将低分辨率特征恢复为高分辨率像素级输出。
Details
Motivation: 现有特征上采样方法依赖于数据集特定的再训练或重型隐式优化,限制了可扩展性和泛化能力。 Method: 通过每图像优化学习结合空间和范围线索的各向异性高斯核,将高斯点阵化与联合双边上采样相结合。 Result: 在语义分割、深度估计及深度和概率图上采样任务中达到最先进性能,单张224x224图像处理时间仅约0.419秒。 Conclusion: 该方法提供了一种通用、边缘感知的上采样算子,可跨架构和模态迁移,有效支持多种像素级视觉任务。 Abstract: We present \textbf{Upsample Anything}, a lightweight test-time optimization (TTO) framework that restores low-resolution features to high-resolution, pixel-wise outputs without any training. Although Vision Foundation Models demonstrate strong generalization across diverse downstream tasks, their representations are typically downsampled by 14x/16x (e.g., ViT), which limits their direct use in pixel-level applications. Existing feature upsampling approaches depend on dataset-specific retraining or heavy implicit optimization, restricting scalability and generalization. Upsample Anything addresses these issues through a simple per-image optimization that learns an anisotropic Gaussian kernel combining spatial and range cues, effectively bridging Gaussian Splatting and Joint Bilateral Upsampling. The learned kernel acts as a universal, edge-aware operator that transfers seamlessly across architectures and modalities, enabling precise high-resolution reconstruction of features, depth, or probability maps. It runs in only $\approx0.419 \text{s}$ per 224x224 image and achieves state-of-the-art performance on semantic segmentation, depth estimation, and both depth and probability map upsampling.[84] Sparse Autoencoders are Topic Models
Leander Girrbach,Zeynep Akata
Main category: cs.CV
TL;DR: 本文提出了一种将稀疏自编码器(SAE)理解为主题模型的新视角,通过扩展潜在狄利克雷分配(LDA)到嵌入空间,推导出SAE的目标函数,并提出SAE-TM框架用于跨模态的大规模主题分析。
Details
Motivation: 稀疏自编码器在嵌入分析中的作用和实际价值存在争议,需要更清晰的理论解释和更有效的应用方法。 Method: 将LDA扩展到嵌入空间,推导SAE的后验最大估计目标,提出SAE-TM框架,包括学习可重用的主题原子、解释为词分布并合并成任意数量的主题。 Result: SAE-TM在文本和图像数据集上生成比强基线更连贯的主题,同时保持多样性,并成功应用于图像数据的主题结构分析以及时序艺术作品的主题演变追踪。 Conclusion: SAE可被有效视为跨模态大规模主题分析的工具,SAE-TM提供了一种无需重新训练即可灵活构建主题的方法。 Abstract: Sparse autoencoders (SAEs) are used to analyze embeddings, but their role and practical value are debated. We propose a new perspective on SAEs by demonstrating that they can be naturally understood as topic models. We extend Latent Dirichlet Allocation to embedding spaces and derive the SAE objective as a maximum a posteriori estimator under this model. This view implies SAE features are thematic components rather than steerable directions. Based on this, we introduce SAE-TM, a topic modeling framework that: (1) trains an SAE to learn reusable topic atoms, (2) interprets them as word distributions on downstream data, and (3) merges them into any number of topics without retraining. SAE-TM yields more coherent topics than strong baselines on text and image datasets while maintaining diversity. Finally, we analyze thematic structure in image datasets and trace topic changes over time in Japanese woodblock prints. Our work positions SAEs as effective tools for large-scale thematic analysis across modalities. Code and data will be released upon publication.[85] BioBench: A Blueprint to Move Beyond ImageNet for Scientific ML Benchmarks
Samuel Stevens
Main category: cs.CV
TL;DR: BioBench是一个新的生态视觉基准,旨在解决ImageNet-1K在线性探针迁移准确性上对科学图像性能预测不足的问题。
Details
Motivation: ImageNet-1K的准确性无法有效预测现代视觉模型在生态学任务中的表现,导致模型评估偏差。 Method: 构建了一个包含9个应用驱动任务、4个分类界和6种采集模态的统一基准测试BioBench,并通过轻量级分类器在冻结主干网络上进行评估。 Result: 在46个现代视觉模型上验证显示,ImageNet top-1准确率仅解释了生态任务中34%的方差,且在超过75%准确率的模型中有30%被错误排序;ViT-L模型可在A6000 GPU上6小时内完成评估。 Conclusion: BioBench为生态学领域的计算机视觉提供了新的评估信号,并为构建可靠的AI-for-science基准提供了模板。 Abstract: ImageNet-1K linear-probe transfer accuracy remains the default proxy for visual representation quality, yet it no longer predicts performance on scientific imagery. Across 46 modern vision model checkpoints, ImageNet top-1 accuracy explains only 34% of variance on ecology tasks and mis-ranks 30% of models above 75% accuracy. We present BioBench, an open ecology vision benchmark that captures what ImageNet misses. BioBench unifies 9 publicly released, application-driven tasks, 4 taxonomic kingdoms, and 6 acquisition modalities (drone RGB, web video, micrographs, in-situ and specimen photos, camera-trap frames), totaling 3.1M images. A single Python API downloads data, fits lightweight classifiers to frozen backbones, and reports class-balanced macro-F1 (plus domain metrics for FishNet and FungiCLEF); ViT-L models evaluate in 6 hours on an A6000 GPU. BioBench provides new signal for computer vision in ecology and a template recipe for building reliable AI-for-science benchmarks in any domain. Code and predictions are available at https://github.com/samuelstevens/biobench and results at https://samuelstevens.me/biobench.[86] NaTex: Seamless Texture Generation as Latent Color Diffusion
Zeqiang Lai,Yunfei Zhao,Zibo Zhao,Xin Yang,Xin Huang,Jingwei Huang,Xiangyu Yue,Chunchao Guo
Main category: cs.CV
TL;DR: 提出NaTex,一种直接在3D空间中预测纹理颜色的原生纹理生成框架,避免了传统多视图扩散模型的局限性。
Details
Motivation: 传统基于2D多视图扩散模型的方法在处理遮挡区域、网格-纹理对齐和跨视角一致性方面存在固有缺陷。 Method: 将纹理视为密集的颜色点云,提出潜色扩散模型,包括几何感知的颜色点云VAE和多控制扩散Transformer(DiT),并引入原生几何控制以实现精确对齐。 Result: NaTex在纹理连贯性和对齐精度上显著优于先前方法,并展现出强大的泛化能力,适用于材质生成、纹理优化和部件分割与着色等下游任务。 Conclusion: NaTex通过全新的3D原生纹理生成范式,有效解决了多视图扩散模型的瓶颈,为高质量纹理建模提供了新方向。 Abstract: We present NaTex, a native texture generation framework that predicts texture color directly in 3D space. In contrast to previous approaches that rely on baking 2D multi-view images synthesized by geometry-conditioned Multi-View Diffusion models (MVDs), NaTex avoids several inherent limitations of the MVD pipeline. These include difficulties in handling occluded regions that require inpainting, achieving precise mesh-texture alignment along boundaries, and maintaining cross-view consistency and coherence in both content and color intensity. NaTex features a novel paradigm that addresses the aforementioned issues by viewing texture as a dense color point cloud. Driven by this idea, we propose latent color diffusion, which comprises a geometry-awared color point cloud VAE and a multi-control diffusion transformer (DiT), entirely trained from scratch using 3D data, for texture reconstruction and generation. To enable precise alignment, we introduce native geometry control that conditions the DiT on direct 3D spatial information via positional embeddings and geometry latents. We co-design the VAE-DiT architecture, where the geometry latents are extracted via a dedicated geometry branch tightly coupled with the color VAE, providing fine-grained surface guidance that maintains strong correspondence with the texture. With these designs, NaTex demonstrates strong performance, significantly outperforming previous methods in texture coherence and alignment. Moreover, NaTex also exhibits strong generalization capabilities, either training-free or with simple tuning, for various downstream applications, e.g., material generation, texture refinement, and part segmentation and texturing.[87] WWE-UIE: A Wavelet & White Balance Efficient Network for Underwater Image Enhancement
Ching-Heng Cheng,Jen-Wei Lee,Chia-Ming Lee,Chih-Chung Hsu
Main category: cs.CV
TL;DR: 本文提出了一种紧凑高效的水下图像增强网络WWE-UIE,结合三种可解释的先验知识,在减少参数量和计算量的同时实现竞争性的恢复效果,支持在资源受限设备上实时推理。
Details
Motivation: 现有的混合方法虽然性能强,但计算成本高,难以应用于实时场景,因此需要一种高效且轻量的水下图像增强方法。 Method: 提出WWE-UIE网络,融合三种可解释先验:自适应白平衡缓解波长依赖性颜色衰减;基于小波的增强模块(WEB)进行多带分解以捕捉全局结构和细节纹理;梯度感知模块(SGFB)利用可学习门控的Sobel算子显式保留边缘结构。 Result: 在多个基准数据集上实验表明,WWE-UIE在显著减少参数量和FLOPs的情况下,仍达到具有竞争力的恢复质量,并支持实时推理。消融研究和可视化验证了各组件的有效性。 Conclusion: WWE-UIE通过融合可解释的物理先验与轻量网络设计,在保持高性能的同时大幅提升效率,适用于资源受限的实际应用场景。 Abstract: Underwater Image Enhancement (UIE) aims to restore visibility and correct color distortions caused by wavelength-dependent absorption and scattering. Recent hybrid approaches, which couple domain priors with modern deep neural architectures, have achieved strong performance but incur high computational cost, limiting their practicality in real-time scenarios. In this work, we propose WWE-UIE, a compact and efficient enhancement network that integrates three interpretable priors. First, adaptive white balance alleviates the strong wavelength-dependent color attenuation, particularly the dominance of blue-green tones. Second, a wavelet-based enhancement block (WEB) performs multi-band decomposition, enabling the network to capture both global structures and fine textures, which are critical for underwater restoration. Third, a gradient-aware module (SGFB) leverages Sobel operators with learnable gating to explicitly preserve edge structures degraded by scattering. Extensive experiments on benchmark datasets demonstrate that WWE-UIE achieves competitive restoration quality with substantially fewer parameters and FLOPs, enabling real-time inference on resource-limited platforms. Ablation studies and visualizations further validate the contribution of each component. The source code is available at https://github.com/chingheng0808/WWE-UIE.[88] ChangeDINO: DINOv3-Driven Building Change Detection in Optical Remote Sensing Imagery
Ching-Heng Cheng,Chih-Chung Hsu
Main category: cs.CV
TL;DR: 本文提出了一种名为ChangeDINO的端到端多尺度Siamese框架,用于光学遥感图像中的建筑物变化检测,通过融合轻量级骨干网络与冻结的DINOv3特征,结合空间-光谱差分Transformer解码器和可学习形态学模块,在多个公开数据集上实现了优于现有方法的性能。
Details
Motivation: 现有的深度学习方法主要依赖变化图标注,未充分利用非变化区域的语义信息,导致在光照变化、斜视角度和标签稀缺情况下鲁棒性不足。 Method: 提出ChangeDINO框架:1)采用多尺度Siamese结构,融合轻量骨干流与冻结的DINOv3迁移特征;2)设计空间-光谱差分Transformer解码器,利用多尺度绝对差异作为变化先验;3)引入可学习形态学模块优化上采样后的 logits 以恢复清晰边界。 Result: 在四个公开基准上实验表明,ChangeDINO在IoU和F1指标上 consistently 超过最新方法,消融实验验证了各组件的有效性。 Conclusion: ChangeDINO通过有效融合语义丰富特征与变化先验,在少标签条件下展现出更强鲁棒性和精度,为遥感变化检测提供了一种高效解决方案。 Abstract: Remote sensing change detection (RSCD) aims to identify surface changes from co-registered bi-temporal images. However, many deep learning-based RSCD methods rely solely on change-map annotations and underuse the semantic information in non-changing regions, which limits robustness under illumination variation, off-nadir views, and scarce labels. This article introduces ChangeDINO, an end-to-end multiscale Siamese framework for optical building change detection. The model fuses a lightweight backbone stream with features transferred from a frozen DINOv3, yielding semantic- and context-rich pyramids even on small datasets. A spatial-spectral differential transformer decoder then exploits multi-scale absolute differences as change priors to highlight true building changes and suppress irrelevant responses. Finally, a learnable morphology module refines the upsampled logits to recover clean boundaries. Experiments on four public benchmarks show that ChangeDINO consistently outperforms recent state-of-the-art methods in IoU and F1, and ablation studies confirm the effectiveness of each component. The source code is available at https://github.com/chingheng0808/ChangeDINO.[89] Arbitrary-Resolution and Arbitrary-Scale Face Super-Resolution with Implicit Representation Networks
Yi Ting Tsai,Yu Wei Chen,Hong-Han Shuai,Ching-Chun Huang
Main category: cs.CV
TL;DR: 本文提出了一种任意分辨率和任意尺度的人脸超分辨率方法ARASFSR,通过隐式表示网络解决现有方法在固定上采样尺度和输入尺寸变化敏感性方面的局限。
Details
Motivation: 现有FSR方法受限于固定的上采样尺度且对输入尺寸变化敏感,难以适应实际应用中多样的分辨率需求。 Method: ARASFSR采用2D深度特征、局部相对坐标和上采样比例预测目标像素的RGB值;引入局部频率估计模块捕捉高频纹理信息以减少频谱偏差;设计全局坐标调制模块利用先验面部结构知识实现分辨率自适应。 Result: 在多种输入尺寸和上采样尺度下,ARASFSR在定量和定性评估中均表现出优于现有最先进方法的鲁棒性。 Conclusion: ARASFSR能够有效支持任意分辨率输入与任意上采样倍数,在人脸超分辨率任务中具有更强的灵活性和实用性。 Abstract: Face super-resolution (FSR) is a critical technique for enhancing low-resolution facial images and has significant implications for face-related tasks. However, existing FSR methods are limited by fixed up-sampling scales and sensitivity to input size variations. To address these limitations, this paper introduces an Arbitrary-Resolution and Arbitrary-Scale FSR method with implicit representation networks (ARASFSR), featuring three novel designs. First, ARASFSR employs 2D deep features, local relative coordinates, and up-sampling scale ratios to predict RGB values for each target pixel, allowing super-resolution at any up-sampling scale. Second, a local frequency estimation module captures high-frequency facial texture information to reduce the spectral bias effect. Lastly, a global coordinate modulation module guides FSR to leverage prior facial structure knowledge and achieve resolution adaptation effectively. Quantitative and qualitative evaluations demonstrate the robustness of ARASFSR over existing state-of-the-art methods while super-resolving facial images across various input sizes and up-sampling scales.[90] Aerial View River Landform Video segmentation: A Weakly Supervised Context-aware Temporal Consistency Distillation Approach
Chi-Han Chen,Chieh-Ming Chen,Wen-Huang Cheng,Ching-Chun Huang
Main category: cs.CV
TL;DR: 提出一种基于教师-学生架构的弱监督学习方法,结合关键帧选择与更新算法,用于无人机遥感地形分类,在仅使用30%标注数据的情况下同时提升mIoU和时间一致性。
Details
Motivation: 解决无人机遥感地形分类中数据标注复杂、时序一致性差、标注数据稀缺的问题,克服传统方法在空中定位任务中对完全标注数据依赖导致的时间一致性不足。 Method: 采用教师-学生架构,引入关键帧选择和关键帧更新算法,实现弱监督学习以及时序一致性知识蒸馏,从而减少对全量标注数据的依赖并提升模型稳定性。 Result: 实验表明,该方法在仅使用30%标注数据时,仍能同时提高mIoU和时间一致性,显著优于完全标注数据下的传统训练方法,实现了稳定的地形目标定位。 Conclusion: 所提框架有效解决了无人机遥感地形分类中标注效率低和时序不一致的问题,为资源受限下的空中遥感任务提供了高效可行的弱监督解决方案。 Abstract: The study of terrain and landform classification through UAV remote sensing diverges significantly from ground vehicle patrol tasks. Besides grappling with the complexity of data annotation and ensuring temporal consistency, it also confronts the scarcity of relevant data and the limitations imposed by the effective range of many technologies. This research substantiates that, in aerial positioning tasks, both the mean Intersection over Union (mIoU) and temporal consistency (TC) metrics are of paramount importance. It is demonstrated that fully labeled data is not the optimal choice, as selecting only key data lacks the enhancement in TC, leading to failures. Hence, a teacher-student architecture, coupled with key frame selection and key frame updating algorithms, is proposed. This framework successfully performs weakly supervised learning and TC knowledge distillation, overcoming the deficiencies of traditional TC training in aerial tasks. The experimental results reveal that our method utilizing merely 30\% of labeled data, concurrently elevates mIoU and temporal consistency ensuring stable localization of terrain objects. Result demo : https://gitlab.com/prophet.ai.inc/drone-based-riverbed-inspection[91] CRISTAL: Real-time Camera Registration in Static LiDAR Scans using Neural Rendering
Joni Vanherck,Steven Moonen,Brent Zoomers,Kobe Werner,Jeroen Put,Lode Jorissen,Nick Michiels
Main category: cs.CV
TL;DR: 本文提出了一种基于高精度彩色LiDAR点云的实时相机定位方法,通过神经渲染技术缩小合成图像与真实图像之间的域差异,实现无漂移、具正确度量尺度的相机跟踪,并在ScanNet++数据集上优于现有SLAM方法。
Details
Motivation: 现有视觉定位方法常存在漂移、尺度模糊和依赖标记物或回环检测的问题,难以满足机器人和扩展现实(XR)中对精确、稳定定位的需求。 Method: 利用预采集的高精度彩色LiDAR点云生成合成视图,建立实时帧与点云之间的2D-3D对应关系;采用神经渲染技术减少合成与真实图像间的域差距,提升特征匹配精度。提出两种实时方案:Online Render and Match 与 Prebuild and Localize。 Result: 实现了在全局LiDAR坐标系下的无漂移、正确尺度的相机跟踪,在ScanNet++数据集上取得了优于现有SLAM系统的结果。 Conclusion: 该方法有效解决了视觉定位中的漂移和尺度问题,无需回环检测即可实现高精度实时定位,适用于机器人和XR等应用。 Abstract: Accurate camera localization is crucial for robotics and Extended Reality (XR), enabling reliable navigation and alignment of virtual and real content. Existing visual methods often suffer from drift, scale ambiguity, and depend on fiducials or loop closure. This work introduces a real-time method for localizing a camera within a pre-captured, highly accurate colored LiDAR point cloud. By rendering synthetic views from this cloud, 2D-3D correspondences are established between live frames and the point cloud. A neural rendering technique narrows the domain gap between synthetic and real images, reducing occlusion and background artifacts to improve feature matching. The result is drift-free camera tracking with correct metric scale in the global LiDAR coordinate system. Two real-time variants are presented: Online Render and Match, and Prebuild and Localize. We demonstrate improved results on the ScanNet++ dataset and outperform existing SLAM pipelines.[92] Multi-Order Matching Network for Alignment-Free Depth Super-Resolution
Zhengxue Wang,Zhiqiang Yan,Yuan Wu,Guangwei Gao,Xiang Li,Jian Yang
Main category: cs.CV
TL;DR: 提出了一种无需对齐的多阶匹配网络(MOMNet),用于解决真实场景中RGB与深度图因硬件限制和校准漂移导致的错位问题,实现了先进的深度超分辨率性能。
Details
Motivation: 现有方法依赖严格对齐的RGB-D数据,在真实场景中因传感器物理分离和校准漂移导致性能下降,因此需要一种对齐鲁棒的方法。 Method: 提出MOMNet,包含多阶匹配机制(零阶、一阶、二阶)以在多阶特征空间中匹配RGB与深度信息,并设计多阶聚合模块结合结构检测器,利用多阶先验实现选择性特征迁移。 Result: 实验表明MOMNet在多种真实错位场景下达到最先进的深度超分辨率性能,具有优异的鲁棒性。 Conclusion: MOMNet有效克服了RGB-D传感器错位带来的挑战,提供了一种鲁棒且高效的深度图超分辨率解决方案,适用于实际应用环境。 Abstract: Recent guided depth super-resolution methods are premised on the assumption of strictly spatial alignment between depth and RGB, achieving high-quality depth reconstruction. However, in real-world scenarios, the acquisition of strictly aligned RGB-D is hindered by inherent hardware limitations (e.g., physically separate RGB-D sensors) and unavoidable calibration drift induced by mechanical vibrations or temperature variations. Consequently, existing approaches often suffer inevitable performance degradation when applied to misaligned real-world scenes. In this paper, we propose the Multi-Order Matching Network (MOMNet), a novel alignment-free framework that adaptively retrieves and selects the most relevant information from misaligned RGB. Specifically, our method begins with a multi-order matching mechanism, which jointly performs zero-order, first-order, and second-order matching to comprehensively identify RGB information consistent with depth across multi-order feature spaces. To effectively integrate the retrieved RGB and depth, we further introduce a multi-order aggregation composed of multiple structure detectors. This strategy uses multi-order priors as prompts to facilitate the selective feature transfer from RGB to depth. Extensive experiments demonstrate that MOMNet achieves state-of-the-art performance and exhibits outstanding robustness.[93] DetailSemNet: Elevating Signature Verification through Detail-Semantic Integration
Meng-Cheng Shih,Tsai-Ling Huang,Yu-Heng Shih,Hong-Han Shuai,Hsuan-Tung Liu,Yi-Ren Yeh,Ching-Chun Huang
Main category: cs.CV
TL;DR: 本文提出了一种用于离线签名验证的新模型DetailSemNet,强调细粒度差异的重要性,通过局部结构匹配提升验证精度。
Details
Motivation: 现有方法多依赖整体特征进行比对,忽视了细粒度差异,导致验证性能受限。此外,基于Transformer的主干网络可能自然地掩盖局部细节,影响性能。 Method: 提出DetailSemNet,引入Detail Semantics Integrator模块,通过特征解耦与再耦合增强局部细节并扩展判别语义,实现更有效的局部结构匹配。 Result: 在多个主流离线签名验证基准上,DetailSemNet均超越最新方法,取得显著领先的结果,并在跨数据集测试中表现出强泛化能力。 Conclusion: DetailSemNet通过关注局部结构匹配,不仅提升了验证准确性和模型可解释性,还具备良好的泛化性,具有较高的实际应用潜力。 Abstract: Offline signature verification (OSV) is a frequently utilized technology in forensics. This paper proposes a new model, DetailSemNet, for OSV. Unlike previous methods that rely on holistic features for pair comparisons, our approach underscores the significance of fine-grained differences for robust OSV. We propose to match local structures between two signature images, significantly boosting verification accuracy. Furthermore, we observe that without specific architectural modifications, transformer-based backbones might naturally obscure local details, adversely impacting OSV performance. To address this, we introduce a Detail Semantics Integrator, leveraging feature disentanglement and re-entanglement. This integrator is specifically designed to enhance intricate details while simultaneously expanding discriminative semantics, thereby augmenting the efficacy of local structural matching. We evaluate our method against leading benchmarks in offline signature verification. Our model consistently outperforms recent methods, achieving state-of-the-art results with clear margins. The emphasis on local structure matching not only improves performance but also enhances the model's interpretability, supporting our findings. Additionally, our model demonstrates remarkable generalization capabilities in cross-dataset testing scenarios. The combination of generalizability and interpretability significantly bolsters the potential of DetailSemNet for real-world applications.[94] CAMS: Towards Compositional Zero-Shot Learning via Gated Cross-Attention and Multi-Space Disentanglement
Pan Yang,Cheng Deng,Jing Yang,Han Zhao,Yun Liu,Yuling Chen,Xiaoli Ruan,Yanping Chen
Main category: cs.CV
TL;DR: 本文提出了一种新的组合零样本学习方法CAMS,通过门控交叉注意力和多空间解耦机制,从视觉特征中提取语义信息并实现属性与对象的解耦,显著提升了对未见属性-对象组合的泛化能力。
Details
Motivation: 现有基于CLIP的CZSL方法依赖全局图像表示,难以完全解耦属性和对象,限制了模型在未见组合上的识别性能。 Method: 提出CAMS框架,包括门控交叉注意力模块(用于从CLIP高层编码块中提取细粒度语义特征并抑制背景干扰)和多空间解耦模块(在多维空间中实现属性与对象语义的分离)。 Result: 在MIT-States、UT-Zappos和C-GQA三个基准上,CAMS在闭集和开集设置下均达到最先进的性能。 Conclusion: CAMS通过在多维空间中进行语义特征提取与解耦,有效提升了组合零样本学习的泛化能力,验证了细粒度特征建模和深度解耦策略的有效性。 Abstract: Compositional zero-shot learning (CZSL) aims to learn the concepts of attributes and objects in seen compositions and to recognize their unseen compositions. Most Contrastive Language-Image Pre-training (CLIP)-based CZSL methods focus on disentangling attributes and objects by leveraging the global semantic representation obtained from the image encoder. However, this representation has limited representational capacity and do not allow for complete disentanglement of the two. To this end, we propose CAMS, which aims to extract semantic features from visual features and perform semantic disentanglement in multidimensional spaces, thereby improving generalization over unseen attribute-object compositions. Specifically, CAMS designs a Gated Cross-Attention that captures fine-grained semantic features from the high-level image encoding blocks of CLIP through a set of latent units, while adaptively suppressing background and other irrelevant information. Subsequently, it conducts Multi-Space Disentanglement to achieve disentanglement of attribute and object semantics. Experiments on three popular benchmarks (MIT-States, UT-Zappos, and C-GQA) demonstrate that CAMS achieves state-of-the-art performance in both closed-world and open-world settings. The code is available at https://github.com/ybyangjing/CAMS.[95] End-to-End Motion Capture from Rigid Body Markers with Geodesic Loss
Hai Lan,Zongyan Li,Jianmin Hu,Jialing Yang,Houde Dai
Main category: cs.CV
TL;DR: 提出了一种基于刚体标记(RBM)的新型光学动作捕捉方法,结合深度学习与测地线损失,实现高效、高精度的实时动作捕捉。
Details
Motivation: 传统基于密集标记点的动作捕捉存在准备耗时和标记识别模糊的问题,限制了其可扩展性。 Method: 引入刚体标记(RBM)作为基本单元,提供无歧义的6自由度数据,并构建基于深度学习的端到端回归模型,采用测地线损失直接估计SMPL参数。 Result: 在AMASS合成数据上训练的模型达到姿态估计的最先进精度,真实Vicon系统数据验证了方法的实用性,计算量降低一个数量级以上。 Conclusion: 结合稀疏6-DoF RBM与流形感知的测地线损失,为图形学、虚拟现实和生物力学提供了实用且高保真的实时动作捕捉解决方案。 Abstract: Marker-based optical motion capture (MoCap), while long regarded as the gold standard for accuracy, faces practical challenges, such as time-consuming preparation and marker identification ambiguity, due to its reliance on dense marker configurations, which fundamentally limit its scalability. To address this, we introduce a novel fundamental unit for MoCap, the Rigid Body Marker (RBM), which provides unambiguous 6-DoF data and drastically simplifies setup. Leveraging this new data modality, we develop a deep-learning-based regression model that directly estimates SMPL parameters under a geodesic loss. This end-to-end approach matches the performance of optimization-based methods while requiring over an order of magnitude less computation. Trained on synthesized data from the AMASS dataset, our end-to-end model achieves state-of-the-art accuracy in body pose estimation. Real-world data captured using a Vicon optical tracking system further demonstrates the practical viability of our approach. Overall, the results show that combining sparse 6-DoF RBM with a manifold-aware geodesic loss yields a practical and high-fidelity solution for real-time MoCap in graphics, virtual reality, and biomechanics.[96] CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation
Samer Abualhanud,Christian Grannemann,Max Mehltretter
Main category: cs.CV
TL;DR: 提出一种几何引导的自监督方法,用于校准的多摄像头系统,通过将3D点投影到共享圆柱体上,提升跨视角深度估计的一致性和精度。
Details
Motivation: 现有自监督环视深度估计方法在重叠图像间的深度预测不一致,影响3D感知性能。 Method: 利用相机内参和外参,将每幅图像预测的3D点投影到共享单位圆柱体上,构建跨图像的2D位置图,并基于圆柱体上的距离采用非学习的空间注意力机制聚合跨图像特征,优化深度图。 Result: 在DDAD和nuScenes数据集上验证,所提方法在跨图像深度一致性及整体深度精度上优于现有方法。 Conclusion: 该方法有效提升了多摄像头系统下自监督深度估计的跨视角一致性与度量准确性,适用于低成本、高覆盖率的360° 3D感知。 Abstract: Self-supervised surround-view depth estimation enables dense, low-cost 3D perception with a 360° field of view from multiple minimally overlapping images. Yet, most existing methods suffer from depth estimates that are inconsistent between overlapping images. Addressing this limitation, we propose a novel geometry-guided method for calibrated, time-synchronized multi-camera rigs that predicts dense, metric, and cross-view-consistent depth. Given the intrinsic and relative orientation parameters, a first depth map is predicted per image and the so-derived 3D points from all images are projected onto a shared unit cylinder, establishing neighborhood relations across different images. This produces a 2D position map for every image, where each pixel is assigned its projected position on the cylinder. Based on these position maps, we apply an explicit, non-learned spatial attention that aggregates features among pixels across images according to their distances on the cylinder, to predict a final depth map per image. Evaluated on the DDAD and nuScenes datasets, our approach improves the consistency of depth estimates across images and the overall depth compared to state-of-the-art methods.[97] Graph Neural Networks for Surgical Scene Segmentation
Yihan Li,Nikhil Churamani,Maria Robu,Imanol Luengo,Danail Stoyanov
Main category: cs.CV
TL;DR: 提出基于图神经网络的分割方法,结合Vision Transformer与GNN,提升腹腔镜手术场景中肝胆解剖结构的识别精度,尤其在细小和关键结构上表现优异。
Details
Motivation: 传统深度学习模型在处理遮挡、长距离依赖和罕见结构的精细几何时存在困难,影响手术安全性,因此需要更鲁棒的分割方法。 Method: 设计两种融合ViT编码器与GNN的模型:一是基于静态k近邻图和GCNII的模型,实现稳定的长距离信息传播;二是基于动态可微图生成器(DGG)和GAT的模型,支持自适应拓扑学习。 Result: 在Endoscapes-Seg50和CholecSeg8k数据集上,mIoU提升7-8%,mDice提升6%,且对稀有、细小和安全关键结构具有更解剖一致性的预测。 Conclusion: 所提出的图基分割方法通过结合ViT的全局上下文与GNN的关系推理,显著提升手术场景分割的性能与可靠性,有助于实现更安全的腹腔镜和机器人辅助手术。 Abstract: Purpose: Accurate identification of hepatocystic anatomy is critical to preventing surgical complications during laparoscopic cholecystectomy. Deep learning models often struggle with occlusions, long-range dependencies, and capturing the fine-scale geometry of rare structures. This work addresses these challenges by introducing graph-based segmentation approaches that enhance spatial and semantic understanding in surgical scene analyses. Methods: We propose two segmentation models integrating Vision Transformer (ViT) feature encoders with Graph Neural Networks (GNNs) to explicitly model spatial relationships between anatomical regions. (1) A static k Nearest Neighbours (k-NN) graph with a Graph Convolutional Network with Initial Residual and Identity Mapping (GCNII) enables stable long-range information propagation. (2) A dynamic Differentiable Graph Generator (DGG) with a Graph Attention Network (GAT) supports adaptive topology learning. Both models are evaluated on the Endoscapes-Seg50 and CholecSeg8k benchmarks. Results: The proposed approaches achieve up to 7-8% improvement in Mean Intersection over Union (mIoU) and 6% improvement in Mean Dice (mDice) scores over state-of-the-art baselines. It produces anatomically coherent predictions, particularly on thin, rare and safety-critical structures. Conclusion: The proposed graph-based segmentation methods enhance both performance and anatomical consistency in surgical scene segmentation. By combining ViT-based global context with graph-based relational reasoning, the models improve interpretability and reliability, paving the way for safer laparoscopic and robot-assisted surgery through a precise identification of critical anatomical features.[98] Beyond Visual Cues: Leveraging General Semantics as Support for Few-Shot Segmentation
Jin Wang,Bingfeng Zhang,Jian Pang,Mengyu Liu,Honglong Chen,Weifeng Liu
Main category: cs.CV
TL;DR: 本文提出了一种语言驱动属性泛化(LDAG)架构,用于解决少样本分割中支持图像因类内差异导致的引导不准确问题,通过大语言模型生成目标类别的多属性描述,并利用多模态匹配和跨模态对齐提升分割性能,取得了当前最优结果。
Details
Motivation: 现有少样本分割方法依赖支持图像提取元信息,但类内视觉差异导致其难以提供准确引导,且对未训练类别存在偏差。本文旨在探索不依赖支持图像、而用语言描述提供无偏元引导的新范式。 Method: 提出LDAG框架,包含两个核心模块:1)多属性增强(MaE)模块,利用大语言模型生成目标类别的详细属性文本描述,并构建细粒度的视觉-文本先验引导;2)多模态属性对齐(MaA)模块,缓解图文模态差异,实现属性文本与视觉特征的跨模态交互。 Result: 在标准少样本分割数据集上实验表明,该方法显著优于现有方法,取得新的最先进性能。 Conclusion: 通过引入语言驱动的属性描述作为无偏元引导,可有效克服传统基于支持图像方法的局限性,为少样本分割提供了新思路。 Abstract: Few-shot segmentation (FSS) aims to segment novel classes under the guidance of limited support samples by a meta-learning paradigm. Existing methods mainly mine references from support images as meta guidance. However, due to intra-class variations among visual representations, the meta information extracted from support images cannot produce accurate guidance to segment untrained classes. In this paper, we argue that the references from support images may not be essential, the key to the support role is to provide unbiased meta guidance for both trained and untrained classes. We then introduce a Language-Driven Attribute Generalization (LDAG) architecture to utilize inherent target property language descriptions to build robust support strategy. Specifically, to obtain an unbiased support representation, we design a Multi-attribute Enhancement (MaE) module, which produces multiple detailed attribute descriptions of the target class through Large Language Models (LLMs), and then builds refined visual-text prior guidance utilizing multi-modal matching. Meanwhile, due to text-vision modal shift, attribute text struggles to promote visual feature representation, we design a Multi-modal Attribute Alignment (MaA) to achieve cross-modal interaction between attribute texts and visual feature. Experiments show that our proposed method outperforms existing approaches by a clear margin and achieves the new state-of-the art performance. The code will be released.[99] StreetView-Waste: A Multi-Task Dataset for Urban Waste Management
Diogo J. Paulo,João Martins,Hugo Proença,João C. Neves
Main category: cs.CV
TL;DR: 本文提出了一个名为StreetView-Waste的综合性城市垃圾场景数据集,支持垃圾容器检测、跟踪和溢出分割三项任务,并提供了基于先进模型的基线方法及两种改进策略:基于启发式的跟踪优化和利用几何先验的分割增强,实验表明所提方法显著提升了容器计数准确性和轻量模型的分割性能。
Details
Motivation: 现有垃圾检测数据集缺乏对实际城市环境中垃圾容器动态跟踪和溢出监测的支持,且多为静态场景,限制了其在真实物流管理中的应用。因此,需要一个更贴近实际应用场景的数据集与方法来推动智能城市垃圾管理的发展。 Method: 提出StreetView-Waste数据集,包含真实街道视角下的垃圾容器与垃圾溢出标注;设计三个评估任务(检测、跟踪、分割)的基线模型;引入基于启发式的跟踪优化策略以减少容器计数误差,并设计一种模型无关的几何先验框架来提升垃圾分割精度。 Result: 实验结果显示,微调后的检测器在容器检测上表现良好,但基线跟踪方法存在较大计数误差;所提启发式方法使平均绝对计数误差降低了79.6%;在分割方面,几何感知策略使轻量级模型的mAP@0.5提升了27%。 Conclusion: StreetView-Waste为城市垃圾管理提供了更具挑战性和现实意义的基准,所提出的改进策略有效提升了关键任务性能,验证了多模态信息与领域知识融合在实际感知系统中的价值。 Abstract: Urban waste management remains a critical challenge for the development of smart cities. Despite the growing number of litter detection datasets, the problem of monitoring overflowing waste containers, particularly from images captured by garbage trucks, has received little attention. While existing datasets are valuable, they often lack annotations for specific container tracking or are captured in static, decontextualized environments, limiting their utility for real-world logistics. To address this gap, we present StreetView-Waste, a comprehensive dataset of urban scenes featuring litter and waste containers. The dataset supports three key evaluation tasks: (1) waste container detection, (2) waste container tracking, and (3) waste overflow segmentation. Alongside the dataset, we provide baselines for each task by benchmarking state-of-the-art models in object detection, tracking, and segmentation. Additionally, we enhance baseline performance by proposing two complementary strategies: a heuristic-based method for improved waste container tracking and a model-agnostic framework that leverages geometric priors to refine litter segmentation. Our experimental results show that while fine-tuned object detectors achieve reasonable performance in detecting waste containers, baseline tracking methods struggle to accurately estimate their number; however, our proposed heuristics reduce the mean absolute counting error by 79.6%. Similarly, while segmenting amorphous litter is challenging, our geometry-aware strategy improves segmentation mAP@0.5 by 27% on lightweight models, demonstrating the value of multimodal inputs for this task. Ultimately, StreetView-Waste provides a challenging benchmark to encourage research into real-world perception systems for urban waste management.[100] VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference
Ziyan Liu,Yeqiu Chen,Hongyi Cai,Tao Lin,Shuo Yang,Zheng Liu,Bo Zhao
Main category: cs.CV
TL;DR: 提出VLA-Pruner,一种面向视觉-语言-动作(VLA)模型的双层级令牌剪枝方法,兼顾语义理解与动作执行需求,提升推理效率同时保持性能。
Details
Motivation: 现有基于语义显著性的令牌剪枝方法忽视了VLA模型在高层语义理解和低层动作执行上的双重需求,导致关键动作信息丢失,影响实际表现。 Method: 设计双层级重要性标准:利用视觉-语言prefill注意力衡量语义相关性,结合通过时间平滑估计的动作解码注意力评估动作重要性;据此提出自适应双层令牌选择策略,在计算预算内保留最具信息量的视觉令牌。 Result: 在多种VLA架构和不同机器人任务上验证,VLA-Pruner在加速推理的同时保持甚至提升任务性能,显著优于现有剪枝方法。 Conclusion: VLA-Pruner通过融合语义与动作重要性评估,有效平衡效率与性能,是适用于VLA模型的高效、即插即用的令牌剪枝方案。 Abstract: Vision-Language-Action (VLA) models have shown great promise for embodied AI, yet the heavy computational cost of processing continuous visual streams severely limits their real-time deployment. Token pruning (keeping salient visual tokens and dropping redundant ones) has emerged as an effective approach for accelerating Vision-Language Models (VLMs), offering a solution for efficient VLA. However, these VLM-specific token pruning methods select tokens based solely on semantic salience metrics (e.g., prefill attention), while overlooking the VLA's intrinsic dual-system nature of high-level semantic understanding and low-level action execution. Consequently, these methods bias token retention toward semantic cues, discard critical information for action generation, and significantly degrade VLA performance. To bridge this gap, we propose VLA-Pruner, a versatile plug-and-play VLA-specific token prune method that aligns with the dual-system nature of VLA models and exploits the temporal continuity in robot manipulation. Specifically, VLA-Pruner adopts a dual-level importance criterion for visual token retention: vision-language prefill attention for semantic-level relevance and action decode attention, estimated via temporal smoothing, for action-level importance. Based on this criterion, VLA-Pruner proposes a novel dual-level token selection strategy that adaptively preserves a compact, informative set of visual tokens for both semantic understanding and action execution under given compute budget. Experiments show that VLA-Pruner achieves state-of-the-art performance across multiple VLA architectures and diverse robotic tasks.[101] LLaVA$^3$: Representing 3D Scenes like a Cubist Painter to Boost 3D Scene Understanding of VLMs
Doriand Petit,Steve Bourgeois,Vincent Gay-Bellile,Florian Chabot,Loïc Barthe
Main category: cs.CV
TL;DR: 提出LLaVA^3,一种无需微调、仅用多视角2D图像提升视觉语言模型3D场景理解能力的新方法。
Details
Motivation: 由于3D训练数据有限,而2D视觉语言模型数据丰富,开发能理解3D场景的多模态语言模型面临挑战。 Method: 受立体派绘画启发,通过多视角2D图像进行中间的3D重建,生成每个物体的全向视觉表示,用于增强VLM对3D场景的理解,无需微调。 Result: 在3D视觉问答和3D语言定位任务上,该方法优于以往基于2D的VLM方法。 Conclusion: LLaVA^3通过创新的全向视觉表示有效提升了现有VLM的3D场景理解能力,且无需额外的3D标注或模型微调。 Abstract: Developing a multi-modal language model capable of understanding 3D scenes remains challenging due to the limited availability of 3D training data, in contrast to the abundance of 2D datasets used for vision-language models (VLM). As an alternative, we introduce LLaVA$^3$ (pronounced LLaVA-Cube), a novel method that improves the 3D scene understanding capabilities of VLM using only multi-view 2D images and without any fine-tuning. Inspired by Cubist painters, who represented multiple viewpoints of a 3D object within a single picture, we propose to describe the 3D scene for the VLM through omnidirectional visual representations of each object. These representations are derived from an intermediate multi-view 3D reconstruction of the scene. Extensive experiments on 3D VQA and 3D language grounding show that our approach outperforms previous 2D-based VLM solutions.[102] FastSurfer-CC: A robust, accurate, and comprehensive framework for corpus callosum morphometry
Clemens Pollak,Kersten Diers,Santiago Estrada,David Kügler,Martin Reuter
Main category: cs.CV
TL;DR: 提出了一种名为FastSurfer-CC的高效全自动框架,用于胼胝体形态测量,能够自动完成分割、标准化和特征提取,并在亨廷顿病患者中检测到现有方法无法发现的显著差异。
Details
Motivation: 目前缺乏公开可用的、能对胼胝体进行综合自动化分析的工具,限制了其在衰老、神经系统疾病研究及临床试验中的应用。 Method: 开发了FastSurfer-CC框架,可自动识别中矢状面切片,分割胼胝体和穹窿,定位前连合和后连合以标准化头位,并生成厚度轮廓、分区及八项形态学指标。 Result: FastSurfer-CC在各项子任务上优于现有专用工具,并能在亨廷顿病患者与健康对照之间检测到当前最先进方法未能发现的统计学显著差异。 Conclusion: FastSurfer-CC是一个高效、全自动的胼胝体分析工具,具有更高的敏感性和应用潜力,适用于神经退行性疾病的研究和临床试验。 Abstract: The corpus callosum, the largest commissural structure in the human brain, is a central focus in research on aging and neurological diseases. It is also a critical target for interventions such as deep brain stimulation and serves as an important biomarker in clinical trials, including those investigating remyelination therapies. Despite extensive research on corpus callosum segmentation, few publicly available tools provide a comprehensive and automated analysis pipeline. To address this gap, we present FastSurfer-CC, an efficient and fully automated framework for corpus callosum morphometry. FastSurfer-CC automatically identifies mid-sagittal slices, segments the corpus callosum and fornix, localizes the anterior and posterior commissures to standardize head positioning, generates thickness profiles and subdivisions, and extracts eight shape metrics for statistical analysis. We demonstrate that FastSurfer-CC outperforms existing specialized tools across the individual tasks. Moreover, our method reveals statistically significant differences between Huntington's disease patients and healthy controls that are not detected by the current state-of-the-art.[103] Flow and Depth Assisted Video Prediction with Latent Transformer
Eliyas Suleyman,Paul Henderson,Eksan Firkat,Nicolas Pugeault
Main category: cs.CV
TL;DR: 本文研究了在遮挡和背景运动情况下,利用点流(point-flow)和深度图(depth-maps)提升视频预测性能的方法,提出了一种结合多模态信息的模型,并在合成和真实数据集上验证了其有效性。
Details
Motivation: 遮挡仍是视频预测中的固有挑战,现有模型在复杂场景下表现受限,因此需要引入运动和结构信息来提升预测准确性。 Method: 基于多对象潜在Transformer架构,引入点流和深度图作为额外输入,系统性地研究其对遮挡视频预测的影响。 Result: 实验表明,结合点流和深度信息的模型在遮挡场景和背景运动预测上优于仅依赖外观的模型,且在对象掩码的Wasserstein距离等指标上表现更优。 Conclusion: 显式引入运动和几何结构信息能有效提升视频预测模型在遮挡和动态背景下的性能,为未来研究提供了可行方向。 Abstract: Video prediction is a fundamental task for various downstream applications, including robotics and world modeling. Although general video prediction models have achieved remarkable performance in standard scenarios, occlusion is still an inherent challenge in video prediction. We hypothesize that providing explicit information about motion (via point-flow) and geometric structure (via depth-maps) will enable video prediction models to perform better in situations with occlusion and the background motion. To investigate this, we present the first systematic study dedicated to occluded video prediction. We use a standard multi-object latent transformer architecture to predict future frames, but modify this to incorporate information from depth and point-flow. We evaluate this model in a controlled setting on both synthetic and real-world datasets with not only appearance-based metrics but also Wasserstein distances on object masks, which can effectively measure the motion distribution of the prediction. We find that when the prediction model is assisted with point flow and depth, it performs better in occluded scenarios and predicts more accurate background motion compared to models without the help of these modalities.[104] Physics-Informed Machine Learning for Efficient Sim-to-Real Data Augmentation in Micro-Object Pose Estimation
Zongcai Tan,Lan Wei,Dandan Zhang
Main category: cs.CV
TL;DR: 提出了一种结合波动光学物理渲染与深度对齐的生成对抗网络框架,用于高效生成高保真显微图像,以实现无需大量真实数据的微机器人姿态估计。
Details
Motivation: 现有方法依赖大量标注良好的显微图像数据,难以获取;且传统仿真难以复现复杂的光学现象(如衍射、景深效应),限制了sim-to-real迁移效果。 Method: 将基于波动光学的物理渲染和深度对齐机制嵌入生成对抗网络(GAN),生成逼真的显微图像;利用该合成数据训练CNN姿态估计器,并支持跨姿态泛化与数据增强。 Result: 相比纯AI方法,结构相似性指数(SSIM)提升35.6%,单帧生成时间仅0.022秒;在俯仰/滚动角估计上分别达到93.9%/91.9%准确率,仅比全真实数据训练低5.0%/5.4%;模型可泛化至未见姿态。 Conclusion: 该物理信息驱动的生成框架能高效合成高质量显微图像,显著降低对真实数据的依赖,为微机器人精确姿态估计提供了可行的数字孪生解决方案。 Abstract: Precise pose estimation of optical microrobots is essential for enabling high-precision object tracking and autonomous biological studies. However, current methods rely heavily on large, high-quality microscope image datasets, which are difficult and costly to acquire due to the complexity of microrobot fabrication and the labour-intensive labelling. Digital twin systems offer a promising path for sim-to-real data augmentation, yet existing techniques struggle to replicate complex optical microscopy phenomena, such as diffraction artifacts and depth-dependent imaging.This work proposes a novel physics-informed deep generative learning framework that, for the first time, integrates wave optics-based physical rendering and depth alignment into a generative adversarial network (GAN), to synthesise high-fidelity microscope images for microrobot pose estimation efficiently. Our method improves the structural similarity index (SSIM) by 35.6% compared to purely AI-driven methods, while maintaining real-time rendering speeds (0.022 s/frame).The pose estimator (CNN backbone) trained on our synthetic data achieves 93.9%/91.9% (pitch/roll) accuracy, just 5.0%/5.4% (pitch/roll) below that of an estimator trained exclusively on real data. Furthermore, our framework generalises to unseen poses, enabling data augmentation and robust pose estimation for novel microrobot configurations without additional training data.[105] Acquisition Time-Informed Breast Tumor Segmentation from Dynamic Contrast-Enhanced MRI
Rui Wang,Yuexi Du,John Lewin,R. Todd Constable,Nicha C. Dvornek
Main category: cs.CV
TL;DR: 提出一种利用图像采集时间通过FiLM层调节模型特征的肿瘤分割方法,提升了DCE-MRI中肿瘤分割的性能和模型泛化能力。
Details
Motivation: 由于采集协议和个体差异导致DCE-MRI图像中组织外观变化大,使得自动肿瘤分割具有挑战性。 Method: 采用特征级线性调制(FiLM)层将采集时间信息融入模型,结合不同骨干网络,在大规模多中心乳腺DCE-MRI数据集上训练并评估模型。 Result: 在域内和公开的域外数据集上的实验表明,引入相位采集时间信息可提高肿瘤分割性能和模型泛化能力。 Conclusion: 融合采集时间信息的FiLM调制方法能有效提升乳腺DCE-MRI肿瘤分割的准确性和鲁棒性。 Abstract: Dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) plays an important role in breast cancer screening, tumor assessment, and treatment planning and monitoring. The dynamic changes in contrast in different tissues help to highlight the tumor in post-contrast images. However, varying acquisition protocols and individual factors result in large variation in the appearance of tissues, even for images acquired in the same phase (e.g., first post-contrast phase), making automated tumor segmentation challenging. Here, we propose a tumor segmentation method that leverages knowledge of the image acquisition time to modulate model features according to the specific acquisition sequence. We incorporate the acquisition times using feature-wise linear modulation (FiLM) layers, a lightweight method for incorporating temporal information that also allows for capitalizing on the full, variables number of images acquired per imaging study. We trained baseline and different configurations for the time-modulated models with varying backbone architectures on a large public multisite breast DCE-MRI dataset. Evaluation on in-domain images and a public out-of-domain dataset showed that incorporating knowledge of phase acquisition time improved tumor segmentation performance and model generalization.[106] YOWO: You Only Walk Once to Jointly Map An Indoor Scene and Register Ceiling-mounted Cameras
Fan Yang,Sosuke Yamao,Ikuo Kusajima,Atsunori Moteki,Shoichi Masui,Shan Jiang
Main category: cs.CV
TL;DR: 提出一种联合室内场景建图与天花板摄像头(CMC)注册的新方法,通过移动代理携带RGB-D相机遍历场景,并利用同步的CMC视频实现轨迹关联与联合优化,有效提升两者性能。
Details
Motivation: 解决天花板摄像头(CMC)手动注册效率低、自动视觉定位在视觉模糊时效果差的问题,实现高效准确的CMC注册与场景建图。 Method: 使用头戴式RGB-D相机的移动代理遍历场景,生成以自我为中心的视频用于构建场景布局和世界坐标轨迹;同步的CMC视频提供伪尺度轨迹和相对位姿;通过时间戳对齐所有轨迹,并构建因子图联合优化自我相机位姿、场景布局和CMC位姿。 Result: 在新构建的数据集上实验表明,该方法能在一个统一框架内有效完成场景建图与CMC注册,并通过联合优化相互提升性能,为位置感知应用提供了可靠工具。 Conclusion: 所提方法实现了高效的协同场景建图与CMC注册,解决了现有方法在效率与精度上的不足,具备实际应用价值。 Abstract: Using ceiling-mounted cameras (CMCs) for indoor visual capturing opens up a wide range of applications. However, registering CMCs to the target scene layout presents a challenging task. While manual registration with specialized tools is inefficient and costly, automatic registration with visual localization may yield poor results when visual ambiguity exists. To alleviate these issues, we propose a novel solution for jointly mapping an indoor scene and registering CMCs to the scene layout. Our approach involves equipping a mobile agent with a head-mounted RGB-D camera to traverse the entire scene once and synchronize CMCs to capture this mobile agent. The egocentric videos generate world-coordinate agent trajectories and the scene layout, while the videos of CMCs provide pseudo-scale agent trajectories and CMC relative poses. By correlating all the trajectories with their corresponding timestamps, the CMC relative poses can be aligned to the world-coordinate scene layout. Based on this initialization, a factor graph is customized to enable the joint optimization of ego-camera poses, scene layout, and CMC poses. We also develop a new dataset, setting the first benchmark for collaborative scene mapping and CMC registration (https://sites.google.com/view/yowo/home). Experimental results indicate that our method not only effectively accomplishes two tasks within a unified framework, but also jointly enhances their performance. We thus provide a reliable tool to facilitate downstream position-aware applications.[107] BoxingVI: A Multi-Modal Benchmark for Boxing Action Recognition and Localization
Rahul Kumar,Vipul Baghel,Sudhanshu Singh,Bikash Kumar Badatya,Shivam Yadav,Babji Srinivasan,Ravi Hegde
Main category: cs.CV
TL;DR: 本文提出了一种针对拳击中出拳检测与分类的高质量、精细标注视频数据集,包含来自20个YouTube对战视频的6,915个出拳片段,涵盖6种出拳类型和18名运动员。
Details
Motivation: 由于动作动态性强、拍摄环境多变,格斗运动的计算机视觉分析面临数据集缺乏的瓶颈,因此需要一个高质量、标注精确的数据集来推动相关研究。 Method: 从20个公开的YouTube对练视频中手动剪辑并标注了6,915个出拳片段,每段精确标注时间边界和出拳类型,共分为6类,并确保不同运动员、动作风格和摄像角度的多样性。 Result: 构建了一个大规模、高质量、类别平衡且环境多样的拳击出拳视频数据集,适用于低资源和非受限环境下的实时视觉动作识别研究。 Conclusion: 该数据集为拳击及相关领域的动作分析、自动教练系统和表现评估提供了有力支持,有望推动基于视觉的动作识别技术的发展。 Abstract: Accurate analysis of combat sports using computer vision has gained traction in recent years, yet the development of robust datasets remains a major bottleneck due to the dynamic, unstructured nature of actions and variations in recording environments. In this work, we present a comprehensive, well-annotated video dataset tailored for punch detection and classification in boxing. The dataset comprises 6,915 high-quality punch clips categorized into six distinct punch types, extracted from 20 publicly available YouTube sparring sessions and involving 18 different athletes. Each clip is manually segmented and labeled to ensure precise temporal boundaries and class consistency, capturing a wide range of motion styles, camera angles, and athlete physiques. This dataset is specifically curated to support research in real-time vision-based action recognition, especially in low-resource and unconstrained environments. By providing a rich benchmark with diverse punch examples, this contribution aims to accelerate progress in movement analysis, automated coaching, and performance assessment within boxing and related domains.[108] Contrastive vision-language learning with paraphrasing and negation
Kwun Ho Ngan,Saman Sadeghi Afgeh,Joe Townsend,Artur d'Avila Garcez
Main category: cs.CV
TL;DR: 本文提出了一种名为SemCLIP的新方法,通过改进对比损失函数并利用大语言模型生成的原始、改写和否定文本三元组,增强视觉-语言模型对语义变换(特别是否定和改写)的鲁棒性。实验表明,SemCLIP在保持原有性能的同时显著提升了对否定文本的区分能力。
Details
Motivation: 现有的视觉-语言模型(如CLIP)在处理否定或改写文本时表现不稳定,因为否定会大幅改变语义但词汇变化小,而改写则词汇差异大但语义相同,这对模型的语义对齐能力构成挑战。因此,需要提升模型对这类语义变换的鲁棒性。 Method: 提出一种新的对比损失函数(SemCLIP),结合改写与否定语义;使用大语言模型生成包含原始、改写和否定文本的训练三元组,并在CLIP-like模型中进行训练,使改写文本的嵌入靠近原图像,否定文本的嵌入远离原图像。 Result: 在CC-Neg基准上,图像检索准确率从68.1%提升至78.1%;在Sugarcrepe++上结果混合,但整体优于仅用否定文本训练的模型;在下游零样本分类任务中,SemCLIP均优于CLIP。 Conclusion: SemCLIP能有效提升视觉-语言模型对否定和改写等语义变换的鲁棒性,在保持原有性能的同时增强了语义分辨能力,具有良好的下游任务泛化性。 Abstract: Contrastive vision-language models continue to be the dominant approach for image and text retrieval. Contrastive Language-Image Pre-training (CLIP) trains two neural networks in contrastive manner to align their image and text embeddings in a shared latent space. Recent results evaluating CLIP on negated or paraphrased text have shown mixed performance because negation changes meaning radically with minimal lexical changes, while paraphrasing can create very different textual expressions with the same intended meaning. This poses a significant challenge for improving the evaluation results and alignment of vision-language models. To address this challenge, this paper evaluates the combination of paraphrasing and negation, proposes a new CLIP contrastive loss function accounting for both paraphrasing and negation, and applies LLM-generated training triples consisting of original, paraphrased and negated textual captions to CLIP-like training models. The approach, called SemCLIP, is shown to move paraphrased captions towards the original image embeddings while pushing negated captions further away in embedding space. Empirically, SemCLIP is shown to be capable of preserving CLIP's performance while increasing considerably the distances to negated captions. On the CC-Neg benchmark using an original over negation image-retrieval accuracy metric, SemCLIP improves accuracy from 68.1% to 78.1%. Although results are mixed when compared with CLIP on the Sugarcrepe++ benchmark, SemCLIP's performance is generally better than the models trained with negated captions. This robustness to negation extends to downstream zero-shot classification tasks where SemCLIP pre-trained on Sugarcrepe++ performs better than CLIP on all tested downstream tasks. These results indicate that SemCLIP can achieve significant robustness to semantic transformations.[109] Enhancing Multi-Camera Gymnast Tracking Through Domain Knowledge Integration
Fan Yang,Shigeyuki Odashima,Shoichi Masui,Ikuo Kusajima,Sosuke Yamao,Shan Jiang
Main category: cs.CV
TL;DR: 提出了一种结合体操领域知识的多相机级联数据关联方法,用于在检测不足时通过射线-平面相交生成共面3D轨迹候选,提升体操运动员跟踪的鲁棒性。
Details
Motivation: 在国际体操比赛中,由于相机数量有限、光照变化、背景复杂、服装多样和遮挡等问题,传统多相机三角化难以准确估计运动员的3D轨迹,需要更鲁棒的跟踪方法。 Method: 引入体操领域知识,假设运动员的3D中心位于一个预定义的垂直平面内,当跨视角检测足够时使用三角化生成3D轨迹候选,不足时采用射线-平面相交生成共面3D轨迹候选,并通过级联数据关联策略融合两种方式。 Result: 实验表明该方法在挑战性场景下优于现有方法,显著减少了跟踪失败,并已成功应用于近期体操世锦赛的裁判系统中。 Conclusion: 通过融合领域知识与级联数据关联策略,所提方法在受限多相机条件下实现了鲁棒的体操运动员3D跟踪,具备实际应用价值并获得国际体操联合会认可。 Abstract: We present a robust multi-camera gymnast tracking, which has been applied at international gymnastics championships for gymnastics judging. Despite considerable progress in multi-camera tracking algorithms, tracking gymnasts presents unique challenges: (i) due to space restrictions, only a limited number of cameras can be installed in the gymnastics stadium; and (ii) due to variations in lighting, background, uniforms, and occlusions, multi-camera gymnast detection may fail in certain views and only provide valid detections from two opposing views. These factors complicate the accurate determination of a gymnast's 3D trajectory using conventional multi-camera triangulation. To alleviate this issue, we incorporate gymnastics domain knowledge into our tracking solution. Given that a gymnast's 3D center typically lies within a predefined vertical plane during \revised{much of their} performance, we can apply a ray-plane intersection to generate coplanar 3D trajectory candidates for opposing-view detections. More specifically, we propose a novel cascaded data association (DA) paradigm that employs triangulation to generate 3D trajectory candidates when cross-view detections are sufficient, and resort to the ray-plane intersection when they are insufficient. Consequently, coplanar candidates are used to compensate for uncertain trajectories, thereby minimizing tracking failures. The robustness of our method is validated through extensive experimentation, demonstrating its superiority over existing methods in challenging scenarios. Furthermore, our gymnastics judging system, equipped with this tracking method, has been successfully applied to recent Gymnastics World Championships, earning significant recognition from the International Gymnastics Federation.[110] Investigating Optical Flow Computation: From Local Methods to a Multiresolution Horn-Schunck Implementation with Bilinear Interpolation
Haytham Ziani
Main category: cs.CV
TL;DR: 本文分析了局部和全局光流计算方法,重点研究了Horn-Schunck算法,并实现了其多分辨率版本以提高精度和收敛性。
Details
Motivation: 为了在不同图像条件下更准确地估计帧间运动,需要改进现有光流算法的鲁棒性和精度。 Method: 采用理论与实践相结合的方式,比较局部(如Lucas-Kanade)和全局(如Horn-Schunck)方法,并实现基于双线性插值和延拓的多分辨率Horn-Schunck算法。 Result: 多分辨率策略有效提升了Horn-Schunck算法的准确性和收敛性能,在不同图像条件下表现出更好的运动估计效果。 Conclusion: 结合多分辨率技术的全局方法在光流计算中具有更高的鲁棒性和准确性,适用于复杂场景下的运动分析。 Abstract: This paper presents an applied analysis of local and global methods, with a focus on the Horn-Schunck algorithm for optical flow computation. We explore the theoretical and practical aspects of local approaches, such as the Lucas-Kanade method, and global techniques such as Horn-Schunck. Additionally, we implement a multiresolution version of the Horn-Schunck algorithm, using bilinear interpolation and prolongation to improve accuracy and convergence. The study investigates the effectiveness of these combined strategies in estimating motion between frames, particularly under varying image conditions.[111] Supervised Contrastive Learning for Few-Shot AI-Generated Image Detection and Attribution
Jaime Álvarez Urueña,David Camacho,Javier Huertas Tato
Main category: cs.CV
TL;DR: 本文提出了一种两阶段合成图像检测框架,利用监督对比学习提取判别性嵌入,并结合少量样本的k近邻分类器,在无需频繁重训练的情况下实现对新型生成模型的高效检测与溯源。
Details
Motivation: 随着生成式AI的快速发展,合成图像与真实图像愈发难以区分,且新模型迭代迅速,传统依赖周期性重训练的检测方法难以应对,亟需具备良好泛化能力的检测方案。 Method: 第一阶段采用基于监督对比学习的视觉深度学习模型,训练时保留部分生成器架构以评估跨生成器泛化能力;第二阶段在嵌入空间中使用k近邻分类器,采用小样本学习范式,仅用少量来自未见生成器的样本进行训练。 Result: 在每类仅150张图像的小样本设置下,该框架平均检测准确率达91.3%,较现有方法提升5.2个百分点;在开放集分类场景下的源归因任务中,AUC和OSCR分别提升14.70%和4.27%。 Conclusion: 所提方法显著提升了合成图像检测与溯源的泛化性和可扩展性,为应对快速演进的生成式AI提供了无需全面重训练的实用化取证解决方案。 Abstract: The rapid advancement of generative artificial intelligence has enabled the creation of synthetic images that are increasingly indistinguishable from authentic content, posing significant challenges for digital media integrity. This problem is compounded by the accelerated release cycle of novel generative models, which renders traditional detection approaches (reliant on periodic retraining) computationally infeasible and operationally impractical. This work proposes a novel two-stage detection framework designed to address the generalization challenge inherent in synthetic image detection. The first stage employs a vision deep learning model trained via supervised contrastive learning to extract discriminative embeddings from input imagery. Critically, this model was trained on a strategically partitioned subset of available generators, with specific architectures withheld from training to rigorously ablate cross-generator generalization capabilities. The second stage utilizes a k-nearest neighbors (k-NN) classifier operating on the learned embedding space, trained in a few-shot learning paradigm incorporating limited samples from previously unseen test generators. With merely 150 images per class in the few-shot learning regime, which are easily obtainable from current generation models, the proposed framework achieves an average detection accuracy of 91.3\%, representing a 5.2 percentage point improvement over existing approaches . For the source attribution task, the proposed approach obtains improvements of of 14.70\% and 4.27\% in AUC and OSCR respectively on an open set classification context, marking a significant advancement toward robust, scalable forensic attribution systems capable of adapting to the evolving generative AI landscape without requiring exhaustive retraining protocols.[112] EOGS++: Earth Observation Gaussian Splatting with Internal Camera Refinement and Direct Panchromatic Rendering
Pierrick Bournez,Luca Savant Aira,Thibaud Ehret,Gabriele Facciolo
Main category: cs.CV
TL;DR: 本文提出了EOGS++,一种针对卫星影像的3D高斯点阵重建方法,直接处理原始高分辨率全色数据,并将光流与捆绑调整融入训练过程,提升了重建质量与几何精度。
Details
Motivation: 为克服现有地球观测重建方法对预处理和外部优化工具的依赖,提升重建效率与精度。 Method: 基于3D高斯点阵技术,提出EOGS++,直接处理原始卫星影像;引入光流辅助的捆绑调整优化相机位姿;结合早停策略和TSDF后处理提升几何准确性。 Result: 在IARPA 2016和DFC2019数据集上达到SOTA性能,建筑重建的平均MAE误差从1.33降至1.19,优于EOGS及其他NeRF方法。 Conclusion: EOGS++在保持高计算效率的同时显著提升重建质量,是适用于高分辨率卫星影像的强大地球观测重建框架。 Abstract: Recently, 3D Gaussian Splatting has been introduced as a compelling alternative to NeRF for Earth observation, offering com- petitive reconstruction quality with significantly reduced training times. In this work, we extend the Earth Observation Gaussian Splatting (EOGS) framework to propose EOGS++, a novel method tailored for satellite imagery that directly operates on raw high-resolution panchromatic data without requiring external preprocessing. Furthermore, leveraging optical flow techniques we embed bundle adjustment directly within the training process, avoiding reliance on external optimization tools while improving camera pose estimation. We also introduce several improvements to the original implementation, including early stopping and TSDF post-processing, all contributing to sharper reconstructions and better geometric accuracy. Experiments on the IARPA 2016 and DFC2019 datasets demonstrate that EOGS++ achieves state-of-the-art performance in terms of reconstruction quality and effi- ciency, outperforming the original EOGS method and other NeRF-based methods while maintaining the computational advantages of Gaussian Splatting. Our model demonstrates an improvement from 1.33 to 1.19 mean MAE errors on buildings compared to the original EOGS models[113] Progressive Supernet Training for Efficient Visual Autoregressive Modeling
Xiaoyue Chen,Yuling Shi,Kaiyuan Li,Huandong Wang,Yong Li,Xiaodong Gu,Xinlei Chen,Mingbao Lin
Main category: cs.CV
TL;DR: 提出VARiant,通过子网与全网络共享权重并进行渐进训练,在保持生成质量的同时显著降低内存消耗和推理成本,支持运行时灵活调整深度。
Details
Motivation: VAR模型在多尺度生成中存在累积KV缓存导致的高内存开销问题,且不同生成阶段对网络深度的敏感性不同,需兼顾效率与质量。 Method: 基于尺度-深度非对称依赖特性,采用等距采样构建从30层主干网络中提取的2至16层子网;早期尺度用全网络,后期用子网,共享权重,并设计渐进式训练策略以解决优化冲突。 Result: 在ImageNet上,VARiant-d16/d8达到接近VAR-d30的质量(FID 2.05/2.12 vs 1.95),内存减少40-65%;VARiant-d2实现3.5倍加速和80%内存缩减(FID 2.97);支持单模型运行时深度切换。 Conclusion: VARiant通过灵活的深度分配和共享权重设计,在不牺牲部署便捷性的前提下,实现了生成质量与效率的良好平衡,适用于多样化的应用场景。 Abstract: Visual Auto-Regressive (VAR) models significantly reduce inference steps through the "next-scale" prediction paradigm. However, progressive multi-scale generation incurs substantial memory overhead due to cumulative KV caching, limiting practical deployment. We observe a scale-depth asymmetric dependency in VAR: early scales exhibit extreme sensitivity to network depth, while later scales remain robust to depth reduction. Inspired by this, we propose VARiant: by equidistant sampling, we select multiple subnets ranging from 16 to 2 layers from the original 30-layer VAR-d30 network. Early scales are processed by the full network, while later scales utilize subnet. Subnet and the full network share weights, enabling flexible depth adjustment within a single model. However, weight sharing between subnet and the entire network can lead to optimization conflicts. To address this, we propose a progressive training strategy that breaks through the Pareto frontier of generation quality for both subnets and the full network under fixed-ratio training, achieving joint optimality. Experiments on ImageNet demonstrate that, compared to the pretrained VAR-d30 (FID 1.95), VARiant-d16 and VARiant-d8 achieve nearly equivalent quality (FID 2.05/2.12) while reducing memory consumption by 40-65%. VARiant-d2 achieves 3.5 times speedup and 80% memory reduction at moderate quality cost (FID 2.97). In terms of deployment, VARiant's single-model architecture supports zero-cost runtime depth switching and provides flexible deployment options from high quality to extreme efficiency, catering to diverse application scenarios.[114] Lite Any Stereo: Efficient Zero-Shot Stereo Matching
Junpeng Jing,Weixun Luo,Ye Mao,Krystian Mikolajczyk
Main category: cs.CV
TL;DR: 本文提出了Lite Any Stereo,一个高效且具有强零样本泛化能力的立体深度估计框架,在保持极小模型规模的同时,在多个真实世界基准上达到领先性能。
Details
Motivation: 现有高效的立体匹配模型通常被认为因容量有限而缺乏零样本泛化能力,本文旨在打破效率与泛化性之间的权衡。 Method: 设计了一个紧凑但表达能力强的骨干网络和混合代价聚合模块,并提出三阶段大规模训练策略以缩小仿真到现实的差距。 Result: 模型在四个广泛使用的真实世界基准上排名第一,精度媲美甚至超过现有的高精度非先验方法,同时计算成本不到1%。 Conclusion: 证明了超轻量模型也能实现强泛化能力,为高效立体匹配设定了新标准。 Abstract: Recent advances in stereo matching have focused on accuracy, often at the cost of significantly increased model size. Traditionally, the community has regarded efficient models as incapable of zero-shot ability due to their limited capacity. In this paper, we introduce Lite Any Stereo, a stereo depth estimation framework that achieves strong zero-shot generalization while remaining highly efficient. To this end, we design a compact yet expressive backbone to ensure scalability, along with a carefully crafted hybrid cost aggregation module. We further propose a three-stage training strategy on million-scale data to effectively bridge the sim-to-real gap. Together, these components demonstrate that an ultra-light model can deliver strong generalization, ranking 1st across four widely used real-world benchmarks. Remarkably, our model attains accuracy comparable to or exceeding state-of-the-art non-prior-based accurate methods while requiring less than 1% computational cost, setting a new standard for efficient stereo matching.[115] NutriScreener: Retrieval-Augmented Multi-Pose Graph Attention Network for Malnourishment Screening
Misaal Khan,Mayank Vatsa,Kuldeep Singh,Richa Singh
Main category: cs.CV
TL;DR: 本文提出了一种名为NutriScreener的多姿态图注意力网络,结合CLIP视觉嵌入和知识检索,用于从儿童图像中检测营养不良并预测人体测量值,在低资源环境中具有高准确性和效率。
Details
Motivation: 现有儿童营养不良筛查方法繁琐且难以扩展,阻碍了早期干预,因此需要一种可扩展、鲁棒性强的自动化解决方案。 Method: 采用检索增强的多姿态图注意力网络,融合CLIP-based视觉嵌入、类别增强的知识检索和上下文感知机制,在AnthroVision、ARAN和自建CampusPose数据集上进行训练与评估。 Result: 在临床研究中,医生评分准确率为4.3/5,效率为4.6/5;模型达到0.79召回率、0.82 AUC,并显著降低人体测量RMSE,在跨数据集测试中 recall 提升达25%,RMSE 减少3.5 cm。 Conclusion: NutriScreener是一种可在低资源环境下部署的、可扩展且准确的儿童营养不良早期检测方案,具备良好的泛化能力和实用性。 Abstract: Child malnutrition remains a global crisis, yet existing screening methods are laborious and poorly scalable, hindering early intervention. In this work, we present NutriScreener, a retrieval-augmented, multi-pose graph attention network that combines CLIP-based visual embeddings, class-boosted knowledge retrieval, and context awareness to enable robust malnutrition detection and anthropometric prediction from children's images, simultaneously addressing generalizability and class imbalance. In a clinical study, doctors rated it 4.3/5 for accuracy and 4.6/5 for efficiency, confirming its deployment readiness in low-resource settings. Trained and tested on 2,141 children from AnthroVision and additionally evaluated on diverse cross-continent populations, including ARAN and an in-house collected CampusPose dataset, it achieves 0.79 recall, 0.82 AUC, and significantly lower anthropometric RMSEs, demonstrating reliable measurement in unconstrained pediatric settings. Cross-dataset results show up to 25% recall gain and up to 3.5 cm RMSE reduction using demographically matched knowledge bases. NutriScreener offers a scalable and accurate solution for early malnutrition detection in low-resource environments.[116] POMA-3D: The Point Map Way to 3D Scene Understanding
Ye Mao,Weixun Luo,Ranran Huang,Junpeng Jing,Krystian Mikolajczyk
Main category: cs.CV
TL;DR: 本文提出了POMA-3D,首个从点图(point maps)中自监督学习的3D表示模型,通过引入视图到场景对齐策略和POMA-JEPA联合嵌入预测架构,实现跨多视角的几何一致性,并利用大规模ScenePoint数据集进行预训练,在仅使用3D坐标的情况下,在多种3D理解任务中表现出色。
Details
Motivation: 解决3D表示学习中缺乏预训练先验知识和数据不足的问题,同时利用2D基础模型的丰富先验来提升3D场景理解能力。 Method: 提出POMA-3D模型,使用点图作为输入,设计视图到场景对齐策略以迁移2D先验,并引入POMA-JEPA架构确保多视角下的几何一致性;构建包含6.5K房间级RGB-D场景和1M 2D图像场景的ScenePoint数据集用于大规模预训练。 Result: POMA-3D在3D问答、具身导航、场景检索和具身定位等多种任务上表现优异,仅使用几何输入(3D坐标)即可实现强大的3D理解能力,成为专用和通用3D理解的有效骨干模型。 Conclusion: POMA-3D探索了基于点图的3D场景理解新路径,有效结合2D先验与3D几何结构,缓解了3D表示学习中数据与预训练模型稀缺的问题。 Abstract: In this paper, we introduce POMA-3D, the first self-supervised 3D representation model learned from point maps. Point maps encode explicit 3D coordinates on a structured 2D grid, preserving global 3D geometry while remaining compatible with the input format of 2D foundation models. To transfer rich 2D priors into POMA-3D, a view-to-scene alignment strategy is designed. Moreover, as point maps are view-dependent with respect to a canonical space, we introduce POMA-JEPA, a joint embedding-predictive architecture that enforces geometrically consistent point map features across multiple views. Additionally, we introduce ScenePoint, a point map dataset constructed from 6.5K room-level RGB-D scenes and 1M 2D image scenes to facilitate large-scale POMA-3D pretraining. Experiments show that POMA-3D serves as a strong backbone for both specialist and generalist 3D understanding. It benefits diverse tasks, including 3D question answering, embodied navigation, scene retrieval, and embodied localization, all achieved using only geometric inputs (i.e., 3D coordinates). Overall, our POMA-3D explores a point map way to 3D scene understanding, addressing the scarcity of pretrained priors and limited data in 3D representation learning. Project Page: https://matchlab-imperial.github.io/poma3d/[117] Erase to Retain: Low Rank Adaptation Guided Selective Unlearning in Medical Segmentation Networks
Nirjhor Datta,Md. Golam Rabiul Alam
Main category: cs.CV
TL;DR: 提出了一种名为Erase to Retain的可控遗忘框架,用于医学图像分割中的选择性知识删除,基于LoRA子空间更新和师生蒸馏机制,在实现目标遗忘的同时保持模型在保留数据上的性能。
Details
Motivation: 在医学图像分析中,出于隐私合规、伦理部署和数据集持续更新的需求,需要能够从分割网络中选择性地移除特定知识(如敏感病变信息),而现有方法通常依赖全量重训练,成本高且难以控制。 Method: 采用教师-学生蒸馏框架,结合低秩适应(LoRA)约束的子空间更新:在强遗忘阶段,对抗优化LoRA模块以消除教师网络对遗忘集的高置信预测;在温和恢复阶段,仅微调分类头以恢复在保留数据上的泛化能力。 Result: 在ISIC分割任务中,遗忘集IoU从0.875降至0.509,保留集和验证集性能保持在0.647–0.677 IoU;在CHASE跨域数据集上也表现出一致的遗忘效果与性能保持;在ISIC分类任务中,遗忘集准确率从87.0%降至64.1%,保留集准确率从83.9%提升至90.6%。 Conclusion: 基于LoRA的子空间遗忘为医学图像分析提供了实用、可控且可逆的知识删除路径,能够在去除敏感信息的同时有效保留关键性能。 Abstract: The ability to selectively remove knowledge from medical segmentation networks is increasingly important for privacy compliance, ethical deployment, and continual dataset revision. We introduce Erase to Retain, a controllable unlearning framework for medical image segmentation that achieves targeted forgetting without full retraining. Our method uses a teacher-student distillation paradigm with Low-Rank Adaptation (LoRA) constrained subspace updates, enabling the student network to erase lesion-specific or class-specific representations in low-rank decoder spaces while preserving global anatomical understanding. During the strong unlearning phase, LoRA modules are adversarially optimized to contradict the teacher's confident predictions on a designated forget subset, enforcing semantic removal. This is followed by a gentle restoration phase that recovers generalization on retained data through head-only supervised refinement. For ISIC segmentation, the student reduces forget-set IoU from 0.875 to 0.509 while maintaining competitive performance on the retain and validation splits (0.647 to 0.677 IoU). On the cross-domain CHASE dataset, Erase to Retain consistently lowers forget-set IoU while preserving utility on retain and validation sets. For ISIC classification, our method decreases accuracy on the forget subset from 87.0 percent to 64.1 percent while improving retain accuracy from 83.9 percent to 90.6 percent. These results demonstrate that LoRA-based subspace unlearning provides a practical pathway toward responsible, controllable, and reversible unlearning in medical image analysis, enabling models to forget sensitive samples or structures while preserving performance where it matters most.[118] Generative AI for Enhanced Wildfire Detection: Bridging the Synthetic-Real Domain Gap
Satyam Gaba
Main category: cs.CV
TL;DR: 本论文提出利用生成式AI技术合成带标注的烟雾数据集,以解决野火烟雾检测中真实标注数据稀缺的问题,并结合无监督域适应和生成模型(如GAN、风格迁移)提升合成数据的真实性,缩小合成与真实数据之间的域差异,从而提高野火烟雾分割的准确性。
Details
Motivation: 由于缺乏大规模标注的烟雾数据集,深度神经网络在野火烟雾检测中的应用受到限制,因此需要有效方法来缓解数据不足问题。 Method: 采用生成式AI技术构建合成烟雾数据集,并结合无监督域适应、风格迁移、生成对抗网络(GAN)和图像抠图技术,提升模型在真实场景中的泛化能力。 Result: 所提出的方法有效缩小了合成数据与真实数据之间的域差距,提升了烟雾分割性能,验证了生成数据在火灾早期检测中的可行性。 Conclusion: 通过生成式AI与域适应技术的结合,能够在标注数据有限的情况下构建高效、可扩展的野火烟雾检测系统,为现实应用提供了可行路径。 Abstract: The early detection of wildfires is a critical environmental challenge, with timely identification of smoke plumes being key to mitigating large-scale damage. While deep neural networks have proven highly effective for localization tasks, the scarcity of large, annotated datasets for smoke detection limits their potential. In response, we leverage generative AI techniques to address this data limitation by synthesizing a comprehensive, annotated smoke dataset. We then explore unsupervised domain adaptation methods for smoke plume segmentation, analyzing their effectiveness in closing the gap between synthetic and real-world data. To further refine performance, we integrate advanced generative approaches such as style transfer, Generative Adversarial Networks (GANs), and image matting. These methods aim to enhance the realism of synthetic data and bridge the domain disparity, paving the way for more accurate and scalable wildfire detection models.[119] SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking
Haofeng Liu,Ziyue Wang,Sudhanshu Mishra,Mingqi Gao,Guanyi Qin,Chang Han Low,Alex Y. W. Kong,Yueming Jin
Main category: cs.CV
TL;DR: 本文提出了SA-SV,这是目前最大的用于手术视频分割的基准数据集,并基于此提出了SAM2S模型,通过改进记忆机制、时间语义学习和抗模糊学习,显著提升了在手术场景中交互式视频对象分割的性能,实现了高精度和实时性。
Details
Motivation: 现有的交互式视频对象分割模型(如SAM2)在手术场景中存在领域差异和长期跟踪能力不足的问题,缺乏高质量、大规模的标注数据集支持其在手术环境中的发展与评估。 Method: 构建了包含八种手术类型的大型数据集SA-SV,包含61k帧和1.6k个实例级时空标注(masklets);在此基础上提出SAM2S模型,引入DiveMem机制实现鲁棒的长期跟踪,结合时间语义学习增强对手术器械的理解,并采用抗模糊学习缓解多源数据集中标注不一致问题。 Result: 实验表明,在SA-SV上微调后,SAM2性能提升12.99个点(平均J&F);SAM2S进一步达到80.42的平均J&F分数,分别超越原始和微调后的SAM2达17.10和4.11点,同时保持68 FPS的实时推理速度,并展现出强零样本泛化能力。 Conclusion: SAM2S通过针对手术视频特点设计的三项关键技术,在长时跟踪、零样本泛化和实时性方面显著优于现有方法,为手术视频分析提供了新的基础模型与数据资源。 Abstract: Surgical video segmentation is crucial for computer-assisted surgery, enabling precise localization and tracking of instruments and tissues. Interactive Video Object Segmentation (iVOS) models such as Segment Anything Model 2 (SAM2) provide prompt-based flexibility beyond methods with predefined categories, but face challenges in surgical scenarios due to the domain gap and limited long-term tracking. To address these limitations, we construct SA-SV, the largest surgical iVOS benchmark with instance-level spatio-temporal annotations (masklets) spanning eight procedure types (61k frames, 1.6k masklets), enabling comprehensive development and evaluation for long-term tracking and zero-shot generalization. Building on SA-SV, we propose SAM2S, a foundation model enhancing \textbf{SAM2} for \textbf{S}urgical iVOS through: (1) DiveMem, a trainable diverse memory mechanism for robust long-term tracking; (2) temporal semantic learning for instrument understanding; and (3) ambiguity-resilient learning to mitigate annotation inconsistencies across multi-source datasets. Extensive experiments demonstrate that fine-tuning on SA-SV enables substantial performance gains, with SAM2 improving by 12.99 average $\mathcal{J}$\&$\mathcal{F}$ over vanilla SAM2. SAM2S further advances performance to 80.42 average $\mathcal{J}$\&$\mathcal{F}$, surpassing vanilla and fine-tuned SAM2 by 17.10 and 4.11 points respectively, while maintaining 68 FPS real-time inference and strong zero-shot generalization. Code and dataset will be released at https://jinlab-imvr.github.io/SAM2S.[120] Improving Long-Tailed Object Detection with Balanced Group Softmax and Metric Learning
Satyam Gaba
Main category: cs.CV
TL;DR: 本文提出了一种改进的Balanced Group Softmax框架,并结合度量学习与k-NN分类策略,有效提升了长尾分布下的2D目标检测性能,在LVISv1数据集上取得了24.5% mAP的SOTA结果。
Details
Motivation: 现实场景中类别分布常呈长尾分布,导致检测模型偏向高频类,忽视稀有类,本文旨在缓解此类不平衡问题。 Method: 基于Faster R-CNN两阶段检测器,改进Balanced Group Softmax(BAGS)框架,并引入度量学习使特征在类内紧凑、类间分离,推理时采用k-Nearest Neighbors进行分类优化。 Result: 在LVISv1数据集上达到24.5% mAP,超过先前24.0%的基准,显著提升稀有类别的检测性能。 Conclusion: 所提方法有效缓解了长尾分布带来的类别不平衡问题,通过特征解耦与k-NN分类策略,实现了新的性能突破。 Abstract: Object detection has been widely explored for class-balanced datasets such as COCO. However, real-world scenarios introduce the challenge of long-tailed distributions, where numerous categories contain only a few instances. This inherent class imbalance biases detection models towards the more frequent classes, degrading performance on rare categories. In this paper, we tackle the problem of long-tailed 2D object detection using the LVISv1 dataset, which consists of 1,203 categories and 164,000 images. We employ a two-stage Faster R-CNN architecture and propose enhancements to the Balanced Group Softmax (BAGS) framework to mitigate class imbalance. Our approach achieves a new state-of-the-art performance with a mean Average Precision (mAP) of 24.5%, surpassing the previous benchmark of 24.0%. Additionally, we hypothesize that tail class features may form smaller, denser clusters within the feature space of head classes, making classification challenging for regression-based classifiers. To address this issue, we explore metric learning to produce feature embeddings that are both well-separated across classes and tightly clustered within each class. For inference, we utilize a k-Nearest Neighbors (k-NN) approach to improve classification performance, particularly for rare classes. Our results demonstrate the effectiveness of these methods in advancing long-tailed object detection.[121] Adaptive Guided Upsampling for Low-light Image Enhancement
Angela Vivian Dcosta,Chunbo Song,Rafael Radkowski
Main category: cs.CV
TL;DR: 提出了一种名为自适应引导上采样(AGU)的方法,用于高效提升低光照图像质量,能在去噪和增强清晰度等方面同时优化,优于现有方法。
Details
Motivation: 现有的引导图像方法在处理低光照图像时因噪声高、亮度低而导致特征不足,无法有效提升图像质量。 Method: 基于引导图像方法,结合多参数优化,通过机器学习从少量低光-明亮图像对中学习特征关联,实现自适应上采样。 Result: AGU能够在实时条件下从低质量、低分辨率输入生成高质量图像,实验表明其在低光场景下优于最先进的方法。 Conclusion: AGU通过学习低光与明亮图像间的多特征关联,有效解决了传统引导方法在低光条件下性能不佳的问题,实现了高效、高质量的图像增强。 Abstract: We introduce Adaptive Guided Upsampling (AGU), an efficient method for upscaling low-light images capable of optimizing multiple image quality characteristics at the same time, such as reducing noise and increasing sharpness. It is based on a guided image method, which transfers image characteristics from a guidance image to the target image. Using state-of-the-art guided methods, low-light images lack sufficient characteristics for this purpose due to their high noise level and low brightness, rendering suboptimal/not significantly improved images in the process. We solve this problem with multi-parameter optimization, learning the association between multiple low-light and bright image characteristics. Our proposed machine learning method learns these characteristics from a few sample images-pairs. AGU can render high-quality images in real time using low-quality, low-resolution input; our experiments demonstrate that it is superior to state-of-the-art methods in the addressed low-light use case.[122] SAM 3D: 3Dfy Anything in Images
SAM 3D Team,Xingyu Chen,Fu-Jen Chu,Pierre Gleize,Kevin J Liang,Alexander Sax,Hao Tang,Weiyao Wang,Michelle Guo,Thibaut Hardin,Xiang Li,Aohan Lin,Jiawei Liu,Ziqi Ma,Anushka Sagar,Bowen Song,Xiaodong Wang,Jianing Yang,Bowen Zhang,Piotr Dollár,Georgia Gkioxari,Matt Feiszli,Jitendra Malik
Main category: cs.CV
TL;DR: SAM 3D 是一种基于单张图像生成3D物体几何、纹理和布局的视觉引导重建模型,通过人机协同标注管道和多阶段训练框架,在自然场景中表现出色。
Details
Motivation: 现有的3D重建方法在处理真实世界图像中的遮挡和复杂场景时表现有限,缺乏大规模的视觉对齐3D数据支持。 Method: 提出SAM 3D模型,采用人机协同的标注流程构建大规模视觉对齐3D数据集,并结合合成数据预训练与真实数据对齐的多阶段训练策略。 Result: 在真实场景物体和场景的人类偏好测试中,相较最新方法取得至少5:1的优势,并发布新基准、代码、模型权重及在线演示。 Conclusion: SAM 3D有效突破了3D重建中的“数据壁垒”,在复杂真实环境中实现了高质量的视觉引导3D对象重建。 Abstract: We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D "data barrier". We obtain significant gains over recent work, with at least a 5:1 win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.[123] TRIM: Scalable 3D Gaussian Diffusion Inference with Temporal and Spatial Trimming
Zeyuan Yin,Xiaoming Liu
Main category: cs.CV
TL;DR: 本文提出了一种名为TRIM的后训练方法,通过轨迹缩减和实例掩码去噪策略,加速3D高斯扩散模型的推理过程,同时保持生成质量,并支持推理时的可扩展性。
Details
Motivation: 现有的3D高斯扩散模型由于高数量的高斯基元导致去噪过程耗时且计算成本高,限制了生成效率和可扩展性。因此需要一种高效的方法来提升模型推理速度。 Method: 提出TRIM方法,包含两个核心策略:1)使用轻量级选择器模型评估潜在高斯基元,实现高质量候选者的早期轨迹缩减;2)引入实例掩码去噪,在每一步去噪中剔除冗余背景区域,减少可学习高斯基元的数量。 Result: 实验表明,TRIM显著提升了3D生成的效率与质量,在不牺牲输出质量的前提下大幅缩短推理时间,并支持推理时的灵活缩放。 Conclusion: TRIM为3D高斯扩散模型提供了一种高效的推理加速方案,兼具性能提升与实际应用的可扩展性。 Abstract: Recent advances in 3D Gaussian diffusion models suffer from time-intensive denoising and post-denoising processing due to the massive number of Gaussian primitives, resulting in slow generation and limited scalability along sampling trajectories. To improve the efficiency of 3D diffusion models, we propose $\textbf{TRIM}$ ($\textbf{T}$rajectory $\textbf{R}$eduction and $\textbf{I}$nstance $\textbf{M}$ask denoising), a post-training approach that incorporates both temporal and spatial trimming strategies, to accelerate inference without compromising output quality while supporting the inference-time scaling for Gaussian diffusion models. Instead of scaling denoising trajectories in a costly end-to-end manner, we develop a lightweight selector model to evaluate latent Gaussian primitives derived from multiple sampled noises, enabling early trajectory reduction by selecting candidates with high-quality potential. Furthermore, we introduce instance mask denoising to prune learnable Gaussian primitives by filtering out redundant background regions, reducing inference computation at each denoising step. Extensive experiments and analysis demonstrate that TRIM significantly improves both the efficiency and quality of 3D generation. Source code is available at $\href{https://github.com/zeyuanyin/TRIM}{link}$.[124] Late-decoupled 3D Hierarchical Semantic Segmentation with Semantic Prototype Discrimination based Bi-branch Supervision
Shuyu Cao,Chongshou Li,Jie Xu,Tianrui Li,Na Zhao
Main category: cs.CV
TL;DR: 提出了一种新的3D层次语义分割框架,通过 late-decoupled 架构和双分支监督机制解决多层级冲突和类别不平衡问题,实现了最先进的性能。
Details
Motivation: 现有3DHS方法忽视了跨层级优化中的多层级冲突和多层级间的类别不平衡问题,影响模型在细粒度场景理解中的表现。 Method: 设计了一个主3DHS分支和一个辅助判别分支的框架;采用late-decoupled架构,结合从粗到精的层次引导与一致性约束;引入基于语义原型的双分支监督机制,增强点云特征的判别性和类别不平衡下的分割能力。 Result: 在多个数据集和骨干网络上实验表明,该方法达到了最先进的3DHS性能,且其核心组件可作为即插即用模块提升已有方法。 Conclusion: 所提出的late-decoupled架构和双分支监督机制有效缓解了多层级冲突和类别不平衡问题,显著提升了3D层次语义分割的整体性能。 Abstract: 3D hierarchical semantic segmentation (3DHS) is crucial for embodied intelligence applications that demand a multi-grained and multi-hierarchy understanding of 3D scenes. Despite the progress, previous 3DHS methods have overlooked following two challenges: I) multi-label learning with a parameter-sharing model can lead to multi-hierarchy conflicts in cross-hierarchy optimization, and II) the class imbalance issue is inevitable across multiple hierarchies of 3D scenes, which makes the model performance become dominated by major classes. To address these issues, we propose a novel framework with a primary 3DHS branch and an auxiliary discrimination branch. Specifically, to alleviate the multi-hierarchy conflicts, we propose a late-decoupled 3DHS framework which employs multiple decoders with the coarse-to-fine hierarchical guidance and consistency. The late-decoupled architecture can mitigate the underfitting and overfitting conflicts among multiple hierarchies and can also constrain the class imbalance problem in each individual hierarchy. Moreover, we introduce a 3DHS-oriented semantic prototype based bi-branch supervision mechanism, which additionally learns class-wise discriminative point cloud features and performs mutual supervision between the auxiliary and 3DHS branches, to enhance the class-imbalance segmentation. Extensive experiments on multiple datasets and backbones demonstrate that our approach achieves state-of-the-art 3DHS performance, and its core components can also be used as a plug-and-play enhancement to improve previous methods.[125] Teacher-Guided One-Shot Pruning via Context-Aware Knowledge Distillation
Md. Samiul Alim,Sharjil Khan,Amrijit Biswas,Fuad Rahman,Shafin Rahman,Nabeel Mohammed
Main category: cs.CV
TL;DR: 提出一种结合知识蒸馏的教师引导剪枝框架,实现高效的一次性全局剪枝,在高稀疏度下保持良好性能。
Details
Motivation: 非结构化剪枝通常需要多次训练-剪枝循环,计算开销大,亟需更高效的剪枝方法。 Method: 在重要性评分计算中引入教师模型的梯度信号,将知识蒸馏与重要性评分紧密结合,实现一次性全局剪枝,并采用稀疏感知重训练恢复精度。 Result: 在CIFAR-10、CIFAR-100和TinyImageNet上验证了方法的有效性,高稀疏度下性能优于EPG、EPSD等先进方法,且比COLT等迭代方法更高效。 Conclusion: 该框架在保持模型性能的同时显著提升剪枝效率,适合资源受限环境下的部署。 Abstract: Unstructured pruning remains a powerful strategy for compressing deep neural networks, yet it often demands iterative train-prune-retrain cycles, resulting in significant computational overhead. To address this challenge, we introduce a novel teacher-guided pruning framework that tightly integrates Knowledge Distillation (KD) with importance score estimation. Unlike prior approaches that apply KD as a post-pruning recovery step, our method leverages gradient signals informed by the teacher during importance score calculation to identify and retain parameters most critical for both task performance and knowledge transfer. Our method facilitates a one-shot global pruning strategy that efficiently eliminates redundant weights while preserving essential representations. After pruning, we employ sparsity-aware retraining with and without KD to recover accuracy without reactivating pruned connections. Comprehensive experiments across multiple image classification benchmarks, including CIFAR-10, CIFAR-100, and TinyImageNet, demonstrate that our method consistently achieves high sparsity levels with minimal performance degradation. Notably, our approach outperforms state-of-the-art baselines such as EPG and EPSD at high sparsity levels, while offering a more computationally efficient alternative to iterative pruning schemes like COLT. The proposed framework offers a computation-efficient, performance-preserving solution well suited for deployment in resource-constrained environments.[126] Solving Spatial Supersensing Without Spatial Supersensing
Vishaal Udandarao,Shyamgopal Karthik,Surabhi S. Nath,Andreas Hochlehnert,Matthias Bethge,Ameya Prabhu
Main category: cs.CV
TL;DR: 本文对Cambrian-S提出的视频世界模型中的空间超感知(spatial supersensing)方法进行了批判性分析,发现其两个基准VSR和VSC存在可被简单方法或捷径启发式破解的问题,表明当前评估基准未能可靠衡量真正的空间超感知能力。
Details
Motivation: 评估Cambrian-S所提出的空间超感知方法是否真正具备跨时间与空间整合信息的能力,而非依赖数据集中的统计捷径。 Method: 提出了NoSense基线模型用于测试VSR基准,并设计了VSC-Repeat实验来检验VSC基准中模型对重复场景的鲁棒性,分析Cambrian-S推理策略是否依赖于潜在的数据集偏差。 Result: NoSense在VSR上达到95%准确率;VSC-Repeat实验使Cambrian-S的准确率从42%降至0%,揭示其推理严重依赖‘房间不会重复’这一捷径假设。 Conclusion: 现有VSI-Super基准无法可靠评估空间超感知能力,Cambrian-S的性能提升主要源于利用基准中的捷径而非实现真正的空间认知。 Abstract: Cambrian-S aims to take the first steps towards improving video world models with spatial supersensing by introducing (i) two benchmarks, VSI-Super-Recall (VSR) and VSI-Super-Counting (VSC), and (ii) bespoke predictive sensing inference strategies tailored to each benchmark. In this work, we conduct a critical analysis of Cambrian-S across both these fronts. First, we introduce a simple baseline, NoSense, which discards almost all temporal structure and uses only a bag-of-words SigLIP model, yet near-perfectly solves VSR, achieving 95% accuracy even on 4-hour videos. This shows benchmarks like VSR can be nearly solved without spatial cognition, world modeling or spatial supersensing. Second, we hypothesize that the tailored inference methods proposed by Cambrian-S likely exploit shortcut heuristics in the benchmark. We illustrate this with a simple sanity check on the VSC benchmark, called VSC-Repeat: We concatenate each video with itself 1-5 times, which does not change the number of unique objects. However, this simple perturbation entirely collapses the mean relative accuracy of Cambrian-S from 42% to 0%. A system that performs spatial supersensing and integrates information across experiences should recognize views of the same scene and keep object-count predictions unchanged; instead, Cambrian-S inference algorithm relies largely on a shortcut in the VSC benchmark that rooms are never revisited. Taken together, our findings suggest that (i) current VSI-Super benchmarks do not yet reliably measure spatial supersensing, and (ii) predictive-sensing inference recipes used by Cambrian-S improve performance by inadvertently exploiting shortcuts rather than from robust spatial supersensing. We include the response from the Cambrian-S authors (in Appendix A) to provide a balanced perspective alongside our claims. We release our code at: https://github.com/bethgelab/supersanity[127] PartUV: Part-Based UV Unwrapping of 3D Meshes
Zhaoning Wang,Xinyue Wei,Ruoxi Shi,Xiaoshuai Zhang,Hao Su,Minghua Liu
Main category: cs.CV
TL;DR: 本文提出了一种基于部件的UV展开方法PartUV,能够在保持低扭曲的同时生成更少且与语义部件对齐的图表,特别适用于AI生成的复杂网格。
Details
Motivation: 现有UV展开方法在处理AI生成的噪声多、表面不平整的网格时表现不佳,常导致图表碎片化和边界不优的问题。 Method: PartUV结合了基于学习的部件分解方法PartField与新的几何启发策略,采用自上而下的递归框架,在保证每张图表扭曲低于用户设定阈值的前提下最小化图表总数,并集成并行化的参数化与打包算法。 Result: 在四个不同类型的数据集上评估显示,PartUV在图表数量和接缝长度方面优于现有工具和神经网络方法,扭曲程度相当,对复杂网格具有高成功率,并支持如部件特定多图块打包等新应用。 Conclusion: PartUV是一种高效、鲁棒的UV展开流程,能有效应对AI生成网格的挑战,兼顾语义结构与几何质量,推动了下游纹理映射等任务的发展。 Abstract: UV unwrapping flattens 3D surfaces to 2D with minimal distortion, often requiring the complex surface to be decomposed into multiple charts. Although extensively studied, existing UV unwrapping methods frequently struggle with AI-generated meshes, which are typically noisy, bumpy, and poorly conditioned. These methods often produce highly fragmented charts and suboptimal boundaries, introducing artifacts and hindering downstream tasks. We introduce PartUV, a part-based UV unwrapping pipeline that generates significantly fewer, part-aligned charts while maintaining low distortion. Built on top of a recent learning-based part decomposition method PartField, PartUV combines high-level semantic part decomposition with novel geometric heuristics in a top-down recursive framework. It ensures each chart's distortion remains below a user-specified threshold while minimizing the total number of charts. The pipeline integrates and extends parameterization and packing algorithms, incorporates dedicated handling of non-manifold and degenerate meshes, and is extensively parallelized for efficiency. Evaluated across four diverse datasets, including man-made, CAD, AI-generated, and Common Shapes, PartUV outperforms existing tools and recent neural methods in chart count and seam length, achieves comparable distortion, exhibits high success rates on challenging meshes, and enables new applications like part-specific multi-tiles packing. Our project page is at https://www.zhaoningwang.com/PartUV.[128] TriDiff-4D: Fast 4D Generation through Diffusion-based Triplane Re-posing
Eddie Pokming Sheung,Qihao Liu,Wufei Ma,Prakhar Kaushik,Jianwen Xie,Alan Yuille
Main category: cs.CV
TL;DR: 本文提出了一种名为TriDiff-4D的新型4D生成框架,通过基于扩散的三平面重姿态方法,从文本描述生成高质量、时间连贯的4D化身,显著提升了生成速度和运动精度。
Details
Motivation: 现有4D生成方法存在时间与几何不一致、感知伪影、运动不规则、计算成本高及动态控制有限等问题,限制了其广泛应用。 Method: 采用自回归策略,结合扩散模型生成任意长度的4D序列;首先从文本生成标准3D化身和对应的动作序列,再通过第二个扩散模型根据动作序列驱动化身动画,并显式学习大规模3D和动作数据中的结构与运动先验。 Result: 实验表明,TriDiff-4D在生成复杂动作时具有高保真外观和精确3D几何,生成时间从数小时缩短至数秒,无需优化过程,显著优于现有方法。 Conclusion: TriDiff-4D实现了高效、可控、高保真的文本到4D化身生成,在时间一致性、运动准确性、计算效率和视觉保真度方面均表现出优越性能。 Abstract: With the increasing demand for 3D animation, generating high-fidelity, controllable 4D avatars from textual descriptions remains a significant challenge. Despite notable efforts in 4D generative modeling, existing methods exhibit fundamental limitations that impede their broader applicability, including temporal and geometric inconsistencies, perceptual artifacts, motion irregularities, high computational costs, and limited control over dynamics. To address these challenges, we propose TriDiff-4D, a novel 4D generative pipeline that employs diffusion-based triplane re-posing to produce high-quality, temporally coherent 4D avatars. Our model adopts an auto-regressive strategy to generate 4D sequences of arbitrary length, synthesizing each 3D frame with a single diffusion process. By explicitly learning 3D structure and motion priors from large-scale 3D and motion datasets, TriDiff-4D enables skeleton-driven 4D generation that excels in temporal consistency, motion accuracy, computational efficiency, and visual fidelity. Specifically, TriDiff-4D first generates a canonical 3D avatar and a corresponding motion sequence from a text prompt, then uses a second diffusion model to animate the avatar according to the motion sequence, supporting arbitrarily long 4D generation. Experimental results demonstrate that TriDiff-4D significantly outperforms existing methods, reducing generation time from hours to seconds by eliminating the optimization process, while substantially improving the generation of complex motions with high-fidelity appearance and accurate 3D geometry.[129] SceneDesigner: Controllable Multi-Object Image Generation with 9-DoF Pose Manipulation
Zhenyuan Qin,Xincheng Shuai,Henghui Ding
Main category: cs.CV
TL;DR: 本文提出SceneDesigner,一种用于多物体9自由度姿态操控的可控图像生成方法,通过分支网络和新的CNOCS地图表示实现精确控制,并引入新数据集ObjectPose9D和两阶段强化学习训练策略以解决数据不平衡问题。
Details
Motivation: 现有方法在实现多个物体同时的9D姿态(位置、大小、方向)控制方面存在可控性不足和生成质量下降的问题,难以满足复杂场景下的多物体精确操控需求。 Method: 提出SceneDesigner,采用分支网络接入预训练模型,引入CNOCS地图编码相机视角下的9D姿态信息;构建包含丰富9D标注的新数据集ObjectPose9D;设计两阶段强化学习训练策略缓解低频姿态的数据不平衡;推理时使用解耦对象采样减少生成不足与概念混淆;支持个性化权重实现用户定制化控制。 Result: 实验表明,SceneDesigner在可控性和生成质量上显著优于现有方法,能够高效稳定地实现多物体9D姿态的灵活编辑。 Conclusion: SceneDesigner为多物体9D姿态控制提供了有效解决方案,提升了可控图像生成的精度与灵活性,具有良好的应用潜力。 Abstract: Controllable image generation has attracted increasing attention in recent years, enabling users to manipulate visual content such as identity and style. However, achieving simultaneous control over the 9D poses (location, size, and orientation) of multiple objects remains an open challenge. Despite recent progress, existing methods often suffer from limited controllability and degraded quality, falling short of comprehensive multi-object 9D pose control. To address these limitations, we propose SceneDesigner, a method for accurate and flexible multi-object 9-DoF pose manipulation. SceneDesigner incorporates a branched network to the pre-trained base model and leverages a new representation, CNOCS map, which encodes 9D pose information from the camera view. This representation exhibits strong geometric interpretation properties, leading to more efficient and stable training. To support training, we construct a new dataset, ObjectPose9D, which aggregates images from diverse sources along with 9D pose annotations. To further address data imbalance issues, particularly performance degradation on low-frequency poses, we introduce a two-stage training strategy with reinforcement learning, where the second stage fine-tunes the model using a reward-based objective on rebalanced data. At inference time, we propose Disentangled Object Sampling, a technique that mitigates insufficient object generation and concept confusion in complex multi-object scenes. Moreover, by integrating user-specific personalization weights, SceneDesigner enables customized pose control for reference subjects. Extensive qualitative and quantitative experiments demonstrate that SceneDesigner significantly outperforms existing approaches in both controllability and quality. Code is publicly available at https://github.com/FudanCVL/SceneDesigner.[130] V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models
Yang Luo,Xuanlei Zhao,Baijiong Lin,Lingting Zhu,Liyao Tang,Yuqi Liu,Ying-Cong Chen,Shengju Qian,Xin Wang,Yang You
Main category: cs.CV
TL;DR: 本文提出了V-ReasonBench,一个用于评估视频生成模型在结构化问题解决、空间认知、模式推理和物理动态四个维度上的推理能力的基准。该基准结合合成与真实世界图像序列,提供可验证、可复现且无歧义的任务。对六种最先进视频模型的评估揭示了其在不同推理维度上的显著差异,并分析了幻觉行为及视频长度对推理的影响。
Details
Motivation: 随着生成式视频模型(如Veo-3)展现出零样本推理能力,亟需一个系统、可靠的视频推理评估基准来衡量其推理表现,推动更符合人类逻辑的模型发展。 Method: 构建包含合成与真实图像序列的V-ReasonBench基准,涵盖四个推理维度,设计答案可验证、可扩展、无歧义的任务,并对六种SOTA视频模型进行评估,同时与强图像模型对比,分析幻觉现象和视频时长对Chain-of-Frames推理的影响。 Result: 实验表明不同视频模型在四个推理维度上表现差异显著;部分模型在特定维度(如物理动态)表现较弱;视频长度影响推理效果;视频模型仍存在明显幻觉问题;整体上视频模型在推理能力上优于图像模型但仍有提升空间。 Conclusion: V-ReasonBench为视频推理提供了统一、可复现的评估框架,有助于推动具备可靠、人类对齐推理能力的视频生成模型的发展。 Abstract: Recent progress in generative video models, such as Veo-3, has shown surprising zero-shot reasoning abilities, creating a growing need for systematic and reliable evaluation. We introduce V-ReasonBench, a benchmark designed to assess video reasoning across four key dimensions: structured problem-solving, spatial cognition, pattern-based inference, and physical dynamics. The benchmark is built from both synthetic and real-world image sequences and provides a diverse set of answer-verifiable tasks that are reproducible, scalable, and unambiguous. Evaluations of six state-of-the-art video models reveal clear dimension-wise differences, with strong variation in structured, spatial, pattern-based, and physical reasoning. We further compare video models with strong image models, analyze common hallucination behaviors, and study how video duration affects Chain-of-Frames reasoning. Overall, V-ReasonBench offers a unified and reproducible framework for measuring video reasoning and aims to support the development of models with more reliable, human-aligned reasoning skills.[131] Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO
Junhao Cheng,Liang Hou,Xin Tao,Jing Liao
Main category: cs.CV
TL;DR: 本文提出了视频下一事件预测(VNEP),将视频生成作为回答模态,引入模型VANS,结合视觉语言模型与视频扩散模型,通过强化学习实现多模态理解与一致的视频生成,在新构建的数据集VANS-Data-100K上实现了SOTA性能。
Details
Motivation: 由于视频能直观展示仅靠文本难以传达的物理世界信息,作者希望拓展视频作为一种新的回答方式用于下一事件预测任务,弥补传统文本输出在直观性和可操作性上的不足。 Method: 提出VANS模型,采用强化学习框架,通过新设计的Joint-GRPO机制联合优化视觉语言模型(VLM)和视频扩散模型(VDM),使其协同工作;同时构建了包含10万样本的VANS-Data-100K数据集用于训练与评估。 Result: 在程序性和预测性基准实验中,VANS在事件预测准确性和视频生成质量方面均达到当前最优水平,显著优于现有方法。 Conclusion: VANS成功实现了从‘告诉’到‘展示’的转变,验证了视频作为答案模态在下一事件预测中的潜力,为 procedural learning 与创造性探索提供了新路径。 Abstract: While language models have become impactful in many real-world applications, video generation remains largely confined to entertainment. Motivated by video's inherent capacity to demonstrate physical-world information that is difficult to convey through language alone (e.g., imagine teaching someone to tie a tie using only text), we identify an underutilized opportunity to extend video as a new answer modality for Next-Event Prediction (NEP), formalized as Video-Next-Event Prediction (VNEP). While the established NEP task takes a video with a procedural or predictive question as input to predict the next event in text, VNEP requires dynamic video responses. This shift from telling to showing unlocks more intuitive and customized answers for procedural learning and creative exploration. However, this task remains challenging for existing models, as it demands an understanding of multimodal input, instruction-conditioned reasoning, and the generation of video with visual and semantic consistency. To address this, we introduce VANS, a model that leverages reinforcement learning to align a Vision-Language Model (VLM) with a Video Diffusion Model (VDM) for VNEP. The core of VANS is our proposed Joint-GRPO that orchestrates the VLM and VDM to function as a unit. Driven by a shared reward on their respective output, it optimizes the VLM to produce captions that are both accurate and friendly to visualize, while guiding the VDM to generate videos that are faithful to these captions and the input visual context. To enable this learning, we craft VANS-Data-100K, a dedicated dataset for the VNEP task. Experiments on procedural and predictive benchmarks demonstrate that VANS achieves state-of-the-art performance in both video event prediction and visualization. Codes are released in https://github.com/KlingTeam/VANS.[132] Learning to Think Fast and Slow for Visual Language Models
Chenyu Lin,Cheng Chi,Jinlin Wu,Sharon Li,Kaiyang Zhou
Main category: cs.CV
TL;DR: 本文提出了一种名为DualMindVLM的简单强化学习方法,使视觉语言模型能根据任务难度自动切换快速和慢速思维模式,兼顾推理效率与性能。
Details
Motivation: 现有视觉语言模型在处理问题时普遍追求长链推理,导致计算成本过高,缺乏对任务难度的自适应能力,难以实现高效认知资源分配。 Method: 该方法分为两阶段:第一阶段根据模型输出长度标注数据为快速或慢速思维模式;第二阶段使用GRPO结合思维模式标签进行训练,实现双模式推理能力。 Result: DualMindVLM显著优于基线模型,在视觉推理任务上达到与当前最先进模型相当的性能,同时保持极高的token效率。 Conclusion: 通过引入类人双系统思维机制,视觉语言模型可有效平衡推理质量与计算开销,为高效推理提供了新思路。 Abstract: When confronted with complex problems, we tend to think slowly; conversely, for simple questions, we think quickly. Such a two-system thinking mechanism allows us to efficiently allocate cognitive resources, enabling quick decision-making for straightforward issues while reserving deeper analytical thinking for more intricate challenges. However, existing reasoning-oriented visual language models (VLMs), whether trained with explicit chain-of-thought annotations or rule-based RL rewards, mainly pursue lengthy, detailed reasoning chains, which often lead to excessive computational costs. In this work, we propose a simple RL approach, which enables VLMs to automatically switch between fast and slow thinking modes depending on task difficulty. The approach consists of two stages: in the first stage, we label data as either requiring fast thinking or slow thinking based on the model output length, which is inspired by the observation that pre-trained VLMs typically produce answers of varying lengths for different types of questions; in the second stage, we train the model using GRPO along with the thinking mode labels to develop dual-mode thinking. Despite its simplicity, our model, named DualMindVLM, significantly outperforms the base model and achieves performance on par with state-of-the-art visual reasoning models, while maintaining exceptionally high token efficiency.[133] EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards
Omkat Thawakar,Shravan Venkatraman,Ritesh Thawkar,Abdelrahman Shaker,Hisham Cholakkal,Rao Muhammad Anwer,Salman Khan,Fahad Khan
Main category: cs.CV
TL;DR: 提出了一种名为EvoLMM的自演化框架,通过无监督方式提升大视觉模型的推理能力,利用双代理机制(提议者和求解者)实现自我奖励学习,在多个多模态数学推理任务上取得显著提升。
Details
Motivation: 现有大视觉模型训练依赖人工标注或外部奖励模型,限制了自主性和可扩展性,因此需要一种完全无监督的自演化方法来提升模型推理能力。 Method: 构建一个基于单一骨干模型的双代理框架:Proposer生成基于图像的多样化问题,Solver通过内部一致性进行求解,模型通过持续的自我奖励机制实现自我演化,无需人工标注或外部奖励信号。 Result: 在ChartQA、MathVista和MathVision等多模态数学推理基准上,使用Qwen2.5-VL作为基础模型,仅用原始图像数据就实现了最高约3%的性能提升。 Conclusion: EvoLMM提供了一种简单而有效的全无监督自改进框架,为未来大视觉模型的自主演化研究提供了可行的基线。 Abstract: Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting their autonomy and scalability. In this work, we strive to improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that instantiates two cooperative agents from a single backbone model: a Proposer, which generates diverse, image-grounded questions, and a Solver, which solves them through internal consistency, where learning proceeds through a continuous self-rewarding process. This dynamic feedback encourages both the generation of informative queries and the refinement of structured reasoning without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base model, our EvoLMM yields consistent gains upto $\sim$3\% on multimodal math-reasoning benchmarks, including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple yet effective approach will serve as a solid baseline easing future research in self-improving LMMs in a fully-unsupervised fashion. Our code and models are available at https://github.com/mbzuai-oryx/EvoLMM.[134] NoPo-Avatar: Generalizable and Animatable Avatars from Sparse Inputs without Human Poses
Jing Wen,Alexander G. Schwing,Shenlong Wang
Main category: cs.CV
TL;DR: 提出NoPo-Avatar,一种无需输入姿态信息、仅从单张或稀疏图像中重建可动画3D人像的方法,在无真实姿态标注的实际场景下表现更优。
Details
Motivation: 现有方法依赖测试时的人体姿态和相机姿态作为输入,但在姿态估计存在噪声时重建质量显著下降。因此需要一种不依赖姿态输入的鲁棒重建方法。 Method: 提出NoPo-Avatar,完全摒弃测试时对姿态输入的依赖,仅利用图像进行重建,从而避免因姿态估计噪声导致的性能退化。 Result: 在THuman2.0、XHuman和HuGe100K数据集上实验表明,NoPo-Avatar在无真实姿态的实际设置下优于现有基线方法,在有真实姿态的实验室设置下性能相当。 Conclusion: 去除测试时对姿态的依赖能提升方法的鲁棒性和实用性,NoPo-Avatar在多种场景下均表现出优异性能,具有广泛适用性。 Abstract: We tackle the task of recovering an animatable 3D human avatar from a single or a sparse set of images. For this task, beyond a set of images, many prior state-of-the-art methods use accurate "ground-truth" camera poses and human poses as input to guide reconstruction at test-time. We show that pose-dependent reconstruction degrades results significantly if pose estimates are noisy. To overcome this, we introduce NoPo-Avatar, which reconstructs avatars solely from images, without any pose input. By removing the dependence of test-time reconstruction on human poses, NoPo-Avatar is not affected by noisy human pose estimates, making it more widely applicable. Experiments on challenging THuman2.0, XHuman, and HuGe100K data show that NoPo-Avatar outperforms existing baselines in practical settings (without ground-truth poses) and delivers comparable results in lab settings (with ground-truth poses).[135] Dataset Distillation for Pre-Trained Self-Supervised Vision Models
George Cazenavette,Antonio Torralba,Vincent Sitzmann
Main category: cs.CV
TL;DR: 本文提出了线性梯度匹配方法,用于在预训练视觉模型上蒸馏数据集以优化线性探测器的训练,合成数据在多个任务中表现优异并具有跨模型泛化能力。