Table of Contents
cs.CL [Back]
[1] Human-Level and Beyond: Benchmarking Large Language Models Against Clinical Pharmacists in Prescription Review
Yan Yang,Mouxiao Bian,Peiling Li,Bingjian Wen,Ruiyao Chen,Kangkun Mao,Xiaojun Ye,Tianbin Li,Pengcheng Chen,Bing Han,Jie Xu,Kaifeng Qiu,Junyan Wu
Main category: cs.CL
TL;DR: RxBench是一个针对大语言模型在处方审查中应用的综合评测基准,涵盖常见错误类型,评估18个主流LLM并发现顶尖模型可匹敌或超越人类药师表现,同时通过微调构建出高性能专用模型。
Details
Motivation: 为系统化、细粒度评估大语言模型在临床处方审查中的能力,需建立标准化且覆盖全面的评测基准。 Method: 基于权威药学资料构建RxBench,包含单选、多选和简答三类题目,共2,259项经临床药师审核的测试项;对18个前沿LLM进行评测,并对中等性能模型进行针对性微调以提升其在特定任务上的表现。 Result: Gemini-2.5-pro-preview-05-06、Grok-4-0709和DeepSeek-R1-0528表现最优,形成第一梯队;领先LLM在某些任务上达到甚至超过人类药师水平;经微调的专用模型在简答题任务上可媲美顶级通用模型。 Conclusion: RxBench提供了面向处方错误类型的标准化评估框架,揭示了当前LLM在临床处方审查中的能力与局限,为开发更可靠、专业的临床决策工具奠定了基础。 Abstract: The rapid advancement of large language models (LLMs) has accelerated their integration into clinical decision support, particularly in prescription review. To enable systematic and fine-grained evaluation, we developed RxBench, a comprehensive benchmark that covers common prescription review categories and consolidates 14 frequent types of prescription errors drawn from authoritative pharmacy references. RxBench consists of 1,150 single-choice, 230 multiple-choice, and 879 short-answer items, all reviewed by experienced clinical pharmacists. We benchmarked 18 state-of-the-art LLMs and identified clear stratification of performance across tasks. Notably, Gemini-2.5-pro-preview-05-06, Grok-4-0709, and DeepSeek-R1-0528 consistently formed the first tier, outperforming other models in both accuracy and robustness. Comparisons with licensed pharmacists indicated that leading LLMs can match or exceed human performance in certain tasks. Furthermore, building on insights from our benchmark evaluation, we performed targeted fine-tuning on a mid-tier model, resulting in a specialized model that rivals leading general-purpose LLMs in performance on short-answer question tasks. The main contribution of RxBench lies in establishing a standardized, error-type-oriented framework that not only reveals the capabilities and limitations of frontier LLMs in prescription review but also provides a foundational resource for building more reliable and specialized clinical tools.[2] Deep Research: A Systematic Survey
Zhengliang Shi,Yiqun Chen,Haitao Li,Weiwei Sun,Shiyu Ni,Yougang Lyu,Run-Ze Fan,Bowen Jin,Yixuan Weng,Minjun Zhu,Qiujie Xie,Xinyu Guo,Qu Yang,Jiayi Wu,Jujia Zhao,Xiaqiang Tang,Xinbei Ma,Cunxiang Wang,Jiaxin Mao,Qingyao Ai,Jen-Tse Huang,Wenxuan Wang,Yue Zhang,Yiming Yang,Zhaopeng Tu,Zhaochun Ren
Main category: cs.CL
TL;DR: 该论文对深度研究(Deep Research, DR)系统进行了全面而系统的综述,提出了一个三阶段路线图,明确了DR的四个核心组件(查询规划、信息获取、记忆管理与答案生成),总结了优化技术,并整理了评估标准与开放挑战。
Details
Motivation: 随着大语言模型的发展,单次提示或标准检索增强生成已难以应对需要批判性思维和多源验证的复杂任务,因此需要系统性地梳理深度研究这一新兴范式,以推动其发展。 Method: 通过文献综述的方式,提出三阶段路线图,构建包含四个关键组件的框架,并对其进行细粒度分类,同时总结优化方法与评估方式。 Result: 形式化定义了深度研究的结构体系,区分了其与其他范式的差异,建立了组件化框架与子分类体系,汇总了现有优化技术和评估标准。 Conclusion: 深度研究是提升大语言模型解决复杂开放任务能力的重要方向,本文提供的系统性框架与分析为未来研究提供了清晰指引。 Abstract: Large language models (LLMs) have rapidly evolved from text generators into powerful problem solvers. Yet, many open tasks demand critical thinking, multi-source, and verifiable outputs, which are beyond single-shot prompting or standard retrieval-augmented generation. Recently, numerous studies have explored Deep Research (DR), which aims to combine the reasoning capabilities of LLMs with external tools, such as search engines, thereby empowering LLMs to act as research agents capable of completing complex, open-ended tasks. This survey presents a comprehensive and systematic overview of deep research systems, including a clear roadmap, foundational components, practical implementation techniques, important challenges, and future directions. Specifically, our main contributions are as follows: (i) we formalize a three-stage roadmap and distinguish deep research from related paradigms; (ii) we introduce four key components: query planning, information acquisition, memory management, and answer generation, each paired with fine-grained sub-taxonomies; (iii) we summarize optimization techniques, including prompting, supervised fine-tuning, and agentic reinforcement learning; and (iv) we consolidate evaluation criteria and open challenges, aiming to guide and facilitate future development. As the field of deep research continues to evolve rapidly, we are committed to continuously updating this survey to reflect the latest progress in this area.[3] Mirror, Mirror on the Wall -- Which is the Best Model of Them All?
Dina Sayed,Heiko Schuldt
Main category: cs.CL
TL;DR: 本文探讨了在选择适合特定用例的大语言模型时的定量评估维度,通过分析当前的排行榜和基准测试,以医疗领域为例展示了模型评估的演变、现状及其实际意义,并提出了一种系统性的模型选择方法论(MSM)来指导模型的选择。
Details
Motivation: 由于大公司不断推出新的基础模型,选择最适合特定任务或领域的模型变得愈发复杂,因此需要一种系统的方法来综合考虑定性和定量两个维度,尤其是量化性能评估的重要性。 Method: 通过分析现有的排行榜和标准化基准测试,聚焦于医疗领域的案例研究,展示定量评估维度的发展和现状。 Result: 揭示了当前大语言模型在医疗领域中的评估格局及其实用价值,并提出了一个用于模型选择的系统性方法论。 Conclusion: 提出了一种名为模型选择方法论(MSM)的系统性方法,能够有效指导用户根据具体用例选择最佳的大语言模型,特别是在依赖定量性能指标的情况下。 Abstract: Large Language Models (LLMs) have become one of the most transformative tools across many applications, as they have significantly boosted productivity and achieved impressive results in various domains such as finance, healthcare, education, telecommunications, and law, among others. Typically, state-of-the-art (SOTA) foundation models are developed by large corporations based on large data collections and substantial computational and financial resources required to pretrain such models from scratch. These foundation models then serve as the basis for further development and domain adaptation for specific use cases or tasks. However, given the dynamic and fast-paced nature of launching new foundation models, the process of selecting the most suitable model for a particular use case, application, or domain becomes increasingly complex. We argue that there are two main dimensions that need to be taken into consideration when selecting a model for further training: a qualitative dimension (which model is best suited for a task based on information, for instance, taken from model cards) and a quantitative dimension (which is the best performing model). The quantitative performance of models is assessed through leaderboards, which rank models based on standardized benchmarks and provide a consistent framework for comparing different LLMs. In this work, we address the analysis of the quantitative dimension by exploring the current leaderboards and benchmarks. To illustrate this analysis, we focus on the medical domain as a case study, demonstrating the evolution, current landscape, and practical significance of this quantitative evaluation dimension. Finally, we propose a Model Selection Methodology (MSM), a systematic approach designed to guide the navigation, prioritization, and selection of the model that best aligns with a given use case.[4] Beyond Confidence: Adaptive and Coherent Decoding for Diffusion Language Models
Kecheng Chen,Ziru Liu,Xijia Tao,Hui Liu,Xinyu Fu,Suiyun Zhang,Dandan Tu,Lingpeng Kong,Rui Liu,Haoliang Li
Main category: cs.CL
TL;DR: 提出了一种新的推理框架Coherent Contextual Decoding (CCD),通过历史上下文校正生成路径并动态调整每步的解码预算,显著提升了扩散语言模型的生成质量和速度。
Details
Motivation: 现有扩散语言模型的推理方法依赖局部即时指标,缺乏对整体序列一致性的考虑,导致生成路径不稳定和质量不佳。 Method: 引入轨迹校正机制,利用历史上下文提升序列连贯性,并基于条件互信息理论建模历史一致性;同时设计自适应采样策略,根据一致性指标动态调整各步的未掩码预算。 Result: 在Dream和LLaDA等多个基准上实现了最高3.48倍的加速和3.91%的性能提升。 Conclusion: CCD通过增强上下文一致性与动态资源分配,有效改善了扩散语言模型的生成效率与质量,为高效推理提供了新思路。 Abstract: Diffusion Language Models (DLMs) have recently achieved significant success due to their any-order generation capabilities. However, existing inference methods typically rely on local, immediate-step metrics such as confidence or entropy which inherently lack a more reliable perspective. This limitation frequently leads to inconsistent sampling trajectories and suboptimal generation quality. To address this, we propose Coherent Contextual Decoding (CCD), a novel inference framework built upon two core innovations. First, CCD employs a trajectory rectification mechanism that leverages historical context to enhance sequence coherence, enabling the early rejection of suboptimal paths. We demonstrate that this mechanism is theoretically equivalent to modeling the consistency of historical steps via the conditional mutual information between context and token predictions. Building on this theoretical insight, we further address the inefficiency of conventional uniform decoding budgets. Instead of rigid allocations based on diffusion steps, we introduce an adaptive sampling strategy that dynamically adjusts the unmasking budget for each step according to our consistency metric. Consequently, our method significantly improves the quality of generation trajectories while accelerating the sampling process. Empirically, our method achieves a simultaneous enhancement in both inference speed and performance across diverse benchmarks on Dream and LLaDA, delivering up to 3.48x speedup alongside 3.91% performance improvement.[5] Reversing Large Language Models for Efficient Training and Fine-Tuning
Eshed Gal,Moshe Eliasof,Javier Turek,Uri Ascher,Eran Treister,Eldad Haber
Main category: cs.CL
TL;DR: 提出基于对称和辛微分方程的内存高效、可逆的大语言模型架构,通过时间可逆动力学减少训练内存消耗,并支持将现有非可逆LLM高效转换为可逆架构,在多个模型和数据集上实现性能相当或更优。
Details
Motivation: 大语言模型(LLMs)训练成本高且耗时,通常需在预训练模型基础上进行微调;然而标准架构需存储所有中间激活值,导致内存开销大,限制了批量大小和训练效率。 Method: 受对称和辛微分方程启发,设计可逆的LLM架构,利用时间可逆动力学在反向传播中恢复隐藏状态,避免存储中间激活;并提出一种高效的微调方法,将现有的非可逆LLM转换为可逆架构。 Result: 相比基线模型显著降低内存使用,支持更大批量训练,提升吞吐量;在多个LLM和基准测试中表现出相当或更好的性能。 Conclusion: 该方法为降低LLM从头训练和微调过程中的内存与计算成本提供了可扩展且高效的解决方案,具有实际应用价值。 Abstract: Large Language Models (LLMs) are known for their expensive and time-consuming training. Thus, oftentimes, LLMs are fine-tuned to address a specific task, given the pretrained weights of a pre-trained LLM considered a foundation model. In this work, we introduce memory-efficient, reversible architectures for LLMs, inspired by symmetric and symplectic differential equations, and investigate their theoretical properties. Different from standard, baseline architectures that store all intermediate activations, the proposed models use time-reversible dynamics to retrieve hidden states during backpropagation, relieving the need to store activations. This property allows for a drastic reduction in memory consumption, allowing for the processing of larger batch sizes for the same available memory, thereby offering improved throughput. In addition, we propose an efficient method for converting existing, non-reversible LLMs into reversible architectures through fine-tuning, rendering our approach practical for exploiting existing pre-trained models. Our results show comparable or improved performance on several datasets and benchmarks, on several LLMs, building a scalable and efficient path towards reducing the memory and computational costs associated with both training from scratch and fine-tuning of LLMs.[6] Dialect Identification Using Resource-Efficient Fine-Tuning Approaches
Zirui Lin,Haris Gulzar,Monnika Roslianna Busto,Akiko Masaki,Takeharu Eda,Kazuhiro Nakadai
Main category: cs.CL
TL;DR: 本文研究了在语音模型微调中应用内存高效微调(MEFT)方法以降低计算和内存开销,同时保持方言识别性能。
Details
Motivation: 传统微调方法在方言识别任务中计算和内存成本高,现有参数高效方法对内存效率和训练速度提升有限。 Method: 将原本用于语言处理的MEFT方法应用于预训练语音模型(如Whisper),并在KeSpeech数据集上进行实验,评估不同MEFT方法在GPU内存使用和训练速度上的表现。 Result: 在识别六个普通话次方言的任务中,GPU内存使用最多减少73.25%,训练速度提升2.1倍,且准确率与全量微调和PEFT方法相当。 Conclusion: MEFT方法在方言识别任务中显著提升了训练效率和内存利用率,同时保持了良好的模型性能,适用于资源受限场景。 Abstract: Dialect Identification (DI) is a task to recognize different dialects within the same language from a speech signal. DI can help to improve the downstream speech related tasks even when speakers have a strong dialect. However, fine-tuning a speech model for tasks like DI is expensive in terms of computation cost and memory requirement. Recent studies have explored fine-tuning pre-trained speech models for tasks like DI using Parameter-Efficient Fine-Tuning (PEFT) methods, which offer parameter efficiency but limited improvement in memory efficiency and training speed. To address these challenges, we explore Memory-Efficient Fine-Tuning (MEFT) methods, originally proposed for language processing, and apply them to the general-purpose pre-trained speech model. We then comprehensively analyze the GPU memory usage and fine-tuning speed based on various MEFT methods. As a case study, we fine-tune the Whisper model to identify six Mandarin subdialects from the KeSpeech dataset, reducing GPU memory usage by up to 73.25% and accelerating training speed by a factor of 2.1, while maintaining accuracy comparable to vanilla fine-tuning and PEFT methods.[7] Feature Selection Empowered BERT for Detection of Hate Speech with Vocabulary Augmentation
Pritish N. Desai,Tanay Kewalramani,Srimanta Mandal
Main category: cs.CL
TL;DR: 提出一种基于TF-IDF样本选择和领域特定词汇增强的BERT微调策略,用于高效且高性能的仇恨言论分类。
Details
Motivation: 社交媒体上的滥用言论不断演变,新俚语和混淆术语频繁出现,传统检测系统难以应对,同时大规模训练成本高。 Method: 采用TF-IDF-based样本选择机制,保留最信息量的75%样本,并扩展BERT分词器以包含滥用语境中的领域俚语和变体形式。 Result: 在广泛使用的仇恨言论数据集上,该方法在减少训练数据的情况下仍保持竞争力的性能,并提升了计算效率。 Conclusion: 所提方法在保证检测效果的同时显著降低训练开销,具备可扩展性和适应性,适用于动态演进的滥用内容治理。 Abstract: Abusive speech on social media poses a persistent and evolving challenge, driven by the continuous emergence of novel slang and obfuscated terms designed to circumvent detection systems. In this work, we present a data efficient strategy for fine tuning BERT on hate speech classification by significantly reducing training set size without compromising performance. Our approach employs a TF IDF-based sample selection mechanism to retain only the most informative 75 percent of examples, thereby minimizing training overhead. To address the limitations of BERT's native vocabulary in capturing evolving hate speech terminology, we augment the tokenizer with domain-specific slang and lexical variants commonly found in abusive contexts. Experimental results on a widely used hate speech dataset demonstrate that our method achieves competitive performance while improving computational efficiency, highlighting its potential for scalable and adaptive abusive content moderation.[8] Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models
Ziyan Wang,Enmao Diao,Qi Le,Pu Wang,Guanchu Wang,Minwoo Lee,Shu-ping Yeh,Li Yang
Main category: cs.CL
TL;DR: 提出了一种名为RESP的自反思结构化剪枝框架,通过自生成校准和渐进式再生,在保持推理连贯性的同时显著提升推理大模型的剪枝效果。
Details
Motivation: 现有剪枝方法在推理大模型(RLMs)上表现脆弱,轻微稀疏化即导致准确性与推理连贯性严重下降,主因是校准数据、剪枝目标与模型解码时推理行为不匹配。 Method: 引入RESP框架,利用模型自身生成的推理轨迹进行自校准,采用仅解码器梯度重要性估计,并通过逐步再生维持高稀疏度下的校准保真度,实现与推理动态对齐的结构化剪枝。 Result: 在Qwen3-8B上实验表明,相比现有方法,RESP在20-30%稀疏度下保持接近全精度性能,在40%稀疏度下GSM8K准确率达81.3%(提升66.87%),MathQA达59.6%(提升47%)。 Conclusion: RESP有效解决了现有剪枝方法在推理大模型上的失效问题,显著提升了剪枝后模型的推理能力保持,适用于资源受限场景下的高效部署。 Abstract: Reasoning LLMs (RLMs) such as OpenAI o1, DeepSeek-R1, and Qwen3 deliver strong multi-step reasoning through chain-of-thought generation, but their large model sizes and lengthy decode-time outputs make them costly to deploy and unsuitable for resource-constrained settings. To reduce computing and memory cost, pruning offers a promising solution by removing unimportant parameters. However, despite their success on standard LLMs, existing pruning methods severely damage RLMs, as even moderate sparsity (e.g., 20%) can collapse accuracy and completely disrupt the model's reasoning coherence. We begin by analyzing why existing pruning pipelines fail on reasoning LLMs and find that their brittleness largely stems from a mismatch between the calibration data, the pruning objective, and the model's decode-time reasoning behavior. Our study further shows that the most reliable calibration signal comes not from human-written labels but from the model's own self-generated reasoning traces, which more accurately reflect its inference distribution. Guided by these insights, we introduce RESP, a self-reflective structured pruning framework that aligns pruning decisions with the model's reasoning dynamics through self-generated calibration, decode-only gradient-based importance estimation, and progressive regeneration that maintains calibration fidelity as sparsity increases. Experiments on Qwen3-8B demonstrate that RESP markedly outperforms existing structured pruning methods on both GSM8K and MathQA, preserving near-dense accuracy at 20-30% sparsity and substantially mitigating performance collapse at higher sparsity levels. At 40% sparsity, RESP attains 81.3% accuracy on GSM8K and 59.6% on MathQA, surpassing the strongest baselines by 66.87% and 47%, respectively.[9] A Knowledge-Based Language Model: Deducing Grammatical Knowledge in a Multi-Agent Language Acquisition Simulation
David Ph. Shakouri,Crit Cremers,Niels O. Schiller
Main category: cs.CL
TL;DR: 本文介绍了一种名为MODOMA的多智能体计算框架,用于无监督语言习得实验,通过成人与儿童智能体的互动实现语法范畴的获取,并验证了其在模拟人类语言习得方面的有效性。
Details
Motivation: 旨在构建一个可完全参数化、结果可解释的计算环境,以模拟和研究语言习得过程,特别是语法知识的自主获取。 Method: 采用基于统计与规则相结合的多智能体系统,其中成人智能体生成数据,儿童智能体通过训练与测试数据进行语言模型学习,最终形成基于知识的语言模型。 Result: 实验表明,儿童智能体能够成功习得功能范畴和内容范畴,且机器生成的数据中出现了与人类语言相似的模式。 Conclusion: MODOMA框架有效支持了语言习得的建模,为计算语言学实验提供了新工具,证实了交互式多智能体方法在语法知识获取中的可行性。 Abstract: This paper presents an initial study performed by the MODOMA system. The MODOMA is a computational multi-agent laboratory environment for unsupervised language acquisition experiments such that acquisition is based on the interaction between two language models, an adult and a child agent. Although this framework employs statistical as well as rule-based procedures, the result of language acquisition is a knowledge-based language model, which can be used to generate and parse new utterances of the target language. This system is fully parametrized and researchers can control all aspects of the experiments while the results of language acquisition, that is, the acquired grammatical knowledge, are explicitly represented and can be consulted. Thus, this system introduces novel possibilities for conducting computational language acquisition experiments. The experiments presented by this paper demonstrate that functional and content categories can be acquired and represented by the daughter agent based on training and test data containing different amounts of exemplars generated by the adult agent. Interestingly, similar patterns, which are well-established for human-generated data, are also found for these machine-generated data. As the procedures resulted in the successful acquisition of discrete grammatical categories by the child agent, these experiments substantiate the validity of the MODOMA approach to modelling language acquisition.[10] Swivuriso: The South African Next Voices Multilingual Speech Dataset
Vukosi Marivatee,Kayode Olaleye,Sitwala Mundia,Andinda Bakainga,Unarine Netshifhefhe,Mahmooda Milanzie,Tsholofelo Hope Mogale,Thapelo Sindane,Zainab Abdulrasaq,Kesego Mokgosi,Chijioke Okorie,Nia Zion Van Wyk,Graham Morrissey,Dale Dunbar,Francois Smit,Tsosheletso Chidi,Rooweither Mabuya,Andiswa Bukula,Respect Mlambo,Tebogo Macucwa,Idris Abdulmumin,and Seani Rananga
Main category: cs.CL
TL;DR: Swivuriso是一个3000小时的多语言语音数据集,涵盖七种南非语言,用于推动自动语音识别(ASR)技术的发展与基准测试。
Details
Motivation: 现有ASR数据集在南非语言上存在显著空白,缺乏覆盖关键领域(如农业和医疗)的高质量语音数据。 Method: 遵循明确的设计原则、伦理考量和数据收集流程,构建了涵盖农业、医疗和通用领域的多语言语音数据集,并用于训练和微调ASR模型,提供基线性能结果。 Result: 成功发布了Swivuriso数据集,并提供了ASR模型的基线训练结果,显示出其在提升南非语言语音识别性能方面的潜力。 Conclusion: Swivuriso填补了南非语言ASR数据的空白,为未来低资源语言的语音技术研究和应用提供了重要资源。 Abstract: This paper introduces Swivuriso, a 3000-hour multilingual speech dataset developed as part of the African Next Voices project, to support the development and benchmarking of automatic speech recognition (ASR) technologies in seven South African languages. Covering agriculture, healthcare, and general domain topics, Swivuriso addresses significant gaps in existing ASR datasets. We describe the design principles, ethical considerations, and data collection procedures that guided the dataset creation. We present baseline results of training/finetuning ASR models with this data and compare to other ASR datasets for the langauges concerned.[11] Lightweight Latent Reasoning for Narrative Tasks
Alexander Gurung,Nikolay Malkin,Mirella Lapata
Main category: cs.CL
TL;DR: LiteReason是一种轻量级的潜推理方法,通过训练一个轻量级的推理投影模块生成连续的潜令牌,使模型能够“跳过”推理步骤,从而在保持性能的同时显著减少推理长度和计算成本。
Details
Motivation: 大型语言模型在处理复杂任务时需要生成长链的思维过程,这导致了高昂的计算成本,尤其是在涉及大量文本检索和处理的叙事相关任务中。因此,需要一种更高效的推理方法来优化性能与计算成本之间的权衡。 Method: 提出LiteReason方法,该方法结合了一个轻量级的推理投影模块,可以在标准令牌采样过程中交替使用,并且可以轻松地与强化学习技术结合。在强化学习过程中,策略模型决定何时激活投影器,根据需要在潜推理和离散推理之间切换。 Result: 实验结果表明,LiteReason在情节漏洞检测和书籍章节生成任务上优于现有的潜推理基线方法,接近非潜式强化学习训练的性能,同时将最终的推理长度减少了77-92%。 Conclusion: LiteReason通过引导强化学习训练进入性能-计算权衡曲线中更高效的部分,提供了一种有效降低推理成本的方法,适用于需要长时间推理的任务。 Abstract: Large language models (LLMs) tackle complex tasks by generating long chains of thought or "reasoning traces" that act as latent variables in the generation of an output given a query. A model's ability to generate such traces can be optimized with reinforcement learning (RL) to improve their utility in predicting an answer. This optimization comes at a high computational cost, especially for narrative-related tasks that involve retrieving and processing many tokens. To this end, we propose LiteReason, a latent reasoning method that can be interleaved with standard token sampling and easily combined with RL techniques. LiteReason employs a lightweight Reasoning Projector module, trained to produce continuous latent tokens that help the model 'skip' reasoning steps. During RL, the policy model decides when to activate the projector, switching between latent and discrete reasoning as needed. Experimental results on plot hole detection and book chapter generation show that our method outperforms latent reasoning baselines and comes close to matching non-latent RL training, while reducing final reasoning length by 77-92%. Overall, LiteReason guides RL training to a more efficient part of the performance-computation tradeoff curve.[12] DETAIL Matters: Measuring the Impact of Prompt Specificity on Reasoning in Large Language Models
Olivia Kim
Main category: cs.CL
TL;DR: 该论文提出了DETAIL框架,用于评估不同提示具体性水平下大语言模型的推理性能,发现更具体的提示能提高准确性,尤其对小模型和程序性任务更为显著。
Details
Motivation: 提示设计在大语言模型推理中至关重要,但提示的具体性影响尚不明确,需系统研究。 Method: 使用GPT-4生成多级提示,通过困惑度量化具体性,并利用基于GPT的语义等价性评估正确性。 Result: 在30个新推理任务上实验表明,更高的提示具体性可提升模型准确性,尤其对较小模型和程序性任务效果更明显。 Conclusion: 提示具体性显著影响推理性能,应采用自适应提示策略,论文提供了支持进一步研究的工具与数据。 Abstract: Prompt design plays a critical role in the reasoning performance of large language models (LLMs), yet the impact of prompt specificity - how detailed or vague a prompt is - remains understudied. This paper introduces DETAIL, a framework for evaluating LLM performance across varying levels of prompt specificity. We generate multi-level prompts using GPT-4, quantify specificity via perplexity, and assess correctness using GPT-based semantic equivalence. Experiments on 30 novel reasoning tasks across GPT-4 and O3-mini reveal that specificity improves accuracy, especially for smaller models and procedural tasks. Our results highlight the need for adaptive prompting strategies and provide tools and data to support further research.[13] CAIRNS: Balancing Readability and Scientific Accuracy in Climate Adaptation Question Answering
Liangji Kong,Aditya Joshi,Sarvnaz Karimi
Main category: cs.CL
TL;DR: CAIRNS是一个无需微调或强化学习的框架,通过改进可读性和引用可靠性,帮助农业专家从复杂网络数据中获取可信的气候适应策略问答。
Details
Motivation: 应对气候变化对农业的影响,需要从大量非结构化和结构化数据中提取有效的气候适应策略信息,但现有方法在可读性、引用可靠性和评估一致性方面存在不足。 Method: 提出CAIRNS框架,结合结构化的ScholarGuide提示提升回答可读性和引用可靠性,并设计基于模型间一致性的加权混合评估器进行稳健评估。 Result: 在专家整理的数据集上,CAIRNS在多数指标上优于基线模型,消融实验验证了各组件的有效性,且LLM评估结果与人工判断具有较高相关性。 Conclusion: CAIRNS实现了可读、可验证且领域可靠的问答,为农业气候适应决策提供了高效、可信的工具。 Abstract: Climate adaptation strategies are proposed in response to climate change. They are practised in agriculture to sustain food production. These strategies can be found in unstructured data (for example, scientific literature from the Elsevier website) or structured (heterogeneous climate data via government APIs). We present Climate Adaptation question-answering with Improved Readability and Noted Sources (CAIRNS), a framework that enables experts -- farmer advisors -- to obtain credible preliminary answers from complex evidence sources from the web. It enhances readability and citation reliability through a structured ScholarGuide prompt and achieves robust evaluation via a consistency-weighted hybrid evaluator that leverages inter-model agreement with experts. Together, these components enable readable, verifiable, and domain-grounded question-answering without fine-tuning or reinforcement learning. Using a previously reported dataset of expert-curated question-answers, we show that CAIRNS outperforms the baselines on most of the metrics. Our thorough ablation study confirms the results on all metrics. To validate our LLM-based evaluation, we also report an analysis of correlations against human judgment.[14] HealthContradict: Evaluating Biomedical Knowledge Conflicts in Language Models
Boya Zhang,Alban Bornet,Rui Yang,Nan Liu,Douglas Teodoro
Main category: cs.CL
TL;DR: 本研究通过构建专家验证的HealthContradict数据集,评估语言模型在处理长且矛盾的生物医学语境下的推理能力,发现微调的生物医学语言模型不仅能利用参数知识,还能有效利用正确语境并抵抗错误语境的影响。
Details
Motivation: 探究语言模型如何利用上下文信息回答健康问题,特别是在存在冲突上下文时其响应如何受到影响。 Method: 构建包含920个实例的HealthContradict数据集,每个实例包括一个健康问题、基于科学证据的事实答案以及两个立场相悖的文档;设计多种提示设置(如正确、错误或矛盾上下文),评估其对模型输出的影响。 Result: 实验表明,经过微调的生物医学语言模型不仅依赖预训练中的参数知识,还具备利用正确上下文并抵抗错误上下文的能力,且该数据集能更清晰地区分模型的上下文推理能力。 Conclusion: HealthContradict数据集能有效评估语言模型在复杂、矛盾医学文本中的推理表现,揭示了当前模型结合外部证据与内部知识进行决策的关键优势。 Abstract: How do language models use contextual information to answer health questions? How are their responses impacted by conflicting contexts? We assess the ability of language models to reason over long, conflicting biomedical contexts using HealthContradict, an expert-verified dataset comprising 920 unique instances, each consisting of a health-related question, a factual answer supported by scientific evidence, and two documents presenting contradictory stances. We consider several prompt settings, including correct, incorrect or contradictory context, and measure their impact on model outputs. Compared to existing medical question-answering evaluation benchmarks, HealthContradict provides greater distinctions of language models' contextual reasoning capabilities. Our experiments show that the strength of fine-tuned biomedical language models lies not only in their parametric knowledge from pretraining, but also in their ability to exploit correct context while resisting incorrect context.[15] When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers
Jack Lu,Ryan Teehan,Jinran Jin,Mengye Ren
Main category: cs.CL
TL;DR: 本文系统研究了大语言模型中求解器与验证器的交互,发现跨模型家族的验证更有效,后训练能增强跨家族改进效果,且数学与逻辑任务具有最高的可验证性。
Details
Motivation: 现有研究多集中于自验证,缺乏对不同模型家族间验证行为的系统分析,同时后训练对验证能力的影响尚不明确。 Method: 在37个涵盖多个家族、规模及基础/后训练变体的模型上,基于9个涵盖逻辑推理、数学、常识等领域的基准进行实验,比较自验证、同家族验证与跨家族验证的效果,并提出‘验证增益’指标来预测基于验证器的采样提升效果。 Result: 跨家族验证效果最佳;后训练削弱自提升但增强跨家族提升;数学与逻辑任务最易被验证;验证增益和误报率随模型规模和后训练呈现规律性变化。 Conclusion: 验证策略的效果受模型家族、训练阶段和任务类型显著影响,跨家族验证是提升性能的有效路径,且不同任务的可验证性存在本质差异。 Abstract: Large language models (LLMs) can act as both problem solvers and solution verifiers, with verifiers improving solver performance by selecting high-quality answers from a pool of candidates. However, prior studies of solver-verifier interactions have been limited, focusing mainly on self-verification and rarely examining how verifiers judge outputs from models in their own or in another model family. Modern LLMs also undergo extensive post-training, but its effect on verification remains unclear. We present a systematic study across 37 models spanning multiple families, sizes, and base vs. post-trained variants, evaluated on 9 benchmarks covering logical reasoning, structured puzzles, symbolic computation, mathematics, commonsense, factual recall, and domain knowledge. We compare self-verification with verification within the same family and across different families. To support this, we introduce and empirically validate verifier gain, a metric that predicts the performance improvements from test-time verifier-based rejection sampling. We analyze how metrics like verifier gain and false positive rate scale with model size and post-training, and characterize differences in dataset verifiability. Our findings show that cross-family verification is especially effective; post-training reduces self-improvement but strengthens cross-family improvement; and mathematical and logical tasks exhibit the highest inherent verifiability.[16] Memory-Augmented Knowledge Fusion with Safety-Aware Decoding for Domain-Adaptive Question Answering
Lei Fu,Xiang Chen,Kaige Gao Xinyue Huang,Kejian Tong
Main category: cs.CL
TL;DR: 本文提出了KARMA框架,通过融合结构化与非结构化知识源、动态调控知识集成及安全可控解码,提升服务场景中问答系统的准确性与安全性。
Details
Motivation: 现有大语言模型在敏感领域(如医疗政策和政府福利)中存在事实不一致和上下文对齐问题,难以满足服务类问答系统对准确性和安全性的高要求。 Method: 提出KARMA框架,采用双编码器架构融合异构知识,引入门控记忆单元动态调节外部知识整合,并设计安全感知的可控解码器,结合安全分类与引导生成技术抑制不安全输出。 Result: 在专有QA数据集上的实验表明,KARMA在答案质量与安全性方面均优于强基线模型。 Conclusion: KARMA为服务场景中的可信、自适应问答系统提供了完整解决方案,有效平衡了性能与安全需求。 Abstract: Domain-specific question answering (QA) systems for services face unique challenges in integrating heterogeneous knowledge sources while ensuring both accuracy and safety. Existing large language models often struggle with factual consistency and context alignment in sensitive domains such as healthcare policies and government welfare. In this work, we introduce Knowledge-Aware Reasoning and Memory-Augmented Adaptation (KARMA), a novel framework designed to enhance QA performance in care scenarios. KARMA incorporates a dual-encoder architecture to fuse structured and unstructured knowledge sources, a gated memory unit to dynamically regulate external knowledge integration, and a safety-aware controllable decoder that mitigates unsafe outputs using safety classification and guided generation techniques. Extensive experiments on a proprietary QA dataset demonstrate that KARMA outperforms strong baselines in both answer quality and safety. This study offers a comprehensive solution for building trustworthy and adaptive QA systems in service contexts.[17] TaleFrame: An Interactive Story Generation System with Fine-Grained Control and Large Language Models
Yunchao Wang,Guodao Sun,Zihang Fu,Zhehao Liu,Kaixing Du,Haidong Gao,Ronghua Liang
Main category: cs.CL
TL;DR: TaleFrame 是一个结合大语言模型与人机交互的创意故事生成系统,通过结构化信息实现对故事生成过程的精细控制。
Details
Motivation: 现有故事生成系统难以准确将用户意图转化为满意的故事输出,缺乏细粒度控制和清晰的输入规范。 Method: 将故事结构分解为四个基本单元:实体、事件、关系和故事大纲,利用 Tinystories 数据集构建偏好数据集并微调本地 Llama 模型,采用 JSON2Story 方法将结构化数据转换为连贯故事,并提供支持拖拽等操作的直观界面。 Result: 构建了包含 9,851 条 JSON 格式数据的偏好数据集,实现了结构化输入到自然语言故事的有效转换,用户可通过交互界面控制故事细节与发展,生成的故事可在七个维度上进行评估与优化。 Conclusion: 实验结果表明 TaleFrame 能有效提升用户对生成故事的控制力和满意度,验证了结构化输入与人机协同在创意写作中的有效性。 Abstract: With the advancement of natural language generation (NLG) technologies, creative story generation systems have gained increasing attention. However, current systems often fail to accurately translate user intent into satisfactory story outputs due to a lack of fine-grained control and unclear input specifications, limiting their applicability. To address this, we propose TaleFrame, a system that combines large language models (LLMs) with human-computer interaction (HCI) to generate stories through structured information, enabling precise control over the generation process. The innovation of TaleFrame lies in decomposing the story structure into four basic units: entities, events, relationships, and story outline. We leverage the Tinystories dataset, parsing and constructing a preference dataset consisting of 9,851 JSON-formatted entries, which is then used to fine-tune a local Llama model. By employing this JSON2Story approach, structured data is transformed into coherent stories. TaleFrame also offers an intuitive interface that supports users in creating and editing entities and events and generates stories through the structured framework. Users can control these units through simple interactions (e.g., drag-and-drop, attach, and connect), thus influencing the details and progression of the story. The generated stories can be evaluated across seven dimensions (e.g., creativity, structural integrity), with the system providing suggestions for refinement based on these evaluations. Users can iteratively adjust the story until a satisfactory result is achieved. Finally, we conduct quantitative evaluation and user studies that demonstrate the usefulness of TaleFrame. Dataset available at https://huggingface.co/datasets/guodaosun/tale-frame.[18] A Concise Review of Hallucinations in LLMs and their Mitigation
Parth Pulkundwar,Vivek Dhanawade,Rohit Yadav,Minal Sonkar,Medha Asurlekar,Sarita Rathod
Main category: cs.CL
TL;DR: 本文综述了语言模型中的幻觉问题,包括其类型、成因及缓解方法,旨在提供一个关于自然语言处理中这一关键挑战的简明概要。
Details
Motivation: 理解当前语言模型中幻觉现象的种类和根源,并探索减少幻觉的方法,以提升自然语言处理的可靠性和安全性。 Method: 对现有文献进行综合分析,总结幻觉的不同类型、产生原因以及现有的缓解策略。 Result: 提供了一个关于语言模型幻觉问题的全面且简洁的概述,涵盖了主要的幻觉类型、成因和应对措施。 Conclusion: 幻觉是影响语言模型可信度的重要问题,通过系统性理解和采用适当的缓解技术,可以在一定程度上减轻其负面影响。 Abstract: Traditional language models face a challenge from hallucinations. Their very presence casts a large, dangerous shadow over the promising realm of natural language processing. It becomes crucial to understand the various kinds of hallucinations that occur nowadays, their origins, and ways of reducing them. This document provides a concise and straightforward summary of that. It serves as a one-stop resource for a general understanding of hallucinations and how to mitigate them.[19] What Signals Really Matter for Misinformation Tasks? Evaluating Fake-News Detection and Virality Prediction under Real-World Constraints
Francesco Paolo Savatteri,Chahan Vidal-Gorène,Florian Cafiero
Main category: cs.CL
TL;DR: 本文研究了在线虚假信息的两个实际任务:假新闻检测和传播性预测,比较了基于文本嵌入(如RoBERTa)与轻量级数值特征及序列模型的效果,发现文本内容对假新闻检测效果显著,而传播性预测更具挑战且对标签构建敏感。
Details
Motivation: 在需要快速响应的操作环境中,评估不同方法在假新闻检测和传播性预测中的实用性,并探讨现有数据集和评估设计的局限性。 Method: 使用EVONS和FakeNewsNet数据集,比较RoBERTa和Mistral的文本嵌入、轻量级数值特征(如时间、粉丝数)以及GRU、门控结构和Transformer编码器等序列模型;采用t-SNE和PCA进行降维分析,并评估不同标签划分方式对结果的影响。 Result: 文本内容在假新闻检测中表现优异,数值特征在计算资源受限时仍具可行性;传播性预测难度更高,对标签定义(如以50点赞为中位数划分)和时间截断敏感;非线性结构(t-SNE)比线性(PCA)更能反映传播性特征;替换为Mistral嵌入仅带来小幅变化。 Conclusion: 假新闻检测可依赖强文本模型,而传播性预测需更谨慎的标签设计和现实可行的时间窗口处理;研究强调了评估可复现性和指标选择的重要性,并呼吁改进API限制以支持实际应用。 Abstract: We present an evaluation-driven study of two practical tasks regarding online misinformation: (i) fake-news detection and (ii) virality prediction in the context of operational settings, with the necessity for rapid reaction. Using the EVONS and FakeNewsNet datasets, we compare textual embeddings (RoBERTa; with a control using Mistral) against lightweight numeric features (timing, follower counts, verification, likes) and sequence models (GRU, gating architectures, Transformer encoders). We show that textual content alone is a strong discriminator for fake-news detection, while numeric-only pipelines remain viable when language models are unavailable or compute is constrained. Virality prediction is markedly harder than fake-news detection and is highly sensitive to label construction; in our setup, a median-based ''viral'' split (<50 likes) is pragmatic but underestimates real-world virality, and time-censoring for engagement features is desirable yet difficult under current API limits. Dimensionality-reduction analyses suggest non-linear structure is more informative for virality than for fake-news detection (t-SNE > PCA on numeric features). Swapping RoBERTa for Mistral embeddings yields only modest deltas, leaving conclusions unchanged. We discuss implications for evaluation design and report reproducibility constraints that realistically affect the field. We release splits and code where possible and provide guidance for metric selection.[20] ADORE: Autonomous Domain-Oriented Relevance Engine for E-commerce
Zheng Fang,Donghao Xie,Ming Pang,Chunyuan Yuan,Xue Jiang,Changping Peng,Zhangang Lin,Zheng Luo
Main category: cs.CL
TL;DR: 本文提出了ADORE框架,通过结合规则感知的相关性判别、错误类型感知的数据合成和关键属性增强的知识蒸馏,解决了电商搜索中相关性建模的语义鸿沟和数据稀缺问题。
Details
Motivation: 现有电商搜索相关性模型受限于传统匹配方法的语义差距和神经模型对领域内难样本缺乏的问题,难以充分捕捉用户意图并提升鲁棒性。 Method: 提出ADORE框架:1)规则感知的相关性判别模块利用思维链大模型生成意图对齐数据,并通过KTO优化对齐用户行为;2)错误类型感知的数据合成模块自动生成对抗样本以增强鲁棒性;3)关键属性增强的知识蒸馏模块将领域属性层次注入轻量级学生模型。 Result: 大规模离线实验和在线A/B测试验证了ADORE的有效性,在相关性排序任务中显著优于基线模型,同时实现了自动化标注、对抗生成与蒸馏,提升了推理效率与可部署性。 Conclusion: ADORE为工业场景下的资源高效、认知对齐的相关性建模提供了新范式,有效解决了数据稀缺与语义理解难题。 Abstract: Relevance modeling in e-commerce search remains challenged by semantic gaps in term-matching methods (e.g., BM25) and neural models' reliance on the scarcity of domain-specific hard samples. We propose ADORE, a self-sustaining framework that synergizes three innovations: (1) A Rule-aware Relevance Discrimination module, where a Chain-of-Thought LLM generates intent-aligned training data, refined via Kahneman-Tversky Optimization (KTO) to align with user behavior; (2) An Error-type-aware Data Synthesis module that auto-generates adversarial examples to harden robustness; and (3) A Key-attribute-enhanced Knowledge Distillation module that injects domain-specific attribute hierarchies into a deployable student model. ADORE automates annotation, adversarial generation, and distillation, overcoming data scarcity while enhancing reasoning. Large-scale experiments and online A/B testing verify the effectiveness of ADORE. The framework establishes a new paradigm for resource-efficient, cognitively aligned relevance modeling in industrial applications.[21] DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
DeepSeek-AI,Aixin Liu,Aoxue Mei,Bangcai Lin,Bing Xue,Bingxuan Wang,Bingzheng Xu,Bochao Wu,Bowei Zhang,Chaofan Lin,Chen Dong,Chengda Lu,Chenggang Zhao,Chengqi Deng,Chenhao Xu,Chong Ruan,Damai Dai,Daya Guo,Dejian Yang,Deli Chen,Erhang Li,Fangqi Zhou,Fangyun Lin,Fucong Dai,Guangbo Hao,Guanting Chen,Guowei Li,H. Zhang,Hanwei Xu,Hao Li,Haofen Liang,Haoran Wei,Haowei Zhang,Haowen Luo,Haozhe Ji,Honghui Ding,Hongxuan Tang,Huanqi Cao,Huazuo Gao,Hui Qu,Hui Zeng,Jialiang Huang,Jiashi Li,Jiaxin Xu,Jiewen Hu,Jingchang Chen,Jingting Xiang,Jingyang Yuan,Jingyuan Cheng,Jinhua Zhu,Jun Ran,Junguang Jiang,Junjie Qiu,Junlong Li,Junxiao Song,Kai Dong,Kaige Gao,Kang Guan,Kexin Huang,Kexing Zhou,Kezhao Huang,Kuai Yu,Lean Wang,Lecong Zhang,Lei Wang,Liang Zhao,Liangsheng Yin,Lihua Guo,Lingxiao Luo,Linwang Ma,Litong Wang,Liyue Zhang,M. S. Di,M. Y Xu,Mingchuan Zhang,Minghua Zhang,Minghui Tang,Mingxu Zhou,Panpan Huang,Peixin Cong,Peiyi Wang,Qiancheng Wang,Qihao Zhu,Qingyang Li,Qinyu Chen,Qiushi Du,Ruiling Xu,Ruiqi Ge,Ruisong Zhang,Ruizhe Pan,Runji Wang,Runqiu Yin,Runxin Xu,Ruomeng Shen,Ruoyu Zhang,S. H. Liu,Shanghao Lu,Shangyan Zhou,Shanhuang Chen,Shaofei Cai,Shaoyuan Chen,Shengding Hu,Shengyu Liu,Shiqiang Hu,Shirong Ma,Shiyu Wang,Shuiping Yu,Shunfeng Zhou,Shuting Pan,Songyang Zhou,Tao Ni,Tao Yun,Tian Pei,Tian Ye,Tianyuan Yue,Wangding Zeng,Wen Liu,Wenfeng Liang,Wenjie Pang,Wenjing Luo,Wenjun Gao,Wentao Zhang,Xi Gao,Xiangwen Wang,Xiao Bi,Xiaodong Liu,Xiaohan Wang,Xiaokang Chen,Xiaokang Zhang,Xiaotao Nie,Xin Cheng,Xin Liu,Xin Xie,Xingchao Liu,Xingkai Yu,Xingyou Li,Xinyu Yang,Xinyuan Li,Xu Chen,Xuecheng Su,Xuehai Pan,Xuheng Lin,Xuwei Fu,Y. Q. Wang,Yang Zhang,Yanhong Xu,Yanru Ma,Yao Li,Yao Li,Yao Zhao,Yaofeng Sun,Yaohui Wang,Yi Qian,Yi Yu,Yichao Zhang,Yifan Ding,Yifan Shi,Yiliang Xiong,Ying He,Ying Zhou,Yinmin Zhong,Yishi Piao,Yisong Wang,Yixiao Chen,Yixuan Tan,Yixuan Wei,Yiyang Ma,Yiyuan Liu,Yonglun Yang,Yongqiang Guo,Yongtong Wu,Yu Wu,Yuan Cheng,Yuan Ou,Yuanfan Xu,Yuduan Wang,Yue Gong,Yuhan Wu,Yuheng Zou,Yukun Li,Yunfan Xiong,Yuxiang Luo,Yuxiang You,Yuxuan Liu,Yuyang Zhou,Z. F. Wu,Z. Z. Ren,Zehua Zhao,Zehui Ren,Zhangli Sha,Zhe Fu,Zhean Xu,Zhenda Xie,Zhengyan Zhang,Zhewen Hao,Zhibin Gou,Zhicheng Ma,Zhigang Yan,Zhihong Shao,Zhixian Huang,Zhiyu Wu,Zhuoshu Li,Zhuping Zhang,Zian Xu,Zihao Wang,Zihui Gu,Zijia Zhu,Zilin Li,Zipeng Zhang,Ziwei Xie,Ziyi Gao,Zizheng Pan,Zongqing Yao,Bei Feng,Hui Li,J. L. Cai,Jiaqi Ni,Lei Xu,Meng Li,Ning Tian,R. J. Chen,R. L. Jin,S. S. Li,Shuang Zhou,Tianyu Sun,X. Q. Li,Xiangyue Jin,Xiaojin Shen,Xiaosha Chen,Xinnan Song,Xinyi Zhou,Y. X. Zhu,Yanping Huang,Yaohui Li,Yi Zheng,Yuchen Zhu,Yunxian Ma,Zhen Huang,Zhipeng Xu,Zhongyu Zhang,Dongjie Ji,Jian Liang,Jianzhong Guo,Jin Chen,Leyi Xia,Miaojun Wang,Mingming Li,Peng Zhang,Ruyi Chen,Shangmian Sun,Shaoqing Wu,Shengfeng Ye,T. Wang,W. L. Xiao,Wei An,Xianzu Wang,Xiaowen Sun,Xiaoxiang Wang,Ying Tang,Yukun Zha,Zekai Zhang,Zhe Ju,Zhen Zhang,Zihua Qu
Main category: cs.CL
TL;DR: DeepSeek-V3.2 是一个高效且高性能的模型,在推理和智能体任务上表现卓越,关键创新包括稀疏注意力机制、可扩展的强化学习框架和大规模代理任务合成管道。
Details
Motivation: 旨在提升大模型在长文本处理、复杂推理和工具使用场景下的性能与计算效率之间的平衡。 Method: 提出 DeepSeek 稀疏注意力(DSA)、可扩展的强化学习框架,并构建大规模代理任务合成管道以生成训练数据。 Result: DeepSeek-V3.2 在性能上媲美 GPT-5;其高算力版本 DeepSeek-V3.2-Speciale 超越 GPT-5,达到与 Gemini-3.0-Pro 相当的推理能力,并在 2025 年 IMO 和 IOI 中获得金牌成绩。 Conclusion: DeepSeek-V3.2 实现了计算效率与强大推理及代理能力的统一,展示了在高复杂度任务中的领先潜力。 Abstract: We introduce DeepSeek-V3.2, a model that harmonizes high computational efficiency with superior reasoning and agent performance. The key technical breakthroughs of DeepSeek-V3.2 are as follows: (1) DeepSeek Sparse Attention (DSA): We introduce DSA, an efficient attention mechanism that substantially reduces computational complexity while preserving model performance in long-context scenarios. (2) Scalable Reinforcement Learning Framework: By implementing a robust reinforcement learning protocol and scaling post-training compute, DeepSeek-V3.2 performs comparably to GPT-5. Notably, our high-compute variant, DeepSeek-V3.2-Speciale, surpasses GPT-5 and exhibits reasoning proficiency on par with Gemini-3.0-Pro, achieving gold-medal performance in both the 2025 International Mathematical Olympiad (IMO) and the International Olympiad in Informatics (IOI). (3) Large-Scale Agentic Task Synthesis Pipeline: To integrate reasoning into tool-use scenarios, we developed a novel synthesis pipeline that systematically generates training data at scale. This methodology facilitates scalable agentic post-training, yielding substantial improvements in generalization and instruction-following robustness within complex, interactive environments.[22] From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks
Changpeng Yang,Jinyang Wu,Yuchen Liu,Shuai Zhang,Yang Li,Qiliang Liang,Hongzhen Wang,Shuai Nie,Jiaming Xu,Runyu Shi,Ying Huang,Guoquan Zhang
Main category: cs.CL
TL;DR: 本文提出了一种名为CAPO的自适应课程优化机制,通过分离正负优势信号分阶段训练大语言模型,先利用正信号建立基础,再引入负信号提升判别能力,在数学和多模态GUI推理任务中均表现出稳定且显著的性能提升。
Details
Motivation: 现有强化学习方法在训练大语言模型时混合使用正负优势信号,可能导致训练初期指导模糊,限制性能提升。为解决该问题,需要一种更有效的信号利用机制。 Method: 提出CAPO(Curriculum Advantage Policy Optimization),采用基于优势信号的自适应课程学习策略:训练初期仅使用具有正优势的样本进行模仿学习以建立稳健基础,后续阶段逐步引入负优势样本以增强判别能力,兼容GRPO、PPO、RLOO和Reinforce++等多种优化方法。 Result: 在数学推理任务中实现了稳定且显著的性能提升,并成功泛化到多模态图形用户界面(GUI)推理场景,验证了方法的通用性与鲁棒性。 Conclusion: CAPO通过分阶段利用优势信号,提供清晰的训练引导,有效提升了模型在复杂任务中的泛化能力,是一种通用、稳健的优化框架。 Abstract: Reinforcement learning has emerged as a paradigm for post-training large language models, boosting their reasoning capabilities. Such approaches compute an advantage value for each sample, reflecting better or worse performance than expected, thereby yielding both positive and negative signals for training. However, the indiscriminate mixing of the two signals in existing methods, especially from the early stages, may lead to ambiguous guidance and limited gains. To address this issue, we propose **CAPO** (**C**urriculum **A**dvantage **P**olicy **O**ptimization), an adaptive curriculum mechanism based on advantage signals. The proposed mechanism bootstraps imitation learning with positive-only advantage samples to establish robust foundations, and subsequently introduces negative signals to cultivate discriminative capabilities, thereby improving generalization across complex scenarios. Compatible with diverse optimization methods including GRPO, PPO, RLOO, and Reinforce++, our method consistently achieves stable and significant improvements in mathematical reasoning tasks, and further generalizes effectively to multimodal Graphical User Interface (GUI) reasoning scenarios, establishing itself as a versatile and robust optimization framework.[23] Spoken Conversational Agents with Large Language Models
Chao-Han Huck Yang,Andreas Stolcke,Larry Heck
Main category: cs.CL
TL;DR: 本教程探讨了从级联式ASR/NLU到端到端、检索与视觉支持的语音原生大语言模型(LLM)系统的发展路径,涵盖音频适配、跨模态对齐、多模态训练、数据集与评估指标,并比较不同系统设计选择,提供实用方案与未来研究方向。
Details
Motivation: 随着语音对话系统向语音原生大语言模型演进,亟需系统性梳理技术路径与设计权衡,推动工业与学术界的融合与标准化。 Method: 综述文本大模型向语音适配的方法,包括端到端架构、级联系统、后处理校正与流式处理;分析跨模态对齐与联合训练策略;总结现有数据集、评估指标及在不同口音下的鲁棒性表现。 Result: 提供了可复现的基线方法、系统设计指南和工业与开放域任务导向代理之间的联系,明确了在隐私、安全与评估方面的开放问题。 Conclusion: 语音对话系统正迈向一体化、多模态的端到端架构,未来需在鲁棒性、隐私保护与标准化评估方面持续突破。 Abstract: Spoken conversational agents are converging toward voice-native LLMs. This tutorial distills the path from cascaded ASR/NLU to end-to-end, retrieval-and vision-grounded systems. We frame adaptation of text LLMs to audio, cross-modal alignment, and joint speech-text training; review datasets, metrics, and robustness across accents and compare design choices (cascaded vs. E2E, post-ASR correction, streaming). We link industrial assistants to current open-domain and task-oriented agents, highlight reproducible baselines, and outline open problems in privacy, safety, and evaluation. Attendees leave with practical recipes and a clear systems-level roadmap.[24] Input Order Shapes LLM Semantic Alignment in Multi-Document Summarization
Jing Ma
Main category: cs.CL
TL;DR: 大型语言模型在生成摘要时表现出对首个输入文档的语义偏好,存在显著的首因效应,可能影响其在多文档汇总和代理式AI系统中的可靠性。
Details
Motivation: 研究大型语言模型在处理多篇长文档时是否平等对待所有输入,特别是在涉及敏感话题(如堕胎新闻)的中立性摘要生成中是否存在顺序偏差。 Method: 构建40组支持-中立-反对立场的文章三元组,每组进行六种输入顺序排列,使用Gemini 2.5 Flash生成中立摘要,并通过ROUGE-L、BERTScore和SummaC评估摘要与源文的词汇重叠、语义相似性和事实一致性,采用单因素方差分析和成对比较检验位置效应。 Result: BERTScore显示存在显著的首因效应,摘要在语义上更接近第一个输入的文章;成对比较表明位置1与其他两个位置差异显著,而位置2和3之间无显著差异。 Conclusion: 大型语言模型在多文档摘要中对首个文档赋予更高权重,可能导致输出偏向,对依赖LLM概览的应用及具身智能系统构成风险。 Abstract: Large language models (LLMs) are now used in settings such as Google's AI Overviews, where it summarizes multiple long documents. However, it remains unclear whether they weight all inputs equally. Focusing on abortion-related news, we construct 40 pro-neutral-con article triplets, permute each triplet into six input orders, and prompt Gemini 2.5 Flash to generate a neutral overview. We evaluate each summary against its source articles using ROUGE-L (lexical overlap), BERTScore (semantic similarity), and SummaC (factual consistency). One-way ANOVA reveals a significant primacy effect for BERTScore across all stances, indicating that summaries are more semantically aligned with the first-seen article. Pairwise comparisons further show that Position 1 differs significantly from Positions 2 and 3, while the latter two do not differ from each other, confirming a selective preference for the first document. The findings present risks for applications that rely on LLM-generated overviews and for agentic AI systems, where the steps involving LLMs can disproportionately influence downstream actions.[25] An Empirical Survey of Model Merging Algorithms for Social Bias Mitigation
Daiki Shirafuji,Tatsuhiko Saito,Yasutomo Kimura
Main category: cs.CL
TL;DR: 本文研究了七种模型融合算法在大语言模型中减轻社会偏见的效果,发现这些方法在降低偏见的同时可能损害下游任务性能,其中SLERP在适度插值权重下表现最平衡。
Details
Motivation: 大语言模型会继承并放大训练数据中的社会偏见,威胁公平性和社会信任,因此需要有效的方法来减轻这种偏见。 Method: 对Linear、Karcher Mean、SLERP、NuSLERP、TIES、DELLA和Nearswap七种模型融合算法进行实证评估,使用BBQ、BOLD和HONEST三个偏见数据集,并在SuperGLUE基准上测试下游任务性能影响。 Result: 发现偏见缓解与下游性能之间存在权衡:过度减少偏见会降低准确率,尤其是在阅读理解、常识和因果推理任务上;Linear、SLERP和Nearswap在降偏和保持性能方面表现较好,SLERP在中等插值权重下最为平衡。 Conclusion: 模型融合算法有潜力用于减轻大语言模型的社会偏见,但过度去偏或不当的融合方法可能导致重要语言能力下降,需谨慎选择方法和参数。 Abstract: Large language models (LLMs) are known to inherit and even amplify societal biases present in their pre-training corpora, threatening fairness and social trust. To address this issue, recent work has explored ``editing'' LLM parameters to mitigate social bias with model merging approaches; however, there is no empirical comparison. In this work, we empirically survey seven algorithms: Linear, Karcher Mean, SLERP, NuSLERP, TIES, DELLA, and Nearswap, applying 13 open weight models in the GPT, LLaMA, and Qwen families. We perform a comprehensive evaluation using three bias datasets (BBQ, BOLD, and HONEST) and measure the impact of these techniques on LLM performance in downstream tasks of the SuperGLUE benchmark. We find a trade-off between bias reduction and downstream performance: methods achieving greater bias mitigation degrade accuracy, particularly on tasks requiring reading comprehension and commonsense and causal reasoning. Among the merging algorithms, Linear, SLERP, and Nearswap consistently reduce bias while maintaining overall performance, with SLERP at moderate interpolation weights emerging as the most balanced choice. These results highlight the potential of model merging algorithms for bias mitigation, while indicating that excessive debiasing or inappropriate merging methods may lead to the degradation of important linguistic abilities.[26] CREST: Universal Safety Guardrails Through Cluster-Guided Cross-Lingual Transfer
Lavish Bansal,Naman Mishra
Main category: cs.CL
TL;DR: CREST是一个支持100种语言的跨语言高效安全分类模型,仅用13种高资源语言训练,通过聚类跨语言迁移实现对低资源语言的有效泛化,在多个安全基准上优于或媲美更大规模模型。
Details
Motivation: 现有LLM内容安全机制主要针对高资源语言,忽视了使用低资源语言的全球大量用户,缺乏可扩展、通用的语言无关安全系统。 Method: 提出CREST模型,采用参数高效的多语言安全分类架构,基于聚类策略从13种高资源语言进行跨语言迁移,扩展至100种语言,解决低资源语言数据稀缺问题。 Result: 在六个安全基准上评估显示,CREST优于同规模最先进防护模型,并与更大模型(2.5B以上参数)表现相当,验证了其跨语言泛化能力。 Conclusion: 语言特定的安全机制存在局限,应发展通用、语言无关的安全系统,CREST为大规模多语言内容安全提供了高效可行方案。 Abstract: Ensuring content safety in large language models (LLMs) is essential for their deployment in real-world applications. However, existing safety guardrails are predominantly tailored for high-resource languages, leaving a significant portion of the world's population underrepresented who communicate in low-resource languages. To address this, we introduce CREST (CRoss-lingual Efficient Safety Transfer), a parameter-efficient multilingual safety classification model that supports 100 languages with only 0.5B parameters. By training on a strategically chosen subset of only 13 high-resource languages, our model utilizes cluster-based cross-lingual transfer from a few to 100 languages, enabling effective generalization to both unseen high-resource and low-resource languages. This approach addresses the challenge of limited training data in low-resource settings. We conduct comprehensive evaluations across six safety benchmarks to demonstrate that CREST outperforms existing state-of-the-art guardrails of comparable scale and achieves competitive results against models with significantly larger parameter counts (2.5B parameters and above). Our findings highlight the limitations of language-specific guardrails and underscore the importance of developing universal, language-agnostic safety systems that can scale effectively to serve global populations.[27] Emergent Bayesian Behaviour and Optimal Cue Combination in LLMs
Julian Ma,Jun Wang,Zafeirios Fountas
Main category: cs.CL
TL;DR: 本研究通过借鉴心理物理学范式,提出BayesBench基准和贝叶斯一致性评分,评估大语言模型在多模态感知任务中是否表现出类似人类的贝叶斯推理行为,发现准确性高并不意味着策略稳健,揭示了模型能力与计算策略之间的分离现象。
Details
Motivation: 探索大语言模型在未经显式训练的情况下是否能像人类一样以近最优的贝叶斯方式整合多模态信息,并揭示其隐含的计算策略。 Method: 构建受心理物理学启发的四类幅度估计任务(长度、位置、距离、持续时间)的基准BayesBench,结合文本与图像模态,对九个LLM进行系统性行为研究,并引入贝叶斯一致性评分来检测模型在不同噪声、上下文和提示条件下的行为变化。 Result: 发现尽管一些模型表现高准确率(如GPT-5 Mini在文本任务上达到完美准确),但在融合视觉线索时效率低下;模型的行为常符合贝叶斯趋势,但准确性不能保证策略的鲁棒性,揭示了能力与策略间的解离。 Conclusion: 大语言模型展现出类似人类的不确定性处理原则,但现有以准确性为中心的评测可能掩盖其脆弱的不确定性管理能力;研究强调需关注模型行为策略而不仅是性能表现,并为未来多模态架构设计提供新评估工具。 Abstract: Large language models (LLMs) excel at explicit reasoning, but their implicit computational strategies remain underexplored. Decades of psychophysics research show that humans intuitively process and integrate noisy signals using near-optimal Bayesian strategies in perceptual tasks. We ask whether LLMs exhibit similar behaviour and perform optimal multimodal integration without explicit training or instruction. Adopting the psychophysics paradigm, we infer computational principles of LLMs from systematic behavioural studies. We introduce a behavioural benchmark - BayesBench: four magnitude estimation tasks (length, location, distance, and duration) over text and image, inspired by classic psychophysics, and evaluate a diverse set of nine LLMs alongside human judgments for calibration. Through controlled ablations of noise, context, and instruction prompts, we measure performance, behaviour and efficiency in multimodal cue-combination. Beyond accuracy and efficiency metrics, we introduce a Bayesian Consistency Score that detects Bayes-consistent behavioural shifts even when accuracy saturates. Our results show that while capable models often adapt in Bayes-consistent ways, accuracy does not guarantee robustness. Notably, GPT-5 Mini achieves perfect text accuracy but fails to integrate visual cues efficiently. This reveals a critical dissociation between capability and strategy, suggesting accuracy-centric benchmarks may over-index on performance while missing brittle uncertainty handling. These findings reveal emergent principled handling of uncertainty and highlight the correlation between accuracy and Bayesian tendencies. We release our psychophysics benchmark and consistency metric (https://bayes-bench.github.io) as evaluation tools and to inform future multimodal architecture designs.[28] SurveyEval: Towards Comprehensive Evaluation of LLM-Generated Academic Surveys
Jiahao Zhao,Shuaixing Zhang,Nan Xu,Lei Wang
Main category: cs.CL
TL;DR: SurveyEval是一个综合基准,用于评估自动生成的综述在整体质量、提纲连贯性和参考准确性三个维度的表现,旨在提升自动综述系统的评估与优化能力。
Details
Motivation: 现有的LLM-based自动综述系统缺乏有效的评估手段,难以衡量其生成内容的质量,因此需要一个全面且可扩展的评估基准来推动系统改进。 Method: 提出SurveyEval基准,在7个学科领域上评估自动生成的综述,并从整体质量、提纲连贯性和参考准确性三个方面进行衡量;扩展LLM-as-a-Judge框架,结合人工参考答案以增强与人类评价的一致性。 Result: 实验结果表明,通用的长文本或论文写作系统生成的综述质量较低,而专门设计的综述生成系统能显著提供更高质量的结果。 Conclusion: SurveyEval可作为可扩展的测试平台,有助于理解和改进跨学科、多标准的自动综述生成系统。 Abstract: LLM-based automatic survey systems are transforming how users acquire information from the web by integrating retrieval, organization, and content synthesis into end-to-end generation pipelines. While recent works focus on developing new generation pipelines, how to evaluate such complex systems remains a significant challenge. To this end, we introduce SurveyEval, a comprehensive benchmark that evaluates automatically generated surveys across three dimensions: overall quality, outline coherence, and reference accuracy. We extend the evaluation across 7 subjects and augment the LLM-as-a-Judge framework with human references to strengthen evaluation-human alignment. Evaluation results show that while general long-text or paper-writing systems tend to produce lower-quality surveys, specialized survey-generation systems are able to deliver substantially higher-quality results. We envision SurveyEval as a scalable testbed to understand and improve automatic survey systems across diverse subjects and evaluation criteria.[29] PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models
Robert Belanec,Ivan Srba,Maria Bielikova
Main category: cs.CL
TL;DR: PEFT-Factory是一个统一的高效微调大语言模型框架,集成了19种PEFT方法和27个数据集,提升可复现性和基准测试能力。
Details
Motivation: 现有PEFT方法难以复现、部署和比较,缺乏统一评估环境。 Method: 构建模块化框架PEFT-Factory,支持即用型和自定义PEFT方法,集成多种PEFT技术、多任务数据集及标准与专用评估指标。 Result: 提供了包含19种PEFT方法、27个数据集和多种评估指标的稳定、可控环境,源自LLaMA-Factory并已开源。 Conclusion: PEFT-Factory提升了PEFT方法的可复现性、可比性和易用性,推动了大模型高效微调的研究与应用。 Abstract: Parameter-Efficient Fine-Tuning (PEFT) methods address the increasing size of Large Language Models (LLMs). Currently, many newly introduced PEFT methods are challenging to replicate, deploy, or compare with one another. To address this, we introduce PEFT-Factory, a unified framework for efficient fine-tuning LLMs using both off-the-shelf and custom PEFT methods. While its modular design supports extensibility, it natively provides a representative set of 19 PEFT methods, 27 classification and text generation datasets addressing 12 tasks, and both standard and PEFT-specific evaluation metrics. As a result, PEFT-Factory provides a ready-to-use, controlled, and stable environment, improving replicability and benchmarking of PEFT methods. PEFT-Factory is a downstream framework that originates from the popular LLaMA-Factory, and is publicly available at https://github.com/kinit-sk/PEFT-Factory[30] Towards Unification of Hallucination Detection and Fact Verification for Large Language Models
Weihang Su,Jianming Long,Changyue Wang,Shiyu Lin,Jingyan Xu,Ziyi Ye,Qingyao Ai,Yiqun Liu
Main category: cs.CL
TL;DR: 本文提出了UniFact,一个统一的评估框架,用于直接比较大语言模型中的幻觉检测(HD)和事实验证(FV),揭示了两种范式互补且融合方法性能最优。
Details
Motivation: HD和FV两个研究范式因假设、数据集和评估协议不同而孤立发展,阻碍了整体进展,亟需统一框架来弥合分歧。 Method: 提出UniFact框架,动态生成模型输出及对应的事实性标签,实现FV与HD在实例级别上的直接比较,并在多个LLM家族和检测方法上进行大规模实验。 Result: 实验发现:(1) 无单一范式始终占优;(2) HD与FV捕捉到事实错误的不同方面;(3) 融合两者的方法表现最佳,达到SOTA。同时分析了FV与HD分化的根源并提供统一必要性的实证支持。 Conclusion: 应推动融合HD与FV的集成研究新议程,以实现更全面、可靠的大语言模型事实性评估。 Abstract: Large Language Models (LLMs) frequently exhibit hallucinations, generating content that appears fluent and coherent but is factually incorrect. Such errors undermine trust and hinder their adoption in real-world applications. To address this challenge, two distinct research paradigms have emerged: model-centric Hallucination Detection (HD) and text-centric Fact Verification (FV). Despite sharing the same goal, these paradigms have evolved in isolation, using distinct assumptions, datasets, and evaluation protocols. This separation has created a research schism that hinders their collective progress. In this work, we take a decisive step toward bridging this divide. We introduce UniFact, a unified evaluation framework that enables direct, instance-level comparison between FV and HD by dynamically generating model outputs and corresponding factuality labels. Through large-scale experiments across multiple LLM families and detection methods, we reveal three key findings: (1) No paradigm is universally superior; (2) HD and FV capture complementary facets of factual errors; and (3) hybrid approaches that integrate both methods consistently achieve state-of-the-art performance. Beyond benchmarking, we provide the first in-depth analysis of why FV and HD diverged, as well as empirical evidence supporting the need for their unification. The comprehensive experimental results call for a new, integrated research agenda toward unifying Hallucination Detection and Fact Verification in LLMs. We have open-sourced all the code, data, and baseline implementation at: https://github.com/oneal2000/UniFact/[31] Making Dialogue Grounding Data Rich: A Three-Tier Data Synthesis Framework for Generalized Referring Expression Comprehension
Juexi Shao,Siyou Li,Yujian Gan,Chris Madge,Vanja Karan,Massimo Poesio
Main category: cs.CL
TL;DR: 提出了一种三层次数据合成方法,用于解决对话式广义指代表达理解中的领域分布偏移问题,并通过合成数据微调显著提升了模型性能。
Details
Motivation: 现有系统在训练和评估域之间存在分布偏移时表现不佳,且标注的对话指代数据稀缺,限制了模型的泛化能力。 Method: 设计了一种平衡真实性与可控性的三层次数据合成方法,生成大规模可扩展的对话条件下的指代接地监督数据,并用于微调模型。 Result: 在标准评估指标上,基于合成数据微调的模型相比先前方法取得了持续且显著的性能提升。 Conclusion: 该数据合成方法有效缓解了领域偏移问题,为对话式指代表达理解提供了可扩展的解决方案。 Abstract: Dialogue-Based Generalized Referring Expressions Comprehension (GREC) requires models to ground the expression and unlimited targets in complex visual scenes while resolving coreference across a long dialogue context. However, existing systems struggle under distribution shift between training and evaluation domains, a gap exacerbated by the scarcity of annotated dialogue grounding data. We address this challenge with a three-tier data-synthesis method that balances realism and controllability to produce scalable supervision for dialogue-conditioned grounding. Fine-tuning on the synthesized data yields consistent, substantial improvements over prior approaches across standard evaluation metrics.[32] TriLex: A Framework for Multilingual Sentiment Analysis in Low-Resource South African Languages
Mike Nkongolo,Hilton Vorster,Josh Warren,Trevor Naick,Deandre Vanmali,Masana Mashapha,Luke Brand,Alyssa Fernandes,Janco Calitz,Sibusiso Makhoba
Main category: cs.CL
TL;DR: 本研究提出了一种名为TriLex的三阶段检索增强框架,用于扩展低资源非洲语言的情感词典,并评估了AfroXLMR和AfriBERTa在多种南非语言情感分析中的表现,结果显示该框架有效且可扩展。
Details
Motivation: 低资源非洲语言在情感分析中代表性不足,导致词汇覆盖有限且多语言NLP系统性能不佳。 Method: 提出TriLex框架,结合语料库提取、跨语言映射和基于检索增强生成(RAG)的词汇精炼,系统性扩展情感词典。 Result: AfroXLMR在isiXhosa和isiZulu上F1分数超过80%,表现出良好的跨语言稳定性;AfriBERTa虽未在目标语言预训练,但仍达到约64%的F1分数;两者均优于传统机器学习基线,集成分析进一步提升了精度和鲁棒性。 Conclusion: TriLex是一种高效、可扩展的框架,适用于低资源南非语言的情感词典扩展与情感建模。 Abstract: Low-resource African languages remain underrepresented in sentiment analysis, limiting both lexical coverage and the performance of multilingual Natural Language Processing (NLP) systems. This study proposes TriLex, a three-stage retrieval augmented framework that unifies corpus-based extraction, cross lingual mapping, and retrieval augmented generation (RAG) driven lexical refinement to systematically expand sentiment lexicons for low-resource languages. Using the enriched lexicon, the performance of two prominent African pretrained language models (AfroXLMR and AfriBERTa) is evaluated across multiple case studies. Results demonstrate that AfroXLMR delivers superior performance, achieving F1-scores above 80% for isiXhosa and isiZulu and exhibiting strong cross-lingual stability. Although AfriBERTa lacks pre-training on these target languages, it still achieves reliable F1-scores around 64%, validating its utility in computationally constrained settings. Both models outperform traditional machine learning baselines, and ensemble analyses further enhance precision and robustness. The findings establish TriLex as a scalable and effective framework for multilingual sentiment lexicon expansion and sentiment modeling in low-resource South African languages.[33] SR-GRPO: Stable Rank as an Intrinsic Geometric Reward for Large Language Model Alignment
Yixuan Tang,Yi Yang
Main category: cs.CL
TL;DR: 提出了一种无需外部监督的对齐方法SR-GRPO,利用模型内部表示的稳定秩(stable rank)作为质量信号,在无标注情况下显著提升大模型在STEM和数学推理任务上的表现。
Details
Motivation: 现有基于人类偏好的对齐方法依赖外部监督,存在标注稀缺、奖励黑客攻击、自评估偏差等问题,亟需一种稳定且无需外部干预的质量评估方式。 Method: 提出稳定秩(stable rank)作为内在质量信号,衡量隐藏状态的有效维度,并将其用于Best-of-N采样和强化学习算法SR-GRPO中,替代传统奖励模型进行策略优化。 Result: 稳定秩在RewardBench上达到84.04%准确率;SR-GRPO使Qwen2.5-1.5B-Instruct在STEM任务上提升10%,数学推理任务上提升19%,优于现有自评估与奖励模型方法。 Conclusion: 模型内部表示的几何特性可提供可靠的内在质量信号,稳定秩为实现无需外部监督的大模型对齐提供了可行路径。 Abstract: Aligning Large Language Models (LLMs) with human preferences typically relies on external supervision, which faces critical limitations: human annotations are scarce and subjective, reward models are vulnerable to reward hacking, and self-evaluation methods suffer from prompt sensitivity and biases. In this work, we propose stable rank, an intrinsic, annotation-free quality signal derived from model representations. Stable rank measures the effective dimensionality of hidden states by computing the ratio of total variance to dominant-direction variance, capturing quality through how information distributes across representation dimensions. Empirically, stable rank achieves 84.04% accuracy on RewardBench and improves task accuracy by an average of 11.3 percentage points over greedy decoding via Best-of-N sampling. Leveraging this insight, we introduce Stable Rank Group Relative Policy Optimization (SR-GRPO), which uses stable rank as a reward signal for reinforcement learning. Without external supervision, SR-GRPO improves Qwen2.5-1.5B-Instruct by 10% on STEM and 19% on mathematical reasoning, outperforming both learned reward models and self-evaluation baselines. Our findings demonstrate that quality signals can be extracted from internal model geometry, offering a path toward scalable alignment without external supervision.[34] A benchmark dataset for evaluating Syndrome Differentiation and Treatment in large language models
Kunning Li,Jianbin Guo,Zhaoyang Shang,Yiqing Liu,Hongmin Du,Lingling Liu,Yuping Zhao,Lifeng Dong
Main category: cs.CL
TL;DR: 本文提出了一种针对中医“辨证论治”的综合性临床案例基准TCM-BEST4SDT,包含四个任务和三种评估机制,并通过15种主流大语言模型的实验验证了其有效性。
Details
Motivation: 由于中医辨证论治具有个体化、整体性和多样性特点,现有评估方法多局限于知识问答或证候诊断准确性,缺乏对治疗决策能力的评估,因此亟需一个全面且专业的评估基准。 Method: 由中医专家主导构建了一个基于临床案例的综合评估基准TCM-BEST4SDT,涵盖中医基础知识、医学伦理、内容安全和辨证论治四项任务;采用选择题评估、裁判模型评估和专门训练的奖励模型评估三种机制,并建立严格的数据标注流程以衡量处方与证候的一致性。 Result: 在15个主流大语言模型上的实验验证了TCM-BEST4SDT的有效性,能够更全面地评估模型在中医临床应用中的表现,特别是在治疗决策方面的能力。 Conclusion: TCM-BEST4SDT为评估大语言模型在中医领域的临床应用能力提供了可靠工具,有助于推动智能中医药研究的发展。 Abstract: The emergence of Large Language Models (LLMs) within the Traditional Chinese Medicine (TCM) domain presents an urgent need to assess their clinical application capabilities. However, such evaluations are challenged by the individualized, holistic, and diverse nature of TCM's "Syndrome Differentiation and Treatment" (SDT). Existing benchmarks are confined to knowledge-based question-answering or the accuracy of syndrome differentiation, often neglecting assessment of treatment decision-making. Here, we propose a comprehensive, clinical case-based benchmark spearheaded by TCM experts, and a specialized reward model employed to quantify prescription-syndrome congruence. Data annotation follows a rigorous pipeline. This benchmark, designated TCM-BEST4SDT, encompasses four tasks, including TCM Basic Knowledge, Medical Ethics, LLM Content Safety, and SDT. The evaluation framework integrates three mechanisms, namely selected-response evaluation, judge model evaluation, and reward model evaluation. The effectiveness of TCM-BEST4SDT was corroborated through experiments on 15 mainstream LLMs, spanning both general and TCM domains. To foster the development of intelligent TCM research, TCM-BEST4SDT is now publicly available.[35] BOOM: Beyond Only One Modality KIT's Multimodal Multilingual Lecture Companion
Sai Koneru,Fabian Retkowski,Christian Huber,Lukas Hilgert,Seymanur Akti,Enes Yavuz Ugan,Alexander Waibel,Jan Niehues
Main category: cs.CL
TL;DR: BOOM是一个多模态多语言讲座辅助系统,能够联合翻译讲座音频和幻灯片,生成同步的文本、本地化幻灯片和合成语音输出,提升跨语言学习体验。
Details
Motivation: 随着教育全球化和在线学习的发展,本地化教育内容成为关键挑战。传统翻译方法难以保留讲座中音频与视觉幻灯片的多模态特性,影响学习效果。因此需要一个能同时处理多种输入模态并保持内容完整性的系统。 Method: 提出BOOM系统,采用端到端的方法联合翻译讲座音频和幻灯片,生成三种同步输出:翻译后的文本、保留视觉元素的本地化幻灯片、以及合成语音。利用幻灯片感知的转录提升下游任务(如摘要和问答)性能。 Result: 实验表明,BOOM能有效生成多模态翻译输出,且具备幻灯片感知能力的转录可为摘要和问答等下游任务带来级联增益。相关代码已开源。 Conclusion: BOOM通过多模态联合翻译,使学生能以母语完整访问讲座内容,同时保留原始信息,提升了多语言学习的可及性与完整性。 Abstract: The globalization of education and rapid growth of online learning have made localizing educational content a critical challenge. Lecture materials are inherently multimodal, combining spoken audio with visual slides, which requires systems capable of processing multiple input modalities. To provide an accessible and complete learning experience, translations must preserve all modalities: text for reading, slides for visual understanding, and speech for auditory learning. We present \textbf{BOOM}, a multimodal multilingual lecture companion that jointly translates lecture audio and slides to produce synchronized outputs across three modalities: translated text, localized slides with preserved visual elements, and synthesized speech. This end-to-end approach enables students to access lectures in their native language while aiming to preserve the original content in its entirety. Our experiments demonstrate that slide-aware transcripts also yield cascading benefits for downstream tasks such as summarization and question answering. We release our Slide Translation code at https://github.com/saikoneru/image-translator and integrate it in Lecture Translator at https://gitlab.kit.edu/kit/isl-ai4lt/lt-middleware/ltpipeline}\footnote{All released code and models are licensed under the MIT License.[36] promptolution: A Unified, Modular Framework for Prompt Optimization
Tom Zehle,Timo Heiß,Moritz Schlager,Matthias Aßenmacher,Matthias Feurer
Main category: cs.CL
TL;DR: 本文介绍了promptolution,一个统一且模块化的开源框架,旨在解决现有提示优化实现孤立且难以维护的问题。
Details
Motivation: 现有的提示优化实现通常依赖于孤立且不再维护的研究代码库,限制了其实际应用。 Method: 开发了一个集成多种离散提示优化器的统一框架,与底层大语言模型实现解耦,支持扩展。 Result: 提供了一个可扩展、模块化的系统,便于研究人员和实践者使用和改进提示优化技术。 Conclusion: promptolution 降低了提示优化的使用门槛,促进了其在实际场景中的广泛应用。 Abstract: Prompt optimization has become crucial for enhancing the performance of large language models (LLMs) across a broad range of tasks. Although many research papers show its effectiveness, practical adoption is hindered as existing implementations are often tied to unmaintained and isolated research codebases. To address this, we introduce promptolution, a unified and modular open-source framework that provides all components required for prompt optimization within a single extensible system for both practitioners and researchers. It integrates multiple contemporary discrete prompt optimizers while remaining agnostic to the underlying LLM implementation.[37] Cross-Lingual Prompt Steerability: Towards Accurate and Robust LLM Behavior across Languages
Lechen Zhang,Yusheng Zhou,Tolga Ergen,Lajanugen Logeswaran,Moontae Lee,David Jurgens
Main category: cs.CL
TL;DR: 本文研究了系统提示在多语言环境下的作用,提出了一个统一的四维评估框架,并通过大规模实验发现某些提示组件与鲁棒的多语言行为相关。作者还开发了一个多语言提示优化框架,可自动发现提升性能5-10%的提示,并分析了高效提示如何引导更结构化和一致的推理模式。
Details
Motivation: 现实世界中需要单一提示在多种语言下可靠运行,但先前工作主要集中于英文环境,缺乏对跨语言系统提示的有效评估和优化方法。 Method: 提出一个四维评估框架,通过在五种语言、三个大语言模型和三个基准上的大规模实验,分析不同提示组件的影响,并开发一个多语言提示优化框架。 Result: 发现CoT、情感和场景等提示组件与鲁棒的多语言行为相关;优化框架能自动发现使各项指标提升5-10%的提示;分析超一千万个推理单元发现高效提示能诱导更结构化、一致的推理并减少不必要的语码转换。 Conclusion: 系统提示优化是实现准确且鲁棒的多语言大模型行为的可扩展路径。 Abstract: System prompts provide a lightweight yet powerful mechanism for conditioning large language models (LLMs) at inference time. While prior work has focused on English-only settings, real-world deployments benefit from having a single prompt to operate reliably across languages. This paper presents a comprehensive study of how different system prompts steer models toward accurate and robust cross-lingual behavior. We propose a unified four-dimensional evaluation framework to assess system prompts in multilingual environments. Through large-scale experiments on five languages, three LLMs, and three benchmarks, we uncover that certain prompt components, such as CoT, emotion, and scenario, correlate with robust multilingual behavior. We develop a prompt optimization framework for multilingual settings and show it can automatically discover prompts that improve all metrics by 5-10%. Finally, we analyze over 10 million reasoning units and find that more performant system prompts induce more structured and consistent reasoning patterns, while reducing unnecessary language-switching. Together, we highlight system prompt optimization as a scalable path to accurate and robust multilingual LLM behavior.[38] Bangla Hate Speech Classification with Fine-tuned Transformer Models
Yalda Keivan Jafari,Krishno Dey
Main category: cs.CL
TL;DR: 本研究探讨了在低资源语言孟加拉语中进行仇恨言论检测的问题,比较了多种传统与基于Transformer的模型,发现特定语言预训练的BanglaBERT表现最佳。
Details
Motivation: 由于数据集不足、拼写异质性和语言多样性,低资源语言中的仇恨言论识别仍具挑战性;孟加拉语使用者众多但计算资源匮乏,亟需有效的自动化检测方法。 Method: 研究复现并扩展了官方基线方法(如多数类、随机、SVM等),引入逻辑回归、随机森林和决策树作为补充基线,并采用DistilBERT、BanglaBERT、m-BERT和XLM-RoBERTa等Transformer模型进行对比实验。 Result: 所有基于Transformer的模型均优于传统基线方法(DistilBERT除外);其中BanglaBERT在两个子任务上表现最优,且优于更大规模的m-BERT和XLM-RoBERTa。 Conclusion: 针对特定语言(如孟加拉语)的预训练对提升模型性能至关重要,凸显了为低资源语言开发专用预训练模型的潜力与必要性。 Abstract: Hate speech recognition in low-resource languages remains a difficult problem due to insufficient datasets, orthographic heterogeneity, and linguistic variety. Bangla is spoken by more than 230 million people of Bangladesh and India (West Bengal). Despite the growing need for automated moderation on social media platforms, Bangla is significantly under-represented in computational resources. In this work, we study Subtask 1A and Subtask 1B of the BLP 2025 Shared Task on hate speech detection. We reproduce the official baselines (e.g., Majority, Random, Support Vector Machine) and also produce and consider Logistic Regression, Random Forest, and Decision Tree as baseline methods. We also utilized transformer-based models such as DistilBERT, BanglaBERT, m-BERT, and XLM-RoBERTa for hate speech classification. All the transformer-based models outperformed baseline methods for the subtasks, except for DistilBERT. Among the transformer-based models, BanglaBERT produces the best performance for both subtasks. Despite being smaller in size, BanglaBERT outperforms both m-BERT and XLM-RoBERTa, which suggests language-specific pre-training is very important. Our results highlight the potential and need for pre-trained language models for the low-resource Bangla language.[39] Think in Parallel, Answer as One: Logit Averaging for Open-Ended Reasoning
Haonan Wang,Chao Du,Kenji Kawaguchi,Tianyu Pang
Main category: cs.CL
TL;DR: ThinkMerge是一种无需训练的解码策略,通过在同步点平均多个并行推理路径的下一个词元logits,实现对开放性问题推理任务的有效扩展,适用于代码生成和基于网页的深度研究等场景。
Details
Motivation: 多数投票在封闭式问答中有效,但在开放式推理任务(如代码生成和网络深度研究)中难以应用,因为无法定义完整解决方案的“多数”。因此需要一种新的方法来提升这类任务的表现。 Method: 提出ThinkMerge,一种无需训练、即插即用的解码策略:运行K个并行推理路径,并在同步点处对其下一个词元的logits进行平均,生成单一连贯输出;兼容vLLM/SGLang及Top-p/Top-k等标准解码技术。 Result: ThinkMerge在AIME和GPQA上表现优于或匹配多数投票,在开放性编码任务(如LiveCodeBench)上显著提升pass@1指标(DeepCoder-14B提升+8.28%,Qwen3-8B提升+7.58%),并在GAIA、BrowseComp-en/zh和XbenchDeepSearch等多个基准上改善了WebSailor等深度研究代理的表现。 Conclusion: ThinkMerge证明了并行测试时扩展可在不依赖对完整输出进行投票的情况下,有效提升开放式推理任务的性能,具有广泛适用性和实用性。 Abstract: Majority voting has proven effective for close-ended question answering by aggregating parallel reasoning traces. However, it is not directly applicable to open-ended reasoning, such as code generation and web-based deep research, where a "majority" over complete solutions is ill-defined. We introduce ThinkMerge, a training-free, plug-and-play decoding strategy that runs K parallel reasoning traces and averages their next-token logits at synchronization points to produce a single coherent output. ThinkMerge integrates seamlessly with vLLM/SGLang and remains compatible with standard decoding techniques such as Top-p/Top-k. Empirically, it matches or surpasses majority voting on AIME and GPQA, while delivering consistent gains on open-ended coding tasks: on LiveCodeBench (hard), pass@1 improves by +8.28% for DeepCoder-14B-Preview and +7.58% for Qwen3-8B. Beyond code, we further show that ThinkMerge improves web-based deep-research agents (e.g., WebSailor-7B/32B) across GAIA, BrowseComp-en/zh, and XbenchDeepSearch. These results demonstrate that parallel test-time scaling can benefit open-ended reasoning without relying on voting over complete outputs.[40] Fast-Decoding Diffusion Language Models via Progress-Aware Confidence Schedules
Amr Mohamed,Yang Zhang,Michalis Vazirgiannis,Guokan Shang
Main category: cs.CL
TL;DR: SchED是一种无需训练、模型无关的早期退出算法,用于加速扩散大语言模型(dLLMs)的解码过程,在保持几乎完整性能的同时实现显著的速度提升。
Details
Motivation: 扩散大语言模型虽然有潜力,但因迭代采样缓慢而受限,需要高效解码方法以提升实用性。 Method: 提出SchED算法,通过聚合全跨度logit margin,并在达到平滑、依赖进度的置信度阈值时提前终止解码,无需训练且适用于多种模型。 Result: 在Dream和LLaDA两类dLLM上验证,指令调优模型实现3.8-4.0倍加速并保留99.8%-100%性能,基础模型在激进设置下最高达2.34倍加速且性能保留99.1%-100%,且在QPS指标下优于现有方法。 Conclusion: SchED能有效将预测置信度稳定转化为计算节省,显著提升dLLM解码效率,尤其在长文本生成中表现稳健,推动dLLMs实际应用。 Abstract: Diffusion large language models (dLLMs) offer a promising alternative to autoregressive models, but their practical utility is severely hampered by slow, iterative sampling. We present SchED, a training-free, model-agnostic early-exit algorithm that aggregates full-span logit margins and halts decoding once a smooth, progress-dependent confidence threshold is met. We evaluated SchED on two dLLM families (Dream and LLaDA), in base and instruction-tuned variants across ten benchmarks spanning downstream tasks including multiple-choice question answering (MCQ), math, long-form QA/summarization, and translation. SchED delivers large, stable accelerations: on instruction-tuned models, it achieves $3.8$-$4.0\times$ speedups while retaining $99.8$-$100\%$ of the baseline score on average. On base models, SchED yields consistent speedup gains with $99.1$-$100\%$ performance retention, with up to $2.34\times$ under more aggressive settings. Using a conservative speed metric that heavily penalizes quality loss (QPS, $γ{=}4$), we show that SchED is robust and clearly outperforms prior confidence-based early-exit methods, which break down on long-form generation. An entropy analysis of the model's token predictions reveals that instruction tuning speeds up the decay of predictive entropy. By turning genuine confidence stabilization into computational savings, SchED makes dLLM decoding substantially more efficient.[41] AutoNeural: Co-Designing Vision-Language Models for NPU Inference
Wei Chen,Liangmin Wu,Yunhai Hu,Zhiyuan Li,Zhiyuan Cheng,Yicheng Qian,Lingyue Zhu,Zhipeng Hu,Luoyi Liang,Qiang Tang,Zhen Liu,Han Yang
Main category: cs.CL
TL;DR: 本文提出了AutoNeural,一种专为NPU设计的视觉-语言模型架构,通过替换ViT编码器为MobileNetV5风格的卷积主干和结合SSM与Transformer的混合语言主干,实现高效的整数量化和线性时间推理,显著降低延迟并提升边缘设备上的多模态性能。
Details
Motivation: 现有针对GPU优化的视觉-语言模型在NPU上表现不佳,主要由于ViT的量化脆弱性和自回归注意力机制的I/O瓶颈,无法充分利用NPU的高算术吞吐能力。因此需要重新设计适配NPU特性的模型结构。 Method: 提出AutoNeural架构:使用MobileNetV5风格的深度可分离卷积替代ViT作为视觉编码器,确保稳定的INT4/8/16量化;语言部分融合State-Space Model与Transformer,采用门控卷积实现线性时间复杂度,避免Key-Value缓存带来的内存开销。 Result: 相比传统基线,视觉编码器的量化误差减少高达7倍,端到端延迟降低14倍,解码速度提升3倍,上下文长度支持延长4倍,并在高通SA8295P芯片上实现了车载座舱应用的实时性能。 Conclusion: 针对NPU硬件约束重新设计模型拓扑结构是实现鲁棒性多模态边缘智能的前提,AutoNeural通过整数友好的架构协同设计,有效解决了现有VLM在NPU上的效率与稳定性问题。 Abstract: While Neural Processing Units (NPUs) offer high theoretical efficiency for edge AI, state-of-the-art Vision--Language Models (VLMs) tailored for GPUs often falter on these substrates. We attribute this hardware-model mismatch to two primary factors: the quantization brittleness of Vision Transformers (ViTs) and the I/O-bound nature of autoregressive attention mechanisms, which fail to utilize the high arithmetic throughput of NPUs. To bridge this gap, we propose AutoNeural, an NPU-native VLM architecture co-designed for integer-only inference. We replace the standard ViT encoder with a MobileNetV5-style backbone utilizing depthwise separable convolutions, which ensures bounded activation distributions for stable INT4/8/16 quantization. Complementing this, our language backbone integrates State-Space Model (SSM) principles with Transformer layers, employing efficient gated convolutions to achieve linear-time complexity. This hybrid design eliminates the heavy memory I/O overhead of Key-Value caching during generation. Our approach delivers substantial efficiency gains, reducing quantization error of vision encoder by up to 7x and end-to-end latency by 14x compared to conventional baselines. The AutoNeural also delivers 3x decoding speed and 4x longer context window than the baseline. We validate these improvements via a real-world automotive case study on the Qualcomm SA8295P SoC, demonstrating real-time performance for cockpit applications. Our results highlight that rethinking model topology specifically for NPU constraints is a prerequisite for robust multi-modal edge intelligence.[42] Fine-Tuned Large Language Models for Logical Translation: Reducing Hallucinations with Lang2Logic
Muyu Pan,Dheeraj Kodakandla,Mahfuza Farooque
Main category: cs.CL
TL;DR: 提出了一种结合经典NLP技术、自定义语法和微调语言模型的新框架,用于将自然语言句子自动转换为合取范式(CNF),以减少大语言模型在逻辑翻译任务中的幻觉问题。
Details
Motivation: 大语言模型在自然语言到形式逻辑的自动翻译中存在幻觉问题,影响逻辑推理的准确性,需要提高翻译的精确性以支持软件系统的自动化推理与验证。 Method: 采用自定义语法的古典NLP技术、符号计算库,并结合在不同语法设置下微调的语言模型,将自然语言输入转化为逻辑表达式,再进一步转为合取范式(CNF)。 Result: 实验表明,微调后的模型能够有意纠正原始模型产生的同类幻觉,提升了逻辑翻译的可靠性与准确性。 Conclusion: 该框架有效减少了大语言模型在逻辑翻译任务中的幻觉,实现了更可靠的CNF生成,有助于自动化推理和软件验证应用。 Abstract: Recent advances in natural language processing (NLP), particularly large language models (LLMs), have motivated the automatic translation of natural language statements into formal logic without human intervention. This enables automated reasoning and facilitates debugging, finding loop invariants, and adhering to specifications in software systems. However, hallucinations-incorrect outputs generated by LLMs are challenging, particularly for logical translation tasks requiring precision. This work introduces a novel framework that inputs English sentences, converts them into logical expressions, and then translates them into Conjunctive Normal Form (CNF) for satisfiability solving. It employs classical NLP techniques with self-defined grammar, symbolic computation libraries, and a fine-tuned language model to reduce hallucinations. In the early experiments, we observed that the fine-tuned model, trained on different grammar settings, could intentionally correct the same types of hallucinations made by the original model. Thus, it provides reliable CNF generation.[43] The Moral Consistency Pipeline: Continuous Ethical Evaluation for Large Language Models
Saeid Jamshidi,Kawser Wazed Nafi,Arghavan Moradi Dakhel,Negar Shahabi,Foutse Khomh
Main category: cs.CL
TL;DR: 本文提出了Moral Consistency Pipeline (MoCoP),一种无需数据集、闭环的框架,用于持续评估和解释大语言模型(LLMs)的道德稳定性。
Details
Motivation: 现有对齐框架依赖静态数据集和事后评估,难以揭示LLMs在不同情境或时间尺度下的伦理推理演变,缺乏对道德一致性的动态理解。 Method: MoCoP结合了三层结构:词汇完整性分析、语义风险估计和基于推理的判断建模,在一个自维持架构中自主生成、评估和优化伦理场景,无需外部监督。 Result: 在GPT-4-Turbo和DeepSeek上的实验表明,MoCoP能有效捕捉纵向伦理行为,发现伦理性与毒性维度呈强负相关(rET = -0.81, p < 0.001),且与响应延迟几乎无关(rEL ≈ 0)。 Conclusion: 道德连贯性和语言安全性是模型行为中稳定且可解释的特征;MoCoP将伦理评估重构为动态、模型无关的道德内省形式,为可扩展的持续审计和自主AI系统的计算道德研究提供了可复现的基础。 Abstract: The rapid advancement and adaptability of Large Language Models (LLMs) highlight the need for moral consistency, the capacity to maintain ethically coherent reasoning across varied contexts. Existing alignment frameworks, structured approaches designed to align model behavior with human ethical and social norms, often rely on static datasets and post-hoc evaluations, offering limited insight into how ethical reasoning may evolve across different contexts or temporal scales. This study presents the Moral Consistency Pipeline (MoCoP), a dataset-free, closed-loop framework for continuously evaluating and interpreting the moral stability of LLMs. MoCoP combines three supporting layers: (i) lexical integrity analysis, (ii) semantic risk estimation, and (iii) reasoning-based judgment modeling within a self-sustaining architecture that autonomously generates, evaluates, and refines ethical scenarios without external supervision. Our empirical results on GPT-4-Turbo and DeepSeek suggest that MoCoP effectively captures longitudinal ethical behavior, revealing a strong inverse relationship between ethical and toxicity dimensions (correlation rET = -0.81, p value less than 0.001) and a near-zero association with response latency (correlation rEL approximately equal to 0). These findings demonstrate that moral coherence and linguistic safety tend to emerge as stable and interpretable characteristics of model behavior rather than short-term fluctuations. Furthermore, by reframing ethical evaluation as a dynamic, model-agnostic form of moral introspection, MoCoP offers a reproducible foundation for scalable, continuous auditing and advances the study of computational morality in autonomous AI systems.cs.CV [Back]
[44] Leveraging AI multimodal geospatial foundation models for improved near-real-time flood mapping at a global scale
Mirela G. Tulbure,Julio Caineta,Mark Broich,Mollie D. Gaines,Philippe Rufin,Leon-Friedrich Thomas,Hamed Alemohammad,Jan Hemmerling,Patrick Hostert
Main category: cs.CV
TL;DR: 本研究评估了基于TerraMind的地理基础模型在洪水范围制图中的表现,使用多模态FloodsNet数据集对85个全球洪灾事件进行细调和测试,结果表明结合光学与SAR数据并微调GFM可提升近实时洪水制图能力。
Details
Motivation: 现有洪水制图方法依赖标注数据且泛化能力有限,而新兴的地理基础模型(如TerraMind)虽具潜力,但其在全球不同洪灾事件中的表现尚不清楚,需系统评估其适用性。 Method: 基于TerraMind模型,利用包含Sentinel-1 SAR和Sentinel-2光学影像的FloodsNet数据集进行微调,测试四种配置(base/large、冻结/非冻结主干网络),并与TerraMind的Sen1Floods11示例及U-Net模型对比性能。 Result: base-unfrozen配置在精度、召回率和准确率之间取得最佳平衡且计算成本较低;large-unfrozen模型召回率最高;FloodsNet训练的模型召回率优于Sen1Floods11示例,整体准确率相当;U-Net召回率最高但精度和准确率略低。 Conclusion: 融合多模态遥感数据并微调地理基础模型可有效提升全球洪水制图的泛化能力和实用性,该研究为GFM在气候适应与灾害韧性中的应用提供了实证支持与局限性分析。 Abstract: Floods are among the most damaging weather-related hazards, and in 2024, the warmest year on record, extreme flood events affected communities across five continents. Earth observation (EO) satellites provide critical, frequent coverage for mapping inundation, yet operational accuracy depends heavily on labeled datasets and model generalization. Recent Geospatial Foundation Models (GFMs), such as ESA-IBM's TerraMind, offer improved generalizability through large-scale self-supervised pretraining, but their performance on diverse global flood events remains poorly understood. We fine-tune TerraMind for flood extent mapping using FloodsNet, a harmonized multimodal dataset containing co-located Sentinel-1 (Synthetic Aperture Radar, SAR data) and Sentinel-2 (optical) imagery for 85 flood events worldwide. We tested four configurations (base vs. large models; frozen vs. unfrozen backbones) and compared against the TerraMind Sen1Floods11 example and a U-Net trained on both FloodsNet and Sen1Floods11. The base-unfrozen configuration provided the best balance of accuracy, precision, and recall at substantially lower computational cost than the large model. The large unfrozen model achieved the highest recall. Models trained on FloodsNet outperformed the Sen1Floods11-trained example in recall with similar overall accuracy. U-Net achieved higher recall than all GFM configurations, though with slightly lower accuracy and precision. Our results demonstrate that integrating multimodal optical and SAR data and fine-tuning a GFM can enhance near-real-time flood mapping. This study provides one of the first global-scale evaluations of a GFM for flood segmentation, highlighting both its potential and current limitations for climate adaptation and disaster resilience.[45] Context-Enriched Contrastive Loss: Enhancing Presentation of Inherent Sample Connections in Contrastive Learning Framework
Haojin Deng,Yimin Yang
Main category: cs.CV
TL;DR: 本文提出了一种上下文增强的对比损失函数,通过引入两个收敛目标来提升对比学习的效果并缓解信息扭曲问题,在多个大规模图像分类基准上优于16种现有方法,尤其在处理系统性偏差任务时表现出显著优势。
Details
Motivation: 现有的对比学习方法容易因数据增强导致信息失真,且模型可能过度依赖标签相同样本而忽略来自同一图像的正样本对,从而影响泛化性能和训练公平性。 Method: 提出一种包含两个收敛目标的上下文增强对比损失函数:第一个组件增强类间区分能力,第二个组件拉近来自同一原始图像的增强样本并推远其他样本。 Result: 在CIFAR10、CIFAR100、ImageNet等8个大型基准数据集上验证了方法的有效性,相比16种最先进方法在泛化性能和收敛速度上均有提升,在BiasedMNIST下游任务中比传统对比损失提升了22.9%。 Conclusion: 所提方法有效缓解了对比学习中的信息扭曲问题,提升了模型的学习效率与公平性,具有较强的实用价值和广泛的应用前景。 Abstract: Contrastive learning has gained popularity and pushes state-of-the-art performance across numerous large-scale benchmarks. In contrastive learning, the contrastive loss function plays a pivotal role in discerning similarities between samples through techniques such as rotation or cropping. However, this learning mechanism can also introduce information distortion from the augmented samples. This is because the trained model may develop a significant overreliance on information from samples with identical labels, while concurrently neglecting positive pairs that originate from the same initial image, especially in expansive datasets. This paper proposes a context-enriched contrastive loss function that concurrently improves learning effectiveness and addresses the information distortion by encompassing two convergence targets. The first component, which is notably sensitive to label contrast, differentiates between features of identical and distinct classes which boosts the contrastive training efficiency. Meanwhile, the second component draws closer the augmented samples from the same source image and distances all other samples. We evaluate the proposed approach on image classification tasks, which are among the most widely accepted 8 recognition large-scale benchmark datasets: CIFAR10, CIFAR100, Caltech-101, Caltech-256, ImageNet, BiasedMNIST, UTKFace, and CelebA datasets. The experimental results demonstrate that the proposed method achieves improvements over 16 state-of-the-art contrastive learning methods in terms of both generalization performance and learning convergence speed. Interestingly, our technique stands out in addressing systematic distortion tasks. It demonstrates a 22.9% improvement compared to original contrastive loss functions in the downstream BiasedMNIST dataset, highlighting its promise for more efficient and equitable downstream training.[46] FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges
Kevin David Hayes,Micah Goldblum,Vikash Sehwag,Gowthami Somepalli,Ashwinee Panda,Tom Goldstein
Main category: cs.CV
TL;DR: 本文提出了一种联合评估文本到图像(T2I)模型和视觉语言模型(VLM)的结构化方法,通过检测VLM是否能识别T2I生成图像中的27种具体失败模式,揭示了现有模型在属性保真度和对象表示上的系统性缺陷。
Details
Motivation: T2I模型虽能生成高质量图像,但在准确捕捉提示中特定属性(如对象数量和颜色)方面常出错,而现有评估框架和VLM基准未能充分应对复杂场景,因此需要更精细的评估体系。 Method: 构建一个包含5个T2I模型生成图像和3个VLM标注的数据集,使用Llama3对VLM的输出进行分析,评估其识别27种预定义失败模式的能力,并基于挑战性提示进行分层评估。 Result: 发现当前T2I模型在属性忠实性和对象表示上存在系统性错误,且现有度量指标难以捕捉这些细微错误;不同VLM在识别失败模式上的表现差异显著。 Conclusion: 现有的评估指标不足以反映生成模型的细粒度问题,需建立针对性的基准测试以提升生成模型的可靠性与可解释性。 Abstract: Text-to-image (T2I) models are capable of generating visually impressive images, yet they often fail to accurately capture specific attributes in user prompts, such as the correct number of objects with the specified colors. The diversity of such errors underscores the need for a hierarchical evaluation framework that can compare prompt adherence abilities of different image generation models. Simultaneously, benchmarks of vision language models (VLMs) have not kept pace with the complexity of scenes that VLMs are used to annotate. In this work, we propose a structured methodology for jointly evaluating T2I models and VLMs by testing whether VLMs can identify 27 specific failure modes in the images generated by T2I models conditioned on challenging prompts. Our second contribution is a dataset of prompts and images generated by 5 T2I models (Flux, SD3-Medium, SD3-Large, SD3.5-Medium, SD3.5-Large) and the corresponding annotations from VLMs (Molmo, InternVL3, Pixtral) annotated by an LLM (Llama3) to test whether VLMs correctly identify the failure mode in a generated image. By analyzing failure modes on a curated set of prompts, we reveal systematic errors in attribute fidelity and object representation. Our findings suggest that current metrics are insufficient to capture these nuanced errors, highlighting the importance of targeted benchmarks for advancing generative model reliability and interpretability.[47] Mapping of Lesion Images to Somatic Mutations
Rahul Mehta
Main category: cs.CV
TL;DR: 提出了一种基于深度隐变量模型LLOST,通过病变点云和体细胞突变数据的联合建模,从医学图像预测癌症患者的突变谱。
Details
Motivation: 旨在实现早期癌症诊断,通过医学图像预测体细胞突变,以辅助治疗方案选择。 Method: 引入病变点云表示,并设计LLOST模型,采用双变分自编码器与共享隐空间结合,三个隐空间均使用条件归一化流先验建模。 Result: 在TCIA和Pan-Cancer数据集上验证了模型能有效预测突变数量及发生情况,并发现影像与突变域间反映癌症类型的共享模式。 Conclusion: 模型能够桥接医学影像与基因信息,未来可扩展至其他遗传领域并优化模型性能。 Abstract: Medical imaging is a critical initial tool used by clinicians to determine a patient's cancer diagnosis, allowing for faster intervention and more reliable patient prognosis. At subsequent stages of patient diagnosis, genetic information is extracted to help select specific patient treatment options. As the efficacy of cancer treatment often relies on early diagnosis and treatment, we build a deep latent variable model to determine patients' somatic mutation profiles based on their corresponding medical images. We first introduce a point cloud representation of lesions images to allow for invariance to the imaging modality. We then propose, LLOST, a model with dual variational autoencoders coupled together by a separate shared latent space that unifies features from the lesion point clouds and counts of distinct somatic mutations. Therefore our model consists of three latent space, each of which is learned with a conditional normalizing flow prior to account for the diverse distributions of each domain. We conduct qualitative and quantitative experiments on de-identified medical images from The Cancer Imaging Archive and the corresponding somatic mutations from the Pan Cancer dataset of The Cancer Genomic Archive. We show the model's predictive performance on the counts of specific mutations as well as it's ability to accurately predict the occurrence of mutations. In particular, shared patterns between the imaging and somatic mutation domain that reflect cancer type. We conclude with a remark on how to improve the model and possible future avenues of research to include other genetic domains.[48] SplatSuRe: Selective Super-Resolution for Multi-view Consistent 3D Gaussian Splatting
Pranav Asthana,Alex Hanson,Allen Tu,Tom Goldstein,Matthias Zwicker,Amitabh Varshney
Main category: cs.CV
TL;DR: 提出SplatSuRe方法,通过相机姿态与场景几何关系选择性地在欠采样区域应用超分辨率,提升3D高斯点阵渲染的细节与一致性。
Details
Motivation: 现有超分辨率方法在多视角图像上均匀应用,导致渲染模糊,尤其在高频信息缺失区域无法有效恢复细节。 Method: 利用相机姿态与场景几何的关系,识别缺乏高频监督的欠采样区域,仅在这些区域选择性地融合超分辨率内容,避免多视图不一致。 Result: 在Tanks & Temples、Deep Blending和Mip-NeRF 360数据集上优于基线方法,显著提升保真度和感知质量,尤其在前景局部区域效果更优。 Conclusion: SplatSuRe通过智能选择超分辨率应用区域,实现了更清晰、一致的多视角渲染,为高分辨率新视角合成提供了高效解决方案。 Abstract: 3D Gaussian Splatting (3DGS) enables high-quality novel view synthesis, motivating interest in generating higher-resolution renders than those available during training. A natural strategy is to apply super-resolution (SR) to low-resolution (LR) input views, but independently enhancing each image introduces multi-view inconsistencies, leading to blurry renders. Prior methods attempt to mitigate these inconsistencies through learned neural components, temporally consistent video priors, or joint optimization on LR and SR views, but all uniformly apply SR across every image. In contrast, our key insight is that close-up LR views may contain high-frequency information for regions also captured in more distant views, and that we can use the camera pose relative to scene geometry to inform where to add SR content. Building from this insight, we propose SplatSuRe, a method that selectively applies SR content only in undersampled regions lacking high-frequency supervision, yielding sharper and more consistent results. Across Tanks & Temples, Deep Blending and Mip-NeRF 360, our approach surpasses baselines in both fidelity and perceptual quality. Notably, our gains are most significant in localized foreground regions where higher detail is desired.[49] RobustSurg: Tackling domain generalisation for out-of-distribution surgical scene segmentation
Mansoor Ali,Maksim Richards,Gilberto Ochoa-Ruiz,Sharib Ali
Main category: cs.CV
TL;DR: 本文提出了一种名为RobustSurg的新型方法,用于提升手术场景分割在跨中心和跨成像模态下的泛化能力,通过实例归一化、特征协方差映射和恢复模块来增强特征表示,并发布了一个新的多中心数据集以支持研究。
Details
Motivation: 现有深度学习方法在单中心和单模态手术场景分割中表现良好,但在面对未见分布(如不同医疗中心)和未见成像模态时泛化能力差,且现有自然场景中的域适应方法难以直接应用于手术场景。 Method: 利用风格与内容解耦的思想,采用实例归一化和特征协方差映射技术减少外观变化的影响;在ResNet骨干网络中引入恢复模块以保留任务相关的关键特征;并在CholecSeg8K上训练模型以测试在HeiCholSeg和EndoUDA数据集上的性能。 Result: 相比DeepLabv3+基线模型,在未见中心HeiCholSeg数据集上mIoU提升近23%,相较于当前最优方法提升10-32%;在EndoUDA结肠息肉数据集上较基线提升22%,较最新SOTA方法提升约11%。 Conclusion: RobustSurg通过解耦风格与内容特征并保留关键语义信息,显著提升了手术场景分割模型在跨中心和跨模态场景下的鲁棒性与泛化能力,同时新发布的数据集为未来研究提供了重要资源。 Abstract: While recent advances in deep learning for surgical scene segmentation have demonstrated promising results on single-centre and single-imaging modality data, these methods usually do not generalise to unseen distribution (i.e., from other centres) and unseen modalities. Current literature for tackling generalisation on out-of-distribution data and domain gaps due to modality changes has been widely researched but mostly for natural scene data. However, these methods cannot be directly applied to the surgical scenes due to limited visual cues and often extremely diverse scenarios compared to the natural scene data. Inspired by these works in natural scenes to push generalisability on OOD data, we hypothesise that exploiting the style and content information in the surgical scenes could minimise the appearances, making it less variable to sudden changes such as blood or imaging artefacts. This can be achieved by performing instance normalisation and feature covariance mapping techniques for robust and generalisable feature representations. Further, to eliminate the risk of removing salient feature representation associated with the objects of interest, we introduce a restitution module within the feature learning ResNet backbone that can enable the retention of useful task-relevant features. To tackle the lack of multiclass and multicentre data for surgical scene segmentation, we also provide a newly curated dataset that can be vital for addressing generalisability in this domain. Our proposed RobustSurg obtained nearly 23% improvement on the baseline DeepLabv3+ and from 10-32% improvement on the SOTA in terms of mean IoU score on an unseen centre HeiCholSeg dataset when trained on CholecSeg8K. Similarly, RobustSurg also obtained nearly 22% improvement over the baseline and nearly 11% improvement on a recent SOTA method for the target set of the EndoUDA polyp dataset.[50] Multifractal Recalibration of Neural Networks for Medical Imaging Segmentation
Miguel L. Martins,Miguel T. Coimbra,Francesco Renna
Main category: cs.CV
TL;DR: 本文提出了两种新的归纳先验——单分形和多分形重校准,用于卷积网络中的通道注意力机制,通过在U-Net框架下引入多分形分析来增强医学图像分割性能,并在三个公开数据集上验证了方法的有效性。
Details
Motivation: 现有的端到端多分形方法依赖于沉重的池化或强特征空间降采样,限制了语义分割等任务的应用,因此本文旨在克服这些局限性。 Method: 提出单分形和多分形重校准两种归纳先验,利用指数概率质量与多分形谱之间的关系构建编码器嵌入的统计描述,并作为通道注意力函数集成到卷积网络中,基于U-Net框架进行实验。 Result: 在ISIC18、Kvasir-SEG和BUSI三个医学影像数据集上的实验表明,所提方法相比其他使用高阶统计信息的通道注意力机制能带来显著性能提升;同时发现U-Net中由于跳跃连接的存在,激发响应并未随编码器深度增加而更加专业化。 Conclusion: 多分形重校准是一种有效的通道注意力机制,能够捕捉病理规律性,在医学图像分割任务中具有潜力,且其效果可能与实例变异性的全局统计特性相关。 Abstract: Multifractal analysis has revealed regularities in many self-seeding phenomena, yet its use in modern deep learning remains limited. Existing end-to-end multifractal methods rely on heavy pooling or strong feature-space decimation, which constrain tasks such as semantic segmentation. Motivated by these limitations, we introduce two inductive priors: Monofractal and Multifractal Recalibration. These methods leverage relationships between the probability mass of the exponents and the multifractal spectrum to form statistical descriptions of encoder embeddings, implemented as channel-attention functions in convolutional networks. Using a U-Net-based framework, we show that multifractal recalibration yields substantial gains over a baseline equipped with other channel-attention mechanisms that also use higher-order statistics. Given the proven ability of multifractal analysis to capture pathological regularities, we validate our approach on three public medical-imaging datasets: ISIC18 (dermoscopy), Kvasir-SEG (endoscopy), and BUSI (ultrasound). Our empirical analysis also provides insights into the behavior of these attention layers. We find that excitation responses do not become increasingly specialized with encoder depth in U-Net architectures due to skip connections, and that their effectiveness may relate to global statistics of instance variability.[51] Towards Unified Video Quality Assessment
Chen Feng,Tianhao Peng,Fan Zhang,David Bull
Main category: cs.CV
TL;DR: 本文提出Unified-VQA,一种统一的视频质量评估框架,通过诊断性混合专家模型(MoE)实现跨多种视频格式和失真类型的通用质量评估,并提供可解释的多维质量反馈,在无需微调的情况下在17个数据库上优于18种基准方法。
Details
Motivation: 现有视频质量评估方法多为单一模型,缺乏可解释性,且通常针对特定格式设计,难以泛化到不同失真类型和视频格式,限制了其在实际应用中的实用性。 Method: 将通用视频质量评估重构为诊断性混合专家(MoE)问题,引入多个专注于不同感知域的“感知专家”,采用多代理专家训练策略,结合排序损失和最适合各域的代理指标进行优化,并集成诊断性多任务头以输出全局质量评分和可解释的多维伪影向量。 Result: Unified-VQA在17个包含HD、UHD、HDR和HFR格式的数据库上,无需重新训练或微调,一致优于18种现有方法,在通用视频质量评估和诊断性伪影检测任务中均表现出优越性能。 Conclusion: Unified-VQA实现了更实用、可操作且可解释的视频质量评估,是迈向通用、诊断性VQA的重要一步。 Abstract: Recent works in video quality assessment (VQA) typically employ monolithic models that typically predict a single quality score for each test video. These approaches cannot provide diagnostic, interpretable feedback, offering little insight into why the video quality is degraded. Most of them are also specialized, format-specific metrics rather than truly ``generic" solutions, as they are designed to learn a compromised representation from disparate perceptual domains. To address these limitations, this paper proposes Unified-VQA, a framework that provides a single, unified quality model applicable to various distortion types within multiple video formats by recasting generic VQA as a Diagnostic Mixture-of-Experts (MoE) problem. Unified-VQA employs multiple ``perceptual experts'' dedicated to distinct perceptual domains. A novel multi-proxy expert training strategy is designed to optimize each expert using a ranking-inspired loss, guided by the most suitable proxy metric for its domain. We also integrated a diagnostic multi-task head into this framework to generate a global quality score and an interpretable multi-dimensional artifact vector, which is optimized using a weakly-supervised learning strategy, leveraging the known properties of the large-scale training database generated for this work. With static model parameters (without retraining or fine-tuning), Unified-VQA demonstrates consistent and superior performance compared to over 18 benchmark methods for both generic VQA and diagnostic artifact detection tasks across 17 databases containing diverse streaming artifacts in HD, UHD, HDR and HFR formats. This work represents an important step towards practical, actionable, and interpretable video quality assessment.[52] See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models
Le Thien Phuc Nguyen,Zhuoran Yu,Samuel Low Yu Hang,Subin An,Jeongik Lee,Yohan Ban,SeungEun Chung,Thanh-Huy Nguyen,JuWan Maeng,Soochahn Lee,Yong Jae Lee
Main category: cs.CV
TL;DR: AV-SpeakerBench是一个专注于说话人中心的视听推理的视频基准,包含3,212个多项选择题,旨在评估多模态大模型在对齐谁说话、说了什么和何时发生方面的细粒度推理能力。
Details
Motivation: 现有视频基准很少评估模型对人类语音的细粒度理解,多数任务仍可仅通过视觉解决或仅粗略评估语音内容,难以反映模型真正的跨模态对齐能力。 Method: 提出AV-SpeakerBench,以说话人为核心推理单元,设计融合音频-视觉依赖关系的问题语义,并通过专家标注确保时间精确性和跨模态有效性。 Result: Gemini系列模型表现最佳,Gemini 2.5 Pro领先;开源模型中Qwen3-Omni-30B接近Gemini 2.0 Flash,但因音视频融合能力较弱仍落后于闭源模型。 Conclusion: AV-SpeakerBench为多模态系统中的细粒度视听推理提供了严格的评估基础,推动未来模型在说话人对齐与跨模态理解上的发展。 Abstract: Multimodal large language models (MLLMs) are expected to jointly interpret vision, audio, and language, yet existing video benchmarks rarely assess fine-grained reasoning about human speech. Many tasks remain visually solvable or only coarsely evaluate speech, offering limited insight into whether models can align who speaks, what is said, and when it occurs. We introduce AV-SpeakerBench, a curated benchmark of 3,212 multiple-choice questions focused on speaker-centric audiovisual reasoning in real-world videos. It features: (1) a speaker-centered formulation that treats speakers-not scenes-as the core reasoning unit; (2) fusion-grounded question design embedding audiovisual dependencies into question semantics; and (3) expert-curated annotations ensuring temporal precision and cross-modal validity. Comprehensive evaluations show that the Gemini family consistently outperforms open-source systems, with Gemini 2.5 Pro achieving the best results. Among open models, Qwen3-Omni-30B approaches Gemini 2.0 Flash but remains far behind Gemini 2.5 Pro, primarily due to weaker audiovisual fusion rather than visual perception. We believe AV-SpeakerBench establishes a rigorous foundation for advancing fine-grained audiovisual reasoning in future multimodal systems.[53] Exploring the Potentials of Spiking Neural Networks for Image Deraining
Shuang Chen,Tomas Krajnik,Farshad Arvin,Amir Atapour-Abarghouei
Main category: cs.CV
TL;DR: 本文提出了一种基于脉冲神经网络(SNN)的图像去雨方法,创新性地设计了视觉LIF(VLIF)神经元和脉冲分解增强模块,实现了高效能、低能耗的多尺度特征学习,在五个基准数据集上显著优于现有SNN方法,仅消耗其13%的能量。
Details
Motivation: SNN在低层视觉任务中探索不足,传统脉冲神经元缺乏空间上下文理解能力且存在频域饱和问题,限制了其在图像去雨等任务中的表现。 Method: 提出视觉LIF(VLIF)神经元以增强空间上下文理解,结合脉冲分解与增强模块(Spiking Decomposition and Enhancement Module)和轻量级多尺度单元(Spiking Multi-scale Unit),实现分层多尺度表示学习。 Result: 在五个主流去雨数据集上显著超越现有的SNN基线方法,同时仅消耗约13%的能耗。 Conclusion: 所提方法在保持SNN生物可解释性和高能效的同时,显著提升了其在低层视觉任务(如去雨)中的性能,为SNN在高性能、低功耗视觉系统中的部署奠定了基础。 Abstract: Biologically plausible and energy-efficient frameworks such as Spiking Neural Networks (SNNs) have not been sufficiently explored in low-level vision tasks. Taking image deraining as an example, this study addresses the representation of the inherent high-pass characteristics of spiking neurons, specifically in image deraining and innovatively proposes the Visual LIF (VLIF) neuron, overcoming the obstacle of lacking spatial contextual understanding present in traditional spiking neurons. To tackle the limitation of frequency-domain saturation inherent in conventional spiking neurons, we leverage the proposed VLIF to introduce the Spiking Decomposition and Enhancement Module and the lightweight Spiking Multi-scale Unit for hierarchical multi-scale representation learning. Extensive experiments across five benchmark deraining datasets demonstrate that our approach significantly outperforms state-of-the-art SNN-based deraining methods, achieving this superior performance with only 13\% of their energy consumption. These findings establish a solid foundation for deploying SNNs in high-performance, energy-efficient low-level vision tasks.[54] Spatiotemporal Pyramid Flow Matching for Climate Emulation
Jeremy Andrew Irvin,Jiaqi Han,Zikui Wang,Abdulaziz Alharbi,Yufei Zhao,Nomin-Erdene Bayarsaikhan,Daniele Visioni,Andrew Y. Ng,Duncan Watson-Parris
Main category: cs.CV
TL;DR: 本文提出了一种名为Spatiotemporal Pyramid Flows (SPF)的新型生成模型,用于高效、多尺度的气候模拟,结合新构建的大规模气候数据集ClimateSuite,实现了在不同时间尺度和未来情景下的准确且快速的概率性气候预测。
Details
Motivation: 现有的生成模型依赖天气尺度的自回归进行气候模拟,速度慢且在非平稳强迫下难以稳定 rollout,无法满足长期气候预测的需求。 Method: 提出SPF模型,采用分层时空金字塔结构,将生成过程分解为多个时空尺度,逐步提升空间分辨率,并在每个阶段耦合特定的时间尺度和物理强迫(如温室气体),实现任意时间层级的直接采样和并行生成。 Result: 在ClimateBench上,SPF在年和月时间尺度上优于现有的流匹配基线和预训练模型,采样速度更快;通过构建包含33,000多年模拟数据的ClimateSuite数据集,验证了SPF在未见情景和气候模型上的良好泛化能力。 Conclusion: SPF与ClimateSuite为跨时间尺度和现实未来情景下的高效、准确、概率性气候模拟提供了新基础,具有推动气候科学建模的潜力。 Abstract: Generative models have the potential to transform the way we emulate Earth's changing climate. Previous generative approaches rely on weather-scale autoregression for climate emulation, but this is inherently slow for long climate horizons and has yet to demonstrate stable rollouts under nonstationary forcings. Here, we introduce Spatiotemporal Pyramid Flows (SPF), a new class of flow matching approaches that model data hierarchically across spatial and temporal scales. Inspired by cascaded video models, SPF partitions the generative trajectory into a spatiotemporal pyramid, progressively increasing spatial resolution to reduce computation and coupling each stage with an associated timescale to enable direct sampling at any temporal level in the pyramid. This design, together with conditioning each stage on prescribed physical forcings (e.g., greenhouse gases or aerosols), enables efficient, parallel climate emulation at multiple timescales. On ClimateBench, SPF outperforms strong flow matching baselines and pre-trained models at yearly and monthly timescales while offering fast sampling, especially at coarser temporal levels. To scale SPF, we curate ClimateSuite, the largest collection of Earth system simulations to date, comprising over 33,000 simulation-years across ten climate models and the first dataset to include simulations of climate interventions. We find that the scaled SPF model demonstrates good generalization to held-out scenarios across climate models. Together, SPF and ClimateSuite provide a foundation for accurate, efficient, probabilistic climate emulation across temporal scales and realistic future scenarios. Data and code is publicly available at https://github.com/stanfordmlgroup/spf .[55] Progressive Image Restoration via Text-Conditioned Video Generation
Peng Kang,Xijun Wang,Yu Yuan
Main category: cs.CV
TL;DR: 本文探索了将文本到视频生成模型CogVideo用于图像修复任务的潜力,通过微调使其生成从退化图像到清晰图像的修复轨迹。作者构建了超分辨率、去模糊和低光照增强的合成数据集,并比较了统一提示与场景特定提示两种策略。实验表明,该方法能在无额外训练的情况下在真实场景中实现良好的零样本泛化能力,保持时间连贯性并提升感知指标。
Details
Motivation: 尽管现有的文本到视频模型在时序生成方面表现出色,但其在图像修复领域的应用尚未被充分挖掘。本文旨在探索如何利用这些模型的时间建模能力来解决渐进式视觉修复问题。 Method: 通过构建包含从退化到清晰帧转变的合成数据集,对CogVideo进行微调以生成修复轨迹;采用两种文本提示策略:统一提示和由LLaVA与ChatGPT生成的场景特定提示,并让模型学习将时间进展与图像质量提升相关联。 Result: 模型在PSNR、SSIM和LPIPS等感知指标上随帧逐步提升,有效恢复空间细节和光照一致性,且在ReLoBlur真实数据集上展现出无需再训练的零样本鲁棒性。 Conclusion: CogVideo经适当微调后可成功应用于多种图像修复任务,在合成与真实场景中均表现出良好性能和强解释性,揭示了大型视频生成模型在传统逆问题中的新潜力。 Abstract: Recent text-to-video models have demonstrated strong temporal generation capabilities, yet their potential for image restoration remains underexplored. In this work, we repurpose CogVideo for progressive visual restoration tasks by fine-tuning it to generate restoration trajectories rather than natural video motion. Specifically, we construct synthetic datasets for super-resolution, deblurring, and low-light enhancement, where each sample depicts a gradual transition from degraded to clean frames. Two prompting strategies are compared: a uniform text prompt shared across all samples, and a scene-specific prompting scheme generated via LLaVA multi-modal LLM and refined with ChatGPT. Our fine-tuned model learns to associate temporal progression with restoration quality, producing sequences that improve perceptual metrics such as PSNR, SSIM, and LPIPS across frames. Extensive experiments show that CogVideo effectively restores spatial detail and illumination consistency while maintaining temporal coherence. Moreover, the model generalizes to real-world scenarios on the ReLoBlur dataset without additional training, demonstrating strong zero-shot robustness and interpretability through temporal restoration.[56] Enhancing Cross Domain SAR Oil Spill Segmentation via Morphological Region Perturbation and Synthetic Label-to-SAR Generation
Andre Juarez,Luis Salsavilca,Frida Coaquira,Celso Gonzales
Main category: cs.CV
TL;DR: 提出MORP--Synth,一种两阶段合成增强框架,通过形态区域扰动和条件生成模型提升SAR油污分割模型从地中海到秘鲁沿海的迁移性能,显著改善因海况差异导致的泛化问题。
Details
Motivation: 现有深度学习模型在不同海域间的SAR油污分割泛化能力差,尤其在标注数据稀缺的秘鲁海岸,受海况、散射特性和油污形态差异影响严重。 Method: 提出MORP--Synth框架:第一阶段采用曲率引导的形态区域扰动(MORP)在标签空间生成几何上更真实的油污及类似物区域;第二阶段使用条件生成对抗网络(INADE)为编辑后的掩码渲染SAR样纹理。 Result: 构建了包含2112个512×512样本的秘鲁SAR数据集,并在七种分割架构上验证。仅用地中海数据预训练的模型在秘鲁数据上mIoU从67.8%降至51.8%;引入MORP--Synth后性能提升达+6 mIoU,其中油污类IoU提升10.8,相似物类提升14.6。 Conclusion: MORP--Synth有效缓解了跨区域SAR油污分割中的域偏移问题,通过合成逼真的几何与纹理特征增强了模型在低资源地区的泛化能力。 Abstract: Deep learning models for SAR oil spill segmentation often fail to generalize across regions due to differences in sea-state, backscatter statistics, and slick morphology, a limitation that is particularly severe along the Peruvian coast where labeled Sentinel-1 data remain scarce. To address this problem, we propose \textbf{MORP--Synth}, a two-stage synthetic augmentation framework designed to improve transfer from Mediterranean to Peruvian conditions. Stage~A applies Morphological Region Perturbation, a curvature guided label space method that generates realistic geometric variations of oil and look-alike regions. Stage~B renders SAR-like textures from the edited masks using a conditional generative INADE model. We compile a Peruvian dataset of 2112 labeled 512$\times$512 patches from 40 Sentinel-1 scenes (2014--2024), harmonized with the Mediterranean CleanSeaNet benchmark, and evaluate seven segmentation architectures. Models pretrained on Mediterranean data degrade from 67.8\% to 51.8\% mIoU on the Peruvian domain; MORP--Synth improves performance up to +6 mIoU and boosts minority-class IoU (+10.8 oil, +14.6 look-alike).[57] Video Diffusion Models Excel at Tracking Similar-Looking Objects Without Supervision
Chenshuang Zhang,Kang Zhang,Joon Son Chung,In So Kweon,Junmo Kim,Chengzhi Mao
Main category: cs.CV
TL;DR: 本文提出了一种基于预训练视频扩散模型的自监督跟踪方法,利用其去噪过程中早期分离运动信息的特性,在视觉相似物体的跟踪任务中显著优于现有自监督方法。
Details
Motivation: 现有自监督跟踪器在视觉线索模糊时难以区分外观相似的物体,限制了其泛化能力;而监督方法依赖大量标注数据,缺乏可扩展性。 Method: 利用预训练视频扩散模型在高噪声阶段分离运动表示的特性,构建无需任务特定训练的自监督跟踪器,并专注于区分视觉上相似的对象。 Result: 在标准基准和新设计的视觉相似物体跟踪测试中,性能优于最新的自监督方法达6个点,可视化显示该方法能有效处理视角变化和形变下的相同物体跟踪。 Conclusion: 预训练视频扩散模型隐含学习到的运动表示可用于鲁棒的自监督跟踪,尤其在区分视觉相似或相同物体方面表现出色,为跟踪任务提供了新思路。 Abstract: Distinguishing visually similar objects by their motion remains a critical challenge in computer vision. Although supervised trackers show promise, contemporary self-supervised trackers struggle when visual cues become ambiguous, limiting their scalability and generalization without extensive labeled data. We find that pre-trained video diffusion models inherently learn motion representations suitable for tracking without task-specific training. This ability arises because their denoising process isolates motion in early, high-noise stages, distinct from later appearance refinement. Capitalizing on this discovery, our self-supervised tracker significantly improves performance in distinguishing visually similar objects, an underexplored failure point for existing methods. Our method achieves up to a 6-point improvement over recent self-supervised approaches on established benchmarks and our newly introduced tests focused on tracking visually similar items. Visualizations confirm that these diffusion-derived motion representations enable robust tracking of even identical objects across challenging viewpoint changes and deformations.[58] TALO: Pushing 3D Vision Foundation Models Towards Globally Consistent Online Reconstruction
Fengyi Zhang,Tianjun Zhang,Kasra Khosoussi,Zheng Zhang,Zi Huang,Yadan Luo
Main category: cs.CV
TL;DR: 本文提出了一种基于薄板样条(Thin Plate Spline)的高自由度、长期对齐框架TALO,用于提升3D视觉基础模型在时序预测中的时空一致性,具有插件式兼容性,显著降低了轨迹误差并增强了对噪声几何的鲁棒性。
Details
Motivation: 在驾驶等在线场景中,现有方法在时间一致性对齐上存在假设不合理、局部对齐范围有限和对噪声几何敏感等问题,难以有效维持3D预测的时序连贯性。 Method: 提出基于薄板样条的对齐框架TALO,利用全局传播的控制点纠正空间变化的不一致性,并采用点无关的子图注册设计以增强对噪声几何的鲁棒性,支持多种3D基础模型和相机配置。 Result: 实验表明,该方法在多个数据集、骨干网络和相机设置下均能持续降低轨迹误差,生成更一致的几何结构,表现出良好的鲁棒性和通用性。 Conclusion: TALO是一种通用、即插即用的时序对齐框架,有效解决了3D基础模型在动态场景中时序不一致的问题,提升了实际应用中的稳定性和精度。 Abstract: 3D vision foundation models have shown strong generalization in reconstructing key 3D attributes from uncalibrated images through a single feed-forward pass. However, when deployed in online settings such as driving scenarios, predictions are made over temporal windows, making it non-trivial to maintain consistency across time. Recent strategies align consecutive predictions by solving global transformation, yet our analysis reveals their fundamental limitations in assumption validity, local alignment scope, and robustness under noisy geometry. In this work, we propose a higher-DOF and long-term alignment framework based on Thin Plate Spline, leveraging globally propagated control points to correct spatially varying inconsistencies. In addition, we adopt a point-agnostic submap registration design that is inherently robust to noisy geometry predictions. The proposed framework is fully plug-and-play, compatible with diverse 3D foundation models and camera configurations (e.g., monocular or surround-view). Extensive experiments demonstrate that our method consistently yields more coherent geometry and lower trajectory errors across multiple datasets, backbone models, and camera setups, highlighting its robustness and generality. Codes are publicly available at \href{https://github.com/Xian-Bei/TALO}{https://github.com/Xian-Bei/TALO}.[59] A multi-weight self-matching visual explanation for cnns on sar images
Siyuan Sun,Yongping Zhang,Hongcheng Zeng,Yamin Wang,Wei Yang,Wanting Yang,Jie Chen
Main category: cs.CV
TL;DR: 提出了一种名为多权重自匹配类激活映射(MS-CAM)的可视化解释方法,用于提升卷积神经网络在合成孔径雷达(SAR)图像任务中的可解释性,实验表明其能更准确地突出网络关注区域并捕捉目标细节,且可用于弱监督目标定位。
Details
Motivation: 卷积神经网络在SAR任务中应用广泛,但其内部机制复杂且不透明,限制了高可靠性场景下的应用,亟需提升模型可解释性。 Method: 提出MS-CAM方法,通过将SAR图像与CNN提取的特征图和梯度进行匹配,结合通道级和元素级权重,可视化模型在SAR图像中的决策依据。 Result: 在自建SAR目标分类数据集上实验表明,MS-CAM能更准确地定位网络关注区域,捕捉更丰富的目标细节信息,并验证了其在弱监督目标定位中的可行性,分析了影响定位精度的关键因素如像素阈值。 Conclusion: MS-CAM有效提升了CNN在SAR图像分析中的可解释性,具有应用于高可靠性SAR任务的潜力,为后续弱监督定位研究提供了参考。 Abstract: In recent years, convolutional neural networks (CNNs) have achieved significant success in various synthetic aperture radar (SAR) tasks. However, the complexity and opacity of their internal mechanisms hinder the fulfillment of high-reliability requirements, thereby limiting their application in SAR. Improving the interpretability of CNNs is thus of great importance for their development and deployment in SAR. In this paper, a visual explanation method termed multi-weight self-matching class activation mapping (MS-CAM) is proposed. MS-CAM matches SAR images with the feature maps and corresponding gradients extracted by the CNN, and combines both channel-wise and element-wise weights to visualize the decision basis learned by the model in SAR images. Extensive experiments conducted on a self-constructed SAR target classification dataset demonstrate that MS-CAM more accurately highlights the network's regions of interest and captures detailed target feature information, thereby enhancing network interpretability. Furthermore, the feasibility of applying MS-CAM to weakly-supervised obiect localization is validated. Key factors affecting localization accuracy, such as pixel thresholds, are analyzed in depth to inform future work.[60] Understanding and Harnessing Sparsity in Unified Multimodal Models
Shwai He,Chaorui Deng,Ang Li,Shen Yan
Main category: cs.CV
TL;DR: 本文研究了统一多模态模型中各组件的推理效率问题,发现理解组件具有较高可压缩性,而生成组件对压缩敏感。为此提出基于专家混合(MoE)的自适应稀疏激活方法,在仅激活约一半参数的情况下恢复生成性能。
Details
Motivation: 统一多模态模型在支持多种能力时存在推理效率低的问题,不同任务可能无需使用完整模型,但缺乏对各组件效率差异的系统性理解。 Method: 采用无训练剪枝作为探测方法,分析深度剪枝和宽度缩减对模型各组件的影响,并提出基于专家混合(MoE)的自适应稀疏激活机制。 Result: 发现理解组件在各类任务中均具可压缩性,生成组件则对压缩敏感;引入MoE自适应后,BAGEL模型仅激活约一半参数即可达到全模型性能。 Conclusion: 通过分析统一多模态模型内部组件的冗余性,提出稀疏激活的MoE自适应方法,有效提升推理效率,为高效多模态建模提供了新思路。 Abstract: Large multimodal models have achieved remarkable progress in both understanding and generation. Recent efforts pursue unified multimodal models that integrate heterogeneous components to support both capabilities within a single framework. However, such unification introduces inference inefficiencies, e.g., specific tasks or samples may not require the full knowledge or capacity of the unified model. Yet, a systematic understanding of how these inefficiencies manifest across different components remains limited. In this work, we first conduct a systematic analysis of unified multimodal model components using training-free pruning as a probing methodology, considering both depth pruning and width reduction. Our study reveals that the understanding component exhibits notable compressibility in both understanding and generation tasks, which is more pronounced in the latter. In contrast, the generation components are highly sensitive to compression, with performance deteriorating sharply even under moderate compression ratios. To address this limitation, we propose the Mixture-of-Experts (MoE) Adaptation, inspired by the dynamic activation patterns observed across different samples. This approach partitions the generation module into multiple experts and enables sparse activation to restore generation quality. We validate the effectiveness of sparse activation through expert-frozen tuning and further demonstrate that a fully trainable adaptation delivers additional gains. As a result, the adapted BAGEL model achieves performance comparable to the full model while activating only about half of its parameters. The code is released at \href{https://github.com/Shwai-He/SparseUnifiedModel}{this link}.[61] WSCF-MVCC: Weakly-supervised Calibration-free Multi-view Crowd Counting
Bin Li,Daijie Chen,Qi Zhang
Main category: cs.CV
TL;DR: 提出了一种弱监督、无需校准的多视角人群计数方法(WSCF-MVCC),直接使用人群数量作为监督信号,引入自监督排序损失和语义信息提升模型性能,在多个数据集上优于现有方法。
Details
Motivation: 现有无校准多视角人群计数方法仍需昂贵的图像级标注来训练单视角计数模块,限制了实际应用。 Method: 提出WSCF-MVCC,使用群体总数而非密度图进行监督;设计自监督排序损失结合多尺度先验增强感知能力;利用语义信息改进视角匹配。 Result: 在三个主流多视角人群计数数据集上,弱监督设置下性能超过现有最先进方法。 Conclusion: 该方法降低了对标注和相机校准的依赖,更适合实际部署。 Abstract: Multi-view crowd counting can effectively mitigate occlusion issues that commonly arise in single-image crowd counting. Existing deep-learning multi-view crowd counting methods project different camera view images onto a common space to obtain ground-plane density maps, requiring abundant and costly crowd annotations and camera calibrations. Hence, calibration-free methods are proposed that do not require camera calibrations and scene-level crowd annotations. However, existing calibration-free methods still require expensive image-level crowd annotations for training the single-view counting module. Thus, in this paper, we propose a weakly-supervised calibration-free multi-view crowd counting method (WSCF-MVCC), directly using crowd count as supervision for the single-view counting module rather than density maps constructed from crowd annotations. Instead, a self-supervised ranking loss that leverages multi-scale priors is utilized to enhance the model's perceptual ability without additional annotation costs. What's more, the proposed model leverages semantic information to achieve a more accurate view matching and, consequently, a more precise scene-level crowd count estimation. The proposed method outperforms the state-of-the-art methods on three widely used multi-view counting datasets under weakly supervised settings, indicating that it is more suitable for practical deployment compared with calibrated methods. Code is released in https://github.com/zqyq/Weakly-MVCC.[62] VACoT: Rethinking Visual Data Augmentation with VLMs
Zhengzhuo Xu,Chong Sun,SiNan Du,Chen Li,Jing Lyu,Chun Yuan
Main category: cs.CV
TL;DR: 本文提出了Visual Augmentation Chain-of-Thought (VACoT),一种在推理阶段动态调用图像增强的框架,显著提升了视觉语言模型在复杂和分布外输入下的鲁棒性,特别是在OCR相关的对抗场景中。
Details
Motivation: 视觉语言模型(VLMs)主要依赖大规模真实数据或合成数据,缺乏有效的视觉数据增强机制,导致其在基本感知任务上表现不佳,且训练成本高昂。因此,需要一种高效、动态的增强方法来提升模型鲁棒性而不增加训练负担。 Method: 提出VACoT框架,在推理过程中引入后验图像增强(如去噪),结合结构化的通用视觉增强策略,并采用高效的基于代理的强化学习进行优化;设计条件奖励机制,鼓励必要增强并惩罚冗余响应,实现简洁有效的感知推理。 Result: 在13个感知基准上进行了广泛实验,验证了VACoT的优越性,并提出了新的对抗OCR数据集AdvOCR,证明了后验视觉增强在对抗场景中的良好泛化能力。 Conclusion: VACoT通过推理阶段的动态视觉增强,有效提升了VLMs在挑战性和分布外输入下的感知鲁棒性,同时降低了训练复杂性和计算开销,为VLMs的增强策略提供了新方向。 Abstract: While visual data augmentation remains a cornerstone for training robust vision models, it has received limited attention in visual language models (VLMs), which predominantly rely on large-scale real data acquisition or synthetic diversity. Consequently, they may struggle with basic perception tasks that conventional models handle reliably. Given the substantial cost of pre-training and fine-tuning VLMs, continue training on augmented data yields limited and diminishing returns. In this paper, we present Visual Augmentation Chain-of-Thought (VACoT), a framework that dynamically invokes image augmentations during model inference. By incorporating post-hoc transformations such as denoising, VACoT substantially improves robustness on challenging and out-of-distribution inputs, especially in OCR-related adversarial scenarios. Distinct from prior approaches limited to local cropping, VACoT integrates a structured collection of general visual augmentations, broadening the query image views while reducing training complexity and computational overhead with efficient agentic reinforcement learning. We propose a conditional reward scheme that encourages necessary augmentation while penalizing verbose responses, ensuring concise and effective reasoning in perception tasks. We demonstrate the superiority of VACoT with extensive experiments on 13 perception benchmarks and further introduce AdvOCR to highlight the generalization benefits of post-hoc visual augmentations in adversarial scenarios.[63] Tackling Tuberculosis: A Comparative Dive into Machine Learning for Tuberculosis Detection
Daanish Hindustani,Sanober Hindustani,Preston Nguyen
Main category: cs.CV
TL;DR: 本研究比较了预训练的ResNet-50和SqueezeNet模型在使用胸部X光图像诊断结核病(TB)中的性能,结果表明SqueezeNet在准确率、精确率和F1分数上表现更优。
Details
Motivation: 由于传统TB诊断方法在资源有限地区效率低下,研究探索基于深度学习的替代方案以提高诊断速度和可及性。 Method: 采用Kaggle的4200张胸部X光图像数据集,对ResNet-50和SqueezeNet模型进行训练;数据经过分割、增强和调整大小等预处理,并使用准确率、精确率、召回率和混淆矩阵等指标评估模型性能。 Result: SqueezeNet实现89%准确率、98%精确率、80%召回率和87% F1分数;ResNet-50为73%准确率、88%精确率、52%召回率和65% F1分数。 Conclusion: SqueezeNet在TB检测任务中优于ResNet-50,显示出机器学习模型在资源匮乏地区辅助TB早期筛查的潜力,未来需进一步优化模型的速度、大小和准确性。 Abstract: This study explores the application of machine learning models, specifically a pretrained ResNet-50 model and a general SqueezeNet model, in diagnosing tuberculosis (TB) using chest X-ray images. TB, a persistent infectious disease affecting humanity for millennia, poses challenges in diagnosis, especially in resource-limited settings. Traditional methods, such as sputum smear microscopy and culture, are inefficient, prompting the exploration of advanced technologies like deep learning and computer vision. The study utilized a dataset from Kaggle, consisting of 4,200 chest X-rays, to develop and compare the performance of the two machine learning models. Preprocessing involved data splitting, augmentation, and resizing to enhance training efficiency. Evaluation metrics, including accuracy, precision, recall, and confusion matrix, were employed to assess model performance. Results showcase that the SqueezeNet achieved a loss of 32%, accuracy of 89%, precision of 98%, recall of 80%, and an F1 score of 87%. In contrast, the ResNet-50 model exhibited a loss of 54%, accuracy of 73%, precision of 88%, recall of 52%, and an F1 score of 65%. This study emphasizes the potential of machine learning in TB detection and possible implications for early identification and treatment initiation. The possibility of integrating such models into mobile devices expands their utility in areas lacking TB detection resources. However, despite promising results, the need for continued development of faster, smaller, and more accurate TB detection models remains crucial in contributing to the global efforts in combating TB.[64] Multi-Domain Enhanced Map-Free Trajectory Prediction with Selective Attention
Wenyi Xiong,Jian Chen
Main category: cs.CV
TL;DR: 提出了一种无需地图信息的轨迹预测算法,通过时域、空域和频域的多维度处理,利用混合专家机制和选择性注意力模块提升复杂交互场景下的预测精度和计算效率。
Details
Motivation: 现有方法在处理复杂交互场景时难以高效提取有效场景信息,导致计算效率和预测准确性下降,尤其是在冗余数据较多的情况下。 Method: 采用混合专家(MoE)机制自适应选择关键频率成分,结合多尺度时间特征提取;设计选择性注意力模块过滤时空冗余信息;构建多模态解码器,并在块级和点级损失监督下进行训练。 Result: 在Nuscenes数据集上的实验表明,该方法在复杂交互场景中优于现有方法,显著提升了预测准确性和计算效率。 Conclusion: 所提出的无地图轨迹预测算法在时空和频率域联合建模,能有效应对复杂交互场景中的冗余信息问题,具有较高的实用价值和性能优势。 Abstract: Trajectory prediction is crucial for the reliability and safety of autonomous driving systems, yet it remains a challenging task in complex interactive scenarios. Existing methods often struggle to efficiently extract valuable scene information from redundant data, thereby reducing computational efficiency and prediction accuracy, especially when dealing with intricate agent interactions. To address these challenges, we propose a novel map-free trajectory prediction algorithm that achieves trajectory prediction across the temporal, spatial, and frequency domains. Specifically, in temporal information processing, We utilize a Mixture of Experts (MoE) mechanism to adaptively select critical frequency components. Concurrently, we extract these components and integrate multi-scale temporal features. Subsequently, a selective attention module is proposed to filter out redundant information in both temporal sequences and spatial interactions. Finally, we design a multimodal decoder. Under the supervision of patch-level and point-level losses, we obtain reasonable trajectory results. Experiments on Nuscences datasets demonstrate the superiority of our algorithm, validating its effectiveness in handling complex interactive scenarios.[65] SAGE: Style-Adaptive Generalization for Privacy-Constrained Semantic Segmentation Across Domains
Qingmei Li,Yang Zhang,Peifeng Zhang,Haohuan Fu,Juepeng Zheng
Main category: cs.CV
TL;DR: 本文提出了一种名为SAGE的风格自适应泛化框架,用于在隐私约束下提升冻结模型在语义分割中的领域泛化能力,通过生成动态视觉提示来隐式对齐不同风格间的特征分布,而无需修改模型参数。
Details
Motivation: 由于隐私和安全限制,传统方法无法访问模型参数进行微调,因此需要一种不修改模型权重的输入级策略来应对域偏移导致的性能下降问题。 Method: SAGE利用风格迁移构建源域的多样化风格表示,学习覆盖广泛视觉特征的风格特性,并根据每个输入的视觉上下文自适应融合这些风格线索,生成动态提示以协调图像外观。 Result: 在五个基准数据集上的实验表明,SAGE在隐私约束下优于或媲美现有最先进方法,并在所有设置中均优于全量微调基线。 Conclusion: SAGE通过输入级风格自适应提示有效弥合了冻结模型不变性与未见域多样性之间的差距,为隐私受限下的领域泛化提供了新思路。 Abstract: Domain generalization for semantic segmentation aims to mitigate the degradation in model performance caused by domain shifts. However, in many real-world scenarios, we are unable to access the model parameters and architectural details due to privacy concerns and security constraints. Traditional fine-tuning or adaptation is hindered, leading to the demand for input-level strategies that can enhance generalization without modifying model weights. To this end, we propose a \textbf{S}tyle-\textbf{A}daptive \textbf{GE}neralization framework (\textbf{SAGE}), which improves the generalization of frozen models under privacy constraints. SAGE learns to synthesize visual prompts that implicitly align feature distributions across styles instead of directly fine-tuning the backbone. Specifically, we first utilize style transfer to construct a diverse style representation of the source domain, thereby learning a set of style characteristics that can cover a wide range of visual features. Then, the model adaptively fuses these style cues according to the visual context of each input, forming a dynamic prompt that harmonizes the image appearance without touching the interior of the model. Through this closed-loop design, SAGE effectively bridges the gap between frozen model invariance and the diversity of unseen domains. Extensive experiments on five benchmark datasets demonstrate that SAGE achieves competitive or superior performance compared to state-of-the-art methods under privacy constraints and outperforms full fine-tuning baselines in all settings.[66] On-the-fly Feedback SfM: Online Explore-and-Exploit UAV Photogrammetry with Incremental Mesh Quality-Aware Indicator and Predictive Path Planning
Liyuan Lou,Wanyun Li,Wentian Gan,Yifei Yu,Tengfei Wang,Xin Wang,Zongqian Zhan
Main category: cs.CV
TL;DR: 本文提出了一种名为On-the-fly Feedback SfM的实时无人机摄影测量框架,通过在线增量稀疏点云扩展、网格质量评估和预测路径规划,实现边飞行边重建与反馈,提升灾后响应等时效性应用中的数据采集效率。
Details
Motivation: 现有实时无人机摄影测量方法缺乏对重建质量的即时评估和图像采集的引导反馈,难以满足时间敏感型地理空间应用的需求。 Method: 构建一个探索-利用框架,包含三个模块:(1) 在线增量粗网格生成;(2) 带可操作指标的在线网格质量评估;(3) 预测性路径规划以动态优化飞行轨迹。 Result: 实验表明该方法能在近实时完成现场三维重建与评估,并显著减少覆盖盲区和重飞成本。 Conclusion: 所提方法实现了从传统被动式到智能自适应勘探工作流的转变,提升了无人机摄影测量的自动化与智能化水平。 Abstract: Compared with conventional offline UAV photogrammetry, real-time UAV photogrammetry is essential for time-critical geospatial applications such as disaster response and active digital-twin maintenance. However, most existing methods focus on processing captured images or sequential frames in real time, without explicitly evaluating the quality of the on-the-go 3D reconstruction or providing guided feedback to enhance image acquisition in the target area. This work presents On-the-fly Feedback SfM, an explore-and-exploit framework for real-time UAV photogrammetry, enabling iterative exploration of unseen regions and exploitation of already observed and reconstructed areas in near real time. Built upon SfM on-the-fly , the proposed method integrates three modules: (1) online incremental coarse-mesh generation for dynamically expanding sparse 3D point cloud; (2) online mesh quality assessment with actionable indicators; and (3) predictive path planning for on-the-fly trajectory refinement. Comprehensive experiments demonstrate that our method achieves in-situ reconstruction and evaluation in near real time while providing actionable feedback that markedly reduces coverage gaps and re-flight costs. Via the integration of data collection, processing, 3D reconstruction and assessment, and online feedback, our on the-fly feedback SfM could be an alternative for the transition from traditional passive working mode to a more intelligent and adaptive exploration workflow. Code is now available at https://github.com/IRIS-LAB-whu/OntheflySfMFeedback.[67] From Detection to Association: Learning Discriminative Object Embeddings for Multi-Object Tracking
Yuqing Shao,Yuchen Yang,Rui Yu,Weilong Li,Xu Guo,Huaicheng Yan,Wei Wang,Xiao Sun
Main category: cs.CV
TL;DR: 本文提出了一种名为FDTA的端到端多目标跟踪框架,通过空间、时间和身份三个适配器增强对象嵌入的判别能力,显著提升了跨帧实例关联准确性。
Details
Motivation: 现有端到端MOT方法在检测性能上表现良好,但关联精度较低,主要因为共享的DETR架构生成的对象嵌入缺乏足够的实例级区分能力,难以满足跨帧时空连续性要求。 Method: 提出FDTA框架,包含三个组件:空间适配器(SA)引入深度感知线索以增强空间连续性;时间适配器(TA)聚合历史信息以建模时间依赖;身份适配器(IA)利用质量感知的对比学习提升实例级可分性。 Result: 在DanceTrack、SportsMOT和BFT等多个具有挑战性的MOT基准上实现了最先进的性能,验证了所提嵌入增强策略的有效性。 Conclusion: FDTA通过显式的特征 refinement 机制,从空间、时间和身份三个互补视角提升了对象嵌入的判别性,有效解决了当前端到端MOT方法中检测与关联不平衡的问题。 Abstract: End-to-end multi-object tracking (MOT) methods have recently achieved remarkable progress by unifying detection and association within a single framework. Despite their strong detection performance, these methods suffer from relatively low association accuracy. Through detailed analysis, we observe that object embeddings produced by the shared DETR architecture display excessively high inter-object similarity, as it emphasizes only category-level discrimination within single frames. In contrast, tracking requires instance-level distinction across frames with spatial and temporal continuity, for which current end-to-end approaches insufficiently optimize object embeddings. To address this, we introduce FDTA (From Detection to Association), an explicit feature refinement framework that enhances object discriminativeness across three complementary perspectives. Specifically, we introduce a Spatial Adapter (SA) to integrate depth-aware cues for spatial continuity, a Temporal Adapter (TA) to aggregate historical information for temporal dependencies, and an Identity Adapter (IA) to leverage quality-aware contrastive learning for instance-level separability. Extensive experiments demonstrate that FDTA achieves state-of-the-art performance on multiple challenging MOT benchmarks, including DanceTrack, SportsMOT, and BFT, highlighting the effectiveness of our proposed discriminative embedding enhancement strategy. The code is available at https://github.com/Spongebobbbbbbbb/FDTA.[68] Reproducing and Extending RaDelft 4D Radar with Camera-Assisted Labels
Kejia Hu,Mohammed Alsakabi,John M. Dolan,Ozan K. Tonguz
Main category: cs.CV
TL;DR: 提出了一种基于相机引导的4D雷达点云标注方法,无需人工标注即可生成高精度雷达标签,并建立了可复现的框架,推动了雷达语义分割研究。
Details
Motivation: 现有4D雷达数据集缺乏公开的标注和代码,限制了雷达语义分割的研究与复现。 Method: 通过将雷达点云投影到相机语义分割结果中,并结合空间聚类,实现相机引导的雷达标注。 Result: 成功复现RaDelft结果并生成更精确的雷达标签,验证了不同雾度对标注性能的影响。 Conclusion: 所提方法为4D雷达数据标注提供了可复现、高效的解决方案,促进该领域的进一步发展。 Abstract: Recent advances in 4D radar highlight its potential for robust environment perception under adverse conditions, yet progress in radar semantic segmentation remains constrained by the scarcity of open source datasets and labels. The RaDelft data set, although seminal, provides only LiDAR annotations and no public code to generate radar labels, limiting reproducibility and downstream research. In this work, we reproduce the numerical results of the RaDelft group and demonstrate that a camera-guided radar labeling pipeline can generate accurate labels for radar point clouds without relying on human annotations. By projecting radar point clouds into camera-based semantic segmentation and applying spatial clustering, we create labels that significantly enhance the accuracy of radar labels. These results establish a reproducible framework that allows the research community to train and evaluate the labeled 4D radar data. In addition, we study and quantify how different fog levels affect the radar labeling performance.[69] Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch
Yifan Zhang,Liang Hu,Haofeng Sun,Peiyu Wang,Yichen Wei,Shukang Yin,Jiangbo Pei,Wei Shen,Peng Xia,Yi Peng,Tianyidan Xie,Eric Li,Yang Liu,Xuchen Song,Yahui Zhou
Main category: cs.CV
TL;DR: Skywork-R1V4是一个30B参数的多模态智能体模型,通过监督微调实现统一的多模态规划、图像操作与深度搜索,无需强化学习,在多个基准上达到SOTA。
Details
Motivation: 现有方法通常将图像操作与网络搜索分离,依赖昂贵的强化学习,且缺乏基于真实工具执行轨迹的规划能力。 Method: 提出Skywork-R1V4模型,整合多模态规划、主动图像操作、深度多模态搜索和交错推理;仅使用少于3万条高质量、一致性的规划-执行轨迹进行监督微调,并通过逐步一致性过滤验证数据。 Result: 在MMSearch上得分为66.1,FVQA上为67.2,超越Gemini 2.5 Flash所有11项指标;展现出涌现的长视野推理能力,可协调超过10次工具调用解决复杂任务。 Conclusion: 仅通过精心设计的监督学习即可实现复杂的多模态智能体能力,无需强化学习。 Abstract: Despite recent progress in multimodal agentic systems, existing approaches often treat image manipulation and web search as disjoint capabilities, rely heavily on costly reinforcement learning, and lack planning grounded in real tool-execution traces. To address these limitations, we present Skywork-R1V4, a 30B (A3B) parameter multimodal agentic model that unifies multimodal planning, active image manipulation ("thinking with images"), deep multimodal search, and, most critically, interleaved reasoning that dynamically alternates between visual operations and external knowledge retrieval. Trained solely via supervised fine-tuning on fewer than 30,000 high-quality, planning-execution-consistent trajectories and validated through stepwise consistency filtering, Skywork-R1V4 achieves state-of-the-art results across perception and multimodal search benchmarks: it scores 66.1 on MMSearch and 67.2 on FVQA, surpassing Gemini 2.5 Flash on all 11 metrics. Skywork-R1V4 exhibits emergent long-horizon reasoning at inference time, successfully orchestrating more than 10 tool calls to solve complex, multi-step tasks. Our results demonstrate that sophisticated agentic multimodal intelligence can be achieved through carefully curated supervised learning alone, without any reliance on reinforcement learning.[70] Nav-$R^2$ Dual-Relation Reasoning for Generalizable Open-Vocabulary Object-Goal Navigation
Wentao Xiang,Haokang Zhang,Tianhang Yang,Zedong Chu,Ruihang Chu,Shichao Xie,Yujian Yuan,Jian Sun,Zhining Gu,Junjie Wang,Xiaolong Wu,Mu Xu,Yujiu Yang
Main category: cs.CV
TL;DR: 本文提出Nav-R²,一种通过结构化思维链和相似性感知记忆来提升开放词汇下目标导航性能的框架,显著提高了对未见物体的定位成功率。
Details
Motivation: 现有方法在开放词汇目标导航中存在决策过程不透明和对未见物体定位成功率低的问题。 Method: 提出Nav-R²框架,结合结构化Chain-of-Thought推理与相似性感知记忆(SA-Mem),建模目标-环境关系和环境-动作规划,并构建NavR²-CoT数据集以训练模型感知环境并规划动作。 Result: Nav-R²在未见物体定位任务上达到SOTA性能,避免了对已见类别过拟合,支持2Hz实时推理,且不增加额外参数。 Conclusion: Nav-R²通过显式关系建模和高效记忆机制,提升了导航任务中对未见物体的定位能力与决策透明度。 Abstract: Object-goal navigation in open-vocabulary settings requires agents to locate novel objects in unseen environments, yet existing approaches suffer from opaque decision-making processes and low success rate on locating unseen objects. To address these challenges, we propose Nav-$R^2$, a framework that explicitly models two critical types of relationships, target-environment modeling and environment-action planning, through structured Chain-of-Thought (CoT) reasoning coupled with a Similarity-Aware Memory. We construct a Nav$R^2$-CoT dataset that teaches the model to perceive the environment, focus on target-related objects in the surrounding context and finally make future action plans. Our SA-Mem preserves the most target-relevant and current observation-relevant features from both temporal and semantic perspectives by compressing video frames and fusing historical observations, while introducing no additional parameters. Compared to previous methods, Nav-R^2 achieves state-of-the-art performance in localizing unseen objects through a streamlined and efficient pipeline, avoiding overfitting to seen object categories while maintaining real-time inference at 2Hz. Resources will be made publicly available at \href{https://github.com/AMAP-EAI/Nav-R2}{github link}.[71] WISE: Weighted Iterative Society-of-Experts for Robust Multimodal Multi-Agent Debate
Anoop Cherian,River Doyle,Eyal Ben-Dov,Suhas Lohit,Kuan-Chuan Peng
Main category: cs.CV
TL;DR: 本文提出了一个名为WISE的多智能体辩论框架,用于解决视觉与语言推理问题。该框架通过将智能体分为生成解决方案的Solver和评估正确性并提供反馈的Reflector,结合改进的Dawid-Skene算法聚合多轮辩论结果,在多种多模态任务上显著提升了准确性。
Details
Motivation: 现有的多智能体辩论方法主要应用于纯语言任务,而在多模态(如视觉与语言)推理中的应用尚不充分,因此需要一种能够整合单模态与多模态专家优势的通用框架。 Method: 提出WISE框架,包含Solver和Reflector两类角色,并采用加权迭代机制;引入改进的Dawid-Skene算法进行两阶段辩论结果聚合,支持异构专家协作。 Result: 在SMART-840、VisualPuzzles、EvoChart-QA和新构建的SMART-840++数据集上实验表明,WISE比现有最先进的MAD方法和聚合策略准确率提升2-7%。 Conclusion: WISE是一种有效且可扩展的多智能体辩论框架,能够在多模态推理任务中充分利用异构模型的优势,显著提升推理性能。 Abstract: Recent large language models (LLMs) are trained on diverse corpora and tasks, leading them to develop complementary strengths. Multi-agent debate (MAD) has emerged as a popular way to leverage these strengths for robust reasoning, though it has mostly been applied to language-only tasks, leaving its efficacy on multimodal problems underexplored. In this paper, we study MAD for solving vision-and-language reasoning problems. Our setup enables generalizing the debate protocol with heterogeneous experts that possess single- and multi-modal capabilities. To this end, we present Weighted Iterative Society-of-Experts (WISE), a generalized and modular MAD framework that partitions the agents into Solvers, that generate solutions, and Reflectors, that verify correctness, assign weights, and provide natural language feedback. To aggregate the agents' solutions across debate rounds, while accounting for variance in their responses and the feedback weights, we present a modified Dawid-Skene algorithm for post-processing that integrates our two-stage debate model. We evaluate WISE on SMART-840, VisualPuzzles, EvoChart-QA, and a new SMART-840++ dataset with programmatically generated problem instances of controlled difficulty. Our results show that WISE consistently improves accuracy by 2-7% over the state-of-the-art MAD setups and aggregation methods across diverse multimodal tasks and LLM configurations.[72] MitUNet: Enhancing Floor Plan Recognition using a Hybrid Mix-Transformer and U-Net Architecture
Dmitriy Parashchuk,Alexey Kapshitskiy,Yuriy Karyakin
Main category: cs.CV
TL;DR: 本文提出了一种名为MitUNet的混合神经网络架构,用于从2D平面图进行室内空间的高精度3D重建中的墙体语义分割。该模型结合Mix-Transformer编码器和带有scSE注意力模块的U-Net解码器,并采用Tversky损失函数优化,显著提升了边界准确性和对细长结构的敏感性,在公开和私有数据集上均优于现有方法。
Details
Motivation: 现有方法在标准指标下难以精确检测薄墙体并生成规则边界掩码,缺乏后续矢量化所需的几何精度。 Method: 提出MitUNet:采用分层Mix-Transformer编码器捕获全局上下文,U-Net解码器结合scSE注意力模块实现精确边界恢复,并基于Tversky损失函数优化超参数以平衡精确率与召回率,抑制边界误检同时保持对细小结构的敏感性。 Result: 在CubiCasa5k公开数据集和私有区域数据集上实验表明,MitUNet生成的掩码结构正确、边界精度高,优于标准单任务模型。 Conclusion: MitUNet为自动化3D重建流程中的数据预处理提供了鲁棒的墙体分割解决方案。 Abstract: Automatic 3D reconstruction of indoor spaces from 2D floor plans requires high-precision semantic segmentation of structural elements, particularly walls. However, existing methods optimized for standard metrics often struggle to detect thin structural components and yield masks with irregular boundaries, lacking the geometric precision required for subsequent vectorization. To address this issue, we introduce MitUNet, a hybrid neural network architecture specifically designed for wall segmentation tasks in the context of 3D modeling. In MitUNet, we utilize a hierarchical Mix-Transformer encoder to capture global context and a U-Net decoder enhanced with scSE attention blocks for precise boundary recovery. Furthermore, we propose an optimization strategy based on the Tversky loss function to effectively balance precision and recall. By fine-tuning the hyperparameters of the loss function, we prioritize the suppression of false positive noise along wall boundaries while maintaining high sensitivity to thin structures. Our experiments on the public CubiCasa5k dataset and a proprietary regional dataset demonstrate that the proposed approach ensures the generation of structurally correct masks with high boundary accuracy, outperforming standard single-task models. MitUNet provides a robust tool for data preparation in automated 3D reconstruction pipelines.[73] Generalizing Vision-Language Models with Dedicated Prompt Guidance
Xinyao Li,Yinjie Min,Hongbo Chen,Zhekai Du,Fengling Li,Jingjing Li
Main category: cs.CV
TL;DR: 本文提出了一种新的两步领域专家引导的域泛化(GuiDG)框架,通过在分割的源域上训练多个参数高效的专家模型,并利用跨模态注意力模块自适应集成来指导视觉编码器的微调,从而提升视觉-语言模型在未见域上的泛化能力。
Details
Motivation: 现有的视觉-语言模型微调方法通常在整体数据集上训练单一通用模型,可能损害其在未见域上的泛化能力,难以平衡领域特异性与域泛化(DG)之间的矛盾。 Method: 首先使用提示调优(prompt tuning)获得各源域的专家模型,然后引入跨模态注意力模块,通过自适应专家集成来指导视觉编码器的微调。同时构建了ImageNet-DG用于评估少样本域泛化性能。 Result: 在标准域泛化基准和ImageNet-DG上的实验表明,GuiDG优于当前最先进的微调方法,且保持了较高的效率。 Conclusion: 训练多个针对源域的参数高效专家模型并进行自适应集成,相比微调单一通用模型能显著提升视觉-语言模型的域泛化能力,验证了理论分析的有效性。 Abstract: Fine-tuning large pretrained vision-language models (VLMs) has emerged as a prevalent paradigm for downstream adaptation, yet it faces a critical trade-off between domain specificity and domain generalization (DG) ability. Current methods typically fine-tune a universal model on the entire dataset, which potentially compromises the ability to generalize to unseen domains. To fill this gap, we provide a theoretical understanding of the generalization ability for VLM fine-tuning, which reveals that training multiple parameter-efficient expert models on partitioned source domains leads to better generalization than fine-tuning a universal model. Inspired by this finding, we propose a two-step domain-expert-Guided DG (GuiDG) framework. GuiDG first employs prompt tuning to obtain source domain experts, then introduces a Cross-Modal Attention module to guide the fine-tuning of the vision encoder via adaptive expert integration. To better evaluate few-shot DG, we construct ImageNet-DG from ImageNet and its variants. Extensive experiments on standard DG benchmarks and ImageNet-DG demonstrate that GuiDG improves upon state-of-the-art fine-tuning methods while maintaining efficiency.[74] GUI Exploration Lab: Enhancing Screen Navigation in Agents via Multi-Turn Reinforcement Learning
Haolong Yan,Yeqing Shen,Xin Huang,Jia Wang,Kaijun Tan,Zhixuan Liang,Hongxin Li,Zheng Ge,Osamu Yoshie,Si Li,Xiangyu Zhang,Daxin Jiang
Main category: cs.CV
TL;DR: 本文提出了一个用于GUI代理导航研究的模拟环境引擎GUI Exploration Lab,支持灵活定义和组合屏幕、图标及导航图,并提供完整的环境信息以支持代理训练与评估。通过实验发现,监督微调有助于基础知识的记忆,单轮强化学习提升了对未见场景的泛化能力,多轮强化学习则进一步优化了探索策略,从而显著提高屏幕导航性能。
Details
Motivation: 现实世界中的GUI环境复杂且专有,难以获取全面的环境信息,限制了代理在复杂界面导航任务中的系统性研究与基准测试。因此需要一个开放、可控的模拟环境来推动GUI代理的研究。 Method: 提出GUI Exploration Lab模拟环境引擎,支持灵活构建GUI结构;采用三阶段训练方法:监督微调、单轮强化学习和多轮强化学习,逐步提升代理的导航能力。 Result: 在静态与交互式基准上验证了方法的有效性,表明监督微调奠定基础知识基础,单轮RL提升泛化能力,多轮RL进一步优化探索策略,整体显著提升导航性能。 Conclusion: 强化学习在GUI导航任务中具有明显优势,结合分阶段训练策略可有效提升代理的通用性和导航能力,为构建更强大的GUI代理提供了实践指导。 Abstract: With the rapid development of Large Vision Language Models, the focus of Graphical User Interface (GUI) agent tasks shifts from single-screen tasks to complex screen navigation challenges. However, real-world GUI environments, such as PC software and mobile Apps, are often complex and proprietary, making it difficult to obtain the comprehensive environment information needed for agent training and evaluation. This limitation hinders systematic investigation and benchmarking of agent navigation capabilities. To address this limitation, we introduce GUI Exploration Lab, a simulation environment engine for GUI agent navigation research that enables flexible definition and composition of screens, icons, and navigation graphs, while providing full access to environment information for comprehensive agent training and evaluation. Through extensive experiments, we find that supervised fine-tuning enables effective memorization of fundamental knowledge, serving as a crucial foundation for subsequent training. Building on this, single-turn reinforcement learning further enhances generalization to unseen scenarios. Finally, multi-turn reinforcement learning encourages the development of exploration strategies through interactive trial and error, leading to further improvements in screen navigation performance. We validate our methods on both static and interactive benchmarks, demonstrating that our findings generalize effectively to real-world scenarios. These findings demonstrate the advantages of reinforcement learning approaches in GUI navigation and offer practical guidance for building more capable and generalizable GUI agents.[75] WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning
Woongyeong Yeo,Kangsan Kim,Jaehong Yoon,Sung Ju Hwang
Main category: cs.CV
TL;DR: 本文提出了WorldMM,一种新型的多模态记忆代理,用于解决长视频理解中上下文容量有限和视觉细节丢失的问题。
Details
Motivation: 现有方法在处理长时间跨度视频时受限于文本依赖和固定时间尺度的检索,难以有效捕捉复杂场景中的视觉证据和多粒度事件。 Method: WorldMM构建了三种互补的记忆:情景记忆(多时间尺度的事实事件)、语义记忆(持续更新的高层概念知识)和视觉记忆(保留场景细节)。推理时通过自适应检索代理迭代选择最相关记忆源并利用多时间粒度进行检索。 Result: WorldMM在五个长视频问答基准上显著优于现有基线,平均比之前最先进的方法提升8.4%。 Conclusion: WorldMM通过多模态、多粒度的记忆机制有效提升了长视频推理能力。 Abstract: Recent advances in video large language models have demonstrated strong capabilities in understanding short clips. However, scaling them to hours- or days-long videos remains highly challenging due to limited context capacity and the loss of critical visual details during abstraction. Existing memory-augmented methods mitigate this by leveraging textual summaries of video segments, yet they heavily rely on text and fail to utilize visual evidence when reasoning over complex scenes. Moreover, retrieving from fixed temporal scales further limits their flexibility in capturing events that span variable durations. To address this, we introduce WorldMM, a novel multimodal memory agent that constructs and retrieves from multiple complementary memories, encompassing both textual and visual representations. WorldMM comprises three types of memory: episodic memory indexes factual events across multiple temporal scales, semantic memory continuously updates high-level conceptual knowledge, and visual memory preserves detailed information about scenes. During inference, an adaptive retrieval agent iteratively selects the most relevant memory source and leverages multiple temporal granularities based on the query, continuing until it determines that sufficient information has been gathered. WorldMM significantly outperforms existing baselines across five long video question-answering benchmarks, achieving an average 8.4% performance gain over previous state-of-the-art methods, showing its effectiveness on long video reasoning.[76] LightHCG: a Lightweight yet powerful HSIC Disentanglement based Causal Glaucoma Detection Model framework
Daeyoung Kim
Main category: cs.CV
TL;DR: 提出一种基于因果表示学习的轻量级卷积VAE模型LightHCG,用于青光眼检测,在显著减少参数量的同时提升分类性能,并支持干预分析。
Details
Motivation: 现有AI驱动的青光眼检测方法在可靠性、参数量、伪相关性和临床干预分析应用方面仍有不足。 Method: 提出LightHCG模型,结合HSIC隐空间解耦和基于图自编码器的无监督因果表示学习,从视网膜图像中提取光学神经区域的因果特征。 Result: 在青光眼分类任务中达到与主流模型相当或更优的性能,参数量减少93%~99%,并具备潜在的干预分析能力。 Conclusion: LightHCG通过因果表示学习实现了高效、可解释且适用于临床模拟的青光眼检测,为AI在医疗诊断中的可靠应用提供了新方向。 Abstract: As a representative optic degenerative condition, glaucoma has been a threat to millions due to its irreversibility and severe impact on human vision fields. Mainly characterized by dimmed and blurred visions, or peripheral vision loss, glaucoma is well known to occur due to damages in the optic nerve from increased intraocular pressure (IOP) or neovascularization within the retina. Traditionally, most glaucoma related works and clinical diagnosis focused on detecting these damages in the optic nerve by using patient data from perimetry tests, optic papilla inspections and tonometer-based IOP measurements. Recently, with advancements in computer vision AI models, such as VGG16 or Vision Transformers (ViT), AI-automatized glaucoma detection and optic cup segmentation based on retinal fundus images or OCT recently exhibited significant performance in aiding conventional diagnosis with high performance. However, current AI-driven glaucoma detection approaches still have significant room for improvement in terms of reliability, excessive parameter usage, possibility of spurious correlation within detection, and limitations in applications to intervention analysis or clinical simulations. Thus, this research introduced a novel causal representation driven glaucoma detection model: LightHCG, an extremely lightweight Convolutional VAE-based latent glaucoma representation model that can consider the true causality among glaucoma-related physical factors within the optic nerve region. Using HSIC-based latent space disentanglement and Graph Autoencoder based unsupervised causal representation learning, LightHCG not only exhibits higher performance in classifying glaucoma with 93~99% less weights, but also enhances the possibility of AI-driven intervention analysis, compared to existing advanced vision models such as InceptionV3, MobileNetV2 or VGG16.[77] Boosting Medical Vision-Language Pretraining via Momentum Self-Distillation under Limited Computing Resources
Phuc Pham,Nhu Pham,Ngoc Quoc Ly
Main category: cs.CV
TL;DR: 提出一种结合动量自蒸馏与梯度累积的高效视觉-语言模型训练方法,显著提升医疗领域小样本学习性能,同时降低计算资源需求。
Details
Motivation: 在医疗领域标注数据稀缺且计算资源受限的情况下,传统对比学习需要大批次训练,导致计算成本高,难以广泛应用。因此需要一种既能提高模型性能又能减少资源消耗的训练方法。 Method: 采用动量自蒸馏(momentum self-distillation)增强多模态学习,并将动量机制与梯度累积结合,扩大有效批次大小而不增加显存占用,从而在单GPU上实现高效训练。 Result: 在零样本分类中达到SOTA水平,在少样本适应任务中AUC-ROC超过90%,检索任务性能提升2-3%,同时显著提高训练效率,单GPU即可完成训练且训练时间合理。 Conclusion: 所提方法通过动量机制与知识蒸馏的结合,在降低计算资源需求的同时提升了多模态模型在医疗场景下的性能,推动了高效多模态学习的发展。 Abstract: In medical healthcare, obtaining detailed annotations is challenging, highlighting the need for robust Vision-Language Models (VLMs). Pretrained VLMs enable fine-tuning on small datasets or zero-shot inference, achieving performance comparable to task-specific models. Contrastive learning (CL) is a key paradigm for training VLMs but inherently requires large batch sizes for effective learning, making it computationally demanding and often limited to well-resourced institutions. Moreover, with limited data in healthcare, it is important to prioritize knowledge extraction from both data and models during training to improve performance. Therefore, we focus on leveraging the momentum method combined with distillation to simultaneously address computational efficiency and knowledge exploitation. Our contributions can be summarized as follows: (1) leveraging momentum self-distillation to enhance multimodal learning, and (2) integrating momentum mechanisms with gradient accumulation to enlarge the effective batch size without increasing resource consumption. Our method attains competitive performance with state-of-the-art (SOTA) approaches in zero-shot classification, while providing a substantial boost in the few-shot adaption, achieving over 90% AUC-ROC and improving retrieval tasks by 2-3%. Importantly, our method achieves high training efficiency with a single GPU while maintaining reasonable training time. Our approach aims to advance efficient multimodal learning by reducing resource requirements while improving performance over SOTA methods. The implementation of our method is available at https://github.com/phphuc612/MSD .[78] Basis-Oriented Low-rank Transfer for Few-Shot and Test-Time Adaptation
Junghwan Park,Woojin Cho,Junhyuk Heo,Darongsae Kwon,Kookjin Lee
Main category: cs.CV
TL;DR: 提出BOLT框架,通过提取任务相关的正交谱基并在该子空间内进行自适应调整,实现对新任务的高效迁移学习。
Details
Motivation: 在数据和计算资源有限的情况下,如何有效将大规模预训练模型迁移到未见任务仍具挑战性;现有元学习方法成本高且不稳定,而多任务微调模型的迁移利用尚不充分。 Method: BOLT框架分两阶段:离线阶段从多个任务向量中提取主奇异方向并逐层正交化形成可重用基;在线阶段冻结这些基,仅训练每层少量对角系数,实现低秩更新。 Result: 实验表明,BOLT在极少数可训练参数下,相比常见PEFT方法和元学习初始化具有更鲁棒的性能表现。 Conclusion: 将适应过程限制在任务相关的正交子空间中,为未知任务迁移提供了一种有效的替代方案。 Abstract: Adapting large pre-trained models to unseen tasks under tight data and compute budgets remains challenging. Meta-learning approaches explicitly learn good initializations, but they require an additional meta-training phase over many tasks, incur high training cost, and can be unstable. At the same time, the number of task-specific pre-trained models continues to grow, yet the question of how to transfer them to new tasks with minimal additional training remains relatively underexplored. We propose BOLT (Basis-Oriented Low-rank Transfer), a framework that reuses existing fine-tuned models not by merging weights, but instead by extracting an orthogonal, task-informed spectral basis and adapting within that subspace. In the offline phase, BOLT collects dominant singular directions from multiple task vectors and orthogonalizes them per layer to form reusable bases. In the online phase, we freeze these bases and train only a small set of diagonal coefficients per layer for the new task, yielding a rank-controlled update with very few trainable parameters. This design provides (i) a strong, training-free initialization for unseen tasks, obtained by pooling source-task coefficients, along with a lightweight rescaling step while leveraging the shared orthogonal bases, and (ii) a parameter-efficient fine-tuning (PEFT) path that, in our experiments, achieves robust performance compared to common PEFT baselines as well as a representative meta-learned initialization. Our results show that constraining adaptation to a task-informed orthogonal subspace provides an effective alternative for unseen-task transfer.[79] Temporal Dynamics Enhancer for Directly Trained Spiking Object Detectors
Fan Luo,Zeyu Gao,Xinhao Luo,Kai Zhao,Yanfeng Lu
Main category: cs.CV
TL;DR: 提出了一种增强脉冲神经网络时序建模能力的Temporal Dynamics Enhancer (TDE),包含Spiking Encoder和Attention Gating Module,并通过Spike-Driven Attention降低能耗,在目标检测任务中实现了更高精度和能效。
Details
Motivation: 现有SNN在输入处理上缺乏多样性,导致时序表达能力受限,难以胜任复杂任务如目标检测。 Method: 设计TDE,包含生成多样化输入刺激的Spiking Encoder(SE)和基于时间依赖引导SE的Attention Gating Module(AGM),并提出无需乘法的Spike-Driven Attention(SDA)以降低能耗。 Result: TDE在PASCAL VOC上达到57.7% mAP50-95,在EvDET200K上达47.6%,SDA能耗仅为传统注意力模块的24.0%。 Conclusion: TDE显著提升了SNN的时序建模能力和能效,可广泛集成于现有SNN检测器中,推动其在复杂任务中的应用。 Abstract: Spiking Neural Networks (SNNs), with their brain-inspired spatiotemporal dynamics and spike-driven computation, have emerged as promising energy-efficient alternatives to Artificial Neural Networks (ANNs). However, existing SNNs typically replicate inputs directly or aggregate them into frames at fixed intervals. Such strategies lead to neurons receiving nearly identical stimuli across time steps, severely limiting the model's expressive power, particularly in complex tasks like object detection. In this work, we propose the Temporal Dynamics Enhancer (TDE) to strengthen SNNs' capacity for temporal information modeling. TDE consists of two modules: a Spiking Encoder (SE) that generates diverse input stimuli across time steps, and an Attention Gating Module (AGM) that guides the SE generation based on inter-temporal dependencies. Moreover, to eliminate the high-energy multiplication operations introduced by the AGM, we propose a Spike-Driven Attention (SDA) to reduce attention-related energy consumption. Extensive experiments demonstrate that TDE can be seamlessly integrated into existing SNN-based detectors and consistently outperforms state-of-the-art methods, achieving mAP50-95 scores of 57.7% on the static PASCAL VOC dataset and 47.6% on the neuromorphic EvDET200K dataset. In terms of energy consumption, the SDA consumes only 0.240 times the energy of conventional attention modules.[80] nuScenes Revisited: Progress and Challenges in Autonomous Driving
Whye Kit Fong,Venice Erin Liong,Kok Seang Tan,Holger Caesar
Main category: cs.CV
TL;DR: 本文回顾了广泛使用的自动驾驶数据集nuScenes,揭示了其创建过程中的技术细节,并探讨了它对后续数据集和社区标准的影响,同时综述了基于该数据集的主要方法进展。
Details
Motivation: 由于深度学习在自动驾驶中的重要性依赖于大量标注数据,作者希望系统回顾nuScenes数据集的构建、影响及其在推动多模态融合与标准化方面的作用。 Method: 通过分析nuScenes数据集的设计理念、传感器配置、采集方式及其扩展版本(如nuImages和Panoptic nuScenes),并梳理其对其他数据集和研究任务的影响。 Result: 揭示了此前未公开的nuScenes构建细节,展示了其在感知、定位、预测与规划等任务中的广泛应用,并总结了其在学术界建立的标准和深远影响。 Conclusion: nuScenes作为自动驾驶领域的基石数据集,不仅推动了多模态融合和标准化发展,还深刻影响了后续数据集设计和算法研究方向。 Abstract: Autonomous Vehicles (AV) and Advanced Driver Assistance Systems (ADAS) have been revolutionized by Deep Learning. As a data-driven approach, Deep Learning relies on vast amounts of driving data, typically labeled in great detail. As a result, datasets, alongside hardware and algorithms, are foundational building blocks for the development of AVs. In this work we revisit one of the most widely used autonomous driving datasets: the nuScenes dataset. nuScenes exemplifies key trends in AV development, being the first dataset to include radar data, to feature diverse urban driving scenes from two continents, and to be collected using a fully autonomous vehicle operating on public roads, while also promoting multi-modal sensor fusion, standardized benchmarks, and a broad range of tasks including perception, localization \& mapping, prediction and planning. We provide an unprecedented look into the creation of nuScenes, as well as its extensions nuImages and Panoptic nuScenes, summarizing many technical details that have hitherto not been revealed in academic publications. Furthermore, we trace how the influence of nuScenes impacted a large number of other datasets that were released later and how it defined numerous standards that are used by the community to this day. Finally, we present an overview of both official and unofficial tasks using the nuScenes dataset and review major methodological developments, thereby offering a comprehensive survey of the autonomous driving literature, with a particular focus on nuScenes.[81] HouseLayout3D: A Benchmark and Training-Free Baseline for 3D Layout Estimation in the Wild
Valentin Bieri,Marie-Julie Rakotosaona,Keisuke Tateno,Francis Engelmann,Leonidas Guibas
Main category: cs.CV
TL;DR: 本文提出了HouseLayout3D,一个用于全建筑规模布局估计的真实世界基准,以及无需训练的基线方法MultiFloor3D,能够在多楼层和复杂建筑空间中实现优于现有方法的性能。
Details
Motivation: 现有的3D布局估计模型主要在单层或简单房间的合成数据上训练,无法处理多层建筑,且分割楼层会丢失楼梯等跨层结构的空间上下文信息。 Method: 提出HouseLayout3D真实世界基准,并设计了一种无需训练的基线方法MultiFloor3D,利用最新的场景理解技术进行多楼层布局估计。 Result: MultiFloor3D在新基准和已有数据集上均优于现有的3D布局估计模型。 Conclusion: 支持全建筑尺度布局估计的研究是必要的,且无需训练的方法已展现出优越性能,为未来研究指明了方向。 Abstract: Current 3D layout estimation models are primarily trained on synthetic datasets containing simple single room or single floor environments. As a consequence, they cannot natively handle large multi floor buildings and require scenes to be split into individual floors before processing, which removes global spatial context that is essential for reasoning about structures such as staircases that connect multiple levels. In this work, we introduce HouseLayout3D, a real world benchmark designed to support progress toward full building scale layout estimation, including multiple floors and architecturally intricate spaces. We also present MultiFloor3D, a simple training free baseline that leverages recent scene understanding methods and already outperforms existing 3D layout estimation models on both our benchmark and prior datasets, highlighting the need for further research in this direction. Data and code are available at: https://houselayout3d.github.io.[82] ClusterStyle: Modeling Intra-Style Diversity with Prototypical Clustering for Stylized Motion Generation
Kerui Chen,Jianrong Zhang,Ming Li,Zhonglong Zheng,Hehe Fan
Main category: cs.CV
TL;DR: 本文提出了一种基于聚类的框架ClusterStyle,用于解决风格化运动生成中捕捉单一风格内多样性的挑战。
Details
Motivation: 现有模型难以捕捉同一风格类别内的运动多样性,即单一风格应对应多种运动变化。 Method: 利用一组原型来建模属于同一风格类别的运动中的多样化风格模式,并构建全局和局部两个结构化的风格嵌入空间,通过与非学习原型锚点对齐进行优化;同时引入风格调制适配器(SMA)增强预训练文本到运动生成模型以融合风格特征。 Result: 大量实验表明,该方法在风格化运动生成和运动风格迁移任务上优于现有的最先进模型。 Conclusion: ClusterStyle能有效建模风格内多样性,提升风格化运动生成的质量和多样性。 Abstract: Existing stylized motion generation models have shown their remarkable ability to understand specific style information from the style motion, and insert it into the content motion. However, capturing intra-style diversity, where a single style should correspond to diverse motion variations, remains a significant challenge. In this paper, we propose a clustering-based framework, ClusterStyle, to address this limitation. Instead of learning an unstructured embedding from each style motion, we leverage a set of prototypes to effectively model diverse style patterns across motions belonging to the same style category. We consider two types of style diversity: global-level diversity among style motions of the same category, and local-level diversity within the temporal dynamics of motion sequences. These components jointly shape two structured style embedding spaces, i.e., global and local, optimized via alignment with non-learnable prototype anchors. Furthermore, we augment the pretrained text-to-motion generation model with the Stylistic Modulation Adapter (SMA) to integrate the style features. Extensive experiments demonstrate that our approach outperforms existing state-of-the-art models in stylized motion generation and motion style transfer.[83] See, Think, Learn: A Self-Taught Multimodal Reasoner
Sourabh Sharma,Sonam Gupta,Sadbhawna
Main category: cs.CV
TL;DR: 提出了一种名为See-Think-Learn(STL)的自训练框架,通过结构化推理模板和负向理由生成,联合提升视觉语言模型的感知与推理能力。
Details
Motivation: 现有方法依赖高质量的思维链数据,获取成本高,且忽视感知与推理的协同提升。 Method: 设计结构化推理模板,先提取视觉属性再进行推理;通过自训练生成正负向理由,联合优化感知与推理。 Result: 在多个领域实验中,STL优于仅基于答案或自生成推理的基线方法,且生成的理由质量更高。 Conclusion: STL是一种低成本、高效提升视觉语言模型多模态推理能力的方案。 Abstract: Vision-Language Models (VLMs) have achieved remarkable progress in integrating visual perception with language understanding. However, effective multimodal reasoning requires both accurate perception and robust reasoning, and weakness in either limits the performance of VLMs. Prior efforts to enhance reasoning often depend on high-quality chain-of-thought (CoT) data, obtained via labor-intensive human annotations, costly proprietary models, or self-training methods that overlook perception. To address these limitations, we propose a simple yet effective self-training framework called See-Think-Learn (STL). At its core, STL introduces a structured reasoning template that encourages the model to see before thinking, first extracting visual attributes in textual form, then using them to guide reasoning. The framework jointly improves perception and reasoning by having the model generate and learn from its own structured rationales in a self-training loop. Furthermore, we augment the training data with negative rationales, i.e. explanations that justify why certain answer choices are incorrect, to enhance the model's ability to distinguish between correct and misleading responses. This fosters more discriminative and robust learning. Experiments across diverse domains show that STL consistently outperforms baselines trained directly only on answers or self-generated reasoning, while qualitative analysis confirms the high quality of its rationales. STL thus provides a cost-effective solution to enhance multimodal reasoning ability of VLMs.[84] Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation
Jianzong Wu,Hao Lian,Dachao Hao,Ye Tian,Qingyu Shi,Biaolong Chen,Hao Jiang
Main category: cs.CV
TL;DR: 本文提出了一种参数高效的音视频联合去噪模型AVFullDiT,通过与文本到音频和文本到视频模块结合,验证了联合训练不仅能提升音画同步性,还能显著改善视频生成质量,尤其是在大运动和物体接触场景中,归因于音频信号作为“特权信号”对视觉动力学的正则化作用。
Details
Motivation: 探究音视频联合去噪训练是否能在仅关注视频质量的情况下仍优于纯视频生成,揭示跨模态训练对视频生成的潜在增益。 Method: 设计了一个参数高效的AVFullDiT架构,复用预训练的文本到视频和文本到音频模块进行联合去噪训练,并在相同设置下对比训练一个仅视频的对照模型。 Result: 实验首次系统性证明音视频联合训练能持续提升视频生成质量,尤其在大运动和物体接触等复杂动态场景中表现更优;分析表明音频预测作为特权信号有助于模型学习视觉事件与声音之间的因果关系。 Conclusion: 跨模态联合训练不仅有助于音画同步,还能通过音频信号正则化视频动态,是构建更强、更具物理合理性的世界模型的有效途径。 Abstract: Recent audio-video generative systems suggest that coupling modalities benefits not only audio-video synchrony but also the video modality itself. We pose a fundamental question: Does audio-video joint denoising training improve video generation, even when we only care about video quality? To study this, we introduce a parameter-efficient Audio-Video Full DiT (AVFullDiT) architecture that leverages pre-trained text-to-video (T2V) and text-to-audio (T2A) modules for joint denoising. We train (i) a T2AV model with AVFullDiT and (ii) a T2V-only counterpart under identical settings. Our results provide the first systematic evidence that audio-video joint denoising can deliver more than synchrony. We observe consistent improvements on challenging subsets featuring large and object contact motions. We hypothesize that predicting audio acts as a privileged signal, encouraging the model to internalize causal relationships between visual events and their acoustic consequences (e.g., collision $\times$ impact sound), which in turn regularizes video dynamics. Our findings suggest that cross-modal co-training is a promising approach to developing stronger, more physically grounded world models. Code and dataset will be made publicly available.[85] Vision to Geometry: 3D Spatial Memory for Sequential Embodied MLLM Reasoning and Exploration
Zhongyi Cai,Yi Du,Chen Wang,Yu Kong
Main category: cs.CV
TL;DR: 本文提出了一个用于评估序列化具身任务的新基准SEER-Bench,并提出3DSPMR方法,首次将几何信息显式融入多模态大语言模型中,以增强在连续探索任务中的空间理解与推理能力。
Details
Motivation: 现实中的具身智能体常面临连续子任务,某些任务可能不可行(如寻找不存在的物体),需重用先前探索的空间知识来支持后续推理,而这一挑战在现有研究中被忽视。 Method: 构建SEER-Bench基准,包含EQA和EMN两类任务;提出3DSPMR方法,利用关系、视觉和几何线索增强多模态大语言模型的空间推理能力。 Result: 实验证明3DSPMR在顺序EQA和EMN任务上均显著优于现有方法。 Conclusion: 通过引入几何信息并结合MLLM,3DSPMR有效提升了具身智能体在连续任务中的空间记忆与推理性能,为实际应用提供了新思路。 Abstract: Existing research on indoor embodied tasks typically requires agents to actively explore unknown environments and reason about the scene to achieve a specific goal. However, when deployed in real life, agents often face sequential tasks, where each new sub-task follows the completion of the previous one, and certain sub-tasks may be infeasible, such as searching for a non-existent object. Compared with the single-task setting, the core challenge lies in reusing spatial knowledge accumulated from previous explorations to support subsequent reasoning and exploration. In this work, we investigate this underexplored yet practically significant embodied AI challenge. To evaluate this challenge, we introduce SEER-Bench, a new Sequential Embodied Exploration and Reasoning Benchmark encompassing encompassing two classic embodied tasks: Embodied Question Answering (EQA) and Embodied Multi-modal Navigation (EMN). Building on SEER-Bench, we propose 3DSPMR, a 3D SPatial Memory Reasoning approach that exploits relational, visual, and geometric cues from explored regions to augment Multi-Modal Large Language Models (MLLMs) for reasoning and exploration in sequential embodied tasks. To the best of our knowledge, this is the first work to explicitly incorporate geometric information into MLLM-based spatial understanding and reasoning. Extensive experiments verify that 3DSPMR achieves substantial performance gains on both sequential EQA and EMN tasks.[86] TGDD: Trajectory Guided Dataset Distillation with Balanced Distribution
Fengli Ran,Xiao Pu,Bo Liu,Xiuli Bi,Bin Xiao
Main category: cs.CV
TL;DR: 提出了一种名为轨迹引导数据蒸馏(TGDD)的方法,通过在模型训练过程中动态对齐特征分布,提升合成数据的质量和下游任务性能。
Details
Motivation: 现有的基于分布匹配的数据蒸馏方法忽略了训练过程中特征表示的演化,限制了合成数据的表达能力。 Method: 将分布匹配重构为沿模型训练轨迹的动态对齐过程,在每个训练阶段对齐合成数据与原始数据的特征分布,并引入分布约束正则化以减少类别重叠。 Result: 在十个数据集上的实验表明,TGDD实现了最先进的性能,在高分辨率基准上准确率提升了5.0%。 Conclusion: TGDD有效提升了合成数据的语义多样性和代表性,在不增加优化开销的情况下实现了性能与效率的良好平衡。 Abstract: Dataset distillation compresses large datasets into compact synthetic ones to reduce storage and computational costs. Among various approaches, distribution matching (DM)-based methods have attracted attention for their high efficiency. However, they often overlook the evolution of feature representations during training, which limits the expressiveness of synthetic data and weakens downstream performance. To address this issue, we propose Trajectory Guided Dataset Distillation (TGDD), which reformulates distribution matching as a dynamic alignment process along the model's training trajectory. At each training stage, TGDD captures evolving semantics by aligning the feature distribution between the synthetic and original dataset. Meanwhile, it introduces a distribution constraint regularization to reduce class overlap. This design helps synthetic data preserve both semantic diversity and representativeness, improving performance in downstream tasks. Without additional optimization overhead, TGDD achieves a favorable balance between performance and efficiency. Experiments on ten datasets demonstrate that TGDD achieves state-of-the-art performance, notably a 5.0% accuracy gain on high-resolution benchmarks.[87] WorldPack: Compressed Memory Improves Spatial Consistency in Video World Modeling
Yuta Oshima,Yusuke Iwasawa,Masahiro Suzuki,Yutaka Matsuo,Hiroki Furuta
Main category: cs.CV
TL;DR: 本文提出了WorldPack,一种具有高效压缩记忆的视频世界模型,通过轨迹打包和记忆检索机制,在较短上下文长度下显著提升了长期生成中的空间一致性、保真度和质量,并在Minecraft的LoopNav基准上验证了其优于现有最先进模型的表现。
Details
Motivation: 现有的视频世界模型在处理长时间上下文时计算成本过高,难以实现长期且时空一致的世界建模,因此需要一种更高效的上下文利用方法。 Method: 提出WorldPack,包含轨迹打包(trajectory packing)以提高上下文效率,以及记忆检索(memory retrieval)来维持生成过程中的时空一致性,支持需要空间推理的长期生成任务。 Result: 在专为评估长期一致性设计的Minecraft基准LoopNav上,WorldPack显著优于现有的强基线模型。 Conclusion: WorldPack通过高效的压缩记忆机制,在降低上下文长度的同时仍能实现高质量、空间一致的长期视觉预测,为视频世界模型的实际应用提供了新的解决方案。 Abstract: Video world models have attracted significant attention for their ability to produce high-fidelity future visual observations conditioned on past observations and navigation actions. Temporally- and spatially-consistent, long-term world modeling has been a long-standing problem, unresolved with even recent state-of-the-art models, due to the prohibitively expensive computational costs for long-context inputs. In this paper, we propose WorldPack, a video world model with efficient compressed memory, which significantly improves spatial consistency, fidelity, and quality in long-term generation despite much shorter context length. Our compressed memory consists of trajectory packing and memory retrieval; trajectory packing realizes high context efficiency, and memory retrieval maintains the consistency in rollouts and helps long-term generations that require spatial reasoning. Our performance is evaluated with LoopNav, a benchmark on Minecraft, specialized for the evaluation of long-term consistency, and we verify that WorldPack notably outperforms strong state-of-the-art models.[88] G-SHARP: Gaussian Surgical Hardware Accelerated Real-time Pipeline
Vishwesh Nath,Javier G. Tejero,Ruilong Li,Filippo Filicori,Mahdi Azizian,Sean D. Huver
Main category: cs.CV
TL;DR: G-SHARP是一个基于GSplat的实时手术场景重建框架,专为微创手术中可变形组织的快速高精度3D建模设计,支持商业部署并在边缘硬件上实现实时可视化。
Details
Motivation: 现有高斯点阵化方法依赖非商业导数,限制了在实际手术环境中的部署能力,亟需一个兼容商业化、实时且精确的重建方案。 Method: 提出G-SHARP框架,首次在GSplat(Apache-2.0许可)可微高斯光栅化器上构建手术重建流程,实现原理性的形变建模和遮挡处理,并开发Holoscan SDK应用以支持NVIDIA IGX Orin和Thor边缘硬件部署。 Result: 在EndoNeRF拉取基准上实现了最先进的重建质量,具备优良的速度-精度权衡,满足术中实时使用需求。 Conclusion: G-SHARP为可变形组织的实时手术重建提供了高效、可部署的解决方案,推动了高斯点阵化技术在临床实践中的应用。 Abstract: We propose G-SHARP, a commercially compatible, real-time surgical scene reconstruction framework designed for minimally invasive procedures that require fast and accurate 3D modeling of deformable tissue. While recent Gaussian splatting approaches have advanced real-time endoscopic reconstruction, existing implementations often depend on non-commercial derivatives, limiting deployability. G-SHARP overcomes these constraints by being the first surgical pipeline built natively on the GSplat (Apache-2.0) differentiable Gaussian rasterizer, enabling principled deformation modeling, robust occlusion handling, and high-fidelity reconstructions on the EndoNeRF pulling benchmark. Our results demonstrate state-of-the-art reconstruction quality with strong speed-accuracy trade-offs suitable for intra-operative use. Finally, we provide a Holoscan SDK application that deploys G-SHARP on NVIDIA IGX Orin and Thor edge hardware, enabling real-time surgical visualization in practical operating-room settings.[89] UCAgents: Unidirectional Convergence for Visual Evidence Anchored Multi-Agent Medical Decision-Making
Qianhan Feng,Zhongzhen Huang,Yakun Zhu,Xiaofan Zhang,Qi Dou
Main category: cs.CV
TL;DR: 提出UCAgents,一种基于单向收敛和证据审计的分层多智能体框架,提升医学视觉-语言模型的诊断可靠性与效率。
Details
Motivation: 现有视觉-语言模型在医学诊断中存在推理脱离图像证据的问题,多智能体辩论虽缓解单一模型偏差,但开放讨论增加文本噪声和计算成本,且缺乏对视觉证据的锚定。 Method: 设计UCAgents框架,模拟临床会诊流程,采用单向收敛机制和结构化证据审计,限制智能体交互为定向证据验证,并引入单轮质询讨论以发现视觉-文本错位风险。 Result: 在四个医学VQA基准上显著优于现有方法(PathVQA达71.3%,提升6.0%),且令牌成本降低87.7%,有效平衡视觉证据挖掘与文本干扰抑制。 Conclusion: UCAgents通过约束双噪声瓶颈(视觉模糊与文本噪声),实现了高诊断可信度与低计算开销,适合实际临床部署。 Abstract: Vision-Language Models (VLMs) show promise in medical diagnosis, yet suffer from reasoning detachment, where linguistically fluent explanations drift from verifiable image evidence, undermining clinical trust. Recent multi-agent frameworks simulate Multidisciplinary Team (MDT) debates to mitigate single-model bias, but open-ended discussions amplify textual noise and computational cost while failing to anchor reasoning to visual evidence, the cornerstone of medical decision-making. We propose UCAgents, a hierarchical multi-agent framework enforcing unidirectional convergence through structured evidence auditing. Inspired by clinical workflows, UCAgents forbids position changes and limits agent interactions to targeted evidence verification, suppressing rhetorical drift while amplifying visual signal extraction. In UCAgents, a one-round inquiry discussion is introduced to uncover potential risks of visual-textual misalignment. This design jointly constrains visual ambiguity and textual noise, a dual-noise bottleneck that we formalize via information theory. Extensive experiments on four medical VQA benchmarks show UCAgents achieves superior accuracy (71.3% on PathVQA, +6.0% over state-of-the-art) with 87.7% lower token cost, the evaluation results further confirm that UCAgents strikes a balance between uncovering more visual evidence and avoiding confusing textual interference. These results demonstrate that UCAgents exhibits both diagnostic reliability and computational efficiency critical for real-world clinical deployment. Code is available at https://github.com/fqhank/UCAgents.[90] Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLMs for 3D Scene-Language Understanding
Yerim Jeon,Miso Lee,WonJun Moon,Jae-Pil Heo
Main category: cs.CV
TL;DR: 提出3D-SLIM,一种基于空间结构的自适应注意力掩码策略,用于提升3D场景-语言理解中的多模态推理能力。
Details
Motivation: 现有方法使用标准语言模型解码器的因果注意力掩码,在处理顺序无关的3D对象时存在序列偏差和指令-对象注意力受限的问题,难以有效进行任务特定的3D推理。 Method: 提出3D-SLIM,包含几何自适应掩码(根据空间密度约束注意力)和指令感知掩码(使对象可直接访问指令上下文),以空间关系替代token顺序进行注意力计算。 Result: 在多个基准和LLM基础上显著提升性能,验证了该方法在3D多模态任务中的有效性。 Conclusion: 解码器设计对3D多模态推理至关重要;3D-SLIM无需架构修改或额外参数,简单高效。 Abstract: Recent advances in 3D scene-language understanding have leveraged Large Language Models (LLMs) for 3D reasoning by transferring their general reasoning ability to 3D multi-modal contexts. However, existing methods typically adopt standard decoders from language modeling, which rely on a causal attention mask. This design introduces two fundamental conflicts in 3D scene understanding: sequential bias among order-agnostic 3D objects and restricted object-instruction attention, hindering task-specific reasoning. To overcome these limitations, we propose 3D Spatial Language Instruction Mask (3D-SLIM), an effective masking strategy that replaces the causal mask with an adaptive attention mask tailored to the spatial structure of 3D scenes. Our 3D-SLIM introduces two key components: a Geometry-adaptive Mask that constrains attention based on spatial density rather than token order, and an Instruction-aware Mask that enables object tokens to directly access instruction context. This design allows the model to process objects based on their spatial relationships while being guided by the user's task. 3D-SLIM is simple, requires no architectural modifications, and adds no extra parameters, yet it yields substantial performance improvements across diverse 3D scene-language tasks. Extensive experiments across multiple benchmarks and LLM baselines validate its effectiveness and underscore the critical role of decoder design in 3D multi-modal reasoning.[91] YingVideo-MV: Music-Driven Multi-Stage Video Generation
Jiahui Chen,Weida Wang,Runhua Shi,Huan Yang,Chaofan Ding,Zihao Chen
Main category: cs.CV
TL;DR: 提出YingVideo-MV,首个音乐驱动的长视频生成级联框架,支持摄像机运动控制与高质量音乐表演视频合成。
Details
Motivation: 现有音频驱动视频生成方法在音乐表演视频中缺乏对摄像机运动的显式控制,且难以保持长序列的一致性。 Method: 构建大规模Music-in-the-Wild数据集;引入MV-Director模块进行镜头规划;设计基于时序感知扩散Transformer的架构;提出相机适配器模块以嵌入相机姿态,并采用时间感知动态窗口策略增强片段连续性。 Result: 在基准测试中表现出色,能生成连贯、富有表现力的音乐视频,实现精确的音乐-动作-相机同步。 Conclusion: YingVideo-MV为音乐驱动的长视频生成提供了有效解决方案,推动了虚拟演出、数字人等应用的发展。 Abstract: While diffusion model for audio-driven avatar video generation have achieved notable process in synthesizing long sequences with natural audio-visual synchronization and identity consistency, the generation of music-performance videos with camera motions remains largely unexplored. We present YingVideo-MV, the first cascaded framework for music-driven long-video generation. Our approach integrates audio semantic analysis, an interpretable shot planning module (MV-Director), temporal-aware diffusion Transformer architectures, and long-sequence consistency modeling to enable automatic synthesis of high-quality music performance videos from audio signals. We construct a large-scale Music-in-the-Wild Dataset by collecting web data to support the achievement of diverse, high-quality results. Observing that existing long-video generation methods lack explicit camera motion control, we introduce a camera adapter module that embeds camera poses into latent noise. To enhance continulity between clips during long-sequence inference, we further propose a time-aware dynamic window range strategy that adaptively adjust denoising ranges based on audio embedding. Comprehensive benchmark tests demonstrate that YingVideo-MV achieves outstanding performance in generating coherent and expressive music videos, and enables precise music-motion-camera synchronization. More videos are available in our project page: https://giantailab.github.io/YingVideo-MV/ .[92] Attention-guided reference point shifting for Gaussian-mixture-based partial point set registration
Mizuki Kikkawa,Tatsuya Yatagawa,Yutaka Ohtake,Hiromasa Suzuki
Main category: cs.CV
TL;DR: 本文研究了在点集的部分到部分配准中,特征向量对平移和旋转的不变性问题,指出基于深度学习和高斯混合模型(GMM)的方法(如DeepGMR)存在的理论与实践缺陷,并提出一种注意力机制驱动的参考点偏移(ARPS)层来提取变换不变特征,显著提升了现有方法的性能。
Details
Motivation: 揭示基于深度学习和GMM的点云配准方法在处理部分到部分配准时因缺乏变换不变性而导致性能下降的根本原因,并改进现有方法。 Method: 提出一种新的注意力机制模块——注意力参考点偏移(ARPS)层,通过寻找两个部分点集之间的共同参考点而非重叠区域,实现对输入点集平移和旋转的不变特征提取,并将其应用于DeepGMR和UGMMReg等模型中。 Result: ARPS层显著提升了DeepGMR和UGMMReg的配准性能,且优于使用注意力机制或Transformer提取重叠区域的先前方法,在部分到部分点集配准任务中表现出更强的鲁棒性和准确性。 Conclusion: 变换不变的特征表示对于部分到部分点云配准至关重要,ARPS提供了一种有效且可解释的解决方案,为基于深度学习和GMM的配准方法提供了新的设计思路和改进方向。 Abstract: This study investigates the impact of the invariance of feature vectors for partial-to-partial point set registration under translation and rotation of input point sets, particularly in the realm of techniques based on deep learning and Gaussian mixture models (GMMs). We reveal both theoretical and practical problems associated with such deep-learning-based registration methods using GMMs, with a particular focus on the limitations of DeepGMR, a pioneering study in this line, to the partial-to-partial point set registration. Our primary goal is to uncover the causes behind such methods and propose a comprehensible solution for that. To address this, we introduce an attention-based reference point shifting (ARPS) layer, which robustly identifies a common reference point of two partial point sets, thereby acquiring transformation-invariant features. The ARPS layer employs a well-studied attention module to find a common reference point rather than the overlap region. Owing to this, it significantly enhances the performance of DeepGMR and its recent variant, UGMMReg. Furthermore, these extension models outperform even prior deep learning methods using attention blocks and Transformer to extract the overlap region or common reference points. We believe these findings provide deeper insights into registration methods using deep learning and GMMs.[93] ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning
Yifan Li,Yingda Yin,Lingting Zhu,Weikai Chen,Shengju Qian,Xin Wang,Yanwei Fu
Main category: cs.CV
TL;DR: 本文提出了ReVSeg,一种面向推理的视频对象分割方法,通过在视觉语言模型中显式分解语义理解、时序证据选择和空间定位三个步骤,并结合强化学习优化多步推理链,实现了可解释且性能领先的视频分割效果。
Details
Motivation: 现有视频对象分割方法通常将复杂的动态、因果和时序推理简化为隐含的嵌入表示,导致推理过程不可见且难以处理复杂查询。本文旨在通过显式分解推理步骤来提升模型的可解释性和推理能力。 Method: 提出ReVSeg框架,将推理过程分解为三个显式阶段:语义理解、时序证据选择和空间定位,并在预训练视觉语言模型的原生接口上执行;采用强化学习优化整个多步推理链,使模型能根据最终结果信号自我改进决策质量。 Result: ReVSeg在标准视频对象分割基准上取得了最先进的性能,同时生成可解释的推理轨迹。 Conclusion: 通过显式分解和强化学习优化多步推理,ReVSeg不仅提升了视频对象分割的性能,还增强了模型的可解释性,为复杂视频理解任务提供了新的范式。 Abstract: Reasoning-centric video object segmentation is an inherently complex task: the query often refers to dynamics, causality, and temporal interactions, rather than static appearances. Yet existing solutions generally collapse these factors into simplified reasoning with latent embeddings, rendering the reasoning chain opaque and essentially intractable. We therefore adopt an explicit decomposition perspective and introduce ReVSeg, which executes reasoning as sequential decisions in the native interface of pretrained vision language models (VLMs). Rather than folding all reasoning into a single-step prediction, ReVSeg executes three explicit operations -- semantics interpretation, temporal evidence selection, and spatial grounding -- aligning pretrained capabilities. We further employ reinforcement learning to optimize the multi-step reasoning chain, enabling the model to self-refine its decision quality from outcome-driven signals. Experimental results demonstrate that ReVSeg attains state-of-the-art performances on standard video object segmentation benchmarks and yields interpretable reasoning trajectories. Project page is available at https://clementine24.github.io/ReVSeg/ .[94] A Large Scale Benchmark for Test Time Adaptation Methods in Medical Image Segmentation
Wenjing Yu,Shuo Jiang,Yifei Chen,Shuo Chang,Yuanhan Wang,Beining Wu,Jie Dong,Mingxuan Liu,Shenghao Zhu,Feiwei Qin,Changmiao Wang,Qiyuan Tian
Main category: cs.CV
TL;DR: MedSeg-TTA是一个全面的基准测试,系统评估了七种医学成像模态下的20种测试时自适应方法,揭示了不同范式在不同条件下的性能差异,并为临床可靠的TTA研究提供了标准化资源和公开排行榜。
Details
Motivation: 现有医学图像分割中的测试时自适应(TTA)方法评估存在模态覆盖窄、任务单一和方法不统一的问题,缺乏跨模态的系统性比较。 Method: 构建MedSeg-TTA基准,统一数据预处理、骨干网络配置和测试协议,涵盖MRI、CT、超声等七种模态,并对输入级、特征级、输出级和先验估计四类TTA范式进行综合评估。 Result: 结果显示没有一种范式在所有条件下表现最优:输入级方法在轻微外观变化下更稳定;特征级和输出级方法在边界相关指标上更有优势;基于先验的方法表现出强模态依赖性;部分方法在大范围中心或设备间差异下性能显著下降。 Conclusion: MedSeg-TTA为测试时自适应提供了标准化数据集、验证过的实现代码和公开排行榜,强调应根据具体临床场景选择合适的TTA方法,推动了鲁棒且临床可靠的TTA研究发展。 Abstract: Test time Adaptation is a promising approach for mitigating domain shift in medical image segmentation; however, current evaluations remain limited in terms of modality coverage, task diversity, and methodological consistency. We present MedSeg-TTA, a comprehensive benchmark that examines twenty representative adaptation methods across seven imaging modalities, including MRI, CT, ultrasound, pathology, dermoscopy, OCT, and chest X-ray, under fully unified data preprocessing, backbone configuration, and test time protocols. The benchmark encompasses four significant adaptation paradigms: Input-level Transformation, Feature-level Alignment, Output-level Regularization, and Prior Estimation, enabling the first systematic cross-modality comparison of their reliability and applicability. The results show that no single paradigm performs best in all conditions. Input-level methods are more stable under mild appearance shifts. Feature-level and Output-level methods offer greater advantages in boundary-related metrics, whereas prior-based methods exhibit strong modality dependence. Several methods degrade significantly under large inter-center and inter-device shifts, which highlights the importance of principled method selection for clinical deployment. MedSeg-TTA provides standardized datasets, validated implementations, and a public leaderboard, establishing a rigorous foundation for future research on robust, clinically reliable test-time adaptation. All source codes and open-source datasets are available at https://github.com/wenjing-gg/MedSeg-TTA.[95] Contextual Image Attack: How Visual Context Exposes Multimodal Safety Vulnerabilities
Yuan Xiong,Ziqi Miao,Lijun Li,Chen Qian,Jie Li,Jing Shao
Main category: cs.CV
TL;DR: 提出了一种新的以图像为中心的攻击方法CIA,利用多智能体系统将有害查询嵌入看似无害的视觉上下文中,显著提高了对多模态大语言模型的越狱成功率。
Details
Motivation: 现有攻击方法主要关注文本-图像交互,低估了图像模态在传递复杂上下文信息方面的潜力,因此需要一种更充分利用视觉内容的攻击方式来评估MLLM的安全性。 Method: 提出Contextual Image Attack (CIA),采用多智能体系统,结合四种可视化策略、上下文元素增强和自动毒性模糊技术,将有害查询隐秘地嵌入良性图像中。 Result: 在MMSafetyBench-tiny数据集上,CIA对GPT-4o和Qwen2.5-VL-72B模型分别达到4.73和4.83的毒性评分,攻击成功率分别为86.31%和91.07%,显著优于先前方法。 Conclusion: 视觉模态本身是越狱先进多模态大语言模型的强大向量,CIA展示了图像在安全攻击中的巨大潜力,提示未来MLLM安全对齐需更重视视觉通道的风险。 Abstract: While Multimodal Large Language Models (MLLMs) show remarkable capabilities, their safety alignments are susceptible to jailbreak attacks. Existing attack methods typically focus on text-image interplay, treating the visual modality as a secondary prompt. This approach underutilizes the unique potential of images to carry complex, contextual information. To address this gap, we propose a new image-centric attack method, Contextual Image Attack (CIA), which employs a multi-agent system to subtly embeds harmful queries into seemingly benign visual contexts using four distinct visualization strategies. To further enhance the attack's efficacy, the system incorporate contextual element enhancement and automatic toxicity obfuscation techniques. Experimental results on the MMSafetyBench-tiny dataset show that CIA achieves high toxicity scores of 4.73 and 4.83 against the GPT-4o and Qwen2.5-VL-72B models, respectively, with Attack Success Rates (ASR) reaching 86.31\% and 91.07\%. Our method significantly outperforms prior work, demonstrating that the visual modality itself is a potent vector for jailbreaking advanced MLLMs.[96] dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model
Yumeng Li,Guang Yang,Hao Liu,Bowen Wang,Colin Zhang
Main category: cs.CV
TL;DR: 本文提出了dots.ocr,首个在统一端到端框架中联合学习布局检测、文本识别和关系理解三项核心任务的视觉语言模型,并通过大规模多语言合成数据实现卓越的多语言文档解析性能。
Details
Motivation: 现有文档布局解析方法依赖碎片化的多阶段流程,存在误差传播问题,且未能充分利用联合训练的优势,限制了对复杂多语言文档的准确理解。 Method: 提出dots.ocr模型,采用统一的视觉语言架构,在端到端框架中联合学习三大核心任务;构建可扩展的数据引擎以生成大规模多语言训练语料。 Result: 在OmniDocBench上达到SOTA性能;在新提出的涵盖126种语言的XDocParse基准上,以+7.4分的显著优势超越最佳竞争对手。 Conclusion: 统一的联合训练范式优于传统多阶段方法,dots.ocr展现了强大的多语言、多领域文档解析能力,为全球文档智能研究提供了有效模型和新基准。 Abstract: Document Layout Parsing serves as a critical gateway for Artificial Intelligence (AI) to access and interpret the world's vast stores of structured knowledge. This process,which encompasses layout detection, text recognition, and relational understanding, is particularly crucial for empowering next-generation Vision-Language Models. Current methods, however, rely on fragmented, multi-stage pipelines that suffer from error propagation and fail to leverage the synergies of joint training. In this paper, we introduce dots.ocr, a single Vision-Language Model that, for the first time, demonstrates the advantages of jointly learning three core tasks within a unified, end-to-end framework. This is made possible by a highly scalable data engine that synthesizes a vast multilingual corpus, empowering the model to deliver robust performance across a wide array of tasks, encompassing diverse languages, layouts, and domains. The efficacy of our unified paradigm is validated by state-of-the-art performance on the comprehensive OmniDocBench. Furthermore, to catalyze research in global document intelligence, we introduce XDocParse, a challenging new benchmark spanning 126 languages. On this testbed, dots.ocr establishes a powerful new baseline, outperforming the next-best competitor by a remarkable +7.4 point margin and proving its unparalleled multilingual capabilities.[97] GeoDiT: A Diffusion-based Vision-Language Model for Geospatial Understanding
Jiaqi Liu,Ronghao Fu,Haoran Liu,Lang Sun,Bo Yang
Main category: cs.CV
TL;DR: 本文提出了一种新的扩散模型GeoDiT,通过并行细化机制替代传统的自回归方法,以更好地适应地理空间数据的固有结构,在图像描述、视觉定位和多目标检测等任务中实现了最先进的性能。
Details
Motivation: 自回归模型在处理具有本质并行特性的地理空间理解任务时存在结构上的不匹配,难以生成结构化且连贯的输出。 Method: 将地理空间生成重构为一种并行细化过程,提出GeoDiT——首个面向地理空间领域的扩散型视觉-语言模型,实现从粗到精的整体合成。 Result: 在多个需要结构化、以对象为中心输出的基准上达到最先进水平,在图像描述、视觉定位和多目标检测任务中显著优于自回归模型。 Conclusion: 生成过程与数据内在结构对齐是提升复杂地理空间分析性能的关键。 Abstract: Autoregressive models are structurally misaligned with the inherently parallel nature of geospatial understanding, forcing a rigid sequential narrative onto scenes and fundamentally hindering the generation of structured and coherent outputs. We challenge this paradigm by reframing geospatial generation as a parallel refinement process, enabling a holistic, coarse-to-fine synthesis that resolves all semantic elements simultaneously. To operationalize this, we introduce GeoDiT, the first diffusion-based vision-language model tailored for the geospatial domain. Extensive experiments demonstrate that GeoDiT establishes a new state-of-the-art on benchmarks requiring structured, object-centric outputs. It achieves significant gains in image captioning, visual grounding, and multi-object detection, precisely the tasks where autoregressive models falter. Our work validates that aligning the generative process with the data's intrinsic structure is key to unlocking superior performance in complex geospatial analysis.[98] Two-Stage Vision Transformer for Image Restoration: Colorization Pretraining + Residual Upsampling
Aditya Chaudhary,Prachet Dev Singh,Ankit Jha
Main category: cs.CV
TL;DR: 提出ViT-SR,一种基于两阶段训练策略的视觉Transformer方法,通过自监督预训练和微调实现单图像超分辨率,显著提升性能。
Details
Motivation: 单图像超分辨率(SISR)在计算机视觉中仍具挑战性,现有方法难以充分提取泛化性强的视觉特征。 Method: 采用两阶段训练策略:首先在着色任务上进行自监督预训练,学习丰富的视觉表示;然后对4倍超分辨率任务进行微调,通过预测高频残差图像加到双三次插值结果上来简化残差学习。 Result: 在DIV2K数据集上达到0.712的SSIM和22.90 dB的PSNR,验证了该方法的有效性。 Conclusion: 两阶段自监督预训练能有效提升ViT在图像超分辨率中的表现,展示了其在复杂图像恢复任务中的潜力。 Abstract: In computer vision, Single Image Super-Resolution (SISR) is still a difficult problem. We present ViT-SR, a new technique to improve the performance of a Vision Transformer (ViT) employing a two-stage training strategy. In our method, the model learns rich, generalizable visual representations from the data itself through a self-supervised pretraining phase on a colourization task. The pre-trained model is then adjusted for 4x super-resolution. By predicting the addition of a high-frequency residual image to an initial bicubic interpolation, this design simplifies residual learning. ViT-SR, trained and evaluated on the DIV2K benchmark dataset, achieves an impressive SSIM of 0.712 and PSNR of 22.90 dB. These results demonstrate the efficacy of our two-stage approach and highlight the potential of self-supervised pre-training for complex image restoration tasks. Further improvements may be possible with larger ViT architectures or alternative pretext tasks.[99] SkyMoE: A Vision-Language Foundation Model for Enhancing Geospatial Interpretation with Mixture of Experts
Jiaqi Liu,Ronghao Fu,Lang Sun,Haoran Liu,Xiao Yang,Weipeng Zhang,Xu Na,Zhuoran Duan,Bo Yang
Main category: cs.CV
TL;DR: 本文提出了SkyMoE,一种面向多模态、多任务遥感(RS)解释的专家混合视觉-语言模型,通过任务和粒度感知的路由机制及解耦增强策略,在多个粒度级别上实现了优越的性能。
Details
Motivation: 通用视觉-语言模型在遥感任务中表现不佳,现有地理空间模型难以区分任务类型和解释粒度,限制了局部细节与全局上下文的理解平衡。 Method: 提出SkyMoE模型,采用自适应路由器生成任务和粒度感知的路由指令,并引入上下文解耦增强策略构建局部与全局特征的对比对,促进专家专业化学习;同时构建MGRS-Bench基准用于评估。 Result: 在21个公开数据集上的实验表明,SkyMoE在多种遥感任务中达到最先进水平,展现出优异的适应性、可扩展性和多粒度理解能力。 Conclusion: SkyMoE通过专家分工与解耦训练策略,有效提升了遥感场景下多任务、多粒度解释的性能,为地理空间视觉-语言建模提供了新范式。 Abstract: The emergence of large vision-language models (VLMs) has significantly enhanced the efficiency and flexibility of geospatial interpretation. However, general-purpose VLMs remain suboptimal for remote sensing (RS) tasks. Existing geospatial VLMs typically adopt a unified modeling strategy and struggle to differentiate between task types and interpretation granularities, limiting their ability to balance local detail perception and global contextual understanding. In this paper, we present SkyMoE, a Mixture-of-Experts (MoE) vision-language model tailored for multimodal, multi-task RS interpretation. SkyMoE employs an adaptive router that generates task- and granularity-aware routing instructions, enabling specialized large language model experts to handle diverse sub-tasks. To further promote expert decoupling and granularity sensitivity, we introduce a context-disentangled augmentation strategy that creates contrastive pairs between local and global features, guiding experts toward level-specific representation learning. We also construct MGRS-Bench, a comprehensive benchmark covering multiple RS interpretation tasks and granularity levels, to evaluate generalization in complex scenarios. Extensive experiments on 21 public datasets demonstrate that SkyMoE achieves state-of-the-art performance across tasks, validating its adaptability, scalability, and superior multi-granularity understanding in remote sensing.[100] On the Problem of Consistent Anomalies in Zero-Shot Anomaly Detection
Tai Le-Gia
Main category: cs.CV
TL;DR: 本论文研究了零样本异常分类与分割(AC/AS)中的核心挑战,提出了基于理论和算法设计的解决方案,包括CoDeGraph框架、3D医学图像扩展及与视觉-语言模型的结合。
Details
Motivation: 零样本AC/AS在工业检测和医学成像中日益重要,但现有方法在面对一致性强的异常时存在系统性偏差问题,亟需理论指导和有效算法。 Method: 通过分析预训练Vision Transformer的patch表示行为,提出CoDeGraph图模型框架,利用多阶段图构建、社区检测和结构化优化来抑制一致异常;引入无训练的体素化分词策略用于3D MRI数据,并结合伪掩码监督视觉-语言模型。 Result: 成功识别出相似性缩放和邻居耗尽现象,CoDeGraph有效过滤一致异常,实现了无需3D训练样本的零样本3D异常检测与分割,并能生成伪掩码提升文本驱动模型性能。 Conclusion: 论文为零样本AC/AS提供了理论理解与实用工具,推动了无监督异常检测在2D与3D场景中的应用边界。 Abstract: Zero-shot anomaly classification and segmentation (AC/AS) aim to detect anomalous samples and regions without any training data, a capability increasingly crucial in industrial inspection and medical imaging. This dissertation aims to investigate the core challenges of zero-shot AC/AS and presents principled solutions rooted in theory and algorithmic design. We first formalize the problem of consistent anomalies, a failure mode in which recurring similar anomalies systematically bias distance-based methods. By analyzing the statistical and geometric behavior of patch representations from pre-trained Vision Transformers, we identify two key phenomena - similarity scaling and neighbor-burnout - that describe how relationships among normal patches change with and without consistent anomalies in settings characterized by highly similar objects. We then introduce CoDeGraph, a graph-based framework for filtering consistent anomalies built on the similarity scaling and neighbor-burnout phenomena. Through multi-stage graph construction, community detection, and structured refinement, CoDeGraph effectively suppresses the influence of consistent anomalies. Next, we extend this framework to 3D medical imaging by proposing a training-free, computationally efficient volumetric tokenization strategy for MRI data. This enables a genuinely zero-shot 3D anomaly detection pipeline and shows that volumetric anomaly segmentation is achievable without any 3D training samples. Finally, we bridge batch-based and text-based zero-shot methods by demonstrating that CoDeGraph-derived pseudo-masks can supervise prompt-driven vision-language models. Together, this dissertation provides theoretical understanding and practical solutions for the zero-shot AC/AS problem.[101] WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens
Jian Yang,Dacheng Yin,Xiaoxuan He,Yong Li,Fengyun Rao,Jing Lyu,Wei Zhai,Yang Cao,Zheng-Jun Zha
Main category: cs.CV
TL;DR: 本文提出了一种名为Noisy Query Tokens的方法,通过在视觉语言模型和扩散模型之间学习分布式表示空间来解决多模态大模型中的任务泛化崩溃问题,并引入VAE分支恢复图像细节。
Details
Motivation: 现有方法使用固定数量的可学习查询令牌,虽然计算高效,但在面对与预训练任务差异较大的新任务时会出现泛化能力下降的问题。 Method: 提出Noisy Query Tokens,通过端到端优化学习视觉语言模型与扩散模型之间的分布式表示空间;同时引入带有线性投影的VAE分支以恢复细粒度图像细节。 Result: 实验结果表明该方法有效缓解了泛化崩溃问题,并支持在多种任务上的稳定持续学习。 Conclusion: 所提出的方法显著提升了多模态大模型在多样化任务上的适应能力和持续学习性能。 Abstract: Recent progress in multimodal large language models (MLLMs) has highlighted the challenge of efficiently bridging pre-trained Vision-Language Models (VLMs) with Diffusion Models. While methods using a fixed number of learnable query tokens offer computational efficiency, they suffer from task generalization collapse, failing to adapt to new tasks that are distant from their pre-training tasks. To overcome this, we propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization, enhancing continual learning. Additionally, we introduce a VAE branch with linear projection to recover fine-grained image details. Experimental results confirm our approach mitigates generalization collapse and enables stable continual learning across diverse tasks.[102] AVGGT: Rethinking Global Attention for Accelerating VGGT
Xianbing Sun,Zhikai Zhu,Zhengyu Lou,Bo Yang,Jinyang Tang,Liqing Zhang,He Wang,Jianfu Zhang
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的两步加速方案,通过重构全局注意力机制,在保持甚至略微提升精度的同时,实现8-10倍的推理速度提升。
Details
Motivation: 现有基于全局自注意力的多视角3D模型计算成本高,且缺乏对全局注意力在多视角推理中作用的系统性分析。 Method: 分析VGGT和$π^3$中全局注意力模块的作用,发现其在不同层次有明确分工;据此提出将早期全局层转为帧内注意力,并通过补丁令牌的K/V子采样(保留对角线并加入均值填充)来稀疏化注意力。 Result: 在标准姿态和点图基准上验证,推理速度提升8-10倍,精度与原模型相当或更优,且在极密集多视角设置下仍保持鲁棒性。 Conclusion: 通过结构化分析全局注意力角色,所提方法实现了高效、无需训练的多视角3D模型加速,显著优于现有稀疏注意力基线。 Abstract: Since DUSt3R, models such as VGGT and $π^3$ have shown strong multi-view 3D performance, but their heavy reliance on global self-attention results in high computational cost. Existing sparse-attention variants offer partial speedups, yet lack a systematic analysis of how global attention contributes to multi-view reasoning. In this paper, we first conduct an in-depth investigation of the global attention modules in VGGT and $π^3$ to better understand their roles. Our analysis reveals a clear division of roles in the alternating global-frame architecture: early global layers do not form meaningful correspondences, middle layers perform cross-view alignment, and last layers provide only minor refinements. Guided by these findings, we propose a training-free two-step acceleration scheme: (1) converting early global layers into frame attention, and (2) subsampling global attention by subsampling K/V over patch tokens with diagonal preservation and a mean-fill component. We instantiate this strategy on VGGT and $π^3$ and evaluate across standard pose and point-map benchmarks. Our method achieves up to $8$-$10\times$ speedup in inference time while matching or slightly improving the accuracy of the original models, and remains robust even in extremely dense multi-view settings where prior sparse-attention baselines fail.[103] OmniPerson: Unified Identity-Preserving Pedestrian Generation
Changxiao Ma,Chao Yuan,Xincheng Shi,Yuzhuo Ma,Yongfei Zhang,Longkun Zhou,Yujia Zhang,Shangze Li,Yifan Xu
Main category: cs.CV
TL;DR: 本文提出了OmniPerson,首个统一的身份保持行人生成框架,用于可见光/红外图像和视频的行人重识别任务,具备多模态、多参考图像输入、文本控制和身份一致性生成能力,并发布了大规模数据集PersonSyn。
Details
Motivation: 由于数据隐私和标注成本问题,行人重识别缺乏大规模高质量训练数据;现有生成方法在身份一致性和可控性方面不足。 Method: 提出OmniPerson生成框架,包含Multi-Refer Fuser模块以实现多参考图像下的身份一致性控制,支持RGB/IR图像/视频生成、跨模态转换与超分辨率;并构建PersonSyn大规模可控行人生成数据集及其自动化构建流程。 Result: OmniPerson在生成质量和身份一致性上达到SOTA水平,生成数据增强后能持续提升各类ReID模型性能。 Conclusion: OmniPerson实现了高保真、强身份一致性的可控行人生成,为ReID提供了有效的数据增强解决方案,未来将开源代码、模型与数据集。 Abstract: Person re-identification (ReID) suffers from a lack of large-scale high-quality training data due to challenges in data privacy and annotation costs. While previous approaches have explored pedestrian generation for data augmentation, they often fail to ensure identity consistency and suffer from insufficient controllability, thereby limiting their effectiveness in dataset augmentation. To address this, We introduce OmniPerson, the first unified identity-preserving pedestrian generation pipeline for visible/infrared image/video ReID tasks. Our contributions are threefold: 1) We proposed OmniPerson, a unified generation model, offering holistic and fine-grained control over all key pedestrian attributes. Supporting RGB/IR modality image/video generation with any number of reference images, two kinds of person poses, and text. Also including RGB-to-IR transfer and image super-resolution abilities.2) We designed Multi-Refer Fuser for robust identity preservation with any number of reference images as input, making OmniPerson could distill a unified identity from a set of multi-view reference images, ensuring our generated pedestrians achieve high-fidelity pedestrian generation.3) We introduce PersonSyn, the first large-scale dataset for multi-reference, controllable pedestrian generation, and present its automated curation pipeline which transforms public, ID-only ReID benchmarks into a richly annotated resource with the dense, multi-modal supervision required for this task. Experimental results demonstrate that OmniPerson achieves SoTA in pedestrian generation, excelling in both visual fidelity and identity consistency. Furthermore, augmenting existing datasets with our generated data consistently improves the performance of ReID models. We will open-source the full codebase, pretrained model, and the PersonSyn dataset.[104] From Panel to Pixel: Zoom-In Vision-Language Pretraining from Biomedical Scientific Literature
Kun Yuan,Min Woo Sun,Zhen Chen,Alejandro Lozano,Xiangteng He,Shi Li,Nassir Navab,Xiaoxiao Sun,Nicolas Padoy,Serena Yeung-Levy
Main category: cs.CV
TL;DR: 提出Panel2Patch,一种从生物医学文献中挖掘多粒度图文监督信号的新数据管道,提升视觉-语言模型的局部语义对齐能力。
Details
Motivation: 现有生物医学视觉-语言预训练通常将复杂的多面板图表压缩为粗粒度的图级配对,丢失了临床医生依赖的细粒度对应关系。 Method: Panel2Patch解析科学图表的布局、面板和视觉标记,构建跨图、面板和图像块层级的分层对齐图文对,并设计粒度感知的预训练策略以统一不同层次的学习目标。 Result: 仅使用少量文献图表,Panel2Patch就能提取出比以往方法更有效的监督信号,在更少预训练数据下实现更优性能。 Conclusion: 通过保留局部语义结构,Panel2Patch为生物医学视觉-语言模型提供了更精细、更高效的预训练范式。 Abstract: There is a growing interest in developing strong biomedical vision-language models. A popular approach to achieve robust representations is to use web-scale scientific data. However, current biomedical vision-language pretraining typically compresses rich scientific figures and text into coarse figure-level pairs, discarding the fine-grained correspondences that clinicians actually rely on when zooming into local structures. To tackle this issue, we introduce Panel2Patch, a novel data pipeline that mines hierarchical structure from existing biomedical scientific literature, i.e., multi-panel, marker-heavy figures and their surrounding text, and converts them into multi-granular supervision. Given scientific figures and captions, Panel2Patch parses layouts, panels, and visual markers, then constructs hierarchical aligned vision-language pairs at the figure, panel, and patch levels, preserving local semantics instead of treating each figure as a single data sample. Built on this hierarchical corpus, we develop a granularity-aware pretraining strategy that unifies heterogeneous objectives from coarse didactic descriptions to fine region-focused phrases. By applying Panel2Patch to only a small set of the literature figures, we extract far more effective supervision than prior pipelines, enabling substantially better performance with less pretraining data.[105] Co-speech Gesture Video Generation via Motion-Based Graph Retrieval
Yafei Song,Peng Zhang,Bang Zhang
Main category: cs.CV
TL;DR: 提出一种结合扩散模型和运动图的新框架,用于生成与语音同步且自然的手势视频,通过学习音频与手势的联合分布并设计运动检索与拼接策略,显著提升了生成效果。
Details
Motivation: 现有方法在处理音频到手势的多对多映射时受限于一对一匹配,难以生成自然且同步的共语手势视频。 Method: 首先使用扩散模型生成手势动作,利用音频的低级和高级特征进行训练;然后设计基于运动图的检索算法,综合考虑运动的全局和局部相似性;最后通过路径拼接生成连贯视频。 Result: 实验结果表明,该方法在同步准确性和手势自然性方面显著优于先前方法。 Conclusion: 所提框架有效解决了音频-手势多对多映射问题,实现了更自然、更同步的共语手势视频生成。 Abstract: Synthesizing synchronized and natural co-speech gesture videos remains a formidable challenge. Recent approaches have leveraged motion graphs to harness the potential of existing video data. To retrieve an appropriate trajectory from the graph, previous methods either utilize the distance between features extracted from the input audio and those associated with the motions in the graph or embed both the input audio and motion into a shared feature space. However, these techniques may not be optimal due to the many-to-many mapping nature between audio and gestures, which cannot be adequately addressed by one-to-one mapping. To alleviate this limitation, we propose a novel framework that initially employs a diffusion model to generate gesture motions. The diffusion model implicitly learns the joint distribution of audio and motion, enabling the generation of contextually appropriate gestures from input audio sequences. Furthermore, our method extracts both low-level and high-level features from the input audio to enrich the training process of the diffusion model. Subsequently, a meticulously designed motion-based retrieval algorithm is applied to identify the most suitable path within the graph by assessing both global and local similarities in motion. Given that not all nodes in the retrieved path are sequentially continuous, the final step involves seamlessly stitching together these segments to produce a coherent video output. Experimental results substantiate the efficacy of our proposed method, demonstrating a significant improvement over prior approaches in terms of synchronization accuracy and naturalness of generated gestures.[106] Content-Aware Texturing for Gaussian Splatting
Panagiotis Papantonakis,Georgios Kopanas,Fredo Durand,George Drettakis
Main category: cs.CV
TL;DR: 提出一种基于纹理映射的高斯点阵优化方法,通过自适应调整纹理分辨率来高效表示细节外观,减少参数量并提升图像质量。
Details
Motivation: 传统高斯点阵需用大量小基元表示细粒度外观,效率低下;受纹理映射启发,希望分离几何与外观表示以提高效率。 Method: 为2D高斯基元引入每基元纹理图,将纹素大小与图像采样频率关联,并在优化过程中动态调整纹理分辨率,同时根据纹理分辨率控制基元数量。 Result: 相比现有方法,在更少参数下实现了更优的图像质量,验证了方法在参数效率和渲染质量上的优势。 Conclusion: 该方法有效结合了纹理映射与高斯点阵优化,实现了几何与外观的解耦表示,提升了重建效率与渲染质量。 Abstract: Gaussian Splatting has become the method of choice for 3D reconstruction and real-time rendering of captured real scenes. However, fine appearance details need to be represented as a large number of small Gaussian primitives, which can be wasteful when geometry and appearance exhibit different frequency characteristics. Inspired by the long tradition of texture mapping, we propose to use texture to represent detailed appearance where possible. Our main focus is to incorporate per-primitive texture maps that adapt to the scene in a principled manner during Gaussian Splatting optimization. We do this by proposing a new appearance representation for 2D Gaussian primitives with textures where the size of a texel is bounded by the image sampling frequency and adapted to the content of the input images. We achieve this by adaptively upscaling or downscaling the texture resolution during optimization. In addition, our approach enables control of the number of primitives during optimization based on texture resolution. We show that our approach performs favorably in image quality and total number of parameters used compared to alternative solutions for textured Gaussian primitives. Project page: https://repo-sam.inria.fr/nerphys/gs-texturing/[107] RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence
Xuming He,Zehao Fan,Hengjia Li,Fan Zhuo,Hankun Xu,Senlin Cheng,Di Weng,Haifeng Liu,Can Ye,Boxi Wu
Main category: cs.CV
TL;DR: RULER-Bench是一个新提出的基准,用于评估视频生成模型的认知规则推理能力,涵盖6类40个任务和622个高质量实例,通过细粒度指标发现当前最先进模型在规则一致性上仅得48.87%,揭示其推理能力仍有显著提升空间。
Details
Motivation: 现有视频生成模型的评测主要关注视觉感知与理解,缺乏对模型推理能力的系统性评估,尤其是基于认知规则的细粒度分析,因此需要构建专门的基准来填补这一空白。 Method: 提出RULER-Bench,基于文本到视频和图像到视频两种范式,设计覆盖六类规则的40个代表性任务,并构建包含622个标注实例的数据集;采用包含四个维度的检查表进行评估,并利用GPT-4o自动评分,实现与人类判断85%的一致性。 Result: 实验显示当前最先进的视频生成模型在规则一致性指标上仅取得48.87%的得分,表明其在逻辑与规则推理方面存在明显不足,不同模型之间也表现出较大差异。 Conclusion: RULER-Bench为视频生成模型的推理能力提供了系统评估框架,揭示了现有模型在认知规则遵循方面的局限性,有望推动具备更强推理能力的视频生成模型发展,促进视觉基础智能的进步。 Abstract: Recent advances in video generation have enabled the synthesis of videos with strong temporal consistency and impressive visual quality, marking a crucial step toward vision foundation models. To evaluate these video generation models, existing benchmarks primarily focus on factors related to visual perception and understanding, like visual aesthetics, instruction adherence, and temporal coherence. However, the rule-based reasoning capabilities of video generation models remain largely unexplored. Although recent studies have carried out preliminary explorations into whether video models can serve as zero-shot learners, they still lack a fine-grained decomposition of reasoning capabilities and a comprehensive evaluation protocol. To address this gap, we introduce RULER-Bench, a benchmark designed to evaluate the reasoning ability of video generation models from the perspective of cognitive rules. Built upon two fundamental paradigms: text-to-video and image-to-video, RULER-Bench covers 40 representative tasks spanning six rule categories with 622 high-quality annotated instances. For the evaluation of each generated video, we construct a checklist covering four metrics and leverage GPT-o3 to assign scores to each question, achieving 85% alignment with human judgements. Extensive experiments show that the state-of-the-art model achieves only 48.87% on the rule coherence metric, highlighting significant room for improvement in the reasoning capability of next-level video models. We expect that the insight obtained from RULER-Bench will facilitate further development of reasoning-aware video generation, advancing video generation models toward vision foundation intelligence.[108] PPTBench: Towards Holistic Evaluation of Large Language Models for PowerPoint Layout and Design Understanding
Zheng Huang,Xukai Liu,Tianyu Hu,Kai Zhang,Ye Liu
Main category: cs.CV
TL;DR: 本文提出了PPTBench,一个用于评估多模态大模型在PowerPoint相关任务中表现的综合性基准,揭示了现有模型在视觉布局理解与生成方面的不足。
Details
Motivation: 现有基准多关注狭窄子任务,忽视以布局为中心的挑战,而布局理解在实际幻灯片制作与编辑中至关重要,因此需要一个更全面的评估基准。 Method: 基于958个PPTX文件构建包含4,439个样本的PPTBench基准,涵盖检测、理解、修改和生成四类任务,并通过实验与消融分析评估主流MLLM在语义与布局理解上的表现。 Result: 实验表明当前MLLM能理解幻灯片语义内容,但在空间布局组织上表现差;模型难以融合视觉线索与JSON布局结构,且在API规划中无法有效利用视觉信息,案例显示存在错位、重叠等系统性错误。 Conclusion: PPTBench揭示了多模态模型在视觉-结构推理和连贯幻灯片生成方面的关键挑战,为未来研究提供了新方向,推动更强大的布局感知模型发展。 Abstract: PowerPoint presentations combine rich textual content with structured visual layouts, making them a natural testbed for evaluating the multimodal reasoning and layout understanding abilities of modern MLLMs. However, existing benchmarks focus solely on narrow subtasks while overlooking layout-centric challenges, which are central to real-world slide creation and editing. To bridge this gap, we introduce PPTBench, a comprehensive multimodal benchmark for evaluating LLMs on PowerPoint-related tasks. Leveraging a diverse source of 958 PPTX files, PPTBench evaluates models across four categories with 4,439 samples, including Detection, Understanding, Modification, and Generation. Our experiments reveal a substantial gap between semantic understanding and visual-layout reasoning in current MLLMs: models can interpret slide content but fail to produce coherent spatial arrangements. Ablation and further analysis show that current MLLMs struggle to combine visual cues with JSON-based layout structures and fail to integrate visual information into their API planning ability. And case studies visually expose systematic layout errors such as misalignment and element overlap. These findings provides a new perspective on evaluating VLLMs in PPT scenarios, highlighting challenges and directions for future research on visual-structural reasoning and coherent slide generation. All datasets and code are fully released to support reproducibility and future research.[109] Leveraging Large-Scale Pretrained Spatial-Spectral Priors for General Zero-Shot Pansharpening
Yongchuan Cui,Peng Liu,Yi Zeng
Main category: cs.CV
TL;DR: 本文提出了一种利用大规模模拟数据进行预训练的新策略,以提升遥感图像融合模型在未见数据集上的泛化能力。
Details
Motivation: 现有深度学习方法在应用于未见数据集时泛化性能差,主要受限于真实训练数据的缺乏以及不同卫星传感器之间的域差异。 Method: 构建多样化的模拟数据集,通过对ImageNet和SkyScript中的自然图像与遥感图像施加退化操作(如模糊、噪声、下采样)和增强手段(如波段生成、通道混洗、高通滤波、颜色抖动等)生成训练样本,并在此基础上对融合模型进行预训练,学习鲁棒的空间-光谱先验。 Result: 在WorldView-2/3/4、IKONOS、QuickBird和GaoFen-2六个数据集上验证了所提方法的有效性,涵盖零样本和一样本设置,结合全量微调与冻结微调方式,在CNN、Transformer和Mamba等多种网络架构上均显著提升了跨传感器和成像条件的泛化性能。 Conclusion: 该预训练策略显著增强了遥感图像融合模型的跨域泛化能力,为利用基础模型解决遥感任务提供了新思路,建立了图像融合任务中泛化性能的新基准。 Abstract: Existing deep learning methods for remote sensing image fusion often suffer from poor generalization when applied to unseen datasets due to the limited availability of real training data and the domain gap between different satellite sensors. To address this challenge, we explore the potential of foundation models by proposing a novel pretraining strategy that leverages large-scale simulated datasets to learn robust spatial-spectral priors. Specifically, our approach first constructs diverse simulated datasets by applying various degradation operations (blur, noise, downsampling) and augmentations (bands generation, channel shuffling, high-pass filtering, color jittering, etc.) to natural images from ImageNet and remote sensing images from SkyScript. We then pretrain fusion models on these simulated data to learn generalizable spatial-spectral representations. The pretrained models are subsequently evaluated on six datasets (WorldView-2/3/4, IKONOS, QuickBird, GaoFen-2) using zero-shot and one-shot paradigms, with both full- and freeze-tuning approaches for fine-tuning. Extensive experiments on different network architectures including convolutional neural networks, Transformer, and Mamba demonstrate that our pretraining strategy significantly improves generalization performance across different satellite sensors and imaging conditions for various fusion models. The pretrained models achieve superior results in zero-shot scenarios and show remarkable adaptation capability with minimal real data in one-shot settings. Our work provides a practical solution for cross-domain pansharpening, establishes a new benchmark for generalization in remote sensing image fusion tasks, and paves the way for leveraging foundation models through advanced training strategies.[110] PoreTrack3D: A Benchmark for Dynamic 3D Gaussian Splatting in Pore-Scale Facial Trajectory Tracking
Dong Li,Jiahao Xiong,Yingda Huang,Le Chang
Main category: cs.CV
TL;DR: PoreTrack3D是首个用于孔尺度、非刚性3D面部轨迹跟踪的动态3D高斯点阵基准数据集,包含超过44万条面部轨迹,推动细粒度面部表情分析。
Details
Motivation: 现有数据集缺乏对细微皮肤表面运动(如毛孔级关键点)的长期轨迹捕捉,限制了精细面部表情分析的发展。 Method: 构建包含超过440,000条面部轨迹的大规模基准数据集PoreTrack3D,其中包含52,000多条超过10帧的轨迹和68条完整150帧的手动标注轨迹,并系统评估最新的动态3D高斯点阵方法。 Result: 首次实现了对传统面部 landmark 和孔尺度关键点轨迹的同时捕捉,建立了该领域首个性能基准,并提出了高保真面部运动捕捉与动态3D重建的新框架。 Conclusion: PoreTrack3D为研究细微面部动作提供了高质量数据支持,推动了动态3D重建和面部行为分析的发展,具有重要科研与应用价值。 Abstract: We introduce PoreTrack3D, the first benchmark for dynamic 3D Gaussian splatting in pore-scale, non-rigid 3D facial trajectory tracking. It contains over 440,000 facial trajectories in total, among which more than 52,000 are longer than 10 frames, including 68 manually reviewed trajectories that span the entire 150 frames. To the best of our knowledge, PoreTrack3D is the first benchmark dataset to capture both traditional facial landmarks and pore-scale keypoints trajectory, advancing the study of fine-grained facial expressions through the analysis of subtle skin-surface motion. We systematically evaluate state-of-the-art dynamic 3D Gaussian splatting methods on PoreTrack3D, establishing the first performance baseline in this domain. Overall, the pipeline developed for this benchmark dataset's creation establishes a new framework for high-fidelity facial motion capture and dynamic 3D reconstruction. Our dataset are publicly available at: https://github.com/JHXion9/PoreTrack3D[111] Hear What Matters! Text-conditioned Selective Video-to-Audio Generation
Junwon Lee,Juhan Nam,Jiyoung Lee
Main category: cs.CV
TL;DR: 本文提出了文本条件下的选择性视频到音频生成任务,旨在从多对象视频中生成用户指定的声音,并提出了一种名为SelVA的新模型,通过文本提示选择目标声源并调制视频编码器以提取与提示相关的特征,实现了高质量的音频生成和精确的语义及时序对齐。
Details
Motivation: 现有方法通常生成混合声音,难以满足多媒体制作中对单个音轨精细编辑的需求,且视觉特征纠缠导致无法准确指定声源。因此需要一种能够根据文本选择特定声源并生成对应音频的方法。 Method: 提出SelVA模型,利用文本提示作为目标声源的选择器,通过补充令牌增强跨注意力机制,抑制与文本无关的激活,并采用自增强策略缓解单声道音频监督数据不足的问题。 Result: 在VGG-MONOAUDIO基准上进行实验,结果表明SelVA在音频质量、语义对齐和时序同步方面均优于现有方法,消融实验证明了各组件的有效性。 Conclusion: SelVA成功实现了文本引导的选择性音频生成,具备良好的语义与时序对齐能力,为多媒体生产中的音频处理提供了新思路。 Abstract: This work introduces a new task, text-conditioned selective video-to-audio (V2A) generation, which produces only the user-intended sound from a multi-object video. This capability is especially crucial in multimedia production, where audio tracks are handled individually for each sound source for precise editing, mixing, and creative control. However, current approaches generate single source-mixed sounds at once, largely because visual features are entangled, and region cues or prompts often fail to specify the source. We propose SelVA, a novel text-conditioned V2A model that treats the text prompt as an explicit selector of target source and modulates video encoder to distinctly extract prompt-relevant video features. The proposed supplementary tokens promote cross-attention by suppressing text-irrelevant activations with efficient parameter tuning, yielding robust semantic and temporal grounding. SelVA further employs a self-augmentation scheme to overcome the lack of mono audio track supervision. We evaluate SelVA on VGG-MONOAUDIO, a curated benchmark of clean single-source videos for such a task. Extensive experiments and ablations consistently verify its effectiveness across audio quality, semantic alignment, and temporal synchronization. Code and demo are available at https://jnwnlee.github.io/selva-demo/.[112] Spatially-Grounded Document Retrieval via Patch-to-Region Relevance Propagation
Agathoklis Georgiou
Main category: cs.CV
TL;DR: 提出一种结合Vision-language模型和OCR的混合架构,通过映射视觉Transformer的patch网格与OCR边界框坐标,实现细粒度、精确的文档区域检索,适用于RAG场景。
Details
Motivation: 现有VLMs(如ColPali)虽能有效进行页面级图像检索,但无法定位具体文本区域;而OCR系统虽提供结构化文本和坐标,却缺乏语义相关性判断能力,限制了在RAG中对精准上下文的需求。 Method: 将ColPali生成的patch级相似性分数作为空间相关性过滤器,作用于OCR提取的文本区域;形式化定义vision transformer patch网格与OCR bounding box之间的坐标映射关系,引入交集度量进行相关性传播,并在不需额外训练的情况下于推理时运行。 Result: 实现了无需训练的细粒度文档区域检索,能够精确定位查询相关的文本区域,理论分析给出了检索精度的上限,且已发布开源实现Snappy,实证评估正在进行中。 Conclusion: 该混合方法成功融合了VLM的语义理解能力和OCR的空间结构优势,为RAG等应用提供了高精度、可解释的视觉-语言检索方案。 Abstract: Vision-language models (VLMs) like ColPali achieve state-of-the-art document retrieval by embedding pages as images and computing fine-grained similarity between query tokens and visual patches. However, they return entire pages rather than specific regions, limiting utility for retrieval-augmented generation (RAG) where precise context is paramount. Conversely, OCR-based systems extract structured text with bounding box coordinates but lack semantic grounding for relevance assessment. We propose a hybrid architecture that unifies these paradigms: using ColPali's patch-level similarity scores as spatial relevance filters over OCR-extracted regions. We formalize the coordinate mapping between vision transformer patch grids and OCR bounding boxes, introduce intersection metrics for relevance propagation, and establish theoretical bounds on retrieval precision. Our approach operates at inference time without additional training. We release Snappy, an open-source implementation demonstrating practical applicability, with empirical evaluation ongoing.[113] PolarGuide-GSDR: 3D Gaussian Splatting Driven by Polarization Priors and Deferred Reflection for Real-World Reflective Scenes
Derui Shan,Qian Qiao,Hao Lu,Tao Du,Peng Lu
Main category: cs.CV
TL;DR: 提出PolarGuide-GSDR,首次将偏振先验嵌入3D高斯点阵优化,实现无需环境图、无强材质假设的高质量反射分离与全场景重建,兼顾实时渲染与高保真效果。
Details
Motivation: 现有偏振NeRF方法训练慢、渲染低效且依赖材质或视角假设;3DGS虽可实时渲染,但难以准确重建反射,引入延迟反射模块又依赖环境图。需一种兼顾高效、高质且不依赖环境图和强假设的反射场景建模方法。 Method: 提出PolarGuide-GSDR,构建偏振与3DGS之间的双向耦合机制:利用3DGS的几何先验消除偏振歧义,再用优化后的偏振信息指导3DGS的法线与球谐表示,实现反射分离与全场景重建。 Result: 在公开和自采数据集上实现了最先进的镜面重建、法线估计和新视角合成效果,同时保持实时渲染能力。 Conclusion: PolarGuide-GSDR首次将偏振先验直接嵌入3DGS优化,无需环境图和强材质假设,实现了高保真反射分离与实时全场景重建,具有优越的可解释性与性能。 Abstract: Polarization-aware Neural Radiance Fields (NeRF) enable novel view synthesis of specular-reflection scenes but face challenges in slow training, inefficient rendering, and strong dependencies on material/viewpoint assumptions. However, 3D Gaussian Splatting (3DGS) enables real-time rendering yet struggles with accurate reflection reconstruction from reflection-geometry entanglement, adding a deferred reflection module introduces environment map dependence. We address these limitations by proposing PolarGuide-GSDR, a polarization-forward-guided paradigm establishing a bidirectional coupling mechanism between polarization and 3DGS: first 3DGS's geometric priors are leveraged to resolve polarization ambiguity, and then the refined polarization information cues are used to guide 3DGS's normal and spherical harmonic representation. This process achieves high-fidelity reflection separation and full-scene reconstruction without requiring environment maps or restrictive material assumptions. We demonstrate on public and self-collected datasets that PolarGuide-GSDR achieves state-of-the-art performance in specular reconstruction, normal estimation, and novel view synthesis, all while maintaining real-time rendering capabilities. To our knowledge, this is the first framework embedding polarization priors directly into 3DGS optimization, yielding superior interpretability and real-time performance for modeling complex reflective scenes.[114] UAUTrack: Towards Unified Multimodal Anti-UAV Visual Tracking
Qionglin Ren,Dawei Zhang,Chunxu Tian,Dan Zhang
Main category: cs.CV
TL;DR: 提出了一种统一的单目标跟踪框架UAUTrack,用于反无人机(Anti-UAV)多模态跟踪,通过单流单阶段端到端架构和文本先验提示策略,在多个数据集上实现了最先进的性能。
Details
Motivation: 现有反无人机跟踪方法缺乏跨模态协同的统一框架,且多模态融合效果不佳,难以在准确性和速度之间取得良好平衡。 Method: 提出UAUTrack,采用单流、单阶段、端到端架构,融合RGB、TIR等多模态数据,并引入文本先验提示策略,引导模型专注于无人机目标。 Result: 在Anti-UAV、DUT Anti-UAV和Anti-UAV410等多个数据集上达到SOTA性能,兼具高精度与高效性。 Conclusion: UAUTrack为反无人机跟踪提供了一个高效、统一的多模态解决方案,显著提升了跨场景下的跟踪性能与实用性。 Abstract: Research in Anti-UAV (Unmanned Aerial Vehicle) tracking has explored various modalities, including RGB, TIR, and RGB-T fusion. However, a unified framework for cross-modal collaboration is still lacking. Existing approaches have primarily focused on independent models for individual tasks, often overlooking the potential for cross-modal information sharing. Furthermore, Anti-UAV tracking techniques are still in their infancy, with current solutions struggling to achieve effective multimodal data fusion. To address these challenges, we propose UAUTrack, a unified single-target tracking framework built upon a single-stream, single-stage, end-to-end architecture that effectively integrates multiple modalities. UAUTrack introduces a key component: a text prior prompt strategy that directs the model to focus on UAVs across various scenarios. Experimental results show that UAUTrack achieves state-of-the-art performance on the Anti-UAV and DUT Anti-UAV datasets, and maintains a favourable trade-off between accuracy and speed on the Anti-UAV410 dataset, demonstrating both high accuracy and practical efficiency across diverse Anti-UAV scenarios.[115] PGP-DiffSR: Phase-Guided Progressive Pruning for Efficient Diffusion-based Image Super-Resolution
Zhongbao Yang,Jiangxin Dong,Yazhou Yao,Jinhui Tang,Jinshan Pan
Main category: cs.CV
TL;DR: 本文提出了一种轻量化的扩散模型PGP-DiffSR,用于高效图像超分辨率,通过渐进式剪枝和相位交换适配器模块,在减少计算和内存开销的同时保持良好的恢复性能。
Details
Motivation: 现有的基于扩散的图像超分辨率模型依赖大规模骨干网络,导致训练和推理过程中计算和内存成本过高。 Method: 提出一种在输入相位信息引导下的渐进式剪枝方法,去除扩散模型中的冗余块,并设计相位交换适配器模块来利用输入的相位信息指导剪枝后的模型进行更优的图像恢复。 Result: 实验表明,该方法在显著降低计算负载和内存消耗的同时,实现了具有竞争力的图像恢复质量。 Conclusion: PGP-DiffSR通过有效去除模型冗余并引入相位引导机制,为高效图像超分辨率提供了一种轻量化且高性能的扩散模型解决方案。 Abstract: Although diffusion-based models have achieved impressive results in image super-resolution, they often rely on large-scale backbones such as Stable Diffusion XL (SDXL) and Diffusion Transformers (DiT), which lead to excessive computational and memory costs during training and inference. To address this issue, we develop a lightweight diffusion method, PGP-DiffSR, by removing redundant information from diffusion models under the guidance of the phase information of inputs for efficient image super-resolution. We first identify the intra-block redundancy within the diffusion backbone and propose a progressive pruning approach that removes redundant blocks while reserving restoration capability. We note that the phase information of the restored images produced by the pruned diffusion model is not well estimated. To solve this problem, we propose a phase-exchange adapter module that explores the phase information of the inputs to guide the pruned diffusion model for better restoration performance. We formulate the progressive pruning approach and the phase-exchange adapter module into a unified model. Extensive experiments demonstrate that our method achieves competitive restoration quality while significantly reducing computational load and memory consumption. The code is available at https://github.com/yzb1997/PGP-DiffSR.[116] Unsupervised Structural Scene Decomposition via Foreground-Aware Slot Attention with Pseudo-Mask Guidance
Huankun Sheng,Ming Li,Yixiang Wei,Yeying Fan,Yu-Hui Wen,Tieliang Gong,Yong-Jin Liu
Main category: cs.CV
TL;DR: 本文提出了一种前景感知的Slot注意力(FASA)框架,通过显式分离前景与背景来提升无监督场景分解和对象发现性能。
Details
Motivation: 现有基于slot注意力的方法在处理真实场景时通常不区分前景与背景,导致背景干扰和实例发现效果不佳。 Method: FASA采用两阶段框架:第一阶段通过双slot竞争机制和聚类初始化进行粗略的前景-背景分离;第二阶段引入掩码slot注意力机制,并利用自监督特征构建的patch亲和图生成伪掩码引导前景slot学习,以减少过分割。 Result: 在合成和真实世界数据集上的实验表明,FASA在对象发现和场景分解任务上均优于现有最先进方法。 Conclusion: 显式的前景建模与伪掩码引导能有效提升无监督对象表示学习的鲁棒性和准确性。 Abstract: Recent advances in object-centric representation learning have shown that slot attention-based methods can effectively decompose visual scenes into object slot representations without supervision. However, existing approaches typically process foreground and background regions indiscriminately, often resulting in background interference and suboptimal instance discovery performance on real-world data. To address this limitation, we propose Foreground-Aware Slot Attention (FASA), a two-stage framework that explicitly separates foreground from background to enable precise object discovery. In the first stage, FASA performs a coarse scene decomposition to distinguish foreground from background regions through a dual-slot competition mechanism. These slots are initialized via a clustering-based strategy, yielding well-structured representations of salient regions. In the second stage, we introduce a masked slot attention mechanism where the first slot captures the background while the remaining slots compete to represent individual foreground objects. To further address over-segmentation of foreground objects, we incorporate pseudo-mask guidance derived from a patch affinity graph constructed with self-supervised image features to guide the learning of foreground slots. Extensive experiments on both synthetic and real-world datasets demonstrate that FASA consistently outperforms state-of-the-art methods, validating the effectiveness of explicit foreground modeling and pseudo-mask guidance for robust scene decomposition and object-coherent representation. Code will be made publicly available.[117] ClimaOoD: Improving Anomaly Segmentation via Physically Realistic Synthetic Data
Yuxing Liu,Yong Liu
Main category: cs.CV
TL;DR: 本文提出ClimaDrive,一个语义引导的图像到图像框架,用于生成具有天气多样性、语义连贯且物理上合理的异常驾驶场景数据,并构建大规模基准ClimaOoD,实验证明其能显著提升现有方法在开放世界异常分割任务中的性能。
Details
Motivation: 由于真实异常数据稀缺且多样性不足,现有异常分割模型在开放世界中的泛化能力受限,合成数据常缺乏上下文一致性和物理真实性,导致域差距问题。 Method: 提出ClimaDrive框架,结合结构引导的多天气图像生成与提示驱动的异常修复技术,统一生成语义连贯且多样化的异常驾驶场景;并基于此构建包含六种典型驾驶场景和多种天气条件的大规模基准ClimaOoD。 Result: 在四个最先进的方法上进行实验,使用ClimaOoD训练后AUROC、AP和FPR95均有显著提升,例如RbA在Fishyscapes LAF上的FPR95从3.97降至3.52。 Conclusion: ClimaOoD提供了高质量、多样化的训练数据,有效增强了模型在开放环境中对异常的检测与分割能力,推动了自动驾驶中异常感知的鲁棒性发展。 Abstract: Anomaly segmentation seeks to detect and localize unknown or out-of-distribution (OoD) objects that fall outside predefined semantic classes a capability essential for safe autonomous driving. However, the scarcity and limited diversity of anomaly data severely constrain model generalization in open-world environments. Existing approaches mitigate this issue through synthetic data generation, either by copy-pasting external objects into driving scenes or by leveraging text-to-image diffusion models to inpaint anomalous regions. While these methods improve anomaly diversity, they often lack contextual coherence and physical realism, resulting in domain gaps between synthetic and real data. In this paper, we present ClimaDrive, a semantics-guided image-to-image framework for synthesizing semantically coherent, weather-diverse, and physically plausible OoD driving data. ClimaDrive unifies structure-guided multi-weather generation with prompt-driven anomaly inpainting, enabling the creation of visually realistic training data. Based on this framework, we construct ClimaOoD, a large-scale benchmark spanning six representative driving scenarios under both clear and adverse weather conditions. Extensive experiments on four state-of-the-art methods show that training with ClimaOoD leads to robust improvements in anomaly segmentation. Across all methods, AUROC, AP, and FPR95 show notable gains, with FPR95 dropping from 3.97 to 3.52 for RbA on Fishyscapes LAF. These results demonstrate that ClimaOoD enhances model robustness, offering valuable training data for better generalization in open-world anomaly detection.[118] ALDI-ray: Adapting the ALDI Framework for Security X-ray Object Detection
Omid Reza Heidari,Yang Wang,Xinxin Zuo
Main category: cs.CV
TL;DR: ALDI++ 是一种结合自蒸馏、特征对齐和增强训练策略的域自适应框架,在安全X光图像的目标检测中有效缓解了域偏移问题,优于现有最先进方法。
Details
Motivation: 安全X光成像因设备和环境差异导致显著的域偏移,影响模型性能,需有效的域自适应方法提升跨域检测能力。 Method: 采用ALDI++框架,结合自蒸馏、特征对齐与增强训练策略,并使用Vision Transformer for Detection(ViTDet)作为骨干网络进行实验。 Result: 在EDS数据集上,ALDI++超越了现有的SOTA方法,在多个适应场景中取得更高的mAP,且在各类别上均表现出检测精度的稳定提升。 Conclusion: ALDI++为安全X光图像中的域自适应目标检测提供了高效解决方案,建立了新的性能基准,展现出优异的跨域泛化能力和稳定性。 Abstract: Domain adaptation in object detection is critical for real-world applications where distribution shifts degrade model performance. Security X-ray imaging presents a unique challenge due to variations in scanning devices and environmental conditions, leading to significant domain discrepancies. To address this, we apply ALDI++, a domain adaptation framework that integrates self-distillation, feature alignment, and enhanced training strategies to mitigate domain shift effectively in this area. We conduct extensive experiments on the EDS dataset, demonstrating that ALDI++ surpasses the state-of-the-art (SOTA) domain adaptation methods across multiple adaptation scenarios. In particular, ALDI++ with a Vision Transformer for Detection (ViTDet) backbone achieves the highest mean average precision (mAP), confirming the effectiveness of transformer-based architectures for cross-domain object detection. Additionally, our category-wise analysis highlights consistent improvements in detection accuracy, reinforcing the robustness of the model across diverse object classes. Our findings establish ALDI++ as an efficient solution for domain-adaptive object detection, setting a new benchmark for performance stability and cross-domain generalization in security X-ray imagery.[119] GeoBridge: A Semantic-Anchored Multi-View Foundation Model Bridging Images and Text for Geo-Localization
Zixuan Song,Jing Zhang,Di Wang,Zidie Zhou,Wenbin Liu,Haonan Guo,En Wang,Bo Du
Main category: cs.CV
TL;DR: 提出GeoBridge,一种基于语义锚机制的跨视角、多模态地理定位基础模型,并构建大规模数据集GeoLoc,实现更鲁棒和灵活的地理定位。
Details
Motivation: 传统以卫星为中心的跨视角地理定位方法在缺乏高分辨率或最新卫星图像时鲁棒性差,且未充分利用多视角和多模态间的互补信息。 Method: 提出GeoBridge模型,采用语义锚机制通过文本描述桥接多视角特征,支持双向匹配和图文检索;并构建包含5万余样本的多模态多视图数据集GeoLoc用于训练与评估。 Result: 实验表明,基于GeoLoc预训练的GeoBridge显著提升地理定位精度,并增强跨域泛化和跨模态知识迁移能力。 Conclusion: GeoBridge结合语义锚机制和多模态对齐数据集,实现了更鲁棒、灵活的跨视角地理定位,推动了非卫星中心化定位的发展。 Abstract: Cross-view geo-localization infers a location by retrieving geo-tagged reference images that visually correspond to a query image. However, the traditional satellite-centric paradigm limits robustness when high-resolution or up-to-date satellite imagery is unavailable. It further underexploits complementary cues across views (e.g., drone, satellite, and street) and modalities (e.g., language and image). To address these challenges, we propose GeoBridge, a foundation model that performs bidirectional matching across views and supports language-to-image retrieval. Going beyond traditional satellite-centric formulations, GeoBridge builds on a novel semantic-anchor mechanism that bridges multi-view features through textual descriptions for robust, flexible localization. In support of this task, we construct GeoLoc, the first large-scale, cross-modal, and multi-view aligned dataset comprising over 50,000 pairs of drone, street-view panorama, and satellite images as well as their textual descriptions, collected from 36 countries, ensuring both geographic and semantic alignment. We performed broad evaluations across multiple tasks. Experiments confirm that GeoLoc pre-training markedly improves geo-location accuracy for GeoBridge while promoting cross-domain generalization and cross-modal knowledge transfer. The dataset, source code, and pretrained models were released at https://github.com/MiliLab/GeoBridge.[120] VLM-Pruner: Buffering for Spatial Sparsity in an Efficient VLM Centrifugal Token Pruning Paradigm
Zhenkai Wu,Xiaowen Ma,Zhenliang Ni,Dengming Zhang,Han Shu,Xin Jiang,Xinghao Chen
Main category: cs.CV
TL;DR: 提出了一种无需训练的视觉语言模型(token pruning)方法VLM-Pruner,通过平衡冗余性和空间稀疏性来高效压缩视觉token,提升推理速度。
Details
Motivation: 现有剪枝方法忽略token间的冗余或空间关系,导致保留的token过于稀疏,无法有效覆盖目标对象区域。 Method: 提出离心式剪枝范式和用于空间稀疏的缓冲(BSS)准则,采用并行贪心策略进行高效token选择,并将丢弃token的重要信息融合到保留token中。 Result: 在五种VLM上以88.9%的剪枝率显著优于强基线,实现端到端推理加速。 Conclusion: VLM-Pruner有效平衡了冗余与空间分布,提升了剪枝效率与图像理解性能,适合在移动设备等资源受限场景部署。 Abstract: Vision-language models (VLMs) excel at image understanding tasks, but the large number of visual tokens imposes significant computational costs, hindering deployment on mobile devices. Many pruning methods rely solely on token importance and thus overlook inter-token redundancy, retaining numerous duplicated tokens and wasting capacity. Although some redundancy-aware approaches have been proposed, they often ignore the spatial relationships among visual tokens. This can lead to overly sparse selections of retained tokens that fail to adequately cover the regions of target objects. To address these limitations, we propose VLM-Pruner, a training-free token pruning algorithm that explicitly balances redundancy and spatial sparsity. We introduce a centrifugal token pruning paradigm that enables near-to-far selection while prioritizing the preservation of fine-grained object details. Moreover, we design a Buffering for Spatial Sparsity (BSS) criterion that defers the selection of spatially distant tokens. We further adopt a parallel greedy strategy to conduct token selection efficiently. To mitigate information loss from pruning, we selectively fuse salient information from the discarded tokens into the retained ones. Comprehensive comparisons demonstrate that VLM-Pruner consistently outperforms strong baselines across five VLMs with an 88.9\% pruning rate, while delivering an end-to-end inference speedup.[121] Tissue-mask supported inter-subject whole-body image registration in the UK Biobank -- A method benchmarking study
Yasemin Utkueri,Elin Lundström,Håkan Ahlström,Johan Öfverstedt,Joel Kullberg
Main category: cs.CV
TL;DR: 提出了一种基于皮下脂肪和肌肉掩模的性别分层全身体磁共振图像配准方法,显著提高了UK Biobank数据的空间标准化精度和医学研究相关性分析质量。
Details
Motivation: 为了实现UK Biobank大规模全身体磁共振图像的体素级空间标准化,并支持非影像数据与图像参数(如组织体积或脂肪含量)的区域相关分析,需要一种更精确的受试者间图像配准方法。 Method: 采用VIBESegmentator生成的皮下脂肪和肌肉掩模,增强基于强度的图割图像配准算法,并按性别分层进行全身体MR图像配准。 Result: 在4000名受试者上验证,该方法平均Dice分数达0.77(男性)和0.75(女性),比强度仅配准高6个百分点,优于uniGradICON和MIRTK(分别高9/8pp和12/13pp),标签误差频率降低,年龄相关性图更清晰且解剖对齐更好。 Conclusion: 结合组织掩模的配准方法显著提升了UK Biobank全身体MRI图像的配准精度,有助于提高体素级医学研究分析的可靠性。 Abstract: The UK Biobank is a large-scale study collecting whole-body MR imaging and non-imaging health data. Robust and accurate inter-subject image registration of these whole-body MR images would enable their body-wide spatial standardization, and region-/voxel-wise correlation analysis of non-imaging data with image-derived parameters (e.g., tissue volume or fat content). We propose a sex-stratified inter-subject whole-body MR image registration approach that uses subcutaneous adipose tissue- and muscle-masks from the state-of-the-art VIBESegmentator method to augment intensity-based graph-cut registration. The proposed method was evaluated on a subset of 4000 subjects by comparing it to an intensity-only method as well as two previously published registration methods, uniGradICON and MIRTK. The evaluation comprised overlap measures applied to the 71 VIBESegmentator masks: 1) Dice scores, and 2) voxel-wise label error frequency. Additionally, voxel-wise correlation between age and each of fat content and tissue volume was studied to exemplify the usefulness for medical research. The proposed method exhibited a mean dice score of 0.77 / 0.75 across the cohort and the 71 masks for males/females, respectively. When compared to the intensity-only registration, the mean values were 6 percentage points (pp) higher for both sexes, and the label error frequency was decreased in most tissue regions. These differences were 9pp / 8pp against uniGradICON and 12pp / 13pp against MIRTK. Using the proposed method, the age-correlation maps were less noisy and showed higher anatomical alignment. In conclusion, the image registration method using two tissue masks improves whole-body registration of UK Biobank images.[122] GeoViS: Geospatially Rewarded Visual Search for Remote Sensing Visual Grounding
Peirong Zhang,Yidan Zhang,Luxiao Xu,Jinliang Lin,Zonghao Guo,Fengxiang Wang,Xue Yang,Kaiwen Wei,Lei Wang
Main category: cs.CV
TL;DR: 本文提出了GeoViS,一种基于地理空间奖励的视觉搜索框架,通过树状逐步推理实现遥感图像中的细粒度视觉定位,显著提升了对极小目标和复杂空间关系的理解能力。
Details
Motivation: 遥感图像中目标极小且查询常涉及复杂的地理空间关系,现有单步定位方法难以有效处理跨模态对齐与全局场景理解之间的平衡。 Method: 将遥感视觉定位重构为渐进式搜索与推理过程,采用树状结构的视觉线索序列,结合多模态感知、空间推理和奖励引导探索,迭代优化地理空间假设。 Result: 在五个遥感定位基准上实验表明,GeoViS在关键指标上优于现有方法,具备精准的地理空间理解、强跨域泛化能力和良好可解释性。 Conclusion: GeoViS通过引入结构化搜索与奖励机制,有效解决了遥感图像中小目标定位与复杂空间关系建模的挑战,为多模态模型在地理空间理解中的应用提供了新思路。 Abstract: Recent advances in multimodal large language models(MLLMs) have led to remarkable progress in visual grounding, enabling fine-grained cross-modal alignment between textual queries and image regions. However, transferring such capabilities to remote sensing imagery remains challenging, as targets are often extremely small within kilometer-scale scenes, and queries typically involve intricate geospatial relations such as relative positions, spatial hierarchies, or contextual dependencies across distant objects. To address these challenges, we propose GeoViS, a Geospatially Rewarded Visual Search framework that reformulates remote sensing visual grounding as a progressive search-and-reasoning process. Rather than directly predicting the target location in a single step, GeoViS actively explores the global image through a tree-structured sequence of visual cues, integrating multimodal perception, spatial reasoning, and reward-guided exploration to refine geospatial hypotheses iteratively. This design enables the model to detect subtle small-scale targets while maintaining holistic scene awareness. Extensive experiments on five remote sensing grounding benchmarks demonstrate that GeoViS achieves precise geospatial understanding and consistently surpasses existing methods across key visual grounding metrics, highlighting its strong cross-domain generalization and interpretability.[123] DF-Mamba: Deformable State Space Modeling for 3D Hand Pose Estimation in Interactions
Yifan Zhou,Takehiko Ohkawa,Guwenxiao Zhou,Kanoko Goto,Takumi Hirose,Yusuke Sekikawa,Nakamasa Inoue
Main category: cs.CV
TL;DR: 提出了一种基于状态空间模型Mamba的可变形Mamba(DF-Mamba)框架,用于3D手部姿态估计中的视觉特征提取,通过可变形状态扫描和选择性状态建模有效捕捉全局上下文信息,在多种场景下实现了最先进的性能。
Details
Motivation: 现有3D手部姿态估计方法多依赖ResNet等CNN架构,其卷积归纳偏置在处理严重遮挡时难以有效建模局部特征与全局上下文的关系,限制了性能提升。 Method: 提出DF-Mamba框架,结合Mamba的选择性状态建模与新设计的可变形状态扫描机制,在卷积后对局部特征进行可变形聚合,以更好地捕捉跨关节、双手及场景的全局上下文线索。 Result: 在五个涵盖单手/双手、手-物交互、RGB/深度等多种设置的数据集上进行广泛实验,DF-Mamba在保持与ResNet-50相当推理速度的同时,优于VMamba、Spatial-Mamba等最新骨干网络,取得最先进的精度。 Conclusion: DF-Mamba通过引入可变形扫描与选择性状态建模,有效增强了3D手部姿态估计中对遮挡情况的特征表达能力,验证了状态空间模型在此任务中的高效性与潜力。 Abstract: Modeling daily hand interactions often struggles with severe occlusions, such as when two hands overlap, which highlights the need for robust feature learning in 3D hand pose estimation (HPE). To handle such occluded hand images, it is vital to effectively learn the relationship between local image features (e.g., for occluded joints) and global context (e.g., cues from inter-joints, inter-hands, or the scene). However, most current 3D HPE methods still rely on ResNet for feature extraction, and such CNN's inductive bias may not be optimal for 3D HPE due to its limited capability to model the global context. To address this limitation, we propose an effective and efficient framework for visual feature extraction in 3D HPE using recent state space modeling (i.e., Mamba), dubbed Deformable Mamba (DF-Mamba). DF-Mamba is designed to capture global context cues beyond standard convolution through Mamba's selective state modeling and the proposed deformable state scanning. Specifically, for local features after convolution, our deformable scanning aggregates these features within an image while selectively preserving useful cues that represent the global context. This approach significantly improves the accuracy of structured 3D HPE, with comparable inference speed to ResNet-50. Our experiments involve extensive evaluations on five divergent datasets including single-hand and two-hand scenarios, hand-only and hand-object interactions, as well as RGB and depth-based estimation. DF-Mamba outperforms the latest image backbones, including VMamba and Spatial-Mamba, on all datasets and achieves state-of-the-art performance.[124] Beyond Paired Data: Self-Supervised UAV Geo-Localization from Reference Imagery Alone
Tristan Amadei,Enric Meinhardt-Llopis,Benedicte Bascle,Corentin Abgrall,Gabriele Facciolo
Main category: cs.CV
TL;DR: 提出了一种无需配对无人机-卫星图像数据的训练范式,通过模拟卫星与无人机视角间的视觉域偏移,实现GNSS拒止环境下的图像定位。
Details
Motivation: 现有基于无人机与卫星图像匹配的方法依赖大量成对训练数据,获取成本高且难以获得,限制了其应用。因此需要一种不依赖无人机图像进行训练的新方法。 Method: 采用仅基于卫星图像的自监督训练范式,设计专门的数据增强策略来模拟从卫星视角到无人机视角的视觉域变换,并提出CAEVL模型以适应该范式。 Result: 在新发布的现实世界无人机图像数据集ViLD上验证了方法的有效性,性能媲美使用配对数据训练的现有方法,展现出良好的泛化能力。 Conclusion: 所提出的训练范式和CAEVL模型能够在无需无人机-卫星配对数据的情况下实现高效图像定位,具有较强的实用性和推广价值。 Abstract: Image-based localization in GNSS-denied environments is critical for UAV autonomy. Existing state-of-the-art approaches rely on matching UAV images to geo-referenced satellite images; however, they typically require large-scale, paired UAV-satellite datasets for training. Such data are costly to acquire and often unavailable, limiting their applicability. To address this challenge, we adopt a training paradigm that removes the need for UAV imagery during training by learning directly from satellite-view reference images. This is achieved through a dedicated augmentation strategy that simulates the visual domain shift between satellite and real-world UAV views. We introduce CAEVL, an efficient model designed to exploit this paradigm, and validate it on ViLD, a new and challenging dataset of real-world UAV images that we release to the community. Our method achieves competitive performance compared to approaches trained with paired data, demonstrating its effectiveness and strong generalization capabilities.[125] Reasoning-Aware Multimodal Fusion for Hateful Video Detection
Shuonan Yang,Tailin Chen,Jiangbei Yue,Guangliang Cheng,Jianbo Jiao,Zeyu Fu
Main category: cs.CV
TL;DR: 提出一种新的多模态融合框架RAMF,用于检测在线视频中的仇恨内容,通过局部-全局上下文融合和语义交叉注意力增强多模态语义交互,并引入对抗性推理机制以更好理解细微的仇恨意图。
Details
Motivation: 现有方法难以有效融合多模态语义关系且缺乏对细微仇恨内容的理解能力,尤其在视频内容日益依赖上下文的情况下表现不足。 Method: 设计局部-全局上下文融合(LGCF)捕获局部显著线索和全局时序结构,提出语义交叉注意力(SCA)实现细粒度多模态交互,并引入三阶段对抗性推理(客观描述、假设仇恨推断、非仇恨推断)以增强上下文理解。 Result: 在两个真实世界的仇恨视频数据集上评估显示,该方法在Macro-F1和仇恨类召回率上分别比现有最佳方法提升3%和7%,具有更强的泛化性能。 Conclusion: 所提出的RAMF框架能更有效地融合多模态信息并理解复杂情境下的细微仇恨内容,显著提升了仇恨视频检测的性能。 Abstract: Hate speech in online videos is posing an increasingly serious threat to digital platforms, especially as video content becomes increasingly multimodal and context-dependent. Existing methods often struggle to effectively fuse the complex semantic relationships between modalities and lack the ability to understand nuanced hateful content. To address these issues, we propose an innovative Reasoning-Aware Multimodal Fusion (RAMF) framework. To tackle the first challenge, we design Local-Global Context Fusion (LGCF) to capture both local salient cues and global temporal structures, and propose Semantic Cross Attention (SCA) to enable fine-grained multimodal semantic interaction. To tackle the second challenge, we introduce adversarial reasoning-a structured three-stage process where a vision-language model generates (i) objective descriptions, (ii) hate-assumed inferences, and (iii) non-hate-assumed inferences-providing complementary semantic perspectives that enrich the model's contextual understanding of nuanced hateful intent. Evaluations on two real-world hateful video datasets demonstrate that our method achieves robust generalisation performance, improving upon state-of-the-art methods by 3% and 7% in Macro-F1 and hate class recall, respectively. We will release the code after the anonymity period ends.[126] AttMetNet: Attention-Enhanced Deep Neural Network for Methane Plume Detection in Sentinel-2 Satellite Imagery
Rakib Ahsan,MD Sadik Hossain Shanto,Md Sultanul Arifin,Tanzima Hashem
Main category: cs.CV
TL;DR: 本文提出了一种名为AttMetNet的新型注意力增强深度学习框架,用于基于Sentinel-2卫星影像的甲烷羽流检测。该方法结合归一化甲烷指数(NDMI)与注意力机制增强的U-Net,并引入焦点损失函数以应对样本不平衡问题,在真实数据上实现了更低的误报率和更优的检测性能。
Details
Motivation: 准确检测甲烷排放对减缓全球变暖至关重要,但传统方法在复杂地表背景下易产生大量误报,现有深度学习模型缺乏对甲烷特征的特异性关注机制。 Method: 提出AttMetNet,融合NDMI与注意力增强的U-Net架构,利用注意力机制突出甲烷吸收特征并抑制背景噪声,同时采用焦点损失函数缓解正负样本极度不平衡的问题。 Result: 在真实甲烷羽流数据集上实验表明,AttMetNet相比现有方法具有更低的误报率、更好的精确率-召回率平衡以及更高的交并比(IoU)。 Conclusion: AttMetNet是一种针对卫星影像中甲烷羽流检测优化的新型深度学习模型,通过特征融合与注意力机制显著提升了检测精度与鲁棒性,适用于实际应用场景。 Abstract: Methane is a powerful greenhouse gas that contributes significantly to global warming. Accurate detection of methane emissions is the key to taking timely action and minimizing their impact on climate change. We present AttMetNet, a novel attention-enhanced deep learning framework for methane plume detection with Sentinel-2 satellite imagery. The major challenge in developing a methane detection model is to accurately identify methane plumes from Sentinel-2's B11 and B12 bands while suppressing false positives caused by background variability and diverse land cover types. Traditional detection methods typically depend on the differences or ratios between these bands when comparing the scenes with and without plumes. However, these methods often require verification by a domain expert because they generate numerous false positives. Recent deep learning methods make some improvements using CNN-based architectures, but lack mechanisms to prioritize methane-specific features. AttMetNet introduces a methane-aware architecture that fuses the Normalized Difference Methane Index (NDMI) with an attention-enhanced U-Net. By jointly exploiting NDMI's plume-sensitive cues and attention-driven feature selection, AttMetNet selectively amplifies methane absorption features while suppressing background noise. This integration establishes a first-of-its-kind architecture tailored for robust methane plume detection in real satellite imagery. Additionally, we employ focal loss to address the severe class imbalance arising from both limited positive plume samples and sparse plume pixels within imagery. Furthermore, AttMetNet is trained on the real methane plume dataset, making it more robust to practical scenarios. Extensive experiments show that AttMetNet surpasses recent methods in methane plume detection with a lower false positive rate, better precision recall balance, and higher IoU.[127] Rethinking Surgical Smoke: A Smoke-Type-Aware Laparoscopic Video Desmoking Method and Dataset
Qifan Liang,Junlin Li,Zhen Han,Xihao Wang,Zhongyuan Wang,Bin Mei
Main category: cs.CV
TL;DR: 提出了一种烟雾类型感知的腹腔镜视频去烟网络(STANet),通过区分扩散烟和环境烟,并设计联合分割与重建网络,显著提升去烟效果及下游手术任务的泛化能力。
Details
Motivation: 现有去烟方法未考虑手术烟雾的不同类型及其时空特性,导致去烟效果不佳,影响腹腔镜手术视觉引导。 Method: 提出STANet,包含烟雾掩码分割子网络(结合注意力加权掩码聚合与粗到精解耦模块)和无烟视频重建子网络,利用两类烟雾掩码指导特征去烟。 Result: 在自建的大规模合成数据集上验证,STANet在去烟质量评估中优于现有最先进方法,并在多个下游手术任务中表现出更强的泛化能力。 Conclusion: 烟雾类型感知机制能有效提升腹腔镜视频去烟性能,所提出的STANet为手术视觉恢复提供了新思路。 Abstract: Electrocautery or lasers will inevitably generate surgical smoke, which hinders the visual guidance of laparoscopic videos for surgical procedures. The surgical smoke can be classified into different types based on its motion patterns, leading to distinctive spatio-temporal characteristics across smoky laparoscopic videos. However, existing desmoking methods fail to account for such smoke-type-specific distinctions. Therefore, we propose the first Smoke-Type-Aware Laparoscopic Video Desmoking Network (STANet) by introducing two smoke types: Diffusion Smoke and Ambient Smoke. Specifically, a smoke mask segmentation sub-network is designed to jointly conduct smoke mask and smoke type predictions based on the attention-weighted mask aggregation, while a smokeless video reconstruction sub-network is proposed to perform specially desmoking on smoky features guided by two types of smoke mask. To address the entanglement challenges of two smoke types, we further embed a coarse-to-fine disentanglement module into the mask segmentation sub-network, which yields more accurate disentangled masks through the smoke-type-aware cross attention between non-entangled and entangled regions. In addition, we also construct the first large-scale synthetic video desmoking dataset with smoke type annotations. Extensive experiments demonstrate that our method not only outperforms state-of-the-art approaches in quality evaluations, but also exhibits superior generalization across multiple downstream surgical tasks.[128] LumiX: Structured and Coherent Text-to-Intrinsic Generation
Xu Han,Biao Zhang,Xiangjun Tang,Xianzhi Li,Peter Wonka
Main category: cs.CV
TL;DR: LumiX是一种基于结构化扩散的文本到本征图生成框架,能联合生成多种本征图(如反照率、光照、法线等),并通过Query-Broadcast Attention和Tensor LoRA实现结构一致性和高效训练。
Details
Motivation: 现有的文本到图像生成方法难以保证生成结果在物理意义上的内在一致性,缺乏对场景本征属性的统一建模。因此需要一个能够从文本生成结构连贯且物理合理的多本征图的框架。 Method: 提出LumiX框架,引入Query-Broadcast Attention机制,在自注意力模块中共享查询以确保各本征图间的结构一致性;设计Tensor LoRA,通过张量化的低秩适配建模跨图关系,实现参数高效的联合训练。 Result: LumiX在生成结果的对齐度上比现有方法提高23%,偏好评分达0.19(对比-0.41),并支持图像条件下的本征分解任务,表现出强一致性和泛化能力。 Conclusion: LumiX实现了文本到多本征图的一致性生成,通过结构化设计和高效参数方法,为物理合理的内容生成提供了新思路,并可在同一框架下支持生成与分解任务。 Abstract: We present LumiX, a structured diffusion framework for coherent text-to-intrinsic generation. Conditioned on text prompts, LumiX jointly generates a comprehensive set of intrinsic maps (e.g., albedo, irradiance, normal, depth, and final color), providing a structured and physically consistent description of an underlying scene. This is enabled by two key contributions: 1) Query-Broadcast Attention, a mechanism that ensures structural consistency by sharing queries across all maps in each self-attention block. 2) Tensor LoRA, a tensor-based adaptation that parameter-efficiently models cross-map relations for efficient joint training. Together, these designs enable stable joint diffusion training and unified generation of multiple intrinsic properties. Experiments show that LumiX produces coherent and physically meaningful results, achieving 23% higher alignment and a better preference score (0.19 vs. -0.41) compared to the state of the art, and it can also perform image-conditioned intrinsic decomposition within the same framework.[129] TrackNetV5: Residual-Driven Spatio-Temporal Refinement and Motion Direction Decoupling for Fast Object Tracking
Tang Haonan,Chen Yanjun,Jiang Lezhi
Main category: cs.CV
TL;DR: 提出TrackNetV5,通过Motion Direction Decoupling和Residual-Driven Spatio-Temporal Refinement机制,显著提升小目标运动追踪性能,尤其在处理遮挡和方向模糊问题上优于前代方法。
Details
Motivation: 现有TrackNet系列在处理遮挡和运动方向模糊方面存在局限,尤其是V4因使用绝对差分丢失运动极性,导致方向判断困难。 Method: 引入Motion Direction Decoupling(MDD)模块以保留运动极性,并设计基于Transformer的Residual-Driven Spatio-Temporal Refinement(R-STR)头,通过时空上下文修正残差来恢复被遮挡目标。 Result: 在TrackNetV2数据集上达到98.59%的F1分数和97.33%的准确率,性能领先且仅增加3.7% FLOPs,保持实时性。 Conclusion: TrackNetV5通过显式建模运动方向和残差驱动的精细化修复,有效解决了小物体遮挡与方向模糊问题,实现了高效精准的运动目标追踪。 Abstract: The TrackNet series has established a strong baseline for fast-moving small object tracking in sports. However, existing iterations face significant limitations: V1-V3 struggle with occlusions due to a reliance on purely visual cues, while TrackNetV4, despite introducing motion inputs, suffers from directional ambiguity as its absolute difference method discards motion polarity. To overcome these bottlenecks, we propose TrackNetV5, a robust architecture integrating two novel mechanisms. First, to recover lost directional priors, we introduce the Motion Direction Decoupling (MDD) module. Unlike V4, MDD decomposes temporal dynamics into signed polarity fields, explicitly encoding both movement occurrence and trajectory direction. Second, we propose the Residual-Driven Spatio-Temporal Refinement (R-STR) head. Operating on a coarse-to-fine paradigm, this Transformer-based module leverages factorized spatio-temporal contexts to estimate a corrective residual, effectively recovering occluded targets. Extensive experiments on the TrackNetV2 dataset demonstrate that TrackNetV5 achieves a new state-of-the-art F1-score of 0.9859 and an accuracy of 0.9733, significantly outperforming previous versions. Notably, this performance leap is achieved with a marginal 3.7% increase in FLOPs compared to V4, maintaining real-time inference capabilities while delivering superior tracking precision.[130] UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits
Keming Ye,Zhipeng Huang,Canmiao Fu,Qingyang Liu,Jiani Cai,Zheqi Lv,Chen Li,Jing Lyu,Zhou Zhao,Shengyu Zhang
Main category: cs.CV
TL;DR: 本文提出了一种轻量级数据流水线,生成了大规模、高质量的图像编辑数据集UnicEdit-10M,并构建了综合基准UnicBench,用于评估模型在基础和复杂编辑任务中的表现,揭示了现有模型的局限性并指明未来研究方向。
Details
Motivation: 由于缺乏大规模高质量训练数据和能诊断模型在多种编辑行为中弱点的综合基准,开源模型与闭源模型在图像编辑任务上的性能差距日益扩大。 Method: 设计了一个端到端的数据流水线,结合一个7B双任务专家模型Qwen-Verify进行高效错误检测和指令重写,并构建了包含1000万样本的数据集UnicEdit-10M;同时提出UnicBench基准及新指标(如非编辑一致性与推理准确率)以实现细粒度评估。 Result: 成功构建了UnicEdit-10M数据集和UnicBench基准,实验表明主流模型在空间推理和知识驱动任务上存在明显不足,新指标可有效诊断模型缺陷。 Conclusion: 该工作通过高效的数据构建方法和精细化评估体系,为缩小开源与闭源模型差距提供了可行路径,并为图像编辑模型的发展提供了明确优化方向。 Abstract: With the rapid advances of powerful multimodal models such as GPT-4o, Nano Banana, and Seedream 4.0 in Image Editing, the performance gap between closed-source and open-source models is widening, primarily due to the scarcity of large-scale, high-quality training data and comprehensive benchmarks capable of diagnosing model weaknesses across diverse editing behaviors. Existing data construction methods face a scale-quality trade-off: human annotations are high-quality but not scalable, while automated pipelines suffer from error propagation and noise. To address this, we introduce a lightweight data pipeline that replaces multi-toolchains with an end-to-end model and a unified post-verification stage. For scalable quality control, we train a 7B dual-task expert model, \textbf{Qwen-Verify}, for efficient failure detection and instruction recaptioning. This pipeline yields \textbf{UnicEdit-10M}, a 10M-scale dataset spanning diverse basic and complex editing tasks. We also propose \textbf{UnicBench}, a general benchmark that extends beyond basic edits to explicitly assess spatial and knowledge-driven reasoning. To enable fine-grained diagnosis, we introduce novel metrics, including \textit{Non-edit Consistency} and \textit{Reasoning Accuracy}. Our analysis of mainstream models on UnicBench reveals their limitations and provides clear directions for future research.[131] HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval
Zhiwei Chen,Yupeng Hu,Zixu Li,Zhiheng Fu,Haokun Wen,Weili Guan
Main category: cs.CV
TL;DR: 本文提出了一种新的多模态视频检索框架HUD,首次利用视频与文本模态间信息密度差异来提升查询理解能力,通过三个关键组件解决指代模糊和细粒度语义关注不足的问题,在CVR和CIR任务上均达到SOTA性能。
Details
Motivation: 现有CVR方法忽视了视频和文本模态间的信息密度差异,导致修改主体指代模糊和细节语义聚焦有限,影响检索性能。 Method: 提出Hierarchical Uncertainty-aware Disambiguation network (HUD),包含整体代词消歧、原子级不确定性建模和从整体到原子的对齐三个模块,利用跨模态交互实现粗粒度语义匹配与细粒度语义对齐。 Result: HUD在多个CVR和CIR基准数据集上实现了最先进的性能,验证了其在复杂多模态查询理解中的有效性。 Conclusion: 通过显式建模视频与文本模态间的信息密度差异,HUD有效提升了多模态查询中对象消歧和细粒度语义学习的能力,为CVR和CIR任务提供了新思路。 Abstract: Composed Video Retrieval (CVR) is a challenging video retrieval task that utilizes multi-modal queries, consisting of a reference video and modification text, to retrieve the desired target video. The core of this task lies in understanding the multi-modal composed query and achieving accurate composed feature learning. Within multi-modal queries, the video modality typically carries richer semantic content compared to the textual modality. However, previous works have largely overlooked the disparity in information density between these two modalities. This limitation can lead to two critical issues: 1) modification subject referring ambiguity and 2) limited detailed semantic focus, both of which degrade the performance of CVR models. To address the aforementioned issues, we propose a novel CVR framework, namely the Hierarchical Uncertainty-aware Disambiguation network (HUD). HUD is the first framework that leverages the disparity in information density between video and text to enhance multi-modal query understanding. It comprises three key components: (a) Holistic Pronoun Disambiguation, (b) Atomistic Uncertainty Modeling, and (c) Holistic-to-Atomistic Alignment. By exploiting overlapping semantics through holistic cross-modal interaction and fine-grained semantic alignment via atomistic-level cross-modal interaction, HUD enables effective object disambiguation and enhances the focus on detailed semantics, thereby achieving precise composed feature learning. Moreover, our proposed HUD is also applicable to the Composed Image Retrieval (CIR) task and achieves state-of-the-art performance across three benchmark datasets for both CVR and CIR tasks. The codes are available on https://zivchen-ty.github.io/HUD.github.io/.[132] IC-World: In-Context Generation for Shared World Modeling
Fan Wu,Jiacheng Wei,Ruibo Li,Yi Xu,Junyou Li,Deheng Ye,Guosheng Lin
Main category: cs.CV
TL;DR: 本文提出了IC-World,一种基于大视频模型的共享世界建模框架,通过上下文生成能力和强化学习优化,实现多视角视频的并行生成,并在几何和运动一致性上显著优于现有方法。
Details
Motivation: 现有的视频世界模型多关注单视角生成,缺乏对多视角间场景和物体一致性的建模。本文旨在解决共享世界建模问题,即从不同相机姿态的输入图像生成多个具有一致性的视频。 Method: 提出IC-World框架,利用大视频模型的上下文生成能力实现多图像并行生成;通过分组相对策略优化(Group Relative Policy Optimization)进行强化学习微调,并设计两个新的奖励模型以增强场景级几何一致性和物体级运动一致性。 Result: 实验表明,IC-World在几何和运动一致性指标上显著优于现有最先进方法,首次系统地探索了基于视频模型的共享世界建模问题。 Conclusion: IC-World是首个将大视频模型用于共享世界建模的工作,通过上下文生成与强化学习结合,有效实现了多视角视频的一致性生成,为未来动态场景建模提供了新方向。 Abstract: Video-based world models have recently garnered increasing attention for their ability to synthesize diverse and dynamic visual environments. In this paper, we focus on shared world modeling, where a model generates multiple videos from a set of input images, each representing the same underlying world in different camera poses. We propose IC-World, a novel generation framework, enabling parallel generation for all input images via activating the inherent in-context generation capability of large video models. We further finetune IC-World via reinforcement learning, Group Relative Policy Optimization, together with two proposed novel reward models to enforce scene-level geometry consistency and object-level motion consistency among the set of generated videos. Extensive experiments demonstrate that IC-World substantially outperforms state-of-the-art methods in both geometry and motion consistency. To the best of our knowledge, this is the first work to systematically explore the shared world modeling problem with video-based world models.[133] PhyCustom: Towards Realistic Physical Customization in Text-to-Image Generation
Fan Wu,Cheng Chen,Zhoujie Fu,Jiacheng Wei,Yi Xu,Deheng Ye,Guosheng Lin
Main category: cs.CV
TL;DR: 本文提出了PhyCustom,一种用于扩散模型的微调框架,通过引入两种新的正则化损失(等距损失和解耦损失),实现对物理概念的文本到图像定制生成,显著提升了生成结果中物理属性的准确性。
Details
Motivation: 现有文本到图像定制方法在处理风格、形状等具体概念时表现良好,但难以准确反映物理概念(如重力、弹性等)相关的物理属性,主要原因是训练过程中缺乏显式的物理知识引入。 Method: 提出PhyCustom框架,包含两个新颖的正则化损失:等距损失用于激活扩散模型学习物理概念,解耦损失用于避免独立概念间的混合学习,从而提升物理定制能力。 Result: 在多样化数据集上进行实验,定量和定性结果表明,PhyCustom在物理定制方面优于现有的先进和流行方法。 Conclusion: PhyCustom有效增强了扩散模型对物理概念的理解与控制能力,推动了文本到图像生成在复杂物理场景中的应用。 Abstract: Recent diffusion-based text-to-image customization methods have achieved significant success in understanding concrete concepts to control generation processes, such as styles and shapes. However, few efforts dive into the realistic yet challenging customization of physical concepts. The core limitation of current methods arises from the absence of explicitly introducing physical knowledge during training. Even when physics-related words appear in the input text prompts, our experiments consistently demonstrate that these methods fail to accurately reflect the corresponding physical properties in the generated results. In this paper, we propose PhyCustom, a fine-tuning framework comprising two novel regularization losses to activate diffusion model to perform physical customization. Specifically, the proposed isometric loss aims at activating diffusion models to learn physical concepts while decouple loss helps to eliminate the mixture learning of independent concepts. Experiments are conducted on a diverse dataset and our benchmark results demonstrate that PhyCustom outperforms previous state-of-the-art and popular methods in terms of physical customization quantitatively and qualitatively.[134] Defense That Attacks: How Robust Models Become Better Attackers
Mohamed Awad,Mahmoud Akrm,Walid Gomaa
Main category: cs.CV
TL;DR: 对抗训练虽然提升了模型的鲁棒性,但意外地增强了对抗样本的可迁移性,带来新的生态风险。
Details
Motivation: 探究对抗训练是否无意中增加了对抗样本的可迁移性,从而影响整体安全生态。 Method: 训练了包含36个CNN和ViT模型的模型库,并进行全面的可迁移性实验。 Result: 对抗训练模型生成的扰动比标准模型更具可迁移性,揭示了鲁棒性与迁移性之间的悖论。 Conclusion: 评估模型鲁棒性时,应同时考虑其抵抗迁移攻击的能力及其生成可迁移对抗样本的倾向。 Abstract: Deep learning has achieved great success in computer vision, but remains vulnerable to adversarial attacks. Adversarial training is the leading defense designed to improve model robustness. However, its effect on the transferability of attacks is underexplored. In this work, we ask whether adversarial training unintentionally increases the transferability of adversarial examples. To answer this, we trained a diverse zoo of 36 models, including CNNs and ViTs, and conducted comprehensive transferability experiments. Our results reveal a clear paradox: adversarially trained (AT) models produce perturbations that transfer more effectively than those from standard models, which introduce a new ecosystem risk. To enable reproducibility and further study, we release all models, code, and experimental scripts. Furthermore, we argue that robustness evaluations should assess not only the resistance of a model to transferred attacks but also its propensity to produce transferable adversarial examples.[135] Action Anticipation at a Glimpse: To What Extent Can Multimodal Cues Replace Video?
Manuel Benavent-Lledo,Konstantinos Bacharidis,Victoria Manousaki,Konstantinos Papoutsakis,Antonis Argyros,Jose Garcia-Rodriguez
Main category: cs.CV
TL;DR: 本文提出了一种名为AAG的单帧动作预测方法,结合RGB和深度特征以及先验动作信息,在不依赖视频时序聚合的情况下实现了具有竞争力的性能。
Details
Motivation: 探索在仅观察单帧图像且具备足够上下文的情况下,模型能否像人类一样预测即将发生的动作,并减少对时序视频聚合的依赖。 Method: 提出AAG方法,融合单帧的RGB特征与深度线索以增强空间推理,并引入基于文本摘要或单帧动作识别提供的先验动作信息作为长期上下文。 Result: 在IKEA-ASM、Meccano和Assembly101三个教学活动数据集上,AAG在单帧多模态动作预测中表现优异,性能媲美基于时序聚合的基线模型和当前最优方法。 Conclusion: 通过结合多模态信息,单帧动作预测可以有效替代部分时序建模范式,尤其适用于上下文丰富的场景。 Abstract: Anticipating actions before they occur is a core challenge in action understanding research. While conventional methods rely on extracting and aggregating temporal information from videos, as humans we can often predict upcoming actions by observing a single moment from a scene, when given sufficient context. Can a model achieve this competence? The short answer is yes, although its effectiveness depends on the complexity of the task. In this work, we investigate to what extent video aggregation can be replaced with alternative modalities. To this end, based on recent advances in visual feature extraction and language-based reasoning, we introduce AAG, a method for Action Anticipation at a Glimpse. AAG combines RGB features with depth cues from a single frame for enhanced spatial reasoning, and incorporates prior action information to provide long-term context. This context is obtained either through textual summaries from Vision-Language Models, or from predictions generated by a single-frame action recognizer. Our results demonstrate that multimodal single-frame action anticipation using AAG can perform competitively compared to both temporally aggregated video baselines and state-of-the-art methods across three instructional activity datasets: IKEA-ASM, Meccano, and Assembly101.[136] Are Detectors Fair to Indian IP-AIGC? A Cross-Generator Study
Vishal Dubey,Pallavi Tyagi
Main category: cs.CV
TL;DR: 本研究首次系统评估了针对印度和南亚人群的IP-AIGC(身份保持型人工智能生成内容)检测性能,发现现有检测器在跨生成器泛化和特定群体表现上存在显著不足,尤其是微调后在训练集上表现提升但对新IP-AIGC数据泛化能力下降,表明其对生成器特征过拟合。研究构建了专注于印度人群的测试集,并揭示了当前AIGC检测方法在身份保持编辑下的脆弱性,强调需发展保留身份特征的适应方法及建立更具代表性的基准数据集。
Details
Motivation: 当前AIGC检测器在身份保持编辑(如换装、换背景)下的鲁棒性和公平性尚不明确,尤其对印度和南亚等代表性不足的人群缺乏系统研究。现有模型可能过度依赖特定生成器的伪影,导致跨生成器泛化能力差,亟需评估并改进其在真实场景中的适用性。 Method: 从FairFD和HAV-DF中构建印度人群训练子集,并利用商业生成工具(Gemini、ChatGPT)创建两个新的身份保持AIGC测试集(HIDF-img-ip-genai 和 HIDF-vid-ip-genai)。评估两种先进检测器(AIDE和Effort)在预训练(PT)和微调(FT)模式下的表现,使用AUC、AP、EER和准确率作为指标,分析其在域内与跨生成器场景下的性能差异。 Result: 微调显著提升检测器在原始测试集上的性能(如Effort AUC从0.739升至0.944),但在新构建的IP-AIGC测试集上性能明显下降(如AIDE AUC从0.923降至0.563),表明模型过拟合于训练数据中的生成器线索。而在非IP-AIGC图像上预训练模型仍保持高检测性能,说明性能下降特指身份保持编辑带来的挑战,而非一般分布偏移。 Conclusion: 当前AIGC检测器在处理印度人群的身份保持生成内容时存在严重泛化问题,微调策略可能导致对特定生成器特征的依赖,削弱实际部署效果。研究确立了IP-AIGC-Indian作为一个具有挑战性且具现实意义的新任务,并呼吁发展能保留身份表示的适应方法以及构建更符合印度人群特征的评测基准。 Abstract: Modern image editors can produce identity-preserving AIGC (IP-AIGC), where the same person appears with new attire, background, or lighting. The robustness and fairness of current detectors in this regime remain unclear, especially for under-represented populations. We present what we believe is the first systematic study of IP-AIGC detection for Indian and South-Asian faces, quantifying cross-generator generalization and intra-population performance. We assemble Indian-focused training splits from FairFD and HAV-DF, and construct two held-out IP-AIGC test sets (HIDF-img-ip-genai and HIDF-vid-ip-genai) using commercial web-UI generators (Gemini and ChatGPT) with identity-preserving prompts. We evaluate two state-of-the-art detectors (AIDE and Effort) under pretrained (PT) and fine-tuned (FT) regimes and report AUC, AP, EER, and accuracy. Fine-tuning yields strong in-domain gains (for example, Effort AUC 0.739 to 0.944 on HAV-DF-test; AIDE EER 0.484 to 0.259), but consistently degrades performance on held-out IP-AIGC for Indian cohorts (for example, AIDE AUC 0.923 to 0.563 on HIDF-img-ip-genai; Effort 0.740 to 0.533), which indicates overfitting to training-generator cues. On non-IP HIDF images, PT performance remains high, which suggests a specific brittleness to identity-preserving edits rather than a generic distribution shift. Our study establishes IP-AIGC-Indian as a challenging and practically relevant scenario and motivates representation-preserving adaptation and India-aware benchmark curation to close generalization gaps in AIGC detection.[137] RFOP: Rethinking Fusion and Orthogonal Projection for Face-Voice Association
Abdul Hannan,Furqan Malik,Hina Jabbar,Syed Suleman Sadiq,Mubashir Noman
Main category: cs.CV
TL;DR: 本文研究多语言环境下面孔-语音关联任务,提出一种有效融合与正交投影方法,在FAME 2026挑战赛的英德数据集上表现良好,EER为33.1,排名第三。
Details
Motivation: 探索多语言场景下的跨模态面孔-语音关联问题,提升在不同语言环境中的关联性能。 Method: 采用融合策略和正交投影方法,聚焦于两种模态中的相关语义信息以增强关联效果。 Result: 在英语-德语数据划分上取得了33.1的EER,在FAME 2026挑战赛中排名第三。 Conclusion: 所提出的方法在多语言面音关联任务中有效,验证了融合与正交投影对跨模态匹配的积极作用。 Abstract: Face-voice association in multilingual environment challenge 2026 aims to investigate the face-voice association task in multilingual scenario. The challenge introduces English-German face-voice pairs to be utilized in the evaluation phase. To this end, we revisit the fusion and orthogonal projection for face-voice association by effectively focusing on the relevant semantic information within the two modalities. Our method performs favorably on the English-German data split and ranked 3rd in the FAME 2026 challenge by achieving the EER of 33.1.[138] MICCAI STSR 2025 Challenge: Semi-Supervised Teeth and Pulp Segmentation and CBCT-IOS Registration
Yaqi Wang,Zhi Li,Chengyu Wu,Jun Liu,Yifan Zhang,Jialuo Chen,Jiaxue Ni,Qian Luo,Jin Liu,Can Han,Changkai Ji,Zhi Qin Tan,Ajo Babu George,Liangyu Chen,Qianni Zhang,Dahong Qian,Shuai Wang,Huiyu Zhou
Main category: cs.CV
TL;DR: STSR 2025挑战赛推动了牙科影像中半监督学习在牙齿和根管分割及CBCT与口扫数据配准中的应用,结合深度学习与公开数据促进可重复研究。
Details
Motivation: 由于标注数据稀缺,牙科影像中自动化的根管分割与跨模态配准面临挑战,需要有效的半监督方法来提升性能。 Method: 提供带标签和无标签的CBCT与口扫数据,组织两项任务:半监督分割和半监督刚性配准;采用nnU-Net、Mamba类模型、PointNetLK结合可微SVD和几何增强等深度学习方法。 Result: 分割任务达到Dice分数0.967和实例亲和度0.738;配准方法通过神经-经典混合优化实现高精度对齐。 Conclusion: 半监督学习能有效应对牙科影像中标注数据不足的问题,开源数据与代码有助于推动该领域的发展与可重复研究。 Abstract: Cone-Beam Computed Tomography (CBCT) and Intraoral Scanning (IOS) are essential for digital dentistry, but annotated data scarcity limits automated solutions for pulp canal segmentation and cross-modal registration. To benchmark semi-supervised learning (SSL) in this domain, we organized the STSR 2025 Challenge at MICCAI 2025, featuring two tasks: (1) semi-supervised segmentation of teeth and pulp canals in CBCT, and (2) semi-supervised rigid registration of CBCT and IOS. We provided 60 labeled and 640 unlabeled IOS samples, plus 30 labeled and 250 unlabeled CBCT scans with varying resolutions and fields of view. The challenge attracted strong community participation, with top teams submitting open-source deep learning-based SSL solutions. For segmentation, leading methods used nnU-Net and Mamba-like State Space Models with pseudo-labeling and consistency regularization, achieving a Dice score of 0.967 and Instance Affinity of 0.738 on the hidden test set. For registration, effective approaches combined PointNetLK with differentiable SVD and geometric augmentation to handle modality gaps; hybrid neural-classical refinement enabled accurate alignment despite limited labels. All data and code are publicly available at https://github.com/ricoleehduu/STS-Challenge-2025 to ensure reproducibility.[139] Taming Camera-Controlled Video Generation with Verifiable Geometry Reward
Zhaoqing Wang,Xiaobo Xia,Zhuolin Bie,Jinlin Liu,Dongdong Yu,Jia-Wang Bian,Changhu Wang
Main category: cs.CV
TL;DR: 本文提出了一种用于视频扩散模型的在线强化学习(RL)后训练框架,通过设计可验证的几何奖励函数,提升相机控制视频生成的精度和几何一致性。
Details
Motivation: 尽管视频扩散模型在相机控制视频生成方面取得了进展,但现有方法主要依赖监督微调(SFT),缺乏对在线强化学习后训练的探索。为此,本文旨在填补这一空白,提升生成视频的相机控制精确性与几何一致性。 Method: 提出一种在线RL后训练框架,设计基于3D相机轨迹的可验证几何奖励函数:将生成视频和参考视频的相机轨迹分段,计算每段的相对位姿并比较,提供密集的段级反馈作为奖励信号;同时构建包含大范围相机运动和多样动态场景的数据集以支持训练。 Result: 实验表明,该方法在相机控制精度、几何一致性和视觉质量等方面均显著优于SFT基线方法,验证了在线RL在视频生成后训练中的有效性。 Conclusion: 所提出的在线强化学习后训练框架结合几何奖励机制,能有效优化预训练视频生成模型,在相机控制视频生成任务中表现出优越性能,为后续研究提供了新方向。 Abstract: Recent advances in video diffusion models have remarkably improved camera-controlled video generation, but most methods rely solely on supervised fine-tuning (SFT), leaving online reinforcement learning (RL) post-training largely underexplored. In this work, we introduce an online RL post-training framework that optimizes a pretrained video generator for precise camera control. To make RL effective in this setting, we design a verifiable geometry reward that delivers dense segment-level feedback to guide model optimization. Specifically, we estimate the 3D camera trajectories for both generated and reference videos, divide each trajectory into short segments, and compute segment-wise relative poses. The reward function then compares each generated-reference segment pair and assigns an alignment score as the reward signal, which helps alleviate reward sparsity and improve optimization efficiency. Moreover, we construct a comprehensive dataset featuring diverse large-amplitude camera motions and scenes with varied subject dynamics. Extensive experiments show that our online RL post-training clearly outperforms SFT baselines across multiple aspects, including camera-control accuracy, geometric consistency, and visual quality, demonstrating its superiority in advancing camera-controlled video generation.[140] MindGPT-4ov: An Enhanced MLLM via a Multi-Stage Post-Training Paradigm
Wei Chen,Chaoqun Du,Feng Gu,Wei He,Qizhen Li,Zide Liu,Xuhao Pan,Chang Ren,Xudong Rao,Chenfeng Wang,Tao Wei,Chengjun Yu,Pengfei Yu,Yufei Zheng,Chunpeng Zhou,Pan Zhou,Xuhan Zhu
Main category: cs.CV
TL;DR: MindGPT-4ov提出了一种通用的多模态大语言模型后训练范式,涵盖数据生成、模型训练与高效部署,在多个基准上达到SOTA性能,同时降低成本,并将开源模型权重、数据集和代码。
Details
Motivation: 旨在提升多模态大语言模型的基础能力与泛化性,解决后训练阶段的数据质量、知识注入与能力保持之间的平衡问题,并推动学术研究向工业应用的转化。 Method: 提出了三项关键技术:基于信息密度的数据生成方案与树状标签系统;协同课程式监督微调;混合强化学习范式;并结合5D并行训练、算子优化和推理量化等基础设施改进。 Result: 在MMBench、MMStar、MathVision和MathVista等多个基准上超越现有最先进模型,且在垂直领域任务中表现出优越的用户体验和高效的领域适配能力。 Conclusion: MindGPT-4ov提供了一个可广泛应用于多模态大语言模型的通用后训练框架,有效提升了模型性能、泛化能力和部署效率,推动了MLLM的发展与落地。 Abstract: We present MindGPT-4ov, a multimodal large language model (MLLM) that introduces a general post-training paradigm spanning data production, model training, and efficient deployment. It achieves state-of-the-art performance across multiple benchmarks at low cost, effectively enhancing the foundational capabilities of MLLMs and the generalization ability. Focusing on data construction, supervised fine-tuning strategies, and multimodal reinforcement learning methods, this work proposes three key innovations: (1) An information density-based data generation scheme, integrated with a dual-dimensional tree-structured label system, enabling automated generation of high-quality cross-domain data. (2) A collaborative curriculum supervised fine-tuning approach that balances the injection of domain-specific knowledge with the preservation of general capabilities. (3) A hybrid reinforcement learning paradigm that enhances reasoning ability while simultaneously addressing multi-objective optimization such as diversity exploration, maintenance of multimodal perception, and response conciseness. Moreover, we implement a series of infrastructure optimizations, such as 5D parallel training, operator optimization, and inference quantization to enhance training and inference efficiency while reducing the cost of domain adaptation. Experimental results demonstrate that the MindGPT-4ov model outperforms state-of-the-art models on benchmarks such as MMBench, MMStar, MathVision, and MathVista. In addition, MindGPT-4ov also demonstrates superior user experience in vertical domain tasks, enabling a seamless transition from academic research to industrial deployment. MindGPT-4ov provides a general post-training paradigm applicable to a wide range of MLLMs. The model weights, datasets, and code for the Qwen3-VL-based variants will be recently open-sourced to support the community's development of MLLMs.[141] Polar Perspectives: Evaluating 2-D LiDAR Projections for Robust Place Recognition with Visual Foundation Models
Pierpaolo Serio,Giulio Pisaneschi,Andrea Dan Ryals,Vincenzo Infantino,Lorenzo Gentilini,Valentina Donzella,Lorenzo Pollini
Main category: cs.CV
TL;DR: 本文系统研究了不同的LiDAR到图像投影方式对基于先进视觉基础模型的度量级地点识别的影响,提出了一种模块化检索流程,并验证了精心设计的投影可有效替代端到端3D学习。
Details
Motivation: 探索LiDAR到图像的不同投影方式如何影响基于视觉基础模型的地点识别性能,以提升鲁棒性和实用性。 Method: 提出一个模块化检索流程,控制主干网络、聚合方法和评估协议,隔离出2D投影的影响;使用一致的几何和结构通道,在多个数据集和部署场景中进行实验。 Result: 识别出最影响判别能力、环境变化鲁棒性和实时自主适用性的投影特征;在多个数据集及实际地点识别策略中验证了结果的有效性。 Conclusion: 精心设计的LiDAR到图像投影能显著提升地点识别性能,并可作为端到端3D学习的有效替代方案。 Abstract: This work presents a systematic investigation into how alternative LiDAR-to-image projections affect metric place recognition when coupled with a state-of-the-art vision foundation model. We introduce a modular retrieval pipeline that controls for backbone, aggregation, and evaluation protocol, thereby isolating the influence of the 2-D projection itself. Using consistent geometric and structural channels across multiple datasets and deployment scenarios, we identify the projection characteristics that most strongly determine discriminative power, robustness to environmental variation, and suitability for real-time autonomy. Experiments with different datasets, including integration into an operational place recognition policy, validate the practical relevance of these findings and demonstrate that carefully designed projections can serve as an effective surrogate for end-to-end 3-D learning in LiDAR place recognition.[142] Glance: Accelerating Diffusion Models with 1 Sample
Zhuobai Dong,Rui Zhao,Songjie Wu,Junchao Yi,Linjie Li,Zhengyuan Yang,Lijuan Wang,Alex Jinpeng Wang
Main category: cs.CV
TL;DR: 本文提出了一种基于LoRA适配器的相位感知加速策略,通过为扩散模型的不同去噪阶段设计专用的Slow-LoRA和Fast-LoRA,在仅用极少量样本和训练时间的情况下实现了高达5倍的加速,同时保持了良好的生成质量和泛化能力。
Details
Motivation: 扩散模型在图像生成中表现出色,但其高计算成本和多步推理限制了部署。现有少步蒸馏方法常伴随高昂重训练成本和泛化性能下降的问题,因此需要一种更高效且通用的加速方法。 Method: 提出相位感知加速策略,将去噪过程分为语义关键的早期阶段和冗余的后期阶段,分别由专注慢速和快速去噪的两个LoRA适配器(Slow-LoRA与Fast-LoRA)处理,仅对基础模型添加轻量级适配模块,避免完整学生模型的重复训练。 Result: 该方法在多个基准上实现了最高达5倍的加速比,视觉质量与基础模型相当;LoRA适配器仅需1个样本、单张V100显卡一小时内完成训练,并在未见提示词上展现出强泛化能力。 Conclusion: 通过引入相位感知的轻量级LoRA专家模块,可在几乎无需大规模重训练的前提下高效加速扩散模型,兼顾速度、质量和泛化性,为实际部署提供了可行方案。 Abstract: Diffusion models have achieved remarkable success in image generation, yet their deployment remains constrained by the heavy computational cost and the need for numerous inference steps. Previous efforts on fewer-step distillation attempt to skip redundant steps by training compact student models, yet they often suffer from heavy retraining costs and degraded generalization. In this work, we take a different perspective: we accelerate smartly, not evenly, applying smaller speedups to early semantic stages and larger ones to later redundant phases. We instantiate this phase-aware strategy with two experts that specialize in slow and fast denoising phases. Surprisingly, instead of investing massive effort in retraining student models, we find that simply equipping the base model with lightweight LoRA adapters achieves both efficient acceleration and strong generalization. We refer to these two adapters as Slow-LoRA and Fast-LoRA. Through extensive experiments, our method achieves up to 5 acceleration over the base model while maintaining comparable visual quality across diverse benchmarks. Remarkably, the LoRA experts are trained with only 1 samples on a single V100 within one hour, yet the resulting models generalize strongly on unseen prompts.[143] MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding
Fan Yang,Kaihao Zhang
Main category: cs.CV
TL;DR: 提出了一种无需训练的多分辨率检索-检测(MRD)框架,以提升多模态大语言模型对高分辨率图像的理解能力。
Details
Motivation: 现有基于图像分块的方法在处理高分辨率图像时,由于对象被分割到不同块中,导致语义相似性计算偏差,影响目标定位和信息提取。 Method: 提出多分辨率语义融合方法,融合不同分辨率下的语义相似性图,并引入开放词汇目标检测(OVD)模型,通过滑动窗口实现全局目标定位。 Result: 在多个高分辨率图像理解基准上验证了MRD框架的有效性,显著提升了不同MLLM的性能。 Conclusion: MRD通过多分辨率融合和全局检测机制,有效解决了分块处理带来的语义碎片化问题,增强了高分辨率图像理解的准确性和完整性。 Abstract: Understanding high-resolution images remains a significant challenge for multimodal large language models (MLLMs). Recent study address this issue by dividing the image into smaller crops and computing the semantic similarity between each crop and a query using a pretrained retrieval-augmented generation (RAG) model. The most relevant crops are then selected to localize the target object and suppress irrelevant information. However, such crop-based processing can fragment complete objects across multiple crops, thereby disrupting the computation of semantic similarity. In our experiments, we find that image crops of objects with different sizes are better handled at different resolutions. Based on this observation, we propose Multi-resolution Retrieval-Detection (MRD), a training-free framework for high-resolution image understanding. To address the issue of semantic similarity bias caused by objects being split across different image crops, we propose a multi-resolution semantic fusion method, which integrates semantic similarity maps obtained at different resolutions to produce more accurate semantic information and preserve the integrity of target objects. Furthermore, to achieve direct localization of target objects at a global scale, we introduce an open-vocalbulary object detection (OVD) model that identifies object regions using a sliding-window approach.Experiments on high-resolution image understanding benchmarks using different MLLMs demonstrate the effectiveness of our approach.[144] DiverseAR: Boosting Diversity in Bitwise Autoregressive Image Generation
Ying Yang,Zhengyao Lv,Tianlin Pan,Haofan Wang,Binxin Yang,Hubery Yin,Chen Li,Chenyang Si
Main category: cs.CV
TL;DR: 本文研究了自回归生成模型中样本多样性的挑战,提出了一种名为DiverseAR的新方法,通过自适应logits分布缩放和基于能量的生成路径搜索算法,在提升图像多样性的同时保持视觉质量。
Details
Motivation: 解决位级自回归生成模型中样本多样性不足的问题,探究限制多样性的关键因素。 Method: 提出了DiverseAR方法,包括自适应logits分布缩放机制和基于能量的生成路径搜索算法,以增强多样性并保持生成质量。 Result: 实验表明,DiverseAR显著提升了位级自回归图像生成中的样本多样性,同时未牺牲视觉保真度。 Conclusion: DiverseAR为位级自回归模型提供了一种有效且原则性的解决方案,平衡了多样性与生成质量之间的权衡。 Abstract: In this paper, we investigate the underexplored challenge of sample diversity in autoregressive (AR) generative models with bitwise visual tokenizers. We first analyze the factors that limit diversity in bitwise AR models and identify two key issues: (1) the binary classification nature of bitwise modeling, which restricts the prediction space, and (2) the overly sharp logits distribution, which causes sampling collapse and reduces diversity. Building on these insights, we propose DiverseAR, a principled and effective method that enhances image diversity without sacrificing visual quality. Specifically, we introduce an adaptive logits distribution scaling mechanism that dynamically adjusts the sharpness of the binary output distribution during sampling, resulting in smoother predictions and greater diversity. To mitigate potential fidelity loss caused by distribution smoothing, we further develop an energy-based generation path search algorithm that avoids sampling low-confidence tokens, thereby preserving high visual quality. Extensive experiments demonstrate that DiverseAR substantially improves sample diversity in bitwise autoregressive image generation.[145] EGGS: Exchangeable 2D/3D Gaussian Splatting for Geometry-Appearance Balanced Novel View Synthesis
Yancheng Zhang,Guangyu Sun,Chen Chen
Main category: cs.CV
TL;DR: 提出了一种名为Exchangeable Gaussian Splatting (EGGS)的混合表示方法,结合2D和3D高斯点阵以在新视角合成中平衡外观保真度与几何精度,通过三项关键技术实现优越的渲染质量、几何准确性和效率。
Details
Motivation: 3D高斯点阵(3DGS)虽能实现实时高保真渲染,但存在多视角不一致问题;而2D高斯点阵(2DGS)虽保证多视角一致性,却牺牲了纹理细节。因此需要一种兼顾外观与几何精度的新方法。 Method: 提出EGGS,结合2D和3D高斯表示,引入混合高斯光栅化统一渲染、自适应类型交换机制动态切换2D/3D高斯,以及频域解耦优化策略充分发挥两类高斯的优势,并采用CUDA加速实现高效训练与推理。 Result: 实验表明,EGGS在渲染质量、几何准确性和效率方面均优于现有方法,有效解决了多视角不一致与纹理丢失问题。 Conclusion: EGGS通过融合2D与3D高斯表示,在新视角合成任务中实现了外观与几何的更好平衡,为高质量新视角合成提供了实用且高效的解决方案。 Abstract: Novel view synthesis (NVS) is crucial in computer vision and graphics, with wide applications in AR, VR, and autonomous driving. While 3D Gaussian Splatting (3DGS) enables real-time rendering with high appearance fidelity, it suffers from multi-view inconsistencies, limiting geometric accuracy. In contrast, 2D Gaussian Splatting (2DGS) enforces multi-view consistency but compromises texture details. To address these limitations, we propose Exchangeable Gaussian Splatting (EGGS), a hybrid representation that integrates 2D and 3D Gaussians to balance appearance and geometry. To achieve this, we introduce Hybrid Gaussian Rasterization for unified rendering, Adaptive Type Exchange for dynamic adaptation between 2D and 3D Gaussians, and Frequency-Decoupled Optimization that effectively exploits the strengths of each type of Gaussian representation. Our CUDA-accelerated implementation ensures efficient training and inference. Extensive experiments demonstrate that EGGS outperforms existing methods in rendering quality, geometric accuracy, and efficiency, providing a practical solution for high-quality NVS.[146] LoVoRA: Text-guided and Mask-free Video Object Removal and Addition with Learnable Object-aware Localization
Zhihan Xiao,Lin Liu,Yixin Gao,Xiaopeng Zhang,Haoxuan Che,Songping Mai,Qi Tian
Main category: cs.CV
TL;DR: 本文提出了一种无需掩码的视频对象编辑框架LoVoRA,通过可学习的对象感知定位机制实现高质量、时空一致的视频对象添加与移除。
Details
Motivation: 现有视频编辑方法通常依赖掩码或参考图像进行引导,限制了其泛化能力和可扩展性;为此,本文旨在实现无需额外输入的端到端文本引导视频编辑。 Method: 提出LoVoRA框架,采用图像到视频转换、基于光流的掩码传播和视频修复构建数据集,并引入可学习的对象感知定位机制和扩散掩码预测器,提供密集的时空监督,实现无需外部控制信号的端到端编辑。 Result: 实验结果表明,LoVoRA在对象添加与移除任务中均能生成时间一致、视觉逼真的编辑效果,且在自动指标和人类评估中均优于现有方法。 Conclusion: LoVoRA实现了无需掩码输入的高质量文本引导视频编辑,为未来开放场景下的视频编辑提供了可行方案。 Abstract: Text-guided video editing, particularly for object removal and addition, remains a challenging task due to the need for precise spatial and temporal consistency. Existing methods often rely on auxiliary masks or reference images for editing guidance, which limits their scalability and generalization. To address these issues, we propose LoVoRA, a novel framework for mask-free video object removal and addition using object-aware localization mechanism. Our approach utilizes a unique dataset construction pipeline that integrates image-to-video translation, optical flow-based mask propagation, and video inpainting, enabling temporally consistent edits. The core innovation of LoVoRA is its learnable object-aware localization mechanism, which provides dense spatio-temporal supervision for both object insertion and removal tasks. By leveraging a Diffusion Mask Predictor, LoVoRA achieves end-to-end video editing without requiring external control signals during inference. Extensive experiments and human evaluation demonstrate the effectiveness and high-quality performance of LoVoRA.[147] Benchmarking Scientific Understanding and Reasoning for Video Generation using VideoScience-Bench
Lanxiang Hu,Abhilash Shankarampeta,Yixin Huang,Zilin Dai,Haoyang Yu,Yujie Zhao,Haoqiang Kang,Daniel Zhao,Tajana Rosing,Hao Zhang
Main category: cs.CV
TL;DR: 本文提出了VideoScience-Bench,首个评估视频生成模型在物理和化学科学概念上推理能力的基准,包含200个涵盖14个主题和103个概念的科学场景提示,并从五个维度对七种最先进模型进行了评估。
Details
Motivation: 现有视频生成基准主要基于物理常识,难以评估模型对科学原理的理解与推理能力,因此需要一个能测试本科级别科学理解的新基准。 Method: 构建了一个包含200个复合科学场景提示的数据集,覆盖物理学和化学中的14个主题与103个概念;设计了五个评估维度(提示一致性、现象符合性、正确动态性、不可变性、时空连续性),并采用VLM-as-a-Judge方法结合专家标注进行评估。 Result: 在T2V和I2V设置下对七种SOTA视频模型的评估显示,当前模型在科学推理任务上表现有限;VLM-as-a-Judge方法与人工评估结果高度相关。 Conclusion: VideoScience-Bench是首个用于评估视频生成模型科学推理能力的基准,不仅衡量其生成能力,更强调其作为‘推理者’的科学理解水平,推动未来模型向具备真实世界科学认知的方向发展。 Abstract: The next frontier for video generation lies in developing models capable of zero-shot reasoning, where understanding real-world scientific laws is crucial for accurate physical outcome modeling under diverse conditions. However, existing video benchmarks are physical commonsense-based, offering limited insight into video models' scientific reasoning capability. We introduce VideoScience-Bench, a benchmark designed to evaluate undergraduate-level scientific understanding in video models. Each prompt encodes a composite scientific scenario that requires understanding and reasoning across multiple scientific concepts to generate the correct phenomenon. The benchmark comprises 200 carefully curated prompts spanning 14 topics and 103 concepts in physics and chemistry. We conduct expert-annotated evaluations across seven state-of-the-art video models in T2V and I2V settings along five dimensions: Prompt Consistency, Phenomenon Congruency, Correct Dynamism, Immutability, and Spatio-Temporal Continuity. Using a VLM-as-a-Judge to assess video generations, we observe strong correlation with human assessments. To the best of our knowledge, VideoScience-Bench is the first benchmark to evaluate video models not only as generators but also as reasoners, requiring their generations to demonstrate scientific understanding consistent with expected physical and chemical phenomena. Our data and evaluation code are available at: \href{https://github.com/hao-ai-lab/VideoScience}{github.com/hao-ai-lab/VideoScience}.[148] Layout Anything: One Transformer for Universal Room Layout Estimation
Md Sohag Mia,Muhammad Abdullah Adnan
Main category: cs.CV
TL;DR: Layout Anything 是一个基于 Transformer 的室内布局估计框架,通过改进 OneFormer 架构实现几何结构预测,具有高精度和高效推理能力。
Details
Motivation: 现有方法依赖复杂的后处理流程且难以兼顾几何一致性和推理速度,因此需要一种端到端、几何感知的通用框架用于室内布局估计。 Method: 采用 OneFormer 的任务条件查询和对比学习,并引入布局退化策略和可微几何损失,在保持曼哈顿世界约束的同时增强训练并提升边界和面一致性。 Result: 在 LSUN、Hedau 和 Matterport3D-Layout 上均达到最先进性能,像素误差分别为 5.43%、7.04% 和 4.03%,推理速度为 114ms。 Conclusion: 该框架实现了无需复杂后处理的高效、精确室内布局估计,适用于增强现实和大规模 3D 场景重建。 Abstract: We present Layout Anything, a transformer-based framework for indoor layout estimation that adapts the OneFormer's universal segmentation architecture to geometric structure prediction. Our approach integrates OneFormer's task-conditioned queries and contrastive learning with two key modules: (1) a layout degeneration strategy that augments training data while preserving Manhattan-world constraints through topology-aware transformations, and (2) differentiable geometric losses that directly enforce planar consistency and sharp boundary predictions during training. By unifying these components in an end-to-end framework, the model eliminates complex post-processing pipelines while achieving high-speed inference at 114ms. Extensive experiments demonstrate state-of-the-art performance across standard benchmarks, with pixel error (PE) of 5.43% and corner error (CE) of 4.02% on the LSUN, PE of 7.04% (CE 5.17%) on the Hedau and PE of 4.03% (CE 3.15%) on the Matterport3D-Layout datasets. The framework's combination of geometric awareness and computational efficiency makes it particularly suitable for augmented reality applications and large-scale 3D scene reconstruction tasks.[149] A Lightweight Real-Time Low-Light Enhancement Network for Embedded Automotive Vision Systems
Yuhan Chen,Yicui Shi,Guofa Li,Guangrui Bai,Jinyuan Shao,Xiangfei Huang,Wenbo Chu,Keqiang Li
Main category: cs.CV
TL;DR: 提出了一种名为UltraFast-LieNET的轻量级多尺度移位卷积网络,用于实时低光照图像增强,具有极低参数量和优异性能。
Details
Motivation: 现有低光增强算法计算复杂度高,难以满足车载相机对实时性和资源受限的需求。 Method: 设计动态移位卷积(DSConv)和多尺度移位残差块(MSRB),结合残差结构与多级梯度感知损失函数,实现高效特征提取与稳定训练。 Result: 在LOLI-Street数据集上达到26.51 dB PSNR,超过现有方法4.6 dB,仅用180个参数;在四个基准数据集上验证了其在资源受限下的实时性与增强质量平衡。 Conclusion: UltraFast-LieNET以极小参数量实现了高性能低光图像增强,适合车载等实时应用场景。 Abstract: In low-light environments like nighttime driving, image degradation severely challenges in-vehicle camera safety. Since existing enhancement algorithms are often too computationally intensive for vehicular applications, we propose UltraFast-LieNET, a lightweight multi-scale shifted convolutional network for real-time low-light image enhancement. We introduce a Dynamic Shifted Convolution (DSConv) kernel with only 12 learnable parameters for efficient feature extraction. By integrating DSConv with varying shift distances, a Multi-scale Shifted Residual Block (MSRB) is constructed to significantly expand the receptive field. To mitigate lightweight network instability, a residual structure and a novel multi-level gradient-aware loss function are incorporated. UltraFast-LieNET allows flexible parameter configuration, with a minimum size of only 36 parameters. Results on the LOLI-Street dataset show a PSNR of 26.51 dB, outperforming state-of-the-art methods by 4.6 dB while utilizing only 180 parameters. Experiments across four benchmark datasets validate its superior balance of real-time performance and enhancement quality under limited resources. Code is available at https://githubhttps://github.com/YuhanChen2024/UltraFast-LiNET[150] BEVDilation: LiDAR-Centric Multi-Modal Fusion for 3D Object Detection
Guowen Zhang,Chenhang He,Liyi Chen,Lei Zhang
Main category: cs.CV
TL;DR: 本文提出了一种名为BEVDilation的新型LiDAR中心框架,用于在鸟瞰图(BEV)表示中融合LiDAR和相机信息,通过将图像特征作为隐式引导而非简单拼接,有效缓解了因深度估计误差导致的空间错位,并利用图像先验缓解点云稀疏性和语义缺失问题,在nuScenes基准上取得了优于现有方法的性能。
Details
Motivation: 由于LiDAR和相机在几何精度上存在根本差异,以往 indiscriminate 融合方法容易导致3D目标检测性能下降,因此需要一种更鲁棒的融合策略。 Method: 提出BEVDilation框架,采用LiDAR为中心的融合方式,将图像BEV特征作为隐式引导;设计稀疏体素膨胀块(Sparse Voxel Dilation Block)利用图像先验稠密化前景体素,并引入语义引导的BEV膨胀块(Semantic-Guided BEV Dilation Block)增强LiDAR特征扩散并捕获长距离上下文。 Result: 在nuScenes基准上性能优于现有最先进方法,同时具有较高的计算效率,并对深度噪声表现出更强的鲁棒性。 Conclusion: BEVDilation通过以LiDAR为中心、图像为引导的融合策略,有效解决了多模态BEV融合中的空间错位、点云稀疏和语义缺失问题,实现了高性能且鲁棒的3D目标检测。 Abstract: Integrating LiDAR and camera information in the bird's eye view (BEV) representation has demonstrated its effectiveness in 3D object detection. However, because of the fundamental disparity in geometric accuracy between these sensors, indiscriminate fusion in previous methods often leads to degraded performance. In this paper, we propose BEVDilation, a novel LiDAR-centric framework that prioritizes LiDAR information in the fusion. By formulating image BEV features as implicit guidance rather than naive concatenation, our strategy effectively alleviates the spatial misalignment caused by image depth estimation errors. Furthermore, the image guidance can effectively help the LiDAR-centric paradigm to address the sparsity and semantic limitations of point clouds. Specifically, we propose a Sparse Voxel Dilation Block that mitigates the inherent point sparsity by densifying foreground voxels through image priors. Moreover, we introduce a Semantic-Guided BEV Dilation Block to enhance the LiDAR feature diffusion processing with image semantic guidance and long-range context capture. On the challenging nuScenes benchmark, BEVDilation achieves better performance than state-of-the-art methods while maintaining competitive computational efficiency. Importantly, our LiDAR-centric strategy demonstrates greater robustness to depth noise compared to naive fusion. The source code is available at https://github.com/gwenzhang/BEVDilation.[151] InEx: Hallucination Mitigation via Introspection and Cross-Modal Multi-Agent Collaboration
Zhongyu Yang,Yingfang Yuan,Xuanming Jiang,Baoyi An,Wei Pang
Main category: cs.CV
TL;DR: 提出了一种无需训练的多智能体框架InEx,通过内省推理和跨模态协作来自主缓解大语言模型中的幻觉问题。
Details
Motivation: 现有方法依赖人工干预或未能充分利用智能体自主减少幻觉的能力,受人类决策认知过程启发,需构建更可靠的多模态大模型。 Method: 设计InEx框架,结合基于熵的不确定性估计进行内部内省推理,并通过多智能体外部协作(编辑与自省代理)进行交叉验证与迭代优化。 Result: 实验表明InEx在通用和幻觉基准上优于现有方法,性能提升4%-27%,并表现出强鲁棒性。 Conclusion: InEx能有效自主缓解MLLM中的幻觉问题,为构建可靠多模态大模型提供了新思路。 Abstract: Hallucination remains a critical challenge in large language models (LLMs), hindering the development of reliable multimodal LLMs (MLLMs). Existing solutions often rely on human intervention or underutilize the agent's ability to autonomously mitigate hallucination. To address these limitations, we draw inspiration from how humans make reliable decisions in the real world. They begin with introspective reasoning to reduce uncertainty and form an initial judgment, then rely on external verification from diverse perspectives to reach a final decision. Motivated by this cognitive paradigm, we propose InEx, a training-free, multi-agent framework designed to autonomously mitigate hallucination. InEx introduces internal introspective reasoning, guided by entropy-based uncertainty estimation, to improve the reliability of the decision agent's reasoning process. The agent first generates a response, which is then iteratively verified and refined through external cross-modal multi-agent collaboration with the editing agent and self-reflection agents, further enhancing reliability and mitigating hallucination. Extensive experiments show that InEx consistently outperforms existing methods, achieving 4%-27% gains on general and hallucination benchmarks, and demonstrating strong robustness.[152] U4D: Uncertainty-Aware 4D World Modeling from LiDAR Sequences
Xiang Xu,Ao Liang,Youquan Liu,Linfeng Li,Lingdong Kong,Ziwei Liu,Qingshan Liu
Main category: cs.CV
TL;DR: 本文提出U4D,一种考虑空间不确定性的4D LiDAR环境建模框架,通过“由难到易”的生成策略和时空混合模块提升几何保真度与时间一致性。
Details
Motivation: 现有生成模型在处理LiDAR序列时忽略场景中不同区域的不确定性差异,导致复杂区域出现伪影,影响真实性和时间稳定性。 Method: 利用预训练分割模型估计空间不确定性图,定位高熵区域;采用两阶段生成策略:先对高不确定性区域进行精细重建,再基于结构先验补全其余区域;引入混合时空(MoST)模块,在扩散过程中自适应融合时空特征以增强时间连贯性。 Result: 实验表明U4D在生成几何细节和时间连续性方面优于现有方法,显著提升4D世界建模的可靠性。 Conclusion: U4D通过显式建模空间不确定性并结合由难到易的生成机制,有效改善了动态3D环境建模的质量,适用于自动驾驶与具身AI中的感知与仿真任务。 Abstract: Modeling dynamic 3D environments from LiDAR sequences is central to building reliable 4D worlds for autonomous driving and embodied AI. Existing generative frameworks, however, often treat all spatial regions uniformly, overlooking the varying uncertainty across real-world scenes. This uniform generation leads to artifacts in complex or ambiguous regions, limiting realism and temporal stability. In this work, we present U4D, an uncertainty-aware framework for 4D LiDAR world modeling. Our approach first estimates spatial uncertainty maps from a pretrained segmentation model to localize semantically challenging regions. It then performs generation in a "hard-to-easy" manner through two sequential stages: (1) uncertainty-region modeling, which reconstructs high-entropy regions with fine geometric fidelity, and (2) uncertainty-conditioned completion, which synthesizes the remaining areas under learned structural priors. To further ensure temporal coherence, U4D incorporates a mixture of spatio-temporal (MoST) block that adaptively fuses spatial and temporal representations during diffusion. Extensive experiments show that U4D produces geometrically faithful and temporally consistent LiDAR sequences, advancing the reliability of 4D world modeling for autonomous perception and simulation.[153] GraphFusion3D: Dynamic Graph Attention Convolution with Adaptive Cross-Modal Transformer for 3D Object Detection
Md Sohag Mia,Md Nahid Hasan,Tawhid Ahmed,Muhammad Abdullah Adnan
Main category: cs.CV
TL;DR: 本文提出了一种名为GraphFusion3D的统一框架,用于提升3D目标检测性能,结合多模态融合与先进特征学习,通过自适应跨模态Transformer和图推理模块有效整合图像特征并建模局部几何与全局语义关系,在SUN RGB-D和ScanNetV2数据集上取得了显著性能提升。
Details
Motivation: 点云数据稀疏、结构不完整且语义信息有限,难以捕捉远距离物体间的上下文关系,现有方法在特征融合和上下文建模方面存在不足。 Method: 提出GraphFusion3D框架,包含自适应跨模态Transformer(ACMT)将图像特征自适应融入点云表示,以及图推理模块(GRM)通过多尺度图注意力建模提案间的空间和特征相似性,并采用级联解码器进行多阶段检测优化。 Result: 在SUN RGB-D上达到70.6% AP$_{25}$和51.2% AP$_{50}$,在ScanNetV2上达到75.1% AP$_{25}$和60.8% AP$_{50}$,显著优于现有方法。 Conclusion: GraphFusion3D通过有效的多模态融合与上下文建模机制,提升了3D目标检测的精度,尤其在几何与语义信息的联合利用方面表现出色。 Abstract: Despite significant progress in 3D object detection, point clouds remain challenging due to sparse data, incomplete structures, and limited semantic information. Capturing contextual relationships between distant objects presents additional difficulties. To address these challenges, we propose GraphFusion3D, a unified framework combining multi-modal fusion with advanced feature learning. Our approach introduces the Adaptive Cross-Modal Transformer (ACMT), which adaptively integrates image features into point representations to enrich both geometric and semantic information. For proposal refinement, we introduce the Graph Reasoning Module (GRM), a novel mechanism that models neighborhood relationships to simultaneously capture local geometric structures and global semantic context. The module employs multi-scale graph attention to dynamically weight both spatial proximity and feature similarity between proposals. We further employ a cascade decoder that progressively refines detections through multi-stage predictions. Extensive experiments on SUN RGB-D (70.6\% AP$_{25}$ and 51.2\% AP$_{50}$) and ScanNetV2 (75.1\% AP$_{25}$ and 60.8\% AP$_{50}$) demonstrate a substantial performance improvement over existing approaches.[154] TEXTRIX: Latent Attribute Grid for Native Texture Generation and Beyond
Yifei Zeng,Yajie Bao,Jiachen Qian,Shuang Wu,Youtian Lin,Hao Zhu,Buyu Li,Feihu Zhang,Xun Cao,Yao Yao
Main category: cs.CV
TL;DR: 提出TEXTRIX,一种基于扩散Transformer的原生3D属性生成框架,实现高质量纹理合成与精确3D部件分割。
Details
Motivation: 现有3D纹理生成方法依赖多视角融合,存在视图间不一致和覆盖不全的问题,影响生成质量。 Method: 构建隐式3D属性网格,采用稀疏注意力的扩散Transformer,在体素空间直接进行3D模型着色,并扩展至语义属性预测以实现3D分割。 Result: 在纹理生成和3D部件分割任务上均达到SOTA,生成无缝高保真纹理和边界精确的分割结果。 Conclusion: TEXTRIX通过原生3D建模有效克服多视角融合缺陷,统一框架支持高质量纹理生成与下游语义分析任务。 Abstract: Prevailing 3D texture generation methods, which often rely on multi-view fusion, are frequently hindered by inter-view inconsistencies and incomplete coverage of complex surfaces, limiting the fidelity and completeness of the generated content. To overcome these challenges, we introduce TEXTRIX, a native 3D attribute generation framework for high-fidelity texture synthesis and downstream applications such as precise 3D part segmentation. Our approach constructs a latent 3D attribute grid and leverages a Diffusion Transformer equipped with sparse attention, enabling direct coloring of 3D models in volumetric space and fundamentally avoiding the limitations of multi-view fusion. Built upon this native representation, the framework naturally extends to high-precision 3D segmentation by training the same architecture to predict semantic attributes on the grid. Extensive experiments demonstrate state-of-the-art performance on both tasks, producing seamless, high-fidelity textures and accurate 3D part segmentation with precise boundaries.[155] DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling
Kairun Wen,Yuzhi Huang,Runyu Chen,Hui Zheng,Yunlong Lin,Panwang Pan,Chenxin Li,Wenyan Cong,Jian Zhang,Junbin Lu,Chenguo Lin,Dilin Wang,Zhicheng Yan,Hongyu Xu,Justin Theiss,Yue Huang,Xinghao Ding,Rakesh Ranjan,Zhiwen Fan
Main category: cs.CV
TL;DR: 本文提出了DynamicVerse,一个用于动态真实世界视频的物理尺度、多模态4D世界建模框架,通过整合大规模视觉、几何和多模态模型,将互联网视频转化为包含度量级几何、真实运动、实例掩码和描述性字幕的4D数据集。
Details
Motivation: 现有数据集受限于模拟器或传统SfM方法,缺乏真实世界的度量尺度和丰富语义描述,限制了基础模型对单目视频中真实动态的理解能力。 Method: 采用基于窗口的光束法平差与全局优化相结合的方法,利用大规模视觉、几何和多模态模型解析静态几何、动态运动、实例掩码和文本描述,构建4D多模态表示。 Result: 构建了包含10万+视频、80万+标注掩码和1000万+帧的大规模数据集,在视频深度估计、相机位姿和内参估计三个基准任务上表现出更优的物理尺度测量和全局精度。 Conclusion: DynamicVerse能够有效提升基础模型对真实世界动态的感知能力,为具身智能体提供了更真实、更丰富的4D环境理解框架。 Abstract: Understanding the dynamic physical world, characterized by its evolving 3D structure, real-world motion, and semantic content with textual descriptions, is crucial for human-agent interaction and enables embodied agents to perceive and act within real environments with human-like capabilities. However, existing datasets are often derived from limited simulators or utilize traditional Structurefrom-Motion for up-to-scale annotation and offer limited descriptive captioning, which restricts the capacity of foundation models to accurately interpret real-world dynamics from monocular videos, commonly sourced from the internet. To bridge these gaps, we introduce DynamicVerse, a physical-scale, multimodal 4D world modeling framework for dynamic real-world video. We employ large vision, geometric, and multimodal models to interpret metric-scale static geometry, real-world dynamic motion, instance-level masks, and holistic descriptive captions. By integrating window-based Bundle Adjustment with global optimization, our method converts long real-world video sequences into a comprehensive 4D multimodal format. DynamicVerse delivers a large-scale dataset consists of 100K+ videos with 800K+ annotated masks and 10M+ frames from internet videos. Experimental evaluations on three benchmark tasks, namely video depth estimation, camera pose estimation, and camera intrinsics estimation, demonstrate that our 4D modeling achieves superior performance in capturing physical-scale measurements with greater global accuracy than existing methods.[156] DGGT: Feedforward 4D Reconstruction of Dynamic Driving Scenes using Unposed Images
Xiaoxue Chen,Ziyi Xiong,Yuantao Chen,Gen Li,Nan Wang,Hongcheng Luo,Long Chen,Haiyang Sun,Bing Wang,Guang Chen,Hangjun Ye,Hongyang Li,Ya-Qin Zhang,Hao Zhao
Main category: cs.CV
TL;DR: 提出Driving Gaussian Grounded Transformer (DGGT),一种无需相机位姿输入的单次前馈框架,用于自动驾驶场景中快速、可扩展的4D重建与重仿真。
Details
Motivation: 现有动态场景重建方法依赖已知相机位姿、逐场景优化或短帧序列,限制了灵活性和可扩展性,难以满足自动驾驶对高效训练与评估的需求。 Method: 将相机位姿作为模型输出而非输入,直接从稀疏无姿态图像中联合预测每帧的3D高斯图和相机参数;通过轻量动态头分离动态变化,寿命头保持时间一致性,并采用基于扩散的渲染优化减少运动伪影。 Result: 在Waymo、nuScenes、Argoverse2等大规模驾驶数据集上实现最先进性能与速度,支持任意数量视图输入,具有良好的跨数据集零样本迁移能力和随帧数增长的良好扩展性。 Conclusion: DGGT是一种高效、统一、无需位姿先验的4D重建框架,显著提升了动态驾驶场景重建的实用性与可扩展性。 Abstract: Autonomous driving needs fast, scalable 4D reconstruction and re-simulation for training and evaluation, yet most methods for dynamic driving scenes still rely on per-scene optimization, known camera calibration, or short frame windows, making them slow and impractical. We revisit this problem from a feedforward perspective and introduce \textbf{Driving Gaussian Grounded Transformer (DGGT)}, a unified framework for pose-free dynamic scene reconstruction. We note that the existing formulations, treating camera pose as a required input, limit flexibility and scalability. Instead, we reformulate pose as an output of the model, enabling reconstruction directly from sparse, unposed images and supporting an arbitrary number of views for long sequences. Our approach jointly predicts per-frame 3D Gaussian maps and camera parameters, disentangles dynamics with a lightweight dynamic head, and preserves temporal consistency with a lifespan head that modulates visibility over time. A diffusion-based rendering refinement further reduces motion/interpolation artifacts and improves novel-view quality under sparse inputs. The result is a single-pass, pose-free algorithm that achieves state-of-the-art performance and speed. Trained and evaluated on large-scale driving benchmarks (Waymo, nuScenes, Argoverse2), our method outperforms prior work both when trained on each dataset and in zero-shot transfer across datasets, and it scales well as the number of input frames increases.[157] SurfFill: Completion of LiDAR Point Clouds via Gaussian Surfel Splatting
Svenja Strobel,Matthias Innmann,Bernhard Egger,Marc Stamminger,Linus Franke
Main category: cs.CV
TL;DR: 本文提出了一种名为SurfFill的高斯面元(Gaussian surfel)基础的LiDAR点云补全方法,结合LiDAR和基于相机的捕捉优势,有效完成3D重建中缺失区域的补全,尤其在处理薄结构和边缘时表现优异。
Details
Motivation: LiDAR在平坦区域精度高,但在细小结构和吸光材料上易丢失细节;而摄影测量虽能捕捉特征丰富区域,却难以达到LiDAR在无特征区域的精度。因此需要融合两者优势以提升整体重建质量。 Method: 分析LiDAR捕获中的光束发散问题,提出一种基于点云密度变化的模糊性启发式方法,识别可能缺失区域附近的点,并利用这些点通过约束高斯面元重建进行点生长,最后提取并采样高斯基元生成补全点云。同时引入分治策略以支持大规模建筑级点云补全。 Result: 在合成与真实场景的LiDAR点云补全任务中,该方法优于先前的重建方法,特别是在恢复细小结构和边缘方面表现突出。 Conclusion: SurfFill通过融合LiDAR与视觉信息,利用高斯面元模型针对模糊区域进行定向优化与稠密化,实现了高质量的点云补全,为大规模3D重建提供了有效解决方案。 Abstract: LiDAR-captured point clouds are often considered the gold standard in active 3D reconstruction. While their accuracy is exceptional in flat regions, the capturing is susceptible to miss small geometric structures and may fail with dark, absorbent materials. Alternatively, capturing multiple photos of the scene and applying 3D photogrammetry can infer these details as they often represent feature-rich regions. However, the accuracy of LiDAR for featureless regions is rarely reached. Therefore, we suggest combining the strengths of LiDAR and camera-based capture by introducing SurfFill: a Gaussian surfel-based LiDAR completion scheme. We analyze LiDAR capturings and attribute LiDAR beam divergence as a main factor for artifacts, manifesting mostly at thin structures and edges. We use this insight to introduce an ambiguity heuristic for completed scans by evaluating the change in density in the point cloud. This allows us to identify points close to missed areas, which we can then use to grow additional points from to complete the scan. For this point growing, we constrain Gaussian surfel reconstruction [Huang et al. 2024] to focus optimization and densification on these ambiguous areas. Finally, Gaussian primitives of the reconstruction in ambiguous areas are extracted and sampled for points to complete the point cloud. To address the challenges of large-scale reconstruction, we extend this pipeline with a divide-and-conquer scheme for building-sized point cloud completion. We evaluate on the task of LiDAR point cloud completion of synthetic and real-world scenes and find that our method outperforms previous reconstruction methods.[158] In-Context Sync-LoRA for Portrait Video Editing
Sagi Polaczek,Or Patashnik,Ali Mahdavi-Amiri,Daniel Cohen-Or
Main category: cs.CV
TL;DR: 本文提出Sync-LoRA,一种用于肖像视频编辑的方法,通过基于同步过滤生成的配对视频训练上下文LoRA,在保持帧级同步和身份一致性的前提下实现高质量的外观、表情或物体添加等编辑。
Details
Motivation: 肖像视频编辑需精确控制多种修改,同时保持主体原始的时间行为同步,现有方法难以兼顾编辑质量与时间一致性,因此需要一种能同时实现高保真编辑和精准运动保留的方法。 Method: 使用图像到视频扩散模型,将编辑定义在首帧修改并传播至整个序列;通过同步筛选自动生成具有相同运动轨迹但外观不同的配对视频,训练上下文LoRA,使其融合源视频的运动线索与首帧的视觉变化。 Result: Sync-LoRA在未见身份和多样化编辑(如外观修改、物体添加、背景更换)上均表现出高视觉保真度和强时间连贯性,能稳健处理姿态和表情变化,实现编辑保真与运动保持的良好平衡。 Conclusion: Sync-LoRA通过精心构建的同步训练数据和上下文学习机制,有效解决了肖像视频编辑中同步性与编辑质量之间的矛盾,为高质量视频编辑提供了实用且泛化性强的解决方案。 Abstract: Editing portrait videos is a challenging task that requires flexible yet precise control over a wide range of modifications, such as appearance changes, expression edits, or the addition of objects. The key difficulty lies in preserving the subject's original temporal behavior, demanding that every edited frame remains precisely synchronized with the corresponding source frame. We present Sync-LoRA, a method for editing portrait videos that achieves high-quality visual modifications while maintaining frame-accurate synchronization and identity consistency. Our approach uses an image-to-video diffusion model, where the edit is defined by modifying the first frame and then propagated to the entire sequence. To enable accurate synchronization, we train an in-context LoRA using paired videos that depict identical motion trajectories but differ in appearance. These pairs are automatically generated and curated through a synchronization-based filtering process that selects only the most temporally aligned examples for training. This training setup teaches the model to combine motion cues from the source video with the visual changes introduced in the edited first frame. Trained on a compact, highly curated set of synchronized human portraits, Sync-LoRA generalizes to unseen identities and diverse edits (e.g., modifying appearance, adding objects, or changing backgrounds), robustly handling variations in pose and expression. Our results demonstrate high visual fidelity and strong temporal coherence, achieving a robust balance between edit fidelity and precise motion preservation.[159] Instant Video Models: Universal Adapters for Stabilizing Image-Based Networks
Matthew Dutson,Nathan Labiosa,Yin Li,Mohit Gupta
Main category: cs.CV
TL;DR: 本文提出一种通用方法,通过引入稳定性适配器和资源高效的训练过程,提升基于帧的视频模型在时序上的一致性和对噪声等干扰的鲁棒性。
Details
Motivation: 帧基网络在处理视频时容易出现输出闪烁等时序不一致问题,尤其在输入存在时变噪声时更为严重。 Method: 设计一类可插入任意架构的稳定性适配器,并采用基于冻结基础网络的高效训练策略,结合准确性-稳定性-鲁棒性损失函数进行优化。 Result: 在去噪、HDR增强、单目深度估计和语义分割等多个任务中验证了方法的有效性,显著提升了时间稳定性和对压缩伪影、噪声、恶劣天气等干扰的鲁棒性,同时保持或提高了预测质量。 Conclusion: 所提方法能有效增强帧基视频模型的时序稳定性与输入鲁棒性,且适用广泛、训练高效,具有良好的实际应用前景。 Abstract: When applied sequentially to video, frame-based networks often exhibit temporal inconsistency - for example, outputs that flicker between frames. This problem is amplified when the network inputs contain time-varying corruptions. In this work, we introduce a general approach for adapting frame-based models for stable and robust inference on video. We describe a class of stability adapters that can be inserted into virtually any architecture and a resource-efficient training process that can be performed with a frozen base network. We introduce a unified conceptual framework for describing temporal stability and corruption robustness, centered on a proposed accuracy-stability-robustness loss. By analyzing the theoretical properties of this loss, we identify the conditions where it produces well-behaved stabilizer training. Our experiments validate our approach on several vision tasks including denoising (NAFNet), image enhancement (HDRNet), monocular depth (Depth Anything v2), and semantic segmentation (DeepLabv3+). Our method improves temporal stability and robustness against a range of image corruptions (including compression artifacts, noise, and adverse weather), while preserving or improving the quality of predictions.[160] AutoBrep: Autoregressive B-Rep Generation with Unified Topology and Geometry
Xiang Xu,Pradeep Kumar Jayaraman,Joseph G. Lambourne,Yilin Liu,Durvesh Malpure,Pete Meltzer
Main category: cs.CV
TL;DR: 本文提出了一种名为AutoBrep的新型Transformer模型,能够自回归地生成高质量、有效的边界表示(B-Rep)CAD模型,采用统一的离散化标记方案编码几何与拓扑信息,并支持自动补全和用户可控的设计生成。
Details
Motivation: 直接端到端生成具有精确几何形状和无漏洞拓扑结构的B-Rep模型仍具挑战性,现有方法在质量、有效性或可扩展性方面存在不足。 Method: 提出统一的标记化方案,将B-Rep的几何特征(曲面、曲线)编码为潜在几何标记,拓扑关系定义为特殊参考标记;序列顺序遵循B-Rep面邻接图的广度优先遍历;使用Transformer架构进行自回归生成。 Result: 实验表明AutoBrep在生成质量、密封性、复杂模型保真度和推理速度方面优于基线模型,并原生支持B-Rep自动补全功能。 Conclusion: AutoBrep通过统一的序列化标记表示和自回归建模,实现了高质量、可扩展且用户可控的B-Rep生成,推动了AI驱动CAD设计的发展。 Abstract: The boundary representation (B-Rep) is the standard data structure used in Computer-Aided Design (CAD) for defining solid models. Despite recent progress, directly generating B-Reps end-to-end with precise geometry and watertight topology remains a challenge. This paper presents AutoBrep, a novel Transformer model that autoregressively generates B-Reps with high quality and validity. AutoBrep employs a unified tokenization scheme that encodes both geometric and topological characteristics of a B-Rep model as a sequence of discrete tokens. Geometric primitives (i.e., surfaces and curves) are encoded as latent geometry tokens, and their structural relationships are defined as special topological reference tokens. Sequence order in AutoBrep naturally follows a breadth first traversal of the B-Rep face adjacency graph. At inference time, neighboring faces and edges along with their topological structure are progressively generated. Extensive experiments demonstrate the advantages of our unified representation when coupled with next-token prediction for B-Rep generation. AutoBrep outperforms baselines with better quality and watertightness. It is also highly scalable to complex solids with good fidelity and inference speed. We further show that autocompleting B-Reps is natively supported through our unified tokenization, enabling user-controllable CAD generation with minimal changes. Code is available at https://github.com/AutodeskAILab/AutoBrep.[161] Unrolled Networks are Conditional Probability Flows in MRI Reconstruction
Kehan Qi,Saumya Gupta,Qingqiao Hu,Weimin Lyu,Chao Chen
Main category: cs.CV
TL;DR: 本文提出了一种基于常微分方程(ODE)的MRI重建方法FLAT,通过将展开网络视为条件概率流ODE的离散实现,提升了重建的稳定性和收敛性,在多个数据集上实现了比扩散模型迭代次数少3倍且比传统展开网络更稳定的高性能重建。
Details
Motivation: MRI成像速度慢限制了其临床应用,现有深度学习方法如展开网络存在中间步骤演化不稳定的问题,而扩散模型虽稳定但计算昂贵,因此需要一种兼具稳定性与高效性的新方法。 Method: 理论证明展开网络是条件概率流ODE的离散形式,并据此提出Flow-Aligned Training(FLAT),利用ODE离散化推导出展开网络参数,并使中间重建结果对齐理想ODE轨迹,从而提升训练稳定性与收敛速度。 Result: 在三个MRI数据集上的实验表明,FLAT相比扩散生成模型可在最多减少3倍迭代次数的情况下实现高质量重建,同时显著优于传统展开网络的稳定性。 Conclusion: FLAT通过连接展开网络与连续ODE系统,为MRI加速重建提供了一种稳定、高效的训练框架,揭示了基于物理建模与深度学习融合的潜力。 Abstract: Magnetic Resonance Imaging (MRI) offers excellent soft-tissue contrast without ionizing radiation, but its long acquisition time limits clinical utility. Recent methods accelerate MRI by under-sampling $k$-space and reconstructing the resulting images using deep learning. Unrolled networks have been widely used for the reconstruction task due to their efficiency, but suffer from unstable evolving caused by freely-learnable parameters in intermediate steps. In contrast, diffusion models based on stochastic differential equations offer theoretical stability in both medical and natural image tasks but are computationally expensive. In this work, we introduce flow ODEs to MRI reconstruction by theoretically proving that unrolled networks are discrete implementations of conditional probability flow ODEs. This connection provides explicit formulations for parameters and clarifies how intermediate states should evolve. Building on this insight, we propose Flow-Aligned Training (FLAT), which derives unrolled parameters from the ODE discretization and aligns intermediate reconstructions with the ideal ODE trajectory to improve stability and convergence. Experiments on three MRI datasets show that FLAT achieves high-quality reconstructions with up to $3\times$ fewer iterations than diffusion-based generative models and significantly greater stability than unrolled networks.[162] MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation
Youxin Pang,Jiajun Liu,Lingfeng Tan,Yong Zhang,Feng Gao,Xiang Deng,Zhuoliang Kang,Xiaoming Wei,Yebin Liu
Main category: cs.CV
TL;DR: 本文提出了一种名为MAViD的新型多模态音视频对话框架,采用Conductor-Creator架构实现理解和生成的协同,并结合自回归与扩散模型生成高质量长时音视频内容。
Details
Motivation: 现有方法主要集中在非交互式系统上,生成的语音受限且不自然,难以实现有效的多模态融合与长时间一致性的音视频生成。 Method: 提出Conductor-Creator架构:Conductor负责理解与指令分解,Creator根据指令生成响应;结合AR模型(音频)和扩散模型(视频)生成长时一致内容,并设计新的融合模块增强跨模态和时序一致性。 Result: 实验表明该框架能生成生动、上下文连贯的长时音视频对话,并能准确理解用户的多模态查询。 Conclusion: MAViD通过分治策略和混合生成模型,在多模态对话理解与生成任务中实现了更自然、更一致的交互性能。 Abstract: We propose MAViD, a novel Multimodal framework for Audio-Visual Dialogue understanding and generation. Existing approaches primarily focus on non-interactive systems and are limited to producing constrained and unnatural human speech.The primary challenge of this task lies in effectively integrating understanding and generation capabilities, as well as achieving seamless multimodal audio-video fusion. To solve these problems, we propose a Conductor-Creator architecture that divides the dialogue system into two primary components.The Conductor is tasked with understanding, reasoning, and generating instructions by breaking them down into motion and speech components, thereby enabling fine-grained control over interactions. The Creator then delivers interactive responses based on these instructions.Furthermore, to address the difficulty of generating long videos with consistent identity, timbre, and tone using dual DiT structures, the Creator adopts a structure that combines autoregressive (AR) and diffusion models. The AR model is responsible for audio generation, while the diffusion model ensures high-quality video generation.Additionally, we propose a novel fusion module to enhance connections between contextually consecutive clips and modalities, enabling synchronized long-duration audio-visual content generation.Extensive experiments demonstrate that our framework can generate vivid and contextually coherent long-duration dialogue interactions and accurately interpret users' multimodal queries.[163] ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation
Mengchen Zhang,Qi Chen,Tong Wu,Zihan Liu,Dahua Lin
Main category: cs.CV
TL;DR: 本文提出了端到端的双耳空间音频生成任务,直接从无声视频生成具有空间沉浸感的双耳音频,并发布了包含约97K个视频-双耳音频对的BiAudio数据集。作者还提出了ViSAudio框架,采用基于条件流匹配的双分支音频生成结构和时空模块,在客观指标和主观评价上均优于现有方法。
Details
Motivation: 现有视频到音频生成方法主要集中在单声道输出,缺乏空间沉浸感;而现有的双耳音频生成多为两阶段方法,先生成单声道再进行空间化,容易导致误差累积和时空不一致问题。 Method: 提出端到端的双耳音频生成任务,构建了大规模BiAudio数据集,并设计了ViSAudio框架,该框架采用条件流匹配与双分支生成结构,结合条件时空模块来建模双耳音频的潜变量流动,实现音频与视频在时空上的精确对齐。 Result: 实验表明ViSAudio在客观指标(如L1 Loss、FAD等)和主观评测中均优于现有SOTA方法,能生成高质量、具空间沉浸感的双耳音频,且能适应视角变化、声源运动和不同声学环境。 Conclusion: ViSAudio实现了从无声视频到双耳音频的端到端高质量生成,推动了空间音频生成的发展,验证了端到端建模在时空一致性与空间保真度方面的优势。 Abstract: Despite progress in video-to-audio generation, the field focuses predominantly on mono output, lacking spatial immersion. Existing binaural approaches remain constrained by a two-stage pipeline that first generates mono audio and then performs spatialization, often resulting in error accumulation and spatio-temporal inconsistencies. To address this limitation, we introduce the task of end-to-end binaural spatial audio generation directly from silent video. To support this task, we present the BiAudio dataset, comprising approximately 97K video-binaural audio pairs spanning diverse real-world scenes and camera rotation trajectories, constructed through a semi-automated pipeline. Furthermore, we propose ViSAudio, an end-to-end framework that employs conditional flow matching with a dual-branch audio generation architecture, where two dedicated branches model the audio latent flows. Integrated with a conditional spacetime module, it balances consistency between channels while preserving distinctive spatial characteristics, ensuring precise spatio-temporal alignment between audio and the input video. Comprehensive experiments demonstrate that ViSAudio outperforms existing state-of-the-art methods across both objective metrics and subjective evaluations, generating high-quality binaural audio with spatial immersion that adapts effectively to viewpoint changes, sound-source motion, and diverse acoustic environments. Project website: https://kszpxxzmc.github.io/ViSAudio-project.[164] Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation
Zeqi Xiao,Yiwei Zhao,Lingxiao Li,Yushi Lan,Yu Ning,Rahul Garg,Roshni Cooper,Mohammad H. Taghavi,Xingang Pan
Main category: cs.CV
TL;DR: 本文提出了Video4Spatial框架,证明仅基于视频数据的视频扩散模型能够执行复杂的空间任务,如场景导航和对象定位,展现出潜在的视觉空间智能。
Details
Motivation: 探索视频生成模型是否能在仅有视觉数据的情况下表现出类似人类的视觉空间智能。 Method: 设计并实现Video4Spatial框架,使用纯视频输入(无深度或姿态等辅助模态),通过视频扩散模型完成场景导航和对象接地任务,强调语义定位、指令跟随和规划能力。 Result: 模型在保持3D几何一致性的同时成功执行导航和目标定位,支持长上下文和跨域环境的泛化,展现出强空间理解能力。 Conclusion: 视频生成模型仅通过视频上下文即可实现复杂的空间推理任务,推动了通用视觉空间推理的发展。 Abstract: We investigate whether video generative models can exhibit visuospatial intelligence, a capability central to human cognition, using only visual data. To this end, we present Video4Spatial, a framework showing that video diffusion models conditioned solely on video-based scene context can perform complex spatial tasks. We validate on two tasks: scene navigation - following camera-pose instructions while remaining consistent with 3D geometry of the scene, and object grounding - which requires semantic localization, instruction following, and planning. Both tasks use video-only inputs, without auxiliary modalities such as depth or poses. With simple yet effective design choices in the framework and data curation, Video4Spatial demonstrates strong spatial understanding from video context: it plans navigation and grounds target objects end-to-end, follows camera-pose instructions while maintaining spatial consistency, and generalizes to long contexts and out-of-domain environments. Taken together, these results advance video generative models toward general visuospatial reasoning.[165] MultiShotMaster: A Controllable Multi-Shot Video Generation Framework
Qinghe Wang,Xiaoyu Shi,Baolu Li,Weikang Bian,Quande Liu,Huchuan Lu,Xintao Wang,Pengfei Wan,Kun Gai,Xu Jia
Main category: cs.CV
TL;DR: 本文提出MultiShotMaster框架,通过引入两种新型RoPE变体,实现高度可控的多镜头视频生成,支持灵活的镜头编排、时空一致性与引用注入。
Details
Motivation: 现有视频生成技术在单镜头生成上表现良好,但难以生成具有连贯叙事和灵活控制的多镜头叙事视频。 Method: 扩展预训练单镜头模型,引入多镜头叙事RoPE(带显式相位偏移)和时空位置感知RoPE,并构建自动化数据标注流水线以获取多镜头数据、字幕、跨镜头定位信号和参考图像。 Result: 框架实现了文本驱动的镜头间一致性、自定义主体与运动控制、背景驱动的场景定制,支持灵活配置镜头数量与时长,在实验中表现出优越性能和强可控性。 Conclusion: MultiShotMaster有效解决了多镜头视频生成中的叙事连贯性与控制难题,为复杂视频内容创作提供了新思路。 Abstract: Current video generation techniques excel at single-shot clips but struggle to produce narrative multi-shot videos, which require flexible shot arrangement, coherent narrative, and controllability beyond text prompts. To tackle these challenges, we propose MultiShotMaster, a framework for highly controllable multi-shot video generation. We extend a pretrained single-shot model by integrating two novel variants of RoPE. First, we introduce Multi-Shot Narrative RoPE, which applies explicit phase shift at shot transitions, enabling flexible shot arrangement while preserving the temporal narrative order. Second, we design Spatiotemporal Position-Aware RoPE to incorporate reference tokens and grounding signals, enabling spatiotemporal-grounded reference injection. In addition, to overcome data scarcity, we establish an automated data annotation pipeline to extract multi-shot videos, captions, cross-shot grounding signals and reference images. Our framework leverages the intrinsic architectural properties to support multi-shot video generation, featuring text-driven inter-shot consistency, customized subject with motion control, and background-driven customized scene. Both shot count and duration are flexibly configurable. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework.[166] PPTArena: A Benchmark for Agentic PowerPoint Editing
Michael Ofengenden,Yunze Man,Ziqi Pang,Yu-Xiong Wang
Main category: cs.CV
TL;DR: PPTArena是一个针对PowerPoint编辑的基准,旨在通过自然语言指令实现对真实幻灯片的可靠修改,同时提出了结构感知的编辑代理PPTPilot,在复杂、布局敏感和跨幻灯片编辑任务中显著优于现有方法。
Details
Motivation: 现有的幻灯片生成或图像渲染方法难以实现精确且一致的PPT编辑,缺乏对真实场景下多元素、长周期编辑任务的评估能力,因此需要一个专注于可靠、可衡量的PPT原位编辑的基准。 Method: 构建包含100个演示文稿、2125张幻灯片和800多个编辑任务的PPTArena基准,采用双VLM-as-judge评估流程;提出PPTPilot代理,结合语义规划、程序化工具与XML级操作,并通过迭代的计划-编辑-检查循环确保编辑准确性。 Result: PPTPilot在复合型、布局敏感和跨幻灯片编辑任务上比前沿专有代理和VLM系统高出10个百分点以上,尤其在视觉保真度和文档级一致性方面表现更优,但整体在长周期、大规模文档任务上仍有不足。 Conclusion: PPTArena为PPT编辑提供了更具现实挑战性的评估标准,PPTPilot展示了结构感知与精确控制在幻灯片编辑中的优势,但可靠的大规模PPT自动化编辑仍面临重大挑战。 Abstract: We introduce PPTArena, a benchmark for PowerPoint editing that measures reliable modifications to real slides under natural-language instructions. In contrast to image-PDF renderings or text-to-slide generation, PPTArena focuses on in-place editing across 100 decks, 2125 slides, and over 800 targeted edits covering text, charts, tables, animations, and master-level styles. Each case includes a ground-truth deck, a fully specified target outcome, and a dual VLM-as-judge pipeline that separately scores instruction following and visual quality using both structural diffs and slide images. Building on this setting, we propose PPTPilot, a structure-aware slide-editing agent that plans semantic edit sequences, routes between high-level programmatic tools and deterministic XML operations for precise control, and verifies outputs through an iterative plan-edit-check loop against task-specific constraints. In our experiments, PPTPilot outperforms strong proprietary agents and frontier VLM systems by over 10 percentage points on compound, layout-sensitive, and cross-slide edits, with particularly large gains in visual fidelity and deck-wide consistency. Despite these improvements, existing agents still underperform on long-horizon, document-scale tasks in PPTArena, highlighting the remaining challenges in reliable PPT editing.[167] OneThinker: All-in-one Reasoning Model for Image and Video
Kaituo Feng,Manyuan Zhang,Hongyu Li,Kaixuan Fan,Shuang Chen,Yilei Jiang,Dian Zheng,Peiwen Sun,Yiyuan Zhang,Haoze Sun,Yan Feng,Peng Pei,Xunliang Cai,Xiangyu Yue
Main category: cs.CV
TL;DR: OneThinker 是一个统一的多模态推理模型,整合图像与视频理解,覆盖10项视觉任务,在31个基准上表现优异,并支持知识迁移与零样本泛化。
Details
Motivation: 现有方法为不同任务训练独立模型,且将图像与视频推理分离,限制了多模态推理通才模型的可扩展性与实用性,缺乏跨任务与跨模态的知识共享。 Method: 提出 OneThinker 模型,构建包含60万样本的 OneThinker-600k 数据集,并利用商业模型生成思维链标注以进行SFT冷启动;进一步提出 EMA-GRPO 方法,通过跟踪各任务奖励标准差的移动平均来平衡多任务强化学习中的奖励异质性。 Result: 在31个多样化视觉基准测试中表现出色,涵盖问答、描述生成、时空定位、追踪和分割等10项任务,展现出任务间有效的知识迁移能力和初步的零样本泛化能力。 Conclusion: OneThinker 推动了统一多模态推理通才模型的发展,实现了图像与视频多任务统一建模,并通过 EMA-GRPO 实现稳定多任务优化,促进实际应用与知识共享。 Abstract: Reinforcement learning (RL) has recently achieved remarkable success in eliciting visual reasoning within Multimodal Large Language Models (MLLMs). However, existing approaches typically train separate models for different tasks and treat image and video reasoning as disjoint domains. This results in limited scalability toward a multimodal reasoning generalist, which restricts practical versatility and hinders potential knowledge sharing across tasks and modalities. To this end, we propose OneThinker, an all-in-one reasoning model that unifies image and video understanding across diverse fundamental visual tasks, including question answering, captioning, spatial and temporal grounding, tracking, and segmentation. To achieve this, we construct the OneThinker-600k training corpus covering all these tasks and employ commercial models for CoT annotation, resulting in OneThinker-SFT-340k for SFT cold start. Furthermore, we propose EMA-GRPO to handle reward heterogeneity in multi-task RL by tracking task-wise moving averages of reward standard deviations for balanced optimization. Extensive experiments on diverse visual benchmarks show that OneThinker delivers strong performance on 31 benchmarks, across 10 fundamental visual understanding tasks. Moreover, it exhibits effective knowledge transfer between certain tasks and preliminary zero-shot generalization ability, marking a step toward a unified multimodal reasoning generalist. All code, model, and data are released.[168] CAMEO: Correspondence-Attention Alignment for Multi-View Diffusion Models
Minkyung Kwon,Jinhyeok Choi,Jiho Park,Seonghu Jeon,Jinhyuk Jang,Junyoung Seo,Minseop Kwak,Jin-Hwa Kim,Seungryong Kim
Main category: cs.CV
TL;DR: 本文研究了多视角扩散模型中的注意力机制,发现其在训练过程中能够学习到几何对应关系,但在大视角变化下表现不佳。为此,作者提出了CAMEO方法,通过直接监督注意力图来增强几何一致性,显著提升训练效率和生成质量,且仅需监督单个注意力层即可。CAMEO具有模型无关性,适用于各类多视角扩散模型,并能将收敛所需迭代次数减少一半。
Details
Motivation: 多视角扩散模型在新视角合成中表现出色,但其保持视图一致性的内在机制尚不明确,尤其是在大视角变化下的几何对应关系学习仍不充分。 Method: 通过分析注意力图的几何对应性,提出CAMEO方法,利用几何对应信号对注意力图进行直接监督,尤其仅监督单个注意力层以引导模型学习精确对应关系。 Result: CAMEO显著提升了模型的训练效率和生成质量,收敛速度加快一倍,在相同迭代次数下性能更优,并能在不同模型上通用。 Conclusion: 通过显式监督注意力图中的几何对应关系,可以有效增强多视角扩散模型的视图一致性与结构保持能力,CAMEO为提升此类模型提供了一种简单、高效且通用的新训练范式。 Abstract: Multi-view diffusion models have recently emerged as a powerful paradigm for novel view synthesis, yet the underlying mechanism that enables their view-consistency remains unclear. In this work, we first verify that the attention maps of these models acquire geometric correspondence throughout training, attending to the geometrically corresponding regions across reference and target views for view-consistent generation. However, this correspondence signal remains incomplete, with its accuracy degrading under large viewpoint changes. Building on these findings, we introduce CAMEO, a simple yet effective training technique that directly supervises attention maps using geometric correspondence to enhance both the training efficiency and generation quality of multi-view diffusion models. Notably, supervising a single attention layer is sufficient to guide the model toward learning precise correspondences, thereby preserving the geometry and structure of reference images, accelerating convergence, and improving novel view synthesis performance. CAMEO reduces the number of training iterations required for convergence by half while achieving superior performance at the same iteration counts. We further demonstrate that CAMEO is model-agnostic and can be applied to any multi-view diffusion model.[169] MagicQuillV2: Precise and Interactive Image Editing with Layered Visual Cues
Zichen Liu,Yue Yu,Hao Ouyang,Qiuyu Wang,Shuailei Ma,Ka Leong Cheng,Wen Wang,Qingyan Bai,Yuxuan Zhang,Yanhong Zeng,Yixuan Li,Xing Zhu,Yujun Shen,Qifeng Chen
Main category: cs.CV
TL;DR: MagicQuill V2提出了一种分层生成图像编辑方法,将用户意图分解为内容、空间、结构和颜色层,实现对扩散模型的细粒度控制。