Table of Contents
cs.CL [Back]
[1] What Kind of Reasoning (if any) is an LLM actually doing? On the Stochastic Nature and Abductive Appearance of Large Language Models
Luciano Floridi,Jessica Morley,Claudio Novelli,David Watson
Main category: cs.CL
TL;DR: 本文探讨了基于token补全机制的大语言模型(LLM)的推理机制,指出其输出虽看似具有溯因推理能力,实则仅为对训练数据中人类文本模式的模仿,缺乏真实理解与验证能力。
Details
Motivation: 揭示当前大语言模型在推理任务中的本质局限性,澄清其表面推理能力背后的生成机制。 Method: 通过分析LLM的随机生成特性及其与人类溯因推理的相似性,结合具体示例说明其如何模拟而非真正执行推理。 Result: 发现LLM能够产生看似合理的解释和常识推理结果,但这些输出缺乏语义基础、真实性验证和理解能力,本质上是模式复制。 Conclusion: LLM可作为辅助思维和创意生成工具,但其输出需经批判性评估;文章进一步回应了五项可能反对意见,并讨论了该分析的局限性。 Abstract: This article looks at how reasoning works in current Large Language Models (LLMs) that function using the token-completion method. It examines their stochastic nature and their similarity to human abductive reasoning. The argument is that these LLMs create text based on learned patterns rather than performing actual abductive reasoning. When their output seems abductive, this is largely because they are trained on human-generated texts that include reasoning structures. Examples are used to show how LLMs can produce plausible ideas, mimic commonsense reasoning, and give explanatory answers without being grounded in truth, semantics, verification, or understanding, and without performing any real abductive reasoning. This dual nature, where the models have a stochastic base but appear abductive in use, has important consequences for how LLMs are evaluated and applied. They can assist with generating ideas and supporting human thinking, but their outputs must be critically assessed because they cannot identify truth or verify their explanations. The article concludes by addressing five objections to these points, noting some limitations in the analysis, and offering an overall evaluation.[2] Generate-Then-Validate: A Novel Question Generation Approach Using Small Language Models
Yumou Wei,John Stamper,Paulo F. Carvalho
Main category: cs.CL
TL;DR: 提出一种基于小语言模型(SLM)的“生成-验证”自动问题生成新流程,结合文本生成与概率推理提升问题质量,经人工与大模型评估均显示其有效性。
Details
Motivation: 探索小语言模型(SLM)在学习分析中自动生成问题的潜力,作为当前主流大模型(LLM)方案的轻量、高效补充。 Method: 采用“生成-然后验证”策略:首先利用SLM广泛生成候选问题,再通过基于新颖概率推理的筛选机制进行精炼和验证。 Result: 两项评估研究(七位人类专家与一个大语言模型)结果显示,大多数评估者认为生成的问题答案清晰、与学习目标高度一致。 Conclusion: 设计良好的流程可充分发挥小语言模型的优势,使其在自动问题生成任务中有效生成高质量问题。 Abstract: We explore the use of small language models (SLMs) for automatic question generation as a complement to the prevalent use of their large counterparts in learning analytics research. We present a novel question generation pipeline that leverages both the text generation and the probabilistic reasoning abilities of SLMs to generate high-quality questions. Adopting a "generate-then-validate" strategy, our pipeline first performs expansive generation to create an abundance of candidate questions and refine them through selective validation based on novel probabilistic reasoning. We conducted two evaluation studies, one with seven human experts and the other with a large language model (LLM), to assess the quality of the generated questions. Most judges (humans or LLMs) agreed that the generated questions had clear answers and generally aligned well with the intended learning objectives. Our findings suggest that an SLM can effectively generate high-quality questions when guided by a well-designed pipeline that leverages its strengths.[3] Workflow is All You Need: Escaping the "Statistical Smoothing Trap" via High-Entropy Information Foraging and Adversarial Pacing
Zhongjie Jiang
Main category: cs.CL
TL;DR: 本文提出了DeepNews框架,以解决大语言模型在长文本生成中难以同时实现低幻觉、高逻辑连贯性和个性化表达的“不可能三元悖论”。该框架通过模拟专业财经记者的认知过程,结合信息觅食理论的双粒度检索、基于叙事图式的战略规划和对抗性约束提示技术,显著提升了生成内容的真实性与可接受性。
Details
Motivation: 当前大语言模型在垂直领域长文本生成中面临“不可能三元悖论”——难以同时实现低幻觉、深度逻辑连贯和个性化表达,其根源在于现有生成范式陷入“统计平滑陷阱”,忽视了专家写作中的高熵信息获取与结构化认知过程。 Method: 提出DeepNews框架,包含三个核心模块:1)基于信息觅食理论的双粒度检索机制,确保10:1的饱和信息输入比;2)基于叙事图式和Atomic Blocks的策略性规划,构建逻辑骨架;3)采用Rhythm Break和Logic Fog等技术的对抗性约束提示,打破模型生成文本的概率平滑性。 Result: 实验发现“知识断崖”现象:当检索上下文低于15,000字符时内容真实性急剧下降,而超过30,000字符的高冗余输入可使无幻觉生成率(HFR)稳定在85%以上。在某顶级中文科技媒体的盲测中,基于旧一代模型(DeepSeek-V3-0324)构建的DeepNews系统投稿接受率达25%,显著优于最先进模型(GPT-5)零样本生成的0%。 Conclusion: 通过显式建模专家写作者的隐性认知过程,DeepNews框架有效突破了大语言模型在专业长文本生成中的“不可能三元悖论”,为垂直领域高质量内容生成提供了新的范式路径。 Abstract: Central to long-form text generation in vertical domains is the "impossible trinity" confronting current large language models (LLMs): the simultaneous achievement of low hallucination, deep logical coherence, and personalized expression. This study establishes that this bottleneck arises from existing generative paradigms succumbing to the Statistical Smoothing Trap, a phenomenon that overlooks the high-entropy information acquisition and structured cognitive processes integral to expert-level writing. To address this limitation, we propose the DeepNews Framework, an agentic workflow that explicitly models the implicit cognitive processes of seasoned financial journalists. The framework integrates three core modules: first, a dual-granularity retrieval mechanism grounded in information foraging theory, which enforces a 10:1 saturated information input ratio to mitigate hallucinatory outputs; second, schema-guided strategic planning, a process leveraging domain expert knowledge bases (narrative schemas) and Atomic Blocks to forge a robust logical skeleton; third, adversarial constraint prompting, a technique deploying tactics including Rhythm Break and Logic Fog to disrupt the probabilistic smoothness inherent in model-generated text. Experiments delineate a salient Knowledge Cliff in deep financial reporting: content truthfulness collapses when retrieved context falls below 15,000 characters, while a high-redundancy input exceeding 30,000 characters stabilizes the Hallucination-Free Rate (HFR) above 85%. In an ecological validity blind test conducted with a top-tier Chinese technology media outlet, the DeepNews system--built on a previous-generation model (DeepSeek-V3-0324)-achieved a 25% submission acceptance rate, significantly outperforming the 0% acceptance rate of zero-shot generation by a state-of-the-art (SOTA) model (GPT-5).[4] PARAN: Persona-Augmented Review ANswering system on Food Delivery Review Dataset
Moonsoo Park,Jeongseok Yun,Bohyung Kim
Main category: cs.CL
TL;DR: 提出了一种两阶段提示框架,通过从简短评论中推断显性和隐性用户特征来生成个性化的回复,提升了自动化响应的相关性和个性化程度。
Details
Motivation: 在用户信息有限的场景(如外卖平台)中,大语言模型常因缺乏上下文数据而生成通用化回复,影响互动效果,因此需要提升回复的个性化水平。 Method: 采用两阶段提示框架,首先从短评中推断用户的显性与隐性特征(如偏好、风格等),再将这些特征融入生成提示中,并通过调节解码温度控制生成多样性。 Result: 在真实韩国外卖应用数据集上验证了方法的有效性,结果显示该方法在精确性、多样性和语义一致性方面均有提升。 Conclusion: 基于提示的 persona 增强策略可在无需微调模型的情况下有效提升自动化回复的个性化和相关性,适用于用户数据稀疏的实际应用场景。 Abstract: Personalized review response generation presents a significant challenge in domains where user information is limited, such as food delivery platforms. While large language models (LLMs) offer powerful text generation capabilities, they often produce generic responses when lacking contextual user data, reducing engagement and effectiveness. In this work, we propose a two-stage prompting framework that infers both explicit (e.g., user-stated preferences) and implicit (e.g., demographic or stylistic cues) personas directly from short review texts. These inferred persona attributes are then incorporated into the response generation prompt to produce user-tailored replies. To encourage diverse yet faithful generations, we adjust decoding temperature during inference. We evaluate our method using a real-world dataset collected from a Korean food delivery app, and assess its impact on precision, diversity, and semantic consistency. Our findings highlight the effectiveness of persona-augmented prompting in enhancing the relevance and personalization of automated responses without requiring model fine-tuning.[5] Unforgotten Safety: Preserving Safety Alignment of Large Language Models with Continual Learning
Lama Alssum,Hani Itani,Hasan Abed Al Kader Hammoud,Philip Torr,Adel Bibi,Bernard Ghanem
Main category: cs.CL
TL;DR: 本研究将大语言模型微调过程中的安全性退化问题视为持续学习问题,系统评估了多种持续学习方法在保持模型任务性能的同时防止安全遗忘的效果,发现DER方法表现最优。
Details
Motivation: 随着大语言模型的普及,用户自定义微调导致的安全性退化问题日益严重,尤其是由于灾难性遗忘引发的安全漏洞。需要有效方法在开放微调服务中保持模型安全性。 Method: 将微调即服务场景下的安全保持问题建模为持续学习问题,采用正则化、基于记忆和模型融合等持续学习方法,在良性与投毒数据两种场景下进行系统评估。 Result: 持续学习方法显著降低攻击成功率,其中DER方法优于其他方法和现有基线,在GSM8K、SST2、Code三个任务及LLaMA2-7B、Mistral-7B、Gemma-2B三个模型上均表现出色且保持任务效用。 Conclusion: 持续学习是缓解大语言模型微调中安全性退化的有效且实用方案,尤其DER方法具有优越的综合性能。 Abstract: The safety alignment of large language models (LLMs) is becoming increasingly important with their democratization. In this paper, we study the safety degradation that comes with adapting LLMs to new tasks. We attribute this safety compromise to catastrophic forgetting and frame the problem of preserving safety when fine-tuning as a continual learning (CL) problem. We consider the fine-tuning-as-a-service setup where the user uploads their data to a service provider to get a customized model that excels on the user's selected task. We adapt several CL approaches from the literature and systematically evaluate their ability to mitigate safety degradation. These include regularization-based, memory-based, and model merging approaches. We consider two scenarios, (1) benign user data and (2) poisoned user data. Our results demonstrate that CL approaches consistently achieve lower attack success rates than standard fine-tuning. Among these, DER outperforms both other CL methods and existing safety-preserving baselines while maintaining task utility. These findings generalize across three downstream tasks (GSM8K, SST2, Code) and three model families (LLaMA2-7B, Mistral-7B, Gemma-2B), establishing CL as a practical solution to preserve safety.[6] AutoMedic: An Automated Evaluation Framework for Clinical Conversational Agents with Medical Dataset Grounding
Gyutaek Oh,Sangjoon Park,Byung-Hoon Kim
Main category: cs.CL
TL;DR: 本文提出了AutoMedic,一个多智能体模拟框架,用于自动化评估作为临床对话代理的大型语言模型(LLMs),通过将静态问答数据集转化为虚拟患者档案,在多轮临床对话中基于CARE指标进行多维度评估。
Details
Motivation: 现有的医学问答基准主要集中在静态任务上,难以评估LLM在动态、交互式多轮临床对话中的表现,且缺乏超越准确率的多维评估方法。因此需要一种可标准化、自动化的评估框架来解决这一问题。 Method: 提出AutoMedic框架,利用现成的静态医学问答数据集构建虚拟患者档案,驱动多个LLM代理之间进行真实、基于临床的多轮对话;并通过CARE指标(涵盖准确性、效率/策略、同理心和鲁棒性)对临床对话代理的表现进行多维度自动评估。 Result: 实验结果表明,AutoMedic能够有效生成符合临床实际的多轮对话,并通过CARE指标量化不同LLM代理的表现,评估结果得到人类专家的认可,验证了该框架的有效性和可靠性。 Conclusion: AutoMedic为评估临床对话型LLM提供了一个有效、自动化且多维度的评估框架,有助于指导面向医疗对话应用的LLM开发与优化。 Abstract: Evaluating large language models (LLMs) has recently emerged as a critical issue for safe and trustworthy application of LLMs in the medical domain. Although a variety of static medical question-answering (QA) benchmarks have been proposed, many aspects remain underexplored, such as the effectiveness of LLMs in generating responses in dynamic, interactive clinical multi-turn conversation situations and the identification of multi-faceted evaluation strategies beyond simple accuracy. However, formally evaluating a dynamic, interactive clinical situation is hindered by its vast combinatorial space of possible patient states and interaction trajectories, making it difficult to standardize and quantitatively measure such scenarios. Here, we introduce AutoMedic, a multi-agent simulation framework that enables automated evaluation of LLMs as clinical conversational agents. AutoMedic transforms off-the-shelf static QA datasets into virtual patient profiles, enabling realistic and clinically grounded multi-turn clinical dialogues between LLM agents. The performance of various clinical conversational agents is then assessed based on our CARE metric, which provides a multi-faceted evaluation standard of clinical conversational accuracy, efficiency/strategy, empathy, and robustness. Our findings, validated by human experts, demonstrate the validity of AutoMedic as an automated evaluation framework for clinical conversational agents, offering practical guidelines for the effective development of LLMs in conversational medical applications.[7] Multilingual VLM Training: Adapting an English-Trained VLM to French
Jules Lahmi,Alexis Roger
Main category: cs.CL
TL;DR: 本文探讨了将英文训练的视觉-语言模型(VLM)适配到其他语言的方法,比较了翻译流水线、LoRA微调和两阶段微调策略,并指出数据集翻译质量是多语言VLM性能的主要瓶颈。
Details
Motivation: 由于当前视觉-语言模型主要局限于英语,本文旨在探索将其扩展到更多语言的方法,以提升非英语用户的可访问性。 Method: 研究对比了三种方法:基于翻译的流水线、LoRA微调以及分离视觉与语言适配的两阶段微调策略,并使用翻译后的多模态基准和母语专家人工评估进行性能评测。 Result: 结果显示,数据集翻译的质量严重限制了模型性能,成为多语言VLM发展的主要瓶颈;不同适配方法在性能和计算成本上表现各异。 Conclusion: 未来的工作应聚焦于构建高质量的本地语言多模态数据集,并改进翻译策略以提升多语言VLM的效果。 Abstract: Artificial intelligence has made great progress in recent years, particularly in the development of Vision--Language Models (VLMs) that understand both visual and textual data. However, these advancements remain largely limited to English, reducing their accessibility for non--English speakers. It is essential to extend these capabilities to a broader range of languages. This paper explores the challenges of adapting an English-trained VLM to different languages. To this end, we will explore and compare different methods for their performance and computational cost. We consider a translation-based pipeline, LoRA finetuning, and a two-stage finetuning strategy that separates vision adaptation from language adaptation. To evaluate these methods, we use a combination of standard multimodal benchmarks translated into the target language and manual assessments by native experts. The results reveal that dataset translation remains a major bottleneck in multilingual VLM performance, with data quality limiting the effectiveness of training and evaluation. These findings suggest that future efforts should focus on native-language dataset collection and improved translation strategies.[8] Confucius Code Agent: An Open-sourced AI Software Engineer at Industrial Scale
Zhaodong Wang,Zhenting Qi,Sherman Wong,Nathan Hu,Samuel Lin,Jun Ge,Erwin Gao,Yining Yang,Ben Maurer,Wenlin Chen,David Recordon,Yilun Du,Minlan Yu,Ying Zhang
Main category: cs.CL
TL;DR: 本文提出了Confucius Code Agent(CCA),一个可在工业规模上运行的开源AI软件工程师,基于Confucius SDK构建,具备长上下文推理、跨会话持续学习和模块化工具使用能力,在SWE-Bench-Pro上达到54.3%的Resolve@1性能,显著优于先前方法。
Details
Motivation: 现有的开源编码代理在工业级任务上表现不足,而闭源代理虽性能强但缺乏可扩展性和可控性,因此需要一个兼具高性能与开放性的AI软件工程解决方案。 Method: 提出Confucius SDK,包含分层工作记忆的统一协调器、持久化笔记系统和模块化扩展模块,并通过元代理实现配置的自动化构建-测试-优化循环,基于此构建CCA代理。 Result: CCA在SWE-Bench-Pro上实现了54.3%的Resolve@1成绩,达到当前最优水平,显著超越以往编码代理。 Conclusion: Confucius SDK与CCA共同为AI代理提供了透明、可扩展且可复现的基础,弥合了研究原型与生产系统之间的鸿沟,支持工业级代理的开发与部署。 Abstract: Real-world AI software engineering demands coding agents that can reason over massive repositories, maintain durable memory across and within long sessions, and robustly coordinate complex toolchains at test time. Existing open-source coding agents provide transparency but frequently fall short when pushed to these industrial-scale workloads, while proprietary coding agents offer strong practical performance but limited extensibility, interpretability, and controllability. We present the Confucius Code Agent (CCA), an open-sourced AI software engineer that can operate at an industrial scale. CCA is built atop the Confucius SDK, an open-sourced agent development platform designed around three complementary perspectives: Agent Experience (AX), User Experience (UX), and Developer Experience (DX). The SDK introduces a unified orchestrator with hierarchical working memory for long-context reasoning, a persistent note-taking system for cross-session continual learning, and a modular extension module for robust tool use. Moreover, a meta-agent automates the synthesis, evaluation, and refinement of agent configurations through a build-test-improve loop, enabling rapid agent development on new tasks, environments, and tool stacks. Instantiated on Confucius SDK with these mechanisms, CCA delivers strong performance on real-world software engineering tasks. On SWE-Bench-Pro, CCA achieves a state-of-the-art Resolve@1 performance of 54.3%, substantially improving over prior coding agents. Together, the Confucius SDK and CCA provide a transparent, extensible, and reproducible foundation for AI agents, bridge gaps between research prototypes and production-grade systems, and support agent development and deployment at industrial scale.[9] Sliding Window Attention Adaptation
Yijiong Yu,Jiale Liu,Qingyun Wu,Huazheng Wang,Ji Pei
Main category: cs.CL
TL;DR: 本文研究了如何将全注意力机制预训练的大型语言模型有效适配到滑动窗口注意力(SWA),提出了一套实用方法组合(SWAA),通过多种技术协同显著恢复长上下文性能。
Details
Motivation: 由于Transformer中自注意力机制在长上下文推理时计算成本高,滑动窗口注意力虽可降低复杂度,但直接用于全注意力预训练模型会导致性能严重下降,存在训练-推理不匹配问题。因此,探索无需重新预训练即可良好适配SWA的方法具有重要意义。 Method: 提出了滑动窗口注意力适配(SWAA)方案,结合五种方法:仅在prefill阶段使用SWA、保留“sink”令牌、交错使用FA/SWA层、思维链(CoT)提示和微调,进行系统实验分析。 Result: 实验证明,单一方法无法有效适配,但特定组合能显著恢复原始长上下文性能;不同SWAA配置在效率与性能间有明确权衡。 Conclusion: 全注意力预训练的LLM可以在不重新预训练的情况下成功适配滑动窗口注意力,关键在于多种适配技术的协同作用,本文提供了适用于不同场景的推荐配置方案。 Abstract: The self-attention mechanism in Transformer-based Large Language Models (LLMs) scales quadratically with input length, making long-context inference expensive. Sliding window attention (SWA) reduces this cost to linear complexity, but naively enabling complete SWA at inference-time for models pretrained with full attention (FA) causes severe long-context performance degradation due to training-inference mismatch. This makes us wonder: Can FA-pretrained LLMs be well adapted to SWA without pretraining? We investigate this by proposing Sliding Window Attention Adaptation (SWAA), a set of practical recipes that combine five methods for better adaptation: (1) applying SWA only during prefilling; (2) preserving "sink" tokens; (3) interleaving FA/SWA layers; (4) chain-of-thought (CoT); and (5) fine-tuning. Our experiments show that SWA adaptation is feasible while non-trivial: no single method suffices, yet specific synergistic combinations effectively recover the original long-context performance. We further analyze the performance-efficiency trade-offs of different SWAA configurations and provide recommended recipes for diverse scenarios. Our code is available at https://github.com/yuyijiong/sliding-window-attention-adaptation[10] Cooperative Retrieval-Augmented Generation for Question Answering: Mutual Information Exchange and Ranking by Contrasting Layers
Youmin Ko,Sungjong Seo,Hyunjoon Kim
Main category: cs.CL
TL;DR: 本文提出了一种新的检索增强生成框架CoopRAG,通过检索器与大语言模型的协同工作,以及检索器内部层间的协作,提升单跳和多跳问答任务的准确性和检索效果。
Details
Motivation: 现有的检索增强生成方法在处理简单和多跳问答时仍存在检索错误和幻觉问题,因此需要一种更有效的协同机制来提升性能。 Method: 将问题分解为子问题和带有掩码的推理链,利用子问题和推理链增强查询进行文档检索;通过对比检索器的不同层对文档进行重排序;最后由大语言模型填充掩码以重建推理链。 Result: 实验表明,CoopRAG在三个多跳问答数据集和一个简单问答数据集上均优于现有最先进方法,显著提升了检索和问答性能。 Conclusion: CoopRAG通过检索器与大语言模型之间的双向协作及内部层间优化,有效减少了错误检索和幻觉,是一种高效且通用的检索增强生成框架。 Abstract: Since large language models (LLMs) have a tendency to generate factually inaccurate output, retrieval-augmented generation (RAG) has gained significant attention as a key means to mitigate this downside of harnessing only LLMs. However, existing RAG methods for simple and multi-hop question answering (QA) are still prone to incorrect retrievals and hallucinations. To address these limitations, we propose CoopRAG, a novel RAG framework for the question answering task in which a retriever and an LLM work cooperatively with each other by exchanging informative knowledge, and the earlier and later layers of the retriever model work cooperatively with each other to accurately rank the retrieved documents relevant to a given query. In this framework, we (i) unroll a question into sub-questions and a reasoning chain in which uncertain positions are masked, (ii) retrieve the documents relevant to the question augmented with the sub-questions and the reasoning chain, (iii) rerank the documents by contrasting layers of the retriever, and (iv) reconstruct the reasoning chain by filling the masked positions via the LLM. Our experiments demonstrate that CoopRAG consistently outperforms state-of-the-art QA methods on three multi-hop QA datasets as well as a simple QA dataset in terms of both the retrieval and QA performances. Our code is available.\footnote{https://github.com/meaningful96/CoopRAG}[11] T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground
Dmitrii Stoianov,Danil Taranets,Olga Tsymboi,Ramil Latypov,Almaz Dautov,Vladislav Kruglikov,Nikita Surkov,German Abramov,Pavel Gein,Dmitry Abulkhanov,Mikhail Gashkov,Viktor Zelenkovskiy,Artem Batalov,Aleksandr Medvedev,Anatolii Potapov
Main category: cs.CL
TL;DR: T-pro 2.0 是一个开源的俄语大语言模型,支持混合推理和高效推理,提供模型权重、指令语料库、推理基准和解码组件,促进可复现和可扩展的研究。
Details
Motivation: 推动俄语语言理解和推理的开放研究,提供高效、可复现且可扩展的模型与工具链。 Method: 采用Cyrillic-dense分词器和改进的EAGLE推测性解码流水线,支持直接回答和推理路径生成。 Result: 实现了低延迟推理,发布了模型权重、T-Wix 500k指令数据集、T-Math推理基准和EAGLE权重,并上线了展示推理与非推理模式的公开演示。 Conclusion: T-pro 2.0 是一个开放、高效的俄语大模型系统,适用于构建和评估实际的俄语AI应用。 Abstract: We introduce T-pro 2.0, an open-weight Russian LLM for hybrid reasoning and efficient inference. The model supports direct answering and reasoning-trace generation, using a Cyrillic-dense tokenizer and an adapted EAGLE speculative-decoding pipeline to reduce latency. To enable reproducible and extensible research, we release the model weights, the T-Wix 500k instruction corpus, the T-Math reasoning benchmark, and the EAGLE weights on Hugging Face. These resources allow users to study Russian-language reasoning and to extend or adapt both the model and the inference pipeline. A public web demo exposes reasoning and non-reasoning modes and illustrates the speedups achieved by our inference stack across domains. T-pro 2.0 thus serves as an accessible open system for building and evaluating efficient, practical Russian LLM applications.[12] Semantic Reconstruction of Adversarial Plagiarism: A Context-Aware Framework for Detecting and Restoring "Tortured Phrases" in Scientific Literature
Agniva Maiti,Prajwal Panth,Suresh Chandra Satapathy
Main category: cs.CL
TL;DR: 本文提出了一种名为SRAP的框架,用于检测并恢复科学文献中被对抗性改写工具篡改的抄袭内容,结合领域特定的语言模型与语义检索技术,实现了对“扭曲短语”的有效识别与源文档匹配。
Details
Motivation: 现有的抄袭检测方法在面对使用自动化同义替换生成的“扭曲短语”时表现不佳,难以识别新型伪装且无法溯源,因此需要一种能同时检测并恢复原始术语的鲁棒方法。 Method: 采用两阶段架构:第一阶段使用基于SciBERT的伪困惑度进行词元级统计异常检测;第二阶段通过FAISS向量检索和SBERT句子对齐实现基于来源的语义重建。 Result: 实验显示,零样本基线方法完全失效(恢复准确率为0.00%),而SRAP达到23.67%的恢复准确率,并证明静态决策边界在高术语密度文本中更稳定可靠。 Conclusion: SRAP能够有效检测科学文本中的对抗性抄袭,并通过语义重建追溯源头,为学术诚信提供新的技术保障。 Abstract: The integrity and reliability of scientific literature is facing a serious threat by adversarial text generation techniques, specifically from the use of automated paraphrasing tools to mask plagiarism. These tools generate "tortured phrases", statistically improbable synonyms (e.g. "counterfeit consciousness" for "artificial intelligence"), that preserve the local grammar while obscuring the original source. Most existing detection methods depend heavily on static blocklists or general-domain language models, which suffer from high false-negative rates for novel obfuscations and cannot determine the source of the plagiarized content. In this paper, we propose Semantic Reconstruction of Adversarial Plagiarism (SRAP), a framework designed not only to detect these anomalies but to mathematically recover the original terminology. We use a two-stage architecture: (1) statistical anomaly detection with a domain-specific masked language model (SciBERT) using token-level pseudo-perplexity, and (2) source-based semantic reconstruction using dense vector retrieval (FAISS) and sentence-level alignment (SBERT). Experiments on a parallel corpus of adversarial scientific text show that while zero-shot baselines fail completely (0.00 percent restoration accuracy), our retrieval-augmented approach achieves 23.67 percent restoration accuracy, significantly outperforming baseline methods. We also show that static decision boundaries are necessary for robust detection in jargon-heavy scientific text, since dynamic thresholding fails under high variance. SRAP enables forensic analysis by linking obfuscated expressions back to their most probable source documents.[13] Enhancing Next-Generation Language Models with Knowledge Graphs: Extending Claude, Mistral IA, and GPT-4 via KG-BERT
Nour El Houda Ben Chaabene,Hamza Hammami
Main category: cs.CL
TL;DR: 提出通过集成知识图谱(KG-BERT)来增强大语言模型的结构化知识,提升其在知识密集型任务中的事实准确性和推理能力。
Details
Motivation: 大语言模型在自然语言处理方面表现出色,但缺乏结构化知识,导致生成内容存在事实不一致问题。 Method: 利用KG-BERT将知识图谱与大语言模型结合,增强模型的知识 grounding 和推理能力。 Result: 实验表明,该方法在问答和实体链接等知识密集型任务中显著提升了性能。 Conclusion: 结合知识图谱可有效提高大语言模型的事实可靠性,推动下一代更具备上下文感知能力的模型发展。 Abstract: Large language models (LLMs) like Claude, Mistral IA, and GPT-4 excel in NLP but lack structured knowledge, leading to factual inconsistencies. We address this by integrating Knowledge Graphs (KGs) via KG-BERT to enhance grounding and reasoning. Experiments show significant gains in knowledge-intensive tasks such as question answering and entity linking. This approach improves factual reliability and enables more context-aware next-generation LLMs.[14] Decoding Student Minds: Leveraging Conversational Agents for Psychological and Learning Analysis
Nour El Houda Ben Chaabene,Hamza Hammami,Laid Kahloul
Main category: cs.CL
TL;DR: 提出一种结合大语言模型和多模态数据的心理感知对话代理,用于实时识别学生的认知与情感状态,提升学习表现与情绪健康。
Details
Motivation: 现有教育聊天机器人通常仅限于教学辅导或情感支持,缺乏对学生动态心理状态的综合理解,难以实现个性化、自适应的教学干预。 Method: 结合大语言模型(LLM)、知识图谱增强的BERT(KG-BERT)以及带注意力机制的双向LSTM,利用文本语义、语音韵律特征和时序行为模式进行多模态融合,实时分类学生的认知与情感状态。 Result: 在大学生中的初步实验表明,该系统相比基线方法能有效提高学习动机、降低压力,并带来中等程度的学业进步。 Conclusion: 整合语义推理、多模态融合与时序建模有助于实现以学生为中心的自适应教育干预,具有在智能教育中促进学习效果与心理健康的潜力。 Abstract: This paper presents a psychologically-aware conversational agent designed to enhance both learning performance and emotional well-being in educational settings. The system combines Large Language Models (LLMs), a knowledge graph-enhanced BERT (KG-BERT), and a bidirectional Long Short-Term Memory (LSTM) with attention to classify students' cognitive and affective states in real time. Unlike prior chatbots limited to either tutoring or affective support, our approach leverages multimodal data-including textual semantics, prosodic speech features, and temporal behavioral trends-to infer engagement, stress, and conceptual understanding. A pilot study with university students demonstrated improved motivation, reduced stress, and moderate academic gains compared to baseline methods. These results underline the promise of integrating semantic reasoning, multimodal fusion, and temporal modeling to support adaptive, student-centered educational interventions.[15] Grammaticality Judgments in Humans and Language Models: Revisiting Generative Grammar with LLMs
Lars G. B. Johnsen
Main category: cs.CL
TL;DR: 该论文探讨大型语言模型(LLM)在仅接受表层形式训练的情况下,是否能再现句法结构的典型语言现象,如主语-助动词倒装和寄生空位许可。研究发现LLM能够可靠地区分语法正确与错误的句子,表明其对句法结构具有敏感性,而非仅仅依赖线性顺序。
Details
Motivation: 传统生成语法中,语法性差异被视为存在内在层次化语法结构的证据。本文旨在检验仅基于表层形式训练的大型语言模型是否表现出类似的结构敏感性,从而挑战句法结构必须显式编码的观点。 Method: 研究聚焦于主语-助动词倒装和寄生空位许可两个经典结构,使用GPT-4和LLaMA-3等模型,通过提示获取可接受性评分,并比较其对合语法与不合语法变体的判断。 Result: 结果显示,LLM能可靠区分两种构造中的合语法与不合语法变体,表现出对句法边界的识别和抽象依赖关系的捕捉能力。 Conclusion: 尽管未显式编码句法结构,LLM通过预测性训练从表层形式中涌现出对句法结构的功能性敏感性,表明结构性泛化可在无认知知识的情况下产生。 Abstract: What counts as evidence for syntactic structure? In traditional generative grammar, systematic contrasts in grammaticality such as subject-auxiliary inversion and the licensing of parasitic gaps are taken as evidence for an internal, hierarchical grammar. In this paper, we test whether large language models (LLMs), trained only on surface forms, reproduce these contrasts in ways that imply an underlying structural representation. We focus on two classic constructions: subject-auxiliary inversion (testing recognition of the subject boundary) and parasitic gap licensing (testing abstract dependency structure). We evaluate models including GPT-4 and LLaMA-3 using prompts eliciting acceptability ratings. Results show that LLMs reliably distinguish between grammatical and ungrammatical variants in both constructions, and as such support that they are sensitive to structure and not just linear order. Structural generalizations, distinct from cognitive knowledge, emerge from predictive training on surface forms, suggesting functional sensitivity to syntax without explicit encoding.[16] XDoGE: Multilingual Data Reweighting to Enhance Language Inclusivity in LLMs
Iñaki Lacunza,José Javier Saiz,Alexander Shvets,Aitor Gonzalez-Agirre,Marta Villegas
Main category: cs.CL
TL;DR: 提出XDoGE算法优化多语言数据分布,通过重加权和持续预训练提升中低资源语言性能,并发布新的IberianLLM-7B-Instruct模型。
Details
Motivation: 现有大模型过度依赖高资源语言(如英语),导致中低资源语言表现不佳,需优化多语言训练的数据分布。 Method: 扩展DoGE算法为XDoGE,训练小代理模型以确定最优语言权重;使用该权重对Salamanra-2b和新模型进行从头训练或持续预训练(CPT),并调整数据比例。 Result: 在IberoBench框架下验证了数据重复和欠采样对模型性能的影响,成功训练出面向伊比利亚语言的IberianLLM-7B-Instruct模型。 Conclusion: XDoGE能有效优化多语言训练中的语言权重分配,显著提升中低资源语言的表现,同时保持高资源语言性能。 Abstract: Current large language models (LLMs) are trained on massive amounts of text data, primarily from a few dominant languages. Studies suggest that this over-reliance on high-resource languages, such as English, hampers LLM performance in mid- and low-resource languages. To mitigate this problem, we propose to (i) optimize the language distribution by training a small proxy model within a domain-reweighing DoGE algorithm that we extend to XDoGE for a multilingual setup, and (ii) rescale the data and train a full-size model with the established language weights either from scratch or within a continual pre-training phase (CPT). We target six languages possessing a variety of geographic and intra- and inter-language-family relations, namely, English and Spanish (high-resource), Portuguese and Catalan (mid-resource), Galician and Basque (low-resource). We experiment with Salamandra-2b, which is a promising model for these languages. We investigate the effects of substantial data repetition on minor languages and under-sampling on dominant languages using the IberoBench framework for quantitative evaluation. Finally, we release a new promising IberianLLM-7B-Instruct model centering on Iberian languages and English that we pretrained from scratch and further improved using CPT with the XDoGE weights.[17] Causal Reasoning Favors Encoders: On The Limits of Decoder-Only Models
Amartya Roy,Elamparithy M,Kripabandhu Ghosh,Ponnurangam Kumaraguru,Adrian de Wynter
Main category: cs.CL
TL;DR: 本文研究了不同架构的模型在因果推理中的表现,发现仅靠上下文学习(ICL)不足以实现可靠的因果推理,尤其是解码器-only模型对分布偏移较为敏感。经过微调的编码器和编码器-解码器模型在多种测试中表现出更强的鲁棒性,尤其在非自然语言场景下优于解码器-only模型,建议在成本敏感和短期应用中优先选择前者。
Details
Motivation: 探究上下文学习在因果推理中的有效性,并比较不同模型架构(尤其是编码器、编码器-解码器与解码器-only)在多步、合取控制的因果推理任务中的表现差异。 Method: 通过在自然语言和非自然语言场景下,对多种架构的模型进行零样本和少样本的上下文学习及微调实验,评估其在因果推理任务上的性能。 Result: 发现仅使用ICL时,模型常过度关注无关输入特征,导致推理不可靠;解码器-only模型对分布偏移敏感,而微调后的编码器和编码器-解码器模型泛化能力更强,尤其在非自然语言任务中表现更优;大规模下解码器-only模型才可匹敌或超越前者。 Conclusion: 对于低成本、短周期且需鲁棒性的因果推理任务,推荐使用经过针对性微调的编码器或编码器-解码器架构,而非依赖大规模解码器-only模型的上下文学习。 Abstract: In context learning (ICL) underpins recent advances in large language models (LLMs), although its role and performance in causal reasoning remains unclear. Causal reasoning demands multihop composition and strict conjunctive control, and reliance on spurious lexical relations of the input could provide misleading results. We hypothesize that, due to their ability to project the input into a latent space, encoder and encoder decoder architectures are better suited for said multihop conjunctive reasoning versus decoder only models. To do this, we compare fine-tuned versions of all the aforementioned architectures with zero and few shot ICL in both natural language and non natural language scenarios. We find that ICL alone is insufficient for reliable causal reasoning, often overfocusing on irrelevant input features. In particular, decoder only models are noticeably brittle to distributional shifts, while finetuned encoder and encoder decoder models can generalize more robustly across our tests, including the non natural language split. Both architectures are only matched or surpassed by decoder only architectures at large scales. We conclude by noting that for cost effective, short horizon robust causal reasoning, encoder or encoder decoder architectures with targeted finetuning are preferable.[18] RoleRMBench & RoleRM: Towards Reward Modeling for Profile-Based Role Play in Dialogue Systems
Hang Ding,Qiming Feng,Dongqi Liu,Qi Zhao,Tao Yao,Shuo Wang,Dongsheng Chen,Jian Li,Zhenye Gan,Jiangning Zhang,Chengjie Wang,Yabiao Wang
Main category: cs.CL
TL;DR: 本文提出了RoleRMBench,首个用于角色扮演对话中奖励建模的系统性基准,并提出了一种基于连续隐式偏好(CIP)训练的奖励模型RoleRM,在叙事连贯性和风格保真度上显著优于现有模型。
Details
Motivation: 现有奖励模型在主观性强、开放的角色扮演等任务中表现不佳,难以捕捉基于人格和情境的细微人类判断,缺乏专门针对此类场景的评估基准。 Method: 构建了包含七个细粒度能力的角色扮演奖励建模基准RoleRMBench;提出RoleRM模型,采用连续隐式偏好(CIP)框架,通过多种结构化策略进行连续一致的成对监督学习。 Result: 实验表明通用奖励模型与人类判断存在显著差距;RoleRM在多个开源和闭源模型上平均超越24%,在叙事和风格维度表现尤为突出。 Conclusion: 连续偏好表示和标注一致性对主观对齐至关重要,RoleRM为面向人类的对话系统提供了有效的奖励建模方案。 Abstract: Reward modeling has become a cornerstone of aligning large language models (LLMs) with human preferences. Yet, when extended to subjective and open-ended domains such as role play, existing reward models exhibit severe degradation, struggling to capture nuanced and persona-grounded human judgments. To address this gap, we introduce RoleRMBench, the first systematic benchmark for reward modeling in role-playing dialogue, covering seven fine-grained capabilities from narrative management to role consistency and engagement. Evaluation on RoleRMBench reveals large and consistent gaps between general-purpose reward models and human judgment, particularly in narrative and stylistic dimensions. We further propose RoleRM, a reward model trained with Continuous Implicit Preferences (CIP), which reformulates subjective evaluation as continuous consistent pairwise supervision under multiple structuring strategies. Comprehensive experiments show that RoleRM surpasses strong open- and closed-source reward models by over 24% on average, demonstrating substantial gains in narrative coherence and stylistic fidelity. Our findings highlight the importance of continuous preference representation and annotation consistency, establishing a foundation for subjective alignment in human-centered dialogue systems.[19] AgriGPT-Omni: A Unified Speech-Vision-Text Framework for Multilingual Agricultural Intelligence
Bo Yang,Lanfei Feng,Yunkui Chen,Yu Zhang,Jianyu Zhang,Xiao Xu,Nueraili Aierken,Shijian Li
Main category: cs.CL
TL;DR: 本文提出了AgriGPT-Omni,一个整合语音、视觉和文本的农业多模态统一框架,并构建了大规模多语言农业语音数据集与首个三模态农业评测基准AgriBench-Omni-2K,通过三阶段训练范式实现跨模态跨语言的统一推理,在多语言多模态任务上显著优于通用模型。
Details
Motivation: 现有农业应用受限于缺乏多语言语音数据、统一的多模态架构和全面的评估基准,难以支持低资源地区的可持续AI发展。 Method: 提出AgriGPT-Omni框架,采用可扩展的数据合成 pipeline 构建大规模多语言农业语音数据集;通过三阶段训练范式(文本知识注入、渐进式多模态对齐、基于GRPO的强化学习)训练农业通识模型;并构建首个涵盖语音-视觉-文本的三模态农业评测基准AgriBench-Omni-2K。 Result: 实验表明,AgriGPT-Omni在多语言多模态推理和真实语音理解任务上显著优于通用基线模型,且所有模型、数据、基准和代码均已开源。 Conclusion: AgriGPT-Omni为农业领域提供了首个统一的多模态多语言通识框架和评测体系,推动了可复现研究和包容性农业智能的发展,尤其有助于低资源地区的人工智能应用。 Abstract: Despite rapid advances in multimodal large language models, agricultural applications remain constrained by the lack of multilingual speech data, unified multimodal architectures, and comprehensive evaluation benchmarks. To address these challenges, we present AgriGPT-Omni, an agricultural omni-framework that integrates speech, vision, and text in a unified framework. First, we construct a scalable data synthesis and collection pipeline that converts agricultural texts and images into training data, resulting in the largest agricultural speech dataset to date, including 492K synthetic and 1.4K real speech samples across six languages. Second, based on this, we train the first agricultural omni-model via a three-stage paradigm: textual knowledge injection, progressive multimodal alignment, and GRPO-based reinforcement learning, enabling unified reasoning across languages and modalities. Third, we propose AgriBench-Omni-2K, the first tri-modal benchmark for agriculture, covering diverse speech-vision-text tasks and multilingual slices, with standardized protocols and reproducible tools. Experiments show that AgriGPT-Omni significantly outperforms general-purpose baselines on multilingual and multimodal reasoning as well as real-world speech understanding. All models, data, benchmarks, and code will be released to promote reproducible research, inclusive agricultural intelligence, and sustainable AI development for low-resource regions.[20] From Data Scarcity to Data Care: Reimagining Language Technologies for Serbian and other Low-Resource Languages
Smiljana Antonijevic Ubois
Main category: cs.CL
TL;DR: 本研究以塞尔维亚语为例,探讨低资源语言在AI时代面临的技术发展障碍,提出基于CARE原则的‘数据关怀’框架,以构建更具包容性和文化敏感性的语言技术。
Details
Motivation: 解决大型语言模型在训练中对低资源语言的文化与语言偏见问题,特别是历史文本遗产损毁和当代工程化导向导致的语言简化现象。 Method: 通过与十位学者和从业者(包括语言学家、数字人文学者和AI开发者)进行半结构化访谈,分析塑造低资源语言技术发展的结构性、历史性与社会技术性因素。 Result: 识别出塞尔维亚语技术发展中的关键挑战,包括表面音译、依赖英语模型、数据偏见以及缺乏文化特异性的数据集构建;提出‘数据关怀’框架作为应对方案。 Conclusion: ‘数据关怀’应成为语料设计、标注与治理的核心组成部分,可作为复制模式用于纠正传统大模型开发中的权力不平等与文化盲点,推动可持续且文化嵌入的语言技术发展。 Abstract: Large language models are commonly trained on dominant languages like English, and their representation of low resource languages typically reflects cultural and linguistic biases present in the source language materials. Using the Serbian language as a case, this study examines the structural, historical, and sociotechnical factors shaping language technology development for low resource languages in the AI age. Drawing on semi structured interviews with ten scholars and practitioners, including linguists, digital humanists, and AI developers, it traces challenges rooted in historical destruction of Serbian textual heritage, intensified by contemporary issues that drive reductive, engineering first approaches prioritizing functionality over linguistic nuance. These include superficial transliteration, reliance on English-trained models, data bias, and dataset curation lacking cultural specificity. To address these challenges, the study proposes Data Care, a framework grounded in CARE principles (Collective Benefit, Authority to Control, Responsibility, and Ethics), that reframes bias mitigation from a post hoc technical fix to an integral component of corpus design, annotation, and governance, and positions Data Care as a replicable model for building inclusive, sustainable, and culturally grounded language technologies in contexts where traditional LLM development reproduces existing power imbalances and cultural blind spots.[21] Textual Data Bias Detection and Mitigation - An Extensible Pipeline with Experimental Evaluation
Rebekka Görge,Sujan Sai Gannamaneni,Tabea Naeven,Hammam Abdelwahab,Héctor Allende-Cid,Armin B. Cremers,Lennard Helmer,Michael Mock,Anna Schmitz,Songkai Xue,Elif Yildirir,Maximilian Poretschkin,Stefan Wrobel
Main category: cs.CL
TL;DR: 提出了一种针对文本数据中表征偏差和显式刻板印象的综合检测与缓解流程,通过四个组件在性别、宗教和年龄等敏感属性上进行验证,发现数据去偏能有效改善数据质量,但对模型偏见缓解效果不一致,暴露出当前评估方法的局限性。
Details
Motivation: 为响应《欧盟人工智能法案》等法规要求,需要识别和减轻训练数据中对受保护群体的偏见,以防止不公平的模型输出,但目前缺乏实际操作指南和系统化方法。 Method: 构建了一个包含四个组件的数据偏见检测与缓解流程:1)利用大语言模型生成符合质量标准的词汇表以识别群体标签;2)使用人口代表性得分量化表征偏差;3)采用社会语言学指导的过滤方法检测和消除刻板印象;4)通过语法和上下文感知的反事实数据增强来补偿表征偏差。 Result: 在性别、宗教和年龄上的两阶段评估表明:各组件能有效减少数据中的表征偏差和显式刻板印象;但在基于去偏数据微调的大语言模型(0.6B-8B参数)上进行偏见基准测试时,模型表现并未一致提升。 Conclusion: 尽管所提方法能有效改善训练数据的偏见问题,但模型层面的偏见缓解效果有限,反映出当前偏见评估方法存在关键缺陷,强调需通过针对性的数据操作来应对模型中的偏见表现。 Abstract: Textual data used to train large language models (LLMs) exhibits multifaceted bias manifestations encompassing harmful language and skewed demographic distributions. Regulations such as the European AI Act require identifying and mitigating biases against protected groups in data, with the ultimate goal of preventing unfair model outputs. However, practical guidance and operationalization are lacking. We propose a comprehensive data bias detection and mitigation pipeline comprising four components that address two data bias types, namely representation bias and (explicit) stereotypes for a configurable sensitive attribute. First, we leverage LLM-generated word lists created based on quality criteria to detect relevant group labels. Second, representation bias is quantified using the Demographic Representation Score. Third, we detect and mitigate stereotypes using sociolinguistically informed filtering. Finally, we compensate representation bias through Grammar- and Context-Aware Counterfactual Data Augmentation. We conduct a two-fold evaluation using the examples of gender, religion and age. First, the effectiveness of each individual component on data debiasing is evaluated through human validation and baseline comparison. The findings demonstrate that we successfully reduce representation bias and (explicit) stereotypes in a text dataset. Second, the effect of data debiasing on model bias reduction is evaluated by bias benchmarking of several models (0.6B-8B parameters), fine-tuned on the debiased text dataset. This evaluation reveals that LLMs fine-tuned on debiased data do not consistently show improved performance on bias benchmarks, exposing critical gaps in current evaluation methodologies and highlighting the need for targeted data manipulation to address manifested model bias.[22] Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving
Songyang Gao,Yuzhe Gu,Zijian Wu,Lingkai Kong,Wenwei Zhang,Zhongrui Cai,Fan Zheng,Tianyou Ma,Junhao Shen,Haiteng Zhao,Duanyang Zhang,Huilun Zhang,Kuikun Liu,Chengqi Lyu,Yanhui Duan,Chiyu Chen,Ningsheng Ma,Jianfei Gao,Han Lyu,Dahua Lin,Kai Chen
Main category: cs.CL
TL;DR: 本文提出了一种新的基于结果的推理过程验证器(OPV),通过总结长思维链的结果来验证其推理过程,结合迭代主动学习和拒绝微调方法,在减少标注成本的同时实现了高效准确的验证,并在多个任务上取得了最先进的性能。
Details
Motivation: 现有的基于结果的验证器无法检查长推理链中的不可靠中间步骤,而基于过程的验证器受限于高质量标注数据的稀缺,难以可靠地检测复杂推理中的错误。因此需要一种更高效且准确的验证方法。 Method: 提出Outcome-based Process Verifier (OPV),通过总结长思维链的输出结果来间接验证其推理过程;采用迭代主动学习框架,选择最不确定的样例进行专家标注,并使用拒绝微调(RFT)和可验证奖励强化学习(RLVR)逐步提升OPV性能。 Result: OPV在自建测试集\thisbench上达到83.1的F1分数,超过更大的开源模型如Qwen3-Max-Preview(76.3);能有效识别合成数据中的假阳性,与专家判断高度一致;与策略模型协作时显著提升性能,例如在AIME2025上将DeepSeek-R1-Distill-Qwen-32B的准确率从55.2%提高到73.3%。 Conclusion: OPV能够在降低标注成本的同时实现对长推理链的高效、准确验证,具有广泛适用性,为构建可靠的大模型推理系统提供了有效途径。 Abstract: Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the \textbf{O}utcome-based \textbf{P}rocess \textbf{V}erifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV's superior performance and broad applicability. It achieves new state-of-the-art results on our held-out \textsc{\thisbench}, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2\% to 73.3\% on AIME2025 as the compute budget scales.[23] TRIDENT: A Redundant Architecture for Caribbean-Accented Emergency Speech Triage
Elroy Galbraith,Chadwick Sutherland,Donahue Morgan
Main category: cs.CL
TL;DR: 本文提出TRIDENT系统,一种用于支持调度员的三层架构,旨在提升对加勒比口音紧急呼叫的识别与分诊能力,通过结合口音优化的语音识别、本地实体抽取和生物声学 distress 检测,实现对低置信度转录和语音压力信号的有效利用,确保加勒比人群在紧急情况下获得公平的医疗服务。
Details
Motivation: 现有紧急语音识别系统在处理非标准英语变体(如加勒比口音)时表现下降,导致服务缺口,尤其影响加勒比地区人群获取紧急分诊服务的公平性。 Method: 提出TRIDENT三层架构:1)针对加勒比口音优化的自动语音识别(ASR);2)基于大语言模型的本地实体抽取;3)生物声学 distress 检测。系统整合转录置信度、结构化临床实体和语音压力指标三种信号,并利用低ASR置信度作为优先级队列信号,结合心理语言学中压力引发的语码转换理论。 Result: 系统能够在ASR失败的情况下仍为调度员提供有效支持,尤其通过低ASR置信度与高语音压力的组合识别危机中的呼叫者;同时能通过语义分析捕捉无明显情绪但存在生命威胁的紧急情况。支持离线部署以应对灾难场景。 Conclusion: TRIDENT为构建口音鲁棒的应急AI系统提供了框架,有助于确保加勒比使用者平等接入国家分诊协议,推动紧急响应中的语音AI公平性。 Abstract: Emergency speech recognition systems exhibit systematic performance degradation on non-standard English varieties, creating a critical gap in services for Caribbean populations. We present TRIDENT (Transcription and Routing Intelligence for Dispatcher-Empowered National Triage), a three-layer dispatcher-support architecture designed to structure emergency call inputs for human application of established triage protocols (the ESI for routine operations and START for mass casualty events), even when automatic speech recognition fails. The system combines Caribbean-accent-tuned ASR, local entity extraction via large language models, and bio-acoustic distress detection to provide dispatchers with three complementary signals: transcription confidence, structured clinical entities, and vocal stress indicators. Our key insight is that low ASR confidence, rather than representing system failure, serves as a valuable queue prioritization signal -- particularly when combined with elevated vocal distress markers indicating a caller in crisis whose speech may have shifted toward basilectal registers. A complementary insight drives the entity extraction layer: trained responders and composed bystanders may report life-threatening emergencies without elevated vocal stress, requiring semantic analysis to capture clinical indicators that paralinguistic features miss. We describe the architectural design, theoretical grounding in psycholinguistic research on stress-induced code-switching, and deployment considerations for offline operation during disaster scenarios. This work establishes a framework for accent-resilient emergency AI that ensures Caribbean voices receive equitable access to established national triage protocols. Empirical validation on Caribbean emergency calls remains future work.[24] OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification
Zijian Wu,Lingkai Kong,Wenwei Zhang,Songyang Gao,Yuzhe Gu,Zhongrui Cai,Tianyou Ma,Yuhong Liu,Zhi Wang,Runyuan Ma,Guangyu Wang,Wei Li,Conghui He,Dahua Lin,Kai Chen
Main category: cs.CL
TL;DR: 本文提出了一种基于结果的过程验证器(OPV),通过总结长思维链的中间结果来有效验证推理过程,结合迭代主动学习和拒绝微调,在降低标注成本的同时实现了高效准确的验证,并在多个任务中展现出优越性能。
Details
Motivation: 现有的结果型验证器无法检查长推理链中的不可靠中间步骤,而过程型验证器受限于高质量标注数据的稀缺,难以可靠检测复杂推理错误。因此需要一种更高效且可扩展的验证方法。 Method: 提出Outcome-based Process Verifier (OPV),通过总结长思维链的中间结果进行过程验证;采用迭代主动学习框架,选择当前模型最不确定的样例进行专家标注,并使用拒绝微调(RFT)和基于可验证奖励的强化学习(RLVR)逐步提升OPV性能。 Result: OPV在自建的OPV-Bench上达到83.1的F1分数,超过Qwen3-Max-Preview等更大模型;能有效识别合成数据中的错误样本,与专家评估高度一致;与策略模型协作时显著提升性能,如在AIME2025上将DeepSeek-R1-Distill-Qwen-32B的准确率从55.2%提高到73.3%。 Conclusion: OPV实现了准确、高效的推理过程验证,具备强可扩展性和广泛应用潜力,能够在较低标注成本下推动大模型推理能力的发展。 Abstract: Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the Outcome-based Process Verifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV's superior performance and broad applicability. It achieves new state-of-the-art results on our held-out OPV-Bench, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2% to 73.3% on AIME2025 as the compute budget scales.[25] Grow Up and Merge: Scaling Strategies for Efficient Language Adaptation
Kevin Glocker,Kätriin Kukk,Romina Oji,Marcel Bollmann,Marco Kuhlmann,Jenny Kunz
Main category: cs.CL
TL;DR: 本研究探讨了通过扩展模型规模来高效适应新目标语言的策略,发现大规模模型在数据效率和保持基础能力方面表现更优,并探索了构建模块化多语言系统的合并方法。
Details
Motivation: 解决中低资源语言在大规模多语言模型中表现不佳的问题,尤其是在小规模模型上与特定语言模型相比存在的差距。 Method: 通过对FLOP匹配的模型进行系统的扩展消融实验,比较扩展英语基础模型与标准持续预训练在目标语言适应上的效果,并探索将扩展后的语言特定模型合并为多语言系统的方法。 Result: 大规模扩展模型在接触足够目标语言数据后,性能可匹敌或超越使用更多数据持续预训练的小规模模型;扩展有助于保持英语能力,减少灾难性遗忘;合并扩展模型的效果优于小模型,但仍不及联合多语言训练,且不同合并方法间性能差异显著。 Conclusion: 模型扩展是一种高效的数据利用策略,能提升语言适应性能并缓解遗忘问题,同时为构建模块化多语言系统提供了可行路径,未来可通过专用合并方法进一步优化语言级集成。 Abstract: Achieving high-performing language models which include medium- and lower-resource languages remains a challenge. Massively multilingual models still underperform compared to language-specific adaptations, especially at smaller model scales. In this work, we investigate scaling as an efficient strategy for adapting pretrained models to new target languages. Through comprehensive scaling ablations with approximately FLOP-matched models, we test whether upscaling an English base model enables more effective and resource-efficient adaptation than standard continued pretraining. We find that, once exposed to sufficient target-language data, larger upscaled models can match or surpass the performance of smaller models continually pretrained on much more data, demonstrating the benefits of scaling for data efficiency. Scaling also helps preserve the base model's capabilities in English, thus reducing catastrophic forgetting. Finally, we explore whether such scaled, language-specific models can be merged to construct modular and flexible multilingual systems. We find that while merging remains less effective than joint multilingual training, upscaled merges perform better than smaller ones. We observe large performance differences across merging methods, suggesting potential for improvement through merging approaches specialized for language-level integration.[26] Script Gap: Evaluating LLM Triage on Indian Languages in Native vs Roman Scripts in a Real World Setting
Manurag Khullar,Utkarsh Desai,Poorva Malviya,Aman Dalmia,Zheyuan Ryan Shi
Main category: cs.CL
TL;DR: 该研究探讨了罗马化文本对大型语言模型(LLM)在印度母婴健康分诊中可靠性的影响,发现使用罗马化输入时性能显著下降,尽管模型能理解语义,但输出仍不稳定。
Details
Motivation: 在印度的临床应用中,用户常使用罗马化文字输入非拉丁语系的本地语言,但现有研究缺乏对此类真实场景下语言变体的评估,尤其是在高风险医疗决策中的影响。 Method: 研究使用来自五个印度语言及尼泊尔语的真实用户生成查询数据集,对主流大语言模型进行基准测试,比较其在原生文字与罗马化文本上的表现,并分析语义理解与分类输出之间的差异。 Result: 实验结果显示,罗马化文本导致F1分数下降5-12个百分点,在合作机构的实际场景中可能导致近200万例额外分诊错误;尽管模型能正确推断罗马化查询的语义意图,其最终分类仍因正字法噪声而变得脆弱。 Conclusion: 研究揭示了基于LLM的医疗系统中的一个关键安全盲点:即使模型看似理解罗马化输入,仍可能无法可靠地据此采取行动,凸显出在多语言低资源环境中处理书写变体的重要性。 Abstract: Large Language Models (LLMs) are increasingly deployed in high-stakes clinical applications in India. In many such settings, speakers of Indian languages frequently communicate using romanized text rather than native scripts, yet existing research rarely evaluates this orthographic variation using real-world data. We investigate how romanization impacts the reliability of LLMs in a critical domain: maternal and newborn healthcare triage. We benchmark leading LLMs on a real-world dataset of user-generated queries spanning five Indian languages and Nepali. Our results reveal consistent degradation in performance for romanized messages, with F1 scores trailing those of native scripts by 5-12 points. At our partner maternal health organization in India, this gap could cause nearly 2 million excess errors in triage. Crucially, this performance gap by scripts is not due to a failure in clinical reasoning. We demonstrate that LLMs often correctly infer the semantic intent of romanized queries. Nevertheless, their final classification outputs remain brittle in the presence of orthographic noise in romanized inputs. Our findings highlight a critical safety blind spot in LLM-based health systems: models that appear to understand romanized input may still fail to act on it reliably.[27] The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality
Aileen Cheng,Alon Jacovi,Amir Globerson,Ben Golan,Charles Kwong,Chris Alberti,Connie Tao,Eyal Ben-David,Gaurav Singh Tomar,Lukas Haas,Yonatan Bitton,Adam Bloniarz,Aijun Bai,Andrew Wang,Anfal Siddiqui,Arturo Bajuelos Castillo,Aviel Atias,Chang Liu,Corey Fry,Daniel Balle,Deepanway Ghosal,Doron Kukliansky,Dror Marcus,Elena Gribovskaya,Eran Ofek,Honglei Zhuang,Itay Laish,Jan Ackermann,Lily Wang,Meg Risdal,Megan Barnes,Michael Fink,Mohamed Amin,Moran Ambar,Natan Potikha,Nikita Gupta,Nitzan Katz,Noam Velan,Ofir Roval,Ori Ram,Polina Zablotskaia,Prathamesh Bang,Priyanka Agrawal,Rakesh Ghiya,Sanjay Ganapathy,Simon Baumgartner,Sofia Erell,Sushant Prakash,Thibault Sellam,Vikram Rao,Xuanhui Wang,Yaroslav Akulov,Yulong Yang,Zhen Yang,Zhixin Lai,Zhongru Wu,Anca Dragan,Avinatan Hassidim,Fernando Pereira,Slav Petrov,Srinivasan Venkatachary,Tulsee Doshi,Yossi Matias,Sasha Goldshtein,Dipanjan Das
Main category: cs.CL
TL;DR: The FACTS Leaderboard是一个综合评估语言模型在多种场景下生成事实准确文本能力的基准套件,包含四个子榜单,通过自动化评判模型打分,全面衡量模型的事实性。
Details
Motivation: 为了全面评估语言模型在不同场景下的事实准确性,解决现有评测片面化的问题。 Method: 构建包含多模态、参数知识、搜索增强和文档 grounding 四个子榜单的评测体系,使用自动化 judge 模型对模型输出进行评分。 Result: 提供了一个可公开参与、持续维护的综合性事实性评测平台,支持公榜和私榜以保障完整性。 Conclusion: FACTS Leaderboard为评估语言模型的事实性提供了更全面、平衡和可靠的框架。 Abstract: We introduce The FACTS Leaderboard, an online leaderboard suite and associated set of benchmarks that comprehensively evaluates the ability of language models to generate factually accurate text across diverse scenarios. The suite provides a holistic measure of factuality by aggregating the performance of models on four distinct sub-leaderboards: (1) FACTS Multimodal, which measures the factuality of responses to image-based questions; (2) FACTS Parametric, which assesses models' world knowledge by answering closed-book factoid questions from internal parameters; (3) FACTS Search, which evaluates factuality in information-seeking scenarios, where the model must use a search API; and (4) FACTS Grounding (v2), which evaluates whether long-form responses are grounded in provided documents, featuring significantly improved judge models. Each sub-leaderboard employs automated judge models to score model responses, and the final suite score is an average of the four components, designed to provide a robust and balanced assessment of a model's overall factuality. The FACTS Leaderboard Suite will be actively maintained, containing both public and private splits to allow for external participation while guarding its integrity. It can be found at https://www.kaggle.com/benchmarks/google/facts .[28] LabelFusion: Learning to Fuse LLMs and Transformer Classifiers for Robust Text Classification
Michael Schlee,Christoph Weisser,Timo Kivimäki,Melchizedek Mashiku,Benjamin Saefken
Main category: cs.CL
TL;DR: LabelFusion是一种融合集成方法,结合传统Transformer模型与大语言模型(LLM)进行文本分类,通过学习融合两者信号提升多类与多标签任务的准确性与成本效益。
Details
Motivation: 旨在结合传统Transformer模型的高效性与大语言模型的强大推理能力,解决文本分类中准确性、延迟和成本之间的权衡问题。 Method: 通过结构化提示工程获取LLM的每类得分,并将其与传统模型的嵌入向量拼接,输入到一个紧凑的多层感知机(FusionMLP)中进行端到端训练,实现信号融合。 Result: 在AG News上达到92.4%准确率,在10类Reuters 21578数据集上达到92.3%准确率,表现出跨领域的鲁棒性能。 Conclusion: LabelFusion有效融合了传统模型与大语言模型的优势,在保持较低推理成本的同时实现了高精度文本分类,具备良好的实用性与可扩展性。 Abstract: LabelFusion is a fusion ensemble for text classification that learns to combine a traditional transformer-based classifier (e.g., RoBERTa) with one or more Large Language Models (LLMs such as OpenAI GPT, Google Gemini, or DeepSeek) to deliver accurate and cost-aware predictions across multi-class and multi-label tasks. The package provides a simple high-level interface (AutoFusionClassifier) that trains the full pipeline end-to-end with minimal configuration, and a flexible API for advanced users. Under the hood, LabelFusion integrates vector signals from both sources by concatenating the ML backbone's embeddings with the LLM-derived per-class scores -- obtained through structured prompt-engineering strategies -- and feeds this joint representation into a compact multi-layer perceptron (FusionMLP) that produces the final prediction. This learned fusion approach captures complementary strengths of LLM reasoning and traditional transformer-based classifiers, yielding robust performance across domains -- achieving 92.4% accuracy on AG News and 92.3% on 10-class Reuters 21578 topic classification -- while enabling practical trade-offs between accuracy, latency, and cost.[29] Quantifying Emotional Tone in Tolkien's The Hobbit: Dialogue Sentiment Analysis with RegEx, NRC-VAD, and Python
Lilin Qiu
Main category: cs.CL
TL;DR: 本研究通过计算文本分析探讨了《霍比特人》对话中的情感基调,发现其整体呈现积极、平静且逐渐增强的主导感,反映了故事中紧张与舒适交替的情感节奏。
Details
Motivation: 探索《霍比特人》中对话的情感结构及其对叙事节奏的影响,结合数字方法与文学解读揭示托尔金作品中的情感调控模式。 Method: 使用正则表达式提取对话,预处理后利用NRC-VAD词典量化情感维度,并通过情感轨迹图和词云进行可视化分析。 Result: 对话整体保持高愉悦度(valence)和低唤醒度(arousal),主导感(dominance)随情节推进逐步上升,情感呈现周期性波动,体现紧张与幽默、同伴情谊之间的平衡。 Conclusion: 数字方法能有效揭示文学作品中的细微情感结构,展现《霍比特人》叙事中稳定的情感节奏与情绪调节机制。 Abstract: This study analyzes the emotional tone of dialogue in J. R. R. Tolkien's The Hobbit (1937) using computational text analysis. Dialogue was extracted with regular expressions, then preprocessed, and scored using the NRC-VAD lexicon to quantify emotional dimensions. The results show that the dialogue maintains a generally positive (high valence) and calm (low arousal) tone, with a gradually increasing sense of agency (dominance) as the story progresses. These patterns reflect the novel's emotional rhythm: moments of danger and excitement are regularly balanced by humor, camaraderie, and relief. Visualizations -- including emotional trajectory graphs and word clouds -- highlight how Tolkien's language cycles between tension and comfort. By combining computational tools with literary interpretation, this study demonstrates how digital methods can uncover subtle emotional structures in literature, revealing the steady rhythm and emotional modulation that shape the storytelling in The Hobbit.[30] Computational emotion analysis with multimodal LLMs: Current evidence on an emerging methodological opportunity
Hauke Licht
Main category: cs.CL
TL;DR: 该论文评估了多模态大语言模型(mLLMs)在基于视频的政治情感唤醒分析中的有效性,发现其在理想条件下表现可靠且无明显偏见,但在真实议会辩论场景中效果不佳,强调需对生成式AI在政治分析中的应用进行持续严谨的评估。
Details
Motivation: 缺乏关于多模态AI在情感分析中有效性的实证证据,尤其是在政治传播领域。 Method: 使用两个包含人工标注视频记录的互补数据集,评估当前多模态大语言模型在视频情感唤醒识别上的表现。 Result: 在理想条件下,mLLMs的情感唤醒评分高度可靠且几乎无群体性偏差;但在真实世界议会辩论视频中表现不佳,可能影响后续统计推断。 Conclusion: 生成式AI在政治情感分析中的潜力有限,需通过严格、可复制的框架持续评估其适用性。 Abstract: Emotions are central to politics and analyzing their role in political communication has a long tradition. As research increasingly leverages audio-visual materials to analyze the display of emotions, the emergence of multimodal generative AI promises great advances. However, we lack evidence about the effectiveness of multimodal AI in emotion analysis. This paper addresses this gap by evaluating current multimodal large language models (mLLMs) in video-based analysis of emotional arousal in two complementary data sets of human-labeled video recordings. I find that under ideal circumstances, mLLMs' emotional arousal ratings are highly reliable and show little to know indication of demographic bias. However, in recordings of speakers in real-world parliamentary debates, mLLMs' arousal ratings fail to deliver on this promise with potential negative consequences for downstream statistical inferences. This study therefore underscores the need for continued, thorough evaluation of emerging generative AI methods in political analysis and contributes a suitable replicable framework.cs.CV [Back]
[31] Neuromorphic Eye Tracking for Low-Latency Pupil Detection
Paul Hueber,Luca Peres,Florian Pitters,Alejandro Gloriani,Oliver Rhodes
Main category: cs.CV
TL;DR: 本文提出了一种基于神经形态传感器和脉冲神经网络(SNN)的高效事件驱动眼动追踪模型,通过引入LIF层和深度可分离卷积,在保持3.7-4.1px高精度的同时,显著降低模型大小与计算开销,适用于低功耗、低延迟的可穿戴设备应用。
Details
Motivation: 传统基于帧的眼动追踪方法存在运动模糊、计算成本高和时间分辨率低的问题,难以满足AR/VR等可穿戴系统对低延迟和低功耗的需求。神经形态传感器和SNN虽具潜力,但现有方法性能不足或过于专用化。 Method: 将高性能事件驱动眼动追踪模型中的循环和注意力模块替换为轻量级的LIF层,并采用深度可分离卷积降低模型复杂度,实现高效的SNN架构设计。 Result: 模型达到3.7-4.1px的平均误差,接近Retina系统的3.24px;模型规模减少20倍,理论计算量降低850倍;预计功耗为3.9-4.9mW,延迟仅为3ms(1kHz下)。 Conclusion: 高性能事件驱动眼动追踪模型可成功转换为SNN形式,在保持足够精度的同时实现巨大能效提升,适合实时可穿戴部署。 Abstract: Eye tracking for wearable systems demands low latency and milliwatt-level power, but conventional frame-based pipelines struggle with motion blur, high compute cost, and limited temporal resolution. Such capabilities are vital for enabling seamless and responsive interaction in emerging technologies like augmented reality (AR) and virtual reality (VR), where understanding user gaze is key to immersion and interface design. Neuromorphic sensors and spiking neural networks (SNNs) offer a promising alternative, yet existing SNN approaches are either too specialized or fall short of the performance of modern ANN architectures. This paper presents a neuromorphic version of top-performing event-based eye-tracking models, replacing their recurrent and attention modules with lightweight LIF layers and exploiting depth-wise separable convolutions to reduce model complexity. Our models obtain 3.7-4.1px mean error, approaching the accuracy of the application-specific neuromorphic system, Retina (3.24px), while reducing model size by 20x and theoretical compute by 850x, compared to the closest ANN variant of the proposed model. These efficient variants are projected to operate at an estimated 3.9-4.9 mW with 3 ms latency at 1 kHz. The present results indicate that high-performing event-based eye-tracking architectures can be redesigned as SNNs with substantial efficiency gains, while retaining accuracy suitable for real-time wearable deployment.[32] ABBSPO: Adaptive Bounding Box Scaling and Symmetric Prior based Orientation Prediction for Detecting Aerial Image Objects
Woojin Lee,Hyugjae Chang,Jaeho Moon,Jaehyup Lee,Munchurl Kim
Main category: cs.CV
TL;DR: 本文提出了一种名为ABBSPO的弱监督定向目标检测框架,通过自适应边界框缩放和基于对称先验的角度预测,显著提升了使用水平边界框标注的检测精度。
Details
Motivation: 现有基于水平框监督的定向检测方法在尺度估计和角度学习上存在不足,导致性能受限。 Method: 提出自适应边界框缩放(ABBS)以优化预测旋转框的尺度匹配,并设计对称先验角度(SPA)损失,利用 aerial 目标对称性实现自监督学习,防止训练崩溃。 Result: 实验表明ABBSPO在多个数据集上达到最先进的性能,优于现有的弱监督方法。 Conclusion: ABBSPO有效解决了HBox监督下OBB检测中的尺度不匹配和角度学习不稳定问题,为弱监督定向检测提供了高效且准确的新方案。 Abstract: Weakly supervised oriented object detection (WS-OOD) has gained attention as a cost-effective alternative to fully supervised methods, providing both efficiency and high accuracy. Among weakly supervised approaches, horizontal bounding box (HBox)-supervised OOD stands out for its ability to directly leverage existing HBox annotations while achieving the highest accuracy under weak supervision settings. This paper introduces adaptive bounding box scaling and symmetry-prior-based orientation prediction, called ABBSPO, a framework for WS-OOD. Our ABBSPO addresses limitations of previous HBox-supervised OOD methods, which compare ground truth (GT) HBoxes directly with the minimum circumscribed rectangles of predicted RBoxes, often leading to inaccurate scale estimation. To overcome this, we propose: (i) Adaptive Bounding Box Scaling (ABBS), which appropriately scales GT HBoxes to optimize for the size of each predicted RBox, ensuring more accurate scale prediction; and (ii) a Symmetric Prior Angle (SPA) loss that exploits inherent symmetry of aerial objects for self-supervised learning, resolving issues in previous methods where learning collapses when predictions for all three augmented views (original, rotated, and flipped) are consistently incorrect. Extensive experimental results demonstrate that ABBSPO achieves state-of-the-art performance, outperforming existing methods.[33] Diffusion Is Your Friend in Show, Suggest and Tell
Jia Cheng Hu,Roberto Cavicchioli,Alessandro Capotondi
Main category: cs.CV
TL;DR: 提出了一种结合扩散模型和自回归生成的新范式——Show, Suggest and Tell (SST),在COCO数据集上取得了125.1的CIDEr-D分数,优于现有方法。
Details
Motivation: 扩散模型在离散域生成任务中仍难以超越自回归模型,本文旨在探索将两者结合的新路径,以发挥各自优势。 Method: 采用扩散模型为自回归生成提供建议(suggestion),而非直接替代自回归模型,利用扩散模型的双向建模与优化能力提升生成质量。 Result: SST在COCO数据集上达到125.1 CIDEr-D,超过当前最优自回归和扩散模型分别1.5和2.5点,并通过实验验证建议模块对生成质量有正向影响。 Conclusion: 结合扩散模型作为建议机制可有效提升自回归生成性能,开辟了一个有潜力但尚未充分探索的研究方向。 Abstract: Diffusion Denoising models demonstrated impressive results across generative Computer Vision tasks, but they still fail to outperform standard autoregressive solutions in the discrete domain, and only match them at best. In this work, we propose a different paradigm by adopting diffusion models to provide suggestions to the autoregressive generation rather than replacing them. By doing so, we combine the bidirectional and refining capabilities of the former with the strong linguistic structure provided by the latter. To showcase its effectiveness, we present Show, Suggest and Tell (SST), which achieves State-of-the-Art results on COCO, among models in a similar setting. In particular, SST achieves 125.1 CIDEr-D on the COCO dataset without Reinforcement Learning, outperforming both autoregressive and diffusion model State-of-the-Art results by 1.5 and 2.5 points. On top of the strong results, we performed extensive experiments to validate the proposal and analyze the impact of the suggestion module. Results demonstrate a positive correlation between suggestion and caption quality, overall indicating a currently underexplored but promising research direction. Code will be available at: https://github.com/jchenghu/show\_suggest\_tell.[34] MetaVoxel: Joint Diffusion Modeling of Imaging and Clinical Metadata
Yihao Liu,Chenyu Gao,Lianrui Zuo,Michael E. Kim,Brian D. Boyd,Lisa L. Barnes,Walter A. Kukull,Lori L. Beason-Held,Susan M. Resnick,Timothy J. Hohman,Warren D. Taylor,Bennett A. Landman
Main category: cs.CV
TL;DR: MetaVoxel 是一种基于扩散模型的生成式联合建模框架,统一建模医学影像与临床元数据的联合分布,支持无需重新训练的灵活零样本推理。
Details
Motivation: 传统医学AI模型通常局限于特定输入和预测方向的条件分布建模,难以灵活支持多种任务。因此需要一种能够统一建模多模态数据联合分布的方法。 Method: 提出 MetaVoxel,采用联合扩散模型对医学影像(如T1加权MRI)和临床元数据进行联合分布建模,通过单一扩散过程覆盖所有变量,实现多任务统一处理。 Result: 在超过10,000例来自九个数据集的MRI数据上验证,单个MetaVoxel模型可同时完成图像生成、年龄估计和性别预测,性能媲美专用模型,并展现出灵活推断能力。 Conclusion: 联合多模态扩散建模为医学AI提供了一种统一且灵活的新范式,具有更广泛的临床应用潜力。 Abstract: Modern deep learning methods have achieved impressive results across tasks from disease classification, estimating continuous biomarkers, to generating realistic medical images. Most of these approaches are trained to model conditional distributions defined by a specific predictive direction with a specific set of input variables. We introduce MetaVoxel, a generative joint diffusion modeling framework that models the joint distribution over imaging data and clinical metadata by learning a single diffusion process spanning all variables. By capturing the joint distribution, MetaVoxel unifies tasks that traditionally require separate conditional models and supports flexible zero-shot inference using arbitrary subsets of inputs without task-specific retraining. Using more than 10,000 T1-weighted MRI scans paired with clinical metadata from nine datasets, we show that a single MetaVoxel model can perform image generation, age estimation, and sex prediction, achieving performance comparable to established task-specific baselines. Additional experiments highlight its capabilities for flexible inference.Together, these findings demonstrate that joint multimodal diffusion offers a promising direction for unifying medical AI models and enabling broader clinical applicability.[35] Independent Density Estimation
Jiahao Liu
Main category: cs.CV
TL;DR: 本文提出了一种名为独立密度估计(IDE)的新方法,旨在提升视觉-语言模型在未见组合上的组成泛化能力。
Details
Motivation: 现有大规模视觉-语言模型在人类般的组成泛化能力上仍存在困难。 Method: 提出独立密度估计(IDE),通过学习句子中单个词与图像特征的对应关系,并结合熵基推理方法进行组合预测;构建了两种基于IDE的模型,分别使用完全解耦和部分解耦的视觉表示。 Result: 在多个数据集上的实验表明,所提模型在未见组合的泛化性能上优于当前模型。 Conclusion: IDE方法有效提升了视觉-语言模型的组成泛化能力,为实现更接近人类的语言-视觉理解提供了新思路。 Abstract: Large-scale Vision-Language models have achieved remarkable results in various domains, such as image captioning and conditioned image generation. Neverthe- less, these models still encounter difficulties in achieving human-like composi- tional generalization. In this study, we propose a new method called Independent Density Estimation (IDE) to tackle this challenge. IDE aims to learn the connec- tion between individual words in a sentence and the corresponding features in an image, enabling compositional generalization. We build two models based on the philosophy of IDE. The first one utilizes fully disentangled visual representations as input, and the second leverages a Variational Auto-Encoder to obtain partially disentangled features from raw images. Additionally, we propose an entropy- based compositional inference method to combine predictions of each word in the sentence. Our models exhibit superior generalization to unseen compositions compared to current models when evaluated on various datasets.[36] TraceFlow: Dynamic 3D Reconstruction of Specular Scenes Driven by Ray Tracing
Jiachen Tao,Junyi Wu,Haoxuan Wang,Zongxin Yang,Dawen Cai,Yan Yan
Main category: cs.CV
TL;DR: TraceFlow是一种用于动态镜面场景高保真渲染的新框架,通过解决反射方向精确估计和物理准确建模两个关键挑战,实现了更清晰、更真实的动态反射效果。
Details
Motivation: 现有方法在动态场景中难以实现精确的反射方向估计和物理准确的反射建模,导致渲染结果模糊或不真实,因此需要一种能够同时处理动态几何和材质变化并支持高质量镜面反射合成的方法。 Method: 提出了一种残差材质增强的2D高斯点阵表示方法,结合动态环境高斯模型和混合渲染管线,将渲染分解为漫反射和镜面反射分量,并采用从粗到细的训练策略以提高优化稳定性。 Result: 在多个动态场景基准上的实验表明,TraceFlow在定量和定性指标上均优于先前方法,能生成更清晰、更逼真的镜面反射。 Conclusion: TraceFlow通过联合建模动态几何与材质属性,并结合光栅化与光线追踪的混合渲染策略,有效提升了动态镜面场景的渲染质量,为复杂动态环境中的高保真渲染提供了新思路。 Abstract: We present TraceFlow, a novel framework for high-fidelity rendering of dynamic specular scenes by addressing two key challenges: precise reflection direction estimation and physically accurate reflection modeling. To achieve this, we propose a Residual Material-Augmented 2D Gaussian Splatting representation that models dynamic geometry and material properties, allowing accurate reflection ray computation. Furthermore, we introduce a Dynamic Environment Gaussian and a hybrid rendering pipeline that decomposes rendering into diffuse and specular components, enabling physically grounded specular synthesis via rasterization and ray tracing. Finally, we devise a coarse-to-fine training strategy to improve optimization stability and promote physically meaningful decomposition. Extensive experiments on dynamic scene benchmarks demonstrate that TraceFlow outperforms prior methods both quantitatively and qualitatively, producing sharper and more realistic specular reflections in complex dynamic environments.[37] Hierarchical Instance Tracking to Balance Privacy Preservation with Accessible Information
Neelima Prasad,Jarek Reynolds,Neel Karsanbhai,Tanusree Sharma,Lotus Zhang,Abigale Stangl,Yang Wang,Leah Findlater,Danna Gurari
Main category: cs.CV
TL;DR: 提出了一种新任务——分层实例跟踪,用于跟踪预定义类别对象及其部分的实例,并保持其层次关系。同时发布了首个支持该任务的基准数据集,包含552个视频中的2,765个唯一实体,涵盖40个类别。
Details
Motivation: 现有的实例跟踪方法通常忽略对象与部分之间的层次结构关系,无法满足对复杂场景中细粒度结构理解的需求。因此需要一种能够同时跟踪对象及其组成部分并维持其层级关联的新任务。 Method: 提出了分层实例跟踪任务,并构建了首个基准数据集,包含552个视频、2,765个唯一实体和40个对象/部分类别;设计并评估了四种模型的七种变体以适应该新任务。 Result: 实验评估了七种针对该任务定制的模型变体,结果表明该数据集具有挑战性,为后续研究提供了基准。 Conclusion: 分层实例跟踪是一个有意义的新方向,所提出的数据集为探索对象与部分间的层次化视觉理解提供了基础支持。 Abstract: We propose a novel task, hierarchical instance tracking, which entails tracking all instances of predefined categories of objects and parts, while maintaining their hierarchical relationships. We introduce the first benchmark dataset supporting this task, consisting of 2,765 unique entities that are tracked in 552 videos and belong to 40 categories (across objects and parts). Evaluation of seven variants of four models tailored to our novel task reveals the new dataset is challenging. Our dataset is available at https://vizwiz.org/tasks-and-datasets/hierarchical-instance-tracking/[38] Topological Conditioning for Mammography Models via a Stable Wavelet-Persistence Vectorization
Charles Fanning,Mehmet Emin Aktas
Main category: cs.CV
TL;DR: 提出基于小波持久同调的拓扑数据分析方法,生成稳定的多尺度空间图以增强乳腺X线图像中癌症检测的模型外部泛化能力,在跨数据集测试中显著提升患者级别AUC。
Details
Motivation: 乳腺癌筛查中现有模型在不同设备、模态和人群间表现不稳定,存在假阴性和假阳性问题,亟需提高模型的外部泛化性能。 Method: 采用基于小波的持久同调向量化方法,通过拓扑数据分析提取跨强度阈值持续存在的图像结构,并将其转化为稳定的空间多尺度图,通过输入级通道拼接集成到ConvNeXt Tiny等两阶段检测流程中。 Result: 在CBIS-DDSM数据集上训练并分别在INbreast(葡萄牙)和CMMD(中国)数据集上验证,结果显示在INbreast上患者级别的AUC从0.55提升至0.75,尤其在有限训练预算下效果显著。 Conclusion: 该方法通过引入拓扑感知的多尺度特征,有效提升了深度学习模型在跨机构、跨设备乳腺癌筛查中的鲁棒性和泛化性能,具有临床应用潜力。 Abstract: Breast cancer is the most commonly diagnosed cancer in women and a leading cause of cancer death worldwide. Screening mammography reduces mortality, yet interpretation still suffers from substantial false negatives and false positives, and model accuracy often degrades when deployed across scanners, modalities, and patient populations. We propose a simple conditioning signal aimed at improving external performance based on a wavelet based vectorization of persistent homology. Using topological data analysis, we summarize image structure that persists across intensity thresholds and convert this information into spatial, multi scale maps that are provably stable to small intensity perturbations. These maps are integrated into a two stage detection pipeline through input level channel concatenation. The model is trained and validated on the CBIS DDSM digitized film mammography cohort from the United States and evaluated on two independent full field digital mammography cohorts from Portugal (INbreast) and China (CMMD), with performance reported at the patient level. On INbreast, augmenting ConvNeXt Tiny with wavelet persistence channels increases patient level AUC from 0.55 to 0.75 under a limited training budget.[39] Feature Coding for Scalable Machine Vision
Md Eimran Hossain Eimon,Juan Merlos,Ashan Perera,Hari Kalva,Velibor Adzic,Borko Furht
Main category: cs.CV
TL;DR: 本文提出了用于压缩深度神经网络中间特征的特征编码测试模型(FCTM),在保持精度的同时平均降低85.14%的比特率,支持边缘与云协同推理的高效部署。
Details
Motivation: 深度神经网络在边缘设备上部署面临计算、带宽和隐私的挑战,传统方法难以平衡性能与资源消耗,因此需要一种高效的中间特征压缩方案以支持边缘-云协同推理。 Method: 基于MPEG提出的面向机器的特征编码(FCM)标准,设计并实现了特征编码测试模型(FCTM),采用针对中间特征优化的编解码流程,在多种视觉任务中评估其压缩性能。 Result: FCTM在多个视觉任务上实现了平均85.14%的比特率降低,同时保持了模型推理精度,验证了其在带宽受限和隐私敏感场景下的有效性。 Conclusion: FCM标准及其FCTM为智能特征的高效、互操作性部署提供了可扩展路径,有助于推动DNN在边缘计算中的广泛应用。 Abstract: Deep neural networks (DNNs) drive modern machine vision but are challenging to deploy on edge devices due to high compute demands. Traditional approaches-running the full model on-device or offloading to the cloud face trade-offs in latency, bandwidth, and privacy. Splitting the inference workload between the edge and the cloud offers a balanced solution, but transmitting intermediate features to enable such splitting introduces new bandwidth challenges. To address this, the Moving Picture Experts Group (MPEG) initiated the Feature Coding for Machines (FCM) standard, establishing a bitstream syntax and codec pipeline tailored for compressing intermediate features. This paper presents the design and performance of the Feature Coding Test Model (FCTM), showing significant bitrate reductions-averaging 85.14%-across multiple vision tasks while preserving accuracy. FCM offers a scalable path for efficient and interoperable deployment of intelligent features in bandwidth-limited and privacy-sensitive consumer applications.[40] Latent Chain-of-Thought World Modeling for End-to-End Driving
Shuhan Tan,Kashyap Chitta,Yuxiao Chen,Ran Tian,Yurong You,Yan Wang,Wenjie Luo,Yulong Cao,Philipp Krahenbuhl,Marco Pavone,Boris Ivanovic
Main category: cs.CV
TL;DR: 本文提出了一种名为Latent-CoT-Drive(LCDrive)的新型视觉-语言-动作模型,通过在动作对齐的潜在空间中使用潜在语言进行推理,而非自然语言链式思维(CoT),从而提升自动驾驶的性能与安全性。
Details
Motivation: 现有VLA模型多采用自然语言进行推理,但文本可能并非最高效的推理表示方式;因此本文探索更紧凑、动作对齐的潜在空间中的推理机制。 Method: LCDrive将推理与决策统一在潜在空间中:使用动作提议令牌(与输出动作共享词汇)和基于学习到的潜在世界模型的未来状态令牌交替表示推理过程,并通过真实未来轨迹监督冷启动,再结合闭环强化学习进行后训练。 Result: 在大规模端到端驾驶基准上,LCDrive相比非推理和文本推理基线实现了更快的推理速度、更好的轨迹质量,以及更强的交互式强化学习提升效果。 Conclusion: 使用潜在语言替代自然语言进行链式思维推理,在自动驾驶任务中更高效且有效,验证了隐式、动作对齐的推理表征的潜力。 Abstract: Recent Vision-Language-Action (VLA) models for autonomous driving explore inference-time reasoning as a way to improve driving performance and safety in challenging scenarios. Most prior work uses natural language to express chain-of-thought (CoT) reasoning before producing driving actions. However, text may not be the most efficient representation for reasoning. In this work, we present Latent-CoT-Drive (LCDrive): a model that expresses CoT in a latent language that captures possible outcomes of the driving actions being considered. Our approach unifies CoT reasoning and decision making by representing both in an action-aligned latent space. Instead of natural language, the model reasons by interleaving (1) action-proposal tokens, which use the same vocabulary as the model's output actions; and (2) world model tokens, which are grounded in a learned latent world model and express future outcomes of these actions. We cold start latent CoT by supervising the model's action proposals and world model tokens based on ground-truth future rollouts of the scene. We then post-train with closed-loop reinforcement learning to strengthen reasoning capabilities. On a large-scale end-to-end driving benchmark, LCDrive achieves faster inference, better trajectory quality, and larger improvements from interactive reinforcement learning compared to both non-reasoning and text-reasoning baselines.[41] Emerging Standards for Machine-to-Machine Video Coding
Md Eimran Hossain Eimon,Velibor Adzic,Hari Kalva,Borko Furht
Main category: cs.CV
TL;DR: 本文探讨了机器对机器视觉通信中的新编码标准,提出视频编码(VCM)和特征编码(FCM)以降低带宽、保护隐私并支持计算卸载。实验表明FCM在显著降低比特率的同时保持接近边缘推理的精度。H.265和H.266在多数任务中表现相近,而H.264性能较差;但在跟踪任务中,现有H.264硬件仍可有效支持机器通信。
Details
Motivation: 传统视觉系统依赖为人类感知优化的视频编解码器传输像素数据,导致带宽高、扩展性差且存在隐私泄露风险。需要为机器消费设计更高效、安全的编码方案。 Method: 采用MPEG提出的面向机器的视频编码框架,包括VCM和FCM两种方法,其中FCM压缩神经网络中间特征,并结合H.26X系列编解码器(如H.264、H.265、H.266)进行比特率与任务性能评估。 Result: FCM能在保持接近边缘推理精度的同时大幅降低比特率;H.265与H.266在多数任务中性能接近(BD-Rate仅差1.39%),而H.264平均增加32.28% BD-Rate;但在跟踪任务中,HEVC(H.265)略优于VVC(H.266),且AVC(H.264)影响较小(+8.79%)。 Conclusion: FCM是一种高效的机器对机器视觉通信方案,兼顾精度、带宽与隐私;现有H.26X硬件(尤其是H.265/HEVC)已能良好支持机器任务,无需强制升级至H.266,尤其在特定任务如跟踪中表现稳健。 Abstract: Machines are increasingly becoming the primary consumers of visual data, yet most deployments of machine-to-machine systems still rely on remote inference where pixel-based video is streamed using codecs optimized for human perception. Consequently, this paradigm is bandwidth intensive, scales poorly, and exposes raw images to third parties. Recent efforts in the Moving Picture Experts Group (MPEG) redesigned the pipeline for machine-to-machine communication: Video Coding for Machines (VCM) is designed to apply task-aware coding tools in the pixel domain, and Feature Coding for Machines (FCM) is designed to compress intermediate neural features to reduce bitrate, preserve privacy, and support compute offload. Experiments show that FCM is capable of maintaining accuracy close to edge inference while significantly reducing bitrate. Additional analysis of H.26X codecs used as inner codecs in FCM reveals that H.265/High Efficiency Video Coding (HEVC) and H.266/Versatile Video Coding (VVC) achieve almost identical machine task performance, with an average BD-Rate increase of 1.39% when VVC is replaced with HEVC. In contrast, H.264/Advanced Video Coding (AVC) yields an average BD-Rate increase of 32.28% compared to VVC. However, for the tracking task, the impact of codec choice is minimal, with HEVC outperforming VVC and achieving BD Rate of -1.81% and 8.79% for AVC, indicating that existing hardware for already deployed codecs can support machine-to-machine communication without degrading performance.[42] Multi-dimensional Preference Alignment by Conditioning Reward Itself
Jiho Jang,Jinyoung Kim,Kyungjune Baek,Nojun Kwak
Main category: cs.CV
TL;DR: 提出了一种新的基于多奖励条件的DPO方法(MCDPO),以解决传统DPO在扩散模型对齐中因标量奖励聚合导致的奖励冲突问题,实现了各维度独立优化,并支持推理时动态控制。
Details
Motivation: 标准DPO依赖Bradley-Terry模型将多个评价维度(如美学质量和语义对齐)聚合为单一标量奖励,导致不同维度间的奖励冲突,使模型可能遗忘某些优良特征。 Method: 提出Multi Reward Conditional DPO(MCDPO),引入解耦的Bradley-Terry目标,通过将偏好结果向量作为条件输入,使模型能在单一网络中独立学习每个奖励维度的优化方向,并采用维度奖励dropout确保各维度优化均衡。 Result: 在Stable Diffusion 1.5和SDXL上的实验表明,MCDPO在多个基准上表现优于现有方法,并支持通过无分类器引导在推理时动态增强特定奖励维度。 Conclusion: MCDPO有效缓解了多维奖励冲突问题,提升了扩散模型在人类反馈下的对齐性能,同时提供了灵活的多轴控制能力。 Abstract: Reinforcement Learning from Human Feedback has emerged as a standard for aligning diffusion models. However, we identify a fundamental limitation in the standard DPO formulation because it relies on the Bradley-Terry model to aggregate diverse evaluation axes like aesthetic quality and semantic alignment into a single scalar reward. This aggregation creates a reward conflict where the model is forced to unlearn desirable features of a specific dimension if they appear in a globally non-preferred sample. To address this issue, we propose Multi Reward Conditional DPO (MCDPO). This method resolves reward conflicts by introducing a disentangled Bradley-Terry objective. MCDPO explicitly injects a preference outcome vector as a condition during training, which allows the model to learn the correct optimization direction for each reward axis independently within a single network. We further introduce dimensional reward dropout to ensure balanced optimization across dimensions. Extensive experiments on Stable Diffusion 1.5 and SDXL demonstrate that MCDPO achieves superior performance on benchmarks. Notably, our conditional framework enables dynamic and multiple-axis control at inference time using Classifier Free Guidance to amplify specific reward dimensions without additional training or external reward models.[43] Solving Semi-Supervised Few-Shot Learning from an Auto-Annotation Perspective
Tian Liu,Anwesha Basu,James Caverlee,Shu Kong
Main category: cs.CV
TL;DR: 本文提出了SWIFT方法,通过温度调整和分类器初始化来增强半监督少样本学习中视觉-语言模型对未标记数据的利用,显著提升了自动标注性能。
Details
Motivation: 现有的半监督少样本学习研究忽视了开源视觉-语言模型及其预训练数据的潜力,而这些资源在实际应用场景如自动标注中应被充分利用。 Method: 提出阶段式微调与温度调整(SWIFT)方法,结合分类器初始化和温度缩放,提高伪标签置信度,从而增强未标记数据的利用和监督信号强度。 Result: 在五个SSFSL基准上,SWIFT比现有FSL和SSL方法高出约5个准确率点,并接近使用真实标签微调的监督学习性能。 Conclusion: SWIFT有效解决了VLM在SSFSL中因softmax分布平坦导致的弱监督问题,为利用开源资源实现高效自动标注提供了可行方案。 Abstract: Semi-supervised few-shot learning (SSFSL) formulates real-world applications like ''auto-annotation'', as it aims to learn a model over a few labeled and abundant unlabeled examples to annotate the unlabeled ones. Despite the availability of powerful open-source Vision-Language Models (VLMs) and their pretraining data, the SSFSL literature largely neglects these open-source resources. In contrast, the related area few-shot learning (FSL) has already exploited them to boost performance. Arguably, to achieve auto-annotation in the real world, SSFSL should leverage such open-source resources. To this end, we start by applying established SSL methods to finetune a VLM. Counterintuitively, they significantly underperform FSL baselines. Our in-depth analysis reveals the root cause: VLMs produce rather ''flat'' distributions of softmax probabilities. This results in zero utilization of unlabeled data and weak supervision signals. We address this issue with embarrassingly simple techniques: classifier initialization and temperature tuning. They jointly increase the confidence scores of pseudo-labels, improving the utilization rate of unlabeled data, and strengthening supervision signals. Building on this, we propose: Stage-Wise Finetuning with Temperature Tuning (SWIFT), which enables existing SSL methods to effectively finetune a VLM on limited labeled data, abundant unlabeled data, and task-relevant but noisy data retrieved from the VLM's pretraining set. Extensive experiments on five SSFSL benchmarks show that SWIFT outperforms recent FSL and SSL methods by $\sim$5 accuracy points. SWIFT even rivals supervised learning, which finetunes VLMs with the unlabeled data being labeled with ground truth![44] RobustSora: De-Watermarked Benchmark for Robust AI-Generated Video Detection
Zhuo Wang,Xiliang Liu,Ligang Sun
Main category: cs.CV
TL;DR: RobustSora是一个旨在评估AI生成视频检测中数字水印鲁棒性的基准,通过构建包含真实和生成视频的数据集,揭示现有检测模型对水印的部分依赖性。
Details
Motivation: 现有AI生成视频检测基准忽略了生成模型嵌入的数字水印可能被检测器利用的问题,影响检测结果的可靠性,因此需要评估水印对检测性能的影响。 Method: 构建包含6500个视频的四类数据集(A-C、A-S、G-W、G-DeW),设计两项评估任务:Task-I测试去水印AI视频的检测性能,Task-II评估在伪造水印真实视频上的误报率,并在十种检测模型上进行实验。 Result: 实验显示不同模型在水印操作下性能差异为2-8个百分点,基于Transformer的模型表现出中等依赖性(6-8pp),MLLMs则呈现多样化模式(2-8pp),表明检测器存在部分水印依赖。 Conclusion: 当前AIGC视频检测模型部分依赖水印信号,需发展水印感知的训练策略;RobustSora为推动更鲁棒的检测研究提供了重要工具。 Abstract: The proliferation of AI-generated video technologies poses challenges to information integrity. While recent benchmarks advance AIGC video detection, they overlook a critical factor: many state-of-the-art generative models embed digital watermarks in outputs, and detectors may partially rely on these patterns. To evaluate this influence, we present RobustSora, the benchmark designed to assess watermark robustness in AIGC video detection. We systematically construct a dataset of 6,500 videos comprising four types: Authentic-Clean (A-C), Authentic-Spoofed with fake watermarks (A-S), Generated-Watermarked (G-W), and Generated-DeWatermarked (G-DeW). Our benchmark introduces two evaluation tasks: Task-I tests performance on watermark-removed AI videos, while Task-II assesses false alarm rates on authentic videos with fake watermarks. Experiments with ten models spanning specialized AIGC detectors, transformer architectures, and MLLM approaches reveal performance variations of 2-8pp under watermark manipulation. Transformer-based models show consistent moderate dependency (6-8pp), while MLLMs exhibit diverse patterns (2-8pp). These findings indicate partial watermark dependency and highlight the need for watermark-aware training strategies. RobustSora provides essential tools to advance robust AIGC detection research.[45] THE-Pose: Topological Prior with Hybrid Graph Fusion for Estimating Category-Level 6D Object Pose
Eunho Lee,Chaehyeon Song,Seunghoon Jeong,Ayoung Kim
Main category: cs.CV
TL;DR: THE-Pose是一种新的类别级6D姿态估计框架,通过引入拓扑先验和混合图融合,有效结合2D图像上下文与3D几何结构,显著提升了在复杂和遮挡场景下的性能。
Details
Motivation: 现有3D图卷积方法仅关注局部几何和深度信息,难以应对类别内差异大、视觉模糊和复杂物体的情况,缺乏全局上下文与拓扑结构的利用。 Method: 提出THE-Pose框架,从图像域提取不变且一致的拓扑特征,并通过混合图融合(HGF)模块自适应地将拓扑特征与点云特征结合,实现2D与3D信息的无缝融合。 Result: 在REAL275数据集上实验表明,相比3D-GC基线HS-Pose性能提升35.8%,超越此前最优方法7.2%,在遮挡和复杂物体上表现更稳定。 Conclusion: THE-Pose通过引入拓扑先验和混合图融合机制,有效整合全局上下文与局部结构,显著提升了类别级6D姿态估计的鲁棒性和准确性。 Abstract: Category-level object pose estimation requires both global context and local structure to ensure robustness against intra-class variations. However, 3D graph convolution (3D-GC) methods only focus on local geometry and depth information, making them vulnerable to complex objects and visual ambiguities. To address this, we present THE-Pose, a novel category-level 6D pose estimation framework that leverages a topological prior via surface embedding and hybrid graph fusion. Specifically, we extract consistent and invariant topological features from the image domain, effectively overcoming the limitations inherent in existing 3D-GC based methods. Our Hybrid Graph Fusion (HGF) module adaptively integrates the topological features with point-cloud features, seamlessly bridging 2D image context and 3D geometric structure. These fused features ensure stability for unseen or complicated objects, even under significant occlusions. Extensive experiments on the REAL275 dataset show that THE-Pose achieves a 35.8% improvement over the 3D-GC baseline (HS-Pose) and surpasses the previous state-of-the-art by 7.2% across all key metrics. The code is avaialbe on https://github.com/EHxxx/THE-Pose[46] GDKVM: Echocardiography Video Segmentation via Spatiotemporal Key-Value Memory with Gated Delta Rule
Rui Wang,Yimu Sun,Jingxing Guo,Huisi Wu,Jing Qin
Main category: cs.CV
TL;DR: 本文提出了一种用于超声心动图视频分割的新架构GDKVM,通过引入线性键值关联(LKVA)、门控Delta规则(GDR)和关键像素特征融合(KPFF)模块,在保持实时性能的同时提升了分割精度和鲁棒性。
Details
Motivation: 超声图像中存在噪声、伪影以及心脏的形变和运动,使得分割算法难以同时兼顾长距离时空依赖建模与计算效率之间的平衡。现有方法在精细特征表示和高效推理方面仍存在局限。 Method: 提出GDKVM模型,采用LKVA建模帧间相关性,GDR高效存储中间记忆状态,KPFF多尺度融合局部与全局特征,增强对边界模糊和噪声的鲁棒性。 Result: 在CAMUS和EchoNet-Dynamic两个主流数据集上验证,GDKVM在分割准确性和鲁棒性方面优于现有最先进方法,并能实现实时性能。 Conclusion: GDKVM有效解决了超声心动图视频分割中精度、鲁棒性与效率之间的权衡问题,具有良好的临床应用潜力。 Abstract: Accurate segmentation of cardiac chambers in echocardiography sequences is crucial for the quantitative analysis of cardiac function, aiding in clinical diagnosis and treatment. The imaging noise, artifacts, and the deformation and motion of the heart pose challenges to segmentation algorithms. While existing methods based on convolutional neural networks, Transformers, and space-time memory networks have improved segmentation accuracy, they often struggle with the trade-off between capturing long-range spatiotemporal dependencies and maintaining computational efficiency with fine-grained feature representation. In this paper, we introduce GDKVM, a novel architecture for echocardiography video segmentation. The model employs Linear Key-Value Association (LKVA) to effectively model inter-frame correlations, and introduces Gated Delta Rule (GDR) to efficiently store intermediate memory states. Key-Pixel Feature Fusion (KPFF) module is designed to integrate local and global features at multiple scales, enhancing robustness against boundary blurring and noise interference. We validated GDKVM on two mainstream echocardiography video datasets (CAMUS and EchoNet-Dynamic) and compared it with various state-of-the-art methods. Experimental results show that GDKVM outperforms existing approaches in terms of segmentation accuracy and robustness, while ensuring real-time performance. Code is available at https://github.com/wangrui2025/GDKVM.[47] VLM-NCD:Novel Class Discovery with Vision-Based Large Language Models
Yuetong Su,Baoguo Wei,Xinyu Wang,Xu Li,Lixin Li
Main category: cs.CV
TL;DR: 提出了一种名为LLM-NCD的多模态框架,通过融合视觉-文本语义和原型引导聚类,显著提升了新类别发现(NCD)在未知类分类上的性能,在CIFAR-100上比现有方法最高提升25.3%,并对长尾分布数据具有独特鲁棒性。
Details
Motivation: 现有基于视觉特征的NCD方法存在特征判别性不足和数据长尾分布等问题,限制了对未知类别的发现能力。 Method: 提出LLM-NCD框架,联合优化已知类别的图像与文本特征,建模聚类中心与语义原型,并设计双阶段发现机制,通过语义相似性阈值和自适应聚类动态区分已知与未知样本。 Result: 在CIFAR-100数据集上,未知类别的分类准确率相比现有方法最高提升25.3%,且在长尾分布场景下表现出前所未有的鲁棒性。 Conclusion: LLM-NCD通过引入文本语义与原型引导的多模态聚类,有效突破了传统NCD方法的瓶颈,为新类别发现提供了更强大且鲁棒的解决方案。 Abstract: Novel Class Discovery aims to utilise prior knowledge of known classes to classify and discover unknown classes from unlabelled data. Existing NCD methods for images primarily rely on visual features, which suffer from limitations such as insufficient feature discriminability and the long-tail distribution of data. We propose LLM-NCD, a multimodal framework that breaks this bottleneck by fusing visual-textual semantics and prototype guided clustering. Our key innovation lies in modelling cluster centres and semantic prototypes of known classes by jointly optimising known class image and text features, and a dualphase discovery mechanism that dynamically separates known or novel samples via semantic affinity thresholds and adaptive clustering. Experiments on the CIFAR-100 dataset show that compared to the current methods, this method achieves up to 25.3% improvement in accuracy for unknown classes. Notably, our method shows unique resilience to long tail distributions, a first in NCD literature.[48] Long-LRM++: Preserving Fine Details in Feed-Forward Wide-Coverage Reconstruction
Chen Ziwen,Hao Tan,Peng Wang,Zexiang Xu,Li Fuxin
Main category: cs.CV
TL;DR: Long-LRM++ 提出了一种半显式场景表示与轻量级解码器结合的方法,在保持高渲染质量的同时实现单次前向传播的实时新视角合成,支持最多64个输入视图,并在多个数据集上优于现有方法。
Details
Motivation: 现有高斯点阵(GS)方法在大规模输入下存在误差敏感和细节模糊问题,而隐式表示虽能提升渲染质量但计算成本高、难以实现实时渲染。本文旨在探索是否可以避免深度序列“解压缩”过程,在保留隐式表示优势的同时实现高效实时渲染。 Method: 提出 Long-LRM++,采用半显式场景表示,将场景信息以紧凑形式编码,并设计轻量级解码器进行快速渲染;模型支持32至64个高分辨率输入视图的端到端训练与推理,实现在A100 GPU上14 FPS的实时渲染速度。 Result: 在DL3DV数据集上达到与LaCT相当的渲染质量,同时实现14 FPS的实时渲染;在ScanNetv2上的新视角深度预测优于基于高斯的方法;支持64输入视图下的稳定训练和强泛化能力。 Conclusion: Long-LRM++ 成功结合了显式与隐式表示的优点,通过半显式表示和轻量级解码器实现了高质量、实时的新视角合成,推动了大规模输入下通用场景重建的发展。 Abstract: Recent advances in generalizable Gaussian splatting (GS) have enabled feed-forward reconstruction of scenes from tens of input views. Long-LRM notably scales this paradigm to 32 input images at $950\times540$ resolution, achieving 360° scene-level reconstruction in a single forward pass. However, directly predicting millions of Gaussian parameters at once remains highly error-sensitive: small inaccuracies in positions or other attributes lead to noticeable blurring, particularly in fine structures such as text. In parallel, implicit representation methods such as LVSM and LaCT have demonstrated significantly higher rendering fidelity by compressing scene information into model weights rather than explicit Gaussians, and decoding RGB frames using the full transformer or TTT backbone. However, this computationally intensive decompression process for every rendered frame makes real-time rendering infeasible. These observations raise key questions: Is the deep, sequential "decompression" process necessary? Can we retain the benefits of implicit representations while enabling real-time performance? We address these questions with Long-LRM++, a model that adopts a semi-explicit scene representation combined with a lightweight decoder. Long-LRM++ matches the rendering quality of LaCT on DL3DV while achieving real-time 14 FPS rendering on an A100 GPU, overcoming the speed limitations of prior implicit methods. Our design also scales to 64 input views at the $950\times540$ resolution, demonstrating strong generalization to increased input lengths. Additionally, Long-LRM++ delivers superior novel-view depth prediction on ScanNetv2 compared to direct depth rendering from Gaussians. Extensive ablation studies validate the effectiveness of each component in the proposed framework.[49] Sample-wise Adaptive Weighting for Transfer Consistency in Adversarial Distillation
Hongsin Lee,Hye Won Chung
Main category: cs.CV
TL;DR: 提出了一种基于样本级自适应对抗蒸馏(SAAD)的方法,通过考虑对抗性可迁移性来提升小型学生模型的鲁棒性,在多个数据集上优于现有方法。
Details
Motivation: 现有的对抗蒸馏方法在使用更强的教师模型时并不总能提升学生模型的鲁棒性,存在“鲁棒性饱和”现象,且传统解释(如容量差距)不充分。 Method: 提出Sample-wise Adaptive Adversarial Distillation (SAAD),根据每个样本的对抗性可迁移性(即学生生成的对抗样本对教师的有效性)进行样本重加权,无需额外计算成本。 Result: 在CIFAR-10、CIFAR-100和Tiny-ImageNet上的实验表明,SAAD在AutoAttack下的鲁棒性 consistently 优于先前方法。 Conclusion: 对抗性可迁移性是影响对抗蒸馏中鲁棒性传递的关键因素,SAAD通过自适应重加权有效提升了学生模型的鲁棒性能。 Abstract: Adversarial distillation in the standard min-max adversarial training framework aims to transfer adversarial robustness from a large, robust teacher network to a compact student. However, existing work often neglects to incorporate state-of-the-art robust teachers. Through extensive analysis, we find that stronger teachers do not necessarily yield more robust students-a phenomenon known as robust saturation. While typically attributed to capacity gaps, we show that such explanations are incomplete. Instead, we identify adversarial transferability-the fraction of student-crafted adversarial examples that remain effective against the teacher-as a key factor in successful robustness transfer. Based on this insight, we propose Sample-wise Adaptive Adversarial Distillation (SAAD), which reweights training examples by their measured transferability without incurring additional computational cost. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet show that SAAD consistently improves AutoAttack robustness over prior methods. Our code is available at https://github.com/HongsinLee/saad.[50] MotionEdit: Benchmarking and Learning Motion-Centric Image Editing
Yixin Wan,Lei Ke,Wenhao Yu,Kai-Wei Chang,Dong Yu
Main category: cs.CV
TL;DR: 本文提出了MotionEdit,一个专注于运动编辑的高质量图像数据集,以及MotionEdit-Bench评估基准,并提出MotionNFT训练框架以提升模型在运动保真度上的表现。
Details
Motivation: 现有图像编辑数据集主要关注静态外观修改,缺乏高质量、连续且真实的运动变化数据,难以支持对主体动作和交互进行精确编辑的研究与应用。 Method: 构建了MotionEdit数据集,从连续视频中提取并验证高保真运动变换图像对;设计MotionEdit-Bench基准,采用生成、判别和偏好多种指标评估模型;提出MotionNFT框架,通过计算运动对齐奖励来指导扩散模型微调,实现更准确的动作编辑。 Result: 实验表明当前最先进的扩散模型在运动编辑任务上表现仍较差;MotionNFT在FLUX.1 Kontext和Qwen-Image-Edit上显著提升了运动保真度和编辑质量,同时保持通用编辑能力。 Conclusion: MotionEdit为运动为中心的图像编辑提供了新的研究基础,MotionNFT有效解决了现有模型在运动对齐方面的不足,推动了该任务的发展。 Abstract: We introduce MotionEdit, a novel dataset for motion-centric image editing-the task of modifying subject actions and interactions while preserving identity, structure, and physical plausibility. Unlike existing image editing datasets that focus on static appearance changes or contain only sparse, low-quality motion edits, MotionEdit provides high-fidelity image pairs depicting realistic motion transformations extracted and verified from continuous videos. This new task is not only scientifically challenging but also practically significant, powering downstream applications such as frame-controlled video synthesis and animation. To evaluate model performance on the novel task, we introduce MotionEdit-Bench, a benchmark that challenges models on motion-centric edits and measures model performance with generative, discriminative, and preference-based metrics. Benchmark results reveal that motion editing remains highly challenging for existing state-of-the-art diffusion-based editing models. To address this gap, we propose MotionNFT (Motion-guided Negative-aware Fine Tuning), a post-training framework that computes motion alignment rewards based on how well the motion flow between input and model-edited images matches the ground-truth motion, guiding models toward accurate motion transformations. Extensive experiments on FLUX.1 Kontext and Qwen-Image-Edit show that MotionNFT consistently improves editing quality and motion fidelity of both base models on the motion editing task without sacrificing general editing ability, demonstrating its effectiveness.[51] ShotDirector: Directorially Controllable Multi-Shot Video Generation with Cinematographic Transitions
Xiaoxue Wu,Xinyuan Chen,Yaohui Wang,Yu Qiao
Main category: cs.CV
TL;DR: 本文提出了ShotDirector框架,通过参数级摄像机控制和分层编辑模式感知提示,实现电影般的可控镜头过渡。
Details
Motivation: 现有方法主要关注跨镜头的低层次视觉一致性,忽视了镜头过渡设计和电影语言对叙事表达的影响。 Method: 引入带有6自由度姿态和内参设置的摄像机控制模块,并采用镜头感知掩码机制结合专业剪辑模式的分层提示。 Result: 在构建的ShotWeaver40K数据集上实验表明,该方法能有效生成具有电影化剪辑模式的多镜头视频。 Conclusion: ShotDirector实现了从低层次视觉连贯到高层次叙事表达的可控多镜头视频生成。 Abstract: Shot transitions play a pivotal role in multi-shot video generation, as they determine the overall narrative expression and the directorial design of visual storytelling. However, recent progress has primarily focused on low-level visual consistency across shots, neglecting how transitions are designed and how cinematographic language contributes to coherent narrative expression. This often leads to mere sequential shot changes without intentional film-editing patterns. To address this limitation, we propose ShotDirector, an efficient framework that integrates parameter-level camera control and hierarchical editing-pattern-aware prompting. Specifically, we adopt a camera control module that incorporates 6-DoF poses and intrinsic settings to enable precise camera information injection. In addition, a shot-aware mask mechanism is employed to introduce hierarchical prompts aware of professional editing patterns, allowing fine-grained control over shot content. Through this design, our framework effectively combines parameter-level conditions with high-level semantic guidance, achieving film-like controllable shot transitions. To facilitate training and evaluation, we construct ShotWeaver40K, a dataset that captures the priors of film-like editing patterns, and develop a set of evaluation metrics for controllable multi-shot video generation. Extensive experiments demonstrate the effectiveness of our framework.[52] Physically Aware 360$^\circ$ View Generation from a Single Image using Disentangled Scene Embeddings
Karthikeya KV,Narendra Bandaru
Main category: cs.CV
TL;DR: Disentangled360是一种创新的3D感知技术,结合了方向解耦体积渲染与单图像360°视图合成,适用于医学成像和自然场景重建,在SSIM和LPIPS指标上表现优越,并支持交互式应用。
Details
Motivation: 现有方法在处理各向异性光行为时过于简化或缺乏跨场景泛化能力,限制了真实感和适用性。 Method: 提出一种双分支条件框架,在高斯点阵骨干中分离各向同性和各向异性贡献;一个分支针对CT强度驱动的体数据散射,另一个用于真实RGB场景;引入混合姿态无关锚定方法以解决尺度模糊并保持结构真实感。 Result: 在Mip-NeRF 360、RealEstate10K和DeepDRR数据集上实现了更高的SSIM和LPIPS性能,运行时间评估表明其适合交互式应用。 Conclusion: Disentangled360实现了无需场景微调或昂贵光子模拟的高质量360°视图合成,可广泛应用于混合现实医疗监护、机器人感知和沉浸式内容创作。 Abstract: We introduce Disentangled360, an innovative 3D-aware technology that integrates the advantages of direction disentangled volume rendering with single-image 360° unique view synthesis for applications in medical imaging and natural scene reconstruction. In contrast to current techniques that either oversimplify anisotropic light behavior or lack generalizability across various contexts, our framework distinctly differentiates between isotropic and anisotropic contributions inside a Gaussian Splatting backbone. We implement a dual-branch conditioning framework, one optimized for CT intensity driven scattering in volumetric data and the other for real-world RGB scenes through normalized camera embeddings. To address scale ambiguity and maintain structural realism, we present a hybrid pose agnostic anchoring method that adaptively samples scene depth and material transitions, functioning as stable pivots during scene distillation. Our design integrates preoperative radiography simulation and consumer-grade 360° rendering into a singular inference pipeline, facilitating rapid, photorealistic view synthesis with inherent directionality. Evaluations on the Mip-NeRF 360, RealEstate10K, and DeepDRR datasets indicate superior SSIM and LPIPS performance, while runtime assessments confirm its viability for interactive applications. Disentangled360 facilitates mixed-reality medical supervision, robotic perception, and immersive content creation, eliminating the necessity for scene-specific finetuning or expensive photon simulations.[53] Efficient-VLN: A Training-Efficient Vision-Language Navigation Model
Duo Zheng,Shijia Huang,Yanyang Li,Liwei Wang
Main category: cs.CV
TL;DR: 提出了一种高效的视觉语言导航模型Efficient-VLN,通过设计渐进式记忆和可学习递归记忆机制以及动态混合策略,显著降低了训练开销,同时在多个基准上达到SOTA性能。
Details
Motivation: 现有的多模态大语言模型在视觉语言导航中面临训练开销过大的问题,主要来自长历史观测的二次计算负担和DAgger中的探索效率权衡。 Method: 提出了两种高效记忆机制:渐进式记忆和可学习递归记忆,并引入动态混合策略来平衡探索与效率的权衡。 Result: 在R2R-CE和RxR-CE数据集上分别取得了64.2%和67.0%的成功率,且仅消耗282 H800 GPU小时,显著低于现有方法。 Conclusion: Efficient-VLN有效缓解了多模态大模型在视觉语言导航中的训练开销问题,同时保持甚至提升了性能,为实际应用提供了更高效的解决方案。 Abstract: Multimodal large language models (MLLMs) have shown promising potential in Vision-Language Navigation (VLN). However, their practical development is severely hindered by the substantial training overhead. We recognize two key issues that contribute to the overhead: (1) the quadratic computational burden from processing long-horizon historical observations as massive sequences of tokens, and (2) the exploration-efficiency trade-off in DAgger, i.e., a data aggregation process of collecting agent-explored trajectories. While more exploration yields effective error-recovery trajectories for handling test-time distribution shifts, it comes at the cost of longer trajectory lengths for both training and inference. To address these challenges, we propose Efficient-VLN, a training-efficient VLN model. Specifically, to mitigate the token processing burden, we design two efficient memory mechanisms: a progressive memory that dynamically allocates more tokens to recent observations, and a learnable recursive memory that utilizes the key-value cache of learnable tokens as the memory state. Moreover, we introduce a dynamic mixed policy to balance the exploration-efficiency trade-off. Extensive experiments show that Efficient-VLN achieves state-of-the-art performance on R2R-CE (64.2% SR) and RxR-CE (67.0% SR). Critically, our model consumes merely 282 H800 GPU hours, demonstrating a dramatic reduction in training overhead compared to state-of-the-art methods.[54] DualProtoSeg: Simple and Efficient Design with Text- and Image-Guided Prototype Learning for Weakly Supervised Histopathology Image Segmentation
Anh M. Vu,Khang P. Le,Trang T. K. Vo,Ha Thach,Huy Hung Nguyen,David Yang,Han H. Huynh,Quynh Nguyen,Tuan M. Pham,Tuan-Anh Le,Minh H. N. Le,Thanh-Huy Nguyen,Akash Awasthi,Chandra Mohan,Zhu Han,Hien Van Nguyen
Main category: cs.CV
TL;DR: 提出一种基于原型驱动的弱监督语义分割框架,结合视觉-语言对齐和多尺度金字塔模块,在组织病理学图像中提升了区域发现与定位精度。
Details
Motivation: 为降低标注成本,弱监督语义分割(WSSS)依赖图像级标签,但受限于类间相似性、类内差异性和CAM导致的区域收缩问题。本文旨在通过引入视觉-语言对齐机制改善弱监督下的区域发现效果。 Method: 采用CoOp风格的可学习提示调优生成文本原型,并结合可学习的图像原型构建双模态原型库;同时引入多尺度金字塔模块以缓解ViT表示中的过平滑问题,提升空间精度。 Result: 在BCSS-WSSS基准上超越现有最先进方法,消融实验验证了文本描述多样性、上下文长度以及文本与图像原型互补性的积极作用。 Conclusion: 联合利用文本语义与视觉原型学习能有效提升数字病理学中的弱监督语义分割性能。 Abstract: Weakly supervised semantic segmentation (WSSS) in histopathology seeks to reduce annotation cost by learning from image-level labels, yet it remains limited by inter-class homogeneity, intra-class heterogeneity, and the region-shrinkage effect of CAM-based supervision. We propose a simple and effective prototype-driven framework that leverages vision-language alignment to improve region discovery under weak supervision. Our method integrates CoOp-style learnable prompt tuning to generate text-based prototypes and combines them with learnable image prototypes, forming a dual-modal prototype bank that captures both semantic and appearance cues. To address oversmoothing in ViT representations, we incorporate a multi-scale pyramid module that enhances spatial precision and improves localization quality. Experiments on the BCSS-WSSS benchmark show that our approach surpasses existing state-of-the-art methods, and detailed analyses demonstrate the benefits of text description diversity, context length, and the complementary behavior of text and image prototypes. These results highlight the effectiveness of jointly leveraging textual semantics and visual prototype learning for WSSS in digital pathology.[55] ConStruct: Structural Distillation of Foundation Models for Prototype-Based Weakly Supervised Histopathology Segmentation
Khang Le,Ha Thach,Anh M. Vu,Trang T. K. Vo,Han H. Huynh,David Yang,Minh H. N. Le,Thanh-Huy Nguyen,Akash Awasthi,Chandra Mohan,Zhu Han,Hien Van Nguyen
Main category: cs.CV
TL;DR: 提出一种结合CONCH和SegFormer优势的原型学习框架,用于无像素级标注的组织病理学图像弱监督语义分割,通过文本引导初始化和结构蒸馏机制生成高质量伪掩码。
Details
Motivation: 现有弱监督语义分割方法在组织病理图像中常局限于判别性区域,难以完整捕捉组织结构的空间范围,且缺乏细粒度形态保留能力。 Method: 结合CONCH的形态感知表示、SegFormer的多尺度结构线索与文本引导的语义对齐,设计原型学习框架;引入文本引导原型初始化生成更完整的伪掩码,并通过结构化蒸馏将SegFormer的空间知识迁移至原型学习过程。 Result: 在BCSS-WSSS数据集上优于现有WSSS方法,生成的伪掩码质量高,定位更完整,语义一致性更强,同时保持计算效率。 Conclusion: 该方法有效融合视觉-语言模型与现代分割骨干网络的优势,在弱监督下实现了语义判别性与空间一致性的平衡,推动了无需密集标注的病理图像分析发展。 Abstract: Weakly supervised semantic segmentation (WSSS) in histopathology relies heavily on classification backbones, yet these models often localize only the most discriminative regions and struggle to capture the full spatial extent of tissue structures. Vision-language models such as CONCH offer rich semantic alignment and morphology-aware representations, while modern segmentation backbones like SegFormer preserve fine-grained spatial cues. However, combining these complementary strengths remains challenging, especially under weak supervision and without dense annotations. We propose a prototype learning framework for WSSS in histopathological images that integrates morphology-aware representations from CONCH, multi-scale structural cues from SegFormer, and text-guided semantic alignment to produce prototypes that are simultaneously semantically discriminative and spatially coherent. To effectively leverage these heterogeneous sources, we introduce text-guided prototype initialization that incorporates pathology descriptions to generate more complete and semantically accurate pseudo-masks. A structural distillation mechanism transfers spatial knowledge from SegFormer to preserve fine-grained morphological patterns and local tissue boundaries during prototype learning. Our approach produces high-quality pseudo masks without pixel-level annotations, improves localization completeness, and enhances semantic consistency across tissue types. Experiments on BCSS-WSSS datasets demonstrate that our prototype learning framework outperforms existing WSSS methods while remaining computationally efficient through frozen foundation model backbones and lightweight trainable adapters.[56] Point2Pose: A Generative Framework for 3D Human Pose Estimation with Multi-View Point Cloud Dataset
Hyunsoo Lee,Daeum Jeon,Hyeokjae Oh
Main category: cs.CV
TL;DR: 提出了一种基于点云和姿态历史的3D人体姿态估计生成方法Point2Pose,并发布了大规模多模态数据集MVPose3D。
Details
Motivation: 解决3D人体姿态估计中因人体复杂几何结构、关节自遮挡以及缺乏大规模真实世界运动数据集带来的挑战。 Method: 设计了Point2Pose框架,包含时空点云编码器和姿态特征编码器以提取关节级特征,并采用基于注意力机制的生成回归器进行姿态预测。 Result: 在多个数据集上超越基线模型,表现出优越性能,验证了方法的有效性。 Conclusion: Point2Pose能有效建模条件化的人体姿态分布,结合新提出的MVPose3D数据集,为3D人体姿态估计提供了新的解决方案。 Abstract: We propose a novel generative approach for 3D human pose estimation. 3D human pose estimation poses several key challenges due to the complex geometry of the human body, self-occluding joints, and the requirement for large-scale real-world motion datasets. To address these challenges, we introduce Point2Pose, a framework that effectively models the distribution of human poses conditioned on sequential point cloud and pose history. Specifically, we employ a spatio-temporal point cloud encoder and a pose feature encoder to extract joint-wise features, followed by an attention-based generative regressor. Additionally, we present a large-scale indoor dataset MVPose3D, which contains multiple modalities, including IMU data of non-trivial human motions, dense multi-view point clouds, and RGB images. Experimental results show that the proposed method outperforms the baseline models, demonstrating its superior performance across various datasets.[57] EchoingPixels: Cross-Modal Adaptive Token Reduction for Efficient Audio-Visual LLMs
Chao Gong,Depeng Wang,Zhipeng Wei,Ya Guo,Huijia Zhu,Jingjing Chen
Main category: cs.CV
TL;DR: EchoingPixels提出了一种用于音频-视觉大语言模型的自适应令牌缩减框架,通过跨模态语义筛(CS2)和同步增强的RoPE(Sync-RoPE)实现高效、动态的多模态信息保留。
Details
Motivation: 现有令牌缩减方法多为单模态设计,无法利用音视频间的跨模态协同,且固定预算分配不适应动态信息密度,导致效率与性能失衡。 Method: 提出EchoingPixels框架,核心是跨模态语义筛(CS2),在联合音视频令牌池上进行共注意力操作,实现跨模态交互与自适应令牌缩减;并设计Sync-RoPE保持稀疏化后关键的时间序列关系。 Result: 实验表明,EchoingPixels仅用原始5%-20%的令牌即可达到与强基线相当的性能,同时实现2-3倍的速度和内存提升。 Conclusion: EchoingPixels有效解决了AV-LLMs中因高计算开销带来的瓶颈,通过联合跨模态令牌缩减和时间同步建模,实现了高效且高性能的音频-视觉理解。 Abstract: Audio-Visual Large Language Models (AV-LLMs) face prohibitive computational overhead from massive audio and video tokens. Token reduction, while extensively explored for video-only LLMs, is insufficient for the audio-visual domain, as these unimodal methods cannot leverage audio-visual cross-modal synergies. Furthermore, the distinct and dynamic information densities of audio and video render static budgets per modality suboptimal. How to perform token reduction on a joint audio-visual stream thus remains an unaddressed bottleneck. To fill this gap, we introduce EchoingPixels, a framework inspired by the coexistence and interaction of visuals and sound in real-world scenes. The core of our framework is the Cross-Modal Semantic Sieve (CS2), a module enabling early audio-visual interaction. Instead of compressing modalities independently, CS2 co-attends to the joint multimodal stream and reduces tokens from an entire combined pool of audio-visual tokens rather than using fixed budgets per modality. This single-pool approach allows it to adaptively allocate the token budget across both modalities and dynamically identify salient tokens in concert. To ensure this aggressive reduction preserves the vital temporal modeling capability, we co-design a Synchronization-Augmented RoPE (Sync-RoPE) to maintain critical temporal relationships for the sparsely selected tokens. Extensive experiments demonstrate that EchoingPixels achieves performance comparable to strong baselines using only 5-20% of the original tokens, with a 2-3x speedup and memory reduction.[58] StainNet: A Special Staining Self-Supervised Vision Transformer for Computational Pathology
Jiawen Li,Jiali Hu,Xitong Ling,Yongqiang Lv,Yuxuan Chen,Yizhi Wang,Tian Guan,Yifei Liu,Yonghong He
Main category: cs.CV
TL;DR: StainNet是一个基于视觉Transformer架构的专用基础模型,通过自蒸馏自监督学习方法在超过140万张特殊染色图像上训练,旨在提升病理学中特殊染色图像的分析能力。
Details
Motivation: 现有病理基础模型主要在H&E染色图像上预训练,对临床常用的特殊染色图像(如免疫组化)适应性有限,限制了其应用。 Method: 提出StainNet,采用视觉Transformer架构和自蒸馏自监督学习方法,在HISTAI数据库的20,231张公开特殊染色全切片图像中提取的140多万个图像块上进行训练。 Result: 在内部肝恶性肿瘤分类任务和两个公开ROI数据集上验证了StainNet的有效性,且在少样本学习和图像检索任务中表现优异,相比近期更大的病理基础模型更具优势。 Conclusion: StainNet作为首个专为特殊染色设计的基础模型,显著提升了特殊染色图像的表征能力,具有良好的临床应用潜力,模型已开源。 Abstract: Foundation models trained with self-supervised learning (SSL) on large-scale histological images have significantly accelerated the development of computational pathology. These models can serve as backbones for region-of-interest (ROI) image analysis or patch-level feature extractors in whole-slide images (WSIs) based on multiple instance learning (MIL). Existing pathology foundation models (PFMs) are typically pre-trained on Hematoxylin-Eosin (H&E) stained pathology images. However, images with special stains, such as immunohistochemistry, are also frequently used in clinical practice. PFMs pre-trained mainly on H\&E-stained images may be limited in clinical applications involving special stains. To address this issue, we propose StainNet, a specialized foundation model for special stains based on the vision transformer (ViT) architecture. StainNet adopts a self-distillation SSL approach and is trained on over 1.4 million patch images cropping from 20,231 publicly available special staining WSIs in the HISTAI database. To evaluate StainNet, we conduct experiments on an in-house slide-level liver malignancy classification task and two public ROI-level datasets to demonstrate its strong ability. We also perform few-ratio learning and retrieval evaluations, and compare StainNet with recently larger PFMs to further highlight its strengths. We have released the StainNet model weights at: https://huggingface.co/JWonderLand/StainNet.[59] Simple Yet Effective Selective Imputation for Incomplete Multi-view Clustering
Cai Xu,Jinlong Liu,Yilin Zhang,Ziyu Guan,Wei Zhao
Main category: cs.CV
TL;DR: 本文提出了一种基于信息量的选择性插补多视图聚类方法(ISMVC),通过评估缺失位置的内在相似性和跨视图一致性来决定是否进行插补,结合变分自编码器与高斯混合先验学习聚类友好的隐表示,并实现分布级插补与不确定性建模,有效提升不完整多视图聚类性能。
Details
Motivation: 现有不完整多视图聚类方法中,插补方法易引入噪声和偏差,而不插补方法在严重缺失时缺乏跨视图互补性,难以有效利用数据。因此需要一种既能避免盲目插补又能充分利用可观测信息的方法。 Method: 提出ISMVC方法,基于视内相似性和跨视图一致性评估每个缺失位置的插补相关信息量,仅在支持充分时进行选择性插补;结合带有高斯混合先验的变分自编码器学习潜在表示,实现分布级别插补与不确定性建模,提升融合鲁棒性。 Result: 在多个基准数据集上验证了方法的有效性,尤其在更真实且具挑战性的非平衡缺失场景下,优于现有的插补式和无插补方法。 Conclusion: ISMVC通过数据驱动的选择性插补策略,在保持轻量、模型无关的同时,有效平衡了信息恢复与噪声抑制,提升了不完整多视图聚类的性能与鲁棒性。 Abstract: Incomplete multi-view data, where different views suffer from missing and unbalanced observations, pose significant challenges for clustering. Existing imputation-based methods attempt to estimate missing views to restore data associations, but indiscriminate imputation often introduces noise and bias, especially when the available information is insufficient. Imputation-free methods avoid this risk by relying solely on observed data, but struggle under severe incompleteness due to the lack of cross-view complementarity. To address this issue, we propose Informativeness-based Selective imputation Multi-View Clustering (ISMVC). Our method evaluates the imputation-relevant informativeness of each missing position based on intra-view similarity and cross-view consistency, and selectively imputes only when sufficient support is available. Furthermore, we integrate this selection with a variational autoencoder equipped with a mixture-of-Gaussians prior to learn clustering-friendly latent representations. By performing distribution-level imputation, ISMVC not only stabilizes the aggregation of posterior distributions but also explicitly models imputation uncertainty, enabling robust fusion and preventing overconfident reconstructions. Compared with existing cautious imputation strategies that depend on training dynamics or model feedback, our method is lightweight, data-driven, and model-agnostic. It can be readily integrated into existing IMC models as a plug-in module. Extensive experiments on multiple benchmark datasets under a more realistic and challenging unbalanced missing scenario demonstrate that our method outperforms both imputation-based and imputation-free approaches.[60] Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation
Yiwen Tang,Zoey Guo,Kaixin Zhu,Ray Zhang,Qizhi Chen,Dongzhi Jiang,Junli Liu,Bohan Zeng,Haoming Song,Delin Qu,Tianyi Bai,Dan Xu,Wentao Zhang,Bin Zhao
Main category: cs.CV
TL;DR: 本论文首次系统研究了强化学习(RL)在文本到3D自回归生成中的应用,提出了新的奖励设计、RL算法变体Hi-GRPO及评测基准MME-3DR,并发布了首个RL增强的文本到3D生成模型AR3D-R1。
Details
Motivation: 由于3D对象具有更高的空间复杂性,且需要几何一致性和精细纹理,现有方法在将强化学习应用于3D生成方面仍属空白,亟需系统性研究以解决奖励设计和算法敏感性问题。 Method: 从四个方面展开研究:(1) 奖励设计,评估多模态模型作为奖励信号的有效性;(2) RL算法,研究GRPO变体与token级优化;(3) 提出新评测基准MME-3DR;(4) 提出分层RL方法Hi-GRPO,实现从全局形状到局部纹理的优化。 Result: 成功开发出首个RL增强的文本到3D生成模型AR3D-R1,在形状生成和纹理细化方面表现优越;验证了人类偏好对齐的重要性,并证明通用多模态模型可提供有效奖励信号。 Conclusion: 强化学习能有效推动文本到3D生成的推理能力发展,合理的奖励设计与分层优化策略是关键,该研究为未来3D生成中的RL应用提供了基础框架与实践指导。 Abstract: Reinforcement learning (RL), earlier proven to be effective in large language and multi-modal models, has been successfully extended to enhance 2D image generation recently. However, applying RL to 3D generation remains largely unexplored due to the higher spatial complexity of 3D objects, which require globally consistent geometry and fine-grained local textures. This makes 3D generation significantly sensitive to reward designs and RL algorithms. To address these challenges, we conduct the first systematic study of RL for text-to-3D autoregressive generation across several dimensions. (1) Reward designs: We evaluate reward dimensions and model choices, showing that alignment with human preference is crucial, and that general multi-modal models provide robust signal for 3D attributes. (2) RL algorithms: We study GRPO variants, highlighting the effectiveness of token-level optimization, and further investigate the scaling of training data and iterations. (3) Text-to-3D Benchmarks: Since existing benchmarks fail to measure implicit reasoning abilities in 3D generation models, we introduce MME-3DR. (4) Advanced RL paradigms: Motivated by the natural hierarchy of 3D generation, we propose Hi-GRPO, which optimizes the global-to-local hierarchical 3D generation through dedicated reward ensembles. Based on these insights, we develop AR3D-R1, the first RL-enhanced text-to-3D model, expert from coarse shape to texture refinement. We hope this study provides insights into RL-driven reasoning for 3D generation. Code is released at https://github.com/Ivan-Tang-3D/3DGen-R1.[61] A Conditional Generative Framework for Synthetic Data Augmentation in Segmenting Thin and Elongated Structures in Biological Images
Yi Liu,Yichi Zhang
Main category: cs.CV
TL;DR: 提出基于Pix2Pix的条件生成框架和纤维感知结构损失,用于生成显微图像中逼真的纤维结构,缓解标注数据不足问题。
Details
Motivation: 缺乏高质量像素级标注的纤维结构数据集,因密集分布和几何特性使手动标注费时费力。 Method: 基于Pix2Pix架构构建条件生成模型,从二值掩码生成真实感显微图像,并设计纤维感知结构损失提升生成图像的结构相似性。 Result: 实验表明所提方法有效,相比无合成数据训练的现有模型性能更优。 Conclusion: 该生成框架能有效缓解纤维分割中数据标注不足的问题,提升模型性能。 Abstract: Thin and elongated filamentous structures, such as microtubules and actin filaments, often play important roles in biological systems. Segmenting these filaments in biological images is a fundamental step for quantitative analysis. Recent advances in deep learning have significantly improved the performance of filament segmentation. However, there is a big challenge in acquiring high quality pixel-level annotated dataset for filamentous structures, as the dense distribution and geometric properties of filaments making manual annotation extremely laborious and time-consuming. To address the data shortage problem, we propose a conditional generative framework based on the Pix2Pix architecture to generate realistic filaments in microscopy images from binary masks. We also propose a filament-aware structural loss to improve the structure similarity when generating synthetic images. Our experiments have demonstrated the effectiveness of our approach and outperformed existing model trained without synthetic data.[62] Zero-shot Adaptation of Stable Diffusion via Plug-in Hierarchical Degradation Representation for Real-World Super-Resolution
Yi-Cheng Liao,Shyang-En Weng,Yu-Syuan Xu,Chi-Wei Hsiao,Wei-Chen Chiu,Ching-Chun Huang
Main category: cs.CV
TL;DR: 提出HD-CLIP,一种分层退化感知的CLIP模块,用于真实图像超分辨率,通过语义和退化嵌入实现更优的生成恢复。
Details
Motivation: 现有方法依赖CLIP文本编码器且假设退化程度已知,难以捕捉数值型退化信息,泛化能力受限。 Method: 提出HD-CLIP,将低质量图像分解为语义嵌入和有序退化嵌入,并引入分类器自由投影引导(CFPG)集成到扩散模型中。 Result: HD-CLIP作为即插即用模块,在无需训练的情况下提升多种超分框架在真实世界数据上的细节保真度和感知真实性。 Conclusion: HD-CLIP有效解耦语义与退化信息,增强了对未知退化程度的泛化能力,显著改善真实图像超分辨率效果。 Abstract: Real-World Image Super-Resolution (Real-ISR) aims to recover high-quality images from low-quality inputs degraded by unknown and complex real-world factors. Real-world scenarios involve diverse and coupled degradations, making it necessary to provide diffusion models with richer and more informative guidance. However, existing methods often assume known degradation severity and rely on CLIP text encoders that cannot capture numerical severity, limiting their generalization ability. To address this, we propose \textbf{HD-CLIP} (\textbf{H}ierarchical \textbf{D}egradation CLIP), which decomposes a low-quality image into a semantic embedding and an ordinal degradation embedding that captures ordered relationships and allows interpolation across unseen levels. Furthermore, we integrated it into diffusion models via classifier-free guidance (CFG) and proposed classifier-free projection guidance (CFPG). HD-CLIP leverages semantic cues to guide generative restoration while using degradation cues to suppress undesired hallucinations and artifacts. As a \textbf{plug-and-play module}, HD-CLIP can be seamlessly integrated into various super-resolution frameworks without training, significantly improving detail fidelity and perceptual realism across diverse real-world datasets.[63] CoSPlan: Corrective Sequential Planning via Scene Graph Incremental Updates
Shresth Grover,Priyank Pathak,Akash Kumar,Vibhav Vineet,Yogesh S Rawat
Main category: cs.CV
TL;DR: 提出CoSPlan基准评测视觉语言模型在含错误步骤的视觉序列规划中的表现,并提出无需训练的SGI方法提升模型推理能力。
Details
Motivation: 大型视觉语言模型在复杂推理上表现优异,但在易出错的视觉序列规划任务中尚未充分探索,尤其缺乏对非最优动作的检测与纠正能力。 Method: 构建CoSPlan基准,涵盖四个领域,评估错误检测与步骤补全能力;提出SGI方法,通过场景图增量更新引入中间推理步骤,提升VLM的序列推理能力。 Result: 现有VLM(如Intern-VLM、Qwen2)在CoSPlan上表现不佳;SGI方法带来平均5.2%的性能提升,并可泛化至传统规划任务如Plan-Bench和VQA。 Conclusion: SGI通过引入中间状态推理,有效增强VLM在纠错性序列规划中的表现,为训练-free推理优化提供了新思路。 Abstract: Large-scale Vision-Language Models (VLMs) exhibit impressive complex reasoning capabilities but remain largely unexplored in visual sequential planning, i.e., executing multi-step actions towards a goal. Additionally, practical sequential planning often involves non-optimal (erroneous) steps, challenging VLMs to detect and correct such steps. We propose Corrective Sequential Planning Benchmark (CoSPlan) to evaluate VLMs in error-prone, vision-based sequential planning tasks across 4 domains: maze navigation, block rearrangement, image reconstruction,and object reorganization. CoSPlan assesses two key abilities: Error Detection (identifying non-optimal action) and Step Completion (correcting and completing action sequences to reach the goal). Despite using state-of-the-art reasoning techniques such as Chain-of-Thought and Scene Graphs, VLMs (e.g. Intern-VLM and Qwen2) struggle on CoSPlan, failing to leverage contextual cues to reach goals. Addressing this, we propose a novel training-free method, Scene Graph Incremental updates (SGI), which introduces intermediate reasoning steps between the initial and goal states. SGI helps VLMs reason about sequences, yielding an average performance gain of 5.2%. In addition to enhancing reliability in corrective sequential planning, SGI generalizes to traditional planning tasks such as Plan-Bench and VQA.[64] Topology-Agnostic Animal Motion Generation from Text Prompt
Keyi Chen,Mingze Sun,Zhenyu Liu,Zhangquan Chen,Ruqi Huang
Main category: cs.CV
TL;DR: 本文提出了OmniZoo,一个大规模动物运动数据集及一种能够生成任意骨骼拓扑结构下文本驱动动作的通用自回归框架,解决了现有方法在不同骨骼结构上泛化能力不足的问题。
Details
Motivation: 现有的运动生成方法大多依赖固定的骨骼模板,难以推广到不同或扰动的骨骼拓扑结构,且缺乏大规模异构动物运动数据和统一的生成框架。 Method: 构建了包含140个物种、32,979个序列的OmniZoo数据集,并提出一种广义自回归生成框架,核心是拓扑感知的骨骼嵌入模块,将任意骨骼的几何与结构特性编码进共享token空间,实现与文本语义的融合。 Result: 该方法能根据文本提示和目标骨骼生成时间连贯、物理合理且语义对齐的动作,并支持跨物种的运动风格迁移。 Conclusion: 所提方法突破了传统运动生成对固定骨骼结构的依赖,实现了对任意骨骼拓扑的文本驱动运动生成,在动画、机器人和虚拟环境中有广泛应用潜力。 Abstract: Motion generation is fundamental to computer animation and widely used across entertainment, robotics, and virtual environments. While recent methods achieve impressive results, most rely on fixed skeletal templates, which prevent them from generalizing to skeletons with different or perturbed topologies. We address the core limitation of current motion generation methods - the combined lack of large-scale heterogeneous animal motion data and unified generative frameworks capable of jointly modeling arbitrary skeletal topologies and textual conditions. To this end, we introduce OmniZoo, a large-scale animal motion dataset spanning 140 species and 32,979 sequences, enriched with multimodal annotations. Building on OmniZoo, we propose a generalized autoregressive motion generation framework capable of producing text-driven motions for arbitrary skeletal topologies. Central to our model is a Topology-aware Skeleton Embedding Module that encodes geometric and structural properties of any skeleton into a shared token space, enabling seamless fusion with textual semantics. Given a text prompt and a target skeleton, our method generates temporally coherent, physically plausible, and semantically aligned motions, and further enables cross-species motion style transfer.[65] Hybrid Transformer-Mamba Architecture for Weakly Supervised Volumetric Medical Segmentation
Yiheng Lyu,Lian Xu,Mohammed Bennamoun,Farid Boussaid,Coen Arrow,Girish Dwivedi
Main category: cs.CV
TL;DR: 提出TranSamba,一种结合Transformer与Mamba的混合架构,用于弱监督下的三维医学图像分割,利用跨平面Mamba模块高效建模3D上下文,在多个数据集上达到SOTA性能。
Details
Motivation: 现有弱监督语义分割方法多依赖2D编码器,忽视了医学影像固有的三维结构特性,难以有效捕捉体素间的深层上下文信息。 Method: 设计TranSamba架构,以Vision Transformer为主干,引入Cross-Plane Mamba模块,利用状态空间模型的线性复杂度实现相邻切片间的高效信息交互,并增强片内自注意力机制,从而生成更优的注意力图用于目标定位。 Result: 在三个医学影像数据集上进行广泛实验,TranSamba在不同模态和病理条件下均显著优于现有方法,实现了新的性能纪录,且时间和内存效率优越。 Conclusion: TranSamba通过融合Transformer与Mamba的优势,有效建模3D上下文信息,为弱监督体积分割提供了高效、可扩展的解决方案,并推动了该领域的性能边界。 Abstract: Weakly supervised semantic segmentation offers a label-efficient solution to train segmentation models for volumetric medical imaging. However, existing approaches often rely on 2D encoders that neglect the inherent volumetric nature of the data. We propose TranSamba, a hybrid Transformer-Mamba architecture designed to capture 3D context for weakly supervised volumetric medical segmentation. TranSamba augments a standard Vision Transformer backbone with Cross-Plane Mamba blocks, which leverage the linear complexity of state space models for efficient information exchange across neighboring slices. The information exchange enhances the pairwise self-attention within slices computed by the Transformer blocks, directly contributing to the attention maps for object localization. TranSamba achieves effective volumetric modeling with time complexity that scales linearly with the input volume depth and maintains constant memory usage for batch processing. Extensive experiments on three datasets demonstrate that TranSamba establishes new state-of-the-art performance, consistently outperforming existing methods across diverse modalities and pathologies. Our source code and trained models are openly accessible at: https://github.com/YihengLyu/TranSamba.[66] mmCounter: Static People Counting in Dense Indoor Scenarios Using mmWave Radar
Tarik Reza Toha,Shao-Jung,Lu,Shahriar Nirjon
Main category: cs.CV
TL;DR: mmCounter是一种利用毫米波雷达提取超低频信号(如呼吸和微小身体运动)来准确计算密集静态人群中人数的新方法,能够在高密度室内环境中实现较高计数精度。
Details
Motivation: 现有毫米波雷达在检测静止人群时受限于空间分辨率和对运动的依赖,难以准确计数密集排列的静止个体。此外,当前呼吸频率估计算法通常假设人数已知,无法直接用于未知人数的场景。 Method: 提出mmCounter,通过提取低于1Hz的超低频信号(主要来自呼吸和细微躯干移动),结合新颖的多阶段信号处理流程,分离出与人体相关的低频源并获取其空间信息,进而映射到个体实现计数。 Result: 在多种环境下的实验表明,mmCounter在熟悉环境中平均F1得分为87%,平均绝对误差为0.6;在未见过的环境中F1为60%,平均绝对误差为1.1。可在三平方米空间内准确计数最多七人,前后间距仅一米且无并排空间。 Conclusion: mmCounter有效解决了毫米波雷达在高密度静态人群计数中的挑战,通过利用呼吸和微动信号及创新的信号处理 pipeline,在不同环境下均实现了较高的计数准确性,具有实际部署潜力。 Abstract: mmWave radars struggle to detect or count individuals in dense, static (non-moving) groups due to limitations in spatial resolution and reliance on movement for detection. We present mmCounter, which accurately counts static people in dense indoor spaces (up to three people per square meter). mmCounter achieves this by extracting ultra-low frequency (< 1 Hz) signals, primarily from breathing and micro-scale body movements such as slight torso shifts, and applying novel signal processing techniques to differentiate these subtle signals from background noise and nearby static objects. Our problem differs significantly from existing studies on breathing rate estimation, which assume the number of people is known a priori. In contrast, mmCounter utilizes a novel multi-stage signal processing pipeline to extract relevant low-frequency sources along with their spatial information and map these sources to individual people, enabling accurate counting. Extensive evaluations in various environments demonstrate that mmCounter delivers an 87% average F1 score and 0.6 mean absolute error in familiar environments, and a 60% average F1 score and 1.1 mean absolute error in previously untested environments. It can count up to seven individuals in a three square meter space, such that there is no side-by-side spacing and only a one-meter front-to-back distance.[67] Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task
Sunqi Fan,Jiashuo Cui,Meng-Hao Guo,Shuojin Yang
Main category: cs.CV
TL;DR: 提出了一种基于时空推理框架(STAR)和视频工具包的轻量级方法,提升多模态大模型在复杂视频问答任务中的时空推理能力,在VideoMME和LongVideoBench上显著提升性能。
Details
Motivation: 现有MLLM在处理视频问答任务时难以同时建模帧内空间关系和时间演化因果动态,尤其在复杂推理场景下表现不足。 Method: 设计了一个可扩展的视频工具包,并提出STAR框架,通过策略性调度时空工具,逐步定位视频中的关键区域,增强MLLM的时空推理能力。 Result: 在VideoMME上提升8.2%,在LongVideoBench上提升4.6%,且仅使用轻量级工具增强了GPT-4o的性能。 Conclusion: 所提出的视频工具包与STAR框架有助于构建自主智能的视频分析助手,推动多模态模型在动态场景理解上的发展。 Abstract: Video Question Answering (VideoQA) task serves as a critical playground for evaluating whether foundation models can effectively perceive, understand, and reason about dynamic real-world scenarios. However, existing Multimodal Large Language Models (MLLMs) struggle with simultaneously modeling spatial relationships within video frames and understanding the causal dynamics of temporal evolution on complex and reasoning-intensive VideoQA task. In this work, we equip MLLM with a comprehensive and extensible Video Toolkit, to enhance MLLM's spatiotemporal reasoning capabilities and ensure the harmony between the quantity and diversity of tools. To better control the tool invocation sequence and avoid toolchain shortcut issues, we propose a Spatiotemporal Reasoning Framework (STAR) that strategically schedules temporal and spatial tools, thereby progressively localizing the key area in the video. Our STAR framework enhances GPT-4o using lightweight tools, achieving an 8.2% gain on VideoMME and 4.6% on LongVideoBench. We believe that our proposed Video Toolkit and STAR framework make an important step towards building autonomous and intelligent video analysis assistants. The code is publicly available at https://github.com/fansunqi/VideoTool.[68] Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models
Woojun Jung,Jaehoon Go,Mingyu Jeon,Sunjae Yoon,Junyeong Kim
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的两步方法Visual Funnel,以解决多模态大语言模型在精细视觉任务中因缺乏结构多样性输入而导致的“上下文盲区”问题。
Details
Motivation: 现有的多模态大语言模型虽然具备较强的推理能力,但在捕捉细粒度视觉细节时常因丢失全局上下文而出现“上下文盲区”,限制了其在高精度任务中的应用。 Method: Visual Funnel通过两个步骤解决该问题:首先进行上下文锚定(Contextual Anchoring)识别感兴趣区域;然后基于注意力熵动态确定裁剪大小并优化中心位置,构建熵加权的多层次图像裁剪组合(Entropy-Scaled Portfolio),保留从焦点细节到周围环境的层次化上下文。 Result: 实验表明,Visual Funnel显著优于简单的单裁剪和非结构化多裁剪基线方法;同时发现增加非结构化裁剪数量效果有限甚至有害,验证了层次化结构对缓解上下文盲区的关键作用。 Conclusion: 输入的结构多样性比信息数量更重要,Visual Funnel通过构建具有层次结构的裁剪组合有效解决了上下文盲区问题,提升了MLLMs在细粒度视觉理解任务中的表现。 Abstract: Multimodal Large Language Models (MLLMs) demonstrate impressive reasoning capabilities, but often fail to perceive fine-grained visual details, limiting their applicability in precision-demanding tasks. While methods that crop salient regions of an image offer a partial solution, we identify a critical limitation they introduce: "Contextual Blindness". This failure occurs due to structural disconnect between high-fidelity details (from the crop) and the broader global context (from the original image), even when all necessary visual information is present. We argue that this limitation stems not from a lack of information 'Quantity', but from a lack of 'Structural Diversity' in the model's input. To resolve this, we propose Visual Funnel, a training-free, two-step approach. Visual Funnel first performs Contextual Anchoring to identify the region of interest in a single forward pass. It then constructs an Entropy-Scaled Portfolio that preserves the hierarchical context - ranging from focal detail to broader surroundings - by dynamically determining crop sizes based on attention entropy and refining crop centers. Through extensive experiments, we demonstrate that Visual Funnel significantly outperforms naive single-crop and unstructured multi-crop baselines. Our results further validate that simply adding more unstructured crops provides limited or even detrimental benefits, confirming that the hierarchical structure of our portfolio is key to resolving Contextual Blindness.[69] Point to Span: Zero-Shot Moment Retrieval for Navigating Unseen Hour-Long Videos
Mingyu Jeon,Jisoo Yang,Sungjin Han,Jinkwon Hwang,Sunjae Yoon,Jonghee Kim,Junyeoung Kim
Main category: cs.CV
TL;DR: 本文提出了P2S,一种无需训练的零样本长视频时刻检索框架,通过自适应跨度生成和查询分解解决搜索效率低和精炼成本高的问题,在小时级视频上显著超越有监督方法。
Details
Motivation: 现有零样本长视频时刻检索方法在搜索阶段存在候选爆炸问题,且精炼阶段依赖高成本视觉语言模型(VLM)验证,导致计算开销大、泛化能力差,难以高效处理小时级长视频。 Method: 提出P2S框架,包含两个核心组件:1)自适应跨度生成器,动态生成高质量候选片段,避免搜索阶段的候选爆炸;2)查询分解策略,将复杂查询拆解为子查询,降低语义差异,实现无需高成本VLM验证的高效精炼。 Result: P2S在MAD等基准上显著优于当前最先进的有监督方法,例如R5@0.1指标提升3.7%,是首个能在小时级视频中实现零样本时序定位的框架。 Conclusion: P2S有效解决了零样本长视频时刻检索中搜索效率与精炼成本的双重挑战,无需训练即可实现高性能,推动了长视频理解的实用化发展。 Abstract: Zero-shot Long Video Moment Retrieval (ZLVMR) is the task of identifying temporal segments in hour-long videos using a natural language query without task-specific training. The core technical challenge of LVMR stems from the computational infeasibility of processing entire lengthy videos in a single pass. This limitation has established a 'Search-then-Refine' approach, where candidates are rapidly narrowed down, and only those portions are analyzed, as the dominant paradigm for LVMR. However, existing approaches to this paradigm face severe limitations. Conventional supervised learning suffers from limited scalability and poor generalization, despite substantial resource consumption. Yet, existing zero-shot methods also fail, facing a dual challenge: (1) their heuristic strategies cause a 'search' phase candidate explosion, and (2) the 'refine' phase, which is vulnerable to semantic discrepancy, requires high-cost VLMs for verification, incurring significant computational overhead. We propose \textbf{P}oint-\textbf{to}-\textbf{S}pan (P2S), a novel training-free framework to overcome this challenge of inefficient 'search' and costly 'refine' phases. P2S overcomes these challenges with two key innovations: an 'Adaptive Span Generator' to prevent the search phase candidate explosion, and 'Query Decomposition' to refine candidates without relying on high-cost VLM verification. To our knowledge, P2S is the first zero-shot framework capable of temporal grounding in hour-long videos, outperforming supervised state-of-the-art methods by a significant margin (e.g., +3.7\% on R5@0.1 on MAD).[70] Breaking the Vicious Cycle: Coherent 3D Gaussian Splatting from Sparse and Motion-Blurred Views
Zhankuo Xu,Chaoran Feng,Yingtao Li,Jianbin Zhao,Jiashu Yang,Wangbo Yu,Li Yuan,Yonghong Tian
Main category: cs.CV
TL;DR: 本文提出CoherentGS,一种用于从稀疏且模糊图像中实现高保真3D重建的新框架,通过结合去模糊网络和扩散模型的双先验策略,显著提升了3D高斯点阵在真实场景下的鲁棒性和性能。
Details
Motivation: 3D高斯点阵(3DGS)在新视角合成中表现优异,但其依赖密集高质量图像的假设在现实稀疏且运动模糊的输入下常不成立,导致重建失败。 Method: 提出双先验策略:结合预训练的专用去模糊网络以恢复细节并提供光度引导,以及扩散模型提供几何先验以填补未观测区域;引入一致性引导的相机探索模块和深度正则化损失以增强几何一致性。 Result: 在仅使用3、6、9个输入视图的情况下,CoherentGS在合成与真实场景上均显著优于现有方法,实现了该任务下的新SOTA。 Conclusion: CoherentGS有效打破了稀疏视图与运动模糊之间的恶性循环,为复杂真实条件下基于3DGS的重建提供了可靠解决方案。 Abstract: 3D Gaussian Splatting (3DGS) has emerged as a state-of-the-art method for novel view synthesis. However, its performance heavily relies on dense, high-quality input imagery, an assumption that is often violated in real-world applications, where data is typically sparse and motion-blurred. These two issues create a vicious cycle: sparse views ignore the multi-view constraints necessary to resolve motion blur, while motion blur erases high-frequency details crucial for aligning the limited views. Thus, reconstruction often fails catastrophically, with fragmented views and a low-frequency bias. To break this cycle, we introduce CoherentGS, a novel framework for high-fidelity 3D reconstruction from sparse and blurry images. Our key insight is to address these compound degradations using a dual-prior strategy. Specifically, we combine two pre-trained generative models: a specialized deblurring network for restoring sharp details and providing photometric guidance, and a diffusion model that offers geometric priors to fill in unobserved regions of the scene. This dual-prior strategy is supported by several key techniques, including a consistency-guided camera exploration module that adaptively guides the generative process, and a depth regularization loss that ensures geometric plausibility. We evaluate CoherentGS through both quantitative and qualitative experiments on synthetic and real-world scenes, using as few as 3, 6, and 9 input views. Our results demonstrate that CoherentGS significantly outperforms existing methods, setting a new state-of-the-art for this challenging task. The code and video demos are available at https://potatobigroom.github.io/CoherentGS/.[71] RaLiFlow: Scene Flow Estimation with 4D Radar and LiDAR Point Clouds
Jingyun Fu,Zhiyu Xiang,Na Zhao
Main category: cs.CV
TL;DR: 本文提出了首个结合4D毫米波雷达和LiDAR进行场景流估计的联合学习框架RaLiFlow,并构建了相应的雷达-LiDAR场景流数据集,通过动态感知双向跨模态融合模块和精心设计的损失函数实现了有效融合。
Details
Motivation: 现有方法未探索4D毫米波雷达与LiDAR在场景流估计中的融合;雷达虽具成本低、抗恶劣天气强、可检测速度等优势,但存在噪声多、分辨率低、稀疏等问题,且缺乏配套数据集。 Method: 构建基于真实世界自动驾驶数据的雷达-LiDAR场景流数据集;提出雷达去噪与场景流标签生成的预处理策略;设计动态感知双向跨模态融合(DBCF)模块,在局部交叉注意力中融入雷达动态信息以实现跨模态上下文传播;采用一组专门设计的损失函数缓解不可靠雷达数据影响并增强实例级一致性。 Result: 在重构的场景流数据集上实验表明,所提方法显著优于现有的LiDAR和雷达单模态方法,尤其在动态前景区域表现更优。 Conclusion: RaLiFlow首次实现了4D雷达与LiDAR在场景流估计中的有效融合,验证了多模态互补潜力,为未来自动驾驶感知系统提供了更鲁棒的解决方案。 Abstract: Recent multimodal fusion methods, integrating images with LiDAR point clouds, have shown promise in scene flow estimation. However, the fusion of 4D millimeter wave radar and LiDAR remains unexplored. Unlike LiDAR, radar is cheaper, more robust in various weather conditions and can detect point-wise velocity, making it a valuable complement to LiDAR. However, radar inputs pose challenges due to noise, low resolution, and sparsity. Moreover, there is currently no dataset that combines LiDAR and radar data specifically for scene flow estimation. To address this gap, we construct a Radar-LiDAR scene flow dataset based on a public real-world automotive dataset. We propose an effective preprocessing strategy for radar denoising and scene flow label generation, deriving more reliable flow ground truth for radar points out of the object boundaries. Additionally, we introduce RaLiFlow, the first joint scene flow learning framework for 4D radar and LiDAR, which achieves effective radar-LiDAR fusion through a novel Dynamic-aware Bidirectional Cross-modal Fusion (DBCF) module and a carefully designed set of loss functions. The DBCF module integrates dynamic cues from radar into the local cross-attention mechanism, enabling the propagation of contextual information across modalities. Meanwhile, the proposed loss functions mitigate the adverse effects of unreliable radar data during training and enhance the instance-level consistency in scene flow predictions from both modalities, particularly for dynamic foreground areas. Extensive experiments on the repurposed scene flow dataset demonstrate that our method outperforms existing LiDAR-based and radar-based single-modal methods by a significant margin.[72] Self-Supervised Contrastive Embedding Adaptation for Endoscopic Image Matching
Alberto Rota,Elena De Momi
Main category: cs.CV
TL;DR: 本文提出了一种用于内窥镜图像对中特征匹配的新型自监督深度学习框架,通过新颖视图合成生成真实对应点,并利用对比学习优化DINOv2骨干网络,提升了手术场景下的匹配精度与3D重建性能。
Details
Motivation: 由于手术环境中存在弱透视线索、非朗伯反射和复杂变形组织,传统计算机视觉方法在内窥镜图像中建立精确像素级对应关系面临挑战,现有深度学习模型缺乏对手术图像细粒度匹配的适应性。 Method: 提出一种基于新视图合成的自监督学习框架,生成真实内点对应关系,并在对比学习中挖掘三元组样本;在DINOv2主干网络上增加一个Transformer层,优化嵌入表示以支持通过余弦相似度阈值进行直接匹配。 Result: 在SCARED数据集上的实验表明,该方法相比现有最先进方法具有更高的匹配精度和更低的对极几何误差。 Conclusion: 所提出的自监督特征匹配框架有效提升了内窥镜图像中的对应关系建立能力,为手术场景中的高精度三维重建、相机跟踪和场景理解等高级视觉应用提供了有力支持。 Abstract: Accurate spatial understanding is essential for image-guided surgery, augmented reality integration and context awareness. In minimally invasive procedures, where visual input is the sole intraoperative modality, establishing precise pixel-level correspondences between endoscopic frames is critical for 3D reconstruction, camera tracking, and scene interpretation. However, the surgical domain presents distinct challenges: weak perspective cues, non-Lambertian tissue reflections, and complex, deformable anatomy degrade the performance of conventional computer vision techniques. While Deep Learning models have shown strong performance in natural scenes, their features are not inherently suited for fine-grained matching in surgical images and require targeted adaptation to meet the demands of this domain. This research presents a novel Deep Learning pipeline for establishing feature correspondences in endoscopic image pairs, alongside a self-supervised optimization framework for model training. The proposed methodology leverages a novel-view synthesis pipeline to generate ground-truth inlier correspondences, subsequently utilized for mining triplets within a contrastive learning paradigm. Through this self-supervised approach, we augment the DINOv2 backbone with an additional Transformer layer, specifically optimized to produce embeddings that facilitate direct matching through cosine similarity thresholding. Experimental evaluation demonstrates that our pipeline surpasses state-of-the-art methodologies on the SCARED datasets improved matching precision and lower epipolar error compared to the related work. The proposed framework constitutes a valuable contribution toward enabling more accurate high-level computer vision applications in surgical endoscopy.[73] Towards Fine-Grained Recognition with Large Visual Language Models: Benchmark and Optimization Strategies
Cong Pang,Hongtao Yu,Zixuan Chen,Lewei Lu,Xin Lou
Main category: cs.CV
TL;DR: 本文提出了一个用于评估大视觉语言模型(LVLMs)细粒度识别能力的新基准FROW,并通过GPT-4o构建了包含马赛克数据和开放世界数据的数据集,结合数据构造与训练过程的优化策略,显著提升了模型在细粒度识别任务上的性能。
Details
Motivation: 现有LVLMs基准主要关注推理任务,忽视了对实际应用至关重要的细粒度识别能力,缺乏系统性评估手段。 Method: 提出FROW基准,利用GPT-4o构建包含马赛克数据(多短答案组合)和开放世界数据(真实问答生成)的新型数据集,并从数据构建和训练过程两方面优化LVLM性能。 Result: 实验表明,马赛克数据使类别识别准确率提升1%,开放世界数据使FROW基准准确率提高10%-20%、内容准确率提升6%-12%;将细粒度数据引入预训练阶段可使类别识别准确率最高提升10%。 Conclusion: FROW为评估LVLMs的细粒度识别能力提供了有效框架,所提出的优化策略显著提升了模型性能,推动其在实际场景中的应用。 Abstract: Large Vision Language Models (LVLMs) have made remarkable progress, enabling sophisticated vision-language interaction and dialogue applications. However, existing benchmarks primarily focus on reasoning tasks, often neglecting fine-grained recognition, which is crucial for practical application scenarios. To address this gap, we introduce the Fine-grained Recognition Open World (FROW) benchmark, designed for detailed evaluation of LVLMs with GPT-4o. On the basis of that, we propose a novel optimization strategy from two perspectives: \textit{data construction} and \textit{training process}, to improve the performance of LVLMs. Our dataset includes mosaic data, which combines multiple short-answer responses, and open-world data, generated from real-world questions and answers using GPT-4o, creating a comprehensive framework for evaluating fine-grained recognition in LVLMs. Experiments show that mosaic data improves category recognition accuracy by 1\% and open-world data boosts FROW benchmark accuracy by 10\%-20\% and content accuracy by 6\%-12\%. Meanwhile, incorporating fine-grained data into the pre-training phase can improve the model's category recognition accuracy by up to 10\%. The benchmark will be available at https://github.com/pc-inno/FROW.[74] Adaptive Dual-Weighted Gravitational Point Cloud Denoising Method
Ge Zhang,Chunyang Wang,Bo Xiao,Xuelian Liu,Bin Liu
Main category: cs.CV
TL;DR: 提出一种自适应双权重引力点云去噪方法,结合八叉树空间划分与引力评分函数,在保证高精度和边缘保持的同时实现高效实时去噪。
Details
Motivation: 现有点云去噪方法难以兼顾去噪精度、边缘保持与计算效率,尤其在多噪声环境下表现受限。 Method: 采用八叉树进行全局空间划分以实现并行加速;在叶节点内利用自适应体素 occupancy 统计和kNN密度估计快速去除孤立低密度噪声;构建融合密度权重与自适应距离权重的引力评分函数,精细区分噪声点与物体点。 Result: 在Stanford 3D、CADC和自采FMCW LiDAR数据集上实验表明,该方法在F1、PSNR和Chamfer Distance指标上均优于现有方法,且单帧处理时间更短。 Conclusion: 所提方法在多种噪声条件下实现了高精度、强边缘保持和实时性的平衡,适用于自动驾驶和3D重建等实际应用。 Abstract: High-quality point cloud data is a critical foundation for tasks such as autonomous driving and 3D reconstruction. However, LiDAR-based point cloud acquisition is often affected by various disturbances, resulting in a large number of noise points that degrade the accuracy of subsequent point cloud object detection and recognition. Moreover, existing point cloud denoising methods typically sacrifice computational efficiency in pursuit of higher denoising accuracy, or, conversely, improve processing speed at the expense of preserving object boundaries and fine structural details, making it difficult to simultaneously achieve high denoising accuracy, strong edge preservation, and real-time performance. To address these limitations, this paper proposes an adaptive dual-weight gravitational-based point cloud denoising method. First, an octree is employed to perform spatial partitioning of the global point cloud, enabling parallel acceleration. Then, within each leaf node, adaptive voxel-based occupancy statistics and k-nearest neighbor (kNN) density estimation are applied to rapidly remove clearly isolated and low-density noise points, thereby reducing the effective candidate set. Finally, a gravitational scoring function that combines density weights with adaptive distance weights is constructed to finely distinguish noise points from object points. Experiments conducted on the Stanford 3D Scanning Repository, the Canadian Adverse Driving Conditions (CADC) dataset, and in-house FMCW LiDAR point clouds acquired in our laboratory demonstrate that, compared with existing methods, the proposed approach achieves consistent improvements in F1, PSNR, and Chamfer Distance (CD) across various noise conditions while reducing the single-frame processing time, thereby validating its high accuracy, robustness, and real-time performance in multi-noise scenarios.[75] MultiHateLoc: Towards Temporal Localisation of Multimodal Hate Content in Online Videos
Qiyue Sun,Tailin Chen,Yinghui Zhang,Yuchen Zhang,Jiangbei Yue,Jianbo Jiao,Zeyu Fu
Main category: cs.CV
TL;DR: 本文提出了MultiHateLoc,首个用于弱监督多模态仇恨言论时间定位的框架,通过模态感知时序编码、动态跨模态融合与对比对齐策略,在仅有视频级标签的情况下实现细粒度的帧级定位,显著提升了多模态仇恨内容检测性能。
Details
Motivation: 现有研究集中于视频级分类,缺乏对仇恨内容发生时间的精确定位,尤其在仅提供视频级弱标签的情况下,传统方法难以捕捉跨模态与时间动态特性。 Method: 提出MultiHateLoc框架:1)模态感知时序编码器建模异构序列模式,包含文本预处理模块;2)动态跨模态融合机制与跨模态对比对齐策略;3)模态感知的多实例学习目标,用于在弱监督下识别关键片段。 Result: 在HateMM和MultiHateClip数据集上实验表明,MultiHateLoc在定位任务上达到最先进水平,能生成细粒度、可解释的帧级预测结果。 Conclusion: MultiHateLoc有效解决了弱监督下多模态仇恨内容的时间定位难题,为实际应用中自动识别有害视频片段提供了高效且可解释的解决方案。 Abstract: The rapid growth of video content on platforms such as TikTok and YouTube has intensified the spread of multimodal hate speech, where harmful cues emerge subtly and asynchronously across visual, acoustic, and textual streams. Existing research primarily focuses on video-level classification, leaving the practically crucial task of temporal localisation, identifying when hateful segments occur, largely unaddressed. This challenge is even more noticeable under weak supervision, where only video-level labels are available, and static fusion or classification-based architectures struggle to capture cross-modal and temporal dynamics. To address these challenges, we propose MultiHateLoc, the first framework designed for weakly-supervised multimodal hate localisation. MultiHateLoc incorporates (1) modality-aware temporal encoders to model heterogeneous sequential patterns, including a tailored text-based preprocessing module for feature enhancement; (2) dynamic cross-modal fusion to adaptively emphasise the most informative modality at each moment and a cross-modal contrastive alignment strategy to enhance multimodal feature consistency; (3) a modality-aware MIL objective to identify discriminative segments under video-level supervision. Despite relying solely on coarse labels, MultiHateLoc produces fine-grained, interpretable frame-level predictions. Experiments on HateMM and MultiHateClip show that our method achieves state-of-the-art performance in the localisation task.[76] Beyond Endpoints: Path-Centric Reasoning for Vectorized Off-Road Network Extraction
Wenfei Guan,Jilin Mei,Tong Shen,Xumin Wu,Shuo Wang,Cheng Min,Yu Hu
Main category: cs.CV
TL;DR: 本文提出了一个针对野外道路提取的新数据集WildRoad和一种路径中心的框架MaGRoad,解决了现有模型在非城市环境中因数据缺乏和结构弱点导致的表现不佳问题。
Details
Motivation: 现有的深度学习模型在处理城市道路提取方面取得了进展,但在面对野外环境时表现不佳,主要因为缺乏大规模的矢量化数据集以及主流方法存在的结构性缺陷。 Method: 首先发布了一个名为WildRoad的全球性野外道路网络数据集,并引入了MaGRoad(Mask-aware Geodesic Road network extractor),这是一种路径中心框架,通过沿候选路径聚合多尺度视觉证据来推断连接性。 Result: 实验表明,MaGRoad在具有挑战性的WildRoad基准上实现了最先进的性能,同时在城市数据集上也表现出良好的泛化能力,且推理速度提高了约2.5倍。 Conclusion: 该研究提供的数据集和路径中心范式为野外道路映射提供了更坚实的基础。 Abstract: Deep learning has advanced vectorized road extraction in urban settings, yet off-road environments remain underexplored and challenging. A significant domain gap causes advanced models to fail in wild terrains due to two key issues: lack of large-scale vectorized datasets and structural weakness in prevailing methods. Models such as SAM-Road employ a node-centric paradigm that reasons at sparse endpoints, making them fragile to occlusions and ambiguous junctions in off-road scenes, leading to topological errors.This work addresses these limitations in two complementary ways. First, we release WildRoad, a gloabal off-road road network dataset constructed efficiently with a dedicated interactive annotation tool tailored for road-network labeling. Second, we introduce MaGRoad (Mask-aware Geodesic Road network extractor), a path-centric framework that aggregates multi-scale visual evidence along candidate paths to infer connectivity robustly.Extensive experiments show that MaGRoad achieves state-of-the-art performance on our challenging WildRoad benchmark while generalizing well to urban datasets. A streamlined pipeline also yields roughly 2.5x faster inference, improving practical applicability. Together, the dataset and path-centric paradigm provide a stronger foundation for mapping roads in the wild.[77] TransLocNet: Cross-Modal Attention for Aerial-Ground Vehicle Localization with Contrastive Learning
Phu Pham,Damon Conover,Aniket Bera
Main category: cs.CV
TL;DR: TransLocNet是一种跨模态注意力框架,用于解决空中与地面定位中的大视角和模态差异问题,通过融合LiDAR几何与航空语义上下文实现高精度定位。
Details
Motivation: 由于地面LiDAR与航拍图像之间存在显著的视角和模态差异,传统方法难以实现精确的空中-地面定位。 Method: 提出TransLocNet,将LiDAR扫描投影为鸟瞰图表示,并通过双向注意力机制与航拍特征对齐;引入对比学习模块构建共享嵌入空间,最后通过似然图解码器输出位置和姿态的概率分布。 Result: 在CARLA和KITTI数据集上实验表明,TransLocNet比现有最先进方法减少最多63%的定位误差,达到亚米级、亚度级精度。 Conclusion: TransLocNet能有效提升跨模态对齐能力,在合成与真实环境中均表现出强鲁棒性和泛化性,适用于高精度空中-地面定位任务。 Abstract: Aerial-ground localization is difficult due to large viewpoint and modality gaps between ground-level LiDAR and overhead imagery. We propose TransLocNet, a cross-modal attention framework that fuses LiDAR geometry with aerial semantic context. LiDAR scans are projected into a bird's-eye-view representation and aligned with aerial features through bidirectional attention, followed by a likelihood map decoder that outputs spatial probability distributions over position and orientation. A contrastive learning module enforces a shared embedding space to improve cross-modal alignment. Experiments on CARLA and KITTI show that TransLocNet outperforms state-of-the-art baselines, reducing localization error by up to 63% and achieving sub-meter, sub-degree accuracy. These results demonstrate that TransLocNet provides robust and generalizable aerial-ground localization in both synthetic and real-world settings.[78] Neural Collapse in Test-Time Adaptation
Xiao Chen,Zhongjing Du,Jiazhen Huang,Xu Jiang,Li Lu,Jingyan Jiang,Zhi Wang
Main category: cs.CV
TL;DR: 本文提出了一种新的测试时自适应方法NCTTA,基于样本级对齐坍缩现象(NC3+),通过混合目标缓解伪标签不可靠问题,显著提升了模型在分布外数据上的鲁棒性。
Details
Motivation: 现有TTA方法缺乏对领域偏移下性能下降根本原因的理论理解,且在大分布偏移下伪标签不可靠,限制了自适应效果。 Method: 引入样本级神经坍缩(NC3+)分析特征与分类器权重的对齐关系,提出NCTTA方法,采用结合几何邻近性和预测置信度的混合目标进行特征-分类器对齐。 Result: NCTTA在多个基准上显著优于现有方法,例如在ImageNet-C上比Tent提升14.52%。 Conclusion: 样本级对齐坍缩是影响TTA性能的关键因素,NCTTA通过缓解伪标签不可靠性有效增强了模型对分布偏移的鲁棒性。 Abstract: Test-Time Adaptation (TTA) enhances model robustness to out-of-distribution (OOD) data by updating the model online during inference, yet existing methods lack theoretical insights into the fundamental causes of performance degradation under domain shifts. Recently, Neural Collapse (NC) has been proposed as an emergent geometric property of deep neural networks (DNNs), providing valuable insights for TTA. In this work, we extend NC to the sample-wise level and discover a novel phenomenon termed Sample-wise Alignment Collapse (NC3+), demonstrating that a sample's feature embedding, obtained by a trained model, aligns closely with the corresponding classifier weight. Building on NC3+, we identify that the performance degradation stems from sample-wise misalignment in adaptation which exacerbates under larger distribution shifts. This indicates the necessity of realigning the feature embeddings with their corresponding classifier weights. However, the misalignment makes pseudo-labels unreliable under domain shifts. To address this challenge, we propose NCTTA, a novel feature-classifier alignment method with hybrid targets to mitigate the impact of unreliable pseudo-labels, which blends geometric proximity with predictive confidence. Extensive experiments demonstrate the effectiveness of NCTTA in enhancing robustness to domain shifts. For example, NCTTA outperforms Tent by 14.52% on ImageNet-C.[79] An M-Health Algorithmic Approach to Identify and Assess Physiotherapy Exercises in Real Time
Stylianos Kandylakis,Christos Orfanopoulos,Georgios Siolas,Panayiotis Tsanakas
Main category: cs.CV
TL;DR: 提出一种基于移动设备的实时人体物理治疗运动识别、分类与评估的高效算法框架,利用姿态估计和动态规划实现动作序列匹配与错误检测。
Details
Motivation: 为了支持远程物理治疗监督和移动健康应用,需要在移动设备上实现实时、准确且低延迟的动作识别与评估。 Method: 将运动分解为静态姿态序列,通过姿态估计神经网络提取关键点,转换为三角角度特征,并使用轻量级模型进行帧级别分类;采用改进的Levenshtein距离算法结合动态规划进行序列匹配以识别完整动作并定位偏差。 Result: 系统可在客户端实现实时运行,实验验证了该方法在动作识别和误差检测方面的有效性。 Conclusion: 该框架适用于移动环境下的远程康复监测,具有良好的可扩展性和实用性。 Abstract: This work presents an efficient algorithmic framework for real-time identification, classification, and evaluation of human physiotherapy exercises using mobile devices. The proposed method interprets a kinetic movement as a sequence of static poses, which are estimated from camera input using a pose-estimation neural network. Extracted body keypoints are transformed into trigonometric angle-based features and classified with lightweight supervised models to generate frame-level pose predictions and accuracy scores. To recognize full exercise movements and detect deviations from prescribed patterns, we employ a dynamic-programming scheme based on a modified Levenshtein distance algorithm, enabling robust sequence matching and localization of inaccuracies. The system operates entirely on the client side, ensuring scalability and real-time performance. Experimental evaluation demonstrates the effectiveness of the methodology and highlights its applicability to remote physiotherapy supervision and m-health applications.[80] Error-Propagation-Free Learned Video Compression With Dual-Domain Progressive Temporal Alignment
Han Li,Shaohui Li,Wenrui Dai,Chenglin Li,Xinlong Pan,Haipeng Wang,Junni Zou,Hongkai Xiong
Main category: cs.CV
TL;DR: 本文提出了一种用于学习型视频压缩的新型统一变换框架,结合双域渐进式时序对齐和质量条件混合专家(QCMoE)模块,有效解决了运动估计与补偿中的误差传播与对齐不准问题,在保持高质量率失真性能的同时实现了无误差传播的连续码率自适应。
Details
Motivation: 现有学习型视频压缩框架在时序对齐准确性与误差传播之间存在权衡:分离变换框架虽性能好但有明显误差传播,统一变换框架虽避免传播却牺牲了运动补偿精度。本文旨在设计一种兼顾高精度时序建模与无误差传播的新框架。 Method: 提出双域渐进式时序对齐机制,先在像素域进行粗对齐处理简单运动,再在潜在域利用多参考帧和Flow-Guided Deformable Transformer(FGDT)实现复杂运动的长期运动精修;同时设计QCMoE模块,根据目标质量和内容动态调整像素级量化步长,实现连续且一致的码率控制。 Result: 实验结果表明,所提方法在保持竞争性率失真性能的同时,显著优于现有方法,并成功消除了误差传播问题,支持高质量、无累积误差的视频流传输。 Conclusion: 本文提出的统一变换框架通过双域对齐与QCMoE模块,有效平衡了时序建模精度与误差传播控制,为学习型视频压缩提供了高性能、可扩展的解决方案。 Abstract: Existing frameworks for learned video compression suffer from a dilemma between inaccurate temporal alignment and error propagation for motion estimation and compensation (ME/MC). The separate-transform framework employs distinct transforms for intra-frame and inter-frame compression to yield impressive rate-distortion (R-D) performance but causes evident error propagation, while the unified-transform framework eliminates error propagation via shared transforms but is inferior in ME/MC in shared latent domains. To address this limitation, in this paper, we propose a novel unifiedtransform framework with dual-domain progressive temporal alignment and quality-conditioned mixture-of-expert (QCMoE) to enable quality-consistent and error-propagation-free streaming for learned video compression. Specifically, we propose dualdomain progressive temporal alignment for ME/MC that leverages coarse pixel-domain alignment and refined latent-domain alignment to significantly enhance temporal context modeling in a coarse-to-fine fashion. The coarse pixel-domain alignment efficiently handles simple motion patterns with optical flow estimated from a single reference frame, while the refined latent-domain alignment develops a Flow-Guided Deformable Transformer (FGDT) over latents from multiple reference frames to achieve long-term motion refinement (LTMR) for complex motion patterns. Furthermore, we design a QCMoE module for continuous bit-rate adaptation that dynamically assigns different experts to adjust quantization steps per pixel based on target quality and content rather than relies on a single quantization step. QCMoE allows continuous and consistent rate control with appealing R-D performance. Experimental results show that the proposed method achieves competitive R-D performance compared with the state-of-the-arts, while successfully eliminating error propagation.[81] Robust Shape from Focus via Multiscale Directional Dilated Laplacian and Recurrent Network
Khurram Ashfaq,Muhammad Tariq Mahmood
Main category: cs.CV
TL;DR: 提出了一种混合框架的Shape-from-Focus方法,通过手工设计的多尺度DDL核计算聚焦体,并结合轻量级多尺度GRU网络和学习上采样模块,实现高效、高精度的深度估计。
Details
Motivation: 现有基于深度学习的SFF方法依赖重型编码器提取聚焦体,且后续深度估计易引入伪影和噪声,需更高效鲁棒的方法。 Method: 使用方向扩张拉普拉斯(DDL)核提取多尺度聚焦体,输入轻量多尺度GRU网络迭代优化低分辨率深度估计,并通过可学习凸上采样恢复高分辨率深度图。 Result: 在合成和真实数据集上均优于现有SOTA方法,具有更高精度、更强泛化能力及更好细节保持。 Conclusion: 所提混合方法在保持计算效率的同时显著提升深度估计质量,为SFF任务提供了有效解决方案。 Abstract: Shape-from-Focus (SFF) is a passive depth estimation technique that infers scene depth by analyzing focus variations in a focal stack. Most recent deep learning-based SFF methods typically operate in two stages: first, they extract focus volumes (a per pixel representation of focus likelihood across the focal stack) using heavy feature encoders; then, they estimate depth via a simple one-step aggregation technique that often introduces artifacts and amplifies noise in the depth map. To address these issues, we propose a hybrid framework. Our method computes multi-scale focus volumes traditionally using handcrafted Directional Dilated Laplacian (DDL) kernels, which capture long-range and directional focus variations to form robust focus volumes. These focus volumes are then fed into a lightweight, multi-scale GRU-based depth extraction module that iteratively refines an initial depth estimate at a lower resolution for computational efficiency. Finally, a learned convex upsampling module within our recurrent network reconstructs high-resolution depth maps while preserving fine scene details and sharp boundaries. Extensive experiments on both synthetic and real-world datasets demonstrate that our approach outperforms state-of-the-art deep learning and traditional methods, achieving superior accuracy and generalization across diverse focal conditions.[82] 3D Blood Pulsation Maps
Maurice Rohr,Tobias Reinhardt,Tizian Dege,Justus Thies,Christoph Hoog Antink
Main category: cs.CV
TL;DR: Pulse3DFace是首个用于估计3D血流脉动图的数据集,支持远程脉搏估计算法的开发与验证,并推动多视角下消除光照影响的新方法研究。
Details
Motivation: 现有的远程光电容积描记成像(rPPG)方法在复杂光照条件下性能受限,缺乏可用于建模动态面部血流脉动的标准化3D数据集,限制了算法的训练与评估。 Method: 采集15名受试者在23个视角下的30 Hz RGB视频,结合血流脉冲参考测量和基于单目运动结构(SfM)技术生成的面部3D扫描;通过处理生成与FLAME 3D头模型纹理空间兼容的3D脉动图,包含信噪比、局部脉幅、相位等信息。 Result: 成功构建了包含原始视频、3D扫描和高质量3D脉动图的Pulse3DFace数据集;验证了其在光照条件、地图一致性及生理特征捕捉方面的有效性,能够准确反映面部与颈部皮肤区域的生理信号。 Conclusion: Pulse3DFace为动态面部血流建模提供了新资源,可促进合成数据生成、多视角rPPG算法优化及光照鲁棒性研究,填补了3D血流脉动数据集的空白。 Abstract: We present Pulse3DFace, the first dataset of its kind for estimating 3D blood pulsation maps. These maps can be used to develop models of dynamic facial blood pulsation, enabling the creation of synthetic video data to improve and validate remote pulse estimation methods via photoplethysmography imaging. Additionally, the dataset facilitates research into novel multi-view-based approaches for mitigating illumination effects in blood pulsation analysis. Pulse3DFace consists of raw videos from 15 subjects recorded at 30 Hz with an RGB camera from 23 viewpoints, blood pulse reference measurements, and facial 3D scans generated using monocular structure-from-motion techniques. It also includes processed 3D pulsation maps compatible with the texture space of the 3D head model FLAME. These maps provide signal-to-noise ratio, local pulse amplitude, phase information, and supplementary data. We offer a comprehensive evaluation of the dataset's illumination conditions, map consistency, and its ability to capture physiologically meaningful features in the facial and neck skin regions.[83] Take a Peek: Efficient Encoder Adaptation for Few-Shot Semantic Segmentation via LoRA
Pasquale De Marinis,Gennaro Vessio,Giovanna Castellano
Main category: cs.CV
TL;DR: 本文提出了名为Take a Peek (TaP)的方法,通过低秩适应(LoRA)对编码器进行微调,以提升少样本语义分割(FSS)和跨域FSS的性能。该方法计算开销小、可无缝集成到现有框架中,并在多个基准上显著提升分割效果,尤其在多类复杂场景下表现突出。
Details
Motivation: 现有FSS方法的编码器在提取未见类别的特征时泛化能力有限,成为性能瓶颈。本文旨在增强编码器对新类别的适应能力,同时避免灾难性遗忘。 Method: 提出TaP方法,利用LoRA技术在支持集上快速微调编码器,实现低计算成本下的模型自适应。该方法不依赖特定架构,具有模型无关性,可广泛适用于不同FSS系统。 Result: 在COCO 20^i、Pascal 5^i及DeepGlobe、ISIC、Chest X-ray等跨域数据集上实验表明,TaP在多种模型和shot设置下均一致提升性能,尤其在多类别场景中增益显著;低秩情况下仍能保持高效性能,验证了其计算效率与鲁棒性。 Conclusion: TaP有效解决了FSS中编码器对新类别泛化能力不足的问题,提升了模型的鲁棒性、效率和通用性,为更实用的少样本分割系统提供了新方向。 Abstract: Few-shot semantic segmentation (FSS) aims to segment novel classes in query images using only a small annotated support set. While prior research has mainly focused on improving decoders, the encoder's limited ability to extract meaningful features for unseen classes remains a key bottleneck. In this work, we introduce \textit{Take a Peek} (TaP), a simple yet effective method that enhances encoder adaptability for both FSS and cross-domain FSS (CD-FSS). TaP leverages Low-Rank Adaptation (LoRA) to fine-tune the encoder on the support set with minimal computational overhead, enabling fast adaptation to novel classes while mitigating catastrophic forgetting. Our method is model-agnostic and can be seamlessly integrated into existing FSS pipelines. Extensive experiments across multiple benchmarks--including COCO $20^i$, Pascal $5^i$, and cross-domain datasets such as DeepGlobe, ISIC, and Chest X-ray--demonstrate that TaP consistently improves segmentation performance across diverse models and shot settings. Notably, TaP delivers significant gains in complex multi-class scenarios, highlighting its practical effectiveness in realistic settings. A rank sensitivity analysis also shows that strong performance can be achieved even with low-rank adaptations, ensuring computational efficiency. By addressing a critical limitation in FSS--the encoder's generalization to novel classes--TaP paves the way toward more robust, efficient, and generalizable segmentation systems. The code is available at https://github.com/pasqualedem/TakeAPeek.[84] Blink: Dynamic Visual Token Resolution for Enhanced Multimodal Understanding
Yuchen Feng,Zhenyu Zhang,Naibin Gu,Yilong Chen,Peng Fu,Zheng Lin,Shuohuan Wang,Yu Sun,Hua Wu,Weiping Wang,Haifeng Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为Blink的动态视觉令牌分辨率框架,通过模拟人类“眨眼”式的注意力机制来增强多模态大语言模型的视觉感知能力。
Details
Motivation: 受人类在复杂场景中通过动态扫描和聚焦显著区域进行高效感知的启发,研究MLLM是否具有类似行为,并探索提升其视觉感知的方法。 Method: 提出Blink框架,包含显著性引导扫描和动态令牌分辨率两个模块;基于注意力图估计每层视觉令牌的显著性,并通过即插即用的TokenSR模块扩展重要令牌,在后续层中丢弃失去关注的扩展令牌。 Result: 实验表明,Blink能有效增强多模态大语言模型的视觉感知与理解能力,实现更高效的细粒度聚焦与广泛探索的平衡。 Conclusion: Blink通过模拟人类视觉注意机制,在单次前向传播中实现了自适应且高效的视觉感知增强,为提升MLLM的视觉理解提供了新思路。 Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress on various vision-language tasks, yet their visual perception remains limited. Humans, in comparison, perceive complex scenes efficiently by dynamically scanning and focusing on salient regions in a sequential "blink-like" process. Motivated by this strategy, we first investigate whether MLLMs exhibit similar behavior. Our pilot analysis reveals that MLLMs naturally attend to different visual regions across layers and that selectively allocating more computation to salient tokens can enhance visual perception. Building on this insight, we propose Blink, a dynamic visual token resolution framework that emulates the human-inspired process within a single forward pass. Specifically, Blink includes two modules: saliency-guided scanning and dynamic token resolution. It first estimates the saliency of visual tokens in each layer based on the attention map, and extends important tokens through a plug-and-play token super-resolution (TokenSR) module. In the next layer, it drops the extended tokens when they lose focus. This dynamic mechanism balances broad exploration and fine-grained focus, thereby enhancing visual perception adaptively and efficiently. Extensive experiments validate Blink, demonstrating its effectiveness in enhancing visual perception and multimodal understanding.[85] Grounding Everything in Tokens for Multimodal Large Language Models
Xiangxuan Ren,Zhongdao Wang,Liping Hou,Pin Tang,Guoqing Wang,Chao Ma
Main category: cs.CV
TL;DR: 提出GETok方法,通过引入可学习的网格和偏移标记,增强多模态大模型在2D空间中定位物体的能力,无需改变自回归架构。
Details
Motivation: 现有的多模态大语言模型因使用自回归Transformer架构和图像分词机制,在精确物体定位方面存在局限,难以有效进行2D空间推理。 Method: 设计GETok,利用网格标记划分图像平面作为空间锚点,并通过偏移标记迭代优化定位预测,将空间关系直接嵌入标记中。 Result: 实验表明,GETok在多种指代表达任务上优于现有最先进方法,适用于监督微调和强化学习设置。 Conclusion: GETok能有效提升MLLMs在不修改自回归结构的前提下对2D空间的原生推理能力。 Abstract: Multimodal large language models (MLLMs) have made significant advancements in vision understanding and reasoning. However, the autoregressive Transformer architecture used by MLLMs requries tokenization on input images, which limits their ability to accurately ground objects within the 2D image space. This raises an important question: how can sequential language tokens be improved to better ground objects in 2D spatial space for MLLMs? To address this, we present a spatial representation method for grounding objects, namely GETok, that integrates a specialized vocabulary of learnable tokens into MLLMs. GETok first uses grid tokens to partition the image plane into structured spatial anchors, and then exploits offset tokens to enable precise and iterative refinement of localization predictions. By embedding spatial relationships directly into tokens, GETok significantly advances MLLMs in native 2D space reasoning without modifying the autoregressive architecture. Extensive experiments demonstrate that GETok achieves superior performance over the state-of-the-art methods across various referring tasks in both supervised fine-tuning and reinforcement learning settings.[86] Data-Efficient American Sign Language Recognition via Few-Shot Prototypical Networks
Meher Md Saad
Main category: cs.CV
TL;DR: 提出一种基于骨架的少样本原型网络框架,结合ST-GCN和多尺度时间聚合模块,通过度量学习提升孤立手语识别在数据稀缺和长尾分布下的性能,在WLASL上显著优于传统分类方法,并展现良好的零样本迁移能力。
Details
Motivation: 由于数据稀缺和手语词汇的长尾分布,传统分类方法在孤立手语识别中表现不佳,难以泛化到罕见类别,亟需一种能有效利用有限数据并提升对稀有类别的识别能力的方法。 Method: 提出一种基于骨架编码器的少样本原型网络,采用情景训练学习语义度量空间,使用ST-GCN提取时空特征,并引入多尺度时间聚合(MSTA)模块捕捉快速和流畅的动作动态,通过类原型的距离进行分类。 Result: 在WLASL数据集上达到43.75% Top-1和77.10% Top-5准确率,比相同骨干的分类基线高13%以上;在未见的SignASL数据集上实现近30%的零样本准确率。 Conclusion: 该度量学习范式在数据稀缺场景下显著优于传统分类方法,具备更强的泛化能力和实际应用潜力,为大规模手语词汇识别提供了可扩展的解决方案。 Abstract: Isolated Sign Language Recognition (ISLR) is critical for bridging the communication gap between the Deaf and Hard-of-Hearing (DHH) community and the hearing world. However, robust ISLR is fundamentally constrained by data scarcity and the long-tail distribution of sign vocabulary, where gathering sufficient examples for thousands of unique signs is prohibitively expensive. Standard classification approaches struggle under these conditions, often overfitting to frequent classes while failing to generalize to rare ones. To address this bottleneck, we propose a Few-Shot Prototypical Network framework adapted for a skeleton based encoder. Unlike traditional classifiers that learn fixed decision boundaries, our approach utilizes episodic training to learn a semantic metric space where signs are classified based on their proximity to dynamic class prototypes. We integrate a Spatiotemporal Graph Convolutional Network (ST-GCN) with a novel Multi-Scale Temporal Aggregation (MSTA) module to capture both rapid and fluid motion dynamics. Experimental results on the WLASL dataset demonstrate the superiority of this metric learning paradigm: our model achieves 43.75% Top-1 and 77.10% Top-5 accuracy on the test set. Crucially, this outperforms a standard classification baseline sharing the identical backbone architecture by over 13%, proving that the prototypical training strategy effectively outperforms in a data scarce situation where standard classification fails. Furthermore, the model exhibits strong zero-shot generalization, achieving nearly 30% accuracy on the unseen SignASL dataset without fine-tuning, offering a scalable pathway for recognizing extensive sign vocabularies with limited data.[87] Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner
Haojie Zheng,Shuchen Weng,Jingqi Liu,Siqi Yang,Boxin Shi,Xinlong Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为AVI-Edit的音频同步视频实例编辑框架,通过精细化的空间和时间控制实现高质量的音视频同步编辑。
Details
Motivation: 现有视频编辑方法忽视了音视频同步,并缺乏对实例级别编辑所需的细粒度时空可控性。 Method: 提出了一个粒度感知的掩码优化器以精确定位实例区域,并设计了一个自反馈音频代理来提供高质量的时间控制音频引导;同时构建了一个大规模、以实例为中心且具有全面标注的数据集。 Result: 实验表明,AVI-Edit在视觉质量、条件遵循和音视频同步方面优于现有的最先进方法。 Conclusion: AVI-Edit有效实现了音频同步的视频实例编辑,在生成逼真且协调的视听内容方面表现出优越性能。 Abstract: Recent advancements in video generation highlight that realistic audio-visual synchronization is crucial for engaging content creation. However, existing video editing methods largely overlook audio-visual synchronization and lack the fine-grained spatial and temporal controllability required for precise instance-level edits. In this paper, we propose AVI-Edit, a framework for audio-sync video instance editing. We propose a granularity-aware mask refiner that iteratively refines coarse user-provided masks into precise instance-level regions. We further design a self-feedback audio agent to curate high-quality audio guidance, providing fine-grained temporal control. To facilitate this task, we additionally construct a large-scale dataset with instance-centric correspondence and comprehensive annotations. Extensive experiments demonstrate that AVI-Edit outperforms state-of-the-art methods in visual quality, condition following, and audio-visual synchronization. Project page: https://hjzheng.net/projects/AVI-Edit/.[88] Unleashing Degradation-Carrying Features in Symmetric U-Net: Simpler and Stronger Baselines for All-in-One Image Restoration
Wenlong Jiao,Heyang Lee,Ping Wang,Pengfei Zhu,Qinghua Hu,Dongwei Ren
Main category: cs.CV
TL;DR: 本文提出了一种基于对称U-Net的简单而高效的全合一图像恢复框架SymUNet,通过精心设计的特征提取和跨尺度传播,无需复杂结构即可捕捉退化信息,并进一步引入语义增强变体SE-SymUNet,利用冻结的CLIP特征提升性能,在多个基准上实现了更优结果且计算成本更低。
Details
Motivation: 现有全合一图像恢复方法依赖复杂的模型架构和退化提示策略,缺乏对基础网络结构有效性的充分探索。本文旨在验证简洁架构是否足以应对多种退化,从而为该领域建立更高效的基础模型。 Method: 提出对称U-Net架构SymUNet,通过对称设计和跨尺度特征对齐保留退化信号,并采用简单的跳跃连接融合;进一步设计SE-SymUNet,通过交叉注意力机制注入冻结的CLIP语义特征以增强退化先验。 Result: SymUNet和SE-SymUNet在多个图像恢复基准上均优于现有方法,同时显著降低计算开销,验证了对称结构与语义增强的有效性。 Conclusion: 精心设计的对称U-Net结构本身足以有效处理多种图像退化,无需依赖复杂机制,为全合一图像恢复提供了更简单、更强健的基础框架。 Abstract: All-in-one image restoration aims to handle diverse degradations (e.g., noise, blur, adverse weather) within a unified framework, yet existing methods increasingly rely on complex architectures (e.g., Mixture-of-Experts, diffusion models) and elaborate degradation prompt strategies. In this work, we reveal a critical insight: well-crafted feature extraction inherently encodes degradation-carrying information, and a symmetric U-Net architecture is sufficient to unleash these cues effectively. By aligning feature scales across encoder-decoder and enabling streamlined cross-scale propagation, our symmetric design preserves intrinsic degradation signals robustly, rendering simple additive fusion in skip connections sufficient for state-of-the-art performance. Our primary baseline, SymUNet, is built on this symmetric U-Net and achieves better results across benchmark datasets than existing approaches while reducing computational cost. We further propose a semantic enhanced variant, SE-SymUNet, which integrates direct semantic injection from frozen CLIP features via simple cross-attention to explicitly amplify degradation priors. Extensive experiments on several benchmarks validate the superiority of our methods. Both baselines SymUNet and SE-SymUNet establish simpler and stronger foundations for future advancements in all-in-one image restoration. The source code is available at https://github.com/WenlongJiao/SymUNet.[89] Salient Object Detection in Complex Weather Conditions via Noise Indicators
Quan Chen,Xiaokai Yang,Tingyu Wang,Rongfeng Lu,Xichun Sheng,Yaoqi Sun,Chenggang Yan
Main category: cs.CV
TL;DR: 本文提出了一种针对不同天气条件的显著性目标检测(SOD)框架,通过引入噪声指示符融合模块(NIFM)提升复杂天气下的分割精度。
Details
Motivation: 现有SOD方法多假设低噪声视觉环境,忽视了真实场景中天气引起的噪声对分割精度的影响,难以适应复杂天气条件。 Method: 提出一种包含特定编码器和可替换解码器的SOD框架,引入one-hot向量作为表示不同天气类型的噪声指示符,并设计噪声指示符融合模块(NIFM),将其嵌入编码器阶段之间,通过自适应特征调制融入天气感知先验。 Result: 在WXSOD数据集上进行了大量实验,涵盖不同训练数据规模(100%、50%、30%)以及多种编码器与解码器组合,结果表明所提框架(尤其是增强NIFM的编码器)相比基础编码器在复杂天气下提升了分割精度。 Conclusion: 该框架通过引入天气感知的先验信息,在不同天气条件下均能有效提升SOD性能,且具有良好的兼容性,适用于主流SOD解码器。 Abstract: Salient object detection (SOD), a foundational task in computer vision, has advanced from single-modal to multi-modal paradigms to enhance generalization. However, most existing SOD methods assume low-noise visual conditions, overlooking the degradation of segmentation accuracy caused by weather-induced noise in real-world scenarios. In this paper, we propose a SOD framework tailored for diverse weather conditions, encompassing a specific encoder and a replaceable decoder. To enable handling of varying weather noises, we introduce a one-hot vector as a noise indicator to represent different weather types and design a Noise Indicator Fusion Module (NIFM). The NIFM takes both semantic features and the noise indicator as dual inputs and is inserted between consecutive stages of the encoder to embed weather-aware priors via adaptive feature modulation. Critically, the proposed specific encoder retains compatibility with mainstream SOD decoders. Extensive experiments are conducted on the WXSOD dataset under varying training data scales (100%, 50%, 30% of the full training set), three encoder and seven decoder configurations. Results show that the proposed SOD framework (particularly the NIFM-enhanced specific encoder) improves segmentation accuracy under complex weather conditions compared to a vanilla encoder.[90] Beyond Pixels: A Training-Free, Text-to-Text Framework for Remote Sensing Image Retrieval
J. Xiao,Y. Guo,X. Zi,K. Thiyagarajan,C. Moreira,M. Prasad
Main category: cs.CV
TL;DR: 提出了一种无需训练的文本到文本遥感图像检索方法TRSLLaVA,并构建了带丰富文本标注的RSRT数据集,通过实验验证了其在零样本设置下显著优于CLIP基线并媲美有监督模型。
Details
Motivation: 现有遥感图像语义检索方法依赖昂贵的领域特定训练,且缺乏评估VLM生成文本在零样本检索中实用性的基准,因此需要一种免训练、基于高质量文本描述的新范式来弥合语义鸿沟。 Method: 构建了包含多条结构化字幕的Remote Sensing Rich Text (RSRT)数据集,将跨模态检索转化为纯文本匹配任务,在统一的文本嵌入空间中使用VLM生成的字幕作为数据库,以自然语言描述为查询,完全无需模型训练或微调。 Result: 在RSITMD和RSICD基准上,该方法在零样本设置下表现优异,例如在RSITMD上平均召回率达42.62%,远超标准零样本CLIP基线的23.86%,并超过多个顶尖有监督模型。 Conclusion: 高质量的结构化文本结合免训练的文本到文本匹配范式,为遥感图像检索提供了一个高效且低成本的有效解决方案。 Abstract: Semantic retrieval of remote sensing (RS) images is a critical task fundamentally challenged by the \textquote{semantic gap}, the discrepancy between a model's low-level visual features and high-level human concepts. While large Vision-Language Models (VLMs) offer a promising path to bridge this gap, existing methods often rely on costly, domain-specific training, and there is a lack of benchmarks to evaluate the practical utility of VLM-generated text in a zero-shot retrieval context. To address this research gap, we introduce the Remote Sensing Rich Text (RSRT) dataset, a new benchmark featuring multiple structured captions per image. Based on this dataset, we propose a fully training-free, text-only retrieval reference called TRSLLaVA. Our methodology reformulates cross-modal retrieval as a text-to-text (T2T) matching problem, leveraging rich text descriptions as queries against a database of VLM-generated captions within a unified textual embedding space. This approach completely bypasses model training or fine-tuning. Experiments on the RSITMD and RSICD benchmarks show our training-free method is highly competitive with state-of-the-art supervised models. For instance, on RSITMD, our method achieves a mean Recall of 42.62\%, nearly doubling the 23.86\% of the standard zero-shot CLIP baseline and surpassing several top supervised models. This validates that high-quality semantic representation through structured text provides a powerful and cost-effective paradigm for remote sensing image retrieval.[91] Track and Caption Any Motion: Query-Free Motion Discovery and Description in Videos
Bishoy Galoaa,Sarah Ostadabbas
Main category: cs.CV
TL;DR: 提出了一种名为TCAM的运动中心框架,用于自动发现和描述视频中的运动模式,无需用户查询,通过运动场注意力机制实现语言描述与运动轨迹的对齐。
Details
Motivation: 在遮挡、伪装或快速移动等复杂条件下,视频理解更依赖于运动动态而非静态外观,现有方法通常需要用户查询,缺乏自主发现多运动模式的能力。 Method: 提出TCAM框架,利用运动场注意力机制,将运动模式与对比视觉-语言表示对齐,通过全局视频-文本对齐和细粒度空间对应联合训练,使用多头交叉注意力实现无需查询的多运动表达发现。 Result: 在MeViS基准上,TCAM实现了58.4%的视频到文本检索准确率,空间定位JF为64.9,每段视频平均发现4.8个相关表达,精度达84.7%。 Conclusion: TCAM通过运动与语言的对齐,实现了强大的跨任务泛化能力,能够在无查询条件下有效发现并描述视频中的多种运动模式。 Abstract: We propose Track and Caption Any Motion (TCAM), a motion-centric framework for automatic video understanding that discovers and describes motion patterns without user queries. Understanding videos in challenging conditions like occlusion, camouflage, or rapid movement often depends more on motion dynamics than static appearance. TCAM autonomously observes a video, identifies multiple motion activities, and spatially grounds each natural language description to its corresponding trajectory through a motion-field attention mechanism. Our key insight is that motion patterns, when aligned with contrastive vision-language representations, provide powerful semantic signals for recognizing and describing actions. Through unified training that combines global video-text alignment with fine-grained spatial correspondence, TCAM enables query-free discovery of multiple motion expressions via multi-head cross-attention. On the MeViS benchmark, TCAM achieves 58.4% video-to-text retrieval, 64.9 JF for spatial grounding, and discovers 4.8 relevant expressions per video with 84.7% precision, demonstrating strong cross-task generalization.[92] Robust Multi-Disease Retinal Classification via Xception-Based Transfer Learning and W-Net Vessel Segmentation
Mohammad Sadegh Gholizadeh,Amir Arsalan Rezapour
Main category: cs.CV
TL;DR: 本文提出了一种结合深度特征提取与可解释图像处理模块的深度学习框架,用于眼部疾病的自动诊断,通过视网膜血管分割辅助分类,提升模型可解释性与临床适用性。
Details
Motivation: 由于标准卷积神经网络缺乏可解释性,限制了其在临床中的应用,因此需要构建更具透明性和可信度的自动化眼科诊断系统。 Method: 采用深度学习架构进行特征提取,并引入高保真视网膜血管分割作为辅助任务,结合可解释的图像处理模块来指导疾病分类过程。 Result: 模型能够基于临床相关的形态学特征生成更可解释的预测结果,减少了假阳性率,提升了在真实临床环境中的部署潜力。 Conclusion: 通过融合可解释模块与深度学习,该方法在保持高准确率的同时增强了模型透明度,有助于推动AI在眼科筛查中的临床转化。 Abstract: In recent years, the incidence of vision-threatening eye diseases has risen dramatically, necessitating scalable and accurate screening solutions. This paper presents a comprehensive study on deep learning architectures for the automated diagnosis of ocular conditions. To mitigate the "black-box" limitations of standard convolutional neural networks (CNNs), we implement a pipeline that combines deep feature extraction with interpretable image processing modules. Specifically, we focus on high-fidelity retinal vessel segmentation as an auxiliary task to guide the classification process. By grounding the model's predictions in clinically relevant morphological features, we aim to bridge the gap between algorithmic output and expert medical validation, thereby reducing false positives and improving deployment viability in clinical settings.[93] Lang2Motion: Bridging Language and Motion through Joint Embedding Spaces
Bishoy Galoaa,Xiangyu Bai,Sarah Ostadabbas
Main category: cs.CV
TL;DR: 本文提出了Lang2Motion,一种通过将运动流形与联合嵌入空间对齐来生成语言引导的点轨迹的框架。该方法使用真实视频中提取的运动轨迹,支持任意对象的轨迹生成,并在文本到轨迹检索和运动准确性方面显著优于现有方法。
Details
Motivation: 现有的工作主要集中在人类动作或视频生成上,缺乏对任意对象语言引导轨迹生成的研究。因此,需要一个能够从自然语言描述中生成精确轨迹的新框架。 Method: 提出基于Transformer的自编码器模型,利用CLIP的冻结编码器对文本描述和轨迹可视化进行双重监督学习,实现文本与轨迹的联合嵌入空间对齐。 Result: 在文本到轨迹检索任务中达到34.2%的Recall@1,比视频方法提高12.5点;运动准确性提升33-52%(ADE为12.4 vs 18.3-25.3);在人类动作识别上取得88.3% Top-1准确率。 Conclusion: Lang2Motion能有效对齐语言与轨迹空间,支持跨物体和跨域的运动生成,并具备风格迁移、语义插值和潜在空间编辑能力,展现出强大的泛化性和应用潜力。 Abstract: We present Lang2Motion, a framework for language-guided point trajectory generation by aligning motion manifolds with joint embedding spaces. Unlike prior work focusing on human motion or video synthesis, we generate explicit trajectories for arbitrary objects using motion extracted from real-world videos via point tracking. Our transformer-based auto-encoder learns trajectory representations through dual supervision: textual motion descriptions and rendered trajectory visualizations, both mapped through CLIP's frozen encoders. Lang2Motion achieves 34.2% Recall@1 on text-to-trajectory retrieval, outperforming video-based methods by 12.5 points, and improves motion accuracy by 33-52% (12.4 ADE vs 18.3-25.3) compared to video generation baselines. We demonstrate 88.3% Top-1 accuracy on human action recognition despite training only on diverse object motions, showing effective transfer across motion domains. Lang2Motion supports style transfer, semantic interpolation, and latent-space editing through CLIP-aligned trajectory representations.[94] DOCR-Inspector: Fine-Grained and Automated Evaluation of Document Parsing with VLM
Qintong Zhang,Junyuan Zhang,Zhifei Ren,Linke Ouyang,Zichen Wen,Junbo Niu,Yuan Qu,Bin Wang,Ka-Ho Chow,Conghui He,Wentao Zhang
Main category: cs.CV
TL;DR: 本文提出了DOCR-Inspector,一种基于视觉语言模型(VLM)的细粒度文档解析质量评估方法,通过28类错误检测和Chain-of-Checklist推理范式,实现对真实场景中文档解析结果的全面评估,并构建了DOCRcaseBench基准进行验证。
Details
Motivation: 现有文档解析评估依赖标准基准,但这些基准存在数据集偏差,模型排名与真实表现相关性低,且整体评分掩盖了具体的错误模式,难以可靠评估现实场景中的解析质量。 Method: 提出DOCR-Inspector,将文档解析评估形式化为细粒度错误检测任务,利用VLM-as-a-Judge分析图像与解析输出,识别并分类28种预定义错误类型;构建DOCRcase-200K训练数据,引入Chain-of-Checklist推理范式支持层次化质量评估。 Result: 在新构建的含882个真实案例的DOCRcaseBench基准上,DOCR-Inspector-7B优于Gemini 2.5 Pro等商用模型及主流开源模型,其评估结果能有效指导解析结果优化。 Conclusion: DOCR-Inspector实现了对文档解析质量的可靠、细粒度评估,不仅可作为实用评测工具,还能推动大规模文档解析系统的持续改进。 Abstract: Document parsing aims to transform unstructured PDF images into semi-structured data, facilitating the digitization and utilization of information in diverse domains. While vision language models (VLMs) have significantly advanced this task, achieving reliable, high-quality parsing in real-world scenarios remains challenging. Common practice often selects the top-performing model on standard benchmarks. However, these benchmarks may carry dataset-specific biases, leading to inconsistent model rankings and limited correlation with real-world performance. Moreover, benchmark metrics typically provide only overall scores, which can obscure distinct error patterns in output. This raises a key challenge: how can we reliably and comprehensively assess document parsing quality in the wild? We address this problem with DOCR-Inspector, which formalizes document parsing assessment as fine-grained error detection and analysis. Leveraging VLM-as-a-Judge, DOCR-Inspector analyzes a document image and its parsed output, identifies all errors, assigns them to one of 28 predefined types, and produces a comprehensive quality assessment. To enable this capability, we construct DOCRcase-200K for training and propose the Chain-of-Checklist reasoning paradigm to enable the hierarchical structure of parsing quality assessment. For empirical validation, we introduce DOCRcaseBench, a set of 882 real-world document parsing cases with manual annotations. On this benchmark, DOCR-Inspector-7B outperforms commercial models like Gemini 2.5 Pro, as well as leading open-source models. Further experiments demonstrate that its quality assessments provide valuable guidance for parsing results refinement, making DOCR-Inspector both a practical evaluator and a driver for advancing document parsing systems at scale. Model and code are released at: https://github.com/ZZZZZQT/DOCR-Inspector.[95] K-Track: Kalman-Enhanced Tracking for Accelerating Deep Point Trackers on Edge Devices
Bishoy Galoaa,Pau Closas,Sarah Ostadabbas
Main category: cs.CV
TL;DR: K-Track 是一种通用的、与跟踪器无关的加速框架,通过结合稀疏深度学习关键帧更新和轻量级卡尔曼滤波,实现视频点跟踪的高效推理,可在边缘设备上实现5-10倍加速,同时保持85%以上的原始精度。
Details
Motivation: 现有的深度学习点跟踪器虽然精度高,但依赖逐帧GPU推理,难以在计算、功耗和连接受限的边缘设备上部署。 Method: 提出K-Track框架,采用稀疏关键帧上的深度学习更新,并在中间帧使用基于贝叶斯不确定性传播的卡尔曼滤波进行预测,以减少推理开销并保持时间一致性。 Result: 在多个最先进点跟踪器上验证,实现了5-10倍的速度提升,在NVIDIA Jetson Nano和RTX Titan等边缘平台上达到实时性能,同时保留超过85%的原始精度。 Conclusion: K-Track在不显著牺牲精度的前提下大幅降低计算需求,为在资源受限的实际场景中部署高质量点跟踪提供了可行方案,缩小了现代算法与可部署视觉系统之间的差距。 Abstract: Point tracking in video sequences is a foundational capability for real-world computer vision applications, including robotics, autonomous systems, augmented reality, and video analysis. While recent deep learning-based trackers achieve state-of-the-art accuracy on challenging benchmarks, their reliance on per-frame GPU inference poses a major barrier to deployment on resource-constrained edge devices, where compute, power, and connectivity are limited. We introduce K-Track (Kalman-enhanced Tracking), a general-purpose, tracker-agnostic acceleration framework designed to bridge this deployment gap. K-Track reduces inference cost by combining sparse deep learning keyframe updates with lightweight Kalman filtering for intermediate frame prediction, using principled Bayesian uncertainty propagation to maintain temporal coherence. This hybrid strategy enables 5-10X speedup while retaining over 85% of the original trackers' accuracy. We evaluate K-Track across multiple state-of-the-art point trackers and demonstrate real-time performance on edge platforms such as the NVIDIA Jetson Nano and RTX Titan. By preserving accuracy while dramatically lowering computational requirements, K-Track provides a practical path toward deploying high-quality point tracking in real-world, resource-limited settings, closing the gap between modern tracking algorithms and deployable vision systems.[96] TriDF: Evaluating Perception, Detection, and Hallucination for Interpretable DeepFake Detection
Jian-Yu Jiang-Lin,Kang-Yang Huang,Ling Zou,Ling Lo,Sheng-Ping Yang,Yu-Wen Tseng,Kun-Hsiang Lin,Chia-Ling Chen,Yu-Ting Ta,Yan-Tsung Wang,Po-Ching Chen,Hongxia Xie,Hong-Han Shuai,Wen-Huang Cheng
Main category: cs.CV
TL;DR: TriDF是一个用于可解释DeepFake检测的综合基准,涵盖图像、视频和音频模态中的16种DeepFake类型,评估感知、检测和幻觉三个方面,揭示了检测准确性与解释可靠性之间的相互依赖关系。
Details
Motivation: 随着生成模型的发展,伪造个体内容的风险日益增加,亟需能够准确识别并提供可靠解释的DeepFake检测系统。 Method: 提出TriDF基准,包含高质量的先进合成模型生成的伪造数据,从感知(细粒度伪影识别)、检测(分类性能)和幻觉(解释可靠性)三个维度评估多模态大模型的表现。 Result: 实验表明,准确的感知对可靠检测至关重要,但模型幻觉会严重干扰决策,三者之间存在紧密关联。 Conclusion: TriDF为理解检测准确性、证据识别和解释可靠性之间的交互提供了统一框架,有助于构建应对合成媒体威胁的可信系统。 Abstract: Advances in generative modeling have made it increasingly easy to fabricate realistic portrayals of individuals, creating serious risks for security, communication, and public trust. Detecting such person-driven manipulations requires systems that not only distinguish altered content from authentic media but also provide clear and reliable reasoning. In this paper, we introduce TriDF, a comprehensive benchmark for interpretable DeepFake detection. TriDF contains high-quality forgeries from advanced synthesis models, covering 16 DeepFake types across image, video, and audio modalities. The benchmark evaluates three key aspects: Perception, which measures the ability of a model to identify fine-grained manipulation artifacts using human-annotated evidence; Detection, which assesses classification performance across diverse forgery families and generators; and Hallucination, which quantifies the reliability of model-generated explanations. Experiments on state-of-the-art multimodal large language models show that accurate perception is essential for reliable detection, but hallucination can severely disrupt decision-making, revealing the interdependence of these three aspects. TriDF provides a unified framework for understanding the interaction between detection accuracy, evidence identification, and explanation reliability, offering a foundation for building trustworthy systems that address real-world synthetic media threats.[97] NaviHydra: Controllable Navigation-guided End-to-end Autonomous Driving with Hydra-distillation
Hanfeng Wu,Marlon Steiner,Michael Schmidt,Alvaro Marcos-Ramiro,Christoph Stiller
Main category: cs.CV
TL;DR: 本文提出NaviHydra,一种受导航指令引导的端到端自动驾驶模型,通过从基于规则的模拟器中蒸馏实现对高阶导航指令的精确响应,并引入BEV轨迹提取方法和新的导航合规性度量,在NAVSIM基准上达到SOTA性能。
Details
Motivation: 传统基于规则的系统在动态环境中适应性差,而现有端到端方法难以严格遵循显式导航指令,因此需要一个兼具可控性与强建模能力的模型。 Method: 提出NaviHydra模型,采用从规则基线蒸馏的方式构建端到端框架;使用鸟瞰图(BEV)进行轨迹特征提取,并将高阶导航指令作为控制信号输入模型;设计新的导航合规性度量以评估轨迹对指令的遵循程度。 Result: 在NAVSIM基准上显著优于基线模型,实现了最先进的性能;通过设计的导航指令响应测试验证了模型良好的可控性和安全性。 Conclusion: NaviHydra通过知识蒸馏结合导航引导机制,在保证对高阶指令强遵从的同时提升了复杂场景下的轨迹生成质量,推动了可控端到端自动驾驶系统的发展。 Abstract: The complexity of autonomous driving scenarios requires robust models that can interpret high-level navigation commands and generate safe trajectories. While traditional rule-based systems can react to these commands, they often struggle in dynamic environments, and end-to-end methods face challenges in complying with explicit navigation commands. To address this, we present NaviHydra, a controllable navigation-guided end-to-end model distilled from an existing rule-based simulator. Our framework accepts high-level navigation commands as control signals, generating trajectories that align with specified intentions. We utilize a Bird's Eye View (BEV) based trajectory gathering method to enhance the trajectory feature extraction. Additionally, we introduce a novel navigation compliance metric to evaluate adherence to intended route, improving controllability and navigation safety. To comprehensively assess our model's controllability, we design a test that evaluates its response to various navigation commands. Our method significantly outperforms baseline models, achieving state-of-the-art results in the NAVSIM benchmark, demonstrating its effectiveness in advancing autonomous driving.[98] XDen-1K: A Density Field Dataset of Real-World Objects
Jingxuan Zhang,Tianqi Yu,Yatu Zhang,Jinze Wu,Kaixin Yao,Jingyang Liu,Yuyao Zhang,Jiayuan Gu,Jingyi Yu
Main category: cs.CV
TL;DR: 本文提出了XDen-1K,首个大规模多模态真实世界数据集,专注于物体的体密度估计,并提出优化框架从稀疏X射线视图恢复高保真密度场,验证了其在质心估计和机器人操作中的有效性。
Details
Motivation: 现有模型主要关注物体表面几何与外观,忽略了内部物理属性(如体密度),而这些属性对机器人操作和物理仿真至关重要;缺乏真实世界的大规模数据是主要瓶颈。 Method: 构建包含1000个真实物体的XDen-1K数据集,提供高分辨率3D模型与双平面X射线扫描,并提出一种新优化框架,从稀疏X射线视图中恢复体密度场;将X射线图像作为条件信号引入分割网络进行体积分割。 Result: 实验表明,利用该数据集可显著提升质心估计精度和机器人操作成功率。 Conclusion: XDen-1K为物理感知的视觉推理和具身AI提供了基础资源和挑战性基准,有望推动相关领域发展。 Abstract: A deep understanding of the physical world is a central goal for embodied AI and realistic simulation. While current models excel at capturing an object's surface geometry and appearance, they largely neglect its internal physical properties. This omission is critical, as properties like volumetric density are fundamental for predicting an object's center of mass, stability, and interaction dynamics in applications ranging from robotic manipulation to physical simulation. The primary bottleneck has been the absence of large-scale, real-world data. To bridge this gap, we introduce XDen-1K, the first large-scale, multi-modal dataset designed for real-world physical property estimation, with a particular focus on volumetric density. The core of this dataset consists of 1,000 real-world objects across 148 categories, for which we provide comprehensive multi-modal data, including a high-resolution 3D geometric model with part-level annotations and a corresponding set of real-world biplanar X-ray scans. Building upon this data, we introduce a novel optimization framework that recovers a high-fidelity volumetric density field of each object from its sparse X-ray views. To demonstrate its practical value, we add X-ray images as a conditioning signal to an existing segmentation network and perform volumetric segmentation. Furthermore, we conduct experiments on downstream robotics tasks. The results show that leveraging the dataset can effectively improve the accuracy of center-of-mass estimation and the success rate of robotic manipulation. We believe XDen-1K will serve as a foundational resource and a challenging new benchmark, catalyzing future research in physically grounded visual inference and embodied AI.[99] Geo6DPose: Fast Zero-Shot 6D Object Pose Estimation via Geometry-Filtered Feature Matching
Javier Villena Toro,Mehdi Tarkian
Main category: cs.CV
TL;DR: 本文提出Geo6DPose,一种轻量级、完全本地化且无需训练的零样本6D姿态估计方法,通过结合基础模型特征与几何过滤策略,在单个消费级GPU上实现亚秒级推理,性能媲美更大模型。
Details
Motivation: 现有零样本6D姿态估计依赖大规模模型和云端推理,导致高延迟、高能耗及部署风险,难以满足机器人在资源受限下进行本地化推理的实际需求。 Method: 将基础模型的视觉特征与几何过滤策略结合:利用DINO描述符生成模板与场景块的相似性图,通过投影建立2D-3D对应关系,并采用基于对应点的RANSAC恢复位姿,最后使用加权几何对齐度量(综合重投影一致性和空间支持)进行排序。 Result: 在单个消费级GPU上实现1.08 FPS的亚秒级推理速度,平均召回率达53.7 AR,性能与更大规模的零样本方法相当;无需训练、微调或网络连接,兼容不断演进的基础模型骨干网络。 Conclusion: Geo6DPose在不牺牲性能的前提下,实现了高效、可靠、完全本地化的6D姿态估计,推动了面向实际机器人部署的实用化感知系统发展。 Abstract: Recent progress in zero-shot 6D object pose estimation has been driven largely by large-scale models and cloud-based inference. However, these approaches often introduce high latency, elevated energy consumption, and deployment risks related to connectivity, cost, and data governance; factors that conflict with the practical constraints of real-world robotics, where compute is limited and on-device inference is frequently required. We introduce Geo6DPose, a lightweight, fully local, and training-free pipeline for zero-shot 6D pose estimation that trades model scale for geometric reliability. Our method combines foundation model visual features with a geometric filtering strategy: Similarity maps are computed between onboarded template DINO descriptors and scene patches, and mutual correspondences are established by projecting scene patch centers to 3D and template descriptors to the object model coordinate system. Final poses are recovered via correspondence-driven RANSAC and ranked using a weighted geometric alignment metric that jointly accounts for reprojection consistency and spatial support, improving robustness to noise, clutter, and partial visibility. Geo6DPose achieves sub-second inference on a single commodity GPU while matching the average recall of significantly larger zero-shot baselines (53.7 AR, 1.08 FPS). It requires no training, fine-tuning, or network access, and remains compatible with evolving foundation backbones, advancing practical, fully local 6D perception for robotic deployment.[100] Optimal transport unlocks end-to-end learning for single-molecule localization
Romain Seailles,Jean-Baptiste Masson,Jean Ponce,Julien Mairal
Main category: cs.CV
TL;DR: 本文提出了一种基于最优传输损失和迭代神经网络的端到端深度学习方法,用于单分子定位显微镜(SMLM),无需非最大抑制(NMS),在中高密度下优于现有技术。
Details
Motivation: 现有的SMLM方法依赖非最大抑制(NMS)层,该层不可微分且可能丢弃真实信号,限制了密集发射情况下的性能,同时影响活细胞成像效率。 Method: 将SMLM训练目标重新表述为集合匹配问题,引入最优传输损失以消除对NMS的依赖,并设计了一个融合显微镜光学系统知识的迭代神经网络结构。 Result: 在合成基准和真实生物数据上的实验表明,所提出的方法在中等和高发射体密度下均优于现有最先进方法。 Conclusion: 该方法实现了可微分、端到端的SMLM重建,提升了定位精度与模型性能,尤其适用于高密度荧光发射场景。 Abstract: Single-molecule localization microscopy (SMLM) allows reconstructing biology-relevant structures beyond the diffraction limit by detecting and localizing individual fluorophores -- fluorescent molecules stained onto the observed specimen -- over time to reconstruct super-resolved images. Currently, efficient SMLM requires non-overlapping emitting fluorophores, leading to long acquisition times that hinders live-cell imaging. Recent deep-learning approaches can handle denser emissions, but they rely on variants of non-maximum suppression (NMS) layers, which are unfortunately non-differentiable and may discard true positives with their local fusion strategy. In this presentation, we reformulate the SMLM training objective as a set-matching problem, deriving an optimal-transport loss that eliminates the need for NMS during inference and enables end-to-end training. Additionally, we propose an iterative neural network that integrates knowledge of the microscope's optical system inside our model. Experiments on synthetic benchmarks and real biological data show that both our new loss function and architecture surpass the state of the art at moderate and high emitter densities. Code is available at https://github.com/RSLLES/SHOT.[101] Sharp Monocular View Synthesis in Less Than a Second
Lars Mescheder,Wei Dong,Shiwei Li,Xuyang Bai,Marcel Santos,Peiyun Hu,Bruno Lecouat,Mingmin Zhen,Amaël Delaunoy,Tian Fang,Yanghai Tsin,Stephan R. Richter,Vladlen Koltun
Main category: cs.CV
TL;DR: SHARP是一种从单张图像实现逼真视图合成的方法,通过神经网络快速回归3D高斯表示,实现实时渲染和度量相机移动,性能显著优于现有方法。
Details
Motivation: 实现从单幅图像出发的高质量、实时、具有度量尺度的视图合成,解决传统方法在速度、精度和泛化能力上的不足。 Method: 提出SHARP方法,利用神经网络在一次前向传播中回归输入图像对应的3D高斯场景表示,并支持实时渲染和度量空间中的相机运动。 Result: SHARP在多个数据集上实现了零样本泛化,LPIPS降低25-34%,DISTS降低21-43%,合成时间减少三个数量级。 Conclusion: SHARP在单图像视图合成任务中实现了高效、高质量和度量准确的突破,推动了该领域的发展。 Abstract: We present SHARP, an approach to photorealistic view synthesis from a single image. Given a single photograph, SHARP regresses the parameters of a 3D Gaussian representation of the depicted scene. This is done in less than a second on a standard GPU via a single feedforward pass through a neural network. The 3D Gaussian representation produced by SHARP can then be rendered in real time, yielding high-resolution photorealistic images for nearby views. The representation is metric, with absolute scale, supporting metric camera movements. Experimental results demonstrate that SHARP delivers robust zero-shot generalization across datasets. It sets a new state of the art on multiple datasets, reducing LPIPS by 25-34% and DISTS by 21-43% versus the best prior model, while lowering the synthesis time by three orders of magnitude. Code and weights are provided at https://github.com/apple/ml-sharp[102] CheXmask-U: Quantifying uncertainty in landmark-based anatomical segmentation for X-ray images
Matias Cosarinsky,Nicolas Gaggion,Rodrigo Echeveste,Enzo Ferrante
Main category: cs.CV
TL;DR: 本研究提出了一种基于解剖标志点的胸部X光图像分割不确定性估计方法,利用混合神经网络架构中的变分隐空间生成两种互补的不确定性度量:隐空间不确定性和预测不确定性,并通过实验验证其在识别不可靠预测和分布外检测中的有效性。
Details
Motivation: 不确定性估计对医学图像分割系统的安全临床部署至关重要,但现有工作多集中于像素级不确定性,而基于标志点的分割方法虽具有拓扑优势,其不确定性研究仍不足。 Method: 采用结合卷积编码器与图生成解码器的混合神经网络架构,利用其变分隐空间,从学习到的分布参数中提取隐空间不确定性,并通过多次随机采样生成输出以获得预测不确定性。 Result: 在受控损坏实验中,两种不确定性度量随扰动程度增加而上升;能有效识别不可靠预测并与人工标注对比验证,支持分布外检测;发布了包含657,566张胸部X光图像的大规模数据集CheXmask-U及交互式演示和源代码。 Conclusion: 该研究表明不确定性估计可提升基于标志点的解剖结构分割方法在胸部X光应用中的鲁棒性与安全性,为未来研究提供了重要资源与方向。 Abstract: Uncertainty estimation is essential for the safe clinical deployment of medical image segmentation systems, enabling the identification of unreliable predictions and supporting human oversight. While prior work has largely focused on pixel-level uncertainty, landmark-based segmentation offers inherent topological guarantees yet remains underexplored from an uncertainty perspective. In this work, we study uncertainty estimation for anatomical landmark-based segmentation on chest X-rays. Inspired by hybrid neural network architectures that combine standard image convolutional encoders with graph-based generative decoders, and leveraging their variational latent space, we derive two complementary measures: (i) latent uncertainty, captured directly from the learned distribution parameters, and (ii) predictive uncertainty, obtained by generating multiple stochastic output predictions from latent samples. Through controlled corruption experiments we show that both uncertainty measures increase with perturbation severity, reflecting both global and local degradation. We demonstrate that these uncertainty signals can identify unreliable predictions by comparing with manual ground-truth, and support out-of-distribution detection on the CheXmask dataset. More importantly, we release CheXmask-U (huggingface.co/datasets/mcosarinsky/CheXmask-U), a large scale dataset of 657,566 chest X-ray landmark segmentations with per-node uncertainty estimates, enabling researchers to account for spatial variations in segmentation quality when using these anatomical masks. Our findings establish uncertainty estimation as a promising direction to enhance robustness and safe deployment of landmark-based anatomical segmentation methods in chest X-ray. A fully working interactive demo of the method is available at huggingface.co/spaces/matiasky/CheXmask-U and the source code at github.com/mcosarinsky/CheXmask-U.[103] SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving
Peizheng Li,Zhenghao Zhang,David Holtz,Hang Yu,Yutong Yang,Yuzhi Lai,Rui Song,Andreas Geiger,Andreas Zell
Main category: cs.CV
TL;DR: 本文提出了一种名为SpaceDrive的空间感知视觉语言模型(VLM)自动驾驶框架,通过将3D空间信息作为显式位置编码处理,提升了对细粒度空间关系的理解与轨迹预测精度。
Details
Motivation: 现有的VLM在理解细粒度的3D空间关系方面存在不足,而这是自动驾驶系统与物理世界交互的基础需求。 Method: 提出SpaceDrive框架,采用通用位置编码器处理来自多视角深度估计、历史自车状态和文本提示的3D坐标,并将这些3D位置编码叠加到2D视觉标记上,同时用作VLM的输入和输出,实现语义与空间信息的联合推理。 Result: 实验表明,SpaceDrive在nuScenes数据集上达到最先进的开环性能,并在Bench2Drive闭环基准中取得78.02的驾驶分数,为现有VLM方法中的第二优结果。 Conclusion: 通过引入显式3D位置编码,SpaceDrive有效增强了VLM在自动驾驶任务中的空间理解与轨迹规划能力,推动了端到端自动驾驶的发展。 Abstract: End-to-end autonomous driving methods built on vision language models (VLMs) have undergone rapid development driven by their universal visual understanding and strong reasoning capabilities obtained from the large-scale pretraining. However, we find that current VLMs struggle to understand fine-grained 3D spatial relationships which is a fundamental requirement for systems interacting with the physical world. To address this issue, we propose SpaceDrive, a spatial-aware VLM-based driving framework that treats spatial information as explicit positional encodings (PEs) instead of textual digit tokens, enabling joint reasoning over semantic and spatial representations. SpaceDrive employs a universal positional encoder to all 3D coordinates derived from multi-view depth estimation, historical ego-states, and text prompts. These 3D PEs are first superimposed to augment the corresponding 2D visual tokens. Meanwhile, they serve as a task-agnostic coordinate representation, replacing the digit-wise numerical tokens as both inputs and outputs for the VLM. This mechanism enables the model to better index specific visual semantics in spatial reasoning and directly regress trajectory coordinates rather than generating digit-by-digit, thereby enhancing planning accuracy. Extensive experiments validate that SpaceDrive achieves state-of-the-art open-loop performance on the nuScenes dataset and the second-best Driving Score of 78.02 on the Bench2Drive closed-loop benchmark over existing VLM-based methods.[104] Video Depth Propagation
Luigi Piccinelli,Thiemo Wandel,Christos Sakaridis,Wim Abbeloos,Luc Van Gool
Main category: cs.CV
TL;DR: 提出VeloDepth,一种高效的在线视频深度估计方法,通过利用时空先验和特征传播实现高时效性、高一致性的深度预测。
Details
Motivation: 现有视频深度估计方法在时间一致性或计算效率上存在不足,难以兼顾准确性和实时性,限制了实际应用。 Method: 设计了一种新的传播模块,结合光流 warp 和学习到的残差修正来传播和优化深度特征,并通过结构设计保证时间一致性。 Result: 在多个基准上实现了最先进的时间一致性、有竞争力的精度,并显著提升了推理速度。 Conclusion: VeloDepth为实时视频深度估计提供了一个高效、准确且实用的解决方案,适用于多种视觉感知任务。 Abstract: Depth estimation in videos is essential for visual perception in real-world applications. However, existing methods either rely on simple frame-by-frame monocular models, leading to temporal inconsistencies and inaccuracies, or use computationally demanding temporal modeling, unsuitable for real-time applications. These limitations significantly restrict general applicability and performance in practical settings. To address this, we propose VeloDepth, an efficient and robust online video depth estimation pipeline that effectively leverages spatiotemporal priors from previous depth predictions and performs deep feature propagation. Our method introduces a novel Propagation Module that refines and propagates depth features and predictions using flow-based warping coupled with learned residual corrections. In addition, our design structurally enforces temporal consistency, resulting in stable depth predictions across consecutive frames with improved efficiency. Comprehensive zero-shot evaluation on multiple benchmarks demonstrates the state-of-the-art temporal consistency and competitive accuracy of VeloDepth, alongside its significantly faster inference compared to existing video-based depth estimators. VeloDepth thus provides a practical, efficient, and accurate solution for real-time depth estimation suitable for diverse perception tasks. Code and models are available at https://github.com/lpiccinelli-eth/velodepth[105] IRG-MotionLLM: Interleaving Motion Generation, Assessment and Refinement for Text-to-Motion Generation
Yuan-Ming Li,Qize Yang,Nan Lei,Shenghao Fu,Ling-An Zeng,Jian-Fang Hu,Xihan Wei,Wei-Shi Zheng
Main category: cs.CV
TL;DR: 本文提出了一种新的运动生成范式IRMoGen,通过在文本-运动对话中交替进行生成、评估和精炼步骤,实现理解与生成之间的双向知识流动。为此,作者构建了首个能够无缝交织这三类任务的模型IRG-MotionLLM,并设计了三阶段训练策略和自动化数据引擎来支持其发展。实验表明该方法显著提升了文本-运动对齐性和生成性能,在标准基准上优于基线模型。
Details
Motivation: 现有运动感知大语言模型通常将理解和生成任务分离处理,限制了二者之间潜在的相互促进。本文旨在通过引入评估与精炼任务作为桥梁,建立双向反馈机制,以增强运动生成的质量和语义对齐性。 Method: 提出IRMoGen范式,通过迭代的文本-运动对话将运动生成、评估与精炼紧密结合;构建IRG-MotionLLM模型,采用三阶段训练方案逐步提升能力;开发自动化数据引擎,从现有文本-运动数据集中合成交错推理标注数据。 Result: 实验证明:(i)评估与精炼任务显著提升文本-运动对齐;(ii)交错执行三个步骤在各训练阶段均带来持续性能提升;(iii)IRG-MotionLLM在标准文本到运动生成基准上明显优于基线模型,跨评估者测试也验证了其有效性。 Conclusion: 通过引入评估与精炼环节并实现生成与理解的交错推理,可以有效促进运动生成的质量与语义一致性。IRG-MotionLLM展示了闭环交互式推理在多模态语言模型中的潜力,为未来运动理解与生成一体化提供了新方向。 Abstract: Recent advances in motion-aware large language models have shown remarkable promise for unifying motion understanding and generation tasks. However, these models typically treat understanding and generation separately, limiting the mutual benefits that could arise from interactive feedback between tasks. In this work, we reveal that motion assessment and refinement tasks act as crucial bridges to enable bidirectional knowledge flow between understanding and generation. Leveraging this insight, we propose Interleaved Reasoning for Motion Generation (IRMoGen), a novel paradigm that tightly couples motion generation with assessment and refinement through iterative text-motion dialogue. To realize this, we introduce IRG-MotionLLM, the first model that seamlessly interleaves motion generation, assessment, and refinement to improve generation performance. IRG-MotionLLM is developed progressively with a novel three-stage training scheme, initializing and subsequently enhancing native IRMoGen capabilities. To facilitate this development, we construct an automated data engine to synthesize interleaved reasoning annotations from existing text-motion datasets. Extensive experiments demonstrate that: (i) Assessment and refinement tasks significantly improve text-motion alignment; (ii) Interleaving motion generation, assessment, and refinement steps yields consistent performance gains across training stages; and (iii) IRG-MotionLLM clearly outperforms the baseline model and achieves advanced performance on standard text-to-motion generation benchmarks. Cross-evaluator testing further validates its effectiveness. Code & Data: https://github.com/HumanMLLM/IRG-MotionLLM/tree/main.[106] LDP: Parameter-Efficient Fine-Tuning of Multimodal LLM for Medical Report Generation
Tianyu Zhou,Junyi Tang,Zehui Li,Dahong Qian,Suncheng Xiang
Main category: cs.CV
TL;DR: 提出LDP框架,利用多模态大语言模型生成专业息肉诊断报告,通过MMEndo数据集和参数高效微调(LoRA)及直接偏好优化(DPO)提升临床一致性与计算效率。
Details
Motivation: 传统自动化结肠镜息肉报告存在不一致性和幻觉问题,主要由于高质量多模态医学数据稀缺,难以满足临床需求。 Method: 构建专家标注的多模态内窥镜数据集MMEndo,基于Qwen2-VL-7B模型采用LoRA进行参数高效微调,并通过DPO对齐临床标准。 Result: LDP在自动指标和临床专家评估中均优于现有基线,医师评分为7.2/10,相比全量微调训练计算成本降低833倍,并在IU-XRay数据集上验证了泛化性。 Conclusion: LDP为初级医疗提供了可扩展且符合临床要求的解决方案,显著提升报告质量与效率,具备实际应用潜力。 Abstract: Colonoscopic polyp diagnosis is pivotal for early colorectal cancer detection, yet traditional automated reporting suffers from inconsistencies and hallucinations due to the scarcity of high-quality multimodal medical data. To bridge this gap, we propose LDP, a novel framework leveraging multimodal large language models (MLLMs) for professional polyp diagnosis report generation. Specifically, we curate MMEndo, a multimodal endoscopic dataset comprising expert-annotated colonoscopy image-text pairs. We fine-tune the Qwen2-VL-7B backbone using Parameter-Efficient Fine-Tuning (LoRA) and align it with clinical standards via Direct Preference Optimization (DPO). Extensive experiments show that our LDP outperforms existing baselines on both automated metrics and rigorous clinical expert evaluations (achieving a Physician Score of 7.2/10), significantly reducing training computational costs by 833x compared to full fine-tuning. The proposed solution offers a scalable, clinically viable path for primary healthcare, with additional validation on the IU-XRay dataset confirming its robustness.[107] Blood Pressure Prediction for Coronary Artery Disease Diagnosis using Coronary Computed Tomography Angiography
Rene Lisasi,Michele Esposito,Chen Zhao
Main category: cs.CV
TL;DR: 提出了一种端到端的自动化管道,结合扩散回归模型,从冠状动脉CT血管造影中直接预测血流压力分布,避免了传统CFD的高计算成本,实现了高效、可扩展的无创冠心病诊断支持。
Details
Motivation: 传统CFD模拟冠状动脉血流虽有价值,但计算昂贵、耗时,难以大规模生成标注数据,限制了AI模型训练和临床应用。 Method: 开发了一个端到端管道,自动提取冠状动脉几何结构,简化模拟数据生成,并引入基于扩散的回归模型,直接从CCTA特征预测血流压力分布,无需推理时进行CFD计算。 Result: 在模拟血流动力学数据集上,模型达到64.42%的R2分数,RMSE为0.0974,归一化RMSE为0.154,优于多种基线方法。 Conclusion: 该框架为冠心病诊断提供了可扩展、易获取的快速无创血压预测方案,有助于推动生理学基础上的CAD评估在临床中的广泛应用。 Abstract: Computational fluid dynamics (CFD) based simulation of coronary blood flow provides valuable hemodynamic markers, such as pressure gradients, for diagnosing coronary artery disease (CAD). However, CFD is computationally expensive, time-consuming, and difficult to integrate into large-scale clinical workflows. These limitations restrict the availability of labeled hemodynamic data for training AI models and hinder broad adoption of non-invasive, physiology based CAD assessment. To address these challenges, we develop an end to end pipeline that automates coronary geometry extraction from coronary computed tomography angiography (CCTA), streamlines simulation data generation, and enables efficient learning of coronary blood pressure distributions. The pipeline reduces the manual burden associated with traditional CFD workflows while producing consistent training data. We further introduce a diffusion-based regression model designed to predict coronary blood pressure directly from CCTA derived features, bypassing the need for slow CFD computation during inference. Evaluated on a dataset of simulated coronary hemodynamics, the proposed model achieves state of the art performance, with an R2 of 64.42%, a root mean squared error of 0.0974, and a normalized RMSE of 0.154, outperforming several baseline approaches. This work provides a scalable and accessible framework for rapid, non-invasive blood pressure prediction to support CAD diagnosis.[108] What matters for Representation Alignment: Global Information or Spatial Structure?
Jaskirat Singh,Xingjian Leng,Zongze Wu,Liang Zheng,Richard Zhang,Eli Shechtman,Saining Xie
Main category: cs.CV
TL;DR: 本文研究了在生成模型训练中,目标表示的哪个方面对生成性能更重要:全局语义信息还是空间结构。通过大规模实验发现,空间结构比全局语义性能更具影响力。基于此,作者提出了iREPA方法,通过卷积层和空间归一化增强空间信息传递,显著提升了收敛速度。
Details
Motivation: 探究在表示对齐(REPA)中,影响生成性能的关键因素是目标表示的全局语义信息还是其空间结构,挑战现有认为强语义性能更优的普遍观点。 Method: 在27种不同视觉编码器上进行大规模实证分析,并提出iREPA方法,用卷积层替代MLP投影层,引入外部表示的空间归一化层,以强化空间信息的迁移。 Result: 实验表明,目标表示的空间结构而非全局语义性能更能驱动生成效果;iREPA在多种编码器、模型规模和训练变体下均显著加快收敛速度。 Conclusion: 空间结构是影响生成性能的关键因素,iREPA通过简单修改有效提升表示对齐效率,促使重新思考表示对齐机制在生成模型训练中的作用。 Abstract: Representation alignment (REPA) guides generative training by distilling representations from a strong, pretrained vision encoder to intermediate diffusion features. We investigate a fundamental question: what aspect of the target representation matters for generation, its \textit{global} \revision{semantic} information (e.g., measured by ImageNet-1K accuracy) or its spatial structure (i.e. pairwise cosine similarity between patch tokens)? Prevalent wisdom holds that stronger global semantic performance leads to better generation as a target representation. To study this, we first perform a large-scale empirical analysis across 27 different vision encoders and different model scales. The results are surprising; spatial structure, rather than global performance, drives the generation performance of a target representation. To further study this, we introduce two straightforward modifications, which specifically accentuate the transfer of \emph{spatial} information. We replace the standard MLP projection layer in REPA with a simple convolution layer and introduce a spatial normalization layer for the external representation. Surprisingly, our simple method (implemented in $<$4 lines of code), termed iREPA, consistently improves convergence speed of REPA, across a diverse set of vision encoders, model sizes, and training variants (such as REPA, REPA-E, Meanflow, JiT etc). %, etc. Our work motivates revisiting the fundamental working mechanism of representational alignment and how it can be leveraged for improved training of generative models. The code and project page are available at https://end2end-diffusion.github.io/irepa[109] Graph Laplacian Transformer with Progressive Sampling for Prostate Cancer Grading
Masum Shah Junayed,John Derek Van Vessem,Qian Wan,Gahie Nam,Sheida Nabavi
Main category: cs.CV
TL;DR: 提出了一种结合图拉普拉斯注意力机制的Transformer模型(GLAT)与迭代优化模块(IRM),用于前列腺癌全切片图像分级,通过动态选择关键区域并增强空间一致性,显著提升了性能。
Details
Motivation: 现有方法在处理全切片图像时多采用随机或静态的补丁选择策略,容易引入冗余或无信息量的区域,影响癌症分级性能,因此需要一种能自适应选择诊断相关区域并保持空间一致性的方法。 Method: 提出GLAT模型,将补丁作为图节点,利用图拉普拉斯约束建模组织连接性,并通过可学习滤波机制优化特征表示;同时设计IRM模块,结合预训练ResNet50提取局部特征,使用基础模型进行重要性评分,迭代优化补丁选择;引入凸聚合机制动态调整补丁权重,生成鲁棒的全切片表示。 Result: 在五个公开和一个私有数据集上进行了广泛实验,结果表明该方法在性能、空间一致性和计算效率方面均优于现有最先进方法。 Conclusion: GLAT结合IRM能有效提升前列腺癌全切片图像的分级精度,通过动态补丁选择和图结构建模增强了模型对关键组织区域的感知能力,具有良好的应用潜力。 Abstract: Prostate cancer grading from whole-slide images (WSIs) remains a challenging task due to the large-scale nature of WSIs, the presence of heterogeneous tissue structures, and difficulty of selecting diagnostically relevant regions. Existing approaches often rely on random or static patch selection, leading to the inclusion of redundant or non-informative regions that degrade performance. To address this, we propose a Graph Laplacian Attention-Based Transformer (GLAT) integrated with an Iterative Refinement Module (IRM) to enhance both feature learning and spatial consistency. The IRM iteratively refines patch selection by leveraging a pretrained ResNet50 for local feature extraction and a foundation model in no-gradient mode for importance scoring, ensuring only the most relevant tissue regions are preserved. The GLAT models tissue-level connectivity by constructing a graph where patches serve as nodes, ensuring spatial consistency through graph Laplacian constraints and refining feature representations via a learnable filtering mechanism that enhances discriminative histological structures. Additionally, a convex aggregation mechanism dynamically adjusts patch importance to generate a robust WSI-level representation. Extensive experiments on five public and one private dataset demonstrate that our model outperforms state-of-the-art methods, achieving higher performance and spatial consistency while maintaining computational efficiency.[110] Self-Ensemble Post Learning for Noisy Domain Generalization
Wang Lu,Jindong Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为SEPL的自集成后学习方法,通过特征探测训练和预测集成推理,提升域泛化在噪声标签下的鲁棒性。
Details
Motivation: 现有域泛化方法在面对标签噪声时性能下降,因噪声加剧了虚假特征的放大问题,需探索如何使现有方法在噪声环境下仍有效工作。 Method: 提出SEPL方法,包含特征探测训练和预测集成推理两部分;利用模型中间层特征训练多个探针分类器,并采用半监督算法应对噪声标签,最后通过众包式推理整合多分类头输出。 Result: 实验表明SEPL能有效提升现有方法的鲁棒性,在多种噪声场景下均表现出优越性能,且具备高灵活性和实际应用潜力。 Conclusion: SEPL通过挖掘中间特征的多样性与判别能力,为域泛化在噪声环境下的应用提供了有效解决方案。 Abstract: While computer vision and machine learning have made great progress, their robustness is still challenged by two key issues: data distribution shift and label noise. When domain generalization (DG) encounters noise, noisy labels further exacerbate the emergence of spurious features in deep layers, i.e. spurious feature enlargement, leading to a degradation in the performance of existing algorithms. This paper, starting from domain generalization, explores how to make existing methods rework when meeting noise. We find that the latent features inside the model have certain discriminative capabilities, and different latent features focus on different parts of the image. Based on these observations, we propose the Self-Ensemble Post Learning approach (SEPL) to diversify features which can be leveraged. Specifically, SEPL consists of two parts: feature probing training and prediction ensemble inference. It leverages intermediate feature representations within the model architecture, training multiple probing classifiers to fully exploit the capabilities of pre-trained models, while the final predictions are obtained through the integration of outputs from these diverse classification heads. Considering the presence of noisy labels, we employ semi-supervised algorithms to train probing classifiers. Given that different probing classifiers focus on different areas, we integrate their predictions using a crowdsourcing inference approach. Extensive experimental evaluations demonstrate that the proposed method not only enhances the robustness of existing methods but also exhibits significant potential for real-world applications with high flexibility.[111] PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning
Jianqi Chen,Biao Zhang,Xiangjun Tang,Peter Wonka
Main category: cs.CV
TL;DR: 本文提出了一种名为PoseGAM的几何感知多视角框架,用于无显式匹配的未见物体6D姿态估计,并构建大规模合成数据集以提升泛化能力,在多个基准上实现了领先性能。
Details
Motivation: 现有的6D物体姿态估计方法在处理未见物体时依赖于显式的特征对应关系,限制了其泛化能力,因此需要一种不依赖显式匹配且能有效利用几何信息的新方法。 Method: 提出PoseGAM,基于多视角基础模型架构,通过显式的点云几何信息和从几何表示网络中学习到的特征来融合物体几何信息,直接从查询图像和多个模板图像中预测物体姿态,无需显式匹配。 Result: 在多个基准上进行广泛评估,相比先前方法平均AR提升5.1%,个别数据集上最高提升达17.6%,表现出对未见物体的强大泛化能力。 Conclusion: PoseGAM通过融合显式与隐式几何信息,实现了无需显式匹配的高效6D姿态估计,在未见物体上展现出卓越的性能和泛化能力。 Abstract: 6D object pose estimation, which predicts the transformation of an object relative to the camera, remains challenging for unseen objects. Existing approaches typically rely on explicitly constructing feature correspondences between the query image and either the object model or template images. In this work, we propose PoseGAM, a geometry-aware multi-view framework that directly predicts object pose from a query image and multiple template images, eliminating the need for explicit matching. Built upon recent multi-view-based foundation model architectures, the method integrates object geometry information through two complementary mechanisms: explicit point-based geometry and learned features from geometry representation networks. In addition, we construct a large-scale synthetic dataset containing more than 190k objects under diverse environmental conditions to enhance robustness and generalization. Extensive evaluations across multiple benchmarks demonstrate our state-of-the-art performance, yielding an average AR improvement of 5.1% over prior methods and achieving up to 17.6% gains on individual datasets, indicating strong generalization to unseen objects. Project page: https://windvchen.github.io/PoseGAM/ .[112] SWiT-4D: Sliding-Window Transformer for Lossless and Parameter-Free Temporal 4D Generation
Kehong Gong,Zhengyu Wen,Mingxi Xu,Weixia He,Qi Wang,Ning Zhang,Zhengyu Li,Chenbin Li,Dongze Lian,Wei Zhao,Xiaoyu He,Mingyuan Zhang
Main category: cs.CV
TL;DR: 本文提出了SWiT-4D,一种基于滑动窗口Transformer的无参数、低监督4D网格生成方法,可将单目视频高效转换为高质量的时序一致3D网格序列。
Details
Motivation: 现有的4D内容生成方法受限于大规模真实4D数据集的缺乏,难以从零开始训练通用的视频到4D模型;同时,尽管图像到3D生成取得进展,如何有效利用其先验知识并减少对4D监督的依赖仍具挑战。 Method: 提出SWiT-4D,采用滑动窗口Transformer架构,无缝集成于任意DiT-based图像到3D生成器中,在保持原单图前向过程的同时引入帧间时空建模;并通过一个针对静态相机视频设计的优化型轨迹模块恢复全局平移。 Result: 仅需一个短于10秒的视频进行微调,SWiT-4D即可实现高保真几何重建和稳定的时序一致性,并在多个域内和跨域基准(C4D、Objaverse、野外视频)上显著优于现有方法,尤其在时间平滑性方面表现突出。 Conclusion: SWiT-4D实现了高效的数据利用与无需额外参数的4D生成,展示了在极低4D监督下实用化部署的潜力,推动了单目视频到4D内容生成的发展。 Abstract: Despite significant progress in 4D content generation, the conversion of monocular videos into high-quality animated 3D assets with explicit 4D meshes remains considerably challenging. The scarcity of large-scale, naturally captured 4D mesh datasets further limits the ability to train generalizable video-to-4D models from scratch in a purely data-driven manner. Meanwhile, advances in image-to-3D generation, supported by extensive datasets, offer powerful prior models that can be leveraged. To better utilize these priors while minimizing reliance on 4D supervision, we introduce SWiT-4D, a Sliding-Window Transformer for lossless, parameter-free temporal 4D mesh generation. SWiT-4D integrates seamlessly with any Diffusion Transformer (DiT)-based image-to-3D generator, adding spatial-temporal modeling across video frames while preserving the original single-image forward process, enabling 4D mesh reconstruction from videos of arbitrary length. To recover global translation, we further introduce an optimization-based trajectory module tailored for static-camera monocular videos. SWiT-4D demonstrates strong data efficiency: with only a single short (<10s) video for fine-tuning, it achieves high-fidelity geometry and stable temporal consistency, indicating practical deployability under extremely limited 4D supervision. Comprehensive experiments on both in-domain zoo-test sets and challenging out-of-domain benchmarks (C4D, Objaverse, and in-the-wild videos) show that SWiT-4D consistently outperforms existing baselines in temporal smoothness. Project page: https://animotionlab.github.io/SWIT4D/[113] MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence
Jingli Lin,Runsen Xu,Shaohao Zhu,Sihan Yang,Peizhou Cao,Yunlong Ran,Miao Hu,Chenming Zhu,Yiman Xie,Yilin Long,Wenbo Hu,Dahua Lin,Tai Wang,Jiangmiao Pang
Main category: cs.CV
TL;DR: 本文提出了MMSI-Video-Bench,一个用于评估多模态大语言模型在视频中空间智能的全人工标注基准,涵盖感知、规划、预测和跨视频推理四个层次,并揭示了当前模型在几何推理、运动定位等方面与人类存在显著差距。
Details
Motivation: 现有的多模态大语言模型缺乏对连续视觉输入中空间理解能力的全面评估基准,限制了其在物理环境中的应用发展。 Method: 构建了一个包含1,106个问题和1,278个视频片段的高质量基准,基于四层框架(感知、规划、预测、跨视频推理),由3DV专家设计并附有解释性理由,同时支持三个面向领域的子基准。 Result: 在25个主流MLLM上的实验显示,许多模型表现接近随机猜测,最优模型仍比人类低近60%;细粒度错误分析揭示了模型在几何推理、运动定位、长时序预测和跨视频对应上的系统性失败;且常见的帧采样策略、3D空间线索和思维链提示均未带来显著提升。 Conclusion: MMSI-Video-Bench为视频空间智能提供了具有挑战性的测试平台,暴露了现有模型的局限性,有望推动该领域的发展。 Abstract: Spatial understanding over continuous visual input is crucial for MLLMs to evolve into general-purpose assistants in physical environments. Yet there is still no comprehensive benchmark that holistically assesses the progress toward this goal. In this work, we introduce MMSI-Video-Bench, a fully human-annotated benchmark for video-based spatial intelligence in MLLMs. It operationalizes a four-level framework, Perception, Planning, Prediction, and Cross-Video Reasoning, through 1,106 questions grounded in 1,278 clips from 25 datasets and in-house videos. Each item is carefully designed and reviewed by 3DV experts with explanatory rationales to ensure precise, unambiguous grounding. Leveraging its diverse data sources and holistic task coverage, MMSI-Video-Bench also supports three domain-oriented sub-benchmarks (Indoor Scene Perception Bench, Robot Bench and Grounding Bench) for targeted capability assessment. We evaluate 25 strong open-source and proprietary MLLMs, revealing a striking human--AI gap: many models perform near chance, and the best reasoning model lags humans by nearly 60%. We further find that spatially fine-tuned models still fail to generalize effectively on our benchmark. Fine-grained error analysis exposes systematic failures in geometric reasoning, motion grounding, long-horizon prediction, and cross-video correspondence. We also show that typical frame-sampling strategies transfer poorly to our reasoning-intensive benchmark, and that neither 3D spatial cues nor chain-of-thought prompting yields meaningful gains. We expect our benchmark to establish a solid testbed for advancing video-based spatial intelligence.[114] From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models
Zongzhao Li,Xiangzhe Kong,Jiahui Su,Zongyang Ma,Mingze Li,Songyou Li,Yuelin Zhang,Yu Rong,Tingyang Xu,Deli Zhao,Wenbing Huang
Main category: cs.CV
TL;DR: 本文提出了微观空间智能(MiSI)的概念,并构建了MiSI-Bench基准来评估视觉语言模型在微观空间推理上的能力,发现当前模型在科学任务上远低于人类水平,表明融入领域知识对实现科学通用人工智能至关重要。
Details
Motivation: 微观空间智能对于科学发现至关重要,但现有视觉语言模型在理解微观不可见实体的空间关系方面的能力尚未系统评估。 Method: 提出MiSI-Bench基准框架,包含16.3万问答对和58.7万张图像,基于约4000个分子结构设计九项任务,覆盖从基础空间变换到复杂关系识别的多种能力。 Result: 实验显示当前最先进的视觉语言模型在该基准上表现远低于人类;一个微调的7B模型在空间变换任务上超过人类,但在氢键识别等科学任务上表现不佳。 Conclusion: 视觉语言模型具备一定微观空间推理潜力,但需融合显式科学领域知识才能推动科学AGI的发展。 Abstract: This paper introduces the concept of Microscopic Spatial Intelligence (MiSI), the capability to perceive and reason about the spatial relationships of invisible microscopic entities, which is fundamental to scientific discovery. To assess the potential of Vision-Language Models (VLMs) in this domain, we propose a systematic benchmark framework MiSI-Bench. This framework features over 163,000 question-answer pairs and 587,000 images derived from approximately 4,000 molecular structures, covering nine complementary tasks that evaluate abilities ranging from elementary spatial transformations to complex relational identifications. Experimental results reveal that current state-of-the-art VLMs perform significantly below human level on this benchmark. However, a fine-tuned 7B model demonstrates substantial potential, even surpassing humans in spatial transformation tasks, while its poor performance in scientifically-grounded tasks like hydrogen bond recognition underscores the necessity of integrating explicit domain knowledge for progress toward scientific AGI. The datasets are available at https://huggingface.co/datasets/zongzhao/MiSI-bench.[115] MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos
Kehong Gong,Zhengyu Wen,Weixia He,Mingxi Xu,Qi Wang,Ning Zhang,Zhengyu Li,Dongze Lian,Wei Zhao,Xiaoyu He,Mingyuan Zhang
Main category: cs.CV
TL;DR: 提出了一种类别无关的运动捕捉框架MoCapAnything,能够基于单目视频和任意3D资产生成驱动该资产的骨骼动画,支持跨物种、跨骨架的高质量动作重定向。
Details
Motivation: 现有运动捕捉方法多为特定物种或模板设计,难以泛化到任意3D资产,限制了在多样化内容创作中的应用。 Method: 提出MoCapAnything,包含参考提示编码器、视频特征提取器和统一运动解码器三个可学习模块,结合约束感知的逆运动学(IK)流程:首先从3D资产中提取关节级查询,再通过视频重建粗略4D变形网格以对齐视觉与关节空间,最后融合信息生成时序连贯的3D关节轨迹,并通过轻量IK恢复资产特定的旋转动画。 Result: 在领域内基准和野外视频上均表现出高质量的骨骼动画效果,支持跨物种的动作重定向;构建了包含1038个动作片段的Truebones Zoo数据集,每个样本提供标准化的骨架-网格-渲染三元组。 Conclusion: MoCapAnything实现了类别无关的运动捕捉,推动了基于提示词驱动的、可扩展的3D动作捕捉技术发展,适用于任意 rigged 3D 资产的动画生成。 Abstract: Motion capture now underpins content creation far beyond digital humans, yet most existing pipelines remain species- or template-specific. We formalize this gap as Category-Agnostic Motion Capture (CAMoCap): given a monocular video and an arbitrary rigged 3D asset as a prompt, the goal is to reconstruct a rotation-based animation such as BVH that directly drives the specific asset. We present MoCapAnything, a reference-guided, factorized framework that first predicts 3D joint trajectories and then recovers asset-specific rotations via constraint-aware inverse kinematics. The system contains three learnable modules and a lightweight IK stage: (1) a Reference Prompt Encoder that extracts per-joint queries from the asset's skeleton, mesh, and rendered images; (2) a Video Feature Extractor that computes dense visual descriptors and reconstructs a coarse 4D deforming mesh to bridge the gap between video and joint space; and (3) a Unified Motion Decoder that fuses these cues to produce temporally coherent trajectories. We also curate Truebones Zoo with 1038 motion clips, each providing a standardized skeleton-mesh-render triad. Experiments on both in-domain benchmarks and in-the-wild videos show that MoCapAnything delivers high-quality skeletal animations and exhibits meaningful cross-species retargeting across heterogeneous rigs, enabling scalable, prompt-driven 3D motion capture for arbitrary assets. Project page: https://animotionlab.github.io/MoCapAnything/[116] PubTables-v2: A new large-scale dataset for full-page and multi-page table extraction
Brandon Smock,Valerie Faucon-Morin,Max Sokolov,Libin Liang,Tayyibah Khanam,Maury Courtland
Main category: cs.CV
TL;DR: 本文提出了一个大规模数据集PubTables-v2,用于支持多页表格结构识别等挑战性任务,并基于该数据集开发了Page-Object Table Transformer(POTATR),推动了页面级表格提取的发展。
Details
Motivation: 由于缺乏标注数据,现有方法在展示表格提取进展方面存在困难,尤其是在多页表格结构识别等复杂任务上。 Method: 构建了一个新的大规模数据集PubTables-v2,支持多种当前具有挑战性的表格提取任务,并利用该数据集评估领域专用的视觉语言模型(VLMs),同时提出POTATR模型,作为Table Transformer的图像到图扩展,实现全面的页面级表格提取。 Result: PubTables-v2成为首个支持多页表格结构识别的大规模基准数据集;通过实验验证了其在评估VLMs上的有效性,并成功开发出POTATR模型,提升了页面级表格提取性能。 Conclusion: PubTables-v2为表格提取研究提供了重要资源,促进了多页和上下文感知表格理解的发展,POTATR展示了在完整页面上下文中进行端到端表格提取的潜力。 Abstract: Table extraction (TE) is a key challenge in visual document understanding. Traditional approaches detect tables first, then recognize their structure. Recently, interest has surged in developing methods, such as vision-language models (VLMs), that can extract tables directly in their full page or document context. However, progress has been difficult to demonstrate due to a lack of annotated data. To address this, we create a new large-scale dataset, PubTables-v2. PubTables-v2 supports a number of current challenging table extraction tasks. Notably, it is the first large-scale benchmark for multi-page table structure recognition. We demonstrate its usefulness by evaluating domain-specialized VLMs on these tasks and highlighting current progress. Finally, we use PubTables-v2 to create the Page-Object Table Transformer (POTATR), an image-to-graph extension of the Table Transformer to comprehensive page-level TE. Data, code, and trained models will be released.[117] DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance
Peiying Zhang,Nanxuan Zhao,Matthew Fisher,Yiran Xu,Jing Liao,Difan Liu
Main category: cs.CV
TL;DR: 提出DuetSVG,一种统一的多模态模型,通过联合生成图像和SVG标记来提升SVG生成质量。
Details
Motivation: 现有基于视觉-语言模型的方法在解码时仅生成文本,缺乏视觉信号,导致复杂语义下生成的SVG在视觉吸引力和几何一致性方面表现不佳。 Method: DuetSVG端到端地联合生成图像标记和SVG标记,并在推理时采用新的测试时扩展策略,利用模型的视觉预测作为指导来提升SVG解码质量。 Result: 在多种应用中,DuetSVG生成的SVG在视觉保真度、语义对齐和语法规范性方面均优于现有方法。 Conclusion: DuetSVG通过引入多模态联合生成和测试时视觉引导策略,显著提升了SVG生成的质量和鲁棒性。 Abstract: Recent vision-language model (VLM)-based approaches have achieved impressive results on SVG generation. However, because they generate only text and lack visual signals during decoding, they often struggle with complex semantics and fail to produce visually appealing or geometrically coherent SVGs. We introduce DuetSVG, a unified multimodal model that jointly generates image tokens and corresponding SVG tokens in an end-to-end manner. DuetSVG is trained on both image and SVG datasets. At inference, we apply a novel test-time scaling strategy that leverages the model's native visual predictions as guidance to improve SVG decoding quality. Extensive experiments show that our method outperforms existing methods, producing visually faithful, semantically aligned, and syntactically clean SVGs across a wide range of applications.[118] FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos
Yulu Gan,Ligeng Zhu,Dandan Shan,Baifeng Shi,Hongxu Yin,Boris Ivanovic,Song Han,Trevor Darrell,Jitendra Malik,Marco Pavone,Boyi Li
Main category: cs.CV
TL;DR: 本文提出了FoundationMotion,一个全自动的数据整理管道,用于构建大规模、细粒度的运动数据集,通过视频中物体轨迹检测与LLM结合生成高质量的运动描述和问答对,显著提升了模型在运动理解任务上的性能。
Details
Motivation: 现有运动数据集依赖昂贵的手动标注,规模受限,难以满足当前模型对高质量运动数据的需求,导致模型在运动理解任务上表现不佳。 Method: 提出FoundationMotion管道:首先从视频中检测和跟踪物体以提取轨迹,然后结合轨迹和视频帧利用大语言模型(LLM)自动生成细粒度的运动描述及多样化的问答对,实现数据集的自动化构建,并用于微调开源视频语言模型。 Result: 使用该管道构建的数据集对NVILA-Video-15B和Qwen2.5-7B等模型进行微调后,在多个运动理解基准上显著优于Gemini-2.5 Flash和Qwen2.5-VL-72B等强基线模型,且不影响其他任务的表现。 Conclusion: FoundationMotion为运动理解提供了可扩展的、低成本的数据生产方案,有效提升模型在运动和空间推理任务上的能力,推动物理推理领域的发展。 Abstract: Motion understanding is fundamental to physical reasoning, enabling models to infer dynamics and predict future states. However, state-of-the-art models still struggle on recent motion benchmarks, primarily due to the scarcity of large-scale, fine-grained motion datasets. Existing motion datasets are often constructed from costly manual annotation, severely limiting scalability. To address this challenge, we introduce FoundationMotion, a fully automated data curation pipeline that constructs large-scale motion datasets. Our approach first detects and tracks objects in videos to extract their trajectories, then leverages these trajectories and video frames with Large Language Models (LLMs) to generate fine-grained captions and diverse question-answer pairs about motion and spatial reasoning. Using datasets produced by this pipeline, we fine-tune open-source models including NVILA-Video-15B and Qwen2.5-7B, achieving substantial improvements in motion understanding without compromising performance on other tasks. Notably, our models outperform strong closed-source baselines like Gemini-2.5 Flash and large open-source models such as Qwen2.5-VL-72B across diverse motion understanding datasets and benchmarks. FoundationMotion thus provides a scalable solution for curating fine-grained motion datasets that enable effective fine-tuning of diverse models to enhance motion understanding and spatial reasoning capabilities.[119] BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models
Shengao Wang,Wenqi Wang,Zecheng Wang,Max Whitton,Michael Wakeham,Arjun Chandra,Joey Huang,Pengyue Zhu,Helen Chen,David Li,Jeffrey Li,Shawn Li,Andrew Zagula,Amy Zhao,Andrew Zhu,Sayaka Nakamura,Yuki Yamamoto,Jerry Jun Yokono,Aaron Mueller,Bryan A. Plummer,Kate Saenko,Venkatesh Saligrama,Boqing Gong
Main category: cs.CV
TL;DR: BabyVLM-V2 是一个受婴儿发展启发的视觉-语言建模框架,通过纵向多模态预训练数据和新提出的 DevCV Toolbox 评估套件,在认知对齐的任务上实现了高效且符合儿童发展规律的视觉基础模型预训练。
Details
Motivation: 利用婴幼儿早期发展轨迹为视觉基础模型提供更高效、更自然的预训练目标,推动发展可解释、样本高效的AI学习范式。 Method: 构建了一个以婴儿为中心的长时程、多模态(音视频+文本)预训练数据集,包含视频-语句、图像-语句和多轮对话数据;设计了紧凑模型从零开始预训练;提出 DevCV Toolbox,将NIH Baby Toolbox中的视觉相关任务转化为十个对齐儿童能力的多模态评估任务。 Result: 预训练的紧凑模型在 DevCV Toolbox 上表现优异,部分任务性能超过 GPT-4o,验证了发展性预训练的有效性。 Conclusion: BabyVLM-V2 提供了一个原则性强、统一的发展性预训练框架,有望加速面向人类认知发展的视觉基础模型研究。 Abstract: Early children's developmental trajectories set up a natural goal for sample-efficient pretraining of vision foundation models. We introduce BabyVLM-V2, a developmentally grounded framework for infant-inspired vision-language modeling that extensively improves upon BabyVLM-V1 through a longitudinal, multifaceted pretraining set, a versatile model, and, most importantly, DevCV Toolbox for cognitive evaluation. The pretraining set maximizes coverage while minimizing curation of a longitudinal, infant-centric audiovisual corpus, yielding video-utterance, image-utterance, and multi-turn conversational data that mirror infant experiences. DevCV Toolbox adapts all vision-related measures of the recently released NIH Baby Toolbox into a benchmark suite of ten multimodal tasks, covering spatial reasoning, memory, and vocabulary understanding aligned with early children's capabilities. Experimental results show that a compact model pretrained from scratch can achieve competitive performance on DevCV Toolbox, outperforming GPT-4o on some tasks. We hope the principled, unified BabyVLM-V2 framework will accelerate research in developmentally plausible pretraining of vision foundation models.[120] Any4D: Unified Feed-Forward Metric 4D Reconstruction
Jay Karhade,Nikhil Keetha,Yuchen Zhang,Tanisha Gupta,Akash Sharma,Sebastian Scherer,Deva Ramanan
Main category: cs.CV
TL;DR: 提出Any4D,一种可扩展的多视角Transformer模型,用于度量尺度下的稠密前馈4D重建,支持多种传感器输入并提升精度与效率。
Details
Motivation: 现有方法多局限于双视角稠密场景流或稀疏3D点跟踪,且难以融合多模态传感器数据,限制了4D重建的准确性与应用范围。 Method: 采用模块化4D场景表示,将每视角的4D预测分解为以相机坐标系表示的自我中心因子(如深度图、内参)和以世界坐标系表示的外部中心因子(如外参、场景流),并通过多视角Transformer进行联合建模。 Result: 在多种设置下均实现更优性能,误差降低2-3倍,计算速度提升15倍。 Conclusion: Any4D通过灵活的模块化设计支持多模态输入,实现了高效准确的稠密4D重建,为下游应用提供了新可能。 Abstract: We present Any4D, a scalable multi-view transformer for metric-scale, dense feed-forward 4D reconstruction. Any4D directly generates per-pixel motion and geometry predictions for N frames, in contrast to prior work that typically focuses on either 2-view dense scene flow or sparse 3D point tracking. Moreover, unlike other recent methods for 4D reconstruction from monocular RGB videos, Any4D can process additional modalities and sensors such as RGB-D frames, IMU-based egomotion, and Radar Doppler measurements, when available. One of the key innovations that allows for such a flexible framework is a modular representation of a 4D scene; specifically, per-view 4D predictions are encoded using a variety of egocentric factors (depthmaps and camera intrinsics) represented in local camera coordinates, and allocentric factors (camera extrinsics and scene flow) represented in global world coordinates. We achieve superior performance across diverse setups - both in terms of accuracy (2-3X lower error) and compute efficiency (15X faster), opening avenues for multiple downstream applications.[121] GaussianHeadTalk: Wobble-Free 3D Talking Heads with Audio Driven Gaussian Splatting
Madhav Agarwal,Mingtian Zhang,Laura Sevilla-Lara,Steven McDonagh
Main category: cs.CV
TL;DR: 提出一种基于音频驱动的高保真实时说话人头生成方法,结合3DMM与高斯点阵,通过Transformer实现时序一致性。
Details
Motivation: 现有方法在视觉保真度和时序稳定性之间难以平衡,扩散模型慢,高斯点阵易产生抖动,限制了实际应用。 Method: 将高斯点阵与3D可变形人脸模型(3DMM)结合,使用Transformer直接从音频预测模型参数,实现个性化 avatar 的时序一致驱动。 Result: 在单目视频和独立音频输入下,实现了实时说话人头生成,定量与定性结果均表现优异。 Conclusion: 该方法在保持高视觉质量的同时提升了时序稳定性,适用于真实场景的交互式虚拟化身应用。 Abstract: Speech-driven talking heads have recently emerged and enable interactive avatars. However, real-world applications are limited, as current methods achieve high visual fidelity but slow or fast yet temporally unstable. Diffusion methods provide realistic image generation, yet struggle with oneshot settings. Gaussian Splatting approaches are real-time, yet inaccuracies in facial tracking, or inconsistent Gaussian mappings, lead to unstable outputs and video artifacts that are detrimental to realistic use cases. We address this problem by mapping Gaussian Splatting using 3D Morphable Models to generate person-specific avatars. We introduce transformer-based prediction of model parameters, directly from audio, to drive temporal consistency. From monocular video and independent audio speech inputs, our method enables generation of real-time talking head videos where we report competitive quantitative and qualitative performance.[122] OmniView: An All-Seeing Diffusion Model for 3D and 4D View Synthesis
Xiang Fan,Sharath Girish,Vivek Ramanujan,Chaoyang Wang,Ashkan Mirzaei,Petr Sushko,Aliaksandr Siarohin,Sergey Tulyakov,Ranjay Krishna
Main category: cs.CV
TL;DR: OmniView 是一个统一的扩散模型框架,能够泛化到多种4D一致性任务,包括新视角合成、文本/图像到视频生成与相机控制,通过分离空间、时间和视角条件实现灵活输入组合,在多个基准上显著优于现有方法。
Details
Motivation: 现有的相机控制扩散模型方法局限于特定的4D一致性子任务,训练数据割裂,缺乏通用性,因此需要一个能统一处理多任务的框架。 Method: 提出OmniView,将空间、时间与视角条件进行独立建模,支持多种输入组合方式,并在统一框架下训练以增强泛化能力。 Result: 在多视图NVS、动态NVS、静态相机控制和文本生成视频等多个基准上表现优异:LLFF数据集上图像质量提升33%,Neural 3D Video提升60%,RE-10K提升20%,文本生成视频轨迹误差减少4倍。 Conclusion: OmniView 实现了一个具备强泛化能力的通用4D视频生成模型,验证了单一模型处理多样化4D任务的可行性。 Abstract: Prior approaches injecting camera control into diffusion models have focused on specific subsets of 4D consistency tasks: novel view synthesis, text-to-video with camera control, image-to-video, amongst others. Therefore, these fragmented approaches are trained on disjoint slices of available 3D/4D data. We introduce OmniView, a unified framework that generalizes across a wide range of 4D consistency tasks. Our method separately represents space, time, and view conditions, enabling flexible combinations of these inputs. For example, OmniView can synthesize novel views from static, dynamic, and multiview inputs, extrapolate trajectories forward and backward in time, and create videos from text or image prompts with full camera control. OmniView is competitive with task-specific models across diverse benchmarks and metrics, improving image quality scores among camera-conditioned diffusion models by up to 33\% in multiview NVS LLFF dataset, 60\% in dynamic NVS Neural 3D Video benchmark, 20\% in static camera control on RE-10K, and reducing camera trajectory errors by 4x in text-conditioned video generation. With strong generalizability in one model, OmniView demonstrates the feasibility of a generalist 4D video model. Project page is available at https://snap-research.github.io/OmniView/[123] Mull-Tokens: Modality-Agnostic Latent Thinking
Arijit Ray,Ahmed Abdelkader,Chengzhi Mao,Bryan A. Plummer,Kate Saenko,Ranjay Krishna,Leonidas Guibas,Wen-Sheng Chu
Main category: cs.CV
TL;DR: Mull-Tokens是一种模态无关的潜在令牌方法,通过在文本和图像模态中抽象思考来提升多模态推理能力,在多个空间推理任务上优于现有基线。
Details
Motivation: 现有基于图像的多模态推理模型依赖专用工具、高成本图像生成或手工设计数据,缺乏可扩展性和鲁棒性,难以有效结合视觉与语言进行自由推理。 Method: 提出Mull-Tokens,即在预训练中学习保存文本或图像中间信息的潜在令牌;先使用交错的图文轨迹进行监督训练,再仅用最终答案进行无监督微调。 Result: 在四个空间推理基准(如解谜、视角转换)上测试,Mull-Tokens平均提升3%,在重推理的解谜子集上最高提升16%。 Conclusion: Mull-Tokens提供了一种简单有效的多模态抽象推理方案,无需复杂架构或额外工具即可在文本和图像之间自由切换中间推理步骤,增强了模型的通用推理能力。 Abstract: Reasoning goes beyond language; the real world requires reasoning about space, time, affordances, and much more that words alone cannot convey. Existing multimodal models exploring the potential of reasoning with images are brittle and do not scale. They rely on calling specialist tools, costly generation of images, or handcrafted reasoning data to switch between text and image thoughts. Instead, we offer a simpler alternative -- Mull-Tokens -- modality-agnostic latent tokens pre-trained to hold intermediate information in either image or text modalities to let the model think free-form towards the correct answer. We investigate best practices to train Mull-Tokens inspired by latent reasoning frameworks. We first train Mull-Tokens using supervision from interleaved text-image traces, and then fine-tune without any supervision by only using the final answers. Across four challenging spatial reasoning benchmarks involving tasks such as solving puzzles and taking different perspectives, we demonstrate that Mull-Tokens improve upon several baselines utilizing text-only reasoning or interleaved image-text reasoning, achieving a +3% average improvement and up to +16% on a puzzle solving reasoning-heavy split compared to our strongest baseline. Adding to conversations around challenges in grounding textual and visual reasoning, Mull-Tokens offers a simple solution to abstractly think in multiple modalities.[124] VL-JEPA: Joint Embedding Predictive Architecture for Vision-language
Delong Chen,Mustafa Shukor,Theo Moutakanni,Willy Chung,Jade Yu,Tejaswi Kasarla,Allen Bolourchi,Yann LeCun,Pascale Fung
Main category: cs.CV
TL;DR: VL-JEPA是一种基于联合嵌入预测架构(JEPA)的视觉-语言模型,通过预测文本的连续嵌入而非自回归生成令牌来提升效率和性能,在多种任务中优于现有模型,同时参数量更少。
Details
Motivation: 传统视觉语言模型(VLMs)依赖自回归生成令牌,计算开销大且关注表面语言变化,VL-JEPA旨在通过在抽象表示空间中学习,聚焦任务相关语义,减少冗余计算。 Method: 采用联合嵌入预测架构(JEPA),使用共享编码器将图像和文本映射到统一嵌入空间,并在此空间内进行目标文本嵌入预测;仅在必要时调用轻量级文本解码器生成文本输出,支持选择性解码和多任务无需结构修改。 Result: 在相同视觉编码器和训练数据下,VL-JEPA比标准令牌空间VLM训练性能更强且可训练参数减少50%;解码操作减少2.85倍;在8个视频分类和8个视频检索数据集上超过CLIP、SigLIP2和Perception Encoder;在4个VQA数据集上表现与InstructBLIP、QwenVL相当,但仅含1.6B参数。 Conclusion: VL-JEPA通过在连续嵌入空间中进行预测,实现了高效、紧凑且多功能的视觉语言建模,支持生成与判别任务,在性能和效率之间取得更好平衡。 Abstract: We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that reduces the number of decoding operations by 2.85x while maintaining similar performance compared to non-adaptive uniform decoding. Beyond generation, the VL-JEPA's embedding space naturally supports open-vocabulary classification, text-to-video retrieval, and discriminative VQA without any architecture modification. On eight video classification and eight video retrieval datasets, the average performance VL-JEPA surpasses that of CLIP, SigLIP2, and Perception Encoder. At the same time, the model achieves comparable performance as classical VLMs (InstructBLIP, QwenVL) on four VQA datasets: GQA, TallyQA, POPE and POPEv2, despite only having 1.6B parameters.[125] AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation
Sharath Girish,Viacheslav Ivanov,Tsai-Shien Chen,Hao Chen,Aliaksandr Siarohin,Sergey Tulyakov
Main category: cs.CV
TL;DR: AlcheMinT 是一种新的主体驱动视频生成框架,通过引入显式时间戳条件和新颖的位置编码机制,实现了对多个主体出现与消失时间的精确控制,同时保持低参数开销和高质量视频生成。
Details
Motivation: 现有主体驱动视频生成方法缺乏对主体外观和消失时间的细粒度时序控制,限制了其在视频合成、分镜设计等应用中的可控性。 Method: 提出 AlcheMinT 框架,采用显式时间戳条件和新型位置编码机制来建模时间区间,并结合描述性文本标记增强视觉身份与文本的绑定;通过词元级拼接集成到预训练模型,无需额外交叉注意力模块。 Result: 在多主体身份保持、视频保真度及时序一致性方面建立了新基准,实验显示 AlcheMinT 在保持最先进视觉质量的同时,首次实现了对多主体生成的精确时序控制。 Conclusion: AlcheMinT 实现了高效、精确的多主体时序控制视频生成,为个性化视频合成提供了更强的可控性和实用性。 Abstract: Recent advances in subject-driven video generation with large diffusion models have enabled personalized content synthesis conditioned on user-provided subjects. However, existing methods lack fine-grained temporal control over subject appearance and disappearance, which are essential for applications such as compositional video synthesis, storyboarding, and controllable animation. We propose AlcheMinT, a unified framework that introduces explicit timestamps conditioning for subject-driven video generation. Our approach introduces a novel positional encoding mechanism that unlocks the encoding of temporal intervals, associated in our case with subject identities, while seamlessly integrating with the pretrained video generation model positional embeddings. Additionally, we incorporate subject-descriptive text tokens to strengthen binding between visual identity and video captions, mitigating ambiguity during generation. Through token-wise concatenation, AlcheMinT avoids any additional cross-attention modules and incurs negligible parameter overhead. We establish a benchmark evaluating multiple subject identity preservation, video fidelity, and temporal adherence. Experimental results demonstrate that AlcheMinT achieves visual quality matching state-of-the-art video personalization methods, while, for the first time, enabling precise temporal control over multi-subject generation within videos. Project page is at https://snap-research.github.io/Video-AlcheMinT[126] MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation
Henghui Ding,Chang Liu,Shuting He,Kaining Ying,Xudong Jiang,Chen Change Loy,Yu-Gang Jiang
Main category: cs.CV
TL;DR: 本文提出了一个大规模多模态数据集MeViS,用于基于运动语言描述的视频中目标对象的分割与跟踪,强调了运动信息在视频和语言理解中的作用,并评测了现有方法在该任务上的表现,提出改进方法LMPM++取得了新的最佳性能。
Details
Motivation: 现有指代表达视频分割数据集多关注显著物体且依赖静态属性描述,难以体现运动信息的作用;因此需要构建强调运动表达和运动推理的数据集以推动像素级视频理解研究。 Method: 提出MeViS数据集,包含33,072条人工标注的文本和音频形式的运动描述,覆盖2,006个复杂场景视频中的8,171个对象;并构建基准测试评估15种现有方法在四个任务上的表现,进一步提出LMPM++方法以提升运动表达引导的视频理解性能。 Result: 现有方法在运动表达引导的视频理解任务上表现不佳,暴露出其在利用运动信息方面的局限性;LMPM++在RVOS、AVOS和RMOT任务上均取得新的最先进结果。 Conclusion: MeViS为基于运动表达的视频理解提供了有效平台,凸显了运动线索的重要性,并推动了相关算法的发展。 Abstract: This paper proposes a large-scale multi-modal dataset for referring motion expression video segmentation, focusing on segmenting and tracking target objects in videos based on language description of objects' motions. Existing referring video segmentation datasets often focus on salient objects and use language expressions rich in static attributes, potentially allowing the target object to be identified in a single frame. Such datasets underemphasize the role of motion in both videos and languages. To explore the feasibility of using motion expressions and motion reasoning clues for pixel-level video understanding, we introduce MeViS, a dataset containing 33,072 human-annotated motion expressions in both text and audio, covering 8,171 objects in 2,006 videos of complex scenarios. We benchmark 15 existing methods across 4 tasks supported by MeViS, including 6 referring video object segmentation (RVOS) methods, 3 audio-guided video object segmentation (AVOS) methods, 2 referring multi-object tracking (RMOT) methods, and 4 video captioning methods for the newly introduced referring motion expression generation (RMEG) task. The results demonstrate weaknesses and limitations of existing methods in addressing motion expression-guided video understanding. We further analyze the challenges and propose an approach LMPM++ for RVOS/AVOS/RMOT that achieves new state-of-the-art results. Our dataset provides a platform that facilitates the development of motion expression-guided video understanding algorithms in complex video scenes. The proposed MeViS dataset and the method's source code are publicly available at https://henghuiding.com/MeViS/[127] Towards Efficient and Effective Multi-Camera Encoding for End-to-End Driving
Jiawei Yang,Ziyu Chen,Yurong You,Yan Wang,Yiming Li,Yuxiao Chen,Boyi Li,Boris Ivanovic,Marco Pavone,Yue Wang
Main category: cs.CV
TL;DR: 提出了一种名为Flex的高效场景编码器,通过可学习的场景令牌联合编码多摄像头数据,无需依赖显式的3D先验(如BEV),在提升推理吞吐量的同时显著提高自动驾驶性能。
Details
Motivation: 解决端到端自动驾驶中处理多摄像头高数据量时的计算瓶颈,并挑战必须依赖3D结构先验的主流假设。 Method: 设计了一种几何无关的场景编码方法,使用少量可学习的场景令牌从所有摄像头和时间步的图像令牌中直接学习紧凑的场景表示,不依赖BEV、占据网格等显式3D归纳偏置。 Result: 在2万小时大规模私有数据集上验证,相比现有最先进方法实现了2.2倍的推理吞吐量提升,并大幅改善驾驶性能;同时发现这些紧凑的场景令牌能自发实现场景分解能力。 Conclusion: 数据驱动的联合编码策略比依赖3D先验的方法更高效、有效且可扩展,为未来自动驾驶系统提供了新方向。 Abstract: We present Flex, an efficient and effective scene encoder that addresses the computational bottleneck of processing high-volume multi-camera data in end-to-end autonomous driving. Flex employs a small set of learnable scene tokens to jointly encode information from all image tokens across different cameras and timesteps. By design, our approach is geometry-agnostic, learning a compact scene representation directly from data without relying on the explicit 3D inductive biases, such as Bird-Eye-View (BEV), occupancy or tri-plane representations, which are common in prior work. This holistic encoding strategy aggressively compresses the visual input for the downstream Large Language Model (LLM) based policy model. Evaluated on a large-scale proprietary dataset of 20,000 driving hours, our Flex achieves 2.2x greater inference throughput while improving driving performance by a large margin compared to state-of-the-art methods. Furthermore, we show that these compact scene tokens develop an emergent capability for scene decomposition without any explicit supervision. Our findings challenge the prevailing assumption that 3D priors are necessary, demonstrating that a data-driven, joint encoding strategy offers a more scalable, efficient and effective path for future autonomous driving systems.[128] ClusIR: Towards Cluster-Guided All-in-One Image Restoration
Shengkai Hu,Jiaqi Ma,Jun Wan,Wenwen Min,Yongcheng Jing,Lefei Zhang,Dacheng Tao
Main category: cs.CV
TL;DR: 提出ClusIR框架,通过聚类引导机制显式建模退化语义,并在空间和频率域传播簇感知线索,实现对多种退化的自适应图像恢复。
Details
Motivation: 现有全合一图像恢复方法难以显式建模退化类型,且对复杂或混合退化适应能力差。 Method: 设计了PCGRM和DAFMM两个模块:PCGRM通过概率聚类分离退化识别与专家激活,实现判别性感知和稳定路由;DAFMM利用聚类先验进行自适应频率分解与调制,提升结构与纹理恢复。 Result: 在多个基准上实验表明,ClusIR在多种退化场景下具有竞争力的性能。 Conclusion: ClusIR通过聚类引导的协同机制,有效结合语义线索与频域调制,实现了高保真的统一图像恢复。 Abstract: All-in-One Image Restoration (AiOIR) aims to recover high-quality images from diverse degradations within a unified framework. However, existing methods often fail to explicitly model degradation types and struggle to adapt their restoration behavior to complex or mixed degradations. To address these issues, we propose ClusIR, a Cluster-Guided Image Restoration framework that explicitly models degradation semantics through learnable clustering and propagates cluster-aware cues across spatial and frequency domains for adaptive restoration. Specifically, ClusIR comprises two key components: a Probabilistic Cluster-Guided Routing Mechanism (PCGRM) and a Degradation-Aware Frequency Modulation Module (DAFMM). The proposed PCGRM disentangles degradation recognition from expert activation, enabling discriminative degradation perception and stable expert routing. Meanwhile, DAFMM leverages the cluster-guided priors to perform adaptive frequency decomposition and targeted modulation, collaboratively refining structural and textural representations for higher restoration fidelity. The cluster-guided synergy seamlessly bridges semantic cues with frequency-domain modulation, empowering ClusIR to attain remarkable restoration results across a wide range of degradations. Extensive experiments on diverse benchmarks validate that ClusIR reaches competitive performance under several scenarios.[129] E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training
Qitao Zhao,Hao Tan,Qianqian Wang,Sai Bi,Kai Zhang,Kalyan Sunkavalli,Shubham Tulsiani,Hanwen Jiang
Main category: cs.CV
TL;DR: E-RayZer是一种自监督的3D视觉预训练模型,通过显式几何结构和细粒度学习课程,直接从多视角图像中学习真正3D感知的表示,在多种下游任务中超越现有方法。
Details
Motivation: 现有的自监督方法在学习3D感知表示时多依赖隐空间的视角合成,缺乏真正的3D几何理解,且难以扩展。本文旨在开发一种能直接在3D空间中进行重建并具备良好泛化能力的自监督模型。 Method: 提出E-RayZer,采用显式3D几何建模,结合新颖的细粒度学习课程,按难易顺序组织样本训练,并以无监督方式融合异构数据源,实现稳定收敛与可扩展性。 Result: 实验表明,E-RayZer在姿态估计等任务上显著优于RayZer,并达到或超过VGGT等全监督模型的表现;其学到的表示在迁移至3D下游任务时也优于DINOv3、CroCo v2、VideoMAE V2等先进预训练模型。 Conclusion: E-RayZer为3D感知视觉预训练提供了新范式,证明了显式3D建模与自监督学习结合的有效性,推动了基于多视角图像的3D表示学习发展。 Abstract: Self-supervised pre-training has revolutionized foundation models for languages, individual 2D images and videos, but remains largely unexplored for learning 3D-aware representations from multi-view images. In this paper, we present E-RayZer, a self-supervised large 3D Vision model that learns truly 3D-aware representations directly from unlabeled images. Unlike prior self-supervised methods such as RayZer that infer 3D indirectly through latent-space view synthesis, E-RayZer operates directly in 3D space, performing self-supervised 3D reconstruction with Explicit geometry. This formulation eliminates shortcut solutions and yields representations that are geometrically grounded. To ensure convergence and scalability, we introduce a novel fine-grained learning curriculum that organizes training from easy to hard samples and harmonizes heterogeneous data sources in an entirely unsupervised manner. Experiments demonstrate that E-RayZer significantly outperforms RayZer on pose estimation, matches or sometimes surpasses fully supervised reconstruction models such as VGGT. Furthermore, its learned representations outperform leading visual pre-training models (e.g., DINOv3, CroCo v2, VideoMAE V2, and RayZer) when transferring to 3D downstream tasks, establishing E-RayZer as a new paradigm for 3D-aware visual pre-training.[130] Group Diffusion: Enhancing Image Generation by Unlocking Cross-Sample Collaboration
Sicheng Mo,Thao Nguyen,Richard Zhang,Nick Kolkin,Siddharth Srinivasan Iyer,Eli Shechtman,Krishna Kumar Singh,Yong Jae Lee,Bolei Zhou,Yuheng Li
Main category: cs.CV
TL;DR: 本文提出了Group Diffusion方法,通过在推理时共享注意力机制实现图像的协同生成,提升了生成质量,并在ImageNet-256x256上实现了最高32.2%的FID改进。
Details
Motivation: 探索扩散模型推理过程中未被利用的信号,尝试打破以往独立生成图像的范式,引入跨样本协作生成机制。 Method: 提出Group Diffusion,在标准扩散Transformer基础上解锁注意力机制,使其可在多张图像间共享,实现联合去噪,同时学习图像内和图像间的关联。 Result: 实验发现更大的分组规模能带来更强的跨样本注意力和更好的生成质量,所提出的定性指标与FID密切相关,在ImageNet-256x256上FID最多改善32.2%。 Conclusion: 跨样本推理是一种有效且此前未被探索的生成建模机制,Group Diffusion为扩散模型提供了新的优化方向。 Abstract: In this work, we explore an untapped signal in diffusion model inference. While all previous methods generate images independently at inference, we instead ask if samples can be generated collaboratively. We propose Group Diffusion, unlocking the attention mechanism to be shared across images, rather than limited to just the patches within an image. This enables images to be jointly denoised at inference time, learning both intra and inter-image correspondence. We observe a clear scaling effect - larger group sizes yield stronger cross-sample attention and better generation quality. Furthermore, we introduce a qualitative measure to capture this behavior and show that its strength closely correlates with FID. Built on standard diffusion transformers, our GroupDiff achieves up to 32.2% FID improvement on ImageNet-256x256. Our work reveals cross-sample inference as an effective, previously unexplored mechanism for generative modeling.[131] Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization
Tsai-Shien Chen,Aliaksandr Siarohin,Guocheng Gordon Qian,Kuan-Chieh Jackson Wang,Egor Nemchinov,Moayed Haji-Ali,Riza Alp Guler,Willi Menapace,Ivan Skorokhodov,Anil Kag,Jun-Yan Zhu,Sergey Tulyakov
Main category: cs.CV
TL;DR: 本文提出了Omni-Attribute,首个开放词汇图像属性编码器,旨在实现高保真、特定属性的视觉概念个性化,通过设计语义关联图像对和双目标训练范式,在多个基准上达到最先进性能。
Details
Motivation: 现有方法依赖于通用图像编码器的整体嵌入,导致多种视觉因素纠缠,难以分离单一属性,易造成信息泄漏和合成不一致。 Method: 提出Omni-Attribute,构建带有正负属性标注的语义关联图像对,并采用兼顾生成保真度与对比解耦的双目标训练范式。 Result: 所提方法在开放词汇属性检索、个性化和组合生成任务中表现优异,多项基准测试上达到最先进水平。 Conclusion: Omni-Attribute能有效解耦图像属性并实现精准个性化编辑,为开放词汇视觉编辑提供了新思路。 Abstract: Visual concept personalization aims to transfer only specific image attributes, such as identity, expression, lighting, and style, into unseen contexts. However, existing methods rely on holistic embeddings from general-purpose image encoders, which entangle multiple visual factors and make it difficult to isolate a single attribute. This often leads to information leakage and incoherent synthesis. To address this limitation, we introduce Omni-Attribute, the first open-vocabulary image attribute encoder designed to learn high-fidelity, attribute-specific representations. Our approach jointly designs the data and model: (i) we curate semantically linked image pairs annotated with positive and negative attributes to explicitly teach the encoder what to preserve or suppress; and (ii) we adopt a dual-objective training paradigm that balances generative fidelity with contrastive disentanglement. The resulting embeddings prove effective for open-vocabulary attribute retrieval, personalization, and compositional generation, achieving state-of-the-art performance across multiple benchmarks.[132] Empowering Dynamic Urban Navigation with Stereo and Mid-Level Vision
Wentao Zhou,Xuweiyi Chen,Vignesh Rajagopal,Jeffrey Chen,Rohan Chandra,Zezhou Cheng
Main category: cs.CV
TL;DR: 本文提出了StereoWalker,一种通过引入双目视觉和显式中层视觉(如深度估计和像素跟踪)来增强机器人导航基础模型(NFM)的方法,解决了单目视觉在动态环境中深度尺度模糊和数据效率低的问题。
Details
Motivation: 单目视觉的深度尺度模糊以及对大量像素到动作监督数据的需求使得现有导航基础模型在动态非结构化环境中表现受限,因此需要更高效且鲁棒的方法。 Method: 提出StereoWalker,利用双目视觉输入消除深度尺度歧义,并融合现代中层视觉模型提供的几何与运动结构信息;同时构建了一个大规模带自动动作标注的双目导航数据集用于训练与评估。 Result: 实验表明,StereoWalker仅用1.5%的训练数据即可达到当前最先进方法的性能,在完整数据下则超越现有方法,且双目输入显著优于单目输入。 Conclusion: 依赖单目视觉并忽略中层视觉先验是低效的,引入双目视觉和显式中层视觉可大幅提升数据效率和导航性能。 Abstract: The success of foundation models in language and vision motivated research in fully end-to-end robot navigation foundation models (NFMs). NFMs directly map monocular visual input to control actions and ignore mid-level vision modules (tracking, depth estimation, etc) entirely. While the assumption that vision capabilities will emerge implicitly is compelling, it requires large amounts of pixel-to-action supervision that are difficult to obtain. The challenge is especially pronounced in dynamic and unstructured settings, where robust navigation requires precise geometric and dynamic understanding, while the depth-scale ambiguity in monocular views further limits accurate spatial reasoning. In this paper, we show that relying on monocular vision and ignoring mid-level vision priors is inefficient. We present StereoWalker, which augments NFMs with stereo inputs and explicit mid-level vision such as depth estimation and dense pixel tracking. Our intuition is straightforward: stereo inputs resolve the depth-scale ambiguity, and modern mid-level vision models provide reliable geometric and motion structure in dynamic scenes. We also curate a large stereo navigation dataset with automatic action annotation from Internet stereo videos to support training of StereoWalker and to facilitate future research. Through our experiments, we find that mid-level vision enables StereoWalker to achieve a comparable performance as the state-of-the-art using only 1.5% of the training data, and surpasses the state-of-the-art using the full data. We also observe that stereo vision yields higher navigation performance than monocular input.[133] SceneMaker: Open-set 3D Scene Generation with Decoupled De-occlusion and Pose Estimation Model
Yukai Shi,Weiyu Li,Zihao Wang,Hongyang Li,Xingyu Chen,Ping Tan,Lei Zhang
Main category: cs.CV
TL;DR: 提出了一种名为SceneMaker的解耦3D场景生成框架,通过分离去遮挡模型和统一姿态估计模型提升在严重遮挡和开放集设置下的3D场景生成质量。
Details
Motivation: 现有方法由于缺乏足够的开放集去遮挡和姿态估计先验,在严重遮挡和开放集条件下难以同时生成高质量几何结构和准确姿态。 Method: 将去遮挡模型从3D对象生成中解耦,并利用图像数据集和收集的去遮挡数据集增强其对多样开放集遮挡模式的处理能力;提出统一的姿态估计模型,结合全局和局部机制优化自注意力和交叉注意力;构建了一个开放集3D场景数据集以提升泛化能力。 Result: 大量实验表明该解耦框架在室内和开放集场景中均优于现有方法。 Conclusion: SceneMaker通过解耦设计和增强的去遮挡与姿态估计模型,显著提升了复杂遮挡和开放集环境下的3D场景生成性能。 Abstract: We propose a decoupled 3D scene generation framework called SceneMaker in this work. Due to the lack of sufficient open-set de-occlusion and pose estimation priors, existing methods struggle to simultaneously produce high-quality geometry and accurate poses under severe occlusion and open-set settings. To address these issues, we first decouple the de-occlusion model from 3D object generation, and enhance it by leveraging image datasets and collected de-occlusion datasets for much more diverse open-set occlusion patterns. Then, we propose a unified pose estimation model that integrates global and local mechanisms for both self-attention and cross-attention to improve accuracy. Besides, we construct an open-set 3D scene dataset to further extend the generalization of the pose estimation model. Comprehensive experiments demonstrate the superiority of our decoupled framework on both indoor and open-set scenes. Our codes and datasets is released at https://idea-research.github.io/SceneMaker/.[134] WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World
Ao Liang,Lingdong Kong,Tianyi Yan,Hongsi Liu,Wesley Yang,Ziqi Huang,Wei Yin,Jialong Zuo,Yixuan Hu,Dekai Zhu,Dongyue Lu,Youquan Liu,Guangfeng Jiang,Linfeng Li,Xiangtai Li,Long Zhuo,Lai Xing Ng,Benoit R. Cottereau,Changxin Gao,Liang Pan,Wei Tsang Ooi,Ziwei Liu
Main category: cs.CV
TL;DR: 本文提出了WorldLens,一个全面评估生成世界模型的基准,涵盖生成、重建、行为跟随、下游任务和人类偏好五个方面,并结合大规模人类标注数据集WorldLens-26K与评估代理WorldLens-Agent,构建了一个统一的生态系统来衡量生成世界的视觉真实感、几何一致性、物理合理性和功能可靠性。
Details
Motivation: 现有生成世界模型在视觉上逼真,但在物理规律和行为一致性上常有缺陷,且缺乏统一的评估标准,难以全面衡量其生成世界的质量。 Method: 提出WorldLens基准,包含五个评估维度:生成、重建、动作跟随、下游任务表现和人类偏好;构建WorldLens-26K人类评分数据集,并训练WorldLens-Agent以实现可扩展、可解释的自动评估。 Result: 实验表明现有模型在不同维度上表现不均衡,视觉效果好的模型常违反物理规律,几何稳定者又缺乏行为保真度;WorldLens能有效区分模型优劣,并通过人类与代理评估实现更一致的评分。 Conclusion: WorldLens提供了一个标准化、多维度的评估框架,推动生成世界模型不仅在视觉上真实,更在行为和物理上可信,为未来研究提供了统一的评测生态。 Abstract: Generative world models are reshaping embodied AI, enabling agents to synthesize realistic 4D driving environments that look convincing but often fail physically or behaviorally. Despite rapid progress, the field still lacks a unified way to assess whether generated worlds preserve geometry, obey physics, or support reliable control. We introduce WorldLens, a full-spectrum benchmark evaluating how well a model builds, understands, and behaves within its generated world. It spans five aspects -- Generation, Reconstruction, Action-Following, Downstream Task, and Human Preference -- jointly covering visual realism, geometric consistency, physical plausibility, and functional reliability. Across these dimensions, no existing world model excels universally: those with strong textures often violate physics, while geometry-stable ones lack behavioral fidelity. To align objective metrics with human judgment, we further construct WorldLens-26K, a large-scale dataset of human-annotated videos with numerical scores and textual rationales, and develop WorldLens-Agent, an evaluation model distilled from these annotations to enable scalable, explainable scoring. Together, the benchmark, dataset, and agent form a unified ecosystem for measuring world fidelity -- standardizing how future models are judged not only by how real they look, but by how real they behave.[135] StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space
Tjark Behrens,Anton Obukhov,Bingxin Ke,Fabio Tosi,Matteo Poggi,Konrad Schindler
Main category: cs.CV
TL;DR: StereoSpace是一种基于扩散的单目到立体合成框架,通过视点条件建模几何结构,无需显式深度或扭曲,在无真实几何信息的情况下实现高质量立体生成。