Table of Contents
cs.CL [Back]
[1] What Kind of Reasoning (if any) is an LLM actually doing? On the Stochastic Nature and Abductive Appearance of Large Language Models
Luciano Floridi,Jessica Morley,Claudio Novelli,David Watson
Main category: cs.CL
TL;DR: 本文探讨了当前基于token补全的大型语言模型(LLMs)在推理中的作用,指出其输出虽看似具有溯因推理能力,实则源于对人类文本模式的学习,而非真正的推理。模型本质上是随机的,缺乏语义理解与真值验证能力,因此应用时需谨慎评估其输出。
Details
Motivation: 揭示LLMs在看似合理推理背后的机制,澄清其并非真正进行溯因推理,而是模仿人类推理结构,从而促进对LLMs能力与局限的正确认识。 Method: 通过分析LLMs的生成机制及其训练数据来源,结合具体示例,比较其输出与人类溯因推理的异同,论证其表面推理能力的来源。 Result: 发现LLMs之所以能产生看似合理的解释和常识推理,是因为它们学习了包含推理结构的人类文本,但其本身不具备语义理解、真值判断或验证能力,其推理表现是表层模仿而非实质推理。 Conclusion: LLMs具有“随机基础-表象溯因”的双重特性,可作为辅助思维和创意生成工具,但其输出不可视为真实或可靠推理结果,必须由人类进行批判性审查;文章还回应了五项可能反对意见,并指出了分析的局限性。 Abstract: This article looks at how reasoning works in current Large Language Models (LLMs) that function using the token-completion method. It examines their stochastic nature and their similarity to human abductive reasoning. The argument is that these LLMs create text based on learned patterns rather than performing actual abductive reasoning. When their output seems abductive, this is largely because they are trained on human-generated texts that include reasoning structures. Examples are used to show how LLMs can produce plausible ideas, mimic commonsense reasoning, and give explanatory answers without being grounded in truth, semantics, verification, or understanding, and without performing any real abductive reasoning. This dual nature, where the models have a stochastic base but appear abductive in use, has important consequences for how LLMs are evaluated and applied. They can assist with generating ideas and supporting human thinking, but their outputs must be critically assessed because they cannot identify truth or verify their explanations. The article concludes by addressing five objections to these points, noting some limitations in the analysis, and offering an overall evaluation.[2] Generate-Then-Validate: A Novel Question Generation Approach Using Small Language Models
Yumou Wei,John Stamper,Paulo F. Carvalho
Main category: cs.CL
TL;DR: 提出了一种利用小型语言模型(SLM)生成高质量问题的新流程,结合生成与基于概率推理的验证策略,评估显示所生成问题清晰且符合学习目标。
Details
Motivation: 探索小型语言模型(SLM)在自动问题生成中的潜力,作为当前主流大模型在学习分析研究中应用的补充。 Method: 采用“先生成后验证”策略,首先大规模生成候选问题,然后利用SLM的概率推理能力进行选择性验证和优化。 Result: 通过人类专家和大语言模型的两轮评估,大多数评估者认为生成的问题答案明确,且与预期学习目标高度一致。 Conclusion: 设计良好的流程可以充分发挥小型语言模型的优势,使其有效生成高质量问题,为教育应用提供低成本、高效的替代方案。 Abstract: We explore the use of small language models (SLMs) for automatic question generation as a complement to the prevalent use of their large counterparts in learning analytics research. We present a novel question generation pipeline that leverages both the text generation and the probabilistic reasoning abilities of SLMs to generate high-quality questions. Adopting a "generate-then-validate" strategy, our pipeline first performs expansive generation to create an abundance of candidate questions and refine them through selective validation based on novel probabilistic reasoning. We conducted two evaluation studies, one with seven human experts and the other with a large language model (LLM), to assess the quality of the generated questions. Most judges (humans or LLMs) agreed that the generated questions had clear answers and generally aligned well with the intended learning objectives. Our findings suggest that an SLM can effectively generate high-quality questions when guided by a well-designed pipeline that leverages its strengths.[3] Workflow is All You Need: Escaping the "Statistical Smoothing Trap" via High-Entropy Information Foraging and Adversarial Pacing
Zhongjie Jiang
Main category: cs.CL
TL;DR: 本研究提出DeepNews框架,通过模拟资深财经记者的认知过程,解决大模型在长文本生成中的“不可能三元”困境(低幻觉、逻辑连贯与个性化表达难以兼得),并在真实媒体测试中显著优于当前最先进模型。
Details
Motivation: 现有大语言模型在垂直领域长文本生成中难以同时实现低幻觉、深度逻辑连贯和个性化表达,根源在于统计平滑陷阱,忽视了专家写作中的高熵信息获取与结构化认知过程。 Method: 提出DeepNews框架,包含三个模块:基于信息觅食理论的双粒度检索机制(10:1饱和信息输入)、基于叙事图式与Atomic Blocks的策略规划、以及通过Rhythm Break和Logic Fog等手段实现的对抗性约束提示。 Result: 实验发现‘知识断崖’现象:当检索上下文低于15,000字符时内容真实性急剧下降;超过30,000字符时幻觉消除率(HFR)稳定在85%以上。在顶级中文科技媒体的盲测中,基于旧代模型(DeepSeek-V3-0324)的DeepNews系统投稿接受率达25%,显著优于GPT-5零样本生成的0%。 Conclusion: 通过显式建模专家写作的认知流程,DeepNews框架突破了传统生成范式中的统计平滑陷阱,为垂直领域高质量长文本生成提供了有效路径。 Abstract: Central to long-form text generation in vertical domains is the "impossible trinity" confronting current large language models (LLMs): the simultaneous achievement of low hallucination, deep logical coherence, and personalized expression. This study establishes that this bottleneck arises from existing generative paradigms succumbing to the Statistical Smoothing Trap, a phenomenon that overlooks the high-entropy information acquisition and structured cognitive processes integral to expert-level writing. To address this limitation, we propose the DeepNews Framework, an agentic workflow that explicitly models the implicit cognitive processes of seasoned financial journalists. The framework integrates three core modules: first, a dual-granularity retrieval mechanism grounded in information foraging theory, which enforces a 10:1 saturated information input ratio to mitigate hallucinatory outputs; second, schema-guided strategic planning, a process leveraging domain expert knowledge bases (narrative schemas) and Atomic Blocks to forge a robust logical skeleton; third, adversarial constraint prompting, a technique deploying tactics including Rhythm Break and Logic Fog to disrupt the probabilistic smoothness inherent in model-generated text. Experiments delineate a salient Knowledge Cliff in deep financial reporting: content truthfulness collapses when retrieved context falls below 15,000 characters, while a high-redundancy input exceeding 30,000 characters stabilizes the Hallucination-Free Rate (HFR) above 85%. In an ecological validity blind test conducted with a top-tier Chinese technology media outlet, the DeepNews system--built on a previous-generation model (DeepSeek-V3-0324)-achieved a 25% submission acceptance rate, significantly outperforming the 0% acceptance rate of zero-shot generation by a state-of-the-art (SOTA) model (GPT-5).[4] PARAN: Persona-Augmented Review ANswering system on Food Delivery Review Dataset
Moonsoo Park,Jeongseok Yun,Bohyung Kim
Main category: cs.CL
TL;DR: 提出一种两阶段提示框架,通过从短评中推断显式和隐式用户画像来生成个性化回复,提升自动化响应的相关性和多样性。
Details
Motivation: 在用户信息有限的场景(如外卖平台)中,大语言模型容易生成通用化回复,缺乏个性化,影响互动效果。 Method: 采用两阶段提示框架,先从评论文本中推断用户的显式和隐式 persona,再将这些特征融入生成提示中,并通过调整解码温度增强生成多样性。 Result: 在韩国外卖应用的真实数据集上验证了方法的有效性,提升了回复的精确性、多样性和语义一致性。 Conclusion: 基于 persona 增强的提示方法能在无需微调模型的情况下有效提升自动化回复的个性化水平和实用性。 Abstract: Personalized review response generation presents a significant challenge in domains where user information is limited, such as food delivery platforms. While large language models (LLMs) offer powerful text generation capabilities, they often produce generic responses when lacking contextual user data, reducing engagement and effectiveness. In this work, we propose a two-stage prompting framework that infers both explicit (e.g., user-stated preferences) and implicit (e.g., demographic or stylistic cues) personas directly from short review texts. These inferred persona attributes are then incorporated into the response generation prompt to produce user-tailored replies. To encourage diverse yet faithful generations, we adjust decoding temperature during inference. We evaluate our method using a real-world dataset collected from a Korean food delivery app, and assess its impact on precision, diversity, and semantic consistency. Our findings highlight the effectiveness of persona-augmented prompting in enhancing the relevance and personalization of automated responses without requiring model fine-tuning.[5] Unforgotten Safety: Preserving Safety Alignment of Large Language Models with Continual Learning
Lama Alssum,Hani Itani,Hasan Abed Al Kader Hammoud,Philip Torr,Adel Bibi,Bernard Ghanem
Main category: cs.CL
TL;DR: 本论文研究了在微调大语言模型(LLM)时安全性的退化问题,将其归因于灾难性遗忘,并从持续学习(CL)角度提出解决方案。实验表明,采用如DER等CL方法可有效降低攻击成功率,同时保持任务性能,适用于多种模型和任务场景。
Details
Motivation: 随着大语言模型的普及,用户通过微调定制模型的需求增加,但微调可能导致原有安全机制失效,即安全性退化。因此,如何在适应新任务的同时保持模型的安全性成为一个关键问题。 Method: 将微调过程中的安全性保持问题建模为持续学习(CL)问题,采用了正则化、基于记忆和模型融合等多种CL方法,在良性与中毒两种用户数据场景下进行系统评估,并与标准微调及现有安全保护基线对比。 Result: 实验结果显示,CL方法显著降低了攻击成功概率,其中DER表现最优,且在GSM8K、SST2、Code三个下游任务以及LLaMA2-7B、Mistral-7B、Gemma-2B三种模型上均具有良好的泛化能力,同时保持了任务性能。 Conclusion: 持续学习是缓解大语言模型微调过程中安全性退化的有效且实用的方法,尤其以DER为代表的CL方法能够在不牺牲任务性能的前提下显著提升模型安全性。 Abstract: The safety alignment of large language models (LLMs) is becoming increasingly important with their democratization. In this paper, we study the safety degradation that comes with adapting LLMs to new tasks. We attribute this safety compromise to catastrophic forgetting and frame the problem of preserving safety when fine-tuning as a continual learning (CL) problem. We consider the fine-tuning-as-a-service setup where the user uploads their data to a service provider to get a customized model that excels on the user's selected task. We adapt several CL approaches from the literature and systematically evaluate their ability to mitigate safety degradation. These include regularization-based, memory-based, and model merging approaches. We consider two scenarios, (1) benign user data and (2) poisoned user data. Our results demonstrate that CL approaches consistently achieve lower attack success rates than standard fine-tuning. Among these, DER outperforms both other CL methods and existing safety-preserving baselines while maintaining task utility. These findings generalize across three downstream tasks (GSM8K, SST2, Code) and three model families (LLaMA2-7B, Mistral-7B, Gemma-2B), establishing CL as a practical solution to preserve safety.[6] AutoMedic: An Automated Evaluation Framework for Clinical Conversational Agents with Medical Dataset Grounding
Gyutaek Oh,Sangjoon Park,Byung-Hoon Kim
Main category: cs.CL
TL;DR: 本文提出了AutoMedic,一个多智能体模拟框架,用于自动化评估作为临床对话代理的大型语言模型(LLMs),通过将静态问答数据集转化为虚拟患者档案,实现基于CARE指标的多方面性能评估。
Details
Motivation: 由于现有医学问答基准在动态、交互式多轮临床对话场景中的局限性,亟需一种能够全面评估LLM在复杂临床互动中表现的方法。 Method: 开发了AutoMedic框架,将现成的静态QA数据集转换为虚拟患者档案,并通过多个LLM智能体之间的多轮对话进行模拟;采用CARE指标(涵盖准确性、效率/策略、共情和鲁棒性)进行多维度评估。 Result: 实验结果表明,AutoMedic能有效生成真实且符合临床逻辑的多轮对话,并通过人类专家验证了其评估结果的有效性。 Conclusion: AutoMedic为临床对话代理提供了一种有效的自动化评估方案,有助于指导面向医疗对话应用的LLM开发与优化。 Abstract: Evaluating large language models (LLMs) has recently emerged as a critical issue for safe and trustworthy application of LLMs in the medical domain. Although a variety of static medical question-answering (QA) benchmarks have been proposed, many aspects remain underexplored, such as the effectiveness of LLMs in generating responses in dynamic, interactive clinical multi-turn conversation situations and the identification of multi-faceted evaluation strategies beyond simple accuracy. However, formally evaluating a dynamic, interactive clinical situation is hindered by its vast combinatorial space of possible patient states and interaction trajectories, making it difficult to standardize and quantitatively measure such scenarios. Here, we introduce AutoMedic, a multi-agent simulation framework that enables automated evaluation of LLMs as clinical conversational agents. AutoMedic transforms off-the-shelf static QA datasets into virtual patient profiles, enabling realistic and clinically grounded multi-turn clinical dialogues between LLM agents. The performance of various clinical conversational agents is then assessed based on our CARE metric, which provides a multi-faceted evaluation standard of clinical conversational accuracy, efficiency/strategy, empathy, and robustness. Our findings, validated by human experts, demonstrate the validity of AutoMedic as an automated evaluation framework for clinical conversational agents, offering practical guidelines for the effective development of LLMs in conversational medical applications.[7] Multilingual VLM Training: Adapting an English-Trained VLM to French
Jules Lahmi,Alexis Roger
Main category: cs.CL
TL;DR: 本文探讨了将英文训练的视觉-语言模型(VLM)适配到其他语言的方法,比较了翻译流水线、LoRA微调和两阶段微调策略,并指出数据集翻译质量是多语言VLM性能的主要瓶颈。
Details
Motivation: 现有的视觉-语言模型主要局限于英语,限制了非英语用户的使用,因此需要将其扩展到更多语言。 Method: 研究比较了三种方法:基于翻译的流水线、LoRA微调以及分离视觉与语言适配的两阶段微调策略,并在翻译后的多模态基准和母语专家评估下进行测试。 Result: 发现数据集翻译质量严重制约模型性能,当前的数据质量和翻译策略限制了训练与评估的有效性。 Conclusion: 未来的工作应聚焦于构建高质量的原生多语言数据集和改进翻译方法以提升多语言VLM的表现。 Abstract: Artificial intelligence has made great progress in recent years, particularly in the development of Vision--Language Models (VLMs) that understand both visual and textual data. However, these advancements remain largely limited to English, reducing their accessibility for non--English speakers. It is essential to extend these capabilities to a broader range of languages. This paper explores the challenges of adapting an English-trained VLM to different languages. To this end, we will explore and compare different methods for their performance and computational cost. We consider a translation-based pipeline, LoRA finetuning, and a two-stage finetuning strategy that separates vision adaptation from language adaptation. To evaluate these methods, we use a combination of standard multimodal benchmarks translated into the target language and manual assessments by native experts. The results reveal that dataset translation remains a major bottleneck in multilingual VLM performance, with data quality limiting the effectiveness of training and evaluation. These findings suggest that future efforts should focus on native-language dataset collection and improved translation strategies.[8] Confucius Code Agent: An Open-sourced AI Software Engineer at Industrial Scale
Zhaodong Wang,Zhenting Qi,Sherman Wong,Nathan Hu,Samuel Lin,Jun Ge,Erwin Gao,Yining Yang,Ben Maurer,Wenlin Chen,David Recordon,Yilun Du,Minlan Yu,Ying Zhang
Main category: cs.CL
TL;DR: 提出Confucius Code Agent(CCA)和Confucius SDK,支持工业级AI软件工程,具备长上下文推理、跨会话持续学习和模块化工具使用,在SWE-Bench-Pro上达到54.3%的Resolve@1性能。
Details
Motivation: 现有开源编码代理在工业规模任务上表现不足,而闭源代理缺乏可扩展性和可控性,亟需透明且高性能的解决方案。 Method: 构建基于Confucius SDK的CCA,引入分层工作记忆、持久笔记系统和模块化扩展模块,并通过元代理实现自动化的构建-测试-改进循环。 Result: 在SWE-Bench-Pro上实现54.3%的Resolve@1,显著优于先前方法。 Conclusion: Confucius SDK与CCA为AI代理提供了透明、可扩展、可复现的工业级开发基础,弥合了研究原型与生产系统之间的鸿沟。 Abstract: Real-world AI software engineering demands coding agents that can reason over massive repositories, maintain durable memory across and within long sessions, and robustly coordinate complex toolchains at test time. Existing open-source coding agents provide transparency but frequently fall short when pushed to these industrial-scale workloads, while proprietary coding agents offer strong practical performance but limited extensibility, interpretability, and controllability. We present the Confucius Code Agent (CCA), an open-sourced AI software engineer that can operate at an industrial scale. CCA is built atop the Confucius SDK, an open-sourced agent development platform designed around three complementary perspectives: Agent Experience (AX), User Experience (UX), and Developer Experience (DX). The SDK introduces a unified orchestrator with hierarchical working memory for long-context reasoning, a persistent note-taking system for cross-session continual learning, and a modular extension module for robust tool use. Moreover, a meta-agent automates the synthesis, evaluation, and refinement of agent configurations through a build-test-improve loop, enabling rapid agent development on new tasks, environments, and tool stacks. Instantiated on Confucius SDK with these mechanisms, CCA delivers strong performance on real-world software engineering tasks. On SWE-Bench-Pro, CCA achieves a state-of-the-art Resolve@1 performance of 54.3%, substantially improving over prior coding agents. Together, the Confucius SDK and CCA provide a transparent, extensible, and reproducible foundation for AI agents, bridge gaps between research prototypes and production-grade systems, and support agent development and deployment at industrial scale.[9] Sliding Window Attention Adaptation
Yijiong Yu,Jiale Liu,Qingyun Wu,Huazheng Wang,Ji Pei
Main category: cs.CL
TL;DR: 本文提出了一种名为滑动窗口注意力适应(SWAA)的方法,旨在使全注意力预训练的大语言模型在推理时有效适应滑动窗口注意力,以降低长上下文处理的计算成本,同时保持性能。
Details
Motivation: 由于Transformer模型中的自注意力机制在处理长输入时计算复杂度呈二次增长,导致推理成本高昂。滑动窗口注意力虽能降低复杂度至线性,但直接用于全注意力预训练模型会导致性能显著下降,因此需要解决训练与推理之间的不匹配问题。 Method: 提出了SWAA框架,结合五种方法:仅在prefill阶段使用SWA、保留“sink” tokens、交错使用FA/SWA层、思维链(CoT)提示以及微调,探索其协同效应。 Result: 实验表明,单一方法不足以恢复性能,但特定组合能有效恢复原始的长上下文性能,并揭示了不同配置下的性能-效率权衡。 Conclusion: 全注意力预训练模型可以在无需重新预训练的情况下成功适应滑动窗口注意力,关键在于多种策略的协同使用,SWAA为实际部署提供了灵活高效的解决方案。 Abstract: The self-attention mechanism in Transformer-based Large Language Models (LLMs) scales quadratically with input length, making long-context inference expensive. Sliding window attention (SWA) reduces this cost to linear complexity, but naively enabling complete SWA at inference-time for models pretrained with full attention (FA) causes severe long-context performance degradation due to training-inference mismatch. This makes us wonder: Can FA-pretrained LLMs be well adapted to SWA without pretraining? We investigate this by proposing Sliding Window Attention Adaptation (SWAA), a set of practical recipes that combine five methods for better adaptation: (1) applying SWA only during prefilling; (2) preserving "sink" tokens; (3) interleaving FA/SWA layers; (4) chain-of-thought (CoT); and (5) fine-tuning. Our experiments show that SWA adaptation is feasible while non-trivial: no single method suffices, yet specific synergistic combinations effectively recover the original long-context performance. We further analyze the performance-efficiency trade-offs of different SWAA configurations and provide recommended recipes for diverse scenarios. Our code is available at https://github.com/yuyijiong/sliding-window-attention-adaptation[10] Cooperative Retrieval-Augmented Generation for Question Answering: Mutual Information Exchange and Ranking by Contrasting Layers
Youmin Ko,Sungjong Seo,Hyunjoon Kim
Main category: cs.CL
TL;DR: 提出CoopRAG框架,通过检索器与大语言模型的协同工作及检索器内部层间协作,提升单跳和多跳问答任务中的检索与回答准确性。
Details
Motivation: 现有检索增强生成方法在单跳和多跳问答中仍存在检索错误和幻觉问题,需提升检索准确性和推理连贯性。 Method: 将问题分解为子问题和带掩码的推理链,结合子问题与推理链进行文档检索,通过对比检索器各层输出重排序文档,并利用大语言模型填充掩码以重建完整推理链。 Result: 在三个多跳问答数据集和一个单跳问答数据集上,CoopRAG在检索和问答性能上均优于当前最先进方法。 Conclusion: CoopRAG通过检索器与大语言模型的双向协作以及检索器内部层间优化,有效提升了问答系统的准确性和鲁棒性。 Abstract: Since large language models (LLMs) have a tendency to generate factually inaccurate output, retrieval-augmented generation (RAG) has gained significant attention as a key means to mitigate this downside of harnessing only LLMs. However, existing RAG methods for simple and multi-hop question answering (QA) are still prone to incorrect retrievals and hallucinations. To address these limitations, we propose CoopRAG, a novel RAG framework for the question answering task in which a retriever and an LLM work cooperatively with each other by exchanging informative knowledge, and the earlier and later layers of the retriever model work cooperatively with each other to accurately rank the retrieved documents relevant to a given query. In this framework, we (i) unroll a question into sub-questions and a reasoning chain in which uncertain positions are masked, (ii) retrieve the documents relevant to the question augmented with the sub-questions and the reasoning chain, (iii) rerank the documents by contrasting layers of the retriever, and (iv) reconstruct the reasoning chain by filling the masked positions via the LLM. Our experiments demonstrate that CoopRAG consistently outperforms state-of-the-art QA methods on three multi-hop QA datasets as well as a simple QA dataset in terms of both the retrieval and QA performances. Our code is available.\footnote{https://github.com/meaningful96/CoopRAG}[11] T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground
Dmitrii Stoianov,Danil Taranets,Olga Tsymboi,Ramil Latypov,Almaz Dautov,Vladislav Kruglikov,Nikita Surkov,German Abramov,Pavel Gein,Dmitry Abulkhanov,Mikhail Gashkov,Viktor Zelenkovskiy,Artem Batalov,Aleksandr Medvedev,Anatolii Potapov
Main category: cs.CL
TL;DR: T-pro 2.0是一个开源的俄语大语言模型,支持混合推理和高效推理,具备直接回答和生成推理路径的能力,并配套发布模型权重、指令数据集、推理基准和解码组件,促进俄语语言推理研究。
Details
Motivation: 推动俄语语言环境下可复现、可扩展的大模型推理研究,填补开源俄语LLM在高效推理与系统开放性方面的空白。 Method: 采用Cyrillic-dense分词器和改进的EAGLE推测解码流水线,支持直接回答与推理路径生成,并发布完整资源以支持开放研究。 Result: 实现了低延迟推理,提供了包括模型权重、指令数据集T-Wix 500k、T-Math推理基准和EAGLE权重在内的开源资源,并上线了展示推理效果与加速能力的公开演示。 Conclusion: T-pro 2.0作为一个开放、高效的俄语大模型系统,为构建和评估实用化的俄语LLM应用提供了坚实基础。 Abstract: We introduce T-pro 2.0, an open-weight Russian LLM for hybrid reasoning and efficient inference. The model supports direct answering and reasoning-trace generation, using a Cyrillic-dense tokenizer and an adapted EAGLE speculative-decoding pipeline to reduce latency. To enable reproducible and extensible research, we release the model weights, the T-Wix 500k instruction corpus, the T-Math reasoning benchmark, and the EAGLE weights on Hugging Face. These resources allow users to study Russian-language reasoning and to extend or adapt both the model and the inference pipeline. A public web demo exposes reasoning and non-reasoning modes and illustrates the speedups achieved by our inference stack across domains. T-pro 2.0 thus serves as an accessible open system for building and evaluating efficient, practical Russian LLM applications.[12] Semantic Reconstruction of Adversarial Plagiarism: A Context-Aware Framework for Detecting and Restoring "Tortured Phrases" in Scientific Literature
Agniva Maiti,Prajwal Panth,Suresh Chandra Satapathy
Main category: cs.CL
TL;DR: 本文提出了一种名为SRAP的框架,用于检测并恢复科学文献中被对抗性改写工具掩盖的剽窃内容,通过领域特定的语言模型和语义检索技术实现对“扭曲短语”的识别与原始术语的重建。
Details
Motivation: 现有的剽窃检测方法在面对新型自动化改写技术时表现不佳,难以识别经过同义替换等手段伪装的抄袭内容,尤其无法还原被篡改的专业术语和定位原始来源。 Method: 采用两阶段架构:第一阶段使用基于SciBERT的token级伪困惑度进行统计异常检测;第二阶段利用FAISS向量检索和SBERT句子对齐实现基于源文本的语义重建。 Result: 在对抗性科学文本平行语料库上的实验表明,零样本基线方法完全失效(恢复准确率为0.00%),而SRAP达到23.67%的恢复准确率,并证明静态决策边界在高变异专业文本中更利于稳定检测。 Conclusion: SRAP不仅能有效检测科学文献中的对抗性剽窃,还能部分恢复原始术语并追溯来源,为学术诚信提供可解释的取证分析工具。 Abstract: The integrity and reliability of scientific literature is facing a serious threat by adversarial text generation techniques, specifically from the use of automated paraphrasing tools to mask plagiarism. These tools generate "tortured phrases", statistically improbable synonyms (e.g. "counterfeit consciousness" for "artificial intelligence"), that preserve the local grammar while obscuring the original source. Most existing detection methods depend heavily on static blocklists or general-domain language models, which suffer from high false-negative rates for novel obfuscations and cannot determine the source of the plagiarized content. In this paper, we propose Semantic Reconstruction of Adversarial Plagiarism (SRAP), a framework designed not only to detect these anomalies but to mathematically recover the original terminology. We use a two-stage architecture: (1) statistical anomaly detection with a domain-specific masked language model (SciBERT) using token-level pseudo-perplexity, and (2) source-based semantic reconstruction using dense vector retrieval (FAISS) and sentence-level alignment (SBERT). Experiments on a parallel corpus of adversarial scientific text show that while zero-shot baselines fail completely (0.00 percent restoration accuracy), our retrieval-augmented approach achieves 23.67 percent restoration accuracy, significantly outperforming baseline methods. We also show that static decision boundaries are necessary for robust detection in jargon-heavy scientific text, since dynamic thresholding fails under high variance. SRAP enables forensic analysis by linking obfuscated expressions back to their most probable source documents.[13] Enhancing Next-Generation Language Models with Knowledge Graphs: Extending Claude, Mistral IA, and GPT-4 via KG-BERT
Nour El Houda Ben Chaabene,Hamza Hammami
Main category: cs.CL
TL;DR: 本文提出通过将知识图谱(KG)与大型语言模型(LLM)结合,利用KG-BERT提升模型在知识密集型任务中的表现,增强事实可靠性和上下文感知能力。
Details
Motivation: 大型语言模型虽然在自然语言处理方面表现出色,但缺乏结构化知识,导致生成内容存在事实不一致问题。 Method: 通过KG-BERT方法将知识图谱集成到大型语言模型中,以增强其知识表示和推理能力。 Result: 实验表明,该方法在问答和实体链接等知识密集型任务中显著提升了性能。 Conclusion: 结合知识图谱能有效提高大型语言模型的事实准确性和上下文理解能力,为下一代更可靠的LLM提供了可行路径。 Abstract: Large language models (LLMs) like Claude, Mistral IA, and GPT-4 excel in NLP but lack structured knowledge, leading to factual inconsistencies. We address this by integrating Knowledge Graphs (KGs) via KG-BERT to enhance grounding and reasoning. Experiments show significant gains in knowledge-intensive tasks such as question answering and entity linking. This approach improves factual reliability and enables more context-aware next-generation LLMs.[14] Decoding Student Minds: Leveraging Conversational Agents for Psychological and Learning Analysis
Nour El Houda Ben Chaabene,Hamza Hammami,Laid Kahloul
Main category: cs.CL
TL;DR: 本文提出了一种心理感知型对话代理,结合大语言模型、知识图谱增强的BERT和带注意力机制的双向LSTM,利用多模态数据实时识别学生的认知与情感状态,实验表明该系统能提升学习动机、降低压力并带来适度学业进步。
Details
Motivation: 传统教育聊天机器人通常仅专注于教学辅导或情感支持,缺乏对学生认知与情感状态的综合理解,难以实现个性化且全面的教育干预。 Method: 结合大语言模型(LLMs)、知识图谱增强的BERT(KG-BERT)以及带注意力机制的双向LSTM,利用文本语义、语音韵律特征和时序行为模式等多模态数据,实时分类学生的认知与情感状态。 Result: 在大学生中的试点研究表明,与基线方法相比,该系统能够有效提升学生的学习动机,减少压力,并带来中等程度的学业成绩提升。 Conclusion: 融合语义推理、多模态融合以及时序建模的方法,有望推动自适应、以学生为中心的教育干预系统的发展。 Abstract: This paper presents a psychologically-aware conversational agent designed to enhance both learning performance and emotional well-being in educational settings. The system combines Large Language Models (LLMs), a knowledge graph-enhanced BERT (KG-BERT), and a bidirectional Long Short-Term Memory (LSTM) with attention to classify students' cognitive and affective states in real time. Unlike prior chatbots limited to either tutoring or affective support, our approach leverages multimodal data-including textual semantics, prosodic speech features, and temporal behavioral trends-to infer engagement, stress, and conceptual understanding. A pilot study with university students demonstrated improved motivation, reduced stress, and moderate academic gains compared to baseline methods. These results underline the promise of integrating semantic reasoning, multimodal fusion, and temporal modeling to support adaptive, student-centered educational interventions.[15] Grammaticality Judgments in Humans and Language Models: Revisiting Generative Grammar with LLMs
Lars G. B. Johnsen
Main category: cs.CL
TL;DR: 该论文探讨了大语言模型(LLMs)在仅接受表面形式训练的情况下,是否能重现句法结构的典型语言现象(如主语-助动词倒装和寄生空位许可),结果表明LLMs能够区分语法正确与错误的句子,显示出对句法结构的敏感性,而无需显式编码。
Details
Motivation: 探究大语言模型是否真正理解句法结构,还是仅仅依赖线性序列模式。通过传统句法证据(如系统性语法对比)来检验LLMs的行为是否暗示其具备类似人类的结构表征。 Method: 使用GPT-4和LLaMA-3等大语言模型,通过设计提示词(prompts)来引出模型对主语-助动词倒装和寄生空位构造的可接受性评分,并分析其判断是否符合句法结构预测。 Result: LLMs能够稳定地区分两种构造中语法正确与错误的变体,表现出对句法边界的敏感(如主语边界)和抽象依存关系的捕捉能力,说明其行为超越了简单的线性序列学习。 Conclusion: 尽管仅在表面形式上进行训练,大语言模型仍能展现出对句法结构的功能性敏感性,表明结构性泛化可在无显式语法知识的情况下从预测性训练中 emergent 出来。 Abstract: What counts as evidence for syntactic structure? In traditional generative grammar, systematic contrasts in grammaticality such as subject-auxiliary inversion and the licensing of parasitic gaps are taken as evidence for an internal, hierarchical grammar. In this paper, we test whether large language models (LLMs), trained only on surface forms, reproduce these contrasts in ways that imply an underlying structural representation. We focus on two classic constructions: subject-auxiliary inversion (testing recognition of the subject boundary) and parasitic gap licensing (testing abstract dependency structure). We evaluate models including GPT-4 and LLaMA-3 using prompts eliciting acceptability ratings. Results show that LLMs reliably distinguish between grammatical and ungrammatical variants in both constructions, and as such support that they are sensitive to structure and not just linear order. Structural generalizations, distinct from cognitive knowledge, emerge from predictive training on surface forms, suggesting functional sensitivity to syntax without explicit encoding.[16] XDoGE: Multilingual Data Reweighting to Enhance Language Inclusivity in LLMs
Iñaki Lacunza,José Javier Saiz,Alexander Shvets,Aitor Gonzalez-Agirre,Marta Villegas
Main category: cs.CL
TL;DR: 本文提出了一种多语言扩展的XDoGE算法,用于优化大规模语言模型在中低资源语言中的训练表现,并发布了专注于伊比利亚语言的新模型IberianLLM-7B-Instruct。
Details
Motivation: 现有大模型过度依赖高资源语言(如英语),导致中低资源语言性能下降,本文旨在通过优化语言数据分布来缓解这一问题。 Method: 提出XDoGE算法,利用小型代理模型进行领域重加权,确定最优语言权重,并在全尺寸模型中从头训练或通过连续预训练(CPT)应用这些权重。 Result: 在六种不同资源水平的语言上实验表明,该方法能有效提升中低资源语言的表现;发布了基于此方法训练的IberianLLM-7B-Instruct模型。 Conclusion: 通过优化语言数据分布和使用XDoGE算法,可以显著改善大模型在多语言尤其是低资源语言上的性能。 Abstract: Current large language models (LLMs) are trained on massive amounts of text data, primarily from a few dominant languages. Studies suggest that this over-reliance on high-resource languages, such as English, hampers LLM performance in mid- and low-resource languages. To mitigate this problem, we propose to (i) optimize the language distribution by training a small proxy model within a domain-reweighing DoGE algorithm that we extend to XDoGE for a multilingual setup, and (ii) rescale the data and train a full-size model with the established language weights either from scratch or within a continual pre-training phase (CPT). We target six languages possessing a variety of geographic and intra- and inter-language-family relations, namely, English and Spanish (high-resource), Portuguese and Catalan (mid-resource), Galician and Basque (low-resource). We experiment with Salamandra-2b, which is a promising model for these languages. We investigate the effects of substantial data repetition on minor languages and under-sampling on dominant languages using the IberoBench framework for quantitative evaluation. Finally, we release a new promising IberianLLM-7B-Instruct model centering on Iberian languages and English that we pretrained from scratch and further improved using CPT with the XDoGE weights.[17] Causal Reasoning Favors Encoders: On The Limits of Decoder-Only Models
Amartya Roy,Elamparithy M,Kripabandhu Ghosh,Ponnurangam Kumaraguru,Adrian de Wynter
Main category: cs.CL
TL;DR: In-context learning (ICL) in large language models struggles with reliable causal reasoning, especially for decoder-only architectures, which are brittle to distributional shifts. Encoder and encoder-decoder models, particularly when fine-tuned, show more robust generalization in both natural and non-natural language settings, making them more suitable for cost-effective, short-horizon causal reasoning.
Details
Motivation: Causal reasoning requires multi-hop composition and strict conjunctive control, but ICL in LLMs may rely on spurious lexical cues, leading to unreliable results. It's unclear how different model architectures perform in this context. Method: Compare fine-tuned and zero/few-shot ICL performance across encoder, encoder-decoder, and decoder-only models in both natural and non-natural language scenarios, evaluating their ability to handle causal reasoning under distributional shifts. Result: ICL alone is insufficient for reliable causal reasoning, with decoder-only models being particularly sensitive to irrelevant features and distributional shifts. Fine-tuned encoder and encoder-decoder models generalize better, even in non-natural language settings, outperforming decoder-only models except at very large scales. Conclusion: For robust, cost-effective causal reasoning over short horizons, fine-tuned encoder or encoder-decoder architectures are preferable to decoder-only models. Abstract: In context learning (ICL) underpins recent advances in large language models (LLMs), although its role and performance in causal reasoning remains unclear. Causal reasoning demands multihop composition and strict conjunctive control, and reliance on spurious lexical relations of the input could provide misleading results. We hypothesize that, due to their ability to project the input into a latent space, encoder and encoder decoder architectures are better suited for said multihop conjunctive reasoning versus decoder only models. To do this, we compare fine-tuned versions of all the aforementioned architectures with zero and few shot ICL in both natural language and non natural language scenarios. We find that ICL alone is insufficient for reliable causal reasoning, often overfocusing on irrelevant input features. In particular, decoder only models are noticeably brittle to distributional shifts, while finetuned encoder and encoder decoder models can generalize more robustly across our tests, including the non natural language split. Both architectures are only matched or surpassed by decoder only architectures at large scales. We conclude by noting that for cost effective, short horizon robust causal reasoning, encoder or encoder decoder architectures with targeted finetuning are preferable.[18] RoleRMBench & RoleRM: Towards Reward Modeling for Profile-Based Role Play in Dialogue Systems
Hang Ding,Qiming Feng,Dongqi Liu,Qi Zhao,Tao Yao,Shuo Wang,Dongsheng Chen,Jian Li,Zhenye Gan,Jiangning Zhang,Chengjie Wang,Yabiao Wang
Main category: cs.CL
TL;DR: 本文提出了RoleRMBench,首个用于角色扮演对话中奖励建模的系统性基准,并引入了基于连续隐式偏好(CIP)训练的奖励模型RoleRM,在叙事连贯性和风格保真度上显著优于现有模型。
Details
Motivation: 现有的奖励模型在主观性强、开放式的领域(如角色扮演)中难以捕捉人类细微且基于角色的判断,因此需要专门针对此类场景构建更有效的奖励模型和评估基准。 Method: 提出RoleRMBench基准,涵盖叙事管理、角色一致性等七个细粒度能力;并设计RoleRM模型,采用连续隐式偏好(CIP)方法,将主观评价转化为多策略结构下的连续成对监督信号进行训练。 Result: 实验表明,RoleRM在RoleRMBench上平均超越强大多样化的开源与闭源奖励模型超过24%,尤其在叙事和风格维度表现突出;同时验证了连续偏好表示与标注一致性的关键作用。 Conclusion: 通过构建专用基准与新型训练范式,RoleRM有效提升了奖励模型在主观对话任务中的对齐能力,为以人为本的对话系统主观对齐提供了可行路径。 Abstract: Reward modeling has become a cornerstone of aligning large language models (LLMs) with human preferences. Yet, when extended to subjective and open-ended domains such as role play, existing reward models exhibit severe degradation, struggling to capture nuanced and persona-grounded human judgments. To address this gap, we introduce RoleRMBench, the first systematic benchmark for reward modeling in role-playing dialogue, covering seven fine-grained capabilities from narrative management to role consistency and engagement. Evaluation on RoleRMBench reveals large and consistent gaps between general-purpose reward models and human judgment, particularly in narrative and stylistic dimensions. We further propose RoleRM, a reward model trained with Continuous Implicit Preferences (CIP), which reformulates subjective evaluation as continuous consistent pairwise supervision under multiple structuring strategies. Comprehensive experiments show that RoleRM surpasses strong open- and closed-source reward models by over 24% on average, demonstrating substantial gains in narrative coherence and stylistic fidelity. Our findings highlight the importance of continuous preference representation and annotation consistency, establishing a foundation for subjective alignment in human-centered dialogue systems.[19] AgriGPT-Omni: A Unified Speech-Vision-Text Framework for Multilingual Agricultural Intelligence
Bo Yang,Lanfei Feng,Yunkui Chen,Yu Zhang,Jianyu Zhang,Xiao Xu,Nueraili Aierken,Shijian Li
Main category: cs.CL
TL;DR: 本文提出了AgriGPT-Omni,一个整合语音、视觉和文本的农业多模态统一框架,并构建了大规模多语言农业语音数据集与首个三模态农业评测基准AgriBench-Omni-2K,通过三阶段训练范式实现跨模态跨语言的统一推理,在多语言多模态任务上显著优于通用模型。
Details
Motivation: 现有农业应用受限于缺乏多语言语音数据、统一的多模态架构以及全面的评估基准,难以支持低资源地区的可持续AI发展。 Method: 提出AgriGPT-Omni框架,采用可扩展的数据合成 pipeline 构建大规模多语言农业语音数据集;通过三阶段训练范式(文本知识注入、渐进式多模态对齐、基于GRPO的强化学习)训练首个农业通模型;并构建首个涵盖语音-视觉-文本的三模态农业评测基准AgriBench-Omni-2K。 Result: 实验表明,AgriGPT-Omni在多语言多模态推理和真实语音理解任务上显著优于通用基线模型,且所有模型、数据、基准和代码均已开源。 Conclusion: AgriGPT-Omni为农业领域提供了首个统一的多模态多语言通模型框架与评测体系,推动了可复现研究与包容性农业智能的发展,助力低资源地区可持续AI建设。 Abstract: Despite rapid advances in multimodal large language models, agricultural applications remain constrained by the lack of multilingual speech data, unified multimodal architectures, and comprehensive evaluation benchmarks. To address these challenges, we present AgriGPT-Omni, an agricultural omni-framework that integrates speech, vision, and text in a unified framework. First, we construct a scalable data synthesis and collection pipeline that converts agricultural texts and images into training data, resulting in the largest agricultural speech dataset to date, including 492K synthetic and 1.4K real speech samples across six languages. Second, based on this, we train the first agricultural omni-model via a three-stage paradigm: textual knowledge injection, progressive multimodal alignment, and GRPO-based reinforcement learning, enabling unified reasoning across languages and modalities. Third, we propose AgriBench-Omni-2K, the first tri-modal benchmark for agriculture, covering diverse speech-vision-text tasks and multilingual slices, with standardized protocols and reproducible tools. Experiments show that AgriGPT-Omni significantly outperforms general-purpose baselines on multilingual and multimodal reasoning as well as real-world speech understanding. All models, data, benchmarks, and code will be released to promote reproducible research, inclusive agricultural intelligence, and sustainable AI development for low-resource regions.[20] From Data Scarcity to Data Care: Reimagining Language Technologies for Serbian and other Low-Resource Languages
Smiljana Antonijevic Ubois
Main category: cs.CL
TL;DR: 本研究以塞尔维亚语为例,探讨低资源语言在AI时代面临的历史、结构与社会技术挑战,提出基于CARE原则的“数据关怀”框架,以构建包容且具文化根基的语言技术。
Details
Motivation: 解决大型语言模型在训练中对英语等主导语言的依赖所导致的低资源语言文化与语言偏见问题,特别是塞尔维亚语所面临的文本遗产损毁与文化特异性缺失的困境。 Method: 通过半结构化访谈收集十位语言学家、数字人文学者和AI开发者的见解,分析历史与当代因素如何共同影响低资源语言技术的发展,并提出‘数据关怀’框架。 Result: 揭示了历史文本损毁、表层转写、依赖英文模型、数据偏见及缺乏文化特异性的数据集构建等问题;发现当前工程导向方法忽视语言细微性,加剧文化边缘化。 Conclusion: ‘数据关怀’框架将偏差缓解从技术补救转变为语料设计、标注与治理的核心原则,可作为在权力不平等和技术盲区背景下发展可持续语言技术的可复制模型。 Abstract: Large language models are commonly trained on dominant languages like English, and their representation of low resource languages typically reflects cultural and linguistic biases present in the source language materials. Using the Serbian language as a case, this study examines the structural, historical, and sociotechnical factors shaping language technology development for low resource languages in the AI age. Drawing on semi structured interviews with ten scholars and practitioners, including linguists, digital humanists, and AI developers, it traces challenges rooted in historical destruction of Serbian textual heritage, intensified by contemporary issues that drive reductive, engineering first approaches prioritizing functionality over linguistic nuance. These include superficial transliteration, reliance on English-trained models, data bias, and dataset curation lacking cultural specificity. To address these challenges, the study proposes Data Care, a framework grounded in CARE principles (Collective Benefit, Authority to Control, Responsibility, and Ethics), that reframes bias mitigation from a post hoc technical fix to an integral component of corpus design, annotation, and governance, and positions Data Care as a replicable model for building inclusive, sustainable, and culturally grounded language technologies in contexts where traditional LLM development reproduces existing power imbalances and cultural blind spots.[21] Textual Data Bias Detection and Mitigation - An Extensible Pipeline with Experimental Evaluation
Rebekka Görge,Sujan Sai Gannamaneni,Tabea Naeven,Hammam Abdelwahab,Héctor Allende-Cid,Armin B. Cremers,Lennard Helmer,Michael Mock,Anna Schmitz,Songkai Xue,Elif Yildirir,Maximilian Poretschkin,Stefan Wrobel
Main category: cs.CL
TL;DR: 提出了一种针对文本数据中表征偏差和显式刻板印象的全面检测与缓解流程,通过四个组件在性别、宗教和年龄等敏感属性上进行评估,发现数据去偏能有效改善数据质量,但对模型偏见缓解效果不一致,揭示当前评估方法的局限性。
Details
Motivation: 由于训练大语言模型的文本数据存在多重偏见(如有害语言和人口分布偏差),而欧洲AI法案等法规要求识别并减轻对受保护群体的偏见,但缺乏实际操作指南,因此需要系统化的方法来实现数据偏见的检测与缓解。 Method: 提出了一个包含四个组件的数据偏见检测与缓解流程:1)利用符合质量标准的LLM生成词汇表识别群体标签;2)使用人口表征分数量化表征偏差;3)采用社会语言学指导的过滤方法检测和缓解刻板印象;4)通过语法和上下文感知的反事实数据增强补偿表征偏差。在性别、宗教和年龄三个维度上进行了双重评估。 Result: 人类验证和基线比较表明,该方法能有效减少数据集中的表征偏差和显式刻板印象;但在基于去偏数据微调的不同规模模型(0.6B-8B参数)的偏见基准测试中,模型表现未持续改善,说明数据去偏不一定直接转化为模型去偏。 Conclusion: 尽管所提方法能有效提升训练数据的公平性,但现有偏见评估方法难以准确反映模型中残留偏见,凸显出需要更精准的数据操作和更有效的模型偏见评估机制。 Abstract: Textual data used to train large language models (LLMs) exhibits multifaceted bias manifestations encompassing harmful language and skewed demographic distributions. Regulations such as the European AI Act require identifying and mitigating biases against protected groups in data, with the ultimate goal of preventing unfair model outputs. However, practical guidance and operationalization are lacking. We propose a comprehensive data bias detection and mitigation pipeline comprising four components that address two data bias types, namely representation bias and (explicit) stereotypes for a configurable sensitive attribute. First, we leverage LLM-generated word lists created based on quality criteria to detect relevant group labels. Second, representation bias is quantified using the Demographic Representation Score. Third, we detect and mitigate stereotypes using sociolinguistically informed filtering. Finally, we compensate representation bias through Grammar- and Context-Aware Counterfactual Data Augmentation. We conduct a two-fold evaluation using the examples of gender, religion and age. First, the effectiveness of each individual component on data debiasing is evaluated through human validation and baseline comparison. The findings demonstrate that we successfully reduce representation bias and (explicit) stereotypes in a text dataset. Second, the effect of data debiasing on model bias reduction is evaluated by bias benchmarking of several models (0.6B-8B parameters), fine-tuned on the debiased text dataset. This evaluation reveals that LLMs fine-tuned on debiased data do not consistently show improved performance on bias benchmarks, exposing critical gaps in current evaluation methodologies and highlighting the need for targeted data manipulation to address manifested model bias.[22] Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving
Songyang Gao,Yuzhe Gu,Zijian Wu,Lingkai Kong,Wenwei Zhang,Zhongrui Cai,Fan Zheng,Tianyou Ma,Junhao Shen,Haiteng Zhao,Duanyang Zhang,Huilun Zhang,Kuikun Liu,Chengqi Lyu,Yanhui Duan,Chiyu Chen,Ningsheng Ma,Jianfei Gao,Han Lyu,Dahua Lin,Kai Chen
Main category: cs.CL
TL;DR: 提出了一种基于结果的流程验证器(OPV),通过总结长思维链的结果来验证推理过程,结合迭代主动学习和拒绝微调,实现高效准确的验证,并显著提升模型性能。
Details
Motivation: 现有基于结果的验证器无法检查长推理链中的不可靠中间步骤,而基于过程的验证器受限于高质量标注数据的缺乏,难以可靠检测复杂推理错误。 Method: 提出OPV,通过总结长思维链的中间结果进行过程验证;采用迭代主动学习框架,选择最不确定样本进行专家标注,并使用拒绝微调(RFT)和可验证奖励强化学习(RLVR)逐步提升OPV能力。 Result: 在自建测试集上达到83.1的F1分数,超过Qwen3-Max-Preview等更大模型;能有效识别合成数据中的错误,与专家判断高度一致;与策略模型协作时显著提升性能,如在AIME2025上将DeepSeek-R1-Distill-Qwen-32B的准确率从55.2%提升至73.3%。 Conclusion: OPV实现了高效、准确的推理过程验证,解决了传统验证器在长链推理中监督不足的问题,具备广泛适用性和实用潜力。 Abstract: Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the \textbf{O}utcome-based \textbf{P}rocess \textbf{V}erifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV's superior performance and broad applicability. It achieves new state-of-the-art results on our held-out \textsc{\thisbench}, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2\% to 73.3\% on AIME2025 as the compute budget scales.[23] TRIDENT: A Redundant Architecture for Caribbean-Accented Emergency Speech Triage
Elroy Galbraith,Chadwick Sutherland,Donahue Morgan
Main category: cs.CL
TL;DR: 本文提出了TRIDENT,一个三层架构的紧急呼叫支持系统,旨在提升对加勒比口音语音的识别与应急分诊能力,通过结合口音优化的语音识别、本地实体抽取和生物声学 distress 检测,实现对呼叫者的有效辅助分诊。
Details
Motivation: 现有紧急语音识别系统在处理非标准英语变体(如加勒比口音)时表现下降,导致服务不平等,本文旨在填补这一技术缺口。 Method: 提出TRIDENT三层架构:1)针对加勒比口音优化的ASR;2)基于大语言模型的本地实体提取;3)生物声学 distress 检测;利用低ASR置信度与高语音压力信号作为危机线索,并结合语义分析捕捉临床关键信息。 Result: 系统能为调度员提供转录置信度、结构化临床实体和语音压力指标三类互补信号;低ASR置信度结合高压力指标可作为优先队列信号,而语义层可捕获无明显情绪但危急的案例。 Conclusion: 本研究建立了口音鲁棒性紧急AI的框架,确保加勒比人群在紧急情况下也能公平接入国家分诊协议,具有重要的公平性与实用性意义。 Abstract: Emergency speech recognition systems exhibit systematic performance degradation on non-standard English varieties, creating a critical gap in services for Caribbean populations. We present TRIDENT (Transcription and Routing Intelligence for Dispatcher-Empowered National Triage), a three-layer dispatcher-support architecture designed to structure emergency call inputs for human application of established triage protocols (the ESI for routine operations and START for mass casualty events), even when automatic speech recognition fails. The system combines Caribbean-accent-tuned ASR, local entity extraction via large language models, and bio-acoustic distress detection to provide dispatchers with three complementary signals: transcription confidence, structured clinical entities, and vocal stress indicators. Our key insight is that low ASR confidence, rather than representing system failure, serves as a valuable queue prioritization signal -- particularly when combined with elevated vocal distress markers indicating a caller in crisis whose speech may have shifted toward basilectal registers. A complementary insight drives the entity extraction layer: trained responders and composed bystanders may report life-threatening emergencies without elevated vocal stress, requiring semantic analysis to capture clinical indicators that paralinguistic features miss. We describe the architectural design, theoretical grounding in psycholinguistic research on stress-induced code-switching, and deployment considerations for offline operation during disaster scenarios. This work establishes a framework for accent-resilient emergency AI that ensures Caribbean voices receive equitable access to established national triage protocols. Empirical validation on Caribbean emergency calls remains future work.[24] OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification
Zijian Wu,Lingkai Kong,Wenwei Zhang,Songyang Gao,Yuzhe Gu,Zhongrui Cai,Tianyou Ma,Yuhong Liu,Zhi Wang,Runyuan Ma,Guangyu Wang,Wei Li,Conghui He,Dahua Lin,Kai Chen
Main category: cs.CL
TL;DR: 本文提出了一种基于结果的过程验证器(OPV),通过总结长思维链的中间结果来实现高效且准确的推理过程验证,并结合迭代主动学习框架和专家标注,以较低成本提升验证能力,在多个任务上实现了先进性能。
Details
Motivation: 现有的基于结果的验证器无法检查长推理链中的不可靠中间步骤,而基于过程的验证器受限于高质量标注数据的缺乏,难以可靠检测复杂推理错误。因此需要一种更高效、可扩展且准确的验证方法。 Method: 提出Outcome-based Process Verifier (OPV),通过汇总长思维链的中间结果进行过程验证;采用迭代主动学习框架,选择当前模型最不确定的样例进行专家标注,并使用拒绝微调(RFT)和基于可验证奖励的强化学习(RLVR)训练下一代OPV。 Result: OPV在OPV-Bench上达到83.1的F1分数,优于Qwen3-Max-Preview等更大模型;能有效识别合成数据中的假阳性,与专家判断高度一致;与策略模型协作时显著提升性能,如在AIME2025上将DeepSeek-R1-Distill-Qwen-32B的准确率从55.2%提高到73.3%。 Conclusion: OPV实现了高效、准确且可扩展的推理验证,通过迭代主动学习降低了对高成本人工标注的依赖,在实际应用和模型协同中表现出强大潜力。 Abstract: Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the Outcome-based Process Verifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV's superior performance and broad applicability. It achieves new state-of-the-art results on our held-out OPV-Bench, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2% to 73.3% on AIME2025 as the compute budget scales.[25] Grow Up and Merge: Scaling Strategies for Efficient Language Adaptation
Kevin Glocker,Kätriin Kukk,Romina Oji,Marcel Bollmann,Marco Kuhlmann,Jenny Kunz
Main category: cs.CL
TL;DR: 本研究探讨了通过扩展模型规模来适应新目标语言的有效性,发现大规模模型在数据效率和保持原有语言能力方面表现更优,并探索了其在构建多语言系统中的潜力。
Details
Motivation: 解决中低资源语言在多语言模型中表现不佳的问题,尤其是在小规模模型上与单语模型的差距较大。 Method: 通过对FLOP匹配的模型进行系统的扩展消融实验,比较扩展英语基础模型与标准持续预训练在目标语言适应上的效果。 Result: 大规模扩展模型在获得足够目标语言数据后,性能可匹敌甚至超越使用更多数据持续预训练的小模型;扩展有助于保留英语能力,减少灾难性遗忘;将扩展后的单语模型合并构建多语言系统虽不如联合训练有效,但优于小模型合并,且不同合并方法间性能差异显著。 Conclusion: 模型扩展是一种高效的数据适应策略,能提升中低资源语言的性能并缓解遗忘问题,同时为模块化多语言系统提供可行路径,未来可优化专门针对语言集成的模型合并方法。 Abstract: Achieving high-performing language models which include medium- and lower-resource languages remains a challenge. Massively multilingual models still underperform compared to language-specific adaptations, especially at smaller model scales. In this work, we investigate scaling as an efficient strategy for adapting pretrained models to new target languages. Through comprehensive scaling ablations with approximately FLOP-matched models, we test whether upscaling an English base model enables more effective and resource-efficient adaptation than standard continued pretraining. We find that, once exposed to sufficient target-language data, larger upscaled models can match or surpass the performance of smaller models continually pretrained on much more data, demonstrating the benefits of scaling for data efficiency. Scaling also helps preserve the base model's capabilities in English, thus reducing catastrophic forgetting. Finally, we explore whether such scaled, language-specific models can be merged to construct modular and flexible multilingual systems. We find that while merging remains less effective than joint multilingual training, upscaled merges perform better than smaller ones. We observe large performance differences across merging methods, suggesting potential for improvement through merging approaches specialized for language-level integration.[26] Script Gap: Evaluating LLM Triage on Indian Languages in Native vs Roman Scripts in a Real World Setting
Manurag Khullar,Utkarsh Desai,Poorva Malviya,Aman Dalmia,Zheyuan Ryan Shi
Main category: cs.CL
TL;DR: 该研究探讨了罗马化文本对大型语言模型(LLM)在印度母婴健康分诊中可靠性的影响,发现使用罗马化输入时性能显著下降,尽管模型能理解语义,但输出仍不稳定。
Details
Motivation: 在印度等多语言环境中,用户常使用罗马化文本进行交流,但现有研究缺乏对此类真实场景下LLM表现的评估,尤其是在高风险的临床应用中。 Method: 研究人员基于五个印度语言和尼泊尔语的真实用户查询数据集,对主流LLM进行了基准测试,比较其在原生文字与罗马化文本上的分诊表现,并分析模型理解和输出之间的差异。 Result: 实验结果显示,罗马化文本导致F1分数下降5-12个百分点,在合作机构的实际应用中可能导致近200万例额外分诊错误;然而,模型通常能正确推断罗马化查询的语义意图,问题主要出在正字法噪声下的输出稳定性。 Conclusion: LLM在处理罗马化输入时存在关键的安全盲点:即使模型看似理解输入,其决策输出仍可能不可靠,这凸显了在医疗AI系统中需专门优化非标准书写形式的重要性。 Abstract: Large Language Models (LLMs) are increasingly deployed in high-stakes clinical applications in India. In many such settings, speakers of Indian languages frequently communicate using romanized text rather than native scripts, yet existing research rarely evaluates this orthographic variation using real-world data. We investigate how romanization impacts the reliability of LLMs in a critical domain: maternal and newborn healthcare triage. We benchmark leading LLMs on a real-world dataset of user-generated queries spanning five Indian languages and Nepali. Our results reveal consistent degradation in performance for romanized messages, with F1 scores trailing those of native scripts by 5-12 points. At our partner maternal health organization in India, this gap could cause nearly 2 million excess errors in triage. Crucially, this performance gap by scripts is not due to a failure in clinical reasoning. We demonstrate that LLMs often correctly infer the semantic intent of romanized queries. Nevertheless, their final classification outputs remain brittle in the presence of orthographic noise in romanized inputs. Our findings highlight a critical safety blind spot in LLM-based health systems: models that appear to understand romanized input may still fail to act on it reliably.[27] The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality
Aileen Cheng,Alon Jacovi,Amir Globerson,Ben Golan,Charles Kwong,Chris Alberti,Connie Tao,Eyal Ben-David,Gaurav Singh Tomar,Lukas Haas,Yonatan Bitton,Adam Bloniarz,Aijun Bai,Andrew Wang,Anfal Siddiqui,Arturo Bajuelos Castillo,Aviel Atias,Chang Liu,Corey Fry,Daniel Balle,Deepanway Ghosal,Doron Kukliansky,Dror Marcus,Elena Gribovskaya,Eran Ofek,Honglei Zhuang,Itay Laish,Jan Ackermann,Lily Wang,Meg Risdal,Megan Barnes,Michael Fink,Mohamed Amin,Moran Ambar,Natan Potikha,Nikita Gupta,Nitzan Katz,Noam Velan,Ofir Roval,Ori Ram,Polina Zablotskaia,Prathamesh Bang,Priyanka Agrawal,Rakesh Ghiya,Sanjay Ganapathy,Simon Baumgartner,Sofia Erell,Sushant Prakash,Thibault Sellam,Vikram Rao,Xuanhui Wang,Yaroslav Akulov,Yulong Yang,Zhen Yang,Zhixin Lai,Zhongru Wu,Anca Dragan,Avinatan Hassidim,Fernando Pereira,Slav Petrov,Srinivasan Venkatachary,Tulsee Doshi,Yossi Matias,Sasha Goldshtein,Dipanjan Das
Main category: cs.CL
TL;DR: The FACTS Leaderboard 是一个综合评估语言模型在多种场景下生成事实准确文本能力的基准套件,包含四个子榜单,采用自动化评判模型进行评分。
Details
Motivation: 为了全面评估语言模型在不同情境下的事实准确性,解决现有评测不够系统和全面的问题。 Method: 构建了四个子榜单:FACTS Multimodal、FACTS Parametric、FACTS Search 和 FACTS Grounding (v2),使用自动化judge模型对模型输出进行评分,并将四项得分平均作为总分。 Result: 提供了一个可维护、包含公榜和私榜的综合性事实性评测体系,能够更全面地衡量语言模型的事实一致性与信息可靠性。 Conclusion: FACTS Leaderboard 为评估语言模型的事实准确性提供了标准化、多维度的平台,有助于推动更可信的语言模型发展。 Abstract: We introduce The FACTS Leaderboard, an online leaderboard suite and associated set of benchmarks that comprehensively evaluates the ability of language models to generate factually accurate text across diverse scenarios. The suite provides a holistic measure of factuality by aggregating the performance of models on four distinct sub-leaderboards: (1) FACTS Multimodal, which measures the factuality of responses to image-based questions; (2) FACTS Parametric, which assesses models' world knowledge by answering closed-book factoid questions from internal parameters; (3) FACTS Search, which evaluates factuality in information-seeking scenarios, where the model must use a search API; and (4) FACTS Grounding (v2), which evaluates whether long-form responses are grounded in provided documents, featuring significantly improved judge models. Each sub-leaderboard employs automated judge models to score model responses, and the final suite score is an average of the four components, designed to provide a robust and balanced assessment of a model's overall factuality. The FACTS Leaderboard Suite will be actively maintained, containing both public and private splits to allow for external participation while guarding its integrity. It can be found at https://www.kaggle.com/benchmarks/google/facts .[28] LabelFusion: Learning to Fuse LLMs and Transformer Classifiers for Robust Text Classification
Michael Schlee,Christoph Weisser,Timo Kivimäki,Melchizedek Mashiku,Benjamin Saefken
Main category: cs.CL
TL;DR: LabelFusion是一种融合集成方法,结合传统Transformer模型与大型语言模型(LLM)进行文本分类,通过学习融合二者输出,在多类和多标签任务中实现高准确率与成本效益的平衡。
Details
Motivation: 旨在利用传统Transformer模型的稳定性与LLM的强大推理能力,克服单独使用LLM带来的高成本与延迟问题,提升文本分类性能。 Method: 将传统模型(如RoBERTa)的嵌入表示与通过结构化提示工程获得的LLM每类得分进行拼接,并输入到一个紧凑的多层感知机(FusionMLP)中进行最终预测,端到端训练整个融合流程。 Result: 在AG News数据集上达到92.4%准确率,在10类Reuters 21578主题分类任务中达到92.3%准确率,表现出优越的跨领域鲁棒性。 Conclusion: LabelFusion通过学习式融合策略有效整合了传统模型与LLM的优势,在保持较低推理成本的同时实现了高性能文本分类,适用于实际应用场景。 Abstract: LabelFusion is a fusion ensemble for text classification that learns to combine a traditional transformer-based classifier (e.g., RoBERTa) with one or more Large Language Models (LLMs such as OpenAI GPT, Google Gemini, or DeepSeek) to deliver accurate and cost-aware predictions across multi-class and multi-label tasks. The package provides a simple high-level interface (AutoFusionClassifier) that trains the full pipeline end-to-end with minimal configuration, and a flexible API for advanced users. Under the hood, LabelFusion integrates vector signals from both sources by concatenating the ML backbone's embeddings with the LLM-derived per-class scores -- obtained through structured prompt-engineering strategies -- and feeds this joint representation into a compact multi-layer perceptron (FusionMLP) that produces the final prediction. This learned fusion approach captures complementary strengths of LLM reasoning and traditional transformer-based classifiers, yielding robust performance across domains -- achieving 92.4% accuracy on AG News and 92.3% on 10-class Reuters 21578 topic classification -- while enabling practical trade-offs between accuracy, latency, and cost.[29] Quantifying Emotional Tone in Tolkien's The Hobbit: Dialogue Sentiment Analysis with RegEx, NRC-VAD, and Python
Lilin Qiu
Main category: cs.CL
TL;DR: 本研究通过计算文本分析探讨了《霍比特人》对话中的情感基调,发现其整体呈现积极、平静且逐渐增强的主导性情感模式,反映了故事中紧张与安慰交替的情感节奏。
Details
Motivation: 探索《霍比特人》中对话的情感结构及其在叙事节奏中的作用,结合数字方法揭示文学作品中的细微情感变化。 Method: 使用正则表达式提取对话内容,经预处理后利用NRC-VAD词典对情感维度进行评分,并通过情感轨迹图和词云等可视化手段分析情感变化。 Result: 对话整体保持高愉悦度(valence)和低唤醒度(arousal),主导性(dominance)随情节推进逐步上升,情感呈现周期性波动,紧张与轻松场景交替出现。 Conclusion: 该研究表明,结合计算工具与文学解读可有效揭示文学作品中的深层情感结构,凸显《霍比特人》叙事中稳定而富有调节性的情感节奏。 Abstract: This study analyzes the emotional tone of dialogue in J. R. R. Tolkien's The Hobbit (1937) using computational text analysis. Dialogue was extracted with regular expressions, then preprocessed, and scored using the NRC-VAD lexicon to quantify emotional dimensions. The results show that the dialogue maintains a generally positive (high valence) and calm (low arousal) tone, with a gradually increasing sense of agency (dominance) as the story progresses. These patterns reflect the novel's emotional rhythm: moments of danger and excitement are regularly balanced by humor, camaraderie, and relief. Visualizations -- including emotional trajectory graphs and word clouds -- highlight how Tolkien's language cycles between tension and comfort. By combining computational tools with literary interpretation, this study demonstrates how digital methods can uncover subtle emotional structures in literature, revealing the steady rhythm and emotional modulation that shape the storytelling in The Hobbit.[30] Computational emotion analysis with multimodal LLMs: Current evidence on an emerging methodological opportunity
Hauke Licht
Main category: cs.CL
TL;DR: 本文评估了多模态大语言模型(mLLMs)在基于视频的政治情感唤醒分析中的有效性,发现其在理想条件下表现可靠且无明显偏见,但在真实议会辩论场景中效果不佳,强调需对生成式AI在政治分析中的应用进行持续严谨评估。
Details
Motivation: 现有研究缺乏对多模态AI在情感分析中有效性的实证证据,尤其是在政治传播领域的情感识别应用。 Method: 利用两个包含人工标注视频记录的互补数据集,评估当前多模态大语言模型(mLLMs)在视频情感唤醒分析中的表现。 Result: 在理想条件下,mLLMs的情感唤醒评分高度可靠且几乎无性别或种族等人口统计学偏见;但在真实的议会辩论视频中,其评分表现不佳,可能影响后续统计推断。 Conclusion: 尽管mLLMs在受控环境下有潜力,但在复杂现实场景中应用仍存在局限,因此必须持续、严格地评估生成式AI在政治分析中的适用性,并建立可复制的评估框架。 Abstract: Emotions are central to politics and analyzing their role in political communication has a long tradition. As research increasingly leverages audio-visual materials to analyze the display of emotions, the emergence of multimodal generative AI promises great advances. However, we lack evidence about the effectiveness of multimodal AI in emotion analysis. This paper addresses this gap by evaluating current multimodal large language models (mLLMs) in video-based analysis of emotional arousal in two complementary data sets of human-labeled video recordings. I find that under ideal circumstances, mLLMs' emotional arousal ratings are highly reliable and show little to know indication of demographic bias. However, in recordings of speakers in real-world parliamentary debates, mLLMs' arousal ratings fail to deliver on this promise with potential negative consequences for downstream statistical inferences. This study therefore underscores the need for continued, thorough evaluation of emerging generative AI methods in political analysis and contributes a suitable replicable framework.cs.CV [Back]
[31] Neuromorphic Eye Tracking for Low-Latency Pupil Detection
Paul Hueber,Luca Peres,Florian Pitters,Alejandro Gloriani,Oliver Rhodes
Main category: cs.CV
TL;DR: 本文提出了一种基于脉冲神经网络(SNN)的高效事件驱动型眼动追踪模型,利用LIF层和深度可分离卷积在保持高精度的同时大幅降低计算量和功耗,适用于低延迟、低功耗的可穿戴系统。
Details
Motivation: 传统基于帧的眼动追踪方法存在运动模糊、计算成本高和时间分辨率有限的问题,难以满足AR/VR等可穿戴系统对低延迟和低功耗的需求。 Method: 采用神经形态传感器和脉冲神经网络(SNN),将高性能事件驱动眼动模型中的循环和注意力模块替换为轻量级的LIF层,并引入深度可分离卷积以降低模型复杂度。 Result: 模型达到3.7-4.1px的平均误差,接近Retina系统的3.24px,模型大小减少20倍,理论计算量降低850倍,功耗仅为3.9-4.9mW,延迟为3ms(1kHz下)。 Conclusion: 高性能事件驱动眼动追踪模型可以成功重构为SNN,在保持适用于实时可穿戴设备的精度的同时,实现显著的效率提升。 Abstract: Eye tracking for wearable systems demands low latency and milliwatt-level power, but conventional frame-based pipelines struggle with motion blur, high compute cost, and limited temporal resolution. Such capabilities are vital for enabling seamless and responsive interaction in emerging technologies like augmented reality (AR) and virtual reality (VR), where understanding user gaze is key to immersion and interface design. Neuromorphic sensors and spiking neural networks (SNNs) offer a promising alternative, yet existing SNN approaches are either too specialized or fall short of the performance of modern ANN architectures. This paper presents a neuromorphic version of top-performing event-based eye-tracking models, replacing their recurrent and attention modules with lightweight LIF layers and exploiting depth-wise separable convolutions to reduce model complexity. Our models obtain 3.7-4.1px mean error, approaching the accuracy of the application-specific neuromorphic system, Retina (3.24px), while reducing model size by 20x and theoretical compute by 850x, compared to the closest ANN variant of the proposed model. These efficient variants are projected to operate at an estimated 3.9-4.9 mW with 3 ms latency at 1 kHz. The present results indicate that high-performing event-based eye-tracking architectures can be redesigned as SNNs with substantial efficiency gains, while retaining accuracy suitable for real-time wearable deployment.[32] ABBSPO: Adaptive Bounding Box Scaling and Symmetric Prior based Orientation Prediction for Detecting Aerial Image Objects
Woojin Lee,Hyugjae Chang,Jaeho Moon,Jaehyup Lee,Munchurl Kim
Main category: cs.CV
TL;DR: 本文提出了一种用于弱监督定向目标检测(WS-OOD)的新框架ABBSPO,通过自适应边界框缩放(ABBS)和基于对称先验的角度损失(SPA),在HBox标注下显著提升了旋转框检测的准确性和稳定性。
Details
Motivation: 现有的HBox监督方法在将GT HBox与预测RBox的最小外接矩形直接比较时,存在尺度估计不准确和训练崩溃的问题,本文旨在解决这些局限性。 Method: 提出ABBS策略以动态调整GT HBox的尺度来匹配预测RBox,并设计SPA损失函数利用航空目标的对称性进行自监督学习,防止多视图增强下的训练失效。 Result: 实验结果表明,ABBSPO在多个数据集上实现了最先进的性能,显著优于现有弱监督方法。 Conclusion: ABBSPO有效解决了HBox监督下旋转检测中的尺度失配与训练不稳定问题,为弱监督OBB检测提供了高效且高精度的解决方案。 Abstract: Weakly supervised oriented object detection (WS-OOD) has gained attention as a cost-effective alternative to fully supervised methods, providing both efficiency and high accuracy. Among weakly supervised approaches, horizontal bounding box (HBox)-supervised OOD stands out for its ability to directly leverage existing HBox annotations while achieving the highest accuracy under weak supervision settings. This paper introduces adaptive bounding box scaling and symmetry-prior-based orientation prediction, called ABBSPO, a framework for WS-OOD. Our ABBSPO addresses limitations of previous HBox-supervised OOD methods, which compare ground truth (GT) HBoxes directly with the minimum circumscribed rectangles of predicted RBoxes, often leading to inaccurate scale estimation. To overcome this, we propose: (i) Adaptive Bounding Box Scaling (ABBS), which appropriately scales GT HBoxes to optimize for the size of each predicted RBox, ensuring more accurate scale prediction; and (ii) a Symmetric Prior Angle (SPA) loss that exploits inherent symmetry of aerial objects for self-supervised learning, resolving issues in previous methods where learning collapses when predictions for all three augmented views (original, rotated, and flipped) are consistently incorrect. Extensive experimental results demonstrate that ABBSPO achieves state-of-the-art performance, outperforming existing methods.[33] Diffusion Is Your Friend in Show, Suggest and Tell
Jia Cheng Hu,Roberto Cavicchioli,Alessandro Capotondi
Main category: cs.CV
TL;DR: 提出一种新的范式Show, Suggest and Tell (SST),结合扩散模型的双向优化能力和自回归生成的强语言结构,在COCO数据集上实现了125.1 CIDEr-D的SOTA性能,优于现有自回归和扩散模型。
Details
Motivation: 扩散去噪模型在生成任务中表现优异,但在离散领域仍无法超越或仅勉强匹配自回归模型,因此需要探索能融合二者优势的新方法。 Method: 采用扩散模型为自回归生成提供建议而非替代它,结合扩散模型的双向性和精炼能力与自回归模型的语言结构优势,构建SST框架。 Result: SST在COCO数据集上达到125.1 CIDEr-D得分(无需强化学习),超过当前最佳自回归和扩散模型分别1.5和2.5点,并通过实验验证建议模块对生成质量有正向影响。 Conclusion: 该工作展示了一种有前景的研究方向——利用扩散模型辅助而非取代自回归生成,有效提升生成质量,且具有广泛探索潜力。 Abstract: Diffusion Denoising models demonstrated impressive results across generative Computer Vision tasks, but they still fail to outperform standard autoregressive solutions in the discrete domain, and only match them at best. In this work, we propose a different paradigm by adopting diffusion models to provide suggestions to the autoregressive generation rather than replacing them. By doing so, we combine the bidirectional and refining capabilities of the former with the strong linguistic structure provided by the latter. To showcase its effectiveness, we present Show, Suggest and Tell (SST), which achieves State-of-the-Art results on COCO, among models in a similar setting. In particular, SST achieves 125.1 CIDEr-D on the COCO dataset without Reinforcement Learning, outperforming both autoregressive and diffusion model State-of-the-Art results by 1.5 and 2.5 points. On top of the strong results, we performed extensive experiments to validate the proposal and analyze the impact of the suggestion module. Results demonstrate a positive correlation between suggestion and caption quality, overall indicating a currently underexplored but promising research direction. Code will be available at: https://github.com/jchenghu/show\_suggest\_tell.[34] MetaVoxel: Joint Diffusion Modeling of Imaging and Clinical Metadata
Yihao Liu,Chenyu Gao,Lianrui Zuo,Michael E. Kim,Brian D. Boyd,Lisa L. Barnes,Walter A. Kukull,Lori L. Beason-Held,Susan M. Resnick,Timothy J. Hohman,Warren D. Taylor,Bennett A. Landman
Main category: cs.CV
TL;DR: MetaVoxel是一种联合扩散建模框架,通过建模医学影像与临床元数据的联合分布,统一多种医疗AI任务,支持无需重训练的灵活零样本推理。
Details
Motivation: 传统方法通常只建模特定输入与输出间的条件分布,限制了模型的灵活性和通用性,难以支持多任务和任意输入组合的推理需求。 Method: 提出MetaVoxel,采用生成式联合扩散模型,学习涵盖影像数据和临床元数据的单一扩散过程,以建模所有变量的联合分布。 Result: 在超过10,000个T1加权MRI扫描数据上验证,单个MetaVoxel模型可完成图像生成、年龄估计和性别预测,性能媲美专用模型,并展现出灵活推理能力。 Conclusion: 联合多模态扩散建模为统一医疗AI模型提供了有前景的方向,增强了模型在临床应用中的广泛适用性。 Abstract: Modern deep learning methods have achieved impressive results across tasks from disease classification, estimating continuous biomarkers, to generating realistic medical images. Most of these approaches are trained to model conditional distributions defined by a specific predictive direction with a specific set of input variables. We introduce MetaVoxel, a generative joint diffusion modeling framework that models the joint distribution over imaging data and clinical metadata by learning a single diffusion process spanning all variables. By capturing the joint distribution, MetaVoxel unifies tasks that traditionally require separate conditional models and supports flexible zero-shot inference using arbitrary subsets of inputs without task-specific retraining. Using more than 10,000 T1-weighted MRI scans paired with clinical metadata from nine datasets, we show that a single MetaVoxel model can perform image generation, age estimation, and sex prediction, achieving performance comparable to established task-specific baselines. Additional experiments highlight its capabilities for flexible inference.Together, these findings demonstrate that joint multimodal diffusion offers a promising direction for unifying medical AI models and enabling broader clinical applicability.[35] Independent Density Estimation
Jiahao Liu
Main category: cs.CV
TL;DR: 提出了一种名为独立密度估计(IDE)的新方法,以提升视觉-语言模型在未见组合上的组成泛化能力。
Details
Motivation: 现有大规模视觉-语言模型在实现类人化的组成泛化方面仍存在困难。 Method: 提出了独立密度估计(IDE),通过学习句子中单个词与图像特征之间的关联;构建了两个基于IDE思想的模型,一个使用完全解耦的视觉表示,另一个使用变分自编码器从原始图像中提取部分解耦特征,并设计了基于熵的组成推理方法来融合各词预测。 Result: 在多个数据集上评估显示,所提模型在未见组合上的泛化性能优于当前模型。 Conclusion: IDE方法有效提升了视觉-语言模型的组成泛化能力,为实现更接近人类的语言-视觉理解提供了新思路。 Abstract: Large-scale Vision-Language models have achieved remarkable results in various domains, such as image captioning and conditioned image generation. Nevertheless, these models still encounter difficulties in achieving human-like compositional generalization. In this study, we propose a new method called Independent Density Estimation (IDE) to tackle this challenge. IDE aims to learn the connection between individual words in a sentence and the corresponding features in an image, enabling compositional generalization. We build two models based on the philosophy of IDE. The first one utilizes fully disentangled visual representations as input, and the second leverages a Variational Auto-Encoder to obtain partially disentangled features from raw images. Additionally, we propose an entropy-based compositional inference method to combine predictions of each word in the sentence. Our models exhibit superior generalization to unseen compositions compared to current models when evaluated on various datasets.[36] TraceFlow: Dynamic 3D Reconstruction of Specular Scenes Driven by Ray Tracing
Jiachen Tao,Junyi Wu,Haoxuan Wang,Zongxin Yang,Dawen Cai,Yan Yan
Main category: cs.CV
TL;DR: 本文提出了TraceFlow,一种用于高保真渲染动态镜面场景的新框架,通过解决反射方向精确估计和物理准确建模两个关键挑战,实现了更清晰、更逼真的动态镜面反射效果。
Details
Motivation: 动态镜面场景的渲染面临反射方向估计不准确和物理建模不充分的问题,现有方法难以在复杂动态环境中生成高质量的镜面反射。 Method: 提出了一种残差材质增强的2D高斯点阵表示法,结合动态环境高斯模型与混合渲染管线,将渲染分解为漫反射和镜面反射分量,并采用由粗到细的训练策略提升优化稳定性。 Result: 在多个动态场景基准上的实验表明,TraceFlow在定量和定性上均优于先前方法,能够生成更锐利、更真实的镜面反射。 Conclusion: TraceFlow通过联合建模几何、材质与环境光照,有效提升了动态镜面场景的渲染质量,为高保真动态渲染提供了新的解决方案。 Abstract: We present TraceFlow, a novel framework for high-fidelity rendering of dynamic specular scenes by addressing two key challenges: precise reflection direction estimation and physically accurate reflection modeling. To achieve this, we propose a Residual Material-Augmented 2D Gaussian Splatting representation that models dynamic geometry and material properties, allowing accurate reflection ray computation. Furthermore, we introduce a Dynamic Environment Gaussian and a hybrid rendering pipeline that decomposes rendering into diffuse and specular components, enabling physically grounded specular synthesis via rasterization and ray tracing. Finally, we devise a coarse-to-fine training strategy to improve optimization stability and promote physically meaningful decomposition. Extensive experiments on dynamic scene benchmarks demonstrate that TraceFlow outperforms prior methods both quantitatively and qualitatively, producing sharper and more realistic specular reflections in complex dynamic environments.[37] Hierarchical Instance Tracking to Balance Privacy Preservation with Accessible Information
Neelima Prasad,Jarek Reynolds,Neel Karsanbhai,Tanusree Sharma,Lotus Zhang,Abigale Stangl,Yang Wang,Leah Findlater,Danna Gurari
Main category: cs.CV
TL;DR: 提出了一种新任务——分层实例跟踪,旨在跟踪预定义类别对象及其部分的所有实例,并保持其层次关系。同时发布了支持该任务的首个基准数据集,包含552个视频中的2,765个唯一实体,涵盖40个类别。
Details
Motivation: 现有的实例跟踪方法通常忽略对象与部分之间的层次结构关系,无法满足对复杂场景中细粒度结构理解的需求。因此,需要一种能够同时跟踪对象及其组成部分并维持其层级关系的新任务和数据集。 Method: 提出了分层实例跟踪任务,并构建了首个支持该任务的基准数据集,包含552个视频、2,765个唯一实体和40个对象/部分类别。设计了七种基于四种模型的变体进行实验评估。 Result: 在新数据集上评估了七种模型变体,结果表明该数据集具有挑战性,现有方法难以有效处理分层实例跟踪任务。 Conclusion: 分层实例跟踪是一项有前景的新任务,所提出的数据集为未来研究提供了基础,推动对视觉场景中对象与部分层次关系的理解。 Abstract: We propose a novel task, hierarchical instance tracking, which entails tracking all instances of predefined categories of objects and parts, while maintaining their hierarchical relationships. We introduce the first benchmark dataset supporting this task, consisting of 2,765 unique entities that are tracked in 552 videos and belong to 40 categories (across objects and parts). Evaluation of seven variants of four models tailored to our novel task reveals the new dataset is challenging. Our dataset is available at https://vizwiz.org/tasks-and-datasets/hierarchical-instance-tracking/[38] Topological Conditioning for Mammography Models via a Stable Wavelet-Persistence Vectorization
Charles Fanning,Mehmet Emin Aktas
Main category: cs.CV
TL;DR: 提出基于小波持久同调的拓扑数据方法,提升乳腺癌筛查模型在不同设备和人群中的泛化性能。
Details
Motivation: 现有乳腺癌筛查模型在跨设备、跨模态和跨人群部署时性能下降,存在较高的假阴性和假阳性率。 Method: 采用基于小波的持久同调向量化方法,构建对强度扰动鲁棒的多尺度空间图,并通过输入级通道拼接集成到ConvNeXt Tiny等检测模型中。 Result: 在INbreast数据集上,加入波浪持久性通道后,患者级别AUC从0.55提升至0.75,显著改善模型外部性能。 Conclusion: 该方法能有效提升 mammography 模型在异构数据上的泛化能力,尤其在训练数据有限的情况下具有潜力。 Abstract: Breast cancer is the most commonly diagnosed cancer in women and a leading cause of cancer death worldwide. Screening mammography reduces mortality, yet interpretation still suffers from substantial false negatives and false positives, and model accuracy often degrades when deployed across scanners, modalities, and patient populations. We propose a simple conditioning signal aimed at improving external performance based on a wavelet based vectorization of persistent homology. Using topological data analysis, we summarize image structure that persists across intensity thresholds and convert this information into spatial, multi scale maps that are provably stable to small intensity perturbations. These maps are integrated into a two stage detection pipeline through input level channel concatenation. The model is trained and validated on the CBIS DDSM digitized film mammography cohort from the United States and evaluated on two independent full field digital mammography cohorts from Portugal (INbreast) and China (CMMD), with performance reported at the patient level. On INbreast, augmenting ConvNeXt Tiny with wavelet persistence channels increases patient level AUC from 0.55 to 0.75 under a limited training budget.[39] Feature Coding for Scalable Machine Vision
Md Eimran Hossain Eimon,Juan Merlos,Ashan Perera,Hari Kalva,Velibor Adzic,Borko Furht
Main category: cs.CV
TL;DR: 本文提出了一种针对深度神经网络中间特征的压缩编码方法FCTM,通过MPEG的FCM标准显著降低了传输带宽(平均减少85.14%),同时保持任务精度,为边缘计算中的高效视觉推理提供了可扩展的解决方案。
Details
Motivation: 深度神经网络在边缘设备部署时面临计算、带宽和隐私的挑战,传统全模型本地运行或完全上云的方法存在权衡,需要一种高效的中间特征传输方案以支持边缘-云端协同推理。 Method: 基于MPEG提出的面向机器的特征编码(FCM)标准,设计了特征编码测试模型(FCTM),该模型定义了专用于压缩中间特征的比特流语法和编解码流程,并在多种视觉任务中评估其压缩效率与精度保持能力。 Result: FCTM在多个视觉任务上实现了平均85.14%的比特率降低,同时保持了原始模型的推理精度,验证了其在带宽受限和隐私敏感场景下的有效性。 Conclusion: FCM标准及其FCTM为边缘智能提供了一种高效、可互操作的特征压缩方案,能够在不牺牲准确性的前提下显著降低通信开销,推动智能功能在资源受限环境中的广泛应用。 Abstract: Deep neural networks (DNNs) drive modern machine vision but are challenging to deploy on edge devices due to high compute demands. Traditional approaches-running the full model on-device or offloading to the cloud face trade-offs in latency, bandwidth, and privacy. Splitting the inference workload between the edge and the cloud offers a balanced solution, but transmitting intermediate features to enable such splitting introduces new bandwidth challenges. To address this, the Moving Picture Experts Group (MPEG) initiated the Feature Coding for Machines (FCM) standard, establishing a bitstream syntax and codec pipeline tailored for compressing intermediate features. This paper presents the design and performance of the Feature Coding Test Model (FCTM), showing significant bitrate reductions-averaging 85.14%-across multiple vision tasks while preserving accuracy. FCM offers a scalable path for efficient and interoperable deployment of intelligent features in bandwidth-limited and privacy-sensitive consumer applications.[40] Latent Chain-of-Thought World Modeling for End-to-End Driving
Shuhan Tan,Kashyap Chitta,Yuxiao Chen,Ran Tian,Yurong You,Yan Wang,Wenjie Luo,Yulong Cao,Philipp Krahenbuhl,Marco Pavone,Boris Ivanovic
Main category: cs.CV
TL;DR: 本文提出了一种名为Latent-CoT-Drive(LCDrive)的新型视觉-语言-动作模型,通过在与驾驶动作对齐的潜在空间中进行推理,取代传统的自然语言思维链(CoT),提升了自动驾驶的推理效率和决策质量。
Details
Motivation: 现有VLA模型多使用自然语言进行推理,但文本表示可能效率低下且不够精确,本文旨在探索更高效的潜在表示方式以提升自动驾驶性能与安全性。 Method: 提出LCDrive模型,将思维链推理与决策统一于动作对齐的潜在空间中,使用动作提议token和基于学习得到的潜在世界模型的world model token交替进行推理,并通过真实未来轨迹监督预训练,再结合闭环强化学习进行后训练。 Result: 在大规模端到端驾驶基准上,LCDrive相比非推理和文本推理基线实现了更快的推理速度、更好的轨迹质量和更强的交互式强化学习增益。 Conclusion: 使用潜在语言而非自然语言进行思维链推理是更高效的方式,LCDrive为自动驾驶中的推理与决策一体化提供了新范式。 Abstract: Recent Vision-Language-Action (VLA) models for autonomous driving explore inference-time reasoning as a way to improve driving performance and safety in challenging scenarios. Most prior work uses natural language to express chain-of-thought (CoT) reasoning before producing driving actions. However, text may not be the most efficient representation for reasoning. In this work, we present Latent-CoT-Drive (LCDrive): a model that expresses CoT in a latent language that captures possible outcomes of the driving actions being considered. Our approach unifies CoT reasoning and decision making by representing both in an action-aligned latent space. Instead of natural language, the model reasons by interleaving (1) action-proposal tokens, which use the same vocabulary as the model's output actions; and (2) world model tokens, which are grounded in a learned latent world model and express future outcomes of these actions. We cold start latent CoT by supervising the model's action proposals and world model tokens based on ground-truth future rollouts of the scene. We then post-train with closed-loop reinforcement learning to strengthen reasoning capabilities. On a large-scale end-to-end driving benchmark, LCDrive achieves faster inference, better trajectory quality, and larger improvements from interactive reinforcement learning compared to both non-reasoning and text-reasoning baselines.[41] Emerging Standards for Machine-to-Machine Video Coding
Md Eimran Hossain Eimon,Velibor Adzic,Hari Kalva,Borko Furht
Main category: cs.CV
TL;DR: 本文探讨了面向机器的视频编码(VCM)和特征编码(FCM)在机器间通信中的应用,提出了一种减少带宽、保护隐私并支持计算卸载的新范式。实验表明,FCM能在显著降低比特率的同时保持接近边缘推理的精度。研究还比较了H.264、H.265和H.266作为FCM内部编码器时的性能,发现H.265与H.266表现相近,而H.264性能较差;但在跟踪任务中,现有编码硬件仍可有效支持机器通信。
Details
Motivation: 传统机器视觉系统依赖为人类感知优化的视频编码传输原始像素数据,导致带宽消耗大、扩展性差且存在隐私泄露风险。需要一种专为机器理解设计的高效、安全的编码方案。 Method: 采用MPEG提出的Video Coding for Machines (VCM) 和 Feature Coding for Machines (FCM) 框架,其中VCM使用任务感知的像素域编码工具,FCM压缩神经网络中间特征。通过实验评估不同H.26X编解码器(AVC、HEVC、VVC)作为FCM内层编码器对各类机器视觉任务(如目标检测、跟踪)的影响,并以BD-Rate和任务准确率为指标进行分析。 Result: FCM能够在显著降低传输比特率的同时,保持接近边缘设备直接推理的任务准确率。当使用HEVC替代VVC时,平均BD-Rate仅增加1.39%,性能几乎相当;而使用AVC则平均增加32.28%。但在跟踪任务中,HEVC略优于VVC(BD-Rate -1.81%),AVC也仅增加8.79%,说明现有编码硬件仍适用。 Conclusion: FCM是一种高效的机器间视觉数据传输方案,可在保证机器任务性能的前提下大幅降低带宽需求并增强隐私保护。H.265/HEVC是当前实用性强的候选编码器,在多数任务中接近H.266/VVC的表现,且已有广泛部署,适合支撑未来的机器为中心的视觉系统。 Abstract: Machines are increasingly becoming the primary consumers of visual data, yet most deployments of machine-to-machine systems still rely on remote inference where pixel-based video is streamed using codecs optimized for human perception. Consequently, this paradigm is bandwidth intensive, scales poorly, and exposes raw images to third parties. Recent efforts in the Moving Picture Experts Group (MPEG) redesigned the pipeline for machine-to-machine communication: Video Coding for Machines (VCM) is designed to apply task-aware coding tools in the pixel domain, and Feature Coding for Machines (FCM) is designed to compress intermediate neural features to reduce bitrate, preserve privacy, and support compute offload. Experiments show that FCM is capable of maintaining accuracy close to edge inference while significantly reducing bitrate. Additional analysis of H.26X codecs used as inner codecs in FCM reveals that H.265/High Efficiency Video Coding (HEVC) and H.266/Versatile Video Coding (VVC) achieve almost identical machine task performance, with an average BD-Rate increase of 1.39% when VVC is replaced with HEVC. In contrast, H.264/Advanced Video Coding (AVC) yields an average BD-Rate increase of 32.28% compared to VVC. However, for the tracking task, the impact of codec choice is minimal, with HEVC outperforming VVC and achieving BD Rate of -1.81% and 8.79% for AVC, indicating that existing hardware for already deployed codecs can support machine-to-machine communication without degrading performance.[42] Multi-dimensional Preference Alignment by Conditioning Reward Itself
Jiho Jang,Jinyoung Kim,Kyungjune Baek,Nojun Kwak
Main category: cs.CV
TL;DR: 本文提出了一种新的基于人类反馈的强化学习方法MCDPO,用于解决扩散模型中多维度奖励冲突的问题。
Details
Motivation: 标准DPO方法使用Bradley-Terry模型将多个评估维度聚合为单一标量奖励,导致不同维度间的奖励冲突。 Method: 提出Multi Reward Conditional DPO (MCDPO),引入解耦的Bradley-Terry目标,并通过偏好结果向量作为条件进行训练,同时采用维度奖励dropout来平衡优化过程。 Result: 在Stable Diffusion 1.5和SDXL上的实验表明,MCDPO在基准测试中表现更优,并支持推理时动态控制多个奖励维度。 Conclusion: MCDPO有效解决了奖励冲突问题,实现了多维度独立优化,并具备无需额外训练即可放大特定奖励维度的能力。 Abstract: Reinforcement Learning from Human Feedback has emerged as a standard for aligning diffusion models. However, we identify a fundamental limitation in the standard DPO formulation because it relies on the Bradley-Terry model to aggregate diverse evaluation axes like aesthetic quality and semantic alignment into a single scalar reward. This aggregation creates a reward conflict where the model is forced to unlearn desirable features of a specific dimension if they appear in a globally non-preferred sample. To address this issue, we propose Multi Reward Conditional DPO (MCDPO). This method resolves reward conflicts by introducing a disentangled Bradley-Terry objective. MCDPO explicitly injects a preference outcome vector as a condition during training, which allows the model to learn the correct optimization direction for each reward axis independently within a single network. We further introduce dimensional reward dropout to ensure balanced optimization across dimensions. Extensive experiments on Stable Diffusion 1.5 and SDXL demonstrate that MCDPO achieves superior performance on benchmarks. Notably, our conditional framework enables dynamic and multiple-axis control at inference time using Classifier Free Guidance to amplify specific reward dimensions without additional training or external reward models.[43] Solving Semi-Supervised Few-Shot Learning from an Auto-Annotation Perspective
Tian Liu,Anwesha Basu,James Caverlee,Shu Kong
Main category: cs.CV
TL;DR: 本文提出了一种名为SWIFT的新方法,用于半监督少样本学习(SSFSL),通过利用开源视觉-语言模型(VLMs)和简单的分类器初始化与温度调节技术,显著提升了对未标记数据的利用效率,并在多个基准上超越了现有方法约5个准确率点,甚至可媲美使用真实标签进行训练的监督学习。
Details
Motivation: 现有的SSFSL研究忽略了强大的开源VLM及其预训练数据,而这些资源已在少样本学习中被成功利用。为了实现现实世界中的自动标注任务,SSFSL有必要充分利用这些开放资源。 Method: 作者发现直接应用传统SSL方法微调VLM效果不佳,原因是VLM输出的概率分布过于平坦。为此提出了SWIFT方法,结合分类器初始化和温度调节,增强伪标签置信度,提高未标记数据的利用率和监督信号强度,并分阶段进行微调。 Result: 在五个SSFSL基准上的实验表明,SWIFT比最新的FSL和SSL方法高出约5个准确率点,性能接近使用真实标签的监督学习。 Conclusion: SWIFT有效解决了VLM在SSFSL中因输出分布平坦导致的弱监督问题,为利用开源VLM推进SSFSL提供了简单而高效的方案。 Abstract: Semi-supervised few-shot learning (SSFSL) formulates real-world applications like ''auto-annotation'', as it aims to learn a model over a few labeled and abundant unlabeled examples to annotate the unlabeled ones. Despite the availability of powerful open-source Vision-Language Models (VLMs) and their pretraining data, the SSFSL literature largely neglects these open-source resources. In contrast, the related area few-shot learning (FSL) has already exploited them to boost performance. Arguably, to achieve auto-annotation in the real world, SSFSL should leverage such open-source resources. To this end, we start by applying established SSL methods to finetune a VLM. Counterintuitively, they significantly underperform FSL baselines. Our in-depth analysis reveals the root cause: VLMs produce rather ''flat'' distributions of softmax probabilities. This results in zero utilization of unlabeled data and weak supervision signals. We address this issue with embarrassingly simple techniques: classifier initialization and temperature tuning. They jointly increase the confidence scores of pseudo-labels, improving the utilization rate of unlabeled data, and strengthening supervision signals. Building on this, we propose: Stage-Wise Finetuning with Temperature Tuning (SWIFT), which enables existing SSL methods to effectively finetune a VLM on limited labeled data, abundant unlabeled data, and task-relevant but noisy data retrieved from the VLM's pretraining set. Extensive experiments on five SSFSL benchmarks show that SWIFT outperforms recent FSL and SSL methods by $\sim$5 accuracy points. SWIFT even rivals supervised learning, which finetunes VLMs with the unlabeled data being labeled with ground truth![44] RobustSora: De-Watermarked Benchmark for Robust AI-Generated Video Detection
Zhuo Wang,Xiliang Liu,Ligang Sun
Main category: cs.CV
TL;DR: RobustSora是一个评估AI生成视频检测中数字水印鲁棒性的新基准,通过构建包含真实和生成视频的多样化数据集,揭示现有检测模型对水印的部分依赖性,并强调需采用水印感知训练策略以提升鲁棒性。
Details
Motivation: 现有AIGC视频检测基准忽略了生成模型嵌入数字水印可能影响检测器性能的问题,导致评估结果不够稳健,因此需要一个专门评估水印影响的基准。 Method: 构建包含6500个视频的数据集,涵盖四种类型:真实-干净、真实-伪造水印、生成-带水印、生成-去水印;设计两个任务:任务I测试去水印AI视频的检测性能,任务II评估在添加假水印的真实视频上的误报率;评估十种主流检测模型的表现。 Result: 实验显示不同模型在水印操作下性能变化达2-8个百分点;基于Transformer的模型表现出一致的中等依赖性(6-8pp),MLLMs则呈现多样化模式(2-8pp)。 Conclusion: 当前AIGC视频检测模型存在对水印信号的部分依赖,影响其鲁棒性;RobustSora为推动更可靠的检测技术提供了重要工具和评估标准。 Abstract: The proliferation of AI-generated video technologies poses challenges to information integrity. While recent benchmarks advance AIGC video detection, they overlook a critical factor: many state-of-the-art generative models embed digital watermarks in outputs, and detectors may partially rely on these patterns. To evaluate this influence, we present RobustSora, the benchmark designed to assess watermark robustness in AIGC video detection. We systematically construct a dataset of 6,500 videos comprising four types: Authentic-Clean (A-C), Authentic-Spoofed with fake watermarks (A-S), Generated-Watermarked (G-W), and Generated-DeWatermarked (G-DeW). Our benchmark introduces two evaluation tasks: Task-I tests performance on watermark-removed AI videos, while Task-II assesses false alarm rates on authentic videos with fake watermarks. Experiments with ten models spanning specialized AIGC detectors, transformer architectures, and MLLM approaches reveal performance variations of 2-8pp under watermark manipulation. Transformer-based models show consistent moderate dependency (6-8pp), while MLLMs exhibit diverse patterns (2-8pp). These findings indicate partial watermark dependency and highlight the need for watermark-aware training strategies. RobustSora provides essential tools to advance robust AIGC detection research.[45] THE-Pose: Topological Prior with Hybrid Graph Fusion for Estimating Category-Level 6D Object Pose
Eunho Lee,Chaehyeon Song,Seunghoon Jeong,Ayoung Kim
Main category: cs.CV
TL;DR: THE-Pose是一种新的类别级6D姿态估计框架,通过引入拓扑先验和混合图融合,有效结合2D图像上下文与3D几何结构,在复杂和遮挡场景下表现优异。
Details
Motivation: 现有3D图卷积方法仅关注局部几何和深度信息,难以应对类别内变化和视觉模糊,尤其在复杂或遮挡物体上性能受限。 Method: 提出THE-Pose框架,从图像域提取不变且一致的拓扑特征,并通过混合图融合(HGF)模块自适应地融合点云特征与拓扑特征,实现2D与3D信息的无缝结合。 Result: 在REAL275数据集上实验表明,相比3D-GC基线HS-Pose提升35.8%,超越先前最优方法7.2%。 Conclusion: THE-Pose通过引入拓扑先验和多模态特征融合,显著提升了类别级姿态估计的鲁棒性和准确性,尤其适用于未见或复杂物体。 Abstract: Category-level object pose estimation requires both global context and local structure to ensure robustness against intra-class variations. However, 3D graph convolution (3D-GC) methods only focus on local geometry and depth information, making them vulnerable to complex objects and visual ambiguities. To address this, we present THE-Pose, a novel category-level 6D pose estimation framework that leverages a topological prior via surface embedding and hybrid graph fusion. Specifically, we extract consistent and invariant topological features from the image domain, effectively overcoming the limitations inherent in existing 3D-GC based methods. Our Hybrid Graph Fusion (HGF) module adaptively integrates the topological features with point-cloud features, seamlessly bridging 2D image context and 3D geometric structure. These fused features ensure stability for unseen or complicated objects, even under significant occlusions. Extensive experiments on the REAL275 dataset show that THE-Pose achieves a 35.8% improvement over the 3D-GC baseline (HS-Pose) and surpasses the previous state-of-the-art by 7.2% across all key metrics. The code is avaialbe on https://github.com/EHxxx/THE-Pose[46] GDKVM: Echocardiography Video Segmentation via Spatiotemporal Key-Value Memory with Gated Delta Rule
Rui Wang,Yimu Sun,Jingxing Guo,Huisi Wu,Jing Qin
Main category: cs.CV
TL;DR: 本文提出了一种用于超声心动图视频分割的新架构GDKVM,通过引入线性键值关联(LKVA)、门控Delta规则(GDR)和关键像素特征融合(KPFF)模块,在保持实时性能的同时提升了分割精度和鲁棒性。
Details
Motivation: 现有方法在捕捉长距离时空依赖与保持计算效率之间存在权衡问题,且对噪声、伪影及心脏形变敏感,难以实现准确的腔室分割。 Method: 提出GDKVM模型:采用LKVA建模帧间相关性,GDR高效存储中间记忆状态,KPFF多尺度融合局部与全局特征以增强抗噪能力。 Result: 在CAMUS和EchoNet-Dynamic两个主流数据集上验证,GDKVM在分割准确性和鲁棒性方面优于现有最先进方法,同时保证了实时性。 Conclusion: GDKVM有效解决了超声心动图视频中时空建模与计算效率的平衡问题,显著提升了心脏腔室分割性能,具有临床应用潜力。 Abstract: Accurate segmentation of cardiac chambers in echocardiography sequences is crucial for the quantitative analysis of cardiac function, aiding in clinical diagnosis and treatment. The imaging noise, artifacts, and the deformation and motion of the heart pose challenges to segmentation algorithms. While existing methods based on convolutional neural networks, Transformers, and space-time memory networks have improved segmentation accuracy, they often struggle with the trade-off between capturing long-range spatiotemporal dependencies and maintaining computational efficiency with fine-grained feature representation. In this paper, we introduce GDKVM, a novel architecture for echocardiography video segmentation. The model employs Linear Key-Value Association (LKVA) to effectively model inter-frame correlations, and introduces Gated Delta Rule (GDR) to efficiently store intermediate memory states. Key-Pixel Feature Fusion (KPFF) module is designed to integrate local and global features at multiple scales, enhancing robustness against boundary blurring and noise interference. We validated GDKVM on two mainstream echocardiography video datasets (CAMUS and EchoNet-Dynamic) and compared it with various state-of-the-art methods. Experimental results show that GDKVM outperforms existing approaches in terms of segmentation accuracy and robustness, while ensuring real-time performance. Code is available at https://github.com/wangrui2025/GDKVM.[47] VLM-NCD:Novel Class Discovery with Vision-Based Large Language Models
Yuetong Su,Baoguo Wei,Xinyu Wang,Xu Li,Lixin Li
Main category: cs.CV
TL;DR: 本文提出了一种名为LLM-NCD的多模态框架,通过融合视觉-文本语义和原型引导聚类,解决了现有图像新类别发现(NCD)方法在特征判别性和长尾分布上的局限性,在CIFAR-100上对未知类别的准确率最高提升了25.3%,并首次展现出对长尾分布的独特鲁棒性。
Details
Motivation: 现有基于视觉特征的NCD方法在特征判别性不足和数据长尾分布方面存在瓶颈,难以有效发现新类别。 Method: 提出LLM-NCD框架,联合优化已知类别的图像与文本特征以建模聚类中心与语义原型,并设计双阶段发现机制,通过语义亲和度阈值和自适应聚类动态区分已知与新颖样本。 Result: 在CIFAR-100数据集上,相比现有方法,未知类别的分类准确率最高提升25.3%,且在长尾分布数据下表现出显著的鲁棒性。 Conclusion: LLM-NCD通过融合多模态语义信息和新型聚类机制,显著提升了新类别发现性能,尤其在处理现实世界中常见的长尾数据时具有重要优势,为NCD任务提供了新思路。 Abstract: Novel Class Discovery aims to utilise prior knowledge of known classes to classify and discover unknown classes from unlabelled data. Existing NCD methods for images primarily rely on visual features, which suffer from limitations such as insufficient feature discriminability and the long-tail distribution of data. We propose LLM-NCD, a multimodal framework that breaks this bottleneck by fusing visual-textual semantics and prototype guided clustering. Our key innovation lies in modelling cluster centres and semantic prototypes of known classes by jointly optimising known class image and text features, and a dualphase discovery mechanism that dynamically separates known or novel samples via semantic affinity thresholds and adaptive clustering. Experiments on the CIFAR-100 dataset show that compared to the current methods, this method achieves up to 25.3% improvement in accuracy for unknown classes. Notably, our method shows unique resilience to long tail distributions, a first in NCD literature.[48] Long-LRM++: Preserving Fine Details in Feed-Forward Wide-Coverage Reconstruction
Chen Ziwen,Hao Tan,Peng Wang,Zexiang Xu,Li Fuxin
Main category: cs.CV
TL;DR: Long-LRM++提出了一种半显式场景表示与轻量级解码器结合的方法,在保持高渲染质量的同时实现了实时渲染,克服了现有隐式方法的速度瓶颈,并在多视图输入和深度预测上表现出更强的泛化能力。
Details
Motivation: 现有高斯点阵(GS)方法在直接预测大量高斯参数时对误差敏感,导致细节模糊;而隐式表示方法虽渲染质量高,但逐帧解压缩计算开销大,难以实现实时渲染。因此需要一种既能保留隐式表示优势又能实现高效渲染的方法。 Method: Long-LRM++采用半显式场景表示,将场景信息以紧凑形式编码,并结合轻量级解码器进行快速解码,避免了传统隐式方法中使用全Transformer或TTT主干网络带来的高计算成本,同时支持最多64个输入视图。 Result: Long-LRM++在DL3DV数据集上达到了与LaCT相当的渲染质量,同时在A100 GPU上实现了14 FPS的实时渲染速度;在ScanNetv2上的新视角深度预测优于基于高斯的直接深度渲染,并验证了其对更多输入视图的良好扩展性。 Conclusion: Long-LRM++通过半显式表示和轻量级解码器的成功结合,平衡了渲染质量与效率,为大规模、高质量的新型视图合成提供了可行的实时解决方案。 Abstract: Recent advances in generalizable Gaussian splatting (GS) have enabled feed-forward reconstruction of scenes from tens of input views. Long-LRM notably scales this paradigm to 32 input images at $950\times540$ resolution, achieving 360° scene-level reconstruction in a single forward pass. However, directly predicting millions of Gaussian parameters at once remains highly error-sensitive: small inaccuracies in positions or other attributes lead to noticeable blurring, particularly in fine structures such as text. In parallel, implicit representation methods such as LVSM and LaCT have demonstrated significantly higher rendering fidelity by compressing scene information into model weights rather than explicit Gaussians, and decoding RGB frames using the full transformer or TTT backbone. However, this computationally intensive decompression process for every rendered frame makes real-time rendering infeasible. These observations raise key questions: Is the deep, sequential "decompression" process necessary? Can we retain the benefits of implicit representations while enabling real-time performance? We address these questions with Long-LRM++, a model that adopts a semi-explicit scene representation combined with a lightweight decoder. Long-LRM++ matches the rendering quality of LaCT on DL3DV while achieving real-time 14 FPS rendering on an A100 GPU, overcoming the speed limitations of prior implicit methods. Our design also scales to 64 input views at the $950\times540$ resolution, demonstrating strong generalization to increased input lengths. Additionally, Long-LRM++ delivers superior novel-view depth prediction on ScanNetv2 compared to direct depth rendering from Gaussians. Extensive ablation studies validate the effectiveness of each component in the proposed framework.[49] Sample-wise Adaptive Weighting for Transfer Consistency in Adversarial Distillation
Hongsin Lee,Hye Won Chung
Main category: cs.CV
TL;DR: 本文提出了一种新的对抗性蒸馏方法SAAD,通过样本级别的自适应重加权提升对抗鲁棒性的迁移效果,发现对抗可转移性是影响鲁棒性传递的关键因素。
Details
Motivation: 现有对抗蒸馏方法忽视了使用最先进的强鲁棒教师模型,且更强的教师并不总能带来更鲁棒的学生,即存在“鲁棒饱和”现象,传统解释(如容量差距)不充分。 Method: 提出Sample-wise Adaptive Adversarial Distillation (SAAD),根据每个样本的对抗可转移性(即学生生成的对抗样本对教师的有效性)进行样本重加权,无需额外计算成本。 Result: 在CIFAR-10、CIFAR-100和Tiny-ImageNet上实验表明,SAAD在AutoAttack下的鲁棒性优于先前方法。 Conclusion: 对抗可转移性是影响对抗鲁棒性知识迁移的关键因素,SAAD通过自适应加权有效提升了学生模型的鲁棒性。 Abstract: Adversarial distillation in the standard min-max adversarial training framework aims to transfer adversarial robustness from a large, robust teacher network to a compact student. However, existing work often neglects to incorporate state-of-the-art robust teachers. Through extensive analysis, we find that stronger teachers do not necessarily yield more robust students-a phenomenon known as robust saturation. While typically attributed to capacity gaps, we show that such explanations are incomplete. Instead, we identify adversarial transferability-the fraction of student-crafted adversarial examples that remain effective against the teacher-as a key factor in successful robustness transfer. Based on this insight, we propose Sample-wise Adaptive Adversarial Distillation (SAAD), which reweights training examples by their measured transferability without incurring additional computational cost. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet show that SAAD consistently improves AutoAttack robustness over prior methods. Our code is available at https://github.com/HongsinLee/saad.[50] MotionEdit: Benchmarking and Learning Motion-Centric Image Editing
Yixin Wan,Lei Ke,Wenhao Yu,Kai-Wei Chang,Dong Yu
Main category: cs.CV
TL;DR: 本文提出了MotionEdit,一个专注于运动编辑的高质量图像编辑数据集,以及相应的基准MotionEdit-Bench,用于评估模型在动作变换中的表现。现有扩散模型在此任务上表现不佳,因此作者提出MotionNFT方法,通过运动对齐奖励进行后训练优化,显著提升运动编辑的保真度和质量。
Details
Motivation: 现有的图像编辑数据集主要关注静态外观变化,缺乏高质量、连续且物理合理的运动编辑样本,限制了动作编辑和动画生成等应用的发展。因此需要一个以动作为中心的高保真数据集和评估基准。 Method: 构建MotionEdit数据集,从连续视频中提取并验证高保真的运动变换图像对;设计MotionEdit-Bench基准,结合生成、判别和偏好指标评估模型;提出MotionNFT框架,利用运动对齐奖励指导模型微调,提升运动编辑准确性。 Result: 实验表明现有SOTA扩散模型在运动编辑任务上表现较差;MotionNFT在FLUX.1 Kontext和Qwen-Image-Edit模型上均显著提升了编辑质量和运动保真度,同时保持了通用编辑能力。 Conclusion: MotionEdit为运动为中心的图像编辑提供了新的标准和挑战,MotionNFT有效提升了模型在该任务上的性能,推动了图像编辑向动态内容生成的发展。 Abstract: We introduce MotionEdit, a novel dataset for motion-centric image editing-the task of modifying subject actions and interactions while preserving identity, structure, and physical plausibility. Unlike existing image editing datasets that focus on static appearance changes or contain only sparse, low-quality motion edits, MotionEdit provides high-fidelity image pairs depicting realistic motion transformations extracted and verified from continuous videos. This new task is not only scientifically challenging but also practically significant, powering downstream applications such as frame-controlled video synthesis and animation. To evaluate model performance on the novel task, we introduce MotionEdit-Bench, a benchmark that challenges models on motion-centric edits and measures model performance with generative, discriminative, and preference-based metrics. Benchmark results reveal that motion editing remains highly challenging for existing state-of-the-art diffusion-based editing models. To address this gap, we propose MotionNFT (Motion-guided Negative-aware Fine Tuning), a post-training framework that computes motion alignment rewards based on how well the motion flow between input and model-edited images matches the ground-truth motion, guiding models toward accurate motion transformations. Extensive experiments on FLUX.1 Kontext and Qwen-Image-Edit show that MotionNFT consistently improves editing quality and motion fidelity of both base models on the motion editing task without sacrificing general editing ability, demonstrating its effectiveness.[51] ShotDirector: Directorially Controllable Multi-Shot Video Generation with Cinematographic Transitions
Xiaoxue Wu,Xinyuan Chen,Yaohui Wang,Yu Qiao
Main category: cs.CV
TL;DR: 本文提出了一种名为ShotDirector的高效框架,用于实现电影级可控的多镜头视频生成中的镜头转换,通过参数级相机控制和分层编辑模式感知提示机制,提升了叙事表达的连贯性。
Details
Motivation: 现有方法主要关注跨镜头的低层次视觉一致性,忽视了镜头转换设计和电影语言对叙事连贯性的影响,导致缺乏有意图的剪辑模式。 Method: 引入一个包含6自由度位姿和内参设置的相机控制模块,并采用镜头感知掩码机制结合专业剪辑模式的分层提示,以实现细粒度的内容控制。 Result: 构建了ShotWeaver40K数据集和一套评估指标,实验证明该框架在可控多镜头视频生成中有效实现了电影般的镜头过渡效果。 Conclusion: ShotDirector通过融合参数级条件与高层语义指导,显著提升了多镜头视频生成中的叙事质量和导演意图表达能力。 Abstract: Shot transitions play a pivotal role in multi-shot video generation, as they determine the overall narrative expression and the directorial design of visual storytelling. However, recent progress has primarily focused on low-level visual consistency across shots, neglecting how transitions are designed and how cinematographic language contributes to coherent narrative expression. This often leads to mere sequential shot changes without intentional film-editing patterns. To address this limitation, we propose ShotDirector, an efficient framework that integrates parameter-level camera control and hierarchical editing-pattern-aware prompting. Specifically, we adopt a camera control module that incorporates 6-DoF poses and intrinsic settings to enable precise camera information injection. In addition, a shot-aware mask mechanism is employed to introduce hierarchical prompts aware of professional editing patterns, allowing fine-grained control over shot content. Through this design, our framework effectively combines parameter-level conditions with high-level semantic guidance, achieving film-like controllable shot transitions. To facilitate training and evaluation, we construct ShotWeaver40K, a dataset that captures the priors of film-like editing patterns, and develop a set of evaluation metrics for controllable multi-shot video generation. Extensive experiments demonstrate the effectiveness of our framework.[52] Physically Aware 360$^\circ$ View Generation from a Single Image using Disentangled Scene Embeddings
Karthikeya KV,Narendra Bandaru
Main category: cs.CV
TL;DR: Disentangled360是一种新型的3D感知技术,结合方向解耦体积渲染与单图像360°视图合成,适用于医学成像与自然场景重建,具有高保真、无需微调的优点。
Details
Motivation: 现有方法在处理各向异性光行为时过于简化或缺乏跨场景泛化能力,难以同时满足医学影像与真实场景的高质量视图合成需求。 Method: 提出Disentangled360框架,在高斯点阵骨干中分离各向同性和各向异性贡献;采用双分支条件机制分别处理CT强度驱动的散射与RGB场景;引入混合姿态无关锚定方法以解决尺度模糊并保持结构真实性。 Result: 在Mip-NeRF 360、RealEstate10K和DeepDRR数据集上实现了更优的SSIM和LPIPS指标,运行效率支持交互式应用。 Conclusion: Disentangled360实现了无需场景微调或昂贵光子模拟的高质量360°视图合成,可广泛应用于混合现实医疗监护、机器人感知与沉浸式内容生成。 Abstract: We introduce Disentangled360, an innovative 3D-aware technology that integrates the advantages of direction disentangled volume rendering with single-image 360° unique view synthesis for applications in medical imaging and natural scene reconstruction. In contrast to current techniques that either oversimplify anisotropic light behavior or lack generalizability across various contexts, our framework distinctly differentiates between isotropic and anisotropic contributions inside a Gaussian Splatting backbone. We implement a dual-branch conditioning framework, one optimized for CT intensity driven scattering in volumetric data and the other for real-world RGB scenes through normalized camera embeddings. To address scale ambiguity and maintain structural realism, we present a hybrid pose agnostic anchoring method that adaptively samples scene depth and material transitions, functioning as stable pivots during scene distillation. Our design integrates preoperative radiography simulation and consumer-grade 360° rendering into a singular inference pipeline, facilitating rapid, photorealistic view synthesis with inherent directionality. Evaluations on the Mip-NeRF 360, RealEstate10K, and DeepDRR datasets indicate superior SSIM and LPIPS performance, while runtime assessments confirm its viability for interactive applications. Disentangled360 facilitates mixed-reality medical supervision, robotic perception, and immersive content creation, eliminating the necessity for scene-specific finetuning or expensive photon simulations.[53] Efficient-VLN: A Training-Efficient Vision-Language Navigation Model
Duo Zheng,Shijia Huang,Yanyang Li,Liwei Wang
Main category: cs.CV
TL;DR: 提出Efficient-VLN,一种高效的视觉-语言导航模型,通过设计渐进式记忆和可学习递归记忆机制减轻长序列处理负担,并引入动态混合策略平衡探索与效率,在显著降低训练开销的同时达到SOTA性能。
Details
Motivation: 现有MLLM在VLN中因处理长历史观测和DAgger中的探索-效率权衡导致训练开销过大,限制实际应用。 Method: 设计两种高效记忆机制:渐进式记忆(动态分配更多token给近期观测)和可学习递归记忆(利用可学习token的KV缓存作为记忆状态);引入动态混合策略以平衡探索与训练/推理效率。 Result: 在R2R-CE(64.2% SR)和RxR-CE(67.0% SR)上达到SOTA性能,仅消耗282 H800 GPU小时,显著降低训练开销。 Conclusion: Efficient-VLN有效缓解了VLN中多模态大模型的训练负担,在性能和效率之间实现了良好平衡,具备更强的实用性。 Abstract: Multimodal large language models (MLLMs) have shown promising potential in Vision-Language Navigation (VLN). However, their practical development is severely hindered by the substantial training overhead. We recognize two key issues that contribute to the overhead: (1) the quadratic computational burden from processing long-horizon historical observations as massive sequences of tokens, and (2) the exploration-efficiency trade-off in DAgger, i.e., a data aggregation process of collecting agent-explored trajectories. While more exploration yields effective error-recovery trajectories for handling test-time distribution shifts, it comes at the cost of longer trajectory lengths for both training and inference. To address these challenges, we propose Efficient-VLN, a training-efficient VLN model. Specifically, to mitigate the token processing burden, we design two efficient memory mechanisms: a progressive memory that dynamically allocates more tokens to recent observations, and a learnable recursive memory that utilizes the key-value cache of learnable tokens as the memory state. Moreover, we introduce a dynamic mixed policy to balance the exploration-efficiency trade-off. Extensive experiments show that Efficient-VLN achieves state-of-the-art performance on R2R-CE (64.2% SR) and RxR-CE (67.0% SR). Critically, our model consumes merely 282 H800 GPU hours, demonstrating a dramatic reduction in training overhead compared to state-of-the-art methods.[54] DualProtoSeg: Simple and Efficient Design with Text- and Image-Guided Prototype Learning for Weakly Supervised Histopathology Image Segmentation
Anh M. Vu,Khang P. Le,Trang T. K. Vo,Ha Thach,Huy Hung Nguyen,David Yang,Han H. Huynh,Quynh Nguyen,Tuan M. Pham,Tuan-Anh Le,Minh H. N. Le,Thanh-Huy Nguyen,Akash Awasthi,Chandra Mohan,Zhu Han,Hien Van Nguyen
Main category: cs.CV
TL;DR: 提出了一种基于视觉-语言对齐和原型驱动的弱监督语义分割框架,通过文本和图像原型融合提升病理图像中的区域发现能力。
Details
Motivation: 现有弱监督语义分割方法受限于类别间相似性、类内差异性和CAM导致的区域收缩问题,需降低标注成本同时提升定位精度。 Method: 采用CoOp风格的可学习提示调优生成文本原型,结合可学习图像原型构建双模原型库,并引入多尺度金字塔模块缓解ViT表示中的过平滑问题。 Result: 在BCSS-WSSS基准上超越现有最先进方法,验证了文本描述多样性、上下文长度及双原型互补性的优势。 Conclusion: 联合利用文本语义与视觉原型学习能有效提升数字病理中弱监督语义分割的性能。 Abstract: Weakly supervised semantic segmentation (WSSS) in histopathology seeks to reduce annotation cost by learning from image-level labels, yet it remains limited by inter-class homogeneity, intra-class heterogeneity, and the region-shrinkage effect of CAM-based supervision. We propose a simple and effective prototype-driven framework that leverages vision-language alignment to improve region discovery under weak supervision. Our method integrates CoOp-style learnable prompt tuning to generate text-based prototypes and combines them with learnable image prototypes, forming a dual-modal prototype bank that captures both semantic and appearance cues. To address oversmoothing in ViT representations, we incorporate a multi-scale pyramid module that enhances spatial precision and improves localization quality. Experiments on the BCSS-WSSS benchmark show that our approach surpasses existing state-of-the-art methods, and detailed analyses demonstrate the benefits of text description diversity, context length, and the complementary behavior of text and image prototypes. These results highlight the effectiveness of jointly leveraging textual semantics and visual prototype learning for WSSS in digital pathology.[55] ConStruct: Structural Distillation of Foundation Models for Prototype-Based Weakly Supervised Histopathology Segmentation
Khang Le,Ha Thach,Anh M. Vu,Trang T. K. Vo,Han H. Huynh,David Yang,Minh H. N. Le,Thanh-Huy Nguyen,Akash Awasthi,Chandra Mohan,Zhu Han,Hien Van Nguyen
Main category: cs.CV
TL;DR: 提出一种结合CONCH和SegFormer优势的原型学习框架,用于病理图像弱监督语义分割,通过文本引导初始化和结构蒸馏生成高质量伪掩码,提升定位完整性和语义一致性。
Details
Motivation: 现有弱监督分割方法多依赖分类骨干网络,易遗漏组织结构的完整空间范围;同时缺乏有效融合视觉-语言模型与现代分割网络优势的方法。 Method: 设计一个原型学习框架,融合CONCH的形态感知表示、SegFormer的多尺度结构线索和文本引导的语义对齐;采用文本引导原型初始化生成更完整的伪掩码,并通过结构化蒸馏保留精细形态特征。 Result: 在BCSS-WSSS数据集上优于现有WSSS方法,生成高质量伪掩码,显著提升定位完整性和语义一致性,且计算高效。 Conclusion: 所提框架能有效整合视觉-语言模型与分割骨干网络的优势,在无像素级标注情况下实现更准确、空间连贯的弱监督病理图像分割。 Abstract: Weakly supervised semantic segmentation (WSSS) in histopathology relies heavily on classification backbones, yet these models often localize only the most discriminative regions and struggle to capture the full spatial extent of tissue structures. Vision-language models such as CONCH offer rich semantic alignment and morphology-aware representations, while modern segmentation backbones like SegFormer preserve fine-grained spatial cues. However, combining these complementary strengths remains challenging, especially under weak supervision and without dense annotations. We propose a prototype learning framework for WSSS in histopathological images that integrates morphology-aware representations from CONCH, multi-scale structural cues from SegFormer, and text-guided semantic alignment to produce prototypes that are simultaneously semantically discriminative and spatially coherent. To effectively leverage these heterogeneous sources, we introduce text-guided prototype initialization that incorporates pathology descriptions to generate more complete and semantically accurate pseudo-masks. A structural distillation mechanism transfers spatial knowledge from SegFormer to preserve fine-grained morphological patterns and local tissue boundaries during prototype learning. Our approach produces high-quality pseudo masks without pixel-level annotations, improves localization completeness, and enhances semantic consistency across tissue types. Experiments on BCSS-WSSS datasets demonstrate that our prototype learning framework outperforms existing WSSS methods while remaining computationally efficient through frozen foundation model backbones and lightweight trainable adapters.[56] Point2Pose: A Generative Framework for 3D Human Pose Estimation with Multi-View Point Cloud Dataset
Hyunsoo Lee,Daeum Jeon,Hyeokjae Oh
Main category: cs.CV
TL;DR: 提出了一种基于点云和姿态历史的3D人体姿态估计新方法Point2Pose,结合时空编码器与注意力机制生成回归器,并发布了大规模多模态数据集MVPose3D。
Details
Motivation: 3D人体姿态估计面临人体几何复杂、关节自遮挡及缺乏大规模真实数据等挑战,现有方法难以充分建模时序与空间关系。 Method: 设计了Point2Pose框架,采用时空点云编码器和姿态特征编码器提取关节级特征,通过基于注意力的生成式回归器预测3D姿态,并利用MVPose3D多模态数据集进行训练与验证。 Result: 在多个数据集上超越基线模型,表现出优异的性能,尤其在复杂运动和遮挡场景下具有更强鲁棒性。 Conclusion: Point2Pose有效建模了点云序列与历史姿态的分布关系,提升了3D人体姿态估计精度,所发布数据集为后续研究提供了重要资源。 Abstract: We propose a novel generative approach for 3D human pose estimation. 3D human pose estimation poses several key challenges due to the complex geometry of the human body, self-occluding joints, and the requirement for large-scale real-world motion datasets. To address these challenges, we introduce Point2Pose, a framework that effectively models the distribution of human poses conditioned on sequential point cloud and pose history. Specifically, we employ a spatio-temporal point cloud encoder and a pose feature encoder to extract joint-wise features, followed by an attention-based generative regressor. Additionally, we present a large-scale indoor dataset MVPose3D, which contains multiple modalities, including IMU data of non-trivial human motions, dense multi-view point clouds, and RGB images. Experimental results show that the proposed method outperforms the baseline models, demonstrating its superior performance across various datasets.[57] EchoingPixels: Cross-Modal Adaptive Token Reduction for Efficient Audio-Visual LLMs
Chao Gong,Depeng Wang,Zhipeng Wei,Ya Guo,Huijia Zhu,Jingjing Chen
Main category: cs.CV
TL;DR: 提出EchoingPixels框架,通过跨模态语义筛(CS2)和同步增强的RoPE(Sync-RoPE),实现音频-视觉大模型中动态、高效的联合模态token压缩,在仅用5-20% token的情况下达到相当性能,并提升2-3倍速度与内存效率。
Details
Motivation: 音频-视觉大语言模型(AV-LLMs)面临音视频token过多导致的计算开销问题,现有单模态压缩方法无法利用跨模态协同效应,且固定分配token预算不适应音视频动态信息密度差异,缺乏有效的联合流token减少方案。 Method: 提出EchoingPixels框架,核心为跨模态语义筛(CS2),将音频和视频token合并为统一池进行联合注意力计算,实现跨模态交互并自适应分配token预算;同时设计同步增强的RoPE(Sync-RoPE),保持稀疏选中token间的时序关系,确保时间建模能力。 Result: 实验表明,EchoingPixels在仅使用原始5-20% token的情况下,性能与强基线相当,并实现2-3倍的推理速度和内存占用降低。 Conclusion: EchoingPixels通过联合音频-视觉token池和自适应压缩机制,有效解决了AV-LLMs中的高计算开销问题,兼顾效率与性能,为多模态大模型的高效推理提供了新思路。 Abstract: Audio-Visual Large Language Models (AV-LLMs) face prohibitive computational overhead from massive audio and video tokens. Token reduction, while extensively explored for video-only LLMs, is insufficient for the audio-visual domain, as these unimodal methods cannot leverage audio-visual cross-modal synergies. Furthermore, the distinct and dynamic information densities of audio and video render static budgets per modality suboptimal. How to perform token reduction on a joint audio-visual stream thus remains an unaddressed bottleneck. To fill this gap, we introduce EchoingPixels, a framework inspired by the coexistence and interaction of visuals and sound in real-world scenes. The core of our framework is the Cross-Modal Semantic Sieve (CS2), a module enabling early audio-visual interaction. Instead of compressing modalities independently, CS2 co-attends to the joint multimodal stream and reduces tokens from an entire combined pool of audio-visual tokens rather than using fixed budgets per modality. This single-pool approach allows it to adaptively allocate the token budget across both modalities and dynamically identify salient tokens in concert. To ensure this aggressive reduction preserves the vital temporal modeling capability, we co-design a Synchronization-Augmented RoPE (Sync-RoPE) to maintain critical temporal relationships for the sparsely selected tokens. Extensive experiments demonstrate that EchoingPixels achieves performance comparable to strong baselines using only 5-20% of the original tokens, with a 2-3x speedup and memory reduction.[58] StainNet: A Special Staining Self-Supervised Vision Transformer for Computational Pathology
Jiawen Li,Jiali Hu,Xitong Ling,Yongqiang Lv,Yuxuan Chen,Yizhi Wang,Tian Guan,Yifei Liu,Yonghong He
Main category: cs.CV
TL;DR: StainNet是一个基于视觉Transformer架构的专用基础模型,采用自蒸馏自监督学习方法,针对特殊染色病理图像进行训练,在肝恶性肿瘤分类和ROI级数据集上表现出优异性能。
Details
Motivation: 现有的病理基础模型主要在H&E染色图像上预训练,对临床中常见的特殊染色图像适应性有限,限制了其应用。 Method: 提出StainNet,基于ViT架构,采用自蒸馏自监督学习方法,在超过140万张来自HISTAI数据库的特殊染色图像块上进行训练。 Result: 在内部的全片肝恶性肿瘤分类任务和两个公开的ROI级数据集上验证了StainNet的有效性,并在少样本学习和检索任务中表现优异,优于近期更大的病理基础模型。 Conclusion: StainNet是首个专为特殊染色病理图像设计的基础模型,显著提升了在特殊染色场景下的适用性和性能,已公开模型权重供社区使用。 Abstract: Foundation models trained with self-supervised learning (SSL) on large-scale histological images have significantly accelerated the development of computational pathology. These models can serve as backbones for region-of-interest (ROI) image analysis or patch-level feature extractors in whole-slide images (WSIs) based on multiple instance learning (MIL). Existing pathology foundation models (PFMs) are typically pre-trained on Hematoxylin-Eosin (H&E) stained pathology images. However, images with special stains, such as immunohistochemistry, are also frequently used in clinical practice. PFMs pre-trained mainly on H\&E-stained images may be limited in clinical applications involving special stains. To address this issue, we propose StainNet, a specialized foundation model for special stains based on the vision transformer (ViT) architecture. StainNet adopts a self-distillation SSL approach and is trained on over 1.4 million patch images cropping from 20,231 publicly available special staining WSIs in the HISTAI database. To evaluate StainNet, we conduct experiments on an in-house slide-level liver malignancy classification task and two public ROI-level datasets to demonstrate its strong ability. We also perform few-ratio learning and retrieval evaluations, and compare StainNet with recently larger PFMs to further highlight its strengths. We have released the StainNet model weights at: https://huggingface.co/JWonderLand/StainNet.[59] Simple Yet Effective Selective Imputation for Incomplete Multi-view Clustering
Cai Xu,Jinlong Liu,Yilin Zhang,Ziyu Guan,Wei Zhao
Main category: cs.CV
TL;DR: 本文提出了一种基于信息量的选择性插补多视图聚类方法(ISMVC),通过评估缺失位置的插补相关信息量,仅在有足够支持时进行插补,并结合变分自编码器学习聚类友好的隐表示,有效提升了不完整多视图数据在不平衡缺失情况下的聚类性能。
Details
Motivation: 现有插补方法在处理不完整多视图数据时容易引入噪声和偏差,而非插补方法在严重缺失下缺乏跨视图互补性,难以有效聚类。 Method: 提出ISMVC,基于 intra-view 相似性和 cross-view 一致性评估每个缺失位置的信息量,选择性地进行插补;结合带有高斯混合先验的变分自编码器,实现分布级插补并建模插补不确定性,提升融合鲁棒性。 Result: 在多个基准数据集上验证了方法的有效性,尤其在更真实且具挑战性的不平衡缺失场景下,性能优于现有的插补和非插补方法。 Conclusion: ISMVC通过数据驱动、轻量且模型无关的选择性插补策略,有效平衡了插补带来的噪声与信息缺失问题,为不完整多视图聚类提供了更鲁棒的解决方案。 Abstract: Incomplete multi-view data, where different views suffer from missing and unbalanced observations, pose significant challenges for clustering. Existing imputation-based methods attempt to estimate missing views to restore data associations, but indiscriminate imputation often introduces noise and bias, especially when the available information is insufficient. Imputation-free methods avoid this risk by relying solely on observed data, but struggle under severe incompleteness due to the lack of cross-view complementarity. To address this issue, we propose Informativeness-based Selective imputation Multi-View Clustering (ISMVC). Our method evaluates the imputation-relevant informativeness of each missing position based on intra-view similarity and cross-view consistency, and selectively imputes only when sufficient support is available. Furthermore, we integrate this selection with a variational autoencoder equipped with a mixture-of-Gaussians prior to learn clustering-friendly latent representations. By performing distribution-level imputation, ISMVC not only stabilizes the aggregation of posterior distributions but also explicitly models imputation uncertainty, enabling robust fusion and preventing overconfident reconstructions. Compared with existing cautious imputation strategies that depend on training dynamics or model feedback, our method is lightweight, data-driven, and model-agnostic. It can be readily integrated into existing IMC models as a plug-in module. Extensive experiments on multiple benchmark datasets under a more realistic and challenging unbalanced missing scenario demonstrate that our method outperforms both imputation-based and imputation-free approaches.[60] Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation
Yiwen Tang,Zoey Guo,Kaixin Zhu,Ray Zhang,Qizhi Chen,Dongzhi Jiang,Junli Liu,Bohan Zeng,Haoming Song,Delin Qu,Tianyi Bai,Dan Xu,Wentao Zhang,Bin Zhao
Main category: cs.CV
TL;DR: 本文首次系统研究了强化学习(RL)在文本到3D自回归生成中的应用,提出了新的基准MME-3DR和分层RL算法Hi-GRPO,并开发了首个RL增强的文本到3D生成模型AR3D-R1。
Details
Motivation: 由于3D对象具有更高的空间复杂性,且对奖励设计和RL算法敏感,目前将强化学习应用于3D生成的研究仍处于空白,因此需要系统性探索其可行性与优化路径。 Method: 从四个方面进行研究:(1) 奖励设计,评估多模态模型作为奖励信号的有效性;(2) RL算法,研究GRPO变体及数据与训练规模的影响;(3) 构建新基准MME-3DR以评估3D生成模型的推理能力;(4) 提出分层强化学习方法Hi-GRPO,实现从全局形状到局部纹理的逐步优化。 Result: 验证了多模态奖励模型的有效性,发现token级优化优于序列级,提出了更高效的Hi-GRPO算法,并发布了包含从粗略形状到纹理细化的完整生成流程的AR3D-R1模型,在多个维度上提升了3D生成质量。 Conclusion: 强化学习能有效推动文本到3D生成的发展,合理的奖励设计与分层优化策略是关键,该研究为未来3D生成中的推理与优化提供了重要基础。 Abstract: Reinforcement learning (RL), earlier proven to be effective in large language and multi-modal models, has been successfully extended to enhance 2D image generation recently. However, applying RL to 3D generation remains largely unexplored due to the higher spatial complexity of 3D objects, which require globally consistent geometry and fine-grained local textures. This makes 3D generation significantly sensitive to reward designs and RL algorithms. To address these challenges, we conduct the first systematic study of RL for text-to-3D autoregressive generation across several dimensions. (1) Reward designs: We evaluate reward dimensions and model choices, showing that alignment with human preference is crucial, and that general multi-modal models provide robust signal for 3D attributes. (2) RL algorithms: We study GRPO variants, highlighting the effectiveness of token-level optimization, and further investigate the scaling of training data and iterations. (3) Text-to-3D Benchmarks: Since existing benchmarks fail to measure implicit reasoning abilities in 3D generation models, we introduce MME-3DR. (4) Advanced RL paradigms: Motivated by the natural hierarchy of 3D generation, we propose Hi-GRPO, which optimizes the global-to-local hierarchical 3D generation through dedicated reward ensembles. Based on these insights, we develop AR3D-R1, the first RL-enhanced text-to-3D model, expert from coarse shape to texture refinement. We hope this study provides insights into RL-driven reasoning for 3D generation. Code is released at https://github.com/Ivan-Tang-3D/3DGen-R1.[61] A Conditional Generative Framework for Synthetic Data Augmentation in Segmenting Thin and Elongated Structures in Biological Images
Yi Liu,Yichi Zhang
Main category: cs.CV
TL;DR: 提出一种基于Pix2Pix的条件生成框架,用于从二值掩码生成具有真实感的显微镜图像中的纤维结构,并引入纤维感知的结构损失以提升生成图像的结构相似性,有效缓解了纤维分割中高质量标注数据不足的问题。
Details
Motivation: 由于纤维结构密集分布且几何特性复杂,手动标注高质量像素级数据极为耗时费力,导致深度学习模型训练面临数据短缺问题。 Method: 基于Pix2Pix架构构建条件生成模型,将二值掩码转换为逼真的显微镜图像;设计一种纤维感知的结构损失函数,增强生成图像与真实图像在结构上的相似性。 Result: 实验表明,所提出的方法能生成结构逼真的纤维图像,使用合成数据训练的模型性能优于未使用合成数据的现有方法。 Conclusion: 该方法有效缓解了纤维状结构分割中缺乏标注数据的问题,通过引入结构感知损失提升了生成图像质量,为生物图像分析提供了可行的数据增强方案。 Abstract: Thin and elongated filamentous structures, such as microtubules and actin filaments, often play important roles in biological systems. Segmenting these filaments in biological images is a fundamental step for quantitative analysis. Recent advances in deep learning have significantly improved the performance of filament segmentation. However, there is a big challenge in acquiring high quality pixel-level annotated dataset for filamentous structures, as the dense distribution and geometric properties of filaments making manual annotation extremely laborious and time-consuming. To address the data shortage problem, we propose a conditional generative framework based on the Pix2Pix architecture to generate realistic filaments in microscopy images from binary masks. We also propose a filament-aware structural loss to improve the structure similarity when generating synthetic images. Our experiments have demonstrated the effectiveness of our approach and outperformed existing model trained without synthetic data.[62] Zero-shot Adaptation of Stable Diffusion via Plug-in Hierarchical Degradation Representation for Real-World Super-Resolution
Yi-Cheng Liao,Shyang-En Weng,Yu-Syuan Xu,Chi-Wei Hsiao,Wei-Chen Chiu,Ching-Chun Huang
Main category: cs.CV
TL;DR: 本文提出了一种名为HD-CLIP的插件式模块,用于真实世界图像超分辨率任务,通过分层退化建模和分类器自由投影引导,提升扩散模型在未知复杂退化下的恢复能力。
Details
Motivation: 现有方法依赖CLIP文本编码器且假设退化程度已知,难以捕捉数值化的退化严重性,泛化能力受限。 Method: 提出HD-CLIP,将低质量图像分解为语义嵌入和有序退化嵌入,并结合分类器自由投影引导(CFPG)集成到扩散模型中,实现无需训练的即插即用增强。 Result: HD-CLIP在多个真实世界数据集上显著提升了细节保真度和感知真实性,支持对未见退化级别的插值。 Conclusion: HD-CLIP作为一种无需微调的通用引导模块,有效增强了扩散模型在真实图像超分辨率中的性能和鲁棒性。 Abstract: Real-World Image Super-Resolution (Real-ISR) aims to recover high-quality images from low-quality inputs degraded by unknown and complex real-world factors. Real-world scenarios involve diverse and coupled degradations, making it necessary to provide diffusion models with richer and more informative guidance. However, existing methods often assume known degradation severity and rely on CLIP text encoders that cannot capture numerical severity, limiting their generalization ability. To address this, we propose \textbf{HD-CLIP} (\textbf{H}ierarchical \textbf{D}egradation CLIP), which decomposes a low-quality image into a semantic embedding and an ordinal degradation embedding that captures ordered relationships and allows interpolation across unseen levels. Furthermore, we integrated it into diffusion models via classifier-free guidance (CFG) and proposed classifier-free projection guidance (CFPG). HD-CLIP leverages semantic cues to guide generative restoration while using degradation cues to suppress undesired hallucinations and artifacts. As a \textbf{plug-and-play module}, HD-CLIP can be seamlessly integrated into various super-resolution frameworks without training, significantly improving detail fidelity and perceptual realism across diverse real-world datasets.[63] CoSPlan: Corrective Sequential Planning via Scene Graph Incremental Updates
Shresth Grover,Priyank Pathak,Akash Kumar,Vibhav Vineet,Yogesh S Rawat
Main category: cs.CV
TL;DR: 本文提出了CoSPlan基准,用于评估大规模视觉语言模型在易出错的视觉序列规划任务中的表现,并提出了一种无需训练的SGI方法来提升模型的推理能力。
Details
Motivation: 大规模视觉语言模型在复杂推理方面表现出色,但在视觉序列规划(尤其是包含非最优步骤的场景)中仍缺乏探索。实际应用中常存在错误步骤,需要模型具备检测和纠正能力,因此需构建更贴近现实的评估基准并改进模型推理机制。 Method: 提出CoSPlan基准,涵盖迷宫导航、积木重排等四个领域,评估模型的错误检测与步骤补全能力;同时提出SGI方法,通过引入场景图的渐进式更新,在不进行训练的情况下增强模型对动作序列的中间状态推理。 Result: 现有VLM(如Intern-VLM和Qwen2)在CoSPlan上表现不佳,难以利用上下文线索完成目标;SGI方法平均提升了5.2%的性能,并能泛化到传统规划任务如Plan-Bench和VQA。 Conclusion: 视觉语言模型在纠错型序列规划中仍有显著挑战,SGI通过增量更新场景图提供了有效的中间推理机制,提升了模型在复杂视觉任务中的可靠性和通用性。 Abstract: Large-scale Vision-Language Models (VLMs) exhibit impressive complex reasoning capabilities but remain largely unexplored in visual sequential planning, i.e., executing multi-step actions towards a goal. Additionally, practical sequential planning often involves non-optimal (erroneous) steps, challenging VLMs to detect and correct such steps. We propose Corrective Sequential Planning Benchmark (CoSPlan) to evaluate VLMs in error-prone, vision-based sequential planning tasks across 4 domains: maze navigation, block rearrangement, image reconstruction,and object reorganization. CoSPlan assesses two key abilities: Error Detection (identifying non-optimal action) and Step Completion (correcting and completing action sequences to reach the goal). Despite using state-of-the-art reasoning techniques such as Chain-of-Thought and Scene Graphs, VLMs (e.g. Intern-VLM and Qwen2) struggle on CoSPlan, failing to leverage contextual cues to reach goals. Addressing this, we propose a novel training-free method, Scene Graph Incremental updates (SGI), which introduces intermediate reasoning steps between the initial and goal states. SGI helps VLMs reason about sequences, yielding an average performance gain of 5.2%. In addition to enhancing reliability in corrective sequential planning, SGI generalizes to traditional planning tasks such as Plan-Bench and VQA.[64] Topology-Agnostic Animal Motion Generation from Text Prompt
Keyi Chen,Mingze Sun,Zhenyu Liu,Zhangquan Chen,Ruqi Huang
Main category: cs.CV
TL;DR: 本文提出了一种新的运动生成框架,能够根据文本描述和任意骨骼结构生成连贯且合理的动物运动,解决了现有方法在不同骨骼拓扑上泛化能力差的问题。
Details
Motivation: 现有的运动生成方法大多依赖固定的骨骼模板,难以推广到不同或扰动的骨骼拓扑结构,缺乏大规模异构动物运动数据和统一的生成框架。 Method: 构建了一个包含140种物种、32,979个序列的大规模动物运动数据集OmniZoo,并提出一种广义自回归生成框架,核心是拓扑感知的骨骼嵌入模块,将任意骨骼的几何与结构特性编码进共享token空间,实现与文本语义的融合。 Result: 该方法能根据文本提示和目标骨骼生成时间连贯、物理合理且语义一致的运动,并支持跨物种的运动风格迁移。 Conclusion: 本文通过OmniZoo数据集和提出的生成框架,显著提升了运动生成在不同骨骼拓扑上的泛化能力,实现了文本驱动的任意动物骨骼运动生成与风格迁移。 Abstract: Motion generation is fundamental to computer animation and widely used across entertainment, robotics, and virtual environments. While recent methods achieve impressive results, most rely on fixed skeletal templates, which prevent them from generalizing to skeletons with different or perturbed topologies. We address the core limitation of current motion generation methods - the combined lack of large-scale heterogeneous animal motion data and unified generative frameworks capable of jointly modeling arbitrary skeletal topologies and textual conditions. To this end, we introduce OmniZoo, a large-scale animal motion dataset spanning 140 species and 32,979 sequences, enriched with multimodal annotations. Building on OmniZoo, we propose a generalized autoregressive motion generation framework capable of producing text-driven motions for arbitrary skeletal topologies. Central to our model is a Topology-aware Skeleton Embedding Module that encodes geometric and structural properties of any skeleton into a shared token space, enabling seamless fusion with textual semantics. Given a text prompt and a target skeleton, our method generates temporally coherent, physically plausible, and semantically aligned motions, and further enables cross-species motion style transfer.[65] Hybrid Transformer-Mamba Architecture for Weakly Supervised Volumetric Medical Segmentation
Yiheng Lyu,Lian Xu,Mohammed Bennamoun,Farid Boussaid,Coen Arrow,Girish Dwivedi
Main category: cs.CV
TL;DR: 提出TranSamba,一种结合Transformer与Mamba的混合架构,用于弱监督下的三维医学图像分割,通过跨切片建模有效捕捉体积上下文信息,实现线性时间复杂度和恒定内存消耗,显著提升性能。
Details
Motivation: 现有弱监督语义分割方法多依赖2D编码器,忽略了医学影像固有的三维结构特性,导致上下文信息利用不足。 Method: 设计了一种名为TranSamba的混合Transformer-Mamba架构,在Vision Transformer基础上引入跨平面Mamba模块,利用状态空间模型的线性复杂度特性实现相邻切片间的高效信息交互,并增强片内自注意力机制以改善物体定位。 Result: 在三个数据集上进行了广泛实验,结果表明TranSamba在多种模态和病理情况下均优于现有方法,达到新的SOTA性能,同时具有线性时间复杂度和批量处理时的恒定内存使用。 Conclusion: TranSamba能有效建模三维上下文信息,显著提升弱监督下体积医学图像分割的性能,具备良好的效率与可扩展性,代码和模型已开源。 Abstract: Weakly supervised semantic segmentation offers a label-efficient solution to train segmentation models for volumetric medical imaging. However, existing approaches often rely on 2D encoders that neglect the inherent volumetric nature of the data. We propose TranSamba, a hybrid Transformer-Mamba architecture designed to capture 3D context for weakly supervised volumetric medical segmentation. TranSamba augments a standard Vision Transformer backbone with Cross-Plane Mamba blocks, which leverage the linear complexity of state space models for efficient information exchange across neighboring slices. The information exchange enhances the pairwise self-attention within slices computed by the Transformer blocks, directly contributing to the attention maps for object localization. TranSamba achieves effective volumetric modeling with time complexity that scales linearly with the input volume depth and maintains constant memory usage for batch processing. Extensive experiments on three datasets demonstrate that TranSamba establishes new state-of-the-art performance, consistently outperforming existing methods across diverse modalities and pathologies. Our source code and trained models are openly accessible at: https://github.com/YihengLyu/TranSamba.[66] mmCounter: Static People Counting in Dense Indoor Scenarios Using mmWave Radar
Tarik Reza Toha,Shao-Jung,Lu,Shahriar Nirjon
Main category: cs.CV
TL;DR: mmCounter是一种利用毫米波雷达提取超低频信号(如呼吸和微小身体移动)来准确计算密集静态人群中人数的新方法,能够在高密度室内环境中实现高效的人数统计。
Details
Motivation: 现有毫米波雷达在检测静止人群时存在局限性,尤其是在高密度静态场景中难以准确计数,且多数研究假设已知人数,无法适用于真实复杂环境。 Method: 提出一种多阶段信号处理流程,提取与人体相关的超低频信号(<1 Hz),结合空间信息区分不同个体,并通过抑制背景噪声实现对静态人群的精确计数。 Result: 在多种环境中测试显示,mmCounter在熟悉环境中平均F1得分为87%,平均绝对误差为0.6;在未见过的环境中F1为60%,误差为1.1,最多可准确计数三平方米内七名无侧向间距、仅前后相距一米的个体。 Conclusion: mmCounter突破了传统毫米波雷达对运动的依赖,首次实现了在高密度静态场景下的精准人数统计,具有较强的实用性和环境适应能力。 Abstract: mmWave radars struggle to detect or count individuals in dense, static (non-moving) groups due to limitations in spatial resolution and reliance on movement for detection. We present mmCounter, which accurately counts static people in dense indoor spaces (up to three people per square meter). mmCounter achieves this by extracting ultra-low frequency (< 1 Hz) signals, primarily from breathing and micro-scale body movements such as slight torso shifts, and applying novel signal processing techniques to differentiate these subtle signals from background noise and nearby static objects. Our problem differs significantly from existing studies on breathing rate estimation, which assume the number of people is known a priori. In contrast, mmCounter utilizes a novel multi-stage signal processing pipeline to extract relevant low-frequency sources along with their spatial information and map these sources to individual people, enabling accurate counting. Extensive evaluations in various environments demonstrate that mmCounter delivers an 87% average F1 score and 0.6 mean absolute error in familiar environments, and a 60% average F1 score and 1.1 mean absolute error in previously untested environments. It can count up to seven individuals in a three square meter space, such that there is no side-by-side spacing and only a one-meter front-to-back distance.[67] Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task
Sunqi Fan,Jiashuo Cui,Meng-Hao Guo,Shuojin Yang
Main category: cs.CV
TL;DR: 本文提出了一种用于增强多模态大语言模型(MLLM)在视频问答中时空推理能力的Video Toolkit和STAR框架,通过轻量级工具调度显著提升了性能。
Details
Motivation: 现有MLLM在处理复杂视频问答任务时难以同时建模帧内空间关系和时间因果动态,缺乏有效的时空联合推理机制。 Method: 设计了一个全面且可扩展的Video Toolkit,并提出STAR框架以协调时空工具调用顺序,逐步定位视频中的关键区域。 Result: 在VideoMME上提升8.2%,在LongVideoBench上提升4.6%,显著优于基线模型。 Conclusion: 所提出的Video Toolkit与STAR框架有效增强了MLLM的时空推理能力,推动了自主智能视频分析助手的发展。 Abstract: Video Question Answering (VideoQA) task serves as a critical playground for evaluating whether foundation models can effectively perceive, understand, and reason about dynamic real-world scenarios. However, existing Multimodal Large Language Models (MLLMs) struggle with simultaneously modeling spatial relationships within video frames and understanding the causal dynamics of temporal evolution on complex and reasoning-intensive VideoQA task. In this work, we equip MLLM with a comprehensive and extensible Video Toolkit, to enhance MLLM's spatiotemporal reasoning capabilities and ensure the harmony between the quantity and diversity of tools. To better control the tool invocation sequence and avoid toolchain shortcut issues, we propose a Spatiotemporal Reasoning Framework (STAR) that strategically schedules temporal and spatial tools, thereby progressively localizing the key area in the video. Our STAR framework enhances GPT-4o using lightweight tools, achieving an 8.2% gain on VideoMME and 4.6% on LongVideoBench. We believe that our proposed Video Toolkit and STAR framework make an important step towards building autonomous and intelligent video analysis assistants. The code is publicly available at https://github.com/fansunqi/VideoTool.[68] Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models
Woojun Jung,Jaehoon Go,Mingyu Jeon,Sunjae Yoon,Junyeong Kim
Main category: cs.CV
TL;DR: 提出了一种无需训练的两步方法Visual Funnel,以解决多模态大模型在精细视觉感知中因缺乏结构多样性导致的“上下文失明”问题。
Details
Motivation: 现有MLLM在处理需要精细视觉细节的任务时,常因忽略全局上下文与局部细节之间的结构关系而失败,即出现“上下文盲区”。 Method: 通过Contextual Anchoring定位兴趣区域,再基于注意力熵动态构建包含多层次上下文信息的Entropy-Scaled Portfolio,形成从焦点到周围环境的层次化输入结构。 Result: 实验表明,Visual Funnel显著优于单裁剪和非结构化多裁剪基线方法;增加无序裁剪数量效果有限甚至有害,验证了结构化输入的重要性。 Conclusion: 视觉输入的结构性多样性比信息数量更重要,Visual Funnel通过保留层次化上下文有效缓解了上下文盲区问题。 Abstract: Multimodal Large Language Models (MLLMs) demonstrate impressive reasoning capabilities, but often fail to perceive fine-grained visual details, limiting their applicability in precision-demanding tasks. While methods that crop salient regions of an image offer a partial solution, we identify a critical limitation they introduce: "Contextual Blindness". This failure occurs due to structural disconnect between high-fidelity details (from the crop) and the broader global context (from the original image), even when all necessary visual information is present. We argue that this limitation stems not from a lack of information 'Quantity', but from a lack of 'Structural Diversity' in the model's input. To resolve this, we propose Visual Funnel, a training-free, two-step approach. Visual Funnel first performs Contextual Anchoring to identify the region of interest in a single forward pass. It then constructs an Entropy-Scaled Portfolio that preserves the hierarchical context - ranging from focal detail to broader surroundings - by dynamically determining crop sizes based on attention entropy and refining crop centers. Through extensive experiments, we demonstrate that Visual Funnel significantly outperforms naive single-crop and unstructured multi-crop baselines. Our results further validate that simply adding more unstructured crops provides limited or even detrimental benefits, confirming that the hierarchical structure of our portfolio is key to resolving Contextual Blindness.[69] Point to Span: Zero-Shot Moment Retrieval for Navigating Unseen Hour-Long Videos
Mingyu Jeon,Jisoo Yang,Sungjin Han,Jinkwon Hwang,Sunjae Yoon,Jonghee Kim,Junyeoung Kim
Main category: cs.CV
TL;DR: 提出了一种无需训练的零样本长视频时刻检索框架P2S,通过自适应跨度生成和查询分解克服了搜索效率低和精炼成本高的问题,在小时级视频上显著优于现有监督方法。
Details
Motivation: 现有的长视频 moment 检索方法受限于计算不可行性和候选爆炸问题,监督方法泛化性差,而零样本方法在搜索阶段产生过多候选且依赖高成本VLM进行验证,导致效率低下。 Method: 提出了P2S框架,包含两个创新:1)自适应跨度生成器(Adaptive Span Generator)以避免搜索阶段的候选爆炸;2)查询分解(Query Decomposition)策略,在不依赖高成本视觉语言模型(VLM)的情况下完成候选精炼。 Result: P2S是首个能实现小时级视频中零样本时序定位的框架,在多个数据集上超越了有监督的最先进方法,例如在MAD数据集上R5@0.1指标提升了3.7%。 Conclusion: P2S有效解决了零样本长视频 moment 检索中的搜索效率与精炼成本问题,实现了高性能、免训练的时序定位,具有良好的可扩展性和应用前景。 Abstract: Zero-shot Long Video Moment Retrieval (ZLVMR) is the task of identifying temporal segments in hour-long videos using a natural language query without task-specific training. The core technical challenge of LVMR stems from the computational infeasibility of processing entire lengthy videos in a single pass. This limitation has established a 'Search-then-Refine' approach, where candidates are rapidly narrowed down, and only those portions are analyzed, as the dominant paradigm for LVMR. However, existing approaches to this paradigm face severe limitations. Conventional supervised learning suffers from limited scalability and poor generalization, despite substantial resource consumption. Yet, existing zero-shot methods also fail, facing a dual challenge: (1) their heuristic strategies cause a 'search' phase candidate explosion, and (2) the 'refine' phase, which is vulnerable to semantic discrepancy, requires high-cost VLMs for verification, incurring significant computational overhead. We propose \textbf{P}oint-\textbf{to}-\textbf{S}pan (P2S), a novel training-free framework to overcome this challenge of inefficient 'search' and costly 'refine' phases. P2S overcomes these challenges with two key innovations: an 'Adaptive Span Generator' to prevent the search phase candidate explosion, and 'Query Decomposition' to refine candidates without relying on high-cost VLM verification. To our knowledge, P2S is the first zero-shot framework capable of temporal grounding in hour-long videos, outperforming supervised state-of-the-art methods by a significant margin (e.g., +3.7\% on R5@0.1 on MAD).[70] Breaking the Vicious Cycle: Coherent 3D Gaussian Splatting from Sparse and Motion-Blurred Views
Zhankuo Xu,Chaoran Feng,Yingtao Li,Jianbin Zhao,Jiashu Yang,Wangbo Yu,Li Yuan,Yonghong Tian
Main category: cs.CV
TL;DR: 本文提出了CoherentGS,一种用于从稀疏且运动模糊图像中实现高保真3D重建的新框架。通过结合去模糊网络和扩散模型的双重先验策略,解决了传统3D高斯点阵在稀疏和模糊数据下重建失败的问题。
Details
Motivation: 3D高斯点阵(3DGS)在稀疏和运动模糊的真实场景下表现不佳,因稀疏视角与模糊共同导致重建崩溃。现有方法难以同时处理这两种退化,因此需要新方法打破这一恶性循环。 Method: 提出双先验策略:使用预训练的去模糊网络恢复高频细节并提供光度引导,结合扩散模型提供几何先验以填补未观测区域;引入一致性引导的相机探索模块和深度正则化损失以提升几何合理性。 Result: 在合成与真实场景中,仅使用3、6、9个输入视图进行实验,结果表明CoherentGS在定性和定量指标上均显著优于现有方法。 Conclusion: CoherentGS有效解决了稀疏与模糊条件下的3D重建难题,通过融合生成模型的双重先验,实现了鲁棒且高保真的新视图合成,推动了该方向的发展。 Abstract: 3D Gaussian Splatting (3DGS) has emerged as a state-of-the-art method for novel view synthesis. However, its performance heavily relies on dense, high-quality input imagery, an assumption that is often violated in real-world applications, where data is typically sparse and motion-blurred. These two issues create a vicious cycle: sparse views ignore the multi-view constraints necessary to resolve motion blur, while motion blur erases high-frequency details crucial for aligning the limited views. Thus, reconstruction often fails catastrophically, with fragmented views and a low-frequency bias. To break this cycle, we introduce CoherentGS, a novel framework for high-fidelity 3D reconstruction from sparse and blurry images. Our key insight is to address these compound degradations using a dual-prior strategy. Specifically, we combine two pre-trained generative models: a specialized deblurring network for restoring sharp details and providing photometric guidance, and a diffusion model that offers geometric priors to fill in unobserved regions of the scene. This dual-prior strategy is supported by several key techniques, including a consistency-guided camera exploration module that adaptively guides the generative process, and a depth regularization loss that ensures geometric plausibility. We evaluate CoherentGS through both quantitative and qualitative experiments on synthetic and real-world scenes, using as few as 3, 6, and 9 input views. Our results demonstrate that CoherentGS significantly outperforms existing methods, setting a new state-of-the-art for this challenging task. The code and video demos are available at https://potatobigroom.github.io/CoherentGS/.[71] RaLiFlow: Scene Flow Estimation with 4D Radar and LiDAR Point Clouds
Jingyun Fu,Zhiyu Xiang,Na Zhao
Main category: cs.CV
TL;DR: 本文提出了首个用于4D毫米波雷达与LiDAR联合场景流估计的框架RaLiFlow,并构建了相应的数据集,通过动态感知双向跨模态融合模块和精心设计的损失函数实现了有效融合。
Details
Motivation: 现有方法未探索雷达与LiDAR在场景流估计中的融合;而雷达具有成本低、抗天气干扰强和可检测速度等优势,但存在噪声多、分辨率低和稀疏等问题,且缺乏相关数据集。 Method: 构建了一个基于公开真实自动驾驶数据集的雷达-LiDAR场景流数据集,提出雷达去噪和场景流标签生成的预处理策略,并设计RaLiFlow框架,包含动态感知双向跨模态融合(DBCF)模块和一组专用损失函数。 Result: 实验表明,该方法显著优于现有的基于LiDAR或雷达的单模态方法,尤其在动态前景区域提升了实例级一致性。 Conclusion: RaLiFlow为雷达与LiDAR的融合提供了有效解决方案,推动了低成本、鲁棒性高的多模态场景流估计发展。 Abstract: Recent multimodal fusion methods, integrating images with LiDAR point clouds, have shown promise in scene flow estimation. However, the fusion of 4D millimeter wave radar and LiDAR remains unexplored. Unlike LiDAR, radar is cheaper, more robust in various weather conditions and can detect point-wise velocity, making it a valuable complement to LiDAR. However, radar inputs pose challenges due to noise, low resolution, and sparsity. Moreover, there is currently no dataset that combines LiDAR and radar data specifically for scene flow estimation. To address this gap, we construct a Radar-LiDAR scene flow dataset based on a public real-world automotive dataset. We propose an effective preprocessing strategy for radar denoising and scene flow label generation, deriving more reliable flow ground truth for radar points out of the object boundaries. Additionally, we introduce RaLiFlow, the first joint scene flow learning framework for 4D radar and LiDAR, which achieves effective radar-LiDAR fusion through a novel Dynamic-aware Bidirectional Cross-modal Fusion (DBCF) module and a carefully designed set of loss functions. The DBCF module integrates dynamic cues from radar into the local cross-attention mechanism, enabling the propagation of contextual information across modalities. Meanwhile, the proposed loss functions mitigate the adverse effects of unreliable radar data during training and enhance the instance-level consistency in scene flow predictions from both modalities, particularly for dynamic foreground areas. Extensive experiments on the repurposed scene flow dataset demonstrate that our method outperforms existing LiDAR-based and radar-based single-modal methods by a significant margin.[72] Self-Supervised Contrastive Embedding Adaptation for Endoscopic Image Matching
Alberto Rota,Elena De Momi
Main category: cs.CV
TL;DR: 本文提出了一种用于内窥镜图像对中特征匹配的新型自监督深度学习框架,利用新视角合成生成真值对应关系,并结合对比学习优化DINOv2模型,在SCARED数据集上实现了优于现有方法的匹配精度和更低的对极几何误差。
Details
Motivation: 由于手术场景中存在弱透视线索、非朗伯反射和复杂可变形解剖结构,传统计算机视觉和现有深度学习方法在内窥镜图像像素级匹配上表现不佳,亟需针对该领域特点进行适应性优化。 Method: 提出一种基于新视角合成的自监督学习框架,通过生成真值内点对应关系,构建三元组进行对比学习;在DINOv2骨干网络基础上增加一个Transformer层,以优化适用于余弦相似度阈值匹配的嵌入表示。 Result: 在SCARED数据集上的实验表明,该方法相比现有技术具有更高的匹配精度和更低的对极误差,验证了其在手术内窥镜图像中建立可靠特征对应的有效性。 Conclusion: 所提出的自监督深度学习管道能有效提升内窥镜图像中的特征匹配性能,为外科手术中的三维重建、相机跟踪和场景理解等高级视觉应用提供了有力支持。 Abstract: Accurate spatial understanding is essential for image-guided surgery, augmented reality integration and context awareness. In minimally invasive procedures, where visual input is the sole intraoperative modality, establishing precise pixel-level correspondences between endoscopic frames is critical for 3D reconstruction, camera tracking, and scene interpretation. However, the surgical domain presents distinct challenges: weak perspective cues, non-Lambertian tissue reflections, and complex, deformable anatomy degrade the performance of conventional computer vision techniques. While Deep Learning models have shown strong performance in natural scenes, their features are not inherently suited for fine-grained matching in surgical images and require targeted adaptation to meet the demands of this domain. This research presents a novel Deep Learning pipeline for establishing feature correspondences in endoscopic image pairs, alongside a self-supervised optimization framework for model training. The proposed methodology leverages a novel-view synthesis pipeline to generate ground-truth inlier correspondences, subsequently utilized for mining triplets within a contrastive learning paradigm. Through this self-supervised approach, we augment the DINOv2 backbone with an additional Transformer layer, specifically optimized to produce embeddings that facilitate direct matching through cosine similarity thresholding. Experimental evaluation demonstrates that our pipeline surpasses state-of-the-art methodologies on the SCARED datasets improved matching precision and lower epipolar error compared to the related work. The proposed framework constitutes a valuable contribution toward enabling more accurate high-level computer vision applications in surgical endoscopy.[73] Towards Fine-Grained Recognition with Large Visual Language Models: Benchmark and Optimization Strategies
Cong Pang,Hongtao Yu,Zixuan Chen,Lewei Lu,Xin Lou
Main category: cs.CV
TL;DR: 本文提出了一个名为FROW的细粒度识别开放世界基准,用于评估大型视觉语言模型(LVLMs),并提出了一种通过数据构建和训练过程优化LVLM性能的新策略。实验结果表明,所提出的 mosaic 和开放世界数据显著提升了模型在细粒度识别任务上的表现。
Details
Motivation: 现有LVLM基准主要关注推理任务,忽视了对实际应用至关重要的细粒度识别能力,因此需要一个新的评估基准和优化方法来弥补这一不足。 Method: 提出了FROW基准,并从数据构造(包括马赛克数据和基于GPT-4o生成的开放世界数据)和训练过程两个方面优化LVLM;将细粒度数据引入预训练阶段以提升模型性能。 Result: 马赛克数据使类别识别准确率提高1%,开放世界数据使FROW基准准确率提升10%-20%,内容准确率提升6%-12%;在预训练中加入细粒度数据可使类别识别准确率最高提升10%。 Conclusion: FROW为评估LVLM的细粒度识别能力提供了有效框架,所提出的优化策略显著提升了模型性能,有助于推动其在实际场景中的应用。 Abstract: Large Vision Language Models (LVLMs) have made remarkable progress, enabling sophisticated vision-language interaction and dialogue applications. However, existing benchmarks primarily focus on reasoning tasks, often neglecting fine-grained recognition, which is crucial for practical application scenarios. To address this gap, we introduce the Fine-grained Recognition Open World (FROW) benchmark, designed for detailed evaluation of LVLMs with GPT-4o. On the basis of that, we propose a novel optimization strategy from two perspectives: \textit{data construction} and \textit{training process}, to improve the performance of LVLMs. Our dataset includes mosaic data, which combines multiple short-answer responses, and open-world data, generated from real-world questions and answers using GPT-4o, creating a comprehensive framework for evaluating fine-grained recognition in LVLMs. Experiments show that mosaic data improves category recognition accuracy by 1\% and open-world data boosts FROW benchmark accuracy by 10\%-20\% and content accuracy by 6\%-12\%. Meanwhile, incorporating fine-grained data into the pre-training phase can improve the model's category recognition accuracy by up to 10\%. The benchmark will be available at https://github.com/pc-inno/FROW.[74] Adaptive Dual-Weighted Gravitational Point Cloud Denoising Method
Ge Zhang,Chunyang Wang,Bo Xiao,Xuelian Liu,Bin Liu
Main category: cs.CV
TL;DR: 本文提出了一种基于自适应双权重引力模型的点云去噪方法,结合八叉树空间划分与并行加速、体素化统计和k近邻密度估计,实现了高效去噪、边缘保持与实时处理的平衡。
Details
Motivation: 现有点云去噪方法难以同时兼顾去噪精度、边缘细节保持和计算效率,尤其在复杂噪声环境下表现受限,因此需要一种能同时实现高精度、强鲁棒性和实时性的去噪算法。 Method: 首先利用八叉树对全局点云进行空间划分以实现并行加速;然后在每个叶节点内采用自适应体素占据统计和k近邻密度估计快速剔除孤立低密度噪声点;最后构建融合密度权重与自适应距离权重的引力评分函数,精细区分噪声点与物体点。 Result: 在Stanford 3D Scanning Repository、CADC数据集及实验室自采FMCW LiDAR数据上实验表明,所提方法在不同噪声条件下均优于现有方法,在F1、PSNR和Chamfer Distance(CD)指标上有提升,同时降低了单帧处理时间。 Conclusion: 该方法在多种噪声场景下实现了高精度去噪、良好边界保持和实时性能的统一,具有较强的鲁棒性和实际应用价值。 Abstract: High-quality point cloud data is a critical foundation for tasks such as autonomous driving and 3D reconstruction. However, LiDAR-based point cloud acquisition is often affected by various disturbances, resulting in a large number of noise points that degrade the accuracy of subsequent point cloud object detection and recognition. Moreover, existing point cloud denoising methods typically sacrifice computational efficiency in pursuit of higher denoising accuracy, or, conversely, improve processing speed at the expense of preserving object boundaries and fine structural details, making it difficult to simultaneously achieve high denoising accuracy, strong edge preservation, and real-time performance. To address these limitations, this paper proposes an adaptive dual-weight gravitational-based point cloud denoising method. First, an octree is employed to perform spatial partitioning of the global point cloud, enabling parallel acceleration. Then, within each leaf node, adaptive voxel-based occupancy statistics and k-nearest neighbor (kNN) density estimation are applied to rapidly remove clearly isolated and low-density noise points, thereby reducing the effective candidate set. Finally, a gravitational scoring function that combines density weights with adaptive distance weights is constructed to finely distinguish noise points from object points. Experiments conducted on the Stanford 3D Scanning Repository, the Canadian Adverse Driving Conditions (CADC) dataset, and in-house FMCW LiDAR point clouds acquired in our laboratory demonstrate that, compared with existing methods, the proposed approach achieves consistent improvements in F1, PSNR, and Chamfer Distance (CD) across various noise conditions while reducing the single-frame processing time, thereby validating its high accuracy, robustness, and real-time performance in multi-noise scenarios.[75] MultiHateLoc: Towards Temporal Localisation of Multimodal Hate Content in Online Videos
Qiyue Sun,Tailin Chen,Yinghui Zhang,Yuchen Zhang,Jiangbei Yue,Jianbo Jiao,Zeyu Fu
Main category: cs.CV
TL;DR: 本文提出了MultiHateLoc,首个用于弱监督多模态仇恨言论时间定位的框架,通过模态感知时序编码、动态跨模态融合和模态感知MIL目标,在仅有视频级标签的情况下实现细粒度帧级定位,显著优于现有方法。
Details
Motivation: 现有研究集中于视频级分类,缺乏对仇恨内容发生时间的精确定位,尤其在弱监督场景下难以捕捉跨模态与时间动态。 Method: 提出MultiHateLoc框架:1)模态感知时序编码器建模异构序列模式;2)动态跨模态融合与对比对齐策略增强多模态一致性;3)模态感知的MIL目标在视频级标签下定位关键片段。 Result: 在HateMM和MultiHateClip数据集上,MultiHateLoc在弱监督设置下实现了最先进的帧级定位性能。 Conclusion: MultiHateLoc有效解决了弱监督下的多模态仇恨内容时间定位问题,具备良好的可解释性与应用潜力。 Abstract: The rapid growth of video content on platforms such as TikTok and YouTube has intensified the spread of multimodal hate speech, where harmful cues emerge subtly and asynchronously across visual, acoustic, and textual streams. Existing research primarily focuses on video-level classification, leaving the practically crucial task of temporal localisation, identifying when hateful segments occur, largely unaddressed. This challenge is even more noticeable under weak supervision, where only video-level labels are available, and static fusion or classification-based architectures struggle to capture cross-modal and temporal dynamics. To address these challenges, we propose MultiHateLoc, the first framework designed for weakly-supervised multimodal hate localisation. MultiHateLoc incorporates (1) modality-aware temporal encoders to model heterogeneous sequential patterns, including a tailored text-based preprocessing module for feature enhancement; (2) dynamic cross-modal fusion to adaptively emphasise the most informative modality at each moment and a cross-modal contrastive alignment strategy to enhance multimodal feature consistency; (3) a modality-aware MIL objective to identify discriminative segments under video-level supervision. Despite relying solely on coarse labels, MultiHateLoc produces fine-grained, interpretable frame-level predictions. Experiments on HateMM and MultiHateClip show that our method achieves state-of-the-art performance in the localisation task.[76] Beyond Endpoints: Path-Centric Reasoning for Vectorized Off-Road Network Extraction
Wenfei Guan,Jilin Mei,Tong Shen,Xumin Wu,Shuo Wang,Cheng Min,Yu Hu
Main category: cs.CV
TL;DR: 本文提出了一种新的路径中心型框架MaGRoad和一个大规模野外道路数据集WildRoad,以解决现有方法在非城市环境中道路提取的不足,实现了最先进的性能并提高了推理速度。
Details
Motivation: 现有的深度学习模型在城市环境中的道路提取已经取得了进展,但在非城市环境(如野外)中仍然面临挑战,主要问题是缺乏大规模矢量化数据集和主流方法的结构弱点。 Method: 首先发布了名为WildRoad的全球性非公路道路网络数据集,并开发了一个专门用于道路网络标注的交互式注释工具;其次提出了MaGRoad(Mask-aware Geodesic Road network extractor),这是一种路径中心型框架,通过沿候选路径聚合多尺度视觉证据来推断连通性。 Result: 实验表明,MaGRoad在具有挑战性的WildRoad基准上达到了最先进的性能,同时在城市数据集上也表现出良好的泛化能力。此外,简化后的管道使得推理速度大约提升了2.5倍。 Conclusion: WildRoad数据集和MaGRoad路径中心型范式为野外道路映射提供了更坚实的基础,显著提升了复杂环境下的道路提取效果和实用性。 Abstract: Deep learning has advanced vectorized road extraction in urban settings, yet off-road environments remain underexplored and challenging. A significant domain gap causes advanced models to fail in wild terrains due to two key issues: lack of large-scale vectorized datasets and structural weakness in prevailing methods. Models such as SAM-Road employ a node-centric paradigm that reasons at sparse endpoints, making them fragile to occlusions and ambiguous junctions in off-road scenes, leading to topological errors.This work addresses these limitations in two complementary ways. First, we release WildRoad, a gloabal off-road road network dataset constructed efficiently with a dedicated interactive annotation tool tailored for road-network labeling. Second, we introduce MaGRoad (Mask-aware Geodesic Road network extractor), a path-centric framework that aggregates multi-scale visual evidence along candidate paths to infer connectivity robustly.Extensive experiments show that MaGRoad achieves state-of-the-art performance on our challenging WildRoad benchmark while generalizing well to urban datasets. A streamlined pipeline also yields roughly 2.5x faster inference, improving practical applicability. Together, the dataset and path-centric paradigm provide a stronger foundation for mapping roads in the wild.[77] TransLocNet: Cross-Modal Attention for Aerial-Ground Vehicle Localization with Contrastive Learning
Phu Pham,Damon Conover,Aniket Bera
Main category: cs.CV
TL;DR: 提出TransLocNet,一种融合LiDAR几何与航拍语义上下文的跨模态注意力框架,通过双向注意力和对比学习实现高精度空中-地面定位。
Details
Motivation: 解决空中图像与地面LiDAR之间因视角和模态差异导致的定位难题。 Method: 将LiDAR扫描投影为鸟瞰图表示,并通过双向注意力机制与航拍特征对齐,结合对比学习模块构建共享嵌入空间,最后由似然图解码器输出位置和姿态的概率分布。 Result: 在CARLA和KITTI数据集上实验表明,相比现有方法定位误差降低高达63%,达到亚米级、亚度级精度。 Conclusion: TransLocNet在合成和真实场景中均实现了鲁棒且可泛化的空中-地面定位性能。 Abstract: Aerial-ground localization is difficult due to large viewpoint and modality gaps between ground-level LiDAR and overhead imagery. We propose TransLocNet, a cross-modal attention framework that fuses LiDAR geometry with aerial semantic context. LiDAR scans are projected into a bird's-eye-view representation and aligned with aerial features through bidirectional attention, followed by a likelihood map decoder that outputs spatial probability distributions over position and orientation. A contrastive learning module enforces a shared embedding space to improve cross-modal alignment. Experiments on CARLA and KITTI show that TransLocNet outperforms state-of-the-art baselines, reducing localization error by up to 63% and achieving sub-meter, sub-degree accuracy. These results demonstrate that TransLocNet provides robust and generalizable aerial-ground localization in both synthetic and real-world settings.[78] Neural Collapse in Test-Time Adaptation
Xiao Chen,Zhongjing Du,Jiazhen Huang,Xu Jiang,Li Lu,Jingyan Jiang,Zhi Wang
Main category: cs.CV
TL;DR: 本文提出了一种新的测试时自适应方法NCTTA,基于样本级对齐坍缩现象(NC3+),通过混合目标缓解伪标签不可靠问题,显著提升了模型在分布外数据上的鲁棒性。
Details
Motivation: 现有TTA方法缺乏对领域偏移下性能下降根本原因的理论理解,且在大分布偏移下伪标签不可靠,导致适应效果差。 Method: 引入样本级神经坍缩(NC3+)分析性能退化根源,提出NCTTA方法,采用结合几何邻近性和预测置信度的混合目标进行特征-分类器对齐。 Result: NCTTA在多个基准上显著优于现有方法,例如在ImageNet-C上比Tent提升14.52%。 Conclusion: 特征与分类器的样本级对齐是提升TTA性能的关键,NCTTA有效缓解了因伪标签不可靠导致的适应偏差,增强了模型对分布偏移的鲁棒性。 Abstract: Test-Time Adaptation (TTA) enhances model robustness to out-of-distribution (OOD) data by updating the model online during inference, yet existing methods lack theoretical insights into the fundamental causes of performance degradation under domain shifts. Recently, Neural Collapse (NC) has been proposed as an emergent geometric property of deep neural networks (DNNs), providing valuable insights for TTA. In this work, we extend NC to the sample-wise level and discover a novel phenomenon termed Sample-wise Alignment Collapse (NC3+), demonstrating that a sample's feature embedding, obtained by a trained model, aligns closely with the corresponding classifier weight. Building on NC3+, we identify that the performance degradation stems from sample-wise misalignment in adaptation which exacerbates under larger distribution shifts. This indicates the necessity of realigning the feature embeddings with their corresponding classifier weights. However, the misalignment makes pseudo-labels unreliable under domain shifts. To address this challenge, we propose NCTTA, a novel feature-classifier alignment method with hybrid targets to mitigate the impact of unreliable pseudo-labels, which blends geometric proximity with predictive confidence. Extensive experiments demonstrate the effectiveness of NCTTA in enhancing robustness to domain shifts. For example, NCTTA outperforms Tent by 14.52% on ImageNet-C.[79] An M-Health Algorithmic Approach to Identify and Assess Physiotherapy Exercises in Real Time
Stylianos Kandylakis,Christos Orfanopoulos,Georgios Siolas,Panayiotis Tsanakas
Main category: cs.CV
TL;DR: 提出了一种基于移动设备的实时人体物理治疗运动识别、分类与评估的高效算法框架,利用姿态估计和动态规划实现动作序列匹配与错误定位。
Details
Motivation: 为了在移动设备上实现实时、准确的物理治疗运动监测,支持远程康复和移动健康应用。 Method: 将运动视为静态姿态序列,使用姿态估计神经网络从摄像头输入中提取关键点,转换为三角角度特征,并用轻量级模型进行帧级分类;结合改进的Levenshtein距离算法的动态规划方法进行完整动作识别与偏差检测。 Result: 系统可在客户端实现实时运行,实验验证了该方法在动作识别和错误检测方面的有效性。 Conclusion: 该框架适用于远程物理治疗监督,具有良好的可扩展性和实际应用潜力。 Abstract: This work presents an efficient algorithmic framework for real-time identification, classification, and evaluation of human physiotherapy exercises using mobile devices. The proposed method interprets a kinetic movement as a sequence of static poses, which are estimated from camera input using a pose-estimation neural network. Extracted body keypoints are transformed into trigonometric angle-based features and classified with lightweight supervised models to generate frame-level pose predictions and accuracy scores. To recognize full exercise movements and detect deviations from prescribed patterns, we employ a dynamic-programming scheme based on a modified Levenshtein distance algorithm, enabling robust sequence matching and localization of inaccuracies. The system operates entirely on the client side, ensuring scalability and real-time performance. Experimental evaluation demonstrates the effectiveness of the methodology and highlights its applicability to remote physiotherapy supervision and m-health applications.[80] Error-Propagation-Free Learned Video Compression With Dual-Domain Progressive Temporal Alignment
Han Li,Shaohui Li,Wenrui Dai,Chenglin Li,Xinlong Pan,Haipeng Wang,Junni Zou,Hongkai Xiong
Main category: cs.CV
TL;DR: 本文提出了一种新的统一变换框架,结合双域渐进式时序对齐和质量条件混合专家(QCMoE),以实现无误差传播且质量一致的视频压缩。
Details
Motivation: 现有学习型视频压缩框架在运动估计与补偿中的时序对齐不准确和误差传播之间存在两难问题。分离变换框架虽性能好但有明显误差传播,统一变换框架虽避免了误差传播却在共享潜在域中ME/MC效果较差。 Method: 提出双域渐进式时序对齐,结合像素域粗对齐和潜在域精对齐;设计Flow-Guided Deformable Transformer(FGDT)用于多参考帧复杂运动的长期运动细化;引入QCMoE模块,根据目标质量和内容动态调整像素级量化步长,实现连续码率自适应。 Result: 实验结果表明,该方法在保持竞争性率失真性能的同时,有效消除了误差传播,并实现了质量一致的连续码率调节。 Conclusion: 所提出的框架成功解决了误差传播与ME/MC精度之间的权衡问题,在保证高质量重建的同时实现了鲁棒的长期时序建模和灵活的码率控制。 Abstract: Existing frameworks for learned video compression suffer from a dilemma between inaccurate temporal alignment and error propagation for motion estimation and compensation (ME/MC). The separate-transform framework employs distinct transforms for intra-frame and inter-frame compression to yield impressive rate-distortion (R-D) performance but causes evident error propagation, while the unified-transform framework eliminates error propagation via shared transforms but is inferior in ME/MC in shared latent domains. To address this limitation, in this paper, we propose a novel unifiedtransform framework with dual-domain progressive temporal alignment and quality-conditioned mixture-of-expert (QCMoE) to enable quality-consistent and error-propagation-free streaming for learned video compression. Specifically, we propose dualdomain progressive temporal alignment for ME/MC that leverages coarse pixel-domain alignment and refined latent-domain alignment to significantly enhance temporal context modeling in a coarse-to-fine fashion. The coarse pixel-domain alignment efficiently handles simple motion patterns with optical flow estimated from a single reference frame, while the refined latent-domain alignment develops a Flow-Guided Deformable Transformer (FGDT) over latents from multiple reference frames to achieve long-term motion refinement (LTMR) for complex motion patterns. Furthermore, we design a QCMoE module for continuous bit-rate adaptation that dynamically assigns different experts to adjust quantization steps per pixel based on target quality and content rather than relies on a single quantization step. QCMoE allows continuous and consistent rate control with appealing R-D performance. Experimental results show that the proposed method achieves competitive R-D performance compared with the state-of-the-arts, while successfully eliminating error propagation.[81] Robust Shape from Focus via Multiscale Directional Dilated Laplacian and Recurrent Network
Khurram Ashfaq,Muhammad Tariq Mahmood
Main category: cs.CV
TL;DR: 提出了一种混合框架的Shape-from-Focus方法,通过多尺度方向扩张拉普拉斯核计算聚焦体,并结合轻量级GRU网络与学习上采样模块,实现高精度、高分辨率的深度估计。
Details
Motivation: 现有基于深度学习的SFF方法依赖重型编码器提取聚焦体,且后续深度估计步骤简单,易引入噪声和伪影;需要更高效、鲁棒的深度估计框架。 Method: 采用手工设计的多尺度方向扩张拉普拉斯(DDL)核计算聚焦体,捕捉长距离和方向性聚焦变化;使用轻量级多尺度GRU模块迭代优化低分辨率深度估计;通过可学习的凸上采样模块恢复高分辨率深度图。 Result: 在合成和真实数据集上均优于当前最先进的传统和深度学习方法,具有更高的深度估计精度、更好的细节保持能力和更强的泛化性。 Conclusion: 所提出的混合框架在保持计算效率的同时显著提升了SFF的深度估计质量,验证了结合手工特征提取与轻量级深度网络的有效性。 Abstract: Shape-from-Focus (SFF) is a passive depth estimation technique that infers scene depth by analyzing focus variations in a focal stack. Most recent deep learning-based SFF methods typically operate in two stages: first, they extract focus volumes (a per pixel representation of focus likelihood across the focal stack) using heavy feature encoders; then, they estimate depth via a simple one-step aggregation technique that often introduces artifacts and amplifies noise in the depth map. To address these issues, we propose a hybrid framework. Our method computes multi-scale focus volumes traditionally using handcrafted Directional Dilated Laplacian (DDL) kernels, which capture long-range and directional focus variations to form robust focus volumes. These focus volumes are then fed into a lightweight, multi-scale GRU-based depth extraction module that iteratively refines an initial depth estimate at a lower resolution for computational efficiency. Finally, a learned convex upsampling module within our recurrent network reconstructs high-resolution depth maps while preserving fine scene details and sharp boundaries. Extensive experiments on both synthetic and real-world datasets demonstrate that our approach outperforms state-of-the-art deep learning and traditional methods, achieving superior accuracy and generalization across diverse focal conditions.[82] 3D Blood Pulsation Maps
Maurice Rohr,Tobias Reinhardt,Tizian Dege,Justus Thies,Christoph Hoog Antink
Main category: cs.CV
TL;DR: Pulse3DFace是首个用于估计3D血流脉动图的多视角数据集,包含15名受试者的23视角视频、生理信号参考和3D面部扫描,支持远程脉搏估计算法的建模、合成数据生成与光照影响研究。
Details
Motivation: 现有的远程光电容积描记成像(rPPG)方法缺乏真实动态3D血流脉动数据来验证和改进模型,且易受光照变化干扰,亟需一个包含多视角、生理基准和3D结构信息的数据集以推动相关研究。 Method: 通过23个视角的RGB相机以30Hz采集15名受试者的面部视频,同步记录血流脉搏参考信号,并利用单目运动恢复结构(SfM)技术生成3D面部扫描;基于视频和生理信号生成与FLAME 3D头模型纹理空间对齐的3D脉动图,包含信噪比、脉动幅度、相位等信息。 Result: 发布了Pulse3DFace数据集,包含原始多视角视频、生理参考信号、3D面部模型及高质量3D脉动图;验证了数据集在光照条件多样性、脉动图一致性以及捕捉面部与颈部生理特征方面的有效性。 Conclusion: Pulse3DFace为基于3D动态血流建模、合成数据生成和多视角rPPG算法研究提供了重要资源,有助于提升远程脉搏估计的准确性与鲁棒性,特别是在复杂光照条件下。 Abstract: We present Pulse3DFace, the first dataset of its kind for estimating 3D blood pulsation maps. These maps can be used to develop models of dynamic facial blood pulsation, enabling the creation of synthetic video data to improve and validate remote pulse estimation methods via photoplethysmography imaging. Additionally, the dataset facilitates research into novel multi-view-based approaches for mitigating illumination effects in blood pulsation analysis. Pulse3DFace consists of raw videos from 15 subjects recorded at 30 Hz with an RGB camera from 23 viewpoints, blood pulse reference measurements, and facial 3D scans generated using monocular structure-from-motion techniques. It also includes processed 3D pulsation maps compatible with the texture space of the 3D head model FLAME. These maps provide signal-to-noise ratio, local pulse amplitude, phase information, and supplementary data. We offer a comprehensive evaluation of the dataset's illumination conditions, map consistency, and its ability to capture physiologically meaningful features in the facial and neck skin regions.[83] Take a Peek: Efficient Encoder Adaptation for Few-Shot Semantic Segmentation via LoRA
Pasquale De Marinis,Gennaro Vessio,Giovanna Castellano
Main category: cs.CV
TL;DR: 本文提出了一种名为Take a Peek (TaP) 的简单而有效的方法,通过低秩适应(LoRA)微调编码器,提升少样本语义分割(FSS)和跨域FSS(CD-FSS)中编码器对新类别的适应能力,具有模型无关性、计算高效且能缓解灾难性遗忘,在多个基准上显著提升了性能。
Details
Motivation: 现有少样本语义分割方法的编码器在提取未见类别的特征方面能力有限,成为性能瓶颈,且以往研究多集中于改进解码器,忽视了编码器的适应性问题。 Method: 提出Take a Peek (TaP) 方法,利用低秩适应(LoRA)在支持集上微调编码器,以极低的计算开销实现快速适应,并缓解灾难性遗忘;该方法不依赖特定模型,可无缝集成到现有FSS流程中。 Result: 在COCO 20^i、Pascal 5^i及DeepGlobe、ISIC、Chest X-ray等跨域数据集上实验表明,TaP在多种模型和shot设置下均一致提升分割性能,尤其在复杂的多类别场景中增益显著;低秩敏感性分析显示即使使用低秩也能保持良好性能,保证了计算效率。 Conclusion: TaP通过解决FSS中编码器对新类别泛化能力不足的关键问题,推动了更鲁棒、高效和可泛化的分割系统的发展。 Abstract: Few-shot semantic segmentation (FSS) aims to segment novel classes in query images using only a small annotated support set. While prior research has mainly focused on improving decoders, the encoder's limited ability to extract meaningful features for unseen classes remains a key bottleneck. In this work, we introduce \textit{Take a Peek} (TaP), a simple yet effective method that enhances encoder adaptability for both FSS and cross-domain FSS (CD-FSS). TaP leverages Low-Rank Adaptation (LoRA) to fine-tune the encoder on the support set with minimal computational overhead, enabling fast adaptation to novel classes while mitigating catastrophic forgetting. Our method is model-agnostic and can be seamlessly integrated into existing FSS pipelines. Extensive experiments across multiple benchmarks--including COCO $20^i$, Pascal $5^i$, and cross-domain datasets such as DeepGlobe, ISIC, and Chest X-ray--demonstrate that TaP consistently improves segmentation performance across diverse models and shot settings. Notably, TaP delivers significant gains in complex multi-class scenarios, highlighting its practical effectiveness in realistic settings. A rank sensitivity analysis also shows that strong performance can be achieved even with low-rank adaptations, ensuring computational efficiency. By addressing a critical limitation in FSS--the encoder's generalization to novel classes--TaP paves the way toward more robust, efficient, and generalizable segmentation systems. The code is available at https://github.com/pasqualedem/TakeAPeek.[84] Blink: Dynamic Visual Token Resolution for Enhanced Multimodal Understanding
Yuchen Feng,Zhenyu Zhang,Naibin Gu,Yilong Chen,Peng Fu,Zheng Lin,Shuohuan Wang,Yu Sun,Hua Wu,Weiping Wang,Haifeng Wang
Main category: cs.CV
TL;DR: 提出Blink框架,通过模拟人类“眨眼”式的视觉注意机制,在多模态大模型中动态调整视觉token分辨率,提升视觉感知与理解能力。
Details
Motivation: 现有MLLMs视觉感知能力有限,而人类能通过动态扫描和聚焦显著区域高效感知复杂场景,因此探索是否可将类似机制引入MLLMs。 Method: 设计包含显著性引导扫描和动态token分辨率的Blink框架;基于注意力图估计每层视觉token的显著性,并通过即插即用的TokenSR模块扩展重要token,在后续层中丢弃失去关注的扩展token。 Result: 实验证明Blink能有效增强MLLMs的视觉感知和多模态理解能力,实现更高效的细粒度识别。 Conclusion: Blink通过模仿人类视觉注意机制,在单次前向传播中实现了自适应、高效的视觉token处理,为提升MLLMs视觉感知提供了新思路。 Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress on various vision-language tasks, yet their visual perception remains limited. Humans, in comparison, perceive complex scenes efficiently by dynamically scanning and focusing on salient regions in a sequential "blink-like" process. Motivated by this strategy, we first investigate whether MLLMs exhibit similar behavior. Our pilot analysis reveals that MLLMs naturally attend to different visual regions across layers and that selectively allocating more computation to salient tokens can enhance visual perception. Building on this insight, we propose Blink, a dynamic visual token resolution framework that emulates the human-inspired process within a single forward pass. Specifically, Blink includes two modules: saliency-guided scanning and dynamic token resolution. It first estimates the saliency of visual tokens in each layer based on the attention map, and extends important tokens through a plug-and-play token super-resolution (TokenSR) module. In the next layer, it drops the extended tokens when they lose focus. This dynamic mechanism balances broad exploration and fine-grained focus, thereby enhancing visual perception adaptively and efficiently. Extensive experiments validate Blink, demonstrating its effectiveness in enhancing visual perception and multimodal understanding.[85] Grounding Everything in Tokens for Multimodal Large Language Models
Xiangxuan Ren,Zhongdao Wang,Liping Hou,Pin Tang,Guoqing Wang,Chao Ma
Main category: cs.CV
TL;DR: 本文提出了一种名为GETok的空间表示方法,通过引入可学习的网格和偏移标记,增强多模态大语言模型在二维图像空间中的对象定位能力,无需修改自回归架构即可提升空间推理性能。
Details
Motivation: 现有的多模态大语言模型由于依赖对图像进行标记化处理,难以精确地在二维图像空间中定位对象,因此需要一种能更好实现空间接地的方法。 Method: 提出GETok方法,使用网格标记将图像平面划分为结构化的空间锚点,并利用偏移标记实现定位预测的精确且迭代优化的细化过程,将空间关系直接嵌入到标记中。 Result: 实验表明,GETok在多种指代表达任务上均优于现有最先进方法,无论是在监督微调还是强化学习设置下都表现出卓越性能。 Conclusion: GETok通过将空间信息整合进标记表示,在不改变自回归架构的前提下显著提升了多模态大语言模型在二维空间中的对象定位与推理能力。 Abstract: Multimodal large language models (MLLMs) have made significant advancements in vision understanding and reasoning. However, the autoregressive Transformer architecture used by MLLMs requries tokenization on input images, which limits their ability to accurately ground objects within the 2D image space. This raises an important question: how can sequential language tokens be improved to better ground objects in 2D spatial space for MLLMs? To address this, we present a spatial representation method for grounding objects, namely GETok, that integrates a specialized vocabulary of learnable tokens into MLLMs. GETok first uses grid tokens to partition the image plane into structured spatial anchors, and then exploits offset tokens to enable precise and iterative refinement of localization predictions. By embedding spatial relationships directly into tokens, GETok significantly advances MLLMs in native 2D space reasoning without modifying the autoregressive architecture. Extensive experiments demonstrate that GETok achieves superior performance over the state-of-the-art methods across various referring tasks in both supervised fine-tuning and reinforcement learning settings.[86] Data-Efficient American Sign Language Recognition via Few-Shot Prototypical Networks
Meher Md Saad
Main category: cs.CV
TL;DR: 提出一种基于骨架编码的少样本原型网络框架,结合ST-GCN和多尺度时间聚合模块,通过度量学习有效提升孤立手语识别在数据稀缺和长尾分布下的性能,在WLASL数据集上显著优于传统分类方法,并展现出良好的零样本泛化能力。
Details
Motivation: 由于数据稀缺和手语词汇的长尾分布,传统分类方法在孤立手语识别中难以充分学习稀有类别,导致泛化能力差,亟需一种能在少量样本下有效识别手语的方法。 Method: 提出一种少样本原型网络框架,采用基于骨架的编码器,结合时空图卷积网络(ST-GCN)与新型多尺度时间聚合(MSTA)模块,通过 episodic 训练学习语义度量空间,依据类原型的距离进行分类。 Result: 在WLASL数据集上达到43.75% Top-1和77.10% Top-5准确率,比相同骨干的标准分类模型高13%以上;在未见过的SignASL数据集上实现近30%的零样本准确率。 Conclusion: 该度量学习范式在数据稀缺场景下显著优于传统分类方法,具备良好泛化能力,为大规模手语词汇识别提供了可扩展的解决方案。 Abstract: Isolated Sign Language Recognition (ISLR) is critical for bridging the communication gap between the Deaf and Hard-of-Hearing (DHH) community and the hearing world. However, robust ISLR is fundamentally constrained by data scarcity and the long-tail distribution of sign vocabulary, where gathering sufficient examples for thousands of unique signs is prohibitively expensive. Standard classification approaches struggle under these conditions, often overfitting to frequent classes while failing to generalize to rare ones. To address this bottleneck, we propose a Few-Shot Prototypical Network framework adapted for a skeleton based encoder. Unlike traditional classifiers that learn fixed decision boundaries, our approach utilizes episodic training to learn a semantic metric space where signs are classified based on their proximity to dynamic class prototypes. We integrate a Spatiotemporal Graph Convolutional Network (ST-GCN) with a novel Multi-Scale Temporal Aggregation (MSTA) module to capture both rapid and fluid motion dynamics. Experimental results on the WLASL dataset demonstrate the superiority of this metric learning paradigm: our model achieves 43.75% Top-1 and 77.10% Top-5 accuracy on the test set. Crucially, this outperforms a standard classification baseline sharing the identical backbone architecture by over 13%, proving that the prototypical training strategy effectively outperforms in a data scarce situation where standard classification fails. Furthermore, the model exhibits strong zero-shot generalization, achieving nearly 30% accuracy on the unseen SignASL dataset without fine-tuning, offering a scalable pathway for recognizing extensive sign vocabularies with limited data.[87] Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner
Haojie Zheng,Shuchen Weng,Jingqi Liu,Siqi Yang,Boxin Shi,Xinlong Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为AVI-Edit的音频同步视频实例编辑框架,通过细粒度的空间和时间控制实现高质量的音视频同步编辑。
Details
Motivation: 现有视频编辑方法忽视了音视频同步,并缺乏精确的实例级编辑所需的时空可控性。 Method: 提出了一个粒度感知的掩码优化器和自反馈音频代理,以精确定位实例区域并提供时序控制;同时构建了一个大规模实例中心数据集。 Result: 实验表明,AVI-Edit在视觉质量、条件遵循和音视频同步方面优于现有最先进方法。 Conclusion: AVI-Edit有效实现了高精度的音频同步视频实例编辑,为未来视频编辑系统提供了新的解决方案。 Abstract: Recent advancements in video generation highlight that realistic audio-visual synchronization is crucial for engaging content creation. However, existing video editing methods largely overlook audio-visual synchronization and lack the fine-grained spatial and temporal controllability required for precise instance-level edits. In this paper, we propose AVI-Edit, a framework for audio-sync video instance editing. We propose a granularity-aware mask refiner that iteratively refines coarse user-provided masks into precise instance-level regions. We further design a self-feedback audio agent to curate high-quality audio guidance, providing fine-grained temporal control. To facilitate this task, we additionally construct a large-scale dataset with instance-centric correspondence and comprehensive annotations. Extensive experiments demonstrate that AVI-Edit outperforms state-of-the-art methods in visual quality, condition following, and audio-visual synchronization. Project page: https://hjzheng.net/projects/AVI-Edit/.[88] Unleashing Degradation-Carrying Features in Symmetric U-Net: Simpler and Stronger Baselines for All-in-One Image Restoration
Wenlong Jiao,Heyang Lee,Ping Wang,Pengfei Zhu,Qinghua Hu,Dongwei Ren
Main category: cs.CV
TL;DR: 本文提出了一种基于对称U-Net的简单而高效的全合一图像恢复框架SymUNet,通过精心设计的特征提取和跨尺度传播,无需复杂结构即可实现先进性能,并进一步引入语义增强版本SE-SymUNet,利用冻结的CLIP特征提升退化先验。
Details
Motivation: 现有全合一图像恢复方法依赖复杂架构和精细的退化提示策略,本文旨在探索更简洁、高效且性能更强的基础框架。 Method: 提出对称U-Net架构,保持编码器-解码器特征尺度对齐,利用简单的跳跃连接加法融合实现跨尺度信息传递;进一步通过冻结CLIP特征的交叉注意力机制引入语义增强(SE-SymUNet)。 Result: SymUNet在多个基准数据集上超越现有方法,同时降低计算成本;SE-SymUNet进一步提升性能,实验证明所提方法在图像恢复任务中具有优越性。 Conclusion: 对称U-Net结构足以有效利用特征中的退化信息,无需复杂设计即可实现SOTA性能,为全合一图像恢复提供了更简单且更强的基础。 Abstract: All-in-one image restoration aims to handle diverse degradations (e.g., noise, blur, adverse weather) within a unified framework, yet existing methods increasingly rely on complex architectures (e.g., Mixture-of-Experts, diffusion models) and elaborate degradation prompt strategies. In this work, we reveal a critical insight: well-crafted feature extraction inherently encodes degradation-carrying information, and a symmetric U-Net architecture is sufficient to unleash these cues effectively. By aligning feature scales across encoder-decoder and enabling streamlined cross-scale propagation, our symmetric design preserves intrinsic degradation signals robustly, rendering simple additive fusion in skip connections sufficient for state-of-the-art performance. Our primary baseline, SymUNet, is built on this symmetric U-Net and achieves better results across benchmark datasets than existing approaches while reducing computational cost. We further propose a semantic enhanced variant, SE-SymUNet, which integrates direct semantic injection from frozen CLIP features via simple cross-attention to explicitly amplify degradation priors. Extensive experiments on several benchmarks validate the superiority of our methods. Both baselines SymUNet and SE-SymUNet establish simpler and stronger foundations for future advancements in all-in-one image restoration. The source code is available at https://github.com/WenlongJiao/SymUNet.[89] Salient Object Detection in Complex Weather Conditions via Noise Indicators
Quan Chen,Xiaokai Yang,Tingyu Wang,Rongfeng Lu,Xichun Sheng,Yaoqi Sun,Chenggang Yan
Main category: cs.CV
TL;DR: 本文提出了一种适用于多种天气条件的显著性目标检测(SOD)框架,通过引入噪声指示符融合模块(NIFM)将天气类型信息嵌入编码器,提升复杂天气下的分割精度。
Details
Motivation: 现有SOD方法多假设低噪声视觉环境,忽视真实场景中天气噪声对分割精度的影响,难以应对复杂天气条件下的性能退化问题。 Method: 提出一种包含特定编码器和可替换解码器的SOD框架,引入one-hot向量作为天气噪声指示符,并设计NIFM模块将其与语义特征结合,通过自适应特征调制在编码器阶段嵌入天气感知先验。 Result: 在WXSOD数据集上进行大量实验,涵盖不同训练数据规模(100%、50%、30%)及多种编码器-解码器组合,结果表明所提框架(尤其是增强NIFM的编码器)相比普通编码器在复杂天气下显著提升分割精度。 Conclusion: 该框架通过显式建模天气噪声类型,增强了SOD模型在真实多变天气条件下的鲁棒性和泛化能力,且保持与主流SOD解码器的兼容性。 Abstract: Salient object detection (SOD), a foundational task in computer vision, has advanced from single-modal to multi-modal paradigms to enhance generalization. However, most existing SOD methods assume low-noise visual conditions, overlooking the degradation of segmentation accuracy caused by weather-induced noise in real-world scenarios. In this paper, we propose a SOD framework tailored for diverse weather conditions, encompassing a specific encoder and a replaceable decoder. To enable handling of varying weather noises, we introduce a one-hot vector as a noise indicator to represent different weather types and design a Noise Indicator Fusion Module (NIFM). The NIFM takes both semantic features and the noise indicator as dual inputs and is inserted between consecutive stages of the encoder to embed weather-aware priors via adaptive feature modulation. Critically, the proposed specific encoder retains compatibility with mainstream SOD decoders. Extensive experiments are conducted on the WXSOD dataset under varying training data scales (100%, 50%, 30% of the full training set), three encoder and seven decoder configurations. Results show that the proposed SOD framework (particularly the NIFM-enhanced specific encoder) improves segmentation accuracy under complex weather conditions compared to a vanilla encoder.[90] Beyond Pixels: A Training-Free, Text-to-Text Framework for Remote Sensing Image Retrieval
J. Xiao,Y. Guo,X. Zi,K. Thiyagarajan,C. Moreira,M. Prasad
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的遥感图像语义检索方法TRSLLaVA,通过将跨模态检索转化为文本到文本匹配问题,利用高质量结构化文本描述在统一文本嵌入空间中实现高效检索,并发布了新的RSRT数据集作为基准。
Details
Motivation: 解决现有遥感图像检索方法依赖昂贵领域特定训练以及缺乏评估VLM生成文本在零样本检索中实用性的基准问题。 Method: 构建Remote Sensing Rich Text (RSRT)数据集,采用结构化多句描述;提出TRSLLaVA方法,将检索转化为纯文本匹配任务,在不进行任何模型训练的情况下,利用VLM生成的文本描述进行检索。 Result: 在RSITMD和RSICD基准上实验表明,该方法在零样本设置下性能接近甚至超越多个有监督的最先进模型,例如在RSITMD上平均召回率达42.62%,远超CLIP基线的23.86%。 Conclusion: 高质量的结构化文本描述结合无需训练的文本匹配范式,为遥感图像检索提供了一种高效且低成本的新路径。 Abstract: Semantic retrieval of remote sensing (RS) images is a critical task fundamentally challenged by the \textquote{semantic gap}, the discrepancy between a model's low-level visual features and high-level human concepts. While large Vision-Language Models (VLMs) offer a promising path to bridge this gap, existing methods often rely on costly, domain-specific training, and there is a lack of benchmarks to evaluate the practical utility of VLM-generated text in a zero-shot retrieval context. To address this research gap, we introduce the Remote Sensing Rich Text (RSRT) dataset, a new benchmark featuring multiple structured captions per image. Based on this dataset, we propose a fully training-free, text-only retrieval reference called TRSLLaVA. Our methodology reformulates cross-modal retrieval as a text-to-text (T2T) matching problem, leveraging rich text descriptions as queries against a database of VLM-generated captions within a unified textual embedding space. This approach completely bypasses model training or fine-tuning. Experiments on the RSITMD and RSICD benchmarks show our training-free method is highly competitive with state-of-the-art supervised models. For instance, on RSITMD, our method achieves a mean Recall of 42.62\%, nearly doubling the 23.86\% of the standard zero-shot CLIP baseline and surpassing several top supervised models. This validates that high-quality semantic representation through structured text provides a powerful and cost-effective paradigm for remote sensing image retrieval.[91] Track and Caption Any Motion: Query-Free Motion Discovery and Description in Videos
Bishoy Galoaa,Sarah Ostadabbas
Main category: cs.CV
TL;DR: 提出了一种名为TCAM的运动中心型视频理解框架,能够自主发现并描述视频中的多种运动模式,无需用户查询,通过运动场注意力机制实现语言描述与轨迹的空间对齐。
Details
Motivation: 传统视频理解方法依赖静态外观信息,在遮挡、伪装或快速运动等复杂场景下表现受限,因此需要一种以运动动态为核心的理解框架。 Method: TCAM通过对比视觉-语言表征对齐运动模式,采用多头交叉注意力机制,在全局视频-文本对齐和细粒度空间对应联合训练下,实现无需查询的多动作发现与描述,并利用运动场注意力机制将语言描述空间定位到运动轨迹。 Result: 在MeViS基准上,TCAM实现了58.4%的视频到文本检索准确率,空间接地JF得分为64.9,平均每视频发现4.8个相关表达且精度达84.7%,表现出优异的跨任务泛化能力。 Conclusion: TCAM验证了运动模式结合视觉-语言对齐可作为强大的语义信号,为无查询条件下的视频理解提供了有效解决方案。 Abstract: We propose Track and Caption Any Motion (TCAM), a motion-centric framework for automatic video understanding that discovers and describes motion patterns without user queries. Understanding videos in challenging conditions like occlusion, camouflage, or rapid movement often depends more on motion dynamics than static appearance. TCAM autonomously observes a video, identifies multiple motion activities, and spatially grounds each natural language description to its corresponding trajectory through a motion-field attention mechanism. Our key insight is that motion patterns, when aligned with contrastive vision-language representations, provide powerful semantic signals for recognizing and describing actions. Through unified training that combines global video-text alignment with fine-grained spatial correspondence, TCAM enables query-free discovery of multiple motion expressions via multi-head cross-attention. On the MeViS benchmark, TCAM achieves 58.4% video-to-text retrieval, 64.9 JF for spatial grounding, and discovers 4.8 relevant expressions per video with 84.7% precision, demonstrating strong cross-task generalization.[92] Robust Multi-Disease Retinal Classification via Xception-Based Transfer Learning and W-Net Vessel Segmentation
Mohammad Sadegh Gholizadeh,Amir Arsalan Rezapour
Main category: cs.CV
TL;DR: 本文提出了一种结合深度特征提取与可解释图像处理模块的深度学习框架,用于眼部疾病的自动诊断,通过视网膜血管分割辅助分类,提升模型可解释性与临床适用性。
Details
Motivation: 应对近年来致盲性眼病发病率上升的问题,亟需可扩展且准确的筛查方案;同时解决标准CNN模型在医学诊断中“黑箱”特性导致的可信度与临床落地难题。 Method: 采用深度学习架构,结合深度特征提取与可解释的图像处理模块,将高保真视网膜血管分割作为辅助任务,引导疾病分类过程,使模型预测基于临床相关的形态学特征。 Result: 该方法能够减少假阳性率,提升模型的可解释性,使算法输出更易于获得医学专家认可。 Conclusion: 所提出的框架在提升眼部疾病自动诊断准确性的同时增强了模型透明度,有助于推动AI模型在临床环境中的实际部署。 Abstract: In recent years, the incidence of vision-threatening eye diseases has risen dramatically, necessitating scalable and accurate screening solutions. This paper presents a comprehensive study on deep learning architectures for the automated diagnosis of ocular conditions. To mitigate the "black-box" limitations of standard convolutional neural networks (CNNs), we implement a pipeline that combines deep feature extraction with interpretable image processing modules. Specifically, we focus on high-fidelity retinal vessel segmentation as an auxiliary task to guide the classification process. By grounding the model's predictions in clinically relevant morphological features, we aim to bridge the gap between algorithmic output and expert medical validation, thereby reducing false positives and improving deployment viability in clinical settings.[93] Lang2Motion: Bridging Language and Motion through Joint Embedding Spaces
Bishoy Galoaa,Xiangyu Bai,Sarah Ostadabbas
Main category: cs.CV
TL;DR: Lang2Motion是一种通过将运动流形与联合嵌入空间对齐来生成语言引导的点轨迹的新框架,能够在文本到轨迹检索和运动准确性方面显著优于现有方法,并支持跨运动域的有效迁移和多种编辑功能。
Details
Motivation: 现有的工作主要集中在人类运动或视频合成上,而缺乏一种能够为任意对象生成明确轨迹的方法。此外,如何有效地结合自然语言描述与真实世界中的运动数据仍是一个挑战。 Method: 提出了一种基于Transformer的自编码器模型,利用双监督机制学习轨迹表示:一是文本描述,二是轨迹可视化渲染结果,二者均通过CLIP的冻结编码器进行映射,从而实现语言与运动的对齐。 Result: 在文本到轨迹检索任务中实现了34.2%的Recall@1,比基于视频的方法高出12.5个百分点;运动准确性提升33-52%(ADE从18.3-25.3降至12.4);在仅训练于多样物体运动数据的情况下,在人类动作识别任务中达到88.3%的Top-1准确率。 Conclusion: Lang2Motion成功地将语言与真实世界中任意对象的点轨迹生成相结合,展现出强大的跨域迁移能力和丰富的编辑功能,为语言引导的运动生成提供了新思路。 Abstract: We present Lang2Motion, a framework for language-guided point trajectory generation by aligning motion manifolds with joint embedding spaces. Unlike prior work focusing on human motion or video synthesis, we generate explicit trajectories for arbitrary objects using motion extracted from real-world videos via point tracking. Our transformer-based auto-encoder learns trajectory representations through dual supervision: textual motion descriptions and rendered trajectory visualizations, both mapped through CLIP's frozen encoders. Lang2Motion achieves 34.2% Recall@1 on text-to-trajectory retrieval, outperforming video-based methods by 12.5 points, and improves motion accuracy by 33-52% (12.4 ADE vs 18.3-25.3) compared to video generation baselines. We demonstrate 88.3% Top-1 accuracy on human action recognition despite training only on diverse object motions, showing effective transfer across motion domains. Lang2Motion supports style transfer, semantic interpolation, and latent-space editing through CLIP-aligned trajectory representations.[94] DOCR-Inspector: Fine-Grained and Automated Evaluation of Document Parsing with VLM
Qintong Zhang,Junyuan Zhang,Zhifei Ren,Linke Ouyang,Zichen Wen,Junbo Niu,Yuan Qu,Bin Wang,Ka-Ho Chow,Conghui He,Wentao Zhang
Main category: cs.CV
TL;DR: 本文提出DOCR-Inspector,一种基于视觉语言模型的细粒度文档解析质量评估方法,通过28类错误检测和Chain-of-Checklist推理范式,实现对真实场景中文档解析结果的全面评估与优化指导。
Details
Motivation: 现有基准测试存在数据集偏差,模型排名在真实场景中不可靠,且传统指标无法揭示具体的错误模式,难以实现可靠的文档解析质量评估。 Method: 提出DOCR-Inspector,将文档解析评估形式化为细粒度错误检测任务;利用VLM-as-a-Judge框架,结合构建的DOCRcase-200K训练数据和提出的Chain-of-Checklist推理范式,识别并分类28种错误类型,生成全面的质量评估。 Result: 在包含882个真实案例的DOCRcaseBench上,DOCR-Inspector-7B优于Gemini 2.5 Pro等商用模型及主流开源模型,其评估结果可有效指导解析结果的优化。 Conclusion: DOCR-Inspector实现了可靠、细粒度的文档解析质量评估,不仅是一个实用的评估工具,还能推动大规模文档解析系统的持续改进。 Abstract: Document parsing aims to transform unstructured PDF images into semi-structured data, facilitating the digitization and utilization of information in diverse domains. While vision language models (VLMs) have significantly advanced this task, achieving reliable, high-quality parsing in real-world scenarios remains challenging. Common practice often selects the top-performing model on standard benchmarks. However, these benchmarks may carry dataset-specific biases, leading to inconsistent model rankings and limited correlation with real-world performance. Moreover, benchmark metrics typically provide only overall scores, which can obscure distinct error patterns in output. This raises a key challenge: how can we reliably and comprehensively assess document parsing quality in the wild? We address this problem with DOCR-Inspector, which formalizes document parsing assessment as fine-grained error detection and analysis. Leveraging VLM-as-a-Judge, DOCR-Inspector analyzes a document image and its parsed output, identifies all errors, assigns them to one of 28 predefined types, and produces a comprehensive quality assessment. To enable this capability, we construct DOCRcase-200K for training and propose the Chain-of-Checklist reasoning paradigm to enable the hierarchical structure of parsing quality assessment. For empirical validation, we introduce DOCRcaseBench, a set of 882 real-world document parsing cases with manual annotations. On this benchmark, DOCR-Inspector-7B outperforms commercial models like Gemini 2.5 Pro, as well as leading open-source models. Further experiments demonstrate that its quality assessments provide valuable guidance for parsing results refinement, making DOCR-Inspector both a practical evaluator and a driver for advancing document parsing systems at scale. Model and code are released at: https://github.com/ZZZZZQT/DOCR-Inspector.[95] K-Track: Kalman-Enhanced Tracking for Accelerating Deep Point Trackers on Edge Devices
Bishoy Galoaa,Pau Closas,Sarah Ostadabbas
Main category: cs.CV
TL;DR: K-Track 是一种通用的、与跟踪器无关的加速框架,通过结合稀疏深度学习关键帧更新和轻量级卡尔曼滤波,实现视频序列中点跟踪的高效推理,显著降低计算成本的同时保持高精度。
Details
Motivation: 现有的深度学习跟踪器在每帧上依赖GPU推理,难以部署到资源受限的边缘设备上,限制了其在实际场景中的应用。 Method: 提出 K-Track 框架,采用稀疏关键帧上的深度学习更新,并利用卡尔曼滤波对中间帧进行预测,结合贝叶斯不确定性传播来维持时间一致性。 Result: 实现了5-10倍的加速,同时保留了原始跟踪器85%以上的精度,并在NVIDIA Jetson Nano等边缘设备上实现了实时性能。 Conclusion: K-Track 在保持高精度的同时大幅降低了计算需求,为在资源受限环境中部署高质量点跟踪提供了可行方案,缩小了先进算法与可部署视觉系统之间的差距。 Abstract: Point tracking in video sequences is a foundational capability for real-world computer vision applications, including robotics, autonomous systems, augmented reality, and video analysis. While recent deep learning-based trackers achieve state-of-the-art accuracy on challenging benchmarks, their reliance on per-frame GPU inference poses a major barrier to deployment on resource-constrained edge devices, where compute, power, and connectivity are limited. We introduce K-Track (Kalman-enhanced Tracking), a general-purpose, tracker-agnostic acceleration framework designed to bridge this deployment gap. K-Track reduces inference cost by combining sparse deep learning keyframe updates with lightweight Kalman filtering for intermediate frame prediction, using principled Bayesian uncertainty propagation to maintain temporal coherence. This hybrid strategy enables 5-10X speedup while retaining over 85% of the original trackers' accuracy. We evaluate K-Track across multiple state-of-the-art point trackers and demonstrate real-time performance on edge platforms such as the NVIDIA Jetson Nano and RTX Titan. By preserving accuracy while dramatically lowering computational requirements, K-Track provides a practical path toward deploying high-quality point tracking in real-world, resource-limited settings, closing the gap between modern tracking algorithms and deployable vision systems.[96] TriDF: Evaluating Perception, Detection, and Hallucination for Interpretable DeepFake Detection
Jian-Yu Jiang-Lin,Kang-Yang Huang,Ling Zou,Ling Lo,Sheng-Ping Yang,Yu-Wen Tseng,Kun-Hsiang Lin,Chia-Ling Chen,Yu-Ting Ta,Yan-Tsung Wang,Po-Ching Chen,Hongxia Xie,Hong-Han Shuai,Wen-Huang Cheng
Main category: cs.CV
TL;DR: 本文提出了TriDF,一个用于可解释深度伪造检测的综合基准,涵盖图像、视频和音频模态中的16种DeepFake类型,评估感知、检测和幻觉三个方面,揭示了检测准确性与解释可靠性之间的相互依赖关系。
Details
Motivation: 随着生成模型的发展,伪造内容的风险日益增加,亟需能够准确识别并提供可靠解释的深度伪造检测系统。 Method: 提出TriDF基准,包含高质量的伪造样本,通过人类标注的细粒度操纵证据评估模型在感知、检测和幻觉三个方面的表现。 Result: 实验表明,准确的感知对可靠检测至关重要,但生成模型的幻觉会严重影响决策过程。 Conclusion: TriDF为理解检测准确性、证据识别和解释可靠性之间的交互提供了统一框架,有助于构建应对合成媒体威胁的可信系统。 Abstract: Advances in generative modeling have made it increasingly easy to fabricate realistic portrayals of individuals, creating serious risks for security, communication, and public trust. Detecting such person-driven manipulations requires systems that not only distinguish altered content from authentic media but also provide clear and reliable reasoning. In this paper, we introduce TriDF, a comprehensive benchmark for interpretable DeepFake detection. TriDF contains high-quality forgeries from advanced synthesis models, covering 16 DeepFake types across image, video, and audio modalities. The benchmark evaluates three key aspects: Perception, which measures the ability of a model to identify fine-grained manipulation artifacts using human-annotated evidence; Detection, which assesses classification performance across diverse forgery families and generators; and Hallucination, which quantifies the reliability of model-generated explanations. Experiments on state-of-the-art multimodal large language models show that accurate perception is essential for reliable detection, but hallucination can severely disrupt decision-making, revealing the interdependence of these three aspects. TriDF provides a unified framework for understanding the interaction between detection accuracy, evidence identification, and explanation reliability, offering a foundation for building trustworthy systems that address real-world synthetic media threats.[97] NaviHydra: Controllable Navigation-guided End-to-end Autonomous Driving with Hydra-distillation
Hanfeng Wu,Marlon Steiner,Michael Schmidt,Alvaro Marcos-Ramiro,Christoph Stiller
Main category: cs.CV
TL;DR: 提出NaviHydra,一种基于BEV的可控端到端自动驾驶模型,通过从规则系统蒸馏并引入导航合规性度量,在NAVSIM基准上实现最优性能。
Details
Motivation: 现有端到端模型难以遵循高层导航指令,传统规则系统在动态环境中适应性差,需提升模型可控性与导航一致性。 Method: 构建基于鸟瞰图(BEV)的轨迹提取框架,采用知识蒸馏从规则系统学习,并引入导航合规性度量作为评估指标,以导航指令为控制信号生成轨迹。 Result: 在NAVSIM基准上显著优于基线模型,具备更强的导航指令遵循能力和轨迹可控性。 Conclusion: NaviHydra有效结合了规则系统的可靠性与端到端模型的灵活性,提升了复杂场景下的自动驾驶安全性与可解释性。 Abstract: The complexity of autonomous driving scenarios requires robust models that can interpret high-level navigation commands and generate safe trajectories. While traditional rule-based systems can react to these commands, they often struggle in dynamic environments, and end-to-end methods face challenges in complying with explicit navigation commands. To address this, we present NaviHydra, a controllable navigation-guided end-to-end model distilled from an existing rule-based simulator. Our framework accepts high-level navigation commands as control signals, generating trajectories that align with specified intentions. We utilize a Bird's Eye View (BEV) based trajectory gathering method to enhance the trajectory feature extraction. Additionally, we introduce a novel navigation compliance metric to evaluate adherence to intended route, improving controllability and navigation safety. To comprehensively assess our model's controllability, we design a test that evaluates its response to various navigation commands. Our method significantly outperforms baseline models, achieving state-of-the-art results in the NAVSIM benchmark, demonstrating its effectiveness in advancing autonomous driving.[98] XDen-1K: A Density Field Dataset of Real-World Objects
Jingxuan Zhang,Tianqi Yu,Yatu Zhang,Jinze Wu,Kaixin Yao,Jingyang Liu,Yuyao Zhang,Jiayuan Gu,Jingyi Yu
Main category: cs.CV
TL;DR: 本文提出了XDen-1K,首个大规模多模态真实世界数据集,专注于物体的体密度估计,并开发了从稀疏X射线视图恢复高保真体密度场的优化框架,验证了其在质心估计和机器人操作中的有效性。
Details
Motivation: 现有模型主要关注物体表面几何与外观,忽略了如体密度等内部物理属性,而这些属性对机器人操作和物理仿真至关重要;缺乏真实世界的大规模数据是主要瓶颈。 Method: 构建包含1000个真实物体的XDen-1K数据集,提供高分辨率3D模型、部件级标注和双平面X射线扫描;提出一种新优化框架,从稀疏X射线图像中恢复体密度场,并将X射线图像作为条件信号引入分割网络进行体积分割。 Result: 实验表明,利用该数据集可显著提升质心估计的准确性和机器人操作的成功率。 Conclusion: XDen-1K为物理感知的视觉推理和具身AI提供了基础资源和挑战性基准,有望推动相关领域研究发展。 Abstract: A deep understanding of the physical world is a central goal for embodied AI and realistic simulation. While current models excel at capturing an object's surface geometry and appearance, they largely neglect its internal physical properties. This omission is critical, as properties like volumetric density are fundamental for predicting an object's center of mass, stability, and interaction dynamics in applications ranging from robotic manipulation to physical simulation. The primary bottleneck has been the absence of large-scale, real-world data. To bridge this gap, we introduce XDen-1K, the first large-scale, multi-modal dataset designed for real-world physical property estimation, with a particular focus on volumetric density. The core of this dataset consists of 1,000 real-world objects across 148 categories, for which we provide comprehensive multi-modal data, including a high-resolution 3D geometric model with part-level annotations and a corresponding set of real-world biplanar X-ray scans. Building upon this data, we introduce a novel optimization framework that recovers a high-fidelity volumetric density field of each object from its sparse X-ray views. To demonstrate its practical value, we add X-ray images as a conditioning signal to an existing segmentation network and perform volumetric segmentation. Furthermore, we conduct experiments on downstream robotics tasks. The results show that leveraging the dataset can effectively improve the accuracy of center-of-mass estimation and the success rate of robotic manipulation. We believe XDen-1K will serve as a foundational resource and a challenging new benchmark, catalyzing future research in physically grounded visual inference and embodied AI.[99] Geo6DPose: Fast Zero-Shot 6D Object Pose Estimation via Geometry-Filtered Feature Matching
Javier Villena Toro,Mehdi Tarkian
Main category: cs.CV
TL;DR: 本文提出了一种轻量级、完全本地化且无需训练的零样本6D姿态估计方法Geo6DPose,结合基础模型特征与几何过滤策略,在单个消费级GPU上实现亚秒级推理,性能媲美更大模型。
Details
Motivation: 现有零样本6D姿态估计依赖大规模模型和云端推理,存在高延迟、高能耗及部署风险,难以满足机器人在计算受限下进行本地推理的实际需求。 Method: 将基础模型的视觉特征与几何过滤策略结合:通过模板DINO描述符与场景块之间的相似性图建立对应关系,并将场景块中心投影到3D空间,模板描述符投影到物体坐标系;利用对应关系驱动RANSAC恢复位姿,并采用加权几何对齐度量(结合重投影一致性与空间支持)进行排序。 Result: 在单个消费级GPU上实现1.08 FPS的亚秒级推理速度,平均召回率达53.7 AR,性能与更大的零样本基线相当;无需训练、微调或网络连接。 Conclusion: Geo6DPose实现了高效、鲁棒且完全本地化的6D姿态估计,兼容不断演进的基础模型骨干网络,推动了面向实际机器人部署的实用化6D感知发展。 Abstract: Recent progress in zero-shot 6D object pose estimation has been driven largely by large-scale models and cloud-based inference. However, these approaches often introduce high latency, elevated energy consumption, and deployment risks related to connectivity, cost, and data governance; factors that conflict with the practical constraints of real-world robotics, where compute is limited and on-device inference is frequently required. We introduce Geo6DPose, a lightweight, fully local, and training-free pipeline for zero-shot 6D pose estimation that trades model scale for geometric reliability. Our method combines foundation model visual features with a geometric filtering strategy: Similarity maps are computed between onboarded template DINO descriptors and scene patches, and mutual correspondences are established by projecting scene patch centers to 3D and template descriptors to the object model coordinate system. Final poses are recovered via correspondence-driven RANSAC and ranked using a weighted geometric alignment metric that jointly accounts for reprojection consistency and spatial support, improving robustness to noise, clutter, and partial visibility. Geo6DPose achieves sub-second inference on a single commodity GPU while matching the average recall of significantly larger zero-shot baselines (53.7 AR, 1.08 FPS). It requires no training, fine-tuning, or network access, and remains compatible with evolving foundation backbones, advancing practical, fully local 6D perception for robotic deployment.[100] Optimal transport unlocks end-to-end learning for single-molecule localization
Romain Seailles,Jean-Baptiste Masson,Jean Ponce,Julien Mairal
Main category: cs.CV
TL;DR: 提出了一种基于最优传输损失和迭代神经网络的端到端深度学习方法,用于单分子定位显微镜(SMLM),无需非最大抑制(NMS),在中高密度下优于现有技术。
Details
Motivation: 现有的SMLM方法依赖非最大抑制(NMS)层,导致不可微分且可能丢弃真实信号,限制了密集发射条件下的性能和端到端训练。 Method: 将SMLM训练目标重新表述为集合匹配问题,设计最优传输损失函数以消除NMS,并构建集成显微镜光学系统知识的迭代神经网络。 Result: 在合成和真实生物数据上验证了新损失函数和网络结构的有效性,在中高密度发射条件下性能超过现有最先进方法。 Conclusion: 所提方法实现了可微分、端到端的SMLM重建,在高密度成像中表现优越,有助于推动活细胞超分辨成像的应用。 Abstract: Single-molecule localization microscopy (SMLM) allows reconstructing biology-relevant structures beyond the diffraction limit by detecting and localizing individual fluorophores -- fluorescent molecules stained onto the observed specimen -- over time to reconstruct super-resolved images. Currently, efficient SMLM requires non-overlapping emitting fluorophores, leading to long acquisition times that hinders live-cell imaging. Recent deep-learning approaches can handle denser emissions, but they rely on variants of non-maximum suppression (NMS) layers, which are unfortunately non-differentiable and may discard true positives with their local fusion strategy. In this presentation, we reformulate the SMLM training objective as a set-matching problem, deriving an optimal-transport loss that eliminates the need for NMS during inference and enables end-to-end training. Additionally, we propose an iterative neural network that integrates knowledge of the microscope's optical system inside our model. Experiments on synthetic benchmarks and real biological data show that both our new loss function and architecture surpass the state of the art at moderate and high emitter densities. Code is available at https://github.com/RSLLES/SHOT.[101] Sharp Monocular View Synthesis in Less Than a Second
Lars Mescheder,Wei Dong,Shiwei Li,Xuyang Bai,Marcel Santos,Peiyun Hu,Bruno Lecouat,Mingmin Zhen,Amaël Delaunoy,Tian Fang,Yanghai Tsin,Stephan R. Richter,Vladlen Koltun
Main category: cs.CV
TL;DR: SHARP是一种从单张图像生成逼真视图合成的新方法,通过快速回归3D高斯表示实现高质量、实时渲染,并在多个指标上达到先进水平。
Details
Motivation: 现有的单图像视图合成方法在渲染质量、速度和尺度一致性方面存在局限,需要更高效且具备度量准确性的解决方案。 Method: SHARP通过神经网络单次前向传播,在不到一秒内回归出输入图像的3D高斯场景表示,该表示支持实时渲染和具有绝对尺度的相机运动。 Result: SHARP在多个数据集上实现了零样本泛化,LPIPS降低25-34%,DISTS降低21-43%,合成时间减少三个数量级。 Conclusion: SHARP在质量、速度和度量准确性之间取得了良好平衡,显著优于先前方法,推动了单图像视图合成的发展。 Abstract: We present SHARP, an approach to photorealistic view synthesis from a single image. Given a single photograph, SHARP regresses the parameters of a 3D Gaussian representation of the depicted scene. This is done in less than a second on a standard GPU via a single feedforward pass through a neural network. The 3D Gaussian representation produced by SHARP can then be rendered in real time, yielding high-resolution photorealistic images for nearby views. The representation is metric, with absolute scale, supporting metric camera movements. Experimental results demonstrate that SHARP delivers robust zero-shot generalization across datasets. It sets a new state of the art on multiple datasets, reducing LPIPS by 25-34% and DISTS by 21-43% versus the best prior model, while lowering the synthesis time by three orders of magnitude. Code and weights are provided at https://github.com/apple/ml-sharp[102] CheXmask-U: Quantifying uncertainty in landmark-based anatomical segmentation for X-ray images
Matias Cosarinsky,Nicolas Gaggion,Rodrigo Echeveste,Enzo Ferrante
Main category: cs.CV
TL;DR: 本研究提出了一种用于胸部X光解剖标志点分割的不确定性估计方法,通过结合卷积编码器与基于图的生成解码器架构,从变分隐空间中推导出隐变量不确定性和预测不确定性两种互补度量。实验表明这些不确定性指标能有效反映输入退化程度,并可用于识别不可靠预测和分布外检测。作者还发布了大规模数据集CheXmask-U,包含657,566个带节点级不确定性估计的分割结果,促进对分割质量空间变化的研究。
Details
Motivation: 在临床部署中准确评估医学图像分割系统的不确定性至关重要,但现有工作多关注像素级不确定性,而对具有拓扑保证的标志点分割的不确定性研究不足。因此需要探索适用于标志点分割的可靠不确定性估计方法。 Method: 采用混合神经网络架构,结合标准图像卷积编码器与基于图的生成解码器,利用其变分隐空间特性:(i) 从学习到的分布参数直接获取隐变量不确定性;(ii) 通过对隐变量采样生成多个随机输出预测来计算预测不确定性。 Result: 通过受控损坏实验验证,两种不确定性度量随扰动严重程度增加而上升,反映全局与局部退化;在CheXmask数据集上证明该方法可有效识别不可靠预测并支持分布外检测;发布大规模带不确定性标注的数据集CheXmask-U。 Conclusion: 不确定性估计是提升胸部X光标志点解剖分割方法鲁棒性与安全部署的重要方向,所提方法为临床应用中的风险控制提供了可行方案。 Abstract: Uncertainty estimation is essential for the safe clinical deployment of medical image segmentation systems, enabling the identification of unreliable predictions and supporting human oversight. While prior work has largely focused on pixel-level uncertainty, landmark-based segmentation offers inherent topological guarantees yet remains underexplored from an uncertainty perspective. In this work, we study uncertainty estimation for anatomical landmark-based segmentation on chest X-rays. Inspired by hybrid neural network architectures that combine standard image convolutional encoders with graph-based generative decoders, and leveraging their variational latent space, we derive two complementary measures: (i) latent uncertainty, captured directly from the learned distribution parameters, and (ii) predictive uncertainty, obtained by generating multiple stochastic output predictions from latent samples. Through controlled corruption experiments we show that both uncertainty measures increase with perturbation severity, reflecting both global and local degradation. We demonstrate that these uncertainty signals can identify unreliable predictions by comparing with manual ground-truth, and support out-of-distribution detection on the CheXmask dataset. More importantly, we release CheXmask-U (huggingface.co/datasets/mcosarinsky/CheXmask-U), a large scale dataset of 657,566 chest X-ray landmark segmentations with per-node uncertainty estimates, enabling researchers to account for spatial variations in segmentation quality when using these anatomical masks. Our findings establish uncertainty estimation as a promising direction to enhance robustness and safe deployment of landmark-based anatomical segmentation methods in chest X-ray. A fully working interactive demo of the method is available at huggingface.co/spaces/matiasky/CheXmask-U and the source code at github.com/mcosarinsky/CheXmask-U.[103] SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving
Peizheng Li,Zhenghao Zhang,David Holtz,Hang Yu,Yutong Yang,Yuzhi Lai,Rui Song,Andreas Geiger,Andreas Zell
Main category: cs.CV
TL;DR: 本文提出了一种名为SpaceDrive的新型自动驾驶框架,该框架基于视觉语言模型(VLM),通过引入显式的三维位置编码(PE)来增强对细粒度空间关系的理解,从而提升规划准确性和语义索引能力。
Details
Motivation: 现有的视觉语言模型在理解细粒度的三维空间关系方面存在不足,而这对于与物理世界交互的系统(如自动驾驶)至关重要。因此,需要一种能够更有效融合空间信息的方法。 Method: 提出SpaceDrive框架,采用通用位置编码器处理来自多视角深度估计、历史自车状态和文本提示的所有3D坐标;将3D位置编码叠加到2D视觉token上,并作为任务无关的坐标表示,替代逐位数字token,实现语义与空间联合推理,并直接回归轨迹坐标。 Result: 实验表明,SpaceDrive在nuScenes数据集上实现了最先进的开环性能,并在Bench2Drive闭环基准测试中取得了78.02的驾驶得分,为现有基于VLM方法中的第二高分。 Conclusion: 通过显式建模3D位置信息,SpaceDrive显著提升了VLM在自动驾驶任务中的空间理解与轨迹规划能力,验证了联合语义-空间表示的有效性。 Abstract: End-to-end autonomous driving methods built on vision language models (VLMs) have undergone rapid development driven by their universal visual understanding and strong reasoning capabilities obtained from the large-scale pretraining. However, we find that current VLMs struggle to understand fine-grained 3D spatial relationships which is a fundamental requirement for systems interacting with the physical world. To address this issue, we propose SpaceDrive, a spatial-aware VLM-based driving framework that treats spatial information as explicit positional encodings (PEs) instead of textual digit tokens, enabling joint reasoning over semantic and spatial representations. SpaceDrive employs a universal positional encoder to all 3D coordinates derived from multi-view depth estimation, historical ego-states, and text prompts. These 3D PEs are first superimposed to augment the corresponding 2D visual tokens. Meanwhile, they serve as a task-agnostic coordinate representation, replacing the digit-wise numerical tokens as both inputs and outputs for the VLM. This mechanism enables the model to better index specific visual semantics in spatial reasoning and directly regress trajectory coordinates rather than generating digit-by-digit, thereby enhancing planning accuracy. Extensive experiments validate that SpaceDrive achieves state-of-the-art open-loop performance on the nuScenes dataset and the second-best Driving Score of 78.02 on the Bench2Drive closed-loop benchmark over existing VLM-based methods.[104] Video Depth Propagation
Luigi Piccinelli,Thiemo Wandel,Christos Sakaridis,Wim Abbeloos,Luc Van Gool
Main category: cs.CV
TL;DR: VeloDepth提出了一种高效的在线视频深度估计方法,通过利用时空先验和深度特征传播,实现了高时间一致性、准确性和实时性能。
Details
Motivation: 现有视频深度估计方法在时间一致性、精度或计算效率之间存在权衡,限制了其在实际应用中的广泛使用。 Method: 提出VeloDepth,包含一个新的传播模块,采用基于光流的 warp 和学习到的残差校正来优化和传播深度特征,并在结构上强制保证时间一致性。 Result: 在多个基准上实现零样本下的最先进时间一致性,具有竞争力的精度,且推理速度显著快于现有方法。 Conclusion: VeloDepth为实时深度估计提供了一个实用、高效且准确的解决方案,适用于多种视觉感知任务。 Abstract: Depth estimation in videos is essential for visual perception in real-world applications. However, existing methods either rely on simple frame-by-frame monocular models, leading to temporal inconsistencies and inaccuracies, or use computationally demanding temporal modeling, unsuitable for real-time applications. These limitations significantly restrict general applicability and performance in practical settings. To address this, we propose VeloDepth, an efficient and robust online video depth estimation pipeline that effectively leverages spatiotemporal priors from previous depth predictions and performs deep feature propagation. Our method introduces a novel Propagation Module that refines and propagates depth features and predictions using flow-based warping coupled with learned residual corrections. In addition, our design structurally enforces temporal consistency, resulting in stable depth predictions across consecutive frames with improved efficiency. Comprehensive zero-shot evaluation on multiple benchmarks demonstrates the state-of-the-art temporal consistency and competitive accuracy of VeloDepth, alongside its significantly faster inference compared to existing video-based depth estimators. VeloDepth thus provides a practical, efficient, and accurate solution for real-time depth estimation suitable for diverse perception tasks. Code and models are available at https://github.com/lpiccinelli-eth/velodepth[105] IRG-MotionLLM: Interleaving Motion Generation, Assessment and Refinement for Text-to-Motion Generation
Yuan-Ming Li,Qize Yang,Nan Lei,Shenghao Fu,Ling-An Zeng,Jian-Fang Hu,Xihan Wei,Wei-Shi Zheng
Main category: cs.CV
TL;DR: 本文提出了一种新的运动生成范式IRMoGen,通过将运动生成、评估和优化任务以迭代文本-运动对话形式交织,实现理解和生成之间的双向知识流动。为此,作者提出了首个具备此类能力的模型IRG-MotionLLM,并设计了三阶段训练策略和自动化数据引擎来支持其发展。实验表明该方法显著提升了文本-运动对齐性和生成性能,在基准测试中优于基线模型。
Details
Motivation: 现有运动感知大语言模型通常将理解与生成任务分离,缺乏二者之间的交互反馈,限制了性能提升。本文旨在通过引入评估与优化环节,建立理解与生成间的桥梁,实现双向知识增强。 Method: 提出IRMoGen范式,通过文本-运动对话交替进行生成、评估与优化;构建IRG-MotionLLM模型,采用三阶段训练方案;开发自动化数据引擎,从现有数据集中合成交错推理标注数据。 Result: 实验证明:(i) 评估与优化任务显著提升文本-运动对齐性;(ii) 生成、评估、优化步骤的交错执行在各训练阶段均带来持续性能增益;(iii) IRG-MotionLLM在标准文本到运动生成基准上明显优于基线模型,跨评估者测试也验证了其有效性。 Conclusion: IRMoGen范式通过引入评估与优化实现了理解与生成任务的紧密耦合,IRG-MotionLLM模型有效提升了运动生成质量,展示了交错推理在多模态任务中的潜力。 Abstract: Recent advances in motion-aware large language models have shown remarkable promise for unifying motion understanding and generation tasks. However, these models typically treat understanding and generation separately, limiting the mutual benefits that could arise from interactive feedback between tasks. In this work, we reveal that motion assessment and refinement tasks act as crucial bridges to enable bidirectional knowledge flow between understanding and generation. Leveraging this insight, we propose Interleaved Reasoning for Motion Generation (IRMoGen), a novel paradigm that tightly couples motion generation with assessment and refinement through iterative text-motion dialogue. To realize this, we introduce IRG-MotionLLM, the first model that seamlessly interleaves motion generation, assessment, and refinement to improve generation performance. IRG-MotionLLM is developed progressively with a novel three-stage training scheme, initializing and subsequently enhancing native IRMoGen capabilities. To facilitate this development, we construct an automated data engine to synthesize interleaved reasoning annotations from existing text-motion datasets. Extensive experiments demonstrate that: (i) Assessment and refinement tasks significantly improve text-motion alignment; (ii) Interleaving motion generation, assessment, and refinement steps yields consistent performance gains across training stages; and (iii) IRG-MotionLLM clearly outperforms the baseline model and achieves advanced performance on standard text-to-motion generation benchmarks. Cross-evaluator testing further validates its effectiveness. Code & Data: https://github.com/HumanMLLM/IRG-MotionLLM/tree/main.[106] LDP: Parameter-Efficient Fine-Tuning of Multimodal LLM for Medical Report Generation
Tianyu Zhou,Junyi Tang,Zehui Li,Dahong Qian,Suncheng Xiang
Main category: cs.CV
TL;DR: 提出LDP框架,利用多模态大语言模型生成专业的结肠息肉诊断报告,结合MMEndo数据集和参数高效微调方法,在降低计算成本的同时显著提升报告质量。
Details
Motivation: 由于高质量多模态医学数据稀缺,传统自动结肠镜报告存在不一致和幻觉问题,亟需一种符合临床标准的可靠自动化诊断方案。 Method: 构建专家标注的多模态内窥镜数据集MMEndo,基于Qwen2-VL-7B模型使用LoRA进行参数高效微调,并通过直接偏好优化(DPO)对齐临床标准。 Result: LDP在自动指标和临床专家评估中均优于现有基线,医师评分为7.2/10,训练计算成本比全微调降低833倍,并在IU-XRay数据集上验证了其鲁棒性。 Conclusion: LDP为初级医疗提供了可扩展且临床可行的结肠息肉自动诊断报告生成方案,兼具高效性与专业性。 Abstract: Colonoscopic polyp diagnosis is pivotal for early colorectal cancer detection, yet traditional automated reporting suffers from inconsistencies and hallucinations due to the scarcity of high-quality multimodal medical data. To bridge this gap, we propose LDP, a novel framework leveraging multimodal large language models (MLLMs) for professional polyp diagnosis report generation. Specifically, we curate MMEndo, a multimodal endoscopic dataset comprising expert-annotated colonoscopy image-text pairs. We fine-tune the Qwen2-VL-7B backbone using Parameter-Efficient Fine-Tuning (LoRA) and align it with clinical standards via Direct Preference Optimization (DPO). Extensive experiments show that our LDP outperforms existing baselines on both automated metrics and rigorous clinical expert evaluations (achieving a Physician Score of 7.2/10), significantly reducing training computational costs by 833x compared to full fine-tuning. The proposed solution offers a scalable, clinically viable path for primary healthcare, with additional validation on the IU-XRay dataset confirming its robustness.[107] Blood Pressure Prediction for Coronary Artery Disease Diagnosis using Coronary Computed Tomography Angiography
Rene Lisasi,Michele Esposito,Chen Zhao
Main category: cs.CV
TL;DR: 提出了一种基于扩散模型的回归方法,直接从冠状动脉CT血管造影(CCTA)特征预测冠状动脉血压分布,结合自动化流程生成训练数据,显著提升计算效率,实现无创、快速的冠心病诊断支持。
Details
Motivation: 传统计算流体动力学(CFD)模拟冠状动脉血流虽能提供有价值的血流动力学指标,但计算成本高、耗时长,难以大规模应用于临床,限制了AI模型的训练数据获取与生理学基础的冠心病评估推广。 Method: 开发了一个端到端自动化流程,从CCTA图像中提取冠状动脉几何结构,自动生成血流模拟数据,并引入一种基于扩散的回归模型,直接从CCTA衍生特征预测冠状动脉血压分布,避免推理阶段的复杂CFD计算。 Result: 在模拟冠状动脉血流动力学数据集上,该模型达到最先进的性能:R2为64.42%,均方根误差(RMSE)为0.0974,归一化RMSE为0.154,优于多种基线方法。 Conclusion: 该研究提供了一个可扩展且易用的框架,能够快速、无创地预测冠状动脉血压,有助于推动基于生理参数的冠心病无创诊断在临床上的广泛应用。 Abstract: Computational fluid dynamics (CFD) based simulation of coronary blood flow provides valuable hemodynamic markers, such as pressure gradients, for diagnosing coronary artery disease (CAD). However, CFD is computationally expensive, time-consuming, and difficult to integrate into large-scale clinical workflows. These limitations restrict the availability of labeled hemodynamic data for training AI models and hinder broad adoption of non-invasive, physiology based CAD assessment. To address these challenges, we develop an end to end pipeline that automates coronary geometry extraction from coronary computed tomography angiography (CCTA), streamlines simulation data generation, and enables efficient learning of coronary blood pressure distributions. The pipeline reduces the manual burden associated with traditional CFD workflows while producing consistent training data. We further introduce a diffusion-based regression model designed to predict coronary blood pressure directly from CCTA derived features, bypassing the need for slow CFD computation during inference. Evaluated on a dataset of simulated coronary hemodynamics, the proposed model achieves state of the art performance, with an R2 of 64.42%, a root mean squared error of 0.0974, and a normalized RMSE of 0.154, outperforming several baseline approaches. This work provides a scalable and accessible framework for rapid, non-invasive blood pressure prediction to support CAD diagnosis.[108] What matters for Representation Alignment: Global Information or Spatial Structure?
Jaskirat Singh,Xingjian Leng,Zongze Wu,Liang Zheng,Richard Zhang,Eli Shechtman,Saining Xie
Main category: cs.CV
TL;DR: 本文研究了在生成模型训练中,目标表示的哪个方面对生成性能更重要:全局语义信息还是空间结构。通过对27种不同视觉编码器的大规模实证分析,发现空间结构比全局语义性能更能驱动生成效果。基于此,作者提出了iREPA方法,通过卷积层和空间归一化层增强空间信息的迁移,仅用不到4行代码即显著提升了REPA的收敛速度。
Details
Motivation: 探究在表示对齐(REPA)中,影响生成性能的关键因素是目标表示的全局语义信息还是其空间结构,挑战当前‘更强语义性能意味着更好生成’的普遍认知。 Method: 引入iREPA,将REPA中的MLP投影层替换为卷积层,并添加外部表示的空间归一化层,以增强空间信息的传递。 Result: 实验证明,目标表示的空间结构而非全局语义性能更关键;iREPA在多种编码器、模型规模和训练变体下均显著加快收敛速度。 Conclusion: 空间结构是决定生成性能的核心因素,应重新审视表示对齐机制的设计原则,优先考虑空间信息的保留与传递。 Abstract: Representation alignment (REPA) guides generative training by distilling representations from a strong, pretrained vision encoder to intermediate diffusion features. We investigate a fundamental question: what aspect of the target representation matters for generation, its \textit{global} \revision{semantic} information (e.g., measured by ImageNet-1K accuracy) or its spatial structure (i.e. pairwise cosine similarity between patch tokens)? Prevalent wisdom holds that stronger global semantic performance leads to better generation as a target representation. To study this, we first perform a large-scale empirical analysis across 27 different vision encoders and different model scales. The results are surprising; spatial structure, rather than global performance, drives the generation performance of a target representation. To further study this, we introduce two straightforward modifications, which specifically accentuate the transfer of \emph{spatial} information. We replace the standard MLP projection layer in REPA with a simple convolution layer and introduce a spatial normalization layer for the external representation. Surprisingly, our simple method (implemented in $<$4 lines of code), termed iREPA, consistently improves convergence speed of REPA, across a diverse set of vision encoders, model sizes, and training variants (such as REPA, REPA-E, Meanflow, JiT etc). %, etc. Our work motivates revisiting the fundamental working mechanism of representational alignment and how it can be leveraged for improved training of generative models. The code and project page are available at https://end2end-diffusion.github.io/irepa[109] Graph Laplacian Transformer with Progressive Sampling for Prostate Cancer Grading
Masum Shah Junayed,John Derek Van Vessem,Qian Wan,Gahie Nam,Sheida Nabavi
Main category: cs.CV
TL;DR: 提出了一种结合图拉普拉斯注意力机制和迭代优化模块的Transformer模型(GLAT-IRM),用于前列腺癌全切片图像分级,通过动态选择关键区域并增强空间一致性,显著提升了性能。
Details
Motivation: 现有方法在处理全切片图像时多采用随机或静态的补丁选择策略,导致包含冗余或无信息区域,影响诊断准确性与模型性能。 Method: 提出GLAT-IRM框架:使用预训练ResNet50提取局部特征,基础模型无梯度模式下评分以迭代优化补丁选择;构建以补丁为节点的图结构,利用图拉普拉斯约束保持空间一致性,并通过可学习滤波机制增强判别性组织结构表示;引入凸聚合机制动态调整补丁权重,生成鲁棒的全切片级别表征。 Result: 在五个公开和一个私有数据集上进行了广泛实验,结果表明该方法在性能、空间一致性和计算效率方面均优于现有最先进方法。 Conclusion: GLAT-IRM能有效提升前列腺癌WSI分级的准确性和鲁棒性,通过迭代优化和图注意力机制实现关键区域聚焦与空间结构建模,具有临床辅助诊断潜力。 Abstract: Prostate cancer grading from whole-slide images (WSIs) remains a challenging task due to the large-scale nature of WSIs, the presence of heterogeneous tissue structures, and difficulty of selecting diagnostically relevant regions. Existing approaches often rely on random or static patch selection, leading to the inclusion of redundant or non-informative regions that degrade performance. To address this, we propose a Graph Laplacian Attention-Based Transformer (GLAT) integrated with an Iterative Refinement Module (IRM) to enhance both feature learning and spatial consistency. The IRM iteratively refines patch selection by leveraging a pretrained ResNet50 for local feature extraction and a foundation model in no-gradient mode for importance scoring, ensuring only the most relevant tissue regions are preserved. The GLAT models tissue-level connectivity by constructing a graph where patches serve as nodes, ensuring spatial consistency through graph Laplacian constraints and refining feature representations via a learnable filtering mechanism that enhances discriminative histological structures. Additionally, a convex aggregation mechanism dynamically adjusts patch importance to generate a robust WSI-level representation. Extensive experiments on five public and one private dataset demonstrate that our model outperforms state-of-the-art methods, achieving higher performance and spatial consistency while maintaining computational efficiency.[110] Self-Ensemble Post Learning for Noisy Domain Generalization
Wang Lu,Jindong Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为SEPL的自集成后学习方法,通过特征探测训练和预测集成推理来提升域泛化模型在噪声标签下的鲁棒性。
Details
Motivation: 现有域泛化方法在面对标签噪声时性能下降,因噪声加剧了深层中的虚假特征放大问题。 Method: 利用模型中间层特征,训练多个探针分类器进行特征探测,并采用半监督方式训练以应对噪声标签,最后通过众包式集成方法融合多分类头的预测结果。 Result: 实验表明SEPL能有效提升现有方法的鲁棒性,在多种噪声场景下均表现出优越性能。 Conclusion: SEPL通过挖掘中间特征的多样性与判别能力,为噪声环境下的域泛化提供了高效且灵活的解决方案。 Abstract: While computer vision and machine learning have made great progress, their robustness is still challenged by two key issues: data distribution shift and label noise. When domain generalization (DG) encounters noise, noisy labels further exacerbate the emergence of spurious features in deep layers, i.e. spurious feature enlargement, leading to a degradation in the performance of existing algorithms. This paper, starting from domain generalization, explores how to make existing methods rework when meeting noise. We find that the latent features inside the model have certain discriminative capabilities, and different latent features focus on different parts of the image. Based on these observations, we propose the Self-Ensemble Post Learning approach (SEPL) to diversify features which can be leveraged. Specifically, SEPL consists of two parts: feature probing training and prediction ensemble inference. It leverages intermediate feature representations within the model architecture, training multiple probing classifiers to fully exploit the capabilities of pre-trained models, while the final predictions are obtained through the integration of outputs from these diverse classification heads. Considering the presence of noisy labels, we employ semi-supervised algorithms to train probing classifiers. Given that different probing classifiers focus on different areas, we integrate their predictions using a crowdsourcing inference approach. Extensive experimental evaluations demonstrate that the proposed method not only enhances the robustness of existing methods but also exhibits significant potential for real-world applications with high flexibility.[111] PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning
Jianqi Chen,Biao Zhang,Xiangjun Tang,Peter Wonka
Main category: cs.CV
TL;DR: 本文提出了一种名为PoseGAM的几何感知多视角框架,用于无显式匹配的未见物体6D姿态估计,并构建大规模合成数据集以提升泛化能力,在多个基准上实现了领先性能。
Details
Motivation: 针对未见物体的6D姿态估计仍具挑战性,现有方法依赖于查询图像与模板或模型之间的显式特征匹配,限制了泛化能力。 Method: 提出PoseGAM,基于多视角基础模型架构,通过显式的点云几何信息和几何表征网络学习到的特征两种机制融合物体几何信息,直接从查询图像和多个模板图像中预测姿态,无需显式匹配。 Result: 在多个基准上取得领先性能,平均AR提升5.1%,个别数据集最高提升17.6%,显示出对未见物体的强泛化能力。 Conclusion: PoseGAM通过融合显式与隐式几何信息,无需显式匹配即可实现高精度、强泛化的6D姿态估计,为未见物体姿态估计提供了有效解决方案。 Abstract: 6D object pose estimation, which predicts the transformation of an object relative to the camera, remains challenging for unseen objects. Existing approaches typically rely on explicitly constructing feature correspondences between the query image and either the object model or template images. In this work, we propose PoseGAM, a geometry-aware multi-view framework that directly predicts object pose from a query image and multiple template images, eliminating the need for explicit matching. Built upon recent multi-view-based foundation model architectures, the method integrates object geometry information through two complementary mechanisms: explicit point-based geometry and learned features from geometry representation networks. In addition, we construct a large-scale synthetic dataset containing more than 190k objects under diverse environmental conditions to enhance robustness and generalization. Extensive evaluations across multiple benchmarks demonstrate our state-of-the-art performance, yielding an average AR improvement of 5.1% over prior methods and achieving up to 17.6% gains on individual datasets, indicating strong generalization to unseen objects. Project page: https://windvchen.github.io/PoseGAM/ .[112] SWiT-4D: Sliding-Window Transformer for Lossless and Parameter-Free Temporal 4D Generation
Kehong Gong,Zhengyu Wen,Mingxi Xu,Weixia He,Qi Wang,Ning Zhang,Zhengyu Li,Chenbin Li,Dongze Lian,Wei Zhao,Xiaoyu He,Mingyuan Zhang
Main category: cs.CV
TL;DR: 本文提出了SWiT-4D,一种基于滑动窗口Transformer的无参数、无需大规模4D监督的视频到4D网格生成方法,可无缝集成到DiT-based图像到3D生成器中,实现任意长度视频的高质量时序一致4D重建。
Details
Motivation: 现有方法受限于缺乏大规模真实4D网格数据集,难以从零训练通用的视频到4D模型;同时虽有强大的图像到3D先验模型,但如何有效利用这些先验并减少对4D监督的依赖仍具挑战。 Method: 提出SWiT-4D,采用滑动窗口Transformer在视频帧间引入时空建模,保持原单图前向过程不变;结合优化-based轨迹模块恢复全局平移,仅需短视频微调即可实现长视频的4D重建。 Result: 在域内zoo-test和跨域C4D、Objaverse及野外视频等基准上,SWiT-4D在时间平滑性和几何质量方面均优于现有方法,仅用<10秒视频微调即实现高保真结果。 Conclusion: SWiT-4D实现了高效利用图像到3D先验模型进行4D生成,在极低4D监督下仍能保持优异性能,展现出强数据效率与实际部署潜力。 Abstract: Despite significant progress in 4D content generation, the conversion of monocular videos into high-quality animated 3D assets with explicit 4D meshes remains considerably challenging. The scarcity of large-scale, naturally captured 4D mesh datasets further limits the ability to train generalizable video-to-4D models from scratch in a purely data-driven manner. Meanwhile, advances in image-to-3D generation, supported by extensive datasets, offer powerful prior models that can be leveraged. To better utilize these priors while minimizing reliance on 4D supervision, we introduce SWiT-4D, a Sliding-Window Transformer for lossless, parameter-free temporal 4D mesh generation. SWiT-4D integrates seamlessly with any Diffusion Transformer (DiT)-based image-to-3D generator, adding spatial-temporal modeling across video frames while preserving the original single-image forward process, enabling 4D mesh reconstruction from videos of arbitrary length. To recover global translation, we further introduce an optimization-based trajectory module tailored for static-camera monocular videos. SWiT-4D demonstrates strong data efficiency: with only a single short (<10s) video for fine-tuning, it achieves high-fidelity geometry and stable temporal consistency, indicating practical deployability under extremely limited 4D supervision. Comprehensive experiments on both in-domain zoo-test sets and challenging out-of-domain benchmarks (C4D, Objaverse, and in-the-wild videos) show that SWiT-4D consistently outperforms existing baselines in temporal smoothness. Project page: https://animotionlab.github.io/SWIT4D/[113] MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence
Jingli Lin,Runsen Xu,Shaohao Zhu,Sihan Yang,Peizhou Cao,Yunlong Ran,Miao Hu,Chenming Zhu,Yiman Xie,Yilin Long,Wenbo Hu,Dahua Lin,Tai Wang,Jiangmiao Pang
Main category: cs.CV
TL;DR: 本文提出了MMSI-Video-Bench,一个用于评估多模态大语言模型(MLLMs)在视频中空间智能的全人工标注基准。该基准涵盖感知、规划、预测和跨视频推理四个层次,包含来自25个数据集和内部视频的1,278个片段中的1,106个问题。通过专家设计与审核确保问题的精确性和无歧义性,并支持三个面向特定领域的子基准测试。对25个主流MLLM的评估揭示了显著的人机差距,现有模型表现接近随机猜测,最优模型仍落后人类近60%。细粒度错误分析表明,模型在几何推理、运动定位、长时序预测和跨视频对应关系上存在系统性缺陷。研究还发现,常见的帧采样策略、3D空间线索和思维链提示均未能有效提升性能。
Details
Motivation: 现有的多模态大语言模型缺乏在连续视觉输入下进行空间理解的全面评估基准,难以衡量其在真实物理环境中作为通用助手的能力进展。因此,需要一个涵盖多层次空间智能任务的高质量基准来系统评估MLLM的发展水平。 Method: 提出MMSI-Video-Bench,构建一个四层框架(感知、规划、预测、跨视频推理),基于1,278个视频片段中的1,106个问题,全部由3DV专家人工设计并附有解释性理由。数据来源多样,涵盖25个公开数据集和自制视频,并定义三个子基准以支持领域特定评估。对25个开源与闭源MLLM进行评测,结合细粒度错误分析探究模型失败模式。 Result: 评估显示大多数MLLM表现接近随机水平,最佳模型仍比人类低近60%;空间微调模型在本基准上泛化能力差;常见帧采样策略、3D空间信息和思维链提示均未带来显著提升;错误分析揭示模型在几何推理、运动定位、长时序预测和跨视频对应方面存在系统性缺陷。 Conclusion: MMSI-Video-Bench为视频空间智能提供了严格且全面的评估平台,揭示了当前MLLM在空间理解上的重大局限,强调需发展更强大的几何、动态和跨视频推理能力,推动真正适用于物理环境的智能体发展。 Abstract: Spatial understanding over continuous visual input is crucial for MLLMs to evolve into general-purpose assistants in physical environments. Yet there is still no comprehensive benchmark that holistically assesses the progress toward this goal. In this work, we introduce MMSI-Video-Bench, a fully human-annotated benchmark for video-based spatial intelligence in MLLMs. It operationalizes a four-level framework, Perception, Planning, Prediction, and Cross-Video Reasoning, through 1,106 questions grounded in 1,278 clips from 25 datasets and in-house videos. Each item is carefully designed and reviewed by 3DV experts with explanatory rationales to ensure precise, unambiguous grounding. Leveraging its diverse data sources and holistic task coverage, MMSI-Video-Bench also supports three domain-oriented sub-benchmarks (Indoor Scene Perception Bench, Robot Bench and Grounding Bench) for targeted capability assessment. We evaluate 25 strong open-source and proprietary MLLMs, revealing a striking human--AI gap: many models perform near chance, and the best reasoning model lags humans by nearly 60%. We further find that spatially fine-tuned models still fail to generalize effectively on our benchmark. Fine-grained error analysis exposes systematic failures in geometric reasoning, motion grounding, long-horizon prediction, and cross-video correspondence. We also show that typical frame-sampling strategies transfer poorly to our reasoning-intensive benchmark, and that neither 3D spatial cues nor chain-of-thought prompting yields meaningful gains. We expect our benchmark to establish a solid testbed for advancing video-based spatial intelligence.[114] From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models
Zongzhao Li,Xiangzhe Kong,Jiahui Su,Zongyang Ma,Mingze Li,Songyou Li,Yuelin Zhang,Yu Rong,Tingyang Xu,Deli Zhao,Wenbing Huang
Main category: cs.CV
TL;DR: 本文提出了微观空间智能(MiSI)的概念,并构建了MiSI-Bench基准来评估视觉语言模型在理解微观空间关系上的能力,结果表明当前模型仍显著落后于人类,但微调后的小模型在部分任务上展现出潜力。
Details
Motivation: 微观空间智能对科学发现至关重要,但现有视觉语言模型在理解微观分子空间关系方面的能力尚未系统评估,亟需专门的基准来推动该领域发展。 Method: 提出MiSI-Bench基准框架,包含超过16.3万个问答对和58.7万张图像,源自约4000个分子结构,涵盖九项评估从基础空间变换到复杂关系识别的任务。 Result: 实验显示当前最先进的视觉语言模型在该基准上的表现远低于人类水平;然而,一个经过微调的7B模型在空间变换任务中超过了人类,但在氢键识别等科学任务上表现不佳。 Conclusion: 尽管小规模模型经微调可在特定空间任务上超越人类,但实现科学领域的通用人工智能仍需融合显式的领域知识。 Abstract: This paper introduces the concept of Microscopic Spatial Intelligence (MiSI), the capability to perceive and reason about the spatial relationships of invisible microscopic entities, which is fundamental to scientific discovery. To assess the potential of Vision-Language Models (VLMs) in this domain, we propose a systematic benchmark framework MiSI-Bench. This framework features over 163,000 question-answer pairs and 587,000 images derived from approximately 4,000 molecular structures, covering nine complementary tasks that evaluate abilities ranging from elementary spatial transformations to complex relational identifications. Experimental results reveal that current state-of-the-art VLMs perform significantly below human level on this benchmark. However, a fine-tuned 7B model demonstrates substantial potential, even surpassing humans in spatial transformation tasks, while its poor performance in scientifically-grounded tasks like hydrogen bond recognition underscores the necessity of integrating explicit domain knowledge for progress toward scientific AGI. The datasets are available at https://huggingface.co/datasets/zongzhao/MiSI-bench.[115] MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos
Kehong Gong,Zhengyu Wen,Weixia He,Mingxi Xu,Qi Wang,Ning Zhang,Zhengyu Li,Dongze Lian,Wei Zhao,Xiaoyu He,Mingyuan Zhang
Main category: cs.CV
TL;DR: 本文提出了类别无关的运动捕捉(CAMoCap)框架MoCapAnything,能够基于单目视频和任意带骨骼的3D资产生成驱动该资产的旋转动画,支持跨物种、跨骨架的动作重定向。
Details
Motivation: 现有运动捕捉技术多为特定物种或模板设计,缺乏通用性,难以扩展到多样化3D资产。本文旨在解决这一局限性,实现真正通用的运动捕捉。 Method: 提出MoCapAnything,包含三个可学习模块和一个轻量级逆运动学(IK)阶段:参考提示编码器提取资产的关节查询,视频特征提取器计算视觉描述并重建粗略4D变形网格,统一运动解码器融合信息生成连贯轨迹,最后通过约束感知的IK恢复资产特定的旋转动画。 Result: 在领域内基准和真实视频上实验表明,MoCapAnything能生成高质量骨骼动画,并实现跨物种、异构骨架间的有效动作重定向。构建了包含1038个动作片段的Truebones Zoo数据集用于评估。 Conclusion: MoCapAnything实现了类别无关的运动捕捉,支持以3D资产为提示的端到端动画生成,推动了可扩展、提示驱动的3D运动捕捉发展。 Abstract: Motion capture now underpins content creation far beyond digital humans, yet most existing pipelines remain species- or template-specific. We formalize this gap as Category-Agnostic Motion Capture (CAMoCap): given a monocular video and an arbitrary rigged 3D asset as a prompt, the goal is to reconstruct a rotation-based animation such as BVH that directly drives the specific asset. We present MoCapAnything, a reference-guided, factorized framework that first predicts 3D joint trajectories and then recovers asset-specific rotations via constraint-aware inverse kinematics. The system contains three learnable modules and a lightweight IK stage: (1) a Reference Prompt Encoder that extracts per-joint queries from the asset's skeleton, mesh, and rendered images; (2) a Video Feature Extractor that computes dense visual descriptors and reconstructs a coarse 4D deforming mesh to bridge the gap between video and joint space; and (3) a Unified Motion Decoder that fuses these cues to produce temporally coherent trajectories. We also curate Truebones Zoo with 1038 motion clips, each providing a standardized skeleton-mesh-render triad. Experiments on both in-domain benchmarks and in-the-wild videos show that MoCapAnything delivers high-quality skeletal animations and exhibits meaningful cross-species retargeting across heterogeneous rigs, enabling scalable, prompt-driven 3D motion capture for arbitrary assets. Project page: https://animotionlab.github.io/MoCapAnything/[116] PubTables-v2: A new large-scale dataset for full-page and multi-page table extraction
Brandon Smock,Valerie Faucon-Morin,Max Sokolov,Libin Liang,Tayyibah Khanam,Maury Courtland
Main category: cs.CV
TL;DR: 本文提出了一个新的大规模数据集PubTables-v2,用于支持多种具有挑战性的表格提取任务,并首次为多页表格结构识别提供了基准。基于该数据集,作者评估了领域专用的视觉-语言模型,并提出了POTATR模型,扩展了Table Transformer以实现页面级的完整表格提取。
Details
Motivation: 由于缺乏标注数据,现有的表格提取方法难以展示其在复杂文档上下文中的进展,尤其是在多页表格结构识别方面。因此,需要一个大规模、多样化的数据集来推动这一领域的研究。 Method: 构建了一个新的大规模数据集PubTables-v2,支持多种表格提取任务;利用该数据集评估了专门化的视觉-语言模型(VLMs),并开发了POTATR模型——一种从图像到图的Table Transformer扩展,用于页面级别的端到端表格提取。 Result: PubTables-v2成为首个支持多页表格结构识别的大规模基准;实验验证了当前VLM在相关任务上的性能;POTATR模型展示了在页面级表格提取中的有效性。 Conclusion: PubTables-v2填补了表格提取领域中高质量标注数据的空白,推动了多页和上下文感知的表格识别技术发展,同时提出的POTATR模型为未来页面级表格理解提供了新方向。 Abstract: Table extraction (TE) is a key challenge in visual document understanding. Traditional approaches detect tables first, then recognize their structure. Recently, interest has surged in developing methods, such as vision-language models (VLMs), that can extract tables directly in their full page or document context. However, progress has been difficult to demonstrate due to a lack of annotated data. To address this, we create a new large-scale dataset, PubTables-v2. PubTables-v2 supports a number of current challenging table extraction tasks. Notably, it is the first large-scale benchmark for multi-page table structure recognition. We demonstrate its usefulness by evaluating domain-specialized VLMs on these tasks and highlighting current progress. Finally, we use PubTables-v2 to create the Page-Object Table Transformer (POTATR), an image-to-graph extension of the Table Transformer to comprehensive page-level TE. Data, code, and trained models will be released.[117] DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance
Peiying Zhang,Nanxuan Zhao,Matthew Fisher,Yiran Xu,Jing Liao,Difan Liu
Main category: cs.CV
TL;DR: DuetSVG是一种统一的多模态模型,能够端到端地联合生成图像标记和对应的SVG标记,通过引入测试时扩展策略,利用模型的视觉预测来指导SVG解码,从而生成视觉上保真、语义对齐且语法清晰的SVG。
Details
Motivation: 现有的基于视觉-语言模型(VLM)的SVG生成方法仅生成文本,在解码过程中缺乏视觉信号,导致在复杂语义下表现不佳,难以生成视觉吸引力强或几何连贯的SVG。 Method: 提出DuetSVG,一种统一的多模态模型,同时在图像和SVG数据集上进行训练,端到端地联合生成图像标记和SVG标记,并在推理时采用一种新的测试时扩展策略,利用模型自身的视觉预测作为指导来提升SVG解码质量。 Result: 大量实验表明,该方法在多种应用中均优于现有方法,能够生成视觉保真、语义对齐且语法干净的SVG。 Conclusion: DuetSVG通过结合视觉与符号生成,有效提升了SVG生成的质量与一致性,为未来多模态内容生成提供了新思路。 Abstract: Recent vision-language model (VLM)-based approaches have achieved impressive results on SVG generation. However, because they generate only text and lack visual signals during decoding, they often struggle with complex semantics and fail to produce visually appealing or geometrically coherent SVGs. We introduce DuetSVG, a unified multimodal model that jointly generates image tokens and corresponding SVG tokens in an end-to-end manner. DuetSVG is trained on both image and SVG datasets. At inference, we apply a novel test-time scaling strategy that leverages the model's native visual predictions as guidance to improve SVG decoding quality. Extensive experiments show that our method outperforms existing methods, producing visually faithful, semantically aligned, and syntactically clean SVGs across a wide range of applications.[118] FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos
Yulu Gan,Ligeng Zhu,Dandan Shan,Baifeng Shi,Hongxu Yin,Boris Ivanovic,Song Han,Trevor Darrell,Jitendra Malik,Marco Pavone,Boyi Li
Main category: cs.CV
TL;DR: 本文提出了FoundationMotion,一个全自动的数据整理管道,用于构建大规模、细粒度的运动数据集,通过视频中物体轨迹检测与大语言模型结合生成高质量运动描述和问答对,显著提升模型在运动理解任务上的表现。
Details
Motivation: 现有运动数据集依赖昂贵的手动标注,难以扩展,且规模和细粒度不足,导致当前模型在运动理解任务上表现受限。 Method: 提出FoundationMotion管道:首先在视频中检测并跟踪物体以提取轨迹,然后结合轨迹和视频帧利用大语言模型生成细粒度的运动描述及多样化的问答对,用于训练模型。 Result: 使用该管道生成的数据集对NVILA-Video-15B和Qwen2.5-7B等开源模型进行微调,显著提升了其在多个运动理解基准上的性能,甚至优于Gemini-2.5 Flash和Qwen2.5-VL-72B等强大闭源或大型开源模型。 Conclusion: FoundationMotion为构建高质量运动理解数据集提供了可扩展的自动化解决方案,有效增强了模型的运动与空间推理能力。 Abstract: Motion understanding is fundamental to physical reasoning, enabling models to infer dynamics and predict future states. However, state-of-the-art models still struggle on recent motion benchmarks, primarily due to the scarcity of large-scale, fine-grained motion datasets. Existing motion datasets are often constructed from costly manual annotation, severely limiting scalability. To address this challenge, we introduce FoundationMotion, a fully automated data curation pipeline that constructs large-scale motion datasets. Our approach first detects and tracks objects in videos to extract their trajectories, then leverages these trajectories and video frames with Large Language Models (LLMs) to generate fine-grained captions and diverse question-answer pairs about motion and spatial reasoning. Using datasets produced by this pipeline, we fine-tune open-source models including NVILA-Video-15B and Qwen2.5-7B, achieving substantial improvements in motion understanding without compromising performance on other tasks. Notably, our models outperform strong closed-source baselines like Gemini-2.5 Flash and large open-source models such as Qwen2.5-VL-72B across diverse motion understanding datasets and benchmarks. FoundationMotion thus provides a scalable solution for curating fine-grained motion datasets that enable effective fine-tuning of diverse models to enhance motion understanding and spatial reasoning capabilities.[119] BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models
Shengao Wang,Wenqi Wang,Zecheng Wang,Max Whitton,Michael Wakeham,Arjun Chandra,Joey Huang,Pengyue Zhu,Helen Chen,David Li,Jeffrey Li,Shawn Li,Andrew Zagula,Amy Zhao,Andrew Zhu,Sayaka Nakamura,Yuki Yamamoto,Jerry Jun Yokono,Aaron Mueller,Bryan A. Plummer,Kate Saenko,Venkatesh Saligrama,Boqing Gong
Main category: cs.CV
TL;DR: 本文提出了BabyVLM-V2,一种基于婴幼儿发展轨迹的视觉-语言建模框架,通过纵向多模态预训练数据和新提出的DevCV Toolbox评估工具,在视觉基础模型的发育式预训练上实现了高效性能,甚至在某些任务上超过GPT-4o。
Details
Motivation: 受儿童早期发展轨迹启发,旨在构建更符合人类认知发展的、样本高效的视觉基础模型预训练方法。 Method: 提出BabyVLM-V2框架,包括基于婴幼儿真实经验构建的纵向多模态预训练数据集(视频-语句、图像-语句、多轮对话),设计紧凑模型从零开始预训练,并开发DevCV Toolbox作为认知评估基准,包含十个与婴幼儿能力对齐的多模态任务。 Result: 实验表明,从零开始预训练的紧凑模型在DevCV Toolbox上表现优异,部分任务性能超越GPT-4o,验证了发育导向预训练的有效性。 Conclusion: BabyVLM-V2提供了一个原则性强、统一的框架,推动面向人类发展规律的视觉基础模型预训练研究。 Abstract: Early children's developmental trajectories set up a natural goal for sample-efficient pretraining of vision foundation models. We introduce BabyVLM-V2, a developmentally grounded framework for infant-inspired vision-language modeling that extensively improves upon BabyVLM-V1 through a longitudinal, multifaceted pretraining set, a versatile model, and, most importantly, DevCV Toolbox for cognitive evaluation. The pretraining set maximizes coverage while minimizing curation of a longitudinal, infant-centric audiovisual corpus, yielding video-utterance, image-utterance, and multi-turn conversational data that mirror infant experiences. DevCV Toolbox adapts all vision-related measures of the recently released NIH Baby Toolbox into a benchmark suite of ten multimodal tasks, covering spatial reasoning, memory, and vocabulary understanding aligned with early children's capabilities. Experimental results show that a compact model pretrained from scratch can achieve competitive performance on DevCV Toolbox, outperforming GPT-4o on some tasks. We hope the principled, unified BabyVLM-V2 framework will accelerate research in developmentally plausible pretraining of vision foundation models.[120] Any4D: Unified Feed-Forward Metric 4D Reconstruction
Jay Karhade,Nikhil Keetha,Yuchen Zhang,Tanisha Gupta,Akash Sharma,Sebastian Scherer,Deva Ramanan
Main category: cs.CV
TL;DR: Any4D是一个可扩展的多视角变换器,用于度量尺度下的密集前馈4D重建,能处理多种模态输入,并在准确性和计算效率上显著优于现有方法。
Details
Motivation: 现有的4D重建方法通常局限于双视角场景流或稀疏3D点跟踪,且多依赖单目RGB视频,缺乏对多模态传感器数据的有效整合。 Method: 提出Any4D,一种基于多视图transformer的框架,引入模块化的4D场景表示,分别使用以自我为中心的因素(如深度图、内参)和以环境为中心的因素(如外参、场景流)进行局部与全局建模。 Result: 在多种设置下实现了2-3倍更低的误差和15倍更快的计算速度,支持RGB-D、IMU和Radar Doppler等多种输入模态。 Conclusion: Any4D通过灵活的模块化设计实现了高效、精确的4D重建,为下游应用提供了广泛的可能性。 Abstract: We present Any4D, a scalable multi-view transformer for metric-scale, dense feed-forward 4D reconstruction. Any4D directly generates per-pixel motion and geometry predictions for N frames, in contrast to prior work that typically focuses on either 2-view dense scene flow or sparse 3D point tracking. Moreover, unlike other recent methods for 4D reconstruction from monocular RGB videos, Any4D can process additional modalities and sensors such as RGB-D frames, IMU-based egomotion, and Radar Doppler measurements, when available. One of the key innovations that allows for such a flexible framework is a modular representation of a 4D scene; specifically, per-view 4D predictions are encoded using a variety of egocentric factors (depthmaps and camera intrinsics) represented in local camera coordinates, and allocentric factors (camera extrinsics and scene flow) represented in global world coordinates. We achieve superior performance across diverse setups - both in terms of accuracy (2-3X lower error) and compute efficiency (15X faster), opening avenues for multiple downstream applications.[121] GaussianHeadTalk: Wobble-Free 3D Talking Heads with Audio Driven Gaussian Splatting
Madhav Agarwal,Mingtian Zhang,Laura Sevilla-Lara,Steven McDonagh
Main category: cs.CV
TL;DR: 提出一种基于3D可变形模型引导的高斯点阵方法,通过音频直接预测参数生成实时、稳定的说话头视频。
Details
Motivation: 现有语音驱动说话人视频生成方法在真实感、实时性和时序稳定性之间难以平衡,限制了实际应用。 Method: 将高斯点阵与3D可变形模型结合,使用Transformer网络从音频直接预测模型参数,实现个性化头像的稳定驱动。 Result: 在单目视频和独立音频输入下,实现了实时说话头视频生成,并在定量和定性评估中表现优异。 Conclusion: 该方法在保持高视觉保真度的同时,显著提升了时序稳定性,适用于真实场景的交互式虚拟化身应用。 Abstract: Speech-driven talking heads have recently emerged and enable interactive avatars. However, real-world applications are limited, as current methods achieve high visual fidelity but slow or fast yet temporally unstable. Diffusion methods provide realistic image generation, yet struggle with oneshot settings. Gaussian Splatting approaches are real-time, yet inaccuracies in facial tracking, or inconsistent Gaussian mappings, lead to unstable outputs and video artifacts that are detrimental to realistic use cases. We address this problem by mapping Gaussian Splatting using 3D Morphable Models to generate person-specific avatars. We introduce transformer-based prediction of model parameters, directly from audio, to drive temporal consistency. From monocular video and independent audio speech inputs, our method enables generation of real-time talking head videos where we report competitive quantitative and qualitative performance.[122] OmniView: An All-Seeing Diffusion Model for 3D and 4D View Synthesis
Xiang Fan,Sharath Girish,Vivek Ramanujan,Chaoyang Wang,Ashkan Mirzaei,Petr Sushko,Aliaksandr Siarohin,Sergey Tulyakov,Ranjay Krishna
Main category: cs.CV
TL;DR: OmniView 是一个统一的框架,能够泛化到多种4D一致性任务,支持灵活的空间、时间与视角条件组合,在多个基准上优于或媲美专用模型。
Details
Motivation: 现有方法仅针对4D一致性任务的特定子集,使用不相交的数据进行训练,缺乏通用性。 Method: OmniView 分别建模空间、时间和视角条件,实现输入的灵活组合,并在一个统一框架中处理多种4D任务。 Result: 在多视图NVS、动态NVS、静态相机控制和文本生成视频等任务中显著提升图像质量(最高提升60%),并减少4倍相机轨迹误差。 Conclusion: OmniView 展示了一个具备强泛化能力的通用4D视频模型的可行性。 Abstract: Prior approaches injecting camera control into diffusion models have focused on specific subsets of 4D consistency tasks: novel view synthesis, text-to-video with camera control, image-to-video, amongst others. Therefore, these fragmented approaches are trained on disjoint slices of available 3D/4D data. We introduce OmniView, a unified framework that generalizes across a wide range of 4D consistency tasks. Our method separately represents space, time, and view conditions, enabling flexible combinations of these inputs. For example, OmniView can synthesize novel views from static, dynamic, and multiview inputs, extrapolate trajectories forward and backward in time, and create videos from text or image prompts with full camera control. OmniView is competitive with task-specific models across diverse benchmarks and metrics, improving image quality scores among camera-conditioned diffusion models by up to 33\% in multiview NVS LLFF dataset, 60\% in dynamic NVS Neural 3D Video benchmark, 20\% in static camera control on RE-10K, and reducing camera trajectory errors by 4x in text-conditioned video generation. With strong generalizability in one model, OmniView demonstrates the feasibility of a generalist 4D video model. Project page is available at https://snap-research.github.io/OmniView/[123] Mull-Tokens: Modality-Agnostic Latent Thinking
Arijit Ray,Ahmed Abdelkader,Chengzhi Mao,Bryan A. Plummer,Kate Saenko,Ranjay Krishna,Leonidas Guibas,Wen-Sheng Chu
Main category: cs.CV
TL;DR: Mull-Tokens是一种模态无关的潜在令牌方法,用于在多模态推理中自由地在文本和图像之间传递中间信息,提升空间推理任务的表现。
Details
Motivation: 现有基于图像的多模态推理模型依赖专用工具、高成本图像生成或手工设计的数据,缺乏可扩展性和鲁棒性,难以有效支持跨模态的抽象推理。 Method: 提出Mull-Tokens,通过在预训练阶段使用交错的图文轨迹进行监督学习,并在微调阶段仅用最终答案进行无监督优化,使模型能在文本与图像模态间自由‘思考’。 Result: 在四个具有挑战性的空间推理基准上,Mull-Tokens相比纯文本或多模态基线平均提升3%,在解谜类重推理任务中最高提升达16%。 Conclusion: Mull-Tokens提供了一种简洁有效的多模态抽象推理方案,无需额外工具或人工标注推理路径,即可提升复杂推理任务的性能。 Abstract: Reasoning goes beyond language; the real world requires reasoning about space, time, affordances, and much more that words alone cannot convey. Existing multimodal models exploring the potential of reasoning with images are brittle and do not scale. They rely on calling specialist tools, costly generation of images, or handcrafted reasoning data to switch between text and image thoughts. Instead, we offer a simpler alternative -- Mull-Tokens -- modality-agnostic latent tokens pre-trained to hold intermediate information in either image or text modalities to let the model think free-form towards the correct answer. We investigate best practices to train Mull-Tokens inspired by latent reasoning frameworks. We first train Mull-Tokens using supervision from interleaved text-image traces, and then fine-tune without any supervision by only using the final answers. Across four challenging spatial reasoning benchmarks involving tasks such as solving puzzles and taking different perspectives, we demonstrate that Mull-Tokens improve upon several baselines utilizing text-only reasoning or interleaved image-text reasoning, achieving a +3% average improvement and up to +16% on a puzzle solving reasoning-heavy split compared to our strongest baseline. Adding to conversations around challenges in grounding textual and visual reasoning, Mull-Tokens offers a simple solution to abstractly think in multiple modalities.[124] VL-JEPA: Joint Embedding Predictive Architecture for Vision-language
Delong Chen,Mustafa Shukor,Theo Moutakanni,Willy Chung,Jade Yu,Tejaswi Kasarla,Allen Bolourchi,Yann LeCun,Pascale Fung
Main category: cs.CV
TL;DR: VL-JEPA是一种基于联合嵌入预测架构(JEPA)的视觉-语言模型,通过预测文本的连续嵌入而非自回归生成标记来提升效率和性能,在多种任务中优于现有模型。
Details
Motivation: 传统视觉语言模型(VLMs)依赖自回归生成标记,计算开销大且关注表面语言变化;本文旨在通过抽象表示空间中的预测,聚焦任务相关语义,提高模型效率与泛化能力。 Method: 采用联合嵌入预测架构(JEPA),在共享嵌入空间中预测目标文本的连续嵌入;使用相同的视觉编码器和训练数据与标准VLM进行对比,并引入轻量级文本解码器按需解码,支持选择性解码以减少解码次数。 Result: 相比标准VLM训练,VL-JEPA在50%更少可训练参数下表现更优;选择性解码减少2.85倍解码操作;在8个视频分类和8个视频检索数据集上超过CLIP、SigLIP2和Perception Encoder;在4个VQA数据集上达到与InstructBLIP、QwenVL相当的表现,仅用1.6B参数。 Conclusion: VL-JEPA通过在抽象嵌入空间中进行预测,有效分离语义理解与语言生成,提升了模型效率和多任务适应性,为高效视觉语言建模提供了新方向。 Abstract: We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that reduces the number of decoding operations by 2.85x while maintaining similar performance compared to non-adaptive uniform decoding. Beyond generation, the VL-JEPA's embedding space naturally supports open-vocabulary classification, text-to-video retrieval, and discriminative VQA without any architecture modification. On eight video classification and eight video retrieval datasets, the average performance VL-JEPA surpasses that of CLIP, SigLIP2, and Perception Encoder. At the same time, the model achieves comparable performance as classical VLMs (InstructBLIP, QwenVL) on four VQA datasets: GQA, TallyQA, POPE and POPEv2, despite only having 1.6B parameters.[125] AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation
Sharath Girish,Viacheslav Ivanov,Tsai-Shien Chen,Hao Chen,Aliaksandr Siarohin,Sergey Tulyakov
Main category: cs.CV
TL;DR: AlcheMinT 是一种用于主体驱动视频生成的统一框架,首次实现了对视频中多主体出现与消失时间的精确控制。
Details
Motivation: 现有主体驱动视频生成方法缺乏对主体外观和消失的细粒度时间控制,限制了其在组合视频合成、分镜和可控动画中的应用。 Method: 提出 AlcheMinT 框架,引入显式时间戳条件机制和新的位置编码方式,以编码与主体身份相关的时间区间,并通过词元级拼接融合主体描述文本,增强视觉身份与字幕的绑定,无需额外交叉注意力模块。 Result: 在多个主体身份保持、视频保真度和时间一致性方面建立了评估基准,实验表明 AlcheMinT 在保持视觉质量的同时实现了精确的时间控制。 Conclusion: AlcheMinT 能够在不增加显著参数开销的前提下,实现高质量且具有精确时间控制的多主体视频生成,推动了个性化视频生成的发展。 Abstract: Recent advances in subject-driven video generation with large diffusion models have enabled personalized content synthesis conditioned on user-provided subjects. However, existing methods lack fine-grained temporal control over subject appearance and disappearance, which are essential for applications such as compositional video synthesis, storyboarding, and controllable animation. We propose AlcheMinT, a unified framework that introduces explicit timestamps conditioning for subject-driven video generation. Our approach introduces a novel positional encoding mechanism that unlocks the encoding of temporal intervals, associated in our case with subject identities, while seamlessly integrating with the pretrained video generation model positional embeddings. Additionally, we incorporate subject-descriptive text tokens to strengthen binding between visual identity and video captions, mitigating ambiguity during generation. Through token-wise concatenation, AlcheMinT avoids any additional cross-attention modules and incurs negligible parameter overhead. We establish a benchmark evaluating multiple subject identity preservation, video fidelity, and temporal adherence. Experimental results demonstrate that AlcheMinT achieves visual quality matching state-of-the-art video personalization methods, while, for the first time, enabling precise temporal control over multi-subject generation within videos. Project page is at https://snap-research.github.io/Video-AlcheMinT[126] MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation
Henghui Ding,Chang Liu,Shuting He,Kaining Ying,Xudong Jiang,Chen Change Loy,Yu-Gang Jiang
Main category: cs.CV
TL;DR: 本文提出了一个大规模多模态数据集MeViS,用于基于运动语言描述的视频目标分割与跟踪,强调运动在视频和语言理解中的作用,并评测了现有方法在运动表达引导下的表现,提出改进方法LMPM++达到新SOTA。
Details
Motivation: 现有指代表达视频分割数据集多关注显著物体和静态属性语言描述,难以体现运动信息的作用,限制了对运动推理和像素级视频理解的研究。 Method: 构建包含33,072条人工标注的文本和音频运动表达的MeViS数据集,覆盖2,006个复杂场景视频中的8,171个对象;在4个任务上评测15种现有方法,并提出LMPM++方法以更好利用运动线索进行视频理解。 Result: 评测结果显示现有方法在运动表达引导的视频理解中存在明显不足;所提出的LMPM++方法在RVOS、AVOS和RMOT任务上均取得当前最优性能。 Conclusion: MeViS为基于运动表达的视频理解提供了重要平台,推动了利用运动信息进行像素级视频分析的发展,同时揭示了现有方法在运动推理方面的局限性。 Abstract: This paper proposes a large-scale multi-modal dataset for referring motion expression video segmentation, focusing on segmenting and tracking target objects in videos based on language description of objects' motions. Existing referring video segmentation datasets often focus on salient objects and use language expressions rich in static attributes, potentially allowing the target object to be identified in a single frame. Such datasets underemphasize the role of motion in both videos and languages. To explore the feasibility of using motion expressions and motion reasoning clues for pixel-level video understanding, we introduce MeViS, a dataset containing 33,072 human-annotated motion expressions in both text and audio, covering 8,171 objects in 2,006 videos of complex scenarios. We benchmark 15 existing methods across 4 tasks supported by MeViS, including 6 referring video object segmentation (RVOS) methods, 3 audio-guided video object segmentation (AVOS) methods, 2 referring multi-object tracking (RMOT) methods, and 4 video captioning methods for the newly introduced referring motion expression generation (RMEG) task. The results demonstrate weaknesses and limitations of existing methods in addressing motion expression-guided video understanding. We further analyze the challenges and propose an approach LMPM++ for RVOS/AVOS/RMOT that achieves new state-of-the-art results. Our dataset provides a platform that facilitates the development of motion expression-guided video understanding algorithms in complex video scenes. The proposed MeViS dataset and the method's source code are publicly available at https://henghuiding.com/MeViS/[127] Towards Efficient and Effective Multi-Camera Encoding for End-to-End Driving
Jiawei Yang,Ziyu Chen,Yurong You,Yan Wang,Yiming Li,Yuxiao Chen,Boyi Li,Boris Ivanovic,Marco Pavone,Yue Wang
Main category: cs.CV
TL;DR: 提出了一种名为Flex的高效场景编码器,用于端到端自动驾驶中多摄像头数据的处理,无需依赖3D先验知识即可实现更高效的推理和更好的驾驶性能。
Details
Motivation: 解决现有方法在处理高量多摄像头数据时计算瓶颈的问题,并挑战必须依赖3D先验(如BEV、占据网格等)才能有效建模场景的主流假设。 Method: 设计了一个几何无关的场景编码器Flex,使用少量可学习的场景token联合编码来自多个相机和时间步的图像token,通过数据驱动方式学习紧凑的场景表示,不依赖任何显式的3D结构先验。 Result: 在2万小时真实驾驶数据上验证,相比当前最优方法,推理吞吐量提升2.2倍,驾驶性能显著提高;同时发现学到的紧凑场景token能自发实现场景分解能力,尽管无显式监督。 Conclusion: 证明了无需3D先验、仅通过数据驱动的联合编码策略即可实现更高效、更可扩展的自动驾驶系统,为未来模型设计提供了新方向。 Abstract: We present Flex, an efficient and effective scene encoder that addresses the computational bottleneck of processing high-volume multi-camera data in end-to-end autonomous driving. Flex employs a small set of learnable scene tokens to jointly encode information from all image tokens across different cameras and timesteps. By design, our approach is geometry-agnostic, learning a compact scene representation directly from data without relying on the explicit 3D inductive biases, such as Bird-Eye-View (BEV), occupancy or tri-plane representations, which are common in prior work. This holistic encoding strategy aggressively compresses the visual input for the downstream Large Language Model (LLM) based policy model. Evaluated on a large-scale proprietary dataset of 20,000 driving hours, our Flex achieves 2.2x greater inference throughput while improving driving performance by a large margin compared to state-of-the-art methods. Furthermore, we show that these compact scene tokens develop an emergent capability for scene decomposition without any explicit supervision. Our findings challenge the prevailing assumption that 3D priors are necessary, demonstrating that a data-driven, joint encoding strategy offers a more scalable, efficient and effective path for future autonomous driving systems.[128] ClusIR: Towards Cluster-Guided All-in-One Image Restoration
Shengkai Hu,Jiaqi Ma,Jun Wan,Wenwen Min,Yongcheng Jing,Lefei Zhang,Dacheng Tao
Main category: cs.CV
TL;DR: 提出ClusIR框架,通过聚类引导机制显式建模退化语义,并在空间和频域传播聚类感知线索,实现对多种退化类型的自适应图像恢复。
Details
Motivation: 现有全合一图像恢复方法难以显式建模退化类型,且在复杂或混合退化下自适应能力差。 Method: 设计包含概率聚类引导路由机制(PCGRM)和退化感知频率调制模块(DAFMM)的ClusIR框架;PCGRM分离退化识别与专家激活,DAFMM利用聚类先验进行自适应频域分解与调制。 Result: 在多个基准上实验表明,ClusIR在多种退化场景下均取得具有竞争力的性能。 Conclusion: ClusIR通过聚类引导的协同机制,有效结合语义信息与频域调制,显著提升多退化条件下的图像恢复效果。 Abstract: All-in-One Image Restoration (AiOIR) aims to recover high-quality images from diverse degradations within a unified framework. However, existing methods often fail to explicitly model degradation types and struggle to adapt their restoration behavior to complex or mixed degradations. To address these issues, we propose ClusIR, a Cluster-Guided Image Restoration framework that explicitly models degradation semantics through learnable clustering and propagates cluster-aware cues across spatial and frequency domains for adaptive restoration. Specifically, ClusIR comprises two key components: a Probabilistic Cluster-Guided Routing Mechanism (PCGRM) and a Degradation-Aware Frequency Modulation Module (DAFMM). The proposed PCGRM disentangles degradation recognition from expert activation, enabling discriminative degradation perception and stable expert routing. Meanwhile, DAFMM leverages the cluster-guided priors to perform adaptive frequency decomposition and targeted modulation, collaboratively refining structural and textural representations for higher restoration fidelity. The cluster-guided synergy seamlessly bridges semantic cues with frequency-domain modulation, empowering ClusIR to attain remarkable restoration results across a wide range of degradations. Extensive experiments on diverse benchmarks validate that ClusIR reaches competitive performance under several scenarios.[129] E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training
Qitao Zhao,Hao Tan,Qianqian Wang,Sai Bi,Kai Zhang,Kalyan Sunkavalli,Shubham Tulsiani,Hanwen Jiang
Main category: cs.CV
TL;DR: 本文提出了E-RayZer,一种从无标签多视图图像中自监督学习真正3D感知表示的大规模3D视觉模型,通过显式3D几何重建和细粒度学习课程,在3D下游任务中显著优于现有方法。
Details
Motivation: 现有的自监督预训练在语言、2D图像和视频中取得了成功,但在从多视图图像学习3D感知表示方面仍探索不足,缺乏几何一致性和直接的3D建模能力。 Method: 提出E-RayZer模型,直接在3D空间中进行自监督3D重建,采用显式几何建模,并引入一种新的细粒度学习课程策略,以无监督方式组织难易样本并融合异构数据源。 Result: 实验表明,E-RayZer在姿态估计等任务上显著优于RayZer,媲美甚至超过全监督模型VGGT,且在迁移至3D下游任务时表现优于DINOv3、CroCo v2等主流视觉预训练模型。 Conclusion: E-RayZer建立了3D感知视觉预训练的新范式,证明了显式3D重建与自监督学习结合的有效性,为大规模3D视觉模型的发展提供了新方向。 Abstract: Self-supervised pre-training has revolutionized foundation models for languages, individual 2D images and videos, but remains largely unexplored for learning 3D-aware representations from multi-view images. In this paper, we present E-RayZer, a self-supervised large 3D Vision model that learns truly 3D-aware representations directly from unlabeled images. Unlike prior self-supervised methods such as RayZer that infer 3D indirectly through latent-space view synthesis, E-RayZer operates directly in 3D space, performing self-supervised 3D reconstruction with Explicit geometry. This formulation eliminates shortcut solutions and yields representations that are geometrically grounded. To ensure convergence and scalability, we introduce a novel fine-grained learning curriculum that organizes training from easy to hard samples and harmonizes heterogeneous data sources in an entirely unsupervised manner. Experiments demonstrate that E-RayZer significantly outperforms RayZer on pose estimation, matches or sometimes surpasses fully supervised reconstruction models such as VGGT. Furthermore, its learned representations outperform leading visual pre-training models (e.g., DINOv3, CroCo v2, VideoMAE V2, and RayZer) when transferring to 3D downstream tasks, establishing E-RayZer as a new paradigm for 3D-aware visual pre-training.[130] Group Diffusion: Enhancing Image Generation by Unlocking Cross-Sample Collaboration
Sicheng Mo,Thao Nguyen,Richard Zhang,Nick Kolkin,Siddharth Srinivasan Iyer,Eli Shechtman,Krishna Kumar Singh,Yong Jae Lee,Bolei Zhou,Yuheng Li
Main category: cs.CV
TL;DR: 提出Group Diffusion,通过跨图像共享注意力机制实现协同生成,在推理阶段联合去噪,显著提升生成质量。
Details
Motivation: 探索扩散模型推理过程中未被利用的信号,尝试打破传统独立生成样本的局限。 Method: 引入Group Diffusion,解锁跨图像的注意力机制,使多个图像在去噪过程中共享信息,学习图像内和图像间的对应关系。 Result: 实现了高达32.2%的FID提升(ImageNet-256x256),且组规模越大,生成效果越好,跨样本注意力强度与FID密切相关。 Conclusion: 跨样本推理是一种有效且此前未被探索的生成建模机制,为扩散模型推理提供了新方向。 Abstract: In this work, we explore an untapped signal in diffusion model inference. While all previous methods generate images independently at inference, we instead ask if samples can be generated collaboratively. We propose Group Diffusion, unlocking the attention mechanism to be shared across images, rather than limited to just the patches within an image. This enables images to be jointly denoised at inference time, learning both intra and inter-image correspondence. We observe a clear scaling effect - larger group sizes yield stronger cross-sample attention and better generation quality. Furthermore, we introduce a qualitative measure to capture this behavior and show that its strength closely correlates with FID. Built on standard diffusion transformers, our GroupDiff achieves up to 32.2% FID improvement on ImageNet-256x256. Our work reveals cross-sample inference as an effective, previously unexplored mechanism for generative modeling.[131] Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization
Tsai-Shien Chen,Aliaksandr Siarohin,Guocheng Gordon Qian,Kuan-Chieh Jackson Wang,Egor Nemchinov,Moayed Haji-Ali,Riza Alp Guler,Willi Menapace,Ivan Skorokhodov,Anil Kag,Jun-Yan Zhu,Sergey Tulyakov
Main category: cs.CV
TL;DR: 本文提出Omni-Attribute,首个开放词汇图像属性编码器,用于实现高保真、特定属性的视觉概念个性化,通过构建语义关联图像对和双目标训练策略,在多个基准上达到最优性能。
Details
Motivation: 现有方法依赖于通用图像编码器的整体嵌入,导致多种视觉因素纠缠,难以分离单一属性,易造成信息泄漏和合成不一致。 Method: 设计了数据与模型协同的方法:构建带有正负属性标注的语义关联图像对,并采用兼顾生成保真度与对比解耦的双目标训练范式。 Result: 所提方法在开放词汇属性检索、个性化和组合生成任务中表现出色,多个基准上性能达到最先进水平。 Conclusion: Omni-Attribute能有效解耦视觉属性,实现精确的属性控制与迁移,为视觉概念个性化提供了更精细、灵活的解决方案。 Abstract: Visual concept personalization aims to transfer only specific image attributes, such as identity, expression, lighting, and style, into unseen contexts. However, existing methods rely on holistic embeddings from general-purpose image encoders, which entangle multiple visual factors and make it difficult to isolate a single attribute. This often leads to information leakage and incoherent synthesis. To address this limitation, we introduce Omni-Attribute, the first open-vocabulary image attribute encoder designed to learn high-fidelity, attribute-specific representations. Our approach jointly designs the data and model: (i) we curate semantically linked image pairs annotated with positive and negative attributes to explicitly teach the encoder what to preserve or suppress; and (ii) we adopt a dual-objective training paradigm that balances generative fidelity with contrastive disentanglement. The resulting embeddings prove effective for open-vocabulary attribute retrieval, personalization, and compositional generation, achieving state-of-the-art performance across multiple benchmarks.[132] Empowering Dynamic Urban Navigation with Stereo and Mid-Level Vision
Wentao Zhou,Xuweiyi Chen,Vignesh Rajagopal,Jeffrey Chen,Rohan Chandra,Zezhou Cheng
Main category: cs.CV
TL;DR: 本文提出StereoWalker,通过引入双目视觉和显式中层视觉(如深度估计和像素跟踪)增强机器人导航基础模型,显著提升在动态非结构化环境中的导航性能,并仅用1.5%数据即可达到当前最优水平。
Details
Motivation: 单目视觉的深度-尺度模糊性限制了机器人在动态复杂环境中的空间推理能力,且现有端到端导航模型依赖大量监督数据,效率低下。 Method: 提出StereoWalker,利用双目输入解决深度尺度模糊,并融合中层视觉先验(如深度估计与密集光流);构建大规模带自动动作标注的立体导航数据集用于训练。 Result: 实验表明,StereoWalker在仅使用1.5%训练数据时即能达到当前最优性能,使用完整数据时超越现有方法,且双目输入优于单目输入。 Conclusion: 显式引入中层视觉与双目输入可有效提升导航基础模型的数据效率与性能,表明完全依赖隐式视觉学习并非最优路径。 Abstract: The success of foundation models in language and vision motivated research in fully end-to-end robot navigation foundation models (NFMs). NFMs directly map monocular visual input to control actions and ignore mid-level vision modules (tracking, depth estimation, etc) entirely. While the assumption that vision capabilities will emerge implicitly is compelling, it requires large amounts of pixel-to-action supervision that are difficult to obtain. The challenge is especially pronounced in dynamic and unstructured settings, where robust navigation requires precise geometric and dynamic understanding, while the depth-scale ambiguity in monocular views further limits accurate spatial reasoning. In this paper, we show that relying on monocular vision and ignoring mid-level vision priors is inefficient. We present StereoWalker, which augments NFMs with stereo inputs and explicit mid-level vision such as depth estimation and dense pixel tracking. Our intuition is straightforward: stereo inputs resolve the depth-scale ambiguity, and modern mid-level vision models provide reliable geometric and motion structure in dynamic scenes. We also curate a large stereo navigation dataset with automatic action annotation from Internet stereo videos to support training of StereoWalker and to facilitate future research. Through our experiments, we find that mid-level vision enables StereoWalker to achieve a comparable performance as the state-of-the-art using only 1.5% of the training data, and surpasses the state-of-the-art using the full data. We also observe that stereo vision yields higher navigation performance than monocular input.[133] SceneMaker: Open-set 3D Scene Generation with Decoupled De-occlusion and Pose Estimation Model
Yukai Shi,Weiyu Li,Zihao Wang,Hongyang Li,Xingyu Chen,Ping Tan,Lei Zhang
Main category: cs.CV
TL;DR: 本文提出了一种解耦的3D场景生成框架SceneMaker,通过分离去遮挡模型与3D物体生成,并引入统一的姿态估计模型,提升了在严重遮挡和开放集设置下的几何质量和姿态准确性。
Details
Motivation: 现有方法由于缺乏足够的开放集去遮挡和姿态估计先验,在严重遮挡和开放集场景中难以同时生成高质量几何结构和准确姿态。 Method: 1. 将去遮挡模型从3D物体生成中解耦,并利用图像数据集和收集的去遮挡数据集增强其对多样开放集遮挡模式的适应性;2. 提出一个融合全局与局部机制的统一姿态估计模型,改进自注意力与交叉注意力机制;3. 构建了一个开放集3D场景数据集以提升姿态估计模型的泛化能力。 Result: 实验表明,该解耦框架在室内和开放集场景中均优于现有方法,显著提升了生成质量与姿态估计精度。 Conclusion: SceneMaker通过解耦设计和统一姿态估计,在复杂遮挡和开放集条件下实现了更优的3D场景生成性能,具备良好的泛化能力和应用潜力。 Abstract: We propose a decoupled 3D scene generation framework called SceneMaker in this work. Due to the lack of sufficient open-set de-occlusion and pose estimation priors, existing methods struggle to simultaneously produce high-quality geometry and accurate poses under severe occlusion and open-set settings. To address these issues, we first decouple the de-occlusion model from 3D object generation, and enhance it by leveraging image datasets and collected de-occlusion datasets for much more diverse open-set occlusion patterns. Then, we propose a unified pose estimation model that integrates global and local mechanisms for both self-attention and cross-attention to improve accuracy. Besides, we construct an open-set 3D scene dataset to further extend the generalization of the pose estimation model. Comprehensive experiments demonstrate the superiority of our decoupled framework on both indoor and open-set scenes. Our codes and datasets is released at https://idea-research.github.io/SceneMaker/.[134] WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World
Ao Liang,Lingdong Kong,Tianyi Yan,Hongsi Liu,Wesley Yang,Ziqi Huang,Wei Yin,Jialong Zuo,Yixuan Hu,Dekai Zhu,Dongyue Lu,Youquan Liu,Guangfeng Jiang,Linfeng Li,Xiangtai Li,Long Zhuo,Lai Xing Ng,Benoit R. Cottereau,Changxin Gao,Liang Pan,Wei Tsang Ooi,Ziwei Liu
Main category: cs.CV
TL;DR: WorldLens 是一个用于评估生成世界模型在视觉真实性、几何一致性、物理合理性和功能可靠性方面的统一基准,结合了大规模人类标注数据集 WorldLens-26K 和可扩展的评估代理 WorldLens-Agent,推动具身智能中世界模型的全面评测。
Details
Motivation: 现有生成式世界模型虽能生成视觉上逼真的驾驶环境,但在物理规律遵守、行为一致性和功能可靠性方面表现不足,且缺乏统一的评估标准。 Method: 提出 WorldLens 基准,涵盖生成、重建、动作跟随、下游任务和人类偏好五个维度;构建包含 26,000 条人类标注视频的 WorldLens-26K 数据集,并训练 WorldLens-Agent 实现自动化、可解释的评分。 Result: 发现当前模型在纹理与物理合理性之间存在权衡,无一模型在所有维度上均表现优异;WorldLens-Agent 能有效对齐人类判断与客观指标。 Conclusion: WorldLens 提供了一个标准化的生态系统,推动未来世界模型不仅在外观上真实,更在行为和功能上可靠。 Abstract: Generative world models are reshaping embodied AI, enabling agents to synthesize realistic 4D driving environments that look convincing but often fail physically or behaviorally. Despite rapid progress, the field still lacks a unified way to assess whether generated worlds preserve geometry, obey physics, or support reliable control. We introduce WorldLens, a full-spectrum benchmark evaluating how well a model builds, understands, and behaves within its generated world. It spans five aspects -- Generation, Reconstruction, Action-Following, Downstream Task, and Human Preference -- jointly covering visual realism, geometric consistency, physical plausibility, and functional reliability. Across these dimensions, no existing world model excels universally: those with strong textures often violate physics, while geometry-stable ones lack behavioral fidelity. To align objective metrics with human judgment, we further construct WorldLens-26K, a large-scale dataset of human-annotated videos with numerical scores and textual rationales, and develop WorldLens-Agent, an evaluation model distilled from these annotations to enable scalable, explainable scoring. Together, the benchmark, dataset, and agent form a unified ecosystem for measuring world fidelity -- standardizing how future models are judged not only by how real they look, but by how real they behave.[135] StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space
Tjark Behrens,Anton Obukhov,Bingxin Ke,Fabio Tosi,Matteo Poggi,Konrad Schindler
Main category: cs.CV
TL;DR: StereoSpace是一种基于扩散的单目到立体合成框架,通过纯视点条件建模几何结构,无需显式深度或扭曲,在无真实几何信息的情况下实现端到端的立体生成。