Skip to content

Table of Contents

cs.CL [Back]

[1] What Kind of Reasoning (if any) is an LLM actually doing? On the Stochastic Nature and Abductive Appearance of Large Language Models

Luciano Floridi,Jessica Morley,Claudio Novelli,David Watson

Main category: cs.CL

TL;DR: 本文探讨了当前基于token补全的大型语言模型(LLMs)在推理中的作用,指出其输出虽看似具有溯因推理能力,实则源于对人类文本模式的学习,而非真正的推理。模型本质上是随机的,缺乏语义理解与真值验证,因此其结果需谨慎评估。

Details Motivation: 揭示LLMs在表现出类似人类溯因推理行为背后的机制,澄清其并非真正进行推理,而是模仿训练数据中的结构,从而正确认识其能力与局限。 Method: 通过分析LLMs的生成机制及其训练基础,结合具体示例,比较其输出与人类溯因推理的异同,论证其表面合理性背后的非语义、非验证性本质。 Result: 发现LLMs能生成看似合理的解释和常识推理,但这些输出缺乏真实语义支撑和逻辑验证;其‘推理’表现实际上是统计模式匹配的结果,而非真正的理解或推理过程。 Conclusion: LLMs虽可辅助人类产生想法和扩展思维,但由于其无法识别真理或验证自身输出,必须对其结果进行批判性审查;文章同时回应了五项可能的反对意见,并指出了分析的局限性和应用建议。 Abstract: This article looks at how reasoning works in current Large Language Models (LLMs) that function using the token-completion method. It examines their stochastic nature and their similarity to human abductive reasoning. The argument is that these LLMs create text based on learned patterns rather than performing actual abductive reasoning. When their output seems abductive, this is largely because they are trained on human-generated texts that include reasoning structures. Examples are used to show how LLMs can produce plausible ideas, mimic commonsense reasoning, and give explanatory answers without being grounded in truth, semantics, verification, or understanding, and without performing any real abductive reasoning. This dual nature, where the models have a stochastic base but appear abductive in use, has important consequences for how LLMs are evaluated and applied. They can assist with generating ideas and supporting human thinking, but their outputs must be critically assessed because they cannot identify truth or verify their explanations. The article concludes by addressing five objections to these points, noting some limitations in the analysis, and offering an overall evaluation.

[2] Generate-Then-Validate: A Novel Question Generation Approach Using Small Language Models

Yumou Wei,John Stamper,Paulo F. Carvalho

Main category: cs.CL

TL;DR: 提出一种基于小语言模型(SLM)的“生成-验证”问答生成新流程,结合文本生成与概率推理提升问题质量,经人工与大模型评估均显示其能有效对齐学习目标并生成高质量问题。

Details Motivation: 探索小语言模型(SLM)在学习分析中自动生成问题的潜力,作为当前主流大模型方法的高效补充。 Method: 采用“生成-然后验证”策略:首先进行扩展性生成大量候选问题,再基于新颖的概率推理机制进行选择性验证和筛选,利用SLM的生成与推理能力优化输出质量。 Result: 通过七位人类专家和一个大语言模型的两轮评估,多数评判者认为生成的问题答案清晰、与学习目标高度一致。 Conclusion: 设计良好的流程可使小语言模型有效生成高质量问题,表明SLM在教育应用中具备替代大模型的潜力。 Abstract: We explore the use of small language models (SLMs) for automatic question generation as a complement to the prevalent use of their large counterparts in learning analytics research. We present a novel question generation pipeline that leverages both the text generation and the probabilistic reasoning abilities of SLMs to generate high-quality questions. Adopting a "generate-then-validate" strategy, our pipeline first performs expansive generation to create an abundance of candidate questions and refine them through selective validation based on novel probabilistic reasoning. We conducted two evaluation studies, one with seven human experts and the other with a large language model (LLM), to assess the quality of the generated questions. Most judges (humans or LLMs) agreed that the generated questions had clear answers and generally aligned well with the intended learning objectives. Our findings suggest that an SLM can effectively generate high-quality questions when guided by a well-designed pipeline that leverages its strengths.

[3] Workflow is All You Need: Escaping the "Statistical Smoothing Trap" via High-Entropy Information Foraging and Adversarial Pacing

Zhongjie Jiang

Main category: cs.CL

TL;DR: 本文提出DeepNews框架,通过模拟资深财经记者的认知过程,解决大模型在长文本生成中的“不可能三元悖论”(低幻觉、逻辑连贯、个性化表达),采用双粒度检索、图式引导规划与对抗性约束提示,在真实媒体测试中显著提升稿件采纳率。

Details Motivation: 当前大语言模型在垂直领域长文本生成中难以同时实现低幻觉、深度逻辑连贯与个性化表达,主因在于传统生成范式陷入统计平滑陷阱,忽视专家写作中的高熵信息获取与结构化认知过程。 Method: 提出DeepNews框架:1)基于信息觅食理论的双粒度检索机制,维持10:1的信息输入饱和比;2)结合领域知识库(叙事图式)与Atomic Blocks的策略性规划;3)引入Rhythm Break与Logic Fog等技术的对抗性约束提示,打破模型生成的平滑性。 Result: 实验发现“知识断崖”现象:当检索上下文低于15,000字符时内容真实性急剧下降,而超过30,000字符时幻觉消除率(HFR)稳定在85%以上;在顶级中文科技媒体的盲测中,基于旧代模型(DeepSeek-V3-0324)的DeepNews系统投稿采纳率达25%,远超SOTA模型GPT-5零样本生成的0%。 Conclusion: 通过显式建模专家写作的认知流程,可有效突破大模型在专业长文本生成中的“不可能三元悖论”,为垂直领域高质量内容生成提供了可验证的新范式。 Abstract: Central to long-form text generation in vertical domains is the "impossible trinity" confronting current large language models (LLMs): the simultaneous achievement of low hallucination, deep logical coherence, and personalized expression. This study establishes that this bottleneck arises from existing generative paradigms succumbing to the Statistical Smoothing Trap, a phenomenon that overlooks the high-entropy information acquisition and structured cognitive processes integral to expert-level writing. To address this limitation, we propose the DeepNews Framework, an agentic workflow that explicitly models the implicit cognitive processes of seasoned financial journalists. The framework integrates three core modules: first, a dual-granularity retrieval mechanism grounded in information foraging theory, which enforces a 10:1 saturated information input ratio to mitigate hallucinatory outputs; second, schema-guided strategic planning, a process leveraging domain expert knowledge bases (narrative schemas) and Atomic Blocks to forge a robust logical skeleton; third, adversarial constraint prompting, a technique deploying tactics including Rhythm Break and Logic Fog to disrupt the probabilistic smoothness inherent in model-generated text. Experiments delineate a salient Knowledge Cliff in deep financial reporting: content truthfulness collapses when retrieved context falls below 15,000 characters, while a high-redundancy input exceeding 30,000 characters stabilizes the Hallucination-Free Rate (HFR) above 85%. In an ecological validity blind test conducted with a top-tier Chinese technology media outlet, the DeepNews system--built on a previous-generation model (DeepSeek-V3-0324)-achieved a 25% submission acceptance rate, significantly outperforming the 0% acceptance rate of zero-shot generation by a state-of-the-art (SOTA) model (GPT-5).

[4] PARAN: Persona-Augmented Review ANswering system on Food Delivery Review Dataset

Moonsoo Park,Jeongseok Yun,Bohyung Kim

Main category: cs.CL

TL;DR: 提出一种两阶段提示框架,通过从短评中推断显式和隐式用户画像来增强个性化回复生成,提升相关性和参与度。

Details Motivation: 在用户信息有限的场景(如外卖平台)中,大语言模型常生成通用化回复,缺乏个性化,影响交互效果。 Method: 设计两阶段提示框架,先从评论文本中推断用户的显式和隐式 persona 特征,再将这些特征融入生成提示中,并通过调整解码温度提升生成多样性。 Result: 在韩国外卖应用的真实数据集上验证了方法的有效性,结果显示该方法在精确性、多样性和语义一致性方面均有提升。 Conclusion: 基于 persona 增强的提示策略可在无需微调模型的情况下显著提升自动化回复的个性化与相关性。 Abstract: Personalized review response generation presents a significant challenge in domains where user information is limited, such as food delivery platforms. While large language models (LLMs) offer powerful text generation capabilities, they often produce generic responses when lacking contextual user data, reducing engagement and effectiveness. In this work, we propose a two-stage prompting framework that infers both explicit (e.g., user-stated preferences) and implicit (e.g., demographic or stylistic cues) personas directly from short review texts. These inferred persona attributes are then incorporated into the response generation prompt to produce user-tailored replies. To encourage diverse yet faithful generations, we adjust decoding temperature during inference. We evaluate our method using a real-world dataset collected from a Korean food delivery app, and assess its impact on precision, diversity, and semantic consistency. Our findings highlight the effectiveness of persona-augmented prompting in enhancing the relevance and personalization of automated responses without requiring model fine-tuning.

[5] Unforgotten Safety: Preserving Safety Alignment of Large Language Models with Continual Learning

Lama Alssum,Hani Itani,Hasan Abed Al Kader Hammoud,Philip Torr,Adel Bibi,Bernard Ghanem

Main category: cs.CL

TL;DR: 本研究将大语言模型微调过程中的安全退化问题视为持续学习问题,通过适应多种持续学习方法,系统评估其在保持模型安全性方面的有效性,发现DER方法在降低攻击成功率的同时保持了任务性能。

Details Motivation: 随着大语言模型的普及,其在任务适配过程中因灾难性遗忘导致的安全性下降问题日益突出,亟需有效方法在微调时保留原有安全对齐。 Method: 将安全性保持问题建模为持续学习(CL)问题,采用正则化、基于记忆和模型融合等CL方法,在良性与中毒用户数据两种场景下进行实验评估。 Result: CL方法显著降低了攻击成功率,其中DER表现最优,且结果在三个下游任务和三种模型家族中具有一致性。 Conclusion: 持续学习是微调服务中保持大语言模型安全性的实用解决方案,尤其DER方法在安全性和任务性能之间实现了最佳平衡。 Abstract: The safety alignment of large language models (LLMs) is becoming increasingly important with their democratization. In this paper, we study the safety degradation that comes with adapting LLMs to new tasks. We attribute this safety compromise to catastrophic forgetting and frame the problem of preserving safety when fine-tuning as a continual learning (CL) problem. We consider the fine-tuning-as-a-service setup where the user uploads their data to a service provider to get a customized model that excels on the user's selected task. We adapt several CL approaches from the literature and systematically evaluate their ability to mitigate safety degradation. These include regularization-based, memory-based, and model merging approaches. We consider two scenarios, (1) benign user data and (2) poisoned user data. Our results demonstrate that CL approaches consistently achieve lower attack success rates than standard fine-tuning. Among these, DER outperforms both other CL methods and existing safety-preserving baselines while maintaining task utility. These findings generalize across three downstream tasks (GSM8K, SST2, Code) and three model families (LLaMA2-7B, Mistral-7B, Gemma-2B), establishing CL as a practical solution to preserve safety.

[6] AutoMedic: An Automated Evaluation Framework for Clinical Conversational Agents with Medical Dataset Grounding

Gyutaek Oh,Sangjoon Park,Byung-Hoon Kim

Main category: cs.CL

TL;DR: 本文提出了一种名为AutoMedic的多智能体模拟框架,用于自动化评估作为临床对话代理的大型语言模型(LLMs),通过将静态问答数据集转换为虚拟患者档案,并基于CARE指标进行多维度评估。

Details Motivation: 现有的医学问答基准主要关注静态场景,难以全面评估LLMs在动态、交互式临床多轮对话中的表现,且缺乏超越准确率的多维度评价方法。 Method: 构建了一个名为AutoMedic的多智能体模拟框架,将现成的静态医学问答数据集转化为虚拟患者画像,驱动LLM代理之间生成真实且符合临床逻辑的多轮对话,并采用CARE指标(涵盖准确性、效率/策略、共情和鲁棒性)对临床对话代理的表现进行多维度量化评估。 Result: 实验结果表明,AutoMedic能够有效生成贴近真实临床交互的对话轨迹,CARE指标与人类专家评估具有较高一致性,验证了该框架在自动化评估临床对话代理方面的有效性。 Conclusion: AutoMedic为评估医学领域中的大语言模型提供了一个可靠、可扩展的自动化平台,支持更安全、可信的临床对话系统开发,并提出了面向复杂交互场景的多维评估范式。 Abstract: Evaluating large language models (LLMs) has recently emerged as a critical issue for safe and trustworthy application of LLMs in the medical domain. Although a variety of static medical question-answering (QA) benchmarks have been proposed, many aspects remain underexplored, such as the effectiveness of LLMs in generating responses in dynamic, interactive clinical multi-turn conversation situations and the identification of multi-faceted evaluation strategies beyond simple accuracy. However, formally evaluating a dynamic, interactive clinical situation is hindered by its vast combinatorial space of possible patient states and interaction trajectories, making it difficult to standardize and quantitatively measure such scenarios. Here, we introduce AutoMedic, a multi-agent simulation framework that enables automated evaluation of LLMs as clinical conversational agents. AutoMedic transforms off-the-shelf static QA datasets into virtual patient profiles, enabling realistic and clinically grounded multi-turn clinical dialogues between LLM agents. The performance of various clinical conversational agents is then assessed based on our CARE metric, which provides a multi-faceted evaluation standard of clinical conversational accuracy, efficiency/strategy, empathy, and robustness. Our findings, validated by human experts, demonstrate the validity of AutoMedic as an automated evaluation framework for clinical conversational agents, offering practical guidelines for the effective development of LLMs in conversational medical applications.

[7] Multilingual VLM Training: Adapting an English-Trained VLM to French

Jules Lahmi,Alexis Roger

Main category: cs.CL

TL;DR: 本文探讨了将英文训练的视觉-语言模型(VLM)适配到其他语言的方法,比较了基于翻译的流程、LoRA微调和分两阶段的微调策略,并指出数据集翻译质量是多语言VLM性能的主要瓶颈。

Details Motivation: 由于当前视觉-语言模型主要局限于英语,限制了非英语用户的使用,因此需要研究如何有效将其扩展到更多语言。 Method: 比较了三种方法:基于翻译的流水线、LoRA微调、以及将视觉与语言适应分离的两阶段微调策略;采用翻译后的多模态基准测试和母语专家人工评估进行评测。 Result: 发现数据集翻译质量严重制约模型性能和评估效果,不同微调方法的表现受限于翻译数据的质量。 Conclusion: 未来的工作应聚焦于构建高质量的本地语言多模态数据集和改进翻译策略,以提升多语言视觉-语言模型的效果。 Abstract: Artificial intelligence has made great progress in recent years, particularly in the development of Vision--Language Models (VLMs) that understand both visual and textual data. However, these advancements remain largely limited to English, reducing their accessibility for non--English speakers. It is essential to extend these capabilities to a broader range of languages. This paper explores the challenges of adapting an English-trained VLM to different languages. To this end, we will explore and compare different methods for their performance and computational cost. We consider a translation-based pipeline, LoRA finetuning, and a two-stage finetuning strategy that separates vision adaptation from language adaptation. To evaluate these methods, we use a combination of standard multimodal benchmarks translated into the target language and manual assessments by native experts. The results reveal that dataset translation remains a major bottleneck in multilingual VLM performance, with data quality limiting the effectiveness of training and evaluation. These findings suggest that future efforts should focus on native-language dataset collection and improved translation strategies.

[8] Confucius Code Agent: An Open-sourced AI Software Engineer at Industrial Scale

Zhaodong Wang,Zhenting Qi,Sherman Wong,Nathan Hu,Samuel Lin,Jun Ge,Erwin Gao,Yining Yang,Ben Maurer,Wenlin Chen,David Recordon,Yilun Du,Minlan Yu,Ying Zhang

Main category: cs.CL

TL;DR: 本文提出了Confucius Code Agent(CCA),一个可在工业规模上运行的开源AI软件工程师,基于Confucius SDK构建,具备长上下文推理、跨会话持续学习和模块化工具使用能力,在SWE-Bench-Pro上达到54.3%的SOTA性能。

Details Motivation: 现有的开源编码代理在工业级任务上表现不足,而专有系统虽性能强但缺乏可扩展性和可控性,因此需要一个兼具高性能与开放性的AI软件工程解决方案。 Method: 提出Confucius SDK,从Agent体验(AX)、用户体验(UX)和开发者体验(DX)三个维度设计,引入分层工作记忆、持久化笔记系统和模块化扩展机制,并通过元代理实现配置的自动化构建-测试-优化循环。 Result: CCA在SWE-Bench-Pro上实现了54.3%的Resolve@1成绩,显著优于先前的开源和部分闭源代理。 Conclusion: Confucius SDK与CCA为AI代理提供了透明、可扩展且可复现的工业级开发基础,弥合了研究原型与生产系统之间的鸿沟。 Abstract: Real-world AI software engineering demands coding agents that can reason over massive repositories, maintain durable memory across and within long sessions, and robustly coordinate complex toolchains at test time. Existing open-source coding agents provide transparency but frequently fall short when pushed to these industrial-scale workloads, while proprietary coding agents offer strong practical performance but limited extensibility, interpretability, and controllability. We present the Confucius Code Agent (CCA), an open-sourced AI software engineer that can operate at an industrial scale. CCA is built atop the Confucius SDK, an open-sourced agent development platform designed around three complementary perspectives: Agent Experience (AX), User Experience (UX), and Developer Experience (DX). The SDK introduces a unified orchestrator with hierarchical working memory for long-context reasoning, a persistent note-taking system for cross-session continual learning, and a modular extension module for robust tool use. Moreover, a meta-agent automates the synthesis, evaluation, and refinement of agent configurations through a build-test-improve loop, enabling rapid agent development on new tasks, environments, and tool stacks. Instantiated on Confucius SDK with these mechanisms, CCA delivers strong performance on real-world software engineering tasks. On SWE-Bench-Pro, CCA achieves a state-of-the-art Resolve@1 performance of 54.3%, substantially improving over prior coding agents. Together, the Confucius SDK and CCA provide a transparent, extensible, and reproducible foundation for AI agents, bridge gaps between research prototypes and production-grade systems, and support agent development and deployment at industrial scale.

[9] Sliding Window Attention Adaptation

Yijiong Yu,Jiale Liu,Qingyun Wu,Huazheng Wang,Ji Pei

Main category: cs.CL

TL;DR: 提出了一种名为滑动窗口注意力适应(SWAA)的方法,使全注意力预训练的大语言模型能够在推理时有效适应滑动窗口注意力,从而降低长上下文推理成本。

Details Motivation: 由于全注意力机制在长上下文推理中计算代价高,而直接在预训练模型上使用滑动窗口注意力会导致性能严重下降,因此需要一种无需重新预训练即可有效适配的方法。 Method: 结合五种策略:仅在prefill阶段应用滑动窗口注意力、保留“sink”token、交错使用全注意力与滑动窗口注意力层、思维链提示以及微调,形成SWAA方法。 Result: 实验表明单一方法不足,但特定组合能有效恢复原始长上下文性能,并实现良好的效率-性能权衡。 Conclusion: SWAA为全注意力预训练模型在不重新预训练的情况下适应滑动窗口注意力提供了可行且实用的解决方案。 Abstract: The self-attention mechanism in Transformer-based Large Language Models (LLMs) scales quadratically with input length, making long-context inference expensive. Sliding window attention (SWA) reduces this cost to linear complexity, but naively enabling complete SWA at inference-time for models pretrained with full attention (FA) causes severe long-context performance degradation due to training-inference mismatch. This makes us wonder: Can FA-pretrained LLMs be well adapted to SWA without pretraining? We investigate this by proposing Sliding Window Attention Adaptation (SWAA), a set of practical recipes that combine five methods for better adaptation: (1) applying SWA only during prefilling; (2) preserving "sink" tokens; (3) interleaving FA/SWA layers; (4) chain-of-thought (CoT); and (5) fine-tuning. Our experiments show that SWA adaptation is feasible while non-trivial: no single method suffices, yet specific synergistic combinations effectively recover the original long-context performance. We further analyze the performance-efficiency trade-offs of different SWAA configurations and provide recommended recipes for diverse scenarios. Our code is available at https://github.com/yuyijiong/sliding-window-attention-adaptation

[10] Cooperative Retrieval-Augmented Generation for Question Answering: Mutual Information Exchange and Ranking by Contrasting Layers

Youmin Ko,Sungjong Seo,Hyunjoon Kim

Main category: cs.CL

TL;DR: 提出CoopRAG框架,通过检索器与大语言模型协同工作,提升单跳和多跳问答任务中的检索与回答准确性。

Details Motivation: 现有RAG方法在问答任务中仍存在检索错误和幻觉问题,需提高检索准确性和推理连贯性。 Method: 将问题分解为子问题和带掩码的推理链,结合子问题与推理链增强检索,利用检索器多层对比重排序文档,并由LLM填充掩码完成推理链重构。 Result: 在多个多跳和简单问答数据集上,CoopRAG在检索和问答性能上均优于现有最先进方法。 Conclusion: CoopRAG通过检索器与LLM及内部层间的协同机制,有效提升了问答系统的准确性和鲁棒性。 Abstract: Since large language models (LLMs) have a tendency to generate factually inaccurate output, retrieval-augmented generation (RAG) has gained significant attention as a key means to mitigate this downside of harnessing only LLMs. However, existing RAG methods for simple and multi-hop question answering (QA) are still prone to incorrect retrievals and hallucinations. To address these limitations, we propose CoopRAG, a novel RAG framework for the question answering task in which a retriever and an LLM work cooperatively with each other by exchanging informative knowledge, and the earlier and later layers of the retriever model work cooperatively with each other to accurately rank the retrieved documents relevant to a given query. In this framework, we (i) unroll a question into sub-questions and a reasoning chain in which uncertain positions are masked, (ii) retrieve the documents relevant to the question augmented with the sub-questions and the reasoning chain, (iii) rerank the documents by contrasting layers of the retriever, and (iv) reconstruct the reasoning chain by filling the masked positions via the LLM. Our experiments demonstrate that CoopRAG consistently outperforms state-of-the-art QA methods on three multi-hop QA datasets as well as a simple QA dataset in terms of both the retrieval and QA performances. Our code is available.\footnote{https://github.com/meaningful96/CoopRAG}

[11] T-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground

Dmitrii Stoianov,Danil Taranets,Olga Tsymboi,Ramil Latypov,Almaz Dautov,Vladislav Kruglikov,Nikita Surkov,German Abramov,Pavel Gein,Dmitry Abulkhanov,Mikhail Gashkov,Viktor Zelenkovskiy,Artem Batalov,Aleksandr Medvedev,Anatolii Potapov

Main category: cs.CL

TL;DR: T-pro 2.0是一个开源的俄语大语言模型,支持混合推理和高效推理,配备Cyrillic-dense分词器和改进的EAGLE推测解码流水线以降低延迟,并公开发布模型权重、指令语料库、数学推理基准和EAGLE权重,促进可复现和可扩展的研究。

Details Motivation: 推动俄语大模型的可复现与可扩展研究,填补俄语在高效推理和开放模型方面的空白。 Method: 采用Cyrillic-dense tokenizer优化俄语处理,结合改进的EAGLE推测解码流水线提升推理效率,支持直接回答与推理路径生成。 Result: 实现了更低的推理延迟,发布了包括模型权重、T-Wix 500k指令数据集、T-Math推理基准和EAGLE权重在内的全套资源,并提供了展示推理效果和加速能力的公开网页演示。 Conclusion: T-pro 2.0作为一个开放、高效的俄语大模型系统,为构建和评估实用化的俄语AI应用提供了坚实基础。 Abstract: We introduce T-pro 2.0, an open-weight Russian LLM for hybrid reasoning and efficient inference. The model supports direct answering and reasoning-trace generation, using a Cyrillic-dense tokenizer and an adapted EAGLE speculative-decoding pipeline to reduce latency. To enable reproducible and extensible research, we release the model weights, the T-Wix 500k instruction corpus, the T-Math reasoning benchmark, and the EAGLE weights on Hugging Face. These resources allow users to study Russian-language reasoning and to extend or adapt both the model and the inference pipeline. A public web demo exposes reasoning and non-reasoning modes and illustrates the speedups achieved by our inference stack across domains. T-pro 2.0 thus serves as an accessible open system for building and evaluating efficient, practical Russian LLM applications.

[12] Semantic Reconstruction of Adversarial Plagiarism: A Context-Aware Framework for Detecting and Restoring "Tortured Phrases" in Scientific Literature

Agniva Maiti,Prajwal Panth,Suresh Chandra Satapathy

Main category: cs.CL

TL;DR: 本文提出了一种名为SRAP的框架,用于检测并恢复科学文献中通过对抗性改写工具隐藏的抄袭内容,结合领域特定的语言模型与语义检索技术,实现了对“扭曲短语”的有效识别与源文档匹配。

Details Motivation: 现有的抄袭检测方法在面对使用自动化改写工具生成的‘扭曲短语’时表现不佳,难以识别新型伪装且无法溯源,因此需要一种能同时检测并恢复原始术语的鲁棒方法。 Method: 采用两阶段架构:第一阶段使用基于SciBERT的token级伪困惑度进行统计异常检测;第二阶段利用FAISS进行密集向量检索,并结合SBERT实现句子级语义对齐以重建原始语义。 Result: 实验显示,零样本基线方法完全失效(恢复准确率为0.00%),而SRAP达到23.67%的恢复准确率,显著优于基线;同时发现静态决策边界在高术语密度文本中更稳定。 Conclusion: SRAP能够有效检测并部分恢复对抗性抄袭中的扭曲表达,支持对剽窃内容的溯源分析,为科学文献完整性保护提供了新的技术路径。 Abstract: The integrity and reliability of scientific literature is facing a serious threat by adversarial text generation techniques, specifically from the use of automated paraphrasing tools to mask plagiarism. These tools generate "tortured phrases", statistically improbable synonyms (e.g. "counterfeit consciousness" for "artificial intelligence"), that preserve the local grammar while obscuring the original source. Most existing detection methods depend heavily on static blocklists or general-domain language models, which suffer from high false-negative rates for novel obfuscations and cannot determine the source of the plagiarized content. In this paper, we propose Semantic Reconstruction of Adversarial Plagiarism (SRAP), a framework designed not only to detect these anomalies but to mathematically recover the original terminology. We use a two-stage architecture: (1) statistical anomaly detection with a domain-specific masked language model (SciBERT) using token-level pseudo-perplexity, and (2) source-based semantic reconstruction using dense vector retrieval (FAISS) and sentence-level alignment (SBERT). Experiments on a parallel corpus of adversarial scientific text show that while zero-shot baselines fail completely (0.00 percent restoration accuracy), our retrieval-augmented approach achieves 23.67 percent restoration accuracy, significantly outperforming baseline methods. We also show that static decision boundaries are necessary for robust detection in jargon-heavy scientific text, since dynamic thresholding fails under high variance. SRAP enables forensic analysis by linking obfuscated expressions back to their most probable source documents.

[13] Enhancing Next-Generation Language Models with Knowledge Graphs: Extending Claude, Mistral IA, and GPT-4 via KG-BERT

Nour El Houda Ben Chaabene,Hamza Hammami

Main category: cs.CL

TL;DR: 本文提出通过将知识图谱(KG)与大型语言模型(如GPT-4)结合,利用KG-BERT增强模型的推理能力和事实一致性,在问答和实体链接等知识密集型任务中显著提升了性能。

Details Motivation: 大型语言模型(LLM)在自然语言处理方面表现出色,但缺乏结构化知识,容易产生事实性错误。为解决这一问题,研究旨在通过引入知识图谱来增强模型的知识 grounding 能力。 Method: 采用KG-BERT方法将知识图谱整合到大型语言模型中,利用KG中的结构化信息辅助模型进行推理和生成,从而提升其在知识密集型任务中的表现。 Result: 实验表明,该方法在问答和实体链接等任务上取得了显著性能提升,增强了模型的事实准确性和上下文感知能力。 Conclusion: 将知识图谱与大型语言模型结合是提升模型事实可靠性和推理能力的有效途径,为下一代更智能、更可信的LLM提供了可行方向。 Abstract: Large language models (LLMs) like Claude, Mistral IA, and GPT-4 excel in NLP but lack structured knowledge, leading to factual inconsistencies. We address this by integrating Knowledge Graphs (KGs) via KG-BERT to enhance grounding and reasoning. Experiments show significant gains in knowledge-intensive tasks such as question answering and entity linking. This approach improves factual reliability and enables more context-aware next-generation LLMs.

[14] Decoding Student Minds: Leveraging Conversational Agents for Psychological and Learning Analysis

Nour El Houda Ben Chaabene,Hamza Hammami,Laid Kahloul

Main category: cs.CL

TL;DR: 本文提出了一种结合大语言模型与多模态数据的心理感知型对话系统,用于实时识别学生的认知与情感状态,提升学习表现与情绪健康。

Details Motivation: 传统教育聊天机器人通常仅专注于教学或情感支持,缺乏对学生心理状态的综合理解,限制了个性化干预的效果。 Method: 结合大语言模型(LLM)、知识图谱增强的BERT(KG-BERT)以及带注意力机制的双向LSTM,利用文本语义、语音韵律特征和时序行为模式进行多模态融合,实现对学生认知与情感状态的实时分类。 Result: 在大学生中的初步实验显示,该系统相比基线方法能有效提升学习动机、降低压力水平,并带来适度的学业进步。 Conclusion: 融合语义推理、多模态信息与时序建模的对话系统有望实现更自适应、以学生为中心的教育支持。 Abstract: This paper presents a psychologically-aware conversational agent designed to enhance both learning performance and emotional well-being in educational settings. The system combines Large Language Models (LLMs), a knowledge graph-enhanced BERT (KG-BERT), and a bidirectional Long Short-Term Memory (LSTM) with attention to classify students' cognitive and affective states in real time. Unlike prior chatbots limited to either tutoring or affective support, our approach leverages multimodal data-including textual semantics, prosodic speech features, and temporal behavioral trends-to infer engagement, stress, and conceptual understanding. A pilot study with university students demonstrated improved motivation, reduced stress, and moderate academic gains compared to baseline methods. These results underline the promise of integrating semantic reasoning, multimodal fusion, and temporal modeling to support adaptive, student-centered educational interventions.

[15] Grammaticality Judgments in Humans and Language Models: Revisiting Generative Grammar with LLMs

Lars G. B. Johnsen

Main category: cs.CL

TL;DR: 该论文探讨大型语言模型(LLM)是否在仅基于表层形式训练的情况下,仍能表现出对句法结构的敏感性,研究发现LLM能够可靠地区分合语法与不合语法的句子,表明其具备对层次结构的隐式识别能力。

Details Motivation: 探究LLM在没有显式句法知识的情况下,能否复现传统生成语法中被视为句法结构证据的系统性语法性差异。 Method: 通过设计提示词让GPT-4和LLaMA-3等模型对主语-助动词倒装和寄生空位许可两类经典句法结构进行可接受性评分,并分析其判断模式。 Result: LLM能够可靠区分两种句法构造中的合语法与不合语法变体,表现出对句法边界的识别和抽象依赖关系的敏感性。 Conclusion: 尽管仅在表层形式上进行预测训练,LLM仍能涌现出对句法结构的功能性敏感性,说明结构性泛化可在无显式编码的情况下产生。 Abstract: What counts as evidence for syntactic structure? In traditional generative grammar, systematic contrasts in grammaticality such as subject-auxiliary inversion and the licensing of parasitic gaps are taken as evidence for an internal, hierarchical grammar. In this paper, we test whether large language models (LLMs), trained only on surface forms, reproduce these contrasts in ways that imply an underlying structural representation. We focus on two classic constructions: subject-auxiliary inversion (testing recognition of the subject boundary) and parasitic gap licensing (testing abstract dependency structure). We evaluate models including GPT-4 and LLaMA-3 using prompts eliciting acceptability ratings. Results show that LLMs reliably distinguish between grammatical and ungrammatical variants in both constructions, and as such support that they are sensitive to structure and not just linear order. Structural generalizations, distinct from cognitive knowledge, emerge from predictive training on surface forms, suggesting functional sensitivity to syntax without explicit encoding.

[16] XDoGE: Multilingual Data Reweighting to Enhance Language Inclusivity in LLMs

Iñaki Lacunza,José Javier Saiz,Alexander Shvets,Aitor Gonzalez-Agirre,Marta Villegas

Main category: cs.CL

TL;DR: 提出了一种多语言扩展的XDoGE算法,优化语言分布以提升中低资源语言在大模型中的表现,并发布了基于此方法训练的IberianLLM-7B-Instruct模型。

Details Motivation: 现有大语言模型过度依赖高资源语言(如英语),导致在中低资源语言上性能受限,影响多语言公平性和实用性。 Method: 通过构建小规模代理模型,在DoGE算法基础上扩展为适用于多语言设置的XDoGE算法,优化语言权重;使用该权重重新调整数据比例,训练全尺寸模型或进行持续预训练(CPT)。 Result: 在六种伊比利亚语言(英语、西班牙语、葡萄牙语、加泰罗尼亚语、加利西亚语、巴斯克语)上进行了实验,利用IberoBench框架评估,验证了对低资源语言重复数据和高资源语言欠采样的有效性。 Conclusion: 所提出的XDoGE方法能有效提升中低资源语言的表现,发布的IberianLLM-7B-Instruct模型在目标语言上展现出良好性能,证明了优化语言分布的重要性。 Abstract: Current large language models (LLMs) are trained on massive amounts of text data, primarily from a few dominant languages. Studies suggest that this over-reliance on high-resource languages, such as English, hampers LLM performance in mid- and low-resource languages. To mitigate this problem, we propose to (i) optimize the language distribution by training a small proxy model within a domain-reweighing DoGE algorithm that we extend to XDoGE for a multilingual setup, and (ii) rescale the data and train a full-size model with the established language weights either from scratch or within a continual pre-training phase (CPT). We target six languages possessing a variety of geographic and intra- and inter-language-family relations, namely, English and Spanish (high-resource), Portuguese and Catalan (mid-resource), Galician and Basque (low-resource). We experiment with Salamandra-2b, which is a promising model for these languages. We investigate the effects of substantial data repetition on minor languages and under-sampling on dominant languages using the IberoBench framework for quantitative evaluation. Finally, we release a new promising IberianLLM-7B-Instruct model centering on Iberian languages and English that we pretrained from scratch and further improved using CPT with the XDoGE weights.

[17] Causal Reasoning Favors Encoders: On The Limits of Decoder-Only Models

Amartya Roy,Elamparithy M,Kripabandhu Ghosh,Ponnurangam Kumaraguru,Adrian de Wynter

Main category: cs.CL

TL;DR: 本文研究了在因果推理中不同语言模型架构的性能,发现仅靠上下文学习(ICL)不足以实现可靠的因果推理,尤其是解码器-only模型对分布变化较为敏感,而经过微调的编码器和编码器-解码器架构在多种测试中表现出更强的泛化能力。

Details Motivation: 探讨不同架构在因果推理中的表现,特别是多跳推理和对抗词法误导的能力。 Method: 比较了编码器、编码器-解码器和仅解码器架构在自然语言与非自然语言场景下的零样本和少样本上下文学习表现,并进行微调实验。 Result: 发现仅靠ICL难以实现可靠因果推理;解码器-only模型对分布偏移敏感,而微调后的编码器和编码器-解码器模型在包括非自然语言任务中均表现更稳健,仅在大规模下被解码器-only模型超越。 Conclusion: 对于成本效益高且短期鲁棒的因果推理任务,推荐使用经过针对性微调的编码器或编码器-解码器架构。 Abstract: In context learning (ICL) underpins recent advances in large language models (LLMs), although its role and performance in causal reasoning remains unclear. Causal reasoning demands multihop composition and strict conjunctive control, and reliance on spurious lexical relations of the input could provide misleading results. We hypothesize that, due to their ability to project the input into a latent space, encoder and encoder decoder architectures are better suited for said multihop conjunctive reasoning versus decoder only models. To do this, we compare fine-tuned versions of all the aforementioned architectures with zero and few shot ICL in both natural language and non natural language scenarios. We find that ICL alone is insufficient for reliable causal reasoning, often overfocusing on irrelevant input features. In particular, decoder only models are noticeably brittle to distributional shifts, while finetuned encoder and encoder decoder models can generalize more robustly across our tests, including the non natural language split. Both architectures are only matched or surpassed by decoder only architectures at large scales. We conclude by noting that for cost effective, short horizon robust causal reasoning, encoder or encoder decoder architectures with targeted finetuning are preferable.

[18] RoleRMBench & RoleRM: Towards Reward Modeling for Profile-Based Role Play in Dialogue Systems

Hang Ding,Qiming Feng,Dongqi Liu,Qi Zhao,Tao Yao,Shuo Wang,Dongsheng Chen,Jian Li,Zhenye Gan,Jiangning Zhang,Chengjie Wang,Yabiao Wang

Main category: cs.CL

TL;DR: 本文提出了RoleRMBench,首个用于角色扮演对话中奖励建模的系统性基准,并引入基于连续隐式偏好(CIP)训练的奖励模型RoleRM,在叙事连贯性和风格保真度上显著优于现有模型。

Details Motivation: 现有的奖励模型在主观性强、开放性的领域(如角色扮演)中表现不佳,难以捕捉基于角色和细微的人类判断,因此需要更适应主观对齐的奖励模型。 Method: 提出RoleRMBench基准,涵盖七项细粒度能力;设计RoleRM模型,采用连续隐式偏好(CIP)进行训练,通过多种结构化策略实现连续一致的成对监督。 Result: 在RoleRMBench上的实验表明,通用奖励模型与人类判断存在显著差距;RoleRM在平均性能上超越强基线模型超过24%,尤其在叙事和风格维度提升明显。 Conclusion: 连续偏好表示和标注一致性对主观对齐至关重要,RoleRM为以人为中心的对话系统中的主观偏好建模奠定了基础。 Abstract: Reward modeling has become a cornerstone of aligning large language models (LLMs) with human preferences. Yet, when extended to subjective and open-ended domains such as role play, existing reward models exhibit severe degradation, struggling to capture nuanced and persona-grounded human judgments. To address this gap, we introduce RoleRMBench, the first systematic benchmark for reward modeling in role-playing dialogue, covering seven fine-grained capabilities from narrative management to role consistency and engagement. Evaluation on RoleRMBench reveals large and consistent gaps between general-purpose reward models and human judgment, particularly in narrative and stylistic dimensions. We further propose RoleRM, a reward model trained with Continuous Implicit Preferences (CIP), which reformulates subjective evaluation as continuous consistent pairwise supervision under multiple structuring strategies. Comprehensive experiments show that RoleRM surpasses strong open- and closed-source reward models by over 24% on average, demonstrating substantial gains in narrative coherence and stylistic fidelity. Our findings highlight the importance of continuous preference representation and annotation consistency, establishing a foundation for subjective alignment in human-centered dialogue systems.

[19] AgriGPT-Omni: A Unified Speech-Vision-Text Framework for Multilingual Agricultural Intelligence

Bo Yang,Lanfei Feng,Yunkui Chen,Yu Zhang,Jianyu Zhang,Xiao Xu,Nueraili Aierken,Shijian Li

Main category: cs.CL

TL;DR: 本文提出了AgriGPT-Omni,一个集成语音、视觉和文本的农业多模态统一框架,通过构建大规模多语言农业语音数据集、三阶段训练的全模态模型以及首个三模态农业基准AgriBench-Omni-2K,显著提升了多语言多模态推理与真实语音理解能力,并全面开源以推动可持续AI发展。

Details Motivation: 现有农业应用受限于缺乏多语言语音数据、统一的多模态架构和综合评估基准,难以支持低资源地区的包容性智能发展。 Method: 提出AgriGPT-Omni框架:1)构建可扩展的数据合成与采集流水线,生成大规模多语言农业语音数据;2)采用三阶段训练范式(文本知识注入、渐进式多模态对齐、基于GRPO的强化学习)训练全模态模型;3)设计首个三模态农业基准AgriBench-Omni-2K,包含标准化协议与复现工具。 Result: 实验表明,AgriGPT-Omni在多语言多模态推理和真实语音理解任务上显著优于通用基线模型,验证了其有效性。 Conclusion: AgriGPT-Omni为农业领域提供了首个统一的语音-视觉-文本全模态解决方案,推动了可复现研究、包容性农业智能及面向低资源区域的可持续AI发展。 Abstract: Despite rapid advances in multimodal large language models, agricultural applications remain constrained by the lack of multilingual speech data, unified multimodal architectures, and comprehensive evaluation benchmarks. To address these challenges, we present AgriGPT-Omni, an agricultural omni-framework that integrates speech, vision, and text in a unified framework. First, we construct a scalable data synthesis and collection pipeline that converts agricultural texts and images into training data, resulting in the largest agricultural speech dataset to date, including 492K synthetic and 1.4K real speech samples across six languages. Second, based on this, we train the first agricultural omni-model via a three-stage paradigm: textual knowledge injection, progressive multimodal alignment, and GRPO-based reinforcement learning, enabling unified reasoning across languages and modalities. Third, we propose AgriBench-Omni-2K, the first tri-modal benchmark for agriculture, covering diverse speech-vision-text tasks and multilingual slices, with standardized protocols and reproducible tools. Experiments show that AgriGPT-Omni significantly outperforms general-purpose baselines on multilingual and multimodal reasoning as well as real-world speech understanding. All models, data, benchmarks, and code will be released to promote reproducible research, inclusive agricultural intelligence, and sustainable AI development for low-resource regions.

[20] From Data Scarcity to Data Care: Reimagining Language Technologies for Serbian and other Low-Resource Languages

Smiljana Antonijevic Ubois

Main category: cs.CL

TL;DR: 本研究以塞尔维亚语为例,探讨低资源语言在AI时代语言技术发展中的结构性、历史性和社会技术性挑战,提出基于CARE原则的“数据关怀”框架,以构建包容、可持续且具文化根基的语言技术。

Details Motivation: 解决大型语言模型在训练中对英语等主导语言的依赖所导致的低资源语言文化与语言偏见问题,特别是塞尔维亚语因历史文本损毁和当代工程化优先方法而面临的代表性不足。 Method: 通过半结构化访谈十位学者与实践者(包括语言学家、数字人文学者和AI开发者),分析影响塞尔维亚语语言技术发展的多重因素,并提出‘数据关怀’框架。 Result: 揭示了历史文本破坏、表层音译、依赖英文模型、数据偏见及缺乏文化特异性数据集等问题;发现当前方法倾向于功能优先而忽视语言细微差异。 Conclusion: ‘数据关怀’框架将偏见缓解从事后技术修正转变为语料设计、标注与治理的核心部分,可作为复制模型用于纠正传统LLM开发中的权力失衡与文化盲点。 Abstract: Large language models are commonly trained on dominant languages like English, and their representation of low resource languages typically reflects cultural and linguistic biases present in the source language materials. Using the Serbian language as a case, this study examines the structural, historical, and sociotechnical factors shaping language technology development for low resource languages in the AI age. Drawing on semi structured interviews with ten scholars and practitioners, including linguists, digital humanists, and AI developers, it traces challenges rooted in historical destruction of Serbian textual heritage, intensified by contemporary issues that drive reductive, engineering first approaches prioritizing functionality over linguistic nuance. These include superficial transliteration, reliance on English-trained models, data bias, and dataset curation lacking cultural specificity. To address these challenges, the study proposes Data Care, a framework grounded in CARE principles (Collective Benefit, Authority to Control, Responsibility, and Ethics), that reframes bias mitigation from a post hoc technical fix to an integral component of corpus design, annotation, and governance, and positions Data Care as a replicable model for building inclusive, sustainable, and culturally grounded language technologies in contexts where traditional LLM development reproduces existing power imbalances and cultural blind spots.

[21] Textual Data Bias Detection and Mitigation - An Extensible Pipeline with Experimental Evaluation

Rebekka Görge,Sujan Sai Gannamaneni,Tabea Naeven,Hammam Abdelwahab,Héctor Allende-Cid,Armin B. Cremers,Lennard Helmer,Michael Mock,Anna Schmitz,Songkai Xue,Elif Yildirir,Maximilian Poretschkin,Stefan Wrobel

Main category: cs.CL

TL;DR: 提出了一种针对文本数据中表示偏差和显式刻板印象的全面数据去偏管道,通过LLM生成词汇表、量化表示偏差、社会语言学过滤和增强反事实数据,在数据层面有效减少了偏差,但在模型层面的去偏效果不一致,暴露出当前评估方法的局限性。

Details Motivation: 为满足欧洲AI法案等法规对识别和缓解训练数据中偏见的要求,解决现有去偏方法缺乏可操作性指导的问题,特别是针对受保护群体的表示偏差和刻板印象问题。 Method: 构建四阶段去偏管道:1)利用LLM生成符合质量标准的敏感属性词汇表以识别群体标签;2)使用人口代表性得分量化表示偏差;3)采用基于社会语言学的过滤检测和缓解刻板印象;4)通过语法和上下文感知的反事实数据增强补偿表示偏差。在性别、宗教和年龄三个敏感属性上进行两阶段评估:人工验证与基线比较评估数据去偏效果,偏见基准测试评估模型去偏效果。 Result: 在数据层面,该方法能有效减少表示偏差和显式刻板印象,经人工验证和基线对比证实其有效性;在模型层面,基于去偏数据微调的LLMs(0.6B-8B参数)在偏见基准测试中表现不稳定,未显示出一致的去偏效果提升。 Conclusion: 尽管所提方法在数据去偏方面有效,但数据去偏并不必然转化为模型输出的公平性改善,反映出当前偏见评估方法存在关键缺陷,需发展更精准的数据干预策略和更有效的模型评估体系来应对实际模型偏见问题。 Abstract: Textual data used to train large language models (LLMs) exhibits multifaceted bias manifestations encompassing harmful language and skewed demographic distributions. Regulations such as the European AI Act require identifying and mitigating biases against protected groups in data, with the ultimate goal of preventing unfair model outputs. However, practical guidance and operationalization are lacking. We propose a comprehensive data bias detection and mitigation pipeline comprising four components that address two data bias types, namely representation bias and (explicit) stereotypes for a configurable sensitive attribute. First, we leverage LLM-generated word lists created based on quality criteria to detect relevant group labels. Second, representation bias is quantified using the Demographic Representation Score. Third, we detect and mitigate stereotypes using sociolinguistically informed filtering. Finally, we compensate representation bias through Grammar- and Context-Aware Counterfactual Data Augmentation. We conduct a two-fold evaluation using the examples of gender, religion and age. First, the effectiveness of each individual component on data debiasing is evaluated through human validation and baseline comparison. The findings demonstrate that we successfully reduce representation bias and (explicit) stereotypes in a text dataset. Second, the effect of data debiasing on model bias reduction is evaluated by bias benchmarking of several models (0.6B-8B parameters), fine-tuned on the debiased text dataset. This evaluation reveals that LLMs fine-tuned on debiased data do not consistently show improved performance on bias benchmarks, exposing critical gaps in current evaluation methodologies and highlighting the need for targeted data manipulation to address manifested model bias.

[22] Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving

Songyang Gao,Yuzhe Gu,Zijian Wu,Lingkai Kong,Wenwei Zhang,Zhongrui Cai,Fan Zheng,Tianyou Ma,Junhao Shen,Haiteng Zhao,Duanyang Zhang,Huilun Zhang,Kuikun Liu,Chengqi Lyu,Yanhui Duan,Chiyu Chen,Ningsheng Ma,Jianfei Gao,Han Lyu,Dahua Lin,Kai Chen

Main category: cs.CL

TL;DR: 本文提出了一种基于结果的过程验证器(OPV),通过总结长思维链的推理结果来实现高效且准确的验证,并结合迭代主动学习框架与专家标注,逐步提升验证能力,显著降低了标注成本。

Details Motivation: 现有的结果型验证器无法检查长推理链中的不可靠中间步骤,而过程型验证器受限于高质量标注数据的稀缺,难以可靠地检测复杂推理中的错误。因此需要一种更高效、准确且可扩展的验证方法。 Method: 提出OPV,通过总结长思维链的结果来验证其推理过程;采用迭代主动学习框架,选取当前模型最不确定的样例进行专家标注,并利用拒绝微调(RFT)和可验证奖励强化学习(RLVR)训练下一代OPV。 Result: 在自建测试集\textsc{\thisbench}上达到83.1的F1分数,超过Qwen3-Max-Preview等更大模型;能有效识别合成数据中的假阳性,与专家判断高度一致;与策略模型协作时显著提升性能,如在AIME2025上将DeepSeek-R1-Distill-Qwen-32B的准确率从55.2%提升至73.3%。 Conclusion: OPV实现了高效、准确且可扩展的推理验证,通过迭代主动学习和少量标注即可持续提升性能,在多个任务和模型上展现出优越性和广泛适用性。 Abstract: Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the \textbf{O}utcome-based \textbf{P}rocess \textbf{V}erifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV's superior performance and broad applicability. It achieves new state-of-the-art results on our held-out \textsc{\thisbench}, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2\% to 73.3\% on AIME2025 as the compute budget scales.

[23] TRIDENT: A Redundant Architecture for Caribbean-Accented Emergency Speech Triage

Elroy Galbraith,Chadwick Sutherland,Donahue Morgan

Main category: cs.CL

TL;DR: 本文提出TRIDENT系统,一个三层架构的紧急呼叫支持系统,通过结合加勒比口音优化的语音识别、本地实体抽取和生物声学 distress 检测,提升对非标准英语变体的应急响应能力。

Details Motivation: 现有紧急语音识别系统在处理加勒比英语等非标准变体时性能下降,导致服务不平等,亟需一种能适应语言多样性并保障有效分诊的解决方案。 Method: 采用三層架構:口音調優的ASR、基於大語言模型的實體抽取、以及生物聲學 distress 檢測;結合心理語言學理論,利用低ASR置信度與聲音壓力指標作為呼叫優先級信號。 Result: 系統能為調度員提供轉錄可信度、結構化臨床實體和語音壓力指標三種補充信號,並在離線環境下運作,適用于災難場景。 Conclusion: TRIDENT為加勒比地區人群提供了口音韌性強的緊急AI框架,確保其公平接入國家分診協議,儘管尚待實證驗證。 Abstract: Emergency speech recognition systems exhibit systematic performance degradation on non-standard English varieties, creating a critical gap in services for Caribbean populations. We present TRIDENT (Transcription and Routing Intelligence for Dispatcher-Empowered National Triage), a three-layer dispatcher-support architecture designed to structure emergency call inputs for human application of established triage protocols (the ESI for routine operations and START for mass casualty events), even when automatic speech recognition fails. The system combines Caribbean-accent-tuned ASR, local entity extraction via large language models, and bio-acoustic distress detection to provide dispatchers with three complementary signals: transcription confidence, structured clinical entities, and vocal stress indicators. Our key insight is that low ASR confidence, rather than representing system failure, serves as a valuable queue prioritization signal -- particularly when combined with elevated vocal distress markers indicating a caller in crisis whose speech may have shifted toward basilectal registers. A complementary insight drives the entity extraction layer: trained responders and composed bystanders may report life-threatening emergencies without elevated vocal stress, requiring semantic analysis to capture clinical indicators that paralinguistic features miss. We describe the architectural design, theoretical grounding in psycholinguistic research on stress-induced code-switching, and deployment considerations for offline operation during disaster scenarios. This work establishes a framework for accent-resilient emergency AI that ensures Caribbean voices receive equitable access to established national triage protocols. Empirical validation on Caribbean emergency calls remains future work.

[24] OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification

Zijian Wu,Lingkai Kong,Wenwei Zhang,Songyang Gao,Yuzhe Gu,Zhongrui Cai,Tianyou Ma,Yuhong Liu,Zhi Wang,Runyuan Ma,Guangyu Wang,Wei Li,Conghui He,Dahua Lin,Kai Chen

Main category: cs.CL

TL;DR: 提出一种基于结果的过程验证器(OPV),通过迭代主动学习框架和拒绝微调实现高效、准确的长链推理验证,显著提升模型性能并降低标注成本。

Details Motivation: 现有的结果型验证器无法检查长推理链中的不可靠中间步骤,而过程型验证器受限于高质量标注数据的缺乏,难以可靠检测复杂推理错误。 Method: 提出Outcome-based Process Verifier (OPV),通过总结长思维链的结果来验证推理过程,并采用迭代主动学习框架结合专家标注,使用拒绝微调(RFT)和可验证奖励强化学习(RLVR)逐步提升OPV能力。 Result: OPV在OPV-Bench上达到83.1的F1分数,超越更大的开源模型;能有效识别合成数据中的假阳性,与专家评估高度一致;与策略模型协作时显著提升性能,如在AIME2025上将DeepSeek-R1-Distill-Qwen-32B的准确率从55.2%提高到73.3%。 Conclusion: OPV实现了高效且准确的推理过程验证,具备强扩展性和实用性,为大规模、低成本训练先进语言模型提供了可行方案。 Abstract: Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the Outcome-based Process Verifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV's superior performance and broad applicability. It achieves new state-of-the-art results on our held-out OPV-Bench, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2% to 73.3% on AIME2025 as the compute budget scales.

[25] Grow Up and Merge: Scaling Strategies for Efficient Language Adaptation

Kevin Glocker,Kätriin Kukk,Romina Oji,Marcel Bollmann,Marco Kuhlmann,Jenny Kunz

Main category: cs.CL

TL;DR: 本研究探讨了通过扩展模型规模来高效适应新目标语言的方法,发现大规模模型在足够目标语言数据下能匹配或超越小规模模型的性能,同时减少灾难性遗忘,并探索了构建模块化多语言系统的可能性。

Details Motivation: 解决中低资源语言在大规模多语言模型中表现不佳的问题,尤其是在较小模型尺度上与特定语言模型相比的不足。 Method: 通过对近似FLOP匹配的模型进行综合扩展消融实验,测试扩展英语基础模型是否比标准持续预训练更有效地适应新语言。 Result: 大规模模型在暴露于足够目标语言数据后,能够匹配或超越在更多数据上持续预训练的小规模模型的性能;扩展有助于保持基础模型在英语中的能力,减少灾难性遗忘;探索将扩展的语言特定模型合并以构建模块化多语言系统,发现扩展后的合并效果优于小规模合并,但合并方法间存在显著性能差异。 Conclusion: 扩展是提高数据效率和保持原有语言能力的有效策略,尽管合并方法仍有改进空间,但扩展后的模型合并为构建灵活的多语言系统提供了潜在途径。 Abstract: Achieving high-performing language models which include medium- and lower-resource languages remains a challenge. Massively multilingual models still underperform compared to language-specific adaptations, especially at smaller model scales. In this work, we investigate scaling as an efficient strategy for adapting pretrained models to new target languages. Through comprehensive scaling ablations with approximately FLOP-matched models, we test whether upscaling an English base model enables more effective and resource-efficient adaptation than standard continued pretraining. We find that, once exposed to sufficient target-language data, larger upscaled models can match or surpass the performance of smaller models continually pretrained on much more data, demonstrating the benefits of scaling for data efficiency. Scaling also helps preserve the base model's capabilities in English, thus reducing catastrophic forgetting. Finally, we explore whether such scaled, language-specific models can be merged to construct modular and flexible multilingual systems. We find that while merging remains less effective than joint multilingual training, upscaled merges perform better than smaller ones. We observe large performance differences across merging methods, suggesting potential for improvement through merging approaches specialized for language-level integration.

[26] Script Gap: Evaluating LLM Triage on Indian Languages in Native vs Roman Scripts in a Real World Setting

Manurag Khullar,Utkarsh Desai,Poorva Malviya,Aman Dalmia,Zheyuan Ryan Shi

Main category: cs.CL

TL;DR: 该研究探讨了罗马化文本对大型语言模型(LLM)在印度母婴健康分诊中可靠性的影响,发现使用罗马化输入时性能显著下降,尽管模型能理解语义,但输出仍不稳定。

Details Motivation: 在印度的临床应用中,用户常使用罗马化文本输入非拉丁语系的本地语言,但现有研究缺乏对此类真实场景下正字法变异的评估,可能导致高风险医疗决策中的系统性错误。 Method: 研究基于真实世界数据集,包含五种印度语言和尼泊尔语的用户生成查询,对比主流LLM在原生文字与罗马化文本上的表现,并分析其语义理解和分类输出的差异。 Result: 实验显示,LLM在罗马化文本上的F1分数比原生脚本低5-12个百分点;合作机构估计这可能导致每年近200万例额外的分诊错误;模型虽能正确推断罗马化查询的语义意图,但最终分类仍易受正字法噪声影响。 Conclusion: LLM在处理罗马化医疗查询时存在关键安全盲点:即使看似理解输入,仍可能无法可靠执行任务,凸显出需专门优化以应对现实中的书写变体。 Abstract: Large Language Models (LLMs) are increasingly deployed in high-stakes clinical applications in India. In many such settings, speakers of Indian languages frequently communicate using romanized text rather than native scripts, yet existing research rarely evaluates this orthographic variation using real-world data. We investigate how romanization impacts the reliability of LLMs in a critical domain: maternal and newborn healthcare triage. We benchmark leading LLMs on a real-world dataset of user-generated queries spanning five Indian languages and Nepali. Our results reveal consistent degradation in performance for romanized messages, with F1 scores trailing those of native scripts by 5-12 points. At our partner maternal health organization in India, this gap could cause nearly 2 million excess errors in triage. Crucially, this performance gap by scripts is not due to a failure in clinical reasoning. We demonstrate that LLMs often correctly infer the semantic intent of romanized queries. Nevertheless, their final classification outputs remain brittle in the presence of orthographic noise in romanized inputs. Our findings highlight a critical safety blind spot in LLM-based health systems: models that appear to understand romanized input may still fail to act on it reliably.

[27] The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality

Aileen Cheng,Alon Jacovi,Amir Globerson,Ben Golan,Charles Kwong,Chris Alberti,Connie Tao,Eyal Ben-David,Gaurav Singh Tomar,Lukas Haas,Yonatan Bitton,Adam Bloniarz,Aijun Bai,Andrew Wang,Anfal Siddiqui,Arturo Bajuelos Castillo,Aviel Atias,Chang Liu,Corey Fry,Daniel Balle,Deepanway Ghosal,Doron Kukliansky,Dror Marcus,Elena Gribovskaya,Eran Ofek,Honglei Zhuang,Itay Laish,Jan Ackermann,Lily Wang,Meg Risdal,Megan Barnes,Michael Fink,Mohamed Amin,Moran Ambar,Natan Potikha,Nikita Gupta,Nitzan Katz,Noam Velan,Ofir Roval,Ori Ram,Polina Zablotskaia,Prathamesh Bang,Priyanka Agrawal,Rakesh Ghiya,Sanjay Ganapathy,Simon Baumgartner,Sofia Erell,Sushant Prakash,Thibault Sellam,Vikram Rao,Xuanhui Wang,Yaroslav Akulov,Yulong Yang,Zhen Yang,Zhixin Lai,Zhongru Wu,Anca Dragan,Avinatan Hassidim,Fernando Pereira,Slav Petrov,Srinivasan Venkatachary,Tulsee Doshi,Yossi Matias,Sasha Goldshtein,Dipanjan Das

Main category: cs.CL

TL;DR: The FACTS Leaderboard是一个综合评估语言模型在多种场景下生成事实准确文本能力的基准套件,包含四个子榜单,通过自动化评判模型打分,全面衡量模型的事实性。

Details Motivation: 为了全面评估语言模型在不同场景下的事实准确性,解决现有基准覆盖范围有限的问题。 Method: 构建包含多模态、参数化知识、搜索增强和文档 grounding 四个子榜单的评测体系,使用自动化judge模型对模型输出进行评分。 Result: 提出了FACTS Leaderboard套件,涵盖四种不同信息场景下的事实性评估,并提供公开和私有数据集划分以支持外部参与和持续维护。 Conclusion: FACTS Leaderboard为评估语言模型的事实性提供了更全面、平衡且可持续的基准框架。 Abstract: We introduce The FACTS Leaderboard, an online leaderboard suite and associated set of benchmarks that comprehensively evaluates the ability of language models to generate factually accurate text across diverse scenarios. The suite provides a holistic measure of factuality by aggregating the performance of models on four distinct sub-leaderboards: (1) FACTS Multimodal, which measures the factuality of responses to image-based questions; (2) FACTS Parametric, which assesses models' world knowledge by answering closed-book factoid questions from internal parameters; (3) FACTS Search, which evaluates factuality in information-seeking scenarios, where the model must use a search API; and (4) FACTS Grounding (v2), which evaluates whether long-form responses are grounded in provided documents, featuring significantly improved judge models. Each sub-leaderboard employs automated judge models to score model responses, and the final suite score is an average of the four components, designed to provide a robust and balanced assessment of a model's overall factuality. The FACTS Leaderboard Suite will be actively maintained, containing both public and private splits to allow for external participation while guarding its integrity. It can be found at https://www.kaggle.com/benchmarks/google/facts .

[28] LabelFusion: Learning to Fuse LLMs and Transformer Classifiers for Robust Text Classification

Michael Schlee,Christoph Weisser,Timo Kivimäki,Melchizedek Mashiku,Benjamin Saefken

Main category: cs.CL

TL;DR: LabelFusion是一种融合传统Transformer分类器与大语言模型(LLM)的文本分类集成方法,通过学习结合两者的优势,在多类和多标签任务中实现准确且成本可控的预测。

Details Motivation: 旨在结合传统Transformer模型的高效性与大语言模型的强大推理能力,解决文本分类中性能、延迟和成本之间的权衡问题。 Method: 通过结构化提示工程获取LLM的类别得分,并将其与传统模型的嵌入向量拼接,输入到一个紧凑的多层感知机(FusionMLP)中进行最终预测,端到端训练整个融合流程。 Result: 在AG News数据集上达到92.4%准确率,在10类Reuters 21578主题分类任务中达到92.3%准确率,表现出跨领域的鲁棒性能。 Conclusion: LabelFusion有效融合了传统模型与大语言模型的优势,提供了高性能、可配置且成本敏感的文本分类解决方案。 Abstract: LabelFusion is a fusion ensemble for text classification that learns to combine a traditional transformer-based classifier (e.g., RoBERTa) with one or more Large Language Models (LLMs such as OpenAI GPT, Google Gemini, or DeepSeek) to deliver accurate and cost-aware predictions across multi-class and multi-label tasks. The package provides a simple high-level interface (AutoFusionClassifier) that trains the full pipeline end-to-end with minimal configuration, and a flexible API for advanced users. Under the hood, LabelFusion integrates vector signals from both sources by concatenating the ML backbone's embeddings with the LLM-derived per-class scores -- obtained through structured prompt-engineering strategies -- and feeds this joint representation into a compact multi-layer perceptron (FusionMLP) that produces the final prediction. This learned fusion approach captures complementary strengths of LLM reasoning and traditional transformer-based classifiers, yielding robust performance across domains -- achieving 92.4% accuracy on AG News and 92.3% on 10-class Reuters 21578 topic classification -- while enabling practical trade-offs between accuracy, latency, and cost.

[29] Quantifying Emotional Tone in Tolkien's The Hobbit: Dialogue Sentiment Analysis with RegEx, NRC-VAD, and Python

Lilin Qiu

Main category: cs.CL

TL;DR: 本研究通过计算文本分析探讨了《霍比特人》对话中的情感基调,发现其整体呈现积极、平静且逐渐增强的主导感,反映出故事张力与舒适交替的情感节奏。

Details Motivation: 探索文学作品中隐含的情感结构,利用数字方法揭示传统阅读难以察觉的情感模式。 Method: 使用正则表达式提取对话,通过NRC-VAD词典对情感维度(效价、唤醒度、主导度)进行量化,并结合可视化手段分析情感轨迹。 Result: 对话整体呈高积极性和低唤醒度,主导感随情节推进逐步上升,情感在紧张与放松之间循环变化。 Conclusion: 数字方法能有效揭示文学作品中的细腻情感结构,展现《霍比特人》叙事中稳定的情感调控节奏。 Abstract: This study analyzes the emotional tone of dialogue in J. R. R. Tolkien's The Hobbit (1937) using computational text analysis. Dialogue was extracted with regular expressions, then preprocessed, and scored using the NRC-VAD lexicon to quantify emotional dimensions. The results show that the dialogue maintains a generally positive (high valence) and calm (low arousal) tone, with a gradually increasing sense of agency (dominance) as the story progresses. These patterns reflect the novel's emotional rhythm: moments of danger and excitement are regularly balanced by humor, camaraderie, and relief. Visualizations -- including emotional trajectory graphs and word clouds -- highlight how Tolkien's language cycles between tension and comfort. By combining computational tools with literary interpretation, this study demonstrates how digital methods can uncover subtle emotional structures in literature, revealing the steady rhythm and emotional modulation that shape the storytelling in The Hobbit.

[30] Computational emotion analysis with multimodal LLMs: Current evidence on an emerging methodological opportunity

Hauke Licht

Main category: cs.CL

TL;DR: 该论文评估了多模态大语言模型(mLLMs)在基于视频的政治传播情感唤醒分析中的有效性,发现其在理想条件下表现可靠且无明显偏见,但在真实议会辩论场景中效果不佳,强调需对生成式AI方法进行持续严谨评估。

Details Motivation: 当前缺乏关于多模态AI在情感分析中有效性的实证证据,尤其是在政治传播领域的情感唤醒识别方面。 Method: 利用两个包含人工标注的视频数据集,评估当前多模态大语言模型在视频情感唤醒分析中的表现,并检验其可靠性与潜在的群体偏见。 Result: 在理想条件下,mLLMs的情感唤醒评分高度可靠且几乎无性别或种族偏见;但在真实世界议会辩论视频中,其表现显著下降,可能影响后续统计推断。 Conclusion: 尽管mLLMs在受控环境下有潜力,但在实际政治分析应用中仍存在局限,需建立可复制的评估框架以持续检验其有效性。 Abstract: Emotions are central to politics and analyzing their role in political communication has a long tradition. As research increasingly leverages audio-visual materials to analyze the display of emotions, the emergence of multimodal generative AI promises great advances. However, we lack evidence about the effectiveness of multimodal AI in emotion analysis. This paper addresses this gap by evaluating current multimodal large language models (mLLMs) in video-based analysis of emotional arousal in two complementary data sets of human-labeled video recordings. I find that under ideal circumstances, mLLMs' emotional arousal ratings are highly reliable and show little to know indication of demographic bias. However, in recordings of speakers in real-world parliamentary debates, mLLMs' arousal ratings fail to deliver on this promise with potential negative consequences for downstream statistical inferences. This study therefore underscores the need for continued, thorough evaluation of emerging generative AI methods in political analysis and contributes a suitable replicable framework.

cs.CV [Back]

[31] Neuromorphic Eye Tracking for Low-Latency Pupil Detection

Paul Hueber,Luca Peres,Florian Pitters,Alejandro Gloriani,Oliver Rhodes

Main category: cs.CV

TL;DR: 本文提出了一种基于神经形态传感器和脉冲神经网络(SNN)的高效事件驱动眼动追踪模型,通过轻量化的LIF层和深度可分离卷积,在保持3.7-4.1px高精度的同时,显著降低模型大小与计算开销,适用于低功耗、低延迟的可穿戴系统。

Details Motivation: 传统基于帧的眼动追踪方法存在运动模糊、计算成本高和时间分辨率有限的问题,难以满足AR/VR等可穿戴系统对低延迟和低功耗的需求。 Method: 将高性能事件驱动眼动追踪模型中的循环和注意力模块替换为轻量级的LIF脉冲神经元,并采用深度可分离卷积降低模型复杂度,实现高效的SNN架构。 Result: 模型达到3.7-4.1px的平均误差,接近Retina系统的3.24px,模型规模减少20倍,理论计算量降低850倍,预计功耗为3.9-4.9mW,延迟仅为3ms(1kHz下)。 Conclusion: 高性能事件驱动眼动追踪模型可成功转换为SNN,在保持足够精度的同时极大提升能效,适合实时可穿戴应用部署。 Abstract: Eye tracking for wearable systems demands low latency and milliwatt-level power, but conventional frame-based pipelines struggle with motion blur, high compute cost, and limited temporal resolution. Such capabilities are vital for enabling seamless and responsive interaction in emerging technologies like augmented reality (AR) and virtual reality (VR), where understanding user gaze is key to immersion and interface design. Neuromorphic sensors and spiking neural networks (SNNs) offer a promising alternative, yet existing SNN approaches are either too specialized or fall short of the performance of modern ANN architectures. This paper presents a neuromorphic version of top-performing event-based eye-tracking models, replacing their recurrent and attention modules with lightweight LIF layers and exploiting depth-wise separable convolutions to reduce model complexity. Our models obtain 3.7-4.1px mean error, approaching the accuracy of the application-specific neuromorphic system, Retina (3.24px), while reducing model size by 20x and theoretical compute by 850x, compared to the closest ANN variant of the proposed model. These efficient variants are projected to operate at an estimated 3.9-4.9 mW with 3 ms latency at 1 kHz. The present results indicate that high-performing event-based eye-tracking architectures can be redesigned as SNNs with substantial efficiency gains, while retaining accuracy suitable for real-time wearable deployment.

[32] ABBSPO: Adaptive Bounding Box Scaling and Symmetric Prior based Orientation Prediction for Detecting Aerial Image Objects

Woojin Lee,Hyugjae Chang,Jaeho Moon,Jaehyup Lee,Munchurl Kim

Main category: cs.CV

TL;DR: 本文提出了一种新的弱监督有向目标检测框架ABBSPO,通过自适应边界框缩放和基于对称先验的角度预测,显著提升了水平边界框监督下的检测性能。

Details Motivation: 现有的水平边界框监督方法在尺度估计和角度学习上存在不足,导致检测精度受限,本文旨在解决这些问题。 Method: 提出了自适应边界框缩放(ABBS)来优化预测旋转框的尺度匹配,并设计了对称先验角度(SPA)损失,利用空中目标的对称性进行自监督学习,防止训练崩溃。 Result: 实验结果表明,ABBSPO在多个数据集上取得了最先进的性能,显著优于现有弱监督方法。 Conclusion: ABBSPO有效解决了HBox监督下OBB检测中的尺度不匹配和角度学习不稳定问题,为弱监督有向检测提供了高效且准确的新方案。 Abstract: Weakly supervised oriented object detection (WS-OOD) has gained attention as a cost-effective alternative to fully supervised methods, providing both efficiency and high accuracy. Among weakly supervised approaches, horizontal bounding box (HBox)-supervised OOD stands out for its ability to directly leverage existing HBox annotations while achieving the highest accuracy under weak supervision settings. This paper introduces adaptive bounding box scaling and symmetry-prior-based orientation prediction, called ABBSPO, a framework for WS-OOD. Our ABBSPO addresses limitations of previous HBox-supervised OOD methods, which compare ground truth (GT) HBoxes directly with the minimum circumscribed rectangles of predicted RBoxes, often leading to inaccurate scale estimation. To overcome this, we propose: (i) Adaptive Bounding Box Scaling (ABBS), which appropriately scales GT HBoxes to optimize for the size of each predicted RBox, ensuring more accurate scale prediction; and (ii) a Symmetric Prior Angle (SPA) loss that exploits inherent symmetry of aerial objects for self-supervised learning, resolving issues in previous methods where learning collapses when predictions for all three augmented views (original, rotated, and flipped) are consistently incorrect. Extensive experimental results demonstrate that ABBSPO achieves state-of-the-art performance, outperforming existing methods.

[33] Diffusion Is Your Friend in Show, Suggest and Tell

Jia Cheng Hu,Roberto Cavicchioli,Alessandro Capotondi

Main category: cs.CV

TL;DR: 本文提出了一种新的范式Show, Suggest and Tell (SST),结合扩散模型的双向优化能力和自回归模型的语言结构,在COCO数据集上实现了最先进的图像描述生成性能,无需强化学习即达到125.1 CIDEr-D分数。

Details Motivation: 尽管扩散去噪模型在生成式计算机视觉任务中表现优异,但在离散领域仍未能超越传统的自回归方法。本文旨在探索如何将扩散模型的优势融入而非取代自回归生成,以提升文本生成质量。 Method: 采用扩散模型为自回归生成过程提供建议(suggestions),而不是直接生成结果,从而结合两者优势:扩散模型的双向上下文感知与精细化能力,以及自回归模型固有的强语言结构。 Result: SST在COCO数据集上取得了125.1 CIDEr-D得分,优于现有自回归和扩散模型最先进方法1.5和2.5分;且无需使用强化学习。实验还验证了建议模块对生成质量有正向影响。 Conclusion: 通过让扩散模型辅助而非替代自回归生成,开辟了一个有前景的研究方向,实验证明该方法能有效提升生成文本质量。 Abstract: Diffusion Denoising models demonstrated impressive results across generative Computer Vision tasks, but they still fail to outperform standard autoregressive solutions in the discrete domain, and only match them at best. In this work, we propose a different paradigm by adopting diffusion models to provide suggestions to the autoregressive generation rather than replacing them. By doing so, we combine the bidirectional and refining capabilities of the former with the strong linguistic structure provided by the latter. To showcase its effectiveness, we present Show, Suggest and Tell (SST), which achieves State-of-the-Art results on COCO, among models in a similar setting. In particular, SST achieves 125.1 CIDEr-D on the COCO dataset without Reinforcement Learning, outperforming both autoregressive and diffusion model State-of-the-Art results by 1.5 and 2.5 points. On top of the strong results, we performed extensive experiments to validate the proposal and analyze the impact of the suggestion module. Results demonstrate a positive correlation between suggestion and caption quality, overall indicating a currently underexplored but promising research direction. Code will be available at: https://github.com/jchenghu/show\_suggest\_tell.

[34] MetaVoxel: Joint Diffusion Modeling of Imaging and Clinical Metadata

Yihao Liu,Chenyu Gao,Lianrui Zuo,Michael E. Kim,Brian D. Boyd,Lisa L. Barnes,Walter A. Kukull,Lori L. Beason-Held,Susan M. Resnick,Timothy J. Hohman,Warren D. Taylor,Bennett A. Landman

Main category: cs.CV

TL;DR: MetaVoxel 是一种基于扩散模型的生成式联合建模框架,统一建模医学影像与临床元数据的联合分布,支持零样本、灵活的多任务推理。

Details Motivation: 传统医学AI模型通常针对特定预测方向设计条件分布,难以泛化到多种任务。作者希望构建一个能统一处理多任务的通用模型。 Method: 提出 MetaVoxel,通过单一扩散过程建模医学影像(如T1加权MRI)与临床元数据的联合分布,实现无需任务特定训练的灵活推理。 Result: 在超过10,000例来自九个数据集的MRI数据上验证,单个MetaVoxel模型可同时完成图像生成、年龄估计和性别预测,性能媲美专用模型,并展现出灵活推理能力。 Conclusion: 联合多模态扩散建模为统一医学AI模型提供了有前景的方向,有助于提升模型在临床中的广泛应用性。 Abstract: Modern deep learning methods have achieved impressive results across tasks from disease classification, estimating continuous biomarkers, to generating realistic medical images. Most of these approaches are trained to model conditional distributions defined by a specific predictive direction with a specific set of input variables. We introduce MetaVoxel, a generative joint diffusion modeling framework that models the joint distribution over imaging data and clinical metadata by learning a single diffusion process spanning all variables. By capturing the joint distribution, MetaVoxel unifies tasks that traditionally require separate conditional models and supports flexible zero-shot inference using arbitrary subsets of inputs without task-specific retraining. Using more than 10,000 T1-weighted MRI scans paired with clinical metadata from nine datasets, we show that a single MetaVoxel model can perform image generation, age estimation, and sex prediction, achieving performance comparable to established task-specific baselines. Additional experiments highlight its capabilities for flexible inference.Together, these findings demonstrate that joint multimodal diffusion offers a promising direction for unifying medical AI models and enabling broader clinical applicability.

[35] Independent Density Estimation

Jiahao Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为独立密度估计(IDE)的新方法,旨在提升视觉-语言模型在未见组合上的组成泛化能力,通过解耦视觉表征和基于熵的推理方法实现更优性能。

Details Motivation: 现有大规模视觉-语言模型在人类般的组成泛化能力上仍存在困难,难以准确理解句子中词语与图像特征之间的细粒度对应关系。 Method: 提出独立密度估计(IDE)方法,构建两个模型:一个使用完全解耦的视觉表征,另一个利用变分自编码器从原始图像中提取部分解耦特征,并采用基于熵的组成推理方法融合各词预测。 Result: 在多个数据集上评估显示,所提模型在未见组合上的泛化能力优于当前主流模型。 Conclusion: IDE方法通过显式建模词与图像特征的独立关联,有效提升了视觉-语言模型的组成泛化能力,为实现更接近人类的语言-视觉理解提供了新思路。 Abstract: Large-scale Vision-Language models have achieved remarkable results in various domains, such as image captioning and conditioned image generation. Nevertheless, these models still encounter difficulties in achieving human-like compositional generalization. In this study, we propose a new method called Independent Density Estimation (IDE) to tackle this challenge. IDE aims to learn the connection between individual words in a sentence and the corresponding features in an image, enabling compositional generalization. We build two models based on the philosophy of IDE. The first one utilizes fully disentangled visual representations as input, and the second leverages a Variational Auto-Encoder to obtain partially disentangled features from raw images. Additionally, we propose an entropy-based compositional inference method to combine predictions of each word in the sentence. Our models exhibit superior generalization to unseen compositions compared to current models when evaluated on various datasets.

[36] TraceFlow: Dynamic 3D Reconstruction of Specular Scenes Driven by Ray Tracing

Jiachen Tao,Junyi Wu,Haoxuan Wang,Zongxin Yang,Dawen Cai,Yan Yan

Main category: cs.CV

TL;DR: 本文提出了TraceFlow,一种用于动态镜面场景高保真渲染的新框架,通过改进反射方向估计和物理准确的反射建模,实现了更清晰、更逼真的渲染效果。

Details Motivation: 解决动态镜面场景渲染中的两个关键挑战:精确的反射方向估计和物理准确的反射建模。 Method: 提出了一种残差材质增强的2D高斯点阵表示法,结合动态环境高斯模型和混合渲染流程,将渲染分解为漫反射和镜面反射分量,并采用由粗到细的训练策略提升优化稳定性。 Result: 在多个动态场景基准上的实验表明,TraceFlow在定量和定性上均优于先前方法,生成更清晰、更真实的镜面反射。 Conclusion: TraceFlow有效提升了复杂动态环境中镜面反射的渲染质量,实现了高保真、物理合理的动态场景渲染。 Abstract: We present TraceFlow, a novel framework for high-fidelity rendering of dynamic specular scenes by addressing two key challenges: precise reflection direction estimation and physically accurate reflection modeling. To achieve this, we propose a Residual Material-Augmented 2D Gaussian Splatting representation that models dynamic geometry and material properties, allowing accurate reflection ray computation. Furthermore, we introduce a Dynamic Environment Gaussian and a hybrid rendering pipeline that decomposes rendering into diffuse and specular components, enabling physically grounded specular synthesis via rasterization and ray tracing. Finally, we devise a coarse-to-fine training strategy to improve optimization stability and promote physically meaningful decomposition. Extensive experiments on dynamic scene benchmarks demonstrate that TraceFlow outperforms prior methods both quantitatively and qualitatively, producing sharper and more realistic specular reflections in complex dynamic environments.

[37] Hierarchical Instance Tracking to Balance Privacy Preservation with Accessible Information

Neelima Prasad,Jarek Reynolds,Neel Karsanbhai,Tanusree Sharma,Lotus Zhang,Abigale Stangl,Yang Wang,Leah Findlater,Danna Gurari

Main category: cs.CV

TL;DR: 提出了一种新任务——分层实例跟踪,用于跟踪预定义类别对象及其部分的所有实例,并保持其层次关系。同时发布了支持该任务的首个基准数据集,包含552个视频中的2,765个唯一实体,涵盖40个类别。评估了四种模型的七种变体,结果表明该数据集具有挑战性。

Details Motivation: 现有的实例跟踪方法通常忽略对象与其组成部分之间的层次关系,难以满足需要细粒度理解场景的应用需求。因此,有必要引入一种能够同时跟踪对象和其组成部分并维护其层级结构的新任务。 Method: 提出了分层实例跟踪这一新任务,并构建了首个支持该任务的基准数据集,包含552个视频、2,765个唯一实体和40个对象与部件类别。对四种主流模型的七种变体进行了实验评估,以验证数据集的挑战性和任务的可行性。 Result: 所构建的数据集被证明具有较高挑战性,现有模型在该数据集上的表现有限,尤其是在处理对象与部件之间的层次关系时存在明显不足。实验结果表明该任务具有研究价值和发展空间。 Conclusion: 分层实例跟踪是一项有意义的新任务,能够推动对复杂场景中对象及其组成部分的细粒度动态建模研究。所提出的数据集为未来相关方法的发展提供了基础支持。 Abstract: We propose a novel task, hierarchical instance tracking, which entails tracking all instances of predefined categories of objects and parts, while maintaining their hierarchical relationships. We introduce the first benchmark dataset supporting this task, consisting of 2,765 unique entities that are tracked in 552 videos and belong to 40 categories (across objects and parts). Evaluation of seven variants of four models tailored to our novel task reveals the new dataset is challenging. Our dataset is available at https://vizwiz.org/tasks-and-datasets/hierarchical-instance-tracking/

[38] Topological Conditioning for Mammography Models via a Stable Wavelet-Persistence Vectorization

Charles Fanning,Mehmet Emin Aktas

Main category: cs.CV

TL;DR: 提出基于小波持久同调的拓扑数据方法,提升乳腺癌筛查模型在不同设备和人群中的泛化性能。

Details Motivation: 乳腺X线筛查存在假阳性和假阴性问题,且现有模型在跨设备、模态和人群时性能下降。 Method: 采用基于小波的持久同调向量化方法,生成对强度扰动稳定的多尺度空间图,并通过输入级通道拼接集成到ConvNeXt Tiny模型中。 Result: 在INbreast数据集上,患者级别AUC从0.55提升至0.75,尤其在有限训练条件下表现显著改善。 Conclusion: 该方法能有效提升模型在外部分布数据上的鲁棒性和性能,具有较强的临床部署潜力。 Abstract: Breast cancer is the most commonly diagnosed cancer in women and a leading cause of cancer death worldwide. Screening mammography reduces mortality, yet interpretation still suffers from substantial false negatives and false positives, and model accuracy often degrades when deployed across scanners, modalities, and patient populations. We propose a simple conditioning signal aimed at improving external performance based on a wavelet based vectorization of persistent homology. Using topological data analysis, we summarize image structure that persists across intensity thresholds and convert this information into spatial, multi scale maps that are provably stable to small intensity perturbations. These maps are integrated into a two stage detection pipeline through input level channel concatenation. The model is trained and validated on the CBIS DDSM digitized film mammography cohort from the United States and evaluated on two independent full field digital mammography cohorts from Portugal (INbreast) and China (CMMD), with performance reported at the patient level. On INbreast, augmenting ConvNeXt Tiny with wavelet persistence channels increases patient level AUC from 0.55 to 0.75 under a limited training budget.

[39] Feature Coding for Scalable Machine Vision

Md Eimran Hossain Eimon,Juan Merlos,Ashan Perera,Hari Kalva,Velibor Adzic,Borko Furht

Main category: cs.CV

TL;DR: 本文提出了一种用于压缩深度神经网络中间特征的特征编码测试模型(FCTM),在多个视觉任务中平均降低85.14%的比特率,同时保持精度,支持边缘与云端协同推理的高效部署。

Details Motivation: 深度神经网络在边缘设备部署面临计算、带宽和隐私的挑战,传统方法难以平衡;需要高效压缩中间特征以支持边缘-云协同推理。 Method: 基于MPEG提出的面向机器的特征编码(FCM)标准,设计并实现特征编码测试模型(FCTM),采用专为中间特征压缩优化的码流语法和编解码流程。 Result: FCTM在多个视觉任务上平均实现85.14%的比特率降低,显著减少传输带宽,同时保持模型推理精度。 Conclusion: FCM为带宽受限和隐私敏感的应用提供了高效、可扩展且互操作的智能特征部署方案,推动边缘智能的发展。 Abstract: Deep neural networks (DNNs) drive modern machine vision but are challenging to deploy on edge devices due to high compute demands. Traditional approaches-running the full model on-device or offloading to the cloud face trade-offs in latency, bandwidth, and privacy. Splitting the inference workload between the edge and the cloud offers a balanced solution, but transmitting intermediate features to enable such splitting introduces new bandwidth challenges. To address this, the Moving Picture Experts Group (MPEG) initiated the Feature Coding for Machines (FCM) standard, establishing a bitstream syntax and codec pipeline tailored for compressing intermediate features. This paper presents the design and performance of the Feature Coding Test Model (FCTM), showing significant bitrate reductions-averaging 85.14%-across multiple vision tasks while preserving accuracy. FCM offers a scalable path for efficient and interoperable deployment of intelligent features in bandwidth-limited and privacy-sensitive consumer applications.

[40] Latent Chain-of-Thought World Modeling for End-to-End Driving

Shuhan Tan,Kashyap Chitta,Yuxiao Chen,Ran Tian,Yurong You,Yan Wang,Wenjie Luo,Yulong Cao,Philipp Krahenbuhl,Marco Pavone,Boris Ivanovic

Main category: cs.CV

TL;DR: 本文提出了一种名为Latent-CoT-Drive(LCDrive)的新型驾驶模型,使用潜在空间中的链式思维(CoT)进行推理,而非自然语言,从而在自动驾驶中实现更高效、高质量的决策。

Details Motivation: 现有VLA模型依赖自然语言进行推理,但文本表达可能效率低下,限制了复杂驾驶场景下的性能与安全性提升。 Method: 将CoT推理与决策统一在动作对齐的潜在空间中,通过动作提议token和基于学习的潜在世界模型token交替表示推理过程,并利用真实未来轨迹监督冷启动,再通过闭环强化学习进行后训练。 Result: 在大规模端到端驾驶基准上,LCDrive相比非推理和文本推理基线实现了更快的推理速度、更好的轨迹质量,以及更强的交互式强化学习提升效果。 Conclusion: 使用潜在语言而非自然语言进行CoT推理能更高效地提升自动驾驶模型的性能和安全性,为未来推理与决策一体化设计提供了新方向。 Abstract: Recent Vision-Language-Action (VLA) models for autonomous driving explore inference-time reasoning as a way to improve driving performance and safety in challenging scenarios. Most prior work uses natural language to express chain-of-thought (CoT) reasoning before producing driving actions. However, text may not be the most efficient representation for reasoning. In this work, we present Latent-CoT-Drive (LCDrive): a model that expresses CoT in a latent language that captures possible outcomes of the driving actions being considered. Our approach unifies CoT reasoning and decision making by representing both in an action-aligned latent space. Instead of natural language, the model reasons by interleaving (1) action-proposal tokens, which use the same vocabulary as the model's output actions; and (2) world model tokens, which are grounded in a learned latent world model and express future outcomes of these actions. We cold start latent CoT by supervising the model's action proposals and world model tokens based on ground-truth future rollouts of the scene. We then post-train with closed-loop reinforcement learning to strengthen reasoning capabilities. On a large-scale end-to-end driving benchmark, LCDrive achieves faster inference, better trajectory quality, and larger improvements from interactive reinforcement learning compared to both non-reasoning and text-reasoning baselines.

[41] Emerging Standards for Machine-to-Machine Video Coding

Md Eimran Hossain Eimon,Velibor Adzic,Hari Kalva,Borko Furht

Main category: cs.CV

TL;DR: 本文介绍了针对机器间通信的视频编码新范式,提出视频编码用于机器(VCM)和特征编码用于机器(FCM),实验表明FCM在显著降低比特率的同时保持接近边缘推理的准确性,且现有H.26X编码器(如HEVC和VVC)在多数任务中表现相近,显示出现有硬件可有效支持机器视觉通信。

Details Motivation: 当前机器视觉系统多依赖为人类感知优化的远程像素级视频传输,导致带宽高、扩展性差和隐私泄露问题,亟需专为机器消费设计的高效、隐私保护的编码方案。 Method: 采用任务感知的像素域编码(VCM)和神经特征压缩(FCM),结合H.264、H.265/HEVC和H.266/VVC等视频编码标准作为FCM内部编码器,评估其在不同机器视觉任务中的比特率与性能表现。 Result: FCM能在保持接近边缘推理精度的同时大幅降低比特率;H.265/HEVC与H.266/VVC在多数任务中性能接近(BD-Rate仅差1.39%),而H.264相比VVC平均增加32.28% BD-Rate;但在跟踪任务中,HEVC略优于VVC,且AVC影响较小(BD-Rate为8.79%)。 Conclusion: 现有的主流视频编码硬件(如HEVC)已能有效支持面向机器的视觉通信,在多数场景下无需升级至VVC,有助于推动低带宽、高隐私、高效能的机器视觉系统部署。 Abstract: Machines are increasingly becoming the primary consumers of visual data, yet most deployments of machine-to-machine systems still rely on remote inference where pixel-based video is streamed using codecs optimized for human perception. Consequently, this paradigm is bandwidth intensive, scales poorly, and exposes raw images to third parties. Recent efforts in the Moving Picture Experts Group (MPEG) redesigned the pipeline for machine-to-machine communication: Video Coding for Machines (VCM) is designed to apply task-aware coding tools in the pixel domain, and Feature Coding for Machines (FCM) is designed to compress intermediate neural features to reduce bitrate, preserve privacy, and support compute offload. Experiments show that FCM is capable of maintaining accuracy close to edge inference while significantly reducing bitrate. Additional analysis of H.26X codecs used as inner codecs in FCM reveals that H.265/High Efficiency Video Coding (HEVC) and H.266/Versatile Video Coding (VVC) achieve almost identical machine task performance, with an average BD-Rate increase of 1.39% when VVC is replaced with HEVC. In contrast, H.264/Advanced Video Coding (AVC) yields an average BD-Rate increase of 32.28% compared to VVC. However, for the tracking task, the impact of codec choice is minimal, with HEVC outperforming VVC and achieving BD Rate of -1.81% and 8.79% for AVC, indicating that existing hardware for already deployed codecs can support machine-to-machine communication without degrading performance.

[42] Multi-dimensional Preference Alignment by Conditioning Reward Itself

Jiho Jang,Jinyoung Kim,Kyungjune Baek,Nojun Kwak

Main category: cs.CV

TL;DR: 提出多奖励条件DPO(MCDPO)以解决标准DPO中因布拉德利-特里模型聚合多维评估导致的奖励冲突问题,通过解耦目标和条件偏好向量实现各维度独立优化,并支持推理时动态控制。

Details Motivation: 标准DPO使用单一标量奖励聚合多维人类反馈(如美学与语义),导致模型在非全局偏好的样本中丢失优良特征,产生奖励冲突。 Method: 提出MCDPO,引入解耦的布拉德利-特里目标,将多维偏好结果向量作为条件输入模型,并采用维度奖励dropout确保各维度均衡优化。 Result: 在Stable Diffusion 1.5和SDXL上实验表明,MCDPO在多个基准上表现更优,且支持通过无分类器引导在推理时动态增强特定奖励维度。 Conclusion: MCDPO有效缓解了多维人类反馈中的奖励冲突问题,实现了更精细的对齐控制,同时具备良好的扩展性和应用灵活性。 Abstract: Reinforcement Learning from Human Feedback has emerged as a standard for aligning diffusion models. However, we identify a fundamental limitation in the standard DPO formulation because it relies on the Bradley-Terry model to aggregate diverse evaluation axes like aesthetic quality and semantic alignment into a single scalar reward. This aggregation creates a reward conflict where the model is forced to unlearn desirable features of a specific dimension if they appear in a globally non-preferred sample. To address this issue, we propose Multi Reward Conditional DPO (MCDPO). This method resolves reward conflicts by introducing a disentangled Bradley-Terry objective. MCDPO explicitly injects a preference outcome vector as a condition during training, which allows the model to learn the correct optimization direction for each reward axis independently within a single network. We further introduce dimensional reward dropout to ensure balanced optimization across dimensions. Extensive experiments on Stable Diffusion 1.5 and SDXL demonstrate that MCDPO achieves superior performance on benchmarks. Notably, our conditional framework enables dynamic and multiple-axis control at inference time using Classifier Free Guidance to amplify specific reward dimensions without additional training or external reward models.

[43] Solving Semi-Supervised Few-Shot Learning from an Auto-Annotation Perspective

Tian Liu,Anwesha Basu,James Caverlee,Shu Kong

Main category: cs.CV

TL;DR: 本文提出了一种名为SWIFT的新方法,用于半监督少样本学习(SSFSL),通过利用开源视觉-语言模型(VLMs)和其预训练数据来提升性能。研究发现现有SSL方法在微调VLM时表现不佳,原因是VLM输出的softmax概率分布过于平坦,导致未标记数据无法被有效利用。为此,作者提出了简单的解决方案:分类器初始化与温度调节,以增强伪标签置信度和监督信号。基于此,SWIFT框架实现了对有限标注数据、大量未标注数据及噪声数据的有效微调,在五个基准上显著超越现有方法,并接近全监督学习性能。

Details Motivation: 现实世界中的自动标注等应用需要高效的半监督少样本学习方法。尽管已有强大的开源视觉-语言模型及其预训练数据可用,但当前SSFSL研究未能充分利用这些资源。相比之下,少样本学习领域已成功利用这些资源提升性能。因此,本文旨在探索如何将这些开放资源引入SSFSL以实现更有效的学习。 Method: 首先分析现有SSL方法在微调VLM时表现差的原因,发现是由于VLM产生的softmax概率分布过于平坦,导致伪标签质量低、未标注数据利用率低。为解决该问题,提出两个简单而有效的技术:分类器初始化和温度调节,以提高伪标签的置信度。在此基础上,设计了阶段式微调结合温度调节的SWIFT框架,使其能够有效利用少量标注数据、大量未标注数据以及从VLM预训练集中检索出的任务相关但含噪的数据进行训练。 Result: 在五个SSFSL基准上的实验表明,SWIFT比最新的FSL和SSL方法平均高出约5个准确率点。值得注意的是,SWIFT的表现甚至可与使用真实标签标注未标注数据的全监督学习相媲美,显示出其极强的实用性与潜力。此外,消融实验验证了分类器初始化和温度调节的有效性。 Conclusion: 本文揭示了当前SSL方法在SSFSL中微调VLM失败的根本原因——VLM输出概率分布过平。通过引入简单的改进措施,即分类器初始化和温度调节,显著提升了未标注数据的利用率和监督信号强度。所提出的SWIFT框架不仅有效整合了开源VLM及其预训练数据,还在多个基准上取得了领先性能,推动了SSFSL向实际应用场景迈进一大步。 Abstract: Semi-supervised few-shot learning (SSFSL) formulates real-world applications like ''auto-annotation'', as it aims to learn a model over a few labeled and abundant unlabeled examples to annotate the unlabeled ones. Despite the availability of powerful open-source Vision-Language Models (VLMs) and their pretraining data, the SSFSL literature largely neglects these open-source resources. In contrast, the related area few-shot learning (FSL) has already exploited them to boost performance. Arguably, to achieve auto-annotation in the real world, SSFSL should leverage such open-source resources. To this end, we start by applying established SSL methods to finetune a VLM. Counterintuitively, they significantly underperform FSL baselines. Our in-depth analysis reveals the root cause: VLMs produce rather ''flat'' distributions of softmax probabilities. This results in zero utilization of unlabeled data and weak supervision signals. We address this issue with embarrassingly simple techniques: classifier initialization and temperature tuning. They jointly increase the confidence scores of pseudo-labels, improving the utilization rate of unlabeled data, and strengthening supervision signals. Building on this, we propose: Stage-Wise Finetuning with Temperature Tuning (SWIFT), which enables existing SSL methods to effectively finetune a VLM on limited labeled data, abundant unlabeled data, and task-relevant but noisy data retrieved from the VLM's pretraining set. Extensive experiments on five SSFSL benchmarks show that SWIFT outperforms recent FSL and SSL methods by $\sim$5 accuracy points. SWIFT even rivals supervised learning, which finetunes VLMs with the unlabeled data being labeled with ground truth!

[44] RobustSora: De-Watermarked Benchmark for Robust AI-Generated Video Detection

Zhuo Wang,Xiliang Liu,Ligang Sun

Main category: cs.CV

TL;DR: RobustSora是一个新基准,用于评估数字水印对AI生成视频检测的影响,发现现有检测模型在去除水印后性能下降2-8个百分点,表明其部分依赖水印模式。

Details Motivation: 现有AIGC视频检测基准忽略了生成模型嵌入的数字水印可能被检测器利用的问题,需评估水印对检测结果的影响。 Method: 构建包含6500个视频的数据集,涵盖真实清洁、伪造水印、生成带水印和去水印四类视频;设计两个任务:任务I测试去水印AI视频的检测性能,任务II评估真实视频中伪造水印的误报率。 Result: 十种检测模型在水印操作下性能变化2-8pp,基于Transformer的模型显示中等依赖性(6-8pp),MLLM表现出多样性(2-8pp)。 Conclusion: 当前AIGC视频检测模型存在对数字水印的部分依赖,需发展水印感知的训练策略以提升鲁棒性,RobustSora为推动该研究提供了重要工具。 Abstract: The proliferation of AI-generated video technologies poses challenges to information integrity. While recent benchmarks advance AIGC video detection, they overlook a critical factor: many state-of-the-art generative models embed digital watermarks in outputs, and detectors may partially rely on these patterns. To evaluate this influence, we present RobustSora, the benchmark designed to assess watermark robustness in AIGC video detection. We systematically construct a dataset of 6,500 videos comprising four types: Authentic-Clean (A-C), Authentic-Spoofed with fake watermarks (A-S), Generated-Watermarked (G-W), and Generated-DeWatermarked (G-DeW). Our benchmark introduces two evaluation tasks: Task-I tests performance on watermark-removed AI videos, while Task-II assesses false alarm rates on authentic videos with fake watermarks. Experiments with ten models spanning specialized AIGC detectors, transformer architectures, and MLLM approaches reveal performance variations of 2-8pp under watermark manipulation. Transformer-based models show consistent moderate dependency (6-8pp), while MLLMs exhibit diverse patterns (2-8pp). These findings indicate partial watermark dependency and highlight the need for watermark-aware training strategies. RobustSora provides essential tools to advance robust AIGC detection research.

[45] THE-Pose: Topological Prior with Hybrid Graph Fusion for Estimating Category-Level 6D Object Pose

Eunho Lee,Chaehyeon Song,Seunghoon Jeong,Ayoung Kim

Main category: cs.CV

TL;DR: THE-Pose是一种新的类别级6D姿态估计框架,通过拓扑先验和混合图融合,结合2D图像上下文与3D几何结构,在复杂和遮挡场景下表现出更强的鲁棒性。

Details Motivation: 现有3D图卷积方法仅关注局部几何和深度信息,难以应对类别内变化和视觉模糊,尤其在复杂或遮挡物体上表现不佳。 Method: 提出THE-Pose框架,从图像域提取不变的拓扑特征,并通过混合图融合(HGF)模块自适应地融合拓扑特征与点云特征,实现2D上下文与3D结构的无缝连接。 Result: 在REAL275数据集上实验表明,相比3D-GC基线HS-Pose提升35.8%,超越此前最优方法7.2%。 Conclusion: THE-Pose通过引入拓扑先验和混合图融合机制,显著提升了类别级姿态估计的性能与鲁棒性,尤其适用于未见或复杂物体。 Abstract: Category-level object pose estimation requires both global context and local structure to ensure robustness against intra-class variations. However, 3D graph convolution (3D-GC) methods only focus on local geometry and depth information, making them vulnerable to complex objects and visual ambiguities. To address this, we present THE-Pose, a novel category-level 6D pose estimation framework that leverages a topological prior via surface embedding and hybrid graph fusion. Specifically, we extract consistent and invariant topological features from the image domain, effectively overcoming the limitations inherent in existing 3D-GC based methods. Our Hybrid Graph Fusion (HGF) module adaptively integrates the topological features with point-cloud features, seamlessly bridging 2D image context and 3D geometric structure. These fused features ensure stability for unseen or complicated objects, even under significant occlusions. Extensive experiments on the REAL275 dataset show that THE-Pose achieves a 35.8% improvement over the 3D-GC baseline (HS-Pose) and surpasses the previous state-of-the-art by 7.2% across all key metrics. The code is avaialbe on https://github.com/EHxxx/THE-Pose

[46] GDKVM: Echocardiography Video Segmentation via Spatiotemporal Key-Value Memory with Gated Delta Rule

Rui Wang,Yimu Sun,Jingxing Guo,Huisi Wu,Jing Qin

Main category: cs.CV

TL;DR: 本文提出了一种用于超声心动图视频分割的新架构GDKVM,通过引入线性键值关联(LKVA)、门控Delta规则(GDR)和关键像素特征融合(KPFF)模块,在保持实时性能的同时提高了分割精度和鲁棒性。

Details Motivation: 准确分割心脏腔室对心脏功能的定量分析至关重要,但成像噪声、伪影以及心脏的变形和运动给现有分割算法带来了挑战,尤其是在长距离时空依赖建模与计算效率之间的权衡问题。 Method: 提出GDKVM模型,采用LKVA建模帧间相关性,GDR高效存储中间记忆状态,并设计KPFF模块多尺度融合局部与全局特征。 Result: 在CAMUS和EchoNet-Dynamic两个主流数据集上验证,GDKVM在分割精度和鲁棒性方面优于现有方法,同时保持实时性能。 Conclusion: GDKVM在超声心动图视频分割中实现了更高的准确性、鲁棒性和效率,具有临床应用潜力。 Abstract: Accurate segmentation of cardiac chambers in echocardiography sequences is crucial for the quantitative analysis of cardiac function, aiding in clinical diagnosis and treatment. The imaging noise, artifacts, and the deformation and motion of the heart pose challenges to segmentation algorithms. While existing methods based on convolutional neural networks, Transformers, and space-time memory networks have improved segmentation accuracy, they often struggle with the trade-off between capturing long-range spatiotemporal dependencies and maintaining computational efficiency with fine-grained feature representation. In this paper, we introduce GDKVM, a novel architecture for echocardiography video segmentation. The model employs Linear Key-Value Association (LKVA) to effectively model inter-frame correlations, and introduces Gated Delta Rule (GDR) to efficiently store intermediate memory states. Key-Pixel Feature Fusion (KPFF) module is designed to integrate local and global features at multiple scales, enhancing robustness against boundary blurring and noise interference. We validated GDKVM on two mainstream echocardiography video datasets (CAMUS and EchoNet-Dynamic) and compared it with various state-of-the-art methods. Experimental results show that GDKVM outperforms existing approaches in terms of segmentation accuracy and robustness, while ensuring real-time performance. Code is available at https://github.com/wangrui2025/GDKVM.

[47] VLM-NCD:Novel Class Discovery with Vision-Based Large Language Models

Yuetong Su,Baoguo Wei,Xinyu Wang,Xu Li,Lixin Li

Main category: cs.CV

TL;DR: 提出了一种名为LLM-NCD的多模态框架,通过融合视觉-文本语义和原型引导聚类来解决新类别发现中的特征判别性不足和长尾分布问题,在CIFAR-100上比现有方法最高提升25.3%的未知类准确率。

Details Motivation: 现有基于视觉特征的新类别发现(NCD)方法存在特征判别性不足和数据长尾分布的问题,限制了对未知类别的发现能力。 Method: 提出LLM-NCD,通过联合优化已知类别的图像与文本特征,建模聚类中心与语义原型,并设计双阶段发现机制,利用语义亲和度阈值和自适应聚类动态区分已知与新样本。 Result: 在CIFAR-100数据集上,相比现有方法最高提升25.3%的未知类分类准确率,并展现出对长尾分布的独特鲁棒性。 Conclusion: LLM-NCD通过融合视觉-文本语义和原型引导聚类,有效提升了新类别发现的性能,尤其在处理长尾分布数据方面表现突出,为NCD提供了新的解决方案。 Abstract: Novel Class Discovery aims to utilise prior knowledge of known classes to classify and discover unknown classes from unlabelled data. Existing NCD methods for images primarily rely on visual features, which suffer from limitations such as insufficient feature discriminability and the long-tail distribution of data. We propose LLM-NCD, a multimodal framework that breaks this bottleneck by fusing visual-textual semantics and prototype guided clustering. Our key innovation lies in modelling cluster centres and semantic prototypes of known classes by jointly optimising known class image and text features, and a dualphase discovery mechanism that dynamically separates known or novel samples via semantic affinity thresholds and adaptive clustering. Experiments on the CIFAR-100 dataset show that compared to the current methods, this method achieves up to 25.3% improvement in accuracy for unknown classes. Notably, our method shows unique resilience to long tail distributions, a first in NCD literature.

[48] Long-LRM++: Preserving Fine Details in Feed-Forward Wide-Coverage Reconstruction

Chen Ziwen,Hao Tan,Peng Wang,Zexiang Xu,Li Fuxin

Main category: cs.CV

TL;DR: Long-LRM++ 提出一种半显式场景表示与轻量解码器结合的方法,在保持高渲染质量的同时实现单次前向传播的实时新视角合成,支持最多64个输入视图,并在多个数据集上优于现有方法。

Details Motivation: 现有显式高斯点阵方法对参数预测误差敏感,导致细节模糊;而隐式表示虽质量高但依赖计算密集的逐帧解码,难以实现实时渲染。因此需要探索是否必须使用深度序列解码过程,以及如何兼顾高质量与实时性。 Method: 提出 Long-LRM++,采用半显式场景表示(将场景信息部分压缩到模型权重中)并设计轻量级解码器,避免每帧重复的重型变换器解码,通过单次前向传播生成百万级高斯参数,支持长序列输入(最多64视图)。 Result: 在 DL3DV 数据集上达到与 LaCT 相当的渲染质量,同时在 A100 GPU 上实现 14 FPS 的实时渲染速度;支持 64 输入视图下的 950×540 分辨率重建;在 ScanNetv2 上的新视角深度预测优于基于高斯的方法。 Conclusion: Long-LRM++ 成功平衡了隐式表示的高质量与显式方法的高效性,证明无需复杂的逐帧解码即可实现高质量实时渲染,为可推广的神经渲染提供了新的有效范式。 Abstract: Recent advances in generalizable Gaussian splatting (GS) have enabled feed-forward reconstruction of scenes from tens of input views. Long-LRM notably scales this paradigm to 32 input images at $950\times540$ resolution, achieving 360° scene-level reconstruction in a single forward pass. However, directly predicting millions of Gaussian parameters at once remains highly error-sensitive: small inaccuracies in positions or other attributes lead to noticeable blurring, particularly in fine structures such as text. In parallel, implicit representation methods such as LVSM and LaCT have demonstrated significantly higher rendering fidelity by compressing scene information into model weights rather than explicit Gaussians, and decoding RGB frames using the full transformer or TTT backbone. However, this computationally intensive decompression process for every rendered frame makes real-time rendering infeasible. These observations raise key questions: Is the deep, sequential "decompression" process necessary? Can we retain the benefits of implicit representations while enabling real-time performance? We address these questions with Long-LRM++, a model that adopts a semi-explicit scene representation combined with a lightweight decoder. Long-LRM++ matches the rendering quality of LaCT on DL3DV while achieving real-time 14 FPS rendering on an A100 GPU, overcoming the speed limitations of prior implicit methods. Our design also scales to 64 input views at the $950\times540$ resolution, demonstrating strong generalization to increased input lengths. Additionally, Long-LRM++ delivers superior novel-view depth prediction on ScanNetv2 compared to direct depth rendering from Gaussians. Extensive ablation studies validate the effectiveness of each component in the proposed framework.

[49] Sample-wise Adaptive Weighting for Transfer Consistency in Adversarial Distillation

Hongsin Lee,Hye Won Chung

Main category: cs.CV

TL;DR: 本文提出了一种新的对抗性蒸馏方法SAAD,通过样本级别的自适应重加权提升对抗鲁棒性的迁移效果。

Details Motivation: 现有工作忽视了使用最先进的鲁棒教师模型,并且更强的教师并不总能带来更鲁棒的学生模型,存在鲁棒性饱和现象。 Method: 提出了基于对抗可迁移性(学生生成的对抗样本对教师的有效性)的样本级自适应重加权机制,在不增加计算成本的情况下优化知识蒸馏过程。 Result: 在CIFAR-10、CIFAR-100和Tiny-ImageNet上的实验表明,SAAD在AutoAttack下的鲁棒性优于先前方法。 Conclusion: 对抗可迁移性是影响鲁棒性转移的关键因素,SAAD有效提升了小模型在有限容量下从强教师模型中学习鲁棒性的能力。 Abstract: Adversarial distillation in the standard min-max adversarial training framework aims to transfer adversarial robustness from a large, robust teacher network to a compact student. However, existing work often neglects to incorporate state-of-the-art robust teachers. Through extensive analysis, we find that stronger teachers do not necessarily yield more robust students-a phenomenon known as robust saturation. While typically attributed to capacity gaps, we show that such explanations are incomplete. Instead, we identify adversarial transferability-the fraction of student-crafted adversarial examples that remain effective against the teacher-as a key factor in successful robustness transfer. Based on this insight, we propose Sample-wise Adaptive Adversarial Distillation (SAAD), which reweights training examples by their measured transferability without incurring additional computational cost. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet show that SAAD consistently improves AutoAttack robustness over prior methods. Our code is available at https://github.com/HongsinLee/saad.

[50] MotionEdit: Benchmarking and Learning Motion-Centric Image Editing

Yixin Wan,Lei Ke,Wenhao Yu,Kai-Wei Chang,Dong Yu

Main category: cs.CV

TL;DR: 本文提出了MotionEdit,一个用于以动作为中心的图像编辑的新数据集和基准,以及MotionNFT方法,通过运动对齐奖励提升编辑模型在动作变换中的准确性和保真度。

Details Motivation: 现有的图像编辑数据集主要关注静态外观变化,缺乏高质量、密集的动作编辑数据,难以支持真实且合理的动态内容生成需求。因此需要构建专注于高保真运动变换的数据集与评估基准。 Method: 提出MotionEdit数据集,包含从连续视频中提取并验证的高质量动作变换图像对;设计MotionEdit-Bench作为评估基准,采用生成式、判别式和偏好式指标;进一步提出MotionNFT框架,在微调过程中引入基于光流匹配程度的运动对齐奖励机制,引导模型生成更准确的动作变化。 Result: 实验表明现有扩散模型在该任务上表现不佳;MotionNFT在FLUX.1 Kontext和Qwen-Image-Edit两个模型上均显著提升了动作编辑的质量与运动保真度,同时保持了通用编辑能力。 Conclusion: MotionEdit为以动作为核心的图像编辑提供了新的研究方向和可靠基准,MotionNFT有效解决了当前模型在运动一致性方面的不足,推动了动态内容生成的发展。 Abstract: We introduce MotionEdit, a novel dataset for motion-centric image editing-the task of modifying subject actions and interactions while preserving identity, structure, and physical plausibility. Unlike existing image editing datasets that focus on static appearance changes or contain only sparse, low-quality motion edits, MotionEdit provides high-fidelity image pairs depicting realistic motion transformations extracted and verified from continuous videos. This new task is not only scientifically challenging but also practically significant, powering downstream applications such as frame-controlled video synthesis and animation. To evaluate model performance on the novel task, we introduce MotionEdit-Bench, a benchmark that challenges models on motion-centric edits and measures model performance with generative, discriminative, and preference-based metrics. Benchmark results reveal that motion editing remains highly challenging for existing state-of-the-art diffusion-based editing models. To address this gap, we propose MotionNFT (Motion-guided Negative-aware Fine Tuning), a post-training framework that computes motion alignment rewards based on how well the motion flow between input and model-edited images matches the ground-truth motion, guiding models toward accurate motion transformations. Extensive experiments on FLUX.1 Kontext and Qwen-Image-Edit show that MotionNFT consistently improves editing quality and motion fidelity of both base models on the motion editing task without sacrificing general editing ability, demonstrating its effectiveness.

[51] ShotDirector: Directorially Controllable Multi-Shot Video Generation with Cinematographic Transitions

Xiaoxue Wu,Xinyuan Chen,Yaohui Wang,Yu Qiao

Main category: cs.CV

TL;DR: 本文提出了ShotDirector框架,通过结合参数级相机控制和分层编辑模式感知提示,实现电影级别的多镜头视频生成中的可控镜头过渡。

Details Motivation: 现有方法主要关注跨镜头的低级别视觉一致性,忽略了镜头过渡设计和电影语言对叙事表达的影响,导致缺乏专业的剪辑模式。 Method: 引入带有6-DoF姿态和内参设置的相机控制模块,并采用镜头感知掩码机制实现分层提示,融合参数级条件与高层语义指导;同时构建了ShotWeaver40K数据集并设计了相应的评估指标。 Result: 实验表明该框架能有效生成具有电影感的连贯镜头过渡,在多镜头视频生成中实现了更精细的内容控制。 Conclusion: ShotDirector通过整合专业剪辑模式与精确相机控制,显著提升了多镜头视频生成的叙事表达能力与导演化设计水平。 Abstract: Shot transitions play a pivotal role in multi-shot video generation, as they determine the overall narrative expression and the directorial design of visual storytelling. However, recent progress has primarily focused on low-level visual consistency across shots, neglecting how transitions are designed and how cinematographic language contributes to coherent narrative expression. This often leads to mere sequential shot changes without intentional film-editing patterns. To address this limitation, we propose ShotDirector, an efficient framework that integrates parameter-level camera control and hierarchical editing-pattern-aware prompting. Specifically, we adopt a camera control module that incorporates 6-DoF poses and intrinsic settings to enable precise camera information injection. In addition, a shot-aware mask mechanism is employed to introduce hierarchical prompts aware of professional editing patterns, allowing fine-grained control over shot content. Through this design, our framework effectively combines parameter-level conditions with high-level semantic guidance, achieving film-like controllable shot transitions. To facilitate training and evaluation, we construct ShotWeaver40K, a dataset that captures the priors of film-like editing patterns, and develop a set of evaluation metrics for controllable multi-shot video generation. Extensive experiments demonstrate the effectiveness of our framework.

[52] Physically Aware 360$^\circ$ View Generation from a Single Image using Disentangled Scene Embeddings

Karthikeya KV,Narendra Bandaru

Main category: cs.CV

TL;DR: Disentangled360是一种创新的3D感知技术,结合了方向解耦体积渲染与单图像360°视图合成,适用于医学成像和自然场景重建,具有高保真、无需微调的优点。

Details Motivation: 现有方法在处理各向异性光行为时存在简化过度或跨场景泛化能力不足的问题,限制了在复杂场景(如医学影像)中的应用。 Method: 提出一种双分支条件框架,在高斯点阵渲染基础上分离各向同性和各向异性贡献;分别针对CT强度散射和真实RGB场景进行优化,并引入混合姿态无关锚定方法来自适应采样深度与材质变化。 Result: 在Mip-NeRF 360、RealEstate10K和DeepDRR数据集上实现了更高的SSIM和LPIPS指标,运行效率支持交互式应用。 Conclusion: Disentangled360实现了高质量、通用性强的360°视图合成,无需场景微调或昂贵的光子模拟,可广泛应用于混合现实医疗、机器人感知和沉浸式内容生成。 Abstract: We introduce Disentangled360, an innovative 3D-aware technology that integrates the advantages of direction disentangled volume rendering with single-image 360° unique view synthesis for applications in medical imaging and natural scene reconstruction. In contrast to current techniques that either oversimplify anisotropic light behavior or lack generalizability across various contexts, our framework distinctly differentiates between isotropic and anisotropic contributions inside a Gaussian Splatting backbone. We implement a dual-branch conditioning framework, one optimized for CT intensity driven scattering in volumetric data and the other for real-world RGB scenes through normalized camera embeddings. To address scale ambiguity and maintain structural realism, we present a hybrid pose agnostic anchoring method that adaptively samples scene depth and material transitions, functioning as stable pivots during scene distillation. Our design integrates preoperative radiography simulation and consumer-grade 360° rendering into a singular inference pipeline, facilitating rapid, photorealistic view synthesis with inherent directionality. Evaluations on the Mip-NeRF 360, RealEstate10K, and DeepDRR datasets indicate superior SSIM and LPIPS performance, while runtime assessments confirm its viability for interactive applications. Disentangled360 facilitates mixed-reality medical supervision, robotic perception, and immersive content creation, eliminating the necessity for scene-specific finetuning or expensive photon simulations.

[53] Efficient-VLN: A Training-Efficient Vision-Language Navigation Model

Duo Zheng,Shijia Huang,Yanyang Li,Liwei Wang

Main category: cs.CV

TL;DR: 提出Efficient-VLN,一种高效的视觉-语言导航模型,通过设计渐进式记忆和可学习递归记忆机制减轻长序列处理负担,并引入动态混合策略平衡探索效率权衡,在R2R-CE和RxR-CE上达到最先进性能的同时显著降低训练开销。

Details Motivation: 现有MLLM在视觉-语言导航中存在训练开销大的问题,主要源于长历史观测的二次计算负担以及DAgger中探索与效率之间的权衡。 Method: 设计了两种高效记忆机制:渐进式记忆动态分配更多token给近期观测,可学习递归记忆利用可学习token的键值缓存作为记忆状态;并引入动态混合策略来平衡探索与训练推理效率之间的权衡。 Result: 在R2R-CE(64.2% SR)和RxR-CE(67.0% SR)上实现了最先进的性能,且训练仅消耗282 H800 GPU小时,显著低于现有最先进方法。 Conclusion: Efficient-VLN有效缓解了多模态大模型在视觉-语言导航中的训练开销问题,同时保持卓越性能,为实际应用提供了更高效的解决方案。 Abstract: Multimodal large language models (MLLMs) have shown promising potential in Vision-Language Navigation (VLN). However, their practical development is severely hindered by the substantial training overhead. We recognize two key issues that contribute to the overhead: (1) the quadratic computational burden from processing long-horizon historical observations as massive sequences of tokens, and (2) the exploration-efficiency trade-off in DAgger, i.e., a data aggregation process of collecting agent-explored trajectories. While more exploration yields effective error-recovery trajectories for handling test-time distribution shifts, it comes at the cost of longer trajectory lengths for both training and inference. To address these challenges, we propose Efficient-VLN, a training-efficient VLN model. Specifically, to mitigate the token processing burden, we design two efficient memory mechanisms: a progressive memory that dynamically allocates more tokens to recent observations, and a learnable recursive memory that utilizes the key-value cache of learnable tokens as the memory state. Moreover, we introduce a dynamic mixed policy to balance the exploration-efficiency trade-off. Extensive experiments show that Efficient-VLN achieves state-of-the-art performance on R2R-CE (64.2% SR) and RxR-CE (67.0% SR). Critically, our model consumes merely 282 H800 GPU hours, demonstrating a dramatic reduction in training overhead compared to state-of-the-art methods.

[54] DualProtoSeg: Simple and Efficient Design with Text- and Image-Guided Prototype Learning for Weakly Supervised Histopathology Image Segmentation

Anh M. Vu,Khang P. Le,Trang T. K. Vo,Ha Thach,Huy Hung Nguyen,David Yang,Han H. Huynh,Quynh Nguyen,Tuan M. Pham,Tuan-Anh Le,Minh H. N. Le,Thanh-Huy Nguyen,Akash Awasthi,Chandra Mohan,Zhu Han,Hien Van Nguyen

Main category: cs.CV

TL;DR: 提出了一种基于视觉-语言对齐的原型驱动框架,用于弱监督组织病理学语义分割,通过结合文本和图像原型以及多尺度模块,有效缓解了类别间相似性、类内差异性和区域收缩问题,显著提升了定位精度和分割性能。

Details Motivation: 为了解决弱监督语义分割在组织病理学中因标注成本高、特征同质化、类内异质性及CAM导致的区域收缩问题,需提升区域发现能力。 Method: 采用CoOp风格的可学习提示调优生成文本原型,结合可学习的图像原型构建双模态原型库,并引入多尺度金字塔模块以增强ViT的空域细节,提升定位质量。 Result: 在BCSS-WSSS基准上超越现有最先进方法,实验验证了文本描述多样性、上下文长度及图文原型互补性的积极作用。 Conclusion: 联合利用文本语义与视觉原型学习能有效提升弱监督组织病理学语义分割的性能,为降低标注依赖提供了新思路。 Abstract: Weakly supervised semantic segmentation (WSSS) in histopathology seeks to reduce annotation cost by learning from image-level labels, yet it remains limited by inter-class homogeneity, intra-class heterogeneity, and the region-shrinkage effect of CAM-based supervision. We propose a simple and effective prototype-driven framework that leverages vision-language alignment to improve region discovery under weak supervision. Our method integrates CoOp-style learnable prompt tuning to generate text-based prototypes and combines them with learnable image prototypes, forming a dual-modal prototype bank that captures both semantic and appearance cues. To address oversmoothing in ViT representations, we incorporate a multi-scale pyramid module that enhances spatial precision and improves localization quality. Experiments on the BCSS-WSSS benchmark show that our approach surpasses existing state-of-the-art methods, and detailed analyses demonstrate the benefits of text description diversity, context length, and the complementary behavior of text and image prototypes. These results highlight the effectiveness of jointly leveraging textual semantics and visual prototype learning for WSSS in digital pathology.

[55] ConStruct: Structural Distillation of Foundation Models for Prototype-Based Weakly Supervised Histopathology Segmentation

Khang Le,Ha Thach,Anh M. Vu,Trang T. K. Vo,Han H. Huynh,David Yang,Minh H. N. Le,Thanh-Huy Nguyen,Akash Awasthi,Chandra Mohan,Zhu Han,Hien Van Nguyen

Main category: cs.CV

TL;DR: 提出一种结合CONCH和SegFormer优势的原型学习框架,用于病理图像弱监督语义分割,通过文本引导的原型初始化和结构蒸馏机制生成高质量伪掩码,提升定位完整性和语义一致性。

Details Motivation: 现有弱监督语义分割方法在病理图像中常局限于判别性区域,难以完整捕捉组织结构的空间范围,且缺乏细粒度形态保持能力。 Method: 融合CONCH的形态感知表示、SegFormer的多尺度结构线索与文本引导的语义对齐,构建原型学习框架;引入文本引导的原型初始化生成更完整的伪掩码,并通过结构蒸馏将SegFormer的空间知识迁移至原型学习过程。 Result: 在BCSS-WSSS数据集上优于现有WSSS方法,生成的伪掩码质量高,定位更完整,语义一致性更强,且计算高效。 Conclusion: 所提框架有效整合了视觉-语言模型与现代分割主干网络的优势,在无像素级标注的情况下显著提升了病理图像弱监督分割性能。 Abstract: Weakly supervised semantic segmentation (WSSS) in histopathology relies heavily on classification backbones, yet these models often localize only the most discriminative regions and struggle to capture the full spatial extent of tissue structures. Vision-language models such as CONCH offer rich semantic alignment and morphology-aware representations, while modern segmentation backbones like SegFormer preserve fine-grained spatial cues. However, combining these complementary strengths remains challenging, especially under weak supervision and without dense annotations. We propose a prototype learning framework for WSSS in histopathological images that integrates morphology-aware representations from CONCH, multi-scale structural cues from SegFormer, and text-guided semantic alignment to produce prototypes that are simultaneously semantically discriminative and spatially coherent. To effectively leverage these heterogeneous sources, we introduce text-guided prototype initialization that incorporates pathology descriptions to generate more complete and semantically accurate pseudo-masks. A structural distillation mechanism transfers spatial knowledge from SegFormer to preserve fine-grained morphological patterns and local tissue boundaries during prototype learning. Our approach produces high-quality pseudo masks without pixel-level annotations, improves localization completeness, and enhances semantic consistency across tissue types. Experiments on BCSS-WSSS datasets demonstrate that our prototype learning framework outperforms existing WSSS methods while remaining computationally efficient through frozen foundation model backbones and lightweight trainable adapters.

[56] Point2Pose: A Generative Framework for 3D Human Pose Estimation with Multi-View Point Cloud Dataset

Hyunsoo Lee,Daeum Jeon,Hyeokjae Oh

Main category: cs.CV

TL;DR: 提出了一种基于点云和姿态历史的3D人体姿态估计新方法Point2Pose,结合时空编码与注意力机制生成模型,并发布了大规模多模态数据集MVPose3D。

Details Motivation: 解决3D人体姿态估计中因人体复杂几何结构、关节自遮挡以及缺乏大规模真实运动数据带来的挑战。 Method: 设计了Point2Pose框架,采用时空点云编码器和姿态特征编码器提取关节级特征,并通过基于注意力机制的生成式回归器进行姿态预测。 Result: 实验表明该方法在多个数据集上优于基线模型,展现出优越性能。 Conclusion: Point2Pose能有效建模条件化的人体姿态分布,结合新提出的MVPose3D数据集,推动了3D人体姿态估计的发展。 Abstract: We propose a novel generative approach for 3D human pose estimation. 3D human pose estimation poses several key challenges due to the complex geometry of the human body, self-occluding joints, and the requirement for large-scale real-world motion datasets. To address these challenges, we introduce Point2Pose, a framework that effectively models the distribution of human poses conditioned on sequential point cloud and pose history. Specifically, we employ a spatio-temporal point cloud encoder and a pose feature encoder to extract joint-wise features, followed by an attention-based generative regressor. Additionally, we present a large-scale indoor dataset MVPose3D, which contains multiple modalities, including IMU data of non-trivial human motions, dense multi-view point clouds, and RGB images. Experimental results show that the proposed method outperforms the baseline models, demonstrating its superior performance across various datasets.

[57] EchoingPixels: Cross-Modal Adaptive Token Reduction for Efficient Audio-Visual LLMs

Chao Gong,Depeng Wang,Zhipeng Wei,Ya Guo,Huijia Zhu,Jingjing Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为EchoingPixels的新框架,用于解决音频-视觉大语言模型中因音视频令牌过多导致的计算开销问题。其核心是跨模态语义筛(CS2),通过联合处理音视频令牌池实现自适应的令牌压缩,并结合Sync-RoPE保持时序建模能力,在大幅减少计算量的同时保持性能。

Details Motivation: 现有的音视频令牌压缩方法多为单模态或静态分配预算,无法充分利用音视频之间的跨模态协同效应,且难以应对二者信息密度差异和动态变化的问题,因此需要一种能够联合优化的动态压缩机制。 Method: 提出了EchoingPixels框架,包含两个关键组件:1)跨模态语义筛(CS2),在早期进行音视频联合注意力计算,从统一的音视频令牌池中动态筛选重要令牌;2)同步增强的RoPE(Sync-RoPE),保留稀疏令牌间的时序关系,确保时间建模能力不被破坏。 Result: 实验表明,EchoingPixels仅使用原始5%-20%的令牌即可达到与强基线相当的性能,同时实现2-3倍的速度提升和内存降低。 Conclusion: EchoingPixels通过跨模态联合压缩和时序感知位置编码,有效解决了音视频大模型中的高效令牌缩减问题,为多模态模型的轻量化提供了新思路。 Abstract: Audio-Visual Large Language Models (AV-LLMs) face prohibitive computational overhead from massive audio and video tokens. Token reduction, while extensively explored for video-only LLMs, is insufficient for the audio-visual domain, as these unimodal methods cannot leverage audio-visual cross-modal synergies. Furthermore, the distinct and dynamic information densities of audio and video render static budgets per modality suboptimal. How to perform token reduction on a joint audio-visual stream thus remains an unaddressed bottleneck. To fill this gap, we introduce EchoingPixels, a framework inspired by the coexistence and interaction of visuals and sound in real-world scenes. The core of our framework is the Cross-Modal Semantic Sieve (CS2), a module enabling early audio-visual interaction. Instead of compressing modalities independently, CS2 co-attends to the joint multimodal stream and reduces tokens from an entire combined pool of audio-visual tokens rather than using fixed budgets per modality. This single-pool approach allows it to adaptively allocate the token budget across both modalities and dynamically identify salient tokens in concert. To ensure this aggressive reduction preserves the vital temporal modeling capability, we co-design a Synchronization-Augmented RoPE (Sync-RoPE) to maintain critical temporal relationships for the sparsely selected tokens. Extensive experiments demonstrate that EchoingPixels achieves performance comparable to strong baselines using only 5-20% of the original tokens, with a 2-3x speedup and memory reduction.

[58] StainNet: A Special Staining Self-Supervised Vision Transformer for Computational Pathology

Jiawen Li,Jiali Hu,Xitong Ling,Yongqiang Lv,Yuxuan Chen,Yizhi Wang,Tian Guan,Yifei Liu,Yonghong He

Main category: cs.CV

TL;DR: StainNet是一种基于视觉Transformer架构的病理学基础模型,采用自蒸馏自监督学习方法,专门针对特殊染色病理图像进行预训练,展现出在肝恶性肿瘤分类和检索任务中的强大性能。

Details Motivation: 现有的病理学基础模型主要在H&E染色图像上预训练,对临床中常见的特殊染色图像适应性有限,限制了其应用。因此需要开发专门针对特殊染色图像的基础模型。 Method: 提出StainNet模型,基于ViT架构,采用自蒸馏自监督学习策略,在包含超过140万张特殊染色图像补丁的HISTAI数据库上进行预训练。 Result: 实验表明StainNet在内部肝恶性肿瘤滑动级别分类任务及两个公开ROI数据集上表现优异,具备优秀的少样本学习和图像检索能力,且优于近期更大的病理学基础模型。 Conclusion: StainNet是首个专为特殊染色病理图像设计的基础模型,显著提升了此类图像的分析能力,具有广泛的临床应用潜力,模型已开源。 Abstract: Foundation models trained with self-supervised learning (SSL) on large-scale histological images have significantly accelerated the development of computational pathology. These models can serve as backbones for region-of-interest (ROI) image analysis or patch-level feature extractors in whole-slide images (WSIs) based on multiple instance learning (MIL). Existing pathology foundation models (PFMs) are typically pre-trained on Hematoxylin-Eosin (H&E) stained pathology images. However, images with special stains, such as immunohistochemistry, are also frequently used in clinical practice. PFMs pre-trained mainly on H\&E-stained images may be limited in clinical applications involving special stains. To address this issue, we propose StainNet, a specialized foundation model for special stains based on the vision transformer (ViT) architecture. StainNet adopts a self-distillation SSL approach and is trained on over 1.4 million patch images cropping from 20,231 publicly available special staining WSIs in the HISTAI database. To evaluate StainNet, we conduct experiments on an in-house slide-level liver malignancy classification task and two public ROI-level datasets to demonstrate its strong ability. We also perform few-ratio learning and retrieval evaluations, and compare StainNet with recently larger PFMs to further highlight its strengths. We have released the StainNet model weights at: https://huggingface.co/JWonderLand/StainNet.

[59] Simple Yet Effective Selective Imputation for Incomplete Multi-view Clustering

Cai Xu,Jinlong Liu,Yilin Zhang,Ziyu Guan,Wei Zhao

Main category: cs.CV

TL;DR: 提出了一种基于信息量的选择性插补多视图聚类方法(ISMVC),通过评估缺失位置的插补相关信息量,仅在有足够支持时进行选择性插补,并结合变分自编码器学习聚类友好的隐表示,有效提升了不完整多视图数据在非平衡缺失情况下的聚类性能。

Details Motivation: 现有插补方法在处理不完整多视图数据时容易引入噪声和偏差,而无插补方法在严重缺失情况下缺乏跨视图互补性,难以有效聚类。 Method: 提出ISMVC,基于视内相似性和跨视图一致性评估每个缺失位置的插补相关信息量,进行选择性插补;结合带有高斯混合先验的变分自编码器,实现分布级插补并建模不确定性,以稳健融合和防止过度自信重建。 Result: 在多个基准数据集上验证了方法的有效性,在更真实且具挑战性的非平衡缺失场景下,优于现有的插补和无插补方法。 Conclusion: ISMVC通过数据驱动、轻量且模型无关的选择性插补策略,实现了更鲁棒的多视图聚类,可作为插件模块集成到现有模型中。 Abstract: Incomplete multi-view data, where different views suffer from missing and unbalanced observations, pose significant challenges for clustering. Existing imputation-based methods attempt to estimate missing views to restore data associations, but indiscriminate imputation often introduces noise and bias, especially when the available information is insufficient. Imputation-free methods avoid this risk by relying solely on observed data, but struggle under severe incompleteness due to the lack of cross-view complementarity. To address this issue, we propose Informativeness-based Selective imputation Multi-View Clustering (ISMVC). Our method evaluates the imputation-relevant informativeness of each missing position based on intra-view similarity and cross-view consistency, and selectively imputes only when sufficient support is available. Furthermore, we integrate this selection with a variational autoencoder equipped with a mixture-of-Gaussians prior to learn clustering-friendly latent representations. By performing distribution-level imputation, ISMVC not only stabilizes the aggregation of posterior distributions but also explicitly models imputation uncertainty, enabling robust fusion and preventing overconfident reconstructions. Compared with existing cautious imputation strategies that depend on training dynamics or model feedback, our method is lightweight, data-driven, and model-agnostic. It can be readily integrated into existing IMC models as a plug-in module. Extensive experiments on multiple benchmark datasets under a more realistic and challenging unbalanced missing scenario demonstrate that our method outperforms both imputation-based and imputation-free approaches.

[60] Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation

Yiwen Tang,Zoey Guo,Kaixin Zhu,Ray Zhang,Qizhi Chen,Dongzhi Jiang,Junli Liu,Bohan Zeng,Haoming Song,Delin Qu,Tianyi Bai,Dan Xu,Wentao Zhang,Bin Zhao

Main category: cs.CV

TL;DR: 本文首次系统研究了强化学习(RL)在文本到3D自回归生成中的应用,提出了Hi-GRPO方法和MME-3DR评测基准,并发布了首个RL增强的文本到3D模型AR3D-R1。

Details Motivation: 由于3D对象具有更高的空间复杂性,且需要全局几何一致性和细粒度局部纹理,现有2D强化学习方法难以直接应用于3D生成,且缺乏评估隐式推理能力的基准。因此需要对RL在3D生成中的奖励设计、算法选择等进行系统性研究。 Method: 从四个维度开展研究:(1) 奖励设计——评估多模态模型作为奖励信号的有效性;(2) RL算法——研究GRPO变体并探索数据与训练迭代的扩展规律;(3) 构建新的评测基准MME-3DR以衡量3D生成模型的推理能力;(4) 提出分层强化学习框架Hi-GRPO,实现从全局到局部的优化。 Result: 验证了通用多模态模型可提供鲁棒的3D属性反馈信号,发现token级优化更有效,并提出Hi-GRPO显著提升生成质量。基于此开发出AR3D-R1模型,在形状生成到纹理细化全过程均表现优越。同时发布包含代码的新基准MME-3DR。 Conclusion: 本研究表明,合理的奖励设计与分层RL框架能有效推动文本到3D生成的发展,为未来构建具备推理能力的3D生成系统提供了可行路径和技术基础。 Abstract: Reinforcement learning (RL), earlier proven to be effective in large language and multi-modal models, has been successfully extended to enhance 2D image generation recently. However, applying RL to 3D generation remains largely unexplored due to the higher spatial complexity of 3D objects, which require globally consistent geometry and fine-grained local textures. This makes 3D generation significantly sensitive to reward designs and RL algorithms. To address these challenges, we conduct the first systematic study of RL for text-to-3D autoregressive generation across several dimensions. (1) Reward designs: We evaluate reward dimensions and model choices, showing that alignment with human preference is crucial, and that general multi-modal models provide robust signal for 3D attributes. (2) RL algorithms: We study GRPO variants, highlighting the effectiveness of token-level optimization, and further investigate the scaling of training data and iterations. (3) Text-to-3D Benchmarks: Since existing benchmarks fail to measure implicit reasoning abilities in 3D generation models, we introduce MME-3DR. (4) Advanced RL paradigms: Motivated by the natural hierarchy of 3D generation, we propose Hi-GRPO, which optimizes the global-to-local hierarchical 3D generation through dedicated reward ensembles. Based on these insights, we develop AR3D-R1, the first RL-enhanced text-to-3D model, expert from coarse shape to texture refinement. We hope this study provides insights into RL-driven reasoning for 3D generation. Code is released at https://github.com/Ivan-Tang-3D/3DGen-R1.

[61] A Conditional Generative Framework for Synthetic Data Augmentation in Segmenting Thin and Elongated Structures in Biological Images

Yi Liu,Yichi Zhang

Main category: cs.CV

TL;DR: 提出一种基于Pix2Pix的条件生成框架,通过二值掩码生成显微镜下逼真的纤维结构图像,并引入纤维感知结构损失提升生成效果,有效缓解了纤维分割中高质量标注数据不足的问题。

Details Motivation: 由于纤维结构密集分布且形态特殊,手动标注显微图像中的纤维极为耗时费力,导致高质量像素级标注数据稀缺,制约了深度学习在纤维分割中的应用。 Method: 基于Pix2Pix架构构建条件生成模型,从二值掩码生成逼真的纤维显微图像;设计一种纤维感知的结构损失函数,增强生成图像与真实图像在结构上的相似性。 Result: 实验表明,所提出的方法能生成高度逼真的纤维图像,用于训练的模型性能优于未使用合成数据的现有方法。 Conclusion: 该方法有效缓解了纤维分割任务中标注数据匮乏的问题,为生物图像分析提供了可行的数据增强解决方案。 Abstract: Thin and elongated filamentous structures, such as microtubules and actin filaments, often play important roles in biological systems. Segmenting these filaments in biological images is a fundamental step for quantitative analysis. Recent advances in deep learning have significantly improved the performance of filament segmentation. However, there is a big challenge in acquiring high quality pixel-level annotated dataset for filamentous structures, as the dense distribution and geometric properties of filaments making manual annotation extremely laborious and time-consuming. To address the data shortage problem, we propose a conditional generative framework based on the Pix2Pix architecture to generate realistic filaments in microscopy images from binary masks. We also propose a filament-aware structural loss to improve the structure similarity when generating synthetic images. Our experiments have demonstrated the effectiveness of our approach and outperformed existing model trained without synthetic data.

[62] Zero-shot Adaptation of Stable Diffusion via Plug-in Hierarchical Degradation Representation for Real-World Super-Resolution

Yi-Cheng Liao,Shyang-En Weng,Yu-Syuan Xu,Chi-Wei Hsiao,Wei-Chen Chiu,Ching-Chun Huang

Main category: cs.CV

TL;DR: 提出HD-CLIP,一种分层退化感知的CLIP模型,用于真实图像超分辨率,可作为即插即用模块提升扩散模型的恢复质量。

Details Motivation: 现有方法依赖CLIP文本编码器且假设退化程度已知,难以捕捉数值型退化信息,泛化能力受限。 Method: 设计HD-CLIP,将低质量图像分解为语义嵌入和有序退化嵌入,并引入分类器无关投影引导(CFPG)机制集成到扩散模型中。 Result: HD-CLIP在多个真实世界数据集上提升了细节保真度和感知真实性,支持对未见退化程度的插值,且无需训练即可集成。 Conclusion: HD-CLIP作为一种即插即用模块,有效增强了扩散模型在真实图像超分辨率中的性能与鲁棒性。 Abstract: Real-World Image Super-Resolution (Real-ISR) aims to recover high-quality images from low-quality inputs degraded by unknown and complex real-world factors. Real-world scenarios involve diverse and coupled degradations, making it necessary to provide diffusion models with richer and more informative guidance. However, existing methods often assume known degradation severity and rely on CLIP text encoders that cannot capture numerical severity, limiting their generalization ability. To address this, we propose \textbf{HD-CLIP} (\textbf{H}ierarchical \textbf{D}egradation CLIP), which decomposes a low-quality image into a semantic embedding and an ordinal degradation embedding that captures ordered relationships and allows interpolation across unseen levels. Furthermore, we integrated it into diffusion models via classifier-free guidance (CFG) and proposed classifier-free projection guidance (CFPG). HD-CLIP leverages semantic cues to guide generative restoration while using degradation cues to suppress undesired hallucinations and artifacts. As a \textbf{plug-and-play module}, HD-CLIP can be seamlessly integrated into various super-resolution frameworks without training, significantly improving detail fidelity and perceptual realism across diverse real-world datasets.

[63] CoSPlan: Corrective Sequential Planning via Scene Graph Incremental Updates

Shresth Grover,Priyank Pathak,Akash Kumar,Vibhav Vineet,Yogesh S Rawat

Main category: cs.CV

TL;DR: 本文提出了一个名为CoSPlan的新基准,用于评估大规模视觉语言模型(VLMs)在易出错的视觉序列规划任务中的表现,并提出了一种无需训练的Scene Graph Incremental updates(SGI)方法来提升模型性能。

Details Motivation: 尽管大规模视觉语言模型展现出强大的推理能力,但在视觉序列规划(尤其是包含非最优步骤的任务)方面仍研究不足,且难以有效检测和纠正错误步骤。 Method: 构建了涵盖四个领域的CoSPlan基准,评估VLMs的错误检测与步骤补全能力;提出SGI方法,通过在初始状态和目标状态之间引入基于场景图的增量推理步骤,实现无需训练的性能提升。 Result: 现有VLM(如Intern-VLM、Qwen2)在CoSPlan上表现不佳;SGI方法平均提升了5.2%的性能,并能泛化到传统规划任务(如Plan-Bench、VQA)。 Conclusion: 视觉语言模型在纠错性序列规划中仍有显著挑战,SGI通过引入中间推理步骤有效提升了模型的序列推理能力和鲁棒性。 Abstract: Large-scale Vision-Language Models (VLMs) exhibit impressive complex reasoning capabilities but remain largely unexplored in visual sequential planning, i.e., executing multi-step actions towards a goal. Additionally, practical sequential planning often involves non-optimal (erroneous) steps, challenging VLMs to detect and correct such steps. We propose Corrective Sequential Planning Benchmark (CoSPlan) to evaluate VLMs in error-prone, vision-based sequential planning tasks across 4 domains: maze navigation, block rearrangement, image reconstruction,and object reorganization. CoSPlan assesses two key abilities: Error Detection (identifying non-optimal action) and Step Completion (correcting and completing action sequences to reach the goal). Despite using state-of-the-art reasoning techniques such as Chain-of-Thought and Scene Graphs, VLMs (e.g. Intern-VLM and Qwen2) struggle on CoSPlan, failing to leverage contextual cues to reach goals. Addressing this, we propose a novel training-free method, Scene Graph Incremental updates (SGI), which introduces intermediate reasoning steps between the initial and goal states. SGI helps VLMs reason about sequences, yielding an average performance gain of 5.2%. In addition to enhancing reliability in corrective sequential planning, SGI generalizes to traditional planning tasks such as Plan-Bench and VQA.

[64] Topology-Agnostic Animal Motion Generation from Text Prompt

Keyi Chen,Mingze Sun,Zhenyu Liu,Zhangquan Chen,Ruqi Huang

Main category: cs.CV

TL;DR: 本文提出了一种能够生成任意骨骼拓扑结构下文本驱动动作的通用自回归框架,并发布了大规模动物运动数据集OmniZoo。

Details Motivation: 现有动作生成方法依赖固定骨骼模板,难以泛化到不同或扰动的骨骼拓扑结构,且缺乏大规模异构动物运动数据和统一的生成框架。 Method: 构建包含140个物种、32,979段序列并带有多模态标注的大规模动物运动数据集OmniZoo;提出一种广义自回归动作生成框架,核心是拓扑感知的骨骼嵌入模块,将任意骨骼的几何与结构属性编码至共享token空间,实现与文本语义的融合。 Result: 该方法能根据文本提示和目标骨骼生成时序连贯、物理合理且语义对齐的动作,并支持跨物种的动作风格迁移。 Conclusion: 所提框架突破了传统方法对固定骨骼结构的依赖,实现了面向任意骨骼拓扑的文本驱动动作生成,推动了运动生成在多样性与通用性上的发展。 Abstract: Motion generation is fundamental to computer animation and widely used across entertainment, robotics, and virtual environments. While recent methods achieve impressive results, most rely on fixed skeletal templates, which prevent them from generalizing to skeletons with different or perturbed topologies. We address the core limitation of current motion generation methods - the combined lack of large-scale heterogeneous animal motion data and unified generative frameworks capable of jointly modeling arbitrary skeletal topologies and textual conditions. To this end, we introduce OmniZoo, a large-scale animal motion dataset spanning 140 species and 32,979 sequences, enriched with multimodal annotations. Building on OmniZoo, we propose a generalized autoregressive motion generation framework capable of producing text-driven motions for arbitrary skeletal topologies. Central to our model is a Topology-aware Skeleton Embedding Module that encodes geometric and structural properties of any skeleton into a shared token space, enabling seamless fusion with textual semantics. Given a text prompt and a target skeleton, our method generates temporally coherent, physically plausible, and semantically aligned motions, and further enables cross-species motion style transfer.

[65] Hybrid Transformer-Mamba Architecture for Weakly Supervised Volumetric Medical Segmentation

Yiheng Lyu,Lian Xu,Mohammed Bennamoun,Farid Boussaid,Coen Arrow,Girish Dwivedi

Main category: cs.CV

TL;DR: 提出TranSamba,一种结合Transformer与Mamba的混合架构,用于弱监督下的三维医学图像分割,通过跨平面Mamba模块有效捕捉3D上下文信息,在多个数据集上实现最优性能。

Details Motivation: 现有弱监督语义分割方法多依赖2D编码器,忽视医学图像的三维特性,导致上下文信息利用不足。 Method: 设计TranSamba架构,以Vision Transformer为主干,引入Cross-Plane Mamba模块,利用状态空间模型的线性复杂度实现切片间高效信息交换,并增强片内自注意力机制,提升目标定位能力。 Result: 在三个医学图像数据集上实验表明,TranSamba在不同模态和病种下均优于现有方法,具备线性时间复杂度和恒定批处理内存消耗。 Conclusion: TranSamba能有效建模三维上下文,显著提升弱监督医学图像分割性能,推动标签高效、高精度分割的发展。 Abstract: Weakly supervised semantic segmentation offers a label-efficient solution to train segmentation models for volumetric medical imaging. However, existing approaches often rely on 2D encoders that neglect the inherent volumetric nature of the data. We propose TranSamba, a hybrid Transformer-Mamba architecture designed to capture 3D context for weakly supervised volumetric medical segmentation. TranSamba augments a standard Vision Transformer backbone with Cross-Plane Mamba blocks, which leverage the linear complexity of state space models for efficient information exchange across neighboring slices. The information exchange enhances the pairwise self-attention within slices computed by the Transformer blocks, directly contributing to the attention maps for object localization. TranSamba achieves effective volumetric modeling with time complexity that scales linearly with the input volume depth and maintains constant memory usage for batch processing. Extensive experiments on three datasets demonstrate that TranSamba establishes new state-of-the-art performance, consistently outperforming existing methods across diverse modalities and pathologies. Our source code and trained models are openly accessible at: https://github.com/YihengLyu/TranSamba.

[66] mmCounter: Static People Counting in Dense Indoor Scenarios Using mmWave Radar

Tarik Reza Toha,Shao-Jung,Lu,Shahriar Nirjon

Main category: cs.CV

TL;DR: mmCounter是一种利用毫米波雷达提取超低频信号(如呼吸和微小身体运动)来准确计算密集静态人群中人数的新方法,通过多阶段信号处理 pipeline 实现对静止人体的高精度计数。

Details Motivation: 现有毫米波雷达在检测密集静态人群时受限于空间分辨率和对运动的依赖,难以准确识别人数,且多数研究假设人数已知,无法适用于真实场景中未知人数的计数任务。 Method: 提出一种新型多阶段信号处理流程,提取与人体相关的超低频信号(<1 Hz),结合空间信息区分不同个体,并过滤背景噪声和静态物体干扰,实现对静止人群的精确计数。 Result: 在多种环境中测试表明,mmCounter在熟悉环境中的平均F1得分为87%,平均绝对误差为0.6;在未见过的环境中为60%和1.1,能够在3平方米空间内准确计数最多七名人员。 Conclusion: mmCounter突破了传统毫米波雷达对运动检测的依赖,首次实现了在高密度静态环境下对人体的有效计数,具有良好的实际应用潜力。 Abstract: mmWave radars struggle to detect or count individuals in dense, static (non-moving) groups due to limitations in spatial resolution and reliance on movement for detection. We present mmCounter, which accurately counts static people in dense indoor spaces (up to three people per square meter). mmCounter achieves this by extracting ultra-low frequency (< 1 Hz) signals, primarily from breathing and micro-scale body movements such as slight torso shifts, and applying novel signal processing techniques to differentiate these subtle signals from background noise and nearby static objects. Our problem differs significantly from existing studies on breathing rate estimation, which assume the number of people is known a priori. In contrast, mmCounter utilizes a novel multi-stage signal processing pipeline to extract relevant low-frequency sources along with their spatial information and map these sources to individual people, enabling accurate counting. Extensive evaluations in various environments demonstrate that mmCounter delivers an 87% average F1 score and 0.6 mean absolute error in familiar environments, and a 60% average F1 score and 1.1 mean absolute error in previously untested environments. It can count up to seven individuals in a three square meter space, such that there is no side-by-side spacing and only a one-meter front-to-back distance.

[67] Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task

Sunqi Fan,Jiashuo Cui,Meng-Hao Guo,Shuojin Yang

Main category: cs.CV

TL;DR: 本文提出了一种用于增强多模态大语言模型(MLLM)在视频问答(VideoQA)任务中时空推理能力的视频工具包和时空推理框架(STAR),通过轻量级工具提升了GPT-4o在VideoMME和LongVideoBench上的性能。

Details Motivation: 现有的MLLM在处理复杂且需要强推理的VideoQA任务时,难以同时建模视频帧内的空间关系和理解时间演化的因果动态。 Method: 设计了一个全面且可扩展的视频工具包,并提出了STAR框架,以策略性地调度时间和空间工具,逐步定位视频中的关键区域,从而增强MLLM的时空推理能力。 Result: 使用轻量级工具增强GPT-4o后,在VideoMME上取得了8.2%的提升,在LongVideoBench上取得了4.6%的提升。 Conclusion: 所提出的视频工具包和STAR框架为构建自主智能的视频分析助手迈出了重要一步。 Abstract: Video Question Answering (VideoQA) task serves as a critical playground for evaluating whether foundation models can effectively perceive, understand, and reason about dynamic real-world scenarios. However, existing Multimodal Large Language Models (MLLMs) struggle with simultaneously modeling spatial relationships within video frames and understanding the causal dynamics of temporal evolution on complex and reasoning-intensive VideoQA task. In this work, we equip MLLM with a comprehensive and extensible Video Toolkit, to enhance MLLM's spatiotemporal reasoning capabilities and ensure the harmony between the quantity and diversity of tools. To better control the tool invocation sequence and avoid toolchain shortcut issues, we propose a Spatiotemporal Reasoning Framework (STAR) that strategically schedules temporal and spatial tools, thereby progressively localizing the key area in the video. Our STAR framework enhances GPT-4o using lightweight tools, achieving an 8.2% gain on VideoMME and 4.6% on LongVideoBench. We believe that our proposed Video Toolkit and STAR framework make an important step towards building autonomous and intelligent video analysis assistants. The code is publicly available at https://github.com/fansunqi/VideoTool.

[68] Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models

Woojun Jung,Jaehoon Go,Mingyu Jeon,Sunjae Yoon,Junyeong Kim

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的两步方法Visual Funnel,以解决多模态大语言模型在精细视觉细节感知中存在的“上下文失明”问题,通过基于注意力熵动态构建具有层次结构的裁剪组合来保留从局部到全局的上下文信息。

Details Motivation: 多模态大语言模型(MLLMs)虽具备较强推理能力,但在需要精细视觉理解的任务中常因忽略图像细节而受限。现有裁剪显著区域的方法虽能提升细节感知,但割裂了局部细节与全局上下文之间的联系,导致“上下文失明”。本文旨在解决这一结构性缺失问题。 Method: 提出Visual Funnel方法,包含两个步骤:首先通过一次前向传播进行上下文锚定(Contextual Anchoring)定位兴趣区域;然后基于注意力熵动态确定裁剪尺寸并优化中心点,构建熵加权的多尺度裁剪组合(Entropy-Scaled Portfolio),以保留从焦点细节到周围背景的层次化上下文结构。该方法无需额外训练。 Result: 实验表明,Visual Funnel显著优于简单的单裁剪和无结构多裁剪基线方法;结果还显示,盲目增加无结构裁剪数量效果有限甚至有害,验证了层次化结构对缓解上下文失明的关键作用。 Conclusion: “上下文失明”问题源于输入信息的结构多样性不足而非信息量不足;Visual Funnel通过构建结构化的多尺度输入,在不需训练的前提下有效提升了MLLMs对细粒度视觉内容的理解能力,证明了输入结构设计的重要性。 Abstract: Multimodal Large Language Models (MLLMs) demonstrate impressive reasoning capabilities, but often fail to perceive fine-grained visual details, limiting their applicability in precision-demanding tasks. While methods that crop salient regions of an image offer a partial solution, we identify a critical limitation they introduce: "Contextual Blindness". This failure occurs due to structural disconnect between high-fidelity details (from the crop) and the broader global context (from the original image), even when all necessary visual information is present. We argue that this limitation stems not from a lack of information 'Quantity', but from a lack of 'Structural Diversity' in the model's input. To resolve this, we propose Visual Funnel, a training-free, two-step approach. Visual Funnel first performs Contextual Anchoring to identify the region of interest in a single forward pass. It then constructs an Entropy-Scaled Portfolio that preserves the hierarchical context - ranging from focal detail to broader surroundings - by dynamically determining crop sizes based on attention entropy and refining crop centers. Through extensive experiments, we demonstrate that Visual Funnel significantly outperforms naive single-crop and unstructured multi-crop baselines. Our results further validate that simply adding more unstructured crops provides limited or even detrimental benefits, confirming that the hierarchical structure of our portfolio is key to resolving Contextual Blindness.

[69] Point to Span: Zero-Shot Moment Retrieval for Navigating Unseen Hour-Long Videos

Mingyu Jeon,Jisoo Yang,Sungjin Han,Jinkwon Hwang,Sunjae Yoon,Jonghee Kim,Junyeoung Kim

Main category: cs.CV

TL;DR: 提出了一种无需训练的零样本长视频时刻检索框架P2S,通过自适应跨度生成和查询分解解决了搜索阶段候选爆炸和精炼阶段高计算成本的问题,显著优于现有监督方法。

Details Motivation: 现有的长视频 moment 检索方法在可扩展性、泛化能力和计算效率方面存在局限,尤其是在零样本设置下,搜索阶段的候选爆炸和精炼阶段对高成本视觉语言模型(VLM)的依赖导致效率低下。 Method: 提出了P2S框架,包含两个关键组件:自适应跨度生成器(Adaptive Span Generator)以避免搜索阶段产生过多候选片段,以及查询分解(Query Decomposition)策略,在不依赖高成本VLM的情况下进行候选精炼。整个框架无需训练,实现高效的零样本推理。 Result: P2S在MAD等数据集上显著优于现有的监督学习方法,例如R5@0.1指标上提升了+3.7%,是首个能够实现小时级视频中时间定位的零样本框架。 Conclusion: P2S有效克服了零样本长视频 moment 检索中搜索效率低和精炼成本高的问题,为高效、可扩展的视频理解提供了新思路。 Abstract: Zero-shot Long Video Moment Retrieval (ZLVMR) is the task of identifying temporal segments in hour-long videos using a natural language query without task-specific training. The core technical challenge of LVMR stems from the computational infeasibility of processing entire lengthy videos in a single pass. This limitation has established a 'Search-then-Refine' approach, where candidates are rapidly narrowed down, and only those portions are analyzed, as the dominant paradigm for LVMR. However, existing approaches to this paradigm face severe limitations. Conventional supervised learning suffers from limited scalability and poor generalization, despite substantial resource consumption. Yet, existing zero-shot methods also fail, facing a dual challenge: (1) their heuristic strategies cause a 'search' phase candidate explosion, and (2) the 'refine' phase, which is vulnerable to semantic discrepancy, requires high-cost VLMs for verification, incurring significant computational overhead. We propose \textbf{P}oint-\textbf{to}-\textbf{S}pan (P2S), a novel training-free framework to overcome this challenge of inefficient 'search' and costly 'refine' phases. P2S overcomes these challenges with two key innovations: an 'Adaptive Span Generator' to prevent the search phase candidate explosion, and 'Query Decomposition' to refine candidates without relying on high-cost VLM verification. To our knowledge, P2S is the first zero-shot framework capable of temporal grounding in hour-long videos, outperforming supervised state-of-the-art methods by a significant margin (e.g., +3.7\% on R5@0.1 on MAD).

[70] Breaking the Vicious Cycle: Coherent 3D Gaussian Splatting from Sparse and Motion-Blurred Views

Zhankuo Xu,Chaoran Feng,Yingtao Li,Jianbin Zhao,Jiashu Yang,Wangbo Yu,Li Yuan,Yonghong Tian

Main category: cs.CV

TL;DR: 本文提出了CoherentGS,一种针对稀疏且运动模糊图像的高保真3D重建新框架,通过结合去模糊网络和扩散模型的双先验策略,显著提升了3D高斯点阵在恶劣输入条件下的表现。

Details Motivation: 3D高斯点阵(3DGS)依赖密集高质量图像,在现实稀疏且模糊的输入下性能严重下降,稀疏视图与运动模糊相互加剧导致重建失败。 Method: 提出双先验策略:使用预训练去模糊网络恢复细节并提供光度引导,结合扩散模型提供几何先验以补全未观测区域;引入一致性引导的相机探索模块和深度正则化损失以提升几何合理性。 Result: 在仅3、6、9个输入视图的合成与真实场景中,CoherentGS在定量与定性实验上均显著优于现有方法,成为该任务的新SOTA。 Conclusion: CoherentGS有效打破了稀疏视图与运动模糊之间的恶性循环,实现了从低质量输入中的高质量3D重建,推动了3DGS在真实场景中的应用。 Abstract: 3D Gaussian Splatting (3DGS) has emerged as a state-of-the-art method for novel view synthesis. However, its performance heavily relies on dense, high-quality input imagery, an assumption that is often violated in real-world applications, where data is typically sparse and motion-blurred. These two issues create a vicious cycle: sparse views ignore the multi-view constraints necessary to resolve motion blur, while motion blur erases high-frequency details crucial for aligning the limited views. Thus, reconstruction often fails catastrophically, with fragmented views and a low-frequency bias. To break this cycle, we introduce CoherentGS, a novel framework for high-fidelity 3D reconstruction from sparse and blurry images. Our key insight is to address these compound degradations using a dual-prior strategy. Specifically, we combine two pre-trained generative models: a specialized deblurring network for restoring sharp details and providing photometric guidance, and a diffusion model that offers geometric priors to fill in unobserved regions of the scene. This dual-prior strategy is supported by several key techniques, including a consistency-guided camera exploration module that adaptively guides the generative process, and a depth regularization loss that ensures geometric plausibility. We evaluate CoherentGS through both quantitative and qualitative experiments on synthetic and real-world scenes, using as few as 3, 6, and 9 input views. Our results demonstrate that CoherentGS significantly outperforms existing methods, setting a new state-of-the-art for this challenging task. The code and video demos are available at https://potatobigroom.github.io/CoherentGS/.

[71] RaLiFlow: Scene Flow Estimation with 4D Radar and LiDAR Point Clouds

Jingyun Fu,Zhiyu Xiang,Na Zhao

Main category: cs.CV

TL;DR: 本文提出了首个结合4D毫米波雷达和LiDAR的场景流估计联合学习框架RaLiFlow,并构建了相应的数据集,通过新颖的动态感知双向跨模态融合模块和损失函数设计,有效解决了雷达数据噪声大、稀疏等问题,在场景流估计中显著优于现有单模态方法。

Details Motivation: 现有的多模态融合方法尚未探索4D毫米波雷达与LiDAR在场景流估计中的融合;而雷达具有成本低、环境鲁棒性强和可检测点速度等优势,是LiDAR的有力补充,但其噪声多、分辨率低、稀疏性强,且缺乏专用数据集。 Method: 构建了一个基于公开真实自动驾驶数据集的雷达-LiDAR场景流数据集;提出了一种有效的雷达去噪和场景流标签生成预处理策略;设计了RaLiFlow框架,包含动态感知双向跨模态融合(DBCF)模块,将雷达的动态信息引入局部交叉注意力机制,并采用一组精心设计的损失函数以减轻不可靠雷达数据的影响并提升实例级一致性。 Result: 在重构的场景流数据集上进行了大量实验,结果表明该方法显著优于现有的基于LiDAR和雷达的单模态方法,尤其在动态前景区域表现出更强的性能。 Conclusion: RaLiFlow是首个用于雷达-LiDAR联合场景流估计的框架,通过有效的跨模态融合策略和训练机制,成功利用了雷达的速度信息和环境鲁棒性,推动了低成本、高鲁棒性多模态感知系统的发展。 Abstract: Recent multimodal fusion methods, integrating images with LiDAR point clouds, have shown promise in scene flow estimation. However, the fusion of 4D millimeter wave radar and LiDAR remains unexplored. Unlike LiDAR, radar is cheaper, more robust in various weather conditions and can detect point-wise velocity, making it a valuable complement to LiDAR. However, radar inputs pose challenges due to noise, low resolution, and sparsity. Moreover, there is currently no dataset that combines LiDAR and radar data specifically for scene flow estimation. To address this gap, we construct a Radar-LiDAR scene flow dataset based on a public real-world automotive dataset. We propose an effective preprocessing strategy for radar denoising and scene flow label generation, deriving more reliable flow ground truth for radar points out of the object boundaries. Additionally, we introduce RaLiFlow, the first joint scene flow learning framework for 4D radar and LiDAR, which achieves effective radar-LiDAR fusion through a novel Dynamic-aware Bidirectional Cross-modal Fusion (DBCF) module and a carefully designed set of loss functions. The DBCF module integrates dynamic cues from radar into the local cross-attention mechanism, enabling the propagation of contextual information across modalities. Meanwhile, the proposed loss functions mitigate the adverse effects of unreliable radar data during training and enhance the instance-level consistency in scene flow predictions from both modalities, particularly for dynamic foreground areas. Extensive experiments on the repurposed scene flow dataset demonstrate that our method outperforms existing LiDAR-based and radar-based single-modal methods by a significant margin.

[72] Self-Supervised Contrastive Embedding Adaptation for Endoscopic Image Matching

Alberto Rota,Elena De Momi

Main category: cs.CV

TL;DR: 本文提出了一种用于内窥镜图像对特征匹配的新型自监督深度学习框架,通过利用新视角合成生成真值对应关系,并结合对比学习优化DINOv2骨干网络,显著提升了手术场景下的匹配精度和几何一致性。

Details Motivation: 传统计算机视觉方法在手术环境中因弱透视线索、非朗伯反射和复杂变形组织而表现不佳,现有深度学习模型也缺乏对手术图像细粒度匹配的适应性,因此需要针对性的自监督方法来提升特征对应精度。 Method: 提出一种基于新视角合成的自监督学习框架,利用合成视图生成像素级真值内点对应,通过三元组挖掘和对比学习优化DINOv2模型,并增加一个Transformer层以增强特征嵌入能力,实现通过余弦相似度阈值进行直接匹配。 Result: 在SCARED数据集上的实验表明,该方法相比现有最先进方法具有更高的匹配精度和更低的对极几何误差。 Conclusion: 所提出的自监督特征匹配框架有效提升了内窥镜图像中的像素级对应性能,为手术场景中的3D重建、相机跟踪和高级视觉应用提供了可靠基础。 Abstract: Accurate spatial understanding is essential for image-guided surgery, augmented reality integration and context awareness. In minimally invasive procedures, where visual input is the sole intraoperative modality, establishing precise pixel-level correspondences between endoscopic frames is critical for 3D reconstruction, camera tracking, and scene interpretation. However, the surgical domain presents distinct challenges: weak perspective cues, non-Lambertian tissue reflections, and complex, deformable anatomy degrade the performance of conventional computer vision techniques. While Deep Learning models have shown strong performance in natural scenes, their features are not inherently suited for fine-grained matching in surgical images and require targeted adaptation to meet the demands of this domain. This research presents a novel Deep Learning pipeline for establishing feature correspondences in endoscopic image pairs, alongside a self-supervised optimization framework for model training. The proposed methodology leverages a novel-view synthesis pipeline to generate ground-truth inlier correspondences, subsequently utilized for mining triplets within a contrastive learning paradigm. Through this self-supervised approach, we augment the DINOv2 backbone with an additional Transformer layer, specifically optimized to produce embeddings that facilitate direct matching through cosine similarity thresholding. Experimental evaluation demonstrates that our pipeline surpasses state-of-the-art methodologies on the SCARED datasets improved matching precision and lower epipolar error compared to the related work. The proposed framework constitutes a valuable contribution toward enabling more accurate high-level computer vision applications in surgical endoscopy.

[73] Towards Fine-Grained Recognition with Large Visual Language Models: Benchmark and Optimization Strategies

Cong Pang,Hongtao Yu,Zixuan Chen,Lewei Lu,Xin Lou

Main category: cs.CV

TL;DR: 本文提出了一个用于评估大视觉语言模型(LVLMs)细粒度识别能力的新基准FROW,并通过数据构建和训练过程两方面的优化策略提升了模型性能。

Details Motivation: 现有基准主要关注推理任务,忽视了对实际应用至关重要的细粒度识别能力,因此需要一个新的评估框架。 Method: 提出FROW基准,结合GPT-4o生成马赛克数据和开放世界数据,并在预训练阶段引入细粒度数据以优化LVLM性能。 Result: 实验表明,马赛克数据使类别识别准确率提高1%,开放世界数据使FROW基准准确率提升10%-20%、内容准确率提升6%-12%,预训练中加入细粒度数据可使类别识别准确率最高提升10%。 Conclusion: FROW为LVLM的细粒度识别提供了有效评估框架,所提出的优化策略显著提升了模型在真实场景中的识别能力。 Abstract: Large Vision Language Models (LVLMs) have made remarkable progress, enabling sophisticated vision-language interaction and dialogue applications. However, existing benchmarks primarily focus on reasoning tasks, often neglecting fine-grained recognition, which is crucial for practical application scenarios. To address this gap, we introduce the Fine-grained Recognition Open World (FROW) benchmark, designed for detailed evaluation of LVLMs with GPT-4o. On the basis of that, we propose a novel optimization strategy from two perspectives: \textit{data construction} and \textit{training process}, to improve the performance of LVLMs. Our dataset includes mosaic data, which combines multiple short-answer responses, and open-world data, generated from real-world questions and answers using GPT-4o, creating a comprehensive framework for evaluating fine-grained recognition in LVLMs. Experiments show that mosaic data improves category recognition accuracy by 1\% and open-world data boosts FROW benchmark accuracy by 10\%-20\% and content accuracy by 6\%-12\%. Meanwhile, incorporating fine-grained data into the pre-training phase can improve the model's category recognition accuracy by up to 10\%. The benchmark will be available at https://github.com/pc-inno/FROW.

[74] Adaptive Dual-Weighted Gravitational Point Cloud Denoising Method

Ge Zhang,Chunyang Wang,Bo Xiao,Xuelian Liu,Bin Liu

Main category: cs.CV

TL;DR: 提出一种自适应双权重引力点云去噪方法,结合八叉树空间划分、自适应体素统计与kNN密度估计,实现高效、高精度且实时的多噪声场景点云去噪。

Details Motivation: 现有点云去噪方法难以兼顾去噪精度、边缘保持与计算效率,尤其在复杂噪声环境下表现不足,因此需要一种既能高效运行又能保持物体结构细节的去噪方法。 Method: 采用八叉树对全局点云进行空间划分以实现并行加速;在每个叶节点内利用自适应体素 occupancy 统计和k近邻密度估计快速剔除孤立低密度噪声点;构建结合密度权重与自适应距离权重的引力评分函数,精细区分噪声点与物体点。 Result: 在Stanford 3D Scanning Repository、CADC数据集及自采FMCW LiDAR数据上实验表明,该方法在F1、PSNR和Chamfer Distance指标上均优于现有方法,同时降低了单帧处理时间。 Conclusion: 所提方法在多种噪声条件下实现了高精度、强边缘保持与实时性的平衡,适用于自动驾驶与3D重建等对点云质量要求高的应用场景。 Abstract: High-quality point cloud data is a critical foundation for tasks such as autonomous driving and 3D reconstruction. However, LiDAR-based point cloud acquisition is often affected by various disturbances, resulting in a large number of noise points that degrade the accuracy of subsequent point cloud object detection and recognition. Moreover, existing point cloud denoising methods typically sacrifice computational efficiency in pursuit of higher denoising accuracy, or, conversely, improve processing speed at the expense of preserving object boundaries and fine structural details, making it difficult to simultaneously achieve high denoising accuracy, strong edge preservation, and real-time performance. To address these limitations, this paper proposes an adaptive dual-weight gravitational-based point cloud denoising method. First, an octree is employed to perform spatial partitioning of the global point cloud, enabling parallel acceleration. Then, within each leaf node, adaptive voxel-based occupancy statistics and k-nearest neighbor (kNN) density estimation are applied to rapidly remove clearly isolated and low-density noise points, thereby reducing the effective candidate set. Finally, a gravitational scoring function that combines density weights with adaptive distance weights is constructed to finely distinguish noise points from object points. Experiments conducted on the Stanford 3D Scanning Repository, the Canadian Adverse Driving Conditions (CADC) dataset, and in-house FMCW LiDAR point clouds acquired in our laboratory demonstrate that, compared with existing methods, the proposed approach achieves consistent improvements in F1, PSNR, and Chamfer Distance (CD) across various noise conditions while reducing the single-frame processing time, thereby validating its high accuracy, robustness, and real-time performance in multi-noise scenarios.

[75] MultiHateLoc: Towards Temporal Localisation of Multimodal Hate Content in Online Videos

Qiyue Sun,Tailin Chen,Yinghui Zhang,Yuchen Zhang,Jiangbei Yue,Jianbo Jiao,Zeyu Fu

Main category: cs.CV

TL;DR: 本文提出了MultiHateLoc,首个用于弱监督多模态仇恨言论定位的框架,能够基于视频级标签实现细粒度的帧级定位,在HateMM和MultiHateClip数据集上达到SOTA性能。

Details Motivation: 现有研究主要集中于视频级分类,缺乏对仇恨内容时间定位的研究,尤其在仅有视频级标签的弱监督条件下,难以捕捉跨模态和时序动态。 Method: 提出MultiHateLoc框架,包含模态感知的时序编码器、动态跨模态融合与对比对齐策略,以及模态感知的MIL目标函数,实现弱监督下的帧级定位。 Result: 在HateMM和MultiHateClip数据集上,MultiHateLoc在定位任务中显著优于现有方法,实现了最先进的性能。 Conclusion: MultiHateLoc有效解决了弱监督下多模态仇恨内容的时间定位问题,具备良好的可解释性和应用潜力。 Abstract: The rapid growth of video content on platforms such as TikTok and YouTube has intensified the spread of multimodal hate speech, where harmful cues emerge subtly and asynchronously across visual, acoustic, and textual streams. Existing research primarily focuses on video-level classification, leaving the practically crucial task of temporal localisation, identifying when hateful segments occur, largely unaddressed. This challenge is even more noticeable under weak supervision, where only video-level labels are available, and static fusion or classification-based architectures struggle to capture cross-modal and temporal dynamics. To address these challenges, we propose MultiHateLoc, the first framework designed for weakly-supervised multimodal hate localisation. MultiHateLoc incorporates (1) modality-aware temporal encoders to model heterogeneous sequential patterns, including a tailored text-based preprocessing module for feature enhancement; (2) dynamic cross-modal fusion to adaptively emphasise the most informative modality at each moment and a cross-modal contrastive alignment strategy to enhance multimodal feature consistency; (3) a modality-aware MIL objective to identify discriminative segments under video-level supervision. Despite relying solely on coarse labels, MultiHateLoc produces fine-grained, interpretable frame-level predictions. Experiments on HateMM and MultiHateClip show that our method achieves state-of-the-art performance in the localisation task.

[76] Beyond Endpoints: Path-Centric Reasoning for Vectorized Off-Road Network Extraction

Wenfei Guan,Jilin Mei,Tong Shen,Xumin Wu,Shuo Wang,Cheng Min,Yu Hu

Main category: cs.CV

TL;DR: 本文提出了一种新的路径中心框架MaGRoad和一个大规模野外道路数据集WildRoad,以解决现有模型在非城市环境中道路提取的不足。

Details Motivation: 现有的深度学习模型在城市环境中的道路提取已经取得了进展,但在野外环境下的表现不佳,主要由于缺乏大规模的数据集和现有方法的结构弱点。 Method: 首先发布了WildRoad数据集,并开发了专门用于道路网络标注的交互式工具;其次提出了MaGRoad框架,该框架通过沿候选路径聚合多尺度视觉证据来推断连通性。 Result: 实验表明,MaGRoad在WildRoad基准测试中达到了最先进水平的表现,并且能够很好地泛化到城市数据集中,同时推理速度提高了约2.5倍。 Conclusion: 结合新数据集和路径中心范式为野外道路映射提供了更坚实的基础。 Abstract: Deep learning has advanced vectorized road extraction in urban settings, yet off-road environments remain underexplored and challenging. A significant domain gap causes advanced models to fail in wild terrains due to two key issues: lack of large-scale vectorized datasets and structural weakness in prevailing methods. Models such as SAM-Road employ a node-centric paradigm that reasons at sparse endpoints, making them fragile to occlusions and ambiguous junctions in off-road scenes, leading to topological errors.This work addresses these limitations in two complementary ways. First, we release WildRoad, a gloabal off-road road network dataset constructed efficiently with a dedicated interactive annotation tool tailored for road-network labeling. Second, we introduce MaGRoad (Mask-aware Geodesic Road network extractor), a path-centric framework that aggregates multi-scale visual evidence along candidate paths to infer connectivity robustly.Extensive experiments show that MaGRoad achieves state-of-the-art performance on our challenging WildRoad benchmark while generalizing well to urban datasets. A streamlined pipeline also yields roughly 2.5x faster inference, improving practical applicability. Together, the dataset and path-centric paradigm provide a stronger foundation for mapping roads in the wild.

[77] TransLocNet: Cross-Modal Attention for Aerial-Ground Vehicle Localization with Contrastive Learning

Phu Pham,Damon Conover,Aniket Bera

Main category: cs.CV

TL;DR: 提出TransLocNet,一种融合LiDAR几何与航拍语义的跨模态注意力框架,通过双向注意力和对比学习实现高精度空中-地面定位。

Details Motivation: 解决空中图像与地面LiDAR之间因视角和模态差异导致的定位难题。 Method: 将LiDAR扫描投影为鸟瞰图表示,利用双向注意力机制对齐航拍图像特征,并通过似然图解码器输出位置和姿态的概率分布;引入对比学习模块构建共享嵌入空间以增强跨模态对齐。 Result: 在CARLA和KITTI数据集上实验表明,相比现有方法定位误差降低高达63%,达到亚米级、亚度级精度。 Conclusion: TransLocNet在合成和真实场景下均能实现鲁棒且可泛化的空中-地面定位。 Abstract: Aerial-ground localization is difficult due to large viewpoint and modality gaps between ground-level LiDAR and overhead imagery. We propose TransLocNet, a cross-modal attention framework that fuses LiDAR geometry with aerial semantic context. LiDAR scans are projected into a bird's-eye-view representation and aligned with aerial features through bidirectional attention, followed by a likelihood map decoder that outputs spatial probability distributions over position and orientation. A contrastive learning module enforces a shared embedding space to improve cross-modal alignment. Experiments on CARLA and KITTI show that TransLocNet outperforms state-of-the-art baselines, reducing localization error by up to 63% and achieving sub-meter, sub-degree accuracy. These results demonstrate that TransLocNet provides robust and generalizable aerial-ground localization in both synthetic and real-world settings.

[78] Neural Collapse in Test-Time Adaptation

Xiao Chen,Zhongjing Du,Jiazhen Huang,Xu Jiang,Li Lu,Jingyan Jiang,Zhi Wang

Main category: cs.CV

TL;DR: 本文提出了一种新的测试时自适应方法NCTTA,基于样本级特征与分类器权重对齐的观察(NC3+),通过混合目标缓解伪标签不可靠问题,显著提升了模型在分布外数据上的鲁棒性。

Details Motivation: 现有TTA方法缺乏对域迁移下性能下降根本原因的理论理解,且依赖不可靠的伪标签,限制了其在大分布偏移下的表现。 Method: 引入样本级神经坍缩现象(NC3+),分析发现性能下降源于特征与分类器权重的样本级错位;据此提出NCTTA,采用结合几何邻近性和预测置信度的混合目标进行特征-分类器对齐。 Result: 在ImageNet-C等基准上大幅优于现有方法,例如比Tent提升14.52%。 Conclusion: 特征与分类器的样本级对齐是提升TTA性能的关键,NCTTA通过混合目标有效应对伪标签不可靠问题,增强了模型对域偏移的鲁棒性。 Abstract: Test-Time Adaptation (TTA) enhances model robustness to out-of-distribution (OOD) data by updating the model online during inference, yet existing methods lack theoretical insights into the fundamental causes of performance degradation under domain shifts. Recently, Neural Collapse (NC) has been proposed as an emergent geometric property of deep neural networks (DNNs), providing valuable insights for TTA. In this work, we extend NC to the sample-wise level and discover a novel phenomenon termed Sample-wise Alignment Collapse (NC3+), demonstrating that a sample's feature embedding, obtained by a trained model, aligns closely with the corresponding classifier weight. Building on NC3+, we identify that the performance degradation stems from sample-wise misalignment in adaptation which exacerbates under larger distribution shifts. This indicates the necessity of realigning the feature embeddings with their corresponding classifier weights. However, the misalignment makes pseudo-labels unreliable under domain shifts. To address this challenge, we propose NCTTA, a novel feature-classifier alignment method with hybrid targets to mitigate the impact of unreliable pseudo-labels, which blends geometric proximity with predictive confidence. Extensive experiments demonstrate the effectiveness of NCTTA in enhancing robustness to domain shifts. For example, NCTTA outperforms Tent by 14.52% on ImageNet-C.

[79] An M-Health Algorithmic Approach to Identify and Assess Physiotherapy Exercises in Real Time

Stylianos Kandylakis,Christos Orfanopoulos,Georgios Siolas,Panayiotis Tsanakas

Main category: cs.CV

TL;DR: 提出一种基于移动设备的实时人体物理治疗运动识别、分类与评估的高效算法框架,利用姿态估计神经网络和轻量级模型实现客户端实时处理。

Details Motivation: 为了支持远程物理治疗监控和移动健康应用,需要在移动设备上实现实时、准确且可扩展的人体运动分析。 Method: 将运动分解为静态姿态序列,通过姿态估计网络提取关键点,转换为三角角度特征,并使用轻量级监督模型进行帧级分类;采用基于改进Levenshtein距离的动态规划方法进行动作序列匹配与错误定位。 Result: 系统能在移动设备上实时运行,准确识别和评估物理治疗动作,并有效检测偏离标准模式的动作。 Conclusion: 该框架在客户端实现了高效、实时的运动分析,适用于远程康复和m-health场景,具有良好的可扩展性和实用性。 Abstract: This work presents an efficient algorithmic framework for real-time identification, classification, and evaluation of human physiotherapy exercises using mobile devices. The proposed method interprets a kinetic movement as a sequence of static poses, which are estimated from camera input using a pose-estimation neural network. Extracted body keypoints are transformed into trigonometric angle-based features and classified with lightweight supervised models to generate frame-level pose predictions and accuracy scores. To recognize full exercise movements and detect deviations from prescribed patterns, we employ a dynamic-programming scheme based on a modified Levenshtein distance algorithm, enabling robust sequence matching and localization of inaccuracies. The system operates entirely on the client side, ensuring scalability and real-time performance. Experimental evaluation demonstrates the effectiveness of the methodology and highlights its applicability to remote physiotherapy supervision and m-health applications.

[80] Error-Propagation-Free Learned Video Compression With Dual-Domain Progressive Temporal Alignment

Han Li,Shaohui Li,Wenrui Dai,Chenglin Li,Xinlong Pan,Haipeng Wang,Junni Zou,Hongkai Xiong

Main category: cs.CV

TL;DR: 本文提出了一种新的统一变换框架,结合双域渐进式时序对齐和质量条件混合专家(QCMoE)模块,用于学习型视频压缩,有效解决了误差传播与运动估计/补偿不准之间的矛盾,在保持高质量重建的同时实现了无误差传播的连续码率自适应。

Details Motivation: 现有学习型视频压缩框架在时序对齐精度和误差传播之间存在权衡问题:分离变换框架虽性能好但有明显误差传播,统一变换框架虽无误差传播但时序建模能力弱。 Method: 提出双域渐进式时序对齐,包括像素域粗对齐和潜在域精对齐,并设计Flow-Guided Deformable Transformer(FGDT)实现多帧复杂运动的长期运动细化;引入QCMoE模块,基于目标质量和内容动态调整像素级量化步长,实现连续码率适配。 Result: 实验结果表明,该方法在消除误差传播的同时,取得了与最先进方法相当的率失真性能。 Conclusion: 所提出的框架通过双域对齐和QCMoE模块,兼顾了高精度运动补偿、无误差传播和连续码率调节,为高质量、一致性的学习型视频流传输提供了有效解决方案。 Abstract: Existing frameworks for learned video compression suffer from a dilemma between inaccurate temporal alignment and error propagation for motion estimation and compensation (ME/MC). The separate-transform framework employs distinct transforms for intra-frame and inter-frame compression to yield impressive rate-distortion (R-D) performance but causes evident error propagation, while the unified-transform framework eliminates error propagation via shared transforms but is inferior in ME/MC in shared latent domains. To address this limitation, in this paper, we propose a novel unifiedtransform framework with dual-domain progressive temporal alignment and quality-conditioned mixture-of-expert (QCMoE) to enable quality-consistent and error-propagation-free streaming for learned video compression. Specifically, we propose dualdomain progressive temporal alignment for ME/MC that leverages coarse pixel-domain alignment and refined latent-domain alignment to significantly enhance temporal context modeling in a coarse-to-fine fashion. The coarse pixel-domain alignment efficiently handles simple motion patterns with optical flow estimated from a single reference frame, while the refined latent-domain alignment develops a Flow-Guided Deformable Transformer (FGDT) over latents from multiple reference frames to achieve long-term motion refinement (LTMR) for complex motion patterns. Furthermore, we design a QCMoE module for continuous bit-rate adaptation that dynamically assigns different experts to adjust quantization steps per pixel based on target quality and content rather than relies on a single quantization step. QCMoE allows continuous and consistent rate control with appealing R-D performance. Experimental results show that the proposed method achieves competitive R-D performance compared with the state-of-the-arts, while successfully eliminating error propagation.

[81] Robust Shape from Focus via Multiscale Directional Dilated Laplacian and Recurrent Network

Khurram Ashfaq,Muhammad Tariq Mahmood

Main category: cs.CV

TL;DR: 提出了一种混合框架的Shape-from-Focus方法,通过多尺度方向扩张拉普拉斯核和轻量级GRU网络实现高精度深度估计。

Details Motivation: 现有深度学习SFF方法依赖重型编码器并使用简单聚合技术,导致深度图中存在伪影和噪声放大问题。 Method: 采用手工设计的多尺度DDL核计算聚焦体积,并利用轻量多尺度GRU模块迭代优化低分辨率深度估计,结合可学习的凸上采样恢复高分辨率深度图。 Result: 在合成和真实数据集上均优于最先进的传统和深度学习方法,具有更高的精度和在不同焦距条件下的泛化能力。 Conclusion: 该方法有效平衡了计算效率与深度估计质量,提升了SFF在复杂场景下的性能表现。 Abstract: Shape-from-Focus (SFF) is a passive depth estimation technique that infers scene depth by analyzing focus variations in a focal stack. Most recent deep learning-based SFF methods typically operate in two stages: first, they extract focus volumes (a per pixel representation of focus likelihood across the focal stack) using heavy feature encoders; then, they estimate depth via a simple one-step aggregation technique that often introduces artifacts and amplifies noise in the depth map. To address these issues, we propose a hybrid framework. Our method computes multi-scale focus volumes traditionally using handcrafted Directional Dilated Laplacian (DDL) kernels, which capture long-range and directional focus variations to form robust focus volumes. These focus volumes are then fed into a lightweight, multi-scale GRU-based depth extraction module that iteratively refines an initial depth estimate at a lower resolution for computational efficiency. Finally, a learned convex upsampling module within our recurrent network reconstructs high-resolution depth maps while preserving fine scene details and sharp boundaries. Extensive experiments on both synthetic and real-world datasets demonstrate that our approach outperforms state-of-the-art deep learning and traditional methods, achieving superior accuracy and generalization across diverse focal conditions.

[82] 3D Blood Pulsation Maps

Maurice Rohr,Tobias Reinhardt,Tizian Dege,Justus Thies,Christoph Hoog Antink

Main category: cs.CV

TL;DR: Pulse3DFace是首个用于估计3D血流脉动图的多视角数据集,支持远程脉搏估计模型开发与光照影响研究。

Details Motivation: 缺乏可用于动态面部血流脉动建模和验证远程光体积描记成像(PPGI)方法的高质量3D脉动数据集。 Method: 采集15名受试者在23个视角下的RGB视频(30Hz),结合参考脉搏信号,利用单目运动恢复结构技术生成面部3D扫描,并生成与FLAME模型兼容的3D脉动图。 Result: 提供了包含原始视频、3D扫描、参考脉搏信号及高质量3D脉动图的数据集,包含信噪比、脉动幅度、相位等信息,并验证了其在不同光照下的一致性和生理有效性。 Conclusion: Pulse3DFace为基于多视角的远程脉搏估计、光照鲁棒性分析及合成数据生成提供了重要资源,推动了PPGI方法的发展。 Abstract: We present Pulse3DFace, the first dataset of its kind for estimating 3D blood pulsation maps. These maps can be used to develop models of dynamic facial blood pulsation, enabling the creation of synthetic video data to improve and validate remote pulse estimation methods via photoplethysmography imaging. Additionally, the dataset facilitates research into novel multi-view-based approaches for mitigating illumination effects in blood pulsation analysis. Pulse3DFace consists of raw videos from 15 subjects recorded at 30 Hz with an RGB camera from 23 viewpoints, blood pulse reference measurements, and facial 3D scans generated using monocular structure-from-motion techniques. It also includes processed 3D pulsation maps compatible with the texture space of the 3D head model FLAME. These maps provide signal-to-noise ratio, local pulse amplitude, phase information, and supplementary data. We offer a comprehensive evaluation of the dataset's illumination conditions, map consistency, and its ability to capture physiologically meaningful features in the facial and neck skin regions.

[83] Take a Peek: Efficient Encoder Adaptation for Few-Shot Semantic Segmentation via LoRA

Pasquale De Marinis,Gennaro Vessio,Giovanna Castellano

Main category: cs.CV

TL;DR: 本文提出了一种名为Take a Peek (TaP) 的简单而有效的方法,通过低秩适应(LoRA)对编码器进行微调,以增强其在少样本语义分割(FSS)和跨域FSS中的适应能力。该方法模型无关、计算开销小,并能缓解灾难性遗忘,在多个基准上显著提升了分割性能,尤其在复杂的多类场景中表现突出。

Details Motivation: 现有FSS方法的编码器在提取未见类别的特征方面能力有限,成为性能瓶颈;且多数研究集中于改进解码器,忽视了编码器的适应性问题。 Method: 提出Take a Peek (TaP) 方法,利用低秩适应(LoRA)在支持集上微调编码器,实现对新类别的快速适应,同时避免灾难性遗忘。该方法模型无关,可无缝集成到现有FSS流程中。 Result: 在COCO 20^i、Pascal 5^i及跨域数据集如DeepGlobe、ISIC和Chest X-ray上实验表明,TaP在多种模型和shot设置下均一致提升分割性能,尤其在多类别复杂场景中增益显著;低秩即可实现高性能,保证了计算效率。 Conclusion: TaP通过解决FSS中编码器对新类别泛化能力不足的关键问题,推动了更鲁棒、高效和可泛化的分割系统的发展。 Abstract: Few-shot semantic segmentation (FSS) aims to segment novel classes in query images using only a small annotated support set. While prior research has mainly focused on improving decoders, the encoder's limited ability to extract meaningful features for unseen classes remains a key bottleneck. In this work, we introduce \textit{Take a Peek} (TaP), a simple yet effective method that enhances encoder adaptability for both FSS and cross-domain FSS (CD-FSS). TaP leverages Low-Rank Adaptation (LoRA) to fine-tune the encoder on the support set with minimal computational overhead, enabling fast adaptation to novel classes while mitigating catastrophic forgetting. Our method is model-agnostic and can be seamlessly integrated into existing FSS pipelines. Extensive experiments across multiple benchmarks--including COCO $20^i$, Pascal $5^i$, and cross-domain datasets such as DeepGlobe, ISIC, and Chest X-ray--demonstrate that TaP consistently improves segmentation performance across diverse models and shot settings. Notably, TaP delivers significant gains in complex multi-class scenarios, highlighting its practical effectiveness in realistic settings. A rank sensitivity analysis also shows that strong performance can be achieved even with low-rank adaptations, ensuring computational efficiency. By addressing a critical limitation in FSS--the encoder's generalization to novel classes--TaP paves the way toward more robust, efficient, and generalizable segmentation systems. The code is available at https://github.com/pasqualedem/TakeAPeek.

Yuchen Feng,Zhenyu Zhang,Naibin Gu,Yilong Chen,Peng Fu,Zheng Lin,Shuohuan Wang,Yu Sun,Hua Wu,Weiping Wang,Haifeng Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为Blink的动态视觉令牌分辨率框架,通过模拟人类“眨眼式”感知过程,在单次前向传播中增强多模态大语言模型的视觉感知能力。

Details Motivation: 受人类在复杂场景中通过动态扫描和聚焦显著区域进行高效感知的启发,探究MLLM是否具有类似行为,并提升其有限的视觉感知能力。 Method: 提出Blink框架,包含显著性引导扫描和动态令牌分辨率两个模块;基于注意力图估计每层视觉令牌的显著性,并通过即插即用的令牌超分辨率(TokenSR)模块扩展重要令牌,在后续层中丢弃失去关注的令牌。 Result: 实验证明Blink能有效增强视觉感知和多模态理解能力,实现对显著区域的自适应、高效关注。 Conclusion: Blink通过模拟人类视觉注意机制,在不增加过多计算开销的情况下显著提升了MLLM的视觉感知性能,为构建更高效的多模态模型提供了新思路。 Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress on various vision-language tasks, yet their visual perception remains limited. Humans, in comparison, perceive complex scenes efficiently by dynamically scanning and focusing on salient regions in a sequential "blink-like" process. Motivated by this strategy, we first investigate whether MLLMs exhibit similar behavior. Our pilot analysis reveals that MLLMs naturally attend to different visual regions across layers and that selectively allocating more computation to salient tokens can enhance visual perception. Building on this insight, we propose Blink, a dynamic visual token resolution framework that emulates the human-inspired process within a single forward pass. Specifically, Blink includes two modules: saliency-guided scanning and dynamic token resolution. It first estimates the saliency of visual tokens in each layer based on the attention map, and extends important tokens through a plug-and-play token super-resolution (TokenSR) module. In the next layer, it drops the extended tokens when they lose focus. This dynamic mechanism balances broad exploration and fine-grained focus, thereby enhancing visual perception adaptively and efficiently. Extensive experiments validate Blink, demonstrating its effectiveness in enhancing visual perception and multimodal understanding.

[85] Grounding Everything in Tokens for Multimodal Large Language Models

Xiangxuan Ren,Zhongdao Wang,Liping Hou,Pin Tang,Guoqing Wang,Chao Ma

Main category: cs.CV

TL;DR: 本文提出了一种名为GETok的空间表示方法,通过引入可学习的网格和偏移标记,增强多模态大语言模型在二维图像空间中的对象定位能力,无需修改自回归架构即可提升模型性能。

Details Motivation: 现有的多模态大语言模型由于依赖对图像进行标记化处理,难以精确地在二维图像空间中定位对象,因此需要一种能更好支持空间 grounding 的表示方法。 Method: 提出GETok方法,使用网格标记将图像平面划分为结构化的空间锚点,并利用偏移标记迭代优化定位预测,将空间关系直接嵌入到标记中。 Result: 实验表明,GETok在多种指代表达任务上均优于现有最先进方法,适用于监督微调和强化学习设置。 Conclusion: GETok通过改进标记表示显著提升了MLLMs在原生二维空间推理方面的能力,且无需改变模型的自回归结构。 Abstract: Multimodal large language models (MLLMs) have made significant advancements in vision understanding and reasoning. However, the autoregressive Transformer architecture used by MLLMs requries tokenization on input images, which limits their ability to accurately ground objects within the 2D image space. This raises an important question: how can sequential language tokens be improved to better ground objects in 2D spatial space for MLLMs? To address this, we present a spatial representation method for grounding objects, namely GETok, that integrates a specialized vocabulary of learnable tokens into MLLMs. GETok first uses grid tokens to partition the image plane into structured spatial anchors, and then exploits offset tokens to enable precise and iterative refinement of localization predictions. By embedding spatial relationships directly into tokens, GETok significantly advances MLLMs in native 2D space reasoning without modifying the autoregressive architecture. Extensive experiments demonstrate that GETok achieves superior performance over the state-of-the-art methods across various referring tasks in both supervised fine-tuning and reinforcement learning settings.

[86] Data-Efficient American Sign Language Recognition via Few-Shot Prototypical Networks

Meher Md Saad

Main category: cs.CV

TL;DR: 提出一种基于骨架的少样本原型网络框架,结合ST-GCN和多尺度时间聚合模块,通过度量学习有效提升孤立手语识别在数据稀缺和长尾分布下的性能,在WLASL数据集上显著优于传统分类方法,并展现良好的零样本泛化能力。

Details Motivation: 由于数据稀缺和手语词汇的长尾分布,传统分类方法在孤立手语识别中表现不佳,难以泛化到罕见类别,亟需一种能有效利用有限数据并提升泛化能力的新方法。 Method: 提出一种少样本原型网络框架,采用基于骨架的编码器,结合ST-GCN和新型多尺度时间聚合(MSTA)模块以捕捉快速和流畅的动作动态,通过 episodic 训练学习语义度量空间,依据类原型的距离进行分类。 Result: 在WLASL数据集上达到43.75% Top-1和77.10% Top-5准确率,比相同骨干的分类基线高13%以上;在未见的SignASL数据集上实现近30%的零样本准确率。 Conclusion: 该度量学习范式在数据稀缺场景下显著优于传统分类方法,具备更强的泛化能力和实际应用潜力,为大规模手语词汇识别提供了可扩展的解决方案。 Abstract: Isolated Sign Language Recognition (ISLR) is critical for bridging the communication gap between the Deaf and Hard-of-Hearing (DHH) community and the hearing world. However, robust ISLR is fundamentally constrained by data scarcity and the long-tail distribution of sign vocabulary, where gathering sufficient examples for thousands of unique signs is prohibitively expensive. Standard classification approaches struggle under these conditions, often overfitting to frequent classes while failing to generalize to rare ones. To address this bottleneck, we propose a Few-Shot Prototypical Network framework adapted for a skeleton based encoder. Unlike traditional classifiers that learn fixed decision boundaries, our approach utilizes episodic training to learn a semantic metric space where signs are classified based on their proximity to dynamic class prototypes. We integrate a Spatiotemporal Graph Convolutional Network (ST-GCN) with a novel Multi-Scale Temporal Aggregation (MSTA) module to capture both rapid and fluid motion dynamics. Experimental results on the WLASL dataset demonstrate the superiority of this metric learning paradigm: our model achieves 43.75% Top-1 and 77.10% Top-5 accuracy on the test set. Crucially, this outperforms a standard classification baseline sharing the identical backbone architecture by over 13%, proving that the prototypical training strategy effectively outperforms in a data scarce situation where standard classification fails. Furthermore, the model exhibits strong zero-shot generalization, achieving nearly 30% accuracy on the unseen SignASL dataset without fine-tuning, offering a scalable pathway for recognizing extensive sign vocabularies with limited data.

[87] Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner

Haojie Zheng,Shuchen Weng,Jingqi Liu,Siqi Yang,Boxin Shi,Xinlong Wang

Main category: cs.CV

TL;DR: 本文提出AVI-Edit,一种实现音频同步的视频实例编辑框架,通过细粒度掩码优化和自反馈音频代理实现精确的空间与时间控制,并构建大规模数据集验证其在视觉质量、条件遵循和音视频同步上的优越性。

Details Motivation: 现有视频编辑方法忽视音视频同步,且缺乏实例级编辑所需的精细时空可控性。 Method: 提出 granularity-aware 掩码优化器以精化用户提供的粗略掩码,并设计自反馈音频代理生成高质量音频引导,实现细粒度时间控制;同时构建大规模实例中心标注数据集。 Result: 实验表明AVI-Edit在视觉质量、条件遵循和音视频同步方面优于现有最先进方法。 Conclusion: AVI-Edit有效实现了音频同步的精细视频实例编辑,推动了音视频协同内容生成的发展。 Abstract: Recent advancements in video generation highlight that realistic audio-visual synchronization is crucial for engaging content creation. However, existing video editing methods largely overlook audio-visual synchronization and lack the fine-grained spatial and temporal controllability required for precise instance-level edits. In this paper, we propose AVI-Edit, a framework for audio-sync video instance editing. We propose a granularity-aware mask refiner that iteratively refines coarse user-provided masks into precise instance-level regions. We further design a self-feedback audio agent to curate high-quality audio guidance, providing fine-grained temporal control. To facilitate this task, we additionally construct a large-scale dataset with instance-centric correspondence and comprehensive annotations. Extensive experiments demonstrate that AVI-Edit outperforms state-of-the-art methods in visual quality, condition following, and audio-visual synchronization. Project page: https://hjzheng.net/projects/AVI-Edit/.

[88] Unleashing Degradation-Carrying Features in Symmetric U-Net: Simpler and Stronger Baselines for All-in-One Image Restoration

Wenlong Jiao,Heyang Lee,Ping Wang,Pengfei Zhu,Qinghua Hu,Dongwei Ren

Main category: cs.CV

TL;DR: 本文提出了一种基于对称U-Net架构的简洁高效的一体化图像恢复方法SymUNet,通过精心设计的特征提取和跨尺度传播,无需复杂模型即可实现先进性能;并进一步引入语义增强版本SE-SymUNet,利用冻结的CLIP特征提升退化先验。

Details Motivation: 现有的一体化图像恢复方法依赖复杂的架构和退化提示策略,缺乏对基础特征设计潜力的充分探索。本文旨在验证简洁架构在充分挖掘退化信息下的有效性。 Method: 提出对称U-Net架构(SymUNet),通过编码器-解码器间特征尺度对齐和简化跨尺度传播保留退化信号;采用简单加法融合跳接连接。进一步设计SE-SymUNet,通过交叉注意力直接注入冻结的CLIP语义特征以增强退化先验。 Result: SymUNet在多个基准数据集上超越现有方法,同时降低计算成本;SE-SymUNet进一步提升性能,实验证明了所提方法在图像去噪、去模糊、恶劣天气去除等任务上的优越性。 Conclusion: 对称U-Net结构结合良好的特征设计足以实现先进的一体化图像恢复效果,无需依赖复杂模型;本文为未来研究提供了更简洁、更强健的基础框架。 Abstract: All-in-one image restoration aims to handle diverse degradations (e.g., noise, blur, adverse weather) within a unified framework, yet existing methods increasingly rely on complex architectures (e.g., Mixture-of-Experts, diffusion models) and elaborate degradation prompt strategies. In this work, we reveal a critical insight: well-crafted feature extraction inherently encodes degradation-carrying information, and a symmetric U-Net architecture is sufficient to unleash these cues effectively. By aligning feature scales across encoder-decoder and enabling streamlined cross-scale propagation, our symmetric design preserves intrinsic degradation signals robustly, rendering simple additive fusion in skip connections sufficient for state-of-the-art performance. Our primary baseline, SymUNet, is built on this symmetric U-Net and achieves better results across benchmark datasets than existing approaches while reducing computational cost. We further propose a semantic enhanced variant, SE-SymUNet, which integrates direct semantic injection from frozen CLIP features via simple cross-attention to explicitly amplify degradation priors. Extensive experiments on several benchmarks validate the superiority of our methods. Both baselines SymUNet and SE-SymUNet establish simpler and stronger foundations for future advancements in all-in-one image restoration. The source code is available at https://github.com/WenlongJiao/SymUNet.

[89] Salient Object Detection in Complex Weather Conditions via Noise Indicators

Quan Chen,Xiaokai Yang,Tingyu Wang,Rongfeng Lu,Xichun Sheng,Yaoqi Sun,Chenggang Yan

Main category: cs.CV

TL;DR: 本文提出了一种适用于多种天气条件的显著性目标检测(SOD)框架,通过引入噪声指示符融合模块(NIFM)提升复杂天气下的分割精度。

Details Motivation: 现有SOD方法多假设低噪声视觉条件,忽视了真实场景中天气噪声对分割精度的影响,本文旨在提升模型在复杂天气下的鲁棒性和准确性。 Method: 提出一种包含特定编码器和可替换解码器的SOD框架,引入one-hot噪声指示向量,并设计NIFM模块将其与语义特征结合,通过自适应特征调制嵌入天气感知先验。 Result: 在WXSOD数据集上进行大量实验,使用不同训练数据比例(100%、50%、30%)及多种编码器-解码器组合,结果表明所提框架(尤其是增强型编码器)在复杂天气下优于基础编码器。 Conclusion: 所提出的NIFM增强SOD框架能有效应对不同天气噪声,提升分割准确性,且兼容主流SOD解码器,具有良好的扩展性和应用潜力。 Abstract: Salient object detection (SOD), a foundational task in computer vision, has advanced from single-modal to multi-modal paradigms to enhance generalization. However, most existing SOD methods assume low-noise visual conditions, overlooking the degradation of segmentation accuracy caused by weather-induced noise in real-world scenarios. In this paper, we propose a SOD framework tailored for diverse weather conditions, encompassing a specific encoder and a replaceable decoder. To enable handling of varying weather noises, we introduce a one-hot vector as a noise indicator to represent different weather types and design a Noise Indicator Fusion Module (NIFM). The NIFM takes both semantic features and the noise indicator as dual inputs and is inserted between consecutive stages of the encoder to embed weather-aware priors via adaptive feature modulation. Critically, the proposed specific encoder retains compatibility with mainstream SOD decoders. Extensive experiments are conducted on the WXSOD dataset under varying training data scales (100%, 50%, 30% of the full training set), three encoder and seven decoder configurations. Results show that the proposed SOD framework (particularly the NIFM-enhanced specific encoder) improves segmentation accuracy under complex weather conditions compared to a vanilla encoder.

[90] Beyond Pixels: A Training-Free, Text-to-Text Framework for Remote Sensing Image Retrieval

J. Xiao,Y. Guo,X. Zi,K. Thiyagarajan,C. Moreira,M. Prasad

Main category: cs.CV

TL;DR: 提出了一种无需训练的文本到文本遥感图像检索方法TRSLLaVA,并构建了RSRT数据集,通过高质量结构化文本实现优于零拍CLIP且媲美监督模型的性能。

Details Motivation: 解决遥感图像语义检索中存在的“语义鸿沟”问题,现有方法依赖昂贵的领域特定训练,且缺乏评估VLM生成文本在零样本检索中实用性的基准。 Method: 构建了包含多条结构化描述的RSRT数据集,将跨模态检索转化为纯文本匹配任务,在统一文本嵌入空间中使用VLM生成的文本描述进行检索,完全无需模型训练或微调。 Result: 在RSITMD和RSICD基准上实验显示,该方法在零样本设置下表现优异,例如在RSITMD上平均召回率达42.62%,远超CLIP基线的23.86%,并超过多个顶尖监督模型。 Conclusion: 高质量的结构化文本表示为遥感图像检索提供了一种高效且低成本的新范式,验证了无需训练的文本检索在该领域的巨大潜力。 Abstract: Semantic retrieval of remote sensing (RS) images is a critical task fundamentally challenged by the \textquote{semantic gap}, the discrepancy between a model's low-level visual features and high-level human concepts. While large Vision-Language Models (VLMs) offer a promising path to bridge this gap, existing methods often rely on costly, domain-specific training, and there is a lack of benchmarks to evaluate the practical utility of VLM-generated text in a zero-shot retrieval context. To address this research gap, we introduce the Remote Sensing Rich Text (RSRT) dataset, a new benchmark featuring multiple structured captions per image. Based on this dataset, we propose a fully training-free, text-only retrieval reference called TRSLLaVA. Our methodology reformulates cross-modal retrieval as a text-to-text (T2T) matching problem, leveraging rich text descriptions as queries against a database of VLM-generated captions within a unified textual embedding space. This approach completely bypasses model training or fine-tuning. Experiments on the RSITMD and RSICD benchmarks show our training-free method is highly competitive with state-of-the-art supervised models. For instance, on RSITMD, our method achieves a mean Recall of 42.62\%, nearly doubling the 23.86\% of the standard zero-shot CLIP baseline and surpassing several top supervised models. This validates that high-quality semantic representation through structured text provides a powerful and cost-effective paradigm for remote sensing image retrieval.

[91] Track and Caption Any Motion: Query-Free Motion Discovery and Description in Videos

Bishoy Galoaa,Sarah Ostadabbas

Main category: cs.CV

TL;DR: 提出了一种名为TCAM的运动中心框架,用于自动视频理解,能够自主发现并描述视频中的运动模式,无需用户查询。

Details Motivation: 在遮挡、伪装或快速移动等挑战性条件下,视频理解更依赖于运动动态而非静态外观,因此需要一种能够自主识别和描述运动模式的方法。 Method: 通过运动场注意力机制,将自然语言描述与对应的轨迹进行空间对齐,并利用对比视觉-语言表示来识别和描述动作。采用全局视频-文本对齐与细粒度空间对应相结合的统一训练方法,通过多头交叉注意力实现无需查询的多运动表达发现。 Result: 在MeViS基准上,TCAM实现了58.4%的视频到文本检索率,空间定位JF为64.9,每段视频平均发现4.8个相关表达,精确率达到84.7%,表现出强大的跨任务泛化能力。 Conclusion: TCAM通过结合运动模式与视觉-语言表示,在无需用户查询的情况下实现了对复杂视频中多种运动活动的有效发现与描述,具有良好的应用前景。 Abstract: We propose Track and Caption Any Motion (TCAM), a motion-centric framework for automatic video understanding that discovers and describes motion patterns without user queries. Understanding videos in challenging conditions like occlusion, camouflage, or rapid movement often depends more on motion dynamics than static appearance. TCAM autonomously observes a video, identifies multiple motion activities, and spatially grounds each natural language description to its corresponding trajectory through a motion-field attention mechanism. Our key insight is that motion patterns, when aligned with contrastive vision-language representations, provide powerful semantic signals for recognizing and describing actions. Through unified training that combines global video-text alignment with fine-grained spatial correspondence, TCAM enables query-free discovery of multiple motion expressions via multi-head cross-attention. On the MeViS benchmark, TCAM achieves 58.4% video-to-text retrieval, 64.9 JF for spatial grounding, and discovers 4.8 relevant expressions per video with 84.7% precision, demonstrating strong cross-task generalization.

[92] Robust Multi-Disease Retinal Classification via Xception-Based Transfer Learning and W-Net Vessel Segmentation

Mohammad Sadegh Gholizadeh,Amir Arsalan Rezapour

Main category: cs.CV

TL;DR: 本文提出了一种结合深度特征提取与可解释图像处理模块的深度学习框架,用于眼部疾病的自动诊断,通过视网膜血管分割辅助分类,提升模型可解释性与临床适用性。

Details Motivation: 应对近年来致盲性眼病发病率上升的问题,亟需可扩展且准确的筛查方案;同时解决标准CNN模型在医学诊断中缺乏可解释性的问题。 Method: 采用深度学习架构,结合深度特征提取与可解释的图像处理模块,将高保真度的视网膜血管分割作为辅助任务,以引导疾病分类过程。 Result: 模型能够基于临床相关的形态学特征进行预测,增强了结果的可解释性,减少了假阳性率。 Conclusion: 该方法在提升自动眼病诊断准确性的同时,增强了模型与医学专家判断之间的一致性,有助于推动其在临床环境中的实际部署。 Abstract: In recent years, the incidence of vision-threatening eye diseases has risen dramatically, necessitating scalable and accurate screening solutions. This paper presents a comprehensive study on deep learning architectures for the automated diagnosis of ocular conditions. To mitigate the "black-box" limitations of standard convolutional neural networks (CNNs), we implement a pipeline that combines deep feature extraction with interpretable image processing modules. Specifically, we focus on high-fidelity retinal vessel segmentation as an auxiliary task to guide the classification process. By grounding the model's predictions in clinically relevant morphological features, we aim to bridge the gap between algorithmic output and expert medical validation, thereby reducing false positives and improving deployment viability in clinical settings.

[93] Lang2Motion: Bridging Language and Motion through Joint Embedding Spaces

Bishoy Galoaa,Xiangyu Bai,Sarah Ostadabbas

Main category: cs.CV

TL;DR: Lang2Motion 是一种通过将运动流形与联合嵌入空间对齐来生成语言引导的点轨迹的新框架,能够在文本到轨迹检索和运动准确性方面显著优于现有方法。

Details Motivation: 现有的工作主要集中在人类动作或视频合成上,缺乏对任意物体显式轨迹生成的支持,且难以实现跨域运动理解与编辑。 Method: 提出基于Transformer的自编码器架构,利用CLIP的冻结编码器对文本描述和轨迹可视化进行双重监督学习,提取真实视频中的运动并通过点跟踪生成轨迹。 Result: 在文本到轨迹检索任务中达到34.2%的Recall@1,比基于视频的方法高出12.5点;运动准确性提升33-52%(ADE从18.3-25.3降至12.4);在仅训练于多样物体运动的情况下,人类动作识别Top-1准确率达88.3%。 Conclusion: Lang2Motion 能有效对齐语言与运动轨迹,在多种任务中表现出色,并支持风格迁移、语义插值和潜在空间编辑,展现出强大的跨域泛化能力与应用潜力。 Abstract: We present Lang2Motion, a framework for language-guided point trajectory generation by aligning motion manifolds with joint embedding spaces. Unlike prior work focusing on human motion or video synthesis, we generate explicit trajectories for arbitrary objects using motion extracted from real-world videos via point tracking. Our transformer-based auto-encoder learns trajectory representations through dual supervision: textual motion descriptions and rendered trajectory visualizations, both mapped through CLIP's frozen encoders. Lang2Motion achieves 34.2% Recall@1 on text-to-trajectory retrieval, outperforming video-based methods by 12.5 points, and improves motion accuracy by 33-52% (12.4 ADE vs 18.3-25.3) compared to video generation baselines. We demonstrate 88.3% Top-1 accuracy on human action recognition despite training only on diverse object motions, showing effective transfer across motion domains. Lang2Motion supports style transfer, semantic interpolation, and latent-space editing through CLIP-aligned trajectory representations.

[94] DOCR-Inspector: Fine-Grained and Automated Evaluation of Document Parsing with VLM

Qintong Zhang,Junyuan Zhang,Zhifei Ren,Linke Ouyang,Zichen Wen,Junbo Niu,Yuan Qu,Bin Wang,Ka-Ho Chow,Conghui He,Wentao Zhang

Main category: cs.CV

TL;DR: 本文提出了DOCR-Inspector,一种基于视觉语言模型的细粒度文档解析质量评估方法,通过28种错误类型分类和Chain-of-Checklist推理范式,实现对真实场景中文档解析结果的全面评估,并构建了大规模数据集DOCRcase-200K和评测基准DOCRcaseBench进行验证。

Details Motivation: 现有基准测试存在数据集偏差,模型排名不稳定且与真实性能相关性低,整体评分掩盖具体错误模式,难以可靠评估现实场景中的文档解析质量。 Method: 提出DOCR-Inspector,将文档解析评估形式化为细粒度错误检测任务,利用VLM-as-a-Judge分析图像与解析输出,识别并归类至28类错误;构建DOCRcase-200K训练数据,引入Chain-of-Checklist推理范式支持层次化质量评估。 Result: 在包含882个真实案例的DOCRcaseBench上,DOCR-Inspector-7B优于Gemini 2.5 Pro等商用模型及主流开源模型,其评估结果能有效指导解析结果优化。 Conclusion: DOCR-Inspector可作为可靠的文档解析评估工具,不仅能提供细粒度质量分析,还能推动大规模文档解析系统的持续改进。 Abstract: Document parsing aims to transform unstructured PDF images into semi-structured data, facilitating the digitization and utilization of information in diverse domains. While vision language models (VLMs) have significantly advanced this task, achieving reliable, high-quality parsing in real-world scenarios remains challenging. Common practice often selects the top-performing model on standard benchmarks. However, these benchmarks may carry dataset-specific biases, leading to inconsistent model rankings and limited correlation with real-world performance. Moreover, benchmark metrics typically provide only overall scores, which can obscure distinct error patterns in output. This raises a key challenge: how can we reliably and comprehensively assess document parsing quality in the wild? We address this problem with DOCR-Inspector, which formalizes document parsing assessment as fine-grained error detection and analysis. Leveraging VLM-as-a-Judge, DOCR-Inspector analyzes a document image and its parsed output, identifies all errors, assigns them to one of 28 predefined types, and produces a comprehensive quality assessment. To enable this capability, we construct DOCRcase-200K for training and propose the Chain-of-Checklist reasoning paradigm to enable the hierarchical structure of parsing quality assessment. For empirical validation, we introduce DOCRcaseBench, a set of 882 real-world document parsing cases with manual annotations. On this benchmark, DOCR-Inspector-7B outperforms commercial models like Gemini 2.5 Pro, as well as leading open-source models. Further experiments demonstrate that its quality assessments provide valuable guidance for parsing results refinement, making DOCR-Inspector both a practical evaluator and a driver for advancing document parsing systems at scale. Model and code are released at: https://github.com/ZZZZZQT/DOCR-Inspector.

[95] K-Track: Kalman-Enhanced Tracking for Accelerating Deep Point Trackers on Edge Devices

Bishoy Galoaa,Pau Closas,Sarah Ostadabbas

Main category: cs.CV

TL;DR: K-Track 是一种通用的、与跟踪器无关的加速框架,通过结合稀疏的关键帧深度学习更新和轻量级卡尔曼滤波,实现视频点跟踪的高效推理,在保持85%以上精度的同时实现5-10倍的速度提升。

Details Motivation: 现有的深度学习点跟踪器依赖每帧GPU推理,难以部署在计算、功耗和连接受限的边缘设备上,亟需一种兼顾精度与效率的解决方案。 Method: 提出K-Track框架,采用关键帧上的深度学习更新,并在中间帧使用基于贝叶斯不确定性传播的卡尔曼滤波进行预测,形成深度学习与轻量滤波相结合的混合策略。 Result: 在多个先进点跟踪器上验证,K-Track在NVIDIA Jetson Nano和RTX Titan等边缘平台实现了实时性能,推理速度提升5-10倍,同时保留超过85%的原始精度。 Conclusion: K-Track有效弥合了现代高精度点跟踪算法与资源受限实际部署场景之间的鸿沟,为现实世界中的边缘视觉系统提供了可行路径。 Abstract: Point tracking in video sequences is a foundational capability for real-world computer vision applications, including robotics, autonomous systems, augmented reality, and video analysis. While recent deep learning-based trackers achieve state-of-the-art accuracy on challenging benchmarks, their reliance on per-frame GPU inference poses a major barrier to deployment on resource-constrained edge devices, where compute, power, and connectivity are limited. We introduce K-Track (Kalman-enhanced Tracking), a general-purpose, tracker-agnostic acceleration framework designed to bridge this deployment gap. K-Track reduces inference cost by combining sparse deep learning keyframe updates with lightweight Kalman filtering for intermediate frame prediction, using principled Bayesian uncertainty propagation to maintain temporal coherence. This hybrid strategy enables 5-10X speedup while retaining over 85% of the original trackers' accuracy. We evaluate K-Track across multiple state-of-the-art point trackers and demonstrate real-time performance on edge platforms such as the NVIDIA Jetson Nano and RTX Titan. By preserving accuracy while dramatically lowering computational requirements, K-Track provides a practical path toward deploying high-quality point tracking in real-world, resource-limited settings, closing the gap between modern tracking algorithms and deployable vision systems.

[96] TriDF: Evaluating Perception, Detection, and Hallucination for Interpretable DeepFake Detection

Jian-Yu Jiang-Lin,Kang-Yang Huang,Ling Zou,Ling Lo,Sheng-Ping Yang,Yu-Wen Tseng,Kun-Hsiang Lin,Chia-Ling Chen,Yu-Ting Ta,Yan-Tsung Wang,Po-Ching Chen,Hongxia Xie,Hong-Han Shuai,Wen-Huang Cheng

Main category: cs.CV

TL;DR: TriDF是一个用于可解释DeepFake检测的综合基准,涵盖图像、视频和音频模态中的16种DeepFake类型,评估感知、检测和幻觉三个方面,揭示了检测准确性与解释可靠性之间的相互依赖关系。

Details Motivation: 随着生成模型的发展,伪造个体内容的风险增加,对安全、通信和公众信任构成威胁,因此需要能够准确识别并提供可靠解释的DeepFake检测系统。 Method: 提出TriDF基准,包含来自先进合成模型的高质量伪造数据,评估模型在感知(基于人工标注识别细粒度伪影)、检测(跨伪造家族的分类性能)和幻觉(生成解释的可靠性)三方面的能力。 Result: 实验表明,准确的感知对可靠检测至关重要,但生成模型的幻觉会严重干扰决策过程,三者之间存在强关联性。 Conclusion: TriDF提供了一个统一框架,有助于理解检测准确性、证据识别与解释可靠性之间的互动,为构建应对合成媒体威胁的可信系统奠定了基础。 Abstract: Advances in generative modeling have made it increasingly easy to fabricate realistic portrayals of individuals, creating serious risks for security, communication, and public trust. Detecting such person-driven manipulations requires systems that not only distinguish altered content from authentic media but also provide clear and reliable reasoning. In this paper, we introduce TriDF, a comprehensive benchmark for interpretable DeepFake detection. TriDF contains high-quality forgeries from advanced synthesis models, covering 16 DeepFake types across image, video, and audio modalities. The benchmark evaluates three key aspects: Perception, which measures the ability of a model to identify fine-grained manipulation artifacts using human-annotated evidence; Detection, which assesses classification performance across diverse forgery families and generators; and Hallucination, which quantifies the reliability of model-generated explanations. Experiments on state-of-the-art multimodal large language models show that accurate perception is essential for reliable detection, but hallucination can severely disrupt decision-making, revealing the interdependence of these three aspects. TriDF provides a unified framework for understanding the interaction between detection accuracy, evidence identification, and explanation reliability, offering a foundation for building trustworthy systems that address real-world synthetic media threats.

[97] NaviHydra: Controllable Navigation-guided End-to-end Autonomous Driving with Hydra-distillation

Hanfeng Wu,Marlon Steiner,Michael Schmidt,Alvaro Marcos-Ramiro,Christoph Stiller

Main category: cs.CV

TL;DR: 提出NaviHydra,一种基于导航指令的可控端到端自动驾驶模型,通过蒸馏规则系统并引入导航合规性度量,在复杂环境中实现更安全、可控的轨迹生成。

Details Motivation: 传统规则系统在动态环境中适应性差,端到端方法难以遵循明确导航指令,需提升模型对高阶指令的理解与执行能力。 Method: 设计NaviHydra模型,利用规则基模拟器进行知识蒸馏,采用BEV视角轨迹采集增强特征提取,并引入导航合规性度量评估路径一致性。 Result: 在NAVSIM基准上达到SOTA性能,显著优于基线模型,具备更强的导航指令响应能力和安全性。 Conclusion: NaviHydra有效结合了规则系统的可控性与端到端模型的泛化能力,提升了复杂场景下的自动驾驶轨迹生成质量与可解释性。 Abstract: The complexity of autonomous driving scenarios requires robust models that can interpret high-level navigation commands and generate safe trajectories. While traditional rule-based systems can react to these commands, they often struggle in dynamic environments, and end-to-end methods face challenges in complying with explicit navigation commands. To address this, we present NaviHydra, a controllable navigation-guided end-to-end model distilled from an existing rule-based simulator. Our framework accepts high-level navigation commands as control signals, generating trajectories that align with specified intentions. We utilize a Bird's Eye View (BEV) based trajectory gathering method to enhance the trajectory feature extraction. Additionally, we introduce a novel navigation compliance metric to evaluate adherence to intended route, improving controllability and navigation safety. To comprehensively assess our model's controllability, we design a test that evaluates its response to various navigation commands. Our method significantly outperforms baseline models, achieving state-of-the-art results in the NAVSIM benchmark, demonstrating its effectiveness in advancing autonomous driving.

[98] XDen-1K: A Density Field Dataset of Real-World Objects

Jingxuan Zhang,Tianqi Yu,Yatu Zhang,Jinze Wu,Kaixin Yao,Jingyang Liu,Yuyao Zhang,Jiayuan Gu,Jingyi Yu

Main category: cs.CV

TL;DR: 本文提出了XDen-1K,首个大规模多模态真实世界物理属性数据集,专注于体密度估计,并结合X射线扫描与3D模型开发优化框架,提升了质心估计和机器人操作性能。

Details Motivation: 现有模型主要关注物体表面几何和外观,忽视内部物理属性(如体密度),而这些属性对机器人操作和物理仿真至关重要,但缺乏真实世界的大规模数据是主要瓶颈。 Method: 构建包含1000个真实物体的XDen-1K数据集,涵盖高分辨率3D模型、部件级标注和双平面X射线扫描;提出一种从稀疏X射线视图恢复高保真体密度场的优化框架,并将X射线图像作为条件信号用于体积分割网络。 Result: 实验表明,利用该数据集可显著提升质心估计的准确性和机器人操作任务的成功率。 Conclusion: XDen-1K为物理感知的视觉推理和具身AI提供了基础资源和新基准,有望推动相关领域研究发展。 Abstract: A deep understanding of the physical world is a central goal for embodied AI and realistic simulation. While current models excel at capturing an object's surface geometry and appearance, they largely neglect its internal physical properties. This omission is critical, as properties like volumetric density are fundamental for predicting an object's center of mass, stability, and interaction dynamics in applications ranging from robotic manipulation to physical simulation. The primary bottleneck has been the absence of large-scale, real-world data. To bridge this gap, we introduce XDen-1K, the first large-scale, multi-modal dataset designed for real-world physical property estimation, with a particular focus on volumetric density. The core of this dataset consists of 1,000 real-world objects across 148 categories, for which we provide comprehensive multi-modal data, including a high-resolution 3D geometric model with part-level annotations and a corresponding set of real-world biplanar X-ray scans. Building upon this data, we introduce a novel optimization framework that recovers a high-fidelity volumetric density field of each object from its sparse X-ray views. To demonstrate its practical value, we add X-ray images as a conditioning signal to an existing segmentation network and perform volumetric segmentation. Furthermore, we conduct experiments on downstream robotics tasks. The results show that leveraging the dataset can effectively improve the accuracy of center-of-mass estimation and the success rate of robotic manipulation. We believe XDen-1K will serve as a foundational resource and a challenging new benchmark, catalyzing future research in physically grounded visual inference and embodied AI.

[99] Geo6DPose: Fast Zero-Shot 6D Object Pose Estimation via Geometry-Filtered Feature Matching

Javier Villena Toro,Mehdi Tarkian

Main category: cs.CV

TL;DR: 本文提出了一种名为Geo6DPose的轻量级、完全本地化且无需训练的零样本6D姿态估计方法,结合基础模型特征与几何过滤策略,在单个消费级GPU上实现亚秒级推理,并在保持与更大模型相当性能的同时,提升了对噪声、杂乱和部分遮挡的鲁棒性。

Details Motivation: 现有的零样本6D姿态估计方法依赖大规模模型和云推理,导致高延迟、高能耗及部署风险,难以满足实际机器人系统对低功耗、低延迟和本地化计算的需求。 Method: 该方法将基础模型的视觉特征与几何过滤策略相结合:通过计算预存模板DINO描述符与场景图像块之间的相似性图,利用投影建立场景图像块中心与物体模型坐标系间的对应关系,再通过基于对应点的RANSAC恢复位姿,并使用加权几何对齐度量(综合重投影一致性和空间支持)进行排序。 Result: Geo6DPose在单个消费级GPU上实现了1.08 FPS的亚秒级推理速度,平均召回率达到53.7 AR,与更大规模的零样本基线相当,且无需训练、微调或网络连接。 Conclusion: Geo6DPose通过牺牲模型规模换取几何可靠性,提供了一种实用、完全本地化的6D感知解决方案,适用于资源受限的机器人系统,并兼容不断演进的基础模型骨干网络。 Abstract: Recent progress in zero-shot 6D object pose estimation has been driven largely by large-scale models and cloud-based inference. However, these approaches often introduce high latency, elevated energy consumption, and deployment risks related to connectivity, cost, and data governance; factors that conflict with the practical constraints of real-world robotics, where compute is limited and on-device inference is frequently required. We introduce Geo6DPose, a lightweight, fully local, and training-free pipeline for zero-shot 6D pose estimation that trades model scale for geometric reliability. Our method combines foundation model visual features with a geometric filtering strategy: Similarity maps are computed between onboarded template DINO descriptors and scene patches, and mutual correspondences are established by projecting scene patch centers to 3D and template descriptors to the object model coordinate system. Final poses are recovered via correspondence-driven RANSAC and ranked using a weighted geometric alignment metric that jointly accounts for reprojection consistency and spatial support, improving robustness to noise, clutter, and partial visibility. Geo6DPose achieves sub-second inference on a single commodity GPU while matching the average recall of significantly larger zero-shot baselines (53.7 AR, 1.08 FPS). It requires no training, fine-tuning, or network access, and remains compatible with evolving foundation backbones, advancing practical, fully local 6D perception for robotic deployment.

[100] Optimal transport unlocks end-to-end learning for single-molecule localization

Romain Seailles,Jean-Baptiste Masson,Jean Ponce,Julien Mairal

Main category: cs.CV

TL;DR: 提出了一种基于最优传输损失和迭代神经网络的端到端方法,用于单分子定位显微镜(SMLM),避免了非极大值抑制(NMS)的使用,在中高密度发射子条件下优于现有技术。

Details Motivation: 现有的SMLM方法依赖非极大值抑制(NMS)层,导致不可微且可能误删真实信号,限制了密集发射条件下的性能和端到端训练。 Method: 将SMLM训练目标重新表述为集合匹配问题,设计最优传输损失函数以去除NMS;并构建结合显微镜光学系统先验知识的迭代神经网络架构。 Result: 在合成数据和真实生物数据上实验表明,该方法在中等和高发射子密度下均优于现有最先进方法。 Conclusion: 所提方法通过可微的最优传输损失和物理信息嵌入的网络结构,实现了更优的SMLM重建性能,尤其适用于高密度场景,推动了活细胞成像的应用。 Abstract: Single-molecule localization microscopy (SMLM) allows reconstructing biology-relevant structures beyond the diffraction limit by detecting and localizing individual fluorophores -- fluorescent molecules stained onto the observed specimen -- over time to reconstruct super-resolved images. Currently, efficient SMLM requires non-overlapping emitting fluorophores, leading to long acquisition times that hinders live-cell imaging. Recent deep-learning approaches can handle denser emissions, but they rely on variants of non-maximum suppression (NMS) layers, which are unfortunately non-differentiable and may discard true positives with their local fusion strategy. In this presentation, we reformulate the SMLM training objective as a set-matching problem, deriving an optimal-transport loss that eliminates the need for NMS during inference and enables end-to-end training. Additionally, we propose an iterative neural network that integrates knowledge of the microscope's optical system inside our model. Experiments on synthetic benchmarks and real biological data show that both our new loss function and architecture surpass the state of the art at moderate and high emitter densities. Code is available at https://github.com/RSLLES/SHOT.

[101] Sharp Monocular View Synthesis in Less Than a Second

Lars Mescheder,Wei Dong,Shiwei Li,Xuyang Bai,Marcel Santos,Peiyun Hu,Bruno Lecouat,Mingmin Zhen,Amaël Delaunoy,Tian Fang,Yanghai Tsin,Stephan R. Richter,Vladlen Koltun

Main category: cs.CV

TL;DR: SHARP是一种从单张图像实现逼真视图合成的新方法,通过快速回归3D高斯表示,在不到一秒内完成并支持实时渲染和度量相机运动。

Details Motivation: 现有的单图像视图合成方法在真实感、速度和尺度一致性方面存在局限,需要更高效且具备度量能力的解决方案。 Method: SHARP通过神经网络单次前向传播回归输入图像对应的3D高斯场景表示,该表示具有度量尺度,可在标准GPU上实时渲染高质量新视图。 Result: SHARP在多个数据集上实现了零样本泛化,LPIPS降低25-34%,DISTS降低21-43%,合成时间减少三个数量级,显著优于先前最优模型。 Conclusion: SHARP在效率、图像质量和度量准确性之间取得了良好平衡,为单图像视图合成设定了新标杆。 Abstract: We present SHARP, an approach to photorealistic view synthesis from a single image. Given a single photograph, SHARP regresses the parameters of a 3D Gaussian representation of the depicted scene. This is done in less than a second on a standard GPU via a single feedforward pass through a neural network. The 3D Gaussian representation produced by SHARP can then be rendered in real time, yielding high-resolution photorealistic images for nearby views. The representation is metric, with absolute scale, supporting metric camera movements. Experimental results demonstrate that SHARP delivers robust zero-shot generalization across datasets. It sets a new state of the art on multiple datasets, reducing LPIPS by 25-34% and DISTS by 21-43% versus the best prior model, while lowering the synthesis time by three orders of magnitude. Code and weights are provided at https://github.com/apple/ml-sharp

[102] CheXmask-U: Quantifying uncertainty in landmark-based anatomical segmentation for X-ray images

Matias Cosarinsky,Nicolas Gaggion,Rodrigo Echeveste,Enzo Ferrante

Main category: cs.CV

TL;DR: 本研究提出了一种基于解剖标志点的胸部X光图像分割不确定性估计方法,利用混合神经网络架构中的变分隐空间推导出两种互补的不确定性度量:隐空间不确定性和预测不确定性。

Details Motivation: 不确定性估计对医学图像分割系统的安全临床部署至关重要,但基于标志点的分割在不确定性研究方面仍不足。 Method: 采用结合卷积编码器与图生成解码器的混合神经网络架构,从变分隐空间中提取隐变量分布参数以获得隐空间不确定性,并通过多次随机采样生成输出来计算预测不确定性。 Result: 实验表明两种不确定性度量随扰动加剧而增加,能有效识别不可靠预测并支持离群分布检测;发布了包含657,566张胸部X光图像的大规模数据集CheXmask-U及配套工具。 Conclusion: 不确定性估计有助于提升基于标志点的解剖分割方法在胸部X光应用中的鲁棒性与安全性,推动其临床安全部署。 Abstract: Uncertainty estimation is essential for the safe clinical deployment of medical image segmentation systems, enabling the identification of unreliable predictions and supporting human oversight. While prior work has largely focused on pixel-level uncertainty, landmark-based segmentation offers inherent topological guarantees yet remains underexplored from an uncertainty perspective. In this work, we study uncertainty estimation for anatomical landmark-based segmentation on chest X-rays. Inspired by hybrid neural network architectures that combine standard image convolutional encoders with graph-based generative decoders, and leveraging their variational latent space, we derive two complementary measures: (i) latent uncertainty, captured directly from the learned distribution parameters, and (ii) predictive uncertainty, obtained by generating multiple stochastic output predictions from latent samples. Through controlled corruption experiments we show that both uncertainty measures increase with perturbation severity, reflecting both global and local degradation. We demonstrate that these uncertainty signals can identify unreliable predictions by comparing with manual ground-truth, and support out-of-distribution detection on the CheXmask dataset. More importantly, we release CheXmask-U (huggingface.co/datasets/mcosarinsky/CheXmask-U), a large scale dataset of 657,566 chest X-ray landmark segmentations with per-node uncertainty estimates, enabling researchers to account for spatial variations in segmentation quality when using these anatomical masks. Our findings establish uncertainty estimation as a promising direction to enhance robustness and safe deployment of landmark-based anatomical segmentation methods in chest X-ray. A fully working interactive demo of the method is available at huggingface.co/spaces/matiasky/CheXmask-U and the source code at github.com/mcosarinsky/CheXmask-U.

[103] SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving

Peizheng Li,Zhenghao Zhang,David Holtz,Hang Yu,Yutong Yang,Yuzhi Lai,Rui Song,Andreas Geiger,Andreas Zell

Main category: cs.CV

TL;DR: 本文提出了一种名为SpaceDrive的空间感知视觉语言模型(VLM)自动驾驶框架,通过将3D空间信息作为显式位置编码处理,提升了对细粒度空间关系的理解与轨迹预测精度。

Details Motivation: 现有的VLM在理解细粒度的3D空间关系方面存在不足,而这对于与物理世界交互的自动驾驶系统至关重要。 Method: 提出SpaceDrive框架,采用通用位置编码器处理来自多视角深度估计、历史自车状态和文本提示的3D坐标,并将这些3D位置编码叠加到2D视觉token上,同时用作任务无关的坐标表示,替代数字token进行输入输出。 Result: 实验表明,SpaceDrive在nuScenes数据集上达到最先进的开环性能,并在Bench2Drive闭环基准上取得78.02的驾驶分数,为现有VLM方法中第二优。 Conclusion: 通过显式建模3D空间信息,SpaceDrive有效增强了VLM在自动驾驶中的空间推理与轨迹规划能力。 Abstract: End-to-end autonomous driving methods built on vision language models (VLMs) have undergone rapid development driven by their universal visual understanding and strong reasoning capabilities obtained from the large-scale pretraining. However, we find that current VLMs struggle to understand fine-grained 3D spatial relationships which is a fundamental requirement for systems interacting with the physical world. To address this issue, we propose SpaceDrive, a spatial-aware VLM-based driving framework that treats spatial information as explicit positional encodings (PEs) instead of textual digit tokens, enabling joint reasoning over semantic and spatial representations. SpaceDrive employs a universal positional encoder to all 3D coordinates derived from multi-view depth estimation, historical ego-states, and text prompts. These 3D PEs are first superimposed to augment the corresponding 2D visual tokens. Meanwhile, they serve as a task-agnostic coordinate representation, replacing the digit-wise numerical tokens as both inputs and outputs for the VLM. This mechanism enables the model to better index specific visual semantics in spatial reasoning and directly regress trajectory coordinates rather than generating digit-by-digit, thereby enhancing planning accuracy. Extensive experiments validate that SpaceDrive achieves state-of-the-art open-loop performance on the nuScenes dataset and the second-best Driving Score of 78.02 on the Bench2Drive closed-loop benchmark over existing VLM-based methods.

[104] Video Depth Propagation

Luigi Piccinelli,Thiemo Wandel,Christos Sakaridis,Wim Abbeloos,Luc Van Gool

Main category: cs.CV

TL;DR: 提出VeloDepth,一种高效、鲁棒的在线视频深度估计方法,利用时空先验和特征传播提升时序一致性与精度。

Details Motivation: 现有视频深度估计方法在时序一致性或计算效率上存在不足,难以兼顾准确性和实时性。 Method: 设计了一种新的传播模块,结合光流 warp 和学习到的残差修正来传播和优化深度特征,并通过结构设计强制保持时序一致性。 Result: 在多个基准上实现了最先进的时序一致性和具有竞争力的精度,推理速度显著快于现有方法。 Conclusion: VeloDepth为实时深度估计提供了一个实用、高效且准确的解决方案,适用于多种视觉感知任务。 Abstract: Depth estimation in videos is essential for visual perception in real-world applications. However, existing methods either rely on simple frame-by-frame monocular models, leading to temporal inconsistencies and inaccuracies, or use computationally demanding temporal modeling, unsuitable for real-time applications. These limitations significantly restrict general applicability and performance in practical settings. To address this, we propose VeloDepth, an efficient and robust online video depth estimation pipeline that effectively leverages spatiotemporal priors from previous depth predictions and performs deep feature propagation. Our method introduces a novel Propagation Module that refines and propagates depth features and predictions using flow-based warping coupled with learned residual corrections. In addition, our design structurally enforces temporal consistency, resulting in stable depth predictions across consecutive frames with improved efficiency. Comprehensive zero-shot evaluation on multiple benchmarks demonstrates the state-of-the-art temporal consistency and competitive accuracy of VeloDepth, alongside its significantly faster inference compared to existing video-based depth estimators. VeloDepth thus provides a practical, efficient, and accurate solution for real-time depth estimation suitable for diverse perception tasks. Code and models are available at https://github.com/lpiccinelli-eth/velodepth

[105] IRG-MotionLLM: Interleaving Motion Generation, Assessment and Refinement for Text-to-Motion Generation

Yuan-Ming Li,Qize Yang,Nan Lei,Shenghao Fu,Ling-An Zeng,Jian-Fang Hu,Xihan Wei,Wei-Shi Zheng

Main category: cs.CV

TL;DR: 本文提出了一种新的运动生成范式IRMoGen,通过将运动生成与评估和优化任务交织进行,实现了理解和生成之间的双向知识流动。基于此,作者构建了首个能够无缝交替执行生成、评估与优化的模型IRG-MotionLLM,并通过三阶段训练策略和自动数据引擎实现性能提升,在文本到运动生成任务上显著优于基线模型。

Details Motivation: 现有运动感知大语言模型通常将理解与生成任务分离,缺乏两者间的交互反馈,限制了性能提升。本文旨在通过引入评估与优化环节,建立双向知识流动机制,增强生成质量。 Method: 提出IRMoGen范式,通过文本-运动对话形式迭代执行生成、评估与优化;构建IRG-MotionLLM模型,采用三阶段训练策略;开发自动化数据引擎,从现有数据集中合成交错推理标注数据。 Result: 实验证明:评估与优化任务显著提升文本-运动对齐性;生成、评估与优化的交错执行在各训练阶段均带来持续性能增益;IRG-MotionLLM在标准文本到运动生成基准上优于基线模型,跨评估者测试也验证其有效性。 Conclusion: 通过耦合生成、评估与优化任务,IRMoGen范式有效促进了运动理解与生成间的协同,为多模态大模型中的动作建模提供了新方向。 Abstract: Recent advances in motion-aware large language models have shown remarkable promise for unifying motion understanding and generation tasks. However, these models typically treat understanding and generation separately, limiting the mutual benefits that could arise from interactive feedback between tasks. In this work, we reveal that motion assessment and refinement tasks act as crucial bridges to enable bidirectional knowledge flow between understanding and generation. Leveraging this insight, we propose Interleaved Reasoning for Motion Generation (IRMoGen), a novel paradigm that tightly couples motion generation with assessment and refinement through iterative text-motion dialogue. To realize this, we introduce IRG-MotionLLM, the first model that seamlessly interleaves motion generation, assessment, and refinement to improve generation performance. IRG-MotionLLM is developed progressively with a novel three-stage training scheme, initializing and subsequently enhancing native IRMoGen capabilities. To facilitate this development, we construct an automated data engine to synthesize interleaved reasoning annotations from existing text-motion datasets. Extensive experiments demonstrate that: (i) Assessment and refinement tasks significantly improve text-motion alignment; (ii) Interleaving motion generation, assessment, and refinement steps yields consistent performance gains across training stages; and (iii) IRG-MotionLLM clearly outperforms the baseline model and achieves advanced performance on standard text-to-motion generation benchmarks. Cross-evaluator testing further validates its effectiveness. Code & Data: https://github.com/HumanMLLM/IRG-MotionLLM/tree/main.

[106] LDP: Parameter-Efficient Fine-Tuning of Multimodal LLM for Medical Report Generation

Tianyu Zhou,Junyi Tang,Zehui Li,Dahong Qian,Suncheng Xiang

Main category: cs.CV

TL;DR: 本文提出了一种基于多模态大语言模型(MLLM)的结肠息肉诊断报告生成框架LDP,通过构建高质量的MMEndo数据集并结合LoRA与DPO技术,在降低训练成本的同时显著提升了自动诊断报告的准确性和临床可用性。

Details Motivation: 由于缺乏高质量的多模态医学数据,传统的自动化结肠镜报告存在不一致和幻觉问题,难以满足临床需求。因此,需要一个符合临床标准、可扩展且高效的自动诊断报告生成方案。 Method: 构建了一个包含专家标注的结肠镜图像-文本对的多模态数据集MMEndo;基于Qwen2-VL-7B模型,采用参数高效微调(LoRA)进行优化,并通过直接偏好优化(DPO)使其输出与临床标准对齐。 Result: LDP在自动评估指标和临床专家评分中均优于现有基线模型,医师评分为7.2/10,且相比全量微调训练计算成本降低了833倍;在IU-XRay数据集上的额外验证表明其具有良好的泛化能力。 Conclusion: LDP为基层医疗提供了一个可扩展、临床可行的自动结肠镜报告生成解决方案,有效解决了数据稀缺下的模型训练效率与报告质量难题。 Abstract: Colonoscopic polyp diagnosis is pivotal for early colorectal cancer detection, yet traditional automated reporting suffers from inconsistencies and hallucinations due to the scarcity of high-quality multimodal medical data. To bridge this gap, we propose LDP, a novel framework leveraging multimodal large language models (MLLMs) for professional polyp diagnosis report generation. Specifically, we curate MMEndo, a multimodal endoscopic dataset comprising expert-annotated colonoscopy image-text pairs. We fine-tune the Qwen2-VL-7B backbone using Parameter-Efficient Fine-Tuning (LoRA) and align it with clinical standards via Direct Preference Optimization (DPO). Extensive experiments show that our LDP outperforms existing baselines on both automated metrics and rigorous clinical expert evaluations (achieving a Physician Score of 7.2/10), significantly reducing training computational costs by 833x compared to full fine-tuning. The proposed solution offers a scalable, clinically viable path for primary healthcare, with additional validation on the IU-XRay dataset confirming its robustness.

[107] Blood Pressure Prediction for Coronary Artery Disease Diagnosis using Coronary Computed Tomography Angiography

Rene Lisasi,Michele Esposito,Chen Zhao

Main category: cs.CV

TL;DR: 提出了一种端到端的自动化管道,结合扩散回归模型,从冠状动脉CT血管造影中直接预测血流压力分布,避免了传统CFD计算的高成本,实现了高效、可扩展的非侵入性冠心病诊断支持。

Details Motivation: 传统CFD模拟冠状动脉血流虽能提供有价值的血流动力学指标,但计算昂贵、耗时,难以大规模用于临床和AI模型训练,限制了基于生理的冠心病评估的普及。 Method: 开发了一个端到端自动化流程:从CCTA图像自动提取冠状动脉几何结构,生成仿真数据,并引入一种基于扩散的回归模型,直接从CCTA衍生特征预测冠状动脉血压分布,无需推理时进行CFD计算。 Result: 在模拟冠状动脉血流动力学数据集上评估,模型达到R2为64.42%,均方根误差0.0974,归一化RMSE为0.154,性能优于多个基线方法。 Conclusion: 该方法提供了一种可扩展、易获取的框架,能够快速、无创地预测冠状动脉血压,有助于推动基于生理参数的冠心病诊断在临床上的广泛应用。 Abstract: Computational fluid dynamics (CFD) based simulation of coronary blood flow provides valuable hemodynamic markers, such as pressure gradients, for diagnosing coronary artery disease (CAD). However, CFD is computationally expensive, time-consuming, and difficult to integrate into large-scale clinical workflows. These limitations restrict the availability of labeled hemodynamic data for training AI models and hinder broad adoption of non-invasive, physiology based CAD assessment. To address these challenges, we develop an end to end pipeline that automates coronary geometry extraction from coronary computed tomography angiography (CCTA), streamlines simulation data generation, and enables efficient learning of coronary blood pressure distributions. The pipeline reduces the manual burden associated with traditional CFD workflows while producing consistent training data. We further introduce a diffusion-based regression model designed to predict coronary blood pressure directly from CCTA derived features, bypassing the need for slow CFD computation during inference. Evaluated on a dataset of simulated coronary hemodynamics, the proposed model achieves state of the art performance, with an R2 of 64.42%, a root mean squared error of 0.0974, and a normalized RMSE of 0.154, outperforming several baseline approaches. This work provides a scalable and accessible framework for rapid, non-invasive blood pressure prediction to support CAD diagnosis.

[108] What matters for Representation Alignment: Global Information or Spatial Structure?

Jaskirat Singh,Xingjian Leng,Zongze Wu,Liang Zheng,Richard Zhang,Eli Shechtman,Saining Xie

Main category: cs.CV

TL;DR: 本文研究了在生成模型训练中,目标表示的哪个方面对生成性能更重要:全局语义信息还是空间结构。通过大规模实验发现,空间结构比全局语义性能更具决定性作用。基于此,作者提出了iREPA方法,通过卷积层和空间归一化增强空间信息传递,显著提升了收敛速度。

Details Motivation: 探究在表示对齐(REPA)中,影响生成性能的关键因素是目标表示的全局语义信息还是其空间结构,挑战现有认为强语义性能必然带来更好生成效果的普遍观点。 Method: 在27种不同视觉编码器和多种模型尺度上进行大规模实证分析;提出iREPA,用卷积层替代MLP投影层,并引入外部表示的空间归一化层以增强空间信息迁移。 Result: 实验表明空间结构而非全局语义性能才是驱动生成质量的关键;iREPA在多种编码器、模型规模和训练变体下均显著加快收敛速度。 Conclusion: 空间结构在表示对齐中对生成性能至关重要,应重新审视表示对齐机制的设计,优先考虑空间信息的传递。 Abstract: Representation alignment (REPA) guides generative training by distilling representations from a strong, pretrained vision encoder to intermediate diffusion features. We investigate a fundamental question: what aspect of the target representation matters for generation, its \textit{global} \revision{semantic} information (e.g., measured by ImageNet-1K accuracy) or its spatial structure (i.e. pairwise cosine similarity between patch tokens)? Prevalent wisdom holds that stronger global semantic performance leads to better generation as a target representation. To study this, we first perform a large-scale empirical analysis across 27 different vision encoders and different model scales. The results are surprising; spatial structure, rather than global performance, drives the generation performance of a target representation. To further study this, we introduce two straightforward modifications, which specifically accentuate the transfer of \emph{spatial} information. We replace the standard MLP projection layer in REPA with a simple convolution layer and introduce a spatial normalization layer for the external representation. Surprisingly, our simple method (implemented in $<$4 lines of code), termed iREPA, consistently improves convergence speed of REPA, across a diverse set of vision encoders, model sizes, and training variants (such as REPA, REPA-E, Meanflow, JiT etc). %, etc. Our work motivates revisiting the fundamental working mechanism of representational alignment and how it can be leveraged for improved training of generative models. The code and project page are available at https://end2end-diffusion.github.io/irepa

[109] Graph Laplacian Transformer with Progressive Sampling for Prostate Cancer Grading

Masum Shah Junayed,John Derek Van Vessem,Qian Wan,Gahie Nam,Sheida Nabavi

Main category: cs.CV

TL;DR: 提出了一种结合图拉普拉斯注意力机制的Transformer(GLAT)与迭代优化模块(IRM)的新方法,用于前列腺癌全切片图像分级,通过动态选择关键区域并增强空间一致性,在多个数据集上优于现有方法。

Details Motivation: 现有方法在全切片图像分析中多采用随机或静态的补丁选择策略,导致包含冗余或非信息性区域,影响性能;且难以捕捉诊断相关区域和组织的空间结构。 Method: 提出GLAT与IRM结合的方法:IRM利用预训练ResNet50提取局部特征,并借助无梯度模式下的基础模型进行重要性评分,迭代优化补丁选择;GLAT将补丁作为节点构建图结构,引入图拉普拉斯约束保证空间一致性,并通过可学习滤波机制增强判别性组织特征;同时采用凸聚合机制动态调整补丁权重以生成鲁棒的全切片表示。 Result: 在五个公开和一个私有数据集上的实验表明,该方法在性能、空间一致性和计算效率方面均优于当前最先进的方法。 Conclusion: GLAT与IRM的结合有效提升了前列腺癌全切片图像的分级精度,通过动态关注关键区域和建模组织空间关系,为病理图像分析提供了更高效且具解释性的解决方案。 Abstract: Prostate cancer grading from whole-slide images (WSIs) remains a challenging task due to the large-scale nature of WSIs, the presence of heterogeneous tissue structures, and difficulty of selecting diagnostically relevant regions. Existing approaches often rely on random or static patch selection, leading to the inclusion of redundant or non-informative regions that degrade performance. To address this, we propose a Graph Laplacian Attention-Based Transformer (GLAT) integrated with an Iterative Refinement Module (IRM) to enhance both feature learning and spatial consistency. The IRM iteratively refines patch selection by leveraging a pretrained ResNet50 for local feature extraction and a foundation model in no-gradient mode for importance scoring, ensuring only the most relevant tissue regions are preserved. The GLAT models tissue-level connectivity by constructing a graph where patches serve as nodes, ensuring spatial consistency through graph Laplacian constraints and refining feature representations via a learnable filtering mechanism that enhances discriminative histological structures. Additionally, a convex aggregation mechanism dynamically adjusts patch importance to generate a robust WSI-level representation. Extensive experiments on five public and one private dataset demonstrate that our model outperforms state-of-the-art methods, achieving higher performance and spatial consistency while maintaining computational efficiency.

[110] Self-Ensemble Post Learning for Noisy Domain Generalization

Wang Lu,Jindong Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为SEPL的自集成后学习方法,通过特征探测训练和预测集成推理来提升域泛化模型在噪声标签下的鲁棒性。

Details Motivation: 现有域泛化方法在面对标签噪声时性能下降,因噪声加剧了深层中的虚假特征放大问题。 Method: 利用模型中间层的潜在特征,训练多个探针分类器进行特征多样化,并采用半监督学习和众包推理方式进行预测集成。 Result: 实验表明SEPL能有效提升现有方法在噪声环境下的鲁棒性,并在真实场景中展现良好应用潜力。 Conclusion: SEPL通过挖掘中间特征的判别能力并集成多视角预测,增强了模型对标签噪声和分布偏移的抵抗能力。 Abstract: While computer vision and machine learning have made great progress, their robustness is still challenged by two key issues: data distribution shift and label noise. When domain generalization (DG) encounters noise, noisy labels further exacerbate the emergence of spurious features in deep layers, i.e. spurious feature enlargement, leading to a degradation in the performance of existing algorithms. This paper, starting from domain generalization, explores how to make existing methods rework when meeting noise. We find that the latent features inside the model have certain discriminative capabilities, and different latent features focus on different parts of the image. Based on these observations, we propose the Self-Ensemble Post Learning approach (SEPL) to diversify features which can be leveraged. Specifically, SEPL consists of two parts: feature probing training and prediction ensemble inference. It leverages intermediate feature representations within the model architecture, training multiple probing classifiers to fully exploit the capabilities of pre-trained models, while the final predictions are obtained through the integration of outputs from these diverse classification heads. Considering the presence of noisy labels, we employ semi-supervised algorithms to train probing classifiers. Given that different probing classifiers focus on different areas, we integrate their predictions using a crowdsourcing inference approach. Extensive experimental evaluations demonstrate that the proposed method not only enhances the robustness of existing methods but also exhibits significant potential for real-world applications with high flexibility.

[111] PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning

Jianqi Chen,Biao Zhang,Xiangjun Tang,Peter Wonka

Main category: cs.CV

TL;DR: 本文提出了一种名为PoseGAM的几何感知多视图框架,用于在无需显式匹配的情况下从查询图像和多个模板图像中直接预测未见物体的6D姿态。

Details Motivation: 现有的6D物体姿态估计方法通常依赖于在查询图像与物体模型或模板图像之间建立显式特征对应关系,这在处理未见物体时面临挑战。因此,需要一种能够更好泛化到未见物体的方法。 Method: PoseGAM基于最新的多视图基础模型架构,通过两种互补机制融入物体几何信息:基于点的显式几何和来自几何表示网络的学习特征。该方法直接从查询图像和多个模板图像中预测物体姿态,避免了显式匹配过程。此外,作者构建了一个包含超过19万个物体的大规模合成数据集,以提升模型的鲁棒性和泛化能力。 Result: 在多个基准上的广泛实验表明,PoseGAM实现了最先进的性能,相较于先前方法平均AR提升了5.1%,在个别数据集上最高提升达17.6%,显示出对未见物体的强大泛化能力。 Conclusion: PoseGAM通过引入几何感知机制和多视图学习框架,有效提升了未见物体6D姿态估计的性能与泛化能力,为未来无需显式匹配的姿态估计方法提供了新思路。 Abstract: 6D object pose estimation, which predicts the transformation of an object relative to the camera, remains challenging for unseen objects. Existing approaches typically rely on explicitly constructing feature correspondences between the query image and either the object model or template images. In this work, we propose PoseGAM, a geometry-aware multi-view framework that directly predicts object pose from a query image and multiple template images, eliminating the need for explicit matching. Built upon recent multi-view-based foundation model architectures, the method integrates object geometry information through two complementary mechanisms: explicit point-based geometry and learned features from geometry representation networks. In addition, we construct a large-scale synthetic dataset containing more than 190k objects under diverse environmental conditions to enhance robustness and generalization. Extensive evaluations across multiple benchmarks demonstrate our state-of-the-art performance, yielding an average AR improvement of 5.1% over prior methods and achieving up to 17.6% gains on individual datasets, indicating strong generalization to unseen objects. Project page: https://windvchen.github.io/PoseGAM/ .

[112] SWiT-4D: Sliding-Window Transformer for Lossless and Parameter-Free Temporal 4D Generation

Kehong Gong,Zhengyu Wen,Mingxi Xu,Weixia He,Qi Wang,Ning Zhang,Zhengyu Li,Chenbin Li,Dongze Lian,Wei Zhao,Xiaoyu He,Mingyuan Zhang

Main category: cs.CV

TL;DR: 本文提出了SWiT-4D,一种基于滑动窗口Transformer的无参数、时序一致的4D网格生成方法,可无缝集成到任意DiT-based图像到3D生成模型中,仅需极少量4D监督即可从单目视频生成高质量动画3D资产。

Details Motivation: 由于缺乏大规模真实4D网格数据集,从单目视频生成高质量4D内容仍具挑战性;同时现有方法依赖大量4D监督且难以保持时序一致性,因此需要一种能有效利用图像到3D先验模型并减少对4D数据依赖的新方法。 Method: 提出SWiT-4D,采用滑动窗口Transformer架构,在不改变原图像到3D模型前向过程的前提下引入帧间时空建模;结合优化-based轨迹模块恢复全局平移,支持任意长度视频输入,并实现无参数、损失less的时间扩展。 Result: 在仅使用一个短于10秒的视频进行微调的情况下,SWiT-4D在域内zoo-test和跨域C4D、Objaverse及野外视频数据上均优于现有方法,表现出优异的几何保真度和时序稳定性。 Conclusion: SWiT-4D实现了高效的数据利用和稳定的4D生成性能,能够在极低4D监督下生成高质量、时序连贯的4D内容,展示了其在实际应用中的部署潜力。 Abstract: Despite significant progress in 4D content generation, the conversion of monocular videos into high-quality animated 3D assets with explicit 4D meshes remains considerably challenging. The scarcity of large-scale, naturally captured 4D mesh datasets further limits the ability to train generalizable video-to-4D models from scratch in a purely data-driven manner. Meanwhile, advances in image-to-3D generation, supported by extensive datasets, offer powerful prior models that can be leveraged. To better utilize these priors while minimizing reliance on 4D supervision, we introduce SWiT-4D, a Sliding-Window Transformer for lossless, parameter-free temporal 4D mesh generation. SWiT-4D integrates seamlessly with any Diffusion Transformer (DiT)-based image-to-3D generator, adding spatial-temporal modeling across video frames while preserving the original single-image forward process, enabling 4D mesh reconstruction from videos of arbitrary length. To recover global translation, we further introduce an optimization-based trajectory module tailored for static-camera monocular videos. SWiT-4D demonstrates strong data efficiency: with only a single short (<10s) video for fine-tuning, it achieves high-fidelity geometry and stable temporal consistency, indicating practical deployability under extremely limited 4D supervision. Comprehensive experiments on both in-domain zoo-test sets and challenging out-of-domain benchmarks (C4D, Objaverse, and in-the-wild videos) show that SWiT-4D consistently outperforms existing baselines in temporal smoothness. Project page: https://animotionlab.github.io/SWIT4D/

[113] MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence

Jingli Lin,Runsen Xu,Shaohao Zhu,Sihan Yang,Peizhou Cao,Yunlong Ran,Miao Hu,Chenming Zhu,Yiman Xie,Yilin Long,Wenbo Hu,Dahua Lin,Tai Wang,Jiangmiao Pang

Main category: cs.CV

TL;DR: 本文提出了MMSI-Video-Bench,一个用于评估多模态大语言模型(MLLMs)在视频中空间智能的全人工标注基准,涵盖感知、规划、预测和跨视频推理四个层次,并揭示了当前模型与人类在空间理解上的巨大差距。

Details Motivation: 现有的多模态大语言模型缺乏对连续视觉输入中空间理解能力的全面评估,亟需一个综合性基准来推动该领域发展。 Method: 构建了一个包含1,106个问题和1,278个视频片段的高质量基准MMSI-Video-Bench,提出四层评估框架,并设计三个面向特定领域的子基准;由3DV专家审核确保问题精确性和可解释性,对25个主流MLLM进行系统评测。 Result: 实验显示大多数模型表现接近随机猜测,最优模型仍比人类低近60%;细粒度错误分析揭示模型在几何推理、运动定位、长时序预测和跨视频对应方面存在系统性缺陷;传统帧采样策略、3D空间线索和思维链提示均未能显著提升性能。 Conclusion: MMSI-Video-Bench为视频空间智能提供了具有挑战性的测试平台,暴露了现有MLLM的局限性,指明了未来改进方向。 Abstract: Spatial understanding over continuous visual input is crucial for MLLMs to evolve into general-purpose assistants in physical environments. Yet there is still no comprehensive benchmark that holistically assesses the progress toward this goal. In this work, we introduce MMSI-Video-Bench, a fully human-annotated benchmark for video-based spatial intelligence in MLLMs. It operationalizes a four-level framework, Perception, Planning, Prediction, and Cross-Video Reasoning, through 1,106 questions grounded in 1,278 clips from 25 datasets and in-house videos. Each item is carefully designed and reviewed by 3DV experts with explanatory rationales to ensure precise, unambiguous grounding. Leveraging its diverse data sources and holistic task coverage, MMSI-Video-Bench also supports three domain-oriented sub-benchmarks (Indoor Scene Perception Bench, Robot Bench and Grounding Bench) for targeted capability assessment. We evaluate 25 strong open-source and proprietary MLLMs, revealing a striking human--AI gap: many models perform near chance, and the best reasoning model lags humans by nearly 60%. We further find that spatially fine-tuned models still fail to generalize effectively on our benchmark. Fine-grained error analysis exposes systematic failures in geometric reasoning, motion grounding, long-horizon prediction, and cross-video correspondence. We also show that typical frame-sampling strategies transfer poorly to our reasoning-intensive benchmark, and that neither 3D spatial cues nor chain-of-thought prompting yields meaningful gains. We expect our benchmark to establish a solid testbed for advancing video-based spatial intelligence.

[114] From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models

Zongzhao Li,Xiangzhe Kong,Jiahui Su,Zongyang Ma,Mingze Li,Songyou Li,Yuelin Zhang,Yu Rong,Tingyang Xu,Deli Zhao,Wenbing Huang

Main category: cs.CV

TL;DR: 本文提出了微观空间智能(MiSI)的概念,并构建了MiSI-Bench基准来评估视觉-语言模型在理解微观空间关系上的能力,发现当前模型仍远低于人类水平,但经过微调的小模型在部分任务上可超越人类,强调了融入显式科学知识的重要性。

Details Motivation: 为了推动科学发现中对微观不可见实体空间关系的理解,需要评估现有视觉-语言模型在微观空间智能方面的能力,并建立系统化的评测基准。 Method: 提出MiSI-Bench基准框架,包含超过163,000个问答对和587,000张图像,源自约4,000个分子结构,涵盖九项评估从基础空间变换到复杂关系识别的任务。 Result: 实验结果显示当前最先进的视觉-语言模型在该基准上表现显著低于人类水平;一个微调后的7B模型在空间变换任务上超越人类,但在氢键识别等科学任务上表现不佳。 Conclusion: 尽管小型模型经微调可在特定任务上表现出色,但实现科学领域的通用人工智能需进一步融合显式的领域知识。 Abstract: This paper introduces the concept of Microscopic Spatial Intelligence (MiSI), the capability to perceive and reason about the spatial relationships of invisible microscopic entities, which is fundamental to scientific discovery. To assess the potential of Vision-Language Models (VLMs) in this domain, we propose a systematic benchmark framework MiSI-Bench. This framework features over 163,000 question-answer pairs and 587,000 images derived from approximately 4,000 molecular structures, covering nine complementary tasks that evaluate abilities ranging from elementary spatial transformations to complex relational identifications. Experimental results reveal that current state-of-the-art VLMs perform significantly below human level on this benchmark. However, a fine-tuned 7B model demonstrates substantial potential, even surpassing humans in spatial transformation tasks, while its poor performance in scientifically-grounded tasks like hydrogen bond recognition underscores the necessity of integrating explicit domain knowledge for progress toward scientific AGI. The datasets are available at https://huggingface.co/datasets/zongzhao/MiSI-bench.

[115] MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos

Kehong Gong,Zhengyu Wen,Weixia He,Mingxi Xu,Qi Wang,Ning Zhang,Zhengyu Li,Dongze Lian,Wei Zhao,Xiaoyu He,Mingyuan Zhang

Main category: cs.CV

TL;DR: 本文提出了类别无关的运动捕捉(CAMoCap)框架MoCapAnything,能够基于单目视频和任意3D角色资产生成驱动该资产的骨骼动画,支持跨物种、跨骨架的高质量动作重定向。

Details Motivation: 现有运动捕捉技术大多局限于特定物种或模板,缺乏对任意3D资产的通用支持,限制了在多样化内容创作中的应用。 Method: 提出MoCapAnything,包含三个可学习模块和一个轻量级逆向运动学(IK)阶段:参考提示编码器、视频特征提取器和统一运动解码器,结合约束感知的IK恢复资产特定的旋转动画。 Result: 在领域内基准和真实场景视频上均表现出高质量的骨骼动画生成能力,并实现跨物种、异构骨架的有效动作重定向。 Conclusion: MoCapAnything实现了以资产为提示的、可扩展的3D运动捕捉,推动了通用、即插即用式动作捕捉系统的发展。 Abstract: Motion capture now underpins content creation far beyond digital humans, yet most existing pipelines remain species- or template-specific. We formalize this gap as Category-Agnostic Motion Capture (CAMoCap): given a monocular video and an arbitrary rigged 3D asset as a prompt, the goal is to reconstruct a rotation-based animation such as BVH that directly drives the specific asset. We present MoCapAnything, a reference-guided, factorized framework that first predicts 3D joint trajectories and then recovers asset-specific rotations via constraint-aware inverse kinematics. The system contains three learnable modules and a lightweight IK stage: (1) a Reference Prompt Encoder that extracts per-joint queries from the asset's skeleton, mesh, and rendered images; (2) a Video Feature Extractor that computes dense visual descriptors and reconstructs a coarse 4D deforming mesh to bridge the gap between video and joint space; and (3) a Unified Motion Decoder that fuses these cues to produce temporally coherent trajectories. We also curate Truebones Zoo with 1038 motion clips, each providing a standardized skeleton-mesh-render triad. Experiments on both in-domain benchmarks and in-the-wild videos show that MoCapAnything delivers high-quality skeletal animations and exhibits meaningful cross-species retargeting across heterogeneous rigs, enabling scalable, prompt-driven 3D motion capture for arbitrary assets. Project page: https://animotionlab.github.io/MoCapAnything/

[116] PubTables-v2: A new large-scale dataset for full-page and multi-page table extraction

Brandon Smock,Valerie Faucon-Morin,Max Sokolov,Libin Liang,Tayyibah Khanam,Maury Courtland

Main category: cs.CV

TL;DR: 本文提出了一种新的大规模数据集PubTables-v2,用于支持多页表格结构识别等挑战性任务,并基于该数据集开发了Page-Object Table Transformer(POTATR),以实现从图像到图的端到端表格提取。

Details Motivation: 由于缺乏标注数据,现有方法在表格提取方面的进展难以展示,尤其是在多页文档上下文中进行表格结构识别的任务中。因此,需要一个大规模、高质量的数据集来推动这一领域的发展。 Method: 构建了一个新的大规模数据集PubTables-v2,支持多种当前具有挑战性的表格提取任务;利用该数据集评估领域专用的视觉语言模型(VLMs),并提出了POTATR模型,作为Table Transformer的扩展,用于页面级别的完整表格提取。 Result: PubTables-v2成为首个支持多页表格结构识别的大规模基准数据集;通过实验验证了其在评估VLMs上的有效性,并成功训练出POTATR模型,显著提升了页面级表格提取性能。 Conclusion: PubTables-v2为表格提取研究提供了重要资源,推动了在复杂文档场景下的表格理解能力发展,同时POTATR展示了将图像直接转换为结构化图表示的潜力。 Abstract: Table extraction (TE) is a key challenge in visual document understanding. Traditional approaches detect tables first, then recognize their structure. Recently, interest has surged in developing methods, such as vision-language models (VLMs), that can extract tables directly in their full page or document context. However, progress has been difficult to demonstrate due to a lack of annotated data. To address this, we create a new large-scale dataset, PubTables-v2. PubTables-v2 supports a number of current challenging table extraction tasks. Notably, it is the first large-scale benchmark for multi-page table structure recognition. We demonstrate its usefulness by evaluating domain-specialized VLMs on these tasks and highlighting current progress. Finally, we use PubTables-v2 to create the Page-Object Table Transformer (POTATR), an image-to-graph extension of the Table Transformer to comprehensive page-level TE. Data, code, and trained models will be released.

[117] DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance

Peiying Zhang,Nanxuan Zhao,Matthew Fisher,Yiran Xu,Jing Liao,Difan Liu

Main category: cs.CV

TL;DR: DuetSVG是一种统一的多模态模型,能够端到端地联合生成图像标记和对应的SVG标记,通过引入测试时扩展策略,利用视觉预测指导SVG解码,从而生成更符合视觉保真度、语义一致且语法清晰的SVG。

Details Motivation: 现有的基于视觉-语言模型的SVG生成方法在解码过程中仅生成文本,缺乏视觉信号,导致在复杂语义下表现不佳,难以生成视觉上吸引人或几何上连贯的SVG。 Method: 提出DuetSVG,一个统一的多模态模型,联合生成图像和SVG标记,并在训练中使用图像和SVG数据集;在推理阶段采用新的测试时扩展策略,利用模型的视觉预测来指导SVG解码过程。 Result: 实验表明,DuetSVG在多个应用中优于现有方法,能生成视觉保真度高、语义对齐且语法干净的SVG。 Conclusion: DuetSVG通过结合视觉与符号生成,有效提升了SVG生成的质量与一致性,为复杂语义下的矢量图形生成提供了新思路。 Abstract: Recent vision-language model (VLM)-based approaches have achieved impressive results on SVG generation. However, because they generate only text and lack visual signals during decoding, they often struggle with complex semantics and fail to produce visually appealing or geometrically coherent SVGs. We introduce DuetSVG, a unified multimodal model that jointly generates image tokens and corresponding SVG tokens in an end-to-end manner. DuetSVG is trained on both image and SVG datasets. At inference, we apply a novel test-time scaling strategy that leverages the model's native visual predictions as guidance to improve SVG decoding quality. Extensive experiments show that our method outperforms existing methods, producing visually faithful, semantically aligned, and syntactically clean SVGs across a wide range of applications.

[118] FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos

Yulu Gan,Ligeng Zhu,Dandan Shan,Baifeng Shi,Hongxu Yin,Boris Ivanovic,Song Han,Trevor Darrell,Jitendra Malik,Marco Pavone,Boyi Li

Main category: cs.CV

TL;DR: 本文提出了FoundationMotion,一个全自动的数据整理管道,用于构建大规模、细粒度的运动数据集,以提升模型对运动和空间推理的理解能力。

Details Motivation: 现有运动数据集依赖昂贵的手动标注,规模受限,导致当前模型在运动理解任务上表现不佳。 Method: 通过视频中的物体检测与轨迹跟踪提取运动轨迹,结合大语言模型(LLM)利用轨迹和视频帧生成细粒度描述及多样化的问答对。 Result: 使用该管道生成的数据集对开源模型进行微调后,在多个运动理解基准上显著优于强闭源和开源基线模型,如Gemini-2.5 Flash和Qwen2.5-VL-72B。 Conclusion: FoundationMotion为构建高质量运动数据集提供了可扩展的解决方案,有效提升了模型的运动与空间推理能力。 Abstract: Motion understanding is fundamental to physical reasoning, enabling models to infer dynamics and predict future states. However, state-of-the-art models still struggle on recent motion benchmarks, primarily due to the scarcity of large-scale, fine-grained motion datasets. Existing motion datasets are often constructed from costly manual annotation, severely limiting scalability. To address this challenge, we introduce FoundationMotion, a fully automated data curation pipeline that constructs large-scale motion datasets. Our approach first detects and tracks objects in videos to extract their trajectories, then leverages these trajectories and video frames with Large Language Models (LLMs) to generate fine-grained captions and diverse question-answer pairs about motion and spatial reasoning. Using datasets produced by this pipeline, we fine-tune open-source models including NVILA-Video-15B and Qwen2.5-7B, achieving substantial improvements in motion understanding without compromising performance on other tasks. Notably, our models outperform strong closed-source baselines like Gemini-2.5 Flash and large open-source models such as Qwen2.5-VL-72B across diverse motion understanding datasets and benchmarks. FoundationMotion thus provides a scalable solution for curating fine-grained motion datasets that enable effective fine-tuning of diverse models to enhance motion understanding and spatial reasoning capabilities.

[119] BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models

Shengao Wang,Wenqi Wang,Zecheng Wang,Max Whitton,Michael Wakeham,Arjun Chandra,Joey Huang,Pengyue Zhu,Helen Chen,David Li,Jeffrey Li,Shawn Li,Andrew Zagula,Amy Zhao,Andrew Zhu,Sayaka Nakamura,Yuki Yamamoto,Jerry Jun Yokono,Aaron Mueller,Bryan A. Plummer,Kate Saenko,Venkatesh Saligrama,Boqing Gong

Main category: cs.CV

TL;DR: BabyVLM-V2是一个基于婴幼儿发展轨迹的视觉-语言建模框架,通过纵向多模态预训练数据和DevCV工具箱推动视觉基础模型的发展。

Details Motivation: 受儿童早期发展轨迹启发,设计更高效、符合认知发展的视觉基础模型预训练方法。 Method: 构建婴儿中心的音视频语料库,包含多种形式的数据,并提出DevCV工具箱作为十项多模态认知任务的评估基准。 Result: 从零训练的小型模型在DevCV工具箱上表现优异,部分任务超过GPT-4o。 Conclusion: BabyVLM-V2为发展性视觉基础模型研究提供了系统且统一的框架。 Abstract: Early children's developmental trajectories set up a natural goal for sample-efficient pretraining of vision foundation models. We introduce BabyVLM-V2, a developmentally grounded framework for infant-inspired vision-language modeling that extensively improves upon BabyVLM-V1 through a longitudinal, multifaceted pretraining set, a versatile model, and, most importantly, DevCV Toolbox for cognitive evaluation. The pretraining set maximizes coverage while minimizing curation of a longitudinal, infant-centric audiovisual corpus, yielding video-utterance, image-utterance, and multi-turn conversational data that mirror infant experiences. DevCV Toolbox adapts all vision-related measures of the recently released NIH Baby Toolbox into a benchmark suite of ten multimodal tasks, covering spatial reasoning, memory, and vocabulary understanding aligned with early children's capabilities. Experimental results show that a compact model pretrained from scratch can achieve competitive performance on DevCV Toolbox, outperforming GPT-4o on some tasks. We hope the principled, unified BabyVLM-V2 framework will accelerate research in developmentally plausible pretraining of vision foundation models.

[120] Any4D: Unified Feed-Forward Metric 4D Reconstruction

Jay Karhade,Nikhil Keetha,Yuchen Zhang,Tanisha Gupta,Akash Sharma,Sebastian Scherer,Deva Ramanan

Main category: cs.CV

TL;DR: 提出Any4D,一种可扩展的多视角Transformer模型,用于度量尺度下的密集前馈4D重建,支持多种传感器输入并显著提升精度与计算效率。

Details Motivation: 现有方法多局限于双视角场景流或稀疏点跟踪,且难以融合多模态传感器数据,缺乏对密集、度量级4D重建的高效统一框架。 Method: 采用模块化4D场景表示,结合以相机坐标系表示的自我中心因子(如深度图、内参)和以世界坐标系表示的外部中心因子(如外参、场景流),通过多视图Transformer实现N帧的像素级运动与几何预测。 Result: 在多种设置下性能优越,误差降低2-3倍,计算速度提升15倍,支持RGB、RGB-D、IMU、Radar Doppler等多种输入。 Conclusion: Any4D为多模态条件下的密集4D重建提供了高效、灵活且可扩展的解决方案,推动了下游应用的发展。 Abstract: We present Any4D, a scalable multi-view transformer for metric-scale, dense feed-forward 4D reconstruction. Any4D directly generates per-pixel motion and geometry predictions for N frames, in contrast to prior work that typically focuses on either 2-view dense scene flow or sparse 3D point tracking. Moreover, unlike other recent methods for 4D reconstruction from monocular RGB videos, Any4D can process additional modalities and sensors such as RGB-D frames, IMU-based egomotion, and Radar Doppler measurements, when available. One of the key innovations that allows for such a flexible framework is a modular representation of a 4D scene; specifically, per-view 4D predictions are encoded using a variety of egocentric factors (depthmaps and camera intrinsics) represented in local camera coordinates, and allocentric factors (camera extrinsics and scene flow) represented in global world coordinates. We achieve superior performance across diverse setups - both in terms of accuracy (2-3X lower error) and compute efficiency (15X faster), opening avenues for multiple downstream applications.

[121] GaussianHeadTalk: Wobble-Free 3D Talking Heads with Audio Driven Gaussian Splatting

Madhav Agarwal,Mingtian Zhang,Laura Sevilla-Lara,Steven McDonagh

Main category: cs.CV

TL;DR: 提出一种基于3D可变形模型引导的高斯点阵方法,通过音频直接驱动生成实时、稳定的说话人头视频,在保持高视觉保真度的同时实现良好的时间一致性。

Details Motivation: 现有语音驱动说话人头方法在实时性与时间稳定性之间存在权衡,难以满足实际应用需求。 Method: 利用3D Morphable Models指导Gaussian Splatting生成个性化的虚拟形象,采用Transformer网络从音频直接预测模型参数以确保时间连续性。 Result: 在单目视频和独立音频输入下,实现了高质量、实时的说话头视频生成,定量与定性结果均表现优异。 Conclusion: 该方法在保持高视觉质量的同时解决了时间不稳定问题,推动了高斯点阵技术在交互式虚拟化身中的实际应用。 Abstract: Speech-driven talking heads have recently emerged and enable interactive avatars. However, real-world applications are limited, as current methods achieve high visual fidelity but slow or fast yet temporally unstable. Diffusion methods provide realistic image generation, yet struggle with oneshot settings. Gaussian Splatting approaches are real-time, yet inaccuracies in facial tracking, or inconsistent Gaussian mappings, lead to unstable outputs and video artifacts that are detrimental to realistic use cases. We address this problem by mapping Gaussian Splatting using 3D Morphable Models to generate person-specific avatars. We introduce transformer-based prediction of model parameters, directly from audio, to drive temporal consistency. From monocular video and independent audio speech inputs, our method enables generation of real-time talking head videos where we report competitive quantitative and qualitative performance.

[122] OmniView: An All-Seeing Diffusion Model for 3D and 4D View Synthesis

Xiang Fan,Sharath Girish,Vivek Ramanujan,Chaoyang Wang,Ashkan Mirzaei,Petr Sushko,Aliaksandr Siarohin,Sergey Tulyakov,Ranjay Krishna

Main category: cs.CV

TL;DR: OmniView 是一个统一的扩散模型框架,能够泛化到多种4D一致性任务,包括新视角合成、动态视频生成和带相机控制的文本/图像到视频生成,在多个基准上显著优于现有方法。

Details Motivation: 现有的相机控制扩散模型方法局限于特定的4D一致性子任务,训练数据割裂,缺乏通用性。 Method: OmniView 分别建模空间、时间和视角条件,实现灵活的输入组合,并在统一框架下支持多种4D任务。 Result: 在多视图NVS、动态NVS、静态相机控制和文本生成视频等任务上,图像质量提升达20%-60%,相机轨迹误差减少4倍。 Conclusion: OmniView 展示了构建具备强泛化能力的通用4D视频模型的可行性,推动了通用视觉生成模型的发展。 Abstract: Prior approaches injecting camera control into diffusion models have focused on specific subsets of 4D consistency tasks: novel view synthesis, text-to-video with camera control, image-to-video, amongst others. Therefore, these fragmented approaches are trained on disjoint slices of available 3D/4D data. We introduce OmniView, a unified framework that generalizes across a wide range of 4D consistency tasks. Our method separately represents space, time, and view conditions, enabling flexible combinations of these inputs. For example, OmniView can synthesize novel views from static, dynamic, and multiview inputs, extrapolate trajectories forward and backward in time, and create videos from text or image prompts with full camera control. OmniView is competitive with task-specific models across diverse benchmarks and metrics, improving image quality scores among camera-conditioned diffusion models by up to 33\% in multiview NVS LLFF dataset, 60\% in dynamic NVS Neural 3D Video benchmark, 20\% in static camera control on RE-10K, and reducing camera trajectory errors by 4x in text-conditioned video generation. With strong generalizability in one model, OmniView demonstrates the feasibility of a generalist 4D video model. Project page is available at https://snap-research.github.io/OmniView/

[123] Mull-Tokens: Modality-Agnostic Latent Thinking

Arijit Ray,Ahmed Abdelkader,Chengzhi Mao,Bryan A. Plummer,Kate Saenko,Ranjay Krishna,Leonidas Guibas,Wen-Sheng Chu

Main category: cs.CV

TL;DR: 本文提出了一种名为Mull-Tokens的新型模态无关的潜在令牌方法,用于多模态推理,能够在文本和图像模态之间自由传递中间信息,从而提升空间推理任务的表现。

Details Motivation: 现有的多模态推理模型依赖专用工具、高成本图像生成或手工设计的数据,难以扩展且脆弱,因此需要一种更简单、灵活的方法来实现跨模态的自由推理。 Method: 提出Mull-Tokens,即在预训练中学习保持文本或图像中间信息的模态无关潜在令牌;先使用交错的文本-图像轨迹进行监督训练,再仅用最终答案进行无监督微调。 Result: 在四个具有挑战性的空间推理基准(如解谜和视角转换)上,Mull-Tokens相比纯文本或多模态基线平均提升3%,在推理密集型谜题子集上最高提升16%。 Conclusion: Mull-Tokens提供了一种简洁有效的解决方案,使模型能在多模态间抽象思考,增强了对空间、时间和功能等复杂现实问题的推理能力,推动了多模态语言模型的发展。 Abstract: Reasoning goes beyond language; the real world requires reasoning about space, time, affordances, and much more that words alone cannot convey. Existing multimodal models exploring the potential of reasoning with images are brittle and do not scale. They rely on calling specialist tools, costly generation of images, or handcrafted reasoning data to switch between text and image thoughts. Instead, we offer a simpler alternative -- Mull-Tokens -- modality-agnostic latent tokens pre-trained to hold intermediate information in either image or text modalities to let the model think free-form towards the correct answer. We investigate best practices to train Mull-Tokens inspired by latent reasoning frameworks. We first train Mull-Tokens using supervision from interleaved text-image traces, and then fine-tune without any supervision by only using the final answers. Across four challenging spatial reasoning benchmarks involving tasks such as solving puzzles and taking different perspectives, we demonstrate that Mull-Tokens improve upon several baselines utilizing text-only reasoning or interleaved image-text reasoning, achieving a +3% average improvement and up to +16% on a puzzle solving reasoning-heavy split compared to our strongest baseline. Adding to conversations around challenges in grounding textual and visual reasoning, Mull-Tokens offers a simple solution to abstractly think in multiple modalities.

[124] VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

Delong Chen,Mustafa Shukor,Theo Moutakanni,Willy Chung,Jade Yu,Tejaswi Kasarla,Allen Bolourchi,Yann LeCun,Pascale Fung

Main category: cs.CV

TL;DR: VL-JEPA是一种基于联合嵌入预测架构(JEPA)的视觉-语言模型,通过预测文本的连续嵌入而非自回归生成标记,实现了更高效、更紧凑的训练与推理,在多个任务上优于现有模型。

Details Motivation: 传统视觉语言模型(VLMs)依赖自回归生成文本标记,计算成本高且关注表面语言变化;VL-JEPA旨在通过在抽象表示空间中学习,聚焦任务相关的语义,减少冗余计算和参数量。 Method: 采用JEPA框架,让模型预测目标文本的连续嵌入向量;使用相同的视觉编码器和训练数据与标准VLM进行对比;仅在必要时调用轻量级文本解码器将嵌入转换为文本;支持选择性解码以减少解码次数。 Result: 在相同条件下,VL-JEPA比标准VLM少50%可训练参数但性能更强;解码操作减少2.85倍仍保持相近性能;在8个视频分类和8个视频检索数据集上超越CLIP、SigLIP2和Perception Encoder;在4个VQA数据集上表现与InstructBLIP、QwenVL相当,尽管仅有1.6B参数。 Conclusion: VL-JEPA通过在嵌入空间进行预测,提供了一种更高效、多功能且参数精简的视觉语言建模方法,天然支持生成、分类、检索和判别式问答等多种任务,无需架构修改。 Abstract: We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that reduces the number of decoding operations by 2.85x while maintaining similar performance compared to non-adaptive uniform decoding. Beyond generation, the VL-JEPA's embedding space naturally supports open-vocabulary classification, text-to-video retrieval, and discriminative VQA without any architecture modification. On eight video classification and eight video retrieval datasets, the average performance VL-JEPA surpasses that of CLIP, SigLIP2, and Perception Encoder. At the same time, the model achieves comparable performance as classical VLMs (InstructBLIP, QwenVL) on four VQA datasets: GQA, TallyQA, POPE and POPEv2, despite only having 1.6B parameters.

[125] AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation

Sharath Girish,Viacheslav Ivanov,Tsai-Shien Chen,Hao Chen,Aliaksandr Siarohin,Sergey Tulyakov

Main category: cs.CV

TL;DR: AlcheMinT 是一个用于主体驱动视频生成的统一框架,通过引入显式时间戳条件和新的位置编码机制,实现了对多个主体在视频中出现和消失时间的精确控制。

Details Motivation: 现有主体驱动视频生成方法缺乏对主体外观和消失的细粒度时间控制,限制了其在组合视频合成、分镜和可控动画等应用中的使用。 Method: 提出 AlcheMinT 框架,采用新颖的位置编码机制来建模与主体身份相关的时间区间,并结合描述性文本标记增强视觉身份与文本提示之间的绑定;通过逐标记拼接整合到预训练模型中,无需额外交叉注意力模块。 Result: 在多主体身份保持、视频保真度和时间一致性方面建立了评估基准,实验结果表明 AlcheMinT 在保持高质量视频生成的同时,首次实现了对多主体生成的精确时间控制。 Conclusion: AlcheMinT 能够在几乎不增加参数的情况下,有效实现主体驱动视频中多主体的精细时间控制,推动了个性化视频生成在复杂时序编辑任务中的应用。 Abstract: Recent advances in subject-driven video generation with large diffusion models have enabled personalized content synthesis conditioned on user-provided subjects. However, existing methods lack fine-grained temporal control over subject appearance and disappearance, which are essential for applications such as compositional video synthesis, storyboarding, and controllable animation. We propose AlcheMinT, a unified framework that introduces explicit timestamps conditioning for subject-driven video generation. Our approach introduces a novel positional encoding mechanism that unlocks the encoding of temporal intervals, associated in our case with subject identities, while seamlessly integrating with the pretrained video generation model positional embeddings. Additionally, we incorporate subject-descriptive text tokens to strengthen binding between visual identity and video captions, mitigating ambiguity during generation. Through token-wise concatenation, AlcheMinT avoids any additional cross-attention modules and incurs negligible parameter overhead. We establish a benchmark evaluating multiple subject identity preservation, video fidelity, and temporal adherence. Experimental results demonstrate that AlcheMinT achieves visual quality matching state-of-the-art video personalization methods, while, for the first time, enabling precise temporal control over multi-subject generation within videos. Project page is at https://snap-research.github.io/Video-AlcheMinT

[126] MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation

Henghui Ding,Chang Liu,Shuting He,Kaining Ying,Xudong Jiang,Chen Change Loy,Yu-Gang Jiang

Main category: cs.CV

TL;DR: 本文提出了MeViS,一个大规模多模态数据集,用于基于运动语言描述的视频目标分割与追踪,强调运动在视频理解中的作用,并评估了现有方法的局限性,提出LMPM++方法取得最优性能。

Details Motivation: 现有指代表达视频分割数据集多关注显著性物体和静态属性语言描述,难以体现运动信息在视频和语言中的重要作用,因此需要构建强调运动表达的数据集以推动基于运动推理的像素级视频理解研究。 Method: 构建包含33,072条人工标注运动表达(文本与音频)的MeViS数据集,覆盖2,006个复杂场景视频中的8,171个对象;在此基础上建立四个任务的基准测试(RVOS、AVOS、RMOT、RMEG),并提出改进模型LMPM++用于RVOS/AVOS/RMOT任务。 Result: 在MeViS上评测的15种现有方法均表现不佳,显示出当前方法在处理运动表达引导的视频理解上的局限性;所提出的LMPM++方法在多个任务上达到新的SOTA性能。 Conclusion: MeViS为基于运动表达的视频理解提供了有价值的平台,推动了利用运动线索进行像素级视频分析的研究,同时揭示了现有模型在运动推理方面的不足及未来发展方向。 Abstract: This paper proposes a large-scale multi-modal dataset for referring motion expression video segmentation, focusing on segmenting and tracking target objects in videos based on language description of objects' motions. Existing referring video segmentation datasets often focus on salient objects and use language expressions rich in static attributes, potentially allowing the target object to be identified in a single frame. Such datasets underemphasize the role of motion in both videos and languages. To explore the feasibility of using motion expressions and motion reasoning clues for pixel-level video understanding, we introduce MeViS, a dataset containing 33,072 human-annotated motion expressions in both text and audio, covering 8,171 objects in 2,006 videos of complex scenarios. We benchmark 15 existing methods across 4 tasks supported by MeViS, including 6 referring video object segmentation (RVOS) methods, 3 audio-guided video object segmentation (AVOS) methods, 2 referring multi-object tracking (RMOT) methods, and 4 video captioning methods for the newly introduced referring motion expression generation (RMEG) task. The results demonstrate weaknesses and limitations of existing methods in addressing motion expression-guided video understanding. We further analyze the challenges and propose an approach LMPM++ for RVOS/AVOS/RMOT that achieves new state-of-the-art results. Our dataset provides a platform that facilitates the development of motion expression-guided video understanding algorithms in complex video scenes. The proposed MeViS dataset and the method's source code are publicly available at https://henghuiding.com/MeViS/

[127] Towards Efficient and Effective Multi-Camera Encoding for End-to-End Driving

Jiawei Yang,Ziyu Chen,Yurong You,Yan Wang,Yiming Li,Yuxiao Chen,Boyi Li,Boris Ivanovic,Marco Pavone,Yue Wang

Main category: cs.CV

TL;DR: Flex提出了一种高效且有效的场景编码器,通过可学习的场景令牌联合编码多视角图像信息,无需依赖3D先验(如BEV),在提升推理速度的同时显著提高自动驾驶性能。

Details Motivation: 解决端到端自动驾驶中处理多摄像头数据的计算瓶颈,并挑战必须依赖3D结构先验(如BEV)的传统假设。 Method: 设计一种几何无关的场景编码器Flex,使用少量可学习的场景令牌,从所有摄像头和时间步的图像令牌中直接学习紧凑的场景表示,不依赖显式的3D归纳偏置。 Result: 在2万小时驾驶数据上验证,Flex实现比现有方法高2.2倍的推理吞吐量,显著提升驾驶性能,并展现出无监督的场景分解能力。 Conclusion: 数据驱动的联合编码策略比依赖3D先验的方法更高效、可扩展,为未来自动驾驶系统提供了更优路径。 Abstract: We present Flex, an efficient and effective scene encoder that addresses the computational bottleneck of processing high-volume multi-camera data in end-to-end autonomous driving. Flex employs a small set of learnable scene tokens to jointly encode information from all image tokens across different cameras and timesteps. By design, our approach is geometry-agnostic, learning a compact scene representation directly from data without relying on the explicit 3D inductive biases, such as Bird-Eye-View (BEV), occupancy or tri-plane representations, which are common in prior work. This holistic encoding strategy aggressively compresses the visual input for the downstream Large Language Model (LLM) based policy model. Evaluated on a large-scale proprietary dataset of 20,000 driving hours, our Flex achieves 2.2x greater inference throughput while improving driving performance by a large margin compared to state-of-the-art methods. Furthermore, we show that these compact scene tokens develop an emergent capability for scene decomposition without any explicit supervision. Our findings challenge the prevailing assumption that 3D priors are necessary, demonstrating that a data-driven, joint encoding strategy offers a more scalable, efficient and effective path for future autonomous driving systems.

[128] ClusIR: Towards Cluster-Guided All-in-One Image Restoration

Shengkai Hu,Jiaqi Ma,Jun Wan,Wenwen Min,Yongcheng Jing,Lefei Zhang,Dacheng Tao

Main category: cs.CV

TL;DR: 提出ClusIR框架,通过聚类引导机制显式建模退化语义,并在空间和频率域传播簇感知线索,实现对多种退化的自适应图像恢复。

Details Motivation: 现有全合一图像恢复方法难以显式建模退化类型,且对复杂或混合退化适应能力差。 Method: 设计包含概率聚类引导路由机制(PCGRM)和退化感知频率调制模块(DAFMM)的ClusIR框架,利用可学习聚类分离退化识别与专家激活,并在频域进行自适应分解与调制。 Result: 在多个基准上实验表明,ClusIR在多种退化场景下均取得有竞争力的性能。 Conclusion: ClusIR通过聚类引导的协同机制,有效融合语义信息与频域调制,显著提升多退化条件下的图像恢复效果。 Abstract: All-in-One Image Restoration (AiOIR) aims to recover high-quality images from diverse degradations within a unified framework. However, existing methods often fail to explicitly model degradation types and struggle to adapt their restoration behavior to complex or mixed degradations. To address these issues, we propose ClusIR, a Cluster-Guided Image Restoration framework that explicitly models degradation semantics through learnable clustering and propagates cluster-aware cues across spatial and frequency domains for adaptive restoration. Specifically, ClusIR comprises two key components: a Probabilistic Cluster-Guided Routing Mechanism (PCGRM) and a Degradation-Aware Frequency Modulation Module (DAFMM). The proposed PCGRM disentangles degradation recognition from expert activation, enabling discriminative degradation perception and stable expert routing. Meanwhile, DAFMM leverages the cluster-guided priors to perform adaptive frequency decomposition and targeted modulation, collaboratively refining structural and textural representations for higher restoration fidelity. The cluster-guided synergy seamlessly bridges semantic cues with frequency-domain modulation, empowering ClusIR to attain remarkable restoration results across a wide range of degradations. Extensive experiments on diverse benchmarks validate that ClusIR reaches competitive performance under several scenarios.

[129] E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training

Qitao Zhao,Hao Tan,Qianqian Wang,Sai Bi,Kai Zhang,Kalyan Sunkavalli,Shubham Tulsiani,Hanwen Jiang

Main category: cs.CV

TL;DR: 本文提出了E-RayZer,一种从无标签多视图图像中自监督学习真正3D感知表示的大型3D视觉模型。与以往方法不同,E-RayZer在3D空间中显式进行几何建模和自监督3D重建,避免了捷径解,并通过新颖的细粒度学习课程实现收敛与可扩展性,在多种3D下游任务上超越现有方法。

Details Motivation: 现有的自监督预训练在语言、2D图像和视频中已取得成功,但在从多视图图像中学习3D感知表示方面仍探索不足,且多数方法间接推断3D,缺乏几何一致性。 Method: 提出E-RayZer模型,直接在3D空间中进行显式几何建模与自监督3D重建;引入一种全新的细粒度学习课程,以无监督方式从易到难组织样本并融合异构数据源。 Result: E-RayZer在姿态估计任务上显著优于RayZer,媲美甚至超过全监督模型如VGGT;在迁移至3D下游任务时,其学习到的表示优于DINOv3、CroCo v2、VideoMAE V2和RayZer等主流视觉预训练模型。 Conclusion: E-RayZer通过显式3D建模和课程学习,实现了几何上更合理的自监督3D表示学习,为3D感知视觉预训练建立了新范式。 Abstract: Self-supervised pre-training has revolutionized foundation models for languages, individual 2D images and videos, but remains largely unexplored for learning 3D-aware representations from multi-view images. In this paper, we present E-RayZer, a self-supervised large 3D Vision model that learns truly 3D-aware representations directly from unlabeled images. Unlike prior self-supervised methods such as RayZer that infer 3D indirectly through latent-space view synthesis, E-RayZer operates directly in 3D space, performing self-supervised 3D reconstruction with Explicit geometry. This formulation eliminates shortcut solutions and yields representations that are geometrically grounded. To ensure convergence and scalability, we introduce a novel fine-grained learning curriculum that organizes training from easy to hard samples and harmonizes heterogeneous data sources in an entirely unsupervised manner. Experiments demonstrate that E-RayZer significantly outperforms RayZer on pose estimation, matches or sometimes surpasses fully supervised reconstruction models such as VGGT. Furthermore, its learned representations outperform leading visual pre-training models (e.g., DINOv3, CroCo v2, VideoMAE V2, and RayZer) when transferring to 3D downstream tasks, establishing E-RayZer as a new paradigm for 3D-aware visual pre-training.

[130] Group Diffusion: Enhancing Image Generation by Unlocking Cross-Sample Collaboration

Sicheng Mo,Thao Nguyen,Richard Zhang,Nick Kolkin,Siddharth Srinivasan Iyer,Eli Shechtman,Krishna Kumar Singh,Yong Jae Lee,Bolei Zhou,Yuheng Li

Main category: cs.CV

TL;DR: 提出Group Diffusion,通过跨图像共享注意力机制实现协同去噪,显著提升生成质量。

Details Motivation: 探索扩散模型推理中未被利用的信号,即样本间是否可以协作生成而非独立生成。 Method: 引入Group Diffusion,在推理时共享跨图像的注意力机制,实现联合去噪,建模图像内和图像间的关联。 Result: 随着组规模增大,跨样本注意力增强,生成质量提升;在ImageNet-256x256上FID最多改善32.2%。 Conclusion: 跨样本推理是一种有效且此前未被探索的生成建模机制。 Abstract: In this work, we explore an untapped signal in diffusion model inference. While all previous methods generate images independently at inference, we instead ask if samples can be generated collaboratively. We propose Group Diffusion, unlocking the attention mechanism to be shared across images, rather than limited to just the patches within an image. This enables images to be jointly denoised at inference time, learning both intra and inter-image correspondence. We observe a clear scaling effect - larger group sizes yield stronger cross-sample attention and better generation quality. Furthermore, we introduce a qualitative measure to capture this behavior and show that its strength closely correlates with FID. Built on standard diffusion transformers, our GroupDiff achieves up to 32.2% FID improvement on ImageNet-256x256. Our work reveals cross-sample inference as an effective, previously unexplored mechanism for generative modeling.

[131] Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization

Tsai-Shien Chen,Aliaksandr Siarohin,Guocheng Gordon Qian,Kuan-Chieh Jackson Wang,Egor Nemchinov,Moayed Haji-Ali,Riza Alp Guler,Willi Menapace,Ivan Skorokhodov,Anil Kag,Jun-Yan Zhu,Sergey Tulyakov

Main category: cs.CV

TL;DR: 本文提出了Omni-Attribute,首个开放词汇图像属性编码器,用于实现高保真、特定属性的视觉概念个性化,通过数据与模型的联合设计,在多基准上达到最先进性能。

Details Motivation: 现有方法依赖于通用图像编码器的整体嵌入,导致多个视觉因素纠缠,难以分离单一属性,常引起信息泄漏和合成不一致。 Method: 1)构建带有正负属性标注的语义关联图像对数据集,明确指导编码器保留或抑制的内容;2)采用兼顾生成保真度与对比解耦的双目标训练范式。 Result: 所提方法在开放词汇属性检索、个性化和组合生成任务中表现出色,多个基准上性能达到最先进水平。 Conclusion: Omni-Attribute 能有效解耦视觉属性并实现精确的视觉概念个性化,为开放词汇图像编辑提供了新的解决方案。 Abstract: Visual concept personalization aims to transfer only specific image attributes, such as identity, expression, lighting, and style, into unseen contexts. However, existing methods rely on holistic embeddings from general-purpose image encoders, which entangle multiple visual factors and make it difficult to isolate a single attribute. This often leads to information leakage and incoherent synthesis. To address this limitation, we introduce Omni-Attribute, the first open-vocabulary image attribute encoder designed to learn high-fidelity, attribute-specific representations. Our approach jointly designs the data and model: (i) we curate semantically linked image pairs annotated with positive and negative attributes to explicitly teach the encoder what to preserve or suppress; and (ii) we adopt a dual-objective training paradigm that balances generative fidelity with contrastive disentanglement. The resulting embeddings prove effective for open-vocabulary attribute retrieval, personalization, and compositional generation, achieving state-of-the-art performance across multiple benchmarks.

[132] Empowering Dynamic Urban Navigation with Stereo and Mid-Level Vision

Wentao Zhou,Xuweiyi Chen,Vignesh Rajagopal,Jeffrey Chen,Rohan Chandra,Zezhou Cheng

Main category: cs.CV

TL;DR: 本文提出StereoWalker,通过引入双目视觉和显式中层视觉(如深度估计和像素跟踪)来增强机器人导航基础模型,显著提升在动态非结构化环境中的导航性能,并减少对大量训练数据的依赖。

Details Motivation: 单目视觉存在深度尺度模糊问题,且忽略中层视觉先验导致在动态复杂环境中需要大量监督数据,限制了现有导航基础模型的效率与性能。 Method: 提出StereoWalker模型,利用双目输入解决深度尺度模糊,并融合现代中层视觉模型提供的几何与运动结构信息;同时构建了一个大规模带自动动作标注的双目导航数据集用于训练。 Result: 实验表明,借助中层视觉,StereoWalker仅用1.5%的训练数据即可达到当前最优方法的性能,在完整数据下则超越现有方法;且双目输入优于单目输入。 Conclusion: 依赖单目视觉并忽略中层视觉先验是低效的;结合双目输入与显式中层视觉可显著提升样本效率和导航性能,为机器人导航基础模型提供了更高效的设计方向。 Abstract: The success of foundation models in language and vision motivated research in fully end-to-end robot navigation foundation models (NFMs). NFMs directly map monocular visual input to control actions and ignore mid-level vision modules (tracking, depth estimation, etc) entirely. While the assumption that vision capabilities will emerge implicitly is compelling, it requires large amounts of pixel-to-action supervision that are difficult to obtain. The challenge is especially pronounced in dynamic and unstructured settings, where robust navigation requires precise geometric and dynamic understanding, while the depth-scale ambiguity in monocular views further limits accurate spatial reasoning. In this paper, we show that relying on monocular vision and ignoring mid-level vision priors is inefficient. We present StereoWalker, which augments NFMs with stereo inputs and explicit mid-level vision such as depth estimation and dense pixel tracking. Our intuition is straightforward: stereo inputs resolve the depth-scale ambiguity, and modern mid-level vision models provide reliable geometric and motion structure in dynamic scenes. We also curate a large stereo navigation dataset with automatic action annotation from Internet stereo videos to support training of StereoWalker and to facilitate future research. Through our experiments, we find that mid-level vision enables StereoWalker to achieve a comparable performance as the state-of-the-art using only 1.5% of the training data, and surpasses the state-of-the-art using the full data. We also observe that stereo vision yields higher navigation performance than monocular input.

[133] SceneMaker: Open-set 3D Scene Generation with Decoupled De-occlusion and Pose Estimation Model

Yukai Shi,Weiyu Li,Zihao Wang,Hongyang Li,Xingyu Chen,Ping Tan,Lei Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为SceneMaker的解耦3D场景生成框架,通过分离去遮挡模型与3D物体生成,并引入统一的姿态估计模型,提升了在严重遮挡和开放集设置下的几何质量和姿态准确性。

Details Motivation: 现有方法由于缺乏足够的开放集去遮挡和姿态估计先验,在严重遮挡和开放集条件下难以同时生成高质量几何结构和精确姿态。 Method: 首先将去遮挡模型从3D物体生成中解耦,并利用图像数据集和收集的去遮挡数据集增强其对多样化开放集遮挡模式的适应能力;其次提出一个融合全局与局部机制的统一姿态估计模型,改进自注意力与交叉注意力机制;此外构建了一个开放集3D场景数据集以提升泛化能力。 Result: 大量实验表明,该解耦框架在室内和开放集场景中均优于现有方法,显著提升了去遮挡效果和姿态估计精度。 Conclusion: SceneMaker通过解耦设计和统一姿态估计,在复杂遮挡和开放环境下实现了更高质量的3D场景生成,且代码与数据集已公开。 Abstract: We propose a decoupled 3D scene generation framework called SceneMaker in this work. Due to the lack of sufficient open-set de-occlusion and pose estimation priors, existing methods struggle to simultaneously produce high-quality geometry and accurate poses under severe occlusion and open-set settings. To address these issues, we first decouple the de-occlusion model from 3D object generation, and enhance it by leveraging image datasets and collected de-occlusion datasets for much more diverse open-set occlusion patterns. Then, we propose a unified pose estimation model that integrates global and local mechanisms for both self-attention and cross-attention to improve accuracy. Besides, we construct an open-set 3D scene dataset to further extend the generalization of the pose estimation model. Comprehensive experiments demonstrate the superiority of our decoupled framework on both indoor and open-set scenes. Our codes and datasets is released at https://idea-research.github.io/SceneMaker/.

[134] WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World

Ao Liang,Lingdong Kong,Tianyi Yan,Hongsi Liu,Wesley Yang,Ziqi Huang,Wei Yin,Jialong Zuo,Yixuan Hu,Dekai Zhu,Dongyue Lu,Youquan Liu,Guangfeng Jiang,Linfeng Li,Xiangtai Li,Long Zhuo,Lai Xing Ng,Benoit R. Cottereau,Changxin Gao,Liang Pan,Wei Tsang Ooi,Ziwei Liu

Main category: cs.CV

TL;DR: 本文提出了WorldLens,一个全面评估生成世界模型的基准,涵盖生成、重建、行为跟随、下游任务和人类偏好五个方面,并结合大规模人类标注数据集WorldLens-26K和评估代理WorldLens-Agent,构建了一个统一的生态系统来衡量生成世界的视觉真实感、几何一致性、物理合理性和功能可靠性。

Details Motivation: 现有生成世界模型在视觉上虽逼真,但在物理规律和行为一致性方面常出错,且缺乏统一的评估标准,难以判断其是否真正符合现实世界的运行规律。 Method: 提出WorldLens基准,包含五个评估维度;构建WorldLens-26K人类注释视频数据集,并训练WorldLens-Agent作为可解释、可扩展的自动评估模型。 Result: 发现当前模型无法在所有维度上同时表现优异:纹理强的模型常违反物理规律,几何稳定的模型则行为保真度低;WorldLens-Agent能有效对齐人类判断与客观指标。 Conclusion: 生成世界不应仅以视觉真实为标准,而应综合考虑几何、物理和功能层面的真实性,WorldLens为未来模型提供了标准化、多维度的评估框架。 Abstract: Generative world models are reshaping embodied AI, enabling agents to synthesize realistic 4D driving environments that look convincing but often fail physically or behaviorally. Despite rapid progress, the field still lacks a unified way to assess whether generated worlds preserve geometry, obey physics, or support reliable control. We introduce WorldLens, a full-spectrum benchmark evaluating how well a model builds, understands, and behaves within its generated world. It spans five aspects -- Generation, Reconstruction, Action-Following, Downstream Task, and Human Preference -- jointly covering visual realism, geometric consistency, physical plausibility, and functional reliability. Across these dimensions, no existing world model excels universally: those with strong textures often violate physics, while geometry-stable ones lack behavioral fidelity. To align objective metrics with human judgment, we further construct WorldLens-26K, a large-scale dataset of human-annotated videos with numerical scores and textual rationales, and develop WorldLens-Agent, an evaluation model distilled from these annotations to enable scalable, explainable scoring. Together, the benchmark, dataset, and agent form a unified ecosystem for measuring world fidelity -- standardizing how future models are judged not only by how real they look, but by how real they behave.

[135] StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space

Tjark Behrens,Anton Obukhov,Bingxin Ke,Fabio Tosi,Matteo Poggi,Konrad Schindler

Main category: cs.CV

TL;DR: StereoSpace是一种基于扩散的单目到立体合成框架,通过视点条件建模几何结构,无需显式深度或扭曲。

Details Motivation: 现有的单目到立体合成方法通常依赖显式深度估计或图像扭曲,可能导致几何泄漏和不一致。需要一种更鲁棒且无需深度的方法。 Method: 提出StereoSpace,利用扩散模型在规范化的校正空间中进行端到端的视点条件生成,通过条件引导生成器推断对应关系并填补遮挡区域,不使用任何真实深度或代理几何信息。 Result: 在无地面真值或代理几何估计的情况下,StereoSpace在iSQoE和MEt3R等指标上优于现有方法,表现出锐利的视差和对分层及非朗伯场景的强鲁棒性。 Conclusion: 视点条件扩散模型是一种可扩展、无需深度的立体生成解决方案。 Abstract: We introduce StereoSpace, a diffusion-based framework for monocular-to-stereo synthesis that models geometry purely through viewpoint conditioning, without explicit depth or warping. A canonical rectified space and the conditioning guide the generator to infer correspondences and fill disocclusions end-to-end. To ensure fair and leakage-free evaluation, we introduce an end-to-end protocol that excludes any ground truth or proxy geometry estimates at test time. The protocol emphasizes metrics reflecting downstream relevance: iSQoE for perceptual comfort and MEt3R for geometric consistency. StereoSpace surpasses other methods from the warp & inpaint, latent-warping, and warped-conditioning categories, achieving sharp parallax and strong robustness on layered and non-Lambertian scenes. This establishes viewpoint-conditioned diffusion as a scalable, depth-free solution for stereo generation.