Skip to content

Table of Contents

cs.CL [Back]

[1] Does Visual Rendering Bypass Tokenization? Investigating Script-Tokenizer Misalignment in Pixel-Based Language Models

Lucky Susanto,Musa Izzanardi Wijanarko,Khumaisa Nur'aini,Farid Adilazuarda,Alham Fikri Aji,Derry Tanti Wijaya

Main category: cs.CL

TL;DR: 本文探讨了在多模态语言模型DualGPT中,即使采用图像化文本渲染(pixel-based),重新引入文本分词器仍会导致与本地非拉丁文字(如爪哇文、巴厘文等)的分词器对齐问题,从而削弱其绕过分词瓶颈的初衷;实验表明定制分词器显著优于Llama 2通用分词器,突显分词器仍是低资源语言公平建模的关键障碍。

Details Motivation: 探究视觉渲染是否真能摆脱分词约束,尤其在印尼四种使用非拉丁文字的低资源语言中,检验DualGPT架构中脚本与分词器对齐的影响。 Method: 在Javanese、Balinese、Sundanese和Lampungnese四种语言上,对比Llama 2 tokenizer与定制tokenizer在DualGPT架构中的表现,评估OOV率、fertility及chrF++指标。 Result: 尽管Llama 2 tokenizer具有更低的OOV和fertility率,其性能却显著差于定制tokenizer,chrF++提升高达30.15;证明视觉渲染无法完全规避分词器带来的对齐问题。 Conclusion: 文本分词器仍是多模态语言模型在低资源、非拉丁文字语言中实现公平性的关键障碍,未来工作需谨慎设计或避免依赖通用分词器。 Abstract: While pixel-based language modeling aims to bypass the sub-word tokenization bottleneck by rendering text as images, recent multimodal variants such as DualGPT reintroduce text tokenizers to improve autoregressive performance. We investigate a fundamental question, does visual rendering truly decouple a model from tokenization constraints? Focusing on four Indonesian low-resource local languages that have their own non-Latin scripts (i.e., Javanese, Balinese, Sundanese, and Lampungnese), we evaluate the impact of script-tokenizer alignment within the DualGPT architecture. Our results show that, despite visual rendering, reintegrating a text tokenizer into the architecture reintroduces the same issue that pixel-based language modeling aims to resolve, which is the tokenizer misalignment problem. Despite having lower OOV and fertility rates, we show that the Llama 2 tokenizer performs significantly worse than a custom tokenizer, with improvements of up to 30.15 chrF++. Our findings serve as a warning for future multimodal variants, as text tokenizers remain a significant barrier to equitable models.

[2] BiomechAgent: AI-Assisted Biomechanical Analysis Through Code-Generating Agents

R. James Cotton,Thomas Leonard

Main category: cs.CL

TL;DR: BiomechAgent 是一个面向临床用户的、能通过自然语言生成代码的AI代理,用于简化标记式运动捕捉数据的生物力学分析,支持数据查询、可视化与临床推理。

Details Motivation: 标记式运动捕捉虽日益普及,但其数据分析对无编程背景的临床医生仍构成障碍,亟需低门槛、高可用的分析工具。 Method: 构建了名为 BiomechAgent 的代码生成型AI代理,融合生物力学领域知识指令、专用工具(如步态事件检测)及系统化基准测试(涵盖数据检索、可视化、活动分类、时序分割与临床推理);对比评估了领域定制提示、工具集成及本地开源模型的影响。 Result: BiomechAgent 在数据检索与可视化任务中表现稳健,在临床推理方面初具能力;生物力学定制化指令和专用工具集成显著提升性能;本地开源模型在多数任务上明显逊于前沿云模型。 Conclusion: BiomechAgent 有效降低了运动捕捉数据分析门槛,提升了临床实用性与可及性,验证了领域适配与工具增强对医疗AI代理的关键作用。 Abstract: Markerless motion capture is making quantitative movement analysis increasingly accessible, yet analyzing the resulting data remains a barrier for clinicians without programming expertise. We present BiomechAgent, a code-generating AI agent that enables biomechanical analysis through natural language and allows users to querying databases, generating visualizations, and even interpret data without requiring users to write code. To evaluate BiomechAgent's capabilities, we developed a systematic benchmark spanning data retrieval, visualization, activity classification, temporal segmentation, and clinical reasoning. BiomechAgent achieved robust accuracy on data retrieval and visualization tasks and demonstrated emerging clinical reasoning capabilities. We used our dataset to systematically evaluate several of our design decisions. Biomechanically-informed, domain-specific instructions significantly improved performance over generic prompts, and integrating validated specialized tools for gait event detection substantially boosted accuracy on challenging spatiotemporal analysis where the base agent struggled. We also tested BiomechAgent using a local open-weight model instead of a frontier cloud based LLM and found that perform was substantially diminished in most domains other than database retrieval. In short, BiomechAgent makes the data from accessible motion capture and much more useful and accessible to end users.

[3] Bridging the Knowledge Void: Inference-time Acquisition of Unfamiliar Programming Languages for Coding Tasks

Chen Shen,Wei Cheng,Jingyue Yang,Huan Zhang,Yuhan Wu,Wei Hu

Main category: cs.CL

TL;DR: 本文提出ILA-agent框架,使大语言模型能在推理时通过与文档和执行环境交互,动态学习陌生编程语言,显著优于传统检索增强方法。

Details Motivation: 现有大语言模型在面对未见过的编程语言时表现不佳,而传统微调方法依赖大量数据,难以适应低资源场景。 Method: 提出Inference-time Language Acquisition(ILA)范式,设计ILA-agent框架,赋予模型人类行为式工具集(如查阅文档、执行代码验证等),通过结构化交互实现对新语言的动态学习。 Result: 在自建低资源基准Cangjie-bench(基于新型静态类型语言Cangjie)上,ILA-agent在代码生成、翻译和程序修复任务中显著超越检索增强基线;轨迹分析揭示了涌现行为模式及现存性能瓶颈。 Conclusion: ILA-agent验证了推理时语言获取的可行性与有效性,为低资源编程语言适配提供了新范式,但仍需解决知识整合深度与执行鲁棒性等挑战。 Abstract: The proficiency of Large Language Models (LLMs) in coding tasks is often a reflection of their extensive pre-training corpora, which typically collapses when confronted with previously unfamiliar programming languages. Departing from data-intensive finetuning, we investigate the paradigm of Inference-time Language Acquisition (ILA), where an LLM masters an unfamiliar language through dynamic interaction with limited external resources. In this paper, we propose ILA-agent, a general ILA framework that equips LLMs with a set of behavioral primitives. By modeling essential human-like behaviors as a suite of tools, ILA-agent enables LLMs to incrementally explore, apply, and verify language knowledge through structured interactions with the official documentation and execution environment. To provide a rigorous evaluation in a low-resource setting, we construct Cangjie-bench, a multi-task benchmark based on the novel statically-typed language Cangjie. We instantiate ILA-agent for Cangjie and evaluate its performance across code generation, translation, and program repair tasks. Results using diverse LLMs demonstrate that ILA-agent significantly outperforms retrieval-augmented baselines. Further analysis of agent trajectories characterizes the emergent behavior patterns while highlighting persisting performance gaps.

Jacqueline He,Jonathan Hayase,Wen-tau Yih,Sewoong Oh,Luke Zettlemoyer,Pang Wei Koh

Main category: cs.CL

TL;DR: 本文提出了一种名为Anchored Decoding的推理时方法,用于抑制大语言模型在生成过程中对训练数据的逐字复制,通过将其生成过程锚定在一个许可宽松的安全模型上实现风险控制,并在保持生成质量的同时显著降低版权风险。

Details Motivation: 现代语言模型易记忆并复现训练数据中的敏感或受版权保护内容,带来隐私、版权与合规风险。 Method: 提出Anchored Decoding——一种即插即用的推理时方法,将高风险LM的生成过程约束在许可宽松的安全LM(如新提出的TinyComma 1.8B)附近;引入信息预算机制和逐步约束,提供序列级保证;进一步提出字节级变体Anchored$_{\mathrm{Byte}}$ Decoding,结合ByteSampler实现跨词表融合。 Result: 在六个模型对和长文本评估中,该方法在版权风险(六项复制指标平均降低75%)与实用性(流畅性、事实性几乎不变)间取得新Pareto前沿,仅引入适度推理开销。 Conclusion: Anchored Decoding为混合授权数据训练的大模型提供了实用、可控、可调的风险缓解方案,兼顾安全性与生成质量,具备工程落地价值。 Abstract: Modern language models (LMs) tend to memorize portions of their training data and emit verbatim spans. When the underlying sources are sensitive or copyright-protected, such reproduction raises issues of consent and compensation for creators and compliance risks for developers. We propose Anchored Decoding, a plug-and-play inference-time method for suppressing verbatim copying: it enables decoding from any risky LM trained on mixed-license data by keeping generation in bounded proximity to a permissively trained safe LM. Anchored Decoding adaptively allocates a user-chosen information budget over the generation trajectory and enforces per-step constraints that yield a sequence-level guarantee, enabling a tunable risk-utility trade-off. To make Anchored Decoding practically useful, we introduce a new permissively trained safe model (TinyComma 1.8B), as well as Anchored$_{\mathrm{Byte}}$ Decoding, a byte-level variant of our method that enables cross-vocabulary fusion via the ByteSampler framework (Hayase et al., 2025). We evaluate our methods across six model pairs on long-form evaluations of copyright risk and utility. Anchored and Anchored$_{\mathrm{Byte}}$ Decoding define a new Pareto frontier, preserving near-original fluency and factuality while eliminating up to 75% of the measurable copying gap (averaged over six copying metrics) between the risky baseline and a safe reference, at a modest inference overhead.

[5] Free Energy Mixer

Jiecheng Lu,Shihao Yang

Main category: cs.CL

TL;DR: 本文提出了一种名为自由能混合器(FEM)的新注意力机制,通过基于值驱动的对数线性倾斜实现通道级选择性读取,保持原有计算复杂度的同时提升性能。

Details Motivation: 标准注意力机制在每个头中使用凸平均读取键/值,阻碍了通道级的选择能力;本文旨在解决这一限制,实现更灵活、值感知的读取机制。 Method: 提出自由能混合器(FEM),采用自由能(log-sum-exp)读取方式,以查询/键提供的快速先验为基础,施加值驱动的每通道对数线性倾斜,形成值感知的后验读取;支持与标准注意力、线性注意力、线性RNN及SSM即插即用的两层门控变体。 Result: 在NLP、视觉和时间序列任务上,FEM在相同参数预算下持续超越强基线模型。 Conclusion: FEM提供了一种高效、可扩展且值感知的注意力读取机制,在不增加渐近复杂度的前提下增强了模型表达能力,并展现出跨模态的通用优越性。 Abstract: Standard attention stores keys/values losslessly but reads them via a per-head convex average, blocking channel-wise selection. We propose the Free Energy Mixer (FEM): a free-energy (log-sum-exp) read that applies a value-driven, per-channel log-linear tilt to a fast prior (e.g., from queries/keys in standard attention) over indices. Unlike methods that attempt to improve and enrich the $(q,k)$ scoring distribution, FEM treats it as a prior and yields a value-aware posterior read at unchanged complexity, smoothly moving from averaging to per-channel selection as the learnable inverse temperature increases, while still preserving parallelism and the original asymptotic complexity ($O(T^2)$ for softmax; $O(T)$ for linearizable variants). We instantiate a two-level gated FEM that is plug-and-play with standard and linear attention, linear RNNs and SSMs. It consistently outperforms strong baselines on NLP, vision, and time-series at matched parameter budgets.

[6] Your Language Model Secretly Contains Personality Subnetworks

Ruimeng Ye,Zihan Wang,Zinan Ling,Yang Xiao,Manling Li,Xiaolong Ma,Bo Hui

Main category: cs.CL

TL;DR: 本文发现大语言模型(LLM)参数空间中已内嵌不同人格(persona)对应的专用子网络,提出一种无需训练、仅基于激活统计与对比剪枝的轻量级人格子网络提取方法,显著提升人格对齐效果且更高效。

Details Motivation: 探究LLM是否必须依赖外部知识(如提示、RAG或微调)来适配不同人格,还是其参数本身已蕴含人格相关知识。 Method: 利用小规模校准数据识别不同人格对应的激活特征,设计掩码策略提取轻量级人格子网络;针对二元对立人格(如内向-外向),引入对比剪枝策略定位导致统计差异的关键参数。 Result: 所提方法在多种评测设置下,人格对齐能力显著优于需外部知识的基线方法,同时更高效;验证了LLM参数空间中天然存在人格专业化子网络。 Conclusion: LLM中的人格化行为并非仅靠外部诱导,而是已嵌入于其参数空间中,为可控、可解释的个性化提供了新视角。 Abstract: Humans shift between different personas depending on social context. Large Language Models (LLMs) demonstrate a similar flexibility in adopting different personas and behaviors. Existing approaches, however, typically adapt such behavior through external knowledge such as prompting, retrieval-augmented generation (RAG), or fine-tuning. We ask: do LLMs really need external context or parameters to adapt to different behaviors, or do they already have such knowledge embedded in their parameters? In this work, we show that LLMs already contain persona-specialized subnetworks in their parameter space. Using small calibration datasets, we identify distinct activation signatures associated with different personas. Guided by these statistics, we develop a masking strategy that isolates lightweight persona subnetworks. Building on the findings, we further discuss: how can we discover opposing subnetwork from the model that lead to binary-opposing personas, such as introvert-extrovert? To further enhance separation in binary opposition scenarios, we introduce a contrastive pruning strategy that identifies parameters responsible for the statistical divergence between opposing personas. Our method is entirely training-free and relies solely on the language model's existing parameter space. Across diverse evaluation settings, the resulting subnetworks exhibit significantly stronger persona alignment than baselines that require external knowledge while being more efficient. Our findings suggest that diverse human-like behaviors are not merely induced in LLMs, but are already embedded in their parameter space, pointing toward a new perspective on controllable and interpretable personalization in large language models.

[7] Open TutorAI: An Open-source Platform for Personalized and Immersive Learning with Generative AI

Mohamed El Hajji,Tarek Ait Baha,Aicha Dakir,Hammou Fadili,Youssef Es-Saady

Main category: cs.CL

TL;DR: 本文介绍了Open TutorAI,一个基于大语言模型和生成式技术的开源教育平台,旨在提供动态、个性化的辅导体验。该系统结合自然语言处理与可定制3D虚拟形象,支持多模态交互,并通过结构化入门流程构建学习者专属AI助教,具备内容组织、嵌入式反馈及面向学习者、教师与家长的专用界面,强调易用性、沉浸感与自适应支持。

Details Motivation: 现有教育聊天机器人缺乏上下文适应性、实时响应能力和教学灵活性,限制了学习者参与度和教学效果;亟需开放、集成的AI与沉浸技术融合平台以支持个性化、有意义的学习体验。 Method: 构建开源平台Open TutorAI,整合大语言模型(LLM)与生成式技术,结合NLP与可定制3D虚拟形象实现多模态交互;通过结构化入门流程采集学习者目标与偏好,生成个性化AI助教;提供文本与虚拟形象双接口,并集成内容管理、嵌入式反馈、多角色界面及学习分析功能。 Result: 成功开发出Open TutorAI平台,具备模块化架构、生成式AI能力与学习者分析功能,支持无需技术背景的自适应学习支持;提升参与度与情感临场感,助力自我调节学习,形成更人性化、沉浸式的教育环境。 Conclusion: Open TutorAI代表下一代智能辅导系统的发展方向,其开源框架、可扩展设计与人本导向的交互范式,为AI赋能教育提供了可复用、可演进的技术路径与实践范例。 Abstract: Recent advances in artificial intelligence have created new possibilities for making education more scalable, adaptive, and learner-centered. However, existing educational chatbot systems often lack contextual adaptability, real-time responsiveness, and pedagogical agility. which can limit learner engagement and diminish instructional effectiveness. Thus, there is a growing need for open, integrative platforms that combine AI and immersive technologies to support personalized, meaningful learning experiences. This paper presents Open TutorAI, an open-source educational platform based on LLMs and generative technologies that provides dynamic, personalized tutoring. The system integrates natural language processing with customizable 3D avatars to enable multimodal learner interaction. Through a structured onboarding process, it captures each learner's goals and preferences in order to configure a learner-specific AI assistant. This assistant is accessible via both text-based and avatar-driven interfaces. The platform includes tools for organizing content, providing embedded feedback, and offering dedicated interfaces for learners, educators, and parents. This work focuses on learner-facing components, delivering a tool for adaptive support that responds to individual learner profiles without requiring technical expertise. Its assistant-generation pipeline and avatar integration enhance engagement and emotional presence, creating a more humanized, immersive learning environment. Embedded learning analytics support self-regulated learning by tracking engagement patterns and generating actionable feedback. The result is Open TutorAI, which unites modular architecture, generative AI, and learner analytics within an open-source framework. It contributes to the development of next-generation intelligent tutoring systems.

[8] Can LLMs Discern the Traits Influencing Your Preferences? Evaluating Personality-Driven Preference Alignment in LLMs

Tianyu Zhao,Siqi Li,Yasser Shoukry,Salma Elmalaki

Main category: cs.CL

TL;DR: 本文提出利用用户人格特质作为潜在信号来提升个性化大语言模型(LLM)问答性能,通过构建人格标注的偏好数据集PACIFIC并设计对齐框架,显著提升答案选择准确率。

Details Motivation: 现有基于用户偏好的LLM个性化方法易受噪声、不完整或误导性偏好影响;而稳定的人格特质可作为偏好背后的深层、可靠潜在信号。 Method: 1)实证发现人格一致的偏好能大幅提升问答准确率;2)构建含1200条跨领域偏好语句、标注Big-Five人格维度(OCEAN)方向的数据集PACIFIC;3)设计端到端框架,使LLM能自动检索并融合人格对齐的偏好用于生成。 Result: 在个性化问答任务中,使用人格对齐偏好使答案选择准确率从29.25%提升至76%;PACIFIC数据集支持人格驱动的偏好建模与评估。 Conclusion: 人格是一种有效且鲁棒的偏好建模基础,将人格引入LLM个性化可显著提升可靠性与性能,为偏好驱动的生成提供了新范式。 Abstract: User preferences are increasingly used to personalize Large Language Model (LLM) responses, yet how to reliably leverage preference signals for answer generation remains under-explored. In practice, preferences can be noisy, incomplete, or even misleading, which can degrade answer quality when applied naively. Motivated by the observation that stable personality traits shape everyday preferences, we study personality as a principled ''latent'' signal behind preference statements. Through extensive experiments, we find that conditioning on personality-aligned preferences substantially improves personalized question answering: selecting preferences consistent with a user's inferred personality increases answer-choice accuracy from 29.25% to 76%, compared to using randomly selected preferences. Based on these findings, we introduce PACIFIC (Preference Alignment Choices Inference for Five-factor Identity Characterization), a personality-labeled preference dataset containing 1200 preference statements spanning diverse domains (e.g., travel, movies, education), annotated with Big-Five (OCEAN) trait directions. Finally, we propose a framework that enables an LLM model to automatically retrieve personality-aligned preferences and incorporate them during answer generation.

Anagha Kulkarni,Parin Rajesh Jhaveri,Prasha Shrestha,Yu Tong Han,Reza Amini,Behrouz Madahian

Main category: cs.CL

TL;DR: 本文提出了一种面向法律文档的长上下文、长答案问答系统,通过领域词汇解构、复杂版面解析和精准生成等技术提升问答效果,并引入覆盖度指标评估召回性能。

Details Motivation: 法律文档具有复杂版式、长脚注及专业语言等特点,导致长上下文、长答案问答任务极具挑战性。 Method: 提出一个问答系统,包含(a)领域术语解构以提升检索效果,(b)解析复杂文档结构并关联正文与脚注,(c)使用精准法律术语生成全面答案;同时设计基于召回的覆盖度评估指标,并构建由法律与税务专业人士标注的QA数据集。 Result: 通过综合实验与消融研究,验证了所提系统在法律长文档问答任务中的有效性与实用性。 Conclusion: 该系统有效应对法律文档长上下文问答难题,提升了答案完整性与可解释性,覆盖度指标增强了人工评估效率。 Abstract: Legal documents have complex document layouts involving multiple nested sections, lengthy footnotes and further use specialized linguistic devices like intricate syntax and domain-specific vocabulary to ensure precision and authority. These inherent characteristics of legal documents make question answering challenging, and particularly so when the answer to the question spans several pages (i.e. requires long-context) and is required to be comprehensive (i.e. a long-form answer). In this paper, we address the challenges of long-context question answering in context of long-form answers given the idiosyncrasies of legal documents. We propose a question answering system that can (a) deconstruct domain-specific vocabulary for better retrieval from source documents, (b) parse complex document layouts while isolating sections and footnotes and linking them appropriately, (c) generate comprehensive answers using precise domain-specific vocabulary. We also introduce a coverage metric that classifies the performance into recall-based coverage categories allowing human users to evaluate the recall with ease. We curate a QA dataset by leveraging the expertise of professionals from fields such as law and corporate tax. Through comprehensive experiments and ablation studies, we demonstrate the usability and merit of the proposed system.

[10] Equipping LLM with Directional Multi-Talker Speech Understanding Capabilities

Ju Lin,Jing Pan,Ruizhi Li,Ming Sun,Yuzong Liu,Alaa Hassan,Jing Zheng,Florian Metze

Main category: cs.CL

TL;DR: 本文研究如何赋予大语言模型(LLM)定向多说话人语音理解能力,特别针对智能眼镜场景,提出级联式(含声源分离前端)和端到端(序列化输出训练)两种新方法,利用智能眼镜的多麦克风阵列实现流式定向处理,在语音识别与翻译任务中取得良好效果。

Details Motivation: 现有语音大模型多基于单通道、单说话人数据训练,难以直接应用于多说话人、多通道的真实场景(如智能眼镜),亟需增强其定向语音理解能力。 Method: 提出两种融合指向性的方法:(1) 级联系统——引入声源分离前端模块;(2) 端到端系统——采用序列化输出训练;二者均基于智能眼镜内置的多麦克风阵列,支持流式定向语音处理。 Result: 实验表明所提方法显著提升LLM在定向多说话人语音理解上的性能,在语音识别与语音翻译任务中均取得优异结果。 Conclusion: 通过前端分离或端到端建模结合多麦克风阵列,可有效赋予LLM定向多说话人语音理解能力,为智能眼镜等真实场景应用提供了可行技术路径。 Abstract: Recent studies have demonstrated that prompting large language models (LLM) with audio encodings enables effective speech understanding capabilities. However, most speech LLMs are trained on single-channel, single-talker data, which makes it challenging to directly apply them to multi-talker and multi-channel speech understanding task. In this work, we present a comprehensive investigation on how to enable directional multi-talker speech understanding capabilities for LLMs, specifically in smart glasses usecase. We propose two novel approaches to integrate directivity into LLMs: (1) a cascaded system that leverages a source separation front-end module, and (2) an end-to-end system that utilizes serialized output training. All of the approaches utilize a multi-microphone array embedded in smart glasses to optimize directivity interpretation and processing in a streaming manner. Experimental results demonstrate the efficacy of our proposed methods in endowing LLMs with directional speech understanding capabilities, achieving strong performance in both speech recognition and speech translation tasks.

[11] Beyond Accuracy: Risk-Sensitive Evaluation of Hallucinated Medical Advice

Savan Doshi

Main category: cs.CL

TL;DR: 本文提出了一种风险敏感的幻觉评估框架,用于评估大语言模型在患者面向医疗问答中的潜在危害,强调对高风险语言(如治疗指令、禁忌症、紧急提示和高风险药物)的量化分析,而非仅关注事实正确性。

Details Motivation: 现有幻觉评估标准过于关注事实正确性,忽视了不同错误在临床场景下的危害差异,尤其无法识别那些看似合理但缺乏依据且可能被患者误执行的医疗建议。 Method: 构建基于风险承载语言(如治疗指令、禁忌症、紧迫性提示、高风险药物提及)的幻觉量化框架,并结合相关性度量识别高风险、低依据的错误;在三个指令微调语言模型上,使用控制性患者面向提示进行安全压力测试。 Result: 不同模型在表面性能相近时表现出显著不同的风险特征;传统评估指标无法反映这些关键差异。 Conclusion: 幻觉评估必须纳入风险敏感性,且其有效性高度依赖于具体任务与提示设计。 Abstract: Large language models are increasingly being used in patient-facing medical question answering, where hallucinated outputs can vary widely in potential harm. However, existing hallucination standards and evaluation metrics focus primarily on factual correctness, treating all errors as equally severe. This obscures clinically relevant failure modes, particularly when models generate unsupported but actionable medical language. We propose a risk-sensitive evaluation framework that quantifies hallucinations through the presence of risk-bearing language, including treatment directives, contraindications, urgency cues, and mentions of high-risk medications. Rather than assessing clinical correctness, our approach evaluates the potential impact of hallucinated content if acted upon. We further combine risk scoring with a relevance measure to identify high-risk, low-grounding failures. We apply this framework to three instruction-tuned language models using controlled patient-facing prompts designed as safety stress tests. Our results show that models with similar surface-level behavior exhibit substantially different risk profiles and that standard evaluation metrics fail to capture these distinctions. These findings highlight the importance of incorporating risk sensitivity into hallucination evaluation and suggest that evaluation validity is critically dependent on task and prompt design.

[12] Intent Mismatch Causes LLMs to Get Lost in Multi-Turn Conversation

Geng Liu,Fei Zhu,Rong Feng,Changyi Ma,Shiqi Wang,Gaofeng Meng

Main category: cs.CL

TL;DR: 本文提出了一种Mediator-Assistant架构,通过解耦意图理解与任务执行来解决大语言模型在多轮对话中因用户意图模糊与模型理解偏差导致的性能下降问题(即'Lost in Conversation'现象)。

Details Motivation: 现有研究将多轮对话中LLM性能下降归因于模型不可靠,但本文认为根本原因在于用户与模型之间的意图对齐差距,而非模型能力缺陷。 Method: 提出Mediator-Assistant架构:由经验驱动的Mediator模块基于历史交互模式,将模糊的用户输入显式转化为结构化指令,从而弥合用户意图与模型理解之间的鸿沟。 Result: 实验表明该方法显著缓解了多种大语言模型在多轮对话中的性能退化问题。 Conclusion: 多轮对话中的性能下降源于结构性语境歧义引发的意图对齐 gap,而非模型表征能力不足;所提架构有效解决了该问题,且不依赖单纯扩大模型规模或改进训练。 Abstract: Multi-turn conversation has emerged as a predominant interaction paradigm for Large Language Models (LLMs). Users often employ follow-up questions to refine their intent, expecting LLMs to adapt dynamically. However, recent research reveals that LLMs suffer a substantial performance drop in multi-turn settings compared to single-turn interactions with fully specified instructions, a phenomenon termed ``Lost in Conversation'' (LiC). While this prior work attributes LiC to model unreliability, we argue that the root cause lies in an intent alignment gap rather than intrinsic capability deficits. In this paper, we first demonstrate that LiC is not a failure of model capability but rather a breakdown in interaction between users and LLMs. We theoretically show that scaling model size or improving training alone cannot resolve this gap, as it arises from structural ambiguity in conversational context rather than representational limitations. To address this, we propose to decouple intent understanding from task execution through a Mediator-Assistant architecture. By utilizing an experience-driven Mediator to explicate user inputs into explicit, well-structured instructions based on historical interaction patterns, our approach effectively bridges the gap between vague user intent and model interpretation. Experimental results demonstrate that this method significantly mitigates performance degradation in multi-turn conversations across diverse LLMs.

[13] ViHERMES: A Graph-Grounded Multihop Question Answering Benchmark and System for Vietnamese Healthcare Regulations

Long S. T. Nguyen,Quan M. Bui,Tin T. Ngo,Quynh T. N. Vo,Dung N. H. Le,Tho T. Quan

Main category: cs.CL

TL;DR: 本文提出了ViHERMES数据集,用于评估越南语医疗监管文档上的多跳问答系统,并设计了图感知的检索框架以提升多跳推理性能。

Details Motivation: 现有问答方法在医疗监管文档(尤其是低资源语言如越南语)上的多跳推理能力缺乏系统性评估,主要受限于缺乏支持多跳推理的基准数据集。 Method: 构建了越南语医疗监管多跳问答数据集ViHERMES,采用语义聚类与图启发式数据挖掘结合大语言模型生成带结构化证据和推理标注的问题答案对;并提出图感知检索框架,建模法律单元间的正式法律关系,支持合法且连贯的上下文扩展。 Result: 实验表明ViHERMES是一个具有挑战性的多跳监管问答评测基准,所提图感知方法持续优于强检索基线。 Conclusion: ViHERMES填补了低资源语言医疗监管多跳问答评测的空白,其数据集与系统实现已开源,为后续研究提供了重要资源与方法参考。 Abstract: Question Answering (QA) over regulatory documents is inherently challenging due to the need for multihop reasoning across legally interdependent texts, a requirement that is particularly pronounced in the healthcare domain where regulations are hierarchically structured and frequently revised through amendments and cross-references. Despite recent progress in retrieval-augmented and graph-based QA methods, systematic evaluation in this setting remains limited, especially for low-resource languages such as Vietnamese, due to the lack of benchmark datasets that explicitly support multihop reasoning over healthcare regulations. In this work, we introduce the Vietnamese Healthcare Regulations-Multihop Reasoning Dataset (ViHERMES), a benchmark designed for multihop QA over Vietnamese healthcare regulatory documents. ViHERMES consists of high-quality question-answer pairs that require reasoning across multiple regulations and capture diverse dependency patterns, including amendment tracing, cross-document comparison, and procedural synthesis. To construct the dataset, we propose a controlled multihop QA generation pipeline based on semantic clustering and graph-inspired data mining, followed by large language model-based generation with structured evidence and reasoning annotations. We further present a graph-aware retrieval framework that models formal legal relations at the level of legal units and supports principled context expansion for legally valid and coherent answers. Experimental results demonstrate that ViHERMES provides a challenging benchmark for evaluating multihop regulatory QA systems and that the proposed graph-aware approach consistently outperforms strong retrieval-based baselines. The ViHERMES dataset and system implementation are publicly available at https://github.com/ura-hcmut/ViHERMES.

[14] TernaryLM: Memory-Efficient Language Modeling via Native 1-Bit Quantization with Adaptive Layer-wise Scaling

Nisharg Nargund,Priyesh Shukla

Main category: cs.CL

TL;DR: TernaryLM 是一种原生1比特三值量化({-1, 0, +1})的132M参数Transformer语言模型,在训练阶段即引入量化,显著降低内存占用(2.4倍),同时保持良好语言建模与下游任务性能。

Details Motivation: 大型语言模型计算和内存开销大,难以部署于边缘设备;现有后训练量化方法无法充分发挥极低比特量化的潜力,需探索训练即量化的高效架构。 Method: 提出TernaryLM架构,采用原生1比特三值量化,结合直通估计器(STE)和自适应逐层缩放因子,在训练中端到端学习量化感知表示。 Result: 在TinyStories上验证困惑度为58.42;MRPC任务F1达82.47%;内存减少2.4倍(498MB vs 1197MB),推理延迟相当;中间Transformer层最适于极端量化。 Conclusion: 原生1比特训练是构建高效神经语言模型的可行且有前景的方向,为非均匀精度策略提供实证依据。 Abstract: Large language models (LLMs) achieve remarkable performance but demand substantial computational resources, limiting deployment on edge devices and resource-constrained environments. We present TernaryLM, a 132M parameter transformer architecture that employs native 1-bit ternary quantization {-1, 0, +1} during training, achieving significant memory reduction without sacrificing language modeling capability. Unlike post-training quantization approaches that quantize pre-trained full-precision models, TernaryLM learns quantization-aware representations from scratch using straight-through estimators and adaptive per-layer scaling factors. Our experiments demonstrate: (1) validation perplexity of 58.42 on TinyStories; (2) downstream transfer with 82.47 percent F1 on MRPC paraphrase detection; (3) 2.4x memory reduction (498MB vs 1197MB) with comparable inference latency; and (4) stable training dynamics across diverse corpora. We provide layer-wise quantization analysis showing that middle transformer layers exhibit highest compatibility with extreme quantization, informing future non-uniform precision strategies. Our results suggest that native 1-bit training is a promising direction for efficient neural language models. Code is available at https://github.com/1nisharg/TernaryLM-Memory-Efficient-Language-Modeling.

[15] Efficient Post-Training Pruning of Large Language Models with Statistical Correction

Peiqi Yu,Jinhao Wang,Xinyi Sui,Nam Ling,Wei Wang,Wei Jiang

Main category: cs.CL

TL;DR: 本文提出了一种基于权重与激活一阶统计特性的轻量级后训练剪枝框架,通过通道级统计校准重要性分数并进行解析能量补偿,无需重训练、梯度或二阶信息,在保持计算高效的同时提升剪枝性能。

Details Motivation: 现有后训练剪枝方法在剪枝质量与计算效率之间存在权衡:启发式方法高效但对激活异常值敏感;重建类方法保真度高但计算开销大。 Method: 利用权重和激活的一阶统计特性,采用通道级统计校准基于幅度的重要性评分,并在剪枝后应用解析能量补偿以修正分布失真;全程无需重训练、梯度或二阶信息。 Result: 在多个LLM家族、稀疏模式和评测任务上实验表明,该方法在计算成本接近启发式方法的前提下,显著提升了剪枝性能。 Conclusion: 仅依赖简单统计校正即可在后训练剪枝中取得良好效果,为LLM高效压缩提供了新思路。 Abstract: Post-training pruning is an effective approach for reducing the size and inference cost of large language models (LLMs), but existing methods often face a trade-off between pruning quality and computational efficiency. Heuristic pruning methods are efficient but sensitive to activation outliers, while reconstruction-based approaches improve fidelity at the cost of heavy computation. In this work, we propose a lightweight post-training pruning framework based on first-order statistical properties of model weights and activations. During pruning, channel-wise statistics are used to calibrate magnitude-based importance scores, reducing bias from activation-dominated channels. After pruning, we apply an analytic energy compensation to correct distributional distortions caused by weight removal. Both steps operate without retraining, gradients, or second-order information. Experiments across multiple LLM families, sparsity patterns, and evaluation tasks show that the proposed approach improves pruning performance while maintaining computational cost comparable to heuristic methods. The results suggest that simple statistical corrections can be effective for post-training pruning of LLMs.

[16] Do Large Language Models Reflect Demographic Pluralism in Safety?

Usman Naseem,Gautam Siddharth Kashyap,Sushant Kumar Ray,Rafiq Ali,Ebad Shabbir,Abdullah Mohammad

Main category: cs.CL

TL;DR: 本文提出了Demo-SafetyBench,一个建模人口统计学多元性的LLM安全评估基准,通过在提示层面解耦价值框架与响应,并利用多模型评估实现可扩展且人口统计稳健的安全评估。

Details Motivation: 现有对齐数据集(如ANTHROPIC-HH和DICES)依赖人口结构单一的标注者群体,忽视了不同社区对安全感知的差异,难以反映LLM安全固有的多元性。 Method: 分两阶段构建Demo-SafetyBench:Stage I基于DICES提示,用Mistral 7B重分类至14个安全领域并保留人口统计元数据,再用Llama-3.1-8B扩充低资源领域;Stage II使用Gemma-7B、GPT-4o和LLaMA-2-7B作为评分器,在零样本下评估多元敏感性,并采用平衡阈值(delta=0.5, tau=10)保证可靠性与低人口统计敏感性。 Result: 构建了含43,050样本的Demo-SafetyBench;多模型评估显示高信度(ICC=0.87)和低人口统计敏感性(DS=0.12),验证了方法的可扩展性与人口统计稳健性。 Conclusion: 在提示层面显式建模人口统计多元性是可行且有效的,Demo-SafetyBench为更公平、包容的LLM安全评估提供了新范式。 Abstract: Large Language Model (LLM) safety is inherently pluralistic, reflecting variations in moral norms, cultural expectations, and demographic contexts. Yet, existing alignment datasets such as ANTHROPIC-HH and DICES rely on demographically narrow annotator pools, overlooking variation in safety perception across communities. Demo-SafetyBench addresses this gap by modeling demographic pluralism directly at the prompt level, decoupling value framing from responses. In Stage I, prompts from DICES are reclassified into 14 safety domains (adapted from BEAVERTAILS) using Mistral 7B-Instruct-v0.3, retaining demographic metadata and expanding low-resource domains via Llama-3.1-8B-Instruct with SimHash-based deduplication, yielding 43,050 samples. In Stage II, pluralistic sensitivity is evaluated using LLMs-as-Raters-Gemma-7B, GPT-4o, and LLaMA-2-7B-under zero-shot inference. Balanced thresholds (delta = 0.5, tau = 10) achieve high reliability (ICC = 0.87) and low demographic sensitivity (DS = 0.12), confirming that pluralistic safety evaluation can be both scalable and demographically robust.

[17] When the Model Said 'No Comment', We Knew Helpfulness Was Dead, Honesty Was Alive, and Safety Was Terrified

Gautam Siddharth Kashyap,Mark Dras,Usman Naseem

Main category: cs.CL

TL;DR: 本文提出AlignX框架,通过两阶段方法解决大语言模型在多目标对齐(帮助性、无害性、诚实性)中的轴坍塌问题,显著提升性能并降低资源消耗。

Details Motivation: 现有方法如监督微调(SFT)和混合专家(MoE)在多目标对齐中存在目标干扰与路由不准问题,导致‘轴坍塌’:特征空间分离引发灾难性遗忘,专家误路由损害推理可靠性。 Method: 提出两阶段AlignX框架:第一阶段采用提示注入微调提取轴向任务特征以缓解遗忘;第二阶段引入MoCaE模块,利用分形与自然几何校准专家路由以提升推理可靠性。 Result: 在Alpaca、BeaverTails、TruthfulQA上分别实现+171.5%胜率、+110.1%真实-信息性、4.3%更少安全违规;延迟与内存使用降低超35%;在四个LLM上验证通用性。 Conclusion: AlignX有效缓解多目标对齐中的轴坍塌问题,兼顾性能提升与计算效率,在帮助性、无害性、诚实性三方面实现协同优化。 Abstract: Large Language Models (LLMs) need to be in accordance with human values-being helpful, harmless, and honest (HHH)-is important for safe deployment. Existing works use Supervised Fine-Tuning (SFT) and Mixture-of-Experts (MoE) to align LLMs. However, these works face challenges in multi-objective settings, such as SFT leading to interference between conflicting objectives, while MoEs suffer from miscalibrated routing. We term this failure mode Axis Collapse, marked by (1) disjoint feature spaces causing catastrophic forgetting, and (2) unreliable inference from misrouted experts. To resolve this, we propose AlignX, a two-stage framework. Stage 1 uses prompt-injected fine-tuning to extract axis-specific task features, mitigating catastrophic forgetting. Stage 2 deploys a MoCaE module that calibrates expert routing using fractal and natural geometry, improving inference reliability. AlignX achieves significant gains on Alpaca (Helpfulness), BeaverTails (Harmlessness), and TruthfulQA (Honesty), with +171.5% win rate, +110.1% in truthfulness-informativeness, and 4.3% fewer safety violations. It also reduces latency and memory usage by over 35% compared to prior MoEs. Results across four LLMs validate its generalizability.

Debtanu Datta,Rajdeep Mukherjee,Adrijit Goswami,Saptarshi Ghosh

Main category: cs.CL

TL;DR: 本文提出了一种融合法律领域知识的框架,用于提升印度法律判决文本的英-英和英-印(印地语)摘要生成效果,涵盖提取式与生成式模型,并在标准、事实一致性和法律领域指标上取得显著提升,且经法律专家验证有效。

Details Motivation: 印度法律判决文本语言复杂、结构松散,且大量民众不熟悉其使用的复杂英语,亟需生成英文及印地语的高质量摘要;现有模型缺乏法律领域知识,难以满足实际需求。 Method: 1)构建面向法律文本的领域专用预训练编码器,增强提取式神经摘要模型;2)对生成式模型(含大语言模型)在英、印双语大规模法律语料上进行持续预训练,注入法律领域知识。 Result: 所提方法在英-英和英-印法律文本摘要任务上,于标准指标(如ROUGE)、事实一致性指标及法律领域专用指标均取得统计显著提升,并通过法律专家评估验证有效性。 Conclusion: 将法律领域知识显式注入提取式与生成式摘要模型,可显著提升印度多语言法律文本摘要质量,为低资源语言法律AI应用提供了可行路径。 Abstract: Summarizing Indian legal court judgments is a complex task not only due to the intricate language and unstructured nature of the legal texts, but also since a large section of the Indian population does not understand the complex English in which legal text is written, thus requiring summaries in Indian languages. In this study, we aim to improve the summarization of Indian legal text to generate summaries in both English and Hindi (the most widely spoken Indian language), by injecting domain knowledge into diverse summarization models. We propose a framework to enhance extractive neural summarization models by incorporating domain-specific pre-trained encoders tailored for legal texts. Further, we explore the injection of legal domain knowledge into generative models (including Large Language Models) through continual pre-training on large legal corpora in English and Hindi. Our proposed approaches achieve statistically significant improvements in both English-to-English and English-to-Hindi Indian legal document summarization, as measured by standard evaluation metrics, factual consistency metrics, and legal domain-specific metrics. Furthermore, these improvements are validated through domain experts, demonstrating the effectiveness of our approaches.

[19] Measuring cross-language intelligibility between Romance languages with computational tools

Liviu P Dinu,Ana Sabina Uban,Bogdan Iordache,Anca Dinu,Simona Georgescu

Main category: cs.CL

TL;DR: 本文提出了一种基于词汇相似性(包括字形和语义相似性)的新型计算指标,用于评估罗曼语族语言间的相互可懂度,并在法、意、葡、西、罗五种主要罗曼语上进行了验证,结果与人类完形填空实验高度相关。

Details Motivation: 评估罗曼语族语言间的相互可懂度,解决传统方法难以量化的问题,并探索字形与语音形式、不同语料库及词向量模型对可懂度估计的影响。 Method: 提出一种结合表面(字形/语音)和语义相似性的新计算指标,利用多种平行语料库和词向量模型,在五种主要罗曼语上计算两两之间的相互可懂度分数。 Result: 所得可懂度分数成功再现了语言间可懂度的不对称现象,并与人类完形填空实验结果呈显著相关。 Conclusion: 该基于词汇相似性的计算指标是评估相关语言间相互可懂度的有效且可靠的方法,具有良好的心理语言学效度。 Abstract: We present an analysis of mutual intelligibility in related languages applied for languages in the Romance family. We introduce a novel computational metric for estimating intelligibility based on lexical similarity using surface and semantic similarity of related words, and use it to measure mutual intelligibility for the five main Romance languages (French, Italian, Portuguese, Spanish, and Romanian), and compare results using both the orthographic and phonetic forms of words as well as different parallel corpora and vectorial models of word meaning representation. The obtained intelligibility scores confirm intuitions related to intelligibility asymmetry across languages and significantly correlate with results of cloze tests in human experiments.

[20] DLLM Agent: See Farther, Run Faster

Huiling Zhen,Weizhe Lin,Renxi Liu,Kai Han,Yiming Li,Yuchuan Tian,Hanting Chen,Xiaoguang Li,Xiaosong Li,Chen Chen,Xianzhi Yu,Mingxuan Yuan,Youliang Yan,Peifeng Qin,Jun Wang,Yu Wang,Dacheng Tao,Yunhe Wang

Main category: cs.CL

TL;DR: 本文研究了扩散大语言模型(DLLM)在智能体多步决策中的应用,发现相比自回归(AR)模型,在相同准确率下,DLLM智能体端到端速度平均提升超30%,部分达8倍;且所需交互轮次与工具调用更少,规划更高效。同时指出部署DLLM需加强工具调用训练,并注意多轮输入中注意力掩码对齐以避免信息泄露。

Details Motivation: 探究当生成范式从自回归变为扩散、而智能体框架和监督信号保持不变时,扩散骨干网络是否引发系统性不同的规划与工具使用行为,并带来端到端效率提升。 Method: 在统一智能体框架DeepDiver中,分别集成DLLM和AR骨干模型,并在相同轨迹数据上进行匹配的面向智能体的微调,构建可比的DLLM智能体与AR智能体;通过基准测试、案例分析、失败模式诊断及注意力动态分析开展系统评估。 Result: DLLM智能体在同等准确率下端到端速度平均提升>30%(最高超8倍);成功任务中交互轮次与工具调用更少,规划命中率更高、收敛更快;发现两类部署挑战:工具调用结构化失败更频繁,需针对性强化训练;多轮输入中扩散式span corruption需对齐注意力掩码,否则性能下降;注意力分析显示DLLM智能体具有更强的全局规划信号。 Conclusion: DLLM作为智能体生成骨干具备显著效率优势与更优规划特性,但需适配训练策略与架构设计(如工具调用优化与注意力掩码对齐),才能充分发挥其在多步决策场景中的潜力。 Abstract: Diffusion large language models (DLLMs) have emerged as an alternative to autoregressive (AR) decoding with appealing efficiency and modeling properties, yet their implications for agentic multi-step decision making remain underexplored. We ask a concrete question: when the generation paradigm is changed but the agent framework and supervision are held fixed, do diffusion backbones induce systematically different planning and tool-use behaviors, and do these differences translate into end-to-end efficiency gains? We study this in a controlled setting by instantiating DLLM and AR backbones within the same agent workflow (DeepDiver) and performing matched agent-oriented fine-tuning on the same trajectory data, yielding diffusion-backed DLLM Agents and directly comparable AR agents. Across benchmarks and case studies, we find that, at comparable accuracy, DLLM Agents are on average over 30% faster end to end than AR agents, with some cases exceeding 8x speedup. Conditioned on correct task completion, DLLM Agents also require fewer interaction rounds and tool invocations, consistent with higher planner hit rates that converge earlier to a correct action path with less backtracking. We further identify two practical considerations for deploying diffusion backbones in tool-using agents. First, naive DLLM policies are more prone to structured tool-call failures, necessitating stronger tool-call-specific training to emit valid schemas and arguments. Second, for multi-turn inputs interleaving context and action spans, diffusion-style span corruption requires aligned attention masking to avoid spurious context-action information flow; without such alignment, performance degrades. Finally, we analyze attention dynamics across workflow stages and observe paradigm-specific coordination patterns, suggesting stronger global planning signals in diffusion-backed agents.

[21] SED-SFT: Selectively Encouraging Diversity in Supervised Fine-Tuning

Yijie Chen,Yijin Liu,Fandong Meng

Main category: cs.CL

TL;DR: 本文提出SED-SFT方法,通过选择性熵正则化缓解监督微调中的模式崩溃问题,提升后续强化学习阶段的探索效率与性能。

Details Motivation: 传统基于交叉熵损失的监督微调易导致模式崩溃,降低响应多样性,从而损害后续强化学习的探索效率;现有改进方法难以兼顾多样性与准确性。 Method: 提出SED-SFT框架,在优化目标中引入基于token探索空间自适应调整的选择性熵正则化项,并结合选择性掩码机制。 Result: 在8个数学基准上实验表明,SED-SFT显著提升生成多样性,计算开销几乎无增加;在Llama-3.2-3B-Instruct和Qwen2.5-Math-7B-Instruct上,后续RL性能分别平均提升2.06和1.20分。 Conclusion: SED-SFT有效缓解SFT阶段的模式崩溃,在保持精度的同时增强分布多样性,为LLM后训练提供更优初始化。 Abstract: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has emerged as the standard post-training paradigm for large language models (LLMs). However, the conventional SFT process, driven by Cross-Entropy (CE) loss, often induces mode collapse, where models over-concentrate on specific response patterns. This lack of distributional diversity severely restricts the exploration efficiency required for subsequent RL. While recent studies have attempted to improve SFT by replacing the CE loss, aiming to preserve diversity or refine the update policy, they fail to adequately balance diversity and accuracy, thereby yielding suboptimal performance after RL. To address the mode collapse problem, we propose SED-SFT, which adaptively encourages diversity based on the token exploration space. This framework introduces a selective entropy regularization term with a selective masking mechanism into the optimization objective. Extensive experiments across eight mathematical benchmarks demonstrate that SED-SFT significantly enhances generation diversity with a negligible computational overhead increase compared with CE loss, yielding average improvements of 2.06 and 1.20 points in subsequent RL performance over standard CE-based baselines on Llama-3.2-3B-Instruct and Qwen2.5-Math-7B-Instruct, respectively. The code is publicly available at https://github.com/pppa2019/SED-SFT

[22] From Native Memes to Global Moderation: Cros-Cultural Evaluation of Vision-Language Models for Hateful Meme Detection

Mo Wang,Kaixuan Ren,Pratik Jalan,Ahmed Ashraf,Tuong Vy Vu,Rahul Seetharaman,Shah Nawaz,Usman Naseem

Main category: cs.CL

TL;DR: 本文提出了一种系统性评估框架,用于诊断和量化视觉-语言模型(VLMs)在多语言模因数据集上的跨文化鲁棒性,并发现原生语言提示和单样本学习等文化对齐策略可显著提升仇恨模因检测性能,而“翻译后检测”则会损害性能。

Details Motivation: 文化背景深刻影响人们对网络内容的理解,但当前视觉-语言模型(VLMs)主要基于西方或英语中心视角训练,导致其在跨文化场景(如仇恨模因检测)中公平性和鲁棒性不足。 Method: 构建系统性评估框架,从三个维度分析VLMs的跨文化鲁棒性:(i) 学习策略(零样本 vs. 单样本),(ii) 提示语言(母语 vs. 英语),(iii) 翻译对语义与检测效果的影响;并在多语言模因数据集上进行实证评估。 Result: ‘翻译后检测’方法显著降低性能;而采用母语提示和单样本学习等文化对齐干预能显著提升仇恨模因检测效果;模型表现出向西方安全规范系统性收敛的趋势。 Conclusion: 应避免简单翻译策略,转而采用文化对齐的设计(如母语提示、单样本学习),以构建真正全球鲁棒的多模态内容审核系统。 Abstract: Cultural context profoundly shapes how people interpret online content, yet vision-language models (VLMs) remain predominantly trained through Western or English-centric lenses. This limits their fairness and cross-cultural robustness in tasks like hateful meme detection. We introduce a systematic evaluation framework designed to diagnose and quantify the cross-cultural robustness of state-of-the-art VLMs across multilingual meme datasets, analyzing three axes: (i) learning strategy (zero-shot vs. one-shot), (ii) prompting language (native vs. English), and (iii) translation effects on meaning and detection. Results show that the common ``translate-then-detect'' approach deteriorate performance, while culturally aligned interventions - native-language prompting and one-shot learning - significantly enhance detection. Our findings reveal systematic convergence toward Western safety norms and provide actionable strategies to mitigate such bias, guiding the design of globally robust multimodal moderation systems.

[23] Let's Simplify Step by Step: Guiding LLM Towards Multilingual Unsupervised Proficiency-Controlled Sentence Simplification

Jingshen Zhang,Xin Ying Qiu,Lifang Lu,Zhuhua Huang,Yutao Hu,Yuechang Wu,JunYu Lu

Main category: cs.CL

TL;DR: 本文提出了一种通过动态路径规划、语义感知示例选择和结合对话历史的思维链生成来实现细粒度可控句子简化的框架,在多语言基准测试中提升了简化效果并减少了计算步骤,但揭示了简化效果与语义保真度之间存在根本性权衡。

Details Motivation: 大型语言模型在跨大可读性层级的熟练度控制型句子简化任务中能力有限。 Method: 提出一种分解复杂简化的框架,包含动态路径规划、语义感知示例选择和结合对话历史的链式思维生成。 Result: 在五个语言、两个基准上的评测显示简化效果提升,计算步骤减少22-42%;人工评估揭示简化效果与语义保留间的权衡,且人类标注者在语义保真判断上一致性低。 Conclusion: 逐步简化提升了可控性,但在大幅简化过程中保持语义保真仍是开放挑战。 Abstract: Large language models demonstrate limited capability in proficiency-controlled sentence simplification, particularly when simplifying across large readability levels. We propose a framework that decomposes complex simplifications into manageable steps through dynamic path planning, semantic-aware exemplar selection, and chain-of-thought generation with conversation history for coherent reasoning. Evaluation on five languages across two benchmarks shows our approach improves simplification effectiveness while reducing computational steps by 22-42%. Human evaluation confirms the fundamental trade-off between simplification effectiveness and meaning preservation. Notably, even human annotators struggle to agree on semantic preservation judgments, highlighting the inherent complexity of this task. Our work shows that while step-by-step simplification improves control, preserving semantic fidelity during extensive simplification remains an open challenge.

[24] Improving Variable-Length Generation in Diffusion Language Models via Length Regularization

Zicong Cheng,Ruixuan Jia,Jia Li,Guo-Wei Yang,Meng-Hao Guo,Shi-Min Hu

Main category: cs.CL

TL;DR: 本文提出LR-DLLM框架,通过显式长度正则化校正扩散大语言模型(DLLM)中因长度导致的置信度偏差,实现无需修改模型或训练过程的可靠变长生成。

Details Motivation: 现有DLLM在目标长度未知时(如补全、填空任务)存在系统性长度诱导偏差,导致置信度估计失真、生成过短或冗余,缺乏可靠的变长推断能力。 Method: 提出LR-DLLM:将生成长度作为显式变量,引入长度正则化项解耦语义兼容性与长度不确定性,从而校正置信度估计;支持动态调整生成跨度,不依赖模型修改或重训练。 Result: 在HumanEvalInfilling(完全未知长度)上Pass@1达51.3%(+13.4% vs. DreamOn);在四语McEval上平均Pass@1达51.5%(+14.3% vs. DreamOn)。 Conclusion: LR-DLLM有效缓解DLLM的长度诱导偏差,为变长文本生成提供鲁棒、即插即用的推理解决方案。 Abstract: Diffusion Large Language Models (DLLMs) are inherently ill-suited for variable-length generation, as their inference is defined on a fixed-length canvas and implicitly assumes a known target length. When the length is unknown, as in realistic completion and infilling, naively comparing confidence across mask lengths becomes systematically biased, leading to under-generation or redundant continuations. In this paper, we show that this failure arises from an intrinsic lengthinduced bias in generation confidence estimates, leaving existing DLLMs without a robust way to determine generation length and making variablelength inference unreliable. To address this issue, we propose LR-DLLM, a length-regularized inference framework for DLLMs that treats generation length as an explicit variable and achieves reliable length determination at inference time. It decouples semantic compatibility from lengthinduced uncertainty through an explicit length regularization that corrects biased confidence estimates. Based on this, LR-DLLM enables dynamic expansion or contraction of the generation span without modifying the underlying DLLM or its training procedure. Experiments show that LRDLLM achieves 51.3% Pass@1 on HumanEvalInfilling under fully unknown lengths (+13.4% vs. DreamOn) and 51.5% average Pass@1 on four-language McEval (+14.3% vs. DreamOn).

[25] Learning to Self-Verify Makes Language Models Better Reasoners

Yuxin Chen,Yu Wang,Yi Zhang,Ziang Ye,Zhengzhou Cai,Yaorui Shi,Qi Gu,Hui Su,Xunliang Cai,Xiang Wang,An Zhang,Tat-Seng Chua

Main category: cs.CL

TL;DR: 本文研究了大语言模型在生成与自我验证能力之间的不对称性,发现提升生成能力并不能同步提升自我验证能力,但反过来,强化自我验证训练却能有效提升生成性能;基于此,作者提出一种多任务强化学习框架,联合优化生成与自我验证两个目标,并在多个基准上验证了其有效性。

Details Motivation: 大语言模型在生成推理路径方面表现优异,但在自我验证答案方面能力薄弱,存在生成与自我验证能力的持续不对称性。 Method: 通过训练演化分析揭示生成与自我验证能力的不对称性;提出多任务强化学习框架,将生成与自我验证建模为两个独立但互补的目标进行联合优化。 Result: 实验证明,所提方法在生成和验证两方面均优于仅训练生成的方法;且仅通过自我验证训练即可达到与标准生成训练相当的准确率,同时获得更高效、更优的推理轨迹。 Conclusion: 生成与自我验证能力存在单向不对称性;将自我验证显式纳入训练流程可协同提升二者性能,为构建更可靠、更高效的推理模型提供了新范式。 Abstract: Recent large language models (LLMs) achieve strong performance in generating promising reasoning paths for complex tasks. However, despite powerful generation ability, LLMs remain weak at verifying their own answers, revealing a persistent capability asymmetry between generation and self-verification. In this work, we conduct an in-depth investigation of this asymmetry throughout training evolution and show that, even on the same task, improving generation does not lead to corresponding improvements in self-verification. Interestingly, we find that the reverse direction of this asymmetry behaves differently: learning to self-verify can effectively improve generation performance, achieving accuracy comparable to standard generation training while yielding more efficient and effective reasoning traces. Building on this observation, we further explore integrating self-verification into generation training by formulating a multi-task reinforcement learning framework, where generation and self-verification are optimized as two independent but complementary objectives. Extensive experiments across benchmarks and models demonstrate performance gains over generation-only training in both generation and verification capabilities.

[26] SciClaimEval: Cross-modal Claim Verification in Scientific Papers

Xanh Ho,Yun-Ang Wu,Sunisth Kumar,Tian Cheng Xia,Florian Boudin,Andre Greiner-Petter,Akiko Aizawa

Main category: cs.CL

TL;DR: 本文提出了SciClaimEval,一个用于科学声明验证的新数据集,其特点是包含直接从已发表论文中提取的真实声明(包括被驳斥的声明),并通过修改支持性证据(图表和表格)而非修改声明本身或依赖大语言模型来生成驳斥声明。该数据集提供跨模态证据,涵盖图像、LaTeX、HTML和JSON等多种格式,并在机器学习、自然语言处理和医学三个领域共180篇论文中收集了1664个专家标注样本。对11个开源及专有视觉语言模型的基准测试表明,基于图表的验证仍是所有模型的重大挑战,性能与人类基线仍有显著差距。

Details Motivation: 现有科学声明验证数据集缺乏真实性和多样性,尤其难以获取真实被驳斥的声明;同时,多数方法依赖LLM生成矛盾或修改声明文本,无法反映真实科研场景中的验证难点。 Method: 提出一种新方法:通过修改原始论文中的图表和表格(而非声明文本)来生成被驳斥的声明;构建跨模态数据集SciClaimEval,包含图像、LaTeX、HTML、JSON等多格式表格与图像证据;由领域专家对1664个样本进行标注;在三个科学领域开展多模型基准评测。 Result: SciClaimEval包含1664个标注样本,覆盖3个领域;11个主流多模态模型在该数据集上表现普遍不佳,尤其在图表验证任务上与人类基线存在显著性能差距。 Conclusion: 真实、跨模态、专家验证的科学声明验证数据集SciClaimEval填补了当前研究空白;实验揭示了当前多模态模型在理解科学图表证据方面存在根本性局限,为未来模型发展指明方向。 Abstract: We present SciClaimEval, a new scientific dataset for the claim verification task. Unlike existing resources, SciClaimEval features authentic claims, including refuted ones, directly extracted from published papers. To create refuted claims, we introduce a novel approach that modifies the supporting evidence (figures and tables), rather than altering the claims or relying on large language models (LLMs) to fabricate contradictions. The dataset provides cross-modal evidence with diverse representations: figures are available as images, while tables are provided in multiple formats, including images, LaTeX source, HTML, and JSON. SciClaimEval contains 1,664 annotated samples from 180 papers across three domains, machine learning, natural language processing, and medicine, validated through expert annotation. We benchmark 11 multimodal foundation models, both open-source and proprietary, across the dataset. Results show that figure-based verification remains particularly challenging for all models, as a substantial performance gap remains between the best system and human baseline.

[27] Letting Tutor Personas "Speak Up" for LLMs: Learning Steering Vectors from Dialogue via Preference Optimization

Jaewook Lee,Alexander Scarlatos,Simon Woodhead,Andrew Lan

Main category: cs.CL

TL;DR: 本文提出了一种基于激活空间引导(activation steering)的方法,利用人类师生对话中提取的导师人格特征,通过修改双向偏好优化(BiPO)学习导向向量,使大语言模型(LLM)能自适应地呈现多样化、个性化的辅导风格,而无需显式提示;该方法在语义对齐和偏好评估上提升显著,且具备行为可解释性。

Details Motivation: 现有LLM辅导系统通常只学习单一辅导策略,缺乏对真实教学中多样辅导风格(如不同 scaffolding 水平、指导性、反馈方式和情感支持)的建模;而人类导师的行为高度依赖于学生状态与教学意图,需动态调整。 Method: 改进双向偏好优化(BiPO),从人类师生对话中提取多样的导师人格特征,学习一个嵌入在模型激活空间中的‘导向向量’(steering vector),用于隐式引导LLM生成符合特定导师风格的回应,不依赖显式提示。 Result: 所学导向向量能有效捕捉不同导师在各类对话情境下的行为差异,显著提升模型输出与真实导师语句的语义对齐度及人工偏好评分,同时保持较高的词汇相似性;方向系数分析揭示出跨导师一致、可解释的行为模式(如反馈强度、指导程度等)。 Conclusion: 激活空间引导是一种高效且可解释的机制,能仅凭人类对话数据驱动LLM实现个性化、多样化的辅导行为建模,为构建更自然、适应性强的教育AI系统提供了新路径。 Abstract: With the emergence of large language models (LLMs) as a powerful class of generative artificial intelligence (AI), their use in tutoring has become increasingly prominent. Prior works on LLM-based tutoring typically learn a single tutor policy and do not capture the diversity of tutoring styles. In real-world tutor-student interactions, pedagogical intent is realized through adaptive instructional strategies, with tutors varying the level of scaffolding, instructional directiveness, feedback, and affective support in response to learners' needs. These differences can all impact dialogue dynamics and student engagement. In this paper, we explore how tutor personas embedded in human tutor-student dialogues can be used to guide LLM behavior without relying on explicitly prompted instructions. We modify Bidirectional Preference Optimization (BiPO) to learn a steering vector, an activation-space direction that steers model responses towards certain tutor personas. We find that this steering vector captures tutor-specific variation across dialogue contexts, improving semantic alignment with ground-truth tutor utterances and increasing preference-based evaluations, while largely preserving lexical similarity. Analysis of the learned directional coefficients further reveals interpretable structure across tutors, corresponding to consistent differences in tutoring behavior. These results demonstrate that activation steering offers an effective and interpretable way for controlling tutor-specific variation in LLMs using signals derived directly from human dialogue data.

[28] Blind to the Human Touch: Overlap Bias in LLM-Based Summary Evaluation

Jiangnan Fang,Cheng-Tse Liu,Hanieh Deilamsalehy,Nesreen K. Ahmed,Puneet Mathur,Nedim Lipka,Franck Dernoncourt,Ryan A. Rossi

Main category: cs.CL

TL;DR: 本文分析了大语言模型(LLM)作为评判者在摘要任务中的偏见,发现LLM评判者倾向于偏好与人类摘要重叠度低的LLM生成摘要,且该偏见普遍存在,提示需超越简单比对的评估方法。

Details Motivation: 尽管LLM评判者在摘要评估中优于传统指标,但其存在长度、顺序等偏见且易受对抗提示干扰;现有研究缺乏基于明确定义重叠度指标(如ROUGE、BLEU)的细粒度偏差分析。 Method: 以ROUGE和BLEU衡量摘要与人工参考摘要的重叠度,系统测试9个参数量1B–12B的主流LLM(含Gemma 3和LLaMA 3变体)作为评判者,分析其打分偏好与重叠度的关系,并控制位置偏差等混杂因素。 Result: 发现:(1)除一个模型外,其余所有LLM评判者均表现出‘重叠度越低,越偏好LLM生成摘要而非人工摘要’的趋势;(2)即使摘要与人工摘要仅有有限重叠,LLM评判者仍难以准确判断;(3)该现象与评判模型自身的位置偏好无关。 Conclusion: LLM-as-a-judge在摘要评估中存在系统性重叠相关偏差,单纯依赖LLM与参考摘要的表面比对不可靠,需引入更鲁棒的评估技术。 Abstract: Large language model (LLM) judges have often been used alongside traditional, algorithm-based metrics for tasks like summarization because they better capture semantic information, are better at reasoning, and are more robust to paraphrasing. However, LLM judges show biases for length and order among others, and are vulnerable to various adversarial input prompts. While recent studies have looked into these biases, few have analyzed them at a more granular level in relation to a well-defined overlap metric. In this work we provide an LLM judge bias analysis as a function of overlap with human-written responses in the domain of summarization. We test 9 recent LLMs with parameter counts ranging from 1 billion to 12 billion, including variants of Gemma 3 and LLaMA 3. We find that LLM judges increasingly prefer summaries generated by other LLMs over those written by humans as the similarities (as measured by ROUGE and BLEU) between the judged summaries decrease, and this pattern extends to all but one model tested, and exists regardless of the models' own position biases. Additionally, we find that models struggle to judge even summaries with limited overlaps, suggesting that LLM-as-a-judge in the summary domain should rely on techniques beyond a simple comparison.

[29] SRR-Judge: Step-Level Rating and Refinement for Enhancing Search-Integrated Reasoning in Search Agents

Chen Zhang,Kuicai Dong,Dexun Li,Wenjun Li,Qu Yang,Wei Han,Yong Liu

Main category: cs.CL

TL;DR: 本文提出SRR-Judge框架,用于对深度搜索代理的每一步推理与搜索动作进行细粒度、可靠的评估,并结合率-精炼流程和迭代拒绝采样微调,显著提升模型在复杂搜索任务上的性能。

Details Motivation: 现有基于大推理模型(LRM)的深度搜索代理多依赖结果导向的监督训练,忽视中间推理与动作质量,导致训练信号不充分。 Method: 提出SRR-Judge框架,实现步骤级推理与搜索动作的可靠评估;将其嵌入改进的ReAct式rate-and-refine流程中,支持高效后训练标注;利用SRR标注数据,采用迭代拒绝采样进行策略微调。 Result: SRR-Judge在步骤级评估上比DeepSeek-V3.1等更大模型更可靠,其评分与最终答案正确性高度相关;经SRR对齐训练后,在多个深度搜索基准上平均pass@1提升超10个百分点。 Conclusion: 细粒度、可靠的步骤级监督对提升搜索集成推理能力至关重要,SRR-Judge为构建更鲁棒、可解释的深度搜索代理提供了有效评估与训练范式。 Abstract: Recent deep search agents built on large reasoning models (LRMs) excel at complex question answering by iteratively planning, acting, and gathering evidence, a capability known as search-integrated reasoning. However, mainstream approaches often train this ability using only outcome-based supervision, neglecting the quality of intermediate thoughts and actions. We introduce SRR-Judge, a framework for reliable step-level assessment of reasoning and search actions. Integrated into a modified ReAct-style rate-and-refine workflow, SRR-Judge provides fine-grained guidance for search-integrated reasoning and enables efficient post-training annotation. Using SRR-annotated data, we apply an iterative rejection sampling fine-tuning procedure to enhance the deep search capability of the base agent. Empirically, SRR-Judge delivers more reliable step-level evaluations than much larger models such as DeepSeek-V3.1, with its ratings showing strong correlation with final answer correctness. Moreover, aligning the policy with SRR-Judge annotated trajectories leads to substantial performance gains, yielding over a 10 percent average absolute pass@1 improvement across challenging deep search benchmarks.

[30] Attn-GS: Attention-Guided Context Compression for Efficient Personalized LLMs

Shenglai Zeng,Tianqi Zheng,Chuan Tian,Dante Everaert,Yau-Shian Wang,Yupin Huang,Michael J. Morais,Rohit Patki,Jinjin Tian,Xinnan Dai,Kai Guo,Monica Xiao Cheng,Hui Liu

Main category: cs.CL

TL;DR: 本文提出Attn-GS框架,利用大语言模型(LLM)的注意力机制识别用户个性化关键信息,实现高效上下文压缩,在大幅降低token使用量的同时保持高性能。

Details Motivation: 现有个性化LLM方法受限于输入token长度,难以纳入完整用户交互历史与画像;启发式压缩策略未考虑LLM内部对不同信息的处理优先级。 Method: 基于对LLM注意力模式的实证分析,提出Attn-GS:先用标记模型通过注意力反馈识别重要个性化句子,再指导压缩模型生成任务相关、高质量的压缩用户上下文。 Result: Attn-GS在多种任务、token限制和设置下显著优于各类基线,性能接近使用完整上下文,同时token用量减少50倍。 Conclusion: LLM的注意力模式可有效揭示个性化关键信号,利用该信号进行引导式压缩是提升个性化效率与效果的可行路径。 Abstract: Personalizing large language models (LLMs) to individual users requires incorporating extensive interaction histories and profiles, but input token constraints make this impractical due to high inference latency and API costs. Existing approaches rely on heuristic methods such as selecting recent interactions or prompting summarization models to compress user profiles. However, these methods treat context as a monolithic whole and fail to consider how LLMs internally process and prioritize different profile components. We investigate whether LLMs' attention patterns can effectively identify important personalization signals for intelligent context compression. Through preliminary studies on representative personalization tasks, we discover that (a) LLMs' attention patterns naturally reveal important signals, and (b) fine-tuning enhances LLMs' ability to distinguish between relevant and irrelevant information. Based on these insights, we propose Attn-GS, an attention-guided context compression framework that leverages attention feedback from a marking model to mark important personalization sentences, then guides a compression model to generate task-relevant, high-quality compressed user contexts. Extensive experiments demonstrate that Attn-GS significantly outperforms various baselines across different tasks, token limits, and settings, achieving performance close to using full context while reducing token usage by 50 times.

[31] Emergent Structured Representations Support Flexible In-Context Inference in Large Language Models

Ningyu Xu,Qi Zhang,Xipeng Qiu,Xuanjing Huang

Main category: cs.CL

TL;DR: 本文发现大语言模型在上下文推理过程中会动态构建并利用结构化的潜在概念表征,这些表征在中后期层形成稳定的概念子空间,并对推理预测具有因果作用。

Details Motivation: 探究大语言模型是否真正依赖类人结构化概念表征进行推理,而非仅表现出类似行为。 Method: 通过表征分析识别概念子空间,并结合因果中介分析和层间注意力机制分析其功能角色与动态构建过程。 Result: 发现中晚期层存在跨上下文稳定的概念子空间;该子空间对预测具有因果作用;早期到中期层的注意力头负责构建该子空间,后期层用于预测。 Conclusion: 大语言模型在上下文中动态构建并功能性地使用结构化潜在表征进行推理,揭示了其灵活适应背后的计算机制。 Abstract: Large language models (LLMs) exhibit emergent behaviors suggestive of human-like reasoning. While recent work has identified structured, human-like conceptual representations within these models, it remains unclear whether they functionally rely on such representations for reasoning. Here we investigate the internal processing of LLMs during in-context concept inference. Our results reveal a conceptual subspace emerging in middle to late layers, whose representational structure persists across contexts. Using causal mediation analyses, we demonstrate that this subspace is not merely an epiphenomenon but is functionally central to model predictions, establishing its causal role in inference. We further identify a layer-wise progression where attention heads in early-to-middle layers integrate contextual cues to construct and refine the subspace, which is subsequently leveraged by later layers to generate predictions. Together, these findings provide evidence that LLMs dynamically construct and use structured, latent representations in context for inference, offering insights into the computational processes underlying flexible adaptation.

[32] Thinking Makes LLM Agents Introverted: How Mandatory Thinking Can Backfire in User-Engaged Agents

Jiatong Li,Changdae Oh,Hyeong Kyu Choi,Jindong Wang,Sharon Li

Main category: cs.CL

TL;DR: 本文研究了在用户参与的LLM代理场景中显式推理(thinking)的效果,发现强制推理反而导致性能下降,因其使代理更‘内向’、减少信息透露;而主动提示信息披露可显著提升性能,强调信息透明度对推理代理设计的重要性。

Details Motivation: 探究显式推理在真实用户参与的LLM代理场景中的实际有效性,因现有方法在该场景下的表现尚不明确。 Method: 在7个模型、3个基准和2种推理实例化方式上开展综合实验,结合定量响应分类分析与定性失败传播案例研究。 Result: 强制推理常导致性能异常下降;推理使代理响应变短、信息披露减少,削弱人机信息交换;显式提示信息披露可稳定提升各模型家族性能。 Conclusion: 信息透明度意识是未来面向现实场景推理代理设计中关键但被忽视的维度。 Abstract: Eliciting reasoning has emerged as a powerful technique for improving the performance of large language models (LLMs) on complex tasks by inducing thinking. However, their effectiveness in realistic user-engaged agent scenarios remains unclear. In this paper, we conduct a comprehensive study on the effect of explicit thinking in user-engaged LLM agents. Our experiments span across seven models, three benchmarks, and two thinking instantiations, and we evaluate them through both a quantitative response taxonomy analysis and qualitative failure propagation case studies. Contrary to expectations, we find that mandatory thinking often backfires on agents in user-engaged settings, causing anomalous performance degradation across various LLMs. Our key finding reveals that thinking makes agents more ``introverted'' by shortening responses and reducing information disclosure to users, which weakens agent-user information exchange and leads to downstream task failures. Furthermore, we demonstrate that explicitly prompting for information disclosure reliably improves performance across diverse model families, suggesting that proactive transparency is a vital lever for agent optimization. Overall, our study suggests that information transparency awareness is a crucial yet underexplored perspective for the future design of reasoning agents in real-world scenarios. Our code is available at https://github.com/deeplearning-wisc/Thinking-Agent.

[33] Pruning as a Cooperative Game: Surrogate-Assisted Layer Contribution Estimation for Large Language Models

Xuan Ding,Pengyu Tong,Ranjie Duan,Yunjian Zhang,Rui Sun,Yao Zhu

Main category: cs.CL

TL;DR: 本文提出了一种基于博弈论的层剪枝框架,将大语言模型(LLM)的层剪枝建模为合作博弈,利用轻量代理网络与分层蒙特卡洛采样高效估计Shapley值,从而动态识别关键层,在降低计算开销的同时提升剪枝效果。

Details Motivation: 现有层剪枝方法依赖静态启发式规则,忽略层间依赖关系,导致剪枝效果受限;同时LLM推理开销高,亟需更高效的剪枝策略。 Method: 将层剪枝建模为合作博弈,各层为玩家、模型性能为效用;用轻量级代理网络预测任意层组合下的LLM性能以估算层边际贡献;结合 stratified Monte Carlo mask sampling 高效近似Shapley值。 Result: 在困惑度和零样本准确率上均显著优于现有方法,实现了更高效、更有效的LLM层剪枝。 Conclusion: 所提博弈论框架能动态捕捉层间依赖,克服静态规则局限,为LLM压缩提供了一种原理清晰、实用性强的新范式。 Abstract: While large language models (LLMs) demonstrate impressive performance across various tasks, their deployment in real-world scenarios is still constrained by high computational demands. Layer-wise pruning, a commonly employed strategy to mitigate inference costs, can partially address this challenge. However, existing approaches generally depend on static heuristic rules and fail to account for the interdependencies among layers, thereby limiting the effectiveness of the pruning process. To this end, this paper proposes a game-theoretic framework that formulates layer pruning as a cooperative game in which each layer acts as a player and model performance serves as the utility. As computing exact Shapley values is computationally infeasible for large language models (LLMs), we propose using a lightweight surrogate network to estimate layer-wise marginal contributions. This network can predict LLM performance for arbitrary layer combinations at a low computational cost. Additionally, we employ stratified Monte Carlo mask sampling to further reduce the cost of Sharpley value estimation. This approach captures inter-layer dependencies and dynamically identifies critical layers for pruning. Extensive experiments demonstrate the consistent superiority of our method in terms of perplexity and zero-shot accuracy, achieving more efficient and effective layer-wise pruning for large language models.

[34] LLMs Know More About Numbers than They Can Say

Fengting Yuchi,Li Du,Jason Eisner

Main category: cs.CL

TL;DR: 本文研究了大语言模型(LLM)在处理混合记法数值比较(如科学计数法与常规数字)时的内部数值理解能力,发现其隐藏层可线性编码对数尺度下的数值大小和相对顺序,但显式回答准确率较低;通过将该内部表征损失作为辅助目标进行微调,可提升其显式数值推理表现。

Details Motivation: 尽管当前大语言模型能解决数学问题,但在混合记法(如$5.7 \times 10^2$ vs $580$)的数值比较任务中频繁出错,引发对其是否真正理解数值大小的根本性质疑。 Method: 对多个开源小规模LLM的隐藏状态进行探针分析:用单一线性投影从特定隐藏层提取数值的对数大小估计,并训练线性分类器判别两数大小关系;进一步将探针所得log-loss作为辅助损失用于监督微调。 Result: 隐藏层可较准确编码数值对数大小(合成数据相对误差2.3%,科学论文中19.06%);对数值对的排序判别准确率超90%;但模型显式回答排序仅50–70%准确;引入探针log-loss微调后,显式准确率额外提升3.22%。 Conclusion: LLM内部已具备一定数值大小与顺序的隐式表征能力,但该能力未被有效映射到生成式输出;显式性能瓶颈在于输出映射而非表征缺失,可通过辅助监督增强其数值推理能力。 Abstract: Although state-of-the-art LLMs can solve math problems, we find that they make errors on numerical comparisons with mixed notation: "Which is larger, $5.7 \times 10^2$ or $580$?" This raises a fundamental question: Do LLMs even know how big these numbers are? We probe the hidden states of several smaller open-source LLMs. A single linear projection of an appropriate hidden layer encodes the log-magnitudes of both kinds of numerals, allowing us to recover the numbers with relative error of about 2.3% (on restricted synthetic text) or 19.06% (on scientific papers). Furthermore, the hidden state after reading a pair of numerals encodes their ranking, with a linear classifier achieving over 90% accuracy. Yet surprisingly, when explicitly asked to rank the same pairs of numerals, these LLMs achieve only 50-70% accuracy, with worse performance for models whose probes are less effective. Finally, we show that incorporating the classifier probe's log-loss as an auxiliary objective during finetuning brings an additional 3.22% improvement in verbalized accuracy over base models, demonstrating that improving models' internal magnitude representations can enhance their numerical reasoning capabilities.

[35] TodoEvolve: Learning to Architect Agent Planning Systems

Jiaxi Liu,Yanzuo Jiang,Guibin Zhang,Zihan Zhang,Heng Chang,Zhenfei Yin,Qibing Ren,Junchi Yan

Main category: cs.CL

TL;DR: 本文提出TodoEvolve,一种元规划范式,通过构建统一模块化设计空间PlanFactory,并结合阻抗引导偏好优化(IGPO)训练大模型Todo-14B,实现任务自适应的规划架构自动合成与动态修订,在多个代理基准上超越手工设计规划模块。

Details Motivation: 现有规划方法依赖固定、手工设计的结构,难以适应开放性问题在结构上的多样性,缺乏灵活性。 Method: 构建统一模块化设计空间PlanFactory(涵盖拓扑、初始化、适应、导航),收集高质量规划轨迹,利用多目标强化学习方法IGPO训练Todo-14B模型,实现规划架构的自主合成与动态修订。 Result: 在五个代理基准测试中,TodoEvolve持续优于精心设计的规划模块,同时保持较低API成本和运行开销。 Conclusion: TodoEvolve验证了元规划范式的有效性,为构建灵活、自适应、高效能的智能体规划能力提供了新路径。 Abstract: Planning has become a central capability for contemporary agent systems in navigating complex, long-horizon tasks, yet existing approaches predominantly rely on fixed, hand-crafted planning structures that lack the flexibility to adapt to the structural diversity of open-ended problems. To address this limitation, we introduce TodoEvolve, a meta-planning paradigm that autonomously synthesizes and dynamically revises task-specific planning architectures. Specifically, we first construct PlanFactory, a modular design space that standardizes diverse planning paradigms within a unified codebase encompassing topology, initialization, adaptation, and navigation, thereby providing a common interface for heterogeneous planning patterns. Leveraging PlanFactory, we collect high-quality planning trajectories and train Todo-14B via \textit{Impedance-Guided Preference Optimization} (IGPO), a multi-objective reinforcement learning objective that encourages the generation of planning systems that are performant, stable, and token-efficient across arbitrary tasks and agent backbones. Empirical evaluations on five agentic benchmarks demonstrate that TodoEvolve consistently surpasses carefully engineered planning modules while maintaining economical API costs and runtime overhead.

[36] Evaluating and Calibrating LLM Confidence on Questions with Multiple Correct Answers

Yuhan Wang,Shiyu Ni,Zhikai Ding,Zihang Zhan,Yuanzi Li,Keping Bi

Main category: cs.CL

TL;DR: 本文揭示了现有无训练置信度校准方法在多答案场景下的失效问题,提出了MACE基准和语义置信聚合(SCA)方法以提升大语言模型在多答案问答中的校准性能。

Details Motivation: 现有训练无关的置信度校准方法主要针对单答案问答设计,在存在多个正确答案时因答案间分歧而导致系统性低估置信度,缺乏对多答案场景的系统研究。 Method: 构建包含12,000个事实性问题的MACE基准(覆盖6个领域、答案数量可变),评估15种校准方法与4类LLM(7B–72B);提出语义置信聚合(SCA),通过聚合多个高概率采样响应的置信度实现校准。 Result: 实验发现:答案数量增加时准确率上升但估计置信度持续下降,导致混合答案问题严重校准偏差;SCA在多答案设置下达到SOTA校准性能,同时保持单答案问题的良好校准效果。 Conclusion: 多答案场景显著挑战现有校准范式,SCA是一种通用、有效且无需额外训练的校准改进方案,为提升LLM在开放域问答等现实任务中的可靠性提供了新路径。 Abstract: Confidence calibration is essential for making large language models (LLMs) reliable, yet existing training-free methods have been primarily studied under single-answer question answering. In this paper, we show that these methods break down in the presence of multiple valid answers, where disagreement among equally correct responses leads to systematic underestimation of confidence. To enable a systematic study of this phenomenon, we introduce MACE, a benchmark of 12,000 factual questions spanning six domains with varying numbers of correct answers. Experiments across 15 representative calibration methods and four LLM families (7B-72B) reveal that while accuracy increases with answer cardinality, estimated confidence consistently decreases, causing severe miscalibration for questions with mixed answer counts. To address this issue, we propose Semantic Confidence Aggregation (SCA), which aggregates confidence over multiple high-probability sampled responses. SCA achieves state-of-the-art calibration performance under mixed-answer settings while preserving strong calibration on single-answer questions.

[37] SparseEval: Efficient Evaluation of Large Language Models by Sparse Optimization

Taolin Zhang,Hang Guo,Wang Lu,Tao Dai,Shu-Tao Xia,Jindong Wang

Main category: cs.CL

TL;DR: 本文提出SparseEval方法,通过稀疏优化和梯度下降选择代表性基准样本(锚点),显著降低大语言模型评估的计算成本,同时保持高准确性和鲁棒性。

Details Motivation: 大语言模型规模增大导致评估成本高昂,需寻找高效、低成本的基准测试方法。 Method: 提出SparseEval:利用模型-项目性能矩阵的稀疏性,以梯度下降优化锚点权重,结合MLP建模与锚点/候选重要性评分进行迭代精炼选择。 Result: 在多个基准上展现出低估计误差和高Kendall's τ,验证了方法的准确性、鲁棒性与实用性。 Conclusion: SparseEval首次将梯度下降引入高效基准评估,为大模型轻量级评估提供了新范式。 Abstract: As large language models (LLMs) continue to scale up, their performance on various downstream tasks has significantly improved. However, evaluating their capabilities has become increasingly expensive, as performing inference on a large number of benchmark samples incurs high computational costs. In this paper, we revisit the model-item performance matrix and show that it exhibits sparsity, that representative items can be selected as anchors, and that the task of efficient benchmarking can be formulated as a sparse optimization problem. Based on these insights, we propose SparseEval, a method that, for the first time, adopts gradient descent to optimize anchor weights and employs an iterative refinement strategy for anchor selection. We utilize the representation capacity of MLP to handle sparse optimization and propose the Anchor Importance Score and Candidate Importance Score to evaluate the value of each item for task-aware refinement. Extensive experiments demonstrate the low estimation error and high Kendall's~$τ$ of our method across a variety of benchmarks, showcasing its superior robustness and practicality in real-world scenarios. Code is available at {https://github.com/taolinzhang/SparseEval}.

[38] Patches of Nonlinearity: Instruction Vectors in Large Language Models

Irina Bigoulaeva,Jonas Rohweder,Subhabrata Dutta,Iryna Gurevych

Main category: cs.CL

TL;DR: 本文从机制角度研究了指令调优语言模型中指令表示的构建与使用,发现指令表示(IVs)具有局部化、线性可分但因果交互非线性的特点,并提出一种无需线性假设的新方法,揭示IVs在模型中起电路选择器作用。

Details Motivation: 尽管指令微调语言模型已广泛应用,但其内部如何处理指令仍不清楚,本文旨在填补这一机制理解上的空白。 Method: 采用因果中介分析识别指令表示的局部性;提出一种不依赖线性假设的新方法来定位信息处理路径;结合SFT和DPO阶段分析指令向量(IVs)的作用机制。 Result: 发现指令表示(IVs)高度局部化,兼具线性可分性与非线性因果交互性;验证IVs在早期层形成任务表征后,在后期层动态选择不同信息通路,即作为电路选择器。 Conclusion: 指令表示并非简单线性结构,其非线性因果交互挑战了当前机械可解释性中普遍接受的线性表征假设;IVs的核心功能是调控模型内部任务相关计算通路的选择。 Abstract: Despite the recent success of instruction-tuned language models and their ubiquitous usage, very little is known of how models process instructions internally. In this work, we address this gap from a mechanistic point of view by investigating how instruction-specific representations are constructed and utilized in different stages of post-training: Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). Via causal mediation, we identify that instruction representation is fairly localized in models. These representations, which we call Instruction Vectors (IVs), demonstrate a curious juxtaposition of linear separability along with non-linear causal interaction, broadly questioning the scope of the linear representation hypothesis commonplace in mechanistic interpretability. To disentangle the non-linear causal interaction, we propose a novel method to localize information processing in language models that is free from the implicit linear assumptions of patching-based techniques. We find that, conditioned on the task representations formed in the early layers, different information pathways are selected in the later layers to solve that task, i.e., IVs act as circuit selectors.

[39] Bielik Guard: Efficient Polish Language Safety Classifiers for LLM Content Moderation

Krzysztof Wróbel,Jan Maria Kowalski,Jerzy Surma,Igor Ciuciura,Maciej Szymański

Main category: cs.CL

TL;DR: 本文提出了Bielik Guard,一种针对波兰语的轻量级内容安全分类器家族,包含0.1B和0.5B两个参数规模的模型,分别基于MMLW-RoBERTa-base和PKOBP/polish-roberta-8k,在社区标注的6885条波兰语数据上微调,覆盖五类安全风险;实验表明其在精度、FPR和F1分数上均优于现有基线(如HerBERT-PL-Guard),且开源可用,强调“恰当响应”而非简单屏蔽。

Details Motivation: 随着大语言模型在波兰语应用中日益部署,亟需高效准确的波兰语内容安全分类器以保障内容安全。 Method: 构建基于MMLW-RoBERTa-base(0.1B)和PKOBP/polish-roberta-8k(0.5B)的两个轻量级波兰语安全分类器,使用6,885条社区标注波兰语文本进行微调,支持 Hate/Aggression、Vulgarities、Sexual Content、Crime 和 Self-Harm 五类安全检测。 Result: 0.5B模型在测试集上取得micro-F1 0.791、macro-F1 0.785;0.1B模型在真实用户提示中达到77.65%精度和仅0.63%假阳性率,显著优于同规模HerBERT-PL-Guard(31.55%精度,4.70% FPR)。 Conclusion: Bielik Guard 是高效、精准、开源的波兰语安全分类器,兼顾性能与实用性,特别注重对自伤等敏感类别提供恰当响应而非简单拦截。 Abstract: As Large Language Models (LLMs) become increasingly deployed in Polish language applications, the need for efficient and accurate content safety classifiers has become paramount. We present Bielik Guard, a family of compact Polish language safety classifiers comprising two model variants: a 0.1B parameter model based on MMLW-RoBERTa-base and a 0.5B parameter model based on PKOBP/polish-roberta-8k. Fine-tuned on a community-annotated dataset of 6,885 Polish texts, these models classify content across five safety categories: Hate/Aggression, Vulgarities, Sexual Content, Crime, and Self-Harm. Our evaluation demonstrates that both models achieve strong performance on multiple benchmarks. The 0.5B variant offers the best overall discrimination capability with F1 scores of 0.791 (micro) and 0.785 (macro) on the test set, while the 0.1B variant demonstrates exceptional efficiency. Notably, Bielik Guard 0.1B v1.1 achieves superior precision (77.65\%) and very low false positive rate (0.63\%) on real user prompts, outperforming HerBERT-PL-Guard (31.55\% precision, 4.70\% FPR) despite identical model size. The models are publicly available and designed to provide appropriate responses rather than simple content blocking, particularly for sensitive categories like self-harm.

[40] Lost in Translation? A Comparative Study on the Cross-Lingual Transfer of Composite Harms

Vaibhav Shukla,Hardik Sharma,Adith N Reganti,Soham Wasmatkar,Bagesh Kumar,Vrijendra Singh

Main category: cs.CL

TL;DR: 本文提出CompositeHarm基准,通过翻译扩展AttaQ和MMSafetyBench至六种印度语言,研究LLM安全对齐在跨语言迁移中的鲁棒性;发现对抗性攻击成功率在印地语等语言中显著上升,而上下文危害转移较温和;采用轻量推理策略提升多语言安全评估的可扩展性与能效。

Details Motivation: 现有LLM安全评估多局限于英语,翻译虽被用作多语言评估捷径,但难以准确反映跨语言间有害意图或结构的变化,亟需系统研究安全对齐在语法与语义迁移下的保持能力。 Method: 构建CompositeHarm基准,融合AttaQ(结构化对抗攻击)和MMSafetyBench(真实场景上下文危害)两个英文数据集,并翻译扩展至英语、印地语、阿萨姆语、马拉地语、卡纳达语和古吉拉特语六种语言;在三个大模型上评估攻击成功率;采用受边缘AI启发的轻量推理策略以提升效率与可扩展性。 Result: 对抗性攻击在印地语等印度语言中成功率显著升高,尤其在对抗性句法下;上下文类危害跨语言转移程度中等;轻量推理策略有效降低计算冗余,兼顾跨语言保真度与能效。 Conclusion: 翻译型基准是多语言安全评估的必要起点,但不足以支撑真正扎根、资源感知且语言自适应的安全系统构建;需进一步发展原生多语言安全评估方法与高效评估范式。 Abstract: Most safety evaluations of large language models (LLMs) remain anchored in English. Translation is often used as a shortcut to probe multilingual behavior, but it rarely captures the full picture, especially when harmful intent or structure morphs across languages. Some types of harm survive translation almost intact, while others distort or disappear. To study this effect, we introduce CompositeHarm, a translation-based benchmark designed to examine how safety alignment holds up as both syntax and semantics shift. It combines two complementary English datasets, AttaQ, which targets structured adversarial attacks, and MMSafetyBench, which covers contextual, real-world harms, and extends them into six languages: English, Hindi, Assamese, Marathi, Kannada, and Gujarati. Using three large models, we find that attack success rates rise sharply in Indic languages, especially under adversarial syntax, while contextual harms transfer more moderately. To ensure scalability and energy efficiency, our study adopts lightweight inference strategies inspired by edge-AI design principles, reducing redundant evaluation passes while preserving cross-lingual fidelity. This design makes large-scale multilingual safety testing both computationally feasible and environmentally conscious. Overall, our results show that translated benchmarks are a necessary first step, but not a sufficient one, toward building grounded, resource-aware, language-adaptive safety systems.

[41] Cross-Linguistic Persona-Driven Data Synthesis for Robust Multimodal Cognitive Decline Detection

Rui Feng,Zhiyao Luo,Liuyu Wu,Wei Wang,Yuting Song,Yong Liu,Kok Pin Ng,Jianqing Li,Xingyao Wang

Main category: cs.CL

TL;DR: 本文提出SynCog框架,通过可控零样本多模态数据合成与思维链(CoT)推理微调,解决轻度认知障碍(MCI)语音生物标志物研究中临床数据稀缺、跨语言泛化差及模型不可解释等问题;在ADReSS/ADReSSo和中文CIR-E数据集上验证了其诊断性能与跨语言鲁棒性。

Details Motivation: 语音数字生物标志物在MCI早期识别中具有潜力,但面临临床数据严重稀缺、模型缺乏可解释性、跨语言泛化能力差等问题,制约其临床信任与全球应用。 Method: 提出SynCog框架:1)可控零样本多模态数据合成,模拟多样化虚拟受试者以扩充多语言临床语料;2)基于合成数据,对多模态大语言模型(MLLM)进行思维链(CoT)推理微调,使其显式输出诊断推理过程。 Result: 在ADReSS和ADReSSo基准上Macro-F1达80.67%和78.46%,超越现有基线;在独立中文真实世界队列CIR-E上Macro-F1达48.71%,验证跨语言泛化能力。 Conclusion: SynCog为构建临床可信、语言包容的认知评估工具提供了新范式,是迈向全球可部署语音生物标志物诊断系统的关键一步。 Abstract: Speech-based digital biomarkers represent a scalable, non-invasive frontier for the early identification of Mild Cognitive Impairment (MCI). However, the development of robust diagnostic models remains impeded by acute clinical data scarcity and a lack of interpretable reasoning. Current solutions frequently struggle with cross-lingual generalization and fail to provide the transparent rationales essential for clinical trust. To address these barriers, we introduce SynCog, a novel framework integrating controllable zero-shot multimodal data synthesis with Chain-of-Thought (CoT) deduction fine-tuning. Specifically, SynCog simulates diverse virtual subjects with varying cognitive profiles to effectively alleviate clinical data scarcity. This generative paradigm enables the rapid, zero-shot expansion of clinical corpora across diverse languages, effectively bypassing data bottlenecks in low-resource settings and bolstering the diagnostic performance of Multimodal Large Language Models (MLLMs). Leveraging this synthesized dataset, we fine-tune a foundational multimodal backbone using a CoT deduction strategy, empowering the model to explicitly articulate diagnostic thought processes rather than relying on black-box predictions. Extensive experiments on the ADReSS and ADReSSo benchmarks demonstrate that augmenting limited clinical data with synthetic phenotypes yields competitive diagnostic performance, achieving Macro-F1 scores of 80.67% and 78.46%, respectively, outperforming current baseline models. Furthermore, evaluation on an independent real-world Mandarin cohort (CIR-E) demonstrates robust cross-linguistic generalization, attaining a Macro-F1 of 48.71%. These findings constitute a critical step toward providing clinically trustworthy and linguistically inclusive cognitive assessment tools for global healthcare.

[42] The Judge Who Never Admits: Hidden Shortcuts in LLM-based Evaluation

Arash Marioriyad,Omid Ghahroodi,Ehsaneddin Asgari,Mohammad Hossein Rohban,Mahdieh Soleymani Baghshah

Main category: cs.CL

TL;DR: 本文研究大型语言模型(LLM)作为自动评估器时对无关背景线索(如来源、时间、性别等元数据)的敏感性,发现其判决易受隐性偏见影响且极少在推理中显式承认这些线索,揭示了‘解释鸿沟’问题。

Details Motivation: LLM被广泛用作自动评估器,但其判决是否真正基于内容质量、是否鲁棒且可解释尚不清楚;需检验其对无关元数据线索的不变性与透明性。 Method: 通过向评估提示中注入六类合成元数据线索(来源、时间、年龄、性别、种族、教育程度),在ELI5(事实问答)和LitBench(创意写作)两个数据集上测试六个主流LLM判官,并引入‘判决偏移率(VSR)’和新指标‘线索承认率(CAR)’来量化行为偏移与显式归因。 Result: 多数强效应线索(如‘专家>人类>LLM’来源层级、‘新>旧’时间偏好、教育程度偏向)导致显著判决偏移,但CAR普遍接近零;CAR具有数据集依赖性——在ELI5中部分模型偶有承认,在LitBench中几乎为零,却仍存在大幅判决偏移。 Conclusion: LLM作为评估器存在严重的‘解释鸿沟’:决策受隐性线索驱动却缺乏自我解释,损害其在研究与实际部署中的可信度与可靠性。 Abstract: Large language models (LLMs) are increasingly used as automatic judges to evaluate system outputs in tasks such as reasoning, question answering, and creative writing. A faithful judge should base its verdicts solely on content quality, remain invariant to irrelevant context, and transparently reflect the factors driving its decisions. We test this ideal via controlled cue perturbations-synthetic metadata labels injected into evaluation prompts-for six judge models: GPT-4o, Gemini-2.0-Flash, Gemma-3-27B, Qwen3-235B, Claude-3-Haiku, and Llama3-70B. Experiments span two complementary datasets with distinct evaluation regimes: ELI5 (factual QA) and LitBench (open-ended creative writing). We study six cue families: source, temporal, age, gender, ethnicity, and educational status. Beyond measuring verdict shift rates (VSR), we introduce cue acknowledgment rate (CAR) to quantify whether judges explicitly reference the injected cues in their natural-language rationales. Across cues with strong behavioral effects-e.g., provenance hierarchies (Expert > Human > LLM > Unknown), recency preferences (New > Old), and educational-status favoritism-CAR is typically at or near zero, indicating that shortcut reliance is largely unreported even when it drives decisions. Crucially, CAR is also dataset-dependent: explicit cue recognition is more likely to surface in the factual ELI5 setting for some models and cues, but often collapses in the open-ended LitBench regime, where large verdict shifts can persist despite zero acknowledgment. The combination of substantial verdict sensitivity and limited cue acknowledgment reveals an explanation gap in LLM-as-judge pipelines, raising concerns about reliability of model-based evaluation in both research and deployment.

[43] DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity

Jitai Hao,Qiang Huang,Yaowei Wang,Min Zhang,Jun Yu

Main category: cs.CL

TL;DR: 本文提出DeltaKV,一种基于残差的KV缓存压缩框架,结合Sparse-vLLM推理引擎,在保持近无损精度的同时显著降低长上下文大模型的KV缓存内存占用并提升吞吐量。

Details Motivation: 现有KV缓存压缩与淘汰方法难以兼顾精度、压缩比和硬件效率;作者观察到长程词元间相似性高、KV表征存在高度共享的潜在成分。 Method: DeltaKV采用基于历史参考的语义残差编码替代丢弃词元;Sparse-vLLM则通过解耦内存管理和稀疏/非规则KV布局优化内核实现系统加速。 Result: DeltaKV将KV缓存内存降至原大小的29%,在LongBench、SCBench和AIME上保持近无损精度;结合Sparse-vLLM后,长上下文场景下吞吐量最高提升2倍。 Conclusion: DeltaKV与Sparse-vLLM协同提供了一条实用、可扩展的长上下文大语言模型部署路径。 Abstract: The deployment of efficient long-context LLMs in applications like autonomous agents, long-chain reasoning, and creative writing is fundamentally bottlenecked by the linear growth of KV cache memory. Existing compression and eviction methods often struggle to balance accuracy, compression ratio, and hardware efficiency. We propose DeltaKV, a residual-based KV cache compression framework motivated by two empirical findings: long-range inter-token similarity and highly shared latent components in KV representations. Instead of discarding tokens, DeltaKV encodes semantic residuals relative to retrieved historical references, preserving fidelity while substantially reducing storage. To translate compression gains into real system speedups, we further introduce Sparse-vLLM, a high-performance inference engine with decoupled memory management and kernels optimized for sparse and irregular KV layouts. Experiments show that DeltaKV reduces KV cache memory to 29\% of the original while maintaining near-lossless accuracy on LongBench, SCBench, and AIME. When integrated with Sparse-vLLM, it achieves up to 2$\times$ throughput improvement over vLLM in long-context scenarios, demonstrating a practical path toward scalable long-context LLM deployment. Code, model checkpoints, and datasets are available at https://github.com/CURRENTF/Sparse-vLLM.

[44] Diverge to Induce Prompting: Multi-Rationale Induction for Zero-Shot Reasoning

Po-Chun Chen,Hen-Hsen Huang,Hsin-Hsi Chen

Main category: cs.CL

TL;DR: 本文提出Diverge-to-Induce Prompting(DIP)框架,通过生成多样化的高层推理路径并归纳为最终计划,提升零样本推理准确性,无需大量采样。

Details Motivation: 标准思维链提示中无引导的推理路径不稳定;仅依赖单一推理策略仍难以适应多样化任务。 Method: DIP框架分三步:1)提示LLM为每个问题生成多个多样化的高层推理理由;2)将每个理由扩展为详细的分步草案计划;3)将多个草案计划归纳为最终计划。 Result: 实验表明DIP在零样本推理准确率上优于单策略提示方法,验证了多计划归纳的有效性。 Conclusion: 多样化推理路径的生成与归纳可显著提升LLM在无监督提示下的推理鲁棒性与泛化能力。 Abstract: To address the instability of unguided reasoning paths in standard Chain-of-Thought prompting, recent methods guide large language models (LLMs) by first eliciting a single reasoning strategy. However, relying on just one strategy for each question can still limit performance across diverse tasks. We propose Diverge-to-Induce Prompting (DIP), a framework that first prompts an LLM to generate multiple diverse high-level rationales for each question. Each rationale is then elaborated into a detailed, step-by-step draft plan. Finally, these draft plans are induced into a final plan. DIP enhances zero-shot reasoning accuracy without reliance on resource-intensive sampling. Experiments show that DIP outperforms single-strategy prompting, demonstrating the effectiveness of multi-plan induction for prompt-based reasoning.

[45] Beyond Raw Detection Scores: Markov-Informed Calibration for Boosting Machine-Generated Text Detection

Chenwang Wu,Yiu-ming Cheung,Shuhai Zhang,Bo Han,Defu Lian

Main category: cs.CL

TL;DR: 本文提出了一种基于马尔可夫随机场的轻量级分数校准策略,用于提升度量型机器生成文本检测器在面对生成随机性时的鲁棒性,显著提升了跨大模型和改写攻击等场景下的检测性能。

Details Motivation: 机器生成文本(MGTs)带来便利的同时也引发虚假信息和网络钓鱼等风险,亟需可靠检测;而现有度量型方法在token级检测分数上易受生成过程内在随机性干扰。 Method: 构建统一框架分析典型度量方法,理论与实验揭示上下文检测分数的‘邻近相似性’和‘初始不稳定性’两种关系,并基于马尔可夫随机场建模,采用平均场近似实现轻量级分数校准组件。 Result: 在跨LLM、改写攻击等多种真实场景下显著优于基线方法,计算开销极小。 Conclusion: 上下文分数关系可被有效建模并用于校准,所提校准策略通用、轻量且高效,可无缝集成到现有度量型检测器中。 Abstract: While machine-generated texts (MGTs) offer great convenience, they also pose risks such as disinformation and phishing, highlighting the need for reliable detection. Metric-based methods, which extract statistically distinguishable features of MGTs, are often more practical than complex model-based methods that are prone to overfitting. Given their diverse designs, we first place representative metric-based methods within a unified framework, enabling a clear assessment of their advantages and limitations. Our analysis identifies a core challenge across these methods: the token-level detection score is easily biased by the inherent randomness of the MGTs generation process. To address this, we theoretically and empirically reveal two relationships of context detection scores that may aid calibration: Neighbor Similarity and Initial Instability. We then propose a Markov-informed score calibration strategy that models these relationships using Markov random fields, and implements it as a lightweight component via a mean-field approximation, allowing our method to be seamlessly integrated into existing detectors. Extensive experiments in various real-world scenarios, such as cross-LLM and paraphrasing attacks, demonstrate significant gains over baselines with negligible computational overhead. The code is available at https://github.com/tmlr-group/MRF_Calibration.

[46] TDGNet: Hallucination Detection in Diffusion Language Models via Temporal Dynamic Graphs

Arshia Hemmat,Philip Torr,Yongqiang Chen,Junchi Yu

Main category: cs.CL

TL;DR: 本文提出TDGNet,一种基于时间动态图的幻觉检测框架,用于扩散语言模型(D-LLMs),通过建模去噪过程中token级注意力图的演化,结合稀疏化、消息传递和时序注意力,实现高效准确的幻觉检测。

Details Motivation: 扩散语言模型(D-LLMs)具有并行去噪和双向上下文优势,但其幻觉检测尚未被充分研究;现有为自回归大模型设计的检测器依赖单次生成线索,难以适配D-LLMs中事实性证据随去噪步长动态变化(出现、漂移、自修正)的特点。 Method: 提出TDGNet:在每一步去噪中构建并稀疏化token级注意力图,通过消息传递更新各token记忆,并利用时序注意力聚合整个去噪轨迹中的证据以进行最终预测。 Result: 在LLaDA-8B和Dream-7B模型及多个QA基准上,TDGNet在AUROC指标上持续优于基于输出、隐状态和静态图的基线方法,且仅需单次前向推理、开销小。 Conclusion: 在扩散语言模型中,对注意力图进行时间维度上的建模与推理,是提升幻觉检测鲁棒性的关键。 Abstract: Diffusion language models (D-LLMs) offer parallel denoising and bidirectional context, but hallucination detection for D-LLMs remains underexplored. Prior detectors developed for auto-regressive LLMs typically rely on single-pass cues and do not directly transfer to diffusion generation, where factuality evidence is distributed across the denoising trajectory and may appear, drift, or be self-corrected over time. We introduce TDGNet, a temporal dynamic graph framework that formulates hallucination detection as learning over evolving token-level attention graphs. At each denoising step, we sparsify the attention graph and update per-token memories via message passing, then apply temporal attention to aggregate trajectory-wide evidence for final prediction. Experiments on LLaDA-8B and Dream-7B across QA benchmarks show consistent AUROC improvements over output-based, latent-based, and static-graph baselines, with single-pass inference and modest overhead. These results highlight the importance of temporal reasoning on attention graphs for robust hallucination detection in diffusion language models.

[47] Emergent Search and Backtracking in Latent Reasoning Models

Jasmine Cui,Charles Ye

Main category: cs.CL

TL;DR: 本文研究了潜推理变换器(LRT)在不使用语言的情况下如何进行推理,发现其在隐空间中自发形成结构化搜索过程,包括探索、初步确认、收敛或回溯,并表现出自适应性与纠错能力。

Details Motivation: 探究语言模型在不依赖显式链式思维(chain-of-thought)时,能否在连续隐空间中完成有效推理,并理解其内在推理动态。 Method: 构建并分析一个潜推理变换器(LRT),在多选题问答基准上逐层解码其隐藏状态演化,观察信念更新轨迹、回溯行为及对干扰项变化的响应。 Result: LRT在隐空间中自发形成三阶段结构化搜索(探索→初步承诺→收敛/回溯);32%样本发生回溯,带来34%准确率提升;回溯倾向避开语义最近干扰项而转向正确答案;替换干扰项为更不合理选项可使探索阶段缩短54%。 Conclusion: 潜推理模型能在激活空间中实现类似链式思维的功能——出错、察觉并恢复,表明非符号化推理具备结构化、自适应与可修正的本质。 Abstract: What happens when a language model thinks without words? Standard reasoning LLMs verbalize intermediate steps as chain-of-thought; latent reasoning transformers (LRTs) instead perform deliberation entirely in continuous hidden space. We investigate an LRT, decoding the model's evolving beliefs at every step on a multiple-choice QA benchmark. We find that the model spontaneously learns a structured search process in latent space. Deliberation follows a consistent trajectory: an exploration phase where probability mass spreads across candidates, tentative commitment to a frontrunner, and either convergence or backtracking. Backtracking is prevalent (32% of instances), beneficial (34% accuracy gain over non-backtracking instances), and predominantly directed away from the semantically closest distractor toward the correct answer. The search is adaptive: replacing distractors with implausible alternatives shortens exploration by 54%. Latent reasoning models achieve in activation space what chain-of-thought achieves through words: the ability to be wrong, notice, and recover.

[48] Gender and Race Bias in Consumer Product Recommendations by Large Language Models

Ke Xu,Shera Potka,Alex Thomo

Main category: cs.CL

TL;DR: This paper investigates gender and race biases in large language model (LLM)-generated consumer product recommendations using prompt engineering and three bias detection methods, revealing significant demographic disparities and calling for more equitable recommendation systems.

Details Motivation: Large Language Models are increasingly used for consumer product recommendations, but their potential to embed and amplify gender and race biases remains underexplored. Method: The authors use prompt engineering to elicit product suggestions from LLMs for various race and gender groups, and apply three analytical methods—Marked Words, Support Vector Machines, and Jensen-Shannon Divergence—to identify and quantify biases. Result: Significant disparities in product recommendations across demographic groups were found. Conclusion: The findings underscore the need for more equitable LLM-based recommendation systems. Abstract: Large Language Models are increasingly employed in generating consumer product recommendations, yet their potential for embedding and amplifying gender and race biases remains underexplored. This paper serves as one of the first attempts to examine these biases within LLM-generated recommendations. We leverage prompt engineering to elicit product suggestions from LLMs for various race and gender groups and employ three analytical methods-Marked Words, Support Vector Machines, and Jensen-Shannon Divergence-to identify and quantify biases. Our findings reveal significant disparities in the recommendations for demographic groups, underscoring the need for more equitable LLM recommendation systems.

[49] DIAL-SUMMER: A Structured Evaluation Framework of Hierarchical Errors in Dialogue Summaries

Sahana Ramnath,Nima Chitsazan,Mingyang Zhou,Chia-Hsuan Lee,Shi-Xiong Zhang,Stephen Rawls,Sambit Sahu,Sangwoo Cho,Xiang Ren,Genta Indra Winata,Akshaj Kumar Veldanda

Main category: cs.CL

TL;DR: 本文提出DIALSUMMER框架,用于评估对话摘要质量,涵盖对话级和单轮级两个层次的错误分类,并构建了包含细粒度错误标注的数据集,分析了人工标注错误的规律,同时验证了大语言模型在检测这些错误方面的局限性。

Details Motivation: 现有对话摘要评估方法忽略了对话特有的复杂性,如结构转换(多说话人分散讨论到连贯句子)和视角转换(第一/二人称到第三人称)。 Method: 提出DIALSUMMER框架,包括两层错误分类体系(对话级和单轮级),构建人工标注的细粒度错误数据集,并开展实证分析与LLM-Judge实验。 Result: 发现中间位置的对话轮次最易被遗漏,外部幻觉多出现在摘要末尾;LLM-Judges在检测这些错误上表现欠佳,凸显任务挑战性。 Conclusion: DIALSUMMER为对话摘要评估提供了更全面、细粒度的基准,揭示了当前LLM在该任务上的不足,呼吁未来研究提升其性能。 Abstract: Dialogues are a predominant mode of communication for humans, and it is immensely helpful to have automatically generated summaries of them (e.g., to revise key points discussed in a meeting, to review conversations between customer agents and product users). Prior works on dialogue summary evaluation largely ignore the complexities specific to this task: (i) shift in structure, from multiple speakers discussing information in a scattered fashion across several turns, to a summary's sentences, and (ii) shift in narration viewpoint, from speakers' first/second-person narration, standardized third-person narration in the summary. In this work, we introduce our framework DIALSUMMER to address the above. We propose DIAL-SUMMER's taxonomy of errors to comprehensively evaluate dialogue summaries at two hierarchical levels: DIALOGUE-LEVEL that focuses on the broader speakers/turns, and WITHIN-TURN-LEVEL that focuses on the information talked about inside a turn. We then present DIAL-SUMMER's dataset composed of dialogue summaries manually annotated with our taxonomy's fine-grained errors. We conduct empirical analyses of these annotated errors, and observe interesting trends (e.g., turns occurring in middle of the dialogue are the most frequently missed in the summary, extrinsic hallucinations largely occur at the end of the summary). We also conduct experiments on LLM-Judges' capability at detecting these errors, through which we demonstrate the challenging nature of our dataset, the robustness of our taxonomy, and the need for future work in this field to enhance LLMs' performance in the same. Code and inference dataset coming soon.

[50] NLP for Local Governance Meeting Records: A Focus Article on Tasks, Datasets, Metrics and Benchmark

Ricardo Campos,José Pedro Evans,José Miguel Isidro,Miguel Marques,Luís Filipe Cunha,Alípio Jorge,Sérgio Nunes,Nuno Guimarães

Main category: cs.CL

TL;DR: 本文综述了利用自然语言处理(NLP)技术对地方治理会议记录进行结构化和可解释性提升的三种基础任务:文档分割、领域特定实体抽取与自动文本摘要,并分析了相关方法、评估指标、公开资源及领域特有挑战。

Details Motivation: 地方治理会议记录虽具结构性,但因语言、术语、格式高度异质,导致公众理解困难、自动化系统处理受限,影响政务透明与公民参与。 Method: 综述NLP中的文档分割、领域特定实体抽取和自动文本摘要三类基础任务,分析其方法、评估指标与可用资源。 Result: 系统梳理了支撑会议记录结构化的关键NLP任务及其在数据稀缺、隐私约束和来源多变等现实挑战下的适用性与局限。 Conclusion: NLP为提升地方治理会议记录的结构化水平、可访问性与可解释性提供了可行路径,本综述为其进一步研究与应用提供了系统性参考框架。 Abstract: Local governance meeting records are official documents, in the form of minutes or transcripts, documenting how proposals, discussions, and procedural actions unfold during institutional meetings. While generally structured, these documents are often dense, bureaucratic, and highly heterogeneous across municipalities, exhibiting significant variation in language, terminology, structure, and overall organization. This heterogeneity makes them difficult for non-experts to interpret and challenging for intelligent automated systems to process, limiting public transparency and civic engagement. To address these challenges, computational methods can be employed to structure and interpret such complex documents. In particular, Natural Language Processing (NLP) offers well-established methods that can enhance the accessibility and interpretability of governmental records. In this focus article, we review foundational NLP tasks that support the structuring of local governance meeting documents. Specifically, we review three core tasks: document segmentation, domain-specific entity extraction and automatic text summarization, which are essential for navigating lengthy deliberations, identifying political actors and personal information, and generating concise representations of complex decision-making processes. In reviewing these tasks, we discuss methodological approaches, evaluation metrics, and publicly available resources, while highlighting domain-specific challenges such as data scarcity, privacy constraints, and source variability. By synthesizing existing work across these foundational tasks, this article provides a structured overview of how NLP can enhance the structuring and accessibility of local governance meeting records.

[51] LLMs and people both learn to form conventions -- just not with each other

Cameron R. Jones,Agnese Lombardi,Kyle Mahowald,Benjamin K. Bergen

Main category: cs.CL

TL;DR: 本文研究了大语言模型(LLM)在多模态交流游戏中是否能像人类一样形成交流惯例。实验发现,同质配对(人-人、AI-AI)中均出现惯例形成现象(准确率和一致性上升、语句变短),但异质配对(人-AI)则失败;即使通过提示让LLM模仿人类表达长度,其与人类配对时的准确率和词汇重叠仍显著偏低,表明对话对齐不仅依赖模仿,更需共享的意义解读倾向。

Details Motivation: 探究LLM是否具备类似人类在对话中自发形成共享交流惯例的能力,并理解人机交互失败的深层原因。 Method: 设计多模态通信游戏,开展两个实验:实验1对比人类-人类、AI-AI、人类-AI配对的表现;实验2通过提示工程让LLM生成类人长度的回复,再评估其与人类配对时的准确性、一致性与词汇重叠度。 Result: 同质配对(人-人、AI-AI)均表现出惯例形成(准确率↑、一致性↑、长度↓);异质配对(人-AI)未形成惯例;即使控制消息长度,人-AI配对的准确率和词汇重叠仍显著低于同质配对。 Conclusion: 对话对齐不仅需要行为模仿能力,更依赖于双方共享的语义解释倾向;当前LLM与人类之间存在根本性的解释偏差,限制了有效协作。 Abstract: Humans align to one another in conversation -- adopting shared conventions that ease communication. We test whether LLMs form the same kinds of conventions in a multimodal communication game. Both humans and LLMs display evidence of convention-formation (increasing the accuracy and consistency of their turns while decreasing their length) when communicating in same-type dyads (humans with humans, AI with AI). However, heterogenous human-AI pairs fail -- suggesting differences in communicative tendencies. In Experiment 2, we ask whether LLMs can be induced to behave more like human conversants, by prompting them to produce superficially humanlike behavior. While the length of their messages matches that of human pairs, accuracy and lexical overlap in human-LLM pairs continues to lag behind that of both human-human and AI-AI pairs. These results suggest that conversational alignment requires more than just the ability to mimic previous interactions, but also shared interpretative biases toward the meanings that are conveyed.

[52] Pretraining with Token-Level Adaptive Latent Chain-of-Thought

Boyi Zeng,Yiqin Hao,He Li,Shixiang Song,Feichen Song,Zitong Wang,Siyuan Huang,Yi Xu,ZiWei He,Xinbing Wang,Zhouhan Lin

Main category: cs.CL

TL;DR: 本文提出了一种在预训练阶段内化隐式链式思维(latent CoT)的新方法——自适应隐式CoT,通过为每个token动态分配可变长度的隐式推理轨迹,在不增加参数量的前提下提升模型能力,并降低训练与推理计算开销。

Details Motivation: 大语言模型扩展受限于高质量语料稀缺和通信成本上升,需探索除增大参数和数据外的新扩展路径。 Method: 提出Pretraining with Token-Level Adaptive Latent CoT:模型在生成每个token前,自动生成可变长度的隐式CoT轨迹,难度高的token分配更长轨迹,简单token可为零长度;该机制通过单阶段通用文本预训练自然涌现,并结合token级自适应中止机制降低计算量。 Result: 在Llama架构上的实验表明,该方法持续改善语言建模困惑度和下游任务准确率,且训练FLOPs少于以往循环基线模型。 Conclusion: 内化自适应隐式CoT是一种有效的非参数扩展路径,能在减少计算开销的同时提升模型性能。 Abstract: Scaling large language models by increasing parameters and training data is increasingly constrained by limited high-quality corpora and rising communication costs. This work explores an alternative axis: increasing per-token computation without expanding parameters, by internalizing latent Chain-of-Thought (CoT) into pretraining. We propose Pretraining with Token-Level Adaptive Latent CoT (adaptive latent CoT), where the model generates a variable-length latent CoT trajectory before emitting each token -- allocating longer trajectories to difficult tokens and shorter (or even zero) trajectories to easy ones. Importantly, this behavior emerges naturally from one-stage pretraining on general text and reduces computation in both training and inference via token-wise adaptive halting. Experiments with Llama architectures show that adaptive latent CoT consistently improves language modeling perplexity and broad downstream accuracy, even with fewer training FLOPs than prior recurrent baselines.

[53] CoRect: Context-Aware Logit Contrast for Hidden State Rectification to Resolve Knowledge Conflicts

Xuhua Ma,Richong Zhang,Zhijie Nie

Main category: cs.CL

TL;DR: 本文提出CoRect方法,通过对比上下文感知和非上下文感知前向传播的logits,识别并修正深层FFN层中由参数化知识引起的偏差,从而提升RAG系统的事实一致性与可靠性。

Details Motivation: RAG系统常因模型内部参数化知识覆盖检索到的证据而导致输出不忠实(unfaithful outputs),现有方法受限于仅做解码调整或需真实标签的权重编辑。 Method: 提出CoRect(Context-Aware Logit Contrast for Hidden State Rectification):通过对比contextualized与non-contextualized前向传播的logits,定位高参数偏差的FFN层,并在无需真实标签的情况下对隐藏状态进行校正。 Result: 在问答(QA)和摘要任务上,CoRect显著提升了生成结果的事实一致性(faithfulness),降低了幻觉(hallucinations),优于多个强基线方法。 Conclusion: 参数化抑制现象是RAG中知识冲突的关键原因;CoRect提供了一种无需监督、可即插即用的隐状态校正机制,有效缓解该问题,增强RAG的可信度。 Abstract: Retrieval-Augmented Generation (RAG) often struggles with knowledge conflicts, where model-internal parametric knowledge overrides retrieved evidence, leading to unfaithful outputs. Existing approaches are often limited, relying either on superficial decoding adjustments or weight editing that necessitates ground-truth targets. Through layer-wise analysis, we attribute this failure to a parametric suppression phenomenon: specifically, in deep layers, certain FFN layers overwrite context-sensitive representations with memorized priors. To address this, we propose CoRect (Context-Aware Logit Contrast for Hidden State Rectification). By contrasting logits from contextualized and non-contextualized forward passes, CoRect identifies layers that exhibit high parametric bias without requiring ground-truth labels. It then rectifies the hidden states to preserve evidence-grounded information. Across question answering (QA) and summarization benchmarks, CoRect consistently improves faithfulness and reduces hallucinations compared to strong baselines.

[54] When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents

Jaylen Jones,Zhehao Zhang,Yuting Ning,Eric Fosler-Lussier,Pierre-Luc St-Charles,Yoshua Bengio,Dawn Song,Yu Su,Huan Sun

Main category: cs.CL

TL;DR: 本文提出了首个针对计算机使用代理(CUAs)无意行为的概念与方法框架,包括定义、自动激发与分析;提出AutoElicit框架,通过基于执行反馈的迭代指令扰动,在保持输入良性前提下激发出大量严重有害的无意行为,并验证其在多个前沿CUAs上的可迁移性。

Details Motivation: 现有对CUAs无意行为的研究多为轶事性,缺乏对其在真实场景中长尾无意行为的系统刻画与自动化检测方法。 Method: 提出AutoElicit:一种基于CUA执行反馈、迭代扰动良性指令以激发严重无意行为的智能体框架,确保扰动保持现实性和良性。 Result: 利用AutoElicit在Claude 4.5 Haiku和Opus等前沿CUAs上发现了数百种有害无意行为,并验证了人工确认的有效扰动在其他CUAs上具有跨模型的可迁移性。 Conclusion: 本工作为在真实计算机使用场景中系统性分析CUAs无意行为奠定了基础,推动了CUAs安全评估的标准化与自动化。 Abstract: Although computer-use agents (CUAs) hold significant potential to automate increasingly complex OS workflows, they can demonstrate unsafe unintended behaviors that deviate from expected outcomes even under benign input contexts. However, exploration of this risk remains largely anecdotal, lacking concrete characterization and automated methods to proactively surface long-tail unintended behaviors under realistic CUA scenarios. To fill this gap, we introduce the first conceptual and methodological framework for unintended CUA behaviors, by defining their key characteristics, automatically eliciting them, and analyzing how they arise from benign inputs. We propose AutoElicit: an agentic framework that iteratively perturbs benign instructions using CUA execution feedback, and elicits severe harms while keeping perturbations realistic and benign. Using AutoElicit, we surface hundreds of harmful unintended behaviors from state-of-the-art CUAs such as Claude 4.5 Haiku and Opus. We further evaluate the transferability of human-verified successful perturbations, identifying persistent susceptibility to unintended behaviors across various other frontier CUAs. This work establishes a foundation for systematically analyzing unintended behaviors in realistic computer-use settings.

[55] Document Reconstruction Unlocks Scalable Long-Context RLVR

Yao Xiao,Lei Wang,Yue Deng,Guanzheng Chen,Ziqi Jin,Jung-jae Kim,Xiaoli Li,Roy Ka-wei Lee,Lidong Bing

Main category: cs.CL

TL;DR: 本文提出了一种无需人工标注或强教师模型监督的无监督强化学习方法(RLVR),通过段落重建任务提升大语言模型的长上下文理解能力,并在RULER和LongBench-v2上验证了有效性。

Details Motivation: 现有RLVR方法依赖昂贵的黄金标准答案或专家设计的评估准则,限制了其可扩展性;亟需一种低成本、无监督的长上下文能力增强方法。 Method: 提出基于段落重建的无监督RL训练范式:在长文档中用特殊占位符替换若干段落,让LLM从候选段落集中识别并排序还原原文,以隐式建模全局连贯性。 Result: 在RULER上获得显著提升,在无需人工构造长上下文QA数据的LongBench-v2上也取得合理改进;消融实验验证了奖励设计、数据策略、训练方式与数据规模的影响。 Conclusion: 无监督段落重建是一种有效且实用的长上下文能力增强方法,降低了对标注与教师模型的依赖,具备良好泛化性与开源可复现性。 Abstract: Reinforcement Learning with Verifiable Rewards~(RLVR) has become a prominent paradigm to enhance the capabilities (i.e.\ long-context) of Large Language Models~(LLMs). However, it often relies on gold-standard answers or explicit evaluation rubrics provided by powerful teacher models or human experts, which are costly and time-consuming. In this work, we investigate unsupervised approaches to enhance the long-context capabilities of LLMs, eliminating the need for heavy human annotations or teacher models' supervision. Specifically, we first replace a few paragraphs with special placeholders in a long document. LLMs are trained through reinforcement learning to reconstruct the document by correctly identifying and sequencing missing paragraphs from a set of candidate options. This training paradigm enables the model to capture global narrative coherence, significantly boosting long-context performance. We validate the effectiveness of our method on two widely used benchmarks, RULER and LongBench~v2. While acquiring noticeable gains on RULER, it can also achieve a reasonable improvement on LongBench~v2 without any manually curated long-context QA data. Furthermore, we conduct extensive ablation studies to analyze the impact of reward design, data curation strategies, training schemes, and data scaling effects on model performance. We publicly release our code, data, and models.

[56] On convexity and efficiency in semantic systems

Nathaniel Imel,Noga Zaslavasky

Main category: cs.CL

TL;DR: 本文通过信息瓶颈(IB)框架分析语义范畴系统的凸性与通信效率之间的关系,发现二者虽常共现但本质独立;效率比凸性更能解释实际的颜色命名系统,并能解释更多经验现象。

Details Motivation: 探究人类语义范畴系统中凸性与通信效率之间分析关系及共现原因。 Method: 结合分析与实证方法,基于信息瓶颈(IB)框架研究语义效率,并对颜色命名系统进行理论推导与经验验证。 Result: 证明凸性与效率既不相互蕴含;IB最优系统在颜色命名中多为凸的;效率比凸性更能区分真实与假设的颜色命名系统;效率可解释凸性无法涵盖的多种经验现象。 Conclusion: 凸性与效率是本质不同的原则,效率提供了对语义类型学更全面的解释。 Abstract: There are two widely held characterizations of human semantic category systems: (1) they form convex partitions of conceptual spaces, and (2) they are efficient for communication. While prior work observed that convexity and efficiency co-occur in color naming, the analytical relation between them and why they co-occur have not been well understood. We address this gap by combining analytical and empirical analyses that build on the Information Bottleneck (IB) framework for semantic efficiency. First, we show that convexity and efficiency are distinct in the sense that neither entails the other: there are convex systems which are inefficient, and optimally-efficient systems that are non-convex. Crucially, however, the IB-optimal systems are mostly convex in the domain of color naming, explaining the main empirical basis for the convexity approach. Second, we show that efficiency is a stronger predictor for discriminating attested color naming systems from hypothetical variants, with convexity adding negligible improvement on top of that. Finally, we discuss a range of empirical phenomena that convexity cannot account for but efficiency can. Taken together, our work suggests that while convexity and efficiency can yield similar structural observations, they are fundamentally distinct, with efficiency providing a more comprehensive account of semantic typology.

[57] Language Predicts Identity Fusion Across Cultures and Reveals Divergent Pathways to Violence

Devin R. Wright,Justin E. Lane,F. LeRon Shults

Main category: cs.CL

TL;DR: 本文提出了一种基于认知语言学、大语言模型和隐喻分析的新型身份融合测量方法(CLIFS),在多个数据集上优于现有方法,并在极端主义宣言中识别出两种高融合通向暴力的路径:意识形态型(以群体定义自我)与怨恨驱动型(以自我定义群体)。

Details Motivation: 鉴于日益加剧的政治极化和暴力,理解极端主义的心理根源至关重要;已有研究表明身份融合能预测极端行为意愿,但缺乏高效、可扩展的测量工具。 Method: 提出认知语言学身份融合评分(CLIFS),结合认知语言模式、大语言模型(LLMs)和隐喻分析,从文本中隐式测量身份融合程度;在英国和新加坡多组数据集上验证其预测效度,并应用于极端主义宣言进行类型学分析。 Result: CLIFS在预测经验证证的身份融合得分方面显著优于现有方法;在极端主义文本中识别出两类高融合暴力路径:'意识形态型'(将自我融入群体,形成类亲缘纽带)和'怨恨驱动型'(将群体纳入个人身份框架)。 Conclusion: 该研究不仅精炼了身份融合理论,还提供了一个可扩展、基于语言的工具,助力身份融合实证研究与极端主义早期检测。 Abstract: In light of increasing polarization and political violence, understanding the psychological roots of extremism is increasingly important. Prior research shows that identity fusion predicts willingness to engage in extreme acts. We evaluate the Cognitive Linguistic Identity Fusion Score, a method that uses cognitive linguistic patterns, LLMs, and implicit metaphor to measure fusion from language. Across datasets from the United Kingdom and Singapore, this approach outperforms existing methods in predicting validated fusion scores. Applied to extremist manifestos, two distinct high-fusion pathways to violence emerge: ideologues tend to frame themselves in terms of group, forming kinship bonds; whereas grievance-driven individuals frame the group in terms of their personal identity. These results refine theories of identity fusion and provide a scalable tool aiding fusion research and extremism detection.

[58] Language Modeling and Understanding Through Paraphrase Generation and Detection

Jan Philip Wahle

Main category: cs.CL

TL;DR: 本文提出将同义改写分解为构成其的各个语言学方面(即同义改写类型),以实现对语义等价性更细粒度、更符合认知基础的理解;实验表明,显式地在同义改写类型上训练模型,可显著提升其在抄袭检测、重复问题识别等下游任务上的性能,甚至超越人类基线。

Details Motivation: 现有同义改写建模方法多简化为二元判断或单一重写,掩盖了影响语义保持的具体语言因素,缺乏对语义等价性的细粒度刻画。 Method: 将同义改写按语言学特征分解为若干类型,并构建相应标注数据集,对机器学习模型进行显式类型级训练与评估。 Result: 在维基百科和arXiv论文抄袭检测任务中,基于同义改写类型的模型分别达到89.6%和66.5%准确率,均超越人类基线;在Quora重复问题识别任务中也优于仅用二元对训练的模型。 Conclusion: 细粒度的同义改写类型建模是提升语义理解能力的关键路径,能有效增强模型在多项下游任务中的泛化性与鲁棒性。 Abstract: Language enables humans to share knowledge, reason about the world, and pass on strategies for survival and innovation across generations. At the heart of this process is not just the ability to communicate but also the remarkable flexibility in how we can express ourselves. We can express the same thoughts in virtually infinite ways using different words and structures - this ability to rephrase and reformulate expressions is known as paraphrase. Modeling paraphrases is a keystone to meaning in computational language models; being able to construct different variations of texts that convey the same meaning or not shows strong abilities of semantic understanding. If computational language models are to represent meaning, they must understand and control the different aspects that construct the same meaning as opposed to different meanings at a fine granularity. Yet most existing approaches reduce paraphrasing to a binary decision between two texts or to producing a single rewrite of a source, obscuring which linguistic factors are responsible for meaning preservation. In this thesis, I propose that decomposing paraphrases into their constituent linguistic aspects (paraphrase types) offers a more fine-grained and cognitively grounded view of semantic equivalence. I show that even advanced machine learning models struggle with this task. Yet, when explicitly trained on paraphrase types, models achieve stronger performance on related paraphrase tasks and downstream applications. For example, in plagiarism detection, language models trained on paraphrase types surpass human baselines: 89.6% accuracy compared to 78.4% for plagiarism cases from Wikipedia, and 66.5% compared to 55.7% for plagiarism of scientific papers from arXiv. In identifying duplicate questions on Quora, models trained with paraphrase types improve over models trained on binary pairs. Furthermore, I demonstrate that...

[59] New Skills or Sharper Primitives? A Probabilistic Perspective on the Emergence of Reasoning in RLVR

Zhilin Wang,Yafu Li,Shunkai Zhang,Zhi Wang,Haoran Zhang,Xiaoye Qu,Yu Cheng

Main category: cs.CL

TL;DR: 本文提出一个概率框架,将大语言模型的能力定义为实例级可解性,认为强化学习与可验证奖励(RLVR)能通过提升原子步骤的成功概率来驱动复杂推理能力的涌现,并通过Algebrarium框架实证验证了RLVR如何放大已有技能、受原子步骤联合概率支配性能,并可能以牺牲特定技能为代价优化全局奖励。

Details Motivation: 澄清RLVR是赋予LLM新能力还是仅激发其潜在能力这一核心争议,主张前者并建立可量化的概率能力定义。 Method: 构建基于实例级可解性的概率框架,假设复杂推理能力源于原子步骤成功概率的提升;在Algebrarium框架下仅用单步操作训练模型,并在未见多步任务上评估。 Result: (1)RLVR能激励探索原本不可达的解路径;(2)多步任务性能高度依赖原子步骤联合概率(Pearson相关系数0.69–0.96);(3)RLVR作为全局优化器可能导致某些具体技能被牺牲。 Conclusion: RLVR通过迭代优化可解问题,使模型发展出解决先前不可解问题的新能力,为‘涌现能力’提供了基于概率与优化的新解释。 Abstract: Whether Reinforcement Learning with Verifiable Rewards (RLVR) endows Large Language Models (LLMs) with new capabilities or merely elicits latent traces remains a central debate. In this work, we align with the former view, proposing a probabilistic framework where capability is defined by instance-level solvability. We hypothesize that the emergence of complex reasoning can be driven by sharpening atomic step probabilities, which enables models to overcome the exponential decay of success rates inherent in multi-step reasoning chains. Utilizing the Algebrarium framework, we train models exclusively on single-step operations and evaluate their performance on unseen multi-step tasks. Our empirical results confirm that: (1) RLVR incentivizes the exploration of previously inaccessible solution paths by amplifying the model's existing skills; (2) composite performance is strictly governed by the joint probability of atomic steps, evidenced by high Pearson correlation coefficients ($ρ\in [0.69, 0.96]$); and (3) RLVR, acting as a global optimizer, can cause specific skills to be sacrificed to maximize aggregate reward. Our work offers a novel explanation for emergent abilities in RLVR, suggesting that the iterative optimization of solvable problems enables models to develop the capabilities to tackle previously unsolvable scenarios.

Binglin Wu,Xianneng Li

Main category: cs.CL

TL;DR: 本文提出了一种基于超图神经网络的法律文书实体与关系抽取算法Legal-KAHRE,专用于毒品案件判决书,通过引入司法领域知识字典、邻域导向打包策略、双仿射机制及融合联合犯罪等特殊案情的超图结构,在CAIL2022数据集上显著优于基线模型。

Details Motivation: 现有法律文书实体与关系抽取方法缺乏司法领域专业知识,且未考虑司法文本的独特特性(如联合犯罪、数罪并罚等)。 Method: 提出Legal-KAHRE算法:1)基于邻域导向打包策略和双仿射机制的候选片段生成器;2)融合司法领域知识的法律词典,并通过多头注意力注入文本编码;3)在超图结构中显式建模联合犯罪等司法特例;4)采用超图神经网络进行高阶消息传递推理。 Result: 在CAIL2022信息抽取数据集上,Legal-KAHRE显著优于现有基线模型。 Conclusion: 引入司法领域知识与结构先验(如超图建模特殊案情)可有效提升法律文书实体与关系抽取性能,验证了Legal-KAHRE方法的有效性与实用性。 Abstract: With the continuous progress of digitization in Chinese judicial institutions, a substantial amount of electronic legal document information has been accumulated. To unlock its potential value, entity and relation extraction for legal documents has emerged as a crucial task. However, existing methods often lack domain-specific knowledge and fail to account for the unique characteristics of the judicial domain. In this paper, we propose an entity and relation extraction algorithm based on hypergraph neural network (Legal-KAHRE) for drug-related judgment documents. Firstly, we design a candidate span generator based on neighbor-oriented packing strategy and biaffine mechanism, which identifies spans likely to contain entities. Secondly, we construct a legal dictionary with judicial domain knowledge and integrate it into text encoding representation using multi-head attention. Additionally, we incorporate domain-specific cases like joint crimes and combined punishment for multiple crimes into the hypergraph structure design. Finally, we employ a hypergraph neural network for higher-order inference via message passing. Experimental results on the CAIL2022 information extraction dataset demonstrate that our method significantly outperforms existing baseline models.

[61] When Does Context Help? Error Dynamics of Contextual Information in Large Language Models

Dingzirui Wang,Xuanliang Zhang,Keyan Xu,Qingfu Zhu,Wanxiang Che,Yang Deng

Main category: cs.CL

TL;DR: 本文提出了一种统一的理论框架,用于分析任意上下文信息对Transformer大语言模型推理性能的影响,揭示了上下文通过误差校正向量实现性能提升的几何机制,并在多种场景中验证了理论并指导上下文选择。

Details Motivation: 现有工作对上下文信息(如示例、检索知识、交互历史)如何提升LLM性能缺乏统一的理论理解,尤其超出特定设定(如上下文学习)时。 Method: 构建基于输出误差动力学的统一理论框架;在单层Transformer中严格证明上下文影响可分解为基线误差与上下文校正向量之和;推导校正向量对误差减小所需的几何条件及范数上界;将结论推广至多上下文与多层Transformer;结合ICL、RAG和记忆演化实验验证理论。 Result: 给出了上下文校正向量需满足的对齐性与范数约束条件;推导出其范数上界由上下文-查询相关性与互补性决定;实验验证理论并据此提出上下文选择策略,在多个任务上提升0.6%性能。 Conclusion: 上下文信息对LLM的作用可通过误差空间中的几何校正来统一刻画,该理论为理解与优化各类上下文增强方法提供了基础。 Abstract: Contextual information at inference time, such as demonstrations, retrieved knowledge, or interaction history, can substantially improve large language models (LLMs) without parameter updates, yet its theoretical role remains poorly understood beyond specific settings such as in-context learning (ICL). We present a unified theoretical framework for analyzing the effect of arbitrary contextual information in Transformer-based LLMs. Our analysis characterizes contextual influence through output error dynamics. In a single-layer Transformer, we prove that the context-conditioned error vector decomposes additively into the baseline error vector and a contextual correction vector. This yields necessary geometric conditions for error reduction: the contextual correction must align with the negative baseline error and satisfy a norm constraint. We further show that the contextual correction norm admits an explicit upper bound determined by context-query relevance and complementarity. These results extend to multi-context and multi-layer Transformers. Experiments across ICL, retrieval-augmented generation, and memory evolution validate our theory and motivate a principled context selection strategy that improves performance by $0.6\%$.

[62] JUSTICE: Judicial Unified Synthesis Through Intermediate Conclusion Emulation for Automated Judgment Document Generation

Binglin Wu,Yingyi Zhang,Xiannneg Li

Main category: cs.CL

TL;DR: 本文提出JUSTICE框架,通过模拟法官‘检索→预判→撰写’的认知流程,特别是引入预判阶段(Pre-Judge),提升判决书生成的法律准确性与逻辑连贯性。

Details Motivation: 现有判决书自动生成方法忽略关键的‘预判’环节,导致司法要素提取不充分、预判过程建模不足,影响生成文档的法律严谨性。 Method: 提出JUSTICE框架,包含三个核心组件:参照性司法要素检索器(RJER)、中间结论模拟器(ICE)和司法统一合成器(JUS),分别实现法律依据检索、可验证中间结论生成及最终判决合成。 Result: 在领域内基准和分布外数据集上显著优于强基线,监狱刑期预测准确率提升4.6%,法律准确性全面提升。 Conclusion: 显式建模‘预判’过程对提升判决书生成的法律一致性与准确性至关重要,JUSTICE为法律AI提供了更符合司法实践的新范式。 Abstract: Automated judgment document generation is a significant yet challenging legal AI task. As the conclusive written instrument issued by a court, a judgment document embodies complex legal reasoning. However, existing methods often oversimplify this complex process, particularly by omitting the ``Pre-Judge'' phase, a crucial step where human judges form a preliminary conclusion. This omission leads to two core challenges: 1) the ineffective acquisition of foundational judicial elements, and 2) the inadequate modeling of the Pre-Judge process, which collectively undermine the final document's legal soundness. To address these challenges, we propose \textit{\textbf{J}udicial \textbf{U}nified \textbf{S}ynthesis \textbf{T}hrough \textbf{I}ntermediate \textbf{C}onclusion \textbf{E}mulation} (JUSTICE), a novel framework that emulates the ``Search $\rightarrow$ Pre-Judge $\rightarrow$ Write'' cognitive workflow of human judges. Specifically, it introduces the Pre-Judge stage through three dedicated components: Referential Judicial Element Retriever (RJER), Intermediate Conclusion Emulator (ICE), and Judicial Unified Synthesizer (JUS). RJER first retrieves legal articles and a precedent case to establish a referential foundation. ICE then operationalizes the Pre-Judge phase by generating a verifiable intermediate conclusion. Finally, JUS synthesizes these inputs to craft the final judgment. Experiments on both an in-domain legal benchmark and an out-of-distribution dataset show that JUSTICE significantly outperforms strong baselines, with substantial gains in legal accuracy, including a 4.6\% improvement in prison term prediction. Our findings underscore the importance of explicitly modeling the Pre-Judge process to enhance the legal coherence and accuracy of generated judgment documents.

[63] Improving Data and Reward Design for Scientific Reasoning in Large Language Models

Zijie Chen,Zhenghao Lin,Xiao Liu,Zhenzhong Lan,Yeyun Gong,Peng Cheng

Main category: cs.CL

TL;DR: 本文提出Dr. SCI数据集和后训练流程,通过构建大规模、结构化的科学问答数据集,并设计探索扩展监督微调、动态难度课程和SciRubric引导强化学习三阶段方法,显著提升大模型在开放性科学问题上的推理能力。

Details Motivation: 解决大语言模型在开放性科学问题上因监督与评估不可靠而导致的性能瓶颈,尤其聚焦于科学领域后训练中的数据构建与奖励设计问题。 Method: 构建包含100万道跨八大学科STEM问题的Dr. SCI数据集,具备可验证/开放性划分、可扩展难度标注与细粒度评分标准;在此基础上提出三阶段Dr. SCI后训练流程:(i) 探索扩展监督微调(Exploration-Expanding SFT),(ii) 动态难度课程(Dynamic Difficulty Curriculum),(iii) SciRubric引导强化学习(SciRubric-Guided RL)。 Result: Qwen3-4B-Base经Dr. SCI流程训练后,在GPQA-diamond和GPQA-general上分别达63.2和32.4分,持续超越o1-mini、GPT-4o等强基线,尤其在开放性科学推理任务中提升显著。 Conclusion: 系统化数据构建与面向科学推理特性的后训练范式设计,能有效缓解开放性科学问题中监督与评估不可靠的挑战,为科学大模型发展提供新路径。 Abstract: Solving open-ended science questions remains challenging for large language models, particularly due to inherently unreliable supervision and evaluation. The bottleneck lies in the data construction and reward design for scientific post-training. We develop a large-scale, systematic data processing pipeline that transforms heterogeneous open-source science data into Dr. SCI dataset, which comprises of 1M questions across eight STEM subjects, with explicit verifiable/open-ended splits, scalable difficulty annotation, and fine-grained rubrics that operationalize evaluation for open-ended answers. Building on this dataset, we propose the Dr. SCI post-training pipeline, which redesigns the standard SFT -> RL workflow through three components: (i) Exploration-Expanding SFT, which broadens the model's reasoning pattern coverage prior to RL; (ii) Dynamic Difficulty Curriculum, which adapts training data to the model's evolving scientific capability; and (iii) SciRubric-Guided RL, which enables stable reinforcement learning on open-ended scientific questions via rubric-based evaluation with explicit answer correctness. Qwen3-4B-Base trained using Dr.SCI pipeline achieves 63.2 on GPQA-diamond and 32.4 on GPQA-general, consistently improves over strong post-trained baselines such as o1-mini and GPT-4o, demonstrating substantial gains in scientific reasoning, especially in open-ended settings.

[64] An Attention-over-Attention Generative Model for Joint Multiple Intent Detection and Slot Filling

Wei Zhu

Main category: cs.CL

TL;DR: 本文提出了一种生成式框架,用于同时解决多意图检测和槽位填充任务,并设计了注意力-注意力解码器以应对多意图及子任务干扰问题;同时基于BERT的NSP头构建了两个新的多意图SLU数据集。

Details Motivation: 现有方法主要针对单意图SLU,而真实场景中用户常在一个语句中表达多个意图,这对现有系统和数据集构成挑战。 Method: 提出一种生成式框架,包含注意力-注意力解码器以建模可变意图数量和缓解子任务干扰;并利用BERT的NSP头构造两个新多意图SLU数据集。 Result: 在MixATIS、MixSNIPS及自建数据集上均达到SOTA性能。 Conclusion: 生成式建模与注意力-注意力机制能有效提升多意图SLU性能,所构建数据集为该任务提供了新基准。 Abstract: In task-oriented dialogue systems, spoken language understanding (SLU) is a critical component, which consists of two sub-tasks, intent detection and slot filling. Most existing methods focus on the single-intent SLU, where each utterance only has one intent. However, in real-world scenarios users usually express multiple intents in an utterance, which poses a challenge for existing dialogue systems and datasets. In this paper, we propose a generative framework to simultaneously address multiple intent detection and slot filling. In particular, an attention-over-attention decoder is proposed to handle the variable number of intents and the interference between the two sub-tasks by incorporating an inductive bias into the process of multi-task learning. Besides, we construct two new multi-intent SLU datasets based on single-intent utterances by taking advantage of the next sentence prediction (NSP) head of the BERT model. Experimental results demonstrate that our proposed attention-over-attention generative model achieves state-of-the-art performance on two public datasets, MixATIS and MixSNIPS, and our constructed datasets.

[65] Latent Reasoning with Supervised Thinking States

Ido Amos,Avi Caciularu,Mor Geva,Amir Globerson,Jonathan Herzig,Lior Shani,Idan Szpektor

Main category: cs.CL

TL;DR: 本文提出Thinking States方法,让大语言模型在处理输入的同时进行推理,通过生成思考token并将其转换为嵌入空间加入后续输入,从而提升推理效率与性能。

Details Motivation: 传统链式思维(CoT)虽能提升复杂任务求解能力,但因生成长推理过程导致高推理开销。 Method: Thinking States在输入处理过程中动态生成思考token序列,将这些token映射回嵌入空间并注入后续输入;利用自然语言监督和teacher-forcing进行并行训练。 Result: 在多个推理任务上优于其他隐式推理方法:数学题上缩小了与CoT的性能差距,2跳问答任务中达到与CoT相当的性能且延迟更低;状态跟踪任务中展现出比CoT更强的泛化能力,可外推至更长序列。 Conclusion: Thinking States是一种高效、可学习、并行化的实时推理机制,在保持或提升推理性能的同时显著降低延迟,并展现出更强的泛化性。 Abstract: Reasoning with a chain-of-thought (CoT) enables Large Language Models (LLMs) to solve complex tasks but incurs significant inference costs due to the generation of long rationales. We propose Thinking States, a method that performs reasoning {\em while} the input is processing. Specifically, Thinking States generates sequences of thinking tokens every few input tokens, transforms the thoughts back into embedding space, and adds them to the following input tokens. This has two key advantages. First, it captures the recurrent nature of CoT, but where the thought tokens are generated as input is processing. Second, since the thoughts are represented as tokens, they can be learned from natural language supervision, and using teacher-forcing, which is parallelizable. Empirically, Thinking States outperforms other latent reasoning methods on multiple reasoning tasks, narrowing the gap to CoT on math problems, and matching its performance on 2-Hop QA with improved latency. On state-tracking tasks, we show Thinking States leads to stronger reasoning behavior than CoT, successfully extrapolating to longer sequences than seen during training.

[66] UReason: Benchmarking the Reasoning Paradox in Unified Multimodal Models

Cheng Yang,Chufan Shi,Bo Shui,Yaokang Wu,Muzi Tao,Huijuan Wang,Ivan Yee Lee,Yong Liu,Xuezhe Ma,Taylor Berg-Kirkpatrick

Main category: cs.CL

TL;DR: 本文提出UReason基准测试,用于评估推理驱动的图像生成效果,发现推理轨迹虽能提升性能,但中间思维作为条件上下文反而干扰视觉合成,表明瓶颈在于上下文干扰而非推理能力不足。

Details Motivation: 探究链式思维推理在图像生成中的实际效果,尤其是推理能否在像素层面被忠实执行。 Method: 构建包含2000个样本、覆盖五类任务的UReason诊断基准,并设计对比评估框架:直接生成、推理引导生成和去上下文化生成(仅基于精炼提示)。 Result: 在八个开源统一模型上观察到‘推理悖论’:推理轨迹通常优于直接生成,但保留中间思维作为条件上下文反而损害生成效果;仅用精炼提示则显著提升性能。 Conclusion: 推理对图像生成的瓶颈在于上下文干扰,而非推理能力不足;UReason为研究统一模型中的推理机制提供了原则性测试平台,并推动开发能有效整合推理并缓解干扰的新方法。 Abstract: To elicit capabilities for addressing complex and implicit visual requirements, recent unified multimodal models increasingly adopt chain-of-thought reasoning to guide image generation. However, the actual effect of reasoning on visual synthesis remains unclear. We present UReason, a diagnostic benchmark for reasoning-driven image generation that evaluates whether reasoning can be faithfully executed in pixels. UReason contains 2,000 instances across five task families: Code, Arithmetic, Spatial, Attribute, and Text reasoning. To isolate the role of reasoning traces, we introduce an evaluation framework comparing direct generation, reasoning-guided generation, and de-contextualized generation which conditions only on the refined prompt. Across eight open-source unified models, we observe a consistent Reasoning Paradox: Reasoning traces generally improve performance over direct generation, yet retaining intermediate thoughts as conditioning context often hinders visual synthesis, and conditioning only on the refined prompt yields substantial gains. Our analysis suggests that the bottleneck lies in contextual interference rather than insufficient reasoning capacity. UReason provides a principled testbed for studying reasoning in unified models and motivates future methods that effectively integrate reasoning for visual generation while mitigating interference.

[67] WorldTravel: A Realistic Multimodal Travel-Planning Benchmark with Tightly Coupled Constraints

Zexuan Wang,Chenghao Yang,Yingqi Que,Zhenzhu Yang,Huaqing Yuan,Yiwen Wang,Zhengxuan Jiang,Shengjie Fang,Zhenhe Wu,Zhaohui Wang,Zhixin Yao,Jiashuo Liu,Jincheng Ren,Yuzhen Li,Yang Yang,Jiaheng Liu,Jian Yang,Zaiyuan Wang,Ge Zhang,Zhoufutu Wen,Wenhao Huang

Main category: cs.CL

TL;DR: 本文提出了WorldTravel基准测试,包含150个真实旅行场景,要求处理15+个相互依赖的时间与逻辑约束;并构建了WorldTravel-Webscape多模态环境,让模型直接从2000+网页视觉布局中感知约束参数。实验发现当前前沿模型(如GPT-5.2)在文本和多模态设置下可行性分别仅32.67%和19.33%,暴露出‘感知-行动鸿沟’与约10约束的规划长度瓶颈,揭示需融合高保真视觉感知与长程推理的新一代智能体。

Details Motivation: 现有基准多基于松耦合约束和理想化数据,无法反映真实世界中紧密耦合约束下决策对后续可行性的全局影响,尤其难以模拟从动态网页中提取参数的实际挑战。 Method: 构建WorldTravel真实旅行约束基准(150场景、5城市、平均15+约束)及配套WorldTravel-Webscape多模态环境(2000+渲染网页),评估10个前沿模型在文本与多模态设置下的可行性表现,并分析性能坍塌原因与瓶颈阈值。 Result: GPT-5.2等最先进模型在文本设置下可行性仅32.67%,在多模态环境下骤降至19.33%;发现存在显著的‘感知-行动鸿沟’,且规划长度超过约10个约束时模型推理能力急剧下降。 Conclusion: 当前大模型在真实复杂约束规划任务中严重受限,亟需发展能统一高保真视觉感知与长程逻辑推理能力的下一代自主智能体。 Abstract: Real-world autonomous planning requires coordinating tightly coupled constraints where a single decision dictates the feasibility of all subsequent actions. However, existing benchmarks predominantly feature loosely coupled constraints solvable through local greedy decisions and rely on idealized data, failing to capture the complexity of extracting parameters from dynamic web environments. We introduce \textbf{WorldTravel}, a benchmark comprising 150 real-world travel scenarios across 5 cities that demand navigating an average of 15+ interdependent temporal and logical constraints. To evaluate agents in realistic deployments, we develop \textbf{WorldTravel-Webscape}, a multi-modal environment featuring over 2,000 rendered webpages where agents must perceive constraint parameters directly from visual layouts to inform their planning. Our evaluation of 10 frontier models reveals a significant performance collapse: even the state-of-the-art GPT-5.2 achieves only 32.67\% feasibility in text-only settings, which plummets to 19.33\% in multi-modal environments. We identify a critical Perception-Action Gap and a Planning Horizon threshold at approximately 10 constraints where model reasoning consistently fails, suggesting that perception and reasoning remain independent bottlenecks. These findings underscore the need for next-generation agents that unify high-fidelity visual perception with long-horizon reasoning to handle brittle real-world logistics.

[68] ViGoEmotions: A Benchmark Dataset For Fine-grained Emotion Detection on Vietnamese Texts

Hung Quang Tran,Nam Tien Pham,Son T. Luu,Kiet Van Nguyen

Main category: cs.CL

TL;DR: 本文提出了ViGoEmotions——一个包含20,664条越南语社交媒体评论、标注27种细粒度情绪的语料库,并系统评估了不同预处理策略(保留/转换/移除emoji)对多个越南语BERT模型情绪分类性能的影响,发现ViSoBERT表现最佳(Macro F1=61.50%),且emoji处理方式显著影响模型效果。

Details Motivation: 现有越南语细粒度情绪分类缺乏高质量、大规模标注语料库,限制了相关研究与应用;同时,emoji在社交媒体情绪表达中至关重要,但其预处理方式对模型性能的影响尚不明确。 Method: 构建ViGoEmotions越南语情绪语料库(20,664条评论,27类情绪);在八种Transformer模型上对比三种emoji预处理策略:规则化保留、转为文本描述、使用ViSoLex模型归一化;采用Macro F1和Weighted F1评估性能。 Result: ViSoBERT取得最高Macro F1(61.50%)和Weighted F1(63.26%);emoji转文本提升多数BERT基线性能,但保留emoji更利于ViSoBERT和CafeBERT;移除emoji普遍降低性能;CafeBERT和PhoBERT也表现强劲。 Conclusion: ViGoEmotions语料库有效支持越南语情绪分类任务,但下游性能高度依赖预处理策略选择与标注质量,emoji不应简单移除,而需根据模型特性选择适配处理方式。 Abstract: Emotion classification plays a significant role in emotion prediction and harmful content detection. Recent advancements in NLP, particularly through large language models (LLMs), have greatly improved outcomes in this field. This study introduces ViGoEmotions -- a Vietnamese emotion corpus comprising 20,664 social media comments in which each comment is classified into 27 fine-grained distinct emotions. To evaluate the quality of the dataset and its impact on emotion classification, eight pre-trained Transformer-based models were evaluated under three preprocessing strategies: preserving original emojis with rule-based normalization, converting emojis into textual descriptions, and applying ViSoLex, a model-based lexical normalization system. Results show that converting emojis into text often improves the performance of several BERT-based baselines, while preserving emojis yields the best results for ViSoBERT and CafeBERT. In contrast, removing emojis generally leads to lower performance. ViSoBERT achieved the highest Macro F1-score of 61.50% and Weighted F1-score of 63.26%. Strong performance was also observed from CafeBERT and PhoBERT. These findings highlight that while the proposed corpus can support diverse architectures effectively, preprocessing strategies and annotation quality remain key factors influencing downstream performance.

[69] Dynamic Long Context Reasoning over Compressed Memory via End-to-End Reinforcement Learning

Zhuoen Chen,Dongfang Li,Meishan Zhang,Baotian Hu,Min Zhang

Main category: cs.CL

TL;DR: 本文提出了一种受认知启发的长上下文推理框架,通过分块压缩和选择性记忆召回提升大语言模型在长文本处理中的效率与性能。

Details Motivation: 解决大语言模型在长上下文处理中面临的二次计算开销、信息遗忘及RAG中的上下文碎片化问题。 Method: 将长输入分块,用学习到的压缩器编码为压缩记忆表示;通过门控模块动态选择相关记忆块,并由具备演化工作记忆的推理模块迭代处理;压缩器与推理器联合强化学习优化,门控模块单独训练为分类器。 Result: 在RULER-HQA等多跳推理基准上达到有竞争力的准确率,上下文长度从7K扩展至1.75M tokens,在准确率与效率间取得更好权衡:峰值GPU内存减少2倍,推理速度比MemAgent快6倍。 Conclusion: 该认知启发框架有效缓解了长上下文处理的核心瓶颈,为高效、可扩展的长文本推理提供了新范式。 Abstract: Large Language Models (LLMs) face significant challenges in long-context processing, including quadratic computational costs, information forgetting, and the context fragmentation inherent in retrieval-augmented generation (RAG). We propose a cognitively inspired framework for efficient long-context inference based on chunk-wise compression and selective memory recall, rather than processing all raw tokens. The framework segments long inputs into chunks and encodes each chunk into compressed memory representations using a learned compressor. A gating module dynamically selects relevant memory blocks, which are then iteratively processed by a reasoning module with an evolving working memory to solve downstream tasks. The compressor and reasoner are jointly optimized via end-to-end reinforcement learning, while the gating module is trained separately as a classifier. Experimental results show that the proposed method achieves competitive accuracy on multi-hop reasoning benchmarks such as RULER-HQA, extrapolates context length from 7K to 1.75M tokens, and offers a favorable accuracy-efficiency trade-off compared to strong long-context baselines. In particular, it achieves up to a 2 times reduction in peak GPU memory usage and a 6 times inference speedup over MemAgent.

[70] TEAM: Temporal-Spatial Consistency Guided Expert Activation for MoE Diffusion Language Model Acceleration

Linye Wei,Zixiang Luo,Pingzhi Tang,Meng Li

Main category: cs.CL

TL;DR: 本文提出TEAM框架,通过利用专家路由在时间和空间上的一致性,减少MoE扩散大语言模型(dLLMs)中每步激活的专家数量,从而显著加速推理,同时保持性能几乎不变。

Details Motivation: MoE dLLMs在扩散解码中存在根本性不匹配:每步激活大量专家,但仅少数token被接受,导致高推理开销,难以用于延迟敏感场景。 Method: 基于专家路由在时间(不同去噪层级)和空间(不同token位置)上的强一致性,TEAM设计了三种互补的专家激活与解码策略:对已解码/掩码token保守选择必要专家,并对多个候选token进行激进的推测性探索。 Result: TEAM在保持可忽略性能下降的前提下,相比基线MoE dLLM实现最高2.2倍的推理加速。 Conclusion: TEAM是一种即插即用的高效加速框架,有效缓解了MoE架构与扩散解码范式之间的不匹配问题,提升了dLLMs在实际应用中的可行性。 Abstract: Diffusion large language models (dLLMs) have recently gained significant attention due to their inherent support for parallel decoding. Building on this paradigm, Mixture-of-Experts (MoE) dLLMs with autoregressive (AR) initialization have further demonstrated strong performance competitive with mainstream AR models. However, we identify a fundamental mismatch between MoE architectures and diffusion-based decoding. Specifically, a large number of experts are activated at each denoising step, while only a small subset of tokens is ultimately accepted, resulting in substantial inference overhead and limiting their deployment in latency-sensitive applications. In this work, we propose TEAM, a plug-and-play framework that accelerates MoE dLLMs by enabling more accepted tokens with fewer activated experts. TEAM is motivated by the observation that expert routing decisions exhibit strong temporal consistency across denoising levels as well as spatial consistency across token positions. Leveraging these properties, TEAM employs three complementary expert activation and decoding strategies, conservatively selecting necessary experts for decoded and masked tokens and simultaneously performing aggressive speculative exploration across multiple candidates. Experimental results demonstrate that TEAM achieves up to 2.2x speedup over vanilla MoE dLLM, with negligible performance degradation. Code is released at https://github.com/PKU-SEC-Lab/TEAM-MoE-dLLM.

[71] Prism: Spectral-Aware Block-Sparse Attention

Xinghao Wang,Pengyu Wang,Xiaoran Liu,Fangxu Liu,Jason Chu,Kai Song,Xipeng Qiu

Main category: cs.CL

TL;DR: 本文提出Prism方法,通过频谱感知的块级注意力机制,解决块稀疏注意力中因均值池化与RoPE交互导致的位置信息丢失问题,在保持精度的同时实现最高5.1倍加速。

Details Motivation: 现有块稀疏注意力方法在块重要性估计时依赖粗粒度注意力(如均值池化),但该方式因与RoPE交互造成高频位置信息丢失,导致估计不准且开销大。 Method: 提出无需训练的Prism方法:将块选择分解为高频与低频分支,并通过基于能量的温度校准,从池化表示中恢复被衰减的位置信号,全程仅使用块级操作。 Result: Prism在多个长上下文任务上保持与全注意力相当的精度,同时实现最高5.1倍的预填充加速。 Conclusion: Prism从频谱角度揭示并修正了均值池化在块稀疏注意力中的理论缺陷,为高效长上下文建模提供了新思路。 Abstract: Block-sparse attention is promising for accelerating long-context LLM pre-filling, yet identifying relevant blocks efficiently remains a bottleneck. Existing methods typically employ coarse-grained attention as a proxy for block importance estimation, but often resort to expensive token-level searching or scoring, resulting in significant selection overhead. In this work, we trace the inaccuracy of standard coarse-grained attention via mean pooling to a theoretical root cause: the interaction between mean pooling and Rotary Positional Embeddings (RoPE). We prove that mean pooling acts as a low-pass filter that induces destructive interference in high-frequency dimensions, effectively creating a "blind spot" for local positional information (e.g., slash patterns). To address this, we introduce Prism, a training-free spectral-aware approach that decomposes block selection into high-frequency and low-frequency branches. By applying energy-based temperature calibration, Prism restores the attenuated positional signals directly from pooled representations, enabling block importance estimation using purely block-level operations, thereby improving efficiency. Extensive evaluations confirm that Prism maintains accuracy parity with full attention while delivering up to $\mathbf{5.1\times}$ speedup.

[72] Large Language Models and Impossible Language Acquisition: "False Promise" or an Overturn of our Current Perspective towards AI

Ziyan wang,Longlong Ma

Main category: cs.CL

TL;DR: 本文通过理论分析与实证实验,检验乔姆斯基对大语言模型(LLMs)的批判——即LLMs无法区分可能与不可能语言。实验构建了若干句法上‘不可能’的语言(如全句倒序、按词数奇偶加否定),在GPT-2 small和LSTM模型上测试其学习能力;结果表明GPT-2显著难以习得不可能语言(p<.001),而LSTM表现更符合乔姆斯基预测,凸显Transformer架构的特殊性。研究进而提出:应在乔姆斯基理论框架内重新理解LLMs,并推动该领域从‘理性主义—浪漫主义’范式转向功能主义与经验主义范式。

Details Motivation: 回应乔姆斯基对LLMs本质的哲学与认知科学批判,检验其核心主张——LLMs缺乏人类语言习得所依赖的内在因果与自校正结构,因而无法识别‘不可能语言’。 Method: 结合文献综述与控制实验:1)基于英语构造多种句法上‘不可能’的人工语言(如全句倒序、依词数奇偶插入否定);2)在GPT-2 small和LSTM两类模型上进行两轮学习任务;3)采用Welch's t检验比较模型在可能语言与不可能语言上的性能差异。 Result: GPT-2 small在所有不可能语言任务上均显著劣于其在可能语言上的表现(p<.001);LSTM模型的表现则更贴近乔姆斯基预测,暗示Transformer架构的演化带来了根本性能力跃迁。 Conclusion: 乔姆斯基批判具有启发性但需修正:LLMs确非人类式语言习得者,但其能力边界不能简单归因于缺乏先天结构,而应置于架构演进与功能适配视角下重审;LLM研究亟需从理性主义范式转向功能主义与经验主义范式。 Abstract: In Chomsky's provocative critique "The False Promise of CHATGPT," Large Language Models (LLMs) are characterized as mere pattern predictors that do not acquire languages via intrinsic causal and self-correction structures like humans, therefore are not able to distinguish impossible languages. It stands as a representative in a fundamental challenge to the intellectual foundations of AI, for it integrally synthesizes major issues in methodologies within LLMs and possesses an iconic a priori rationalist perspective. We examine this famous critic from both the perspective in pre-existing literature of linguistics and psychology as well as a research based on an experiment inquiring the capacity of learning both possible and impossible languages among LLMs. We constructed a set of syntactically impossible languages by applying certain transformations to English. These include reversing whole sentences, and adding negation based on word-count parity. Two rounds of controlled experiments were each conducted on GPT-2 small models and long short-term memory (LSTM) models. Statistical analysis (Welch's t-test) shows GPT2 small models underperform in learning all of the impossible languages compared to their performance on the possible language (p<.001). On the other hand, LSTM models' performance tallies with Chomsky's argument, suggesting the irreplaceable role of the evolution of transformer architecture. Based on theoretical analysis and empirical findings, we propose a new vision within Chomsky's theory towards LLMs, and a shift of theoretical paradigm outside Chomsky, from his "rationalist-romantics" paradigm to functionalism and empiricism in LLMs research.

[73] Characterizing, Evaluating, and Optimizing Complex Reasoning

Haoran Zhang,Yafu Li,Zhi Wang,Zhilin Wang,Shunkai Zhang,Xiaoye Qu,Yu Cheng

Main category: cs.CL

TL;DR: 本文提出ME²原则和基于DAG的评估方法,构建TRM-Preference数据集并训练Thinking Reward Model(TRM),用于大规模评估和优化大推理模型(LRMs)的推理质量。

Details Motivation: 现有工作缺乏对高质量推理定义、长且隐式结构化推理轨迹的可靠评估、以及如何利用评估信号优化推理这三大根本问题的统一回答。 Method: 提出ME²原则(兼顾宏观与微观层面的效率与有效性);将推理轨迹建模为有向无环图(DAG),设计DAG-based pairwise评估方法;构建TRM-Preference数据集并训练Thinking Reward Model(TRM)。 Result: 实验表明,thinking reward可作为有效优化信号:测试时选择更优推理可提升性能最高达19.3%;强化学习训练中可提升推理与任务性能最高达3.9%。 Conclusion: 本文提供了统一框架,从定义、评估到优化,系统性提升了大推理模型中复杂推理结构的质量。 Abstract: Large Reasoning Models (LRMs) increasingly rely on reasoning traces with complex internal structures. However, existing work lacks a unified answer to three fundamental questions: (1) what defines high-quality reasoning, (2) how to reliably evaluate long, implicitly structured reasoning traces, and (3) how to use such evaluation signals for reasoning optimization. To address these challenges, we provide a unified perspective. (1) We introduce the ME$^2$ principle to characterize reasoning quality along macro- and micro-level concerning efficiency and effectiveness. (2) Built on this principle, we model reasoning traces as directed acyclic graphs (DAGs) and develop a DAG-based pairwise evaluation method, capturing complex reasoning structures. (3) Based on this method, we construct the TRM-Preference dataset and train a Thinking Reward Model (TRM) to evaluate reasoning quality at scale. Experiments show that thinking rewards serve as an effective optimization signal. At test time, selecting better reasoning leads to better outcomes (up to 19.3% gain), and during RL training, thinking rewards enhance reasoning and performance (up to 3.9% gain) across diverse tasks.

[74] GISA: A Benchmark for General Information-Seeking Assistant

Yutao Zhu,Xingshuo Zhang,Maosen Zhang,Jiajie Jin,Liancheng Zhang,Xiaoshuai Song,Kangzhi Zhao,Wencong Zeng,Ruiming Tang,Han Li,Ji-Rong Wen,Zhicheng Dou

Main category: cs.CL

TL;DR: 本文提出GISA基准,用于评估通用信息检索助手,包含373个人工构建的真实查询,支持多种答案格式和过程监督,并揭示当前LLM在复杂信息检索任务上性能仍较低。

Details Motivation: 现有信息检索代理基准存在任务构造不自然、答案静态易污染、任务类型单一等问题,难以真实反映实际需求。 Method: 构建GISA基准,包含373个人工设计的真实查询,支持item/set/list/table四种结构化答案格式;引入动态更新的live子集以抗数据污染;提供完整人工搜索轨迹用于过程监督与模仿学习。 Result: 主流LLM及商业搜索产品在GISA上表现不佳,最优模型仅达19.30%精确匹配率,复杂规划与综合信息收集任务性能显著下降。 Conclusion: GISA有效暴露了当前信息检索代理的能力瓶颈,为未来研究提供了更真实、更具挑战性的评估平台与训练信号。 Abstract: The advancement of large language models (LLMs) has significantly accelerated the development of search agents capable of autonomously gathering information through multi-turn web interactions. Various benchmarks have been proposed to evaluate such agents. However, existing benchmarks often construct queries backward from answers, producing unnatural tasks misaligned with real-world needs. Moreover, these benchmarks tend to focus on either locating specific information or aggregating information from multiple sources, while relying on static answer sets prone to data contamination. To bridge these gaps, we introduce GISA, a benchmark for General Information-Seeking Assistants comprising 373 human-crafted queries that reflect authentic information-seeking scenarios. GISA features four structured answer formats (item, set, list, and table), enabling deterministic evaluation. It integrates both deep reasoning and broad information aggregation within unified tasks, and includes a live subset with periodically updated answers to resist memorization. Notably, GISA provides complete human search trajectories for every query, offering gold-standard references for process-level supervision and imitation learning. Experiments on mainstream LLMs and commercial search products reveal that even the best-performing model achieves only 19.30\% exact match score, with performance notably degrading on tasks requiring complex planning and comprehensive information gathering. These findings highlight substantial room for future improvement.

[75] How Do Language Models Understand Tables? A Mechanistic Analysis of Cell Location

Xuanliang Zhang,Dingzirui Wang,Keyan Xu,Qingfu Zhu,Wanxiang Che

Main category: cs.CL

TL;DR: 本文通过激活修补和可解释性技术,揭示了大语言模型处理线性化表格时的细胞位置定位机制,将其分解为语义绑定、坐标定位和信息提取三阶段,并发现模型使用序数机制计数分隔符来解析坐标,列索引编码在线性子空间中,且多细胞定位任务复用相同注意力头。

Details Motivation: 大型语言模型(LLMs)被越来越多地用于表格相关任务,但其处理线性化二维结构化表格的内部机制尚不明确。 Method: 采用激活修补(activation patching)及互补的可解释性技术,对表格理解中的原子任务——单元格位置定位进行剖析。 Result: 发现模型通过计数离散分隔符的序数机制解析坐标;列索引编码于一个线性子空间中,支持通过向量运算精准调控模型关注点;多单元格定位任务复用在原子定位中识别出的相同注意力头。 Conclusion: 本研究系统揭示了Transformer架构中表格理解的内在机制,为理解LLMs如何处理结构化数据提供了全面解释。 Abstract: While Large Language Models (LLMs) are increasingly deployed for table-related tasks, the internal mechanisms enabling them to process linearized two-dimensional structured tables remain opaque. In this work, we investigate the process of table understanding by dissecting the atomic task of cell location. Through activation patching and complementary interpretability techniques, we delineate the table understanding mechanism into a sequential three-stage pipeline: Semantic Binding, Coordinate Localization, and Information Extraction. We demonstrate that models locate the target cell via an ordinal mechanism that counts discrete delimiters to resolve coordinates. Furthermore, column indices are encoded within a linear subspace that allows for precise steering of model focus through vector arithmetic. Finally, we reveal that models generalize to multi-cell location tasks by multiplexing the identical attention heads identified during atomic location. Our findings provide a comprehensive explanation of table understanding within Transformer architectures.

[76] Beyond Scalar Scores: Reinforcement Learning for Error-Aware Quality Estimation of Machine Translation

Archchana Sindhujan,Girish A. Koushik,Shenbin Qian,Diptesh Kanojia,Constantin Orăsan

Main category: cs.CL

TL;DR: 本文提出首个英-马拉雅拉姆语段级质量评估(QE)数据集,并设计了基于策略的强化学习框架ALOPE-RL,利用直接评估分数和翻译质量评注(TQR)作为误差感知奖励,使小规模LLM在低资源QE任务中达到SOTA性能。

Details Motivation: 现有QE方法多依赖标量评分,缺乏对具体翻译错误的显式建模;且在低资源语言(如英-马拉雅拉姆语)上因标注数据稀缺而性能受限。 Method: 构建首个英-马拉雅拉姆语段级QE数据集(含DA评分与TQR评注),并提出ALOPE-RL——一种基于策略的强化学习框架,通过DA与TQR联合定义奖励信号,训练LoRA适配器并结合4-bit量化,在紧凑LLM(≤4B参数)上实现高效微调。 Result: ALOPE-RL在英-马拉雅拉姆语QE任务上超越更大LLM基线及主流编码器式QE模型,验证了误差感知、策略驱动学习在低数据与低算力下的有效性。 Conclusion: 误差感知的强化学习范式可显著提升低资源语言QE性能,兼顾可解释性与实用性;作者开源数据集、代码与模型以推动后续研究。 Abstract: Quality Estimation (QE) aims to assess the quality of machine translation (MT) outputs without relying on reference translations, making it essential for real-world, large-scale MT evaluation. Large Language Models (LLMs) have shown significant promise in advancing the field of quality estimation of machine translation. However, most of the QE approaches solely rely on scalar quality scores, offering no explicit information about the translation errors that should drive these judgments. Moreover, for low-resource languages where annotated QE data is limited, existing approaches struggle to achieve reliable performance. To address these challenges, we introduce the first segment-level QE dataset for English to Malayalam, a severely resource-scarce language pair in the QE domain, comprising human-annotated Direct Assessment (DA) scores and Translation Quality Remarks (TQR), which are short, contextual, free-form annotator comments that describe translation errors. We further introduce ALOPE-RL, a policy-based reinforcement learning framework that trains efficient adapters based on policy rewards derived from DA score and TQR. Integrating error-aware rewards with ALOPE-RL, enables LLMs to reason about translation quality beyond numeric scores. Despite being trained on a small-scale QE dataset, ALOPE-RL achieves state-of-the-art performance on English to Malayalam QE using compact LLMs (<=4B parameters}) fine-tuned with LoRA and 4-bit quantization, outperforming both larger LLM-based baselines and leading encoder-based QE models. Our results demonstrate that error-aware, policy-based learning can deliver strong QE performance under limited data and compute budgets. We release our dataset, code, and trained models to support future research.

[77] VocalNet-MDM: Accelerating Streaming Speech LLM via Self-Distilled Masked Diffusion Modeling

Ziyang Cheng,Yuhao Wang,Heyang Liu,Ronghua Wu,Qunshan Gu,Yanfeng Wang,Yu Wang

Main category: cs.CL

TL;DR: 本文提出VocalNet-MDM,一种基于掩码扩散建模(MDM)的非自回归语音大语言模型,通过分层块掩码和迭代自蒸馏解决流式语音交互中的训练-推理不匹配与迭代开销问题,在有限数据下显著提升解码速度并降低首块延迟,同时保持识别准确率与文本/语音质量。

Details Motivation: 现有自回归语音大语言模型存在串行生成效率低和暴露偏差问题,亟需更高效的非自回归范式。 Method: 提出VocalNet-MDM框架,采用掩码扩散建模;设计分层块掩码(Hierarchical Block-wise Masking)对齐训练与推理状态;引入迭代自蒸馏(Iterative Self-Distillation)压缩多步优化以降低延迟。 Result: 在仅6K小时语音数据上训练,解码速度提升3.7–10倍,首块延迟降低34%;识别准确率具竞争力,文本质量与语音自然度达SOTA。 Conclusion: 掩码扩散建模是构建低延迟、高效率语音大语言模型的一种有前景且可扩展的替代范式。 Abstract: Recent Speech Large Language Models~(LLMs) have achieved impressive capabilities in end-to-end speech interaction. However, the prevailing autoregressive paradigm imposes strict serial constraints, limiting generation efficiency and introducing exposure bias. In this paper, we investigate Masked Diffusion Modeling~(MDM) as a non-autoregressive paradigm for speech LLMs and introduce VocalNet-MDM. To adapt MDM for streaming speech interaction, we address two critical challenges: training-inference mismatch and iterative overhead. We propose Hierarchical Block-wise Masking to align training objectives with the progressive masked states encountered during block diffusion decoding, and Iterative Self-Distillation to compress multi-step refinement into fewer steps for low-latency inference. Trained on a limited scale of only 6K hours of speech data, VocalNet-MDM achieves a 3.7$\times$--10$\times$ decoding speedup and reduces first-chunk latency by 34\% compared to AR baselines. It maintains competitive recognition accuracy while achieving state-of-the-art text quality and speech naturalness, demonstrating that MDM is a promising and scalable alternative for low-latency, efficient speech LLMs.

[78] Do Multilingual LLMs have specialized language heads?

Muhammad Naufil

Main category: cs.CL

TL;DR: 本文探讨多语言大语言模型(LLMs)中是否存在针对特定语言的注意力头,并研究能否在不影响目标语言性能的前提下移除非目标语言的专用注意力头,以提升部署效率。

Details Motivation: 多语言LLM在实际部署中若仅需支持部分语言,则存在冗余;现有研究集中于机器翻译模型的语言特性分析,而对多任务能力更强的多语言LLM尚无类似探索。 Method: 分析多语言LLM中注意力头的语言特异性,通过识别和移除与非目标语言相关的注意力头,评估其对目标语言任务性能的影响。 Result: 发现多语言LLM中存在语言特定的注意力头,且可选择性移除非目标语言相关头而不显著损害目标语言性能。 Conclusion: 该发现为面向特定语言子集的高效多语言LLM部署提供了新思路,可在降低模型复杂度的同时保持目标语言的高准确性。 Abstract: Multilingual large language models (LLMs) have gained significant popularity for their ability to process and generate text across multiple languages. However, deploying these models in production can be inefficient when only a subset of the supported languages is of interest. There has been some research conducted on identifying whether machine translation models have language-specific or language-agnostic heads, however no research has been conducted for multilingual LLMs, to the best of our knowledge, that as we know are capable of performing diverse tasks beyond just translation. This paper explores whether multilingual LLMs have specialized language attention heads for each language, and investigates the possibility of removing language-specific heads for unwanted languages without degrading performance in the targeted languages. Our findings could inform more efficient deployment strategies for multilingual LLMs, enabling reduced model complexity while maintaining high accuracy for targeted languages.

[79] Fundamental Reasoning Paradigms Induce Out-of-Domain Generalization in Language Models

Mingzi Cao,Xingwei Tan,Mahmud Akhter,Marco Valentino,Maria Liakata,Xi Wang,Nikolaos Aletras

Main category: cs.CL

TL;DR: 本文系统研究了演绎、归纳和溯因三种基本推理范式对大语言模型(LLM)泛化能力的影响,构建了面向符号任务的推理轨迹数据集,并通过多种模型增强方法(如微调、加深、MoE)提升LLM在真实世界自然语言任务上的跨域推理性能,取得了最高达14.60的显著提升。

Details Motivation: 尽管LLM推理能力研究活跃,但三大基本推理范式(演绎、归纳、溯因)对模型泛化能力的影响尚未被系统探究。 Method: 构建聚焦三类推理范式的符号任务轨迹数据集;采用微调、增加模型深度、稠密模型转MoE等多种方法将推理技能注入LLM;在纯自然语言、含真实世界知识的跨域任务上全面评估。 Result: 所提方法在现实场景的跨域任务上展现出强泛化性,性能提升最高达14.60。 Conclusion: 三大基础推理范式可有效引导LLM获得更鲁棒、更通用的推理能力,其协同作用对提升模型泛化至关重要。 Abstract: Deduction, induction, and abduction are fundamental reasoning paradigms, core for human logical thinking. Although improving Large Language Model (LLM) reasoning has attracted significant research efforts, the extent to which the fundamental paradigms induce generalization has yet to be systematically explored. In this study, we shed light on how the interplay between these core paradigms influences LLMs' reasoning behavior. To this end, we first collect a new dataset of reasoning trajectories from symbolic tasks, each targeting one of the three fundamental paradigms, to abstract from concrete world knowledge. Then, we investigate effective ways for inducing these skills into LLMs. We experiment with a battery of methods including simple fine-tuning, and more complex approaches to increase model depth, or transform a dense model to a mixture-of-experts. We comprehensively evaluate induced models on realistic out-of-domain tasks, that are entirely formulated in natural language and contain real-world knowledge. Our results reveal that our approach yields strong generalizability with substantial performance gains (up to $14.60$) across realistic tasks.

[80] Learning to Judge: LLMs Designing and Applying Evaluation Rubrics

Clemencia Siro,Pourya Aliannejadi,Mohammad Aliannejadi

Main category: cs.CL

TL;DR: 本文提出GER-Eval方法,探索大语言模型(LLMs)能否自主设计并应用评估标准来评价自然语言生成结果;发现LLMs能可靠生成可解释、任务相关的评估维度,但在事实性和知识密集型任务中评分可靠性下降;闭源模型(如GPT-4o)表现优于开源模型(如Llama);研究将评估视为LLM的一种习得语言能力,并呼吁联合建模人类与LLM的评估语言以提升可靠性与可解释性。

Details Motivation: 人类定义的静态评估标准常与大语言模型内部对语言质量的表征不一致,限制了其作为评估器的有效性。 Method: 提出GER-Eval框架,让LLMs自动生成评估标准并应用于自身输出,系统评估其生成标准的语义一致性、评分可靠性及与人类标准的对齐程度。 Result: LLMs能稳定生成可解释、任务感知的评估维度,且在模型内部应用具有一致性;但在事实性与知识密集型场景下评分可靠性下降;闭源模型(如GPT-4o)在评分一致性与跨模型泛化性上优于开源模型(如Llama)。 Conclusion: 评估是LLM的一种习得语言能力——在单个模型内具有一致性,但跨模型间存在碎片化;需发展新方法联合建模人类与LLM的评估语言,以提升评估的可靠性与可解释性。 Abstract: Large language models (LLMs) are increasingly used as evaluators for natural language generation, applying human-defined rubrics to assess system outputs. However, human rubrics are often static and misaligned with how models internally represent language quality. We introduce GER-Eval (Generating Evaluation Rubrics for Evaluation) to investigate whether LLMs can design and apply their own evaluation rubrics. We evaluate the semantic coherence and scoring reliability of LLM-defined criteria and their alignment with human criteria. LLMs reliably generate interpretable and task-aware evaluation dimensions and apply them consistently within models, but their scoring reliability degrades in factual and knowledge-intensive settings. Closed-source models such as GPT-4o achieve higher agreement and cross-model generalization than open-weight models such as Llama. Our findings position evaluation as a learned linguistic capability of LLMs, consistent within models but fragmented across them, and call for new methods that jointly model human and LLM evaluative language to improve reliability and interpretability.

[81] Old wine in old glasses: Comparing computational and qualitative methods in identifying incivility on Persian Twitter during the #MahsaAmini movement

Hossein Kermani,Fatemeh Oudlajani,Pardis Yarahmadi,Hamideh Mahdi Soltani,Mohammad Makki,Zahra HosseiniKhoo

Main category: cs.CL

TL;DR: 本文比较了三种波斯语推文不文明行为检测方法:人工定性编码、基于ParsBERT的监督学习和大型语言模型(ChatGPT),发现ParsBERT在识别仇恨言论方面显著优于多个ChatGPT模型,且ChatGPT在显性和隐性不文明内容上均表现不佳,提示语言(英语或波斯语)对其输出影响不大。

Details Motivation: 在低资源语言(如波斯语)背景下,亟需评估不同方法对仇恨言论检测的有效性与适用性,尤其针对社会运动相关推文中的不文明内容。 Method: 采用人工定性编码、基于ParsBERT的监督学习模型和多种ChatGPT模型,在47,278条#MahsaAmini运动推文上进行实验对比,评估准确率与效率,并测试不同提示语言的影响。 Result: ParsBERT显著优于所有被测ChatGPT模型;ChatGPT不仅难以处理细微不文明内容,对明显不文明内容也识别不准;提示语言(英/波斯语)未带来实质性性能差异。 Conclusion: 在波斯语等低资源语言的仇恨言论分析中,经过领域微调的专用模型(如ParsBERT)比通用大模型(如ChatGPT)更可靠;人工编码仍具基准价值,但成本高;研究明确了各方法的适用边界。 Abstract: This paper compares three approaches to detecting incivility in Persian tweets: human qualitative coding, supervised learning with ParsBERT, and large language models (ChatGPT). Using 47,278 tweets from the #MahsaAmini movement in Iran, we evaluate the accuracy and efficiency of each method. ParsBERT substantially outperforms seven evaluated ChatGPT models in identifying hate speech. We also find that ChatGPT struggles not only with subtle cases but also with explicitly uncivil content, and that prompt language (English vs. Persian) does not meaningfully affect its outputs. The study provides a detailed comparison of these approaches and clarifies their strengths and limitations for analyzing hate speech in a low-resource language context.

[82] Challenges in Translating Technical Lectures: Insights from the NPTEL

Basudha Raje,Sadanand Venkatraman,Nandana TP,Soumyadeepa Das,Polkam Poojitha,M. Vijaykumar,Tanima Bagchi,Hema A. Murthy

Main category: cs.CL

TL;DR: 本研究探讨了机器翻译在印地语系语言(孟加拉语、马拉雅拉姆语和泰卢固语)中的实际应用与方法论影响,基于NPTEL平台语料,聚焦自发语音语料库构建及现有评估指标对形态丰富、语义紧凑语言的局限性。

Details Motivation: 为响应印度国家教育政策(NEP 2020)对多语言教育技术的支持,并兼顾语言多样性,选取孟加拉语、马拉雅拉姆语和泰卢固语作为研究对象;同时依托最大MOOC平台NPTEL构建语料支撑分析。 Method: 构建面向技术概念讲解的自发语音语料库,注重语域适配与词汇选择;结合现有表面重叠类评估指标,分析其在形态丰富、语义紧凑语言上的适用性。 Result: 发现主流评估指标对形态复杂、语义凝练的语言敏感度不足,存在因表面形式匹配失准而导致评价偏差的问题。 Conclusion: 需发展适配印度语言特性的新型评估框架,并强调语料建设中对语域、表达自然性与教学适用性的综合考量。 Abstract: This study examines the practical applications and methodological implications of Machine Translation in Indian Languages, specifically Bangla, Malayalam, and Telugu, within emerging translation workflows and in relation to existing evaluation frameworks. The choice of languages prioritized in this study is motivated by a triangulation of linguistic diversity, which illustrates the significance of multilingual accommodation of educational technology under NEP 2020. This is further supported by the largest MOOC portal, i.e., NPTEL, which has served as a corpus to facilitate the arguments presented in this paper. The curation of a spontaneous speech corpora that accounts for lucid delivery of technical concepts, considering the retention of suitable register and lexical choices are crucial in a diverse country like India. The findings of this study highlight metric-specific sensitivity and the challenges of morphologically rich and semantically compact features when tested against surface overlapping metrics.

Clemencia Siro,Zahra Abbasiantaeb,Yifei Yuan,Mohammad Aliannejadi,Maarten de Rijke

Main category: cs.CL

TL;DR: 本研究通过73名参与者的用户实验,探讨了在对话式搜索中,图文结合的澄清问题相较于纯文本澄清问题对用户回答澄清问题和查询重构两个任务的影响。结果表明,图像对不同任务和用户专业水平的影响各异:在回答澄清问题时,用户更偏好图文结合方式但纯文本表现更好;在查询重构中,图像有助于提升查询精确性和检索性能。

Details Motivation: 尽管图文结合在多种检索场景中被证明有效,但其在对话式搜索中作为澄清问题形式对用户表现的影响尚未被充分探索。 Method: 开展包含73名参与者的用户研究,对比图文结合与纯文本澄清问题在回答澄清问题和查询重构两项任务中的效果,并从任务类型和用户专业知识角度进行多维度分析。 Result: 1)回答澄清问题时,用户偏好图文结合,但纯文本方式实际表现更优;2)查询重构中,图像提升了查询精度和检索性能;3)图像效果因任务类型和用户专业知识而异。 Conclusion: 图文结合的澄清问题在对话式搜索中的效用具有任务依赖性,系统设计应依据具体搜索场景和用户特征进行策略性选择。 Abstract: Conversational search systems increasingly employ clarifying questions to refine user queries and improve the search experience. Previous studies have demonstrated the usefulness of text-based clarifying questions in enhancing both retrieval performance and user experience. While images have been shown to improve retrieval performance in various contexts, their impact on user performance when incorporated into clarifying questions remains largely unexplored. We conduct a user study with 73 participants to investigate the role of images in conversational search, specifically examining their effects on two search-related tasks: (i) answering clarifying questions and (ii) query reformulation. We compare the effect of multimodal and text-only clarifying questions in both tasks within a conversational search context from various perspectives. Our findings reveal that while participants showed a strong preference for multimodal questions when answering clarifying questions, preferences were more balanced in the query reformulation task. The impact of images varied with both task type and user expertise. In answering clarifying questions, images helped maintain engagement across different expertise levels, while in query reformulation they led to more precise queries and improved retrieval performance. Interestingly, for clarifying question answering, text-only setups demonstrated better user performance as they provided more comprehensive textual information in the absence of images. These results provide valuable insights for designing effective multimodal conversational search systems, highlighting that the benefits of visual augmentation are task-dependent and should be strategically implemented based on the specific search context and user characteristics.

[84] FactSim: Fact-Checking for Opinion Summarization

Leandro Anghinoni,Jorge Sanchez

Main category: cs.CL

TL;DR: 本文提出了一种新颖的全自动方法,用于评估生成式AI在意见摘要任务中的事实一致性,通过衡量摘要与原始评论中主张的相似性、覆盖率和一致性,并验证其与人类判断高度相关。

Details Motivation: 传统自动化评估指标在大语言模型(LLM)背景下对意见摘要的事实一致性评估存在局限性,亟需更全面精准的评估技术。 Method: 提出一种全自动方法,基于从摘要和原始评论中提取主张并计算其相似性,综合衡量覆盖率与一致性,生成评估得分;主张提取采用简单但鲁棒的方法,能处理否定、转述和扩展等变体。 Result: 所提指标能为语义相似但形式不同的主张赋予更高分,且与人类判断的相关性显著高于当前最先进指标。 Conclusion: 该方法有效弥补了现有评估范式的不足,为GenAI意见摘要提供了更可靠、可扩展的事实一致性自动化评估方案。 Abstract: We explore the need for more comprehensive and precise evaluation techniques for generative artificial intelligence (GenAI) in text summarization tasks, specifically in the area of opinion summarization. Traditional methods, which leverage automated metrics to compare machine-generated summaries from a collection of opinion pieces, e.g. product reviews, have shown limitations due to the paradigm shift introduced by large language models (LLM). This paper addresses these shortcomings by proposing a novel, fully automated methodology for assessing the factual consistency of such summaries. The method is based on measuring the similarity between the claims in a given summary with those from the original reviews, measuring the coverage and consistency of the generated summary. To do so, we rely on a simple approach to extract factual assessment from texts that we then compare and summarize in a suitable score. We demonstrate that the proposed metric attributes higher scores to similar claims, regardless of whether the claim is negated, paraphrased, or expanded, and that the score has a high correlation to human judgment when compared to state-of-the-art metrics.

[85] PERSPECTRA: A Scalable and Configurable Pluralist Benchmark of Perspectives from Arguments

Shangrui Nie,Kian Omoomi,Lucie Flek,Zhixue Zhao,Charles Welch

Main category: cs.CL

TL;DR: 本文提出PERSPECTRA基准,旨在评估大语言模型对多元观点的理解与推理能力,整合Kialo的结构化辩论图与Reddit的语言多样性,构建包含3810个扩展论点的数据集,并设计三项新任务以揭示当前LLM在多元主义理解上的系统性缺陷。

Details Motivation: 现有大语言模型缺乏对人类观点多样性的忠实反映,且主流对齐研究忽视‘多元主义’这一关键特性;辩论类数据源虽具潜力,但各自存在局限(如人工验证成本高、结构缺失或脱离自然话语),亟需兼顾结构清晰性与语言多样性的新基准。 Method: 提出PERSPECTRA基准,通过受控的检索-扩展流程,融合Kialo的显式正反论证图结构与Reddit的真实语言多样性,构建覆盖100个争议话题、762种立场、3810个自然化变体论点的数据集;并定义三个新任务:观点计数、观点匹配与极性检验。 Result: 在多个开源与闭源SOTA大语言模型上的实验表明,模型普遍存在高估观点数量、误判让步结构等系统性错误,证实多元主义感知与推理极具挑战性。 Conclusion: PERSPECTRA是首个可扩展、可配置的多元主义评估基准,为推动模型更好表征、区分与推理多视角提供了坚实基础与新评测范式。 Abstract: Pluralism, the capacity to engage with diverse perspectives without collapsing them into a single viewpoint, is critical for developing large language models that faithfully reflect human heterogeneity. Yet this characteristic has not been carefully examined in the LLM research community and remains absent from most alignment studies. Debate-oriented sources provide a natural entry point for pluralism research. Previous work builds on online debate sources but remains constrained by costly human validation. Other debate-rich platforms such as Reddit and Kialo also offer promising material: Reddit provides linguistic diversity and scale but lacks clear argumentative structure, while Kialo supplies explicit pro/con graphs but remains overly concise and detached from natural discourse. We introduce PERSPECTRA, a pluralist benchmark that integrates the structural clarity of Kialo debate graphs with the linguistic diversity of real Reddit discussions. Using a controlled retrieval-and-expansion pipeline, we construct 3,810 enriched arguments spanning 762 pro/con stances on 100 controversial topics. Each opinion is expanded to multiple naturalistic variants, enabling robust evaluation of pluralism. We initialise three tasks with PERSPECTRA: opinion counting (identifying distinct viewpoints), opinion matching (aligning supporting stances and discourse to source opinions), and polarity check (inferring aggregate stance in mixed discourse). Experiments with state-of-the-art open-source and proprietary LLMs, highlight systematic failures, such as overestimating the number of viewpoints and misclassifying concessive structures, underscoring the difficulty of pluralism-aware understanding and reasoning. By combining diversity with structure, PERSPECTRA establishes the first scalable, configurable benchmark for evaluating how well models represent, distinguish, and reason over multiple perspectives.

[86] Map of Encoders -- Mapping Sentence Encoders using Quantum Relative Entropy

Gaifan Zhang,Danushka Bollegala

Main category: cs.CL

TL;DR: 本文提出一种基于量子相对熵的句子编码器映射方法,通过构建1101个公开句子编码器的可视化地图,揭示其内在关系并预测下游任务性能。

Details Motivation: 现有句子编码器数量庞大且缺乏系统性比较方法,难以理解其内在关系和性能差异。 Method: 首先用句子集合的嵌入矩阵表示每个编码器;然后计算其成对内积(PIP)矩阵;最后利用量子相对熵(QRE)相对于单位基编码器生成特征向量,并据此构建编码器地图。 Result: 构建了覆盖1101个公开句子编码器的地图,验证了相似属性编码器在地图上空间邻近,并能准确预测检索与聚类等下游任务性能。 Conclusion: 该方法为大规模句子编码器的比较、可视化与性能预测提供了有效、可解释的新范式。 Abstract: We propose a method to compare and visualise sentence encoders at scale by creating a map of encoders where each sentence encoder is represented in relation to the other sentence encoders. Specifically, we first represent a sentence encoder using an embedding matrix of a sentence set, where each row corresponds to the embedding of a sentence. Next, we compute the Pairwise Inner Product (PIP) matrix for a sentence encoder using its embedding matrix. Finally, we create a feature vector for each sentence encoder reflecting its Quantum Relative Entropy (QRE) with respect to a unit base encoder. We construct a map of encoders covering 1101 publicly available sentence encoders, providing a new perspective of the landscape of the pre-trained sentence encoders. Our map accurately reflects various relationships between encoders, where encoders with similar attributes are proximally located on the map. Moreover, our encoder feature vectors can be used to accurately infer downstream task performance of the encoders, such as in retrieval and clustering tasks, demonstrating the faithfulness of our map.

[87] LakeHopper: Cross Data Lakes Column Type Annotation through Model Adaptation

Yushi Sun,Xujia Li,Nan Tang,Quanqing Xu,Chuanhui Yang,Lei Chen

Main category: cs.CL

TL;DR: 本文提出LakeHopper框架,通过LM交互识别知识差距、聚类选择目标数据、增量微调机制,将预训练语言模型高效适配至新数据湖,显著减少目标数据湖的标注需求。

Details Motivation: 现有基于大语言模型的列类型标注方法依赖于特定源数据湖的大量标注数据,难以泛化到新数据湖;如何以最少标注成本将已有模型迁移到新数据湖是一个关键挑战。 Method: 提出LakeHopper框架:1)利用语言模型交互识别并弥合源-目标数据湖间的知识差距;2)采用基于聚类的未标注列选择策略;3)设计增量式微调机制,在适配目标数据湖的同时保留共享知识。 Result: 在两种不同数据湖迁移任务上,LakeHopper在低资源和高资源设定下均显著优于基线方法,验证了其有效性与鲁棒性。 Conclusion: LakeHopper为跨数据湖的列类型标注提供了高效、低标注成本的迁移学习方案,解决了知识迁移中的知识差距、数据选择与知识遗忘等核心问题。 Abstract: Column type annotation is vital for tasks like data cleaning, integration, and visualization. Recent solutions rely on resource-intensive language models fine-tuned on well-annotated columns from a particular set of tables, i.e., a source data lake. In this paper, we study whether we can adapt an existing pre-trained LM-based model to a new (i.e., target) data lake to minimize the annotations required on the new data lake. However, challenges include the source-target knowledge gap, selecting informative target data, and fine-tuning without losing shared knowledge exist. We propose LakeHopper, a framework that identifies and resolves the knowledge gap through LM interactions, employs a cluster-based data selection scheme for unannotated columns, and uses an incremental fine-tuning mechanism that gradually adapts the source model to the target data lake. Our experimental results validate the effectiveness of LakeHopper on two different data lake transfers under both low-resource and high-resource settings.

[88] Affective Flow Language Model for Emotional Support Conversation

Chenghui Zou,Ning Wang,Tiesunlong Shen,Luwei Xiao,Chuan Ma,Xiangpeng Li,Rui Mao,Erik Cambria

Main category: cs.CL

TL;DR: 本文提出AFlow框架,通过建模多轮对话中连续的情感流,为情感支持对话提供细粒度的前缀级监督,显著提升策略连贯性和共情响应质量,并在多个指标上超越GPT-4o和Claude-3.5等闭源大模型。

Details Motivation: 现有情感支持对话方法依赖稀疏的结果级信号,难以有效指导多轮对话中中间策略决策,导致复杂多轮支持效果不佳。 Method: 提出AFlow框架,建模多轮轨迹上的连续情感流,估计搜索路径的中间效用,学习偏好一致的策略转移;引入子路径级flow-balance目标,将偏好信号传播至中间状态以增强策略连贯性与共情质量。 Result: 在多种情感场景下显著优于强基线;基于紧凑开源骨干模型的AFlow在主要ESC指标上超越GPT-4o和Claude-3.5等专有大模型。 Conclusion: 细粒度的前缀级情感流建模可有效提升多轮情感支持对话的策略一致性与响应质量,为ESC对齐提供了新范式。 Abstract: Large language models (LLMs) have been widely applied to emotional support conversation (ESC). However, complex multi-turn support remains challenging.This is because existing alignment schemes rely on sparse outcome-level signals, thus offering limited supervision for intermediate strategy decisions. To fill this gap, this paper proposes affective flow language model for emotional support conversation (AFlow), a framework that introduces fine-grained supervision on dialogue prefixes by modeling a continuous affective flow along multi-turn trajectories. AFlow can estimate intermediate utility over searched trajectories and learn preference-consistent strategy transitions. To improve strategy coherence and empathetic response quality, a subpath-level flow-balance objective is presented to propagate preference signals to intermediate states. Experiment results show consistent and significant improvements over competitive baselines in diverse emotional contexts. Remarkably, AFlow with a compact open-source backbone outperforms proprietary LMMs such as GPT-4o and Claude-3.5 on major ESC metrics. Our code is available at https://github.com/chzou25-lgtm/AffectiveFlow.

[89] WildReward: Learning Reward Models from In-the-Wild Human Interactions

Hao Peng,Yunjia Qi,Xiaozhi Wang,Zijun Yao,Lei Hou,Juanzi Li

Main category: cs.CL

TL;DR: 本文提出WildReward,一种直接从真实世界用户交互中学习的奖励模型,无需人工标注偏好对,通过序数回归在186k高质量用户反馈上训练,性能媲美甚至超越传统RM,并提升校准性和跨样本一致性。

Details Motivation: 传统奖励模型依赖大量人工标注的偏好对,而现实中LLM的广泛部署产生了丰富的隐式用户反馈信号,如何利用这些‘in-the-wild’交互构建奖励模型成为关键问题。 Method: 以WildChat为交互数据源,设计数据清洗与反馈提取流程,获得186k高质量用户反馈实例;采用序数回归方法,直接建模用户原始反馈(非偏好对)训练WildReward。 Result: WildReward在多项指标上达到或超过传统奖励模型,校准性与跨样本一致性更优;用户多样性正向影响模型性能;应用于在线DPO训练时,在多个任务上显著提升效果。 Conclusion: 无需人工偏好标注、直接利用真实用户交互数据训练奖励模型是可行且有效的,WildReward为低成本、高可扩展的RM构建提供了新范式。 Abstract: Reward models (RMs) are crucial for the training of large language models (LLMs), yet they typically rely on large-scale human-annotated preference pairs. With the widespread deployment of LLMs, in-the-wild interactions have emerged as a rich source of implicit reward signals. This raises the question: Can we develop reward models directly from in-the-wild interactions? In this work, we explore this possibility by adopting WildChat as an interaction source and proposing a pipeline to extract reliable human feedback, yielding 186k high-quality instances for training WildReward via ordinal regression directly on user feedback without preference pairs. Extensive experiments demonstrate that WildReward achieves comparable or even superior performance compared to conventional reward models, with improved calibration and cross-sample consistency. We also observe that WildReward benefits directly from user diversity, where more users yield stronger reward models. Finally, we apply WildReward to online DPO training and observe significant improvements across various tasks. Code and data are released at https://github.com/THU-KEG/WildReward.

[90] Understanding Dynamic Compute Allocation in Recurrent Transformers

Ibraheem Muhammad Moosa,Suhas Lohit,Ye Wang,Moitreya Chatterjee,Wenpeng Yin

Main category: cs.CL

TL;DR: 本文提出ANIRA框架,通过算法和合成语言任务评估token-level自适应计算,发现计算分配可自发对齐任务复杂度,但无法保证算法泛化能力,且早期决策依赖静态结构线索,而在线停止更贴近算法执行状态。

Details Motivation: 现有token-level自适应计算研究缺乏可观测的token难度标注,且受模型架构干扰,难以验证计算分配是否真正匹配底层复杂度。 Method: 1)构建复杂度可控的算法与合成语言任务作为评估范式;2)提出ANIRA——一种支持每token可变深度计算的统一循环Transformer框架,解耦计算分配决策与其他模型因素;3)系统分析计算分配与复杂度对齐性、泛化能力及决策时机。 Result: 1)无需显式难度监督,计算分配可自发对齐任务复杂度;2)该对齐性不保证算法泛化(如无法外推至未见输入规模);3)早期计算决策依赖静态结构线索,而在线停止机制更贴合算法执行状态。 Conclusion: token-level自适应计算的有效性需在解耦架构影响、可控复杂度环境下严格评估;对齐复杂度是必要但不充分条件,算法泛化还需其他机制支持;决策时机显著影响对齐质量。 Abstract: Token-level adaptive computation seeks to reduce inference cost by allocating more computation to harder tokens and less to easier ones. However, prior work is primarily evaluated on natural-language benchmarks using task-level metrics, where token-level difficulty is unobservable and confounded with architectural factors, making it unclear whether compute allocation truly aligns with underlying complexity. We address this gap through three contributions. First, we introduce a complexity-controlled evaluation paradigm using algorithmic and synthetic language tasks with parameterized difficulty, enabling direct testing of token-level compute allocation. Second, we propose ANIRA, a unified recurrent Transformer framework that supports per-token variable-depth computation while isolating compute allocation decisions from other model factors. Third, we use this framework to conduct a systematic analysis of token-level adaptive computation across alignment with complexity, generalization, and decision timing. Our results show that compute allocation aligned with task complexity can emerge without explicit difficulty supervision, but such alignment does not imply algorithmic generalization: models fail to extrapolate to unseen input sizes despite allocating additional computation. We further find that early compute decisions rely on static structural cues, whereas online halting more closely tracks algorithmic execution state.

[91] Large Language Models for Geolocation Extraction in Humanitarian Crisis Response

G. Cafferata,T. Demarco,K. Kalimeri,Y. Mejova,M. G. Beiró

Main category: cs.CL

TL;DR: 本文提出了一种结合少样本大语言模型(LLM)命名实体识别与上下文感知的智能体式地理编码的两步框架,以提升人道主义文本中地理位置提取的准确性与公平性,尤其改善对欠代表地区的覆盖。

Details Motivation: 现有自动化地理信息提取系统存在地理与社会经济偏差,导致危机地区可见性不均,亟需更公平、准确的方法。 Method: 提出两步框架:1)少样本LLM进行地名识别;2)基于上下文的智能体式地理编码模块解决歧义地名;在扩展版HumSet数据集上评估。 Result: LLM方法显著提升了地理定位提取的精度与公平性,尤其在欠代表地区表现更优;在地理和社会经济维度的公平性指标上优于现有SOTA模型。 Conclusion: 将LLM推理能力与负责任、包容性AI原则结合,可构建更公平的人道主义地理空间数据系统,助力‘危机分析中不遗漏任何地方’的目标。 Abstract: Humanitarian crises demand timely and accurate geographic information to inform effective response efforts. Yet, automated systems that extract locations from text often reproduce existing geographic and socioeconomic biases, leading to uneven visibility of crisis-affected regions. This paper investigates whether Large Language Models (LLMs) can address these geographic disparities in extracting location information from humanitarian documents. We introduce a two-step framework that combines few-shot LLM-based named entity recognition with an agent-based geocoding module that leverages context to resolve ambiguous toponyms. We benchmark our approach against state-of-the-art pretrained and rule-based systems using both accuracy and fairness metrics across geographic and socioeconomic dimensions. Our evaluation uses an extended version of the HumSet dataset with refined literal toponym annotations. Results show that LLM-based methods substantially improve both the precision and fairness of geolocation extraction from humanitarian texts, particularly for underrepresented regions. By bridging advances in LLM reasoning with principles of responsible and inclusive AI, this work contributes to more equitable geospatial data systems for humanitarian response, advancing the goal of leaving no place behind in crisis analytics.

[92] Is Reasoning Capability Enough for Safety in Long-Context Language Models?

Yu Fu,Haz Sameen Shahgir,Huanli Gong,Zhipeng Wei,N. Benjamin Erichson,Yue Dong

Main category: cs.CL

TL;DR: 本文提出了一种新的威胁模型——组合推理攻击(compositional reasoning attacks),在长上下文场景中,将有害意图拆分为分散的片段,通过中性推理提示诱导模型合成并隐式触发有害行为;实验发现更强的推理能力并不自动提升安全性,安全对齐随上下文增长而下降,而增加推理时计算可显著缓解攻击。

Details Motivation: 假设更强的推理能力有助于识别隐含的有害意图从而提升模型安全性,但该假设在长上下文、隐式有害意图场景下尚未被系统检验。 Method: 提出组合推理攻击新威胁模型:将有害查询拆解为不完整片段,散布于长上下文(最高64k tokens)中,并用中性推理提示诱导模型检索与合成;在14个前沿大语言模型上进行实证评估,并分析推理能力、上下文长度与推理时计算量对安全性的独立影响。 Result: (1)推理能力更强的模型并未更鲁棒,常成功合成有害意图却未拒绝;(2)安全对齐性能随上下文长度增加而持续下降;(3)增加推理时计算(如更多思考步)可使GPT-oss-120b攻击成功率降低超50个百分点。 Conclusion: 安全性不随推理能力自动扩展,尤其在长上下文推理中;需专门设计安全机制,而非依赖推理能力提升来保障安全。 Abstract: Large language models (LLMs) increasingly combine long-context processing with advanced reasoning, enabling them to retrieve and synthesize information distributed across tens of thousands of tokens. A hypothesis is that stronger reasoning capability should improve safety by helping models recognize harmful intent even when it is not stated explicitly. We test this hypothesis in long-context settings where harmful intent is implicit and must be inferred through reasoning, and find that it does not hold. We introduce compositional reasoning attacks, a new threat model in which a harmful query is decomposed into incomplete fragments that scattered throughout a long context. The model is then prompted with a neutral reasoning query that induces retrieval and synthesis, causing the harmful intent to emerge only after composition. Evaluating 14 frontier LLMs on contexts up to 64k tokens, we uncover three findings: (1) models with stronger general reasoning capability are not more robust to compositional reasoning attacks, often assembling the intent yet failing to refuse; (2) safety alignment consistently degrades as context length increases; and (3) inference-time reasoning effort is a key mitigating factor: increasing inference-time compute reduces attack success by over 50 percentage points on GPT-oss-120b model. Together, these results suggest that safety does not automatically scale with reasoning capability, especially under long-context inference.

Sahajpreet Singh,Kokil Jaidka,Min-Yen Kan

Main category: cs.CL

TL;DR: GitSearch是一种基于信息缺口感知的实时网络检索框架,用于提升社区内容审核中人工标注的质量与覆盖率,在政治推文数据集PolBench上显著超越现有方法和人工标注。

Details Motivation: 社区审核面临结构性挑战,AI方法在冷启动场景下表现差,需利用人类感知的质量缺口(如缺失背景)作为关键信号。 Method: 提出GitSearch框架:三阶段流程——识别信息缺口、实时定向网络检索填补、生成平台合规的说明笔记;构建PolBench基准(78,698条美国政治推文及对应Community Notes)用于评估。 Result: GitSearch实现99%覆盖率,接近现有最优方法的两倍;在帮助性方面以69%胜率超越人工撰写的优质笔记,平均帮助分3.87 vs. 3.36。 Conclusion: GitSearch通过将人类感知的信息缺口显式建模并结合实时检索,有效平衡了社区审核的大规模需求与高质量要求,为冷启动场景提供了新范式。 Abstract: Community-based moderation offers a scalable alternative to centralized fact-checking, yet it faces significant structural challenges, and existing AI-based methods fail in "cold start" scenarios. To tackle these challenges, we introduce GitSearch (Gap-Informed Targeted Search), a framework that treats human-perceived quality gaps, such as missing context, etc., as first-class signals. GitSearch has a three-stage pipeline: identifying information deficits, executing real-time targeted web-retrieval to resolve them, and synthesizing platform-compliant notes. To facilitate evaluation, we present PolBench, a benchmark of 78,698 U.S. political tweets with their associated Community Notes. We find GitSearch achieves 99% coverage, almost doubling coverage over the state-of-the-art. GitSearch surpasses human-authored helpful notes with a 69% win rate and superior helpfulness scores (3.87 vs. 3.36), demonstrating retrieval effectiveness that balanced the trade-off between scale and quality.

[94] How Should We Model the Probability of a Language?

Rasul Dent,Pedro Ortiz Suarez,Thibault Clérice,Benoît Sagot

Main category: cs.CL

TL;DR: 本文是一篇立场论文,指出当前语言识别(LID)系统对小语种覆盖不足的问题源于将其错误地视为去语境化的文本分类任务;主张应将LID重新定义为路由问题,并利用环境线索提升尾部语言的识别能力。

Details Motivation: 商业和研究级语言识别系统对全球7000多种语言中绝大多数(尤其是尾部小语种)覆盖不足,作者认为这一局限是领域内自我施加的,根源在于对LID任务的错误建模和制度性激励偏差。 Method: 提出概念性重构:将语言识别从去语境化的文本分类范式,转向基于环境线索和局部先验概率估计的‘路由问题’;强调结合上下文(如地域、平台、用户行为等环境 cues)进行动态、局部化的语言判别。 Result: 未报告具体实验结果,但提出了一套新的理论框架与研究方向,呼吁社区重视先验概率建模与环境信息融合。 Conclusion: 提升小语种LID覆盖率的关键不在于扩大训练数据或模型规模,而在于范式转变——即放弃统一全局先验,转而构建可适应本地语境的、以路由为核心的语言识别新范式。 Abstract: Of the over 7,000 languages spoken in the world, commercial language identification (LID) systems only reliably identify a few hundred in written form. Research-grade systems extend this coverage under certain circumstances, but for most languages coverage remains patchy or nonexistent. This position paper argues that this situation is largely self-imposed. In particular, it arises from a persistent framing of LID as decontextualized text classification, which obscures the central role of prior probability estimation and is reinforced by institutional incentives that favor global, fixed-prior models. We argue that improving coverage for tail languages requires rethinking LID as a routing problem and developing principled ways to incorporate environmental cues that make languages locally plausible.

[95] Next Concept Prediction in Discrete Latent Space Leads to Stronger Language Models

Yuliang Liu,Yunchong Song,Yixuan Wang,Kewen Ge,Alex Lamb,Qipeng Guo,Kai Chen,Bowen Zhou,Zhouhan Lin

Main category: cs.CL

TL;DR: 本文提出了一种名为Next Concept Prediction(NCP)的生成式预训练范式,它在传统Next Token Prediction(NTP)基础上预测跨越多个token的离散概念,构建更难的预训练目标;模型ConceptLM通过向量量化隐状态构建概念词表,并联合NCP与NTP进行训练,在多个规模和基准上验证了其有效性与可迁移性。

Details Motivation: 传统基于token的预训练目标(如NTP)可能难以充分挖掘语言的高层语义结构,因此需要更具挑战性、语义粒度更粗的预训练任务来提升模型能力。 Method: 提出Next Concept Prediction(NCP)范式;设计ConceptLM模型,利用向量量化(VQ)对隐藏状态进行离散化以构建概念词表;联合优化NCP与NTP目标;从零开始训练70M至1.5B参数模型,并对8B Llama模型开展持续预训练实验。 Result: 在13个基准测试中,NCP consistently outperforms token-level基线模型;对8B Llama模型的持续预训练也带来性能提升;分析表明NCP通过引入更难的预训练任务提升了语言建模能力。 Conclusion: NCP是一种有效增强语言模型能力的新预训练范式,为构建更强大、更具语义感知能力的语言模型提供了新路径。 Abstract: We propose Next Concept Prediction (NCP), a generative pretraining paradigm built on top of Next Token Prediction (NTP). NCP predicts discrete concepts that span multiple tokens, thereby forming a more challenging pretraining objective. Our model, ConceptLM, quantizes hidden states using Vector Quantization and constructs a concept vocabulary. It leverages both NCP and NTP to drive parameter updates and generates a concept to guide the generation of the following tokens. We train ConceptLM from scratch at scales ranging from 70M to 1.5B parameters with up to 300B training data, including Pythia and GPT-2 backbones. Results on 13 benchmarks show that NCP yields consistent performance gains over traditional token-level models. Furthermore, continual pretraining experiments on an 8B-parameter Llama model indicate that NCP can further improve an NTP-trained model. Our analysis suggests that NCP leads to more powerful language models by introducing a harder pretraining task, providing a promising path toward better language modeling.

[96] When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents

Yuting Ning,Jaylen Jones,Zhehao Zhang,Chentao Ye,Weitong Ruan,Junyi Li,Rahul Gupta,Huan Sun

Main category: cs.CL

TL;DR: 本文首次定义并研究了计算机使用代理(CUAs)中的错位动作检测问题,提出了包含外部诱导和内部产生的错位动作的全面框架;构建了具有动作级对齐标注的真实轨迹基准MisActBench;并设计了通用防护机制DeAction,在离线和在线评估中均显著优于现有方法,大幅降低攻击成功率且不损害正常任务性能。

Details Motivation: CUAs虽进展迅速,但常产生偏离用户意图的错位动作,源于外部攻击(如间接提示注入)或内部缺陷(如错误推理),带来安全风险并降低效率与可靠性,亟需系统性检测与防护。 Method: 提出错位动作检测新任务,构建真实轨迹基准MisActBench(含三类常见场景及人工标注的动作级对齐标签),并设计轻量、通用的防护机制DeAction——在动作执行前检测错位,并通过结构化反馈迭代修正。 Result: DeAction在MisActBench上F1分数较基线提升超15个百分点;在线评估中,对抗环境下攻击成功率降低超90%,良性环境下任务成功率保持或提升,仅引入适度延迟开销。 Conclusion: 本文确立了CUA错位动作检测的研究方向,验证了DeAction作为实用通用guardrail的有效性与鲁棒性,为提升CUA安全性与可靠性提供了可扩展解决方案。 Abstract: Computer-use agents (CUAs) have made tremendous progress in the past year, yet they still frequently produce misaligned actions that deviate from the user's original intent. Such misaligned actions may arise from external attacks (e.g., indirect prompt injection) or from internal limitations (e.g., erroneous reasoning). They not only expose CUAs to safety risks, but also degrade task efficiency and reliability. This work makes the first effort to define and study misaligned action detection in CUAs, with comprehensive coverage of both externally induced and internally arising misaligned actions. We further identify three common categories in real-world CUA deployment and construct MisActBench, a benchmark of realistic trajectories with human-annotated, action-level alignment labels. Moreover, we propose DeAction, a practical and universal guardrail that detects misaligned actions before execution and iteratively corrects them through structured feedback. DeAction outperforms all existing baselines across offline and online evaluations with moderate latency overhead: (1) On MisActBench, it outperforms baselines by over 15% absolute in F1 score; (2) In online evaluation, it reduces attack success rate by over 90% under adversarial settings while preserving or even improving task success rate in benign environments.

cs.CV [Back]

[97] Scalable spatial point process models for forensic footwear analysis

Alokesh Manna,Neil Spencer,Dipak K. Dey

Main category: cs.CV

TL;DR: 本文提出了一种基于分层贝叶斯模型的鞋印意外特征(accidentals)稀有性量化方法,用于提升法医鞋印比对的证据强度评估准确性。

Details Motivation: 传统鞋印分析仅依赖品牌、型号和尺寸匹配不足以唯一识别嫌疑鞋;需利用磨损产生的独特‘意外特征’并量化其罕见性以增强证据效力。 Method: 构建基于潜在高斯模型的分层贝叶斯模型,采用集成嵌套拉普拉斯近似(INLA)实现高效推断,并引入空间变化系数建模鞋底花纹与意外特征位置的关系。 Result: 在留出数据上性能优于现有方法,提升了鞋印分析的准确性和可靠性。 Conclusion: 该模型为法医鞋印中意外特征的统计评估提供了可扩展、空间感知且更可靠的量化框架。 Abstract: Shoe print evidence recovered from crime scenes plays a key role in forensic investigations. By examining shoe prints, investigators can determine details of the footwear worn by suspects. However, establishing that a suspect's shoes match the make and model of a crime scene print may not be sufficient. Typically, thousands of shoes of the same size, make, and model are manufactured, any of which could be responsible for the print. Accordingly, a popular approach used by investigators is to examine the print for signs of ``accidentals,'' i.e., cuts, scrapes, and other features that accumulate on shoe soles after purchase due to wear. While some patterns of accidentals are common on certain types of shoes, others are highly distinctive, potentially distinguishing the suspect's shoe from all others. Quantifying the rarity of a pattern is thus essential to accurately measuring the strength of forensic evidence. In this study, we address this task by developing a hierarchical Bayesian model. Our improvement over existing methods primarily stems from two advancements. First, we frame our approach in terms of a latent Gaussian model, thus enabling inference to be efficiently scaled to large collections of annotated shoe prints via integrated nested Laplace approximations. Second, we incorporate spatially varying coefficients to model the relationship between shoes' tread patterns and accidental locations. We demonstrate these improvements through superior performance on held-out data, which enhances accuracy and reliability in forensic shoe print analysis.

[98] Where Not to Learn: Prior-Aligned Training with Subset-based Attribution Constraints for Reliable Decision-Making

Ruoyu Chen,Shangquan Sun,Xiaoqing Guo,Sanyi Zhang,Kangwei Liu,Shiming Liu,Zhangcheng Wang,Qunli Zhang,Hua Zhang,Xiaochun Cao

Main category: cs.CV

TL;DR: 本文提出了一种基于归因的人类先验对齐方法,通过将人类先验编码为模型应依赖的输入区域(如边界框),并利用高保真子集选择归因方法揭示模型决策依据,在训练中惩罚偏离先验区域的归因,从而提升模型准确性和决策可解释性。

Details Motivation: 传统监督学习仅提供类别标签,易导致模型依赖捷径相关性而非真实证据;而人类先验虽有助于约束模型行为,但模型学习表征常与人类感知不一致,难以对齐。 Method: 将人类先验编码为期望模型依赖的输入区域(如bounding boxes);采用高保真子集选择归因方法在训练中暴露模型决策依据;当归因区域显著偏离先验区域时施加惩罚,通过引入先验诱导的归因约束训练目标引导模型对齐。 Result: 在图像分类和MLLM-based GUI agent的点击决策任务上验证有效;在常规分类与自回归生成设置下,均一致提升了任务准确率和决策合理性。 Conclusion: 基于归因的人类先验对齐方法能有效桥接模型表征与人类认知,兼顾性能提升与决策可信性,为构建可靠、可解释AI提供了新路径。 Abstract: Reliable models should not only predict correctly, but also justify decisions with acceptable evidence. Yet conventional supervised learning typically provides only class-level labels, allowing models to achieve high accuracy through shortcut correlations rather than the intended evidence. Human priors can help constrain such behavior, but aligning models to these priors remains challenging because learned representations often diverge from human perception. To address this challenge, we propose an attribution-based human prior alignment method. We encode human priors as input regions that the model is expected to rely on (e.g., bounding boxes), and leverage a highly faithful subset-selection-based attribution approach to expose the model's decision evidence during training. When the attribution region deviates substantially from the prior regions, we penalize reliance on off-prior evidence, encouraging the model to shift its attribution toward the intended regions. This is achieved through a training objective that imposes attribution constraints induced by the human prior. We validate our method on both image classification and click decision tasks in MLLM-based GUI agent models. Across conventional classification and autoregressive generation settings, human prior alignment consistently improves task accuracy while also enhancing the model's decision reasonability.

[99] MAU-GPT: Enhancing Multi-type Industrial Anomaly Understanding via Anomaly-aware and Generalist Experts Adaptation

Zhuonan Wang,Zhenxuan Fan,Siwen Tan,Yu Zhong,Yuqian Yuan,Haoyuan Li,Hao Jiang,Wenqiao Zhang,Feifei Shao,Hongwei Wang,Jun Xiao

Main category: cs.CV

TL;DR: 本文提出了MAU-Set工业异常理解数据集和MAU-GPT多模态大模型,通过AMoE-LoRA机制提升跨缺陷类别的理解与推理能力,在多个工业领域显著超越现有方法。

Details Motivation: 现有工业图像分析方法受限于数据集覆盖不足和模型对复杂异常模式泛化能力差。 Method: 构建了覆盖多工业领域的分层任务数据集MAU-Set,并提出适配工业场景的多模态大模型MAU-GPT,引入AMoE-LoRA机制统一进行异常感知专家与通用专家的适配。 Result: MAU-GPT在所有测试领域持续超越现有最先进方法,展现出可扩展、自动化的工业检测潜力。 Conclusion: MAU-Set与MAU-GPT共同为工业异常理解提供了新基准与有效解决方案,推动了自动化质检的发展。 Abstract: As industrial manufacturing scales, automating fine-grained product image analysis has become critical for quality control. However, existing approaches are hindered by limited dataset coverage and poor model generalization across diverse and complex anomaly patterns. To address these challenges, we introduce MAU-Set, a comprehensive dataset for Multi-type industrial Anomaly Understanding. It spans multiple industrial domains and features a hierarchical task structure, ranging from binary classification to complex reasoning. Alongside this dataset, we establish a rigorous evaluation protocol to facilitate fair and comprehensive model assessment. Building upon this foundation, we further present MAU-GPT, a domain-adapted multimodal large model specifically designed for industrial anomaly understanding. It incorporates a novel AMoE-LoRA mechanism that unifies anomaly-aware and generalist experts adaptation, enhancing both understanding and reasoning across diverse defect classes. Extensive experiments show that MAU-GPT consistently outperforms prior state-of-the-art methods across all domains, demonstrating strong potential for scalable and automated industrial inspection.

[100] A General Model for Retinal Segmentation and Quantification

Zhonghua Wang,Lie Ju,Sijia Li,Wei Feng,Sijin Zhou,Ming Hu,Jianhao Xiong,Xiaoying Tang,Yifan Peng,Mingquan Lin,Yaodong Ding,Yong Zeng,Wenbin Wei,Li Dong,Zongyuan Ge

Main category: cs.CV

TL;DR: RetSAM是一个基于大规模数据训练的通用视网膜分割与定量分析框架,支持多目标分割和30余种标准化生物标志物提取,显著提升分割精度并促进大规模眼病与系统性疾病研究。

Details Motivation: 视网膜成像虽便捷可及,但缺乏公开多标签数据集和统一的分割-量化流程,限制了大规模视网膜表型与疾病关联研究。 Method: 提出RetSAM框架,采用多阶段训练策略,融合私有与公开眼底图像(>20万张),实现五类解剖结构、四类表型模式及20余种病变的多任务分割,并转化为30+标准化生物标志物(涵盖形态、血管几何与退行性变化)。 Result: 在17个公开数据集上分割性能领先,平均DSC提升3.9个百分点,复杂多任务场景最高提升15个百分点;泛化性强,适用于不同人群、设备与临床环境;所提生物标志物成功支撑糖尿病视网膜病变、AMD、青光眼及病理性近视等主要眼病的系统性关联分析。 Conclusion: RetSAM将眼底图像转化为标准化、可解释的定量表型,为大规模眼科研究与临床转化提供了可扩展、鲁棒的基础工具。 Abstract: Retinal imaging is fast, non-invasive, and widely available, offering quantifiable structural and vascular signals for ophthalmic and systemic health assessment. This accessibility creates an opportunity to study how quantitative retinal phenotypes relate to ocular and systemic diseases. However, such analyses remain difficult at scale due to the limited availability of public multi-label datasets and the lack of a unified segmentation-to-quantification pipeline. We present RetSAM, a general retinal segmentation and quantification framework for fundus imaging. It delivers robust multi-target segmentation and standardized biomarker extraction, supporting downstream ophthalmologic studies and oculomics correlation analyses. Trained on over 200,000 fundus images, RetSAM supports three task categories and segments five anatomical structures, four retinal phenotypic patterns, and more than 20 distinct lesion types. It converts these segmentation results into over 30 standardized biomarkers that capture structural morphology, vascular geometry, and degenerative changes. Trained with a multi-stage strategy using both private and public fundus data, RetSAM achieves superior segmentation performance on 17 public datasets. It improves on prior best methods by 3.9 percentage points in DSC on average, with up to 15 percentage points on challenging multi-task benchmarks, and generalizes well across diverse populations, imaging devices, and clinical settings. The resulting biomarkers enable systematic correlation analyses across major ophthalmic diseases, including diabetic retinopathy, age-related macular degeneration, glaucoma, and pathologic myopia. Together, RetSAM transforms fundus images into standardized, interpretable quantitative phenotypes, enabling large-scale ophthalmic research and translation.

[101] Steering to Say No: Configurable Refusal via Activation Steering in Vision Language Models

Jiaxi Yang,Shicheng Liu,Yuchen Yang,Dongwon Lee

Main category: cs.CV

TL;DR: 本文提出CR-VLM,一种基于激活引导的可配置拒绝机制,通过提取拒绝向量、门控机制和反事实视觉增强模块,实现VLM中高效、鲁棒且用户自适应的拒绝行为。

Details Motivation: 现有VLM拒绝策略过于单一,无法适应不同用户需求和上下文约束,导致拒绝不足或过度拒绝。 Method: 提出CR-VLM框架,包含三部分:(1) 教师强制机制提取可配置拒绝向量;(2) 门控机制防止过度拒绝;(3) 反事实视觉增强模块对齐视觉表征与拒绝要求。 Result: 在多个数据集和VLM上实验表明,CR-VLM实现了有效、高效且鲁棒的可配置拒绝。 Conclusion: CR-VLM为VLM提供了可扩展的用户自适应安全对齐路径。 Abstract: With the rapid advancement of Vision Language Models (VLMs), refusal mechanisms have become a critical component for ensuring responsible and safe model behavior. However, existing refusal strategies are largely \textit{one-size-fits-all} and fail to adapt to diverse user needs and contextual constraints, leading to either under-refusal or over-refusal. In this work, we firstly explore the challenges mentioned above and develop \textbf{C}onfigurable \textbf{R}efusal in \textbf{VLM}s (\textbf{CR-VLM}), a robust and efficient approach for {\em configurable} refusal based on activation steering. CR-VLM consists of three integrated components: (1) extracting a configurable refusal vector via a teacher-forced mechanism to amplify the refusal signal; (2) introducing a gating mechanism that mitigates over-refusal by preserving acceptance for in-scope queries; and (3) designing a counterfactual vision enhancement module that aligns visual representations with refusal requirements. Comprehensive experiments across multiple datasets and various VLMs demonstrate that CR-VLM achieves effective, efficient, and robust configurable refusals, offering a scalable path toward user-adaptive safety alignment in VLMs.

[102] Vectra: A New Metric, Dataset, and Model for Visual Quality Assessment in E-Commerce In-Image Machine Translation

Qingyu Wu,Yuxuan Han,Haijun Li,Zhao Xu,Jianshan Zhao,Xu Jin,Longyue Wang,Weihua Luo

Main category: cs.CV

TL;DR: 本文提出Vectra,首个面向电商场景下图文机器翻译(IIMT)的无参考、多模态大语言模型(MLLM)驱动视觉质量评估框架,包含多维可解释评分体系、大规模真实产品图像数据集及40亿参数MLLM模型,显著提升与人工评分的相关性。

Details Motivation: 现有IIMT视觉质量评估方法(如SSIM、FID)缺乏可解释性,而模型即裁判方法缺少领域细粒度、可接地的奖励信号,难以应对上下文密集和多模态缺陷的产品图像。 Method: 提出Vectra框架,包括:(1) Vectra Score——14维可解释质量指标,含空间感知的缺陷区域比(DAR);(2) Vectra Dataset——基于110万真实商品图构建,含2K评测基准、30K推理标注、3.5K专家偏好标注;(3) Vectra Model——4B参数MLLM,同步输出定量评分与诊断推理。 Result: Vectra在人类排序相关性上达到SOTA;其模型在评分性能上超越GPT-5和Gemini-3等主流MLLM;数据集与模型将在录用后开源。 Conclusion: Vectra首次实现了面向电商IIMT的无参考、可解释、细粒度、领域适配的视觉质量评估,为多模态生成评估提供了新范式。 Abstract: In-Image Machine Translation (IIMT) powers cross-border e-commerce product listings; existing research focuses on machine translation evaluation, while visual rendering quality is critical for user engagement. When facing context-dense product imagery and multimodal defects, current reference-based methods (e.g., SSIM, FID) lack explainability, while model-as-judge approaches lack domain-grounded, fine-grained reward signals. To bridge this gap, we introduce Vectra, to the best of our knowledge, the first reference-free, MLLM-driven visual quality assessment framework for e-commerce IIMT. Vectra comprises three components: (1) Vectra Score, a multidimensional quality metric system that decomposes visual quality into 14 interpretable dimensions, with spatially-aware Defect Area Ratio (DAR) quantification to reduce annotation ambiguity; (2) Vectra Dataset, constructed from 1.1M real-world product images via diversity-aware sampling, comprising a 2K benchmark for system evaluation, 30K reasoning-based annotations for instruction tuning, and 3.5K expert-labeled preferences for alignment and evaluation; and (3) Vectra Model, a 4B-parameter MLLM that generates both quantitative scores and diagnostic reasoning. Experiments demonstrate that Vectra achieves state-of-the-art correlation with human rankings, and our model outperforms leading MLLMs, including GPT-5 and Gemini-3, in scoring performance. The dataset and model will be released upon acceptance.

[103] Robust and Real-Time Bangladeshi Currency Recognition: A Dual-Stream MobileNet and EfficientNet Approach

Subreena,Mohammad Amzad Hossain,Mirza Raquib,Saydul Akbar Murad,Farida Siddiqi Prity,Muhammad Hanif,Nick Rahimi

Main category: cs.CV

TL;DR: 本文提出了一种结合MobileNetV3-Large与EfficientNetB0的混合CNN架构,用于孟加拉国纸币识别,提升在复杂场景下的准确率与可解释性。

Details Motivation: 解决视障人士依赖他人识别纸币所导致的欺诈与剥削风险,现有模型在真实场景下泛化能力不足。 Method: 构建包含控制与真实场景的新孟加拉国纸币数据集,并融合四个公开数据集;提出MobileNetV3-Large与EfficientNetB0联合特征提取+MLP分类器的轻量高效混合CNN架构;采用LIME与SHAP进行可解释性分析。 Result: 在控制数据集上达97.95%准确率,复杂背景达92.84%,全数据集融合达94.98%;经五折交叉验证及七项指标(含AUC、MCC等)全面评估。 Conclusion: 所提方法在精度、鲁棒性与设备适用性间取得良好平衡,且通过XAI增强可信度,适用于资源受限的辅助技术部署。 Abstract: Accurate currency recognition is essential for assistive technologies, particularly for visually impaired individuals who rely on others to identify banknotes. This dependency puts them at risk of fraud and exploitation. To address these challenges, we first build a new Bangladeshi banknote dataset that includes both controlled and real-world scenarios, ensuring a more comprehensive and diverse representation. Next, to enhance the dataset's robustness, we incorporate four additional datasets, including public benchmarks, to cover various complexities and improve the model's generalization. To overcome the limitations of current recognition models, we propose a novel hybrid CNN architecture that combines MobileNetV3-Large and EfficientNetB0 for efficient feature extraction. This is followed by an effective multilayer perceptron (MLP) classifier to improve performance while keeping computational costs low, making the system suitable for resource-constrained devices. The experimental results show that the proposed model achieves 97.95% accuracy on controlled datasets, 92.84% on complex backgrounds, and 94.98% accuracy when combining all datasets. The model's performance is thoroughly evaluated using five-fold cross-validation and seven metrics: accuracy, precision, recall, F1-score, Cohen's Kappa, MCC, and AUC. Additionally, explainable AI methods like LIME and SHAP are incorporated to enhance transparency and interpretability.

[104] Gaussian-Constrained LeJEPA Representations for Unsupervised Scene Discovery and Pose Consistency

Mohsen Mostafa

Main category: cs.CV

TL;DR: 本文探讨了在无监督3D场景重建中,利用受高斯约束的图像嵌入(受LeJEPA启发)提升多场景图像集合中的场景发现与相机位姿估计性能,尤其在视觉模糊和含噪真实数据(如IMC2025)下表现更优。

Details Motivation: 解决无结构图像集合(尤其来自多个无关场景、含大量视觉歧义)下的无监督3D场景重建难题,应对IMC2025提出的场景发现与鲁棒位姿估计挑战。 Method: 提出三种渐进式改进流程,最终采用受LeJEPA启发的方法,在图像嵌入空间施加各向同性高斯约束,并通过实证评估该约束对聚类一致性与位姿估计鲁棒性的影响。 Result: 在IMC2025数据集上验证:高斯约束嵌入相比启发式基线,显著提升场景分离效果与位姿合理性,尤其在视觉模糊场景中优势明显。 Conclusion: 理论驱动的表征约束(如高斯约束)能有效弥合自监督学习原理与实际SfM流程之间的鸿沟,为无监督3D重建提供新思路。 Abstract: Unsupervised 3D scene reconstruction from unstructured image collections remains a fundamental challenge in computer vision, particularly when images originate from multiple unrelated scenes and contain significant visual ambiguity. The Image Matching Challenge 2025 (IMC2025) highlights these difficulties by requiring both scene discovery and camera pose estimation under real-world conditions, including outliers and mixed content. This paper investigates the application of Gaussian-constrained representations inspired by LeJEPA (Joint Embedding Predictive Architecture) to address these challenges. We present three progressively refined pipelines, culminating in a LeJEPA-inspired approach that enforces isotropic Gaussian constraints on learned image embeddings. Rather than introducing new theoretical guarantees, our work empirically evaluates how these constraints influence clustering consistency and pose estimation robustness in practice. Experimental results on IMC2025 demonstrate that Gaussian-constrained embeddings can improve scene separation and pose plausibility compared to heuristic-driven baselines, particularly in visually ambiguous settings. These findings suggest that theoretically motivated representation constraints offer a promising direction for bridging self-supervised learning principles and practical structure-from-motion pipelines.

[105] XAI-CLIP: ROI-Guided Perturbation Framework for Explainable Medical Image Segmentation in Multimodal Vision-Language Models

Thuraya Alzubaidi,Sana Ammar,Maryam Alsharqi,Islem Rekik,Muzammil Behzad

Main category: cs.CV

TL;DR: 本文提出XAI-CLIP,一种基于ROI引导的扰动框架,利用多模态视觉-语言模型嵌入定位临床相关解剖区域,提升医学图像分割模型的可解释性与效率。

Details Motivation: 现有XAI方法在医学图像分割中存在计算昂贵、噪声大、解剖不相关等问题,限制了临床信任与部署。 Method: 提出XAI-CLIP框架:结合CLIP等多模态模型进行语言引导的ROI定位,并设计区域感知扰动策略生成边界清晰、解剖一致的显著图。 Result: 在FLARE22和CHAOS数据集上,相比传统扰动法,运行时间减少60%,Dice提升44.6%,IoU提升96.7%;定性结果表明归因图更干净、解剖一致性更高。 Conclusion: 将多模态视觉-语言表征融入扰动型XAI框架,可显著提升医学图像分割系统的可解释性与计算效率,推动其临床落地。 Abstract: Medical image segmentation is a critical component of clinical workflows, enabling accurate diagnosis, treatment planning, and disease monitoring. However, despite the superior performance of transformer-based models over convolutional architectures, their limited interpretability remains a major obstacle to clinical trust and deployment. Existing explainable artificial intelligence (XAI) techniques, including gradient-based saliency methods and perturbation-based approaches, are often computationally expensive, require numerous forward passes, and frequently produce noisy or anatomically irrelevant explanations. To address these limitations, we propose XAI-CLIP, an ROI-guided perturbation framework that leverages multimodal vision-language model embeddings to localize clinically meaningful anatomical regions and guide the explanation process. By integrating language-informed region localization with medical image segmentation and applying targeted, region-aware perturbations, the proposed method generates clearer, boundary-aware saliency maps while substantially reducing computational overhead. Experiments conducted on the FLARE22 and CHAOS datasets demonstrate that XAI-CLIP achieves up to a 60\% reduction in runtime, a 44.6\% improvement in dice score, and a 96.7\% increase in Intersection-over-Union for occlusion-based explanations compared to conventional perturbation methods. Qualitative results further confirm cleaner and more anatomically consistent attribution maps with fewer artifacts, highlighting that the incorporation of multimodal vision-language representations into perturbation-based XAI frameworks significantly enhances both interpretability and efficiency, thereby enabling transparent and clinically deployable medical image segmentation systems.

[106] Deep Learning Based Multi-Level Classification for Aviation Safety

Elaheh Sabziyan Varnousfaderani,Syed A. M. Shihab,Jonathan King

Main category: cs.CV

TL;DR: 本文提出了一种基于CNN的图像识别框架,用于鸟类物种分类、鸟群类型与规模估计,以提升航空鸟击预警与飞行路径预测能力。

Details Motivation: 现有鸟击预警雷达系统无法识别鸟类物种,而不同物种的飞行行为和高度偏好差异显著,限制了预测精度与防控效果。 Method: 构建基于卷积神经网络(CNN)的图像分类框架,实现鸟类物种识别、鸟群类型判别及鸟群规模估计,并将结果输入物种特异性飞行路径预测模型。 Result: 实现了对鸟类物种、鸟群类型和规模的准确视觉识别,为提升鸟击风险评估与飞行轨迹预测提供了关键视觉语义信息。 Conclusion: 该图像分类框架可有效弥补现有雷达系统的物种识别缺失,增强鸟击预警系统的智能化与精细化水平,具有实际航空安全应用价值。 Abstract: Bird strikes pose a significant threat to aviation safety, often resulting in loss of life, severe aircraft damage, and substantial financial costs. Existing bird strike prevention strategies primarily rely on avian radar systems that detect and track birds in real time. A major limitation of these systems is their inability to identify bird species, an essential factor, as different species exhibit distinct flight behaviors, and altitudinal preference. To address this challenge, we propose an image-based bird classification framework using Convolutional Neural Networks (CNNs), designed to work with camera systems for autonomous visual detection. The CNN is designed to identify bird species and provide critical input to species-specific predictive models for accurate flight path prediction. In addition to species identification, we implemented dedicated CNN classifiers to estimate flock formation type and flock size. These characteristics provide valuable supplementary information for aviation safety. Specifically, flock type and size offer insights into collective flight behavior, and trajectory dispersion . Flock size directly relates to the potential impact severity, as the overall damage risk increases with the combined kinetic energy of multiple birds.

[107] The Geometry of Representational Failures in Vision Language Models

Daniele Savietto,Declan Campbell,André Panisson,Marco Nurisso,Giovanni Petri,Jonathan D. Cohen,Alan Perotti

Main category: cs.CV

TL;DR: 本文通过分析开源视觉语言模型(VLMs)中视觉概念向量的表征几何结构,发现概念向量间的几何重叠程度与模型在多目标视觉任务中的典型错误(如幻觉、混淆)高度相关,从而为理解VLMs的视觉失败机制提供了可量化的机制性解释。

Details Motivation: VLMs在多物体视觉任务中表现出类似人类认知限制(如绑定问题)的奇怪失败(如幻觉、识别错误),但其内部机制尚不清楚。 Method: 分析多个开源VLM(Qwen、InternVL、Gemma)中‘概念向量’(编码视觉概念的潜在方向)的表征几何;通过干预式概念 steering 验证其有效性;量化概念向量间的几何重叠,并关联其与模型错误模式。 Result: 概念向量间的几何重叠显著预测特定错误模式(如幻觉、相似物误判),且steering干预能可靠操控模型行为(如强制将红花识别为蓝花)。 Conclusion: VLMs的视觉失败可由其内部概念表征的几何结构(特别是概念向量间的重叠)机制性解释,为诊断和改进多目标视觉推理能力提供了新路径。 Abstract: Vision-Language Models (VLMs) exhibit puzzling failures in multi-object visual tasks, such as hallucinating non-existent elements or failing to identify the most similar objects among distractions. While these errors mirror human cognitive constraints, such as the "Binding Problem", the internal mechanisms driving them in artificial systems remain poorly understood. Here, we propose a mechanistic insight by analyzing the representational geometry of open-weight VLMs (Qwen, InternVL, Gemma), comparing methodologies to distill "concept vectors" - latent directions encoding visual concepts. We validate our concept vectors via steering interventions that reliably manipulate model behavior in both simplified and naturalistic vision tasks (e.g., forcing the model to perceive a red flower as blue). We observe that the geometric overlap between these vectors strongly correlates with specific error patterns, offering a grounded quantitative framework to understand how internal representations shape model behavior and drive visual failures.

[108] Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

Xiaomin Yu,Yi Xin,Wenjie Zhang,Chonghan Liu,Hanzhen Zhao,Xiaoxing Hu,Xinlei Yu,Ziyue Qiao,Hao Tang,Xue Yang,Xiaobin Hu,Chengwei Qin,Hui Xiong,Yu Qiao,Shuicheng Yan

Main category: cs.CV

TL;DR: 本文提出Fixed-frame Modality Gap Theory,精确刻画多模态表征间的几何偏差(模态间隙),并基于此设计训练无关的对齐策略ReAlign;进一步提出可扩展预训练范式ReVision,利用海量无配对文本数据实现视觉表征分布学习,显著降低对高质量图文对的依赖。

Details Motivation: 现有方法受限于各向同性简化假设,难以在大规模场景下有效弥合多模态对比学习中固有的几何偏差——模态间隙(Modality Gap)。 Method: 提出Fixed-frame Modality Gap Theory,将模态间隙分解为稳定偏置与各向异性残差;据此设计无需训练的ReAlign策略,通过Anchor、Trace、Centroid三步对齐,利用大规模无配对数据校正文本到图像表征的几何偏差;进而构建ReVision训练范式,在MLLM预训练中引入ReAlign,使模型先从无配对文本学习视觉表征分布,再进行视觉指令微调。 Result: ReAlign可高效对齐模态表征;ReVision在不依赖大规模高质量图文对的前提下,显著提升MLLM的可扩展性与训练效率;实验证明统计对齐的无配对数据可有效替代昂贵图文对。 Conclusion: 精确建模模态间隙的几何结构是实现高效多模态对齐与大模型可扩展训练的关键;ReAlign与ReVision为多模态大语言模型提供了低资源、高鲁棒性的新路径。 Abstract: Despite the success of multimodal contrastive learning in aligning visual and linguistic representations, a persistent geometric anomaly, the Modality Gap, remains: embeddings of distinct modalities expressing identical semantics occupy systematically offset regions. Prior approaches to bridge this gap are largely limited by oversimplified isotropic assumptions, hindering their application in large-scale scenarios. In this paper, we address these limitations by precisely characterizing the geometric shape of the modality gap and leveraging it for efficient model scaling. First, we propose the Fixed-frame Modality Gap Theory, which decomposes the modality gap within a frozen reference frame into stable biases and anisotropic residuals. Guided by this precise modeling, we introduce ReAlign, a training-free modality alignment strategy. Utilizing statistics from massive unpaired data, ReAlign aligns text representation into the image representation distribution via a three-step process comprising Anchor, Trace, and Centroid Alignment, thereby explicitly rectifying geometric misalignment. Building on ReAlign, we propose ReVision, a scalable training paradigm for Multimodal Large Language Models (MLLMs). ReVision integrates ReAlign into the pretraining stage, enabling the model to learn the distribution of visual representations from unpaired text before visual instruction tuning, without the need for large-scale, high-quality image-text pairs. Our framework demonstrates that statistically aligned unpaired data can effectively substitute for expensive image-text pairs, offering a robust path for the efficient scaling of MLLMs.

[109] Fair Context Learning for Evidence-Balanced Test-Time Adaptation in Vision-Language Models

Sanggeon Yun,Ryozo Masukawa,SungHeon Jeong,Wenjun Huang,Hanning Chen,Mohsen Imani

Main category: cs.CV

TL;DR: 本文提出Fair Context Learning (FCL),一种避免熵最小化的测试时自适应(TTA)框架,通过解耦增强探索与公平性驱动的文本上下文校准,缓解视觉特征共享导致的偏差,在多种域偏移和细粒度识别任务上实现SOTA性能。

Details Motivation: 现有基于提示的测试时自适应方法多依赖熵最小化,易放大虚假相关、在类间共享视觉特征时引发过度自信错误;需一种不依赖熵最小化、能缓解共享证据偏差的新方法。 Method: 提出Fair Context Learning(FCL),基于加性证据分解假设,将适配解耦为两步:(i) 基于增强的探索以识别潜在类别候选;(ii) 公平性驱动的校准,调整文本上下文以均衡对共有视觉证据的敏感性,从而抑制局部特征过度关注。 Result: 在多种域偏移和细粒度识别基准上,FCL取得与当前最优TTA方法相当甚至更优的适应性能,并通过实验验证了其理论动机的有效性。 Conclusion: FCL通过摒弃熵最小化、引入公平性约束进行文本上下文校准,有效提升了VLM在分布偏移下的鲁棒零样本识别能力,为TTA提供了新范式。 Abstract: Vision-Language Models (VLMs) such as CLIP enable strong zero-shot recognition but suffer substantial degradation under distribution shifts. Test-Time Adaptation (TTA) aims to improve robustness using only unlabeled test samples, yet most prompt-based TTA methods rely on entropy minimization -- an approach that can amplify spurious correlations and induce overconfident errors when classes share visual features. We propose Fair Context Learning (FCL), an episodic TTA framework that avoids entropy minimization by explicitly addressing shared-evidence bias. Motivated by our additive evidence decomposition assumption, FCL decouples adaptation into (i) augmentation-based exploration to identify plausible class candidates, and (ii) fairness-driven calibration that adapts text contexts to equalize sensitivity to common visual evidence. This fairness constraint mitigates partial feature obsession and enables effective calibration of text embeddings without relying on entropy reduction. Through extensive evaluation, we empirically validate our theoretical motivation and show that FCL achieves competitive adaptation performance relative to state-of-the-art TTA methods across diverse domain-shift and fine-grained benchmarks.

[110] A Comparative Study of Adversarial Robustness in CNN and CNN-ANFIS Architectures

Kaaustaaub Shankar,Bharadwaj Dogga,Kelly Cohen

Main category: cs.CV

TL;DR: 本文比较了标准CNN与ANFIS增强型CNN在多个数据集和对抗攻击下的性能,发现ANFIS增强对鲁棒性的提升具有架构依赖性,并非普遍有效。

Details Motivation: CNN虽性能强但缺乏可解释性且易受对抗攻击;已有神经模糊混合模型(如DCNFIS)提升了可解释性,但其鲁棒性尚未被充分研究。 Method: 将ConvNet、VGG、ResNet18等标准CNN的全连接分类器替换为ANFIS,构建ANFIS增强模型;在MNIST、Fashion-MNIST、CIFAR-10和CIFAR-100上,使用PGD(梯度法)和Square(无梯度法)对抗攻击评估其鲁棒性。 Result: ANFIS集成未一致提升干净样本准确率;鲁棒性提升具有架构依赖性:ResNet18-ANFIS鲁棒性提高,而VGG-ANFIS常劣于基线。 Conclusion: 神经模糊增强可在特定网络架构中提升对抗鲁棒性,但不具普适性,需结合架构特性谨慎设计。 Abstract: Convolutional Neural Networks (CNNs) achieve strong image classification performance but lack interpretability and are vulnerable to adversarial attacks. Neuro-fuzzy hybrids such as DCNFIS replace fully connected CNN classifiers with Adaptive Neuro-Fuzzy Inference Systems (ANFIS) to improve interpretability, yet their robustness remains underexplored. This work compares standard CNNs (ConvNet, VGG, ResNet18) with their ANFIS-augmented counterparts on MNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100 under gradient-based (PGD) and gradient-free (Square) attacks. Results show that ANFIS integration does not consistently improve clean accuracy and has architecture-dependent effects on robustness: ResNet18-ANFIS exhibits improved adversarial robustness, while VGG-ANFIS often underperforms its baseline. These findings suggest that neuro-fuzzy augmentation can enhance robustness in specific architectures but is not universally beneficial.

[111] UNIKIE-BENCH: Benchmarking Large Multimodal Models for Key Information Extraction in Visual Documents

Yifan Ji,Zhipeng Xu,Zhenghao Liu,Zulong Chen,Qian Zhang,Zhibo Yang,Junyang Lin,Yu Gu,Ge Yu,Maosong Sun

Main category: cs.CV

TL;DR: 本文提出了UNIKIE-BENCH,一个用于系统评估大型多模态模型(LMMs)在真实文档关键信息抽取(KIE)任务中性能的统一基准,包含受限类别和开放类别两个评测轨道,并揭示了当前LMMs在布局感知推理和定位精度上的显著不足。

Details Motivation: 真实场景中文档布局多样、视觉质量参差、任务需求各异,现有KIE方法缺乏系统性、多样性评测基准,难以全面评估LMMs的实际KIE能力。 Method: 构建UNIKIE-BENCH统一基准,包含两个互补评测轨道:(1)约束类别KIE轨道(基于预定义场景schema),(2)开放类别KIE轨道(提取文档中所有显式存在的关键信息);并在15个SOTA LMM上开展实验分析。 Result: 实验表明,当前LMMs在多样化schema定义、长尾关键字段及复杂版式下性能显著下降,且在不同文档类型与场景间表现差异明显。 Conclusion: LMMs在KIE任务中仍面临接地准确性(grounding accuracy)和布局感知推理(layout-aware reasoning)等根本性挑战,亟需针对性改进。 Abstract: Key Information Extraction (KIE) from real-world documents remains challenging due to substantial variations in layout structures, visual quality, and task-specific information requirements. Recent Large Multimodal Models (LMMs) have shown promising potential for performing end-to-end KIE directly from document images. To enable a comprehensive and systematic evaluation across realistic and diverse application scenarios, we introduce UNIKIE-BENCH, a unified benchmark designed to rigorously evaluate the KIE capabilities of LMMs. UNIKIE-BENCH consists of two complementary tracks: a constrained-category KIE track with scenario-predefined schemas that reflect practical application needs, and an open-category KIE track that extracts any key information that is explicitly present in the document. Experiments on 15 state-of-the-art LMMs reveal substantial performance degradation under diverse schema definitions, long-tail key fields, and complex layouts, along with pronounced performance disparities across different document types and scenarios. These findings underscore persistent challenges in grounding accuracy and layout-aware reasoning for LMM-based KIE. All codes and datasets are available at https://github.com/NEUIR/UNIKIE-BENCH.

[112] OMNI-Dent: Towards an Accessible and Explainable AI Framework for Automated Dental Diagnosis

Leeje Jang,Yao-Yi Chiang,Angela M. Hastings,Patimaporn Pungchanchaikul,Martha B. Lucas,Emily C. Schultz,Jeffrey P. Louie,Mohamed Estai,Wen-Chen Wang,Ryan H. L. Ip,Boyen Huang

Main category: cs.CV

TL;DR: OMNI-Dent is a data-efficient, explainable dental diagnostic framework using a Vision-Language Model (VLM) guided by clinical reasoning, operating on multi-view smartphone photos without VLM fine-tuning.

Details Motivation: To overcome limitations of existing AI dental diagnosis methods—lack of clinical reasoning integration, heavy reliance on expert-annotated data, and poor generalization under real-world imaging conditions—especially for underserved populations with limited access to professional care. Method: Integrates dental expert heuristics into a VLM-based pipeline; processes multi-view smartphone photographs; leverages pre-trained VLM's visual-linguistic capabilities without dental-specific fine-tuning; enables tooth-level evaluation with explainability. Result: A practical, early-stage assistive tool that supports dental abnormality identification and triage decisions using only smartphone images, requiring no curated clinical imaging or extensive labeled data. Conclusion: OMNI-Dent bridges the gap between AI-driven diagnosis and clinical reasoning, offering an accessible, scalable, and interpretable solution for low-resource oral healthcare settings. Abstract: Accurate dental diagnosis is essential for oral healthcare, yet many individuals lack access to timely professional evaluation. Existing AI-based methods primarily treat diagnosis as a visual pattern recognition task and do not reflect the structured clinical reasoning used by dental professionals. These approaches also require large amounts of expert-annotated data and often struggle to generalize across diverse real-world imaging conditions. To address these limitations, we present OMNI-Dent, a data-efficient and explainable diagnostic framework that incorporates clinical reasoning principles into a Vision-Language Model (VLM)-based pipeline. The framework operates on multi-view smartphone photographs,embeds diagnostic heuristics from dental experts, and guides a general-purpose VLM to perform tooth-level evaluation without dental-specific fine-tuning of the VLM. By utilizing the VLM's existing visual-linguistic capabilities, OMNI-Dent aims to support diagnostic assessment in settings where curated clinical imaging is unavailable. Designed as an early-stage assistive tool, OMNI-Dent helps users identify potential abnormalities and determine when professional evaluation may be needed, offering a practical option for individuals with limited access to in-person care.

[113] COMBOOD: A Semiparametric Approach for Detecting Out-of-distribution Data for Image Classification

Magesh Rajasekaran,Md Saiful Islam Sajol,Frej Berglind,Supratik Mukhopadhyay,Kamalika Das

Main category: cs.CV

TL;DR: 本文提出了一种名为COMBOOD的无监督半参数框架,用于图像识别中的分布外(OOD)检测,通过结合最近邻和马氏距离两种度量信号,提升近OOD和远OOD场景下的检测准确率,并在多个基准数据集上优于现有方法。

Details Motivation: 现有OOD检测方法在近OOD场景中表现不佳,而实际应用中近OOD情况常见;需要一种能同时兼顾近OOD和远OOD检测效果的方法。 Method: 提出COMBOOD框架,融合非参数的最近邻距离与参数化的马氏距离,在半参数设定下生成统一的OOD置信度得分。 Result: 在OpenOOD v1/v1.5及文档数据集上,COMBOOD在近OOD和远OOD检测准确率上均超越当前最优方法,多数结果具有统计显著性;算法时间复杂度为嵌入空间维度的线性。 Conclusion: COMBOOD是一种高效、准确且可扩展的OOD检测框架,适用于实际自动化系统。 Abstract: Identifying out-of-distribution (OOD) data at inference time is crucial for many machine learning applications, especially for automation. We present a novel unsupervised semi-parametric framework COMBOOD for OOD detection with respect to image recognition. Our framework combines signals from two distance metrics, nearest-neighbor and Mahalanobis, to derive a confidence score for an inference point to be out-of-distribution. The former provides a non-parametric approach to OOD detection. The latter provides a parametric, simple, yet effective method for detecting OOD data points, especially, in the far OOD scenario, where the inference point is far apart from the training data set in the embedding space. However, its performance is not satisfactory in the near OOD scenarios that arise in practical situations. Our COMBOOD framework combines the two signals in a semi-parametric setting to provide a confidence score that is accurate both for the near-OOD and far-OOD scenarios. We show experimental results with the COMBOOD framework for different types of feature extraction strategies. We demonstrate experimentally that COMBOOD outperforms state-of-the-art OOD detection methods on the OpenOOD (both version 1 and most recent version 1.5) benchmark datasets (for both far-OOD and near-OOD) as well as on the documents dataset in terms of accuracy. On a majority of the benchmark datasets, the improvements in accuracy resulting from the COMBOOD framework are statistically significant. COMBOOD scales linearly with the size of the embedding space, making it ideal for many real-life applications.

[114] PipeMFL-240K: A Large-scale Dataset and Benchmark for Object Detection in Pipeline Magnetic Flux Leakage Imaging

Tianyi Qu,Songxiao Yang,Haolin Wang,Huadong Song,Xiaoting Guo,Wenguang Hu,Guanlin Liu,Honghe Chen,Yafei Ou

Main category: cs.CV

TL;DR: 本文提出了PipeMFL-240K——首个大规模公开的管道磁通泄漏(MFL)伪彩色图像目标检测数据集与基准,包含24万张图像和19万个高质量标注框,旨在解决MFL自动识别中因缺乏标准数据导致的模型可比性与可复现性难题。

Details Motivation: 现有深度学习方法在MFL图像解释中进展受限,主因是缺乏大规模、公开、标注精细的数据集与统一基准,难以进行公平比较和可复现评估。 Method: 构建了PipeMFL-240K数据集:涵盖11条总长约1480 km的实测管道,含240,320张伪彩色MFL图像和191,530个高精度边界框标注;覆盖12类缺陷,具有长尾分布、大量像素级微小目标及强类内差异等挑战;并基于SOTA目标检测器开展系统性基线实验。 Result: 实验表明当前主流检测器在该数据集上性能仍显著受限,尤其对微小目标和长尾类别泛化能力弱,验证了数据集的挑战性;同时确立了多个可靠基线性能指标。 Conclusion: PipeMFL-240K作为首个大规模MFL公开基准,为管道完整性智能诊断、维护决策及算法创新提供了关键基础设施,有望推动MFL领域可复现研究与技术落地。 Abstract: Pipeline integrity is critical to industrial safety and environmental protection, with Magnetic Flux Leakage (MFL) detection being a primary non-destructive testing technology. Despite the promise of deep learning for automating MFL interpretation, progress toward reliable models has been constrained by the absence of a large-scale public dataset and benchmark, making fair comparison and reproducible evaluation difficult. We introduce \textbf{PipeMFL-240K}, a large-scale, meticulously annotated dataset and benchmark for complex object detection in pipeline MFL pseudo-color images. PipeMFL-240K reflects real-world inspection complexity and poses several unique challenges: (i) an extremely long-tailed distribution over \textbf{12} categories, (ii) a high prevalence of tiny objects that often comprise only a handful of pixels, and (iii) substantial intra-class variability. The dataset contains \textbf{240,320} images and \textbf{191,530} high-quality bounding-box annotations, collected from 11 pipelines spanning approximately \textbf{1,480} km. Extensive experiments are conducted with state-of-the-art object detectors to establish baselines. Results show that modern detectors still struggle with the intrinsic properties of MFL data, highlighting considerable headroom for improvement, while PipeMFL-240K provides a reliable and challenging testbed to drive future research. As the first public dataset and the first benchmark of this scale and scope for pipeline MFL inspection, it provides a critical foundation for efficient pipeline diagnostics as well as maintenance planning and is expected to accelerate algorithmic innovation and reproducible research in MFL-based pipeline integrity assessment.

[115] VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing

Zhiming Luo,Di Wang,Haonan Guo,Jing Zhang,Bo Du

Main category: cs.CV

TL;DR: 本文提出首个专为遥感(RS)复杂推理设计的视觉语言推理基准VLRS-Bench,涵盖认知、决策与预测三大维度,包含2000个长文本问答对,揭示当前多模态大模型在遥感推理任务中的显著瓶颈。

Details Motivation: 现有遥感基准严重偏向感知任务(如目标识别、场景分类),难以支撑认知密集型遥感应用,亟需面向复杂推理的专用评测基准。 Method: 构建了首个遥感专用视觉语言推理基准VLRS-Bench,按认知、决策、预测三维度组织,含2000个平均71词的问答对、14类任务及最多八阶段时序推理;采用融合遥感先验与专家知识的定制化构建流程,保障地理空间真实性和推理复杂性。 Result: 实验表明当前最先进多模态大语言模型(MLLMs)在VLRS-Bench上表现显著不足,暴露出其在遥感复杂推理任务中的关键瓶颈。 Conclusion: VLRS-Bench填补了遥感领域缺乏复杂推理评测基准的空白,为推动多模态模型在遥感认知任务中的发展提供了新标准和重要方向。 Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have enabled complex reasoning. However, existing remote sensing (RS) benchmarks remain heavily biased toward perception tasks, such as object recognition and scene classification. This limitation hinders the development of MLLMs for cognitively demanding RS applications. To address this, , we propose a Vision Language ReaSoning Benchmark (VLRS-Bench), which is the first benchmark exclusively dedicated to complex RS reasoning. Structured across the three core dimensions of Cognition, Decision, and Prediction, VLRS-Bench comprises 2,000 question-answer pairs with an average length of 71 words, spanning 14 tasks and up to eight temporal phases. VLRS-Bench is constructed via a specialized pipeline that integrates RS-specific priors and expert knowledge to ensure geospatial realism and reasoning complexity. Experimental results reveal significant bottlenecks in existing state-of-the-art MLLMs, providing critical insights for advancing multimodal reasoning within the remote sensing community.

[116] ShapBPT: Image Feature Attributions Using Data-Aware Binary Partition Trees

Muhammad Rashid,Elvio G. Amparore,Enrico Ferrari,Damiano Verda

Main category: cs.CV

TL;DR: 本文提出ShapBPT,一种基于数据感知的二叉分割树(BPT)的分层Shapley值方法,用于提升计算机视觉模型解释性,实现更高效、语义更清晰的像素级特征归因。

Details Motivation: 现有分层Shapley方法未利用图像的多尺度结构,收敛慢、与形态特征对齐差;且缺乏面向视觉数据的数据感知层次结构,导致可解释性不足。 Method: 提出ShapBPT方法,将Shapley值分配到专为图像设计的二叉分割树(BPT)这一数据感知多尺度层次结构上,使归因结果与图像内在形态一致并降低计算开销。 Result: 实验表明ShapBPT在图像结构对齐性和计算效率上优于现有XCV方法;20人用户研究证实其解释更受人类偏好。 Conclusion: ShapBPT成功桥接分层Shapley理论与图像数据特性,为视觉可解释性提供了更高效、更具语义意义的新范式。 Abstract: Pixel-level feature attributions are an important tool in eXplainable AI for Computer Vision (XCV), providing visual insights into how image features influence model predictions. The Owen formula for hierarchical Shapley values has been widely used to interpret machine learning (ML) models and their learned representations. However, existing hierarchical Shapley approaches do not exploit the multiscale structure of image data, leading to slow convergence and weak alignment with the actual morphological features. Moreover, no prior Shapley method has leveraged data-aware hierarchies for Computer Vision tasks, leaving a gap in model interpretability of structured visual data. To address this, this paper introduces ShapBPT, a novel data-aware XCV method based on the hierarchical Shapley formula. ShapBPT assigns Shapley coefficients to a multiscale hierarchical structure tailored for images, the Binary Partition Tree (BPT). By using this data-aware hierarchical partitioning, ShapBPT ensures that feature attributions align with intrinsic image morphology, effectively prioritizing relevant regions while reducing computational overhead. This advancement connects hierarchical Shapley methods with image data, providing a more efficient and semantically meaningful approach to visual interpretability. Experimental results confirm ShapBPT's effectiveness, demonstrating superior alignment with image structures and improved efficiency over existing XCV methods, and a 20-subject user study confirming that ShapBPT explanations are preferred by humans.

[117] Enhancing IMU-Based Online Handwriting Recognition via Contrastive Learning with Zero Inference Overhead

Jindong Li,Dario Zanca,Vincent Christlein,Tim Hamann,Jens Barth,Peter Kämpf,Björn Eskofier

Main category: cs.CV

TL;DR: 本文提出ECHWR框架,通过引入临时辅助分支和双对比损失(包括基于错误的对比损失)提升手写识别特征表示与准确率,训练后移除辅助分支以保持高效推理,显著降低字符错误率。

Details Motivation: 在边缘硬件上进行基于惯性测量单元的在线手写识别面临内存受限问题,需在不增加推理开销的前提下提升识别精度与泛化能力。 Method: 提出Error-enhanced Contrastive Handwriting Recognition(ECHWR)训练框架:引入临时辅助分支对齐传感器信号与语义文本嵌入;采用双对比目标——批内对比损失实现模态对齐,新型基于错误的对比损失区分正确信号与合成难负样本;训练完成后丢弃辅助分支。 Result: 在OnHW-Words500数据集上,ECHWR在书写者无关划分中字符错误率降低7.4%,书写者相关划分中降低10.4%,显著优于现有方法;消融实验表明基于错误的对比损失对未见书写风格具有鲁棒性。 Conclusion: ECHWR在不增加部署模型复杂度的前提下,通过创新的对比学习策略有效提升了边缘手写识别的准确性与泛化性,尤其适用于隐私敏感、低延迟场景。 Abstract: Online handwriting recognition using inertial measurement units opens up handwriting on paper as input for digital devices. Doing it on edge hardware improves privacy and lowers latency, but entails memory constraints. To address this, we propose Error-enhanced Contrastive Handwriting Recognition (ECHWR), a training framework designed to improve feature representation and recognition accuracy without increasing inference costs. ECHWR utilizes a temporary auxiliary branch that aligns sensor signals with semantic text embeddings during the training phase. This alignment is maintained through a dual contrastive objective: an in-batch contrastive loss for general modality alignment and a novel error-based contrastive loss that distinguishes between correct signals and synthetic hard negatives. The auxiliary branch is discarded after training, which allows the deployed model to keep its original, efficient architecture. Evaluations on the OnHW-Words500 dataset show that ECHWR significantly outperforms state-of-the-art baselines, reducing character error rates by up to 7.4% on the writer-independent split and 10.4% on the writer-dependent split. Finally, although our ablation studies indicate that solving specific challenges require specific architectural and objective configurations, error-based contrastive loss shows its effectiveness for handling unseen writing styles.

[118] Interpreting Physics in Video World Models

Sonia Joseph,Quentin Garrido,Randall Balestriero,Matthew Kowal,Thomas Fel,Shahab Bakhtiari,Blake Richards,Mike Rabbat

Main category: cs.CV

TL;DR: 本文通过多种可解释性方法研究了大规模视频编码器内部物理变量的表示方式,发现物理信息在中间层突然涌现(Physics Emergence Zone),并以分布式、高维几何结构(如方向呈圆形编码)而非因子化方式表征,表明模型虽不模拟经典物理引擎,却仍能做出准确物理预测。

Details Motivation: 探究视频模型是否必须依赖因子化的物理变量表示才能进行准确物理预测,还是可以隐式地以任务特定、分布式方式表示物理变量。 Method: 采用逐层探测(layerwise probing)、子空间几何分析、图像块级解码(patch-level decoding)和定向注意力掩蔽(targeted attention ablations)等可解释性技术,分析视频Transformer编码器中物理信息的出现位置与组织方式。 Result: 发现跨架构存在一个‘物理涌现区’(Physics Emergence Zone)——物理变量在中间深度层突然变得可访问;标量量(如速度、加速度)从浅层即可见,而运动方向仅在此区出现,并以高维群体结构、圆形几何形式编码,需多特征协同干预才能调控。 Conclusion: 现代视频模型不使用类似经典物理引擎的因子化物理表示,而是采用一种虽为分布式但足以支撑物理预测的表征机制。 Abstract: A long-standing question in physical reasoning is whether video-based models need to rely on factorized representations of physical variables in order to make physically accurate predictions, or whether they can implicitly represent such variables in a task-specific, distributed manner. While modern video world models achieve strong performance on intuitive physics benchmarks, it remains unclear which of these representational regimes they implement internally. Here, we present the first interpretability study to directly examine physical representations inside large-scale video encoders. Using layerwise probing, subspace geometry, patch-level decoding, and targeted attention ablations, we characterize where physical information becomes accessible and how it is organized within encoder-based video transformers. Across architectures, we identify a sharp intermediate-depth transition -- which we call the Physics Emergence Zone -- at which physical variables become accessible. Physics-related representations peak shortly after this transition and degrade toward the output layers. Decomposing motion into explicit variables, we find that scalar quantities such as speed and acceleration are available from early layers onwards, whereas motion direction becomes accessible only at the Physics Emergence Zone. Notably, we find that direction is encoded through a high-dimensional population structure with circular geometry, requiring coordinated multi-feature intervention to control. These findings suggest that modern video models do not use factorized representations of physical variables like a classical physics engine. Instead, they use a distributed representation that is nonetheless sufficient for making physical predictions.

[119] Neural Sentinel: Unified Vision Language Model (VLM) for License Plate Recognition with Human-in-the-Loop Continual Learning

Karthik Sivakoti

Main category: cs.CV

TL;DR: 本文提出Neural Sentinel,一种基于视觉语言模型(VLM)的统一ALPR方法,用单次前向传播完成车牌识别、州属分类与车辆属性提取,显著提升精度、降低延迟与复杂度,并支持零样本多任务泛化。

Details Motivation: 传统ALPR多阶段流水线存在误差累积、高延迟和架构复杂等问题,亟需更高效、鲁棒的一体化方案。 Method: 基于PaliGemma 3B模型,采用LoRA微调;引入人机协同(HITL)持续学习框架,结合经验回放防止灾难性遗忘;利用VLM的多视觉问答能力实现统一推理。 Result: 车牌识别准确率达92.3%,较EasyOCR和PaddleOCR分别提升14.1%和9.9%;推理延迟152ms,校准误差ECE=0.048;零样本实现车辆颜色(89%)、安全带(82%)、载客数(78%)检测。 Conclusion: VLM驱动的统一架构代表ALPR范式转变,兼具更高精度、更低复杂度与涌现式多任务能力,优于传统流水线方法。 Abstract: Traditional Automatic License Plate Recognition (ALPR) systems employ multi-stage pipelines consisting of object detection networks followed by separate Optical Character Recognition (OCR) modules, introducing compounding errors, increased latency, and architectural complexity. This research presents Neural Sentinel, a novel unified approach that leverages Vision Language Models (VLMs) to perform license plate recognition, state classification, and vehicle attribute extraction through a single forward pass. Our primary contribution lies in demonstrating that a fine-tuned PaliGemma 3B model, adapted via Low-Rank Adaptation (LoRA), can simultaneously answer multiple visual questions about vehicle images, achieving 92.3% plate recognition accuracy, which is a 14.1% improvement over EasyOCR and 9.9% improvement over PaddleOCR baselines. We introduce a Human-in-the-Loop (HITL) continual learning framework that incorporates user corrections while preventing catastrophic forgetting through experience replay, maintaining a 70:30 ratio of original training data to correction samples. The system achieves a mean inference latency of 152ms with an Expected Calibration Error (ECE) of 0.048, indicating well calibrated confidence estimates. Additionally, the VLM first architecture enables zero-shot generalization to auxiliary tasks including vehicle color detection (89%), seatbelt detection (82%), and occupancy counting (78%) without task specific training. Through extensive experimentation on real world toll plaza imagery, we demonstrate that unified vision language approaches represent a paradigm shift in ALPR systems, offering superior accuracy, reduced architectural complexity, and emergent multi-task capabilities that traditional pipeline approaches cannot achieve.

[120] Toward Accurate and Accessible Markerless Neuronavigation

Ziye Xie,Oded Schlesinger,Raj Kundu,Jessica Y. Choi,Pablo Iturralde,Dennis A. Turner,Stefan M. Goetz,Guillermo Sapiro,Angel V. Peterchev,J. Matias Di Martino

Main category: cs.CV

TL;DR: 本文提出了一种无需物理标记的无标记神经导航方法,利用低成本可见光与红外立体/深度摄像头结合面部几何建模算法,实现了对头部位置的高精度实时追踪,在50名受试者中达到2.32 mm和2.01°的中位误差,性能优于以往无标记方案,且适用于经颅磁刺激等临床应用。

Details Motivation: 传统神经导航依赖易移位、需手动配准、造成不适的体表标记物,亟需更舒适、稳定、低成本的替代方案。 Method: 采用可见光与红外双模态立体及深度摄像头,结合面部三维几何建模算法,实现无标记头部姿态实时追踪,并通过与传统标记系统对比验证精度。 Result: 在50名受试者上验证,最优无标记算法中位误差为2.32 mm(位置)和2.01°(角度),显著优于既往无标记方法,接近标记式系统精度;多传感器数据融合有望进一步提升精度。 Conclusion: 所提无标记神经导航方法可显著降低设备成本与操作复杂度,提升患者舒适度,拓展神经导航在临床与科研中的可及性。 Abstract: Neuronavigation is widely used in biomedical research and interventions to guide the precise placement of instruments around the head to support procedures such as transcranial magnetic stimulation. Traditional systems, however, rely on subject-mounted markers that require manual registration, may shift during procedures, and can cause discomfort. We introduce and evaluate markerless approaches that replace expensive hardware and physical markers with low-cost visible and infrared light cameras incorporating stereo and depth sensing combined with algorithmic modeling of the facial geometry. Validation with $50$ human subjects yielded a median tracking discrepancy of only $2.32$ mm and $2.01°$ for the best markerless algorithms compared to a conventional marker-based system, which indicates sufficient accuracy for transcranial magnetic stimulation and a substantial improvement over prior markerless results. The results suggest that integration of the data from the various camera sensors can improve the overall accuracy further. The proposed markerless neuronavigation methods can reduce setup cost and complexity, improve patient comfort, and expand access to neuronavigation in clinical and research settings.

[121] RECITYGEN -- Interactive and Generative Participatory Urban Design Tool with Latent Diffusion and Segment Anything

Di Mo,Mingyang Sun,Chengxiu Yin,Runjia Tian,Yanhong Wu,Liyan Xu

Main category: cs.CV

TL;DR: 本文提出RECITYGEN工具,结合潜在扩散模型与交互式语义分割,支持公众通过文本提示交互生成城市街景图像,提升城市设计的参与性和包容性。

Details Motivation: 传统自上而下的城市设计方法常忽视公众意见,导致设计理想与现实脱节;数字技术虽提升了参与度,但生成式工具仍缺乏直观、低门槛的公众交互方式。 Method: 融合前沿潜在扩散模型(Latent Diffusion Models)与交互式语义分割技术,构建RECITYGEN系统,支持用户以自然语言提示实时生成并编辑街景图像,并在北京城市更新试点中开展用户实证应用。 Result: RECITYGEN在试点中被公众有效用于提出街道改造建议,生成结果较好反映公众偏好,验证了其在增强公众参与和提升设计响应性方面的可行性与潜力。 Conclusion: RECITYGEN代表了一种更动态、更具包容性的城市规划范式转变,为公众深度参与城市设计提供了可扩展的技术路径。 Abstract: Urban design profoundly impacts public spaces and community engagement. Traditional top-down methods often overlook public input, creating a gap in design aspirations and reality. Recent advancements in digital tools, like City Information Modelling and augmented reality, have enabled a more participatory process involving more stakeholders in urban design. Further, deep learning and latent diffusion models have lowered barriers for design generation, providing even more opportunities for participatory urban design. Combining state-of-the-art latent diffusion models with interactive semantic segmentation, we propose RECITYGEN, a novel tool that allows users to interactively create variational street view images of urban environments using text prompts. In a pilot project in Beijing, users employed RECITYGEN to suggest improvements for an ongoing Urban Regeneration project. Despite some limitations, RECITYGEN has shown significant potential in aligning with public preferences, indicating a shift towards more dynamic and inclusive urban planning methods. The source code for the project can be found at RECITYGEN GitHub.

[122] FADE: Selective Forgetting via Sparse LoRA and Self-Distillation

Carolina R. Kelsch,Leonardo S. B. Pereira,Natnael Mola,Luis H. Arribas,Juan C. S. M. Avedillo

Main category: cs.CV

TL;DR: FADE是一种针对文本到图像扩散模型的两阶段快速遗忘方法,通过参数定位与自蒸馏结合,实现高效、可逆、轻量级的概念擦除。

Details Motivation: 应对数据保护法规和负责任AI实践对模型遗忘特定数据或概念的需求,解决当前文本到图像扩散模型中遗忘计算成本高、遗忘与保留难以平衡的问题。 Method: 提出FADE(Fast Adapter for Data Erasure):第一阶段基于梯度显著性定位关键参数,采用稀疏LoRA适配器进行局部轻量更新;第二阶段引入自蒸馏目标,用用户定义的替代概念覆盖待遗忘概念,同时保持对保留数据的行为一致性。适配器支持运行时合并或移除。 Result: 在UnlearnCanvas基准及多个数据集(Imagenette、LFW、Dog Breeds、SUN Attributes)上验证了FADE的SOTA性能,展现出强概念擦除能力、高保留率及细粒度遗忘-保留权衡控制能力。 Conclusion: FADE为扩散模型提供了高效、灵活、可部署的定向遗忘方案,适用于生产环境中的选择性模型遗忘需求。 Abstract: Machine Unlearning aims to remove the influence of specific data or concepts from trained models while preserving overall performance, a capability increasingly required by data protection regulations and responsible AI practices. Despite recent progress, unlearning in text-to-image diffusion models remains challenging due to high computational costs and the difficulty of balancing effective forgetting with retention of unrelated concepts. We introduce FADE (Fast Adapter for Data Erasure), a two-stage unlearning method for image generation that combines parameter localization with self-distillation. FADE first identifies parameters most responsible for the forget set using gradient-based saliency and constrains updates through sparse LoRA adapters, ensuring lightweight, localized modifications. In a second stage, FADE applies a self-distillation objective that overwrites the forgotten concept with a user-defined surrogate while preserving behavior on retained data. The resulting adapters are memory-efficient, reversible, and can be merged or removed at runtime, enabling flexible deployment in production systems. We evaluated FADE on the UnlearnCanvas benchmark and conducted ablation studies on Imagenette, Labeled Faces in the Wild, AtharvaTaras Dog Breeds Dataset, and SUN Attributes datasets, demonstrating State-of-the-Art unlearning performance with fine-grained control over the forgetting-retention trade-off. Our results demonstrate that FADE achieves strong concept erasure and high retainability across various domains, making it a suitable solution for selective unlearning in diffusion-based image generation models.

[123] From Images to Decisions: Assistive Computer Vision for Non-Metallic Content Estimation in Scrap Metal

Daniil Storonkin,Ilia Dziub,Maksim Golyadkin,Ilya Makarov

Main category: cs.CV

TL;DR: 本文提出了一种辅助计算机视觉流水线,用于在铁路车厢卸料过程中自动评估废钢中非金属夹杂物(污染)的百分比并分类废钢类型,采用多实例学习(MIL)和多任务学习(MTL)方法,在近实时系统中集成到验收流程中,提升了评估客观性、操作安全性及工艺集成能力。

Details Motivation: 当前依靠人工目视判断废钢中非金属夹杂物含量存在主观性强、粉尘与机械运动带来安全隐患等问题,亟需自动化、客观、安全的评估方法。 Method: 将污染评估建模为车厢级回归任务,利用多实例学习(MIL)处理时序图像序列,并结合多任务学习(MTL)同步完成污染估计与废钢分类;系统包含磁铁/车厢检测、版本化推理服务、操作员结构化复核与主动学习反馈闭环。 Result: MIL模型达到MAE 0.27、R² 0.83;MTL模型实现MAE 0.36、废钢分类F1 0.79;系统已部署于近实时验收工作流,支持带置信度的车厢级预测与人工覆盖修正。 Conclusion: 该视觉流水线显著降低了主观判别差异,提升了现场人员安全性,并可无缝嵌入废钢验收与熔炼计划等下游工业流程,具备实际落地价值。 Abstract: Scrap quality directly affects energy use, emissions, and safety in steelmaking. Today, the share of non-metallic inclusions (contamination) is judged visually by inspectors - an approach that is subjective and hazardous due to dust and moving machinery. We present an assistive computer vision pipeline that estimates contamination (per percent) from images captured during railcar unloading and also classifies scrap type. The method formulates contamination assessment as a regression task at the railcar level and leverages sequential data through multi-instance learning (MIL) and multi-task learning (MTL). Best results include MAE 0.27 and R2 0.83 by MIL; and an MTL setup reaches MAE 0.36 with F1 0.79 for scrap class. Also we present the system in near real time within the acceptance workflow: magnet/railcar detection segments temporal layers, a versioned inference service produces railcar-level estimates with confidence scores, and results are reviewed by operators with structured overrides; corrections and uncertain cases feed an active-learning loop for continual improvement. The pipeline reduces subjective variability, improves human safety, and enables integration into acceptance and melt-planning workflows.

[124] Exploring Physical Intelligence Emergence via Omni-Modal Architecture and Physical Data Engine

Minghao Han,Dingkang Yang,Yue Jiang,Yizhou Liu,Lihua Zhang

Main category: cs.CV

TL;DR: OmniFysics 是一个紧凑型全模态模型,通过物理数据引擎(FysicsAny 和 FysicsOmniCap)注入显式物理知识,提升图像、音频、视频和文本跨模态的物理理解能力,并支持语音与图像生成。

Details Motivation: 现有全模态模型对物理属性的理解脆弱,因其视觉模糊且网络规模数据中物理属性稀疏。 Method: 构建物理数据引擎:FysicsAny 通过分层检索与物理定律约束重写生成物理 grounded 的图文指令对;FysicsOmniCap 利用音视频一致性过滤蒸馏网页视频生成高质量视频-指令对;采用分阶段多模态对齐、指令微调、潜在空间流匹配图像生成及意图路由机制。 Result: 在标准多模态基准上表现具竞争力,在面向物理的评测中性能提升。 Conclusion: 显式注入物理知识可显著增强全模态模型的物理理解鲁棒性,OmniFysics 为构建具物理常识的多模态AI提供了可行路径。 Abstract: Physical understanding remains brittle in omni-modal models because key physical attributes are visually ambiguous and sparsely represented in web-scale data. We present OmniFysics, a compact omni-modal model that unifies understanding across images, audio, video, and text, with integrated speech and image generation. To inject explicit physical knowledge, we build a physical data engine with two components. FysicsAny produces physics-grounded instruction--image supervision by mapping salient objects to verified physical attributes through hierarchical retrieval over a curated prototype database, followed by physics-law--constrained verification and caption rewriting. FysicsOmniCap distills web videos via audio--visual consistency filtering to generate high-fidelity video--instruction pairs emphasizing cross-modal physical cues. We train OmniFysics with staged multimodal alignment and instruction tuning, adopt latent-space flow matching for text-to-image generation, and use an intent router to activate generation only when needed. Experiments show competitive performance on standard multimodal benchmarks and improved results on physics-oriented evaluations.

[125] Contactless estimation of continuum displacement and mechanical compressibility from image series using a deep learning based framework

A. N. Maria Antony,T. Richter,E. Gladilin

Main category: cs.CV

TL;DR: 本文提出了一种基于深度学习的端到端方法,用于从图像序列中高效、准确地估计连续介质位移和材料可压缩性,克服了传统迭代算法(如FEM/FDM)在非接触式材料力学性质评估中的低效问题。

Details Motivation: 传统非接触式材料力学性质评估方法(如基于FEM/FDM的非刚性图像配准与本构建模)计算耗时、难以满足高通量数据处理需求,亟需更高效准确的新方法。 Method: 构建两个深度神经网络:一个用于图像配准以估计位移场,另一个用于直接从图像序列估计材料可压缩性;端到端联合优化,利用高阶认知特征(如矢量场涡度)而非仅局部位移特征。 Result: 该深度学习框架在效率和精度上均优于传统方法;即使图像配准结果与参考位移场存在显著局部偏差,仍能准确估计材料可压缩性;实验证明其性能源于对高阶特征(如涡度)的学习能力。 Conclusion: 深度学习端到端模型为非接触、非侵入式材料力学参数估计提供了高效、鲁棒且高精度的新范式,尤其适用于工程与生物医学等难以直接测量的场景。 Abstract: Contactless and non-invasive estimation of mechanical properties of physical media from optical observations is of interest for manifold engineering and biomedical applications, where direct physical measurements are not possible. Conventional approaches to the assessment of image displacement and non-contact material probing typically rely on time-consuming iterative algorithms for non-rigid image registration and constitutive modelling using discretization and iterative numerical solving techniques, such as Finite Element Method (FEM) and Finite Difference Method (FDM), which are not suitable for high-throughput data processing. Here, we present an efficient deep learning based end-to-end approach for the estimation of continuum displacement and material compressibility directly from the image series. Based on two deep neural networks for image registration and material compressibility estimation, this framework outperforms conventional approaches in terms of efficiency and accuracy. In particular, our experimental results show that the deep learning model trained on a set of reference data can accurately determine the material compressibility even in the presence of substantial local deviations of the mapping predicted by image registration from the reference displacement field. Our findings suggest that the remarkable accuracy of the deep learning end-to-end model originates from its ability to assess higher-order cognitive features, such as the vorticity of the vector field, rather than conventional local features of the image displacement.

[126] Bidirectional Reward-Guided Diffusion for Real-World Image Super-Resolution

Zihao Fan,Xin Lu,Yidi Liu,Jie Huang,Dong Li,Xueyang Fu,Zheng-Jun Zha

Main category: cs.CV

TL;DR: Bird-SR是一种双向奖励引导的扩散超分辨率框架,通过奖励反馈学习(ReFL)联合利用合成与真实低分辨率图像,在结构保真与感知增强间动态权衡,显著提升真实场景超分辨率性能。

Details Motivation: 现有基于扩散模型的超分辨率方法在合成数据上训练后,因分布偏移而在真实低分辨率图像上表现不佳。 Method: 提出Bird-SR框架:1)早期扩散步在合成对上直接优化以保障结构保真;2)后期用质量引导奖励优化感知质量,对合成结果采用相对优势空间奖励、对真实图像施加语义对齐约束以防奖励作弊;3)引入动态保真-感知加权策略。 Result: 在多个真实世界超分辨率基准上,Bird-SR在感知质量与结构一致性两方面均超越当前最优方法。 Conclusion: Bird-SR通过双向奖励引导与分阶段优化策略,有效缓解了合成到真实的域偏移问题,为真实场景超分辨率提供了新范式。 Abstract: Diffusion-based super-resolution can synthesize rich details, but models trained on synthetic paired data often fail on real-world LR images due to distribution shifts. We propose Bird-SR, a bidirectional reward-guided diffusion framework that formulates super-resolution as trajectory-level preference optimization via reward feedback learning (ReFL), jointly leveraging synthetic LR-HR pairs and real-world LR images. For structural fidelity easily affected in ReFL, the model is directly optimized on synthetic pairs at early diffusion steps, which also facilitates structure preservation for real-world inputs under smaller distribution gap in structure levels. For perceptual enhancement, quality-guided rewards are applied at later sampling steps to both synthetic and real LR images. To mitigate reward hacking, the rewards for synthetic results are formulated in a relative advantage space bounded by their clean counterparts, while real-world optimization is regularized via a semantic alignment constraint. Furthermore, to balance structural and perceptual learning, we adopt a dynamic fidelity-perception weighting strategy that emphasizes structure preservation at early stages and progressively shifts focus toward perceptual optimization at later diffusion steps. Extensive experiments on real-world SR benchmarks demonstrate that Bird-SR consistently outperforms state-of-the-art methods in perceptual quality while preserving structural consistency, validating its effectiveness for real-world super-resolution.

[127] MosaicThinker: On-Device Visual Spatial Reasoning for Embodied AI via Iterative Construction of Space Representation

Haoming Wang,Qiyao Xue,Weichen Liu,Wei Gao

Main category: cs.CV

TL;DR: 本文提出MosaicThinker,一种在设备端增强小型视觉语言模型(VLM)跨帧空间推理能力的新推理时计算技术,通过构建全局语义地图并结合视觉提示提升空间理解。

Details Motivation: 现有视觉语言模型缺乏3D空间知识,难以处理涉及多视频帧的复杂空间关系推理任务,而具身AI对视频输入的空间推理需求日益增长。 Method: 提出MosaicThinker技术,将多帧空间信息整合为统一的全局语义地图,并利用视觉提示引导小型VLM在该地图上进行空间推理。 Result: 实验表明,该方法显著提升了资源受限具身AI设备在多种复杂度跨帧空间推理任务中的准确率。 Conclusion: MosaicThinker有效弥补了小型VLM在3D空间感知与跨帧推理上的不足,为设备端具身AI提供了可行的空间推理增强方案。 Abstract: When embodied AI is expanding from traditional object detection and recognition to more advanced tasks of robot manipulation and actuation planning, visual spatial reasoning from the video inputs is necessary to perceive the spatial relationships of objects and guide device actions. However, existing visual language models (VLMs) have very weak capabilities in spatial reasoning due to the lack of knowledge about 3D spatial information, especially when the reasoning task involve complex spatial relations across multiple video frames. In this paper, we present a new inference-time computing technique for on-device embodied AI, namely \emph{MosaicThinker}, which enhances the on-device small VLM's spatial reasoning capabilities on difficult cross-frame reasoning tasks. Our basic idea is to integrate fragmented spatial information from multiple frames into a unified space representation of global semantic map, and further guide the VLM's spatial reasoning over the semantic map via a visual prompt. Experiment results show that our technique can greatly enhance the accuracy of cross-frame spatial reasoning on resource-constrained embodied AI devices, over reasoning tasks with diverse types and complexities.

[128] WorldEdit: Towards Open-World Image Editing with a Knowledge-Informed Benchmark

Wang Lin,Feng Wang,Majun Zhang,Wentao Hu,Tao Jin,Zhou Zhao,Fei Wu,Jingyuan Chen,Alan Yuille,Sucheng Ren

Main category: cs.CV

TL;DR: 本文提出WorldEdit数据集和两阶段训练框架,旨在提升图像编辑模型对隐式、因果性编辑指令的理解与执行能力,显著改善现有模型在知识合理性和指令遵循方面的表现。

Details Motivation: 现有图像编辑模型难以处理隐式编辑指令(即仅描述视觉变化原因而未明确结果),因其依赖统一编辑策略,缺乏所需的世界知识和因果推理能力。 Method: 构建WorldEdit高质量因果驱动图像编辑数据集及WorldEdit-Test评测基准;提出两阶段微调框架(以Bagel为例),融合因果验证奖励机制。 Result: 所提方法在因果编辑任务上显著缩小与GPT-4o和Nano-Banana的性能差距,在指令跟随和知识合理性两方面均展现出强竞争力。 Conclusion: WorldEdit数据集与因果感知训练范式有效提升了模型对隐式、因果性编辑意图的理解与生成能力,为世界知识驱动的图像编辑提供了新路径。 Abstract: Recent advances in image editing models have demonstrated remarkable capabilities in executing explicit instructions, such as attribute manipulation, style transfer, and pose synthesis. However, these models often face challenges when dealing with implicit editing instructions, which describe the cause of a visual change without explicitly detailing the resulting outcome. These limitations arise because existing models rely on uniform editing strategies that are not equipped to handle the complex world knowledge and reasoning required for implicit instructions. To address this gap, we introduce \textbf{WorldEdit}, a dataset specifically designed to enable world-driven image editing. WorldEdit consists of high-quality editing samples, guided by paraphrased instructions that align with real-world causal logic. Furthermore, we provide \textbf{WorldEdit-Test} for evaluating the existing model's performance on causal editing scenarios. With WorldEdit, we use a two-stage training framework for fine-tuning models like Bagel, integrating with a causal verification reward. Our results show that the proposed dataset and methods significantly narrow the gap with GPT-4o and Nano-Banana, demonstrating competitive performance not only in instruction following but also in knowledge plausibility, where many open-source systems typically struggle.

[129] TLC-Plan: A Two-Level Codebook Based Network for End-to-End Vector Floorplan Generation

Biao Xiong,Zhen Peng,Ping Wang,Qiegen Liu,Xian Zhong

Main category: cs.CV

TL;DR: 本文提出TLC-Plan,一种基于层次化VQ-VAE与CodeTree表示的端到端向量户型图生成模型,直接从边界生成拓扑有效、多样化的向量户型,避免了传统光栅方法带来的结构不一致问题。

Details Motivation: 现有户型图生成方法在光栅空间操作并依赖后处理矢量化,导致结构不一致且难以端到端学习;受人类模块化、可复用的建筑设计流程启发,需支持组合式空间推理与约束感知的向量生成。 Method: 提出TLC-Plan:采用双层VQ-VAE分别编码全局语义房间框和局部多边形几何;引入CodeTree统一层次表征;使用自回归Transformer以输入边界为条件采样代码,无需显式拓扑或尺寸先验。 Result: 在RPLAN数据集上达到SOTA(FID=1.84, MSE=2.06),LIFULL数据集上性能领先;支持生成多样、拓扑合法、边界对齐的向量户型图。 Conclusion: TLC-Plan实现了真正端到端、约束感知、可扩展的向量户型生成,推动了AI在真实建筑应用中的落地,代码与模型已开源。 Abstract: Automated floorplan generation aims to improve design quality, architectural efficiency, and sustainability by jointly modeling global spatial organization and precise geometric detail. However, existing approaches operate in raster space and rely on post hoc vectorization, which introduces structural inconsistencies and hinders end-to-end learning. Motivated by compositional spatial reasoning, we propose TLC-Plan, a hierarchical generative model that directly synthesizes vector floorplans from input boundaries, aligning with human architectural workflows based on modular and reusable patterns. TLC-Plan employs a two-level VQ-VAE to encode global layouts as semantically labeled room bounding boxes and to refine local geometries using polygon-level codes. This hierarchy is unified in a CodeTree representation, while an autoregressive transformer samples codes conditioned on the boundary to generate diverse and topologically valid designs, without requiring explicit room topology or dimensional priors. Extensive experiments show state-of-the-art performance on RPLAN dataset (FID = 1.84, MSE = 2.06) and leading results on LIFULL dataset. The proposed framework advances constraint-aware and scalable vector floorplan generation for real-world architectural applications. Source code and trained models are released at https://github.com/rosolose/TLC-PLAN.

[130] Zero-Shot UAV Navigation in Forests via Relightable 3D Gaussian Splatting

Zinan Lv,Yeqian Qian,Chen Sang,Hao Liu,Danping Zou,Ming Yang

Main category: cs.CV

TL;DR: 本文提出了一种基于可重光照3D高斯泼溅与端到端强化学习的无人机单目视觉导航框架,实现从高保真仿真到复杂户外环境(如森林)的零样本迁移,具备强光照不变性与高速(10 m/s)无碰撞导航能力。

Details Motivation: 无人机在非结构化户外环境中依赖单目视觉导航时,面临仿真与现实间巨大的视觉域差距;现有3D高斯泼溅方法将光照与几何耦合,难以适应真实世界动态光照变化,限制策略泛化能力。 Method: 提出Relightable 3D Gaussian Splatting,解耦场景几何与光照分量,支持物理驱动的光照编辑;在基于真实数据构建的高保真仿真环境中,训练端到端强化学习策略,直接从单目RGB输入映射到连续控制指令,并通过多样化合成光照条件(如强直射光、阴天漫射光)进行数据增强。 Result: 轻量级四旋翼无人机在真实森林环境中实现高达10 m/s的鲁棒、无碰撞导航,对剧烈光照变化具有显著鲁棒性,且无需任何微调。 Conclusion: 解耦光照建模与强化学习联合优化是提升单目视觉导航跨域泛化能力的关键路径,所提方法为真实复杂户外场景下的自主导航提供了高效可行的解决方案。 Abstract: UAV navigation in unstructured outdoor environments using passive monocular vision is hindered by the substantial visual domain gap between simulation and reality. While 3D Gaussian Splatting enables photorealistic scene reconstruction from real-world data, existing methods inherently couple static lighting with geometry, severely limiting policy generalization to dynamic real-world illumination. In this paper, we propose a novel end-to-end reinforcement learning framework designed for effective zero-shot transfer to unstructured outdoors. Within a high-fidelity simulation grounded in real-world data, our policy is trained to map raw monocular RGB observations directly to continuous control commands. To overcome photometric limitations, we introduce Relightable 3D Gaussian Splatting, which decomposes scene components to enable explicit, physically grounded editing of environmental lighting within the neural representation. By augmenting training with diverse synthesized lighting conditions ranging from strong directional sunlight to diffuse overcast skies, we compel the policy to learn robust, illumination-invariant visual features. Extensive real-world experiments demonstrate that a lightweight quadrotor achieves robust, collision-free navigation in complex forest environments at speeds up to 10 m/s, exhibiting significant resilience to drastic lighting variations without fine-tuning.

[131] Extended to Reality: Prompt Injection in 3D Environments

Zhuoheng Li,Ying Chen

Main category: cs.CV

TL;DR: 本文提出PI3D,一种针对三维物理环境中多模态大语言模型(MLLMs)的提示注入攻击,通过在真实场景中放置带文本的物理对象来干扰模型推理,并验证了现有防御手段对此类攻击无效。

Details Motivation: 现有提示注入攻击研究集中于纯文本或数字编辑的2D图像,而真实3D物理环境中的攻击路径尚不明确,亟需探索MLLMs在具身场景下的新型安全威胁。 Method: 提出PI3D攻击框架,建模并求解带文本物理对象在3D空间中的最优位姿(位置与朝向),使其既可成功触发目标任务,又保持物理合理性;通过多模型、多相机轨迹实验验证攻击有效性。 Result: PI3D在多个主流MLLM上均成功实现攻击,且在不同相机运动轨迹下鲁棒性强;现有防御方法(如输入过滤、注意力可视化等)无法有效缓解该攻击。 Conclusion: 三维物理环境中的提示注入是一种切实可行且危险的新攻击范式,揭示了MLLMs在具身智能应用中的关键安全短板,亟需设计面向3D物理世界的新型防御机制。 Abstract: Multimodal large language models (MLLMs) have advanced the capabilities to interpret and act on visual input in 3D environments, empowering diverse applications such as robotics and situated conversational agents. When MLLMs reason over camera-captured views of the physical world, a new attack surface emerges: an attacker can place text-bearing physical objects in the environment to override MLLMs' intended task. While prior work has studied prompt injection in the text domain and through digitally edited 2D images, it remains unclear how these attacks function in 3D physical environments. To bridge the gap, we introduce PI3D, a prompt injection attack against MLLMs in 3D environments, realized through text-bearing physical object placement rather than digital image edits. We formulate and solve the problem of identifying an effective 3D object pose (position and orientation) with injected text, where the attacker's goal is to induce the MLLM to perform the injected task while ensuring that the object placement remains physically plausible. Experiments demonstrate that PI3D is an effective attack against multiple MLLMs under diverse camera trajectories. We further evaluate existing defenses and show that they are insufficient to defend against PI3D.

[132] Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models

Haoyu Zhang,Zhipeng Li,Yiwen Guo,Tianshu Yu

Main category: cs.CV

TL;DR: 本文提出Ex-Omni框架,通过解耦语义推理与时序生成,并引入语音单元作为时序支架及统一的token-as-query门控融合机制,实现大语言模型对语音驱动3D面部动画的高效建模;同时发布InstructEx数据集以支持该任务。

Details Motivation: 现有 omni-modal 大语言模型(OLLMs)尚未有效整合语音与3D面部动画,主要受限于离散token级语义建模与密集连续面部运动表征之间的不匹配,且训练数据有限。 Method: 提出Ex-Omni框架:1)解耦语义推理与时序生成;2)利用语音单元作为时序支架;3)设计token-as-query门控融合(TQGF)机制实现可控语义注入;4)构建InstructEx数据集。 Result: Ex-Omni在多项实验中表现媲美现有开源OLLMs,并能稳定生成语音与3D面部动画对齐的输出。 Conclusion: Ex-Omni为OLLMs拓展语音-3D面部动画联合建模能力提供了可行、高效且开源的解决方案,推动自然人机交互发展。 Abstract: Omni-modal large language models (OLLMs) aim to unify multimodal understanding and generation, yet incorporating speech with 3D facial animation remains largely unexplored despite its importance for natural interaction. A key challenge arises from the representation mismatch between discrete, token-level semantic reasoning in LLMs and the dense, fine-grained temporal dynamics required for 3D facial motion, which makes direct modeling difficult to optimize under limited data. We propose Expressive Omni (Ex-Omni), an open-source omni-modal framework that augments OLLMs with speech-accompanied 3D facial animation. Ex-Omni reduces learning difficulty by decoupling semantic reasoning from temporal generation, leveraging speech units as temporal scaffolding and a unified token-as-query gated fusion (TQGF) mechanism for controlled semantic injection. We further introduce InstructEx, a dataset aims to facilitate augment OLLMs with speech-accompanied 3D facial animation. Extensive experiments demonstrate that Ex-Omni performs competitively against existing open-source OLLMs while enabling stable aligned speech and facial animation generation.

[133] Privacy in Image Datasets: A Case Study on Pregnancy Ultrasounds

Rawisara Lohanimit,Yankun Wu,Amelia Katirai,Yuta Nakashima,Noa Garcia

Main category: cs.CV

TL;DR: This paper investigates the presence of sensitive pregnancy ultrasound images in the LAION-400M dataset using CLIP embeddings, uncovering thousands of instances containing private identifiers like names and locations, highlighting privacy risks and recommending improved curation and ethical practices.

Details Motivation: The increasing use of large-scale, minimally curated internet-collected datasets for training generative models raises serious concerns about inclusion of sensitive or private personal information, such as pregnancy ultrasound images. Method: Systematic examination of the LAION-400M dataset using CLIP embedding similarity to retrieve and identify pregnancy ultrasound images, followed by detection of private information (e.g., names, locations) within those images. Result: Thousands of pregnancy ultrasound images were retrieved; many contained high-risk private information—including names and locations—that could enable re-identification or impersonation. Conclusion: Public image datasets like LAION-400M pose significant privacy risks when sensitive medical images are present; the paper recommends improved dataset curation, stronger data privacy safeguards, and ethical guidelines for using public image datasets. Abstract: The rise of generative models has led to increased use of large-scale datasets collected from the internet, often with minimal or no data curation. This raises concerns about the inclusion of sensitive or private information. In this work, we explore the presence of pregnancy ultrasound images, which contain sensitive personal information and are often shared online. Through a systematic examination of LAION-400M dataset using CLIP embedding similarity, we retrieve images containing pregnancy ultrasound and detect thousands of entities of private information such as names and locations. Our findings reveal that multiple images have high-risk information that could enable re-identification or impersonation. We conclude with recommended practices for dataset curation, data privacy, and ethical use of public image datasets.

[134] DuMeta++: Spatiotemporal Dual Meta-Learning for Generalizable Few-Shot Brain Tissue Segmentation Across Diverse Ages

Yongheng Sun,Jun Shu,Jianhua Ma,Fan Wang

Main category: cs.CV

TL;DR: 本文提出DuMeta++,一种无需配对纵向数据的双元元学习框架,用于解决MRI脑组织分割中跨年龄泛化难题,通过元特征学习、元初始化学习和记忆库驱动的类感知正则化提升模型鲁棒性与适应性。

Details Motivation: 现有方法依赖配对纵向数据进行自监督正则化以缓解年龄相关脑部变化带来的分割性能下降,但此类数据在实际中往往不可得。 Method: 提出DuMeta++双元元学习框架,包含元特征学习(提取年龄无关语义表征)、元初始化学习(支持数据高效适配)及记忆库驱动的类感知正则化(保障纵向一致性),并提供收敛性理论证明。 Result: 在iSeg-2019、IBIS、OASIS、ADNI等多个数据集的少样本实验中,DuMeta++在跨年龄泛化性能上优于现有方法。 Conclusion: DuMeta++有效缓解了MRI脑分割中因年龄变化导致的分布偏移问题,无需配对纵向数据即可实现稳定、鲁棒的跨年龄泛化。 Abstract: Accurate segmentation of brain tissues from MRI scans is critical for neuroscience and clinical applications, but achieving consistent performance across the human lifespan remains challenging due to dynamic, age-related changes in brain appearance and morphology. While prior work has sought to mitigate these shifts by using self-supervised regularization with paired longitudinal data, such data are often unavailable in practice. To address this, we propose \emph{DuMeta++}, a dual meta-learning framework that operates without paired longitudinal data. Our approach integrates: (1) meta-feature learning to extract age-agnostic semantic representations of spatiotemporally evolving brain structures, and (2) meta-initialization learning to enable data-efficient adaptation of the segmentation model. Furthermore, we propose a memory-bank-based class-aware regularization strategy to enforce longitudinal consistency without explicit longitudinal supervision. We theoretically prove the convergence of our DuMeta++, ensuring stability. Experiments on diverse datasets (iSeg-2019, IBIS, OASIS, ADNI) under few-shot settings demonstrate that DuMeta++ outperforms existing methods in cross-age generalization. Code will be available at https://github.com/ladderlab-xjtu/DuMeta++.

[135] Condition Matters in Full-head 3D GANs

Heyuan Li,Huimin Zhang,Yuda Qiu,Zhengwentai Sun,Keru Zheng,Lingteng Qiu,Peihao Li,Qi Zuo,Ce Chen,Yujian Zheng,Yuming Gu,Zilong Dong,Xiaoguang Han

Main category: cs.CV

TL;DR: 本文提出了一种基于视角无关语义特征的条件输入方法,以解决全头3D GAN中因使用视角角作为条件而导致的方向偏差与全局不一致性问题。通过构建合成多视角头部图像数据集并提取前视图CLIP特征作为共享语义条件,提升了生成质量、多样性与全局一致性。

Details Motivation: 现有全头3D GAN常以视角角为条件输入,导致3D空间学习在条件视角方向上产生偏差,造成不同视角下生成质量与多样性差异大、全局不连贯。 Method: 采用前视图图像的CLIP语义特征作为所有视角图像的共享、视角无关条件;利用FLUX.1 Kontext将高质量前视人脸数据集扩展至多视角,并以此构建合成头部图像数据集;使多视角监督统一于同一语义条件,增强训练效率与生成一致性。 Result: 在全头合成与单视图GAN反演任务中,显著提升了生成结果的保真度、多样性与泛化能力。 Conclusion: 视角无关的语义条件可有效解耦生成能力与视角依赖,缓解模式崩溃,提升3D全头GAN的稳定性、一致性和多样性。 Abstract: Conditioning is crucial for stable training of full-head 3D GANs. Without any conditioning signal, the model suffers from severe mode collapse, making it impractical to training. However, a series of previous full-head 3D GANs conventionally choose the view angle as the conditioning input, which leads to a bias in the learned 3D full-head space along the conditional view direction. This is evident in the significant differences in generation quality and diversity between the conditional view and non-conditional views of the generated 3D heads, resulting in global incoherence across different head regions. In this work, we propose to use view-invariant semantic feature as the conditioning input, thereby decoupling the generative capability of 3D heads from the viewing direction. To construct a view-invariant semantic condition for each training image, we create a novel synthesized head image dataset. We leverage FLUX.1 Kontext to extend existing high-quality frontal face datasets to a wide range of view angles. The image clip feature extracted from the frontal view is then used as a shared semantic condition across all views in the extended images, ensuring semantic alignment while eliminating directional bias. This also allows supervision from different views of the same subject to be consolidated under a shared semantic condition, which accelerates training and enhances the global coherence of the generated 3D heads. Moreover, as GANs often experience slower improvements in diversity once the generator learns a few modes that successfully fool the discriminator, our semantic conditioning encourages the generator to follow the true semantic distribution, thereby promoting continuous learning and diverse generation. Extensive experiments on full-head synthesis and single-view GAN inversion demonstrate that our method achieves significantly higher fidelity, diversity, and generalizability.

[136] Understanding Real-World Traffic Safety through RoadSafe365 Benchmark

Xinyu Liu,Darryl C. Jacob,Yuxin Liu,Xinsong Du,Muchao Ye,Bolei Zhou,Pan He

Main category: cs.CV

TL;DR: RoadSafe365 is a large-scale vision-language benchmark for fine-grained traffic safety analysis, aligned with official standards, featuring hierarchical taxonomy, rich annotations, and multimodal QA data.

Details Motivation: Existing traffic benchmarks lack systematic evaluation aligned with official safety standards; there's a need for a dataset that bridges regulatory definitions (crash/incident/violation) with data-driven understanding. Method: Curated RoadSafe365—a systematically organized, hierarchical taxonomy-based benchmark—using diverse real-world dashcam and surveillance videos; annotated with attributes, multiple-choice QAs, and scene descriptions; evaluated via fine-tuning and cross-domain experiments. Result: 36,196 annotated clips, 864K QA options, 8.4K unique answers, 36K scene descriptions; strong baseline performance with consistent gains from fine-tuning; validated effectiveness on real and synthetic datasets. Conclusion: RoadSafe365 establishes a comprehensive, standardized, and scalable benchmark to advance reproducible, safety-standard-aligned research in vision-language traffic safety analysis. Abstract: Although recent traffic benchmarks have advanced multimodal data analysis, they generally lack systematic evaluation aligned with official safety standards. To fill this gap, we introduce RoadSafe365, a large-scale vision-language benchmark that supports fine-grained analysis of traffic safety from extensive and diverse real-world video data collections. Unlike prior works that focus primarily on coarse accident identification, RoadSafe365 is independently curated and systematically organized using a hierarchical taxonomy that refines and extends foundational definitions of crash, incident, and violation to bridge official traffic safety standards with data-driven traffic understanding systems. RoadSafe365 provides rich attribute annotations across diverse traffic event types, environmental contexts, and interaction scenarios, yielding 36,196 annotated clips from both dashcam and surveillance cameras. Each clip is paired with multiple-choice question-answer sets, comprising 864K candidate options, 8.4K unique answers, and 36K detailed scene descriptions collectively designed for vision-language understanding and reasoning. We establish strong baselines and observe consistent gains when fine-tuning on RoadSafe365. Cross-domain experiments on both real and synthetic datasets further validate its effectiveness. Designed for large-scale training and standardized evaluation, RoadSafe365 provides a comprehensive benchmark to advance reproducible research in real-world traffic safety analysis.

[137] The Double-Edged Sword of Data-Driven Super-Resolution: Adversarial Super-Resolution Models

Haley Duba-Sullivan,Steven R. Young,Emma J. Reid

Main category: cs.CV

TL;DR: 本文提出AdvSR框架,通过在超分辨率(SR)模型训练过程中嵌入对抗行为,使其在推理时无需访问输入即可诱导下游任务(如目标检测)错误分类,且保持良好的图像质量指标。

Details Motivation: 数据驱动的超分辨率方法常作为成像流水线的预处理步骤,但其模型本身可能成为新的攻击面;现有对抗攻击多依赖输入扰动或后门触发器,而模型级的隐蔽攻击尚未被探索。 Method: AdvSR在SR模型训练中联合优化重建质量与特定对抗目标(如下游分类器误分类),将对抗行为直接编码进模型权重,不依赖推理时的输入访问或外部触发器。 Result: 在SRCNN、EDSR、SwinIR三种SR模型与YOLOv11检测器组合上验证,AdvSR可实现高攻击成功率,同时图像质量下降极小。 Conclusion: AdvSR揭示了一种新型模型级威胁,警示安全关键场景中模型来源与验证需更严格,不能仅依赖传统图像质量评估。 Abstract: Data-driven super-resolution (SR) methods are often integrated into imaging pipelines as preprocessing steps to improve downstream tasks such as classification and detection. However, these SR models introduce a previously unexplored attack surface into imaging pipelines. In this paper, we present AdvSR, a framework demonstrating that adversarial behavior can be embedded directly into SR model weights during training, requiring no access to inputs at inference time. Unlike prior attacks that perturb inputs or rely on backdoor triggers, AdvSR operates entirely at the model level. By jointly optimizing for reconstruction quality and targeted adversarial outcomes, AdvSR produces models that appear benign under standard image quality metrics while inducing downstream misclassification. We evaluate AdvSR on three SR architectures (SRCNN, EDSR, SwinIR) paired with a YOLOv11 classifier and demonstrate that AdvSR models can achieve high attack success rates with minimal quality degradation. These findings highlight a new model-level threat for imaging pipelines, with implications for how practitioners source and validate models in safety-critical applications.

[138] 3D Transport-based Morphometry (3D-TBM) for medical image analysis

Hongyu Kan,Kristofor Pas,Ivan Medri,Naqib Sad Pathan,Natasha Ironside,Shinjini Kundu,Jingjia He,Gustavo Kunde Rohde

Main category: cs.CV

TL;DR: 本文介绍了3D-TBM,一个用于3D医学图像形态分析的开源工具,基于传输映射将图像嵌入传输域,并支持结果回映到原始图像空间以实现临床可解释性。

Details Motivation: 推动Transport-Based Morphometry(TBM)在临床影像研究中的广泛应用,解决现有方法缺乏可解释性和易用性的问题。 Method: 构建3D-TBM框架,包括数据预处理、最优传输嵌入计算、主传输方向可视化、判别方向识别等分析模块,并提供完整文档与教程;代码开源至PyTransKit。 Result: 实现了首个面向3D医学图像的TBM全流程开源工具,支持传输域建模与空间可解释分析。 Conclusion: 3D-TBM为医学影像研究者提供了易用、可解释、可复现的TBM分析平台,有望促进TBM在临床研究中的落地应用。 Abstract: Transport-Based Morphometry (TBM) has emerged as a new framework for 3D medical image analysis. By embedding images into a transport domain via invertible transformations, TBM facilitates effective classification, regression, and other tasks using transport-domain features. Crucially, the inverse mapping enables the projection of analytic results back into the original image space, allowing researchers to directly interpret clinical features associated with model outputs in a spatially meaningful way. To facilitate broader adoption of TBM in clinical imaging research, we present 3D-TBM, a tool designed for morphological analysis of 3D medical images. The framework includes data preprocessing, computation of optimal transport embeddings, and analytical methods such as visualization of main transport directions, together with techniques for discerning discriminating directions and related analysis methods. We also provide comprehensive documentation and practical tutorials to support researchers interested in applying 3D-TBM in their own medical imaging studies. The source code is publicly available through PyTransKit.

[139] TwistNet-2D: Learning Second-Order Channel Interactions via Spiral Twisting for Texture Recognition

Junbo Jacob Lian,Feng Xiong,Yujun Sun,Kaichen Ouyang,Mingyang Yu,Shengwei Fu,Zhong Rui,Zhang Yujun,Huiling Chen

Main category: cs.CV

TL;DR: TwistNet-2D是一种轻量级模块,通过方向性空间位移下的局部通道乘积建模特征共现与交互,显著提升纹理与细粒度识别性能,且计算开销极小。

Details Motivation: 现有方法在建模二阶特征统计时存在根本矛盾:双线性池化和Gram矩阵捕获全局通道相关性但丢失空间结构;自注意力能建模空间上下文却缺乏显式的成对特征交互。 Method: 提出TwistNet-2D模块,核心为Spiral-Twisted Channel Interaction(STCI):沿预设方向平移一个特征图后进行逐通道乘法,以捕获结构化/周期性纹理中的跨位置共现模式;聚合四个方向头,辅以学习的通道重加权和sigmoid门控残差路径。 Result: 相比ResNet-18仅增加3.5%参数量和2% FLOPs,但在四个纹理与细粒度识别基准上持续超越参数匹配甚至更大规模的基线模型(如ConvNeXt、Swin Transformer及CNN-Transformer混合架构)。 Conclusion: 显式建模带空间位移的局部通道交互是高效提升纹理识别中二阶统计建模能力的有效途径,TwistNet-2D在精度与效率间取得了优异平衡。 Abstract: Second-order feature statistics are central to texture recognition, yet current methods face a fundamental tension: bilinear pooling and Gram matrices capture global channel correlations but collapse spatial structure, while self-attention models spatial context through weighted aggregation rather than explicit pairwise feature interactions. We introduce TwistNet-2D, a lightweight module that computes \emph{local} pairwise channel products under directional spatial displacement, jointly encoding where features co-occur and how they interact. The core component, Spiral-Twisted Channel Interaction (STCI), shifts one feature map along a prescribed direction before element-wise channel multiplication, thereby capturing the cross-position co-occurrence patterns characteristic of structured and periodic textures. Aggregating four directional heads with learned channel reweighting and injecting the result through a sigmoid-gated residual path, \TwistNet incurs only 3.5% additional parameters and 2% additional FLOPs over ResNet-18, yet consistently surpasses both parameter-matched and substantially larger baselines -- including ConvNeXt, Swin Transformer, and hybrid CNN--Transformer architectures -- across four texture and fine-grained recognition benchmarks.

[140] VideoNeuMat: Neural Material Extraction from Generative Video Models

Bowen Xue,Saeed Hadadan,Zheng Zeng,Fabrice Rousselle,Zahra Montazeri,Milos Hasan

Main category: cs.CV

TL;DR: 本文提出VideoNeuMat两阶段方法,从视频扩散模型中提取可复用的神经材质资产:先微调大视频模型生成受控光照/视角下的材质视频,再通过微调的小型视频模型重建紧凑神经材质参数,实现高质量、多样化、泛化性强的神经3D材质生成。

Details Motivation: 现有材质生成模型受限于高质量训练数据匮乏;虽视频生成模型能产生逼真材质外观,但其材质知识与几何和光照纠缠,难以解耦并复用。 Method: 采用两阶段流程:1)微调Wan 2.1(14B)视频模型,生成符合虚拟gonioreflectometer轨迹(可控相机与光照变化)的材质样本视频;2)基于微调后的Wan 1.3B构建Large Reconstruction Model(LRM),从17帧生成视频中单次推理重建神经材质参数。 Result: 从视频模型中成功提取出高真实感、高多样性、支持新视角与新光照泛化的紧凑神经材质;性能远超基于有限合成数据训练的传统方法。 Conclusion: 互联网规模视频扩散模型蕴含丰富的材质先验知识,可通过适当解耦与重建策略迁移为独立、可复用的神经3D材质资产。 Abstract: Creating photorealistic materials for 3D rendering requires exceptional artistic skill. Generative models for materials could help, but are currently limited by the lack of high-quality training data. While recent video generative models effortlessly produce realistic material appearances, this knowledge remains entangled with geometry and lighting. We present VideoNeuMat, a two-stage pipeline that extracts reusable neural material assets from video diffusion models. First, we finetune a large video model (Wan 2.1 14B) to generate material sample videos under controlled camera and lighting trajectories, effectively creating a "virtual gonioreflectometer" that preserves the model's material realism while learning a structured measurement pattern. Second, we reconstruct compact neural materials from these videos through a Large Reconstruction Model (LRM) finetuned from a smaller Wan 1.3B video backbone. From 17 generated video frames, our LRM performs single-pass inference to predict neural material parameters that generalize to novel viewing and lighting conditions. The resulting materials exhibit realism and diversity far exceeding the limited synthetic training data, demonstrating that material knowledge can be successfully transferred from internet-scale video models into standalone, reusable neural 3D assets.

[141] Cross-View World Models

Rishabh Sharma,Gijs Hogervorst,Wayne E. Mackey,David J. Heeger,Stefano Martiniani

Main category: cs.CV

TL;DR: 本文提出了一种跨视角世界模型(XVWM),通过跨视角预测目标(给定某一视角的帧序列,预测采取动作后同一或不同视角的未来状态)来学习环境的3D结构不变表征,从而提升智能体在多视角下的规划能力,并为多智能体中的视角采择提供基础。

Details Motivation: 现有世界模型通常仅基于单一视角(如自我中心视角)进行建模和规划,而某些任务(如导航)在其他视角(如鸟瞰视角)下更易规划;因此需要能支持多视角建模与一致推理的世界模型。 Method: 提出跨视角世界模型(XVWM),以同步多视角游戏数据(Aimlabs平台)为训练数据,采用跨视角预测目标进行训练:输入某视角帧序列及动作,预测另一视角下的未来状态;该目标强制模型学习几何一致、视角不变的3D环境表征。 Result: XVWM成功实现了多视角并行想象流,使智能体可按任务需求选择最优参考系进行规划,同时仍以自我中心视角执行;实验证明跨视角一致性提供了强空间表征学习信号。 Conclusion: 跨视角一致性是构建空间接地表征的有效正则化方式;XVWM不仅提升了单智能体规划灵活性,也为多智能体中视角采择(perspective-taking)提供了建模基础。 Abstract: World models enable agents to plan by imagining future states, but existing approaches operate from a single viewpoint, typically egocentric, even when other perspectives would make planning easier; navigation, for instance, benefits from a bird's-eye view. We introduce Cross-View World Models (XVWM), trained with a cross-view prediction objective: given a sequence of frames from one viewpoint, predict the future state from the same or a different viewpoint after an action is taken. Enforcing cross-view consistency acts as geometric regularization: because the input and output views may share little or no visual overlap, to predict across viewpoints, the model must learn view-invariant representations of the environment's 3D structure. We train on synchronized multi-view gameplay data from Aimlabs, an aim-training platform providing precisely aligned multi-camera recordings with high-frequency action labels. The resulting model gives agents parallel imagination streams across viewpoints, enabling planning in whichever frame of reference best suits the task while executing from the egocentric view. Our results show that multi-view consistency provides a strong learning signal for spatially grounded representations. Finally, predicting the consequences of one's actions from another viewpoint may offer a foundation for perspective-taking in multi-agent settings.

[142] Diabetic Retinopathy Lesion Segmentation through Attention Mechanisms

Aruna Jithesh,Chinmayi Karumuri,Venkata Kiran Reddy Kotha,Meghana Doddapuneni,Taehee Jeong

Main category: cs.CV

TL;DR: 本文提出了一种结合注意力机制的DeepLab-V3+模型(Attention-DeepLab),用于糖尿病视网膜病变(DR)病灶的像素级分割,显著提升了微动脉瘤等早期病灶的检测性能,增强临床筛查实用性。

Details Motivation: 现有深度学习方法在DR病灶分割方面临床适用性有限,尤其缺乏支持眼科医生进行精准筛查的像素级病灶标注能力;而微动脉瘤作为DR最早期可见体征,其准确检测对早期干预至关重要。 Method: 在DDR数据集757张眼底图像上,针对四种DR相关病灶(微动脉瘤、软性渗出、硬性渗出、出血)进行分割;将注意力机制嵌入DeepLab-V3+网络,构建Attention-DeepLab模型。 Result: 相比基线模型,mAP从0.3010提升至0.3326,平均IoU从0.1791提升至0.1928;微动脉瘤检测指标从0.0205显著提升至0.0763。 Conclusion: 引入注意力机制有效提升了DeepLab-V3+对DR病灶(尤其是早期微动脉瘤)的分割精度,为临床DR系统化筛查提供了更可靠的像素级辅助诊断工具。 Abstract: Diabetic Retinopathy (DR) is an eye disease which arises due to diabetes mellitus. It might cause vision loss and blindness. To prevent irreversible vision loss, early detection through systematic screening is crucial. Although researchers have developed numerous automated deep learning-based algorithms for DR screening, their clinical applicability remains limited, particularly in lesion segmentation. Our method provides pixel-level annotations for lesions, which practically supports Ophthalmologist to screen DR from fundus images. In this work, we segmented four types of DR-related lesions: microaneurysms, soft exudates, hard exudates, and hemorrhages on 757 images from DDR dataset. To enhance lesion segmentation, an attention mechanism was integrated with DeepLab-V3+. Compared to the baseline model, the Attention-DeepLab model increases mean average precision (mAP) from 0.3010 to 0.3326 and the mean Intersection over Union (IoU) from 0.1791 to 0.1928. The model also increased microaneurysm detection from 0.0205 to 0.0763, a clinically significant improvement. The detection of microaneurysms is the earliest visible symptom of DR.

[143] Optimization of Precipitate Segmentation Through Linear Genetic Programming of Image Processing

Kyle Williams,Andrew Seltzman

Main category: cs.CV

TL;DR: 本文提出了一种基于线性遗传编程(LGP)的图像滤波与分割算法,用于自动识别聚焦离子束(FIB)截面微图中的析出相,克服了手工标注效率低、受噪声和伪影干扰的问题;该方法生成可解释的MATLAB代码,在像素级对比中达到1.8%平均误差,处理3.6兆像素图像仅需约2秒,显著加速了铌基铜合金的增材制造研发。

Details Motivation: 当前铌基铜合金的增材制造微观组织分析依赖人工标注,受限于微图对比度变化、噪声及成像伪影,严重拖慢合金开发迭代速度。 Method: 构建基于领域特定语言(DSL)的线性遗传编程优化环境,DSL由可调参的图像滤波模块序列组成;通过遗传算法搜索最优滤波-分割流程,并输出人类可读的MATLAB代码。 Result: 在种群规模60、最大程序长度5块条件下,所得算法与人工标注的像素级XOR误差平均为1.8%;单张3.6兆像素图像处理耗时约2秒。 Conclusion: 该自动化方法大幅提升了析出相识别效率与可靠性,支撑了高性能、低活化、析出强化铜合金在聚变堆增材制造部件中的快速研发与优化。 Abstract: Current analysis of additive manufactured niobium-based copper alloys relies on hand annotation due to varying contrast, noise, and image artifacts present in micrographs, slowing iteration speed in alloy development. We present a filtering and segmentation algorithm for detecting precipitates in FIB cross-section micrographs, optimized using linear genetic programming (LGP), which accounts for the various artifacts. To this end, the optimization environment uses a domain-specific language for image processing to iterate on solutions. Programs in this language are a list of image-filtering blocks with tunable parameters that sequentially process an input image, allowing for reliable generation and mutation by a genetic algorithm. Our environment produces optimized human-interpretable MATLAB code representing an image filtering pipeline. Under ideal conditions--a population size of 60 and a maximum program length of 5 blocks--our system was able to find a near-human accuracy solution with an average evaluation error of 1.8% when comparing segmentations pixel-by-pixel to a human baseline using an XOR error evaluation. Our automation work enabled faster iteration cycles and furthered exploration of the material composition and processing space: our optimized pipeline algorithm processes a 3.6 megapixel image in about 2 seconds on average. This ultimately enables convergence on strong, low-activation, precipitation hardened copper alloys for additive manufactured fusion reactor parts.

[144] LUCID-SAE: Learning Unified Vision-Language Sparse Codes for Interpretable Concept Discovery

Difei Gu,Yunhe Gao,Gerasimos Chatzoudis,Zihan Dong,Guoning Zhang,Bangwei Guo,Yang Zhou,Mu Zhou,Dimitris Metaxas

Main category: cs.CV

TL;DR: 本文提出LUCID,一种统一的视觉-语言稀疏自编码器,通过共享潜在字典和最优传输匹配目标,实现跨模态特征对齐与可解释性,无需标注,并支持自动词项聚类解释。

Details Motivation: 现有稀疏自编码器(SAEs)按模态单独训练,导致特征不可理解、解释无法跨域迁移。 Method: 提出LUCID模型,构建图像块与文本token共享的稀疏潜在字典,并保留模态私有容量;采用无监督的最优传输匹配目标实现共享特征对齐;设计基于词项聚类的自动化字典解释流程。 Result: LUCID生成可解释的共享特征,支持图像块级定位、跨模态神经元对应,并缓解相似性评估中的概念聚类问题;揭示其共享特征涵盖对象、动作、属性及抽象概念等多元语义类别。 Conclusion: LUCID为多模态表征提供了统一、可解释且鲁棒的稀疏编码框架,推动了跨模态概念发现与自动化解释的发展。 Abstract: Sparse autoencoders (SAEs) offer a natural path toward comparable explanations across different representation spaces. However, current SAEs are trained per modality, producing dictionaries whose features are not directly understandable and whose explanations do not transfer across domains. In this study, we introduce LUCID (Learning Unified vision-language sparse Codes for Interpretable concept Discovery), a unified vision-language sparse autoencoder that learns a shared latent dictionary for image patch and text token representations, while reserving private capacity for modality-specific details. We achieve feature alignment by coupling the shared codes with a learned optimal transport matching objective without the need of labeling. LUCID yields interpretable shared features that support patch-level grounding, establish cross-modal neuron correspondence, and enhance robustness against the concept clustering problem in similarity-based evaluation. Leveraging the alignment properties, we develop an automated dictionary interpretation pipeline based on term clustering without manual observations. Our analysis reveals that LUCID's shared features capture diverse semantic categories beyond objects, including actions, attributes, and abstract concepts, demonstrating a comprehensive approach to interpretable multimodal representations.

[145] Seeing Roads Through Words: A Language-Guided Framework for RGB-T Driving Scene Segmentation

Ruturaj Reddy,Hrishav Bakul Barua,Junn Yong Loo,Thanh Thi Nguyen,Ganesh Krishnasamy

Main category: cs.CV

TL;DR: 本文提出CLARITY方法,通过视觉语言模型引导的动态RGB-热成像融合策略,提升恶劣光照条件下的道路场景语义分割性能,并在MFNet数据集上达到新SOTA。

Details Motivation: 现有RGB-热成像融合方法采用静态融合策略,无法应对多变的光照与阴影条件,易导致模态特有噪声传播,影响分割鲁棒性。 Method: 提出基于视觉语言模型(VLM)先验引导的动态融合网络CLARITY,自适应调节各模态贡献;引入保留暗区有效物体语义的机制和分层解码器以增强结构一致性和细小物体边界清晰度。 Result: 在MFNet数据集上达到62.3% mIoU和77.5% mAcc,刷新当前最优性能。 Conclusion: 动态、条件感知的多模态融合结合VLM先验与结构化解码设计,显著提升了复杂光照下语义分割的鲁棒性与精度。 Abstract: Robust semantic segmentation of road scenes under adverse illumination, lighting, and shadow conditions remain a core challenge for autonomous driving applications. RGB-Thermal fusion is a standard approach, yet existing methods apply static fusion strategies uniformly across all conditions, allowing modality-specific noise to propagate throughout the network. Hence, we propose CLARITY that dynamically adapts its fusion strategy to the detected scene condition. Guided by vision-language model (VLM) priors, the network learns to modulate each modality's contribution based on the illumination state while leveraging object embeddings for segmentation, rather than applying a fixed fusion policy. We further introduce two mechanisms, i.e., one which preserves valid dark-object semantics that prior noise-suppression methods incorrectly discard, and a hierarchical decoder that enforces structural consistency across scales to sharpen boundaries on thin objects. Experiments on the MFNet dataset demonstrate that CLARITY establishes a new state-of-the-art (SOTA), achieving 62.3% mIoU and 77.5% mAcc.

[146] Optimizing Few-Step Generation with Adaptive Matching Distillation

Lichen Bai,Zikai Zhou,Shitong Shao,Wenliang Zhong,Shuo Yang,Shuo Chen,Bojun Chen,Zeke Xie

Main category: cs.CV

TL;DR: 本文提出自适应匹配蒸馏(AMD)方法,通过奖励代理显式检测并逃离‘禁区’,提升生成模型的样本保真度和训练鲁棒性。

Details Motivation: Distribution Matching Distillation(DMD)在‘禁区’(即教师模型提供不可靠指导、伪教师排斥力不足的区域)中稳定性差,亟需更鲁棒的优化机制。 Method: 提出统一优化框架,将已有方法视为隐式规避禁区的策略;在此基础上设计AMD,包含基于结构信号分解的矫正梯度动态优先级机制,以及增强排斥势能的‘排斥景观锐化’技术。 Result: 在SDXL、Wan2.1等图像/视频生成任务及VBench、GenEval等基准上显著提升性能,如SDXL的HPSv2分数从30.64提升至31.25。 Conclusion: 显式修正优化路径以规避禁区,是突破少步生成模型性能上限的关键。 Abstract: Distribution Matching Distillation (DMD) is a powerful acceleration paradigm, yet its stability is often compromised in Forbidden Zone, regions where the real teacher provides unreliable guidance while the fake teacher exerts insufficient repulsive force. In this work, we propose a unified optimization framework that reinterprets prior art as implicit strategies to avoid these corrupted regions. Based on this insight, we introduce Adaptive Matching Distillation (AMD), a self-correcting mechanism that utilizes reward proxies to explicitly detect and escape Forbidden Zones. AMD dynamically prioritizes corrective gradients via structural signal decomposition and introduces Repulsive Landscape Sharpening to enforce steep energy barriers against failure mode collapse. Extensive experiments across image and video generation tasks (e.g., SDXL, Wan2.1) and rigorous benchmarks (e.g., VBench, GenEval) demonstrate that AMD significantly enhances sample fidelity and training robustness. For instance, AMD improves the HPSv2 score on SDXL from 30.64 to 31.25, outperforming state-of-the-art baselines. These findings validate that explicitly rectifying optimization trajectories within Forbidden Zones is essential for pushing the performance ceiling of few-step generative models.

[147] Row-Column Separated Attention Based Low-Light Image/Video Enhancement

Chengqi Dong,Zhiyuan Cao,Tuoshi Qi,Kexin Wu,Yixing Gao,Fan Tang

Main category: cs.CV

TL;DR: 本文提出了一种行-列分离注意力模块(RCSA),结合改进的U-Net用于低光照图像/视频增强,通过利用特征图行列的均值与最大值引入全局信息指导局部增强,在减少参数和计算量的同时提升细节恢复与噪声抑制能力,并设计两种时序损失函数保障视频增强的时序一致性。

Details Motivation: U-Net在低光照增强中易导致局部噪声大、细节丢失,缺乏对全局信息的有效利用;传统注意力机制虽能增强全局建模但带来过高计算与参数开销。 Method: 提出行-列分离注意力模块(RCSA),以特征图行列方向的均值与最大值为输入,轻量级地融合全局信息;将其嵌入改进的U-Net架构;并设计两种时序损失函数用于低光照视频增强以保持帧间一致性。 Result: 在LOL、MIT Adobe FiveK图像数据集和SDSD视频数据集上实验表明,该方法在主观质量与客观指标(如PSNR、SSIM)上均优于现有方法,同时参数量与计算量显著低于标准注意力机制。 Conclusion: RCSA是一种高效、轻量且有效的全局信息建模方式,可显著提升U-Net在低光照图像/视频增强任务中的性能与稳定性。 Abstract: U-Net structure is widely used for low-light image/video enhancement. The enhanced images result in areas with large local noise and loss of more details without proper guidance for global information. Attention mechanisms can better focus on and use global information. However, attention to images could significantly increase the number of parameters and computations. We propose a Row-Column Separated Attention module (RCSA) inserted after an improved U-Net. The RCSA module's input is the mean and maximum of the row and column of the feature map, which utilizes global information to guide local information with fewer parameters. We propose two temporal loss functions to apply the method to low-light video enhancement and maintain temporal consistency. Extensive experiments on the LOL, MIT Adobe FiveK image, and SDSD video datasets demonstrate the effectiveness of our approach. The code is publicly available at https://github.com/cq-dong/URCSA.

[148] Perspective-aware fusion of incomplete depth maps and surface normals for accurate 3D reconstruction

Ondrej Hlinka,Georg Kaniak,Christian Kapeller

Main category: cs.CV

TL;DR: 本文提出了一种透视感知的对数深度融合方法,用于从单视角相机获取的深度图和表面法向图中重建精确的3D表面,尤其在处理缺失深度数据时利用法向信息进行修复。

Details Motivation: 现有正交梯度法在深度与法向融合中未考虑透视投影,导致重建失真;同时传感器常产生缺失深度数据,需有效填补。 Method: 提出透视感知的对数深度融合方法,扩展传统正交梯度法,显式建模透视投影,并利用表面法向信息对缺失深度区域进行插值修复。 Result: 在DiLiGenT-MV数据集上的实验表明该方法能实现更准确的度量级3D重建,并验证了透视建模对融合质量的关键作用。 Conclusion: 透视感知建模对深度-法向融合至关重要,所提方法在精度和鲁棒性上优于传统正交假设方法。 Abstract: We address the problem of reconstructing 3D surfaces from depth and surface normal maps acquired by a sensor system based on a single perspective camera. Depth and normal maps can be obtained through techniques such as structured-light scanning and photometric stereo, respectively. We propose a perspective-aware log-depth fusion approach that extends existing orthographic gradient-based depth-normals fusion methods by explicitly accounting for perspective projection, leading to metrically accurate 3D reconstructions. Additionally, the method handles missing depth measurements by leveraging available surface normal information to inpaint gaps. Experiments on the DiLiGenT-MV data set demonstrate the effectiveness of our approach and highlight the importance of perspective-aware depth-normals fusion.

[149] PTB-XL-Image-17K: A Large-Scale Synthetic ECG Image Dataset with Comprehensive Ground Truth for Deep Learning-Based Digitization

Naqcho Ali Mehdi

Main category: cs.CV

TL;DR: 本文提出PTB-XL-Image-17K,一个包含17,271张合成12导联ECG图像的大规模数据集,源自PTB-XL信号数据库,支持ECG图像数字化全流程研究。

Details Motivation: ECG图像数字化对利用历史纸质/扫描ECG数据至关重要,但缺乏兼具图像、真实信号与丰富标注的大规模数据集严重阻碍了该领域进展。 Method: 基于PTB-XL信号库,构建高保真合成ECG图像数据集PTB-XL-Image-17K;提供五类互补标注(图像、像素级分割掩码、真实时序信号、YOLO格式检测框、元数据);开发开源Python框架,支持纸速、电压标度、采样率、网格颜色等参数可控生成。 Result: 成功生成17,271张高质量ECG图像,生成成功率100%,平均耗时1.35秒/样本;首次提供覆盖导联检测、波形分割与信号提取全链路的带完备真值的大规模资源。 Conclusion: PTB-XL-Image-17K填补了ECG图像数字化研究的关键数据空白,推动相关算法的系统性开发与严格评估,全部资源已开源。 Abstract: Electrocardiogram (ECG) digitization-converting paper-based or scanned ECG images back into time-series signals-is critical for leveraging decades of legacy clinical data in modern deep learning applications. However, progress has been hindered by the lack of large-scale datasets providing both ECG images and their corresponding ground truth signals with comprehensive annotations. We introduce PTB-XL-Image-17K, a complete synthetic ECG image dataset comprising 17,271 high-quality 12-lead ECG images generated from the PTB-XL signal database. Our dataset uniquely provides five complementary data types per sample: (1) realistic ECG images with authentic grid patterns and annotations (50% with visible grid, 50% without), (2) pixel-level segmentation masks, (3) ground truth time-series signals, (4) bounding box annotations in YOLO format for both lead regions and lead name labels, and (5) comprehensive metadata including visual parameters and patient information. We present an open-source Python framework enabling customizable dataset generation with controllable parameters including paper speed (25/50 mm/s), voltage scale (5/10 mm/mV), sampling rate (500 Hz), grid appearance (4 colors), and waveform characteristics. The dataset achieves 100% generation success rate with an average processing time of 1.35 seconds per sample. PTB-XL-Image-17K addresses critical gaps in ECG digitization research by providing the first large-scale resource supporting the complete pipeline: lead detection, waveform segmentation, and signal extraction with full ground truth for rigorous evaluation. The dataset, generation framework, and documentation are publicly available at https://github.com/naqchoalimehdi/PTB-XL-Image-17K and https://doi.org/10.5281/zenodo.18197519.

[150] SoulX-FlashHead: Oracle-guided Generation of Infinite Real-time Streaming Talking Heads

Tan Yu,Qian Qiao,Le Shen,Ke Zhou,Jincheng Hu,Dian Sheng,Bo Hu,Haoming Qin,Jun Gao,Changhai Zhou,Shunshun Yin,Siyuan Liu

Main category: cs.CV

TL;DR: 本文提出SoulX-FlashHead,一个1.3B参数的实时、无限长、高保真音频驱动肖像视频生成框架,通过流式感知时空预训练与Oracle引导双向蒸馏提升稳定性与质量,并构建大规模对齐数据集VividHead,实现在HDTF和VFHQ上的SOTA性能,Lite版达96FPS。

Details Motivation: 解决音频驱动肖像生成中高保真视觉质量与低延迟流式传输难以兼顾的问题,克服大模型计算开销高、轻量模型面部表征与时间稳定性差的局限。 Method: 提出SoulX-FlashHead框架;引入Streaming-Aware Spatiotemporal Pre-training与Temporal Audio Context Cache机制以稳定短音频特征提取;设计Oracle-Guided Bidirectional Distillation利用真实运动先验抑制误差累积与身份漂移;构建782小时严格对齐数据集VividHead。 Result: 在HDTF和VFHQ基准上达到SOTA;Lite版本在单张RTX 4090上实现96 FPS推理速度,保持视觉连贯性。 Conclusion: SoulX-FlashHead实现了高质量、高效率、长时稳定的实时音频驱动头像生成,为交互式虚拟人应用提供了实用化解决方案。 Abstract: Achieving a balance between high-fidelity visual quality and low-latency streaming remains a formidable challenge in audio-driven portrait generation. Existing large-scale models often suffer from prohibitive computational costs, while lightweight alternatives typically compromise on holistic facial representations and temporal stability. In this paper, we propose SoulX-FlashHead, a unified 1.3B-parameter framework designed for real-time, infinite-length, and high-fidelity streaming video generation. To address the instability of audio features in streaming scenarios, we introduce Streaming-Aware Spatiotemporal Pre-training equipped with a Temporal Audio Context Cache mechanism, which ensures robust feature extraction from short audio fragments. Furthermore, to mitigate the error accumulation and identity drift inherent in long-sequence autoregressive generation, we propose Oracle-Guided Bidirectional Distillation, leveraging ground-truth motion priors to provide precise physical guidance. We also present VividHead, a large-scale, high-quality dataset containing 782 hours of strictly aligned footage to support robust training. Extensive experiments demonstrate that SoulX-FlashHead achieves state-of-the-art performance on HDTF and VFHQ benchmarks. Notably, our Lite variant achieves an inference speed of 96 FPS on a single NVIDIA RTX 4090, facilitating ultra-fast interaction without sacrificing visual coherence.

[151] SpatialReward: Bridging the Perception Gap in Online RL for Image Editing via Explicit Spatial Reasoning

Yancheng Long,Yankai Yang,Hongyang Wei,Wei Chen,Tianke Zhang,Haonan fan,Changyi Liu,Kaiyu Jiang,Jiankang Chen,Kaiyu Tang,Bin Wen,Fan Yang,Tingting Gao,Han Li,Shuo Yang

Main category: cs.CV

TL;DR: 本文提出SpatialReward,一种基于空间推理的奖励模型,通过锚定预测编辑区域来提升图像编辑评估的准确性,并在多个基准测试中达到最先进水平。

Details Motivation: 现有图像编辑评估模型存在'注意力崩溃'问题,即忽视跨图像比较且无法捕捉细粒度细节,导致评估不准确。 Method: 提出SpatialReward模型,通过显式的空间推理进行精确验证,将语义判断锚定在预测的编辑区域上,并在260k空间感知数据集上进行训练。 Result: 在MMRB2和EditReward-Bench上达到SOTA,在MultiEditReward-Bench上优于专有评估器;作为在线强化学习信号,使OmniGen2在GEdit-Bench上提升+0.90。 Conclusion: 空间推理对图像编辑中的有效对齐至关重要。 Abstract: Online Reinforcement Learning (RL) offers a promising avenue for complex image editing but is currently constrained by the scarcity of reliable and fine-grained reward signals. Existing evaluators frequently struggle with a critical perception gap we term "Attention Collapse," where models neglect cross-image comparisons and fail to capture fine-grained details, resulting in inaccurate perception and miscalibrated scores. To address these limitations, we propose SpatialReward, a reward model that enforces precise verification via explicit spatial reasoning. By anchoring reasoning to predicted edit regions, SpatialReward grounds semantic judgments in pixel-level evidence, significantly enhancing evaluative accuracy. Trained on a curated 260k spatial-aware dataset, our model achieves state-of-the-art performance on MMRB2 and EditReward-Bench, and outperforms proprietary evaluators on our proposed MultiEditReward-Bench. Furthermore, SpatialReward serves as a robust signal in online RL, boosting OmniGen2 by +0.90 on GEdit-Bench--surpassing the leading discriminative model and doubling the gain of GPT-4.1 (+0.45). These results demonstrate that spatial reasoning is essential for unlocking effective alignment in image editing.

[152] GlobalWasteData: A Large-Scale, Integrated Dataset for Robust Waste Classification and Environmental Monitoring

Misbah Ijaz,Saif Ur Rehman Khan,Abd Ur Rehman,Tayyaba Asif,Sebastian Vollmer,Andreas Dengel,Muhammad Nabeel Asim

Main category: cs.CV

TL;DR: 本文介绍了GlobalWasteData (GWD) 数据集,一个大规模、统一、高质量的废物分类数据集,旨在解决现有公开数据集碎片化、不一致和偏差问题,以支持鲁棒且泛化能力强的AI废物识别模型。

Details Motivation: 现有公开废物分类数据集存在碎片化、标注格式不一致、图像条件差异大、类别分布不均衡等问题,导致难以整合与训练泛化能力强的模型。 Method: 通过整合多个公开数据集,构建统一的GlobalWasteData (GWD) 档案,包含89,807张图像、14个主类和68个子类,并进行质量过滤、去重和元数据生成等预处理。 Result: GWD数据集实现了标签一致性、领域多样性增强和类别分布更均衡,显著提升了数据可靠性与可用性。 Conclusion: GWD为环境监测、回收自动化和废物识别等机器学习应用提供了坚实基础,并已公开发布以促进后续研究与可复现性。 Abstract: The growing amount of waste is a problem for the environment that requires efficient sorting techniques for various kinds of waste. An automated waste classification system is used for this purpose. The effectiveness of these Artificial Intelligence (AI) models depends on the quality and accessibility of publicly available datasets, which provide the basis for training and analyzing classification algorithms. Although several public waste classification datasets exist, they remain fragmented, inconsistent, and biased toward specific environments. Differences in class names, annotation formats, image conditions, and class distributions make it difficult to combine these datasets or train models that generalize well to real world scenarios. To address these issues, we introduce the GlobalWasteData (GWD) archive, a large scale dataset of 89,807 images across 14 main categories, annotated with 68 distinct subclasses. We compile this novel integrated GWD archive by merging multiple publicly available datasets into a single, unified resource. This GWD archive offers consistent labeling, improved domain diversity, and more balanced class representation, enabling the development of robust and generalizable waste recognition models. Additional preprocessing steps such as quality filtering, duplicate removal, and metadata generation further improve dataset reliability. Overall, this dataset offers a strong foundation for Machine Learning (ML) applications in environmental monitoring, recycling automation, and waste identification, and is publicly available to promote future research and reproducibility.

[153] Thermal odometry and dense mapping using learned ddometry and Gaussian splatting

Tianhao Zhou,Yujia Chen,Zhihao Zhan,Yuhang Ming,Jianzhu Huai

Main category: cs.CV

TL;DR: TOM-GS is a novel thermal SLAM system combining learning-based odometry with Gaussian Splatting for dense mapping, specifically designed for thermal cameras under adverse conditions.

Details Motivation: Existing thermal odometry and mapping methods are mostly geometric, lack robustness across diverse datasets, and fail to generate dense maps; recent advances in Gaussian Splatting offer high-quality, efficient reconstruction, motivating its adaptation to thermal SLAM. Method: TOM-GS integrates learning-based monocular odometry with Gaussian Splatting for dense mapping, incorporating dedicated thermal image enhancement and monocular depth integration. Result: TOM-GS outperforms existing learning-based methods in motion estimation and novel-view rendering on thermal datasets, demonstrating superior robustness and dense reconstruction capability. Conclusion: Learning-based pipelines combined with Gaussian Splatting are highly effective for thermal odometry and dense mapping, establishing TOM-GS as a pioneering GS-based thermal SLAM system. Abstract: Thermal infrared sensors, with wavelengths longer than smoke particles, can capture imagery independent of darkness, dust, and smoke. This robustness has made them increasingly valuable for motion estimation and environmental perception in robotics, particularly in adverse conditions. Existing thermal odometry and mapping approaches, however, are predominantly geometric and often fail across diverse datasets while lacking the ability to produce dense maps. Motivated by the efficiency and high-quality reconstruction ability of recent Gaussian Splatting (GS) techniques, we propose TOM-GS, a thermal odometry and mapping method that integrates learning-based odometry with GS-based dense mapping. TOM-GS is among the first GS-based SLAM systems tailored for thermal cameras, featuring dedicated thermal image enhancement and monocular depth integration. Extensive experiments on motion estimation and novel-view rendering demonstrate that TOM-GS outperforms existing learning-based methods, confirming the benefits of learning-based pipelines for robust thermal odometry and dense reconstruction.

[154] Learning Brain Representation with Hierarchical Visual Embeddings

Jiawen Zheng,Haonan Jia,Ming Li,Yuhui Zheng,Yufeng Zeng,Yang Gao,Chen Liang

Main category: cs.CV

TL;DR: 本文提出了一种利用多预训练视觉编码器与对比学习对齐脑信号和图像表征的新方法,并引入Fusion Prior提升跨模态分布一致性,在检索准确率与重建保真度间取得更好平衡。

Details Motivation: 当前视觉解码方法多关注高层语义特征,忽视像素级细节,导致对人类视觉系统理解受限;且脑信号究竟在多大程度上编码视觉信息尚不明确。 Method: 采用多个具有不同归纳偏置的预训练视觉编码器提取层次化、多尺度视觉表征,并通过对比学习实现脑信号与视觉嵌入的有效对齐;同时引入Fusion Prior,先在大规模视觉数据上学习稳定映射,再将脑特征匹配至该先验,增强跨模态分布一致性。 Result: 大量定量与定性实验表明,所提方法在检索精度和重建保真度之间实现了更优平衡。 Conclusion: 该方法提升了脑-图对齐的表征能力与泛化性,有助于更全面地解析大脑中的视觉信息编码机制。 Abstract: Decoding visual representations from brain signals has attracted significant attention in both neuroscience and artificial intelligence. However, the degree to which brain signals truly encode visual information remains unclear. Current visual decoding approaches explore various brain-image alignment strategies, yet most emphasize high-level semantic features while neglecting pixel-level details, thereby limiting our understanding of the human visual system. In this paper, we propose a brain-image alignment strategy that leverages multiple pre-trained visual encoders with distinct inductive biases to capture hierarchical and multi-scale visual representations, while employing a contrastive learning objective to achieve effective alignment between brain signals and visual embeddings. Furthermore, we introduce a Fusion Prior, which learns a stable mapping on large-scale visual data and subsequently matches brain features to this pre-trained prior, thereby enhancing distributional consistency across modalities. Extensive quantitative and qualitative experiments demonstrate that our method achieves a favorable balance between retrieval accuracy and reconstruction fidelity.

[155] IM-Animation: An Implicit Motion Representation for Identity-decoupled Character Animation

Zhufeng Xu,Xuan Gao,Feng-Lin Liu,Haoxian Zhang,Zhixue Fang,Yu-Kun Lai,Xiaoqiang Liu,Pengfei Wan,Lin Gao

Main category: cs.CV

TL;DR: 本文提出了一种新的隐式运动表示方法(IM-Animation),通过将每帧运动压缩为紧凑的1D运动token,并设计基于掩码token的时间一致性重定向模块,以解决显式方法的空间不匹配与尺度变化问题,以及隐式方法的身份泄露和运动-外观纠缠问题。

Details Motivation: 现有显式方法(如骨架、DWPose)难以处理空间错位和身体比例变化;隐式方法虽能直接从驱动视频学习高层运动语义,但易导致身份信息泄露及运动与外观纠缠。 Method: 提出1D运动token隐式运动表示以缓解2D表示的空间约束并防止身份泄露;设计基于时间一致掩码token的重定向模块,引入时间训练瓶颈;采用三阶段训练策略提升效率与保真度。 Result: 在大量实验中,IM-Animation在生成质量上达到或超越当前最优方法。 Conclusion: 所提出的隐式运动表示与重定向机制有效提升了角色动画的鲁棒性、一致性与身份保持能力,为视频扩散模型中的运动建模提供了新思路。 Abstract: Recent progress in video diffusion models has markedly advanced character animation, which synthesizes motioned videos by animating a static identity image according to a driving video. Explicit methods represent motion using skeleton, DWPose or other explicit structured signals, but struggle to handle spatial mismatches and varying body scales. %proportions. Implicit methods, on the other hand, capture high-level implicit motion semantics directly from the driving video, but suffer from identity leakage and entanglement between motion and appearance. To address the above challenges, we propose a novel implicit motion representation that compresses per-frame motion into compact 1D motion tokens. This design relaxes strict spatial constraints inherent in 2D representations and effectively prevents identity information leakage from the motion video. Furthermore, we design a temporally consistent mask token-based retargeting module that enforces a temporal training bottleneck, mitigating interference from the source images' motion and improving retargeting consistency. Our methodology employs a three-stage training strategy to enhance the training efficiency and ensure high fidelity. Extensive experiments demonstrate that our implicit motion representation and the propose IM-Animation's generative capabilities are achieve superior or competitive performance compared with state-of-the-art methods.

[156] Adaptive Image Zoom-in with Bounding Box Transformation for UAV Object Detection

Tao Wang,Chenyu Lin,Chenwei Tang,Jizhe Zhou,Deng Xiong,Jianan Li,Jian Zhao,Jiancheng Lv

Main category: cs.CV

TL;DR: 本文提出了一种自适应缩放(ZoomDet)框架,通过非均匀缩放和边界框对齐变换,提升无人机图像中小目标检测性能,具有架构无关性且仅引入少量延迟。

Details Motivation: 无人机图像中前景目标通常尺寸小、分布稀疏,导致通用目标检测器难以有效优化。 Method: 提出轻量级偏移预测方案与基于边界框的缩放目标函数,实现高效非均匀缩放;设计角点对齐的边界框变换方法,在缩放空间中训练检测器,并将预测框映射回原始空间进行推理。 Result: 在SeaDronesSee数据集上,结合Faster R-CNN时mAP提升超8.4个绝对点,仅增加约3ms延迟;在VisDrone、UAVDT等数据集上也验证了有效性。 Conclusion: ZoomDet是一种简单、高效、架构无关的自适应缩放框架,显著提升了无人机图像中小目标检测精度,具备实际部署潜力。 Abstract: Detecting objects from UAV-captured images is challenging due to the small object size. In this work, a simple and efficient adaptive zoom-in framework is explored for object detection on UAV images. The main motivation is that the foreground objects are generally smaller and sparser than those in common scene images, which hinders the optimization of effective object detectors. We thus aim to zoom in adaptively on the objects to better capture object features for the detection task. To achieve the goal, two core designs are required: \textcolor{black}{i) How to conduct non-uniform zooming on each image efficiently? ii) How to enable object detection training and inference with the zoomed image space?} Correspondingly, a lightweight offset prediction scheme coupled with a novel box-based zooming objective is introduced to learn non-uniform zooming on the input image. Based on the learned zooming transformation, a corner-aligned bounding box transformation method is proposed. The method warps the ground-truth bounding boxes to the zoomed space to learn object detection, and warps the predicted bounding boxes back to the original space during inference. We conduct extensive experiments on three representative UAV object detection datasets, including VisDrone, UAVDT, and SeaDronesSee. The proposed ZoomDet is architecture-independent and can be applied to an arbitrary object detection architecture. Remarkably, on the SeaDronesSee dataset, ZoomDet offers more than 8.4 absolute gain of mAP with a Faster R-CNN model, with only about 3 ms additional latency. The code is available at https://github.com/twangnh/zoomdet_code.

[157] CA-YOLO: Cross Attention Empowered YOLO for Biomimetic Localization

Zhen Zhang,Qing Zhao,Xiuhe Li,Cheng Wang,Guoqiang Zhu,Yu Zhang,Yining Huo,Hongyi Yu,Yi Zhang

Main category: cs.CV

TL;DR: 本文提出了一种基于CA-YOLO的仿生稳定定位系统,通过引入小目标检测头和特征融合注意力机制(CFAM)提升检测精度与小目标识别能力,并借鉴前庭眼反射(VOR)设计了仿生云台跟踪控制策略;实验表明其在COCO和VisDrone数据集上平均精度分别提升3.94%和4.90%。

Details Motivation: 现有目标定位系统在精度和小目标识别能力方面存在局限,难以满足现代复杂环境下的需求。 Method: 提出CA-YOLO模型:在YOLO骨干网络中嵌入仿生模块(小目标检测头、CFAM);并设计基于VOR机制的仿生云台跟踪控制策略,含中心定位、稳定性优化、自适应系数调整与智能重捕获功能。 Result: CA-YOLO在COCO和VisDrone数据集上平均精度分别提升3.94%和4.90%;时敏目标定位实验验证了系统的有效性与实用性。 Conclusion: 该仿生稳定定位系统显著提升了目标定位精度与小目标识别能力,兼具理论创新性与工程实用性。 Abstract: In modern complex environments, achieving accurate and efficient target localization is essential in numerous fields. However, existing systems often face limitations in both accuracy and the ability to recognize small targets. In this study, we propose a bionic stabilized localization system based on CA-YOLO, designed to enhance both target localization accuracy and small target recognition capabilities. Acting as the "brain" of the system, the target detection algorithm emulates the visual focusing mechanism of animals by integrating bionic modules into the YOLO backbone network. These modules include the introduction of a small target detection head and the development of a Characteristic Fusion Attention Mechanism (CFAM). Furthermore, drawing inspiration from the human Vestibulo-Ocular Reflex (VOR), a bionic pan-tilt tracking control strategy is developed, which incorporates central positioning, stability optimization, adaptive control coefficient adjustment, and an intelligent recapture function. The experimental results show that CA-YOLO outperforms the original model on standard datasets (COCO and VisDrone), with average accuracy metrics improved by 3.94%and 4.90%, respectively.Further time-sensitive target localization experiments validate the effectiveness and practicality of this bionic stabilized localization system.

[158] Evaluating Object-Centric Models beyond Object Discovery

Krishnakant Singh,Simone Schaub-Meyer,Stefan Roth

Main category: cs.CV

TL;DR: 本文提出了一种新的评估框架,用于更全面地衡量对象中心学习(OCL)模型在复杂推理与定位联合任务中的表现,利用指令调优的视觉语言模型(VLMs)进行可扩展评测,并引入统一指标解决现有基准中定位与表征效用评估分离的问题。

Details Motivation: 现有OCL模型评估局限于对象发现和简单推理任务(如图像分类),无法有效反映其结构化表征对组合泛化和OOD鲁棒性的实际价值;且定位与表征效用常被分开评估,导致评价不一致。 Method: 1)使用指令调优的VLM作为评估器,在多个VQA数据集上评估OCL表征对复杂推理的支持能力;2)设计联合‘where+what’的统一评估任务与指标,同步衡量定位精度与表征实用性;3)引入多特征重建基线作为参考。 Result: 所提框架能更有效地揭示不同OCL模型在复杂推理与精确定位上的差异,验证了统一评估相比传统分离式评估更具判别力和一致性。 Conclusion: 面向组合泛化与OOD鲁棒性的OCL模型需通过兼顾定位与语义表征效用的统一、可扩展评估范式来检验;指令调优VLMs是评估OCL表征实用性的有力工具。 Abstract: Object-centric learning (OCL) aims to learn structured scene representations that support compositional generalization and robustness to out-of-distribution (OOD) data. However, OCL models are often not evaluated regarding these goals. Instead, most prior work focuses on evaluating OCL models solely through object discovery and simple reasoning tasks, such as probing the representation via image classification. We identify two limitations in existing benchmarks: (1) They provide limited insights on the representation usefulness of OCL models, and (2) localization and representation usefulness are assessed using disjoint metrics. To address (1), we use instruction-tuned VLMs as evaluators, enabling scalable benchmarking across diverse VQA datasets to measure how well VLMs leverage OCL representations for complex reasoning tasks. To address (2), we introduce a unified evaluation task and metric that jointly assess localization (where) and representation usefulness (what), thereby eliminating inconsistencies introduced by disjoint evaluation. Finally, we include a simple multi-feature reconstruction baseline as a reference point.

[159] Fine-Grained Cat Breed Recognition with Global Context Vision Transformer

Mowmita Parvin Hera,Md. Shahriar Mahmud Kallol,Shohanur Rahman Nirob,Md. Badsha Bulbul,Jubayer Ahmed,M. Zhourul Islam,Hazrat Ali,Mohammmad Farhad Bulbul

Main category: cs.CV

TL;DR: 本文提出了一种基于Global Context Vision Transformer (GCViT-Tiny) 的猫品种图像分类方法,在Oxford-IIIT Pet数据集子集上实现了92.00%的测试准确率,验证了Transformer架构在细粒度图像分类中的有效性。

Details Motivation: 猫品种识别因毛色图案、面部结构等细微差异而具有挑战性,需高精度细粒度图像分类方法。 Method: 采用GCViT-Tiny视觉Transformer模型,结合旋转、水平翻转和亮度调整等数据增强策略,在Oxford-IIIT Pet数据集子集上进行训练与评估。 Result: GCViT-Tiny模型达到92.00%测试准确率和94.54%验证准确率,并提供了Hugging Face在线演示。 Conclusion: Transformer架构(尤其是GCViT)在猫品种等细粒度图像分类任务中表现出优异性能和泛化能力,具备兽医诊断、动物收容所管理等实际应用潜力。 Abstract: Accurate identification of cat breeds from images is a challenging task due to subtle differences in fur patterns, facial structure, and color. In this paper, we present a deep learning-based approach for classifying cat breeds using a subset of the Oxford-IIIT Pet Dataset, which contains high-resolution images of various domestic breeds. We employed the Global Context Vision Transformer (GCViT) architecture-tiny for cat breed recognition. To improve model generalization, we used extensive data augmentation, including rotation, horizontal flipping, and brightness adjustment. Experimental results show that the GCViT-Tiny model achieved a test accuracy of 92.00% and validation accuracy of 94.54%. These findings highlight the effectiveness of transformer-based architectures for fine-grained image classification tasks. Potential applications include veterinary diagnostics, animal shelter management, and mobile-based breed recognition systems. We also provide a hugging face demo at https://huggingface.co/spaces/bfarhad/cat-breed-classifier.

[160] Beyond Core and Penumbra: Bi-Temporal Image-Driven Stroke Evolution Analysis

Md Sazidur Rahman,Kjersti Engan,Kathinka Dæhli Kurz,Mahdieh Khanmohammadi

Main category: cs.CV

TL;DR: 本文提出一种双时间点分析框架,利用CTP(入院时)和随访DWI(治疗后)的多模态影像特征(统计、放射组学、深度学习嵌入),结合六类ROI划分,揭示卒中组织在时间维度上的表型演化与可挽救性。

Details Motivation: 单时间点分割无法捕捉卒中组织的生物学异质性和时间动态演化,需更精细刻画缺血组织的状态转变与结局预测。 Method: 构建双时间点(T1:入院CTP;T2:随访DWI)配准与ROI划分框架(6类区域),提取统计描述符、GLCM放射组学特征及mJ-Net/nnU-Net深度嵌入特征,并在特征空间中进行聚类与判别分析。 Result: 在18例成功再通患者中,特征空间呈现有意义的区域聚类;penumbra区域的特征显著区分最终转归(恢复vs梗死),而core区域无此差异;mJ-Net嵌入的penumbra分离指数显著非零。 Conclusion: 编码器导出的特征流形能反映卒中组织真实表型与状态跃迁,为基于影像的卒中演化量化提供新范式。 Abstract: Computed tomography perfusion (CTP) at admission is routinely used to estimate the ischemic core and penumbra, while follow-up diffusion-weighted MRI (DWI) provides the definitive infarct outcome. However, single time-point segmentations fail to capture the biological heterogeneity and temporal evolution of stroke. We propose a bi-temporal analysis framework that characterizes ischemic tissue using statistical descriptors, radiomic texture features, and deep feature embeddings from two architectures (mJ-Net and nnU-Net). Bi-temporal refers to admission (T1) and post-treatment follow-up (T2). All features are extracted at T1 from CTP, with follow-up DWI aligned to ensure spatial correspondence. Manually delineated masks at T1 and T2 are intersected to construct six regions of interest (ROIs) encoding both initial tissue state and final outcome. Features were aggregated per region and analyzed in feature space. Evaluation on 18 patients with successful reperfusion demonstrated meaningful clustering of region-level representations. Regions classified as penumbra or healthy at T1 that ultimately recovered exhibited feature similarity to preserved brain tissue, whereas infarct-bound regions formed distinct groupings. Both baseline GLCM and deep embeddings showed a similar trend: penumbra regions exhibit features that are significantly different depending on final state, whereas this difference is not significant for core regions. Deep feature spaces, particularly mJ-Net, showed strong separation between salvageable and non-salvageable tissue, with a penumbra separation index that differed significantly from zero (Wilcoxon signed-rank test). These findings suggest that encoder-derived feature manifolds reflect underlying tissue phenotypes and state transitions, providing insight into imaging-based quantification of stroke evolution.

[161] LLM-Guided Diagnostic Evidence Alignment for Medical Vision-Language Pretraining under Limited Pairing

Huimin Yan,Liang Bai,Xian Yang,Long Chen

Main category: cs.CV

TL;DR: 本文提出LGDEA方法,利用大语言模型从放射学报告中提取关键诊断证据,构建共享诊断证据空间,实现证据级跨模态对齐,减少对配对数据的依赖,并在多项下游任务中取得显著提升。

Details Motivation: 现有CLIP式医学视觉-语言预训练方法依赖全局或局部对齐,但全局对齐易受非诊断信息干扰,局部对齐难以整合关键诊断证据,导致在配对数据有限的医学场景中难以学习可靠的诊断表征。 Method: 提出LLM-Guided Diagnostic Evidence Alignment(LGDEA)方法,利用大语言模型从放射学报告中提取关键诊断证据,构建共享诊断证据空间,实现证据级跨模态对齐,并有效利用大量未配对的医学图像和报告。 Result: 在短语定位、图像-文本检索和零样本分类等任务上均取得一致且显著的性能提升,甚至可媲美依赖大量配对数据的预训练方法。 Conclusion: LGDEA通过引入诊断证据级对齐,提升了医学视觉-语言模型在低资源场景下的泛化能力与实用性,为减少对稀缺配对数据的依赖提供了新思路。 Abstract: Most existing CLIP-style medical vision--language pretraining methods rely on global or local alignment with substantial paired data. However, global alignment is easily dominated by non-diagnostic information, while local alignment fails to integrate key diagnostic evidence. As a result, learning reliable diagnostic representations becomes difficult, which limits their applicability in medical scenarios with limited paired data. To address this issue, we propose an LLM-Guided Diagnostic Evidence Alignment method (LGDEA), which shifts the pretraining objective toward evidence-level alignment that is more consistent with the medical diagnostic process. Specifically, we leverage LLMs to extract key diagnostic evidence from radiology reports and construct a shared diagnostic evidence space, enabling evidence-aware cross-modal alignment and allowing LGDEA to effectively exploit abundant unpaired medical images and reports, thereby substantially alleviating the reliance on paired data. Extensive experimental results demonstrate that our method achieves consistent and significant improvements on phrase grounding, image--text retrieval, and zero-shot classification, and even rivals pretraining methods that rely on substantial paired data.

[162] MUFASA: A Multi-Layer Framework for Slot Attention

Sebastian Bock,Leonie Schüßler,Krishnakant Singh,Simone Schaub-Meyer,Stefan Roth

Main category: cs.CV

TL;DR: 本文提出MUFASA,一种轻量级即插即用框架,通过在ViT多层特征上执行slot attention并融合多层slots,提升无监督目标中心学习的分割性能与训练收敛性。

Details Motivation: 现有无监督目标中心学习方法仅利用ViT最后一层特征提取slots,忽略了其他层蕴含的丰富语义信息。 Method: MUFASA在ViT编码器的多个特征层上并行执行slot attention,并设计融合策略将多层slots聚合为统一的目标中心表征。 Result: 在多个数据集上显著提升无监督目标分割性能,达到新SOTA;同时加快训练收敛,仅引入轻微推理开销。 Conclusion: 充分利用ViT多层语义特征可有效增强slot attention模型的表达能力与实用性,MUFASA为OCL提供了通用且高效的改进范式。 Abstract: Unsupervised object-centric learning (OCL) decomposes visual scenes into distinct entities. Slot attention is a popular approach that represents individual objects as latent vectors, called slots. Current methods obtain these slot representations solely from the last layer of a pre-trained vision transformer (ViT), ignoring valuable, semantically rich information encoded across the other layers. To better utilize this latent semantic information, we introduce MUFASA, a lightweight plug-and-play framework for slot attention-based approaches to unsupervised object segmentation. Our model computes slot attention across multiple feature layers of the ViT encoder, fully leveraging their semantic richness. We propose a fusion strategy to aggregate slots obtained on multiple layers into a unified object-centric representation. Integrating MUFASA into existing OCL methods improves their segmentation results across multiple datasets, setting a new state of the art while simultaneously improving training convergence with only minor inference overhead.

[163] Revealing the Semantic Selection Gap in DINOv3 through Training-Free Few-Shot Segmentation

Hussni Mohd Zakir,Eric Tatt Wei Ho

Main category: cs.CV

TL;DR: 本文提出FSSDINO,一种无需训练的few-shot语义分割方法,利用冻结的DINOv3特征、类原型与Gram矩阵优化,在多个基准上媲美复杂适配方法;研究发现最后一层特征虽强但非最优,揭示了基础模型中‘语义选择鸿沟’这一关键问题。

Details Motivation: 探索冻结的自监督ViT(如DINOv3)在few-shot语义分割中的内在能力,检验无需微调或适配的极简方案是否具备竞争力,并诊断其特征表示的语义潜力瓶颈。 Method: 提出FSSDINO:直接使用冻结DINOv3最后一层特征,构建类特定原型,并通过Gram矩阵细化增强判别性;辅以Oracle引导的逐层分析,系统评估各层特征性能,并对比多种无监督和支持引导的选择策略。 Result: FSSDINO在二值、多类及跨域FSS基准上达到SOTA级性能,媲美含复杂解码器或测试时适配的方法;Oracle分析显示中间层存在显著更高性能上限,但现有选择策略均不如默认最后一层。 Conclusion: 最后一层特征是极具欺骗性的强基线;当前无监督特征选择机制无法可靠定位高保真语义层,暴露出基础模型中普遍存在的‘语义选择鸿沟’,亟需更鲁棒的层选择范式。 Abstract: Recent self-supervised Vision Transformers (ViTs), such as DINOv3, provide rich feature representations for dense vision tasks. This study investigates the intrinsic few-shot semantic segmentation (FSS) capabilities of frozen DINOv3 features through a training-free baseline, FSSDINO, utilizing class-specific prototypes and Gram-matrix refinement. Our results across binary, multi-class, and cross-domain (CDFSS) benchmarks demonstrate that this minimal approach, applied to the final backbone layer, is highly competitive with specialized methods involving complex decoders or test-time adaptation. Crucially, we conduct an Oracle-guided layer analysis, identifying a significant performance gap between the standard last-layer features and globally optimal intermediate representations. We reveal a "Safest vs. Optimal" dilemma: while the Oracle proves higher performance is attainable, matching the results of compute-intensive adaptation methods, current unsupervised and support-guided selection metrics consistently yield lower performance than the last-layer baseline. This characterizes a "Semantic Selection Gap" in Foundation Models, a disconnect where traditional heuristics fail to reliably identify high-fidelity features. Our work establishes the "Last-Layer" as a deceptively strong baseline and provides a rigorous diagnostic of the latent semantic potentials in DINOv3.The code is publicly available at https://github.com/hussni0997/fssdino.

[164] FlexID: Training-Free Flexible Identity Injection via Intent-Aware Modulation for Text-to-Image Generation

Guandong Li,Yijun Ding

Main category: cs.CV

TL;DR: FlexID是一种无需训练的个性化文本到图像生成框架,通过意图感知调制(包括语义身份投影器SIP、视觉特征锚点VFA和上下文感知自适应门控CAG)解耦并协同优化身份保真度与文本适配性。

Details Motivation: 现有无训练方法在身份保真度和文本可编辑性之间存在冲突,难以兼顾二者。 Method: 提出FlexID框架,包含语义身份投影器(SIP)注入语言空间高层先验、视觉特征锚点(VFA)保障潜在空间结构保真,并引入上下文感知自适应门控(CAG)根据编辑意图和扩散步长动态调节两路权重。 Result: 在IBench基准上,FlexID在身份一致性与文本遵循性之间达到当前最优平衡,支持复杂叙事生成。 Conclusion: FlexID通过正交解耦身份表征与意图驱动的动态调制,有效缓解了无训练个性化生成中身份保真与语义灵活性之间的固有矛盾。 Abstract: Personalized text-to-image generation aims to seamlessly integrate specific identities into textual descriptions. However, existing training-free methods often rely on rigid visual feature injection, creating a conflict between identity fidelity and textual adaptability. To address this, we propose FlexID, a novel training-free framework utilizing intent-aware modulation. FlexID orthogonally decouples identity into two dimensions: a Semantic Identity Projector (SIP) that injects high-level priors into the language space, and a Visual Feature Anchor (VFA) that ensures structural fidelity within the latent space. Crucially, we introduce a Context-Aware Adaptive Gating (CAG) mechanism that dynamically modulates the weights of these streams based on editing intent and diffusion timesteps. By automatically relaxing rigid visual constraints when strong editing intent is detected, CAG achieves synergy between identity preservation and semantic variation. Extensive experiments on IBench demonstrate that FlexID achieves a state-of-the-art balance between identity consistency and text adherence, offering an efficient solution for complex narrative generation.

[165] VISOR: VIsual Spatial Object Reasoning for Language-driven Object Navigation

Francesco Taioli,Shiping Yang,Sonia Raychaudhuri,Marco Cristani,Unnat Jain,Angel X Chang

Main category: cs.CV

TL;DR: 本文提出了一种紧凑型3B参数的视觉-语言-动作(VLA)智能体,通过显式的图像接地推理实现类人具身推理,用于语言驱动的目标物体导航,提升了可解释性、泛化能力和导航效率。

Details Motivation: 现有方法存在泛化能力差、缺乏动作级可解释性、错误传播、计算开销大以及难以将推理整合进导航策略等问题。 Method: 提出一种3B参数的Vision-Language-Action(VLA)智能体,采用三阶段显式图像接地推理:'think'、'think summary'和'action',直接回答'这是目标物体吗?'和'为何执行该动作?',摒弃端到端嵌入匹配或多模型拼接流水线。 Result: 实现了更强的泛化能力、更高的动作级可解释性以及更高效的导航性能。 Conclusion: 所提VLA智能体通过结构化、图像接地的推理机制,有效克服了现有语言驱动导航方法的关键缺陷,为具身AI提供了更可靠、可解释且高效的新范式。 Abstract: Language-driven object navigation requires agents to interpret natural language descriptions of target objects, which combine intrinsic and extrinsic attributes for instance recognition and commonsense navigation. Existing methods either (i) use end-to-end trained models with vision-language embeddings, which struggle to generalize beyond training data and lack action-level explainability, or (ii) rely on modular zero-shot pipelines with large language models (LLMs) and open-set object detectors, which suffer from error propagation, high computational cost, and difficulty integrating their reasoning back into the navigation policy. To this end, we propose a compact 3B-parameter Vision-Language-Action (VLA) agent that performs human-like embodied reasoning for both object recognition and action selection, removing the need for stitched multi-model pipelines. Instead of raw embedding matching, our agent employs explicit image-grounded reasoning to directly answer "Is this the target object?" and "Why should I take this action?" The reasoning process unfolds in three stages: "think", "think summary", and "action", yielding improved explainability, stronger generalization, and more efficient navigation. Code and dataset available upon acceptance.

[166] SIGMA: Selective-Interleaved Generation with Multi-Attribute Tokens

Xiaoyan Zhang,Zechen Bai,Haofan Wang,Yiren Song

Main category: cs.CV

TL;DR: 本文提出SIGMA框架,通过引入选择性多属性标记(风格、内容、主体、身份),在扩散Transformer中实现多条件交错生成,显著提升可控性、跨条件一致性和视觉质量。

Details Motivation: 现有统一模型如Bagel仅支持单条件输入,缺乏从多个异构来源合成结果的灵活性。 Method: 提出SIGMA统一后训练框架,引入选择性多属性标记,在Bagel骨干网络上基于70万交错样本进行后训练。 Result: SIGMA在组合编辑、选择性属性迁移和细粒度多模态对齐任务中显著优于Bagel,提升了可控性、跨条件一致性与视觉质量。 Conclusion: SIGMA成功实现了扩散Transformer中多条件交错生成,为统一视觉生成模型提供了更灵活、可控的新范式。 Abstract: Recent unified models such as Bagel demonstrate that paired image-edit data can effectively align multiple visual tasks within a single diffusion transformer. However, these models remain limited to single-condition inputs and lack the flexibility needed to synthesize results from multiple heterogeneous sources. We present SIGMA (Selective-Interleaved Generation with Multi-Attribute Tokens), a unified post-training framework that enables interleaved multi-condition generation within diffusion transformers. SIGMA introduces selective multi-attribute tokens, including style, content, subject, and identity tokens, which allow the model to interpret and compose multiple visual conditions in an interleaved text-image sequence. Through post-training on the Bagel unified backbone with 700K interleaved examples, SIGMA supports compositional editing, selective attribute transfer, and fine-grained multimodal alignment. Extensive experiments show that SIGMA improves controllability, cross-condition consistency, and visual quality across diverse editing and generation tasks, with substantial gains over Bagel on compositional tasks.

[167] Human Identification at a Distance: Challenges, Methods and Results on the Competition HID 2025

Jingzhe Ma,Meng Zhang,Jianlong Yu,Kun Liu,Zunxiao Xu,Xue Cheng,Junjie Zhou,Yanfei Wang,Jiahang Li,Zepeng Wang,Kazuki Osamura,Rujie Liu,Narishige Abe,Jingjie Wang,Shunli Zhang,Haojun Xie,Jiajun Wu,Weiming Wu,Wenxiong Kang,Qingshuo Gao,Jiaming Xiong,Xianye Ben,Lei Chen,Lichen Song,Junjian Cui,Haijun Xiong,Junhao Lu,Bin Feng,Mengyuan Liu,Ji Zhou,Baoquan Zhao,Ke Xu,Yongzhen Huang,Liang Wang,Manuel J Marin-Jimenez,Md Atiqur Rahman Ahad,Shiqi Yu

Main category: cs.CV

TL;DR: 本文介绍了国际远距离人体识别竞赛(HID)的发展,特别是2025年基于极具挑战性的SUSTech-Competition数据集的最新进展,参赛方法在无专用训练数据、跨域泛化要求高的设定下,最高准确率达94.2%,刷新了该数据集基准,并分析了技术趋势与未来方向。

Details Motivation: 推动步态识别研究发展,提供公平、具挑战性的评估平台,尤其应对现实场景中服装、携带物、视角等多变因素带来的识别难题。 Method: 依托HID竞赛框架,采用SUSTech-Competition数据集(无专用训练数据),每年使用不同随机种子划分测试集以评估跨域泛化能力;参赛者基于外部数据训练模型并提交结果。 Result: HID 2025中最佳方法达到94.2%识别准确率,超越此前HID 2023/2024的性能上限;同时总结出当前主流技术趋势。 Conclusion: 算法持续进步可突破既有性能瓶颈;SUSTech-Competition数据集已成为检验步态识别鲁棒性与泛化能力的重要基准;未来研究需进一步关注遮挡、小样本适应及多源异构数据融合等问题。 Abstract: Human identification at a distance (HID) is challenging because traditional biometric modalities such as face and fingerprints are often difficult to acquire in real-world scenarios. Gait recognition provides a practical alternative, as it can be captured reliably at a distance. To promote progress in gait recognition and provide a fair evaluation platform, the International Competition on Human Identification at a Distance (HID) has been organized annually since 2020. Since 2023, the competition has adopted the challenging SUSTech-Competition dataset, which features substantial variations in clothing, carried objects, and view angles. No dedicated training data are provided, requiring participants to train their models using external datasets. Each year, the competition applies a different random seed to generate distinct evaluation splits, which reduces the risk of overfitting and supports a fair assessment of cross-domain generalization. While HID 2023 and HID 2024 already used this dataset, HID 2025 explicitly examined whether algorithmic advances could surpass the accuracy limits observed previously. Despite the heightened difficulty, participants achieved further improvements, and the best-performing method reached 94.2% accuracy, setting a new benchmark on this dataset. We also analyze key technical trends and outline potential directions for future research in gait recognition.

[168] Cross-Camera Cow Identification via Disentangled Representation Learning

Runcheng Wang,Yaru Chen,Guiguo Zhang,Honghua Jiang,Yongliang Qiao

Main category: cs.CV

TL;DR: 本文提出了一种基于解耦表示学习的跨摄像头奶牛个体识别框架,利用子空间可识别性理论,将图像分解为多个正交潜在子空间,分离出跨摄像头稳定的生物特征,在五个不同摄像头节点的数据集上平均准确率达86.0%,显著优于现有方法。

Details Motivation: 现有奶牛识别方法在单摄像头下表现良好,但在跨摄像头部署时因光照、背景、视角和成像特性差异导致性能严重下降,限制了其在真实动态牧场环境中的大规模应用。 Method: 基于子空间可识别性保证(SIG)理论,构建原理驱动的特征解耦模块,将图像分解为多个正交潜在子空间,以分离出跨摄像头不变的身份相关生物特征;并构建覆盖五种异构摄像头、多光照与多角度的高质量数据集。 Result: 在七个跨摄像头任务上平均识别准确率达86.0%,显著优于源域仅训练基线(51.9%)和最强跨摄像头基线方法(79.8%)。 Conclusion: 本研究建立了面向协同跨摄像头奶牛识别的子空间理论驱动特征解耦框架,为非受控智能牧场环境下的精准动物监测提供了新范式。 Abstract: Precise identification of individual cows is a fundamental prerequisite for comprehensive digital management in smart livestock farming. While existing animal identification methods excel in controlled, single-camera settings, they face severe challenges regarding cross-camera generalization. When models trained on source cameras are deployed to new monitoring nodes characterized by divergent illumination, backgrounds, viewpoints, and heterogeneous imaging properties, recognition performance often degrades dramatically. This limits the large-scale application of non-contact technologies in dynamic, real-world farming environments. To address this challenge, this study proposes a cross-camera cow identification framework based on disentangled representation learning. This framework leverages the Subspace Identifiability Guarantee (SIG) theory in the context of bovine visual recognition. By modeling the underlying physical data generation process, we designed a principle-driven feature disentanglement module that decomposes observed images into multiple orthogonal latent subspaces. This mechanism effectively isolates stable, identity-related biometric features that remain invariant across cameras, thereby substantially improving generalization to unseen cameras. We constructed a high-quality dataset spanning five distinct camera nodes, covering heterogeneous acquisition devices and complex variations in lighting and angles. Extensive experiments across seven cross-camera tasks demonstrate that the proposed method achieves an average accuracy of 86.0%, significantly outperforming the Source-only Baseline (51.9%) and the strongest cross-camera baseline method (79.8%). This work establishes a subspace-theoretic feature disentanglement framework for collaborative cross-camera cow identification, offering a new paradigm for precise animal monitoring in uncontrolled smart farming environments.

[169] Visualizing the Invisible: Enhancing Radiologist Performance in Breast Mammography via Task-Driven Chromatic Encoding

Hui Ye,Shilong Yang,Yexuan Xing,Juan Yu,Yaoqin Xie,Wei Zhang,Chulong Zhang

Main category: cs.CV

TL;DR: 本文提出MammoColor框架,通过任务驱动的彩色编码(TDCE)模块将单通道乳腺X光图像转换为增强视觉信息的彩色视图,显著提升致密乳腺中BI-RADS分类性能与阅片特异性。

Details Motivation: 乳腺X光筛查在致密乳腺中敏感性较低,因组织重叠和细微征象导致感知困难。 Method: 提出端到端MammoColor框架,集成轻量级TDCE模块与BI-RADS三级分诊分类器,在VinDr-Mammo数据集上端到端训练,并在多个内部、公开及外部临床队列上评估;开展多阅片者多病例(MRMC)对照研究,比较灰度图、TDCE图及二者并排显示的效果。 Result: 在VinDr-Mammo上AUC从0.7669提升至0.8461(P=0.004),致密乳腺中AUC从0.749升至0.835;MRMC研究显示TDCE使特异性从0.90提升至0.96(P=0.052),敏感性保持相当。 Conclusion: TDCE提供面向任务优化的彩色表征,可增强病灶感知显著性,减少乳腺X光分诊中的假阳性召回。 Abstract: Purpose:Mammography screening is less sensitive in dense breasts, where tissue overlap and subtle findings increase perceptual difficulty. We present MammoColor, an end-to-end framework with a Task-Driven Chromatic Encoding (TDCE) module that converts single-channel mammograms into TDCE-encoded views for visual augmentation. Materials and Methods:MammoColor couples a lightweight TDCE module with a BI-RADS triage classifier and was trained end-to-end on VinDr-Mammo. Performance was evaluated on an internal test set, two public datasets (CBIS-DDSM and INBreast), and three external clinical cohorts. We also conducted a multi-reader, multi-case (MRMC) observer study with a washout period, comparing (1) grayscale-only, (2) TDCE-only, and (3) side-by-side grayscale+TDCE. Results:On VinDr-Mammo, MammoColor improved AUC from 0.7669 to 0.8461 (P=0.004). Gains were larger in dense breasts (AUC 0.749 to 0.835). In the MRMC study, TDCE-encoded images improved specificity (0.90 to 0.96; P=0.052) with comparable sensitivity. Conclusion:TDCE provides a task-optimized chromatic representation that may improve perceptual salience and reduce false-positive recalls in mammography triage.

[170] ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention

Wenjie Liu,Hao Wu,Xin Qiu,Yingqi Fan,Yihan Zhang,Anhao Zhao,Yunpu Ma,Xiaoyu Shen

Main category: cs.CV

TL;DR: ViCA是一种新型多模态大语言模型架构,通过仅在选定层使用视觉-文本交叉注意力、跳过视觉token的自注意力和前馈计算,大幅降低视觉侧计算开销,同时保持98%基线精度。

Details Motivation: 现有MLLMs在每一层都对视觉和文本token进行统一自注意力计算,导致视觉处理计算开销巨大;作者发现视觉嵌入已较好对齐语言空间,且有效跨模态交互仅发生在少数层。 Method: 提出ViCA(Vision-only Cross-Attention)架构:视觉token完全绕过所有自注意力与前馈层,仅在精选的若干层中通过稀疏交叉注意力与文本交互。 Result: 在3个骨干模型、9个基准和26个剪枝基线上验证:ViCA保持98%基线精度,视觉计算降至4%;单批推理加速3.5倍以上,多批加速超10倍;与token剪枝方法正交可组合。 Conclusion: 密集视觉自注意力非必需;ViCA以极简设计实现高性能-高效率平衡,提供硬件友好的规整推理流程,并为多模态模型轻量化提供新范式。 Abstract: Modern multimodal large language models (MLLMs) adopt a unified self-attention design that processes visual and textual tokens at every Transformer layer, incurring substantial computational overhead. In this work, we revisit the necessity of such dense visual processing and show that projected visual embeddings are already well-aligned with the language space, while effective vision-language interaction occurs in only a small subset of layers. Based on these insights, we propose ViCA (Vision-only Cross-Attention), a minimal MLLM architecture in which visual tokens bypass all self-attention and feed-forward layers, interacting with text solely through sparse cross-attention at selected layers. Extensive evaluations across three MLLM backbones, nine multimodal benchmarks, and 26 pruning-based baselines show that ViCA preserves 98% of baseline accuracy while reducing visual-side computation to 4%, consistently achieving superior performance-efficiency trade-offs. Moreover, ViCA provides a regular, hardware-friendly inference pipeline that yields over 3.5x speedup in single-batch inference and over 10x speedup in multi-batch inference, reducing visual grounding to near-zero overhead compared with text-only LLMs. It is also orthogonal to token pruning methods and can be seamlessly combined for further efficiency gains. Our code is available at https://github.com/EIT-NLP/ViCA.

[171] Automated rock joint trace mapping using a supervised learning model trained on synthetic data generated by parametric modelling

Jessica Ka Yi Chiu,Tom Frode Hansen,Eivind Magnus Paulsen,Ole Jakob Mengshoel

Main category: cs.CV

TL;DR: 本文提出了一种地质驱动的机器学习方法,通过结合地质建模、合成数据生成和监督图像分割,实现从图像中自动提取岩体节理迹线。该方法在真实数据稀缺和标注不一致的情况下仍表现出良好性能,尤其通过少量实测数据微调可实现有效泛化。

Details Motivation: 解决岩体节理图像识别中真实标注数据稀缺、类别不平衡以及现场标注噪声大(如边坡域中标签偏差、不完整、不一致)等问题。 Method: 1)基于离散裂缝网络(DFN)模型生成符合野外尺度的合成节理图像,保留节理的连续性、连通性和节点类型分布;2)采用混合训练(合成+实测数据)与预训练-微调策略训练语义分割模型。 Result: 合成数据能有效支撑节理迹线检测;混合训练在标注一致的箱域表现好,微调在标注噪声大的边坡域更鲁棒;仅用少量实测数据微调即可获得显著泛化能力;定性分析显示结果比定量指标反映的更清晰、更具地质意义。 Conclusion: 该地质驱动的合成数据+微调框架提升了节理自动识别的可靠性与实用性,为后续领域自适应和评估研究奠定基础。 Abstract: This paper presents a geology-driven machine learning method for automated rock joint trace mapping from images. The approach combines geological modelling, synthetic data generation, and supervised image segmentation to address limited real data and class imbalance. First, discrete fracture network models are used to generate synthetic jointed rock images at field-relevant scales via parametric modelling, preserving joint persistence, connectivity, and node-type distributions. Second, segmentation models are trained using mixed training and pretraining followed by fine-tuning on real images. The method is tested in box and slope domains using several real datasets. The results show that synthetic data can support supervised joint trace detection when real data are scarce. Mixed training performs well when real labels are consistent (e.g. box-domain), while fine-tuning is more robust when labels are noisy (e.g. slope-domain where labels can be biased, incomplete, and inconsistent). Fully zero-shot prediction from synthetic model remains limited, but useful generalisation is achieved by fine-tuning with a small number of real data. Qualitative analysis shows clearer and more geologically meaningful joint traces than indicated by quantitative metrics alone. The proposed method supports reliable joint mapping and provides a basis for further work on domain adaptation and evaluation.

[172] TeleBoost: A Systematic Alignment Framework for High-Fidelity, Controllable, and Robust Video Generation

Yuanzhi Liang,Xuan'er Wu,Yirui Liu,Yijie Fang,Yizhen Fan,Ke Hao,Rui Li,Ruiying Liu,Ziqi Ni,Peng Yu,Yanbo Wang,Haibin Huang,Qizhen Weng,Chi Zhang,Xuelong Li

Main category: cs.CV

TL;DR: 本文提出了一种系统化的视频生成模型后训练框架,整合监督策略塑形、基于奖励的强化学习和基于偏好的优化,在稳定性约束下提升感知保真度、时序一致性和提示遵循能力。

Details Motivation: 解决视频生成模型在实际部署中面临的高rollout成本、时序累积失败、异构且弱判别性的反馈等挑战,实现可控、鲁棒、长时序的指令跟随能力。 Method: 构建一个稳定性约束的多阶段优化栈,依次集成监督策略塑形、奖励驱动的强化学习和偏好驱动的细化,并以诊断驱动的方式进行 staged 优化。 Result: 显著提升了视频生成模型的感知保真度、时序连贯性与提示遵循能力,同时保持初始阶段建立的可控性;为可扩展、稳定且实用的后训练流程提供了清晰蓝图。 Conclusion: 将后训练视为诊断驱动的系统性过程而非零散技巧的堆砌,是构建面向生产环境的高质量视频生成模型的关键路径。 Abstract: Post-training is the decisive step for converting a pretrained video generator into a production-oriented model that is instruction-following, controllable, and robust over long temporal horizons. This report presents a systematical post-training framework that organizes supervised policy shaping, reward-driven reinforcement learning, and preference-based refinement into a single stability-constrained optimization stack. The framework is designed around practical video-generation constraints, including high rollout cost, temporally compounding failure modes, and feedback that is heterogeneous, uncertain, and often weakly discriminative. By treating optimization as a staged, diagnostic-driven process rather than a collection of isolated tricks, the report summarizes a cohesive recipe for improving perceptual fidelity, temporal coherence, and prompt adherence while preserving the controllability established at initialization. The resulting framework provides a clear blueprint for building scalable post-training pipelines that remain stable, extensible, and effective in real-world deployment settings.

[173] Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning

Hulingxiao He,Zijun Geng,Yuxin Peng

Main category: cs.CV

TL;DR: 本文提出Fine-R1,一种专为细粒度视觉识别(FGVR)定制的多模态大语言模型,通过链式思维监督微调与三元组增强策略优化,在仅4样本训练下超越现有MLLM和CLIP模型,显著提升对已见与未见子类别的泛化能力。

Details Motivation: 现有MLLM在细粒度视觉识别(FGVR)任务中表现不佳,依赖大量标注数据、易过拟合已见子类别、泛化能力差,而专用对比学习模型(如CLIP)虽强但缺乏推理能力。 Method: 提出R1风格训练框架:(1)链式思维监督微调(CoT-SFT),构建含‘视觉分析-候选子类-比较-预测’四步推理的高质量FGVR CoT数据集;(2)三元组增强策略优化,包括类内增强(混合同类锚图与正样本轨迹)提升类内鲁棒性,类间增强(最大化不同子类别图像响应差异)增强判别力。 Result: 仅用4-shot训练,Fine-R1在已见与未见子类别识别上均超越通用MLLM、推理型MLLM及对比式CLIP模型,展现出强零样本/少样本泛化能力。 Conclusion: Fine-R1验证了通过结构化推理训练与增强策略可有效提升MLLM在知识密集型细粒度识别任务中的性能与泛化性,为低资源FGVR提供新范式。 Abstract: Any entity in the visual world can be hierarchically grouped based on shared characteristics and mapped to fine-grained sub-categories. While Multi-modal Large Language Models (MLLMs) achieve strong performance on coarse-grained visual tasks, they often struggle with Fine-Grained Visual Recognition (FGVR). Adapting general-purpose MLLMs to FGVR typically requires large amounts of annotated data, which is costly to obtain, leaving a substantial performance gap compared to contrastive CLIP models dedicated for discriminative tasks. Moreover, MLLMs tend to overfit to seen sub-categories and generalize poorly to unseen ones. To address these challenges, we propose Fine-R1, an MLLM tailored for FGVR through an R1-style training framework: (1) Chain-of-Thought Supervised Fine-tuning, where we construct a high-quality FGVR CoT dataset with rationales of "visual analysis, candidate sub-categories, comparison, and prediction", transition the model into a strong open-world classifier; and (2) Triplet Augmented Policy Optimization, where Intra-class Augmentation mixes trajectories from anchor and positive images within the same category to improve robustness to intra-class variance, while Inter-class Augmentation maximizes the response distinction conditioned on images across sub-categories to enhance discriminative ability. With only 4-shot training, Fine-R1 outperforms existing general MLLMs, reasoning MLLMs, and even contrastive CLIP models in identifying both seen and unseen sub-categories, showing promise in working in knowledge-intensive domains where gathering expert annotations for all sub-categories is arduous. Code is available at https://github.com/PKU-ICST-MIPL/FineR1_ICLR2026.

[174] HistoMet: A Pan-Cancer Deep Learning Framework for Prognostic Prediction of Metastatic Progression and Site Tropism from Primary Tumor Histopathology

Yixin Chen,Ziyu Su,Lingbin Meng,Elshad Hasanov,Wei Chen,Anil Parwani,M. Khalid Khan Niazi

Main category: cs.CV

TL;DR: 本文提出HistoMet框架,通过决策感知和概念对齐的多实例学习方法,从全切片图像预测原发肿瘤的转移风险及转移部位,显著提升临床预测性能与可解释性。

Details Motivation: 现有计算病理学方法通常将转移状态或部位预测作为孤立任务,未建模临床中先评估转移风险、再判断具体部位的顺序决策过程,且缺乏对临床可解释性的支持。 Method: 提出两模块决策流程:先估计原发肿瘤转移可能性,再对高风险病例条件预测转移部位;结合预训练的病理视觉-语言模型,融入语言定义与数据自适应的转移相关概念以指导表征学习。 Result: 在6504例多中心泛癌队列上验证,95%高敏感度筛查下显著降低下游工作量并保持高风险召回率;对转移病例条件预测达宏F1=74.6(±1.3),宏一对多AUC=92.1。 Conclusion: 显式建模临床决策结构可实现鲁棒、可部署的转移进展与部位嗜性预测,直接基于原发肿瘤组织病理图像。 Abstract: Metastatic Progression remains the leading cause of cancer-related mortality, yet predicting whether a primary tumor will metastasize and where it will disseminate directly from histopathology remains a fundamental challenge. Although whole-slide images (WSIs) provide rich morphological information, prior computational pathology approaches typically address metastatic status or site prediction as isolated tasks, and do not explicitly model the clinically sequential decision process of metastatic risk assessment followed by downstream site-specific evaluation. To address this research gap, we present a decision-aware, concept-aligned MIL framework, HistoMet, for prognostic metastatic outcome prediction from primary tumor WSIs. Our proposed framework adopts a two-module prediction pipeline in which the likelihood of metastatic progression from the primary tumor is first estimated, followed by conditional prediction of metastatic site for high-risk cases. To guide representation learning and improve clinical interpretability, our framework integrates linguistically defined and data-adaptive metastatic concepts through a pretrained pathology vision-language model. We evaluate HistoMet on a multi-institutional pan-cancer cohort of 6504 patients with metastasis follow-up and site annotations. Under clinically relevant high-sensitivity screening settings (95 percent sensitivity), HistoMet significantly reduces downstream workload while maintaining high metastatic risk recall. Conditional on metastatic cases, HistoMet achieves a macro F1 of 74.6 with a standard deviation of 1.3 and a macro one-vs-rest AUC of 92.1. These results demonstrate that explicitly modeling clinical decision structure enables robust and deployable prognostic prediction of metastatic progression and site tropism directly from primary tumor histopathology.

[175] AD-MIR: Bridging the Gap from Perception to Persuasion in Advertising Video Understanding via Structured Reasoning

Binxiao Xu,Junyu Feng,Xiaopeng Lin,Haodong Li,Zhiyuan Feng,Bohan Zeng,Shaolin Lu,Ming Lu,Qi She,Wentao Zhang

Main category: cs.CV

TL;DR: 本文提出AD-MIR框架,通过结构化记忆构建与结构化推理代理两阶段方法,将广告视频的像素级感知与高层营销逻辑关联,显著提升广告意图理解性能。

Details Motivation: 现有智能体虽擅长通用检索,但难以弥合广告视频中像素级感知与高层营销逻辑之间的认知鸿沟。 Method: AD-MIR采用双阶段架构:第一阶段为结构感知记忆构建,融合语义检索与精确关键词匹配,提取品牌细节并过滤背景噪声;第二阶段为结构化推理代理,通过迭代提问分解叙事、推断隐含说服策略,并基于视频帧证据进行自我修正。 Result: 在AdsQA基准上,AD-MIR严格准确率和宽松准确率分别超越最强通用智能体DVD 1.8% 和 9.5%,达到SOTA。 Conclusion: 有效的广告理解必须将抽象营销策略显式锚定于像素级视觉证据。 Abstract: Multimodal understanding of advertising videos is essential for interpreting the intricate relationship between visual storytelling and abstract persuasion strategies. However, despite excelling at general search, existing agents often struggle to bridge the cognitive gap between pixel-level perception and high-level marketing logic. To address this challenge, we introduce AD-MIR, a framework designed to decode advertising intent via a two-stage architecture. First, in the Structure-Aware Memory Construction phase, the system converts raw video into a structured database by integrating semantic retrieval with exact keyword matching. This approach prioritizes fine-grained brand details (e.g., logos, on-screen text) while dynamically filtering out irrelevant background noise to isolate key protagonists. Second, the Structured Reasoning Agent mimics a marketing expert through an iterative inquiry loop, decomposing the narrative to deduce implicit persuasion tactics. Crucially, it employs an evidence-based self-correction mechanism that rigorously validates these insights against specific video frames, automatically backtracking when visual support is lacking. Evaluation on the AdsQA benchmark demonstrates that AD-MIR achieves state-of-the-art performance, surpassing the strongest general-purpose agent, DVD, by 1.8% in strict and 9.5% in relaxed accuracy. These results underscore that effective advertising understanding demands explicitly grounding abstract marketing strategies in pixel-level evidence. The code is available at https://github.com/Little-Fridge/AD-MIR.

[176] Uncovering Modality Discrepancy and Generalization Illusion for General-Purpose 3D Medical Segmentation

Yichi Zhang,Feiyang Xiao,Le Xue,Wenbo Zhang,Gang Feng,Chenguang Zheng,Yuan Qi,Yuan Cheng,Zixin Hu

Main category: cs.CV

TL;DR: 本文构建了UMD数据集(包含490例全身PET/CT和464例PET/MRI扫描),首次系统评估3D医学基础模型在功能成像(PET)与结构成像(CT/MRI)间的泛化能力,发现现有模型在跨模态任务中表现显著下降,揭示其距真正通用尚有较大差距,呼吁转向多模态训练与评估范式。

Details Motivation: 现有3D医学基础模型的验证主要集中于结构成像(如CT、MRI),忽视了功能成像(如PET)等关键模态,导致模态偏差严重,缺乏对模型真实临床鲁棒性的客观评估。 Method: 构建大规模、配对的全身影像数据集UMD(含PET/CT与PET/MRI),通过受试者内配对扫描控制变量,以成像模态为唯一独立变量,对主流3D分割基础模型开展跨模态鲁棒性评估。 Result: 实验揭示文献基准与真实世界性能存在巨大鸿沟,尤其在从结构域(CT/MRI)迁移到功能域(PET)时模型性能急剧下降,暴露当前模型不具备真正的跨模态泛化能力。 Conclusion: 当前3D医学基础模型远未达到通用水平;必须推动以多模态联合训练与评估为核心的新范式,才能弥合理想化评测与实际临床需求之间的差距。 Abstract: While emerging 3D medical foundation models are envisioned as versatile tools with offer general-purpose capabilities, their validation remains largely confined to regional and structural imaging, leaving a significant modality discrepancy unexplored. To provide a rigorous and objective assessment, we curate the UMD dataset comprising 490 whole-body PET/CT and 464 whole-body PET/MRI scans ($\sim$675k 2D images, $\sim$12k 3D organ annotations) and conduct a thorough and comprehensive evaluation of representative 3D segmentation foundation models. Through intra-subject controlled comparisons of paired scans, we isolate imaging modality as the primary independent variable to evaluate model robustness in real-world applications. Our evaluation reveals a stark discrepancy between literature-reported benchmarks and real-world efficacy, particularly when transitioning from structural to functional domains. Such systemic failures underscore that current 3D foundation models are far from achieving truly general-purpose status, necessitating a paradigm shift toward multi-modal training and evaluation to bridge the gap between idealized benchmarking and comprehensive clinical utility. This dataset and analysis establish a foundational cornerstone for future research to develop truly modality-agnostic medical foundation models.

[177] From Dead Pixels to Editable Slides: Infographic Reconstruction into Native Google Slides via Vision-Language Region Understanding

Leonardo Gonzalez

Main category: cs.CV

TL;DR: 本文提出Images2Slides系统,利用视觉语言模型从静态信息图(PNG/JPG)中提取区域级结构,映射到Google Slides坐标系,并通过批量API重建为可编辑幻灯片;在29张合成测试图上实现高元素恢复率,但布局保真度仍有提升空间。

Details Motivation: 静态信息图导出为图像后内容不可编辑,导致更新、本地化和复用成本高昂。 Method: 构建API驱动的pipeline:先用视觉语言模型(VLM)提取图像中的区域级语义结构(文本、图标、图表等),再将像素坐标映射至Google Slides坐标系,最后调用Google Slides批量更新API重建原生可编辑幻灯片;系统模型无关,通过统一JSON区域模式与确定性后处理支持多种VLM后端。 Result: 在29张程序生成的信息图基准上,整体元素恢复率达0.989±0.057(文本0.985±0.083,图像1.000±0.000);文本CER均值为0.033±0.149;文本区域IoU为0.364±0.161,图像区域IoU为0.644±0.131;识别出文本尺寸校准与非均匀背景等工程挑战及典型失败模式。 Conclusion: Images2Slides有效实现了从静态信息图到可编辑幻灯片的自动化转换,验证了VLM驱动区域解析+API重建范式的可行性,但布局精度(尤其文本定位)仍需改进,所揭示的实践挑战为后续研究提供了明确方向。 Abstract: Infographics are widely used to communicate information with a combination of text, icons, and data visualizations, but once exported as images their content is locked into pixels, making updates, localization, and reuse expensive. We describe \textsc{Images2Slides}, an API-based pipeline that converts a static infographic (PNG/JPG) into a native, editable Google Slides slide by extracting a region-level specification with a vision-language model (VLM), mapping pixel geometry into slide coordinates, and recreating elements using the Google Slides batch update API. The system is model-agnostic and supports multiple VLM backends via a common JSON region schema and deterministic postprocessing. On a controlled benchmark of 29 programmatically generated infographic slides with known ground-truth regions, \textsc{Images2Slides} achieves an overall element recovery rate of $0.989\pm0.057$ (text: $0.985\pm0.083$, images: $1.000\pm0.000$), with mean text transcription error $\mathrm{CER}=0.033\pm0.149$ and mean layout fidelity $\mathrm{IoU}=0.364\pm0.161$ for text regions and $0.644\pm0.131$ for image regions. We also highlight practical engineering challenges in reconstruction, including text size calibration and non-uniform backgrounds, and describe failure modes that guide future work.

[178] Influence of Geometry, Class Imbalance and Alignment on Reconstruction Accuracy -- A Micro-CT Phantom-Based Evaluation

Avinash Kumar K M,Samarth S. Raut

Main category: cs.CV

TL;DR: 本文评估了医学扫描3D重建流程中不同分割算法(GMM、Otsu、RG)和几何类型(球体、面罩、腹主动脉瘤AAA)对精度的影响,比较了基于体素(Dice、Jaccard、Precision)和表面(Chamfer距离、Hausdorff距离)的多种评估指标,发现Otsu总体最优,Jaccard比Dice更适用于薄壁结构,且对类别不平衡和配准误差敏感。

Details Motivation: 3D医学模型精度受成像设备、分割方法与网格处理等多因素影响,但几何类型、类别不平衡、体素与点云配准对精度的影响尚未系统研究。 Method: 采用SLA打印三种几何模型(球体、面罩、AAA),经micro-CT扫描后,使用GMM、Otsu和RG方法进行分割;分别采用KU算法(体素配准)和ICP算法(表面配准)对齐;计算Dice、Jaccard、Precision(体素指标)及Chamfer距离、平均Hausdorff距离(表面指标)。 Result: Otsu在所有几何上表现最稳定;AAA因壁薄和配准误差导致重叠度低、特异性受类别不平衡影响最大;表面指标趋势与体素指标不一致;RG对球体最优,GMM/Otsu对AAA更优;面罩表面误差最大,可能源于ICP配准误差;Jaccard比Dice更严格,更适合薄壁结构评估。 Conclusion: 分割精度是重建全流程误差累积结果;高体素精度指标在类别不平衡或配准不佳时可能具有误导性;可靠评估需确保体素与点云层面的精确配准;Jaccard指数更适合作为薄壁结构精度评估标准。 Abstract: The accuracy of the 3D models created from medical scans depends on imaging hardware, segmentation methods and mesh processing techniques etc. The effects of geometry type, class imbalance, voxel and point cloud alignment on accuracy remain to be thoroughly explored. This work evaluates the errors across the reconstruction pipeline and explores the use of voxel and surface-based accuracy metrics for different segmentation algorithms and geometry types. A sphere, a facemask, and an AAA were printed using the SLA technique and scanned using a micro-CT machine. Segmentation was performed using GMM, Otsu and RG based methods. Segmented and reference models aligned using the KU algorithm, were quantitatively compared to evaluate metrics like Dice and Jaccard scores, precision. Surface meshes were registered with reference meshes using an ICP-based alignment process. Metrics like chamfer distance, and average Hausdorff distance were evaluated. The Otsu method was found to be the most suitable method for all the geometries. AAA yielded low overlap scores due to its small wall thickness and misalignment. The effect of class imbalance on specificity was observed the most for AAA. Surface-based accuracy metrics differed from the voxel-based trends. The RG method performed best for sphere, while GMM and Otsu perform better for AAA. The facemask surface was most error-prone, possibly due to misalignment during the ICP process. Segmentation accuracy is a cumulative sum of errors across different stages of the reconstruction process. High voxel-based accuracy metrics may be misleading in cases of high class imbalance and sensitivity to alignment. The Jaccard index is found to be more stringent than the Dice and more suitable for accuracy assessment for thin-walled structures. Voxel and point cloud alignment should be ensured to make any reliable assessment of the reconstruction pipeline.

[179] Looking and Listening Inside and Outside: Multimodal Artificial Intelligence Systems for Driver Safety Assessment and Intelligent Vehicle Decision-Making

Ross Greer,Laura Fleig,Maitrayee Keskar,Erika Maquiling,Giovanni Tapia Lopez,Angel Martinez-Sanchez,Parthib Roy,Jake Rattigan,Mira Sur,Alejandra Vidrio,Thomas Marcotte,Mohan Trivedi

Main category: cs.CV

TL;DR: 本文提出L-LIO框架,即在传统LILO(看内看外)框架基础上引入音频模态,实现车内外驾驶员、乘客及外部人员的多模态感知融合,以提升智能车辆的安全性。通过三个案例验证音频在驾驶状态识别、语音指令理解与外部交互歧义消解中的价值,并讨论了噪声、隐私与鲁棒性等挑战。

Details Motivation: 现有LILO框架主要依赖视觉信息,难以应对视觉受限或需语义理解的复杂安全场景;音频模态可提供互补、实时、语义丰富的信息,尤其在驾驶员状态评估、人车自然交互及外部代理意图理解中具有独特优势。 Method: 提出L-LIO(Looking-and-Listening Inside-and-Outside)框架,融合车内/外音频与视觉信号;构建真实环境下的定制化车载与外部音频数据集;开展三项实证研究:1)基于驾驶员语音的损伤状态分类;2)乘客自然语言指令采集与分析;3)对比音频与视觉在外部引导歧义消解中的有效性。 Result: 初步实验表明音频能提供关键安全洞察,尤其在视觉信息不足或语境复杂的场景中(如判断醉酒、解析模糊口头指令、识别手势指向对象);但受环境噪声、个体差异和隐私问题制约,鲁棒性仍需提升。 Conclusion: L-LIO通过引入音频模态拓展了LILO框架的能力边界,为多模态驱动的智能车安全干预提供了新范式,后续需重点解决动态真实环境下的可靠性与隐私保护问题。 Abstract: The looking-in-looking-out (LILO) framework has enabled intelligent vehicle applications that understand both the outside scene and the driver state to improve safety outcomes, with examples in smart airbag deployment, takeover time prediction in autonomous control transitions, and driver attention monitoring. In this research, we propose an augmentation to this framework, making a case for the audio modality as an additional source of information to understand the driver, and in the evolving autonomy landscape, also the passengers and those outside the vehicle. We expand LILO by incorporating audio signals, forming the looking-and-listening inside-and-outside (L-LIO) framework to enhance driver state assessment and environment understanding through multimodal sensor fusion. We evaluate three example cases where audio enhances vehicle safety: supervised learning on driver speech audio to classify potential impairment states (e.g., intoxication), collection and analysis of passenger natural language instructions (e.g., "turn after that red building") to motivate how spoken language can interface with planning systems through audio-aligned instruction data, and limitations of vision-only systems where audio may disambiguate the guidance and gestures of external agents. Datasets include custom-collected in-vehicle and external audio samples in real-world environments. Pilot findings show that audio yields safety-relevant insights, particularly in nuanced or context-rich scenarios where sound is critical to safe decision-making or visual signals alone are insufficient. Challenges include ambient noise interference, privacy considerations, and robustness across human subjects, motivating further work on reliability in dynamic real-world contexts. L-LIO augments driver and scene understanding through multimodal fusion of audio and visual sensing, offering new paths for safety intervention.

[180] Vision and language: Novel Representations and Artificial intelligence for Driving Scene Safety Assessment and Autonomous Vehicle Planning

Ross Greer,Maitrayee Keskar,Angel Martinez-Sanchez,Parthib Roy,Shashank Shriram,Mohan Trivedi

Main category: cs.CV

TL;DR: 本文探讨了视觉-语言模型(VLMs)在自动驾驶安全评估与决策中的系统级应用,提出了三种互补方法:基于CLIP的轻量级语义风险检测、任务对齐的视觉语言嵌入用于轨迹规划、以及以自然语言作为行为约束的运动规划;结果表明VLMs需通过结构化接地和系统设计而非简单特征注入来提升安全性。

Details Motivation: 探索视觉-语言表征如何支持自动驾驶中感知、预测与规划环节的安全评估与决策,尤其针对语义风险识别、意图表达与行为约束等关键安全问题。 Method: 研究三种系统级用例:1)基于CLIP图像-文本相似度的类别无关危险筛查;2)将场景级视觉语言嵌入集成至Transformer轨迹规划器(Waymo数据集)并分析表征-任务对齐;3)利用自然语言指令(doScenes数据集)作为视觉接地的行为约束引导运动规划。 Result: 1)CLIP可实现低延迟、泛化性强的语义危险信号生成;2)全局嵌入直接注入规划器无效,凸显任务导向特征提取的必要性;3)乘客式自然语言指令能抑制严重规划失败并提升模糊场景下的安全行为。 Conclusion: 视觉-语言表征在自动驾驶安全中具有重要潜力,但其实现依赖于严谨的系统工程设计与结构化视觉-语言接地,而非简单特征融合。 Abstract: Vision-language models (VLMs) have recently emerged as powerful representation learning systems that align visual observations with natural language concepts, offering new opportunities for semantic reasoning in safety-critical autonomous driving. This paper investigates how vision-language representations support driving scene safety assessment and decision-making when integrated into perception, prediction, and planning pipelines. We study three complementary system-level use cases. First, we introduce a lightweight, category-agnostic hazard screening approach leveraging CLIP-based image-text similarity to produce a low-latency semantic hazard signal. This enables robust detection of diverse and out-of-distribution road hazards without explicit object detection or visual question answering. Second, we examine the integration of scene-level vision-language embeddings into a transformer-based trajectory planning framework using the Waymo Open Dataset. Our results show that naively conditioning planners on global embeddings does not improve trajectory accuracy, highlighting the importance of representation-task alignment and motivating the development of task-informed extraction methods for safety-critical planning. Third, we investigate natural language as an explicit behavioral constraint on motion planning using the doScenes dataset. In this setting, passenger-style instructions grounded in visual scene elements suppress rare but severe planning failures and improve safety-aligned behavior in ambiguous scenarios. Taken together, these findings demonstrate that vision-language representations hold significant promise for autonomous driving safety when used to express semantic risk, intent, and behavioral constraints. Realizing this potential is fundamentally an engineering problem requiring careful system design and structured grounding rather than direct feature injection.

[181] Process-of-Thought Reasoning for Videos

Jusheng Zhang,Kaitong Cai,Jian Wang,Yongsen Zheng,Kwok-Yan Lam,Keze Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为'思维过程(PoT)推理'的视频理解框架,通过将推理分解为可验证的轻量步骤(证据选择、状态更新、答案合成),提升时序定位准确性与事实正确性,并支持模型无关的可解释推理。

Details Motivation: 视频理解不仅需要识别视觉内容,还需在长且嘈杂的时序观测中进行时空对齐的多步推理,现有方法缺乏可追溯性和抗干扰能力。 Method: 提出Process-of-Thought (PoT)推理框架,包含三阶段交替执行:(i)时序证据选择,(ii)逐步状态更新,(iii)约束答案合成;引入统一的PoT trace表示,将中间决策对齐到视频时间片段;支持闭卷推理与外部工具增强推理。 Result: 在标准视频推理任务上,PoT显著提升了事实正确性与时序定位精度,增强了对干扰项的鲁棒性,减少了幻觉解释,并生成可诊断、可复用的可解释推理轨迹。 Conclusion: PoT是一种模型无关、结构清晰、可验证的视频推理范式,兼顾性能提升与推理透明性,为复杂视频理解任务提供了新思路。 Abstract: Video understanding requires not only recognizing visual content but also performing temporally grounded, multi-step reasoning over long and noisy observations. We propose Process-of-Thought (PoT) Reasoning for Videos, a framework that makes the reasoning process explicit by structuring video inference into a sequence of lightweight, verifiable steps. PoT interleaves (i) temporal evidence selection, (ii) step-wise state updates, and (iii) constrained answer synthesis, enabling the model to progressively refine hypotheses while maintaining traceability to video evidence. The framework is designed to be model-agnostic and can be plugged into existing vision-language backbones, supporting both closed-book reasoning and evidence-augmented reasoning with external tools. We further introduce a unified representation for PoT traces that aligns intermediate decisions with temporal segments, which improves robustness to distractors and reduces hallucinated explanations. Extensive experiments on standard video reasoning tasks demonstrate that PoT consistently improves factual correctness and temporal grounding, while providing interpretable reasoning traces for diagnosis and downstream use.

[182] Semantic-Deviation-Anchored Multi-Branch Fusion for Unsupervised Anomaly Detection and Localization in Unstructured Conveyor-Belt Coal Scenes

Wenping Jin,Yuyang Tang,Li Zhu

Main category: cs.CV

TL;DR: 本文提出了一种面向输送带煤流场景的无监督异物异常检测与像素级定位方法,构建了首个专用基准CoalAD,并设计了多视角互补线索协同感知框架,在图像级和像素级指标上均超越现有基线。

Details Motivation: 输送带煤流场景中异物检测面临环境高度非结构化、目标低对比度、形变遮挡等挑战,导致传统工业异常检测方法性能显著下降,亟需专用数据集与方法。 Method: 构建了CoalAD基准数据集;提出互补线索协同感知框架,从物体级语义组成建模、语义归因的全局偏差分析、细粒度纹理匹配三个视角提取并融合异常证据,实现鲁棒图像级打分与精确像素级定位。 Result: 在CoalAD数据集上,所提方法在图像级和像素级评估指标上均优于主流基线方法;消融实验验证了各模块的有效性。 Conclusion: 针对非结构化煤流场景的异物检测问题,本文通过构建专用基准和设计多视角协同感知框架,显著提升了无监督异常检测与定位的性能与鲁棒性。 Abstract: Reliable foreign-object anomaly detection and pixel-level localization in conveyor-belt coal scenes are essential for safe and intelligent mining operations. This task is particularly challenging due to the highly unstructured environment: coal and gangue are randomly piled, backgrounds are complex and variable, and foreign objects often exhibit low contrast, deformation, occlusion, resulting in coupling with their surroundings. These characteristics weaken the stability and regularity assumptions that many anomaly detection methods rely on in structured industrial settings, leading to notable performance degradation. To support evaluation and comparison in this setting, we construct \textbf{CoalAD}, a benchmark for unsupervised foreign-object anomaly detection with pixel-level localization in coal-stream scenes. We further propose a complementary-cue collaborative perception framework that extracts and fuses complementary anomaly evidence from three perspectives: object-level semantic composition modeling, semantic-attribution-based global deviation analysis, and fine-grained texture matching. The fused outputs provide robust image-level anomaly scoring and accurate pixel-level localization. Experiments on CoalAD demonstrate that our method outperforms widely used baselines across the evaluated image-level and pixel-level metrics, and ablation studies validate the contribution of each component. The code is available at https://github.com/xjpp2016/USAD.

[183] A hybrid Kolmogorov-Arnold network for medical image segmentation

Deep Bhattacharyya,Ali Ayub,A. Ben Hamza

Main category: cs.CV

TL;DR: 本文提出U-KABS,一种结合Kolmogorov-Arnold网络(KAN)与U型编解码结构的新型医学图像分割框架,通过引入基于Bernstein多项式和B样条的可学习激活函数(KABS),提升对非线性关系和多尺度结构的建模能力,在多个基准数据集上取得优越性能。

Details Motivation: 医学图像分割因图像复杂性和变异性大、难以建模非线性关系而具有挑战性,亟需更强大的特征表达与上下文建模能力。 Method: 提出U-KABS框架:融合卷积+挤压激励模块(增强通道特征)与KAN Bernstein样条(KABS)模块(采用可学习的Bernstein多项式与B样条激活函数);结合U型编码器-解码器结构及跨层跳跃连接以实现多尺度特征融合与空间细节保持。 Result: 在多个医学影像基准数据集上,U-KABS显著优于强基线模型,尤其在复杂解剖结构分割任务中表现突出。 Conclusion: KAN与U形架构的协同设计有效提升了医学图像分割的精度与鲁棒性,验证了可学习样条激活函数在建模全局平滑性与局部适应性方面的优势。 Abstract: Medical image segmentation plays a vital role in diagnosis and treatment planning, but remains challenging due to the inherent complexity and variability of medical images, especially in capturing non-linear relationships within the data. We propose U-KABS, a novel hybrid framework that integrates the expressive power of Kolmogorov-Arnold Networks (KANs) with a U-shaped encoder-decoder architecture to enhance segmentation performance. The U-KABS model combines the convolutional and squeeze-and-excitation stage, which enhances channel-wise feature representations, and the KAN Bernstein Spline (KABS) stage, which employs learnable activation functions based on Bernstein polynomials and B-splines. This hybrid design leverages the global smoothness of Bernstein polynomials and the local adaptability of B-splines, enabling the model to effectively capture both broad contextual trends and fine-grained patterns critical for delineating complex structures in medical images. Skip connections between encoder and decoder layers support effective multi-scale feature fusion and preserve spatial details. Evaluated across diverse medical imaging benchmark datasets, U-KABS demonstrates superior performance compared to strong baselines, particularly in segmenting complex anatomical structures.

[184] All-Optical Segmentation via Diffractive Neural Networks for Autonomous Driving

Yingjie Li,Daniel Robinson,Cunxi Yu

Main category: cs.CV

TL;DR: 本文提出了一种基于衍射光学神经网络(DONN)的全光计算框架,用于自动驾驶中的语义分割与车道检测,显著降低能耗并避免模数转换开销。

Details Motivation: 传统深度神经网络在自动驾驶中能耗高,尤其受限于大量模数转换和图像计算;亟需更节能、低延迟的替代方案。 Method: 设计并实现一种全光计算框架,利用衍射光学神经网络(DONN)进行RGB图像分割与车道检测,实验在CityScapes、自建室内赛道数据集及CARLA仿真环境中开展。 Result: 在CityScapes上验证了DONN在图像分割任务中的有效性;在定制车道数据集和CARLA仿真中展示了模型在不同环境下的泛化能力。 Conclusion: DONN为自动驾驶感知任务提供了一种高效、低功耗的全光解决方案,具备实际部署潜力。 Abstract: Semantic segmentation and lane detection are crucial tasks in autonomous driving systems. Conventional approaches predominantly rely on deep neural networks (DNNs), which incur high energy costs due to extensive analog-to-digital conversions and large-scale image computations required for low-latency, real-time responses. Diffractive optical neural networks (DONNs) have shown promising advantages over conventional DNNs on digital or optoelectronic computing platforms in energy efficiency. By performing all-optical image processing via light diffraction at the speed of light, DONNs save computation energy costs while reducing the overhead associated with analog-to-digital conversions by all-optical encoding and computing. In this work, we propose a novel all-optical computing framework for RGB image segmentation and lane detection in autonomous driving applications. Our experimental results demonstrate the effectiveness of the DONN system for image segmentation on the CityScapes dataset. Additionally, we conduct case studies on lane detection using a customized indoor track dataset and simulated driving scenarios in CARLA, where we further evaluate the model's generalizability under diverse environmental conditions.

[185] PAND: Prompt-Aware Neighborhood Distillation for Lightweight Fine-Grained Visual Classification

Qiuming Luo,Yuebing Li,Feng Li,Chang Kong

Main category: cs.CV

TL;DR: 本文提出PAND框架,通过提示感知语义校准和邻域感知结构蒸馏,提升轻量级网络在细粒度视觉分类中的性能。

Details Motivation: 在细粒度视觉分类(FGVC)中,将大型视觉语言模型(VLMs)的知识蒸馏到轻量级网络面临固定提示和全局对齐的挑战。 Method: 提出两阶段框架PAND:第一阶段为提示感知语义校准,生成自适应语义锚点;第二阶段为邻域感知结构蒸馏,约束学生模型的局部决策结构。 Result: PAND在四个FGVC基准上持续优于现有最先进方法;ResNet-18学生模型在CUB-200上达到76.09%准确率,比VL2Lite基线高3.4%。 Conclusion: PAND有效解耦语义校准与结构迁移,显著提升轻量级模型在FGVC任务中的性能,验证了提示感知与邻域感知联合蒸馏策略的有效性。 Abstract: Distilling knowledge from large Vision-Language Models (VLMs) into lightweight networks is crucial yet challenging in Fine-Grained Visual Classification (FGVC), due to the reliance on fixed prompts and global alignment. To address this, we propose PAND (Prompt-Aware Neighborhood Distillation), a two-stage framework that decouples semantic calibration from structural transfer. First, we incorporate Prompt-Aware Semantic Calibration to generate adaptive semantic anchors. Second, we introduce a neighborhood-aware structural distillation strategy to constrain the student's local decision structure. PAND consistently outperforms state-of-the-art methods on four FGVC benchmarks. Notably, our ResNet-18 student achieves 76.09% accuracy on CUB-200, surpassing the strong baseline VL2Lite by 3.4%. Code is available at https://github.com/LLLVTA/PAND.

[186] Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

Haodong Li,Shaoteng Liu,Zhe Lin,Manmohan Chandraker

Main category: cs.CV

TL;DR: 本文提出Rolling Sink方法,通过训练-free的方式解决自回归视频扩散模型在测试时超出训练时长导致的视觉质量下降问题,实现了超长视频(5-30分钟)的高质量生成。

Details Motivation: 自回归视频扩散模型在训练时受限于较短时长(如5秒),测试时若生成更长视频会出现显著视觉退化,即存在‘训练-测试时长差距’,而延长训练时长计算成本过高,因此需训练-free的解决方案。 Method: 基于对自回归缓存维护机制的系统分析,提出Rolling Sink方法,在不修改模型训练过程的前提下,通过优化测试时的缓存更新与滚动机制,实现超长视频的稳定生成。 Result: Rolling Sink在仅使用5秒训练片段的模型上,成功生成5–30分钟、16FPS的高质量视频,保持主体一致性、色彩稳定性、结构连贯性和运动平滑性,显著优于现有SOTA方法。 Conclusion: 训练-free的Rolling Sink有效弥合了AR视频模型在开放时长测试下的性能鸿沟,为长视频生成提供了高效可行的新范式。 Abstract: Recently, autoregressive (AR) video diffusion models has achieved remarkable performance. However, due to their limited training durations, a train-test gap emerges when testing at longer horizons, leading to rapid visual degradations. Following Self Forcing, which studies the train-test gap within the training duration, this work studies the train-test gap beyond the training duration, i.e., the gap between the limited horizons during training and open-ended horizons during testing. Since open-ended testing can extend beyond any finite training window, and long-video training is computationally expensive, we pursue a training-free solution to bridge this gap. To explore a training-free solution, we conduct a systematic analysis of AR cache maintenance. These insights lead to Rolling Sink. Built on Self Forcing (trained on only 5s clips), Rolling Sink effectively scales the AR video synthesis to ultra-long durations (e.g., 5-30 minutes at 16 FPS) at test time, with consistent subjects, stable colors, coherent structures, and smooth motions. As demonstrated by extensive experiments, Rolling Sink achieves superior long-horizon visual fidelity and temporal consistency compared to SOTA baselines. Project page: https://rolling-sink.github.io/

[187] Uncertainty-Aware Counterfactual Traffic Signal Control with Predictive Safety and Starvation-Avoidance Constraints Using Vision-Based Sensing

Jayawant Bodagala,Balaji Bodagala

Main category: cs.CV

TL;DR: 本文提出UCATSC,一种基于模型的交通信号控制系统,通过在信念空间中进行反事实推演来预测并强制执行安全与防饥饿的硬约束,兼顾延迟、排放优化与可解释性。

Details Motivation: 现有自适应交通信号控制在现实部署中受限于视觉感知的不确定性、隐式安全性及不可解释的策略,主要在仿真中验证。 Method: UCATSC将交叉口信号控制建模为带约束、部分可观测的随机决策过程,利用显式模型在信念空间中进行含安全与防饥饿硬约束的反事实推演。 Result: 系统在降低交通延误和排放的同时,能防止安全关键错误,并输出可解释的控制策略。 Conclusion: UCATSC通过引入显式建模与硬约束机制,提升了真实场景下自适应信号控制的安全性、可靠性与可解释性。 Abstract: Real-world deployment of adaptive traffic signal control, to date, remains limited due to the uncertainty associated with vision-based perception, implicit safety, and non-interpretable control policies learned and validated mainly in simulation. In this paper, we introduce UCATSC, a model-based traffic signal control system that models traffic signal control at an intersection using a stochastic decision process with constraints and under partial observability, taking into account the uncertainty associated with vision-based perception. Unlike reinforcement learning methods that learn to predict safety using reward shaping, UCATSC predicts and enforces hard constraints related to safety and starvation prevention during counterfactual rollouts in belief space. The system is designed to improve traffic delay and emission while preventing safety-critical errors and providing interpretable control policy outputs based on explicit models.

[188] VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos

Wenqi Liu,Yunxiao Wang,Shijie Ma,Meng Liu,Qile Su,Tianke Zhang,Haonan Fan,Changyi Liu,Kaiyu Jiang,Jiankang Chen,Kaiyu Tang,Bin Wen,Fan Yang,Tingting Gao,Han Li,Yinwei Wei,Xuemeng Song

Main category: cs.CV

TL;DR: 本文提出VideoTemp-o3,一种统一的视频代理式推理框架,通过联合建模视频定位与问答,提升长视频理解性能,解决现有方法定位弱、效率低、流程僵化等问题。

Details Motivation: 传统均匀采帧在长视频理解中难以捕获关键视觉证据,导致性能下降和幻觉增多;现有代理式视频推理方法存在定位能力弱、效率低、流程僵化等问题。 Method: 提出VideoTemp-o3框架,采用联合建模视频定位与问答的统一代理式范式;设计统一掩码机制用于监督微调以平衡探索与去噪;引入专用奖励机制缓解强化学习中的奖励作弊;构建高质量长视频定位问答数据集及对应评测基准。 Result: 在长视频理解和视频定位任务上均取得显著性能提升,验证了其强定位能力、按需裁剪支持及错误定位修正能力。 Conclusion: VideoTemp-o3为长视频理解提供了更高效、灵活且鲁棒的代理式推理新范式,推动了视频定位与问答的协同建模发展。 Abstract: In long-video understanding, conventional uniform frame sampling often fails to capture key visual evidence, leading to degraded performance and increased hallucinations. To address this, recent agentic thinking-with-videos paradigms have emerged, adopting a localize-clip-answer pipeline in which the model actively identifies relevant video segments, performs dense sampling within those clips, and then produces answers. However, existing methods remain inefficient, suffer from weak localization, and adhere to rigid workflows. To solve these issues, we propose VideoTemp-o3, a unified agentic thinking-with-videos framework that jointly models video grounding and question answering. VideoTemp-o3 exhibits strong localization capability, supports on-demand clipping, and can refine inaccurate localizations. Specifically, in the supervised fine-tuning stage, we design a unified masking mechanism that encourages exploration while preventing noise. For reinforcement learning, we introduce dedicated rewards to mitigate reward hacking. Besides, from the data perspective, we develop an effective pipeline to construct high-quality long video grounded QA data, along with a corresponding benchmark for systematic evaluation across various video durations. Experimental results demonstrate that our method achieves remarkable performance on both long video understanding and grounding.

[189] How well are open sourced AI-generated image detection models out-of-the-box: A comprehensive benchmark study

Simiao Ren,Yuchen Zhou,Xingyu Shen,Kidus Zewde,Tommy Duong,George Huang,Hatsanai,Tiangratanakul,Tsang,Ng,En Wei,Jiayu Xue

Main category: cs.CV

TL;DR: 本文对16种最先进的AI图像检测方法进行了首个全面的零样本评估,涵盖12个数据集、291种生成器和260万张图像,揭示了检测器性能的巨大差异与不稳定性,并指出训练数据对泛化能力的关键影响及现代商业生成器对现有检测器的强规避能力。

Details Motivation: 现有基准主要评估微调模型,缺乏对开箱即用(zero-shot)场景下检测器真实性能的系统评估,而这是实际部署中最常见的情形。 Method: 对16种SOTA检测方法(共23个预训练变体)在12个多样化数据集(含291种生成器、260万图像)上进行零样本评估,并采用Spearman相关性、Friedman检验等统计方法分析性能稳定性与显著性。 Result: 发现:(1) 无通用最优检测器,排名极不稳定;(2) 最好与最差检测器平均准确率相差37个百分点;(3) 训练数据对齐程度导致同架构检测器性能波动达20–60%;(4) Flux Dev、Firefly v4、Midjourney v7等现代商业生成器使多数检测器准确率降至18–30%;(5) 发现三类跨数据集泛化失败模式。 Conclusion: ‘通用检测器’范式不可靠,实践者需依据具体威胁场景谨慎选型,而非依赖公开基准指标。 Abstract: As AI-generated images proliferate across digital platforms, reliable detection methods have become critical for combating misinformation and maintaining content authenticity. While numerous deepfake detection methods have been proposed, existing benchmarks predominantly evaluate fine-tuned models, leaving a critical gap in understanding out-of-the-box performance -- the most common deployment scenario for practitioners. We present the first comprehensive zero-shot evaluation of 16 state-of-the-art detection methods, comprising 23 pretrained detector variants (due to multiple released versions of certain detectors), across 12 diverse datasets, comprising 2.6~million image samples spanning 291 unique generators including modern diffusion models. Our systematic analysis reveals striking findings: (1)~no universal winner exists, with detector rankings exhibiting substantial instability (Spearman~$ρ$: 0.01 -- 0.87 across dataset pairs); (2)~a 37~percentage-point performance gap separates the best detector (75.0\% mean accuracy) from the worst (37.5\%); (3)~training data alignment critically impacts generalization, causing up to 20--60\% performance variance within architecturally identical detector families; (4)~modern commercial generators (Flux~Dev, Firefly~v4, Midjourney~v7) defeat most detectors, achieving only 18--30\% average accuracy; and (5)~we identify three systematic failure patterns affecting cross-dataset generalization. Statistical analysis confirms significant performance differences between detectors (Friedman test: $χ^2$=121.01, $p<10^{-16}$, Kendall~$W$=0.524). Our findings challenge the ``one-size-fits-all'' detector paradigm and provide actionable deployment guidelines, demonstrating that practitioners must carefully select detectors based on their specific threat landscape rather than relying on published benchmark performance.

[190] Out of the box age estimation through facial imagery: A Comprehensive Benchmark of Vision-Language Models vs. out-of-the-box Traditional Architectures

Simiao Ren

Main category: cs.CV

TL;DR: 本文首次构建了大规模跨范式基准,对比34种模型(22个专用架构+12个通用视觉语言模型VLMs)在8个标准数据集上的面部年龄估计性能,发现零样本VLMs显著优于大多数专用模型(平均MAE 5.65 vs. 9.88),尤其在年龄验证(18岁阈值)和极端年龄段表现更鲁棒,挑战了任务专用架构必要性的传统假设。

Details Motivation: 缺乏系统比较现代视觉语言模型(VLMs)与专用年龄估计模型的基准,导致对二者能力边界认知不清;亟需统一评估以指导未来研究方向。 Method: 构建首个大规模跨范式基准,涵盖34个模型(22个开源专用模型 + 12个通用VLMs)和8个标准数据集(共1100测试图像/模型);评估指标包括MAE、18岁年龄验证错误率、粗粒度分桶误差及14个年龄段的分层性能分析。 Result: 零样本VLMs平均MAE为5.65年,显著优于非LLM模型(9.88年);最佳VLM(Gemini 3 Flash Preview,MAE=4.32)比最佳非LLM模型(MiVOLO,MAE=5.10)提升15%;VLMs在18岁验证中误判未成年人为成人的比例(13–25%)远低于非LLM模型(60–100%);所有模型在<5岁和>65岁极端年龄段性能最差;粗粒度分桶(8–9类)使MAE恶化至13年以上。 Conclusion: 任务专用架构并非年龄估计最优解,VLMs凭借强泛化能力展现出优越性能;未来应聚焦于将VLM知识蒸馏至轻量专用模型,而非继续堆砌专用设计。 Abstract: Facial age estimation is critical for content moderation, age verification, and deepfake detection, yet no prior benchmark has systematically compared modern vision-language models (VLMs) against specialized age estimation architectures. We present the first large-scale cross-paradigm benchmark, evaluating \textbf{34 models} -- 22 specialized architectures with publicly available pretrained weights and 12 general-purpose VLMs -- across \textbf{8 standard datasets} (UTKFace, IMDB-WIKI, MORPH, AFAD, CACD, FG-NET, APPA-REAL, AgeDB) totaling 1{,}100 test images per model. Our key finding is striking: \emph{zero-shot VLMs significantly outperform most specialized models}, achieving an average MAE of 5.65 years compared to 9.88 for non-LLM models. The best VLM (Gemini~3 Flash Preview, MAE~4.32) outperforms the best non-LLM model (MiVOLO, MAE~5.10) by 15\%. Only MiVOLO, which uniquely combines face and body features via Vision Transformers, competes with VLMs. We further analyze age verification at the 18-year threshold, revealing that non-LLM models exhibit 60--100\% false adult rates on minors while VLMs achieve 13--25\%, and demonstrate that coarse age binning (8--9 classes) consistently degrades MAE beyond 13 years. Our stratified analysis across 14 age groups reveals that all models struggle most at extreme ages ($<$5 and 65+). These findings challenge the assumption that task-specific architectures are necessary for age estimation and suggest that the field should redirect toward distilling VLM capabilities into efficient specialized models.

[191] Back to Physics: Operator-Guided Generative Paths for SMS MRI Reconstruction

Zhibo Chen,Yu Guan,Yajuan Huang,Chaoqi Chen,XiangJi,Qiuyun Fan,Dong Liang,Qiegen Liu

Main category: cs.CV

TL;DR: 本文提出了一种面向算子引导的SMS-MRI重建框架OCDI-Net,通过显式建模和分离层间干扰与目标切片内容,结合两阶段推理流程,在加速MRI重建中提升了图像保真度并抑制了层间泄漏。

Details Motivation: 传统基于扩散模型的SMS-MRI重建方法多假设高斯噪声,且需额外一致性步骤来引入SMS物理模型,易与实际采集算子导致的退化不匹配。 Method: 提出算子引导框架,建模已知采集算子定义的退化轨迹,并通过确定性更新进行逆过程;设计算子条件双流交互网络(OCDI-Net),显式解耦目标切片内容与层间干扰,并预测结构化退化以支持算子对齐重建;采用两阶段级联推理:先进行SMS切片分离,再进行平面内k空间补全。 Result: 在fastMRI脑部数据和前瞻性采集的体内扩散MRI数据上实验表明,该方法相比传统及学习型SMS重建,图像保真度更高、层间泄漏更少。 Conclusion: 算子引导的重建范式能更准确刻画SMS采集中的确定性干扰与缺失数据联合退化,OCDI-Net及其两阶段流程为高效高质量SMS-MRI重建提供了新思路。 Abstract: Simultaneous multi-slice (SMS) imaging with in-plane undersampling enables highly accelerated MRI but yields a strongly coupled inverse problem with deterministic inter-slice interference and missing k-space data. Most diffusion-based reconstructions are formulated around Gaussian-noise corruption and rely on additional consistency steps to incorporate SMS physics, which can be mismatched to the operator-governed degradations in SMS acquisition. We propose an operator-guided framework that models the degradation trajectory using known acquisition operators and inverts this process via deterministic updates. Within this framework, we introduce an operator-conditional dual-stream interaction network (OCDI-Net) that explicitly disentangles target-slice content from inter-slice interference and predicts structured degradations for operator-aligned inversion, and we instantiate reconstruction as a two-stage chained inference procedure that performs SMS slice separation followed by in-plane completion. Experiments on fastMRI brain data and prospectively acquired in vivo diffusion MRI data demonstrate improved fidelity and reduced slice leakage over conventional and learning-based SMS reconstructions.

[192] Open-Text Aerial Detection: A Unified Framework For Aerial Visual Grounding And Detection

Guoting Wei,Xia Yuan,Yang Zhou,Haizhao Jing,Yu Liu,Xianbiao Qi,Chunxia Zhao,Haokui Zhang,Rong Xiao

Main category: cs.CV

TL;DR: 本文提出OTA-Det,首个统一Open-Vocabulary Aerial Detection(OVAD)与Remote Sensing Visual Grounding(RSVG)的框架,通过任务重定义与密集语义对齐策略,实现细粒度语义理解与多目标检测,并基于RT-DETR实现实时高效推理。

Details Motivation: OVAD仅支持粗粒度类别语义,RSVG仅限单目标定位,二者孤立使用无法兼顾丰富语义理解与多目标检测。 Method: 提出OTA-Det统一框架:1)任务重定义策略,统一目标与监督机制,支持跨范式联合训练;2)密集语义对齐策略,建立从整体描述到个体属性的多粒度显式对应;3)基于RT-DETR扩展高效模块,支持开放文本检测。 Result: 在六个OVAD与RSVG基准上达到SOTA性能,同时保持34 FPS实时推理速度。 Conclusion: OTA-Det成功融合OVAD与RSVG两大范式,突破各自局限,为遥感图像理解提供兼具语义丰富性、定位精度与运行效率的统一解决方案。 Abstract: Open-Vocabulary Aerial Detection (OVAD) and Remote Sensing Visual Grounding (RSVG) have emerged as two key paradigms for aerial scene understanding. However, each paradigm suffers from inherent limitations when operating in isolation: OVAD is restricted to coarse category-level semantics, while RSVG is structurally limited to single-target localization. These limitations prevent existing methods from simultaneously supporting rich semantic understanding and multi-target detection. To address this, we propose OTA-Det, the first unified framework that bridges both paradigms into a cohesive architecture. Specifically, we introduce a task reformulation strategy that unifies task objectives and supervision mechanisms, enabling joint training across datasets from both paradigms with dense supervision signals. Furthermore, we propose a dense semantic alignment strategy that establishes explicit correspondence at multiple granularities, from holistic expressions to individual attributes, enabling fine-grained semantic understanding. To ensure real-time efficiency, OTA-Det builds upon the RT-DETR architecture, extending it from closed-set detection to open-text detection by introducing several high efficient modules, achieving state-of-the-art performance on six benchmarks spanning both OVAD and RSVG tasks while maintaining real-time inference at 34 FPS.

[193] SPD-Faith Bench: Diagnosing and Improving Faithfulness in Chain-of-Thought for Multimodal Large Language Models

Weijiang Lv,Yaoxuan Feng,Xiaobo Xia,Jiayu Wang,Yan Jing,Wenchao Chen,Bo Chen

Main category: cs.CV

TL;DR: 本文提出SPD-Faith Bench基准,用于诊断多模态大语言模型(MLLMs)在链式推理中的推理忠实性问题,发现两类系统性失败模式,并据此提出无需训练的SAGE框架以提升视觉证据对齐与推理忠实性。

Details Motivation: 现有工作主要关注感知幻觉,而忽视了推理层面的不忠实问题;为剥离语言先验影响、专注评估推理忠实性,需构建能强制显式视觉比较的细粒度诊断基准。 Method: 构建基于图像差异推理的SPD-Faith Bench基准;通过分析模型残差流中的视觉注意力衰减和表征偏移定位失败根源;提出无需训练的SAGE框架,实现视觉证据校准与感知-推理对齐。 Result: 在主流MLLMs上揭示了‘感知盲区’和‘感知-推理解耦’两类系统性失败;SAGE显著提升了推理忠实性,验证了显式评估忠实性的重要性。 Conclusion: 推理忠实性不能仅靠答案正确性推断,需专用基准进行显式评估;提升视觉路由与感知-推理一致性是增强MLLMs可信推理的关键路径。 Abstract: Chain-of-Thought reasoning is widely used to improve the interpretability of multimodal large language models (MLLMs), yet the faithfulness of the generated reasoning traces remains unclear. Prior work has mainly focused on perceptual hallucinations, leaving reasoning level unfaithfulness underexplored. To isolate faithfulness from linguistic priors, we introduce SPD-Faith Bench, a diagnostic benchmark based on fine-grained image difference reasoning that enforces explicit visual comparison. Evaluations on state-of-the-art MLLMs reveal two systematic failure modes, perceptual blindness and perception-reasoning dissociation. We trace these failures to decaying visual attention and representation shifts in the residual stream. Guided by this analysis, we propose SAGE, a train-free visual evidence-calibrated framework that improves visual routing and aligns reasoning with perception. Our results highlight the importance of explicitly evaluating faithfulness beyond response correctness. Our benchmark and codes are available at https://github.com/Johanson-colab/SPD-Faith-Bench.

[194] FlashVID: Efficient Video Large Language Models via Training-free Tree-based Spatiotemporal Token Merging

Ziyang Fan,Keyu Chen,Ruilong Xing,Yulin Li,Li Jiang,Zhuotao Tian

Main category: cs.CV

TL;DR: FlashVID是一种无需训练的VLLM推理加速框架,通过注意力与多样性驱动的视觉令牌选择(ADTS)和基于树的时空令牌合并(TSTM)来高效压缩视频视觉令牌,在仅保留10%令牌时仍保持99.1%原始性能。

Details Motivation: 现有VLLM加速方法独立压缩空间和时间冗余,忽略时空关联性,导致压缩效果不佳;而视频中高度相关的视觉特征随时间在位置、尺度、方向等维度动态变化,需联合建模时空冗余。 Method: 提出FlashVID框架:首先用Attention and Diversity-based Token Selection (ADTS)选取最具代表性的基础视频令牌;再通过Tree-based Spatiotemporal Token Merging (TSTM)进行细粒度时空冗余消除。全程无需训练,即插即用。 Result: 在三个主流VLLM和五个视频理解基准上验证有效;仅保留10%视觉令牌时,LLaVA-OneVision性能达99.1%;Qwen2.5-VL视频帧输入提升10倍,相同算力下相对性能提升8.6%。 Conclusion: FlashVID是一种高效、通用、训练无关的VLLM视频推理加速方案,显著提升长视频处理能力与计算效率。 Abstract: Although Video Large Language Models (VLLMs) have shown remarkable capabilities in video understanding, they are required to process high volumes of visual tokens, causing significant computational inefficiency. Existing VLLMs acceleration frameworks usually compress spatial and temporal redundancy independently, which overlooks the spatiotemporal relationships, thereby leading to suboptimal spatiotemporal compression. The highly correlated visual features are likely to change in spatial position, scale, orientation, and other attributes over time due to the dynamic nature of video. Building on this insight, we introduce FlashVID, a training-free inference acceleration framework for VLLMs. Specifically, FlashVID utilizes Attention and Diversity-based Token Selection (ADTS) to select the most representative tokens for basic video representation, then applies Tree-based Spatiotemporal Token Merging (TSTM) for fine-grained spatiotemporal redundancy elimination. Extensive experiments conducted on three representative VLLMs across five video understanding benchmarks demonstrate the effectiveness and generalization of our method. Notably, by retaining only 10% of visual tokens, FlashVID preserves 99.1% of the performance of LLaVA-OneVision. Consequently, FlashVID can serve as a training-free and plug-and-play module for extending long video frames, which enables a 10x increase in video frame input to Qwen2.5-VL, resulting in a relative improvement of 8.6% within the same computational budget. Code is available at https://github.com/Fanziyang-v/FlashVID.

[195] VFace: A Training-Free Approach for Diffusion-Based Video Face Swapping

Sanoojan Baliah,Yohan Abeysinghe,Rusiru Thushara,Khan Muhammad,Abhinav Dhall,Karthik Nandakumar,Muhammad Haris Khan

Main category: cs.CV

TL;DR: 本文提出了一种无需训练、即插即用的视频人脸替换方法VFace,通过频谱注意力插值、目标结构引导和光流引导注意力时序平滑机制,在不修改扩散模型的前提下显著提升时序一致性和视觉质量。

Details Motivation: 解决基于扩散模型的图像级人脸替换方法在视频中应用时存在的时序不一致性和身份特征丢失问题。 Method: 提出三种关键技术:1)频率谱注意力插值(FSAI)以保留关键身份特征;2)即插即用注意力注入实现目标结构引导;3)光流引导注意力时序平滑(FGATS)增强时空一致性。整个方法无需额外训练或视频微调。 Result: 在多个数据集上实验表明,VFace显著提升了视频人脸替换的时序一致性与视觉保真度,且具有良好的模块化与实用性。 Conclusion: VFace是一种高效、通用、无需训练的视频人脸替换方案,可无缝集成现有图像级扩散模型人脸替换方法,并有效缓解帧间不一致问题。 Abstract: We present a training-free, plug-and-play method, namely VFace, for high-quality face swapping in videos. It can be seamlessly integrated with image-based face swapping approaches built on diffusion models. First, we introduce a Frequency Spectrum Attention Interpolation technique to facilitate generation and intact key identity characteristics. Second, we achieve Target Structure Guidance via plug-and-play attention injection to better align the structural features from the target frame to the generation. Third, we present a Flow-Guided Attention Temporal Smoothening mechanism that enforces spatiotemporal coherence without modifying the underlying diffusion model to reduce temporal inconsistencies typically encountered in frame-wise generation. Our method requires no additional training or video-specific fine-tuning. Extensive experiments show that our method significantly enhances temporal consistency and visual fidelity, offering a practical and modular solution for video-based face swapping. Our code is available at https://github.com/Sanoojan/VFace.

[196] When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning

Shoubin Yu,Yue Zhang,Zun Wang,Jaehong Yoon,Huaxiu Yao,Mingyu Ding,Mohit Bansal

Main category: cs.CV

TL;DR: 本文提出AVIC框架,通过自适应控制测试时视觉想象的使用,在空间推理任务中实现更高效、可靠的性能。

Details Motivation: 现有MLLM在视觉空间推理上仍不可靠,尤其当答案依赖于未见过或替代视角下的场景时;尽管已有工作引入世界模型进行视觉想象,但何时需要想象、多少想象有益、何时想象有害等问题尚未明确。 Method: 提出AVIC(Adaptive Visual Imagination Control)框架:在测试时显式评估当前视觉证据是否充分,再选择性调用并缩放视觉想象;结合世界模型与语言模型协同推理。 Result: 在SAT、MMSI和R2R等基准上验证了想象在不同场景下的作用(关键/边缘/有害),AVIC以显著更少的世界模型调用和语言token,达到或超越固定想象策略的性能。 Conclusion: 测试时视觉想象应作为可控资源而非默认启用;AVIC证明了自适应控制对提升空间推理效率与可靠性至关重要。 Abstract: Despite rapid progress in Multimodal Large Language Models (MLLMs), visual spatial reasoning remains unreliable when correct answers depend on how a scene would appear under unseen or alternative viewpoints. Recent work addresses this by augmenting reasoning with world models for visual imagination, but questions such as when imagination is actually necessary, how much of it is beneficial, and when it becomes harmful, remain poorly understood. In practice, indiscriminate imagination can increase computation and even degrade performance by introducing misleading evidence. In this work, we present an in-depth analysis of test-time visual imagination as a controllable resource for spatial reasoning. We study when static visual evidence is sufficient, when imagination improves reasoning, and how excessive or unnecessary imagination affects accuracy and efficiency. To support this analysis, we introduce AVIC, an adaptive test-time framework with world models that explicitly reasons about the sufficiency of current visual evidence before selectively invoking and scaling visual imagination. Across spatial reasoning benchmarks (SAT, MMSI) and an embodied navigation benchmark (R2R), our results reveal clear scenarios where imagination is critical, marginal, or detrimental, and show that selective control can match or outperform fixed imagination strategies with substantially fewer world-model calls and language tokens. Overall, our findings highlight the importance of analyzing and controlling test-time imagination for efficient and reliable spatial reasoning.

[197] Geometry-Aware Rotary Position Embedding for Consistent Video World Model

Chendong Xiang,Jiajun Liu,Jintao Zhang,Xiao Yang,Zhengwei Fang,Shizun Wang,Zijun Wang,Yingtian Zou,Hang Su,Jun Zhu

Main category: cs.CV

TL;DR: 本文提出ViewRope,一种几何感知的视频Transformer编码方法,通过将相机射线方向注入自注意力机制,解决预测世界模型中因屏幕空间位置嵌入导致的几何漂移问题,显著提升长程空间一致性与计算效率。

Details Motivation: 现有预测世界模型缺乏空间持久性,相机重访时场景结构不稳定、易幻觉,根源在于屏幕空间位置嵌入与3D投影几何不兼容。 Method: 提出ViewRope——将相机射线方向直接融入视频Transformer自注意力;设计几何感知帧稀疏注意力机制,依据几何线索选择性关注历史帧;构建ViewBench评测基准评估闭环保真度与几何漂移。 Result: ViewRope显著提升长时轨迹下的3D一致性,降低几何漂移,并减少计算开销。 Conclusion: 以射线几何为归纳偏置的注意力建模可有效增强世界模型的空间持久性,为构建具几何鲁棒性的交互式AI奠定基础。 Abstract: Predictive world models that simulate future observations under explicit camera control are fundamental to interactive AI. Despite rapid advances, current systems lack spatial persistence: they fail to maintain stable scene structures over long trajectories, frequently hallucinating details when cameras revisit previously observed locations. We identify that this geometric drift stems from reliance on screen-space positional embeddings, which conflict with the projective geometry required for 3D consistency. We introduce \textbf{ViewRope}, a geometry-aware encoding that injects camera-ray directions directly into video transformer self-attention layers. By parameterizing attention with relative ray geometry rather than pixel locality, ViewRope provides a model-native inductive bias for retrieving 3D-consistent content across temporal gaps. We further propose \textbf{Geometry-Aware Frame-Sparse Attention}, which exploits these geometric cues to selectively attend to relevant historical frames, improving efficiency without sacrificing memory consistency. We also present \textbf{ViewBench}, a diagnostic suite measuring loop-closure fidelity and geometric drift. Our results demonstrate that ViewRope substantially improves long-term consistency while reducing computational costs.

[198] Learning Self-Correction in Vision-Language Models via Rollout Augmentation

Yi Ding,Ziliang Qiu,Bolian Li,Ruqi Zhang

Main category: cs.CV

TL;DR: 本文提出Octopus框架,通过修正特定的rollout增强技术解决视觉语言模型中自我纠正行为稀疏的问题,并引入响应掩码策略解耦自我纠正与直接推理,最终实现高效稳定的强化学习训练。

Details Motivation: 现有强化学习方法难以学习到有效的自我纠正行为,因为这些行为出现频率极低,导致学习信号极其稀疏。 Method: 提出修正特定rollout(Octopus)增强框架,通过重组已有rollout合成密集的自我纠正样本;引入响应掩码策略,解耦自我纠正与直接推理过程。 Result: 在7个基准测试中,所提出的Octopus-8B模型达到开源VLM中的SOTA性能,相比最佳RLVR基线提升1.0分,且每步训练时间仅为其0.72倍。 Conclusion: Octopus框架显著提升了视觉语言模型中自我纠正能力的学习效率与稳定性,为复杂推理任务提供了新思路。 Abstract: Self-correction is essential for solving complex reasoning problems in vision-language models (VLMs). However, existing reinforcement learning (RL) methods struggle to learn it, as effective self-correction behaviors emerge only rarely, making learning signals extremely sparse. To address this challenge, we propose correction-specific rollouts (Octopus), an RL rollout augmentation framework that synthesizes dense self-correction examples by recombining existing rollouts. This augmentation simultaneously improves sample efficiency due to rollout reuse and stabilizes RL optimization through balanced supervision. Furthermore, we introduce a response-masking strategy that decouples self-correction from direct reasoning, avoiding signal conflicts and enabling both behaviors to be learned effectively. Building on this, we introduce Octopus-8B, a reasoning VLM with controllable self-correction capability. Across 7 benchmarks, it achieves SoTA performance among open-source VLMs, outperforming the best RLVR baseline by 1.0 score while requiring only $0.72\times$ training time per step.

[199] Recovering 3D Shapes from Ultra-Fast Motion-Blurred Images

Fei Yu,Shudan Guo,Shiqing Xin,Beibei Wang,Haisen Zhao,Wenzheng Chen

Main category: cs.CV

TL;DR: 本文提出了一种用于从超高速运动模糊图像中进行3D形状恢复的可微分逆渲染方法,通过设计快速重心坐标求解器显著加速前向渲染,并实现端到端形状优化。

Details Motivation: 传统多视角立体(MVS)等3D重建方法在极端运动模糊图像上失效,而此类图像常见于体育、工业等高速运动场景,亟需新方法解决。 Method: 提出一种新型可微分逆渲染框架;针对传统渲染中重复计算重心坐标导致的瓶颈,设计了快速可微重心坐标求解器,提升前向渲染效率达4.57倍。 Result: 在快速平移和旋转两类运动上验证了方法的有效性:前向模拟高效逼真,且能成功从2D模糊图像中恢复3D形状。 Conclusion: 该工作拓展了基于视觉的3D重建能力边界,首次实现了从超高速运动模糊图像中端到端恢复几何形状。 Abstract: We consider the problem of 3D shape recovery from ultra-fast motion-blurred images. While 3D reconstruction from static images has been extensively studied, recovering geometry from extreme motion-blurred images remains challenging. Such scenarios frequently occur in both natural and industrial settings, such as fast-moving objects in sports (e.g., balls) or rotating machinery, where rapid motion distorts object appearance and makes traditional 3D reconstruction techniques like Multi-View Stereo (MVS) ineffective. In this paper, we propose a novel inverse rendering approach for shape recovery from ultra-fast motion-blurred images. While conventional rendering techniques typically synthesize blur by averaging across multiple frames, we identify a major computational bottleneck in the repeated computation of barycentric weights. To address this, we propose a fast barycentric coordinate solver, which significantly reduces computational overhead and achieves a speedup of up to 4.57x, enabling efficient and photorealistic simulation of high-speed motion. Crucially, our method is fully differentiable, allowing gradients to propagate from rendered images to the underlying 3D shape, thereby facilitating shape recovery through inverse rendering. We validate our approach on two representative motion types: rapid translation and rotation. Experimental results demonstrate that our method enables efficient and realistic modeling of ultra-fast moving objects in the forward simulation. Moreover, it successfully recovers 3D shapes from 2D imagery of objects undergoing extreme translational and rotational motion, advancing the boundaries of vision-based 3D reconstruction. Project page: https://maxmilite.github.io/rec-from-ultrafast-blur/

[200] Thinking in Structures: Evaluating Spatial Intelligence through Reasoning on Constrained Manifolds

Chen Yang,Guanxin Lin,Youquan He,Peiyao Chen,Guanghe Liu,Yufan Mo,Zhouyuan Xu,Linhao Wang,Guohui Zhang,Zihang Zhang,Shenxiang Zeng,Chen Wang,Jiansheng Fan

Main category: cs.CV

TL;DR: 本文提出了SSI-Bench,一个面向受限流形上空间推理的视觉问答基准,强调几何、拓扑与物理约束下的三维空间理解;现有VLMs在该基准上表现远逊于人类,暴露其在结构化空间建模和约束一致推理上的根本缺陷。

Details Motivation: 现有VLM评估多依赖无约束2D场景,易导致模型利用像素级捷径而非真正空间理解;需构建能反映真实物理世界几何、拓扑与物理约束的严苛基准。 Method: 构建SSI-Bench:基于复杂真实3D结构,设计1000道排序型VQA问题,覆盖几何/拓扑推理及多种组合空间操作(如心理旋转、截面推断、遮挡推理、力路径推理);采用全人工流程(10名研究者超400小时)确保低像素线索、高结构严谨性。 Result: 在31个主流VLM上测试,最优开源模型仅22.2%,最强闭源模型33.6%,而人类达91.6%;提示‘思考’仅带来微弱提升;错误分析表明模型缺乏结构接地能力与约束一致的3D推理能力。 Conclusion: SSI-Bench揭示当前VLM在真实物理空间理解上的严重不足,凸显发展具备几何-拓扑-物理联合约束建模能力的新一代空间智能模型的必要性。 Abstract: Spatial intelligence is crucial for vision--language models (VLMs) in the physical world, yet many benchmarks evaluate largely unconstrained scenes where models can exploit 2D shortcuts. We introduce SSI-Bench, a VQA benchmark for spatial reasoning on constrained manifolds, built from complex real-world 3D structures whose feasible configurations are tightly governed by geometric, topological, and physical constraints. SSI-Bench contains 1,000 ranking questions spanning geometric and topological reasoning and requiring a diverse repertoire of compositional spatial operations, such as mental rotation, cross-sectional inference, occlusion reasoning, and force-path reasoning. It is created via a fully human-centered pipeline: ten researchers spent over 400 hours curating images, annotating structural components, and designing questions to minimize pixel-level cues. Evaluating 31 widely used VLMs reveals a large gap to humans: the best open-source model achieves 22.2% accuracy and the strongest closed-source model reaches 33.6%, while humans score 91.6%. Encouraging models to think yields only marginal gains, and error analysis points to failures in structural grounding and constraint-consistent 3D reasoning. Project page: https://ssi-bench.github.io.

[201] WristMIR: Coarse-to-Fine Region-Aware Retrieval of Pediatric Wrist Radiographs with Radiology Report-Driven Learning

Mert Sonmezer,Serge Vasylechko,Duygu Atasoy,Seyda Ertekin,Sila Kurugol

Main category: cs.CV

TL;DR: WristMIR 是一种无需人工图像标注、基于区域感知的儿童腕部X光片检索框架,利用结构化放射报告挖掘和骨特异性定位,通过两阶段检索(全局粗匹配 + 区域条件重排序)显著提升骨折模式检索与分类性能。

Details Motivation: 临床中腕部骨折模式检索困难,因关键征象细微、局部化强,且易被重叠解剖结构或不同投照角度干扰;同时缺乏大规模、高质量标注的医学图像检索数据集。 Method: 提出 WristMIR 框架:1)用 MedGemma 挖掘密集放射科报告生成全局与区域级文本描述;2)结合预处理腕部图像及桡骨远端、尺骨远端、尺骨茎突等骨特异性裁剪图像;3)联合训练全局与局部对比学习编码器;4)采用两阶段检索:先全局匹配筛选候选检查,再按解剖区域条件重排序。 Result: 图像到文本 Recall@5 从 0.82% 提升至 9.35%;骨折分类 AUROC 达 0.949,AUPRC 达 0.953;区域感知评估下平均 F1 从 0.568 升至 0.753;放射科医生临床相关性评分均值从 3.36 升至 4.35。 Conclusion: 解剖结构引导的检索方法可有效增强儿科肌肉骨骼影像中的诊断推理与临床决策支持。 Abstract: Retrieving wrist radiographs with analogous fracture patterns is challenging because clinically important cues are subtle, highly localized and often obscured by overlapping anatomy or variable imaging views. Progress is further limited by the scarcity of large, well-annotated datasets for case-based medical image retrieval. We introduce WristMIR, a region-aware pediatric wrist radiograph retrieval framework that leverages dense radiology reports and bone-specific localization to learn fine-grained, clinically meaningful image representations without any manual image-level annotations. Using MedGemma-based structured report mining to generate both global and region-level captions, together with pre-processed wrist images and bone-specific crops of the distal radius, distal ulna, and ulnar styloid, WristMIR jointly trains global and local contrastive encoders and performs a two-stage retrieval process: (1) coarse global matching to identify candidate exams, followed by (2) region-conditioned reranking aligned to a predefined anatomical bone region. WristMIR improves retrieval performance over strong vision-language baselines, raising image-to-text Recall@5 from 0.82% to 9.35%. Its embeddings also yield stronger fracture classification (AUROC 0.949, AUPRC 0.953). In region-aware evaluation, the two-stage design markedly improves retrieval-based fracture diagnosis, increasing mean $F_1$ from 0.568 to 0.753, and radiologists rate its retrieved cases as more clinically relevant, with mean scores rising from 3.36 to 4.35. These findings highlight the potential of anatomically guided retrieval to enhance diagnostic reasoning and support clinical decision-making in pediatric musculoskeletal imaging. The source code is publicly available at https://github.com/quin-med-harvard-edu/WristMIR.

[202] Scalable Adaptation of 3D Geometric Foundation Models via Weak Supervision from Internet Video

Zihui Gao,Ke Liu,Donny Y. Chen,Duochao Shi,Guosheng Lin,Hao Chen,Chunhua Shen

Main category: cs.CV

TL;DR: SAGE是一种从互联网视频流中可扩展地自适应几何基础模型的新框架,通过分层挖掘和混合监督(稀疏几何锚定与密集可微一致性)提升3D重建的零样本泛化能力。

Details Motivation: 几何基础模型在3D重建中潜力巨大,但受限于大规模、多样化3D标注数据的稀缺;互联网视频虽丰富,却缺乏真值几何信息且含观测噪声,难以直接用于几何学习。 Method: 提出SAGE框架:1)信息性训练轨迹选择;2)基于SfM点云的稀疏几何锚定提供全局结构引导;3)基于3D高斯渲染的密集可微一致性实现多视角约束;并引入基于锚数据的正则化策略防止灾难性遗忘。 Result: 在7Scenes、TUM-RGBD、Matterport3D等未见基准上,零样本Chamfer距离降低20–42%,显著优于现有方法。 Conclusion: SAGE首次实现了利用互联网视频对几何基础模型进行可扩展自适应,为通用3D学习建立了新范式。 Abstract: Geometric foundation models show promise in 3D reconstruction, yet their progress is severely constrained by the scarcity of diverse, large-scale 3D annotations. While Internet videos offer virtually unlimited raw data, utilizing them as a scaling source for geometric learning is challenging due to the absence of ground-truth geometry and the presence of observational noise. To address this, we propose SAGE, a framework for Scalable Adaptation of GEometric foundation models from raw video streams. SAGE leverages a hierarchical mining pipeline to transform videos into training trajectories and hybrid supervision: (1) Informative training trajectory selection; (2) Sparse Geometric Anchoring via SfM point clouds for global structural guidance; and (3) Dense Differentiable Consistency via 3D Gaussian rendering for multi-view constraints. To prevent catastrophic forgetting, we introduce a regularization strategy using anchor data. Extensive experiments show that SAGE significantly enhances zero-shot generalization, reducing Chamfer Distance by 20-42% on unseen benchmarks (7Scenes, TUM-RGBD, Matterport3D) compared to state-of-the-art baselines. To our knowledge, SAGE pioneers the adaptation of geometric foundation models via Internet video, establishing a scalable paradigm for general-purpose 3D learning.

[203] Rethinking Practical and Efficient Quantization Calibration for Vision-Language Models

Zhenhao Shang,Haizhao Jing,Guoting Wei,Haokui Zhang,Rong Xiao,Jianqing Gao,Peng Wang

Main category: cs.CV

TL;DR: 本文提出了一种面向视觉语言模型(VLMs)的后训练量化(PTQ)新框架TLQ,通过引入基于梯度的token级重要性感知机制和多GPU支持的层量化校准策略,在不微调的前提下显著提升量化性能与稳定性。

Details Motivation: 视觉语言模型中视觉与文本token激活分布差异大、对量化误差敏感,导致传统PTQ校准效果差。 Method: 提出Token-level Importance-aware Layer-wise Quantization(TLQ)框架:1)利用梯度信息设计token级重要性整合机制,构建token级校准集;2)采用多GPU、量化暴露的层间校准方案,使校准过程与真实量化推理路径一致,并分摊计算负载。 Result: 在两个VLM模型、三种模型规模、两种量化设置下均取得一致性能提升,展现出强量化稳定性;支持在RTX3090等消费级GPU上高效运行,降低对A100等大显存GPU的依赖。 Conclusion: TLQ为VLMs提供了更精细、更贴近实际推理的PTQ校准范式,兼顾有效性、稳定性和硬件友好性。 Abstract: Post-training quantization (PTQ) is a primary approach for deploying large language models without fine-tuning, and the quantized performance is often strongly affected by the calibration in PTQ. By contrast, in vision-language models (VLMs), substantial differences between visual and text tokens in their activation distributions and sensitivities to quantization error pose significant challenges for effective calibration during PTQ. In this work, we rethink what PTQ calibration should align with in VLMs and propose the Token-level Importance-aware Layer-wise Quantization framework (TLQ). Guided by gradient information, we design a token-level importance integration mechanism for quantization error, and use it to construct a token-level calibration set, enabling a more fine-grained calibration strategy. Furthermore, TLQ introduces a multi-GPU, quantization-exposed layer-wise calibration scheme. This scheme keeps the layer-wise calibration procedure consistent with the true quantized inference path and distributes the complex layer-wise calibration workload across multiple RTX3090 GPUs, thereby reducing reliance on the large memory of A100 GPUs. TLQ is evaluated across two models, three model scales, and two quantization settings, consistently achieving performance improvements across all settings, indicating its strong quantization stability. The code will be released publicly.

[204] Which private attributes do VLMs agree on and predict well?

Olena Hrynenko,Darya Baranouskaya,Alina Elena Baia,Andrea Cavallaro

Main category: cs.CV

TL;DR: 本文评估了开源视觉语言模型(VLMs)在隐私相关属性识别中的零样本性能,发现VLMs比人类标注者更倾向于预测隐私属性存在,但在高一致性情况下可补充人类标注、发现被忽略的属性。

Details Motivation: 评估开源VLMs在隐私相关视觉属性识别中的零样本能力,探索其在大规模图像数据集隐私标注中的辅助潜力。 Method: 对开源VLMs进行零样本隐私属性识别评估,分析其与人类标注的一致性与分歧,并考察VLMs间高一致性情形下的标注互补性。 Result: VLMs比人类标注者更频繁地预测隐私属性存在;但在高VLM间一致性情况下,能识别出人类忽略的隐私属性。 Conclusion: VLMs虽存在过度预测倾向,但在高一致性场景下可有效辅助人类完成大规模图像隐私属性标注任务。 Abstract: Visual Language Models (VLMs) are often used for zero-shot detection of visual attributes in the image. We present a zero-shot evaluation of open-source VLMs for privacy-related attribute recognition. We identify the attributes for which VLMs exhibit strong inter-annotator agreement, and discuss the disagreement cases of human and VLM annotations. Our results show that when evaluated against human annotations, VLMs tend to predict the presence of privacy attributes more often than human annotators. In addition to this, we find that in cases of high inter-annotator agreement between VLMs, they can complement human annotation by identifying attributes overlooked by human annotators. This highlights the potential of VLMs to support privacy annotations in large-scale image datasets.

[205] Integrating Specialized and Generic Agent Motion Prediction with Dynamic Occupancy Grid Maps

Rabbia Asghar,Lukas Rummelhard,Wenqian Liu,Anne Spalanzani,Christian Laugier

Main category: cs.CV

TL;DR: 本文提出了一种基于动态占用网格图(DOGM)的统一预测框架,通过轻量时空骨干网络与定制化互依赖损失函数,同步预测未来占用状态、车辆分布和场景流,兼顾agent-agnostic与agent-specific预测优势,在nuScenes和Woven Planet数据集上取得更优性能。

Details Motivation: 现有预测方法中,agent-agnostic模型难以刻画动态主体行为复杂性,而agent-specific模型对感知不佳或未识别主体泛化能力差;需融合二者以实现鲁棒安全的运动预测。 Method: 提出基于动态占用网格图(DOGM)的统一框架,采用轻量级时空骨干网络和定制的互依赖损失函数,同步预测未来占用状态网格、车辆网格和场景流网格;利用占用状态信息约束流引导的演化过程,建模障碍物与遮挡影响。 Result: 在nuScenes和Woven Planet真实世界数据集上,该方法在动态车辆及通用动态场景元素的预测性能均优于基线方法。 Conclusion: 所提框架成功融合了场景级与agent级预测优势,通过结构化多任务学习与物理启发的损失设计,提升了预测的鲁棒性、泛化性与安全性。 Abstract: Accurate prediction of driving scene is a challenging task due to uncertainty in sensor data, the complex behaviors of agents, and the possibility of multiple feasible futures. Existing prediction methods using occupancy grid maps primarily focus on agent-agnostic scene predictions, while agent-specific predictions provide specialized behavior insights with the help of semantic information. However, both paradigms face distinct limitations: agent-agnostic models struggle to capture the behavioral complexities of dynamic actors, whereas agent-specific approaches fail to generalize to poorly perceived or unrecognized agents; combining both enables robust and safer motion forecasting. To address this, we propose a unified framework by leveraging Dynamic Occupancy Grid Maps within a streamlined temporal decoding pipeline to simultaneously predict future occupancy state grids, vehicle grids, and scene flow grids. Relying on a lightweight spatiotemporal backbone, our approach is centered on a tailored, interdependent loss function that captures inter-grid dependencies and enables diverse future predictions. By using occupancy state information to enforce flow-guided transitions, the loss function acts as a regularizer that directs occupancy evolution while accounting for obstacles and occlusions. Consequently, the model not only predicts the specific behaviors of vehicle agents, but also identifies other dynamic entities and anticipates their evolution within the complex scene. Evaluations on real-world nuScenes and Woven Planet datasets demonstrate superior prediction performances for dynamic vehicles and generic dynamic scene elements compared to baseline methods.

[206] One-Shot Crowd Counting With Density Guidance For Scene Adaptaion

Jiwei Chen,Qi Wang,Junyu Gao,Jing Zhang,Dingyi Li,Jing-Jia Luo

Main category: cs.CV

TL;DR: 本文提出了一种基于少样本学习的跨场景人群计数方法,通过局部与全局密度特征引导模型适应未见过的监控场景,在三个数据集上优于现有SOTA方法。

Details Motivation: 现有 crowd counting 模型在不同监控场景间泛化能力差,难以适应未见过的场景。 Method: 引入少样本学习框架,设计多局部密度学习器(学习多个代表不同密度分布的原型)和全局密度特征提取模块,分别从局部相似性矩阵和全局密度特征两方面引导模型适应目标场景。 Result: 在三个监控场景数据集上验证了方法有效性,显著优于近期少样本人群计数的最先进方法。 Conclusion: 结合局部与全局密度建模的少样本学习策略可有效提升人群计数模型对新监控场景的泛化能力。 Abstract: Crowd scenes captured by cameras at different locations vary greatly, and existing crowd models have limited generalization for unseen surveillance scenes. To improve the generalization of the model, we regard different surveillance scenes as different category scenes, and introduce few-shot learning to make the model adapt to the unseen surveillance scene that belongs to the given exemplar category scene. To this end, we propose to leverage local and global density characteristics to guide the model of crowd counting for unseen surveillance scenes. Specifically, to enable the model to adapt to the varying density variations in the target scene, we propose the multiple local density learner to learn multi prototypes which represent different density distributions in the support scene. Subsequently, these multiple local density similarity matrixes are encoded. And they are utilized to guide the model in a local way. To further adapt to the global density in the target scene, the global density features are extracted from the support image, then it is used to guide the model in a global way. Experiments on three surveillance datasets shows that proposed method can adapt to the unseen surveillance scene and outperform recent state-of-the-art methods in the few-shot crowd counting.

[207] D-ORCA: Dialogue-Centric Optimization for Robust Audio-Visual Captioning

Changli Tang,Tianyi Wang,Fengyun Rao,Jing Lyu,Chao Zhang

Main category: cs.CV

TL;DR: 本文提出了D-ORCA——一个面向对话的多模态大语言模型,专为鲁棒音视频字幕生成而优化,并构建了大规模双语多说话人对话视频数据集DVD,结合创新的分组相对策略优化与三项新奖励函数,在说话人识别、语音识别和时序定位等任务上显著超越现有开源模型。

Details Motivation: 准确识别视频中谁在何时说了什么,是实现深度视频理解的关键;当前开源生态中缺乏高质量、大规模的多说话人对话视频数据集及相应专用模型。 Method: 提出对话中心的多模态大语言模型D-ORCA;构建双语DVD数据集(近4万训练视频、2000评估视频);采用分组相对策略优化,设计三项基于语音处理常用指标的新奖励函数(说话人归属、全局语音内容、句子级时间边界对齐),首次将其用于音视频字幕生成的强化学习目标。 Result: D-ORCA在说话人识别、语音识别和时间定位任务上大幅超越现有开源模型;仅8B参数即在多个通用音视频理解基准上媲美Qwen3-Omni。 Conclusion: D-ORCA验证了对话中心建模与细粒度奖励驱动强化学习在音视频理解中的有效性,DVD数据集填补了开源多说话人视频资源空白,推动了开放多模态研究发展。 Abstract: Spoken dialogue is a primary source of information in videos; therefore, accurately identifying who spoke what and when is essential for deep video understanding. We introduce D-ORCA, a \textbf{d}ialogue-centric \textbf{o}mni-modal large language model optimized for \textbf{r}obust audio-visual \textbf{ca}ptioning. We further curate DVD, a large-scale, high-quality bilingual dataset comprising nearly 40,000 multi-party dialogue videos for training and 2000 videos for evaluation in English and Mandarin, addressing a critical gap in the open-source ecosystem. To ensure fine-grained captioning accuracy, we adopt group relative policy optimization with three novel reward functions that assess speaker attribution accuracy, global speech content accuracy, and sentence-level temporal boundary alignment. These rewards are derived from evaluation metrics widely used in speech processing and, to our knowledge, are applied for the first time as reinforcement learning objectives for audio-visual captioning. Extensive experiments demonstrate that D-ORCA substantially outperforms existing open-source models in speaker identification, speech recognition, and temporal grounding. Notably, despite having only 8 billion parameters, D-ORCA achieves performance competitive with Qwen3-Omni across several general-purpose audio-visual understanding benchmarks. Demos are available at \href{https://d-orca-llm.github.io/}{https://d-orca-llm.github.io/}. Our code, data, and checkpoints will be available at \href{https://github.com/WeChatCV/D-ORCA/}{https://github.com/WeChatCV/D-ORCA/}.

[208] EasyTune: Efficient Step-Aware Fine-Tuning for Diffusion-Based Motion Generation

Xiaofeng Tan,Wanjiang Weng,Haodong Lei,Hongsong Wang

Main category: cs.CV

TL;DR: 本文提出EasyTune方法,通过在每一步去噪过程中独立微调扩散模型,解决现有可微奖励对齐方法中递归依赖导致的优化低效与高内存消耗问题;同时引入自精炼偏好学习(SPL)机制缓解偏好运动数据稀缺问题,显著提升对齐性能并大幅降低内存开销与训练时间。

Details Motivation: 现有基于可微奖励对齐运动生成扩散模型的方法存在优化效率低、内存消耗高、偏好运动数据稀缺等问题,根源在于去噪轨迹中各步之间的递归依赖。 Method: 提出EasyTune:在每个去噪步骤上独立微调扩散模型,打破递归依赖;并设计自精炼偏好学习(SPL)机制,动态构建偏好对并进行偏好建模。 Result: EasyTune在MM-Dist指标上比DRaFT-50提升8.2%,额外内存开销仅为后者的31.16%,训练速度提升7.3倍。 Conclusion: 解耦去噪轨迹的递归依赖是提升运动生成模型奖励对齐效率与可扩展性的关键路径,EasyTune与SPL为高效、轻量、数据高效的运动生成对齐提供了新范式。 Abstract: In recent years, motion generative models have undergone significant advancement, yet pose challenges in aligning with downstream objectives. Recent studies have shown that using differentiable rewards to directly align the preference of diffusion models yields promising results. However, these methods suffer from (1) inefficient and coarse-grained optimization with (2) high memory consumption. In this work, we first theoretically and empirically identify the key reason of these limitations: the recursive dependence between different steps in the denoising trajectory. Inspired by this insight, we propose EasyTune, which fine-tunes diffusion at each denoising step rather than over the entire trajectory. This decouples the recursive dependence, allowing us to perform (1) a dense and fine-grained, and (2) memory-efficient optimization. Furthermore, the scarcity of preference motion pairs restricts the availability of motion reward model training. To this end, we further introduce a Self-refinement Preference Learning (SPL) mechanism that dynamically identifies preference pairs and conducts preference learning. Extensive experiments demonstrate that EasyTune outperforms DRaFT-50 by 8.2% in alignment (MM-Dist) improvement while requiring only 31.16% of its additional memory overhead and achieving a 7.3x training speedup. The project page is available at this link {https://xiaofeng-tan.github.io/projects/EasyTune/index.html}.

[209] FSP-Diff: Full-Spectrum Prior-Enhanced DualDomain Latent Diffusion for Ultra-Low-Dose Spectral CT Reconstruction

Peng Peng,Xinrui Zhang,Junlin Wang,Lei Li,Shaoyu Wang,Qiegen Liu

Main category: cs.CV

TL;DR: 本文提出FSP-Diff框架,结合互补特征构建、全谱先验融合与高效潜在扩散合成,显著提升超低剂量能谱CT重建质量与效率。

Details Motivation: 超低剂量下能谱CT投影数据信噪比急剧下降,导致重建图像伪影严重、结构细节丢失。 Method: 提出全谱先验增强的双域潜在扩散框架FSP-Diff,包含:1)互补特征构建(联合图像域重建与投影域去噪结果);2)全谱先验集成(融合多能量投影生成高信噪比全谱参考图像);3)高效潜在扩散合成(将多路径特征嵌入紧凑潜在空间以加速扩散重建)。 Result: 在模拟和真实数据集上,FSP-Diff在图像质量和计算效率上均显著优于现有最先进方法。 Conclusion: FSP-Diff为临床可行的超低剂量能谱CT成像提供了新思路与有效解决方案。 Abstract: Spectral computed tomography (CT) with photon-counting detectors holds immense potential for material discrimination and tissue characterization. However, under ultra-low-dose conditions, the sharply degraded signal-to-noise ratio (SNR) in energy-specific projections poses a significant challenge, leading to severe artifacts and loss of structural details in reconstructed images. To address this, we propose FSP-Diff, a full-spectrum prior-enhanced dual-domain latent diffusion framework for ultra-low-dose spectral CT reconstruction. Our framework integrates three core strategies: 1) Complementary Feature Construction: We integrate direct image reconstructions with projection-domain denoised results. While the former preserves latent textural nuances amidst heavy noise, the latter provides a stable structural scaffold to balance detail fidelity and noise suppression. 2) Full-Spectrum Prior Integration: By fusing multi-energy projections into a high-SNR full-spectrum image, we establish a unified structural reference that guides the reconstruction across all energy bins. 3) Efficient Latent Diffusion Synthesis: To alleviate the high computational burden of high-dimensional spectral data, multi-path features are embedded into a compact latent space. This allows the diffusion process to facilitate interactive feature fusion in a lower-dimensional manifold, achieving accelerated reconstruction while maintaining fine-grained detail restoration. Extensive experiments on simulated and real-world datasets demonstrate that FSP-Diff significantly outperforms state-of-the-art methods in both image quality and computational efficiency, underscoring its potential for clinically viable ultra-low-dose spectral CT imaging.

[210] Continuity-driven Synergistic Diffusion with Neural Priors for Ultra-Sparse-View CBCT Reconstruction

Junlin Wang,Jiancheng Fang,Peng Peng,Shaoyu Wang,Qiegen Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为Continuity-driven Synergistic Diffusion with Neural priors (CSDN)的方法,用于解决超稀疏视角CBCT重建中辐射剂量与图像质量之间的权衡问题,通过神经先验建模三维连续衰减分布,并结合双路径扩散策略(正弦图和数字放射影像)提升角度连续性与层间一致性,最终实现高质量、低剂量的CBCT重建。

Details Motivation: CBCT临床应用受限于辐射剂量与图像质量的权衡;超稀疏角度采样虽降低剂量,却引发严重欠采样伪影和层间不一致,影响诊断可靠性;现有方法难以兼顾角度连续性与空间细节保真度。 Method: 提出CSDN框架:1)利用神经先验构建连续三维衰减表示,合成物理一致的密集投影;2)在此基础上设计协同扩散策略,包含正弦图细化扩散(Sino-RD)恢复角度连续性,以及数字放射影像细化扩散(DR-RD)保障层间一致性;3)通过双投影重建融合(DPRF)模块自适应融合两路径输出,实现一致的体数据重建。 Result: 实验表明CSDN在超稀疏视角下能有效抑制伪影、恢复精细纹理,性能优于当前最先进方法。 Conclusion: CSDN通过神经先验与协同扩散机制的结合,在极低剂量条件下实现了高保真CBCT重建,为临床安全、精准成像提供了新思路。 Abstract: The clinical application of cone-beam computed tomography (CBCT) is constrained by the inherent trade-off between radiation exposure and image quality. Ultra-sparse angular sampling, employed to reduce dose, introduces severe undersampling artifacts and inter-slice inconsistencies, compromising diagnostic reliability. Existing reconstruction methods often struggle to balance angular continuity with spatial detail fidelity. To address these challenges, we propose a Continuity-driven Synergistic Diffusion with Neural priors (CSDN) for ultra-sparse-view CBCT reconstruction. Neural priors are introduced as a structural foundation to encode a continuous threedimensional attenuation representation, enabling the synthesis of physically consistent dense projections from ultra-sparse measurements. Building upon this neural-prior-based initialization, a synergistic diffusion strategy is developed, consisting of two collaborative refinement paths: a Sinogram Refinement Diffusion (Sino-RD) process that restores angular continuity and a Digital Radiography Refinement Diffusion (DR-RD) process that enforces inter-slice consistency from the projection image perspective. The outputs of the two diffusion paths are adaptively fused by the Dual-Projection Reconstruction Fusion (DPRF) module to achieve coherent volumetric reconstruction. Extensive experiments demonstrate that the proposed CSDN effectively suppresses artifacts and recovers fine textures under ultra-sparse-view conditions, outperforming existing state-of-the-art techniques.

[211] Deepfake Synthesis vs. Detection: An Uneven Contest

Md. Tarek Hasan,Sanjay Saha,Shaojing Fan,Swakkhar Shatabda,Terence Sim

Main category: cs.CV

TL;DR: 本文对最先进的深度伪造检测技术进行了全面实证分析,发现现有检测模型在面对基于扩散模型、NeRF和改进GAN等现代合成技术生成的深度伪造视频时表现严重不足,人类评估也难以识别高质量伪造,凸显检测技术已明显落后于生成技术的发展。

Details Motivation: 深度伪造技术(如扩散模型、NeRF、增强型GAN)快速发展,导致伪造内容愈发逼真且易得;而检测技术虽有进步(如Transformer、对比学习),但其实际泛化能力与应对新型伪造的能力尚不明确,亟需系统性评估。 Method: 开展大规模实证分析,涵盖多种SOTA深度伪造检测模型,并结合人类评估实验,统一测试其在最新合成方法(如扩散模型、NeRF、先进GAN)生成的深伪视频上的检测性能。 Result: 多数SOTA检测模型在面对最新型深度伪造时性能显著下降;人类参与者对最高质量深伪视频的识别准确率亦很低;实验确证当前检测能力已严重滞后于生成技术演进。 Conclusion: 当前深度伪造检测方法存在严重能力缺口,必须加速模型迭代与方法创新,以弥合与前沿生成技术之间的鸿沟,保障数字内容可信性。 Abstract: The rapid advancement of deepfake technology has significantly elevated the realism and accessibility of synthetic media. Emerging techniques, such as diffusion-based models and Neural Radiance Fields (NeRF), alongside enhancements in traditional Generative Adversarial Networks (GANs), have contributed to the sophisticated generation of deepfake videos. Concurrently, deepfake detection methods have seen notable progress, driven by innovations in Transformer architectures, contrastive learning, and other machine learning approaches. In this study, we conduct a comprehensive empirical analysis of state-of-the-art deepfake detection techniques, including human evaluation experiments against cutting-edge synthesis methods. Our findings highlight a concerning trend: many state-of-the-art detection models exhibit markedly poor performance when challenged with deepfakes produced by modern synthesis techniques, including poor performance by human participants against the best quality deepfakes. Through extensive experimentation, we provide evidence that underscores the urgent need for continued refinement of detection models to keep pace with the evolving capabilities of deepfake generation technologies. This research emphasizes the critical gap between current detection methodologies and the sophistication of new generation techniques, calling for intensified efforts in this crucial area of study.

[212] MCIE: Multimodal LLM-Driven Complex Instruction Image Editing with Spatial Guidance

Xuehai Bai,Xiaoling Gu,Akide Liu,Hangjie Yuan,YiFan Zhang,Jack Ma

Main category: cs.CV

TL;DR: 本文提出MCIE-E1方法,通过空间感知与背景一致的跨注意力模块,提升复杂指令图像编辑中的指令遵循能力与背景一致性,并构建新数据集与基准CIE-Bench验证其有效性。

Details Motivation: 现有基于指令的图像编辑方法难以处理复杂、组合式编辑指令,存在指令遵循不足和背景不一致两大问题。 Method: 提出MCIE-E1方法,包含空间感知跨注意力模块(增强指令-区域对齐)和背景一致跨注意力模块(保留未编辑区域特征);构建融合MLLM自动筛选与人工校验的复杂指令数据管道;设计新基准CIE-Bench及两项新评估指标。 Result: 在CIE-Bench上显著超越SOTA方法,指令遵循能力提升23.96%,定量与定性结果均更优。 Conclusion: MCIE-E1从模型架构、数据构建与评估体系三方面系统性推动复杂指令图像编辑发展,有效缓解指令合规性与背景一致性难题。 Abstract: Recent advances in instruction-based image editing have shown remarkable progress. However, existing methods remain limited to relatively simple editing operations, hindering real-world applications that require complex and compositional instructions. In this work, we address these limitations from the perspectives of architectural design, data, and evaluation protocols. Specifically, we identify two key challenges in current models: insufficient instruction compliance and background inconsistency. To this end, we propose MCIE-E1, a Multimodal Large Language Model-Driven Complex Instruction Image Editing method that integrates two key modules: a spatial-aware cross-attention module and a background-consistent cross-attention module. The former enhances instruction-following capability by explicitly aligning semantic instructions with spatial regions through spatial guidance during the denoising process, while the latter preserves features in unedited regions to maintain background consistency. To enable effective training, we construct a dedicated data pipeline to mitigate the scarcity of complex instruction-based image editing datasets, combining fine-grained automatic filtering via a powerful MLLM with rigorous human validation. Finally, to comprehensively evaluate complex instruction-based image editing, we introduce CIE-Bench, a new benchmark with two new evaluation metrics. Experimental results on CIE-Bench demonstrate that MCIE-E1 consistently outperforms previous state-of-the-art methods in both quantitative and qualitative assessments, achieving a 23.96% improvement in instruction compliance.

[213] ForecastOcc: Vision-based Semantic Occupancy Forecasting

Riya Mohan,Juana Valeria Hurtado,Rohit Mohan,Abhinav Valada

Main category: cs.CV

TL;DR: 本文提出了ForecastOcc,首个直接从图像预测多时间步长语义占据(semantic occupancy)的端到端视觉框架,避免依赖中间占据图,显著提升预测精度与语义丰富性。

Details Motivation: 现有占据预测方法要么忽略语义信息,要么依赖外部预测的占据图导致误差累积,且难以联合建模时空与语义特征。 Method: 提出端到端的ForecastOcc框架,包含时序交叉注意力模块、2D-to-3D视图变换器、3D占据编码器和语义占据头,直接从多视角或单目图像预测多步长语义占据体素。 Result: 在Occ3D-nuScenes(多视角)和SemanticKITTI(单目)上建立首个语义占据预测基准,并显著超越所构建的基线方法。 Conclusion: ForecastOcc验证了联合建模几何与语义动态的可行性与优势,为自动驾驶提供更鲁棒、语义感知的未来场景理解能力。 Abstract: Autonomous driving requires forecasting both geometry and semantics over time to effectively reason about future environment states. Existing vision-based occupancy forecasting methods focus on motion-related categories such as static and dynamic objects, while semantic information remains largely absent. Recent semantic occupancy forecasting approaches address this gap but rely on past occupancy predictions obtained from separate networks. This makes current methods sensitive to error accumulation and prevents learning spatio-temporal features directly from images. In this work, we present ForecastOcc, the first framework for vision-based semantic occupancy forecasting that jointly predicts future occupancy states and semantic categories. Our framework yields semantic occupancy forecasts for multiple horizons directly from past camera images, without relying on externally estimated maps. We evaluate ForecastOcc in two complementary settings: multi-view forecasting on the Occ3D-nuScenes dataset and monocular forecasting on SemanticKITTI, where we establish the first benchmark for this task. We introduce the first baselines by adapting two 2D forecasting modules within our framework. Importantly, we propose a novel architecture that incorporates a temporal cross-attention forecasting module, a 2D-to-3D view transformer, a 3D encoder for occupancy prediction, and a semantic occupancy head for voxel-level forecasts across multiple horizons. Extensive experiments on both datasets show that ForecastOcc consistently outperforms baselines, yielding semantically rich, future-aware predictions that capture scene dynamics and semantics critical for autonomous driving.

[214] PhysDrape: Learning Explicit Forces and Collision Constraints for Physically Realistic Garment Draping

Minghai Chen,Mingyuan Liu,Yuxiang Huan

Main category: cs.CV

TL;DR: 本文提出PhysDrape,一种结合神经网络与显式物理求解器的混合方法,通过可微分的力求解与碰撞投影模块,在保证几何可行性的同时提升服装仿真的物理真实性与鲁棒性。

Details Motivation: 现有基于深度学习的服装仿真方法依赖软约束处理碰撞,导致几何可行性与物理合理性难以兼顾:强化碰撞惩罚易扭曲网格,而保持形状则引发穿透。 Method: 提出PhysDrape框架:1)构建物理增强图(含材料参数与人体邻近信息),驱动物理感知图神经网络预测残差位移;2)设计可微分两阶段求解器——先由可学习力求解器基于StVK模型迭代达到准静态平衡,再通过可微分投影严格满足人体表面无穿透约束。 Result: 实验表明PhysDrape在显著降低应变能的同时实现几乎零穿透,物理保真度与实时鲁棒性均优于现有方法,达到SOTA性能。 Conclusion: PhysDrape通过将显式物理约束嵌入端到端可微分框架,有效弥合了神经建模与物理真实性的鸿沟,为高质量实时服装仿真提供了新范式。 Abstract: Deep learning-based garment draping has emerged as a promising alternative to traditional Physics-Based Simulation (PBS), yet robust collision handling remains a critical bottleneck. Most existing methods enforce physical validity through soft penalties, creating an intrinsic trade-off between geometric feasibility and physical plausibility: penalizing collisions often distorts mesh structure, while preserving shape leads to interpenetration. To resolve this conflict, we present PhysDrape, a hybrid neural-physical solver for physically realistic garment draping driven by explicit forces and constraints. Unlike soft-constrained frameworks, PhysDrape integrates neural inference with explicit geometric solvers in a fully differentiable pipeline. Specifically, we propose a Physics-Informed Graph Neural Network conditioned on a physics-enriched graph -- encoding material parameters and body proximity -- to predict residual displacements. Crucially, we integrate a differentiable two-stage solver: first, a learnable Force Solver iteratively resolves unbalanced forces derived from the Saint Venant-Kirchhoff (StVK) model to ensure quasi-static equilibrium; second, a Differentiable Projection strictly enforces collision constraints against the body surface. This differentiable design guarantees physical validity through explicit constraints, while enabling end-to-end learning to optimize the network for physically consistent predictions. Extensive experiments demonstrate that PhysDrape achieves state-of-the-art performance, ensuring negligible interpenetration with significantly lower strain energy compared to existing baselines, achieving superior physical fidelity and robustness in real-time.

[215] MIND: Benchmarking Memory Consistency and Action Control in World Models

Yixuan Ye,Xuanyu Lu,Yuxin Jiang,Yuchao Gu,Rui Zhao,Qiwei Liang,Jiachun Pan,Fengda Zhang,Weijia Wu,Alex Jinpeng Wang

Main category: cs.CV

TL;DR: 本文提出了MIND,首个面向世界模型记忆一致性和动作控制能力评估的开放域闭环基准,包含250个高质量视频,并设计了评估框架与基线模型MIND-World,揭示了当前模型在长期记忆一致性和动作空间泛化上的关键挑战。

Details Motivation: 现有世界模型缺乏统一基准来系统评估其理解、记忆和预测动态视觉环境的核心能力,尤其是记忆一致性与动作控制这两项基础能力。 Method: 构建MIND基准:涵盖250个1080p/24FPS视频(分第一/第三人称及不同动作空间),设计高效评估框架量化记忆一致性(时间稳定性)与动作控制(跨视角上下文一致性),并引入可交互的Video-to-World基线模型MIND-World。 Result: 实验证明MIND具备完整性;揭示当前世界模型在长期记忆一致性维持和跨动作空间泛化方面存在显著瓶颈。 Conclusion: MIND填补了世界模型综合评估的空白,为推动具身智能中记忆与动作协同建模提供了标准化测试平台和明确改进方向。 Abstract: World models aim to understand, remember, and predict dynamic visual environments, yet a unified benchmark for evaluating their fundamental abilities remains lacking. To address this gap, we introduce MIND, the first open-domain closed-loop revisited benchmark for evaluating Memory consIstency and action coNtrol in worlD models. MIND contains 250 high-quality videos at 1080p and 24 FPS, including 100 (first-person) + 100 (third-person) video clips under a shared action space and 25 + 25 clips across varied action spaces covering eight diverse scenes. We design an efficient evaluation framework to measure two core abilities: memory consistency and action control, capturing temporal stability and contextual coherence across viewpoints. Furthermore, we design various action spaces, including different character movement speeds and camera rotation angles, to evaluate the action generalization capability across different action spaces under shared scenes. To facilitate future performance benchmarking on MIND, we introduce MIND-World, a novel interactive Video-to-World baseline. Extensive experiments demonstrate the completeness of MIND and reveal key challenges in current world models, including the difficulty of maintaining long-term memory consistency and generalizing across action spaces. Project page: https://csu-jpg.github.io/MIND.github.io/

[216] Enhanced Mixture 3D CGAN for Completion and Generation of 3D Objects

Yahia Hamdi,Nicolas Andrialovanirina,Kélig Mahé,Emilie Poisson Caillault

Main category: cs.CV

TL;DR: 本文提出了一种将深度3D卷积生成对抗网络(CGAN)与专家混合(MoE)框架相结合的新方法,用于高质量3D模型生成与不完整/损坏物体重建;通过多专业化生成器与无辅助损失的动态容量约束机制(DCC),在保持训练稳定性与计算效率的同时提升性能。

Details Motivation: 传统GAN在处理复杂、异构、结构精细的3D数据(尤其是输入不完整时)存在建模困难和高计算开销问题,限制其实际应用。 Method: 将3D CGAN与MoE框架结合,采用多个模态特化的生成器,并引入无辅助损失的动态容量约束(DCC)机制以实现专家选择与负载均衡。 Result: 在不同规模缺失区域的3D形状生成与补全任务上,该MoE-DCGAN方法在定量与定性评估中均优于当前最先进方法。 Conclusion: MoE架构能有效增强3D生成模型对复杂分布的建模能力与鲁棒性,DCC机制为3D体素处理提供了高效稳定的专家调度方案。 Abstract: The generation and completion of 3D objects represent a transformative challenge in computer vision. Generative Adversarial Networks (GANs) have recently demonstrated strong potential in synthesizing realistic visual data. However, they often struggle to capture complex and diverse data distributions, particularly in scenarios involving incomplete inputs or significant missing regions. These challenges arise mainly from the high computational requirements and the difficulty of modeling heterogeneous and structurally intricate data, which restrict their applicability in real-world settings. Mixture of Experts (MoE) models have emerged as a promising solution to these limitations. By dynamically selecting and activating the most relevant expert sub-networks for a given input, MoEs improve both performance and efficiency. In this paper, we investigate the integration of Deep 3D Convolutional GANs (CGANs) with a MoE framework to generate high-quality 3D models and reconstruct incomplete or damaged objects. The proposed architecture incorporates multiple generators, each specialized to capture distinct modalities within the dataset. Furthermore, an auxiliary loss-free dynamic capacity constraint (DCC) mechanism is introduced to guide the selection of categorical generators, ensuring a balance between specialization, training stability, and computational efficiency, which is critical for 3D voxel processing. We evaluated the model's ability to generate and complete shapes with missing regions of varying sizes and compared its performance with state-of-the-art approaches. Both quantitative and qualitative results confirm the effectiveness of the proposed MoE-DCGAN in handling complex 3D data.

[217] Vanilla Group Equivariant Vision Transformer: Simple and Effective

Jiahong Fu,Qi Xie,Deyu Meng,Zongben Xu

Main category: cs.CV

TL;DR: 本文提出了一种系统化构建具有严格等变性(equivariance)的视觉Transformer(ViT)的框架,通过使patch embedding、自注意力、位置编码及上下采样等关键模块均满足等变性,实现了性能与理论保证的兼顾,并可即插即用扩展至Swin Transformer等架构。

Details Motivation: 现有等变ViT难以在性能与等变性之间取得平衡,尤其难以在ViT多样化的模块(特别是自注意力与patch embedding)间实现整体等变性设计。 Method: 提出一个简洁框架,对ViT中的patch embedding、self-attention、 positional encodings 和 Down/Up-Sampling 等核心组件进行系统性等变化改造,确保整体架构具备理论保证的等变性,并支持即插即用式迁移(如适配Swin Transformer)。 Result: 所提等变ViT在多种视觉任务上持续提升性能与数据效率,且能无缝扩展至Swin Transformer等先进架构。 Conclusion: 系统性模块级等变化是构建高性能、高理论可信度等变ViT的有效途径,该框架兼具通用性、可扩展性与实用性。 Abstract: Incorporating symmetry priors as inductive biases to design equivariant Vision Transformers (ViTs) has emerged as a promising avenue for enhancing their performance. However, existing equivariant ViTs often struggle to balance performance with equivariance, primarily due to the challenge of achieving holistic equivariant modifications across the diverse modules in ViTs-particularly in harmonizing the Self-Attention mechanism with Patch Embedding. To address this, we propose a straightforward framework that systematically renders key ViT components, including patch embedding, self-attention, positional encodings, and Down/Up-Sampling, equivariant, thereby constructing ViTs with guaranteed equivariance. The resulting architecture serves as a plug-and-play replacement that is both theoretically grounded and practically versatile, scaling seamlessly even to Swin Transformers. Extensive experiments demonstrate that our equivariant ViTs consistently improve performance and data efficiency across a wide spectrum of vision tasks.

[218] Weak to Strong: VLM-Based Pseudo-Labeling as a Weakly Supervised Training Strategy in Multimodal Video-based Hidden Emotion Understanding Tasks

Yufei Wang,Haixu Liu,Tianxiang Xu,Chuancheng Shi,Hongsheng Xing

Main category: cs.CV

TL;DR: 本文提出了一种多模态弱监督框架,用于视频中隐蔽情绪的自动识别,在iMiGUE数据集上达到SOTA性能。通过YOLO和DINOv2提取视觉特征,Gemini 2.5 Pro生成伪标签与推理文本,OpenPose提取关键点并用MLP建模,结合Transformer与BERT融合多模态信息,最终在严重类别不平衡下将准确率提升至0.69以上。

Details Motivation: 解决视频中“隐蔽情绪”的自动识别难题,现有方法在iMiGUE等真实场景数据集上性能受限(准确率低于0.6),且面临标注稀缺与类别不平衡挑战。 Method: 构建多模态弱监督框架:1)YOLOv11x检测人像+DINOv2-Base提取视觉特征;2)Gemini 2.5 Pro结合CoT+Reflection生成伪标签与推理文本;3)OpenPose提取137维关键点序列并引入帧间偏移特征,用MLP替代传统GCN建模时序关系;4)超长序列Transformer分别编码图像与关键点序列,与BERT编码的访谈文本拼接;5)各模态先独立预训练,再联合微调,并融合伪标签样本。 Result: 在iMiGUE网球采访数据集上,准确率从先前工作的不足0.6显著提升至0.69以上,建立新的公开基准;验证了简化后的MLP关键点主干在该任务中可媲美甚至超越GCN。 Conclusion: 多模态弱监督范式(尤其结合大模型生成伪标签与轻量高效结构设计)能有效应对隐蔽情绪识别中的标注稀缺、模态异构与类别不平衡问题,MLP化关键点建模具备实用优势。 Abstract: To tackle the automatic recognition of "concealed emotions" in videos, this paper proposes a multimodal weak-supervision framework and achieves state-of-the-art results on the iMiGUE tennis-interview dataset. First, YOLO 11x detects and crops human portraits frame-by-frame, and DINOv2-Base extracts visual features from the cropped regions. Next, by integrating Chain-of-Thought and Reflection prompting (CoT + Reflection), Gemini 2.5 Pro automatically generates pseudo-labels and reasoning texts that serve as weak supervision for downstream models. Subsequently, OpenPose produces 137-dimensional key-point sequences, augmented with inter-frame offset features; the usual graph neural network backbone is simplified to an MLP to efficiently model the spatiotemporal relationships of the three key-point streams. An ultra-long-sequence Transformer independently encodes both the image and key-point sequences, and their representations are concatenated with BERT-encoded interview transcripts. Each modality is first pre-trained in isolation, then fine-tuned jointly, with pseudo-labeled samples merged into the training set for further gains. Experiments demonstrate that, despite severe class imbalance, the proposed approach lifts accuracy from under 0.6 in prior work to over 0.69, establishing a new public benchmark. The study also validates that an "MLP-ified" key-point backbone can match - or even surpass - GCN-based counterparts in this task.

[219] Picasso: Holistic Scene Reconstruction with Physics-Constrained Sampling

Xihang Yu,Rajat Talak,Lorenzo Shaikewitz,Luca Carlone

Main category: cs.CV

TL;DR: 本文提出Picasso,一种物理约束的多物体场景重建方法,通过考虑几何、非穿透性和物理规律,结合快速拒绝采样和接触图推理,提升重建的物理合理性和人类直觉一致性;同时发布Picasso数据集与物理合理性评估指标。

Details Motivation: 传统几何重建虽拟合传感器数据,但在遮挡和噪声下易产生物理不合理结果(如物体穿透、不稳定平衡),阻碍数字孪生中接触丰富场景的动力学预测与仿真规划。 Method: 提出Picasso物理约束重建流程:1)联合建模多物体几何、非穿透约束与物理规律;2)基于推断的物体接触图,采用快速拒绝采样法进行多物体交互推理;3)构建含真实标注的Picasso数据集及物理合理性量化指标。 Result: 在自建Picasso数据集和YCB-V数据集上的实验表明,Picasso显著优于现有方法,重建结果兼具更高物理合理性与更强人类直觉一致性。 Conclusion: 多物体场景重建需整体建模物体间交互与物理约束,Picasso验证了该范式在提升重建物理合理性和实用性上的有效性,并为后续研究提供了新基准与工具。 Abstract: In the presence of occlusions and measurement noise, geometrically accurate scene reconstructions -- which fit the sensor data -- can still be physically incorrect. For instance, when estimating the poses and shapes of objects in the scene and importing the resulting estimates into a simulator, small errors might translate to implausible configurations including object interpenetration or unstable equilibrium. This makes it difficult to predict the dynamic behavior of the scene using a digital twin, an important step in simulation-based planning and control of contact-rich behaviors. In this paper, we posit that object pose and shape estimation requires reasoning holistically over the scene (instead of reasoning about each object in isolation), accounting for object interactions and physical plausibility. Towards this goal, our first contribution is Picasso, a physics-constrained reconstruction pipeline that builds multi-object scene reconstructions by considering geometry, non-penetration, and physics. Picasso relies on a fast rejection sampling method that reasons over multi-object interactions, leveraging an inferred object contact graph to guide samples. Second, we propose the Picasso dataset, a collection of 10 contact-rich real-world scenes with ground truth annotations, as well as a metric to quantify physical plausibility, which we open-source as part of our benchmark. Finally, we provide an extensive evaluation of Picasso on our newly introduced dataset and on the YCB-V dataset, and show it largely outperforms the state of the art while providing reconstructions that are both physically plausible and more aligned with human intuition.

[220] DICE: Disentangling Artist Style from Content via Contrastive Subspace Decomposition in Diffusion Models

Tong Zhang,Ru Zhang,Jianyi Liu

Main category: cs.CV

TL;DR: 本文提出DICE框架,无需训练即可实时擦除扩散模型中艺术家风格,通过对比子空间分解实现风格与内容解耦,并在保持内容完整性的同时高效抑制风格模仿。

Details Motivation: 扩散模型的普及导致未经授权的艺术风格模仿泛滥,带来版权与知识产权风险;现有防护方法需昂贵权重编辑或依赖显式指定风格,难以实际部署。 Method: 提出DICE(基于对比子空间分解的艺术家风格-内容解耦)框架:构建对比三元组以在潜在空间区分风格与非风格特征;将解耦建模为广义特征值问题以精确定位风格子空间;引入自适应注意力解耦编辑策略,动态评估各token风格浓度并对QKV向量进行差异化抑制与内容增强。 Result: 实验表明DICE在风格擦除彻底性与内容保真度之间取得更优平衡;仅增加3秒开销即可完成风格解耦,具备实用高效性。 Conclusion: DICE是一种训练无关、部署友好的风格净化方案,有效遏制扩散模型中的风格模仿行为,兼顾安全性与实用性。 Abstract: The recent proliferation of diffusion models has made style mimicry effortless, enabling users to imitate unique artistic styles without authorization. In deployed platforms, this raises copyright and intellectual-property risks and calls for reliable protection. However, existing countermeasures either require costly weight editing as new styles emerge or rely on an explicitly specified editing style, limiting their practicality for deployment-side safety. To address this challenge, we propose DICE (Disentanglement of artist Style from Content via Contrastive Subspace Decomposition), a training-free framework for on-the-fly artist style erasure. Unlike style editing that require an explicitly specified replacement style, DICE performs style purification, removing the artist's characteristics while preserving the user-intended content. Our core insight is that a model cannot truly comprehend the artist style from a single text or image alone. Consequently, we abandon the traditional paradigm of identifying style from isolated samples. Instead, we construct contrastive triplets to compel the model to distinguish between style and non-style features in the latent space. By formalizing this disentanglement process as a solvable generalized eigenvalue problem, we achieve precise identification of the style subspace. Furthermore, we introduce an Adaptive Attention Decoupling Editing strategy dynamically assesses the style concentration of each token and performs differential suppression and content enhancement on the QKV vectors. Extensive experiments demonstrate that DICE achieves a superior balance between the thoroughness of style erasure and the preservation of content integrity. DICE introduces an additional overhead of only 3 seconds to disentangle style, providing a practical and efficient technique for curbing style mimicry.

[221] ReRoPE: Repurposing RoPE for Relative Camera Control

Chunyang Li,Yuanbo Yang,Jiahao Shao,Hongyu Zhou,Katja Schwarz,Yiyi Liao

Main category: cs.CV

TL;DR: 本文提出ReRoPE框架,通过利用预训练视频扩散模型中旋转位置编码(RoPE)未充分使用的低频谱带,无缝注入相对相机姿态信息,实现无需额外训练或架构修改的可控视频生成。

Details Motivation: 现有方法使用相对于固定参考帧的相机姿态编码,缺乏平移不变性,导致泛化差和累积漂移;而任意视角对间的相对姿态编码虽更鲁棒,但难以低成本集成到预训练模型中。 Method: 提出ReRoPE,将相对相机姿态信息注入预训练视频扩散模型中RoPE未充分利用的低频谱成分,作为即插即用模块,不改变原有架构或需大量再训练。 Result: 在图像到视频(I2V)和视频到视频(V2V)任务上验证了ReRoPE在相机控制精度与视觉保真度上的优越性,实现了训练高效、高保真、强可控的视频生成。 Conclusion: ReRoPE为可控视频生成提供了一种轻量、通用且高效的即插即用解决方案,兼顾了预训练先验保持与精确相机控制。 Abstract: Video generation with controllable camera viewpoints is essential for applications such as interactive content creation, gaming, and simulation. Existing methods typically adapt pre-trained video models using camera poses relative to a fixed reference, e.g., the first frame. However, these encodings lack shift-invariance, often leading to poor generalization and accumulated drift. While relative camera pose embeddings defined between arbitrary view pairs offer a more robust alternative, integrating them into pre-trained video diffusion models without prohibitive training costs or architectural changes remains challenging. We introduce ReRoPE, a plug-and-play framework that incorporates relative camera information into pre-trained video diffusion models without compromising their generation capability. Our approach is based on the insight that Rotary Positional Embeddings (RoPE) in existing models underutilize their full spectral bandwidth, particularly in the low-frequency components. By seamlessly injecting relative camera pose information into these underutilized bands, ReRoPE achieves precise control while preserving strong pre-trained generative priors. We evaluate our method on both image-to-video (I2V) and video-to-video (V2V) tasks in terms of camera control accuracy and visual fidelity. Our results demonstrate that ReRoPE offers a training-efficient path toward controllable, high-fidelity video generation. See project page for more results: https://sisyphe-lee.github.io/ReRoPE/

[222] ViT-5: Vision Transformers for The Mid-2020s

Feng Wang,Sucheng Ren,Tiezheng Zhang,Predrag Neskovic,Anand Bhattad,Cihang Xie,Alan Yuille

Main category: cs.CV

TL;DR: 本文提出ViT-5,通过对Vision Transformer各组件(归一化、激活函数、位置编码、门控机制、可学习token)进行系统性改进,在保持标准Attention-FFN结构的前提下,显著提升其在理解与生成任务上的性能。

Details Motivation: 现代Vision Transformer架构在过去五年中已有诸多进展,但经典ViT主干尚未充分整合这些改进;本文旨在系统性地将最新架构技术融入ViT,构建更先进、即插即用的视觉骨干网络。 Method: 在保持Attention-FFN基本结构的前提下,对归一化方式、激活函数、位置编码、门控机制和可学习token等关键组件进行逐项更新与优化,形成新一代ViT——ViT-5。 Result: ViT-5在ImageNet-1k上达到84.2% top-1准确率(超越DeiT-III-Base的83.8%);在SiT扩散模型中FID降至1.84(优于ViT的2.06);同时展现出更强的表征能力、空间推理能力和跨任务迁移能力。 Conclusion: ViT-5是一种与当代基础模型实践对齐、简单易部署的ViT升级方案,为2020年代中期的视觉骨干网络提供了有效的‘即插即用’替代选择。 Abstract: This work presents a systematic investigation into modernizing Vision Transformer backbones by leveraging architectural advancements from the past five years. While preserving the canonical Attention-FFN structure, we conduct a component-wise refinement involving normalization, activation functions, positional encoding, gating mechanisms, and learnable tokens. These updates form a new generation of Vision Transformers, which we call ViT-5. Extensive experiments demonstrate that ViT-5 consistently outperforms state-of-the-art plain Vision Transformers across both understanding and generation benchmarks. On ImageNet-1k classification, ViT-5-Base reaches 84.2\% top-1 accuracy under comparable compute, exceeding DeiT-III-Base at 83.8\%. ViT-5 also serves as a stronger backbone for generative modeling: when plugged into an SiT diffusion framework, it achieves 1.84 FID versus 2.06 with a vanilla ViT backbone. Beyond headline metrics, ViT-5 exhibits improved representation learning and favorable spatial reasoning behavior, and transfers reliably across tasks. With a design aligned with contemporary foundation-model practices, ViT-5 offers a simple drop-in upgrade over vanilla ViT for mid-2020s vision backbones.

[223] VidVec: Unlocking Video MLLM Embeddings for Video-Text Retrieval

Issar Tzachor,Dvir Samuel,Rami Ben-Ari

Main category: cs.CV

TL;DR: 本文提出了一种无需视觉监督、仅通过文本对齐即可提升MLLM用于视频-文本检索性能的新方法,在多个基准上达到SOTA。

Details Motivation: 现有生成式多模态大语言模型(MLLMs)在视频任务上的嵌入表现仍逊于视频基础模型(VFMs),本文旨在探索如何更有效地利用预训练MLLM进行零样本视频-文本嵌入与检索。 Method: 1)系统性地进行层间分析,发现MLLM中间层已蕴含丰富任务相关信息;2)直接融合中间层嵌入与校准后的MLLM头部实现零样本检索;3)设计轻量级文本对齐策略,将密集视频字幕映射为简短摘要,仅用文本监督学习视频-文本嵌入。 Result: 仅通过文本微调(无需视觉数据),该方法在主流视频检索基准(如MSR-VTT、YouCook2等)上显著超越现有方法,达到当前最优性能。 Conclusion: 预训练MLLM的中间层具有强表征能力,结合文本引导的对齐策略可高效释放其视频理解潜力,无需视觉微调即可实现高性能视频-文本检索。 Abstract: Recent studies have adapted generative Multimodal Large Language Models (MLLMs) into embedding extractors for vision tasks, typically through fine-tuning to produce universal representations. However, their performance on video remains inferior to Video Foundation Models (VFMs). In this paper, we focus on leveraging MLLMs for video-text embedding and retrieval. We first conduct a systematic layer-wise analysis, showing that intermediate (pre-trained) MLLM layers already encode substantial task-relevant information. Leveraging this insight, we demonstrate that combining intermediate-layer embeddings with a calibrated MLLM head yields strong zero-shot retrieval performance without any training. Building on these findings, we introduce a lightweight text-based alignment strategy which maps dense video captions to short summaries and enables task-related video-text embedding learning without visual supervision. Remarkably, without any fine-tuning beyond text, our method outperforms current methods, often by a substantial margin, achieving state-of-the-art results across common video retrieval benchmarks.

[224] MMLSv2: A Multimodal Dataset for Martian Landslide Detection in Remote Sensing Imagery

Sidike Paheding,Abel Reyes-Angulo,Leo Thomas Ramos,Angel D. Sappa,Rajaneesh A.,Hiral P. B.,Sajin Kumar K. S.,Thomas Oommen

Main category: cs.CV

TL;DR: 本文介绍了MMLSv2数据集,用于火星表面滑坡分割任务,包含多模态影像(RGB、DEM、坡度、热惯量和灰度通道)共664张图像,并额外提供一个地理上独立的276张图像测试集以评估空间泛化能力;实验表明该数据集支持稳定训练但对细碎、狭长及小尺度滑坡区域仍具挑战性,且在独立测试集上性能显著下降,凸显其对模型鲁棒性和泛化能力评估的价值。

Details Motivation: 现有火星滑坡分割数据集在模态多样性、规模及泛化评估方面存在不足,亟需一个更全面、更具挑战性的基准数据集来推动相关研究。 Method: 构建了多模态火星滑坡分割数据集MMLSv2,涵盖七种影像通道,划分训练/验证/测试集,并新增地理隔离的独立测试集;采用多种主流分割模型进行基准实验与性能分析。 Result: 多个分割模型在MMLSv2上可实现稳定训练和有竞争力的性能,但在细碎、狭长和小尺度滑坡区域表现受限;在地理隔离测试集上性能明显下降,验证了其对模型空间泛化能力评估的有效性。 Conclusion: MMLSv2是一个高质量、多模态、具备明确泛化评估设计的火星滑坡分割基准数据集,有助于推动行星遥感图像语义分割及模型鲁棒性研究。 Abstract: We present MMLSv2, a dataset for landslide segmentation on Martian surfaces. MMLSv2 consists of multimodal imagery with seven bands: RGB, digital elevation model, slope, thermal inertia, and grayscale channels. MMLSv2 comprises 664 images distributed across training, validation, and test splits. In addition, an isolated test set of 276 images from a geographically disjoint region from the base dataset is released to evaluate spatial generalization. Experiments conducted with multiple segmentation models show that the dataset supports stable training and achieves competitive performance, while still posing challenges in fragmented, elongated, and small-scale landslide regions. Evaluation on the isolated test set leads to a noticeable performance drop, indicating increased difficulty and highlighting its value for assessing model robustness and generalization beyond standard in-distribution settings. Dataset will be available at: https://github.com/MAIN-Lab/MMLS_v2

[225] Building Damage Detection using Satellite Images and Patch-Based Transformer Methods

Smriti Siva,Jan Cross-Zamirski

Main category: cs.CV

TL;DR: 本文研究了在噪声大、类别不平衡的卫星图像数据上,使用Vision Transformer(ViT)模型(DINOv2-small和DeiT)进行建筑物损毁多类分类的效果,并提出了一种基于补丁的预处理流程和冻结头微调策略,在xBDD数据集上取得了与CNN基线相当的宏F1分数。

Details Motivation: 快速建筑物损毁评估对灾后响应至关重要,但卫星图像数据中存在标签噪声和严重类别不平衡问题,给模型训练带来挑战。 Method: 采用DINOv2-small和DeiT模型,设计了针对结构特征的补丁级预处理流程以减少背景噪声,并使用冻结头部的微调策略;评估指标包括准确率、精确率、召回率和宏平均F1分数。 Result: 小规模ViT模型结合所提训练方法,在xBDD数据集上实现了与先前CNN基线模型具有竞争力的宏平均F1分数。 Conclusion: ViT架构在噪声与不平衡的遥感灾害图像分类任务中具备可行性与潜力,尤其在计算资源受限场景下,通过合理预处理与微调策略可达到实用性能水平。 Abstract: Rapid building damage assessment is critical for post-disaster response. Damage classification models built on satellite imagery provide a scalable means of obtaining situational awareness. However, label noise and severe class imbalance in satellite data create major challenges. The xBD dataset offers a standardized benchmark for building-level damage across diverse geographic regions. In this study, we evaluate Vision Transformer (ViT) model performance on the xBD dataset, specifically investigating how these models distinguish between types of structural damage when training on noisy, imbalanced data. In this study, we specifically evaluate DINOv2-small and DeiT for multi-class damage classification. We propose a targeted patch-based pre-processing pipeline to isolate structural features and minimize background noise in training. We adopt a frozen-head fine-tuning strategy to keep computational requirements manageable. Model performance is evaluated through accuracy, precision, recall, and macro-averaged F1 scores. We show that small ViT architectures with our novel training method achieves competitive macro-averaged F1 relative to prior CNN baselines for disaster classification.

[226] PEGAsus: 3D Personalization of Geometry and Appearance

Jingyu Hu,Bin Hu,Ka-Hei Hui,Haipeng Li,Zhengzhe Liu,Daniel Cohen-Or,Chi-Wing Fu

Main category: cs.CV

TL;DR: PEGAsus是一个新框架,通过在几何和外观两个层面学习3D形状概念,实现个性化3D形状生成。

Details Motivation: 现有方法难以在跨类别场景下灵活提取和组合3D形状的几何与外观属性以支持细粒度、个性化的生成。 Method: 提出将3D形状个性化建模为从参考形状中提取类别无关的几何与外观属性,并与文本组合生成新形状;设计渐进式优化策略解耦几何与外观概念学习;扩展至区域级概念学习,引入上下文感知与上下文无关损失。 Result: 在广泛参考形状上有效提取属性,并灵活与文本组合生成新形状;在定量与定性实验中均优于现有最先进方法。 Conclusion: PEGAsus实现了更精细、更灵活的3D形状个性化生成,尤其在跨类别任务中表现突出。 Abstract: We present PEGAsus, a new framework capable of generating Personalized 3D shapes by learning shape concepts at both Geometry and Appearance levels. First, we formulate 3D shape personalization as extracting reusable, category-agnostic geometric and appearance attributes from reference shapes, and composing these attributes with text to generate novel shapes. Second, we design a progressive optimization strategy to learn shape concepts at both the geometry and appearance levels, decoupling the shape concept learning process. Third, we extend our approach to region-wise concept learning, enabling flexible concept extraction, with context-aware and context-free losses. Extensive experimental results show that PEGAsus is able to effectively extract attributes from a wide range of reference shapes and then flexibly compose these concepts with text to synthesize new shapes. This enables fine-grained control over shape generation and supports the creation of diverse, personalized results, even in challenging cross-category scenarios. Both quantitative and qualitative experiments demonstrate that our approach outperforms existing state-of-the-art solutions.

[227] Efficient-SAM2: Accelerating SAM2 with Object-Aware Visual Encoding and Memory Retrieval

Jing Zhang,Zhikai Li,Xuewen Liu,Qingyi Gu

Main category: cs.CV

TL;DR: 本文提出Efficient-SAM2,通过引入对象感知的稀疏窗口路由(SWR)和稀疏记忆检索(SMR),在保持高精度的同时显著提升SAM2视频分割的推理效率。

Details Motivation: SAM2在视频对象分割中性能优异但计算开销大,现有加速方法多依赖重新训练轻量骨干网络,缺乏对后训练阶段的高效加速探索。 Method: 基于SAM2稀疏感知模式的观察,提出两种轻量级后训练加速机制:1)面向图像编码器的对象感知稀疏窗口路由(SWR),将背景区域路由至轻量捷径分支;2)面向记忆注意力的对象感知稀疏记忆检索(SMR),仅保留关键记忆token参与计算,并复用首次提取的显著性模式。 Result: Efficient-SAM2在SAM2.1-L模型上实现1.68倍加速,在SA-V测试集上仅损失1.0%精度,且参数增量和训练开销极小。 Conclusion: 利用生物视觉启发的稀疏性先验,无需重训即可高效加速SAM2,为实时视频分割提供了实用可行的新范式。 Abstract: Segment Anything Model 2 (SAM2) shows excellent performance in video object segmentation tasks; however, the heavy computational burden hinders its application in real-time video processing. Although there have been efforts to improve the efficiency of SAM2, most of them focus on retraining a lightweight backbone, with little exploration into post-training acceleration. In this paper, we observe that SAM2 exhibits sparse perception pattern as biological vision, which provides opportunities for eliminating redundant computation and acceleration: i) In mask decoder, the attention primarily focuses on the foreground objects, whereas the image encoder in the earlier stage exhibits a broad attention span, which results in unnecessary computation to background regions. ii) In memory bank, only a small subset of tokens in each frame contribute significantly to memory attention, and the salient regions exhibit temporal consistency, making full-token computation redundant. With these insights, we propose Efficient-SAM2, which promotes SAM2 to adaptively focus on object regions while eliminating task-irrelevant computations, thereby significantly improving inference efficiency. Specifically, for image encoder, we propose object-aware Sparse Window Routing (SWR), a window-level computation allocation mechanism that leverages the consistency and saliency cues from the previous-frame decoder to route background regions into a lightweight shortcut branch. Moreover, for memory attention, we propose object-aware Sparse Memory Retrieval (SMR), which allows only the salient memory tokens in each frame to participate in computation, with the saliency pattern reused from their first recollection. With negligible additional parameters and minimal training overhead, Efficient-SAM2 delivers 1.68x speedup on SAM2.1-L model with only 1.0% accuracy drop on SA-V test set.

[228] Generating Adversarial Events: A Motion-Aware Point Cloud Framework

Hongwei Ren,Youxin Jiang,Qifei Gu,Xiangqian Wu

Main category: cs.CV

TL;DR: 本文提出MA-ADV框架,首次利用点云表示生成事件相机的对抗样本,通过扩散平滑、时空建模与优化策略实现高成功率低扰动攻击,揭示了事件感知系统的严重安全风险。

Details Motivation: 事件相机在自动驾驶等安全关键领域广泛应用,但其依赖的深度神经网络易受对抗样本攻击;而现有事件表示不可微,导致梯度攻击方法难以迁移,相关研究匮乏。 Method: 提出Motion-Aware Adversarial(MA-ADV)框架:1)将事件映射为点云以支持可微优化;2)引入扩散模型平滑高频噪声扰动;3)建模事件时空关系;4)结合逐样本Adam优化、迭代精炼与二分搜索寻找最小代价扰动。 Result: 在多个基准上实现100%攻击成功率,扰动成本最低,并展现出对主流防御方法的强鲁棒性。 Conclusion: MA-ADV首次验证了事件数据可被高效构造对抗样本,凸显事件感知系统面临严峻安全挑战,亟需发展相应防御机制。 Abstract: Event cameras have been widely adopted in safety-critical domains such as autonomous driving, robotics, and human-computer interaction. A pressing challenge arises from the vulnerability of deep neural networks to adversarial examples, which poses a significant threat to the reliability of event-based systems. Nevertheless, research into adversarial attacks on events is scarce. This is primarily due to the non-differentiable nature of mainstream event representations, which hinders the extension of gradient-based attack methods. In this paper, we propose MA-ADV, a novel \textbf{M}otion-\textbf{A}ware \textbf{Adv}ersarial framework. To the best of our knowledge, this is the first work to generate adversarial events by leveraging point cloud representations. MA-ADV accounts for high-frequency noise in events and employs a diffusion-based approach to smooth perturbations, while fully leveraging the spatial and temporal relationships among events. Finally, MA-ADV identifies the minimal-cost perturbation through a combination of sample-wise Adam optimization, iterative refinement, and binary search. Extensive experimental results validate that MA-ADV ensures a 100\% attack success rate with minimal perturbation cost, and also demonstrate enhanced robustness against defenses, underscoring the critical security challenges facing future event-based perception systems.

[229] Moving Beyond Functional Connectivity: Time-Series Modeling for fMRI-Based Brain Disorder Classification

Guoqi Yu,Xiaowei Hu,Angelica I. Aviles-Rivero,Anqi Qiu,Shujun Wang

Main category: cs.CV

TL;DR: 本文提出DeCI框架,通过周期-漂移分解和通道独立性建模原始fMRI BOLD时间序列,在多个数据集上超越传统功能连接方法,证明端到端时序建模更有效。

Details Motivation: 现有fMRI分类方法多依赖静态功能连接(FC),丢失BOLD信号的时序动态和非线性关系,亟需直接建模原始时间序列的方法。 Method: 基准测试多种先进时序模型(如PatchTST、TimesNet、TimeMixer);提出DeCI框架,包含ROI级周期与漂移分解、各ROI独立建模(Channel-Independence)。 Result: DeCI在五个公开数据集上显著优于FC方法和各类时序基线,展现出更高分类精度与泛化能力。 Conclusion: 直接对原始BOLD信号进行端到端时序建模比传统FC方法更能捕捉复杂脑动态,应成为fMRI分析的新范式。 Abstract: Functional magnetic resonance imaging (fMRI) enables non-invasive brain disorder classification by capturing blood-oxygen-level-dependent (BOLD) signals. However, most existing methods rely on functional connectivity (FC) via Pearson correlation, which reduces 4D BOLD signals to static 2D matrices, discarding temporal dynamics and capturing only linear inter-regional relationships. In this work, we benchmark state-of-the-art temporal models (e.g., time-series models such as PatchTST, TimesNet, and TimeMixer) on raw BOLD signals across five public datasets. Results show these models consistently outperform traditional FC-based approaches, highlighting the value of directly modeling temporal information such as cycle-like oscillatory fluctuations and drift-like slow baseline trends. Building on this insight, we propose DeCI, a simple yet effective framework that integrates two key principles: (i) Cycle and Drift Decomposition to disentangle cycle and drift within each ROI (Region of Interest); and (ii) Channel-Independence to model each ROI separately, improving robustness and reducing overfitting. Extensive experiments demonstrate that DeCI achieves superior classification accuracy and generalization compared to both FC-based and temporal baselines. Our findings advocate for a shift toward end-to-end temporal modeling in fMRI analysis to better capture complex brain dynamics. The code is available at https://github.com/Levi-Ackman/DeCI.

[230] PISCO: Precise Video Instance Insertion with Sparse Control

Xiangbo Gao,Renjie Li,Xinghao Chen,Yuheng Wu,Suofei Feng,Qing Yin,Zhengzhong Tu

Main category: cs.CV

TL;DR: 本文提出PISCO,一种支持任意稀疏关键帧控制的视频扩散模型,用于精确视频实例插入,解决了空间-时间精确定位、物理一致的场景交互和原始动态保持等挑战。

Details Motivation: 专业AI辅助电影制作中需要进行精确、有针对性的修改,而现有方法依赖大量提示工程和结果筛选,缺乏细粒度可控性和高保真后处理能力。 Method: 提出PISCO视频扩散模型,引入可变信息引导(Variable-Information Guidance)、分布保持的时间掩码(Distribution-Preserving Temporal Masking)和几何感知条件控制;构建含验证标注与干净背景的基准PISCO-Bench,并采用参考式与无参考式感知指标评估。 Result: PISCO在稀疏控制下持续优于强图像修复与视频编辑基线,并随更多控制信号加入表现出清晰单调的性能提升。 Conclusion: PISCO实现了低用户负担下的高精度视频实例插入,推动AI视频生成向可控、高保真方向发展。 Abstract: The landscape of AI video generation is undergoing a pivotal shift: moving beyond general generation - which relies on exhaustive prompt-engineering and "cherry-picking" - towards fine-grained, controllable generation and high-fidelity post-processing. In professional AI-assisted filmmaking, it is crucial to perform precise, targeted modifications. A cornerstone of this transition is video instance insertion, which requires inserting a specific instance into existing footage while maintaining scene integrity. Unlike traditional video editing, this task demands several requirements: precise spatial-temporal placement, physically consistent scene interaction, and the faithful preservation of original dynamics - all achieved under minimal user effort. In this paper, we propose PISCO, a video diffusion model for precise video instance insertion with arbitrary sparse keyframe control. PISCO allows users to specify a single keyframe, start-and-end keyframes, or sparse keyframes at arbitrary timestamps, and automatically propagates object appearance, motion, and interaction. To address the severe distribution shift induced by sparse conditioning in pretrained video diffusion models, we introduce Variable-Information Guidance for robust conditioning and Distribution-Preserving Temporal Masking to stabilize temporal generation, together with geometry-aware conditioning for realistic scene adaptation. We further construct PISCO-Bench, a benchmark with verified instance annotations and paired clean background videos, and evaluate performance using both reference-based and reference-free perceptual metrics. Experiments demonstrate that PISCO consistently outperforms strong inpainting and video editing baselines under sparse control, and exhibits clear, monotonic performance improvements as additional control signals are provided. Project page: xiangbogaobarry.github.io/PISCO.

[231] Tighnari v2: Mitigating Label Noise and Distribution Shift in Multimodal Plant Distribution Prediction via Mixture of Experts and Weakly Supervised Learning

Haixu Liu,Yufei Wang,Tianxiang Xu,Chuancheng Shi,Hongsheng Xing

Main category: cs.CV

TL;DR: 本文提出了一种多模态融合框架,结合Presence-Absence(PA)和Presence-Only(PO)植物分布数据,通过地理对齐的伪标签聚合、三模态交叉注意力机制及基于空间邻近性的专家混合推理策略,显著提升了跨物种大尺度植物分布预测性能,尤其在PA数据稀疏且地理分布偏移明显时效果突出。

Details Motivation: 现有植物分布预测面临PA数据稀缺昂贵、PO数据负样本噪声严重的问题,且二者混合训练易因标签噪声导致性能下降;同时存在训练与测试样本间显著的地理分布偏移。 Method: 提出多模态融合框架:1)基于卫星影像地理覆盖的PO数据伪标签聚合策略以实现地理对齐;2)采用Swin Transformer Base处理遥感影像、TabM处理表格特征、Temporal Swin Transformer建模时序;3)设计可堆叠的串行三模态交叉注意力机制融合异构模态;4)借鉴mixture-of-experts范式,按测试样本与PA样本的空间邻近性划分区域并分配专用模型进行推理与后处理。 Result: 在GeoLifeCLEF 2025数据集上验证,所提方法在PA覆盖有限且分布偏移显著的场景下,预测性能优于现有方法。 Conclusion: 融合PA与PO数据需兼顾地理对齐、模态协同与噪声鲁棒性;本文提出的伪标签策略、三模态融合架构及空间自适应专家推理机制,为跨物种大尺度植物分布建模提供了有效新范式。 Abstract: Large-scale, cross-species plant distribution prediction plays a crucial role in biodiversity conservation, yet modeling efforts in this area still face significant challenges due to the sparsity and bias of observational data. Presence-Absence (PA) data provide accurate and noise-free labels, but are costly to obtain and limited in quantity; Presence-Only (PO) data, by contrast, offer broad spatial coverage and rich spatiotemporal distribution, but suffer from severe label noise in negative samples. To address these real-world constraints, this paper proposes a multimodal fusion framework that fully leverages the strengths of both PA and PO data. We introduce an innovative pseudo-label aggregation strategy for PO data based on the geographic coverage of satellite imagery, enabling geographic alignment between the label space and remote sensing feature space. In terms of model architecture, we adopt Swin Transformer Base as the backbone for satellite imagery, utilize the TabM network for tabular feature extraction, retain the Temporal Swin Transformer for time-series modeling, and employ a stackable serial tri-modal cross-attention mechanism to optimize the fusion of heterogeneous modalities. Furthermore, empirical analysis reveals significant geographic distribution shifts between PA training and test samples, and models trained by directly mixing PO and PA data tend to experience performance degradation due to label noise in PO data. To address this, we draw on the mixture-of-experts paradigm: test samples are partitioned according to their spatial proximity to PA samples, and different models trained on distinct datasets are used for inference and post-processing within each partition. Experiments on the GeoLifeCLEF 2025 dataset demonstrate that our approach achieves superior predictive performance in scenarios with limited PA coverage and pronounced distribution shifts.

[232] CAE-AV: Improving Audio-Visual Learning via Cross-modal Interactive Enrichment

Yunzuo Hu,Wen Li,Jing Zhang

Main category: cs.CV

TL;DR: 本文提出CAE-AV框架,通过CASTE和CASE两个模块及轻量目标函数缓解音视频模态错位问题,在多个基准上达到SOTA性能。

Details Motivation: 音频-视觉学习受屏幕外声源和背景杂波导致的模态错位影响,现有方法易放大无关区域或时刻,造成训练不稳定和表征质量下降。 Method: 提出Caption-aligned and Agreement-guided Enhancement(CAE-AV)框架,包含Cross-modal Agreement-guided Spatio-Temporal Enrichment(CASTE)和Caption-Aligned Saliency-guided Enrichment(CASE)两个互补模块,并设计caption-to-modality InfoNCE、visual-audio consistency和entropy regularization等轻量目标函数。 Result: 在AVE、AVVP、AVS和AVQA等多个音视频基准上,使用冻结主干网络时达到当前最优性能;定性分析验证了其对音视频错位的鲁棒性。 Conclusion: CAE-AV通过动态时空平衡与语义引导增强,有效缓解音视频模态错位,提升了跨模态表征质量与模型稳定性。 Abstract: Audio-visual learning suffers from modality misalignment caused by off-screen sources and background clutter, and current methods usually amplify irrelevant regions or moments, leading to unstable training and degraded representation quality. To address this challenge, we proposed a novel Caption-aligned and Agreement-guided Enhancement framework (CAE-AV) for audio-visual learning, which used two complementary modules: Cross-modal Agreement-guided Spatio-Temporal Enrichment (CASTE) and Caption-Aligned Saliency-guided Enrichment (CASE) to relieve audio-visual misalignment. CASTE dynamically balances spatial and temporal relations by evaluating frame-level audio-visual agreement, ensuring that key information is captured from both preceding and subsequent frames under misalignment. CASE injects cross-modal semantic guidance into selected spatio-temporal positions, leveraging high-level semantic cues to further alleviate misalignment. In addition, we design lightweight objectives, caption-to-modality InfoNCE, visual-audio consistency, and entropy regularization to guide token selection and strengthen cross-modal semantic alignment. With frozen backbones, CAE-AV achieves state-of-the-art performance on AVE, AVVP, AVS, and AVQA benchmarks, and qualitative analyses further validate its robustness against audio-visual misalignment.

[233] Language-Guided Transformer Tokenizer for Human Motion Generation

Sheng Yan,Yong Wang,Xin Du,Junsong Yuan,Mengyuan Liu

Main category: cs.CV

TL;DR: 本文提出了一种语言引导的运动离散化方法(LG-Tok),通过Transformer架构实现语言与运动在tokenization阶段的对齐,提升重建质量并降低生成复杂度;同时引入language-drop机制支持无语言指导生成,在多个基准上超越SOTA。

Details Motivation: 现有运动离散化方法依赖增加token数量来提升重建质量,但会加剧生成模型的学习难度;且主流卷积结构难以支持全局语言引导。 Method: 提出Language-Guided Tokenization(LG-Tok):1)设计基于Transformer的Tokenizer,利用注意力机制实现语言-运动对齐;2)引入language-drop训练策略,使detokenizer支持语言自由生成;3)构建语义紧凑、高层的运动token表示。 Result: 在HumanML3D和Motion-X上Top-1分别达0.542/0.582(优于MARDM的0.500/0.528),FID为0.057/0.088(优于0.114/0.147);轻量版LG-Tok-mini仅用一半token即达Top-1: 0.521/0.588、FID: 0.085/0.071。 Conclusion: 语言引导的tokenization能兼顾高重建质量与低生成复杂度,Transformer架构与language-drop机制是实现高效语义运动表征的关键。 Abstract: In this paper, we focus on motion discrete tokenization, which converts raw motion into compact discrete tokens--a process proven crucial for efficient motion generation. In this paradigm, increasing the number of tokens is a common approach to improving motion reconstruction quality, but more tokens make it more difficult for generative models to learn. To maintain high reconstruction quality while reducing generation complexity, we propose leveraging language to achieve efficient motion tokenization, which we term Language-Guided Tokenization (LG-Tok). LG-Tok aligns natural language with motion at the tokenization stage, yielding compact, high-level semantic representations. This approach not only strengthens both tokenization and detokenization but also simplifies the learning of generative models. Furthermore, existing tokenizers predominantly adopt convolutional architectures, whose local receptive fields struggle to support global language guidance. To this end, we propose a Transformer-based Tokenizer that leverages attention mechanisms to enable effective alignment between language and motion. Additionally, we design a language-drop scheme, in which language conditions are randomly removed during training, enabling the detokenizer to support language-free guidance during generation. On the HumanML3D and Motion-X generation benchmarks, LG-Tok achieves Top-1 scores of 0.542 and 0.582, outperforming state-of-the-art methods (MARDM: 0.500 and 0.528), and with FID scores of 0.057 and 0.088, respectively, versus 0.114 and 0.147. LG-Tok-mini uses only half the tokens while maintaining competitive performance (Top-1: 0.521/0.588, FID: 0.085/0.071), validating the efficiency of our semantic representations.

[234] UrbanGraphEmbeddings: Learning and Evaluating Spatially Grounded Multimodal Embeddings for Urban Science

Jie Zhang,Xingtong Yu,Yuan Fang,Rudi Stouffs,Zdravko Trivic

Main category: cs.CV

TL;DR: 本文提出UGData数据集、UGE训练方法和UGBench基准,旨在通过显式空间对齐提升城市多模态嵌入的可迁移性,在图像检索与地理定位等任务中显著提升性能。

Details Motivation: 现有城市理解数据集缺乏街景图像与城市空间结构的显式对齐,难以支持空间密集型任务。 Method: 构建空间锚定的UGData数据集,设计两阶段UGE训练策略(结合指令引导对比学习与图结构空间编码),并开发涵盖多项城市理解任务的UGBench评估基准。 Result: 基于Qwen2.5-VL-7B的UGE在训练城市图像检索和地理定位分别提升44%和30%,在未见城市上提升超30%和22%。 Conclusion: 显式空间对齐能显著增强多模态模型在城市理解任务中的泛化能力与空间推理能力。 Abstract: Learning transferable multimodal embeddings for urban environments is challenging because urban understanding is inherently spatial, yet existing datasets and benchmarks lack explicit alignment between street-view images and urban structure. We introduce UGData, a spatially grounded dataset that anchors street-view images to structured spatial graphs and provides graph-aligned supervision via spatial reasoning paths and spatial context captions, exposing distance, directionality, connectivity, and neighborhood context beyond image content. Building on UGData, we propose UGE, a two-stage training strategy that progressively and stably aligns images, text, and spatial structures by combining instruction-guided contrastive learning with graph-based spatial encoding. We finally introduce UGBench, a comprehensive benchmark to evaluate how spatially grounded embeddings support diverse urban understanding tasks -- including geolocation ranking, image retrieval, urban perception, and spatial grounding. We develop UGE on multiple state-of-the-art VLM backbones, including Qwen2-VL, Qwen2.5-VL, Phi-3-Vision, and LLaVA1.6-Mistral, and train fixed-dimensional spatial embeddings with LoRA tuning. UGE built upon Qwen2.5-VL-7B backbone achieves up to 44% improvement in image retrieval and 30% in geolocation ranking on training cities, and over 30% and 22% gains respectively on held-out cities, demonstrating the effectiveness of explicit spatial grounding for spatially intensive urban tasks.

[235] What, Whether and How? Unveiling Process Reward Models for Thinking with Images Reasoning

Yujin Zhou,Pengcheng Wen,Jiale Chen,Boqin Yin,Han Zhu,Jiaming Ji,Juntao Dai,Chi-Min Chan,Sirui Han

Main category: cs.CV

TL;DR: 本文提出了首个专门用于评估‘图像思考’范式下过程奖励模型(PRMs)的综合基准,定义了7种细粒度错误类型,构建了1206条人工标注的推理轨迹数据集,并揭示了当前大视觉语言模型(LVLMs)作为PRM的局限性。

Details Motivation: 现有PRM评估基准以文本为中心,缺乏对‘图像思考’范式下推理过程的全面评估,而该范式因动态图像编辑与重编码易引入多样错误,亟需专用PRM及相应评测基准。 Method: 通过分析推理轨迹与PRM引导搜索实验,定义7类细粒度错误;构建包含4大类、16子类、1206条人工标注‘图像思考’推理轨迹的基准;对主流LVLMs作为PRM进行系统实验评估。 Result: 发现当前LVLMs作为PRM表现不佳:视觉推理过程评估能力有限,不同错误类型间性能差异显著,存在正向评价偏差,且对推理步骤位置敏感。 Conclusion: 所提基准有效揭示了LVLMs在PRM任务上的不足,为后续PRM在LVLMs中的发展奠定了关键基础。 Abstract: The rapid advancement of Large Vision Language Models (LVLMs) has demonstrated excellent abilities in various visual tasks. Building upon these developments, the thinking with images paradigm has emerged, enabling models to dynamically edit and re-encode visual information at each reasoning step, mirroring human visual processing. However, this paradigm introduces significant challenges as diverse errors may occur during reasoning processes. This necessitates Process Reward Models (PRMs) for distinguishing positive and negative reasoning steps, yet existing benchmarks for PRMs are predominantly text-centric and lack comprehensive assessment under this paradigm. To address these gaps, this work introduces the first comprehensive benchmark specifically designed for evaluating PRMs under the thinking with images paradigm. Our main contributions are: (1) Through extensive analysis of reasoning trajectories and guided search experiments with PRMs, we define 7 fine-grained error types and demonstrate both the necessity for specialized PRMs and the potential for improvement. (2) We construct a comprehensive benchmark comprising 1,206 manually annotated thinking with images reasoning trajectories spanning 4 categories and 16 subcategories for fine-grained evaluation of PRMs. (3) Our experimental analysis reveals that current LVLMs fall short as effective PRMs, exhibiting limited capabilities in visual reasoning process evaluation with significant performance disparities across error types, positive evaluation bias, and sensitivity to reasoning step positions. These findings demonstrate the effectiveness of our benchmark and establish crucial foundations for advancing PRMs in LVLMs.

[236] E-VAds: An E-commerce Short Videos Understanding Benchmark for MLLMs

Xianjie Liu,Yiman Hu,Liang Wu,Ping Hu,Yixiong Zou,Jian Xu,Bo Zheng

Main category: cs.CV

TL;DR: 本文提出了一种面向电商短视频的多模态信息密度评估框架,构建了首个电商短视频理解基准E-VAds,并设计了基于强化学习的推理模型E-VAds-R1,在商业意图推理任务上显著提升性能。

Details Motivation: 现有视频理解模型在电商短视频上表现不佳,因其目标驱动、多模态信号密集,而主流基准缺乏对商业意图推理的关注。 Method: 提出多模态信息密度评估框架;构建首个电商短视频理解基准E-VAds(含3961个视频、19785个Q&A对,覆盖感知与认知推理两大维度);开发基于RL的E-VAds-R1模型,采用多粒度奖励机制MG-GRPO。 Result: E-VAds-R1在商业意图推理任务上实现109.2%的性能提升,仅需数百样本训练。 Conclusion: 电商短视频具有更高多模态信息密度,需专用基准与推理模型;E-VAds和E-VAds-R1为该领域提供了新基准与高效方法。 Abstract: E-commerce short videos represent a high-revenue segment of the online video industry characterized by a goal-driven format and dense multi-modal signals. Current models often struggle with these videos because existing benchmarks focus primarily on general-purpose tasks and neglect the reasoning of commercial intent. In this work, we first propose a \textbf{multi-modal information density assessment framework} to quantify the complexity of this domain. Our evaluation reveals that e-commerce content exhibits substantially higher density across visual, audio, and textual modalities compared to mainstream datasets, establishing a more challenging frontier for video understanding. To address this gap, we introduce \textbf{E-commerce Video Ads Benchmark (E-VAds)}, which is the first benchmark specifically designed for e-commerce short video understanding. We curated 3,961 high-quality videos from Taobao covering a wide range of product categories and used a multi-agent system to generate 19,785 open-ended Q&A pairs. These questions are organized into two primary dimensions, namely Perception and Cognition and Reasoning, which consist of five distinct tasks. Finally, we develop \textbf{E-VAds-R1}, an RL-based reasoning model featuring a multi-grained reward design called \textbf{MG-GRPO}. This strategy provides smooth guidance for early exploration while creating a non-linear incentive for expert-level precision. Experimental results demonstrate that E-VAds-R1 achieves a 109.2% performance gain in commercial intent reasoning with only a few hundred training samples.

[237] Geometric Image Editing via Effects-Sensitive In-Context Inpainting with Diffusion Transformers

Shuo Zhang,Wenzhuo Wu,Huayu Zhang,Jiarong Cheng,Xianghao Zang,Chao Ban,Hao Sun,Zhongjiang He,Tianwei Cao,Kongming Liang,Zhanyu Ma

Main category: cs.CV

TL;DR: 本文提出GeoEdit框架,通过扩散变换器模块和Effects-Sensitive Attention机制,解决图像编辑中几何变换(平移、旋转、缩放)不精确及光照阴影建模不足的问题,并构建RS-Objects数据集进行训练,在多个指标上超越现有方法。

Details Motivation: 现有图像编辑方法在处理几何变换(如平移、旋转、缩放)和复杂光照阴影效果时存在精度低、真实性差的问题。 Method: 提出GeoEdit框架,包含基于扩散模型的Transformer模块实现几何变换控制,以及Effects-Sensitive Attention机制增强光照与阴影建模;并构建大规模RS-Objects数据集(12万+图像对)用于训练。 Result: 在公开基准测试中,GeoEdit在视觉质量、几何编辑准确性和结果真实性方面均显著优于当前最优方法。 Conclusion: GeoEdit有效提升了图像几何编辑的精度与真实感,为复杂场景下的可控图像编辑提供了新思路和实用工具。 Abstract: Recent advances in diffusion models have significantly improved image editing. However, challenges persist in handling geometric transformations, such as translation, rotation, and scaling, particularly in complex scenes. Existing approaches suffer from two main limitations: (1) difficulty in achieving accurate geometric editing of object translation, rotation, and scaling; (2) inadequate modeling of intricate lighting and shadow effects, leading to unrealistic results. To address these issues, we propose GeoEdit, a framework that leverages in-context generation through a diffusion transformer module, which integrates geometric transformations for precise object edits. Moreover, we introduce Effects-Sensitive Attention, which enhances the modeling of intricate lighting and shadow effects for improved realism. To further support training, we construct RS-Objects, a large-scale geometric editing dataset containing over 120,000 high-quality image pairs, enabling the model to learn precise geometric editing while generating realistic lighting and shadows. Extensive experiments on public benchmarks demonstrate that GeoEdit consistently outperforms state-of-the-art methods in terms of visual quality, geometric accuracy, and realism.

[238] D$^2$-VR: Degradation-Robust and Distilled Video Restoration with Synergistic Optimization Strategy

Jianfeng Liang,Shaocheng Shen,Botao Xu,Qiang Hu,Xiaoyun Zhang

Main category: cs.CV

TL;DR: 本文提出D²-VR,一种基于单图像扩散模型的视频恢复框架,通过退化鲁棒光流对齐、对抗蒸馏和协同优化,在保证高质量恢复的同时将采样速度提升12倍。

Details Motivation: 现有基于扩散先验与时间对齐的视频恢复方法虽感知质量高,但推理延迟大、时间不稳定,难以应对复杂真实退化。 Method: 提出D²-VR框架:1)设计退化鲁棒光流对齐(DRFA)模块,利用置信度感知注意力过滤不可靠运动线索;2)引入对抗蒸馏压缩扩散采样轨迹至少步数;3)设计协同优化策略兼顾感知质量与严格时间一致性。 Result: 在多项实验中达到SOTA性能,采样速度提升12倍。 Conclusion: D²-VR有效平衡了视频恢复的质量、效率与时间稳定性,为实际部署提供了可行方案。 Abstract: The integration of diffusion priors with temporal alignment has emerged as a transformative paradigm for video restoration, delivering fantastic perceptual quality, yet the practical deployment of such frameworks is severely constrained by prohibitive inference latency and temporal instability when confronted with complex real-world degradations. To address these limitations, we propose \textbf{D$^2$-VR}, a single-image diffusion-based video-restoration framework with low-step inference. To obtain precise temporal guidance under severe degradation, we first design a Degradation-Robust Flow Alignment (DRFA) module that leverages confidence-aware attention to filter unreliable motion cues. We then incorporate an adversarial distillation paradigm to compress the diffusion sampling trajectory into a rapid few-step regime. Finally, a synergistic optimization strategy is devised to harmonize perceptual quality with rigorous temporal consistency. Extensive experiments demonstrate that D$^2$-VR achieves state-of-the-art performance while accelerating the sampling process by \textbf{12$\times$}

[239] RealSynCol: a high-fidelity synthetic colon dataset for 3D reconstruction applications

Chiara Lena,Davide Milesi,Alessandro Casella,Luca Carlini,Joseph C. Norton,James Martin,Bruno Scaglioni,Keith L. Obstein,Roberto De Sire,Marco Spadaccini,Cesare Hassan,Pietro Valdastri,Elena De Momi

Main category: cs.CV

TL;DR: 本文提出了一种高度逼真的合成结肠镜数据集RealSynCol,用于解决深度学习在结肠镜中因真实标注数据稀缺而导致的模型鲁棒性不足问题。该数据集基于10例CT扫描提取的结肠几何结构,在虚拟环境中渲染,并提供深度图、光流、3D网格和相机轨迹等丰富标注。实验表明,RealSynCol显著提升了深度估计与位姿估计模型在临床图像上的泛化能力。

Details Motivation: 深度学习在结肠镜中应用受限于大规模真实标注数据的稀缺,亟需高质量合成数据替代方案。 Method: 基于10例CT扫描提取结肠三维几何结构,构建高保真虚拟内镜环境,模拟术中光照、纹理(如血管)等条件,渲染生成28,130帧图像及配套多模态真值(深度图、光流、3D网格、相机轨迹)。 Result: 在深度估计与位姿估计任务的基准测试中,RealSynCol相较其他合成数据集显著提升模型在真实临床图像上的泛化性能。 Conclusion: RealSynCol凭借其高 realism 和高 variability,成为开发面向结肠镜诊断的深度学习算法的有力工具。 Abstract: Deep learning has the potential to improve colonoscopy by enabling 3D reconstruction of the colon, providing a comprehensive view of mucosal surfaces and lesions, and facilitating the identification of unexplored areas. However, the development of robust methods is limited by the scarcity of large-scale ground truth data. We propose RealSynCol, a highly realistic synthetic dataset designed to replicate the endoscopic environment. Colon geometries extracted from 10 CT scans were imported into a virtual environment that closely mimics intraoperative conditions and rendered with realistic vascular textures. The resulting dataset comprises 28\,130 frames, paired with ground truth depth maps, optical flow, 3D meshes, and camera trajectories. A benchmark study was conducted to evaluate the available synthetic colon datasets for the tasks of depth and pose estimation. Results demonstrate that the high realism and variability of RealSynCol significantly enhance generalization performance on clinical images, proving it to be a powerful tool for developing deep learning algorithms to support endoscopic diagnosis.

[240] Understanding and Optimizing Attention-Based Sparse Matching for Diverse Local Features

Qiang Wang

Main category: cs.CV

TL;DR: 本文重新审视了基于注意力机制的稀疏图像匹配模型训练问题,发现检测器而非描述子是影响性能差异的主要原因,并提出一种利用多源关键点进行微调的新方法,构建出对检测器无关的通用匹配模型。

Details Motivation: 现有基于注意力机制的稀疏图像匹配模型(如LightGlue)在设计上存在被忽视的关键选择,影响其性能;同时,检测器与描述子在匹配框架中的作用尚不清晰。 Method: 分析LightGlue等模型的设计缺陷;系统评估检测器与描述子对性能的影响;提出基于多样化检测器关键点的微调策略,构建检测器无关的通用匹配模型。 Result: 所提方法训练出的通用模型在零样本迁移至新检测器时,匹配精度达到甚至超过针对该检测器专门训练的模型。 Conclusion: 检测器是影响匹配性能的主导因素;通过多检测器关键点微调可实现高性能、检测器无关的匹配模型,为未来局部特征设计与部署提供重要启示。 Abstract: We revisit the problem of training attention-based sparse image matching models for various local features. We first identify one critical design choice that has been previously overlooked, which significantly impacts the performance of the LightGlue model. We then investigate the role of detectors and descriptors within the transformer-based matching framework, finding that detectors, rather than descriptors, are often the primary cause for performance difference. Finally, we propose a novel approach to fine-tune existing image matching models using keypoints from a diverse set of detectors, resulting in a universal, detector-agnostic model. When deployed as a zero-shot matcher for novel detectors, the resulting model achieves or exceeds the accuracy of models specifically trained for those features. Our findings offer valuable insights for the deployment of transformer-based matching models and the future design of local features.

[241] Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition

Yuhao Dong,Shulin Tian,Shuai Liu,Shuangrui Ding,Yuhang Zang,Xiaoyi Dong,Yuhang Cao,Jiaqi Wang,Ziwei Liu

Main category: cs.CV

TL;DR: 本文提出了一种新的视频上下文学习任务(Demo-driven Video In-Context Learning)及配套基准Demo-ICL-Bench,并构建了专用于该任务的MLLM模型Demo-ICL,通过两阶段训练策略提升模型从示范中学习的能力。

Details Motivation: 现有视频理解基准主要评估模型基于静态内部知识的理解能力,缺乏对其从动态、新颖的少量示例中学习与适应能力的评测。 Method: 提出Demo-driven Video In-Context Learning任务;构建含1200个教学视频的Demo-ICL-Bench基准,包含文本(字幕摘要)和视频两类演示;设计Demo-ICL模型,采用视频监督微调+信息辅助直接偏好优化的两阶段训练策略。 Result: 实验证明Demo-ICL-Bench具有挑战性,Demo-ICL显著优于现有SOTA MLLMs,验证了所提方法的有效性。 Conclusion: 该工作填补了视频上下文学习评测的空白,为未来研究提供了新任务、新基准与新方法。 Abstract: Despite the growing video understanding capabilities of recent Multimodal Large Language Models (MLLMs), existing video benchmarks primarily assess understanding based on models' static, internal knowledge, rather than their ability to learn and adapt from dynamic, novel contexts from few examples. To bridge this gap, we present Demo-driven Video In-Context Learning, a novel task focused on learning from in-context demonstrations to answer questions about the target videos. Alongside this, we propose Demo-ICL-Bench, a challenging benchmark designed to evaluate demo-driven video in-context learning capabilities. Demo-ICL-Bench is constructed from 1200 instructional YouTube videos with associated questions, from which two types of demonstrations are derived: (i) summarizing video subtitles for text demonstration; and (ii) corresponding instructional videos as video demonstrations. To effectively tackle this new challenge, we develop Demo-ICL, an MLLM with a two-stage training strategy: video-supervised fine-tuning and information-assisted direct preference optimization, jointly enhancing the model's ability to learn from in-context examples. Extensive experiments with state-of-the-art MLLMs confirm the difficulty of Demo-ICL-Bench, demonstrate the effectiveness of Demo-ICL, and thereby unveil future research directions.

[242] Vista: Scene-Aware Optimization for Streaming Video Question Answering under Post-Hoc Queries

Haocheng Lu,Nan Zhang,Wei Tao,Xiaoyang Qu,Guokuan Li,Jiguang Wan,Jianzong Wang

Main category: cs.CV

TL;DR: Vista是一种面向场景的流式视频问答框架,通过场景感知分割、压缩与召回机制,在保证低延迟和内存效率的同时,实现对长时序视频流的高效、可扩展推理。

Details Motivation: 现有流式视频问答方法依赖固定大小内存或简单压缩,易导致上下文丢失或内存溢出,难以应对长时、实时视频理解需求。 Method: 提出Vista框架,包含三个核心创新:(1)场景感知分割——动态聚类视频帧为时空一致的场景单元;(2)场景感知压缩——将各场景压缩为紧凑token存于GPU内存,原始帧卸载至CPU;(3)场景感知召回——按需检索并重整合相关场景token以响应查询。框架模型无关,兼容多种视觉语言骨干模型。 Result: 在StreamingBench上达到SOTA性能,显著提升长视频流问答的准确性与效率,同时保持低延迟和内存可控性。 Conclusion: Vista为真实场景下的流式视频理解提供了高效、可扩展且通用的新范式,确立了该任务的重要基线。 Abstract: Streaming video question answering (Streaming Video QA) poses distinct challenges for multimodal large language models (MLLMs), as video frames arrive sequentially and user queries can be issued at arbitrary time points. Existing solutions relying on fixed-size memory or naive compression often suffer from context loss or memory overflow, limiting their effectiveness in long-form, real-time scenarios. We present Vista, a novel framework for scene-aware streaming video QA that enables efficient and scalable reasoning over continuous video streams. The innovation of Vista can be summarized in three aspects: (1) scene-aware segmentation, where Vista dynamically clusters incoming frames into temporally and visually coherent scene units; (2) scene-aware compression, where each scene is compressed into a compact token representation and stored in GPU memory for efficient index-based retrieval, while full-resolution frames are offloaded to CPU memory; and (3) scene-aware recall, where relevant scenes are selectively recalled and reintegrated into the model input upon receiving a query, enabling both efficiency and completeness. Vista is model-agnostic and integrates seamlessly with a variety of vision-language backbones, enabling long-context reasoning without compromising latency or memory efficiency. Extensive experiments on StreamingBench demonstrate that Vista achieves state-of-the-art performance, establishing a strong baseline for real-world streaming video understanding.

[243] TriC-Motion: Tri-Domain Causal Modeling Grounded Text-to-Motion Generation

Yiyang Cao,Yunze Deng,Ziyu Lin,Bin Feng,Xinggang Wang,Wenyu Liu,Dandan Zheng,Jingdong Chen

Main category: cs.CV

TL;DR: 本文提出TriC-Motion,一种基于扩散模型的三域因果文本到动作生成框架,联合建模空间、时间与频率域,并引入因果干预解耦运动无关噪声,显著提升动作生成质量。

Details Motivation: 现有文本到动作生成方法缺乏对空间、时间、频率三域的统一联合建模,且易受运动无关噪声干扰导致动作失真。 Method: 提出TriC-Motion框架,包含时序动作编码、空间拓扑建模、混合频率分析三个模块;通过Score-guided三域融合模块整合信息;并设计基于因果性的反事实动作解耦器以消除噪声。 Result: 在HumanML3D数据集上R@1达0.612,显著优于现有SOTA方法,生成动作具有高保真度、一致性、多样性与文本对齐性。 Conclusion: 联合三域建模与因果干预可有效提升文本驱动动作生成质量,TriC-Motion为该任务提供了新范式。 Abstract: Text-to-motion generation, a rapidly evolving field in computer vision, aims to produce realistic and text-aligned motion sequences. Current methods primarily focus on spatial-temporal modeling or independent frequency domain analysis, lacking a unified framework for joint optimization across spatial, temporal, and frequency domains. This limitation hinders the model's ability to leverage information from all domains simultaneously, leading to suboptimal generation quality. Additionally, in motion generation frameworks, motion-irrelevant cues caused by noise are often entangled with features that contribute positively to generation, thereby leading to motion distortion. To address these issues, we propose Tri-Domain Causal Text-to-Motion Generation (TriC-Motion), a novel diffusion-based framework integrating spatial-temporal-frequency-domain modeling with causal intervention. TriC-Motion includes three core modeling modules for domain-specific modeling, namely Temporal Motion Encoding, Spatial Topology Modeling, and Hybrid Frequency Analysis. After comprehensive modeling, a Score-guided Tri-domain Fusion module integrates valuable information from the triple domains, simultaneously ensuring temporal consistency, spatial topology, motion trends, and dynamics. Moreover, the Causality-based Counterfactual Motion Disentangler is meticulously designed to expose motion-irrelevant cues to eliminate noise, disentangling the real modeling contributions of each domain for superior generation. Extensive experimental results validate that TriC-Motion achieves superior performance compared to state-of-the-art methods, attaining an outstanding R@1 of 0.612 on the HumanML3D dataset. These results demonstrate its capability to generate high-fidelity, coherent, diverse, and text-aligned motion sequences. Code is available at: https://caoyiyang1105.github.io/TriC-Motion/.

[244] Gesture Matters: Pedestrian Gesture Recognition for AVs Through Skeleton Pose Evaluation

Alif Rizqullah Mahdi,Mahdi Rezaei,Natasha Merat

Main category: cs.CV

TL;DR: 本文提出了一种基于2D姿态估计的交通场景手势分类框架,利用WIVW数据集视频,提取76个静态与动态特征,实现Stop、Go、Thank & Greet及No Gesture四类手势识别,准确率达87%,关键判别特征为手部位置与运动速度。

Details Motivation: 自动驾驶车辆(AVs)难以理解行人手势,而手势在交通中是弥补正式规则不足的重要非语言交流方式。 Method: 基于WIVW真实世界视频数据集,采用2D姿态估计提取归一化关键点,构建包含76个静态和动态特征的手势分类框架,将手势分为Stop、Go、Thank & Greet、No Gesture四类。 Result: 手部位置和运动速度是最具区分性的特征,整体分类准确率达到87%。 Conclusion: 该方法提升了AV对行人手势的理解能力,并增进了对交通环境中行人行为的认知。 Abstract: Gestures are a key component of non-verbal communication in traffic, often helping pedestrian-to-driver interactions when formal traffic rules may be insufficient. This problem becomes more apparent when autonomous vehicles (AVs) struggle to interpret such gestures. In this study, we present a gesture classification framework using 2D pose estimation applied to real-world video sequences from the WIVW dataset. We categorise gestures into four primary classes (Stop, Go, Thank & Greet, and No Gesture) and extract 76 static and dynamic features from normalised keypoints. Our analysis demonstrates that hand position and movement velocity are especially discriminative in distinguishing between gesture classes, achieving a classification accuracy score of 87%. These findings not only improve the perceptual capabilities of AV systems but also contribute to the broader understanding of pedestrian behaviour in traffic contexts.

[245] Enhanced Food Category Recognition under Illumination-Induced Domain Shift

Keonvin Park,Aditya Pal,Jin Hong Mok

Main category: cs.CV

TL;DR: 本文研究光照变化引起的域偏移对多类别食物识别的影响,通过构建合成光照增强数据集提升模型鲁棒性,并验证其在跨数据集迁移和泛化中的有效性。

Details Motivation: 真实场景(如自动传送带检测)中光照变化导致域偏移,严重影响视觉食物识别性能;现有工作多局限于单类别或受控环境,且公开食物数据集缺乏光照标注。 Method: 在Food-101和Fruits-360上开展跨数据集评估,构建系统调节色温与光强的合成光照增强数据集,结合跨数据集迁移学习与域泛化方法,重点分析苹果类等光照敏感类别。 Result: 光照失配导致显著准确率下降;光照感知增强显著提升域偏移下的识别鲁棒性,同时保持实时性能。 Conclusion: 光照鲁棒性对实际食物识别系统至关重要,本文为真实检验场景下可靠部署提供了实用见解与技术路径。 Abstract: Visual food recognition systems deployed in real-world environments, such as automated conveyor-belt inspection, are highly sensitive to domain shifts caused by illumination changes. While recent studies have shown that lighting variations can significantly distort food perception by both humans and AI, existing works are often limited to single food categories or controlled settings, and most public food datasets lack explicit illumination annotations. In this work, we investigate illumination-induced domain shift in multi-class food category recognition using two widely adopted datasets, Food-101 and Fruits-360. We demonstrate substantial accuracy degradation under cross-dataset evaluation due to mismatched visual conditions. To address this challenge, we construct synthetic illumination-augmented datasets by systematically varying light temperature and intensity, enabling controlled robustness analysis without additional labels. We further evaluate cross-dataset transfer learning and domain generalization, with a focus on illumination-sensitive target categories such as apple-based classes. Experimental results show that illumination-aware augmentation significantly improves recognition robustness under domain shift while preserving real-time performance. Our findings highlight the importance of illumination robustness and provide practical insights for deploying reliable food recognition systems in real-world inspection scenarios.

[246] Are Vision Foundation Models Foundational for Electron Microscopy Image Segmentation?

Caterina Fuster-Barceló,Virginie Uhlmann

Main category: cs.CV

TL;DR: 本文研究了视觉基础模型(VFMs)在电子显微镜(EM)图像线粒体分割任务中的跨数据集泛化能力,发现尽管VFMs在单个EM数据集上表现良好,但在多个异构EM数据集联合训练时性能显著下降;分析表明存在显著的域间表征不匹配,当前参数高效微调(如LoRA)策略不足以缓解该问题,需引入额外的域对齐机制。

Details Motivation: 探究视觉基础模型(VFMs)的潜在表征是否足够通用,以支持在异质电子显微镜(EM)图像数据集之间有效迁移和复用,尤其针对线粒体分割任务。 Method: 在Lucchi++和VNC两个公开EM数据集上,评估DINOv2、DINOv3和OpenCLIP三种VFMs;采用两种适配范式:冻结主干+轻量分割头训练,以及基于LoRA的参数高效微调(PEFT);通过PCA、Fréchet DINOv2距离和线性探针等方法分析潜在表征空间。 Result: 单数据集训练效果良好,LoRA能稳定提升域内性能;但多数据集联合训练导致所有模型性能严重下降,PEFT仅带来微弱增益;表征分析证实两EM数据集间存在显著且顽固的域不匹配。 Conclusion: VFMs可在单EM域中通过轻量适配实现有竞争力的分割性能,但当前PEFT策略尚不足以构建跨异构EM数据集的鲁棒统一模型,亟需引入域对齐机制。 Abstract: Although vision foundation models (VFMs) are increasingly reused for biomedical image analysis, it remains unclear whether the latent representations they provide are general enough to support effective transfer and reuse across heterogeneous microscopy image datasets. Here, we study this question for the problem of mitochondria segmentation in electron microscopy (EM) images, using two popular public EM datasets (Lucchi++ and VNC) and three recent representative VFMs (DINOv2, DINOv3, and OpenCLIP). We evaluate two practical model adaptation regimes: a frozen-backbone setting in which only a lightweight segmentation head is trained on top of the VFM, and parameter-efficient fine-tuning (PEFT) via Low-Rank Adaptation (LoRA) in which the VFM is fine-tuned in a targeted manner to a specific dataset. Across all backbones, we observe that training on a single EM dataset yields good segmentation performance (quantified as foreground Intersection-over-Union), and that LoRA consistently improves in-domain performance. In contrast, training on multiple EM datasets leads to severe performance degradation for all models considered, with only marginal gains from PEFT. Exploration of the latent representation space through various techniques (PCA, Fréchet Dinov2 distance, and linear probes) reveals a pronounced and persistent domain mismatch between the two considered EM datasets in spite of their visual similarity, which is consistent with the observed failure of paired training. These results suggest that, while VFMs can deliver competitive results for EM segmentation within a single domain under lightweight adaptation, current PEFT strategies are insufficient to obtain a single robust model across heterogeneous EM datasets without additional domain-alignment mechanisms.

[247] GeoFocus: Blending Efficient Global-to-Local Perception for Multimodal Geometry Problem-Solving

Linger Deng,Yuliang Liu,Wenwen Yu,Zujia Zhang,Jianzhong Ju,Zhenbo Luo,Xiang Bai

Main category: cs.CV

TL;DR: 本文提出GeoFocus框架,通过Critical Local Perceptor模块增强局部几何特征感知,并引入VertexLang拓扑形式语言优化全局图形编码,在多个几何数据集上显著提升准确率与鲁棒性。

Details Motivation: 几何问题求解对大模型而言仍具挑战,需兼顾全局形状识别与局部几何关系理解。 Method: 提出两模块框架:1)Critical Local Perceptor,基于13个理论驱动的感知模板自动识别并强化关键局部结构;2)VertexLang,一种轻量拓扑形式语言,用顶点坐标与连接关系编码全局图形。 Result: 在Geo3K、GeoQA和FormalGeo7K上比当前最优专用模型准确率提升4.7%,MATHVERSE中展现更强视觉鲁棒性;局部特征覆盖率提升61%,全局感知训练时间减少20%。 Conclusion: GeoFocus有效融合局部几何细节感知与全局拓扑建模,显著提升LMM在几何推理任务中的性能与效率。 Abstract: Geometry problem-solving remains a significant challenge for Large Multimodal Models (LMMs), requiring not only global shape recognition but also attention to intricate local relationships related to geometric theory. To address this, we propose GeoFocus, a novel framework comprising two core modules. 1) Critical Local Perceptor, which automatically identifies and emphasizes critical local structure (e.g., angles, parallel lines, comparative distances) through thirteen theory-based perception templates, boosting critical local feature coverage by 61% compared to previous methods. 2) VertexLang, a compact topology formal language, encodes global figures through vertex coordinates and connectivity relations. By replacing bulky code-based encodings, VertexLang reduces global perception training time by 20% while improving topology recognition accuracy. When evaluated in Geo3K, GeoQA, and FormalGeo7K, GeoFocus achieves a 4.7% accuracy improvement over leading specialized models and demonstrates superior robustness in MATHVERSE under diverse visual conditions. Project Page -- https://github.com/dle666/GeoFocus

[248] Automatic regularization parameter choice for tomography using a double model approach

Chuyang Wu,Samuli Siltanen

Main category: cs.CV

TL;DR: 本文提出了一种基于双网格反馈控制的自动正则化参数选择方法,用于X射线断层成像中的图像重建,以解决数据受限下的病态反问题。

Details Motivation: X射线断层成像中图像重建是病态反问题,尤其在数据有限时更甚;正则化虽必要,但其效果高度依赖于正则化参数的选择,需在数据保真度与先验信息间取得平衡。 Method: 提出一种基于两个不同计算离散化(即双网格)的自动参数选择方法,利用反馈控制算法动态调整正则化强度,使迭代重建结果在两套网格上达到足够相似,并选取满足该条件的最小正则化参数。 Result: 该方法在真实断层扫描数据上验证有效,能自动选取合适正则化参数并提升重建质量。 Conclusion: 双网格反馈控制策略为病态逆问题提供了一种鲁棒、自适应的正则化参数选择新范式,无需先验噪声估计或人工调参。 Abstract: Image reconstruction in X-ray tomography is an ill-posed inverse problem, particularly with limited available data. Regularization is thus essential, but its effectiveness hinges on the choice of a regularization parameter that balances data fidelity against a priori information. We present a novel method for automatic parameter selection based on the use of two distinct computational discretizations of the same problem. A feedback control algorithm dynamically adjusts the regularization strength, driving an iterative reconstruction toward the smallest parameter that yields sufficient similarity between reconstructions on the two grids. The effectiveness of the proposed approach is demonstrated using real tomographic data.

[249] Thegra: Graph-based SLAM for Thermal Imagery

Anastasiia Kornilova,Ivan Moskalenko,Arabella Gromova,Gonzalo Ferrer,Alexander Menshchikov

Main category: cs.CV

TL;DR: 本文提出了一种基于通用学习特征(SuperPoint + LightGlue)的稀疏单目热成像SLAM系统,通过预处理、匹配优化与置信度加权因子图提升鲁棒性,在无需热图像微调的前提下实现可靠定位建图。

Details Motivation: 热成像在低光照、烟雾或恶劣天气等视觉退化环境中具有实用价值,但其低纹理、低对比度和高噪声特性使传统基于特征的SLAM难以适用。 Method: 采用SuperPoint检测器和LightGlue匹配器(在可见光大数据上训练),设计热图像专用预处理流程,改进SLAM核心模块以应对稀疏及异常匹配,并引入SuperPoint关键点置信度构建置信加权因子图。 Result: 在公开热成像数据集上的实验表明,该系统在不依赖数据集特异性训练或微调的情况下,实现了可靠性能。 Conclusion: 通用可见光域学习的特征可有效迁移到热成像SLAM任务中,结合针对性工程改进(预处理、匹配适配、置信加权)即可克服热图像固有缺陷,为资源受限的热图像SLAM提供实用新路径。 Abstract: Thermal imaging provides a practical sensing modality for visual SLAM in visually degraded environments such as low illumination, smoke, or adverse weather. However, thermal imagery often exhibits low texture, low contrast, and high noise, complicating feature-based SLAM. In this work, we propose a sparse monocular graph-based SLAM system for thermal imagery that leverages general-purpose learned features -- the SuperPoint detector and LightGlue matcher, trained on large-scale visible-spectrum data to improve cross-domain generalization. To adapt these components to thermal data, we introduce a preprocessing pipeline to enhance input suitability and modify core SLAM modules to handle sparse and outlier-prone feature matches. We further incorporate keypoint confidence scores from SuperPoint into a confidence-weighted factor graph to improve estimation robustness. Evaluations on public thermal datasets demonstrate that the proposed system achieves reliable performance without requiring dataset-specific training or fine-tuning a desired feature detector, given the scarcity of quality thermal data. Code will be made available upon publication.

[250] TIBR4D: Tracing-Guided Iterative Boundary Refinement for Efficient 4D Gaussian Segmentation

He Wu,Xia Yan,Yanghui Xu,Liegang Xia,Jiazhou Chen

Main category: cs.CV

TL;DR: 本文提出了一种无需学习的4D高斯分割框架TIBR4D,通过两阶段迭代边界优化(IGIT和RCC)提升动态4D高斯场景中物体级分割的精度与效率。

Details Motivation: 动态4D高斯场景中的物体级分割面临运动复杂、遮挡严重和边界模糊等挑战。 Method: 提出两阶段迭代边界细化方法TIBR4D:第一阶段为时间片段级的迭代高斯实例追踪(IGIT),第二阶段为逐帧的高斯渲染范围控制(RCC);并引入时间分割融合策略以兼顾身份一致性和动态感知能力。 Result: 在HyperNeRF和Neu3D数据集上,该方法生成的物体高斯点云边界更清晰、精度更高、效率优于现有SOTA方法。 Conclusion: 所提学习无关的TIBR4D框架能有效应对动态4D高斯场景下的分割难题,在准确性、边界清晰度和计算效率方面均取得显著提升。 Abstract: Object-level segmentation in dynamic 4D Gaussian scenes remains challenging due to complex motion, occlusions, and ambiguous boundaries. In this paper, we present an efficient learning-free 4D Gaussian segmentation framework that lifts video segmentation masks to 4D spaces, whose core is a two-stage iterative boundary refinement, TIBR4D. The first stage is an Iterative Gaussian Instance Tracing (IGIT) at the temporal segment level. It progressively refines Gaussian-to-instance probabilities through iterative tracing, and extracts corresponding Gaussian point clouds that better handle occlusions and preserve completeness of object structures compared to existing one-shot threshold-based methods. The second stage is a frame-wise Gaussian Rendering Range Control (RCC) via suppressing highly uncertain Gaussians near object boundaries while retaining their core contributions for more accurate boundaries. Furthermore, a temporal segmentation merging strategy is proposed for IGIT to balance identity consistency and dynamic awareness. Longer segments enforce stronger multi-frame constraints for stable identities, while shorter segments allow identity changes to be captured promptly. Experiments on HyperNeRF and Neu3D demonstrate that our method produces accurate object Gaussian point clouds with clearer boundaries and higher efficiency compared to SOTA methods.

[251] GOT-Edit: Geometry-Aware Generic Object Tracking via Online Model Editing

Shih-Fang Chen,Jun-Cheng Chen,I-Hong Jhuo,Yen-Yu Lin

Main category: cs.CV

TL;DR: 本文提出GOT-Edit方法,通过在线跨模态模型编辑,将3D几何线索融入2D通用目标跟踪器,提升其在遮挡和杂乱场景下的鲁棒性与精度。

Details Motivation: 现有通用目标跟踪方法主要依赖2D特征,忽视3D几何线索,导致在遮挡、干扰物及外观/几何变化下性能下降。 Method: 提出GOT-Edit,在线跨模态模型编辑方法;利用预训练的视觉几何基础Transformer提取少量2D图像中的几何线索,并通过零空间约束更新融合几何信息同时保持语义判别能力。 Result: 在多个GOT基准上实验表明,GOT-Edit在遮挡和杂乱场景下显著提升鲁棒性与精度。 Conclusion: GOT-Edit建立了融合2D语义与3D几何推理的新范式,为通用目标跟踪提供了更可靠的解决方案。 Abstract: Human perception for effective object tracking in a 2D video stream arises from the implicit use of prior 3D knowledge combined with semantic reasoning. In contrast, most generic object tracking (GOT) methods primarily rely on 2D features of the target and its surroundings while neglecting 3D geometric cues, which makes them susceptible to partial occlusion, distractors, and variations in geometry and appearance. To address this limitation, we introduce GOT-Edit, an online cross-modality model editing approach that integrates geometry-aware cues into a generic object tracker from a 2D video stream. Our approach leverages features from a pre-trained Visual Geometry Grounded Transformer to enable geometric cue inference from only a few 2D images. To tackle the challenge of seamlessly combining geometry and semantics, GOT-Edit performs online model editing with null-space constrained updates that incorporate geometric information while preserving semantic discrimination, yielding consistently better performance across diverse scenarios. Extensive experiments on multiple GOT benchmarks demonstrate that GOT-Edit achieves superior robustness and accuracy, particularly under occlusion and clutter, establishing a new paradigm for combining 2D semantics with 3D geometric reasoning for generic object tracking.

[252] FLAG-4D: Flow-Guided Local-Global Dual-Deformation Model for 4D Reconstruction

Guan Yuan Tan,Ngoc Tuan Vu,Arghya Pal,Sailaja Rajanala,Raphael Phan C. -W.,Mettu Srinivas,Chee-Ming Ting

Main category: cs.CV

TL;DR: FLAG-4D是一种用于动态场景新视角合成的新型框架,通过双变形网络(IDN与GMN)联合建模3D高斯原语在时空中的演化,并融合预训练光流特征以提升运动建模精度与时序一致性。

Details Motivation: 现有方法依赖单一MLP建模时间形变,难以从稀疏输入中稳定捕捉复杂点运动和细粒度动态细节。 Method: 提出双变形网络:瞬时形变网络(IDN)处理局部精细形变,全局运动网络(GMN)建模长程动态,二者通过互学习优化;引入预训练光流骨干提取稠密运动特征,结合形变引导注意力机制对齐高斯状态与相邻帧光流。 Result: 在多个数据集上显著优于SOTA方法,在重建保真度、时序连贯性与细节保留方面均有提升。 Conclusion: FLAG-4D通过解耦并协同优化局部与全局动态建模,并有效融合外部运动先验,为动态神经辐射场提供了更鲁棒、更精细的4D重建范式。 Abstract: We introduce FLAG-4D, a novel framework for generating novel views of dynamic scenes by reconstructing how 3D Gaussian primitives evolve through space and time. Existing methods typically rely on a single Multilayer Perceptron (MLP) to model temporal deformations, and they often struggle to capture complex point motions and fine-grained dynamic details consistently over time, especially from sparse input views. Our approach, FLAG-4D, overcomes this by employing a dual-deformation network that dynamically warps a canonical set of 3D Gaussians over time into new positions and anisotropic shapes. This dual-deformation network consists of an Instantaneous Deformation Network (IDN) for modeling fine-grained, local deformations and a Global Motion Network (GMN) for capturing long-range dynamics, refined through mutual learning. To ensure these deformations are both accurate and temporally smooth, FLAG-4D incorporates dense motion features from a pretrained optical flow backbone. We fuse these motion cues from adjacent timeframes and use a deformation-guided attention mechanism to align this flow information with the current state of each evolving 3D Gaussian. Extensive experiments demonstrate that FLAG-4D achieves higher-fidelity and more temporally coherent reconstructions with finer detail preservation than state-of-the-art methods.

[253] SemiNFT: Learning to Transfer Presets from Imitation to Appreciation via Hybrid-Sample Reinforcement Learning

Melany Yang,Yuhang Yu,Diwang Weng,Jinwei Chen,Wei Dong

Main category: cs.CV

TL;DR: 本文提出SemiNFT,一种基于扩散Transformer的参考式彩色润饰框架,通过模仿人类艺术学习路径(先配对学习结构与色彩映射,再通过强化学习提升美学感知),在保持结构的同时实现更符合人类审美的色彩迁移。

Details Motivation: 现有参考式润饰方法仅依赖像素级统计进行全局色彩映射,缺乏语义理解与美学判断能力,难以满足非专业人士对真实感润饰的需求。 Method: 提出SemiNFT框架:首先使用配对三元组数据训练模型掌握结构保持与基础色彩映射;随后在无配对数据上采用强化学习进一步优化美学感知,并设计混合在线-离线奖励机制防止技能遗忘。 Result: 在标准预设迁移基准上超越现有最优方法,并在黑白照片上色、动漫到照片跨域迁移等零样本任务中表现优异,验证其具备高级美学理解能力。 Conclusion: SemiNFT成功将人类艺术学习范式引入图像润饰,突破了传统统计匹配局限,实现了兼具结构保真与审美智能的光度重映射。 Abstract: Photorealistic color retouching plays a vital role in visual content creation, yet manual retouching remains inaccessible to non-experts due to its reliance on specialized expertise. Reference-based methods offer a promising alternative by transferring the preset color of a reference image to a source image. However, these approaches often operate as novice learners, performing global color mappings derived from pixel-level statistics, without a true understanding of semantic context or human aesthetics. To address this issue, we propose SemiNFT, a Diffusion Transformer (DiT)-based retouching framework that mirrors the trajectory of human artistic training: beginning with rigid imitation and evolving into intuitive creation. Specifically, SemiNFT is first taught with paired triplets to acquire basic structural preservation and color mapping skills, and then advanced to reinforcement learning (RL) on unpaired data to cultivate nuanced aesthetic perception. Crucially, during the RL stage, to prevent catastrophic forgetting of old skills, we design a hybrid online-offline reward mechanism that anchors aesthetic exploration with structural review. % experiments Extensive experiments show that SemiNFT not only outperforms state-of-the-art methods on standard preset transfer benchmarks but also demonstrates remarkable intelligence in zero-shot tasks, such as black-and-white photo colorization and cross-domain (anime-to-photo) preset transfer. These results confirm that SemiNFT transcends simple statistical matching and achieves a sophisticated level of aesthetic comprehension. Our project can be found at https://melanyyang.github.io/SemiNFT/.

[254] Overview and Comparison of AVS Point Cloud Compression Standard

Wei Gao,Wenxu Gao,Xingming Mu,Changhao Peng,Ge Li

Main category: cs.CV

TL;DR: 本文综述了中国AVS工作组制定的首个点云压缩标准AVS PCC,从技术特点和性能对比两方面进行分析,并与MPEG的G-PCC和V-PCC标准进行比较。

Details Motivation: 点云数据量大,给传输和存储带来挑战,亟需高效压缩标准;MPEG已制定G-PCC和V-PCC,而中国AVS推出具有自主特色的AVS PCC标准,需系统梳理其技术与性能。 Method: 对AVS PCC标准进行技术解析,并从编码工具、框架设计等方面与MPEG G-PCC/V-PCC开展对比分析;同时汇总公开性能测试结果进行横向比较。 Result: 明确了AVS PCC采用的新型编码技术(如自适应几何量化、多视角属性融合等),并在若干测试序列上展现出相比G-PCC/V-PCC在特定场景下的率失真性能优势或计算复杂度优势。 Conclusion: AVS PCC是一个具备创新性与实用性的国产点云压缩标准,其技术路径差异化明显,在推动点云应用落地和标准生态多元化方面具有重要意义。 Abstract: Point cloud is a prevalent 3D data representation format with significant application values in immersive media, autonomous driving, digital heritage protection, etc. However, the large data size of point clouds poses challenges to transmission and storage, which influences the wide deployments. Therefore, point cloud compression plays a crucial role in practical applications for both human and machine perception optimization. To this end, the Moving Picture Experts Group (MPEG) has established two standards for point cloud compression, including Geometry-based Point Cloud Compression (G-PCC) and Video-based Point Cloud Compression (V-PCC). In the meantime, the Audio Video coding Standard (AVS) Workgroup of China also have launched and completed the development for its first generation point cloud compression standard, namely AVS PCC. This new standardization effort has adopted many new coding tools and techniques, which are different from the other counterpart standards. This paper reviews the AVS PCC standard from two perspectives, i.e., the related technologies and performance comparisons.

[255] Inspiration Seeds: Learning Non-Literal Visual Combinations for Generative Exploration

Kfir Goldberg,Elad Richardson,Yael Vinker

Main category: cs.CV

TL;DR: 本文提出了一种名为Inspiration Seeds的生成框架,旨在支持设计师在创意初期进行视觉探索与构思,通过输入两张图像,无需文本提示即可生成展现二者潜在关系的多样化、视觉连贯的合成图像。

Details Motivation: 现有生成模型依赖精心设计的文本提示,难以支持设计过程中常见的、基于松散视觉参考的开放性探索和灵感激发。 Method: 提出一种前馈式生成框架,利用CLIP稀疏自编码器从CLIP潜在空间中提取编辑方向并分离概念对,构建纯视觉驱动的合成三元组进行训练,完全摆脱语言依赖。 Result: 模型能快速、直观地将两张输入图像以视觉连贯方式重组,生成揭示其潜在关系的多样化输出,适用于创意早期模糊阶段。 Conclusion: Inspiration Seeds将图像生成从执行导向转向探索导向,为视觉构思提供了更自然、更高效的支持,拓展了生成模型在创造性工作流中的应用边界。 Abstract: While generative models have become powerful tools for image synthesis, they are typically optimized for executing carefully crafted textual prompts, offering limited support for the open-ended visual exploration that often precedes idea formation. In contrast, designers frequently draw inspiration from loosely connected visual references, seeking emergent connections that spark new ideas. We propose Inspiration Seeds, a generative framework that shifts image generation from final execution to exploratory ideation. Given two input images, our model produces diverse, visually coherent compositions that reveal latent relationships between inputs, without relying on user-specified text prompts. Our approach is feed-forward, trained on synthetic triplets of decomposed visual aspects derived entirely through visual means: we use CLIP Sparse Autoencoders to extract editing directions in CLIP latent space and isolate concept pairs. By removing the reliance on language and enabling fast, intuitive recombination, our method supports visual ideation at the early and ambiguous stages of creative work.

[256] Improving Reconstruction of Representation Autoencoder

Siyu Liu,Chujie Qin,Hubery Yin,Qixin Yan,Zheng-Peng Duan,Chen Li,Jing Lyu,Chun-Le Guo,Chongyi Li

Main category: cs.CV

TL;DR: 本文提出LV-RAE方法,通过增强语义特征中缺失的低层信息(如颜色、纹理)来提升潜在扩散模型(LDMs)的重建保真度,并通过解码器微调与可控噪声注入缓解高维潜在空间解码敏感性问题,从而兼顾高保真重建与高质量生成。

Details Motivation: 现有基于视觉基础模型作为图像编码器的潜在扩散模型虽能提升生成性能,但其语义特征缺乏低层细节(如颜色、纹理),导致重建保真度下降,成为LDM进一步扩展的主要瓶颈。 Method: 提出LV-RAE表示自编码器,融合语义与低层信息;分析发现高维丰富潜在导致解码器对扰动敏感,进而引入解码器鲁棒性微调与生成潜在的可控噪声平滑策略。 Result: 实验表明LV-RAE显著提升重建保真度,同时保持语义抽象能力并实现优异的生成质量。 Conclusion: LV-RAE有效解决了语义特征与低层信息失衡及解码敏感性问题,为高保真、高质量生成提供了新范式。 Abstract: Recent work leverages Vision Foundation Models as image encoders to boost the generative performance of latent diffusion models (LDMs), as their semantic feature distributions are easy to learn. However, such semantic features often lack low-level information (\eg, color and texture), leading to degraded reconstruction fidelity, which has emerged as a primary bottleneck in further scaling LDMs. To address this limitation, we propose LV-RAE, a representation autoencoder that augments semantic features with missing low-level information, enabling high-fidelity reconstruction while remaining highly aligned with the semantic distribution. We further observe that the resulting high-dimensional, information-rich latent make decoders sensitive to latent perturbations, causing severe artifacts when decoding generated latent and consequently degrading generation quality. Our analysis suggests that this sensitivity primarily stems from excessive decoder responses along directions off the data manifold. Building on these insights, we propose fine-tuning the decoder to increase its robustness and smoothing the generated latent via controlled noise injection, thereby enhancing generation quality. Experiments demonstrate that LV-RAE significantly improves reconstruction fidelity while preserving the semantic abstraction and achieving strong generative quality. Our code is available at https://github.com/modyu-liu/LVRAE.

[257] Revisiting [CLS] and Patch Token Interaction in Vision Transformers

Alexis Marouani,Oriane Siméoni,Hervé Jégou,Piotr Bojanowski,Huy V. Vo

Main category: cs.CV

TL;DR: 本文提出了一种针对Vision Transformer中[CLS]类令牌和图像块令牌的专门化处理路径,通过在归一化层和早期QKV投影中选择性解耦两类令牌的计算流,显著提升了密集预测任务(如分割)的性能,同时保持分类精度,仅增加8%参数量且无额外计算开销。

Details Motivation: 标准ViT中全局[CLS]令牌和局部图像块令牌被同等处理,但二者学习目标不同,存在学习摩擦;作者旨在探究预训练策略下两类令牌交互机制,并改进其表征能力。 Method: 通过分析标准化层对两类令牌的隐式区分,设计了在归一化层和早期QKV投影中对[CLS]和patch tokens进行分别处理的专用路径,实现计算流的定向解耦。 Result: 在语义分割任务上提升超2 mIoU,在保持强分类精度的同时,仅引入8%参数增长,无计算开销增加;消融实验揭示了最受益的模块及跨模型规模与学习框架的泛化性。 Conclusion: 对ViT中不同语义角色令牌进行结构化、轻量级的专门化建模,能有效缓解全局与局部特征学习冲突,是提升密集预测性能的有效途径。 Abstract: Vision Transformers have emerged as powerful, scalable and versatile representation learners. To capture both global and local features, a learnable [CLS] class token is typically prepended to the input sequence of patch tokens. Despite their distinct nature, both token types are processed identically throughout the model. In this work, we investigate the friction between global and local feature learning under different pre-training strategies by analyzing the interactions between class and patch tokens. Our analysis reveals that standard normalization layers introduce an implicit differentiation between these token types. Building on this insight, we propose specialized processing paths that selectively disentangle the computational flow of class and patch tokens, particularly within normalization layers and early query-key-value projections. This targeted specialization leads to significantly improved patch representation quality for dense prediction tasks. Our experiments demonstrate segmentation performance gains of over 2 mIoU points on standard benchmarks, while maintaining strong classification accuracy. The proposed modifications introduce only an 8% increase in parameters, with no additional computational overhead. Through comprehensive ablations, we provide insights into which architectural components benefit most from specialization and how our approach generalizes across model scales and learning frameworks.

[258] Deep Learning-Based Fixation Type Prediction for Quality Assurance in Digital Pathology

Oskar Thaeter,Tanja Niedermair,Johannes Raffler,Ralf Huss,Peter J. Schüffler

Main category: cs.CV

TL;DR: 本文提出了一种基于低分辨率预扫描缩略图的深度学习模型,用于快速、准确地预测病理切片的固定类型(FFPE或FS),显著提升高通量病理质控效率。

Details Motivation: 手动标注固定类型易出错,影响诊断准确性;现有方法依赖全分辨率WSI,难以扩展到高通量质控场景。 Method: 构建深度学习模型,仅使用低分辨率预扫描缩略图进行 fixation type 分类;在TUM数据集上训练,并在TCGA、Augsburg和Regensburg多中心多设备数据集上跨域验证。 Result: 在TCGA上AUROC达0.88,较同类预扫描方法提升4.8%;在Augsburg和Regensburg上AUROC分别为0.72;单张切片处理仅需21ms,速度提升400倍。 Conclusion: 该方法无需高倍镜扫描即可高效识别标签错误,适用于大规模病理质控;存在扫描仪导致的域偏移挑战,未来需增强跨设备泛化能力。 Abstract: Accurate annotation of fixation type is a critical step in slide preparation for pathology laboratories. However, this manual process is prone to errors, impacting downstream analyses and diagnostic accuracy. Existing methods for verifying formalin-fixed, paraffin-embedded (FFPE), and frozen section (FS) fixation types typically require full-resolution whole-slide images (WSIs), limiting scalability for high-throughput quality control. We propose a deep-learning model to predict fixation types using low-resolution, pre-scan thumbnail images. The model was trained on WSIs from the TUM Institute of Pathology (n=1,200, Leica GT450DX) and evaluated on a class-balanced subset of The Cancer Genome Atlas dataset (TCGA, n=8,800, Leica AT2), as well as on class-balanced datasets from Augsburg (n=695 [392 FFPE, 303 FS], Philips UFS) and Regensburg (n=202, 3DHISTECH P1000). Our model achieves an AUROC of 0.88 on TCGA, outperforming comparable pre-scan methods by 4.8%. It also achieves AUROCs of 0.72 on Regensburg and Augsburg slides, underscoring challenges related to scanner-induced domain shifts. Furthermore, the model processes each slide in 21 ms, $400\times$ faster than existing high-magnification, full-resolution methods, enabling rapid, high-throughput processing. This approach provides an efficient solution for detecting labelling errors without relying on high-magnification scans, offering a valuable tool for quality control in high-throughput pathology workflows. Future work will improve and evaluate the model's generalisation to additional scanner types. Our findings suggest that this method can increase accuracy and efficiency in digital pathology workflows and may be extended to other low-resolution slide annotations.

[259] WiFlow: A Lightweight WiFi-based Continuous Human Pose Estimation Network with Spatio-Temporal Feature Decoupling

Yi Dao,Lankai Zhang,Hao Liu,Haiwei Zhang,Wenbo Wang

Main category: cs.CV

TL;DR: WiFlow是一种基于WiFi信号的连续人体姿态估计框架,采用编码器-解码器结构,结合时空解耦卷积与轴向注意力机制,在自建数据集上达到高精度(PCK@20为97.00%)且模型轻量(4.82M参数)。

Details Motivation: 现有WiFi-based姿态估计方法在处理连续运动时性能不佳,且计算开销大;视觉类方法将CSI当作图像处理,忽略了其原始时序结构。 Method: 提出WiFlow框架:编码器使用时间卷积和非对称卷积提取CSI的时空特征,并通过轴向注意力建模关节点间结构依赖;解码器将高维特征映射为关节点坐标;训练数据为自采集的36万组同步CSI-姿态样本。 Result: 在自建数据集上PCK@20达97.00%,PCK@50达99.48%,平均关节位置误差为0.008米,参数量仅4.82M。 Conclusion: WiFlow在保持高精度的同时显著降低计算复杂度,为实用化WiFi姿态估计建立了新基准。 Abstract: Human pose estimation is fundamental to intelligent perception in the Internet of Things (IoT), enabling applications ranging from smart healthcare to human-computer interaction. While WiFi-based methods have gained traction, they often struggle with continuous motion and high computational overhead. This work presents WiFlow, a novel framework for continuous human pose estimation using WiFi signals. Unlike vision-based approaches such as two-dimensional deep residual networks that treat Channel State Information (CSI) as images, WiFlow employs an encoder-decoder architecture. The encoder captures spatio-temporal features of CSI using temporal and asymmetric convolutions, preserving the original sequential structure of signals. It then refines keypoint features of human bodies to be tracked and capture their structural dependencies via axial attention. The decoder subsequently maps the encoded high-dimensional features into keypoint coordinates. Trained on a self-collected dataset of 360,000 synchronized CSI-pose samples from 5 subjects performing continuous sequences of 8 daily activities, WiFlow achieves a Percentage of Correct Keypoints (PCK) of 97.00% at a threshold of 20% (PCK@20) and 99.48% at PCK@50, with a mean per-joint position error of 0.008m. With only 4.82M parameters, WiFlow significantly reduces model complexity and computational cost, establishing a new performance baseline for practical WiFi-based human pose estimation. Our code and datasets are available at https://github.com/DY2434/WiFlow-WiFi-Pose-Estimation-with-Spatio-Temporal-Decoupling.git.

[260] A Machine Learning accelerated geophysical fluid solver

Yang Bai

Main category: cs.CV

TL;DR: 本文探讨了将机器学习(尤其是数据驱动离散化方法)应用于求解偏微分方程(PDEs),特别是浅水方程和欧拉方程;提出了四种深度神经网络架构,其中两种表现良好,并在结构化网格上实现了比传统求解器(如Pyclaw)更优的精度与稳定性。

Details Motivation: 现有机器学习方法在图像、NLP等领域成功,但在含数学约束(如PDEs)的问题中应用尚不成熟;亟需提升低分辨率PDE求解的精度、稳定性,并融合传统数值方法(如守恒律)优势。 Method: 采用数据驱动离散化方法,预测拟线性模板系数以计算函数值或导数;实现基于经典数值框架(有限体积/差分)的浅水方程与欧拉方程求解器;设计并对比四种深度神经网络用于ML-based PDE求解。 Result: 所实现的经典求解器性能显著优于Pyclaw;四种神经网络中,两种能输出满意解,验证了数据驱动离散化在提升低分辨率模拟精度与稳定性方面的有效性。 Conclusion: 数据驱动离散化是融合机器学习与传统数值PDE求解的有效路径,既可继承守恒等物理性质,又可在结构化网格上提升效率与鲁棒性;神经网络架构选择对ML-PDE求解器性能至关重要。 Abstract: Machine learning methods have been successful in many areas, like image classification and natural language processing. However, it still needs to be determined how to apply ML to areas with mathematical constraints, like solving PDEs. Among various approaches to applying ML techniques to solving PDEs, the data-driven discretization method presents a promising way of accelerating and improving existing PDE solver on structured grids where it predicts the coefficients of quasi-linear stencils for computing values or derivatives of a function at given positions. It can improve the accuracy and stability of low-resolution simulation compared with using traditional finite difference or finite volume schemes. Meanwhile, it can also benefit from traditional numerical schemes like achieving conservation law by adapting finite volume type formulations. In this thesis, we have implemented the shallow water equation and Euler equation classic solver under a different framework. Experiments show that our classic solver performs much better than the Pyclaw solver. Then we propose four different deep neural networks for the ML-based solver. The results indicate that two of these approaches could output satisfactory solutions.

[261] ALIVE: Animate Your World with Lifelike Audio-Video Generation

Ying Guo,Qijun Gan,Yifu Zhang,Jinlai Liu,Yifei Hu,Pan Xie,Dongjun Qian,Yu Zhang,Ruiqi Li,Yuqi Zhang,Ruibiao Lu,Xiaofeng Mei,Bo Han,Xiang Yin,Bingyue Peng,Zehuan Yuan

Main category: cs.CV

TL;DR: ALIVE 是一个基于预训练文本到视频(T2V)模型的统一音视频生成模型,支持文本/参考驱动的音视频生成与动画,通过改进 MMDiT 架构(引入 TA-CrossAttn 和 UniTemp-RoPE)实现音画同步,并构建高质量数据管道与新基准,性能达开源最优、媲美甚至超越商业方案。

Details Motivation: 推动视频生成向统一音视频生成演进,弥补现有 T2V 模型缺乏音频建模与音画同步能力的不足,支持更自然、可控的音视频内容生成。 Method: 在预训练 T2V 模型基础上,扩展 MMDiT 架构为联合音视频分支:引入 TA-CrossAttn 实现时序对齐的跨模态融合,UniTemp-RoPE 实现精确音画对齐;设计涵盖音视频标注与质量控制的数据流水线;构建新音视频生成基准;通过百万级高质量数据进行持续预训练与微调。 Result: ALIVE 在多项音视频生成任务上显著优于现有开源模型,在音画同步、参考动画等关键指标上达到或超过当前最优商业方案(如 Sora 风格),并开源代码与评测基准。 Conclusion: ALIVE 成功将 T2V 模型升级为通用音视频生成基础模型,验证了联合音视频建模与对齐机制的有效性,为社区提供了可复现、可扩展的音视频生成技术路径与评估标准。 Abstract: Video generation is rapidly evolving towards unified audio-video generation. In this paper, we present ALIVE, a generation model that adapts a pretrained Text-to-Video (T2V) model to Sora-style audio-video generation and animation. In particular, the model unlocks the Text-to-Video&Audio (T2VA) and Reference-to-Video&Audio (animation) capabilities compared to the T2V foundation models. To support the audio-visual synchronization and reference animation, we augment the popular MMDiT architecture with a joint audio-video branch which includes TA-CrossAttn for temporally-aligned cross-modal fusion and UniTemp-RoPE for precise audio-visual alignment. Meanwhile, a comprehensive data pipeline consisting of audio-video captioning, quality control, etc., is carefully designed to collect high-quality finetuning data. Additionally, we introduce a new benchmark to perform a comprehensive model test and comparison. After continue pretraining and finetuning on million-level high-quality data, ALIVE demonstrates outstanding performance, consistently outperforming open-source models and matching or surpassing state-of-the-art commercial solutions. With detailed recipes and benchmarks, we hope ALIVE helps the community develop audio-video generation models more efficiently. Official page: https://github.com/FoundationVision/Alive.

[262] OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence

Feilong Tang,Xiang An,Yunyao Yan,Yin Xie,Bin Qin,Kaicheng Yang,Yifei Shen,Yuanhan Zhang,Chunyuan Li,Shikun Feng,Changrui Chen,Huajie Tan,Ming Hu,Manyuan Zhang,Bo Li,Ziyong Feng,Ziwei Liu,Zongyuan Ge,Jiankang Deng

Main category: cs.CV

TL;DR: 本文提出OneVision-Encoder(OV-Encoder),一种受视频编解码器(Codec)启发的视觉编码器,通过聚焦高熵区域(仅3.1%-25%)实现高效视频理解,在多项图像、视频和文档任务上超越Qwen3-ViT等强基线,验证了‘压缩即智能’与‘效率-精度正相关’的核心假设。

Details Motivation: 现代视觉模型在密集像素网格上均匀计算,浪费算力于冗余背景,忽视视频中稀疏但富含语义的预测残差(即信息熵高的区域);作者主张应使架构对齐视频的信息论本质——即编解码器原理。 Method: OV-Encoder采用Codec Patchification策略,仅处理高信号熵的稀疏区域;引入共享3D RoPE以统一不规则时空token布局下的建模;并使用超大规模(百万级概念)聚类判别目标进行训练,联合学习物体恒常性与运动动力学。 Result: OV-Encoder在16个图像/视频/文档理解基准上一致优于Qwen3-ViT和SigLIP2,尤其在视频理解任务上平均提升4.1%;同时显著减少视觉token数量和预训练数据量。 Conclusion: 对齐编解码器原理的patch级稀疏性是构建可扩展通用视觉模型的基础原则;效率与准确性并非权衡关系,而是正向协同。 Abstract: Hypothesis. Artificial general intelligence is, at its core, a compression problem. Effective compression demands resonance: deep learning scales best when its architecture aligns with the fundamental structure of the data. These are the fundamental principles. Yet, modern vision architectures have strayed from these truths: visual signals are highly redundant, while discriminative information, the surprise, is sparse. Current models process dense pixel grids uniformly, wasting vast compute on static background rather than focusing on the predictive residuals that define motion and meaning. We argue that to solve visual understanding, we must align our architectures with the information-theoretic principles of video, i.e., Codecs. Method. OneVision-Encoder encodes video by compressing predictive visual structure into semantic meaning. By adopting Codec Patchification, OV-Encoder abandons uniform computation to focus exclusively on the 3.1%-25% of regions rich in signal entropy. To unify spatial and temporal reasoning under irregular token layouts, OneVision-Encoder employs a shared 3D RoPE and is trained with a large-scale cluster discrimination objective over more than one million semantic concepts, jointly capturing object permanence and motion dynamics. Evidence. The results validate our core hypothesis: efficiency and accuracy are not a trade-off; they are positively correlated. When integrated into LLM, it consistently outperforms strong vision backbones such as Qwen3-ViT and SigLIP2 across 16 image, video, and document understanding benchmarks, despite using substantially fewer visual tokens and pretraining data. Notably, on video understanding tasks, OV-Encoder achieves an average improvement of 4.1% over Qwen3-ViT. Codec-aligned, patch-level sparsity is a foundational principle, enabling OV-Encoder as a scalable engine for next-generation visual generalists.

[263] Low-Light Video Enhancement with An Effective Spatial-Temporal Decomposition Paradigm

Xiaogang Xu,Kun Zhou,Tao Hu,Jiafei Wu,Ruixing Wang,Hao Peng,Bei Yu

Main category: cs.CV

TL;DR: 本文提出了一种面向低光照视频增强(LLVE)的视图感知分解框架VLLVE及其增强版VLLVE++,通过分离视图无关(外观)与视图相关(阴影)成分,并引入跨帧一致性约束与双结构增强网络,显著提升了增强效果与鲁棒性,尤其在真实场景和高动态视频中表现突出。

Details Motivation: 现有LLVE方法难以兼顾严重噪声、低可见性及动态场景下的内容一致性与细节恢复;传统单帧或简单时序建模无法有效解耦光照变化与固有外观,导致增强结果不一致或失真。 Method: 提出视图感知分解策略:将视频分解为视图无关项(用动态跨帧对应建模内在外观)和视图相关项(施加场景级连续性约束以稳定阴影建模);设计双结构增强网络实现跨帧交互与联合监督;进一步扩展为VLLVE++,引入加性残差项模拟场景自适应退化,并支持增强与退化感知对应关系优化的端到端双向学习。 Result: 在主流LLVE基准上取得SOTA性能;显著提升真实世界视频与高动态视频的增强质量;分解结果具有一致性与可解释性;模型参数增量小,兼容性强。 Conclusion: VLLVE/VLLVE++通过精细化视频成分分解与双向协同优化,为低光照视频增强提供了更鲁棒、更一致且更具泛化能力的新范式。 Abstract: Low-Light Video Enhancement (LLVE) seeks to restore dynamic or static scenes plagued by severe invisibility and noise. In this paper, we present an innovative video decomposition strategy that incorporates view-independent and view-dependent components to enhance the performance of LLVE. The framework is called View-aware Low-light Video Enhancement (VLLVE). We leverage dynamic cross-frame correspondences for the view-independent term (which primarily captures intrinsic appearance) and impose a scene-level continuity constraint on the view-dependent term (which mainly describes the shading condition) to achieve consistent and satisfactory decomposition results. To further ensure consistent decomposition, we introduce a dual-structure enhancement network featuring a cross-frame interaction mechanism. By supervising different frames simultaneously, this network encourages them to exhibit matching decomposition features. This mechanism can seamlessly integrate with encoder-decoder single-frame networks, incurring minimal additional parameter costs. Building upon VLLVE, we propose a more comprehensive decomposition strategy by introducing an additive residual term, resulting in VLLVE++. This residual term can simulate scene-adaptive degradations, which are difficult to model using a decomposition formulation for common scenes, thereby further enhancing the ability to capture the overall content of videos. In addition, VLLVE++ enables bidirectional learning for both enhancement and degradation-aware correspondence refinement (end-to-end manner), effectively increasing reliable correspondences while filtering out incorrect ones. Notably, VLLVE++ demonstrates strong capability in handling challenging cases, such as real-world scenes and videos with high dynamics. Extensive experiments are conducted on widely recognized LLVE benchmarks.

[264] TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions

Linli Yao,Yuancheng Wei,Yaojie Zhang,Lei Li,Xinlong Chen,Feifan Song,Ziyue Wang,Kun Ouyang,Yuanxin Liu,Lingpeng Kong,Qi Liu,Pengfei Wan,Kun Gai,Yuanxing Zhang,Xu Sun

Main category: cs.CV

TL;DR: 本文提出Omni Dense Captioning任务,旨在生成连续、细粒度、结构化的音视频叙事,并引入六维结构化模式、新基准OmniDCBench、统一评估指标SodaM及基线模型TimeChat-Captioner-7B,显著提升下游音视频推理与时间定位性能。

Details Motivation: 现有音视频描述方法难以兼顾时间连续性、语义密度与结构化表达,缺乏支持细粒度、剧本式叙述的系统性框架和评估标准。 Method: 提出六维结构化schema构建‘剧本式’密集描述;构建高质量人工标注基准OmniDCBench;设计时间感知评估指标SodaM;构建训练数据TimeChatCap-42K;开发基于SFT与GRPO优化的基线模型TimeChat-Captioner-7B。 Result: TimeChat-Captioner-7B在OmniDCBench上超越Gemini-2.5-Pro;其生成的密集描述显著提升DailyOmni、WorldSense(音视频推理)与Charades-STA(时间定位)等下游任务性能。 Conclusion: Omni Dense Captioning为音视频理解提供了更丰富、结构化、时间对齐的语义表示范式,推动密集描述向实用化、可评估、可复用方向发展。 Abstract: This paper proposes Omni Dense Captioning, a novel task designed to generate continuous, fine-grained, and structured audio-visual narratives with explicit timestamps. To ensure dense semantic coverage, we introduce a six-dimensional structural schema to create "script-like" captions, enabling readers to vividly imagine the video content scene by scene, akin to a cinematographic screenplay. To facilitate research, we construct OmniDCBench, a high-quality, human-annotated benchmark, and propose SodaM, a unified metric that evaluates time-aware detailed descriptions while mitigating scene boundary ambiguity. Furthermore, we construct a training dataset, TimeChatCap-42K, and present TimeChat-Captioner-7B, a strong baseline trained via SFT and GRPO with task-specific rewards. Extensive experiments demonstrate that TimeChat-Captioner-7B achieves state-of-the-art performance, surpassing Gemini-2.5-Pro, while its generated dense descriptions significantly boost downstream capabilities in audio-visual reasoning (DailyOmni and WorldSense) and temporal grounding (Charades-STA). All datasets, models, and code will be made publicly available at https://github.com/yaolinli/TimeChat-Captioner.

[265] Towards Understanding Multimodal Fine-Tuning: Spatial Features

Lachin Naghashyar,Hunar Batra,Ashkan Khakzar,Philip Torr,Ronald Clark,Christian Schroeder de Witt,Constantin Venhoff

Main category: cs.CV

TL;DR: 本文首次对视觉-语言模型(VLMs)在多模态微调过程中的表征适应机制进行分析,提出‘阶段式模型差分’方法,揭示语言模型如何学习‘看’:识别出视觉偏好特征的涌现与重定向,发现其中部分特征可靠编码空间关系,并追溯其激活至少量注意力头。

Details Motivation: 尽管视觉-语言模型性能优异,但尚不清楚其语言主干表征如何在多模态训练中适应,以及视觉特有能力何时出现。 Method: 采用阶段式模型差分(stage-wise model diffing)技术,逐阶段比较表征变化;结合空间提示控制实验和因果归因分析,定位视觉偏好特征及关键注意力头。 Result: 发现微调过程中涌现/重定向出视觉偏好特征;其中子集稳定编码空间关系;这些特征的激活可归因于少量特定注意力头。 Conclusion: 阶段式模型差分能有效揭示空间感知多模态特征的产生时机与位置,阐明视觉接地如何重塑原有文本特征,提升了多模态训练的可解释性,并为理解预训练语言模型获取视觉能力提供了基础。 Abstract: Contemporary Vision-Language Models (VLMs) achieve strong performance on a wide range of tasks by pairing a vision encoder with a pre-trained language model, fine-tuned for visual-text inputs. Yet despite these gains, it remains unclear how language backbone representations adapt during multimodal training and when vision-specific capabilities emerge. In this work, we present the first mechanistic analysis of VLM adaptation. Using stage-wise model diffing, a technique that isolates representational changes introduced during multimodal fine-tuning, we reveal how a language model learns to "see". We first identify vision-preferring features that emerge or reorient during fine-tuning. We then show that a selective subset of these features reliably encodes spatial relations, revealed through controlled shifts to spatial prompts. Finally, we trace the causal activation of these features to a small group of attention heads. Our findings show that stage-wise model diffing reveals when and where spatially grounded multimodal features arise. It also provides a clearer view of modality fusion by showing how visual grounding reshapes features that were previously text-only. This methodology enhances the interpretability of multimodal training and provides a foundation for understanding and refining how pretrained language models acquire vision-grounded capabilities.

[266] Zero-shot System for Automatic Body Region Detection for Volumetric CT and MR Images

Farnaz Khun Jush,Grit Werner,Mark Klemens,Matthias Lenga

Main category: cs.CV

TL;DR: 本文提出三种无需训练的零样本方法,用于在CT和MR图像中自动识别解剖区域,其中基于分割的规则系统表现最佳。

Details Motivation: 现有解剖区域识别方法严重依赖不可靠的DICOM元数据,且多为监督学习,泛化能力受限。 Method: 提出三种训练无关的零样本流程:(1) 基于预训练多器官分割模型的分割驱动规则系统;(2) 由放射科医生定义规则引导的多模态大语言模型(MLLM);(3) 结合视觉输入与显式解剖证据的分割感知MLLM。 Result: 在887例CT/MR扫描上评估,分割驱动规则系统性能最强且最稳定(加权F1:CT为0.947,MR为0.914);MLLM在视觉特征明显的区域表现良好,而分割感知MLLM暴露了其根本局限性。 Conclusion: 零样本、训练无关的方法可行且有效,尤其分割驱动规则系统具备跨模态鲁棒性,为临床部署提供了新路径。 Abstract: Reliable identification of anatomical body regions is a prerequisite for many automated medical imaging workflows, yet existing solutions remain heavily dependent on unreliable DICOM metadata. Current solutions mainly use supervised learning, which limits their applicability in many real-world scenarios. In this work, we investigate whether body region detection in volumetric CT and MR images can be achieved in a fully zero-shot manner by using knowledge embedded in large pre-trained foundation models. We propose and systematically evaluate three training-free pipelines: (1) a segmentation-driven rule-based system leveraging pre-trained multi-organ segmentation models, (2) a Multimodal Large Language Model (MLLM) guided by radiologist-defined rules, and (3) a segmentation-aware MLLM that combines visual input with explicit anatomical evidence. All methods are evaluated on 887 heterogeneous CT and MR scans with manually verified anatomical region labels. The segmentation-driven rule-based approach achieves the strongest and most consistent performance, with weighted F1-scores of 0.947 (CT) and 0.914 (MR), demonstrating robustness across modalities and atypical scan coverage. The MLLM performs competitively in visually distinctive regions, while the segmentation-aware MLLM reveals fundamental limitations.

[267] Rotated Lights for Consistent and Efficient 2D Gaussians Inverse Rendering

Geng Lin,Matthias Zwicker

Main category: cs.CV

TL;DR: 本文提出RotLight方法,通过简单旋转物体的捕捉设置和引入代理网格,显著改善了基于2D高斯泼溅(2DGS)的逆向渲染中反照率估计的准确性与全局光照处理能力。

Details Motivation: 现有逆向渲染方法在估计材质(尤其是反照率)时存在高度模糊性,导致颜色不准确和阴影被‘烘焙’进反照率,即使使用正则化也难以解决。 Method: 提出RotLight捕捉方案(仅需物体少量旋转)以缓解模糊性;在2DGS框架中引入代理网格,用于精确入射光追踪、施加残差约束并改进全局光照建模。 Result: 在合成与真实数据集上验证,该方法显著提升了反照率估计质量,同时保持高效计算性能。 Conclusion: RotLight结合轻量旋转采集与代理网格设计,有效缓解了逆向渲染中的材质-光照解耦模糊性,为高质量、高效率的逆向渲染提供了新思路。 Abstract: Inverse rendering aims to decompose a scene into its geometry, material properties and light conditions under a certain rendering model. It has wide applications like view synthesis, relighting, and scene editing. In recent years, inverse rendering methods have been inspired by view synthesis approaches like neural radiance fields and Gaussian splatting, which are capable of efficiently decomposing a scene into its geometry and radiance. They then further estimate the material and lighting that lead to the observed scene radiance. However, the latter step is highly ambiguous and prior works suffer from inaccurate color and baked shadows in their albedo estimation albeit their regularization. To this end, we propose RotLight, a simple capturing setup, to address the ambiguity. Compared to a usual capture, RotLight only requires the object to be rotated several times during the process. We show that as few as two rotations is effective in reducing artifacts. To further improve 2DGS-based inverse rendering, we additionally introduce a proxy mesh that not only allows accurate incident light tracing, but also enables a residual constraint and improves global illumination handling. We demonstrate with both synthetic and real world datasets that our method achieves superior albedo estimation while keeping efficient computation.

[268] FusionEdit: Semantic Fusion and Attention Modulation for Training-Free Image Editing

Yongwen Lai,Chaoqun Wang,Shaobo Min

Main category: cs.CV

TL;DR: FusionEdit是一种无需训练的文本引导图像编辑框架,通过语义差异自动识别编辑与保留区域,采用距离感知的潜在融合和总变差损失生成软掩码,并利用AdaIN调制DiT注意力层实现统计注意力融合,从而在保持源图像身份的同时实现精确、可控且自然的编辑效果。

Details Motivation: 现有方法依赖显式二值掩码约束编辑,但硬边界易引入伪影并降低可编辑性。 Method: 1)基于源提示与目标提示的语义差异自动识别编辑/保留区域;2)距离感知潜在融合+总变差损失生成软掩码以缓解边界伪影;3)在DiT注意力层中使用AdaIN进行统计注意力融合,增强可编辑性并保持全局一致性。 Result: 在多项实验中显著优于当前最先进方法。 Conclusion: FusionEdit提供了一种高效、无需训练的解决方案,在精确性、可控性和自然性方面取得突破。 Abstract: Text-guided image editing aims to modify specific regions according to the target prompt while preserving the identity of the source image. Recent methods exploit explicit binary masks to constrain editing, but hard mask boundaries introduce artifacts and reduce editability. To address these issues, we propose FusionEdit, a training-free image editing framework that achieves precise and controllable edits. First, editing and preserved regions are automatically identified by measuring semantic discrepancies between source and target prompts. To mitigate boundary artifacts, FusionEdit performs distance-aware latent fusion along region boundaries to yield the soft and accurate mask, and employs a total variation loss to enforce smooth transitions, obtaining natural editing results. Second, FusionEdit leverages AdaIN-based modulation within DiT attention layers to perform a statistical attention fusion in the editing region, enhancing editability while preserving global consistency with the source image. Extensive experiments demonstrate that our FusionEdit significantly outperforms state-of-the-art methods. Code is available at \href{https://github.com/Yvan1001/FusionEdit}{https://github.com/Yvan1001/FusionEdit}.

[269] SynSacc: A Blender-to-V2E Pipeline for Synthetic Neuromorphic Eye-Movement Data and Sim-to-Real Spiking Model Training

Khadija Iddrisu,Waseem Shariff,Suzanne Little,Noel OConnor

Main category: cs.CV

TL;DR: 本文提出了一种基于Blender生成的合成事件数据集SynSacc,用于眼动(扫视与注视)分类,并利用脉冲神经网络(SNNs)进行建模,在真实事件数据上微调后达到0.83准确率,兼具鲁棒性与计算效率优势。

Details Motivation: 传统帧式相机存在运动模糊、时间分辨率低等问题,难以准确捕捉快速眼动;事件相机虽具优势,但真实标注事件数据稀缺,亟需高质量合成数据支持眼动分析。 Method: 使用Blender构建可控的合成眼动事件数据集(SynSacc),结合两种脉冲神经网络(SNN)架构进行训练与真实事件数据微调,并与人工神经网络(ANN)对比评估性能与计算效率。 Result: 模型在眼动分类任务中最高达0.83准确率,对不同时间分辨率具有稳定性;SNN在合成事件流上显著优于ANN的计算效率。 Conclusion: 合成事件数据可有效弥补真实标注数据不足,SNN适配事件相机特性,在眼动识别任务中兼具精度、鲁棒性与能效优势,推动事件视觉在认知科学中的应用。 Abstract: The study of eye movements, particularly saccades and fixations, are fundamental to understanding the mechanisms of human cognition and perception. Accurate classification of these movements requires sensing technologies capable of capturing rapid dynamics without distortion. Event cameras, also known as Dynamic Vision Sensors (DVS), provide asynchronous recordings of changes in light intensity, thereby eliminating motion blur inherent in conventional frame-based cameras and offering superior temporal resolution and data efficiency. In this study, we introduce a synthetic dataset generated with Blender to simulate saccades and fixations under controlled conditions. Leveraging Spiking Neural Networks (SNNs), we evaluate its robustness by training two architectures and finetuning on real event data. The proposed models achieve up to 0.83 accuracy and maintain consistent performance across varying temporal resolutions, demonstrating stability in eye movement classification. Moreover, the use of SNNs with synthetic event streams yields substantial computational efficiency gains over artificial neural network (ANN) counterparts, underscoring the utility of synthetic data augmentation in advancing event-based vision. All code and datasets associated with this work is available at https: //github.com/Ikhadija-5/SynSacc-Dataset.

[270] Artifact Reduction in Undersampled 3D Cone-Beam CTs using a Hybrid 2D-3D CNN Framework

Johannes Thalhammer,Tina Dorosti,Sebastian Peterhansl,Daniela Pfeiffer,Franz Pfeiffer,Florian Schaff

Main category: cs.CV

TL;DR: 本文提出了一种结合2D和3D深度学习模型的混合框架,用于减少欠采样CT图像中的伪影,兼顾计算效率与跨层一致性。

Details Motivation: 欠采样CT成像虽可缩短扫描时间和降低辐射剂量,但会引入伪影,影响图像质量与诊断价值,亟需高效去伪影方法。 Method: 采用两阶段混合网络:先用2D U-Net逐层提取特征图,再将这些特征图沿体素方向堆叠,输入3D解码器以利用层间上下文信息重建无伪影3D体积。 Result: 显著提升了冠状面和矢状面的层间一致性,同时保持较低计算开销。 Conclusion: 该2D-3D混合框架是一种鲁棒、高效的3D CT图像后处理方案。 Abstract: Undersampled CT volumes minimize acquisition time and radiation exposure but introduce artifacts degrading image quality and diagnostic utility. Reducing these artifacts is critical for high-quality imaging. We propose a computationally efficient hybrid deep-learning framework that combines the strengths of 2D and 3D models. First, a 2D U-Net operates on individual slices of undersampled CT volumes to extract feature maps. These slice-wise feature maps are then stacked across the volume and used as input to a 3D decoder, which utilizes contextual information across slices to predict an artifact-free 3D CT volume. The proposed two-stage approach balances the computational efficiency of 2D processing with the volumetric consistency provided by 3D modeling. The results show substantial improvements in inter-slice consistency in coronal and sagittal direction with low computational overhead. This hybrid framework presents a robust and efficient solution for high-quality 3D CT image post-processing. The code of this project can be found on github: https://github.com/J-3TO/2D-3DCNN_sparseview/.

[271] Closing the Confusion Loop: CLIP-Guided Alignment for Source-Free Domain Adaptation

Shanshan Wang,Ziying Feng,Xiaozheng Shen,Xun Yang,Pichao Wang,Zhenwei He,Xingyi Zhang

Main category: cs.CV

TL;DR: 本文提出CLIP-Guided Alignment(CGA)框架,通过建模和缓解源模型在无源数据场景下的不对称、动态类混淆问题,提升无源域自适应(SFDA)性能,尤其在细粒度和易混淆场景中效果显著。

Details Motivation: 现有SFDA方法忽视了源模型在目标域中存在的不对称且动态的类间混淆问题,导致伪标签噪声大、目标域判别能力差,尤其在细粒度识别中表现不佳。 Method: 提出CGA框架,包含三部分:(1) MCA模块检测方向性混淆对;(2) MCC模块利用CLIP构建混淆感知文本提示以生成更鲁棒的伪标签;(3) FAM模块构建混淆引导的特征库,并通过对比学习对齐CLIP与源模型的表征。 Result: 在多个数据集上显著优于现有SFDA方法,尤其在易混淆和细粒度场景中增益明显。 Conclusion: 显式建模类间混淆对提升源自由域自适应效果至关重要,CGA为解决SFDA中的语义模糊问题提供了新思路。 Abstract: Source-Free Domain Adaptation (SFDA) tackles the problem of adapting a pre-trained source model to an unlabeled target domain without accessing any source data, which is quite suitable for the field of data security. Although recent advances have shown that pseudo-labeling strategies can be effective, they often fail in fine-grained scenarios due to subtle inter-class similarities. A critical but underexplored issue is the presence of asymmetric and dynamic class confusion, where visually similar classes are unequally and inconsistently misclassified by the source model. Existing methods typically ignore such confusion patterns, leading to noisy pseudo-labels and poor target discrimination. To address this, we propose CLIP-Guided Alignment(CGA), a novel framework that explicitly models and mitigates class confusion in SFDA. Generally, our method consists of three parts: (1) MCA: detects first directional confusion pairs by analyzing the predictions of the source model in the target domain; (2) MCC: leverages CLIP to construct confusion-aware textual prompts (e.g. a truck that looks like a bus), enabling more context-sensitive pseudo-labeling; and (3) FAM: builds confusion-guided feature banks for both CLIP and the source model and aligns them using contrastive learning to reduce ambiguity in the representation space. Extensive experiments on various datasets demonstrate that CGA consistently outperforms state-of-the-art SFDA methods, with especially notable gains in confusion-prone and fine-grained scenarios. Our results highlight the importance of explicitly modeling inter-class confusion for effective source-free adaptation. Our code can be find at https://github.com/soloiro/CGA

[272] From Correspondence to Actions: Human-Like Multi-Image Spatial Reasoning in Multi-modal Large Language Models

Masanari Oi,Koki Maeda,Ryuto Koike,Daisuke Oba,Nakamasa Inoue,Naoaki Okazaki

Main category: cs.CV

TL;DR: 本文提出HATCH框架,通过补丁级空间对齐和动作-再回答推理两个目标,显式建模跨视角对应与视角变换,显著提升多图像空间推理能力,同时保持单图像推理性能。

Details Motivation: 现有MLLMs在多图像空间推理中未能显式、充分地建模人类认知中的跨视图对应和逐步视角变换机制。 Method: 提出HATCH训练框架,包含两个互补目标:(1) 补丁级空间对齐(强制不同视角下空间对应区域的特征对齐);(2) 动作-再回答推理(要求模型先生成显式的视角变换动作,再回答问题)。 Result: 在三个基准上,HATCH持续显著优于同规模基线,并媲美更大模型,且不损害单图像推理能力。 Conclusion: 显式建模人类启发的跨视角对应与视角变换机制,是提升多图像空间推理能力的有效途径。 Abstract: While multimodal large language models (MLLMs) have made substantial progress in single-image spatial reasoning, multi-image spatial reasoning, which requires integration of information from multiple viewpoints, remains challenging. Cognitive studies suggest that humans address such tasks through two mechanisms: cross-view correspondence, which identifies regions across different views that correspond to the same physical locations, and stepwise viewpoint transformation, which composes relative viewpoint changes sequentially. However, existing studies incorporate these mechanisms only partially and often implicitly, without explicit supervision for both. We propose Human-Aware Training for Cross-view correspondence and viewpoint cHange (HATCH), a training framework with two complementary objectives: (1) Patch-Level Spatial Alignment, which encourages patch representations to align across views for spatially corresponding regions, and (2) Action-then-Answer Reasoning, which requires the model to generate explicit viewpoint transition actions before predicting the final answer. Experiments on three benchmarks demonstrate that HATCH consistently outperforms baselines of comparable size by a clear margin and achieves competitive results against much larger models, while preserving single-image reasoning capabilities.

[273] Shifting the Breaking Point of Flow Matching for Multi-Instance Editing

Carmine Zaccagnino,Fabio Quattrini,Enis Simsar,Marta Tintoré Gazulla,Rita Cucchiara,Alessio Tonioni,Silvia Cascianelli

Main category: cs.CV

TL;DR: 本文提出Instance-Disentangled Attention机制,解决现有流匹配模型在多实例图像编辑中语义纠缠的问题,实现单次前向传播下的实例级解耦编辑。

Details Motivation: 现有基于流的图像编辑方法难以处理多实例场景,即对参考图像多个区域进行独立文本引导编辑时易产生语义干扰,根源在于全局条件化速度场和联合注意力机制导致编辑耦合。 Method: 提出Instance-Disentangled Attention机制,将联合注意力操作按实例划分,强制文本指令与对应空间区域在速度场估计过程中绑定。 Result: 在自然图像编辑及新构建的图文密集信息图(含区域级编辑指令)基准上验证有效,显著提升编辑解耦性与局部性,同时保持整体输出一致性。 Conclusion: 该方法实现了单次前向传播下的实例级、解耦式文本引导图像编辑,为复杂场景下的细粒度可控生成提供了新思路。 Abstract: Flow matching models have recently emerged as an efficient alternative to diffusion, especially for text-guided image generation and editing, offering faster inference through continuous-time dynamics. However, existing flow-based editors predominantly support global or single-instruction edits and struggle with multi-instance scenarios, where multiple parts of a reference input must be edited independently without semantic interference. We identify this limitation as a consequence of globally conditioned velocity fields and joint attention mechanisms, which entangle concurrent edits. To address this issue, we introduce Instance-Disentangled Attention, a mechanism that partitions joint attention operations, enforcing binding between instance-specific textual instructions and spatial regions during velocity field estimation. We evaluate our approach on both natural image editing and a newly introduced benchmark of text-dense infographics with region-level editing instructions. Experimental results demonstrate that our approach promotes edit disentanglement and locality while preserving global output coherence, enabling single-pass, instance-level editing.

[274] MVAnimate: Enhancing Character Animation with Multi-View Optimization

Tianyu Sun,Zhoujie Fu,Bang Zhang,Guosheng Lin

Main category: cs.CV

TL;DR: 本文提出MVAnimate框架,通过融合多视角先验信息,同时建模2D和3D人体姿态,提升动画视频生成的质量、时序一致性和空间连贯性。

Details Motivation: 现有基于2D或3D结构的人体姿态动画生成方法存在输出质量低、训练数据不足等问题,难以生成高质量动画视频。 Method: 提出MVAnimate框架,利用多视角先验信息协同建模动态人物的2D与3D结构,优化多视角视频生成,确保时空一致性。 Result: 在多个数据集上的实验表明,该方法在不同运动模式和外观下均表现出强鲁棒性,显著优于现有动画生成方法。 Conclusion: MVAnimate有效解决了动画生成中质量低与数据匮乏问题,为高质量、多视角一致的字符动画生成提供了新思路。 Abstract: The demand for realistic and versatile character animation has surged, driven by its wide-ranging applications in various domains. However, the animation generation algorithms modeling human pose with 2D or 3D structures all face various problems, including low-quality output content and training data deficiency, preventing the related algorithms from generating high-quality animation videos. Therefore, we introduce MVAnimate, a novel framework that synthesizes both 2D and 3D information of dynamic figures based on multi-view prior information, to enhance the generated video quality. Our approach leverages multi-view prior information to produce temporally consistent and spatially coherent animation outputs, demonstrating improvements over existing animation methods. Our MVAnimate also optimizes the multi-view videos of the target character, enhancing the video quality from different views. Experimental results on diverse datasets highlight the robustness of our method in handling various motion patterns and appearances.

[275] VedicTHG: Symbolic Vedic Computation for Low-Resource Talking-Head Generation in Educational Avatars

Vineet Kumar Rakesh,Ahana Bhattacharjee,Soumya Mazumdar,Tapas Samanta,Hemendra Kumar Pandey,Amitabha Das,Sarbajit Pal

Main category: cs.CV

TL;DR: 本文提出了一种名为Symbolic Vedic Computation的轻量级、确定性、CPU友好的说话头生成(THG)框架,用于教育技术中的离线或资源受限场景;该方法通过语音-音素-视位映射与受吠陀经启发的符号协同发音建模,结合轻量2D渲染,实现在普通CPU上实时生成高同步精度与身份一致性的说话头视频。

Details Motivation: 现有说话头生成方法多依赖GPU神经渲染、大规模训练数据或高算力扩散模型,难以部署于离线或低资源教育环境,亟需一种轻量、确定、CPU可运行的替代方案。 Method: 提出Symbolic Vedic Computation框架:1)语音转时间对齐音素流;2)音素映射至紧凑视位集;3)基于吠陀经Urdhva Tiryakbhyam原理的符号化协同发音建模生成平滑视位轨迹;4)轻量2D渲染器实现ROI形变、嘴部合成与稳定化。 Result: 在纯CPU环境下实现了高唇音同步精度、时间稳定性与身份一致性;相比CPU可行基线,显著降低计算负载与延迟,支持低端硬件上的实时教育虚拟人应用。 Conclusion: Symbolic Vedic Computation为资源受限教育场景提供了实用、高效、可部署的说话头生成新范式,验证了非深度学习、符号化方法在生成质量与效率间的良好平衡。 Abstract: Talking-head avatars are increasingly adopted in educational technology to deliver content with social presence and improved engagement. However, many recent talking-head generation (THG) methods rely on GPU-centric neural rendering, large training sets, or high-capacity diffusion models, which limits deployment in offline or resource-constrained learning environments. A deterministic and CPU-oriented THG framework is described, termed Symbolic Vedic Computation, that converts speech to a time-aligned phoneme stream, maps phonemes to a compact viseme inventory, and produces smooth viseme trajectories through symbolic coarticulation inspired by Vedic sutra Urdhva Tiryakbhyam. A lightweight 2D renderer performs region-of-interest (ROI) warping and mouth compositing with stabilization to support real-time synthesis on commodity CPUs. Experiments report synchronization accuracy, temporal stability, and identity consistency under CPU-only execution, alongside benchmarking against representative CPU-feasible baselines. Results indicate that acceptable lip-sync quality can be achieved while substantially reducing computational load and latency, supporting practical educational avatars on low-end hardware. GitHub: https://vineetkumarrakesh.github.io/vedicthg

[276] Multimodal Learning for Arcing Detection in Pantograph-Catenary Systems

Hao Dong,Eleni Chatzi,Olga Fink

Main category: cs.CV

TL;DR: 本文提出了一种结合高分辨率图像与受力测量的多模态框架(MultiDeepSAD),用于更准确、鲁棒地检测受电弓-接触网界面的电弧事件,并通过真实与合成数据集及专用伪异常生成技术提升模型在数据稀缺和域偏移下的性能。

Details Motivation: 电弧事件具有瞬态性、环境噪声大、标注数据稀缺且易与其它瞬态现象混淆,导致检测困难,威胁铁路供电可靠性。 Method: 构建两个同步视觉-力觉数据集(来自SBB的真实数据 + 公开视频+合成力数据);提出多模态异常检测模型MultiDeepSAD,扩展DeepSAD并设计新损失函数;引入针对图像(合成电弧伪影)和力信号(模拟力异常)的专用伪异常生成策略。 Result: 在大量实验与消融研究中,所提方法显著优于基线方法,在域偏移和真实电弧样本稀少条件下仍展现出更高灵敏度与鲁棒性。 Conclusion: 多模态融合与针对性数据增强可有效提升电弧检测性能,为实际铁路系统提供更具实用性的智能监测方案。 Abstract: The pantograph-catenary interface is essential for ensuring uninterrupted and reliable power delivery in electrified rail systems. However, electrical arcing at this interface poses serious risks, including accelerated wear of contact components, degraded system performance, and potential service disruptions. Detecting arcing events at the pantograph-catenary interface is challenging due to their transient nature, noisy operating environment, data scarcity, and the difficulty of distinguishing arcs from other similar transient phenomena. To address these challenges, we propose a novel multimodal framework that combines high-resolution image data with force measurements to more accurately and robustly detect arcing events. First, we construct two arcing detection datasets comprising synchronized visual and force measurements. One dataset is built from data provided by the Swiss Federal Railways (SBB), and the other is derived from publicly available videos of arcing events in different railway systems and synthetic force data that mimic the characteristics observed in the real dataset. Leveraging these datasets, we propose MultiDeepSAD, an extension of the DeepSAD algorithm for multiple modalities with a new loss formulation. Additionally, we introduce tailored pseudo-anomaly generation techniques specific to each data type, such as synthetic arc-like artifacts in images and simulated force irregularities, to augment training data and improve the discriminative ability of the model. Through extensive experiments and ablation studies, we demonstrate that our framework significantly outperforms baseline approaches, exhibiting enhanced sensitivity to real arcing events even under domain shifts and limited availability of real arcing observations.

[277] MOVA: Towards Scalable and Synchronized Video-Audio Generation

SII-OpenMOSS Team,:,Donghua Yu,Mingshu Chen,Qi Chen,Qi Luo,Qianyi Wu,Qinyuan Cheng,Ruixiao Li,Tianyi Liang,Wenbo Zhang,Wenming Tu,Xiangyu Peng,Yang Gao,Yanru Huo,Ying Zhu,Yinze Luo,Yiyang Zhang,Yuerong Song,Zhe Xu,Zhiyu Zhang,Chenchen Yang,Cheng Chang,Chushu Zhou,Hanfu Chen,Hongnan Ma,Jiaxi Li,Jingqi Tong,Junxi Liu,Ke Chen,Shimin Li,Songlin Wang,Wei Jiang,Zhaoye Fei,Zhiyuan Ning,Chunguo Li,Chenhui Li,Ziwei He,Zengfeng Huang,Xie Chen,Xipeng Qiu

Main category: cs.CV

TL;DR: 本文提出了MOVA,一个开源的、支持图像-文本到视频-音频生成的多模态模型,采用混合专家(MoE)架构,具备32B总参数和18B激活参数,能生成高质量同步音视频内容(如唇形同步语音、环境感知音效、内容匹配音乐),并开源模型权重与代码以促进社区发展。

Details Motivation: 现有音视频生成方法多依赖级联流水线,导致成本高、误差累积、质量下降;同时,Sora 2、Veo 3等先进系统为闭源,限制了领域进步;亟需开源、高质量、端到端联合建模的音视频生成模型。 Method: 提出MOVA模型,基于Mixture-of-Experts(MoE)架构,总参数32B、推理时激活18B;支持IT2VA(Image-Text to Video-Audio)任务;开源模型权重、训练/推理代码,并提供LoRA微调与提示增强等工具。 Result: 实现了高质量、时间同步的音视频生成,包括真实感唇形同步语音、环境感知音效和内容对齐音乐;在公开基准上展现出强生成能力;代码与权重已开源,支持高效推理与定制化微调。 Conclusion: MOVA是首个开源的高性能端到端音视频生成模型,通过MoE架构与系统性工程优化,显著提升生成质量与实用性,并推动开放研究与创作者生态建设。 Abstract: Audio is indispensable for real-world video, yet generation models have largely overlooked audio components. Current approaches to producing audio-visual content often rely on cascaded pipelines, which increase cost, accumulate errors, and degrade overall quality. While systems such as Veo 3 and Sora 2 emphasize the value of simultaneous generation, joint multimodal modeling introduces unique challenges in architecture, data, and training. Moreover, the closed-source nature of existing systems limits progress in the field. In this work, we introduce MOVA (MOSS Video and Audio), an open-source model capable of generating high-quality, synchronized audio-visual content, including realistic lip-synced speech, environment-aware sound effects, and content-aligned music. MOVA employs a Mixture-of-Experts (MoE) architecture, with a total of 32B parameters, of which 18B are active during inference. It supports IT2VA (Image-Text to Video-Audio) generation task. By releasing the model weights and code, we aim to advance research and foster a vibrant community of creators. The released codebase features comprehensive support for efficient inference, LoRA fine-tuning, and prompt enhancement.

[278] Addressing data annotation scarcity in Brain Tumor Segmentation on 3D MRI scan Using a Semi-Supervised Teacher-Student Framework

Jiaming Liu,Cheng Ding,Daoqiang Zhang

Main category: cs.CV

TL;DR: 本文提出了一种半监督的师生框架,通过不确定性感知的伪标签教师和基于置信度的渐进式课程学习学生,提升MRI脑肿瘤分割性能,尤其在标注数据有限时显著提高数据效率和鲁棒性。

Details Motivation: 解决脑肿瘤MRI分割中人工标注成本高、跨设备/机构数据异质性大的问题。 Method: 设计不确定性感知的伪标签教师模型生成概率掩码与像素级不确定性;对学生采用基于图像级置信度排序的渐进式课程学习,结合双损失目标(学习高置信区域、遗忘低置信区域),并引入一致性驱动的伪标签优化机制。 Result: 在BraTS 2021上,仅用10%标注数据时验证集DSC达0.393,全量数据达0.872;教师模型DSC为0.922,学生在NCR/NET(0.797)和Edema(0.980)子区域超越教师,并成功恢复教师失败的Enhancing类(DSC 0.620)。 Conclusion: 基于置信度的课程学习与选择性‘遗忘’机制能有效应对弱监督与噪声伪标签挑战,提升分割鲁棒性与泛化能力。 Abstract: Accurate brain tumor segmentation from MRI is limited by expensive annotations and data heterogeneity across scanners and sites. We propose a semi-supervised teacher-student framework that combines an uncertainty-aware pseudo-labeling teacher with a progressive, confidence-based curriculum for the student. The teacher produces probabilistic masks and per-pixel uncertainty; unlabeled scans are ranked by image-level confidence and introduced in stages, while a dual-loss objective trains the student to learn from high-confidence regions and unlearn low-confidence ones. Agreement-based refinement further improves pseudo-label quality. On BraTS 2021, validation DSC increased from 0.393 (10% data) to 0.872 (100%), with the largest gains in early stages, demonstrating data efficiency. The teacher reached a validation DSC of 0.922, and the student surpassed the teacher on tumor subregions (e.g., NCR/NET 0.797 and Edema 0.980); notably, the student recovered the Enhancing class (DSC 0.620) where the teacher failed. These results show that confidence-driven curricula and selective unlearning provide robust segmentation under limited supervision and noisy pseudo-labels.

[279] Omni-Video 2: Scaling MLLM-Conditioned Diffusion for Unified Video Generation and Editing

Hao Yang,Zhiyu Tan,Jia Gong,Luozheng Qin,Hesen Chen,Xiaomeng Yang,Yuqing Sun,Yuetan Lin,Mengping Yang,Hao Li

Main category: cs.CV

TL;DR: Omni-Video 2 是一个将多模态大语言模型(MLLMs)与视频扩散模型结合的高效视频生成与编辑框架,通过 MLLM 生成目标字幕指导扩散模型,并用轻量适配器注入多模态条件,实现高质量、可扩展的文本到视频生成与复杂视频编辑。

Details Motivation: 提升视频生成与编辑中对复杂、组合式用户指令的理解与执行能力,克服现有方法在语义理解与生成引导上的脱节问题。 Method: 利用预训练 MLLM 理解用户指令并生成显式目标字幕,以此提供强语义引导;设计轻量级适配器将多模态条件令牌注入预训练文本到视频扩散模型,实现参数高效微调与生成先验复用。 Result: 在 FiVE(细粒度视频编辑)和 VBench(文本到视频生成)基准上均取得优异性能,尤其在复杂组合编辑任务(如物体增删、背景替换、运动编辑)上显著优于现有方法,同时支持高质量 14B 规模视频生成。 Conclusion: Omni-Video 2 验证了融合理解型 MLLM 与生成型扩散模型的范式优势,为统一、可控、高质量的视频生成与编辑提供了可扩展且高效的解决方案。 Abstract: We present Omni-Video 2, a scalable and computationally efficient model that connects pretrained multimodal large-language models (MLLMs) with video diffusion models for unified video generation and editing. Our key idea is to exploit the understanding and reasoning capabilities of MLLMs to produce explicit target captions to interpret user instructions. In this way, the rich contextual representations from the understanding model are directly used to guide the generative process, thereby improving performance on complex and compositional editing. Moreover, a lightweight adapter is developed to inject multimodal conditional tokens into pretrained text-to-video diffusion models, allowing maximum reuse of their powerful generative priors in a parameter-efficient manner. Benefiting from these designs, we scale up Omni-Video 2 to a 14B video diffusion model on meticulously curated training data with quality, supporting high quality text-to-video generation and various video editing tasks such as object removal, addition, background change, complex motion editing, \emph{etc.} We evaluate the performance of Omni-Video 2 on the FiVE benchmark for fine-grained video editing and the VBench benchmark for text-to-video generation. The results demonstrate its superior ability to follow complex compositional instructions in video editing, while also achieving competitive or superior quality in video generation tasks.

[280] Any-to-All MRI Synthesis: A Unified Foundation Model for Nasopharyngeal Carcinoma and Its Downstream Applications

Yao Pu,Yiming Shi,Zhenxi Zhang,Peixin Yu,Yitao Zhuang,Xiang Wang,Hongzhao Chen,Jing Cai,Ge Ren

Main category: cs.CV

TL;DR: 本文提出了一种基于对比视觉表征学习与视觉-语言对齐(VLA)的统一基础模型,用于鼻咽癌(NPC)放疗所需的任意MRI模态间合成,显著提升合成质量、鲁棒性及下游放疗任务性能。

Details Motivation: 临床中NPC放疗所需MRI常因患者不适、扫描时间长、成本高等原因缺失部分模态,影响放疗计划精度;传统MRI合成方法模态特异、解剖适应性差、缺乏临床可解释性。 Method: 构建融合对比视觉表征学习与CLIP-based视觉-语言对齐的统一基础模型:对比编码器提取模态无关表征,文本引导解码器实现语义一致的任意到任意MRI合成;在13家机构40,825张图像上训练。 Result: 在26个内外部验证点(15,748张图像)上平均SSIM达0.90、PSNR达27,合成保真度高,抗噪声与域偏移能力强;统一表征同时提升分割等下游放疗相关任务性能。 Conclusion: 该统一基础模型 bridged 技术合成能力与临床实用性,为NPC数字医疗提供新范式。 Abstract: Magnetic resonance imaging (MRI) is essential for nasopharyngeal carcinoma (NPC) radiotherapy (RT), but practical constraints, such as patient discomfort, long scan times, and high costs often lead to incomplete modalities in clinical practice, compromising RT planning accuracy. Traditional MRI synthesis methods are modality-specific, limited in anatomical adaptability, and lack clinical interpretability-failing to meet NPC's RT needs. Here, we developed a unified foundation model integrating contrastive visual representation learning and vision-language alignment (VLA) to enable any-to-all MRI synthesis. The model uses a contrastive encoder for modality-invariant representations and a CLIP-based text-informed decoder for semantically consistent synthesis, supporting any-to-all MRI synthesis via one unified foundation model. Trained on 40,825 images from 13 institutions, it achieves consistently high performance (average SSIM 0.90, PSNR 27) across 26 internal/external validation sites (15,748 images), with superior synthesis fidelity and robustness to noise and domain shifts. Meanwhile, its unified representation enhances downstream RT-relevant tasks (e.g., segmentation). This work advances digital medicine solutions for NPC care by leveraging foundation models to bridge technical synthesis and clinical utility.

[281] VideoVeritas: AI-Generated Video Detection via Perception Pretext Reinforcement Learning

Hao Tan,Jun Lan,Senyuan Shi,Zichang Tan,Zijian Yu,Huijia Zhu,Weiqiang Wang,Jun Wan,Zhen Lei

Main category: cs.CV

TL;DR: 本文提出VideoVeritas框架,结合细粒度感知与基于事实的推理,通过联合偏好对齐和感知前置强化学习(PPRL)提升视频伪造检测性能,并构建轻量高质量数据集MintVid用于评估。

Details Motivation: 视频生成能力增强带来日益严峻的安全风险,而现有多模态大语言模型(MLLMs)虽具强推理能力,但细粒度感知能力不足,亟需更可靠的检测方法。 Method: 提出VideoVeritas框架,引入Joint Preference Alignment和Perception Pretext Reinforcement Learning(PPRL),在强化学习阶段采用时空定位和自监督物体计数等感知前置任务,而非直接优化检测目标;同时构建MintVid数据集,含3K个来自9种先进生成器的视频及含事实错误的真实世界子集。 Result: 实验表明,现有方法易偏向表层推理或机械分析,而VideoVeritas在多个基准上实现更均衡、鲁棒的检测性能。 Conclusion: 融合细粒度感知与事实推理的框架(VideoVeritas)及其配套感知前置训练策略(PPRL)和评测数据集(MintVid)可有效提升视频伪造检测的可靠性与泛化性。 Abstract: The growing capability of video generation poses escalating security risks, making reliable detection increasingly essential. In this paper, we introduce VideoVeritas, a framework that integrates fine-grained perception and fact-based reasoning. We observe that while current multi-modal large language models (MLLMs) exhibit strong reasoning capacity, their granular perception ability remains limited. To mitigate this, we introduce Joint Preference Alignment and Perception Pretext Reinforcement Learning (PPRL). Specifically, rather than directly optimizing for detection task, we adopt general spatiotemporal grounding and self-supervised object counting in the RL stage, enhancing detection performance with simple perception pretext tasks. To facilitate robust evaluation, we further introduce MintVid, a light yet high-quality dataset containing 3K videos from 9 state-of-the-art generators, along with a real-world collected subset that has factual errors in content. Experimental results demonstrate that existing methods tend to bias towards either superficial reasoning or mechanical analysis, while VideoVeritas achieves more balanced performance across diverse benchmarks.

[282] FlattenGPT: Depth Compression for Transformer with Layer Flattening

Ruihan Xu,Qingpei Guo,Yao Zhu,Xiangyang Ji,Ming Yang,Shiliang Zhang

Main category: cs.CV

TL;DR: FlattenGPT 提出将相邻Transformer块扁平化为一个块,以在不丢弃任何块知识的前提下压缩模型深度,实现更有效的参数冗余检测与去除,在保持性能的同时显著提升推理效率。

Details Motivation: 现有整块剪枝方法易丢失关键特征导致性能大幅下降,而通道剪枝无法压缩深度且各层剪枝比例难以统一,需一种兼顾深度压缩与性能保留的新方法。 Method: 提出FlattenGPT方法,通过将两个相邻Transformer块‘扁平化’融合为一个新块,压缩网络深度,并在此结构上进行更精细的参数冗余识别与剪枝,同时完全保留原始各块所学知识。 Result: 在LLaMA-2/3和Qwen-1.5等模型上,20%压缩比下仍保持90–96%零样本准确率;零样本任务准确率和WikiText-2困惑度均优于现有剪枝方法,并显著加速大模型推理。 Conclusion: FlattenGPT是一种有效、架构兼容且性能稳健的深度压缩方法,为Transformer模型高效部署提供了新思路。 Abstract: Recent works have indicated redundancy across transformer blocks, prompting the research of depth compression to prune less crucial blocks. However, current ways of entire-block pruning suffer from risks of discarding meaningful cues learned in those blocks, leading to substantial performance degradation. As another line of model compression, channel pruning can better preserve performance, while it cannot reduce model depth and is challenged by inconsistent pruning ratios for individual layers. To pursue better model compression and acceleration, this paper proposes \textbf{FlattenGPT}, a novel way to detect and reduce depth-wise redundancies. By flatting two adjacent blocks into one, it compresses the network depth, meanwhile enables more effective parameter redundancy detection and removal. FlattenGPT allows to preserve the knowledge learned in all blocks, and remains consistent with the original transformer architecture. Extensive experiments demonstrate that FlattenGPT enhances model efficiency with a decent trade-off to performance. It outperforms existing pruning methods in both zero-shot accuracies and WikiText-2 perplexity across various model types and parameter sizes. On LLaMA-2/3 and Qwen-1.5 models, FlattenGPT retains 90-96\% of zero-shot performance with a compression ratio of 20\%. It also outperforms other pruning methods in accelerating LLM inference, making it promising for enhancing the efficiency of transformers.

[283] TiFRe: Text-guided Video Frame Reduction for Efficient Video Multi-modal Large Language Models

Xiangtian Zheng,Zishuo Wang,Yuxin Peng

Main category: cs.CV

TL;DR: 本文提出TiFRe框架,通过文本引导的帧采样(TFS)和帧匹配融合(FMM)机制,在减少输入视频帧数的同时保留关键语义信息,从而降低计算开销并提升视频-语言任务性能。

Details Motivation: Video MLLMs面临高计算成本问题,尤其是处理大量视频帧时注意力计算开销大;简单降帧(如固定FPS采样)易丢失非关键帧中的重要信息,导致性能下降。 Method: 提出TiFRe框架:1)Text-guided Frame Sampling(TFS)——利用LLM将用户输入转化为CLIP风格提示,通过预训练CLIP编码器计算帧与提示的语义相似度,选择最相关帧;2)Frame Matching and Merging(FMM)——将非关键帧信息融合进关键帧以保留视频语义。 Result: 实验表明TiFRe能有效降低计算成本,同时在视频理解与问答等视频-语言任务上提升性能。 Conclusion: TiFRe是一种兼顾效率与效果的视频帧精简方法,通过文本引导与语义融合策略,在减少输入帧数的同时保障甚至提升模型性能。 Abstract: With the rapid development of Large Language Models (LLMs), Video Multi-Modal Large Language Models (Video MLLMs) have achieved remarkable performance in video-language tasks such as video understanding and question answering. However, Video MLLMs face high computational costs, particularly in processing numerous video frames as input, which leads to significant attention computation overhead. A straightforward approach to reduce computational costs is to decrease the number of input video frames. However, simply selecting key frames at a fixed frame rate (FPS) often overlooks valuable information in non-key frames, resulting in notable performance degradation. To address this, we propose Text-guided Video Frame Reduction (TiFRe), a framework that reduces input frames while preserving essential video information. TiFRe uses a Text-guided Frame Sampling (TFS) strategy to select key frames based on user input, which is processed by an LLM to generate a CLIP-style prompt. Pre-trained CLIP encoders calculate the semantic similarity between the prompt and each frame, selecting the most relevant frames as key frames. To preserve video semantics, TiFRe employs a Frame Matching and Merging (FMM) mechanism, which integrates non-key frame information into the selected key frames, minimizing information loss. Experiments show that TiFRe effectively reduces computational costs while improving performance on video-language tasks.

[284] Analysis of Converged 3D Gaussian Splatting Solutions: Density Effects and Prediction Limit

Zhendong Wang,Cihan Ruan,Jingchuan Xiao,Chuqing Shi,Wei Jiang,Wei Wang,Wenjie Liu,Nam Ling

Main category: cs.CV

TL;DR: 本文研究了3D高斯泼溅(3DGS)在多视角优化中产生的Rendering-Optimal References(RORs)的结构特性,发现其具有混合尺度和双峰辐射等稳定模式;进一步通过无渲染监督的预测实验揭示了密度分层现象:稠密区域参数可由点云直接预测,稀疏区域则依赖多视角约束;最后提出密度感知策略以提升训练鲁棒性。

Details Motivation: 理解3D高斯泼溅在标准多视角优化中自发形成的结构(RORs)及其决定因素,尤其是几何与外观参数之间的耦合机制。 Method: 统计分析RORs的结构特性;设计无渲染监督的learnability probes,训练模型从点云预测RORs;通过方差分解形式化分析可见性异质性导致的参数耦合。 Result: 发现RORs呈现混合结构尺度与双模态辐射;揭示密度分层现象:稠密区参数几何相关、可预测,稀疏区因可见性异质性导致几何与外观参数强协方差耦合;RORs兼具几何基元与视图合成基元双重角色。 Conclusion: RORs并非单纯几何表示,而是由密度驱动的、需自适应平衡前馈预测与渲染精调的混合表示;密度感知策略可提升训练鲁棒性,并为新架构设计提供指导。 Abstract: We investigate what structure emerges in 3D Gaussian Splatting (3DGS) solutions from standard multi-view optimization. We term these Rendering-Optimal References (RORs) and analyze their statistical properties, revealing stable patterns: mixture-structured scales and bimodal radiance across diverse scenes. To understand what determines these parameters, we apply learnability probes by training predictors to reconstruct RORs from point clouds without rendering supervision. Our analysis uncovers fundamental density-stratification. Dense regions exhibit geometry-correlated parameters amenable to render-free prediction, while sparse regions show systematic failure across architectures. We formalize this through variance decomposition, demonstrating that visibility heterogeneity creates covariance-dominated coupling between geometric and appearance parameters in sparse regions. This reveals the dual character of RORs: geometric primitives where point clouds suffice, and view synthesis primitives where multi-view constraints are essential. We provide density-aware strategies that improve training robustness and discuss architectural implications for systems that adaptively balance feed-forward prediction and rendering-based refinement.

[285] Grow with the Flow: 4D Reconstruction of Growing Plants with Gaussian Flow Fields

Weihan Luo,Lily Goli,Sherwin Bahmani,Felix Taubner,Andrea Tagliasacchi,David B. Lindell

Main category: cs.CV

TL;DR: 本文提出了一种基于3D高斯流场的植物生长建模方法,通过时间变化的高斯参数导数实现非线性、连续时间的生长动态建模,并利用逆向生长重建初始高斯集,在多视角延时数据上实现了更优的图像质量和几何精度。

Details Motivation: 植物在生长过程中不断生成新几何结构(如分枝、分化),而现有动态建模方法(如形变场、4D高斯溅射)无法处理几何新增和非线性连续时间演化。 Method: 提出3D高斯流场表示法,将植物生长建模为高斯参数(位置、尺度、方向、颜色、不透明度)的时间导数;通过重建成熟植株并学习逆向生长过程来初始化高斯基元。 Result: 在多视角植物生长延时数据集上,相比先前方法,图像质量与几何精度均有提升。 Conclusion: 该方法为生长型3D结构的外观建模提供了新范式,有效解决了植物等生物体随时间产生新几何的建模难题。 Abstract: Modeling the time-varying 3D appearance of plants during their growth poses unique challenges: unlike many dynamic scenes, plants generate new geometry over time as they expand, branch, and differentiate. Recent motion modeling techniques are ill-suited to this problem setting. For example, deformation fields cannot introduce new geometry, and 4D Gaussian splatting constrains motion to a linear trajectory in space and time and cannot track the same set of Gaussians over time. Here, we introduce a 3D Gaussian flow field representation that models plant growth as a time-varying derivative over Gaussian parameters -- position, scale, orientation, color, and opacity -- enabling nonlinear and continuous-time growth dynamics. To initialize a sufficient set of Gaussian primitives, we reconstruct the mature plant and learn a process of reverse growth, effectively simulating the plant's developmental history in reverse. Our approach achieves superior image quality and geometric accuracy compared to prior methods on multi-view timelapse datasets of plant growth, providing a new approach for appearance modeling of growing 3D structures.

[286] MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE

Ruijie Zhu,Jiahao Lu,Wenbo Hu,Xiaoguang Han,Jianfei Cai,Ying Shan,Chuanxia Zheng

Main category: cs.CV

TL;DR: MotionCrafter是一种基于视频扩散的框架,联合重建4D几何结构并估计单目视频中的稠密运动,通过新提出的联合表示和4D VAE显著提升性能。

Details Motivation: 现有方法强制3D值与RGB VAE潜在空间严格对齐,但二者分布本质不同,导致次优性能;需一种更合理的表示与训练策略。 Method: 提出稠密3D点图与3D场景流在共享坐标系下的联合表示,并设计新型4D VAE;采用新数据归一化与VAE训练策略,避免强制对齐RGB潜在空间。 Result: 在多个数据集上达到几何重建和场景流估计的SOTA,分别提升38.64%和25.0%,且无需后优化。 Conclusion: MotionCrafter证明了脱离RGB VAE潜在空间约束、采用专用4D表示与训练策略可大幅提升单目视频的4D重建与运动估计性能。 Abstract: We introduce MotionCrafter, a video diffusion-based framework that jointly reconstructs 4D geometry and estimates dense motion from a monocular video. The core of our method is a novel joint representation of dense 3D point maps and 3D scene flows in a shared coordinate system, and a novel 4D VAE to effectively learn this representation. Unlike prior work that forces the 3D value and latents to align strictly with RGB VAE latents-despite their fundamentally different distributions-we show that such alignment is unnecessary and leads to suboptimal performance. Instead, we introduce a new data normalization and VAE training strategy that better transfers diffusion priors and greatly improves reconstruction quality. Extensive experiments across multiple datasets demonstrate that MotionCrafter achieves state-of-the-art performance in both geometry reconstruction and dense scene flow estimation, delivering 38.64% and 25.0% improvements in geometry and motion reconstruction, respectively, all without any post-optimization. Project page: https://ruijiezhu94.github.io/MotionCrafter_Page

[287] Modeling 3D Pedestrian-Vehicle Interactions for Vehicle-Conditioned Pose Forecasting

Guangxun Zhu,Xuan Liu,Nicolas Pugeault,Chongfeng Wei,Edmond S. L. Ho

Main category: cs.CV

TL;DR: 本文提出了一种3D车辆条件下的行人姿态预测框架,通过引入车辆信息增强预测准确性,并在增强的Waymo-3DSkelMo数据集上验证了其有效性。

Details Motivation: 准确预测行人在复杂城市环境中的运动对自动驾驶的安全性和可靠性至关重要,而现有方法往往忽略周围车辆对行人行为的影响。 Method: 构建了车辆条件下的3D行人姿态预测网络,改进TBIFormer架构,加入专用车辆编码器和行人-车辆交互交叉注意力模块,并在增强的Waymo-3DSkelMo数据集上进行训练与评估。 Result: 实验表明该方法显著提升了3D行人姿态预测精度,验证了建模行人-车辆交互的有效性。 Conclusion: 车辆感知的3D姿态预测对提升自动驾驶系统中行人运动预测能力具有重要意义。 Abstract: Accurately predicting pedestrian motion is crucial for safe and reliable autonomous driving in complex urban environments. In this work, we present a 3D vehicle-conditioned pedestrian pose forecasting framework that explicitly incorporates surrounding vehicle information. To support this, we enhance the Waymo-3DSkelMo dataset with aligned 3D vehicle bounding boxes, enabling realistic modeling of multi-agent pedestrian-vehicle interactions. We introduce a sampling scheme to categorize scenes by pedestrian and vehicle count, facilitating training across varying interaction complexities. Our proposed network adapts the TBIFormer architecture with a dedicated vehicle encoder and pedestrian-vehicle interaction cross-attention module to fuse pedestrian and vehicle features, allowing predictions to be conditioned on both historical pedestrian motion and surrounding vehicles. Extensive experiments demonstrate substantial improvements in forecasting accuracy and validate different approaches for modeling pedestrian-vehicle interactions, highlighting the importance of vehicle-aware 3D pose prediction for autonomous driving. Code is available at: https://github.com/GuangxunZhu/VehCondPose3D

[288] WorldArena: A Unified Benchmark for Evaluating Perception and Functional Utility of Embodied World Models

Yu Shang,Zhuohang Li,Yiding Ma,Weikang Su,Xin Jin,Ziyou Wang,Xin Zhang,Yinzhou Tang,Chen Gao,Wei Wu,Xihui Liu,Dhruv Shah,Zhaoxiang Zhang,Zhibo Chen,Jun Zhu,Yonghong Tian,Tat-Seng Chua,Wenwu Zhu,Yong Li

Main category: cs.CV

TL;DR: 本文提出WorldArena统一基准,用于系统评估具身世界模型在感知与功能两方面的性能,并揭示了高视觉质量不等于强具身任务能力的'感知-功能性鸿沟'。

Details Motivation: 当前具身世界模型的评估过于侧重感知保真度(如视频生成质量),忽视其在下游决策任务中的实际功能效用,缺乏统一、多维度的评估框架。 Method: 构建WorldArena基准,从视频感知质量(16个指标、6个子维度)和具身任务功能性(作为数据引擎、策略评估器、动作规划器,并结合人工主观评价)两个维度进行系统评估;提出综合指标EWMScore。 Result: 在14个代表性模型上的实验表明存在显著的‘感知-功能性鸿沟’,即视觉质量高并不保证具身任务能力强;公开发布WorldArena基准及排行榜。 Conclusion: 具身世界模型的评估需兼顾感知与功能,WorldArena为推动真正具备功能性的具身AI世界模型发展提供了标准化评测框架。 Abstract: While world models have emerged as a cornerstone of embodied intelligence by enabling agents to reason about environmental dynamics through action-conditioned prediction, their evaluation remains fragmented. Current evaluation of embodied world models has largely focused on perceptual fidelity (e.g., video generation quality), overlooking the functional utility of these models in downstream decision-making tasks. In this work, we introduce WorldArena, a unified benchmark designed to systematically evaluate embodied world models across both perceptual and functional dimensions. WorldArena assesses models through three dimensions: video perception quality, measured with 16 metrics across six sub-dimensions; embodied task functionality, which evaluates world models as data engines, policy evaluators, and action planners integrating with subjective human evaluation. Furthermore, we propose EWMScore, a holistic metric integrating multi-dimensional performance into a single interpretable index. Through extensive experiments on 14 representative models, we reveal a significant perception-functionality gap, showing that high visual quality does not necessarily translate into strong embodied task capability. WorldArena benchmark with the public leaderboard is released at https://worldarena.ai, providing a framework for tracking progress toward truly functional world models in embodied AI.

[289] Generalizing Sports Feedback Generation by Watching Competitions and Reading Books: A Rock Climbing Case Study

Arushi Rai,Adriana Kovashka

Main category: cs.CV

TL;DR: 本文提出了一种利用辅助网络数据(如比赛视频、教练手册)和跨领域体育反馈数据来提升视频大模型在未见运动项目中生成专业反馈能力的方法,并设计了‘特异性’和‘可操作性’两个新评估指标,以解决传统文本生成指标不适用于体育反馈评价的问题。

Details Motivation: 现有视频大语言模型在体育反馈生成任务上泛化能力差,尤其对未参与微调的运动项目表现不佳;同时,传统文本生成评估指标无法准确衡量体育反馈的质量。 Method: 以攀岩为案例,结合目标领域免费网络数据(比赛视频、教练手册)与源领域已有的体育反馈数据,进行跨领域知识迁移;并提出两个专用评估指标:特异性和可操作性。 Result: 所提方法显著提升了模型在目标运动(攀岩)上的反馈生成质量,且在标注数据有限条件下仍保持有效性;新评估指标更贴合体育反馈的实际需求。 Conclusion: 利用辅助域数据和专用评估指标可有效缓解视频-LLM在体育反馈生成中的泛化瓶颈与评估失准问题,为低资源场景下的专业反馈生成提供了可行路径。 Abstract: While there is rapid progress in video-LLMs with advanced reasoning capabilities, prior work shows that these models struggle on the challenging task of sports feedback generation and require expensive and difficult-to-collect finetuning feedback data for each sport. This limitation is evident from the poor generalization to sports unseen during finetuning. Furthermore, traditional text generation evaluation metrics (e.g., BLEU-4, METEOR, ROUGE-L, BERTScore), originally developed for machine translation and summarization, fail to capture the unique aspects of sports feedback quality. To address the first problem, using rock climbing as our case study, we propose using auxiliary freely-available web data from the target domain, such as competition videos and coaching manuals, in addition to existing sports feedback from a disjoint, source domain to improve sports feedback generation performance on the target domain. To improve evaluation, we propose two evaluation metrics: (1) specificity and (2) actionability. Together, our approach enables more meaningful and practical generation of sports feedback under limited annotations.

[290] ArcFlow: Unleashing 2-Step Text-to-Image Generation via High-Precision Non-Linear Flow Distillation

Zihan Yang,Shuyuan Tu,Licheng Zhang,Qi Dai,Yu-Gang Jiang,Zuxuan Wu

Main category: cs.CV

TL;DR: ArcFlow是一种用于扩散模型的少步蒸馏框架,通过非线性流轨迹建模教师模型的去噪路径,利用连续动量过程混合建模速度场,并支持解析积分,从而在仅2次函数求值(NFE)下实现40倍加速且保持生成质量。

Details Motivation: 现有蒸馏方法使用线性捷径近似教师轨迹,难以匹配其随时间变化的速度方向,导致生成质量下降。 Method: 提出ArcFlow,将去噪轨迹的速度场参数化为连续动量过程的混合;支持解析积分以避免数值离散误差;采用轻量适配器进行轨迹蒸馏训练。 Result: 在Qwen-Image-20B和FLUX.1-dev等大模型上,仅微调<5%参数,2 NFE即达40倍加速,生成质量无明显下降;在多个基准上验证了有效性。 Conclusion: ArcFlow通过显式建模非线性轨迹与解析积分,在极少数步内高精度逼近教师模型,兼顾效率与生成质量,为扩散模型高效推理提供了新范式。 Abstract: Diffusion models have achieved remarkable generation quality, but they suffer from significant inference cost due to their reliance on multiple sequential denoising steps, motivating recent efforts to distill this inference process into a few-step regime. However, existing distillation methods typically approximate the teacher trajectory by using linear shortcuts, which makes it difficult to match its constantly changing tangent directions as velocities evolve across timesteps, thereby leading to quality degradation. To address this limitation, we propose ArcFlow, a few-step distillation framework that explicitly employs non-linear flow trajectories to approximate pre-trained teacher trajectories. Concretely, ArcFlow parameterizes the velocity field underlying the inference trajectory as a mixture of continuous momentum processes. This enables ArcFlow to capture velocity evolution and extrapolate coherent velocities to form a continuous non-linear trajectory within each denoising step. Importantly, this parameterization admits an analytical integration of this non-linear trajectory, which circumvents numerical discretization errors and results in high-precision approximation of the teacher trajectory. To train this parameterization into a few-step generator, we implement ArcFlow via trajectory distillation on pre-trained teacher models using lightweight adapters. This strategy ensures fast, stable convergence while preserving generative diversity and quality. Built on large-scale models (Qwen-Image-20B and FLUX.1-dev), ArcFlow only fine-tunes on less than 5% of original parameters and achieves a 40x speedup with 2 NFEs over the original multi-step teachers without significant quality degradation. Experiments on benchmarks show the effectiveness of ArcFlow both qualitatively and quantitatively.

[291] Raster2Seq: Polygon Sequence Generation for Floorplan Reconstruction

Hao Phung,Hadar Averbuch-Elor

Main category: cs.CV

TL;DR: 本文提出Raster2Seq方法,将栅格化平面图重建为结构化矢量图形表示建模为序列到序列任务,利用带可学习锚点的自回归解码器预测多边形顶点序列,实现几何与语义联合编码,在多个基准上达到SOTA性能。

Details Motivation: 现有方法难以准确重建复杂室内空间(多房间、多边形角点数量可变)的平面图结构和语义信息。 Method: 提出Raster2Seq框架,将平面图元素(房间、门窗等)表示为带标签的多边形顶点序列;设计基于图像特征和已生成顶点的自回归解码器,并引入可学习的空间锚点引导注意力机制聚焦关键图像区域。 Result: 在Structure3D、CubiCasa5K和Raster2Graph等标准基准上达到SOTA性能,并在更具挑战性的WAFFLE数据集上展现出对多样房间结构和复杂几何变化的良好泛化能力。 Conclusion: Raster2Seq通过序列化建模和锚点引导的自回归生成,有效提升了复杂平面图的结构化重建精度与灵活性,为后续理解与CAD应用提供了更可靠的矢量表示基础。 Abstract: Reconstructing a structured vector-graphics representation from a rasterized floorplan image is typically an important prerequisite for computational tasks involving floorplans such as automated understanding or CAD workflows. However, existing techniques struggle in faithfully generating the structure and semantics conveyed by complex floorplans that depict large indoor spaces with many rooms and a varying numbers of polygon corners. To this end, we propose Raster2Seq, framing floorplan reconstruction as a sequence-to-sequence task in which floorplan elements--such as rooms, windows, and doors--are represented as labeled polygon sequences that jointly encode geometry and semantics. Our approach introduces an autoregressive decoder that learns to predict the next corner conditioned on image features and previously generated corners using guidance from learnable anchors. These anchors represent spatial coordinates in image space, hence allowing for effectively directing the attention mechanism to focus on informative image regions. By embracing the autoregressive mechanism, our method offers flexibility in the output format, enabling for efficiently handling complex floorplans with numerous rooms and diverse polygon structures. Our method achieves state-of-the-art performance on standard benchmarks such as Structure3D, CubiCasa5K, and Raster2Graph, while also demonstrating strong generalization to more challenging datasets like WAFFLE, which contain diverse room structures and complex geometric variations.

[292] WorldCompass: Reinforcement Learning for Long-Horizon World Models

Zehan Wang,Tengfei Wang,Haiyu Zhang,Xuhui Zuo,Junta Wu,Haoyuan Wang,Wenqiang Sun,Zhenwei Wang,Chenjie Cao,Hengshuang Zhao,Chunchao Guo,Zhou Zhao

Main category: cs.CV

TL;DR: WorldCompass是一个面向长时序、交互式视频世界模型的强化学习后训练框架,通过剪辑级采样策略、互补奖励函数和高效RL算法,显著提升模型在交互准确性和视觉保真度上的表现。

Details Motivation: 现有世界模型在长时序、交互式视频生成中存在探索不准确、不一致的问题,缺乏对交互信号的有效利用。 Method: 提出三个核心创新:1)剪辑级rollout策略,实现高效多样本生成与细粒度奖励评估;2)设计兼顾交互准确性与视觉质量的互补奖励函数;3)采用负感知微调策略与多种效率优化的高效RL算法。 Result: 在SoTA开源世界模型WorldPlay上验证,WorldCompass显著提升了交互准确性和视觉保真度。 Conclusion: WorldCompass为交互式视频世界模型提供了一种高效、鲁棒的RL后训练范式,推动其向更真实、可控的具身智能发展。 Abstract: This work presents WorldCompass, a novel Reinforcement Learning (RL) post-training framework for the long-horizon, interactive video-based world models, enabling them to explore the world more accurately and consistently based on interaction signals. To effectively "steer" the world model's exploration, we introduce three core innovations tailored to the autoregressive video generation paradigm: 1) Clip-level rollout Strategy: We generate and evaluate multiple samples at a single target clip, which significantly boosts rollout efficiency and provides fine-grained reward signals. 2) Complementary Reward Functions: We design reward functions for both interaction-following accuracy and visual quality, which provide direct supervision and effectively suppress reward-hacking behaviors. 3) Efficient RL Algorithm: We employ the negative-aware fine-tuning strategy coupled with various efficiency optimizations to efficiently and effectively enhance model capacity. Evaluations on the SoTA open-source world model, WorldPlay, demonstrate that WorldCompass significantly improves interaction accuracy and visual fidelity across various scenarios.

[293] Autoregressive Image Generation with Masked Bit Modeling

Qihang Yu,Qihao Liu,Ju He,Xinyang Zhang,Yang Liu,Liang-Chieh Chen,Xi Chen

Main category: cs.CV

TL;DR: 本文挑战了视觉生成中连续流水线的主导地位,发现离散与连续方法的性能差距主要源于潜在空间中的总比特数(即压缩比),提出通过扩大码本规模可弥合该差距;为此设计了可扩展的掩码位自回归建模(BAR)框架,利用自回归Transformer逐位预测离散token,在ImageNet-256上以gFID 0.99达到新SOTA,同时降低采样成本并加速收敛。

Details Motivation: 离散tokenizers被普遍认为性能不如连续方法,但作者质疑这一观点,旨在探究真实性能差距根源,并解决现有离散生成方法在扩大码本时面临的性能下降或训练开销过大的问题。 Method: 提出掩码位自回归建模(BAR)框架:将自回归Transformer与掩码位建模头结合,逐位预测离散token,支持任意大小码本,避免传统离散方法扩展码本带来的瓶颈。 Result: BAR在ImageNet-256上取得gFID 0.99,超越所有现有连续与离散方法;采样成本显著降低,训练收敛速度优于先前连续方法。 Conclusion: 离散视觉生成潜力被低估,性能差距本质是比特预算问题而非离散性本身;BAR证明大规模离散建模可行且高效,为离散生成范式提供了新基准与实用路径。 Abstract: This paper challenges the dominance of continuous pipelines in visual generation. We systematically investigate the performance gap between discrete and continuous methods. Contrary to the belief that discrete tokenizers are intrinsically inferior, we demonstrate that the disparity arises primarily from the total number of bits allocated in the latent space (i.e., the compression ratio). We show that scaling up the codebook size effectively bridges this gap, allowing discrete tokenizers to match or surpass their continuous counterparts. However, existing discrete generation methods struggle to capitalize on this insight, suffering from performance degradation or prohibitive training costs with scaled codebook. To address this, we propose masked Bit AutoRegressive modeling (BAR), a scalable framework that supports arbitrary codebook sizes. By equipping an autoregressive transformer with a masked bit modeling head, BAR predicts discrete tokens through progressively generating their constituent bits. BAR achieves a new state-of-the-art gFID of 0.99 on ImageNet-256, outperforming leading methods across both continuous and discrete paradigms, while significantly reducing sampling costs and converging faster than prior continuous approaches. Project page is available at https://bar-gen.github.io/